linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters)
@ 2012-01-11 17:25 Will Drewry
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
  0 siblings, 2 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-11 17:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

The goal of the patchset is straightforward:

 To provide a means of reducing the kernel attack surface.

In practice, this is done at the primary kernel ABI: system calls.
Achieving this goal will address the needs expressed by many systems
projects:
  qemu/kvm, openssh, vsftpd, lxc, and chromium and chromium os (me).

While system call filtering has been attempted many times, I hope that
this approach shows more promise.  It works as described below and in
the patch series.

A userland task may call prctl(PR_ATTACH_SECCOMP_FILTER) to attach a
BPF program to itself.  Once attached, all system calls made by the
task will be evaluated by the BPF program prior to being accepted.
Evaluation is done by executing the BPF program over the struct
user_regs_state for the process.

!! If you don't care about background or reasoning, stop reading !!

Past attempts have used:
- bitmap of system call numbers evaluated by seccomp (or tracehooks)
- standalone data structures and extra entry hooks
  (cgroups syscall, systrace)
- a collection of ftrace filter strings evaluated by seccomp
- perf_event hackery to allow process termination when an event matches
  (or doesn't)

In addition to the publicly posted approaches, I've personally attempted
continued deeper integration with ftrace along a number of different
lines (lead up to that can be found here[1]).  What inspired the current
patch series was a number of realizations:
1. Userland knows its ABI - that's how it made the system calls in the
   first place.
2. We already exposed a filtering system to userland processes in the
   form of BPF and there is continued focus on optimizing evaluation
   even after so many years.
3. System call filtering policies should not expose
   time-of-check-time-of-use (TOCTOU) vulnerable interfaces but should
   expose all the information that may be relevant to a syscall policy
   decision.

The prior seccomp-ftrace  implementations struggled with very
fixable challenges in ftrace: incomplete syscall coverage,
mismatched syscall names versus unistd, incomplete arch coverage,
etc.  These challenges may all be fixed with some time and effort, and
potentially, even closer integration.  I explored a number of
alternative approaches from making system call tracepoints per-thread
and "active" to adding a new less-perf-oriented system call.

In the process of experimentation, a number of things became clear:
- perf/ftrace system-wide analysis goals don't align with lightweight
  per-thread analysis.
- ftrace/perf ABI doesn't mix well with security policy enforcement,
  reduced attack surface environments, or keeping users from specifing
 vulnerable filtering policies.
- other than system calls, tracepoints aren't considered ABI-stable.

The core focus of ftrace and perf is to support system-wide
performance and debugging tracing.  Despite its amazing flexibility,
there are tradeoffs that are made to provide efficient system-wide
behavior that are less efficient at a per-thread level.  For instance,
system call tracepoints are global.  It is possible to make them
per-thread (since they use a TIF anyway).  However, doing so would mean
that a system-wide system call analysis would require one trace event
per thread rather than one total.  It's possible to alleviate that pain,
but that in turn requires more bookkeeping (global versus local
tracepoint registrations mapping to the thread info flag).

Another example is the ftrace ABI.  Both the debugfs entry point with
unstable event ids and the perf-oriented perf_event_open(2) are not
suitable to providing a subsystem which is meant to reduce the attack
surface -- much less avoid maintainer flame wars :) The third aspect of
its ABI was also concerning and hints at yet-another-potential struggle.
The ftrace filter language happily accepts globbing and string matching.
This is excellent for tracing, but horrible for system call
interposition.  If, despite warning, a user decides that blocking a
system call based on a string is what they want, they can do it.  The
result is that their policy may be bypassed due to a time of check, time
of use race.  While addressable, it would mean that the filtering engine
would need to allow operation filtering or offer a "secure" subset.

A side challenge that emerged from the desire to enable tracing to act
as a security policy mechanism was the ability to enact policy over more
than just the system calls.  While this would be doable if all
tracepoints became active, there is a fundamental problem in that very
little, if any, tracepoints aside from system calls can be considered
stable.  If a subset were to emerge as stable, there is still the
challenge of enacting security policy in parallel with tracing policy.
In an example patch where security policy logic was added to
perf_event_open(2), the basics of the system worked, but enforcement of
the security policy was simplistic and intertwined with a large number
of event attributes that were meaningless or altered the behavior.

At every turn, it appears that the tracing infrastructure was unsuited
for being used for attack surface reduction or as a larger security
subsystem on its own.  It is well suited for feeding a policy
enforcement mechanism (like seccomp), but not for letting the logic
co-exist.  It doesn't mean that it has security problems, just that
there will be a continued struggle between having a really good perf
system and and really good kernel attack surface reduction system if
they were merged.  While there may be some distant vision where the
apparent struggle does not exist, I don't see how it would be reached.
Of course, anything is possible with unlimited time. :)

That said, much of that discussion is history and to fill in some of the
gaps since I posted the last ftrace-based patches.  This patch series
should stand on its own as both straightforward and effective.  In my
opinion, this is the direction I should have taken before I sent my
first patch.

I am looking forward to any and all feedback - thanks!
will


[1] http://search.gmane.org/?query=seccomp+wad%40chromium.org&group=gmane.linux.kernel


Will Drewry (3):
  seccomp_filters: dynamic system call filtering using BPF programs
  Documentation: prctl/seccomp_filter

 Documentation/prctl/seccomp_filter.txt |  179 ++++++++
 fs/exec.c                              |    5 +
 include/linux/prctl.h                  |    3 +
 include/linux/seccomp.h                |   70 +++++-
 kernel/Makefile                        |    1 +
 kernel/fork.c                          |    4 +
 kernel/seccomp.c                       |    8 +
 kernel/seccomp_filter.c                |  639 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                           |    4 +
 security/Kconfig                       |   12 +
 9 files changed, 743 insertions(+), 3 deletions(-)
 create mode 100644 kernel/seccomp_filter.c
 create mode 100644 Documentation/prctl/seccomp_filter.txt
-- 
1.7.5.4

























^ permalink raw reply	[flat|nested] 235+ messages in thread

* [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 [RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters) Will Drewry
@ 2012-01-11 17:25 ` Will Drewry
  2012-01-12  8:53   ` Serge Hallyn
                     ` (5 more replies)
  2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
  1 sibling, 6 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-11 17:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

This patch adds support for seccomp mode 2.  This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task.  The policy is expressed in terms of a BPF program,
as is used for userland-exposed socket filtering.  Instead of network
data, the BPF program is evaluated over struct user_regs_struct at the
time of the system call (as retrieved using regviews).

A filter program may be installed by a userland task by calling
  prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.

If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached.  All attached programs
must be evaluated before a system call will be allowed to proceed.

To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, it
is not allowed to make system calls or attach additional filters which
use a different combination of is_compat_task() and
current->personality.

Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace.  Once a task-local filter program is attached from a
process without privileges, execve will fail.  This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and system
  call numbers.
- Full register information is provided which may be relevant for
  certain syscalls (fork, rt_sigreturn) or for other userland
  filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible.

This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code.  It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)

Signed-off-by: Will Drewry <wad@chromium.org>
---
 fs/exec.c               |    5 +
 include/linux/prctl.h   |    3 +
 include/linux/seccomp.h |   70 +++++-
 kernel/Makefile         |    1 +
 kernel/fork.c           |    4 +
 kernel/seccomp.c        |    8 +
 kernel/seccomp_filter.c |  639 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c            |    4 +
 security/Kconfig        |   12 +
 9 files changed, 743 insertions(+), 3 deletions(-)
 create mode 100644 kernel/seccomp_filter.c

diff --git a/fs/exec.c b/fs/exec.c
index 3625464..e9cc89c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -44,6 +44,7 @@
 #include <linux/namei.h>
 #include <linux/mount.h>
 #include <linux/security.h>
+#include <linux/seccomp.h>
 #include <linux/syscalls.h>
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
@@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
 	if (retval)
 		goto out_ret;
 
+	retval = seccomp_check_exec();
+	if (retval)
+		goto out_ret;
+
 	retval = -ENOMEM;
 	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
 	if (!bprm)
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..15e2460 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,9 @@
 #define PR_GET_SECCOMP	21
 #define PR_SET_SECCOMP	22
 
+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER	36
+
 /* Get/set the capability bounding set (as per security/commoncap.c) */
 #define PR_CAPBSET_READ 23
 #define PR_CAPBSET_DROP 24
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..99d163e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,28 @@
 #ifdef CONFIG_SECCOMP
 
 #include <linux/thread_info.h>
+#include <linux/types.h>
 #include <asm/seccomp.h>
 
-typedef struct { int mode; } seccomp_t;
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:
+ *     if this is 0, seccomp is not in use.
+ *             is 1, the process is under standard seccomp rules.
+ *             is 2, the process is only allowed to make system calls where
+ *                   associated filters evaluate successfully.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ *          @filter must only be accessed from the context of current as there
+ *          is no guard.
+ */
+typedef struct seccomp_struct {
+	int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+	struct seccomp_filter *filter;
+#endif
+} seccomp_t;
 
 extern void __secure_computing(int);
 static inline void secure_computing(int this_syscall)
@@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)
 
 #include <linux/errno.h>
 
-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
 #define secure_computing(x) do { } while (0)
 
 static inline long prctl_get_seccomp(void)
@@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)
 
 #endif /* CONFIG_SECCOMP */
 
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define seccomp_filter_init_task(_tsk) do { \
+	(_tsk)->seccomp.filter = NULL; \
+} while (0);
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+#define seccomp_filter_free_task(_tsk) do { \
+	put_seccomp_filter((_tsk)->seccomp.filter); \
+} while (0);
+
+extern int seccomp_check_exec(void);
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_filter_fork(struct task_struct *child,
+				struct task_struct *parent);
+
+#else  /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+#define seccomp_filter_init_task(_tsk) do { } while (0);
+#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
+#define seccomp_filter_free_task(_tsk) do { } while (0);
+
+static inline int seccomp_check_exec(void)
+{
+	return 0;
+}
+
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+	return -ENOSYS;
+}
+
+#endif  /* CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..0584090 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TREE_RCU) += rcutree.o
 obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..cc1d628 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
 #include <linux/cgroup.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/seccomp.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
+	seccomp_filter_free_task(tsk);
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	sched_fork(p);
 
+	seccomp_filter_init_task(p);
 	retval = perf_event_init_task(p);
 	if (retval)
 		goto bad_fork_cleanup_policy;
@@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
+	seccomp_filter_fork(p, current);
 	return p;
 
 bad_fork_free_pid:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..78719be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
 				return;
 		} while (*++syscall);
 		break;
+#ifdef CONFIG_SECCOMP_FILTER
+	case 2:
+		if (seccomp_test_filters(this_syscall) == 0)
+			return;
+
+		seccomp_filter_log_failure(this_syscall);
+		break;
+#endif
 	default:
 		BUG();
 	}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..4770847
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,639 @@
+/* bpf program-based system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ *         get/put helpers should be used when accessing an instance
+ *         outside of a lifetime-guarded section.  In general, this
+ *         is only needed for handling filters shared across tasks.
+ * @creator: pointer to the pid that created this filter
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @flags: provide information about filter from creation time.
+ * @personality: personality of the process at filter creation time.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+	struct kref usage;
+	struct pid *creator;
+	struct seccomp_filter *parent;
+	struct {
+		uint32_t admin:1,  /* can allow execve */
+			 compat:1,  /* CONFIG_COMPAT */
+			 __reserved:30;
+	} flags;
+	int personality;
+	unsigned short count;  /* Instruction count */
+	struct sock_filter insns[0];
+};
+
+static unsigned int seccomp_run_filter(const u8 *buf,
+				       const size_t buflen,
+				       const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+	struct seccomp_filter *f;
+	unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+	/* Drop oversized requests. */
+	if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+		return ERR_PTR(-EINVAL);
+
+	/* Padding should always be in sock_filter increments. */
+	BUG_ON(padding % sizeof(struct sock_filter));
+
+	f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+	if (!f)
+		return ERR_PTR(-ENOMEM);
+	kref_init(&f->usage);
+	f->creator = get_task_pid(current, PIDTYPE_PID);
+	f->count = bpf_blocks;
+	return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+	if (!filter)
+		return;
+	put_seccomp_filter(filter->parent);
+	put_pid(filter->creator);
+	kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+	struct seccomp_filter *orig =
+		container_of(kref, struct seccomp_filter, usage);
+	seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+	pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+		current->comm, task_pid_nr(current), syscall,
+		KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+	if (!orig)
+		return;
+	kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+	if (!orig)
+		return NULL;
+	kref_get(&orig->usage);
+	return orig;
+}
+
+static int seccomp_check_personality(struct seccomp_filter *filter)
+{
+	if (filter->personality != current->personality)
+		return -EACCES;
+#ifdef CONFIG_COMPAT
+	if (filter->flags.compat != (!!(is_compat_task())))
+		return -EACCES;
+#endif
+	return 0;
+}
+
+static const struct user_regset *
+find_prstatus(const struct user_regset_view *view)
+{
+	const struct user_regset *regset;
+	int n;
+
+	/* Skip 0. */
+	for (n = 1; n < view->n; ++n) {
+		regset = view->regsets + n;
+		if (regset->core_note_type == NT_PRSTATUS)
+			return regset;
+	}
+
+	return NULL;
+}
+
+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ * @scratch: preallocated storage of size @available
+ * @available: pointer to the size of scratch.
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Returns a populated pointer to @scratch by default.
+ * Otherwise, returns a pointer to a a u8 array containing the struct
+ * user_regs_struct appropriate for the task personality.  The pointer
+ * may be to the beginning of @scratch or to an externally managed data
+ * structure.  On success, @available should be updated with the
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may pointer to
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+	/* regset is usually returned based on task personality, not current
+	 * system call convention.  This behavior makes it unsafe to execute
+	 * BPF programs over regviews if is_compat_task or the personality
+	 * have changed since the program was installed.
+	 */
+	const struct user_regset_view *view = task_user_regset_view(current);
+	const struct user_regset *regset = &view->regsets[0];
+	size_t scratch_size = *available;
+	if (regset->core_note_type != NT_PRSTATUS) {
+		/* The architecture should override this method for speed. */
+		regset = find_prstatus(view);
+		if (!regset)
+			return NULL;
+	}
+	*available = regset->n * regset->size;
+	/* Make sure the scratch space isn't exceeded. */
+	if (*available > scratch_size)
+		*available = scratch_size;
+	if (regset->get(current, regset, 0, *available, scratch, NULL))
+		return NULL;
+	return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+	struct seccomp_filter *filter;
+	u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+	size_t regs_size = sizeof(struct user_regs_struct);
+	int ret = -EACCES;
+
+	filter = current->seccomp.filter; /* uses task ref */
+	if (!filter)
+		goto out;
+
+	/* All filters in the list are required to share the same system call
+	 * convention so only the first filter is ever checked.
+	 */
+	if (seccomp_check_personality(filter))
+		goto out;
+
+	/* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
+	 * that is not mandatory.  E.g., it may return a point to
+	 * task_pt_regs(current).  NULL checking is mandatory.
+	 */
+	regs = seccomp_get_regs(regs_tmp, &regs_size);
+	if (!regs)
+		goto out;
+
+	/* Only allow a system call if it is allowed in all ancestors. */
+	ret = 0;
+	for ( ; filter != NULL; filter = filter->parent) {
+		/* Allowed if return value is the size of the data supplied. */
+		if (seccomp_run_filter(regs, regs_size, filter->insns) !=
+		    regs_size)
+			ret = -EACCES;
+	}
+out:
+	return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ *          operates on current. current must be attempting a system call
+ *          when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+	struct seccomp_filter *filter = NULL;
+	/* Note, len is a short so overflow should be impossible. */
+	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+	long ret = -EPERM;
+
+	/* Allocate a new seccomp_filter */
+	filter = seccomp_filter_alloc(fp_size);
+	if (IS_ERR(filter)) {
+		ret = PTR_ERR(filter);
+		goto out;
+	}
+
+	/* Lock the process personality and calling convention. */
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		filter->flags.compat = 1;
+#endif
+	filter->personality = current->personality;
+
+	/* Auditing is not needed since the capability wasn't requested */
+	if (security_real_capable_noaudit(current, current_user_ns(),
+					  CAP_SYS_ADMIN) == 0)
+		filter->flags.admin = 1;
+
+	/* Copy the instructions from fprog. */
+	ret = -EFAULT;
+	if (copy_from_user(filter->insns, fprog->filter, fp_size))
+		goto out;
+
+	/* Check the fprog */
+	ret = sk_chk_filter(filter->insns, filter->count);
+	if (ret)
+		goto out;
+
+	/* If there is an existing filter, make it the parent
+	 * and reuse the existing task-based ref.
+	 */
+	filter->parent = current->seccomp.filter;
+
+	/* Force all filters to use one system call convention. */
+	ret = -EINVAL;
+	if (filter->parent) {
+		if (filter->parent->flags.compat != filter->flags.compat)
+			goto out;
+		if (filter->parent->personality != filter->personality)
+			goto out;
+	}
+
+	/* Double claim the new filter so we can release it below simplifying
+	 * the error paths earlier.
+	 */
+	ret = 0;
+	get_seccomp_filter(filter);
+	current->seccomp.filter = filter;
+	/* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+	if (!current->seccomp.mode) {
+		current->seccomp.mode = 2;
+		set_thread_flag(TIF_SECCOMP);
+	}
+
+out:
+	put_seccomp_filter(filter);  /* for get or task, on err */
+	return ret;
+}
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+	struct sock_fprog fprog;
+	long ret = -EINVAL;
+
+	ret = -EFAULT;
+	if (!user_filter)
+		goto out;
+
+	if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+		goto out;
+
+	ret = seccomp_attach_filter(&fprog);
+out:
+	return ret;
+}
+
+/**
+ * seccomp_check_exec: determines if exec is allowed for current
+ * Returns 0 if allowed.
+ */
+int seccomp_check_exec(void)
+{
+	if (current->seccomp.mode != 2)
+		return 0;
+	/* We can rely on the task refcount for the filter. */
+	if (!current->seccomp.filter)
+		return -EPERM;
+	/* The last attached filter set for the process is checked. It must
+	 * have been installed with CAP_SYS_ADMIN capabilities.
+	 */
+	if (current->seccomp.filter->flags.admin)
+		return 0;
+	return -EPERM;
+}
+
+/* seccomp_filter_fork: manages inheritance on fork
+ * @child: forkee
+ * @parent: forker
+ * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
+ * and the set of filters is marked as 'enabled'.
+ */
+void seccomp_filter_fork(struct task_struct *child,
+			 struct task_struct *parent)
+{
+	if (!parent->seccomp.mode)
+		return;
+	child->seccomp.mode = parent->seccomp.mode;
+	child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
+}
+
+/* Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries.  The signature almost matches the signature from
+ * net/core/filter.c with the hopes of sharing code in the future.
+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+				int offset, size_t size,
+				void *unused)
+{
+	if (offset >= buflen)
+		goto fail;
+	if (offset < 0)
+		goto fail;
+	if (size > buflen - offset)
+		goto fail;
+	return buf + offset;
+fail:
+	return NULL;
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF (over user_regs_struct)
+ *	@buf: buffer to execute the filter over
+ *	@buflen: length of the buffer
+ *	@fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer.
+ * Return length to keep, 0 for none. @buf is a regset we are
+ * filtering, @filter is the array of filter instructions.
+ * Because all jumps are guaranteed to be before last instruction,
+ * and last instruction guaranteed to be a RET, we dont need to check
+ * flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ *
+ * A successful filter must return the full length of the data. Anything less
+ * will currently result in a seccomp failure.  In the future, it may be
+ * possible to use that for hard filtering registers on the fly so it is
+ * ideal for consumers to return 0 on intended failure.
+ */
+static unsigned int seccomp_run_filter(const u8 *buf,
+				       const size_t buflen,
+				       const struct sock_filter *fentry)
+{
+	const void *ptr;
+	u32 A = 0;			/* Accumulator */
+	u32 X = 0;			/* Index Register */
+	u32 mem[BPF_MEMWORDS];		/* Scratch Memory Store */
+	u32 tmp;
+	int k;
+
+	/*
+	 * Process array of filter instructions.
+	 */
+	for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define	K (fentry->k)
+#else
+		const u32 K = fentry->k;
+#endif
+
+		switch (fentry->code) {
+		case BPF_S_ALU_ADD_X:
+			A += X;
+			continue;
+		case BPF_S_ALU_ADD_K:
+			A += K;
+			continue;
+		case BPF_S_ALU_SUB_X:
+			A -= X;
+			continue;
+		case BPF_S_ALU_SUB_K:
+			A -= K;
+			continue;
+		case BPF_S_ALU_MUL_X:
+			A *= X;
+			continue;
+		case BPF_S_ALU_MUL_K:
+			A *= K;
+			continue;
+		case BPF_S_ALU_DIV_X:
+			if (X == 0)
+				return 0;
+			A /= X;
+			continue;
+		case BPF_S_ALU_DIV_K:
+			A = reciprocal_divide(A, K);
+			continue;
+		case BPF_S_ALU_AND_X:
+			A &= X;
+			continue;
+		case BPF_S_ALU_AND_K:
+			A &= K;
+			continue;
+		case BPF_S_ALU_OR_X:
+			A |= X;
+			continue;
+		case BPF_S_ALU_OR_K:
+			A |= K;
+			continue;
+		case BPF_S_ALU_LSH_X:
+			A <<= X;
+			continue;
+		case BPF_S_ALU_LSH_K:
+			A <<= K;
+			continue;
+		case BPF_S_ALU_RSH_X:
+			A >>= X;
+			continue;
+		case BPF_S_ALU_RSH_K:
+			A >>= K;
+			continue;
+		case BPF_S_ALU_NEG:
+			A = -A;
+			continue;
+		case BPF_S_JMP_JA:
+			fentry += K;
+			continue;
+		case BPF_S_JMP_JGT_K:
+			fentry += (A > K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGE_K:
+			fentry += (A >= K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JEQ_K:
+			fentry += (A == K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JSET_K:
+			fentry += (A & K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGT_X:
+			fentry += (A > X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGE_X:
+			fentry += (A >= X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JEQ_X:
+			fentry += (A == X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JSET_X:
+			fentry += (A & X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_LD_W_ABS:
+			k = K;
+load_w:
+			ptr = load_pointer(buf, buflen, k, 4, &tmp);
+			if (ptr != NULL) {
+				/* Note, unlike on network data, values are not
+				 * byte swapped.
+				 */
+				A = *(const u32 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_H_ABS:
+			k = K;
+load_h:
+			ptr = load_pointer(buf, buflen, k, 2, &tmp);
+			if (ptr != NULL) {
+				A = *(const u16 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_B_ABS:
+			k = K;
+load_b:
+			ptr = load_pointer(buf, buflen, k, 1, &tmp);
+			if (ptr != NULL) {
+				A = *(const u8 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_W_LEN:
+			A = buflen;
+			continue;
+		case BPF_S_LDX_W_LEN:
+			X = buflen;
+			continue;
+		case BPF_S_LD_W_IND:
+			k = X + K;
+			goto load_w;
+		case BPF_S_LD_H_IND:
+			k = X + K;
+			goto load_h;
+		case BPF_S_LD_B_IND:
+			k = X + K;
+			goto load_b;
+		case BPF_S_LDX_B_MSH:
+			ptr = load_pointer(buf, buflen, K, 1, &tmp);
+			if (ptr != NULL) {
+				X = (*(u8 *)ptr & 0xf) << 2;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_IMM:
+			A = K;
+			continue;
+		case BPF_S_LDX_IMM:
+			X = K;
+			continue;
+		case BPF_S_LD_MEM:
+			A = mem[K];
+			continue;
+		case BPF_S_LDX_MEM:
+			X = mem[K];
+			continue;
+		case BPF_S_MISC_TAX:
+			X = A;
+			continue;
+		case BPF_S_MISC_TXA:
+			A = X;
+			continue;
+		case BPF_S_RET_K:
+			return K;
+		case BPF_S_RET_A:
+			return A;
+		case BPF_S_ST:
+			mem[K] = A;
+			continue;
+		case BPF_S_STX:
+			mem[K] = X;
+			continue;
+		case BPF_S_ANC_PROTOCOL:
+		case BPF_S_ANC_PKTTYPE:
+		case BPF_S_ANC_IFINDEX:
+		case BPF_S_ANC_MARK:
+		case BPF_S_ANC_QUEUE:
+		case BPF_S_ANC_HATYPE:
+		case BPF_S_ANC_RXHASH:
+		case BPF_S_ANC_CPU:
+		case BPF_S_ANC_NLATTR:
+		case BPF_S_ANC_NLATTR_NEST:
+			/* ignored */
+			continue;
+		default:
+			WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+				       fentry->code, fentry->jt,
+				       fentry->jf, fentry->k);
+			return 0;
+		}
+	}
+
+	return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..77f2eda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		case PR_SET_SECCOMP:
 			error = prctl_set_seccomp(arg2);
 			break;
+		case PR_ATTACH_SECCOMP_FILTER:
+			error = prctl_attach_seccomp_filter((char __user *)
+								arg2);
+			break;
 		case PR_GET_TSC:
 			error = GET_TSC_CTL(arg2);
 			break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..77b1106 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
 
 	  If you are unsure how to answer this question, answer N.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	select SECCOMP
+	depends on EXPERIMENTAL
+	help
+	  This kernel feature expands CONFIG_SECCOMP to allow computing
+	  in environments with reduced kernel access dictated by a system
+	  call filter, expressed in BPF, installed by the application itself
+	  through prctl(2).
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config SECURITY
 	bool "Enable different security models"
 	depends on SYSFS
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 235+ messages in thread

* [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 17:25 [RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters) Will Drewry
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-11 17:25 ` Will Drewry
  2012-01-11 20:03   ` Jonathan Corbet
  2012-01-12 13:13   ` [RFC,PATCH " Łukasz Sowa
  1 sibling, 2 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-11 17:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

Document how system call filtering with BPF works
and can be used.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |  159 ++++++++++++++++++++++++++++++++
 1 files changed, 159 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/prctl/seccomp_filter.txt

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..5fb3f44
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,159 @@
+		Seccomp filtering
+		=================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated.  A
+certain subset of userland applications benefit by having a reduced set
+of available system calls.  The resulting set reduces the total kernel
+surface exposed to the application.  System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls.  The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct.  This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox.  It provides a clearly defined
+mechanism for minimizing the exposed kernel surface.  Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing.  Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process.  The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+	Allows the specification of a new filter using a BPF program.
+	The BPF program will be executed over a user_regs_struct data
+	reflecting system call time except with the system call number
+	resident in orig_[register].  To allow a system call, the size
+	of the data must be returned.  At present, all other return values
+	result in the system call being blocked, but it is recommended to
+	return 0 in those cases.  This will allow for future custom return
+	values to be introduced, if ever desired.
+
+	Usage:
+		prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+	The 'prog' argument is a pointer to a struct sock_fprog which will
+	contain the filter program.  If the program is invalid, the call
+	will return -1 and set errno to -EINVAL.
+
+	The struct user_regs_struct the @prog will see is based on the
+	personality of the task at the time of this prctl call.  Additionally,
+	is_compat_task is also tracked for the @prog.  This means that once set
+	the calling task will have all of its system calls blocked if it
+	switches its system call ABI (via personality or other means).
+
+	If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+	namespace, the @prog will be marked as inheritable across execve.  Any
+	inherited filters are still subject to the system call ABI constraints
+	above and any ABI mismatched system calls will result in process death.
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err and exit
+cleanly.  Without using a BPF compiler, it may be done as follows on x86 32-bit:
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+int install_filter(void)
+{
+	struct sock_filter filter[] = {
+		/* Grab the system call number */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+		/* Jump table for the allowed syscalls */
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+		/* Check that read is only using stdin. */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+		/* Check that write is only using stdout/stderr */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+		/* Put the "accept" value in A */
+		BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+		BPF_STMT(BPF_RET+BPF_A,0),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+		.filter = filter,
+	};
+	if (prctl(36, &prog)) {
+		perror("prctl");
+		return 1;
+	}
+	return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+	char buf[4096];
+	ssize_t bytes = 0;
+	if (install_filter())
+		return 1;
+	syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+	syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+	syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+	return 0;
+}
+
+Additionally, if prctl(2) is allowed by the installed filter, additional
+filters may be layered on which will increase evaluation time, but allow for
+further decreasing the attack surface during execution of a process.
+
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+  a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
@ 2012-01-11 20:03   ` Jonathan Corbet
  2012-01-11 20:10     ` Will Drewry
  2012-01-12 13:13   ` [RFC,PATCH " Łukasz Sowa
  1 sibling, 1 reply; 235+ messages in thread
From: Jonathan Corbet @ 2012-01-11 20:03 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Interesting approach to the problem, I think I like it.  Watch for news at
11...:)

One nit:

> +Example
> +-------
> +
> +Assume a process would like to cleanly read and write to stdin/out/err and exit
> +cleanly.  Without using a BPF compiler, it may be done as follows on x86 32-bit:
> +

It seems like this little program belongs in the samples/ directory.

Thanks,

jon

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 20:03   ` Jonathan Corbet
@ 2012-01-11 20:10     ` Will Drewry
  2012-01-11 23:19       ` [PATCH v2 " Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-11 20:10 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Wed, Jan 11, 2012 at 2:03 PM, Jonathan Corbet <corbet@lwn.net> wrote:
> Interesting approach to the problem, I think I like it.  Watch for news at
> 11...:)

Thanks - I'm glad to hear it!

> One nit:
>
>> +Example
>> +-------
>> +
>> +Assume a process would like to cleanly read and write to stdin/out/err and exit
>> +cleanly.  Without using a BPF compiler, it may be done as follows on x86 32-bit:
>> +
>
> It seems like this little program belongs in the samples/ directory.

Cool - I'll do that and rev this patch.

cheers!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 20:10     ` Will Drewry
@ 2012-01-11 23:19       ` Will Drewry
  2012-01-12  0:29         ` Will Drewry
  2012-01-12 18:16         ` Randy Dunlap
  0 siblings, 2 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-11 23:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet

Document how system call filtering with BPF works and
may be used.  Includes an example for x86 (32-bit).

Signed-off-by: Will Drewry <wad@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |   99 ++++++++++++++++++++++++++++++++
 samples/Makefile                       |    2 +-
 samples/seccomp/Makefile               |   12 ++++
 samples/seccomp/bpf-example.c          |   74 ++++++++++++++++++++++++
 4 files changed, 186 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/prctl/seccomp_filter.txt
 create mode 100644 samples/seccomp/Makefile
 create mode 100644 samples/seccomp/bpf-example.c

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..15d4645
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,99 @@
+		Seccomp filtering
+		=================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated.  A
+certain subset of userland applications benefit by having a reduced set
+of available system calls.  The resulting set reduces the total kernel
+surface exposed to the application.  System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls.  The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct.  This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox.  It provides a clearly defined
+mechanism for minimizing the exposed kernel surface.  Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing.  Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process.  The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+	Allows the specification of a new filter using a BPF program.
+	The BPF program will be executed over a user_regs_struct data
+	reflecting system call time except with the system call number
+	resident in orig_[register].  To allow a system call, the size
+	of the data must be returned.  At present, all other return values
+	result in the system call being blocked, but it is recommended to
+	return 0 in those cases.  This will allow for future custom return
+	values to be introduced, if ever desired.
+
+	Usage:
+		prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+	The 'prog' argument is a pointer to a struct sock_fprog which will
+	contain the filter program.  If the program is invalid, the call
+	will return -1 and set errno to -EINVAL.
+
+	The struct user_regs_struct the @prog will see is based on the
+	personality of the task at the time of this prctl call.  Additionally,
+	is_compat_task is also tracked for the @prog.  This means that once set
+	the calling task will have all of its system calls blocked if it
+	switches its system call ABI (via personality or other means).
+
+	If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+	namespace, the @prog will be marked as inheritable across execve.  Any
+	inherited filters are still subject to the system call ABI constraints
+	above and any ABI mismatched system calls will result in process death.
+
+	Additionally, if prctl(2) is allowed by the attached filter,
+	additional filters may be layered on which will increase evaluation
+	time, but allow for further decreasing the attack surface during
+	execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+samples/seccomp-bpf-example.c shows an example process that allows read from stdin,
+write to stdout/err, exit and signal returns for 32-bit x86.
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+  a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
 # Makefile for Linux samples code
 
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ tracepoints/ trace_events/ \
-			   hw_breakpoint/ kfifo/ kdb/ hidraw/
+			   hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..80dc8e4
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,12 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-$(CONFIG_X86_32) := bpf-example
+bpf-example-objs := bpf-example.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-example.o += -I$(objtree)/usr/include -m32
+HOSTLOADLIBES_bpf-example += -m32
diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-example.c
new file mode 100644
index 0000000..f98b70a
--- /dev/null
+++ b/samples/seccomp/bpf-example.c
@@ -0,0 +1,74 @@
+/*
+ * Seccomp BPF example
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+#	define PR_ATTACH_SECCOMP_FILTER 36
+#endif
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+static int install_filter(void)
+{
+	struct sock_filter filter[] = {
+		/* Grab the system call number */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+		/* Jump table for the allowed syscalls */
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+		/* Check that read is only using stdin. */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+		/* Check that write is only using stdout/stderr */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+		/* Put the "accept" value in A */
+		BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+		BPF_STMT(BPF_RET+BPF_A,0),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+		.filter = filter,
+	};
+	if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+		perror("prctl");
+		return 1;
+	}
+	return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+	char buf[4096];
+	ssize_t bytes = 0;
+	if (install_filter())
+		return 1;
+	syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+	syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+	syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+	return 0;
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 235+ messages in thread

* Re: [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 23:19       ` [PATCH v2 " Will Drewry
@ 2012-01-12  0:29         ` Will Drewry
  2012-01-12 18:16         ` Randy Dunlap
  1 sibling, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-12  0:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet

Hrm, I may need to guard sample compilation based on host arch and not
just target arch. Documentation v3 will be on the way once I have that
behaving properly. :/

Sorry!
will

On Wed, Jan 11, 2012 at 5:19 PM, Will Drewry <wad@chromium.org> wrote:
> Document how system call filtering with BPF works and
> may be used.  Includes an example for x86 (32-bit).
>
> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  Documentation/prctl/seccomp_filter.txt |   99 ++++++++++++++++++++++++++++++++
>  samples/Makefile                       |    2 +-
>  samples/seccomp/Makefile               |   12 ++++
>  samples/seccomp/bpf-example.c          |   74 ++++++++++++++++++++++++
>  4 files changed, 186 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/prctl/seccomp_filter.txt
>  create mode 100644 samples/seccomp/Makefile
>  create mode 100644 samples/seccomp/bpf-example.c
>
> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
> new file mode 100644
> index 0000000..15d4645
> --- /dev/null
> +++ b/Documentation/prctl/seccomp_filter.txt
> @@ -0,0 +1,99 @@
> +               Seccomp filtering
> +               =================
> +
> +Introduction
> +------------
> +
> +A large number of system calls are exposed to every userland process
> +with many of them going unused for the entire lifetime of the process.
> +As system calls change and mature, bugs are found and eradicated.  A
> +certain subset of userland applications benefit by having a reduced set
> +of available system calls.  The resulting set reduces the total kernel
> +surface exposed to the application.  System call filtering is meant for
> +use with those applications.
> +
> +Seccomp filtering provides a means for a process to specify a filter
> +for incoming system calls.  The filter is expressed as a Berkeley Packet
> +Filter program, as with socket filters, except that the data operated on
> +is the current user_regs_struct.  This allows for expressive filtering
> +of system calls using the pre-existing system call ABI and using a filter
> +program language with a long history of being exposed to userland.
> +Additionally, BPF makes it impossible for users of seccomp to fall prey to
> +time-of-check-time-of-use (TOCTOU) attacks that are common in system call
> +interposition frameworks because the evaluated data is solely register state
> +just after system call entry.
> +
> +What it isn't
> +-------------
> +
> +System call filtering isn't a sandbox.  It provides a clearly defined
> +mechanism for minimizing the exposed kernel surface.  Beyond that,
> +policy for logical behavior and information flow should be managed with
> +a combinations of other system hardening techniques and, potentially, a
> +LSM of your choosing.  Expressive, dynamic filters provide further options down
> +this path (avoiding pathological sizes or selecting which of the multiplexed
> +system calls in socketcall() is allowed, for instance) which could be
> +construed, incorrectly, as a more complete sandboxing solution.
> +
> +Usage
> +-----
> +
> +An additional seccomp mode is added, but they are not directly set by the
> +consuming process.  The new mode, '2', is only available if
> +CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
> +PR_ATTACH_SECCOMP_FILTER argument.
> +
> +Interacting with seccomp filters is done using one prctl(2) call.
> +
> +PR_ATTACH_SECCOMP_FILTER:
> +       Allows the specification of a new filter using a BPF program.
> +       The BPF program will be executed over a user_regs_struct data
> +       reflecting system call time except with the system call number
> +       resident in orig_[register].  To allow a system call, the size
> +       of the data must be returned.  At present, all other return values
> +       result in the system call being blocked, but it is recommended to
> +       return 0 in those cases.  This will allow for future custom return
> +       values to be introduced, if ever desired.
> +
> +       Usage:
> +               prctl(PR_ATTACH_SECCOMP_FILTER, prog);
> +
> +       The 'prog' argument is a pointer to a struct sock_fprog which will
> +       contain the filter program.  If the program is invalid, the call
> +       will return -1 and set errno to -EINVAL.
> +
> +       The struct user_regs_struct the @prog will see is based on the
> +       personality of the task at the time of this prctl call.  Additionally,
> +       is_compat_task is also tracked for the @prog.  This means that once set
> +       the calling task will have all of its system calls blocked if it
> +       switches its system call ABI (via personality or other means).
> +
> +       If the @prog is installed while the task has CAP_SYS_ADMIN in its user
> +       namespace, the @prog will be marked as inheritable across execve.  Any
> +       inherited filters are still subject to the system call ABI constraints
> +       above and any ABI mismatched system calls will result in process death.
> +
> +       Additionally, if prctl(2) is allowed by the attached filter,
> +       additional filters may be layered on which will increase evaluation
> +       time, but allow for further decreasing the attack surface during
> +       execution of a process.
> +
> +The above call returns 0 on success and non-zero on error.
> +
> +Example
> +-------
> +
> +samples/seccomp-bpf-example.c shows an example process that allows read from stdin,
> +write to stdout/err, exit and signal returns for 32-bit x86.
> +
> +Caveats
> +-------
> +
> +- execve will fail unless the most recently attached filter was installed by
> +  a process with CAP_SYS_ADMIN (in its namespace).
> +
> +Adding architecture support
> +-----------------------
> +
> +Any platform with seccomp support will support seccomp filters
> +as long as CONFIG_SECCOMP_FILTER is enabled.
> diff --git a/samples/Makefile b/samples/Makefile
> index 6280817..f29b19c 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -1,4 +1,4 @@
>  # Makefile for Linux samples code
>
>  obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ tracepoints/ trace_events/ \
> -                          hw_breakpoint/ kfifo/ kdb/ hidraw/
> +                          hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
> diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
> new file mode 100644
> index 0000000..80dc8e4
> --- /dev/null
> +++ b/samples/seccomp/Makefile
> @@ -0,0 +1,12 @@
> +# kbuild trick to avoid linker error. Can be omitted if a module is built.
> +obj- := dummy.o
> +
> +# List of programs to build
> +hostprogs-$(CONFIG_X86_32) := bpf-example
> +bpf-example-objs := bpf-example.o
> +
> +# Tell kbuild to always build the programs
> +always := $(hostprogs-y)
> +
> +HOSTCFLAGS_bpf-example.o += -I$(objtree)/usr/include -m32
> +HOSTLOADLIBES_bpf-example += -m32
> diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-example.c
> new file mode 100644
> index 0000000..f98b70a
> --- /dev/null
> +++ b/samples/seccomp/bpf-example.c
> @@ -0,0 +1,74 @@
> +/*
> + * Seccomp BPF example
> + *
> + * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
> + * Author: Will Drewry <wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + */
> +
> +#include <asm/unistd.h>
> +#include <linux/filter.h>
> +#include <stdio.h>
> +#include <stddef.h>
> +#include <sys/prctl.h>
> +#include <sys/user.h>
> +#include <unistd.h>
> +
> +#ifndef PR_ATTACH_SECCOMP_FILTER
> +#      define PR_ATTACH_SECCOMP_FILTER 36
> +#endif
> +
> +#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
> +static int install_filter(void)
> +{
> +       struct sock_filter filter[] = {
> +               /* Grab the system call number */
> +               BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
> +               /* Jump table for the allowed syscalls */
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
> +
> +               /* Check that read is only using stdin. */
> +               BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
> +
> +               /* Check that write is only using stdout/stderr */
> +               BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
> +
> +               /* Put the "accept" value in A */
> +               BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
> +
> +               BPF_STMT(BPF_RET+BPF_A,0),
> +       };
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
> +               .filter = filter,
> +       };
> +       if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
> +               perror("prctl");
> +               return 1;
> +       }
> +       return 0;
> +}
> +
> +#define payload(_c) _c, sizeof(_c)
> +int main(int argc, char **argv) {
> +       char buf[4096];
> +       ssize_t bytes = 0;
> +       if (install_filter())
> +               return 1;
> +       syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
> +       bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
> +       syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
> +       syscall(__NR_write, STDOUT_FILENO, buf, bytes);
> +       return 0;
> +}
> --
> 1.7.5.4
>

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-12  8:53   ` Serge Hallyn
  2012-01-12 16:54     ` Will Drewry
  2012-01-12 14:50   ` Oleg Nesterov
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 235+ messages in thread
From: Serge Hallyn @ 2012-01-12  8:53 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

Quoting Will Drewry (wad@chromium.org):
> This patch adds support for seccomp mode 2.  This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task.  The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering.  Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).
> 
> A filter program may be installed by a userland task by calling
>   prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
> where fprog is of type struct sock_fprog.
> 
> If the first filter program allows subsequent prctl(2) calls, then
> additional filter programs may be attached.  All attached programs
> must be evaluated before a system call will be allowed to proceed.
> 
> To avoid CONFIG_COMPAT related landmines, once a filter program is
> installed using specific is_compat_task() and current->personality, it
> is not allowed to make system calls or attach additional filters which
> use a different combination of is_compat_task() and
> current->personality.
> 
> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace.  Once a task-local filter program is attached from a
> process without privileges, execve will fail.  This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).
> 
> There are a number of benefits to this approach. A few of which are
> as follows:
> - BPF has been exposed to userland for a long time.
> - Userland already knows its ABI: expected register layout and system
>   call numbers.
> - Full register information is provided which may be relevant for
>   certain syscalls (fork, rt_sigreturn) or for other userland
>   filtering tactics (checking the PC).
> - No time-of-check-time-of-use vulnerable data accesses are possible.
> 
> This patch includes its own BPF evaluator, but relies on the
> net/core/filter.c BPF checking code.  It is possible to share
> evaluators, but the performance sensitive nature of the network
> filtering path makes it an iterative optimization which (I think :) can
> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)
> 
> Signed-off-by: Will Drewry <wad@chromium.org>

Hey Will,

A few comments below, but otherwise

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

thanks,
-serge

> ---
>  fs/exec.c               |    5 +
>  include/linux/prctl.h   |    3 +
>  include/linux/seccomp.h |   70 +++++-
>  kernel/Makefile         |    1 +
>  kernel/fork.c           |    4 +
>  kernel/seccomp.c        |    8 +
>  kernel/seccomp_filter.c |  639 +++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sys.c            |    4 +
>  security/Kconfig        |   12 +
>  9 files changed, 743 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/seccomp_filter.c
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 3625464..e9cc89c 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -44,6 +44,7 @@
>  #include <linux/namei.h>
>  #include <linux/mount.h>
>  #include <linux/security.h>
> +#include <linux/seccomp.h>
>  #include <linux/syscalls.h>
>  #include <linux/tsacct_kern.h>
>  #include <linux/cn_proc.h>
> @@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
>  	if (retval)
>  		goto out_ret;
>  
> +	retval = seccomp_check_exec();
> +	if (retval)
> +		goto out_ret;
> +
>  	retval = -ENOMEM;
>  	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
>  	if (!bprm)
> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
> index a3baeb2..15e2460 100644
> --- a/include/linux/prctl.h
> +++ b/include/linux/prctl.h
> @@ -64,6 +64,9 @@
>  #define PR_GET_SECCOMP	21
>  #define PR_SET_SECCOMP	22
>  
> +/* Set process seccomp filters */
> +#define PR_ATTACH_SECCOMP_FILTER	36
> +
>  /* Get/set the capability bounding set (as per security/commoncap.c) */
>  #define PR_CAPBSET_READ 23
>  #define PR_CAPBSET_DROP 24
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index cc7a4e9..99d163e 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,9 +5,28 @@
>  #ifdef CONFIG_SECCOMP
>  
>  #include <linux/thread_info.h>
> +#include <linux/types.h>
>  #include <asm/seccomp.h>
>  
> -typedef struct { int mode; } seccomp_t;
> +struct seccomp_filter;
> +/**
> + * struct seccomp_struct - the state of a seccomp'ed process
> + *
> + * @mode:
> + *     if this is 0, seccomp is not in use.
> + *             is 1, the process is under standard seccomp rules.
> + *             is 2, the process is only allowed to make system calls where
> + *                   associated filters evaluate successfully.
> + * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
> + *          @filter must only be accessed from the context of current as there
> + *          is no guard.
> + */
> +typedef struct seccomp_struct {
> +	int mode;
> +#ifdef CONFIG_SECCOMP_FILTER
> +	struct seccomp_filter *filter;
> +#endif
> +} seccomp_t;
>  
>  extern void __secure_computing(int);
>  static inline void secure_computing(int this_syscall)
> @@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)
>  
>  #include <linux/errno.h>
>  
> -typedef struct { } seccomp_t;
> -
> +typedef struct seccomp_struct { } seccomp_t;
>  #define secure_computing(x) do { } while (0)
>  
>  static inline long prctl_get_seccomp(void)
> @@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)
>  
>  #endif /* CONFIG_SECCOMP */
>  
> +#ifdef CONFIG_SECCOMP_FILTER
> +
> +#define seccomp_filter_init_task(_tsk) do { \
> +	(_tsk)->seccomp.filter = NULL; \
> +} while (0);
> +
> +/* No locking is needed here because the task_struct will
> + * have no parallel consumers.
> + */
> +#define seccomp_filter_free_task(_tsk) do { \
> +	put_seccomp_filter((_tsk)->seccomp.filter); \
> +} while (0);
> +
> +extern int seccomp_check_exec(void);
> +
> +extern long prctl_attach_seccomp_filter(char __user *);
> +
> +extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
> +extern void put_seccomp_filter(struct seccomp_filter *);
> +
> +extern int seccomp_test_filters(int);
> +extern void seccomp_filter_log_failure(int);
> +extern void seccomp_filter_fork(struct task_struct *child,
> +				struct task_struct *parent);
> +
> +#else  /* CONFIG_SECCOMP_FILTER */
> +
> +#include <linux/errno.h>
> +
> +struct seccomp_filter { };
> +#define seccomp_filter_init_task(_tsk) do { } while (0);
> +#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
> +#define seccomp_filter_free_task(_tsk) do { } while (0);
> +
> +static inline int seccomp_check_exec(void)
> +{
> +	return 0;
> +}
> +
> +
> +static inline long prctl_attach_seccomp_filter(char __user *a2)
> +{
> +	return -ENOSYS;
> +}
> +
> +#endif  /* CONFIG_SECCOMP_FILTER */
>  #endif /* _LINUX_SECCOMP_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index e898c5b..0584090 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>  obj-$(CONFIG_SECCOMP) += seccomp.o
> +obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
> diff --git a/kernel/fork.c b/kernel/fork.c
> index da4a6a1..cc1d628 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -34,6 +34,7 @@
>  #include <linux/cgroup.h>
>  #include <linux/security.h>
>  #include <linux/hugetlb.h>
> +#include <linux/seccomp.h>
>  #include <linux/swap.h>
>  #include <linux/syscalls.h>
>  #include <linux/jiffies.h>
> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>  	free_thread_info(tsk->stack);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
> +	seccomp_filter_free_task(tsk);
>  	free_task_struct(tsk);
>  }
>  EXPORT_SYMBOL(free_task);
> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	/* Perform scheduler related setup. Assign this task to a CPU. */
>  	sched_fork(p);
>  
> +	seccomp_filter_init_task(p);
>  	retval = perf_event_init_task(p);
>  	if (retval)
>  		goto bad_fork_cleanup_policy;
> @@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	if (clone_flags & CLONE_THREAD)
>  		threadgroup_fork_read_unlock(current);
>  	perf_event_fork(p);
> +	seccomp_filter_fork(p, current);
>  	return p;
>  
>  bad_fork_free_pid:
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 57d4b13..78719be 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
>  				return;
>  		} while (*++syscall);
>  		break;
> +#ifdef CONFIG_SECCOMP_FILTER
> +	case 2:
> +		if (seccomp_test_filters(this_syscall) == 0)
> +			return;
> +
> +		seccomp_filter_log_failure(this_syscall);
> +		break;
> +#endif
>  	default:
>  		BUG();
>  	}
> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
> new file mode 100644
> index 0000000..4770847
> --- /dev/null
> +++ b/kernel/seccomp_filter.c
> @@ -0,0 +1,639 @@
> +/* bpf program-based system call filtering
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
> + */
> +
> +#include <linux/capability.h>
> +#include <linux/compat.h>
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +#include <linux/rculist.h>
> +#include <linux/filter.h>
> +#include <linux/kallsyms.h>
> +#include <linux/kref.h>
> +#include <linux/module.h>
> +#include <linux/pid.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/ratelimit.h>
> +#include <linux/reciprocal_div.h>
> +#include <linux/regset.h>
> +#include <linux/seccomp.h>
> +#include <linux/security.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/user.h>
> +
> +
> +/**
> + * struct seccomp_filter - container for seccomp BPF programs
> + *
> + * @usage: reference count to manage the object lifetime.
> + *         get/put helpers should be used when accessing an instance
> + *         outside of a lifetime-guarded section.  In general, this
> + *         is only needed for handling filters shared across tasks.
> + * @creator: pointer to the pid that created this filter
> + * @parent: pointer to the ancestor which this filter will be composed with.
> + * @flags: provide information about filter from creation time.
> + * @personality: personality of the process at filter creation time.
> + * @insns: the BPF program instructions to evaluate
> + * @count: the number of instructions in the program.
> + *
> + * seccomp_filter objects should never be modified after being attached
> + * to a task_struct (other than @usage).
> + */
> +struct seccomp_filter {
> +	struct kref usage;
> +	struct pid *creator;
> +	struct seccomp_filter *parent;
> +	struct {
> +		uint32_t admin:1,  /* can allow execve */
> +			 compat:1,  /* CONFIG_COMPAT */
> +			 __reserved:30;
> +	} flags;
> +	int personality;
> +	unsigned short count;  /* Instruction count */
> +	struct sock_filter insns[0];
> +};
> +
> +static unsigned int seccomp_run_filter(const u8 *buf,
> +				       const size_t buflen,
> +				       const struct sock_filter *);
> +
> +/**
> + * seccomp_filter_alloc - allocates a new filter object
> + * @padding: size of the insns[0] array in bytes
> + *
> + * The @padding should be a multiple of
> + * sizeof(struct sock_filter).
> + *
> + * Returns ERR_PTR on error or an allocated object.
> + */
> +static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
> +{
> +	struct seccomp_filter *f;
> +	unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
> +
> +	/* Drop oversized requests. */
> +	if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
> +		return ERR_PTR(-EINVAL);
> +
> +	/* Padding should always be in sock_filter increments. */
> +	BUG_ON(padding % sizeof(struct sock_filter));

I still think the BUG_ON here is harsh given that the progsize is passed
in by userspace.  Was there a reason not to return -EINVAL here?

> +
> +	f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
> +	if (!f)
> +		return ERR_PTR(-ENOMEM);
> +	kref_init(&f->usage);
> +	f->creator = get_task_pid(current, PIDTYPE_PID);
> +	f->count = bpf_blocks;
> +	return f;
> +}
> +
> +/**
> + * seccomp_filter_free - frees the allocated filter.
> + * @filter: NULL or live object to be completely destructed.
> + */
> +static void seccomp_filter_free(struct seccomp_filter *filter)
> +{
> +	if (!filter)
> +		return;
> +	put_seccomp_filter(filter->parent);
> +	put_pid(filter->creator);
> +	kfree(filter);
> +}
> +
> +static void __put_seccomp_filter(struct kref *kref)
> +{
> +	struct seccomp_filter *orig =
> +		container_of(kref, struct seccomp_filter, usage);
> +	seccomp_filter_free(orig);
> +}
> +
> +void seccomp_filter_log_failure(int syscall)
> +{
> +	pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
> +		current->comm, task_pid_nr(current), syscall,
> +		KSTK_EIP(current));
> +}
> +
> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
> +void put_seccomp_filter(struct seccomp_filter *orig)
> +{
> +	if (!orig)
> +		return;
> +	kref_put(&orig->usage, __put_seccomp_filter);
> +}
> +
> +/* get_seccomp_filter - increments the reference count of @orig. */
> +struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
> +{
> +	if (!orig)
> +		return NULL;
> +	kref_get(&orig->usage);
> +	return orig;
> +}
> +
> +static int seccomp_check_personality(struct seccomp_filter *filter)
> +{
> +	if (filter->personality != current->personality)
> +		return -EACCES;
> +#ifdef CONFIG_COMPAT
> +	if (filter->flags.compat != (!!(is_compat_task())))
> +		return -EACCES;
> +#endif
> +	return 0;
> +}
> +
> +static const struct user_regset *
> +find_prstatus(const struct user_regset_view *view)
> +{
> +	const struct user_regset *regset;
> +	int n;
> +
> +	/* Skip 0. */
> +	for (n = 1; n < view->n; ++n) {
> +		regset = view->regsets + n;
> +		if (regset->core_note_type == NT_PRSTATUS)
> +			return regset;
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * seccomp_get_regs - returns a pointer to struct user_regs_struct
> + * @scratch: preallocated storage of size @available
> + * @available: pointer to the size of scratch.
> + *
> + * Returns NULL if the registers cannot be acquired or copied.
> + * Returns a populated pointer to @scratch by default.
> + * Otherwise, returns a pointer to a a u8 array containing the struct
> + * user_regs_struct appropriate for the task personality.  The pointer
> + * may be to the beginning of @scratch or to an externally managed data
> + * structure.  On success, @available should be updated with the
> + * valid region size of the returned pointer.
> + *
> + * If the architecture overrides the linkage, then the pointer may pointer to
> + * another location.
> + */
> +__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
> +{
> +	/* regset is usually returned based on task personality, not current
> +	 * system call convention.  This behavior makes it unsafe to execute
> +	 * BPF programs over regviews if is_compat_task or the personality
> +	 * have changed since the program was installed.
> +	 */
> +	const struct user_regset_view *view = task_user_regset_view(current);
> +	const struct user_regset *regset = &view->regsets[0];
> +	size_t scratch_size = *available;
> +	if (regset->core_note_type != NT_PRSTATUS) {
> +		/* The architecture should override this method for speed. */
> +		regset = find_prstatus(view);
> +		if (!regset)
> +			return NULL;
> +	}
> +	*available = regset->n * regset->size;
> +	/* Make sure the scratch space isn't exceeded. */
> +	if (*available > scratch_size)
> +		*available = scratch_size;
> +	if (regset->get(current, regset, 0, *available, scratch, NULL))
> +		return NULL;
> +	return scratch;
> +}
> +
> +/**
> + * seccomp_test_filters - tests 'current' against the given syscall
> + * @syscall: number of the system call to test
> + *
> + * Returns 0 on ok and non-zero on error/failure.
> + */
> +int seccomp_test_filters(int syscall)
> +{
> +	struct seccomp_filter *filter;
> +	u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
> +	size_t regs_size = sizeof(struct user_regs_struct);
> +	int ret = -EACCES;
> +
> +	filter = current->seccomp.filter; /* uses task ref */
> +	if (!filter)
> +		goto out;
> +
> +	/* All filters in the list are required to share the same system call
> +	 * convention so only the first filter is ever checked.
> +	 */
> +	if (seccomp_check_personality(filter))
> +		goto out;
> +
> +	/* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
> +	 * that is not mandatory.  E.g., it may return a point to
> +	 * task_pt_regs(current).  NULL checking is mandatory.
> +	 */
> +	regs = seccomp_get_regs(regs_tmp, &regs_size);
> +	if (!regs)
> +		goto out;
> +
> +	/* Only allow a system call if it is allowed in all ancestors. */
> +	ret = 0;
> +	for ( ; filter != NULL; filter = filter->parent) {
> +		/* Allowed if return value is the size of the data supplied. */
> +		if (seccomp_run_filter(regs, regs_size, filter->insns) !=
> +		    regs_size)
> +			ret = -EACCES;
> +	}
> +out:
> +	return ret;
> +}
> +
> +/**
> + * seccomp_attach_filter: Attaches a seccomp filter to current.
> + * @fprog: BPF program to install
> + *
> + * Context: User context only. This function may sleep on allocation and
> + *          operates on current. current must be attempting a system call
> + *          when this is called (usually prctl).
> + *
> + * This function may be called repeatedly to install additional filters.
> + * Every filter successfully installed will be evaluated (in reverse order)
> + * for each system call the thread makes.
> + *
> + * Returns 0 on success or an errno on failure.
> + */
> +long seccomp_attach_filter(struct sock_fprog *fprog)
> +{
> +	struct seccomp_filter *filter = NULL;
> +	/* Note, len is a short so overflow should be impossible. */
> +	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
> +	long ret = -EPERM;
> +
> +	/* Allocate a new seccomp_filter */
> +	filter = seccomp_filter_alloc(fp_size);
> +	if (IS_ERR(filter)) {
> +		ret = PTR_ERR(filter);
> +		goto out;
> +	}
> +
> +	/* Lock the process personality and calling convention. */
> +#ifdef CONFIG_COMPAT
> +	if (is_compat_task())
> +		filter->flags.compat = 1;
> +#endif
> +	filter->personality = current->personality;
> +
> +	/* Auditing is not needed since the capability wasn't requested */
> +	if (security_real_capable_noaudit(current, current_user_ns(),
> +					  CAP_SYS_ADMIN) == 0)
> +		filter->flags.admin = 1;
> +
> +	/* Copy the instructions from fprog. */
> +	ret = -EFAULT;
> +	if (copy_from_user(filter->insns, fprog->filter, fp_size))
> +		goto out;
> +
> +	/* Check the fprog */
> +	ret = sk_chk_filter(filter->insns, filter->count);
> +	if (ret)
> +		goto out;
> +
> +	/* If there is an existing filter, make it the parent
> +	 * and reuse the existing task-based ref.
> +	 */
> +	filter->parent = current->seccomp.filter;
> +
> +	/* Force all filters to use one system call convention. */
> +	ret = -EINVAL;
> +	if (filter->parent) {
> +		if (filter->parent->flags.compat != filter->flags.compat)
> +			goto out;
> +		if (filter->parent->personality != filter->personality)
> +			goto out;
> +	}
> +
> +	/* Double claim the new filter so we can release it below simplifying
> +	 * the error paths earlier.
> +	 */
> +	ret = 0;
> +	get_seccomp_filter(filter);
> +	current->seccomp.filter = filter;
> +	/* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
> +	if (!current->seccomp.mode) {
> +		current->seccomp.mode = 2;
> +		set_thread_flag(TIF_SECCOMP);
> +	}
> +
> +out:
> +	put_seccomp_filter(filter);  /* for get or task, on err */
> +	return ret;
> +}
> +
> +long prctl_attach_seccomp_filter(char __user *user_filter)
> +{
> +	struct sock_fprog fprog;
> +	long ret = -EINVAL;
> +
> +	ret = -EFAULT;
> +	if (!user_filter)
> +		goto out;
> +
> +	if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
> +		goto out;
> +
> +	ret = seccomp_attach_filter(&fprog);
> +out:
> +	return ret;
> +}
> +
> +/**
> + * seccomp_check_exec: determines if exec is allowed for current
> + * Returns 0 if allowed.
> + */
> +int seccomp_check_exec(void)
> +{
> +	if (current->seccomp.mode != 2)
> +		return 0;
> +	/* We can rely on the task refcount for the filter. */
> +	if (!current->seccomp.filter)
> +		return -EPERM;
> +	/* The last attached filter set for the process is checked. It must
> +	 * have been installed with CAP_SYS_ADMIN capabilities.

This comment is confusing.  By 'It must' you mean that if not, it's
denied.  But if I didn't know better I would read that as "we can't
get to this code unless".  Can you change it to something like
"Exec is refused unless the filter was installed with CAP_SYS_ADMIN
privilege"?

> +	 */
> +	if (current->seccomp.filter->flags.admin)
> +		return 0;
> +	return -EPERM;
> +}
> +
> +/* seccomp_filter_fork: manages inheritance on fork
> + * @child: forkee
> + * @parent: forker
> + * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
> + * and the set of filters is marked as 'enabled'.
> + */
> +void seccomp_filter_fork(struct task_struct *child,
> +			 struct task_struct *parent)
> +{
> +	if (!parent->seccomp.mode)
> +		return;
> +	child->seccomp.mode = parent->seccomp.mode;
> +	child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
> +}
> +
> +/* Returns a pointer to the BPF evaluator after checking the offset and size
> + * boundaries.  The signature almost matches the signature from
> + * net/core/filter.c with the hopes of sharing code in the future.
> + */
> +static const void *load_pointer(const u8 *buf, size_t buflen,
> +				int offset, size_t size,
> +				void *unused)
> +{
> +	if (offset >= buflen)
> +		goto fail;
> +	if (offset < 0)
> +		goto fail;
> +	if (size > buflen - offset)
> +		goto fail;
> +	return buf + offset;
> +fail:
> +	return NULL;
> +}
> +
> +/**
> + * seccomp_run_filter - evaluate BPF (over user_regs_struct)
> + *	@buf: buffer to execute the filter over
> + *	@buflen: length of the buffer
> + *	@fentry: filter to apply
> + *
> + * Decode and apply filter instructions to the buffer.
> + * Return length to keep, 0 for none. @buf is a regset we are
> + * filtering, @filter is the array of filter instructions.
> + * Because all jumps are guaranteed to be before last instruction,
> + * and last instruction guaranteed to be a RET, we dont need to check
> + * flen.
> + *
> + * See core/net/filter.c as this is nearly an exact copy.
> + * At some point, it would be nice to merge them to take advantage of
> + * optimizations (like JIT).
> + *
> + * A successful filter must return the full length of the data. Anything less
> + * will currently result in a seccomp failure.  In the future, it may be
> + * possible to use that for hard filtering registers on the fly so it is
> + * ideal for consumers to return 0 on intended failure.
> + */
> +static unsigned int seccomp_run_filter(const u8 *buf,
> +				       const size_t buflen,
> +				       const struct sock_filter *fentry)
> +{
> +	const void *ptr;
> +	u32 A = 0;			/* Accumulator */
> +	u32 X = 0;			/* Index Register */
> +	u32 mem[BPF_MEMWORDS];		/* Scratch Memory Store */
> +	u32 tmp;
> +	int k;
> +
> +	/*
> +	 * Process array of filter instructions.
> +	 */
> +	for (;; fentry++) {
> +#if defined(CONFIG_X86_32)
> +#define	K (fentry->k)
> +#else
> +		const u32 K = fentry->k;
> +#endif
> +
> +		switch (fentry->code) {
> +		case BPF_S_ALU_ADD_X:
> +			A += X;
> +			continue;
> +		case BPF_S_ALU_ADD_K:
> +			A += K;
> +			continue;
> +		case BPF_S_ALU_SUB_X:
> +			A -= X;
> +			continue;
> +		case BPF_S_ALU_SUB_K:
> +			A -= K;
> +			continue;
> +		case BPF_S_ALU_MUL_X:
> +			A *= X;
> +			continue;
> +		case BPF_S_ALU_MUL_K:
> +			A *= K;
> +			continue;
> +		case BPF_S_ALU_DIV_X:
> +			if (X == 0)
> +				return 0;
> +			A /= X;
> +			continue;
> +		case BPF_S_ALU_DIV_K:
> +			A = reciprocal_divide(A, K);
> +			continue;
> +		case BPF_S_ALU_AND_X:
> +			A &= X;
> +			continue;
> +		case BPF_S_ALU_AND_K:
> +			A &= K;
> +			continue;
> +		case BPF_S_ALU_OR_X:
> +			A |= X;
> +			continue;
> +		case BPF_S_ALU_OR_K:
> +			A |= K;
> +			continue;
> +		case BPF_S_ALU_LSH_X:
> +			A <<= X;
> +			continue;
> +		case BPF_S_ALU_LSH_K:
> +			A <<= K;
> +			continue;
> +		case BPF_S_ALU_RSH_X:
> +			A >>= X;
> +			continue;
> +		case BPF_S_ALU_RSH_K:
> +			A >>= K;
> +			continue;
> +		case BPF_S_ALU_NEG:
> +			A = -A;
> +			continue;
> +		case BPF_S_JMP_JA:
> +			fentry += K;
> +			continue;
> +		case BPF_S_JMP_JGT_K:
> +			fentry += (A > K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JGE_K:
> +			fentry += (A >= K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JEQ_K:
> +			fentry += (A == K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JSET_K:
> +			fentry += (A & K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JGT_X:
> +			fentry += (A > X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JGE_X:
> +			fentry += (A >= X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JEQ_X:
> +			fentry += (A == X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JSET_X:
> +			fentry += (A & X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_LD_W_ABS:
> +			k = K;
> +load_w:
> +			ptr = load_pointer(buf, buflen, k, 4, &tmp);
> +			if (ptr != NULL) {
> +				/* Note, unlike on network data, values are not
> +				 * byte swapped.
> +				 */
> +				A = *(const u32 *)ptr;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_H_ABS:
> +			k = K;
> +load_h:
> +			ptr = load_pointer(buf, buflen, k, 2, &tmp);
> +			if (ptr != NULL) {
> +				A = *(const u16 *)ptr;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_B_ABS:
> +			k = K;
> +load_b:
> +			ptr = load_pointer(buf, buflen, k, 1, &tmp);
> +			if (ptr != NULL) {
> +				A = *(const u8 *)ptr;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_W_LEN:
> +			A = buflen;
> +			continue;
> +		case BPF_S_LDX_W_LEN:
> +			X = buflen;
> +			continue;
> +		case BPF_S_LD_W_IND:
> +			k = X + K;
> +			goto load_w;
> +		case BPF_S_LD_H_IND:
> +			k = X + K;
> +			goto load_h;
> +		case BPF_S_LD_B_IND:
> +			k = X + K;
> +			goto load_b;
> +		case BPF_S_LDX_B_MSH:
> +			ptr = load_pointer(buf, buflen, K, 1, &tmp);
> +			if (ptr != NULL) {
> +				X = (*(u8 *)ptr & 0xf) << 2;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_IMM:
> +			A = K;
> +			continue;
> +		case BPF_S_LDX_IMM:
> +			X = K;
> +			continue;
> +		case BPF_S_LD_MEM:
> +			A = mem[K];
> +			continue;
> +		case BPF_S_LDX_MEM:
> +			X = mem[K];
> +			continue;
> +		case BPF_S_MISC_TAX:
> +			X = A;
> +			continue;
> +		case BPF_S_MISC_TXA:
> +			A = X;
> +			continue;
> +		case BPF_S_RET_K:
> +			return K;
> +		case BPF_S_RET_A:
> +			return A;
> +		case BPF_S_ST:
> +			mem[K] = A;
> +			continue;
> +		case BPF_S_STX:
> +			mem[K] = X;
> +			continue;
> +		case BPF_S_ANC_PROTOCOL:
> +		case BPF_S_ANC_PKTTYPE:
> +		case BPF_S_ANC_IFINDEX:
> +		case BPF_S_ANC_MARK:
> +		case BPF_S_ANC_QUEUE:
> +		case BPF_S_ANC_HATYPE:
> +		case BPF_S_ANC_RXHASH:
> +		case BPF_S_ANC_CPU:
> +		case BPF_S_ANC_NLATTR:
> +		case BPF_S_ANC_NLATTR_NEST:
> +			/* ignored */
> +			continue;
> +		default:
> +			WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
> +				       fentry->code, fentry->jt,
> +				       fentry->jf, fentry->k);
> +			return 0;
> +		}
> +	}
> +
> +	return 0;
> +}
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 481611f..77f2eda 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		case PR_SET_SECCOMP:
>  			error = prctl_set_seccomp(arg2);
>  			break;
> +		case PR_ATTACH_SECCOMP_FILTER:
> +			error = prctl_attach_seccomp_filter((char __user *)
> +								arg2);
> +			break;
>  		case PR_GET_TSC:
>  			error = GET_TSC_CTL(arg2);
>  			break;
> diff --git a/security/Kconfig b/security/Kconfig
> index 51bd5a0..77b1106 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
>  
>  	  If you are unsure how to answer this question, answer N.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	select SECCOMP
> +	depends on EXPERIMENTAL
> +	help
> +	  This kernel feature expands CONFIG_SECCOMP to allow computing
> +	  in environments with reduced kernel access dictated by a system
> +	  call filter, expressed in BPF, installed by the application itself
> +	  through prctl(2).
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config SECURITY
>  	bool "Enable different security models"
>  	depends on SYSFS
> -- 
> 1.7.5.4
> 

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
  2012-01-11 20:03   ` Jonathan Corbet
@ 2012-01-12 13:13   ` Łukasz Sowa
  2012-01-12 17:25     ` Will Drewry
  1 sibling, 1 reply; 235+ messages in thread
From: Łukasz Sowa @ 2012-01-12 13:13 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Hi Will,

That's very different approach to the system call interposition problem.
I find you solution very interesting. It gives far more capabilities
than my syscalls cgroup that you commented on some time ago. It's ready
now but I haven't tried filtering yet. I think that if your solution
make it to the mainline (and I guess that's really possible at current
stage :)), there will be no place for mine solution but that's ok.

There's one thing that I'm curious about - have you measured overhead in
any way? That was one of the biggest issues in all previous attempts to
limit syscalls. I'd love to compare the numbers with mine solution.

I'll examine your patch later on and put some comments if I bump into
something.

Best Regards,
Lukasz Sowa


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-12  8:53   ` Serge Hallyn
@ 2012-01-12 14:50   ` Oleg Nesterov
  2012-01-12 16:55     ` Will Drewry
  2012-01-12 15:43   ` Steven Rostedt
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-12 14:50 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/11, Will Drewry wrote:
>
> This patch adds support for seccomp mode 2.  This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task.  The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering.  Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).

Cool ;)

I didn't really read this patch yet, just one nit.

> +#define seccomp_filter_init_task(_tsk) do { \
> +	(_tsk)->seccomp.filter = NULL; \
> +} while (0);

Cosmetic and subjective, but imho it would be better to add inline
functions instead of define's.

> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>  	free_thread_info(tsk->stack);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
> +	seccomp_filter_free_task(tsk);
>  	free_task_struct(tsk);
>  }
>  EXPORT_SYMBOL(free_task);
> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	/* Perform scheduler related setup. Assign this task to a CPU. */
>  	sched_fork(p);
>  
> +	seccomp_filter_init_task(p);

This doesn't look right or I missed something. something seccomp_filter_init_task()
should be called right after dup_task_struct(), at least before copy process can
fail.

Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
current->seccomp.filter copied by arch_dup_task_struct().

> +struct seccomp_filter {
> +	struct kref usage;
> +	struct pid *creator;

Why? seccomp_filter->creator is never used, no?

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-12  8:53   ` Serge Hallyn
  2012-01-12 14:50   ` Oleg Nesterov
@ 2012-01-12 15:43   ` Steven Rostedt
  2012-01-12 16:14     ` Oleg Nesterov
                       ` (3 more replies)
  2012-01-12 16:18   ` Alan Cox
                     ` (2 subsequent siblings)
  5 siblings, 4 replies; 235+ messages in thread
From: Steven Rostedt @ 2012-01-12 15:43 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:

> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace.  Once a task-local filter program is attached from a
> process without privileges, execve will fail.  This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).

This means that a non privileged user can not run another program with
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first and
then the execv() would be performed. But after the filter is attached,
the execv is prevented?

Maybe I don't understand this correctly.

-- Steve
 


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 15:43   ` Steven Rostedt
@ 2012-01-12 16:14     ` Oleg Nesterov
  2012-01-12 16:38       ` Steven Rostedt
  2012-01-12 16:14     ` Andrew Lutomirski
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-12 16:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/12, Steven Rostedt wrote:
>
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
> > Filter programs may _only_ cross the execve(2) barrier if last filter
> > program was attached by a task with CAP_SYS_ADMIN capabilities in its
> > user namespace.  Once a task-local filter program is attached from a
> > process without privileges, execve will fail.  This ensures that only
> > privileged parent task can affect its privileged children (e.g., setuid
> > binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?
>
> Maybe I don't understand this correctly.

May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.

OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 15:43   ` Steven Rostedt
  2012-01-12 16:14     ` Oleg Nesterov
@ 2012-01-12 16:14     ` Andrew Lutomirski
  2012-01-12 16:27       ` Steven Rostedt
  2012-01-12 16:59     ` Will Drewry
  2012-01-12 17:36     ` Jamie Lokier
  3 siblings, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 16:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 7:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?
>
> Maybe I don't understand this correctly.

Time to resurrect execve_nosecurity?  If so, then the rule could be
simplified to: seccomp programs cannot use normal execve at all.

The longer I linger on lists and see neat ideas like this, the more I
get annoyed that execve is magical.  I dream of a distribution that
doesn't use setuid, file capabilities, selinux transitions on exec, or
any other privilege changes on exec *at all*.  I think that the only
things missing in the kernel (other than something intelligent to do
about SELinux) are execve_nosecurity and the ability for a normal
program to wait for an unrelated program to finish (or some other way
that a program can ask a daemon to spawn a privileged program for it
and then to cleanly wait for that program to finish in a way that
could survive re-exec of the daemon).

--Andy

>
> -- Steve
>
>

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
                     ` (2 preceding siblings ...)
  2012-01-12 15:43   ` Steven Rostedt
@ 2012-01-12 16:18   ` Alan Cox
  2012-01-12 17:03     ` Will Drewry
  2012-01-13  1:31     ` James Morris
  2012-01-12 16:22   ` Oleg Nesterov
  2012-01-12 17:02   ` Andrew Lutomirski
  5 siblings, 2 replies; 235+ messages in thread
From: Alan Cox @ 2012-01-12 16:18 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace.  Once a task-local filter program is attached from a
> process without privileges, execve will fail.  This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).

I think this model is wrong. The rest of the policy rules all work on the
basis that dumpable is the decider (the same rules for not dumping, not
tracing, etc). A user should be able to apply filter to their own code
arbitarily. Any setuid app should IMHO lose the trace subject to the usual
uid rules and capability rules. That would seem to be more flexible and
also the path of least surprise.

[plus you can implement non setuid exec entirely in userspace so it's
a rather meaningless distinction you propose]

> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)

A BPF jit ought to be trivial and would be a big win.

In general I like this approach. It's simple, it's compact and it offers
interesting possibilities for solving some interesting problem spaces,
without the full weight of SELinux, SMACK etc which are still needed for
heavyweight security.

Alan

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
                     ` (3 preceding siblings ...)
  2012-01-12 16:18   ` Alan Cox
@ 2012-01-12 16:22   ` Oleg Nesterov
  2012-01-12 17:10     ` Will Drewry
  2012-01-12 17:02   ` Andrew Lutomirski
  5 siblings, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-12 16:22 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/11, Will Drewry wrote:
>
> +__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
> +{
> +	/* regset is usually returned based on task personality, not current
> +	 * system call convention.  This behavior makes it unsafe to execute
> +	 * BPF programs over regviews if is_compat_task or the personality
> +	 * have changed since the program was installed.
> +	 */
> +	const struct user_regset_view *view = task_user_regset_view(current);
> +	const struct user_regset *regset = &view->regsets[0];
> +	size_t scratch_size = *available;
> +	if (regset->core_note_type != NT_PRSTATUS) {
> +		/* The architecture should override this method for speed. */
> +		regset = find_prstatus(view);
> +		if (!regset)
> +			return NULL;
> +	}
> +	*available = regset->n * regset->size;
> +	/* Make sure the scratch space isn't exceeded. */
> +	if (*available > scratch_size)
> +		*available = scratch_size;
> +	if (regset->get(current, regset, 0, *available, scratch, NULL))
> +		return NULL;
> +	return scratch;
> +}
> +
> +/**
> + * seccomp_test_filters - tests 'current' against the given syscall
> + * @syscall: number of the system call to test
> + *
> + * Returns 0 on ok and non-zero on error/failure.
> + */
> +int seccomp_test_filters(int syscall)
> +{
> +	struct seccomp_filter *filter;
> +	u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
> +	size_t regs_size = sizeof(struct user_regs_struct);
> +	int ret = -EACCES;
> +
> +	filter = current->seccomp.filter; /* uses task ref */
> +	if (!filter)
> +		goto out;
> +
> +	/* All filters in the list are required to share the same system call
> +	 * convention so only the first filter is ever checked.
> +	 */
> +	if (seccomp_check_personality(filter))
> +		goto out;
> +
> +	/* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
> +	 * that is not mandatory.  E.g., it may return a point to
> +	 * task_pt_regs(current).  NULL checking is mandatory.
> +	 */
> +	regs = seccomp_get_regs(regs_tmp, &regs_size);

Stupid question. I am sure you know what are you doing ;) and I know
nothing about !x86 arches.

But could you explain why it is designed to use user_regs_struct ?
Why we can't simply use task_pt_regs() and avoid the (costly) regsets?

Just curious.

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:14     ` Andrew Lutomirski
@ 2012-01-12 16:27       ` Steven Rostedt
  2012-01-12 16:51         ` Andrew Lutomirski
  2012-01-12 17:09         ` Linus Torvalds
  0 siblings, 2 replies; 235+ messages in thread
From: Steven Rostedt @ 2012-01-12 16:27 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 08:14 -0800, Andrew Lutomirski wrote:

> The longer I linger on lists and see neat ideas like this, the more I
> get annoyed that execve is magical.  I dream of a distribution that
> doesn't use setuid, file capabilities, selinux transitions on exec, or
> any other privilege changes on exec *at all*. 

Is that the fear with filtering on execv? That if we have filters on an
execv calling a setuid program that we change the behavior of that
privileged program and might cause unexpected results?

In that case, just have execv fail if filtering is enabled and we are
execing a setuid program. But I don't see why non "magical" execv's
should be prohibited.

-- Steve



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:14     ` Oleg Nesterov
@ 2012-01-12 16:38       ` Steven Rostedt
  2012-01-12 16:47         ` Oleg Nesterov
  2012-01-12 17:30         ` Jamie Lokier
  0 siblings, 2 replies; 235+ messages in thread
From: Steven Rostedt @ 2012-01-12 16:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:

> May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> 
> OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.

I've never used seccomp, so I admit I'm totally ignorant on this topic.

But looking at seccomp from the outside, the biggest advantage to this
would be the ability for normal processes to be able to limit tasks it
kicks off. If I want to run a task in a sandbox, I don't want to be root
to do so.

I guess a web browser doesn't perform an exec to run java programs. But
it would be nice if I could execute something from the command line that
I could run in a sand box.

What's the problem with making sure that the setuid isn't set before
doing an execv? Only fail when setuid (or some other magic) is enabled
on the file being exec'd.

Or is this a race where I can have a soft link pointing to a normal
file, run this, and have the link change to a setuid file at just the
right time that causes it to happen?


-- Steve



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:38       ` Steven Rostedt
@ 2012-01-12 16:47         ` Oleg Nesterov
  2012-01-12 17:08           ` Will Drewry
  2012-01-12 17:30         ` Jamie Lokier
  1 sibling, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-12 16:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/12, Steven Rostedt wrote:
>
> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
>
> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> >
> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
>
> I've never used seccomp, so I admit I'm totally ignorant on this topic.

me too ;)

> But looking at seccomp from the outside, the biggest advantage to this
> would be the ability for normal processes to be able to limit tasks it
> kicks off. If I want to run a task in a sandbox, I don't want to be root
> to do so.
>
> I guess a web browser doesn't perform an exec to run java programs. But
> it would be nice if I could execute something from the command line that
> I could run in a sand box.
>
> What's the problem with making sure that the setuid isn't set before
> doing an execv? Only fail when setuid (or some other magic) is enabled
> on the file being exec'd.

I agree. That is why I mentioned LSM_UNSAFE_SECCOMP/cap_bprm_set_creds.
Just I do not know what would be the most simple/clean way to do this.


And in any case I agree that the current seccomp_check_exec() looks
strange. Btw, it does
{
	if (current->seccomp.mode != 2)
		return 0;
	/* We can rely on the task refcount for the filter. */
	if (!current->seccomp.filter)
		return -EPERM;

How it is possible to have seccomp.filter == NULL with mode == 2?

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:27       ` Steven Rostedt
@ 2012-01-12 16:51         ` Andrew Lutomirski
  2012-01-12 17:09         ` Linus Torvalds
  1 sibling, 0 replies; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 16:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 2012-01-12 at 08:14 -0800, Andrew Lutomirski wrote:
>
>> The longer I linger on lists and see neat ideas like this, the more I
>> get annoyed that execve is magical.  I dream of a distribution that
>> doesn't use setuid, file capabilities, selinux transitions on exec, or
>> any other privilege changes on exec *at all*.
>
> Is that the fear with filtering on execv? That if we have filters on an
> execv calling a setuid program that we change the behavior of that
> privileged program and might cause unexpected results?

Exactly.

>
> In that case, just have execv fail if filtering is enabled and we are
> execing a setuid program. But I don't see why non "magical" execv's
> should be prohibited.
>

How do you define "non-magical"?

If setuid is set, then it's obviously magical.  On a nosuid
filesystem, strange things happen.  If file capabilities are enabled
and set, then different magic happens.  With LSMs involved, anything
can be magical.  (SELinux AFAICT looks up rules on every single exec,
so it might be impossible for execve to be non-magical.)  If execve is
banned entirely when seccomp is enabled, then there will never be any
attacks based on abusing these mechanisms.

My proposal is to have an alternative mechanism that, from a security
POV, does nothing that the caller couldn't have done on its own.  The
only reason it would be needed at all is because implementing execve
with correct semantics from userspace is a PITA -- the right set of
fds needs to be closed, threads need to be killed (without races),
vmas need to be found an unmapped, a new program needs to be mapped in
(possibly at the same place that the old one was mapped at),
/proc/self/exe needs to be updated, auxv needs to be recreated
(including using values that glibc might have erased already), etc.

The code is short and it works (although I have no idea whether it
applies to current kernels).

Oleg: my only issue with setting something like LSM_UNSAFE_SECCOMP is
that a different class of vulnerability might be introduced: take a
setuid program that calls other setuid programs (or just uses execve
as a way to get the default execve capability handling, SELinux
handling, etc), run it (as root!) inside seccomp, and watch it
possibly develop security holes.  If the alternate execve is a
different syscall, then this can't happen.  And if someone remaps
execve to execve_nosecurity (from userspace or via some in-kernel
mechanism) and causes problems, it's entirely clear who to blame.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12  8:53   ` Serge Hallyn
@ 2012-01-12 16:54     ` Will Drewry
  0 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-12 16:54 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 2:53 AM, Serge Hallyn
<serge.hallyn@canonical.com> wrote:
> Quoting Will Drewry (wad@chromium.org):
>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task.  The policy is expressed in terms of a BPF program,
>> as is used for userland-exposed socket filtering.  Instead of network
>> data, the BPF program is evaluated over struct user_regs_struct at the
>> time of the system call (as retrieved using regviews).
>>
>> A filter program may be installed by a userland task by calling
>>   prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
>> where fprog is of type struct sock_fprog.
>>
>> If the first filter program allows subsequent prctl(2) calls, then
>> additional filter programs may be attached.  All attached programs
>> must be evaluated before a system call will be allowed to proceed.
>>
>> To avoid CONFIG_COMPAT related landmines, once a filter program is
>> installed using specific is_compat_task() and current->personality, it
>> is not allowed to make system calls or attach additional filters which
>> use a different combination of is_compat_task() and
>> current->personality.
>>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>>
>> There are a number of benefits to this approach. A few of which are
>> as follows:
>> - BPF has been exposed to userland for a long time.
>> - Userland already knows its ABI: expected register layout and system
>>   call numbers.
>> - Full register information is provided which may be relevant for
>>   certain syscalls (fork, rt_sigreturn) or for other userland
>>   filtering tactics (checking the PC).
>> - No time-of-check-time-of-use vulnerable data accesses are possible.
>>
>> This patch includes its own BPF evaluator, but relies on the
>> net/core/filter.c BPF checking code.  It is possible to share
>> evaluators, but the performance sensitive nature of the network
>> filtering path makes it an iterative optimization which (I think :) can
>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>>
>> Signed-off-by: Will Drewry <wad@chromium.org>
>
> Hey Will,
>
> A few comments below, but otherwise
>
> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

Thanks! Unimportant responses below.  Fixes will be incorporated in
the next round (along with Oleg's feedback).

cheers,
will

> thanks,
> -serge
>
>> ---
>>  fs/exec.c               |    5 +
>>  include/linux/prctl.h   |    3 +
>>  include/linux/seccomp.h |   70 +++++-
>>  kernel/Makefile         |    1 +
>>  kernel/fork.c           |    4 +
>>  kernel/seccomp.c        |    8 +
>>  kernel/seccomp_filter.c |  639 +++++++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sys.c            |    4 +
>>  security/Kconfig        |   12 +
>>  9 files changed, 743 insertions(+), 3 deletions(-)
>>  create mode 100644 kernel/seccomp_filter.c
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 3625464..e9cc89c 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -44,6 +44,7 @@
>>  #include <linux/namei.h>
>>  #include <linux/mount.h>
>>  #include <linux/security.h>
>> +#include <linux/seccomp.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/tsacct_kern.h>
>>  #include <linux/cn_proc.h>
>> @@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
>>       if (retval)
>>               goto out_ret;
>>
>> +     retval = seccomp_check_exec();
>> +     if (retval)
>> +             goto out_ret;
>> +
>>       retval = -ENOMEM;
>>       bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
>>       if (!bprm)
>> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
>> index a3baeb2..15e2460 100644
>> --- a/include/linux/prctl.h
>> +++ b/include/linux/prctl.h
>> @@ -64,6 +64,9 @@
>>  #define PR_GET_SECCOMP       21
>>  #define PR_SET_SECCOMP       22
>>
>> +/* Set process seccomp filters */
>> +#define PR_ATTACH_SECCOMP_FILTER     36
>> +
>>  /* Get/set the capability bounding set (as per security/commoncap.c) */
>>  #define PR_CAPBSET_READ 23
>>  #define PR_CAPBSET_DROP 24
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index cc7a4e9..99d163e 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -5,9 +5,28 @@
>>  #ifdef CONFIG_SECCOMP
>>
>>  #include <linux/thread_info.h>
>> +#include <linux/types.h>
>>  #include <asm/seccomp.h>
>>
>> -typedef struct { int mode; } seccomp_t;
>> +struct seccomp_filter;
>> +/**
>> + * struct seccomp_struct - the state of a seccomp'ed process
>> + *
>> + * @mode:
>> + *     if this is 0, seccomp is not in use.
>> + *             is 1, the process is under standard seccomp rules.
>> + *             is 2, the process is only allowed to make system calls where
>> + *                   associated filters evaluate successfully.
>> + * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
>> + *          @filter must only be accessed from the context of current as there
>> + *          is no guard.
>> + */
>> +typedef struct seccomp_struct {
>> +     int mode;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     struct seccomp_filter *filter;
>> +#endif
>> +} seccomp_t;
>>
>>  extern void __secure_computing(int);
>>  static inline void secure_computing(int this_syscall)
>> @@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)
>>
>>  #include <linux/errno.h>
>>
>> -typedef struct { } seccomp_t;
>> -
>> +typedef struct seccomp_struct { } seccomp_t;
>>  #define secure_computing(x) do { } while (0)
>>
>>  static inline long prctl_get_seccomp(void)
>> @@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)
>>
>>  #endif /* CONFIG_SECCOMP */
>>
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +
>> +#define seccomp_filter_init_task(_tsk) do { \
>> +     (_tsk)->seccomp.filter = NULL; \
>> +} while (0);
>> +
>> +/* No locking is needed here because the task_struct will
>> + * have no parallel consumers.
>> + */
>> +#define seccomp_filter_free_task(_tsk) do { \
>> +     put_seccomp_filter((_tsk)->seccomp.filter); \
>> +} while (0);
>> +
>> +extern int seccomp_check_exec(void);
>> +
>> +extern long prctl_attach_seccomp_filter(char __user *);
>> +
>> +extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
>> +extern void put_seccomp_filter(struct seccomp_filter *);
>> +
>> +extern int seccomp_test_filters(int);
>> +extern void seccomp_filter_log_failure(int);
>> +extern void seccomp_filter_fork(struct task_struct *child,
>> +                             struct task_struct *parent);
>> +
>> +#else  /* CONFIG_SECCOMP_FILTER */
>> +
>> +#include <linux/errno.h>
>> +
>> +struct seccomp_filter { };
>> +#define seccomp_filter_init_task(_tsk) do { } while (0);
>> +#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
>> +#define seccomp_filter_free_task(_tsk) do { } while (0);
>> +
>> +static inline int seccomp_check_exec(void)
>> +{
>> +     return 0;
>> +}
>> +
>> +
>> +static inline long prctl_attach_seccomp_filter(char __user *a2)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +#endif  /* CONFIG_SECCOMP_FILTER */
>>  #endif /* _LINUX_SECCOMP_H */
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index e898c5b..0584090 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>>  obj-$(CONFIG_SECCOMP) += seccomp.o
>> +obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
>>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index da4a6a1..cc1d628 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -34,6 +34,7 @@
>>  #include <linux/cgroup.h>
>>  #include <linux/security.h>
>>  #include <linux/hugetlb.h>
>> +#include <linux/seccomp.h>
>>  #include <linux/swap.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/jiffies.h>
>> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>>       free_thread_info(tsk->stack);
>>       rt_mutex_debug_task_free(tsk);
>>       ftrace_graph_exit_task(tsk);
>> +     seccomp_filter_free_task(tsk);
>>       free_task_struct(tsk);
>>  }
>>  EXPORT_SYMBOL(free_task);
>> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>       /* Perform scheduler related setup. Assign this task to a CPU. */
>>       sched_fork(p);
>>
>> +     seccomp_filter_init_task(p);
>>       retval = perf_event_init_task(p);
>>       if (retval)
>>               goto bad_fork_cleanup_policy;
>> @@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>       if (clone_flags & CLONE_THREAD)
>>               threadgroup_fork_read_unlock(current);
>>       perf_event_fork(p);
>> +     seccomp_filter_fork(p, current);
>>       return p;
>>
>>  bad_fork_free_pid:
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 57d4b13..78719be 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
>>                               return;
>>               } while (*++syscall);
>>               break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     case 2:
>> +             if (seccomp_test_filters(this_syscall) == 0)
>> +                     return;
>> +
>> +             seccomp_filter_log_failure(this_syscall);
>> +             break;
>> +#endif
>>       default:
>>               BUG();
>>       }
>> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>> new file mode 100644
>> index 0000000..4770847
>> --- /dev/null
>> +++ b/kernel/seccomp_filter.c
>> @@ -0,0 +1,639 @@
>> +/* bpf program-based system call filtering
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>> + *
>> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
>> + */
>> +
>> +#include <linux/capability.h>
>> +#include <linux/compat.h>
>> +#include <linux/err.h>
>> +#include <linux/errno.h>
>> +#include <linux/rculist.h>
>> +#include <linux/filter.h>
>> +#include <linux/kallsyms.h>
>> +#include <linux/kref.h>
>> +#include <linux/module.h>
>> +#include <linux/pid.h>
>> +#include <linux/prctl.h>
>> +#include <linux/ptrace.h>
>> +#include <linux/ratelimit.h>
>> +#include <linux/reciprocal_div.h>
>> +#include <linux/regset.h>
>> +#include <linux/seccomp.h>
>> +#include <linux/security.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/user.h>
>> +
>> +
>> +/**
>> + * struct seccomp_filter - container for seccomp BPF programs
>> + *
>> + * @usage: reference count to manage the object lifetime.
>> + *         get/put helpers should be used when accessing an instance
>> + *         outside of a lifetime-guarded section.  In general, this
>> + *         is only needed for handling filters shared across tasks.
>> + * @creator: pointer to the pid that created this filter
>> + * @parent: pointer to the ancestor which this filter will be composed with.
>> + * @flags: provide information about filter from creation time.
>> + * @personality: personality of the process at filter creation time.
>> + * @insns: the BPF program instructions to evaluate
>> + * @count: the number of instructions in the program.
>> + *
>> + * seccomp_filter objects should never be modified after being attached
>> + * to a task_struct (other than @usage).
>> + */
>> +struct seccomp_filter {
>> +     struct kref usage;
>> +     struct pid *creator;
>> +     struct seccomp_filter *parent;
>> +     struct {
>> +             uint32_t admin:1,  /* can allow execve */
>> +                      compat:1,  /* CONFIG_COMPAT */
>> +                      __reserved:30;
>> +     } flags;
>> +     int personality;
>> +     unsigned short count;  /* Instruction count */
>> +     struct sock_filter insns[0];
>> +};
>> +
>> +static unsigned int seccomp_run_filter(const u8 *buf,
>> +                                    const size_t buflen,
>> +                                    const struct sock_filter *);
>> +
>> +/**
>> + * seccomp_filter_alloc - allocates a new filter object
>> + * @padding: size of the insns[0] array in bytes
>> + *
>> + * The @padding should be a multiple of
>> + * sizeof(struct sock_filter).
>> + *
>> + * Returns ERR_PTR on error or an allocated object.
>> + */
>> +static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
>> +{
>> +     struct seccomp_filter *f;
>> +     unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
>> +
>> +     /* Drop oversized requests. */
>> +     if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     /* Padding should always be in sock_filter increments. */
>> +     BUG_ON(padding % sizeof(struct sock_filter));
>
> I still think the BUG_ON here is harsh given that the progsize is passed
> in by userspace.  Was there a reason not to return -EINVAL here?

I've changed it in the next revision.  As is, I don't believe
userspace can control
the size of padding directly, just the increment since it specifies
its length in terms
of bpf blocks (sizeof(struct sock_filter)).  But EINVAL is certainly
less aggressive :)

>> +
>> +     f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
>> +     if (!f)
>> +             return ERR_PTR(-ENOMEM);
>> +     kref_init(&f->usage);
>> +     f->creator = get_task_pid(current, PIDTYPE_PID);
>> +     f->count = bpf_blocks;
>> +     return f;
>> +}
>> +
>> +/**
>> + * seccomp_filter_free - frees the allocated filter.
>> + * @filter: NULL or live object to be completely destructed.
>> + */
>> +static void seccomp_filter_free(struct seccomp_filter *filter)
>> +{
>> +     if (!filter)
>> +             return;
>> +     put_seccomp_filter(filter->parent);
>> +     put_pid(filter->creator);
>> +     kfree(filter);
>> +}
>> +
>> +static void __put_seccomp_filter(struct kref *kref)
>> +{
>> +     struct seccomp_filter *orig =
>> +             container_of(kref, struct seccomp_filter, usage);
>> +     seccomp_filter_free(orig);
>> +}
>> +
>> +void seccomp_filter_log_failure(int syscall)
>> +{
>> +     pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
>> +             current->comm, task_pid_nr(current), syscall,
>> +             KSTK_EIP(current));
>> +}
>> +
>> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
>> +void put_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> +     if (!orig)
>> +             return;
>> +     kref_put(&orig->usage, __put_seccomp_filter);
>> +}
>> +
>> +/* get_seccomp_filter - increments the reference count of @orig. */
>> +struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> +     if (!orig)
>> +             return NULL;
>> +     kref_get(&orig->usage);
>> +     return orig;
>> +}
>> +
>> +static int seccomp_check_personality(struct seccomp_filter *filter)
>> +{
>> +     if (filter->personality != current->personality)
>> +             return -EACCES;
>> +#ifdef CONFIG_COMPAT
>> +     if (filter->flags.compat != (!!(is_compat_task())))
>> +             return -EACCES;
>> +#endif
>> +     return 0;
>> +}
>> +
>> +static const struct user_regset *
>> +find_prstatus(const struct user_regset_view *view)
>> +{
>> +     const struct user_regset *regset;
>> +     int n;
>> +
>> +     /* Skip 0. */
>> +     for (n = 1; n < view->n; ++n) {
>> +             regset = view->regsets + n;
>> +             if (regset->core_note_type == NT_PRSTATUS)
>> +                     return regset;
>> +     }
>> +
>> +     return NULL;
>> +}
>> +
>> +/**
>> + * seccomp_get_regs - returns a pointer to struct user_regs_struct
>> + * @scratch: preallocated storage of size @available
>> + * @available: pointer to the size of scratch.
>> + *
>> + * Returns NULL if the registers cannot be acquired or copied.
>> + * Returns a populated pointer to @scratch by default.
>> + * Otherwise, returns a pointer to a a u8 array containing the struct
>> + * user_regs_struct appropriate for the task personality.  The pointer
>> + * may be to the beginning of @scratch or to an externally managed data
>> + * structure.  On success, @available should be updated with the
>> + * valid region size of the returned pointer.
>> + *
>> + * If the architecture overrides the linkage, then the pointer may pointer to
>> + * another location.
>> + */
>> +__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
>> +{
>> +     /* regset is usually returned based on task personality, not current
>> +      * system call convention.  This behavior makes it unsafe to execute
>> +      * BPF programs over regviews if is_compat_task or the personality
>> +      * have changed since the program was installed.
>> +      */
>> +     const struct user_regset_view *view = task_user_regset_view(current);
>> +     const struct user_regset *regset = &view->regsets[0];
>> +     size_t scratch_size = *available;
>> +     if (regset->core_note_type != NT_PRSTATUS) {
>> +             /* The architecture should override this method for speed. */
>> +             regset = find_prstatus(view);
>> +             if (!regset)
>> +                     return NULL;
>> +     }
>> +     *available = regset->n * regset->size;
>> +     /* Make sure the scratch space isn't exceeded. */
>> +     if (*available > scratch_size)
>> +             *available = scratch_size;
>> +     if (regset->get(current, regset, 0, *available, scratch, NULL))
>> +             return NULL;
>> +     return scratch;
>> +}
>> +
>> +/**
>> + * seccomp_test_filters - tests 'current' against the given syscall
>> + * @syscall: number of the system call to test
>> + *
>> + * Returns 0 on ok and non-zero on error/failure.
>> + */
>> +int seccomp_test_filters(int syscall)
>> +{
>> +     struct seccomp_filter *filter;
>> +     u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
>> +     size_t regs_size = sizeof(struct user_regs_struct);
>> +     int ret = -EACCES;
>> +
>> +     filter = current->seccomp.filter; /* uses task ref */
>> +     if (!filter)
>> +             goto out;
>> +
>> +     /* All filters in the list are required to share the same system call
>> +      * convention so only the first filter is ever checked.
>> +      */
>> +     if (seccomp_check_personality(filter))
>> +             goto out;
>> +
>> +     /* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
>> +      * that is not mandatory.  E.g., it may return a point to
>> +      * task_pt_regs(current).  NULL checking is mandatory.
>> +      */
>> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>> +     if (!regs)
>> +             goto out;
>> +
>> +     /* Only allow a system call if it is allowed in all ancestors. */
>> +     ret = 0;
>> +     for ( ; filter != NULL; filter = filter->parent) {
>> +             /* Allowed if return value is the size of the data supplied. */
>> +             if (seccomp_run_filter(regs, regs_size, filter->insns) !=
>> +                 regs_size)
>> +                     ret = -EACCES;
>> +     }
>> +out:
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_attach_filter: Attaches a seccomp filter to current.
>> + * @fprog: BPF program to install
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + *          operates on current. current must be attempting a system call
>> + *          when this is called (usually prctl).
>> + *
>> + * This function may be called repeatedly to install additional filters.
>> + * Every filter successfully installed will be evaluated (in reverse order)
>> + * for each system call the thread makes.
>> + *
>> + * Returns 0 on success or an errno on failure.
>> + */
>> +long seccomp_attach_filter(struct sock_fprog *fprog)
>> +{
>> +     struct seccomp_filter *filter = NULL;
>> +     /* Note, len is a short so overflow should be impossible. */
>> +     unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>> +     long ret = -EPERM;
>> +
>> +     /* Allocate a new seccomp_filter */
>> +     filter = seccomp_filter_alloc(fp_size);
>> +     if (IS_ERR(filter)) {
>> +             ret = PTR_ERR(filter);
>> +             goto out;
>> +     }
>> +
>> +     /* Lock the process personality and calling convention. */
>> +#ifdef CONFIG_COMPAT
>> +     if (is_compat_task())
>> +             filter->flags.compat = 1;
>> +#endif
>> +     filter->personality = current->personality;
>> +
>> +     /* Auditing is not needed since the capability wasn't requested */
>> +     if (security_real_capable_noaudit(current, current_user_ns(),
>> +                                       CAP_SYS_ADMIN) == 0)
>> +             filter->flags.admin = 1;
>> +
>> +     /* Copy the instructions from fprog. */
>> +     ret = -EFAULT;
>> +     if (copy_from_user(filter->insns, fprog->filter, fp_size))
>> +             goto out;
>> +
>> +     /* Check the fprog */
>> +     ret = sk_chk_filter(filter->insns, filter->count);
>> +     if (ret)
>> +             goto out;
>> +
>> +     /* If there is an existing filter, make it the parent
>> +      * and reuse the existing task-based ref.
>> +      */
>> +     filter->parent = current->seccomp.filter;
>> +
>> +     /* Force all filters to use one system call convention. */
>> +     ret = -EINVAL;
>> +     if (filter->parent) {
>> +             if (filter->parent->flags.compat != filter->flags.compat)
>> +                     goto out;
>> +             if (filter->parent->personality != filter->personality)
>> +                     goto out;
>> +     }
>> +
>> +     /* Double claim the new filter so we can release it below simplifying
>> +      * the error paths earlier.
>> +      */
>> +     ret = 0;
>> +     get_seccomp_filter(filter);
>> +     current->seccomp.filter = filter;
>> +     /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
>> +     if (!current->seccomp.mode) {
>> +             current->seccomp.mode = 2;
>> +             set_thread_flag(TIF_SECCOMP);
>> +     }
>> +
>> +out:
>> +     put_seccomp_filter(filter);  /* for get or task, on err */
>> +     return ret;
>> +}
>> +
>> +long prctl_attach_seccomp_filter(char __user *user_filter)
>> +{
>> +     struct sock_fprog fprog;
>> +     long ret = -EINVAL;
>> +
>> +     ret = -EFAULT;
>> +     if (!user_filter)
>> +             goto out;
>> +
>> +     if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
>> +             goto out;
>> +
>> +     ret = seccomp_attach_filter(&fprog);
>> +out:
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_check_exec: determines if exec is allowed for current
>> + * Returns 0 if allowed.
>> + */
>> +int seccomp_check_exec(void)
>> +{
>> +     if (current->seccomp.mode != 2)
>> +             return 0;
>> +     /* We can rely on the task refcount for the filter. */
>> +     if (!current->seccomp.filter)
>> +             return -EPERM;
>> +     /* The last attached filter set for the process is checked. It must
>> +      * have been installed with CAP_SYS_ADMIN capabilities.
>
> This comment is confusing.  By 'It must' you mean that if not, it's
> denied.  But if I didn't know better I would read that as "we can't
> get to this code unless".  Can you change it to something like
> "Exec is refused unless the filter was installed with CAP_SYS_ADMIN
> privilege"?

Sounds good!

>> +      */
>> +     if (current->seccomp.filter->flags.admin)
>> +             return 0;
>> +     return -EPERM;
>> +}
>> +
>> +/* seccomp_filter_fork: manages inheritance on fork
>> + * @child: forkee
>> + * @parent: forker
>> + * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
>> + * and the set of filters is marked as 'enabled'.
>> + */
>> +void seccomp_filter_fork(struct task_struct *child,
>> +                      struct task_struct *parent)
>> +{
>> +     if (!parent->seccomp.mode)
>> +             return;
>> +     child->seccomp.mode = parent->seccomp.mode;
>> +     child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
>> +}
>> +
>> +/* Returns a pointer to the BPF evaluator after checking the offset and size
>> + * boundaries.  The signature almost matches the signature from
>> + * net/core/filter.c with the hopes of sharing code in the future.
>> + */
>> +static const void *load_pointer(const u8 *buf, size_t buflen,
>> +                             int offset, size_t size,
>> +                             void *unused)
>> +{
>> +     if (offset >= buflen)
>> +             goto fail;
>> +     if (offset < 0)
>> +             goto fail;
>> +     if (size > buflen - offset)
>> +             goto fail;
>> +     return buf + offset;
>> +fail:
>> +     return NULL;
>> +}
>> +
>> +/**
>> + * seccomp_run_filter - evaluate BPF (over user_regs_struct)
>> + *   @buf: buffer to execute the filter over
>> + *   @buflen: length of the buffer
>> + *   @fentry: filter to apply
>> + *
>> + * Decode and apply filter instructions to the buffer.
>> + * Return length to keep, 0 for none. @buf is a regset we are
>> + * filtering, @filter is the array of filter instructions.
>> + * Because all jumps are guaranteed to be before last instruction,
>> + * and last instruction guaranteed to be a RET, we dont need to check
>> + * flen.
>> + *
>> + * See core/net/filter.c as this is nearly an exact copy.
>> + * At some point, it would be nice to merge them to take advantage of
>> + * optimizations (like JIT).
>> + *
>> + * A successful filter must return the full length of the data. Anything less
>> + * will currently result in a seccomp failure.  In the future, it may be
>> + * possible to use that for hard filtering registers on the fly so it is
>> + * ideal for consumers to return 0 on intended failure.
>> + */
>> +static unsigned int seccomp_run_filter(const u8 *buf,
>> +                                    const size_t buflen,
>> +                                    const struct sock_filter *fentry)
>> +{
>> +     const void *ptr;
>> +     u32 A = 0;                      /* Accumulator */
>> +     u32 X = 0;                      /* Index Register */
>> +     u32 mem[BPF_MEMWORDS];          /* Scratch Memory Store */
>> +     u32 tmp;
>> +     int k;
>> +
>> +     /*
>> +      * Process array of filter instructions.
>> +      */
>> +     for (;; fentry++) {
>> +#if defined(CONFIG_X86_32)
>> +#define      K (fentry->k)
>> +#else
>> +             const u32 K = fentry->k;
>> +#endif
>> +
>> +             switch (fentry->code) {
>> +             case BPF_S_ALU_ADD_X:
>> +                     A += X;
>> +                     continue;
>> +             case BPF_S_ALU_ADD_K:
>> +                     A += K;
>> +                     continue;
>> +             case BPF_S_ALU_SUB_X:
>> +                     A -= X;
>> +                     continue;
>> +             case BPF_S_ALU_SUB_K:
>> +                     A -= K;
>> +                     continue;
>> +             case BPF_S_ALU_MUL_X:
>> +                     A *= X;
>> +                     continue;
>> +             case BPF_S_ALU_MUL_K:
>> +                     A *= K;
>> +                     continue;
>> +             case BPF_S_ALU_DIV_X:
>> +                     if (X == 0)
>> +                             return 0;
>> +                     A /= X;
>> +                     continue;
>> +             case BPF_S_ALU_DIV_K:
>> +                     A = reciprocal_divide(A, K);
>> +                     continue;
>> +             case BPF_S_ALU_AND_X:
>> +                     A &= X;
>> +                     continue;
>> +             case BPF_S_ALU_AND_K:
>> +                     A &= K;
>> +                     continue;
>> +             case BPF_S_ALU_OR_X:
>> +                     A |= X;
>> +                     continue;
>> +             case BPF_S_ALU_OR_K:
>> +                     A |= K;
>> +                     continue;
>> +             case BPF_S_ALU_LSH_X:
>> +                     A <<= X;
>> +                     continue;
>> +             case BPF_S_ALU_LSH_K:
>> +                     A <<= K;
>> +                     continue;
>> +             case BPF_S_ALU_RSH_X:
>> +                     A >>= X;
>> +                     continue;
>> +             case BPF_S_ALU_RSH_K:
>> +                     A >>= K;
>> +                     continue;
>> +             case BPF_S_ALU_NEG:
>> +                     A = -A;
>> +                     continue;
>> +             case BPF_S_JMP_JA:
>> +                     fentry += K;
>> +                     continue;
>> +             case BPF_S_JMP_JGT_K:
>> +                     fentry += (A > K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JGE_K:
>> +                     fentry += (A >= K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JEQ_K:
>> +                     fentry += (A == K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JSET_K:
>> +                     fentry += (A & K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JGT_X:
>> +                     fentry += (A > X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JGE_X:
>> +                     fentry += (A >= X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JEQ_X:
>> +                     fentry += (A == X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JSET_X:
>> +                     fentry += (A & X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_LD_W_ABS:
>> +                     k = K;
>> +load_w:
>> +                     ptr = load_pointer(buf, buflen, k, 4, &tmp);
>> +                     if (ptr != NULL) {
>> +                             /* Note, unlike on network data, values are not
>> +                              * byte swapped.
>> +                              */
>> +                             A = *(const u32 *)ptr;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_H_ABS:
>> +                     k = K;
>> +load_h:
>> +                     ptr = load_pointer(buf, buflen, k, 2, &tmp);
>> +                     if (ptr != NULL) {
>> +                             A = *(const u16 *)ptr;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_B_ABS:
>> +                     k = K;
>> +load_b:
>> +                     ptr = load_pointer(buf, buflen, k, 1, &tmp);
>> +                     if (ptr != NULL) {
>> +                             A = *(const u8 *)ptr;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_W_LEN:
>> +                     A = buflen;
>> +                     continue;
>> +             case BPF_S_LDX_W_LEN:
>> +                     X = buflen;
>> +                     continue;
>> +             case BPF_S_LD_W_IND:
>> +                     k = X + K;
>> +                     goto load_w;
>> +             case BPF_S_LD_H_IND:
>> +                     k = X + K;
>> +                     goto load_h;
>> +             case BPF_S_LD_B_IND:
>> +                     k = X + K;
>> +                     goto load_b;
>> +             case BPF_S_LDX_B_MSH:
>> +                     ptr = load_pointer(buf, buflen, K, 1, &tmp);
>> +                     if (ptr != NULL) {
>> +                             X = (*(u8 *)ptr & 0xf) << 2;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_IMM:
>> +                     A = K;
>> +                     continue;
>> +             case BPF_S_LDX_IMM:
>> +                     X = K;
>> +                     continue;
>> +             case BPF_S_LD_MEM:
>> +                     A = mem[K];
>> +                     continue;
>> +             case BPF_S_LDX_MEM:
>> +                     X = mem[K];
>> +                     continue;
>> +             case BPF_S_MISC_TAX:
>> +                     X = A;
>> +                     continue;
>> +             case BPF_S_MISC_TXA:
>> +                     A = X;
>> +                     continue;
>> +             case BPF_S_RET_K:
>> +                     return K;
>> +             case BPF_S_RET_A:
>> +                     return A;
>> +             case BPF_S_ST:
>> +                     mem[K] = A;
>> +                     continue;
>> +             case BPF_S_STX:
>> +                     mem[K] = X;
>> +                     continue;
>> +             case BPF_S_ANC_PROTOCOL:
>> +             case BPF_S_ANC_PKTTYPE:
>> +             case BPF_S_ANC_IFINDEX:
>> +             case BPF_S_ANC_MARK:
>> +             case BPF_S_ANC_QUEUE:
>> +             case BPF_S_ANC_HATYPE:
>> +             case BPF_S_ANC_RXHASH:
>> +             case BPF_S_ANC_CPU:
>> +             case BPF_S_ANC_NLATTR:
>> +             case BPF_S_ANC_NLATTR_NEST:
>> +                     /* ignored */
>> +                     continue;
>> +             default:
>> +                     WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
>> +                                    fentry->code, fentry->jt,
>> +                                    fentry->jf, fentry->k);
>> +                     return 0;
>> +             }
>> +     }
>> +
>> +     return 0;
>> +}
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 481611f..77f2eda 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>               case PR_SET_SECCOMP:
>>                       error = prctl_set_seccomp(arg2);
>>                       break;
>> +             case PR_ATTACH_SECCOMP_FILTER:
>> +                     error = prctl_attach_seccomp_filter((char __user *)
>> +                                                             arg2);
>> +                     break;
>>               case PR_GET_TSC:
>>                       error = GET_TSC_CTL(arg2);
>>                       break;
>> diff --git a/security/Kconfig b/security/Kconfig
>> index 51bd5a0..77b1106 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
>>
>>         If you are unsure how to answer this question, answer N.
>>
>> +config SECCOMP_FILTER
>> +     bool "Enable seccomp-based system call filtering"
>> +     select SECCOMP
>> +     depends on EXPERIMENTAL
>> +     help
>> +       This kernel feature expands CONFIG_SECCOMP to allow computing
>> +       in environments with reduced kernel access dictated by a system
>> +       call filter, expressed in BPF, installed by the application itself
>> +       through prctl(2).
>> +
>> +       See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>>  config SECURITY
>>       bool "Enable different security models"
>>       depends on SYSFS
>> --
>> 1.7.5.4
>>

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 14:50   ` Oleg Nesterov
@ 2012-01-12 16:55     ` Will Drewry
  0 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-12 16:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Thu, Jan 12, 2012 at 8:50 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/11, Will Drewry wrote:
>>
>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task.  The policy is expressed in terms of a BPF program,
>> as is used for userland-exposed socket filtering.  Instead of network
>> data, the BPF program is evaluated over struct user_regs_struct at the
>> time of the system call (as retrieved using regviews).
>
> Cool ;)
>
> I didn't really read this patch yet, just one nit.
>
>> +#define seccomp_filter_init_task(_tsk) do { \
>> +     (_tsk)->seccomp.filter = NULL; \
>> +} while (0);
>
> Cosmetic and subjective, but imho it would be better to add inline
> functions instead of define's.

Refactoring it a bit to make that possible.  Since seccomp fork/init/free
never needs access to the whole task_structs, I'll just pass in what's
needed (and avoid the sched.h inclusion recursion).

Comments on the next round will most definitely be appreciated!

>> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>>       free_thread_info(tsk->stack);
>>       rt_mutex_debug_task_free(tsk);
>>       ftrace_graph_exit_task(tsk);
>> +     seccomp_filter_free_task(tsk);
>>       free_task_struct(tsk);
>>  }
>>  EXPORT_SYMBOL(free_task);
>> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>       /* Perform scheduler related setup. Assign this task to a CPU. */
>>       sched_fork(p);
>>
>> +     seccomp_filter_init_task(p);
>
> This doesn't look right or I missed something. something seccomp_filter_init_task()
> should be called right after dup_task_struct(), at least before copy process can
> fail.
>
> Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
> current->seccomp.filter copied by arch_dup_task_struct().

Ah - makes sense!  I moved it under dup_task_struct before any goto's
to bad_fork_free.

>> +struct seccomp_filter {
>> +     struct kref usage;
>> +     struct pid *creator;
>
> Why? seccomp_filter->creator is never used, no?

Removing it. It is from a related patch I'm experimenting with (adding
optional tracehook support), but it has no bearing here.

Thanks - new patch revision incoming!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 15:43   ` Steven Rostedt
  2012-01-12 16:14     ` Oleg Nesterov
  2012-01-12 16:14     ` Andrew Lutomirski
@ 2012-01-12 16:59     ` Will Drewry
  2012-01-12 17:22       ` Jamie Lokier
  2012-01-12 17:36     ` Jamie Lokier
  3 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-12 16:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?

Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked executable
using LD_PRELOAD.

> Maybe I don't understand this correctly.

You're right on.  This was to ensure that one process didn't cause
crazy behavior in another. I think Alan has a better proposal than
mine below.  (Goes back to catching up.)

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
                     ` (4 preceding siblings ...)
  2012-01-12 16:22   ` Oleg Nesterov
@ 2012-01-12 17:02   ` Andrew Lutomirski
  2012-01-16 20:28     ` Will Drewry
  5 siblings, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 17:02 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Wed, Jan 11, 2012 at 9:25 AM, Will Drewry <wad@chromium.org> wrote:
> This patch adds support for seccomp mode 2.  This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task.  The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering.  Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).
>

There's some seccomp-related code in the vsyscall emulation path in
arch/x86/kernel/vsyscall_64.c.  How should time(), getcpu(), and
gettimeofday() be handled?  If you want filtering to work, there
aren't any real syscall registers to inspect, but they could be
synthesized.

Preventing a malicious task from figuring out approximately what time
it is is basically impossible because of the way that vvars work.  I
don't know how to change that efficiently.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:18   ` Alan Cox
@ 2012-01-12 17:03     ` Will Drewry
  2012-01-12 17:11       ` Alan Cox
  2012-01-13  1:31     ` James Morris
  1 sibling, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-12 17:03 UTC (permalink / raw)
  To: Alan Cox
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:18 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> I think this model is wrong. The rest of the policy rules all work on the
> basis that dumpable is the decider (the same rules for not dumping, not
> tracing, etc). A user should be able to apply filter to their own code
> arbitarily. Any setuid app should IMHO lose the trace subject to the usual
> uid rules and capability rules. That would seem to be more flexible and
> also the path of least surprise.

My line of thinking up to now has been that disallowing setuid exec
would mean there is no risk of an errant setuid binary allowing escape
from the system call filters (which the containers people may care
more about).  Since setuid is privilege escalation, then perhaps it
makes sense to allow it as an escape hatch.

Would it be sane to just disallow setuid exec exclusively?

> [plus you can implement non setuid exec entirely in userspace so it's
> a rather meaningless distinction you propose]

Agreed.

>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>
> A BPF jit ought to be trivial and would be a big win.
>
> In general I like this approach. It's simple, it's compact and it offers
> interesting possibilities for solving some interesting problem spaces,
> without the full weight of SELinux, SMACK etc which are still needed for
> heavyweight security.
>

Thanks!  Yeah I think merging with the network stack is eminently
doable, but I didn't want to bog down the proposal in how much
overhead I might be adding to the network layer.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:47         ` Oleg Nesterov
@ 2012-01-12 17:08           ` Will Drewry
  0 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-12 17:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:47 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/12, Steven Rostedt wrote:
>>
>> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
>>
>> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
>> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
>> >
>> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
>>
>> I've never used seccomp, so I admit I'm totally ignorant on this topic.
>
> me too ;)
>
>> But looking at seccomp from the outside, the biggest advantage to this
>> would be the ability for normal processes to be able to limit tasks it
>> kicks off. If I want to run a task in a sandbox, I don't want to be root
>> to do so.
>>
>> I guess a web browser doesn't perform an exec to run java programs. But
>> it would be nice if I could execute something from the command line that
>> I could run in a sand box.
>>
>> What's the problem with making sure that the setuid isn't set before
>> doing an execv? Only fail when setuid (or some other magic) is enabled
>> on the file being exec'd.
>
> I agree. That is why I mentioned LSM_UNSAFE_SECCOMP/cap_bprm_set_creds.
> Just I do not know what would be the most simple/clean way to do this.
>
>
> And in any case I agree that the current seccomp_check_exec() looks
> strange. Btw, it does
> {
>        if (current->seccomp.mode != 2)
>                return 0;
>        /* We can rely on the task refcount for the filter. */
>        if (!current->seccomp.filter)
>                return -EPERM;
>
> How it is possible to have seccomp.filter == NULL with mode == 2?

It shouldn't be. It's another relic I missed from development. (Adding to v3 :)

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:27       ` Steven Rostedt
  2012-01-12 16:51         ` Andrew Lutomirski
@ 2012-01-12 17:09         ` Linus Torvalds
  2012-01-12 17:17           ` Steven Rostedt
  2012-01-12 18:18           ` Andrew Lutomirski
  1 sibling, 2 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-12 17:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> In that case, just have execv fail if filtering is enabled and we are
> execing a setuid program. But I don't see why non "magical" execv's
> should be prohibited.

The whole "fail security escalations" thing goes way beyond just
filtering, I think we could seriously try to make it a generic
feature.

For example, somebody just asked me the other day why "chroot()"
requires admin privileges, since it would be good to limit even
non-root things.

And it's really the exact same issue as filtering: in some sense,
chroot() "filters" FS name lookups, and can be used to fool programs
that are written to be secure.

We could easily introduce a per-process flag that just says "cannot
escalate privileges". Which basically just disables execve() of
suid/sgid programs (and possibly other things too), and locks the
process to the current privileges. And then make the rule be that *if*
that flag is set, you can then filter across an execve, or chroot as a
normal user, or whatever.

There are probably other things like that - things like allowing users
to do bind mounts etc - that aren't dangerous in themselves, but that
are dangerous mainly because they can be used to fool things into
privilege escalations. So this is definitely not a filter-only issue.

                       Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:22   ` Oleg Nesterov
@ 2012-01-12 17:10     ` Will Drewry
  2012-01-12 17:23       ` Oleg Nesterov
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-12 17:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/11, Will Drewry wrote:
>>
>> +__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
>> +{
>> +     /* regset is usually returned based on task personality, not current
>> +      * system call convention.  This behavior makes it unsafe to execute
>> +      * BPF programs over regviews if is_compat_task or the personality
>> +      * have changed since the program was installed.
>> +      */
>> +     const struct user_regset_view *view = task_user_regset_view(current);
>> +     const struct user_regset *regset = &view->regsets[0];
>> +     size_t scratch_size = *available;
>> +     if (regset->core_note_type != NT_PRSTATUS) {
>> +             /* The architecture should override this method for speed. */
>> +             regset = find_prstatus(view);
>> +             if (!regset)
>> +                     return NULL;
>> +     }
>> +     *available = regset->n * regset->size;
>> +     /* Make sure the scratch space isn't exceeded. */
>> +     if (*available > scratch_size)
>> +             *available = scratch_size;
>> +     if (regset->get(current, regset, 0, *available, scratch, NULL))
>> +             return NULL;
>> +     return scratch;
>> +}
>> +
>> +/**
>> + * seccomp_test_filters - tests 'current' against the given syscall
>> + * @syscall: number of the system call to test
>> + *
>> + * Returns 0 on ok and non-zero on error/failure.
>> + */
>> +int seccomp_test_filters(int syscall)
>> +{
>> +     struct seccomp_filter *filter;
>> +     u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
>> +     size_t regs_size = sizeof(struct user_regs_struct);
>> +     int ret = -EACCES;
>> +
>> +     filter = current->seccomp.filter; /* uses task ref */
>> +     if (!filter)
>> +             goto out;
>> +
>> +     /* All filters in the list are required to share the same system call
>> +      * convention so only the first filter is ever checked.
>> +      */
>> +     if (seccomp_check_personality(filter))
>> +             goto out;
>> +
>> +     /* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
>> +      * that is not mandatory.  E.g., it may return a point to
>> +      * task_pt_regs(current).  NULL checking is mandatory.
>> +      */
>> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>
> Stupid question. I am sure you know what are you doing ;) and I know
> nothing about !x86 arches.
>
> But could you explain why it is designed to use user_regs_struct ?
> Why we can't simply use task_pt_regs() and avoid the (costly) regsets?

So on x86 32, it would work since user_regs_struct == task_pt_regs
(iirc), but on x86-64
and others, that's not true.  I don't think it's kosher to expose
pt_regs to the userspace, but if, let's say, x86-32 overrides the weak
linkage, then it could just return task_pt_regs and be the fastest
path.

If it would be appropriate to expose pt_regs to userspace, then I'd
happily do so :)

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:03     ` Will Drewry
@ 2012-01-12 17:11       ` Alan Cox
  2012-01-12 17:52         ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Alan Cox @ 2012-01-12 17:11 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

> more about).  Since setuid is privilege escalation, then perhaps it
> makes sense to allow it as an escape hatch.
> 
> Would it be sane to just disallow setuid exec exclusively?

I think that is a policy question. I can imagine cases where either
behaviour is the "right" one so it may need to be a parameter ?

Alan

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:09         ` Linus Torvalds
@ 2012-01-12 17:17           ` Steven Rostedt
  2012-01-12 18:18           ` Andrew Lutomirski
  1 sibling, 0 replies; 235+ messages in thread
From: Steven Rostedt @ 2012-01-12 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 09:09 -0800, Linus Torvalds wrote:

> The whole "fail security escalations" thing goes way beyond just
> filtering, I think we could seriously try to make it a generic
> feature.

After I wrote this comment I thought the same thing. It would be nice to
have a way to just set a flag to a process that will prevent it from
doing any escalating of privileges.

I totally agree, this would solve a whole host of issues with regard to
security issues in things that shouldn't be a problem but currently are.

-- Steve





^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:59     ` Will Drewry
@ 2012-01-12 17:22       ` Jamie Lokier
  2012-01-12 17:35         ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:22 UTC (permalink / raw)
  To: Will Drewry
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Will Drewry wrote:
> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
> >
> >> Filter programs may _only_ cross the execve(2) barrier if last filter
> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> >> user namespace.  Once a task-local filter program is attached from a
> >> process without privileges, execve will fail.  This ensures that only
> >> privileged parent task can affect its privileged children (e.g., setuid
> >> binary).
> >
> > This means that a non privileged user can not run another program with
> > limited features? How would a process exec another program and filter
> > it? I would assume that the filter would need to be attached first and
> > then the execv() would be performed. But after the filter is attached,
> > the execv is prevented?
> 
> Yeah - it means tasks can filter themselves, but not each other.
> However, you can inject a filter for any dynamically linked executable
> using LD_PRELOAD.
> 
> > Maybe I don't understand this correctly.
> 
> You're right on.  This was to ensure that one process didn't cause
> crazy behavior in another. I think Alan has a better proposal than
> mine below.  (Goes back to catching up.)

You can already use ptrace() to cause crazy behaviour in another
process, including modifying registers arbitrarily at syscall entry
and exit, aborting and emulating syscalls.

ptrace() is quite slow and it would be really nice to speed it up,
especially for trapping a small subset of syscalls, or limiting some
kinds of access to some file descriptors, while everything else runs
at normal speed.

Speeding up ptrace() with BPF filters would be a really nice.  Not
that I like ptrace(), but sometimes it's the only thing you can rely on.

LD_PRELOAD and code running in the target process address space can't
always be trusted in some contexts (e.g. the target process may modify
the tracing code or its data); whereas ptrace() is pretty complete and
reliable, if ugly.

There's already a security model around who can use ptrace(); speeding
it up needn't break that.

If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
needed as userspace could have done it, with exactly the restrictions
it wants.  Google's NaCl comes to mind as a potential user.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:10     ` Will Drewry
@ 2012-01-12 17:23       ` Oleg Nesterov
  2012-01-12 17:51         ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-12 17:23 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/12, Will Drewry wrote:
>
> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> +      */
> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
> >
> > Stupid question. I am sure you know what are you doing ;) and I know
> > nothing about !x86 arches.
> >
> > But could you explain why it is designed to use user_regs_struct ?
> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>
> So on x86 32, it would work since user_regs_struct == task_pt_regs
> (iirc), but on x86-64
> and others, that's not true.

Yes sure, I meant that userpace should use pt_regs too.

> If it would be appropriate to expose pt_regs to userspace, then I'd
> happily do so :)

Ah, so that was the reason. But it is already exported? At least I see
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.

Once again, I am not arguing, just trying to understand. And I do not
know if this definition is part of abi.

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-12 18:16         ` Randy Dunlap
@ 2012-01-12 17:23           ` Will Drewry
  2012-01-12 17:34             ` Steven Rostedt
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-12 17:23 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet

On Thu, Jan 12, 2012 at 12:16 PM, Randy Dunlap <rdunlap@xenotime.net> wrote:
> On 01/11/2012 03:19 PM, Will Drewry wrote:
>> Document how system call filtering with BPF works and
>> may be used.  Includes an example for x86 (32-bit).
>
> Please tell some of us what "BPF" means.  wikipedia lists 15 possible
> choices, but I don't know which one to choose.

I'll make it clearer in the documentation file and update the patch description.

BPF == Berkeley Packet Filters which are implemented in Linux Socket
Filters (LSF)>

thanks!

>> Signed-off-by: Will Drewry <wad@chromium.org>
>> ---
>>  Documentation/prctl/seccomp_filter.txt |   99 ++++++++++++++++++++++++++++++++
>>  samples/Makefile                       |    2 +-
>>  samples/seccomp/Makefile               |   12 ++++
>>  samples/seccomp/bpf-example.c          |   74 ++++++++++++++++++++++++
>>  4 files changed, 186 insertions(+), 1 deletions(-)
>>  create mode 100644 Documentation/prctl/seccomp_filter.txt
>>  create mode 100644 samples/seccomp/Makefile
>>  create mode 100644 samples/seccomp/bpf-example.c
>
>
> --
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-12 13:13   ` [RFC,PATCH " Łukasz Sowa
@ 2012-01-12 17:25     ` Will Drewry
  0 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-12 17:25 UTC (permalink / raw)
  To: Łukasz Sowa
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 7:13 AM, Łukasz Sowa <luksow@gmail.com> wrote:
> Hi Will,
>
> That's very different approach to the system call interposition problem.
> I find you solution very interesting. It gives far more capabilities
> than my syscalls cgroup that you commented on some time ago. It's ready
> now but I haven't tried filtering yet. I think that if your solution
> make it to the mainline (and I guess that's really possible at current
> stage :)), there will be no place for mine solution but that's ok.

Yeah - there've been so many tries, I'll be happy when one makes it in
which is usable :)

> There's one thing that I'm curious about - have you measured overhead in
> any way? That was one of the biggest issues in all previous attempts to
> limit syscalls. I'd love to compare the numbers with mine solution.

Certainly. I have some rough numbers, but nothing I'd call strong
measurements.  There is still a fair amount of cost due to the syscall
slow path.

> I'll examine your patch later on and put some comments if I bump into
> something.

Much appreciated - cheers!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:38       ` Steven Rostedt
  2012-01-12 16:47         ` Oleg Nesterov
@ 2012-01-12 17:30         ` Jamie Lokier
  2012-01-12 17:40           ` Steven Rostedt
  1 sibling, 1 reply; 235+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

Steven Rostedt wrote:
> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
> 
> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> > 
> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
> 
> I've never used seccomp, so I admit I'm totally ignorant on this topic.
> 
> But looking at seccomp from the outside, the biggest advantage to this
> would be the ability for normal processes to be able to limit tasks it
> kicks off. If I want to run a task in a sandbox, I don't want to be root
> to do so.
> 
> I guess a web browser doesn't perform an exec to run java programs.

Actually it does.  Firefox on Linux forks and execs the Java VM.
Same for Flash, using "plugin-container".

> But it would be nice if I could execute something from the command
> line that I could run in a sand box.

You can do this now, using ptrace().  It's horrible, but half of the
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address.  (The other half is a ton of
undocumented but important ptrace() behaviours on Linux.)

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-12 17:23           ` Will Drewry
@ 2012-01-12 17:34             ` Steven Rostedt
  0 siblings, 0 replies; 235+ messages in thread
From: Steven Rostedt @ 2012-01-12 17:34 UTC (permalink / raw)
  To: Will Drewry
  Cc: Randy Dunlap, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet

On Thu, 2012-01-12 at 11:23 -0600, Will Drewry wrote:

> > Please tell some of us what "BPF" means.  wikipedia lists 15 possible
> > choices, but I don't know which one to choose.
> 
> I'll make it clearer in the documentation file and update the patch description.
> 
> BPF == Berkeley Packet Filters which are implemented in Linux Socket
> Filters (LSF)>
> 

I admit, I was totally clueless in what it meant too ;)

Even the LWN article didn't explain (shame on you Jon).

"he has repurposed the networking layer's packet filtering mechanism
(BPF)"

I didn't know what did the "B" stood for.

-- Steve



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:22       ` Jamie Lokier
@ 2012-01-12 17:35         ` Will Drewry
  2012-01-12 17:57           ` Jamie Lokier
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-12 17:35 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >
>> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> user namespace.  Once a task-local filter program is attached from a
>> >> process without privileges, execve will fail.  This ensures that only
>> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> binary).
>> >
>> > This means that a non privileged user can not run another program with
>> > limited features? How would a process exec another program and filter
>> > it? I would assume that the filter would need to be attached first and
>> > then the execv() would be performed. But after the filter is attached,
>> > the execv is prevented?
>>
>> Yeah - it means tasks can filter themselves, but not each other.
>> However, you can inject a filter for any dynamically linked executable
>> using LD_PRELOAD.
>>
>> > Maybe I don't understand this correctly.
>>
>> You're right on.  This was to ensure that one process didn't cause
>> crazy behavior in another. I think Alan has a better proposal than
>> mine below.  (Goes back to catching up.)
>
> You can already use ptrace() to cause crazy behaviour in another
> process, including modifying registers arbitrarily at syscall entry
> and exit, aborting and emulating syscalls.
>
> ptrace() is quite slow and it would be really nice to speed it up,
> especially for trapping a small subset of syscalls, or limiting some
> kinds of access to some file descriptors, while everything else runs
> at normal speed.
>
> Speeding up ptrace() with BPF filters would be a really nice.  Not
> that I like ptrace(), but sometimes it's the only thing you can rely on.
>
> LD_PRELOAD and code running in the target process address space can't
> always be trusted in some contexts (e.g. the target process may modify
> the tracing code or its data); whereas ptrace() is pretty complete and
> reliable, if ugly.
>
> There's already a security model around who can use ptrace(); speeding
> it up needn't break that.
>
> If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> needed as userspace could have done it, with exactly the restrictions
> it wants.  Google's NaCl comes to mind as a potential user.

That's not entirely true.  ptrace supervisors are subject to races and
always fail open.  This makes them effective but not as robust as a
seccomp solution can provide.

With seccomp, it fails close.  What I think would make sense would be
to add a user-controllable failure mode with seccomp bpf that calls
tracehook_ptrace_syscall_entry(regs).  I've prototyped this and it
works quite well, but I didn't want to conflate the discussions.

Using ptrace() would also mean that all consumers of this interface
would need a supervisor, but with seccomp, the filters are installed
and require no supervisors to stick around for when failure occurs.

Does that make sense?
thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 15:43   ` Steven Rostedt
                       ` (2 preceding siblings ...)
  2012-01-12 16:59     ` Will Drewry
@ 2012-01-12 17:36     ` Jamie Lokier
  3 siblings, 0 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Steven Rostedt wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
> 
> > Filter programs may _only_ cross the execve(2) barrier if last filter
> > program was attached by a task with CAP_SYS_ADMIN capabilities in its
> > user namespace.  Once a task-local filter program is attached from a
> > process without privileges, execve will fail.  This ensures that only
> > privileged parent task can affect its privileged children (e.g., setuid
> > binary).
> 
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?

Ugly method: Using ptrace(), trap after the execve() and issue fake
syscalls to install the filter.  I feel dirty thinking it, in a good way.

LD_PRELOAD has been suggested.  It's not 100% reliable because not all
executables are dynamic (on some uClinux platforms none of them are),
but it will usually work.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:30         ` Jamie Lokier
@ 2012-01-12 17:40           ` Steven Rostedt
  2012-01-12 17:44             ` Jamie Lokier
  2012-01-12 22:18             ` Will Drewry
  0 siblings, 2 replies; 235+ messages in thread
From: Steven Rostedt @ 2012-01-12 17:40 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:

> You can do this now, using ptrace().  It's horrible, but half of the
> horribleness is needing to understand machine-dependent registers,
> which this new patch doesn't address.  (The other half is a ton of
> undocumented but important ptrace() behaviours on Linux.)

Yeah I know the horrid use of ptrace, I've implemented programs that use
it :-p

I guess ptrace can capture the execv and determine if it is OK or not to
run it. But again, this doesn't stop the possible attacks that could
happen, with having the execv on a symlink file, having the ptrace check
say its OK, and then switching the symlink to a setuid file.

When the new execv executed, the parent process would lose all control
over it. The idea is to prevent this.

I like Alan's suggestion. Have userspace decide to allow execv or not,
and even let it decide if it should allow setuid execv's or not, but
still allow non-setuid execvs. If you allow the setuid execv, once that
happens, the same behavior will occur as with ptrace. A setuid execv
will lose all its filtering.

-- Steve



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:40           ` Steven Rostedt
@ 2012-01-12 17:44             ` Jamie Lokier
  2012-01-12 17:56               ` Steven Rostedt
  2012-01-12 22:18             ` Will Drewry
  1 sibling, 1 reply; 235+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

Steven Rostedt wrote:
> On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:
> 
> > You can do this now, using ptrace().  It's horrible, but half of the
> > horribleness is needing to understand machine-dependent registers,
> > which this new patch doesn't address.  (The other half is a ton of
> > undocumented but important ptrace() behaviours on Linux.)
> 
> Yeah I know the horrid use of ptrace, I've implemented programs that use
> it :-p

That warm fuzzy feeling :-)

> I guess ptrace can capture the execv and determine if it is OK or not to
> run it. But again, this doesn't stop the possible attacks that could
> happen, with having the execv on a symlink file, having the ptrace check
> say its OK, and then switching the symlink to a setuid file.
>
> When the new execv executed, the parent process would lose all control
> over it. The idea is to prevent this.

fexecve() exists to solve the problem.
Also known as execve("/proc/self/fd/...") on Linux.

> I like Alan's suggestion. Have userspace decide to allow execv or not,
> and even let it decide if it should allow setuid execv's or not, but
> still allow non-setuid execvs. If you allow the setuid execv, once that
> happens, the same behavior will occur as with ptrace. A setuid execv
> will lose all its filtering.

I like the idea of letting the tracer decide what it wants.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:23       ` Oleg Nesterov
@ 2012-01-12 17:51         ` Will Drewry
  2012-01-13 17:31           ` Oleg Nesterov
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-12 17:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/12, Will Drewry wrote:
>>
>> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >> +      */
>> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>> >
>> > Stupid question. I am sure you know what are you doing ;) and I know
>> > nothing about !x86 arches.
>> >
>> > But could you explain why it is designed to use user_regs_struct ?
>> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>
>> So on x86 32, it would work since user_regs_struct == task_pt_regs
>> (iirc), but on x86-64
>> and others, that's not true.
>
> Yes sure, I meant that userpace should use pt_regs too.
>
>> If it would be appropriate to expose pt_regs to userspace, then I'd
>> happily do so :)
>
> Ah, so that was the reason. But it is already exported? At least I see
> the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>
> Once again, I am not arguing, just trying to understand. And I do not
> know if this definition is part of abi.

I don't either :/  My original idea was to operate on task_pt_regs(current),
but I noticed that PTRACE_GETREGS/SETREGS only uses the
user_regs_struct. So I went that route.

I'd love for pt_regs to be fair game to cut down on the copying!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:11       ` Alan Cox
@ 2012-01-12 17:52         ` Will Drewry
  0 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-12 17:52 UTC (permalink / raw)
  To: Alan Cox
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:11 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> more about).  Since setuid is privilege escalation, then perhaps it
>> makes sense to allow it as an escape hatch.
>>
>> Would it be sane to just disallow setuid exec exclusively?
>
> I think that is a policy question. I can imagine cases where either
> behaviour is the "right" one so it may need to be a parameter ?

Makes sense. I'll make it flaggable (ignoring the parallel conversation
about having a thread-wide suidable bit).

thanks!

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:44             ` Jamie Lokier
@ 2012-01-12 17:56               ` Steven Rostedt
  2012-01-12 23:27                 ` Alan Cox
  0 siblings, 1 reply; 235+ messages in thread
From: Steven Rostedt @ 2012-01-12 17:56 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 17:44 +0000, Jamie Lokier wrote:

> > I like Alan's suggestion. Have userspace decide to allow execv or not,
> > and even let it decide if it should allow setuid execv's or not, but
> > still allow non-setuid execvs. If you allow the setuid execv, once that
> > happens, the same behavior will occur as with ptrace. A setuid execv
> > will lose all its filtering.
> 
> I like the idea of letting the tracer decide what it wants.

Right, and if we implement the suggestion that Linus made, to set a flag
to prevent a task from every getting privilege, then seccomp can add
that too.

That is, there can be a filter to say "prevent this task from doing
anything with privilege" and that will prevent execv from gaining setuid
privilege. Perhaps, it would still do the execv, but the program that is
executed will run as the normal user, and just fail when it tries to do
something that requires sys admin privilege.

Thus, execv will not be a "special" case here. Seccomp either allows it
or not. But also add a command to tell seccomp that this task will not
be allowed to do anything privileged.

-- Steve



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:35         ` Will Drewry
@ 2012-01-12 17:57           ` Jamie Lokier
  2012-01-12 18:03             ` Will Drewry
                               ` (2 more replies)
  0 siblings, 3 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:57 UTC (permalink / raw)
  To: Will Drewry
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Will Drewry wrote:
> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@shareable.org> wrote:
> > Will Drewry wrote:
> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
> >> >
> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> >> >> user namespace.  Once a task-local filter program is attached from a
> >> >> process without privileges, execve will fail.  This ensures that only
> >> >> privileged parent task can affect its privileged children (e.g., setuid
> >> >> binary).
> >> >
> >> > This means that a non privileged user can not run another program with
> >> > limited features? How would a process exec another program and filter
> >> > it? I would assume that the filter would need to be attached first and
> >> > then the execv() would be performed. But after the filter is attached,
> >> > the execv is prevented?
> >>
> >> Yeah - it means tasks can filter themselves, but not each other.
> >> However, you can inject a filter for any dynamically linked executable
> >> using LD_PRELOAD.
> >>
> >> > Maybe I don't understand this correctly.
> >>
> >> You're right on.  This was to ensure that one process didn't cause
> >> crazy behavior in another. I think Alan has a better proposal than
> >> mine below.  (Goes back to catching up.)
> >
> > You can already use ptrace() to cause crazy behaviour in another
> > process, including modifying registers arbitrarily at syscall entry
> > and exit, aborting and emulating syscalls.
> >
> > ptrace() is quite slow and it would be really nice to speed it up,
> > especially for trapping a small subset of syscalls, or limiting some
> > kinds of access to some file descriptors, while everything else runs
> > at normal speed.
> >
> > Speeding up ptrace() with BPF filters would be a really nice.  Not
> > that I like ptrace(), but sometimes it's the only thing you can rely on.
> >
> > LD_PRELOAD and code running in the target process address space can't
> > always be trusted in some contexts (e.g. the target process may modify
> > the tracing code or its data); whereas ptrace() is pretty complete and
> > reliable, if ugly.
> >
> > There's already a security model around who can use ptrace(); speeding
> > it up needn't break that.
> >
> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> > needed as userspace could have done it, with exactly the restrictions
> > it wants.  Google's NaCl comes to mind as a potential user.
> 
> That's not entirely true.  ptrace supervisors are subject to races and
> always fail open.  This makes them effective but not as robust as a
> seccomp solution can provide.

What races do you know about?

I'm not aware of any ptrace races if it's used properly.  I'm also not
sure what you mean by fail open/close here, unless you mean the target
process gets to carry on if the tracing process dies.

Having said that, I can think of one race, but I think your BPF scheme
has the same one: After checking the syscall's string arguments and
other pointed to data, another thread can change those arguments
before the real syscall uses them.

> With seccomp, it fails close.  What I think would make sense would be
> to add a user-controllable failure mode with seccomp bpf that calls
> tracehook_ptrace_syscall_entry(regs).  I've prototyped this and it
> works quite well, but I didn't want to conflate the discussions.

It think it's a nice idea.  While you're at it could you fix all the
architectures to actually use tracehooks for syscall tracing ;-)

(I think it's ok to call the tracehook function on all archs though.)

> Using ptrace() would also mean that all consumers of this interface
> would need a supervisor, but with seccomp, the filters are installed
> and require no supervisors to stick around for when failure occurs.
> 
> Does that make sense?

It does, I agree that ptrace() is quite cumbersome and you don't
always want a separate tracing process, especially if "failure" means
to die or get an error.

On the other hand, sometimes when a failure occurs, having another
process decide what to do, or log the event, is exactly what you want.

For my nefarious purposes I'm really just looking for a faster way to
reliably trace some activities of individual processes, in particular
tracking which files they access.  I'd rather not interfere with
debuggers, so I'd really like your ability to stack multiple filters
to work with separate-process tracing as well.  And I'd happily use a
filter rule which can dump some information over a pipe, without
waiting for the tracer to respond in most cases.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:57           ` Jamie Lokier
@ 2012-01-12 18:03             ` Will Drewry
  2012-01-13  1:34               ` Jamie Lokier
  2012-01-13  2:44             ` Indan Zupancic
  2012-01-13  6:33             ` Chris Evans
  2 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-12 18:03 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:57 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> > Will Drewry wrote:
>> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >> >
>> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> >> user namespace.  Once a task-local filter program is attached from a
>> >> >> process without privileges, execve will fail.  This ensures that only
>> >> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> >> binary).
>> >> >
>> >> > This means that a non privileged user can not run another program with
>> >> > limited features? How would a process exec another program and filter
>> >> > it? I would assume that the filter would need to be attached first and
>> >> > then the execv() would be performed. But after the filter is attached,
>> >> > the execv is prevented?
>> >>
>> >> Yeah - it means tasks can filter themselves, but not each other.
>> >> However, you can inject a filter for any dynamically linked executable
>> >> using LD_PRELOAD.
>> >>
>> >> > Maybe I don't understand this correctly.
>> >>
>> >> You're right on.  This was to ensure that one process didn't cause
>> >> crazy behavior in another. I think Alan has a better proposal than
>> >> mine below.  (Goes back to catching up.)
>> >
>> > You can already use ptrace() to cause crazy behaviour in another
>> > process, including modifying registers arbitrarily at syscall entry
>> > and exit, aborting and emulating syscalls.
>> >
>> > ptrace() is quite slow and it would be really nice to speed it up,
>> > especially for trapping a small subset of syscalls, or limiting some
>> > kinds of access to some file descriptors, while everything else runs
>> > at normal speed.
>> >
>> > Speeding up ptrace() with BPF filters would be a really nice.  Not
>> > that I like ptrace(), but sometimes it's the only thing you can rely on.
>> >
>> > LD_PRELOAD and code running in the target process address space can't
>> > always be trusted in some contexts (e.g. the target process may modify
>> > the tracing code or its data); whereas ptrace() is pretty complete and
>> > reliable, if ugly.
>> >
>> > There's already a security model around who can use ptrace(); speeding
>> > it up needn't break that.
>> >
>> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
>> > needed as userspace could have done it, with exactly the restrictions
>> > it wants.  Google's NaCl comes to mind as a potential user.
>>
>> That's not entirely true.  ptrace supervisors are subject to races and
>> always fail open.  This makes them effective but not as robust as a
>> seccomp solution can provide.
>
> What races do you know about?

I'm pretty sure that if you have two "isolated" processes, they could
cause irregular behavior using signals.

> I'm not aware of any ptrace races if it's used properly.  I'm also not
> sure what you mean by fail open/close here, unless you mean the target
> process gets to carry on if the tracing process dies.

Exactly.  Security systems that, on failure, allow the action to
proceed can't be relied on.

> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.

Not a problem - BPF only allows register inspection. No TOCTOU attacks
need apply :D

>> With seccomp, it fails close.  What I think would make sense would be
>> to add a user-controllable failure mode with seccomp bpf that calls
>> tracehook_ptrace_syscall_entry(regs).  I've prototyped this and it
>> works quite well, but I didn't want to conflate the discussions.
>
> It think it's a nice idea.  While you're at it could you fix all the
> architectures to actually use tracehooks for syscall tracing ;-)
>
> (I think it's ok to call the tracehook function on all archs though.)
>
>> Using ptrace() would also mean that all consumers of this interface
>> would need a supervisor, but with seccomp, the filters are installed
>> and require no supervisors to stick around for when failure occurs.
>>
>> Does that make sense?
>
> It does, I agree that ptrace() is quite cumbersome and you don't
> always want a separate tracing process, especially if "failure" means
> to die or get an error.
>
> On the other hand, sometimes when a failure occurs, having another
> process decide what to do, or log the event, is exactly what you want.
>
> For my nefarious purposes I'm really just looking for a faster way to
> reliably trace some activities of individual processes, in particular
> tracking which files they access.  I'd rather not interfere with
> debuggers, so I'd really like your ability to stack multiple filters
> to work with separate-process tracing as well.  And I'd happily use a
> filter rule which can dump some information over a pipe, without
> waiting for the tracer to respond in most cases.

Cool - if the rest of this discussion proceeds, then hopefully, we can
move towards discussing if tying it with ptrace is a good idea or a
horrible one :)

thanks!

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 23:19       ` [PATCH v2 " Will Drewry
  2012-01-12  0:29         ` Will Drewry
@ 2012-01-12 18:16         ` Randy Dunlap
  2012-01-12 17:23           ` Will Drewry
  1 sibling, 1 reply; 235+ messages in thread
From: Randy Dunlap @ 2012-01-12 18:16 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet

On 01/11/2012 03:19 PM, Will Drewry wrote:
> Document how system call filtering with BPF works and
> may be used.  Includes an example for x86 (32-bit).

Please tell some of us what "BPF" means.  wikipedia lists 15 possible
choices, but I don't know which one to choose.

> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  Documentation/prctl/seccomp_filter.txt |   99 ++++++++++++++++++++++++++++++++
>  samples/Makefile                       |    2 +-
>  samples/seccomp/Makefile               |   12 ++++
>  samples/seccomp/bpf-example.c          |   74 ++++++++++++++++++++++++
>  4 files changed, 186 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/prctl/seccomp_filter.txt
>  create mode 100644 samples/seccomp/Makefile
>  create mode 100644 samples/seccomp/bpf-example.c


-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:09         ` Linus Torvalds
  2012-01-12 17:17           ` Steven Rostedt
@ 2012-01-12 18:18           ` Andrew Lutomirski
  2012-01-12 18:32             ` Linus Torvalds
  1 sibling, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 18:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 9:09 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> In that case, just have execv fail if filtering is enabled and we are
>> execing a setuid program. But I don't see why non "magical" execv's
>> should be prohibited.
>
> The whole "fail security escalations" thing goes way beyond just
> filtering, I think we could seriously try to make it a generic
> feature.
>
> For example, somebody just asked me the other day why "chroot()"
> requires admin privileges, since it would be good to limit even
> non-root things.
>
> And it's really the exact same issue as filtering: in some sense,
> chroot() "filters" FS name lookups, and can be used to fool programs
> that are written to be secure.
>
> We could easily introduce a per-process flag that just says "cannot
> escalate privileges". Which basically just disables execve() of
> suid/sgid programs (and possibly other things too), and locks the
> process to the current privileges. And then make the rule be that *if*
> that flag is set, you can then filter across an execve, or chroot as a
> normal user, or whatever.

Like this?

http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html

(This depends on execve_nosecurity, which is controversial, but that
dependency would be trivial to remove.)

Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything.  (I think that this is a really awful idea, but it's in
the kernel, so we're stuck with it.)

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:18           ` Andrew Lutomirski
@ 2012-01-12 18:32             ` Linus Torvalds
  2012-01-12 18:44               ` Andrew Lutomirski
  0 siblings, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-12 18:32 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Steven Rostedt, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>
> Like this?
>
> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html

I don't know the execve_nosecurity patches, so the diff makes little
sense to me, but yeah, I wouldn't expect it to be more than a couple
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.

> Note that there's a huge can of worms if execve is allowed but
> suid/sgid is not: selinux may elevate privileges on exec of pretty
> much anything.  (I think that this is a really awful idea, but it's in
> the kernel, so we're stuck with it.)

You can do any amount of crazy things with selinux, but the other side
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..

I really don't think this is just about "execve cannot do setuid". I
think it's about the process being marked as restricted.

So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try to
artificially limit it to be some "execve feature", and more think of
it as a "this is a process that has *no* extra privileges at all, and
can never get them".

                            Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:32             ` Linus Torvalds
@ 2012-01-12 18:44               ` Andrew Lutomirski
  2012-01-12 19:08                 ` Kyle Moffett
  2012-01-12 19:40                 ` Will Drewry
  0 siblings, 2 replies; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>
>> Like this?
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>
> I don't know the execve_nosecurity patches, so the diff makes little
> sense to me, but yeah, I wouldn't expect it to be more than a couple
> of lines. Exactly *how* you set the bit etc is not something I care
> deeply about, prctl seems about as good as anything.
>
>> Note that there's a huge can of worms if execve is allowed but
>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>> much anything.  (I think that this is a really awful idea, but it's in
>> the kernel, so we're stuck with it.)
>
> You can do any amount of crazy things with selinux, but the other side
> of the coin is that it would also be trivial to teach selinux about
> this same "restricted environment" bit, and just say that a process
> with that bit set doesn't get to match whatever selinux privilege
> escalation rules..
>
> I really don't think this is just about "execve cannot do setuid". I
> think it's about the process being marked as restricted.
>
> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
> artificially limit it to be some "execve feature", and more think of
> it as a "this is a process that has *no* extra privileges at all, and
> can never get them".

Fair enough.  I'll submit the simpler patch tonight.

execve_nosecurity was my attempt to sidestep selinux issues.  It's a
different syscall that does all of the non-security-related things
that execve does but does not escalate (or even change) any
privileges.  Maybe I'll try to rework that for newer kernels as well.
The idea is that programs that expect to run in sandboxes / chroots /
namespaces / whatever can use it, and older programs that might
malfunction dangerously if the semantics of execve change will just
fail instead.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:44               ` Andrew Lutomirski
@ 2012-01-12 19:08                 ` Kyle Moffett
  2012-01-12 23:05                   ` Eric Paris
  2012-01-12 19:40                 ` Will Drewry
  1 sibling, 1 reply; 235+ messages in thread
From: Kyle Moffett @ 2012-01-12 19:08 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, jmorris, scarybeasts, avi, penberg, viro, mingo,
	akpm, khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 13:44, Andrew Lutomirski <luto@mit.edu> wrote:
> On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>> Like this?
>>>
>>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>>
>> I don't know the execve_nosecurity patches, so the diff makes little
>> sense to me, but yeah, I wouldn't expect it to be more than a couple
>> of lines. Exactly *how* you set the bit etc is not something I care
>> deeply about, prctl seems about as good as anything.
>>
>>> Note that there's a huge can of worms if execve is allowed but
>>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>>> much anything.  (I think that this is a really awful idea, but it's in
>>> the kernel, so we're stuck with it.)
>>
>> You can do any amount of crazy things with selinux, but the other side
>> of the coin is that it would also be trivial to teach selinux about
>> this same "restricted environment" bit, and just say that a process
>> with that bit set doesn't get to match whatever selinux privilege
>> escalation rules..
>>
>> I really don't think this is just about "execve cannot do setuid". I
>> think it's about the process being marked as restricted.
>>
>> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
>> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
>> artificially limit it to be some "execve feature", and more think of
>> it as a "this is a process that has *no* extra privileges at all, and
>> can never get them".
>
> execve_nosecurity was my attempt to sidestep selinux issues.  It's a
> different syscall that does all of the non-security-related things
> that execve does but does not escalate (or even change) any
> privileges.  Maybe I'll try to rework that for newer kernels as well.
> The idea is that programs that expect to run in sandboxes / chroots /
> namespaces / whatever can use it, and older programs that might
> malfunction dangerously if the semantics of execve change will just
> fail instead.

I don't see any issues with SELinux support for this feature.

Specifically, when you try to execute something in SELinux, it will
first look at the types and try to "execute" (involving a type
transition IE: security label change).

But if that fails in many cases it may still be allowed to
"execute_no_trans" (IE: regular non-privileged exec() without a
transition).

If you add this feature, it should just disable the normal "execute"
with transition path and unconditionally fall back to
"execute_no_trans".

Likewise, enabling these bits should also disable the "transition" and
"dyntransition" process access vectors, and I'm on the fence about
whether "setfscreate", etc should be allowed.

Cheers,
Kyle Moffett

-- 
Curious about my work on the Debian powerpcspe port?
I'm keeping a blog here: http://pureperl.blogspot.com/

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:44               ` Andrew Lutomirski
  2012-01-12 19:08                 ` Kyle Moffett
@ 2012-01-12 19:40                 ` Will Drewry
  2012-01-12 19:42                   ` Will Drewry
  2012-01-12 19:46                   ` Andrew Lutomirski
  1 sibling, 2 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-12 19:40 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 12:44 PM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>>
>>> Like this?
>>>
>>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>>
>> I don't know the execve_nosecurity patches, so the diff makes little
>> sense to me, but yeah, I wouldn't expect it to be more than a couple
>> of lines. Exactly *how* you set the bit etc is not something I care
>> deeply about, prctl seems about as good as anything.
>>
>>> Note that there's a huge can of worms if execve is allowed but
>>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>>> much anything.  (I think that this is a really awful idea, but it's in
>>> the kernel, so we're stuck with it.)
>>
>> You can do any amount of crazy things with selinux, but the other side
>> of the coin is that it would also be trivial to teach selinux about
>> this same "restricted environment" bit, and just say that a process
>> with that bit set doesn't get to match whatever selinux privilege
>> escalation rules..
>>
>> I really don't think this is just about "execve cannot do setuid". I
>> think it's about the process being marked as restricted.
>>
>> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
>> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
>> artificially limit it to be some "execve feature", and more think of
>> it as a "this is a process that has *no* extra privileges at all, and
>> can never get them".
>
> Fair enough.  I'll submit the simpler patch tonight.

This sounds cool.  Do you think you'll go for a new task_struct member
or will it a securebit?  (Seems like securebits might be too tied to
posix file caps, but I figured I'd ask).

I'm planning on going ahead and mocking up your potential patch so I
can respin this series using it and make sure I understand the
interactions.

thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 19:40                 ` Will Drewry
@ 2012-01-12 19:42                   ` Will Drewry
  2012-01-12 19:46                   ` Andrew Lutomirski
  1 sibling, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-12 19:42 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 1:40 PM, Will Drewry <wad@chromium.org> wrote:
> On Thu, Jan 12, 2012 at 12:44 PM, Andrew Lutomirski <luto@mit.edu> wrote:
>> On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>>>
>>>> Like this?
>>>>
>>>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>>>
>>> I don't know the execve_nosecurity patches, so the diff makes little
>>> sense to me, but yeah, I wouldn't expect it to be more than a couple
>>> of lines. Exactly *how* you set the bit etc is not something I care
>>> deeply about, prctl seems about as good as anything.
>>>
>>>> Note that there's a huge can of worms if execve is allowed but
>>>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>>>> much anything.  (I think that this is a really awful idea, but it's in
>>>> the kernel, so we're stuck with it.)
>>>
>>> You can do any amount of crazy things with selinux, but the other side
>>> of the coin is that it would also be trivial to teach selinux about
>>> this same "restricted environment" bit, and just say that a process
>>> with that bit set doesn't get to match whatever selinux privilege
>>> escalation rules..
>>>
>>> I really don't think this is just about "execve cannot do setuid". I
>>> think it's about the process being marked as restricted.
>>>
>>> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
>>> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
>>> artificially limit it to be some "execve feature", and more think of
>>> it as a "this is a process that has *no* extra privileges at all, and
>>> can never get them".
>>
>> Fair enough.  I'll submit the simpler patch tonight.
>
> This sounds cool.  Do you think you'll go for a new task_struct member
> or will it a securebit?  (Seems like securebits might be too tied to
> posix file caps, but I figured I'd ask).

Or cred member, etc.

> I'm planning on going ahead and mocking up your potential patch so I
> can respin this series using it and make sure I understand the
> interactions.
>
> thanks!
> will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 19:40                 ` Will Drewry
  2012-01-12 19:42                   ` Will Drewry
@ 2012-01-12 19:46                   ` Andrew Lutomirski
  2012-01-12 20:00                     ` Linus Torvalds
  1 sibling, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 19:46 UTC (permalink / raw)
  To: Will Drewry
  Cc: Linus Torvalds, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:40 AM, Will Drewry <wad@chromium.org> wrote:
> On Thu, Jan 12, 2012 at 12:44 PM, Andrew Lutomirski <luto@mit.edu> wrote:
>> On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>>>
>>>> Like this?
>>>>
>>>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>>>
>>> I don't know the execve_nosecurity patches, so the diff makes little
>>> sense to me, but yeah, I wouldn't expect it to be more than a couple
>>> of lines. Exactly *how* you set the bit etc is not something I care
>>> deeply about, prctl seems about as good as anything.
>>>
>>>> Note that there's a huge can of worms if execve is allowed but
>>>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>>>> much anything.  (I think that this is a really awful idea, but it's in
>>>> the kernel, so we're stuck with it.)
>>>
>>> You can do any amount of crazy things with selinux, but the other side
>>> of the coin is that it would also be trivial to teach selinux about
>>> this same "restricted environment" bit, and just say that a process
>>> with that bit set doesn't get to match whatever selinux privilege
>>> escalation rules..
>>>
>>> I really don't think this is just about "execve cannot do setuid". I
>>> think it's about the process being marked as restricted.
>>>
>>> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
>>> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
>>> artificially limit it to be some "execve feature", and more think of
>>> it as a "this is a process that has *no* extra privileges at all, and
>>> can never get them".
>>
>> Fair enough.  I'll submit the simpler patch tonight.
>
> This sounds cool.  Do you think you'll go for a new task_struct member
> or will it a securebit?  (Seems like securebits might be too tied to
> posix file caps, but I figured I'd ask).
>
> I'm planning on going ahead and mocking up your potential patch so I
> can respin this series using it and make sure I understand the
> interactions.

I think securebits and cred didn't exist the first time I did this,
and sticking it in struct cred might unnecessarily prevent sharing
cred (assuming that even happens).  So I'd say task_struct.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 19:46                   ` Andrew Lutomirski
@ 2012-01-12 20:00                     ` Linus Torvalds
  0 siblings, 0 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-12 20:00 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Will Drewry, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:46 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>
> I think securebits and cred didn't exist the first time I did this,
> and sticking it in struct cred might unnecessarily prevent sharing
> cred (assuming that even happens).  So I'd say task_struct.

I think it almost has to be task state, since we very much want to
make sure it's trivial to see that nothing ever clears that bit, and
that it always gets copied right over a fork/exec/whatever.

Putting it in some cred or capability bit or somethin would make that
kind of transparency pretty much totally impossible.

                 Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:40           ` Steven Rostedt
  2012-01-12 17:44             ` Jamie Lokier
@ 2012-01-12 22:18             ` Will Drewry
  2012-01-12 23:00               ` Andrew Lutomirski
  1 sibling, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-12 22:18 UTC (permalink / raw)
  To: Steven Rostedt, Alan Cox
  Cc: Jamie Lokier, Oleg Nesterov, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:40 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:
>
>> You can do this now, using ptrace().  It's horrible, but half of the
>> horribleness is needing to understand machine-dependent registers,
>> which this new patch doesn't address.  (The other half is a ton of
>> undocumented but important ptrace() behaviours on Linux.)
>
> Yeah I know the horrid use of ptrace, I've implemented programs that use
> it :-p
>
> I guess ptrace can capture the execv and determine if it is OK or not to
> run it. But again, this doesn't stop the possible attacks that could
> happen, with having the execv on a symlink file, having the ptrace check
> say its OK, and then switching the symlink to a setuid file.
>
> When the new execv executed, the parent process would lose all control
> over it. The idea is to prevent this.
>
> I like Alan's suggestion. Have userspace decide to allow execv or not,
> and even let it decide if it should allow setuid execv's or not, but
> still allow non-setuid execvs. If you allow the setuid execv, once that
> happens, the same behavior will occur as with ptrace. A setuid execv
> will lose all its filtering.

In the ptrace case, doesn't it just downgrade the privileges of the new process
if there is a tracer, rather than detach the tracer?

Ignoring that, I've been looking at system call filters as being equivalent to
something like the caps bounding set.  Once reduced, there's no going
back. I think Linus's proposal perfectly resolves the policy decision around
suid execution behavior in the run-with-privs or not scenarios (just like with
how ptrace does it).  However, I'd like to avoid allowing any process to
escape system call filters once installed.  (It's doable to add
suid/caps-based-bypass, but it certainly not ideal from my perspective.)

cheers,
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 22:18             ` Will Drewry
@ 2012-01-12 23:00               ` Andrew Lutomirski
  0 siblings, 0 replies; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 23:00 UTC (permalink / raw)
  To: Will Drewry
  Cc: Steven Rostedt, Alan Cox, Jamie Lokier, Oleg Nesterov,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 2:18 PM, Will Drewry <wad@chromium.org> wrote:
> On Thu, Jan 12, 2012 at 11:40 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:
>>
>>> You can do this now, using ptrace().  It's horrible, but half of the
>>> horribleness is needing to understand machine-dependent registers,
>>> which this new patch doesn't address.  (The other half is a ton of
>>> undocumented but important ptrace() behaviours on Linux.)
>>
>> Yeah I know the horrid use of ptrace, I've implemented programs that use
>> it :-p
>>
>> I guess ptrace can capture the execv and determine if it is OK or not to
>> run it. But again, this doesn't stop the possible attacks that could
>> happen, with having the execv on a symlink file, having the ptrace check
>> say its OK, and then switching the symlink to a setuid file.
>>
>> When the new execv executed, the parent process would lose all control
>> over it. The idea is to prevent this.
>>
>> I like Alan's suggestion. Have userspace decide to allow execv or not,
>> and even let it decide if it should allow setuid execv's or not, but
>> still allow non-setuid execvs. If you allow the setuid execv, once that
>> happens, the same behavior will occur as with ptrace. A setuid execv
>> will lose all its filtering.
>
> In the ptrace case, doesn't it just downgrade the privileges of the new process
> if there is a tracer, rather than detach the tracer?
>
> Ignoring that, I've been looking at system call filters as being equivalent to
> something like the caps bounding set.  Once reduced, there's no going
> back. I think Linus's proposal perfectly resolves the policy decision around
> suid execution behavior in the run-with-privs or not scenarios (just like with
> how ptrace does it).  However, I'd like to avoid allowing any process to
> escape system call filters once installed.  (It's doable to add
> suid/caps-based-bypass, but it certainly not ideal from my perspective.)

I agree.

In principle, it could be safe for an outside (non-seccomp) process
with appropriate credentials to lift seccomp restrictions from a
different process.  But why?

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 19:08                 ` Kyle Moffett
@ 2012-01-12 23:05                   ` Eric Paris
  2012-01-12 23:33                     ` Andrew Lutomirski
  0 siblings, 1 reply; 235+ messages in thread
From: Eric Paris @ 2012-01-12 23:05 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Andrew Lutomirski, Linus Torvalds, Steven Rostedt, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, djm, segoon, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, oleg, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 14:08 -0500, Kyle Moffett wrote:
> On Thu, Jan 12, 2012 at 13:44, Andrew Lutomirski <luto@mit.edu> wrote:
> > On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> >> You can do any amount of crazy things with selinux, but the other side
> >> of the coin is that it would also be trivial to teach selinux about
> >> this same "restricted environment" bit, and just say that a process
> >> with that bit set doesn't get to match whatever selinux privilege
> >> escalation rules..

> I don't see any issues with SELinux support for this feature.
> 
> Specifically, when you try to execute something in SELinux, it will
> first look at the types and try to "execute" (involving a type
> transition IE: security label change).
> 
> But if that fails in many cases it may still be allowed to
> "execute_no_trans" (IE: regular non-privileged exec() without a
> transition).

That's not true.  See specifically
security/selinux/hooks.c::selinux_bprm_set_creds()  We calculate a label
for the new task (that may or may not be the same) and then check if
there is permission to run the new binary with the new label.  There is
no fallback.

The exception would be if the binary is on a MNT_NOSUID mount point, in
which case we calculate the new label, then just revert to the same
label.

At first glance it looks to me like a reasonable way to implement this
at first would be to do the new checks right next to any place we
already do MNT_NOSUID checks and mimic their behavior.  If there are
other priv escalation points in the kernel we might need to consider if
MNT_NOSUID is adequate....

-Eric


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:56               ` Steven Rostedt
@ 2012-01-12 23:27                 ` Alan Cox
  2012-01-12 23:38                   ` Linus Torvalds
  0 siblings, 1 reply; 235+ messages in thread
From: Alan Cox @ 2012-01-12 23:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jamie Lokier, Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

> Thus, execv will not be a "special" case here. Seccomp either allows it
> or not. But also add a command to tell seccomp that this task will not
> be allowed to do anything privileged.

A setuid binary is not necessarily priviledged - indeed a root -> user
transition via setuid is pretty much the reverse.

It's a change of user context. Things like ptrace and file permissions
basically mean you can't build a barrier between stuff running as the
same uid to a great extent except with heavy restricting, but saying
"you can't become someone else" is very useful.

Alan

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 23:05                   ` Eric Paris
@ 2012-01-12 23:33                     ` Andrew Lutomirski
  0 siblings, 0 replies; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 23:33 UTC (permalink / raw)
  To: Eric Paris
  Cc: Kyle Moffett, Linus Torvalds, Steven Rostedt, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, djm, segoon, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, oleg, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 3:05 PM, Eric Paris <eparis@redhat.com> wrote:
> On Thu, 2012-01-12 at 14:08 -0500, Kyle Moffett wrote:
>> On Thu, Jan 12, 2012 at 13:44, Andrew Lutomirski <luto@mit.edu> wrote:
>> > On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> >> You can do any amount of crazy things with selinux, but the other side
>> >> of the coin is that it would also be trivial to teach selinux about
>> >> this same "restricted environment" bit, and just say that a process
>> >> with that bit set doesn't get to match whatever selinux privilege
>> >> escalation rules..
>
>> I don't see any issues with SELinux support for this feature.
>>
>> Specifically, when you try to execute something in SELinux, it will
>> first look at the types and try to "execute" (involving a type
>> transition IE: security label change).
>>
>> But if that fails in many cases it may still be allowed to
>> "execute_no_trans" (IE: regular non-privileged exec() without a
>> transition).
>
> That's not true.  See specifically
> security/selinux/hooks.c::selinux_bprm_set_creds()  We calculate a label
> for the new task (that may or may not be the same) and then check if
> there is permission to run the new binary with the new label.  There is
> no fallback.
>
> The exception would be if the binary is on a MNT_NOSUID mount point, in
> which case we calculate the new label, then just revert to the same
> label.
>
> At first glance it looks to me like a reasonable way to implement this
> at first would be to do the new checks right next to any place we
> already do MNT_NOSUID checks and mimic their behavior.  If there are
> other priv escalation points in the kernel we might need to consider if
> MNT_NOSUID is adequate....
>

I don't really like the current logic.  It does:

        if (old_tsec->exec_sid) {
                new_tsec->sid = old_tsec->exec_sid;
                /* Reset exec SID on execve. */
                new_tsec->exec_sid = 0;
        } else {
                /* Check for a default transition on this program. */
                rc = security_transition_sid(old_tsec->sid, isec->sid,
                                             SECCLASS_PROCESS, NULL,
                                             &new_tsec->sid);
                if (rc)
                        return rc;
        }

        COMMON_AUDIT_DATA_INIT(&ad, PATH);
        ad.u.path = bprm->file->f_path;

        if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
                new_tsec->sid = old_tsec->sid;

which means that, if MNT_NOSUD, then exec_sid is silently ignored.
I'd rather fail in that case, but it's probably too late for that.
However, if we set the "no new privileges" flag, then we could fail,
since there's no old ABI to be compatible with.  I'll implement it
that way.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 23:27                 ` Alan Cox
@ 2012-01-12 23:38                   ` Linus Torvalds
  0 siblings, 0 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-12 23:38 UTC (permalink / raw)
  To: Alan Cox
  Cc: Steven Rostedt, Jamie Lokier, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, jmorris, scarybeasts, avi, penberg,
	viro, luto, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 3:27 PM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> It's a change of user context. Things like ptrace and file permissions
> basically mean you can't build a barrier between stuff running as the
> same uid to a great extent except with heavy restricting, but saying
> "you can't become someone else" is very useful.

Not just "someone else".

The guarantee basically has to be "you can't change your security
context". Where "become somebody else" is part of it, but any
capability changes etc would be part of it too. So it should disable
all games with capabilities etc.

And I don't think selinux really should be all that much of a problem
- we should just make sure that selinux would honor such a bit, and
refuse to do any op that would change any selinux capabilities either.
Same goes for other security models.

And that may include restricting the ways a binary can be executed
totally outside of suid/sgid bits. For example, if you consider
binaries under /home to have different selinxu rules than system
binaries in /usr/bin, then a cross-execute from one to the other may
not work, regardless of whether it's suid or not.

I think that is the kind of guarantee a sandbox environment really
wants: "I'm setting up a sandbox, you'd better not change the
permissions on me regardless of what crazy things I do".

                       Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:18   ` Alan Cox
  2012-01-12 17:03     ` Will Drewry
@ 2012-01-13  1:31     ` James Morris
  1 sibling, 0 replies; 235+ messages in thread
From: James Morris @ 2012-01-13  1:31 UTC (permalink / raw)
  To: Alan Cox
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, 12 Jan 2012, Alan Cox wrote:

> In general I like this approach. It's simple, it's compact and it offers
> interesting possibilities for solving some interesting problem spaces,
> without the full weight of SELinux, SMACK etc which are still needed for
> heavyweight security.

Yes, I can see potential to vastly simplify MAC policy in some cases.


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:03             ` Will Drewry
@ 2012-01-13  1:34               ` Jamie Lokier
  0 siblings, 0 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-13  1:34 UTC (permalink / raw)
  To: Will Drewry
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Will Drewry wrote:
> >> > There's already a security model around who can use ptrace(); speeding
> >> > it up needn't break that.
> >> >
> >> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> >> > needed as userspace could have done it, with exactly the restrictions
> >> > it wants.  Google's NaCl comes to mind as a potential user.
> >>
> >> That's not entirely true.  ptrace supervisors are subject to races and
> >> always fail open.  This makes them effective but not as robust as a
> >> seccomp solution can provide.
> >
> > What races do you know about?
> 
> I'm pretty sure that if you have two "isolated" processes, they could
> cause irregular behavior using signals.

Do you have an example?  I'm not aware of one and I've been studying
ptrace quite a bit lately.  If there's a race (other than temporary
kernel bugs with all the ptrace patching lately ;-), I would like to
know and maybe patch it.

The only signal confusion when ptracing syscalls I'm aware of is with
SIGTRAP, and that was fixed in 2.5.46, long, long ago (PTRACE_SETOPTIONS).

> > I'm not aware of any ptrace races if it's used properly.  I'm also not
> > sure what you mean by fail open/close here, unless you mean the target
> > process gets to carry on if the tracing process dies.
> 
> Exactly.  Security systems that, on failure, allow the action to
> proceed can't be relied on.

That's fair enough.  There are numerous occasions when ptracer death
should kill the tracee anyway regardless of security.  E.g. "strace
command..." and strace dies, you'd normally want the command to
be killed as well.  So that could be worth a ptrace option anyway.

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:57           ` Jamie Lokier
  2012-01-12 18:03             ` Will Drewry
@ 2012-01-13  2:44             ` Indan Zupancic
  2012-01-13  6:33             ` Chris Evans
  2 siblings, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-13  2:44 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Will Drewry, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds

Hello,

I think execve should be allowed and follow the same rules as execve under ptrace.

On Thu, January 12, 2012 18:57, Jamie Lokier wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> > Will Drewry wrote:
>> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >> >
>> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> >> user namespace. �Once a task-local filter program is attached from a
>> >> >> process without privileges, execve will fail. �This ensures that only
>> >> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> >> binary).
>> >> >
>> >> > This means that a non privileged user can not run another program with
>> >> > limited features? How would a process exec another program and filter
>> >> > it? I would assume that the filter would need to be attached first and
>> >> > then the execv() would be performed. But after the filter is attached,
>> >> > the execv is prevented?
>> >>
>> >> Yeah - it means tasks can filter themselves, but not each other.
>> >> However, you can inject a filter for any dynamically linked executable
>> >> using LD_PRELOAD.
>> >>
>> >> > Maybe I don't understand this correctly.
>> >>
>> >> You're right on. �This was to ensure that one process didn't cause
>> >> crazy behavior in another. I think Alan has a better proposal than
>> >> mine below. �(Goes back to catching up.)
>> >
>> > You can already use ptrace() to cause crazy behaviour in another
>> > process, including modifying registers arbitrarily at syscall entry
>> > and exit, aborting and emulating syscalls.
>> >
>> > ptrace() is quite slow and it would be really nice to speed it up,
>> > especially for trapping a small subset of syscalls, or limiting some
>> > kinds of access to some file descriptors, while everything else runs
>> > at normal speed.
>> >
>> > Speeding up ptrace() with BPF filters would be a really nice. �Not
>> > that I like ptrace(), but sometimes it's the only thing you can rely on.
>> >
>> > LD_PRELOAD and code running in the target process address space can't
>> > always be trusted in some contexts (e.g. the target process may modify
>> > the tracing code or its data); whereas ptrace() is pretty complete and
>> > reliable, if ugly.
>> >
>> > There's already a security model around who can use ptrace(); speeding
>> > it up needn't break that.
>> >
>> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
>> > needed as userspace could have done it, with exactly the restrictions
>> > it wants. �Google's NaCl comes to mind as a potential user.
>>
>> That's not entirely true.  ptrace supervisors are subject to races and
>> always fail open.  This makes them effective but not as robust as a
>> seccomp solution can provide.
>
> What races do you know about?
>
> I'm not aware of any ptrace races if it's used properly.  I'm also not
> sure what you mean by fail open/close here, unless you mean the target
> process gets to carry on if the tracing process dies.

That one could be easily fixed with a new ptrace option.

The tracer can kill all traced tasks before it dies except when it exits
with a SIGKILL. In that case another observer task could kill all the
traced tasks, but that is just moving the problem around.

> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.

I have implemented a ptrace based jailer which avoids these kinds of
races by copying such strings to read-only memory before the system call
is allowed to proceed. Only races that can't be closed with ptrace are
symlink races, and then only with an attacker outside the jail.

And the architectural differences in registers are easily abstracted away
when you're only interested in system call arguments and the instruction
pointer. The system call table information is more annoying, but unavoidable.

Our jailer is around 5k lines of code and supports checking file paths, PIDs,
FDs, SYSV IPC and has limited networking support (no incoming peer address
filtering), all race free. The idea is transparent jailing of complex tasks
with minimal configuration (everything is contained within the jail, access
to anything else needs explicit permission). It's more or less finished for
a few years now, but everyone is busy with other things and no one got around
releasing the code. :-/

It would be nice to avoid the ptrace overhead for system calls that are always
allowed or always denied, so I hope this BPF filtering can be made to work in
conjunction with ptrace so that the tracer only has to handle system calls not
handled by the BPF filter. One way to achieve that is to have a way for the BPF
filter to let a system call generate ptrace system call events or not, with a
new ptrace option PTRACE_UNHANDLED_SYSCALL or something like that to ask for
the unhandled system calls events.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:57           ` Jamie Lokier
  2012-01-12 18:03             ` Will Drewry
  2012-01-13  2:44             ` Indan Zupancic
@ 2012-01-13  6:33             ` Chris Evans
  2 siblings, 0 replies; 235+ messages in thread
From: Chris Evans @ 2012-01-13  6:33 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Will Drewry, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 9:57 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> > Will Drewry wrote:
>> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >> >
>> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> >> user namespace.  Once a task-local filter program is attached from a
>> >> >> process without privileges, execve will fail.  This ensures that only
>> >> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> >> binary).
>> >> >
>> >> > This means that a non privileged user can not run another program with
>> >> > limited features? How would a process exec another program and filter
>> >> > it? I would assume that the filter would need to be attached first and
>> >> > then the execv() would be performed. But after the filter is attached,
>> >> > the execv is prevented?
>> >>
>> >> Yeah - it means tasks can filter themselves, but not each other.
>> >> However, you can inject a filter for any dynamically linked executable
>> >> using LD_PRELOAD.
>> >>
>> >> > Maybe I don't understand this correctly.
>> >>
>> >> You're right on.  This was to ensure that one process didn't cause
>> >> crazy behavior in another. I think Alan has a better proposal than
>> >> mine below.  (Goes back to catching up.)
>> >
>> > You can already use ptrace() to cause crazy behaviour in another
>> > process, including modifying registers arbitrarily at syscall entry
>> > and exit, aborting and emulating syscalls.
>> >
>> > ptrace() is quite slow and it would be really nice to speed it up,
>> > especially for trapping a small subset of syscalls, or limiting some
>> > kinds of access to some file descriptors, while everything else runs
>> > at normal speed.
>> >
>> > Speeding up ptrace() with BPF filters would be a really nice.  Not
>> > that I like ptrace(), but sometimes it's the only thing you can rely on.
>> >
>> > LD_PRELOAD and code running in the target process address space can't
>> > always be trusted in some contexts (e.g. the target process may modify
>> > the tracing code or its data); whereas ptrace() is pretty complete and
>> > reliable, if ugly.
>> >
>> > There's already a security model around who can use ptrace(); speeding
>> > it up needn't break that.
>> >
>> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
>> > needed as userspace could have done it, with exactly the restrictions
>> > it wants.  Google's NaCl comes to mind as a potential user.
>>
>> That's not entirely true.  ptrace supervisors are subject to races and
>> always fail open.  This makes them effective but not as robust as a
>> seccomp solution can provide.
>
> What races do you know about?
>
> I'm not aware of any ptrace races if it's used properly.  I'm also not
> sure what you mean by fail open/close here, unless you mean the target
> process gets to carry on if the tracing process dies.

Yeah, that's one and it's a pretty awful one when you can consider
that the untrusted tracee can play games such as trying to get the
kernel to fire OOM SIGKILLs.

My memory is hazy but the last time I looked at this in detail there
were other racy areas:

- Bad problems if the tracee takes a SIGTSTP or (real) SIGCONT.
- Difficulty in stopping the syscall from executing once it has
started, especially if the tracer dies.


Cheers
Chris

>
> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.
>
>> With seccomp, it fails close.  What I think would make sense would be
>> to add a user-controllable failure mode with seccomp bpf that calls
>> tracehook_ptrace_syscall_entry(regs).  I've prototyped this and it
>> works quite well, but I didn't want to conflate the discussions.
>
> It think it's a nice idea.  While you're at it could you fix all the
> architectures to actually use tracehooks for syscall tracing ;-)
>
> (I think it's ok to call the tracehook function on all archs though.)
>
>> Using ptrace() would also mean that all consumers of this interface
>> would need a supervisor, but with seccomp, the filters are installed
>> and require no supervisors to stick around for when failure occurs.
>>
>> Does that make sense?
>
> It does, I agree that ptrace() is quite cumbersome and you don't
> always want a separate tracing process, especially if "failure" means
> to die or get an error.
>
> On the other hand, sometimes when a failure occurs, having another
> process decide what to do, or log the event, is exactly what you want.
>
> For my nefarious purposes I'm really just looking for a faster way to
> reliably trace some activities of individual processes, in particular
> tracking which files they access.  I'd rather not interfere with
> debuggers, so I'd really like your ability to stack multiple filters
> to work with separate-process tracing as well.  And I'd happily use a
> filter rule which can dump some information over a pipe, without
> waiting for the tracer to respond in most cases.
>
> -- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:51         ` Will Drewry
@ 2012-01-13 17:31           ` Oleg Nesterov
  2012-01-13 19:01             ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-13 17:31 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/12, Will Drewry wrote:
>
> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 01/12, Will Drewry wrote:
> >>
> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> >> +      */
> >> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
> >> >
> >> > Stupid question. I am sure you know what are you doing ;) and I know
> >> > nothing about !x86 arches.
> >> >
> >> > But could you explain why it is designed to use user_regs_struct ?
> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
> >>
> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
> >> (iirc), but on x86-64
> >> and others, that's not true.
> >
> > Yes sure, I meant that userpace should use pt_regs too.
> >
> >> If it would be appropriate to expose pt_regs to userspace, then I'd
> >> happily do so :)
> >
> > Ah, so that was the reason. But it is already exported? At least I see
> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
> >
> > Once again, I am not arguing, just trying to understand. And I do not
> > know if this definition is part of abi.
>
> I don't either :/  My original idea was to operate on task_pt_regs(current),
> but I noticed that PTRACE_GETREGS/SETREGS only uses the
> user_regs_struct. So I went that route.

Well, I don't know where user_regs_struct come from initially. But
probably it is needed to allow to access the "artificial" things like
fs_base. Or perhaps this struct mimics the layout in the coredump.

> I'd love for pt_regs to be fair game to cut down on the copying!

Me too. I see no point in using user_regs_struct.

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 17:31           ` Oleg Nesterov
@ 2012-01-13 19:01             ` Will Drewry
  2012-01-13 23:10               ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-13 19:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/12, Will Drewry wrote:
>>
>> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> > On 01/12, Will Drewry wrote:
>> >>
>> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >> >> +      */
>> >> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>> >> >
>> >> > Stupid question. I am sure you know what are you doing ;) and I know
>> >> > nothing about !x86 arches.
>> >> >
>> >> > But could you explain why it is designed to use user_regs_struct ?
>> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>> >>
>> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
>> >> (iirc), but on x86-64
>> >> and others, that's not true.
>> >
>> > Yes sure, I meant that userpace should use pt_regs too.
>> >
>> >> If it would be appropriate to expose pt_regs to userspace, then I'd
>> >> happily do so :)
>> >
>> > Ah, so that was the reason. But it is already exported? At least I see
>> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>> >
>> > Once again, I am not arguing, just trying to understand. And I do not
>> > know if this definition is part of abi.
>>
>> I don't either :/  My original idea was to operate on task_pt_regs(current),
>> but I noticed that PTRACE_GETREGS/SETREGS only uses the
>> user_regs_struct. So I went that route.
>
> Well, I don't know where user_regs_struct come from initially. But
> probably it is needed to allow to access the "artificial" things like
> fs_base. Or perhaps this struct mimics the layout in the coredump.

Not sure - added Roland whose name was on many of the files :)

I just noticed that ptrace ABI allows pt_regs access using the register
macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).

But I think the latter is guaranteed to have a certain layout while the macros
for PEEKUSR can do post-processing fixup.  (Which could be done in the
bpf evaluator load_pointer() helper if needed.)

>> I'd love for pt_regs to be fair game to cut down on the copying!
>
> Me too. I see no point in using user_regs_struct.

I'll rev the change to use pt_regs and drop all the helper code.  If
no one says otherwise, that certainly seems ideal from a performance
perspective, and I see pt_regs exported to userland along with ptrace
abi register offset macros.


Thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 19:01             ` Will Drewry
@ 2012-01-13 23:10               ` Will Drewry
  2012-01-13 23:12                 ` Will Drewry
                                   ` (3 more replies)
  0 siblings, 4 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-13 23:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen

On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
> On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 01/12, Will Drewry wrote:
>>>
>>> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> > On 01/12, Will Drewry wrote:
>>> >>
>>> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> >> >> +      */
>>> >> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>>> >> >
>>> >> > Stupid question. I am sure you know what are you doing ;) and I know
>>> >> > nothing about !x86 arches.
>>> >> >
>>> >> > But could you explain why it is designed to use user_regs_struct ?
>>> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>> >>
>>> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
>>> >> (iirc), but on x86-64
>>> >> and others, that's not true.
>>> >
>>> > Yes sure, I meant that userpace should use pt_regs too.
>>> >
>>> >> If it would be appropriate to expose pt_regs to userspace, then I'd
>>> >> happily do so :)
>>> >
>>> > Ah, so that was the reason. But it is already exported? At least I see
>>> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>>> >
>>> > Once again, I am not arguing, just trying to understand. And I do not
>>> > know if this definition is part of abi.
>>>
>>> I don't either :/  My original idea was to operate on task_pt_regs(current),
>>> but I noticed that PTRACE_GETREGS/SETREGS only uses the
>>> user_regs_struct. So I went that route.
>>
>> Well, I don't know where user_regs_struct come from initially. But
>> probably it is needed to allow to access the "artificial" things like
>> fs_base. Or perhaps this struct mimics the layout in the coredump.
>
> Not sure - added Roland whose name was on many of the files :)
>
> I just noticed that ptrace ABI allows pt_regs access using the register
> macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).
>
> But I think the latter is guaranteed to have a certain layout while the macros
> for PEEKUSR can do post-processing fixup.  (Which could be done in the
> bpf evaluator load_pointer() helper if needed.)
>
>>> I'd love for pt_regs to be fair game to cut down on the copying!
>>
>> Me too. I see no point in using user_regs_struct.
>
> I'll rev the change to use pt_regs and drop all the helper code.  If
> no one says otherwise, that certainly seems ideal from a performance
> perspective, and I see pt_regs exported to userland along with ptrace
> abi register offset macros.

On second thought, pt_regs is scary :)

>From looking at
  http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
and ia32syscall enty code, it appears that for x86, at least, the
pt_regs for compat processes will be 8 bytes wide per register on the
stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
IA32_EMU, its filters will always index into pt_regs incorrectly.

I'm not 100% that I am reading the code right, but it means that I can either
keep using user_regs_struct or fork the code behavior based on compat. That
would need to be arch dependent then which is pretty rough.

Any thoughts?

I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
about the pt_regs
change yet.  If the performance boost is worth the effort of having a
per-arch fixup,
I can go that route.  Otherwise, I could look at some alternate approach for a
faster-than-regview payload.

Thanks!

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 23:10               ` Will Drewry
@ 2012-01-13 23:12                 ` Will Drewry
  2012-01-13 23:30                 ` Eric Paris
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-13 23:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen

On Fri, Jan 13, 2012 at 5:10 PM, Will Drewry <wad@chromium.org> wrote:
> On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
>> On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> On 01/12, Will Drewry wrote:
>>>>
>>>> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> > On 01/12, Will Drewry wrote:
>>>> >>
>>>> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> >> >> +      */
>>>> >> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>>>> >> >
>>>> >> > Stupid question. I am sure you know what are you doing ;) and I know
>>>> >> > nothing about !x86 arches.
>>>> >> >
>>>> >> > But could you explain why it is designed to use user_regs_struct ?
>>>> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>>> >>
>>>> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
>>>> >> (iirc), but on x86-64
>>>> >> and others, that's not true.
>>>> >
>>>> > Yes sure, I meant that userpace should use pt_regs too.
>>>> >
>>>> >> If it would be appropriate to expose pt_regs to userspace, then I'd
>>>> >> happily do so :)
>>>> >
>>>> > Ah, so that was the reason. But it is already exported? At least I see
>>>> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>>>> >
>>>> > Once again, I am not arguing, just trying to understand. And I do not
>>>> > know if this definition is part of abi.
>>>>
>>>> I don't either :/  My original idea was to operate on task_pt_regs(current),
>>>> but I noticed that PTRACE_GETREGS/SETREGS only uses the
>>>> user_regs_struct. So I went that route.
>>>
>>> Well, I don't know where user_regs_struct come from initially. But
>>> probably it is needed to allow to access the "artificial" things like
>>> fs_base. Or perhaps this struct mimics the layout in the coredump.
>>
>> Not sure - added Roland whose name was on many of the files :)
>>
>> I just noticed that ptrace ABI allows pt_regs access using the register
>> macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).
>>
>> But I think the latter is guaranteed to have a certain layout while the macros
>> for PEEKUSR can do post-processing fixup.  (Which could be done in the
>> bpf evaluator load_pointer() helper if needed.)
>>
>>>> I'd love for pt_regs to be fair game to cut down on the copying!
>>>
>>> Me too. I see no point in using user_regs_struct.
>>
>> I'll rev the change to use pt_regs and drop all the helper code.  If
>> no one says otherwise, that certainly seems ideal from a performance
>> perspective, and I see pt_regs exported to userland along with ptrace
>> abi register offset macros.
>
> On second thought, pt_regs is scary :)
>
> From looking at
>  http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
> and ia32syscall enty code, it appears that for x86, at least, the
> pt_regs for compat processes will be 8 bytes wide per register on the
> stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
> IA32_EMU, its filters will always index into pt_regs incorrectly.
>
> I'm not 100% that I am reading the code right, but it means that I can either
> keep using user_regs_struct or fork the code behavior based on compat. That
> would need to be arch dependent then which is pretty rough.
>
> Any thoughts?
>
> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
> about the pt_regs
> change yet.  If the performance boost is worth the effort of having a
> per-arch fixup,
> I can go that route.  Otherwise, I could look at some alternate approach for a
> faster-than-regview payload.

Ugh. Sorry about the formatting. (The other option is to disallow compat ;).

cheers!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 23:10               ` Will Drewry
  2012-01-13 23:12                 ` Will Drewry
@ 2012-01-13 23:30                 ` Eric Paris
  2012-01-15  3:40                 ` Indan Zupancic
  2012-01-16 18:37                 ` Oleg Nesterov
  3 siblings, 0 replies; 235+ messages in thread
From: Eric Paris @ 2012-01-13 23:30 UTC (permalink / raw)
  To: Will Drewry
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, djm, torvalds, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

For anyone who is interested I hacked up a program to turn what I think
is a readable seccomp syntax into BPF rules.  It should make it easier
to prototype this new thing.  The translator needs a LOT of love to be
worth much, but for now it can handle a couple of things and can build a
set of rules!

The rules are of the form:
label object:
	value label

So using Will's BPF example code in my syntax looks like:

start syscall:
        rt_sigreturn success
        sigreturn success
        exit_group success
        exit success
        read read
        write write
read arg0:
        0 success
write arg0:
        1 success
        2 success

So this says the first label is "start" and it is going to deal with the
syscall number.  The first value is 'rt_sigreturn' and if syscall ==
rt_sigreturn will cause you to jump to 'success' (success and fail are
implied labels).  If the syscall is 'write' we will jump to 'write.'
The write rules look at arg0.  If arg0 == "1" we jump to "success".  If
you run that syntax through my translator you should get Will's BPF
rules!

You'll quickly notice that the translator only understands "syscall" and
"arg0" and only x86_32, but it should be easy to add more, support the
right registers on different arches, etc, etc.  If others think they
might want to hack on the translator I put it at:

http://git.infradead.org/users/eparis/bpf-translate.git

-Eric

[-- Attachment #2: translate.py --]
[-- Type: text/x-python, Size: 2179 bytes --]

#! /usr/bin/python -Es

import sys

if len(sys.argv) > 1:
	file = open(sys.argv[1])
else:
	file = sys.stdin

linecount = 0
sections = []
rules = {}
output = []
section_map = {}

def new_section(section):
	if section[1] == "syscall":
		output.append(("BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),", section[0]))
	elif section[1] == "arg0":
		output.append(("BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),", section[0]))
	elif section[0] == "success":
		output.append(("BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),", section[0]))
	elif section[0] == "fail":
		output.append(("BPF_STMT(BPF_RET+BPF_A,0),", section[0]))

def new_rule(rule, section, last=None):
	string = "BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, %s, %s, 0)," % (rule[0], rule[1])
	if last:
		string = string.replace(", 0)", ", fail)")
	output.append((string, "0"))

if __name__ == '__main__':
	while 1:
		line = file.readline()
		if not line:
			break
		linecount = linecount + 1
		if ":" in line:
			sections.append(line.strip().strip(":").split())
		else:
			key = sections[-1][0]
			current_list = rules.get(key, [])
			newrule = line.strip().split()
			if sections[-1][1] == "syscall":
				newrule = ["__NR_%s" % newrule[0], newrule[1]]
			current_list.append(newrule)
			rules[key] = current_list
			
		

sections.append(["success", "*"])
sections.append(["fail", "*"])

for section in sections:
	new_section(section)
	if rules.has_key(section[0]):
		for rule in rules[section[0]]:
			if rule == rules[section[0]][-1]:
				new_rule(rule, section, 1)
			else:
				new_rule(rule, section)

for lineno,line in enumerate(output):
	if (line[1] == "0"):
		continue
	section_map[line[1]] = lineno

for lineno,line in enumerate(output):
	line = line[0]
	for section in section_map.keys():
		# Only replace in those last 2 commas 
		#if VALUE == section:
			#replace VALUE with str(section_map[section] - lineno - 2)
		splitline = line.split(",")
		if section in splitline[-3]:
			splitline[-3] = splitline[-3].replace(section, str(section_map[section] - lineno - 1))
		if section in splitline[-2]:
			splitline[-2] = splitline[-2].replace(section, str(section_map[section] - lineno - 1))
		line = ",".join(splitline)
	print line

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 23:10               ` Will Drewry
  2012-01-13 23:12                 ` Will Drewry
  2012-01-13 23:30                 ` Eric Paris
@ 2012-01-15  3:40                 ` Indan Zupancic
  2012-01-16  1:40                   ` Will Drewry
  2012-01-16 18:37                 ` Oleg Nesterov
  3 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-15  3:40 UTC (permalink / raw)
  To: Will Drewry
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris

Hello,

(I hoped everyone is CC'ed, emails forwarded via lkml truncate the CC list
when it's too long.)

On Sat, January 14, 2012 00:10, Will Drewry wrote:
> On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
>> On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> On 01/12, Will Drewry wrote:
>>>>
>>>> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> > On 01/12, Will Drewry wrote:
>>>> >>
>>>> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> >> >> + � � �*/
>>>> >> >> + � � regs = seccomp_get_regs(regs_tmp, &regs_size);
>>>> >> >
>>>> >> > Stupid question. I am sure you know what are you doing ;) and I know
>>>> >> > nothing about !x86 arches.
>>>> >> >
>>>> >> > But could you explain why it is designed to use user_regs_struct ?
>>>> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>>> >>
>>>> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
>>>> >> (iirc), but on x86-64
>>>> >> and others, that's not true.
>>>> >
>>>> > Yes sure, I meant that userpace should use pt_regs too.
>>>> >
>>>> >> If it would be appropriate to expose pt_regs to userspace, then I'd
>>>> >> happily do so :)
>>>> >
>>>> > Ah, so that was the reason. But it is already exported? At least I see
>>>> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>>>> >
>>>> > Once again, I am not arguing, just trying to understand. And I do not
>>>> > know if this definition is part of abi.
>>>>
>>>> I don't either :/ �My original idea was to operate on task_pt_regs(current),
>>>> but I noticed that PTRACE_GETREGS/SETREGS only uses the
>>>> user_regs_struct. So I went that route.
>>>
>>> Well, I don't know where user_regs_struct come from initially. But
>>> probably it is needed to allow to access the "artificial" things like
>>> fs_base. Or perhaps this struct mimics the layout in the coredump.
>>
>> Not sure - added Roland whose name was on many of the files :)
>>
>> I just noticed that ptrace ABI allows pt_regs access using the register
>> macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).
>>
>> But I think the latter is guaranteed to have a certain layout while the macros
>> for PEEKUSR can do post-processing fixup. �(Which could be done in the
>> bpf evaluator load_pointer() helper if needed.)
>>
>>>> I'd love for pt_regs to be fair game to cut down on the copying!
>>>
>>> Me too. I see no point in using user_regs_struct.
>>
>> I'll rev the change to use pt_regs and drop all the helper code. �If
>> no one says otherwise, that certainly seems ideal from a performance
>> perspective, and I see pt_regs exported to userland along with ptrace
>> abi register offset macros.
>
> On second thought, pt_regs is scary :)
>
> From looking at
>   http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
> and ia32syscall enty code, it appears that for x86, at least, the
> pt_regs for compat processes will be 8 bytes wide per register on the
> stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
> IA32_EMU, its filters will always index into pt_regs incorrectly.
>
> I'm not 100% that I am reading the code right, but it means that I can either
> keep using user_regs_struct or fork the code behavior based on compat. That
> would need to be arch dependent then which is pretty rough.
>
> Any thoughts?
>
> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
> about the pt_regs
> change yet.  If the performance boost is worth the effort of having a
> per-arch fixup,
> I can go that route.  Otherwise, I could look at some alternate approach
> for a faster-than-regview payload.

I recommend not giving access to the registers at all, but to instead provide
a fixed cross-platform ABI (a 32 bit version and one 64 bit version).

As everyone dealing with system call is mostly interested in the same things:
The syscall number and the arguments. You can add some other potentially useful
info like instruction pointer as well, but keep it limited to cross-platform
things with a clear meaning that make sense for system call filtering.

So I propose an interface like the following instead of a register interface:

/* Currently 6, but to be future proof, make it 8 */
#define MAX_SC_ARGS	8

struct syscall_bpf_data {
	unsigned long syscall_nr;
	unsigned long flags;
	unsigned long instruction_pointer;
	unsigned long arg[MAX_SC_ARGS];
	unsigned long _reserved[5];
};

The flag argument can be used to e.g. tell if it is a compat 32 program
running on a 64 bit system.

This way the registers have to be interpreted only once by the kernel and all
filtering programs don't have to do that mapping themselves. It also avoids
doing unnecessary work fiddling/translating registers like the ptrace ABI does.

I missed if the original version was allowed to change the registers or not,
if it is then perhaps the BPF program should set a specific flag after changing
anything, to make it more explicit.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-15  3:40                 ` Indan Zupancic
@ 2012-01-16  1:40                   ` Will Drewry
  2012-01-16  6:49                     ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-16  1:40 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris

On Sat, Jan 14, 2012 at 9:40 PM, Indan Zupancic <indan@nul.nu> wrote:
> Hello,
>
> (I hoped everyone is CC'ed, emails forwarded via lkml truncate the CC list
> when it's too long.)
>
> On Sat, January 14, 2012 00:10, Will Drewry wrote:
>> On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
>>> On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> On 01/12, Will Drewry wrote:
>>>>>
>>>>> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>>> > On 01/12, Will Drewry wrote:
>>>>> >>
>>>>> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>>> >> >> + � � �*/
>>>>> >> >> + � � regs = seccomp_get_regs(regs_tmp, &regs_size);
>>>>> >> >
>>>>> >> > Stupid question. I am sure you know what are you doing ;) and I know
>>>>> >> > nothing about !x86 arches.
>>>>> >> >
>>>>> >> > But could you explain why it is designed to use user_regs_struct ?
>>>>> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>>>> >>
>>>>> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
>>>>> >> (iirc), but on x86-64
>>>>> >> and others, that's not true.
>>>>> >
>>>>> > Yes sure, I meant that userpace should use pt_regs too.
>>>>> >
>>>>> >> If it would be appropriate to expose pt_regs to userspace, then I'd
>>>>> >> happily do so :)
>>>>> >
>>>>> > Ah, so that was the reason. But it is already exported? At least I see
>>>>> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>>>>> >
>>>>> > Once again, I am not arguing, just trying to understand. And I do not
>>>>> > know if this definition is part of abi.
>>>>>
>>>>> I don't either :/ �My original idea was to operate on task_pt_regs(current),
>>>>> but I noticed that PTRACE_GETREGS/SETREGS only uses the
>>>>> user_regs_struct. So I went that route.
>>>>
>>>> Well, I don't know where user_regs_struct come from initially. But
>>>> probably it is needed to allow to access the "artificial" things like
>>>> fs_base. Or perhaps this struct mimics the layout in the coredump.
>>>
>>> Not sure - added Roland whose name was on many of the files :)
>>>
>>> I just noticed that ptrace ABI allows pt_regs access using the register
>>> macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).
>>>
>>> But I think the latter is guaranteed to have a certain layout while the macros
>>> for PEEKUSR can do post-processing fixup. �(Which could be done in the
>>> bpf evaluator load_pointer() helper if needed.)
>>>
>>>>> I'd love for pt_regs to be fair game to cut down on the copying!
>>>>
>>>> Me too. I see no point in using user_regs_struct.
>>>
>>> I'll rev the change to use pt_regs and drop all the helper code. �If
>>> no one says otherwise, that certainly seems ideal from a performance
>>> perspective, and I see pt_regs exported to userland along with ptrace
>>> abi register offset macros.
>>
>> On second thought, pt_regs is scary :)
>>
>> From looking at
>>   http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
>> and ia32syscall enty code, it appears that for x86, at least, the
>> pt_regs for compat processes will be 8 bytes wide per register on the
>> stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
>> IA32_EMU, its filters will always index into pt_regs incorrectly.
>>
>> I'm not 100% that I am reading the code right, but it means that I can either
>> keep using user_regs_struct or fork the code behavior based on compat. That
>> would need to be arch dependent then which is pretty rough.
>>
>> Any thoughts?
>>
>> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
>> about the pt_regs
>> change yet.  If the performance boost is worth the effort of having a
>> per-arch fixup,
>> I can go that route.  Otherwise, I could look at some alternate approach
>> for a faster-than-regview payload.
>
> I recommend not giving access to the registers at all, but to instead provide
> a fixed cross-platform ABI (a 32 bit version and one 64 bit version).

I don't believe it is possible to create a fixed cross-platform ABI
that will be compat and future safe.  user_regs_struct (or even
pt_regs) should always match whatever each arch is doing -- even weird
personality-based changes.  If we do a new ABI, not only does that
have to be exported to userland, but it means we're still copying and
munging the data around (which was why I was trying to see if pt_regs
was a easier way to get a speed boost).

If there's consensus, I'll change it (and use syscall_get_arguments),
but I don't believe it makes sense. (more below)

> As everyone dealing with system call is mostly interested in the same things:
> The syscall number and the arguments. You can add some other potentially useful
> info like instruction pointer as well, but keep it limited to cross-platform
> things with a clear meaning that make sense for system call filtering.

Well there are also clone/fork, sig_rt_return, etc related registers
too.  I like not making the decision for each syscall filtering
consumer.  We have an ABI so it seems like I'd be making work for the
kernel to manage yet another one for system call calling conventions.

> So I propose an interface like the following instead of a register interface:
>
> /* Currently 6, but to be future proof, make it 8 */
> #define MAX_SC_ARGS     8
>
> struct syscall_bpf_data {
>        unsigned long syscall_nr;
>        unsigned long flags;
>        unsigned long instruction_pointer;
>        unsigned long arg[MAX_SC_ARGS];
>        unsigned long _reserved[5];
> };
>
> The flag argument can be used to e.g. tell if it is a compat 32 program
> running on a 64 bit system.

I certainly considered this, but I don't think this is a practical
idea.  Firstly, CONFIG_COMPAT is meant to be compatibility mode.  We
can't assume a program knows about it.  Second, if we assume any new
program will be "smart" and check @flags, then the first few
instruction of _every_ (32-bit) seccomp filter program will be
checking compat mode - a serious waste :(  I'm also not sure if
is_compat_task actually covers all random personality-based changes --
just 32-bit v 64-bit.

I _really_ wanted to make compat a flag and push that logic out of the
kernel, but I don't think it makes sense to burden all ABI consumers
with a "just in case" compat flag check.  Also, what happens if a new,
weird architecture comes along where that flag doesn't make the same
sense?  We can fix all the internal kernel stuff, but we'd end up with
an ABI change to boot :/  Using regviews, we stay consistent
regardless of whatever the new craziness is.  I just wish there was a
way to make it speedier.

> This way the registers have to be interpreted only once by the kernel and all
> filtering programs don't have to do that mapping themselves. It also avoids
> doing unnecessary work fiddling/translating registers like the ptrace ABI does.

The kernel does only interpret them once (after entry to
__secure_computing). It gets the regview and has it populate a
user_regs_struct.  All the register info is per-arch and matches
PTRACE_GETREGS, but not PTRACE_PEEKUSR.  All the weird stuff is in
PEEKUSR to deal with the fact that compat pt_regs members are not
actually the same width as userspace would expect.  If we populated an
ABI as you've proposed, we'd at least need to build that data set and
give it syscall_get_arguments() output.

I was hoping I could just hand over pt_regs and avoid any processing,
but it doesn't look promising.  In theory, all the same bit-twiddling
compat_ptrace does could be done during load_pointer in the patch
series, but it seems wrong to go that route.

> I missed if the original version was allowed to change the registers or not,
> if it is then perhaps the BPF program should set a specific flag after changing
> anything, to make it more explicit.

Registers are const from the BPF perspective (just like with socket
filters).   Adding support for tracehook interception later could
allow for supervisor guided register mutation.

Thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-16  1:40                   ` Will Drewry
@ 2012-01-16  6:49                     ` Indan Zupancic
  2012-01-16 20:12                       ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-16  6:49 UTC (permalink / raw)
  To: Will Drewry
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris

Hello,

On Mon, January 16, 2012 02:40, Will Drewry wrote:
> On Sat, Jan 14, 2012 at 9:40 PM, Indan Zupancic <indan@nul.nu> wrote:
>> On Sat, January 14, 2012 00:10, Will Drewry wrote:
>>> Any thoughts?
>>>
>>> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
>>> about the pt_regs
>>> change yet.  If the performance boost is worth the effort of having a
>>> per-arch fixup,
>>> I can go that route.  Otherwise, I could look at some alternate approach
>>> for a faster-than-regview payload.
>>
>> I recommend not giving access to the registers at all, but to instead provide
>> a fixed cross-platform ABI (a 32 bit version and one 64 bit version).
>
> I don't believe it is possible to create a fixed cross-platform ABI
> that will be compat and future safe.

The main problem is that BPF is inherently 32 bits while system call
arguments can be 64 bits.

You could change it so that all BPF programs are always 64 bits, which
would solve everything, except the when-to-sign-extend-and-when-not-to
problem. Luckily tgkill() is the only system call I'm aware of which
would have preferred sign extension, so never sign extending is fairly
safe. But this would require a new user space ABI as it's incompatible
with the current sock_filter.

You could make it work on "unsigned long" and solve all subtleties
that way. For 32 bits compat mode you execute the 32 bit version, if
you want to support it. This would require two seccomp_run_filter()
versions if compat mode is supported. Again, would change the BPF
filter ABI.

So considering that you seem to be stuck with running 32 bits BPF
filters anyway, you can as well make the input always 32 bits or
always 64 bits. 32 bits would lose information sometimes, so making
it always 64 bits without sign extension seems logical. This would
uglify the actual BPF filters immensely though, because then they
often have to check the upper argument half too, usually just to
make sure it's zero. They can't be sure that the kernel will ignore
the upper half for 'int' arguments.

So I suppose the question is: How do you expect people to use this on
64 bit systems?

> user_regs_struct (or even
> pt_regs) should always match whatever each arch is doing -- even weird
> personality-based changes.

I don't think weird personality-based changes are really relevant, no one
is going to write different versions of each filter depending on which
personality is being run. Perhaps weird personality changes should be
denied when a filter is installed, in case the filter forgets to do it.

So if the personality would change in a drastic way across an execve,
it should fail. That is something a filter can't check beforehand and
the only way to deal with it afterwards is by checking what the current
personality is for each system call.

All in all I think filters should be per personality, and if a process's
personality changes it is only allowed if there is also a filter installed
for the new personality too. Or just disallow personality changes, wasn't
that what your patch did anyway?

> If we do a new ABI, not only does that
> have to be exported to userland, but it means we're still copying and
> munging the data around (which was why I was trying to see if pt_regs
> was a easier way to get a speed boost).

The difference is that the register ABI uses the messy ptrace ABI which
is a bit strange and not cross-platform, while simply exporting system
call arguments is plain simple and what everyone tries to get out of the
pt_regs anyway.

But considering system call numbers are platform dependent anyway, it
doesn't really matter that much. I think an array of syscall arguments
would be a cleaner interface, but struct pt_regs would be acceptable.

> If there's consensus, I'll change it (and use syscall_get_arguments),
> but I don't believe it makes sense. (more below)

It's about system call filtering, I'd argue that giving anything else than
the arguments doesn't make sense. Yes, the registers are the current system
call ABI and reusing that is in a way simpler, but that ABI is about telling
the kernel what the arguments are, it's not the best way for the kernel to
tell userspace what the arguments are because for userspace it ends up in
some structure with its own ABI instead of in the actual registers.

>> As everyone dealing with system call is mostly interested in the same things:
>> The syscall number and the arguments. You can add some other potentially useful
>> info like instruction pointer as well, but keep it limited to cross-platform
>> things with a clear meaning that make sense for system call filtering.
>
> Well there are also clone/fork, sig_rt_return, etc related registers
> too.

What special clone/fork registers are you talking about?

I don't think anyone would want to ever filter sig_rt_return, you can as
well kill the process.

> I like not making the decision for each syscall filtering
> consumer.  We have an ABI so it seems like I'd be making work for the
> kernel to manage yet another one for system call calling conventions.

I think it's pretty much about the arguments and not much else. Even
adding instruction pointer was a bit of a stretch, but it's something
I can imagine people using to make decisions. But as the BPF filters
are stateless they can't really use much else than the syscall number
and arguments anyway, the rest is usually too context dependent.

In order of exponentially less likelihood to filter on:
- syscall number
- syscall arguments
- Instruction pointer
- Stack pointer
- ...?!

Keep in mind that x86 only has a handful registers, 6 of them are used
for the arguments, one is the syscall number and return value, one is
the instruction pointer and there's the stack pointer. There just isn't
much room for much else.

Adding the instruction and stack pointers is quite a stretch already and
should cover pretty much any need. If there is any other information that
might be useful for filtering system calls I'd like to hear about it.

>> So I propose an interface like the following instead of a register interface:
>>
>> /* Currently 6, but to be future proof, make it 8 */
>> #define MAX_SC_ARGS     8
>>
>> struct syscall_bpf_data {
>>        unsigned long syscall_nr;
>>        unsigned long flags;
>>        unsigned long instruction_pointer;
>>        unsigned long arg[MAX_SC_ARGS];
>>        unsigned long _reserved[5];
>> };

BTW, the width of the fields depends on how you want to resolve
the 64 bit issue. As BPF is always 32 bits, it doesn't make much
sense to use longs. And as offsets are used anyway, it probably
makes more sense to define those instead of a structure.

>>
>> The flag argument can be used to e.g. tell if it is a compat 32 program
>> running on a 64 bit system.
>
> I certainly considered this, but I don't think this is a practical
> idea.  Firstly, CONFIG_COMPAT is meant to be compatibility mode.  We
> can't assume a program knows about it.  Second, if we assume any new
> program will be "smart" and check @flags, then the first few
> instruction of _every_ (32-bit) seccomp filter program will be
> checking compat mode - a serious waste :(  I'm also not sure if
> is_compat_task actually covers all random personality-based changes --
> just 32-bit v 64-bit.

Yeah, bad idea. Forget about the flag thing.

> I _really_ wanted to make compat a flag and push that logic out of the
> kernel, but I don't think it makes sense to burden all ABI consumers
> with a "just in case" compat flag check.  Also, what happens if a new,
> weird architecture comes along where that flag doesn't make the same
> sense?  We can fix all the internal kernel stuff, but we'd end up with
> an ABI change to boot :/  Using regviews, we stay consistent
> regardless of whatever the new craziness is.  I just wish there was a
> way to make it speedier.

Better to have filters per personality. That solves this whole issue,
independently of regviews or argument list ABI.

>> This way the registers have to be interpreted only once by the kernel and all
>> filtering programs don't have to do that mapping themselves. It also avoids
>> doing unnecessary work fiddling/translating registers like the ptrace ABI does.
>
> The kernel does only interpret them once (after entry to
> __secure_computing).

Not if data shuffling is needed for compat related stuff.

> It gets the regview and has it populate a
> user_regs_struct.  All the register info is per-arch and matches
> PTRACE_GETREGS, but not PTRACE_PEEKUSR.

GETREGS seems to be a subset of PEEKUSR. That is, both start with
a struct pt_regs/user_regs_struct (seems to be the same thing?)
PEEKUSR only has extra access to debugging registers.

That is another problem of giving a register view: Which registers
are you going to give access to?

> All the weird stuff is in
> PEEKUSR to deal with the fact that compat pt_regs members are not
> actually the same width as userspace would expect.
>
> If we populated an ABI as you've proposed, we'd at least need to
> build that data set and give it syscall_get_arguments() output.

Yes, but that's all you have to do, nothing more.

The pt_regs a 64 bit kernel builds for a 32 bit compat process is
different than one from a 32 bit kernel, so you have to do some kind
of data shuffling anyway.

Worse, once you pick this ABI you're stuck with it and can't get rid
of compat horrors like you have now with ptrace(). Do you really want
to reuse an obscure ptrace ABI instead of creating a simpler new one?

> I was hoping I could just hand over pt_regs and avoid any processing,
> but it doesn't look promising.  In theory, all the same bit-twiddling
> compat_ptrace does could be done during load_pointer in the patch
> series, but it seems wrong to go that route.

Your problem is worse because BPF programs are 32 bits but registers/args
can be 64 bit. Compared to that, running 32 bits on top of 64 bits seems
easy.

Do you propose that people not only know about 64 bitness, but also
about endianness when grabbing bits and pieces of 64 bit registers?
Because that seems like a fun source of bugs.

>> I missed if the original version was allowed to change the registers or not,
>> if it is then perhaps the BPF program should set a specific flag after changing
>> anything, to make it more explicit.
>
> Registers are const from the BPF perspective (just like with socket
> filters).   Adding support for tracehook interception later could
> allow for supervisor guided register mutation.

If the ABI gives access to arguments instead of registers you don't have
to do anything tricky: No security checks, no need for fixing up register
values to their original value after the system call returns or any other
subtleties. BPF filters can just change the values without side effects.

I would prefer if it would work nicely with a ptrace supervisor, because
to me it seems that if something can't be resolved in the BPF filter, more
context and direct control is needed. The main downside of ptrace for
jailing is its overhead (and some quirks). If that can be taken away for
most system calls by using BPF then it would be useful for my use case.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 23:10               ` Will Drewry
                                   ` (2 preceding siblings ...)
  2012-01-15  3:40                 ` Indan Zupancic
@ 2012-01-16 18:37                 ` Oleg Nesterov
  2012-01-16 20:15                   ` Will Drewry
  3 siblings, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-16 18:37 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On 01/13, Will Drewry wrote:
>
> On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
> > On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >>
> >> Me too. I see no point in using user_regs_struct.
> >
> > I'll rev the change to use pt_regs and drop all the helper code.  If
> > no one says otherwise, that certainly seems ideal from a performance
> > perspective, and I see pt_regs exported to userland along with ptrace
> > abi register offset macros.
>
> On second thought, pt_regs is scary :)
>
> From looking at
>   http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
> and ia32syscall enty code, it appears that for x86, at least, the
> pt_regs for compat processes will be 8 bytes wide per register on the
> stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
> IA32_EMU, its filters will always index into pt_regs incorrectly.

Yes, thanks, I forgot about compat tasks again. But this is easy, just
we need regs_64_to_32().

Doesn't matter. I think Indan has a better suggestion.

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-16  6:49                     ` Indan Zupancic
@ 2012-01-16 20:12                       ` Will Drewry
  2012-01-17  6:46                         ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-16 20:12 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris

On Mon, Jan 16, 2012 at 12:49 AM, Indan Zupancic <indan@nul.nu> wrote:
> Hello,
>
> On Mon, January 16, 2012 02:40, Will Drewry wrote:
>> On Sat, Jan 14, 2012 at 9:40 PM, Indan Zupancic <indan@nul.nu> wrote:
>>> On Sat, January 14, 2012 00:10, Will Drewry wrote:
>>>> Any thoughts?
>>>>
>>>> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
>>>> about the pt_regs
>>>> change yet.  If the performance boost is worth the effort of having a
>>>> per-arch fixup,
>>>> I can go that route.  Otherwise, I could look at some alternate approach
>>>> for a faster-than-regview payload.
>>>
>>> I recommend not giving access to the registers at all, but to instead provide
>>> a fixed cross-platform ABI (a 32 bit version and one 64 bit version).
>>
>> I don't believe it is possible to create a fixed cross-platform ABI
>> that will be compat and future safe.
>
> The main problem is that BPF is inherently 32 bits while system call
> arguments can be 64 bits.

Inefficient, but ok :)

> You could change it so that all BPF programs are always 64 bits, which
> would solve everything, except the when-to-sign-extend-and-when-not-to
> problem. Luckily tgkill() is the only system call I'm aware of which
> would have preferred sign extension, so never sign extending is fairly
> safe. But this would require a new user space ABI as it's incompatible
> with the current sock_filter.

Yeah, I don't think that'd be a good idea.  I suspect that someday BPF
will grow a 64-bit set of instructions to deal with incoming socket
data better, but there's no rush.

> You could make it work on "unsigned long" and solve all subtleties
> that way. For 32 bits compat mode you execute the 32 bit version, if
> you want to support it. This would require two seccomp_run_filter()
> versions if compat mode is supported. Again, would change the BPF
> filter ABI.

Agreed - this seems like a bad path.

> So considering that you seem to be stuck with running 32 bits BPF
> filters anyway, you can as well make the input always 32 bits or
> always 64 bits. 32 bits would lose information sometimes, so making
> it always 64 bits without sign extension seems logical.

This is not fully formed logic.  BPF can operate on values that exceed
its "bus" width.  Just because it can't do a load/jump on a 64-bit
value doesn't mean you can't implement a jump based on a 64-bit value.
 You just split it up. Load the high order 32-bits, jump on failure,
fall through and load the next 32-bits and do the rest of the
comparison. Extending Eric's python translator wouldn't be
particularly painful and then it would be transparent to an end user.

> This would
> uglify the actual BPF filters immensely though, because then they
> often have to check the upper argument half too, usually just to
> make sure it's zero. They can't be sure that the kernel will ignore
> the upper half for 'int' arguments.

Of course not!  This will uglify the filters until someday when BPF
grows 64-bit support (or never), but it's not that bad.  The BPF
doesn't need to be pretty, just effective.  And it could be made even
easier with JIT support enabled since it could provide better native
code behaviors.

> So I suppose the question is: How do you expect people to use this on
> 64 bit systems?

As mentioned above.  The whole point of using BPF and user_regs_struct
is to implement _less_ kernel magic instead of more.

>> user_regs_struct (or even
>> pt_regs) should always match whatever each arch is doing -- even weird
>> personality-based changes.
>
> I don't think weird personality-based changes are really relevant, no one
> is going to write different versions of each filter depending on which
> personality is being run. Perhaps weird personality changes should be
> denied when a filter is installed, in case the filter forgets to do it.

It does this in the patch today.  Personalities can affect system call
numbers and argument ordering so it is relevant.  It'd also be a
viable way to escape system call filters if they weren't locked.

> So if the personality would change in a drastic way across an execve,
> it should fail. That is something a filter can't check beforehand and
> the only way to deal with it afterwards is by checking what the current
> personality is for each system call.
>
> All in all I think filters should be per personality, and if a process's
> personality changes it is only allowed if there is also a filter installed
> for the new personality too. Or just disallow personality changes, wasn't
> that what your patch did anyway?

Yup :)  You can't predict _all_ personality-ish changes (at least with x86).

>> If we do a new ABI, not only does that
>> have to be exported to userland, but it means we're still copying and
>> munging the data around (which was why I was trying to see if pt_regs
>> was a easier way to get a speed boost).
>
> The difference is that the register ABI uses the messy ptrace ABI which
> is a bit strange and not cross-platform, while simply exporting system
> call arguments is plain simple and what everyone tries to get out of the
> pt_regs anyway.

user_regs_struct != pt_regs.  user_regs_struct is acquired using
regviews which is already provided by each arch for use by coredump
generation and PTRACE_[GS]ETREGS.  There is no messy ptrace ABI.

Also, the whole point of the patch series was to create filters that
were not cross-platform.  I don't believe that is the kernel's job.
system calls are inherently arch specific so why go to all the effort
to hide that?

> But considering system call numbers are platform dependent anyway, it
> doesn't really matter that much. I think an array of syscall arguments
> would be a cleaner interface, but struct pt_regs would be acceptable.

As I said, user_regs_struct is the portable pt_regs, so I don't see
why it's a problem. But, using syscall_get_arguments is doable too if
that's the route this goes.

>> If there's consensus, I'll change it (and use syscall_get_arguments),
>> but I don't believe it makes sense. (more below)
>
> It's about system call filtering, I'd argue that giving anything else than
> the arguments doesn't make sense. Yes, the registers are the current system
> call ABI and reusing that is in a way simpler, but that ABI is about telling
> the kernel what the arguments are, it's not the best way for the kernel to
> tell userspace what the arguments are because for userspace it ends up in
> some structure with its own ABI instead of in the actual registers.

I don't see this disconnect.  The ABI is as expected by userspace.
Giving an array of arguments and a system call number will work fine,
but it is not a known ABI.  We can create a new one, but I don't
believe this argument justifies it.  It's not what the kernel is
telling user space, it's what is safe to evaluate in the kernel using
what userspace knows without adding another new interface.  If a new
interface is what it takes to get this merged, then I'll clearly do
it, but I'm still not sold that it is actually better.

>>> As everyone dealing with system call is mostly interested in the same things:
>>> The syscall number and the arguments. You can add some other potentially useful
>>> info like instruction pointer as well, but keep it limited to cross-platform
>>> things with a clear meaning that make sense for system call filtering.
>>
>> Well there are also clone/fork, sig_rt_return, etc related registers
>> too.
>
> What special clone/fork registers are you talking about?

On x86, si is used to indicate the tls area:
  http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/process_32.c#L238
(and r8 on 64-bit).  Also segment registers, etc.

> I don't think anyone would want to ever filter sig_rt_return, you can as
> well kill the process.

Why make that decision for them?

>> I like not making the decision for each syscall filtering
>> consumer.  We have an ABI so it seems like I'd be making work for the
>> kernel to manage yet another one for system call calling conventions.
>
> I think it's pretty much about the arguments and not much else. Even
> adding instruction pointer was a bit of a stretch, but it's something
> I can imagine people using to make decisions. But as the BPF filters
> are stateless they can't really use much else than the syscall number
> and arguments anyway, the rest is usually too context dependent.
>
> In order of exponentially less likelihood to filter on:
> - syscall number
> - syscall arguments
> - Instruction pointer
> - Stack pointer
> - ...?!
>
> Keep in mind that x86 only has a handful registers, 6 of them are used
> for the arguments, one is the syscall number and return value, one is
> the instruction pointer and there's the stack pointer. There just isn't
> much room for much else.
>
> Adding the instruction and stack pointers is quite a stretch already and
> should cover pretty much any need. If there is any other information that
> might be useful for filtering system calls I'd like to hear about it.

At this point, why create a new ABI?  Just use the existing fully
support register views expressed via user_regs_struct.

That said, I can imagine filtering on other registers as being useful
for tentative research.  Think of the past work where control flow
integrity was done by XORing the system call number with a run-time
selected value.  Instead of doing that, you could populate a
non-argument register with the xor of the syscall number and the
secret (picked and then added to the BPF program before install).

I'm not saying this is a good idea, but it seems silly to exclude it
when there doesn't seem to be any specific gain and only added
kernel-side complexity.  It may also be useful to know what other
saved registers (segment, etc) depending on what sort of sandboxing is
being done.  The PC/IP is fun for that one since you could limit all
syscalls to only come from the vdso or vsyscall locations.

>>> So I propose an interface like the following instead of a register interface:
>>>
>>> /* Currently 6, but to be future proof, make it 8 */
>>> #define MAX_SC_ARGS     8
>>>
>>> struct syscall_bpf_data {
>>>        unsigned long syscall_nr;
>>>        unsigned long flags;
>>>        unsigned long instruction_pointer;
>>>        unsigned long arg[MAX_SC_ARGS];
>>>        unsigned long _reserved[5];
>>> };
>
> BTW, the width of the fields depends on how you want to resolve
> the 64 bit issue. As BPF is always 32 bits, it doesn't make much
> sense to use longs. And as offsets are used anyway, it probably
> makes more sense to define those instead of a structure.

Yup. I'm still not sold on needing a standalone ABI for this when it
is some combination of syscall_get_arguments and KSTK_EIP, since
user_regs_struct already handles the right type widths, etc.  In fact,
it gets a bit more challenging.

If you look at syscall_get_arguments for x86, it always uses unsigned
long even when it is a TS_COMPAT task:
  lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
That means that the BPF exposed types would either always need to be
unsigned long, irrespective of is_compat_task, or seccomp filter would
need additional per-arch fixups (which is what regviews already do :/
).

>>>
>>> The flag argument can be used to e.g. tell if it is a compat 32 program
>>> running on a 64 bit system.
>>
>> I certainly considered this, but I don't think this is a practical
>> idea.  Firstly, CONFIG_COMPAT is meant to be compatibility mode.  We
>> can't assume a program knows about it.  Second, if we assume any new
>> program will be "smart" and check @flags, then the first few
>> instruction of _every_ (32-bit) seccomp filter program will be
>> checking compat mode - a serious waste :(  I'm also not sure if
>> is_compat_task actually covers all random personality-based changes --
>> just 32-bit v 64-bit.
>
> Yeah, bad idea. Forget about the flag thing.

It seems to elegant too :(

>> I _really_ wanted to make compat a flag and push that logic out of the
>> kernel, but I don't think it makes sense to burden all ABI consumers
>> with a "just in case" compat flag check.  Also, what happens if a new,
>> weird architecture comes along where that flag doesn't make the same
>> sense?  We can fix all the internal kernel stuff, but we'd end up with
>> an ABI change to boot :/  Using regviews, we stay consistent
>> regardless of whatever the new craziness is.  I just wish there was a
>> way to make it speedier.
>
> Better to have filters per personality. That solves this whole issue,
> independently of regviews or argument list ABI.

Agreed -- except that, as I mentioned above, there are still
significant complexities kernel-side if anything other than regviews
are used.

>>> This way the registers have to be interpreted only once by the kernel and all
>>> filtering programs don't have to do that mapping themselves. It also avoids
>>> doing unnecessary work fiddling/translating registers like the ptrace ABI does.
>>
>> The kernel does only interpret them once (after entry to
>> __secure_computing).
>
> Not if data shuffling is needed for compat related stuff.

I agree!  user_regs_struct get rid of the data shuffling.  pt_regs and
syscall_get_arguments all seem to induced data shuffling for compat
junk.  I just wish pt_regs was compat-width appropriate, but it makes
sense that a 64-bit kernel with a 32-bit program would use 64-bit
registers on its side.  Just frustrating.

>> It gets the regview and has it populate a
>> user_regs_struct.  All the register info is per-arch and matches
>> PTRACE_GETREGS, but not PTRACE_PEEKUSR.
>
> GETREGS seems to be a subset of PEEKUSR. That is, both start with
> a struct pt_regs/user_regs_struct (seems to be the same thing?)


Not quite.  on x86-32, pt_regs and user_regs_struct are identical.
Power PC as well, I think.  They diverge on pretty much every other
platform.  Also, x86 compat has some trickiness.  pt_regs is 64-bit on
x86-64 even with compat processes.  Instead what happens is the naming
is  kept if __KERNEL__ such that there aren't different struct member
names in all the syscall.h and ptrace code.  The
IA32_EMULATION/TS_COMPAT stuff can then just use the reordered member
names without even more #ifdef madness.

user_regs_struct will use the correct width according to the process
personality.  On all arches with is_compat_task support, this matches
-- except x86.  With x86, you can force a 32-bit syscall entry from a
64-bit process resulting in a temporary setting of TS_COMPAT but with
a personality that is still 64-bit.  This is an edge case and one I
think forcing compat and personality to not-change addresses.

> PEEKUSR only has extra access to debugging registers.

GETREGS uses a regview:
  http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1153
PEEKUSR uses getreg or getreg32 directly (on x86).  compat_arch_ptrace
on x86 will then grab a specified register based on the 32-bit offsets
out of a 64-bit pt_regs and can return any register offset, like
ORIG_EAX:
  lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1026

It's this magic fixup that allows ptrace to just use pt_regs for
PEEKUSR while GETREGS is forced to do the full register copy.

> That is another problem of giving a register view: Which registers
> are you going to give access to?

Always the general set.  This is the set denoted by core_note_type
NT_PRSTATUS on all architectures as far as I can tell.

>> All the weird stuff is in
>> PEEKUSR to deal with the fact that compat pt_regs members are not
>> actually the same width as userspace would expect.
>>
>> If we populated an ABI as you've proposed, we'd at least need to
>> build that data set and give it syscall_get_arguments() output.
>
> Yes, but that's all you have to do, nothing more.

I do even less now! :)

> The pt_regs a 64 bit kernel builds for a 32 bit compat process is
> different than one from a 32 bit kernel, so you have to do some kind
> of data shuffling anyway.

Yes - that's why I use user_regs_struct.

> Worse, once you pick this ABI you're stuck with it and can't get rid
> of compat horrors like you have now with ptrace(). Do you really want
> to reuse an obscure ptrace ABI instead of creating a simpler new one?

Exactly why I'm using user_regs_struct.  I think we've been having
some cross-talk, but I'm not sure.  The only edge case I can find with
user_regs_struct across all platforms is the nasty x86 32-bit entry
from 64-bit personality.  Perhaps someday we can just nuke that path
:)  But even so, it is tightly constrained by saving the personality
and compat flag in the filter program metadata and checking it at each
syscall.

>> I was hoping I could just hand over pt_regs and avoid any processing,
>> but it doesn't look promising.  In theory, all the same bit-twiddling
>> compat_ptrace does could be done during load_pointer in the patch
>> series, but it seems wrong to go that route.
>
> Your problem is worse because BPF programs are 32 bits but registers/args
> can be 64 bit. Compared to that, running 32 bits on top of 64 bits seems
> easy.
>
> Do you propose that people not only know about 64 bitness, but also
> about endianness when grabbing bits and pieces of 64 bit registers?
> Because that seems like a fun source of bugs.

Endianness calling convention specific.  For arches that allow
endianness changes, that should be personality based.  I believe that
"people" don't need to know anything unless they are crafting BPF
filters by hand, but I do believe that the userland software they rely
on should understand the current endianness and system call
conventions.  glibc already has to know this stuff, and so does any
other piece of userland code directly interacting with the kernel, so
I don't believe it is an hardship on userland.  It certainly isn't
shiny and isn't naturally intuitive, but those don't seem like the
only guiding requirements.  Making it cross-arch and future-friendly
using what user-space is already aware of seems like it will result in
a robust ABI less afflicted by bit rot or the addition of a crazy new
128-bit architecture :)  But who knows.


>>> I missed if the original version was allowed to change the registers or not,
>>> if it is then perhaps the BPF program should set a specific flag after changing
>>> anything, to make it more explicit.
>>
>> Registers are const from the BPF perspective (just like with socket
>> filters).   Adding support for tracehook interception later could
>> allow for supervisor guided register mutation.
>
> If the ABI gives access to arguments instead of registers you don't have
> to do anything tricky: No security checks, no need for fixing up register
> values to their original value after the system call returns or any other
> subtleties. BPF filters can just change the values without side effects.

BPF programs should never change any filters.  BPF does not have the
capability to modify the data it is evaluating.  Doing that would
require a BPF change and alter its very nature, imo.

While arguments seem tidy, we still end up with the nasty compat pain
and it is only worse kernel-side since there'd be no arch-independent
way to get the correct width system call arguments.  I'd need to guess
they were 32-bit and downgrade them if compat.  That or add a new arch
callout.  Very fiddly :/

> I would prefer if it would work nicely with a ptrace supervisor, because
> to me it seems that if something can't be resolved in the BPF filter, more
> context and direct control is needed. The main downside of ptrace for
> jailing is its overhead (and some quirks). If that can be taken away for
> most system calls by using BPF then it would be useful for my use case.

I could not agree more.  I have a patch already in existence that adds
a call to tracehook_syscall_entry on failure under certain conditions,
but I didn't want to bog down discussion of the core feature with that
discussion too.  I think supporting a ptrace supervisor would allow
for better debugging and sandbox development.  (Then I think most of
the logic could move directly to BPF.  E.g., only allow pointer
arguments for open() to live in this known read-only memory, etc.)

Cheers!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-16 18:37                 ` Oleg Nesterov
@ 2012-01-16 20:15                   ` Will Drewry
  2012-01-17 16:45                     ` Oleg Nesterov
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-16 20:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/13, Will Drewry wrote:
>>
>> On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
>> > On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >>
>> >> Me too. I see no point in using user_regs_struct.
>> >
>> > I'll rev the change to use pt_regs and drop all the helper code.  If
>> > no one says otherwise, that certainly seems ideal from a performance
>> > perspective, and I see pt_regs exported to userland along with ptrace
>> > abi register offset macros.
>>
>> On second thought, pt_regs is scary :)
>>
>> From looking at
>>   http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
>> and ia32syscall enty code, it appears that for x86, at least, the
>> pt_regs for compat processes will be 8 bytes wide per register on the
>> stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
>> IA32_EMU, its filters will always index into pt_regs incorrectly.
>
> Yes, thanks, I forgot about compat tasks again. But this is easy, just
> we need regs_64_to_32().

Yup - we could make the assumption that is_compat_task is always
32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
regs_64_to_32.  Seems kinda wonky though :/

> Doesn't matter. I think Indan has a better suggestion.

I disagree, but perhaps I'm not fully understanding!

Thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:02   ` Andrew Lutomirski
@ 2012-01-16 20:28     ` Will Drewry
  0 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-16 20:28 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:02 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Wed, Jan 11, 2012 at 9:25 AM, Will Drewry <wad@chromium.org> wrote:
>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task.  The policy is expressed in terms of a BPF program,
>> as is used for userland-exposed socket filtering.  Instead of network
>> data, the BPF program is evaluated over struct user_regs_struct at the
>> time of the system call (as retrieved using regviews).
>>
>https://www.google.com/calendar?tab=mc&authuser=1
> There's some seccomp-related code in the vsyscall emulation path in
> arch/x86/kernel/vsyscall_64.c.  How should time(), getcpu(), and
> gettimeofday() be handled?

Nice catch:
  lxr.linux.no/linux+v3.2.1/arch/x86/kernel/vsyscall_64.c#L180
I'd missed it.

> If you want filtering to work, there
> aren't any real syscall registers to inspect, but they could be
> synthesized.

Hrm, I wonder if making sure orig_eax is populated with the
vsyscall_nr would be enough.  Unless I'm misreading, args 0 and 1 are
correct, so there may be other noise, but performing a call to
__secure_computing() (either in the case or with a pre-validate
syscall nr: 0-2) should send the do_exit.  Does that sound reasonable?

I'll try to do the right thing in my next patch set.

> Preventing a malicious task from figuring out approximately what time
> it is is basically impossible because of the way that vvars work.  I
> don't know how to change that efficiently.

There are other ways to guess the time too, so I don't think it's that
bad.  For those that are really worried, they could disable or
otherwise attempt to limit vsyscall access from their sandbox.

thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-16 20:12                       ` Will Drewry
@ 2012-01-17  6:46                         ` Indan Zupancic
  2012-01-17 17:37                           ` Will Drewry
  2012-01-17 20:34                           ` Kees Cook
  0 siblings, 2 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-17  6:46 UTC (permalink / raw)
  To: Will Drewry
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris

On Mon, January 16, 2012 21:12, Will Drewry wrote:
> On Mon, Jan 16, 2012 at 12:49 AM, Indan Zupancic <indan@nul.nu> wrote:
>> Hello,
>>
>> On Mon, January 16, 2012 02:40, Will Drewry wrote:
>>> On Sat, Jan 14, 2012 at 9:40 PM, Indan Zupancic <indan@nul.nu> wrote:
>>>> On Sat, January 14, 2012 00:10, Will Drewry wrote:
>>>>> Any thoughts?
>>>>>
>>>>> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
>>>>> about the pt_regs
>>>>> change yet. �If the performance boost is worth the effort of having a
>>>>> per-arch fixup,
>>>>> I can go that route. �Otherwise, I could look at some alternate approach
>>>>> for a faster-than-regview payload.
>>>>
>>>> I recommend not giving access to the registers at all, but to instead provide
>>>> a fixed cross-platform ABI (a 32 bit version and one 64 bit version).
>>>
>>> I don't believe it is possible to create a fixed cross-platform ABI
>>> that will be compat and future safe.
>>
>> The main problem is that BPF is inherently 32 bits while system call
>> arguments can be 64 bits.
>
> Inefficient, but ok :)

I hope you're right.

If BPF is always 32 bits and people have to deal with the 64 bit pain
anyway, you can as well have one fixed ABI that is always the same and
cross-platform by just making the arguments always 64 bit, with lower
and upper halves in a fixed order to avoid endianness problems. This
would be compat and future safe.

>> So considering that you seem to be stuck with running 32 bits BPF
>> filters anyway, you can as well make the input always 32 bits or
>> always 64 bits. 32 bits would lose information sometimes, so making
>> it always 64 bits without sign extension seems logical.
>
> This is not fully formed logic.  BPF can operate on values that exceed
> its "bus" width.  Just because it can't do a load/jump on a 64-bit
> value doesn't mean you can't implement a jump based on a 64-bit value.
>  You just split it up. Load the high order 32-bits, jump on failure,
> fall through and load the next 32-bits and do the rest of the

Yes, hence my proposal to just bite the bullet and provide a fixed,
cross-platform 64 bit argument interface.

> comparison. Extending Eric's python translator wouldn't be
> particularly painful and then it would be transparent to an end user.

Your end user uses your ABI directly, Eric's python translator is only
one of them.

>> This would
>> uglify the actual BPF filters immensely though, because then they
>> often have to check the upper argument half too, usually just to
>> make sure it's zero. They can't be sure that the kernel will ignore
>> the upper half for 'int' arguments.
>
> Of course not!  This will uglify the filters until someday when BPF
> grows 64-bit support (or never), but it's not that bad.  The BPF
> doesn't need to be pretty, just effective.  And it could be made even
> easier with JIT support enabled since it could provide better native
> code behaviors.

What JIT? If there is one I doubt it's smart enough to consolidate two
32 bit operations into one 64 bit operation.

>> So I suppose the question is: How do you expect people to use this on
>> 64 bit systems?
>
> As mentioned above.  The whole point of using BPF and user_regs_struct
> is to implement _less_ kernel magic instead of more.

At the cost of making it cross-platform and harder to use. I think it is
a bit sad that the code still ends up being so platform dependent while
it is running in a virtual machine.

And you still have to fix up the compat case.

[...]
>>> If we do a new ABI, not only does that
>>> have to be exported to userland, but it means we're still copying and
>>> munging the data around (which was why I was trying to see if pt_regs
>>> was a easier way to get a speed boost).
>>
>> The difference is that the register ABI uses the messy ptrace ABI which
>> is a bit strange and not cross-platform, while simply exporting system
>> call arguments is plain simple and what everyone tries to get out of the
>> pt_regs anyway.
>
> user_regs_struct != pt_regs.  user_regs_struct is acquired using
> regviews which is already provided by each arch for use by coredump
> generation and PTRACE_[GS]ETREGS.  There is no messy ptrace ABI.

How is exporting registers via a structure not messy? And if PEEKUSR
uses a different ABI then ptrace's ABI is very messy. And it gets messy
whenever you cross a 32/64 bit boundary.

> Also, the whole point of the patch series was to create filters that
> were not cross-platform.  I don't believe that is the kernel's job.

It's the kernel's job to provide an abstracted view of the hardware so
user space has a consistent view. Not trying to make it cross-platform
is just slacking.

> system calls are inherently arch specific so why go to all the effort
> to hide that?

Because, although the numbers are certainly arch specific, the system calls
themselves including the argument ordering are surprisingly consistent.

The numbers are handled by the SYS_* defines, so when porting to a different
arch people just have to check if the arguments they use still match and
that's it. If you provide registers they have to put more effort into
porting the code.

>> But considering system call numbers are platform dependent anyway, it
>> doesn't really matter that much. I think an array of syscall arguments
>> would be a cleaner interface, but struct pt_regs would be acceptable.
>
> As I said, user_regs_struct is the portable pt_regs, so I don't see
> why it's a problem. But, using syscall_get_arguments is doable too if
> that's the route this goes.

It's not portable because it is different for every arch.

>>> If there's consensus, I'll change it (and use syscall_get_arguments),
>>> but I don't believe it makes sense. (more below)
>>
>> It's about system call filtering, I'd argue that giving anything else than
>> the arguments doesn't make sense. Yes, the registers are the current system
>> call ABI and reusing that is in a way simpler, but that ABI is about telling
>> the kernel what the arguments are, it's not the best way for the kernel to
>> tell userspace what the arguments are because for userspace it ends up in
>> some structure with its own ABI instead of in the actual registers.
>
> I don't see this disconnect.  The ABI is as expected by userspace.
> Giving an array of arguments and a system call number will work fine,
> but it is not a known ABI.  We can create a new one, but I don't
> believe this argument justifies it.  It's not what the kernel is
> telling user space, it's what is safe to evaluate in the kernel using
> what userspace knows without adding another new interface.  If a new
> interface is what it takes to get this merged, then I'll clearly do
> it, but I'm still not sold that it is actually better.

You are adding a new interface anyway with this feature. Using user_regs_struct
is not something expected by userspace, except if it are hardcore ptrace users.
And they will get the offsets wrong if they are on 64 bits because those are
different than for ptrace. The ptrace ABI uses longs, BPF is fixed to 32 bits,
it's just not a good fit.

Put another way, why isn't user_regs_struct passed on to each system call
implementation in the kernel instead of the arguments? It's exactly the
same reason as why passing arguments is better for BPF too.

>>>> As everyone dealing with system call is mostly interested in the same things:
>>>> The syscall number and the arguments. You can add some other potentially useful
>>>> info like instruction pointer as well, but keep it limited to cross-platform
>>>> things with a clear meaning that make sense for system call filtering.
>>>
>>> Well there are also clone/fork, sig_rt_return, etc related registers
>>> too.
>>
>> What special clone/fork registers are you talking about?
>
> On x86, si is used to indicate the tls area:
>   http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/process_32.c#L238
> (and r8 on 64-bit).  Also segment registers, etc.

si is just the 4th (5th for x86_64) system call argument, it's nothing special.

System calls never use segment registers directly, do they? It's not part
of the system call ABI, so why would you want to use them in filtering?

>> I don't think anyone would want to ever filter sig_rt_return, you can as
>> well kill the process.
>
> Why make that decision for them?

I don't, I'm just saying it doesn't make sense to filter it. Based on what
would anyone ever want to filter it? It's just a kernel helper thing to
implement signal handlers.

But now you mention it, it won't be bad to enforce this in the kernel,
otherwise everyone has to add code to allow it. Same for exit(_group).
Because if those are denied, there's nothing else to run instead.

>>> I like not making the decision for each syscall filtering
>>> consumer. �We have an ABI so it seems like I'd be making work for the
>>> kernel to manage yet another one for system call calling conventions.
>>
>> I think it's pretty much about the arguments and not much else. Even
>> adding instruction pointer was a bit of a stretch, but it's something
>> I can imagine people using to make decisions. But as the BPF filters
>> are stateless they can't really use much else than the syscall number
>> and arguments anyway, the rest is usually too context dependent.
>>
>> In order of exponentially less likelihood to filter on:
>> - syscall number
>> - syscall arguments
>> - Instruction pointer
>> - Stack pointer
>> - ...?!
>>
>> Keep in mind that x86 only has a handful registers, 6 of them are used
>> for the arguments, one is the syscall number and return value, one is
>> the instruction pointer and there's the stack pointer. There just isn't
>> much room for much else.
>>
>> Adding the instruction and stack pointers is quite a stretch already and
>> should cover pretty much any need. If there is any other information that
>> might be useful for filtering system calls I'd like to hear about it.
>
> At this point, why create a new ABI?  Just use the existing fully
> support register views expressed via user_regs_struct.

It's the difference between 6 args + a couple extras versus 17 registers
for a register starved arch like x86.

But yeah, better to not provide the instruction or stack pointers indeed.
At least the instruction pointer gives some system call related information
(from where it is called).

> That said, I can imagine filtering on other registers as being useful
> for tentative research.

They can use ptrace for that.

> Think of the past work where control flow
> integrity was done by XORing the system call number with a run-time
> selected value.  Instead of doing that, you could populate a
> non-argument register with the xor of the syscall number and the
> secret (picked and then added to the BPF program before install).

What non-argument register would you like to use on x86? I think all
are used up already. All you got left is the segment registers, and
using those seems a bad idea. It also seems a bad idea to promote
non-portable BPF filtering programs.

If you support modifying arguments and syscall nr then people can keep
doing the XORing trick with BPF. Another advantage of allowing that is
that unsafe old system calls can be replaced with secure ones on the
fly transparently.

Really, disallowing modifications is much more limiting than not providing
all registers. But allowing modifications is a lot harder to get right
with a register interface.

> I'm not saying this is a good idea, but it seems silly to exclude it
> when there doesn't seem to be any specific gain and only added
> kernel-side complexity.  It may also be useful to know what other
> saved registers (segment, etc) depending on what sort of sandboxing is
> being done.  The PC/IP is fun for that one since you could limit all
> syscalls to only come from the vdso or vsyscall locations.

Problem is that that is less useful than it seems because malicious code
can always just jump to a syscall entry instruction. Randomization helps
a bit, but it gives no guarantees. Better to store an XORed secret in the
syscall nr and arguments, that gives up to 224 bits of security.

>> BTW, the width of the fields depends on how you want to resolve
>> the 64 bit issue. As BPF is always 32 bits, it doesn't make much
>> sense to use longs. And as offsets are used anyway, it probably
>> makes more sense to define those instead of a structure.
>
> Yup. I'm still not sold on needing a standalone ABI for this when it
> is some combination of syscall_get_arguments and KSTK_EIP, since
> user_regs_struct already handles the right type widths, etc.  In fact,
> it gets a bit more challenging.

I would go for system call number + arguments only, and forget about the
EIP and stack, except if people really want it. But if you do add it then
it's barely any less limiting than a register view.

> If you look at syscall_get_arguments for x86, it always uses unsigned
> long even when it is a TS_COMPAT task:
>   lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
> That means that the BPF exposed types would either always need to be
> unsigned long, irrespective of is_compat_task, or seccomp filter would
> need additional per-arch fixups (which is what regviews already do :/
> ).

If compat tasks are involved you are screwed anyway and have to fiddle
with data, it's unavoidable.

Arguments exposed to BPF should always be 64 bits even on 32 bit archs,
that solves all compat and portability problems.

I really don't see the problem of copying 6 arguments to a fixed place.

If that is tricky then you're either trying to use the wrong function
or doing it at the wrong place in the kernel. I'd expect that passing on
the arguments is highly optimised in the kernel, all system calls have
easy access to them, why would it be hard for the BPF code to get it?

If you use syscall_get_arguments you have to call it once for each arg
instead of calling it once and trying to fix up the 32/64 bit and
endianness afterwards.

So call it once and store the value in a long. Then copy the low half
to the right place and then the upper half when on 64 bits. It may not
look too pretty, but the compiler should be able to optimise almost all
overhead away and end up with 6 (or 12) int copies. Something like this:

struct bpf_data {
	uint32 syscall_nr;
	uint32 arg_low[MAX_SC_ARGS];
	uint32 arg_high[MAX_SC_ARGS];
};

void fill_bpf_data(struct task_struct *t, struct pt_regs *r, struct bpf_data *d)
{
	int i;
	unsigned long arg;

	d->syscall_nr = syscall_get_nr(t, r);
	for (i = 0; i < MAX_SC_ARGS; ++i){
		syscall_get_arguments(t, r, i, 1, &arg);
		d->arg_low[i] = arg;
		d->arg_high[i] = arg >> 32;
	}
}

>
> Agreed -- except that, as I mentioned above, there are still
> significant complexities kernel-side if anything other than regviews
> are used.

I'm missing what those complexities are.

>
>>>> This way the registers have to be interpreted only once by the kernel and all
>>>> filtering programs don't have to do that mapping themselves. It also avoids
>>>> doing unnecessary work fiddling/translating registers like the ptrace ABI does.
>>>
>>> The kernel does only interpret them once (after entry to
>>> __secure_computing).
>>
>> Not if data shuffling is needed for compat related stuff.
>
> I agree!  user_regs_struct get rid of the data shuffling.  pt_regs and
> syscall_get_arguments all seem to induced data shuffling for compat
> junk.  I just wish pt_regs was compat-width appropriate, but it makes
> sense that a 64-bit kernel with a 32-bit program would use 64-bit
> registers on its side.  Just frustrating.

Are user_regs_struct entries 32-bit for 32-bit tasks or is it 64-bit if
the kernel is 64-bit? If they're 64-bit then you didn't get rid of the
data shuffling.

>>> It gets the regview and has it populate a
>>> user_regs_struct. �All the register info is per-arch and matches
>>> PTRACE_GETREGS, but not PTRACE_PEEKUSR.
>>
>> GETREGS seems to be a subset of PEEKUSR. That is, both start with
>> a struct pt_regs/user_regs_struct (seems to be the same thing?)
>
>
> Not quite.  on x86-32, pt_regs and user_regs_struct are identical.
> Power PC as well, I think.  They diverge on pretty much every other
> platform.  Also, x86 compat has some trickiness.  pt_regs is 64-bit on
> x86-64 even with compat processes.  Instead what happens is the naming
> is  kept if __KERNEL__ such that there aren't different struct member
> names in all the syscall.h and ptrace code.  The
> IA32_EMULATION/TS_COMPAT stuff can then just use the reordered member
> names without even more #ifdef madness.

It was a surprise to me to find out that the pt_regs a 64-bit ptrace user
gets for a 32 bit tracee differs from the pt_regs when both are 32 bits.

> user_regs_struct will use the correct width according to the process
> personality.  On all arches with is_compat_task support, this matches
> -- except x86.  With x86, you can force a 32-bit syscall entry from a
> 64-bit process resulting in a temporary setting of TS_COMPAT but with
> a personality that is still 64-bit.  This is an edge case and one I
> think forcing compat and personality to not-change addresses.

How's that possible? Setting CS to 0x23? Can userspace do that?

>> PEEKUSR only has extra access to debugging registers.
>
> GETREGS uses a regview:
>   http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1153
> PEEKUSR uses getreg or getreg32 directly (on x86).  compat_arch_ptrace
> on x86 will then grab a specified register based on the 32-bit offsets
> out of a 64-bit pt_regs and can return any register offset, like
> ORIG_EAX:
>   lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1026
>
> It's this magic fixup that allows ptrace to just use pt_regs for
> PEEKUSR while GETREGS is forced to do the full register copy.
>
>> That is another problem of giving a register view: Which registers
>> are you going to give access to?
>
> Always the general set.  This is the set denoted by core_note_type
> NT_PRSTATUS on all architectures as far as I can tell.
>
>>> All the weird stuff is in
>>> PEEKUSR to deal with the fact that compat pt_regs members are not
>>> actually the same width as userspace would expect.
>>>
>>> If we populated an ABI as you've proposed, we'd at least need to
>>> build that data set and give it syscall_get_arguments() output.
>>
>> Yes, but that's all you have to do, nothing more.
>
> I do even less now! :)

You have to do more for the compat case.

>> The pt_regs a 64 bit kernel builds for a 32 bit compat process is
>> different than one from a 32 bit kernel, so you have to do some kind
>> of data shuffling anyway.
>
> Yes - that's why I use user_regs_struct.

But there are different versions of user_regs_struct, depending on the
situation. This implies that the BPF filters have to be different too,
while they could be exactly the same (except for the syscall nr).

>> Worse, once you pick this ABI you're stuck with it and can't get rid
>> of compat horrors like you have now with ptrace(). Do you really want
>> to reuse an obscure ptrace ABI instead of creating a simpler new one?
>
> Exactly why I'm using user_regs_struct.  I think we've been having
> some cross-talk, but I'm not sure.  The only edge case I can find with
> user_regs_struct across all platforms is the nasty x86 32-bit entry
> from 64-bit personality.  Perhaps someday we can just nuke that path
> :)  But even so, it is tightly constrained by saving the personality
> and compat flag in the filter program metadata and checking it at each
> syscall.

I think it's a good idea to nuke that path, it seems like a security hole
waiting to happen.

>
>>> I was hoping I could just hand over pt_regs and avoid any processing,
>>> but it doesn't look promising. �In theory, all the same bit-twiddling
>>> compat_ptrace does could be done during load_pointer in the patch
>>> series, but it seems wrong to go that route.
>>
>> Your problem is worse because BPF programs are 32 bits but registers/args
>> can be 64 bit. Compared to that, running 32 bits on top of 64 bits seems
>> easy.
>>
>> Do you propose that people not only know about 64 bitness, but also
>> about endianness when grabbing bits and pieces of 64 bit registers?
>> Because that seems like a fun source of bugs.
>
> Endianness calling convention specific.  For arches that allow
> endianness changes, that should be personality based.  I believe that
> "people" don't need to know anything unless they are crafting BPF
> filters by hand, but I do believe that the userland software they rely
> on should understand the current endianness and system call
> conventions.  glibc already has to know this stuff, and so does any
> other piece of userland code directly interacting with the kernel, so
> I don't believe it is an hardship on userland.  It certainly isn't
> shiny and isn't naturally intuitive, but those don't seem like the
> only guiding requirements.  Making it cross-arch and future-friendly
> using what user-space is already aware of seems like it will result in
> a robust ABI less afflicted by bit rot or the addition of a crazy new
> 128-bit architecture :)  But who knows.

If your ABI is too hard to use directly, it won't be used at all.
Any excuse that people won't use this ABI directly is a sign that
it is not good enough.

And the more complicated you make it, the less likely it is that
anyone will use this.

>
>>>> I missed if the original version was allowed to change the registers or not,
>>>> if it is then perhaps the BPF program should set a specific flag after changing
>>>> anything, to make it more explicit.
>>>
>>> Registers are const from the BPF perspective (just like with socket
>>> filters). � Adding support for tracehook interception later could
>>> allow for supervisor guided register mutation.
>>
>> If the ABI gives access to arguments instead of registers you don't have
>> to do anything tricky: No security checks, no need for fixing up register
>> values to their original value after the system call returns or any other
>> subtleties. BPF filters can just change the values without side effects.
>
> BPF programs should never change any filters.  BPF does not have the
> capability to modify the data it is evaluating.  Doing that would
> require a BPF change and alter its very nature, imo.

It could if you make the data part of the scratch memory. If you put the
data at the top, just after BPF_MEMWORDS, then it's all compatible with
the read-only version. Except the sk_chk_filter() code. But if you ever
want to consolidate with the networking version, then you already need
new non-byteswapping instructions. You can as well add a special modify
instruction too then. Making it very explicit seems better anyway.

Using BPF for system call filtering changes its very nature already.

I must say that until your patch came up, I've never heard of BPF filters
before. I think I'm going to use it in our ptrace jailer for network
filtering, if it's possible to get the peer address for normal TCP/UDP
sockets. Documentation is quite vague.

> While arguments seem tidy, we still end up with the nasty compat pain
> and it is only worse kernel-side since there'd be no arch-independent
> way to get the correct width system call arguments.  I'd need to guess
> they were 32-bit and downgrade them if compat.  That or add a new arch
> callout.  Very fiddly :/

See code above. It seems fairly tidy to me.

You could also do the BPF filtering later in the system call entry path
when the arguments are passed directly, but then it's harder to interact
well with ptrace.

>
>> I would prefer if it would work nicely with a ptrace supervisor, because
>> to me it seems that if something can't be resolved in the BPF filter, more
>> context and direct control is needed. The main downside of ptrace for
>> jailing is its overhead (and some quirks). If that can be taken away for
>> most system calls by using BPF then it would be useful for my use case.
>
> I could not agree more.  I have a patch already in existence that adds
> a call to tracehook_syscall_entry on failure under certain conditions,
> but I didn't want to bog down discussion of the core feature with that
> discussion too.  I think supporting a ptrace supervisor would allow
> for better debugging and sandbox development.  (Then I think most of
> the logic could move directly to BPF.  E.g., only allow pointer
> arguments for open() to live in this known read-only memory, etc.)

That is very hard to do in practise except for very limited sandboxing
cases. In the general case you want to check all paths, but knowing
beforehand where those are stored is hard when running arbitrary stuff.
And it doesn't guarantee that it are safe path, because they can start
in the middle of a stored path and turn an absolute path into a relative
one.

And updating the filters on the run all the time is a hassle too. So
I think most logic will stay out of BPF, especially because it is the
more tricky stuff to do. But open() is not that performance critical
compared to stuff that happens all the time and where you really don't
want the ptrace overhead, like gettimeofday().

By the way, I think you want the filter to decide with what error code
the system call fails instead of hard coding it to EACCESS. So just use
the return value instead of checking against regs_size, which doesn't
make much sense anyway. Then you also have a way for the filter to tell
whether the system call should be passed on to ptrace or not.

Ideally, the BPF filter should be able to deny the system call with a
specific error code, deny the call and kill the task, have a way to
defer to ptrace, and a way to allow it.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-16 20:15                   ` Will Drewry
@ 2012-01-17 16:45                     ` Oleg Nesterov
  2012-01-17 16:56                       ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-17 16:45 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On 01/16, Will Drewry wrote:
>
> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
> > we need regs_64_to_32().
>
> Yup - we could make the assumption that is_compat_task is always
> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
> regs_64_to_32.  Seems kinda wonky though :/

much simpler/faster than what regset does to create the artificial
user_regs_struct32.

> > Doesn't matter. I think Indan has a better suggestion.
>
> I disagree, but perhaps I'm not fully understanding!

I have much more chances to be wrong ;) I leave it to you and Indan.

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 16:45                     ` Oleg Nesterov
@ 2012-01-17 16:56                       ` Will Drewry
  2012-01-17 17:01                         ` Andrew Lutomirski
  2012-01-17 19:35                         ` Will Drewry
  0 siblings, 2 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-17 16:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Tue, Jan 17, 2012 at 10:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/16, Will Drewry wrote:
>>
>> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
>> > we need regs_64_to_32().
>>
>> Yup - we could make the assumption that is_compat_task is always
>> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
>> regs_64_to_32.  Seems kinda wonky though :/
>
> much simpler/faster than what regset does to create the artificial
> user_regs_struct32.

True, I could collapse pt_regs to looks like the exported ABI pt_regs.
 Then only compat processes would get the copy overhead.  That could
be tidy and not break ABI.  It would mean that I have to assume that
if unsigned long == 64-bit and is_compat_task(), then the task is
32-bit.  Do you think if we ever add a crazy 128-bit "supercomputer"
arch that we will add a is_compat64_task() so that I could properly
collapse? :)

I like this idea!

>> > Doesn't matter. I think Indan has a better suggestion.
>>
>> I disagree, but perhaps I'm not fully understanding!
>
> I have much more chances to be wrong ;) I leave it to you and Indan.

We're being very verbose. I hope we can come to a good place!  I took
a break from my response to reply here :)

thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 16:56                       ` Will Drewry
@ 2012-01-17 17:01                         ` Andrew Lutomirski
  2012-01-17 17:05                           ` Oleg Nesterov
  2012-01-17 17:06                           ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-17 19:35                         ` Will Drewry
  1 sibling, 2 replies; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-17 17:01 UTC (permalink / raw)
  To: Will Drewry
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Tue, Jan 17, 2012 at 8:56 AM, Will Drewry <wad@chromium.org> wrote:
> On Tue, Jan 17, 2012 at 10:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 01/16, Will Drewry wrote:
>>>
>>> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> >
>>> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
>>> > we need regs_64_to_32().
>>>
>>> Yup - we could make the assumption that is_compat_task is always
>>> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
>>> regs_64_to_32.  Seems kinda wonky though :/
>>
>> much simpler/faster than what regset does to create the artificial
>> user_regs_struct32.
>
> True, I could collapse pt_regs to looks like the exported ABI pt_regs.
>  Then only compat processes would get the copy overhead.  That could
> be tidy and not break ABI.  It would mean that I have to assume that
> if unsigned long == 64-bit and is_compat_task(), then the task is
> 32-bit.  Do you think if we ever add a crazy 128-bit "supercomputer"
> arch that we will add a is_compat64_task() so that I could properly
> collapse? :)
>
> I like this idea!

FWIW, it's possible for a task to execute in 32-bit mode when
!is_compat_task or in 64-bit mode when is_compat_task.  From earlier
in the thread, I think you were planning to block the wrong-bitness
syscall entries, but it's worth double-checking that you don't open up
a hole when a compat task issues the 64-bit syscall instruction.

(is_compat_task says whether the executable was marked as 32-bit.  The
actual execution mode is determined by the cs register, which the user
can control.  See the user_64bit_mode function in
arch/asm/x86/ptrace.h.  But maybe it would make more sense to have a
separate 32-bit and 64-bit BPF program and select which one to use
based on the entry point.)

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 17:01                         ` Andrew Lutomirski
@ 2012-01-17 17:05                           ` Oleg Nesterov
  2012-01-17 17:45                             ` Andrew Lutomirski
  2012-01-17 17:06                           ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  1 sibling, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-17 17:05 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On 01/17, Andrew Lutomirski wrote:
>
> (is_compat_task says whether the executable was marked as 32-bit.  The
> actual execution mode is determined by the cs register, which the user
> can control.

Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
along with TS_COMPAT).

TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
64-bit or not, we should treat is as 32-bit in this case.

No?

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 17:01                         ` Andrew Lutomirski
  2012-01-17 17:05                           ` Oleg Nesterov
@ 2012-01-17 17:06                           ` Will Drewry
  1 sibling, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-17 17:06 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Tue, Jan 17, 2012 at 11:01 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Tue, Jan 17, 2012 at 8:56 AM, Will Drewry <wad@chromium.org> wrote:
>> On Tue, Jan 17, 2012 at 10:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> On 01/16, Will Drewry wrote:
>>>>
>>>> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> >
>>>> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
>>>> > we need regs_64_to_32().
>>>>
>>>> Yup - we could make the assumption that is_compat_task is always
>>>> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
>>>> regs_64_to_32.  Seems kinda wonky though :/
>>>
>>> much simpler/faster than what regset does to create the artificial
>>> user_regs_struct32.
>>
>> True, I could collapse pt_regs to looks like the exported ABI pt_regs.
>>  Then only compat processes would get the copy overhead.  That could
>> be tidy and not break ABI.  It would mean that I have to assume that
>> if unsigned long == 64-bit and is_compat_task(), then the task is
>> 32-bit.  Do you think if we ever add a crazy 128-bit "supercomputer"
>> arch that we will add a is_compat64_task() so that I could properly
>> collapse? :)
>>
>> I like this idea!
>
> FWIW, it's possible for a task to execute in 32-bit mode when
> !is_compat_task or in 64-bit mode when is_compat_task.  From earlier
> in the thread, I think you were planning to block the wrong-bitness
> syscall entries, but it's worth double-checking that you don't open up
> a hole when a compat task issues the 64-bit syscall instruction.

Yup - I had to (see below).

> (is_compat_task says whether the executable was marked as 32-bit.  The
> actual execution mode is determined by the cs register, which the user
> can control.  See the user_64bit_mode function in
> arch/asm/x86/ptrace.h.  But maybe it would make more sense to have a
> separate 32-bit and 64-bit BPF program and select which one to use
> based on the entry point.)

So that was my original design, but the problem was with how regviews
decides on the user_regs_struct.  It decides using TIF_IA32 while I
can only check the cross-arch is_compat_task() which checks TS_COMPAT
on x86.  If I'm just collapsing registers for compat calls (which I am
exploring the viability of right now), then I guess I could re-fork
the filtering to support compat versus non-compat.  The nastier bits
there were that I don't want to allow a compat call to be allowed
because a process only defined non-compat. I think that can be made
manage-able though.

I'll finish proving out the possibilities here.

Thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17  6:46                         ` Indan Zupancic
@ 2012-01-17 17:37                           ` Will Drewry
  2012-01-18  4:06                             ` Indan Zupancic
  2012-01-17 20:34                           ` Kees Cook
  1 sibling, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-17 17:37 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris

On Tue, Jan 17, 2012 at 12:46 AM, Indan Zupancic <indan@nul.nu> wrote:
> On Mon, January 16, 2012 21:12, Will Drewry wrote:
>> On Mon, Jan 16, 2012 at 12:49 AM, Indan Zupancic <indan@nul.nu> wrote:
>>> Hello,
>>>
>>> On Mon, January 16, 2012 02:40, Will Drewry wrote:
>>>> On Sat, Jan 14, 2012 at 9:40 PM, Indan Zupancic <indan@nul.nu> wrote:
>>>>> On Sat, January 14, 2012 00:10, Will Drewry wrote:
>>>>>> Any thoughts?
>>>>>>
>>>>>> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
>>>>>> about the pt_regs
>>>>>> change yet. �If the performance boost is worth the effort of having a
>>>>>> per-arch fixup,
>>>>>> I can go that route. �Otherwise, I could look at some alternate approach
>>>>>> for a faster-than-regview payload.
>>>>>
>>>>> I recommend not giving access to the registers at all, but to instead provide
>>>>> a fixed cross-platform ABI (a 32 bit version and one 64 bit version).
>>>>
>>>> I don't believe it is possible to create a fixed cross-platform ABI
>>>> that will be compat and future safe.
>>>
>>> The main problem is that BPF is inherently 32 bits while system call
>>> arguments can be 64 bits.
>>
>> Inefficient, but ok :)
>
> I hope you're right.
>
> If BPF is always 32 bits and people have to deal with the 64 bit pain
> anyway, you can as well have one fixed ABI that is always the same and
> cross-platform by just making the arguments always 64 bit, with lower
> and upper halves in a fixed order to avoid endianness problems. This
> would be compat and future safe.

What happens if unsigned long is no longer 64-bit in some distant future?

As to endianness, fixed endianess means that userland programs that
have a different endianness will need to translate their values.  It's
just shifting the work around.

>>> So considering that you seem to be stuck with running 32 bits BPF
>>> filters anyway, you can as well make the input always 32 bits or
>>> always 64 bits. 32 bits would lose information sometimes, so making
>>> it always 64 bits without sign extension seems logical.
>>
>> This is not fully formed logic.  BPF can operate on values that exceed
>> its "bus" width.  Just because it can't do a load/jump on a 64-bit
>> value doesn't mean you can't implement a jump based on a 64-bit value.
>>  You just split it up. Load the high order 32-bits, jump on failure,
>> fall through and load the next 32-bits and do the rest of the
>
> Yes, hence my proposal to just bite the bullet and provide a fixed,
> cross-platform 64 bit argument interface.
>
>> comparison. Extending Eric's python translator wouldn't be
>> particularly painful and then it would be transparent to an end user.
>
> Your end user uses your ABI directly, Eric's python translator is only
> one of them.

Sure, but how many people write BPF manually today versus those who
use ethereal/wireshark/tcpdump/libpcap?  And for network data, the
data protocols may vary per  packet.

>>> This would
>>> uglify the actual BPF filters immensely though, because then they
>>> often have to check the upper argument half too, usually just to
>>> make sure it's zero. They can't be sure that the kernel will ignore
>>> the upper half for 'int' arguments.
>>
>> Of course not!  This will uglify the filters until someday when BPF
>> grows 64-bit support (or never), but it's not that bad.  The BPF
>> doesn't need to be pretty, just effective.  And it could be made even
>> easier with JIT support enabled since it could provide better native
>> code behaviors.
>
> What JIT? If there is one I doubt it's smart enough to consolidate two
> 32 bit operations into one 64 bit operation.

There's a BPF JITer in the kernel now.  I doubt it does consolidation
either, but with a lightweight lookahead, it would be possible to
collapse known patterns for checking 64-bit values (not in all cases,
but in generic ones).

>>> So I suppose the question is: How do you expect people to use this on
>>> 64 bit systems?
>>
>> As mentioned above.  The whole point of using BPF and user_regs_struct
>> is to implement _less_ kernel magic instead of more.
>
> At the cost of making it cross-platform and harder to use. I think it is
> a bit sad that the code still ends up being so platform dependent while
> it is running in a virtual machine.

Not all virtual machines have the same goal.  The goal of BPF is to
provide a safe place to evaluate user-supplied instructions over a
fixed window of data to determine acceptance.  While BPF for socket
data is arch-independent, each BPF program needs to understand what
the packet data is being operated on.  In this case, the
user_regs_struct is the packet data and the BPF program needs to be
tailored to it.

> And you still have to fix up the compat case.

Not really.  I lock down the compat case.  _Even_ with fixed 64-bit
arguments, you still get system call number mismatches which mean you
need to keep independent filters for the same task. I had this in one
of my first implementations and it adds a nasty amount of implicit
logic during evaluation.

> [...]
>>>> If we do a new ABI, not only does that
>>>> have to be exported to userland, but it means we're still copying and
>>>> munging the data around (which was why I was trying to see if pt_regs
>>>> was a easier way to get a speed boost).
>>>
>>> The difference is that the register ABI uses the messy ptrace ABI which
>>> is a bit strange and not cross-platform, while simply exporting system
>>> call arguments is plain simple and what everyone tries to get out of the
>>> pt_regs anyway.
>>
>> user_regs_struct != pt_regs.  user_regs_struct is acquired using
>> regviews which is already provided by each arch for use by coredump
>> generation and PTRACE_[GS]ETREGS.  There is no messy ptrace ABI.
>
> How is exporting registers via a structure not messy? And if PEEKUSR
> uses a different ABI then ptrace's ABI is very messy. And it gets messy
> whenever you cross a 32/64 bit boundary.

I'm not following.

>> Also, the whole point of the patch series was to create filters that
>> were not cross-platform.  I don't believe that is the kernel's job.
>
> It's the kernel's job to provide an abstracted view of the hardware so
> user space has a consistent view. Not trying to make it cross-platform
> is just slacking.

I disagree. I believe it is overengineering and poor design to create
an brand new interface with new semantics to hide an ABI that userland
is already aware of.

>> system calls are inherently arch specific so why go to all the effort
>> to hide that?
>
> Because, although the numbers are certainly arch specific, the system calls
> themselves including the argument ordering are surprisingly consistent.
>
> The numbers are handled by the SYS_* defines, so when porting to a different
> arch people just have to check if the arguments they use still match and
> that's it. If you provide registers they have to put more effort into
> porting the code.

Yup the __NR_* defines are ABI and so is the argument ordering
(clearly).  Of course, userspace knows that the argument ordering is
based on registers since that's how it preps the registers before
calling the arch-specific system call trap.  This is why I believe it
makes sense to just store the arch reg mappings in a userland library
rather than do it kernel-side.

>>> But considering system call numbers are platform dependent anyway, it
>>> doesn't really matter that much. I think an array of syscall arguments
>>> would be a cleaner interface, but struct pt_regs would be acceptable.
>>
>> As I said, user_regs_struct is the portable pt_regs, so I don't see
>> why it's a problem. But, using syscall_get_arguments is doable too if
>> that's the route this goes.
>
> It's not portable because it is different for every arch.

I was describing the kernel code, not the data set.  By using
regviews, I get the consistent register view for the personality of
the process for the architecture it is running on.  This means that
the user_regs_struct will always be consistent _for the architecture_
when given to the user's BPF code.  It does not create a portable
userland ABI but instead uses the existing ABI in a way that is
arch-agnostic in the kernel (using the regviews interfaces for arch
fixup).

>>>> If there's consensus, I'll change it (and use syscall_get_arguments),
>>>> but I don't believe it makes sense. (more below)
>>>
>>> It's about system call filtering, I'd argue that giving anything else than
>>> the arguments doesn't make sense. Yes, the registers are the current system
>>> call ABI and reusing that is in a way simpler, but that ABI is about telling
>>> the kernel what the arguments are, it's not the best way for the kernel to
>>> tell userspace what the arguments are because for userspace it ends up in
>>> some structure with its own ABI instead of in the actual registers.
>>
>> I don't see this disconnect.  The ABI is as expected by userspace.
>> Giving an array of arguments and a system call number will work fine,
>> but it is not a known ABI.  We can create a new one, but I don't
>> believe this argument justifies it.  It's not what the kernel is
>> telling user space, it's what is safe to evaluate in the kernel using
>> what userspace knows without adding another new interface.  If a new
>> interface is what it takes to get this merged, then I'll clearly do
>> it, but I'm still not sold that it is actually better.
>
> You are adding a new interface anyway with this feature. Using user_regs_struct
> is not something expected by userspace, except if it are hardcore ptrace users.

And anything that parses coredumps.

> And they will get the offsets wrong if they are on 64 bits because those are
> different than for ptrace. The ptrace ABI uses longs, BPF is fixed to 32 bits,
> it's just not a good fit.

That's not true on x86-64:

 http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L945

PEEKUSR uses the offsetof() the 32-bit register struct for compat
calls and, with that macro, maps it to the proper entry in pt_regs.
For non-compat, it just uses the offset into pt_regs:
  http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L475
  lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L170

As there is significant overlap in the contents of user_regs_struct
and pt_regs on the platform. While it's possible for another arch to
use a different set of ptrace ABI offsets, using offsetof(struct
user_regs_struct, ...) will always work.  It is not long-based.

If you want to convince me this isn't a good fit, I need you to meet
me halfway and make sure your assertions match the code! :)

> Put another way, why isn't user_regs_struct passed on to each system call
> implementation in the kernel instead of the arguments? It's exactly the
> same reason as why passing arguments is better for BPF too.

This is silly.  When you call a system call, what do you do?  You
prepare the register state such that the arguments are in the right
positioning.  user_regs_struct is the snapshot of the registers the
program-itself prepared.

If you think it is more pleasing to have [syscall_nr|args0|...|args6],
then I can respect that.  But so far, the technical arguments have not
backed up that direction.

>>>>> As everyone dealing with system call is mostly interested in the same things:
>>>>> The syscall number and the arguments. You can add some other potentially useful
>>>>> info like instruction pointer as well, but keep it limited to cross-platform
>>>>> things with a clear meaning that make sense for system call filtering.
>>>>
>>>> Well there are also clone/fork, sig_rt_return, etc related registers
>>>> too.
>>>
>>> What special clone/fork registers are you talking about?
>>
>> On x86, si is used to indicate the tls area:
>>   http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/process_32.c#L238
>> (and r8 on 64-bit).  Also segment registers, etc.
>
> si is just the 4th (5th for x86_64) system call argument, it's nothing special.

True - arguably though fork() takes no arguments.  Now
syscall_get_arguments() won't know that, so si/r8 and all will still
be exposed for filtering.  For some reason, I thought some other
pieces were used (EFLAGS?), but I can't back that up in the source.

> System calls never use segment registers directly, do they? It's not part
> of the system call ABI, so why would you want to use them in filtering?

On x86-32, you may be using segmentation for isolation purposes.  If
you do so, it may be of interest where the call comes from.  Nothing
truly relevant, just another possibility.  *shrug*

>>> I don't think anyone would want to ever filter sig_rt_return, you can as
>>> well kill the process.
>>
>> Why make that decision for them?
>
> I don't, I'm just saying it doesn't make sense to filter it. Based on what
> would anyone ever want to filter it? It's just a kernel helper thing to
> implement signal handlers.

What if you want the process to die if it returns from any signals?

> But now you mention it, it won't be bad to enforce this in the kernel,
> otherwise everyone has to add code to allow it. Same for exit(_group).
> Because if those are denied, there's nothing else to run instead.

The process will be do_exit()d.  I don't know why it matters?

>>>> I like not making the decision for each syscall filtering
>>>> consumer. �We have an ABI so it seems like I'd be making work for the
>>>> kernel to manage yet another one for system call calling conventions.
>>>
>>> I think it's pretty much about the arguments and not much else. Even
>>> adding instruction pointer was a bit of a stretch, but it's something
>>> I can imagine people using to make decisions. But as the BPF filters
>>> are stateless they can't really use much else than the syscall number
>>> and arguments anyway, the rest is usually too context dependent.
>>>
>>> In order of exponentially less likelihood to filter on:
>>> - syscall number
>>> - syscall arguments
>>> - Instruction pointer
>>> - Stack pointer
>>> - ...?!
>>>
>>> Keep in mind that x86 only has a handful registers, 6 of them are used
>>> for the arguments, one is the syscall number and return value, one is
>>> the instruction pointer and there's the stack pointer. There just isn't
>>> much room for much else.
>>>
>>> Adding the instruction and stack pointers is quite a stretch already and
>>> should cover pretty much any need. If there is any other information that
>>> might be useful for filtering system calls I'd like to hear about it.
>>
>> At this point, why create a new ABI?  Just use the existing fully
>> support register views expressed via user_regs_struct.
>
> It's the difference between 6 args + a couple extras versus 17 registers
> for a register starved arch like x86.
>
> But yeah, better to not provide the instruction or stack pointers indeed.
> At least the instruction pointer gives some system call related information
> (from where it is called).

Yup - it's nice to have that.

>> That said, I can imagine filtering on other registers as being useful
>> for tentative research.
>
> They can use ptrace for that.

And it will stay research forever.

>> Think of the past work where control flow
>> integrity was done by XORing the system call number with a run-time
>> selected value.  Instead of doing that, you could populate a
>> non-argument register with the xor of the syscall number and the
>> secret (picked and then added to the BPF program before install).
>
> What non-argument register would you like to use on x86? I think all
> are used up already. All you got left is the segment registers, and
> using those seems a bad idea.

There are other arches where this would be feasible.

> It also seems a bad idea to promote non-portable BPF filtering programs.

Why?  If it's possible to make a userland abstraction layer, why do we
force the kernel to take on more work?

> If you support modifying arguments and syscall nr then people can keep
> doing the XORing trick with BPF. Another advantage of allowing that is
> that unsafe old system calls can be replaced with secure ones on the
> fly transparently.
>
> Really, disallowing modifications is much more limiting than not providing
> all registers. But allowing modifications is a lot harder to get right
> with a register interface.

I'm not going to make the change to support BPF making the data
mutable or using scratch space as an output vector. If that is
something that makes sense to the networking stack, then we could
benefit from it, but I don't want to go there.

>> I'm not saying this is a good idea, but it seems silly to exclude it
>> when there doesn't seem to be any specific gain and only added
>> kernel-side complexity.  It may also be useful to know what other
>> saved registers (segment, etc) depending on what sort of sandboxing is
>> being done.  The PC/IP is fun for that one since you could limit all
>> syscalls to only come from the vdso or vsyscall locations.
>
> Problem is that that is less useful than it seems because malicious code
> can always just jump to a syscall entry instruction. Randomization helps
> a bit, but it gives no guarantees. Better to store an XORed secret in the
> syscall nr and arguments, that gives up to 224 bits of security.

Yup, but rewriting the system call number or arguments would require a
ptrace supervisor without changing the nature of BPF.  Also, if
something like this were crafted, a XOR-guess failure would result in
immediate process termination.  This allows for lower entropy to still
provide a robust mechanism.

>>> BTW, the width of the fields depends on how you want to resolve
>>> the 64 bit issue. As BPF is always 32 bits, it doesn't make much
>>> sense to use longs. And as offsets are used anyway, it probably
>>> makes more sense to define those instead of a structure.
>>
>> Yup. I'm still not sold on needing a standalone ABI for this when it
>> is some combination of syscall_get_arguments and KSTK_EIP, since
>> user_regs_struct already handles the right type widths, etc.  In fact,
>> it gets a bit more challenging.
>
> I would go for system call number + arguments only, and forget about the
> EIP and stack, except if people really want it. But if you do add it then
> it's barely any less limiting than a register view.

What do you think of Oleg's proposal? I like it a lot!  If I can just
use pt_regs for all but compat tasks, then it means there _no_ copy
needed to evaluate the BPF.  This saves on more than just the
user_regs_struct registers but a lot more.  It also avoids
syscall_get_arguments, etc. I'm looking into how to best implement
this, but I think it may be a real option.

It does mean BPF is per-arch, per syscall-convention, but you know I
am fine with that :)  I do think the performance gains could be well
worth avoiding any copying.  (Perf gains are the strongest argument I
think for your proposal and the thing that would likely lead me to do
it.)

>> If you look at syscall_get_arguments for x86, it always uses unsigned
>> long even when it is a TS_COMPAT task:
>>   lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
>> That means that the BPF exposed types would either always need to be
>> unsigned long, irrespective of is_compat_task, or seccomp filter would
>> need additional per-arch fixups (which is what regviews already do :/
>> ).
>
> If compat tasks are involved you are screwed anyway and have to fiddle
> with data, it's unavoidable.

Well I was letting the existing code do that for me.

> Arguments exposed to BPF should always be 64 bits even on 32 bit archs,
> that solves all compat and portability problems.

Not really. It means that if there is ever a 128-bit register arch, a
new ABI would be spawned for it.  But I don't know what else would be
broken in the kernel by that, so it's hard to tell if that argument
makes sense.

> I really don't see the problem of copying 6 arguments to a fixed place.

I was indicating the need to truncate them for compat.

> If that is tricky then you're either trying to use the wrong function
> or doing it at the wrong place in the kernel. I'd expect that passing on
> the arguments is highly optimised in the kernel, all system calls have
> easy access to them, why would it be hard for the BPF code to get it?

See the source. They are static inlines but there are still memory
copies.  I would then need to copy a second time to truncate them to
the correct width (or some other fanciness).

> If you use syscall_get_arguments you have to call it once for each arg
> instead of calling it once and trying to fix up the 32/64 bit and
> endianness afterwards.

You - or call it once and then iterate over the emitted array doing fixups.

> So call it once and store the value in a long. Then copy the low half
> to the right place and then the upper half when on 64 bits. It may not
> look too pretty, but the compiler should be able to optimise almost all
> overhead away and end up with 6 (or 12) int copies. Something like this:
>
> struct bpf_data {
>        uint32 syscall_nr;
>        uint32 arg_low[MAX_SC_ARGS];
>        uint32 arg_high[MAX_SC_ARGS];
> };
>
> void fill_bpf_data(struct task_struct *t, struct pt_regs *r, struct bpf_data *d)
> {
>        int i;
>        unsigned long arg;
>
>        d->syscall_nr = syscall_get_nr(t, r);
>        for (i = 0; i < MAX_SC_ARGS; ++i){
>                syscall_get_arguments(t, r, i, 1, &arg);
>                d->arg_low[i] = arg;
>                d->arg_high[i] = arg >> 32;
>        }
> }

Sure, but it seems weird to keep the arguments high and low not
adjacent, but I realize you want an arch-independent interface. I
guess in that world, you could do this:

{
  int32_t nr;
  uint32_t arg32[MAX_SC_ARGS];
  uint32_t arg64[MAX_SC_ARGS];
  /* room for future expansion is here */
};

If no one else sees the benefit of keeping this out of the kernel,
then I can do this. I hope Oleg's idea works though because I think ti
addresses the performance problems even if it means the userspace
tooling work is higher to achieve good portability.

>>
>> Agreed -- except that, as I mentioned above, there are still
>> significant complexities kernel-side if anything other than regviews
>> are used.
>
> I'm missing what those complexities are.
>
>>
>>>>> This way the registers have to be interpreted only once by the kernel and all
>>>>> filtering programs don't have to do that mapping themselves. It also avoids
>>>>> doing unnecessary work fiddling/translating registers like the ptrace ABI does.
>>>>
>>>> The kernel does only interpret them once (after entry to
>>>> __secure_computing).
>>>
>>> Not if data shuffling is needed for compat related stuff.
>>
>> I agree!  user_regs_struct get rid of the data shuffling.  pt_regs and
>> syscall_get_arguments all seem to induced data shuffling for compat
>> junk.  I just wish pt_regs was compat-width appropriate, but it makes
>> sense that a 64-bit kernel with a 32-bit program would use 64-bit
>> registers on its side.  Just frustrating.
>
> Are user_regs_struct entries 32-bit for 32-bit tasks or is it 64-bit if
> the kernel is 64-bit? If they're 64-bit then you didn't get rid of the
> data shuffling.

They are appropriate to the process personality as I said and linked
to in the ptrace code. Take a look at my patch - no data shuffling is
needed.

>>>> It gets the regview and has it populate a
>>>> user_regs_struct. �All the register info is per-arch and matches
>>>> PTRACE_GETREGS, but not PTRACE_PEEKUSR.
>>>
>>> GETREGS seems to be a subset of PEEKUSR. That is, both start with
>>> a struct pt_regs/user_regs_struct (seems to be the same thing?)
>>
>>
>> Not quite.  on x86-32, pt_regs and user_regs_struct are identical.
>> Power PC as well, I think.  They diverge on pretty much every other
>> platform.  Also, x86 compat has some trickiness.  pt_regs is 64-bit on
>> x86-64 even with compat processes.  Instead what happens is the naming
>> is  kept if __KERNEL__ such that there aren't different struct member
>> names in all the syscall.h and ptrace code.  The
>> IA32_EMULATION/TS_COMPAT stuff can then just use the reordered member
>> names without even more #ifdef madness.
>
> It was a surprise to me to find out that the pt_regs a 64-bit ptrace user
> gets for a 32 bit tracee differs from the pt_regs when both are 32 bits.
>
>> user_regs_struct will use the correct width according to the process
>> personality.  On all arches with is_compat_task support, this matches
>> -- except x86.  With x86, you can force a 32-bit syscall entry from a
>> 64-bit process resulting in a temporary setting of TS_COMPAT but with
>> a personality that is still 64-bit.  This is an edge case and one I
>> think forcing compat and personality to not-change addresses.
>
> How's that possible? Setting CS to 0x23? Can userspace do that?

int 0x80 will do the trick or setting the CS which I believe can be done.

>>> PEEKUSR only has extra access to debugging registers.
>>
>> GETREGS uses a regview:
>>   http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1153
>> PEEKUSR uses getreg or getreg32 directly (on x86).  compat_arch_ptrace
>> on x86 will then grab a specified register based on the 32-bit offsets
>> out of a 64-bit pt_regs and can return any register offset, like
>> ORIG_EAX:
>>   lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1026
>>
>> It's this magic fixup that allows ptrace to just use pt_regs for
>> PEEKUSR while GETREGS is forced to do the full register copy.
>>
>>> That is another problem of giving a register view: Which registers
>>> are you going to give access to?
>>
>> Always the general set.  This is the set denoted by core_note_type
>> NT_PRSTATUS on all architectures as far as I can tell.
>>
>>>> All the weird stuff is in
>>>> PEEKUSR to deal with the fact that compat pt_regs members are not
>>>> actually the same width as userspace would expect.
>>>>
>>>> If we populated an ABI as you've proposed, we'd at least need to
>>>> build that data set and give it syscall_get_arguments() output.
>>>
>>> Yes, but that's all you have to do, nothing more.
>>
>> I do even less now! :)
>
> You have to do more for the compat case.

No. Please look at the code.


>>> The pt_regs a 64 bit kernel builds for a 32 bit compat process is
>>> different than one from a 32 bit kernel, so you have to do some kind
>>> of data shuffling anyway.
>>
>> Yes - that's why I use user_regs_struct.
>
> But there are different versions of user_regs_struct, depending on the
> situation. This implies that the BPF filters have to be different too,
> while they could be exactly the same (except for the syscall nr).

Yes - per-arch. If you're already doing per-arch fixup, why ius
mapping 6 extra registers such a burden?  If the code can't do that in
userspace, it is slacking.

>>> Worse, once you pick this ABI you're stuck with it and can't get rid
>>> of compat horrors like you have now with ptrace(). Do you really want
>>> to reuse an obscure ptrace ABI instead of creating a simpler new one?
>>
>> Exactly why I'm using user_regs_struct.  I think we've been having
>> some cross-talk, but I'm not sure.  The only edge case I can find with
>> user_regs_struct across all platforms is the nasty x86 32-bit entry
>> from 64-bit personality.  Perhaps someday we can just nuke that path
>> :)  But even so, it is tightly constrained by saving the personality
>> and compat flag in the filter program metadata and checking it at each
>> syscall.
>
> I think it's a good idea to nuke that path, it seems like a security hole
> waiting to happen.

Agreed. And yet we can't (yet?).


>>
>>>> I was hoping I could just hand over pt_regs and avoid any processing,
>>>> but it doesn't look promising. �In theory, all the same bit-twiddling
>>>> compat_ptrace does could be done during load_pointer in the patch
>>>> series, but it seems wrong to go that route.
>>>
>>> Your problem is worse because BPF programs are 32 bits but registers/args
>>> can be 64 bit. Compared to that, running 32 bits on top of 64 bits seems
>>> easy.
>>>
>>> Do you propose that people not only know about 64 bitness, but also
>>> about endianness when grabbing bits and pieces of 64 bit registers?
>>> Because that seems like a fun source of bugs.
>>
>> Endianness calling convention specific.  For arches that allow
>> endianness changes, that should be personality based.  I believe that
>> "people" don't need to know anything unless they are crafting BPF
>> filters by hand, but I do believe that the userland software they rely
>> on should understand the current endianness and system call
>> conventions.  glibc already has to know this stuff, and so does any
>> other piece of userland code directly interacting with the kernel, so
>> I don't believe it is an hardship on userland.  It certainly isn't
>> shiny and isn't naturally intuitive, but those don't seem like the
>> only guiding requirements.  Making it cross-arch and future-friendly
>> using what user-space is already aware of seems like it will result in
>> a robust ABI less afflicted by bit rot or the addition of a crazy new
>> 128-bit architecture :)  But who knows.
>
> If your ABI is too hard to use directly, it won't be used at all.
> Any excuse that people won't use this ABI directly is a sign that
> it is not good enough.

That is blatantly untrue. Have you ever used tcpdump's expression
language for filtering packets? Wireshark?

> And the more complicated you make it, the less likely it is that
> anyone will use this.

Unless there is a nice library that makes it work well.

>>>>> I missed if the original version was allowed to change the registers or not,
>>>>> if it is then perhaps the BPF program should set a specific flag after changing
>>>>> anything, to make it more explicit.
>>>>
>>>> Registers are const from the BPF perspective (just like with socket
>>>> filters). � Adding support for tracehook interception later could
>>>> allow for supervisor guided register mutation.
>>>
>>> If the ABI gives access to arguments instead of registers you don't have
>>> to do anything tricky: No security checks, no need for fixing up register
>>> values to their original value after the system call returns or any other
>>> subtleties. BPF filters can just change the values without side effects.
>>
>> BPF programs should never change any filters.  BPF does not have the
>> capability to modify the data it is evaluating.  Doing that would
>> require a BPF change and alter its very nature, imo.
>
> It could if you make the data part of the scratch memory. If you put the
> data at the top, just after BPF_MEMWORDS, then it's all compatible with
> the read-only version. Except the sk_chk_filter() code. But if you ever
> want to consolidate with the networking version, then you already need
> new non-byteswapping instructions. You can as well add a special modify
> instruction too then. Making it very explicit seems better anyway.

I am not going to go this route right now.  If you want to, be my
guest. We can add BPF instructions later, but I am not going down that
rabbit hole now.

> Using BPF for system call filtering changes its very nature already.

No it doesn't.  user_regs_struct becomes another data protocol.

> I must say that until your patch came up, I've never heard of BPF filters
> before. I think I'm going to use it in our ptrace jailer for network
> filtering, if it's possible to get the peer address for normal TCP/UDP
> sockets. Documentation is quite vague.

Cool!

>> While arguments seem tidy, we still end up with the nasty compat pain
>> and it is only worse kernel-side since there'd be no arch-independent
>> way to get the correct width system call arguments.  I'd need to guess
>> they were 32-bit and downgrade them if compat.  That or add a new arch
>> callout.  Very fiddly :/
>
> See code above. It seems fairly tidy to me.
>
> You could also do the BPF filtering later in the system call entry path
> when the arguments are passed directly, but then it's harder to interact
> well with ptrace.

Where? Interpose it into each system call?  The later I put it, the
less attack surface reduction I get.  The whole point of this
framework is to reduce the kernel's attack surface by doing minimal
kernel-side work before making a policy decision.

Also, I've already gone the ftrace route. As I said in the writeup, I
don't think it is the right path for this sort of functionality.

>>
>>> I would prefer if it would work nicely with a ptrace supervisor, because
>>> to me it seems that if something can't be resolved in the BPF filter, more
>>> context and direct control is needed. The main downside of ptrace for
>>> jailing is its overhead (and some quirks). If that can be taken away for
>>> most system calls by using BPF then it would be useful for my use case.
>>
>> I could not agree more.  I have a patch already in existence that adds
>> a call to tracehook_syscall_entry on failure under certain conditions,
>> but I didn't want to bog down discussion of the core feature with that
>> discussion too.  I think supporting a ptrace supervisor would allow
>> for better debugging and sandbox development.  (Then I think most of
>> the logic could move directly to BPF.  E.g., only allow pointer
>> arguments for open() to live in this known read-only memory, etc.)
>
> That is very hard to do in practise except for very limited sandboxing
> cases. In the general case you want to check all paths, but knowing
> beforehand where those are stored is hard when running arbitrary stuff.
> And it doesn't guarantee that it are safe path, because they can start
> in the middle of a stored path and turn an absolute path into a relative
> one.

Yeah - you'd need a lookup table in the BPF, etc. It'd be pretty ugly :)

> And updating the filters on the run all the time is a hassle too. So
> I think most logic will stay out of BPF, especially because it is the
> more tricky stuff to do. But open() is not that performance critical
> compared to stuff that happens all the time and where you really don't
> want the ptrace overhead, like gettimeofday().

yeah definitely.

> By the way, I think you want the filter to decide with what error code
> the system call fails instead of hard coding it to EACCESS. So just use
> the return value instead of checking against regs_size, which doesn't
> make much sense anyway. Then you also have a way for the filter to tell
> whether the system call should be passed on to ptrace or not.

EACCES is never passed to userspace.  As far as ptrace is concerned, I
don't think the filter needs any awareness of whether there is a
tracer or not.

That said, I'm happy to change the return value semantic.  Right now
it matches how it works in the networking stack.  It returns the data
size to be accepted.

What would make sense? 0 is success and any other value is a failure.
Then specify that the ABI failure return code is _some value_ and then
populate the other later?  I was planning on doing that with
regs_size.  reg_size is reserved, but the other return values could be
exported and used if they ever came into existence.  I'm open to what
everyone thinks makes the most sense!


> Ideally, the BPF filter should be able to deny the system call with a
> specific error code, deny the call and kill the task, have a way to
> defer to ptrace, and a way to allow it.

Not happening (by my hand :).  I'm not changing seccomp to allow it to
cause a system call to fail with an error code. I'll add support for
tracehook integration if this patch can get merged, but I'm not going
to change the basic semantics of seccomp.  The nice thing is, if we
reserve return values, this functionality can be layered on later
without it causing any ABI breakage and with proper consideration
independent of whether the basic functionality gets merged. Then, if
you want retool the entire seccomp path on all architectures to allow
graceful system call failure, it'd be totally doable.

Thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 17:05                           ` Oleg Nesterov
@ 2012-01-17 17:45                             ` Andrew Lutomirski
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-17 17:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/17, Andrew Lutomirski wrote:
>>
>> (is_compat_task says whether the executable was marked as 32-bit.  The
>> actual execution mode is determined by the cs register, which the user
>> can control.
>
> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
> along with TS_COMPAT).
>
> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
> 64-bit or not, we should treat is as 32-bit in this case.

I think you're right, and checking which entry was used is better than
checking the cs register (since 64-bit code can use int80).  That's
what I get for insufficiently careful reading of the assembly.  (And
for going from memory from when I wrote the vsyscall emulation code --
that code is entered from a page fault, so the entry point used is
irrelevant.)

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 16:56                       ` Will Drewry
  2012-01-17 17:01                         ` Andrew Lutomirski
@ 2012-01-17 19:35                         ` Will Drewry
  1 sibling, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-17 19:35 UTC (permalink / raw)
  To: Oleg Nesterov, Indan Zupancic
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen

On Tue, Jan 17, 2012 at 10:56 AM, Will Drewry <wad@chromium.org> wrote:
> On Tue, Jan 17, 2012 at 10:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 01/16, Will Drewry wrote:
>>>
>>> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> >
>>> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
>>> > we need regs_64_to_32().
>>>
>>> Yup - we could make the assumption that is_compat_task is always
>>> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
>>> regs_64_to_32.  Seems kinda wonky though :/
>>
>> much simpler/faster than what regset does to create the artificial
>> user_regs_struct32.
>
> True, I could collapse pt_regs to looks like the exported ABI pt_regs.
>  Then only compat processes would get the copy overhead.  That could
> be tidy and not break ABI.  It would mean that I have to assume that
> if unsigned long == 64-bit and is_compat_task(), then the task is
> 32-bit.  Do you think if we ever add a crazy 128-bit "supercomputer"
> arch that we will add a is_compat64_task() so that I could properly
> collapse? :)
>
> I like this idea!

Ouch, so a few issues:
- pt_regs isn't exported for most arches
- is_compat_task arches would need custom fixups

I think Indan takes this round :) I'll being integrating a
syscall_get_arguments approach.  Hopefully it can be quite efficient.

cheers!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17  6:46                         ` Indan Zupancic
  2012-01-17 17:37                           ` Will Drewry
@ 2012-01-17 20:34                           ` Kees Cook
  2012-01-17 20:42                             ` Will Drewry
  1 sibling, 1 reply; 235+ messages in thread
From: Kees Cook @ 2012-01-17 20:34 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Will Drewry, Oleg Nesterov, linux-kernel, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris

On Mon, Jan 16, 2012 at 10:46 PM, Indan Zupancic <indan@nul.nu> wrote:
> So call it once and store the value in a long. Then copy the low half
> to the right place and then the upper half when on 64 bits. It may not
> look too pretty, but the compiler should be able to optimise almost all
> overhead away and end up with 6 (or 12) int copies. Something like this:
>
> struct bpf_data {
>        uint32 syscall_nr;
>        uint32 arg_low[MAX_SC_ARGS];
>        uint32 arg_high[MAX_SC_ARGS];
> };
>
> void fill_bpf_data(struct task_struct *t, struct pt_regs *r, struct bpf_data *d)
> {
>        int i;
>        unsigned long arg;
>
>        d->syscall_nr = syscall_get_nr(t, r);
>        for (i = 0; i < MAX_SC_ARGS; ++i){
>                syscall_get_arguments(t, r, i, 1, &arg);
>                d->arg_low[i] = arg;
>                d->arg_high[i] = arg >> 32;
>        }
> }

If this turns out to be expensive, it might be possible to break it up
and load the arguments on demand (and cache them); i.e. have
load_pointer() or similar notice when it is about to access something
other than bpf_data.syscall_nr.

-Kees

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 20:34                           ` Kees Cook
@ 2012-01-17 20:42                             ` Will Drewry
  2012-01-17 21:09                               ` Will Drewry
  2012-01-18  4:47                               ` Indan Zupancic
  0 siblings, 2 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-17 20:42 UTC (permalink / raw)
  To: Kees Cook
  Cc: Indan Zupancic, Oleg Nesterov, linux-kernel, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris, Roland McGrath

On Tue, Jan 17, 2012 at 2:34 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jan 16, 2012 at 10:46 PM, Indan Zupancic <indan@nul.nu> wrote:
>> So call it once and store the value in a long. Then copy the low half
>> to the right place and then the upper half when on 64 bits. It may not
>> look too pretty, but the compiler should be able to optimise almost all
>> overhead away and end up with 6 (or 12) int copies. Something like this:
>>
>> struct bpf_data {
>>        uint32 syscall_nr;
>>        uint32 arg_low[MAX_SC_ARGS];
>>        uint32 arg_high[MAX_SC_ARGS];
>> };
>>
>> void fill_bpf_data(struct task_struct *t, struct pt_regs *r, struct bpf_data *d)
>> {
>>        int i;
>>        unsigned long arg;
>>
>>        d->syscall_nr = syscall_get_nr(t, r);
>>        for (i = 0; i < MAX_SC_ARGS; ++i){
>>                syscall_get_arguments(t, r, i, 1, &arg);
>>                d->arg_low[i] = arg;
>>                d->arg_high[i] = arg >> 32;
>>        }
>> }
>
> If this turns out to be expensive, it might be possible to break it up
> and load the arguments on demand (and cache them); i.e. have
> load_pointer() or similar notice when it is about to access something
> other than bpf_data.syscall_nr.

Makes perfect sense!  In theory (as a few other people pointed this
out off list), it is entirely possible to never populate any data for
load_pointer except an optional cache.  Just provide a custom
load_pointer that knows to take the offset return the syscall nr or
the args or some slice of the returned data.

This is even easier if the struct looks like:
struct {
  int nr;
  union {
    uint32_t args32[6];
    uint64_t args64[6];
  }
};

since you can just use the offset without doing any endian-based
splitting.  Another suggestion (thanks roland!) was to add
  int syscall_arch;
to the struct populated with the AUDIT_ARCH_* defines.  This would
help the case Indan was worried about -- portable filter programs.

It looks like there'd be some cross-arch plumbing to make the
AUDIT_ARCH_ data available, but not too bad.

Seem sane? I'm headed down this path now and I think it'll work out
assuming there aren't major objections to the syscall_arch piece.

thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 20:42                             ` Will Drewry
@ 2012-01-17 21:09                               ` Will Drewry
  2012-01-18  4:47                               ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-17 21:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: Indan Zupancic, Oleg Nesterov, linux-kernel, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris, Roland McGrath

On Tue, Jan 17, 2012 at 2:42 PM, Will Drewry <wad@chromium.org> wrote:
> On Tue, Jan 17, 2012 at 2:34 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Mon, Jan 16, 2012 at 10:46 PM, Indan Zupancic <indan@nul.nu> wrote:
>>> So call it once and store the value in a long. Then copy the low half
>>> to the right place and then the upper half when on 64 bits. It may not
>>> look too pretty, but the compiler should be able to optimise almost all
>>> overhead away and end up with 6 (or 12) int copies. Something like this:
>>>
>>> struct bpf_data {
>>>        uint32 syscall_nr;
>>>        uint32 arg_low[MAX_SC_ARGS];
>>>        uint32 arg_high[MAX_SC_ARGS];
>>> };
>>>
>>> void fill_bpf_data(struct task_struct *t, struct pt_regs *r, struct bpf_data *d)
>>> {
>>>        int i;
>>>        unsigned long arg;
>>>
>>>        d->syscall_nr = syscall_get_nr(t, r);
>>>        for (i = 0; i < MAX_SC_ARGS; ++i){
>>>                syscall_get_arguments(t, r, i, 1, &arg);
>>>                d->arg_low[i] = arg;
>>>                d->arg_high[i] = arg >> 32;
>>>        }
>>> }
>>
>> If this turns out to be expensive, it might be possible to break it up
>> and load the arguments on demand (and cache them); i.e. have
>> load_pointer() or similar notice when it is about to access something
>> other than bpf_data.syscall_nr.
>
> Makes perfect sense!  In theory (as a few other people pointed this
> out off list), it is entirely possible to never populate any data for
> load_pointer except an optional cache.  Just provide a custom
> load_pointer that knows to take the offset return the syscall nr or
> the args or some slice of the returned data.
>
> This is even easier if the struct looks like:
> struct {
>  int nr;
>  union {
>    uint32_t args32[6];
>    uint64_t args64[6];
>  }
> };
>
> since you can just use the offset without doing any endian-based
> splitting.  Another suggestion (thanks roland!) was to add
>  int syscall_arch;
> to the struct populated with the AUDIT_ARCH_* defines.  This would
> help the case Indan was worried about -- portable filter programs.
>
> It looks like there'd be some cross-arch plumbing to make the
> AUDIT_ARCH_ data available, but not too bad.
>
> Seem sane? I'm headed down this path now and I think it'll work out
> assuming there aren't major objections to the syscall_arch piece.

Hrm. I'm still not so sure about the arch bit.  Without it, BPF
programs aren't directly share-able, but they could be as long as the
values for k and syscall numbers are being adapted.  By putting arch
in the program, it makes it more likely that every system call will
have a bpf preamble that has to check the syscall_arch.  It could
easily add 100s of nanoseconds to every call (on slower arches).

I'll probably do the next patch series without arch-checking support
then I can add if it is seems needed.  Nothing forces a filter program
to check it, so it could be that we let the author make the decision.

cheers!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-17 17:45                             ` Andrew Lutomirski
@ 2012-01-18  0:56                               ` Indan Zupancic
  2012-01-18  1:01                                 ` Andrew Lutomirski
                                                   ` (3 more replies)
  0 siblings, 4 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-18  0:56 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Tue, January 17, 2012 18:45, Andrew Lutomirski wrote:
> On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 01/17, Andrew Lutomirski wrote:
>>>
>>> (is_compat_task says whether the executable was marked as 32-bit. �The
>>> actual execution mode is determined by the cs register, which the user
>>> can control.
>>
>> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
>> along with TS_COMPAT).
>>
>> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
>> 64-bit or not, we should treat is as 32-bit in this case.
>
> I think you're right, and checking which entry was used is better than
> checking the cs register (since 64-bit code can use int80).  That's
> what I get for insufficiently careful reading of the assembly.  (And
> for going from memory from when I wrote the vsyscall emulation code --
> that code is entered from a page fault, so the entry point used is
> irrelevant.)

Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
int 0x80 it's changed to 32 bit mode for that system call and back to
64 bit mode when the system call is finished!?

Our ptrace jailer is checking cs to figure out if a task is a compat task
or not, if the kernel can change that behind our back it means our jailer
isn't secure for x86_64 with compat enabled. Or is cs changed before the
ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
there another way?

I think this behaviour is so unexpected that it can only cause security
problems in the long run. Is anyone counting on this? Where is this
behaviour documented?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
@ 2012-01-18  1:01                                 ` Andrew Lutomirski
  2012-01-19  1:06                                   ` Indan Zupancic
  2012-01-18  1:07                                 ` Roland McGrath
                                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-18  1:01 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Tue, January 17, 2012 18:45, Andrew Lutomirski wrote:
>> On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> On 01/17, Andrew Lutomirski wrote:
>>>>
>>>> (is_compat_task says whether the executable was marked as 32-bit. �The
>>>> actual execution mode is determined by the cs register, which the user
>>>> can control.
>>>
>>> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
>>> along with TS_COMPAT).
>>>
>>> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
>>> 64-bit or not, we should treat is as 32-bit in this case.
>>
>> I think you're right, and checking which entry was used is better than
>> checking the cs register (since 64-bit code can use int80).  That's
>> what I get for insufficiently careful reading of the assembly.  (And
>> for going from memory from when I wrote the vsyscall emulation code --
>> that code is entered from a page fault, so the entry point used is
>> irrelevant.)
>
> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
> int 0x80 it's changed to 32 bit mode for that system call and back to
> 64 bit mode when the system call is finished!?
>
> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer
> isn't secure for x86_64 with compat enabled. Or is cs changed before the
> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
> there another way?

I don't know what your ptrace jailer does.  But a task can switch
itself between 32-bit and 64-bit execution at will, and there's
nothing the kernel can do about it.  (That isn't quite true -- in
theory the kernel could fiddle with the GDT, but that would be
expensive and wouldn't work on Xen.)

That being said, is_compat_task is apparently a good indication of
whether the current *syscall* entry is a 64-bit syscall or a 32-bit
syscall.  Perhaps the function should be renamed to in_compat_syscall,
because that's what it does.

>
> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?

Nowhere, I think.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
  2012-01-18  1:01                                 ` Andrew Lutomirski
@ 2012-01-18  1:07                                 ` Roland McGrath
  2012-01-18  1:47                                   ` Indan Zupancic
  2012-01-18  1:48                                 ` Jamie Lokier
  2012-01-18  1:50                                 ` Andi Kleen
  3 siblings, 1 reply; 235+ messages in thread
From: Roland McGrath @ 2012-01-18  1:07 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Andi Kleen

On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
> int 0x80 it's changed to 32 bit mode for that system call and back to
> 64 bit mode when the system call is finished!?

Well, saying it like that suggests that there is more of a "mode change"
than really exists.  It's simply that any task can use int $0x80 and
this always means using the 32-bit syscall table with TS_COMPAT set.

> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer
> isn't secure for x86_64 with compat enabled. Or is cs changed before the
> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
> there another way?

I don't think there's another way.  hpa and I once discussed adding a field
to the extractable "register state" that would say which method the syscall
in progress had taken to enter the kernel.  That would tell you which
flavor of syscall instruction was used (or none, i.e. a trap/interrupt).
But nobody ever had a real need for it, and we didn't pursue it further.
(We originally talked about it in the context of distinguishing whether a
32-bit task had used sysenter or syscall or int $0x80, I think.)

> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?

It's documented the same place the entire Linux machine-level ABI is
documented, which is nowhere.  Someone somewhere may once have been
counting on it.  (The story I heard was about an implementation of valgrind
for 32-bit code that ran in 64-bit tasks, but I don't know for sure that it
was really done.)  The general rule is that if it ever worked before in a
coherent way, we don't break binary compatibility.

In the implementation, it would require a special check to make it barf.
It's really just something that falls out of how the hardware and the
kernel implementation works.  I suppose you could add such a check under a
new kconfig option that's marked as being potentially incompatible with
some old applications.  Good luck with that.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  1:07                                 ` Roland McGrath
@ 2012-01-18  1:47                                   ` Indan Zupancic
  0 siblings, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-18  1:47 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Andi Kleen

On Wed, January 18, 2012 02:07, Roland McGrath wrote:
> On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
>> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
>> int 0x80 it's changed to 32 bit mode for that system call and back to
>> 64 bit mode when the system call is finished!?
>
> Well, saying it like that suggests that there is more of a "mode change"
> than really exists.  It's simply that any task can use int $0x80 and
> this always means using the 32-bit syscall table with TS_COMPAT set.

True, the kernel always runs in 64-bit mode, it just selects which path
is taken.

>> Our ptrace jailer is checking cs to figure out if a task is a compat task
>> or not, if the kernel can change that behind our back it means our jailer
>> isn't secure for x86_64 with compat enabled. Or is cs changed before the
>> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
>> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
>> there another way?
>
> I don't think there's another way.  hpa and I once discussed adding a field
> to the extractable "register state" that would say which method the syscall
> in progress had taken to enter the kernel.  That would tell you which
> flavor of syscall instruction was used (or none, i.e. a trap/interrupt).
> But nobody ever had a real need for it, and we didn't pursue it further.
> (We originally talked about it in the context of distinguishing whether a
> 32-bit task had used sysenter or syscall or int $0x80, I think.)

Argh. So strace and all other ptrace users will think the task is calling a
different system call than it executes, except if they check for int 0x80,
which I bet they don't.

I suppose I could cache the checked EIP-2's results, but then I also have to
check if the memory is read-only and invalide the cache when the mapping may
be changed. Probably not worth the complexity.

>> I think this behaviour is so unexpected that it can only cause security
>> problems in the long run. Is anyone counting on this? Where is this
>> behaviour documented?
>
> It's documented the same place the entire Linux machine-level ABI is
> documented, which is nowhere.

AMD wrote the "System V Application Binary Interface" which decribes
some Linux conventions. It's better than nothing. But it just mentions
'syscall', not what happens when int 0x80 is called anyway.

> Someone somewhere may once have been
> counting on it.  (The story I heard was about an implementation of valgrind
> for 32-bit code that ran in 64-bit tasks, but I don't know for sure that it
> was really done.)  The general rule is that if it ever worked before in a
> coherent way, we don't break binary compatibility.

Well, considering the code can't be sure if the kernel supports compat mode
at all, I think this case is getting even more obscure than it already is.
Disallowing it won't change the kernel behaviour compared to a kernel with
compat disabled.

What about disallowing this path when the task is being ptraced?

> In the implementation, it would require a special check to make it barf.
> It's really just something that falls out of how the hardware and the
> kernel implementation works.  I suppose you could add such a check under a
> new kconfig option that's marked as being potentially incompatible with
> some old applications.  Good luck with that.

That seems a hopeless path to follow, and won't solve my problem because
my code has to be able to run on all kernels. Half the point of using
ptrace for jailing was that it's mostly portable with no special kernel
support.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
  2012-01-18  1:01                                 ` Andrew Lutomirski
  2012-01-18  1:07                                 ` Roland McGrath
@ 2012-01-18  1:48                                 ` Jamie Lokier
  2012-01-18  1:50                                 ` Andi Kleen
  3 siblings, 0 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-18  1:48 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

Indan Zupancic wrote:
> On Tue, January 17, 2012 18:45, Andrew Lutomirski wrote:
> > On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> On 01/17, Andrew Lutomirski wrote:
> >>>
> >>> (is_compat_task says whether the executable was marked as 32-bit. �The
> >>> actual execution mode is determined by the cs register, which the user
> >>> can control.
> >>
> >> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
> >> along with TS_COMPAT).
> >>
> >> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
> >> 64-bit or not, we should treat is as 32-bit in this case.
> >
> > I think you're right, and checking which entry was used is better than
> > checking the cs register (since 64-bit code can use int80).  That's
> > what I get for insufficiently careful reading of the assembly.  (And
> > for going from memory from when I wrote the vsyscall emulation code --
> > that code is entered from a page fault, so the entry point used is
> > irrelevant.)
> 
> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
> int 0x80 it's changed to 32 bit mode for that system call and back to
> 64 bit mode when the system call is finished!?
> 
> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer
> isn't secure for x86_64 with compat enabled. Or is cs changed before the
> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
> there another way?

PTRACE_PEEKTEXT won't securely tell you if it's int 0x80 if there's
another thread modifying the code, or changing the mappings, or it's
executing from a file or shared memory that someone's writing to.

> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?

It's a surprise to me too.  And like you I'm using ptrace, to trace
what a process touches, not restrict it, but it's subject to the same problem.

This looks like it needs a kernel patch.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
                                                   ` (2 preceding siblings ...)
  2012-01-18  1:48                                 ` Jamie Lokier
@ 2012-01-18  1:50                                 ` Andi Kleen
  2012-01-18  2:00                                   ` Steven Rostedt
  2012-01-18  2:04                                   ` Jamie Lokier
  3 siblings, 2 replies; 235+ messages in thread
From: Andi Kleen @ 2012-01-18  1:50 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer

Every user program change it behind your back.

Your ptrace jailer isn't.

> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?

Look up far jumps in any x86 manual.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  1:50                                 ` Andi Kleen
@ 2012-01-18  2:00                                   ` Steven Rostedt
  2012-01-18  2:04                                   ` Jamie Lokier
  1 sibling, 0 replies; 235+ messages in thread
From: Steven Rostedt @ 2012-01-18  2:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Indan Zupancic, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Wed, 2012-01-18 at 02:50 +0100, Andi Kleen wrote:

> Every user program change it behind your back.
> 
> Your ptrace jailer isn't.

I'm sorry but I can't read the above two lines without hearing Yoda's
voice. "Hmm hmm"

-- Steve



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  1:50                                 ` Andi Kleen
  2012-01-18  2:00                                   ` Steven Rostedt
@ 2012-01-18  2:04                                   ` Jamie Lokier
  2012-01-18  2:22                                     ` Andi Kleen
  2012-01-18  2:27                                     ` Linus Torvalds
  1 sibling, 2 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-18  2:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Indan Zupancic, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

Andi Kleen wrote:
> > Our ptrace jailer is checking cs to figure out if a task is a compat task
> > or not, if the kernel can change that behind our back it means our jailer
> 
> Every user program change it behind your back.
..
> Look up far jumps in any x86 manual.

I'm pretty sure this isn't about changing cs or far jumps

I think Indan means code is running with 64-bit cs, but the kernel
treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
and there's no way for the ptracer to know which syscall the kernel
will perform, even by looking at all registers.  It looks like a hole
in ptrace which could be fixed.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:04                                   ` Jamie Lokier
@ 2012-01-18  2:22                                     ` Andi Kleen
  2012-01-18  2:25                                       ` Andrew Lutomirski
  2012-01-18  4:22                                       ` Indan Zupancic
  2012-01-18  2:27                                     ` Linus Torvalds
  1 sibling, 2 replies; 235+ messages in thread
From: Andi Kleen @ 2012-01-18  2:22 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andi Kleen, Indan Zupancic, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

> I'm pretty sure this isn't about changing cs or far jumps

He's assuming that code can only run on two code segments and
not arbitarily switch between them which is a completely incorrect
assumption.

> I think Indan means code is running with 64-bit cs, but the kernel
> treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
> and there's no way for the ptracer to know which syscall the kernel
> will perform, even by looking at all registers.  It looks like a hole
> in ptrace which could be fixed.

Possibly, but anything that bases its security on ptrace is typically
unfixable racy (just think what happens with multiple threads 
and syscall arguments), so it's unlikely to do any good.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:22                                     ` Andi Kleen
@ 2012-01-18  2:25                                       ` Andrew Lutomirski
  2012-01-18  4:22                                       ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-18  2:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jamie Lokier, Indan Zupancic, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 6:22 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> I'm pretty sure this isn't about changing cs or far jumps
>
> He's assuming that code can only run on two code segments and
> not arbitarily switch between them which is a completely incorrect
> assumption.

I think all he needs is to figure out which type of syscall was just
intercepted.  (Obviously arguments in memory are a problem.)

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:04                                   ` Jamie Lokier
  2012-01-18  2:22                                     ` Andi Kleen
@ 2012-01-18  2:27                                     ` Linus Torvalds
  2012-01-18  2:31                                       ` Andi Kleen
  1 sibling, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18  2:27 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andi Kleen, Indan Zupancic, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 6:04 PM, Jamie Lokier <jamie@shareable.org> wrote:
>
> I think Indan means code is running with 64-bit cs, but the kernel
> treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
> and there's no way for the ptracer to know which syscall the kernel
> will perform, even by looking at all registers.  It looks like a hole
> in ptrace which could be fixed.

We could possibly munge the "orig_ax" field to be different for the
int80 vs syscall cases. That's really the only field that isn't direct
x86 state. And it's 64 bits wide, but we really only care about the
low 32 bits in the kernel. So a bit in the high bits that says "this
was a int80 entry" would be possible.

                       Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:27                                     ` Linus Torvalds
@ 2012-01-18  2:31                                       ` Andi Kleen
  2012-01-18  2:46                                         ` Linus Torvalds
  0 siblings, 1 reply; 235+ messages in thread
From: Andi Kleen @ 2012-01-18  2:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Andi Kleen, Indan Zupancic, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 06:27:19PM -0800, Linus Torvalds wrote:
> On Tue, Jan 17, 2012 at 6:04 PM, Jamie Lokier <jamie@shareable.org> wrote:
> >
> > I think Indan means code is running with 64-bit cs, but the kernel
> > treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
> > and there's no way for the ptracer to know which syscall the kernel
> > will perform, even by looking at all registers.  It looks like a hole
> > in ptrace which could be fixed.
> 
> We could possibly munge the "orig_ax" field to be different for the
> int80 vs syscall cases. That's really the only field that isn't direct
> x86 state. And it's 64 bits wide, but we really only care about the
> low 32 bits in the kernel. So a bit in the high bits that says "this
> was a int80 entry" would be possible.

That would be incompatible. However you could just add another virtual
register with such information (in fact I thought about that
when I did the compat code originally). However I don't think it'll salvage
the original broken by design ptrace jailer. And everyone else
so far has done fine without it.

-Andi

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:31                                       ` Andi Kleen
@ 2012-01-18  2:46                                         ` Linus Torvalds
  2012-01-18 14:06                                           ` Martin Mares
  0 siblings, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18  2:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jamie Lokier, Andi Kleen, Indan Zupancic, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 6:31 PM, Andi Kleen <ak@linux.intel.com> wrote:
>
> That would be incompatible.

No it wouldn't.

We'd only do it for the case that everybody gets wrong: int80 from a
64-bit context.

All the other cases are trivial to see (look at CS to determine 32-bit
vs 64-bit system call) and are the common case.

So the one new "incompatible" bit case would be the case that existing
users would inevitably get wrong, so it can hardly be "incompatible".

                  Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 17:37                           ` Will Drewry
@ 2012-01-18  4:06                             ` Indan Zupancic
  2012-01-18  4:38                               ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-18  4:06 UTC (permalink / raw)
  To: Will Drewry
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris

Hello,

I'll try to reduce the vebrosity level a bit, for everyone's sake.

On Tue, January 17, 2012 18:37, Will Drewry wrote:
>> If BPF is always 32 bits and people have to deal with the 64 bit pain
>> anyway, you can as well have one fixed ABI that is always the same and
>> cross-platform by just making the arguments always 64 bit, with lower
>> and upper halves in a fixed order to avoid endianness problems. This
>> would be compat and future safe.
>
> What happens if unsigned long is no longer 64-bit in some distant future?

It won't matter, because unsigned long is not used directly and system calls
will stay 64 bit anyway.

Besides that, I don't think unsigned long will ever be bigger than 64-bit.
And if it did, by that time, we can only hope that BPF has gained 64-bit
support and hence a new ABI.

> As to endianness, fixed endianess means that userland programs that
> have a different endianness will need to translate their values.  It's
> just shifting the work around.

No, they never need to translate values. BPF is 32 bit, it doesn't work
on longs directly. If you make it explicit where the upper half of a
64-bit value goes, it can be set and read directly. Only when you access
one half of a 64-bit value through a pointer do you need to worry about
endianness. That doesn't happen with my proposal, it does if the data
you expose contains longs.

> Not really.  I lock down the compat case.  _Even_ with fixed 64-bit
> arguments, you still get system call number mismatches which mean you
> need to keep independent filters for the same task. I had this in one
> of my first implementations and it adds a nasty amount of implicit
> logic during evaluation.

Well, that's why I proposed to have a way to set filters per personality.
Then applications can just blindly install both the 32 and the 64 bit
filters and be assured the kernel picks up the right one. A -1 means
"use current personality", anything else uses the personality given.

Without a persona option you can only install filters for the current
personality, which can be a hassle, especially if the task gets killed
if it changes mode and calls a system call.

>> It's not portable because it is different for every arch.
>
> I was describing the kernel code, not the data set.  By using
> regviews, I get the consistent register view for the personality of
> the process for the architecture it is running on.  This means that
> the user_regs_struct will always be consistent _for the architecture_
> when given to the user's BPF code.  It does not create a portable
> userland ABI but instead uses the existing ABI in a way that is
> arch-agnostic in the kernel (using the regviews interfaces for arch
> fixup).

I argue from user space's point of view, which probably explains most
of our disagreements. :-)

>> And they will get the offsets wrong if they are on 64 bits because those are
>> different than for ptrace. The ptrace ABI uses longs, BPF is fixed to 32 bits,
>> it's just not a good fit.
>
> That's not true on x86-64:
>
>  http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L945
>
> PEEKUSR uses the offsetof() the 32-bit register struct for compat
> calls and, with that macro, maps it to the proper entry in pt_regs.
> For non-compat, it just uses the offset into pt_regs:
>   http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L475
>   lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L170
>
> As there is significant overlap in the contents of user_regs_struct
> and pt_regs on the platform. While it's possible for another arch to
> use a different set of ptrace ABI offsets, using offsetof(struct
> user_regs_struct, ...) will always work.  It is not long-based.
>
> If you want to convince me this isn't a good fit, I need you to meet
> me halfway and make sure your assertions match the code! :)

You're right, the offset is in bytes, not words.

I was looking at my own code, which uses an array of longs and indexes
into that, and got defines for the register indices. So it's my own
abstraction layer that's long-based, not ptrace. with ptrace I use
PTRACE_GETREGS, so I don't use offsets there.

>>>> What special clone/fork registers are you talking about?
>>>
>>> On x86, si is used to indicate the tls area:
>>>   http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/process_32.c#L238
>>> (and r8 on 64-bit).  Also segment registers, etc.
>>
>> si is just the 4th (5th for x86_64) system call argument, it's nothing special.
>
> True - arguably though fork() takes no arguments.  Now
> syscall_get_arguments() won't know that, so si/r8 and all will still
> be exposed for filtering.  For some reason, I thought some other
> pieces were used (EFLAGS?), but I can't back that up in the source.

si is only used when the CLONE_SETTLS flag is set. Fork doesn't set flag,
it comes from a clone argument.

>> System calls never use segment registers directly, do they? It's not part
>> of the system call ABI, so why would you want to use them in filtering?
>
> On x86-32, you may be using segmentation for isolation purposes.  If
> you do so, it may be of interest where the call comes from.  Nothing
> truly relevant, just another possibility.  *shrug*

Isn't EIP an absolute address, corrected for any segmentation?
Or relative to fixed segment registers user space can't change?
Doesn't really matter I suppose.

>
>>>> I don't think anyone would want to ever filter sig_rt_return, you can as
>>>> well kill the process.
>>>
>>> Why make that decision for them?
>>
>> I don't, I'm just saying it doesn't make sense to filter it. Based on what
>> would anyone ever want to filter it? It's just a kernel helper thing to
>> implement signal handlers.
>
> What if you want the process to die if it returns from any signals?

Why would you want that? You can as well send it a SIGKILL at random moments.

>> But now you mention it, it won't be bad to enforce this in the kernel,
>> otherwise everyone has to add code to allow it. Same for exit(_group).
>> Because if those are denied, there's nothing else to run instead.
>
> The process will be do_exit()d.  I don't know why it matters?

Ah, yes, because you kill the process if the filter fails.

Won't it change the return value and exit state?

And in the case of exit_group, do you kill just the thread or the whole
process? Whatever you choose, it either won't be possible to exit one
thread anymore, or any thread exiting kills all threads.

Seems better to always allow it. You may not like the overhead of checking
specific system calls, but it's either hardcoded or done in every filter.
I suppose it doesn't really matter. If it can be done easily in the kernel,
then just do it there. If it's too much of a hassle, push it to the filters.

>> But yeah, better to not provide the instruction or stack pointers indeed.
>> At least the instruction pointer gives some system call related information
>> (from where it is called).
>
> Yup - it's nice to have that.

And it would round up the total of data to 8 fields. My main concern would be
that people try to use it for simplistic security checks and trust it too
much.

>>> That said, I can imagine filtering on other registers as being useful
>>> for tentative research.
>>
>> They can use ptrace for that.
>
> And it will stay research forever.

Speaking from experience, you're probably right. But in our case we really
want to be able to modify registers too, mostly the args. We use the EIP
for system call restarting. But that's not in a performance sensitive path
(one per execve). Modifying the registers in filters is not that useful
for our case because we can't get copy paths to read-only memory from the
BPF anyway.

>> What non-argument register would you like to use on x86? I think all
>> are used up already. All you got left is the segment registers, and
>> using those seems a bad idea.
>
> There are other arches where this would be feasible.

Yes. Like I said:
>> It also seems a bad idea to promote non-portable BPF filtering programs.

> Why?  If it's possible to make a userland abstraction layer, why do we
> force the kernel to take on more work?

Non-portable implies that it is not possible to make a userspace abstraction
layer. Letting the kernel take on slightly more work avoids non-portable
filter programs. Yes, this is limiting, but I think limiting filters isn't
a bad idea.

>
>> If you support modifying arguments and syscall nr then people can keep
>> doing the XORing trick with BPF. Another advantage of allowing that is
>> that unsafe old system calls can be replaced with secure ones on the
>> fly transparently.
>>
>> Really, disallowing modifications is much more limiting than not providing
>> all registers. But allowing modifications is a lot harder to get right
>> with a register interface.
>
> I'm not going to make the change to support BPF making the data
> mutable or using scratch space as an output vector. If that is
> something that makes sense to the networking stack, then we could
> benefit from it, but I don't want to go there.

It doesn't make sense for the networking stack, but it would make BPF
usable for more use cases when used for system call filtering, like
that XOR thing. For our jailer it's not very useful because we need
to copy data to read-only memory. Only other cases where we modify
data are replacing (v)fork with clone with CLONE_PTRACE set for 2.4
kernels, and system call restarting once per execve.

I don't have strong feelings about it, but not supporting it is more
limiting than not providing all registers. That's all.

> It does mean BPF is per-arch, per syscall-convention, but you know I
> am fine with that :)  I do think the performance gains could be well
> worth avoiding any copying.  (Perf gains are the strongest argument I
> think for your proposal and the thing that would likely lead me to do
> it.)

Aesthetics is my main argument against using register views.

>> So call it once and store the value in a long. Then copy the low half
>> to the right place and then the upper half when on 64 bits. It may not
>> look too pretty, but the compiler should be able to optimise almost all
>> overhead away and end up with 6 (or 12) int copies. Something like this:
>>
>> struct bpf_data {
>>        uint32 syscall_nr;
>>        uint32 arg_low[MAX_SC_ARGS];
>>        uint32 arg_high[MAX_SC_ARGS];
>> };
>>
>> void fill_bpf_data(struct task_struct *t, struct pt_regs *r, struct bpf_data *d)
>> {
>>        int i;
>>        unsigned long arg;
>>
>>        d->syscall_nr = syscall_get_nr(t, r);
>>        for (i = 0; i < MAX_SC_ARGS; ++i){
>>                syscall_get_arguments(t, r, i, 1, &arg);
>>                d->arg_low[i] = arg;
>>                d->arg_high[i] = arg >> 32;
>>        }
>> }
>
> Sure, but it seems weird to keep the arguments high and low not
> adjacent, but I realize you want an arch-independent interface.

Yes, that was the idea.

You can achieve the same with an adjacent layout, but then it's easier
to get endianness wrong.

> I guess in that world, you could do this:
>
> {
>   int32_t nr;
>   uint32_t arg32[MAX_SC_ARGS];
>   uint32_t arg64[MAX_SC_ARGS];
>   /* room for future expansion is here */
> };

Isn't that the same?

> Yes - per-arch. If you're already doing per-arch fixup, why ius
> mapping 6 extra registers such a burden?  If the code can't do that in
> userspace, it is slacking.

Fair enough.

>>> Exactly why I'm using user_regs_struct.  I think we've been having
>>> some cross-talk, but I'm not sure.  The only edge case I can find with
>>> user_regs_struct across all platforms is the nasty x86 32-bit entry
>>> from 64-bit personality.  Perhaps someday we can just nuke that path
>>> :)  But even so, it is tightly constrained by saving the personality
>>> and compat flag in the filter program metadata and checking it at each
>>> syscall.
>>
>> I think it's a good idea to nuke that path, it seems like a security hole
>> waiting to happen.
>
> Agreed. And yet we can't (yet?).

Apparently unwritten obscure behaviour may not be broken, in case someone
counts on it. I disagree, because any users can't be sure that compat mode
is supported anyway, so the behaviour is no different than a 64 bit kernel
without compat mode. Oh well.

>> If your ABI is too hard to use directly, it won't be used at all.
>> Any excuse that people won't use this ABI directly is a sign that
>> it is not good enough.
>
> That is blatantly untrue. Have you ever used tcpdump's expression
> language for filtering packets? Wireshark?

Okay, "at all" was stretching it a bit.

>> And the more complicated you make it, the less likely it is that
>> anyone will use this.
>
> Unless there is a nice library that makes it work well.

I beg you, please don't count on this.

BPF will be used for security sensitive code, you really want to keep it simple
and easy to use, otherwise you're just encouraging bugs to happen. Using registers
only adds subtleties for just a tiny performance gain, at best.

>> It could if you make the data part of the scratch memory. If you put the
>> data at the top, just after BPF_MEMWORDS, then it's all compatible with
>> the read-only version. Except the sk_chk_filter() code. But if you ever
>> want to consolidate with the networking version, then you already need
>> new non-byteswapping instructions. You can as well add a special modify
>> instruction too then. Making it very explicit seems better anyway.
>
> I am not going to go this route right now.  If you want to, be my
> guest. We can add BPF instructions later, but I am not going down that
> rabbit hole now.

Agreed.

>> Using BPF for system call filtering changes its very nature already.
>
> No it doesn't.  user_regs_struct becomes another data protocol.

I can't decide whether I agree or disagree. In a way it makes perfect sense,
while yet something inside me screams it's not the same at all.

>> I must say that until your patch came up, I've never heard of BPF filters
>> before. I think I'm going to use it in our ptrace jailer for network
>> filtering, if it's possible to get the peer address for normal TCP/UDP
>> sockets. Documentation is quite vague.
>
> Cool!

I might be the first one to use it on regular sockets instead of raw or
packet sockets though. I fear I'll only get the packet data and no IP
header. But I haven't tried it yet.

>> You could also do the BPF filtering later in the system call entry path
>> when the arguments are passed directly, but then it's harder to interact
>> well with ptrace.
>
> Where? Interpose it into each system call?  The later I put it, the
> less attack surface reduction I get.  The whole point of this
> framework is to reduce the kernel's attack surface by doing minimal
> kernel-side work before making a policy decision.

Agreed.

> Also, I've already gone the ftrace route. As I said in the writeup, I
> don't think it is the right path for this sort of functionality.

Indeed.

>> Ideally, the BPF filter should be able to deny the system call with a
>> specific error code, deny the call and kill the task, have a way to
>> defer to ptrace, and a way to allow it.
>
> Not happening (by my hand :).  I'm not changing seccomp to allow it to
> cause a system call to fail with an error code. I'll add support for
> tracehook integration if this patch can get merged, but I'm not going
> to change the basic semantics of seccomp.

I guess the difference is that you have a somewhat controlled environment
and aren't trying to run arbitrary programs in your jail. Because standard
programs are trying to do stuff they don't really have to all the time,
and killing them for each silly offence would make the jail useless.

It is also crucial in handling new system calls. Glibc does that all the
time, it tries the new version, and if it's not supported, it falls back
to the old one. That's why we return ENOSYS for denied system calls and
that seems to work pretty well.

I guess it would be okay if BPF could somehow defer to ptrace instead of
killing. But without that, this BPF filtering is useless for our generic
jail.

I don't know anything about tracehooks, is that related to ftrace?

> The nice thing is, if we
> reserve return values, this functionality can be layered on later
> without it causing any ABI breakage and with proper consideration
> independent of whether the basic functionality gets merged.

That seems like a good idea.

> Then, if
> you want retool the entire seccomp path on all architectures to allow
> graceful system call failure, it'd be totally doable.

It's trivial if ENOSYS is always returned, then you just change the syscall
nr (that's what I do now). But specific return values are indeed harder
and maybe not worth the trouble.

If this gets in and no one else does it, I'll try to add ptrace support
so they work nicely together.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:22                                     ` Andi Kleen
  2012-01-18  2:25                                       ` Andrew Lutomirski
@ 2012-01-18  4:22                                       ` Indan Zupancic
  2012-01-18  5:23                                         ` Linus Torvalds
  2012-01-18  5:43                                         ` Chris Evans
  1 sibling, 2 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-18  4:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jamie Lokier, Andi Kleen, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, January 18, 2012 03:22, Andi Kleen wrote:
>> I'm pretty sure this isn't about changing cs or far jumps
>
> He's assuming that code can only run on two code segments and
> not arbitarily switch between them which is a completely incorrect
> assumption.

All I assumed up to now was that cs shows the current mode of the process,
and that that defines which system call path is taken. Apparently that is
not true and int 0x80 forces the compat system call path.

Looking at EIP - 2 seems like a secure way to check how we entered the kernel.

>> I think Indan means code is running with 64-bit cs, but the kernel
>> treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
>> and there's no way for the ptracer to know which syscall the kernel
>> will perform, even by looking at all registers.

Yes, that's what I meant.

>> It looks like a hole in ptrace which could be fixed.
>
> Possibly, but anything that bases its security on ptrace is typically
> unfixable racy (just think what happens with multiple threads
> and syscall arguments), so it's unlikely to do any good.

As far as I know, we fixed all races except symlink races caused by malicious
code outside the jail. Those are controllable by limiting what filesystem access
the prisoners get. A special open() flag which causes open to fail when a part
of the path is a symlink with a distinguishable error code would solve this for
us.

Other than that and the abysmal performance, ptrace is fine for jailing.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-18  4:06                             ` Indan Zupancic
@ 2012-01-18  4:38                               ` Will Drewry
  0 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-18  4:38 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris

On Tue, Jan 17, 2012 at 10:06 PM, Indan Zupancic <indan@nul.nu> wrote:
> Hello,
>
> I'll try to reduce the vebrosity level a bit, for everyone's sake.

Me too - I wasn't speaking just about your verbosity level :)

> On Tue, January 17, 2012 18:37, Will Drewry wrote:
>>> If BPF is always 32 bits and people have to deal with the 64 bit pain
>>> anyway, you can as well have one fixed ABI that is always the same and
>>> cross-platform by just making the arguments always 64 bit, with lower
>>> and upper halves in a fixed order to avoid endianness problems. This
>>> would be compat and future safe.
>>
>> What happens if unsigned long is no longer 64-bit in some distant future?
>
> It won't matter, because unsigned long is not used directly and system calls
> will stay 64 bit anyway.
>
> Besides that, I don't think unsigned long will ever be bigger than 64-bit.
> And if it did, by that time, we can only hope that BPF has gained 64-bit
> support and hence a new ABI.

There is a BPF major and minor number exported from linux/filter.h so
it's possible to add BPF instructions without needing to change the
data layout, but yeah - anything beyond 64-bit is stretching it at
this point I guess.

>> As to endianness, fixed endianess means that userland programs that
>> have a different endianness will need to translate their values.  It's
>> just shifting the work around.
>
> No, they never need to translate values. BPF is 32 bit, it doesn't work
> on longs directly. If you make it explicit where the upper half of a
> 64-bit value goes, it can be set and read directly. Only when you access
> one half of a 64-bit value through a pointer do you need to worry about
> endianness. That doesn't happen with my proposal, it does if the data
> you expose contains longs.

True enough and data adjacency is all an illusion to BPF anyway.  As
long as the struct layout is respected, the BPF engine can
index/offset/etc however it likes.

>> Not really.  I lock down the compat case.  _Even_ with fixed 64-bit
>> arguments, you still get system call number mismatches which mean you
>> need to keep independent filters for the same task. I had this in one
>> of my first implementations and it adds a nasty amount of implicit
>> logic during evaluation.
>
> Well, that's why I proposed to have a way to set filters per personality.
> Then applications can just blindly install both the 32 and the 64 bit
> filters and be assured the kernel picks up the right one. A -1 means
> "use current personality", anything else uses the personality given.
>
> Without a persona option you can only install filters for the current
> personality, which can be a hassle, especially if the task gets killed
> if it changes mode and calls a system call.

In general, this is a good thing.  Very few legitimate programs change
personality after putting themselves in any sort of restricted
environment. I've only seen that used to escape ptrace jails.

>>> It's not portable because it is different for every arch.
>>
>> I was describing the kernel code, not the data set.  By using
>> regviews, I get the consistent register view for the personality of
>> the process for the architecture it is running on.  This means that
>> the user_regs_struct will always be consistent _for the architecture_
>> when given to the user's BPF code.  It does not create a portable
>> userland ABI but instead uses the existing ABI in a way that is
>> arch-agnostic in the kernel (using the regviews interfaces for arch
>> fixup).
>
> I argue from user space's point of view, which probably explains most
> of our disagreements. :-)

Makes sense!

>>> And they will get the offsets wrong if they are on 64 bits because those are
>>> different than for ptrace. The ptrace ABI uses longs, BPF is fixed to 32 bits,
>>> it's just not a good fit.
>>
>> That's not true on x86-64:
>>
>>  http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L945
>>
>> PEEKUSR uses the offsetof() the 32-bit register struct for compat
>> calls and, with that macro, maps it to the proper entry in pt_regs.
>> For non-compat, it just uses the offset into pt_regs:
>>   http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L475
>>   lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L170
>>
>> As there is significant overlap in the contents of user_regs_struct
>> and pt_regs on the platform. While it's possible for another arch to
>> use a different set of ptrace ABI offsets, using offsetof(struct
>> user_regs_struct, ...) will always work.  It is not long-based.
>>
>> If you want to convince me this isn't a good fit, I need you to meet
>> me halfway and make sure your assertions match the code! :)
>
> You're right, the offset is in bytes, not words.
>
> I was looking at my own code, which uses an array of longs and indexes
> into that, and got defines for the register indices. So it's my own
> abstraction layer that's long-based, not ptrace. with ptrace I use
> PTRACE_GETREGS, so I don't use offsets there.

Makes sense!

>>>>> What special clone/fork registers are you talking about?
>>>>
>>>> On x86, si is used to indicate the tls area:
>>>>   http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/process_32.c#L238
>>>> (and r8 on 64-bit).  Also segment registers, etc.
>>>
>>> si is just the 4th (5th for x86_64) system call argument, it's nothing special.
>>
>> True - arguably though fork() takes no arguments.  Now
>> syscall_get_arguments() won't know that, so si/r8 and all will still
>> be exposed for filtering.  For some reason, I thought some other
>> pieces were used (EFLAGS?), but I can't back that up in the source.
>
> si is only used when the CLONE_SETTLS flag is set. Fork doesn't set flag,
> it comes from a clone argument.

Doh. No idea what I was thinking about then.

>>> System calls never use segment registers directly, do they? It's not part
>>> of the system call ABI, so why would you want to use them in filtering?
>>
>> On x86-32, you may be using segmentation for isolation purposes.  If
>> you do so, it may be of interest where the call comes from.  Nothing
>> truly relevant, just another possibility.  *shrug*
>
> Isn't EIP an absolute address, corrected for any segmentation?
> Or relative to fixed segment registers user space can't change?
> Doesn't really matter I suppose.

Nah. I think my user_regs_struct proposal is dead on its feet.  Roland
pointed out a much stronger reason to drop it than even performance.
Right now, seccomp takes the system call slow-path which adds some
pretty heavy overheads, but there is very little reason that it
couldn't take a fast-path.  If I add a dependency on all register
state being available, then the fast-path is never an option again.
I'd prefer to dream of a fast-path seccomp than to have access to all
the register state. :)  (On an Atom, the overhead is on the order of a
few hundred nanoseconds iirc.)

>>
>>>>> I don't think anyone would want to ever filter sig_rt_return, you can as
>>>>> well kill the process.
>>>>
>>>> Why make that decision for them?
>>>
>>> I don't, I'm just saying it doesn't make sense to filter it. Based on what
>>> would anyone ever want to filter it? It's just a kernel helper thing to
>>> implement signal handlers.
>>
>> What if you want the process to die if it returns from any signals?
>
> Why would you want that? You can as well send it a SIGKILL at random moments.

I look at this from the case where there is no supervisor.  An example
might be if you want fire up a sandbox and want to make sure that no
signals are handled by the untrusted code.  A bit unnatural, but I
could see it being desirable in some very limited cases.

>>> But now you mention it, it won't be bad to enforce this in the kernel,
>>> otherwise everyone has to add code to allow it. Same for exit(_group).
>>> Because if those are denied, there's nothing else to run instead.
>>
>> The process will be do_exit()d.  I don't know why it matters?
>
> Ah, yes, because you kill the process if the filter fails.
>
> Won't it change the return value and exit state?

Nope - that'd require a call to syscall_set_error.  The entry point to
seccomp, __secure_computing, does not return a value. So you'd have to
call the helper and not all arches that support seccomp also support
syscall_*.  Though it now seems like syscall_get_arguments will be a
requirement for seccomp_filter so it is fair to reconsider the
syscall_* helpers.  The only trickiness is that calling them could
change the subsequent behavior of ptrace or syscall audits since they
are linearly evaluated in the slow-path (on x86 - not all arches).

> And in the case of exit_group, do you kill just the thread or the whole
> process? Whatever you choose, it either won't be possible to exit one
> thread anymore, or any thread exiting kills all threads.

Yeah

> Seems better to always allow it. You may not like the overhead of checking
> specific system calls, but it's either hardcoded or done in every filter.
> I suppose it doesn't really matter. If it can be done easily in the kernel,
> then just do it there. If it's too much of a hassle, push it to the filters.

Checking specific syscall numbers is a PITA.  For all CONFIG_COMPAT
supporting arches, you have to add a seccomp_syscall_blah and
seccomp_syscall_blah_32 so you filter the right numbers.  Better to
leave it to the filters :/

>>> But yeah, better to not provide the instruction or stack pointers indeed.
>>> At least the instruction pointer gives some system call related information
>>> (from where it is called).
>>
>> Yup - it's nice to have that.
>
> And it would round up the total of data to 8 fields. My main concern would be
> that people try to use it for simplistic security checks and trust it too
> much.

Yeah though even without TOCTOU vulnerability, people could still
write crazy dumb filters :)

>>>> That said, I can imagine filtering on other registers as being useful
>>>> for tentative research.
>>>
>>> They can use ptrace for that.
>>
>> And it will stay research forever.
>
> Speaking from experience, you're probably right. But in our case we really
> want to be able to modify registers too, mostly the args. We use the EIP
> for system call restarting. But that's not in a performance sensitive path
> (one per execve). Modifying the registers in filters is not that useful
> for our case because we can't get copy paths to read-only memory from the
> BPF anyway.

Yeah - this would require the ptrace callout.

>>> What non-argument register would you like to use on x86? I think all
>>> are used up already. All you got left is the segment registers, and
>>> using those seems a bad idea.
>>
>> There are other arches where this would be feasible.
>
> Yes. Like I said:
>>> It also seems a bad idea to promote non-portable BPF filtering programs.
>
>> Why?  If it's possible to make a userland abstraction layer, why do we
>> force the kernel to take on more work?
>
> Non-portable implies that it is not possible to make a userspace abstraction
> layer. Letting the kernel take on slightly more work avoids non-portable
> filter programs. Yes, this is limiting, but I think limiting filters isn't
> a bad idea.

Well I believe it was possible to create a userspace abstraction
layer, but I don't think that matters as much now. The question will
be more around how the final syscall args get laid out and if we need
a syscall_arch identifier. (I think)

>>
>>> If you support modifying arguments and syscall nr then people can keep
>>> doing the XORing trick with BPF. Another advantage of allowing that is
>>> that unsafe old system calls can be replaced with secure ones on the
>>> fly transparently.
>>>
>>> Really, disallowing modifications is much more limiting than not providing
>>> all registers. But allowing modifications is a lot harder to get right
>>> with a register interface.
>>
>> I'm not going to make the change to support BPF making the data
>> mutable or using scratch space as an output vector. If that is
>> something that makes sense to the networking stack, then we could
>> benefit from it, but I don't want to go there.
>
> It doesn't make sense for the networking stack, but it would make BPF
> usable for more use cases when used for system call filtering, like
> that XOR thing. For our jailer it's not very useful because we need
> to copy data to read-only memory. Only other cases where we modify
> data are replacing (v)fork with clone with CLONE_PTRACE set for 2.4
> kernels, and system call restarting once per execve.
>
> I don't have strong feelings about it, but not supporting it is more
> limiting than not providing all registers. That's all.

Agreed, but it is how the engine is.  It's possible to look at adding
it in the future, but I wouldn't want to start out of the gate with
it. It  can have further reaching impact than just seccomp (via the
slow-path).

>> It does mean BPF is per-arch, per syscall-convention, but you know I
>> am fine with that :)  I do think the performance gains could be well
>> worth avoiding any copying.  (Perf gains are the strongest argument I
>> think for your proposal and the thing that would likely lead me to do
>> it.)
>
> Aesthetics is my main argument against using register views.

Fair enough!

>>> So call it once and store the value in a long. Then copy the low half
>>> to the right place and then the upper half when on 64 bits. It may not
>>> look too pretty, but the compiler should be able to optimise almost all
>>> overhead away and end up with 6 (or 12) int copies. Something like this:
>>>
>>> struct bpf_data {
>>>        uint32 syscall_nr;
>>>        uint32 arg_low[MAX_SC_ARGS];
>>>        uint32 arg_high[MAX_SC_ARGS];
>>> };
>>>
>>> void fill_bpf_data(struct task_struct *t, struct pt_regs *r, struct bpf_data *d)
>>> {
>>>        int i;
>>>        unsigned long arg;
>>>
>>>        d->syscall_nr = syscall_get_nr(t, r);
>>>        for (i = 0; i < MAX_SC_ARGS; ++i){
>>>                syscall_get_arguments(t, r, i, 1, &arg);
>>>                d->arg_low[i] = arg;
>>>                d->arg_high[i] = arg >> 32;
>>>        }
>>> }
>>
>> Sure, but it seems weird to keep the arguments high and low not
>> adjacent, but I realize you want an arch-independent interface.
>
> Yes, that was the idea.
>
> You can achieve the same with an adjacent layout, but then it's easier
> to get endianness wrong.
>
>> I guess in that world, you could do this:
>>
>> {
>>   int32_t nr;
>>   uint32_t arg32[MAX_SC_ARGS];
>>   uint32_t arg64[MAX_SC_ARGS];
>>   /* room for future expansion is here */
>> };
>
> Isn't that the same?

Yup - I was pointing out the comment and that your proposal would be
friendly to arbitrary expansion.

>> Yes - per-arch. If you're already doing per-arch fixup, why ius
>> mapping 6 extra registers such a burden?  If the code can't do that in
>> userspace, it is slacking.
>
> Fair enough.
>
>>>> Exactly why I'm using user_regs_struct.  I think we've been having
>>>> some cross-talk, but I'm not sure.  The only edge case I can find with
>>>> user_regs_struct across all platforms is the nasty x86 32-bit entry
>>>> from 64-bit personality.  Perhaps someday we can just nuke that path
>>>> :)  But even so, it is tightly constrained by saving the personality
>>>> and compat flag in the filter program metadata and checking it at each
>>>> syscall.
>>>
>>> I think it's a good idea to nuke that path, it seems like a security hole
>>> waiting to happen.
>>
>> Agreed. And yet we can't (yet?).
>
> Apparently unwritten obscure behaviour may not be broken, in case someone
> counts on it. I disagree, because any users can't be sure that compat mode
> is supported anyway, so the behaviour is no different than a 64 bit kernel
> without compat mode. Oh well.

Pretty painful. A big reason why ptrace jails are defense in depth but
not an perfect solution and why I didn't want to merge seccomp and
ptrace directly (by making ptrace-bpf).

>>> If your ABI is too hard to use directly, it won't be used at all.
>>> Any excuse that people won't use this ABI directly is a sign that
>>> it is not good enough.
>>
>> That is blatantly untrue. Have you ever used tcpdump's expression
>> language for filtering packets? Wireshark?
>
> Okay, "at all" was stretching it a bit.
>
>>> And the more complicated you make it, the less likely it is that
>>> anyone will use this.
>>
>> Unless there is a nice library that makes it work well.
>
> I beg you, please don't count on this.
>
> BPF will be used for security sensitive code, you really want to keep it simple
> and easy to use, otherwise you're just encouraging bugs to happen. Using registers
> only adds subtleties for just a tiny performance gain, at best.

Make sense. My goal was a clean, simple to validate implementation in
the kernel with a usable userland API.  It sounds like it'll need to
be more like your proposal though.

>>> It could if you make the data part of the scratch memory. If you put the
>>> data at the top, just after BPF_MEMWORDS, then it's all compatible with
>>> the read-only version. Except the sk_chk_filter() code. But if you ever
>>> want to consolidate with the networking version, then you already need
>>> new non-byteswapping instructions. You can as well add a special modify
>>> instruction too then. Making it very explicit seems better anyway.
>>
>> I am not going to go this route right now.  If you want to, be my
>> guest. We can add BPF instructions later, but I am not going down that
>> rabbit hole now.
>
> Agreed.
>
>>> Using BPF for system call filtering changes its very nature already.
>>
>> No it doesn't.  user_regs_struct becomes another data protocol.
>
> I can't decide whether I agree or disagree. In a way it makes perfect sense,
> while yet something inside me screams it's not the same at all.

Then my original design succeeded :)

>>> I must say that until your patch came up, I've never heard of BPF filters
>>> before. I think I'm going to use it in our ptrace jailer for network
>>> filtering, if it's possible to get the peer address for normal TCP/UDP
>>> sockets. Documentation is quite vague.
>>
>> Cool!
>
> I might be the first one to use it on regular sockets instead of raw or
> packet sockets though. I fear I'll only get the packet data and no IP
> header. But I haven't tried it yet.
>
>>> You could also do the BPF filtering later in the system call entry path
>>> when the arguments are passed directly, but then it's harder to interact
>>> well with ptrace.
>>
>> Where? Interpose it into each system call?  The later I put it, the
>> less attack surface reduction I get.  The whole point of this
>> framework is to reduce the kernel's attack surface by doing minimal
>> kernel-side work before making a policy decision.
>
> Agreed.
>
>> Also, I've already gone the ftrace route. As I said in the writeup, I
>> don't think it is the right path for this sort of functionality.
>
> Indeed.
>
>>> Ideally, the BPF filter should be able to deny the system call with a
>>> specific error code, deny the call and kill the task, have a way to
>>> defer to ptrace, and a way to allow it.
>>
>> Not happening (by my hand :).  I'm not changing seccomp to allow it to
>> cause a system call to fail with an error code. I'll add support for
>> tracehook integration if this patch can get merged, but I'm not going
>> to change the basic semantics of seccomp.
>
> I guess the difference is that you have a somewhat controlled environment
> and aren't trying to run arbitrary programs in your jail. Because standard
> programs are trying to do stuff they don't really have to all the time,
> and killing them for each silly offence would make the jail useless.
>
> It is also crucial in handling new system calls. Glibc does that all the
> time, it tries the new version, and if it's not supported, it falls back
> to the old one. That's why we return ENOSYS for denied system calls and
> that seems to work pretty well.
>
> I guess it would be okay if BPF could somehow defer to ptrace instead of
> killing. But without that, this BPF filtering is useless for our generic
> jail.
>
> I don't know anything about tracehooks, is that related to ftrace?

Roughly, tracehook == ptrace.  I was using it as shorthand for ptrace
integration.  There's more arch-related goo.

lxr.linux.no/linux+v3.2.1/include/linux/tracehook.h#L22

By using seccomp filter as a feeder for ptrace, it become more
reasonable to attempt to implement some of the fixes for cases where a
race results in a process executing what it wants.  However, I still
think ptrace will always be a security nightmare so I like for
seccomp+bpf to standalone with optional ptrace support.

>> The nice thing is, if we
>> reserve return values, this functionality can be layered on later
>> without it causing any ABI breakage and with proper consideration
>> independent of whether the basic functionality gets merged.
>
> That seems like a good idea.

I'll make that explicit in the next patch.

>> Then, if
>> you want retool the entire seccomp path on all architectures to allow
>> graceful system call failure, it'd be totally doable.
>
> It's trivial if ENOSYS is always returned, then you just change the syscall
> nr (that's what I do now). But specific return values are indeed harder
> and maybe not worth the trouble.
>
> If this gets in and no one else does it, I'll try to add ptrace support
> so they work nicely together.

Sounds good!  I have a tentative patch for this already too, so
hopefully we could come to a nice solution!  I'd like ptrace support
for effective, deployed "learning" mode and soft-enforcement of
policies in addition to exploring more advanced policy logic.  But
that's all future :)

Thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 20:42                             ` Will Drewry
  2012-01-17 21:09                               ` Will Drewry
@ 2012-01-18  4:47                               ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-18  4:47 UTC (permalink / raw)
  To: Will Drewry
  Cc: Kees Cook, Oleg Nesterov, linux-kernel, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris, Roland McGrath

On Tue, January 17, 2012 21:42, Will Drewry wrote:
> On Tue, Jan 17, 2012 at 2:34 PM, Kees Cook <keescook@chromium.org> wrote:
>> If this turns out to be expensive, it might be possible to break it up
>> and load the arguments on demand (and cache them); i.e. have
>> load_pointer() or similar notice when it is about to access something
>> other than bpf_data.syscall_nr.
>
> Makes perfect sense!  In theory (as a few other people pointed this
> out off list), it is entirely possible to never populate any data for
> load_pointer except an optional cache.  Just provide a custom
> load_pointer that knows to take the offset return the syscall nr or
> the args or some slice of the returned data.

That sounds like a very good idea. I don't think the cache is needed
because an argument is usually only checked once.

> This is even easier if the struct looks like:
> struct {
>   int nr;
>   union {
>     uint32_t args32[6];
>     uint64_t args64[6];
>   }
> };
>
> since you can just use the offset without doing any endian-based
> splitting.

But then the filter has to know if it's 32 or 64-bit, and still know about
endianness, or am I missing something? It seems better to provide the upper
32 bits seperately so code that cares about it can check it.

> Another suggestion (thanks roland!) was to add
>   int syscall_arch;
> to the struct populated with the AUDIT_ARCH_* defines.  This would
> help the case Indan was worried about -- portable filter programs.

I never heard of AUDIT_ARCH* before, so I have no idea how it could be
used. Do you have a documentation pointer?

A quick scan through the kernel code seems it tells what arch the system is.

Such flag could be used to use the syscall number depending on what arch it
is, but I'm not sure that would make things cleaner or easier for anyone.
Still, it won't hurt to have it and at least gives that option.

> It looks like there'd be some cross-arch plumbing to make the
> AUDIT_ARCH_ data available, but not too bad.
>
> Seem sane? I'm headed down this path now and I think it'll work out
> assuming there aren't major objections to the syscall_arch piece.

I'm not sure about the union, and no objections to syscall_arch if it just
tells what arch it is.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  4:22                                       ` Indan Zupancic
@ 2012-01-18  5:23                                         ` Linus Torvalds
  2012-01-18  6:25                                           ` Linus Torvalds
  2012-01-18  5:43                                         ` Chris Evans
  1 sibling, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18  5:23 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 8:22 PM, Indan Zupancic <indan@nul.nu> wrote:
>
> Looking at EIP - 2 seems like a secure way to check how we entered the kernel.

Secure? No. Not at all.

It's actually very easy to fool it. Do something like this:

 - map the same physical page executably at one address, and writably
4kB above it (use shared memory, and map it twice).

 - in that page, do this:

      lea 1f,%edx
      movl $SYSCALL,%eax
      movl $-1,4096(%edx)
  1:
      int 0x80

and what happens is that the move that *overwrites* the int 0x80 will
not be noticed by the I$ coherency because it's at another address,
but by the time you read at $pc-2, you'll get -1, not "int 0x80"

                  Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  4:22                                       ` Indan Zupancic
  2012-01-18  5:23                                         ` Linus Torvalds
@ 2012-01-18  5:43                                         ` Chris Evans
  2012-01-18 12:12                                           ` Indan Zupancic
  2012-01-18 17:00                                           ` Oleg Nesterov
  1 sibling, 2 replies; 235+ messages in thread
From: Chris Evans @ 2012-01-18  5:43 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Tue, Jan 17, 2012 at 8:22 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Wed, January 18, 2012 03:22, Andi Kleen wrote:
>>> I'm pretty sure this isn't about changing cs or far jumps
>>
>> He's assuming that code can only run on two code segments and
>> not arbitarily switch between them which is a completely incorrect
>> assumption.
>
> All I assumed up to now was that cs shows the current mode of the process,
> and that that defines which system call path is taken. Apparently that is
> not true and int 0x80 forces the compat system call path.
>
> Looking at EIP - 2 seems like a secure way to check how we entered the kernel.

For 64-bit processes, you need to look at that (hard due to races) and
_also_ CS.
At least that was the state the last time I played with this in
earnest: http://scary.beasts.org/security/CESA-2009-001.html

I see Linus posted one of the race conditions that "EIP - 2" is
vulnerable to. You can start to chip away at the problem by making
sure your policy doesn't allow mmap() or mprotect() with PROT_EXEC (or
MAP_SHARED) but it's a long battle.

>
>>> I think Indan means code is running with 64-bit cs, but the kernel
>>> treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
>>> and there's no way for the ptracer to know which syscall the kernel
>>> will perform, even by looking at all registers.
>
> Yes, that's what I meant.
>
>>> It looks like a hole in ptrace which could be fixed.
>>
>> Possibly, but anything that bases its security on ptrace is typically
>> unfixable racy (just think what happens with multiple threads
>> and syscall arguments), so it's unlikely to do any good.
>
> As far as I know, we fixed all races except symlink races caused by malicious
> code outside the jail.

Are you sure? I've remembered possibly the worst one I encountered,
since my previous e-mail to Jamie:

1) Tracee is compromised; executes fork() which is syscall that isn't allowed
2) Tracee traps
2b) Tracee could take a SIGKILL here
3) Tracer looks at registers; bad syscall
3b) Or tracee could take a SIGKILL here
4) The only way to stop the bad syscall from executing is to rewrite
orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
syscall has finished)
5) Disaster: the tracee took a SIGKILL so any attempt to address it by
pid (such as PTRACE_SETREGS) fails.
6) Syscall fork() executes; possible unsupervised process now running
since the tracer wasn't expecting the fork() to be allowed.


All this ptrace() security headache is why vsftpd is waiting for
Will's seccomp enhancements to hit the kernel. Then they will be used
pronto.


Cheers
Chris

> Those are controllable by limiting what filesystem access
> the prisoners get. A special open() flag which causes open to fail when a part
> of the path is a symlink with a distinguishable error code would solve this for
> us.
>
> Other than that and the abysmal performance, ptrace is fine for jailing.
>
> Greetings,
>
> Indan
>
>

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  5:23                                         ` Linus Torvalds
@ 2012-01-18  6:25                                           ` Linus Torvalds
  2012-01-18 13:12                                             ` Compat 32-bit syscall entry from 64-bit task!? Indan Zupancic
  2012-01-18 15:04                                             ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Eric Paris
  0 siblings, 2 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18  6:25 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 9:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>  - in that page, do this:
>
>      lea 1f,%edx
>      movl $SYSCALL,%eax
>      movl $-1,4096(%edx)
>  1:
>      int 0x80
>
> and what happens is that the move that *overwrites* the int 0x80 will
> not be noticed by the I$ coherency because it's at another address,
> but by the time you read at $pc-2, you'll get -1, not "int 0x80"

Btw, that's I$ coherency comment is not technically the correct explanation.

The I$ coherency isn't the problem, the problem is that the pipeline
has already fetched the "int 0x80" before the write happens. And the
write - because it's not to the same linear address as the code fetch
- won't trigger the internal "pipeline flush on write to code stream".
So the D$ (and I$) will have the -1 in it, but the instruction fetch
will have walked ahead and seen the "int 80" that existed earlier, and
will execute it.

And the above depends very much on uarch details, so depending on
microarchitecture it may or may not work. But I think the "use a
different virtual address, but same physical address" thing will fake
out all modern x86 cpu's, and your 'ptrace' will see the -1, even
though the system call happened.

Anyway, the *kernel* knows, since the kernel will have seen which
entrypoint it comes through. So we can handle it in the kernel. But
no, you cannot currently securely/reliably use $pc-2 in gdb or ptrace
to determine how the system call was made, afaik.

Of course, limiting things so that you cannot map the same page
executably *and* writably is one solution - and a good idea regardless
- so secure environments can still exist. But even then you could have
races in a multi-threaded environment (they'd just be *much* harder to
trigger for an attacker).

                 Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  5:43                                         ` Chris Evans
@ 2012-01-18 12:12                                           ` Indan Zupancic
  2012-01-18 21:13                                             ` Chris Evans
  2012-01-18 17:00                                           ` Oleg Nesterov
  1 sibling, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-18 12:12 UTC (permalink / raw)
  To: Chris Evans
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Wed, January 18, 2012 06:43, Chris Evans wrote:
>> As far as I know, we fixed all races except symlink races caused by malicious
>> code outside the jail.
>
> Are you sure? I've remembered possibly the worst one I encountered,
> since my previous e-mail to Jamie:
>
> 1) Tracee is compromised; executes fork() which is syscall that isn't allowed

How do you mean compromised? Tracees aren't trusted by definition. And fork is
allowed in our jail, we're ptracing all tasks within the jail.

> 2) Tracee traps
> 2b) Tracee could take a SIGKILL here
> 3) Tracer looks at registers; bad syscall
> 3b) Or tracee could take a SIGKILL here
> 4) The only way to stop the bad syscall from executing is to rewrite
> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> syscall has finished)

Yes, we rewrite it to -1.

> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> pid (such as PTRACE_SETREGS) fails.

I assume that if a task can execute system calls and we get ptrace events
for that, that we can do other ptrace operations too. Are you saying that
the kernel has this ptrace gap between SIGKILL and task exit where ptrace
doesn't work but the task continues executing system calls? That would be
a huge bug, but it seems very unlikely too, as the task is stopped and
shouldn't be able to disappear till it is continued by the tracer.

I mean, really? That would be stupid.

If true we have to work around it by disallowing SIGKILL and just sending
them ourselves within the jail. Meh.

> 6) Syscall fork() executes; possible unsupervised process now running
> since the tracer wasn't expecting the fork() to be allowed.

We use PTRACE_O_TRACEFORK (or replace it with clone and set CLONE_PTRACE
for 2.4 kernels. Yes, I check for CLONE_UNTRACED in clone calls.)

>
> All this ptrace() security headache is why vsftpd is waiting for
> Will's seccomp enhancements to hit the kernel. Then they will be used
> pronto.

How will you avoid file path races with BPF?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18  6:25                                           ` Linus Torvalds
@ 2012-01-18 13:12                                             ` Indan Zupancic
  2012-01-18 19:31                                               ` Linus Torvalds
  2012-01-18 15:04                                             ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Eric Paris
  1 sibling, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-18 13:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, January 18, 2012 07:25, Linus Torvalds wrote:
> On Tue, Jan 17, 2012 at 9:23 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> 	- in that page, do this:
>>
>> 			lea 1f,%edx
>> 			movl $SYSCALL,%eax
>> 			movl $-1,4096(%edx)
>> 	1:
>> 			int 0x80
>>
>> and what happens is that the move that *overwrites* the int 0x80 will
>> not be noticed by the I$ coherency because it's at another address,
>> but by the time you read at $pc-2, you'll get -1, not "int 0x80"

Oh jolly. I feared something like that might have been possible.

> Btw, that's I$ coherency comment is not technically the correct explanation.
>
> The I$ coherency isn't the problem, the problem is that the pipeline
> has already fetched the "int 0x80" before the write happens. And the
> write - because it's not to the same linear address as the code fetch
> - won't trigger the internal "pipeline flush on write to code stream".
> So the D$ (and I$) will have the -1 in it, but the instruction fetch
> will have walked ahead and seen the "int 80" that existed earlier, and
> will execute it.
>
> And the above depends very much on uarch details, so depending on
> microarchitecture it may or may not work. But I think the "use a
> different virtual address, but same physical address" thing will fake
> out all modern x86 cpu's, and your 'ptrace' will see the -1, even
> though the system call happened.
>
> Anyway, the *kernel* knows, since the kernel will have seen which
> entrypoint it comes through. So we can handle it in the kernel. But
> no, you cannot currently securely/reliably use $pc-2 in gdb or ptrace
> to determine how the system call was made, afaik.

So there is this gap and there is no good way to handle it at all for
user space? And even if it's fixed in the kernel, that won't help with
older kernels, so it will stay a problem for a while.

Can this int 0x80 trick be blocked for ptraced task (preferably always),
pretty please?

> Of course, limiting things so that you cannot map the same page
> executably *and* writably is one solution - and a good idea regardless
> - so secure environments can still exist.

We got the infrastructure in place to do that, though it would be a hassle.
But browsing around in /proc/$PID/maps, it seems w+x mappings are very
common, and we want to jail normal programs, so that seems a bit of a
problem. We could disallow system calls coming from such double mapped
memory, instead of disallowing such mappings altogether.

We'd either need to keep track of all mappings or scan /proc/$PID/maps.
Because that is a pain, we need to cache the results and invalidate or
update the cache after each new writeable mapping.

Doable, but starting to look silly and fragile.

I suppose restarting the system call would avoid same-task tricks,
but doesn't solve the other-task-having-a-writeable-mapping problem.

> But even then you could have
> races in a multi-threaded environment (they'd just be *much* harder to
> trigger for an attacker).

All hostile threads are either jailed or running as a different user,
so at least the mapping checks can be done race-free. Syscall from
unknown mappings can be disallowed.

I hope there is a really dirty trick that works reliable to find a very
subtle difference between system call entered via 'syscall' or 'int 0x80'.

At this point it starts to look attractive to only allow system calls
coming from vdso and protecting the vdso mapping (or is that done by
the kernel already?) System calls coming from elsewhere can be
restarted at the vdso (need to fix up EIP post-syscall then too.)
All in all something like this seems the simplest and most practical
solution to me.

Anyone got any better idea?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:46                                         ` Linus Torvalds
@ 2012-01-18 14:06                                           ` Martin Mares
  2012-01-18 18:24                                             ` Andi Kleen
  0 siblings, 1 reply; 235+ messages in thread
From: Martin Mares @ 2012-01-18 14:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Jamie Lokier, Andi Kleen, Indan Zupancic,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

Hello!

> > That would be incompatible.
> 
> No it wouldn't.
> 
> We'd only do it for the case that everybody gets wrong: int80 from a
> 64-bit context.

Not everybody. There are programs which try hard to distinguish between
int80 and syscall. One such example is a sandbox for programming contests
I wrote several years ago. It analyses the instruction before EIP and as
it does not allow threads nor executing writeable memory, it should be
correct.

The change you propose would break it. It is not a huge deal, I can fix it
in a minute, but I suspect there are other such pieces of code in the wild.

However, having TS_COMPAT available through ptrace would be great and I do not
see any other nice way how to export it to userspace, so maybe breaking the
ABI in this case is acceptable.

				Have a nice fortnight
-- 
Martin `MJ' Mares                          <mj@ucw.cz>   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Anything is good and useful if it's made of chocolate.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  6:25                                           ` Linus Torvalds
  2012-01-18 13:12                                             ` Compat 32-bit syscall entry from 64-bit task!? Indan Zupancic
@ 2012-01-18 15:04                                             ` Eric Paris
  2012-01-18 17:51                                               ` Linus Torvalds
  1 sibling, 1 reply; 235+ messages in thread
From: Eric Paris @ 2012-01-18 15:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Tue, 2012-01-17 at 22:25 -0800, Linus Torvalds wrote:

> Of course, limiting things so that you cannot map the same page
> executably *and* writably is one solution - and a good idea regardless
> - so secure environments can still exist. But even then you could have
> races in a multi-threaded environment (they'd just be *much* harder to
> trigger for an attacker).

Gratuitous SELinux for the win e-mail!  (Feel free to delete now)  We
typically, for all confined domains, do not allow mapping anonymous
memory both W and X.  Actually you can't even map it W and then map it
X...

Now if there is file which you have both W and X SELinux permissions
(which is rare, but not impossible) you could map it in two places.  So
we can (and do) build SELinux sandboxes which address this.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  5:43                                         ` Chris Evans
  2012-01-18 12:12                                           ` Indan Zupancic
@ 2012-01-18 17:00                                           ` Oleg Nesterov
  2012-01-18 17:12                                             ` Oleg Nesterov
  2012-01-19  0:29                                             ` Indan Zupancic
  1 sibling, 2 replies; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-18 17:00 UTC (permalink / raw)
  To: Chris Evans
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On 01/17, Chris Evans wrote:
>
> 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
> 2) Tracee traps
> 2b) Tracee could take a SIGKILL here
> 3) Tracer looks at registers; bad syscall
> 3b) Or tracee could take a SIGKILL here
> 4) The only way to stop the bad syscall from executing is to rewrite
> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> syscall has finished)
> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> pid (such as PTRACE_SETREGS) fails.
> 6) Syscall fork() executes; possible unsupervised process now running
> since the tracer wasn't expecting the fork() to be allowed.

As for fork() in particular, it can't succeed after SIGKILL.

But I agree, probably it makes sense to change ptrace_stop() to check
fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
in TASK_TRACED. Or we can change tracehook_report_syscall_entry()

	-	return 0;
	+	return !fatal_signal_pending();

(no, I do not literally mean the change above)

Not only for security. The current behaviour sometime confuses the
users. Debugger sends SIGKILL to the tracee and assumes it should
die asap, but the tracee exits only after syscall.

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 17:00                                           ` Oleg Nesterov
@ 2012-01-18 17:12                                             ` Oleg Nesterov
  2012-01-18 21:09                                               ` Chris Evans
  2012-02-07 11:45                                               ` Indan Zupancic
  2012-01-19  0:29                                             ` Indan Zupancic
  1 sibling, 2 replies; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-18 17:12 UTC (permalink / raw)
  To: Chris Evans
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On 01/18, Oleg Nesterov wrote:
>
> On 01/17, Chris Evans wrote:
> >
> > 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
> > 2) Tracee traps
> > 2b) Tracee could take a SIGKILL here
> > 3) Tracer looks at registers; bad syscall
> > 3b) Or tracee could take a SIGKILL here
> > 4) The only way to stop the bad syscall from executing is to rewrite
> > orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> > syscall has finished)
> > 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> > pid (such as PTRACE_SETREGS) fails.
> > 6) Syscall fork() executes; possible unsupervised process now running
> > since the tracer wasn't expecting the fork() to be allowed.
>
> As for fork() in particular, it can't succeed after SIGKILL.
>
> But I agree, probably it makes sense to change ptrace_stop() to check
> fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
> in TASK_TRACED. Or we can change tracehook_report_syscall_entry()
>
> 	-	return 0;
> 	+	return !fatal_signal_pending();
>
> (no, I do not literally mean the change above)
>
> Not only for security. The current behaviour sometime confuses the
> users. Debugger sends SIGKILL to the tracee and assumes it should
> die asap, but the tracee exits only after syscall.

Something like the patch below.

Oleg.

--- x/include/linux/tracehook.h
+++ x/include/linux/tracehook.h
@@ -54,12 +54,12 @@ struct linux_binprm;
 /*
  * ptrace report for syscall entry and exit looks identical.
  */
-static inline void ptrace_report_syscall(struct pt_regs *regs)
+static inline int ptrace_report_syscall(struct pt_regs *regs)
 {
 	int ptrace = current->ptrace;
 
 	if (!(ptrace & PT_PTRACED))
-		return;
+		return 0;
 
 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
 
@@ -72,6 +72,8 @@ static inline void ptrace_report_syscall
 		send_sig(current->exit_code, current, 1);
 		current->exit_code = 0;
 	}
+
+	return fatal_signal_pending(current);
 }
 
 /**
@@ -96,8 +98,7 @@ static inline void ptrace_report_syscall
 static inline __must_check int tracehook_report_syscall_entry(
 	struct pt_regs *regs)
 {
-	ptrace_report_syscall(regs);
-	return 0;
+	return ptrace_report_syscall(regs);
 }
 
 /**


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 15:04                                             ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Eric Paris
@ 2012-01-18 17:51                                               ` Linus Torvalds
  0 siblings, 0 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18 17:51 UTC (permalink / raw)
  To: Eric Paris
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 7:04 AM, Eric Paris <eparis@redhat.com> wrote:
>
> Gratuitous SELinux for the win e-mail!  (Feel free to delete now)  We
> typically, for all confined domains, do not allow mapping anonymous
> memory both W and X.  Actually you can't even map it W and then map it
> X...

That doesn't help.

Anonymous memory is the *one* kind of mapping that this cannot happen
for - because then you have the same page mapped only at one
particular virtual address (and all modern x86's are entirely coherent
in the pipeline for that case, afaik).

> Now if there is file which you have both W and X SELinux permissions
> (which is rare, but not impossible) you could map it in two places.  So
> we can (and do) build SELinux sandboxes which address this.

So the cases that matter are file-backed and various shared memory setups.

                   Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 14:06                                           ` Martin Mares
@ 2012-01-18 18:24                                             ` Andi Kleen
  2012-01-19 16:04                                               ` Jamie Lokier
  0 siblings, 1 reply; 235+ messages in thread
From: Andi Kleen @ 2012-01-18 18:24 UTC (permalink / raw)
  To: Martin Mares
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andi Kleen,
	Indan Zupancic, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

> Not everybody. There are programs which try hard to distinguish between
> int80 and syscall. One such example is a sandbox for programming contests
> I wrote several years ago. It analyses the instruction before EIP and as
> it does not allow threads nor executing writeable memory, it should be
> correct.

There are other ways to break it, like using the syscall itself to change
input arguments or using ptrace from another process and other ways.

Generally there are so many races with ptrace that if you want to do
things like that it's better to use a LSM. That's what they are for.

-Andi


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 13:12                                             ` Compat 32-bit syscall entry from 64-bit task!? Indan Zupancic
@ 2012-01-18 19:31                                               ` Linus Torvalds
  2012-01-18 19:36                                                 ` Andi Kleen
                                                                   ` (3 more replies)
  0 siblings, 4 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18 19:31 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 4199 bytes --]

On Wed, Jan 18, 2012 at 5:12 AM, Indan Zupancic <indan@nul.nu> wrote:
>
> So there is this gap and there is no good way to handle it at all for
> user space? And even if it's fixed in the kernel, that won't help with
> older kernels, so it will stay a problem for a while.

Correct.

> Can this int 0x80 trick be blocked for ptraced task (preferably always),
> pretty please?

Nope. Not that I can tell. The "unable to read $pc-2" is a hardware
feature, and we cannot stop users from running the "int 0x80" code.
The only way to block it is to simply not enable the 32-bit
compatibility mode at all, at which point the "int 0x80" interface
simply doesn't exist.

And sure, we could do something in the kernel (like saying that you
cannot do "int 0x80" from 64-bit code by explicitly testing in the
ia32_syscall function), but that has the same "even if it's fixed in
the kernel" issue.

You can test this feature out with a test-program something like this:

  #include <errno.h>
  #include <stdlib.h>
  #include <signal.h>

  #define _GNU_SOURCE
  #include <unistd.h>
  #include <sys/syscall.h>

  void handler(int sig)
  {
	printf("SIGWINCH\n");
  }

  int main(unsigned int argc, char **argv)
  {
	signal(SIGWINCH, handler);
	asm("int $0x80": :"a" (29));	/* sys_pause - 32-bit */
	syscall(34);	/* sys_pause - 64-bit */
  }

which does two "pause()" system calls from 64-bit mode, the first one
using the legacy system call interface.

At least "strace" gets really confused, and will show the first one as

   shmget(0x1c, 140734112566944, 0)        = ? ERESTARTNOHAND (To be restarted)

because it assumes that in 64-bit mode, system call number 29 means
"shmget". It doesn't even look at $pc-2, which (since this code
doesn't try to obfuscate it) would have worked in this case.

I actually checked the strace source code. It has

  #  if 0
                /* This version analyzes the opcode of a syscall instruction.
                 * (int 0x80 on i386 vs. syscall on x86-64)
                 * It works, but is too complicated.
                 */
                unsigned long val, rip, i;

                if (upeek(tcp, 8*RIP, &rip) < 0)
                        perror("upeek(RIP)");

                /* sizeof(syscall) == sizeof(int 0x80) == 2 */
                rip -= 2;
                errno = 0;
              ...

so there is code there that could make it work, but it's #ifdef'ed
out. The actually used code just does

                /* Check CS register value. On x86-64 linux it is:
                 *      0x33    for long mode (64 bit)
                 *      0x23    for compatibility mode (32 bit)
                 * It takes only one ptrace and thus doesn't need
                 * to be cached.
                 */
                if (upeek(tcp, 8*CS, &val) < 0)
                        return -1;
                switch (val) {
                        case 0x23: currpers = 1; break;
                        case 0x33: currpers = 0; break;

which is the reasonable and obvious approach.

I'm looking at "struct user_regs_struct" and there really isn't any
non-architected state there outside of "high bits".

There are high bits that we can hide things in outside of orig_ax - we
do have 64 bits for "cs" for example - but it all boils down to the
same issue: we *will* break something that thinks it knows the details
of this. The advantage of "orig_eax" would be that at least it makes
conceptual sense there.

Using the high bits of 'eflags' might work. Hopefully nobody tests
that. IOW, something like the attached might work. It just sets bit#32
in eflags if the system call is a compat call.

With that, ptrace would at least be able to tell (assuming a new
kernel, of course - it would still need to have the "look at cs" as a
fallback) if it's a compat call or not, but it could do something like

   mode = (eflags >> 32) & 3;
   switch (mode) {
   case 0:
          .. guess it from CS ..
   case 1:
           64-bit
   case 2:
            32-bit
   default:
            Oddity.
   }

or something like that. The idea being that you can also see from
eflags whether the new feature is supported or not.

THIS IS TOTALLY UNTESTED!

                      Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 934 bytes --]

 arch/x86/kernel/ptrace.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 50267386b766..e7b019cd88d3 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -353,6 +353,7 @@ static int set_segment_reg(struct task_struct *task,
 
 static unsigned long get_flags(struct task_struct *task)
 {
+	int bit = 32;
 	unsigned long retval = task_pt_regs(task)->flags;
 
 	/*
@@ -361,7 +362,12 @@ static unsigned long get_flags(struct task_struct *task)
 	if (test_tsk_thread_flag(task, TIF_FORCED_TF))
 		retval &= ~X86_EFLAGS_TF;
 
-	return retval;
+#ifdef CONFIG_IA32_EMULATION
+	/* Set bit 32 for 64-bit system calls, bit 33 for compat system calls */
+	bit += (task_thread_info(task)->status & TS_COMPAT) / TS_COMPAT;
+#endif
+
+	return retval | (1ul << bit);
 }
 
 static int set_flags(struct task_struct *task, unsigned long value)

^ permalink raw reply related	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:31                                               ` Linus Torvalds
@ 2012-01-18 19:36                                                 ` Andi Kleen
  2012-01-18 19:39                                                   ` Linus Torvalds
  2012-01-18 19:41                                                   ` Martin Mares
  2012-01-18 19:38                                                 ` Andrew Lutomirski
                                                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 235+ messages in thread
From: Andi Kleen @ 2012-01-18 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath


The real fix is really to use a LSM for custom jails.  Trying to make 
ptrace secure is trying to make a sieve wather tight by plugging the individual
holes one by one. It's simply not suitable for this.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:31                                               ` Linus Torvalds
  2012-01-18 19:36                                                 ` Andi Kleen
@ 2012-01-18 19:38                                                 ` Andrew Lutomirski
  2012-01-19 16:01                                                   ` Jamie Lokier
  2012-01-18 20:26                                                 ` Linus Torvalds
  2012-01-25 19:36                                                 ` Oleg Nesterov
  3 siblings, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-18 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 11:31 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> The actually used code just does
>
>                /* Check CS register value. On x86-64 linux it is:
>                 *      0x33    for long mode (64 bit)
>                 *      0x23    for compatibility mode (32 bit)
>                 * It takes only one ptrace and thus doesn't need
>                 * to be cached.
>                 */
>                if (upeek(tcp, 8*CS, &val) < 0)
>                        return -1;
>                switch (val) {
>                        case 0x23: currpers = 1; break;
>                        case 0x33: currpers = 0; break;
>
> which is the reasonable and obvious approach.

*sigh*

It's reasonable, obvious, and even more wrong than it appears.  On
Xen, there's an extra 64-bit GDT entry, and it gets used by default.
(I got bitten by this in some iteration of the vsyscall emulation
patches -- see user_64bit_mode for the correct and
unusable-from-user-mode way to do this.)

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:36                                                 ` Andi Kleen
@ 2012-01-18 19:39                                                   ` Linus Torvalds
  2012-01-18 19:44                                                     ` Andi Kleen
  2012-01-18 19:41                                                   ` Martin Mares
  1 sibling, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18 19:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Indan Zupancic, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 11:36 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> The real fix is really to use a LSM for custom jails.  Trying to make
> ptrace secure is trying to make a sieve wather tight by plugging the individual
> holes one by one. It's simply not suitable for this.

Umm. But the exact same is true of "LSM for custom jail". It's a
f*&^ing disaster, and it's a whole lot more complicated than ptrace.

Plus it can't even do what ptrace does, so what's the point?  There's
a lot of system calls that don't have any kind of lsm hooks, and
shouldn't. Exactly because THAT is a "plugging individual holes one by
one" approach.

                     Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:36                                                 ` Andi Kleen
  2012-01-18 19:39                                                   ` Linus Torvalds
@ 2012-01-18 19:41                                                   ` Martin Mares
  1 sibling, 0 replies; 235+ messages in thread
From: Martin Mares @ 2012-01-18 19:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Indan Zupancic, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

Hello!

> The real fix is really to use a LSM for custom jails.  Trying to make 
> ptrace secure is trying to make a sieve wather tight by plugging the individual
> holes one by one. It's simply not suitable for this.

As long as the set of syscalls which are permitted is trivial,
it should be secure and much easier than writing a custom LSM.

Regardless, having working strace would be nice.

				Have a nice fortnight
-- 
Martin `MJ' Mares                          <mj@ucw.cz>   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
"Never send to know for whom the bell tolls: it tolls for thee." -- John Donne

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:39                                                   ` Linus Torvalds
@ 2012-01-18 19:44                                                     ` Andi Kleen
  2012-01-18 19:47                                                       ` Linus Torvalds
  0 siblings, 1 reply; 235+ messages in thread
From: Andi Kleen @ 2012-01-18 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Indan Zupancic, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

> Umm. But the exact same is true of "LSM for custom jail". It's a
> f*&^ing disaster, and it's a whole lot more complicated than ptrace.
> 
> Plus it can't even do what ptrace does, so what's the point?  There's

It can securely enable syscall auditing which can catch all syscalls
(however you only get race free memory arguments for the ones with LSM hooks 
at the right place). Really need both.

I agree it's not easy to get tight (and also not pretty), but you have a lot 
better chance doing it this way than with ptrace.

-Andi

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:44                                                     ` Andi Kleen
@ 2012-01-18 19:47                                                       ` Linus Torvalds
  2012-01-18 19:52                                                         ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18 19:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Indan Zupancic, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 11:44 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> It can securely enable syscall auditing which can catch all syscalls
> (however you only get race free memory arguments for the ones with LSM hooks
> at the right place). Really need both.
>
> I agree it's not easy to get tight (and also not pretty), but you have a lot
> better chance doing it this way than with ptrace.

.. And how the f*^& did you imagine that something like chrome would do that?

You need massive amounts of privileges, and it's a total disaster in
every single respect.

Stop pushing crap. No, ptrace isn't wonderful, but your LSM+auditing
idea is a billion times worse in all respects.

We can definitely fix the ptrace issue with compat system calls.

THERE IS NO WAY IN HELL YOU CAN EVER FIX LSM+AUDIT TO BE USABLE!

Stop bothering to even bring it up. It's dead, Jim.

               Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:47                                                       ` Linus Torvalds
@ 2012-01-18 19:52                                                         ` Will Drewry
  2012-01-18 19:58                                                           ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-18 19:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Indan Zupancic, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 1:47 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Jan 18, 2012 at 11:44 AM, Andi Kleen <andi@firstfloor.org> wrote:
>>
>> It can securely enable syscall auditing which can catch all syscalls
>> (however you only get race free memory arguments for the ones with LSM hooks
>> at the right place). Really need both.
>>
>> I agree it's not easy to get tight (and also not pretty), but you have a lot
>> better chance doing it this way than with ptrace.
>
> .. And how the f*^& did you imagine that something like chrome would do that?
>
> You need massive amounts of privileges, and it's a total disaster in
> every single respect.
>
> Stop pushing crap. No, ptrace isn't wonderful, but your LSM+auditing
> idea is a billion times worse in all respects.
>
> We can definitely fix the ptrace issue with compat system calls.

FWIW, it looks like audit needs fixing too.  If a process only uses
TIF_SYSCALL_AUDIT, then the fast-path will properly annotate the entry
with AUDIT_ARCH_I386, but if it takes the slow path because of some
other tracing on a thread (ftrace, ptrace, ...), then the audit record
will incorrectly use TIF_IA32 to write the audit record.  Easy patch
(I'll write it up shortly), but yet another case of breakage.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:52                                                         ` Will Drewry
@ 2012-01-18 19:58                                                           ` Will Drewry
  0 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-18 19:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Indan Zupancic, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 1:52 PM, Will Drewry <wad@chromium.org> wrote:
> On Wed, Jan 18, 2012 at 1:47 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, Jan 18, 2012 at 11:44 AM, Andi Kleen <andi@firstfloor.org> wrote:
>>>
>>> It can securely enable syscall auditing which can catch all syscalls
>>> (however you only get race free memory arguments for the ones with LSM hooks
>>> at the right place). Really need both.
>>>
>>> I agree it's not easy to get tight (and also not pretty), but you have a lot
>>> better chance doing it this way than with ptrace.
>>
>> .. And how the f*^& did you imagine that something like chrome would do that?
>>
>> You need massive amounts of privileges, and it's a total disaster in
>> every single respect.
>>
>> Stop pushing crap. No, ptrace isn't wonderful, but your LSM+auditing
>> idea is a billion times worse in all respects.
>>
>> We can definitely fix the ptrace issue with compat system calls.
>
> FWIW, it looks like audit needs fixing too.  If a process only uses
> TIF_SYSCALL_AUDIT, then the fast-path will properly annotate the entry
> with AUDIT_ARCH_I386, but if it takes the slow path because of some
> other tracing on a thread (ftrace, ptrace, ...), then the audit record
> will incorrectly use TIF_IA32 to write the audit record.  Easy patch
> (I'll write it up shortly), but yet another case of breakage.

Nevermind - mis-derefenced the IS_IA32 define.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:31                                               ` Linus Torvalds
  2012-01-18 19:36                                                 ` Andi Kleen
  2012-01-18 19:38                                                 ` Andrew Lutomirski
@ 2012-01-18 20:26                                                 ` Linus Torvalds
  2012-01-18 20:55                                                   ` H. Peter Anvin
  2012-02-06  8:32                                                   ` Indan Zupancic
  2012-01-25 19:36                                                 ` Oleg Nesterov
  3 siblings, 2 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18 20:26 UTC (permalink / raw)
  To: Indan Zupancic, H. Peter Anvin
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 2734 bytes --]

Added Peter to the cc, since this is now about some x86-specific
things. Ingo was already cc'd earlier.

On Wed, Jan 18, 2012 at 11:31 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Using the high bits of 'eflags' might work. Hopefully nobody tests
> that. IOW, something like the attached might work. It just sets bit#32
> in eflags if the system call is a compat call.

So that description was bogus, it was what my original patch did, but
not the one I actually sent out (Peter - you can find it on lkml,
although the description below is probably sufficient for you to
understand what it does, or the obvious nature of the attached patch
for strace).

The one I sent out *unconditionally* sets one bit in the high bits of
the returned value of the eflags register from ptrace(), very much on
purpose. That way you can unambiguously see whether it's an old kernel
(bits clear) or a new kernel that supports the feature. On a new
kernel, bit #32 of eflags will be set for a native 64-bit system call,
and bit #33 will be set for a compat system call.

And some testing says that it works. In particular, I have a patch to
strace-4.6 that is able to correctly decode my mixed-case binary that
uses both the compat system call and the native system calls from
64-bit long mode. Also, it looks like gdb ignores the high bits of
eflags, since it "knows" that eflags is just a 32-bit register even in
64-bit mode, so the fact that we set some random bits in there doesn't
end up being noisy for at least one debugger.

HOWEVER. I'm not going to guarantee that this is the right approach.
It seems to work, and it clearly gives people real information, but
whether this is the best way to do things or not is open.

The reason I picked 'eflags' was that it

 (a) was easy from an implementation standpoint, since we already have
to handle reading of eflags specially in ptrace (we have to fake out
the resume bit)

 (b) it "kind of" makes sense to make high bits be "system flags",
with low bits being "cpu flags", so it fits at least *some* kind of
conceptual model.

 (c) the other sane places to put it (high bits of CS and/or ORIG_AX)
were being used and compared as 64-bit values at least by strace.
Whether eflags works for all users, I have no idea, but generally you
would never compare eflags for one particular value - you might check
individual bits in eflags, but hopefully setting a few new bits should
not be something that any legacy user would ever really notice.

So there are reasons to think that my patch is sane, but...

Here's the strace patch, so people can look. I didn't even test it on
an old kernel, but the fallback case to the old behavior looks
trivial.

Comments?

                     Linus

[-- Attachment #2: strace.diff --]
[-- Type: text/x-patch, Size: 1031 bytes --]

 syscall.c |   21 +++++++++++++++++++--
 1 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/syscall.c b/syscall.c
index e66ac0a95582..edd9cb804318 100644
--- a/syscall.c
+++ b/syscall.c
@@ -901,14 +901,31 @@ get_scno(struct tcb *tcp)
 		long val;
 		int pid = tcp->pid;
 
+		/* Check the high bits of eflags for processor mode */
+		if (upeek(tcp, 8*EFLAGS, &val) < 0)
+			return -1;
+		val >>= 32;
 		/* Check CS register value. On x86-64 linux it is:
 		 * 	0x33	for long mode (64 bit)
 		 * 	0x23	for compatibility mode (32 bit)
 		 * It takes only one ptrace and thus doesn't need
 		 * to be cached.
 		 */
-		if (upeek(tcp, 8*CS, &val) < 0)
-			return -1;
+		switch (val & 3) {
+		case 0:
+			/* Legacu case: check CS */
+			if (upeek(tcp, 8*CS, &val) < 0)
+				return -1;
+			break;
+		case 1:
+			/* "Long mode" value */
+			val = 0x33;
+			break;
+		case 2:
+			/* Compatibility mode */
+			val = 0x23;
+			break;
+		}
 		switch (val) {
 			case 0x23: currpers = 1; break;
 			case 0x33: currpers = 0; break;

^ permalink raw reply related	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 20:26                                                 ` Linus Torvalds
@ 2012-01-18 20:55                                                   ` H. Peter Anvin
  2012-01-18 21:01                                                     ` Linus Torvalds
  2012-02-06  8:32                                                   ` Indan Zupancic
  1 sibling, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-01-18 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

In the past, I have asked for a metaregister for this, rather than
hijacking a real hardware register.  Firstly because I suspect we're
going to need more than one bit like this, and second because it seems
cleaner to me.  However, when proposed in the past Roland McGrath
strongly opposed it for reasons which are unclear to me.

I would really like to not use a hack with the flags, because although
there current aren't any flags in the high half of RFLAGS they are
architecturally defined and could appear in the future.

If we're going to use bits in an existing register field I would be
happier if we used bits [31:16] of CS, which are unlikely to ever be
used for anything.

	-hpa


On 01/18/2012 12:26 PM, Linus Torvalds wrote:
> Added Peter to the cc, since this is now about some x86-specific
> things. Ingo was already cc'd earlier.
> 
> On Wed, Jan 18, 2012 at 11:31 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Using the high bits of 'eflags' might work. Hopefully nobody tests
>> that. IOW, something like the attached might work. It just sets bit#32
>> in eflags if the system call is a compat call.
> 
> So that description was bogus, it was what my original patch did, but
> not the one I actually sent out (Peter - you can find it on lkml,
> although the description below is probably sufficient for you to
> understand what it does, or the obvious nature of the attached patch
> for strace).
> 
> The one I sent out *unconditionally* sets one bit in the high bits of
> the returned value of the eflags register from ptrace(), very much on
> purpose. That way you can unambiguously see whether it's an old kernel
> (bits clear) or a new kernel that supports the feature. On a new
> kernel, bit #32 of eflags will be set for a native 64-bit system call,
> and bit #33 will be set for a compat system call.
> 
> And some testing says that it works. In particular, I have a patch to
> strace-4.6 that is able to correctly decode my mixed-case binary that
> uses both the compat system call and the native system calls from
> 64-bit long mode. Also, it looks like gdb ignores the high bits of
> eflags, since it "knows" that eflags is just a 32-bit register even in
> 64-bit mode, so the fact that we set some random bits in there doesn't
> end up being noisy for at least one debugger.
> 
> HOWEVER. I'm not going to guarantee that this is the right approach.
> It seems to work, and it clearly gives people real information, but
> whether this is the best way to do things or not is open.
> 
> The reason I picked 'eflags' was that it
> 
>  (a) was easy from an implementation standpoint, since we already have
> to handle reading of eflags specially in ptrace (we have to fake out
> the resume bit)
> 
>  (b) it "kind of" makes sense to make high bits be "system flags",
> with low bits being "cpu flags", so it fits at least *some* kind of
> conceptual model.
> 
>  (c) the other sane places to put it (high bits of CS and/or ORIG_AX)
> were being used and compared as 64-bit values at least by strace.
> Whether eflags works for all users, I have no idea, but generally you
> would never compare eflags for one particular value - you might check
> individual bits in eflags, but hopefully setting a few new bits should
> not be something that any legacy user would ever really notice.
> 
> So there are reasons to think that my patch is sane, but...
> 
> Here's the strace patch, so people can look. I didn't even test it on
> an old kernel, but the fallback case to the old behavior looks
> trivial.
> 
> Comments?
> 
>                      Linus


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 20:55                                                   ` H. Peter Anvin
@ 2012-01-18 21:01                                                     ` Linus Torvalds
  2012-01-18 21:04                                                       ` H. Peter Anvin
  0 siblings, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18 21:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 12:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> If we're going to use bits in an existing register field I would be
> happier if we used bits [31:16] of CS, which are unlikely to ever be
> used for anything.

See my note about that: I would have preferred CS or ORIG_AX myself,
but that breaks existing binaries. So that isn't really an option.

                Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:01                                                     ` Linus Torvalds
@ 2012-01-18 21:04                                                       ` H. Peter Anvin
  2012-01-18 21:21                                                         ` H. Peter Anvin
  2012-01-18 21:26                                                         ` Linus Torvalds
  0 siblings, 2 replies; 235+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On 01/18/2012 01:01 PM, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 12:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> If we're going to use bits in an existing register field I would be
>> happier if we used bits [31:16] of CS, which are unlikely to ever be
>> used for anything.
> 
> See my note about that: I would have preferred CS or ORIG_AX myself,
> but that breaks existing binaries. So that isn't really an option.
> 

Fair enough.  Sigh.  I still think an actual pseudo-register would be
better.

	-hpa


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 17:12                                             ` Oleg Nesterov
@ 2012-01-18 21:09                                               ` Chris Evans
  2012-01-23 16:56                                                 ` Oleg Nesterov
  2012-02-07 11:45                                               ` Indan Zupancic
  1 sibling, 1 reply; 235+ messages in thread
From: Chris Evans @ 2012-01-18 21:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

Thanks, Oleg. Seems like this would be a nice change to have. As we
can see, people do use ptrace() as a security technology.

With this in place, you can also (where possible) set up the tracee
with PR_SET_PDEATHSIG==SIGKILL. And then, you have defences again
either of the tracer or tracee dying from a stray SIGKILL.

On Wed, Jan 18, 2012 at 9:12 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/18, Oleg Nesterov wrote:
>>
>> On 01/17, Chris Evans wrote:
>> >
>> > 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
>> > 2) Tracee traps
>> > 2b) Tracee could take a SIGKILL here
>> > 3) Tracer looks at registers; bad syscall
>> > 3b) Or tracee could take a SIGKILL here
>> > 4) The only way to stop the bad syscall from executing is to rewrite
>> > orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>> > syscall has finished)
>> > 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>> > pid (such as PTRACE_SETREGS) fails.
>> > 6) Syscall fork() executes; possible unsupervised process now running
>> > since the tracer wasn't expecting the fork() to be allowed.
>>
>> As for fork() in particular, it can't succeed after SIGKILL.
>>
>> But I agree, probably it makes sense to change ptrace_stop() to check
>> fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
>> in TASK_TRACED. Or we can change tracehook_report_syscall_entry()
>>
>>       -       return 0;
>>       +       return !fatal_signal_pending();
>>
>> (no, I do not literally mean the change above)
>>
>> Not only for security. The current behaviour sometime confuses the
>> users. Debugger sends SIGKILL to the tracee and assumes it should
>> die asap, but the tracee exits only after syscall.
>
> Something like the patch below.
>
> Oleg.
>
> --- x/include/linux/tracehook.h
> +++ x/include/linux/tracehook.h
> @@ -54,12 +54,12 @@ struct linux_binprm;
>  /*
>  * ptrace report for syscall entry and exit looks identical.
>  */
> -static inline void ptrace_report_syscall(struct pt_regs *regs)
> +static inline int ptrace_report_syscall(struct pt_regs *regs)
>  {
>        int ptrace = current->ptrace;
>
>        if (!(ptrace & PT_PTRACED))
> -               return;
> +               return 0;
>
>        ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>
> @@ -72,6 +72,8 @@ static inline void ptrace_report_syscall
>                send_sig(current->exit_code, current, 1);
>                current->exit_code = 0;
>        }
> +
> +       return fatal_signal_pending(current);
>  }
>
>  /**
> @@ -96,8 +98,7 @@ static inline void ptrace_report_syscall
>  static inline __must_check int tracehook_report_syscall_entry(
>        struct pt_regs *regs)
>  {
> -       ptrace_report_syscall(regs);
> -       return 0;
> +       return ptrace_report_syscall(regs);
>  }
>
>  /**
>

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 12:12                                           ` Indan Zupancic
@ 2012-01-18 21:13                                             ` Chris Evans
  2012-01-19  0:14                                               ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: Chris Evans @ 2012-01-18 21:13 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
> On Wed, January 18, 2012 06:43, Chris Evans wrote:
>>> As far as I know, we fixed all races except symlink races caused by malicious
>>> code outside the jail.
>>
>> Are you sure? I've remembered possibly the worst one I encountered,
>> since my previous e-mail to Jamie:
>>
>> 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
>
> How do you mean compromised? Tracees aren't trusted by definition. And fork is
> allowed in our jail, we're ptracing all tasks within the jail.

Right, the tracee isn't trusted because you're worried it _might_ get
compromised.
If it _does_ get compromised, you don't want it playing various tricks
to break our of the ptrace() sandbox.

>
>> 2) Tracee traps
>> 2b) Tracee could take a SIGKILL here
>> 3) Tracer looks at registers; bad syscall
>> 3b) Or tracee could take a SIGKILL here
>> 4) The only way to stop the bad syscall from executing is to rewrite
>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>> syscall has finished)
>
> Yes, we rewrite it to -1.
>
>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>> pid (such as PTRACE_SETREGS) fails.
>
> I assume that if a task can execute system calls and we get ptrace events
> for that, that we can do other ptrace operations too. Are you saying that
> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
> doesn't work but the task continues executing system calls? That would be
> a huge bug, but it seems very unlikely too, as the task is stopped and
> shouldn't be able to disappear till it is continued by the tracer.
>
> I mean, really? That would be stupid.
>
> If true we have to work around it by disallowing SIGKILL and just sending
> them ourselves within the jail. Meh.
>
>> 6) Syscall fork() executes; possible unsupervised process now running
>> since the tracer wasn't expecting the fork() to be allowed.
>
> We use PTRACE_O_TRACEFORK (or replace it with clone and set CLONE_PTRACE
> for 2.4 kernels. Yes, I check for CLONE_UNTRACED in clone calls.)
>
>>
>> All this ptrace() security headache is why vsftpd is waiting for
>> Will's seccomp enhancements to hit the kernel. Then they will be used
>> pronto.
>
> How will you avoid file path races with BPF?

There is typically no need for file-path based access control in an FTP server.
Take for example anonymous FTP, which will typically be inside a
chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
it, you can have it.


Cheers
Chris

>
> Greetings,
>
> Indan
>
>

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:04                                                       ` H. Peter Anvin
@ 2012-01-18 21:21                                                         ` H. Peter Anvin
  2012-01-18 21:51                                                           ` Roland McGrath
  2012-01-18 21:26                                                         ` Linus Torvalds
  1 sibling, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On 01/18/2012 01:04 PM, H. Peter Anvin wrote:
> On 01/18/2012 01:01 PM, Linus Torvalds wrote:
>> On Wed, Jan 18, 2012 at 12:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>
>>> If we're going to use bits in an existing register field I would be
>>> happier if we used bits [31:16] of CS, which are unlikely to ever be
>>> used for anything.
>>
>> See my note about that: I would have preferred CS or ORIG_AX myself,
>> but that breaks existing binaries. So that isn't really an option.
>>
> 
> Fair enough.  Sigh.  I still think an actual pseudo-register would be
> better.
> 

Roland, could you refresh my memory what your objection to this was?

	-hpa


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:04                                                       ` H. Peter Anvin
  2012-01-18 21:21                                                         ` H. Peter Anvin
@ 2012-01-18 21:26                                                         ` Linus Torvalds
  2012-01-18 21:30                                                           ` H. Peter Anvin
  2012-01-19  1:45                                                           ` Indan Zupancic
  1 sibling, 2 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18 21:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 1:04 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Fair enough.  Sigh.  I still think an actual pseudo-register would be
> better.

.. and that breaks existing binaries too, because the indexing is
based on offsets into "struct pt_regs", and while we *could* change
that - leave pt_regs untouched but add a new virtual register - it
would be problematic.

We could add a whole new ptrace() access command (eg
PTRACE_GETSYSTEMREGSET), of course. But that's a lot of effort for
very little gain.

So on the whole, putting it in eflags seemed like the *much* simpler approach.

                 Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:26                                                         ` Linus Torvalds
@ 2012-01-18 21:30                                                           ` H. Peter Anvin
  2012-01-18 21:42                                                             ` Linus Torvalds
  2012-01-19  1:45                                                           ` Indan Zupancic
  1 sibling, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On 01/18/2012 01:26 PM, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 1:04 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> Fair enough.  Sigh.  I still think an actual pseudo-register would be
>> better.
> 
> .. and that breaks existing binaries too, because the indexing is
> based on offsets into "struct pt_regs", and while we *could* change
> that - leave pt_regs untouched but add a new virtual register - it
> would be problematic.
> 
> We could add a whole new ptrace() access command (eg
> PTRACE_GETSYSTEMREGSET), of course. But that's a lot of effort for
> very little gain.
> 
> So on the whole, putting it in eflags seemed like the *much* simpler approach.
> 

I would have assumed it would be a new register set (which could be
expanded in the future if we have additional system information to provide.)

	-hpa


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:30                                                           ` H. Peter Anvin
@ 2012-01-18 21:42                                                             ` Linus Torvalds
  2012-01-18 21:47                                                               ` H. Peter Anvin
  0 siblings, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18 21:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 1:30 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> I would have assumed it would be a new register set (which could be
> expanded in the future if we have additional system information to provide.)

Well, I really don't think we want to expose much. In fact, I'd argue
we should expose as little as humanly possible.

Which at this point is literally just a single bit (and effectively
another bit to say "we support the new feature").

So...

              Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:42                                                             ` Linus Torvalds
@ 2012-01-18 21:47                                                               ` H. Peter Anvin
  0 siblings, 0 replies; 235+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On 01/18/2012 01:42 PM, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 1:30 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> I would have assumed it would be a new register set (which could be
>> expanded in the future if we have additional system information to provide.)
> 
> Well, I really don't think we want to expose much. In fact, I'd argue
> we should expose as little as humanly possible.
> 
> Which at this point is literally just a single bit (and effectively
> another bit to say "we support the new feature").
> 
> So...
> 

I actually think we need to also have a bit for some of the 32-bit entry
point differences, since the registers have different meanings for them.
 We have kluges in place for them, but those kluges cause their own
problems when registers are modified.

So that means at least four states (SYSCALL64, SYSENTER, SYSCALL32, INT
80) plus the presence bit.  Furthermore, three out of those states apply
even to pure 32-bit kernels.

	-hpa

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:21                                                         ` H. Peter Anvin
@ 2012-01-18 21:51                                                           ` Roland McGrath
  2012-01-18 21:53                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 235+ messages in thread
From: Roland McGrath @ 2012-01-18 21:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Wed, Jan 18, 2012 at 1:21 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Roland, could you refresh my memory what your objection to this was?

Sorry, I don't really recall.  I could dig through old email archives, but
they'd be archives of messages I sent to you, so you should have them too.

The only principle in this area I recall having an opinion about is that
we should not mix up things that are bona fide user-visible state with
things that aren't.  (By "user-visible" I mean things that the task in
question can see or affect by just doing normal instructions, as opposed
to things only controlled via ptrace, such as the debug registers.)  But
that principle is not really being violated here.

I recall that you and I discussed making the path-of-entry visible somehow
and I was in favor of doing that.  As I recall it, we just never bothered
to follow through.

There are all the concerns about obscure ABI compatibility with
expectations of existing debuggers and so forth, which Linus has mentioned.
For that I can accept his point that things today so mishandle the
int80-from-64 case that something like a new meaning for high bits of
orig_ax or whatnot in just that case would not be actually problematic.
When you and I were discussing a more general feature of distinguishing
int80 from sysenter from syscall from traps from asynchronous interrupts,
that was of more concern.

I do feel strongly that any new means of exposing bona fide user state
ought to be done via the user_regset mechanism.  (i.e., either overloading
some existing user_regs_struct bits if that truly is harmless to
compatibility, or adding a new regset flavor.)  That way it is
automatically recorded in core files, accessible with PTRACE_GETREGSET,
etc.  (But I'm not really working on this stuff any more, so I'm out of the
business of arguing strenuously about such opinions.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:51                                                           ` Roland McGrath
@ 2012-01-18 21:53                                                             ` H. Peter Anvin
  2012-01-18 23:28                                                               ` Linus Torvalds
  0 siblings, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:53 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On 01/18/2012 01:51 PM, Roland McGrath wrote:
> 
> There are all the concerns about obscure ABI compatibility with
> expectations of existing debuggers and so forth, which Linus has mentioned.
> For that I can accept his point that things today so mishandle the
> int80-from-64 case that something like a new meaning for high bits of
> orig_ax or whatnot in just that case would not be actually problematic.
> When you and I were discussing a more general feature of distinguishing
> int80 from sysenter from syscall from traps from asynchronous interrupts,
> that was of more concern.
> 
> I do feel strongly that any new means of exposing bona fide user state
> ought to be done via the user_regset mechanism.  (i.e., either overloading
> some existing user_regs_struct bits if that truly is harmless to
> compatibility, or adding a new regset flavor.)  That way it is
> automatically recorded in core files, accessible with PTRACE_GETREGSET,
> etc.  (But I'm not really working on this stuff any more, so I'm out of the
> business of arguing strenuously about such opinions.)
> 

I think we can obviously agree that regsets is the only way to go for
any kind of new state.

	-hpa


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:53                                                             ` H. Peter Anvin
@ 2012-01-18 23:28                                                               ` Linus Torvalds
  2012-01-19  0:38                                                                 ` H. Peter Anvin
  0 siblings, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-18 23:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Roland McGrath, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Wed, Jan 18, 2012 at 1:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> I think we can obviously agree that regsets is the only way to go for
> any kind of new state.

So I really don't necessarily agree at all.

Exactly because there is a heavy burden to introducing new models.
It's not only relatively much more kernel code, it's also relatively
much more painful for user code. If we can hide it in existing
structures, user code is *much* better off, because any existing code
to get the state will just continue to work. Otherwise, you need to
have the code to figure out the new structures (how do you compile it
without the new kernel headers?), you need to do the extra accesses
conditionally etc etc.

There's a real cost to introducing new interfaces. There's a *reason*
people try to make do with old ones.

          Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 21:13                                             ` Chris Evans
@ 2012-01-19  0:14                                               ` Indan Zupancic
  2012-01-19  8:16                                                 ` Chris Evans
  2012-01-19 15:40                                                 ` Jamie Lokier
  0 siblings, 2 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-19  0:14 UTC (permalink / raw)
  To: Chris Evans
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Wed, January 18, 2012 22:13, Chris Evans wrote:
> On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
>> On Wed, January 18, 2012 06:43, Chris Evans wrote:
>>> 2) Tracee traps
>>> 2b) Tracee could take a SIGKILL here
>>> 3) Tracer looks at registers; bad syscall
>>> 3b) Or tracee could take a SIGKILL here
>>> 4) The only way to stop the bad syscall from executing is to rewrite
>>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>>> syscall has finished)
>>
>> Yes, we rewrite it to -1.
>>
>>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>>> pid (such as PTRACE_SETREGS) fails.
>>
>> I assume that if a task can execute system calls and we get ptrace events
>> for that, that we can do other ptrace operations too. Are you saying that
>> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
>> doesn't work but the task continues executing system calls? That would be
>> a huge bug, but it seems very unlikely too, as the task is stopped and
>> shouldn't be able to disappear till it is continued by the tracer.
>>
>> I mean, really? That would be stupid.

Okay, I tested this scenario and you're right, we're screwed.

What the hell guys? What about other PID checks in the kernel, are they still
safe if the process looks dead but is still active? Or is it a ptrace-only
problem?

>> If true we have to work around it by disallowing SIGKILL and just sending
>> them ourselves within the jail. Meh.

I guess this helps a bit. It doesn't prevent external signals, but prisoners
don't have control over that.

Is this SIGKILL specific or is it true for all task ending signals?

>> How will you avoid file path races with BPF?
>
> There is typically no need for file-path based access control in an FTP server.
> Take for example anonymous FTP, which will typically be inside a
> chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
> it, you can have it.

Ah, you count on having root access. We don't.

Do you know any more crazy security destroying holes?

Thanks,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 17:00                                           ` Oleg Nesterov
  2012-01-18 17:12                                             ` Oleg Nesterov
@ 2012-01-19  0:29                                             ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-19  0:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Evans, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Wed, January 18, 2012 18:00, Oleg Nesterov wrote:
> On 01/17, Chris Evans wrote:
>>
>> 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
>> 2) Tracee traps
>> 2b) Tracee could take a SIGKILL here
>> 3) Tracer looks at registers; bad syscall
>> 3b) Or tracee could take a SIGKILL here
>> 4) The only way to stop the bad syscall from executing is to rewrite
>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>> syscall has finished)
>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>> pid (such as PTRACE_SETREGS) fails.
>> 6) Syscall fork() executes; possible unsupervised process now running
>> since the tracer wasn't expecting the fork() to be allowed.
>
> As for fork() in particular, it can't succeed after SIGKILL.

That was sadly exactly the system call I used for testing my code...

> But I agree, probably it makes sense to change ptrace_stop() to check
> fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
> in TASK_TRACED. Or we can change tracehook_report_syscall_entry()
>
> 	-	return 0;
> 	+	return !fatal_signal_pending();
>
> (no, I do not literally mean the change above)
>
> Not only for security. The current behaviour sometime confuses the
> users. Debugger sends SIGKILL to the tracee and assumes it should
> die asap, but the tracee exits only after syscall.

I didn't expect the tracer to die asap when sending SIGKILL, but I
did for PTRACE_KILL.

Improving this behaviour is highly appreciated, thanks!

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 23:28                                                               ` Linus Torvalds
@ 2012-01-19  0:38                                                                 ` H. Peter Anvin
  2012-01-20 21:51                                                                   ` Denys Vlasenko
  0 siblings, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-01-19  0:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland McGrath, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On 01/18/2012 03:28 PM, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 1:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> I think we can obviously agree that regsets is the only way to go for
>> any kind of new state.
> 
> So I really don't necessarily agree at all.
> 
> Exactly because there is a heavy burden to introducing new models.
> It's not only relatively much more kernel code, it's also relatively
> much more painful for user code. If we can hide it in existing
> structures, user code is *much* better off, because any existing code
> to get the state will just continue to work. Otherwise, you need to
> have the code to figure out the new structures (how do you compile it
> without the new kernel headers?), you need to do the extra accesses
> conditionally etc etc.
> 
> There's a real cost to introducing new interfaces. There's a *reason*
> people try to make do with old ones.
> 

Of course.  However, the whole point with regsets is that at the very
least the vast majority of the infrastructure is generic and extends
without a bunch of new machine.  What you are saying is "we might be
able to get away with existing state", what I'm saying is "if we add
state it should be a regset".

The question if this should be new state is currently open.  I
personally would still would prefer if this didn't overlay real CPU state.

	-hpa



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  1:01                                 ` Andrew Lutomirski
@ 2012-01-19  1:06                                   ` Indan Zupancic
  2012-01-19  1:19                                     ` Andrew Lutomirski
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-19  1:06 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Wed, January 18, 2012 02:01, Andrew Lutomirski wrote:
> On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
>> On Tue, January 17, 2012 18:45, Andrew Lutomirski wrote:
>>> On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> On 01/17, Andrew Lutomirski wrote:
>>>>>
>>>>> (is_compat_task says whether the executable was marked as 32-bit. �The
>>>>> actual execution mode is determined by the cs register, which the user
>>>>> can control.
>>>>
>>>> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
>>>> along with TS_COMPAT).
>>>>
>>>> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
>>>> 64-bit or not, we should treat is as 32-bit in this case.
>>>
>>> I think you're right, and checking which entry was used is better than
>>> checking the cs register (since 64-bit code can use int80).  That's
>>> what I get for insufficiently careful reading of the assembly.  (And
>>> for going from memory from when I wrote the vsyscall emulation code --
>>> that code is entered from a page fault, so the entry point used is
>>> irrelevant.)
>>
>> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
>> int 0x80 it's changed to 32 bit mode for that system call and back to
>> 64 bit mode when the system call is finished!?
>>
>> Our ptrace jailer is checking cs to figure out if a task is a compat task
>> or not, if the kernel can change that behind our back it means our jailer
>> isn't secure for x86_64 with compat enabled. Or is cs changed before the
>> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
>> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
>> there another way?
>
> I don't know what your ptrace jailer does.  But a task can switch
> itself between 32-bit and 64-bit execution at will, and there's
> nothing the kernel can do about it.  (That isn't quite true -- in
> theory the kernel could fiddle with the GDT, but that would be
> expensive and wouldn't work on Xen.)

That's why we don't cache the CS value but check it for every system call.
But you said elsewhere that checking CS isn't always correct either.
I grepped arch/x86 for "user_64bit_mode", but couldn't find anything,
but maybe my kernel sources are too old, I haven't updated this system
for almost a year. The current code only handles 0x23 and 0x33 and kills
the jail if it encounters anything else.

> That being said, is_compat_task is apparently a good indication of
> whether the current *syscall* entry is a 64-bit syscall or a 32-bit
> syscall.  Perhaps the function should be renamed to in_compat_syscall,
> because that's what it does.

That seems like a good idea.

>
>>
>> I think this behaviour is so unexpected that it can only cause security
>> problems in the long run. Is anyone counting on this? Where is this
>> behaviour documented?
>
> Nowhere, I think.

Such is life.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  1:06                                   ` Indan Zupancic
@ 2012-01-19  1:19                                     ` Andrew Lutomirski
  2012-01-19  1:47                                       ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-19  1:19 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Wed, Jan 18, 2012 at 5:06 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Wed, January 18, 2012 02:01, Andrew Lutomirski wrote:
>> On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
>> I don't know what your ptrace jailer does.  But a task can switch
>> itself between 32-bit and 64-bit execution at will, and there's
>> nothing the kernel can do about it.  (That isn't quite true -- in
>> theory the kernel could fiddle with the GDT, but that would be
>> expensive and wouldn't work on Xen.)
>
> That's why we don't cache the CS value but check it for every system call.
> But you said elsewhere that checking CS isn't always correct either.
> I grepped arch/x86 for "user_64bit_mode", but couldn't find anything,
> but maybe my kernel sources are too old, I haven't updated this system
> for almost a year. The current code only handles 0x23 and 0x33 and kills
> the jail if it encounters anything else.

I think you're hosed on Xen, then.  Xen regularly runs with a
different Xen-specific cs value.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:26                                                         ` Linus Torvalds
  2012-01-18 21:30                                                           ` H. Peter Anvin
@ 2012-01-19  1:45                                                           ` Indan Zupancic
  2012-01-19  2:16                                                             ` H. Peter Anvin
  1 sibling, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-19  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Wed, January 18, 2012 22:26, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 1:04 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> Fair enough.  Sigh.  I still think an actual pseudo-register would be
>> better.
>
> .. and that breaks existing binaries too, because the indexing is
> based on offsets into "struct pt_regs", and while we *could* change
> that - leave pt_regs untouched but add a new virtual register - it
> would be problematic.
>
> We could add a whole new ptrace() access command (eg
> PTRACE_GETSYSTEMREGSET), of course. But that's a lot of effort for
> very little gain.
>
> So on the whole, putting it in eflags seemed like the *much* simpler approach.

For security reasons it should be impossible for userspace to set those bits
themselves, otherwise the tracer can be easily fooled on an old kernel. That
seems to be the case for the higher bits of eflags, so eflags would work. And
the current code checks cs, also checking eflags would be very easy to add.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  1:19                                     ` Andrew Lutomirski
@ 2012-01-19  1:47                                       ` Indan Zupancic
  0 siblings, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-19  1:47 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Thu, January 19, 2012 02:19, Andrew Lutomirski wrote:
> On Wed, Jan 18, 2012 at 5:06 PM, Indan Zupancic <indan@nul.nu> wrote:
>> On Wed, January 18, 2012 02:01, Andrew Lutomirski wrote:
>>> On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
>>> I don't know what your ptrace jailer does.  But a task can switch
>>> itself between 32-bit and 64-bit execution at will, and there's
>>> nothing the kernel can do about it.  (That isn't quite true -- in
>>> theory the kernel could fiddle with the GDT, but that would be
>>> expensive and wouldn't work on Xen.)
>>
>> That's why we don't cache the CS value but check it for every system call.
>> But you said elsewhere that checking CS isn't always correct either.
>> I grepped arch/x86 for "user_64bit_mode", but couldn't find anything,
>> but maybe my kernel sources are too old, I haven't updated this system
>> for almost a year. The current code only handles 0x23 and 0x33 and kills
>> the jail if it encounters anything else.
>
> I think you're hosed on Xen, then.  Xen regularly runs with a
> different Xen-specific cs value.

That's fine as long as a cs value of 0x23 or 0x33 gives reliable information.
Not running is highly prefered above running insecurely. Security first,
functionality second.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19  1:45                                                           ` Indan Zupancic
@ 2012-01-19  2:16                                                             ` H. Peter Anvin
  0 siblings, 0 replies; 235+ messages in thread
From: H. Peter Anvin @ 2012-01-19  2:16 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On 01/18/2012 05:45 PM, Indan Zupancic wrote:
> 
> For security reasons it should be impossible for userspace to set those bits
> themselves, otherwise the tracer can be easily fooled on an old kernel. That
> seems to be the case for the higher bits of eflags, so eflags would work. And
> the current code checks cs, also checking eflags would be very easy to add.
> 

I think this goes without saying, and isn't an issue for the options
currently on the table (including regset).

	-phpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  0:14                                               ` Indan Zupancic
@ 2012-01-19  8:16                                                 ` Chris Evans
  2012-01-19 11:34                                                   ` Indan Zupancic
  2012-01-19 15:40                                                 ` Jamie Lokier
  1 sibling, 1 reply; 235+ messages in thread
From: Chris Evans @ 2012-01-19  8:16 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Wed, Jan 18, 2012 at 4:14 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Wed, January 18, 2012 22:13, Chris Evans wrote:
>> On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
>>> On Wed, January 18, 2012 06:43, Chris Evans wrote:
>>>> 2) Tracee traps
>>>> 2b) Tracee could take a SIGKILL here
>>>> 3) Tracer looks at registers; bad syscall
>>>> 3b) Or tracee could take a SIGKILL here
>>>> 4) The only way to stop the bad syscall from executing is to rewrite
>>>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>>>> syscall has finished)
>>>
>>> Yes, we rewrite it to -1.
>>>
>>>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>>>> pid (such as PTRACE_SETREGS) fails.
>>>
>>> I assume that if a task can execute system calls and we get ptrace events
>>> for that, that we can do other ptrace operations too. Are you saying that
>>> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
>>> doesn't work but the task continues executing system calls? That would be
>>> a huge bug, but it seems very unlikely too, as the task is stopped and
>>> shouldn't be able to disappear till it is continued by the tracer.
>>>
>>> I mean, really? That would be stupid.
>
> Okay, I tested this scenario and you're right, we're screwed.
>
> What the hell guys?

Steady on :) ptrace() has never been sold as a technology upon which
its safe to build security solutions.

> What about other PID checks in the kernel, are they still
> safe if the process looks dead but is still active? Or is it a ptrace-only
> problem?
>
>>> If true we have to work around it by disallowing SIGKILL and just sending
>>> them ourselves within the jail. Meh.
>
> I guess this helps a bit. It doesn't prevent external signals, but prisoners
> don't have control over that.

Well.... a prisoner may be able to play other tricks:
- Allocate lots of memory... kernel may start spraying around SIGKILLs
- Sending SIGKILL via prctl()
- Sending SIGKILL via fcntl()
- Sending SIGKILL via clone()

>
> Is this SIGKILL specific or is it true for all task ending signals?

Can't remember - try it?

>
>>> How will you avoid file path races with BPF?
>>
>> There is typically no need for file-path based access control in an FTP server.
>> Take for example anonymous FTP, which will typically be inside a
>> chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
>> it, you can have it.
>
> Ah, you count on having root access. We don't.
>
> Do you know any more crazy security destroying holes?

Try spraying SIGCONT and / or SIGSTOP at tracees. It may be possible
to confuse the tracer about whether a SIGTRAP event is syscall entry
or exit.
Try doing an execve() that fails. May cause similar state confusion in
the tracer.


Cheers
Chris

>
> Thanks,
>
> Indan
>
>

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  8:16                                                 ` Chris Evans
@ 2012-01-19 11:34                                                   ` Indan Zupancic
  2012-01-19 16:11                                                     ` Jamie Lokier
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-19 11:34 UTC (permalink / raw)
  To: Chris Evans
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Thu, January 19, 2012 09:16, Chris Evans wrote:
> On Wed, Jan 18, 2012 at 4:14 PM, Indan Zupancic <indan@nul.nu> wrote:
>> On Wed, January 18, 2012 22:13, Chris Evans wrote:
>>> On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
>>>> On Wed, January 18, 2012 06:43, Chris Evans wrote:
>>>>> 2) Tracee traps
>>>>> 2b) Tracee could take a SIGKILL here
>>>>> 3) Tracer looks at registers; bad syscall
>>>>> 3b) Or tracee could take a SIGKILL here
>>>>> 4) The only way to stop the bad syscall from executing is to rewrite
>>>>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>>>>> syscall has finished)
>>>>
>>>> Yes, we rewrite it to -1.
>>>>
>>>>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>>>>> pid (such as PTRACE_SETREGS) fails.
>>>>
>>>> I assume that if a task can execute system calls and we get ptrace events
>>>> for that, that we can do other ptrace operations too. Are you saying that
>>>> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
>>>> doesn't work but the task continues executing system calls? That would be
>>>> a huge bug, but it seems very unlikely too, as the task is stopped and
>>>> shouldn't be able to disappear till it is continued by the tracer.
>>>>
>>>> I mean, really? That would be stupid.
>>
>> Okay, I tested this scenario and you're right, we're screwed.
>>
>> What the hell guys?
>
> Steady on :) ptrace() has never been sold as a technology upon which
> its safe to build security solutions.

Well, that can be said of pretty much all kernel functionality.
That is no excuse for crazy behaviour.

I more or less fixed it by turning all SIGKILLs into SIGTERMs.
Perhaps I should use a more obscure signal instead.

>> What about other PID checks in the kernel, are they still
>> safe if the process looks dead but is still active? Or is it a ptrace-only
>> problem?
>>
>>>> If true we have to work around it by disallowing SIGKILL and just sending
>>>> them ourselves within the jail. Meh.
>>
>> I guess this helps a bit. It doesn't prevent external signals, but prisoners
>> don't have control over that.
>
> Well.... a prisoner may be able to play other tricks:
> - Allocate lots of memory... kernel may start spraying around SIGKILLs
> - Sending SIGKILL via prctl()

prctl is disallowed within our jail. Did you had PR_SET_PDEATHSIG in mind?
But doesn't the tracer become the parent when ptracing or not for this?
Or were you thinking about enabling SECCOMP and counting on the SIGKILL
being process-wide instead of thread-specific?

> - Sending SIGKILL via fcntl()

I haven't written the fcntl demultiplexor yet, but I missed fcntl could
be used for sending signals. I knew there was whacky stuff in there, but
didn't expect it to be that bad. Thanks.

> - Sending SIGKILL via clone()

How? And can you send it to another process than yourself?

>
>>
>> Is this SIGKILL specific or is it true for all task ending signals?
>
> Can't remember - try it?

Tried: It's safe with SIGTERM, so I assume the others are fine too.
I'll double check though...

>>
>>>> How will you avoid file path races with BPF?
>>>
>>> There is typically no need for file-path based access control in an FTP server.
>>> Take for example anonymous FTP, which will typically be inside a
>>> chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
>>> it, you can have it.
>>
>> Ah, you count on having root access. We don't.
>>
>> Do you know any more crazy security destroying holes?
>
> Try spraying SIGCONT and / or SIGSTOP at tracees. It may be possible
> to confuse the tracer about whether a SIGTRAP event is syscall entry
> or exit.

Yes, heard about that weirdness before, but it's all ignored. We're
using PTRACE_O_TRACESYSGOOD.

> Try doing an execve() that fails. May cause similar state confusion in
> the tracer.

Our jailer pretty much ignores all signals and only handles syscalls
and task exits. We actually check execve's return value to know if we
have to do our stuff or not.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  0:14                                               ` Indan Zupancic
  2012-01-19  8:16                                                 ` Chris Evans
@ 2012-01-19 15:40                                                 ` Jamie Lokier
  1 sibling, 0 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-19 15:40 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Chris Evans, Andi Kleen, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

Indan Zupancic wrote:
> On Wed, January 18, 2012 22:13, Chris Evans wrote:
> > On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
> >> On Wed, January 18, 2012 06:43, Chris Evans wrote:
> >>> 2) Tracee traps
> >>> 2b) Tracee could take a SIGKILL here
> >>> 3) Tracer looks at registers; bad syscall
> >>> 3b) Or tracee could take a SIGKILL here
> >>> 4) The only way to stop the bad syscall from executing is to rewrite
> >>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> >>> syscall has finished)
> >>
> >> Yes, we rewrite it to -1.
> >>
> >>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> >>> pid (such as PTRACE_SETREGS) fails.
> >>
> >> I assume that if a task can execute system calls and we get ptrace events
> >> for that, that we can do other ptrace operations too. Are you saying that
> >> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
> >> doesn't work but the task continues executing system calls? That would be
> >> a huge bug, but it seems very unlikely too, as the task is stopped and
> >> shouldn't be able to disappear till it is continued by the tracer.
> >>
> >> I mean, really? That would be stupid.
> 
> Okay, I tested this scenario and you're right, we're screwed.

Ha!

Perhaps this could be fixed generically in
tracehook_report_syscall_entry(), for those architectures which bother
to call it and bother to disable the syscall if it says to.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:38                                                 ` Andrew Lutomirski
@ 2012-01-19 16:01                                                   ` Jamie Lokier
  2012-01-19 16:13                                                     ` Andrew Lutomirski
  2012-01-19 19:21                                                     ` Linus Torvalds
  0 siblings, 2 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-19 16:01 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

Andrew Lutomirski wrote:
> It's reasonable, obvious, and even more wrong than it appears.  On
> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
> (I got bitten by this in some iteration of the vsyscall emulation
> patches -- see user_64bit_mode for the correct and
> unusable-from-user-mode way to do this.)

Here it is:

	static inline bool user_64bit_mode(struct pt_regs *regs)
	{
	#ifndef CONFIG_PARAVIRT
		/*
		 * On non-paravirt systems, this is the only long mode CPL 3
		 * selector.  We do not allow long mode selectors in the LDT.
		 */
		return regs->cs == __USER_CS;
	#else
		/* Headers are too twisted for this to go in paravirt.h. */
		return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs;
	#endif
	}

Perhaps userspace can do that.
Would it be right for a ptracer to say:

   CS == 0x23 -> 32-bit
   (CS & 4)   -> 32-bit (LDT, "we do not allow long mode selectors in the LDT")
   else       -> 64-bit (__USER_CS or some other GDT entry which must be pv_info's)

I.e. assume that no other *GDT* CS values are available to userspace?
There are other 32-bit GDT entries, but are they not all for data or kernel use only?

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 18:24                                             ` Andi Kleen
@ 2012-01-19 16:04                                               ` Jamie Lokier
  2012-01-20  0:21                                                 ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: Jamie Lokier @ 2012-01-19 16:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin Mares, Linus Torvalds, Andi Kleen, Indan Zupancic,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

Andi Kleen wrote:
> > Not everybody. There are programs which try hard to distinguish between
> > int80 and syscall. One such example is a sandbox for programming contests
> > I wrote several years ago. It analyses the instruction before EIP and as
> > it does not allow threads nor executing writeable memory, it should be
> > correct.
> 
> There are other ways to break it, like using the syscall itself to change
> input arguments or using ptrace from another process and other ways.
> 
> Generally there are so many races with ptrace that if you want to do
> things like that it's better to use a LSM. That's what they are for.

I could see the LSM approach working *if* there was an LSM module to
make it available to unpriviledged userspace.  I.e. a replacement for
ptrace() for this purpose.

It would be nice to be able to trace and check syscall strings properly.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19 11:34                                                   ` Indan Zupancic
@ 2012-01-19 16:11                                                     ` Jamie Lokier
  0 siblings, 0 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-19 16:11 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Chris Evans, Andi Kleen, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

Indan Zupancic wrote:
> On Thu, January 19, 2012 09:16, Chris Evans wrote:
> > On Wed, Jan 18, 2012 at 4:14 PM, Indan Zupancic <indan@nul.nu> wrote:
> >> On Wed, January 18, 2012 22:13, Chris Evans wrote:
> >>> On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
> >>>> On Wed, January 18, 2012 06:43, Chris Evans wrote:
> >>>>> 2) Tracee traps
> >>>>> 2b) Tracee could take a SIGKILL here
> >>>>> 3) Tracer looks at registers; bad syscall
> >>>>> 3b) Or tracee could take a SIGKILL here
> >>>>> 4) The only way to stop the bad syscall from executing is to rewrite
> >>>>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> >>>>> syscall has finished)
> >>>>
> >>>> Yes, we rewrite it to -1.
> >>>>
> >>>>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> >>>>> pid (such as PTRACE_SETREGS) fails.
> >>>>
> >>>> I assume that if a task can execute system calls and we get ptrace events
> >>>> for that, that we can do other ptrace operations too. Are you saying that
> >>>> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
> >>>> doesn't work but the task continues executing system calls? That would be
> >>>> a huge bug, but it seems very unlikely too, as the task is stopped and
> >>>> shouldn't be able to disappear till it is continued by the tracer.
> >>>>
> >>>> I mean, really? That would be stupid.
> >>
> >> Okay, I tested this scenario and you're right, we're screwed.
> >>
> >> What the hell guys?
> >
> > Steady on :) ptrace() has never been sold as a technology upon which
> > its safe to build security solutions.
> 
> Well, that can be said of pretty much all kernel functionality.
> That is no excuse for crazy behaviour.
> 
> I more or less fixed it by turning all SIGKILLs into SIGTERMs.
> Perhaps I should use a more obscure signal instead.
> 
> >> What about other PID checks in the kernel, are they still
> >> safe if the process looks dead but is still active? Or is it a ptrace-only
> >> problem?
> >>
> >>>> If true we have to work around it by disallowing SIGKILL and just sending
> >>>> them ourselves within the jail. Meh.
> >>
> >> I guess this helps a bit. It doesn't prevent external signals, but prisoners
> >> don't have control over that.
> >
> > Well.... a prisoner may be able to play other tricks:
> > - Allocate lots of memory... kernel may start spraying around SIGKILLs
> > - Sending SIGKILL via prctl()
> 
> prctl is disallowed within our jail. Did you had PR_SET_PDEATHSIG in mind?
> But doesn't the tracer become the parent when ptracing or not for this?
> Or were you thinking about enabling SECCOMP and counting on the SIGKILL
> being process-wide instead of thread-specific?
> 
> > - Sending SIGKILL via fcntl()
> 
> I haven't written the fcntl demultiplexor yet, but I missed fcntl could
> be used for sending signals. I knew there was whacky stuff in there, but
> didn't expect it to be that bad. Thanks.
> 
> > - Sending SIGKILL via clone()
> 
> How? And can you send it to another process than yourself?
> 
> >
> >>
> >> Is this SIGKILL specific or is it true for all task ending signals?
> >
> > Can't remember - try it?
> 
> Tried: It's safe with SIGTERM, so I assume the others are fine too.
> I'll double check though...
> 
> >>
> >>>> How will you avoid file path races with BPF?
> >>>
> >>> There is typically no need for file-path based access control in an FTP server.
> >>> Take for example anonymous FTP, which will typically be inside a
> >>> chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
> >>> it, you can have it.
> >>
> >> Ah, you count on having root access. We don't.
> >>
> >> Do you know any more crazy security destroying holes?
> >
> > Try spraying SIGCONT and / or SIGSTOP at tracees. It may be possible
> > to confuse the tracer about whether a SIGTRAP event is syscall entry
> > or exit.
> 
> Yes, heard about that weirdness before, but it's all ignored. We're
> using PTRACE_O_TRACESYSGOOD.
> 
> > Try doing an execve() that fails. May cause similar state confusion in
> > the tracer.
> 
> Our jailer pretty much ignores all signals and only handles syscalls
> and task exits. We actually check execve's return value to know if we
> have to do our stuff or not.

Take a look at the file README-linux-ptrace in recent strace Git.
(Thanks Denys!)

It describes some *really* ugly things Linux does to ptrace on execve
when there are threads: The most exciting being the return value is
sent to a different tid than called execve(), and other tids magically
disappear without notification.

You can use PTRACE_O_TRACEEXEC to see if the execve() succeeds, btw.
It has the useful side-effect of preventing the legacy behaviour of
SIGTRAP being sent as a normal queued signal after successful execve().

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 16:01                                                   ` Jamie Lokier
@ 2012-01-19 16:13                                                     ` Andrew Lutomirski
  2012-01-19 19:21                                                     ` Linus Torvalds
  1 sibling, 0 replies; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-19 16:13 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Andrew Lutomirski wrote:
>> It's reasonable, obvious, and even more wrong than it appears.  On
>> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
>> (I got bitten by this in some iteration of the vsyscall emulation
>> patches -- see user_64bit_mode for the correct and
>> unusable-from-user-mode way to do this.)
>
> Here it is:
>
>        static inline bool user_64bit_mode(struct pt_regs *regs)
>        {
>        #ifndef CONFIG_PARAVIRT
>                /*
>                 * On non-paravirt systems, this is the only long mode CPL 3
>                 * selector.  We do not allow long mode selectors in the LDT.
>                 */
>                return regs->cs == __USER_CS;
>        #else
>                /* Headers are too twisted for this to go in paravirt.h. */
>                return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs;
>        #endif
>        }
>
> Perhaps userspace can do that.
> Would it be right for a ptracer to say:
>
>   CS == 0x23 -> 32-bit
>   (CS & 4)   -> 32-bit (LDT, "we do not allow long mode selectors in the LDT")
>   else       -> 64-bit (__USER_CS or some other GDT entry which must be pv_info's)
>
> I.e. assume that no other *GDT* CS values are available to userspace?
> There are other 32-bit GDT entries, but are they not all for data or kernel use only?

I suspect not.  asm/xen/interface_64.h has:

#define FLAT_RING3_CS32 0xe023  /* GDT index 260 */
#define FLAT_RING3_CS64 0xe033  /* GDT index 261 */
#define FLAT_RING3_DS32 0xe02b  /* GDT index 262 */
#define FLAT_RING3_DS64 0x0000  /* NULL selector */
#define FLAT_RING3_SS32 0xe02b  /* GDT index 262 */
#define FLAT_RING3_SS64 0xe02b  /* GDT index 262 */

which sounds like there's an extra 32-bit selector as well.  I haven't
checked, though.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 16:01                                                   ` Jamie Lokier
  2012-01-19 16:13                                                     ` Andrew Lutomirski
@ 2012-01-19 19:21                                                     ` Linus Torvalds
  2012-01-19 19:30                                                       ` Andrew Lutomirski
                                                                         ` (2 more replies)
  1 sibling, 3 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-19 19:21 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andrew Lutomirski, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Andrew Lutomirski wrote:
>> It's reasonable, obvious, and even more wrong than it appears.  On
>> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
>> (I got bitten by this in some iteration of the vsyscall emulation
>> patches -- see user_64bit_mode for the correct and
>> unusable-from-user-mode way to do this.)
>
> Here it is:
>
>        static inline bool user_64bit_mode(struct pt_regs *regs)

This is pointless, even if it worked, which it clearly doesn't on Xen
(or other random situations).

Why would you care?

The issue is *not* whether somebody is running in 32-bit mode or 64-bit mode.

The problem is the system call itself, and that can be 32-bit or
64-bit independently of the execution mode. So knowing the user-mode
mode is simply not relevant.

In the kernel, we know this with the TS_COMPAT flag - exactly because
it's impossible to tell from any actual CPU state. So *that* is the
flag you need to figure out, and currently the kernel doesn't export
it any way (but my suggested patch would export it in the high bits of
rflags).

So looking at CS isn't *ever* going to help.

                 Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:21                                                     ` Linus Torvalds
@ 2012-01-19 19:30                                                       ` Andrew Lutomirski
  2012-01-19 19:37                                                         ` Linus Torvalds
  2012-01-19 23:54                                                       ` Jamie Lokier
  2012-01-20 15:35                                                       ` Will Drewry
  2 siblings, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2012-01-19 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 11:21 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> Andrew Lutomirski wrote:
>>> It's reasonable, obvious, and even more wrong than it appears.  On
>>> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
>>> (I got bitten by this in some iteration of the vsyscall emulation
>>> patches -- see user_64bit_mode for the correct and
>>> unusable-from-user-mode way to do this.)
>>
>> Here it is:
>>
>>        static inline bool user_64bit_mode(struct pt_regs *regs)
>
> This is pointless, even if it worked, which it clearly doesn't on Xen
> (or other random situations).
>
> Why would you care?
>
> The issue is *not* whether somebody is running in 32-bit mode or 64-bit mode.
>
> The problem is the system call itself, and that can be 32-bit or
> 64-bit independently of the execution mode. So knowing the user-mode
> mode is simply not relevant.

Unless you're writing a debugger and you want to disassemble the code
that's being executed (i.e. normal code, not a system call).  I wonder
how gdb guesses whether the cpu is in long mode.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:30                                                       ` Andrew Lutomirski
@ 2012-01-19 19:37                                                         ` Linus Torvalds
  2012-01-19 19:41                                                           ` Linus Torvalds
  0 siblings, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-19 19:37 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Jamie Lokier, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 11:30 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>
> Unless you're writing a debugger and you want to disassemble the code
> that's being executed (i.e. normal code, not a system call).  I wonder
> how gdb guesses whether the cpu is in long mode.

Yes, if you need to disassemble user space you would need to figure
out the mode.

I would suggest looking at 'rip/rsp' first, though, and just say that
if it's >32-bit, it's flat mode. Only if both rsp and rip fit in 32
bits should you even bother start guessing.

Because technically I suspect you really do need to look it up in the
segment descriptors, and I don't think we have that kind of interface
(nor do I think we really want to expose one).

                          Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:37                                                         ` Linus Torvalds
@ 2012-01-19 19:41                                                           ` Linus Torvalds
  0 siblings, 0 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-19 19:41 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Jamie Lokier, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 11:37 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I would suggest looking at 'rip/rsp' first, though, and just say that
> if it's >32-bit, it's flat mode. Only if both rsp and rip fit in 32
> bits should you even bother start guessing.

Oh, there's a few other hints you can look at. If 'ds' is zero, you
might technically be in 32-bit mode, but realistically nothing really
would work, so you might as well assume you're in long mode.

So you can have a lot of heuristics (including just looking at what
the disassembly itself looks like) if you really want to..

                    Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:21                                                     ` Linus Torvalds
  2012-01-19 19:30                                                       ` Andrew Lutomirski
@ 2012-01-19 23:54                                                       ` Jamie Lokier
  2012-01-20  0:05                                                         ` Linus Torvalds
  2012-01-20 15:35                                                       ` Will Drewry
  2 siblings, 1 reply; 235+ messages in thread
From: Jamie Lokier @ 2012-01-19 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Lutomirski, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

Linus Torvalds wrote:
> On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
> > Andrew Lutomirski wrote:
> >> It's reasonable, obvious, and even more wrong than it appears.  On
> >> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
> >> (I got bitten by this in some iteration of the vsyscall emulation
> >> patches -- see user_64bit_mode for the correct and
> >> unusable-from-user-mode way to do this.)
> >
> > Here it is:
> >
> >        static inline bool user_64bit_mode(struct pt_regs *regs)
> 
> This is pointless, even if it worked, which it clearly doesn't on Xen
> (or other random situations).
> 
> Why would you care?
> 
> The issue is *not* whether somebody is running in 32-bit mode or 64-bit mode.
> 
> The problem is the system call itself, and that can be 32-bit or
> 64-bit independently of the execution mode. So knowing the user-mode
> mode is simply not relevant.

Sorry, you're responding to a different question than the one I was
talking about.  My bad for adding to the confusion.

Mine was: strace currently checks the CS value and may have a bug on
existing/older kernels if Xen is involved when using the *normal*
syscall entry point (not int $0x80).  Can we patch strace to solve
that on those kernels in a generic way, or does the fix need to
hard-code knowledge of Xen's CS values (and any similar PV hypervisors
if there are any).

No amount of patching newer kernels will fix that, but it would be
nice if newer kernels made it unambiguous.

You've usefully pointed out that there's no reliable way to tell if
the tracee is executing in long mode.  If we're adding pseudo-flags to
say what kind of syscall it is, it would be no bad thing to have a
pseudo-flag to say if userspace is in long mode -- made available to
breakpoints and single-stepping as well.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 23:54                                                       ` Jamie Lokier
@ 2012-01-20  0:05                                                         ` Linus Torvalds
  0 siblings, 0 replies; 235+ messages in thread
From: Linus Torvalds @ 2012-01-20  0:05 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andrew Lutomirski, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 3:54 PM, Jamie Lokier <jamie@shareable.org> wrote:
>
> Mine was: strace currently checks the CS value and may have a bug on
> existing/older kernels if Xen is involved when using the *normal*
> syscall entry point (not int $0x80).  Can we patch strace to solve
> that on those kernels in a generic way, or does the fix need to
> hard-code knowledge of Xen's CS values (and any similar PV hypervisors
> if there are any).
>
> No amount of patching newer kernels will fix that, but it would be
> nice if newer kernels made it unambiguous.

Ok.  So yeah, I think the heuristics for strace could possibly be
improved when running under Xen, I agree. See my suggestion for taking
other register contents into account (%rsp in particular - the code
segment tends to be mapped low in 64-bit mode, but the stack is almost
always high unless you are doing something really odd).

So heuristics improvements could be a good idea. Very few real
programs will use "int 0x80" in long mode, since it's slow and limited
to 32 bit.

               Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19 16:04                                               ` Jamie Lokier
@ 2012-01-20  0:21                                                 ` Indan Zupancic
  2012-01-20  0:53                                                   ` Linus Torvalds
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-20  0:21 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andi Kleen, Martin Mares, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Thu, January 19, 2012 17:04, Jamie Lokier wrote:
> Andi Kleen wrote:
>> > Not everybody. There are programs which try hard to distinguish between
>> > int80 and syscall. One such example is a sandbox for programming contests
>> > I wrote several years ago. It analyses the instruction before EIP and as
>> > it does not allow threads nor executing writeable memory, it should be
>> > correct.
>>
>> There are other ways to break it, like using the syscall itself to change
>> input arguments or using ptrace from another process and other ways.
>>
>> Generally there are so many races with ptrace that if you want to do
>> things like that it's better to use a LSM. That's what they are for.
>
> I could see the LSM approach working *if* there was an LSM module to
> make it available to unpriviledged userspace.  I.e. a replacement for
> ptrace() for this purpose.
>
> It would be nice to be able to trace and check syscall strings properly.

With current ptrace you can do exactly that. It's just very slow, because
you have to copy the data word by word via PTRACE_PEEKDATA. But if Linux
would support something like BSD's PT_IO ptrace request, then it could be
limited to one extra ptrace command. (PTRACE_STRNCPY would be handy.)

After the check we memcpy the data to a shared read-only mapping, but
that's very quick. We could read the data directly into the RO area,
but as we're mostly dealing with path strings it seemed more efficient
to allocate the needed memory instead of the max every time.

No matter how you make it available to userspace via some LSM, you will
end up with the same context switch overhead ptrace suffers, so I don't
see how a LSM module would give either more options or make it much faster
compared to ptrace.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-20  0:21                                                 ` Indan Zupancic
@ 2012-01-20  0:53                                                   ` Linus Torvalds
  2012-01-20  2:02                                                     ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-20  0:53 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Jamie Lokier, Andi Kleen, Martin Mares, Andi Kleen,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 4:21 PM, Indan Zupancic <indan@nul.nu> wrote:
>
> With current ptrace you can do exactly that. It's just very slow, because
> you have to copy the data word by word via PTRACE_PEEKDATA. But if Linux
> would support something like BSD's PT_IO ptrace request, then it could be
> limited to one extra ptrace command. (PTRACE_STRNCPY would be handy.)

Actually, you could use the new "process_vm_readv/writev()" system
calls. No need to do the crazy slow ptrace stuff.

I dunno. It got merged through Andrew, and the code looks sane, but
I've never actually seen anybody *use* it. So maybe there is something
wrong there. And no, it doesn't have a "strncpy" interface, I'm
afraid.

                   Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-20  0:53                                                   ` Linus Torvalds
@ 2012-01-20  2:02                                                     ` Indan Zupancic
  0 siblings, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-20  2:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Andi Kleen, Martin Mares, Andi Kleen,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Fri, January 20, 2012 01:53, Linus Torvalds wrote:
> On Thu, Jan 19, 2012 at 4:21 PM, Indan Zupancic <indan@nul.nu> wrote:
>>
>> With current ptrace you can do exactly that. It's just very slow, because
>> you have to copy the data word by word via PTRACE_PEEKDATA. But if Linux
>> would support something like BSD's PT_IO ptrace request, then it could be
>> limited to one extra ptrace command. (PTRACE_STRNCPY would be handy.)
>
> Actually, you could use the new "process_vm_readv/writev()" system
> calls. No need to do the crazy slow ptrace stuff.

Oh wow, that's great! I tried pread on /proc/$PID/mem before, but that
didn't work for some reason and would eat many fd's if there were a lot
of prisoners.

When did it got merged?

> I dunno. It got merged through Andrew, and the code looks sane, but
> I've never actually seen anybody *use* it. So maybe there is something
> wrong there. And no, it doesn't have a "strncpy" interface, I'm
> afraid.

My main problem is that I don't know beforehand how much I have to read,
and if I always reada fixed amount it may go across a page border and
error out. So if process_vm_readv() reads the accessible data only and
doesn't give up halfway, it's perfect. That seems to be the behaviour,
but the manpage is fuzzy enough that it may not be true. I'll take a
look at the source later.

Thanks,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:21                                                     ` Linus Torvalds
  2012-01-19 19:30                                                       ` Andrew Lutomirski
  2012-01-19 23:54                                                       ` Jamie Lokier
@ 2012-01-20 15:35                                                       ` Will Drewry
  2012-01-20 17:56                                                         ` Roland McGrath
  2 siblings, 1 reply; 235+ messages in thread
From: Will Drewry @ 2012-01-20 15:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Andrew Lutomirski, Indan Zupancic, Andi Kleen,
	Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 1:21 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> Andrew Lutomirski wrote:
>>> It's reasonable, obvious, and even more wrong than it appears.  On
>>> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
>>> (I got bitten by this in some iteration of the vsyscall emulation
>>> patches -- see user_64bit_mode for the correct and
>>> unusable-from-user-mode way to do this.)
>>
>> Here it is:
>>
>>        static inline bool user_64bit_mode(struct pt_regs *regs)
>
> This is pointless, even if it worked, which it clearly doesn't on Xen
> (or other random situations).
>
> Why would you care?
>
> The issue is *not* whether somebody is running in 32-bit mode or 64-bit mode.
>
> The problem is the system call itself, and that can be 32-bit or
> 64-bit independently of the execution mode. So knowing the user-mode
> mode is simply not relevant.
>
> In the kernel, we know this with the TS_COMPAT flag - exactly because
> it's impossible to tell from any actual CPU state. So *that* is the
> flag you need to figure out, and currently the kernel doesn't export
> it any way (but my suggested patch would export it in the high bits of
> rflags).

Would it be worth considering changing the return from
task_user_regset_view, like:

--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1311,7 +1311,11 @@ void update_regset_xstate_info(unsigned int
size, u64 xstate_mask)
 const struct user_regset_view *task_user_regset_view(struct task_struct *task)
 {
 #ifdef CONFIG_IA32_EMULATION
-       if (test_tsk_thread_flag(task, TIF_IA32))
+       /* If the task is in a syscall, then the TS_COMPAT status
+        * is more accurate than the personality.
+        */
+       if (test_tsk_thread_flag(task, TIF_IA32) ||
+           task_thread_info(task)->status & TS_COMPAT)
 #endif
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
                return &user_x86_32_view;


This would make TS_COMPAT behave like a personality change.
PTRACE_POKEUSR and PEEKUSR would still access the 64-bit view with no
compat info (just like with TIF_IA32 tasks), but PTRACE_[GS]ETREGS
would return/expect 32-bit struct user_struct_regs.  This would result
in the tracer needing to check the returned regs to see if it was
fully populated (which seems heinous), but it would export the
TS_COMPAT state.

Right now, if a 64-bit tracer changes the regs for a TS_COMPAT call,
the args will be 32-bit truncated (for better or worse). Of course, on
trace_syscall_leave, 64-bit registers won't be truncated so it maybe
makes less sense.

Perhaps this was considered and discarded as being obviously broken,
but it wasn't clear cut to me.

Thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 15:35                                                       ` Will Drewry
@ 2012-01-20 17:56                                                         ` Roland McGrath
  2012-01-20 19:45                                                           ` Will Drewry
  0 siblings, 1 reply; 235+ messages in thread
From: Roland McGrath @ 2012-01-20 17:56 UTC (permalink / raw)
  To: Will Drewry
  Cc: Linus Torvalds, Jamie Lokier, Andrew Lutomirski, Indan Zupancic,
	Andi Kleen, Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

In arch_ptrace, task_user_regset_view is called on current.  On an x86-64
kernel, that path is only reached for a 64-bit syscall.  compat_arch_ptrace
doesn't use it at all, always using the 32-bit view.  So your change would
have no effect on PTRACE_GETREGS.

It would only affect PTRACE_GETREGSET, which calls task_user_regset_view on
the target task.  Is that what you meant?  I think that would be confusing
at best.  A caller of PTRACE_GETREGSET is expecting a particular layout
based on what type of task he thinks he's dealing with.  The caller can
look at the iov_len in the result to discern which layout it actually got
filled in, but I don't think that's what callers expect.

The other use of task_user_regset_view is in core dump
(binfmt_elf.c:fill_note_info).  Off hand I don't think there's a way a core
dump can be started while still "inside" a syscall so that TS_COMPAT could
ever be set.  But that should be double-checked.

As to whether it was considered before, I doubt that it was.  I don't
really recall the sequence of events, but I think that I did all the
user_regset code before I was really cognizant of the TS_COMPAT subtleties.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 17:56                                                         ` Roland McGrath
@ 2012-01-20 19:45                                                           ` Will Drewry
  0 siblings, 0 replies; 235+ messages in thread
From: Will Drewry @ 2012-01-20 19:45 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Jamie Lokier, Andrew Lutomirski, Indan Zupancic,
	Andi Kleen, Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Fri, Jan 20, 2012 at 11:56 AM, Roland McGrath <mcgrathr@google.com> wrote:
> In arch_ptrace, task_user_regset_view is called on current.  On an x86-64
> kernel, that path is only reached for a 64-bit syscall.  compat_arch_ptrace
> doesn't use it at all, always using the 32-bit view.  So your change would
> have no effect on PTRACE_GETREGS.
>
> It would only affect PTRACE_GETREGSET, which calls task_user_regset_view on
> the target task.  Is that what you meant?

Exactly - sorry for being unclear!

> I think that would be confusing
> at best.  A caller of PTRACE_GETREGSET is expecting a particular layout
> based on what type of task he thinks he's dealing with.  The caller can
> look at the iov_len in the result to discern which layout it actually got
> filled in, but I don't think that's what callers expect.

The question of what callers expect wasn't so clear to me -- for two reasons:
1. I was misreading
2. Compat syscall numbering.

#1 I had mistakenly thought that TIF_IA32 was set on a task if
personality(2) was called with PER_LINUX/PER_LINUX32.  It appears that
thread info flag can only be set by the binfmt handlers at exec-time,
so personality(2) cannot be used to change the user_regs_struct on the
fly (just signal mappings).

#2 In the case of a 64-bit process doing a 32-bit system call without
a personality change, the 64-bit register view will be consistent,
but, as discussed, the numbering will be incorrect.  So what the
caller gets back still seems to not be what they were expecting, it's
just not as far off as a different register view.

In either case the output from PTRACE_GETREGS is broken for the
TS_COMPAT-64-bit process flow, but it all comes down to determining
with brokenness is worse.  The silent system call numbers change and
register truncation, or a different, but accurate user_regs_struct :/

> The other use of task_user_regset_view is in core dump
> (binfmt_elf.c:fill_note_info).  Off hand I don't think there's a way a core
> dump can be started while still "inside" a syscall so that TS_COMPAT could
> ever be set.  But that should be double-checked.

That was my reading, too, but additional eyes would be useful.

> As to whether it was considered before, I doubt that it was.  I don't
> really recall the sequence of events, but I think that I did all the
> user_regset code before I was really cognizant of the TS_COMPAT subtleties.

Makes sense.

Thanks!
will

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19  0:38                                                                 ` H. Peter Anvin
@ 2012-01-20 21:51                                                                   ` Denys Vlasenko
  2012-01-20 22:40                                                                     ` Roland McGrath
  0 siblings, 1 reply; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-20 21:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Roland McGrath, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

[-- Attachment #1: Type: text/plain, Size: 2817 bytes --]

On Thu, Jan 19, 2012 at 1:38 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 01/18/2012 03:28 PM, Linus Torvalds wrote:
>> On Wed, Jan 18, 2012 at 1:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>
>>> I think we can obviously agree that regsets is the only way to go for
>>> any kind of new state.
>>
>> So I really don't necessarily agree at all.
>>
>> Exactly because there is a heavy burden to introducing new models.
>> It's not only relatively much more kernel code, it's also relatively
>> much more painful for user code. If we can hide it in existing
>> structures, user code is *much* better off, because any existing code
>> to get the state will just continue to work. Otherwise, you need to
>> have the code to figure out the new structures (how do you compile it
>> without the new kernel headers?), you need to do the extra accesses
>> conditionally etc etc.
>>
>> There's a real cost to introducing new interfaces. There's a *reason*
>> people try to make do with old ones.
>
> Of course.  However, the whole point with regsets is that at the very
> least the vast majority of the infrastructure is generic and extends
> without a bunch of new machine.  What you are saying is "we might be
> able to get away with existing state", what I'm saying is "if we add
> state it should be a regset".
>
> The question if this should be new state is currently open.  I
> personally would still would prefer if this didn't overlay real CPU state.

What about extending of one of the GETREGSET layouts?
GETREGSET uses struct iovec. struct iovec has buf_len.
Currently, if buf_len is larger than the register structure
being requested, kernel simply returns less data than
userspace asks for.

In the x86 case, we can add additional field(s) at
the end of NT_PRSTATUS layout.

Old programs which use PTRACE_GETREGS will get
old user_regs_struct layout (without appended fields).
Old programs which use
PTRACE_GETREGSET(NT_PRSTATUS, sizeof(struct user_regs_struct))
will also get the same.
New programs which use
PTRACE_GETREGSET(NT_PRSTATUS, sizeof(struct user_regs_struct) + N *
sizeof(long))
will get new fields too.

It's more intrusive than Linus' solution, but it avoids
the problem of overlaying real register data
with OS-specific special bits. It can also be employed
on other architectures (does not depend on having
a suitable register to abuse).

OTOH it is less intrusive than adding a whole new regset
just in order to add a few bits to an exiting one;
and would allow strace to extract both registers
and this new data with one operation instead of two.

Please see attached patch. NOT TESTED.

I'm new to this machinery, thus I might be missing some
obvious flaw with this idea (such as breaking on-disk
coredump format?)

-- 
vda

[-- Attachment #2: add_one_word_to_regset0.diff --]
[-- Type: text/x-patch, Size: 1271 bytes --]

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..16455c0 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -419,6 +419,10 @@ static int putreg(struct task_struct *child,
 		if (child->thread.gs != value)
 			return do_arch_prctl(child, ARCH_SET_GS, value);
 		return 0;
+
+	case sizeof(struct user_regs_struct) + 0 * sizeof(long):
+		/* Modifying of thread_info->status is not allowed */
+		return 0;
 #endif
 	}
 
@@ -469,6 +473,10 @@ static unsigned long getreg(struct task_struct *task, unsigned long offset)
 			return 0;
 		return get_desc_base(&task->thread.tls_array[GS_TLS]);
 	}
+
+	case sizeof(struct user_regs_struct) + 0 * sizeof(long):
+		/* One day we might want to expose other bits too */
+		return (task_thread_info(task)->status & TS_COMPAT);
 #endif
 	}
 
@@ -1203,7 +1211,7 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 static struct user_regset x86_64_regsets[] __read_mostly = {
 	[REGSET_GENERAL] = {
 		.core_note_type = NT_PRSTATUS,
-		.n = sizeof(struct user_regs_struct) / sizeof(long),
+		.n = (sizeof(struct user_regs_struct) + 1 * sizeof(long)) / sizeof(long),
 		.size = sizeof(long), .align = sizeof(long),
 		.get = genregs_get, .set = genregs_set
 	},

^ permalink raw reply related	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 21:51                                                                   ` Denys Vlasenko
@ 2012-01-20 22:40                                                                     ` Roland McGrath
  2012-01-20 22:41                                                                       ` H. Peter Anvin
                                                                                         ` (2 more replies)
  0 siblings, 3 replies; 235+ messages in thread
From: Roland McGrath @ 2012-01-20 22:40 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: H. Peter Anvin, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

If you change the size of a regset, then the new full size will be the size
of the core file notes.  Existing userland tools will not be expecting
this, they expect a known exact size.  If you need to add new stuff, it
really is easier all around to add a new regset flavor.  When adding a new
one, you can make it variable-sized from the start so as to be extensible
in the future.  We did this for NT_X86_XSTATE, for example.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 22:40                                                                     ` Roland McGrath
@ 2012-01-20 22:41                                                                       ` H. Peter Anvin
  2012-01-20 23:49                                                                         ` Indan Zupancic
  2012-01-24  8:19                                                                       ` Indan Zupancic
  2012-02-06 20:30                                                                       ` H. Peter Anvin
  2 siblings, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-01-20 22:41 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On 01/20/2012 02:40 PM, Roland McGrath wrote:
> If you change the size of a regset, then the new full size will be the size
> of the core file notes.  Existing userland tools will not be expecting
> this, they expect a known exact size.  If you need to add new stuff, it
> really is easier all around to add a new regset flavor.  When adding a new
> one, you can make it variable-sized from the start so as to be extensible
> in the future.  We did this for NT_X86_XSTATE, for example.
>

Yes, that definitely seems cleaner.

	-hpa


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 22:41                                                                       ` H. Peter Anvin
@ 2012-01-20 23:49                                                                         ` Indan Zupancic
  2012-01-20 23:55                                                                           ` Roland McGrath
  2012-01-21  0:07                                                                           ` Denys Vlasenko
  0 siblings, 2 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-20 23:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Roland McGrath, Denys Vlasenko, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Fri, January 20, 2012 23:41, H. Peter Anvin wrote:
> On 01/20/2012 02:40 PM, Roland McGrath wrote:
>> If you change the size of a regset, then the new full size will be the size
>> of the core file notes.  Existing userland tools will not be expecting
>> this, they expect a known exact size.  If you need to add new stuff, it
>> really is easier all around to add a new regset flavor.  When adding a new
>> one, you can make it variable-sized from the start so as to be extensible
>> in the future.  We did this for NT_X86_XSTATE, for example.
>>
>
> Yes, that definitely seems cleaner.

I would prefer Linus' way of just stuffing it into cs. Jamie also wanted
a bit telling in what mode the userspace is running. That's 3 bits in total,
with one bit telling whether the other bits are valid or not. Anything else?
Maybe a bit telling whether it is syscall entry or exit?

As all this is very x86_64 specific and cs is already used to figure out
the mode, it seems overkill to add a new regset just for this.

It's a lot easier for existing code to add an extra cs check than to use
different register sets and different ptrace commands. Considering that
PTRACE_GETREGSET is undocumented it's likely that existing code isn't
using it much.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 23:49                                                                         ` Indan Zupancic
@ 2012-01-20 23:55                                                                           ` Roland McGrath
  2012-01-20 23:58                                                                             ` hpanvin@gmail.com
  2012-01-23  2:14                                                                             ` Indan Zupancic
  2012-01-21  0:07                                                                           ` Denys Vlasenko
  1 sibling, 2 replies; 235+ messages in thread
From: Roland McGrath @ 2012-01-20 23:55 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H. Peter Anvin, Denys Vlasenko, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Fri, Jan 20, 2012 at 3:49 PM, Indan Zupancic <indan@nul.nu> wrote:
> It's a lot easier for existing code to add an extra cs check than to use

The issue is whether showing fictitious high bits of %cs as set will break
existing applications (debuggers, etc.) that look at it and think that it's
nothing but the hardware state zero-extended, as it is today.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 23:55                                                                           ` Roland McGrath
@ 2012-01-20 23:58                                                                             ` hpanvin@gmail.com
  2012-01-23  2:14                                                                             ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: hpanvin@gmail.com @ 2012-01-20 23:58 UTC (permalink / raw)
  To: Roland McGrath, Indan Zupancic
  Cc: Denys Vlasenko, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

Linus claims it does break.

Roland McGrath <mcgrathr@google.com> wrote:

>On Fri, Jan 20, 2012 at 3:49 PM, Indan Zupancic <indan@nul.nu> wrote:
>> It's a lot easier for existing code to add an extra cs check than to
>use
>
>The issue is whether showing fictitious high bits of %cs as set will
>break
>existing applications (debuggers, etc.) that look at it and think that
>it's
>nothing but the hardware state zero-extended, as it is today.

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 23:49                                                                         ` Indan Zupancic
  2012-01-20 23:55                                                                           ` Roland McGrath
@ 2012-01-21  0:07                                                                           ` Denys Vlasenko
  2012-01-21  0:10                                                                             ` Roland McGrath
  1 sibling, 1 reply; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-21  0:07 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H. Peter Anvin, Roland McGrath, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Saturday 21 January 2012 00:49, Indan Zupancic wrote:
> On Fri, January 20, 2012 23:41, H. Peter Anvin wrote:
> > On 01/20/2012 02:40 PM, Roland McGrath wrote:
> >> If you change the size of a regset, then the new full size will be the size
> >> of the core file notes.  Existing userland tools will not be expecting
> >> this, they expect a known exact size.  If you need to add new stuff, it
> >> really is easier all around to add a new regset flavor.  When adding a new
> >> one, you can make it variable-sized from the start so as to be extensible
> >> in the future.  We did this for NT_X86_XSTATE, for example.
> >>
> >
> > Yes, that definitely seems cleaner.
> 
> I would prefer Linus' way of just stuffing it into cs. Jamie also wanted
> a bit telling in what mode the userspace is running. That's 3 bits in total,
> with one bit telling whether the other bits are valid or not. Anything else?

There is actually a bunch of ptrace-specific stuff we want to return.

For example, Oleg wants to be able to print *which syscall*,
(along with its arguments if possible) is restarted when
we restart the ERESTART_RESTARTBLOCK-returning syscall.
Which happens every time strace attaches to a process sleeping
in nanosleep or poll, for example. We get just

$ strace -p 1234
Process 1234 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>_

and that's it.

Returning syscall and its parameters require several words,
not a few bits.

> Maybe a bit telling whether it is syscall entry or exit?

Yes, this one too. This is one of longstanding annoyances
that this information is not exposed.

> As all this is very x86_64 specific and cs is already used to figure out
> the mode, it seems overkill to add a new regset just for this.
> 
> It's a lot easier for existing code to add an extra cs check than to use
> different register sets and different ptrace commands.

You don't understand. Returning new bits in cs will break *existing*
programs. This is generally a bad thing. For example, old strace binaries
on new kernel will complain:

        switch (x86_64_regs.cs) {
                case 0x23: currpers = 1; break;
                case 0x33: currpers = 0; break;
                default:
                        fprintf(stderr, "Unknown value CS=0x%08X while "
                                 "detecting personality of process "
                                 "PID=%d\n", (int)x86_64_regs.cs, tcp->pid);
                        currpers = current_personality;
                        break;
        }

when they'll see unfamiliar x86_64_regs.cs value.

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-21  0:07                                                                           ` Denys Vlasenko
@ 2012-01-21  0:10                                                                             ` Roland McGrath
  2012-01-21  1:23                                                                               ` Jamie Lokier
  0 siblings, 1 reply; 235+ messages in thread
From: Roland McGrath @ 2012-01-21  0:10 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Indan Zupancic, H. Peter Anvin, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Fri, Jan 20, 2012 at 4:07 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
>> Maybe a bit telling whether it is syscall entry or exit?
>
> Yes, this one too. This is one of longstanding annoyances
> that this information is not exposed.

That is not really "state", it's just which event you want.
That is much better addressed by replacing PTRACE_SYSCALL
with PTRACE_O_TRACE_SYSCALL_{ENTRY,EXIT} and PTRACE_EVENT_SYSCALL_{ENTRY,EXIT}.
Oleg can whip that up for you no problem.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-21  0:10                                                                             ` Roland McGrath
@ 2012-01-21  1:23                                                                               ` Jamie Lokier
  2012-01-23  2:37                                                                                 ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: Jamie Lokier @ 2012-01-21  1:23 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, Indan Zupancic, H. Peter Anvin, Linus Torvalds,
	Andi Kleen, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

Roland McGrath wrote:
> On Fri, Jan 20, 2012 at 4:07 PM, Denys Vlasenko
> <vda.linux@googlemail.com> wrote:
> >> Maybe a bit telling whether it is syscall entry or exit?
> >
> > Yes, this one too. This is one of longstanding annoyances
> > that this information is not exposed.
> 
> That is not really "state", it's just which event you want.
> That is much better addressed by replacing PTRACE_SYSCALL
> with PTRACE_O_TRACE_SYSCALL_{ENTRY,EXIT} and PTRACE_EVENT_SYSCALL_{ENTRY,EXIT}.
> Oleg can whip that up for you no problem.

I agree, that is so obviously the right thing to do and it's very easy
to do in the tracehook functions.

There is one slight problem that some archs don't use
tracehook yet. Probably that should be fixed anyway.

(Fwiw, two other issues with arch-independent ptrace have come up in this
thread, which ought to be fairly easy to fix:
   - If tracer dies, tracee is free to continue running.  For security
     tracers, and would be useful for strace as well, it would be good
     to have an option to SIGKILL the tracee if tracer dies.
   - Can't abort or change an unwanted syscall if the process receives
     SIGKILL as it's about to start a syscall (which will be its last).)

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 23:55                                                                           ` Roland McGrath
  2012-01-20 23:58                                                                             ` hpanvin@gmail.com
@ 2012-01-23  2:14                                                                             ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-23  2:14 UTC (permalink / raw)
  To: Roland McGrath
  Cc: H. Peter Anvin, Denys Vlasenko, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Sat, January 21, 2012 00:55, Roland McGrath wrote:
> On Fri, Jan 20, 2012 at 3:49 PM, Indan Zupancic <indan@nul.nu> wrote:
>> It's a lot easier for existing code to add an extra cs check than to use
>
> The issue is whether showing fictitious high bits of %cs as set will break
> existing applications (debuggers, etc.) that look at it and think that it's
> nothing but the hardware state zero-extended, as it is today.

Argh, sorry, I meant eflags.

I even checked how many bits are free in eflags and still wrote 'cs'.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-21  1:23                                                                               ` Jamie Lokier
@ 2012-01-23  2:37                                                                                 ` Indan Zupancic
  2012-01-23 16:48                                                                                   ` Oleg Nesterov
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-23  2:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Roland McGrath, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Andi Kleen, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Sat, January 21, 2012 02:23, Jamie Lokier wrote:
> Roland McGrath wrote:
>> On Fri, Jan 20, 2012 at 4:07 PM, Denys Vlasenko
>> <vda.linux@googlemail.com> wrote:
>> >> Maybe a bit telling whether it is syscall entry or exit?
>> >
>> > Yes, this one too. This is one of longstanding annoyances
>> > that this information is not exposed.
>>
>> That is not really "state", it's just which event you want.
>> That is much better addressed by replacing PTRACE_SYSCALL
>> with PTRACE_O_TRACE_SYSCALL_{ENTRY,EXIT} and PTRACE_EVENT_SYSCALL_{ENTRY,EXIT}.
>> Oleg can whip that up for you no problem.
>
> I agree, that is so obviously the right thing to do and it's very easy
> to do in the tracehook functions.

Yes, bad place for it, much better via ptrace flags. We're usually not
interested in syscall exit events, so having a way to not always get
syscall exit events would improve performance quite a bit too.

> There is one slight problem that some archs don't use
> tracehook yet. Probably that should be fixed anyway.
>
> (Fwiw, two other issues with arch-independent ptrace have come up in this
> thread, which ought to be fairly easy to fix:
>    - If tracer dies, tracee is free to continue running.  For security
>      tracers, and would be useful for strace as well, it would be good
>      to have an option to SIGKILL the tracee if tracer dies.

It should be easy to add a PTRACE_O_SIGKILL_ON_DEATH option.

>    - Can't abort or change an unwanted syscall if the process receives
>      SIGKILL as it's about to start a syscall (which will be its last).)

This is very important for any syscall filtering/control via ptrace, otherwise
SIGKILL becomes a security problem. Oleg had a patch for that:

On Wed, January 18, 2012 18:12, Oleg Nesterov wrote:
> On 01/18, Oleg Nesterov wrote:
>> Not only for security. The current behaviour sometime confuses the
>> users. Debugger sends SIGKILL to the tracee and assumes it should
>> die asap, but the tracee exits only after syscall.
>
> Something like the patch below.
>
> Oleg.
>
> --- x/include/linux/tracehook.h
> +++ x/include/linux/tracehook.h
> @@ -54,12 +54,12 @@ struct linux_binprm;
>  /*
>   * ptrace report for syscall entry and exit looks identical.
>   */
> -static inline void ptrace_report_syscall(struct pt_regs *regs)
> +static inline int ptrace_report_syscall(struct pt_regs *regs)
>  {
>  	int ptrace = current->ptrace;
>
>  	if (!(ptrace & PT_PTRACED))
> -		return;
> +		return 0;
>
>  	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>
> @@ -72,6 +72,8 @@ static inline void ptrace_report_syscall
>  		send_sig(current->exit_code, current, 1);
>  		current->exit_code = 0;
>  	}
> +
> +	return fatal_signal_pending(current);
>  }
>
>  /**
> @@ -96,8 +98,7 @@ static inline void ptrace_report_syscall
>  static inline __must_check int tracehook_report_syscall_entry(
>  	struct pt_regs *regs)
>  {
> -	ptrace_report_syscall(regs);
> -	return 0;
> +	return ptrace_report_syscall(regs);
>  }
>
>  /**
>


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-23  2:37                                                                                 ` Indan Zupancic
@ 2012-01-23 16:48                                                                                   ` Oleg Nesterov
  0 siblings, 0 replies; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-23 16:48 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Jamie Lokier, Roland McGrath, Denys Vlasenko, H. Peter Anvin,
	Linus Torvalds, Andi Kleen, Andrew Lutomirski, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On 01/23, Indan Zupancic wrote:
>
> On Sat, January 21, 2012 02:23, Jamie Lokier wrote:
> >
> > (Fwiw, two other issues with arch-independent ptrace have come up in this
> > thread, which ought to be fairly easy to fix:
> >    - If tracer dies, tracee is free to continue running.  For security
> >      tracers, and would be useful for strace as well, it would be good
> >      to have an option to SIGKILL the tracee if tracer dies.
>
> It should be easy to add a PTRACE_O_SIGKILL_ON_DEATH option.

Yes, this looks simple.

> >    - Can't abort or change an unwanted syscall if the process receives
> >      SIGKILL as it's about to start a syscall (which will be its last).)
>
> This is very important for any syscall filtering/control via ptrace, otherwise
> SIGKILL becomes a security problem. Oleg had a patch for that:

OK, I'll send this patch after some testing. Although it looks trivial.

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 21:09                                               ` Chris Evans
@ 2012-01-23 16:56                                                 ` Oleg Nesterov
  2012-01-23 22:23                                                   ` Chris Evans
  0 siblings, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-23 16:56 UTC (permalink / raw)
  To: Chris Evans
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On 01/18, Chris Evans wrote:
>
> Thanks, Oleg. Seems like this would be a nice change to have. As we
> can see, people do use ptrace() as a security technology.

OK, I'll send it.

> With this in place, you can also (where possible) set up the tracee
> with PR_SET_PDEATHSIG==SIGKILL. And then, you have defences again
> either of the tracer or tracee dying from a stray SIGKILL.

This can only help if the tracer is the natural parent, is it enough?

Indan suggested PTRACE_O_SIGKILL_ON_DEATH, perhaps it makes sense.

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-23 16:56                                                 ` Oleg Nesterov
@ 2012-01-23 22:23                                                   ` Chris Evans
  0 siblings, 0 replies; 235+ messages in thread
From: Chris Evans @ 2012-01-23 22:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Mon, Jan 23, 2012 at 8:56 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/18, Chris Evans wrote:
>>
>> Thanks, Oleg. Seems like this would be a nice change to have. As we
>> can see, people do use ptrace() as a security technology.
>
> OK, I'll send it.
>
>> With this in place, you can also (where possible) set up the tracee
>> with PR_SET_PDEATHSIG==SIGKILL. And then, you have defences again
>> either of the tracer or tracee dying from a stray SIGKILL.
>
> This can only help if the tracer is the natural parent, is it enough?
>
> Indan suggested PTRACE_O_SIGKILL_ON_DEATH, perhaps it makes sense.

Yeah, this takes care of all cases.

One caveat I can think of with the implementation: in the parent
exit() path, the child's SIGKILL needs to be delivered _before_ the
tracer is detached. Otherwise it might feasible wake up and run for a
bit :)


Cheers
Chris

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 22:40                                                                     ` Roland McGrath
  2012-01-20 22:41                                                                       ` H. Peter Anvin
@ 2012-01-24  8:19                                                                       ` Indan Zupancic
  2012-02-06 20:30                                                                       ` H. Peter Anvin
  2 siblings, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-24  8:19 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, H. Peter Anvin, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Fri, January 20, 2012 23:40, Roland McGrath wrote:
> If you change the size of a regset, then the new full size will be the size
> of the core file notes.  Existing userland tools will not be expecting
> this, they expect a known exact size.  If you need to add new stuff, it
> really is easier all around to add a new regset flavor.  When adding a new
> one, you can make it variable-sized from the start so as to be extensible
> in the future.  We did this for NT_X86_XSTATE, for example.

If stuffing it into eflags is not acceptable and you really want a
new regset, perhaps that new regset should only contain the new,
mostly cross-platform information, instead of slapping it at the
end of the x86 regset. Because if you do the latter you really
could have better just stuffed it into eflags.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:31                                               ` Linus Torvalds
                                                                   ` (2 preceding siblings ...)
  2012-01-18 20:26                                                 ` Linus Torvalds
@ 2012-01-25 19:36                                                 ` Oleg Nesterov
  2012-01-25 20:20                                                   ` Pedro Alves
  2012-01-25 23:32                                                   ` Denys Vlasenko
  3 siblings, 2 replies; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-25 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Denys Vlasenko

On 01/18, Linus Torvalds wrote:
>
> Using the high bits of 'eflags' might work.

I thought about changing eflags too, this looks very natural to me.

But I do not understand the result of this discussion, are you going
to apply this change?

If not...

Not sure this is really better, but there is another idea. Currently we
have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?

IOW. Currently ptrace_report_syscall() does

	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));

We can add the new events,

	PTRACE_EVENT_SYSCALL_ENTRY
	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
	PTRACE_EVENT_SYSCALL_EXIT
	PTRACE_EVENT_SYSCALL_COMPAT_EXIT

and change ptrace_report_syscall() to do

	if (PT_SEIZED) /* or PT_TRACESYS_VERY_GOOD? */ {
		int event = entry ? PTRACE_EVENT_SYSCALL_ENTRY : EXIT;
		if (is_compat_task(current))
			event++;
		ptrace_notify((event << 8) | SIGTRAP);
	} else {
		ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
	}

This also allows to distinguish entry/exit.


However. The change in get_flags() also allows to know the state of
TIF_IA32 bit bit outside of syscall entry/exit reports, perhaps there
is a reason why do we want this?

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 19:36                                                 ` Oleg Nesterov
@ 2012-01-25 20:20                                                   ` Pedro Alves
  2012-01-25 23:36                                                     ` Denys Vlasenko
  2012-01-25 23:32                                                   ` Denys Vlasenko
  1 sibling, 1 reply; 235+ messages in thread
From: Pedro Alves @ 2012-01-25 20:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, Denys Vlasenko

On 01/25/2012 07:36 PM, Oleg Nesterov wrote:
> 
> Not sure this is really better, but there is another idea. Currently we
> have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
> Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
> PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?

May I beg to don't rely on PTRACE_SYSCALL for anything new?
You can't PTRACE_SINGLESTEP and PTRACE_SYSCALL simultaneously.  Think of
gdb single-stepping all the way for some reason (software watchpoints, for ex.),
while at the same time wanting to catch syscalls.

As Roland suggested, replacing PTRACE_SYSCALL with PTRACE_O_TRACE_SYSCALL_{ENTRY,EXIT}
and PTRACE_EVENT_SYSCALL_{ENTRY,EXIT} would be superior, syscall tracing wise.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 19:36                                                 ` Oleg Nesterov
  2012-01-25 20:20                                                   ` Pedro Alves
@ 2012-01-25 23:32                                                   ` Denys Vlasenko
  2012-01-26  0:40                                                     ` Indan Zupancic
                                                                       ` (2 more replies)
  1 sibling, 3 replies; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-25 23:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> On 01/18, Linus Torvalds wrote:
> >
> > Using the high bits of 'eflags' might work.
> 
> I thought about changing eflags too, this looks very natural to me.
> 
> But I do not understand the result of this discussion, are you going
> to apply this change?
> 
> If not...
> 
> Not sure this is really better, but there is another idea. Currently we
> have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
> Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
> PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?
> 
> IOW. Currently ptrace_report_syscall() does
> 
> 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
> 
> We can add the new events,
> 
> 	PTRACE_EVENT_SYSCALL_ENTRY
> 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> 	PTRACE_EVENT_SYSCALL_EXIT
> 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT

We can get away with just the first one.
(1) It's unlikely people would want to get native sysentry events but not compat ones,
thus first two options can be combined into one;
(2) syscall exit compat-ness is known from entry type - no need to indicate it; and
(3) if we would flag syscall entry with an event value in wait status, then syscall
exit will be already distinquisable.

Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
"on syscall entry ptrace stop, set a nonzero event value in wait status"
, and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.

To future-proof this scheme we may reserve a few more event values
PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
if we'll ever have arches with more than one non-native syscall
entry. I'm no expert, but looking at strace code, ARM may already have
more than one additional convention how to pass syscall args.


-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 20:20                                                   ` Pedro Alves
@ 2012-01-25 23:36                                                     ` Denys Vlasenko
  0 siblings, 0 replies; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-25 23:36 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Oleg Nesterov, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Wednesday 25 January 2012 21:20, Pedro Alves wrote:
> On 01/25/2012 07:36 PM, Oleg Nesterov wrote:
> > 
> > Not sure this is really better, but there is another idea. Currently we
> > have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
> > Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
> > PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?
> 
> May I beg to don't rely on PTRACE_SYSCALL for anything new?

This doesn't *add* anything new. All the same ptrace stops will happen
at exactly the same moments. No new stops added. We only add a value
into upper half of waitpid status: (status >> 16) used to be 0
on syscall entry. Now it will be PTRACE_EVENT_SYSCALL_ENTRY[1].
That's all.

> You can't PTRACE_SINGLESTEP and PTRACE_SYSCALL simultaneously.

This is an orthogonal problem.

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 23:32                                                   ` Denys Vlasenko
@ 2012-01-26  0:40                                                     ` Indan Zupancic
  2012-01-26  1:08                                                       ` Jamie Lokier
  2012-01-26  1:09                                                       ` Denys Vlasenko
  2012-01-26  0:59                                                     ` Jamie Lokier
  2012-01-26 18:44                                                     ` Oleg Nesterov
  2 siblings, 2 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-26  0:40 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thu, January 26, 2012 00:32, Denys Vlasenko wrote:
> On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
>> On 01/18, Linus Torvalds wrote:
>> >
>> > Using the high bits of 'eflags' might work.
>>
>> I thought about changing eflags too, this looks very natural to me.
>>
>> But I do not understand the result of this discussion, are you going
>> to apply this change?
>>
>> If not...
>>
>> Not sure this is really better, but there is another idea. Currently we
>> have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
>> Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
>> PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?

Disadvantage of that is that all archs have to add support for this,
while it only affects x86_64.

>>
>> IOW. Currently ptrace_report_syscall() does
>>
>> 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>>
>> We can add the new events,
>>
>> 	PTRACE_EVENT_SYSCALL_ENTRY
>> 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
>> 	PTRACE_EVENT_SYSCALL_EXIT
>> 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
>
> We can get away with just the first one.
> (1) It's unlikely people would want to get native sysentry events but not compat ones,
> thus first two options can be combined into one;

True.

> (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> (3) if we would flag syscall entry with an event value in wait status, then syscall
> exit will be already distinquisable.

False for execve which messes everything up by changing TID sometimes.

>
> Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> "on syscall entry ptrace stop, set a nonzero event value in wait status"
> , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.

Not all code wants to receive a syscall exit event all the time, so
if you add PTRACE_O_TRACE_SYSENTRY, please add PTRACE_O_TRACE_SYSEXIT
too. That would pretty much halve ptrace's overhead for my use case.
But this is orthogonal to the compat problem.

> To future-proof this scheme we may reserve a few more event values
> PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> if we'll ever have arches with more than one non-native syscall
> entry. I'm no expert, but looking at strace code, ARM may already have
> more than one additional convention how to pass syscall args.

Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
would be horrible. Keep arch specific stuff in arch specific areas,
please don't spread it around.

What was wrong with using eflags again? Is it too simple or something?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 23:32                                                   ` Denys Vlasenko
  2012-01-26  0:40                                                     ` Indan Zupancic
@ 2012-01-26  0:59                                                     ` Jamie Lokier
  2012-01-26  1:21                                                       ` Denys Vlasenko
  2012-01-26  8:23                                                       ` Pedro Alves
  2012-01-26 18:44                                                     ` Oleg Nesterov
  2 siblings, 2 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-26  0:59 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

Denys Vlasenko wrote:
> On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> > On 01/18, Linus Torvalds wrote:
> > >
> > > Using the high bits of 'eflags' might work.
> > 
> > I thought about changing eflags too, this looks very natural to me.
> > 
> > But I do not understand the result of this discussion, are you going
> > to apply this change?
> > 
> > If not...
> > 
> > Not sure this is really better, but there is another idea. Currently we
> > have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
> > Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
> > PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?
> > 
> > IOW. Currently ptrace_report_syscall() does
> > 
> > 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
> > 
> > We can add the new events,
> > 
> > 	PTRACE_EVENT_SYSCALL_ENTRY
> > 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> > 	PTRACE_EVENT_SYSCALL_EXIT
> > 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
> 
> We can get away with just the first one.
> (1) It's unlikely people would want to get native sysentry events but not compat ones,
> thus first two options can be combined into one;

Tracers mainly want to know if it's a 32-bit or 64-bit syscall, not
whether it's compat as such.

I'm thinking it might be a little kinder like this:

#define PTRACE_EVENT_SYSCALL_ENTRY_ABI32 (...)
#define PTRACE_EVENT_SYSCALL_ENTRY_ABI64 (...)

#ifdef CONFIG_64BIT
# define PTRACE_EVENT_SYSCALL_ENTRY         PTRACE_EVENT_SYSCALL_ENTRY_ABI64
# define PTRACE_EVENT_SYSCALL_ENTRY_COMPAT  PTRACE_EVENT_SYSCALL_ENTRY_ABI32
#else
# define PTRACE_EVENT_SYSCALL_ENTRY         PTRACE_EVENT_SYSCALL_ENTRY_ABI32
#endif

So the ABI is represented directly, with the _ENTRY referring to the
tracer's own.  (Other ABI numbers can exist, e.g. OABI and EABI for
ARM, see below.)

This has the two specific advantages:

  1. It can match on specific ABI or regular/compat, as suits the tracer's code.
  2. When a 32-bit *tracer* is running a 64-bit *tracee* as least it knows ;-)

With your idea, what happens in situation 2?  I'm not sure a 32-bit
tracee can do anything useful, because it can't get the 64-bit
registers, but at least it can see when it's got the wrong registers :-)

> (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> (3) if we would flag syscall entry with an event value in wait status, then syscall
> exit will be already distinquisable.
>
> Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> "on syscall entry ptrace stop, set a nonzero event value in wait status"
> , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.

PTRACE_EVENT_SYSCALL_EXIT would cleanly indicate that the new option
is actually working without the tracer needing to do a fork+test, if
PTRACE_ATTACH is used and for some reason the tracer sees a syscall
exit first.  I'm not sure if this can happen but I've heard rumour of
it on some archs or kernel versions.

> To future-proof this scheme we may reserve a few more event values
> PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> if we'll ever have arches with more than one non-native syscall
> entry.

> I'm no expert, but looking at strace code, ARM may already have
> more than one additional convention how to pass syscall args.

I was just looking at ARM and see exactly the same thing.  The
difference between EABI and OABI calls is significant on ARM, even
though syscall numbers are the same; and the ABI is selected by the
syscall instruction used, not process personality.  The __NR_name
values differ for each ABI, but (if I read arm/kernel/entry-common.S
properly) strace sees the same _NR_name values for both ABIs.

MIPS also has two different 32-bit ABIs, as well as 64-bit, but on
MIPS the syscall numbers are distinct, and should be seen by ptrace.
(Again if I read mips/kernel/ correctly.)

PA-RISC also has two different ABIs, the Linux one and the HPUX one.
The syscall numbers are different but overlap.  I don't know if they
are distinct to ptrace, in which case using the HPUX entry point might
be used to subvert a ptracer unless the ABI number is exposed.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  0:40                                                     ` Indan Zupancic
@ 2012-01-26  1:08                                                       ` Jamie Lokier
  2012-01-26  1:22                                                         ` Denys Vlasenko
  2012-01-26  6:34                                                         ` Indan Zupancic
  2012-01-26  1:09                                                       ` Denys Vlasenko
  1 sibling, 2 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-26  1:08 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

Indan Zupancic wrote:
> On Thu, January 26, 2012 00:32, Denys Vlasenko wrote:
> > On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> >> IOW. Currently ptrace_report_syscall() does
> >>
> >> 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
> >>
> >> We can add the new events,
> >>
> >> 	PTRACE_EVENT_SYSCALL_ENTRY
> >> 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> >> 	PTRACE_EVENT_SYSCALL_EXIT
> >> 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
> >
> > We can get away with just the first one.
> > (1) It's unlikely people would want to get native sysentry events but not compat ones,
> > thus first two options can be combined into one;
> 
> True.
> 
> > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > exit will be already distinquisable.
> 
> False for execve which messes everything up by changing TID sometimes.

Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
returns, and you knowing the TID always changes to the PID?  I haven't
yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
not the old one, perhaps that could be changed.

It would be good to improve the threaded execve() behaviour for all
the disappearing TIDs to issue a disappearing event, and the winning
execve changing-TID to issue an I-am-changing-TID even, anyway.

> > Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> > "on syscall entry ptrace stop, set a nonzero event value in wait status"
> > , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> > PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.
> 
> Not all code wants to receive a syscall exit event all the time, so
> if you add PTRACE_O_TRACE_SYSENTRY, please add PTRACE_O_TRACE_SYSEXIT
> too. That would pretty much halve ptrace's overhead for my use case.
> But this is orthogonal to the compat problem.

I agree.  I would like to ignore the exit for most syscalls but see a
few of them.  I guess PTRACE_SETOPTIONS could be used to toggle it,
with some overhead.  But in the spirit of this thread,
PTRACE_O_TRACE_BPF would be even better, to completely ignore
irrelevant syscalls :-)

> > To future-proof this scheme we may reserve a few more event values
> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> > if we'll ever have arches with more than one non-native syscall
> > entry. I'm no expert, but looking at strace code, ARM may already have
> > more than one additional convention how to pass syscall args.
> 
> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
> would be horrible. Keep arch specific stuff in arch specific areas,
> please don't spread it around.
> 
> What was wrong with using eflags again? Is it too simple or something?

Well it doesn't deal with the equivalent issue on ARM and PA-RISC.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  0:40                                                     ` Indan Zupancic
  2012-01-26  1:08                                                       ` Jamie Lokier
@ 2012-01-26  1:09                                                       ` Denys Vlasenko
  2012-01-26  3:47                                                         ` Linus Torvalds
  2012-01-26  5:57                                                         ` Indan Zupancic
  1 sibling, 2 replies; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-26  1:09 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thursday 26 January 2012 01:40, Indan Zupancic wrote:
> >> We can add the new events,
> >>
> >> 	PTRACE_EVENT_SYSCALL_ENTRY
> >> 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> >> 	PTRACE_EVENT_SYSCALL_EXIT
> >> 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
> >
> > We can get away with just the first one.
> > (1) It's unlikely people would want to get native sysentry events but not compat ones,
> > thus first two options can be combined into one;
> 
> True.
> 
> > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > exit will be already distinquisable.
> 
> False for execve which messes everything up by changing TID sometimes.

Dealt with in Linus tree already: set PTRACE_O_TRACEEXEC option,
and use PTRACE_GETEVENTMSG in PTRACE_EVENT_EXEC stop to get
the old TID.


> > To future-proof this scheme we may reserve a few more event values
> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> > if we'll ever have arches with more than one non-native syscall
> > entry. I'm no expert, but looking at strace code, ARM may already have
> > more than one additional convention how to pass syscall args.
> 
> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
> would be horrible. Keep arch specific stuff in arch specific areas,
> please don't spread it around.

The situation when an architecture has 32- and 64-bit varieties,
and sometimes different ABIs (parameter passing comventions),
is rather typical, it's not a quirk of just one unfortunate
architecture.

Please look at strace source, get_scno() function, where
it reads syscall no and parameters. Let's see....
- POWERPC: has 32-bit and 64-bit mode
- X86_64: has 32-bit and 64-bit mode
- IA64: has i386-compat mode
- ARM: has more than one ABI
- SPARC: has 32-bit and 64-bit mode

Do you want to re-invent a different arch-specific way to report
syscall type for each of these arches?


> What was wrong with using eflags again? Is it too simple or something?

It's x86-specific, and abuses a bit in a real register.

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  0:59                                                     ` Jamie Lokier
@ 2012-01-26  1:21                                                       ` Denys Vlasenko
  2012-01-26  8:23                                                       ` Pedro Alves
  1 sibling, 0 replies; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-26  1:21 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Oleg Nesterov, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thursday 26 January 2012 01:59, Jamie Lokier wrote:
> Denys Vlasenko wrote:
> > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > exit will be already distinquisable.
> >
> > Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> > "on syscall entry ptrace stop, set a nonzero event value in wait status"
> > , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> > PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.
> 
> PTRACE_EVENT_SYSCALL_EXIT would cleanly indicate that the new option
> is actually working without the tracer needing to do a fork+test, if
> PTRACE_ATTACH is used and for some reason the tracer sees a syscall
> exit first.

Can't happen. After PTRACE_ATTACH, you can only see tracee dying, or
getting a signal delivery (usually a SIGSTOP). Anything else
would be a kernel bug.

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  1:08                                                       ` Jamie Lokier
@ 2012-01-26  1:22                                                         ` Denys Vlasenko
  2012-01-26  6:34                                                         ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-26  1:22 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thursday 26 January 2012 02:08, Jamie Lokier wrote:
> Indan Zupancic wrote:
> > > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > > exit will be already distinquisable.
> > 
> > False for execve which messes everything up by changing TID sometimes.
> 
> Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
> returns, and you knowing the TID always changes to the PID?  I haven't
> yet checked which TID gets the PTRACE_EVENT_EXEC event,

tid change happens before PTRACE_EVENT_EXEC event generation.

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  1:09                                                       ` Denys Vlasenko
@ 2012-01-26  3:47                                                         ` Linus Torvalds
  2012-01-26 18:03                                                           ` Denys Vlasenko
  2012-01-26  5:57                                                         ` Indan Zupancic
  1 sibling, 1 reply; 235+ messages in thread
From: Linus Torvalds @ 2012-01-26  3:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Indan Zupancic, Oleg Nesterov, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Wed, Jan 25, 2012 at 5:09 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
>
> Please look at strace source, get_scno() function, where
> it reads syscall no and parameters. Let's see....
> - POWERPC: has 32-bit and 64-bit mode
> - X86_64: has 32-bit and 64-bit mode
> - IA64: has i386-compat mode
> - ARM: has more than one ABI
> - SPARC: has 32-bit and 64-bit mode
>
> Do you want to re-invent a different arch-specific way to report
> syscall type for each of these arches?

I think an arch-specific one is better than trying to make some
generic one that is messy.

As you say, many architectures have multiple system call ABIs.

But they tend to be very *different* issues. They can be about
multiple ABI's, as you mention, and even when they *look* similar
(32-bit vs 64-bit ABI's) they are actually totally different issues.

On x86, the real issue is not so much "32-bit vs 64-bit" as "multiple
system call entry models", where a 64-bit process can use the system
call entry for a 32-bit one. That is not true on POWER, for example,
and trying to make it out to be the same issue only muddles the point,
and confuses things. It really is NOT AT ALL the same issue, even if
you can make it "look" like the same issue by calling it a 32-bit vs
64-bit thing.

So for POWER, it really is about the mode of the CPU/process. For x86,
it really isn't. Trying to equate the two is *wrong*.

I seriously think it's better to be architecture-specific than to be
that kind of totally confused, and try to "consolidate" the issue,
when they are actually two totally different issues.

                        Linus

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  1:09                                                       ` Denys Vlasenko
  2012-01-26  3:47                                                         ` Linus Torvalds
@ 2012-01-26  5:57                                                         ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-26  5:57 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thu, January 26, 2012 02:09, Denys Vlasenko wrote:
> Dealt with in Linus tree already: set PTRACE_O_TRACEEXEC option,
> and use PTRACE_GETEVENTMSG in PTRACE_EVENT_EXEC stop to get
> the old TID.

Thanks, getting the old TID is useful, that was the missing bit to
handle execve statelessly.

>
>
>> > To future-proof this scheme we may reserve a few more event values
>> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
>> > if we'll ever have arches with more than one non-native syscall
>> > entry. I'm no expert, but looking at strace code, ARM may already have
>> > more than one additional convention how to pass syscall args.
>>
>> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
>> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
>> would be horrible. Keep arch specific stuff in arch specific areas,
>> please don't spread it around.
>
> The situation when an architecture has 32- and 64-bit varieties,
> and sometimes different ABIs (parameter passing comventions),
> is rather typical, it's not a quirk of just one unfortunate
> architecture.

The question is how many of those have a different ABI and can not
be reliably detected at system call entry time. If the ABI can't
be changed at runtime then there's no problem either.

x86_64's case is very peculiar because it can execute a 32-bit compat
syscall while the process itself is in 64-bit mode.

> Please look at strace source, get_scno() function, where
> it reads syscall no and parameters. Let's see....
> - POWERPC: has 32-bit and 64-bit mode
> - X86_64: has 32-bit and 64-bit mode
> - IA64: has i386-compat mode
> - ARM: has more than one ABI
> - SPARC: has 32-bit and 64-bit mode

Fow most of them you can reliably check the mode by looking at registers.

x86_64 and apparently ARM are problematic. Others may too in similar subtle
ways as x86_64, but I can't tell that from strace's code.

ARM looks ok when old cruft isn't enabled (much more likely than compat
mode being disabled in x86_64).

Can SPARC change mode on the fly without detection? Otherwise it looks
like it may be slightly problematic too, though it seems that at least
the ABI is pretty much the same between 32 and 64 bit mode. Same for
PA-RISC. So all in all not sure if they have a problem or not.

To be a problem the only way to figure our what mode the system call
will be is by looking at the trapping/syscall instruction itself. If
that isn't needed, or if there isn't much difference between the modes
anyway, then there's no problem.

>
> Do you want to re-invent a different arch-specific way to report
> syscall type for each of these arches?

Thing is, it is not just 32 versus 64 bit mode. So one way or the other,
you do end up with an arch-specific way of saying what syscall type it is.

It doesn't matter much where that info is stuffed, it will always be arch
specific, because depending on that value people have to do different
arch specific things.

It's fine to somehow give that info together with PTRACE_EVENT_SYSCALL_ENTRY,
but I don't think it's a good idea to have different syscall entry events
depending on what type they are. I suppose the only reason to do that would
be because we're running out of bits elsewhere.

>
>> What was wrong with using eflags again? Is it too simple or something?
>
> It's x86-specific, and abuses a bit in a real register.

If the problem is limited to only a couple of archs, and we can stuff this
info in the register set for all of them, then I'm all for it.

So far it's just x86_64 and ARM with OABI enabled, with the rest either
fine or unclear.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  1:08                                                       ` Jamie Lokier
  2012-01-26  1:22                                                         ` Denys Vlasenko
@ 2012-01-26  6:34                                                         ` Indan Zupancic
  2012-01-26 10:31                                                           ` Jamie Lokier
  1 sibling, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-26  6:34 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
> Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
> returns, and you knowing the TID always changes to the PID?  I haven't
> yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
> not the old one, perhaps that could be changed.

Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
barely documented already, but if if ever changes it will be also
unreliable.

It's still unclear if the PTRACE_EVENT_EXEC comes before or after
or instead of the post-execve ptrace event. I guess before, but
can I count on that? If it is after then I get a stray weird
execve event that messes up the system call cadence.

> It would be good to improve the threaded execve() behaviour for all
> the disappearing TIDs to issue a disappearing event, and the winning
> execve changing-TID to issue an I-am-changing-TID even, anyway.

As Denys said, you get the event with the new PID, and apparently with
the latest kernel you can get the old TID with PTRACE_GETEVENTMSG.

So all the info is there to handle it statelessly now.

My point was that stateless handling is much preferred to stateful
handling, and hence why not having the syscall mode available for
the syscall exit event would be inconvenient sometimes (meaning the
real mode can be different than guessed).

>> > Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
>> > "on syscall entry ptrace stop, set a nonzero event value in wait status"
>> > , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
>> > PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.
>>
>> Not all code wants to receive a syscall exit event all the time, so
>> if you add PTRACE_O_TRACE_SYSENTRY, please add PTRACE_O_TRACE_SYSEXIT
>> too. That would pretty much halve ptrace's overhead for my use case.
>> But this is orthogonal to the compat problem.
>
> I agree.  I would like to ignore the exit for most syscalls but see a
> few of them.  I guess PTRACE_SETOPTIONS could be used to toggle it,
> with some overhead.

Yes, that's what I had in mind.

> But in the spirit of this thread,
> PTRACE_O_TRACE_BPF would be even better, to completely ignore
> irrelevant syscalls :-)

Yes, that's the only reason I'm interested in BPF, really.
Most system calls are either always allowed, or always denied.
Of the ones that need checking, most of them have file paths.
For those I'm not interested in the post-syscall event.

>> > To future-proof this scheme we may reserve a few more event values
>> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
>> > if we'll ever have arches with more than one non-native syscall
>> > entry. I'm no expert, but looking at strace code, ARM may already have
>> > more than one additional convention how to pass syscall args.
>>
>> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
>> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
>> would be horrible. Keep arch specific stuff in arch specific areas,
>> please don't spread it around.
>>
>> What was wrong with using eflags again? Is it too simple or something?
>
> Well it doesn't deal with the equivalent issue on ARM and PA-RISC.

Those issues are not equivalent. ARM only has that OABI thing which
is hopefully not used in practice. Can you switch modes on-the-fly in
PA-RISC without doing a system call? Both ARM and PA-RISC use only one
struct pt_regs and one syscall table.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  0:59                                                     ` Jamie Lokier
  2012-01-26  1:21                                                       ` Denys Vlasenko
@ 2012-01-26  8:23                                                       ` Pedro Alves
  2012-01-26  8:53                                                         ` Denys Vlasenko
  1 sibling, 1 reply; 235+ messages in thread
From: Pedro Alves @ 2012-01-26  8:23 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Indan Zupancic,
	Andi Kleen, Andrew Lutomirski, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On 01/26/2012 12:59 AM, Jamie Lokier wrote:
> Tracers mainly want to know if it's a 32-bit or 64-bit syscall, not
> whether it's compat as such.

Another idea, avoiding new PTRACE_EVENTs per arch, would be to make
the abi32/abi64/compat/whatnot discriminator retrievable with PTRACE_GETEVENTMSG
instead.  So you'd get PTRACE_EVENT_SYSCALL_ENTRY|EXIT, or the regular old
0x80|SIGTRAP, you'd still fetch the syscall number from $orig_ax (or whatever means
for other archs), as usual, then have extra syscall info in PTRACE_GETEVENTMSG.
I don't know if it'd be simple to make it possible to do PTRACE_GETEVENTMSG
on a 0x80|SIGTRAP trap, but I imagine it so.

-> wait
  <- 0x80|SIGTRAP   (or PTRACE_EVENT_SYSCALL_ENTRY)
-> read regs, find out syscall number
-> PTRACE_GETEVENTMSG, figure out which entry mode was used.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  8:23                                                       ` Pedro Alves
@ 2012-01-26  8:53                                                         ` Denys Vlasenko
  2012-01-26  9:51                                                           ` Pedro Alves
  0 siblings, 1 reply; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-26  8:53 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Jamie Lokier, Oleg Nesterov, Linus Torvalds, Indan Zupancic,
	Andi Kleen, Andrew Lutomirski, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Thu, Jan 26, 2012 at 9:23 AM, Pedro Alves <palves@redhat.com> wrote:
> On 01/26/2012 12:59 AM, Jamie Lokier wrote:
>> Tracers mainly want to know if it's a 32-bit or 64-bit syscall, not
>> whether it's compat as such.
>
> Another idea, avoiding new PTRACE_EVENTs per arch, would be to make
> the abi32/abi64/compat/whatnot discriminator retrievable with PTRACE_GETEVENTMSG
> instead.  So you'd get PTRACE_EVENT_SYSCALL_ENTRY|EXIT, or the regular old
> 0x80|SIGTRAP, you'd still fetch the syscall number from $orig_ax (or whatever means
> for other archs), as usual, then have extra syscall info in PTRACE_GETEVENTMSG.
> I don't know if it'd be simple to make it possible to do PTRACE_GETEVENTMSG
> on a 0x80|SIGTRAP trap, but I imagine it so.
>
> -> wait
>  <- 0x80|SIGTRAP   (or PTRACE_EVENT_SYSCALL_ENTRY)
> -> read regs, find out syscall number
> -> PTRACE_GETEVENTMSG, figure out which entry mode was used.

This would require additional ptrace op per syscall entry.
Linus' method and event method wouldn't.

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  8:53                                                         ` Denys Vlasenko
@ 2012-01-26  9:51                                                           ` Pedro Alves
  0 siblings, 0 replies; 235+ messages in thread
From: Pedro Alves @ 2012-01-26  9:51 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Pedro Alves, Jamie Lokier, Oleg Nesterov, Linus Torvalds,
	Indan Zupancic, Andi Kleen, Andrew Lutomirski, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On 01/26/2012 08:53 AM, Denys Vlasenko wrote:
> On Thu, Jan 26, 2012 at 9:23 AM, Pedro Alves <palves@redhat.com> wrote:
>> On 01/26/2012 12:59 AM, Jamie Lokier wrote:
>>> Tracers mainly want to know if it's a 32-bit or 64-bit syscall, not
>>> whether it's compat as such.
>>
>> Another idea, avoiding new PTRACE_EVENTs per arch, would be to make
>> the abi32/abi64/compat/whatnot discriminator retrievable with PTRACE_GETEVENTMSG
>> instead.  So you'd get PTRACE_EVENT_SYSCALL_ENTRY|EXIT, or the regular old
>> 0x80|SIGTRAP, you'd still fetch the syscall number from $orig_ax (or whatever means
>> for other archs), as usual, then have extra syscall info in PTRACE_GETEVENTMSG.
>> I don't know if it'd be simple to make it possible to do PTRACE_GETEVENTMSG
>> on a 0x80|SIGTRAP trap, but I imagine it so.
>>
>> -> wait
>>  <- 0x80|SIGTRAP   (or PTRACE_EVENT_SYSCALL_ENTRY)
>> -> read regs, find out syscall number
>> -> PTRACE_GETEVENTMSG, figure out which entry mode was used.
> 
> This would require additional ptrace op per syscall entry.
> Linus' method and event method wouldn't.

Yes.

In any case, ptrace events leave recording the state in core files
behind; possibly also important for userspace c/r.
Linus' method or a new regset don't have that drawback.  A new regset requires
an additional ptrace op too, while the former abuses an architecture register,
possibly leading to headaches later on.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  6:34                                                         ` Indan Zupancic
@ 2012-01-26 10:31                                                           ` Jamie Lokier
  2012-01-26 10:40                                                             ` Denys Vlasenko
  2012-01-26 11:20                                                             ` Indan Zupancic
  0 siblings, 2 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-26 10:31 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

Indan Zupancic wrote:
> On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
> > Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
> > returns, and you knowing the TID always changes to the PID?  I haven't
> > yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
> > not the old one, perhaps that could be changed.
> 
> Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
> barely documented already, but if if ever changes it will be also
> unreliable.
> 
> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
> or instead of the post-execve ptrace event. I guess before, but
> can I count on that? If it is after then I get a stray weird
> execve event that messes up the system call cadence.

It should be *sent* before because the exec steps must finish before
the execve() syscall "returns".

I'm not sure if the events are guaranteed to be received in the same
order as they are sent.

> >> > Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> >> > "on syscall entry ptrace stop, set a nonzero event value in wait status"
> >> > , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> >> > PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.
> >>
> >> Not all code wants to receive a syscall exit event all the time, so
> >> if you add PTRACE_O_TRACE_SYSENTRY, please add PTRACE_O_TRACE_SYSEXIT
> >> too. That would pretty much halve ptrace's overhead for my use case.
> >> But this is orthogonal to the compat problem.
> >
> > I agree.  I would like to ignore the exit for most syscalls but see a
> > few of them.  I guess PTRACE_SETOPTIONS could be used to toggle it,
> > with some overhead.
> 
> Yes, that's what I had in mind.
> 
> > But in the spirit of this thread,
> > PTRACE_O_TRACE_BPF would be even better, to completely ignore
> > irrelevant syscalls :-)
> 
> Yes, that's the only reason I'm interested in BPF, really.
> Most system calls are either always allowed, or always denied.
> Of the ones that need checking, most of them have file paths.
> For those I'm not interested in the post-syscall event.

Same here, though for tracing file paths rather than blocking anything.

> >> > To future-proof this scheme we may reserve a few more event values
> >> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> >> > if we'll ever have arches with more than one non-native syscall
> >> > entry. I'm no expert, but looking at strace code, ARM may already have
> >> > more than one additional convention how to pass syscall args.
> >>
> >> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
> >> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
> >> would be horrible. Keep arch specific stuff in arch specific areas,
> >> please don't spread it around.
> >>
> >> What was wrong with using eflags again? Is it too simple or something?
> >
> > Well it doesn't deal with the equivalent issue on ARM and PA-RISC.
> 
> Those issues are not equivalent. ARM only has that OABI thing which
> is hopefully not used in practice.

I am still using OABI on some currently-sold and still-developed
devices with userspace libraries that I can't replace or rebuild.
Maybe I'm the only one, but the issue is still there.  It should be
supported in ptrace() as long as it's supported in the kernel at all.

I don't know if the PA-RISC thing is real.

But it's occurred to me that there are a lot of 32/64 archs now (I was
extracting all their syscall number tables last night), and it would
be good if there were a consistent, arch-independent way to signal if
the syscall number is in the 32 or 64-bit table - or at least, in the
same ABI as the tracer gets from <asm/unistd.h>.  For tracers doing
simple things to avoid needing a ton of arch-specific knowledge.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 10:31                                                           ` Jamie Lokier
@ 2012-01-26 10:40                                                             ` Denys Vlasenko
  2012-01-26 11:01                                                               ` Jamie Lokier
  2012-01-26 11:19                                                               ` Indan Zupancic
  2012-01-26 11:20                                                             ` Indan Zupancic
  1 sibling, 2 replies; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-26 10:40 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thu, Jan 26, 2012 at 11:31 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Indan Zupancic wrote:
>> On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
>> > Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
>> > returns, and you knowing the TID always changes to the PID?  I haven't
>> > yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
>> > not the old one, perhaps that could be changed.
>>
>> Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
>> barely documented already, but if if ever changes it will be also
>> unreliable.
>>
>> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
>> or instead of the post-execve ptrace event.

Denis <- confused.
Was ist das "post-execve ptrace event"? I know no such thing.
I know about PTRACE_EVENT_EXEC, and "post-execve SIGTRAP".


>> I guess before, but
>> can I count on that? If it is after then I get a stray weird
>> execve event that messes up the system call cadence.
>
> It should be *sent* before because the exec steps must finish before
> the execve() syscall "returns".
>
> I'm not sure if the events are guaranteed to be received in the same
> order as they are sent.

All ptrace stops (events and other stops) are synchronous.
Tracee stops, tracer notices it, tracer restarts tracee,
and only after this tracee can generate next event.
Therefore ptrace stops can't get reordered.

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 10:40                                                             ` Denys Vlasenko
@ 2012-01-26 11:01                                                               ` Jamie Lokier
  2012-01-26 14:02                                                                 ` Denys Vlasenko
  2012-01-26 11:19                                                               ` Indan Zupancic
  1 sibling, 1 reply; 235+ messages in thread
From: Jamie Lokier @ 2012-01-26 11:01 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

Denys Vlasenko wrote:
> On Thu, Jan 26, 2012 at 11:31 AM, Jamie Lokier <jamie@shareable.org> wrote:
> > Indan Zupancic wrote:
> >> On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
> >> > Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
> >> > returns, and you knowing the TID always changes to the PID?  I haven't
> >> > yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
> >> > not the old one, perhaps that could be changed.
> >>
> >> Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
> >> barely documented already, but if if ever changes it will be also
> >> unreliable.
> >>
> >> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
> >> or instead of the post-execve ptrace event.
> 
> Denis <- confused.
> Was ist das "post-execve ptrace event"? I know no such thing.
> I know about PTRACE_EVENT_EXEC, and "post-execve SIGTRAP".

Sorry, I meant to write execve-syscall-exit event.

> >> I guess before, but
> >> can I count on that? If it is after then I get a stray weird
> >> execve event that messes up the system call cadence.
> >
> > It should be *sent* before because the exec steps must finish before
> > the execve() syscall "returns".
> >
> > I'm not sure if the events are guaranteed to be received in the same
> > order as they are sent.
> 
> All ptrace stops (events and other stops) are synchronous.
> Tracee stops, tracer notices it, tracer restarts tracee,
> and only after this tracee can generate next event.
> Therefore ptrace stops can't get reordered.

That's good to know, thanks.

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 10:40                                                             ` Denys Vlasenko
  2012-01-26 11:01                                                               ` Jamie Lokier
@ 2012-01-26 11:19                                                               ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-01-26 11:19 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Jamie Lokier, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thu, January 26, 2012 11:40, Denys Vlasenko wrote:
> On Thu, Jan 26, 2012 at 11:31 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> Indan Zupancic wrote:
>>> On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
>>> > Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
>>> > returns, and you knowing the TID always changes to the PID? �I haven't
>>> > yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
>>> > not the old one, perhaps that could be changed.
>>>
>>> Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
>>> barely documented already, but if if ever changes it will be also
>>> unreliable.
>>>
>>> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
>>> or instead of the post-execve ptrace event.
>
> Denis <- confused.
> Was ist das "post-execve ptrace event"? I know no such thing.
> I know about PTRACE_EVENT_EXEC, and "post-execve SIGTRAP".

I mean the second SIGTRAP | 0x80 event, the syscall return of execve.

> All ptrace stops (events and other stops) are synchronous.
> Tracee stops, tracer notices it, tracer restarts tracee,
> and only after this tracee can generate next event.
> Therefore ptrace stops can't get reordered.

That's good to know and what I expected.

Since which kernel version does the PTRACE_GETEVENTMSG work and
is there a way to find out before it returns zero?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 10:31                                                           ` Jamie Lokier
  2012-01-26 10:40                                                             ` Denys Vlasenko
@ 2012-01-26 11:20                                                             ` Indan Zupancic
  2012-01-26 11:47                                                               ` Jamie Lokier
  1 sibling, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-26 11:20 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
> Indan Zupancic wrote:
>> Yes, that's the only reason I'm interested in BPF, really.
>> Most system calls are either always allowed, or always denied.
>> Of the ones that need checking, most of them have file paths.
>> For those I'm not interested in the post-syscall event.
>
> Same here, though for tracing file paths rather than blocking anything.

The jailer I wrote works pretty well as a simplistic strace replacement.
It can only print out the arguments we're checking, but that's usually
the more interesting info.

>> Those issues are not equivalent. ARM only has that OABI thing which
>> is hopefully not used in practice.
>
> I am still using OABI on some currently-sold and still-developed
> devices with userspace libraries that I can't replace or rebuild.
> Maybe I'm the only one, but the issue is still there.  It should be
> supported in ptrace() as long as it's supported in the kernel at all.

It's not a 32 versus 64-bit issue though, so it will be something on
its own anyway. Can as well add an extra ARM specific ptrace command
to get that info, or hack it in some other way. For instance, ip is
(ab)used to tell if it is syscall entry or exit, so doing these tricks
isn't anything new in ARM either.

> I don't know if the PA-RISC thing is real.
>
> But it's occurred to me that there are a lot of 32/64 archs now (I was
> extracting all their syscall number tables last night), and it would
> be good if there were a consistent, arch-independent way to signal if
> the syscall number is in the 32 or 64-bit table - or at least, in the
> same ABI as the tracer gets from <asm/unistd.h>.  For tracers doing
> simple things to avoid needing a ton of arch-specific knowledge.

You can't avoid the arch-specific knowledge, because depending on the
answer, you have to do something arch specific. In ARM's OABI case, it's
reading program memory to find out the system call number, of all things.
(I hope I read the code wrong). So ARM's solution would need to get all
info it needs to handle the system call securely without reading any text
memory, otherwise it's racy.

And then there's the whole confusion what that flag says, some might think
it says in what mode the tracee is instead of what mode the system call is.
That those two can be different is not obvious at all and seems very x86_64
specific.

I'm not sure what you're doing, but perhaps we should share code and write
a kind of Linux ptrace library. The code I wrote was university stuff and
we want to release it, but it will take a while to get things sorted out.
Hopefully it's released in April, maybe before.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 11:20                                                             ` Indan Zupancic
@ 2012-01-26 11:47                                                               ` Jamie Lokier
  2012-01-26 14:05                                                                 ` Denys Vlasenko
  2012-01-27  7:23                                                                 ` Indan Zupancic
  0 siblings, 2 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-01-26 11:47 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

Indan Zupancic wrote:
> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
> > Indan Zupancic wrote:
> >> Yes, that's the only reason I'm interested in BPF, really.
> >> Most system calls are either always allowed, or always denied.
> >> Of the ones that need checking, most of them have file paths.
> >> For those I'm not interested in the post-syscall event.
> >
> > Same here, though for tracing file paths rather than blocking anything.
> 
> The jailer I wrote works pretty well as a simplistic strace replacement.
> It can only print out the arguments we're checking, but that's usually
> the more interesting info.

In theory such a thing should be easy to write, but as we both found,
ptrace() on Linux has a huge number of difficult quirks to deal with
to trace reliably.  At least it's getting better with later kernels.

> >> Those issues are not equivalent. ARM only has that OABI thing which
> >> is hopefully not used in practice.
> >
> > I am still using OABI on some currently-sold and still-developed
> > devices with userspace libraries that I can't replace or rebuild.
> > Maybe I'm the only one, but the issue is still there.  It should be
> > supported in ptrace() as long as it's supported in the kernel at all.
> 
> It's not a 32 versus 64-bit issue though, so it will be something on
> its own anyway. Can as well add an extra ARM specific ptrace command
> to get that info, or hack it in some other way. For instance, ip is
> (ab)used to tell if it is syscall entry or exit, so doing these tricks
> isn't anything new in ARM either.

In theory, aren't we supposed to know whether it's entry/exit anyway?
Why does strace care?  Have there been kernel bugs in the past?  Maybe
it was just to deal with SIGTRAP-after-exit in the past, which could
be delivered at an unpredictable time if blocked and then unblocked by
sigreturn().

> You can't avoid the arch-specific knowledge, because depending on the
> answer, you have to do something arch specific. In ARM's OABI case, it's
> reading program memory to find out the system call number, of all things.
> (I hope I read the code wrong). So ARM's solution would need to get all
> info it needs to handle the system call securely without reading any text
> memory, otherwise it's racy.

A few archs read program memory to get the syscall number even now, in
the current strace source.  Look for PEEKTEXT: S390, ARM, SPARC use it
on every syscall entry, and X86_64 has it commented out.

As we know, all of them are buggy if the memory is modified while
reading it, and it's silly because the kernel knows the syscall
number.

> And then there's the whole confusion what that flag says, some might think
> it says in what mode the tracee is instead of what mode the system call is.
> That those two can be different is not obvious at all and seems very x86_64
> specific.

My rough read of PARISC entry code suggests it has two entry methods,
similar to ARM and x86_64, but I'm not really familiar with PARISC and
I don't have a machine handy to try it out :-)

> I'm not sure what you're doing, but perhaps we should share code and write
> a kind of Linux ptrace library. The code I wrote was university stuff and
> we want to release it, but it will take a while to get things sorted out.
> Hopefully it's released in April, maybe before.

I've been thinking along similar lines.  The idea came up when I was
hacking on strace last year and it so wanted to be cleaned up (but now
strace is in good hands, my work on it is obsolete); now I'm doing
ptracing for other purposes.  Denys' ptrace API document, currently in
strace git, is extremely useful.

Denys, would you be interested in further refactoring strace to use a
"libsystrace" sort of thing which abstracts the detail of archs,
tracing (and maybe syscall argument layout) away from the printing and
user-interface, for strace's use and other users?  I would be happy to
help with that and keep strace's non-Linux support as well (if there's
any way to test the latter...)  I seem to be going in the direction of
a library like that anyway for another project.

The seccomp-BPF stuff could also benefit from a part dealing with
syscall argument layout, as it too needs needs that arch-specific
knowledge.  I have a script in progress which extracts all the
per-arch and per-ABI syscall numbers, syscall argument layouts and
kernel function names to keep track of arch-specific fixups, from a
Linux source tree.  It currently works on all archs except it breaks
on x86 which insists on being diferent ;-)

-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 11:01                                                               ` Jamie Lokier
@ 2012-01-26 14:02                                                                 ` Denys Vlasenko
  0 siblings, 0 replies; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-26 14:02 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thu, Jan 26, 2012 at 12:01 PM, Jamie Lokier <jamie@shareable.org> wrote:
> Denys Vlasenko wrote:
>> On Thu, Jan 26, 2012 at 11:31 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> >> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
>> >> or instead of the post-execve ptrace event.
>>
>> Denis <- confused.
>> Was ist das "post-execve ptrace event"? I know no such thing.
>> I know about PTRACE_EVENT_EXEC, and "post-execve SIGTRAP".
>
> Sorry, I meant to write execve-syscall-exit event.

PTRACE_EVENT_EXEC happens before syscall exit. syscall exit
is not lost. Basically, the sequence is:

tracer               tracee with tid N, tgid M
   <------------- syscall entry for execve, pid=N
PTRACE_SYSCALL--->
   <------------- PTRACE_EVENT_EXEC, pid=M
PTRACE_GETEVENTMSG-->
   <------------- returns N ("I used to be tid N")
PTRACE_SYSCALL--->
   <------------- syscall exit for execve, pid=M
...

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 11:47                                                               ` Jamie Lokier
@ 2012-01-26 14:05                                                                 ` Denys Vlasenko
  2012-01-27  7:23                                                                 ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-26 14:05 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thu, Jan 26, 2012 at 12:47 PM, Jamie Lokier <jamie@shareable.org> wrote:
> Indan Zupancic wrote:
> Denys, would you be interested in further refactoring strace to use a
> "libsystrace" sort of thing which abstracts the detail of archs,
> tracing (and maybe syscall argument layout) away from the printing and
> user-interface, for strace's use and other users?  I would be happy to
> help with that and keep strace's non-Linux support as well (if there's
> any way to test the latter...)  I seem to be going in the direction of
> a library like that anyway for another project.

It might make sense to do this.
Design of this library would depend on the needs
of those other projects. Where can I see their code?

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  3:47                                                         ` Linus Torvalds
@ 2012-01-26 18:03                                                           ` Denys Vlasenko
  2017-03-08 23:41                                                             ` Dmitry V. Levin
  0 siblings, 1 reply; 235+ messages in thread
From: Denys Vlasenko @ 2012-01-26 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Oleg Nesterov, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

Hi Linus,

On Thu, Jan 26, 2012 at 4:47 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>> Please look at strace source, get_scno() function, where
>> it reads syscall no and parameters. Let's see....
>> - POWERPC: has 32-bit and 64-bit mode
>> - X86_64: has 32-bit and 64-bit mode
>> - IA64: has i386-compat mode
>> - ARM: has more than one ABI
>> - SPARC: has 32-bit and 64-bit mode
>>
>> Do you want to re-invent a different arch-specific way to report
>> syscall type for each of these arches?
>
> I think an arch-specific one is better than trying to make some
> generic one that is messy.
>
> As you say, many architectures have multiple system call ABIs.
>
> But they tend to be very *different* issues. They can be about
> multiple ABI's, as you mention, and even when they *look* similar
> (32-bit vs 64-bit ABI's) they are actually totally different issues.
> [skip]

I don't have a particular attachment to my solution,
and I think we already talk about this problem for
far too long.

Looks like nobody is _strongly_ opposed to your patch
which uses a few bits in eflags to report bitness
of the x86 syscall.

Lets just do that already. If you commit it to kernel git,
I will immediately change strace accordingly.

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 23:32                                                   ` Denys Vlasenko
  2012-01-26  0:40                                                     ` Indan Zupancic
  2012-01-26  0:59                                                     ` Jamie Lokier
@ 2012-01-26 18:44                                                     ` Oleg Nesterov
  2012-02-10  2:51                                                       ` Jamie Lokier
  2 siblings, 1 reply; 235+ messages in thread
From: Oleg Nesterov @ 2012-01-26 18:44 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On 01/26, Denys Vlasenko wrote:
>
> On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> >
> > We can add the new events,
> >
> > 	PTRACE_EVENT_SYSCALL_ENTRY
> > 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> > 	PTRACE_EVENT_SYSCALL_EXIT
> > 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
>
> We can get away with just the first one.
> (1) It's unlikely people would want to get native sysentry events but not compat ones,
> thus first two options can be combined into one;

Confused... Sure, we need the single option, or we could even report
this unconditionally if PT_SEIZED.

I meant the different PTRACE_EVENT_* codes only.

> (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> (3) if we would flag syscall entry with an event value in wait status, then syscall
> exit will be already distinquisable.

Well, if we add _ENTRY then it looks more consistent to report _EXIT
as well even if it is not that useful.

Doesn't matter. Nobody seem to like this, and afaics Linus has the
good arguments against the arch-independent "consolidation".

Oleg.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 11:47                                                               ` Jamie Lokier
  2012-01-26 14:05                                                                 ` Denys Vlasenko
@ 2012-01-27  7:23                                                                 ` Indan Zupancic
  2012-02-10  2:02                                                                   ` Jamie Lokier
  1 sibling, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-01-27  7:23 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Thu, January 26, 2012 12:47, Jamie Lokier wrote:
> Indan Zupancic wrote:
>> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
>> > Indan Zupancic wrote:
>> >> Yes, that's the only reason I'm interested in BPF, really.
>> >> Most system calls are either always allowed, or always denied.
>> >> Of the ones that need checking, most of them have file paths.
>> >> For those I'm not interested in the post-syscall event.
>> >
>> > Same here, though for tracing file paths rather than blocking anything.
>>
>> The jailer I wrote works pretty well as a simplistic strace replacement.
>> It can only print out the arguments we're checking, but that's usually
>> the more interesting info.
>
> In theory such a thing should be easy to write, but as we both found,
> ptrace() on Linux has a huge number of difficult quirks to deal with
> to trace reliably.  At least it's getting better with later kernels.

It's not that bad, there are a few quirks, but not that many.
The ptrace specific code is less than 500 lines of code, with
a couple of hundred lines of header files. Linux ptrace specific
stuff creeps in elsewhere too though, like that execve mess.

>> It's not a 32 versus 64-bit issue though, so it will be something on
>> its own anyway. Can as well add an extra ARM specific ptrace command
>> to get that info, or hack it in some other way. For instance, ip is
>> (ab)used to tell if it is syscall entry or exit, so doing these tricks
>> isn't anything new in ARM either.
>
> In theory, aren't we supposed to know whether it's entry/exit anyway?
> Why does strace care?  Have there been kernel bugs in the past?  Maybe
> it was just to deal with SIGTRAP-after-exit in the past, which could
> be delivered at an unpredictable time if blocked and then unblocked by
> sigreturn().

Maybe. I don't why ARM does that ip thing.

Although in theory you know the entry/exits if you keep track, but one
mistake or unexpected behaviour (like execve for my code) and you can get
it wrong. So for robustness sake it's good if it can be double checked.

>> You can't avoid the arch-specific knowledge, because depending on the
>> answer, you have to do something arch specific. In ARM's OABI case, it's
>> reading program memory to find out the system call number, of all things.
>> (I hope I read the code wrong). So ARM's solution would need to get all
>> info it needs to handle the system call securely without reading any text
>> memory, otherwise it's racy.
>
> A few archs read program memory to get the syscall number even now, in
> the current strace source.  Look for PEEKTEXT: S390, ARM, SPARC use it
> on every syscall entry, and X86_64 has it commented out.

I did look for PEEKTEXT. For ARM it's to check if OABI is used (and
if it is, the syscall is in memory, otherwise it's in r7). Strace only
uses it on S390 to handle old style ABI, 2.6 is fine. On SPARC Strace
does it to figure out what personality is used. But that can only be
changed via personality(2) and not secretly at runtime, or so it seems,
so SPARC should be safe too. But I can't really figure out the kernel
SPARC code to be honest, so I may be wrong. It seems the trap instruction
differs between SPARC 32 and 64-bit, but on the other hand they both use
the same syscall table, so at least the syscall nr can't be confused.

> As we know, all of them are buggy if the memory is modified while
> reading it, and it's silly because the kernel knows the syscall
> number.

Only ARM OABI is really problematic in that regard, but that's not a
32 versus 64-bit issue.

I don't know anything about OABI, can you link an OABI program against
an EABI library? If you can then libc can be EABI and the kernel doesn't
need OABI support.

>> And then there's the whole confusion what that flag says, some might think
>> it says in what mode the tracee is instead of what mode the system call is.
>> That those two can be different is not obvious at all and seems very x86_64
>> specific.
>
> My rough read of PARISC entry code suggests it has two entry methods,
> similar to ARM and x86_64, but I'm not really familiar with PARISC and
> I don't have a machine handy to try it out :-)

It has a unified syscall table, so does it really matter?

>> I'm not sure what you're doing, but perhaps we should share code and write
>> a kind of Linux ptrace library. The code I wrote was university stuff and
>> we want to release it, but it will take a while to get things sorted out.
>> Hopefully it's released in April, maybe before.
>
> I've been thinking along similar lines.  The idea came up when I was
> hacking on strace last year and it so wanted to be cleaned up (but now
> strace is in good hands, my work on it is obsolete); now I'm doing
> ptracing for other purposes.  Denys' ptrace API document, currently in
> strace git, is extremely useful.
>
> Denys, would you be interested in further refactoring strace to use a
> "libsystrace" sort of thing which abstracts the detail of archs,
> tracing (and maybe syscall argument layout) away from the printing and
> user-interface, for strace's use and other users?  I would be happy to
> help with that and keep strace's non-Linux support as well (if there's
> any way to test the latter...)  I seem to be going in the direction of
> a library like that anyway for another project.

I actually recommend to leave strace as it is. I've seen the code,
it's full with arch and OS specific stuff scattered all over the
place. Considering it actually works now, why risk breaking anything?
Especially considering you can't test any changes for all supported
platforms. Just leave it be and slowly improve it by tiny bit for
bits you can actually test.

The point of the library would be to make it easier to create new
software, possibly by using all the new features and dropping support
for too old kernels. Strace doesn't really benefit from that.

> The seccomp-BPF stuff could also benefit from a part dealing with
> syscall argument layout, as it too needs needs that arch-specific
> knowledge.

It seems I convinced them to use a cross-platform ABI, so you should
get the system call number and arguments directly.

> I have a script in progress which extracts all the
> per-arch and per-ABI syscall numbers, syscall argument layouts and
> kernel function names to keep track of arch-specific fixups, from a
> Linux source tree.  It currently works on all archs except it breaks
> on x86 which insists on being diferent ;-)

That's handy, but I thought strace had such a script already?
See HACKING-scripts in strace source. Or is yours much better?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 20:26                                                 ` Linus Torvalds
  2012-01-18 20:55                                                   ` H. Peter Anvin
@ 2012-02-06  8:32                                                   ` Indan Zupancic
  2012-02-06 17:02                                                     ` H. Peter Anvin
  1 sibling, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-02-06  8:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Wed, January 18, 2012 21:26, Linus Torvalds wrote:
> Added Peter to the cc, since this is now about some x86-specific
> things. Ingo was already cc'd earlier.
>
> On Wed, Jan 18, 2012 at 11:31 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Using the high bits of 'eflags' might work. Hopefully nobody tests
>> that. IOW, something like the attached might work. It just sets bit#32
>> in eflags if the system call is a compat call.
>
> So that description was bogus, it was what my original patch did, but
> not the one I actually sent out (Peter - you can find it on lkml,
> although the description below is probably sufficient for you to
> understand what it does, or the obvious nature of the attached patch
> for strace).
>
> The one I sent out *unconditionally* sets one bit in the high bits of
> the returned value of the eflags register from ptrace(), very much on
> purpose. That way you can unambiguously see whether it's an old kernel
> (bits clear) or a new kernel that supports the feature. On a new
> kernel, bit #32 of eflags will be set for a native 64-bit system call,
> and bit #33 will be set for a compat system call.
>
> And some testing says that it works. In particular, I have a patch to
> strace-4.6 that is able to correctly decode my mixed-case binary that
> uses both the compat system call and the native system calls from
> 64-bit long mode. Also, it looks like gdb ignores the high bits of
> eflags, since it "knows" that eflags is just a 32-bit register even in
> 64-bit mode, so the fact that we set some random bits in there doesn't
> end up being noisy for at least one debugger.
>
> HOWEVER. I'm not going to guarantee that this is the right approach.
> It seems to work, and it clearly gives people real information, but
> whether this is the best way to do things or not is open.

It seems that just using eflags is a lot simpler than the alternatives,
let's just go for it.

>
> The reason I picked 'eflags' was that it
>
>  (a) was easy from an implementation standpoint, since we already have
> to handle reading of eflags specially in ptrace (we have to fake out
> the resume bit)
>
>  (b) it "kind of" makes sense to make high bits be "system flags",
> with low bits being "cpu flags", so it fits at least *some* kind of
> conceptual model.
>
>  (c) the other sane places to put it (high bits of CS and/or ORIG_AX)
> were being used and compared as 64-bit values at least by strace.
> Whether eflags works for all users, I have no idea, but generally you
> would never compare eflags for one particular value - you might check
> individual bits in eflags, but hopefully setting a few new bits should
> not be something that any legacy user would ever really notice.
>
> So there are reasons to think that my patch is sane, but...
>
> Here's the strace patch, so people can look. I didn't even test it on
> an old kernel, but the fallback case to the old behavior looks
> trivial.
>
> Comments?

I propose using bits somewhere in the middle of the upper half. If new
flags are ever added by Intel or AMD, they will use the lower bits. If
anyone else ever adds flags, they most likely add them to the top (VIA).
So the middle seems the safest spot as far as long-term maintenance goes.

The below version does that, but instead of setting one of the two bits,
it always sets bit 50 for newer kernels and sets bit 51 if it's a compat
system call. I find this version more readable and after compilation it's
also a couple of bytes smaller compared to Linus' original version.

Should we make sure that the top 32 bits are zero, in case any weird
hardware does set our bits?

Greetings,

Indan

---

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..a7fda48 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -353,6 +353,7 @@ static int set_segment_reg(struct task_struct *task,

 static unsigned long get_flags(struct task_struct *task)
 {
+	int bit = 50;
 	unsigned long retval = task_pt_regs(task)->flags;

 	/*
@@ -360,8 +361,11 @@ static unsigned long get_flags(struct task_struct *task)
 	 */
 	if (test_tsk_thread_flag(task, TIF_FORCED_TF))
 		retval &= ~X86_EFLAGS_TF;
-
-	return retval;
+#ifdef CONFIG_IA32_EMULATION
+	if (task_thread_info(task)->status & TS_COMPAT)
+		retval |= (1ul << 51);
+#endif
+	return retval | (1ul << bit);
 }

 static int set_flags(struct task_struct *task, unsigned long value)



^ permalink raw reply related	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-06  8:32                                                   ` Indan Zupancic
@ 2012-02-06 17:02                                                     ` H. Peter Anvin
  2012-02-07  1:52                                                       ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-06 17:02 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, H.J. Lu

On 02/06/2012 12:32 AM, Indan Zupancic wrote:
> 
> It seems that just using eflags is a lot simpler than the alternatives,
> let's just go for it.
> 
> 
> I propose using bits somewhere in the middle of the upper half. If new
> flags are ever added by Intel or AMD, they will use the lower bits. If
> anyone else ever adds flags, they most likely add them to the top (VIA).
> So the middle seems the safest spot as far as long-term maintenance goes.
> 
> The below version does that, but instead of setting one of the two bits,
> it always sets bit 50 for newer kernels and sets bit 51 if it's a compat
> system call. I find this version more readable and after compilation it's
> also a couple of bytes smaller compared to Linus' original version.
> 
> Should we make sure that the top 32 bits are zero, in case any weird
> hardware does set our bits?
> 

[Adding H.J. Lu, since he has run into some of these requirements before]

NAK in the extreme.

We have not heard back from the architecture people on this, and I will
NAK this unless that happens.

Furthermore, you're picking bits that do not work for 32 bits, EVEN
THOUGH WE HAVE A SIMILAR PROBLEM ON 32 BITS; I outlined it for you and
you chose to ignore it.

Finally, I think we actually are going to need a fair number of bits in
the end.  All of this points to using a new regset designed for
extension in the first place.

As far as I can tell, we need at least the following information:

- If the CPU is currently in 32- or 64-bit mode.
- If we are currently inside a system call, and if so if it was entered
  via:
	- SYSCALL64
	- INT 80
	- SYSCALL32
	- SYSENTER

  The reason we need this information is because for the various 32-bit
  entry points we do some very ugly swizzling of registers, which
  matters to a ptrace client which wants to modify system call
  arguments.
- If the process was started as a 64-bit process, i386 process or x32
  process.

This adds up to a minimum of six bits already (and at least two bits on
i386), and that's just a start.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 22:40                                                                     ` Roland McGrath
  2012-01-20 22:41                                                                       ` H. Peter Anvin
  2012-01-24  8:19                                                                       ` Indan Zupancic
@ 2012-02-06 20:30                                                                       ` H. Peter Anvin
  2012-02-06 20:39                                                                         ` Roland McGrath
  2 siblings, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-06 20:30 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On 01/20/2012 02:40 PM, Roland McGrath wrote:
> If you change the size of a regset, then the new full size will be the size
> of the core file notes.  Existing userland tools will not be expecting
> this, they expect a known exact size.  If you need to add new stuff, it
> really is easier all around to add a new regset flavor.  When adding a new
> one, you can make it variable-sized from the start so as to be extensible
> in the future.  We did this for NT_X86_XSTATE, for example.
> 
> Thanks,
> Roland

Hi Roland,

What is needed to make a regset variable-sized?  Just declaring that it
may change in size in the future, or does one need a length field at the
top (I would personally have expected that both notes and ptrace would
have out-of-band methods for getting the size?)

	-hp-a

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-06 20:30                                                                       ` H. Peter Anvin
@ 2012-02-06 20:39                                                                         ` Roland McGrath
  2012-02-06 20:42                                                                           ` H. Peter Anvin
  0 siblings, 1 reply; 235+ messages in thread
From: Roland McGrath @ 2012-02-06 20:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Mon, Feb 6, 2012 at 12:30 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> What is needed to make a regset variable-sized?  Just declaring that it
> may change in size in the future, or does one need a length field at the
> top (I would personally have expected that both notes and ptrace would
> have out-of-band methods for getting the size?)

ELF notes do have a size field, so core files are self-explanatory.  There
is no ptrace interface to directly interrogate the regset details (one
could be added).  But the PTRACE_GETREGSET interface is to accept an upper
bound and yield the actual size filled in (which might be less than the
regset's size if the user-supplied buffer was smaller).  So in practice, a
caller can just use a buffer that's sure to be large enough, and then look
at iov_len for the actual size delivered.  (And nobody has yet complained
about this for xstate, though that might just be that nobody is really
using it.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-06 20:39                                                                         ` Roland McGrath
@ 2012-02-06 20:42                                                                           ` H. Peter Anvin
  0 siblings, 0 replies; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-06 20:42 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On 02/06/2012 12:39 PM, Roland McGrath wrote:
> On Mon, Feb 6, 2012 at 12:30 PM, H. Peter Anvin<hpa@zytor.com>  wrote:
>> What is needed to make a regset variable-sized?  Just declaring that it
>> may change in size in the future, or does one need a length field at the
>> top (I would personally have expected that both notes and ptrace would
>> have out-of-band methods for getting the size?)
>
> ELF notes do have a size field, so core files are self-explanatory.  There
> is no ptrace interface to directly interrogate the regset details (one
> could be added).  But the PTRACE_GETREGSET interface is to accept an upper
> bound and yield the actual size filled in (which might be less than the
> regset's size if the user-supplied buffer was smaller).  So in practice, a
> caller can just use a buffer that's sure to be large enough, and then look
> at iov_len for the actual size delivered.  (And nobody has yet complained
> about this for xstate, though that might just be that nobody is really
> using it.)
>

That should be fine, since you'd just set it to the size of the fields 
that you know about, and if there are additional fields that you don't 
know about, you logically don't care about them.  If you want to dump 
the full set of data you'd just read until you get a short read... like 
if you were reading a regular file.

	-hpa


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-06 17:02                                                     ` H. Peter Anvin
@ 2012-02-07  1:52                                                       ` Indan Zupancic
  2012-02-09  0:19                                                         ` H. Peter Anvin
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-02-07  1:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, H.J. Lu

On Mon, February 6, 2012 18:02, H. Peter Anvin wrote:
> On 02/06/2012 12:32 AM, Indan Zupancic wrote:
>>
>> It seems that just using eflags is a lot simpler than the alternatives,
>> let's just go for it.
>>
>>
>> I propose using bits somewhere in the middle of the upper half. If new
>> flags are ever added by Intel or AMD, they will use the lower bits. If
>> anyone else ever adds flags, they most likely add them to the top (VIA).
>> So the middle seems the safest spot as far as long-term maintenance goes.
>>
>> The below version does that, but instead of setting one of the two bits,
>> it always sets bit 50 for newer kernels and sets bit 51 if it's a compat
>> system call. I find this version more readable and after compilation it's
>> also a couple of bytes smaller compared to Linus' original version.
>>
>> Should we make sure that the top 32 bits are zero, in case any weird
>> hardware does set our bits?
>>
>
> [Adding H.J. Lu, since he has run into some of these requirements before]
>
> NAK in the extreme.
>
> We have not heard back from the architecture people on this, and I will
> NAK this unless that happens.
>
> Furthermore, you're picking bits that do not work for 32 bits, EVEN
> THOUGH WE HAVE A SIMILAR PROBLEM ON 32 BITS; I outlined it for you and
> you chose to ignore it.

Sorry, I missed that. I looked up that email and you indeed did, though
you didn't give any details about what the problems are.

> Finally, I think we actually are going to need a fair number of bits in
> the end.  All of this points to using a new regset designed for
> extension in the first place.
>
> As far as I can tell, we need at least the following information:
>
> - If the CPU is currently in 32- or 64-bit mode.

What is the best way to find that out at the kernel side? Add a function
that checks cs and returns the correct answer? But in the kernel path the
CPU is always in 64-bit mode, so I suppose you want to know what mode the
tracee was in?

> - If we are currently inside a system call, and if so if it was entered
>   via:
> 	- SYSCALL64
> 	- INT 80
> 	- SYSCALL32
> 	- SYSENTER
>
>   The reason we need this information is because for the various 32-bit
>   entry points we do some very ugly swizzling of registers, which
>   matters to a ptrace client which wants to modify system call
>   arguments.

But isn't the swizzling done in such way that all this is hidden from
ptrace clients (and the rest of the kernel)? Why would a ptrace client
need to know the details of the 32-bit entry call?

The ptrace client can always modify the same registers, as system calls
always use the same registers too. No unexpected behaviour happens as
far as I can tell from looking at the code, at least not in the syscall
entry path.

E.g. ENTRY(ia32_cstar_target) in ia32entry.S does:

	movq	%rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */

To hide that for SYSCALL32 arg2 comes in edp instead of rcx. Same for arg6.

(I actually can't find a SYSCALL32 entry in entry_32.S, am I blind or
was it too slow until the 64-bit Athlons showed up?)

A pure 32-bit kernel is compiled with:

#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))

So all arguments are passed on the stack and those arguments can be
directly modified by ptrace. For compat kernels the arguments are
reloaded after ptrace and before the actual system call is done.

> - If the process was started as a 64-bit process, i386 process or x32
>   process.

Can't that be figured out by looking at the AUXV data? Either via /proc
or PTRACE_GETREGSET + NT_AUXV. And as this can't change, there is no
need to pass it on all the time.

> This adds up to a minimum of six bits already (and at least two bits on
> i386), and that's just a start.

I'm not convinced that there is any real problem, it seems only one extra
bit for the task CPU mode would be needed, so three bits in total.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 17:12                                             ` Oleg Nesterov
  2012-01-18 21:09                                               ` Chris Evans
@ 2012-02-07 11:45                                               ` Indan Zupancic
  1 sibling, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-02-07 11:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Evans, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Wed, January 18, 2012 18:12, Oleg Nesterov wrote:
> On 01/18, Oleg Nesterov wrote:
>>
>> On 01/17, Chris Evans wrote:
>> >
>> > 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
>> > 2) Tracee traps
>> > 2b) Tracee could take a SIGKILL here
>> > 3) Tracer looks at registers; bad syscall
>> > 3b) Or tracee could take a SIGKILL here
>> > 4) The only way to stop the bad syscall from executing is to rewrite
>> > orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>> > syscall has finished)
>> > 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>> > pid (such as PTRACE_SETREGS) fails.
>> > 6) Syscall fork() executes; possible unsupervised process now running
>> > since the tracer wasn't expecting the fork() to be allowed.
>>
>> As for fork() in particular, it can't succeed after SIGKILL.
>>
>> But I agree, probably it makes sense to change ptrace_stop() to check
>> fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
>> in TASK_TRACED. Or we can change tracehook_report_syscall_entry()
>>
>> 	-	return 0;
>> 	+	return !fatal_signal_pending();
>>
>> (no, I do not literally mean the change above)
>>
>> Not only for security. The current behaviour sometime confuses the
>> users. Debugger sends SIGKILL to the tracee and assumes it should
>> die asap, but the tracee exits only after syscall.
>
> Something like the patch below.
>
> Oleg.
>
> --- x/include/linux/tracehook.h
> +++ x/include/linux/tracehook.h
> @@ -54,12 +54,12 @@ struct linux_binprm;
>  /*
>   * ptrace report for syscall entry and exit looks identical.
>   */
> -static inline void ptrace_report_syscall(struct pt_regs *regs)
> +static inline int ptrace_report_syscall(struct pt_regs *regs)
>  {
>  	int ptrace = current->ptrace;
>
>  	if (!(ptrace & PT_PTRACED))
> -		return;
> +		return 0;
>
>  	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>
> @@ -72,6 +72,8 @@ static inline void ptrace_report_syscall
>  		send_sig(current->exit_code, current, 1);
>  		current->exit_code = 0;
>  	}
> +
> +	return fatal_signal_pending(current);
>  }
>
>  /**
> @@ -96,8 +98,7 @@ static inline void ptrace_report_syscall
>  static inline __must_check int tracehook_report_syscall_entry(
>  	struct pt_regs *regs)
>  {
> -	ptrace_report_syscall(regs);
> -	return 0;
> +	return ptrace_report_syscall(regs);
>  }
>

Tested-by: Indan Zupancic <indan@nul.nu>

Tested on 32-bit x86. It behaves as expected, please apply.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-07  1:52                                                       ` Indan Zupancic
@ 2012-02-09  0:19                                                         ` H. Peter Anvin
  2012-02-09  4:20                                                           ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-09  0:19 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, H.J. Lu

On 02/06/2012 05:52 PM, Indan Zupancic wrote:
>>
>> - If the CPU is currently in 32- or 64-bit mode.
> 
> What is the best way to find that out at the kernel side? Add a function
> that checks cs and returns the correct answer? But in the kernel path the
> CPU is always in 64-bit mode, so I suppose you want to know what mode the
> tracee was in?
> 

You need to look at the CS descriptor.

>> - If we are currently inside a system call, and if so if it was entered
>>   via:
>> 	- SYSCALL64
>> 	- INT 80
>> 	- SYSCALL32
>> 	- SYSENTER
>>
>>   The reason we need this information is because for the various 32-bit
>>   entry points we do some very ugly swizzling of registers, which
>>   matters to a ptrace client which wants to modify system call
>>   arguments.
> 
> But isn't the swizzling done in such way that all this is hidden from
> ptrace clients (and the rest of the kernel)? Why would a ptrace client
> need to know the details of the 32-bit entry call?
>  
> The ptrace client can always modify the same registers, as system calls
> always use the same registers too. No unexpected behaviour happens as
> far as I can tell from looking at the code, at least not in the syscall
> entry path.

The simple stuff works, but once you want to do things like change the
arguments and/or move the execution point, things get unswizzled in
uncontrolled ways.  There are bug reports related to that (I would have
to dig them up) and they aren't really fixable in any sane way right now.

> A pure 32-bit kernel is compiled with:
> 
> #define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))

... which we'd like to get rid of ...

> So all arguments are passed on the stack and those arguments can be
> directly modified by ptrace. For compat kernels the arguments are
> reloaded after ptrace and before the actual system call is done.

>> - If the process was started as a 64-bit process, i386 process or x32
>>   process.
> 
> Can't that be figured out by looking at the AUXV data? Either via /proc
> or PTRACE_GETREGSET + NT_AUXV. And as this can't change, there is no
> need to pass it on all the time.

I'll look at the auxv stuff.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  0:19                                                         ` H. Peter Anvin
@ 2012-02-09  4:20                                                           ` Indan Zupancic
  2012-02-09  4:29                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-02-09  4:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, H.J. Lu

On Thu, February 9, 2012 01:19, H. Peter Anvin wrote:
> On 02/06/2012 05:52 PM, Indan Zupancic wrote:
>>>
>>> - If the CPU is currently in 32- or 64-bit mode.
>>
>> What is the best way to find that out at the kernel side? Add a function
>> that checks cs and returns the correct answer? But in the kernel path the
>> CPU is always in 64-bit mode, so I suppose you want to know what mode the
>> tracee was in?
>>
>
> You need to look at the CS descriptor.

CS is already available to user space, but any other value than 0x23 or 0x33
will confuse user space, as that is all they know about. Apparently Xen uses
different values, but if those are static then user space can check for them
separately. But if the values change dynamically then some other way may be
needed.

But does it make much sense to pass the CPU mode of user space if that mode
can be changed at any moment? I don't think it really does. Can you give an
example of how that info can be used by a ptracer?

>
>>> - If we are currently inside a system call, and if so if it was entered
>>>   via:
>>> 	- SYSCALL64
>>> 	- INT 80
>>> 	- SYSCALL32
>>> 	- SYSENTER
>>>
>>>   The reason we need this information is because for the various 32-bit
>>>   entry points we do some very ugly swizzling of registers, which
>>>   matters to a ptrace client which wants to modify system call
>>>   arguments.
>>
>> But isn't the swizzling done in such way that all this is hidden from
>> ptrace clients (and the rest of the kernel)? Why would a ptrace client
>> need to know the details of the 32-bit entry call?
>>
>> The ptrace client can always modify the same registers, as system calls
>> always use the same registers too. No unexpected behaviour happens as
>> far as I can tell from looking at the code, at least not in the syscall
>> entry path.
>
> The simple stuff works, but once you want to do things like change the
> arguments and/or move the execution point, things get unswizzled in
> uncontrolled ways.

I do both and haven't encountered any problems.

I can't find any unswizzling happening in the return path though. So
from a ptracer's point of view it all looks the same after a system
call, no matter how it was entered. Except for IP perhaps, but that's
handled in the vDSO.

> There are bug reports related to that (I would have
> to dig them up) and they aren't really fixable in any sane way right now.

I don't see any problems in the code.

Only confusion I can think of is someone following the register values
across a systemcall instruction. Then the swizzling may be unexpected.
But if they do that they could check how the sycall was entered and
compensate for that. (I can't think of any requirement why this would
need to be race-free.)

>> A pure 32-bit kernel is compiled with:
>>
>> #define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
>
> ... which we'd like to get rid of ...

If you do get rid of it, then you have to reload the registers after
ptrace, just like currently happens on x86_64 kernels. So regparm(0)
isn't a requirement, I only explained why reloading the registers
isn't needed for pure 32-bit.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  4:20                                                           ` Indan Zupancic
@ 2012-02-09  4:29                                                             ` H. Peter Anvin
  2012-02-09  6:03                                                               ` Indan Zupancic
  2012-02-09 16:00                                                               ` H.J. Lu
  0 siblings, 2 replies; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-09  4:29 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, H.J. Lu

On 02/08/2012 08:20 PM, Indan Zupancic wrote:
> 
> CS is already available to user space, but any other value than 0x23 or 0x33
> will confuse user space, as that is all they know about. Apparently Xen uses
> different values, but if those are static then user space can check for them
> separately. But if the values change dynamically then some other way may be
> needed.
> 
> But does it make much sense to pass the CPU mode of user space if that mode
> can be changed at any moment? I don't think it really does. Can you give an
> example of how that info can be used by a ptracer?
> 

Uh... you could make THAT argument about ANY register state!

I believe H.J. can fill you in about the usage.

> 
> Only confusion I can think of is someone following the register values
> across a systemcall instruction. Then the swizzling may be unexpected.
> But if they do that they could check how the sycall was entered and
> compensate for that. (I can't think of any requirement why this would
> need to be race-free.)
> 

You'd have to know how you'd entered, which right now you don't have any
way to know.

	-hpa

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  4:29                                                             ` H. Peter Anvin
@ 2012-02-09  6:03                                                               ` Indan Zupancic
  2012-02-09 14:47                                                                 ` H. Peter Anvin
  2012-02-09 16:00                                                               ` H.J. Lu
  1 sibling, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-02-09  6:03 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, H.J. Lu

On Thu, February 9, 2012 05:29, H. Peter Anvin wrote:
> On 02/08/2012 08:20 PM, Indan Zupancic wrote:
>>
>> CS is already available to user space, but any other value than 0x23 or 0x33
>> will confuse user space, as that is all they know about. Apparently Xen uses
>> different values, but if those are static then user space can check for them
>> separately. But if the values change dynamically then some other way may be
>> needed.
>>
>> But does it make much sense to pass the CPU mode of user space if that mode
>> can be changed at any moment? I don't think it really does. Can you give an
>> example of how that info can be used by a ptracer?
>>
>
> Uh... you could make THAT argument about ANY register state!

Well, when the tracee is in a system call, it can't change registers,
and their values determine the system call number and arguments. That
information is stable for the current system call. And as a ptracer
can't determine if the 32 or 64-bit syscall entry path was taken in
a race-free way, it makes sense to provide that extra info.

But the same is not true for the user space CPU mode, that can change
at any time without the tracer getting a notification, except if it is
single stepping (which I forgot about).

Would it be useful to know the CPU mode when single stepping or otherwise?

I'm asking because I don't see a need for it, but if someone else does
it's better to add it now together with the syscall mode bit. Unlike the
system call mode, the CPU mode can be checked via CS. The question is
if that works well enough or if the values are dynamic enough that it's
better to pass the info explicitly instead.

Unlike the syscall mode info, figuring out the mode from CS isn't trivial
when it can change dynamically. Then all places that use non-standard CS
values need to be changed to provide the mode somehow.

> I believe H.J. can fill you in about the usage.

That would be great.

>>
>> Only confusion I can think of is someone following the register values
>> across a systemcall instruction. Then the swizzling may be unexpected.
>> But if they do that they could check how the sycall was entered and
>> compensate for that. (I can't think of any requirement why this would
>> need to be race-free.)
>>
>
> You'd have to know how you'd entered, which right now you don't have any
> way to know.

You can check the syscall instruction itself, either before it's executed
or afterwards by checking the IP. Though that's trickier, because the
kernel points the IP to just after int80 for a sysenter call, so you have
to check if there's a sysenter nearby too.

You can also figure out what the entry instruction was by comparing the
register values with the expected ones and deducing it that way.

But the kernel is actually changing the registers, so why hide that?

I mean, once user space is aware that the kernel may do swizzling, is there
any actual problem left? Because this sounds like user space was trying to
be clever, but got it wrong. E.g. it knew the kernel was entered not via
int80, but then got confused because of the swizzling.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  6:03                                                               ` Indan Zupancic
@ 2012-02-09 14:47                                                                 ` H. Peter Anvin
  0 siblings, 0 replies; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-09 14:47 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, H.J. Lu

On 02/08/2012 10:03 PM, Indan Zupancic wrote:
>
> You can check the syscall instruction itself, either before it's executed
> or afterwards by checking the IP. Though that's trickier, because the
> kernel points the IP to just after int80 for a sysenter call, so you have
> to check if there's a sysenter nearby too.
>

No, that's a total nightmare.  FAIL.

> But the kernel is actually changing the registers, so why hide that?
>
> I mean, once user space is aware that the kernel may do swizzling, is there
> any actual problem left? Because this sounds like user space was trying to
> be clever, but got it wrong. E.g. it knew the kernel was entered not via
> int80, but then got confused because of the swizzling.

I would be great if we didn't have an existing compatibility problem. 
As it is we can't get rid of it easily.

	-hpa


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  4:29                                                             ` H. Peter Anvin
  2012-02-09  6:03                                                               ` Indan Zupancic
@ 2012-02-09 16:00                                                               ` H.J. Lu
  2012-02-10  1:09                                                                 ` Indan Zupancic
  1 sibling, 1 reply; 235+ messages in thread
From: H.J. Lu @ 2012-02-09 16:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Indan Zupancic, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Wed, Feb 8, 2012 at 8:29 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 02/08/2012 08:20 PM, Indan Zupancic wrote:
>>
>> CS is already available to user space, but any other value than 0x23 or 0x33
>> will confuse user space, as that is all they know about. Apparently Xen uses
>> different values, but if those are static then user space can check for them
>> separately. But if the values change dynamically then some other way may be
>> needed.
>>
>> But does it make much sense to pass the CPU mode of user space if that mode
>> can be changed at any moment? I don't think it really does. Can you give an
>> example of how that info can be used by a ptracer?
>>
>
> Uh... you could make THAT argument about ANY register state!
>
> I believe H.J. can fill you in about the usage.
>

GDB uses CS value to tell ia32 process from x86-64 process.
At minimum, we need a bit in CS for GDB.  But any changes
will break old GDB.

H.J.

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09 16:00                                                               ` H.J. Lu
@ 2012-02-10  1:09                                                                 ` Indan Zupancic
  2012-02-10  1:15                                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-02-10  1:09 UTC (permalink / raw)
  To: H.J. Lu
  Cc: H. Peter Anvin, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Thu, February 9, 2012 17:00, H.J. Lu wrote:
> GDB uses CS value to tell ia32 process from x86-64 process.

Are there any cases when this doesn't work? Someone said Xen can
have different CS values, but looking at the source it seems it's
using the same ones, at least with a Linux hypervisor. So perhaps
it was KVM. Looking at the header it seems paravirtualisation uses
different cs values. On the upside, it seems we can just use that
user_64bit_mode() to know whether it is 32 or 64 bit mode, so
adding a bit telling the process mode is easier than I thought.

Currently there is a need to tell if the 32 or 64-bit syscall
path is being taken, which is independent of the process mode.

> At minimum, we need a bit in CS for GDB.  But any changes
> will break old GDB.

Would adding bits to the upper 32-bit of rflags break GDB?

Do you also need a way to know whether the kernel was entered via
int 0x80, SYSCALL32/64 or SYSENTER?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  1:09                                                                 ` Indan Zupancic
@ 2012-02-10  1:15                                                                   ` H. Peter Anvin
  2012-02-10  2:29                                                                     ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-10  1:15 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On 02/09/2012 05:09 PM, Indan Zupancic wrote:
> On Thu, February 9, 2012 17:00, H.J. Lu wrote:
>> GDB uses CS value to tell ia32 process from x86-64 process.
> 
> Are there any cases when this doesn't work? Someone said Xen can
> have different CS values, but looking at the source it seems it's
> using the same ones, at least with a Linux hypervisor. So perhaps
> it was KVM. Looking at the header it seems paravirtualisation uses
> different cs values. On the upside, it seems we can just use that
> user_64bit_mode() to know whether it is 32 or 64 bit mode, so
> adding a bit telling the process mode is easier than I thought.
> 
> Currently there is a need to tell if the 32 or 64-bit syscall
> path is being taken, which is independent of the process mode.
> 

There are definitely cases where the current reliance on magic CS values
doesn't work; never mind the fact that it's just broken.

>> At minimum, we need a bit in CS for GDB.  But any changes
>> will break old GDB.
> 
> Would adding bits to the upper 32-bit of rflags break GDB?

It doesn't work for i386, never mind that this is reserved hardware
state and we don't have an OK at this time to redeclare them available.

> Do you also need a way to know whether the kernel was entered via
> int 0x80, SYSCALL32/64 or SYSENTER?

gdb, probably not.  That came from another user (pin, I think, but I'm
not sure.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-27  7:23                                                                 ` Indan Zupancic
@ 2012-02-10  2:02                                                                   ` Jamie Lokier
  2012-02-10  3:37                                                                     ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: Jamie Lokier @ 2012-02-10  2:02 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

Indan Zupancic wrote:
> On Thu, January 26, 2012 12:47, Jamie Lokier wrote:
> > Indan Zupancic wrote:
> >> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
> >> > Indan Zupancic wrote:
> >> The jailer I wrote works pretty well as a simplistic strace replacement.
> >> It can only print out the arguments we're checking, but that's usually
> >> the more interesting info.
> >
> > In theory such a thing should be easy to write, but as we both found,
> > ptrace() on Linux has a huge number of difficult quirks to deal with
> > to trace reliably.  At least it's getting better with later kernels.
> 
> It's not that bad, there are a few quirks, but not that many.
> The ptrace specific code is less than 500 lines of code, with
> a couple of hundred lines of header files. Linux ptrace specific
> stuff creeps in elsewhere too though, like that execve mess.

I count 720 lines *just* to read the syscall number and arguments in
strace-git, for the Linux archs it supports.

That's only the Linux code, I excluded non-Linux, and it's only a
little bit of syscall.c, I didn't include generic ptracing,
fork-following, threaded-exec-fixups, signal handling etc. nor other
arch-specific functions and ABI fixups.  And it doesn't even have all
archs currently in Linux mainline.

> >> It's not a 32 versus 64-bit issue though, so it will be something on
> >> its own anyway. Can as well add an extra ARM specific ptrace command
> >> to get that info, or hack it in some other way. For instance, ip is
> >> (ab)used to tell if it is syscall entry or exit, so doing these tricks
> >> isn't anything new in ARM either.
> >
> > In theory, aren't we supposed to know whether it's entry/exit anyway?
> > Why does strace care?  Have there been kernel bugs in the past?  Maybe
> > it was just to deal with SIGTRAP-after-exit in the past, which could
> > be delivered at an unpredictable time if blocked and then unblocked by
> > sigreturn().
> 
> Maybe. I don't why ARM does that ip thing.
> 
> Although in theory you know the entry/exits if you keep track, but one
> mistake or unexpected behaviour (like execve for my code) and you can get
> it wrong. So for robustness sake it's good if it can be double checked.

I agree, and I think the PTRACE_EVENT_SYSCALL_ENTRY/EXIT events would
be a clean way to represent that.

I wonder if all archs report syscall-exit as the first event in traced
fork children.  Looking at arch/hexagon I'm guessing it doesn't, but
it's hard to be sure and no practical way to test it :-/

That wouldn't matter if the events were robust.

I read somewhere about a bug report where syscall-exit was seen after
attach, but I don't remember where now.

> I don't know anything about OABI, can you link an OABI program against
> an EABI library? If you can then libc can be EABI and the kernel doesn't
> need OABI support.

That's not the point.  If you're writing a ptrace jailer (as you are)
a program can deliberately use OABI calls to subvert the tracer, even
if it's using EABI for normal calls.

For linking, you are mostly right.  Ideally everything would be open
and recompilable anyway, but that's sadly not always possible.  OABI
and EABI have different struct layouts among other changes, and EABI
being newer tends to accompany other libc changes; embedded libc.
aren't always as drop-in backward-compatible as glibc.

> >> And then there's the whole confusion what that flag says, some might think
> >> it says in what mode the tracee is instead of what mode the system call is.
> >> That those two can be different is not obvious at all and seems very x86_64
> >> specific.
> >
> > My rough read of PARISC entry code suggests it has two entry methods,
> > similar to ARM and x86_64, but I'm not really familiar with PARISC and
> > I don't have a machine handy to try it out :-)
> 
> It has a unified syscall table, so does it really matter?

I don't know if the 32/64 matters.  For security or accurate tracing,
I wouldn't like to assume without checking if there are 64-on-32
argument alignment fixups.

PARISC has a second set of HPUX-compatible system call numbers,
handled in arch/parisc/hpux/*.  I don't know if those are available to
all programs and can be used to subvert a ptracer.  Looking at
hpux/gate.S I think they bypass ptrace entirely; maybe they can subvert it.

> > I have a script in progress which extracts all the
> > per-arch and per-ABI syscall numbers, syscall argument layouts and
> > kernel function names to keep track of arch-specific fixups, from a
> > Linux source tree.  It currently works on all archs except it breaks
> > on x86 which insists on being diferent ;-)
> 
> That's handy, but I thought strace had such a script already?
> See HACKING-scripts in strace source. Or is yours much better?

The strace script only gets the syscall numbers (so doesn't help
cross-check I've applied all arch-specific syscall fixups), doesn't
work for all arch/ABI combinations without editing unistd.h, and
requires a configured and partly built kernel for some archs.  It's
only really useful for getting new syscall numbers which you then
hand-edit into the real table.  You still have to set the number of
arguments and check carefully you haven't missed any arch-specific
fixups.

All the best,
-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  1:15                                                                   ` H. Peter Anvin
@ 2012-02-10  2:29                                                                     ` Indan Zupancic
  2012-02-10  2:47                                                                       ` H. Peter Anvin
       [not found]                                                                       ` <cc95fcf4b1c28ee6f73e373d04593634.squirrel@webmail.greenhost.nl>
  0 siblings, 2 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-02-10  2:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Fri, February 10, 2012 02:15, H. Peter Anvin wrote:
> On 02/09/2012 05:09 PM, Indan Zupancic wrote:
>> On Thu, February 9, 2012 17:00, H.J. Lu wrote:
>>> GDB uses CS value to tell ia32 process from x86-64 process.
>>
>> Are there any cases when this doesn't work? Someone said Xen can
>> have different CS values, but looking at the source it seems it's
>> using the same ones, at least with a Linux hypervisor. So perhaps
>> it was KVM. Looking at the header it seems paravirtualisation uses
>> different cs values. On the upside, it seems we can just use that
>> user_64bit_mode() to know whether it is 32 or 64 bit mode, so
>> adding a bit telling the process mode is easier than I thought.
>>
>> Currently there is a need to tell if the 32 or 64-bit syscall
>> path is being taken, which is independent of the process mode.
>>
>
> There are definitely cases where the current reliance on magic CS values
> doesn't work; never mind the fact that it's just broken.

It's only broken because it doesn't work sometimes. ;-)

>>> At minimum, we need a bit in CS for GDB.  But any changes
>>> will break old GDB.
>>
>> Would adding bits to the upper 32-bit of rflags break GDB?
>
> It doesn't work for i386, never mind that this is reserved hardware
> state and we don't have an OK at this time to redeclare them available.

It doesn't need to work for i386 because it's close to practically
impossible to ptrace a 64-bit task with a 32-bit ptracer.

An alternative would be to use some of the bits in the lower half.

E.g. bits 1, 3, 5 and 15 are reserved and very unlikely to be ever
used for anything, because they can use plenty of bits at the top.
Problem would be that we can't be sure that they are always zero.
If they are, they're safe to use.

The VIF and VIP flags can also be stolen as they're always zero
outside of vm86 mode (which can't be ptraced AFAIK). So we could
set VIF or VIP to tell if we stole bits 1, 3, 5 and/or 15. That
would give us 6 bits in total, and the only confusing thing might
be VIF or VIP set for user space. But anyone counting on those
being zero seems unlikely, and even more unlikely for the reserved
bits, as they are intermixed with unpredictable bits. We could use
VM too, but that might be too confusing, while VIF or VIP without
VM set make no sense.

Perhaps using VIF or VIP to tell whether the other bits are valid
is a good idea anyway, as it can never clash because they are well
defined already and always zero for non-VM mode.

With the current rate of adding flags it will take forever before
any of this might break. And if that happens, we just move to other
bits and user space needs to check those first. Or if the flags
aren't useful for userspace, hide them and keep using it for the
kernel.

>> Do you also need a way to know whether the kernel was entered via
>> int 0x80, SYSCALL32/64 or SYSENTER?
>
> gdb, probably not.  That came from another user (pin, I think, but I'm
> not sure.)

Could you find out? Because I have a hard time thinking of any good
reason why anyone would want to know this specifically.

If this info is added it can replace the bit saying if it's 32 or 64-bit
syscall path. So one bit for enabling all this, 2 bits for the syscall
entry instruction (with SYSCALL64 being 0 as an easy check for the 64-bit
path) and one bit for user space mode. This would end up being 4 bits in
total, except if I forgot anything.

Only downside of adding the entry instruction info would be that more
work in the entry-specific code is needed. The code wouldn't be contained
to a small ptrace specific bit anymore.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  2:29                                                                     ` Indan Zupancic
@ 2012-02-10  2:47                                                                       ` H. Peter Anvin
       [not found]                                                                       ` <cc95fcf4b1c28ee6f73e373d04593634.squirrel@webmail.greenhost.nl>
  1 sibling, 0 replies; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-10  2:47 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On 02/09/2012 06:29 PM, Indan Zupancic wrote:
> 
> It's only broken because it doesn't work sometimes. ;-)
> 

I really hope you realize how idiotic that sounds.

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 18:44                                                     ` Oleg Nesterov
@ 2012-02-10  2:51                                                       ` Jamie Lokier
  0 siblings, 0 replies; 235+ messages in thread
From: Jamie Lokier @ 2012-02-10  2:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

Oleg Nesterov wrote:
> On 01/26, Denys Vlasenko wrote:
> >
> > On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> > >
> > > We can add the new events,
> > >
> > > 	PTRACE_EVENT_SYSCALL_ENTRY
> > > 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> > > 	PTRACE_EVENT_SYSCALL_EXIT
> > > 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
> >
> > We can get away with just the first one.
> > (1) It's unlikely people would want to get native sysentry events but not compat ones,
> > thus first two options can be combined into one;
> 
> Confused... Sure, we need the single option, or we could even report
> this unconditionally if PT_SEIZED.
> 
> I meant the different PTRACE_EVENT_* codes only.
> 
> > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > exit will be already distinquisable.
> 
> Well, if we add _ENTRY then it looks more consistent to report _EXIT
> as well even if it is not that useful.
> 
> Doesn't matter. Nobody seem to like this, and afaics Linus has the
> good arguments against the arch-independent "consolidation".

Regarding distinction between ENTRY/EXIT:

  I agree only a buggy kernel should get out of sync, but are we sure
  the kernel is never buggy, and wouldn't this be nice protection, and
  an excuse for strace to drop the heuristics it currently does for
  this condition?

  The behaviour from fork() appears to have changed.  (This is from
  reading kernel code, I'm too lazy to try out old kernels.)  If I
  read correctly, before 2.5.35, Linux returned an EXIT event first to
  a child process if CLONE_PTRACE was used, and then it didn't, and
  then from 2.5.46 the tracer's use of PTRACE_EVENT_* determines if it
  does or not.

  So it's not surprising strace has heuristics... shame they're buggy
  (sigreturn can look like anything).

  Anyway, PTRACE_EVENT_* for syscall entry/exit just look prettier!

Regarding ABI indication:

  At least with new syscalls, a tracer that doesn't know about them
  will see they're unrecognised; whereas a different ABI sometimes
  looks like an innocent syscall so can trick the tracer.

  However the argument for putting this in register state that goes
  into core dumps and checkpoint/restart state instead is pretty good.

  I don't have a strong opinion.  It's unfortunate that the current
  method not only makes it easy to subvert a ptracer, it makes
  ptracing slow and racy on archs where it has to read the syscall
  instruction.  (Weirdly that includes ARM, despite ARM using a
  register these days and having a ptrace option to set, but not read,
  the syscall number).

  That really is an argument for making sure all archs have the
  syscall number and, if necessary, the type of syscall entry point,
  somewhere in the register set.

All the best,
-- Jamie

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  2:02                                                                   ` Jamie Lokier
@ 2012-02-10  3:37                                                                     ` Indan Zupancic
  2012-02-10 21:19                                                                       ` Denys Vlasenko
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-02-10  3:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Fri, February 10, 2012 03:02, Jamie Lokier wrote:
> Indan Zupancic wrote:
>> On Thu, January 26, 2012 12:47, Jamie Lokier wrote:
>> > Indan Zupancic wrote:
>> >> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
>> >> > Indan Zupancic wrote:
>> >> The jailer I wrote works pretty well as a simplistic strace replacement.
>> >> It can only print out the arguments we're checking, but that's usually
>> >> the more interesting info.
>> >
>> > In theory such a thing should be easy to write, but as we both found,
>> > ptrace() on Linux has a huge number of difficult quirks to deal with
>> > to trace reliably.  At least it's getting better with later kernels.
>>
>> It's not that bad, there are a few quirks, but not that many.
>> The ptrace specific code is less than 500 lines of code, with
>> a couple of hundred lines of header files. Linux ptrace specific
>> stuff creeps in elsewhere too though, like that execve mess.
>
> I count 720 lines *just* to read the syscall number and arguments in
> strace-git, for the Linux archs it supports.
>
> That's only the Linux code, I excluded non-Linux, and it's only a
> little bit of syscall.c, I didn't include generic ptracing,
> fork-following, threaded-exec-fixups, signal handling etc. nor other
> arch-specific functions and ABI fixups.  And it doesn't even have all
> archs currently in Linux mainline.

Well, I was talking about my own code, not strace. Counting strace lines
of code is tricky because of all the ifdefs.

I have to add threaded-exec-fixups, though that's not ptrace specific,
but Linux specific. Although I only support x86 at the moment, I try
to keep the per-arch code to a minimum. Currently it's 20 lines of x86
header file and 50 for x86_64 for the ptrace code. The real work is the
syscall info table, which is both system call and arch specific.

My code is written with cross-platform support in mind, I try to keep
the number of (Linux, ptrace or arch specific) assumptions as low as
possible. But if I added support for e.g. BSD then I would keep its
ptrace code totally separate from the Linux one.

>> >> It's not a 32 versus 64-bit issue though, so it will be something on
>> >> its own anyway. Can as well add an extra ARM specific ptrace command
>> >> to get that info, or hack it in some other way. For instance, ip is
>> >> (ab)used to tell if it is syscall entry or exit, so doing these tricks
>> >> isn't anything new in ARM either.
>> >
>> > In theory, aren't we supposed to know whether it's entry/exit anyway?
>> > Why does strace care?  Have there been kernel bugs in the past?  Maybe
>> > it was just to deal with SIGTRAP-after-exit in the past, which could
>> > be delivered at an unpredictable time if blocked and then unblocked by
>> > sigreturn().
>>
>> Maybe. I don't why ARM does that ip thing.
>>
>> Although in theory you know the entry/exits if you keep track, but one
>> mistake or unexpected behaviour (like execve for my code) and you can get
>> it wrong. So for robustness sake it's good if it can be double checked.
>
> I agree, and I think the PTRACE_EVENT_SYSCALL_ENTRY/EXIT events would
> be a clean way to represent that.

Yes, that would be perfect.

> I wonder if all archs report syscall-exit as the first event in traced
> fork children.  Looking at arch/hexagon I'm guessing it doesn't, but
> it's hard to be sure and no practical way to test it :-/

I would expect none of them to return syscall-exit for the child process.
It was the parent that called it, the child never did!

> That wouldn't matter if the events were robust.

Yes. It's a lot better to not worry about all these kind of details which
may or may not change between archs and kernel versions.

> I read somewhere about a bug report where syscall-exit was seen after
> attach, but I don't remember where now.

Well, if you attach at a random moment you can get a syscall-exit first,
I guess. I suppose you have to wait till you get the SIGSTOP notification
before you can be sure that the next syscall event will be an entry one.

>> I don't know anything about OABI, can you link an OABI program against
>> an EABI library? If you can then libc can be EABI and the kernel doesn't
>> need OABI support.
>
> That's not the point.  If you're writing a ptrace jailer (as you are)
> a program can deliberately use OABI calls to subvert the tracer, even
> if it's using EABI for normal calls.

I know, but I can say that kernels supporting OABI aren't supported
because they are unsafe. Just like a 32-bit only jailer running on
x86_64 is unsafe. Best would be if I checked it at startup too.

Right now I have to add very paranoid code to support compat32 on
x86_64 anyway.

> For linking, you are mostly right.  Ideally everything would be open
> and recompilable anyway, but that's sadly not always possible.  OABI
> and EABI have different struct layouts among other changes, and EABI
> being newer tends to accompany other libc changes; embedded libc.
> aren't always as drop-in backward-compatible as glibc.

Russell King told me about PTRACE_SET_SYSCALL on ARM, that would solve
the reading memory problem, as we can always set the expected syscall
number to make sure it wasn't changed behind our back. The system call
number are the same for EABI and OABI, so it's not as bad as int 0x80
from 64-bit.

The alignment changes hopefully don't make a difference for my jailer.
If they do then I have to add specific code to handle it, which I don't
like doing. But looking at sys_oabi-compat.c it doesn't seem too bad.

>> >> And then there's the whole confusion what that flag says, some might think
>> >> it says in what mode the tracee is instead of what mode the system call is.
>> >> That those two can be different is not obvious at all and seems very x86_64
>> >> specific.
>> >
>> > My rough read of PARISC entry code suggests it has two entry methods,
>> > similar to ARM and x86_64, but I'm not really familiar with PARISC and
>> > I don't have a machine handy to try it out :-)
>>
>> It has a unified syscall table, so does it really matter?
>
> I don't know if the 32/64 matters.  For security or accurate tracing,
> I wouldn't like to assume without checking if there are 64-on-32
> argument alignment fixups.

I thought it was just ARM passing a 64-bit arg in two 32-bit regs.
But yes, it's something that needs to be checked. That's most of
the work of adding a new arch, checking all system calls.

> PARISC has a second set of HPUX-compatible system call numbers,
> handled in arch/parisc/hpux/*.  I don't know if those are available to
> all programs and can be used to subvert a ptracer.  Looking at
> hpux/gate.S I think they bypass ptrace entirely; maybe they can subvert it.

That's only set when CONFIG_HPUX is set. If they bypass ptrace entirely
then such kernels can't be supported anyway, except if they have some
other mechanism for syscall interception. But the obscurer the setup,
the less worried I am about supporting it.

>> > I have a script in progress which extracts all the
>> > per-arch and per-ABI syscall numbers, syscall argument layouts and
>> > kernel function names to keep track of arch-specific fixups, from a
>> > Linux source tree.  It currently works on all archs except it breaks
>> > on x86 which insists on being diferent ;-)
>>
>> That's handy, but I thought strace had such a script already?
>> See HACKING-scripts in strace source. Or is yours much better?
>
> The strace script only gets the syscall numbers (so doesn't help
> cross-check I've applied all arch-specific syscall fixups), doesn't
> work for all arch/ABI combinations without editing unistd.h, and
> requires a configured and partly built kernel for some archs.  It's
> only really useful for getting new syscall numbers which you then
> hand-edit into the real table.  You still have to set the number of
> arguments and check carefully you haven't missed any arch-specific
> fixups.

Your script sounds quite useful then. I might ask for it when I'm
adding support for more archs.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
       [not found]                                                                       ` <cc95fcf4b1c28ee6f73e373d04593634.squirrel@webmail.greenhost.nl>
@ 2012-02-10 15:53                                                                         ` H. Peter Anvin
  2012-02-10 22:42                                                                           ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-10 15:53 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On 02/09/2012 11:42 PM, Indan Zupancic wrote:>
> Patch implementing this below. It uses bit 3 for task mode and bit 5
> for syscall mode. Those bits are only valid if VIF is set. It increases
> the kernel size by around 50 bytes, 6 for a 32-bit kernel.
>
> Any objections?

#include <stdnak.h>

	-hpa


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  3:37                                                                     ` Indan Zupancic
@ 2012-02-10 21:19                                                                       ` Denys Vlasenko
  0 siblings, 0 replies; 235+ messages in thread
From: Denys Vlasenko @ 2012-02-10 21:19 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Jamie Lokier, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Friday 10 February 2012 04:37, Indan Zupancic wrote:
> > I read somewhere about a bug report where syscall-exit was seen after
> > attach, but I don't remember where now.
> 
> Well, if you attach at a random moment you can get a syscall-exit first,
> I guess. I suppose you have to wait till you get the SIGSTOP notification
> before you can be sure that the next syscall event will be an entry one.

No. After PTRACE_ATTACH, next reported waitpid result will be either
a ptrace-stop of signal-delivery-stop variety,
or death (WIFEXITED/WIFSIGNALED). Syscall exit notification
is not possible (modulo kernel bugs). For one, syscall entry/exit
notifications must be explicitly requested by PTRACE_SYSCALL, which
wasn't yet done!

-- 
vda

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10 15:53                                                                         ` H. Peter Anvin
@ 2012-02-10 22:42                                                                           ` Indan Zupancic
  2012-02-10 22:56                                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 235+ messages in thread
From: Indan Zupancic @ 2012-02-10 22:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Fri, February 10, 2012 16:53, H. Peter Anvin wrote:
> On 02/09/2012 11:42 PM, Indan Zupancic wrote:>
>> Patch implementing this below. It uses bit 3 for task mode and bit 5
>> for syscall mode. Those bits are only valid if VIF is set. It increases
>> the kernel size by around 50 bytes, 6 for a 32-bit kernel.
>>
>> Any objections?
>
> #include <stdnak.h>

Could you please elaborate? Is it just the stealing of eflags bits that
irks you or are there technical problems too?

I understand some people would prefer a new regset, but that would force
everyone to use PTRACE_GETREGSET instead of whatever they are using now.
The problem with that is that not all archs support PTRACE_GETREGSET, so
the user space ptrace code needs to use different ptrace calls depending
on the architecture for no good reason. If PEEK_USER works then it's less
of a problem, then it's one extra ptrace call compared to the eflag way
if PTRACE_GETREGS is used. If this new info is exposed with a special
regset instead of being appended to normal regs then one extra ptrace
call per system call event needs to be done. You can as well add special
x86 ptrace requests then.

Or is the main advantage of using a regset that it shows up in coredumps?
That would merit the extra effort at least.

Stealing eflags bits may be ugly like hell, but it's very simple for both
the kernel and user space to implement.

I think everyone agrees that this kind of info needs to be exposed somehow.
In the end I don't care how it is done, as long as the info is easily
available.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10 22:42                                                                           ` Indan Zupancic
@ 2012-02-10 22:56                                                                             ` H. Peter Anvin
  2012-02-12 12:07                                                                               ` Indan Zupancic
  0 siblings, 1 reply; 235+ messages in thread
From: H. Peter Anvin @ 2012-02-10 22:56 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On 02/10/2012 02:42 PM, Indan Zupancic wrote:
>> #include <stdnak.h>
> 
> Could you please elaborate? Is it just the stealing of eflags bits that
> irks you or are there technical problems too?

Yes, I will not accept that unless it gets ok'd by the architecture
people, which may take a long time.

> I understand some people would prefer a new regset, but that would force
> everyone to use PTRACE_GETREGSET instead of whatever they are using now.
> The problem with that is that not all archs support PTRACE_GETREGSET, so
> the user space ptrace code needs to use different ptrace calls depending
> on the architecture for no good reason. If PEEK_USER works then it's less
> of a problem, then it's one extra ptrace call compared to the eflag way
> if PTRACE_GETREGS is used. If this new info is exposed with a special
> regset instead of being appended to normal regs then one extra ptrace
> call per system call event needs to be done. You can as well add special
> x86 ptrace requests then.

Seriously... if you're mucking with registers on this level, youan
architecture dependency is not a big deal, and perhaps it's a good sign
that the laggard architectures need to catch up.  If multiple ptrace
requests is a problem, then perhaps this is a good sign that we need a
single way to get multiple regsets in a single request?

> Or is the main advantage of using a regset that it shows up in coredumps?
> That would merit the extra effort at least.

That is another plus, which is significant, too.  The final advantage is
expandability.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10 22:56                                                                             ` H. Peter Anvin
@ 2012-02-12 12:07                                                                               ` Indan Zupancic
  0 siblings, 0 replies; 235+ messages in thread
From: Indan Zupancic @ 2012-02-12 12:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Fri, February 10, 2012 23:56, H. Peter Anvin wrote:
> On 02/10/2012 02:42 PM, Indan Zupancic wrote:
>>> #include <stdnak.h>
>>
>> Could you please elaborate? Is it just the stealing of eflags bits that
>> irks you or are there technical problems too?
>
> Yes, I will not accept that unless it gets ok'd by the architecture
> people, which may take a long time.

The kernel x86 people or the Intel CPU people?

With the latest patch it doesn't matter what bits Intel decides to use in
the future, any clashes can always be handled unambiguously.

>> I understand some people would prefer a new regset, but that would force
>> everyone to use PTRACE_GETREGSET instead of whatever they are using now.
>> The problem with that is that not all archs support PTRACE_GETREGSET, so
>> the user space ptrace code needs to use different ptrace calls depending
>> on the architecture for no good reason. If PEEK_USER works then it's less
>> of a problem, then it's one extra ptrace call compared to the eflag way
>> if PTRACE_GETREGS is used. If this new info is exposed with a special
>> regset instead of being appended to normal regs then one extra ptrace
>> call per system call event needs to be done. You can as well add special
>> x86 ptrace requests then.
>
> Seriously... if you're mucking with registers on this level, youan
> architecture dependency is not a big deal, and perhaps it's a good sign
> that the laggard architectures need to catch up.  If multiple ptrace
> requests is a problem, then perhaps this is a good sign that we need a
> single way to get multiple regsets in a single request?

Well, if we're forcing people to use a different API then we can as well
overhaul the whole ptrace thing to have a decent interface instead of all
this mucking around with waitpid() and whatnot.

That is the main advantage of the stealing eflags bits approach, it's mostly
API independent. That it puts the info close to the data where it is used is
a bonus.

>> Or is the main advantage of using a regset that it shows up in coredumps?
>> That would merit the extra effort at least.
>
> That is another plus, which is significant, too.  The final advantage is
> expandability.

I just realized that if coredumping uses ptrace's code the eflags will
show up too. As for expandability, there are a few more bits left...
But more seriously, what other highly x86 specific flags are needed?
Other than maybe the syscall entry instruction, which I'm not convinced
of, I can't think of anything.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 18:03                                                           ` Denys Vlasenko
@ 2017-03-08 23:41                                                             ` Dmitry V. Levin
  2017-03-09  4:39                                                               ` Andrew Lutomirski
  0 siblings, 1 reply; 235+ messages in thread
From: Dmitry V. Levin @ 2017-03-08 23:41 UTC (permalink / raw)
  To: Denys Vlasenko, Linus Torvalds
  Cc: Indan Zupancic, Oleg Nesterov, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 1731 bytes --]

Hi,

On Thu, Jan 26, 2012 at 07:03:43PM +0100, Denys Vlasenko wrote:
> Hi Linus,
> 
> On Thu, Jan 26, 2012 at 4:47 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >> Please look at strace source, get_scno() function, where
> >> it reads syscall no and parameters. Let's see....
> >> - POWERPC: has 32-bit and 64-bit mode
> >> - X86_64: has 32-bit and 64-bit mode
> >> - IA64: has i386-compat mode
> >> - ARM: has more than one ABI
> >> - SPARC: has 32-bit and 64-bit mode
> >>
> >> Do you want to re-invent a different arch-specific way to report
> >> syscall type for each of these arches?
> >
> > I think an arch-specific one is better than trying to make some
> > generic one that is messy.
> >
> > As you say, many architectures have multiple system call ABIs.
> >
> > But they tend to be very *different* issues. They can be about
> > multiple ABI's, as you mention, and even when they *look* similar
> > (32-bit vs 64-bit ABI's) they are actually totally different issues.
> > [skip]
> 
> I don't have a particular attachment to my solution,
> and I think we already talk about this problem for
> far too long.
> 
> Looks like nobody is _strongly_ opposed to your patch
> which uses a few bits in eflags to report bitness
> of the x86 syscall.
> 
> Lets just do that already. If you commit it to kernel git,
> I will immediately change strace accordingly.

Is there any progress with this (or any alternative) solution?

I see the kernel side has changed a bit, and the strace part
is in a better shape than 5 years ago (although I'm biased of course),
but I don't see any kernel interface that would allow strace to reliably
recognize this 0x80 case.


-- 
ldv

[-- Attachment #2: Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2017-03-08 23:41                                                             ` Dmitry V. Levin
@ 2017-03-09  4:39                                                               ` Andrew Lutomirski
  2017-03-14  2:57                                                                 ` Dmitry V. Levin
  0 siblings, 1 reply; 235+ messages in thread
From: Andrew Lutomirski @ 2017-03-09  4:39 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Oleg Nesterov,
	Andi Kleen, Jamie Lokier, Will Drewry, linux-kernel, Kees Cook,
	John Johansen, Serge Hallyn, coreyb, pmoore, Eric Paris, djm,
	segoon, Steven Rostedt, James Morris, Chris Evans, Avi Kivity,
	penberg, Al Viro, Ingo Molnar, Andrew Morton, khilman,
	borislav.petkov, amwang, Andi Kleen, Eric Dumazet, gregkh,
	dhowells, daniel.lezcano, Linux FS Devel, linux-security-module,
	olofj, Michael Halcrow, dlaor, Roland McGrath

On Wed, Mar 8, 2017 at 3:41 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> Hi,
>
> On Thu, Jan 26, 2012 at 07:03:43PM +0100, Denys Vlasenko wrote:
>> Hi Linus,
>>
>> On Thu, Jan 26, 2012 at 4:47 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> >> Please look at strace source, get_scno() function, where
>> >> it reads syscall no and parameters. Let's see....
>> >> - POWERPC: has 32-bit and 64-bit mode
>> >> - X86_64: has 32-bit and 64-bit mode
>> >> - IA64: has i386-compat mode
>> >> - ARM: has more than one ABI
>> >> - SPARC: has 32-bit and 64-bit mode
>> >>
>> >> Do you want to re-invent a different arch-specific way to report
>> >> syscall type for each of these arches?
>> >
>> > I think an arch-specific one is better than trying to make some
>> > generic one that is messy.
>> >
>> > As you say, many architectures have multiple system call ABIs.
>> >
>> > But they tend to be very *different* issues. They can be about
>> > multiple ABI's, as you mention, and even when they *look* similar
>> > (32-bit vs 64-bit ABI's) they are actually totally different issues.
>> > [skip]
>>
>> I don't have a particular attachment to my solution,
>> and I think we already talk about this problem for
>> far too long.
>>
>> Looks like nobody is _strongly_ opposed to your patch
>> which uses a few bits in eflags to report bitness
>> of the x86 syscall.
>>
>> Lets just do that already. If you commit it to kernel git,
>> I will immediately change strace accordingly.
>
> Is there any progress with this (or any alternative) solution?
>
> I see the kernel side has changed a bit, and the strace part
> is in a better shape than 5 years ago (although I'm biased of course),
> but I don't see any kernel interface that would allow strace to reliably
> recognize this 0x80 case.

I am strongly opposed to fudging registers to half-arsedly slightly
improve the epicly crappy ptrace(2) interface for syscalls.

To fix this right, please just add PTRACE_GET_SYSCALL_INFO or similar
to, in one shot, read out all the syscall details.  This means: arch,
no, arg0..arg5, and *whether it's entry or exit*.  I propose returning
this structure:

struct ptrace_syscall_info {
  u8 op;  /* 0 for entry, 1 for exit */
  u8 pad0;
  u16 pad1;
  u32 pad2;
  union {
    struct seccomp_data syscall_entry;
    s64 syscall_exit_retval;
  };
};

because struct seccomp_data already gets this right.  There's plenty
of opportunity to fine-tune this.  Now it works on all architectures.

Since struct seccomp_data may be extended in the future, the operation
should be:

ptrace(PTRACE_GET_SYSCALL_INFO, pid, (void *)sizeof(struct
ptrace_syscall_info), &info);

returns 0 on success and some error code if, for example, the current
ptrace stop isn't a syscall entry or exit.

--Andy

^ permalink raw reply	[flat|nested] 235+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2017-03-09  4:39                                                               ` Andrew Lutomirski
@ 2017-03-14  2:57                                                                 ` Dmitry V. Levin
  0 siblings, 0 replies; 235+ messages in thread
From: Dmitry V. Levin @ 2017-03-14  2:57 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Elvira Khabirova, Denys Vlasenko, Linus Torvalds, Indan Zupancic,
	Oleg Nesterov, Andi Kleen, Jamie Lokier, Will Drewry, Kees Cook,
	John Johansen, pmoore, Eric Paris, djm, segoon, Steven Rostedt,
	James Morris, Chris Evans, Avi Kivity, penberg, Al Viro,
	Ingo Molnar, Andrew Morton, Andi Kleen, Eric Dumazet, dhowells,
	daniel.lezcano, Linux FS Devel, linux-security-module, olofj,
	Michael Halcrow, Roland McGrath, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

On Wed, Mar 08, 2017 at 08:39:55PM -0800, Andrew Lutomirski wrote:
> On Wed, Mar 8, 2017 at 3:41 PM, Dmitry V. Levin wrote:
[...]
> > Is there any progress with this (or any alternative) solution?
> >
> > I see the kernel side has changed a bit, and the strace part
> > is in a better shape than 5 years ago (although I'm biased of course),
> > but I don't see any kernel interface that would allow strace to reliably
> > recognize this 0x80 case.
> 
> I am strongly opposed to fudging registers to half-arsedly slightly
> improve the epicly crappy ptrace(2) interface for syscalls.
> 
> To fix this right, please just add PTRACE_GET_SYSCALL_INFO or similar
> to, in one shot, read out all the syscall details.  This means: arch,
> no, arg0..arg5, and *whether it's entry or exit*.  I propose returning
> this structure:
> 
> struct ptrace_syscall_info {
>   u8 op;  /* 0 for entry, 1 for exit */
>   u8 pad0;
>   u16 pad1;
>   u32 pad2;
>   union {
>     struct seccomp_data syscall_entry;
>     s64 syscall_exit_retval;
>   };
> };
> 
> because struct seccomp_data already gets this right.  There's plenty
> of opportunity to fine-tune this.  Now it works on all architectures.

Unfortunately, the API is missing.

Unlike syscall_get_nr(), syscall_get_arch() works with the current task
only so there is no API to get the arch identifier for the given task
that would work on all architectures.


-- 
ldv

[-- Attachment #2: Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 235+ messages in thread

end of thread, other threads:[~2017-03-14  2:57 UTC | newest]

Thread overview: 235+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-11 17:25 [RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters) Will Drewry
2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
2012-01-12  8:53   ` Serge Hallyn
2012-01-12 16:54     ` Will Drewry
2012-01-12 14:50   ` Oleg Nesterov
2012-01-12 16:55     ` Will Drewry
2012-01-12 15:43   ` Steven Rostedt
2012-01-12 16:14     ` Oleg Nesterov
2012-01-12 16:38       ` Steven Rostedt
2012-01-12 16:47         ` Oleg Nesterov
2012-01-12 17:08           ` Will Drewry
2012-01-12 17:30         ` Jamie Lokier
2012-01-12 17:40           ` Steven Rostedt
2012-01-12 17:44             ` Jamie Lokier
2012-01-12 17:56               ` Steven Rostedt
2012-01-12 23:27                 ` Alan Cox
2012-01-12 23:38                   ` Linus Torvalds
2012-01-12 22:18             ` Will Drewry
2012-01-12 23:00               ` Andrew Lutomirski
2012-01-12 16:14     ` Andrew Lutomirski
2012-01-12 16:27       ` Steven Rostedt
2012-01-12 16:51         ` Andrew Lutomirski
2012-01-12 17:09         ` Linus Torvalds
2012-01-12 17:17           ` Steven Rostedt
2012-01-12 18:18           ` Andrew Lutomirski
2012-01-12 18:32             ` Linus Torvalds
2012-01-12 18:44               ` Andrew Lutomirski
2012-01-12 19:08                 ` Kyle Moffett
2012-01-12 23:05                   ` Eric Paris
2012-01-12 23:33                     ` Andrew Lutomirski
2012-01-12 19:40                 ` Will Drewry
2012-01-12 19:42                   ` Will Drewry
2012-01-12 19:46                   ` Andrew Lutomirski
2012-01-12 20:00                     ` Linus Torvalds
2012-01-12 16:59     ` Will Drewry
2012-01-12 17:22       ` Jamie Lokier
2012-01-12 17:35         ` Will Drewry
2012-01-12 17:57           ` Jamie Lokier
2012-01-12 18:03             ` Will Drewry
2012-01-13  1:34               ` Jamie Lokier
2012-01-13  2:44             ` Indan Zupancic
2012-01-13  6:33             ` Chris Evans
2012-01-12 17:36     ` Jamie Lokier
2012-01-12 16:18   ` Alan Cox
2012-01-12 17:03     ` Will Drewry
2012-01-12 17:11       ` Alan Cox
2012-01-12 17:52         ` Will Drewry
2012-01-13  1:31     ` James Morris
2012-01-12 16:22   ` Oleg Nesterov
2012-01-12 17:10     ` Will Drewry
2012-01-12 17:23       ` Oleg Nesterov
2012-01-12 17:51         ` Will Drewry
2012-01-13 17:31           ` Oleg Nesterov
2012-01-13 19:01             ` Will Drewry
2012-01-13 23:10               ` Will Drewry
2012-01-13 23:12                 ` Will Drewry
2012-01-13 23:30                 ` Eric Paris
2012-01-15  3:40                 ` Indan Zupancic
2012-01-16  1:40                   ` Will Drewry
2012-01-16  6:49                     ` Indan Zupancic
2012-01-16 20:12                       ` Will Drewry
2012-01-17  6:46                         ` Indan Zupancic
2012-01-17 17:37                           ` Will Drewry
2012-01-18  4:06                             ` Indan Zupancic
2012-01-18  4:38                               ` Will Drewry
2012-01-17 20:34                           ` Kees Cook
2012-01-17 20:42                             ` Will Drewry
2012-01-17 21:09                               ` Will Drewry
2012-01-18  4:47                               ` Indan Zupancic
2012-01-16 18:37                 ` Oleg Nesterov
2012-01-16 20:15                   ` Will Drewry
2012-01-17 16:45                     ` Oleg Nesterov
2012-01-17 16:56                       ` Will Drewry
2012-01-17 17:01                         ` Andrew Lutomirski
2012-01-17 17:05                           ` Oleg Nesterov
2012-01-17 17:45                             ` Andrew Lutomirski
2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
2012-01-18  1:01                                 ` Andrew Lutomirski
2012-01-19  1:06                                   ` Indan Zupancic
2012-01-19  1:19                                     ` Andrew Lutomirski
2012-01-19  1:47                                       ` Indan Zupancic
2012-01-18  1:07                                 ` Roland McGrath
2012-01-18  1:47                                   ` Indan Zupancic
2012-01-18  1:48                                 ` Jamie Lokier
2012-01-18  1:50                                 ` Andi Kleen
2012-01-18  2:00                                   ` Steven Rostedt
2012-01-18  2:04                                   ` Jamie Lokier
2012-01-18  2:22                                     ` Andi Kleen
2012-01-18  2:25                                       ` Andrew Lutomirski
2012-01-18  4:22                                       ` Indan Zupancic
2012-01-18  5:23                                         ` Linus Torvalds
2012-01-18  6:25                                           ` Linus Torvalds
2012-01-18 13:12                                             ` Compat 32-bit syscall entry from 64-bit task!? Indan Zupancic
2012-01-18 19:31                                               ` Linus Torvalds
2012-01-18 19:36                                                 ` Andi Kleen
2012-01-18 19:39                                                   ` Linus Torvalds
2012-01-18 19:44                                                     ` Andi Kleen
2012-01-18 19:47                                                       ` Linus Torvalds
2012-01-18 19:52                                                         ` Will Drewry
2012-01-18 19:58                                                           ` Will Drewry
2012-01-18 19:41                                                   ` Martin Mares
2012-01-18 19:38                                                 ` Andrew Lutomirski
2012-01-19 16:01                                                   ` Jamie Lokier
2012-01-19 16:13                                                     ` Andrew Lutomirski
2012-01-19 19:21                                                     ` Linus Torvalds
2012-01-19 19:30                                                       ` Andrew Lutomirski
2012-01-19 19:37                                                         ` Linus Torvalds
2012-01-19 19:41                                                           ` Linus Torvalds
2012-01-19 23:54                                                       ` Jamie Lokier
2012-01-20  0:05                                                         ` Linus Torvalds
2012-01-20 15:35                                                       ` Will Drewry
2012-01-20 17:56                                                         ` Roland McGrath
2012-01-20 19:45                                                           ` Will Drewry
2012-01-18 20:26                                                 ` Linus Torvalds
2012-01-18 20:55                                                   ` H. Peter Anvin
2012-01-18 21:01                                                     ` Linus Torvalds
2012-01-18 21:04                                                       ` H. Peter Anvin
2012-01-18 21:21                                                         ` H. Peter Anvin
2012-01-18 21:51                                                           ` Roland McGrath
2012-01-18 21:53                                                             ` H. Peter Anvin
2012-01-18 23:28                                                               ` Linus Torvalds
2012-01-19  0:38                                                                 ` H. Peter Anvin
2012-01-20 21:51                                                                   ` Denys Vlasenko
2012-01-20 22:40                                                                     ` Roland McGrath
2012-01-20 22:41                                                                       ` H. Peter Anvin
2012-01-20 23:49                                                                         ` Indan Zupancic
2012-01-20 23:55                                                                           ` Roland McGrath
2012-01-20 23:58                                                                             ` hpanvin@gmail.com
2012-01-23  2:14                                                                             ` Indan Zupancic
2012-01-21  0:07                                                                           ` Denys Vlasenko
2012-01-21  0:10                                                                             ` Roland McGrath
2012-01-21  1:23                                                                               ` Jamie Lokier
2012-01-23  2:37                                                                                 ` Indan Zupancic
2012-01-23 16:48                                                                                   ` Oleg Nesterov
2012-01-24  8:19                                                                       ` Indan Zupancic
2012-02-06 20:30                                                                       ` H. Peter Anvin
2012-02-06 20:39                                                                         ` Roland McGrath
2012-02-06 20:42                                                                           ` H. Peter Anvin
2012-01-18 21:26                                                         ` Linus Torvalds
2012-01-18 21:30                                                           ` H. Peter Anvin
2012-01-18 21:42                                                             ` Linus Torvalds
2012-01-18 21:47                                                               ` H. Peter Anvin
2012-01-19  1:45                                                           ` Indan Zupancic
2012-01-19  2:16                                                             ` H. Peter Anvin
2012-02-06  8:32                                                   ` Indan Zupancic
2012-02-06 17:02                                                     ` H. Peter Anvin
2012-02-07  1:52                                                       ` Indan Zupancic
2012-02-09  0:19                                                         ` H. Peter Anvin
2012-02-09  4:20                                                           ` Indan Zupancic
2012-02-09  4:29                                                             ` H. Peter Anvin
2012-02-09  6:03                                                               ` Indan Zupancic
2012-02-09 14:47                                                                 ` H. Peter Anvin
2012-02-09 16:00                                                               ` H.J. Lu
2012-02-10  1:09                                                                 ` Indan Zupancic
2012-02-10  1:15                                                                   ` H. Peter Anvin
2012-02-10  2:29                                                                     ` Indan Zupancic
2012-02-10  2:47                                                                       ` H. Peter Anvin
     [not found]                                                                       ` <cc95fcf4b1c28ee6f73e373d04593634.squirrel@webmail.greenhost.nl>
2012-02-10 15:53                                                                         ` H. Peter Anvin
2012-02-10 22:42                                                                           ` Indan Zupancic
2012-02-10 22:56                                                                             ` H. Peter Anvin
2012-02-12 12:07                                                                               ` Indan Zupancic
2012-01-25 19:36                                                 ` Oleg Nesterov
2012-01-25 20:20                                                   ` Pedro Alves
2012-01-25 23:36                                                     ` Denys Vlasenko
2012-01-25 23:32                                                   ` Denys Vlasenko
2012-01-26  0:40                                                     ` Indan Zupancic
2012-01-26  1:08                                                       ` Jamie Lokier
2012-01-26  1:22                                                         ` Denys Vlasenko
2012-01-26  6:34                                                         ` Indan Zupancic
2012-01-26 10:31                                                           ` Jamie Lokier
2012-01-26 10:40                                                             ` Denys Vlasenko
2012-01-26 11:01                                                               ` Jamie Lokier
2012-01-26 14:02                                                                 ` Denys Vlasenko
2012-01-26 11:19                                                               ` Indan Zupancic
2012-01-26 11:20                                                             ` Indan Zupancic
2012-01-26 11:47                                                               ` Jamie Lokier
2012-01-26 14:05                                                                 ` Denys Vlasenko
2012-01-27  7:23                                                                 ` Indan Zupancic
2012-02-10  2:02                                                                   ` Jamie Lokier
2012-02-10  3:37                                                                     ` Indan Zupancic
2012-02-10 21:19                                                                       ` Denys Vlasenko
2012-01-26  1:09                                                       ` Denys Vlasenko
2012-01-26  3:47                                                         ` Linus Torvalds
2012-01-26 18:03                                                           ` Denys Vlasenko
2017-03-08 23:41                                                             ` Dmitry V. Levin
2017-03-09  4:39                                                               ` Andrew Lutomirski
2017-03-14  2:57                                                                 ` Dmitry V. Levin
2012-01-26  5:57                                                         ` Indan Zupancic
2012-01-26  0:59                                                     ` Jamie Lokier
2012-01-26  1:21                                                       ` Denys Vlasenko
2012-01-26  8:23                                                       ` Pedro Alves
2012-01-26  8:53                                                         ` Denys Vlasenko
2012-01-26  9:51                                                           ` Pedro Alves
2012-01-26 18:44                                                     ` Oleg Nesterov
2012-02-10  2:51                                                       ` Jamie Lokier
2012-01-18 15:04                                             ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Eric Paris
2012-01-18 17:51                                               ` Linus Torvalds
2012-01-18  5:43                                         ` Chris Evans
2012-01-18 12:12                                           ` Indan Zupancic
2012-01-18 21:13                                             ` Chris Evans
2012-01-19  0:14                                               ` Indan Zupancic
2012-01-19  8:16                                                 ` Chris Evans
2012-01-19 11:34                                                   ` Indan Zupancic
2012-01-19 16:11                                                     ` Jamie Lokier
2012-01-19 15:40                                                 ` Jamie Lokier
2012-01-18 17:00                                           ` Oleg Nesterov
2012-01-18 17:12                                             ` Oleg Nesterov
2012-01-18 21:09                                               ` Chris Evans
2012-01-23 16:56                                                 ` Oleg Nesterov
2012-01-23 22:23                                                   ` Chris Evans
2012-02-07 11:45                                               ` Indan Zupancic
2012-01-19  0:29                                             ` Indan Zupancic
2012-01-18  2:27                                     ` Linus Torvalds
2012-01-18  2:31                                       ` Andi Kleen
2012-01-18  2:46                                         ` Linus Torvalds
2012-01-18 14:06                                           ` Martin Mares
2012-01-18 18:24                                             ` Andi Kleen
2012-01-19 16:04                                               ` Jamie Lokier
2012-01-20  0:21                                                 ` Indan Zupancic
2012-01-20  0:53                                                   ` Linus Torvalds
2012-01-20  2:02                                                     ` Indan Zupancic
2012-01-17 17:06                           ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
2012-01-17 19:35                         ` Will Drewry
2012-01-12 17:02   ` Andrew Lutomirski
2012-01-16 20:28     ` Will Drewry
2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
2012-01-11 20:03   ` Jonathan Corbet
2012-01-11 20:10     ` Will Drewry
2012-01-11 23:19       ` [PATCH v2 " Will Drewry
2012-01-12  0:29         ` Will Drewry
2012-01-12 18:16         ` Randy Dunlap
2012-01-12 17:23           ` Will Drewry
2012-01-12 17:34             ` Steven Rostedt
2012-01-12 13:13   ` [RFC,PATCH " Łukasz Sowa
2012-01-12 17:25     ` Will Drewry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).