linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters)
@ 2012-01-11 17:25 Will Drewry
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
  0 siblings, 2 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-11 17:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

The goal of the patchset is straightforward:

 To provide a means of reducing the kernel attack surface.

In practice, this is done at the primary kernel ABI: system calls.
Achieving this goal will address the needs expressed by many systems
projects:
  qemu/kvm, openssh, vsftpd, lxc, and chromium and chromium os (me).

While system call filtering has been attempted many times, I hope that
this approach shows more promise.  It works as described below and in
the patch series.

A userland task may call prctl(PR_ATTACH_SECCOMP_FILTER) to attach a
BPF program to itself.  Once attached, all system calls made by the
task will be evaluated by the BPF program prior to being accepted.
Evaluation is done by executing the BPF program over the struct
user_regs_state for the process.

!! If you don't care about background or reasoning, stop reading !!

Past attempts have used:
- bitmap of system call numbers evaluated by seccomp (or tracehooks)
- standalone data structures and extra entry hooks
  (cgroups syscall, systrace)
- a collection of ftrace filter strings evaluated by seccomp
- perf_event hackery to allow process termination when an event matches
  (or doesn't)

In addition to the publicly posted approaches, I've personally attempted
continued deeper integration with ftrace along a number of different
lines (lead up to that can be found here[1]).  What inspired the current
patch series was a number of realizations:
1. Userland knows its ABI - that's how it made the system calls in the
   first place.
2. We already exposed a filtering system to userland processes in the
   form of BPF and there is continued focus on optimizing evaluation
   even after so many years.
3. System call filtering policies should not expose
   time-of-check-time-of-use (TOCTOU) vulnerable interfaces but should
   expose all the information that may be relevant to a syscall policy
   decision.

The prior seccomp-ftrace  implementations struggled with very
fixable challenges in ftrace: incomplete syscall coverage,
mismatched syscall names versus unistd, incomplete arch coverage,
etc.  These challenges may all be fixed with some time and effort, and
potentially, even closer integration.  I explored a number of
alternative approaches from making system call tracepoints per-thread
and "active" to adding a new less-perf-oriented system call.

In the process of experimentation, a number of things became clear:
- perf/ftrace system-wide analysis goals don't align with lightweight
  per-thread analysis.
- ftrace/perf ABI doesn't mix well with security policy enforcement,
  reduced attack surface environments, or keeping users from specifing
 vulnerable filtering policies.
- other than system calls, tracepoints aren't considered ABI-stable.

The core focus of ftrace and perf is to support system-wide
performance and debugging tracing.  Despite its amazing flexibility,
there are tradeoffs that are made to provide efficient system-wide
behavior that are less efficient at a per-thread level.  For instance,
system call tracepoints are global.  It is possible to make them
per-thread (since they use a TIF anyway).  However, doing so would mean
that a system-wide system call analysis would require one trace event
per thread rather than one total.  It's possible to alleviate that pain,
but that in turn requires more bookkeeping (global versus local
tracepoint registrations mapping to the thread info flag).

Another example is the ftrace ABI.  Both the debugfs entry point with
unstable event ids and the perf-oriented perf_event_open(2) are not
suitable to providing a subsystem which is meant to reduce the attack
surface -- much less avoid maintainer flame wars :) The third aspect of
its ABI was also concerning and hints at yet-another-potential struggle.
The ftrace filter language happily accepts globbing and string matching.
This is excellent for tracing, but horrible for system call
interposition.  If, despite warning, a user decides that blocking a
system call based on a string is what they want, they can do it.  The
result is that their policy may be bypassed due to a time of check, time
of use race.  While addressable, it would mean that the filtering engine
would need to allow operation filtering or offer a "secure" subset.

A side challenge that emerged from the desire to enable tracing to act
as a security policy mechanism was the ability to enact policy over more
than just the system calls.  While this would be doable if all
tracepoints became active, there is a fundamental problem in that very
little, if any, tracepoints aside from system calls can be considered
stable.  If a subset were to emerge as stable, there is still the
challenge of enacting security policy in parallel with tracing policy.
In an example patch where security policy logic was added to
perf_event_open(2), the basics of the system worked, but enforcement of
the security policy was simplistic and intertwined with a large number
of event attributes that were meaningless or altered the behavior.

At every turn, it appears that the tracing infrastructure was unsuited
for being used for attack surface reduction or as a larger security
subsystem on its own.  It is well suited for feeding a policy
enforcement mechanism (like seccomp), but not for letting the logic
co-exist.  It doesn't mean that it has security problems, just that
there will be a continued struggle between having a really good perf
system and and really good kernel attack surface reduction system if
they were merged.  While there may be some distant vision where the
apparent struggle does not exist, I don't see how it would be reached.
Of course, anything is possible with unlimited time. :)

That said, much of that discussion is history and to fill in some of the
gaps since I posted the last ftrace-based patches.  This patch series
should stand on its own as both straightforward and effective.  In my
opinion, this is the direction I should have taken before I sent my
first patch.

I am looking forward to any and all feedback - thanks!
will


[1] http://search.gmane.org/?query=seccomp+wad%40chromium.org&group=gmane.linux.kernel


Will Drewry (3):
  seccomp_filters: dynamic system call filtering using BPF programs
  Documentation: prctl/seccomp_filter

 Documentation/prctl/seccomp_filter.txt |  179 ++++++++
 fs/exec.c                              |    5 +
 include/linux/prctl.h                  |    3 +
 include/linux/seccomp.h                |   70 +++++-
 kernel/Makefile                        |    1 +
 kernel/fork.c                          |    4 +
 kernel/seccomp.c                       |    8 +
 kernel/seccomp_filter.c                |  639 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                           |    4 +
 security/Kconfig                       |   12 +
 9 files changed, 743 insertions(+), 3 deletions(-)
 create mode 100644 kernel/seccomp_filter.c
 create mode 100644 Documentation/prctl/seccomp_filter.txt
-- 
1.7.5.4

























^ permalink raw reply	[flat|nested] 222+ messages in thread

* [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 [RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters) Will Drewry
@ 2012-01-11 17:25 ` Will Drewry
  2012-01-12  8:53   ` Serge Hallyn
                     ` (5 more replies)
  2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
  1 sibling, 6 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-11 17:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

This patch adds support for seccomp mode 2.  This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task.  The policy is expressed in terms of a BPF program,
as is used for userland-exposed socket filtering.  Instead of network
data, the BPF program is evaluated over struct user_regs_struct at the
time of the system call (as retrieved using regviews).

A filter program may be installed by a userland task by calling
  prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.

If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached.  All attached programs
must be evaluated before a system call will be allowed to proceed.

To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, it
is not allowed to make system calls or attach additional filters which
use a different combination of is_compat_task() and
current->personality.

Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace.  Once a task-local filter program is attached from a
process without privileges, execve will fail.  This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and system
  call numbers.
- Full register information is provided which may be relevant for
  certain syscalls (fork, rt_sigreturn) or for other userland
  filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible.

This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code.  It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)

Signed-off-by: Will Drewry <wad@chromium.org>
---
 fs/exec.c               |    5 +
 include/linux/prctl.h   |    3 +
 include/linux/seccomp.h |   70 +++++-
 kernel/Makefile         |    1 +
 kernel/fork.c           |    4 +
 kernel/seccomp.c        |    8 +
 kernel/seccomp_filter.c |  639 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c            |    4 +
 security/Kconfig        |   12 +
 9 files changed, 743 insertions(+), 3 deletions(-)
 create mode 100644 kernel/seccomp_filter.c

diff --git a/fs/exec.c b/fs/exec.c
index 3625464..e9cc89c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -44,6 +44,7 @@
 #include <linux/namei.h>
 #include <linux/mount.h>
 #include <linux/security.h>
+#include <linux/seccomp.h>
 #include <linux/syscalls.h>
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
@@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
 	if (retval)
 		goto out_ret;
 
+	retval = seccomp_check_exec();
+	if (retval)
+		goto out_ret;
+
 	retval = -ENOMEM;
 	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
 	if (!bprm)
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..15e2460 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,9 @@
 #define PR_GET_SECCOMP	21
 #define PR_SET_SECCOMP	22
 
+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER	36
+
 /* Get/set the capability bounding set (as per security/commoncap.c) */
 #define PR_CAPBSET_READ 23
 #define PR_CAPBSET_DROP 24
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..99d163e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,28 @@
 #ifdef CONFIG_SECCOMP
 
 #include <linux/thread_info.h>
+#include <linux/types.h>
 #include <asm/seccomp.h>
 
-typedef struct { int mode; } seccomp_t;
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:
+ *     if this is 0, seccomp is not in use.
+ *             is 1, the process is under standard seccomp rules.
+ *             is 2, the process is only allowed to make system calls where
+ *                   associated filters evaluate successfully.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ *          @filter must only be accessed from the context of current as there
+ *          is no guard.
+ */
+typedef struct seccomp_struct {
+	int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+	struct seccomp_filter *filter;
+#endif
+} seccomp_t;
 
 extern void __secure_computing(int);
 static inline void secure_computing(int this_syscall)
@@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)
 
 #include <linux/errno.h>
 
-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
 #define secure_computing(x) do { } while (0)
 
 static inline long prctl_get_seccomp(void)
@@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)
 
 #endif /* CONFIG_SECCOMP */
 
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define seccomp_filter_init_task(_tsk) do { \
+	(_tsk)->seccomp.filter = NULL; \
+} while (0);
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+#define seccomp_filter_free_task(_tsk) do { \
+	put_seccomp_filter((_tsk)->seccomp.filter); \
+} while (0);
+
+extern int seccomp_check_exec(void);
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_filter_fork(struct task_struct *child,
+				struct task_struct *parent);
+
+#else  /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+#define seccomp_filter_init_task(_tsk) do { } while (0);
+#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
+#define seccomp_filter_free_task(_tsk) do { } while (0);
+
+static inline int seccomp_check_exec(void)
+{
+	return 0;
+}
+
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+	return -ENOSYS;
+}
+
+#endif  /* CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..0584090 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TREE_RCU) += rcutree.o
 obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..cc1d628 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
 #include <linux/cgroup.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/seccomp.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
+	seccomp_filter_free_task(tsk);
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	sched_fork(p);
 
+	seccomp_filter_init_task(p);
 	retval = perf_event_init_task(p);
 	if (retval)
 		goto bad_fork_cleanup_policy;
@@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
+	seccomp_filter_fork(p, current);
 	return p;
 
 bad_fork_free_pid:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..78719be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
 				return;
 		} while (*++syscall);
 		break;
+#ifdef CONFIG_SECCOMP_FILTER
+	case 2:
+		if (seccomp_test_filters(this_syscall) == 0)
+			return;
+
+		seccomp_filter_log_failure(this_syscall);
+		break;
+#endif
 	default:
 		BUG();
 	}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..4770847
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,639 @@
+/* bpf program-based system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ *         get/put helpers should be used when accessing an instance
+ *         outside of a lifetime-guarded section.  In general, this
+ *         is only needed for handling filters shared across tasks.
+ * @creator: pointer to the pid that created this filter
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @flags: provide information about filter from creation time.
+ * @personality: personality of the process at filter creation time.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+	struct kref usage;
+	struct pid *creator;
+	struct seccomp_filter *parent;
+	struct {
+		uint32_t admin:1,  /* can allow execve */
+			 compat:1,  /* CONFIG_COMPAT */
+			 __reserved:30;
+	} flags;
+	int personality;
+	unsigned short count;  /* Instruction count */
+	struct sock_filter insns[0];
+};
+
+static unsigned int seccomp_run_filter(const u8 *buf,
+				       const size_t buflen,
+				       const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+	struct seccomp_filter *f;
+	unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+	/* Drop oversized requests. */
+	if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+		return ERR_PTR(-EINVAL);
+
+	/* Padding should always be in sock_filter increments. */
+	BUG_ON(padding % sizeof(struct sock_filter));
+
+	f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+	if (!f)
+		return ERR_PTR(-ENOMEM);
+	kref_init(&f->usage);
+	f->creator = get_task_pid(current, PIDTYPE_PID);
+	f->count = bpf_blocks;
+	return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+	if (!filter)
+		return;
+	put_seccomp_filter(filter->parent);
+	put_pid(filter->creator);
+	kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+	struct seccomp_filter *orig =
+		container_of(kref, struct seccomp_filter, usage);
+	seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+	pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+		current->comm, task_pid_nr(current), syscall,
+		KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+	if (!orig)
+		return;
+	kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+	if (!orig)
+		return NULL;
+	kref_get(&orig->usage);
+	return orig;
+}
+
+static int seccomp_check_personality(struct seccomp_filter *filter)
+{
+	if (filter->personality != current->personality)
+		return -EACCES;
+#ifdef CONFIG_COMPAT
+	if (filter->flags.compat != (!!(is_compat_task())))
+		return -EACCES;
+#endif
+	return 0;
+}
+
+static const struct user_regset *
+find_prstatus(const struct user_regset_view *view)
+{
+	const struct user_regset *regset;
+	int n;
+
+	/* Skip 0. */
+	for (n = 1; n < view->n; ++n) {
+		regset = view->regsets + n;
+		if (regset->core_note_type == NT_PRSTATUS)
+			return regset;
+	}
+
+	return NULL;
+}
+
+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ * @scratch: preallocated storage of size @available
+ * @available: pointer to the size of scratch.
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Returns a populated pointer to @scratch by default.
+ * Otherwise, returns a pointer to a a u8 array containing the struct
+ * user_regs_struct appropriate for the task personality.  The pointer
+ * may be to the beginning of @scratch or to an externally managed data
+ * structure.  On success, @available should be updated with the
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may pointer to
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+	/* regset is usually returned based on task personality, not current
+	 * system call convention.  This behavior makes it unsafe to execute
+	 * BPF programs over regviews if is_compat_task or the personality
+	 * have changed since the program was installed.
+	 */
+	const struct user_regset_view *view = task_user_regset_view(current);
+	const struct user_regset *regset = &view->regsets[0];
+	size_t scratch_size = *available;
+	if (regset->core_note_type != NT_PRSTATUS) {
+		/* The architecture should override this method for speed. */
+		regset = find_prstatus(view);
+		if (!regset)
+			return NULL;
+	}
+	*available = regset->n * regset->size;
+	/* Make sure the scratch space isn't exceeded. */
+	if (*available > scratch_size)
+		*available = scratch_size;
+	if (regset->get(current, regset, 0, *available, scratch, NULL))
+		return NULL;
+	return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+	struct seccomp_filter *filter;
+	u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+	size_t regs_size = sizeof(struct user_regs_struct);
+	int ret = -EACCES;
+
+	filter = current->seccomp.filter; /* uses task ref */
+	if (!filter)
+		goto out;
+
+	/* All filters in the list are required to share the same system call
+	 * convention so only the first filter is ever checked.
+	 */
+	if (seccomp_check_personality(filter))
+		goto out;
+
+	/* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
+	 * that is not mandatory.  E.g., it may return a point to
+	 * task_pt_regs(current).  NULL checking is mandatory.
+	 */
+	regs = seccomp_get_regs(regs_tmp, &regs_size);
+	if (!regs)
+		goto out;
+
+	/* Only allow a system call if it is allowed in all ancestors. */
+	ret = 0;
+	for ( ; filter != NULL; filter = filter->parent) {
+		/* Allowed if return value is the size of the data supplied. */
+		if (seccomp_run_filter(regs, regs_size, filter->insns) !=
+		    regs_size)
+			ret = -EACCES;
+	}
+out:
+	return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ *          operates on current. current must be attempting a system call
+ *          when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+	struct seccomp_filter *filter = NULL;
+	/* Note, len is a short so overflow should be impossible. */
+	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+	long ret = -EPERM;
+
+	/* Allocate a new seccomp_filter */
+	filter = seccomp_filter_alloc(fp_size);
+	if (IS_ERR(filter)) {
+		ret = PTR_ERR(filter);
+		goto out;
+	}
+
+	/* Lock the process personality and calling convention. */
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		filter->flags.compat = 1;
+#endif
+	filter->personality = current->personality;
+
+	/* Auditing is not needed since the capability wasn't requested */
+	if (security_real_capable_noaudit(current, current_user_ns(),
+					  CAP_SYS_ADMIN) == 0)
+		filter->flags.admin = 1;
+
+	/* Copy the instructions from fprog. */
+	ret = -EFAULT;
+	if (copy_from_user(filter->insns, fprog->filter, fp_size))
+		goto out;
+
+	/* Check the fprog */
+	ret = sk_chk_filter(filter->insns, filter->count);
+	if (ret)
+		goto out;
+
+	/* If there is an existing filter, make it the parent
+	 * and reuse the existing task-based ref.
+	 */
+	filter->parent = current->seccomp.filter;
+
+	/* Force all filters to use one system call convention. */
+	ret = -EINVAL;
+	if (filter->parent) {
+		if (filter->parent->flags.compat != filter->flags.compat)
+			goto out;
+		if (filter->parent->personality != filter->personality)
+			goto out;
+	}
+
+	/* Double claim the new filter so we can release it below simplifying
+	 * the error paths earlier.
+	 */
+	ret = 0;
+	get_seccomp_filter(filter);
+	current->seccomp.filter = filter;
+	/* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+	if (!current->seccomp.mode) {
+		current->seccomp.mode = 2;
+		set_thread_flag(TIF_SECCOMP);
+	}
+
+out:
+	put_seccomp_filter(filter);  /* for get or task, on err */
+	return ret;
+}
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+	struct sock_fprog fprog;
+	long ret = -EINVAL;
+
+	ret = -EFAULT;
+	if (!user_filter)
+		goto out;
+
+	if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+		goto out;
+
+	ret = seccomp_attach_filter(&fprog);
+out:
+	return ret;
+}
+
+/**
+ * seccomp_check_exec: determines if exec is allowed for current
+ * Returns 0 if allowed.
+ */
+int seccomp_check_exec(void)
+{
+	if (current->seccomp.mode != 2)
+		return 0;
+	/* We can rely on the task refcount for the filter. */
+	if (!current->seccomp.filter)
+		return -EPERM;
+	/* The last attached filter set for the process is checked. It must
+	 * have been installed with CAP_SYS_ADMIN capabilities.
+	 */
+	if (current->seccomp.filter->flags.admin)
+		return 0;
+	return -EPERM;
+}
+
+/* seccomp_filter_fork: manages inheritance on fork
+ * @child: forkee
+ * @parent: forker
+ * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
+ * and the set of filters is marked as 'enabled'.
+ */
+void seccomp_filter_fork(struct task_struct *child,
+			 struct task_struct *parent)
+{
+	if (!parent->seccomp.mode)
+		return;
+	child->seccomp.mode = parent->seccomp.mode;
+	child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
+}
+
+/* Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries.  The signature almost matches the signature from
+ * net/core/filter.c with the hopes of sharing code in the future.
+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+				int offset, size_t size,
+				void *unused)
+{
+	if (offset >= buflen)
+		goto fail;
+	if (offset < 0)
+		goto fail;
+	if (size > buflen - offset)
+		goto fail;
+	return buf + offset;
+fail:
+	return NULL;
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF (over user_regs_struct)
+ *	@buf: buffer to execute the filter over
+ *	@buflen: length of the buffer
+ *	@fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer.
+ * Return length to keep, 0 for none. @buf is a regset we are
+ * filtering, @filter is the array of filter instructions.
+ * Because all jumps are guaranteed to be before last instruction,
+ * and last instruction guaranteed to be a RET, we dont need to check
+ * flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ *
+ * A successful filter must return the full length of the data. Anything less
+ * will currently result in a seccomp failure.  In the future, it may be
+ * possible to use that for hard filtering registers on the fly so it is
+ * ideal for consumers to return 0 on intended failure.
+ */
+static unsigned int seccomp_run_filter(const u8 *buf,
+				       const size_t buflen,
+				       const struct sock_filter *fentry)
+{
+	const void *ptr;
+	u32 A = 0;			/* Accumulator */
+	u32 X = 0;			/* Index Register */
+	u32 mem[BPF_MEMWORDS];		/* Scratch Memory Store */
+	u32 tmp;
+	int k;
+
+	/*
+	 * Process array of filter instructions.
+	 */
+	for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define	K (fentry->k)
+#else
+		const u32 K = fentry->k;
+#endif
+
+		switch (fentry->code) {
+		case BPF_S_ALU_ADD_X:
+			A += X;
+			continue;
+		case BPF_S_ALU_ADD_K:
+			A += K;
+			continue;
+		case BPF_S_ALU_SUB_X:
+			A -= X;
+			continue;
+		case BPF_S_ALU_SUB_K:
+			A -= K;
+			continue;
+		case BPF_S_ALU_MUL_X:
+			A *= X;
+			continue;
+		case BPF_S_ALU_MUL_K:
+			A *= K;
+			continue;
+		case BPF_S_ALU_DIV_X:
+			if (X == 0)
+				return 0;
+			A /= X;
+			continue;
+		case BPF_S_ALU_DIV_K:
+			A = reciprocal_divide(A, K);
+			continue;
+		case BPF_S_ALU_AND_X:
+			A &= X;
+			continue;
+		case BPF_S_ALU_AND_K:
+			A &= K;
+			continue;
+		case BPF_S_ALU_OR_X:
+			A |= X;
+			continue;
+		case BPF_S_ALU_OR_K:
+			A |= K;
+			continue;
+		case BPF_S_ALU_LSH_X:
+			A <<= X;
+			continue;
+		case BPF_S_ALU_LSH_K:
+			A <<= K;
+			continue;
+		case BPF_S_ALU_RSH_X:
+			A >>= X;
+			continue;
+		case BPF_S_ALU_RSH_K:
+			A >>= K;
+			continue;
+		case BPF_S_ALU_NEG:
+			A = -A;
+			continue;
+		case BPF_S_JMP_JA:
+			fentry += K;
+			continue;
+		case BPF_S_JMP_JGT_K:
+			fentry += (A > K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGE_K:
+			fentry += (A >= K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JEQ_K:
+			fentry += (A == K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JSET_K:
+			fentry += (A & K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGT_X:
+			fentry += (A > X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGE_X:
+			fentry += (A >= X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JEQ_X:
+			fentry += (A == X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JSET_X:
+			fentry += (A & X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_LD_W_ABS:
+			k = K;
+load_w:
+			ptr = load_pointer(buf, buflen, k, 4, &tmp);
+			if (ptr != NULL) {
+				/* Note, unlike on network data, values are not
+				 * byte swapped.
+				 */
+				A = *(const u32 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_H_ABS:
+			k = K;
+load_h:
+			ptr = load_pointer(buf, buflen, k, 2, &tmp);
+			if (ptr != NULL) {
+				A = *(const u16 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_B_ABS:
+			k = K;
+load_b:
+			ptr = load_pointer(buf, buflen, k, 1, &tmp);
+			if (ptr != NULL) {
+				A = *(const u8 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_W_LEN:
+			A = buflen;
+			continue;
+		case BPF_S_LDX_W_LEN:
+			X = buflen;
+			continue;
+		case BPF_S_LD_W_IND:
+			k = X + K;
+			goto load_w;
+		case BPF_S_LD_H_IND:
+			k = X + K;
+			goto load_h;
+		case BPF_S_LD_B_IND:
+			k = X + K;
+			goto load_b;
+		case BPF_S_LDX_B_MSH:
+			ptr = load_pointer(buf, buflen, K, 1, &tmp);
+			if (ptr != NULL) {
+				X = (*(u8 *)ptr & 0xf) << 2;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_IMM:
+			A = K;
+			continue;
+		case BPF_S_LDX_IMM:
+			X = K;
+			continue;
+		case BPF_S_LD_MEM:
+			A = mem[K];
+			continue;
+		case BPF_S_LDX_MEM:
+			X = mem[K];
+			continue;
+		case BPF_S_MISC_TAX:
+			X = A;
+			continue;
+		case BPF_S_MISC_TXA:
+			A = X;
+			continue;
+		case BPF_S_RET_K:
+			return K;
+		case BPF_S_RET_A:
+			return A;
+		case BPF_S_ST:
+			mem[K] = A;
+			continue;
+		case BPF_S_STX:
+			mem[K] = X;
+			continue;
+		case BPF_S_ANC_PROTOCOL:
+		case BPF_S_ANC_PKTTYPE:
+		case BPF_S_ANC_IFINDEX:
+		case BPF_S_ANC_MARK:
+		case BPF_S_ANC_QUEUE:
+		case BPF_S_ANC_HATYPE:
+		case BPF_S_ANC_RXHASH:
+		case BPF_S_ANC_CPU:
+		case BPF_S_ANC_NLATTR:
+		case BPF_S_ANC_NLATTR_NEST:
+			/* ignored */
+			continue;
+		default:
+			WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+				       fentry->code, fentry->jt,
+				       fentry->jf, fentry->k);
+			return 0;
+		}
+	}
+
+	return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..77f2eda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		case PR_SET_SECCOMP:
 			error = prctl_set_seccomp(arg2);
 			break;
+		case PR_ATTACH_SECCOMP_FILTER:
+			error = prctl_attach_seccomp_filter((char __user *)
+								arg2);
+			break;
 		case PR_GET_TSC:
 			error = GET_TSC_CTL(arg2);
 			break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..77b1106 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
 
 	  If you are unsure how to answer this question, answer N.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	select SECCOMP
+	depends on EXPERIMENTAL
+	help
+	  This kernel feature expands CONFIG_SECCOMP to allow computing
+	  in environments with reduced kernel access dictated by a system
+	  call filter, expressed in BPF, installed by the application itself
+	  through prctl(2).
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config SECURITY
 	bool "Enable different security models"
 	depends on SYSFS
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 222+ messages in thread

* [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 17:25 [RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters) Will Drewry
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-11 17:25 ` Will Drewry
  2012-01-11 20:03   ` Jonathan Corbet
  2012-01-12 13:13   ` [RFC,PATCH " Łukasz Sowa
  1 sibling, 2 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-11 17:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

Document how system call filtering with BPF works
and can be used.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |  159 ++++++++++++++++++++++++++++++++
 1 files changed, 159 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/prctl/seccomp_filter.txt

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..5fb3f44
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,159 @@
+		Seccomp filtering
+		=================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated.  A
+certain subset of userland applications benefit by having a reduced set
+of available system calls.  The resulting set reduces the total kernel
+surface exposed to the application.  System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls.  The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct.  This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox.  It provides a clearly defined
+mechanism for minimizing the exposed kernel surface.  Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing.  Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process.  The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+	Allows the specification of a new filter using a BPF program.
+	The BPF program will be executed over a user_regs_struct data
+	reflecting system call time except with the system call number
+	resident in orig_[register].  To allow a system call, the size
+	of the data must be returned.  At present, all other return values
+	result in the system call being blocked, but it is recommended to
+	return 0 in those cases.  This will allow for future custom return
+	values to be introduced, if ever desired.
+
+	Usage:
+		prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+	The 'prog' argument is a pointer to a struct sock_fprog which will
+	contain the filter program.  If the program is invalid, the call
+	will return -1 and set errno to -EINVAL.
+
+	The struct user_regs_struct the @prog will see is based on the
+	personality of the task at the time of this prctl call.  Additionally,
+	is_compat_task is also tracked for the @prog.  This means that once set
+	the calling task will have all of its system calls blocked if it
+	switches its system call ABI (via personality or other means).
+
+	If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+	namespace, the @prog will be marked as inheritable across execve.  Any
+	inherited filters are still subject to the system call ABI constraints
+	above and any ABI mismatched system calls will result in process death.
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err and exit
+cleanly.  Without using a BPF compiler, it may be done as follows on x86 32-bit:
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+int install_filter(void)
+{
+	struct sock_filter filter[] = {
+		/* Grab the system call number */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+		/* Jump table for the allowed syscalls */
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+		/* Check that read is only using stdin. */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+		/* Check that write is only using stdout/stderr */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+		/* Put the "accept" value in A */
+		BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+		BPF_STMT(BPF_RET+BPF_A,0),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+		.filter = filter,
+	};
+	if (prctl(36, &prog)) {
+		perror("prctl");
+		return 1;
+	}
+	return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+	char buf[4096];
+	ssize_t bytes = 0;
+	if (install_filter())
+		return 1;
+	syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+	syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+	syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+	return 0;
+}
+
+Additionally, if prctl(2) is allowed by the installed filter, additional
+filters may be layered on which will increase evaluation time, but allow for
+further decreasing the attack surface during execution of a process.
+
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+  a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
@ 2012-01-11 20:03   ` Jonathan Corbet
  2012-01-11 20:10     ` Will Drewry
  2012-01-12 13:13   ` [RFC,PATCH " Łukasz Sowa
  1 sibling, 1 reply; 222+ messages in thread
From: Jonathan Corbet @ 2012-01-11 20:03 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Interesting approach to the problem, I think I like it.  Watch for news at
11...:)

One nit:

> +Example
> +-------
> +
> +Assume a process would like to cleanly read and write to stdin/out/err and exit
> +cleanly.  Without using a BPF compiler, it may be done as follows on x86 32-bit:
> +

It seems like this little program belongs in the samples/ directory.

Thanks,

jon

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 20:03   ` Jonathan Corbet
@ 2012-01-11 20:10     ` Will Drewry
  2012-01-11 23:19       ` [PATCH v2 " Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-11 20:10 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Wed, Jan 11, 2012 at 2:03 PM, Jonathan Corbet <corbet@lwn.net> wrote:
> Interesting approach to the problem, I think I like it.  Watch for news at
> 11...:)

Thanks - I'm glad to hear it!

> One nit:
>
>> +Example
>> +-------
>> +
>> +Assume a process would like to cleanly read and write to stdin/out/err and exit
>> +cleanly.  Without using a BPF compiler, it may be done as follows on x86 32-bit:
>> +
>
> It seems like this little program belongs in the samples/ directory.

Cool - I'll do that and rev this patch.

cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 20:10     ` Will Drewry
@ 2012-01-11 23:19       ` Will Drewry
  2012-01-12  0:29         ` Will Drewry
  2012-01-12 18:16         ` Randy Dunlap
  0 siblings, 2 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-11 23:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet

Document how system call filtering with BPF works and
may be used.  Includes an example for x86 (32-bit).

Signed-off-by: Will Drewry <wad@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |   99 ++++++++++++++++++++++++++++++++
 samples/Makefile                       |    2 +-
 samples/seccomp/Makefile               |   12 ++++
 samples/seccomp/bpf-example.c          |   74 ++++++++++++++++++++++++
 4 files changed, 186 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/prctl/seccomp_filter.txt
 create mode 100644 samples/seccomp/Makefile
 create mode 100644 samples/seccomp/bpf-example.c

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..15d4645
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,99 @@
+		Seccomp filtering
+		=================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated.  A
+certain subset of userland applications benefit by having a reduced set
+of available system calls.  The resulting set reduces the total kernel
+surface exposed to the application.  System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls.  The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct.  This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox.  It provides a clearly defined
+mechanism for minimizing the exposed kernel surface.  Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing.  Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process.  The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+	Allows the specification of a new filter using a BPF program.
+	The BPF program will be executed over a user_regs_struct data
+	reflecting system call time except with the system call number
+	resident in orig_[register].  To allow a system call, the size
+	of the data must be returned.  At present, all other return values
+	result in the system call being blocked, but it is recommended to
+	return 0 in those cases.  This will allow for future custom return
+	values to be introduced, if ever desired.
+
+	Usage:
+		prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+	The 'prog' argument is a pointer to a struct sock_fprog which will
+	contain the filter program.  If the program is invalid, the call
+	will return -1 and set errno to -EINVAL.
+
+	The struct user_regs_struct the @prog will see is based on the
+	personality of the task at the time of this prctl call.  Additionally,
+	is_compat_task is also tracked for the @prog.  This means that once set
+	the calling task will have all of its system calls blocked if it
+	switches its system call ABI (via personality or other means).
+
+	If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+	namespace, the @prog will be marked as inheritable across execve.  Any
+	inherited filters are still subject to the system call ABI constraints
+	above and any ABI mismatched system calls will result in process death.
+
+	Additionally, if prctl(2) is allowed by the attached filter,
+	additional filters may be layered on which will increase evaluation
+	time, but allow for further decreasing the attack surface during
+	execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+samples/seccomp-bpf-example.c shows an example process that allows read from stdin,
+write to stdout/err, exit and signal returns for 32-bit x86.
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+  a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
 # Makefile for Linux samples code
 
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ tracepoints/ trace_events/ \
-			   hw_breakpoint/ kfifo/ kdb/ hidraw/
+			   hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..80dc8e4
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,12 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-$(CONFIG_X86_32) := bpf-example
+bpf-example-objs := bpf-example.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-example.o += -I$(objtree)/usr/include -m32
+HOSTLOADLIBES_bpf-example += -m32
diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-example.c
new file mode 100644
index 0000000..f98b70a
--- /dev/null
+++ b/samples/seccomp/bpf-example.c
@@ -0,0 +1,74 @@
+/*
+ * Seccomp BPF example
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+#	define PR_ATTACH_SECCOMP_FILTER 36
+#endif
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+static int install_filter(void)
+{
+	struct sock_filter filter[] = {
+		/* Grab the system call number */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+		/* Jump table for the allowed syscalls */
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+		/* Check that read is only using stdin. */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+		/* Check that write is only using stdout/stderr */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+		/* Put the "accept" value in A */
+		BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+		BPF_STMT(BPF_RET+BPF_A,0),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+		.filter = filter,
+	};
+	if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+		perror("prctl");
+		return 1;
+	}
+	return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+	char buf[4096];
+	ssize_t bytes = 0;
+	if (install_filter())
+		return 1;
+	syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+	syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+	syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+	return 0;
+}
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 222+ messages in thread

* Re: [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 23:19       ` [PATCH v2 " Will Drewry
@ 2012-01-12  0:29         ` Will Drewry
  2012-01-12 18:16         ` Randy Dunlap
  1 sibling, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-12  0:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet

Hrm, I may need to guard sample compilation based on host arch and not
just target arch. Documentation v3 will be on the way once I have that
behaving properly. :/

Sorry!
will

On Wed, Jan 11, 2012 at 5:19 PM, Will Drewry <wad@chromium.org> wrote:
> Document how system call filtering with BPF works and
> may be used.  Includes an example for x86 (32-bit).
>
> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  Documentation/prctl/seccomp_filter.txt |   99 ++++++++++++++++++++++++++++++++
>  samples/Makefile                       |    2 +-
>  samples/seccomp/Makefile               |   12 ++++
>  samples/seccomp/bpf-example.c          |   74 ++++++++++++++++++++++++
>  4 files changed, 186 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/prctl/seccomp_filter.txt
>  create mode 100644 samples/seccomp/Makefile
>  create mode 100644 samples/seccomp/bpf-example.c
>
> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
> new file mode 100644
> index 0000000..15d4645
> --- /dev/null
> +++ b/Documentation/prctl/seccomp_filter.txt
> @@ -0,0 +1,99 @@
> +               Seccomp filtering
> +               =================
> +
> +Introduction
> +------------
> +
> +A large number of system calls are exposed to every userland process
> +with many of them going unused for the entire lifetime of the process.
> +As system calls change and mature, bugs are found and eradicated.  A
> +certain subset of userland applications benefit by having a reduced set
> +of available system calls.  The resulting set reduces the total kernel
> +surface exposed to the application.  System call filtering is meant for
> +use with those applications.
> +
> +Seccomp filtering provides a means for a process to specify a filter
> +for incoming system calls.  The filter is expressed as a Berkeley Packet
> +Filter program, as with socket filters, except that the data operated on
> +is the current user_regs_struct.  This allows for expressive filtering
> +of system calls using the pre-existing system call ABI and using a filter
> +program language with a long history of being exposed to userland.
> +Additionally, BPF makes it impossible for users of seccomp to fall prey to
> +time-of-check-time-of-use (TOCTOU) attacks that are common in system call
> +interposition frameworks because the evaluated data is solely register state
> +just after system call entry.
> +
> +What it isn't
> +-------------
> +
> +System call filtering isn't a sandbox.  It provides a clearly defined
> +mechanism for minimizing the exposed kernel surface.  Beyond that,
> +policy for logical behavior and information flow should be managed with
> +a combinations of other system hardening techniques and, potentially, a
> +LSM of your choosing.  Expressive, dynamic filters provide further options down
> +this path (avoiding pathological sizes or selecting which of the multiplexed
> +system calls in socketcall() is allowed, for instance) which could be
> +construed, incorrectly, as a more complete sandboxing solution.
> +
> +Usage
> +-----
> +
> +An additional seccomp mode is added, but they are not directly set by the
> +consuming process.  The new mode, '2', is only available if
> +CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
> +PR_ATTACH_SECCOMP_FILTER argument.
> +
> +Interacting with seccomp filters is done using one prctl(2) call.
> +
> +PR_ATTACH_SECCOMP_FILTER:
> +       Allows the specification of a new filter using a BPF program.
> +       The BPF program will be executed over a user_regs_struct data
> +       reflecting system call time except with the system call number
> +       resident in orig_[register].  To allow a system call, the size
> +       of the data must be returned.  At present, all other return values
> +       result in the system call being blocked, but it is recommended to
> +       return 0 in those cases.  This will allow for future custom return
> +       values to be introduced, if ever desired.
> +
> +       Usage:
> +               prctl(PR_ATTACH_SECCOMP_FILTER, prog);
> +
> +       The 'prog' argument is a pointer to a struct sock_fprog which will
> +       contain the filter program.  If the program is invalid, the call
> +       will return -1 and set errno to -EINVAL.
> +
> +       The struct user_regs_struct the @prog will see is based on the
> +       personality of the task at the time of this prctl call.  Additionally,
> +       is_compat_task is also tracked for the @prog.  This means that once set
> +       the calling task will have all of its system calls blocked if it
> +       switches its system call ABI (via personality or other means).
> +
> +       If the @prog is installed while the task has CAP_SYS_ADMIN in its user
> +       namespace, the @prog will be marked as inheritable across execve.  Any
> +       inherited filters are still subject to the system call ABI constraints
> +       above and any ABI mismatched system calls will result in process death.
> +
> +       Additionally, if prctl(2) is allowed by the attached filter,
> +       additional filters may be layered on which will increase evaluation
> +       time, but allow for further decreasing the attack surface during
> +       execution of a process.
> +
> +The above call returns 0 on success and non-zero on error.
> +
> +Example
> +-------
> +
> +samples/seccomp-bpf-example.c shows an example process that allows read from stdin,
> +write to stdout/err, exit and signal returns for 32-bit x86.
> +
> +Caveats
> +-------
> +
> +- execve will fail unless the most recently attached filter was installed by
> +  a process with CAP_SYS_ADMIN (in its namespace).
> +
> +Adding architecture support
> +-----------------------
> +
> +Any platform with seccomp support will support seccomp filters
> +as long as CONFIG_SECCOMP_FILTER is enabled.
> diff --git a/samples/Makefile b/samples/Makefile
> index 6280817..f29b19c 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -1,4 +1,4 @@
>  # Makefile for Linux samples code
>
>  obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ tracepoints/ trace_events/ \
> -                          hw_breakpoint/ kfifo/ kdb/ hidraw/
> +                          hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
> diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
> new file mode 100644
> index 0000000..80dc8e4
> --- /dev/null
> +++ b/samples/seccomp/Makefile
> @@ -0,0 +1,12 @@
> +# kbuild trick to avoid linker error. Can be omitted if a module is built.
> +obj- := dummy.o
> +
> +# List of programs to build
> +hostprogs-$(CONFIG_X86_32) := bpf-example
> +bpf-example-objs := bpf-example.o
> +
> +# Tell kbuild to always build the programs
> +always := $(hostprogs-y)
> +
> +HOSTCFLAGS_bpf-example.o += -I$(objtree)/usr/include -m32
> +HOSTLOADLIBES_bpf-example += -m32
> diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-example.c
> new file mode 100644
> index 0000000..f98b70a
> --- /dev/null
> +++ b/samples/seccomp/bpf-example.c
> @@ -0,0 +1,74 @@
> +/*
> + * Seccomp BPF example
> + *
> + * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
> + * Author: Will Drewry <wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + */
> +
> +#include <asm/unistd.h>
> +#include <linux/filter.h>
> +#include <stdio.h>
> +#include <stddef.h>
> +#include <sys/prctl.h>
> +#include <sys/user.h>
> +#include <unistd.h>
> +
> +#ifndef PR_ATTACH_SECCOMP_FILTER
> +#      define PR_ATTACH_SECCOMP_FILTER 36
> +#endif
> +
> +#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
> +static int install_filter(void)
> +{
> +       struct sock_filter filter[] = {
> +               /* Grab the system call number */
> +               BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
> +               /* Jump table for the allowed syscalls */
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
> +
> +               /* Check that read is only using stdin. */
> +               BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
> +
> +               /* Check that write is only using stdout/stderr */
> +               BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
> +
> +               /* Put the "accept" value in A */
> +               BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
> +
> +               BPF_STMT(BPF_RET+BPF_A,0),
> +       };
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
> +               .filter = filter,
> +       };
> +       if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
> +               perror("prctl");
> +               return 1;
> +       }
> +       return 0;
> +}
> +
> +#define payload(_c) _c, sizeof(_c)
> +int main(int argc, char **argv) {
> +       char buf[4096];
> +       ssize_t bytes = 0;
> +       if (install_filter())
> +               return 1;
> +       syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
> +       bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
> +       syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
> +       syscall(__NR_write, STDOUT_FILENO, buf, bytes);
> +       return 0;
> +}
> --
> 1.7.5.4
>
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-12  8:53   ` Serge Hallyn
  2012-01-12 16:54     ` Will Drewry
  2012-01-12 14:50   ` Oleg Nesterov
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 222+ messages in thread
From: Serge Hallyn @ 2012-01-12  8:53 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

Quoting Will Drewry (wad@chromium.org):
> This patch adds support for seccomp mode 2.  This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task.  The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering.  Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).
> 
> A filter program may be installed by a userland task by calling
>   prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
> where fprog is of type struct sock_fprog.
> 
> If the first filter program allows subsequent prctl(2) calls, then
> additional filter programs may be attached.  All attached programs
> must be evaluated before a system call will be allowed to proceed.
> 
> To avoid CONFIG_COMPAT related landmines, once a filter program is
> installed using specific is_compat_task() and current->personality, it
> is not allowed to make system calls or attach additional filters which
> use a different combination of is_compat_task() and
> current->personality.
> 
> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace.  Once a task-local filter program is attached from a
> process without privileges, execve will fail.  This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).
> 
> There are a number of benefits to this approach. A few of which are
> as follows:
> - BPF has been exposed to userland for a long time.
> - Userland already knows its ABI: expected register layout and system
>   call numbers.
> - Full register information is provided which may be relevant for
>   certain syscalls (fork, rt_sigreturn) or for other userland
>   filtering tactics (checking the PC).
> - No time-of-check-time-of-use vulnerable data accesses are possible.
> 
> This patch includes its own BPF evaluator, but relies on the
> net/core/filter.c BPF checking code.  It is possible to share
> evaluators, but the performance sensitive nature of the network
> filtering path makes it an iterative optimization which (I think :) can
> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)
> 
> Signed-off-by: Will Drewry <wad@chromium.org>

Hey Will,

A few comments below, but otherwise

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

thanks,
-serge

> ---
>  fs/exec.c               |    5 +
>  include/linux/prctl.h   |    3 +
>  include/linux/seccomp.h |   70 +++++-
>  kernel/Makefile         |    1 +
>  kernel/fork.c           |    4 +
>  kernel/seccomp.c        |    8 +
>  kernel/seccomp_filter.c |  639 +++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sys.c            |    4 +
>  security/Kconfig        |   12 +
>  9 files changed, 743 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/seccomp_filter.c
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 3625464..e9cc89c 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -44,6 +44,7 @@
>  #include <linux/namei.h>
>  #include <linux/mount.h>
>  #include <linux/security.h>
> +#include <linux/seccomp.h>
>  #include <linux/syscalls.h>
>  #include <linux/tsacct_kern.h>
>  #include <linux/cn_proc.h>
> @@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
>  	if (retval)
>  		goto out_ret;
>  
> +	retval = seccomp_check_exec();
> +	if (retval)
> +		goto out_ret;
> +
>  	retval = -ENOMEM;
>  	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
>  	if (!bprm)
> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
> index a3baeb2..15e2460 100644
> --- a/include/linux/prctl.h
> +++ b/include/linux/prctl.h
> @@ -64,6 +64,9 @@
>  #define PR_GET_SECCOMP	21
>  #define PR_SET_SECCOMP	22
>  
> +/* Set process seccomp filters */
> +#define PR_ATTACH_SECCOMP_FILTER	36
> +
>  /* Get/set the capability bounding set (as per security/commoncap.c) */
>  #define PR_CAPBSET_READ 23
>  #define PR_CAPBSET_DROP 24
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index cc7a4e9..99d163e 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,9 +5,28 @@
>  #ifdef CONFIG_SECCOMP
>  
>  #include <linux/thread_info.h>
> +#include <linux/types.h>
>  #include <asm/seccomp.h>
>  
> -typedef struct { int mode; } seccomp_t;
> +struct seccomp_filter;
> +/**
> + * struct seccomp_struct - the state of a seccomp'ed process
> + *
> + * @mode:
> + *     if this is 0, seccomp is not in use.
> + *             is 1, the process is under standard seccomp rules.
> + *             is 2, the process is only allowed to make system calls where
> + *                   associated filters evaluate successfully.
> + * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
> + *          @filter must only be accessed from the context of current as there
> + *          is no guard.
> + */
> +typedef struct seccomp_struct {
> +	int mode;
> +#ifdef CONFIG_SECCOMP_FILTER
> +	struct seccomp_filter *filter;
> +#endif
> +} seccomp_t;
>  
>  extern void __secure_computing(int);
>  static inline void secure_computing(int this_syscall)
> @@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)
>  
>  #include <linux/errno.h>
>  
> -typedef struct { } seccomp_t;
> -
> +typedef struct seccomp_struct { } seccomp_t;
>  #define secure_computing(x) do { } while (0)
>  
>  static inline long prctl_get_seccomp(void)
> @@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)
>  
>  #endif /* CONFIG_SECCOMP */
>  
> +#ifdef CONFIG_SECCOMP_FILTER
> +
> +#define seccomp_filter_init_task(_tsk) do { \
> +	(_tsk)->seccomp.filter = NULL; \
> +} while (0);
> +
> +/* No locking is needed here because the task_struct will
> + * have no parallel consumers.
> + */
> +#define seccomp_filter_free_task(_tsk) do { \
> +	put_seccomp_filter((_tsk)->seccomp.filter); \
> +} while (0);
> +
> +extern int seccomp_check_exec(void);
> +
> +extern long prctl_attach_seccomp_filter(char __user *);
> +
> +extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
> +extern void put_seccomp_filter(struct seccomp_filter *);
> +
> +extern int seccomp_test_filters(int);
> +extern void seccomp_filter_log_failure(int);
> +extern void seccomp_filter_fork(struct task_struct *child,
> +				struct task_struct *parent);
> +
> +#else  /* CONFIG_SECCOMP_FILTER */
> +
> +#include <linux/errno.h>
> +
> +struct seccomp_filter { };
> +#define seccomp_filter_init_task(_tsk) do { } while (0);
> +#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
> +#define seccomp_filter_free_task(_tsk) do { } while (0);
> +
> +static inline int seccomp_check_exec(void)
> +{
> +	return 0;
> +}
> +
> +
> +static inline long prctl_attach_seccomp_filter(char __user *a2)
> +{
> +	return -ENOSYS;
> +}
> +
> +#endif  /* CONFIG_SECCOMP_FILTER */
>  #endif /* _LINUX_SECCOMP_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index e898c5b..0584090 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>  obj-$(CONFIG_SECCOMP) += seccomp.o
> +obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
> diff --git a/kernel/fork.c b/kernel/fork.c
> index da4a6a1..cc1d628 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -34,6 +34,7 @@
>  #include <linux/cgroup.h>
>  #include <linux/security.h>
>  #include <linux/hugetlb.h>
> +#include <linux/seccomp.h>
>  #include <linux/swap.h>
>  #include <linux/syscalls.h>
>  #include <linux/jiffies.h>
> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>  	free_thread_info(tsk->stack);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
> +	seccomp_filter_free_task(tsk);
>  	free_task_struct(tsk);
>  }
>  EXPORT_SYMBOL(free_task);
> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	/* Perform scheduler related setup. Assign this task to a CPU. */
>  	sched_fork(p);
>  
> +	seccomp_filter_init_task(p);
>  	retval = perf_event_init_task(p);
>  	if (retval)
>  		goto bad_fork_cleanup_policy;
> @@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	if (clone_flags & CLONE_THREAD)
>  		threadgroup_fork_read_unlock(current);
>  	perf_event_fork(p);
> +	seccomp_filter_fork(p, current);
>  	return p;
>  
>  bad_fork_free_pid:
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 57d4b13..78719be 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
>  				return;
>  		} while (*++syscall);
>  		break;
> +#ifdef CONFIG_SECCOMP_FILTER
> +	case 2:
> +		if (seccomp_test_filters(this_syscall) == 0)
> +			return;
> +
> +		seccomp_filter_log_failure(this_syscall);
> +		break;
> +#endif
>  	default:
>  		BUG();
>  	}
> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
> new file mode 100644
> index 0000000..4770847
> --- /dev/null
> +++ b/kernel/seccomp_filter.c
> @@ -0,0 +1,639 @@
> +/* bpf program-based system call filtering
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
> + */
> +
> +#include <linux/capability.h>
> +#include <linux/compat.h>
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +#include <linux/rculist.h>
> +#include <linux/filter.h>
> +#include <linux/kallsyms.h>
> +#include <linux/kref.h>
> +#include <linux/module.h>
> +#include <linux/pid.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/ratelimit.h>
> +#include <linux/reciprocal_div.h>
> +#include <linux/regset.h>
> +#include <linux/seccomp.h>
> +#include <linux/security.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/user.h>
> +
> +
> +/**
> + * struct seccomp_filter - container for seccomp BPF programs
> + *
> + * @usage: reference count to manage the object lifetime.
> + *         get/put helpers should be used when accessing an instance
> + *         outside of a lifetime-guarded section.  In general, this
> + *         is only needed for handling filters shared across tasks.
> + * @creator: pointer to the pid that created this filter
> + * @parent: pointer to the ancestor which this filter will be composed with.
> + * @flags: provide information about filter from creation time.
> + * @personality: personality of the process at filter creation time.
> + * @insns: the BPF program instructions to evaluate
> + * @count: the number of instructions in the program.
> + *
> + * seccomp_filter objects should never be modified after being attached
> + * to a task_struct (other than @usage).
> + */
> +struct seccomp_filter {
> +	struct kref usage;
> +	struct pid *creator;
> +	struct seccomp_filter *parent;
> +	struct {
> +		uint32_t admin:1,  /* can allow execve */
> +			 compat:1,  /* CONFIG_COMPAT */
> +			 __reserved:30;
> +	} flags;
> +	int personality;
> +	unsigned short count;  /* Instruction count */
> +	struct sock_filter insns[0];
> +};
> +
> +static unsigned int seccomp_run_filter(const u8 *buf,
> +				       const size_t buflen,
> +				       const struct sock_filter *);
> +
> +/**
> + * seccomp_filter_alloc - allocates a new filter object
> + * @padding: size of the insns[0] array in bytes
> + *
> + * The @padding should be a multiple of
> + * sizeof(struct sock_filter).
> + *
> + * Returns ERR_PTR on error or an allocated object.
> + */
> +static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
> +{
> +	struct seccomp_filter *f;
> +	unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
> +
> +	/* Drop oversized requests. */
> +	if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
> +		return ERR_PTR(-EINVAL);
> +
> +	/* Padding should always be in sock_filter increments. */
> +	BUG_ON(padding % sizeof(struct sock_filter));

I still think the BUG_ON here is harsh given that the progsize is passed
in by userspace.  Was there a reason not to return -EINVAL here?

> +
> +	f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
> +	if (!f)
> +		return ERR_PTR(-ENOMEM);
> +	kref_init(&f->usage);
> +	f->creator = get_task_pid(current, PIDTYPE_PID);
> +	f->count = bpf_blocks;
> +	return f;
> +}
> +
> +/**
> + * seccomp_filter_free - frees the allocated filter.
> + * @filter: NULL or live object to be completely destructed.
> + */
> +static void seccomp_filter_free(struct seccomp_filter *filter)
> +{
> +	if (!filter)
> +		return;
> +	put_seccomp_filter(filter->parent);
> +	put_pid(filter->creator);
> +	kfree(filter);
> +}
> +
> +static void __put_seccomp_filter(struct kref *kref)
> +{
> +	struct seccomp_filter *orig =
> +		container_of(kref, struct seccomp_filter, usage);
> +	seccomp_filter_free(orig);
> +}
> +
> +void seccomp_filter_log_failure(int syscall)
> +{
> +	pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
> +		current->comm, task_pid_nr(current), syscall,
> +		KSTK_EIP(current));
> +}
> +
> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
> +void put_seccomp_filter(struct seccomp_filter *orig)
> +{
> +	if (!orig)
> +		return;
> +	kref_put(&orig->usage, __put_seccomp_filter);
> +}
> +
> +/* get_seccomp_filter - increments the reference count of @orig. */
> +struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
> +{
> +	if (!orig)
> +		return NULL;
> +	kref_get(&orig->usage);
> +	return orig;
> +}
> +
> +static int seccomp_check_personality(struct seccomp_filter *filter)
> +{
> +	if (filter->personality != current->personality)
> +		return -EACCES;
> +#ifdef CONFIG_COMPAT
> +	if (filter->flags.compat != (!!(is_compat_task())))
> +		return -EACCES;
> +#endif
> +	return 0;
> +}
> +
> +static const struct user_regset *
> +find_prstatus(const struct user_regset_view *view)
> +{
> +	const struct user_regset *regset;
> +	int n;
> +
> +	/* Skip 0. */
> +	for (n = 1; n < view->n; ++n) {
> +		regset = view->regsets + n;
> +		if (regset->core_note_type == NT_PRSTATUS)
> +			return regset;
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * seccomp_get_regs - returns a pointer to struct user_regs_struct
> + * @scratch: preallocated storage of size @available
> + * @available: pointer to the size of scratch.
> + *
> + * Returns NULL if the registers cannot be acquired or copied.
> + * Returns a populated pointer to @scratch by default.
> + * Otherwise, returns a pointer to a a u8 array containing the struct
> + * user_regs_struct appropriate for the task personality.  The pointer
> + * may be to the beginning of @scratch or to an externally managed data
> + * structure.  On success, @available should be updated with the
> + * valid region size of the returned pointer.
> + *
> + * If the architecture overrides the linkage, then the pointer may pointer to
> + * another location.
> + */
> +__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
> +{
> +	/* regset is usually returned based on task personality, not current
> +	 * system call convention.  This behavior makes it unsafe to execute
> +	 * BPF programs over regviews if is_compat_task or the personality
> +	 * have changed since the program was installed.
> +	 */
> +	const struct user_regset_view *view = task_user_regset_view(current);
> +	const struct user_regset *regset = &view->regsets[0];
> +	size_t scratch_size = *available;
> +	if (regset->core_note_type != NT_PRSTATUS) {
> +		/* The architecture should override this method for speed. */
> +		regset = find_prstatus(view);
> +		if (!regset)
> +			return NULL;
> +	}
> +	*available = regset->n * regset->size;
> +	/* Make sure the scratch space isn't exceeded. */
> +	if (*available > scratch_size)
> +		*available = scratch_size;
> +	if (regset->get(current, regset, 0, *available, scratch, NULL))
> +		return NULL;
> +	return scratch;
> +}
> +
> +/**
> + * seccomp_test_filters - tests 'current' against the given syscall
> + * @syscall: number of the system call to test
> + *
> + * Returns 0 on ok and non-zero on error/failure.
> + */
> +int seccomp_test_filters(int syscall)
> +{
> +	struct seccomp_filter *filter;
> +	u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
> +	size_t regs_size = sizeof(struct user_regs_struct);
> +	int ret = -EACCES;
> +
> +	filter = current->seccomp.filter; /* uses task ref */
> +	if (!filter)
> +		goto out;
> +
> +	/* All filters in the list are required to share the same system call
> +	 * convention so only the first filter is ever checked.
> +	 */
> +	if (seccomp_check_personality(filter))
> +		goto out;
> +
> +	/* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
> +	 * that is not mandatory.  E.g., it may return a point to
> +	 * task_pt_regs(current).  NULL checking is mandatory.
> +	 */
> +	regs = seccomp_get_regs(regs_tmp, &regs_size);
> +	if (!regs)
> +		goto out;
> +
> +	/* Only allow a system call if it is allowed in all ancestors. */
> +	ret = 0;
> +	for ( ; filter != NULL; filter = filter->parent) {
> +		/* Allowed if return value is the size of the data supplied. */
> +		if (seccomp_run_filter(regs, regs_size, filter->insns) !=
> +		    regs_size)
> +			ret = -EACCES;
> +	}
> +out:
> +	return ret;
> +}
> +
> +/**
> + * seccomp_attach_filter: Attaches a seccomp filter to current.
> + * @fprog: BPF program to install
> + *
> + * Context: User context only. This function may sleep on allocation and
> + *          operates on current. current must be attempting a system call
> + *          when this is called (usually prctl).
> + *
> + * This function may be called repeatedly to install additional filters.
> + * Every filter successfully installed will be evaluated (in reverse order)
> + * for each system call the thread makes.
> + *
> + * Returns 0 on success or an errno on failure.
> + */
> +long seccomp_attach_filter(struct sock_fprog *fprog)
> +{
> +	struct seccomp_filter *filter = NULL;
> +	/* Note, len is a short so overflow should be impossible. */
> +	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
> +	long ret = -EPERM;
> +
> +	/* Allocate a new seccomp_filter */
> +	filter = seccomp_filter_alloc(fp_size);
> +	if (IS_ERR(filter)) {
> +		ret = PTR_ERR(filter);
> +		goto out;
> +	}
> +
> +	/* Lock the process personality and calling convention. */
> +#ifdef CONFIG_COMPAT
> +	if (is_compat_task())
> +		filter->flags.compat = 1;
> +#endif
> +	filter->personality = current->personality;
> +
> +	/* Auditing is not needed since the capability wasn't requested */
> +	if (security_real_capable_noaudit(current, current_user_ns(),
> +					  CAP_SYS_ADMIN) == 0)
> +		filter->flags.admin = 1;
> +
> +	/* Copy the instructions from fprog. */
> +	ret = -EFAULT;
> +	if (copy_from_user(filter->insns, fprog->filter, fp_size))
> +		goto out;
> +
> +	/* Check the fprog */
> +	ret = sk_chk_filter(filter->insns, filter->count);
> +	if (ret)
> +		goto out;
> +
> +	/* If there is an existing filter, make it the parent
> +	 * and reuse the existing task-based ref.
> +	 */
> +	filter->parent = current->seccomp.filter;
> +
> +	/* Force all filters to use one system call convention. */
> +	ret = -EINVAL;
> +	if (filter->parent) {
> +		if (filter->parent->flags.compat != filter->flags.compat)
> +			goto out;
> +		if (filter->parent->personality != filter->personality)
> +			goto out;
> +	}
> +
> +	/* Double claim the new filter so we can release it below simplifying
> +	 * the error paths earlier.
> +	 */
> +	ret = 0;
> +	get_seccomp_filter(filter);
> +	current->seccomp.filter = filter;
> +	/* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
> +	if (!current->seccomp.mode) {
> +		current->seccomp.mode = 2;
> +		set_thread_flag(TIF_SECCOMP);
> +	}
> +
> +out:
> +	put_seccomp_filter(filter);  /* for get or task, on err */
> +	return ret;
> +}
> +
> +long prctl_attach_seccomp_filter(char __user *user_filter)
> +{
> +	struct sock_fprog fprog;
> +	long ret = -EINVAL;
> +
> +	ret = -EFAULT;
> +	if (!user_filter)
> +		goto out;
> +
> +	if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
> +		goto out;
> +
> +	ret = seccomp_attach_filter(&fprog);
> +out:
> +	return ret;
> +}
> +
> +/**
> + * seccomp_check_exec: determines if exec is allowed for current
> + * Returns 0 if allowed.
> + */
> +int seccomp_check_exec(void)
> +{
> +	if (current->seccomp.mode != 2)
> +		return 0;
> +	/* We can rely on the task refcount for the filter. */
> +	if (!current->seccomp.filter)
> +		return -EPERM;
> +	/* The last attached filter set for the process is checked. It must
> +	 * have been installed with CAP_SYS_ADMIN capabilities.

This comment is confusing.  By 'It must' you mean that if not, it's
denied.  But if I didn't know better I would read that as "we can't
get to this code unless".  Can you change it to something like
"Exec is refused unless the filter was installed with CAP_SYS_ADMIN
privilege"?

> +	 */
> +	if (current->seccomp.filter->flags.admin)
> +		return 0;
> +	return -EPERM;
> +}
> +
> +/* seccomp_filter_fork: manages inheritance on fork
> + * @child: forkee
> + * @parent: forker
> + * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
> + * and the set of filters is marked as 'enabled'.
> + */
> +void seccomp_filter_fork(struct task_struct *child,
> +			 struct task_struct *parent)
> +{
> +	if (!parent->seccomp.mode)
> +		return;
> +	child->seccomp.mode = parent->seccomp.mode;
> +	child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
> +}
> +
> +/* Returns a pointer to the BPF evaluator after checking the offset and size
> + * boundaries.  The signature almost matches the signature from
> + * net/core/filter.c with the hopes of sharing code in the future.
> + */
> +static const void *load_pointer(const u8 *buf, size_t buflen,
> +				int offset, size_t size,
> +				void *unused)
> +{
> +	if (offset >= buflen)
> +		goto fail;
> +	if (offset < 0)
> +		goto fail;
> +	if (size > buflen - offset)
> +		goto fail;
> +	return buf + offset;
> +fail:
> +	return NULL;
> +}
> +
> +/**
> + * seccomp_run_filter - evaluate BPF (over user_regs_struct)
> + *	@buf: buffer to execute the filter over
> + *	@buflen: length of the buffer
> + *	@fentry: filter to apply
> + *
> + * Decode and apply filter instructions to the buffer.
> + * Return length to keep, 0 for none. @buf is a regset we are
> + * filtering, @filter is the array of filter instructions.
> + * Because all jumps are guaranteed to be before last instruction,
> + * and last instruction guaranteed to be a RET, we dont need to check
> + * flen.
> + *
> + * See core/net/filter.c as this is nearly an exact copy.
> + * At some point, it would be nice to merge them to take advantage of
> + * optimizations (like JIT).
> + *
> + * A successful filter must return the full length of the data. Anything less
> + * will currently result in a seccomp failure.  In the future, it may be
> + * possible to use that for hard filtering registers on the fly so it is
> + * ideal for consumers to return 0 on intended failure.
> + */
> +static unsigned int seccomp_run_filter(const u8 *buf,
> +				       const size_t buflen,
> +				       const struct sock_filter *fentry)
> +{
> +	const void *ptr;
> +	u32 A = 0;			/* Accumulator */
> +	u32 X = 0;			/* Index Register */
> +	u32 mem[BPF_MEMWORDS];		/* Scratch Memory Store */
> +	u32 tmp;
> +	int k;
> +
> +	/*
> +	 * Process array of filter instructions.
> +	 */
> +	for (;; fentry++) {
> +#if defined(CONFIG_X86_32)
> +#define	K (fentry->k)
> +#else
> +		const u32 K = fentry->k;
> +#endif
> +
> +		switch (fentry->code) {
> +		case BPF_S_ALU_ADD_X:
> +			A += X;
> +			continue;
> +		case BPF_S_ALU_ADD_K:
> +			A += K;
> +			continue;
> +		case BPF_S_ALU_SUB_X:
> +			A -= X;
> +			continue;
> +		case BPF_S_ALU_SUB_K:
> +			A -= K;
> +			continue;
> +		case BPF_S_ALU_MUL_X:
> +			A *= X;
> +			continue;
> +		case BPF_S_ALU_MUL_K:
> +			A *= K;
> +			continue;
> +		case BPF_S_ALU_DIV_X:
> +			if (X == 0)
> +				return 0;
> +			A /= X;
> +			continue;
> +		case BPF_S_ALU_DIV_K:
> +			A = reciprocal_divide(A, K);
> +			continue;
> +		case BPF_S_ALU_AND_X:
> +			A &= X;
> +			continue;
> +		case BPF_S_ALU_AND_K:
> +			A &= K;
> +			continue;
> +		case BPF_S_ALU_OR_X:
> +			A |= X;
> +			continue;
> +		case BPF_S_ALU_OR_K:
> +			A |= K;
> +			continue;
> +		case BPF_S_ALU_LSH_X:
> +			A <<= X;
> +			continue;
> +		case BPF_S_ALU_LSH_K:
> +			A <<= K;
> +			continue;
> +		case BPF_S_ALU_RSH_X:
> +			A >>= X;
> +			continue;
> +		case BPF_S_ALU_RSH_K:
> +			A >>= K;
> +			continue;
> +		case BPF_S_ALU_NEG:
> +			A = -A;
> +			continue;
> +		case BPF_S_JMP_JA:
> +			fentry += K;
> +			continue;
> +		case BPF_S_JMP_JGT_K:
> +			fentry += (A > K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JGE_K:
> +			fentry += (A >= K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JEQ_K:
> +			fentry += (A == K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JSET_K:
> +			fentry += (A & K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JGT_X:
> +			fentry += (A > X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JGE_X:
> +			fentry += (A >= X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JEQ_X:
> +			fentry += (A == X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JSET_X:
> +			fentry += (A & X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_LD_W_ABS:
> +			k = K;
> +load_w:
> +			ptr = load_pointer(buf, buflen, k, 4, &tmp);
> +			if (ptr != NULL) {
> +				/* Note, unlike on network data, values are not
> +				 * byte swapped.
> +				 */
> +				A = *(const u32 *)ptr;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_H_ABS:
> +			k = K;
> +load_h:
> +			ptr = load_pointer(buf, buflen, k, 2, &tmp);
> +			if (ptr != NULL) {
> +				A = *(const u16 *)ptr;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_B_ABS:
> +			k = K;
> +load_b:
> +			ptr = load_pointer(buf, buflen, k, 1, &tmp);
> +			if (ptr != NULL) {
> +				A = *(const u8 *)ptr;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_W_LEN:
> +			A = buflen;
> +			continue;
> +		case BPF_S_LDX_W_LEN:
> +			X = buflen;
> +			continue;
> +		case BPF_S_LD_W_IND:
> +			k = X + K;
> +			goto load_w;
> +		case BPF_S_LD_H_IND:
> +			k = X + K;
> +			goto load_h;
> +		case BPF_S_LD_B_IND:
> +			k = X + K;
> +			goto load_b;
> +		case BPF_S_LDX_B_MSH:
> +			ptr = load_pointer(buf, buflen, K, 1, &tmp);
> +			if (ptr != NULL) {
> +				X = (*(u8 *)ptr & 0xf) << 2;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_IMM:
> +			A = K;
> +			continue;
> +		case BPF_S_LDX_IMM:
> +			X = K;
> +			continue;
> +		case BPF_S_LD_MEM:
> +			A = mem[K];
> +			continue;
> +		case BPF_S_LDX_MEM:
> +			X = mem[K];
> +			continue;
> +		case BPF_S_MISC_TAX:
> +			X = A;
> +			continue;
> +		case BPF_S_MISC_TXA:
> +			A = X;
> +			continue;
> +		case BPF_S_RET_K:
> +			return K;
> +		case BPF_S_RET_A:
> +			return A;
> +		case BPF_S_ST:
> +			mem[K] = A;
> +			continue;
> +		case BPF_S_STX:
> +			mem[K] = X;
> +			continue;
> +		case BPF_S_ANC_PROTOCOL:
> +		case BPF_S_ANC_PKTTYPE:
> +		case BPF_S_ANC_IFINDEX:
> +		case BPF_S_ANC_MARK:
> +		case BPF_S_ANC_QUEUE:
> +		case BPF_S_ANC_HATYPE:
> +		case BPF_S_ANC_RXHASH:
> +		case BPF_S_ANC_CPU:
> +		case BPF_S_ANC_NLATTR:
> +		case BPF_S_ANC_NLATTR_NEST:
> +			/* ignored */
> +			continue;
> +		default:
> +			WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
> +				       fentry->code, fentry->jt,
> +				       fentry->jf, fentry->k);
> +			return 0;
> +		}
> +	}
> +
> +	return 0;
> +}
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 481611f..77f2eda 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		case PR_SET_SECCOMP:
>  			error = prctl_set_seccomp(arg2);
>  			break;
> +		case PR_ATTACH_SECCOMP_FILTER:
> +			error = prctl_attach_seccomp_filter((char __user *)
> +								arg2);
> +			break;
>  		case PR_GET_TSC:
>  			error = GET_TSC_CTL(arg2);
>  			break;
> diff --git a/security/Kconfig b/security/Kconfig
> index 51bd5a0..77b1106 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
>  
>  	  If you are unsure how to answer this question, answer N.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	select SECCOMP
> +	depends on EXPERIMENTAL
> +	help
> +	  This kernel feature expands CONFIG_SECCOMP to allow computing
> +	  in environments with reduced kernel access dictated by a system
> +	  call filter, expressed in BPF, installed by the application itself
> +	  through prctl(2).
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config SECURITY
>  	bool "Enable different security models"
>  	depends on SYSFS
> -- 
> 1.7.5.4
> 

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
  2012-01-11 20:03   ` Jonathan Corbet
@ 2012-01-12 13:13   ` Łukasz Sowa
  2012-01-12 17:25     ` Will Drewry
  1 sibling, 1 reply; 222+ messages in thread
From: Łukasz Sowa @ 2012-01-12 13:13 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Hi Will,

That's very different approach to the system call interposition problem.
I find you solution very interesting. It gives far more capabilities
than my syscalls cgroup that you commented on some time ago. It's ready
now but I haven't tried filtering yet. I think that if your solution
make it to the mainline (and I guess that's really possible at current
stage :)), there will be no place for mine solution but that's ok.

There's one thing that I'm curious about - have you measured overhead in
any way? That was one of the biggest issues in all previous attempts to
limit syscalls. I'd love to compare the numbers with mine solution.

I'll examine your patch later on and put some comments if I bump into
something.

Best Regards,
Lukasz Sowa


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-12  8:53   ` Serge Hallyn
@ 2012-01-12 14:50   ` Oleg Nesterov
  2012-01-12 16:55     ` Will Drewry
  2012-01-12 15:43   ` Steven Rostedt
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-12 14:50 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/11, Will Drewry wrote:
>
> This patch adds support for seccomp mode 2.  This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task.  The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering.  Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).

Cool ;)

I didn't really read this patch yet, just one nit.

> +#define seccomp_filter_init_task(_tsk) do { \
> +	(_tsk)->seccomp.filter = NULL; \
> +} while (0);

Cosmetic and subjective, but imho it would be better to add inline
functions instead of define's.

> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>  	free_thread_info(tsk->stack);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
> +	seccomp_filter_free_task(tsk);
>  	free_task_struct(tsk);
>  }
>  EXPORT_SYMBOL(free_task);
> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	/* Perform scheduler related setup. Assign this task to a CPU. */
>  	sched_fork(p);
>  
> +	seccomp_filter_init_task(p);

This doesn't look right or I missed something. something seccomp_filter_init_task()
should be called right after dup_task_struct(), at least before copy process can
fail.

Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
current->seccomp.filter copied by arch_dup_task_struct().

> +struct seccomp_filter {
> +	struct kref usage;
> +	struct pid *creator;

Why? seccomp_filter->creator is never used, no?

Oleg.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-12  8:53   ` Serge Hallyn
  2012-01-12 14:50   ` Oleg Nesterov
@ 2012-01-12 15:43   ` Steven Rostedt
  2012-01-12 16:14     ` Oleg Nesterov
                       ` (3 more replies)
  2012-01-12 16:18   ` Alan Cox
                     ` (2 subsequent siblings)
  5 siblings, 4 replies; 222+ messages in thread
From: Steven Rostedt @ 2012-01-12 15:43 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:

> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace.  Once a task-local filter program is attached from a
> process without privileges, execve will fail.  This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).

This means that a non privileged user can not run another program with
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first and
then the execv() would be performed. But after the filter is attached,
the execv is prevented?

Maybe I don't understand this correctly.

-- Steve
 

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 15:43   ` Steven Rostedt
@ 2012-01-12 16:14     ` Oleg Nesterov
  2012-01-12 16:38       ` Steven Rostedt
  2012-01-12 16:14     ` Andrew Lutomirski
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-12 16:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/12, Steven Rostedt wrote:
>
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
> > Filter programs may _only_ cross the execve(2) barrier if last filter
> > program was attached by a task with CAP_SYS_ADMIN capabilities in its
> > user namespace.  Once a task-local filter program is attached from a
> > process without privileges, execve will fail.  This ensures that only
> > privileged parent task can affect its privileged children (e.g., setuid
> > binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?
>
> Maybe I don't understand this correctly.

May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.

OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.

Oleg.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 15:43   ` Steven Rostedt
  2012-01-12 16:14     ` Oleg Nesterov
@ 2012-01-12 16:14     ` Andrew Lutomirski
  2012-01-12 16:27       ` Steven Rostedt
  2012-01-12 16:59     ` Will Drewry
  2012-01-12 17:36     ` Jamie Lokier
  3 siblings, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 16:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 7:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?
>
> Maybe I don't understand this correctly.

Time to resurrect execve_nosecurity?  If so, then the rule could be
simplified to: seccomp programs cannot use normal execve at all.

The longer I linger on lists and see neat ideas like this, the more I
get annoyed that execve is magical.  I dream of a distribution that
doesn't use setuid, file capabilities, selinux transitions on exec, or
any other privilege changes on exec *at all*.  I think that the only
things missing in the kernel (other than something intelligent to do
about SELinux) are execve_nosecurity and the ability for a normal
program to wait for an unrelated program to finish (or some other way
that a program can ask a daemon to spawn a privileged program for it
and then to cleanly wait for that program to finish in a way that
could survive re-exec of the daemon).

--Andy

>
> -- Steve
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
                     ` (2 preceding siblings ...)
  2012-01-12 15:43   ` Steven Rostedt
@ 2012-01-12 16:18   ` Alan Cox
  2012-01-12 17:03     ` Will Drewry
  2012-01-13  1:31     ` James Morris
  2012-01-12 16:22   ` Oleg Nesterov
  2012-01-12 17:02   ` Andrew Lutomirski
  5 siblings, 2 replies; 222+ messages in thread
From: Alan Cox @ 2012-01-12 16:18 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace.  Once a task-local filter program is attached from a
> process without privileges, execve will fail.  This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).

I think this model is wrong. The rest of the policy rules all work on the
basis that dumpable is the decider (the same rules for not dumping, not
tracing, etc). A user should be able to apply filter to their own code
arbitarily. Any setuid app should IMHO lose the trace subject to the usual
uid rules and capability rules. That would seem to be more flexible and
also the path of least surprise.

[plus you can implement non setuid exec entirely in userspace so it's
a rather meaningless distinction you propose]

> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)

A BPF jit ought to be trivial and would be a big win.

In general I like this approach. It's simple, it's compact and it offers
interesting possibilities for solving some interesting problem spaces,
without the full weight of SELinux, SMACK etc which are still needed for
heavyweight security.

Alan

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
                     ` (3 preceding siblings ...)
  2012-01-12 16:18   ` Alan Cox
@ 2012-01-12 16:22   ` Oleg Nesterov
  2012-01-12 17:10     ` Will Drewry
  2012-01-12 17:02   ` Andrew Lutomirski
  5 siblings, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-12 16:22 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/11, Will Drewry wrote:
>
> +__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
> +{
> +	/* regset is usually returned based on task personality, not current
> +	 * system call convention.  This behavior makes it unsafe to execute
> +	 * BPF programs over regviews if is_compat_task or the personality
> +	 * have changed since the program was installed.
> +	 */
> +	const struct user_regset_view *view = task_user_regset_view(current);
> +	const struct user_regset *regset = &view->regsets[0];
> +	size_t scratch_size = *available;
> +	if (regset->core_note_type != NT_PRSTATUS) {
> +		/* The architecture should override this method for speed. */
> +		regset = find_prstatus(view);
> +		if (!regset)
> +			return NULL;
> +	}
> +	*available = regset->n * regset->size;
> +	/* Make sure the scratch space isn't exceeded. */
> +	if (*available > scratch_size)
> +		*available = scratch_size;
> +	if (regset->get(current, regset, 0, *available, scratch, NULL))
> +		return NULL;
> +	return scratch;
> +}
> +
> +/**
> + * seccomp_test_filters - tests 'current' against the given syscall
> + * @syscall: number of the system call to test
> + *
> + * Returns 0 on ok and non-zero on error/failure.
> + */
> +int seccomp_test_filters(int syscall)
> +{
> +	struct seccomp_filter *filter;
> +	u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
> +	size_t regs_size = sizeof(struct user_regs_struct);
> +	int ret = -EACCES;
> +
> +	filter = current->seccomp.filter; /* uses task ref */
> +	if (!filter)
> +		goto out;
> +
> +	/* All filters in the list are required to share the same system call
> +	 * convention so only the first filter is ever checked.
> +	 */
> +	if (seccomp_check_personality(filter))
> +		goto out;
> +
> +	/* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
> +	 * that is not mandatory.  E.g., it may return a point to
> +	 * task_pt_regs(current).  NULL checking is mandatory.
> +	 */
> +	regs = seccomp_get_regs(regs_tmp, &regs_size);

Stupid question. I am sure you know what are you doing ;) and I know
nothing about !x86 arches.

But could you explain why it is designed to use user_regs_struct ?
Why we can't simply use task_pt_regs() and avoid the (costly) regsets?

Just curious.

Oleg.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:14     ` Andrew Lutomirski
@ 2012-01-12 16:27       ` Steven Rostedt
  2012-01-12 16:51         ` Andrew Lutomirski
  2012-01-12 17:09         ` Linus Torvalds
  0 siblings, 2 replies; 222+ messages in thread
From: Steven Rostedt @ 2012-01-12 16:27 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 08:14 -0800, Andrew Lutomirski wrote:

> The longer I linger on lists and see neat ideas like this, the more I
> get annoyed that execve is magical.  I dream of a distribution that
> doesn't use setuid, file capabilities, selinux transitions on exec, or
> any other privilege changes on exec *at all*. 

Is that the fear with filtering on execv? That if we have filters on an
execv calling a setuid program that we change the behavior of that
privileged program and might cause unexpected results?

In that case, just have execv fail if filtering is enabled and we are
execing a setuid program. But I don't see why non "magical" execv's
should be prohibited.

-- Steve



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:14     ` Oleg Nesterov
@ 2012-01-12 16:38       ` Steven Rostedt
  2012-01-12 16:47         ` Oleg Nesterov
  2012-01-12 17:30         ` Jamie Lokier
  0 siblings, 2 replies; 222+ messages in thread
From: Steven Rostedt @ 2012-01-12 16:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:

> May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> 
> OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.

I've never used seccomp, so I admit I'm totally ignorant on this topic.

But looking at seccomp from the outside, the biggest advantage to this
would be the ability for normal processes to be able to limit tasks it
kicks off. If I want to run a task in a sandbox, I don't want to be root
to do so.

I guess a web browser doesn't perform an exec to run java programs. But
it would be nice if I could execute something from the command line that
I could run in a sand box.

What's the problem with making sure that the setuid isn't set before
doing an execv? Only fail when setuid (or some other magic) is enabled
on the file being exec'd.

Or is this a race where I can have a soft link pointing to a normal
file, run this, and have the link change to a setuid file at just the
right time that causes it to happen?


-- Steve



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:38       ` Steven Rostedt
@ 2012-01-12 16:47         ` Oleg Nesterov
  2012-01-12 17:08           ` Will Drewry
  2012-01-12 17:30         ` Jamie Lokier
  1 sibling, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-12 16:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/12, Steven Rostedt wrote:
>
> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
>
> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> >
> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
>
> I've never used seccomp, so I admit I'm totally ignorant on this topic.

me too ;)

> But looking at seccomp from the outside, the biggest advantage to this
> would be the ability for normal processes to be able to limit tasks it
> kicks off. If I want to run a task in a sandbox, I don't want to be root
> to do so.
>
> I guess a web browser doesn't perform an exec to run java programs. But
> it would be nice if I could execute something from the command line that
> I could run in a sand box.
>
> What's the problem with making sure that the setuid isn't set before
> doing an execv? Only fail when setuid (or some other magic) is enabled
> on the file being exec'd.

I agree. That is why I mentioned LSM_UNSAFE_SECCOMP/cap_bprm_set_creds.
Just I do not know what would be the most simple/clean way to do this.


And in any case I agree that the current seccomp_check_exec() looks
strange. Btw, it does
{
	if (current->seccomp.mode != 2)
		return 0;
	/* We can rely on the task refcount for the filter. */
	if (!current->seccomp.filter)
		return -EPERM;

How it is possible to have seccomp.filter == NULL with mode == 2?

Oleg.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:27       ` Steven Rostedt
@ 2012-01-12 16:51         ` Andrew Lutomirski
  2012-01-12 17:09         ` Linus Torvalds
  1 sibling, 0 replies; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 16:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 2012-01-12 at 08:14 -0800, Andrew Lutomirski wrote:
>
>> The longer I linger on lists and see neat ideas like this, the more I
>> get annoyed that execve is magical.  I dream of a distribution that
>> doesn't use setuid, file capabilities, selinux transitions on exec, or
>> any other privilege changes on exec *at all*.
>
> Is that the fear with filtering on execv? That if we have filters on an
> execv calling a setuid program that we change the behavior of that
> privileged program and might cause unexpected results?

Exactly.

>
> In that case, just have execv fail if filtering is enabled and we are
> execing a setuid program. But I don't see why non "magical" execv's
> should be prohibited.
>

How do you define "non-magical"?

If setuid is set, then it's obviously magical.  On a nosuid
filesystem, strange things happen.  If file capabilities are enabled
and set, then different magic happens.  With LSMs involved, anything
can be magical.  (SELinux AFAICT looks up rules on every single exec,
so it might be impossible for execve to be non-magical.)  If execve is
banned entirely when seccomp is enabled, then there will never be any
attacks based on abusing these mechanisms.

My proposal is to have an alternative mechanism that, from a security
POV, does nothing that the caller couldn't have done on its own.  The
only reason it would be needed at all is because implementing execve
with correct semantics from userspace is a PITA -- the right set of
fds needs to be closed, threads need to be killed (without races),
vmas need to be found an unmapped, a new program needs to be mapped in
(possibly at the same place that the old one was mapped at),
/proc/self/exe needs to be updated, auxv needs to be recreated
(including using values that glibc might have erased already), etc.

The code is short and it works (although I have no idea whether it
applies to current kernels).

Oleg: my only issue with setting something like LSM_UNSAFE_SECCOMP is
that a different class of vulnerability might be introduced: take a
setuid program that calls other setuid programs (or just uses execve
as a way to get the default execve capability handling, SELinux
handling, etc), run it (as root!) inside seccomp, and watch it
possibly develop security holes.  If the alternate execve is a
different syscall, then this can't happen.  And if someone remaps
execve to execve_nosecurity (from userspace or via some in-kernel
mechanism) and causes problems, it's entirely clear who to blame.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12  8:53   ` Serge Hallyn
@ 2012-01-12 16:54     ` Will Drewry
  0 siblings, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-12 16:54 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 2:53 AM, Serge Hallyn
<serge.hallyn@canonical.com> wrote:
> Quoting Will Drewry (wad@chromium.org):
>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task.  The policy is expressed in terms of a BPF program,
>> as is used for userland-exposed socket filtering.  Instead of network
>> data, the BPF program is evaluated over struct user_regs_struct at the
>> time of the system call (as retrieved using regviews).
>>
>> A filter program may be installed by a userland task by calling
>>   prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
>> where fprog is of type struct sock_fprog.
>>
>> If the first filter program allows subsequent prctl(2) calls, then
>> additional filter programs may be attached.  All attached programs
>> must be evaluated before a system call will be allowed to proceed.
>>
>> To avoid CONFIG_COMPAT related landmines, once a filter program is
>> installed using specific is_compat_task() and current->personality, it
>> is not allowed to make system calls or attach additional filters which
>> use a different combination of is_compat_task() and
>> current->personality.
>>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>>
>> There are a number of benefits to this approach. A few of which are
>> as follows:
>> - BPF has been exposed to userland for a long time.
>> - Userland already knows its ABI: expected register layout and system
>>   call numbers.
>> - Full register information is provided which may be relevant for
>>   certain syscalls (fork, rt_sigreturn) or for other userland
>>   filtering tactics (checking the PC).
>> - No time-of-check-time-of-use vulnerable data accesses are possible.
>>
>> This patch includes its own BPF evaluator, but relies on the
>> net/core/filter.c BPF checking code.  It is possible to share
>> evaluators, but the performance sensitive nature of the network
>> filtering path makes it an iterative optimization which (I think :) can
>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>>
>> Signed-off-by: Will Drewry <wad@chromium.org>
>
> Hey Will,
>
> A few comments below, but otherwise
>
> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

Thanks! Unimportant responses below.  Fixes will be incorporated in
the next round (along with Oleg's feedback).

cheers,
will

> thanks,
> -serge
>
>> ---
>>  fs/exec.c               |    5 +
>>  include/linux/prctl.h   |    3 +
>>  include/linux/seccomp.h |   70 +++++-
>>  kernel/Makefile         |    1 +
>>  kernel/fork.c           |    4 +
>>  kernel/seccomp.c        |    8 +
>>  kernel/seccomp_filter.c |  639 +++++++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sys.c            |    4 +
>>  security/Kconfig        |   12 +
>>  9 files changed, 743 insertions(+), 3 deletions(-)
>>  create mode 100644 kernel/seccomp_filter.c
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 3625464..e9cc89c 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -44,6 +44,7 @@
>>  #include <linux/namei.h>
>>  #include <linux/mount.h>
>>  #include <linux/security.h>
>> +#include <linux/seccomp.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/tsacct_kern.h>
>>  #include <linux/cn_proc.h>
>> @@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
>>       if (retval)
>>               goto out_ret;
>>
>> +     retval = seccomp_check_exec();
>> +     if (retval)
>> +             goto out_ret;
>> +
>>       retval = -ENOMEM;
>>       bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
>>       if (!bprm)
>> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
>> index a3baeb2..15e2460 100644
>> --- a/include/linux/prctl.h
>> +++ b/include/linux/prctl.h
>> @@ -64,6 +64,9 @@
>>  #define PR_GET_SECCOMP       21
>>  #define PR_SET_SECCOMP       22
>>
>> +/* Set process seccomp filters */
>> +#define PR_ATTACH_SECCOMP_FILTER     36
>> +
>>  /* Get/set the capability bounding set (as per security/commoncap.c) */
>>  #define PR_CAPBSET_READ 23
>>  #define PR_CAPBSET_DROP 24
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index cc7a4e9..99d163e 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -5,9 +5,28 @@
>>  #ifdef CONFIG_SECCOMP
>>
>>  #include <linux/thread_info.h>
>> +#include <linux/types.h>
>>  #include <asm/seccomp.h>
>>
>> -typedef struct { int mode; } seccomp_t;
>> +struct seccomp_filter;
>> +/**
>> + * struct seccomp_struct - the state of a seccomp'ed process
>> + *
>> + * @mode:
>> + *     if this is 0, seccomp is not in use.
>> + *             is 1, the process is under standard seccomp rules.
>> + *             is 2, the process is only allowed to make system calls where
>> + *                   associated filters evaluate successfully.
>> + * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
>> + *          @filter must only be accessed from the context of current as there
>> + *          is no guard.
>> + */
>> +typedef struct seccomp_struct {
>> +     int mode;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     struct seccomp_filter *filter;
>> +#endif
>> +} seccomp_t;
>>
>>  extern void __secure_computing(int);
>>  static inline void secure_computing(int this_syscall)
>> @@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)
>>
>>  #include <linux/errno.h>
>>
>> -typedef struct { } seccomp_t;
>> -
>> +typedef struct seccomp_struct { } seccomp_t;
>>  #define secure_computing(x) do { } while (0)
>>
>>  static inline long prctl_get_seccomp(void)
>> @@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)
>>
>>  #endif /* CONFIG_SECCOMP */
>>
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +
>> +#define seccomp_filter_init_task(_tsk) do { \
>> +     (_tsk)->seccomp.filter = NULL; \
>> +} while (0);
>> +
>> +/* No locking is needed here because the task_struct will
>> + * have no parallel consumers.
>> + */
>> +#define seccomp_filter_free_task(_tsk) do { \
>> +     put_seccomp_filter((_tsk)->seccomp.filter); \
>> +} while (0);
>> +
>> +extern int seccomp_check_exec(void);
>> +
>> +extern long prctl_attach_seccomp_filter(char __user *);
>> +
>> +extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
>> +extern void put_seccomp_filter(struct seccomp_filter *);
>> +
>> +extern int seccomp_test_filters(int);
>> +extern void seccomp_filter_log_failure(int);
>> +extern void seccomp_filter_fork(struct task_struct *child,
>> +                             struct task_struct *parent);
>> +
>> +#else  /* CONFIG_SECCOMP_FILTER */
>> +
>> +#include <linux/errno.h>
>> +
>> +struct seccomp_filter { };
>> +#define seccomp_filter_init_task(_tsk) do { } while (0);
>> +#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
>> +#define seccomp_filter_free_task(_tsk) do { } while (0);
>> +
>> +static inline int seccomp_check_exec(void)
>> +{
>> +     return 0;
>> +}
>> +
>> +
>> +static inline long prctl_attach_seccomp_filter(char __user *a2)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +#endif  /* CONFIG_SECCOMP_FILTER */
>>  #endif /* _LINUX_SECCOMP_H */
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index e898c5b..0584090 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>>  obj-$(CONFIG_SECCOMP) += seccomp.o
>> +obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
>>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index da4a6a1..cc1d628 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -34,6 +34,7 @@
>>  #include <linux/cgroup.h>
>>  #include <linux/security.h>
>>  #include <linux/hugetlb.h>
>> +#include <linux/seccomp.h>
>>  #include <linux/swap.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/jiffies.h>
>> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>>       free_thread_info(tsk->stack);
>>       rt_mutex_debug_task_free(tsk);
>>       ftrace_graph_exit_task(tsk);
>> +     seccomp_filter_free_task(tsk);
>>       free_task_struct(tsk);
>>  }
>>  EXPORT_SYMBOL(free_task);
>> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>       /* Perform scheduler related setup. Assign this task to a CPU. */
>>       sched_fork(p);
>>
>> +     seccomp_filter_init_task(p);
>>       retval = perf_event_init_task(p);
>>       if (retval)
>>               goto bad_fork_cleanup_policy;
>> @@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>       if (clone_flags & CLONE_THREAD)
>>               threadgroup_fork_read_unlock(current);
>>       perf_event_fork(p);
>> +     seccomp_filter_fork(p, current);
>>       return p;
>>
>>  bad_fork_free_pid:
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 57d4b13..78719be 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
>>                               return;
>>               } while (*++syscall);
>>               break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     case 2:
>> +             if (seccomp_test_filters(this_syscall) == 0)
>> +                     return;
>> +
>> +             seccomp_filter_log_failure(this_syscall);
>> +             break;
>> +#endif
>>       default:
>>               BUG();
>>       }
>> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>> new file mode 100644
>> index 0000000..4770847
>> --- /dev/null
>> +++ b/kernel/seccomp_filter.c
>> @@ -0,0 +1,639 @@
>> +/* bpf program-based system call filtering
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>> + *
>> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
>> + */
>> +
>> +#include <linux/capability.h>
>> +#include <linux/compat.h>
>> +#include <linux/err.h>
>> +#include <linux/errno.h>
>> +#include <linux/rculist.h>
>> +#include <linux/filter.h>
>> +#include <linux/kallsyms.h>
>> +#include <linux/kref.h>
>> +#include <linux/module.h>
>> +#include <linux/pid.h>
>> +#include <linux/prctl.h>
>> +#include <linux/ptrace.h>
>> +#include <linux/ratelimit.h>
>> +#include <linux/reciprocal_div.h>
>> +#include <linux/regset.h>
>> +#include <linux/seccomp.h>
>> +#include <linux/security.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/user.h>
>> +
>> +
>> +/**
>> + * struct seccomp_filter - container for seccomp BPF programs
>> + *
>> + * @usage: reference count to manage the object lifetime.
>> + *         get/put helpers should be used when accessing an instance
>> + *         outside of a lifetime-guarded section.  In general, this
>> + *         is only needed for handling filters shared across tasks.
>> + * @creator: pointer to the pid that created this filter
>> + * @parent: pointer to the ancestor which this filter will be composed with.
>> + * @flags: provide information about filter from creation time.
>> + * @personality: personality of the process at filter creation time.
>> + * @insns: the BPF program instructions to evaluate
>> + * @count: the number of instructions in the program.
>> + *
>> + * seccomp_filter objects should never be modified after being attached
>> + * to a task_struct (other than @usage).
>> + */
>> +struct seccomp_filter {
>> +     struct kref usage;
>> +     struct pid *creator;
>> +     struct seccomp_filter *parent;
>> +     struct {
>> +             uint32_t admin:1,  /* can allow execve */
>> +                      compat:1,  /* CONFIG_COMPAT */
>> +                      __reserved:30;
>> +     } flags;
>> +     int personality;
>> +     unsigned short count;  /* Instruction count */
>> +     struct sock_filter insns[0];
>> +};
>> +
>> +static unsigned int seccomp_run_filter(const u8 *buf,
>> +                                    const size_t buflen,
>> +                                    const struct sock_filter *);
>> +
>> +/**
>> + * seccomp_filter_alloc - allocates a new filter object
>> + * @padding: size of the insns[0] array in bytes
>> + *
>> + * The @padding should be a multiple of
>> + * sizeof(struct sock_filter).
>> + *
>> + * Returns ERR_PTR on error or an allocated object.
>> + */
>> +static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
>> +{
>> +     struct seccomp_filter *f;
>> +     unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
>> +
>> +     /* Drop oversized requests. */
>> +     if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     /* Padding should always be in sock_filter increments. */
>> +     BUG_ON(padding % sizeof(struct sock_filter));
>
> I still think the BUG_ON here is harsh given that the progsize is passed
> in by userspace.  Was there a reason not to return -EINVAL here?

I've changed it in the next revision.  As is, I don't believe
userspace can control
the size of padding directly, just the increment since it specifies
its length in terms
of bpf blocks (sizeof(struct sock_filter)).  But EINVAL is certainly
less aggressive :)

>> +
>> +     f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
>> +     if (!f)
>> +             return ERR_PTR(-ENOMEM);
>> +     kref_init(&f->usage);
>> +     f->creator = get_task_pid(current, PIDTYPE_PID);
>> +     f->count = bpf_blocks;
>> +     return f;
>> +}
>> +
>> +/**
>> + * seccomp_filter_free - frees the allocated filter.
>> + * @filter: NULL or live object to be completely destructed.
>> + */
>> +static void seccomp_filter_free(struct seccomp_filter *filter)
>> +{
>> +     if (!filter)
>> +             return;
>> +     put_seccomp_filter(filter->parent);
>> +     put_pid(filter->creator);
>> +     kfree(filter);
>> +}
>> +
>> +static void __put_seccomp_filter(struct kref *kref)
>> +{
>> +     struct seccomp_filter *orig =
>> +             container_of(kref, struct seccomp_filter, usage);
>> +     seccomp_filter_free(orig);
>> +}
>> +
>> +void seccomp_filter_log_failure(int syscall)
>> +{
>> +     pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
>> +             current->comm, task_pid_nr(current), syscall,
>> +             KSTK_EIP(current));
>> +}
>> +
>> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
>> +void put_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> +     if (!orig)
>> +             return;
>> +     kref_put(&orig->usage, __put_seccomp_filter);
>> +}
>> +
>> +/* get_seccomp_filter - increments the reference count of @orig. */
>> +struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> +     if (!orig)
>> +             return NULL;
>> +     kref_get(&orig->usage);
>> +     return orig;
>> +}
>> +
>> +static int seccomp_check_personality(struct seccomp_filter *filter)
>> +{
>> +     if (filter->personality != current->personality)
>> +             return -EACCES;
>> +#ifdef CONFIG_COMPAT
>> +     if (filter->flags.compat != (!!(is_compat_task())))
>> +             return -EACCES;
>> +#endif
>> +     return 0;
>> +}
>> +
>> +static const struct user_regset *
>> +find_prstatus(const struct user_regset_view *view)
>> +{
>> +     const struct user_regset *regset;
>> +     int n;
>> +
>> +     /* Skip 0. */
>> +     for (n = 1; n < view->n; ++n) {
>> +             regset = view->regsets + n;
>> +             if (regset->core_note_type == NT_PRSTATUS)
>> +                     return regset;
>> +     }
>> +
>> +     return NULL;
>> +}
>> +
>> +/**
>> + * seccomp_get_regs - returns a pointer to struct user_regs_struct
>> + * @scratch: preallocated storage of size @available
>> + * @available: pointer to the size of scratch.
>> + *
>> + * Returns NULL if the registers cannot be acquired or copied.
>> + * Returns a populated pointer to @scratch by default.
>> + * Otherwise, returns a pointer to a a u8 array containing the struct
>> + * user_regs_struct appropriate for the task personality.  The pointer
>> + * may be to the beginning of @scratch or to an externally managed data
>> + * structure.  On success, @available should be updated with the
>> + * valid region size of the returned pointer.
>> + *
>> + * If the architecture overrides the linkage, then the pointer may pointer to
>> + * another location.
>> + */
>> +__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
>> +{
>> +     /* regset is usually returned based on task personality, not current
>> +      * system call convention.  This behavior makes it unsafe to execute
>> +      * BPF programs over regviews if is_compat_task or the personality
>> +      * have changed since the program was installed.
>> +      */
>> +     const struct user_regset_view *view = task_user_regset_view(current);
>> +     const struct user_regset *regset = &view->regsets[0];
>> +     size_t scratch_size = *available;
>> +     if (regset->core_note_type != NT_PRSTATUS) {
>> +             /* The architecture should override this method for speed. */
>> +             regset = find_prstatus(view);
>> +             if (!regset)
>> +                     return NULL;
>> +     }
>> +     *available = regset->n * regset->size;
>> +     /* Make sure the scratch space isn't exceeded. */
>> +     if (*available > scratch_size)
>> +             *available = scratch_size;
>> +     if (regset->get(current, regset, 0, *available, scratch, NULL))
>> +             return NULL;
>> +     return scratch;
>> +}
>> +
>> +/**
>> + * seccomp_test_filters - tests 'current' against the given syscall
>> + * @syscall: number of the system call to test
>> + *
>> + * Returns 0 on ok and non-zero on error/failure.
>> + */
>> +int seccomp_test_filters(int syscall)
>> +{
>> +     struct seccomp_filter *filter;
>> +     u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
>> +     size_t regs_size = sizeof(struct user_regs_struct);
>> +     int ret = -EACCES;
>> +
>> +     filter = current->seccomp.filter; /* uses task ref */
>> +     if (!filter)
>> +             goto out;
>> +
>> +     /* All filters in the list are required to share the same system call
>> +      * convention so only the first filter is ever checked.
>> +      */
>> +     if (seccomp_check_personality(filter))
>> +             goto out;
>> +
>> +     /* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
>> +      * that is not mandatory.  E.g., it may return a point to
>> +      * task_pt_regs(current).  NULL checking is mandatory.
>> +      */
>> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>> +     if (!regs)
>> +             goto out;
>> +
>> +     /* Only allow a system call if it is allowed in all ancestors. */
>> +     ret = 0;
>> +     for ( ; filter != NULL; filter = filter->parent) {
>> +             /* Allowed if return value is the size of the data supplied. */
>> +             if (seccomp_run_filter(regs, regs_size, filter->insns) !=
>> +                 regs_size)
>> +                     ret = -EACCES;
>> +     }
>> +out:
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_attach_filter: Attaches a seccomp filter to current.
>> + * @fprog: BPF program to install
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + *          operates on current. current must be attempting a system call
>> + *          when this is called (usually prctl).
>> + *
>> + * This function may be called repeatedly to install additional filters.
>> + * Every filter successfully installed will be evaluated (in reverse order)
>> + * for each system call the thread makes.
>> + *
>> + * Returns 0 on success or an errno on failure.
>> + */
>> +long seccomp_attach_filter(struct sock_fprog *fprog)
>> +{
>> +     struct seccomp_filter *filter = NULL;
>> +     /* Note, len is a short so overflow should be impossible. */
>> +     unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>> +     long ret = -EPERM;
>> +
>> +     /* Allocate a new seccomp_filter */
>> +     filter = seccomp_filter_alloc(fp_size);
>> +     if (IS_ERR(filter)) {
>> +             ret = PTR_ERR(filter);
>> +             goto out;
>> +     }
>> +
>> +     /* Lock the process personality and calling convention. */
>> +#ifdef CONFIG_COMPAT
>> +     if (is_compat_task())
>> +             filter->flags.compat = 1;
>> +#endif
>> +     filter->personality = current->personality;
>> +
>> +     /* Auditing is not needed since the capability wasn't requested */
>> +     if (security_real_capable_noaudit(current, current_user_ns(),
>> +                                       CAP_SYS_ADMIN) == 0)
>> +             filter->flags.admin = 1;
>> +
>> +     /* Copy the instructions from fprog. */
>> +     ret = -EFAULT;
>> +     if (copy_from_user(filter->insns, fprog->filter, fp_size))
>> +             goto out;
>> +
>> +     /* Check the fprog */
>> +     ret = sk_chk_filter(filter->insns, filter->count);
>> +     if (ret)
>> +             goto out;
>> +
>> +     /* If there is an existing filter, make it the parent
>> +      * and reuse the existing task-based ref.
>> +      */
>> +     filter->parent = current->seccomp.filter;
>> +
>> +     /* Force all filters to use one system call convention. */
>> +     ret = -EINVAL;
>> +     if (filter->parent) {
>> +             if (filter->parent->flags.compat != filter->flags.compat)
>> +                     goto out;
>> +             if (filter->parent->personality != filter->personality)
>> +                     goto out;
>> +     }
>> +
>> +     /* Double claim the new filter so we can release it below simplifying
>> +      * the error paths earlier.
>> +      */
>> +     ret = 0;
>> +     get_seccomp_filter(filter);
>> +     current->seccomp.filter = filter;
>> +     /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
>> +     if (!current->seccomp.mode) {
>> +             current->seccomp.mode = 2;
>> +             set_thread_flag(TIF_SECCOMP);
>> +     }
>> +
>> +out:
>> +     put_seccomp_filter(filter);  /* for get or task, on err */
>> +     return ret;
>> +}
>> +
>> +long prctl_attach_seccomp_filter(char __user *user_filter)
>> +{
>> +     struct sock_fprog fprog;
>> +     long ret = -EINVAL;
>> +
>> +     ret = -EFAULT;
>> +     if (!user_filter)
>> +             goto out;
>> +
>> +     if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
>> +             goto out;
>> +
>> +     ret = seccomp_attach_filter(&fprog);
>> +out:
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_check_exec: determines if exec is allowed for current
>> + * Returns 0 if allowed.
>> + */
>> +int seccomp_check_exec(void)
>> +{
>> +     if (current->seccomp.mode != 2)
>> +             return 0;
>> +     /* We can rely on the task refcount for the filter. */
>> +     if (!current->seccomp.filter)
>> +             return -EPERM;
>> +     /* The last attached filter set for the process is checked. It must
>> +      * have been installed with CAP_SYS_ADMIN capabilities.
>
> This comment is confusing.  By 'It must' you mean that if not, it's
> denied.  But if I didn't know better I would read that as "we can't
> get to this code unless".  Can you change it to something like
> "Exec is refused unless the filter was installed with CAP_SYS_ADMIN
> privilege"?

Sounds good!

>> +      */
>> +     if (current->seccomp.filter->flags.admin)
>> +             return 0;
>> +     return -EPERM;
>> +}
>> +
>> +/* seccomp_filter_fork: manages inheritance on fork
>> + * @child: forkee
>> + * @parent: forker
>> + * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
>> + * and the set of filters is marked as 'enabled'.
>> + */
>> +void seccomp_filter_fork(struct task_struct *child,
>> +                      struct task_struct *parent)
>> +{
>> +     if (!parent->seccomp.mode)
>> +             return;
>> +     child->seccomp.mode = parent->seccomp.mode;
>> +     child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
>> +}
>> +
>> +/* Returns a pointer to the BPF evaluator after checking the offset and size
>> + * boundaries.  The signature almost matches the signature from
>> + * net/core/filter.c with the hopes of sharing code in the future.
>> + */
>> +static const void *load_pointer(const u8 *buf, size_t buflen,
>> +                             int offset, size_t size,
>> +                             void *unused)
>> +{
>> +     if (offset >= buflen)
>> +             goto fail;
>> +     if (offset < 0)
>> +             goto fail;
>> +     if (size > buflen - offset)
>> +             goto fail;
>> +     return buf + offset;
>> +fail:
>> +     return NULL;
>> +}
>> +
>> +/**
>> + * seccomp_run_filter - evaluate BPF (over user_regs_struct)
>> + *   @buf: buffer to execute the filter over
>> + *   @buflen: length of the buffer
>> + *   @fentry: filter to apply
>> + *
>> + * Decode and apply filter instructions to the buffer.
>> + * Return length to keep, 0 for none. @buf is a regset we are
>> + * filtering, @filter is the array of filter instructions.
>> + * Because all jumps are guaranteed to be before last instruction,
>> + * and last instruction guaranteed to be a RET, we dont need to check
>> + * flen.
>> + *
>> + * See core/net/filter.c as this is nearly an exact copy.
>> + * At some point, it would be nice to merge them to take advantage of
>> + * optimizations (like JIT).
>> + *
>> + * A successful filter must return the full length of the data. Anything less
>> + * will currently result in a seccomp failure.  In the future, it may be
>> + * possible to use that for hard filtering registers on the fly so it is
>> + * ideal for consumers to return 0 on intended failure.
>> + */
>> +static unsigned int seccomp_run_filter(const u8 *buf,
>> +                                    const size_t buflen,
>> +                                    const struct sock_filter *fentry)
>> +{
>> +     const void *ptr;
>> +     u32 A = 0;                      /* Accumulator */
>> +     u32 X = 0;                      /* Index Register */
>> +     u32 mem[BPF_MEMWORDS];          /* Scratch Memory Store */
>> +     u32 tmp;
>> +     int k;
>> +
>> +     /*
>> +      * Process array of filter instructions.
>> +      */
>> +     for (;; fentry++) {
>> +#if defined(CONFIG_X86_32)
>> +#define      K (fentry->k)
>> +#else
>> +             const u32 K = fentry->k;
>> +#endif
>> +
>> +             switch (fentry->code) {
>> +             case BPF_S_ALU_ADD_X:
>> +                     A += X;
>> +                     continue;
>> +             case BPF_S_ALU_ADD_K:
>> +                     A += K;
>> +                     continue;
>> +             case BPF_S_ALU_SUB_X:
>> +                     A -= X;
>> +                     continue;
>> +             case BPF_S_ALU_SUB_K:
>> +                     A -= K;
>> +                     continue;
>> +             case BPF_S_ALU_MUL_X:
>> +                     A *= X;
>> +                     continue;
>> +             case BPF_S_ALU_MUL_K:
>> +                     A *= K;
>> +                     continue;
>> +             case BPF_S_ALU_DIV_X:
>> +                     if (X == 0)
>> +                             return 0;
>> +                     A /= X;
>> +                     continue;
>> +             case BPF_S_ALU_DIV_K:
>> +                     A = reciprocal_divide(A, K);
>> +                     continue;
>> +             case BPF_S_ALU_AND_X:
>> +                     A &= X;
>> +                     continue;
>> +             case BPF_S_ALU_AND_K:
>> +                     A &= K;
>> +                     continue;
>> +             case BPF_S_ALU_OR_X:
>> +                     A |= X;
>> +                     continue;
>> +             case BPF_S_ALU_OR_K:
>> +                     A |= K;
>> +                     continue;
>> +             case BPF_S_ALU_LSH_X:
>> +                     A <<= X;
>> +                     continue;
>> +             case BPF_S_ALU_LSH_K:
>> +                     A <<= K;
>> +                     continue;
>> +             case BPF_S_ALU_RSH_X:
>> +                     A >>= X;
>> +                     continue;
>> +             case BPF_S_ALU_RSH_K:
>> +                     A >>= K;
>> +                     continue;
>> +             case BPF_S_ALU_NEG:
>> +                     A = -A;
>> +                     continue;
>> +             case BPF_S_JMP_JA:
>> +                     fentry += K;
>> +                     continue;
>> +             case BPF_S_JMP_JGT_K:
>> +                     fentry += (A > K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JGE_K:
>> +                     fentry += (A >= K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JEQ_K:
>> +                     fentry += (A == K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JSET_K:
>> +                     fentry += (A & K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JGT_X:
>> +                     fentry += (A > X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JGE_X:
>> +                     fentry += (A >= X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JEQ_X:
>> +                     fentry += (A == X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JSET_X:
>> +                     fentry += (A & X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_LD_W_ABS:
>> +                     k = K;
>> +load_w:
>> +                     ptr = load_pointer(buf, buflen, k, 4, &tmp);
>> +                     if (ptr != NULL) {
>> +                             /* Note, unlike on network data, values are not
>> +                              * byte swapped.
>> +                              */
>> +                             A = *(const u32 *)ptr;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_H_ABS:
>> +                     k = K;
>> +load_h:
>> +                     ptr = load_pointer(buf, buflen, k, 2, &tmp);
>> +                     if (ptr != NULL) {
>> +                             A = *(const u16 *)ptr;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_B_ABS:
>> +                     k = K;
>> +load_b:
>> +                     ptr = load_pointer(buf, buflen, k, 1, &tmp);
>> +                     if (ptr != NULL) {
>> +                             A = *(const u8 *)ptr;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_W_LEN:
>> +                     A = buflen;
>> +                     continue;
>> +             case BPF_S_LDX_W_LEN:
>> +                     X = buflen;
>> +                     continue;
>> +             case BPF_S_LD_W_IND:
>> +                     k = X + K;
>> +                     goto load_w;
>> +             case BPF_S_LD_H_IND:
>> +                     k = X + K;
>> +                     goto load_h;
>> +             case BPF_S_LD_B_IND:
>> +                     k = X + K;
>> +                     goto load_b;
>> +             case BPF_S_LDX_B_MSH:
>> +                     ptr = load_pointer(buf, buflen, K, 1, &tmp);
>> +                     if (ptr != NULL) {
>> +                             X = (*(u8 *)ptr & 0xf) << 2;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_IMM:
>> +                     A = K;
>> +                     continue;
>> +             case BPF_S_LDX_IMM:
>> +                     X = K;
>> +                     continue;
>> +             case BPF_S_LD_MEM:
>> +                     A = mem[K];
>> +                     continue;
>> +             case BPF_S_LDX_MEM:
>> +                     X = mem[K];
>> +                     continue;
>> +             case BPF_S_MISC_TAX:
>> +                     X = A;
>> +                     continue;
>> +             case BPF_S_MISC_TXA:
>> +                     A = X;
>> +                     continue;
>> +             case BPF_S_RET_K:
>> +                     return K;
>> +             case BPF_S_RET_A:
>> +                     return A;
>> +             case BPF_S_ST:
>> +                     mem[K] = A;
>> +                     continue;
>> +             case BPF_S_STX:
>> +                     mem[K] = X;
>> +                     continue;
>> +             case BPF_S_ANC_PROTOCOL:
>> +             case BPF_S_ANC_PKTTYPE:
>> +             case BPF_S_ANC_IFINDEX:
>> +             case BPF_S_ANC_MARK:
>> +             case BPF_S_ANC_QUEUE:
>> +             case BPF_S_ANC_HATYPE:
>> +             case BPF_S_ANC_RXHASH:
>> +             case BPF_S_ANC_CPU:
>> +             case BPF_S_ANC_NLATTR:
>> +             case BPF_S_ANC_NLATTR_NEST:
>> +                     /* ignored */
>> +                     continue;
>> +             default:
>> +                     WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
>> +                                    fentry->code, fentry->jt,
>> +                                    fentry->jf, fentry->k);
>> +                     return 0;
>> +             }
>> +     }
>> +
>> +     return 0;
>> +}
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 481611f..77f2eda 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>               case PR_SET_SECCOMP:
>>                       error = prctl_set_seccomp(arg2);
>>                       break;
>> +             case PR_ATTACH_SECCOMP_FILTER:
>> +                     error = prctl_attach_seccomp_filter((char __user *)
>> +                                                             arg2);
>> +                     break;
>>               case PR_GET_TSC:
>>                       error = GET_TSC_CTL(arg2);
>>                       break;
>> diff --git a/security/Kconfig b/security/Kconfig
>> index 51bd5a0..77b1106 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
>>
>>         If you are unsure how to answer this question, answer N.
>>
>> +config SECCOMP_FILTER
>> +     bool "Enable seccomp-based system call filtering"
>> +     select SECCOMP
>> +     depends on EXPERIMENTAL
>> +     help
>> +       This kernel feature expands CONFIG_SECCOMP to allow computing
>> +       in environments with reduced kernel access dictated by a system
>> +       call filter, expressed in BPF, installed by the application itself
>> +       through prctl(2).
>> +
>> +       See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>>  config SECURITY
>>       bool "Enable different security models"
>>       depends on SYSFS
>> --
>> 1.7.5.4
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 14:50   ` Oleg Nesterov
@ 2012-01-12 16:55     ` Will Drewry
  0 siblings, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-12 16:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Thu, Jan 12, 2012 at 8:50 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/11, Will Drewry wrote:
>>
>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task.  The policy is expressed in terms of a BPF program,
>> as is used for userland-exposed socket filtering.  Instead of network
>> data, the BPF program is evaluated over struct user_regs_struct at the
>> time of the system call (as retrieved using regviews).
>
> Cool ;)
>
> I didn't really read this patch yet, just one nit.
>
>> +#define seccomp_filter_init_task(_tsk) do { \
>> +     (_tsk)->seccomp.filter = NULL; \
>> +} while (0);
>
> Cosmetic and subjective, but imho it would be better to add inline
> functions instead of define's.

Refactoring it a bit to make that possible.  Since seccomp fork/init/free
never needs access to the whole task_structs, I'll just pass in what's
needed (and avoid the sched.h inclusion recursion).

Comments on the next round will most definitely be appreciated!

>> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>>       free_thread_info(tsk->stack);
>>       rt_mutex_debug_task_free(tsk);
>>       ftrace_graph_exit_task(tsk);
>> +     seccomp_filter_free_task(tsk);
>>       free_task_struct(tsk);
>>  }
>>  EXPORT_SYMBOL(free_task);
>> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>       /* Perform scheduler related setup. Assign this task to a CPU. */
>>       sched_fork(p);
>>
>> +     seccomp_filter_init_task(p);
>
> This doesn't look right or I missed something. something seccomp_filter_init_task()
> should be called right after dup_task_struct(), at least before copy process can
> fail.
>
> Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
> current->seccomp.filter copied by arch_dup_task_struct().

Ah - makes sense!  I moved it under dup_task_struct before any goto's
to bad_fork_free.

>> +struct seccomp_filter {
>> +     struct kref usage;
>> +     struct pid *creator;
>
> Why? seccomp_filter->creator is never used, no?

Removing it. It is from a related patch I'm experimenting with (adding
optional tracehook support), but it has no bearing here.

Thanks - new patch revision incoming!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 15:43   ` Steven Rostedt
  2012-01-12 16:14     ` Oleg Nesterov
  2012-01-12 16:14     ` Andrew Lutomirski
@ 2012-01-12 16:59     ` Will Drewry
  2012-01-12 17:22       ` Jamie Lokier
  2012-01-12 17:36     ` Jamie Lokier
  3 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-12 16:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?

Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked executable
using LD_PRELOAD.

> Maybe I don't understand this correctly.

You're right on.  This was to ensure that one process didn't cause
crazy behavior in another. I think Alan has a better proposal than
mine below.  (Goes back to catching up.)
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
                     ` (4 preceding siblings ...)
  2012-01-12 16:22   ` Oleg Nesterov
@ 2012-01-12 17:02   ` Andrew Lutomirski
  2012-01-16 20:28     ` Will Drewry
  5 siblings, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 17:02 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Wed, Jan 11, 2012 at 9:25 AM, Will Drewry <wad@chromium.org> wrote:
> This patch adds support for seccomp mode 2.  This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task.  The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering.  Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).
>

There's some seccomp-related code in the vsyscall emulation path in
arch/x86/kernel/vsyscall_64.c.  How should time(), getcpu(), and
gettimeofday() be handled?  If you want filtering to work, there
aren't any real syscall registers to inspect, but they could be
synthesized.

Preventing a malicious task from figuring out approximately what time
it is is basically impossible because of the way that vvars work.  I
don't know how to change that efficiently.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:18   ` Alan Cox
@ 2012-01-12 17:03     ` Will Drewry
  2012-01-12 17:11       ` Alan Cox
  2012-01-13  1:31     ` James Morris
  1 sibling, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-12 17:03 UTC (permalink / raw)
  To: Alan Cox
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:18 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> I think this model is wrong. The rest of the policy rules all work on the
> basis that dumpable is the decider (the same rules for not dumping, not
> tracing, etc). A user should be able to apply filter to their own code
> arbitarily. Any setuid app should IMHO lose the trace subject to the usual
> uid rules and capability rules. That would seem to be more flexible and
> also the path of least surprise.

My line of thinking up to now has been that disallowing setuid exec
would mean there is no risk of an errant setuid binary allowing escape
from the system call filters (which the containers people may care
more about).  Since setuid is privilege escalation, then perhaps it
makes sense to allow it as an escape hatch.

Would it be sane to just disallow setuid exec exclusively?

> [plus you can implement non setuid exec entirely in userspace so it's
> a rather meaningless distinction you propose]

Agreed.

>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>
> A BPF jit ought to be trivial and would be a big win.
>
> In general I like this approach. It's simple, it's compact and it offers
> interesting possibilities for solving some interesting problem spaces,
> without the full weight of SELinux, SMACK etc which are still needed for
> heavyweight security.
>

Thanks!  Yeah I think merging with the network stack is eminently
doable, but I didn't want to bog down the proposal in how much
overhead I might be adding to the network layer.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:47         ` Oleg Nesterov
@ 2012-01-12 17:08           ` Will Drewry
  0 siblings, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-12 17:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:47 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/12, Steven Rostedt wrote:
>>
>> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
>>
>> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
>> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
>> >
>> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
>>
>> I've never used seccomp, so I admit I'm totally ignorant on this topic.
>
> me too ;)
>
>> But looking at seccomp from the outside, the biggest advantage to this
>> would be the ability for normal processes to be able to limit tasks it
>> kicks off. If I want to run a task in a sandbox, I don't want to be root
>> to do so.
>>
>> I guess a web browser doesn't perform an exec to run java programs. But
>> it would be nice if I could execute something from the command line that
>> I could run in a sand box.
>>
>> What's the problem with making sure that the setuid isn't set before
>> doing an execv? Only fail when setuid (or some other magic) is enabled
>> on the file being exec'd.
>
> I agree. That is why I mentioned LSM_UNSAFE_SECCOMP/cap_bprm_set_creds.
> Just I do not know what would be the most simple/clean way to do this.
>
>
> And in any case I agree that the current seccomp_check_exec() looks
> strange. Btw, it does
> {
>        if (current->seccomp.mode != 2)
>                return 0;
>        /* We can rely on the task refcount for the filter. */
>        if (!current->seccomp.filter)
>                return -EPERM;
>
> How it is possible to have seccomp.filter == NULL with mode == 2?

It shouldn't be. It's another relic I missed from development. (Adding to v3 :)
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:27       ` Steven Rostedt
  2012-01-12 16:51         ` Andrew Lutomirski
@ 2012-01-12 17:09         ` Linus Torvalds
  2012-01-12 17:17           ` Steven Rostedt
  2012-01-12 18:18           ` Andrew Lutomirski
  1 sibling, 2 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-12 17:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> In that case, just have execv fail if filtering is enabled and we are
> execing a setuid program. But I don't see why non "magical" execv's
> should be prohibited.

The whole "fail security escalations" thing goes way beyond just
filtering, I think we could seriously try to make it a generic
feature.

For example, somebody just asked me the other day why "chroot()"
requires admin privileges, since it would be good to limit even
non-root things.

And it's really the exact same issue as filtering: in some sense,
chroot() "filters" FS name lookups, and can be used to fool programs
that are written to be secure.

We could easily introduce a per-process flag that just says "cannot
escalate privileges". Which basically just disables execve() of
suid/sgid programs (and possibly other things too), and locks the
process to the current privileges. And then make the rule be that *if*
that flag is set, you can then filter across an execve, or chroot as a
normal user, or whatever.

There are probably other things like that - things like allowing users
to do bind mounts etc - that aren't dangerous in themselves, but that
are dangerous mainly because they can be used to fool things into
privilege escalations. So this is definitely not a filter-only issue.

                       Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:22   ` Oleg Nesterov
@ 2012-01-12 17:10     ` Will Drewry
  2012-01-12 17:23       ` Oleg Nesterov
  0 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-12 17:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/11, Will Drewry wrote:
>>
>> +__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
>> +{
>> +     /* regset is usually returned based on task personality, not current
>> +      * system call convention.  This behavior makes it unsafe to execute
>> +      * BPF programs over regviews if is_compat_task or the personality
>> +      * have changed since the program was installed.
>> +      */
>> +     const struct user_regset_view *view = task_user_regset_view(current);
>> +     const struct user_regset *regset = &view->regsets[0];
>> +     size_t scratch_size = *available;
>> +     if (regset->core_note_type != NT_PRSTATUS) {
>> +             /* The architecture should override this method for speed. */
>> +             regset = find_prstatus(view);
>> +             if (!regset)
>> +                     return NULL;
>> +     }
>> +     *available = regset->n * regset->size;
>> +     /* Make sure the scratch space isn't exceeded. */
>> +     if (*available > scratch_size)
>> +             *available = scratch_size;
>> +     if (regset->get(current, regset, 0, *available, scratch, NULL))
>> +             return NULL;
>> +     return scratch;
>> +}
>> +
>> +/**
>> + * seccomp_test_filters - tests 'current' against the given syscall
>> + * @syscall: number of the system call to test
>> + *
>> + * Returns 0 on ok and non-zero on error/failure.
>> + */
>> +int seccomp_test_filters(int syscall)
>> +{
>> +     struct seccomp_filter *filter;
>> +     u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
>> +     size_t regs_size = sizeof(struct user_regs_struct);
>> +     int ret = -EACCES;
>> +
>> +     filter = current->seccomp.filter; /* uses task ref */
>> +     if (!filter)
>> +             goto out;
>> +
>> +     /* All filters in the list are required to share the same system call
>> +      * convention so only the first filter is ever checked.
>> +      */
>> +     if (seccomp_check_personality(filter))
>> +             goto out;
>> +
>> +     /* Grab the user_regs_struct.  Normally, regs == &regs_tmp, but
>> +      * that is not mandatory.  E.g., it may return a point to
>> +      * task_pt_regs(current).  NULL checking is mandatory.
>> +      */
>> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>
> Stupid question. I am sure you know what are you doing ;) and I know
> nothing about !x86 arches.
>
> But could you explain why it is designed to use user_regs_struct ?
> Why we can't simply use task_pt_regs() and avoid the (costly) regsets?

So on x86 32, it would work since user_regs_struct == task_pt_regs
(iirc), but on x86-64
and others, that's not true.  I don't think it's kosher to expose
pt_regs to the userspace, but if, let's say, x86-32 overrides the weak
linkage, then it could just return task_pt_regs and be the fastest
path.

If it would be appropriate to expose pt_regs to userspace, then I'd
happily do so :)

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:03     ` Will Drewry
@ 2012-01-12 17:11       ` Alan Cox
  2012-01-12 17:52         ` Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Alan Cox @ 2012-01-12 17:11 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

> more about).  Since setuid is privilege escalation, then perhaps it
> makes sense to allow it as an escape hatch.
> 
> Would it be sane to just disallow setuid exec exclusively?

I think that is a policy question. I can imagine cases where either
behaviour is the "right" one so it may need to be a parameter ?

Alan

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:09         ` Linus Torvalds
@ 2012-01-12 17:17           ` Steven Rostedt
  2012-01-12 18:18           ` Andrew Lutomirski
  1 sibling, 0 replies; 222+ messages in thread
From: Steven Rostedt @ 2012-01-12 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 09:09 -0800, Linus Torvalds wrote:

> The whole "fail security escalations" thing goes way beyond just
> filtering, I think we could seriously try to make it a generic
> feature.

After I wrote this comment I thought the same thing. It would be nice to
have a way to just set a flag to a process that will prevent it from
doing any escalating of privileges.

I totally agree, this would solve a whole host of issues with regard to
security issues in things that shouldn't be a problem but currently are.

-- Steve





^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:59     ` Will Drewry
@ 2012-01-12 17:22       ` Jamie Lokier
  2012-01-12 17:35         ` Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:22 UTC (permalink / raw)
  To: Will Drewry
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Will Drewry wrote:
> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
> >
> >> Filter programs may _only_ cross the execve(2) barrier if last filter
> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> >> user namespace.  Once a task-local filter program is attached from a
> >> process without privileges, execve will fail.  This ensures that only
> >> privileged parent task can affect its privileged children (e.g., setuid
> >> binary).
> >
> > This means that a non privileged user can not run another program with
> > limited features? How would a process exec another program and filter
> > it? I would assume that the filter would need to be attached first and
> > then the execv() would be performed. But after the filter is attached,
> > the execv is prevented?
> 
> Yeah - it means tasks can filter themselves, but not each other.
> However, you can inject a filter for any dynamically linked executable
> using LD_PRELOAD.
> 
> > Maybe I don't understand this correctly.
> 
> You're right on.  This was to ensure that one process didn't cause
> crazy behavior in another. I think Alan has a better proposal than
> mine below.  (Goes back to catching up.)

You can already use ptrace() to cause crazy behaviour in another
process, including modifying registers arbitrarily at syscall entry
and exit, aborting and emulating syscalls.

ptrace() is quite slow and it would be really nice to speed it up,
especially for trapping a small subset of syscalls, or limiting some
kinds of access to some file descriptors, while everything else runs
at normal speed.

Speeding up ptrace() with BPF filters would be a really nice.  Not
that I like ptrace(), but sometimes it's the only thing you can rely on.

LD_PRELOAD and code running in the target process address space can't
always be trusted in some contexts (e.g. the target process may modify
the tracing code or its data); whereas ptrace() is pretty complete and
reliable, if ugly.

There's already a security model around who can use ptrace(); speeding
it up needn't break that.

If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
needed as userspace could have done it, with exactly the restrictions
it wants.  Google's NaCl comes to mind as a potential user.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:10     ` Will Drewry
@ 2012-01-12 17:23       ` Oleg Nesterov
  2012-01-12 17:51         ` Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-12 17:23 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/12, Will Drewry wrote:
>
> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> +      */
> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
> >
> > Stupid question. I am sure you know what are you doing ;) and I know
> > nothing about !x86 arches.
> >
> > But could you explain why it is designed to use user_regs_struct ?
> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>
> So on x86 32, it would work since user_regs_struct == task_pt_regs
> (iirc), but on x86-64
> and others, that's not true.

Yes sure, I meant that userpace should use pt_regs too.

> If it would be appropriate to expose pt_regs to userspace, then I'd
> happily do so :)

Ah, so that was the reason. But it is already exported? At least I see
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.

Once again, I am not arguing, just trying to understand. And I do not
know if this definition is part of abi.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-12 18:16         ` Randy Dunlap
@ 2012-01-12 17:23           ` Will Drewry
  2012-01-12 17:34             ` Steven Rostedt
  0 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-12 17:23 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet

On Thu, Jan 12, 2012 at 12:16 PM, Randy Dunlap <rdunlap@xenotime.net> wrote:
> On 01/11/2012 03:19 PM, Will Drewry wrote:
>> Document how system call filtering with BPF works and
>> may be used.  Includes an example for x86 (32-bit).
>
> Please tell some of us what "BPF" means.  wikipedia lists 15 possible
> choices, but I don't know which one to choose.

I'll make it clearer in the documentation file and update the patch description.

BPF == Berkeley Packet Filters which are implemented in Linux Socket
Filters (LSF)>

thanks!

>> Signed-off-by: Will Drewry <wad@chromium.org>
>> ---
>>  Documentation/prctl/seccomp_filter.txt |   99 ++++++++++++++++++++++++++++++++
>>  samples/Makefile                       |    2 +-
>>  samples/seccomp/Makefile               |   12 ++++
>>  samples/seccomp/bpf-example.c          |   74 ++++++++++++++++++++++++
>>  4 files changed, 186 insertions(+), 1 deletions(-)
>>  create mode 100644 Documentation/prctl/seccomp_filter.txt
>>  create mode 100644 samples/seccomp/Makefile
>>  create mode 100644 samples/seccomp/bpf-example.c
>
>
> --
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
  2012-01-12 13:13   ` [RFC,PATCH " Łukasz Sowa
@ 2012-01-12 17:25     ` Will Drewry
  0 siblings, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-12 17:25 UTC (permalink / raw)
  To: Łukasz Sowa
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 7:13 AM, Łukasz Sowa <luksow@gmail.com> wrote:
> Hi Will,
>
> That's very different approach to the system call interposition problem.
> I find you solution very interesting. It gives far more capabilities
> than my syscalls cgroup that you commented on some time ago. It's ready
> now but I haven't tried filtering yet. I think that if your solution
> make it to the mainline (and I guess that's really possible at current
> stage :)), there will be no place for mine solution but that's ok.

Yeah - there've been so many tries, I'll be happy when one makes it in
which is usable :)

> There's one thing that I'm curious about - have you measured overhead in
> any way? That was one of the biggest issues in all previous attempts to
> limit syscalls. I'd love to compare the numbers with mine solution.

Certainly. I have some rough numbers, but nothing I'd call strong
measurements.  There is still a fair amount of cost due to the syscall
slow path.

> I'll examine your patch later on and put some comments if I bump into
> something.

Much appreciated - cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:38       ` Steven Rostedt
  2012-01-12 16:47         ` Oleg Nesterov
@ 2012-01-12 17:30         ` Jamie Lokier
  2012-01-12 17:40           ` Steven Rostedt
  1 sibling, 1 reply; 222+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

Steven Rostedt wrote:
> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
> 
> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> > 
> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
> 
> I've never used seccomp, so I admit I'm totally ignorant on this topic.
> 
> But looking at seccomp from the outside, the biggest advantage to this
> would be the ability for normal processes to be able to limit tasks it
> kicks off. If I want to run a task in a sandbox, I don't want to be root
> to do so.
> 
> I guess a web browser doesn't perform an exec to run java programs.

Actually it does.  Firefox on Linux forks and execs the Java VM.
Same for Flash, using "plugin-container".

> But it would be nice if I could execute something from the command
> line that I could run in a sand box.

You can do this now, using ptrace().  It's horrible, but half of the
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address.  (The other half is a ton of
undocumented but important ptrace() behaviours on Linux.)

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-12 17:23           ` Will Drewry
@ 2012-01-12 17:34             ` Steven Rostedt
  0 siblings, 0 replies; 222+ messages in thread
From: Steven Rostedt @ 2012-01-12 17:34 UTC (permalink / raw)
  To: Will Drewry
  Cc: Randy Dunlap, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet

On Thu, 2012-01-12 at 11:23 -0600, Will Drewry wrote:

> > Please tell some of us what "BPF" means.  wikipedia lists 15 possible
> > choices, but I don't know which one to choose.
> 
> I'll make it clearer in the documentation file and update the patch description.
> 
> BPF == Berkeley Packet Filters which are implemented in Linux Socket
> Filters (LSF)>
> 

I admit, I was totally clueless in what it meant too ;)

Even the LWN article didn't explain (shame on you Jon).

"he has repurposed the networking layer's packet filtering mechanism
(BPF)"

I didn't know what did the "B" stood for.

-- Steve



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:22       ` Jamie Lokier
@ 2012-01-12 17:35         ` Will Drewry
  2012-01-12 17:57           ` Jamie Lokier
  0 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-12 17:35 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >
>> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> user namespace.  Once a task-local filter program is attached from a
>> >> process without privileges, execve will fail.  This ensures that only
>> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> binary).
>> >
>> > This means that a non privileged user can not run another program with
>> > limited features? How would a process exec another program and filter
>> > it? I would assume that the filter would need to be attached first and
>> > then the execv() would be performed. But after the filter is attached,
>> > the execv is prevented?
>>
>> Yeah - it means tasks can filter themselves, but not each other.
>> However, you can inject a filter for any dynamically linked executable
>> using LD_PRELOAD.
>>
>> > Maybe I don't understand this correctly.
>>
>> You're right on.  This was to ensure that one process didn't cause
>> crazy behavior in another. I think Alan has a better proposal than
>> mine below.  (Goes back to catching up.)
>
> You can already use ptrace() to cause crazy behaviour in another
> process, including modifying registers arbitrarily at syscall entry
> and exit, aborting and emulating syscalls.
>
> ptrace() is quite slow and it would be really nice to speed it up,
> especially for trapping a small subset of syscalls, or limiting some
> kinds of access to some file descriptors, while everything else runs
> at normal speed.
>
> Speeding up ptrace() with BPF filters would be a really nice.  Not
> that I like ptrace(), but sometimes it's the only thing you can rely on.
>
> LD_PRELOAD and code running in the target process address space can't
> always be trusted in some contexts (e.g. the target process may modify
> the tracing code or its data); whereas ptrace() is pretty complete and
> reliable, if ugly.
>
> There's already a security model around who can use ptrace(); speeding
> it up needn't break that.
>
> If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> needed as userspace could have done it, with exactly the restrictions
> it wants.  Google's NaCl comes to mind as a potential user.

That's not entirely true.  ptrace supervisors are subject to races and
always fail open.  This makes them effective but not as robust as a
seccomp solution can provide.

With seccomp, it fails close.  What I think would make sense would be
to add a user-controllable failure mode with seccomp bpf that calls
tracehook_ptrace_syscall_entry(regs).  I've prototyped this and it
works quite well, but I didn't want to conflate the discussions.

Using ptrace() would also mean that all consumers of this interface
would need a supervisor, but with seccomp, the filters are installed
and require no supervisors to stick around for when failure occurs.

Does that make sense?
thanks!
will

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 15:43   ` Steven Rostedt
                       ` (2 preceding siblings ...)
  2012-01-12 16:59     ` Will Drewry
@ 2012-01-12 17:36     ` Jamie Lokier
  3 siblings, 0 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Steven Rostedt wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
> 
> > Filter programs may _only_ cross the execve(2) barrier if last filter
> > program was attached by a task with CAP_SYS_ADMIN capabilities in its
> > user namespace.  Once a task-local filter program is attached from a
> > process without privileges, execve will fail.  This ensures that only
> > privileged parent task can affect its privileged children (e.g., setuid
> > binary).
> 
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?

Ugly method: Using ptrace(), trap after the execve() and issue fake
syscalls to install the filter.  I feel dirty thinking it, in a good way.

LD_PRELOAD has been suggested.  It's not 100% reliable because not all
executables are dynamic (on some uClinux platforms none of them are),
but it will usually work.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:30         ` Jamie Lokier
@ 2012-01-12 17:40           ` Steven Rostedt
  2012-01-12 17:44             ` Jamie Lokier
  2012-01-12 22:18             ` Will Drewry
  0 siblings, 2 replies; 222+ messages in thread
From: Steven Rostedt @ 2012-01-12 17:40 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:

> You can do this now, using ptrace().  It's horrible, but half of the
> horribleness is needing to understand machine-dependent registers,
> which this new patch doesn't address.  (The other half is a ton of
> undocumented but important ptrace() behaviours on Linux.)

Yeah I know the horrid use of ptrace, I've implemented programs that use
it :-p

I guess ptrace can capture the execv and determine if it is OK or not to
run it. But again, this doesn't stop the possible attacks that could
happen, with having the execv on a symlink file, having the ptrace check
say its OK, and then switching the symlink to a setuid file.

When the new execv executed, the parent process would lose all control
over it. The idea is to prevent this.

I like Alan's suggestion. Have userspace decide to allow execv or not,
and even let it decide if it should allow setuid execv's or not, but
still allow non-setuid execvs. If you allow the setuid execv, once that
happens, the same behavior will occur as with ptrace. A setuid execv
will lose all its filtering.

-- Steve



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:40           ` Steven Rostedt
@ 2012-01-12 17:44             ` Jamie Lokier
  2012-01-12 17:56               ` Steven Rostedt
  2012-01-12 22:18             ` Will Drewry
  1 sibling, 1 reply; 222+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

Steven Rostedt wrote:
> On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:
> 
> > You can do this now, using ptrace().  It's horrible, but half of the
> > horribleness is needing to understand machine-dependent registers,
> > which this new patch doesn't address.  (The other half is a ton of
> > undocumented but important ptrace() behaviours on Linux.)
> 
> Yeah I know the horrid use of ptrace, I've implemented programs that use
> it :-p

That warm fuzzy feeling :-)

> I guess ptrace can capture the execv and determine if it is OK or not to
> run it. But again, this doesn't stop the possible attacks that could
> happen, with having the execv on a symlink file, having the ptrace check
> say its OK, and then switching the symlink to a setuid file.
>
> When the new execv executed, the parent process would lose all control
> over it. The idea is to prevent this.

fexecve() exists to solve the problem.
Also known as execve("/proc/self/fd/...") on Linux.

> I like Alan's suggestion. Have userspace decide to allow execv or not,
> and even let it decide if it should allow setuid execv's or not, but
> still allow non-setuid execvs. If you allow the setuid execv, once that
> happens, the same behavior will occur as with ptrace. A setuid execv
> will lose all its filtering.

I like the idea of letting the tracer decide what it wants.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:23       ` Oleg Nesterov
@ 2012-01-12 17:51         ` Will Drewry
  2012-01-13 17:31           ` Oleg Nesterov
  0 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-12 17:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/12, Will Drewry wrote:
>>
>> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >> +      */
>> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>> >
>> > Stupid question. I am sure you know what are you doing ;) and I know
>> > nothing about !x86 arches.
>> >
>> > But could you explain why it is designed to use user_regs_struct ?
>> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>
>> So on x86 32, it would work since user_regs_struct == task_pt_regs
>> (iirc), but on x86-64
>> and others, that's not true.
>
> Yes sure, I meant that userpace should use pt_regs too.
>
>> If it would be appropriate to expose pt_regs to userspace, then I'd
>> happily do so :)
>
> Ah, so that was the reason. But it is already exported? At least I see
> the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>
> Once again, I am not arguing, just trying to understand. And I do not
> know if this definition is part of abi.

I don't either :/  My original idea was to operate on task_pt_regs(current),
but I noticed that PTRACE_GETREGS/SETREGS only uses the
user_regs_struct. So I went that route.

I'd love for pt_regs to be fair game to cut down on the copying!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:11       ` Alan Cox
@ 2012-01-12 17:52         ` Will Drewry
  0 siblings, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-12 17:52 UTC (permalink / raw)
  To: Alan Cox
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:11 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> more about).  Since setuid is privilege escalation, then perhaps it
>> makes sense to allow it as an escape hatch.
>>
>> Would it be sane to just disallow setuid exec exclusively?
>
> I think that is a policy question. I can imagine cases where either
> behaviour is the "right" one so it may need to be a parameter ?

Makes sense. I'll make it flaggable (ignoring the parallel conversation
about having a thread-wide suidable bit).

thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:44             ` Jamie Lokier
@ 2012-01-12 17:56               ` Steven Rostedt
  2012-01-12 23:27                 ` Alan Cox
  0 siblings, 1 reply; 222+ messages in thread
From: Steven Rostedt @ 2012-01-12 17:56 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 17:44 +0000, Jamie Lokier wrote:

> > I like Alan's suggestion. Have userspace decide to allow execv or not,
> > and even let it decide if it should allow setuid execv's or not, but
> > still allow non-setuid execvs. If you allow the setuid execv, once that
> > happens, the same behavior will occur as with ptrace. A setuid execv
> > will lose all its filtering.
> 
> I like the idea of letting the tracer decide what it wants.

Right, and if we implement the suggestion that Linus made, to set a flag
to prevent a task from every getting privilege, then seccomp can add
that too.

That is, there can be a filter to say "prevent this task from doing
anything with privilege" and that will prevent execv from gaining setuid
privilege. Perhaps, it would still do the execv, but the program that is
executed will run as the normal user, and just fail when it tries to do
something that requires sys admin privilege.

Thus, execv will not be a "special" case here. Seccomp either allows it
or not. But also add a command to tell seccomp that this task will not
be allowed to do anything privileged.

-- Steve



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:35         ` Will Drewry
@ 2012-01-12 17:57           ` Jamie Lokier
  2012-01-12 18:03             ` Will Drewry
  2012-01-13  6:33             ` Chris Evans
  0 siblings, 2 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-12 17:57 UTC (permalink / raw)
  To: Will Drewry
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Will Drewry wrote:
> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@shareable.org> wrote:
> > Will Drewry wrote:
> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
> >> >
> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> >> >> user namespace.  Once a task-local filter program is attached from a
> >> >> process without privileges, execve will fail.  This ensures that only
> >> >> privileged parent task can affect its privileged children (e.g., setuid
> >> >> binary).
> >> >
> >> > This means that a non privileged user can not run another program with
> >> > limited features? How would a process exec another program and filter
> >> > it? I would assume that the filter would need to be attached first and
> >> > then the execv() would be performed. But after the filter is attached,
> >> > the execv is prevented?
> >>
> >> Yeah - it means tasks can filter themselves, but not each other.
> >> However, you can inject a filter for any dynamically linked executable
> >> using LD_PRELOAD.
> >>
> >> > Maybe I don't understand this correctly.
> >>
> >> You're right on.  This was to ensure that one process didn't cause
> >> crazy behavior in another. I think Alan has a better proposal than
> >> mine below.  (Goes back to catching up.)
> >
> > You can already use ptrace() to cause crazy behaviour in another
> > process, including modifying registers arbitrarily at syscall entry
> > and exit, aborting and emulating syscalls.
> >
> > ptrace() is quite slow and it would be really nice to speed it up,
> > especially for trapping a small subset of syscalls, or limiting some
> > kinds of access to some file descriptors, while everything else runs
> > at normal speed.
> >
> > Speeding up ptrace() with BPF filters would be a really nice.  Not
> > that I like ptrace(), but sometimes it's the only thing you can rely on.
> >
> > LD_PRELOAD and code running in the target process address space can't
> > always be trusted in some contexts (e.g. the target process may modify
> > the tracing code or its data); whereas ptrace() is pretty complete and
> > reliable, if ugly.
> >
> > There's already a security model around who can use ptrace(); speeding
> > it up needn't break that.
> >
> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> > needed as userspace could have done it, with exactly the restrictions
> > it wants.  Google's NaCl comes to mind as a potential user.
> 
> That's not entirely true.  ptrace supervisors are subject to races and
> always fail open.  This makes them effective but not as robust as a
> seccomp solution can provide.

What races do you know about?

I'm not aware of any ptrace races if it's used properly.  I'm also not
sure what you mean by fail open/close here, unless you mean the target
process gets to carry on if the tracing process dies.

Having said that, I can think of one race, but I think your BPF scheme
has the same one: After checking the syscall's string arguments and
other pointed to data, another thread can change those arguments
before the real syscall uses them.

> With seccomp, it fails close.  What I think would make sense would be
> to add a user-controllable failure mode with seccomp bpf that calls
> tracehook_ptrace_syscall_entry(regs).  I've prototyped this and it
> works quite well, but I didn't want to conflate the discussions.

It think it's a nice idea.  While you're at it could you fix all the
architectures to actually use tracehooks for syscall tracing ;-)

(I think it's ok to call the tracehook function on all archs though.)

> Using ptrace() would also mean that all consumers of this interface
> would need a supervisor, but with seccomp, the filters are installed
> and require no supervisors to stick around for when failure occurs.
> 
> Does that make sense?

It does, I agree that ptrace() is quite cumbersome and you don't
always want a separate tracing process, especially if "failure" means
to die or get an error.

On the other hand, sometimes when a failure occurs, having another
process decide what to do, or log the event, is exactly what you want.

For my nefarious purposes I'm really just looking for a faster way to
reliably trace some activities of individual processes, in particular
tracking which files they access.  I'd rather not interfere with
debuggers, so I'd really like your ability to stack multiple filters
to work with separate-process tracing as well.  And I'd happily use a
filter rule which can dump some information over a pipe, without
waiting for the tracer to respond in most cases.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:57           ` Jamie Lokier
@ 2012-01-12 18:03             ` Will Drewry
  2012-01-13  1:34               ` Jamie Lokier
  2012-01-13  6:33             ` Chris Evans
  1 sibling, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-12 18:03 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:57 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> > Will Drewry wrote:
>> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >> >
>> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> >> user namespace.  Once a task-local filter program is attached from a
>> >> >> process without privileges, execve will fail.  This ensures that only
>> >> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> >> binary).
>> >> >
>> >> > This means that a non privileged user can not run another program with
>> >> > limited features? How would a process exec another program and filter
>> >> > it? I would assume that the filter would need to be attached first and
>> >> > then the execv() would be performed. But after the filter is attached,
>> >> > the execv is prevented?
>> >>
>> >> Yeah - it means tasks can filter themselves, but not each other.
>> >> However, you can inject a filter for any dynamically linked executable
>> >> using LD_PRELOAD.
>> >>
>> >> > Maybe I don't understand this correctly.
>> >>
>> >> You're right on.  This was to ensure that one process didn't cause
>> >> crazy behavior in another. I think Alan has a better proposal than
>> >> mine below.  (Goes back to catching up.)
>> >
>> > You can already use ptrace() to cause crazy behaviour in another
>> > process, including modifying registers arbitrarily at syscall entry
>> > and exit, aborting and emulating syscalls.
>> >
>> > ptrace() is quite slow and it would be really nice to speed it up,
>> > especially for trapping a small subset of syscalls, or limiting some
>> > kinds of access to some file descriptors, while everything else runs
>> > at normal speed.
>> >
>> > Speeding up ptrace() with BPF filters would be a really nice.  Not
>> > that I like ptrace(), but sometimes it's the only thing you can rely on.
>> >
>> > LD_PRELOAD and code running in the target process address space can't
>> > always be trusted in some contexts (e.g. the target process may modify
>> > the tracing code or its data); whereas ptrace() is pretty complete and
>> > reliable, if ugly.
>> >
>> > There's already a security model around who can use ptrace(); speeding
>> > it up needn't break that.
>> >
>> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
>> > needed as userspace could have done it, with exactly the restrictions
>> > it wants.  Google's NaCl comes to mind as a potential user.
>>
>> That's not entirely true.  ptrace supervisors are subject to races and
>> always fail open.  This makes them effective but not as robust as a
>> seccomp solution can provide.
>
> What races do you know about?

I'm pretty sure that if you have two "isolated" processes, they could
cause irregular behavior using signals.

> I'm not aware of any ptrace races if it's used properly.  I'm also not
> sure what you mean by fail open/close here, unless you mean the target
> process gets to carry on if the tracing process dies.

Exactly.  Security systems that, on failure, allow the action to
proceed can't be relied on.

> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.

Not a problem - BPF only allows register inspection. No TOCTOU attacks
need apply :D

>> With seccomp, it fails close.  What I think would make sense would be
>> to add a user-controllable failure mode with seccomp bpf that calls
>> tracehook_ptrace_syscall_entry(regs).  I've prototyped this and it
>> works quite well, but I didn't want to conflate the discussions.
>
> It think it's a nice idea.  While you're at it could you fix all the
> architectures to actually use tracehooks for syscall tracing ;-)
>
> (I think it's ok to call the tracehook function on all archs though.)
>
>> Using ptrace() would also mean that all consumers of this interface
>> would need a supervisor, but with seccomp, the filters are installed
>> and require no supervisors to stick around for when failure occurs.
>>
>> Does that make sense?
>
> It does, I agree that ptrace() is quite cumbersome and you don't
> always want a separate tracing process, especially if "failure" means
> to die or get an error.
>
> On the other hand, sometimes when a failure occurs, having another
> process decide what to do, or log the event, is exactly what you want.
>
> For my nefarious purposes I'm really just looking for a faster way to
> reliably trace some activities of individual processes, in particular
> tracking which files they access.  I'd rather not interfere with
> debuggers, so I'd really like your ability to stack multiple filters
> to work with separate-process tracing as well.  And I'd happily use a
> filter rule which can dump some information over a pipe, without
> waiting for the tracer to respond in most cases.

Cool - if the rest of this discussion proceeds, then hopefully, we can
move towards discussing if tying it with ptrace is a good idea or a
horrible one :)

thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [PATCH v2 2/2] Documentation: prctl/seccomp_filter
  2012-01-11 23:19       ` [PATCH v2 " Will Drewry
  2012-01-12  0:29         ` Will Drewry
@ 2012-01-12 18:16         ` Randy Dunlap
  2012-01-12 17:23           ` Will Drewry
  1 sibling, 1 reply; 222+ messages in thread
From: Randy Dunlap @ 2012-01-12 18:16 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet

On 01/11/2012 03:19 PM, Will Drewry wrote:
> Document how system call filtering with BPF works and
> may be used.  Includes an example for x86 (32-bit).

Please tell some of us what "BPF" means.  wikipedia lists 15 possible
choices, but I don't know which one to choose.

> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  Documentation/prctl/seccomp_filter.txt |   99 ++++++++++++++++++++++++++++++++
>  samples/Makefile                       |    2 +-
>  samples/seccomp/Makefile               |   12 ++++
>  samples/seccomp/bpf-example.c          |   74 ++++++++++++++++++++++++
>  4 files changed, 186 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/prctl/seccomp_filter.txt
>  create mode 100644 samples/seccomp/Makefile
>  create mode 100644 samples/seccomp/bpf-example.c


-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:09         ` Linus Torvalds
  2012-01-12 17:17           ` Steven Rostedt
@ 2012-01-12 18:18           ` Andrew Lutomirski
  2012-01-12 18:32             ` Linus Torvalds
  1 sibling, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 18:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 9:09 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> In that case, just have execv fail if filtering is enabled and we are
>> execing a setuid program. But I don't see why non "magical" execv's
>> should be prohibited.
>
> The whole "fail security escalations" thing goes way beyond just
> filtering, I think we could seriously try to make it a generic
> feature.
>
> For example, somebody just asked me the other day why "chroot()"
> requires admin privileges, since it would be good to limit even
> non-root things.
>
> And it's really the exact same issue as filtering: in some sense,
> chroot() "filters" FS name lookups, and can be used to fool programs
> that are written to be secure.
>
> We could easily introduce a per-process flag that just says "cannot
> escalate privileges". Which basically just disables execve() of
> suid/sgid programs (and possibly other things too), and locks the
> process to the current privileges. And then make the rule be that *if*
> that flag is set, you can then filter across an execve, or chroot as a
> normal user, or whatever.

Like this?

http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html

(This depends on execve_nosecurity, which is controversial, but that
dependency would be trivial to remove.)

Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything.  (I think that this is a really awful idea, but it's in
the kernel, so we're stuck with it.)

--Andy

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:18           ` Andrew Lutomirski
@ 2012-01-12 18:32             ` Linus Torvalds
  2012-01-12 18:44               ` Andrew Lutomirski
  0 siblings, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-12 18:32 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Steven Rostedt, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>
> Like this?
>
> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html

I don't know the execve_nosecurity patches, so the diff makes little
sense to me, but yeah, I wouldn't expect it to be more than a couple
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.

> Note that there's a huge can of worms if execve is allowed but
> suid/sgid is not: selinux may elevate privileges on exec of pretty
> much anything.  (I think that this is a really awful idea, but it's in
> the kernel, so we're stuck with it.)

You can do any amount of crazy things with selinux, but the other side
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..

I really don't think this is just about "execve cannot do setuid". I
think it's about the process being marked as restricted.

So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try to
artificially limit it to be some "execve feature", and more think of
it as a "this is a process that has *no* extra privileges at all, and
can never get them".

                            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:32             ` Linus Torvalds
@ 2012-01-12 18:44               ` Andrew Lutomirski
  2012-01-12 19:08                 ` Kyle Moffett
  2012-01-12 19:40                 ` Will Drewry
  0 siblings, 2 replies; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>
>> Like this?
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>
> I don't know the execve_nosecurity patches, so the diff makes little
> sense to me, but yeah, I wouldn't expect it to be more than a couple
> of lines. Exactly *how* you set the bit etc is not something I care
> deeply about, prctl seems about as good as anything.
>
>> Note that there's a huge can of worms if execve is allowed but
>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>> much anything.  (I think that this is a really awful idea, but it's in
>> the kernel, so we're stuck with it.)
>
> You can do any amount of crazy things with selinux, but the other side
> of the coin is that it would also be trivial to teach selinux about
> this same "restricted environment" bit, and just say that a process
> with that bit set doesn't get to match whatever selinux privilege
> escalation rules..
>
> I really don't think this is just about "execve cannot do setuid". I
> think it's about the process being marked as restricted.
>
> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
> artificially limit it to be some "execve feature", and more think of
> it as a "this is a process that has *no* extra privileges at all, and
> can never get them".

Fair enough.  I'll submit the simpler patch tonight.

execve_nosecurity was my attempt to sidestep selinux issues.  It's a
different syscall that does all of the non-security-related things
that execve does but does not escalate (or even change) any
privileges.  Maybe I'll try to rework that for newer kernels as well.
The idea is that programs that expect to run in sandboxes / chroots /
namespaces / whatever can use it, and older programs that might
malfunction dangerously if the semantics of execve change will just
fail instead.

--Andy

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:44               ` Andrew Lutomirski
@ 2012-01-12 19:08                 ` Kyle Moffett
  2012-01-12 23:05                   ` Eric Paris
  2012-01-12 19:40                 ` Will Drewry
  1 sibling, 1 reply; 222+ messages in thread
From: Kyle Moffett @ 2012-01-12 19:08 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, jmorris, scarybeasts, avi, penberg, viro, mingo,
	akpm, khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 13:44, Andrew Lutomirski <luto@mit.edu> wrote:
> On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>> Like this?
>>>
>>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>>
>> I don't know the execve_nosecurity patches, so the diff makes little
>> sense to me, but yeah, I wouldn't expect it to be more than a couple
>> of lines. Exactly *how* you set the bit etc is not something I care
>> deeply about, prctl seems about as good as anything.
>>
>>> Note that there's a huge can of worms if execve is allowed but
>>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>>> much anything.  (I think that this is a really awful idea, but it's in
>>> the kernel, so we're stuck with it.)
>>
>> You can do any amount of crazy things with selinux, but the other side
>> of the coin is that it would also be trivial to teach selinux about
>> this same "restricted environment" bit, and just say that a process
>> with that bit set doesn't get to match whatever selinux privilege
>> escalation rules..
>>
>> I really don't think this is just about "execve cannot do setuid". I
>> think it's about the process being marked as restricted.
>>
>> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
>> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
>> artificially limit it to be some "execve feature", and more think of
>> it as a "this is a process that has *no* extra privileges at all, and
>> can never get them".
>
> execve_nosecurity was my attempt to sidestep selinux issues.  It's a
> different syscall that does all of the non-security-related things
> that execve does but does not escalate (or even change) any
> privileges.  Maybe I'll try to rework that for newer kernels as well.
> The idea is that programs that expect to run in sandboxes / chroots /
> namespaces / whatever can use it, and older programs that might
> malfunction dangerously if the semantics of execve change will just
> fail instead.

I don't see any issues with SELinux support for this feature.

Specifically, when you try to execute something in SELinux, it will
first look at the types and try to "execute" (involving a type
transition IE: security label change).

But if that fails in many cases it may still be allowed to
"execute_no_trans" (IE: regular non-privileged exec() without a
transition).

If you add this feature, it should just disable the normal "execute"
with transition path and unconditionally fall back to
"execute_no_trans".

Likewise, enabling these bits should also disable the "transition" and
"dyntransition" process access vectors, and I'm on the fence about
whether "setfscreate", etc should be allowed.

Cheers,
Kyle Moffett

-- 
Curious about my work on the Debian powerpcspe port?
I'm keeping a blog here: http://pureperl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:44               ` Andrew Lutomirski
  2012-01-12 19:08                 ` Kyle Moffett
@ 2012-01-12 19:40                 ` Will Drewry
  2012-01-12 19:42                   ` Will Drewry
  2012-01-12 19:46                   ` Andrew Lutomirski
  1 sibling, 2 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-12 19:40 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 12:44 PM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>>
>>> Like this?
>>>
>>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>>
>> I don't know the execve_nosecurity patches, so the diff makes little
>> sense to me, but yeah, I wouldn't expect it to be more than a couple
>> of lines. Exactly *how* you set the bit etc is not something I care
>> deeply about, prctl seems about as good as anything.
>>
>>> Note that there's a huge can of worms if execve is allowed but
>>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>>> much anything.  (I think that this is a really awful idea, but it's in
>>> the kernel, so we're stuck with it.)
>>
>> You can do any amount of crazy things with selinux, but the other side
>> of the coin is that it would also be trivial to teach selinux about
>> this same "restricted environment" bit, and just say that a process
>> with that bit set doesn't get to match whatever selinux privilege
>> escalation rules..
>>
>> I really don't think this is just about "execve cannot do setuid". I
>> think it's about the process being marked as restricted.
>>
>> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
>> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
>> artificially limit it to be some "execve feature", and more think of
>> it as a "this is a process that has *no* extra privileges at all, and
>> can never get them".
>
> Fair enough.  I'll submit the simpler patch tonight.

This sounds cool.  Do you think you'll go for a new task_struct member
or will it a securebit?  (Seems like securebits might be too tied to
posix file caps, but I figured I'd ask).

I'm planning on going ahead and mocking up your potential patch so I
can respin this series using it and make sure I understand the
interactions.

thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 19:40                 ` Will Drewry
@ 2012-01-12 19:42                   ` Will Drewry
  2012-01-12 19:46                   ` Andrew Lutomirski
  1 sibling, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-12 19:42 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 1:40 PM, Will Drewry <wad@chromium.org> wrote:
> On Thu, Jan 12, 2012 at 12:44 PM, Andrew Lutomirski <luto@mit.edu> wrote:
>> On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>>>
>>>> Like this?
>>>>
>>>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>>>
>>> I don't know the execve_nosecurity patches, so the diff makes little
>>> sense to me, but yeah, I wouldn't expect it to be more than a couple
>>> of lines. Exactly *how* you set the bit etc is not something I care
>>> deeply about, prctl seems about as good as anything.
>>>
>>>> Note that there's a huge can of worms if execve is allowed but
>>>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>>>> much anything.  (I think that this is a really awful idea, but it's in
>>>> the kernel, so we're stuck with it.)
>>>
>>> You can do any amount of crazy things with selinux, but the other side
>>> of the coin is that it would also be trivial to teach selinux about
>>> this same "restricted environment" bit, and just say that a process
>>> with that bit set doesn't get to match whatever selinux privilege
>>> escalation rules..
>>>
>>> I really don't think this is just about "execve cannot do setuid". I
>>> think it's about the process being marked as restricted.
>>>
>>> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
>>> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
>>> artificially limit it to be some "execve feature", and more think of
>>> it as a "this is a process that has *no* extra privileges at all, and
>>> can never get them".
>>
>> Fair enough.  I'll submit the simpler patch tonight.
>
> This sounds cool.  Do you think you'll go for a new task_struct member
> or will it a securebit?  (Seems like securebits might be too tied to
> posix file caps, but I figured I'd ask).

Or cred member, etc.

> I'm planning on going ahead and mocking up your potential patch so I
> can respin this series using it and make sure I understand the
> interactions.
>
> thanks!
> will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 19:40                 ` Will Drewry
  2012-01-12 19:42                   ` Will Drewry
@ 2012-01-12 19:46                   ` Andrew Lutomirski
  2012-01-12 20:00                     ` Linus Torvalds
  1 sibling, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 19:46 UTC (permalink / raw)
  To: Will Drewry
  Cc: Linus Torvalds, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:40 AM, Will Drewry <wad@chromium.org> wrote:
> On Thu, Jan 12, 2012 at 12:44 PM, Andrew Lutomirski <luto@mit.edu> wrote:
>> On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>>>>
>>>> Like this?
>>>>
>>>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
>>>
>>> I don't know the execve_nosecurity patches, so the diff makes little
>>> sense to me, but yeah, I wouldn't expect it to be more than a couple
>>> of lines. Exactly *how* you set the bit etc is not something I care
>>> deeply about, prctl seems about as good as anything.
>>>
>>>> Note that there's a huge can of worms if execve is allowed but
>>>> suid/sgid is not: selinux may elevate privileges on exec of pretty
>>>> much anything.  (I think that this is a really awful idea, but it's in
>>>> the kernel, so we're stuck with it.)
>>>
>>> You can do any amount of crazy things with selinux, but the other side
>>> of the coin is that it would also be trivial to teach selinux about
>>> this same "restricted environment" bit, and just say that a process
>>> with that bit set doesn't get to match whatever selinux privilege
>>> escalation rules..
>>>
>>> I really don't think this is just about "execve cannot do setuid". I
>>> think it's about the process being marked as restricted.
>>>
>>> So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
>>> should simply be "PR_RESTRICT_ME", and be done with it, and not try to
>>> artificially limit it to be some "execve feature", and more think of
>>> it as a "this is a process that has *no* extra privileges at all, and
>>> can never get them".
>>
>> Fair enough.  I'll submit the simpler patch tonight.
>
> This sounds cool.  Do you think you'll go for a new task_struct member
> or will it a securebit?  (Seems like securebits might be too tied to
> posix file caps, but I figured I'd ask).
>
> I'm planning on going ahead and mocking up your potential patch so I
> can respin this series using it and make sure I understand the
> interactions.

I think securebits and cred didn't exist the first time I did this,
and sticking it in struct cred might unnecessarily prevent sharing
cred (assuming that even happens).  So I'd say task_struct.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 19:46                   ` Andrew Lutomirski
@ 2012-01-12 20:00                     ` Linus Torvalds
  0 siblings, 0 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-12 20:00 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Will Drewry, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:46 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>
> I think securebits and cred didn't exist the first time I did this,
> and sticking it in struct cred might unnecessarily prevent sharing
> cred (assuming that even happens).  So I'd say task_struct.

I think it almost has to be task state, since we very much want to
make sure it's trivial to see that nothing ever clears that bit, and
that it always gets copied right over a fork/exec/whatever.

Putting it in some cred or capability bit or somethin would make that
kind of transparency pretty much totally impossible.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:40           ` Steven Rostedt
  2012-01-12 17:44             ` Jamie Lokier
@ 2012-01-12 22:18             ` Will Drewry
  2012-01-12 23:00               ` Andrew Lutomirski
  1 sibling, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-12 22:18 UTC (permalink / raw)
  To: Steven Rostedt, Alan Cox
  Cc: Jamie Lokier, Oleg Nesterov, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:40 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:
>
>> You can do this now, using ptrace().  It's horrible, but half of the
>> horribleness is needing to understand machine-dependent registers,
>> which this new patch doesn't address.  (The other half is a ton of
>> undocumented but important ptrace() behaviours on Linux.)
>
> Yeah I know the horrid use of ptrace, I've implemented programs that use
> it :-p
>
> I guess ptrace can capture the execv and determine if it is OK or not to
> run it. But again, this doesn't stop the possible attacks that could
> happen, with having the execv on a symlink file, having the ptrace check
> say its OK, and then switching the symlink to a setuid file.
>
> When the new execv executed, the parent process would lose all control
> over it. The idea is to prevent this.
>
> I like Alan's suggestion. Have userspace decide to allow execv or not,
> and even let it decide if it should allow setuid execv's or not, but
> still allow non-setuid execvs. If you allow the setuid execv, once that
> happens, the same behavior will occur as with ptrace. A setuid execv
> will lose all its filtering.

In the ptrace case, doesn't it just downgrade the privileges of the new process
if there is a tracer, rather than detach the tracer?

Ignoring that, I've been looking at system call filters as being equivalent to
something like the caps bounding set.  Once reduced, there's no going
back. I think Linus's proposal perfectly resolves the policy decision around
suid execution behavior in the run-with-privs or not scenarios (just like with
how ptrace does it).  However, I'd like to avoid allowing any process to
escape system call filters once installed.  (It's doable to add
suid/caps-based-bypass, but it certainly not ideal from my perspective.)

cheers,
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 22:18             ` Will Drewry
@ 2012-01-12 23:00               ` Andrew Lutomirski
  0 siblings, 0 replies; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 23:00 UTC (permalink / raw)
  To: Will Drewry
  Cc: Steven Rostedt, Alan Cox, Jamie Lokier, Oleg Nesterov,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 2:18 PM, Will Drewry <wad@chromium.org> wrote:
> On Thu, Jan 12, 2012 at 11:40 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:
>>
>>> You can do this now, using ptrace().  It's horrible, but half of the
>>> horribleness is needing to understand machine-dependent registers,
>>> which this new patch doesn't address.  (The other half is a ton of
>>> undocumented but important ptrace() behaviours on Linux.)
>>
>> Yeah I know the horrid use of ptrace, I've implemented programs that use
>> it :-p
>>
>> I guess ptrace can capture the execv and determine if it is OK or not to
>> run it. But again, this doesn't stop the possible attacks that could
>> happen, with having the execv on a symlink file, having the ptrace check
>> say its OK, and then switching the symlink to a setuid file.
>>
>> When the new execv executed, the parent process would lose all control
>> over it. The idea is to prevent this.
>>
>> I like Alan's suggestion. Have userspace decide to allow execv or not,
>> and even let it decide if it should allow setuid execv's or not, but
>> still allow non-setuid execvs. If you allow the setuid execv, once that
>> happens, the same behavior will occur as with ptrace. A setuid execv
>> will lose all its filtering.
>
> In the ptrace case, doesn't it just downgrade the privileges of the new process
> if there is a tracer, rather than detach the tracer?
>
> Ignoring that, I've been looking at system call filters as being equivalent to
> something like the caps bounding set.  Once reduced, there's no going
> back. I think Linus's proposal perfectly resolves the policy decision around
> suid execution behavior in the run-with-privs or not scenarios (just like with
> how ptrace does it).  However, I'd like to avoid allowing any process to
> escape system call filters once installed.  (It's doable to add
> suid/caps-based-bypass, but it certainly not ideal from my perspective.)

I agree.

In principle, it could be safe for an outside (non-seccomp) process
with appropriate credentials to lift seccomp restrictions from a
different process.  But why?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 19:08                 ` Kyle Moffett
@ 2012-01-12 23:05                   ` Eric Paris
  2012-01-12 23:33                     ` Andrew Lutomirski
  0 siblings, 1 reply; 222+ messages in thread
From: Eric Paris @ 2012-01-12 23:05 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Andrew Lutomirski, Linus Torvalds, Steven Rostedt, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, djm, segoon, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, oleg, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, 2012-01-12 at 14:08 -0500, Kyle Moffett wrote:
> On Thu, Jan 12, 2012 at 13:44, Andrew Lutomirski <luto@mit.edu> wrote:
> > On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> >> You can do any amount of crazy things with selinux, but the other side
> >> of the coin is that it would also be trivial to teach selinux about
> >> this same "restricted environment" bit, and just say that a process
> >> with that bit set doesn't get to match whatever selinux privilege
> >> escalation rules..

> I don't see any issues with SELinux support for this feature.
> 
> Specifically, when you try to execute something in SELinux, it will
> first look at the types and try to "execute" (involving a type
> transition IE: security label change).
> 
> But if that fails in many cases it may still be allowed to
> "execute_no_trans" (IE: regular non-privileged exec() without a
> transition).

That's not true.  See specifically
security/selinux/hooks.c::selinux_bprm_set_creds()  We calculate a label
for the new task (that may or may not be the same) and then check if
there is permission to run the new binary with the new label.  There is
no fallback.

The exception would be if the binary is on a MNT_NOSUID mount point, in
which case we calculate the new label, then just revert to the same
label.

At first glance it looks to me like a reasonable way to implement this
at first would be to do the new checks right next to any place we
already do MNT_NOSUID checks and mimic their behavior.  If there are
other priv escalation points in the kernel we might need to consider if
MNT_NOSUID is adequate....

-Eric


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:56               ` Steven Rostedt
@ 2012-01-12 23:27                 ` Alan Cox
  2012-01-12 23:38                   ` Linus Torvalds
  0 siblings, 1 reply; 222+ messages in thread
From: Alan Cox @ 2012-01-12 23:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jamie Lokier, Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, scarybeasts, avi, penberg, viro, luto,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

> Thus, execv will not be a "special" case here. Seccomp either allows it
> or not. But also add a command to tell seccomp that this task will not
> be allowed to do anything privileged.

A setuid binary is not necessarily priviledged - indeed a root -> user
transition via setuid is pretty much the reverse.

It's a change of user context. Things like ptrace and file permissions
basically mean you can't build a barrier between stuff running as the
same uid to a great extent except with heavy restricting, but saying
"you can't become someone else" is very useful.

Alan

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 23:05                   ` Eric Paris
@ 2012-01-12 23:33                     ` Andrew Lutomirski
  0 siblings, 0 replies; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-12 23:33 UTC (permalink / raw)
  To: Eric Paris
  Cc: Kyle Moffett, Linus Torvalds, Steven Rostedt, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, djm, segoon, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, oleg, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 3:05 PM, Eric Paris <eparis@redhat.com> wrote:
> On Thu, 2012-01-12 at 14:08 -0500, Kyle Moffett wrote:
>> On Thu, Jan 12, 2012 at 13:44, Andrew Lutomirski <luto@mit.edu> wrote:
>> > On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> >> You can do any amount of crazy things with selinux, but the other side
>> >> of the coin is that it would also be trivial to teach selinux about
>> >> this same "restricted environment" bit, and just say that a process
>> >> with that bit set doesn't get to match whatever selinux privilege
>> >> escalation rules..
>
>> I don't see any issues with SELinux support for this feature.
>>
>> Specifically, when you try to execute something in SELinux, it will
>> first look at the types and try to "execute" (involving a type
>> transition IE: security label change).
>>
>> But if that fails in many cases it may still be allowed to
>> "execute_no_trans" (IE: regular non-privileged exec() without a
>> transition).
>
> That's not true.  See specifically
> security/selinux/hooks.c::selinux_bprm_set_creds()  We calculate a label
> for the new task (that may or may not be the same) and then check if
> there is permission to run the new binary with the new label.  There is
> no fallback.
>
> The exception would be if the binary is on a MNT_NOSUID mount point, in
> which case we calculate the new label, then just revert to the same
> label.
>
> At first glance it looks to me like a reasonable way to implement this
> at first would be to do the new checks right next to any place we
> already do MNT_NOSUID checks and mimic their behavior.  If there are
> other priv escalation points in the kernel we might need to consider if
> MNT_NOSUID is adequate....
>

I don't really like the current logic.  It does:

        if (old_tsec->exec_sid) {
                new_tsec->sid = old_tsec->exec_sid;
                /* Reset exec SID on execve. */
                new_tsec->exec_sid = 0;
        } else {
                /* Check for a default transition on this program. */
                rc = security_transition_sid(old_tsec->sid, isec->sid,
                                             SECCLASS_PROCESS, NULL,
                                             &new_tsec->sid);
                if (rc)
                        return rc;
        }

        COMMON_AUDIT_DATA_INIT(&ad, PATH);
        ad.u.path = bprm->file->f_path;

        if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
                new_tsec->sid = old_tsec->sid;

which means that, if MNT_NOSUD, then exec_sid is silently ignored.
I'd rather fail in that case, but it's probably too late for that.
However, if we set the "no new privileges" flag, then we could fail,
since there's no old ABI to be compatible with.  I'll implement it
that way.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 23:27                 ` Alan Cox
@ 2012-01-12 23:38                   ` Linus Torvalds
  0 siblings, 0 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-12 23:38 UTC (permalink / raw)
  To: Alan Cox
  Cc: Steven Rostedt, Jamie Lokier, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, jmorris, scarybeasts, avi, penberg,
	viro, luto, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 3:27 PM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> It's a change of user context. Things like ptrace and file permissions
> basically mean you can't build a barrier between stuff running as the
> same uid to a great extent except with heavy restricting, but saying
> "you can't become someone else" is very useful.

Not just "someone else".

The guarantee basically has to be "you can't change your security
context". Where "become somebody else" is part of it, but any
capability changes etc would be part of it too. So it should disable
all games with capabilities etc.

And I don't think selinux really should be all that much of a problem
- we should just make sure that selinux would honor such a bit, and
refuse to do any op that would change any selinux capabilities either.
Same goes for other security models.

And that may include restricting the ways a binary can be executed
totally outside of suid/sgid bits. For example, if you consider
binaries under /home to have different selinxu rules than system
binaries in /usr/bin, then a cross-execute from one to the other may
not work, regardless of whether it's suid or not.

I think that is the kind of guarantee a sandbox environment really
wants: "I'm setting up a sandbox, you'd better not change the
permissions on me regardless of what crazy things I do".

                       Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 16:18   ` Alan Cox
  2012-01-12 17:03     ` Will Drewry
@ 2012-01-13  1:31     ` James Morris
  1 sibling, 0 replies; 222+ messages in thread
From: James Morris @ 2012-01-13  1:31 UTC (permalink / raw)
  To: Alan Cox
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, 12 Jan 2012, Alan Cox wrote:

> In general I like this approach. It's simple, it's compact and it offers
> interesting possibilities for solving some interesting problem spaces,
> without the full weight of SELinux, SMACK etc which are still needed for
> heavyweight security.

Yes, I can see potential to vastly simplify MAC policy in some cases.


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 18:03             ` Will Drewry
@ 2012-01-13  1:34               ` Jamie Lokier
  0 siblings, 0 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-13  1:34 UTC (permalink / raw)
  To: Will Drewry
  Cc: Steven Rostedt, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Will Drewry wrote:
> >> > There's already a security model around who can use ptrace(); speeding
> >> > it up needn't break that.
> >> >
> >> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> >> > needed as userspace could have done it, with exactly the restrictions
> >> > it wants.  Google's NaCl comes to mind as a potential user.
> >>
> >> That's not entirely true.  ptrace supervisors are subject to races and
> >> always fail open.  This makes them effective but not as robust as a
> >> seccomp solution can provide.
> >
> > What races do you know about?
> 
> I'm pretty sure that if you have two "isolated" processes, they could
> cause irregular behavior using signals.

Do you have an example?  I'm not aware of one and I've been studying
ptrace quite a bit lately.  If there's a race (other than temporary
kernel bugs with all the ptrace patching lately ;-), I would like to
know and maybe patch it.

The only signal confusion when ptracing syscalls I'm aware of is with
SIGTRAP, and that was fixed in 2.5.46, long, long ago (PTRACE_SETOPTIONS).

> > I'm not aware of any ptrace races if it's used properly.  I'm also not
> > sure what you mean by fail open/close here, unless you mean the target
> > process gets to carry on if the tracing process dies.
> 
> Exactly.  Security systems that, on failure, allow the action to
> proceed can't be relied on.

That's fair enough.  There are numerous occasions when ptracer death
should kill the tracee anyway regardless of security.  E.g. "strace
command..." and strace dies, you'd normally want the command to
be killed as well.  So that could be worth a ptrace option anyway.

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:57           ` Jamie Lokier
  2012-01-12 18:03             ` Will Drewry
@ 2012-01-13  6:33             ` Chris Evans
  1 sibling, 0 replies; 222+ messages in thread
From: Chris Evans @ 2012-01-13  6:33 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Will Drewry, Steven Rostedt, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, jmorris, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 9:57 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> > Will Drewry wrote:
>> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >> >
>> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> >> user namespace.  Once a task-local filter program is attached from a
>> >> >> process without privileges, execve will fail.  This ensures that only
>> >> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> >> binary).
>> >> >
>> >> > This means that a non privileged user can not run another program with
>> >> > limited features? How would a process exec another program and filter
>> >> > it? I would assume that the filter would need to be attached first and
>> >> > then the execv() would be performed. But after the filter is attached,
>> >> > the execv is prevented?
>> >>
>> >> Yeah - it means tasks can filter themselves, but not each other.
>> >> However, you can inject a filter for any dynamically linked executable
>> >> using LD_PRELOAD.
>> >>
>> >> > Maybe I don't understand this correctly.
>> >>
>> >> You're right on.  This was to ensure that one process didn't cause
>> >> crazy behavior in another. I think Alan has a better proposal than
>> >> mine below.  (Goes back to catching up.)
>> >
>> > You can already use ptrace() to cause crazy behaviour in another
>> > process, including modifying registers arbitrarily at syscall entry
>> > and exit, aborting and emulating syscalls.
>> >
>> > ptrace() is quite slow and it would be really nice to speed it up,
>> > especially for trapping a small subset of syscalls, or limiting some
>> > kinds of access to some file descriptors, while everything else runs
>> > at normal speed.
>> >
>> > Speeding up ptrace() with BPF filters would be a really nice.  Not
>> > that I like ptrace(), but sometimes it's the only thing you can rely on.
>> >
>> > LD_PRELOAD and code running in the target process address space can't
>> > always be trusted in some contexts (e.g. the target process may modify
>> > the tracing code or its data); whereas ptrace() is pretty complete and
>> > reliable, if ugly.
>> >
>> > There's already a security model around who can use ptrace(); speeding
>> > it up needn't break that.
>> >
>> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
>> > needed as userspace could have done it, with exactly the restrictions
>> > it wants.  Google's NaCl comes to mind as a potential user.
>>
>> That's not entirely true.  ptrace supervisors are subject to races and
>> always fail open.  This makes them effective but not as robust as a
>> seccomp solution can provide.
>
> What races do you know about?
>
> I'm not aware of any ptrace races if it's used properly.  I'm also not
> sure what you mean by fail open/close here, unless you mean the target
> process gets to carry on if the tracing process dies.

Yeah, that's one and it's a pretty awful one when you can consider
that the untrusted tracee can play games such as trying to get the
kernel to fire OOM SIGKILLs.

My memory is hazy but the last time I looked at this in detail there
were other racy areas:

- Bad problems if the tracee takes a SIGTSTP or (real) SIGCONT.
- Difficulty in stopping the syscall from executing once it has
started, especially if the tracer dies.


Cheers
Chris

>
> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.
>
>> With seccomp, it fails close.  What I think would make sense would be
>> to add a user-controllable failure mode with seccomp bpf that calls
>> tracehook_ptrace_syscall_entry(regs).  I've prototyped this and it
>> works quite well, but I didn't want to conflate the discussions.
>
> It think it's a nice idea.  While you're at it could you fix all the
> architectures to actually use tracehooks for syscall tracing ;-)
>
> (I think it's ok to call the tracehook function on all archs though.)
>
>> Using ptrace() would also mean that all consumers of this interface
>> would need a supervisor, but with seccomp, the filters are installed
>> and require no supervisors to stick around for when failure occurs.
>>
>> Does that make sense?
>
> It does, I agree that ptrace() is quite cumbersome and you don't
> always want a separate tracing process, especially if "failure" means
> to die or get an error.
>
> On the other hand, sometimes when a failure occurs, having another
> process decide what to do, or log the event, is exactly what you want.
>
> For my nefarious purposes I'm really just looking for a faster way to
> reliably trace some activities of individual processes, in particular
> tracking which files they access.  I'd rather not interfere with
> debuggers, so I'd really like your ability to stack multiple filters
> to work with separate-process tracing as well.  And I'd happily use a
> filter rule which can dump some information over a pipe, without
> waiting for the tracer to respond in most cases.
>
> -- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:51         ` Will Drewry
@ 2012-01-13 17:31           ` Oleg Nesterov
  2012-01-13 19:01             ` Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-13 17:31 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On 01/12, Will Drewry wrote:
>
> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 01/12, Will Drewry wrote:
> >>
> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> >> +      */
> >> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
> >> >
> >> > Stupid question. I am sure you know what are you doing ;) and I know
> >> > nothing about !x86 arches.
> >> >
> >> > But could you explain why it is designed to use user_regs_struct ?
> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
> >>
> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
> >> (iirc), but on x86-64
> >> and others, that's not true.
> >
> > Yes sure, I meant that userpace should use pt_regs too.
> >
> >> If it would be appropriate to expose pt_regs to userspace, then I'd
> >> happily do so :)
> >
> > Ah, so that was the reason. But it is already exported? At least I see
> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
> >
> > Once again, I am not arguing, just trying to understand. And I do not
> > know if this definition is part of abi.
>
> I don't either :/  My original idea was to operate on task_pt_regs(current),
> but I noticed that PTRACE_GETREGS/SETREGS only uses the
> user_regs_struct. So I went that route.

Well, I don't know where user_regs_struct come from initially. But
probably it is needed to allow to access the "artificial" things like
fs_base. Or perhaps this struct mimics the layout in the coredump.

> I'd love for pt_regs to be fair game to cut down on the copying!

Me too. I see no point in using user_regs_struct.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 17:31           ` Oleg Nesterov
@ 2012-01-13 19:01             ` Will Drewry
  2012-01-13 23:10               ` Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-13 19:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/12, Will Drewry wrote:
>>
>> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> > On 01/12, Will Drewry wrote:
>> >>
>> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >> >> +      */
>> >> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>> >> >
>> >> > Stupid question. I am sure you know what are you doing ;) and I know
>> >> > nothing about !x86 arches.
>> >> >
>> >> > But could you explain why it is designed to use user_regs_struct ?
>> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>> >>
>> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
>> >> (iirc), but on x86-64
>> >> and others, that's not true.
>> >
>> > Yes sure, I meant that userpace should use pt_regs too.
>> >
>> >> If it would be appropriate to expose pt_regs to userspace, then I'd
>> >> happily do so :)
>> >
>> > Ah, so that was the reason. But it is already exported? At least I see
>> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>> >
>> > Once again, I am not arguing, just trying to understand. And I do not
>> > know if this definition is part of abi.
>>
>> I don't either :/  My original idea was to operate on task_pt_regs(current),
>> but I noticed that PTRACE_GETREGS/SETREGS only uses the
>> user_regs_struct. So I went that route.
>
> Well, I don't know where user_regs_struct come from initially. But
> probably it is needed to allow to access the "artificial" things like
> fs_base. Or perhaps this struct mimics the layout in the coredump.

Not sure - added Roland whose name was on many of the files :)

I just noticed that ptrace ABI allows pt_regs access using the register
macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).

But I think the latter is guaranteed to have a certain layout while the macros
for PEEKUSR can do post-processing fixup.  (Which could be done in the
bpf evaluator load_pointer() helper if needed.)

>> I'd love for pt_regs to be fair game to cut down on the copying!
>
> Me too. I see no point in using user_regs_struct.

I'll rev the change to use pt_regs and drop all the helper code.  If
no one says otherwise, that certainly seems ideal from a performance
perspective, and I see pt_regs exported to userland along with ptrace
abi register offset macros.


Thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 19:01             ` Will Drewry
@ 2012-01-13 23:10               ` Will Drewry
  2012-01-13 23:12                 ` Will Drewry
                                   ` (2 more replies)
  0 siblings, 3 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-13 23:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen

On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
> On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 01/12, Will Drewry wrote:
>>>
>>> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> > On 01/12, Will Drewry wrote:
>>> >>
>>> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> >> >> +      */
>>> >> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>>> >> >
>>> >> > Stupid question. I am sure you know what are you doing ;) and I know
>>> >> > nothing about !x86 arches.
>>> >> >
>>> >> > But could you explain why it is designed to use user_regs_struct ?
>>> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>> >>
>>> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
>>> >> (iirc), but on x86-64
>>> >> and others, that's not true.
>>> >
>>> > Yes sure, I meant that userpace should use pt_regs too.
>>> >
>>> >> If it would be appropriate to expose pt_regs to userspace, then I'd
>>> >> happily do so :)
>>> >
>>> > Ah, so that was the reason. But it is already exported? At least I see
>>> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>>> >
>>> > Once again, I am not arguing, just trying to understand. And I do not
>>> > know if this definition is part of abi.
>>>
>>> I don't either :/  My original idea was to operate on task_pt_regs(current),
>>> but I noticed that PTRACE_GETREGS/SETREGS only uses the
>>> user_regs_struct. So I went that route.
>>
>> Well, I don't know where user_regs_struct come from initially. But
>> probably it is needed to allow to access the "artificial" things like
>> fs_base. Or perhaps this struct mimics the layout in the coredump.
>
> Not sure - added Roland whose name was on many of the files :)
>
> I just noticed that ptrace ABI allows pt_regs access using the register
> macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).
>
> But I think the latter is guaranteed to have a certain layout while the macros
> for PEEKUSR can do post-processing fixup.  (Which could be done in the
> bpf evaluator load_pointer() helper if needed.)
>
>>> I'd love for pt_regs to be fair game to cut down on the copying!
>>
>> Me too. I see no point in using user_regs_struct.
>
> I'll rev the change to use pt_regs and drop all the helper code.  If
> no one says otherwise, that certainly seems ideal from a performance
> perspective, and I see pt_regs exported to userland along with ptrace
> abi register offset macros.

On second thought, pt_regs is scary :)

From looking at
  http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
and ia32syscall enty code, it appears that for x86, at least, the
pt_regs for compat processes will be 8 bytes wide per register on the
stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
IA32_EMU, its filters will always index into pt_regs incorrectly.

I'm not 100% that I am reading the code right, but it means that I can either
keep using user_regs_struct or fork the code behavior based on compat. That
would need to be arch dependent then which is pretty rough.

Any thoughts?

I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
about the pt_regs
change yet.  If the performance boost is worth the effort of having a
per-arch fixup,
I can go that route.  Otherwise, I could look at some alternate approach for a
faster-than-regview payload.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 23:10               ` Will Drewry
@ 2012-01-13 23:12                 ` Will Drewry
  2012-01-13 23:30                 ` Eric Paris
  2012-01-16 18:37                 ` Oleg Nesterov
  2 siblings, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-13 23:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen

On Fri, Jan 13, 2012 at 5:10 PM, Will Drewry <wad@chromium.org> wrote:
> On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
>> On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> On 01/12, Will Drewry wrote:
>>>>
>>>> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> > On 01/12, Will Drewry wrote:
>>>> >>
>>>> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> >> >> +      */
>>>> >> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
>>>> >> >
>>>> >> > Stupid question. I am sure you know what are you doing ;) and I know
>>>> >> > nothing about !x86 arches.
>>>> >> >
>>>> >> > But could you explain why it is designed to use user_regs_struct ?
>>>> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>>> >>
>>>> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
>>>> >> (iirc), but on x86-64
>>>> >> and others, that's not true.
>>>> >
>>>> > Yes sure, I meant that userpace should use pt_regs too.
>>>> >
>>>> >> If it would be appropriate to expose pt_regs to userspace, then I'd
>>>> >> happily do so :)
>>>> >
>>>> > Ah, so that was the reason. But it is already exported? At least I see
>>>> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>>>> >
>>>> > Once again, I am not arguing, just trying to understand. And I do not
>>>> > know if this definition is part of abi.
>>>>
>>>> I don't either :/  My original idea was to operate on task_pt_regs(current),
>>>> but I noticed that PTRACE_GETREGS/SETREGS only uses the
>>>> user_regs_struct. So I went that route.
>>>
>>> Well, I don't know where user_regs_struct come from initially. But
>>> probably it is needed to allow to access the "artificial" things like
>>> fs_base. Or perhaps this struct mimics the layout in the coredump.
>>
>> Not sure - added Roland whose name was on many of the files :)
>>
>> I just noticed that ptrace ABI allows pt_regs access using the register
>> macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).
>>
>> But I think the latter is guaranteed to have a certain layout while the macros
>> for PEEKUSR can do post-processing fixup.  (Which could be done in the
>> bpf evaluator load_pointer() helper if needed.)
>>
>>>> I'd love for pt_regs to be fair game to cut down on the copying!
>>>
>>> Me too. I see no point in using user_regs_struct.
>>
>> I'll rev the change to use pt_regs and drop all the helper code.  If
>> no one says otherwise, that certainly seems ideal from a performance
>> perspective, and I see pt_regs exported to userland along with ptrace
>> abi register offset macros.
>
> On second thought, pt_regs is scary :)
>
> From looking at
>  http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
> and ia32syscall enty code, it appears that for x86, at least, the
> pt_regs for compat processes will be 8 bytes wide per register on the
> stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
> IA32_EMU, its filters will always index into pt_regs incorrectly.
>
> I'm not 100% that I am reading the code right, but it means that I can either
> keep using user_regs_struct or fork the code behavior based on compat. That
> would need to be arch dependent then which is pretty rough.
>
> Any thoughts?
>
> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
> about the pt_regs
> change yet.  If the performance boost is worth the effort of having a
> per-arch fixup,
> I can go that route.  Otherwise, I could look at some alternate approach for a
> faster-than-regview payload.

Ugh. Sorry about the formatting. (The other option is to disallow compat ;).

cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 23:10               ` Will Drewry
  2012-01-13 23:12                 ` Will Drewry
@ 2012-01-13 23:30                 ` Eric Paris
  2012-01-16 18:37                 ` Oleg Nesterov
  2 siblings, 0 replies; 222+ messages in thread
From: Eric Paris @ 2012-01-13 23:30 UTC (permalink / raw)
  To: Will Drewry
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, djm, torvalds, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, luto, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

For anyone who is interested I hacked up a program to turn what I think
is a readable seccomp syntax into BPF rules.  It should make it easier
to prototype this new thing.  The translator needs a LOT of love to be
worth much, but for now it can handle a couple of things and can build a
set of rules!

The rules are of the form:
label object:
	value label

So using Will's BPF example code in my syntax looks like:

start syscall:
        rt_sigreturn success
        sigreturn success
        exit_group success
        exit success
        read read
        write write
read arg0:
        0 success
write arg0:
        1 success
        2 success

So this says the first label is "start" and it is going to deal with the
syscall number.  The first value is 'rt_sigreturn' and if syscall ==
rt_sigreturn will cause you to jump to 'success' (success and fail are
implied labels).  If the syscall is 'write' we will jump to 'write.'
The write rules look at arg0.  If arg0 == "1" we jump to "success".  If
you run that syntax through my translator you should get Will's BPF
rules!

You'll quickly notice that the translator only understands "syscall" and
"arg0" and only x86_32, but it should be easy to add more, support the
right registers on different arches, etc, etc.  If others think they
might want to hack on the translator I put it at:

http://git.infradead.org/users/eparis/bpf-translate.git

-Eric

[-- Attachment #2: translate.py --]
[-- Type: text/x-python, Size: 2179 bytes --]

#! /usr/bin/python -Es

import sys

if len(sys.argv) > 1:
	file = open(sys.argv[1])
else:
	file = sys.stdin

linecount = 0
sections = []
rules = {}
output = []
section_map = {}

def new_section(section):
	if section[1] == "syscall":
		output.append(("BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),", section[0]))
	elif section[1] == "arg0":
		output.append(("BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),", section[0]))
	elif section[0] == "success":
		output.append(("BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),", section[0]))
	elif section[0] == "fail":
		output.append(("BPF_STMT(BPF_RET+BPF_A,0),", section[0]))

def new_rule(rule, section, last=None):
	string = "BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, %s, %s, 0)," % (rule[0], rule[1])
	if last:
		string = string.replace(", 0)", ", fail)")
	output.append((string, "0"))

if __name__ == '__main__':
	while 1:
		line = file.readline()
		if not line:
			break
		linecount = linecount + 1
		if ":" in line:
			sections.append(line.strip().strip(":").split())
		else:
			key = sections[-1][0]
			current_list = rules.get(key, [])
			newrule = line.strip().split()
			if sections[-1][1] == "syscall":
				newrule = ["__NR_%s" % newrule[0], newrule[1]]
			current_list.append(newrule)
			rules[key] = current_list
			
		

sections.append(["success", "*"])
sections.append(["fail", "*"])

for section in sections:
	new_section(section)
	if rules.has_key(section[0]):
		for rule in rules[section[0]]:
			if rule == rules[section[0]][-1]:
				new_rule(rule, section, 1)
			else:
				new_rule(rule, section)

for lineno,line in enumerate(output):
	if (line[1] == "0"):
		continue
	section_map[line[1]] = lineno

for lineno,line in enumerate(output):
	line = line[0]
	for section in section_map.keys():
		# Only replace in those last 2 commas 
		#if VALUE == section:
			#replace VALUE with str(section_map[section] - lineno - 2)
		splitline = line.split(",")
		if section in splitline[-3]:
			splitline[-3] = splitline[-3].replace(section, str(section_map[section] - lineno - 1))
		if section in splitline[-2]:
			splitline[-2] = splitline[-2].replace(section, str(section_map[section] - lineno - 1))
		line = ",".join(splitline)
	print line

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-13 23:10               ` Will Drewry
  2012-01-13 23:12                 ` Will Drewry
  2012-01-13 23:30                 ` Eric Paris
@ 2012-01-16 18:37                 ` Oleg Nesterov
  2012-01-16 20:15                   ` Will Drewry
  2 siblings, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-16 18:37 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On 01/13, Will Drewry wrote:
>
> On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
> > On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >>
> >> Me too. I see no point in using user_regs_struct.
> >
> > I'll rev the change to use pt_regs and drop all the helper code.  If
> > no one says otherwise, that certainly seems ideal from a performance
> > perspective, and I see pt_regs exported to userland along with ptrace
> > abi register offset macros.
>
> On second thought, pt_regs is scary :)
>
> From looking at
>   http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
> and ia32syscall enty code, it appears that for x86, at least, the
> pt_regs for compat processes will be 8 bytes wide per register on the
> stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
> IA32_EMU, its filters will always index into pt_regs incorrectly.

Yes, thanks, I forgot about compat tasks again. But this is easy, just
we need regs_64_to_32().

Doesn't matter. I think Indan has a better suggestion.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-16 18:37                 ` Oleg Nesterov
@ 2012-01-16 20:15                   ` Will Drewry
  2012-01-17 16:45                     ` Oleg Nesterov
  0 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-16 20:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/13, Will Drewry wrote:
>>
>> On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <wad@chromium.org> wrote:
>> > On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >>
>> >> Me too. I see no point in using user_regs_struct.
>> >
>> > I'll rev the change to use pt_regs and drop all the helper code.  If
>> > no one says otherwise, that certainly seems ideal from a performance
>> > perspective, and I see pt_regs exported to userland along with ptrace
>> > abi register offset macros.
>>
>> On second thought, pt_regs is scary :)
>>
>> From looking at
>>   http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
>> and ia32syscall enty code, it appears that for x86, at least, the
>> pt_regs for compat processes will be 8 bytes wide per register on the
>> stack.  This means if a self-filtering 32-bit program runs on a 64-bit host in
>> IA32_EMU, its filters will always index into pt_regs incorrectly.
>
> Yes, thanks, I forgot about compat tasks again. But this is easy, just
> we need regs_64_to_32().

Yup - we could make the assumption that is_compat_task is always
32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
regs_64_to_32.  Seems kinda wonky though :/

> Doesn't matter. I think Indan has a better suggestion.

I disagree, but perhaps I'm not fully understanding!

Thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-12 17:02   ` Andrew Lutomirski
@ 2012-01-16 20:28     ` Will Drewry
  0 siblings, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-16 20:28 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 12, 2012 at 11:02 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Wed, Jan 11, 2012 at 9:25 AM, Will Drewry <wad@chromium.org> wrote:
>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task.  The policy is expressed in terms of a BPF program,
>> as is used for userland-exposed socket filtering.  Instead of network
>> data, the BPF program is evaluated over struct user_regs_struct at the
>> time of the system call (as retrieved using regviews).
>>
>https://www.google.com/calendar?tab=mc&authuser=1
> There's some seccomp-related code in the vsyscall emulation path in
> arch/x86/kernel/vsyscall_64.c.  How should time(), getcpu(), and
> gettimeofday() be handled?

Nice catch:
  lxr.linux.no/linux+v3.2.1/arch/x86/kernel/vsyscall_64.c#L180
I'd missed it.

> If you want filtering to work, there
> aren't any real syscall registers to inspect, but they could be
> synthesized.

Hrm, I wonder if making sure orig_eax is populated with the
vsyscall_nr would be enough.  Unless I'm misreading, args 0 and 1 are
correct, so there may be other noise, but performing a call to
__secure_computing() (either in the case or with a pre-validate
syscall nr: 0-2) should send the do_exit.  Does that sound reasonable?

I'll try to do the right thing in my next patch set.

> Preventing a malicious task from figuring out approximately what time
> it is is basically impossible because of the way that vvars work.  I
> don't know how to change that efficiently.

There are other ways to guess the time too, so I don't think it's that
bad.  For those that are really worried, they could disable or
otherwise attempt to limit vsyscall access from their sandbox.

thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-16 20:15                   ` Will Drewry
@ 2012-01-17 16:45                     ` Oleg Nesterov
  2012-01-17 16:56                       ` Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-17 16:45 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On 01/16, Will Drewry wrote:
>
> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
> > we need regs_64_to_32().
>
> Yup - we could make the assumption that is_compat_task is always
> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
> regs_64_to_32.  Seems kinda wonky though :/

much simpler/faster than what regset does to create the artificial
user_regs_struct32.

> > Doesn't matter. I think Indan has a better suggestion.
>
> I disagree, but perhaps I'm not fully understanding!

I have much more chances to be wrong ;) I leave it to you and Indan.

Oleg.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 16:45                     ` Oleg Nesterov
@ 2012-01-17 16:56                       ` Will Drewry
  2012-01-17 17:01                         ` Andrew Lutomirski
  2012-01-17 19:35                         ` Will Drewry
  0 siblings, 2 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-17 16:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Tue, Jan 17, 2012 at 10:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/16, Will Drewry wrote:
>>
>> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
>> > we need regs_64_to_32().
>>
>> Yup - we could make the assumption that is_compat_task is always
>> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
>> regs_64_to_32.  Seems kinda wonky though :/
>
> much simpler/faster than what regset does to create the artificial
> user_regs_struct32.

True, I could collapse pt_regs to looks like the exported ABI pt_regs.
 Then only compat processes would get the copy overhead.  That could
be tidy and not break ABI.  It would mean that I have to assume that
if unsigned long == 64-bit and is_compat_task(), then the task is
32-bit.  Do you think if we ever add a crazy 128-bit "supercomputer"
arch that we will add a is_compat64_task() so that I could properly
collapse? :)

I like this idea!

>> > Doesn't matter. I think Indan has a better suggestion.
>>
>> I disagree, but perhaps I'm not fully understanding!
>
> I have much more chances to be wrong ;) I leave it to you and Indan.

We're being very verbose. I hope we can come to a good place!  I took
a break from my response to reply here :)

thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 16:56                       ` Will Drewry
@ 2012-01-17 17:01                         ` Andrew Lutomirski
  2012-01-17 17:05                           ` Oleg Nesterov
  2012-01-17 17:06                           ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-17 19:35                         ` Will Drewry
  1 sibling, 2 replies; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-17 17:01 UTC (permalink / raw)
  To: Will Drewry
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Tue, Jan 17, 2012 at 8:56 AM, Will Drewry <wad@chromium.org> wrote:
> On Tue, Jan 17, 2012 at 10:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 01/16, Will Drewry wrote:
>>>
>>> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> >
>>> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
>>> > we need regs_64_to_32().
>>>
>>> Yup - we could make the assumption that is_compat_task is always
>>> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
>>> regs_64_to_32.  Seems kinda wonky though :/
>>
>> much simpler/faster than what regset does to create the artificial
>> user_regs_struct32.
>
> True, I could collapse pt_regs to looks like the exported ABI pt_regs.
>  Then only compat processes would get the copy overhead.  That could
> be tidy and not break ABI.  It would mean that I have to assume that
> if unsigned long == 64-bit and is_compat_task(), then the task is
> 32-bit.  Do you think if we ever add a crazy 128-bit "supercomputer"
> arch that we will add a is_compat64_task() so that I could properly
> collapse? :)
>
> I like this idea!

FWIW, it's possible for a task to execute in 32-bit mode when
!is_compat_task or in 64-bit mode when is_compat_task.  From earlier
in the thread, I think you were planning to block the wrong-bitness
syscall entries, but it's worth double-checking that you don't open up
a hole when a compat task issues the 64-bit syscall instruction.

(is_compat_task says whether the executable was marked as 32-bit.  The
actual execution mode is determined by the cs register, which the user
can control.  See the user_64bit_mode function in
arch/asm/x86/ptrace.h.  But maybe it would make more sense to have a
separate 32-bit and 64-bit BPF program and select which one to use
based on the entry point.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 17:01                         ` Andrew Lutomirski
@ 2012-01-17 17:05                           ` Oleg Nesterov
  2012-01-17 17:45                             ` Andrew Lutomirski
  2012-01-17 17:06                           ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
  1 sibling, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-17 17:05 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On 01/17, Andrew Lutomirski wrote:
>
> (is_compat_task says whether the executable was marked as 32-bit.  The
> actual execution mode is determined by the cs register, which the user
> can control.

Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
along with TS_COMPAT).

TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
64-bit or not, we should treat is as 32-bit in this case.

No?

Oleg.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 17:01                         ` Andrew Lutomirski
  2012-01-17 17:05                           ` Oleg Nesterov
@ 2012-01-17 17:06                           ` Will Drewry
  1 sibling, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-17 17:06 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Tue, Jan 17, 2012 at 11:01 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Tue, Jan 17, 2012 at 8:56 AM, Will Drewry <wad@chromium.org> wrote:
>> On Tue, Jan 17, 2012 at 10:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> On 01/16, Will Drewry wrote:
>>>>
>>>> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> >
>>>> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
>>>> > we need regs_64_to_32().
>>>>
>>>> Yup - we could make the assumption that is_compat_task is always
>>>> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
>>>> regs_64_to_32.  Seems kinda wonky though :/
>>>
>>> much simpler/faster than what regset does to create the artificial
>>> user_regs_struct32.
>>
>> True, I could collapse pt_regs to looks like the exported ABI pt_regs.
>>  Then only compat processes would get the copy overhead.  That could
>> be tidy and not break ABI.  It would mean that I have to assume that
>> if unsigned long == 64-bit and is_compat_task(), then the task is
>> 32-bit.  Do you think if we ever add a crazy 128-bit "supercomputer"
>> arch that we will add a is_compat64_task() so that I could properly
>> collapse? :)
>>
>> I like this idea!
>
> FWIW, it's possible for a task to execute in 32-bit mode when
> !is_compat_task or in 64-bit mode when is_compat_task.  From earlier
> in the thread, I think you were planning to block the wrong-bitness
> syscall entries, but it's worth double-checking that you don't open up
> a hole when a compat task issues the 64-bit syscall instruction.

Yup - I had to (see below).

> (is_compat_task says whether the executable was marked as 32-bit.  The
> actual execution mode is determined by the cs register, which the user
> can control.  See the user_64bit_mode function in
> arch/asm/x86/ptrace.h.  But maybe it would make more sense to have a
> separate 32-bit and 64-bit BPF program and select which one to use
> based on the entry point.)

So that was my original design, but the problem was with how regviews
decides on the user_regs_struct.  It decides using TIF_IA32 while I
can only check the cross-arch is_compat_task() which checks TS_COMPAT
on x86.  If I'm just collapsing registers for compat calls (which I am
exploring the viability of right now), then I guess I could re-fork
the filtering to support compat versus non-compat.  The nastier bits
there were that I don't want to allow a compat call to be allowed
because a process only defined non-compat. I think that can be made
manage-able though.

I'll finish proving out the possibilities here.

Thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 17:05                           ` Oleg Nesterov
@ 2012-01-17 17:45                             ` Andrew Lutomirski
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-17 17:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen, indan

On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/17, Andrew Lutomirski wrote:
>>
>> (is_compat_task says whether the executable was marked as 32-bit.  The
>> actual execution mode is determined by the cs register, which the user
>> can control.
>
> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
> along with TS_COMPAT).
>
> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
> 64-bit or not, we should treat is as 32-bit in this case.

I think you're right, and checking which entry was used is better than
checking the cs register (since 64-bit code can use int80).  That's
what I get for insufficiently careful reading of the assembly.  (And
for going from memory from when I wrote the vsyscall emulation code --
that code is entered from a page fault, so the entry point used is
irrelevant.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF
  2012-01-17 16:56                       ` Will Drewry
  2012-01-17 17:01                         ` Andrew Lutomirski
@ 2012-01-17 19:35                         ` Will Drewry
  1 sibling, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-17 19:35 UTC (permalink / raw)
  To: Oleg Nesterov, Indan Zupancic
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath, Andi Kleen

On Tue, Jan 17, 2012 at 10:56 AM, Will Drewry <wad@chromium.org> wrote:
> On Tue, Jan 17, 2012 at 10:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 01/16, Will Drewry wrote:
>>>
>>> On Mon, Jan 16, 2012 at 12:37 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> >
>>> > Yes, thanks, I forgot about compat tasks again. But this is easy, just
>>> > we need regs_64_to_32().
>>>
>>> Yup - we could make the assumption that is_compat_task is always
>>> 32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
>>> regs_64_to_32.  Seems kinda wonky though :/
>>
>> much simpler/faster than what regset does to create the artificial
>> user_regs_struct32.
>
> True, I could collapse pt_regs to looks like the exported ABI pt_regs.
>  Then only compat processes would get the copy overhead.  That could
> be tidy and not break ABI.  It would mean that I have to assume that
> if unsigned long == 64-bit and is_compat_task(), then the task is
> 32-bit.  Do you think if we ever add a crazy 128-bit "supercomputer"
> arch that we will add a is_compat64_task() so that I could properly
> collapse? :)
>
> I like this idea!

Ouch, so a few issues:
- pt_regs isn't exported for most arches
- is_compat_task arches would need custom fixups

I think Indan takes this round :) I'll being integrating a
syscall_get_arguments approach.  Hopefully it can be quite efficient.

cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-17 17:45                             ` Andrew Lutomirski
@ 2012-01-18  0:56                               ` Indan Zupancic
  2012-01-18  1:01                                 ` Andrew Lutomirski
                                                   ` (3 more replies)
  0 siblings, 4 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-18  0:56 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Tue, January 17, 2012 18:45, Andrew Lutomirski wrote:
> On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 01/17, Andrew Lutomirski wrote:
>>>
>>> (is_compat_task says whether the executable was marked as 32-bit. �The
>>> actual execution mode is determined by the cs register, which the user
>>> can control.
>>
>> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
>> along with TS_COMPAT).
>>
>> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
>> 64-bit or not, we should treat is as 32-bit in this case.
>
> I think you're right, and checking which entry was used is better than
> checking the cs register (since 64-bit code can use int80).  That's
> what I get for insufficiently careful reading of the assembly.  (And
> for going from memory from when I wrote the vsyscall emulation code --
> that code is entered from a page fault, so the entry point used is
> irrelevant.)

Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
int 0x80 it's changed to 32 bit mode for that system call and back to
64 bit mode when the system call is finished!?

Our ptrace jailer is checking cs to figure out if a task is a compat task
or not, if the kernel can change that behind our back it means our jailer
isn't secure for x86_64 with compat enabled. Or is cs changed before the
ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
there another way?

I think this behaviour is so unexpected that it can only cause security
problems in the long run. Is anyone counting on this? Where is this
behaviour documented?

Greetings,

Indan

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
@ 2012-01-18  1:01                                 ` Andrew Lutomirski
  2012-01-19  1:06                                   ` Indan Zupancic
  2012-01-18  1:07                                 ` Roland McGrath
                                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-18  1:01 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Tue, January 17, 2012 18:45, Andrew Lutomirski wrote:
>> On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> On 01/17, Andrew Lutomirski wrote:
>>>>
>>>> (is_compat_task says whether the executable was marked as 32-bit. �The
>>>> actual execution mode is determined by the cs register, which the user
>>>> can control.
>>>
>>> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
>>> along with TS_COMPAT).
>>>
>>> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
>>> 64-bit or not, we should treat is as 32-bit in this case.
>>
>> I think you're right, and checking which entry was used is better than
>> checking the cs register (since 64-bit code can use int80).  That's
>> what I get for insufficiently careful reading of the assembly.  (And
>> for going from memory from when I wrote the vsyscall emulation code --
>> that code is entered from a page fault, so the entry point used is
>> irrelevant.)
>
> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
> int 0x80 it's changed to 32 bit mode for that system call and back to
> 64 bit mode when the system call is finished!?
>
> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer
> isn't secure for x86_64 with compat enabled. Or is cs changed before the
> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
> there another way?

I don't know what your ptrace jailer does.  But a task can switch
itself between 32-bit and 64-bit execution at will, and there's
nothing the kernel can do about it.  (That isn't quite true -- in
theory the kernel could fiddle with the GDT, but that would be
expensive and wouldn't work on Xen.)

That being said, is_compat_task is apparently a good indication of
whether the current *syscall* entry is a 64-bit syscall or a 32-bit
syscall.  Perhaps the function should be renamed to in_compat_syscall,
because that's what it does.

>
> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?

Nowhere, I think.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
  2012-01-18  1:01                                 ` Andrew Lutomirski
@ 2012-01-18  1:07                                 ` Roland McGrath
  2012-01-18  1:47                                   ` Indan Zupancic
  2012-01-18  1:48                                 ` Jamie Lokier
  2012-01-18  1:50                                 ` Andi Kleen
  3 siblings, 1 reply; 222+ messages in thread
From: Roland McGrath @ 2012-01-18  1:07 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Andi Kleen

On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
> int 0x80 it's changed to 32 bit mode for that system call and back to
> 64 bit mode when the system call is finished!?

Well, saying it like that suggests that there is more of a "mode change"
than really exists.  It's simply that any task can use int $0x80 and
this always means using the 32-bit syscall table with TS_COMPAT set.

> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer
> isn't secure for x86_64 with compat enabled. Or is cs changed before the
> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
> there another way?

I don't think there's another way.  hpa and I once discussed adding a field
to the extractable "register state" that would say which method the syscall
in progress had taken to enter the kernel.  That would tell you which
flavor of syscall instruction was used (or none, i.e. a trap/interrupt).
But nobody ever had a real need for it, and we didn't pursue it further.
(We originally talked about it in the context of distinguishing whether a
32-bit task had used sysenter or syscall or int $0x80, I think.)

> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?

It's documented the same place the entire Linux machine-level ABI is
documented, which is nowhere.  Someone somewhere may once have been
counting on it.  (The story I heard was about an implementation of valgrind
for 32-bit code that ran in 64-bit tasks, but I don't know for sure that it
was really done.)  The general rule is that if it ever worked before in a
coherent way, we don't break binary compatibility.

In the implementation, it would require a special check to make it barf.
It's really just something that falls out of how the hardware and the
kernel implementation works.  I suppose you could add such a check under a
new kconfig option that's marked as being potentially incompatible with
some old applications.  Good luck with that.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  1:07                                 ` Roland McGrath
@ 2012-01-18  1:47                                   ` Indan Zupancic
  0 siblings, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-18  1:47 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Andi Kleen

On Wed, January 18, 2012 02:07, Roland McGrath wrote:
> On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
>> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
>> int 0x80 it's changed to 32 bit mode for that system call and back to
>> 64 bit mode when the system call is finished!?
>
> Well, saying it like that suggests that there is more of a "mode change"
> than really exists.  It's simply that any task can use int $0x80 and
> this always means using the 32-bit syscall table with TS_COMPAT set.

True, the kernel always runs in 64-bit mode, it just selects which path
is taken.

>> Our ptrace jailer is checking cs to figure out if a task is a compat task
>> or not, if the kernel can change that behind our back it means our jailer
>> isn't secure for x86_64 with compat enabled. Or is cs changed before the
>> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
>> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
>> there another way?
>
> I don't think there's another way.  hpa and I once discussed adding a field
> to the extractable "register state" that would say which method the syscall
> in progress had taken to enter the kernel.  That would tell you which
> flavor of syscall instruction was used (or none, i.e. a trap/interrupt).
> But nobody ever had a real need for it, and we didn't pursue it further.
> (We originally talked about it in the context of distinguishing whether a
> 32-bit task had used sysenter or syscall or int $0x80, I think.)

Argh. So strace and all other ptrace users will think the task is calling a
different system call than it executes, except if they check for int 0x80,
which I bet they don't.

I suppose I could cache the checked EIP-2's results, but then I also have to
check if the memory is read-only and invalide the cache when the mapping may
be changed. Probably not worth the complexity.

>> I think this behaviour is so unexpected that it can only cause security
>> problems in the long run. Is anyone counting on this? Where is this
>> behaviour documented?
>
> It's documented the same place the entire Linux machine-level ABI is
> documented, which is nowhere.

AMD wrote the "System V Application Binary Interface" which decribes
some Linux conventions. It's better than nothing. But it just mentions
'syscall', not what happens when int 0x80 is called anyway.

> Someone somewhere may once have been
> counting on it.  (The story I heard was about an implementation of valgrind
> for 32-bit code that ran in 64-bit tasks, but I don't know for sure that it
> was really done.)  The general rule is that if it ever worked before in a
> coherent way, we don't break binary compatibility.

Well, considering the code can't be sure if the kernel supports compat mode
at all, I think this case is getting even more obscure than it already is.
Disallowing it won't change the kernel behaviour compared to a kernel with
compat disabled.

What about disallowing this path when the task is being ptraced?

> In the implementation, it would require a special check to make it barf.
> It's really just something that falls out of how the hardware and the
> kernel implementation works.  I suppose you could add such a check under a
> new kconfig option that's marked as being potentially incompatible with
> some old applications.  Good luck with that.

That seems a hopeless path to follow, and won't solve my problem because
my code has to be able to run on all kernels. Half the point of using
ptrace for jailing was that it's mostly portable with no special kernel
support.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
  2012-01-18  1:01                                 ` Andrew Lutomirski
  2012-01-18  1:07                                 ` Roland McGrath
@ 2012-01-18  1:48                                 ` Jamie Lokier
  2012-01-18  1:50                                 ` Andi Kleen
  3 siblings, 0 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-18  1:48 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

Indan Zupancic wrote:
> On Tue, January 17, 2012 18:45, Andrew Lutomirski wrote:
> > On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> On 01/17, Andrew Lutomirski wrote:
> >>>
> >>> (is_compat_task says whether the executable was marked as 32-bit. �The
> >>> actual execution mode is determined by the cs register, which the user
> >>> can control.
> >>
> >> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
> >> along with TS_COMPAT).
> >>
> >> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
> >> 64-bit or not, we should treat is as 32-bit in this case.
> >
> > I think you're right, and checking which entry was used is better than
> > checking the cs register (since 64-bit code can use int80).  That's
> > what I get for insufficiently careful reading of the assembly.  (And
> > for going from memory from when I wrote the vsyscall emulation code --
> > that code is entered from a page fault, so the entry point used is
> > irrelevant.)
> 
> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
> int 0x80 it's changed to 32 bit mode for that system call and back to
> 64 bit mode when the system call is finished!?
> 
> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer
> isn't secure for x86_64 with compat enabled. Or is cs changed before the
> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
> there another way?

PTRACE_PEEKTEXT won't securely tell you if it's int 0x80 if there's
another thread modifying the code, or changing the mappings, or it's
executing from a file or shared memory that someone's writing to.

> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?

It's a surprise to me too.  And like you I'm using ptrace, to trace
what a process touches, not restrict it, but it's subject to the same problem.

This looks like it needs a kernel patch.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
                                                   ` (2 preceding siblings ...)
  2012-01-18  1:48                                 ` Jamie Lokier
@ 2012-01-18  1:50                                 ` Andi Kleen
  2012-01-18  2:00                                   ` Steven Rostedt
  2012-01-18  2:04                                   ` Jamie Lokier
  3 siblings, 2 replies; 222+ messages in thread
From: Andi Kleen @ 2012-01-18  1:50 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer

Every user program change it behind your back.

Your ptrace jailer isn't.

> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?

Look up far jumps in any x86 manual.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  1:50                                 ` Andi Kleen
@ 2012-01-18  2:00                                   ` Steven Rostedt
  2012-01-18  2:04                                   ` Jamie Lokier
  1 sibling, 0 replies; 222+ messages in thread
From: Steven Rostedt @ 2012-01-18  2:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Indan Zupancic, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath

On Wed, 2012-01-18 at 02:50 +0100, Andi Kleen wrote:

> Every user program change it behind your back.
> 
> Your ptrace jailer isn't.

I'm sorry but I can't read the above two lines without hearing Yoda's
voice. "Hmm hmm"

-- Steve



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  1:50                                 ` Andi Kleen
  2012-01-18  2:00                                   ` Steven Rostedt
@ 2012-01-18  2:04                                   ` Jamie Lokier
  2012-01-18  2:22                                     ` Andi Kleen
  2012-01-18  2:27                                     ` Linus Torvalds
  1 sibling, 2 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-18  2:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Indan Zupancic, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

Andi Kleen wrote:
> > Our ptrace jailer is checking cs to figure out if a task is a compat task
> > or not, if the kernel can change that behind our back it means our jailer
> 
> Every user program change it behind your back.
..
> Look up far jumps in any x86 manual.

I'm pretty sure this isn't about changing cs or far jumps

I think Indan means code is running with 64-bit cs, but the kernel
treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
and there's no way for the ptracer to know which syscall the kernel
will perform, even by looking at all registers.  It looks like a hole
in ptrace which could be fixed.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:04                                   ` Jamie Lokier
@ 2012-01-18  2:22                                     ` Andi Kleen
  2012-01-18  2:25                                       ` Andrew Lutomirski
  2012-01-18  4:22                                       ` Indan Zupancic
  2012-01-18  2:27                                     ` Linus Torvalds
  1 sibling, 2 replies; 222+ messages in thread
From: Andi Kleen @ 2012-01-18  2:22 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andi Kleen, Indan Zupancic, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland

> I'm pretty sure this isn't about changing cs or far jumps

He's assuming that code can only run on two code segments and
not arbitarily switch between them which is a completely incorrect
assumption.

> I think Indan means code is running with 64-bit cs, but the kernel
> treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
> and there's no way for the ptracer to know which syscall the kernel
> will perform, even by looking at all registers.  It looks like a hole
> in ptrace which could be fixed.

Possibly, but anything that bases its security on ptrace is typically
unfixable racy (just think what happens with multiple threads 
and syscall arguments), so it's unlikely to do any good.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:22                                     ` Andi Kleen
@ 2012-01-18  2:25                                       ` Andrew Lutomirski
  2012-01-18  4:22                                       ` Indan Zupancic
  1 sibling, 0 replies; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-18  2:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jamie Lokier, Indan Zupancic, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 6:22 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> I'm pretty sure this isn't about changing cs or far jumps
>
> He's assuming that code can only run on two code segments and
> not arbitarily switch between them which is a completely incorrect
> assumption.

I think all he needs is to figure out which type of syscall was just
intercepted.  (Obviously arguments in memory are a problem.)

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:04                                   ` Jamie Lokier
  2012-01-18  2:22                                     ` Andi Kleen
@ 2012-01-18  2:27                                     ` Linus Torvalds
  2012-01-18  2:31                                       ` Andi Kleen
  1 sibling, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18  2:27 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andi Kleen, Indan Zupancic, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 6:04 PM, Jamie Lokier <jamie@shareable.org> wrote:
>
> I think Indan means code is running with 64-bit cs, but the kernel
> treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
> and there's no way for the ptracer to know which syscall the kernel
> will perform, even by looking at all registers.  It looks like a hole
> in ptrace which could be fixed.

We could possibly munge the "orig_ax" field to be different for the
int80 vs syscall cases. That's really the only field that isn't direct
x86 state. And it's 64 bits wide, but we really only care about the
low 32 bits in the kernel. So a bit in the high bits that says "this
was a int80 entry" would be possible.

                       Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:27                                     ` Linus Torvalds
@ 2012-01-18  2:31                                       ` Andi Kleen
  2012-01-18  2:46                                         ` Linus Torvalds
  0 siblings, 1 reply; 222+ messages in thread
From: Andi Kleen @ 2012-01-18  2:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Andi Kleen, Indan Zupancic, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 06:27:19PM -0800, Linus Torvalds wrote:
> On Tue, Jan 17, 2012 at 6:04 PM, Jamie Lokier <jamie@shareable.org> wrote:
> >
> > I think Indan means code is running with 64-bit cs, but the kernel
> > treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
> > and there's no way for the ptracer to know which syscall the kernel
> > will perform, even by looking at all registers.  It looks like a hole
> > in ptrace which could be fixed.
> 
> We could possibly munge the "orig_ax" field to be different for the
> int80 vs syscall cases. That's really the only field that isn't direct
> x86 state. And it's 64 bits wide, but we really only care about the
> low 32 bits in the kernel. So a bit in the high bits that says "this
> was a int80 entry" would be possible.

That would be incompatible. However you could just add another virtual
register with such information (in fact I thought about that
when I did the compat code originally). However I don't think it'll salvage
the original broken by design ptrace jailer. And everyone else
so far has done fine without it.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:31                                       ` Andi Kleen
@ 2012-01-18  2:46                                         ` Linus Torvalds
  2012-01-18 14:06                                           ` Martin Mares
  0 siblings, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18  2:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jamie Lokier, Andi Kleen, Indan Zupancic, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 6:31 PM, Andi Kleen <ak@linux.intel.com> wrote:
>
> That would be incompatible.

No it wouldn't.

We'd only do it for the case that everybody gets wrong: int80 from a
64-bit context.

All the other cases are trivial to see (look at CS to determine 32-bit
vs 64-bit system call) and are the common case.

So the one new "incompatible" bit case would be the case that existing
users would inevitably get wrong, so it can hardly be "incompatible".

                  Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:22                                     ` Andi Kleen
  2012-01-18  2:25                                       ` Andrew Lutomirski
@ 2012-01-18  4:22                                       ` Indan Zupancic
  2012-01-18  5:23                                         ` Linus Torvalds
  2012-01-18  5:43                                         ` Chris Evans
  1 sibling, 2 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-18  4:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jamie Lokier, Andi Kleen, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Wed, January 18, 2012 03:22, Andi Kleen wrote:
>> I'm pretty sure this isn't about changing cs or far jumps
>
> He's assuming that code can only run on two code segments and
> not arbitarily switch between them which is a completely incorrect
> assumption.

All I assumed up to now was that cs shows the current mode of the process,
and that that defines which system call path is taken. Apparently that is
not true and int 0x80 forces the compat system call path.

Looking at EIP - 2 seems like a secure way to check how we entered the kernel.

>> I think Indan means code is running with 64-bit cs, but the kernel
>> treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
>> and there's no way for the ptracer to know which syscall the kernel
>> will perform, even by looking at all registers.

Yes, that's what I meant.

>> It looks like a hole in ptrace which could be fixed.
>
> Possibly, but anything that bases its security on ptrace is typically
> unfixable racy (just think what happens with multiple threads
> and syscall arguments), so it's unlikely to do any good.

As far as I know, we fixed all races except symlink races caused by malicious
code outside the jail. Those are controllable by limiting what filesystem access
the prisoners get. A special open() flag which causes open to fail when a part
of the path is a symlink with a distinguishable error code would solve this for
us.

Other than that and the abysmal performance, ptrace is fine for jailing.

Greetings,

Indan

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  4:22                                       ` Indan Zupancic
@ 2012-01-18  5:23                                         ` Linus Torvalds
  2012-01-18  6:25                                           ` Linus Torvalds
  2012-01-18  5:43                                         ` Chris Evans
  1 sibling, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18  5:23 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 8:22 PM, Indan Zupancic <indan@nul.nu> wrote:
>
> Looking at EIP - 2 seems like a secure way to check how we entered the kernel.

Secure? No. Not at all.

It's actually very easy to fool it. Do something like this:

 - map the same physical page executably at one address, and writably
4kB above it (use shared memory, and map it twice).

 - in that page, do this:

      lea 1f,%edx
      movl $SYSCALL,%eax
      movl $-1,4096(%edx)
  1:
      int 0x80

and what happens is that the move that *overwrites* the int 0x80 will
not be noticed by the I$ coherency because it's at another address,
but by the time you read at $pc-2, you'll get -1, not "int 0x80"

                  Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  4:22                                       ` Indan Zupancic
  2012-01-18  5:23                                         ` Linus Torvalds
@ 2012-01-18  5:43                                         ` Chris Evans
  2012-01-18 12:12                                           ` Indan Zupancic
  2012-01-18 17:00                                           ` Oleg Nesterov
  1 sibling, 2 replies; 222+ messages in thread
From: Chris Evans @ 2012-01-18  5:43 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Tue, Jan 17, 2012 at 8:22 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Wed, January 18, 2012 03:22, Andi Kleen wrote:
>>> I'm pretty sure this isn't about changing cs or far jumps
>>
>> He's assuming that code can only run on two code segments and
>> not arbitarily switch between them which is a completely incorrect
>> assumption.
>
> All I assumed up to now was that cs shows the current mode of the process,
> and that that defines which system call path is taken. Apparently that is
> not true and int 0x80 forces the compat system call path.
>
> Looking at EIP - 2 seems like a secure way to check how we entered the kernel.

For 64-bit processes, you need to look at that (hard due to races) and
_also_ CS.
At least that was the state the last time I played with this in
earnest: http://scary.beasts.org/security/CESA-2009-001.html

I see Linus posted one of the race conditions that "EIP - 2" is
vulnerable to. You can start to chip away at the problem by making
sure your policy doesn't allow mmap() or mprotect() with PROT_EXEC (or
MAP_SHARED) but it's a long battle.

>
>>> I think Indan means code is running with 64-bit cs, but the kernel
>>> treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
>>> and there's no way for the ptracer to know which syscall the kernel
>>> will perform, even by looking at all registers.
>
> Yes, that's what I meant.
>
>>> It looks like a hole in ptrace which could be fixed.
>>
>> Possibly, but anything that bases its security on ptrace is typically
>> unfixable racy (just think what happens with multiple threads
>> and syscall arguments), so it's unlikely to do any good.
>
> As far as I know, we fixed all races except symlink races caused by malicious
> code outside the jail.

Are you sure? I've remembered possibly the worst one I encountered,
since my previous e-mail to Jamie:

1) Tracee is compromised; executes fork() which is syscall that isn't allowed
2) Tracee traps
2b) Tracee could take a SIGKILL here
3) Tracer looks at registers; bad syscall
3b) Or tracee could take a SIGKILL here
4) The only way to stop the bad syscall from executing is to rewrite
orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
syscall has finished)
5) Disaster: the tracee took a SIGKILL so any attempt to address it by
pid (such as PTRACE_SETREGS) fails.
6) Syscall fork() executes; possible unsupervised process now running
since the tracer wasn't expecting the fork() to be allowed.


All this ptrace() security headache is why vsftpd is waiting for
Will's seccomp enhancements to hit the kernel. Then they will be used
pronto.


Cheers
Chris

> Those are controllable by limiting what filesystem access
> the prisoners get. A special open() flag which causes open to fail when a part
> of the path is a symlink with a distinguishable error code would solve this for
> us.
>
> Other than that and the abysmal performance, ptrace is fine for jailing.
>
> Greetings,
>
> Indan
>
>

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  5:23                                         ` Linus Torvalds
@ 2012-01-18  6:25                                           ` Linus Torvalds
  2012-01-18 13:12                                             ` Compat 32-bit syscall entry from 64-bit task!? Indan Zupancic
  2012-01-18 15:04                                             ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Eric Paris
  0 siblings, 2 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18  6:25 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Tue, Jan 17, 2012 at 9:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>  - in that page, do this:
>
>      lea 1f,%edx
>      movl $SYSCALL,%eax
>      movl $-1,4096(%edx)
>  1:
>      int 0x80
>
> and what happens is that the move that *overwrites* the int 0x80 will
> not be noticed by the I$ coherency because it's at another address,
> but by the time you read at $pc-2, you'll get -1, not "int 0x80"

Btw, that's I$ coherency comment is not technically the correct explanation.

The I$ coherency isn't the problem, the problem is that the pipeline
has already fetched the "int 0x80" before the write happens. And the
write - because it's not to the same linear address as the code fetch
- won't trigger the internal "pipeline flush on write to code stream".
So the D$ (and I$) will have the -1 in it, but the instruction fetch
will have walked ahead and seen the "int 80" that existed earlier, and
will execute it.

And the above depends very much on uarch details, so depending on
microarchitecture it may or may not work. But I think the "use a
different virtual address, but same physical address" thing will fake
out all modern x86 cpu's, and your 'ptrace' will see the -1, even
though the system call happened.

Anyway, the *kernel* knows, since the kernel will have seen which
entrypoint it comes through. So we can handle it in the kernel. But
no, you cannot currently securely/reliably use $pc-2 in gdb or ptrace
to determine how the system call was made, afaik.

Of course, limiting things so that you cannot map the same page
executably *and* writably is one solution - and a good idea regardless
- so secure environments can still exist. But even then you could have
races in a multi-threaded environment (they'd just be *much* harder to
trigger for an attacker).

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  5:43                                         ` Chris Evans
@ 2012-01-18 12:12                                           ` Indan Zupancic
  2012-01-18 21:13                                             ` Chris Evans
  2012-01-18 17:00                                           ` Oleg Nesterov
  1 sibling, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-18 12:12 UTC (permalink / raw)
  To: Chris Evans
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,

On Wed, January 18, 2012 06:43, Chris Evans wrote:
>> As far as I know, we fixed all races except symlink races caused by malicious
>> code outside the jail.
>
> Are you sure? I've remembered possibly the worst one I encountered,
> since my previous e-mail to Jamie:
>
> 1) Tracee is compromised; executes fork() which is syscall that isn't allowed

How do you mean compromised? Tracees aren't trusted by definition. And fork is
allowed in our jail, we're ptracing all tasks within the jail.

> 2) Tracee traps
> 2b) Tracee could take a SIGKILL here
> 3) Tracer looks at registers; bad syscall
> 3b) Or tracee could take a SIGKILL here
> 4) The only way to stop the bad syscall from executing is to rewrite
> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> syscall has finished)

Yes, we rewrite it to -1.

> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> pid (such as PTRACE_SETREGS) fails.

I assume that if a task can execute system calls and we get ptrace events
for that, that we can do other ptrace operations too. Are you saying that
the kernel has this ptrace gap between SIGKILL and task exit where ptrace
doesn't work but the task continues executing system calls? That would be
a huge bug, but it seems very unlikely too, as the task is stopped and
shouldn't be able to disappear till it is continued by the tracer.

I mean, really? That would be stupid.

If true we have to work around it by disallowing SIGKILL and just sending
them ourselves within the jail. Meh.

> 6) Syscall fork() executes; possible unsupervised process now running
> since the tracer wasn't expecting the fork() to be allowed.

We use PTRACE_O_TRACEFORK (or replace it with clone and set CLONE_PTRACE
for 2.4 kernels. Yes, I check for CLONE_UNTRACED in clone calls.)

>
> All this ptrace() security headache is why vsftpd is waiting for
> Will's seccomp enhancements to hit the kernel. Then they will be used
> pronto.

How will you avoid file path races with BPF?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18  6:25                                           ` Linus Torvalds
@ 2012-01-18 13:12                                             ` Indan Zupancic
  2012-01-18 19:31                                               ` Linus Torvalds
  2012-01-18 15:04                                             ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Eric Paris
  1 sibling, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-18 13:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, January 18, 2012 07:25, Linus Torvalds wrote:
> On Tue, Jan 17, 2012 at 9:23 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> 	- in that page, do this:
>>
>> 			lea 1f,%edx
>> 			movl $SYSCALL,%eax
>> 			movl $-1,4096(%edx)
>> 	1:
>> 			int 0x80
>>
>> and what happens is that the move that *overwrites* the int 0x80 will
>> not be noticed by the I$ coherency because it's at another address,
>> but by the time you read at $pc-2, you'll get -1, not "int 0x80"

Oh jolly. I feared something like that might have been possible.

> Btw, that's I$ coherency comment is not technically the correct explanation.
>
> The I$ coherency isn't the problem, the problem is that the pipeline
> has already fetched the "int 0x80" before the write happens. And the
> write - because it's not to the same linear address as the code fetch
> - won't trigger the internal "pipeline flush on write to code stream".
> So the D$ (and I$) will have the -1 in it, but the instruction fetch
> will have walked ahead and seen the "int 80" that existed earlier, and
> will execute it.
>
> And the above depends very much on uarch details, so depending on
> microarchitecture it may or may not work. But I think the "use a
> different virtual address, but same physical address" thing will fake
> out all modern x86 cpu's, and your 'ptrace' will see the -1, even
> though the system call happened.
>
> Anyway, the *kernel* knows, since the kernel will have seen which
> entrypoint it comes through. So we can handle it in the kernel. But
> no, you cannot currently securely/reliably use $pc-2 in gdb or ptrace
> to determine how the system call was made, afaik.

So there is this gap and there is no good way to handle it at all for
user space? And even if it's fixed in the kernel, that won't help with
older kernels, so it will stay a problem for a while.

Can this int 0x80 trick be blocked for ptraced task (preferably always),
pretty please?

> Of course, limiting things so that you cannot map the same page
> executably *and* writably is one solution - and a good idea regardless
> - so secure environments can still exist.

We got the infrastructure in place to do that, though it would be a hassle.
But browsing around in /proc/$PID/maps, it seems w+x mappings are very
common, and we want to jail normal programs, so that seems a bit of a
problem. We could disallow system calls coming from such double mapped
memory, instead of disallowing such mappings altogether.

We'd either need to keep track of all mappings or scan /proc/$PID/maps.
Because that is a pain, we need to cache the results and invalidate or
update the cache after each new writeable mapping.

Doable, but starting to look silly and fragile.

I suppose restarting the system call would avoid same-task tricks,
but doesn't solve the other-task-having-a-writeable-mapping problem.

> But even then you could have
> races in a multi-threaded environment (they'd just be *much* harder to
> trigger for an attacker).

All hostile threads are either jailed or running as a different user,
so at least the mapping checks can be done race-free. Syscall from
unknown mappings can be disallowed.

I hope there is a really dirty trick that works reliable to find a very
subtle difference between system call entered via 'syscall' or 'int 0x80'.

At this point it starts to look attractive to only allow system calls
coming from vdso and protecting the vdso mapping (or is that done by
the kernel already?) System calls coming from elsewhere can be
restarted at the vdso (need to fix up EIP post-syscall then too.)
All in all something like this seems the simplest and most practical
solution to me.

Anyone got any better idea?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  2:46                                         ` Linus Torvalds
@ 2012-01-18 14:06                                           ` Martin Mares
  2012-01-18 18:24                                             ` Andi Kleen
  0 siblings, 1 reply; 222+ messages in thread
From: Martin Mares @ 2012-01-18 14:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Jamie Lokier, Andi Kleen, Indan Zupancic,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

Hello!

> > That would be incompatible.
> 
> No it wouldn't.
> 
> We'd only do it for the case that everybody gets wrong: int80 from a
> 64-bit context.

Not everybody. There are programs which try hard to distinguish between
int80 and syscall. One such example is a sandbox for programming contests
I wrote several years ago. It analyses the instruction before EIP and as
it does not allow threads nor executing writeable memory, it should be
correct.

The change you propose would break it. It is not a huge deal, I can fix it
in a minute, but I suspect there are other such pieces of code in the wild.

However, having TS_COMPAT available through ptrace would be great and I do not
see any other nice way how to export it to userspace, so maybe breaking the
ABI in this case is acceptable.

				Have a nice fortnight
-- 
Martin `MJ' Mares                          <mj@ucw.cz>   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Anything is good and useful if it's made of chocolate.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  6:25                                           ` Linus Torvalds
  2012-01-18 13:12                                             ` Compat 32-bit syscall entry from 64-bit task!? Indan Zupancic
@ 2012-01-18 15:04                                             ` Eric Paris
  2012-01-18 17:51                                               ` Linus Torvalds
  1 sibling, 1 reply; 222+ messages in thread
From: Eric Paris @ 2012-01-18 15:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Tue, 2012-01-17 at 22:25 -0800, Linus Torvalds wrote:

> Of course, limiting things so that you cannot map the same page
> executably *and* writably is one solution - and a good idea regardless
> - so secure environments can still exist. But even then you could have
> races in a multi-threaded environment (they'd just be *much* harder to
> trigger for an attacker).

Gratuitous SELinux for the win e-mail!  (Feel free to delete now)  We
typically, for all confined domains, do not allow mapping anonymous
memory both W and X.  Actually you can't even map it W and then map it
X...

Now if there is file which you have both W and X SELinux permissions
(which is rare, but not impossible) you could map it in two places.  So
we can (and do) build SELinux sandboxes which address this.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  5:43                                         ` Chris Evans
  2012-01-18 12:12                                           ` Indan Zupancic
@ 2012-01-18 17:00                                           ` Oleg Nesterov
  2012-01-18 17:12                                             ` Oleg Nesterov
  2012-01-19  0:29                                             ` Indan Zupancic
  1 sibling, 2 replies; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-18 17:00 UTC (permalink / raw)
  To: Chris Evans
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On 01/17, Chris Evans wrote:
>
> 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
> 2) Tracee traps
> 2b) Tracee could take a SIGKILL here
> 3) Tracer looks at registers; bad syscall
> 3b) Or tracee could take a SIGKILL here
> 4) The only way to stop the bad syscall from executing is to rewrite
> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> syscall has finished)
> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> pid (such as PTRACE_SETREGS) fails.
> 6) Syscall fork() executes; possible unsupervised process now running
> since the tracer wasn't expecting the fork() to be allowed.

As for fork() in particular, it can't succeed after SIGKILL.

But I agree, probably it makes sense to change ptrace_stop() to check
fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
in TASK_TRACED. Or we can change tracehook_report_syscall_entry()

	-	return 0;
	+	return !fatal_signal_pending();

(no, I do not literally mean the change above)

Not only for security. The current behaviour sometime confuses the
users. Debugger sends SIGKILL to the tracee and assumes it should
die asap, but the tracee exits only after syscall.

Oleg.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 17:00                                           ` Oleg Nesterov
@ 2012-01-18 17:12                                             ` Oleg Nesterov
  2012-01-18 21:09                                               ` Chris Evans
  2012-02-07 11:45                                               ` Indan Zupancic
  2012-01-19  0:29                                             ` Indan Zupancic
  1 sibling, 2 replies; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-18 17:12 UTC (permalink / raw)
  To: Chris Evans
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On 01/18, Oleg Nesterov wrote:
>
> On 01/17, Chris Evans wrote:
> >
> > 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
> > 2) Tracee traps
> > 2b) Tracee could take a SIGKILL here
> > 3) Tracer looks at registers; bad syscall
> > 3b) Or tracee could take a SIGKILL here
> > 4) The only way to stop the bad syscall from executing is to rewrite
> > orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> > syscall has finished)
> > 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> > pid (such as PTRACE_SETREGS) fails.
> > 6) Syscall fork() executes; possible unsupervised process now running
> > since the tracer wasn't expecting the fork() to be allowed.
>
> As for fork() in particular, it can't succeed after SIGKILL.
>
> But I agree, probably it makes sense to change ptrace_stop() to check
> fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
> in TASK_TRACED. Or we can change tracehook_report_syscall_entry()
>
> 	-	return 0;
> 	+	return !fatal_signal_pending();
>
> (no, I do not literally mean the change above)
>
> Not only for security. The current behaviour sometime confuses the
> users. Debugger sends SIGKILL to the tracee and assumes it should
> die asap, but the tracee exits only after syscall.

Something like the patch below.

Oleg.

--- x/include/linux/tracehook.h
+++ x/include/linux/tracehook.h
@@ -54,12 +54,12 @@ struct linux_binprm;
 /*
  * ptrace report for syscall entry and exit looks identical.
  */
-static inline void ptrace_report_syscall(struct pt_regs *regs)
+static inline int ptrace_report_syscall(struct pt_regs *regs)
 {
 	int ptrace = current->ptrace;
 
 	if (!(ptrace & PT_PTRACED))
-		return;
+		return 0;
 
 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
 
@@ -72,6 +72,8 @@ static inline void ptrace_report_syscall
 		send_sig(current->exit_code, current, 1);
 		current->exit_code = 0;
 	}
+
+	return fatal_signal_pending(current);
 }
 
 /**
@@ -96,8 +98,7 @@ static inline void ptrace_report_syscall
 static inline __must_check int tracehook_report_syscall_entry(
 	struct pt_regs *regs)
 {
-	ptrace_report_syscall(regs);
-	return 0;
+	return ptrace_report_syscall(regs);
 }
 
 /**


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 15:04                                             ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Eric Paris
@ 2012-01-18 17:51                                               ` Linus Torvalds
  0 siblings, 0 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18 17:51 UTC (permalink / raw)
  To: Eric Paris
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 7:04 AM, Eric Paris <eparis@redhat.com> wrote:
>
> Gratuitous SELinux for the win e-mail!  (Feel free to delete now)  We
> typically, for all confined domains, do not allow mapping anonymous
> memory both W and X.  Actually you can't even map it W and then map it
> X...

That doesn't help.

Anonymous memory is the *one* kind of mapping that this cannot happen
for - because then you have the same page mapped only at one
particular virtual address (and all modern x86's are entirely coherent
in the pipeline for that case, afaik).

> Now if there is file which you have both W and X SELinux permissions
> (which is rare, but not impossible) you could map it in two places.  So
> we can (and do) build SELinux sandboxes which address this.

So the cases that matter are file-backed and various shared memory setups.

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 14:06                                           ` Martin Mares
@ 2012-01-18 18:24                                             ` Andi Kleen
  2012-01-19 16:04                                               ` Jamie Lokier
  0 siblings, 1 reply; 222+ messages in thread
From: Andi Kleen @ 2012-01-18 18:24 UTC (permalink / raw)
  To: Martin Mares
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andi Kleen,
	Indan Zupancic, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module

> Not everybody. There are programs which try hard to distinguish between
> int80 and syscall. One such example is a sandbox for programming contests
> I wrote several years ago. It analyses the instruction before EIP and as
> it does not allow threads nor executing writeable memory, it should be
> correct.

There are other ways to break it, like using the syscall itself to change
input arguments or using ptrace from another process and other ways.

Generally there are so many races with ptrace that if you want to do
things like that it's better to use a LSM. That's what they are for.

-Andi


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 13:12                                             ` Compat 32-bit syscall entry from 64-bit task!? Indan Zupancic
@ 2012-01-18 19:31                                               ` Linus Torvalds
  2012-01-18 19:36                                                 ` Andi Kleen
                                                                   ` (3 more replies)
  0 siblings, 4 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18 19:31 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 4199 bytes --]

On Wed, Jan 18, 2012 at 5:12 AM, Indan Zupancic <indan@nul.nu> wrote:
>
> So there is this gap and there is no good way to handle it at all for
> user space? And even if it's fixed in the kernel, that won't help with
> older kernels, so it will stay a problem for a while.

Correct.

> Can this int 0x80 trick be blocked for ptraced task (preferably always),
> pretty please?

Nope. Not that I can tell. The "unable to read $pc-2" is a hardware
feature, and we cannot stop users from running the "int 0x80" code.
The only way to block it is to simply not enable the 32-bit
compatibility mode at all, at which point the "int 0x80" interface
simply doesn't exist.

And sure, we could do something in the kernel (like saying that you
cannot do "int 0x80" from 64-bit code by explicitly testing in the
ia32_syscall function), but that has the same "even if it's fixed in
the kernel" issue.

You can test this feature out with a test-program something like this:

  #include <errno.h>
  #include <stdlib.h>
  #include <signal.h>

  #define _GNU_SOURCE
  #include <unistd.h>
  #include <sys/syscall.h>

  void handler(int sig)
  {
	printf("SIGWINCH\n");
  }

  int main(unsigned int argc, char **argv)
  {
	signal(SIGWINCH, handler);
	asm("int $0x80": :"a" (29));	/* sys_pause - 32-bit */
	syscall(34);	/* sys_pause - 64-bit */
  }

which does two "pause()" system calls from 64-bit mode, the first one
using the legacy system call interface.

At least "strace" gets really confused, and will show the first one as

   shmget(0x1c, 140734112566944, 0)        = ? ERESTARTNOHAND (To be restarted)

because it assumes that in 64-bit mode, system call number 29 means
"shmget". It doesn't even look at $pc-2, which (since this code
doesn't try to obfuscate it) would have worked in this case.

I actually checked the strace source code. It has

  #  if 0
                /* This version analyzes the opcode of a syscall instruction.
                 * (int 0x80 on i386 vs. syscall on x86-64)
                 * It works, but is too complicated.
                 */
                unsigned long val, rip, i;

                if (upeek(tcp, 8*RIP, &rip) < 0)
                        perror("upeek(RIP)");

                /* sizeof(syscall) == sizeof(int 0x80) == 2 */
                rip -= 2;
                errno = 0;
              ...

so there is code there that could make it work, but it's #ifdef'ed
out. The actually used code just does

                /* Check CS register value. On x86-64 linux it is:
                 *      0x33    for long mode (64 bit)
                 *      0x23    for compatibility mode (32 bit)
                 * It takes only one ptrace and thus doesn't need
                 * to be cached.
                 */
                if (upeek(tcp, 8*CS, &val) < 0)
                        return -1;
                switch (val) {
                        case 0x23: currpers = 1; break;
                        case 0x33: currpers = 0; break;

which is the reasonable and obvious approach.

I'm looking at "struct user_regs_struct" and there really isn't any
non-architected state there outside of "high bits".

There are high bits that we can hide things in outside of orig_ax - we
do have 64 bits for "cs" for example - but it all boils down to the
same issue: we *will* break something that thinks it knows the details
of this. The advantage of "orig_eax" would be that at least it makes
conceptual sense there.

Using the high bits of 'eflags' might work. Hopefully nobody tests
that. IOW, something like the attached might work. It just sets bit#32
in eflags if the system call is a compat call.

With that, ptrace would at least be able to tell (assuming a new
kernel, of course - it would still need to have the "look at cs" as a
fallback) if it's a compat call or not, but it could do something like

   mode = (eflags >> 32) & 3;
   switch (mode) {
   case 0:
          .. guess it from CS ..
   case 1:
           64-bit
   case 2:
            32-bit
   default:
            Oddity.
   }

or something like that. The idea being that you can also see from
eflags whether the new feature is supported or not.

THIS IS TOTALLY UNTESTED!

                      Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 934 bytes --]

 arch/x86/kernel/ptrace.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 50267386b766..e7b019cd88d3 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -353,6 +353,7 @@ static int set_segment_reg(struct task_struct *task,
 
 static unsigned long get_flags(struct task_struct *task)
 {
+	int bit = 32;
 	unsigned long retval = task_pt_regs(task)->flags;
 
 	/*
@@ -361,7 +362,12 @@ static unsigned long get_flags(struct task_struct *task)
 	if (test_tsk_thread_flag(task, TIF_FORCED_TF))
 		retval &= ~X86_EFLAGS_TF;
 
-	return retval;
+#ifdef CONFIG_IA32_EMULATION
+	/* Set bit 32 for 64-bit system calls, bit 33 for compat system calls */
+	bit += (task_thread_info(task)->status & TS_COMPAT) / TS_COMPAT;
+#endif
+
+	return retval | (1ul << bit);
 }
 
 static int set_flags(struct task_struct *task, unsigned long value)

^ permalink raw reply related	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:31                                               ` Linus Torvalds
@ 2012-01-18 19:36                                                 ` Andi Kleen
  2012-01-18 19:39                                                   ` Linus Torvalds
  2012-01-18 19:41                                                   ` Martin Mares
  2012-01-18 19:38                                                 ` Andrew Lutomirski
                                                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 222+ messages in thread
From: Andi Kleen @ 2012-01-18 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro


The real fix is really to use a LSM for custom jails.  Trying to make 
ptrace secure is trying to make a sieve wather tight by plugging the individual
holes one by one. It's simply not suitable for this.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:31                                               ` Linus Torvalds
  2012-01-18 19:36                                                 ` Andi Kleen
@ 2012-01-18 19:38                                                 ` Andrew Lutomirski
  2012-01-19 16:01                                                   ` Jamie Lokier
  2012-01-18 20:26                                                 ` Linus Torvalds
  2012-01-25 19:36                                                 ` Oleg Nesterov
  3 siblings, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-18 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 11:31 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> The actually used code just does
>
>                /* Check CS register value. On x86-64 linux it is:
>                 *      0x33    for long mode (64 bit)
>                 *      0x23    for compatibility mode (32 bit)
>                 * It takes only one ptrace and thus doesn't need
>                 * to be cached.
>                 */
>                if (upeek(tcp, 8*CS, &val) < 0)
>                        return -1;
>                switch (val) {
>                        case 0x23: currpers = 1; break;
>                        case 0x33: currpers = 0; break;
>
> which is the reasonable and obvious approach.

*sigh*

It's reasonable, obvious, and even more wrong than it appears.  On
Xen, there's an extra 64-bit GDT entry, and it gets used by default.
(I got bitten by this in some iteration of the vsyscall emulation
patches -- see user_64bit_mode for the correct and
unusable-from-user-mode way to do this.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:36                                                 ` Andi Kleen
@ 2012-01-18 19:39                                                   ` Linus Torvalds
  2012-01-18 19:44                                                     ` Andi Kleen
  2012-01-18 19:41                                                   ` Martin Mares
  1 sibling, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18 19:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Indan Zupancic, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 11:36 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> The real fix is really to use a LSM for custom jails.  Trying to make
> ptrace secure is trying to make a sieve wather tight by plugging the individual
> holes one by one. It's simply not suitable for this.

Umm. But the exact same is true of "LSM for custom jail". It's a
f*&^ing disaster, and it's a whole lot more complicated than ptrace.

Plus it can't even do what ptrace does, so what's the point?  There's
a lot of system calls that don't have any kind of lsm hooks, and
shouldn't. Exactly because THAT is a "plugging individual holes one by
one" approach.

                     Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:36                                                 ` Andi Kleen
  2012-01-18 19:39                                                   ` Linus Torvalds
@ 2012-01-18 19:41                                                   ` Martin Mares
  1 sibling, 0 replies; 222+ messages in thread
From: Martin Mares @ 2012-01-18 19:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Indan Zupancic, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Hello!

> The real fix is really to use a LSM for custom jails.  Trying to make 
> ptrace secure is trying to make a sieve wather tight by plugging the individual
> holes one by one. It's simply not suitable for this.

As long as the set of syscalls which are permitted is trivial,
it should be secure and much easier than writing a custom LSM.

Regardless, having working strace would be nice.

				Have a nice fortnight
-- 
Martin `MJ' Mares                          <mj@ucw.cz>   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
"Never send to know for whom the bell tolls: it tolls for thee." -- John Donne

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:39                                                   ` Linus Torvalds
@ 2012-01-18 19:44                                                     ` Andi Kleen
  2012-01-18 19:47                                                       ` Linus Torvalds
  0 siblings, 1 reply; 222+ messages in thread
From: Andi Kleen @ 2012-01-18 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Indan Zupancic, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

> Umm. But the exact same is true of "LSM for custom jail". It's a
> f*&^ing disaster, and it's a whole lot more complicated than ptrace.
> 
> Plus it can't even do what ptrace does, so what's the point?  There's

It can securely enable syscall auditing which can catch all syscalls
(however you only get race free memory arguments for the ones with LSM hooks 
at the right place). Really need both.

I agree it's not easy to get tight (and also not pretty), but you have a lot 
better chance doing it this way than with ptrace.

-Andi

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:44                                                     ` Andi Kleen
@ 2012-01-18 19:47                                                       ` Linus Torvalds
  2012-01-18 19:52                                                         ` Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18 19:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Indan Zupancic, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 11:44 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> It can securely enable syscall auditing which can catch all syscalls
> (however you only get race free memory arguments for the ones with LSM hooks
> at the right place). Really need both.
>
> I agree it's not easy to get tight (and also not pretty), but you have a lot
> better chance doing it this way than with ptrace.

.. And how the f*^& did you imagine that something like chrome would do that?

You need massive amounts of privileges, and it's a total disaster in
every single respect.

Stop pushing crap. No, ptrace isn't wonderful, but your LSM+auditing
idea is a billion times worse in all respects.

We can definitely fix the ptrace issue with compat system calls.

THERE IS NO WAY IN HELL YOU CAN EVER FIX LSM+AUDIT TO BE USABLE!

Stop bothering to even bring it up. It's dead, Jim.

               Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:47                                                       ` Linus Torvalds
@ 2012-01-18 19:52                                                         ` Will Drewry
  2012-01-18 19:58                                                           ` Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-18 19:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Indan Zupancic, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 1:47 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Jan 18, 2012 at 11:44 AM, Andi Kleen <andi@firstfloor.org> wrote:
>>
>> It can securely enable syscall auditing which can catch all syscalls
>> (however you only get race free memory arguments for the ones with LSM hooks
>> at the right place). Really need both.
>>
>> I agree it's not easy to get tight (and also not pretty), but you have a lot
>> better chance doing it this way than with ptrace.
>
> .. And how the f*^& did you imagine that something like chrome would do that?
>
> You need massive amounts of privileges, and it's a total disaster in
> every single respect.
>
> Stop pushing crap. No, ptrace isn't wonderful, but your LSM+auditing
> idea is a billion times worse in all respects.
>
> We can definitely fix the ptrace issue with compat system calls.

FWIW, it looks like audit needs fixing too.  If a process only uses
TIF_SYSCALL_AUDIT, then the fast-path will properly annotate the entry
with AUDIT_ARCH_I386, but if it takes the slow path because of some
other tracing on a thread (ftrace, ptrace, ...), then the audit record
will incorrectly use TIF_IA32 to write the audit record.  Easy patch
(I'll write it up shortly), but yet another case of breakage.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:52                                                         ` Will Drewry
@ 2012-01-18 19:58                                                           ` Will Drewry
  0 siblings, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-18 19:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Indan Zupancic, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Wed, Jan 18, 2012 at 1:52 PM, Will Drewry <wad@chromium.org> wrote:
> On Wed, Jan 18, 2012 at 1:47 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, Jan 18, 2012 at 11:44 AM, Andi Kleen <andi@firstfloor.org> wrote:
>>>
>>> It can securely enable syscall auditing which can catch all syscalls
>>> (however you only get race free memory arguments for the ones with LSM hooks
>>> at the right place). Really need both.
>>>
>>> I agree it's not easy to get tight (and also not pretty), but you have a lot
>>> better chance doing it this way than with ptrace.
>>
>> .. And how the f*^& did you imagine that something like chrome would do that?
>>
>> You need massive amounts of privileges, and it's a total disaster in
>> every single respect.
>>
>> Stop pushing crap. No, ptrace isn't wonderful, but your LSM+auditing
>> idea is a billion times worse in all respects.
>>
>> We can definitely fix the ptrace issue with compat system calls.
>
> FWIW, it looks like audit needs fixing too.  If a process only uses
> TIF_SYSCALL_AUDIT, then the fast-path will properly annotate the entry
> with AUDIT_ARCH_I386, but if it takes the slow path because of some
> other tracing on a thread (ftrace, ptrace, ...), then the audit record
> will incorrectly use TIF_IA32 to write the audit record.  Easy patch
> (I'll write it up shortly), but yet another case of breakage.

Nevermind - mis-derefenced the IS_IA32 define.
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:31                                               ` Linus Torvalds
  2012-01-18 19:36                                                 ` Andi Kleen
  2012-01-18 19:38                                                 ` Andrew Lutomirski
@ 2012-01-18 20:26                                                 ` Linus Torvalds
  2012-01-18 20:55                                                   ` H. Peter Anvin
  2012-02-06  8:32                                                   ` Indan Zupancic
  2012-01-25 19:36                                                 ` Oleg Nesterov
  3 siblings, 2 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18 20:26 UTC (permalink / raw)
  To: Indan Zupancic, H. Peter Anvin
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 2734 bytes --]

Added Peter to the cc, since this is now about some x86-specific
things. Ingo was already cc'd earlier.

On Wed, Jan 18, 2012 at 11:31 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Using the high bits of 'eflags' might work. Hopefully nobody tests
> that. IOW, something like the attached might work. It just sets bit#32
> in eflags if the system call is a compat call.

So that description was bogus, it was what my original patch did, but
not the one I actually sent out (Peter - you can find it on lkml,
although the description below is probably sufficient for you to
understand what it does, or the obvious nature of the attached patch
for strace).

The one I sent out *unconditionally* sets one bit in the high bits of
the returned value of the eflags register from ptrace(), very much on
purpose. That way you can unambiguously see whether it's an old kernel
(bits clear) or a new kernel that supports the feature. On a new
kernel, bit #32 of eflags will be set for a native 64-bit system call,
and bit #33 will be set for a compat system call.

And some testing says that it works. In particular, I have a patch to
strace-4.6 that is able to correctly decode my mixed-case binary that
uses both the compat system call and the native system calls from
64-bit long mode. Also, it looks like gdb ignores the high bits of
eflags, since it "knows" that eflags is just a 32-bit register even in
64-bit mode, so the fact that we set some random bits in there doesn't
end up being noisy for at least one debugger.

HOWEVER. I'm not going to guarantee that this is the right approach.
It seems to work, and it clearly gives people real information, but
whether this is the best way to do things or not is open.

The reason I picked 'eflags' was that it

 (a) was easy from an implementation standpoint, since we already have
to handle reading of eflags specially in ptrace (we have to fake out
the resume bit)

 (b) it "kind of" makes sense to make high bits be "system flags",
with low bits being "cpu flags", so it fits at least *some* kind of
conceptual model.

 (c) the other sane places to put it (high bits of CS and/or ORIG_AX)
were being used and compared as 64-bit values at least by strace.
Whether eflags works for all users, I have no idea, but generally you
would never compare eflags for one particular value - you might check
individual bits in eflags, but hopefully setting a few new bits should
not be something that any legacy user would ever really notice.

So there are reasons to think that my patch is sane, but...

Here's the strace patch, so people can look. I didn't even test it on
an old kernel, but the fallback case to the old behavior looks
trivial.

Comments?

                     Linus

[-- Attachment #2: strace.diff --]
[-- Type: text/x-patch, Size: 1031 bytes --]

 syscall.c |   21 +++++++++++++++++++--
 1 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/syscall.c b/syscall.c
index e66ac0a95582..edd9cb804318 100644
--- a/syscall.c
+++ b/syscall.c
@@ -901,14 +901,31 @@ get_scno(struct tcb *tcp)
 		long val;
 		int pid = tcp->pid;
 
+		/* Check the high bits of eflags for processor mode */
+		if (upeek(tcp, 8*EFLAGS, &val) < 0)
+			return -1;
+		val >>= 32;
 		/* Check CS register value. On x86-64 linux it is:
 		 * 	0x33	for long mode (64 bit)
 		 * 	0x23	for compatibility mode (32 bit)
 		 * It takes only one ptrace and thus doesn't need
 		 * to be cached.
 		 */
-		if (upeek(tcp, 8*CS, &val) < 0)
-			return -1;
+		switch (val & 3) {
+		case 0:
+			/* Legacu case: check CS */
+			if (upeek(tcp, 8*CS, &val) < 0)
+				return -1;
+			break;
+		case 1:
+			/* "Long mode" value */
+			val = 0x33;
+			break;
+		case 2:
+			/* Compatibility mode */
+			val = 0x23;
+			break;
+		}
 		switch (val) {
 			case 0x23: currpers = 1; break;
 			case 0x33: currpers = 0; break;

^ permalink raw reply related	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 20:26                                                 ` Linus Torvalds
@ 2012-01-18 20:55                                                   ` H. Peter Anvin
  2012-01-18 21:01                                                     ` Linus Torvalds
  2012-02-06  8:32                                                   ` Indan Zupancic
  1 sibling, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-01-18 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

In the past, I have asked for a metaregister for this, rather than
hijacking a real hardware register.  Firstly because I suspect we're
going to need more than one bit like this, and second because it seems
cleaner to me.  However, when proposed in the past Roland McGrath
strongly opposed it for reasons which are unclear to me.

I would really like to not use a hack with the flags, because although
there current aren't any flags in the high half of RFLAGS they are
architecturally defined and could appear in the future.

If we're going to use bits in an existing register field I would be
happier if we used bits [31:16] of CS, which are unlikely to ever be
used for anything.

	-hpa


On 01/18/2012 12:26 PM, Linus Torvalds wrote:
> Added Peter to the cc, since this is now about some x86-specific
> things. Ingo was already cc'd earlier.
> 
> On Wed, Jan 18, 2012 at 11:31 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Using the high bits of 'eflags' might work. Hopefully nobody tests
>> that. IOW, something like the attached might work. It just sets bit#32
>> in eflags if the system call is a compat call.
> 
> So that description was bogus, it was what my original patch did, but
> not the one I actually sent out (Peter - you can find it on lkml,
> although the description below is probably sufficient for you to
> understand what it does, or the obvious nature of the attached patch
> for strace).
> 
> The one I sent out *unconditionally* sets one bit in the high bits of
> the returned value of the eflags register from ptrace(), very much on
> purpose. That way you can unambiguously see whether it's an old kernel
> (bits clear) or a new kernel that supports the feature. On a new
> kernel, bit #32 of eflags will be set for a native 64-bit system call,
> and bit #33 will be set for a compat system call.
> 
> And some testing says that it works. In particular, I have a patch to
> strace-4.6 that is able to correctly decode my mixed-case binary that
> uses both the compat system call and the native system calls from
> 64-bit long mode. Also, it looks like gdb ignores the high bits of
> eflags, since it "knows" that eflags is just a 32-bit register even in
> 64-bit mode, so the fact that we set some random bits in there doesn't
> end up being noisy for at least one debugger.
> 
> HOWEVER. I'm not going to guarantee that this is the right approach.
> It seems to work, and it clearly gives people real information, but
> whether this is the best way to do things or not is open.
> 
> The reason I picked 'eflags' was that it
> 
>  (a) was easy from an implementation standpoint, since we already have
> to handle reading of eflags specially in ptrace (we have to fake out
> the resume bit)
> 
>  (b) it "kind of" makes sense to make high bits be "system flags",
> with low bits being "cpu flags", so it fits at least *some* kind of
> conceptual model.
> 
>  (c) the other sane places to put it (high bits of CS and/or ORIG_AX)
> were being used and compared as 64-bit values at least by strace.
> Whether eflags works for all users, I have no idea, but generally you
> would never compare eflags for one particular value - you might check
> individual bits in eflags, but hopefully setting a few new bits should
> not be something that any legacy user would ever really notice.
> 
> So there are reasons to think that my patch is sane, but...
> 
> Here's the strace patch, so people can look. I didn't even test it on
> an old kernel, but the fallback case to the old behavior looks
> trivial.
> 
> Comments?
> 
>                      Linus


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 20:55                                                   ` H. Peter Anvin
@ 2012-01-18 21:01                                                     ` Linus Torvalds
  2012-01-18 21:04                                                       ` H. Peter Anvin
  0 siblings, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18 21:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

On Wed, Jan 18, 2012 at 12:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> If we're going to use bits in an existing register field I would be
> happier if we used bits [31:16] of CS, which are unlikely to ever be
> used for anything.

See my note about that: I would have preferred CS or ORIG_AX myself,
but that breaks existing binaries. So that isn't really an option.

                Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:01                                                     ` Linus Torvalds
@ 2012-01-18 21:04                                                       ` H. Peter Anvin
  2012-01-18 21:21                                                         ` H. Peter Anvin
  2012-01-18 21:26                                                         ` Linus Torvalds
  0 siblings, 2 replies; 222+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

On 01/18/2012 01:01 PM, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 12:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> If we're going to use bits in an existing register field I would be
>> happier if we used bits [31:16] of CS, which are unlikely to ever be
>> used for anything.
> 
> See my note about that: I would have preferred CS or ORIG_AX myself,
> but that breaks existing binaries. So that isn't really an option.
> 

Fair enough.  Sigh.  I still think an actual pseudo-register would be
better.

	-hpa

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 17:12                                             ` Oleg Nesterov
@ 2012-01-18 21:09                                               ` Chris Evans
  2012-01-23 16:56                                                 ` Oleg Nesterov
  2012-02-07 11:45                                               ` Indan Zupancic
  1 sibling, 1 reply; 222+ messages in thread
From: Chris Evans @ 2012-01-18 21:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

Thanks, Oleg. Seems like this would be a nice change to have. As we
can see, people do use ptrace() as a security technology.

With this in place, you can also (where possible) set up the tracee
with PR_SET_PDEATHSIG==SIGKILL. And then, you have defences again
either of the tracer or tracee dying from a stray SIGKILL.

On Wed, Jan 18, 2012 at 9:12 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/18, Oleg Nesterov wrote:
>>
>> On 01/17, Chris Evans wrote:
>> >
>> > 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
>> > 2) Tracee traps
>> > 2b) Tracee could take a SIGKILL here
>> > 3) Tracer looks at registers; bad syscall
>> > 3b) Or tracee could take a SIGKILL here
>> > 4) The only way to stop the bad syscall from executing is to rewrite
>> > orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>> > syscall has finished)
>> > 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>> > pid (such as PTRACE_SETREGS) fails.
>> > 6) Syscall fork() executes; possible unsupervised process now running
>> > since the tracer wasn't expecting the fork() to be allowed.
>>
>> As for fork() in particular, it can't succeed after SIGKILL.
>>
>> But I agree, probably it makes sense to change ptrace_stop() to check
>> fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
>> in TASK_TRACED. Or we can change tracehook_report_syscall_entry()
>>
>>       -       return 0;
>>       +       return !fatal_signal_pending();
>>
>> (no, I do not literally mean the change above)
>>
>> Not only for security. The current behaviour sometime confuses the
>> users. Debugger sends SIGKILL to the tracee and assumes it should
>> die asap, but the tracee exits only after syscall.
>
> Something like the patch below.
>
> Oleg.
>
> --- x/include/linux/tracehook.h
> +++ x/include/linux/tracehook.h
> @@ -54,12 +54,12 @@ struct linux_binprm;
>  /*
>  * ptrace report for syscall entry and exit looks identical.
>  */
> -static inline void ptrace_report_syscall(struct pt_regs *regs)
> +static inline int ptrace_report_syscall(struct pt_regs *regs)
>  {
>        int ptrace = current->ptrace;
>
>        if (!(ptrace & PT_PTRACED))
> -               return;
> +               return 0;
>
>        ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>
> @@ -72,6 +72,8 @@ static inline void ptrace_report_syscall
>                send_sig(current->exit_code, current, 1);
>                current->exit_code = 0;
>        }
> +
> +       return fatal_signal_pending(current);
>  }
>
>  /**
> @@ -96,8 +98,7 @@ static inline void ptrace_report_syscall
>  static inline __must_check int tracehook_report_syscall_entry(
>        struct pt_regs *regs)
>  {
> -       ptrace_report_syscall(regs);
> -       return 0;
> +       return ptrace_report_syscall(regs);
>  }
>
>  /**
>
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 12:12                                           ` Indan Zupancic
@ 2012-01-18 21:13                                             ` Chris Evans
  2012-01-19  0:14                                               ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: Chris Evans @ 2012-01-18 21:13 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
> On Wed, January 18, 2012 06:43, Chris Evans wrote:
>>> As far as I know, we fixed all races except symlink races caused by malicious
>>> code outside the jail.
>>
>> Are you sure? I've remembered possibly the worst one I encountered,
>> since my previous e-mail to Jamie:
>>
>> 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
>
> How do you mean compromised? Tracees aren't trusted by definition. And fork is
> allowed in our jail, we're ptracing all tasks within the jail.

Right, the tracee isn't trusted because you're worried it _might_ get
compromised.
If it _does_ get compromised, you don't want it playing various tricks
to break our of the ptrace() sandbox.

>
>> 2) Tracee traps
>> 2b) Tracee could take a SIGKILL here
>> 3) Tracer looks at registers; bad syscall
>> 3b) Or tracee could take a SIGKILL here
>> 4) The only way to stop the bad syscall from executing is to rewrite
>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>> syscall has finished)
>
> Yes, we rewrite it to -1.
>
>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>> pid (such as PTRACE_SETREGS) fails.
>
> I assume that if a task can execute system calls and we get ptrace events
> for that, that we can do other ptrace operations too. Are you saying that
> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
> doesn't work but the task continues executing system calls? That would be
> a huge bug, but it seems very unlikely too, as the task is stopped and
> shouldn't be able to disappear till it is continued by the tracer.
>
> I mean, really? That would be stupid.
>
> If true we have to work around it by disallowing SIGKILL and just sending
> them ourselves within the jail. Meh.
>
>> 6) Syscall fork() executes; possible unsupervised process now running
>> since the tracer wasn't expecting the fork() to be allowed.
>
> We use PTRACE_O_TRACEFORK (or replace it with clone and set CLONE_PTRACE
> for 2.4 kernels. Yes, I check for CLONE_UNTRACED in clone calls.)
>
>>
>> All this ptrace() security headache is why vsftpd is waiting for
>> Will's seccomp enhancements to hit the kernel. Then they will be used
>> pronto.
>
> How will you avoid file path races with BPF?

There is typically no need for file-path based access control in an FTP server.
Take for example anonymous FTP, which will typically be inside a
chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
it, you can have it.


Cheers
Chris

>
> Greetings,
>
> Indan
>
>

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:04                                                       ` H. Peter Anvin
@ 2012-01-18 21:21                                                         ` H. Peter Anvin
  2012-01-18 21:51                                                           ` Roland McGrath
  2012-01-18 21:26                                                         ` Linus Torvalds
  1 sibling, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

On 01/18/2012 01:04 PM, H. Peter Anvin wrote:
> On 01/18/2012 01:01 PM, Linus Torvalds wrote:
>> On Wed, Jan 18, 2012 at 12:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>
>>> If we're going to use bits in an existing register field I would be
>>> happier if we used bits [31:16] of CS, which are unlikely to ever be
>>> used for anything.
>>
>> See my note about that: I would have preferred CS or ORIG_AX myself,
>> but that breaks existing binaries. So that isn't really an option.
>>
> 
> Fair enough.  Sigh.  I still think an actual pseudo-register would be
> better.
> 

Roland, could you refresh my memory what your objection to this was?

	-hpa


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:04                                                       ` H. Peter Anvin
  2012-01-18 21:21                                                         ` H. Peter Anvin
@ 2012-01-18 21:26                                                         ` Linus Torvalds
  2012-01-18 21:30                                                           ` H. Peter Anvin
  2012-01-19  1:45                                                           ` Indan Zupancic
  1 sibling, 2 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18 21:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

On Wed, Jan 18, 2012 at 1:04 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Fair enough.  Sigh.  I still think an actual pseudo-register would be
> better.

.. and that breaks existing binaries too, because the indexing is
based on offsets into "struct pt_regs", and while we *could* change
that - leave pt_regs untouched but add a new virtual register - it
would be problematic.

We could add a whole new ptrace() access command (eg
PTRACE_GETSYSTEMREGSET), of course. But that's a lot of effort for
very little gain.

So on the whole, putting it in eflags seemed like the *much* simpler approach.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:26                                                         ` Linus Torvalds
@ 2012-01-18 21:30                                                           ` H. Peter Anvin
  2012-01-18 21:42                                                             ` Linus Torvalds
  2012-01-19  1:45                                                           ` Indan Zupancic
  1 sibling, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

On 01/18/2012 01:26 PM, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 1:04 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> Fair enough.  Sigh.  I still think an actual pseudo-register would be
>> better.
> 
> .. and that breaks existing binaries too, because the indexing is
> based on offsets into "struct pt_regs", and while we *could* change
> that - leave pt_regs untouched but add a new virtual register - it
> would be problematic.
> 
> We could add a whole new ptrace() access command (eg
> PTRACE_GETSYSTEMREGSET), of course. But that's a lot of effort for
> very little gain.
> 
> So on the whole, putting it in eflags seemed like the *much* simpler approach.
> 

I would have assumed it would be a new register set (which could be
expanded in the future if we have additional system information to provide.)

	-hpa


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:30                                                           ` H. Peter Anvin
@ 2012-01-18 21:42                                                             ` Linus Torvalds
  2012-01-18 21:47                                                               ` H. Peter Anvin
  0 siblings, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18 21:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

On Wed, Jan 18, 2012 at 1:30 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> I would have assumed it would be a new register set (which could be
> expanded in the future if we have additional system information to provide.)

Well, I really don't think we want to expose much. In fact, I'd argue
we should expose as little as humanly possible.

Which at this point is literally just a single bit (and effectively
another bit to say "we support the new feature").

So...

              Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:42                                                             ` Linus Torvalds
@ 2012-01-18 21:47                                                               ` H. Peter Anvin
  0 siblings, 0 replies; 222+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

On 01/18/2012 01:42 PM, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 1:30 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> I would have assumed it would be a new register set (which could be
>> expanded in the future if we have additional system information to provide.)
> 
> Well, I really don't think we want to expose much. In fact, I'd argue
> we should expose as little as humanly possible.
> 
> Which at this point is literally just a single bit (and effectively
> another bit to say "we support the new feature").
> 
> So...
> 

I actually think we need to also have a bit for some of the 32-bit entry
point differences, since the registers have different meanings for them.
 We have kluges in place for them, but those kluges cause their own
problems when registers are modified.

So that means at least four states (SYSCALL64, SYSENTER, SYSCALL32, INT
80) plus the presence bit.  Furthermore, three out of those states apply
even to pure 32-bit kernels.

	-hpa

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:21                                                         ` H. Peter Anvin
@ 2012-01-18 21:51                                                           ` Roland McGrath
  2012-01-18 21:53                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 222+ messages in thread
From: Roland McGrath @ 2012-01-18 21:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On Wed, Jan 18, 2012 at 1:21 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Roland, could you refresh my memory what your objection to this was?

Sorry, I don't really recall.  I could dig through old email archives, but
they'd be archives of messages I sent to you, so you should have them too.

The only principle in this area I recall having an opinion about is that
we should not mix up things that are bona fide user-visible state with
things that aren't.  (By "user-visible" I mean things that the task in
question can see or affect by just doing normal instructions, as opposed
to things only controlled via ptrace, such as the debug registers.)  But
that principle is not really being violated here.

I recall that you and I discussed making the path-of-entry visible somehow
and I was in favor of doing that.  As I recall it, we just never bothered
to follow through.

There are all the concerns about obscure ABI compatibility with
expectations of existing debuggers and so forth, which Linus has mentioned.
For that I can accept his point that things today so mishandle the
int80-from-64 case that something like a new meaning for high bits of
orig_ax or whatnot in just that case would not be actually problematic.
When you and I were discussing a more general feature of distinguishing
int80 from sysenter from syscall from traps from asynchronous interrupts,
that was of more concern.

I do feel strongly that any new means of exposing bona fide user state
ought to be done via the user_regset mechanism.  (i.e., either overloading
some existing user_regs_struct bits if that truly is harmless to
compatibility, or adding a new regset flavor.)  That way it is
automatically recorded in core files, accessible with PTRACE_GETREGSET,
etc.  (But I'm not really working on this stuff any more, so I'm out of the
business of arguing strenuously about such opinions.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:51                                                           ` Roland McGrath
@ 2012-01-18 21:53                                                             ` H. Peter Anvin
  2012-01-18 23:28                                                               ` Linus Torvalds
  0 siblings, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-01-18 21:53 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On 01/18/2012 01:51 PM, Roland McGrath wrote:
> 
> There are all the concerns about obscure ABI compatibility with
> expectations of existing debuggers and so forth, which Linus has mentioned.
> For that I can accept his point that things today so mishandle the
> int80-from-64 case that something like a new meaning for high bits of
> orig_ax or whatnot in just that case would not be actually problematic.
> When you and I were discussing a more general feature of distinguishing
> int80 from sysenter from syscall from traps from asynchronous interrupts,
> that was of more concern.
> 
> I do feel strongly that any new means of exposing bona fide user state
> ought to be done via the user_regset mechanism.  (i.e., either overloading
> some existing user_regs_struct bits if that truly is harmless to
> compatibility, or adding a new regset flavor.)  That way it is
> automatically recorded in core files, accessible with PTRACE_GETREGSET,
> etc.  (But I'm not really working on this stuff any more, so I'm out of the
> business of arguing strenuously about such opinions.)
> 

I think we can obviously agree that regsets is the only way to go for
any kind of new state.

	-hpa

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:53                                                             ` H. Peter Anvin
@ 2012-01-18 23:28                                                               ` Linus Torvalds
  2012-01-19  0:38                                                                 ` H. Peter Anvin
  0 siblings, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-18 23:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Roland McGrath, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mh

On Wed, Jan 18, 2012 at 1:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> I think we can obviously agree that regsets is the only way to go for
> any kind of new state.

So I really don't necessarily agree at all.

Exactly because there is a heavy burden to introducing new models.
It's not only relatively much more kernel code, it's also relatively
much more painful for user code. If we can hide it in existing
structures, user code is *much* better off, because any existing code
to get the state will just continue to work. Otherwise, you need to
have the code to figure out the new structures (how do you compile it
without the new kernel headers?), you need to do the extra accesses
conditionally etc etc.

There's a real cost to introducing new interfaces. There's a *reason*
people try to make do with old ones.

          Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 21:13                                             ` Chris Evans
@ 2012-01-19  0:14                                               ` Indan Zupancic
  2012-01-19  8:16                                                 ` Chris Evans
  2012-01-19 15:40                                                 ` Jamie Lokier
  0 siblings, 2 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-19  0:14 UTC (permalink / raw)
  To: Chris Evans
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,

On Wed, January 18, 2012 22:13, Chris Evans wrote:
> On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
>> On Wed, January 18, 2012 06:43, Chris Evans wrote:
>>> 2) Tracee traps
>>> 2b) Tracee could take a SIGKILL here
>>> 3) Tracer looks at registers; bad syscall
>>> 3b) Or tracee could take a SIGKILL here
>>> 4) The only way to stop the bad syscall from executing is to rewrite
>>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>>> syscall has finished)
>>
>> Yes, we rewrite it to -1.
>>
>>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>>> pid (such as PTRACE_SETREGS) fails.
>>
>> I assume that if a task can execute system calls and we get ptrace events
>> for that, that we can do other ptrace operations too. Are you saying that
>> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
>> doesn't work but the task continues executing system calls? That would be
>> a huge bug, but it seems very unlikely too, as the task is stopped and
>> shouldn't be able to disappear till it is continued by the tracer.
>>
>> I mean, really? That would be stupid.

Okay, I tested this scenario and you're right, we're screwed.

What the hell guys? What about other PID checks in the kernel, are they still
safe if the process looks dead but is still active? Or is it a ptrace-only
problem?

>> If true we have to work around it by disallowing SIGKILL and just sending
>> them ourselves within the jail. Meh.

I guess this helps a bit. It doesn't prevent external signals, but prisoners
don't have control over that.

Is this SIGKILL specific or is it true for all task ending signals?

>> How will you avoid file path races with BPF?
>
> There is typically no need for file-path based access control in an FTP server.
> Take for example anonymous FTP, which will typically be inside a
> chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
> it, you can have it.

Ah, you count on having root access. We don't.

Do you know any more crazy security destroying holes?

Thanks,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 17:00                                           ` Oleg Nesterov
  2012-01-18 17:12                                             ` Oleg Nesterov
@ 2012-01-19  0:29                                             ` Indan Zupancic
  1 sibling, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-19  0:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Evans, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,

On Wed, January 18, 2012 18:00, Oleg Nesterov wrote:
> On 01/17, Chris Evans wrote:
>>
>> 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
>> 2) Tracee traps
>> 2b) Tracee could take a SIGKILL here
>> 3) Tracer looks at registers; bad syscall
>> 3b) Or tracee could take a SIGKILL here
>> 4) The only way to stop the bad syscall from executing is to rewrite
>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>> syscall has finished)
>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>> pid (such as PTRACE_SETREGS) fails.
>> 6) Syscall fork() executes; possible unsupervised process now running
>> since the tracer wasn't expecting the fork() to be allowed.
>
> As for fork() in particular, it can't succeed after SIGKILL.

That was sadly exactly the system call I used for testing my code...

> But I agree, probably it makes sense to change ptrace_stop() to check
> fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
> in TASK_TRACED. Or we can change tracehook_report_syscall_entry()
>
> 	-	return 0;
> 	+	return !fatal_signal_pending();
>
> (no, I do not literally mean the change above)
>
> Not only for security. The current behaviour sometime confuses the
> users. Debugger sends SIGKILL to the tracee and assumes it should
> die asap, but the tracee exits only after syscall.

I didn't expect the tracer to die asap when sending SIGKILL, but I
did for PTRACE_KILL.

Improving this behaviour is highly appreciated, thanks!

Greetings,

Indan

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 23:28                                                               ` Linus Torvalds
@ 2012-01-19  0:38                                                                 ` H. Peter Anvin
  2012-01-20 21:51                                                                   ` Denys Vlasenko
  0 siblings, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-01-19  0:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland McGrath, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mh

On 01/18/2012 03:28 PM, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 1:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> I think we can obviously agree that regsets is the only way to go for
>> any kind of new state.
> 
> So I really don't necessarily agree at all.
> 
> Exactly because there is a heavy burden to introducing new models.
> It's not only relatively much more kernel code, it's also relatively
> much more painful for user code. If we can hide it in existing
> structures, user code is *much* better off, because any existing code
> to get the state will just continue to work. Otherwise, you need to
> have the code to figure out the new structures (how do you compile it
> without the new kernel headers?), you need to do the extra accesses
> conditionally etc etc.
> 
> There's a real cost to introducing new interfaces. There's a *reason*
> people try to make do with old ones.
> 

Of course.  However, the whole point with regsets is that at the very
least the vast majority of the infrastructure is generic and extends
without a bunch of new machine.  What you are saying is "we might be
able to get away with existing state", what I'm saying is "if we add
state it should be a regset".

The question if this should be new state is currently open.  I
personally would still would prefer if this didn't overlay real CPU state.

	-hpa



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18  1:01                                 ` Andrew Lutomirski
@ 2012-01-19  1:06                                   ` Indan Zupancic
  2012-01-19  1:19                                     ` Andrew Lutomirski
  0 siblings, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-19  1:06 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Wed, January 18, 2012 02:01, Andrew Lutomirski wrote:
> On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
>> On Tue, January 17, 2012 18:45, Andrew Lutomirski wrote:
>>> On Tue, Jan 17, 2012 at 9:05 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> On 01/17, Andrew Lutomirski wrote:
>>>>>
>>>>> (is_compat_task says whether the executable was marked as 32-bit. �The
>>>>> actual execution mode is determined by the cs register, which the user
>>>>> can control.
>>>>
>>>> Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
>>>> along with TS_COMPAT).
>>>>
>>>> TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
>>>> 64-bit or not, we should treat is as 32-bit in this case.
>>>
>>> I think you're right, and checking which entry was used is better than
>>> checking the cs register (since 64-bit code can use int80).  That's
>>> what I get for insufficiently careful reading of the assembly.  (And
>>> for going from memory from when I wrote the vsyscall emulation code --
>>> that code is entered from a page fault, so the entry point used is
>>> irrelevant.)
>>
>> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
>> int 0x80 it's changed to 32 bit mode for that system call and back to
>> 64 bit mode when the system call is finished!?
>>
>> Our ptrace jailer is checking cs to figure out if a task is a compat task
>> or not, if the kernel can change that behind our back it means our jailer
>> isn't secure for x86_64 with compat enabled. Or is cs changed before the
>> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
>> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
>> there another way?
>
> I don't know what your ptrace jailer does.  But a task can switch
> itself between 32-bit and 64-bit execution at will, and there's
> nothing the kernel can do about it.  (That isn't quite true -- in
> theory the kernel could fiddle with the GDT, but that would be
> expensive and wouldn't work on Xen.)

That's why we don't cache the CS value but check it for every system call.
But you said elsewhere that checking CS isn't always correct either.
I grepped arch/x86 for "user_64bit_mode", but couldn't find anything,
but maybe my kernel sources are too old, I haven't updated this system
for almost a year. The current code only handles 0x23 and 0x33 and kills
the jail if it encounters anything else.

> That being said, is_compat_task is apparently a good indication of
> whether the current *syscall* entry is a 64-bit syscall or a 32-bit
> syscall.  Perhaps the function should be renamed to in_compat_syscall,
> because that's what it does.

That seems like a good idea.

>
>>
>> I think this behaviour is so unexpected that it can only cause security
>> problems in the long run. Is anyone counting on this? Where is this
>> behaviour documented?
>
> Nowhere, I think.

Such is life.

Greetings,

Indan


--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  1:06                                   ` Indan Zupancic
@ 2012-01-19  1:19                                     ` Andrew Lutomirski
  2012-01-19  1:47                                       ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-19  1:19 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Wed, Jan 18, 2012 at 5:06 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Wed, January 18, 2012 02:01, Andrew Lutomirski wrote:
>> On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
>> I don't know what your ptrace jailer does.  But a task can switch
>> itself between 32-bit and 64-bit execution at will, and there's
>> nothing the kernel can do about it.  (That isn't quite true -- in
>> theory the kernel could fiddle with the GDT, but that would be
>> expensive and wouldn't work on Xen.)
>
> That's why we don't cache the CS value but check it for every system call.
> But you said elsewhere that checking CS isn't always correct either.
> I grepped arch/x86 for "user_64bit_mode", but couldn't find anything,
> but maybe my kernel sources are too old, I haven't updated this system
> for almost a year. The current code only handles 0x23 and 0x33 and kills
> the jail if it encounters anything else.

I think you're hosed on Xen, then.  Xen regularly runs with a
different Xen-specific cs value.

--Andy

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 21:26                                                         ` Linus Torvalds
  2012-01-18 21:30                                                           ` H. Peter Anvin
@ 2012-01-19  1:45                                                           ` Indan Zupancic
  2012-01-19  2:16                                                             ` H. Peter Anvin
  1 sibling, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-19  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Wed, January 18, 2012 22:26, Linus Torvalds wrote:
> On Wed, Jan 18, 2012 at 1:04 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> Fair enough.  Sigh.  I still think an actual pseudo-register would be
>> better.
>
> .. and that breaks existing binaries too, because the indexing is
> based on offsets into "struct pt_regs", and while we *could* change
> that - leave pt_regs untouched but add a new virtual register - it
> would be problematic.
>
> We could add a whole new ptrace() access command (eg
> PTRACE_GETSYSTEMREGSET), of course. But that's a lot of effort for
> very little gain.
>
> So on the whole, putting it in eflags seemed like the *much* simpler approach.

For security reasons it should be impossible for userspace to set those bits
themselves, otherwise the tracer can be easily fooled on an old kernel. That
seems to be the case for the higher bits of eflags, so eflags would work. And
the current code checks cs, also checking eflags would be very easy to add.

Greetings,

Indan

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  1:19                                     ` Andrew Lutomirski
@ 2012-01-19  1:47                                       ` Indan Zupancic
  0 siblings, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-19  1:47 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm,
	torvalds, segoon, rostedt, jmorris, scarybeasts, avi, penberg,
	viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, Roland McGrath,
	Andi Kleen

On Thu, January 19, 2012 02:19, Andrew Lutomirski wrote:
> On Wed, Jan 18, 2012 at 5:06 PM, Indan Zupancic <indan@nul.nu> wrote:
>> On Wed, January 18, 2012 02:01, Andrew Lutomirski wrote:
>>> On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@nul.nu> wrote:
>>> I don't know what your ptrace jailer does.  But a task can switch
>>> itself between 32-bit and 64-bit execution at will, and there's
>>> nothing the kernel can do about it.  (That isn't quite true -- in
>>> theory the kernel could fiddle with the GDT, but that would be
>>> expensive and wouldn't work on Xen.)
>>
>> That's why we don't cache the CS value but check it for every system call.
>> But you said elsewhere that checking CS isn't always correct either.
>> I grepped arch/x86 for "user_64bit_mode", but couldn't find anything,
>> but maybe my kernel sources are too old, I haven't updated this system
>> for almost a year. The current code only handles 0x23 and 0x33 and kills
>> the jail if it encounters anything else.
>
> I think you're hosed on Xen, then.  Xen regularly runs with a
> different Xen-specific cs value.

That's fine as long as a cs value of 0x23 or 0x33 gives reliable information.
Not running is highly prefered above running insecurely. Security first,
functionality second.

Greetings,

Indan

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19  1:45                                                           ` Indan Zupancic
@ 2012-01-19  2:16                                                             ` H. Peter Anvin
  0 siblings, 0 replies; 222+ messages in thread
From: H. Peter Anvin @ 2012-01-19  2:16 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dla

On 01/18/2012 05:45 PM, Indan Zupancic wrote:
> 
> For security reasons it should be impossible for userspace to set those bits
> themselves, otherwise the tracer can be easily fooled on an old kernel. That
> seems to be the case for the higher bits of eflags, so eflags would work. And
> the current code checks cs, also checking eflags would be very easy to add.
> 

I think this goes without saying, and isn't an issue for the options
currently on the table (including regset).

	-phpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  0:14                                               ` Indan Zupancic
@ 2012-01-19  8:16                                                 ` Chris Evans
  2012-01-19 11:34                                                   ` Indan Zupancic
  2012-01-19 15:40                                                 ` Jamie Lokier
  1 sibling, 1 reply; 222+ messages in thread
From: Chris Evans @ 2012-01-19  8:16 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Wed, Jan 18, 2012 at 4:14 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Wed, January 18, 2012 22:13, Chris Evans wrote:
>> On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
>>> On Wed, January 18, 2012 06:43, Chris Evans wrote:
>>>> 2) Tracee traps
>>>> 2b) Tracee could take a SIGKILL here
>>>> 3) Tracer looks at registers; bad syscall
>>>> 3b) Or tracee could take a SIGKILL here
>>>> 4) The only way to stop the bad syscall from executing is to rewrite
>>>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>>>> syscall has finished)
>>>
>>> Yes, we rewrite it to -1.
>>>
>>>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>>>> pid (such as PTRACE_SETREGS) fails.
>>>
>>> I assume that if a task can execute system calls and we get ptrace events
>>> for that, that we can do other ptrace operations too. Are you saying that
>>> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
>>> doesn't work but the task continues executing system calls? That would be
>>> a huge bug, but it seems very unlikely too, as the task is stopped and
>>> shouldn't be able to disappear till it is continued by the tracer.
>>>
>>> I mean, really? That would be stupid.
>
> Okay, I tested this scenario and you're right, we're screwed.
>
> What the hell guys?

Steady on :) ptrace() has never been sold as a technology upon which
its safe to build security solutions.

> What about other PID checks in the kernel, are they still
> safe if the process looks dead but is still active? Or is it a ptrace-only
> problem?
>
>>> If true we have to work around it by disallowing SIGKILL and just sending
>>> them ourselves within the jail. Meh.
>
> I guess this helps a bit. It doesn't prevent external signals, but prisoners
> don't have control over that.

Well.... a prisoner may be able to play other tricks:
- Allocate lots of memory... kernel may start spraying around SIGKILLs
- Sending SIGKILL via prctl()
- Sending SIGKILL via fcntl()
- Sending SIGKILL via clone()

>
> Is this SIGKILL specific or is it true for all task ending signals?

Can't remember - try it?

>
>>> How will you avoid file path races with BPF?
>>
>> There is typically no need for file-path based access control in an FTP server.
>> Take for example anonymous FTP, which will typically be inside a
>> chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
>> it, you can have it.
>
> Ah, you count on having root access. We don't.
>
> Do you know any more crazy security destroying holes?

Try spraying SIGCONT and / or SIGSTOP at tracees. It may be possible
to confuse the tracer about whether a SIGTRAP event is syscall entry
or exit.
Try doing an execve() that fails. May cause similar state confusion in
the tracer.


Cheers
Chris

>
> Thanks,
>
> Indan
>
>

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  8:16                                                 ` Chris Evans
@ 2012-01-19 11:34                                                   ` Indan Zupancic
  2012-01-19 16:11                                                     ` Jamie Lokier
  0 siblings, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-19 11:34 UTC (permalink / raw)
  To: Chris Evans
  Cc: Andi Kleen, Jamie Lokier, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,

On Thu, January 19, 2012 09:16, Chris Evans wrote:
> On Wed, Jan 18, 2012 at 4:14 PM, Indan Zupancic <indan@nul.nu> wrote:
>> On Wed, January 18, 2012 22:13, Chris Evans wrote:
>>> On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
>>>> On Wed, January 18, 2012 06:43, Chris Evans wrote:
>>>>> 2) Tracee traps
>>>>> 2b) Tracee could take a SIGKILL here
>>>>> 3) Tracer looks at registers; bad syscall
>>>>> 3b) Or tracee could take a SIGKILL here
>>>>> 4) The only way to stop the bad syscall from executing is to rewrite
>>>>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>>>>> syscall has finished)
>>>>
>>>> Yes, we rewrite it to -1.
>>>>
>>>>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>>>>> pid (such as PTRACE_SETREGS) fails.
>>>>
>>>> I assume that if a task can execute system calls and we get ptrace events
>>>> for that, that we can do other ptrace operations too. Are you saying that
>>>> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
>>>> doesn't work but the task continues executing system calls? That would be
>>>> a huge bug, but it seems very unlikely too, as the task is stopped and
>>>> shouldn't be able to disappear till it is continued by the tracer.
>>>>
>>>> I mean, really? That would be stupid.
>>
>> Okay, I tested this scenario and you're right, we're screwed.
>>
>> What the hell guys?
>
> Steady on :) ptrace() has never been sold as a technology upon which
> its safe to build security solutions.

Well, that can be said of pretty much all kernel functionality.
That is no excuse for crazy behaviour.

I more or less fixed it by turning all SIGKILLs into SIGTERMs.
Perhaps I should use a more obscure signal instead.

>> What about other PID checks in the kernel, are they still
>> safe if the process looks dead but is still active? Or is it a ptrace-only
>> problem?
>>
>>>> If true we have to work around it by disallowing SIGKILL and just sending
>>>> them ourselves within the jail. Meh.
>>
>> I guess this helps a bit. It doesn't prevent external signals, but prisoners
>> don't have control over that.
>
> Well.... a prisoner may be able to play other tricks:
> - Allocate lots of memory... kernel may start spraying around SIGKILLs
> - Sending SIGKILL via prctl()

prctl is disallowed within our jail. Did you had PR_SET_PDEATHSIG in mind?
But doesn't the tracer become the parent when ptracing or not for this?
Or were you thinking about enabling SECCOMP and counting on the SIGKILL
being process-wide instead of thread-specific?

> - Sending SIGKILL via fcntl()

I haven't written the fcntl demultiplexor yet, but I missed fcntl could
be used for sending signals. I knew there was whacky stuff in there, but
didn't expect it to be that bad. Thanks.

> - Sending SIGKILL via clone()

How? And can you send it to another process than yourself?

>
>>
>> Is this SIGKILL specific or is it true for all task ending signals?
>
> Can't remember - try it?

Tried: It's safe with SIGTERM, so I assume the others are fine too.
I'll double check though...

>>
>>>> How will you avoid file path races with BPF?
>>>
>>> There is typically no need for file-path based access control in an FTP server.
>>> Take for example anonymous FTP, which will typically be inside a
>>> chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
>>> it, you can have it.
>>
>> Ah, you count on having root access. We don't.
>>
>> Do you know any more crazy security destroying holes?
>
> Try spraying SIGCONT and / or SIGSTOP at tracees. It may be possible
> to confuse the tracer about whether a SIGTRAP event is syscall entry
> or exit.

Yes, heard about that weirdness before, but it's all ignored. We're
using PTRACE_O_TRACESYSGOOD.

> Try doing an execve() that fails. May cause similar state confusion in
> the tracer.

Our jailer pretty much ignores all signals and only handles syscalls
and task exits. We actually check execve's return value to know if we
have to do our stuff or not.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19  0:14                                               ` Indan Zupancic
  2012-01-19  8:16                                                 ` Chris Evans
@ 2012-01-19 15:40                                                 ` Jamie Lokier
  1 sibling, 0 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-19 15:40 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Chris Evans, Andi Kleen, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

Indan Zupancic wrote:
> On Wed, January 18, 2012 22:13, Chris Evans wrote:
> > On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
> >> On Wed, January 18, 2012 06:43, Chris Evans wrote:
> >>> 2) Tracee traps
> >>> 2b) Tracee could take a SIGKILL here
> >>> 3) Tracer looks at registers; bad syscall
> >>> 3b) Or tracee could take a SIGKILL here
> >>> 4) The only way to stop the bad syscall from executing is to rewrite
> >>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> >>> syscall has finished)
> >>
> >> Yes, we rewrite it to -1.
> >>
> >>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> >>> pid (such as PTRACE_SETREGS) fails.
> >>
> >> I assume that if a task can execute system calls and we get ptrace events
> >> for that, that we can do other ptrace operations too. Are you saying that
> >> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
> >> doesn't work but the task continues executing system calls? That would be
> >> a huge bug, but it seems very unlikely too, as the task is stopped and
> >> shouldn't be able to disappear till it is continued by the tracer.
> >>
> >> I mean, really? That would be stupid.
> 
> Okay, I tested this scenario and you're right, we're screwed.

Ha!

Perhaps this could be fixed generically in
tracehook_report_syscall_entry(), for those architectures which bother
to call it and bother to disable the syscall if it says to.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:38                                                 ` Andrew Lutomirski
@ 2012-01-19 16:01                                                   ` Jamie Lokier
  2012-01-19 16:13                                                     ` Andrew Lutomirski
  2012-01-19 19:21                                                     ` Linus Torvalds
  0 siblings, 2 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-19 16:01 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

Andrew Lutomirski wrote:
> It's reasonable, obvious, and even more wrong than it appears.  On
> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
> (I got bitten by this in some iteration of the vsyscall emulation
> patches -- see user_64bit_mode for the correct and
> unusable-from-user-mode way to do this.)

Here it is:

	static inline bool user_64bit_mode(struct pt_regs *regs)
	{
	#ifndef CONFIG_PARAVIRT
		/*
		 * On non-paravirt systems, this is the only long mode CPL 3
		 * selector.  We do not allow long mode selectors in the LDT.
		 */
		return regs->cs == __USER_CS;
	#else
		/* Headers are too twisted for this to go in paravirt.h. */
		return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs;
	#endif
	}

Perhaps userspace can do that.
Would it be right for a ptracer to say:

   CS == 0x23 -> 32-bit
   (CS & 4)   -> 32-bit (LDT, "we do not allow long mode selectors in the LDT")
   else       -> 64-bit (__USER_CS or some other GDT entry which must be pv_info's)

I.e. assume that no other *GDT* CS values are available to userspace?
There are other 32-bit GDT entries, but are they not all for data or kernel use only?

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 18:24                                             ` Andi Kleen
@ 2012-01-19 16:04                                               ` Jamie Lokier
  2012-01-20  0:21                                                 ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: Jamie Lokier @ 2012-01-19 16:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin Mares, Linus Torvalds, Andi Kleen, Indan Zupancic,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlao

Andi Kleen wrote:
> > Not everybody. There are programs which try hard to distinguish between
> > int80 and syscall. One such example is a sandbox for programming contests
> > I wrote several years ago. It analyses the instruction before EIP and as
> > it does not allow threads nor executing writeable memory, it should be
> > correct.
> 
> There are other ways to break it, like using the syscall itself to change
> input arguments or using ptrace from another process and other ways.
> 
> Generally there are so many races with ptrace that if you want to do
> things like that it's better to use a LSM. That's what they are for.

I could see the LSM approach working *if* there was an LSM module to
make it available to unpriviledged userspace.  I.e. a replacement for
ptrace() for this purpose.

It would be nice to be able to trace and check syscall strings properly.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19 11:34                                                   ` Indan Zupancic
@ 2012-01-19 16:11                                                     ` Jamie Lokier
  0 siblings, 0 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-19 16:11 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Chris Evans, Andi Kleen, Andrew Lutomirski, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

Indan Zupancic wrote:
> On Thu, January 19, 2012 09:16, Chris Evans wrote:
> > On Wed, Jan 18, 2012 at 4:14 PM, Indan Zupancic <indan@nul.nu> wrote:
> >> On Wed, January 18, 2012 22:13, Chris Evans wrote:
> >>> On Wed, Jan 18, 2012 at 4:12 AM, Indan Zupancic <indan@nul.nu> wrote:
> >>>> On Wed, January 18, 2012 06:43, Chris Evans wrote:
> >>>>> 2) Tracee traps
> >>>>> 2b) Tracee could take a SIGKILL here
> >>>>> 3) Tracer looks at registers; bad syscall
> >>>>> 3b) Or tracee could take a SIGKILL here
> >>>>> 4) The only way to stop the bad syscall from executing is to rewrite
> >>>>> orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
> >>>>> syscall has finished)
> >>>>
> >>>> Yes, we rewrite it to -1.
> >>>>
> >>>>> 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
> >>>>> pid (such as PTRACE_SETREGS) fails.
> >>>>
> >>>> I assume that if a task can execute system calls and we get ptrace events
> >>>> for that, that we can do other ptrace operations too. Are you saying that
> >>>> the kernel has this ptrace gap between SIGKILL and task exit where ptrace
> >>>> doesn't work but the task continues executing system calls? That would be
> >>>> a huge bug, but it seems very unlikely too, as the task is stopped and
> >>>> shouldn't be able to disappear till it is continued by the tracer.
> >>>>
> >>>> I mean, really? That would be stupid.
> >>
> >> Okay, I tested this scenario and you're right, we're screwed.
> >>
> >> What the hell guys?
> >
> > Steady on :) ptrace() has never been sold as a technology upon which
> > its safe to build security solutions.
> 
> Well, that can be said of pretty much all kernel functionality.
> That is no excuse for crazy behaviour.
> 
> I more or less fixed it by turning all SIGKILLs into SIGTERMs.
> Perhaps I should use a more obscure signal instead.
> 
> >> What about other PID checks in the kernel, are they still
> >> safe if the process looks dead but is still active? Or is it a ptrace-only
> >> problem?
> >>
> >>>> If true we have to work around it by disallowing SIGKILL and just sending
> >>>> them ourselves within the jail. Meh.
> >>
> >> I guess this helps a bit. It doesn't prevent external signals, but prisoners
> >> don't have control over that.
> >
> > Well.... a prisoner may be able to play other tricks:
> > - Allocate lots of memory... kernel may start spraying around SIGKILLs
> > - Sending SIGKILL via prctl()
> 
> prctl is disallowed within our jail. Did you had PR_SET_PDEATHSIG in mind?
> But doesn't the tracer become the parent when ptracing or not for this?
> Or were you thinking about enabling SECCOMP and counting on the SIGKILL
> being process-wide instead of thread-specific?
> 
> > - Sending SIGKILL via fcntl()
> 
> I haven't written the fcntl demultiplexor yet, but I missed fcntl could
> be used for sending signals. I knew there was whacky stuff in there, but
> didn't expect it to be that bad. Thanks.
> 
> > - Sending SIGKILL via clone()
> 
> How? And can you send it to another process than yourself?
> 
> >
> >>
> >> Is this SIGKILL specific or is it true for all task ending signals?
> >
> > Can't remember - try it?
> 
> Tried: It's safe with SIGTERM, so I assume the others are fine too.
> I'll double check though...
> 
> >>
> >>>> How will you avoid file path races with BPF?
> >>>
> >>> There is typically no need for file-path based access control in an FTP server.
> >>> Take for example anonymous FTP, which will typically be inside a
> >>> chroot() to /var/ftp. Inside that filesystem tree -- if you can open()
> >>> it, you can have it.
> >>
> >> Ah, you count on having root access. We don't.
> >>
> >> Do you know any more crazy security destroying holes?
> >
> > Try spraying SIGCONT and / or SIGSTOP at tracees. It may be possible
> > to confuse the tracer about whether a SIGTRAP event is syscall entry
> > or exit.
> 
> Yes, heard about that weirdness before, but it's all ignored. We're
> using PTRACE_O_TRACESYSGOOD.
> 
> > Try doing an execve() that fails. May cause similar state confusion in
> > the tracer.
> 
> Our jailer pretty much ignores all signals and only handles syscalls
> and task exits. We actually check execve's return value to know if we
> have to do our stuff or not.

Take a look at the file README-linux-ptrace in recent strace Git.
(Thanks Denys!)

It describes some *really* ugly things Linux does to ptrace on execve
when there are threads: The most exciting being the return value is
sent to a different tid than called execve(), and other tids magically
disappear without notification.

You can use PTRACE_O_TRACEEXEC to see if the execve() succeeds, btw.
It has the useful side-effect of preventing the legacy behaviour of
SIGTRAP being sent as a normal queued signal after successful execve().

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 16:01                                                   ` Jamie Lokier
@ 2012-01-19 16:13                                                     ` Andrew Lutomirski
  2012-01-19 19:21                                                     ` Linus Torvalds
  1 sibling, 0 replies; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-19 16:13 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Andrew Lutomirski wrote:
>> It's reasonable, obvious, and even more wrong than it appears.  On
>> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
>> (I got bitten by this in some iteration of the vsyscall emulation
>> patches -- see user_64bit_mode for the correct and
>> unusable-from-user-mode way to do this.)
>
> Here it is:
>
>        static inline bool user_64bit_mode(struct pt_regs *regs)
>        {
>        #ifndef CONFIG_PARAVIRT
>                /*
>                 * On non-paravirt systems, this is the only long mode CPL 3
>                 * selector.  We do not allow long mode selectors in the LDT.
>                 */
>                return regs->cs == __USER_CS;
>        #else
>                /* Headers are too twisted for this to go in paravirt.h. */
>                return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs;
>        #endif
>        }
>
> Perhaps userspace can do that.
> Would it be right for a ptracer to say:
>
>   CS == 0x23 -> 32-bit
>   (CS & 4)   -> 32-bit (LDT, "we do not allow long mode selectors in the LDT")
>   else       -> 64-bit (__USER_CS or some other GDT entry which must be pv_info's)
>
> I.e. assume that no other *GDT* CS values are available to userspace?
> There are other 32-bit GDT entries, but are they not all for data or kernel use only?

I suspect not.  asm/xen/interface_64.h has:

#define FLAT_RING3_CS32 0xe023  /* GDT index 260 */
#define FLAT_RING3_CS64 0xe033  /* GDT index 261 */
#define FLAT_RING3_DS32 0xe02b  /* GDT index 262 */
#define FLAT_RING3_DS64 0x0000  /* NULL selector */
#define FLAT_RING3_SS32 0xe02b  /* GDT index 262 */
#define FLAT_RING3_SS64 0xe02b  /* GDT index 262 */

which sounds like there's an extra 32-bit selector as well.  I haven't
checked, though.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 16:01                                                   ` Jamie Lokier
  2012-01-19 16:13                                                     ` Andrew Lutomirski
@ 2012-01-19 19:21                                                     ` Linus Torvalds
  2012-01-19 19:30                                                       ` Andrew Lutomirski
                                                                         ` (2 more replies)
  1 sibling, 3 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-19 19:21 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andrew Lutomirski, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Andrew Lutomirski wrote:
>> It's reasonable, obvious, and even more wrong than it appears.  On
>> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
>> (I got bitten by this in some iteration of the vsyscall emulation
>> patches -- see user_64bit_mode for the correct and
>> unusable-from-user-mode way to do this.)
>
> Here it is:
>
>        static inline bool user_64bit_mode(struct pt_regs *regs)

This is pointless, even if it worked, which it clearly doesn't on Xen
(or other random situations).

Why would you care?

The issue is *not* whether somebody is running in 32-bit mode or 64-bit mode.

The problem is the system call itself, and that can be 32-bit or
64-bit independently of the execution mode. So knowing the user-mode
mode is simply not relevant.

In the kernel, we know this with the TS_COMPAT flag - exactly because
it's impossible to tell from any actual CPU state. So *that* is the
flag you need to figure out, and currently the kernel doesn't export
it any way (but my suggested patch would export it in the high bits of
rflags).

So looking at CS isn't *ever* going to help.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:21                                                     ` Linus Torvalds
@ 2012-01-19 19:30                                                       ` Andrew Lutomirski
  2012-01-19 19:37                                                         ` Linus Torvalds
  2012-01-19 23:54                                                       ` Jamie Lokier
  2012-01-20 15:35                                                       ` Will Drewry
  2 siblings, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2012-01-19 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 11:21 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> Andrew Lutomirski wrote:
>>> It's reasonable, obvious, and even more wrong than it appears.  On
>>> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
>>> (I got bitten by this in some iteration of the vsyscall emulation
>>> patches -- see user_64bit_mode for the correct and
>>> unusable-from-user-mode way to do this.)
>>
>> Here it is:
>>
>>        static inline bool user_64bit_mode(struct pt_regs *regs)
>
> This is pointless, even if it worked, which it clearly doesn't on Xen
> (or other random situations).
>
> Why would you care?
>
> The issue is *not* whether somebody is running in 32-bit mode or 64-bit mode.
>
> The problem is the system call itself, and that can be 32-bit or
> 64-bit independently of the execution mode. So knowing the user-mode
> mode is simply not relevant.

Unless you're writing a debugger and you want to disassemble the code
that's being executed (i.e. normal code, not a system call).  I wonder
how gdb guesses whether the cpu is in long mode.

--Andy

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:30                                                       ` Andrew Lutomirski
@ 2012-01-19 19:37                                                         ` Linus Torvalds
  2012-01-19 19:41                                                           ` Linus Torvalds
  0 siblings, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-19 19:37 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Jamie Lokier, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 11:30 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>
> Unless you're writing a debugger and you want to disassemble the code
> that's being executed (i.e. normal code, not a system call).  I wonder
> how gdb guesses whether the cpu is in long mode.

Yes, if you need to disassemble user space you would need to figure
out the mode.

I would suggest looking at 'rip/rsp' first, though, and just say that
if it's >32-bit, it's flat mode. Only if both rsp and rip fit in 32
bits should you even bother start guessing.

Because technically I suspect you really do need to look it up in the
segment descriptors, and I don't think we have that kind of interface
(nor do I think we really want to expose one).

                          Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:37                                                         ` Linus Torvalds
@ 2012-01-19 19:41                                                           ` Linus Torvalds
  0 siblings, 0 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-19 19:41 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Jamie Lokier, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 11:37 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I would suggest looking at 'rip/rsp' first, though, and just say that
> if it's >32-bit, it's flat mode. Only if both rsp and rip fit in 32
> bits should you even bother start guessing.

Oh, there's a few other hints you can look at. If 'ds' is zero, you
might technically be in 32-bit mode, but realistically nothing really
would work, so you might as well assume you're in long mode.

So you can have a lot of heuristics (including just looking at what
the disassembly itself looks like) if you really want to..

                    Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:21                                                     ` Linus Torvalds
  2012-01-19 19:30                                                       ` Andrew Lutomirski
@ 2012-01-19 23:54                                                       ` Jamie Lokier
  2012-01-20  0:05                                                         ` Linus Torvalds
  2012-01-20 15:35                                                       ` Will Drewry
  2 siblings, 1 reply; 222+ messages in thread
From: Jamie Lokier @ 2012-01-19 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Lutomirski, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

Linus Torvalds wrote:
> On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
> > Andrew Lutomirski wrote:
> >> It's reasonable, obvious, and even more wrong than it appears.  On
> >> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
> >> (I got bitten by this in some iteration of the vsyscall emulation
> >> patches -- see user_64bit_mode for the correct and
> >> unusable-from-user-mode way to do this.)
> >
> > Here it is:
> >
> >        static inline bool user_64bit_mode(struct pt_regs *regs)
> 
> This is pointless, even if it worked, which it clearly doesn't on Xen
> (or other random situations).
> 
> Why would you care?
> 
> The issue is *not* whether somebody is running in 32-bit mode or 64-bit mode.
> 
> The problem is the system call itself, and that can be 32-bit or
> 64-bit independently of the execution mode. So knowing the user-mode
> mode is simply not relevant.

Sorry, you're responding to a different question than the one I was
talking about.  My bad for adding to the confusion.

Mine was: strace currently checks the CS value and may have a bug on
existing/older kernels if Xen is involved when using the *normal*
syscall entry point (not int $0x80).  Can we patch strace to solve
that on those kernels in a generic way, or does the fix need to
hard-code knowledge of Xen's CS values (and any similar PV hypervisors
if there are any).

No amount of patching newer kernels will fix that, but it would be
nice if newer kernels made it unambiguous.

You've usefully pointed out that there's no reliable way to tell if
the tracee is executing in long mode.  If we're adding pseudo-flags to
say what kind of syscall it is, it would be no bad thing to have a
pseudo-flag to say if userspace is in long mode -- made available to
breakpoints and single-stepping as well.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 23:54                                                       ` Jamie Lokier
@ 2012-01-20  0:05                                                         ` Linus Torvalds
  0 siblings, 0 replies; 222+ messages in thread
From: Linus Torvalds @ 2012-01-20  0:05 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andrew Lutomirski, Indan Zupancic, Andi Kleen, Oleg Nesterov,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 3:54 PM, Jamie Lokier <jamie@shareable.org> wrote:
>
> Mine was: strace currently checks the CS value and may have a bug on
> existing/older kernels if Xen is involved when using the *normal*
> syscall entry point (not int $0x80).  Can we patch strace to solve
> that on those kernels in a generic way, or does the fix need to
> hard-code knowledge of Xen's CS values (and any similar PV hypervisors
> if there are any).
>
> No amount of patching newer kernels will fix that, but it would be
> nice if newer kernels made it unambiguous.

Ok.  So yeah, I think the heuristics for strace could possibly be
improved when running under Xen, I agree. See my suggestion for taking
other register contents into account (%rsp in particular - the code
segment tends to be mapped low in 64-bit mode, but the stack is almost
always high unless you are doing something really odd).

So heuristics improvements could be a good idea. Very few real
programs will use "int 0x80" in long mode, since it's slow and limited
to 32 bit.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-19 16:04                                               ` Jamie Lokier
@ 2012-01-20  0:21                                                 ` Indan Zupancic
  2012-01-20  0:53                                                   ` Linus Torvalds
  0 siblings, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-20  0:21 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andi Kleen, Martin Mares, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow

On Thu, January 19, 2012 17:04, Jamie Lokier wrote:
> Andi Kleen wrote:
>> > Not everybody. There are programs which try hard to distinguish between
>> > int80 and syscall. One such example is a sandbox for programming contests
>> > I wrote several years ago. It analyses the instruction before EIP and as
>> > it does not allow threads nor executing writeable memory, it should be
>> > correct.
>>
>> There are other ways to break it, like using the syscall itself to change
>> input arguments or using ptrace from another process and other ways.
>>
>> Generally there are so many races with ptrace that if you want to do
>> things like that it's better to use a LSM. That's what they are for.
>
> I could see the LSM approach working *if* there was an LSM module to
> make it available to unpriviledged userspace.  I.e. a replacement for
> ptrace() for this purpose.
>
> It would be nice to be able to trace and check syscall strings properly.

With current ptrace you can do exactly that. It's just very slow, because
you have to copy the data word by word via PTRACE_PEEKDATA. But if Linux
would support something like BSD's PT_IO ptrace request, then it could be
limited to one extra ptrace command. (PTRACE_STRNCPY would be handy.)

After the check we memcpy the data to a shared read-only mapping, but
that's very quick. We could read the data directly into the RO area,
but as we're mostly dealing with path strings it seemed more efficient
to allocate the needed memory instead of the max every time.

No matter how you make it available to userspace via some LSM, you will
end up with the same context switch overhead ptrace suffers, so I don't
see how a LSM module would give either more options or make it much faster
compared to ptrace.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-20  0:21                                                 ` Indan Zupancic
@ 2012-01-20  0:53                                                   ` Linus Torvalds
  2012-01-20  2:02                                                     ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-20  0:53 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Jamie Lokier, Andi Kleen, Martin Mares, Andi Kleen,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor

On Thu, Jan 19, 2012 at 4:21 PM, Indan Zupancic <indan@nul.nu> wrote:
>
> With current ptrace you can do exactly that. It's just very slow, because
> you have to copy the data word by word via PTRACE_PEEKDATA. But if Linux
> would support something like BSD's PT_IO ptrace request, then it could be
> limited to one extra ptrace command. (PTRACE_STRNCPY would be handy.)

Actually, you could use the new "process_vm_readv/writev()" system
calls. No need to do the crazy slow ptrace stuff.

I dunno. It got merged through Andrew, and the code looks sane, but
I've never actually seen anybody *use* it. So maybe there is something
wrong there. And no, it doesn't have a "strncpy" interface, I'm
afraid.

                   Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-20  0:53                                                   ` Linus Torvalds
@ 2012-01-20  2:02                                                     ` Indan Zupancic
  0 siblings, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-20  2:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Andi Kleen, Martin Mares, Andi Kleen,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow

On Fri, January 20, 2012 01:53, Linus Torvalds wrote:
> On Thu, Jan 19, 2012 at 4:21 PM, Indan Zupancic <indan@nul.nu> wrote:
>>
>> With current ptrace you can do exactly that. It's just very slow, because
>> you have to copy the data word by word via PTRACE_PEEKDATA. But if Linux
>> would support something like BSD's PT_IO ptrace request, then it could be
>> limited to one extra ptrace command. (PTRACE_STRNCPY would be handy.)
>
> Actually, you could use the new "process_vm_readv/writev()" system
> calls. No need to do the crazy slow ptrace stuff.

Oh wow, that's great! I tried pread on /proc/$PID/mem before, but that
didn't work for some reason and would eat many fd's if there were a lot
of prisoners.

When did it got merged?

> I dunno. It got merged through Andrew, and the code looks sane, but
> I've never actually seen anybody *use* it. So maybe there is something
> wrong there. And no, it doesn't have a "strncpy" interface, I'm
> afraid.

My main problem is that I don't know beforehand how much I have to read,
and if I always reada fixed amount it may go across a page border and
error out. So if process_vm_readv() reads the accessible data only and
doesn't give up halfway, it's perfect. That seems to be the behaviour,
but the manpage is fuzzy enough that it may not be true. I'll take a
look at the source later.

Thanks,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19 19:21                                                     ` Linus Torvalds
  2012-01-19 19:30                                                       ` Andrew Lutomirski
  2012-01-19 23:54                                                       ` Jamie Lokier
@ 2012-01-20 15:35                                                       ` Will Drewry
  2012-01-20 17:56                                                         ` Roland McGrath
  2 siblings, 1 reply; 222+ messages in thread
From: Will Drewry @ 2012-01-20 15:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Andrew Lutomirski, Indan Zupancic, Andi Kleen,
	Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On Thu, Jan 19, 2012 at 1:21 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 19, 2012 at 8:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> Andrew Lutomirski wrote:
>>> It's reasonable, obvious, and even more wrong than it appears.  On
>>> Xen, there's an extra 64-bit GDT entry, and it gets used by default.
>>> (I got bitten by this in some iteration of the vsyscall emulation
>>> patches -- see user_64bit_mode for the correct and
>>> unusable-from-user-mode way to do this.)
>>
>> Here it is:
>>
>>        static inline bool user_64bit_mode(struct pt_regs *regs)
>
> This is pointless, even if it worked, which it clearly doesn't on Xen
> (or other random situations).
>
> Why would you care?
>
> The issue is *not* whether somebody is running in 32-bit mode or 64-bit mode.
>
> The problem is the system call itself, and that can be 32-bit or
> 64-bit independently of the execution mode. So knowing the user-mode
> mode is simply not relevant.
>
> In the kernel, we know this with the TS_COMPAT flag - exactly because
> it's impossible to tell from any actual CPU state. So *that* is the
> flag you need to figure out, and currently the kernel doesn't export
> it any way (but my suggested patch would export it in the high bits of
> rflags).

Would it be worth considering changing the return from
task_user_regset_view, like:

--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1311,7 +1311,11 @@ void update_regset_xstate_info(unsigned int
size, u64 xstate_mask)
 const struct user_regset_view *task_user_regset_view(struct task_struct *task)
 {
 #ifdef CONFIG_IA32_EMULATION
-       if (test_tsk_thread_flag(task, TIF_IA32))
+       /* If the task is in a syscall, then the TS_COMPAT status
+        * is more accurate than the personality.
+        */
+       if (test_tsk_thread_flag(task, TIF_IA32) ||
+           task_thread_info(task)->status & TS_COMPAT)
 #endif
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
                return &user_x86_32_view;


This would make TS_COMPAT behave like a personality change.
PTRACE_POKEUSR and PEEKUSR would still access the 64-bit view with no
compat info (just like with TIF_IA32 tasks), but PTRACE_[GS]ETREGS
would return/expect 32-bit struct user_struct_regs.  This would result
in the tracer needing to check the returned regs to see if it was
fully populated (which seems heinous), but it would export the
TS_COMPAT state.

Right now, if a 64-bit tracer changes the regs for a TS_COMPAT call,
the args will be 32-bit truncated (for better or worse). Of course, on
trace_syscall_leave, 64-bit registers won't be truncated so it maybe
makes less sense.

Perhaps this was considered and discarded as being obviously broken,
but it wasn't clear cut to me.

Thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 15:35                                                       ` Will Drewry
@ 2012-01-20 17:56                                                         ` Roland McGrath
  2012-01-20 19:45                                                           ` Will Drewry
  0 siblings, 1 reply; 222+ messages in thread
From: Roland McGrath @ 2012-01-20 17:56 UTC (permalink / raw)
  To: Will Drewry
  Cc: Linus Torvalds, Jamie Lokier, Andrew Lutomirski, Indan Zupancic,
	Andi Kleen, Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlao

In arch_ptrace, task_user_regset_view is called on current.  On an x86-64
kernel, that path is only reached for a 64-bit syscall.  compat_arch_ptrace
doesn't use it at all, always using the 32-bit view.  So your change would
have no effect on PTRACE_GETREGS.

It would only affect PTRACE_GETREGSET, which calls task_user_regset_view on
the target task.  Is that what you meant?  I think that would be confusing
at best.  A caller of PTRACE_GETREGSET is expecting a particular layout
based on what type of task he thinks he's dealing with.  The caller can
look at the iov_len in the result to discern which layout it actually got
filled in, but I don't think that's what callers expect.

The other use of task_user_regset_view is in core dump
(binfmt_elf.c:fill_note_info).  Off hand I don't think there's a way a core
dump can be started while still "inside" a syscall so that TS_COMPAT could
ever be set.  But that should be double-checked.

As to whether it was considered before, I doubt that it was.  I don't
really recall the sequence of events, but I think that I did all the
user_regset code before I was really cognizant of the TS_COMPAT subtleties.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 17:56                                                         ` Roland McGrath
@ 2012-01-20 19:45                                                           ` Will Drewry
  0 siblings, 0 replies; 222+ messages in thread
From: Will Drewry @ 2012-01-20 19:45 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Jamie Lokier, Andrew Lutomirski, Indan Zupancic,
	Andi Kleen, Oleg Nesterov, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, segoon, rostedt,
	jmorris, scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlao

On Fri, Jan 20, 2012 at 11:56 AM, Roland McGrath <mcgrathr@google.com> wrote:
> In arch_ptrace, task_user_regset_view is called on current.  On an x86-64
> kernel, that path is only reached for a 64-bit syscall.  compat_arch_ptrace
> doesn't use it at all, always using the 32-bit view.  So your change would
> have no effect on PTRACE_GETREGS.
>
> It would only affect PTRACE_GETREGSET, which calls task_user_regset_view on
> the target task.  Is that what you meant?

Exactly - sorry for being unclear!

> I think that would be confusing
> at best.  A caller of PTRACE_GETREGSET is expecting a particular layout
> based on what type of task he thinks he's dealing with.  The caller can
> look at the iov_len in the result to discern which layout it actually got
> filled in, but I don't think that's what callers expect.

The question of what callers expect wasn't so clear to me -- for two reasons:
1. I was misreading
2. Compat syscall numbering.

#1 I had mistakenly thought that TIF_IA32 was set on a task if
personality(2) was called with PER_LINUX/PER_LINUX32.  It appears that
thread info flag can only be set by the binfmt handlers at exec-time,
so personality(2) cannot be used to change the user_regs_struct on the
fly (just signal mappings).

#2 In the case of a 64-bit process doing a 32-bit system call without
a personality change, the 64-bit register view will be consistent,
but, as discussed, the numbering will be incorrect.  So what the
caller gets back still seems to not be what they were expecting, it's
just not as far off as a different register view.

In either case the output from PTRACE_GETREGS is broken for the
TS_COMPAT-64-bit process flow, but it all comes down to determining
with brokenness is worse.  The silent system call numbers change and
register truncation, or a different, but accurate user_regs_struct :/

> The other use of task_user_regset_view is in core dump
> (binfmt_elf.c:fill_note_info).  Off hand I don't think there's a way a core
> dump can be started while still "inside" a syscall so that TS_COMPAT could
> ever be set.  But that should be double-checked.

That was my reading, too, but additional eyes would be useful.

> As to whether it was considered before, I doubt that it was.  I don't
> really recall the sequence of events, but I think that I did all the
> user_regset code before I was really cognizant of the TS_COMPAT subtleties.

Makes sense.

Thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-19  0:38                                                                 ` H. Peter Anvin
@ 2012-01-20 21:51                                                                   ` Denys Vlasenko
  2012-01-20 22:40                                                                     ` Roland McGrath
  0 siblings, 1 reply; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-20 21:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Roland McGrath, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-securit

[-- Attachment #1: Type: text/plain, Size: 2817 bytes --]

On Thu, Jan 19, 2012 at 1:38 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 01/18/2012 03:28 PM, Linus Torvalds wrote:
>> On Wed, Jan 18, 2012 at 1:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>
>>> I think we can obviously agree that regsets is the only way to go for
>>> any kind of new state.
>>
>> So I really don't necessarily agree at all.
>>
>> Exactly because there is a heavy burden to introducing new models.
>> It's not only relatively much more kernel code, it's also relatively
>> much more painful for user code. If we can hide it in existing
>> structures, user code is *much* better off, because any existing code
>> to get the state will just continue to work. Otherwise, you need to
>> have the code to figure out the new structures (how do you compile it
>> without the new kernel headers?), you need to do the extra accesses
>> conditionally etc etc.
>>
>> There's a real cost to introducing new interfaces. There's a *reason*
>> people try to make do with old ones.
>
> Of course.  However, the whole point with regsets is that at the very
> least the vast majority of the infrastructure is generic and extends
> without a bunch of new machine.  What you are saying is "we might be
> able to get away with existing state", what I'm saying is "if we add
> state it should be a regset".
>
> The question if this should be new state is currently open.  I
> personally would still would prefer if this didn't overlay real CPU state.

What about extending of one of the GETREGSET layouts?
GETREGSET uses struct iovec. struct iovec has buf_len.
Currently, if buf_len is larger than the register structure
being requested, kernel simply returns less data than
userspace asks for.

In the x86 case, we can add additional field(s) at
the end of NT_PRSTATUS layout.

Old programs which use PTRACE_GETREGS will get
old user_regs_struct layout (without appended fields).
Old programs which use
PTRACE_GETREGSET(NT_PRSTATUS, sizeof(struct user_regs_struct))
will also get the same.
New programs which use
PTRACE_GETREGSET(NT_PRSTATUS, sizeof(struct user_regs_struct) + N *
sizeof(long))
will get new fields too.

It's more intrusive than Linus' solution, but it avoids
the problem of overlaying real register data
with OS-specific special bits. It can also be employed
on other architectures (does not depend on having
a suitable register to abuse).

OTOH it is less intrusive than adding a whole new regset
just in order to add a few bits to an exiting one;
and would allow strace to extract both registers
and this new data with one operation instead of two.

Please see attached patch. NOT TESTED.

I'm new to this machinery, thus I might be missing some
obvious flaw with this idea (such as breaking on-disk
coredump format?)

-- 
vda

[-- Attachment #2: add_one_word_to_regset0.diff --]
[-- Type: text/x-patch, Size: 1271 bytes --]

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..16455c0 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -419,6 +419,10 @@ static int putreg(struct task_struct *child,
 		if (child->thread.gs != value)
 			return do_arch_prctl(child, ARCH_SET_GS, value);
 		return 0;
+
+	case sizeof(struct user_regs_struct) + 0 * sizeof(long):
+		/* Modifying of thread_info->status is not allowed */
+		return 0;
 #endif
 	}
 
@@ -469,6 +473,10 @@ static unsigned long getreg(struct task_struct *task, unsigned long offset)
 			return 0;
 		return get_desc_base(&task->thread.tls_array[GS_TLS]);
 	}
+
+	case sizeof(struct user_regs_struct) + 0 * sizeof(long):
+		/* One day we might want to expose other bits too */
+		return (task_thread_info(task)->status & TS_COMPAT);
 #endif
 	}
 
@@ -1203,7 +1211,7 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 static struct user_regset x86_64_regsets[] __read_mostly = {
 	[REGSET_GENERAL] = {
 		.core_note_type = NT_PRSTATUS,
-		.n = sizeof(struct user_regs_struct) / sizeof(long),
+		.n = (sizeof(struct user_regs_struct) + 1 * sizeof(long)) / sizeof(long),
 		.size = sizeof(long), .align = sizeof(long),
 		.get = genregs_get, .set = genregs_set
 	},

^ permalink raw reply related	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 21:51                                                                   ` Denys Vlasenko
@ 2012-01-20 22:40                                                                     ` Roland McGrath
  2012-01-20 22:41                                                                       ` H. Peter Anvin
                                                                                         ` (2 more replies)
  0 siblings, 3 replies; 222+ messages in thread
From: Roland McGrath @ 2012-01-20 22:40 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: H. Peter Anvin, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-mo

If you change the size of a regset, then the new full size will be the size
of the core file notes.  Existing userland tools will not be expecting
this, they expect a known exact size.  If you need to add new stuff, it
really is easier all around to add a new regset flavor.  When adding a new
one, you can make it variable-sized from the start so as to be extensible
in the future.  We did this for NT_X86_XSTATE, for example.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 22:40                                                                     ` Roland McGrath
@ 2012-01-20 22:41                                                                       ` H. Peter Anvin
  2012-01-20 23:49                                                                         ` Indan Zupancic
  2012-01-24  8:19                                                                       ` Indan Zupancic
  2012-02-06 20:30                                                                       ` H. Peter Anvin
  2 siblings, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-01-20 22:41 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-se

On 01/20/2012 02:40 PM, Roland McGrath wrote:
> If you change the size of a regset, then the new full size will be the size
> of the core file notes.  Existing userland tools will not be expecting
> this, they expect a known exact size.  If you need to add new stuff, it
> really is easier all around to add a new regset flavor.  When adding a new
> one, you can make it variable-sized from the start so as to be extensible
> in the future.  We did this for NT_X86_XSTATE, for example.
>

Yes, that definitely seems cleaner.

	-hpa


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 22:41                                                                       ` H. Peter Anvin
@ 2012-01-20 23:49                                                                         ` Indan Zupancic
  2012-01-20 23:55                                                                           ` Roland McGrath
  2012-01-21  0:07                                                                           ` Denys Vlasenko
  0 siblings, 2 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-20 23:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Roland McGrath, Denys Vlasenko, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel

On Fri, January 20, 2012 23:41, H. Peter Anvin wrote:
> On 01/20/2012 02:40 PM, Roland McGrath wrote:
>> If you change the size of a regset, then the new full size will be the size
>> of the core file notes.  Existing userland tools will not be expecting
>> this, they expect a known exact size.  If you need to add new stuff, it
>> really is easier all around to add a new regset flavor.  When adding a new
>> one, you can make it variable-sized from the start so as to be extensible
>> in the future.  We did this for NT_X86_XSTATE, for example.
>>
>
> Yes, that definitely seems cleaner.

I would prefer Linus' way of just stuffing it into cs. Jamie also wanted
a bit telling in what mode the userspace is running. That's 3 bits in total,
with one bit telling whether the other bits are valid or not. Anything else?
Maybe a bit telling whether it is syscall entry or exit?

As all this is very x86_64 specific and cs is already used to figure out
the mode, it seems overkill to add a new regset just for this.

It's a lot easier for existing code to add an extra cs check than to use
different register sets and different ptrace commands. Considering that
PTRACE_GETREGSET is undocumented it's likely that existing code isn't
using it much.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 23:49                                                                         ` Indan Zupancic
@ 2012-01-20 23:55                                                                           ` Roland McGrath
  2012-01-20 23:58                                                                             ` hpanvin@gmail.com
  2012-01-23  2:14                                                                             ` Indan Zupancic
  2012-01-21  0:07                                                                           ` Denys Vlasenko
  1 sibling, 2 replies; 222+ messages in thread
From: Roland McGrath @ 2012-01-20 23:55 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H. Peter Anvin, Denys Vlasenko, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux

On Fri, Jan 20, 2012 at 3:49 PM, Indan Zupancic <indan@nul.nu> wrote:
> It's a lot easier for existing code to add an extra cs check than to use

The issue is whether showing fictitious high bits of %cs as set will break
existing applications (debuggers, etc.) that look at it and think that it's
nothing but the hardware state zero-extended, as it is today.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 23:55                                                                           ` Roland McGrath
@ 2012-01-20 23:58                                                                             ` hpanvin@gmail.com
  2012-01-23  2:14                                                                             ` Indan Zupancic
  1 sibling, 0 replies; 222+ messages in thread
From: hpanvin@gmail.com @ 2012-01-20 23:58 UTC (permalink / raw)
  To: Roland McGrath, Indan Zupancic
  Cc: Denys Vlasenko, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, o

Linus claims it does break.

Roland McGrath <mcgrathr@google.com> wrote:

>On Fri, Jan 20, 2012 at 3:49 PM, Indan Zupancic <indan@nul.nu> wrote:
>> It's a lot easier for existing code to add an extra cs check than to
>use
>
>The issue is whether showing fictitious high bits of %cs as set will
>break
>existing applications (debuggers, etc.) that look at it and think that
>it's
>nothing but the hardware state zero-extended, as it is today.

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 23:49                                                                         ` Indan Zupancic
  2012-01-20 23:55                                                                           ` Roland McGrath
@ 2012-01-21  0:07                                                                           ` Denys Vlasenko
  2012-01-21  0:10                                                                             ` Roland McGrath
  1 sibling, 1 reply; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-21  0:07 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H. Peter Anvin, Roland McGrath, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel

On Saturday 21 January 2012 00:49, Indan Zupancic wrote:
> On Fri, January 20, 2012 23:41, H. Peter Anvin wrote:
> > On 01/20/2012 02:40 PM, Roland McGrath wrote:
> >> If you change the size of a regset, then the new full size will be the size
> >> of the core file notes.  Existing userland tools will not be expecting
> >> this, they expect a known exact size.  If you need to add new stuff, it
> >> really is easier all around to add a new regset flavor.  When adding a new
> >> one, you can make it variable-sized from the start so as to be extensible
> >> in the future.  We did this for NT_X86_XSTATE, for example.
> >>
> >
> > Yes, that definitely seems cleaner.
> 
> I would prefer Linus' way of just stuffing it into cs. Jamie also wanted
> a bit telling in what mode the userspace is running. That's 3 bits in total,
> with one bit telling whether the other bits are valid or not. Anything else?

There is actually a bunch of ptrace-specific stuff we want to return.

For example, Oleg wants to be able to print *which syscall*,
(along with its arguments if possible) is restarted when
we restart the ERESTART_RESTARTBLOCK-returning syscall.
Which happens every time strace attaches to a process sleeping
in nanosleep or poll, for example. We get just

$ strace -p 1234
Process 1234 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>_

and that's it.

Returning syscall and its parameters require several words,
not a few bits.

> Maybe a bit telling whether it is syscall entry or exit?

Yes, this one too. This is one of longstanding annoyances
that this information is not exposed.

> As all this is very x86_64 specific and cs is already used to figure out
> the mode, it seems overkill to add a new regset just for this.
> 
> It's a lot easier for existing code to add an extra cs check than to use
> different register sets and different ptrace commands.

You don't understand. Returning new bits in cs will break *existing*
programs. This is generally a bad thing. For example, old strace binaries
on new kernel will complain:

        switch (x86_64_regs.cs) {
                case 0x23: currpers = 1; break;
                case 0x33: currpers = 0; break;
                default:
                        fprintf(stderr, "Unknown value CS=0x%08X while "
                                 "detecting personality of process "
                                 "PID=%d\n", (int)x86_64_regs.cs, tcp->pid);
                        currpers = current_personality;
                        break;
        }

when they'll see unfamiliar x86_64_regs.cs value.

-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-21  0:07                                                                           ` Denys Vlasenko
@ 2012-01-21  0:10                                                                             ` Roland McGrath
  2012-01-21  1:23                                                                               ` Jamie Lokier
  0 siblings, 1 reply; 222+ messages in thread
From: Roland McGrath @ 2012-01-21  0:10 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Indan Zupancic, H. Peter Anvin, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-mod

On Fri, Jan 20, 2012 at 4:07 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
>> Maybe a bit telling whether it is syscall entry or exit?
>
> Yes, this one too. This is one of longstanding annoyances
> that this information is not exposed.

That is not really "state", it's just which event you want.
That is much better addressed by replacing PTRACE_SYSCALL
with PTRACE_O_TRACE_SYSCALL_{ENTRY,EXIT} and PTRACE_EVENT_SYSCALL_{ENTRY,EXIT}.
Oleg can whip that up for you no problem.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-21  0:10                                                                             ` Roland McGrath
@ 2012-01-21  1:23                                                                               ` Jamie Lokier
  2012-01-23  2:37                                                                                 ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: Jamie Lokier @ 2012-01-21  1:23 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, Indan Zupancic, H. Peter Anvin, Linus Torvalds,
	Andi Kleen, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-secu

Roland McGrath wrote:
> On Fri, Jan 20, 2012 at 4:07 PM, Denys Vlasenko
> <vda.linux@googlemail.com> wrote:
> >> Maybe a bit telling whether it is syscall entry or exit?
> >
> > Yes, this one too. This is one of longstanding annoyances
> > that this information is not exposed.
> 
> That is not really "state", it's just which event you want.
> That is much better addressed by replacing PTRACE_SYSCALL
> with PTRACE_O_TRACE_SYSCALL_{ENTRY,EXIT} and PTRACE_EVENT_SYSCALL_{ENTRY,EXIT}.
> Oleg can whip that up for you no problem.

I agree, that is so obviously the right thing to do and it's very easy
to do in the tracehook functions.

There is one slight problem that some archs don't use
tracehook yet. Probably that should be fixed anyway.

(Fwiw, two other issues with arch-independent ptrace have come up in this
thread, which ought to be fairly easy to fix:
   - If tracer dies, tracee is free to continue running.  For security
     tracers, and would be useful for strace as well, it would be good
     to have an option to SIGKILL the tracee if tracer dies.
   - Can't abort or change an unwanted syscall if the process receives
     SIGKILL as it's about to start a syscall (which will be its last).)

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 23:55                                                                           ` Roland McGrath
  2012-01-20 23:58                                                                             ` hpanvin@gmail.com
@ 2012-01-23  2:14                                                                             ` Indan Zupancic
  1 sibling, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-23  2:14 UTC (permalink / raw)
  To: Roland McGrath
  Cc: H. Peter Anvin, Denys Vlasenko, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel

On Sat, January 21, 2012 00:55, Roland McGrath wrote:
> On Fri, Jan 20, 2012 at 3:49 PM, Indan Zupancic <indan@nul.nu> wrote:
>> It's a lot easier for existing code to add an extra cs check than to use
>
> The issue is whether showing fictitious high bits of %cs as set will break
> existing applications (debuggers, etc.) that look at it and think that it's
> nothing but the hardware state zero-extended, as it is today.

Argh, sorry, I meant eflags.

I even checked how many bits are free in eflags and still wrote 'cs'.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-21  1:23                                                                               ` Jamie Lokier
@ 2012-01-23  2:37                                                                                 ` Indan Zupancic
  2012-01-23 16:48                                                                                   ` Oleg Nesterov
  0 siblings, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-23  2:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Roland McGrath, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Andi Kleen, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel

On Sat, January 21, 2012 02:23, Jamie Lokier wrote:
> Roland McGrath wrote:
>> On Fri, Jan 20, 2012 at 4:07 PM, Denys Vlasenko
>> <vda.linux@googlemail.com> wrote:
>> >> Maybe a bit telling whether it is syscall entry or exit?
>> >
>> > Yes, this one too. This is one of longstanding annoyances
>> > that this information is not exposed.
>>
>> That is not really "state", it's just which event you want.
>> That is much better addressed by replacing PTRACE_SYSCALL
>> with PTRACE_O_TRACE_SYSCALL_{ENTRY,EXIT} and PTRACE_EVENT_SYSCALL_{ENTRY,EXIT}.
>> Oleg can whip that up for you no problem.
>
> I agree, that is so obviously the right thing to do and it's very easy
> to do in the tracehook functions.

Yes, bad place for it, much better via ptrace flags. We're usually not
interested in syscall exit events, so having a way to not always get
syscall exit events would improve performance quite a bit too.

> There is one slight problem that some archs don't use
> tracehook yet. Probably that should be fixed anyway.
>
> (Fwiw, two other issues with arch-independent ptrace have come up in this
> thread, which ought to be fairly easy to fix:
>    - If tracer dies, tracee is free to continue running.  For security
>      tracers, and would be useful for strace as well, it would be good
>      to have an option to SIGKILL the tracee if tracer dies.

It should be easy to add a PTRACE_O_SIGKILL_ON_DEATH option.

>    - Can't abort or change an unwanted syscall if the process receives
>      SIGKILL as it's about to start a syscall (which will be its last).)

This is very important for any syscall filtering/control via ptrace, otherwise
SIGKILL becomes a security problem. Oleg had a patch for that:

On Wed, January 18, 2012 18:12, Oleg Nesterov wrote:
> On 01/18, Oleg Nesterov wrote:
>> Not only for security. The current behaviour sometime confuses the
>> users. Debugger sends SIGKILL to the tracee and assumes it should
>> die asap, but the tracee exits only after syscall.
>
> Something like the patch below.
>
> Oleg.
>
> --- x/include/linux/tracehook.h
> +++ x/include/linux/tracehook.h
> @@ -54,12 +54,12 @@ struct linux_binprm;
>  /*
>   * ptrace report for syscall entry and exit looks identical.
>   */
> -static inline void ptrace_report_syscall(struct pt_regs *regs)
> +static inline int ptrace_report_syscall(struct pt_regs *regs)
>  {
>  	int ptrace = current->ptrace;
>
>  	if (!(ptrace & PT_PTRACED))
> -		return;
> +		return 0;
>
>  	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>
> @@ -72,6 +72,8 @@ static inline void ptrace_report_syscall
>  		send_sig(current->exit_code, current, 1);
>  		current->exit_code = 0;
>  	}
> +
> +	return fatal_signal_pending(current);
>  }
>
>  /**
> @@ -96,8 +98,7 @@ static inline void ptrace_report_syscall
>  static inline __must_check int tracehook_report_syscall_entry(
>  	struct pt_regs *regs)
>  {
> -	ptrace_report_syscall(regs);
> -	return 0;
> +	return ptrace_report_syscall(regs);
>  }
>
>  /**
>


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-23  2:37                                                                                 ` Indan Zupancic
@ 2012-01-23 16:48                                                                                   ` Oleg Nesterov
  0 siblings, 0 replies; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-23 16:48 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Jamie Lokier, Roland McGrath, Denys Vlasenko, H. Peter Anvin,
	Linus Torvalds, Andi Kleen, Andrew Lutomirski, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel

On 01/23, Indan Zupancic wrote:
>
> On Sat, January 21, 2012 02:23, Jamie Lokier wrote:
> >
> > (Fwiw, two other issues with arch-independent ptrace have come up in this
> > thread, which ought to be fairly easy to fix:
> >    - If tracer dies, tracee is free to continue running.  For security
> >      tracers, and would be useful for strace as well, it would be good
> >      to have an option to SIGKILL the tracee if tracer dies.
>
> It should be easy to add a PTRACE_O_SIGKILL_ON_DEATH option.

Yes, this looks simple.

> >    - Can't abort or change an unwanted syscall if the process receives
> >      SIGKILL as it's about to start a syscall (which will be its last).)
>
> This is very important for any syscall filtering/control via ptrace, otherwise
> SIGKILL becomes a security problem. Oleg had a patch for that:

OK, I'll send this patch after some testing. Although it looks trivial.

Oleg.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 21:09                                               ` Chris Evans
@ 2012-01-23 16:56                                                 ` Oleg Nesterov
  2012-01-23 22:23                                                   ` Chris Evans
  0 siblings, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-23 16:56 UTC (permalink / raw)
  To: Chris Evans
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On 01/18, Chris Evans wrote:
>
> Thanks, Oleg. Seems like this would be a nice change to have. As we
> can see, people do use ptrace() as a security technology.

OK, I'll send it.

> With this in place, you can also (where possible) set up the tracee
> with PR_SET_PDEATHSIG==SIGKILL. And then, you have defences again
> either of the tracer or tracee dying from a stray SIGKILL.

This can only help if the tracer is the natural parent, is it enough?

Indan suggested PTRACE_O_SIGKILL_ON_DEATH, perhaps it makes sense.

Oleg.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-23 16:56                                                 ` Oleg Nesterov
@ 2012-01-23 22:23                                                   ` Chris Evans
  0 siblings, 0 replies; 222+ messages in thread
From: Chris Evans @ 2012-01-23 22:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Mon, Jan 23, 2012 at 8:56 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/18, Chris Evans wrote:
>>
>> Thanks, Oleg. Seems like this would be a nice change to have. As we
>> can see, people do use ptrace() as a security technology.
>
> OK, I'll send it.
>
>> With this in place, you can also (where possible) set up the tracee
>> with PR_SET_PDEATHSIG==SIGKILL. And then, you have defences again
>> either of the tracer or tracee dying from a stray SIGKILL.
>
> This can only help if the tracer is the natural parent, is it enough?
>
> Indan suggested PTRACE_O_SIGKILL_ON_DEATH, perhaps it makes sense.

Yeah, this takes care of all cases.

One caveat I can think of with the implementation: in the parent
exit() path, the child's SIGKILL needs to be delivered _before_ the
tracer is detached. Otherwise it might feasible wake up and run for a
bit :)


Cheers
Chris

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 22:40                                                                     ` Roland McGrath
  2012-01-20 22:41                                                                       ` H. Peter Anvin
@ 2012-01-24  8:19                                                                       ` Indan Zupancic
  2012-02-06 20:30                                                                       ` H. Peter Anvin
  2 siblings, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-24  8:19 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, H. Peter Anvin, Linus Torvalds, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel

On Fri, January 20, 2012 23:40, Roland McGrath wrote:
> If you change the size of a regset, then the new full size will be the size
> of the core file notes.  Existing userland tools will not be expecting
> this, they expect a known exact size.  If you need to add new stuff, it
> really is easier all around to add a new regset flavor.  When adding a new
> one, you can make it variable-sized from the start so as to be extensible
> in the future.  We did this for NT_X86_XSTATE, for example.

If stuffing it into eflags is not acceptable and you really want a
new regset, perhaps that new regset should only contain the new,
mostly cross-platform information, instead of slapping it at the
end of the x86 regset. Because if you do the latter you really
could have better just stuffed it into eflags.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 19:31                                               ` Linus Torvalds
                                                                   ` (2 preceding siblings ...)
  2012-01-18 20:26                                                 ` Linus Torvalds
@ 2012-01-25 19:36                                                 ` Oleg Nesterov
  2012-01-25 20:20                                                   ` Pedro Alves
  2012-01-25 23:32                                                   ` Denys Vlasenko
  3 siblings, 2 replies; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-25 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, Roland McGrath

On 01/18, Linus Torvalds wrote:
>
> Using the high bits of 'eflags' might work.

I thought about changing eflags too, this looks very natural to me.

But I do not understand the result of this discussion, are you going
to apply this change?

If not...

Not sure this is really better, but there is another idea. Currently we
have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?

IOW. Currently ptrace_report_syscall() does

	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));

We can add the new events,

	PTRACE_EVENT_SYSCALL_ENTRY
	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
	PTRACE_EVENT_SYSCALL_EXIT
	PTRACE_EVENT_SYSCALL_COMPAT_EXIT

and change ptrace_report_syscall() to do

	if (PT_SEIZED) /* or PT_TRACESYS_VERY_GOOD? */ {
		int event = entry ? PTRACE_EVENT_SYSCALL_ENTRY : EXIT;
		if (is_compat_task(current))
			event++;
		ptrace_notify((event << 8) | SIGTRAP);
	} else {
		ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
	}

This also allows to distinguish entry/exit.


However. The change in get_flags() also allows to know the state of
TIF_IA32 bit bit outside of syscall entry/exit reports, perhaps there
is a reason why do we want this?

Oleg.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 19:36                                                 ` Oleg Nesterov
@ 2012-01-25 20:20                                                   ` Pedro Alves
  2012-01-25 23:36                                                     ` Denys Vlasenko
  2012-01-25 23:32                                                   ` Denys Vlasenko
  1 sibling, 1 reply; 222+ messages in thread
From: Pedro Alves @ 2012-01-25 20:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On 01/25/2012 07:36 PM, Oleg Nesterov wrote:
> 
> Not sure this is really better, but there is another idea. Currently we
> have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
> Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
> PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?

May I beg to don't rely on PTRACE_SYSCALL for anything new?
You can't PTRACE_SINGLESTEP and PTRACE_SYSCALL simultaneously.  Think of
gdb single-stepping all the way for some reason (software watchpoints, for ex.),
while at the same time wanting to catch syscalls.

As Roland suggested, replacing PTRACE_SYSCALL with PTRACE_O_TRACE_SYSCALL_{ENTRY,EXIT}
and PTRACE_EVENT_SYSCALL_{ENTRY,EXIT} would be superior, syscall tracing wise.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 19:36                                                 ` Oleg Nesterov
  2012-01-25 20:20                                                   ` Pedro Alves
@ 2012-01-25 23:32                                                   ` Denys Vlasenko
  2012-01-26  0:40                                                     ` Indan Zupancic
                                                                       ` (2 more replies)
  1 sibling, 3 replies; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-25 23:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> On 01/18, Linus Torvalds wrote:
> >
> > Using the high bits of 'eflags' might work.
> 
> I thought about changing eflags too, this looks very natural to me.
> 
> But I do not understand the result of this discussion, are you going
> to apply this change?
> 
> If not...
> 
> Not sure this is really better, but there is another idea. Currently we
> have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
> Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
> PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?
> 
> IOW. Currently ptrace_report_syscall() does
> 
> 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
> 
> We can add the new events,
> 
> 	PTRACE_EVENT_SYSCALL_ENTRY
> 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> 	PTRACE_EVENT_SYSCALL_EXIT
> 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT

We can get away with just the first one.
(1) It's unlikely people would want to get native sysentry events but not compat ones,
thus first two options can be combined into one;
(2) syscall exit compat-ness is known from entry type - no need to indicate it; and
(3) if we would flag syscall entry with an event value in wait status, then syscall
exit will be already distinquisable.

Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
"on syscall entry ptrace stop, set a nonzero event value in wait status"
, and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.

To future-proof this scheme we may reserve a few more event values
PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
if we'll ever have arches with more than one non-native syscall
entry. I'm no expert, but looking at strace code, ARM may already have
more than one additional convention how to pass syscall args.


-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 20:20                                                   ` Pedro Alves
@ 2012-01-25 23:36                                                     ` Denys Vlasenko
  0 siblings, 0 replies; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-25 23:36 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Oleg Nesterov, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On Wednesday 25 January 2012 21:20, Pedro Alves wrote:
> On 01/25/2012 07:36 PM, Oleg Nesterov wrote:
> > 
> > Not sure this is really better, but there is another idea. Currently we
> > have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
> > Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
> > PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?
> 
> May I beg to don't rely on PTRACE_SYSCALL for anything new?

This doesn't *add* anything new. All the same ptrace stops will happen
at exactly the same moments. No new stops added. We only add a value
into upper half of waitpid status: (status >> 16) used to be 0
on syscall entry. Now it will be PTRACE_EVENT_SYSCALL_ENTRY[1].
That's all.

> You can't PTRACE_SINGLESTEP and PTRACE_SYSCALL simultaneously.

This is an orthogonal problem.

-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 23:32                                                   ` Denys Vlasenko
@ 2012-01-26  0:40                                                     ` Indan Zupancic
  2012-01-26  1:08                                                       ` Jamie Lokier
  2012-01-26  1:09                                                       ` Denys Vlasenko
  2012-01-26  0:59                                                     ` Jamie Lokier
  2012-01-26 18:44                                                     ` Oleg Nesterov
  2 siblings, 2 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-26  0:40 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

On Thu, January 26, 2012 00:32, Denys Vlasenko wrote:
> On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
>> On 01/18, Linus Torvalds wrote:
>> >
>> > Using the high bits of 'eflags' might work.
>>
>> I thought about changing eflags too, this looks very natural to me.
>>
>> But I do not understand the result of this discussion, are you going
>> to apply this change?
>>
>> If not...
>>
>> Not sure this is really better, but there is another idea. Currently we
>> have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
>> Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
>> PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?

Disadvantage of that is that all archs have to add support for this,
while it only affects x86_64.

>>
>> IOW. Currently ptrace_report_syscall() does
>>
>> 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>>
>> We can add the new events,
>>
>> 	PTRACE_EVENT_SYSCALL_ENTRY
>> 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
>> 	PTRACE_EVENT_SYSCALL_EXIT
>> 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
>
> We can get away with just the first one.
> (1) It's unlikely people would want to get native sysentry events but not compat ones,
> thus first two options can be combined into one;

True.

> (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> (3) if we would flag syscall entry with an event value in wait status, then syscall
> exit will be already distinquisable.

False for execve which messes everything up by changing TID sometimes.

>
> Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> "on syscall entry ptrace stop, set a nonzero event value in wait status"
> , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.

Not all code wants to receive a syscall exit event all the time, so
if you add PTRACE_O_TRACE_SYSENTRY, please add PTRACE_O_TRACE_SYSEXIT
too. That would pretty much halve ptrace's overhead for my use case.
But this is orthogonal to the compat problem.

> To future-proof this scheme we may reserve a few more event values
> PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> if we'll ever have arches with more than one non-native syscall
> entry. I'm no expert, but looking at strace code, ARM may already have
> more than one additional convention how to pass syscall args.

Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
would be horrible. Keep arch specific stuff in arch specific areas,
please don't spread it around.

What was wrong with using eflags again? Is it too simple or something?

Greetings,

Indan

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 23:32                                                   ` Denys Vlasenko
  2012-01-26  0:40                                                     ` Indan Zupancic
@ 2012-01-26  0:59                                                     ` Jamie Lokier
  2012-01-26  1:21                                                       ` Denys Vlasenko
  2012-01-26  8:23                                                       ` Pedro Alves
  2012-01-26 18:44                                                     ` Oleg Nesterov
  2 siblings, 2 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-26  0:59 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Denys Vlasenko wrote:
> On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> > On 01/18, Linus Torvalds wrote:
> > >
> > > Using the high bits of 'eflags' might work.
> > 
> > I thought about changing eflags too, this looks very natural to me.
> > 
> > But I do not understand the result of this discussion, are you going
> > to apply this change?
> > 
> > If not...
> > 
> > Not sure this is really better, but there is another idea. Currently we
> > have PTRACE_O_TRACESYSGOOD to avoid the confusion with the real SIGTRAP.
> > Perhaps we can add PTRACE_O_TRACESYS_VERY_GOOD (or we can look at
> > PT_SEIZED instead) and report TS_COMPAT via ptrace_report_syscall ?
> > 
> > IOW. Currently ptrace_report_syscall() does
> > 
> > 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
> > 
> > We can add the new events,
> > 
> > 	PTRACE_EVENT_SYSCALL_ENTRY
> > 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> > 	PTRACE_EVENT_SYSCALL_EXIT
> > 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
> 
> We can get away with just the first one.
> (1) It's unlikely people would want to get native sysentry events but not compat ones,
> thus first two options can be combined into one;

Tracers mainly want to know if it's a 32-bit or 64-bit syscall, not
whether it's compat as such.

I'm thinking it might be a little kinder like this:

#define PTRACE_EVENT_SYSCALL_ENTRY_ABI32 (...)
#define PTRACE_EVENT_SYSCALL_ENTRY_ABI64 (...)

#ifdef CONFIG_64BIT
# define PTRACE_EVENT_SYSCALL_ENTRY         PTRACE_EVENT_SYSCALL_ENTRY_ABI64
# define PTRACE_EVENT_SYSCALL_ENTRY_COMPAT  PTRACE_EVENT_SYSCALL_ENTRY_ABI32
#else
# define PTRACE_EVENT_SYSCALL_ENTRY         PTRACE_EVENT_SYSCALL_ENTRY_ABI32
#endif

So the ABI is represented directly, with the _ENTRY referring to the
tracer's own.  (Other ABI numbers can exist, e.g. OABI and EABI for
ARM, see below.)

This has the two specific advantages:

  1. It can match on specific ABI or regular/compat, as suits the tracer's code.
  2. When a 32-bit *tracer* is running a 64-bit *tracee* as least it knows ;-)

With your idea, what happens in situation 2?  I'm not sure a 32-bit
tracee can do anything useful, because it can't get the 64-bit
registers, but at least it can see when it's got the wrong registers :-)

> (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> (3) if we would flag syscall entry with an event value in wait status, then syscall
> exit will be already distinquisable.
>
> Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> "on syscall entry ptrace stop, set a nonzero event value in wait status"
> , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.

PTRACE_EVENT_SYSCALL_EXIT would cleanly indicate that the new option
is actually working without the tracer needing to do a fork+test, if
PTRACE_ATTACH is used and for some reason the tracer sees a syscall
exit first.  I'm not sure if this can happen but I've heard rumour of
it on some archs or kernel versions.

> To future-proof this scheme we may reserve a few more event values
> PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> if we'll ever have arches with more than one non-native syscall
> entry.

> I'm no expert, but looking at strace code, ARM may already have
> more than one additional convention how to pass syscall args.

I was just looking at ARM and see exactly the same thing.  The
difference between EABI and OABI calls is significant on ARM, even
though syscall numbers are the same; and the ABI is selected by the
syscall instruction used, not process personality.  The __NR_name
values differ for each ABI, but (if I read arm/kernel/entry-common.S
properly) strace sees the same _NR_name values for both ABIs.

MIPS also has two different 32-bit ABIs, as well as 64-bit, but on
MIPS the syscall numbers are distinct, and should be seen by ptrace.
(Again if I read mips/kernel/ correctly.)

PA-RISC also has two different ABIs, the Linux one and the HPUX one.
The syscall numbers are different but overlap.  I don't know if they
are distinct to ptrace, in which case using the HPUX entry point might
be used to subvert a ptracer unless the ABI number is exposed.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  0:40                                                     ` Indan Zupancic
@ 2012-01-26  1:08                                                       ` Jamie Lokier
  2012-01-26  1:22                                                         ` Denys Vlasenko
  2012-01-26  6:34                                                         ` Indan Zupancic
  2012-01-26  1:09                                                       ` Denys Vlasenko
  1 sibling, 2 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-26  1:08 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

Indan Zupancic wrote:
> On Thu, January 26, 2012 00:32, Denys Vlasenko wrote:
> > On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> >> IOW. Currently ptrace_report_syscall() does
> >>
> >> 	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
> >>
> >> We can add the new events,
> >>
> >> 	PTRACE_EVENT_SYSCALL_ENTRY
> >> 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> >> 	PTRACE_EVENT_SYSCALL_EXIT
> >> 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
> >
> > We can get away with just the first one.
> > (1) It's unlikely people would want to get native sysentry events but not compat ones,
> > thus first two options can be combined into one;
> 
> True.
> 
> > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > exit will be already distinquisable.
> 
> False for execve which messes everything up by changing TID sometimes.

Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
returns, and you knowing the TID always changes to the PID?  I haven't
yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
not the old one, perhaps that could be changed.

It would be good to improve the threaded execve() behaviour for all
the disappearing TIDs to issue a disappearing event, and the winning
execve changing-TID to issue an I-am-changing-TID even, anyway.

> > Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> > "on syscall entry ptrace stop, set a nonzero event value in wait status"
> > , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> > PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.
> 
> Not all code wants to receive a syscall exit event all the time, so
> if you add PTRACE_O_TRACE_SYSENTRY, please add PTRACE_O_TRACE_SYSEXIT
> too. That would pretty much halve ptrace's overhead for my use case.
> But this is orthogonal to the compat problem.

I agree.  I would like to ignore the exit for most syscalls but see a
few of them.  I guess PTRACE_SETOPTIONS could be used to toggle it,
with some overhead.  But in the spirit of this thread,
PTRACE_O_TRACE_BPF would be even better, to completely ignore
irrelevant syscalls :-)

> > To future-proof this scheme we may reserve a few more event values
> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> > if we'll ever have arches with more than one non-native syscall
> > entry. I'm no expert, but looking at strace code, ARM may already have
> > more than one additional convention how to pass syscall args.
> 
> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
> would be horrible. Keep arch specific stuff in arch specific areas,
> please don't spread it around.
> 
> What was wrong with using eflags again? Is it too simple or something?

Well it doesn't deal with the equivalent issue on ARM and PA-RISC.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  0:40                                                     ` Indan Zupancic
  2012-01-26  1:08                                                       ` Jamie Lokier
@ 2012-01-26  1:09                                                       ` Denys Vlasenko
  2012-01-26  3:47                                                         ` Linus Torvalds
  2012-01-26  5:57                                                         ` Indan Zupancic
  1 sibling, 2 replies; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-26  1:09 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oleg Nesterov, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

On Thursday 26 January 2012 01:40, Indan Zupancic wrote:
> >> We can add the new events,
> >>
> >> 	PTRACE_EVENT_SYSCALL_ENTRY
> >> 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> >> 	PTRACE_EVENT_SYSCALL_EXIT
> >> 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
> >
> > We can get away with just the first one.
> > (1) It's unlikely people would want to get native sysentry events but not compat ones,
> > thus first two options can be combined into one;
> 
> True.
> 
> > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > exit will be already distinquisable.
> 
> False for execve which messes everything up by changing TID sometimes.

Dealt with in Linus tree already: set PTRACE_O_TRACEEXEC option,
and use PTRACE_GETEVENTMSG in PTRACE_EVENT_EXEC stop to get
the old TID.


> > To future-proof this scheme we may reserve a few more event values
> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> > if we'll ever have arches with more than one non-native syscall
> > entry. I'm no expert, but looking at strace code, ARM may already have
> > more than one additional convention how to pass syscall args.
> 
> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
> would be horrible. Keep arch specific stuff in arch specific areas,
> please don't spread it around.

The situation when an architecture has 32- and 64-bit varieties,
and sometimes different ABIs (parameter passing comventions),
is rather typical, it's not a quirk of just one unfortunate
architecture.

Please look at strace source, get_scno() function, where
it reads syscall no and parameters. Let's see....
- POWERPC: has 32-bit and 64-bit mode
- X86_64: has 32-bit and 64-bit mode
- IA64: has i386-compat mode
- ARM: has more than one ABI
- SPARC: has 32-bit and 64-bit mode

Do you want to re-invent a different arch-specific way to report
syscall type for each of these arches?


> What was wrong with using eflags again? Is it too simple or something?

It's x86-specific, and abuses a bit in a real register.

-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  0:59                                                     ` Jamie Lokier
@ 2012-01-26  1:21                                                       ` Denys Vlasenko
  2012-01-26  8:23                                                       ` Pedro Alves
  1 sibling, 0 replies; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-26  1:21 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Oleg Nesterov, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thursday 26 January 2012 01:59, Jamie Lokier wrote:
> Denys Vlasenko wrote:
> > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > exit will be already distinquisable.
> >
> > Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> > "on syscall entry ptrace stop, set a nonzero event value in wait status"
> > , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> > PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.
> 
> PTRACE_EVENT_SYSCALL_EXIT would cleanly indicate that the new option
> is actually working without the tracer needing to do a fork+test, if
> PTRACE_ATTACH is used and for some reason the tracer sees a syscall
> exit first.

Can't happen. After PTRACE_ATTACH, you can only see tracee dying, or
getting a signal delivery (usually a SIGSTOP). Anything else
would be a kernel bug.

-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  1:08                                                       ` Jamie Lokier
@ 2012-01-26  1:22                                                         ` Denys Vlasenko
  2012-01-26  6:34                                                         ` Indan Zupancic
  1 sibling, 0 replies; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-26  1:22 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thursday 26 January 2012 02:08, Jamie Lokier wrote:
> Indan Zupancic wrote:
> > > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > > exit will be already distinquisable.
> > 
> > False for execve which messes everything up by changing TID sometimes.
> 
> Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
> returns, and you knowing the TID always changes to the PID?  I haven't
> yet checked which TID gets the PTRACE_EVENT_EXEC event,

tid change happens before PTRACE_EVENT_EXEC event generation.

-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  1:09                                                       ` Denys Vlasenko
@ 2012-01-26  3:47                                                         ` Linus Torvalds
  2012-01-26 18:03                                                           ` Denys Vlasenko
  2012-01-26  5:57                                                         ` Indan Zupancic
  1 sibling, 1 reply; 222+ messages in thread
From: Linus Torvalds @ 2012-01-26  3:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Indan Zupancic, Oleg Nesterov, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

On Wed, Jan 25, 2012 at 5:09 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
>
> Please look at strace source, get_scno() function, where
> it reads syscall no and parameters. Let's see....
> - POWERPC: has 32-bit and 64-bit mode
> - X86_64: has 32-bit and 64-bit mode
> - IA64: has i386-compat mode
> - ARM: has more than one ABI
> - SPARC: has 32-bit and 64-bit mode
>
> Do you want to re-invent a different arch-specific way to report
> syscall type for each of these arches?

I think an arch-specific one is better than trying to make some
generic one that is messy.

As you say, many architectures have multiple system call ABIs.

But they tend to be very *different* issues. They can be about
multiple ABI's, as you mention, and even when they *look* similar
(32-bit vs 64-bit ABI's) they are actually totally different issues.

On x86, the real issue is not so much "32-bit vs 64-bit" as "multiple
system call entry models", where a 64-bit process can use the system
call entry for a 32-bit one. That is not true on POWER, for example,
and trying to make it out to be the same issue only muddles the point,
and confuses things. It really is NOT AT ALL the same issue, even if
you can make it "look" like the same issue by calling it a 32-bit vs
64-bit thing.

So for POWER, it really is about the mode of the CPU/process. For x86,
it really isn't. Trying to equate the two is *wrong*.

I seriously think it's better to be architecture-specific than to be
that kind of totally confused, and try to "consolidate" the issue,
when they are actually two totally different issues.

                        Linus

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  1:09                                                       ` Denys Vlasenko
  2012-01-26  3:47                                                         ` Linus Torvalds
@ 2012-01-26  5:57                                                         ` Indan Zupancic
  1 sibling, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-26  5:57 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

On Thu, January 26, 2012 02:09, Denys Vlasenko wrote:
> Dealt with in Linus tree already: set PTRACE_O_TRACEEXEC option,
> and use PTRACE_GETEVENTMSG in PTRACE_EVENT_EXEC stop to get
> the old TID.

Thanks, getting the old TID is useful, that was the missing bit to
handle execve statelessly.

>
>
>> > To future-proof this scheme we may reserve a few more event values
>> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
>> > if we'll ever have arches with more than one non-native syscall
>> > entry. I'm no expert, but looking at strace code, ARM may already have
>> > more than one additional convention how to pass syscall args.
>>
>> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
>> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
>> would be horrible. Keep arch specific stuff in arch specific areas,
>> please don't spread it around.
>
> The situation when an architecture has 32- and 64-bit varieties,
> and sometimes different ABIs (parameter passing comventions),
> is rather typical, it's not a quirk of just one unfortunate
> architecture.

The question is how many of those have a different ABI and can not
be reliably detected at system call entry time. If the ABI can't
be changed at runtime then there's no problem either.

x86_64's case is very peculiar because it can execute a 32-bit compat
syscall while the process itself is in 64-bit mode.

> Please look at strace source, get_scno() function, where
> it reads syscall no and parameters. Let's see....
> - POWERPC: has 32-bit and 64-bit mode
> - X86_64: has 32-bit and 64-bit mode
> - IA64: has i386-compat mode
> - ARM: has more than one ABI
> - SPARC: has 32-bit and 64-bit mode

Fow most of them you can reliably check the mode by looking at registers.

x86_64 and apparently ARM are problematic. Others may too in similar subtle
ways as x86_64, but I can't tell that from strace's code.

ARM looks ok when old cruft isn't enabled (much more likely than compat
mode being disabled in x86_64).

Can SPARC change mode on the fly without detection? Otherwise it looks
like it may be slightly problematic too, though it seems that at least
the ABI is pretty much the same between 32 and 64 bit mode. Same for
PA-RISC. So all in all not sure if they have a problem or not.

To be a problem the only way to figure our what mode the system call
will be is by looking at the trapping/syscall instruction itself. If
that isn't needed, or if there isn't much difference between the modes
anyway, then there's no problem.

>
> Do you want to re-invent a different arch-specific way to report
> syscall type for each of these arches?

Thing is, it is not just 32 versus 64 bit mode. So one way or the other,
you do end up with an arch-specific way of saying what syscall type it is.

It doesn't matter much where that info is stuffed, it will always be arch
specific, because depending on that value people have to do different
arch specific things.

It's fine to somehow give that info together with PTRACE_EVENT_SYSCALL_ENTRY,
but I don't think it's a good idea to have different syscall entry events
depending on what type they are. I suppose the only reason to do that would
be because we're running out of bits elsewhere.

>
>> What was wrong with using eflags again? Is it too simple or something?
>
> It's x86-specific, and abuses a bit in a real register.

If the problem is limited to only a couple of archs, and we can stuff this
info in the register set for all of them, then I'm all for it.

So far it's just x86_64 and ARM with OABI enabled, with the rest either
fine or unclear.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  1:08                                                       ` Jamie Lokier
  2012-01-26  1:22                                                         ` Denys Vlasenko
@ 2012-01-26  6:34                                                         ` Indan Zupancic
  2012-01-26 10:31                                                           ` Jamie Lokier
  1 sibling, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-26  6:34 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalc

On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
> Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
> returns, and you knowing the TID always changes to the PID?  I haven't
> yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
> not the old one, perhaps that could be changed.

Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
barely documented already, but if if ever changes it will be also
unreliable.

It's still unclear if the PTRACE_EVENT_EXEC comes before or after
or instead of the post-execve ptrace event. I guess before, but
can I count on that? If it is after then I get a stray weird
execve event that messes up the system call cadence.

> It would be good to improve the threaded execve() behaviour for all
> the disappearing TIDs to issue a disappearing event, and the winning
> execve changing-TID to issue an I-am-changing-TID even, anyway.

As Denys said, you get the event with the new PID, and apparently with
the latest kernel you can get the old TID with PTRACE_GETEVENTMSG.

So all the info is there to handle it statelessly now.

My point was that stateless handling is much preferred to stateful
handling, and hence why not having the syscall mode available for
the syscall exit event would be inconvenient sometimes (meaning the
real mode can be different than guessed).

>> > Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
>> > "on syscall entry ptrace stop, set a nonzero event value in wait status"
>> > , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
>> > PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.
>>
>> Not all code wants to receive a syscall exit event all the time, so
>> if you add PTRACE_O_TRACE_SYSENTRY, please add PTRACE_O_TRACE_SYSEXIT
>> too. That would pretty much halve ptrace's overhead for my use case.
>> But this is orthogonal to the compat problem.
>
> I agree.  I would like to ignore the exit for most syscalls but see a
> few of them.  I guess PTRACE_SETOPTIONS could be used to toggle it,
> with some overhead.

Yes, that's what I had in mind.

> But in the spirit of this thread,
> PTRACE_O_TRACE_BPF would be even better, to completely ignore
> irrelevant syscalls :-)

Yes, that's the only reason I'm interested in BPF, really.
Most system calls are either always allowed, or always denied.
Of the ones that need checking, most of them have file paths.
For those I'm not interested in the post-syscall event.

>> > To future-proof this scheme we may reserve a few more event values
>> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
>> > if we'll ever have arches with more than one non-native syscall
>> > entry. I'm no expert, but looking at strace code, ARM may already have
>> > more than one additional convention how to pass syscall args.
>>
>> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
>> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
>> would be horrible. Keep arch specific stuff in arch specific areas,
>> please don't spread it around.
>>
>> What was wrong with using eflags again? Is it too simple or something?
>
> Well it doesn't deal with the equivalent issue on ARM and PA-RISC.

Those issues are not equivalent. ARM only has that OABI thing which
is hopefully not used in practice. Can you switch modes on-the-fly in
PA-RISC without doing a system call? Both ARM and PA-RISC use only one
struct pt_regs and one syscall table.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  0:59                                                     ` Jamie Lokier
  2012-01-26  1:21                                                       ` Denys Vlasenko
@ 2012-01-26  8:23                                                       ` Pedro Alves
  2012-01-26  8:53                                                         ` Denys Vlasenko
  1 sibling, 1 reply; 222+ messages in thread
From: Pedro Alves @ 2012-01-26  8:23 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Indan Zupancic,
	Andi Kleen, Andrew Lutomirski, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On 01/26/2012 12:59 AM, Jamie Lokier wrote:
> Tracers mainly want to know if it's a 32-bit or 64-bit syscall, not
> whether it's compat as such.

Another idea, avoiding new PTRACE_EVENTs per arch, would be to make
the abi32/abi64/compat/whatnot discriminator retrievable with PTRACE_GETEVENTMSG
instead.  So you'd get PTRACE_EVENT_SYSCALL_ENTRY|EXIT, or the regular old
0x80|SIGTRAP, you'd still fetch the syscall number from $orig_ax (or whatever means
for other archs), as usual, then have extra syscall info in PTRACE_GETEVENTMSG.
I don't know if it'd be simple to make it possible to do PTRACE_GETEVENTMSG
on a 0x80|SIGTRAP trap, but I imagine it so.

-> wait
  <- 0x80|SIGTRAP   (or PTRACE_EVENT_SYSCALL_ENTRY)
-> read regs, find out syscall number
-> PTRACE_GETEVENTMSG, figure out which entry mode was used.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  8:23                                                       ` Pedro Alves
@ 2012-01-26  8:53                                                         ` Denys Vlasenko
  2012-01-26  9:51                                                           ` Pedro Alves
  0 siblings, 1 reply; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-26  8:53 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Jamie Lokier, Oleg Nesterov, Linus Torvalds, Indan Zupancic,
	Andi Kleen, Andrew Lutomirski, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On Thu, Jan 26, 2012 at 9:23 AM, Pedro Alves <palves@redhat.com> wrote:
> On 01/26/2012 12:59 AM, Jamie Lokier wrote:
>> Tracers mainly want to know if it's a 32-bit or 64-bit syscall, not
>> whether it's compat as such.
>
> Another idea, avoiding new PTRACE_EVENTs per arch, would be to make
> the abi32/abi64/compat/whatnot discriminator retrievable with PTRACE_GETEVENTMSG
> instead.  So you'd get PTRACE_EVENT_SYSCALL_ENTRY|EXIT, or the regular old
> 0x80|SIGTRAP, you'd still fetch the syscall number from $orig_ax (or whatever means
> for other archs), as usual, then have extra syscall info in PTRACE_GETEVENTMSG.
> I don't know if it'd be simple to make it possible to do PTRACE_GETEVENTMSG
> on a 0x80|SIGTRAP trap, but I imagine it so.
>
> -> wait
>  <- 0x80|SIGTRAP   (or PTRACE_EVENT_SYSCALL_ENTRY)
> -> read regs, find out syscall number
> -> PTRACE_GETEVENTMSG, figure out which entry mode was used.

This would require additional ptrace op per syscall entry.
Linus' method and event method wouldn't.

-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  8:53                                                         ` Denys Vlasenko
@ 2012-01-26  9:51                                                           ` Pedro Alves
  0 siblings, 0 replies; 222+ messages in thread
From: Pedro Alves @ 2012-01-26  9:51 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Pedro Alves, Jamie Lokier, Oleg Nesterov, Linus Torvalds,
	Indan Zupancic, Andi Kleen, Andrew Lutomirski, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-mod

On 01/26/2012 08:53 AM, Denys Vlasenko wrote:
> On Thu, Jan 26, 2012 at 9:23 AM, Pedro Alves <palves@redhat.com> wrote:
>> On 01/26/2012 12:59 AM, Jamie Lokier wrote:
>>> Tracers mainly want to know if it's a 32-bit or 64-bit syscall, not
>>> whether it's compat as such.
>>
>> Another idea, avoiding new PTRACE_EVENTs per arch, would be to make
>> the abi32/abi64/compat/whatnot discriminator retrievable with PTRACE_GETEVENTMSG
>> instead.  So you'd get PTRACE_EVENT_SYSCALL_ENTRY|EXIT, or the regular old
>> 0x80|SIGTRAP, you'd still fetch the syscall number from $orig_ax (or whatever means
>> for other archs), as usual, then have extra syscall info in PTRACE_GETEVENTMSG.
>> I don't know if it'd be simple to make it possible to do PTRACE_GETEVENTMSG
>> on a 0x80|SIGTRAP trap, but I imagine it so.
>>
>> -> wait
>>  <- 0x80|SIGTRAP   (or PTRACE_EVENT_SYSCALL_ENTRY)
>> -> read regs, find out syscall number
>> -> PTRACE_GETEVENTMSG, figure out which entry mode was used.
> 
> This would require additional ptrace op per syscall entry.
> Linus' method and event method wouldn't.

Yes.

In any case, ptrace events leave recording the state in core files
behind; possibly also important for userspace c/r.
Linus' method or a new regset don't have that drawback.  A new regset requires
an additional ptrace op too, while the former abuses an architecture register,
possibly leading to headaches later on.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  6:34                                                         ` Indan Zupancic
@ 2012-01-26 10:31                                                           ` Jamie Lokier
  2012-01-26 10:40                                                             ` Denys Vlasenko
  2012-01-26 11:20                                                             ` Indan Zupancic
  0 siblings, 2 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-26 10:31 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

Indan Zupancic wrote:
> On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
> > Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
> > returns, and you knowing the TID always changes to the PID?  I haven't
> > yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
> > not the old one, perhaps that could be changed.
> 
> Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
> barely documented already, but if if ever changes it will be also
> unreliable.
> 
> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
> or instead of the post-execve ptrace event. I guess before, but
> can I count on that? If it is after then I get a stray weird
> execve event that messes up the system call cadence.

It should be *sent* before because the exec steps must finish before
the execve() syscall "returns".

I'm not sure if the events are guaranteed to be received in the same
order as they are sent.

> >> > Thus, minimally we need one new option, PTRACE_O_TRACE_SYSENTRY -
> >> > "on syscall entry ptrace stop, set a nonzero event value in wait status"
> >> > , and two event values: PTRACE_EVENT_SYSCALL_ENTRY (for native entry),
> >> > PTRACE_EVENT_SYSCALL_ENTRY1 for compat one.
> >>
> >> Not all code wants to receive a syscall exit event all the time, so
> >> if you add PTRACE_O_TRACE_SYSENTRY, please add PTRACE_O_TRACE_SYSEXIT
> >> too. That would pretty much halve ptrace's overhead for my use case.
> >> But this is orthogonal to the compat problem.
> >
> > I agree.  I would like to ignore the exit for most syscalls but see a
> > few of them.  I guess PTRACE_SETOPTIONS could be used to toggle it,
> > with some overhead.
> 
> Yes, that's what I had in mind.
> 
> > But in the spirit of this thread,
> > PTRACE_O_TRACE_BPF would be even better, to completely ignore
> > irrelevant syscalls :-)
> 
> Yes, that's the only reason I'm interested in BPF, really.
> Most system calls are either always allowed, or always denied.
> Of the ones that need checking, most of them have file paths.
> For those I'm not interested in the post-syscall event.

Same here, though for tracing file paths rather than blocking anything.

> >> > To future-proof this scheme we may reserve a few more event values
> >> > PTRACE_EVENT_SYSCALL_ENTRY2, PTRACE_EVENT_SYSCALL_ENTRY3, etc,
> >> > if we'll ever have arches with more than one non-native syscall
> >> > entry. I'm no expert, but looking at strace code, ARM may already have
> >> > more than one additional convention how to pass syscall args.
> >>
> >> Please, no! This way lays madness, just one PTRACE_EVENT_SYSCALL_ENTRY,
> >> no PTRACE_EVENT_SYSCALL_ENTRY1 or PTRACE_EVENT_SYSCALL_ENTRY2, that
> >> would be horrible. Keep arch specific stuff in arch specific areas,
> >> please don't spread it around.
> >>
> >> What was wrong with using eflags again? Is it too simple or something?
> >
> > Well it doesn't deal with the equivalent issue on ARM and PA-RISC.
> 
> Those issues are not equivalent. ARM only has that OABI thing which
> is hopefully not used in practice.

I am still using OABI on some currently-sold and still-developed
devices with userspace libraries that I can't replace or rebuild.
Maybe I'm the only one, but the issue is still there.  It should be
supported in ptrace() as long as it's supported in the kernel at all.

I don't know if the PA-RISC thing is real.

But it's occurred to me that there are a lot of 32/64 archs now (I was
extracting all their syscall number tables last night), and it would
be good if there were a consistent, arch-independent way to signal if
the syscall number is in the 32 or 64-bit table - or at least, in the
same ABI as the tracer gets from <asm/unistd.h>.  For tracers doing
simple things to avoid needing a ton of arch-specific knowledge.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 10:31                                                           ` Jamie Lokier
@ 2012-01-26 10:40                                                             ` Denys Vlasenko
  2012-01-26 11:01                                                               ` Jamie Lokier
  2012-01-26 11:19                                                               ` Indan Zupancic
  2012-01-26 11:20                                                             ` Indan Zupancic
  1 sibling, 2 replies; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-26 10:40 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 26, 2012 at 11:31 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Indan Zupancic wrote:
>> On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
>> > Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
>> > returns, and you knowing the TID always changes to the PID?  I haven't
>> > yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
>> > not the old one, perhaps that could be changed.
>>
>> Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
>> barely documented already, but if if ever changes it will be also
>> unreliable.
>>
>> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
>> or instead of the post-execve ptrace event.

Denis <- confused.
Was ist das "post-execve ptrace event"? I know no such thing.
I know about PTRACE_EVENT_EXEC, and "post-execve SIGTRAP".


>> I guess before, but
>> can I count on that? If it is after then I get a stray weird
>> execve event that messes up the system call cadence.
>
> It should be *sent* before because the exec steps must finish before
> the execve() syscall "returns".
>
> I'm not sure if the events are guaranteed to be received in the same
> order as they are sent.

All ptrace stops (events and other stops) are synchronous.
Tracee stops, tracer notices it, tracer restarts tracee,
and only after this tracee can generate next event.
Therefore ptrace stops can't get reordered.

-- 
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 10:40                                                             ` Denys Vlasenko
@ 2012-01-26 11:01                                                               ` Jamie Lokier
  2012-01-26 14:02                                                                 ` Denys Vlasenko
  2012-01-26 11:19                                                               ` Indan Zupancic
  1 sibling, 1 reply; 222+ messages in thread
From: Jamie Lokier @ 2012-01-26 11:01 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

Denys Vlasenko wrote:
> On Thu, Jan 26, 2012 at 11:31 AM, Jamie Lokier <jamie@shareable.org> wrote:
> > Indan Zupancic wrote:
> >> On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
> >> > Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
> >> > returns, and you knowing the TID always changes to the PID?  I haven't
> >> > yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
> >> > not the old one, perhaps that could be changed.
> >>
> >> Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
> >> barely documented already, but if if ever changes it will be also
> >> unreliable.
> >>
> >> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
> >> or instead of the post-execve ptrace event.
> 
> Denis <- confused.
> Was ist das "post-execve ptrace event"? I know no such thing.
> I know about PTRACE_EVENT_EXEC, and "post-execve SIGTRAP".

Sorry, I meant to write execve-syscall-exit event.

> >> I guess before, but
> >> can I count on that? If it is after then I get a stray weird
> >> execve event that messes up the system call cadence.
> >
> > It should be *sent* before because the exec steps must finish before
> > the execve() syscall "returns".
> >
> > I'm not sure if the events are guaranteed to be received in the same
> > order as they are sent.
> 
> All ptrace stops (events and other stops) are synchronous.
> Tracee stops, tracer notices it, tracer restarts tracee,
> and only after this tracee can generate next event.
> Therefore ptrace stops can't get reordered.

That's good to know, thanks.

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 10:40                                                             ` Denys Vlasenko
  2012-01-26 11:01                                                               ` Jamie Lokier
@ 2012-01-26 11:19                                                               ` Indan Zupancic
  1 sibling, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-01-26 11:19 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Jamie Lokier, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

On Thu, January 26, 2012 11:40, Denys Vlasenko wrote:
> On Thu, Jan 26, 2012 at 11:31 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> Indan Zupancic wrote:
>>> On Thu, January 26, 2012 02:08, Jamie Lokier wrote:
>>> > Is it disambiguated by PTRACE_EVENT_EXEC happening before the execve
>>> > returns, and you knowing the TID always changes to the PID? �I haven't
>>> > yet checked which TID gets the PTRACE_EVENT_EXEC event, but if it's
>>> > not the old one, perhaps that could be changed.
>>>
>>> Please don't ever change the behaviour of PTRACE_EVENT_EXEC, it's
>>> barely documented already, but if if ever changes it will be also
>>> unreliable.
>>>
>>> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
>>> or instead of the post-execve ptrace event.
>
> Denis <- confused.
> Was ist das "post-execve ptrace event"? I know no such thing.
> I know about PTRACE_EVENT_EXEC, and "post-execve SIGTRAP".

I mean the second SIGTRAP | 0x80 event, the syscall return of execve.

> All ptrace stops (events and other stops) are synchronous.
> Tracee stops, tracer notices it, tracer restarts tracee,
> and only after this tracee can generate next event.
> Therefore ptrace stops can't get reordered.

That's good to know and what I expected.

Since which kernel version does the PTRACE_GETEVENTMSG work and
is there a way to find out before it returns zero?

Greetings,

Indan


--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 10:31                                                           ` Jamie Lokier
  2012-01-26 10:40                                                             ` Denys Vlasenko
@ 2012-01-26 11:20                                                             ` Indan Zupancic
  2012-01-26 11:47                                                               ` Jamie Lokier
  1 sibling, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-26 11:20 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalc

On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
> Indan Zupancic wrote:
>> Yes, that's the only reason I'm interested in BPF, really.
>> Most system calls are either always allowed, or always denied.
>> Of the ones that need checking, most of them have file paths.
>> For those I'm not interested in the post-syscall event.
>
> Same here, though for tracing file paths rather than blocking anything.

The jailer I wrote works pretty well as a simplistic strace replacement.
It can only print out the arguments we're checking, but that's usually
the more interesting info.

>> Those issues are not equivalent. ARM only has that OABI thing which
>> is hopefully not used in practice.
>
> I am still using OABI on some currently-sold and still-developed
> devices with userspace libraries that I can't replace or rebuild.
> Maybe I'm the only one, but the issue is still there.  It should be
> supported in ptrace() as long as it's supported in the kernel at all.

It's not a 32 versus 64-bit issue though, so it will be something on
its own anyway. Can as well add an extra ARM specific ptrace command
to get that info, or hack it in some other way. For instance, ip is
(ab)used to tell if it is syscall entry or exit, so doing these tricks
isn't anything new in ARM either.

> I don't know if the PA-RISC thing is real.
>
> But it's occurred to me that there are a lot of 32/64 archs now (I was
> extracting all their syscall number tables last night), and it would
> be good if there were a consistent, arch-independent way to signal if
> the syscall number is in the 32 or 64-bit table - or at least, in the
> same ABI as the tracer gets from <asm/unistd.h>.  For tracers doing
> simple things to avoid needing a ton of arch-specific knowledge.

You can't avoid the arch-specific knowledge, because depending on the
answer, you have to do something arch specific. In ARM's OABI case, it's
reading program memory to find out the system call number, of all things.
(I hope I read the code wrong). So ARM's solution would need to get all
info it needs to handle the system call securely without reading any text
memory, otherwise it's racy.

And then there's the whole confusion what that flag says, some might think
it says in what mode the tracee is instead of what mode the system call is.
That those two can be different is not obvious at all and seems very x86_64
specific.

I'm not sure what you're doing, but perhaps we should share code and write
a kind of Linux ptrace library. The code I wrote was university stuff and
we want to release it, but it will take a while to get things sorted out.
Hopefully it's released in April, maybe before.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 11:20                                                             ` Indan Zupancic
@ 2012-01-26 11:47                                                               ` Jamie Lokier
  2012-01-26 14:05                                                                 ` Denys Vlasenko
  2012-01-27  7:23                                                                 ` Indan Zupancic
  0 siblings, 2 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-01-26 11:47 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

Indan Zupancic wrote:
> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
> > Indan Zupancic wrote:
> >> Yes, that's the only reason I'm interested in BPF, really.
> >> Most system calls are either always allowed, or always denied.
> >> Of the ones that need checking, most of them have file paths.
> >> For those I'm not interested in the post-syscall event.
> >
> > Same here, though for tracing file paths rather than blocking anything.
> 
> The jailer I wrote works pretty well as a simplistic strace replacement.
> It can only print out the arguments we're checking, but that's usually
> the more interesting info.

In theory such a thing should be easy to write, but as we both found,
ptrace() on Linux has a huge number of difficult quirks to deal with
to trace reliably.  At least it's getting better with later kernels.

> >> Those issues are not equivalent. ARM only has that OABI thing which
> >> is hopefully not used in practice.
> >
> > I am still using OABI on some currently-sold and still-developed
> > devices with userspace libraries that I can't replace or rebuild.
> > Maybe I'm the only one, but the issue is still there.  It should be
> > supported in ptrace() as long as it's supported in the kernel at all.
> 
> It's not a 32 versus 64-bit issue though, so it will be something on
> its own anyway. Can as well add an extra ARM specific ptrace command
> to get that info, or hack it in some other way. For instance, ip is
> (ab)used to tell if it is syscall entry or exit, so doing these tricks
> isn't anything new in ARM either.

In theory, aren't we supposed to know whether it's entry/exit anyway?
Why does strace care?  Have there been kernel bugs in the past?  Maybe
it was just to deal with SIGTRAP-after-exit in the past, which could
be delivered at an unpredictable time if blocked and then unblocked by
sigreturn().

> You can't avoid the arch-specific knowledge, because depending on the
> answer, you have to do something arch specific. In ARM's OABI case, it's
> reading program memory to find out the system call number, of all things.
> (I hope I read the code wrong). So ARM's solution would need to get all
> info it needs to handle the system call securely without reading any text
> memory, otherwise it's racy.

A few archs read program memory to get the syscall number even now, in
the current strace source.  Look for PEEKTEXT: S390, ARM, SPARC use it
on every syscall entry, and X86_64 has it commented out.

As we know, all of them are buggy if the memory is modified while
reading it, and it's silly because the kernel knows the syscall
number.

> And then there's the whole confusion what that flag says, some might think
> it says in what mode the tracee is instead of what mode the system call is.
> That those two can be different is not obvious at all and seems very x86_64
> specific.

My rough read of PARISC entry code suggests it has two entry methods,
similar to ARM and x86_64, but I'm not really familiar with PARISC and
I don't have a machine handy to try it out :-)

> I'm not sure what you're doing, but perhaps we should share code and write
> a kind of Linux ptrace library. The code I wrote was university stuff and
> we want to release it, but it will take a while to get things sorted out.
> Hopefully it's released in April, maybe before.

I've been thinking along similar lines.  The idea came up when I was
hacking on strace last year and it so wanted to be cleaned up (but now
strace is in good hands, my work on it is obsolete); now I'm doing
ptracing for other purposes.  Denys' ptrace API document, currently in
strace git, is extremely useful.

Denys, would you be interested in further refactoring strace to use a
"libsystrace" sort of thing which abstracts the detail of archs,
tracing (and maybe syscall argument layout) away from the printing and
user-interface, for strace's use and other users?  I would be happy to
help with that and keep strace's non-Linux support as well (if there's
any way to test the latter...)  I seem to be going in the direction of
a library like that anyway for another project.

The seccomp-BPF stuff could also benefit from a part dealing with
syscall argument layout, as it too needs needs that arch-specific
knowledge.  I have a script in progress which extracts all the
per-arch and per-ABI syscall numbers, syscall argument layouts and
kernel function names to keep track of arch-specific fixups, from a
Linux source tree.  It currently works on all archs except it breaks
on x86 which insists on being diferent ;-)

-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 11:01                                                               ` Jamie Lokier
@ 2012-01-26 14:02                                                                 ` Denys Vlasenko
  0 siblings, 0 replies; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-26 14:02 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 26, 2012 at 12:01 PM, Jamie Lokier <jamie@shareable.org> wrote:
> Denys Vlasenko wrote:
>> On Thu, Jan 26, 2012 at 11:31 AM, Jamie Lokier <jamie@shareable.org> wrote:
>> >> It's still unclear if the PTRACE_EVENT_EXEC comes before or after
>> >> or instead of the post-execve ptrace event.
>>
>> Denis <- confused.
>> Was ist das "post-execve ptrace event"? I know no such thing.
>> I know about PTRACE_EVENT_EXEC, and "post-execve SIGTRAP".
>
> Sorry, I meant to write execve-syscall-exit event.

PTRACE_EVENT_EXEC happens before syscall exit. syscall exit
is not lost. Basically, the sequence is:

tracer               tracee with tid N, tgid M
   <------------- syscall entry for execve, pid=N
PTRACE_SYSCALL--->
   <------------- PTRACE_EVENT_EXEC, pid=M
PTRACE_GETEVENTMSG-->
   <------------- returns N ("I used to be tid N")
PTRACE_SYSCALL--->
   <------------- syscall exit for execve, pid=M
...

-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 11:47                                                               ` Jamie Lokier
@ 2012-01-26 14:05                                                                 ` Denys Vlasenko
  2012-01-27  7:23                                                                 ` Indan Zupancic
  1 sibling, 0 replies; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-26 14:05 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Indan Zupancic, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Thu, Jan 26, 2012 at 12:47 PM, Jamie Lokier <jamie@shareable.org> wrote:
> Indan Zupancic wrote:
> Denys, would you be interested in further refactoring strace to use a
> "libsystrace" sort of thing which abstracts the detail of archs,
> tracing (and maybe syscall argument layout) away from the printing and
> user-interface, for strace's use and other users?  I would be happy to
> help with that and keep strace's non-Linux support as well (if there's
> any way to test the latter...)  I seem to be going in the direction of
> a library like that anyway for another project.

It might make sense to do this.
Design of this library would depend on the needs
of those other projects. Where can I see their code?

-- 
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26  3:47                                                         ` Linus Torvalds
@ 2012-01-26 18:03                                                           ` Denys Vlasenko
  2017-03-08 23:41                                                             ` Dmitry V. Levin
  0 siblings, 1 reply; 222+ messages in thread
From: Denys Vlasenko @ 2012-01-26 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Indan Zupancic, Oleg Nesterov, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Ro

Hi Linus,

On Thu, Jan 26, 2012 at 4:47 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>> Please look at strace source, get_scno() function, where
>> it reads syscall no and parameters. Let's see....
>> - POWERPC: has 32-bit and 64-bit mode
>> - X86_64: has 32-bit and 64-bit mode
>> - IA64: has i386-compat mode
>> - ARM: has more than one ABI
>> - SPARC: has 32-bit and 64-bit mode
>>
>> Do you want to re-invent a different arch-specific way to report
>> syscall type for each of these arches?
>
> I think an arch-specific one is better than trying to make some
> generic one that is messy.
>
> As you say, many architectures have multiple system call ABIs.
>
> But they tend to be very *different* issues. They can be about
> multiple ABI's, as you mention, and even when they *look* similar
> (32-bit vs 64-bit ABI's) they are actually totally different issues.
> [skip]

I don't have a particular attachment to my solution,
and I think we already talk about this problem for
far too long.

Looks like nobody is _strongly_ opposed to your patch
which uses a few bits in eflags to report bitness
of the x86 syscall.

Lets just do that already. If you commit it to kernel git,
I will immediately change strace accordingly.

-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-25 23:32                                                   ` Denys Vlasenko
  2012-01-26  0:40                                                     ` Indan Zupancic
  2012-01-26  0:59                                                     ` Jamie Lokier
@ 2012-01-26 18:44                                                     ` Oleg Nesterov
  2012-02-10  2:51                                                       ` Jamie Lokier
  2 siblings, 1 reply; 222+ messages in thread
From: Oleg Nesterov @ 2012-01-26 18:44 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Indan Zupancic, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On 01/26, Denys Vlasenko wrote:
>
> On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> >
> > We can add the new events,
> >
> > 	PTRACE_EVENT_SYSCALL_ENTRY
> > 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> > 	PTRACE_EVENT_SYSCALL_EXIT
> > 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
>
> We can get away with just the first one.
> (1) It's unlikely people would want to get native sysentry events but not compat ones,
> thus first two options can be combined into one;

Confused... Sure, we need the single option, or we could even report
this unconditionally if PT_SEIZED.

I meant the different PTRACE_EVENT_* codes only.

> (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> (3) if we would flag syscall entry with an event value in wait status, then syscall
> exit will be already distinquisable.

Well, if we add _ENTRY then it looks more consistent to report _EXIT
as well even if it is not that useful.

Doesn't matter. Nobody seem to like this, and afaics Linus has the
good arguments against the arch-independent "consolidation".

Oleg.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 11:47                                                               ` Jamie Lokier
  2012-01-26 14:05                                                                 ` Denys Vlasenko
@ 2012-01-27  7:23                                                                 ` Indan Zupancic
  2012-02-10  2:02                                                                   ` Jamie Lokier
  1 sibling, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-01-27  7:23 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalc

On Thu, January 26, 2012 12:47, Jamie Lokier wrote:
> Indan Zupancic wrote:
>> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
>> > Indan Zupancic wrote:
>> >> Yes, that's the only reason I'm interested in BPF, really.
>> >> Most system calls are either always allowed, or always denied.
>> >> Of the ones that need checking, most of them have file paths.
>> >> For those I'm not interested in the post-syscall event.
>> >
>> > Same here, though for tracing file paths rather than blocking anything.
>>
>> The jailer I wrote works pretty well as a simplistic strace replacement.
>> It can only print out the arguments we're checking, but that's usually
>> the more interesting info.
>
> In theory such a thing should be easy to write, but as we both found,
> ptrace() on Linux has a huge number of difficult quirks to deal with
> to trace reliably.  At least it's getting better with later kernels.

It's not that bad, there are a few quirks, but not that many.
The ptrace specific code is less than 500 lines of code, with
a couple of hundred lines of header files. Linux ptrace specific
stuff creeps in elsewhere too though, like that execve mess.

>> It's not a 32 versus 64-bit issue though, so it will be something on
>> its own anyway. Can as well add an extra ARM specific ptrace command
>> to get that info, or hack it in some other way. For instance, ip is
>> (ab)used to tell if it is syscall entry or exit, so doing these tricks
>> isn't anything new in ARM either.
>
> In theory, aren't we supposed to know whether it's entry/exit anyway?
> Why does strace care?  Have there been kernel bugs in the past?  Maybe
> it was just to deal with SIGTRAP-after-exit in the past, which could
> be delivered at an unpredictable time if blocked and then unblocked by
> sigreturn().

Maybe. I don't why ARM does that ip thing.

Although in theory you know the entry/exits if you keep track, but one
mistake or unexpected behaviour (like execve for my code) and you can get
it wrong. So for robustness sake it's good if it can be double checked.

>> You can't avoid the arch-specific knowledge, because depending on the
>> answer, you have to do something arch specific. In ARM's OABI case, it's
>> reading program memory to find out the system call number, of all things.
>> (I hope I read the code wrong). So ARM's solution would need to get all
>> info it needs to handle the system call securely without reading any text
>> memory, otherwise it's racy.
>
> A few archs read program memory to get the syscall number even now, in
> the current strace source.  Look for PEEKTEXT: S390, ARM, SPARC use it
> on every syscall entry, and X86_64 has it commented out.

I did look for PEEKTEXT. For ARM it's to check if OABI is used (and
if it is, the syscall is in memory, otherwise it's in r7). Strace only
uses it on S390 to handle old style ABI, 2.6 is fine. On SPARC Strace
does it to figure out what personality is used. But that can only be
changed via personality(2) and not secretly at runtime, or so it seems,
so SPARC should be safe too. But I can't really figure out the kernel
SPARC code to be honest, so I may be wrong. It seems the trap instruction
differs between SPARC 32 and 64-bit, but on the other hand they both use
the same syscall table, so at least the syscall nr can't be confused.

> As we know, all of them are buggy if the memory is modified while
> reading it, and it's silly because the kernel knows the syscall
> number.

Only ARM OABI is really problematic in that regard, but that's not a
32 versus 64-bit issue.

I don't know anything about OABI, can you link an OABI program against
an EABI library? If you can then libc can be EABI and the kernel doesn't
need OABI support.

>> And then there's the whole confusion what that flag says, some might think
>> it says in what mode the tracee is instead of what mode the system call is.
>> That those two can be different is not obvious at all and seems very x86_64
>> specific.
>
> My rough read of PARISC entry code suggests it has two entry methods,
> similar to ARM and x86_64, but I'm not really familiar with PARISC and
> I don't have a machine handy to try it out :-)

It has a unified syscall table, so does it really matter?

>> I'm not sure what you're doing, but perhaps we should share code and write
>> a kind of Linux ptrace library. The code I wrote was university stuff and
>> we want to release it, but it will take a while to get things sorted out.
>> Hopefully it's released in April, maybe before.
>
> I've been thinking along similar lines.  The idea came up when I was
> hacking on strace last year and it so wanted to be cleaned up (but now
> strace is in good hands, my work on it is obsolete); now I'm doing
> ptracing for other purposes.  Denys' ptrace API document, currently in
> strace git, is extremely useful.
>
> Denys, would you be interested in further refactoring strace to use a
> "libsystrace" sort of thing which abstracts the detail of archs,
> tracing (and maybe syscall argument layout) away from the printing and
> user-interface, for strace's use and other users?  I would be happy to
> help with that and keep strace's non-Linux support as well (if there's
> any way to test the latter...)  I seem to be going in the direction of
> a library like that anyway for another project.

I actually recommend to leave strace as it is. I've seen the code,
it's full with arch and OS specific stuff scattered all over the
place. Considering it actually works now, why risk breaking anything?
Especially considering you can't test any changes for all supported
platforms. Just leave it be and slowly improve it by tiny bit for
bits you can actually test.

The point of the library would be to make it easier to create new
software, possibly by using all the new features and dropping support
for too old kernels. Strace doesn't really benefit from that.

> The seccomp-BPF stuff could also benefit from a part dealing with
> syscall argument layout, as it too needs needs that arch-specific
> knowledge.

It seems I convinced them to use a cross-platform ABI, so you should
get the system call number and arguments directly.

> I have a script in progress which extracts all the
> per-arch and per-ABI syscall numbers, syscall argument layouts and
> kernel function names to keep track of arch-specific fixups, from a
> Linux source tree.  It currently works on all archs except it breaks
> on x86 which insists on being diferent ;-)

That's handy, but I thought strace had such a script already?
See HACKING-scripts in strace source. Or is yours much better?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-18 20:26                                                 ` Linus Torvalds
  2012-01-18 20:55                                                   ` H. Peter Anvin
@ 2012-02-06  8:32                                                   ` Indan Zupancic
  2012-02-06 17:02                                                     ` H. Peter Anvin
  1 sibling, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-02-06  8:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor

On Wed, January 18, 2012 21:26, Linus Torvalds wrote:
> Added Peter to the cc, since this is now about some x86-specific
> things. Ingo was already cc'd earlier.
>
> On Wed, Jan 18, 2012 at 11:31 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Using the high bits of 'eflags' might work. Hopefully nobody tests
>> that. IOW, something like the attached might work. It just sets bit#32
>> in eflags if the system call is a compat call.
>
> So that description was bogus, it was what my original patch did, but
> not the one I actually sent out (Peter - you can find it on lkml,
> although the description below is probably sufficient for you to
> understand what it does, or the obvious nature of the attached patch
> for strace).
>
> The one I sent out *unconditionally* sets one bit in the high bits of
> the returned value of the eflags register from ptrace(), very much on
> purpose. That way you can unambiguously see whether it's an old kernel
> (bits clear) or a new kernel that supports the feature. On a new
> kernel, bit #32 of eflags will be set for a native 64-bit system call,
> and bit #33 will be set for a compat system call.
>
> And some testing says that it works. In particular, I have a patch to
> strace-4.6 that is able to correctly decode my mixed-case binary that
> uses both the compat system call and the native system calls from
> 64-bit long mode. Also, it looks like gdb ignores the high bits of
> eflags, since it "knows" that eflags is just a 32-bit register even in
> 64-bit mode, so the fact that we set some random bits in there doesn't
> end up being noisy for at least one debugger.
>
> HOWEVER. I'm not going to guarantee that this is the right approach.
> It seems to work, and it clearly gives people real information, but
> whether this is the best way to do things or not is open.

It seems that just using eflags is a lot simpler than the alternatives,
let's just go for it.

>
> The reason I picked 'eflags' was that it
>
>  (a) was easy from an implementation standpoint, since we already have
> to handle reading of eflags specially in ptrace (we have to fake out
> the resume bit)
>
>  (b) it "kind of" makes sense to make high bits be "system flags",
> with low bits being "cpu flags", so it fits at least *some* kind of
> conceptual model.
>
>  (c) the other sane places to put it (high bits of CS and/or ORIG_AX)
> were being used and compared as 64-bit values at least by strace.
> Whether eflags works for all users, I have no idea, but generally you
> would never compare eflags for one particular value - you might check
> individual bits in eflags, but hopefully setting a few new bits should
> not be something that any legacy user would ever really notice.
>
> So there are reasons to think that my patch is sane, but...
>
> Here's the strace patch, so people can look. I didn't even test it on
> an old kernel, but the fallback case to the old behavior looks
> trivial.
>
> Comments?

I propose using bits somewhere in the middle of the upper half. If new
flags are ever added by Intel or AMD, they will use the lower bits. If
anyone else ever adds flags, they most likely add them to the top (VIA).
So the middle seems the safest spot as far as long-term maintenance goes.

The below version does that, but instead of setting one of the two bits,
it always sets bit 50 for newer kernels and sets bit 51 if it's a compat
system call. I find this version more readable and after compilation it's
also a couple of bytes smaller compared to Linus' original version.

Should we make sure that the top 32 bits are zero, in case any weird
hardware does set our bits?

Greetings,

Indan

---

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..a7fda48 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -353,6 +353,7 @@ static int set_segment_reg(struct task_struct *task,

 static unsigned long get_flags(struct task_struct *task)
 {
+	int bit = 50;
 	unsigned long retval = task_pt_regs(task)->flags;

 	/*
@@ -360,8 +361,11 @@ static unsigned long get_flags(struct task_struct *task)
 	 */
 	if (test_tsk_thread_flag(task, TIF_FORCED_TF))
 		retval &= ~X86_EFLAGS_TF;
-
-	return retval;
+#ifdef CONFIG_IA32_EMULATION
+	if (task_thread_info(task)->status & TS_COMPAT)
+		retval |= (1ul << 51);
+#endif
+	return retval | (1ul << bit);
 }

 static int set_flags(struct task_struct *task, unsigned long value)



^ permalink raw reply related	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-06  8:32                                                   ` Indan Zupancic
@ 2012-02-06 17:02                                                     ` H. Peter Anvin
  2012-02-07  1:52                                                       ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-06 17:02 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dla

On 02/06/2012 12:32 AM, Indan Zupancic wrote:
> 
> It seems that just using eflags is a lot simpler than the alternatives,
> let's just go for it.
> 
> 
> I propose using bits somewhere in the middle of the upper half. If new
> flags are ever added by Intel or AMD, they will use the lower bits. If
> anyone else ever adds flags, they most likely add them to the top (VIA).
> So the middle seems the safest spot as far as long-term maintenance goes.
> 
> The below version does that, but instead of setting one of the two bits,
> it always sets bit 50 for newer kernels and sets bit 51 if it's a compat
> system call. I find this version more readable and after compilation it's
> also a couple of bytes smaller compared to Linus' original version.
> 
> Should we make sure that the top 32 bits are zero, in case any weird
> hardware does set our bits?
> 

[Adding H.J. Lu, since he has run into some of these requirements before]

NAK in the extreme.

We have not heard back from the architecture people on this, and I will
NAK this unless that happens.

Furthermore, you're picking bits that do not work for 32 bits, EVEN
THOUGH WE HAVE A SIMILAR PROBLEM ON 32 BITS; I outlined it for you and
you chose to ignore it.

Finally, I think we actually are going to need a fair number of bits in
the end.  All of this points to using a new regset designed for
extension in the first place.

As far as I can tell, we need at least the following information:

- If the CPU is currently in 32- or 64-bit mode.
- If we are currently inside a system call, and if so if it was entered
  via:
	- SYSCALL64
	- INT 80
	- SYSCALL32
	- SYSENTER

  The reason we need this information is because for the various 32-bit
  entry points we do some very ugly swizzling of registers, which
  matters to a ptrace client which wants to modify system call
  arguments.
- If the process was started as a 64-bit process, i386 process or x32
  process.

This adds up to a minimum of six bits already (and at least two bits on
i386), and that's just a start.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-20 22:40                                                                     ` Roland McGrath
  2012-01-20 22:41                                                                       ` H. Peter Anvin
  2012-01-24  8:19                                                                       ` Indan Zupancic
@ 2012-02-06 20:30                                                                       ` H. Peter Anvin
  2012-02-06 20:39                                                                         ` Roland McGrath
  2 siblings, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-06 20:30 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-se

On 01/20/2012 02:40 PM, Roland McGrath wrote:
> If you change the size of a regset, then the new full size will be the size
> of the core file notes.  Existing userland tools will not be expecting
> this, they expect a known exact size.  If you need to add new stuff, it
> really is easier all around to add a new regset flavor.  When adding a new
> one, you can make it variable-sized from the start so as to be extensible
> in the future.  We did this for NT_X86_XSTATE, for example.
> 
> Thanks,
> Roland

Hi Roland,

What is needed to make a regset variable-sized?  Just declaring that it
may change in size in the future, or does one need a length field at the
top (I would personally have expected that both notes and ptrace would
have out-of-band methods for getting the size?)

	-hp-a

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-06 20:30                                                                       ` H. Peter Anvin
@ 2012-02-06 20:39                                                                         ` Roland McGrath
  2012-02-06 20:42                                                                           ` H. Peter Anvin
  0 siblings, 1 reply; 222+ messages in thread
From: Roland McGrath @ 2012-02-06 20:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-se

On Mon, Feb 6, 2012 at 12:30 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> What is needed to make a regset variable-sized?  Just declaring that it
> may change in size in the future, or does one need a length field at the
> top (I would personally have expected that both notes and ptrace would
> have out-of-band methods for getting the size?)

ELF notes do have a size field, so core files are self-explanatory.  There
is no ptrace interface to directly interrogate the regset details (one
could be added).  But the PTRACE_GETREGSET interface is to accept an upper
bound and yield the actual size filled in (which might be less than the
regset's size if the user-supplied buffer was smaller).  So in practice, a
caller can just use a buffer that's sure to be large enough, and then look
at iov_len for the actual size delivered.  (And nobody has yet complained
about this for xstate, though that might just be that nobody is really
using it.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-06 20:39                                                                         ` Roland McGrath
@ 2012-02-06 20:42                                                                           ` H. Peter Anvin
  0 siblings, 0 replies; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-06 20:42 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Jamie Lokier, Andrew Lutomirski, Oleg Nesterov, Will Drewry,
	linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, mingo, akpm, khilman, borislav.petkov, amwang, ak,
	eric.dumazet, gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-se

On 02/06/2012 12:39 PM, Roland McGrath wrote:
> On Mon, Feb 6, 2012 at 12:30 PM, H. Peter Anvin<hpa@zytor.com>  wrote:
>> What is needed to make a regset variable-sized?  Just declaring that it
>> may change in size in the future, or does one need a length field at the
>> top (I would personally have expected that both notes and ptrace would
>> have out-of-band methods for getting the size?)
>
> ELF notes do have a size field, so core files are self-explanatory.  There
> is no ptrace interface to directly interrogate the regset details (one
> could be added).  But the PTRACE_GETREGSET interface is to accept an upper
> bound and yield the actual size filled in (which might be less than the
> regset's size if the user-supplied buffer was smaller).  So in practice, a
> caller can just use a buffer that's sure to be large enough, and then look
> at iov_len for the actual size delivered.  (And nobody has yet complained
> about this for xstate, though that might just be that nobody is really
> using it.)
>

That should be fine, since you'd just set it to the size of the fields 
that you know about, and if there are additional fields that you don't 
know about, you logically don't care about them.  If you want to dump 
the full set of data you'd just read until you get a short read... like 
if you were reading a regular file.

	-hpa

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-06 17:02                                                     ` H. Peter Anvin
@ 2012-02-07  1:52                                                       ` Indan Zupancic
  2012-02-09  0:19                                                         ` H. Peter Anvin
  0 siblings, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-02-07  1:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

On Mon, February 6, 2012 18:02, H. Peter Anvin wrote:
> On 02/06/2012 12:32 AM, Indan Zupancic wrote:
>>
>> It seems that just using eflags is a lot simpler than the alternatives,
>> let's just go for it.
>>
>>
>> I propose using bits somewhere in the middle of the upper half. If new
>> flags are ever added by Intel or AMD, they will use the lower bits. If
>> anyone else ever adds flags, they most likely add them to the top (VIA).
>> So the middle seems the safest spot as far as long-term maintenance goes.
>>
>> The below version does that, but instead of setting one of the two bits,
>> it always sets bit 50 for newer kernels and sets bit 51 if it's a compat
>> system call. I find this version more readable and after compilation it's
>> also a couple of bytes smaller compared to Linus' original version.
>>
>> Should we make sure that the top 32 bits are zero, in case any weird
>> hardware does set our bits?
>>
>
> [Adding H.J. Lu, since he has run into some of these requirements before]
>
> NAK in the extreme.
>
> We have not heard back from the architecture people on this, and I will
> NAK this unless that happens.
>
> Furthermore, you're picking bits that do not work for 32 bits, EVEN
> THOUGH WE HAVE A SIMILAR PROBLEM ON 32 BITS; I outlined it for you and
> you chose to ignore it.

Sorry, I missed that. I looked up that email and you indeed did, though
you didn't give any details about what the problems are.

> Finally, I think we actually are going to need a fair number of bits in
> the end.  All of this points to using a new regset designed for
> extension in the first place.
>
> As far as I can tell, we need at least the following information:
>
> - If the CPU is currently in 32- or 64-bit mode.

What is the best way to find that out at the kernel side? Add a function
that checks cs and returns the correct answer? But in the kernel path the
CPU is always in 64-bit mode, so I suppose you want to know what mode the
tracee was in?

> - If we are currently inside a system call, and if so if it was entered
>   via:
> 	- SYSCALL64
> 	- INT 80
> 	- SYSCALL32
> 	- SYSENTER
>
>   The reason we need this information is because for the various 32-bit
>   entry points we do some very ugly swizzling of registers, which
>   matters to a ptrace client which wants to modify system call
>   arguments.

But isn't the swizzling done in such way that all this is hidden from
ptrace clients (and the rest of the kernel)? Why would a ptrace client
need to know the details of the 32-bit entry call?

The ptrace client can always modify the same registers, as system calls
always use the same registers too. No unexpected behaviour happens as
far as I can tell from looking at the code, at least not in the syscall
entry path.

E.g. ENTRY(ia32_cstar_target) in ia32entry.S does:

	movq	%rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */

To hide that for SYSCALL32 arg2 comes in edp instead of rcx. Same for arg6.

(I actually can't find a SYSCALL32 entry in entry_32.S, am I blind or
was it too slow until the 64-bit Athlons showed up?)

A pure 32-bit kernel is compiled with:

#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))

So all arguments are passed on the stack and those arguments can be
directly modified by ptrace. For compat kernels the arguments are
reloaded after ptrace and before the actual system call is done.

> - If the process was started as a 64-bit process, i386 process or x32
>   process.

Can't that be figured out by looking at the AUXV data? Either via /proc
or PTRACE_GETREGSET + NT_AUXV. And as this can't change, there is no
need to pass it on all the time.

> This adds up to a minimum of six bits already (and at least two bits on
> i386), and that's just a start.

I'm not convinced that there is any real problem, it seems only one extra
bit for the task CPU mode would be needed, so three bits in total.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]
  2012-01-18 17:12                                             ` Oleg Nesterov
  2012-01-18 21:09                                               ` Chris Evans
@ 2012-02-07 11:45                                               ` Indan Zupancic
  1 sibling, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-02-07 11:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Evans, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	avi, penberg, viro, mingo, akpm, khilman, borislav.petkov,
	amwang, ak, eric.dumazet, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	Roland McGrath

On Wed, January 18, 2012 18:12, Oleg Nesterov wrote:
> On 01/18, Oleg Nesterov wrote:
>>
>> On 01/17, Chris Evans wrote:
>> >
>> > 1) Tracee is compromised; executes fork() which is syscall that isn't allowed
>> > 2) Tracee traps
>> > 2b) Tracee could take a SIGKILL here
>> > 3) Tracer looks at registers; bad syscall
>> > 3b) Or tracee could take a SIGKILL here
>> > 4) The only way to stop the bad syscall from executing is to rewrite
>> > orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
>> > syscall has finished)
>> > 5) Disaster: the tracee took a SIGKILL so any attempt to address it by
>> > pid (such as PTRACE_SETREGS) fails.
>> > 6) Syscall fork() executes; possible unsupervised process now running
>> > since the tracer wasn't expecting the fork() to be allowed.
>>
>> As for fork() in particular, it can't succeed after SIGKILL.
>>
>> But I agree, probably it makes sense to change ptrace_stop() to check
>> fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
>> in TASK_TRACED. Or we can change tracehook_report_syscall_entry()
>>
>> 	-	return 0;
>> 	+	return !fatal_signal_pending();
>>
>> (no, I do not literally mean the change above)
>>
>> Not only for security. The current behaviour sometime confuses the
>> users. Debugger sends SIGKILL to the tracee and assumes it should
>> die asap, but the tracee exits only after syscall.
>
> Something like the patch below.
>
> Oleg.
>
> --- x/include/linux/tracehook.h
> +++ x/include/linux/tracehook.h
> @@ -54,12 +54,12 @@ struct linux_binprm;
>  /*
>   * ptrace report for syscall entry and exit looks identical.
>   */
> -static inline void ptrace_report_syscall(struct pt_regs *regs)
> +static inline int ptrace_report_syscall(struct pt_regs *regs)
>  {
>  	int ptrace = current->ptrace;
>
>  	if (!(ptrace & PT_PTRACED))
> -		return;
> +		return 0;
>
>  	ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>
> @@ -72,6 +72,8 @@ static inline void ptrace_report_syscall
>  		send_sig(current->exit_code, current, 1);
>  		current->exit_code = 0;
>  	}
> +
> +	return fatal_signal_pending(current);
>  }
>
>  /**
> @@ -96,8 +98,7 @@ static inline void ptrace_report_syscall
>  static inline __must_check int tracehook_report_syscall_entry(
>  	struct pt_regs *regs)
>  {
> -	ptrace_report_syscall(regs);
> -	return 0;
> +	return ptrace_report_syscall(regs);
>  }
>

Tested-by: Indan Zupancic <indan@nul.nu>

Tested on 32-bit x86. It behaves as expected, please apply.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-07  1:52                                                       ` Indan Zupancic
@ 2012-02-09  0:19                                                         ` H. Peter Anvin
  2012-02-09  4:20                                                           ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-09  0:19 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dla

On 02/06/2012 05:52 PM, Indan Zupancic wrote:
>>
>> - If the CPU is currently in 32- or 64-bit mode.
> 
> What is the best way to find that out at the kernel side? Add a function
> that checks cs and returns the correct answer? But in the kernel path the
> CPU is always in 64-bit mode, so I suppose you want to know what mode the
> tracee was in?
> 

You need to look at the CS descriptor.

>> - If we are currently inside a system call, and if so if it was entered
>>   via:
>> 	- SYSCALL64
>> 	- INT 80
>> 	- SYSCALL32
>> 	- SYSENTER
>>
>>   The reason we need this information is because for the various 32-bit
>>   entry points we do some very ugly swizzling of registers, which
>>   matters to a ptrace client which wants to modify system call
>>   arguments.
> 
> But isn't the swizzling done in such way that all this is hidden from
> ptrace clients (and the rest of the kernel)? Why would a ptrace client
> need to know the details of the 32-bit entry call?
>  
> The ptrace client can always modify the same registers, as system calls
> always use the same registers too. No unexpected behaviour happens as
> far as I can tell from looking at the code, at least not in the syscall
> entry path.

The simple stuff works, but once you want to do things like change the
arguments and/or move the execution point, things get unswizzled in
uncontrolled ways.  There are bug reports related to that (I would have
to dig them up) and they aren't really fixable in any sane way right now.

> A pure 32-bit kernel is compiled with:
> 
> #define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))

... which we'd like to get rid of ...

> So all arguments are passed on the stack and those arguments can be
> directly modified by ptrace. For compat kernels the arguments are
> reloaded after ptrace and before the actual system call is done.

>> - If the process was started as a 64-bit process, i386 process or x32
>>   process.
> 
> Can't that be figured out by looking at the AUXV data? Either via /proc
> or PTRACE_GETREGSET + NT_AUXV. And as this can't change, there is no
> need to pass it on all the time.

I'll look at the auxv stuff.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  0:19                                                         ` H. Peter Anvin
@ 2012-02-09  4:20                                                           ` Indan Zupancic
  2012-02-09  4:29                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-02-09  4:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

On Thu, February 9, 2012 01:19, H. Peter Anvin wrote:
> On 02/06/2012 05:52 PM, Indan Zupancic wrote:
>>>
>>> - If the CPU is currently in 32- or 64-bit mode.
>>
>> What is the best way to find that out at the kernel side? Add a function
>> that checks cs and returns the correct answer? But in the kernel path the
>> CPU is always in 64-bit mode, so I suppose you want to know what mode the
>> tracee was in?
>>
>
> You need to look at the CS descriptor.

CS is already available to user space, but any other value than 0x23 or 0x33
will confuse user space, as that is all they know about. Apparently Xen uses
different values, but if those are static then user space can check for them
separately. But if the values change dynamically then some other way may be
needed.

But does it make much sense to pass the CPU mode of user space if that mode
can be changed at any moment? I don't think it really does. Can you give an
example of how that info can be used by a ptracer?

>
>>> - If we are currently inside a system call, and if so if it was entered
>>>   via:
>>> 	- SYSCALL64
>>> 	- INT 80
>>> 	- SYSCALL32
>>> 	- SYSENTER
>>>
>>>   The reason we need this information is because for the various 32-bit
>>>   entry points we do some very ugly swizzling of registers, which
>>>   matters to a ptrace client which wants to modify system call
>>>   arguments.
>>
>> But isn't the swizzling done in such way that all this is hidden from
>> ptrace clients (and the rest of the kernel)? Why would a ptrace client
>> need to know the details of the 32-bit entry call?
>>
>> The ptrace client can always modify the same registers, as system calls
>> always use the same registers too. No unexpected behaviour happens as
>> far as I can tell from looking at the code, at least not in the syscall
>> entry path.
>
> The simple stuff works, but once you want to do things like change the
> arguments and/or move the execution point, things get unswizzled in
> uncontrolled ways.

I do both and haven't encountered any problems.

I can't find any unswizzling happening in the return path though. So
from a ptracer's point of view it all looks the same after a system
call, no matter how it was entered. Except for IP perhaps, but that's
handled in the vDSO.

> There are bug reports related to that (I would have
> to dig them up) and they aren't really fixable in any sane way right now.

I don't see any problems in the code.

Only confusion I can think of is someone following the register values
across a systemcall instruction. Then the swizzling may be unexpected.
But if they do that they could check how the sycall was entered and
compensate for that. (I can't think of any requirement why this would
need to be race-free.)

>> A pure 32-bit kernel is compiled with:
>>
>> #define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
>
> ... which we'd like to get rid of ...

If you do get rid of it, then you have to reload the registers after
ptrace, just like currently happens on x86_64 kernels. So regparm(0)
isn't a requirement, I only explained why reloading the registers
isn't needed for pure 32-bit.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  4:20                                                           ` Indan Zupancic
@ 2012-02-09  4:29                                                             ` H. Peter Anvin
  2012-02-09  6:03                                                               ` Indan Zupancic
  2012-02-09 16:00                                                               ` H.J. Lu
  0 siblings, 2 replies; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-09  4:29 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dla

On 02/08/2012 08:20 PM, Indan Zupancic wrote:
> 
> CS is already available to user space, but any other value than 0x23 or 0x33
> will confuse user space, as that is all they know about. Apparently Xen uses
> different values, but if those are static then user space can check for them
> separately. But if the values change dynamically then some other way may be
> needed.
> 
> But does it make much sense to pass the CPU mode of user space if that mode
> can be changed at any moment? I don't think it really does. Can you give an
> example of how that info can be used by a ptracer?
> 

Uh... you could make THAT argument about ANY register state!

I believe H.J. can fill you in about the usage.

> 
> Only confusion I can think of is someone following the register values
> across a systemcall instruction. Then the swizzling may be unexpected.
> But if they do that they could check how the sycall was entered and
> compensate for that. (I can't think of any requirement why this would
> need to be race-free.)
> 

You'd have to know how you'd entered, which right now you don't have any
way to know.

	-hpa

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  4:29                                                             ` H. Peter Anvin
@ 2012-02-09  6:03                                                               ` Indan Zupancic
  2012-02-09 14:47                                                                 ` H. Peter Anvin
  2012-02-09 16:00                                                               ` H.J. Lu
  1 sibling, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-02-09  6:03 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor

On Thu, February 9, 2012 05:29, H. Peter Anvin wrote:
> On 02/08/2012 08:20 PM, Indan Zupancic wrote:
>>
>> CS is already available to user space, but any other value than 0x23 or 0x33
>> will confuse user space, as that is all they know about. Apparently Xen uses
>> different values, but if those are static then user space can check for them
>> separately. But if the values change dynamically then some other way may be
>> needed.
>>
>> But does it make much sense to pass the CPU mode of user space if that mode
>> can be changed at any moment? I don't think it really does. Can you give an
>> example of how that info can be used by a ptracer?
>>
>
> Uh... you could make THAT argument about ANY register state!

Well, when the tracee is in a system call, it can't change registers,
and their values determine the system call number and arguments. That
information is stable for the current system call. And as a ptracer
can't determine if the 32 or 64-bit syscall entry path was taken in
a race-free way, it makes sense to provide that extra info.

But the same is not true for the user space CPU mode, that can change
at any time without the tracer getting a notification, except if it is
single stepping (which I forgot about).

Would it be useful to know the CPU mode when single stepping or otherwise?

I'm asking because I don't see a need for it, but if someone else does
it's better to add it now together with the syscall mode bit. Unlike the
system call mode, the CPU mode can be checked via CS. The question is
if that works well enough or if the values are dynamic enough that it's
better to pass the info explicitly instead.

Unlike the syscall mode info, figuring out the mode from CS isn't trivial
when it can change dynamically. Then all places that use non-standard CS
values need to be changed to provide the mode somehow.

> I believe H.J. can fill you in about the usage.

That would be great.

>>
>> Only confusion I can think of is someone following the register values
>> across a systemcall instruction. Then the swizzling may be unexpected.
>> But if they do that they could check how the sycall was entered and
>> compensate for that. (I can't think of any requirement why this would
>> need to be race-free.)
>>
>
> You'd have to know how you'd entered, which right now you don't have any
> way to know.

You can check the syscall instruction itself, either before it's executed
or afterwards by checking the IP. Though that's trickier, because the
kernel points the IP to just after int80 for a sysenter call, so you have
to check if there's a sysenter nearby too.

You can also figure out what the entry instruction was by comparing the
register values with the expected ones and deducing it that way.

But the kernel is actually changing the registers, so why hide that?

I mean, once user space is aware that the kernel may do swizzling, is there
any actual problem left? Because this sounds like user space was trying to
be clever, but got it wrong. E.g. it knew the kernel was entered not via
int80, but then got confused because of the swizzling.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  6:03                                                               ` Indan Zupancic
@ 2012-02-09 14:47                                                                 ` H. Peter Anvin
  0 siblings, 0 replies; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-09 14:47 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Linus Torvalds, Andi Kleen, Jamie Lokier, Andrew Lutomirski,
	Oleg Nesterov, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, R

On 02/08/2012 10:03 PM, Indan Zupancic wrote:
>
> You can check the syscall instruction itself, either before it's executed
> or afterwards by checking the IP. Though that's trickier, because the
> kernel points the IP to just after int80 for a sysenter call, so you have
> to check if there's a sysenter nearby too.
>

No, that's a total nightmare.  FAIL.

> But the kernel is actually changing the registers, so why hide that?
>
> I mean, once user space is aware that the kernel may do swizzling, is there
> any actual problem left? Because this sounds like user space was trying to
> be clever, but got it wrong. E.g. it knew the kernel was entered not via
> int80, but then got confused because of the swizzling.

I would be great if we didn't have an existing compatibility problem. 
As it is we can't get rid of it easily.

	-hpa


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09  4:29                                                             ` H. Peter Anvin
  2012-02-09  6:03                                                               ` Indan Zupancic
@ 2012-02-09 16:00                                                               ` H.J. Lu
  2012-02-10  1:09                                                                 ` Indan Zupancic
  1 sibling, 1 reply; 222+ messages in thread
From: H.J. Lu @ 2012-02-09 16:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Indan Zupancic, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On Wed, Feb 8, 2012 at 8:29 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 02/08/2012 08:20 PM, Indan Zupancic wrote:
>>
>> CS is already available to user space, but any other value than 0x23 or 0x33
>> will confuse user space, as that is all they know about. Apparently Xen uses
>> different values, but if those are static then user space can check for them
>> separately. But if the values change dynamically then some other way may be
>> needed.
>>
>> But does it make much sense to pass the CPU mode of user space if that mode
>> can be changed at any moment? I don't think it really does. Can you give an
>> example of how that info can be used by a ptracer?
>>
>
> Uh... you could make THAT argument about ANY register state!
>
> I believe H.J. can fill you in about the usage.
>

GDB uses CS value to tell ia32 process from x86-64 process.
At minimum, we need a bit in CS for GDB.  But any changes
will break old GDB.

H.J.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-09 16:00                                                               ` H.J. Lu
@ 2012-02-10  1:09                                                                 ` Indan Zupancic
  2012-02-10  1:15                                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-02-10  1:09 UTC (permalink / raw)
  To: H.J. Lu
  Cc: H. Peter Anvin, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module

On Thu, February 9, 2012 17:00, H.J. Lu wrote:
> GDB uses CS value to tell ia32 process from x86-64 process.

Are there any cases when this doesn't work? Someone said Xen can
have different CS values, but looking at the source it seems it's
using the same ones, at least with a Linux hypervisor. So perhaps
it was KVM. Looking at the header it seems paravirtualisation uses
different cs values. On the upside, it seems we can just use that
user_64bit_mode() to know whether it is 32 or 64 bit mode, so
adding a bit telling the process mode is easier than I thought.

Currently there is a need to tell if the 32 or 64-bit syscall
path is being taken, which is independent of the process mode.

> At minimum, we need a bit in CS for GDB.  But any changes
> will break old GDB.

Would adding bits to the upper 32-bit of rflags break GDB?

Do you also need a way to know whether the kernel was entered via
int 0x80, SYSCALL32/64 or SYSENTER?

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  1:09                                                                 ` Indan Zupancic
@ 2012-02-10  1:15                                                                   ` H. Peter Anvin
  2012-02-10  2:29                                                                     ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-10  1:15 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On 02/09/2012 05:09 PM, Indan Zupancic wrote:
> On Thu, February 9, 2012 17:00, H.J. Lu wrote:
>> GDB uses CS value to tell ia32 process from x86-64 process.
> 
> Are there any cases when this doesn't work? Someone said Xen can
> have different CS values, but looking at the source it seems it's
> using the same ones, at least with a Linux hypervisor. So perhaps
> it was KVM. Looking at the header it seems paravirtualisation uses
> different cs values. On the upside, it seems we can just use that
> user_64bit_mode() to know whether it is 32 or 64 bit mode, so
> adding a bit telling the process mode is easier than I thought.
> 
> Currently there is a need to tell if the 32 or 64-bit syscall
> path is being taken, which is independent of the process mode.
> 

There are definitely cases where the current reliance on magic CS values
doesn't work; never mind the fact that it's just broken.

>> At minimum, we need a bit in CS for GDB.  But any changes
>> will break old GDB.
> 
> Would adding bits to the upper 32-bit of rflags break GDB?

It doesn't work for i386, never mind that this is reserved hardware
state and we don't have an OK at this time to redeclare them available.

> Do you also need a way to know whether the kernel was entered via
> int 0x80, SYSCALL32/64 or SYSENTER?

gdb, probably not.  That came from another user (pin, I think, but I'm
not sure.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-27  7:23                                                                 ` Indan Zupancic
@ 2012-02-10  2:02                                                                   ` Jamie Lokier
  2012-02-10  3:37                                                                     ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: Jamie Lokier @ 2012-02-10  2:02 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

Indan Zupancic wrote:
> On Thu, January 26, 2012 12:47, Jamie Lokier wrote:
> > Indan Zupancic wrote:
> >> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
> >> > Indan Zupancic wrote:
> >> The jailer I wrote works pretty well as a simplistic strace replacement.
> >> It can only print out the arguments we're checking, but that's usually
> >> the more interesting info.
> >
> > In theory such a thing should be easy to write, but as we both found,
> > ptrace() on Linux has a huge number of difficult quirks to deal with
> > to trace reliably.  At least it's getting better with later kernels.
> 
> It's not that bad, there are a few quirks, but not that many.
> The ptrace specific code is less than 500 lines of code, with
> a couple of hundred lines of header files. Linux ptrace specific
> stuff creeps in elsewhere too though, like that execve mess.

I count 720 lines *just* to read the syscall number and arguments in
strace-git, for the Linux archs it supports.

That's only the Linux code, I excluded non-Linux, and it's only a
little bit of syscall.c, I didn't include generic ptracing,
fork-following, threaded-exec-fixups, signal handling etc. nor other
arch-specific functions and ABI fixups.  And it doesn't even have all
archs currently in Linux mainline.

> >> It's not a 32 versus 64-bit issue though, so it will be something on
> >> its own anyway. Can as well add an extra ARM specific ptrace command
> >> to get that info, or hack it in some other way. For instance, ip is
> >> (ab)used to tell if it is syscall entry or exit, so doing these tricks
> >> isn't anything new in ARM either.
> >
> > In theory, aren't we supposed to know whether it's entry/exit anyway?
> > Why does strace care?  Have there been kernel bugs in the past?  Maybe
> > it was just to deal with SIGTRAP-after-exit in the past, which could
> > be delivered at an unpredictable time if blocked and then unblocked by
> > sigreturn().
> 
> Maybe. I don't why ARM does that ip thing.
> 
> Although in theory you know the entry/exits if you keep track, but one
> mistake or unexpected behaviour (like execve for my code) and you can get
> it wrong. So for robustness sake it's good if it can be double checked.

I agree, and I think the PTRACE_EVENT_SYSCALL_ENTRY/EXIT events would
be a clean way to represent that.

I wonder if all archs report syscall-exit as the first event in traced
fork children.  Looking at arch/hexagon I'm guessing it doesn't, but
it's hard to be sure and no practical way to test it :-/

That wouldn't matter if the events were robust.

I read somewhere about a bug report where syscall-exit was seen after
attach, but I don't remember where now.

> I don't know anything about OABI, can you link an OABI program against
> an EABI library? If you can then libc can be EABI and the kernel doesn't
> need OABI support.

That's not the point.  If you're writing a ptrace jailer (as you are)
a program can deliberately use OABI calls to subvert the tracer, even
if it's using EABI for normal calls.

For linking, you are mostly right.  Ideally everything would be open
and recompilable anyway, but that's sadly not always possible.  OABI
and EABI have different struct layouts among other changes, and EABI
being newer tends to accompany other libc changes; embedded libc.
aren't always as drop-in backward-compatible as glibc.

> >> And then there's the whole confusion what that flag says, some might think
> >> it says in what mode the tracee is instead of what mode the system call is.
> >> That those two can be different is not obvious at all and seems very x86_64
> >> specific.
> >
> > My rough read of PARISC entry code suggests it has two entry methods,
> > similar to ARM and x86_64, but I'm not really familiar with PARISC and
> > I don't have a machine handy to try it out :-)
> 
> It has a unified syscall table, so does it really matter?

I don't know if the 32/64 matters.  For security or accurate tracing,
I wouldn't like to assume without checking if there are 64-on-32
argument alignment fixups.

PARISC has a second set of HPUX-compatible system call numbers,
handled in arch/parisc/hpux/*.  I don't know if those are available to
all programs and can be used to subvert a ptracer.  Looking at
hpux/gate.S I think they bypass ptrace entirely; maybe they can subvert it.

> > I have a script in progress which extracts all the
> > per-arch and per-ABI syscall numbers, syscall argument layouts and
> > kernel function names to keep track of arch-specific fixups, from a
> > Linux source tree.  It currently works on all archs except it breaks
> > on x86 which insists on being diferent ;-)
> 
> That's handy, but I thought strace had such a script already?
> See HACKING-scripts in strace source. Or is yours much better?

The strace script only gets the syscall numbers (so doesn't help
cross-check I've applied all arch-specific syscall fixups), doesn't
work for all arch/ABI combinations without editing unistd.h, and
requires a configured and partly built kernel for some archs.  It's
only really useful for getting new syscall numbers which you then
hand-edit into the real table.  You still have to set the number of
arguments and check carefully you haven't missed any arch-specific
fixups.

All the best,
-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  1:15                                                                   ` H. Peter Anvin
@ 2012-02-10  2:29                                                                     ` Indan Zupancic
  2012-02-10  2:47                                                                       ` H. Peter Anvin
       [not found]                                                                       ` <cc95fcf4b1c28ee6f73e373d04593634.squirrel@webmail.greenhost.nl>
  0 siblings, 2 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-02-10  2:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module

On Fri, February 10, 2012 02:15, H. Peter Anvin wrote:
> On 02/09/2012 05:09 PM, Indan Zupancic wrote:
>> On Thu, February 9, 2012 17:00, H.J. Lu wrote:
>>> GDB uses CS value to tell ia32 process from x86-64 process.
>>
>> Are there any cases when this doesn't work? Someone said Xen can
>> have different CS values, but looking at the source it seems it's
>> using the same ones, at least with a Linux hypervisor. So perhaps
>> it was KVM. Looking at the header it seems paravirtualisation uses
>> different cs values. On the upside, it seems we can just use that
>> user_64bit_mode() to know whether it is 32 or 64 bit mode, so
>> adding a bit telling the process mode is easier than I thought.
>>
>> Currently there is a need to tell if the 32 or 64-bit syscall
>> path is being taken, which is independent of the process mode.
>>
>
> There are definitely cases where the current reliance on magic CS values
> doesn't work; never mind the fact that it's just broken.

It's only broken because it doesn't work sometimes. ;-)

>>> At minimum, we need a bit in CS for GDB.  But any changes
>>> will break old GDB.
>>
>> Would adding bits to the upper 32-bit of rflags break GDB?
>
> It doesn't work for i386, never mind that this is reserved hardware
> state and we don't have an OK at this time to redeclare them available.

It doesn't need to work for i386 because it's close to practically
impossible to ptrace a 64-bit task with a 32-bit ptracer.

An alternative would be to use some of the bits in the lower half.

E.g. bits 1, 3, 5 and 15 are reserved and very unlikely to be ever
used for anything, because they can use plenty of bits at the top.
Problem would be that we can't be sure that they are always zero.
If they are, they're safe to use.

The VIF and VIP flags can also be stolen as they're always zero
outside of vm86 mode (which can't be ptraced AFAIK). So we could
set VIF or VIP to tell if we stole bits 1, 3, 5 and/or 15. That
would give us 6 bits in total, and the only confusing thing might
be VIF or VIP set for user space. But anyone counting on those
being zero seems unlikely, and even more unlikely for the reserved
bits, as they are intermixed with unpredictable bits. We could use
VM too, but that might be too confusing, while VIF or VIP without
VM set make no sense.

Perhaps using VIF or VIP to tell whether the other bits are valid
is a good idea anyway, as it can never clash because they are well
defined already and always zero for non-VM mode.

With the current rate of adding flags it will take forever before
any of this might break. And if that happens, we just move to other
bits and user space needs to check those first. Or if the flags
aren't useful for userspace, hide them and keep using it for the
kernel.

>> Do you also need a way to know whether the kernel was entered via
>> int 0x80, SYSCALL32/64 or SYSENTER?
>
> gdb, probably not.  That came from another user (pin, I think, but I'm
> not sure.)

Could you find out? Because I have a hard time thinking of any good
reason why anyone would want to know this specifically.

If this info is added it can replace the bit saying if it's 32 or 64-bit
syscall path. So one bit for enabling all this, 2 bits for the syscall
entry instruction (with SYSCALL64 being 0 as an easy check for the 64-bit
path) and one bit for user space mode. This would end up being 4 bits in
total, except if I forgot anything.

Only downside of adding the entry instruction info would be that more
work in the entry-specific code is needed. The code wouldn't be contained
to a small ptrace specific bit anymore.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  2:29                                                                     ` Indan Zupancic
@ 2012-02-10  2:47                                                                       ` H. Peter Anvin
       [not found]                                                                       ` <cc95fcf4b1c28ee6f73e373d04593634.squirrel@webmail.greenhost.nl>
  1 sibling, 0 replies; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-10  2:47 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On 02/09/2012 06:29 PM, Indan Zupancic wrote:
> 
> It's only broken because it doesn't work sometimes. ;-)
> 

I really hope you realize how idiotic that sounds.

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 18:44                                                     ` Oleg Nesterov
@ 2012-02-10  2:51                                                       ` Jamie Lokier
  0 siblings, 0 replies; 222+ messages in thread
From: Jamie Lokier @ 2012-02-10  2:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

Oleg Nesterov wrote:
> On 01/26, Denys Vlasenko wrote:
> >
> > On Wednesday 25 January 2012 20:36, Oleg Nesterov wrote:
> > >
> > > We can add the new events,
> > >
> > > 	PTRACE_EVENT_SYSCALL_ENTRY
> > > 	PTRACE_EVENT_SYSCALL_COMPAT_ENTRY
> > > 	PTRACE_EVENT_SYSCALL_EXIT
> > > 	PTRACE_EVENT_SYSCALL_COMPAT_EXIT
> >
> > We can get away with just the first one.
> > (1) It's unlikely people would want to get native sysentry events but not compat ones,
> > thus first two options can be combined into one;
> 
> Confused... Sure, we need the single option, or we could even report
> this unconditionally if PT_SEIZED.
> 
> I meant the different PTRACE_EVENT_* codes only.
> 
> > (2) syscall exit compat-ness is known from entry type - no need to indicate it; and
> > (3) if we would flag syscall entry with an event value in wait status, then syscall
> > exit will be already distinquisable.
> 
> Well, if we add _ENTRY then it looks more consistent to report _EXIT
> as well even if it is not that useful.
> 
> Doesn't matter. Nobody seem to like this, and afaics Linus has the
> good arguments against the arch-independent "consolidation".

Regarding distinction between ENTRY/EXIT:

  I agree only a buggy kernel should get out of sync, but are we sure
  the kernel is never buggy, and wouldn't this be nice protection, and
  an excuse for strace to drop the heuristics it currently does for
  this condition?

  The behaviour from fork() appears to have changed.  (This is from
  reading kernel code, I'm too lazy to try out old kernels.)  If I
  read correctly, before 2.5.35, Linux returned an EXIT event first to
  a child process if CLONE_PTRACE was used, and then it didn't, and
  then from 2.5.46 the tracer's use of PTRACE_EVENT_* determines if it
  does or not.

  So it's not surprising strace has heuristics... shame they're buggy
  (sigreturn can look like anything).

  Anyway, PTRACE_EVENT_* for syscall entry/exit just look prettier!

Regarding ABI indication:

  At least with new syscalls, a tracer that doesn't know about them
  will see they're unrecognised; whereas a different ABI sometimes
  looks like an innocent syscall so can trick the tracer.

  However the argument for putting this in register state that goes
  into core dumps and checkpoint/restart state instead is pretty good.

  I don't have a strong opinion.  It's unfortunate that the current
  method not only makes it easy to subvert a ptracer, it makes
  ptracing slow and racy on archs where it has to read the syscall
  instruction.  (Weirdly that includes ARM, despite ARM using a
  register these days and having a ptrace option to set, but not read,
  the syscall number).

  That really is an argument for making sure all archs have the
  syscall number and, if necessary, the type of syscall entry point,
  somewhere in the register set.

All the best,
-- Jamie

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  2:02                                                                   ` Jamie Lokier
@ 2012-02-10  3:37                                                                     ` Indan Zupancic
  2012-02-10 21:19                                                                       ` Denys Vlasenko
  0 siblings, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-02-10  3:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Denys Vlasenko, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalc

On Fri, February 10, 2012 03:02, Jamie Lokier wrote:
> Indan Zupancic wrote:
>> On Thu, January 26, 2012 12:47, Jamie Lokier wrote:
>> > Indan Zupancic wrote:
>> >> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
>> >> > Indan Zupancic wrote:
>> >> The jailer I wrote works pretty well as a simplistic strace replacement.
>> >> It can only print out the arguments we're checking, but that's usually
>> >> the more interesting info.
>> >
>> > In theory such a thing should be easy to write, but as we both found,
>> > ptrace() on Linux has a huge number of difficult quirks to deal with
>> > to trace reliably.  At least it's getting better with later kernels.
>>
>> It's not that bad, there are a few quirks, but not that many.
>> The ptrace specific code is less than 500 lines of code, with
>> a couple of hundred lines of header files. Linux ptrace specific
>> stuff creeps in elsewhere too though, like that execve mess.
>
> I count 720 lines *just* to read the syscall number and arguments in
> strace-git, for the Linux archs it supports.
>
> That's only the Linux code, I excluded non-Linux, and it's only a
> little bit of syscall.c, I didn't include generic ptracing,
> fork-following, threaded-exec-fixups, signal handling etc. nor other
> arch-specific functions and ABI fixups.  And it doesn't even have all
> archs currently in Linux mainline.

Well, I was talking about my own code, not strace. Counting strace lines
of code is tricky because of all the ifdefs.

I have to add threaded-exec-fixups, though that's not ptrace specific,
but Linux specific. Although I only support x86 at the moment, I try
to keep the per-arch code to a minimum. Currently it's 20 lines of x86
header file and 50 for x86_64 for the ptrace code. The real work is the
syscall info table, which is both system call and arch specific.

My code is written with cross-platform support in mind, I try to keep
the number of (Linux, ptrace or arch specific) assumptions as low as
possible. But if I added support for e.g. BSD then I would keep its
ptrace code totally separate from the Linux one.

>> >> It's not a 32 versus 64-bit issue though, so it will be something on
>> >> its own anyway. Can as well add an extra ARM specific ptrace command
>> >> to get that info, or hack it in some other way. For instance, ip is
>> >> (ab)used to tell if it is syscall entry or exit, so doing these tricks
>> >> isn't anything new in ARM either.
>> >
>> > In theory, aren't we supposed to know whether it's entry/exit anyway?
>> > Why does strace care?  Have there been kernel bugs in the past?  Maybe
>> > it was just to deal with SIGTRAP-after-exit in the past, which could
>> > be delivered at an unpredictable time if blocked and then unblocked by
>> > sigreturn().
>>
>> Maybe. I don't why ARM does that ip thing.
>>
>> Although in theory you know the entry/exits if you keep track, but one
>> mistake or unexpected behaviour (like execve for my code) and you can get
>> it wrong. So for robustness sake it's good if it can be double checked.
>
> I agree, and I think the PTRACE_EVENT_SYSCALL_ENTRY/EXIT events would
> be a clean way to represent that.

Yes, that would be perfect.

> I wonder if all archs report syscall-exit as the first event in traced
> fork children.  Looking at arch/hexagon I'm guessing it doesn't, but
> it's hard to be sure and no practical way to test it :-/

I would expect none of them to return syscall-exit for the child process.
It was the parent that called it, the child never did!

> That wouldn't matter if the events were robust.

Yes. It's a lot better to not worry about all these kind of details which
may or may not change between archs and kernel versions.

> I read somewhere about a bug report where syscall-exit was seen after
> attach, but I don't remember where now.

Well, if you attach at a random moment you can get a syscall-exit first,
I guess. I suppose you have to wait till you get the SIGSTOP notification
before you can be sure that the next syscall event will be an entry one.

>> I don't know anything about OABI, can you link an OABI program against
>> an EABI library? If you can then libc can be EABI and the kernel doesn't
>> need OABI support.
>
> That's not the point.  If you're writing a ptrace jailer (as you are)
> a program can deliberately use OABI calls to subvert the tracer, even
> if it's using EABI for normal calls.

I know, but I can say that kernels supporting OABI aren't supported
because they are unsafe. Just like a 32-bit only jailer running on
x86_64 is unsafe. Best would be if I checked it at startup too.

Right now I have to add very paranoid code to support compat32 on
x86_64 anyway.

> For linking, you are mostly right.  Ideally everything would be open
> and recompilable anyway, but that's sadly not always possible.  OABI
> and EABI have different struct layouts among other changes, and EABI
> being newer tends to accompany other libc changes; embedded libc.
> aren't always as drop-in backward-compatible as glibc.

Russell King told me about PTRACE_SET_SYSCALL on ARM, that would solve
the reading memory problem, as we can always set the expected syscall
number to make sure it wasn't changed behind our back. The system call
number are the same for EABI and OABI, so it's not as bad as int 0x80
from 64-bit.

The alignment changes hopefully don't make a difference for my jailer.
If they do then I have to add specific code to handle it, which I don't
like doing. But looking at sys_oabi-compat.c it doesn't seem too bad.

>> >> And then there's the whole confusion what that flag says, some might think
>> >> it says in what mode the tracee is instead of what mode the system call is.
>> >> That those two can be different is not obvious at all and seems very x86_64
>> >> specific.
>> >
>> > My rough read of PARISC entry code suggests it has two entry methods,
>> > similar to ARM and x86_64, but I'm not really familiar with PARISC and
>> > I don't have a machine handy to try it out :-)
>>
>> It has a unified syscall table, so does it really matter?
>
> I don't know if the 32/64 matters.  For security or accurate tracing,
> I wouldn't like to assume without checking if there are 64-on-32
> argument alignment fixups.

I thought it was just ARM passing a 64-bit arg in two 32-bit regs.
But yes, it's something that needs to be checked. That's most of
the work of adding a new arch, checking all system calls.

> PARISC has a second set of HPUX-compatible system call numbers,
> handled in arch/parisc/hpux/*.  I don't know if those are available to
> all programs and can be used to subvert a ptracer.  Looking at
> hpux/gate.S I think they bypass ptrace entirely; maybe they can subvert it.

That's only set when CONFIG_HPUX is set. If they bypass ptrace entirely
then such kernels can't be supported anyway, except if they have some
other mechanism for syscall interception. But the obscurer the setup,
the less worried I am about supporting it.

>> > I have a script in progress which extracts all the
>> > per-arch and per-ABI syscall numbers, syscall argument layouts and
>> > kernel function names to keep track of arch-specific fixups, from a
>> > Linux source tree.  It currently works on all archs except it breaks
>> > on x86 which insists on being diferent ;-)
>>
>> That's handy, but I thought strace had such a script already?
>> See HACKING-scripts in strace source. Or is yours much better?
>
> The strace script only gets the syscall numbers (so doesn't help
> cross-check I've applied all arch-specific syscall fixups), doesn't
> work for all arch/ABI combinations without editing unistd.h, and
> requires a configured and partly built kernel for some archs.  It's
> only really useful for getting new syscall numbers which you then
> hand-edit into the real table.  You still have to set the number of
> arguments and check carefully you haven't missed any arch-specific
> fixups.

Your script sounds quite useful then. I might ask for it when I'm
adding support for more archs.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
       [not found]                                                                       ` <cc95fcf4b1c28ee6f73e373d04593634.squirrel@webmail.greenhost.nl>
@ 2012-02-10 15:53                                                                         ` H. Peter Anvin
  2012-02-10 22:42                                                                           ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-10 15:53 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On 02/09/2012 11:42 PM, Indan Zupancic wrote:>
> Patch implementing this below. It uses bit 3 for task mode and bit 5
> for syscall mode. Those bits are only valid if VIF is set. It increases
> the kernel size by around 50 bytes, 6 for a 32-bit kernel.
>
> Any objections?

#include <stdnak.h>

	-hpa


^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10  3:37                                                                     ` Indan Zupancic
@ 2012-02-10 21:19                                                                       ` Denys Vlasenko
  0 siblings, 0 replies; 222+ messages in thread
From: Denys Vlasenko @ 2012-02-10 21:19 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Jamie Lokier, Oleg Nesterov, Linus Torvalds, Andi Kleen,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow

On Friday 10 February 2012 04:37, Indan Zupancic wrote:
> > I read somewhere about a bug report where syscall-exit was seen after
> > attach, but I don't remember where now.
> 
> Well, if you attach at a random moment you can get a syscall-exit first,
> I guess. I suppose you have to wait till you get the SIGSTOP notification
> before you can be sure that the next syscall event will be an entry one.

No. After PTRACE_ATTACH, next reported waitpid result will be either
a ptrace-stop of signal-delivery-stop variety,
or death (WIFEXITED/WIFSIGNALED). Syscall exit notification
is not possible (modulo kernel bugs). For one, syscall entry/exit
notifications must be explicitly requested by PTRACE_SYSCALL, which
wasn't yet done!

-- 
vda

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10 15:53                                                                         ` H. Peter Anvin
@ 2012-02-10 22:42                                                                           ` Indan Zupancic
  2012-02-10 22:56                                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 222+ messages in thread
From: Indan Zupancic @ 2012-02-10 22:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module

On Fri, February 10, 2012 16:53, H. Peter Anvin wrote:
> On 02/09/2012 11:42 PM, Indan Zupancic wrote:>
>> Patch implementing this below. It uses bit 3 for task mode and bit 5
>> for syscall mode. Those bits are only valid if VIF is set. It increases
>> the kernel size by around 50 bytes, 6 for a 32-bit kernel.
>>
>> Any objections?
>
> #include <stdnak.h>

Could you please elaborate? Is it just the stealing of eflags bits that
irks you or are there technical problems too?

I understand some people would prefer a new regset, but that would force
everyone to use PTRACE_GETREGSET instead of whatever they are using now.
The problem with that is that not all archs support PTRACE_GETREGSET, so
the user space ptrace code needs to use different ptrace calls depending
on the architecture for no good reason. If PEEK_USER works then it's less
of a problem, then it's one extra ptrace call compared to the eflag way
if PTRACE_GETREGS is used. If this new info is exposed with a special
regset instead of being appended to normal regs then one extra ptrace
call per system call event needs to be done. You can as well add special
x86 ptrace requests then.

Or is the main advantage of using a regset that it shows up in coredumps?
That would merit the extra effort at least.

Stealing eflags bits may be ugly like hell, but it's very simple for both
the kernel and user space to implement.

I think everyone agrees that this kind of info needs to be exposed somehow.
In the end I don't care how it is done, as long as the info is easily
available.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10 22:42                                                                           ` Indan Zupancic
@ 2012-02-10 22:56                                                                             ` H. Peter Anvin
  2012-02-12 12:07                                                                               ` Indan Zupancic
  0 siblings, 1 reply; 222+ messages in thread
From: H. Peter Anvin @ 2012-02-10 22:56 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj

On 02/10/2012 02:42 PM, Indan Zupancic wrote:
>> #include <stdnak.h>
> 
> Could you please elaborate? Is it just the stealing of eflags bits that
> irks you or are there technical problems too?

Yes, I will not accept that unless it gets ok'd by the architecture
people, which may take a long time.

> I understand some people would prefer a new regset, but that would force
> everyone to use PTRACE_GETREGSET instead of whatever they are using now.
> The problem with that is that not all archs support PTRACE_GETREGSET, so
> the user space ptrace code needs to use different ptrace calls depending
> on the architecture for no good reason. If PEEK_USER works then it's less
> of a problem, then it's one extra ptrace call compared to the eflag way
> if PTRACE_GETREGS is used. If this new info is exposed with a special
> regset instead of being appended to normal regs then one extra ptrace
> call per system call event needs to be done. You can as well add special
> x86 ptrace requests then.

Seriously... if you're mucking with registers on this level, youan
architecture dependency is not a big deal, and perhaps it's a good sign
that the laggard architectures need to catch up.  If multiple ptrace
requests is a problem, then perhaps this is a good sign that we need a
single way to get multiple regsets in a single request?

> Or is the main advantage of using a regset that it shows up in coredumps?
> That would merit the extra effort at least.

That is another plus, which is significant, too.  The final advantage is
expandability.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-02-10 22:56                                                                             ` H. Peter Anvin
@ 2012-02-12 12:07                                                                               ` Indan Zupancic
  0 siblings, 0 replies; 222+ messages in thread
From: Indan Zupancic @ 2012-02-12 12:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H.J. Lu, Linus Torvalds, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Oleg Nesterov, Will Drewry, linux-kernel,
	keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, segoon, rostedt, jmorris, scarybeasts, avi, penberg, viro,
	mingo, akpm, khilman, borislav.petkov, amwang, ak, eric.dumazet,
	gregkh, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module

On Fri, February 10, 2012 23:56, H. Peter Anvin wrote:
> On 02/10/2012 02:42 PM, Indan Zupancic wrote:
>>> #include <stdnak.h>
>>
>> Could you please elaborate? Is it just the stealing of eflags bits that
>> irks you or are there technical problems too?
>
> Yes, I will not accept that unless it gets ok'd by the architecture
> people, which may take a long time.

The kernel x86 people or the Intel CPU people?

With the latest patch it doesn't matter what bits Intel decides to use in
the future, any clashes can always be handled unambiguously.

>> I understand some people would prefer a new regset, but that would force
>> everyone to use PTRACE_GETREGSET instead of whatever they are using now.
>> The problem with that is that not all archs support PTRACE_GETREGSET, so
>> the user space ptrace code needs to use different ptrace calls depending
>> on the architecture for no good reason. If PEEK_USER works then it's less
>> of a problem, then it's one extra ptrace call compared to the eflag way
>> if PTRACE_GETREGS is used. If this new info is exposed with a special
>> regset instead of being appended to normal regs then one extra ptrace
>> call per system call event needs to be done. You can as well add special
>> x86 ptrace requests then.
>
> Seriously... if you're mucking with registers on this level, youan
> architecture dependency is not a big deal, and perhaps it's a good sign
> that the laggard architectures need to catch up.  If multiple ptrace
> requests is a problem, then perhaps this is a good sign that we need a
> single way to get multiple regsets in a single request?

Well, if we're forcing people to use a different API then we can as well
overhaul the whole ptrace thing to have a decent interface instead of all
this mucking around with waitpid() and whatnot.

That is the main advantage of the stealing eflags bits approach, it's mostly
API independent. That it puts the info close to the data where it is used is
a bonus.

>> Or is the main advantage of using a regset that it shows up in coredumps?
>> That would merit the extra effort at least.
>
> That is another plus, which is significant, too.  The final advantage is
> expandability.

I just realized that if coredumping uses ptrace's code the eflags will
show up too. As for expandability, there are a few more bits left...
But more seriously, what other highly x86 specific flags are needed?
Other than maybe the syscall entry instruction, which I'm not convinced
of, I can't think of anything.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2012-01-26 18:03                                                           ` Denys Vlasenko
@ 2017-03-08 23:41                                                             ` Dmitry V. Levin
  2017-03-09  4:39                                                               ` Andrew Lutomirski
  0 siblings, 1 reply; 222+ messages in thread
From: Dmitry V. Levin @ 2017-03-08 23:41 UTC (permalink / raw)
  To: Denys Vlasenko, Linus Torvalds
  Cc: Indan Zupancic, Oleg Nesterov, Andi Kleen, Jamie Lokier,
	Andrew Lutomirski, Will Drewry, linux-kernel, keescook,
	john.johansen, serge.hallyn, coreyb, pmoore, eparis, djm, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 1731 bytes --]

Hi,

On Thu, Jan 26, 2012 at 07:03:43PM +0100, Denys Vlasenko wrote:
> Hi Linus,
> 
> On Thu, Jan 26, 2012 at 4:47 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >> Please look at strace source, get_scno() function, where
> >> it reads syscall no and parameters. Let's see....
> >> - POWERPC: has 32-bit and 64-bit mode
> >> - X86_64: has 32-bit and 64-bit mode
> >> - IA64: has i386-compat mode
> >> - ARM: has more than one ABI
> >> - SPARC: has 32-bit and 64-bit mode
> >>
> >> Do you want to re-invent a different arch-specific way to report
> >> syscall type for each of these arches?
> >
> > I think an arch-specific one is better than trying to make some
> > generic one that is messy.
> >
> > As you say, many architectures have multiple system call ABIs.
> >
> > But they tend to be very *different* issues. They can be about
> > multiple ABI's, as you mention, and even when they *look* similar
> > (32-bit vs 64-bit ABI's) they are actually totally different issues.
> > [skip]
> 
> I don't have a particular attachment to my solution,
> and I think we already talk about this problem for
> far too long.
> 
> Looks like nobody is _strongly_ opposed to your patch
> which uses a few bits in eflags to report bitness
> of the x86 syscall.
> 
> Lets just do that already. If you commit it to kernel git,
> I will immediately change strace accordingly.

Is there any progress with this (or any alternative) solution?

I see the kernel side has changed a bit, and the strace part
is in a better shape than 5 years ago (although I'm biased of course),
but I don't see any kernel interface that would allow strace to reliably
recognize this 0x80 case.


-- 
ldv

[-- Attachment #2: Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2017-03-08 23:41                                                             ` Dmitry V. Levin
@ 2017-03-09  4:39                                                               ` Andrew Lutomirski
  2017-03-14  2:57                                                                 ` Dmitry V. Levin
  0 siblings, 1 reply; 222+ messages in thread
From: Andrew Lutomirski @ 2017-03-09  4:39 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Denys Vlasenko, Linus Torvalds, Indan Zupancic, Oleg Nesterov,
	Andi Kleen, Jamie Lokier, Will Drewry, linux-kernel, Kees Cook,
	John Johansen, Serge Hallyn, coreyb, pmoore, Eric Paris, djm,
	segoon, Steven Rostedt, James Morris, Chris Evans, Avi Kivity,
	penberg, Al Viro, Ingo Molnar, Andrew Morton, khilman,
	borislav.petkov, amwang, Andi Kleen, Eric Dumazet, gregkh,
	dhowells, daniel.lezcano, Linux FS Devel, linux-security-module,
	olofj, Michael Halcrow, dlaor, Roland McGrath

On Wed, Mar 8, 2017 at 3:41 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> Hi,
>
> On Thu, Jan 26, 2012 at 07:03:43PM +0100, Denys Vlasenko wrote:
>> Hi Linus,
>>
>> On Thu, Jan 26, 2012 at 4:47 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> >> Please look at strace source, get_scno() function, where
>> >> it reads syscall no and parameters. Let's see....
>> >> - POWERPC: has 32-bit and 64-bit mode
>> >> - X86_64: has 32-bit and 64-bit mode
>> >> - IA64: has i386-compat mode
>> >> - ARM: has more than one ABI
>> >> - SPARC: has 32-bit and 64-bit mode
>> >>
>> >> Do you want to re-invent a different arch-specific way to report
>> >> syscall type for each of these arches?
>> >
>> > I think an arch-specific one is better than trying to make some
>> > generic one that is messy.
>> >
>> > As you say, many architectures have multiple system call ABIs.
>> >
>> > But they tend to be very *different* issues. They can be about
>> > multiple ABI's, as you mention, and even when they *look* similar
>> > (32-bit vs 64-bit ABI's) they are actually totally different issues.
>> > [skip]
>>
>> I don't have a particular attachment to my solution,
>> and I think we already talk about this problem for
>> far too long.
>>
>> Looks like nobody is _strongly_ opposed to your patch
>> which uses a few bits in eflags to report bitness
>> of the x86 syscall.
>>
>> Lets just do that already. If you commit it to kernel git,
>> I will immediately change strace accordingly.
>
> Is there any progress with this (or any alternative) solution?
>
> I see the kernel side has changed a bit, and the strace part
> is in a better shape than 5 years ago (although I'm biased of course),
> but I don't see any kernel interface that would allow strace to reliably
> recognize this 0x80 case.

I am strongly opposed to fudging registers to half-arsedly slightly
improve the epicly crappy ptrace(2) interface for syscalls.

To fix this right, please just add PTRACE_GET_SYSCALL_INFO or similar
to, in one shot, read out all the syscall details.  This means: arch,
no, arg0..arg5, and *whether it's entry or exit*.  I propose returning
this structure:

struct ptrace_syscall_info {
  u8 op;  /* 0 for entry, 1 for exit */
  u8 pad0;
  u16 pad1;
  u32 pad2;
  union {
    struct seccomp_data syscall_entry;
    s64 syscall_exit_retval;
  };
};

because struct seccomp_data already gets this right.  There's plenty
of opportunity to fine-tune this.  Now it works on all architectures.

Since struct seccomp_data may be extended in the future, the operation
should be:

ptrace(PTRACE_GET_SYSCALL_INFO, pid, (void *)sizeof(struct
ptrace_syscall_info), &info);

returns 0 on success and some error code if, for example, the current
ptrace stop isn't a syscall entry or exit.

--Andy

^ permalink raw reply	[flat|nested] 222+ messages in thread

* Re: Compat 32-bit syscall entry from 64-bit task!?
  2017-03-09  4:39                                                               ` Andrew Lutomirski
@ 2017-03-14  2:57                                                                 ` Dmitry V. Levin
  0 siblings, 0 replies; 222+ messages in thread
From: Dmitry V. Levin @ 2017-03-14  2:57 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Elvira Khabirova, Denys Vlasenko, Linus Torvalds, Indan Zupancic,
	Oleg Nesterov, Andi Kleen, Jamie Lokier, Will Drewry, Kees Cook,
	John Johansen, pmoore, Eric Paris, djm, segoon, Steven Rostedt,
	James Morris, Chris Evans, Avi Kivity, penberg, Al Viro,
	Ingo Molnar, Andrew Morton, Andi Kleen, Eric Dumazet, dhowells,
	daniel.lezcano, Linux FS Devel, linux-security-module, olofj,
	Michael Halcrow, Roland McGrath, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

On Wed, Mar 08, 2017 at 08:39:55PM -0800, Andrew Lutomirski wrote:
> On Wed, Mar 8, 2017 at 3:41 PM, Dmitry V. Levin wrote:
[...]
> > Is there any progress with this (or any alternative) solution?
> >
> > I see the kernel side has changed a bit, and the strace part
> > is in a better shape than 5 years ago (although I'm biased of course),
> > but I don't see any kernel interface that would allow strace to reliably
> > recognize this 0x80 case.
> 
> I am strongly opposed to fudging registers to half-arsedly slightly
> improve the epicly crappy ptrace(2) interface for syscalls.
> 
> To fix this right, please just add PTRACE_GET_SYSCALL_INFO or similar
> to, in one shot, read out all the syscall details.  This means: arch,
> no, arg0..arg5, and *whether it's entry or exit*.  I propose returning
> this structure:
> 
> struct ptrace_syscall_info {
>   u8 op;  /* 0 for entry, 1 for exit */
>   u8 pad0;
>   u16 pad1;
>   u32 pad2;
>   union {
>     struct seccomp_data syscall_entry;
>     s64 syscall_exit_retval;
>   };
> };
> 
> because struct seccomp_data already gets this right.  There's plenty
> of opportunity to fine-tune this.  Now it works on all architectures.

Unfortunately, the API is missing.

Unlike syscall_get_nr(), syscall_get_arch() works with the current task
only so there is no API to get the arch identifier for the given task
that would work on all architectures.


-- 
ldv

[-- Attachment #2: Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 222+ messages in thread

end of thread, other threads:[~2017-03-14  2:57 UTC | newest]

Thread overview: 222+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-11 17:25 [RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters) Will Drewry
2012-01-11 17:25 ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
2012-01-12  8:53   ` Serge Hallyn
2012-01-12 16:54     ` Will Drewry
2012-01-12 14:50   ` Oleg Nesterov
2012-01-12 16:55     ` Will Drewry
2012-01-12 15:43   ` Steven Rostedt
2012-01-12 16:14     ` Oleg Nesterov
2012-01-12 16:38       ` Steven Rostedt
2012-01-12 16:47         ` Oleg Nesterov
2012-01-12 17:08           ` Will Drewry
2012-01-12 17:30         ` Jamie Lokier
2012-01-12 17:40           ` Steven Rostedt
2012-01-12 17:44             ` Jamie Lokier
2012-01-12 17:56               ` Steven Rostedt
2012-01-12 23:27                 ` Alan Cox
2012-01-12 23:38                   ` Linus Torvalds
2012-01-12 22:18             ` Will Drewry
2012-01-12 23:00               ` Andrew Lutomirski
2012-01-12 16:14     ` Andrew Lutomirski
2012-01-12 16:27       ` Steven Rostedt
2012-01-12 16:51         ` Andrew Lutomirski
2012-01-12 17:09         ` Linus Torvalds
2012-01-12 17:17           ` Steven Rostedt
2012-01-12 18:18           ` Andrew Lutomirski
2012-01-12 18:32             ` Linus Torvalds
2012-01-12 18:44               ` Andrew Lutomirski
2012-01-12 19:08                 ` Kyle Moffett
2012-01-12 23:05                   ` Eric Paris
2012-01-12 23:33                     ` Andrew Lutomirski
2012-01-12 19:40                 ` Will Drewry
2012-01-12 19:42                   ` Will Drewry
2012-01-12 19:46                   ` Andrew Lutomirski
2012-01-12 20:00                     ` Linus Torvalds
2012-01-12 16:59     ` Will Drewry
2012-01-12 17:22       ` Jamie Lokier
2012-01-12 17:35         ` Will Drewry
2012-01-12 17:57           ` Jamie Lokier
2012-01-12 18:03             ` Will Drewry
2012-01-13  1:34               ` Jamie Lokier
2012-01-13  6:33             ` Chris Evans
2012-01-12 17:36     ` Jamie Lokier
2012-01-12 16:18   ` Alan Cox
2012-01-12 17:03     ` Will Drewry
2012-01-12 17:11       ` Alan Cox
2012-01-12 17:52         ` Will Drewry
2012-01-13  1:31     ` James Morris
2012-01-12 16:22   ` Oleg Nesterov
2012-01-12 17:10     ` Will Drewry
2012-01-12 17:23       ` Oleg Nesterov
2012-01-12 17:51         ` Will Drewry
2012-01-13 17:31           ` Oleg Nesterov
2012-01-13 19:01             ` Will Drewry
2012-01-13 23:10               ` Will Drewry
2012-01-13 23:12                 ` Will Drewry
2012-01-13 23:30                 ` Eric Paris
2012-01-16 18:37                 ` Oleg Nesterov
2012-01-16 20:15                   ` Will Drewry
2012-01-17 16:45                     ` Oleg Nesterov
2012-01-17 16:56                       ` Will Drewry
2012-01-17 17:01                         ` Andrew Lutomirski
2012-01-17 17:05                           ` Oleg Nesterov
2012-01-17 17:45                             ` Andrew Lutomirski
2012-01-18  0:56                               ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Indan Zupancic
2012-01-18  1:01                                 ` Andrew Lutomirski
2012-01-19  1:06                                   ` Indan Zupancic
2012-01-19  1:19                                     ` Andrew Lutomirski
2012-01-19  1:47                                       ` Indan Zupancic
2012-01-18  1:07                                 ` Roland McGrath
2012-01-18  1:47                                   ` Indan Zupancic
2012-01-18  1:48                                 ` Jamie Lokier
2012-01-18  1:50                                 ` Andi Kleen
2012-01-18  2:00                                   ` Steven Rostedt
2012-01-18  2:04                                   ` Jamie Lokier
2012-01-18  2:22                                     ` Andi Kleen
2012-01-18  2:25                                       ` Andrew Lutomirski
2012-01-18  4:22                                       ` Indan Zupancic
2012-01-18  5:23                                         ` Linus Torvalds
2012-01-18  6:25                                           ` Linus Torvalds
2012-01-18 13:12                                             ` Compat 32-bit syscall entry from 64-bit task!? Indan Zupancic
2012-01-18 19:31                                               ` Linus Torvalds
2012-01-18 19:36                                                 ` Andi Kleen
2012-01-18 19:39                                                   ` Linus Torvalds
2012-01-18 19:44                                                     ` Andi Kleen
2012-01-18 19:47                                                       ` Linus Torvalds
2012-01-18 19:52                                                         ` Will Drewry
2012-01-18 19:58                                                           ` Will Drewry
2012-01-18 19:41                                                   ` Martin Mares
2012-01-18 19:38                                                 ` Andrew Lutomirski
2012-01-19 16:01                                                   ` Jamie Lokier
2012-01-19 16:13                                                     ` Andrew Lutomirski
2012-01-19 19:21                                                     ` Linus Torvalds
2012-01-19 19:30                                                       ` Andrew Lutomirski
2012-01-19 19:37                                                         ` Linus Torvalds
2012-01-19 19:41                                                           ` Linus Torvalds
2012-01-19 23:54                                                       ` Jamie Lokier
2012-01-20  0:05                                                         ` Linus Torvalds
2012-01-20 15:35                                                       ` Will Drewry
2012-01-20 17:56                                                         ` Roland McGrath
2012-01-20 19:45                                                           ` Will Drewry
2012-01-18 20:26                                                 ` Linus Torvalds
2012-01-18 20:55                                                   ` H. Peter Anvin
2012-01-18 21:01                                                     ` Linus Torvalds
2012-01-18 21:04                                                       ` H. Peter Anvin
2012-01-18 21:21                                                         ` H. Peter Anvin
2012-01-18 21:51                                                           ` Roland McGrath
2012-01-18 21:53                                                             ` H. Peter Anvin
2012-01-18 23:28                                                               ` Linus Torvalds
2012-01-19  0:38                                                                 ` H. Peter Anvin
2012-01-20 21:51                                                                   ` Denys Vlasenko
2012-01-20 22:40                                                                     ` Roland McGrath
2012-01-20 22:41                                                                       ` H. Peter Anvin
2012-01-20 23:49                                                                         ` Indan Zupancic
2012-01-20 23:55                                                                           ` Roland McGrath
2012-01-20 23:58                                                                             ` hpanvin@gmail.com
2012-01-23  2:14                                                                             ` Indan Zupancic
2012-01-21  0:07                                                                           ` Denys Vlasenko
2012-01-21  0:10                                                                             ` Roland McGrath
2012-01-21  1:23                                                                               ` Jamie Lokier
2012-01-23  2:37                                                                                 ` Indan Zupancic
2012-01-23 16:48                                                                                   ` Oleg Nesterov
2012-01-24  8:19                                                                       ` Indan Zupancic
2012-02-06 20:30                                                                       ` H. Peter Anvin
2012-02-06 20:39                                                                         ` Roland McGrath
2012-02-06 20:42                                                                           ` H. Peter Anvin
2012-01-18 21:26                                                         ` Linus Torvalds
2012-01-18 21:30                                                           ` H. Peter Anvin
2012-01-18 21:42                                                             ` Linus Torvalds
2012-01-18 21:47                                                               ` H. Peter Anvin
2012-01-19  1:45                                                           ` Indan Zupancic
2012-01-19  2:16                                                             ` H. Peter Anvin
2012-02-06  8:32                                                   ` Indan Zupancic
2012-02-06 17:02                                                     ` H. Peter Anvin
2012-02-07  1:52                                                       ` Indan Zupancic
2012-02-09  0:19                                                         ` H. Peter Anvin
2012-02-09  4:20                                                           ` Indan Zupancic
2012-02-09  4:29                                                             ` H. Peter Anvin
2012-02-09  6:03                                                               ` Indan Zupancic
2012-02-09 14:47                                                                 ` H. Peter Anvin
2012-02-09 16:00                                                               ` H.J. Lu
2012-02-10  1:09                                                                 ` Indan Zupancic
2012-02-10  1:15                                                                   ` H. Peter Anvin
2012-02-10  2:29                                                                     ` Indan Zupancic
2012-02-10  2:47                                                                       ` H. Peter Anvin
     [not found]                                                                       ` <cc95fcf4b1c28ee6f73e373d04593634.squirrel@webmail.greenhost.nl>
2012-02-10 15:53                                                                         ` H. Peter Anvin
2012-02-10 22:42                                                                           ` Indan Zupancic
2012-02-10 22:56                                                                             ` H. Peter Anvin
2012-02-12 12:07                                                                               ` Indan Zupancic
2012-01-25 19:36                                                 ` Oleg Nesterov
2012-01-25 20:20                                                   ` Pedro Alves
2012-01-25 23:36                                                     ` Denys Vlasenko
2012-01-25 23:32                                                   ` Denys Vlasenko
2012-01-26  0:40                                                     ` Indan Zupancic
2012-01-26  1:08                                                       ` Jamie Lokier
2012-01-26  1:22                                                         ` Denys Vlasenko
2012-01-26  6:34                                                         ` Indan Zupancic
2012-01-26 10:31                                                           ` Jamie Lokier
2012-01-26 10:40                                                             ` Denys Vlasenko
2012-01-26 11:01                                                               ` Jamie Lokier
2012-01-26 14:02                                                                 ` Denys Vlasenko
2012-01-26 11:19                                                               ` Indan Zupancic
2012-01-26 11:20                                                             ` Indan Zupancic
2012-01-26 11:47                                                               ` Jamie Lokier
2012-01-26 14:05                                                                 ` Denys Vlasenko
2012-01-27  7:23                                                                 ` Indan Zupancic
2012-02-10  2:02                                                                   ` Jamie Lokier
2012-02-10  3:37                                                                     ` Indan Zupancic
2012-02-10 21:19                                                                       ` Denys Vlasenko
2012-01-26  1:09                                                       ` Denys Vlasenko
2012-01-26  3:47                                                         ` Linus Torvalds
2012-01-26 18:03                                                           ` Denys Vlasenko
2017-03-08 23:41                                                             ` Dmitry V. Levin
2017-03-09  4:39                                                               ` Andrew Lutomirski
2017-03-14  2:57                                                                 ` Dmitry V. Levin
2012-01-26  5:57                                                         ` Indan Zupancic
2012-01-26  0:59                                                     ` Jamie Lokier
2012-01-26  1:21                                                       ` Denys Vlasenko
2012-01-26  8:23                                                       ` Pedro Alves
2012-01-26  8:53                                                         ` Denys Vlasenko
2012-01-26  9:51                                                           ` Pedro Alves
2012-01-26 18:44                                                     ` Oleg Nesterov
2012-02-10  2:51                                                       ` Jamie Lokier
2012-01-18 15:04                                             ` Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF] Eric Paris
2012-01-18 17:51                                               ` Linus Torvalds
2012-01-18  5:43                                         ` Chris Evans
2012-01-18 12:12                                           ` Indan Zupancic
2012-01-18 21:13                                             ` Chris Evans
2012-01-19  0:14                                               ` Indan Zupancic
2012-01-19  8:16                                                 ` Chris Evans
2012-01-19 11:34                                                   ` Indan Zupancic
2012-01-19 16:11                                                     ` Jamie Lokier
2012-01-19 15:40                                                 ` Jamie Lokier
2012-01-18 17:00                                           ` Oleg Nesterov
2012-01-18 17:12                                             ` Oleg Nesterov
2012-01-18 21:09                                               ` Chris Evans
2012-01-23 16:56                                                 ` Oleg Nesterov
2012-01-23 22:23                                                   ` Chris Evans
2012-02-07 11:45                                               ` Indan Zupancic
2012-01-19  0:29                                             ` Indan Zupancic
2012-01-18  2:27                                     ` Linus Torvalds
2012-01-18  2:31                                       ` Andi Kleen
2012-01-18  2:46                                         ` Linus Torvalds
2012-01-18 14:06                                           ` Martin Mares
2012-01-18 18:24                                             ` Andi Kleen
2012-01-19 16:04                                               ` Jamie Lokier
2012-01-20  0:21                                                 ` Indan Zupancic
2012-01-20  0:53                                                   ` Linus Torvalds
2012-01-20  2:02                                                     ` Indan Zupancic
2012-01-17 17:06                           ` [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Will Drewry
2012-01-17 19:35                         ` Will Drewry
2012-01-12 17:02   ` Andrew Lutomirski
2012-01-16 20:28     ` Will Drewry
2012-01-11 17:25 ` [RFC,PATCH 2/2] Documentation: prctl/seccomp_filter Will Drewry
2012-01-11 20:03   ` Jonathan Corbet
2012-01-11 20:10     ` Will Drewry
2012-01-11 23:19       ` [PATCH v2 " Will Drewry
2012-01-12  0:29         ` Will Drewry
2012-01-12 18:16         ` Randy Dunlap
2012-01-12 17:23           ` Will Drewry
2012-01-12 17:34             ` Steven Rostedt
2012-01-12 13:13   ` [RFC,PATCH " Łukasz Sowa
2012-01-12 17:25     ` Will Drewry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).