linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] Various fixes related to ptrace_may_access()
@ 2016-09-23 20:40 Jann Horn
  2016-09-23 20:40 ` [PATCH v2 1/8] exec: introduce cred_guard_light Jann Horn
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Jann Horn @ 2016-09-23 20:40 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds
  Cc: linux-fsdevel, linux-security-module, security

This series has a bunch of loosely-related fixes for minor security bugs. Since
the bugs are minor and the patches aren't trivial, I'm sending it publicly.

The reason I'm bundling these patches up as a series instead of sending patches
one by one is that e.g. patch 2 creates some common infrastructure that multiple
other patches depend on.

For specific information about what the purpose of this series is, please see
the individual commits - but the general theme is:

 - get rid of races that can leak things like userspace addresses during setuid
   execve()
 - get rid of procfs files that cause unexpected behavior when passed around
 - add warnings to keep developers from creating more issues like this
 - document access checks

Changes in v2:
 - removed "ptrace: forbid ptrace checks against current_cred() from VFS
   context" (Linus Torvalds)
 - use the luid scheme suggested by Andy Lutomirski - patch 2/8 changed a lot
 - various other changes in individual patches

There is a somewhat ugly detail in patch 2/8 now, which is that the
tasklist_lock is taken for reading while regenerating the luid during execve.
I'm not sure whether that can be avoided.

Jann Horn (8):
  exec: introduce cred_guard_light
  exec: turn self_exec_id into self_privunit
  proc: use open()-time creds for ptrace checks
  futex: don't leak robust_list pointer
  proc: lock properly in ptrace_may_access callers
  ptrace: warn on ptrace_may_access without proper locking
  fs/proc: fix attr access check
  Documentation: add security/ptrace_checks.txt

 Documentation/security/ptrace_checks.txt | 229 +++++++++++++++++++++++++++++++
 fs/exec.c                                |  56 +++++++-
 fs/proc/array.c                          |  10 +-
 fs/proc/base.c                           | 224 ++++++++++++++++++++++--------
 fs/proc/internal.h                       |  14 ++
 fs/proc/namespaces.c                     |  14 ++
 include/linux/init_task.h                |   1 +
 include/linux/lsm_hooks.h                |   3 +-
 include/linux/ptrace.h                   |   5 +
 include/linux/sched.h                    |  27 +++-
 include/linux/security.h                 |  10 +-
 kernel/fork.c                            |   6 +-
 kernel/futex.c                           |  30 ++--
 kernel/futex_compat.c                    |  30 ++--
 kernel/ptrace.c                          |  54 ++++++--
 kernel/signal.c                          |   5 +-
 security/apparmor/include/ipc.h          |   2 +-
 security/apparmor/ipc.c                  |   4 +-
 security/apparmor/lsm.c                  |  14 +-
 security/commoncap.c                     |   8 +-
 security/security.c                      |   5 +-
 security/selinux/hooks.c                 |  15 +-
 security/smack/smack_lsm.c               |  18 ++-
 security/yama/yama_lsm.c                 |   9 +-
 24 files changed, 662 insertions(+), 131 deletions(-)
 create mode 100644 Documentation/security/ptrace_checks.txt

-- 
2.1.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/8] exec: introduce cred_guard_light
  2016-09-23 20:40 [PATCH v2 0/8] Various fixes related to ptrace_may_access() Jann Horn
@ 2016-09-23 20:40 ` Jann Horn
  2016-09-30 15:35   ` Oleg Nesterov
  2016-09-23 20:40 ` [PATCH v2 2/8] exec: turn self_exec_id into self_privunit Jann Horn
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Jann Horn @ 2016-09-23 20:40 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds
  Cc: linux-fsdevel, linux-security-module, security

This is a new per-threadgroup lock that can often be taken instead of
cred_guard_mutex and has less deadlock potential. I'm doing this because
Oleg Nesterov mentioned the potential for deadlocks, in particular if a
debugged task is stuck in execve, trying to get rid of a ptrace-stopped
thread, and the debugger attempts to inspect procfs files of the debugged
task.

The binfmt handlers (in particular for elf_fdpic and flat) might still
call VFS read and mmap operations on the binary with the lock held, but
not open operations (as is the case with cred_guard_mutex).

An rwlock would be more appropriate here, but apparently those don't
have _killable variants of the locking functions?

This is a preparation patch for using proper locking in more places.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/exec.c                 | 15 ++++++++++++++-
 include/linux/init_task.h |  1 +
 include/linux/sched.h     | 10 ++++++++++
 kernel/fork.c             |  1 +
 kernel/ptrace.c           | 10 ++++++++++
 5 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 6fcfb3f..84430ee 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1238,6 +1238,10 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_guard_light);
+	if (retval)
+		goto out;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1251,7 +1255,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 	acct_arg_size(bprm, 0);
 	retval = exec_mmap(bprm->mm);
 	if (retval)
-		goto out;
+		goto out_unlock;
 
 	bprm->mm = NULL;		/* We're using it now */
 
@@ -1263,6 +1267,8 @@ int flush_old_exec(struct linux_binprm * bprm)
 
 	return 0;
 
+out_unlock:
+	mutex_unlock(&current->signal->cred_guard_light);
 out:
 	return retval;
 }
@@ -1386,6 +1392,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	mutex_unlock(&current->signal->cred_guard_light);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
@@ -1753,6 +1760,12 @@ static int do_execveat_common(int fd, struct filename *filename,
 	return retval;
 
 out:
+	if (!bprm->mm && bprm->cred) {
+		/* failure after flush_old_exec(), but before
+		 * install_exec_creds()
+		 */
+		mutex_unlock(&current->signal->cred_guard_light);
+	}
 	if (bprm->mm) {
 		acct_arg_size(bprm, 0);
 		mmput(bprm->mm);
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f8834f8..cd9faa0 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -58,6 +58,7 @@ extern struct fs_struct init_fs;
 	INIT_PREV_CPUTIME(sig)						\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	.cred_guard_light = __MUTEX_INITIALIZER(sig.cred_guard_light)	\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..2a1df2f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -808,6 +808,16 @@ struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	/*
+	 * Lightweight version of cred_guard_mutex; used to prevent race
+	 * conditions where a user can gain information about the post-execve
+	 * state of a task to which access should only be granted pre-execve.
+	 * Hold this mutex while performing remote task inspection associated
+	 * with a security check.
+	 * This mutex MUST NOT be used in cases where anything changes about
+	 * the security properties of a running execve().
+	 */
+	struct mutex cred_guard_light;
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index beb3172..2d46f3a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1215,6 +1215,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 				   current->signal->is_child_subreaper;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->cred_guard_light);
 
 	return 0;
 }
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 1d3b766..b5120ec 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -283,6 +283,16 @@ ok:
 	return security_ptrace_access_check(task, mode);
 }
 
+/*
+ * NOTE: When you call this function, you need to ensure that the target task
+ * can't acquire (via setuid execve) credentials between the ptrace access
+ * check and the privileged access. The recommended way to do this is to hold
+ * one of task->signal->{cred_guard_mutex,cred_guard_light} while calling this
+ * function and performing the requested access.
+ *
+ * This function may only be used if access is requested in the name of
+ * current_cred().
+ */
 bool ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
 	int err;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/8] exec: turn self_exec_id into self_privunit
  2016-09-23 20:40 [PATCH v2 0/8] Various fixes related to ptrace_may_access() Jann Horn
  2016-09-23 20:40 ` [PATCH v2 1/8] exec: introduce cred_guard_light Jann Horn
@ 2016-09-23 20:40 ` Jann Horn
  2016-09-23 21:04   ` Andy Lutomirski
  2016-09-30 13:20   ` Oleg Nesterov
  2016-09-23 20:40 ` [PATCH v2 3/8] proc: use open()-time creds for ptrace checks Jann Horn
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Jann Horn @ 2016-09-23 20:40 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds
  Cc: linux-fsdevel, linux-security-module, security

This ensures that self_privunit ("privilege unit locally unique ID")
is only shared by processes that share the mm_struct and the signal_struct;
not just spatially, but also temporally. In other words, if you do execve()
or clone() without CLONE_THREAD, you get a new privunit that has never
been used before.

One reason for doing this is that it prevents an attacker from sending an
arbitrary signal to a parent process after performing 2^32-1 execve()
calls.

The second reason for this is that it permits using the self_exec_luid in
a later patch to check during a ptrace access whether subject and object
are temporally and spatially equal for privilege checking purposes.

The implementation of locally unique IDs is in sched.h and exec.c for now
because those are the only users so far - if anything else wants to use
them in the future, they can be moved elsewhere.

changed in v2:
 - have 2^64 IDs per CPU instead of 2^64 shared ones (luid scheme,
   suggested by Andy Lutomirski)
 - take task_lock for reading in setup_new_exec() while bumping the LUID

Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/exec.c             | 41 +++++++++++++++++++++++++++++++++++++++--
 include/linux/sched.h | 17 +++++++++++++++--
 kernel/fork.c         |  5 +++--
 kernel/signal.c       |  5 ++++-
 4 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 84430ee..fcc11f0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1281,6 +1281,34 @@ void would_dump(struct linux_binprm *bprm, struct file *file)
 }
 EXPORT_SYMBOL(would_dump);
 
+static DEFINE_PER_CPU(u64, luid_counters);
+
+static int __init init_luid_counters(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		/* value 0 is reserved for init */
+		per_cpu(luid_counters, cpu) = 1;
+	}
+
+	return 0;
+}
+early_initcall(init_luid_counters);
+
+/*
+ * Allocates a new LUID and writes the allocated LUID to @out.
+ * This function must not be called from IRQ context.
+ */
+void fill_luid(struct luid *out)
+{
+	preempt_disable();
+	out->count = raw_cpu_read(luid_counters);
+	raw_cpu_add(luid_counters, 1);
+	out->cpu = smp_processor_id();
+	preempt_enable();
+}
+
 void setup_new_exec(struct linux_binprm * bprm)
 {
 	arch_pick_mmap_layout(current->mm);
@@ -1313,8 +1341,17 @@ void setup_new_exec(struct linux_binprm * bprm)
 	}
 
 	/* An exec changes our domain. We are no longer part of the thread
-	   group */
-	current->self_exec_id++;
+	 * group.
+	 * The privunit luid is regenerated with the tasklist_lock held for
+	 * reading to allow do_notify_parent() (which only runs with
+	 * tasklist_lock held for writing) to inspect privunit IDs of other
+	 * tasks without taking the cred_guard_light (which wouldn't work
+	 * because the tasklist_lock is held).
+	 */
+	read_lock(&tasklist_lock);
+	fill_luid(&current->self_privunit);
+	read_unlock(&tasklist_lock);
+
 	flush_signal_handlers(current, 0);
 	do_close_on_exec(current->files);
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2a1df2f..fa90e36 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1467,6 +1467,19 @@ struct tlbflush_unmap_batch {
 	bool writable;
 };
 
+/* locally unique ID */
+struct luid {
+	u64 count;
+	unsigned int cpu;
+};
+
+void fill_luid(struct luid *out);
+
+static inline bool luid_eq(const struct luid *a, const struct luid *b)
+{
+	return a->count == b->count && a->cpu == b->cpu;
+}
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1688,8 +1701,8 @@ struct task_struct {
 	struct seccomp seccomp;
 
 /* Thread group tracking */
-   	u32 parent_exec_id;
-   	u32 self_exec_id;
+	struct luid parent_privunit;
+	struct luid self_privunit;
 /* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
  * mempolicy */
 	spinlock_t alloc_lock;
diff --git a/kernel/fork.c b/kernel/fork.c
index 2d46f3a..e1bd501 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1567,6 +1567,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 			p->exit_signal = (clone_flags & CSIGNAL);
 		p->group_leader = p;
 		p->tgid = p->pid;
+		fill_luid(&p->self_privunit);
 	}
 
 	p->nr_dirtied = 0;
@@ -1597,10 +1598,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* CLONE_PARENT re-uses the old parent */
 	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
 		p->real_parent = current->real_parent;
-		p->parent_exec_id = current->parent_exec_id;
+		p->parent_privunit = current->parent_privunit;
 	} else {
 		p->real_parent = current;
-		p->parent_exec_id = current->self_exec_id;
+		p->parent_privunit = current->self_privunit;
 	}
 
 	spin_lock(&current->sighand->siglock);
diff --git a/kernel/signal.c b/kernel/signal.c
index af21afc..3dbd25b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1566,6 +1566,8 @@ ret:
  * Let a parent know about the death of a child.
  * For a stopped/continued status change, use do_notify_parent_cldstop instead.
  *
+ * Must be called with tasklist_lock held for writing.
+ *
  * Returns true if our parent ignored us and so we've switched to
  * self-reaping.
  */
@@ -1590,7 +1592,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 		 * This is only possible if parent == real_parent.
 		 * Check if it has changed security domain.
 		 */
-		if (tsk->parent_exec_id != tsk->parent->self_exec_id)
+		if (!luid_eq(&tsk->parent_privunit,
+			     &tsk->parent->self_privunit))
 			sig = SIGCHLD;
 	}
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 3/8] proc: use open()-time creds for ptrace checks
  2016-09-23 20:40 [PATCH v2 0/8] Various fixes related to ptrace_may_access() Jann Horn
  2016-09-23 20:40 ` [PATCH v2 1/8] exec: introduce cred_guard_light Jann Horn
  2016-09-23 20:40 ` [PATCH v2 2/8] exec: turn self_exec_id into self_privunit Jann Horn
@ 2016-09-23 20:40 ` Jann Horn
  2016-09-23 20:40 ` [PATCH v2 4/8] futex: don't leak robust_list pointer Jann Horn
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Jann Horn @ 2016-09-23 20:40 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds
  Cc: linux-fsdevel, linux-security-module, security

This adds a new ptrace_may_access_noncurrent() method that
uses the supplied credentials instead of those of the
current task. (However, the current task may still be
inspected for auditing purposes, e.g. by the Smack LSM.)

procfs used the caller's creds for a few ptrace_may_access()
checks at read() time, which made a confused deputy attack
by passing an FD to a procfs file to a setuid program
possible. Therefore, the following was a local userspace
ASLR bypass:

rm -f /tmp/foobar
procmail <(echo 'DEFAULT=/tmp/foobar') </proc/1/stat
(
  echo 'obase=16'
  cut -d' ' -f26-30,45-51 </tmp/foobar | tr ' ' '\n'
) | bc
rm /tmp/foobar

procmail is installed setuid root on Debian and read()s
data from stdin before dropping privs, so the
ptrace_may_access() check in the VFS read handler of
/proc/1/stat passes. Procmail then dumps the read data
to a user-accessible file (/tmp/foobar here).

changed in v2:
 - also handle PTRACE_MODE_ATTACH access correctly in
   SELinux (Stephen Smalley)
 - changed privunit-using code

Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/proc/array.c                 |  3 +-
 fs/proc/base.c                  | 63 ++++++++++++++++++++++++++++++++++-------
 fs/proc/internal.h              | 14 +++++++++
 include/linux/lsm_hooks.h       |  3 +-
 include/linux/ptrace.h          |  5 ++++
 include/linux/security.h        | 10 ++++---
 kernel/ptrace.c                 | 41 ++++++++++++++++++++-------
 security/apparmor/include/ipc.h |  2 +-
 security/apparmor/ipc.c         |  4 +--
 security/apparmor/lsm.c         | 14 +++++++--
 security/commoncap.c            |  8 +++---
 security/security.c             |  5 ++--
 security/selinux/hooks.c        | 15 +++++-----
 security/smack/smack_lsm.c      | 18 ++++++++----
 security/yama/yama_lsm.c        |  9 ++++--
 15 files changed, 157 insertions(+), 57 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 88c7de1..3349742 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -413,7 +413,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
-	permitted = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT);
+	permitted = proc_ptrace_may_access_seq(task,
+			PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT, m);
 	mm = get_task_mm(task);
 	if (mm) {
 		vsize = task_vsize(mm);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ac0df4d..d6c98ab 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -430,8 +430,9 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
 
 	wchan = get_wchan(task);
 
-	if (wchan && ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)
-			&& !lookup_symbol_name(wchan, symname))
+	if (wchan &&
+	    proc_ptrace_may_access_seq(task, PTRACE_MODE_READ_FSCREDS, m) &&
+	    !lookup_symbol_name(wchan, symname))
 		seq_printf(m, "%s", symname);
 	else
 		seq_putc(m, '0');
@@ -440,12 +441,12 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
 }
 #endif /* CONFIG_KALLSYMS */
 
-static int lock_trace(struct task_struct *task)
+static int lock_trace(struct seq_file *m, struct task_struct *task)
 {
 	int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
 	if (err)
 		return err;
-	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
+	if (!proc_ptrace_may_access_seq(task, PTRACE_MODE_ATTACH_FSCREDS, m)) {
 		mutex_unlock(&task->signal->cred_guard_mutex);
 		return -EPERM;
 	}
@@ -478,7 +479,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
 	trace.entries		= entries;
 	trace.skip		= 0;
 
-	err = lock_trace(task);
+	err = lock_trace(m, task);
 	if (!err) {
 		save_stack_trace_tsk(task, &trace);
 
@@ -660,7 +661,7 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long args[6], sp, pc;
 	int res;
 
-	res = lock_trace(task);
+	res = lock_trace(m, task);
 	if (res)
 		return res;
 
@@ -770,7 +771,7 @@ static const struct inode_operations proc_def_inode_operations = {
 
 static int proc_single_show(struct seq_file *m, void *v)
 {
-	struct inode *inode = m->private;
+	struct inode *inode = m->file->f_inode;
 	struct pid_namespace *ns;
 	struct pid *pid;
 	struct task_struct *task;
@@ -790,14 +791,35 @@ static int proc_single_show(struct seq_file *m, void *v)
 
 static int proc_single_open(struct inode *inode, struct file *filp)
 {
-	return single_open(filp, proc_single_show, inode);
+	int ret;
+	struct luid *privunit = kmalloc(sizeof(struct luid), GFP_KERNEL);
+
+	if (privunit == NULL)
+		return -ENOMEM;
+	*privunit = current->self_privunit;
+
+	ret = single_open(filp, proc_single_show, privunit);
+
+	if (ret)
+		kfree(privunit);
+	return ret;
+}
+
+static int proc_single_release(struct inode *inode, struct file *filp)
+{
+	struct seq_file *seq = filp->private_data;
+	struct luid *privunit = seq->private;
+
+	kfree(privunit);
+
+	return single_release(inode, filp);
 }
 
 static const struct file_operations proc_single_file_operations = {
 	.open		= proc_single_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= single_release,
+	.release	= proc_single_release,
 };
 
 
@@ -2157,10 +2179,29 @@ out:
 	return ret;
 }
 
+int proc_open_get_privunit(struct inode *inode, struct file *filp)
+{
+	struct luid *privunit = kmalloc(sizeof(struct luid), GFP_KERNEL);
+
+	if (privunit == NULL)
+		return -ENOMEM;
+	*privunit = current->self_privunit;
+	filp->private_data = privunit;
+	return 0;
+}
+
+int proc_release_put_privunit(struct inode *inode, struct file *filp)
+{
+	kfree((struct luid *)filp->private_data);
+	return 0;
+}
+
 static const struct file_operations proc_map_files_operations = {
 	.read		= generic_read_dir,
 	.iterate_shared	= proc_map_files_readdir,
 	.llseek		= generic_file_llseek,
+	.open		= proc_open_get_privunit,
+	.release	= proc_release_put_privunit,
 };
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
@@ -2614,7 +2655,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 	if (result)
 		return result;
 
-	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
+	if (!proc_ptrace_may_access_seq(task, PTRACE_MODE_READ_FSCREDS, m)) {
 		result = -EACCES;
 		goto out_unlock;
 	}
@@ -2798,7 +2839,7 @@ static const struct file_operations proc_setgroups_operations = {
 static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *task)
 {
-	int err = lock_trace(task);
+	int err = lock_trace(m, task);
 	if (!err) {
 		seq_printf(m, "%08x\n", task->personality);
 		unlock_trace(task);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 7931c55..64ac610 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -14,6 +14,9 @@
 #include <linux/spinlock.h>
 #include <linux/atomic.h>
 #include <linux/binfmts.h>
+#include <linux/ptrace.h>
+#include <linux/pid.h>
+#include <linux/seq_file.h>
 
 struct ctl_table_header;
 struct mempolicy;
@@ -134,6 +137,15 @@ out:
 	return ~0U;
 }
 
+static inline bool proc_ptrace_may_access_seq(struct task_struct *task,
+			unsigned int mode, struct seq_file *m)
+{
+	const struct luid *privunit = m->private;
+
+	return ptrace_may_access_noncurrent(task, mode, m->file->f_cred,
+					    privunit);
+}
+
 /*
  * Offset of the first process in the /proc root directory..
  */
@@ -168,6 +180,8 @@ extern int pid_delete_dentry(const struct dentry *);
 extern int proc_pid_readdir(struct file *, struct dir_context *);
 extern struct dentry *proc_pid_lookup(struct inode *, struct dentry *, unsigned int);
 extern loff_t mem_lseek(struct file *, loff_t, int);
+extern int proc_open_get_privunit(struct inode *, struct file *);
+extern int proc_release_put_privunit(struct inode *, struct file *);
 
 /* Lookups */
 typedef int instantiate_t(struct inode *, struct dentry *,
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 101bf19..f33ef61 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1147,6 +1147,7 @@
  *	attributes would be changed by the execve.
  *	@child contains the task_struct structure for the target process.
  *	@mode contains the PTRACE_MODE flags indicating the form of access.
+ *	@cred contains the caller's credentials
  *	Return 0 if permission is granted.
  * @ptrace_traceme:
  *	Check that the @parent process has sufficient permission to trace the
@@ -1315,7 +1316,7 @@ union security_list_options {
 					struct file *file);
 
 	int (*ptrace_access_check)(struct task_struct *child,
-					unsigned int mode);
+			unsigned int mode, const struct cred *cred);
 	int (*ptrace_traceme)(struct task_struct *parent);
 	int (*capget)(struct task_struct *target, kernel_cap_t *effective,
 			kernel_cap_t *inheritable, kernel_cap_t *permitted);
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 504c98a..12b3d2c 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -66,6 +66,9 @@ extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
 #define PTRACE_MODE_ATTACH_FSCREDS (PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS)
 #define PTRACE_MODE_ATTACH_REALCREDS (PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS)
 
+/* ignore relationship between current and target task */
+#define PTRACE_MODE_NON_CURRENT 0x20
+
 /**
  * ptrace_may_access - check whether the caller is permitted to access
  * a target task.
@@ -81,6 +84,8 @@ extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
  * process_vm_writev or ptrace (and should use the real credentials).
  */
 extern bool ptrace_may_access(struct task_struct *task, unsigned int mode);
+bool ptrace_may_access_noncurrent(struct task_struct *task, unsigned int mode,
+		const struct cred *cred, const struct luid *privunit);
 
 static inline int ptrace_reparented(struct task_struct *child)
 {
diff --git a/include/linux/security.h b/include/linux/security.h
index 7831cd5..e2cca1d 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -72,7 +72,8 @@ struct timezone;
 extern int cap_capable(const struct cred *cred, struct user_namespace *ns,
 		       int cap, int audit);
 extern int cap_settime(const struct timespec64 *ts, const struct timezone *tz);
-extern int cap_ptrace_access_check(struct task_struct *child, unsigned int mode);
+extern int cap_ptrace_access_check(struct task_struct *child, unsigned int mode,
+				   const struct cred *cred);
 extern int cap_ptrace_traceme(struct task_struct *parent);
 extern int cap_capget(struct task_struct *target, kernel_cap_t *effective, kernel_cap_t *inheritable, kernel_cap_t *permitted);
 extern int cap_capset(struct cred *new, const struct cred *old,
@@ -191,7 +192,8 @@ int security_binder_transfer_binder(struct task_struct *from,
 				    struct task_struct *to);
 int security_binder_transfer_file(struct task_struct *from,
 				  struct task_struct *to, struct file *file);
-int security_ptrace_access_check(struct task_struct *child, unsigned int mode);
+int security_ptrace_access_check(struct task_struct *child, unsigned int mode,
+				 const struct cred *cred);
 int security_ptrace_traceme(struct task_struct *parent);
 int security_capget(struct task_struct *target,
 		    kernel_cap_t *effective,
@@ -414,9 +416,9 @@ static inline int security_binder_transfer_file(struct task_struct *from,
 }
 
 static inline int security_ptrace_access_check(struct task_struct *child,
-					     unsigned int mode)
+			unsigned int mode, const struct cred *cred)
 {
-	return cap_ptrace_access_check(child, mode);
+	return cap_ptrace_access_check(child, mode, cred);
 }
 
 static inline int security_ptrace_traceme(struct task_struct *parent)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index b5120ec..29ace11 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -206,18 +206,21 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	return ret;
 }
 
-static int ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
+static int ptrace_has_cap(const struct cred *cred, struct user_namespace *ns,
+			  unsigned int mode)
 {
 	if (mode & PTRACE_MODE_NOAUDIT)
-		return has_ns_capability_noaudit(current, ns, CAP_SYS_PTRACE);
+		return security_capable_noaudit(cred, ns, CAP_SYS_PTRACE) == 0;
 	else
-		return has_ns_capability(current, ns, CAP_SYS_PTRACE);
+		return security_capable(cred, ns, CAP_SYS_PTRACE) == 0;
 }
 
 /* Returns 0 on success, -errno on denial. */
-static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
+static int __ptrace_may_access(struct task_struct *task, unsigned int mode,
+			       const struct cred *cred,
+			       const struct luid *privunit)
 {
-	const struct cred *cred = current_cred(), *tcred;
+	const struct cred *tcred;
 	int dumpable = 0;
 	kuid_t caller_uid;
 	kgid_t caller_gid;
@@ -237,8 +240,9 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	 */
 
 	/* Don't let security modules deny introspection */
-	if (same_thread_group(task, current))
+	if (luid_eq(privunit, &task->self_privunit))
 		return 0;
+
 	rcu_read_lock();
 	if (mode & PTRACE_MODE_FSCREDS) {
 		caller_uid = cred->fsuid;
@@ -263,7 +267,7 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	    gid_eq(caller_gid, tcred->sgid) &&
 	    gid_eq(caller_gid, tcred->gid))
 		goto ok;
-	if (ptrace_has_cap(tcred->user_ns, mode))
+	if (ptrace_has_cap(cred, tcred->user_ns, mode))
 		goto ok;
 	rcu_read_unlock();
 	return -EPERM;
@@ -274,13 +278,13 @@ ok:
 		dumpable = get_dumpable(task->mm);
 	rcu_read_lock();
 	if (dumpable != SUID_DUMP_USER &&
-	    !ptrace_has_cap(__task_cred(task)->user_ns, mode)) {
+	    !ptrace_has_cap(cred, __task_cred(task)->user_ns, mode)) {
 		rcu_read_unlock();
 		return -EPERM;
 	}
 	rcu_read_unlock();
 
-	return security_ptrace_access_check(task, mode);
+	return security_ptrace_access_check(task, mode, cred);
 }
 
 /*
@@ -297,7 +301,21 @@ bool ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
 	int err;
 	task_lock(task);
-	err = __ptrace_may_access(task, mode);
+	err = __ptrace_may_access(task, mode, current_cred(),
+				  &current->self_privunit);
+	task_unlock(task);
+	return !err;
+}
+
+bool ptrace_may_access_noncurrent(struct task_struct *task, unsigned int mode,
+				  const struct cred *cred,
+				  const struct luid *privunit)
+{
+	int err;
+
+	task_lock(task);
+	err = __ptrace_may_access(task, mode | PTRACE_MODE_NON_CURRENT, cred,
+				  privunit);
 	task_unlock(task);
 	return !err;
 }
@@ -338,7 +356,8 @@ static int ptrace_attach(struct task_struct *task, long request,
 		goto out;
 
 	task_lock(task);
-	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS,
+	    current_cred(), &current->self_privunit);
 	task_unlock(task);
 	if (retval)
 		goto unlock_creds;
diff --git a/security/apparmor/include/ipc.h b/security/apparmor/include/ipc.h
index 288ca76..eb685c1 100644
--- a/security/apparmor/include/ipc.h
+++ b/security/apparmor/include/ipc.h
@@ -22,7 +22,7 @@ struct aa_profile;
 int aa_may_ptrace(struct aa_profile *tracer, struct aa_profile *tracee,
 		  unsigned int mode);
 
-int aa_ptrace(struct task_struct *tracer, struct task_struct *tracee,
+int aa_ptrace(struct aa_profile *tracer_p, struct task_struct *tracee,
 	      unsigned int mode);
 
 #endif /* __AA_IPC_H */
diff --git a/security/apparmor/ipc.c b/security/apparmor/ipc.c
index 777ac1c..c60b374 100644
--- a/security/apparmor/ipc.c
+++ b/security/apparmor/ipc.c
@@ -82,7 +82,7 @@ int aa_may_ptrace(struct aa_profile *tracer, struct aa_profile *tracee,
  *
  * Returns: %0 else error code if permission denied or error
  */
-int aa_ptrace(struct task_struct *tracer, struct task_struct *tracee,
+int aa_ptrace(struct aa_profile *tracer_p, struct task_struct *tracee,
 	      unsigned int mode)
 {
 	/*
@@ -94,7 +94,6 @@ int aa_ptrace(struct task_struct *tracer, struct task_struct *tracee,
 	 *       - tracer profile has CAP_SYS_PTRACE
 	 */
 
-	struct aa_profile *tracer_p = aa_get_task_profile(tracer);
 	int error = 0;
 
 	if (!unconfined(tracer_p)) {
@@ -105,7 +104,6 @@ int aa_ptrace(struct task_struct *tracer, struct task_struct *tracee,
 
 		aa_put_profile(tracee_p);
 	}
-	aa_put_profile(tracer_p);
 
 	return error;
 }
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 41b8cb1..dd94df2 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -94,14 +94,22 @@ static void apparmor_cred_transfer(struct cred *new, const struct cred *old)
 }
 
 static int apparmor_ptrace_access_check(struct task_struct *child,
-					unsigned int mode)
+				 unsigned int mode, const struct cred *cred)
 {
-	return aa_ptrace(current, child, mode);
+	struct aa_profile *tracer_p = aa_get_profile(aa_cred_profile(cred));
+	int res = aa_ptrace(tracer_p, child, mode);
+
+	aa_put_profile(tracer_p);
+	return res;
 }
 
 static int apparmor_ptrace_traceme(struct task_struct *parent)
 {
-	return aa_ptrace(parent, current, PTRACE_MODE_ATTACH);
+	struct aa_profile *tracer_p = aa_get_task_profile(parent);
+	int res = aa_ptrace(tracer_p, current, PTRACE_MODE_ATTACH);
+
+	aa_put_profile(tracer_p);
+	return res;
 }
 
 /* Derived from security/commoncap.c:cap_capget */
diff --git a/security/commoncap.c b/security/commoncap.c
index 14540bd..fe0a73a 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -133,14 +133,14 @@ int cap_settime(const struct timespec64 *ts, const struct timezone *tz)
  * Determine whether a process may access another, returning 0 if permission
  * granted, -ve if denied.
  */
-int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
+int cap_ptrace_access_check(struct task_struct *child, unsigned int mode,
+			    const struct cred *cred)
 {
 	int ret = 0;
-	const struct cred *cred, *child_cred;
+	const struct cred *child_cred;
 	const kernel_cap_t *caller_caps;
 
 	rcu_read_lock();
-	cred = current_cred();
 	child_cred = __task_cred(child);
 	if (mode & PTRACE_MODE_FSCREDS)
 		caller_caps = &cred->cap_effective;
@@ -149,7 +149,7 @@ int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
 	if (cred->user_ns == child_cred->user_ns &&
 	    cap_issubset(child_cred->cap_permitted, *caller_caps))
 		goto out;
-	if (ns_capable(child_cred->user_ns, CAP_SYS_PTRACE))
+	if (security_capable(cred, child_cred->user_ns, CAP_SYS_PTRACE) == 0)
 		goto out;
 	ret = -EPERM;
 out:
diff --git a/security/security.c b/security/security.c
index 4838e7f..70c021a 100644
--- a/security/security.c
+++ b/security/security.c
@@ -154,9 +154,10 @@ int security_binder_transfer_file(struct task_struct *from,
 	return call_int_hook(binder_transfer_file, 0, from, to, file);
 }
 
-int security_ptrace_access_check(struct task_struct *child, unsigned int mode)
+int security_ptrace_access_check(struct task_struct *child, unsigned int mode,
+				 const struct cred *cred)
 {
-	return call_int_hook(ptrace_access_check, 0, child, mode);
+	return call_int_hook(ptrace_access_check, 0, child, mode, cred);
 }
 
 int security_ptrace_traceme(struct task_struct *parent)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 13185a6..4c3c4be 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2133,15 +2133,16 @@ static int selinux_binder_transfer_file(struct task_struct *from,
 }
 
 static int selinux_ptrace_access_check(struct task_struct *child,
-				     unsigned int mode)
+				     unsigned int mode, const struct cred *cred)
 {
-	if (mode & PTRACE_MODE_READ) {
-		u32 sid = current_sid();
-		u32 csid = task_sid(child);
-		return avc_has_perm(sid, csid, SECCLASS_FILE, FILE__READ, NULL);
-	}
+	u32 sid = cred_sid(cred);
+	u32 csid = task_sid(child);
 
-	return current_has_perm(child, PROCESS__PTRACE);
+	if (mode & PTRACE_MODE_READ)
+		return avc_has_perm(sid, csid, SECCLASS_FILE, FILE__READ, NULL);
+	else
+		return avc_has_perm(sid, csid, SECCLASS_PROCESS,
+				    PROCESS__PTRACE, NULL);
 }
 
 static int selinux_ptrace_traceme(struct task_struct *parent)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 87a9741..7616742 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -408,7 +408,8 @@ static inline unsigned int smk_ptrace_mode(unsigned int mode)
 
 /**
  * smk_ptrace_rule_check - helper for ptrace access
- * @tracer: tracer process
+ * @tracer: tracer process - only used for auditing if @tracer_cred is set
+ * @tracer_cred: creds for checks, may be different from @tracer's creds
  * @tracee_known: label entry of the process that's about to be traced
  * @mode: ptrace attachment mode (PTRACE_MODE_*)
  * @func: name of the function that called us, used for audit
@@ -416,6 +417,7 @@ static inline unsigned int smk_ptrace_mode(unsigned int mode)
  * Returns 0 on access granted, -error on error
  */
 static int smk_ptrace_rule_check(struct task_struct *tracer,
+				 const struct cred *tracer_cred,
 				 struct smack_known *tracee_known,
 				 unsigned int mode, const char *func)
 {
@@ -431,7 +433,9 @@ static int smk_ptrace_rule_check(struct task_struct *tracer,
 	}
 
 	rcu_read_lock();
-	tsp = __task_cred(tracer)->security;
+	if (tracer_cred == NULL)
+		tracer_cred = __task_cred(tracer);
+	tsp = tracer_cred->security;
 	tracer_known = smk_of_task(tsp);
 
 	if ((mode & PTRACE_MODE_ATTACH) &&
@@ -476,13 +480,14 @@ static int smk_ptrace_rule_check(struct task_struct *tracer,
  *
  * Do the capability checks.
  */
-static int smack_ptrace_access_check(struct task_struct *ctp, unsigned int mode)
+static int smack_ptrace_access_check(struct task_struct *ctp, unsigned int mode,
+				     const struct cred *cred)
 {
 	struct smack_known *skp;
 
 	skp = smk_of_task_struct(ctp);
 
-	return smk_ptrace_rule_check(current, skp, mode, __func__);
+	return smk_ptrace_rule_check(current, cred, skp, mode, __func__);
 }
 
 /**
@@ -500,7 +505,8 @@ static int smack_ptrace_traceme(struct task_struct *ptp)
 
 	skp = smk_of_task(current_security());
 
-	rc = smk_ptrace_rule_check(ptp, skp, PTRACE_MODE_ATTACH, __func__);
+	rc = smk_ptrace_rule_check(ptp, NULL, skp, PTRACE_MODE_ATTACH,
+				   __func__);
 	return rc;
 }
 
@@ -941,7 +947,7 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm)
 		rcu_read_lock();
 		tracer = ptrace_parent(current);
 		if (likely(tracer != NULL))
-			rc = smk_ptrace_rule_check(tracer,
+			rc = smk_ptrace_rule_check(tracer, NULL,
 						   isp->smk_task,
 						   PTRACE_MODE_ATTACH,
 						   __func__);
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index 0309f21..32a2962 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -347,7 +347,7 @@ static int ptracer_exception_found(struct task_struct *tracer,
  * Returns 0 if following the ptrace is allowed, -ve on error.
  */
 static int yama_ptrace_access_check(struct task_struct *child,
-				    unsigned int mode)
+				    unsigned int mode, const struct cred *cred)
 {
 	int rc = 0;
 
@@ -359,7 +359,9 @@ static int yama_ptrace_access_check(struct task_struct *child,
 			break;
 		case YAMA_SCOPE_RELATIONAL:
 			rcu_read_lock();
-			if (!task_is_descendant(current, child) &&
+			/* fail open for some procfs-based ATTACH accesses! */
+			if ((mode & PTRACE_MODE_NON_CURRENT) == 0 &&
+			    !task_is_descendant(current, child) &&
 			    !ptracer_exception_found(current, child) &&
 			    !ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
 				rc = -EPERM;
@@ -367,7 +369,8 @@ static int yama_ptrace_access_check(struct task_struct *child,
 			break;
 		case YAMA_SCOPE_CAPABILITY:
 			rcu_read_lock();
-			if (!ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
+			if (security_capable(cred, __task_cred(child)->user_ns,
+					     CAP_SYS_PTRACE) != 0)
 				rc = -EPERM;
 			rcu_read_unlock();
 			break;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 4/8] futex: don't leak robust_list pointer
  2016-09-23 20:40 [PATCH v2 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (2 preceding siblings ...)
  2016-09-23 20:40 ` [PATCH v2 3/8] proc: use open()-time creds for ptrace checks Jann Horn
@ 2016-09-23 20:40 ` Jann Horn
  2016-09-30 14:52   ` Oleg Nesterov
  2016-09-23 20:40 ` [PATCH v2 5/8] proc: lock properly in ptrace_may_access callers Jann Horn
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Jann Horn @ 2016-09-23 20:40 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds
  Cc: linux-fsdevel, linux-security-module, security

This prevents an attacker from determining the robust_list or
compat_robust_list userspace pointer of a process created by executing
a setuid binary. Such an attack could be performed by racing
get_robust_list() with a setuid execution. The impact of this issue is that
an attacker could theoretically bypass ASLR when attacking setuid binaries.

changed in v2:
 - only get_task_struct(p) if p!=NULL (Ben Hutchings)
 - move the -ESRCH return check

Signed-off-by: Jann Horn <jann@thejh.net>
---
 kernel/futex.c        | 30 +++++++++++++++++++++---------
 kernel/futex_compat.c | 30 +++++++++++++++++++++---------
 2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 46cb3a3..773806e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3007,31 +3007,43 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
 
-	rcu_read_lock();
-
-	ret = -ESRCH;
-	if (!pid)
+	if (!pid) {
 		p = current;
-	else {
+		get_task_struct(p);
+	} else {
+		rcu_read_lock();
 		p = find_task_by_vpid(pid);
+		/* pin the task to permit dropping the RCU read lock before
+		 * acquiring the mutex
+		 */
+		if (p)
+			get_task_struct(p);
+		rcu_read_unlock();
 		if (!p)
-			goto err_unlock;
+			return -ESRCH;
 	}
 
+	ret = mutex_lock_killable(&p->signal->cred_guard_light);
+	if (ret)
+		goto err_put;
+
 	ret = -EPERM;
 	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
 		goto err_unlock;
 
 	head = p->robust_list;
-	rcu_read_unlock();
+
+	mutex_unlock(&p->signal->cred_guard_light);
+	put_task_struct(p);
 
 	if (put_user(sizeof(*head), len_ptr))
 		return -EFAULT;
 	return put_user(head, head_ptr);
 
 err_unlock:
-	rcu_read_unlock();
-
+	mutex_unlock(&current->signal->cred_guard_light);
+err_put:
+	put_task_struct(p);
 	return ret;
 }
 
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 4ae3232..92c350f 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -143,31 +143,43 @@ COMPAT_SYSCALL_DEFINE3(get_robust_list, int, pid,
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
 
-	rcu_read_lock();
-
-	ret = -ESRCH;
-	if (!pid)
+	if (!pid) {
 		p = current;
-	else {
+		get_task_struct(p);
+	} else {
+		rcu_read_lock();
 		p = find_task_by_vpid(pid);
+		/* pin the task to permit dropping the RCU read lock before
+		 * acquiring the mutex
+		 */
+		if (p)
+			get_task_struct(p);
+		rcu_read_unlock();
 		if (!p)
-			goto err_unlock;
+			return -ESRCH;
 	}
 
+	ret = mutex_lock_killable(&p->signal->cred_guard_light);
+	if (ret)
+		goto err_put;
+
 	ret = -EPERM;
 	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
 		goto err_unlock;
 
 	head = p->compat_robust_list;
-	rcu_read_unlock();
+
+	mutex_unlock(&p->signal->cred_guard_light);
+	put_task_struct(p);
 
 	if (put_user(sizeof(*head), len_ptr))
 		return -EFAULT;
 	return put_user(ptr_to_compat(head), head_ptr);
 
 err_unlock:
-	rcu_read_unlock();
-
+	mutex_unlock(&current->signal->cred_guard_light);
+err_put:
+	put_task_struct(p);
 	return ret;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 5/8] proc: lock properly in ptrace_may_access callers
  2016-09-23 20:40 [PATCH v2 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (3 preceding siblings ...)
  2016-09-23 20:40 ` [PATCH v2 4/8] futex: don't leak robust_list pointer Jann Horn
@ 2016-09-23 20:40 ` Jann Horn
  2016-09-23 20:40 ` [PATCH v2 6/8] ptrace: warn on ptrace_may_access without proper locking Jann Horn
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Jann Horn @ 2016-09-23 20:40 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds
  Cc: linux-fsdevel, linux-security-module, security

Use the new cred_guard_light to prevent information leaks
through races in procfs.

changed in v2:
 - also use mm_access() for proc_map_files_readdir() (0day test robot)

Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/proc/array.c      |   7 +++
 fs/proc/base.c       | 134 ++++++++++++++++++++++++++++++---------------------
 fs/proc/namespaces.c |  14 ++++++
 3 files changed, 99 insertions(+), 56 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 3349742..c28f254 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -410,9 +410,15 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long rsslim = 0;
 	char tcomm[sizeof(task->comm)];
 	unsigned long flags;
+	int err;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
+
+	err = mutex_lock_killable(&task->signal->cred_guard_light);
+	if (err)
+		return err;
+
 	permitted = proc_ptrace_may_access_seq(task,
 			PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT, m);
 	mm = get_task_mm(task);
@@ -568,6 +574,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	seq_putc(m, '\n');
 	if (mm)
 		mmput(mm);
+	mutex_unlock(&task->signal->cred_guard_light);
 	return 0;
 }
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d6c98ab..15845cf 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -135,6 +135,25 @@ struct pid_entry {
 		NULL, &proc_single_file_operations,	\
 		{ .proc_show = show } )
 
+static int lock_trace(struct seq_file *m, struct task_struct *task,
+		      unsigned int mode)
+{
+	int err = mutex_lock_killable(&task->signal->cred_guard_light);
+
+	if (err)
+		return err;
+	if (!proc_ptrace_may_access_seq(task, mode | PTRACE_MODE_FSCREDS, m)) {
+		mutex_unlock(&task->signal->cred_guard_light);
+		return -EPERM;
+	}
+	return 0;
+}
+
+static void unlock_trace(struct task_struct *task)
+{
+	mutex_unlock(&task->signal->cred_guard_light);
+}
+
 /*
  * Count the number of hardlinks for the pid_entry table, excluding the .
  * and .. links.
@@ -428,36 +447,20 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long wchan;
 	char symname[KSYM_NAME_LEN];
 
-	wchan = get_wchan(task);
-
-	if (wchan &&
-	    proc_ptrace_may_access_seq(task, PTRACE_MODE_READ_FSCREDS, m) &&
-	    !lookup_symbol_name(wchan, symname))
-		seq_printf(m, "%s", symname);
-	else
-		seq_putc(m, '0');
+	if (lock_trace(m, task, PTRACE_MODE_READ) == 0) {
+		wchan = get_wchan(task);
+		unlock_trace(task);
+		if (wchan && !lookup_symbol_name(wchan, symname)) {
+			seq_printf(m, "%s", symname);
+			return 0;
+		}
+	}
 
+	seq_putc(m, '0');
 	return 0;
 }
 #endif /* CONFIG_KALLSYMS */
 
-static int lock_trace(struct seq_file *m, struct task_struct *task)
-{
-	int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
-	if (err)
-		return err;
-	if (!proc_ptrace_may_access_seq(task, PTRACE_MODE_ATTACH_FSCREDS, m)) {
-		mutex_unlock(&task->signal->cred_guard_mutex);
-		return -EPERM;
-	}
-	return 0;
-}
-
-static void unlock_trace(struct task_struct *task)
-{
-	mutex_unlock(&task->signal->cred_guard_mutex);
-}
-
 #ifdef CONFIG_STACKTRACE
 
 #define MAX_STACK_TRACE_DEPTH	64
@@ -479,7 +482,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
 	trace.entries		= entries;
 	trace.skip		= 0;
 
-	err = lock_trace(m, task);
+	err = lock_trace(m, task, PTRACE_MODE_ATTACH);
 	if (!err) {
 		save_stack_trace_tsk(task, &trace);
 
@@ -661,7 +664,7 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long args[6], sp, pc;
 	int res;
 
-	res = lock_trace(m, task);
+	res = lock_trace(m, task, PTRACE_MODE_ATTACH);
 	if (res)
 		return res;
 
@@ -685,23 +688,38 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 /*                       Here the fs part begins                        */
 /************************************************************************/
 
-/* permission checks */
-static int proc_fd_access_allowed(struct inode *inode)
+/* permission checks.
+ * If this returns 1, you'll have to unlock_trace(*taskp) afterwards.
+ */
+static int proc_fd_access_allowed_lock(struct inode *inode,
+					 struct task_struct **taskp)
 {
 	struct task_struct *task;
 	int allowed = 0;
-	/* Allow access to a task's file descriptors if it is us or we
-	 * may use ptrace attach to the process and find out that
-	 * information.
-	 */
+
 	task = get_proc_task(inode);
-	if (task) {
-		allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (!task)
+		return 0;
+	if (mutex_lock_killable(&task->signal->cred_guard_light))
+		goto out_put;
+	allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (!allowed)
+		mutex_unlock(&task->signal->cred_guard_light);
+out_put:
+	if (allowed)
+		*taskp = task;
+	else
 		put_task_struct(task);
-	}
+
 	return allowed;
 }
 
+static void proc_fd_access_allowed_unlock(struct task_struct *task)
+{
+	mutex_unlock(&task->signal->cred_guard_light);
+	put_task_struct(task);
+}
+
 int proc_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	int error;
@@ -722,6 +740,7 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
 /*
  * May current process learn task's sched/cmdline info (for hide_pid_min=1)
  * or euid/egid (for hide_pid_min=2)?
+ * NOTE: When you call this, you must hold cred_guard_mutex or cred_guard_light.
  */
 static bool has_pid_permissions(struct pid_namespace *pid,
 				 struct task_struct *task,
@@ -1600,15 +1619,17 @@ static const char *proc_pid_get_link(struct dentry *dentry,
 {
 	struct path path;
 	int error = -EACCES;
+	struct task_struct *task;
 
 	if (!dentry)
 		return ERR_PTR(-ECHILD);
 
 	/* Are we allowed to snoop on the tasks file descriptors? */
-	if (!proc_fd_access_allowed(inode))
+	if (!proc_fd_access_allowed_lock(inode, &task))
 		goto out;
 
 	error = PROC_I(inode)->op.proc_get_link(dentry, &path);
+	proc_fd_access_allowed_unlock(task);
 	if (error)
 		goto out;
 
@@ -1647,12 +1668,14 @@ static int proc_pid_readlink(struct dentry * dentry, char __user * buffer, int b
 	int error = -EACCES;
 	struct inode *inode = d_inode(dentry);
 	struct path path;
+	struct task_struct *task;
 
 	/* Are we allowed to snoop on the tasks file descriptors? */
-	if (!proc_fd_access_allowed(inode))
+	if (!proc_fd_access_allowed_lock(inode, &task))
 		goto out;
 
 	error = PROC_I(inode)->op.proc_get_link(dentry, &path);
+	proc_fd_access_allowed_unlock(task);
 	if (error)
 		goto out;
 
@@ -2047,17 +2070,15 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	if (!task)
 		goto out;
 
-	result = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
+	mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (IS_ERR(mm))
+		result = PTR_ERR(mm);
+	if (IS_ERR_OR_NULL(mm))
 		goto out_put_task;
 
 	result = -ENOENT;
 	if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
-		goto out_put_task;
-
-	mm = get_task_mm(task);
-	if (!mm)
-		goto out_put_task;
+		goto out_put_mm;
 
 	down_read(&mm->mmap_sem);
 	vma = find_exact_vma(mm, vm_start, vm_end);
@@ -2070,6 +2091,7 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 
 out_no_vma:
 	up_read(&mm->mmap_sem);
+out_put_mm:
 	mmput(mm);
 out_put_task:
 	put_task_struct(task);
@@ -2100,17 +2122,17 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	if (!task)
 		goto out;
 
-	ret = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
+	ret = 0;
+
+	mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (IS_ERR(mm))
+		ret = PTR_ERR(mm);
+	if (IS_ERR_OR_NULL(mm))
 		goto out_put_task;
 
-	ret = 0;
 	if (!dir_emit_dots(file, ctx))
-		goto out_put_task;
+		goto out_mmput;
 
-	mm = get_task_mm(task);
-	if (!mm)
-		goto out_put_task;
 	down_read(&mm->mmap_sem);
 
 	nr_files = 0;
@@ -2139,8 +2161,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 			if (fa)
 				flex_array_free(fa);
 			up_read(&mm->mmap_sem);
-			mmput(mm);
-			goto out_put_task;
+			goto out_mmput;
 		}
 		for (i = 0, vma = mm->mmap, pos = 2; vma;
 				vma = vma->vm_next) {
@@ -2171,8 +2192,9 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	}
 	if (fa)
 		flex_array_free(fa);
-	mmput(mm);
 
+out_mmput:
+	mmput(mm);
 out_put_task:
 	put_task_struct(task);
 out:
@@ -2839,7 +2861,7 @@ static const struct file_operations proc_setgroups_operations = {
 static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *task)
 {
-	int err = lock_trace(m, task);
+	int err = lock_trace(m, task, PTRACE_MODE_ATTACH);
 	if (!err) {
 		seq_printf(m, "%08x\n", task->personality);
 		unlock_trace(task);
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 51b8b0a..e1246e8 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -49,11 +49,18 @@ static const char *proc_ns_get_link(struct dentry *dentry,
 	if (!task)
 		return error;
 
+	error = ERR_PTR(mutex_lock_killable(&task->signal->cred_guard_light));
+	if (error)
+		goto out_put_task;
+
+	error = ERR_PTR(-EACCES);
 	if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
 		error = ns_get_path(&ns_path, task, ns_ops);
 		if (!error)
 			nd_jump_link(&ns_path);
 	}
+	mutex_unlock(&task->signal->cred_guard_light);
+out_put_task:
 	put_task_struct(task);
 	return error;
 }
@@ -70,11 +77,18 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl
 	if (!task)
 		return res;
 
+	res = mutex_lock_killable(&task->signal->cred_guard_light);
+	if (res)
+		goto out_put_task;
+
+	res = -EACCES;
 	if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
 		res = ns_get_name(name, sizeof(name), task, ns_ops);
 		if (res >= 0)
 			res = readlink_copy(buffer, buflen, name);
 	}
+	mutex_unlock(&task->signal->cred_guard_light);
+out_put_task:
 	put_task_struct(task);
 	return res;
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 6/8] ptrace: warn on ptrace_may_access without proper locking
  2016-09-23 20:40 [PATCH v2 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (4 preceding siblings ...)
  2016-09-23 20:40 ` [PATCH v2 5/8] proc: lock properly in ptrace_may_access callers Jann Horn
@ 2016-09-23 20:40 ` Jann Horn
  2016-09-23 20:40 ` [PATCH v2 7/8] fs/proc: fix attr access check Jann Horn
  2016-09-23 20:40 ` [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt Jann Horn
  7 siblings, 0 replies; 30+ messages in thread
From: Jann Horn @ 2016-09-23 20:40 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds
  Cc: linux-fsdevel, linux-security-module, security

If neither of those locks is taken during a ptrace_may_access() call, that
is a strong indicator for a security bug where post-execve process
properties can be leaked by racing with execve().

Signed-off-by: Jann Horn <jann@thejh.net>
---
 kernel/ptrace.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 29ace11..17c2245 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -225,6 +225,9 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode,
 	kuid_t caller_uid;
 	kgid_t caller_gid;
 
+	WARN_ON(!mutex_is_locked(&task->signal->cred_guard_mutex) &&
+		!mutex_is_locked(&task->signal->cred_guard_light));
+
 	if (!(mode & PTRACE_MODE_FSCREDS) == !(mode & PTRACE_MODE_REALCREDS)) {
 		WARN(1, "denying ptrace access check without PTRACE_MODE_*CREDS\n");
 		return -EPERM;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 7/8] fs/proc: fix attr access check
  2016-09-23 20:40 [PATCH v2 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (5 preceding siblings ...)
  2016-09-23 20:40 ` [PATCH v2 6/8] ptrace: warn on ptrace_may_access without proper locking Jann Horn
@ 2016-09-23 20:40 ` Jann Horn
  2016-09-23 20:40 ` [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt Jann Horn
  7 siblings, 0 replies; 30+ messages in thread
From: Jann Horn @ 2016-09-23 20:40 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds
  Cc: linux-fsdevel, linux-security-module, security

Make sure files in /proc/$pid/attr/ can only be written by the task that
opened them.

This prevents an attacking process from changing the security context of
another process that it can force to write attacker-controlled data into an
attacker-supplied file descriptor. I'm not sure what the impact of this is.

changed in v2:
 - changed privunit-using code

Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/proc/base.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 15845cf..27f369d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2484,6 +2484,18 @@ out:
 }
 
 #ifdef CONFIG_SECURITY
+static int proc_pid_attr_open(struct inode *inode, struct file *file)
+{
+	struct luid *opener_privunit;
+
+	opener_privunit = kmalloc(sizeof(struct luid), GFP_KERNEL);
+	if (opener_privunit == NULL)
+		return -ENOMEM;
+	*opener_privunit = current->self_privunit;
+	file->private_data = opener_privunit;
+	return 0;
+}
+
 static ssize_t proc_pid_attr_read(struct file * file, char __user * buf,
 				  size_t count, loff_t *ppos)
 {
@@ -2512,6 +2524,7 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	void *page;
 	ssize_t length;
 	struct task_struct *task = get_proc_task(inode);
+	struct luid *opener_privunit = file->private_data;
 
 	length = -ESRCH;
 	if (!task)
@@ -2535,9 +2548,29 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	if (length < 0)
 		goto out_free;
 
+	/*
+	 * Ensure that a process can't be tricked into writing into its own attr
+	 * files without intending to do so.
+	 *
+	 * SELinux has a rule that prevents anyone other than `task` from
+	 * writing, but if the fd stays open across execve, or is sent across a
+	 * unix domain socket or whatever, that is bypassable.
+	 * Same thing in AppArmor and in Smack.
+	 *
+	 * To prevent this, compare the opener's exec_id with the target's to
+	 * ensure that they're in the same task group and no exec happened in
+	 * the meantime.
+	 *
+	 * Why is this a file and not a prctl or whatever. :/
+	 */
+	length = -EACCES;
+	if (!luid_eq(opener_privunit, &task->self_privunit))
+		goto out_unlock;
+
 	length = security_setprocattr(task,
 				      (char*)file->f_path.dentry->d_name.name,
 				      page, count);
+out_unlock:
 	mutex_unlock(&task->signal->cred_guard_mutex);
 out_free:
 	kfree(page);
@@ -2547,10 +2580,20 @@ out_no_task:
 	return length;
 }
 
+static int proc_pid_attr_release(struct inode *inode, struct file *file)
+{
+	struct luid *opener_privunit = file->private_data;
+
+	kfree(opener_privunit);
+	return 0;
+}
+
 static const struct file_operations proc_pid_attr_operations = {
+	.open		= proc_pid_attr_open,
 	.read		= proc_pid_attr_read,
 	.write		= proc_pid_attr_write,
 	.llseek		= generic_file_llseek,
+	.release	= proc_pid_attr_release,
 };
 
 static const struct pid_entry attr_dir_stuff[] = {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt
  2016-09-23 20:40 [PATCH v2 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (6 preceding siblings ...)
  2016-09-23 20:40 ` [PATCH v2 7/8] fs/proc: fix attr access check Jann Horn
@ 2016-09-23 20:40 ` Jann Horn
  2016-10-02  3:16   ` Krister Johansen
  7 siblings, 1 reply; 30+ messages in thread
From: Jann Horn @ 2016-09-23 20:40 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds
  Cc: linux-fsdevel, linux-security-module, security

This includes some things suggested by Ben Hutchings and Kees Cook.

Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
---
 Documentation/security/ptrace_checks.txt | 229 +++++++++++++++++++++++++++++++
 1 file changed, 229 insertions(+)
 create mode 100644 Documentation/security/ptrace_checks.txt

diff --git a/Documentation/security/ptrace_checks.txt b/Documentation/security/ptrace_checks.txt
new file mode 100644
index 0000000..98689d5
--- /dev/null
+++ b/Documentation/security/ptrace_checks.txt
@@ -0,0 +1,229 @@
+			     ====================
+			     PTRACE ACCESS CHECKS
+			     ====================
+
+By: Jann Horn <jann@thejh.net>
+
+=====
+INTRO
+=====
+
+This file describes the rules for ptrace() and ptrace()-like access to another
+task (inspecting or modifying the virtual memory of another process, inspecting
+another task's register state, inspecting another task's open files and so on).
+
+This kind of access is particularly security-sensitive because it often allows
+the caller to take complete control over the remote process. Therefore, the
+kernel must be very careful here.
+
+There are various kernel APIs that grant debugging access to other processes.
+
+
+===========================
+ORDINARY DEBUGGING SYSCALLS
+===========================
+
+The following ordinary syscalls grant debugging-like access:
+
+ - ptrace() grants R+W access to registers and virtual memory
+ - process_vm_readv() / process_vm_writev() grant R/W access to virtual memory
+ - get_robust_list() reveals an address from the target task, which would be
+   useful e.g. for an ASLR bypass
+ - perf_event_open()
+ - (kcmp() reveals a very small amount of information)
+
+These syscalls use the caller's *real* UID/GID and his *permitted* capability
+set (for historical reasons) to determine the permitted access using
+ptrace_may_access(..., PTRACE_MODE_REALCREDS | ...). This function is
+responsible for ensuring that the caller can't steal any privileges using the
+debugging access; in other words, it has to ensure that the caller's credentials
+are (more or less) a superset of the target's credentials. Namespaces aside,
+these rules work as follows:
+
+Introspection, including cross-thread introspection, is always allowed. This
+means that it is unsafe to use any of the debugging APIs on behalf of another,
+untrusted process unless it is verified somehow that the target pid does not
+actually belong to the calling process - dropping privileges is not sufficient.
+
+root (to be more precise, a process with CAP_SYS_PTRACE effective) can always
+use these debugging APIs on any process - unless a LSM denies the access.
+
+For normal users, there are more checks:
+
+ - The real uid of the caller must be equal to ruid, euid and suid of the target
+   task, same thing for the gids. All of the uids and gids of the target (apart
+   from the fs*id) are checked here because a privileged process might stash his
+   privileged uid into any one of them. (checked in __ptrace_may_access())
+   (Note: This doesn't cover the supplementary GIDs.)
+ - The permitted capabilities of the caller must be a superset of the permitted
+   capabilities of the target. (checked in cap_ptrace_access_check())
+ - The target process must be dumpable. What this means is explained in the next
+   section.
+
+
+===========
+DUMPABILITY
+===========
+
+Originally, the dumpability rule didn't exist - as long as a process had ruid,
+euid and suid set to the uid of a user (same thing with the gids), that user was
+able to gain debugging access to the process. This was problematic: It meant
+that if a setuid binary opened some important file, then dropped all privileges,
+the user was able to debug the process and steal its file descriptor - which is
+also a form of privilege! Other issues are e.g. that when a process drops its
+privileges, it might still have secrets in memory that the user isn't meant to
+see.
+
+Therefore, a process now becomes "nondumpable" when it:
+
+ - changes its euid / egid
+ - changes its fsuid / fsgid
+ - changes its permitted capability set
+
+These checks are in commit_creds().
+
+When a process is nondumpable, you need to have the CAP_SYS_PTRACE capability to
+access it - just having the same uids is not enough anymore.
+
+The most interesting part of dumpability is what happens on execve().
+(Non-)Dumpability is not inherited; instead, it is recalculated in
+setup_new_exec() and commit_creds(): A process will be nondumpable after
+execve() if its post-setuid-bit-handling euid and ruid or egid and rgid aren't
+the same. Additionally, it will be nondumpable if execve() changes its
+credentials in a way that commit_creds() triggers on (as described above).
+
+In effect, this means that normally, when a setuid/setgid binary executes
+another program, the other program is still protected by nondumpability (and
+also by secure execution).
+
+(Additionally, a process will be nondumpable if the euid doesn't have read
+access to the executed binary, which can be the case if the binary is
+executable, but not readable; however, this restriction is irrelevant from a
+security perspective because it isn't enforced consistently, neither through
+AT_SECURE nor for bprm->unsafe. In other words, you can e.g. be attached to such
+a binary via ptrace while it is executed, or you can LD_PRELOAD a library into
+it that dumps its memory. If this was fixed, it would be important to keep
+namespacing issues in mind, in particular the interaction with
+LSM_UNSAFE_PTRACE_CAP, the privileged kind of ptrace that works across setuid
+execve.)
+
+Dumpability and the secure exec flag can be affected by the following "unsafe
+execution" rules. When a process with ruid != euid executes another program and
+an unsafe execution is detected, *ITS EUID WILL BE DROPPED TO ITS RUID* by
+cap_bprm_set_creds() (unless it has CAP_SETUID effective in the init user ns).
+Unsafe execution means one of the following:
+
+ - LSM_UNSAFE_PTRACE: An unprivileged process is currently attached via ptrace.
+ - LSM_UNSAFE_NO_NEW_PRIVS: The current thread turned on NO_NEW_PRIVS.
+ - LSM_UNSAFE_SHARE: The fs struct is shared with a non-thread.
+
+
+=====================
+FILESYSTEM DEBUG APIS
+=====================
+
+The pid / tgid entries in procfs contain various entries that allow debugging
+access to a process. Interesting entries are:
+
+ - auxv permits an ASLR bypass
+ - cwd can permit bypassing filesystem restrictions in some cases
+ - environ can leak secret tokens
+ - fd can permit bypassing filesystem restrictions or leak access to things like
+   pipes
+ - maps permits an ASLR bypass
+ - mem grants R+W access to process memory
+ - stat permits an ASLR bypass
+
+Of these, all use both a normal filesystem DAC check (where the file owner is
+the process owner for a dumpable process, root for a nondumpable process) and a
+ptrace_may_access() check; however, the DAC check may be modified, and the
+ptrace_may_access() is performed under PTRACE_FSCREDS, meaning that instead of
+the caller's ruid, rgid and permitted capabilities, the fsuid, fsgid and
+effective capabilities are used, causing the case where a daemon drops its euid
+prior to accessing a file for the user to be treated correctly for this check.
+
+NEEDS ATTENTION:
+Special-case rules that permit introspection become a big problem here. In
+particular, if a setuid binary runs with dropped privileges (ruid=euid=user and
+suid=0 or so), it is still able to open the following entries under /proc/self:
+
+ - cwd (because normal DAC rules always permit symlink access, proc_pid_get_link
+   only checks for ptrace access and proc_cwd_link doesn't check anything)
+ - fd (same as cwd for the entries inside; the directory itself has an override
+   on the permission handler that makes introspection possible even if the
+   normal DAC rules would forbid it)
+ - maps (mode 0444, so DAC doesn't restrict anything)
+ - stat (mode 0444, and anyone can read from it; however, the interesting
+     pointers are only printed if the file opener has ptrace access to the
+     target process)
+
+In particular the fd directory is dangerous: Imagine a privileged process that
+creates some important pipe using pipe() with a dropped EUID and later performs
+writes with user-specified data on a file at a user-specified path with the same
+dropped EUID. An attacker could let the root-owned process reference the pipe as
+/proc/$pid/fd/3 and let the privileged process write to it.
+
+If necessary, it might make sense to introduce some "enable/disable
+introspection" prctl that can be used by userspace to disambiguate whether an
+access is intended to be introspection.
+
+
+Note that in all cases in which the DAC check permits access but a ptrace access
+check later on (in the VFS open handler or afterwards) restricts access,
+faccessat() is broken. However, that's probably not a real problem because
+faccessat() is racy anyway.
+
+
+===============
+USER NAMESPACES
+===============
+
+When ptrace_may_access() checks for CAP_SYS_PTRACE, the capability doesn't need
+to be active in the init namespace or in the current namespace; having it in the
+namespace of the ptrace target is sufficient. There are currently no further
+restrictions on this, meaning that the behavior of a task that creates or enters
+a user namespace is somewhat unintuitive:
+
+ - If a nondumpable task enters a user namespace, it will still be
+   nondumpable - but because the owner of the namespace is capable relative to
+   it when viewing the namespace from outside, this will often cause a
+   nondumpable task to effectively become dumpable because of a setns() /
+   clone() / unshare() call.
+
+ - The root user of a namespace can ptrace any process that enters the namespace
+   (via setns()) - immediately. This means that a process that wants to enter an
+   untrusted namespace from outside needs to be *very* careful to drop any
+   privileges it might have - uids, open file descriptors, secret data that's
+   still in memory, the controlling TTY and so on. (This also means that a
+   namespace owner who is unprivileged outside the namespace can't safely enter
+   the namespace if he doesn't trust the namespace root user: He needs
+   euid==ns->owner to enter the namespace, but must not have euid==ns->owner
+   after having entered the namespace. To get around this, it is necessary to
+   introduce a third namespace between the other two.)
+
+One way to make it easier for userspace to get this right might be to bind
+dumpability to namespaces, as follows: Let a task become nondumpable on
+setns() / clone(CLONE_NEWUSER) / unshare(CLONE_NEWUSER). Record the current
+namespace on dumpable -> nondumpable transition as "dumpability namespace".
+Require the caller to be privileged in the target's dumpability namespace, not
+just the target's current namespace.
+
+=============
+FUSE AND CUSE
+=============
+
+Anyone who can open /dev/cuse (root-owned, mode 0600) can arbitrarily read and
+write to the memory of various system processes (to be more precise, any process
+that opens files in /dev when they appear or so, then calls ioctl() on them; in
+particular, acpid does this) without further checks on capabilities or so.
+
+NEEDS ATTENTION:
+FUSE is more accessible than CUSE, but also more strict. ioctl() replies are
+restricted in location and size based on the ioctl command number (size) and the
+argument (location), but the ability to interactively control the replies to VFS
+method calls is still scary. Therefore, FUSE by default rejects filesystem
+access by anyone except the filesystem owner. This isn't done using the normal
+ptrace check, though; instead, fuse_allow_current_process() is used, which only
+looks at the UIDs and GIDs and therefore is a very simplified version of the
+normal ptrace access check. In particular, this provides no protection for
+setcap binaries.
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] exec: turn self_exec_id into self_privunit
  2016-09-23 20:40 ` [PATCH v2 2/8] exec: turn self_exec_id into self_privunit Jann Horn
@ 2016-09-23 21:04   ` Andy Lutomirski
  2016-09-23 21:33     ` Jann Horn
  2016-09-30 13:20   ` Oleg Nesterov
  1 sibling, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2016-09-23 21:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings, Linus Torvalds,
	Linux FS Devel, LSM List, security

On Fri, Sep 23, 2016 at 1:40 PM, Jann Horn <jann@thejh.net> wrote:
> This ensures that self_privunit ("privilege unit locally unique ID")
> is only shared by processes that share the mm_struct and the signal_struct;
> not just spatially, but also temporally. In other words, if you do execve()
> or clone() without CLONE_THREAD, you get a new privunit that has never
> been used before.
>
> One reason for doing this is that it prevents an attacker from sending an
> arbitrary signal to a parent process after performing 2^32-1 execve()
> calls.
>
> The second reason for this is that it permits using the self_exec_luid in
> a later patch to check during a ptrace access whether subject and object
> are temporally and spatially equal for privilege checking purposes.
>
> The implementation of locally unique IDs is in sched.h and exec.c for now
> because those are the only users so far - if anything else wants to use
> them in the future, they can be moved elsewhere.
>
> changed in v2:
>  - have 2^64 IDs per CPU instead of 2^64 shared ones (luid scheme,
>    suggested by Andy Lutomirski)
>  - take task_lock for reading in setup_new_exec() while bumping the LUID
>
> Signed-off-by: Jann Horn <jann@thejh.net>
> ---
>  fs/exec.c             | 41 +++++++++++++++++++++++++++++++++++++++--
>  include/linux/sched.h | 17 +++++++++++++++--
>  kernel/fork.c         |  5 +++--
>  kernel/signal.c       |  5 ++++-
>  4 files changed, 61 insertions(+), 7 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 84430ee..fcc11f0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1281,6 +1281,34 @@ void would_dump(struct linux_binprm *bprm, struct file *file)
>  }
>  EXPORT_SYMBOL(would_dump);
>
> +static DEFINE_PER_CPU(u64, luid_counters);
> +
> +static int __init init_luid_counters(void)
> +{
> +       unsigned int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               /* value 0 is reserved for init */
> +               per_cpu(luid_counters, cpu) = 1;
> +       }
> +
> +       return 0;
> +}
> +early_initcall(init_luid_counters);

How about static DEFINE_PER_CPU(u64, luid_counters) = 1?  You could
optionally use local64_t instead, which would let you avoid needing to
think about preemption.

> +
> +/*
> + * Allocates a new LUID and writes the allocated LUID to @out.
> + * This function must not be called from IRQ context.
> + */
> +void fill_luid(struct luid *out)
> +{
> +       preempt_disable();
> +       out->count = raw_cpu_read(luid_counters);
> +       raw_cpu_add(luid_counters, 1);
> +       out->cpu = smp_processor_id();
> +       preempt_enable();
> +}
> +

I would call this alloc_luid().

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] exec: turn self_exec_id into self_privunit
  2016-09-23 21:04   ` Andy Lutomirski
@ 2016-09-23 21:33     ` Jann Horn
  0 siblings, 0 replies; 30+ messages in thread
From: Jann Horn @ 2016-09-23 21:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings, Linus Torvalds,
	Linux FS Devel, LSM List, security

[-- Attachment #1: Type: text/plain, Size: 3028 bytes --]

On Fri, Sep 23, 2016 at 02:04:30PM -0700, Andy Lutomirski wrote:
> On Fri, Sep 23, 2016 at 1:40 PM, Jann Horn <jann@thejh.net> wrote:
> > This ensures that self_privunit ("privilege unit locally unique ID")
> > is only shared by processes that share the mm_struct and the signal_struct;
> > not just spatially, but also temporally. In other words, if you do execve()
> > or clone() without CLONE_THREAD, you get a new privunit that has never
> > been used before.
> >
> > One reason for doing this is that it prevents an attacker from sending an
> > arbitrary signal to a parent process after performing 2^32-1 execve()
> > calls.
> >
> > The second reason for this is that it permits using the self_exec_luid in
> > a later patch to check during a ptrace access whether subject and object
> > are temporally and spatially equal for privilege checking purposes.
> >
> > The implementation of locally unique IDs is in sched.h and exec.c for now
> > because those are the only users so far - if anything else wants to use
> > them in the future, they can be moved elsewhere.
> >
> > changed in v2:
> >  - have 2^64 IDs per CPU instead of 2^64 shared ones (luid scheme,
> >    suggested by Andy Lutomirski)
> >  - take task_lock for reading in setup_new_exec() while bumping the LUID
> >
> > Signed-off-by: Jann Horn <jann@thejh.net>
> > ---
> >  fs/exec.c             | 41 +++++++++++++++++++++++++++++++++++++++--
> >  include/linux/sched.h | 17 +++++++++++++++--
> >  kernel/fork.c         |  5 +++--
> >  kernel/signal.c       |  5 ++++-
> >  4 files changed, 61 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 84430ee..fcc11f0 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1281,6 +1281,34 @@ void would_dump(struct linux_binprm *bprm, struct file *file)
> >  }
> >  EXPORT_SYMBOL(would_dump);
> >
> > +static DEFINE_PER_CPU(u64, luid_counters);
> > +
> > +static int __init init_luid_counters(void)
> > +{
> > +       unsigned int cpu;
> > +
> > +       for_each_possible_cpu(cpu) {
> > +               /* value 0 is reserved for init */
> > +               per_cpu(luid_counters, cpu) = 1;
> > +       }
> > +
> > +       return 0;
> > +}
> > +early_initcall(init_luid_counters);
> 
> How about static DEFINE_PER_CPU(u64, luid_counters) = 1?  You could
> optionally use local64_t instead, which would let you avoid needing to
> think about preemption.

Ah, I didn't realize that either of those was possible. Yes, I guess I'll
change it to use local64_t.

> > +
> > +/*
> > + * Allocates a new LUID and writes the allocated LUID to @out.
> > + * This function must not be called from IRQ context.
> > + */
> > +void fill_luid(struct luid *out)
> > +{
> > +       preempt_disable();
> > +       out->count = raw_cpu_read(luid_counters);
> > +       raw_cpu_add(luid_counters, 1);
> > +       out->cpu = smp_processor_id();
> > +       preempt_enable();
> > +}
> > +
> 
> I would call this alloc_luid().

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] exec: turn self_exec_id into self_privunit
  2016-09-23 20:40 ` [PATCH v2 2/8] exec: turn self_exec_id into self_privunit Jann Horn
  2016-09-23 21:04   ` Andy Lutomirski
@ 2016-09-30 13:20   ` Oleg Nesterov
  2016-09-30 13:44     ` Oleg Nesterov
  1 sibling, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2016-09-30 13:20 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

On 09/23, Jann Horn wrote:
>
> One reason for doing this is that it prevents an attacker from sending an
> arbitrary signal to a parent process after performing 2^32-1 execve()
> calls.

I think we should simply kill self/parent_exec_id's. I am going to send
the patch below after re-check/testing.

Oleg.


--- x/include/linux/sched.h
+++ x/include/linux/sched.h
@@ -1677,9 +1677,6 @@ struct task_struct {
 #endif
 	struct seccomp seccomp;
 
-/* Thread group tracking */
-   	u32 parent_exec_id;
-   	u32 self_exec_id;
 /* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
  * mempolicy */
 	spinlock_t alloc_lock;
--- x/fs/exec.c
+++ x/fs/exec.c
@@ -1163,6 +1163,14 @@ static int de_thread(struct task_struct 
 no_thread_group:
 	/* we have changed execution domain */
 	tsk->exit_signal = SIGCHLD;
+	if (!list_empty(&father->children)) {
+		struct task_struct *child;
+
+		read_lock(&tasklist_lock);
+		list_for_each_entry(child, &father->children, sibling)
+			child->exit_signal = SIGCHLD;
+		read_unlock(&tasklist_lock);
+	}
 
 	exit_itimers(sig);
 	flush_itimer_signals();
@@ -1306,9 +1314,6 @@ void setup_new_exec(struct linux_binprm 
 			set_dumpable(current->mm, suid_dumpable);
 	}
 
-	/* An exec changes our domain. We are no longer part of the thread
-	   group */
-	current->self_exec_id++;
 	flush_signal_handlers(current, 0);
 	do_close_on_exec(current->files);
 }
--- x/kernel/fork.c
+++ x/kernel/fork.c
@@ -1573,13 +1573,10 @@ static struct task_struct *copy_process(
 	write_lock_irq(&tasklist_lock);
 
 	/* CLONE_PARENT re-uses the old parent */
-	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
+	if (clone_flags & (CLONE_PARENT|CLONE_THREAD))
 		p->real_parent = current->real_parent;
-		p->parent_exec_id = current->parent_exec_id;
-	} else {
+	else
 		p->real_parent = current;
-		p->parent_exec_id = current->self_exec_id;
-	}
 
 	spin_lock(&current->sighand->siglock);
 
--- x/kernel/signal.c
+++ x/kernel/signal.c
@@ -1585,15 +1585,6 @@ bool do_notify_parent(struct task_struct
 	BUG_ON(!tsk->ptrace &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 
-	if (sig != SIGCHLD) {
-		/*
-		 * This is only possible if parent == real_parent.
-		 * Check if it has changed security domain.
-		 */
-		if (tsk->parent_exec_id != tsk->parent->self_exec_id)
-			sig = SIGCHLD;
-	}
-
 	info.si_signo = sig;
 	info.si_errno = 0;
 	/*


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] exec: turn self_exec_id into self_privunit
  2016-09-30 13:20   ` Oleg Nesterov
@ 2016-09-30 13:44     ` Oleg Nesterov
  2016-09-30 18:30       ` Kees Cook
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2016-09-30 13:44 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

forgot to mention...

On 09/30, Oleg Nesterov wrote:
>
> On 09/23, Jann Horn wrote:
> >
> > One reason for doing this is that it prevents an attacker from sending an
> > arbitrary signal to a parent process after performing 2^32-1 execve()
> > calls.

No, sets ->exit_signal = SIGCHLD. So the only problem is that the parent
can do clone(SIGKILL), then do execve() 2^32-1 times, then it can be killed
by SIGKILL from the exiting child.

Honestly, I do not think this is security problem.

> I think we should simply kill self/parent_exec_id's. I am going to send
> the patch below after re-check/testing.

Yes, I think this makes sense anyway.

Oleg.

> --- x/include/linux/sched.h
> +++ x/include/linux/sched.h
> @@ -1677,9 +1677,6 @@ struct task_struct {
>  #endif
>  	struct seccomp seccomp;
>  
> -/* Thread group tracking */
> -   	u32 parent_exec_id;
> -   	u32 self_exec_id;
>  /* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
>   * mempolicy */
>  	spinlock_t alloc_lock;
> --- x/fs/exec.c
> +++ x/fs/exec.c
> @@ -1163,6 +1163,14 @@ static int de_thread(struct task_struct 
>  no_thread_group:
>  	/* we have changed execution domain */
>  	tsk->exit_signal = SIGCHLD;
> +	if (!list_empty(&father->children)) {
> +		struct task_struct *child;
> +
> +		read_lock(&tasklist_lock);
> +		list_for_each_entry(child, &father->children, sibling)
> +			child->exit_signal = SIGCHLD;
> +		read_unlock(&tasklist_lock);
> +	}
>  
>  	exit_itimers(sig);
>  	flush_itimer_signals();
> @@ -1306,9 +1314,6 @@ void setup_new_exec(struct linux_binprm 
>  			set_dumpable(current->mm, suid_dumpable);
>  	}
>  
> -	/* An exec changes our domain. We are no longer part of the thread
> -	   group */
> -	current->self_exec_id++;
>  	flush_signal_handlers(current, 0);
>  	do_close_on_exec(current->files);
>  }
> --- x/kernel/fork.c
> +++ x/kernel/fork.c
> @@ -1573,13 +1573,10 @@ static struct task_struct *copy_process(
>  	write_lock_irq(&tasklist_lock);
>  
>  	/* CLONE_PARENT re-uses the old parent */
> -	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
> +	if (clone_flags & (CLONE_PARENT|CLONE_THREAD))
>  		p->real_parent = current->real_parent;
> -		p->parent_exec_id = current->parent_exec_id;
> -	} else {
> +	else
>  		p->real_parent = current;
> -		p->parent_exec_id = current->self_exec_id;
> -	}
>  
>  	spin_lock(&current->sighand->siglock);
>  
> --- x/kernel/signal.c
> +++ x/kernel/signal.c
> @@ -1585,15 +1585,6 @@ bool do_notify_parent(struct task_struct
>  	BUG_ON(!tsk->ptrace &&
>  	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
>  
> -	if (sig != SIGCHLD) {
> -		/*
> -		 * This is only possible if parent == real_parent.
> -		 * Check if it has changed security domain.
> -		 */
> -		if (tsk->parent_exec_id != tsk->parent->self_exec_id)
> -			sig = SIGCHLD;
> -	}
> -
>  	info.si_signo = sig;
>  	info.si_errno = 0;
>  	/*


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/8] futex: don't leak robust_list pointer
  2016-09-23 20:40 ` [PATCH v2 4/8] futex: don't leak robust_list pointer Jann Horn
@ 2016-09-30 14:52   ` Oleg Nesterov
  2016-10-30 17:16     ` Jann Horn
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2016-09-30 14:52 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

On 09/23, Jann Horn wrote:
>
> This prevents an attacker from determining the robust_list or
> compat_robust_list userspace pointer of a process created by executing
> a setuid binary. Such an attack could be performed by racing
> get_robust_list() with a setuid execution. The impact of this issue is that
> an attacker could theoretically bypass ASLR when attacking setuid binaries.

Well. I am not sure this actually needs a fix, but I won't argue.

I can't really understand what this patch actually fixes,

> @@ -3007,31 +3007,43 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
>  	if (!futex_cmpxchg_enabled)
>  		return -ENOSYS;
>
> -	rcu_read_lock();
> -
> -	ret = -ESRCH;
> -	if (!pid)
> +	if (!pid) {
>  		p = current;
> -	else {
> +		get_task_struct(p);
> +	} else {
> +		rcu_read_lock();
>  		p = find_task_by_vpid(pid);
> +		/* pin the task to permit dropping the RCU read lock before
> +		 * acquiring the mutex
> +		 */
> +		if (p)
> +			get_task_struct(p);
> +		rcu_read_unlock();
>  		if (!p)
> -			goto err_unlock;
> +			return -ESRCH;
>  	}
>
> +	ret = mutex_lock_killable(&p->signal->cred_guard_light);
> +	if (ret)
> +		goto err_put;
> +
>  	ret = -EPERM;
>  	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
>  		goto err_unlock;
>
>  	head = p->robust_list;
> -	rcu_read_unlock();

OK, suppose it races with setuid exec, and mutex_lock_killable() +
ptrace_may_access() comes after flush_old_exec() but before
install_exec_creds(), in this case ptrace_may_access() can wrongly
succeed.

In theory, it is possible that the execing thread can complete exec,
return to user-mode and call sys_set_robust_list() before we read
head = p->robust_list. Yes, this is unlikely, but unless I am totally
confused the race you are trying to fix is equally unlikely?

perhaps we can make a much simpler change to prevent this, see below.
We can rely on fact that both ptrace_may_access() and exec_mmap()
takes the same task_lock(). Sure, this can "leak" robust_list too,
a set-uid binary can exec and/or lower its credentials after we
read p->robust_list, but personally I think we do not care.

Or I missed something else?

Oleg.

--- x/kernel/futex.c
+++ x/kernel/futex.c
@@ -3019,10 +3019,10 @@ SYSCALL_DEFINE3(get_robust_list, int, pi
 	}
 
 	ret = -EPERM;
+	head = p->robust_list;
 	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
 		goto err_unlock;
 
-	head = p->robust_list;
 	rcu_read_unlock();
 
 	if (put_user(sizeof(*head), len_ptr))


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/8] exec: introduce cred_guard_light
  2016-09-23 20:40 ` [PATCH v2 1/8] exec: introduce cred_guard_light Jann Horn
@ 2016-09-30 15:35   ` Oleg Nesterov
  2016-09-30 18:27     ` Eric W. Biederman
  2016-10-30 21:12     ` Jann Horn
  0 siblings, 2 replies; 30+ messages in thread
From: Oleg Nesterov @ 2016-09-30 15:35 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

On 09/23, Jann Horn wrote:
>
> This is a new per-threadgroup lock that can often be taken instead of
> cred_guard_mutex and has less deadlock potential.

Oh, please don't.

> I'm doing this because
> Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> thread, and the debugger attempts to inspect procfs files of the debugged
> task.

Yes, but we need to fix this anyway. And I am not sure the new mutex can
actually help.

And I think that cred_guard_mutex is already over-used in fs/proc. Say,
I think lock_trace() must die, I simply can't understand why it is useful.

Suppose we modify, say, proc_pid_stack() to do

	save_stack_trace_tsk(task, &trace);
	if (!ptrace_may_access(task, ...))
		goto return -EPERM;

	for (i = 0; i < trace.nr_entries; i++)
		seq_printf(...);

	return 0;

is there any problem if it shows some trace before setuid exec does
install_exec_creds() ?

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/8] exec: introduce cred_guard_light
  2016-09-30 15:35   ` Oleg Nesterov
@ 2016-09-30 18:27     ` Eric W. Biederman
  2016-10-03 16:02       ` Oleg Nesterov
  2016-10-30 21:12     ` Jann Horn
  1 sibling, 1 reply; 30+ messages in thread
From: Eric W. Biederman @ 2016-09-30 18:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

Oleg Nesterov <oleg@redhat.com> writes:

> On 09/23, Jann Horn wrote:
>>
>> This is a new per-threadgroup lock that can often be taken instead of
>> cred_guard_mutex and has less deadlock potential.
>
> Oh, please don't.
>
>> I'm doing this because
>> Oleg Nesterov mentioned the potential for deadlocks, in particular if a
>> debugged task is stuck in execve, trying to get rid of a ptrace-stopped
>> thread, and the debugger attempts to inspect procfs files of the debugged
>> task.
>
> Yes, but we need to fix this anyway. And I am not sure the new mutex can
> actually help.
>
> And I think that cred_guard_mutex is already over-used in fs/proc. Say,
> I think lock_trace() must die, I simply can't understand why it is useful.
>
> Suppose we modify, say, proc_pid_stack() to do
>
> 	save_stack_trace_tsk(task, &trace);
> 	if (!ptrace_may_access(task, ...))
> 		goto return -EPERM;
>
> 	for (i = 0; i < trace.nr_entries; i++)
> 		seq_printf(...);
>
> 	return 0;
>
> is there any problem if it shows some trace before setuid exec does
> install_exec_creds() ?

You should make certain that the mm doesn't change in that picture,
perhaps like /proc/<pid>/mem does.  At which point exec (which changes
the mm) should not be an issue.

But we definitely should be able to check permissions on open (possibly
including grabbing resources) and then perform the work.

Eric


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] exec: turn self_exec_id into self_privunit
  2016-09-30 13:44     ` Oleg Nesterov
@ 2016-09-30 18:30       ` Kees Cook
  2016-09-30 18:59         ` Jann Horn
  2016-10-03 16:37         ` Oleg Nesterov
  0 siblings, 2 replies; 30+ messages in thread
From: Kees Cook @ 2016-09-30 18:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

On Fri, Sep 30, 2016 at 6:44 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> forgot to mention...
>
> On 09/30, Oleg Nesterov wrote:
>>
>> On 09/23, Jann Horn wrote:
>> >
>> > One reason for doing this is that it prevents an attacker from sending an
>> > arbitrary signal to a parent process after performing 2^32-1 execve()
>> > calls.
>
> No, sets ->exit_signal = SIGCHLD. So the only problem is that the parent
> can do clone(SIGKILL), then do execve() 2^32-1 times, then it can be killed
> by SIGKILL from the exiting child.
>
> Honestly, I do not think this is security problem.

It's a corner case, to be sure. But even sending a SIGKILL across
privilege boundaries should not be allowed to happen.

>
>> I think we should simply kill self/parent_exec_id's. I am going to send
>> the patch below after re-check/testing.
>
> Yes, I think this makes sense anyway.

Hrm, I also thought this was used for more than just signal checking,
but I don't see anything else right now. Maybe I was remembering
earlier versions of Jann's patches...

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] exec: turn self_exec_id into self_privunit
  2016-09-30 18:30       ` Kees Cook
@ 2016-09-30 18:59         ` Jann Horn
  2016-09-30 19:05           ` Kees Cook
  2016-10-03 16:37         ` Oleg Nesterov
  1 sibling, 1 reply; 30+ messages in thread
From: Jann Horn @ 2016-09-30 18:59 UTC (permalink / raw)
  To: Kees Cook
  Cc: Oleg Nesterov, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 694 bytes --]

On Fri, Sep 30, 2016 at 11:30:23AM -0700, Kees Cook wrote:
> On Fri, Sep 30, 2016 at 6:44 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 09/30, Oleg Nesterov wrote:
> >> I think we should simply kill self/parent_exec_id's. I am going to send
> >> the patch below after re-check/testing.
> >
> > Yes, I think this makes sense anyway.
> 
> Hrm, I also thought this was used for more than just signal checking,
> but I don't see anything else right now. Maybe I was remembering
> earlier versions of Jann's patches...

Maybe you're thinking of grsecurity's exec_id (which I used as the basis
for my first implementation of the unique ID before Andy suggested the
LUID approach)?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] exec: turn self_exec_id into self_privunit
  2016-09-30 18:59         ` Jann Horn
@ 2016-09-30 19:05           ` Kees Cook
  0 siblings, 0 replies; 30+ messages in thread
From: Kees Cook @ 2016-09-30 19:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: Oleg Nesterov, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

On Fri, Sep 30, 2016 at 11:59 AM, Jann Horn <jann@thejh.net> wrote:
> On Fri, Sep 30, 2016 at 11:30:23AM -0700, Kees Cook wrote:
>> On Fri, Sep 30, 2016 at 6:44 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> > On 09/30, Oleg Nesterov wrote:
>> >> I think we should simply kill self/parent_exec_id's. I am going to send
>> >> the patch below after re-check/testing.
>> >
>> > Yes, I think this makes sense anyway.
>>
>> Hrm, I also thought this was used for more than just signal checking,
>> but I don't see anything else right now. Maybe I was remembering
>> earlier versions of Jann's patches...
>
> Maybe you're thinking of grsecurity's exec_id (which I used as the basis
> for my first implementation of the unique ID before Andy suggested the
> LUID approach)?

Ah, yes, so I am. Looks like it's part of CONFIG_GRKERNSEC_PROC_MEMMAP.

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt
  2016-09-23 20:40 ` [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt Jann Horn
@ 2016-10-02  3:16   ` Krister Johansen
  2016-10-30 19:09     ` Jann Horn
  0 siblings, 1 reply; 30+ messages in thread
From: Krister Johansen @ 2016-10-02  3:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, linux-fsdevel,
	linux-security-module, security

On Fri, Sep 23, 2016 at 10:40:38PM +0200, Jann Horn wrote:
> +=====================
> +FILESYSTEM DEBUG APIS
> +=====================
> +
> +The pid / tgid entries in procfs contain various entries that allow debugging
> +access to a process. Interesting entries are:
> +
> + - auxv permits an ASLR bypass
> + - cwd can permit bypassing filesystem restrictions in some cases
> + - environ can leak secret tokens
> + - fd can permit bypassing filesystem restrictions or leak access to things like
> +   pipes
> + - maps permits an ASLR bypass
> + - mem grants R+W access to process memory
> + - stat permits an ASLR bypass
> +
> +Of these, all use both a normal filesystem DAC check (where the file owner is
> +the process owner for a dumpable process, root for a nondumpable process) and a
> +ptrace_may_access() check; however, the DAC check may be modified, and the
> +ptrace_may_access() is performed under PTRACE_FSCREDS, meaning that instead of
> +the caller's ruid, rgid and permitted capabilities, the fsuid, fsgid and
> +effective capabilities are used, causing the case where a daemon drops its euid
> +prior to accessing a file for the user to be treated correctly for this check.

Thanks for writing this up.

Is it worth mentioning some of the less obvious aspects of how user
namespaces interact with the filesystem debug APIs?  Of particular note:
a nondumpable process will always be assigned the global root ids.
Checks against capabilities for procfs require that the uid and gid have
a mapping in the current namepsace.   That's enforced through
capable_wrt_inode_uidgid().

By way of example, if you enter a user namespace that doesn't have a
mapping for the global root id, then non-dumpable processes are off
limits from /proc.  The global root ids get mapped to a special id for
unresolved mappings.  If the process that entered the namespace has
CAP_DAC_OVERRIDE/CAP_DAC_READ_SEARCH, these don't suffice to grant any
access to the non-dumpable process because the inode has no [ug]id
mapping in the particular namespace.

-K

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/8] exec: introduce cred_guard_light
  2016-09-30 18:27     ` Eric W. Biederman
@ 2016-10-03 16:02       ` Oleg Nesterov
  0 siblings, 0 replies; 30+ messages in thread
From: Oleg Nesterov @ 2016-10-03 16:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

On 09/30, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > And I think that cred_guard_mutex is already over-used in fs/proc. Say,
> > I think lock_trace() must die, I simply can't understand why it is useful.
> >
> > Suppose we modify, say, proc_pid_stack() to do
> >
> > 	save_stack_trace_tsk(task, &trace);
> > 	if (!ptrace_may_access(task, ...))
> > 		goto return -EPERM;
> >
> > 	for (i = 0; i < trace.nr_entries; i++)
> > 		seq_printf(...);
> >
> > 	return 0;
> >
> > is there any problem if it shows some trace before setuid exec does
> > install_exec_creds() ?
>
> You should make certain that the mm doesn't change in that picture,
> perhaps like /proc/<pid>/mem does.

Why? We do not care I think, /proc/pid/stack has nothing to do with
task->mm.

OK, we can use __ptrace_may_access() and call save_stack_trace_tsk()
under task_lock(), but I don't understand why should we worry.

And again, do you see any security problem with the code above? Yes,
it is "racy" but I think it is fine to occasionally succeed when it
races with credentials change. I fail to understand why the current
code abuses cred_guard_mutex to close the race with setuid exec.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] exec: turn self_exec_id into self_privunit
  2016-09-30 18:30       ` Kees Cook
  2016-09-30 18:59         ` Jann Horn
@ 2016-10-03 16:37         ` Oleg Nesterov
  1 sibling, 0 replies; 30+ messages in thread
From: Oleg Nesterov @ 2016-10-03 16:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

On 09/30, Kees Cook wrote:
>
> On Fri, Sep 30, 2016 at 6:44 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > forgot to mention...
> >
> > On 09/30, Oleg Nesterov wrote:
> >>
> >> On 09/23, Jann Horn wrote:
> >> >
> >> > One reason for doing this is that it prevents an attacker from sending an
> >> > arbitrary signal to a parent process after performing 2^32-1 execve()
> >> > calls.
> >
> > No, sets ->exit_signal = SIGCHLD. So the only problem is that the parent
> > can do clone(SIGKILL), then do execve() 2^32-1 times, then it can be killed
> > by SIGKILL from the exiting child.
> >
> > Honestly, I do not think this is security problem.
>
> It's a corner case, to be sure. But even sending a SIGKILL across
> privilege boundaries should not be allowed to happen.

Agreed, and actually I need to take my words back, of course this is not
nice security-wise.

So lets kill these counters. At least they should not live in task_struct.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/8] futex: don't leak robust_list pointer
  2016-09-30 14:52   ` Oleg Nesterov
@ 2016-10-30 17:16     ` Jann Horn
  2016-11-02 21:39       ` Jann Horn
  0 siblings, 1 reply; 30+ messages in thread
From: Jann Horn @ 2016-10-30 17:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 3062 bytes --]

On Fri, Sep 30, 2016 at 04:52:57PM +0200, Oleg Nesterov wrote:
> On 09/23, Jann Horn wrote:
> >
> > This prevents an attacker from determining the robust_list or
> > compat_robust_list userspace pointer of a process created by executing
> > a setuid binary. Such an attack could be performed by racing
> > get_robust_list() with a setuid execution. The impact of this issue is that
> > an attacker could theoretically bypass ASLR when attacking setuid binaries.
> 
> Well. I am not sure this actually needs a fix, but I won't argue.
> 
> I can't really understand what this patch actually fixes,
> 
> > @@ -3007,31 +3007,43 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
> >  	if (!futex_cmpxchg_enabled)
> >  		return -ENOSYS;
> >
> > -	rcu_read_lock();
> > -
> > -	ret = -ESRCH;
> > -	if (!pid)
> > +	if (!pid) {
> >  		p = current;
> > -	else {
> > +		get_task_struct(p);
> > +	} else {
> > +		rcu_read_lock();
> >  		p = find_task_by_vpid(pid);
> > +		/* pin the task to permit dropping the RCU read lock before
> > +		 * acquiring the mutex
> > +		 */
> > +		if (p)
> > +			get_task_struct(p);
> > +		rcu_read_unlock();
> >  		if (!p)
> > -			goto err_unlock;
> > +			return -ESRCH;
> >  	}
> >
> > +	ret = mutex_lock_killable(&p->signal->cred_guard_light);
> > +	if (ret)
> > +		goto err_put;
> > +
> >  	ret = -EPERM;
> >  	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
> >  		goto err_unlock;
> >
> >  	head = p->robust_list;
> > -	rcu_read_unlock();
> 
> OK, suppose it races with setuid exec, and mutex_lock_killable() +
> ptrace_may_access() comes after flush_old_exec() but before
> install_exec_creds(), in this case ptrace_may_access() can wrongly
> succeed.

I take cred_guard_light in flush_old_exec() and release it in
install_exec_creds(), so that shouldn't work, I think.


> In theory, it is possible that the execing thread can complete exec,
> return to user-mode and call sys_set_robust_list() before we read
> head = p->robust_list. Yes, this is unlikely, but unless I am totally
> confused the race you are trying to fix is equally unlikely?
> 
> perhaps we can make a much simpler change to prevent this, see below.
> We can rely on fact that both ptrace_may_access() and exec_mmap()
> takes the same task_lock(). Sure, this can "leak" robust_list too,
> a set-uid binary can exec and/or lower its credentials after we
> read p->robust_list, but personally I think we do not care.
> 
> Or I missed something else?

No - I think your patch would work, too, apart from the potential
leak you mentioned.

But unless this breaks something, I would prefer to do it properly.

> Oleg.
> 
> --- x/kernel/futex.c
> +++ x/kernel/futex.c
> @@ -3019,10 +3019,10 @@ SYSCALL_DEFINE3(get_robust_list, int, pi
>  	}
>  
>  	ret = -EPERM;
> +	head = p->robust_list;
>  	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
>  		goto err_unlock;
>  
> -	head = p->robust_list;
>  	rcu_read_unlock();
>  
>  	if (put_user(sizeof(*head), len_ptr))
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt
  2016-10-02  3:16   ` Krister Johansen
@ 2016-10-30 19:09     ` Jann Horn
  2016-10-31  4:14       ` Eric W. Biederman
  0 siblings, 1 reply; 30+ messages in thread
From: Jann Horn @ 2016-10-30 19:09 UTC (permalink / raw)
  To: Krister Johansen
  Cc: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric . Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, linux-fsdevel,
	linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

On Sat, Oct 01, 2016 at 08:16:00PM -0700, Krister Johansen wrote:
> On Fri, Sep 23, 2016 at 10:40:38PM +0200, Jann Horn wrote:
> > +=====================
> > +FILESYSTEM DEBUG APIS
> > +=====================
> > +
> > +The pid / tgid entries in procfs contain various entries that allow debugging
> > +access to a process. Interesting entries are:
> > +
> > + - auxv permits an ASLR bypass
> > + - cwd can permit bypassing filesystem restrictions in some cases
> > + - environ can leak secret tokens
> > + - fd can permit bypassing filesystem restrictions or leak access to things like
> > +   pipes
> > + - maps permits an ASLR bypass
> > + - mem grants R+W access to process memory
> > + - stat permits an ASLR bypass
> > +
> > +Of these, all use both a normal filesystem DAC check (where the file owner is
> > +the process owner for a dumpable process, root for a nondumpable process) and a
> > +ptrace_may_access() check; however, the DAC check may be modified, and the
> > +ptrace_may_access() is performed under PTRACE_FSCREDS, meaning that instead of
> > +the caller's ruid, rgid and permitted capabilities, the fsuid, fsgid and
> > +effective capabilities are used, causing the case where a daemon drops its euid
> > +prior to accessing a file for the user to be treated correctly for this check.
> 
> Thanks for writing this up.
> 
> Is it worth mentioning some of the less obvious aspects of how user
> namespaces interact with the filesystem debug APIs?  Of particular note:
> a nondumpable process will always be assigned the global root ids.
> Checks against capabilities for procfs require that the uid and gid have
> a mapping in the current namepsace.   That's enforced through
> capable_wrt_inode_uidgid().

Yeah, makes sense. Added that. Thanks!

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/8] exec: introduce cred_guard_light
  2016-09-30 15:35   ` Oleg Nesterov
  2016-09-30 18:27     ` Eric W. Biederman
@ 2016-10-30 21:12     ` Jann Horn
  1 sibling, 0 replies; 30+ messages in thread
From: Jann Horn @ 2016-10-30 21:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 1536 bytes --]

On Fri, Sep 30, 2016 at 05:35:05PM +0200, Oleg Nesterov wrote:
> On 09/23, Jann Horn wrote:
> >
> > This is a new per-threadgroup lock that can often be taken instead of
> > cred_guard_mutex and has less deadlock potential.
> 
> Oh, please don't.
> 
> > I'm doing this because
> > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> > thread, and the debugger attempts to inspect procfs files of the debugged
> > task.
> 
> Yes, but we need to fix this anyway. And I am not sure the new mutex can
> actually help.
> 
> And I think that cred_guard_mutex is already over-used in fs/proc. Say,
> I think lock_trace() must die, I simply can't understand why it is useful.

IMO it's just about reducing potential (really small) information leaks.
These leaks probably don't matter much in practice - they're racy and only
disclose minimal amounts of data -, but in principle, they expose data.

But since you're opposed to it and I don't see a significant benefit in
having these checks, I'll just remove the lock_trace() stuff from my patch
for now.

> Suppose we modify, say, proc_pid_stack() to do
> 
> 	save_stack_trace_tsk(task, &trace);
> 	if (!ptrace_may_access(task, ...))
> 		goto return -EPERM;
> 
> 	for (i = 0; i < trace.nr_entries; i++)
> 		seq_printf(...);
> 
> 	return 0;
> 
> is there any problem if it shows some trace before setuid exec does
> install_exec_creds() ?
> 
> Oleg.
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt
  2016-10-30 19:09     ` Jann Horn
@ 2016-10-31  4:14       ` Eric W. Biederman
  2016-10-31 13:39         ` Jann Horn
  2016-11-03 20:43         ` Krister Johansen
  0 siblings, 2 replies; 30+ messages in thread
From: Eric W. Biederman @ 2016-10-31  4:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Krister Johansen, Alexander Viro, Roland McGrath, Oleg Nesterov,
	John Johansen, James Morris, Serge E. Hallyn, Paul Moore,
	Stephen Smalley, Eric Paris, Casey Schaufler, Kees Cook,
	Andrew Morton, Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

Jann Horn <jann@thejh.net> writes:

> On Sat, Oct 01, 2016 at 08:16:00PM -0700, Krister Johansen wrote:
>> On Fri, Sep 23, 2016 at 10:40:38PM +0200, Jann Horn wrote:
>> > +=====================
>> > +FILESYSTEM DEBUG APIS
>> > +=====================
>> > +
>> > +The pid / tgid entries in procfs contain various entries that allow debugging
>> > +access to a process. Interesting entries are:
>> > +
>> > + - auxv permits an ASLR bypass
>> > + - cwd can permit bypassing filesystem restrictions in some cases
>> > + - environ can leak secret tokens
>> > + - fd can permit bypassing filesystem restrictions or leak access to things like
>> > +   pipes
>> > + - maps permits an ASLR bypass
>> > + - mem grants R+W access to process memory
>> > + - stat permits an ASLR bypass
>> > +
>> > +Of these, all use both a normal filesystem DAC check (where the file owner is
>> > +the process owner for a dumpable process, root for a nondumpable process) and a
>> > +ptrace_may_access() check; however, the DAC check may be modified, and the
>> > +ptrace_may_access() is performed under PTRACE_FSCREDS, meaning that instead of
>> > +the caller's ruid, rgid and permitted capabilities, the fsuid, fsgid and
>> > +effective capabilities are used, causing the case where a daemon drops its euid
>> > +prior to accessing a file for the user to be treated correctly for this check.
>> 
>> Thanks for writing this up.
>> 
>> Is it worth mentioning some of the less obvious aspects of how user
>> namespaces interact with the filesystem debug APIs?  Of particular note:
>> a nondumpable process will always be assigned the global root ids.
>> Checks against capabilities for procfs require that the uid and gid have
>> a mapping in the current namepsace.   That's enforced through
>> capable_wrt_inode_uidgid().
>
> Yeah, makes sense. Added that. Thanks!

That will actually be changing for 4.10.  mm->user_ns allows me to use
the user namespace id 0 if that id is mapped.

Eric


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt
  2016-10-31  4:14       ` Eric W. Biederman
@ 2016-10-31 13:39         ` Jann Horn
  2016-11-03 20:43         ` Krister Johansen
  1 sibling, 0 replies; 30+ messages in thread
From: Jann Horn @ 2016-10-31 13:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Krister Johansen, Alexander Viro, Roland McGrath, Oleg Nesterov,
	John Johansen, James Morris, Serge E. Hallyn, Paul Moore,
	Stephen Smalley, Eric Paris, Casey Schaufler, Kees Cook,
	Andrew Morton, Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 2187 bytes --]

On Sun, Oct 30, 2016 at 11:14:04PM -0500, Eric W. Biederman wrote:
> Jann Horn <jann@thejh.net> writes:
> 
> > On Sat, Oct 01, 2016 at 08:16:00PM -0700, Krister Johansen wrote:
> >> On Fri, Sep 23, 2016 at 10:40:38PM +0200, Jann Horn wrote:
> >> > +=====================
> >> > +FILESYSTEM DEBUG APIS
> >> > +=====================
> >> > +
> >> > +The pid / tgid entries in procfs contain various entries that allow debugging
> >> > +access to a process. Interesting entries are:
> >> > +
> >> > + - auxv permits an ASLR bypass
> >> > + - cwd can permit bypassing filesystem restrictions in some cases
> >> > + - environ can leak secret tokens
> >> > + - fd can permit bypassing filesystem restrictions or leak access to things like
> >> > +   pipes
> >> > + - maps permits an ASLR bypass
> >> > + - mem grants R+W access to process memory
> >> > + - stat permits an ASLR bypass
> >> > +
> >> > +Of these, all use both a normal filesystem DAC check (where the file owner is
> >> > +the process owner for a dumpable process, root for a nondumpable process) and a
> >> > +ptrace_may_access() check; however, the DAC check may be modified, and the
> >> > +ptrace_may_access() is performed under PTRACE_FSCREDS, meaning that instead of
> >> > +the caller's ruid, rgid and permitted capabilities, the fsuid, fsgid and
> >> > +effective capabilities are used, causing the case where a daemon drops its euid
> >> > +prior to accessing a file for the user to be treated correctly for this check.
> >> 
> >> Thanks for writing this up.
> >> 
> >> Is it worth mentioning some of the less obvious aspects of how user
> >> namespaces interact with the filesystem debug APIs?  Of particular note:
> >> a nondumpable process will always be assigned the global root ids.
> >> Checks against capabilities for procfs require that the uid and gid have
> >> a mapping in the current namepsace.   That's enforced through
> >> capable_wrt_inode_uidgid().
> >
> > Yeah, makes sense. Added that. Thanks!
> 
> That will actually be changing for 4.10.  mm->user_ns allows me to use
> the user namespace id 0 if that id is mapped.

Okay, I'll take it back out for now.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/8] futex: don't leak robust_list pointer
  2016-10-30 17:16     ` Jann Horn
@ 2016-11-02 21:39       ` Jann Horn
  2016-11-02 22:47         ` Jann Horn
  0 siblings, 1 reply; 30+ messages in thread
From: Jann Horn @ 2016-11-02 21:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 4781 bytes --]

On Sun, Oct 30, 2016 at 06:16:50PM +0100, Jann Horn wrote:
> On Fri, Sep 30, 2016 at 04:52:57PM +0200, Oleg Nesterov wrote:
> > On 09/23, Jann Horn wrote:
> > >
> > > This prevents an attacker from determining the robust_list or
> > > compat_robust_list userspace pointer of a process created by executing
> > > a setuid binary. Such an attack could be performed by racing
> > > get_robust_list() with a setuid execution. The impact of this issue is that
> > > an attacker could theoretically bypass ASLR when attacking setuid binaries.
> > 
> > Well. I am not sure this actually needs a fix, but I won't argue.
> > 
> > I can't really understand what this patch actually fixes,
> > 
> > > @@ -3007,31 +3007,43 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
> > >  	if (!futex_cmpxchg_enabled)
> > >  		return -ENOSYS;
> > >
> > > -	rcu_read_lock();
> > > -
> > > -	ret = -ESRCH;
> > > -	if (!pid)
> > > +	if (!pid) {
> > >  		p = current;
> > > -	else {
> > > +		get_task_struct(p);
> > > +	} else {
> > > +		rcu_read_lock();
> > >  		p = find_task_by_vpid(pid);
> > > +		/* pin the task to permit dropping the RCU read lock before
> > > +		 * acquiring the mutex
> > > +		 */
> > > +		if (p)
> > > +			get_task_struct(p);
> > > +		rcu_read_unlock();
> > >  		if (!p)
> > > -			goto err_unlock;
> > > +			return -ESRCH;
> > >  	}
> > >
> > > +	ret = mutex_lock_killable(&p->signal->cred_guard_light);
> > > +	if (ret)
> > > +		goto err_put;
> > > +
> > >  	ret = -EPERM;
> > >  	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
> > >  		goto err_unlock;
> > >
> > >  	head = p->robust_list;
> > > -	rcu_read_unlock();
> > 
> > OK, suppose it races with setuid exec, and mutex_lock_killable() +
> > ptrace_may_access() comes after flush_old_exec() but before
> > install_exec_creds(), in this case ptrace_may_access() can wrongly
> > succeed.
> 
> I take cred_guard_light in flush_old_exec() and release it in
> install_exec_creds(), so that shouldn't work, I think.
> 
> 
> > In theory, it is possible that the execing thread can complete exec,
> > return to user-mode and call sys_set_robust_list() before we read
> > head = p->robust_list. Yes, this is unlikely, but unless I am totally
> > confused the race you are trying to fix is equally unlikely?
> > 
> > perhaps we can make a much simpler change to prevent this, see below.
> > We can rely on fact that both ptrace_may_access() and exec_mmap()
> > takes the same task_lock(). Sure, this can "leak" robust_list too,
> > a set-uid binary can exec and/or lower its credentials after we
> > read p->robust_list, but personally I think we do not care.
> > 
> > Or I missed something else?
> 
> No - I think your patch would work, too, apart from the potential
> leak you mentioned.

Changing my opinion:

This does not just affect setuid binaries. It also affects daemons like
cron and atd that execute processes with dropped privileges.

This is how atd runs jobs (strace output, with irrelevant stuff removed):

[...]
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa81b1099d0) = 14915
Process 14915 attached
[...]
[pid 14915] set_robust_list(0x7fa81b1099e0, 24) = 0
[...]
[pid 14915] setregid(0, 1)              = 0
[pid 14915] setreuid(0, 1)              = 0
[pid 14915] close(0)                    = 0
[pid 14915] close(1)                    = 0
[pid 14915] close(2)                    = 0
[pid 14915] clone(Process 14916 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa81b1099d0) = 14916
[pid 14916] set_robust_list(0x7fa81b1099e0, 24) = 0
[pid 14915] wait4(14916,  <unfinished ...>
[pid 14916] lseek(6, 0, SEEK_SET)       = 0
[pid 14916] dup2(6, 0)                  = 0
[pid 14916] dup2(5, 1)                  = 1
[pid 14916] dup2(5, 2)                  = 2
[pid 14916] close(6)                    = 0
[pid 14916] close(5)                    = 0
[pid 14916] setreuid(1, 0)              = 0
[pid 14916] setregid(1, 0)              = 0
[...]
[pid 14916] setgroups(13, [1000, [...]]) = 0
[pid 14916] setgid(1000)                = 0
[pid 14916] setuid(1000)                = 0
[pid 14916] chdir("/")                  = 0
[pid 14916] execve("/bin/sh", ["sh"], [/* 0 vars */]) = 0
[...]

Basically, you can see that the pointer 0x7fa81b1099e0, which reveals
information about the address space layout, is the robust list of pid 14916
when it calls execve(), and after that execve() call, pid 14916 will be
ptraceable for the user (modulo LSMs).

So I think that my patch is a bit safer. Yes, there aren't many local
daemons whose address space layout you can discover this way, but it's still
not great.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/8] futex: don't leak robust_list pointer
  2016-11-02 21:39       ` Jann Horn
@ 2016-11-02 22:47         ` Jann Horn
  0 siblings, 0 replies; 30+ messages in thread
From: Jann Horn @ 2016-11-02 22:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric . Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	linux-fsdevel, linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 5748 bytes --]

On Wed, Nov 02, 2016 at 10:39:32PM +0100, Jann Horn wrote:
> On Sun, Oct 30, 2016 at 06:16:50PM +0100, Jann Horn wrote:
> > On Fri, Sep 30, 2016 at 04:52:57PM +0200, Oleg Nesterov wrote:
> > > On 09/23, Jann Horn wrote:
> > > >
> > > > This prevents an attacker from determining the robust_list or
> > > > compat_robust_list userspace pointer of a process created by executing
> > > > a setuid binary. Such an attack could be performed by racing
> > > > get_robust_list() with a setuid execution. The impact of this issue is that
> > > > an attacker could theoretically bypass ASLR when attacking setuid binaries.
> > > 
> > > Well. I am not sure this actually needs a fix, but I won't argue.
> > > 
> > > I can't really understand what this patch actually fixes,
> > > 
> > > > @@ -3007,31 +3007,43 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
> > > >  	if (!futex_cmpxchg_enabled)
> > > >  		return -ENOSYS;
> > > >
> > > > -	rcu_read_lock();
> > > > -
> > > > -	ret = -ESRCH;
> > > > -	if (!pid)
> > > > +	if (!pid) {
> > > >  		p = current;
> > > > -	else {
> > > > +		get_task_struct(p);
> > > > +	} else {
> > > > +		rcu_read_lock();
> > > >  		p = find_task_by_vpid(pid);
> > > > +		/* pin the task to permit dropping the RCU read lock before
> > > > +		 * acquiring the mutex
> > > > +		 */
> > > > +		if (p)
> > > > +			get_task_struct(p);
> > > > +		rcu_read_unlock();
> > > >  		if (!p)
> > > > -			goto err_unlock;
> > > > +			return -ESRCH;
> > > >  	}
> > > >
> > > > +	ret = mutex_lock_killable(&p->signal->cred_guard_light);
> > > > +	if (ret)
> > > > +		goto err_put;
> > > > +
> > > >  	ret = -EPERM;
> > > >  	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
> > > >  		goto err_unlock;
> > > >
> > > >  	head = p->robust_list;
> > > > -	rcu_read_unlock();
> > > 
> > > OK, suppose it races with setuid exec, and mutex_lock_killable() +
> > > ptrace_may_access() comes after flush_old_exec() but before
> > > install_exec_creds(), in this case ptrace_may_access() can wrongly
> > > succeed.
> > 
> > I take cred_guard_light in flush_old_exec() and release it in
> > install_exec_creds(), so that shouldn't work, I think.
> > 
> > 
> > > In theory, it is possible that the execing thread can complete exec,
> > > return to user-mode and call sys_set_robust_list() before we read
> > > head = p->robust_list. Yes, this is unlikely, but unless I am totally
> > > confused the race you are trying to fix is equally unlikely?
> > > 
> > > perhaps we can make a much simpler change to prevent this, see below.
> > > We can rely on fact that both ptrace_may_access() and exec_mmap()
> > > takes the same task_lock(). Sure, this can "leak" robust_list too,
> > > a set-uid binary can exec and/or lower its credentials after we
> > > read p->robust_list, but personally I think we do not care.
> > > 
> > > Or I missed something else?
> > 
> > No - I think your patch would work, too, apart from the potential
> > leak you mentioned.
> 
> Changing my opinion:
> 
> This does not just affect setuid binaries. It also affects daemons like
> cron and atd that execute processes with dropped privileges.
> 
> This is how atd runs jobs (strace output, with irrelevant stuff removed):
> 
> [...]
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa81b1099d0) = 14915
> Process 14915 attached
> [...]
> [pid 14915] set_robust_list(0x7fa81b1099e0, 24) = 0
> [...]
> [pid 14915] setregid(0, 1)              = 0
> [pid 14915] setreuid(0, 1)              = 0
> [pid 14915] close(0)                    = 0
> [pid 14915] close(1)                    = 0
> [pid 14915] close(2)                    = 0
> [pid 14915] clone(Process 14916 attached
> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa81b1099d0) = 14916
> [pid 14916] set_robust_list(0x7fa81b1099e0, 24) = 0
> [pid 14915] wait4(14916,  <unfinished ...>
> [pid 14916] lseek(6, 0, SEEK_SET)       = 0
> [pid 14916] dup2(6, 0)                  = 0
> [pid 14916] dup2(5, 1)                  = 1
> [pid 14916] dup2(5, 2)                  = 2
> [pid 14916] close(6)                    = 0
> [pid 14916] close(5)                    = 0
> [pid 14916] setreuid(1, 0)              = 0
> [pid 14916] setregid(1, 0)              = 0
> [...]
> [pid 14916] setgroups(13, [1000, [...]]) = 0
> [pid 14916] setgid(1000)                = 0
> [pid 14916] setuid(1000)                = 0
> [pid 14916] chdir("/")                  = 0
> [pid 14916] execve("/bin/sh", ["sh"], [/* 0 vars */]) = 0
> [...]
> 
> Basically, you can see that the pointer 0x7fa81b1099e0, which reveals
> information about the address space layout, is the robust list of pid 14916
> when it calls execve(), and after that execve() call, pid 14916 will be
> ptraceable for the user (modulo LSMs).
> 
> So I think that my patch is a bit safer. Yes, there aren't many local
> daemons whose address space layout you can discover this way, but it's still
> not great.

I think my previous message wasn't very clear about what I think the issue is.

Basically, here, it would be plausible for uid 1000 to be able to determine
the pre-execve() robust_list pointer of pid 14916 by racing get_robust_list()
during the execve(). That itself isn't a big issue because the memory mappings
of pid 14916 are thrown away during the execve(), but what is potentially
interesting to an attacker is that before the execve(), pid 14916 shared its
address space layout with its parents, including the atd daemon. So if an
attacker has a vulnerability in atd but needs an address leak in order to
exploit it, this would be such a leak.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt
  2016-10-31  4:14       ` Eric W. Biederman
  2016-10-31 13:39         ` Jann Horn
@ 2016-11-03 20:43         ` Krister Johansen
  1 sibling, 0 replies; 30+ messages in thread
From: Krister Johansen @ 2016-11-03 20:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Krister Johansen, Alexander Viro, Roland McGrath,
	Oleg Nesterov, John Johansen, James Morris, Serge E. Hallyn,
	Paul Moore, Stephen Smalley, Eric Paris, Casey Schaufler,
	Kees Cook, Andrew Morton, Janis Danisevskis, Seth Forshee,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, linux-fsdevel,
	linux-security-module, security

On Sun, Oct 30, 2016 at 11:14:04PM -0500, Eric W. Biederman wrote:
> Jann Horn <jann@thejh.net> writes:
> 
> > On Sat, Oct 01, 2016 at 08:16:00PM -0700, Krister Johansen wrote:
> >> On Fri, Sep 23, 2016 at 10:40:38PM +0200, Jann Horn wrote:
> >> > +=====================
> >> > +FILESYSTEM DEBUG APIS
> >> > +=====================
> >> > +
> >> > +The pid / tgid entries in procfs contain various entries that allow debugging
> >> > +access to a process. Interesting entries are:
> >> > +
> >> > + - auxv permits an ASLR bypass
> >> > + - cwd can permit bypassing filesystem restrictions in some cases
> >> > + - environ can leak secret tokens
> >> > + - fd can permit bypassing filesystem restrictions or leak access to things like
> >> > +   pipes
> >> > + - maps permits an ASLR bypass
> >> > + - mem grants R+W access to process memory
> >> > + - stat permits an ASLR bypass
> >> > +
> >> > +Of these, all use both a normal filesystem DAC check (where the file owner is
> >> > +the process owner for a dumpable process, root for a nondumpable process) and a
> >> > +ptrace_may_access() check; however, the DAC check may be modified, and the
> >> > +ptrace_may_access() is performed under PTRACE_FSCREDS, meaning that instead of
> >> > +the caller's ruid, rgid and permitted capabilities, the fsuid, fsgid and
> >> > +effective capabilities are used, causing the case where a daemon drops its euid
> >> > +prior to accessing a file for the user to be treated correctly for this check.
> >> 
> >> Thanks for writing this up.
> >> 
> >> Is it worth mentioning some of the less obvious aspects of how user
> >> namespaces interact with the filesystem debug APIs?  Of particular note:
> >> a nondumpable process will always be assigned the global root ids.
> >> Checks against capabilities for procfs require that the uid and gid have
> >> a mapping in the current namepsace.   That's enforced through
> >> capable_wrt_inode_uidgid().
> >
> > Yeah, makes sense. Added that. Thanks!
> 
> That will actually be changing for 4.10.  mm->user_ns allows me to use
> the user namespace id 0 if that id is mapped.

I'll be excited to sync up with this change once you land it in 4.10.
There are a bunch of tools that get confused if you run them in a user
namespace but they can't access /proc/[pid]/whatever.

-K

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-11-03 20:43 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-23 20:40 [PATCH v2 0/8] Various fixes related to ptrace_may_access() Jann Horn
2016-09-23 20:40 ` [PATCH v2 1/8] exec: introduce cred_guard_light Jann Horn
2016-09-30 15:35   ` Oleg Nesterov
2016-09-30 18:27     ` Eric W. Biederman
2016-10-03 16:02       ` Oleg Nesterov
2016-10-30 21:12     ` Jann Horn
2016-09-23 20:40 ` [PATCH v2 2/8] exec: turn self_exec_id into self_privunit Jann Horn
2016-09-23 21:04   ` Andy Lutomirski
2016-09-23 21:33     ` Jann Horn
2016-09-30 13:20   ` Oleg Nesterov
2016-09-30 13:44     ` Oleg Nesterov
2016-09-30 18:30       ` Kees Cook
2016-09-30 18:59         ` Jann Horn
2016-09-30 19:05           ` Kees Cook
2016-10-03 16:37         ` Oleg Nesterov
2016-09-23 20:40 ` [PATCH v2 3/8] proc: use open()-time creds for ptrace checks Jann Horn
2016-09-23 20:40 ` [PATCH v2 4/8] futex: don't leak robust_list pointer Jann Horn
2016-09-30 14:52   ` Oleg Nesterov
2016-10-30 17:16     ` Jann Horn
2016-11-02 21:39       ` Jann Horn
2016-11-02 22:47         ` Jann Horn
2016-09-23 20:40 ` [PATCH v2 5/8] proc: lock properly in ptrace_may_access callers Jann Horn
2016-09-23 20:40 ` [PATCH v2 6/8] ptrace: warn on ptrace_may_access without proper locking Jann Horn
2016-09-23 20:40 ` [PATCH v2 7/8] fs/proc: fix attr access check Jann Horn
2016-09-23 20:40 ` [PATCH v2 8/8] Documentation: add security/ptrace_checks.txt Jann Horn
2016-10-02  3:16   ` Krister Johansen
2016-10-30 19:09     ` Jann Horn
2016-10-31  4:14       ` Eric W. Biederman
2016-10-31 13:39         ` Jann Horn
2016-11-03 20:43         ` Krister Johansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).