linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/8] Various fixes related to ptrace_may_access()
@ 2016-10-30 21:46 Jann Horn
  2016-10-30 21:46 ` [PATCH v3 1/8] exec: introduce cred_guard_light Jann Horn
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Jann Horn @ 2016-10-30 21:46 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, Krister Johansen
  Cc: linux-fsdevel, linux-security-module, security

Next try.

Changes to the individual patches are mostly documented in their
commit messages.

Added/removed patches:
 - Added "proc: fix timerslack_ns handling"
 - Removed "ptrace: warn on ptrace_may_access without proper locking"
   (because of some reverted changes in the "proc: lock properly [...]"
   patch)


Jann Horn (8):
  exec: introduce cred_guard_light
  exec: add privunit to task_struct
  proc: use open()-time creds for ptrace checks
  futex: don't leak robust_list pointer
  proc: lock properly in ptrace_may_access callers
  fs/proc: fix attr access check
  proc: fix timerslack_ns handling
  Documentation: add security/ptrace_checks.txt

 Documentation/security/ptrace_checks.txt | 243 +++++++++++++++++++++++++++++++
 arch/mips/kernel/mips-mt-fpaff.c         |   4 +-
 fs/exec.c                                |  33 ++++-
 fs/proc/array.c                          |  10 +-
 fs/proc/base.c                           | 220 +++++++++++++++++++++-------
 fs/proc/internal.h                       |  14 ++
 fs/proc/namespaces.c                     |  21 ++-
 include/linux/init_task.h                |   1 +
 include/linux/lsm_hooks.h                |  17 ++-
 include/linux/ptrace.h                   |   5 +
 include/linux/sched.h                    |  28 +++-
 include/linux/security.h                 |  23 +--
 kernel/cpuset.c                          |   2 +-
 kernel/fork.c                            |   2 +
 kernel/futex.c                           |  30 ++--
 kernel/futex_compat.c                    |  30 ++--
 kernel/ptrace.c                          |  51 +++++--
 kernel/sched/core.c                      |  14 +-
 security/apparmor/include/ipc.h          |   2 +-
 security/apparmor/ipc.c                  |   4 +-
 security/apparmor/lsm.c                  |  14 +-
 security/commoncap.c                     |  24 +--
 security/security.c                      |  13 +-
 security/selinux/hooks.c                 |  35 +++--
 security/smack/smack_lsm.c               |  27 +++-
 security/yama/yama_lsm.c                 |   9 +-
 26 files changed, 718 insertions(+), 158 deletions(-)
 create mode 100644 Documentation/security/ptrace_checks.txt

-- 
2.1.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
@ 2016-10-30 21:46 ` Jann Horn
  2016-11-02 18:18   ` Oleg Nesterov
  2016-10-30 21:46 ` [PATCH v3 2/8] exec: add privunit to task_struct Jann Horn
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Jann Horn @ 2016-10-30 21:46 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, Krister Johansen
  Cc: linux-fsdevel, linux-security-module, security

This is a new per-threadgroup lock that can often be taken instead of
cred_guard_mutex and has less deadlock potential. I'm doing this because
Oleg Nesterov mentioned the potential for deadlocks, in particular if a
debugged task is stuck in execve, trying to get rid of a ptrace-stopped
thread, and the debugger attempts to inspect procfs files of the debugged
task.

The binfmt handlers (in particular for elf_fdpic and flat) might still
call VFS read and mmap operations on the binary with the lock held, but
not open operations (as is the case with cred_guard_mutex).

An rwlock would be more appropriate here, but apparently those don't
have _killable variants of the locking functions?

This is a preparation patch for using proper locking in more places.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/exec.c                 | 15 ++++++++++++++-
 include/linux/init_task.h |  1 +
 include/linux/sched.h     | 10 ++++++++++
 kernel/fork.c             |  1 +
 kernel/ptrace.c           | 10 ++++++++++
 5 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 4e497b9ee71e..67b76cb319d8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1243,6 +1243,10 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_guard_light);
+	if (retval)
+		goto out;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1256,7 +1260,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 	acct_arg_size(bprm, 0);
 	retval = exec_mmap(bprm->mm);
 	if (retval)
-		goto out;
+		goto out_unlock;
 
 	bprm->mm = NULL;		/* We're using it now */
 
@@ -1268,6 +1272,8 @@ int flush_old_exec(struct linux_binprm * bprm)
 
 	return 0;
 
+out_unlock:
+	mutex_unlock(&current->signal->cred_guard_light);
 out:
 	return retval;
 }
@@ -1391,6 +1397,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	mutex_unlock(&current->signal->cred_guard_light);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
@@ -1758,6 +1765,12 @@ static int do_execveat_common(int fd, struct filename *filename,
 	return retval;
 
 out:
+	if (!bprm->mm && bprm->cred) {
+		/* failure after flush_old_exec(), but before
+		 * install_exec_creds()
+		 */
+		mutex_unlock(&current->signal->cred_guard_light);
+	}
 	if (bprm->mm) {
 		acct_arg_size(bprm, 0);
 		mmput(bprm->mm);
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 325f649d77ff..c6819468e79a 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -60,6 +60,7 @@ extern struct fs_struct init_fs;
 	INIT_PREV_CPUTIME(sig)						\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	.cred_guard_light = __MUTEX_INITIALIZER(sig.cred_guard_light)	\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b0ec92..0ccb379895b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -812,6 +812,16 @@ struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	/*
+	 * Lightweight version of cred_guard_mutex; used to prevent race
+	 * conditions where a user can gain information about the post-execve
+	 * state of a task to which access should only be granted pre-execve.
+	 * Hold this mutex while performing remote task inspection associated
+	 * with a security check.
+	 * This mutex MUST NOT be used in cases where anything changes about
+	 * the security properties of a running execve().
+	 */
+	struct mutex cred_guard_light;
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 623259fc794d..d0e1d6fa4d00 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1361,6 +1361,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 				   current->signal->is_child_subreaper;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->cred_guard_light);
 
 	return 0;
 }
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index e6474f7272ec..c3312e9e0078 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -285,6 +285,16 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	return security_ptrace_access_check(task, mode);
 }
 
+/*
+ * NOTE: When you call this function, you need to ensure that the target task
+ * can't acquire (via setuid execve) credentials between the ptrace access
+ * check and the privileged access. The recommended way to do this is to hold
+ * one of task->signal->{cred_guard_mutex,cred_guard_light} while calling this
+ * function and performing the requested access.
+ *
+ * This function may only be used if access is requested in the name of
+ * current_cred().
+ */
 bool ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
 	int err;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 2/8] exec: add privunit to task_struct
  2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
  2016-10-30 21:46 ` [PATCH v3 1/8] exec: introduce cred_guard_light Jann Horn
@ 2016-10-30 21:46 ` Jann Horn
  2016-10-30 21:46 ` [PATCH v3 3/8] proc: use open()-time creds for ptrace checks Jann Horn
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-10-30 21:46 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, Krister Johansen
  Cc: linux-fsdevel, linux-security-module, security

This adds a member privunit ("privilege unit locally unique ID") to
task_struct. privunit is only shared by processes that share the
mm_struct and the signal_struct; not just spatially, but also temporally.
In other words, if you do execve() or clone() without CLONE_THREAD, you
get a new privunit that has never been used before.

privunit is used in later patches to check during ptrace access
checks whether subject and object are temporally and spatially equal for
privilege checking purposes.

The implementation of locally unique IDs is in sched.h and exec.c for now
because those are the only users so far - if anything else wants to use
them in the future, they can be moved elsewhere.

changed in v2:
 - have 2^64 IDs per CPU instead of 2^64 shared ones (luid scheme,
   suggested by Andy Lutomirski)
 - take task_lock for reading in setup_new_exec() while bumping the LUID

changed in v3:
 - Make privunit a new member of task_struct instead of reusing
   self_exec_id. This reduces locking trouble and allows self_exec_id to be
   removed at a later point. (Oleg Nesterov)
 - statically initialize luid_counters instead of using an __init function
   (Andy Lutomirski)

Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/exec.c             | 18 ++++++++++++++++++
 include/linux/sched.h | 18 ++++++++++++++++--
 kernel/fork.c         |  1 +
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 67b76cb319d8..c695dcd355ac 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1286,6 +1286,22 @@ void would_dump(struct linux_binprm *bprm, struct file *file)
 }
 EXPORT_SYMBOL(would_dump);
 
+/* value 0 is reserved for init */
+static DEFINE_PER_CPU(u64, luid_counters) = 1;
+
+/*
+ * Allocates a new LUID and writes the allocated LUID to @out.
+ * This function must not be called from IRQ context.
+ */
+void alloc_luid(struct luid *out)
+{
+	preempt_disable();
+	out->count = raw_cpu_read(luid_counters);
+	raw_cpu_add(luid_counters, 1);
+	out->cpu = smp_processor_id();
+	preempt_enable();
+}
+
 void setup_new_exec(struct linux_binprm * bprm)
 {
 	arch_pick_mmap_layout(current->mm);
@@ -1320,6 +1336,8 @@ void setup_new_exec(struct linux_binprm * bprm)
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
 	current->self_exec_id++;
+	alloc_luid(&current->privunit);
+
 	flush_signal_handlers(current, 0);
 	do_close_on_exec(current->files);
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0ccb379895b3..86146977d60c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1482,6 +1482,19 @@ struct tlbflush_unmap_batch {
 	bool writable;
 };
 
+/* locally unique ID */
+struct luid {
+	u64 count;
+	unsigned int cpu;
+};
+
+void alloc_luid(struct luid *out);
+
+static inline bool luid_eq(const struct luid *a, const struct luid *b)
+{
+	return a->count == b->count && a->cpu == b->cpu;
+}
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1713,8 +1726,9 @@ struct task_struct {
 	struct seccomp seccomp;
 
 /* Thread group tracking */
-   	u32 parent_exec_id;
-   	u32 self_exec_id;
+	u32 parent_exec_id;
+	u32 self_exec_id;
+	struct luid privunit;
 /* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
  * mempolicy */
 	spinlock_t alloc_lock;
diff --git a/kernel/fork.c b/kernel/fork.c
index d0e1d6fa4d00..c7a658d5a6cf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1714,6 +1714,7 @@ static __latent_entropy struct task_struct *copy_process(
 			p->exit_signal = (clone_flags & CSIGNAL);
 		p->group_leader = p;
 		p->tgid = p->pid;
+		alloc_luid(&p->privunit);
 	}
 
 	p->nr_dirtied = 0;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 3/8] proc: use open()-time creds for ptrace checks
  2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
  2016-10-30 21:46 ` [PATCH v3 1/8] exec: introduce cred_guard_light Jann Horn
  2016-10-30 21:46 ` [PATCH v3 2/8] exec: add privunit to task_struct Jann Horn
@ 2016-10-30 21:46 ` Jann Horn
  2016-10-30 21:46 ` [PATCH v3 4/8] futex: don't leak robust_list pointer Jann Horn
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-10-30 21:46 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, Krister Johansen
  Cc: linux-fsdevel, linux-security-module, security

This adds a new ptrace_may_access_noncurrent() method that
uses the supplied credentials instead of those of the
current task. (However, the current task may still be
inspected for auditing purposes, e.g. by the Smack LSM.)

procfs used the caller's creds for a few ptrace_may_access()
checks at read() time, which made a confused deputy attack
by passing an FD to a procfs file to a setuid program
possible. Therefore, the following was a local userspace
ASLR bypass:

rm -f /tmp/foobar
procmail <(echo 'DEFAULT=/tmp/foobar') </proc/1/stat
(
  echo 'obase=16'
  cut -d' ' -f26-30,45-51 </tmp/foobar | tr ' ' '\n'
) | bc
rm /tmp/foobar

procmail is installed setuid root on Debian and read()s
data from stdin before dropping privs, so the
ptrace_may_access() check in the VFS read handler of
/proc/1/stat passes. Procmail then dumps the read data
to a user-accessible file (/tmp/foobar here).

changed in v2:
 - also handle PTRACE_MODE_ATTACH access correctly in
   SELinux (Stephen Smalley)
 - changed privunit-using code

Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/proc/array.c                 |  3 +-
 fs/proc/base.c                  | 63 ++++++++++++++++++++++++++++++++++-------
 fs/proc/internal.h              | 14 +++++++++
 include/linux/lsm_hooks.h       |  3 +-
 include/linux/ptrace.h          |  5 ++++
 include/linux/security.h        | 10 ++++---
 kernel/ptrace.c                 | 41 ++++++++++++++++++++-------
 security/apparmor/include/ipc.h |  2 +-
 security/apparmor/ipc.c         |  4 +--
 security/apparmor/lsm.c         | 14 +++++++--
 security/commoncap.c            |  8 +++---
 security/security.c             |  5 ++--
 security/selinux/hooks.c        | 15 +++++-----
 security/smack/smack_lsm.c      | 18 ++++++++----
 security/yama/yama_lsm.c        |  9 ++++--
 15 files changed, 157 insertions(+), 57 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 81818adb8e9e..524cff7e0104 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -408,7 +408,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
-	permitted = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT);
+	permitted = proc_ptrace_may_access_seq(task,
+			PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT, m);
 	mm = get_task_mm(task);
 	if (mm) {
 		vsize = task_vsize(mm);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ca651ac00660..d3b20d3a01c9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -413,8 +413,9 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
 
 	wchan = get_wchan(task);
 
-	if (wchan && ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)
-			&& !lookup_symbol_name(wchan, symname))
+	if (wchan &&
+	    proc_ptrace_may_access_seq(task, PTRACE_MODE_READ_FSCREDS, m) &&
+	    !lookup_symbol_name(wchan, symname))
 		seq_printf(m, "%s", symname);
 	else
 		seq_putc(m, '0');
@@ -423,12 +424,12 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
 }
 #endif /* CONFIG_KALLSYMS */
 
-static int lock_trace(struct task_struct *task)
+static int lock_trace(struct seq_file *m, struct task_struct *task)
 {
 	int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
 	if (err)
 		return err;
-	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
+	if (!proc_ptrace_may_access_seq(task, PTRACE_MODE_ATTACH_FSCREDS, m)) {
 		mutex_unlock(&task->signal->cred_guard_mutex);
 		return -EPERM;
 	}
@@ -461,7 +462,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
 	trace.entries		= entries;
 	trace.skip		= 0;
 
-	err = lock_trace(task);
+	err = lock_trace(m, task);
 	if (!err) {
 		save_stack_trace_tsk(task, &trace);
 
@@ -643,7 +644,7 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long args[6], sp, pc;
 	int res;
 
-	res = lock_trace(task);
+	res = lock_trace(m, task);
 	if (res)
 		return res;
 
@@ -753,7 +754,7 @@ static const struct inode_operations proc_def_inode_operations = {
 
 static int proc_single_show(struct seq_file *m, void *v)
 {
-	struct inode *inode = m->private;
+	struct inode *inode = m->file->f_inode;
 	struct pid_namespace *ns;
 	struct pid *pid;
 	struct task_struct *task;
@@ -773,14 +774,35 @@ static int proc_single_show(struct seq_file *m, void *v)
 
 static int proc_single_open(struct inode *inode, struct file *filp)
 {
-	return single_open(filp, proc_single_show, inode);
+	int ret;
+	struct luid *privunit = kmalloc(sizeof(struct luid), GFP_KERNEL);
+
+	if (privunit == NULL)
+		return -ENOMEM;
+	*privunit = current->privunit;
+
+	ret = single_open(filp, proc_single_show, privunit);
+
+	if (ret)
+		kfree(privunit);
+	return ret;
+}
+
+static int proc_single_release(struct inode *inode, struct file *filp)
+{
+	struct seq_file *seq = filp->private_data;
+	struct luid *privunit = seq->private;
+
+	kfree(privunit);
+
+	return single_release(inode, filp);
 }
 
 static const struct file_operations proc_single_file_operations = {
 	.open		= proc_single_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= single_release,
+	.release	= proc_single_release,
 };
 
 
@@ -2172,10 +2194,29 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	return ret;
 }
 
+int proc_open_get_privunit(struct inode *inode, struct file *filp)
+{
+	struct luid *privunit = kmalloc(sizeof(struct luid), GFP_KERNEL);
+
+	if (privunit == NULL)
+		return -ENOMEM;
+	*privunit = current->privunit;
+	filp->private_data = privunit;
+	return 0;
+}
+
+int proc_release_put_privunit(struct inode *inode, struct file *filp)
+{
+	kfree((struct luid *)filp->private_data);
+	return 0;
+}
+
 static const struct file_operations proc_map_files_operations = {
 	.read		= generic_read_dir,
 	.iterate_shared	= proc_map_files_readdir,
 	.llseek		= generic_file_llseek,
+	.open		= proc_open_get_privunit,
+	.release	= proc_release_put_privunit,
 };
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
@@ -2649,7 +2690,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 	if (result)
 		return result;
 
-	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
+	if (!proc_ptrace_may_access_seq(task, PTRACE_MODE_READ_FSCREDS, m)) {
 		result = -EACCES;
 		goto out_unlock;
 	}
@@ -2833,7 +2874,7 @@ static const struct file_operations proc_setgroups_operations = {
 static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *task)
 {
-	int err = lock_trace(task);
+	int err = lock_trace(m, task);
 	if (!err) {
 		seq_printf(m, "%08x\n", task->personality);
 		unlock_trace(task);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 5378441ec1b7..bdbeec3a382a 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -14,6 +14,9 @@
 #include <linux/spinlock.h>
 #include <linux/atomic.h>
 #include <linux/binfmts.h>
+#include <linux/ptrace.h>
+#include <linux/pid.h>
+#include <linux/seq_file.h>
 
 struct ctl_table_header;
 struct mempolicy;
@@ -134,6 +137,15 @@ static inline unsigned name_to_int(const struct qstr *qstr)
 	return ~0U;
 }
 
+static inline bool proc_ptrace_may_access_seq(struct task_struct *task,
+			unsigned int mode, struct seq_file *m)
+{
+	const struct luid *privunit = m->private;
+
+	return ptrace_may_access_noncurrent(task, mode, m->file->f_cred,
+					    privunit);
+}
+
 /*
  * Offset of the first process in the /proc root directory..
  */
@@ -168,6 +180,8 @@ extern int pid_delete_dentry(const struct dentry *);
 extern int proc_pid_readdir(struct file *, struct dir_context *);
 extern struct dentry *proc_pid_lookup(struct inode *, struct dentry *, unsigned int);
 extern loff_t mem_lseek(struct file *, loff_t, int);
+extern int proc_open_get_privunit(struct inode *, struct file *);
+extern int proc_release_put_privunit(struct inode *, struct file *);
 
 /* Lookups */
 typedef int instantiate_t(struct inode *, struct dentry *,
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 558adfa5c8a8..26f260c24f31 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1174,6 +1174,7 @@
  *	attributes would be changed by the execve.
  *	@child contains the task_struct structure for the target process.
  *	@mode contains the PTRACE_MODE flags indicating the form of access.
+ *	@cred contains the caller's credentials
  *	Return 0 if permission is granted.
  * @ptrace_traceme:
  *	Check that the @parent process has sufficient permission to trace the
@@ -1342,7 +1343,7 @@ union security_list_options {
 					struct file *file);
 
 	int (*ptrace_access_check)(struct task_struct *child,
-					unsigned int mode);
+			unsigned int mode, const struct cred *cred);
 	int (*ptrace_traceme)(struct task_struct *parent);
 	int (*capget)(struct task_struct *target, kernel_cap_t *effective,
 			kernel_cap_t *inheritable, kernel_cap_t *permitted);
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 504c98a278d4..12b3d2c64389 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -66,6 +66,9 @@ extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
 #define PTRACE_MODE_ATTACH_FSCREDS (PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS)
 #define PTRACE_MODE_ATTACH_REALCREDS (PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS)
 
+/* ignore relationship between current and target task */
+#define PTRACE_MODE_NON_CURRENT 0x20
+
 /**
  * ptrace_may_access - check whether the caller is permitted to access
  * a target task.
@@ -81,6 +84,8 @@ extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
  * process_vm_writev or ptrace (and should use the real credentials).
  */
 extern bool ptrace_may_access(struct task_struct *task, unsigned int mode);
+bool ptrace_may_access_noncurrent(struct task_struct *task, unsigned int mode,
+		const struct cred *cred, const struct luid *privunit);
 
 static inline int ptrace_reparented(struct task_struct *child)
 {
diff --git a/include/linux/security.h b/include/linux/security.h
index c2125e9093e8..8b6ce7c8984d 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -72,7 +72,8 @@ struct timezone;
 extern int cap_capable(const struct cred *cred, struct user_namespace *ns,
 		       int cap, int audit);
 extern int cap_settime(const struct timespec64 *ts, const struct timezone *tz);
-extern int cap_ptrace_access_check(struct task_struct *child, unsigned int mode);
+extern int cap_ptrace_access_check(struct task_struct *child, unsigned int mode,
+				   const struct cred *cred);
 extern int cap_ptrace_traceme(struct task_struct *parent);
 extern int cap_capget(struct task_struct *target, kernel_cap_t *effective, kernel_cap_t *inheritable, kernel_cap_t *permitted);
 extern int cap_capset(struct cred *new, const struct cred *old,
@@ -191,7 +192,8 @@ int security_binder_transfer_binder(struct task_struct *from,
 				    struct task_struct *to);
 int security_binder_transfer_file(struct task_struct *from,
 				  struct task_struct *to, struct file *file);
-int security_ptrace_access_check(struct task_struct *child, unsigned int mode);
+int security_ptrace_access_check(struct task_struct *child, unsigned int mode,
+				 const struct cred *cred);
 int security_ptrace_traceme(struct task_struct *parent);
 int security_capget(struct task_struct *target,
 		    kernel_cap_t *effective,
@@ -419,9 +421,9 @@ static inline int security_binder_transfer_file(struct task_struct *from,
 }
 
 static inline int security_ptrace_access_check(struct task_struct *child,
-					     unsigned int mode)
+			unsigned int mode, const struct cred *cred)
 {
-	return cap_ptrace_access_check(child, mode);
+	return cap_ptrace_access_check(child, mode, cred);
 }
 
 static inline int security_ptrace_traceme(struct task_struct *parent)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index c3312e9e0078..321c69449f21 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -208,18 +208,21 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	return ret;
 }
 
-static int ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
+static int ptrace_has_cap(const struct cred *cred, struct user_namespace *ns,
+			  unsigned int mode)
 {
 	if (mode & PTRACE_MODE_NOAUDIT)
-		return has_ns_capability_noaudit(current, ns, CAP_SYS_PTRACE);
+		return security_capable_noaudit(cred, ns, CAP_SYS_PTRACE) == 0;
 	else
-		return has_ns_capability(current, ns, CAP_SYS_PTRACE);
+		return security_capable(cred, ns, CAP_SYS_PTRACE) == 0;
 }
 
 /* Returns 0 on success, -errno on denial. */
-static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
+static int __ptrace_may_access(struct task_struct *task, unsigned int mode,
+			       const struct cred *cred,
+			       const struct luid *privunit)
 {
-	const struct cred *cred = current_cred(), *tcred;
+	const struct cred *tcred;
 	int dumpable = 0;
 	kuid_t caller_uid;
 	kgid_t caller_gid;
@@ -239,8 +242,9 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	 */
 
 	/* Don't let security modules deny introspection */
-	if (same_thread_group(task, current))
+	if (luid_eq(privunit, &task->privunit))
 		return 0;
+
 	rcu_read_lock();
 	if (mode & PTRACE_MODE_FSCREDS) {
 		caller_uid = cred->fsuid;
@@ -265,7 +269,7 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	    gid_eq(caller_gid, tcred->sgid) &&
 	    gid_eq(caller_gid, tcred->gid))
 		goto ok;
-	if (ptrace_has_cap(tcred->user_ns, mode))
+	if (ptrace_has_cap(cred, tcred->user_ns, mode))
 		goto ok;
 	rcu_read_unlock();
 	return -EPERM;
@@ -276,13 +280,13 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 		dumpable = get_dumpable(task->mm);
 	rcu_read_lock();
 	if (dumpable != SUID_DUMP_USER &&
-	    !ptrace_has_cap(__task_cred(task)->user_ns, mode)) {
+	    !ptrace_has_cap(cred, __task_cred(task)->user_ns, mode)) {
 		rcu_read_unlock();
 		return -EPERM;
 	}
 	rcu_read_unlock();
 
-	return security_ptrace_access_check(task, mode);
+	return security_ptrace_access_check(task, mode, cred);
 }
 
 /*
@@ -299,7 +303,21 @@ bool ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
 	int err;
 	task_lock(task);
-	err = __ptrace_may_access(task, mode);
+	err = __ptrace_may_access(task, mode, current_cred(),
+				  &current->privunit);
+	task_unlock(task);
+	return !err;
+}
+
+bool ptrace_may_access_noncurrent(struct task_struct *task, unsigned int mode,
+				  const struct cred *cred,
+				  const struct luid *privunit)
+{
+	int err;
+
+	task_lock(task);
+	err = __ptrace_may_access(task, mode | PTRACE_MODE_NON_CURRENT, cred,
+				  privunit);
 	task_unlock(task);
 	return !err;
 }
@@ -340,7 +358,8 @@ static int ptrace_attach(struct task_struct *task, long request,
 		goto out;
 
 	task_lock(task);
-	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS,
+	    current_cred(), &current->privunit);
 	task_unlock(task);
 	if (retval)
 		goto unlock_creds;
diff --git a/security/apparmor/include/ipc.h b/security/apparmor/include/ipc.h
index 288ca76e2fb1..eb685c11fe83 100644
--- a/security/apparmor/include/ipc.h
+++ b/security/apparmor/include/ipc.h
@@ -22,7 +22,7 @@ struct aa_profile;
 int aa_may_ptrace(struct aa_profile *tracer, struct aa_profile *tracee,
 		  unsigned int mode);
 
-int aa_ptrace(struct task_struct *tracer, struct task_struct *tracee,
+int aa_ptrace(struct aa_profile *tracer_p, struct task_struct *tracee,
 	      unsigned int mode);
 
 #endif /* __AA_IPC_H */
diff --git a/security/apparmor/ipc.c b/security/apparmor/ipc.c
index 777ac1c47253..c60b374885a2 100644
--- a/security/apparmor/ipc.c
+++ b/security/apparmor/ipc.c
@@ -82,7 +82,7 @@ int aa_may_ptrace(struct aa_profile *tracer, struct aa_profile *tracee,
  *
  * Returns: %0 else error code if permission denied or error
  */
-int aa_ptrace(struct task_struct *tracer, struct task_struct *tracee,
+int aa_ptrace(struct aa_profile *tracer_p, struct task_struct *tracee,
 	      unsigned int mode)
 {
 	/*
@@ -94,7 +94,6 @@ int aa_ptrace(struct task_struct *tracer, struct task_struct *tracee,
 	 *       - tracer profile has CAP_SYS_PTRACE
 	 */
 
-	struct aa_profile *tracer_p = aa_get_task_profile(tracer);
 	int error = 0;
 
 	if (!unconfined(tracer_p)) {
@@ -105,7 +104,6 @@ int aa_ptrace(struct task_struct *tracer, struct task_struct *tracee,
 
 		aa_put_profile(tracee_p);
 	}
-	aa_put_profile(tracer_p);
 
 	return error;
 }
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 41b8cb115801..dd94df2c0167 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -94,14 +94,22 @@ static void apparmor_cred_transfer(struct cred *new, const struct cred *old)
 }
 
 static int apparmor_ptrace_access_check(struct task_struct *child,
-					unsigned int mode)
+				 unsigned int mode, const struct cred *cred)
 {
-	return aa_ptrace(current, child, mode);
+	struct aa_profile *tracer_p = aa_get_profile(aa_cred_profile(cred));
+	int res = aa_ptrace(tracer_p, child, mode);
+
+	aa_put_profile(tracer_p);
+	return res;
 }
 
 static int apparmor_ptrace_traceme(struct task_struct *parent)
 {
-	return aa_ptrace(parent, current, PTRACE_MODE_ATTACH);
+	struct aa_profile *tracer_p = aa_get_task_profile(parent);
+	int res = aa_ptrace(tracer_p, current, PTRACE_MODE_ATTACH);
+
+	aa_put_profile(tracer_p);
+	return res;
 }
 
 /* Derived from security/commoncap.c:cap_capget */
diff --git a/security/commoncap.c b/security/commoncap.c
index 8df676fbd393..ffbb3ea18720 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -133,14 +133,14 @@ int cap_settime(const struct timespec64 *ts, const struct timezone *tz)
  * Determine whether a process may access another, returning 0 if permission
  * granted, -ve if denied.
  */
-int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
+int cap_ptrace_access_check(struct task_struct *child, unsigned int mode,
+			    const struct cred *cred)
 {
 	int ret = 0;
-	const struct cred *cred, *child_cred;
+	const struct cred *child_cred;
 	const kernel_cap_t *caller_caps;
 
 	rcu_read_lock();
-	cred = current_cred();
 	child_cred = __task_cred(child);
 	if (mode & PTRACE_MODE_FSCREDS)
 		caller_caps = &cred->cap_effective;
@@ -149,7 +149,7 @@ int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
 	if (cred->user_ns == child_cred->user_ns &&
 	    cap_issubset(child_cred->cap_permitted, *caller_caps))
 		goto out;
-	if (ns_capable(child_cred->user_ns, CAP_SYS_PTRACE))
+	if (security_capable(cred, child_cred->user_ns, CAP_SYS_PTRACE) == 0)
 		goto out;
 	ret = -EPERM;
 out:
diff --git a/security/security.c b/security/security.c
index f825304f04a7..cb375573f021 100644
--- a/security/security.c
+++ b/security/security.c
@@ -154,9 +154,10 @@ int security_binder_transfer_file(struct task_struct *from,
 	return call_int_hook(binder_transfer_file, 0, from, to, file);
 }
 
-int security_ptrace_access_check(struct task_struct *child, unsigned int mode)
+int security_ptrace_access_check(struct task_struct *child, unsigned int mode,
+				 const struct cred *cred)
 {
-	return call_int_hook(ptrace_access_check, 0, child, mode);
+	return call_int_hook(ptrace_access_check, 0, child, mode, cred);
 }
 
 int security_ptrace_traceme(struct task_struct *parent)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 09fd6108e421..8c383176a846 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2128,15 +2128,16 @@ static int selinux_binder_transfer_file(struct task_struct *from,
 }
 
 static int selinux_ptrace_access_check(struct task_struct *child,
-				     unsigned int mode)
+				     unsigned int mode, const struct cred *cred)
 {
-	if (mode & PTRACE_MODE_READ) {
-		u32 sid = current_sid();
-		u32 csid = task_sid(child);
-		return avc_has_perm(sid, csid, SECCLASS_FILE, FILE__READ, NULL);
-	}
+	u32 sid = cred_sid(cred);
+	u32 csid = task_sid(child);
 
-	return current_has_perm(child, PROCESS__PTRACE);
+	if (mode & PTRACE_MODE_READ)
+		return avc_has_perm(sid, csid, SECCLASS_FILE, FILE__READ, NULL);
+	else
+		return avc_has_perm(sid, csid, SECCLASS_PROCESS,
+				    PROCESS__PTRACE, NULL);
 }
 
 static int selinux_ptrace_traceme(struct task_struct *parent)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 1cb060293505..6aef4b558e73 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -408,7 +408,8 @@ static inline unsigned int smk_ptrace_mode(unsigned int mode)
 
 /**
  * smk_ptrace_rule_check - helper for ptrace access
- * @tracer: tracer process
+ * @tracer: tracer process - only used for auditing if @tracer_cred is set
+ * @tracer_cred: creds for checks, may be different from @tracer's creds
  * @tracee_known: label entry of the process that's about to be traced
  * @mode: ptrace attachment mode (PTRACE_MODE_*)
  * @func: name of the function that called us, used for audit
@@ -416,6 +417,7 @@ static inline unsigned int smk_ptrace_mode(unsigned int mode)
  * Returns 0 on access granted, -error on error
  */
 static int smk_ptrace_rule_check(struct task_struct *tracer,
+				 const struct cred *tracer_cred,
 				 struct smack_known *tracee_known,
 				 unsigned int mode, const char *func)
 {
@@ -431,7 +433,9 @@ static int smk_ptrace_rule_check(struct task_struct *tracer,
 	}
 
 	rcu_read_lock();
-	tsp = __task_cred(tracer)->security;
+	if (tracer_cred == NULL)
+		tracer_cred = __task_cred(tracer);
+	tsp = tracer_cred->security;
 	tracer_known = smk_of_task(tsp);
 
 	if ((mode & PTRACE_MODE_ATTACH) &&
@@ -476,13 +480,14 @@ static int smk_ptrace_rule_check(struct task_struct *tracer,
  *
  * Do the capability checks.
  */
-static int smack_ptrace_access_check(struct task_struct *ctp, unsigned int mode)
+static int smack_ptrace_access_check(struct task_struct *ctp, unsigned int mode,
+				     const struct cred *cred)
 {
 	struct smack_known *skp;
 
 	skp = smk_of_task_struct(ctp);
 
-	return smk_ptrace_rule_check(current, skp, mode, __func__);
+	return smk_ptrace_rule_check(current, cred, skp, mode, __func__);
 }
 
 /**
@@ -500,7 +505,8 @@ static int smack_ptrace_traceme(struct task_struct *ptp)
 
 	skp = smk_of_task(current_security());
 
-	rc = smk_ptrace_rule_check(ptp, skp, PTRACE_MODE_ATTACH, __func__);
+	rc = smk_ptrace_rule_check(ptp, NULL, skp, PTRACE_MODE_ATTACH,
+				   __func__);
 	return rc;
 }
 
@@ -941,7 +947,7 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm)
 		rcu_read_lock();
 		tracer = ptrace_parent(current);
 		if (likely(tracer != NULL))
-			rc = smk_ptrace_rule_check(tracer,
+			rc = smk_ptrace_rule_check(tracer, NULL,
 						   isp->smk_task,
 						   PTRACE_MODE_ATTACH,
 						   __func__);
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index 0309f2111c70..32a29625d873 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -347,7 +347,7 @@ static int ptracer_exception_found(struct task_struct *tracer,
  * Returns 0 if following the ptrace is allowed, -ve on error.
  */
 static int yama_ptrace_access_check(struct task_struct *child,
-				    unsigned int mode)
+				    unsigned int mode, const struct cred *cred)
 {
 	int rc = 0;
 
@@ -359,7 +359,9 @@ static int yama_ptrace_access_check(struct task_struct *child,
 			break;
 		case YAMA_SCOPE_RELATIONAL:
 			rcu_read_lock();
-			if (!task_is_descendant(current, child) &&
+			/* fail open for some procfs-based ATTACH accesses! */
+			if ((mode & PTRACE_MODE_NON_CURRENT) == 0 &&
+			    !task_is_descendant(current, child) &&
 			    !ptracer_exception_found(current, child) &&
 			    !ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
 				rc = -EPERM;
@@ -367,7 +369,8 @@ static int yama_ptrace_access_check(struct task_struct *child,
 			break;
 		case YAMA_SCOPE_CAPABILITY:
 			rcu_read_lock();
-			if (!ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
+			if (security_capable(cred, __task_cred(child)->user_ns,
+					     CAP_SYS_PTRACE) != 0)
 				rc = -EPERM;
 			rcu_read_unlock();
 			break;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 4/8] futex: don't leak robust_list pointer
  2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (2 preceding siblings ...)
  2016-10-30 21:46 ` [PATCH v3 3/8] proc: use open()-time creds for ptrace checks Jann Horn
@ 2016-10-30 21:46 ` Jann Horn
  2016-10-30 21:46 ` [PATCH v3 5/8] proc: lock properly in ptrace_may_access callers Jann Horn
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-10-30 21:46 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, Krister Johansen
  Cc: linux-fsdevel, linux-security-module, security

This prevents an attacker from determining the robust_list or
compat_robust_list userspace pointer of a process created by executing
a setuid binary. Such an attack could be performed by racing
get_robust_list() with a setuid execution. The impact of this issue is that
an attacker could theoretically bypass ASLR when attacking setuid binaries.

changed in v2:
 - only get_task_struct(p) if p!=NULL (Ben Hutchings)
 - move the -ESRCH return check

Signed-off-by: Jann Horn <jann@thejh.net>
---
 kernel/futex.c        | 30 +++++++++++++++++++++---------
 kernel/futex_compat.c | 30 +++++++++++++++++++++---------
 2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2c4be467fecd..f0697e0d55b8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3016,31 +3016,43 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
 
-	rcu_read_lock();
-
-	ret = -ESRCH;
-	if (!pid)
+	if (!pid) {
 		p = current;
-	else {
+		get_task_struct(p);
+	} else {
+		rcu_read_lock();
 		p = find_task_by_vpid(pid);
+		/* pin the task to permit dropping the RCU read lock before
+		 * acquiring the mutex
+		 */
+		if (p)
+			get_task_struct(p);
+		rcu_read_unlock();
 		if (!p)
-			goto err_unlock;
+			return -ESRCH;
 	}
 
+	ret = mutex_lock_killable(&p->signal->cred_guard_light);
+	if (ret)
+		goto err_put;
+
 	ret = -EPERM;
 	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
 		goto err_unlock;
 
 	head = p->robust_list;
-	rcu_read_unlock();
+
+	mutex_unlock(&p->signal->cred_guard_light);
+	put_task_struct(p);
 
 	if (put_user(sizeof(*head), len_ptr))
 		return -EFAULT;
 	return put_user(head, head_ptr);
 
 err_unlock:
-	rcu_read_unlock();
-
+	mutex_unlock(&current->signal->cred_guard_light);
+err_put:
+	put_task_struct(p);
 	return ret;
 }
 
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 4ae3232e7a28..92c350f05ff8 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -143,31 +143,43 @@ COMPAT_SYSCALL_DEFINE3(get_robust_list, int, pid,
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
 
-	rcu_read_lock();
-
-	ret = -ESRCH;
-	if (!pid)
+	if (!pid) {
 		p = current;
-	else {
+		get_task_struct(p);
+	} else {
+		rcu_read_lock();
 		p = find_task_by_vpid(pid);
+		/* pin the task to permit dropping the RCU read lock before
+		 * acquiring the mutex
+		 */
+		if (p)
+			get_task_struct(p);
+		rcu_read_unlock();
 		if (!p)
-			goto err_unlock;
+			return -ESRCH;
 	}
 
+	ret = mutex_lock_killable(&p->signal->cred_guard_light);
+	if (ret)
+		goto err_put;
+
 	ret = -EPERM;
 	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
 		goto err_unlock;
 
 	head = p->compat_robust_list;
-	rcu_read_unlock();
+
+	mutex_unlock(&p->signal->cred_guard_light);
+	put_task_struct(p);
 
 	if (put_user(sizeof(*head), len_ptr))
 		return -EFAULT;
 	return put_user(ptr_to_compat(head), head_ptr);
 
 err_unlock:
-	rcu_read_unlock();
-
+	mutex_unlock(&current->signal->cred_guard_light);
+err_put:
+	put_task_struct(p);
 	return ret;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 5/8] proc: lock properly in ptrace_may_access callers
  2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (3 preceding siblings ...)
  2016-10-30 21:46 ` [PATCH v3 4/8] futex: don't leak robust_list pointer Jann Horn
@ 2016-10-30 21:46 ` Jann Horn
  2016-10-30 21:46 ` [PATCH v3 6/8] fs/proc: fix attr access check Jann Horn
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-10-30 21:46 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, Krister Johansen
  Cc: linux-fsdevel, linux-security-module, security

Use the new cred_guard_light to prevent information leaks
through races in procfs.

changed in v2:
 - also use mm_access() for proc_map_files_readdir() (0day test robot)

changed in v3:
 - drop the lock_trace() stuff (Oleg Nesterov)

Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/proc/array.c      |  7 +++++
 fs/proc/base.c       | 77 +++++++++++++++++++++++++++++++++-------------------
 fs/proc/namespaces.c | 21 +++++++++++---
 3 files changed, 73 insertions(+), 32 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 524cff7e0104..e180d4d82d21 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -405,9 +405,15 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long rsslim = 0;
 	char tcomm[sizeof(task->comm)];
 	unsigned long flags;
+	int err;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
+
+	err = mutex_lock_killable(&task->signal->cred_guard_light);
+	if (err)
+		return err;
+
 	permitted = proc_ptrace_may_access_seq(task,
 			PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT, m);
 	mm = get_task_mm(task);
@@ -564,6 +570,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	seq_putc(m, '\n');
 	if (mm)
 		mmput(mm);
+	mutex_unlock(&task->signal->cred_guard_light);
 	return 0;
 }
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d3b20d3a01c9..32ea9bc3d320 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -668,23 +668,39 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 /*                       Here the fs part begins                        */
 /************************************************************************/
 
-/* permission checks */
-static int proc_fd_access_allowed(struct inode *inode)
+/* permission checks.
+ * If this returns 1, you'll have to proc_fd_access_allowed_unlock(*taskp)
+ * afterwards.
+ */
+static int proc_fd_access_allowed_lock(struct inode *inode,
+					 struct task_struct **taskp)
 {
 	struct task_struct *task;
 	int allowed = 0;
-	/* Allow access to a task's file descriptors if it is us or we
-	 * may use ptrace attach to the process and find out that
-	 * information.
-	 */
+
 	task = get_proc_task(inode);
-	if (task) {
-		allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (!task)
+		return 0;
+	if (mutex_lock_killable(&task->signal->cred_guard_light))
+		goto out_put;
+	allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (!allowed)
+		mutex_unlock(&task->signal->cred_guard_light);
+out_put:
+	if (allowed)
+		*taskp = task;
+	else
 		put_task_struct(task);
-	}
+
 	return allowed;
 }
 
+static void proc_fd_access_allowed_unlock(struct task_struct *task)
+{
+	mutex_unlock(&task->signal->cred_guard_light);
+	put_task_struct(task);
+}
+
 int proc_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	int error;
@@ -705,6 +721,8 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
 /*
  * May current process learn task's sched/cmdline info (for hide_pid_min=1)
  * or euid/egid (for hide_pid_min=2)?
+ * NOTE: When you call this, you should hold cred_guard_mutex or
+ * cred_guard_light.
  */
 static bool has_pid_permissions(struct pid_namespace *pid,
 				 struct task_struct *task,
@@ -1615,15 +1633,17 @@ static const char *proc_pid_get_link(struct dentry *dentry,
 {
 	struct path path;
 	int error = -EACCES;
+	struct task_struct *task;
 
 	if (!dentry)
 		return ERR_PTR(-ECHILD);
 
 	/* Are we allowed to snoop on the tasks file descriptors? */
-	if (!proc_fd_access_allowed(inode))
+	if (!proc_fd_access_allowed_lock(inode, &task))
 		goto out;
 
 	error = PROC_I(inode)->op.proc_get_link(dentry, &path);
+	proc_fd_access_allowed_unlock(task);
 	if (error)
 		goto out;
 
@@ -1662,12 +1682,14 @@ static int proc_pid_readlink(struct dentry * dentry, char __user * buffer, int b
 	int error = -EACCES;
 	struct inode *inode = d_inode(dentry);
 	struct path path;
+	struct task_struct *task;
 
 	/* Are we allowed to snoop on the tasks file descriptors? */
-	if (!proc_fd_access_allowed(inode))
+	if (!proc_fd_access_allowed_lock(inode, &task))
 		goto out;
 
 	error = PROC_I(inode)->op.proc_get_link(dentry, &path);
+	proc_fd_access_allowed_unlock(task);
 	if (error)
 		goto out;
 
@@ -2062,17 +2084,15 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	if (!task)
 		goto out;
 
-	result = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
+	mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (IS_ERR(mm))
+		result = PTR_ERR(mm);
+	if (IS_ERR_OR_NULL(mm))
 		goto out_put_task;
 
 	result = -ENOENT;
 	if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
-		goto out_put_task;
-
-	mm = get_task_mm(task);
-	if (!mm)
-		goto out_put_task;
+		goto out_put_mm;
 
 	down_read(&mm->mmap_sem);
 	vma = find_exact_vma(mm, vm_start, vm_end);
@@ -2085,6 +2105,7 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 
 out_no_vma:
 	up_read(&mm->mmap_sem);
+out_put_mm:
 	mmput(mm);
 out_put_task:
 	put_task_struct(task);
@@ -2115,17 +2136,17 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	if (!task)
 		goto out;
 
-	ret = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
+	ret = 0;
+
+	mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (IS_ERR(mm))
+		ret = PTR_ERR(mm);
+	if (IS_ERR_OR_NULL(mm))
 		goto out_put_task;
 
-	ret = 0;
 	if (!dir_emit_dots(file, ctx))
-		goto out_put_task;
+		goto out_mmput;
 
-	mm = get_task_mm(task);
-	if (!mm)
-		goto out_put_task;
 	down_read(&mm->mmap_sem);
 
 	nr_files = 0;
@@ -2154,8 +2175,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 			if (fa)
 				flex_array_free(fa);
 			up_read(&mm->mmap_sem);
-			mmput(mm);
-			goto out_put_task;
+			goto out_mmput;
 		}
 		for (i = 0, vma = mm->mmap, pos = 2; vma;
 				vma = vma->vm_next) {
@@ -2186,8 +2206,9 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	}
 	if (fa)
 		flex_array_free(fa);
-	mmput(mm);
 
+out_mmput:
+	mmput(mm);
 out_put_task:
 	put_task_struct(task);
 out:
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 51b8b0a8ad91..c8dee46939df 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -49,11 +49,18 @@ static const char *proc_ns_get_link(struct dentry *dentry,
 	if (!task)
 		return error;
 
+	error = ERR_PTR(mutex_lock_killable(&task->signal->cred_guard_light));
+	if (error)
+		goto out_put_task;
+
+	error = ERR_PTR(-EACCES);
 	if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
 		error = ns_get_path(&ns_path, task, ns_ops);
 		if (!error)
 			nd_jump_link(&ns_path);
 	}
+	mutex_unlock(&task->signal->cred_guard_light);
+out_put_task:
 	put_task_struct(task);
 	return error;
 }
@@ -70,11 +77,17 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl
 	if (!task)
 		return res;
 
-	if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
+	res = mutex_lock_killable(&task->signal->cred_guard_light);
+	if (res)
+		goto out_put_task;
+
+	res = -EACCES;
+	if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
 		res = ns_get_name(name, sizeof(name), task, ns_ops);
-		if (res >= 0)
-			res = readlink_copy(buffer, buflen, name);
-	}
+	mutex_unlock(&task->signal->cred_guard_light);
+	if (res >= 0)
+		res = readlink_copy(buffer, buflen, name);
+out_put_task:
 	put_task_struct(task);
 	return res;
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 6/8] fs/proc: fix attr access check
  2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (4 preceding siblings ...)
  2016-10-30 21:46 ` [PATCH v3 5/8] proc: lock properly in ptrace_may_access callers Jann Horn
@ 2016-10-30 21:46 ` Jann Horn
  2016-10-30 21:46 ` [PATCH v3 7/8] proc: fix timerslack_ns handling Jann Horn
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-10-30 21:46 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, Krister Johansen
  Cc: linux-fsdevel, linux-security-module, security

Make sure files in /proc/$pid/attr/ can only be written by the task that
opened them.

This prevents an attacking process from changing the security context of
another process that it can force to write attacker-controlled data into an
attacker-supplied file descriptor. I'm not sure what the impact of this is.

changed in v2:
 - changed privunit-using code

Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/proc/base.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 32ea9bc3d320..cbba490543e2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2518,6 +2518,18 @@ static int proc_pident_readdir(struct file *file, struct dir_context *ctx,
 }
 
 #ifdef CONFIG_SECURITY
+static int proc_pid_attr_open(struct inode *inode, struct file *file)
+{
+	struct luid *opener_privunit;
+
+	opener_privunit = kmalloc(sizeof(struct luid), GFP_KERNEL);
+	if (opener_privunit == NULL)
+		return -ENOMEM;
+	*opener_privunit = current->privunit;
+	file->private_data = opener_privunit;
+	return 0;
+}
+
 static ssize_t proc_pid_attr_read(struct file * file, char __user * buf,
 				  size_t count, loff_t *ppos)
 {
@@ -2546,6 +2558,7 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	void *page;
 	ssize_t length;
 	struct task_struct *task = get_proc_task(inode);
+	struct luid *opener_privunit = file->private_data;
 
 	length = -ESRCH;
 	if (!task)
@@ -2569,9 +2582,29 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	if (length < 0)
 		goto out_free;
 
+	/*
+	 * Ensure that a process can't be tricked into writing into its own attr
+	 * files without intending to do so.
+	 *
+	 * SELinux has a rule that prevents anyone other than `task` from
+	 * writing, but if the fd stays open across execve, or is sent across a
+	 * unix domain socket or whatever, that is bypassable.
+	 * Same thing in AppArmor and in Smack.
+	 *
+	 * To prevent this, compare the opener's exec_id with the target's to
+	 * ensure that they're in the same task group and no exec happened in
+	 * the meantime.
+	 *
+	 * Why is this a file and not a prctl or whatever. :/
+	 */
+	length = -EACCES;
+	if (!luid_eq(opener_privunit, &task->privunit))
+		goto out_unlock;
+
 	length = security_setprocattr(task,
 				      (char*)file->f_path.dentry->d_name.name,
 				      page, count);
+out_unlock:
 	mutex_unlock(&task->signal->cred_guard_mutex);
 out_free:
 	kfree(page);
@@ -2581,10 +2614,20 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	return length;
 }
 
+static int proc_pid_attr_release(struct inode *inode, struct file *file)
+{
+	struct luid *opener_privunit = file->private_data;
+
+	kfree(opener_privunit);
+	return 0;
+}
+
 static const struct file_operations proc_pid_attr_operations = {
+	.open		= proc_pid_attr_open,
 	.read		= proc_pid_attr_read,
 	.write		= proc_pid_attr_write,
 	.llseek		= generic_file_llseek,
+	.release	= proc_pid_attr_release,
 };
 
 static const struct pid_entry attr_dir_stuff[] = {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 7/8] proc: fix timerslack_ns handling
  2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (5 preceding siblings ...)
  2016-10-30 21:46 ` [PATCH v3 6/8] fs/proc: fix attr access check Jann Horn
@ 2016-10-30 21:46 ` Jann Horn
  2016-10-30 21:46 ` [PATCH v3 8/8] Documentation: add security/ptrace_checks.txt Jann Horn
  2016-11-01 23:57 ` [PATCH v3 0/8] Various fixes related to ptrace_may_access() Linus Torvalds
  8 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-10-30 21:46 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, Krister Johansen
  Cc: linux-fsdevel, linux-security-module, security

This changes the privilege checks in /proc/*/timerslack_ns to use the
privileges of the file opener instead of the privileges of current (which
must not be used in privilege checks).

The Smack LSM hook is not changed because I'm not sure how to properly
check the privileges of a non-current credential struct in Smack.

Don't take any mutex here for now - since the timerslack isn't reset on
execve anyway, that should be fine.

Note that this, as a side effect, also permits cross-thread access, which
was previously disallowed.

Signed-off-by: Jann Horn <jann@thejh.net>
---
 arch/mips/kernel/mips-mt-fpaff.c |  4 ++--
 fs/proc/base.c                   | 37 +++++++++++++++++++++++++++----------
 include/linux/lsm_hooks.h        | 14 ++++++++------
 include/linux/security.h         | 13 ++++++++-----
 kernel/cpuset.c                  |  2 +-
 kernel/sched/core.c              | 14 +++++++-------
 security/commoncap.c             | 16 +++++++++-------
 security/security.c              |  8 ++++----
 security/selinux/hooks.c         | 20 ++++++++++++++++----
 security/smack/smack_lsm.c       |  9 +++++++--
 10 files changed, 89 insertions(+), 48 deletions(-)

diff --git a/arch/mips/kernel/mips-mt-fpaff.c b/arch/mips/kernel/mips-mt-fpaff.c
index 789d7bf4fef3..13a2ad1ac021 100644
--- a/arch/mips/kernel/mips-mt-fpaff.c
+++ b/arch/mips/kernel/mips-mt-fpaff.c
@@ -103,7 +103,7 @@ asmlinkage long mipsmt_sys_sched_setaffinity(pid_t pid, unsigned int len,
 	if (!check_same_owner(p) && !capable(CAP_SYS_NICE))
 		goto out_unlock;
 
-	retval = security_task_setscheduler(p);
+	retval = security_task_setscheduler(p, current_cred());
 	if (retval)
 		goto out_unlock;
 
@@ -169,7 +169,7 @@ asmlinkage long mipsmt_sys_sched_getaffinity(pid_t pid, unsigned int len,
 	p = find_process_by_pid(pid);
 	if (!p)
 		goto out_unlock;
-	retval = security_task_getscheduler(p);
+	retval = security_task_getscheduler(p, current_cred());
 	if (retval)
 		goto out_unlock;
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index cbba490543e2..3f87cef3b5c1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2345,6 +2345,8 @@ static ssize_t timerslack_ns_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *offset)
 {
 	struct inode *inode = file_inode(file);
+	struct luid *opener_privunit =
+		((struct seq_file *)file->private_data)->private;
 	struct task_struct *p;
 	u64 slack_ns;
 	int err;
@@ -2357,13 +2359,13 @@ static ssize_t timerslack_ns_write(struct file *file, const char __user *buf,
 	if (!p)
 		return -ESRCH;
 
-	if (p != current) {
-		if (!capable(CAP_SYS_NICE)) {
+	if (!luid_eq(&p->privunit, opener_privunit)) {
+		if (!file_ns_capable(file, &init_user_ns, CAP_SYS_NICE)) {
 			count = -EPERM;
 			goto out;
 		}
 
-		err = security_task_setscheduler(p);
+		err = security_task_setscheduler(p, file->f_cred);
 		if (err) {
 			count = err;
 			goto out;
@@ -2385,7 +2387,9 @@ static ssize_t timerslack_ns_write(struct file *file, const char __user *buf,
 
 static int timerslack_ns_show(struct seq_file *m, void *v)
 {
-	struct inode *inode = m->private;
+	struct inode *inode = file_inode(m->file);
+	struct luid *opener_privunit = m->private;
+
 	struct task_struct *p;
 	int err = 0;
 
@@ -2393,13 +2397,12 @@ static int timerslack_ns_show(struct seq_file *m, void *v)
 	if (!p)
 		return -ESRCH;
 
-	if (p != current) {
-
-		if (!capable(CAP_SYS_NICE)) {
+	if (!luid_eq(&p->privunit, opener_privunit)) {
+		if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_NICE)) {
 			err = -EPERM;
 			goto out;
 		}
-		err = security_task_getscheduler(p);
+		err = security_task_getscheduler(p, m->file->f_cred);
 		if (err)
 			goto out;
 	}
@@ -2416,7 +2419,21 @@ static int timerslack_ns_show(struct seq_file *m, void *v)
 
 static int timerslack_ns_open(struct inode *inode, struct file *filp)
 {
-	return single_open(filp, timerslack_ns_show, inode);
+	struct luid *privunit = kmalloc(sizeof(*privunit), GFP_KERNEL);
+
+	if (!privunit)
+		return -ENOMEM;
+	*privunit = current->privunit;
+	return single_open(filp, timerslack_ns_show, privunit);
+}
+
+static int timerslack_ns_release(struct inode *inode, struct file *file)
+{
+	struct luid *opener_privunit =
+		((struct seq_file *)file->private_data)->private;
+
+	kfree(opener_privunit);
+	return single_release(inode, file);
 }
 
 static const struct file_operations proc_pid_set_timerslack_ns_operations = {
@@ -2424,7 +2441,7 @@ static const struct file_operations proc_pid_set_timerslack_ns_operations = {
 	.read		= seq_read,
 	.write		= timerslack_ns_write,
 	.llseek		= seq_lseek,
-	.release	= single_release,
+	.release	= timerslack_ns_release,
 };
 
 static int proc_pident_instantiate(struct inode *dir,
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 26f260c24f31..c84c92a52386 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -640,15 +640,15 @@
  *	Return 0 if permission is granted.
  * @task_setscheduler:
  *	Check permission before setting scheduling policy and/or parameters of
- *	process @p based on @policy and @lp.
+ *	process @p on behalf of subject @cred.
  *	@p contains the task_struct for process.
- *	@policy contains the scheduling policy.
- *	@lp contains the scheduling parameters.
+ *	@cred contains the credentials of the caller.
  *	Return 0 if permission is granted.
  * @task_getscheduler:
  *	Check permission before obtaining scheduling information for process
- *	@p.
+ *	@p on behalf of subject @cred.
  *	@p contains the task_struct for process.
+ *	@cred contains the credentials of the caller.
  *	Return 0 if permission is granted.
  * @task_movememory
  *	Check permission before moving memory owned by process @p.
@@ -1503,8 +1503,10 @@ union security_list_options {
 	int (*task_getioprio)(struct task_struct *p);
 	int (*task_setrlimit)(struct task_struct *p, unsigned int resource,
 				struct rlimit *new_rlim);
-	int (*task_setscheduler)(struct task_struct *p);
-	int (*task_getscheduler)(struct task_struct *p);
+	int (*task_setscheduler)(struct task_struct *p,
+				 const struct cred *cred);
+	int (*task_getscheduler)(struct task_struct *p,
+				 const struct cred *cred);
 	int (*task_movememory)(struct task_struct *p);
 	int (*task_kill)(struct task_struct *p, struct siginfo *info,
 				int sig, u32 secid);
diff --git a/include/linux/security.h b/include/linux/security.h
index 8b6ce7c8984d..1e54ef927537 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -93,7 +93,8 @@ extern int cap_mmap_file(struct file *file, unsigned long reqprot,
 extern int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags);
 extern int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 			  unsigned long arg4, unsigned long arg5);
-extern int cap_task_setscheduler(struct task_struct *p);
+extern int cap_task_setscheduler(struct task_struct *p,
+				 const struct cred *cred);
 extern int cap_task_setioprio(struct task_struct *p, int ioprio);
 extern int cap_task_setnice(struct task_struct *p, int nice);
 extern int cap_vm_enough_memory(struct mm_struct *mm, long pages);
@@ -329,8 +330,8 @@ int security_task_setioprio(struct task_struct *p, int ioprio);
 int security_task_getioprio(struct task_struct *p);
 int security_task_setrlimit(struct task_struct *p, unsigned int resource,
 		struct rlimit *new_rlim);
-int security_task_setscheduler(struct task_struct *p);
-int security_task_getscheduler(struct task_struct *p);
+int security_task_setscheduler(struct task_struct *p, const struct cred *cred);
+int security_task_getscheduler(struct task_struct *p, const struct cred *cred);
 int security_task_movememory(struct task_struct *p);
 int security_task_kill(struct task_struct *p, struct siginfo *info,
 			int sig, u32 secid);
@@ -960,12 +961,14 @@ static inline int security_task_setrlimit(struct task_struct *p,
 	return 0;
 }
 
-static inline int security_task_setscheduler(struct task_struct *p)
+static inline int security_task_setscheduler(struct task_struct *p,
+					     const struct cred *cred)
 {
 	return cap_task_setscheduler(p);
 }
 
-static inline int security_task_getscheduler(struct task_struct *p)
+static inline int security_task_getscheduler(struct task_struct *p,
+					     const struct cred *cred)
 {
 	return 0;
 }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 29f815d2ef7e..e3a797b4dde3 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1480,7 +1480,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 		ret = task_can_attach(task, cs->cpus_allowed);
 		if (ret)
 			goto out_unlock;
-		ret = security_task_setscheduler(task);
+		ret = security_task_setscheduler(task, current_cred());
 		if (ret)
 			goto out_unlock;
 	}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42d4027f9e26..37f1dc2ea7d6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4165,7 +4165,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	}
 
 	if (user) {
-		retval = security_task_setscheduler(p);
+		retval = security_task_setscheduler(p, current_cred());
 		if (retval)
 			return retval;
 	}
@@ -4544,7 +4544,7 @@ SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)
 	rcu_read_lock();
 	p = find_process_by_pid(pid);
 	if (p) {
-		retval = security_task_getscheduler(p);
+		retval = security_task_getscheduler(p, current_cred());
 		if (!retval)
 			retval = p->policy
 				| (p->sched_reset_on_fork ? SCHED_RESET_ON_FORK : 0);
@@ -4576,7 +4576,7 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)
 	if (!p)
 		goto out_unlock;
 
-	retval = security_task_getscheduler(p);
+	retval = security_task_getscheduler(p, current_cred());
 	if (retval)
 		goto out_unlock;
 
@@ -4658,7 +4658,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	if (!p)
 		goto out_unlock;
 
-	retval = security_task_getscheduler(p);
+	retval = security_task_getscheduler(p, current_cred());
 	if (retval)
 		goto out_unlock;
 
@@ -4722,7 +4722,7 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 		rcu_read_unlock();
 	}
 
-	retval = security_task_setscheduler(p);
+	retval = security_task_setscheduler(p, current_cred());
 	if (retval)
 		goto out_free_new_mask;
 
@@ -4819,7 +4819,7 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
 	if (!p)
 		goto out_unlock;
 
-	retval = security_task_getscheduler(p);
+	retval = security_task_getscheduler(p, current_cred());
 	if (retval)
 		goto out_unlock;
 
@@ -5164,7 +5164,7 @@ SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,
 	if (!p)
 		goto out_unlock;
 
-	retval = security_task_getscheduler(p);
+	retval = security_task_getscheduler(p, current_cred());
 	if (retval)
 		goto out_unlock;
 
diff --git a/security/commoncap.c b/security/commoncap.c
index ffbb3ea18720..a2508cd3aab9 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -813,14 +813,15 @@ int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags)
  * yet with increased caps.
  * So we check for increased caps on the target process.
  */
-static int cap_safe_nice(struct task_struct *p)
+static int cap_safe_nice(struct task_struct *p, const struct cred *cred)
 {
 	int is_subset, ret = 0;
 
 	rcu_read_lock();
 	is_subset = cap_issubset(__task_cred(p)->cap_permitted,
-				 current_cred()->cap_permitted);
-	if (!is_subset && !ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE))
+				 cred->cap_permitted);
+	if (!is_subset &&
+	    security_capable(cred, __task_cred(p)->user_ns, CAP_SYS_NICE) != 0)
 		ret = -EPERM;
 	rcu_read_unlock();
 
@@ -830,13 +831,14 @@ static int cap_safe_nice(struct task_struct *p)
 /**
  * cap_task_setscheduler - Detemine if scheduler policy change is permitted
  * @p: The task to affect
+ * @cred: The caller's credentials
  *
  * Detemine if the requested scheduler policy change is permitted for the
  * specified task, returning 0 if permission is granted, -ve if denied.
  */
-int cap_task_setscheduler(struct task_struct *p)
+int cap_task_setscheduler(struct task_struct *p, const struct cred *cred)
 {
-	return cap_safe_nice(p);
+	return cap_safe_nice(p, cred);
 }
 
 /**
@@ -849,7 +851,7 @@ int cap_task_setscheduler(struct task_struct *p)
  */
 int cap_task_setioprio(struct task_struct *p, int ioprio)
 {
-	return cap_safe_nice(p);
+	return cap_safe_nice(p, current_cred());
 }
 
 /**
@@ -862,7 +864,7 @@ int cap_task_setioprio(struct task_struct *p, int ioprio)
  */
 int cap_task_setnice(struct task_struct *p, int nice)
 {
-	return cap_safe_nice(p);
+	return cap_safe_nice(p, current_cred());
 }
 
 /*
diff --git a/security/security.c b/security/security.c
index cb375573f021..846012b5502b 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1005,14 +1005,14 @@ int security_task_setrlimit(struct task_struct *p, unsigned int resource,
 	return call_int_hook(task_setrlimit, 0, p, resource, new_rlim);
 }
 
-int security_task_setscheduler(struct task_struct *p)
+int security_task_setscheduler(struct task_struct *p, const struct cred *cred)
 {
-	return call_int_hook(task_setscheduler, 0, p);
+	return call_int_hook(task_setscheduler, 0, p, cred);
 }
 
-int security_task_getscheduler(struct task_struct *p)
+int security_task_getscheduler(struct task_struct *p, const struct cred *cred)
 {
-	return call_int_hook(task_getscheduler, 0, p);
+	return call_int_hook(task_getscheduler, 0, p, cred);
 }
 
 int security_task_movememory(struct task_struct *p)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 8c383176a846..9a19da9f1ea2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3910,14 +3910,26 @@ static int selinux_task_setrlimit(struct task_struct *p, unsigned int resource,
 	return 0;
 }
 
-static int selinux_task_setscheduler(struct task_struct *p)
+static int selinux_task_setscheduler(struct task_struct *p,
+				     const struct cred *cred)
 {
-	return current_has_perm(p, PROCESS__SETSCHED);
+	u32 sid, tsid;
+
+	sid = cred_sid(cred);
+	tsid = task_sid(p);
+	return avc_has_perm(sid, tsid, SECCLASS_PROCESS, PROCESS__SETSCHED,
+			    NULL);
 }
 
-static int selinux_task_getscheduler(struct task_struct *p)
+static int selinux_task_getscheduler(struct task_struct *p,
+				     const struct cred *cred)
 {
-	return current_has_perm(p, PROCESS__GETSCHED);
+	u32 sid, tsid;
+
+	sid = cred_sid(cred);
+	tsid = task_sid(p);
+	return avc_has_perm(sid, tsid, SECCLASS_PROCESS, PROCESS__GETSCHED,
+			    NULL);
 }
 
 static int selinux_task_movememory(struct task_struct *p)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 6aef4b558e73..0737325baa14 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -2209,13 +2209,16 @@ static int smack_task_getioprio(struct task_struct *p)
 /**
  * smack_task_setscheduler - Smack check on setting scheduler
  * @p: the task object
+ * @cred: the caller's credentials
  * @policy: unused
  * @lp: unused
  *
  * Return 0 if read access is permitted
  */
-static int smack_task_setscheduler(struct task_struct *p)
+static int smack_task_setscheduler(struct task_struct *p,
+				   const struct cred *cred)
 {
+	// TODO: handle non-current accesses (via timerslack)
 	return smk_curacc_on_task(p, MAY_WRITE, __func__);
 }
 
@@ -2225,8 +2228,10 @@ static int smack_task_setscheduler(struct task_struct *p)
  *
  * Return 0 if read access is permitted
  */
-static int smack_task_getscheduler(struct task_struct *p)
+static int smack_task_getscheduler(struct task_struct *p,
+				   const struct cred *cred)
 {
+	// TODO: handle non-current accesses (via timerslack)
 	return smk_curacc_on_task(p, MAY_READ, __func__);
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 8/8] Documentation: add security/ptrace_checks.txt
  2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (6 preceding siblings ...)
  2016-10-30 21:46 ` [PATCH v3 7/8] proc: fix timerslack_ns handling Jann Horn
@ 2016-10-30 21:46 ` Jann Horn
  2016-11-01 23:57 ` [PATCH v3 0/8] Various fixes related to ptrace_may_access() Linus Torvalds
  8 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-10-30 21:46 UTC (permalink / raw)
  To: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Linus Torvalds, Krister Johansen
  Cc: linux-fsdevel, linux-security-module, security

This includes some things suggested by Ben Hutchings and Kees Cook.

changed in v3:
 - talk about interaction between /proc DAC checks and user namespaces
   (Krister Johansen)

Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
---
 Documentation/security/ptrace_checks.txt | 243 +++++++++++++++++++++++++++++++
 1 file changed, 243 insertions(+)
 create mode 100644 Documentation/security/ptrace_checks.txt

diff --git a/Documentation/security/ptrace_checks.txt b/Documentation/security/ptrace_checks.txt
new file mode 100644
index 000000000000..afe01ae111bd
--- /dev/null
+++ b/Documentation/security/ptrace_checks.txt
@@ -0,0 +1,243 @@
+			     ====================
+			     PTRACE ACCESS CHECKS
+			     ====================
+
+By: Jann Horn <jann@thejh.net>
+
+=====
+INTRO
+=====
+
+This file describes the rules for ptrace() and ptrace()-like access to another
+task (inspecting or modifying the virtual memory of another process, inspecting
+another task's register state, inspecting another task's open files and so on).
+
+This kind of access is particularly security-sensitive because it often allows
+the caller to take complete control over the remote process. Therefore, the
+kernel must be very careful here.
+
+There are various kernel APIs that grant debugging access to other processes.
+
+
+===========================
+ORDINARY DEBUGGING SYSCALLS
+===========================
+
+The following ordinary syscalls grant debugging-like access:
+
+ - ptrace() grants R+W access to registers and virtual memory
+ - process_vm_readv() / process_vm_writev() grant R/W access to virtual memory
+ - get_robust_list() reveals an address from the target task, which would be
+   useful e.g. for an ASLR bypass
+ - perf_event_open()
+ - (kcmp() reveals a very small amount of information)
+
+These syscalls use the caller's *real* UID/GID and his *permitted* capability
+set (for historical reasons) to determine the permitted access using
+ptrace_may_access(..., PTRACE_MODE_REALCREDS | ...). This function is
+responsible for ensuring that the caller can't steal any privileges using the
+debugging access; in other words, it has to ensure that the caller's credentials
+are (more or less) a superset of the target's credentials. Namespaces aside,
+these rules work as follows:
+
+Introspection, including cross-thread introspection, is always allowed. This
+means that it is unsafe to use most of the debugging APIs on behalf of another,
+untrusted process unless it is verified somehow that the target pid does not
+actually belong to the calling process - dropping privileges is not sufficient.
+However, this rule does not apply to ptrace() - a task can't attach to itself or
+to another thread of the same thread group using ptrace().
+
+root (to be more precise, a process with CAP_SYS_PTRACE effective) can always
+use these debugging APIs on any process - unless a LSM denies the access.
+
+For normal users, there are more checks:
+
+ - The real uid of the caller must be equal to ruid, euid and suid of the target
+   task, same thing for the gids. All of the uids and gids of the target (apart
+   from the fs*id) are checked here because a privileged process might stash his
+   privileged uid into any one of them. (checked in __ptrace_may_access())
+   (Note: This doesn't cover the supplementary GIDs.)
+ - The permitted capabilities of the caller must be a superset of the permitted
+   capabilities of the target. (checked in cap_ptrace_access_check())
+ - The target process must be dumpable. What this means is explained in the next
+   section.
+
+
+===========
+DUMPABILITY
+===========
+
+Originally, the dumpability rule didn't exist - as long as a process had ruid,
+euid and suid set to the uid of a user (same thing with the gids), that user was
+able to gain debugging access to the process. This was problematic: It meant
+that if a setuid binary opened some important file, then dropped all privileges,
+the user was able to debug the process and steal its file descriptor - which is
+also a form of privilege! Other issues are e.g. that when a process drops its
+privileges, it might still have secrets in memory that the user isn't meant to
+see.
+
+Therefore, a process now becomes "nondumpable" when it:
+
+ - changes its euid / egid
+ - changes its fsuid / fsgid
+ - changes its permitted capability set
+
+These checks are in commit_creds().
+
+When a process is nondumpable, you need to have the CAP_SYS_PTRACE capability to
+access it - just having the same uids is not enough anymore.
+
+The most interesting part of dumpability is what happens on execve().
+(Non-)Dumpability is not inherited; instead, it is recalculated in
+setup_new_exec() and commit_creds(): A process will be nondumpable after
+execve() if its post-setuid-bit-handling euid and ruid or egid and rgid aren't
+the same. Additionally, it will be nondumpable if execve() changes its
+credentials in a way that commit_creds() triggers on (as described above).
+
+In effect, this means that normally, when a setuid/setgid binary executes
+another program, the other program is still protected by nondumpability (and
+also by secure execution).
+
+(Additionally, a process will be nondumpable if the euid doesn't have read
+access to the executed binary, which can be the case if the binary is
+executable, but not readable; however, this restriction is irrelevant from a
+security perspective because it isn't enforced consistently, neither through
+AT_SECURE nor for bprm->unsafe. In other words, you can e.g. be attached to such
+a binary via ptrace while it is executed, or you can LD_PRELOAD a library into
+it that dumps its memory. If this was fixed, it would be important to keep
+namespacing issues in mind, in particular the interaction with
+LSM_UNSAFE_PTRACE_CAP, the privileged kind of ptrace that works across setuid
+execve.)
+
+Dumpability and the secure exec flag can be affected by the following "unsafe
+execution" rules. When a process with ruid != euid executes another program and
+an unsafe execution is detected, *ITS EUID WILL BE DROPPED TO ITS RUID* by
+cap_bprm_set_creds() (unless it has CAP_SETUID effective in the init user ns).
+Unsafe execution means one of the following:
+
+ - LSM_UNSAFE_PTRACE: An unprivileged process is currently attached via ptrace.
+ - LSM_UNSAFE_NO_NEW_PRIVS: The current thread turned on NO_NEW_PRIVS.
+ - LSM_UNSAFE_SHARE: The fs struct is shared with a non-thread.
+
+
+=====================
+FILESYSTEM DEBUG APIS
+=====================
+
+The pid / tgid entries in procfs contain various entries that allow debugging
+access to a process. Interesting entries are:
+
+ - auxv permits an ASLR bypass
+ - cwd can permit bypassing filesystem restrictions in some cases
+ - environ can leak secret tokens
+ - fd can permit bypassing filesystem restrictions or leak access to things like
+   pipes
+ - maps permits an ASLR bypass
+ - mem grants R+W access to process memory
+ - stat permits an ASLR bypass
+
+Of these, all use both a normal filesystem DAC check (where the file owner is
+the process owner for a dumpable process, root for a nondumpable process) and a
+ptrace_may_access() check; however, the DAC check may be modified, and the
+ptrace_may_access() is performed under PTRACE_FSCREDS, meaning that instead of
+the caller's ruid, rgid and permitted capabilities, the fsuid, fsgid and
+effective capabilities are used, causing the case where a daemon drops its euid
+prior to accessing a file for the user to be treated correctly for this check.
+
+For a nondumpable process in a user namespace, GLOBAL_ROOT_UID and
+GLOBAL_ROOT_GID are the owners of the corresponding procfs files. This means
+that, when a process in a user namespace becomes nondumpable, even the namespace
+owner can't fully inspect it through procfs anymore.
+
+When a privileged process inside a user namespace tries to inspect a dumpable
+process that belongs to another user in that namespace via procfs, that passes
+the DAC checks because the procfs files of the target process are owned by the
+other user, meaning that, as long as the other user's UID is mapped into the
+namespace, the namespace administrator is capable relative to the
+procfs files (see capable_wrt_inode_uidgid()).
+
+NEEDS ATTENTION:
+Special-case rules that permit introspection become a big problem here. In
+particular, if a setuid binary runs with dropped privileges (ruid=euid=user and
+suid=0 or so), it is still able to open the following entries under /proc/self:
+
+ - cwd (because normal DAC rules always permit symlink access, proc_pid_get_link
+   only checks for ptrace access and proc_cwd_link doesn't check anything)
+ - fd (same as cwd for the entries inside; the directory itself has an override
+   on the permission handler that makes introspection possible even if the
+   normal DAC rules would forbid it)
+ - maps (mode 0444, so DAC doesn't restrict anything)
+ - stat (mode 0444, and anyone can read from it; however, the interesting
+     pointers are only printed if the file opener has ptrace access to the
+     target process)
+
+In particular the fd directory is dangerous: Imagine a privileged process that
+creates some important pipe using pipe() with a dropped EUID and later performs
+writes with user-specified data on a file at a user-specified path with the same
+dropped EUID. An attacker could let the root-owned process reference the pipe as
+/proc/$pid/fd/3 and let the privileged process write to it.
+
+If necessary, it might make sense to introduce some "enable/disable
+introspection" prctl that can be used by userspace to disambiguate whether an
+access is intended to be introspection.
+
+
+Note that in all cases in which the DAC check permits access but a ptrace access
+check later on (in the VFS open handler or afterwards) restricts access,
+faccessat() is broken. However, that's probably not a real problem because
+faccessat() is racy anyway.
+
+
+===============
+USER NAMESPACES
+===============
+
+When ptrace_may_access() checks for CAP_SYS_PTRACE, the capability doesn't need
+to be active in the init namespace or in the current namespace; having it in the
+namespace of the ptrace target is sufficient. There are currently no further
+restrictions on this, meaning that the behavior of a task that creates or enters
+a user namespace is somewhat unintuitive:
+
+ - If a nondumpable task enters a user namespace, it will still be
+   nondumpable - but because the owner of the namespace is capable relative to
+   it when viewing the namespace from outside, this will often cause a
+   nondumpable task to effectively become dumpable because of a setns() /
+   clone() / unshare() call.
+
+ - The root user of a namespace can ptrace any process that enters the namespace
+   (via setns()) - immediately. This means that a process that wants to enter an
+   untrusted namespace from outside needs to be *very* careful to drop any
+   privileges it might have - uids, open file descriptors, secret data that's
+   still in memory, the controlling TTY and so on. (This also means that a
+   namespace owner who is unprivileged outside the namespace can't safely enter
+   the namespace if he doesn't trust the namespace root user: He needs
+   euid==ns->owner to enter the namespace, but must not have euid==ns->owner
+   after having entered the namespace. To get around this, it is necessary to
+   introduce a third namespace between the other two.)
+
+One way to make it easier for userspace to get this right might be to bind
+dumpability to namespaces, as follows: Let a task become nondumpable on
+setns() / clone(CLONE_NEWUSER) / unshare(CLONE_NEWUSER). Record the current
+namespace on dumpable -> nondumpable transition as "dumpability namespace".
+Require the caller to be privileged in the target's dumpability namespace, not
+just the target's current namespace.
+
+=============
+FUSE AND CUSE
+=============
+
+Anyone who can open /dev/cuse (root-owned, mode 0600) can arbitrarily read and
+write to the memory of various system processes (to be more precise, any process
+that opens files in /dev when they appear or so, then calls ioctl() on them; in
+particular, acpid does this) without further checks on capabilities or so.
+
+NEEDS ATTENTION:
+FUSE is more accessible than CUSE, but also more strict. ioctl() replies are
+restricted in location and size based on the ioctl command number (size) and the
+argument (location), but the ability to interactively control the replies to VFS
+method calls is still scary. Therefore, FUSE by default rejects filesystem
+access by anyone except the filesystem owner. This isn't done using the normal
+ptrace check, though; instead, fuse_allow_current_process() is used, which only
+looks at the UIDs and GIDs and therefore is a very simplified version of the
+normal ptrace access check. In particular, this provides no protection for
+setcap binaries.
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 0/8] Various fixes related to ptrace_may_access()
  2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
                   ` (7 preceding siblings ...)
  2016-10-30 21:46 ` [PATCH v3 8/8] Documentation: add security/ptrace_checks.txt Jann Horn
@ 2016-11-01 23:57 ` Linus Torvalds
  2016-11-02 18:38   ` Oleg Nesterov
                     ` (2 more replies)
  8 siblings, 3 replies; 32+ messages in thread
From: Linus Torvalds @ 2016-11-01 23:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Roland McGrath, Oleg Nesterov, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Krister Johansen, linux-fsdevel, LSM List,
	security

On Sun, Oct 30, 2016 at 3:46 PM, Jann Horn <jann@thejh.net> wrote:
> Next try.
>
> Changes to the individual patches are mostly documented in their
> commit messages.
>
> Added/removed patches:
>  - Added "proc: fix timerslack_ns handling"
>  - Removed "ptrace: warn on ptrace_may_access without proper locking"
>    (because of some reverted changes in the "proc: lock properly [...]"
>    patch)

So I'm a bit unsure which tree this series is going to come in
through. There's no clear maintinership for this area, so I'm just
making sure that Andrew has this on his radar because I suspect this
is going to fall in his lap.

Oleg, you're really the obvious maintainer choice at least for some of
this, but I don't recall having ever pulled from you? If you are ok
with this and were to put  git tree etc, that would certainly also
work very well.  Or at least ack's for Andrew?

                  Linus

---

> Jann Horn (8):
>   exec: introduce cred_guard_light
>   exec: add privunit to task_struct
>   proc: use open()-time creds for ptrace checks
>   futex: don't leak robust_list pointer
>   proc: lock properly in ptrace_may_access callers
>   fs/proc: fix attr access check
>   proc: fix timerslack_ns handling
>   Documentation: add security/ptrace_checks.txt

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-10-30 21:46 ` [PATCH v3 1/8] exec: introduce cred_guard_light Jann Horn
@ 2016-11-02 18:18   ` Oleg Nesterov
  2016-11-02 20:50     ` Jann Horn
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2016-11-02 18:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric W. Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

On 10/30, Jann Horn wrote:
>
> This is a new per-threadgroup lock that can often be taken instead of
> cred_guard_mutex and has less deadlock potential. I'm doing this because
> Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> thread, and the debugger attempts to inspect procfs files of the debugged
> task.

Yes, but let me repeat that we need to fix this anyway. So I don't really
understand why should we add yet another mutex.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 0/8] Various fixes related to ptrace_may_access()
  2016-11-01 23:57 ` [PATCH v3 0/8] Various fixes related to ptrace_may_access() Linus Torvalds
@ 2016-11-02 18:38   ` Oleg Nesterov
  2016-11-02 21:40     ` Jann Horn
  2016-11-03 19:09   ` Andrew Morton
  2016-11-04  0:57   ` James Morris
  2 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2016-11-02 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Krister Johansen, linux-fsdevel, LSM List,
	security

On 11/01, Linus Torvalds wrote:
>
> Oleg, you're really the obvious maintainer choice at least for some of
> this,

Well. I still disagree with 1/8, I think we need to fix and cleanup
the usage of cred_guard_mutex we already have. And to me the additional
complications added by, say, 4/8 make no sense, we can make a much more
simple change to avoid this leak "in practice".

But. I never pretended I understand the security problems. So I won't
really argue with these changes.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-02 18:18   ` Oleg Nesterov
@ 2016-11-02 20:50     ` Jann Horn
  2016-11-02 21:38       ` Ben Hutchings
  2016-11-03 18:12       ` Oleg Nesterov
  0 siblings, 2 replies; 32+ messages in thread
From: Jann Horn @ 2016-11-02 20:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric W. Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 3514 bytes --]

On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> On 10/30, Jann Horn wrote:
> >
> > This is a new per-threadgroup lock that can often be taken instead of
> > cred_guard_mutex and has less deadlock potential. I'm doing this because
> > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> > thread, and the debugger attempts to inspect procfs files of the debugged
> > task.
> 
> Yes, but let me repeat that we need to fix this anyway. So I don't really
> understand why should we add yet another mutex.

execve() only takes the new mutex immediately after de_thread(), so this
problem shouldn't occur there. Basically, I think that I'm not making the
problem worse with my patches this way.

I believe that it should be possible to convert most existing users of the
cred_guard_mutex to the new cred_guard_light - exceptions to that that I
see are:

 - PTRACE_ATTACH
 - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)

Beyond that, conceptually, the new cred_guard_light could also be turned
into a read-write mutex to prevent deadlocks between its users (where
execve would take it for writing and everyone else would take it for
reading), but afaik the kernel doesn't have an implementation of
read-write mutexes yet?

cred_guard_light would mean that you could theoretically still create
deadlocks, but afaics only if you do things like trying to read
/proc/$pid/mem in the FUSE read handler for the file that is currently
being executed - and in that case, I think it's okay to have a killable
deadlock.

Do you think that, if (apart from execve) only PTRACE_ATTACH and
SECCOMP_FILTER_FLAG_TSYNC remain as users of cred_guard_mutex and
everything else used my new cred_guard_light, that would be sufficient to
fix the races you are concerned about?

It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have
deadlocking issues. PTRACE_ATTACH isn't that clear to me; if a debugger
tries to attach to a newly spawned thread while another ptraced thread is
dying because of de_thread() in a third thread, that might still cause
the debugger to deadlock, right?

The problem with PTRACE_ATTACH is basically that bprm->unsafe is used
in the bprm_set_creds LSM hook, so it needs to have been calculated when
that hook is executed. (Also in the bprm_secureexec hook, but that
one happens after install_exec_creds(), so that's unproblematic.)
security_bprm_set_creds() is called in prepare_binprm(), which is
executed very early in do_execveat_common(), at a point where failures
should still be graceful (return an error code instead of killing the
whole process), and therefore other threads can still run and debuggers
can still attach.

The LSM hooks that execute at that point e.g. inspect and modify
bprm->cred, and they can still cleanly prohibit execution. E.g. SELinux
does this - it can cancel execution with errors like -EPERM and -EACCES.

AFAICS the hard case is:

 - Multithreaded process with tasks A and B is running.
 - Task C attaches to B via ptrace.
 - Task A calls execve(), takes the mutex, reaches de_thread(), kills
   task B.
 - Task C tries to attach to A, tries to take the mutex again,
   deadlock.

I'm not sure whether it'd be possible to get rid of the deadlock
for PTRACE_ATTACH without ABI changes, and I would be surprised if it
was doable without nontrivial additional logic.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-02 20:50     ` Jann Horn
@ 2016-11-02 21:38       ` Ben Hutchings
  2016-11-02 21:54         ` Jann Horn
  2016-11-03 18:12       ` Oleg Nesterov
  1 sibling, 1 reply; 32+ messages in thread
From: Ben Hutchings @ 2016-11-02 21:38 UTC (permalink / raw)
  To: Jann Horn, Oleg Nesterov
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric W. Biederman, Thomas Gleixner,
	Benjamin LaHaise, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 1731 bytes --]

On Wed, 2016-11-02 at 21:50 +0100, Jann Horn wrote:
> On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> > On 10/30, Jann Horn wrote:
> > > 
> > > This is a new per-threadgroup lock that can often be taken instead of
> > > cred_guard_mutex and has less deadlock potential. I'm doing this because
> > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> > > thread, and the debugger attempts to inspect procfs files of the debugged
> > > task.
> > 
> > 
> > Yes, but let me repeat that we need to fix this anyway. So I don't really
> > understand why should we add yet another mutex.
> 
> 
> execve() only takes the new mutex immediately after de_thread(), so this
> problem shouldn't occur there. Basically, I think that I'm not making the
> problem worse with my patches this way.
> 
> I believe that it should be possible to convert most existing users of the
> cred_guard_mutex to the new cred_guard_light - exceptions to that that I
> see are:
> 
>  - PTRACE_ATTACH
>  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)
> 
> Beyond that, conceptually, the new cred_guard_light could also be turned
> into a read-write mutex to prevent deadlocks between its users (where
> execve would take it for writing and everyone else would take it for
> reading), but afaik the kernel doesn't have an implementation of
> read-write mutexes yet?
[...]

It does (rw_semaphore), but with PREEMPT_RT enabled they become simple
mutexes.  So I don't think we should rely on that to avoid deadlock.

Ben.

-- 
Ben Hutchings
The world is coming to an end.	Please log off.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 0/8] Various fixes related to ptrace_may_access()
  2016-11-02 18:38   ` Oleg Nesterov
@ 2016-11-02 21:40     ` Jann Horn
  0 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-11-02 21:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Krister Johansen, linux-fsdevel, LSM List,
	security

[-- Attachment #1: Type: text/plain, Size: 1464 bytes --]

On Wed, Nov 02, 2016 at 07:38:26PM +0100, Oleg Nesterov wrote:
> On 11/01, Linus Torvalds wrote:
> >
> > Oleg, you're really the obvious maintainer choice at least for some of
> > this,
> 
> Well. I still disagree with 1/8, I think we need to fix and cleanup
> the usage of cred_guard_mutex we already have. And to me the additional
> complications added by, say, 4/8 make no sense, we can make a much more
> simple change to avoid this leak "in practice".
> 
> But. I never pretended I understand the security problems. So I won't
> really argue with these changes.

No, I think it's good that you make me think about this stuff properly.
And I do sometimes get overzealous with trying to do thing "completely
correctly".

See the mail I sent earlier today for my opinion on the deadlocking
potential related to 1/8. Basically, I think that my patch doesn't make
things worse and makes subsequent cleanup work, which should fix most
of the current deadlocking trouble, easier. I think that the remaining
edcecase (concurrent PTRACE_ATTACH and execve) would be at least hard
to fix, maybe even unfixable without API changes.
If anyone has ideas on how we could completely prevent userspace from
deadlocking itself there without changing the ABI, please speak up.
Does that make sense? If so, do you want a
cred_guard_mutex->cred_guard_light conversion in this series or
afterwards?

Regarding 4/8, see the other message I just sent.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-02 21:38       ` Ben Hutchings
@ 2016-11-02 21:54         ` Jann Horn
  0 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-11-02 21:54 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Oleg Nesterov, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Andy Lutomirski,
	Linus Torvalds, Krister Johansen, linux-fsdevel,
	linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 2176 bytes --]

On Wed, Nov 02, 2016 at 03:38:53PM -0600, Ben Hutchings wrote:
> On Wed, 2016-11-02 at 21:50 +0100, Jann Horn wrote:
> > On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> > > On 10/30, Jann Horn wrote:
> > > > 
> > > > This is a new per-threadgroup lock that can often be taken instead of
> > > > cred_guard_mutex and has less deadlock potential. I'm doing this because
> > > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> > > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> > > > thread, and the debugger attempts to inspect procfs files of the debugged
> > > > task.
> > > 
> > > 
> > > Yes, but let me repeat that we need to fix this anyway. So I don't really
> > > understand why should we add yet another mutex.
> > 
> > 
> > execve() only takes the new mutex immediately after de_thread(), so this
> > problem shouldn't occur there. Basically, I think that I'm not making the
> > problem worse with my patches this way.
> > 
> > I believe that it should be possible to convert most existing users of the
> > cred_guard_mutex to the new cred_guard_light - exceptions to that that I
> > see are:
> > 
> >  - PTRACE_ATTACH
> >  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)
> > 
> > Beyond that, conceptually, the new cred_guard_light could also be turned
> > into a read-write mutex to prevent deadlocks between its users (where
> > execve would take it for writing and everyone else would take it for
> > reading), but afaik the kernel doesn't have an implementation of
> > read-write mutexes yet?
> [...]
> 
> It does (rw_semaphore)

Aren't those spinlocks? I don't think those would be appropriate here,
considering that the cred_guard_light can be held during filesystem
access during execve().

> but with PREEMPT_RT enabled they become simple
> mutexes.  So I don't think we should rely on that to avoid deadlock.

Well, I don't think it's really necessary - as far as I can tell, the
only locking operations that would be executed with the
cred_guard_light held would be taking the sighand lock, filesystem
stuff and LSM stuff.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-02 20:50     ` Jann Horn
  2016-11-02 21:38       ` Ben Hutchings
@ 2016-11-03 18:12       ` Oleg Nesterov
  2016-11-03 21:17         ` Jann Horn
  2016-11-04 13:26         ` Eric W. Biederman
  1 sibling, 2 replies; 32+ messages in thread
From: Oleg Nesterov @ 2016-11-03 18:12 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric W. Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

On 11/02, Jann Horn wrote:
>
> On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> > On 10/30, Jann Horn wrote:
> > >
> > > This is a new per-threadgroup lock that can often be taken instead of
> > > cred_guard_mutex and has less deadlock potential. I'm doing this because
> > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> > > thread, and the debugger attempts to inspect procfs files of the debugged
> > > task.
> >
> > Yes, but let me repeat that we need to fix this anyway. So I don't really
> > understand why should we add yet another mutex.
>
> execve() only takes the new mutex immediately after de_thread(), so this
> problem shouldn't occur there.

Yes, I see.

> Basically, I think that I'm not making the
> problem worse with my patches this way.

In a sense that it doesn't add the new deadlocks, I agree. But it adds
yet another per-process mutex while we already have the similar one,

> I believe that it should be possible to convert most existing users of the
> cred_guard_mutex to the new cred_guard_light - exceptions to that that I
> see are:
>
>  - PTRACE_ATTACH

This is the main problem afaics. So "strace -f" can hang if it races
with mt-exec. And we need to fix this. I constantly forget about this
problem, but I tried many times to find a reasonable solution, still
can't.

IMO, it would be nice to rework the lsm hooks, so that we could take
cred_guard_mutex after de_thread() (like your cred_guard_light) or
at least drop it earlier, but unlikely this is possible...

So the only plan I currently have is change de_thread() to wait until
other threads pass exit_notify() or even exit_signals(), but I don't
like this.

>  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)

I forgot about this one... Need to re-check but at first glance this
is not a real problem.

> Beyond that, conceptually, the new cred_guard_light could also be turned
> into a read-write mutex

Not sure I understand how this can help... doesn't matter.

My point is, imo you should not add the new mutex. Just use the old
one in (say) 4/8 (which I do not personally like as you know ;), this
won't add the new problem.


> It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have
> deadlocking issues.

Yes, agreed.

> PTRACE_ATTACH isn't that clear to me; if a debugger
> tries to attach to a newly spawned thread while another ptraced thread is
> dying because of de_thread() in a third thread, that might still cause
> the debugger to deadlock, right?

This is the trivial test-case I wrote when the problem was initially
reported. And damn, I always knew that cred_guard_mutex needs fixes,
but somehow I completely forgot that it is used by PTRACE_ATTACH when
I was going to try to remove from fs/proc a long ago.

	void *thread(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return NULL;
	}

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			pthread_t pt;
			pthread_create(&pt, NULL, thread, NULL);
			pthread_join(pt, NULL);
			execlp("echo", "echo", "passed", NULL);
		}

		sleep(1);
		// or anything else which needs ->cred_guard_mutex,
		// say open(/proc/$pid/mem)
		ptrace(PTRACE_ATTACH, pid, 0,0);
		kill(pid, SIGCONT);

		return 0;
	}

The problem is trivial. The execing thread waits until its sub-thread
goes away, it should be reaped by the tracer, the tracer waits for
cred_guard_mutex.

> security_bprm_set_creds() is called in prepare_binprm(), which is
> executed very early in do_execveat_common(), at a point where failures
> should still be graceful (return an error code instead of killing the
> whole process),

Yes, yes.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 0/8] Various fixes related to ptrace_may_access()
  2016-11-01 23:57 ` [PATCH v3 0/8] Various fixes related to ptrace_may_access() Linus Torvalds
  2016-11-02 18:38   ` Oleg Nesterov
@ 2016-11-03 19:09   ` Andrew Morton
  2016-11-03 20:01     ` Jann Horn
  2016-11-04  0:57   ` James Morris
  2 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2016-11-03 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, Alexander Viro, Roland McGrath, Oleg Nesterov,
	John Johansen, James Morris, Serge E. Hallyn, Paul Moore,
	Stephen Smalley, Eric Paris, Casey Schaufler, Kees Cook,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Krister Johansen, linux-fsdevel, LSM List,
	security

On Tue, 1 Nov 2016 17:57:05 -0600 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, Oct 30, 2016 at 3:46 PM, Jann Horn <jann@thejh.net> wrote:
> > Next try.
> >
> > Changes to the individual patches are mostly documented in their
> > commit messages.
> >
> > Added/removed patches:
> >  - Added "proc: fix timerslack_ns handling"
> >  - Removed "ptrace: warn on ptrace_may_access without proper locking"
> >    (because of some reverted changes in the "proc: lock properly [...]"
> >    patch)
> 
> So I'm a bit unsure which tree this series is going to come in
> through. There's no clear maintinership for this area, so I'm just
> making sure that Andrew has this on his radar because I suspect this
> is going to fall in his lap.
> 
> Oleg, you're really the obvious maintainer choice at least for some of
> this, but I don't recall having ever pulled from you? If you are ok
> with this and were to put  git tree etc, that would certainly also
> work very well.  Or at least ack's for Andrew?

Yup, I was cc'ed and it is on my radar.  It's a little hard to tell
from the discussion, but I *think* we'll be seeing a v4 series?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 0/8] Various fixes related to ptrace_may_access()
  2016-11-03 19:09   ` Andrew Morton
@ 2016-11-03 20:01     ` Jann Horn
  0 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-11-03 20:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Alexander Viro, Roland McGrath, Oleg Nesterov,
	John Johansen, James Morris, Serge E. Hallyn, Paul Moore,
	Stephen Smalley, Eric Paris, Casey Schaufler, Kees Cook,
	Janis Danisevskis, Seth Forshee, Eric W. Biederman,
	Thomas Gleixner, Benjamin LaHaise, Ben Hutchings,
	Andy Lutomirski, Krister Johansen, linux-fsdevel, LSM List,
	security

[-- Attachment #1: Type: text/plain, Size: 1358 bytes --]

On Thu, Nov 03, 2016 at 12:09:04PM -0700, Andrew Morton wrote:
> On Tue, 1 Nov 2016 17:57:05 -0600 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Sun, Oct 30, 2016 at 3:46 PM, Jann Horn <jann@thejh.net> wrote:
> > > Next try.
> > >
> > > Changes to the individual patches are mostly documented in their
> > > commit messages.
> > >
> > > Added/removed patches:
> > >  - Added "proc: fix timerslack_ns handling"
> > >  - Removed "ptrace: warn on ptrace_may_access without proper locking"
> > >    (because of some reverted changes in the "proc: lock properly [...]"
> > >    patch)
> > 
> > So I'm a bit unsure which tree this series is going to come in
> > through. There's no clear maintinership for this area, so I'm just
> > making sure that Andrew has this on his radar because I suspect this
> > is going to fall in his lap.
> > 
> > Oleg, you're really the obvious maintainer choice at least for some of
> > this, but I don't recall having ever pulled from you? If you are ok
> > with this and were to put  git tree etc, that would certainly also
> > work very well.  Or at least ack's for Andrew?
> 
> Yup, I was cc'ed and it is on my radar.  It's a little hard to tell
> from the discussion, but I *think* we'll be seeing a v4 series?

Yup, I still have to address Oleg's last email, there will be a v4.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-03 18:12       ` Oleg Nesterov
@ 2016-11-03 21:17         ` Jann Horn
  2016-11-04 13:26         ` Eric W. Biederman
  1 sibling, 0 replies; 32+ messages in thread
From: Jann Horn @ 2016-11-03 21:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Roland McGrath, John Johansen, James Morris,
	Serge E. Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
	Casey Schaufler, Kees Cook, Andrew Morton, Janis Danisevskis,
	Seth Forshee, Eric W. Biederman, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

[-- Attachment #1: Type: text/plain, Size: 2005 bytes --]

On Thu, Nov 03, 2016 at 07:12:25PM +0100, Oleg Nesterov wrote:
> On 11/02, Jann Horn wrote:
> >
> > On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> > > On 10/30, Jann Horn wrote:
[...]
> > I believe that it should be possible to convert most existing users of the
> > cred_guard_mutex to the new cred_guard_light - exceptions to that that I
> > see are:
> >
> >  - PTRACE_ATTACH
> 
> This is the main problem afaics. So "strace -f" can hang if it races
> with mt-exec. And we need to fix this. I constantly forget about this
> problem, but I tried many times to find a reasonable solution, still
> can't.

Ah, okay, it wasn't clear to me that you consider the race with
PTRACE_ATTACH to be a similarly big problem as the other ones.

> IMO, it would be nice to rework the lsm hooks, so that we could take
> cred_guard_mutex after de_thread() (like your cred_guard_light) or
> at least drop it earlier, but unlikely this is possible...

An idea: Maybe we can change the LSM hook so that, immediately before
de_thread(), the LSMs can handle the execve() based on the current
state and report the circumstances under which they would deny the
execution or treat it differently. Then, during de_thread(), we can
immediately reject any access that would have changed the execution
and immediately permit any access that wouldn't have.

This could theoretically cause userland to see spurious permission
denials, but only if an LSM has an inconsistent security policy where
some access degrades execution although it would have been permitted
after a normal execution.

Does that make sense?

> So the only plan I currently have is change de_thread() to wait until
> other threads pass exit_notify() or even exit_signals(), but I don't
> like this.
[...]
> My point is, imo you should not add the new mutex. Just use the old
> one in (say) 4/8 (which I do not personally like as you know ;), this
> won't add the new problem.

Okay, I'll do that.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 0/8] Various fixes related to ptrace_may_access()
  2016-11-01 23:57 ` [PATCH v3 0/8] Various fixes related to ptrace_may_access() Linus Torvalds
  2016-11-02 18:38   ` Oleg Nesterov
  2016-11-03 19:09   ` Andrew Morton
@ 2016-11-04  0:57   ` James Morris
  2 siblings, 0 replies; 32+ messages in thread
From: James Morris @ 2016-11-04  0:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, Alexander Viro, Roland McGrath, Oleg Nesterov,
	John Johansen, James Morris, Serge E. Hallyn, Paul Moore,
	Stephen Smalley, Eric Paris, Casey Schaufler, Kees Cook,
	Andrew Morton, Janis Danisevskis, Seth Forshee,
	Eric W. Biederman, Thomas Gleixner, Benjamin LaHaise,
	Ben Hutchings, Andy Lutomirski, Krister Johansen, linux-fsdevel,
	LSM List, security

On Tue, 1 Nov 2016, Linus Torvalds wrote:

> Oleg, you're really the obvious maintainer choice at least for some of
> this, but I don't recall having ever pulled from you? If you are ok
> with this and were to put  git tree etc, that would certainly also
> work very well.  Or at least ack's for Andrew?
> 

I can take it through my tree if necessary.


-- 
James Morris
<jmorris@namei.org>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-03 18:12       ` Oleg Nesterov
  2016-11-03 21:17         ` Jann Horn
@ 2016-11-04 13:26         ` Eric W. Biederman
  2016-11-04 15:00           ` Eric W. Biederman
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2016-11-04 13:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

Oleg Nesterov <oleg@redhat.com> writes:

> On 11/02, Jann Horn wrote:
>>
>> On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
>> > On 10/30, Jann Horn wrote:
>> > >
>> > > This is a new per-threadgroup lock that can often be taken instead of
>> > > cred_guard_mutex and has less deadlock potential. I'm doing this because
>> > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
>> > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
>> > > thread, and the debugger attempts to inspect procfs files of the debugged
>> > > task.
>> >
>> > Yes, but let me repeat that we need to fix this anyway. So I don't really
>> > understand why should we add yet another mutex.
>>
>> execve() only takes the new mutex immediately after de_thread(), so this
>> problem shouldn't occur there.
>
> Yes, I see.
>
>> Basically, I think that I'm not making the
>> problem worse with my patches this way.
>
> In a sense that it doesn't add the new deadlocks, I agree. But it adds
> yet another per-process mutex while we already have the similar one,
>
>> I believe that it should be possible to convert most existing users of the
>> cred_guard_mutex to the new cred_guard_light - exceptions to that that I
>> see are:
>>
>>  - PTRACE_ATTACH
>
> This is the main problem afaics. So "strace -f" can hang if it races
> with mt-exec. And we need to fix this. I constantly forget about this
> problem, but I tried many times to find a reasonable solution, still
> can't.
>
> IMO, it would be nice to rework the lsm hooks, so that we could take
> cred_guard_mutex after de_thread() (like your cred_guard_light) or
> at least drop it earlier, but unlikely this is possible...
>
> So the only plan I currently have is change de_thread() to wait until
> other threads pass exit_notify() or even exit_signals(), but I don't
> like this.
>
>>  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)
>
> I forgot about this one... Need to re-check but at first glance this
> is not a real problem.
>
>> Beyond that, conceptually, the new cred_guard_light could also be turned
>> into a read-write mutex
>
> Not sure I understand how this can help... doesn't matter.
>
> My point is, imo you should not add the new mutex. Just use the old
> one in (say) 4/8 (which I do not personally like as you know ;), this
> won't add the new problem.
>
>
>> It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have
>> deadlocking issues.
>
> Yes, agreed.
>
>> PTRACE_ATTACH isn't that clear to me; if a debugger
>> tries to attach to a newly spawned thread while another ptraced thread is
>> dying because of de_thread() in a third thread, that might still cause
>> the debugger to deadlock, right?
>
> This is the trivial test-case I wrote when the problem was initially
> reported. And damn, I always knew that cred_guard_mutex needs fixes,
> but somehow I completely forgot that it is used by PTRACE_ATTACH when
> I was going to try to remove from fs/proc a long ago.
>
> 	void *thread(void *arg)
> 	{
> 		ptrace(PTRACE_TRACEME, 0,0,0);
> 		return NULL;
> 	}
>
> 	int main(void)
> 	{
> 		int pid = fork();
>
> 		if (!pid) {
> 			pthread_t pt;
> 			pthread_create(&pt, NULL, thread, NULL);
> 			pthread_join(pt, NULL);
> 			execlp("echo", "echo", "passed", NULL);
> 		}
>
> 		sleep(1);
> 		// or anything else which needs ->cred_guard_mutex,
> 		// say open(/proc/$pid/mem)
> 		ptrace(PTRACE_ATTACH, pid, 0,0);
> 		kill(pid, SIGCONT);
>
> 		return 0;
> 	}
>
> The problem is trivial. The execing thread waits until its sub-thread
> goes away, it should be reaped by the tracer, the tracer waits for
> cred_guard_mutex.

There is a bug here but I don't believe it has anything to do with
the cred_guard_mutex.

If we reach zap_other_threads fundamentally the tracer should not
be able to block the traced thread from exiting.  Those are the
semantics described in the comments in the code.

I have poked things a little and have a half fix for that but
the fix appears to be the wrong, but enlightening.

AKA the following prevents the hang of your test case.
diff --git a/kernel/signal.c b/kernel/signal.c
index 75761acc77cf..a6f83450500e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1200,7 +1200,7 @@ int zap_other_threads(struct task_struct *p)
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
-		signal_wake_up(t, 1);
+		signal_wake_up_state(t, TASK_WAKEKILL | __TASK_TRACED);
 	}
 
 	return count;

It looks like somewhere on the exit path the traced thread is blocking
without setting TASK_WAKEKILL.

Eric



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-04 13:26         ` Eric W. Biederman
@ 2016-11-04 15:00           ` Eric W. Biederman
  2016-11-04 18:04             ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2016-11-04 15:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

ebiederm@xmission.com (Eric W. Biederman) writes:

> Oleg Nesterov <oleg@redhat.com> writes:
>
>> On 11/02, Jann Horn wrote:
>>>
>>> On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
>>> > On 10/30, Jann Horn wrote:
>>> > >
>>> > > This is a new per-threadgroup lock that can often be taken instead of
>>> > > cred_guard_mutex and has less deadlock potential. I'm doing this because
>>> > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
>>> > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
>>> > > thread, and the debugger attempts to inspect procfs files of the debugged
>>> > > task.
>>> >
>>> > Yes, but let me repeat that we need to fix this anyway. So I don't really
>>> > understand why should we add yet another mutex.
>>>
>>> execve() only takes the new mutex immediately after de_thread(), so this
>>> problem shouldn't occur there.
>>
>> Yes, I see.
>>
>>> Basically, I think that I'm not making the
>>> problem worse with my patches this way.
>>
>> In a sense that it doesn't add the new deadlocks, I agree. But it adds
>> yet another per-process mutex while we already have the similar one,
>>
>>> I believe that it should be possible to convert most existing users of the
>>> cred_guard_mutex to the new cred_guard_light - exceptions to that that I
>>> see are:
>>>
>>>  - PTRACE_ATTACH
>>
>> This is the main problem afaics. So "strace -f" can hang if it races
>> with mt-exec. And we need to fix this. I constantly forget about this
>> problem, but I tried many times to find a reasonable solution, still
>> can't.
>>
>> IMO, it would be nice to rework the lsm hooks, so that we could take
>> cred_guard_mutex after de_thread() (like your cred_guard_light) or
>> at least drop it earlier, but unlikely this is possible...
>>
>> So the only plan I currently have is change de_thread() to wait until
>> other threads pass exit_notify() or even exit_signals(), but I don't
>> like this.
>>
>>>  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)
>>
>> I forgot about this one... Need to re-check but at first glance this
>> is not a real problem.
>>
>>> Beyond that, conceptually, the new cred_guard_light could also be turned
>>> into a read-write mutex
>>
>> Not sure I understand how this can help... doesn't matter.
>>
>> My point is, imo you should not add the new mutex. Just use the old
>> one in (say) 4/8 (which I do not personally like as you know ;), this
>> won't add the new problem.
>>
>>
>>> It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have
>>> deadlocking issues.
>>
>> Yes, agreed.
>>
>>> PTRACE_ATTACH isn't that clear to me; if a debugger
>>> tries to attach to a newly spawned thread while another ptraced thread is
>>> dying because of de_thread() in a third thread, that might still cause
>>> the debugger to deadlock, right?
>>
>> This is the trivial test-case I wrote when the problem was initially
>> reported. And damn, I always knew that cred_guard_mutex needs fixes,
>> but somehow I completely forgot that it is used by PTRACE_ATTACH when
>> I was going to try to remove from fs/proc a long ago.
>>
>> 	void *thread(void *arg)
>> 	{
>> 		ptrace(PTRACE_TRACEME, 0,0,0);
>> 		return NULL;
>> 	}
>>
>> 	int main(void)
>> 	{
>> 		int pid = fork();
>>
>> 		if (!pid) {
>> 			pthread_t pt;
>> 			pthread_create(&pt, NULL, thread, NULL);
>> 			pthread_join(pt, NULL);
>> 			execlp("echo", "echo", "passed", NULL);
>> 		}
>>
>> 		sleep(1);
>> 		// or anything else which needs ->cred_guard_mutex,
>> 		// say open(/proc/$pid/mem)
>> 		ptrace(PTRACE_ATTACH, pid, 0,0);
>> 		kill(pid, SIGCONT);
>>
>> 		return 0;
>> 	}
>>
>> The problem is trivial. The execing thread waits until its sub-thread
>> goes away, it should be reaped by the tracer, the tracer waits for
>> cred_guard_mutex.
>
> There is a bug here but I don't believe it has anything to do with
> the cred_guard_mutex.
>
> If we reach zap_other_threads fundamentally the tracer should not
> be able to block the traced thread from exiting.  Those are the
> semantics described in the comments in the code.
>
> I have poked things a little and have a half fix for that but
> the fix appears to be the wrong, but enlightening.
>
> AKA the following prevents the hang of your test case.
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 75761acc77cf..a6f83450500e 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1200,7 +1200,7 @@ int zap_other_threads(struct task_struct *p)
>  		if (t->exit_state)
>  			continue;
>  		sigaddset(&t->pending.signal, SIGKILL);
> -		signal_wake_up(t, 1);
> +		signal_wake_up_state(t, TASK_WAKEKILL | __TASK_TRACED);
>  	}
>  
>  	return count;
>
> It looks like somewhere on the exit path the traced thread is blocking
> without setting TASK_WAKEKILL.

Apologies there was a testing mistake and that patch does not actually
help anything.

The following mostly correct patch modifies zap_other_threads in
the case of a de_thread to not wait for zombies to be reaped.  The only
case that cares is ptrace (as threads are self reaping).  So I don't
think this will cause any problems except removing the strace -f race.

Not waiting for zombies to be reaped in de_thread keeps the kernel from
holding the cred_guard_mutex while waiting for userspace.  Which should
mean we don't have to move it.

Not waiting for zombies to be reaped should also speed of mt-exec.  So I
think this is a benefit all around.


diff --git a/kernel/exit.c b/kernel/exit.c
index 9d68c45ebbe3..8c8556cab655 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -109,7 +109,8 @@ static void __exit_signal(struct task_struct *tsk)
 		 * If there is any task waiting for the group exit
 		 * then notify it:
 		 */
-		if (sig->notify_count > 0 && !--sig->notify_count)
+		if ((sig->flags & SIGNAL_GROUP_EXIT) &&
+		    sig->notify_count > 0 && !--sig->notify_count)
 			wake_up_process(sig->group_exit_task);
 
 		if (tsk == sig->curr_target)
@@ -690,6 +691,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
+	if (!(tsk->signal->flags & SIGNAL_GROUP_EXIT) &&
+	    tsk->signal->notify_count > 0 && !--tsk->signal->notify_count)
+		wake_up_process(tsk->signal->group_exit_task);
+
 	/* mt-exec, de_thread() is waiting for group leader */
 	if (unlikely(tsk->signal->notify_count < 0))
 		wake_up_process(tsk->signal->group_exit_task);
diff --git a/kernel/signal.c b/kernel/signal.c
index 75761acc77cf..a3a5cd8dad0f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1194,7 +1194,9 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
+		if ((t->signal->flags & SIGNAL_GROUP_EXIT) ||
+		    !t->exit_state)
+			count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)


Eric

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-04 15:00           ` Eric W. Biederman
@ 2016-11-04 18:04             ` Oleg Nesterov
  2016-11-04 18:45               ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2016-11-04 18:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

On 11/04, Eric W. Biederman wrote:
>
> The following mostly correct patch modifies zap_other_threads in
> the case of a de_thread to not wait for zombies to be reaped.  The only
> case that cares is ptrace (as threads are self reaping).  So I don't
> think this will cause any problems except removing the strace -f race.

>From my previous email:

	So the only plan I currently have is change de_thread() to wait until
	other threads pass exit_notify() or even exit_signals(), but I don't
	like this.

And yes, I don't like this, but perhaps this is what we should do.

The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT
checks doesn't look right, but off course technically this change should
be simple enough.

But not that simple. Just for example, the exiting sub-threads should
not run with ->group_leader pointing to nowhere, in case it was reaped
by de_thread.

And we have another problem with PTRACE_EVENT_EXIT which can lead to the
same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never
defined. But this change will add the user-visible change.

And if we add the user-visible changes, then perhaps we could simply untrace
the traced sub-threads on exec. This change is simple, we do not even need
to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace
if exec is in progress. But I'm afraid we can't do this.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-04 18:04             ` Oleg Nesterov
@ 2016-11-04 18:45               ` Oleg Nesterov
  2016-11-05 14:56                 ` Oleg Nesterov
  2016-11-08 22:02                 ` Kees Cook
  0 siblings, 2 replies; 32+ messages in thread
From: Oleg Nesterov @ 2016-11-04 18:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

On 11/04, Oleg Nesterov wrote:
>
> On 11/04, Eric W. Biederman wrote:
> >
> > The following mostly correct patch modifies zap_other_threads in
> > the case of a de_thread to not wait for zombies to be reaped.  The only
> > case that cares is ptrace (as threads are self reaping).  So I don't
> > think this will cause any problems except removing the strace -f race.
>
> From my previous email:
>
> 	So the only plan I currently have is change de_thread() to wait until
> 	other threads pass exit_notify() or even exit_signals(), but I don't
> 	like this.
>
> And yes, I don't like this, but perhaps this is what we should do.
>
> The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT
> checks doesn't look right, but off course technically this change should
> be simple enough.
>
> But not that simple. Just for example, the exiting sub-threads should
> not run with ->group_leader pointing to nowhere, in case it was reaped
> by de_thread.

Not to mention other potential problems outside of ptrace/exec. For example
userns_install() can fail after mt-exec even without ptrace, simply because
thread_group_empty() can be false. Sure, easy to fix, and probably _install()
should use signal->live anyway, but still.

And I didn't mention the fun with sighand unsharing. We simply can't do this
until all sub-threads go away. IOW, your patch breaks the usage of ->siglock.
The execing thread and the zombie threads will use different locks to, say,
remove the task from thread-group. Again, this is fixable, but not that
simple.

> And we have another problem with PTRACE_EVENT_EXIT which can lead to the
> same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never
> defined. But this change will add the user-visible change.
>
> And if we add the user-visible changes, then perhaps we could simply untrace
> the traced sub-threads on exec. This change is simple, we do not even need
> to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace
> if exec is in progress. But I'm afraid we can't do this.

Eric, I hope you see my emails, I got the "Undelivered Mail Returned to Sender"
	...
	This is the mail system at host mail.kernel.org.
	...
	<ebiederm@xmission.com> (expanded from <security@kernel.org>): host
	    mx.xmission.com[166.70.12.20] said: 550-XM-RJCT16: SPF Failure
	    (ip=198.145.29.136, frm=oleg@redhat.com, 550 result=fail) (in reply to RCPT
	    TO command)

right now I have no idea what does this mean.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-04 18:45               ` Oleg Nesterov
@ 2016-11-05 14:56                 ` Oleg Nesterov
  2016-11-09  0:34                   ` Eric W. Biederman
  2016-11-16 20:03                   ` Eric W. Biederman
  2016-11-08 22:02                 ` Kees Cook
  1 sibling, 2 replies; 32+ messages in thread
From: Oleg Nesterov @ 2016-11-05 14:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

On 11/04, Oleg Nesterov wrote:
>
> On 11/04, Oleg Nesterov wrote:
> >
> > On 11/04, Eric W. Biederman wrote:
> > >
> > > The following mostly correct patch modifies zap_other_threads in
> > > the case of a de_thread to not wait for zombies to be reaped.  The only
> > > case that cares is ptrace (as threads are self reaping).  So I don't
> > > think this will cause any problems except removing the strace -f race.
> >
> > From my previous email:
> >
> > 	So the only plan I currently have is change de_thread() to wait until
> > 	other threads pass exit_notify() or even exit_signals(), but I don't
> > 	like this.
> >
> > And yes, I don't like this, but perhaps this is what we should do.
> >
> > The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT
> > checks doesn't look right, but off course technically this change should
> > be simple enough.
> >
> > But not that simple. Just for example, the exiting sub-threads should
> > not run with ->group_leader pointing to nowhere, in case it was reaped
> > by de_thread.
>
> Not to mention other potential problems outside of ptrace/exec. For example
> userns_install() can fail after mt-exec even without ptrace, simply because
> thread_group_empty() can be false. Sure, easy to fix, and probably _install()
> should use signal->live anyway, but still.
>
> And I didn't mention the fun with sighand unsharing. We simply can't do this
> until all sub-threads go away. IOW, your patch breaks the usage of ->siglock.
> The execing thread and the zombie threads will use different locks to, say,
> remove the task from thread-group. Again, this is fixable, but not that
> simple.
>
> > And we have another problem with PTRACE_EVENT_EXIT which can lead to the
> > same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never
> > defined. But this change will add the user-visible change.
> >
> > And if we add the user-visible changes, then perhaps we could simply untrace
> > the traced sub-threads on exec. This change is simple, we do not even need
> > to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace
> > if exec is in progress. But I'm afraid we can't do this.

So I was thinking about something like below. Untested, probably buggy/incomplete
too, but hopefully can work.

flush_old_exec() calls the new kill_sub_threads() helper which waits until
all the sub-threads pass exit_notify().

de_thread() is called after install_exec_creds(), it is simplified and waits
for thread_group_empty() without cred_guard_mutex.

But again, I do not really like this, and we need to do something with
PTRACE_EVENT_EXIT anyway, this needs another/separate change. User-visible.

And I disagree that this has nothing to do with cred_guard_mutex. And in any
case we should narrow its scope in do_execve() path. Why do we take it so early?
Why do we need to do, say, copy_strings() with this lock held? The original
motivation for this has gone, acct_arg_size() can work just fine even if
multiple threads call sys_execve().

I'll try to discuss the possible changes in LSM hooks with Jann, I still think
that this is what we actually need to do. At least try to do, possibly this is
too complicated.

Oleg.

 fs/binfmt_elf.c         |   6 ++-
 fs/exec.c               | 108 ++++++++++++++++++++++++------------------------
 include/linux/binfmts.h |   1 +
 kernel/exit.c           |   5 ++-
 kernel/signal.c         |   3 +-
 5 files changed, 65 insertions(+), 58 deletions(-)


--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	setup_new_exec(bprm);
 	install_exec_creds(bprm);
 
+	retval = de_thread(current);
+	if (retval)
+		goto out_free_dentry;
+
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
 	retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
 				 executable_stack);
 	if (retval < 0)
 		goto out_free_dentry;
-	
+
 	current->mm->start_stack = bprm->p;
 
 	/* Now we do a little grungy work by mmapping the ELF image into
diff --git a/fs/exec.c b/fs/exec.c
index 4e497b9..7246b9f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1036,13 +1036,59 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
+static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig)
+{
+	for (;;) {
+		if (unlikely(__fatal_signal_pending(tsk)))
+			goto killed;
+		set_current_state(TASK_KILLABLE);
+		if (!sig->notify_count)
+			break;
+		schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+	return 0;
+
+killed:
+	/* protects against exit_notify() and __exit_signal() */
+	read_lock(&tasklist_lock);
+	sig->group_exit_task = NULL;
+	sig->notify_count = 0;
+	read_unlock(&tasklist_lock);
+	return -EINTR;
+}
+
+static int kill_sub_threads(struct task_struct *tsk)
+{
+	struct signal_struct *sig = tsk->signal;
+	int err = -EINTR;
+
+	if (thread_group_empty(tsk))
+		return 0;
+
+	read_lock(&tasklist_lock);
+	spin_lock_irq(&tsk->sighand->siglock);
+	if (!signal_group_exit(sig)) {
+		sig->group_exit_task = tsk;
+		sig->notify_count = -zap_other_threads(tsk);
+		err = 0;
+	}
+	spin_unlock_irq(&tsk->sighand->siglock);
+	read_unlock(&tasklist_lock);
+
+	if (!err)
+		err = wait_for_notify_count(tsk, sig);
+	return err;
+
+}
+
 /*
  * This function makes sure the current process has its own signal table,
  * so that flush_signal_handlers can later reset the handlers without
  * disturbing other processes.  (Other processes might share the signal
  * table via the CLONE_SIGHAND option to clone().)
  */
-static int de_thread(struct task_struct *tsk)
+int de_thread(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
@@ -1051,34 +1097,15 @@ static int de_thread(struct task_struct *tsk)
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
 
-	/*
-	 * Kill all other threads in the thread group.
-	 */
 	spin_lock_irq(lock);
-	if (signal_group_exit(sig)) {
-		/*
-		 * Another group action in progress, just
-		 * return so that the signal is processed.
-		 */
-		spin_unlock_irq(lock);
-		return -EAGAIN;
-	}
-
-	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
+	sig->notify_count = sig->nr_threads;
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
-
-	while (sig->notify_count) {
-		__set_current_state(TASK_KILLABLE);
-		spin_unlock_irq(lock);
-		schedule();
-		if (unlikely(__fatal_signal_pending(tsk)))
-			goto killed;
-		spin_lock_irq(lock);
-	}
 	spin_unlock_irq(lock);
 
+	if (wait_for_notify_count(tsk, sig))
+		return -EINTR;
+
 	/*
 	 * At this point all other threads have exited, all we have to
 	 * do is to wait for the thread group leader to become inactive,
@@ -1087,24 +1114,8 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
-			/*
-			 * Do this under tasklist_lock to ensure that
-			 * exit_notify() can't miss ->group_exit_task
-			 */
-			sig->notify_count = -1;
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_KILLABLE);
-			write_unlock_irq(&tasklist_lock);
-			threadgroup_change_end(tsk);
-			schedule();
-			if (unlikely(__fatal_signal_pending(tsk)))
-				goto killed;
-		}
-
+		threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
 		/*
 		 * The only record we have of the real-time age of a
 		 * process, regardless of execs it's done, is start_time.
@@ -1162,10 +1173,9 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+no_thread_group:
 	sig->group_exit_task = NULL;
 	sig->notify_count = 0;
-
-no_thread_group:
 	/* we have changed execution domain */
 	tsk->exit_signal = SIGCHLD;
 
@@ -1197,14 +1207,6 @@ static int de_thread(struct task_struct *tsk)
 
 	BUG_ON(!thread_group_leader(tsk));
 	return 0;
-
-killed:
-	/* protects against exit_notify() and __exit_signal() */
-	read_lock(&tasklist_lock);
-	sig->group_exit_task = NULL;
-	sig->notify_count = 0;
-	read_unlock(&tasklist_lock);
-	return -EAGAIN;
 }
 
 char *get_task_comm(char *buf, struct task_struct *tsk)
@@ -1239,7 +1241,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 	 * Make sure we have a private signal table and that
 	 * we are unassociated from the previous thread group.
 	 */
-	retval = de_thread(current);
+	retval = kill_sub_threads(current);
 	if (retval)
 		goto out;
 
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *);
 extern int search_binary_handler(struct linux_binprm *);
 extern int flush_old_exec(struct linux_binprm * bprm);
 extern void setup_new_exec(struct linux_binprm * bprm);
+extern int de_thread(struct task_struct *tsk);
 extern void would_dump(struct linux_binprm *, struct file *);
 
 extern int suid_dumpable;
diff --git a/kernel/exit.c b/kernel/exit.c
index 9d68c45..f3dd46d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -690,8 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
+	/* mt-exec, kill_sub_threads() is waiting for group exit */
+	if (unlikely(tsk->signal->notify_count < 0) &&
+	    !++tsk->signal->notify_count)
 		wake_up_process(tsk->signal->group_exit_task);
 	write_unlock_irq(&tasklist_lock);
 
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1194,13 +1194,12 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
-
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
+		count++;
 	}
 
 	return count;


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-04 18:45               ` Oleg Nesterov
  2016-11-05 14:56                 ` Oleg Nesterov
@ 2016-11-08 22:02                 ` Kees Cook
  2016-11-08 22:46                   ` Eric W. Biederman
  1 sibling, 1 reply; 32+ messages in thread
From: Kees Cook @ 2016-11-08 22:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Jann Horn, Alexander Viro, Roland McGrath,
	John Johansen, James Morris, Serge E. Hallyn, Paul Moore,
	Stephen Smalley, Eric Paris, Casey Schaufler, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

On Fri, Nov 4, 2016 at 11:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Eric, I hope you see my emails, I got the "Undelivered Mail Returned to Sender"
>         ...
>         This is the mail system at host mail.kernel.org.
>         ...
>         <ebiederm@xmission.com> (expanded from <security@kernel.org>): host
>             mx.xmission.com[166.70.12.20] said: 550-XM-RJCT16: SPF Failure
>             (ip=198.145.29.136, frm=oleg@redhat.com, 550 result=fail) (in reply to RCPT
>             TO command)
>
> right now I have no idea what does this mean.

This is a problem for Google folks too sometimes. This is saying that
xmission.com is checking redhat.com's SPF records and refusing to let
kernel.org deliver email as if it were redhat.com (due to
security@kernel.org being an alias not a mailing list). There aren't
good solutions for this, but best I've found is to have my
security@kernel.org alias be a @kernel.org address instead of an
@google.com address...

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-08 22:02                 ` Kees Cook
@ 2016-11-08 22:46                   ` Eric W. Biederman
  2016-11-08 22:56                     ` Benjamin LaHaise
  0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2016-11-08 22:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Oleg Nesterov, Jann Horn, Alexander Viro, Roland McGrath,
	John Johansen, James Morris, Serge E. Hallyn, Paul Moore,
	Stephen Smalley, Eric Paris, Casey Schaufler, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

Kees Cook <keescook@chromium.org> writes:

> On Fri, Nov 4, 2016 at 11:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> Eric, I hope you see my emails, I got the "Undelivered Mail Returned to Sender"
>>         ...

Oleg I can receive your messages directly and through vger.kernel.org
lists, but I can't receive them through the email reflector at
security@kernel.org.

>>         This is the mail system at host mail.kernel.org.
>>         ...
>>         <ebiederm@xmission.com> (expanded from <security@kernel.org>): host
>>             mx.xmission.com[166.70.12.20] said: 550-XM-RJCT16: SPF Failure
>>             (ip=198.145.29.136, frm=oleg@redhat.com, 550 result=fail) (in reply to RCPT
>>             TO command)
>>
>> right now I have no idea what does this mean.
>
> This is a problem for Google folks too sometimes. This is saying that
> xmission.com is checking redhat.com's SPF records and refusing to let
> kernel.org deliver email as if it were redhat.com (due to
> security@kernel.org being an alias not a mailing list). There aren't
> good solutions for this, but best I've found is to have my
> security@kernel.org alias be a @kernel.org address instead of an
> @google.com address...

Ugh.  Is even redhat configuring the redhat email to do that?
I will have to look.

Last I looked xmission.com was just enforcing the policy that the other
mail domains were asking to be enforced on themselves.  But those are
policies that are incompatible with mailing lists in general.  Although
I do get confused about which part SPF and DKIM play in this mess.

I just remember that the last several ``enhancements'' to email were
busily breaking mailing lists and I thought they were completely insane.
I can even find evidence that it is (or at least was) so bad that email
standards comittee member's can't comminicate with each other via email
lists.

vger.kernel.org appears to rewrite the envelope sender to avoid
problems.

If xmission is doing any more than just performing what the domain of
the senders of email asked them to do I will be happy to see if I can
to sort it out.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-08 22:46                   ` Eric W. Biederman
@ 2016-11-08 22:56                     ` Benjamin LaHaise
  2016-11-08 23:33                       ` Eric W. Biederman
  0 siblings, 1 reply; 32+ messages in thread
From: Benjamin LaHaise @ 2016-11-08 22:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Oleg Nesterov, Jann Horn, Alexander Viro,
	Roland McGrath, John Johansen, James Morris, Serge E. Hallyn,
	Paul Moore, Stephen Smalley, Eric Paris, Casey Schaufler,
	Andrew Morton, Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Ben Hutchings, Andy Lutomirski, Linus Torvalds, Krister Johansen,
	linux-fsdevel, linux-security-module, security

On Tue, Nov 08, 2016 at 04:46:44PM -0600, Eric W. Biederman wrote:
> Kees Cook <keescook@chromium.org> writes:
...
> > This is a problem for Google folks too sometimes. This is saying that
> > xmission.com is checking redhat.com's SPF records and refusing to let
> > kernel.org deliver email as if it were redhat.com (due to
> > security@kernel.org being an alias not a mailing list). There aren't
> > good solutions for this, but best I've found is to have my
> > security@kernel.org alias be a @kernel.org address instead of an
> > @google.com address...
> 
> Ugh.  Is even redhat configuring the redhat email to do that?
> I will have to look.
> 
> Last I looked xmission.com was just enforcing the policy that the other
> mail domains were asking to be enforced on themselves.  But those are
> policies that are incompatible with mailing lists in general.  Although
> I do get confused about which part SPF and DKIM play in this mess.
> 
> I just remember that the last several ``enhancements'' to email were
> busily breaking mailing lists and I thought they were completely insane.
> I can even find evidence that it is (or at least was) so bad that email
> standards comittee member's can't comminicate with each other via email
> lists.
> 
> vger.kernel.org appears to rewrite the envelope sender to avoid
> problems.

Envelope sender rewriting is insufficient, the From: lines need to be 
rewritten to be compliant.  This is a pain in the ass for the @kvack.org 
mailing lists as well -- people with @google.com addresses don't see the 
mailing list postings of users from @google.com and other domains using 
"enhanced" email header "validation" techniques.

		-ben

> If xmission is doing any more than just performing what the domain of
> the senders of email asked them to do I will be happy to see if I can
> to sort it out.
> 
> Eric

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-08 22:56                     ` Benjamin LaHaise
@ 2016-11-08 23:33                       ` Eric W. Biederman
  0 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2016-11-08 23:33 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Kees Cook, Oleg Nesterov, Jann Horn, Alexander Viro,
	Roland McGrath, John Johansen, James Morris, Serge E. Hallyn,
	Paul Moore, Stephen Smalley, Eric Paris, Casey Schaufler,
	Andrew Morton, Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Ben Hutchings, Andy Lutomirski, Linus Torvalds, Krister Johansen,
	linux-fsdevel, linux-security-module, security

Benjamin LaHaise <bcrl@kvack.org> writes:

> On Tue, Nov 08, 2016 at 04:46:44PM -0600, Eric W. Biederman wrote:
>> Kees Cook <keescook@chromium.org> writes:
> ...
>> > This is a problem for Google folks too sometimes. This is saying that
>> > xmission.com is checking redhat.com's SPF records and refusing to let
>> > kernel.org deliver email as if it were redhat.com (due to
>> > security@kernel.org being an alias not a mailing list). There aren't
>> > good solutions for this, but best I've found is to have my
>> > security@kernel.org alias be a @kernel.org address instead of an
>> > @google.com address...
>> 
>> Ugh.  Is even redhat configuring the redhat email to do that?
>> I will have to look.
>> 
>> Last I looked xmission.com was just enforcing the policy that the other
>> mail domains were asking to be enforced on themselves.  But those are
>> policies that are incompatible with mailing lists in general.  Although
>> I do get confused about which part SPF and DKIM play in this mess.
>> 
>> I just remember that the last several ``enhancements'' to email were
>> busily breaking mailing lists and I thought they were completely insane.
>> I can even find evidence that it is (or at least was) so bad that email
>> standards comittee member's can't comminicate with each other via email
>> lists.
>> 
>> vger.kernel.org appears to rewrite the envelope sender to avoid
>> problems.
>
> Envelope sender rewriting is insufficient, the From: lines need to be 
> rewritten to be compliant.  This is a pain in the ass for the @kvack.org 
> mailing lists as well -- people with @google.com addresses don't see the 
> mailing list postings of users from @google.com and other domains using 
> "enhanced" email header "validation" techniques.

That definitely happens in the worst case.  At least for Oleg something
less serious is happening because the from header does not get changed
and the email gets to me through the vger.kernel.org lists.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-05 14:56                 ` Oleg Nesterov
@ 2016-11-09  0:34                   ` Eric W. Biederman
  2016-11-16 20:03                   ` Eric W. Biederman
  1 sibling, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2016-11-09  0:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

Oleg Nesterov <oleg@redhat.com> writes:

> On 11/04, Oleg Nesterov wrote:
>>
>> On 11/04, Oleg Nesterov wrote:
>> >
>> > On 11/04, Eric W. Biederman wrote:
>> > >
>> > > The following mostly correct patch modifies zap_other_threads in
>> > > the case of a de_thread to not wait for zombies to be reaped.  The only
>> > > case that cares is ptrace (as threads are self reaping).  So I don't
>> > > think this will cause any problems except removing the strace -f race.
>> >
>> > From my previous email:
>> >
>> > 	So the only plan I currently have is change de_thread() to wait until
>> > 	other threads pass exit_notify() or even exit_signals(), but I don't
>> > 	like this.
>> >
>> > And yes, I don't like this, but perhaps this is what we should do.
>> >
>> > The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT
>> > checks doesn't look right, but off course technically this change should
>> > be simple enough.
>> >
>> > But not that simple. Just for example, the exiting sub-threads should
>> > not run with ->group_leader pointing to nowhere, in case it was reaped
>> > by de_thread.
>>
>> Not to mention other potential problems outside of ptrace/exec. For example
>> userns_install() can fail after mt-exec even without ptrace, simply because
>> thread_group_empty() can be false. Sure, easy to fix, and probably _install()
>> should use signal->live anyway, but still.
>>
>> And I didn't mention the fun with sighand unsharing. We simply can't do this
>> until all sub-threads go away. IOW, your patch breaks the usage of ->siglock.
>> The execing thread and the zombie threads will use different locks to, say,
>> remove the task from thread-group. Again, this is fixable, but not that
>> simple.
>>
>> > And we have another problem with PTRACE_EVENT_EXIT which can lead to the
>> > same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never
>> > defined. But this change will add the user-visible change.
>> >
>> > And if we add the user-visible changes, then perhaps we could simply untrace
>> > the traced sub-threads on exec. This change is simple, we do not even need
>> > to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace
>> > if exec is in progress. But I'm afraid we can't do this.
>
> So I was thinking about something like below. Untested, probably buggy/incomplete
> too, but hopefully can work.
>
> flush_old_exec() calls the new kill_sub_threads() helper which waits until
> all the sub-threads pass exit_notify().
>
> de_thread() is called after install_exec_creds(), it is simplified and waits
> for thread_group_empty() without cred_guard_mutex.
>
> But again, I do not really like this, and we need to do something with
> PTRACE_EVENT_EXIT anyway, this needs another/separate change. User-visible.
>
> And I disagree that this has nothing to do with cred_guard_mutex. And in any
> case we should narrow its scope in do_execve() path. Why do we take it so early?
> Why do we need to do, say, copy_strings() with this lock held? The original
> motivation for this has gone, acct_arg_size() can work just fine even if
> multiple threads call sys_execve().

The little piece of this puzzle that I understand is that we don't want
to ptrace_attach while a processes is in the middle of exec.  The name
cred_guard_mutex is odd for that, but that is what I see that lock
doing.

But ptrace really needs to consider either the original creds and mm or
the final creds and mm.  Halfway states are problem.

Solution to avoid that may simply be some code motion that allows the
mutex to have a smaller hold time.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/8] exec: introduce cred_guard_light
  2016-11-05 14:56                 ` Oleg Nesterov
  2016-11-09  0:34                   ` Eric W. Biederman
@ 2016-11-16 20:03                   ` Eric W. Biederman
  1 sibling, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2016-11-16 20:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Alexander Viro, Roland McGrath, John Johansen,
	James Morris, Serge E. Hallyn, Paul Moore, Stephen Smalley,
	Eric Paris, Casey Schaufler, Kees Cook, Andrew Morton,
	Janis Danisevskis, Seth Forshee, Thomas Gleixner,
	Benjamin LaHaise, Ben Hutchings, Andy Lutomirski, Linus Torvalds,
	Krister Johansen, linux-fsdevel, linux-security-module, security

Oleg Nesterov <oleg@redhat.com> writes:

> On 11/04, Oleg Nesterov wrote:
>>
>> On 11/04, Oleg Nesterov wrote:
>> >
>> > On 11/04, Eric W. Biederman wrote:
>> > >
>> > > The following mostly correct patch modifies zap_other_threads in
>> > > the case of a de_thread to not wait for zombies to be reaped.  The only
>> > > case that cares is ptrace (as threads are self reaping).  So I don't
>> > > think this will cause any problems except removing the strace -f race.
>> >
>> > From my previous email:
>> >
>> > 	So the only plan I currently have is change de_thread() to wait until
>> > 	other threads pass exit_notify() or even exit_signals(), but I don't
>> > 	like this.
>> >
>> > And yes, I don't like this, but perhaps this is what we should do.
>> >
>> > The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT
>> > checks doesn't look right, but off course technically this change should
>> > be simple enough.
>> >
>> > But not that simple. Just for example, the exiting sub-threads should
>> > not run with ->group_leader pointing to nowhere, in case it was reaped
>> > by de_thread.
>>
>> Not to mention other potential problems outside of ptrace/exec. For example
>> userns_install() can fail after mt-exec even without ptrace, simply because
>> thread_group_empty() can be false. Sure, easy to fix, and probably _install()
>> should use signal->live anyway, but still.
>>
>> And I didn't mention the fun with sighand unsharing. We simply can't do this
>> until all sub-threads go away. IOW, your patch breaks the usage of ->siglock.
>> The execing thread and the zombie threads will use different locks to, say,
>> remove the task from thread-group. Again, this is fixable, but not that
>> simple.
>>
>> > And we have another problem with PTRACE_EVENT_EXIT which can lead to the
>> > same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never
>> > defined. But this change will add the user-visible change.
>> >
>> > And if we add the user-visible changes, then perhaps we could simply untrace
>> > the traced sub-threads on exec. This change is simple, we do not even need
>> > to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace
>> > if exec is in progress. But I'm afraid we can't do this.
>
> So I was thinking about something like below. Untested, probably buggy/incomplete
> too, but hopefully can work.
>
> flush_old_exec() calls the new kill_sub_threads() helper which waits until
> all the sub-threads pass exit_notify().
>
> de_thread() is called after install_exec_creds(), it is simplified and waits
> for thread_group_empty() without cred_guard_mutex.
>
> But again, I do not really like this, and we need to do something with
> PTRACE_EVENT_EXIT anyway, this needs another/separate change. User-visible.
>
> And I disagree that this has nothing to do with cred_guard_mutex. And in any
> case we should narrow its scope in do_execve() path. Why do we take it so early?
> Why do we need to do, say, copy_strings() with this lock held? The original
> motivation for this has gone, acct_arg_size() can work just fine even if
> multiple threads call sys_execve().
>
> I'll try to discuss the possible changes in LSM hooks with Jann, I still think
> that this is what we actually need to do. At least try to do, possibly this is
> too complicated.

The code below looks interesting.

Am I wrong or do we get the PTRACE_EVENT_EXIT case wrong for the
multi-threaded exec's when we don't exec from the primary thread?  AKA I
think the primary thread will pass through ptrace_event(PTRACE_EVENT_EXIT)
before we steal it's thread and likewise the thread that calls exec
won't pass through ptrace_event(PTRACE_EVENT_EXIT).

Which I suspect gives us quite a bit of lattitude to skip that
notification entirely without notifying userspace.  We need to test to
be certain that both gdb and strace can cope.  But I do suspect we could
just throw ptrace_event(PTRACE_EVENT_EXIT) out in the case of a
multi-threaded exec and no one would care.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-11-16 20:07 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-30 21:46 [PATCH v3 0/8] Various fixes related to ptrace_may_access() Jann Horn
2016-10-30 21:46 ` [PATCH v3 1/8] exec: introduce cred_guard_light Jann Horn
2016-11-02 18:18   ` Oleg Nesterov
2016-11-02 20:50     ` Jann Horn
2016-11-02 21:38       ` Ben Hutchings
2016-11-02 21:54         ` Jann Horn
2016-11-03 18:12       ` Oleg Nesterov
2016-11-03 21:17         ` Jann Horn
2016-11-04 13:26         ` Eric W. Biederman
2016-11-04 15:00           ` Eric W. Biederman
2016-11-04 18:04             ` Oleg Nesterov
2016-11-04 18:45               ` Oleg Nesterov
2016-11-05 14:56                 ` Oleg Nesterov
2016-11-09  0:34                   ` Eric W. Biederman
2016-11-16 20:03                   ` Eric W. Biederman
2016-11-08 22:02                 ` Kees Cook
2016-11-08 22:46                   ` Eric W. Biederman
2016-11-08 22:56                     ` Benjamin LaHaise
2016-11-08 23:33                       ` Eric W. Biederman
2016-10-30 21:46 ` [PATCH v3 2/8] exec: add privunit to task_struct Jann Horn
2016-10-30 21:46 ` [PATCH v3 3/8] proc: use open()-time creds for ptrace checks Jann Horn
2016-10-30 21:46 ` [PATCH v3 4/8] futex: don't leak robust_list pointer Jann Horn
2016-10-30 21:46 ` [PATCH v3 5/8] proc: lock properly in ptrace_may_access callers Jann Horn
2016-10-30 21:46 ` [PATCH v3 6/8] fs/proc: fix attr access check Jann Horn
2016-10-30 21:46 ` [PATCH v3 7/8] proc: fix timerslack_ns handling Jann Horn
2016-10-30 21:46 ` [PATCH v3 8/8] Documentation: add security/ptrace_checks.txt Jann Horn
2016-11-01 23:57 ` [PATCH v3 0/8] Various fixes related to ptrace_may_access() Linus Torvalds
2016-11-02 18:38   ` Oleg Nesterov
2016-11-02 21:40     ` Jann Horn
2016-11-03 19:09   ` Andrew Morton
2016-11-03 20:01     ` Jann Horn
2016-11-04  0:57   ` James Morris

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).