LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] exec: Fix a deadlock in ptrace
@ 2020-03-01 11:27 Bernd Edlinger
  2020-03-01 15:13 ` Aleksa Sarai
  2020-03-01 18:21 ` Jann Horn
  0 siblings, 2 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-01 11:27 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Eric W. Biederman, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Jann Horn, James Morris,
	Kees Cook, Greg Kroah-Hartman, Shakeel Butt, Christian Brauner,
	Jason Gunthorpe, Christian Kellner, Andrea Arcangeli,
	Aleksa Sarai, Dmitry V. Levin, linux-doc, linux-kernel,
	linux-fsdevel, linux-mm

This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The proposed solution is to have a second mutex that is
used in mm_access, so it is allowed to continue while the
dying threads are not yet terminated.

I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 Documentation/security/credentials.rst | 18 ++++++++++--------
 fs/exec.c                              |  9 +++++++++
 include/linux/binfmts.h                |  6 +++++-
 include/linux/sched/signal.h           |  1 +
 init/init_task.c                       |  1 +
 kernel/cred.c                          |  2 +-
 kernel/fork.c                          |  5 +++--
 mm/process_vm_access.c                 |  2 +-
 8 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..c98e0a8 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,13 @@ new set of credentials by calling::
 
 	struct cred *prepare_creds(void);
 
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful.  It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and the mutex
+current->signal->cred_change_mutex is acquired later, while the credentials
+and the process mmap are actually changed.
 
 The mutex prevents ``ptrace()`` from altering the ptrace state of a process
 while security checks on credentials construction and changing is taking place
@@ -466,9 +470,8 @@ by calling::
 
 This will alter various aspects of the credentials and the process, giving the
 LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
 
 This function is guaranteed to return 0, so that it can be tail-called at the
 end of such functions as ``sys_setresuid()``.
@@ -486,8 +489,7 @@ invoked::
 
 	void abort_creds(struct cred *new);
 
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
 
 
 A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..a6884e4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_change_mutex);
+	if (retval)
+		goto out;
+
+	bprm->called_flush_old_exec = 1;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1420,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		if (bprm->called_flush_old_exec)
+			mutex_unlock(&current->signal->cred_change_mutex);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1469,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	mutex_unlock(&current->signal->cred_change_mutex);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2e1318b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
 		 * exec has happened. Used to sanitize execution environment
 		 * and to set AT_SECURE auxv for glibc.
 		 */
-		secureexec:1;
+		secureexec:1,
+		/*
+		 * Set by flush_old_exec, when the cred_change_mutex is taken.
+		 */
+		called_flush_old_exec:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..37eeabe 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@ struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	struct mutex cred_change_mutex; /* guard against credentials change */
 } __randomize_layout;
 
 /*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..6cd9a0f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+	.cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex),
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
 	.cputimer	= {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
  *
  * Returns the new credentials or NULL if out of memory.
  *
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
  */
 struct cred *prepare_kernel_cred(struct task_struct *daemon)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..0395154 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	struct mm_struct *mm;
 	int err;
 
-	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
+	err =  mutex_lock_killable(&task->signal->cred_change_mutex);
 	if (err)
 		return ERR_PTR(err);
 
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 		mmput(mm);
 		mm = ERR_PTR(-EACCES);
 	}
-	mutex_unlock(&task->signal->cred_guard_mutex);
+	mutex_unlock(&task->signal->cred_change_mutex);
 
 	return mm;
 }
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->cred_change_mutex);
 
 	return 0;
 }
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	if (!mm || IS_ERR(mm)) {
 		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		/*
-		 * Explicitly map EACCES to EPERM as EPERM is a more a
+		 * Explicitly map EACCES to EPERM as EPERM is a more
 		 * appropriate error code for process_vw_readv/writev
 		 */
 		if (rc == -EACCES)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 11:27 [PATCH] exec: Fix a deadlock in ptrace Bernd Edlinger
@ 2020-03-01 15:13 ` Aleksa Sarai
  2020-03-01 15:58   ` Christian Brauner
  2020-03-01 17:24   ` Bernd Edlinger
  2020-03-01 18:21 ` Jann Horn
  1 sibling, 2 replies; 203+ messages in thread
From: Aleksa Sarai @ 2020-03-01 15:13 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Eric W. Biederman, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Jann Horn, James Morris,
	Kees Cook, Greg Kroah-Hartman, Shakeel Butt, Christian Brauner,
	Jason Gunthorpe, Christian Kellner, Andrea Arcangeli,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm


[-- Attachment #1: Type: text/plain, Size: 2465 bytes --]

On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
> 
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated.  They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
> 
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> strace          D    0 30614  30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> expect          D    0 31933  30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The proposed solution is to have a second mutex that is
> used in mm_access, so it is allowed to continue while the
> dying threads are not yet terminated.
> 
> I also took the opportunity to improve the documentation
> of prepare_creds, which is obviously out of sync.
>
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>

I can't comment on the validity of the patch, but I also found and
reported this issue in 2016[1] and the discussion quickly veered into
the problem being more complicated (and uglier) than it seems at first
glance.

You should probably also Cc stable, given this has been a long-standing
issue and your patch doesn't look (too) invasive.

[1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 15:13 ` Aleksa Sarai
@ 2020-03-01 15:58   ` Christian Brauner
  2020-03-01 17:46     ` Bernd Edlinger
  2020-03-01 17:24   ` Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Christian Brauner @ 2020-03-01 15:58 UTC (permalink / raw)
  To: Aleksa Sarai, Bernd Edlinger, Oleg Nesterov
  Cc: Bernd Edlinger, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Alexey Dobriyan, Eric W. Biederman, Thomas Gleixner,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Jann Horn, James Morris,
	Kees Cook, Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Dmitry V. Levin, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm

On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote:
> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> > This fixes a deadlock in the tracer when tracing a multi-threaded
> > application that calls execve while more than one thread are running.
> > 
> > I observed that when running strace on the gcc test suite, it always
> > blocks after a while, when expect calls execve, because other threads
> > have to be terminated.  They send ptrace events, but the strace is no
> > longer able to respond, since it is blocked in vm_access.
> > 
> > The deadlock is always happening when strace needs to access the
> > tracees process mmap, while another thread in the tracee starts to
> > execve a child process, but that cannot continue until the
> > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> > 
> > strace          D    0 30614  30584 0x00000000
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > schedule_preempt_disabled+0x15/0x20
> > __mutex_lock.isra.13+0x1ec/0x520
> > __mutex_lock_killable_slowpath+0x13/0x20
> > mutex_lock_killable+0x28/0x30
> > mm_access+0x27/0xa0
> > process_vm_rw_core.isra.3+0xff/0x550
> > process_vm_rw+0xdd/0xf0
> > __x64_sys_process_vm_readv+0x31/0x40
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > expect          D    0 31933  30876 0x80004003
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > flush_old_exec+0xc4/0x770
> > load_elf_binary+0x35a/0x16c0
> > search_binary_handler+0x97/0x1d0
> > __do_execve_file.isra.40+0x5d4/0x8a0
> > __x64_sys_execve+0x49/0x60
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > The proposed solution is to have a second mutex that is
> > used in mm_access, so it is allowed to continue while the
> > dying threads are not yet terminated.
> > 
> > I also took the opportunity to improve the documentation
> > of prepare_creds, which is obviously out of sync.
> >
> > Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> 
> I can't comment on the validity of the patch, but I also found and
> reported this issue in 2016[1] and the discussion quickly veered into
> the problem being more complicated (and uglier) than it seems at first
> glance.
> 
> You should probably also Cc stable, given this has been a long-standing
> issue and your patch doesn't look (too) invasive.
> 
> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/

Yeah, I remember you mentioning this a while back.

Bernd, we really want a reproducer for this sent alongside with this
patch added to:
tools/testing/selftests/ptrace/
Having a test for this bug irrespective of whether or not we go with
this as fix seems really worth it.

Oleg seems to have suggested that a potential alternative fix is to wait
in de_thread() until all other threads in the thread-group have passed
exit_notiy(). Right now we only kill them but don't wait. Currently
de_thread() only waits for the thread-group leader to pass exit_notify()
whenever a non-thread-group leader thread execs (because the exec'ing
thread becomes the new thread-group leader with the same pid as the
former thread-group leader).

Christian

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 15:13 ` Aleksa Sarai
  2020-03-01 15:58   ` Christian Brauner
@ 2020-03-01 17:24   ` Bernd Edlinger
  1 sibling, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-01 17:24 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Eric W. Biederman, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Jann Horn, James Morris,
	Kees Cook, Greg Kroah-Hartman, Shakeel Butt, Christian Brauner,
	Jason Gunthorpe, Christian Kellner, Andrea Arcangeli,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm


[-- Attachment #1: Type: text/plain, Size: 4335 bytes --]

Hi Aleksa,

On 3/1/20 4:13 PM, Aleksa Sarai wrote:
> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated.  They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>
>> strace          D    0 30614  30584 0x00000000
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> schedule_preempt_disabled+0x15/0x20
>> __mutex_lock.isra.13+0x1ec/0x520
>> __mutex_lock_killable_slowpath+0x13/0x20
>> mutex_lock_killable+0x28/0x30
>> mm_access+0x27/0xa0
>> process_vm_rw_core.isra.3+0xff/0x550
>> process_vm_rw+0xdd/0xf0
>> __x64_sys_process_vm_readv+0x31/0x40
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> expect          D    0 31933  30876 0x80004003
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> flush_old_exec+0xc4/0x770
>> load_elf_binary+0x35a/0x16c0
>> search_binary_handler+0x97/0x1d0
>> __do_execve_file.isra.40+0x5d4/0x8a0
>> __x64_sys_execve+0x49/0x60
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> The proposed solution is to have a second mutex that is
>> used in mm_access, so it is allowed to continue while the
>> dying threads are not yet terminated.
>>
>> I also took the opportunity to improve the documentation
>> of prepare_creds, which is obviously out of sync.
>>
>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> 
> I can't comment on the validity of the patch, but I also found and
> reported this issue in 2016[1] and the discussion quickly veered into
> the problem being more complicated (and uglier) than it seems at first
> glance.
> 
> You should probably also Cc stable, given this has been a long-standing
> issue and your patch doesn't look (too) invasive.
> 

I am fully aware that this patch won't fix the case then PTRACE_ACCESS is racing
with de_thread.  But I don't see a problem with allowing vm access based on the
current credentials as they are still the same until de_thread is done with it's
job.  And in a practical way this fixes 99% of the real problem here, as it only
happens since strace is currently tracing something and needs access to the parameters
in the tracee's vm space.
Of course you could fork the strace process to do any PTRACE_ACCESS when necessary,
and, well, maybe that would fix the remaining problem here...

However before I considered changing the kernel for this I tried to fix this
within strace.  First I tried to wait in the signal handler.  See attached
strace-patch-1.diff, but that did not work, BUT I think it is possible that your
patch you proposed previously would actually make it work.

I tried then another approach, using a worker thread to wait for the childs,
but it did only work when I remove PTRACE_O_TRACEEXIT from the ptrace options,
because the ptrace(PTRACE_SYSCALL, pid, 0L, 0L) does not work in the worker thread,
rv = -1, errno = 3 there, and unfortunately the main thread is blocked and unable
to do the ptrace call, that makes the thread continue.
So I consider that second patch really ugly, and wouldn't propose something like
that seriously.


@@ -69,7 +71,7 @@
 cflag_t cflag = CFLAG_NONE;
 unsigned int followfork;
 unsigned int ptrace_setoptions = PTRACE_O_TRACESYSGOOD | PTRACE_O_TRACEEXEC
-                                | PTRACE_O_TRACEEXIT;
+                                ;//| PTRACE_O_TRACEEXIT;
 unsigned int xflag;
 bool debug_flag;
 bool Tflag;

so it only works because of this line, without that it is not able to make the
thread continue after the PTRACE_EVENT_EXIT. 


Thanks
Bernd.

> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
> 

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: strace-patch-1.diff --]
[-- Type: text/x-patch; name="strace-patch-1.diff", Size: 7652 bytes --]

diff -ur strace-5.5/delay.h strace-5.5.x/delay.h
--- strace-5.5/delay.h	2019-08-06 15:38:20.000000000 +0200
+++ strace-5.5.x/delay.h	2020-02-29 12:39:51.563110827 +0100
@@ -14,5 +14,6 @@
 void delay_timer_expired(void);
 void arm_delay_timer(const struct tcb *);
 void delay_tcb(struct tcb *, uint16_t delay_idx, bool isenter);
+int my_waitpid(int, int*, int);
 
 #endif /* !STRACE_DELAY_H */
diff -ur strace-5.5/filter_seccomp.c strace-5.5.x/filter_seccomp.c
--- strace-5.5/filter_seccomp.c	2020-02-06 16:16:17.000000000 +0100
+++ strace-5.5.x/filter_seccomp.c	2020-02-29 12:42:43.184120263 +0100
@@ -19,6 +19,7 @@
 #include "number_set.h"
 #include "syscall.h"
 #include "scno.h"
+#include "delay.h"
 
 bool seccomp_filtering;
 bool seccomp_before_sysentry;
@@ -136,7 +137,7 @@
 		int status;
 
 		for (;;) {
-			long rc = waitpid(pid, &status, 0);
+			long rc = my_waitpid(pid, &status, 0);
 			if (rc < 0 && errno == EINTR)
 				continue;
 			if (rc == pid)
@@ -272,7 +273,7 @@
 	if (pid) {
 		kill(pid, SIGKILL);
 		for (;;) {
-			long rc = waitpid(pid, NULL, 0);
+			long rc = my_waitpid(pid, NULL, 0);
 			if (rc < 0 && errno == EINTR)
 				continue;
 			break;
diff -ur strace-5.5/Makefile.am strace-5.5.x/Makefile.am
--- strace-5.5/Makefile.am	2020-02-06 16:16:17.000000000 +0100
+++ strace-5.5.x/Makefile.am	2020-02-29 10:28:04.515676065 +0100
@@ -45,7 +45,7 @@
 strace_CPPFLAGS = $(AM_CPPFLAGS)
 strace_CFLAGS = $(AM_CFLAGS)
 strace_LDFLAGS =
-strace_LDADD = libstrace.a $(clock_LIBS) $(timer_LIBS)
+strace_LDADD = libstrace.a -lpthread $(clock_LIBS) $(timer_LIBS)
 noinst_LIBRARIES = libstrace.a
 
 libstrace_a_CPPFLAGS = $(strace_CPPFLAGS)
diff -ur strace-5.5/Makefile.in strace-5.5.x/Makefile.in
--- strace-5.5/Makefile.in	2020-02-06 17:23:35.000000000 +0100
+++ strace-5.5.x/Makefile.in	2020-02-29 10:28:28.833677402 +0100
@@ -1631,7 +1631,7 @@
 	$(am__append_11) $(CODE_COVERAGE_CPPFLAGS)
 strace_CFLAGS = $(AM_CFLAGS) $(am__append_4) $(CODE_COVERAGE_CFLAGS)
 strace_LDFLAGS = $(am__append_5) $(am__append_9) $(am__append_12)
-strace_LDADD = libstrace.a $(clock_LIBS) $(timer_LIBS) $(am__append_6) \
+strace_LDADD = libstrace.a -lpthread $(clock_LIBS) $(timer_LIBS) $(am__append_6) \
 	$(am__append_10) $(am__append_13) $(CODE_COVERAGE_LIBS) \
 	$(am__append_14) $(am__append_18)
 noinst_LIBRARIES = libstrace.a $(am__append_15) $(am__append_19)
diff -ur strace-5.5/ptrace_syscall_info.c strace-5.5.x/ptrace_syscall_info.c
--- strace-5.5/ptrace_syscall_info.c	2020-02-06 16:16:17.000000000 +0100
+++ strace-5.5.x/ptrace_syscall_info.c	2020-02-29 12:41:44.565117040 +0100
@@ -12,6 +12,7 @@
 #include "ptrace.h"
 #include "ptrace_syscall_info.h"
 #include "scno.h"
+#include "delay.h"
 
 #include <signal.h>
 #include <sys/wait.h>
@@ -118,7 +119,7 @@
 		};
 		const size_t size = sizeof(info);
 		int status;
-		long rc = waitpid(pid, &status, 0);
+		long rc = my_waitpid(pid, &status, 0);
 		if (rc != pid) {
 			/* cannot happen */
 			kill_tracee(pid);
@@ -247,7 +248,7 @@
 done:
 	if (pid) {
 		kill_tracee(pid);
-		waitpid(pid, NULL, 0);
+		my_waitpid(pid, NULL, 0);
 		ptrace_stop = -1U;
 	}
 
diff -ur strace-5.5/strace.c strace-5.5.x/strace.c
--- strace-5.5/strace.c	2020-02-06 16:16:17.000000000 +0100
+++ strace-5.5.x/strace.c	2020-03-01 07:53:27.028407698 +0100
@@ -15,6 +15,7 @@
 #include <fcntl.h>
 #include "ptrace.h"
 #include <signal.h>
+#include <semaphore.h>
 #include <sys/resource.h>
 #include <sys/stat.h>
 #ifdef HAVE_PATHS_H
@@ -1002,7 +1003,7 @@
 	 */
 	for (;;) {
 		unsigned int sig;
-		if (waitpid(tcp->pid, &status, __WALL) < 0) {
+		if (my_waitpid(tcp->pid, &status, __WALL) < 0) {
 			if (errno == EINTR)
 				continue;
 			/*
@@ -1615,7 +1616,7 @@
 		int status, tracee_pid;
 
 		errno = 0;
-		tracee_pid = waitpid(pid, &status, 0);
+		tracee_pid = my_waitpid(pid, &status, 0);
 		if (tracee_pid <= 0) {
 			if (errno == EINTR)
 				continue;
@@ -1663,6 +1664,69 @@
 	sigaction(signo, &sa, oldact);
 }
 
+#define MAX_WAITIDX 65536
+static unsigned short in_idx = 0, out_idx = 0;
+static sem_t wait_sem;
+static int wait_pid[MAX_WAITIDX];
+static int wait_status[MAX_WAITIDX];
+static struct rusage wait_rusage[MAX_WAITIDX];
+
+static void
+child_sighandler(int sig)
+{
+	int old_errno = errno;
+	int status;
+	struct rusage ru;
+	int pid = wait4(-1, &status, __WALL | WNOHANG, (cflag ? &ru : NULL));
+
+	if (pid > 0) {
+		if (WIFSTOPPED(status) && (status >> 16) == PTRACE_EVENT_EXIT)
+			ptrace(PTRACE_SYSCALL, pid, 0L, 0L);
+		wait_pid[in_idx] = pid;
+		wait_status[in_idx] = status;
+		if (cflag)
+			wait_rusage[in_idx] = ru;
+		in_idx++;
+		if (in_idx == out_idx || sem_post(&wait_sem) == -1)
+		{
+			const char *msg = "fatal error in child_sighandler\n"; 
+			status = write(STDERR_FILENO, msg, strlen(msg));
+			_exit(2);
+		}
+	}
+
+	errno = old_errno; 
+}
+
+int my_waitpid(int pid, int *status, int options)
+{
+	int skip = 0;
+	unsigned short idx = out_idx;
+	for (;;) {
+		while (sem_wait(&wait_sem) == -1 && errno == EINTR)
+			;
+		if (wait_pid[idx] == pid)
+			break;
+		idx++;
+		skip++;
+	}
+	*status = wait_status[idx];
+	while (skip > 0) {
+		unsigned short idx1 = idx;
+		idx1--;
+		wait_status[idx] = wait_status[idx1];
+		wait_pid[idx] = wait_pid[idx1];
+		if (cflag)
+			wait_rusage[idx] = wait_rusage[idx1];
+		if (sem_post(&wait_sem) == -1)
+			error_msg_and_die("fatal error in my_waitpid"); 
+		skip--;
+		idx--;
+	}
+	out_idx++;
+	return pid;
+}
+
 /*
  * Initialization part of main() was eating much stack (~0.5k),
  * which was unused after init.
@@ -2015,7 +2079,9 @@
 	memset(acolumn_spaces, ' ', acolumn);
 	acolumn_spaces[acolumn] = '\0';
 
-	set_sighandler(SIGCHLD, SIG_DFL, &params_for_tracee.child_sa);
+	if (sem_init(&wait_sem, 0, 0) == -1)
+		perror_msg_and_die("Unable to initialize signal wait sema");
+	set_sighandler(SIGCHLD, child_sighandler, &params_for_tracee.child_sa);
 
 #ifdef ENABLE_STACKTRACE
 	if (stack_trace_enabled)
@@ -2607,10 +2673,28 @@
 	 * then the system call will be interrupted and
 	 * the expiration will be handled by the signal handler.
 	 */
-	int status;
+	int status = 0;
 	struct rusage ru;
-	int pid = wait4(-1, &status, __WALL, (cflag ? &ru : NULL));
-	int wait_errno = errno;
+	int pid = 0;
+	int wait_errno = 0;
+	if (in_idx == out_idx) {
+		pid = wait4(-1, &status, __WALL | WNOHANG, (cflag ? &ru : NULL));
+		wait_errno = errno;
+		if (pid > 0 && WIFSTOPPED(status) && (status >> 16) == PTRACE_EVENT_EXIT)
+			ptrace(PTRACE_SYSCALL, pid, 0L, 0L);
+	}
+	if (pid == 0) {
+		while (sem_wait(&wait_sem) == -1 && errno == EINTR)
+			;
+
+		if (in_idx == out_idx)
+			error_msg_and_die("wait queue error");
+		pid = wait_pid[out_idx];
+		status = wait_status[out_idx];
+		if (cflag)
+			ru = wait_rusage[out_idx];
+		out_idx++;
+	}
 
 	/*
 	 * The window of opportunity to handle expirations
@@ -2791,8 +2875,17 @@
 			break;
 
 next_event_wait_next:
-		pid = wait4(-1, &status, __WALL | WNOHANG, (cflag ? &ru : NULL));
-		wait_errno = errno;
+		pid = 0;
+		if (in_idx != out_idx) {
+			while (sem_wait(&wait_sem) == -1 && errno == EINTR)
+				;
+
+			pid = wait_pid[out_idx];
+			status = wait_status[out_idx];
+			if (cflag)
+				ru = wait_rusage[out_idx];
+			out_idx++;
+		}
 		wait_nohang = true;
 	}
 
@@ -3019,7 +3112,7 @@
 
 	case TE_STOP_BEFORE_EXIT:
 		print_event_exit(current_tcp);
-		break;
+		return true;
 	}
 
 	/* We handled quick cases, we are permitted to interrupt now. */
@@ -3138,7 +3231,7 @@
 	if (shared_log != stderr)
 		fclose(shared_log);
 	if (popen_pid) {
-		while (waitpid(popen_pid, NULL, 0) < 0 && errno == EINTR)
+		while (my_waitpid(popen_pid, NULL, 0) < 0 && errno == EINTR)
 			;
 	}
 	if (sig) {

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: strace-patch-2.diff --]
[-- Type: text/x-patch; name="strace-patch-2.diff", Size: 7878 bytes --]

diff -ur strace-5.5/delay.h strace-5.5.y/delay.h
--- strace-5.5/delay.h	2019-08-06 15:38:20.000000000 +0200
+++ strace-5.5.y/delay.h	2020-02-29 12:39:51.563110827 +0100
@@ -14,5 +14,6 @@
 void delay_timer_expired(void);
 void arm_delay_timer(const struct tcb *);
 void delay_tcb(struct tcb *, uint16_t delay_idx, bool isenter);
+int my_waitpid(int, int*, int);
 
 #endif /* !STRACE_DELAY_H */
diff -ur strace-5.5/filter_seccomp.c strace-5.5.y/filter_seccomp.c
--- strace-5.5/filter_seccomp.c	2020-02-06 16:16:17.000000000 +0100
+++ strace-5.5.y/filter_seccomp.c	2020-02-29 12:42:43.184120263 +0100
@@ -19,6 +19,7 @@
 #include "number_set.h"
 #include "syscall.h"
 #include "scno.h"
+#include "delay.h"
 
 bool seccomp_filtering;
 bool seccomp_before_sysentry;
@@ -136,7 +137,7 @@
 		int status;
 
 		for (;;) {
-			long rc = waitpid(pid, &status, 0);
+			long rc = my_waitpid(pid, &status, 0);
 			if (rc < 0 && errno == EINTR)
 				continue;
 			if (rc == pid)
@@ -272,7 +273,7 @@
 	if (pid) {
 		kill(pid, SIGKILL);
 		for (;;) {
-			long rc = waitpid(pid, NULL, 0);
+			long rc = my_waitpid(pid, NULL, 0);
 			if (rc < 0 && errno == EINTR)
 				continue;
 			break;
diff -ur strace-5.5/Makefile.am strace-5.5.y/Makefile.am
--- strace-5.5/Makefile.am	2020-02-06 16:16:17.000000000 +0100
+++ strace-5.5.y/Makefile.am	2020-02-29 10:28:04.515676065 +0100
@@ -45,7 +45,7 @@
 strace_CPPFLAGS = $(AM_CPPFLAGS)
 strace_CFLAGS = $(AM_CFLAGS)
 strace_LDFLAGS =
-strace_LDADD = libstrace.a $(clock_LIBS) $(timer_LIBS)
+strace_LDADD = libstrace.a -lpthread $(clock_LIBS) $(timer_LIBS)
 noinst_LIBRARIES = libstrace.a
 
 libstrace_a_CPPFLAGS = $(strace_CPPFLAGS)
diff -ur strace-5.5/Makefile.in strace-5.5.y/Makefile.in
--- strace-5.5/Makefile.in	2020-02-06 17:23:35.000000000 +0100
+++ strace-5.5.y/Makefile.in	2020-02-29 10:28:28.833677402 +0100
@@ -1631,7 +1631,7 @@
 	$(am__append_11) $(CODE_COVERAGE_CPPFLAGS)
 strace_CFLAGS = $(AM_CFLAGS) $(am__append_4) $(CODE_COVERAGE_CFLAGS)
 strace_LDFLAGS = $(am__append_5) $(am__append_9) $(am__append_12)
-strace_LDADD = libstrace.a $(clock_LIBS) $(timer_LIBS) $(am__append_6) \
+strace_LDADD = libstrace.a -lpthread $(clock_LIBS) $(timer_LIBS) $(am__append_6) \
 	$(am__append_10) $(am__append_13) $(CODE_COVERAGE_LIBS) \
 	$(am__append_14) $(am__append_18)
 noinst_LIBRARIES = libstrace.a $(am__append_15) $(am__append_19)
diff -ur strace-5.5/strace.c strace-5.5.y/strace.c
--- strace-5.5/strace.c	2020-02-06 16:16:17.000000000 +0100
+++ strace-5.5.y/strace.c	2020-03-01 07:59:55.586429063 +0100
@@ -15,6 +15,8 @@
 #include <fcntl.h>
 #include "ptrace.h"
 #include <signal.h>
+#include <semaphore.h>
+#include <pthread.h>
 #include <sys/resource.h>
 #include <sys/stat.h>
 #ifdef HAVE_PATHS_H
@@ -69,7 +71,7 @@
 cflag_t cflag = CFLAG_NONE;
 unsigned int followfork;
 unsigned int ptrace_setoptions = PTRACE_O_TRACESYSGOOD | PTRACE_O_TRACEEXEC
-				 | PTRACE_O_TRACEEXIT;
+				 ;//| PTRACE_O_TRACEEXIT;
 unsigned int xflag;
 bool debug_flag;
 bool Tflag;
@@ -1002,7 +1004,7 @@
 	 */
 	for (;;) {
 		unsigned int sig;
-		if (waitpid(tcp->pid, &status, __WALL) < 0) {
+		if (my_waitpid(tcp->pid, &status, __WALL) < 0) {
 			if (errno == EINTR)
 				continue;
 			/*
@@ -1663,6 +1665,83 @@
 	sigaction(signo, &sa, oldact);
 }
 
+#define MAX_WAITIDX 65536
+static unsigned short in_idx = 0, out_idx = 0;
+static sem_t wait_sem;
+static pthread_t wait_thread;
+static int wait_pid[MAX_WAITIDX];
+static int wait_status[MAX_WAITIDX];
+static struct rusage wait_rusage[MAX_WAITIDX];
+
+static void*
+child_sighandler(void *arg)
+{
+	int status;
+	struct rusage ru;
+	int pid;
+	for (;;) {
+		pid = wait4(-1, &status, __WALL, (cflag ? &ru : NULL));
+		if (pid < 0 && errno == EINTR)
+			continue;
+
+		if (pid < 0)
+			pid = -errno;
+
+		if (pid > 0 && WIFSTOPPED(status) && (status >> 16) == PTRACE_EVENT_EXIT) {
+			int i = ptrace(PTRACE_SYSCALL, pid, 0L, 0L);
+			fprintf(stderr, "in thread: ptrace(PTRACE_SYSCALL, %d, 0L, 0L)=%d errno=%d\n", pid, i, errno);
+		}
+		wait_pid[in_idx] = pid;
+		wait_status[in_idx] = status;
+		if (cflag)
+			wait_rusage[in_idx] = ru;
+		in_idx++;
+		if (in_idx == out_idx || sem_post(&wait_sem) == -1)
+			error_msg_and_die("fatal error in child_sighandler"); 
+		if (pid < 0)
+			break;
+	}
+
+	return NULL;
+}
+
+int my_waitpid(int pid, int *status, int options)
+{
+	int skip = 0;
+	unsigned short idx = out_idx;
+	for (;;) {
+		while (sem_wait(&wait_sem) == -1 && errno == EINTR)
+			;
+		if (wait_pid[idx] < 0) {
+			while (skip-- >= 0)
+				sem_post(&wait_sem);
+			errno = -wait_pid[idx];
+			return -1;
+		}
+		if (wait_pid[idx] == pid)
+			break;
+		idx++;
+		skip++;
+	}
+	*status = wait_status[idx];
+	while (skip > 0) {
+		unsigned short idx1 = idx;
+		idx1--;
+		wait_status[idx] = wait_status[idx1];
+		wait_pid[idx] = wait_pid[idx1];
+		if (cflag)
+			wait_rusage[idx] = wait_rusage[idx1];
+		if (sem_post(&wait_sem) == -1)
+			error_msg_and_die("fatal error in my_waitpid"); 
+		skip--;
+		idx--;
+	}
+	out_idx++;
+	if (pid < 0)
+		errno = -pid;
+	return pid < 0 ? -1 : pid;
+}
+
 /*
  * Initialization part of main() was eating much stack (~0.5k),
  * which was unused after init.
@@ -2124,6 +2203,9 @@
 		startup_child(argv);
 	}
 
+	if (sem_init(&wait_sem, 0, 0) == -1)
+		perror_msg_and_die("Unable to initialize signal wait sema");
+
 	set_sighandler(SIGTTOU, SIG_IGN, NULL);
 	set_sighandler(SIGTTIN, SIG_IGN, NULL);
 	if (opt_intr != INTR_ANYWHERE) {
@@ -2150,6 +2232,7 @@
 	if (nprocs != 0 || daemonized_tracer)
 		startup_attach();
 
+	pthread_create(&wait_thread, NULL, child_sighandler, NULL);
 	/* Do we want pids printed in our -o OUTFILE?
 	 * -ff: no (every pid has its own file); or
 	 * -f: yes (there can be more pids in the future); or
@@ -2607,10 +2690,28 @@
 	 * then the system call will be interrupted and
 	 * the expiration will be handled by the signal handler.
 	 */
-	int status;
+	int status = 0;
 	struct rusage ru;
-	int pid = wait4(-1, &status, __WALL, (cflag ? &ru : NULL));
-	int wait_errno = errno;
+	int pid = 0;
+	int wait_errno = 0;
+	while (sem_wait(&wait_sem) == -1 && errno == EINTR)
+		;
+
+	if (in_idx == out_idx)
+		error_msg_and_die("wait queue error");
+	pid = wait_pid[out_idx];
+	status = wait_status[out_idx];
+	ru = wait_rusage[out_idx];
+	if (pid > 0 && WIFSTOPPED(status) && (status >> 16) == PTRACE_EVENT_EXIT) {
+		int i = ptrace(PTRACE_SYSCALL, pid, 0L, 0L);
+		fprintf(stderr, "ptrace(PTRACE_SYSCALL, %d, 0L, 0L)=%d errno=%d\n", pid, i, errno);
+	}
+	out_idx++;
+	if (pid < 0) {
+		wait_errno = -pid;
+		out_idx--;
+		sem_post(&wait_sem);
+	}
 
 	/*
 	 * The window of opportunity to handle expirations
@@ -2791,8 +2892,25 @@
 			break;
 
 next_event_wait_next:
-		pid = wait4(-1, &status, __WALL | WNOHANG, (cflag ? &ru : NULL));
-		wait_errno = errno;
+		pid = 0;
+		if (in_idx != out_idx) {
+			while (sem_wait(&wait_sem) == -1 && errno == EINTR)
+				;
+
+			pid = wait_pid[out_idx];
+			status = wait_status[out_idx];
+			ru = wait_rusage[out_idx];
+			if (pid > 0 && WIFSTOPPED(status) && (status >> 16) == PTRACE_EVENT_EXIT) {
+				int i = ptrace(PTRACE_SYSCALL, pid, 0L, 0L);
+				fprintf(stderr, "ptrace(PTRACE_SYSCALL, %d, 0L, 0L)=%d errno=%d\n", pid, i, errno);
+			}
+			out_idx++;
+			if (pid < 0) {
+				wait_errno = -pid;
+				out_idx--;
+				sem_post(&wait_sem);
+			}
+		}
 		wait_nohang = true;
 	}
 
@@ -3019,7 +3137,8 @@
 
 	case TE_STOP_BEFORE_EXIT:
 		print_event_exit(current_tcp);
-		break;
+		//droptcb(current_tcp);
+		return true;
 	}
 
 	/* We handled quick cases, we are permitted to interrupt now. */
@@ -3138,9 +3257,10 @@
 	if (shared_log != stderr)
 		fclose(shared_log);
 	if (popen_pid) {
-		while (waitpid(popen_pid, NULL, 0) < 0 && errno == EINTR)
+		while (my_waitpid(popen_pid, NULL, 0) < 0 && errno == EINTR)
 			;
 	}
+	pthread_join(wait_thread, NULL);
 	if (sig) {
 		exit_code = 0x100 | sig;
 	}

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 15:58   ` Christian Brauner
@ 2020-03-01 17:46     ` Bernd Edlinger
  2020-03-01 18:20       ` Christian Brauner
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-01 17:46 UTC (permalink / raw)
  To: Christian Brauner, Aleksa Sarai, Oleg Nesterov
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Eric W. Biederman, Thomas Gleixner, Frederic Weisbecker,
	Andrei Vagin, Ingo Molnar, Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Jann Horn, James Morris,
	Kees Cook, Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Dmitry V. Levin, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm

On 3/1/20 4:58 PM, Christian Brauner wrote:
> On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote:
>> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>> This fixes a deadlock in the tracer when tracing a multi-threaded
>>> application that calls execve while more than one thread are running.
>>>
>>> I observed that when running strace on the gcc test suite, it always
>>> blocks after a while, when expect calls execve, because other threads
>>> have to be terminated.  They send ptrace events, but the strace is no
>>> longer able to respond, since it is blocked in vm_access.
>>>
>>> The deadlock is always happening when strace needs to access the
>>> tracees process mmap, while another thread in the tracee starts to
>>> execve a child process, but that cannot continue until the
>>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>>
>>> strace          D    0 30614  30584 0x00000000
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> schedule_preempt_disabled+0x15/0x20
>>> __mutex_lock.isra.13+0x1ec/0x520
>>> __mutex_lock_killable_slowpath+0x13/0x20
>>> mutex_lock_killable+0x28/0x30
>>> mm_access+0x27/0xa0
>>> process_vm_rw_core.isra.3+0xff/0x550
>>> process_vm_rw+0xdd/0xf0
>>> __x64_sys_process_vm_readv+0x31/0x40
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> expect          D    0 31933  30876 0x80004003
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> flush_old_exec+0xc4/0x770
>>> load_elf_binary+0x35a/0x16c0
>>> search_binary_handler+0x97/0x1d0
>>> __do_execve_file.isra.40+0x5d4/0x8a0
>>> __x64_sys_execve+0x49/0x60
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> The proposed solution is to have a second mutex that is
>>> used in mm_access, so it is allowed to continue while the
>>> dying threads are not yet terminated.
>>>
>>> I also took the opportunity to improve the documentation
>>> of prepare_creds, which is obviously out of sync.
>>>
>>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>>
>> I can't comment on the validity of the patch, but I also found and
>> reported this issue in 2016[1] and the discussion quickly veered into
>> the problem being more complicated (and uglier) than it seems at first
>> glance.
>>
>> You should probably also Cc stable, given this has been a long-standing
>> issue and your patch doesn't look (too) invasive.
>>
>> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
> 
> Yeah, I remember you mentioning this a while back.
> 
> Bernd, we really want a reproducer for this sent alongside with this
> patch added to:
> tools/testing/selftests/ptrace/
> Having a test for this bug irrespective of whether or not we go with
> this as fix seems really worth it.
> 

I ran into this issue, because I wanted to fix an issue in the gcc testsuite,
namely why it forgets to remove some temp files,
so I did the following:

strace -ftt -o trace.txt make check-gcc-c -k -j4

I reproduced with v4.20 and v5.5 kernel, and I don't know why but it is
not happening on all systems I tested, maybe it is something that the expect program
does, because, always when I try to reproduce this, the deadlock was always in "expect".

I use expect version 5.45 on the computer where the above test freezes after
a couple of minutes.

I think the issue with strace is that it is using vm_access to get the parameters
of a syscall that is going on in one thread, and that races with another thread
that calls execve, and blocks the cred_guard_mutex.

While Olg's test case here, will certainly not be fixed:

https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/

he mentions the access to "anything else which needs ->cred_guard_mutex,
say open(/proc/$pid/mem)", I don't know for sure how that can be done, but if
that is possible, it would probably work as a test case.

What do you think?


Bernd.


> Oleg seems to have suggested that a potential alternative fix is to wait
> in de_thread() until all other threads in the thread-group have passed
> exit_notiy(). Right now we only kill them but don't wait. Currently
> de_thread() only waits for the thread-group leader to pass exit_notify()
> whenever a non-thread-group leader thread execs (because the exec'ing
> thread becomes the new thread-group leader with the same pid as the
> former thread-group leader).
> 
> Christian
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 17:46     ` Bernd Edlinger
@ 2020-03-01 18:20       ` Christian Brauner
  0 siblings, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-01 18:20 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Aleksa Sarai, Oleg Nesterov, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Eric W. Biederman,
	Thomas Gleixner, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Jann Horn, James Morris,
	Kees Cook, Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Dmitry V. Levin, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm

On Sun, Mar 01, 2020 at 05:46:08PM +0000, Bernd Edlinger wrote:
> On 3/1/20 4:58 PM, Christian Brauner wrote:
> > On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote:
> >> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> >>> This fixes a deadlock in the tracer when tracing a multi-threaded
> >>> application that calls execve while more than one thread are running.
> >>>
> >>> I observed that when running strace on the gcc test suite, it always
> >>> blocks after a while, when expect calls execve, because other threads
> >>> have to be terminated.  They send ptrace events, but the strace is no
> >>> longer able to respond, since it is blocked in vm_access.
> >>>
> >>> The deadlock is always happening when strace needs to access the
> >>> tracees process mmap, while another thread in the tracee starts to
> >>> execve a child process, but that cannot continue until the
> >>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> >>>
> >>> strace          D    0 30614  30584 0x00000000
> >>> Call Trace:
> >>> __schedule+0x3ce/0x6e0
> >>> schedule+0x5c/0xd0
> >>> schedule_preempt_disabled+0x15/0x20
> >>> __mutex_lock.isra.13+0x1ec/0x520
> >>> __mutex_lock_killable_slowpath+0x13/0x20
> >>> mutex_lock_killable+0x28/0x30
> >>> mm_access+0x27/0xa0
> >>> process_vm_rw_core.isra.3+0xff/0x550
> >>> process_vm_rw+0xdd/0xf0
> >>> __x64_sys_process_vm_readv+0x31/0x40
> >>> do_syscall_64+0x64/0x220
> >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>
> >>> expect          D    0 31933  30876 0x80004003
> >>> Call Trace:
> >>> __schedule+0x3ce/0x6e0
> >>> schedule+0x5c/0xd0
> >>> flush_old_exec+0xc4/0x770
> >>> load_elf_binary+0x35a/0x16c0
> >>> search_binary_handler+0x97/0x1d0
> >>> __do_execve_file.isra.40+0x5d4/0x8a0
> >>> __x64_sys_execve+0x49/0x60
> >>> do_syscall_64+0x64/0x220
> >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>
> >>> The proposed solution is to have a second mutex that is
> >>> used in mm_access, so it is allowed to continue while the
> >>> dying threads are not yet terminated.
> >>>
> >>> I also took the opportunity to improve the documentation
> >>> of prepare_creds, which is obviously out of sync.
> >>>
> >>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> >>
> >> I can't comment on the validity of the patch, but I also found and
> >> reported this issue in 2016[1] and the discussion quickly veered into
> >> the problem being more complicated (and uglier) than it seems at first
> >> glance.
> >>
> >> You should probably also Cc stable, given this has been a long-standing
> >> issue and your patch doesn't look (too) invasive.
> >>
> >> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
> > 
> > Yeah, I remember you mentioning this a while back.
> > 
> > Bernd, we really want a reproducer for this sent alongside with this
> > patch added to:
> > tools/testing/selftests/ptrace/
> > Having a test for this bug irrespective of whether or not we go with
> > this as fix seems really worth it.
> > 
> 
> I ran into this issue, because I wanted to fix an issue in the gcc testsuite,
> namely why it forgets to remove some temp files,
> so I did the following:
> 
> strace -ftt -o trace.txt make check-gcc-c -k -j4
> 
> I reproduced with v4.20 and v5.5 kernel, and I don't know why but it is
> not happening on all systems I tested, maybe it is something that the expect program
> does, because, always when I try to reproduce this, the deadlock was always in "expect".
> 
> I use expect version 5.45 on the computer where the above test freezes after
> a couple of minutes.
> 
> I think the issue with strace is that it is using vm_access to get the parameters
> of a syscall that is going on in one thread, and that races with another thread
> that calls execve, and blocks the cred_guard_mutex.
> 
> While Olg's test case here, will certainly not be fixed:
> 
> https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
> 
> he mentions the access to "anything else which needs ->cred_guard_mutex,
> say open(/proc/$pid/mem)", I don't know for sure how that can be done, but if
> that is possible, it would probably work as a test case.
> 
> What do you think?

Yeah, anything that calls ptrace_may_access() is fine and
open(/proc/$pid/mem) will work so long as $pid is not in the same
thread-group as the caller. A polished version of the reproducer you
linked in would probably be good.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 11:27 [PATCH] exec: Fix a deadlock in ptrace Bernd Edlinger
  2020-03-01 15:13 ` Aleksa Sarai
@ 2020-03-01 18:21 ` Jann Horn
  2020-03-01 18:52   ` Christian Brauner
  1 sibling, 1 reply; 203+ messages in thread
From: Jann Horn @ 2020-03-01 18:21 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Eric W. Biederman, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Christian Brauner,
	Jason Gunthorpe, Christian Kellner, Andrea Arcangeli,
	Aleksa Sarai, Dmitry V. Levin, linux-doc, linux-kernel,
	linux-fsdevel, linux-mm

On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
> The proposed solution is to have a second mutex that is
> used in mm_access, so it is allowed to continue while the
> dying threads are not yet terminated.

Just for context: When I proposed something similar back in 2016,
https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
was the resulting discussion thread. At least back then, I looked
through the various existing users of cred_guard_mutex, and the only
places that couldn't be converted to the new second mutex were
PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.


The ideal solution would IMO be something like this: Decide what the
new task's credentials should be *before* reaching de_thread(),
install them into a second cred* on the task (together with the new
dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
check against both. After that, some further restructuring might even
allow the cred_guard_mutex to not be held across all of the VFS
operations that happen early on in execve, which may block
indefinitely. But that would be pretty complicated, so I think your
proposed solution makes sense for now, given that nobody has managed
to implement anything better in the last few years.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 18:21 ` Jann Horn
@ 2020-03-01 18:52   ` Christian Brauner
  2020-03-01 19:00     ` Bernd Edlinger
  2020-03-01 20:00     ` Jann Horn
  0 siblings, 2 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-01 18:52 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Alexey Dobriyan, Eric W. Biederman, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm

On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
> > The proposed solution is to have a second mutex that is
> > used in mm_access, so it is allowed to continue while the
> > dying threads are not yet terminated.
> 
> Just for context: When I proposed something similar back in 2016,
> https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> was the resulting discussion thread. At least back then, I looked
> through the various existing users of cred_guard_mutex, and the only
> places that couldn't be converted to the new second mutex were
> PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> 
> 
> The ideal solution would IMO be something like this: Decide what the
> new task's credentials should be *before* reaching de_thread(),
> install them into a second cred* on the task (together with the new
> dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> check against both. After that, some further restructuring might even

Hm, so essentially a private ptrace_access_cred member in task_struct?
That would presumably also involve altering various LSM hooks to look at
ptrace_access_cred.

(Minor side-note, de_thread() takes a struct task_struct argument but
 only ever is passed current.)

> allow the cred_guard_mutex to not be held across all of the VFS
> operations that happen early on in execve, which may block
> indefinitely. But that would be pretty complicated, so I think your
> proposed solution makes sense for now, given that nobody has managed
> to implement anything better in the last few years.

Reading through the old threads and how often this issue came up, I tend
to agree.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 18:52   ` Christian Brauner
@ 2020-03-01 19:00     ` Bernd Edlinger
  2020-03-01 20:00     ` Jann Horn
  1 sibling, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-01 19:00 UTC (permalink / raw)
  To: Christian Brauner, Jann Horn
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Eric W. Biederman, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm

On 3/1/20 7:52 PM, Christian Brauner wrote:
> On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
>> On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
>> <bernd.edlinger@hotmail.de> wrote:
>>> The proposed solution is to have a second mutex that is
>>> used in mm_access, so it is allowed to continue while the
>>> dying threads are not yet terminated.
>>
>> Just for context: When I proposed something similar back in 2016,
>> https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>> was the resulting discussion thread. At least back then, I looked
>> through the various existing users of cred_guard_mutex, and the only
>> places that couldn't be converted to the new second mutex were
>> PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
>>
>>
>> The ideal solution would IMO be something like this: Decide what the
>> new task's credentials should be *before* reaching de_thread(),
>> install them into a second cred* on the task (together with the new
>> dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
>> check against both. After that, some further restructuring might even
> 
> Hm, so essentially a private ptrace_access_cred member in task_struct?
> That would presumably also involve altering various LSM hooks to look at
> ptrace_access_cred.
> 
> (Minor side-note, de_thread() takes a struct task_struct argument but
>  only ever is passed current.)
> 
>> allow the cred_guard_mutex to not be held across all of the VFS
>> operations that happen early on in execve, which may block
>> indefinitely. But that would be pretty complicated, so I think your
>> proposed solution makes sense for now, given that nobody has managed
>> to implement anything better in the last few years.
> 
> Reading through the old threads and how often this issue came up, I tend
> to agree.
> 

Okay, fine.

I managed to change Oleg's test case, into one that shows what exactly
is changed with this patch:


$ cat t.c
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/signal.h>
#include <sys/ptrace.h>

void *thread(void *arg)
{
	ptrace(PTRACE_TRACEME, 0,0,0);
	return NULL;
}

int main(void)
{
	int f, pid = fork();
	char mm[64];

	if (!pid) {
		pthread_t pt;
		pthread_create(&pt, NULL, thread, NULL);
		pthread_join(pt, NULL);
		execlp("echo", "echo", "passed", NULL);
	}

	sleep(1);
	sprintf(mm, "/proc/%d/mem", pid);
        printf("open(%s)\n", mm);
	f = open(mm, O_RDONLY);
        printf("f = %d\n", f);
	// this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0);
	kill(pid, SIGCONT);
	if (f >= 0)
		close(f);
	return 0;
}
$ gcc -pthread -Wall t.c
$ ./a.out 
open(/proc/2802/mem)
f = 3
$ passed

previously this did block, how can I make a test case for this?
I am not so experienced in this matter.


Thanks
Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 18:52   ` Christian Brauner
  2020-03-01 19:00     ` Bernd Edlinger
@ 2020-03-01 20:00     ` Jann Horn
  2020-03-01 20:34       ` [PATCHv2] " Bernd Edlinger
  2020-03-02  7:47       ` [PATCH] " Christian Brauner
  1 sibling, 2 replies; 203+ messages in thread
From: Jann Horn @ 2020-03-01 20:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Bernd Edlinger, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Alexey Dobriyan, Eric W. Biederman, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm

On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> > <bernd.edlinger@hotmail.de> wrote:
> > > The proposed solution is to have a second mutex that is
> > > used in mm_access, so it is allowed to continue while the
> > > dying threads are not yet terminated.
> >
> > Just for context: When I proposed something similar back in 2016,
> > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> > was the resulting discussion thread. At least back then, I looked
> > through the various existing users of cred_guard_mutex, and the only
> > places that couldn't be converted to the new second mutex were
> > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> >
> >
> > The ideal solution would IMO be something like this: Decide what the
> > new task's credentials should be *before* reaching de_thread(),
> > install them into a second cred* on the task (together with the new
> > dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> > check against both. After that, some further restructuring might even
>
> Hm, so essentially a private ptrace_access_cred member in task_struct?

And a second dumpability field, because that changes together with the
creds during execve. (Btw, currently the dumpability is in the
mm_struct, but that's kinda wrong. The mm_struct is removed from a
task on exit while access checks can still be performed against it, and
currently ptrace_may_access() just lets the access go through in that
case, which weakens the protection offered by PR_SET_DUMPABLE when
used for security purposes. I think it ought to be moved over into the
task_struct.)

> That would presumably also involve altering various LSM hooks to look at
> ptrace_access_cred.

When I tried to implement this in the past, I changed the LSM hook to
take the target task's cred* as an argument, and then called the LSM
hook twice from ptrace_may_access(). IIRC having the target task's
creds as an argument works for almost all the LSMs, with the exception
of Yama, which doesn't really care about the target task's creds, so
you have to pass in both the task_struct* and the cred*.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-01 20:00     ` Jann Horn
@ 2020-03-01 20:34       ` Bernd Edlinger
  2020-03-02  6:38         ` Eric W. Biederman
  2020-03-02 12:28         ` [PATCHv2] " Oleg Nesterov
  2020-03-02  7:47       ` [PATCH] " Christian Brauner
  1 sibling, 2 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-01 20:34 UTC (permalink / raw)
  To: Jann Horn, Christian Brauner
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Eric W. Biederman, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, Aleksa Sarai, stable

This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The proposed solution is to have a second mutex that is
used in mm_access, so it is allowed to continue while the
dying threads are not yet terminated.

I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 Documentation/security/credentials.rst    | 18 ++++++------
 fs/exec.c                                 |  9 ++++++
 include/linux/binfmts.h                   |  6 +++-
 include/linux/sched/signal.h              |  1 +
 init/init_task.c                          |  1 +
 kernel/cred.c                             |  2 +-
 kernel/fork.c                             |  5 ++--
 mm/process_vm_access.c                    |  2 +-
 tools/testing/selftests/ptrace/Makefile   |  4 +--
 tools/testing/selftests/ptrace/vmaccess.c | 46 +++++++++++++++++++++++++++++++
 10 files changed, 79 insertions(+), 15 deletions(-)
 create mode 100644 tools/testing/selftests/ptrace/vmaccess.c

v2: adds a test case which passes when this patch is applied.


diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..c98e0a8 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,13 @@ new set of credentials by calling::
 
 	struct cred *prepare_creds(void);
 
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful.  It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and the mutex
+current->signal->cred_change_mutex is acquired later, while the credentials
+and the process mmap are actually changed.
 
 The mutex prevents ``ptrace()`` from altering the ptrace state of a process
 while security checks on credentials construction and changing is taking place
@@ -466,9 +470,8 @@ by calling::
 
 This will alter various aspects of the credentials and the process, giving the
 LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
 
 This function is guaranteed to return 0, so that it can be tail-called at the
 end of such functions as ``sys_setresuid()``.
@@ -486,8 +489,7 @@ invoked::
 
 	void abort_creds(struct cred *new);
 
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
 
 
 A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..a6884e4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_change_mutex);
+	if (retval)
+		goto out;
+
+	bprm->called_flush_old_exec = 1;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1420,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		if (bprm->called_flush_old_exec)
+			mutex_unlock(&current->signal->cred_change_mutex);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1469,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	mutex_unlock(&current->signal->cred_change_mutex);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2e1318b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
 		 * exec has happened. Used to sanitize execution environment
 		 * and to set AT_SECURE auxv for glibc.
 		 */
-		secureexec:1;
+		secureexec:1,
+		/*
+		 * Set by flush_old_exec, when the cred_change_mutex is taken.
+		 */
+		called_flush_old_exec:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..37eeabe 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@ struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	struct mutex cred_change_mutex; /* guard against credentials change */
 } __randomize_layout;
 
 /*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..6cd9a0f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+	.cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex),
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
 	.cputimer	= {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
  *
  * Returns the new credentials or NULL if out of memory.
  *
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
  */
 struct cred *prepare_kernel_cred(struct task_struct *daemon)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..0395154 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	struct mm_struct *mm;
 	int err;
 
-	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
+	err =  mutex_lock_killable(&task->signal->cred_change_mutex);
 	if (err)
 		return ERR_PTR(err);
 
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 		mmput(mm);
 		mm = ERR_PTR(-EACCES);
 	}
-	mutex_unlock(&task->signal->cred_guard_mutex);
+	mutex_unlock(&task->signal->cred_change_mutex);
 
 	return mm;
 }
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->cred_change_mutex);
 
 	return 0;
 }
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	if (!mm || IS_ERR(mm)) {
 		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		/*
-		 * Explicitly map EACCES to EPERM as EPERM is a more a
+		 * Explicitly map EACCES to EPERM as EPERM is a more
 		 * appropriate error code for process_vw_readv/writev
 		 */
 		if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
 
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
 
 include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..ef08c9f
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+	ptrace(PTRACE_TRACEME, 0, 0, 0);
+	return NULL;
+}
+
+TEST(vmaccess)
+{
+	int f, pid = fork();
+	char mm[64];
+
+	if (!pid) {
+		pthread_t pt;
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	sprintf(mm, "/proc/%d/mem", pid);
+	f = open(mm, O_RDONLY);
+	ASSERT_LE(0, f)
+		close(f);
+	/* this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0); */
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(0, f);
+}
+
+TEST_HARNESS_MAIN
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-01 20:34       ` [PATCHv2] " Bernd Edlinger
@ 2020-03-02  6:38         ` Eric W. Biederman
  2020-03-02 15:43           ` Bernd Edlinger
  2020-03-02 12:28         ` [PATCHv2] " Oleg Nesterov
  1 sibling, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-02  6:38 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated.  They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

I think your patch works, but I don't think to solve your case another
mutex is necessary.  Possibly it is justified, but I hesitate to
introduce yet another concept in the code.

Having read elsewhere in the thread that this does not solve the problem
Oleg has mentioned I am really hesitant to add more complexity to the
situation.


For your case there is a straight forward and local workaround.

When the current task is ptracing the target task don't bother with
cred_gaurd_mutex and ptrace_may_access in access_mm as those tests
have already passed.  Instead just confirm the ptrace status. AKA
the permission check in ptraces_access_vm.

I think something like this is all we need.

diff --git a/kernel/fork.c b/kernel/fork.c
index cee89229606a..b0ab98c84589 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,6 +1224,16 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	struct mm_struct *mm;
 	int err;
 
+	if (task->ptrace && (current == task->parent)) {
+		mm = get_task_mm(task);
+		if ((get_dumpable(mm) != SUID_DUMP_USER) &&
+		    !ptracer_capable(task, mm->user_ns)) {
+			mmput(mm);
+			mm = ERR_PTR(-EACCESS);
+		}
+		return mm;
+	}
+
 	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
 	if (err)
 		return ERR_PTR(err);

Does this solve your test case?

The patch above is short the approriate locking for the ptrace attached
check.  (tasklist_lock I think).  But is enough to illustrate the idea,
and it is probably a check we want in any event so that if the tracer
starts dropping privileges process_vm_readv and process_vm_writev will
still be usable by the tracer.

Eric


> strace          D    0 30614  30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> expect          D    0 31933  30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The proposed solution is to have a second mutex that is
> used in mm_access, so it is allowed to continue while the
> dying threads are not yet terminated.
>
> I also took the opportunity to improve the documentation
> of prepare_creds, which is obviously out of sync.
>
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  Documentation/security/credentials.rst    | 18 ++++++------
>  fs/exec.c                                 |  9 ++++++
>  include/linux/binfmts.h                   |  6 +++-
>  include/linux/sched/signal.h              |  1 +
>  init/init_task.c                          |  1 +
>  kernel/cred.c                             |  2 +-
>  kernel/fork.c                             |  5 ++--
>  mm/process_vm_access.c                    |  2 +-
>  tools/testing/selftests/ptrace/Makefile   |  4 +--
>  tools/testing/selftests/ptrace/vmaccess.c | 46 +++++++++++++++++++++++++++++++
>  10 files changed, 79 insertions(+), 15 deletions(-)
>  create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
>
> v2: adds a test case which passes when this patch is applied.
>
>
> diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
> index 282e79f..c98e0a8 100644
> --- a/Documentation/security/credentials.rst
> +++ b/Documentation/security/credentials.rst
> @@ -437,9 +437,13 @@ new set of credentials by calling::
>  
>  	struct cred *prepare_creds(void);
>  
> -this locks current->cred_replace_mutex and then allocates and constructs a
> -duplicate of the current process's credentials, returning with the mutex still
> -held if successful.  It returns NULL if not successful (out of memory).
> +this allocates and constructs a duplicate of the current process's credentials.
> +It returns NULL if not successful (out of memory).
> +
> +If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
> +is acquired before this function gets called, and the mutex
> +current->signal->cred_change_mutex is acquired later, while the credentials
> +and the process mmap are actually changed.
>  
>  The mutex prevents ``ptrace()`` from altering the ptrace state of a process
>  while security checks on credentials construction and changing is taking place
> @@ -466,9 +470,8 @@ by calling::
>  
>  This will alter various aspects of the credentials and the process, giving the
>  LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
> -actually commit the new credentials to ``current->cred``, it will release
> -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
> -will notify the scheduler and others of the changes.
> +actually commit the new credentials to ``current->cred``, and it will notify
> +the scheduler and others of the changes.
>  
>  This function is guaranteed to return 0, so that it can be tail-called at the
>  end of such functions as ``sys_setresuid()``.
> @@ -486,8 +489,7 @@ invoked::
>  
>  	void abort_creds(struct cred *new);
>  
> -This releases the lock on ``current->cred_replace_mutex`` that
> -``prepare_creds()`` got and then releases the new credentials.
> +This releases the new credentials.
>  
>  
>  A typical credentials alteration function would look something like this::
> diff --git a/fs/exec.c b/fs/exec.c
> index 74d88da..a6884e4 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> +	retval = mutex_lock_killable(&current->signal->cred_change_mutex);
> +	if (retval)
> +		goto out;
> +
> +	bprm->called_flush_old_exec = 1;
> +
>  	/*
>  	 * Must be called _before_ exec_mmap() as bprm->mm is
>  	 * not visibile until then. This also enables the update
> @@ -1420,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm)
>  {
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> +		if (bprm->called_flush_old_exec)
> +			mutex_unlock(&current->signal->cred_change_mutex);
>  		mutex_unlock(&current->signal->cred_guard_mutex);
>  		abort_creds(bprm->cred);
>  	}
> @@ -1469,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>  	 * credentials; any time after this it may be unlocked.
>  	 */
>  	security_bprm_committed_creds(bprm);
> +	mutex_unlock(&current->signal->cred_change_mutex);
>  	mutex_unlock(&current->signal->cred_guard_mutex);
>  }
>  EXPORT_SYMBOL(install_exec_creds);
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index b40fc63..2e1318b 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -44,7 +44,11 @@ struct linux_binprm {
>  		 * exec has happened. Used to sanitize execution environment
>  		 * and to set AT_SECURE auxv for glibc.
>  		 */
> -		secureexec:1;
> +		secureexec:1,
> +		/*
> +		 * Set by flush_old_exec, when the cred_change_mutex is taken.
> +		 */
> +		called_flush_old_exec:1;
>  #ifdef __alpha__
>  	unsigned int taso:1;
>  #endif
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 8805025..37eeabe 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -225,6 +225,7 @@ struct signal_struct {
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations
>  					 * (notably. ptrace) */
> +	struct mutex cred_change_mutex; /* guard against credentials change */
>  } __randomize_layout;
>  
>  /*
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5..6cd9a0f 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -26,6 +26,7 @@
>  	.multiprocess	= HLIST_HEAD_INIT,
>  	.rlim		= INIT_RLIMITS,
>  	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
> +	.cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex),
>  #ifdef CONFIG_POSIX_TIMERS
>  	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
>  	.cputimer	= {
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 809a985..e4c78de 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -676,7 +676,7 @@ void __init cred_init(void)
>   *
>   * Returns the new credentials or NULL if out of memory.
>   *
> - * Does not take, and does not return holding current->cred_replace_mutex.
> + * Does not take, and does not return holding ->cred_guard_mutex.
>   */
>  struct cred *prepare_kernel_cred(struct task_struct *daemon)
>  {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0808095..0395154 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  	struct mm_struct *mm;
>  	int err;
>  
> -	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
> +	err =  mutex_lock_killable(&task->signal->cred_change_mutex);
>  	if (err)
>  		return ERR_PTR(err);
>  
> @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  		mmput(mm);
>  		mm = ERR_PTR(-EACCES);
>  	}
> -	mutex_unlock(&task->signal->cred_guard_mutex);
> +	mutex_unlock(&task->signal->cred_change_mutex);
>  
>  	return mm;
>  }
> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>  	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>  
>  	mutex_init(&sig->cred_guard_mutex);
> +	mutex_init(&sig->cred_change_mutex);
>  
>  	return 0;
>  }
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 357aa7b..b3e6eb5 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
>  	if (!mm || IS_ERR(mm)) {
>  		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
>  		/*
> -		 * Explicitly map EACCES to EPERM as EPERM is a more a
> +		 * Explicitly map EACCES to EPERM as EPERM is a more
>  		 * appropriate error code for process_vw_readv/writev
>  		 */
>  		if (rc == -EACCES)
> diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
> index c0b7f89..2f1f532 100644
> --- a/tools/testing/selftests/ptrace/Makefile
> +++ b/tools/testing/selftests/ptrace/Makefile
> @@ -1,6 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -CFLAGS += -iquote../../../../include/uapi -Wall
> +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
>  
> -TEST_GEN_PROGS := get_syscall_info peeksiginfo
> +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
>  
>  include ../lib.mk
> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> new file mode 100644
> index 0000000..ef08c9f
> --- /dev/null
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -0,0 +1,46 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
> + * All rights reserved.
> + *
> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
> + * when de_thread is blocked with ->cred_guard_mutex held.
> + */
> +
> +#include "../kselftest_harness.h"
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <unistd.h>
> +#include <sys/ptrace.h>
> +
> +static void *thread(void *arg)
> +{
> +	ptrace(PTRACE_TRACEME, 0, 0, 0);
> +	return NULL;
> +}
> +
> +TEST(vmaccess)
> +{
> +	int f, pid = fork();
> +	char mm[64];
> +
> +	if (!pid) {
> +		pthread_t pt;
> +		pthread_create(&pt, NULL, thread, NULL);
> +		pthread_join(pt, NULL);
> +		execlp("true", "true", NULL);
> +	}
> +
> +	sleep(1);
> +	sprintf(mm, "/proc/%d/mem", pid);
> +	f = open(mm, O_RDONLY);
> +	ASSERT_LE(0, f)
> +		close(f);
> +	/* this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0); */
> +	f = kill(pid, SIGCONT);
> +	ASSERT_EQ(0, f);
> +}
> +
> +TEST_HARNESS_MAIN

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-01 20:00     ` Jann Horn
  2020-03-01 20:34       ` [PATCHv2] " Bernd Edlinger
@ 2020-03-02  7:47       ` Christian Brauner
  2020-03-02  7:48         ` Christian Brauner
  1 sibling, 1 reply; 203+ messages in thread
From: Christian Brauner @ 2020-03-02  7:47 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Alexey Dobriyan, Eric W. Biederman, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm

On Sun, Mar 01, 2020 at 09:00:22PM +0100, Jann Horn wrote:
> On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> > > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> > > <bernd.edlinger@hotmail.de> wrote:
> > > > The proposed solution is to have a second mutex that is
> > > > used in mm_access, so it is allowed to continue while the
> > > > dying threads are not yet terminated.
> > >
> > > Just for context: When I proposed something similar back in 2016,
> > > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> > > was the resulting discussion thread. At least back then, I looked
> > > through the various existing users of cred_guard_mutex, and the only
> > > places that couldn't be converted to the new second mutex were
> > > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> > >
> > >
> > > The ideal solution would IMO be something like this: Decide what the
> > > new task's credentials should be *before* reaching de_thread(),
> > > install them into a second cred* on the task (together with the new
> > > dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> > > check against both. After that, some further restructuring might even
> >
> > Hm, so essentially a private ptrace_access_cred member in task_struct?
> 
> And a second dumpability field, because that changes together with the
> creds during execve. (Btw, currently the dumpability is in the
> mm_struct, but that's kinda wrong. The mm_struct is removed from a
> task on exit while access checks can still be performed against it, and
> currently ptrace_may_access() just lets the access go through in that
> case, which weakens the protection offered by PR_SET_DUMPABLE when
> used for security purposes. I think it ought to be moved over into the
> task_struct.)
> 
> > That would presumably also involve altering various LSM hooks to look at
> > ptrace_access_cred.
> 
> When I tried to implement this in the past, I changed the LSM hook to
> take the target task's cred* as an argument, and then called the LSM
> hook twice from ptrace_may_access(). IIRC having the target task's
> creds as an argument works for almost all the LSMs, with the exception
> of Yama, which doesn't really care about the target task's creds, so
> you have to pass in both the task_struct* and the cred*.

It seems we should try PoCing this.

Christian

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: Fix a deadlock in ptrace
  2020-03-02  7:47       ` [PATCH] " Christian Brauner
@ 2020-03-02  7:48         ` Christian Brauner
  0 siblings, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-02  7:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Alexey Dobriyan, Eric W. Biederman, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm

On Mon, Mar 02, 2020 at 08:47:53AM +0100, Christian Brauner wrote:
> On Sun, Mar 01, 2020 at 09:00:22PM +0100, Jann Horn wrote:
> > On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> > > > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> > > > <bernd.edlinger@hotmail.de> wrote:
> > > > > The proposed solution is to have a second mutex that is
> > > > > used in mm_access, so it is allowed to continue while the
> > > > > dying threads are not yet terminated.
> > > >
> > > > Just for context: When I proposed something similar back in 2016,
> > > > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> > > > was the resulting discussion thread. At least back then, I looked
> > > > through the various existing users of cred_guard_mutex, and the only
> > > > places that couldn't be converted to the new second mutex were
> > > > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> > > >
> > > >
> > > > The ideal solution would IMO be something like this: Decide what the
> > > > new task's credentials should be *before* reaching de_thread(),
> > > > install them into a second cred* on the task (together with the new
> > > > dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> > > > check against both. After that, some further restructuring might even
> > >
> > > Hm, so essentially a private ptrace_access_cred member in task_struct?
> > 
> > And a second dumpability field, because that changes together with the
> > creds during execve. (Btw, currently the dumpability is in the
> > mm_struct, but that's kinda wrong. The mm_struct is removed from a
> > task on exit while access checks can still be performed against it, and
> > currently ptrace_may_access() just lets the access go through in that
> > case, which weakens the protection offered by PR_SET_DUMPABLE when
> > used for security purposes. I think it ought to be moved over into the
> > task_struct.)
> > 
> > > That would presumably also involve altering various LSM hooks to look at
> > > ptrace_access_cred.
> > 
> > When I tried to implement this in the past, I changed the LSM hook to
> > take the target task's cred* as an argument, and then called the LSM
> > hook twice from ptrace_may_access(). IIRC having the target task's
> > creds as an argument works for almost all the LSMs, with the exception
> > of Yama, which doesn't really care about the target task's creds, so
> > you have to pass in both the task_struct* and the cred*.
> 
> It seems we should try PoCing this.

Independent of the fix for Bernd's issue that is.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-01 20:34       ` [PATCHv2] " Bernd Edlinger
  2020-03-02  6:38         ` Eric W. Biederman
@ 2020-03-02 12:28         ` Oleg Nesterov
  2020-03-02 15:56           ` Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Oleg Nesterov @ 2020-03-02 12:28 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Eric W. Biederman,
	Thomas Gleixner, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On 03/01, Bernd Edlinger wrote:
>
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.

Heh. Yes, known problem. See my attempt to fix it:
https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/

> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  	struct mm_struct *mm;
>  	int err;
>  
> -	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
> +	err =  mutex_lock_killable(&task->signal->cred_change_mutex);

So if I understand correctly your patch doesn't fix other problems
with debugger waiting for cred_guard_mutex.

I too do not think this can justify the new mutex in signal_struct...

Oleg.


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02  6:38         ` Eric W. Biederman
@ 2020-03-02 15:43           ` Bernd Edlinger
  2020-03-02 15:57             ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-02 15:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On 3/2/20 7:38 AM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated.  They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> I think your patch works, but I don't think to solve your case another
> mutex is necessary.  Possibly it is justified, but I hesitate to
> introduce yet another concept in the code.
> 
> Having read elsewhere in the thread that this does not solve the problem
> Oleg has mentioned I am really hesitant to add more complexity to the
> situation.
> 
> 
> For your case there is a straight forward and local workaround.
> 
> When the current task is ptracing the target task don't bother with
> cred_gaurd_mutex and ptrace_may_access in access_mm as those tests
> have already passed.  Instead just confirm the ptrace status. AKA
> the permission check in ptraces_access_vm.
> 
> I think something like this is all we need.
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cee89229606a..b0ab98c84589 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1224,6 +1224,16 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  	struct mm_struct *mm;
>  	int err;
>  
> +	if (task->ptrace && (current == task->parent)) {
> +		mm = get_task_mm(task);
> +		if ((get_dumpable(mm) != SUID_DUMP_USER) &&
> +		    !ptracer_capable(task, mm->user_ns)) {
> +			mmput(mm);
> +			mm = ERR_PTR(-EACCESS);
> +		}
> +		return mm;
> +	}
> +
>  	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
>  	if (err)
>  		return ERR_PTR(err);
> 
> Does this solve your test case?
> 

I tried this with s/EACCESS/EACCES/.

The test case in this patch is not fixed, but strace does not freeze,
at least with my setup where it did freeze repeatable.  That is
obviously because it bypasses the cred_guard_mutex.  But all other
process that access this file still freeze, and cannot be
interrupted except with kill -9.

However that smells like a denial of service, that this
simple test case which can be executed by guest, creates a /proc/$pid/mem
that freezes any process, even root, when it looks at it.
I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 12:28         ` [PATCHv2] " Oleg Nesterov
@ 2020-03-02 15:56           ` Bernd Edlinger
  0 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-02 15:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Eric W. Biederman,
	Thomas Gleixner, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable



On 3/2/20 1:28 PM, Oleg Nesterov wrote:
> On 03/01, Bernd Edlinger wrote:
>>
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
> 
> Heh. Yes, known problem. See my attempt to fix it:
> https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
> 
>> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>>  	struct mm_struct *mm;
>>  	int err;
>>  
>> -	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
>> +	err =  mutex_lock_killable(&task->signal->cred_change_mutex);
> 
> So if I understand correctly your patch doesn't fix other problems
> with debugger waiting for cred_guard_mutex.
> 

No, but I see this just as a first step.

> I too do not think this can justify the new mutex in signal_struct...
> 

I think for the vm_access the semantic of this mutex is clear, that it
prevents the credentials to change while it is held by vm_access,
and probably other places can take advantage of this mutex as well.

While on the other hand, the cred_guard_mutex is needed to avoid two
threads calling execve at the same time.  So that is needed as well.

What remains is probably making PTHREAD_ATTACH detect that the process
is currently in execve, and make that call fail in that situation.
I have not thought in depth about that problem, but it will probably
just need the right mutex to access current->in_execve.


That's at least how I see it.


Thanks
Bernd.




^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 15:43           ` Bernd Edlinger
@ 2020-03-02 15:57             ` Eric W. Biederman
  2020-03-02 16:02               ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-02 15:57 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

>
> I tried this with s/EACCESS/EACCES/.
>
> The test case in this patch is not fixed, but strace does not freeze,
> at least with my setup where it did freeze repeatable.

Thanks, That is what I was aiming at.

So we have one method we can pursue to fix this in practice.

> That is
> obviously because it bypasses the cred_guard_mutex.  But all other
> process that access this file still freeze, and cannot be
> interrupted except with kill -9.
>
> However that smells like a denial of service, that this
> simple test case which can be executed by guest, creates a /proc/$pid/mem
> that freezes any process, even root, when it looks at it.
> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.

Yes.  Your the test case in your patch a variant of the original
problem.


I have been staring at this trying to understand the fundamentals of the
original deeper problem.

The current scope of cred_guard_mutex in exec is because being ptraced
causes suid exec to act differently.  So we need to know early if we are
ptraced.

If that case did not exist we could reduce the scope of the
cred_guard_mutex in exec to where your patch puts the cred_change_mutex.

I am starting to think reworking how we deal with ptrace and exec is the
way to solve this problem.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 15:57             ` Eric W. Biederman
@ 2020-03-02 16:02               ` Bernd Edlinger
  2020-03-02 16:17                 ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-02 16:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable



On 3/2/20 4:57 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>>
>> I tried this with s/EACCESS/EACCES/.
>>
>> The test case in this patch is not fixed, but strace does not freeze,
>> at least with my setup where it did freeze repeatable.
> 
> Thanks, That is what I was aiming at.
> 
> So we have one method we can pursue to fix this in practice.
> 
>> That is
>> obviously because it bypasses the cred_guard_mutex.  But all other
>> process that access this file still freeze, and cannot be
>> interrupted except with kill -9.
>>
>> However that smells like a denial of service, that this
>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>> that freezes any process, even root, when it looks at it.
>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
> 
> Yes.  Your the test case in your patch a variant of the original
> problem.
> 
> 
> I have been staring at this trying to understand the fundamentals of the
> original deeper problem.
> 
> The current scope of cred_guard_mutex in exec is because being ptraced
> causes suid exec to act differently.  So we need to know early if we are
> ptraced.
> 

It has a second use, that it prevents two threads entering execve,
which would probably result in disaster.

> If that case did not exist we could reduce the scope of the
> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
> 
> I am starting to think reworking how we deal with ptrace and exec is the
> way to solve this problem.
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 16:02               ` Bernd Edlinger
@ 2020-03-02 16:17                 ` Eric W. Biederman
  2020-03-02 16:43                   ` Jann Horn
  2020-03-02 17:13                   ` [PATCHv2] " Bernd Edlinger
  0 siblings, 2 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-02 16:17 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>>
>>> I tried this with s/EACCESS/EACCES/.
>>>
>>> The test case in this patch is not fixed, but strace does not freeze,
>>> at least with my setup where it did freeze repeatable.
>> 
>> Thanks, That is what I was aiming at.
>> 
>> So we have one method we can pursue to fix this in practice.
>> 
>>> That is
>>> obviously because it bypasses the cred_guard_mutex.  But all other
>>> process that access this file still freeze, and cannot be
>>> interrupted except with kill -9.
>>>
>>> However that smells like a denial of service, that this
>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>> that freezes any process, even root, when it looks at it.
>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>> 
>> Yes.  Your the test case in your patch a variant of the original
>> problem.
>> 
>> 
>> I have been staring at this trying to understand the fundamentals of the
>> original deeper problem.
>> 
>> The current scope of cred_guard_mutex in exec is because being ptraced
>> causes suid exec to act differently.  So we need to know early if we are
>> ptraced.
>> 
>
> It has a second use, that it prevents two threads entering execve,
> which would probably result in disaster.

Exec can fail with an error code up until de_thread.  de_thread causes
exec to fail with the error code -EAGAIN for the second thread to get
into de_thread.

So no.  The cred_guard_mutex is not needed for that case at all.

>> If that case did not exist we could reduce the scope of the
>> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
>> 
>> I am starting to think reworking how we deal with ptrace and exec is the
>> way to solve this problem.


I am 99% convinced that the fix is to move cred_guard_mutex down.

Then right after we take cred_guard_mutex do:
	if (ptraced) {
		use_original_creds();
	}

And call it a day.

The details suck but I am 99% certain that would solve everyones
problems, and not be too bad to audit either.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 16:17                 ` Eric W. Biederman
@ 2020-03-02 16:43                   ` Jann Horn
  2020-03-02 17:01                     ` Bernd Edlinger
  2020-03-02 17:13                   ` [PATCHv2] " Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Jann Horn @ 2020-03-02 16:43 UTC (permalink / raw)
  To: Eric W. Biederman, James Morris
  Cc: Bernd Edlinger, Christian Brauner, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Kees Cook, Greg Kroah-Hartman,
	Shakeel Butt, Jason Gunthorpe, Christian Kellner,
	Andrea Arcangeli, Aleksa Sarai, Dmitry V. Levin, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, stable,
	linux-security-module

On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>
> > On 3/2/20 4:57 PM, Eric W. Biederman wrote:
> >> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> >>
> >>>
> >>> I tried this with s/EACCESS/EACCES/.
> >>>
> >>> The test case in this patch is not fixed, but strace does not freeze,
> >>> at least with my setup where it did freeze repeatable.
> >>
> >> Thanks, That is what I was aiming at.
> >>
> >> So we have one method we can pursue to fix this in practice.
> >>
> >>> That is
> >>> obviously because it bypasses the cred_guard_mutex.  But all other
> >>> process that access this file still freeze, and cannot be
> >>> interrupted except with kill -9.
> >>>
> >>> However that smells like a denial of service, that this
> >>> simple test case which can be executed by guest, creates a /proc/$pid/mem
> >>> that freezes any process, even root, when it looks at it.
> >>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
> >>
> >> Yes.  Your the test case in your patch a variant of the original
> >> problem.
> >>
> >>
> >> I have been staring at this trying to understand the fundamentals of the
> >> original deeper problem.
> >>
> >> The current scope of cred_guard_mutex in exec is because being ptraced
> >> causes suid exec to act differently.  So we need to know early if we are
> >> ptraced.
> >>
> >
> > It has a second use, that it prevents two threads entering execve,
> > which would probably result in disaster.
>
> Exec can fail with an error code up until de_thread.  de_thread causes
> exec to fail with the error code -EAGAIN for the second thread to get
> into de_thread.
>
> So no.  The cred_guard_mutex is not needed for that case at all.
>
> >> If that case did not exist we could reduce the scope of the
> >> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
> >>
> >> I am starting to think reworking how we deal with ptrace and exec is the
> >> way to solve this problem.
>
>
> I am 99% convinced that the fix is to move cred_guard_mutex down.

"move cred_guard_mutex down" as in "take it once we've already set up
the new process, past the point of no return"?

> Then right after we take cred_guard_mutex do:
>         if (ptraced) {
>                 use_original_creds();
>         }
>
> And call it a day.
>
> The details suck but I am 99% certain that would solve everyones
> problems, and not be too bad to audit either.

Ah, hmm, that sounds like it'll work fine at least when no LSMs are involved.

SELinux normally doesn't do the execution-degrading thing, it just
blocks the execution completely - see their selinux_bprm_set_creds()
hook. So I think they'd still need to set some state on the task that
says "we're currently in the middle of an execution where the target
task will run in context X", and then check against that in the
ptrace_may_access hook. Or I suppose they could just kill the task
near the end of execve, although that'd be kinda ugly.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 16:43                   ` Jann Horn
@ 2020-03-02 17:01                     ` Bernd Edlinger
  2020-03-02 17:37                       ` Jann Horn
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-02 17:01 UTC (permalink / raw)
  To: Jann Horn, Eric W. Biederman, James Morris
  Cc: Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Kees Cook, Greg Kroah-Hartman,
	Shakeel Butt, Jason Gunthorpe, Christian Kellner,
	Andrea Arcangeli, Aleksa Sarai, Dmitry V. Levin, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, stable,
	linux-security-module

On 3/2/20 5:43 PM, Jann Horn wrote:
> On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>
>>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>>
>>>>>
>>>>> I tried this with s/EACCESS/EACCES/.
>>>>>
>>>>> The test case in this patch is not fixed, but strace does not freeze,
>>>>> at least with my setup where it did freeze repeatable.
>>>>
>>>> Thanks, That is what I was aiming at.
>>>>
>>>> So we have one method we can pursue to fix this in practice.
>>>>
>>>>> That is
>>>>> obviously because it bypasses the cred_guard_mutex.  But all other
>>>>> process that access this file still freeze, and cannot be
>>>>> interrupted except with kill -9.
>>>>>
>>>>> However that smells like a denial of service, that this
>>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>>>> that freezes any process, even root, when it looks at it.
>>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>>>>
>>>> Yes.  Your the test case in your patch a variant of the original
>>>> problem.
>>>>
>>>>
>>>> I have been staring at this trying to understand the fundamentals of the
>>>> original deeper problem.
>>>>
>>>> The current scope of cred_guard_mutex in exec is because being ptraced
>>>> causes suid exec to act differently.  So we need to know early if we are
>>>> ptraced.
>>>>
>>>
>>> It has a second use, that it prevents two threads entering execve,
>>> which would probably result in disaster.
>>
>> Exec can fail with an error code up until de_thread.  de_thread causes
>> exec to fail with the error code -EAGAIN for the second thread to get
>> into de_thread.
>>
>> So no.  The cred_guard_mutex is not needed for that case at all.
>>
>>>> If that case did not exist we could reduce the scope of the
>>>> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
>>>>
>>>> I am starting to think reworking how we deal with ptrace and exec is the
>>>> way to solve this problem.
>>
>>
>> I am 99% convinced that the fix is to move cred_guard_mutex down.
> 
> "move cred_guard_mutex down" as in "take it once we've already set up
> the new process, past the point of no return"?
> 
>> Then right after we take cred_guard_mutex do:
>>         if (ptraced) {
>>                 use_original_creds();
>>         }
>>
>> And call it a day.
>>
>> The details suck but I am 99% certain that would solve everyones
>> problems, and not be too bad to audit either.
> 
> Ah, hmm, that sounds like it'll work fine at least when no LSMs are involved.
> 
> SELinux normally doesn't do the execution-degrading thing, it just
> blocks the execution completely - see their selinux_bprm_set_creds()
> hook. So I think they'd still need to set some state on the task that
> says "we're currently in the middle of an execution where the target
> task will run in context X", and then check against that in the
> ptrace_may_access hook. Or I suppose they could just kill the task
> near the end of execve, although that'd be kinda ugly.
> 

We have current->in_execve for that, right?
I think when the cred_guard_mutex is taken only in the critical section,
then PTRACE_ATTACH could take the guard_mutex, and look at current->in_execve,
and just return -EAGAIN in that case, right, everybody happy :)


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 16:17                 ` Eric W. Biederman
  2020-03-02 16:43                   ` Jann Horn
@ 2020-03-02 17:13                   ` Bernd Edlinger
  2020-03-02 21:49                     ` Eric W. Biederman
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-02 17:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable



On 3/2/20 5:17 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>
>>>>
>>>> I tried this with s/EACCESS/EACCES/.
>>>>
>>>> The test case in this patch is not fixed, but strace does not freeze,
>>>> at least with my setup where it did freeze repeatable.
>>>
>>> Thanks, That is what I was aiming at.
>>>
>>> So we have one method we can pursue to fix this in practice.
>>>
>>>> That is
>>>> obviously because it bypasses the cred_guard_mutex.  But all other
>>>> process that access this file still freeze, and cannot be
>>>> interrupted except with kill -9.
>>>>
>>>> However that smells like a denial of service, that this
>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>>> that freezes any process, even root, when it looks at it.
>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>>>
>>> Yes.  Your the test case in your patch a variant of the original
>>> problem.
>>>
>>>
>>> I have been staring at this trying to understand the fundamentals of the
>>> original deeper problem.
>>>
>>> The current scope of cred_guard_mutex in exec is because being ptraced
>>> causes suid exec to act differently.  So we need to know early if we are
>>> ptraced.
>>>
>>
>> It has a second use, that it prevents two threads entering execve,
>> which would probably result in disaster.
> 
> Exec can fail with an error code up until de_thread.  de_thread causes
> exec to fail with the error code -EAGAIN for the second thread to get
> into de_thread.
> 
> So no.  The cred_guard_mutex is not needed for that case at all.
> 

Okay, but that will reset current->in_execve, right?

>>> If that case did not exist we could reduce the scope of the
>>> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
>>>
>>> I am starting to think reworking how we deal with ptrace and exec is the
>>> way to solve this problem.
> 
> 
> I am 99% convinced that the fix is to move cred_guard_mutex down.
> 
> Then right after we take cred_guard_mutex do:
> 	if (ptraced) {
> 		use_original_creds();
> 	}
> 
> And call it a day.
> 
> The details suck but I am 99% certain that would solve everyones
> problems, and not be too bad to audit either.
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 17:01                     ` Bernd Edlinger
@ 2020-03-02 17:37                       ` Jann Horn
  2020-03-02 17:42                         ` christian
  0 siblings, 1 reply; 203+ messages in thread
From: Jann Horn @ 2020-03-02 17:37 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, James Morris, Christian Brauner,
	Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Thomas Gleixner, Oleg Nesterov, Frederic Weisbecker,
	Andrei Vagin, Ingo Molnar, Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Kees Cook, Greg Kroah-Hartman,
	Shakeel Butt, Jason Gunthorpe, Christian Kellner,
	Andrea Arcangeli, Aleksa Sarai, Dmitry V. Levin, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, stable,
	linux-security-module

On Mon, Mar 2, 2020 at 6:01 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> On 3/2/20 5:43 PM, Jann Horn wrote:
> > On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>
> >> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> >>
> >>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
> >>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> >>>>
> >>>>>
> >>>>> I tried this with s/EACCESS/EACCES/.
> >>>>>
> >>>>> The test case in this patch is not fixed, but strace does not freeze,
> >>>>> at least with my setup where it did freeze repeatable.
> >>>>
> >>>> Thanks, That is what I was aiming at.
> >>>>
> >>>> So we have one method we can pursue to fix this in practice.
> >>>>
> >>>>> That is
> >>>>> obviously because it bypasses the cred_guard_mutex.  But all other
> >>>>> process that access this file still freeze, and cannot be
> >>>>> interrupted except with kill -9.
> >>>>>
> >>>>> However that smells like a denial of service, that this
> >>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
> >>>>> that freezes any process, even root, when it looks at it.
> >>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
> >>>>
> >>>> Yes.  Your the test case in your patch a variant of the original
> >>>> problem.
> >>>>
> >>>>
> >>>> I have been staring at this trying to understand the fundamentals of the
> >>>> original deeper problem.
> >>>>
> >>>> The current scope of cred_guard_mutex in exec is because being ptraced
> >>>> causes suid exec to act differently.  So we need to know early if we are
> >>>> ptraced.
> >>>>
> >>>
> >>> It has a second use, that it prevents two threads entering execve,
> >>> which would probably result in disaster.
> >>
> >> Exec can fail with an error code up until de_thread.  de_thread causes
> >> exec to fail with the error code -EAGAIN for the second thread to get
> >> into de_thread.
> >>
> >> So no.  The cred_guard_mutex is not needed for that case at all.
> >>
> >>>> If that case did not exist we could reduce the scope of the
> >>>> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
> >>>>
> >>>> I am starting to think reworking how we deal with ptrace and exec is the
> >>>> way to solve this problem.
> >>
> >>
> >> I am 99% convinced that the fix is to move cred_guard_mutex down.
> >
> > "move cred_guard_mutex down" as in "take it once we've already set up
> > the new process, past the point of no return"?
> >
> >> Then right after we take cred_guard_mutex do:
> >>         if (ptraced) {
> >>                 use_original_creds();
> >>         }
> >>
> >> And call it a day.
> >>
> >> The details suck but I am 99% certain that would solve everyones
> >> problems, and not be too bad to audit either.
> >
> > Ah, hmm, that sounds like it'll work fine at least when no LSMs are involved.
> >
> > SELinux normally doesn't do the execution-degrading thing, it just
> > blocks the execution completely - see their selinux_bprm_set_creds()
> > hook. So I think they'd still need to set some state on the task that
> > says "we're currently in the middle of an execution where the target
> > task will run in context X", and then check against that in the
> > ptrace_may_access hook. Or I suppose they could just kill the task
> > near the end of execve, although that'd be kinda ugly.
> >
>
> We have current->in_execve for that, right?
> I think when the cred_guard_mutex is taken only in the critical section,
> then PTRACE_ATTACH could take the guard_mutex, and look at current->in_execve,
> and just return -EAGAIN in that case, right, everybody happy :)

It's probably going to mean that things like strace will just randomly
fail to attach to processes if they happen to be in the middle of
execve... but I guess that works?

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 17:37                       ` Jann Horn
@ 2020-03-02 17:42                         ` christian
  2020-03-02 18:08                           ` Jann Horn
  0 siblings, 1 reply; 203+ messages in thread
From: christian @ 2020-03-02 17:42 UTC (permalink / raw)
  To: Jann Horn, Bernd Edlinger
  Cc: Eric W. Biederman, James Morris, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Kees Cook, Greg Kroah-Hartman,
	Shakeel Butt, Jason Gunthorpe, Christian Kellner,
	Andrea Arcangeli, Aleksa Sarai, Dmitry V. Levin, linux-doc

<linux-doc@vger.kernel.org>,"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,"linux-mm@kvack.org" <linux-mm@kvack.org>,"stable@vger.kernel.org" <stable@vger.kernel.org>,linux-security-module <linux-security-module@vger.kernel.org>
From: Christian Brauner <christian.brauner@ubuntu.com>
Message-ID: <9C3BF644-0F82-48C9-9116-8554204FB57D@ubuntu.com>

On March 2, 2020 6:37:27 PM GMT+01:00, Jann Horn <jannh@google.com> wrote:
>On Mon, Mar 2, 2020 at 6:01 PM Bernd Edlinger
><bernd.edlinger@hotmail.de> wrote:
>> On 3/2/20 5:43 PM, Jann Horn wrote:
>> > On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman
><ebiederm@xmission.com> wrote:
>> >>
>> >> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> >>
>> >>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>> >>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> >>>>
>> >>>>>
>> >>>>> I tried this with s/EACCESS/EACCES/.
>> >>>>>
>> >>>>> The test case in this patch is not fixed, but strace does not
>freeze,
>> >>>>> at least with my setup where it did freeze repeatable.
>> >>>>
>> >>>> Thanks, That is what I was aiming at.
>> >>>>
>> >>>> So we have one method we can pursue to fix this in practice.
>> >>>>
>> >>>>> That is
>> >>>>> obviously because it bypasses the cred_guard_mutex.  But all
>other
>> >>>>> process that access this file still freeze, and cannot be
>> >>>>> interrupted except with kill -9.
>> >>>>>
>> >>>>> However that smells like a denial of service, that this
>> >>>>> simple test case which can be executed by guest, creates a
>/proc/$pid/mem
>> >>>>> that freezes any process, even root, when it looks at it.
>> >>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>> >>>>
>> >>>> Yes.  Your the test case in your patch a variant of the original
>> >>>> problem.
>> >>>>
>> >>>>
>> >>>> I have been staring at this trying to understand the
>fundamentals of the
>> >>>> original deeper problem.
>> >>>>
>> >>>> The current scope of cred_guard_mutex in exec is because being
>ptraced
>> >>>> causes suid exec to act differently.  So we need to know early
>if we are
>> >>>> ptraced.
>> >>>>
>> >>>
>> >>> It has a second use, that it prevents two threads entering
>execve,
>> >>> which would probably result in disaster.
>> >>
>> >> Exec can fail with an error code up until de_thread.  de_thread
>causes
>> >> exec to fail with the error code -EAGAIN for the second thread to
>get
>> >> into de_thread.
>> >>
>> >> So no.  The cred_guard_mutex is not needed for that case at all.
>> >>
>> >>>> If that case did not exist we could reduce the scope of the
>> >>>> cred_guard_mutex in exec to where your patch puts the
>cred_change_mutex.
>> >>>>
>> >>>> I am starting to think reworking how we deal with ptrace and
>exec is the
>> >>>> way to solve this problem.
>> >>
>> >>
>> >> I am 99% convinced that the fix is to move cred_guard_mutex down.
>> >
>> > "move cred_guard_mutex down" as in "take it once we've already set
>up
>> > the new process, past the point of no return"?
>> >
>> >> Then right after we take cred_guard_mutex do:
>> >>         if (ptraced) {
>> >>                 use_original_creds();
>> >>         }
>> >>
>> >> And call it a day.
>> >>
>> >> The details suck but I am 99% certain that would solve everyones
>> >> problems, and not be too bad to audit either.
>> >
>> > Ah, hmm, that sounds like it'll work fine at least when no LSMs are
>involved.
>> >
>> > SELinux normally doesn't do the execution-degrading thing, it just
>> > blocks the execution completely - see their
>selinux_bprm_set_creds()
>> > hook. So I think they'd still need to set some state on the task
>that
>> > says "we're currently in the middle of an execution where the
>target
>> > task will run in context X", and then check against that in the
>> > ptrace_may_access hook. Or I suppose they could just kill the task
>> > near the end of execve, although that'd be kinda ugly.
>> >
>>
>> We have current->in_execve for that, right?
>> I think when the cred_guard_mutex is taken only in the critical
>section,
>> then PTRACE_ATTACH could take the guard_mutex, and look at
>current->in_execve,
>> and just return -EAGAIN in that case, right, everybody happy :)
>
>It's probably going to mean that things like strace will just randomly
>fail to attach to processes if they happen to be in the middle of
>execve... but I guess that works?

That sounds like an acceptable outcome.
We can at least risk it and if we regress
revert or come up with the more complex
solution suggested in another mail here?

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 17:42                         ` christian
@ 2020-03-02 18:08                           ` Jann Horn
  2020-03-02 20:10                             ` [PATCHv3] " Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Jann Horn @ 2020-03-02 18:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Bernd Edlinger, Eric W. Biederman, James Morris, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Kees Cook, Greg Kroah-Hartman,
	Shakeel Butt, Jason Gunthorpe, Christian Kellner,
	Andrea Arcangeli, Aleksa Sarai, Dmitry V. Levin, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, stable,
	linux-security-module

On Mon, Mar 2, 2020 at 6:43 PM <christian@brauner.io> wrote:
> On March 2, 2020 6:37:27 PM GMT+01:00, Jann Horn <jannh@google.com> wrote:
> >On Mon, Mar 2, 2020 at 6:01 PM Bernd Edlinger
> ><bernd.edlinger@hotmail.de> wrote:
> >> On 3/2/20 5:43 PM, Jann Horn wrote:
> >> > On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman
> ><ebiederm@xmission.com> wrote:
[...]
> >> >> I am 99% convinced that the fix is to move cred_guard_mutex down.
> >> >
> >> > "move cred_guard_mutex down" as in "take it once we've already set
> >up
> >> > the new process, past the point of no return"?
> >> >
> >> >> Then right after we take cred_guard_mutex do:
> >> >>         if (ptraced) {
> >> >>                 use_original_creds();
> >> >>         }
> >> >>
> >> >> And call it a day.
> >> >>
> >> >> The details suck but I am 99% certain that would solve everyones
> >> >> problems, and not be too bad to audit either.
> >> >
> >> > Ah, hmm, that sounds like it'll work fine at least when no LSMs are
> >involved.
> >> >
> >> > SELinux normally doesn't do the execution-degrading thing, it just
> >> > blocks the execution completely - see their
> >selinux_bprm_set_creds()
> >> > hook. So I think they'd still need to set some state on the task
> >that
> >> > says "we're currently in the middle of an execution where the
> >target
> >> > task will run in context X", and then check against that in the
> >> > ptrace_may_access hook. Or I suppose they could just kill the task
> >> > near the end of execve, although that'd be kinda ugly.
> >> >
> >>
> >> We have current->in_execve for that, right?
> >> I think when the cred_guard_mutex is taken only in the critical
> >section,
> >> then PTRACE_ATTACH could take the guard_mutex, and look at
> >current->in_execve,
> >> and just return -EAGAIN in that case, right, everybody happy :)
> >
> >It's probably going to mean that things like strace will just randomly
> >fail to attach to processes if they happen to be in the middle of
> >execve... but I guess that works?
>
> That sounds like an acceptable outcome.
> We can at least risk it and if we regress
> revert or come up with the more complex
> solution suggested in another mail here?

Yeah, sounds reasonable, I guess.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCHv3] exec: Fix a deadlock in ptrace
  2020-03-02 18:08                           ` Jann Horn
@ 2020-03-02 20:10                             ` Bernd Edlinger
  2020-03-02 20:28                               ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-02 20:10 UTC (permalink / raw)
  To: Jann Horn, Christian Brauner
  Cc: Eric W. Biederman, James Morris, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Kees Cook, Greg Kroah-Hartman,
	Shakeel Butt, Jason Gunthorpe, Christian Kellner,
	Andrea Arcangeli, Aleksa Sarai, Dmitry V. Levin, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, stable,
	linux-security-module

This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The proposed solution is to take the cred_guard_mutex only
in a critical section at the beginning, and at the end of the
execve function, and let PTRACE_ATTACH fail with EAGAIN while
execve is not complete, but other functions like vm_access are
allowed to complete normally.

I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 Documentation/security/credentials.rst    | 19 +++++----
 fs/exec.c                                 | 28 ++++++++++++--
 include/linux/binfmts.h                   |  6 ++-
 include/linux/sched/signal.h              |  1 +
 init/init_task.c                          |  1 +
 kernel/cred.c                             |  2 +-
 kernel/fork.c                             |  1 +
 kernel/ptrace.c                           |  4 ++
 mm/process_vm_access.c                    |  2 +-
 tools/testing/selftests/ptrace/Makefile   |  4 +-
 tools/testing/selftests/ptrace/vmaccess.c | 64 +++++++++++++++++++++++++++++++
 11 files changed, 115 insertions(+), 17 deletions(-)
 create mode 100644 tools/testing/selftests/ptrace/vmaccess.c

v2: adds a test case which passes when this patch is applied.
v3: fixes the issue without introducing a new mutex.

diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..61d6704 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,14 @@ new set of credentials by calling::
 
 	struct cred *prepare_creds(void);
 
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful.  It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and released after setting
+current->signal->cred_locked_for_ptrace.  The same mutex is acquired later,
+while the credentials and the process mmap are actually changed, and
+current->signal->cred_locked_for_ptrace is reset again.
 
 The mutex prevents ``ptrace()`` from altering the ptrace state of a process
 while security checks on credentials construction and changing is taking place
@@ -466,9 +471,8 @@ by calling::
 
 This will alter various aspects of the credentials and the process, giving the
 LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
 
 This function is guaranteed to return 0, so that it can be tail-called at the
 end of such functions as ``sys_setresuid()``.
@@ -486,8 +490,7 @@ invoked::
 
 	void abort_creds(struct cred *new);
 
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
 
 
 A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..e466301 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_guard_mutex);
+	if (retval)
+		goto out;
+
+	bprm->called_flush_old_exec = 1;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1398,28 +1404,41 @@ void finalize_exec(struct linux_binprm *bprm)
 EXPORT_SYMBOL(finalize_exec);
 
 /*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and set ->cred_locked_for_ptrace.
  * install_exec_creds() commits the new creds and drops the lock.
  * Or, if exec fails before, free_bprm() should release ->cred and
  * and unlock.
  */
 static int prepare_bprm_creds(struct linux_binprm *bprm)
 {
+	int ret;
+
 	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	ret = -EAGAIN;
+	if (unlikely(current->signal->cred_locked_for_ptrace))
+		goto out;
+
+	ret = -ENOMEM;
 	bprm->cred = prepare_exec_creds();
-	if (likely(bprm->cred))
-		return 0;
+	if (likely(bprm->cred)) {
+		current->signal->cred_locked_for_ptrace = true;
+		ret = 0;
+	}
 
+out:
 	mutex_unlock(&current->signal->cred_guard_mutex);
-	return -ENOMEM;
+	return ret;
 }
 
 static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		if (!bprm->called_flush_old_exec)
+			mutex_lock(&current->signal->cred_guard_mutex);
+		current->signal->cred_locked_for_ptrace = false;
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1469,6 +1488,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	current->signal->cred_locked_for_ptrace = false;
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2e1318b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
 		 * exec has happened. Used to sanitize execution environment
 		 * and to set AT_SECURE auxv for glibc.
 		 */
-		secureexec:1;
+		secureexec:1,
+		/*
+		 * Set by flush_old_exec, when the cred_change_mutex is taken.
+		 */
+		called_flush_old_exec:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..073a2b7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@ struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	bool cred_locked_for_ptrace;	/* set while in execve */
 } __randomize_layout;
 
 /*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..ecefff28 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+	.cred_locked_for_ptrace = false,
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
 	.cputimer	= {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
  *
  * Returns the new credentials or NULL if out of memory.
  *
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
  */
 struct cred *prepare_kernel_cred(struct task_struct *daemon)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..a2b2ec8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	sig->cred_locked_for_ptrace = false;
 
 	return 0;
 }
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..abf09ba 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -395,6 +395,10 @@ static int ptrace_attach(struct task_struct *task, long request,
 	if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
 		goto out;
 
+	retval = -EAGAIN;
+	if (task->signal->cred_locked_for_ptrace)
+		goto unlock_creds;
+
 	task_lock(task);
 	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
 	task_unlock(task);
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	if (!mm || IS_ERR(mm)) {
 		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		/*
-		 * Explicitly map EACCES to EPERM as EPERM is a more a
+		 * Explicitly map EACCES to EPERM as EPERM is a more
 		 * appropriate error code for process_vw_readv/writev
 		 */
 		if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
 
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
 
 include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..63ff531
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+	return NULL;
+}
+
+TEST(vmaccess)
+{
+	int f, pid = fork();
+	char mm[64];
+
+	if (!pid) {
+		pthread_t pt;
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	sprintf(mm, "/proc/%d/mem", pid);
+	f = open(mm, O_RDONLY);
+	ASSERT_LE(0, f);
+	close(f);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(0, f);
+}
+
+TEST(attach)
+{
+	int f, pid = fork();
+
+	if (!pid) {
+		pthread_t pt;
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(EAGAIN, errno);
+	ASSERT_EQ(f, -1);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(0, f);
+}
+
+TEST_HARNESS_MAIN
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv3] exec: Fix a deadlock in ptrace
  2020-03-02 20:10                             ` [PATCHv3] " Bernd Edlinger
@ 2020-03-02 20:28                               ` Bernd Edlinger
  0 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-02 20:28 UTC (permalink / raw)
  To: Jann Horn, Christian Brauner
  Cc: Eric W. Biederman, James Morris, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, Kees Cook, Greg Kroah-Hartman,
	Shakeel Butt, Jason Gunthorpe, Christian Kellner,
	Andrea Arcangeli, Aleksa Sarai, Dmitry V. Levin, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, stable,
	linux-security-module

On 3/2/20 9:10 PM, Bernd Edlinger wrote:
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -44,7 +44,11 @@ struct linux_binprm {
>  		 * exec has happened. Used to sanitize execution environment
>  		 * and to set AT_SECURE auxv for glibc.
>  		 */
> -		secureexec:1;
> +		secureexec:1,
> +		/*
> +		 * Set by flush_old_exec, when the cred_change_mutex is taken.

Oops, missed to update this comment, should be "when the cred_guard_mutex is taken".

I'll send a new patch later.

Bernd.

> +		 */
> +		called_flush_old_exec:1;
>  #ifdef __alpha__
>  	unsigned int taso:1;
>  #endif

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 17:13                   ` [PATCHv2] " Bernd Edlinger
@ 2020-03-02 21:49                     ` Eric W. Biederman
  2020-03-02 22:00                       ` Bernd Edlinger
  2020-03-02 22:18                       ` [PATCHv4] " Bernd Edlinger
  0 siblings, 2 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-02 21:49 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/2/20 5:17 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>>
>>>>>
>>>>> I tried this with s/EACCESS/EACCES/.
>>>>>
>>>>> The test case in this patch is not fixed, but strace does not freeze,
>>>>> at least with my setup where it did freeze repeatable.
>>>>
>>>> Thanks, That is what I was aiming at.
>>>>
>>>> So we have one method we can pursue to fix this in practice.
>>>>
>>>>> That is
>>>>> obviously because it bypasses the cred_guard_mutex.  But all other
>>>>> process that access this file still freeze, and cannot be
>>>>> interrupted except with kill -9.
>>>>>
>>>>> However that smells like a denial of service, that this
>>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>>>> that freezes any process, even root, when it looks at it.
>>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>>>>
>>>> Yes.  Your the test case in your patch a variant of the original
>>>> problem.
>>>>
>>>>
>>>> I have been staring at this trying to understand the fundamentals of the
>>>> original deeper problem.
>>>>
>>>> The current scope of cred_guard_mutex in exec is because being ptraced
>>>> causes suid exec to act differently.  So we need to know early if we are
>>>> ptraced.
>>>>
>>>
>>> It has a second use, that it prevents two threads entering execve,
>>> which would probably result in disaster.
>> 
>> Exec can fail with an error code up until de_thread.  de_thread causes
>> exec to fail with the error code -EAGAIN for the second thread to get
>> into de_thread.
>> 
>> So no.  The cred_guard_mutex is not needed for that case at all.
>> 
>
> Okay, but that will reset current->in_execve, right?

Absolutely.

The error handling kicks in and exec_binprm fails with a negative
return code.  Then __do_excve_file cleans up and clears
current->in_execve.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv2] exec: Fix a deadlock in ptrace
  2020-03-02 21:49                     ` Eric W. Biederman
@ 2020-03-02 22:00                       ` Bernd Edlinger
  2020-03-02 22:18                       ` [PATCHv4] " Bernd Edlinger
  1 sibling, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-02 22:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On 3/2/20 10:49 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/2/20 5:17 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>
>>>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>>>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>>>
>>>>>>
>>>>>> I tried this with s/EACCESS/EACCES/.
>>>>>>
>>>>>> The test case in this patch is not fixed, but strace does not freeze,
>>>>>> at least with my setup where it did freeze repeatable.
>>>>>
>>>>> Thanks, That is what I was aiming at.
>>>>>
>>>>> So we have one method we can pursue to fix this in practice.
>>>>>
>>>>>> That is
>>>>>> obviously because it bypasses the cred_guard_mutex.  But all other
>>>>>> process that access this file still freeze, and cannot be
>>>>>> interrupted except with kill -9.
>>>>>>
>>>>>> However that smells like a denial of service, that this
>>>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>>>>> that freezes any process, even root, when it looks at it.
>>>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>>>>>
>>>>> Yes.  Your the test case in your patch a variant of the original
>>>>> problem.
>>>>>
>>>>>
>>>>> I have been staring at this trying to understand the fundamentals of the
>>>>> original deeper problem.
>>>>>
>>>>> The current scope of cred_guard_mutex in exec is because being ptraced
>>>>> causes suid exec to act differently.  So we need to know early if we are
>>>>> ptraced.
>>>>>
>>>>
>>>> It has a second use, that it prevents two threads entering execve,
>>>> which would probably result in disaster.
>>>
>>> Exec can fail with an error code up until de_thread.  de_thread causes
>>> exec to fail with the error code -EAGAIN for the second thread to get
>>> into de_thread.
>>>
>>> So no.  The cred_guard_mutex is not needed for that case at all.
>>>
>>
>> Okay, but that will reset current->in_execve, right?
> 
> Absolutely.
> 
> The error handling kicks in and exec_binprm fails with a negative
> return code.  Then __do_excve_file cleans up and clears
> current->in_execve.
> 

Yes of course.  I was under the wrong impression that that value is
a kind of global, but it is a thread local.

So I think I need a new boolean see v3 of the patch, and soon v4 (with
just one comment fixed).

I'm currently executing the strace v5.5 testsuite, and every test
is passed so far.  I'll also look at gdb testsuite, before I send the
next version.


Thanks
Bernd.


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-02 21:49                     ` Eric W. Biederman
  2020-03-02 22:00                       ` Bernd Edlinger
@ 2020-03-02 22:18                       ` Bernd Edlinger
  2020-03-03  2:26                         ` Kees Cook
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-02 22:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Christian Brauner, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, Kees Cook,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The proposed solution is to take the cred_guard_mutex only
in a critical section at the beginning, and at the end of the
execve function, and let PTRACE_ATTACH fail with EAGAIN while
execve is not complete, but other functions like vm_access are
allowed to complete normally.

I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 Documentation/security/credentials.rst    | 19 +++++----
 fs/exec.c                                 | 28 +++++++++++--
 include/linux/binfmts.h                   |  6 ++-
 include/linux/sched/signal.h              |  1 +
 init/init_task.c                          |  1 +
 kernel/cred.c                             |  2 +-
 kernel/fork.c                             |  1 +
 kernel/ptrace.c                           |  4 ++
 mm/process_vm_access.c                    |  2 +-
 tools/testing/selftests/ptrace/Makefile   |  4 +-
 tools/testing/selftests/ptrace/vmaccess.c | 66 +++++++++++++++++++++++++++++++
 11 files changed, 117 insertions(+), 17 deletions(-)
 create mode 100644 tools/testing/selftests/ptrace/vmaccess.c

v2: adds a test case which passes when this patch is applied.
v3: fixes the issue without introducing a new mutex.
v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case. 

diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..61d6704 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,14 @@ new set of credentials by calling::
 
 	struct cred *prepare_creds(void);
 
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful.  It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and released after setting
+current->signal->cred_locked_for_ptrace.  The same mutex is acquired later,
+while the credentials and the process mmap are actually changed, and
+current->signal->cred_locked_for_ptrace is reset again.
 
 The mutex prevents ``ptrace()`` from altering the ptrace state of a process
 while security checks on credentials construction and changing is taking place
@@ -466,9 +471,8 @@ by calling::
 
 This will alter various aspects of the credentials and the process, giving the
 LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
 
 This function is guaranteed to return 0, so that it can be tail-called at the
 end of such functions as ``sys_setresuid()``.
@@ -486,8 +490,7 @@ invoked::
 
 	void abort_creds(struct cred *new);
 
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
 
 
 A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..e466301 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_guard_mutex);
+	if (retval)
+		goto out;
+
+	bprm->called_flush_old_exec = 1;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1398,28 +1404,41 @@ void finalize_exec(struct linux_binprm *bprm)
 EXPORT_SYMBOL(finalize_exec);
 
 /*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and set ->cred_locked_for_ptrace.
  * install_exec_creds() commits the new creds and drops the lock.
  * Or, if exec fails before, free_bprm() should release ->cred and
  * and unlock.
  */
 static int prepare_bprm_creds(struct linux_binprm *bprm)
 {
+	int ret;
+
 	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	ret = -EAGAIN;
+	if (unlikely(current->signal->cred_locked_for_ptrace))
+		goto out;
+
+	ret = -ENOMEM;
 	bprm->cred = prepare_exec_creds();
-	if (likely(bprm->cred))
-		return 0;
+	if (likely(bprm->cred)) {
+		current->signal->cred_locked_for_ptrace = true;
+		ret = 0;
+	}
 
+out:
 	mutex_unlock(&current->signal->cred_guard_mutex);
-	return -ENOMEM;
+	return ret;
 }
 
 static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		if (!bprm->called_flush_old_exec)
+			mutex_lock(&current->signal->cred_guard_mutex);
+		current->signal->cred_locked_for_ptrace = false;
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1469,6 +1488,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	current->signal->cred_locked_for_ptrace = false;
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2930253 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
 		 * exec has happened. Used to sanitize execution environment
 		 * and to set AT_SECURE auxv for glibc.
 		 */
-		secureexec:1;
+		secureexec:1,
+		/*
+		 * Set by flush_old_exec, when the cred_guard_mutex is taken.
+		 */
+		called_flush_old_exec:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..073a2b7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@ struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	bool cred_locked_for_ptrace;	/* set while in execve */
 } __randomize_layout;
 
 /*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..ecefff28 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+	.cred_locked_for_ptrace = false,
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
 	.cputimer	= {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
  *
  * Returns the new credentials or NULL if out of memory.
  *
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
  */
 struct cred *prepare_kernel_cred(struct task_struct *daemon)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..a2b2ec8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	sig->cred_locked_for_ptrace = false;
 
 	return 0;
 }
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..abf09ba 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -395,6 +395,10 @@ static int ptrace_attach(struct task_struct *task, long request,
 	if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
 		goto out;
 
+	retval = -EAGAIN;
+	if (task->signal->cred_locked_for_ptrace)
+		goto unlock_creds;
+
 	task_lock(task);
 	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
 	task_unlock(task);
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	if (!mm || IS_ERR(mm)) {
 		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		/*
-		 * Explicitly map EACCES to EPERM as EPERM is a more a
+		 * Explicitly map EACCES to EPERM as EPERM is a more
 		 * appropriate error code for process_vw_readv/writev
 		 */
 		if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
 
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
 
 include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..6d8a048
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+	return NULL;
+}
+
+TEST(vmaccess)
+{
+	int f, pid = fork();
+	char mm[64];
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	sprintf(mm, "/proc/%d/mem", pid);
+	f = open(mm, O_RDONLY);
+	ASSERT_LE(0, f);
+	close(f);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(0, f);
+}
+
+TEST(attach)
+{
+	int f, pid = fork();
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(EAGAIN, errno);
+	ASSERT_EQ(f, -1);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(0, f);
+}
+
+TEST_HARNESS_MAIN
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-02 22:18                       ` [PATCHv4] " Bernd Edlinger
@ 2020-03-03  2:26                         ` Kees Cook
  2020-03-03  4:54                           ` Bernd Edlinger
  2020-03-03  8:58                           ` Christian Brauner
  0 siblings, 2 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-03  2:26 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Jann Horn, Christian Brauner, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
> 
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated.  They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
> 
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> strace          D    0 30614  30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> expect          D    0 31933  30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The proposed solution is to take the cred_guard_mutex only
> in a critical section at the beginning, and at the end of the
> execve function, and let PTRACE_ATTACH fail with EAGAIN while
> execve is not complete, but other functions like vm_access are
> allowed to complete normally.

Sorry to be bummer, but I don't think this will work. A few more things
during the exec process depend on cred_guard_mutex being held.

If I'm reading this patch correctly, this changes the lifetime of the
cred_guard_mutex lock to be:
	- during prepare_bprm_creds()
	- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through
install_exec_creds().

That means, for example, that check_unsafe_exec()'s documented invariant
is violated:
    /*
     * determine how safe it is to execute the proposed program
     * - the caller must hold ->cred_guard_mutex to protect against
     *   PTRACE_ATTACH or seccomp thread-sync
     */
    static void check_unsafe_exec(struct linux_binprm *bprm) ...
which is looking at no_new_privs as well as other details, and making
decisions about the bprm state from the current state.

I think it also means that the potentially multiple invocations
of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
a lock (another place where current's no_new_privs is evaluated).

Related, it also means that cred_guard_mutex is unheld for every
invocation of search_binary_handler() (which can loop via the previously
mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
currently.)

For seccomp, the expectations about existing thread states risks races
too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from
  appearing/disappearing, which would destroy filter refcounting and
  lead to memory corruption.
- cred_guard_mutex is held to keep no_new_privs in sync with filters to
  avoid no_new_privs and filter confusion during exec, which could
  lead to exploitable setuid conditions (see below).

Just racing a malicious thread during TSYNC is not a very strong
example (a malicious thread could do lots of fun things to "current"
before it ever got near calling TSYNC), but I think there is the risk
of mismatched/confused states that we don't want to allow. One is a
particularly bad state that could lead to privilege escalations (in the
form of the old "sendmail doesn't check setuid" flaw; if a setuid process
has a filter attached that silently fails a priv-dropping setuid call
and continues execution with elevated privs, it can be tricked into
doing bad things on behalf of the unprivileged parent, which was the
primary goal of the original use of cred_guard_mutex with TSYNC[1]):

thread A clones thread B
thread B starts setuid exec
thread A sets no_new_privs
thread A calls seccomp with TSYNC
thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
thread B passes check_unsafe_exec() with no_new_privs unset
thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
thread A still in seccomp_sync_threads() sets no_new_privs on thread B
thread B finishes exec, now running with elevated privs, a filter chosen
         by thread A, _and_ nnp set (which doesn't matter)

With the original locking, thread B will fail check_unsafe_exec()
because filter and nnp state are changed together, with "atomicity"
protected by the cred_guard_mutex.

And this is just the bad state I _can_ see. I'm worried there are more...

All this said, I do see a small similarity here to the work I did to
stabilize stack rlimits (there was an ongoing problem with making multiple
decisions for the bprm based on current's state -- but current's state
was mutable during exec). For this, I saved rlim_stack to bprm and ignored
current's copy until exec ended and then stored bprm's copy into current.
If the only problem anyone can see here is the handling of no_new_privs,
we might be able to solve that similarly, at least disentangling tsync/nnp
from cred_guard_mutex.

-Kees

[1] https://lore.kernel.org/lkml/20140625142121.GD7892@redhat.com/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03  2:26                         ` Kees Cook
@ 2020-03-03  4:54                           ` Bernd Edlinger
  2020-03-03  5:29                             ` Kees Cook
  2020-03-03  8:58                           ` Christian Brauner
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-03  4:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Jann Horn, Christian Brauner, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On 3/3/20 3:26 AM, Kees Cook wrote:
> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated.  They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>
>> strace          D    0 30614  30584 0x00000000
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> schedule_preempt_disabled+0x15/0x20
>> __mutex_lock.isra.13+0x1ec/0x520
>> __mutex_lock_killable_slowpath+0x13/0x20
>> mutex_lock_killable+0x28/0x30
>> mm_access+0x27/0xa0
>> process_vm_rw_core.isra.3+0xff/0x550
>> process_vm_rw+0xdd/0xf0
>> __x64_sys_process_vm_readv+0x31/0x40
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> expect          D    0 31933  30876 0x80004003
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> flush_old_exec+0xc4/0x770
>> load_elf_binary+0x35a/0x16c0
>> search_binary_handler+0x97/0x1d0
>> __do_execve_file.isra.40+0x5d4/0x8a0
>> __x64_sys_execve+0x49/0x60
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> The proposed solution is to take the cred_guard_mutex only
>> in a critical section at the beginning, and at the end of the
>> execve function, and let PTRACE_ATTACH fail with EAGAIN while
>> execve is not complete, but other functions like vm_access are
>> allowed to complete normally.
> 
> Sorry to be bummer, but I don't think this will work. A few more things
> during the exec process depend on cred_guard_mutex being held.
> 
> If I'm reading this patch correctly, this changes the lifetime of the
> cred_guard_mutex lock to be:
> 	- during prepare_bprm_creds()
> 	- from flush_old_exec() through install_exec_creds()
> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> install_exec_creds().
> 
> That means, for example, that check_unsafe_exec()'s documented invariant
> is violated:
>     /*
>      * determine how safe it is to execute the proposed program
>      * - the caller must hold ->cred_guard_mutex to protect against
>      *   PTRACE_ATTACH or seccomp thread-sync
>      */

Oh, right, I haven't understood that hint...

>     static void check_unsafe_exec(struct linux_binprm *bprm) ...
> which is looking at no_new_privs as well as other details, and making
> decisions about the bprm state from the current state.
> 
> I think it also means that the potentially multiple invocations
> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> a lock (another place where current's no_new_privs is evaluated).

So no_new_privs can change from 0->1, but should not
when execve is running.

As long as the calling thread is in execve it won't do this,
and the only other place, where it may set for other threads
is in seccomp_sync_threads, but that can easily be avoided see below.

> 
> Related, it also means that cred_guard_mutex is unheld for every
> invocation of search_binary_handler() (which can loop via the previously
> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> currently.)
> 
> For seccomp, the expectations about existing thread states risks races
> too. There are two locks held for TSYNC:
> - current->sighand->siglock is held to keep new threads from
>   appearing/disappearing, which would destroy filter refcounting and
>   lead to memory corruption.

I don't understand what you mean here.
How can this lead to memory corruption?

> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
>   avoid no_new_privs and filter confusion during exec, which could
>   lead to exploitable setuid conditions (see below).
> 
> Just racing a malicious thread during TSYNC is not a very strong
> example (a malicious thread could do lots of fun things to "current"
> before it ever got near calling TSYNC), but I think there is the risk
> of mismatched/confused states that we don't want to allow. One is a
> particularly bad state that could lead to privilege escalations (in the
> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> has a filter attached that silently fails a priv-dropping setuid call
> and continues execution with elevated privs, it can be tricked into
> doing bad things on behalf of the unprivileged parent, which was the
> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> 
> thread A clones thread B
> thread B starts setuid exec
> thread A sets no_new_privs
> thread A calls seccomp with TSYNC
> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> thread B passes check_unsafe_exec() with no_new_privs unset
> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> thread B finishes exec, now running with elevated privs, a filter chosen
>          by thread A, _and_ nnp set (which doesn't matter)
> 
> With the original locking, thread B will fail check_unsafe_exec()
> because filter and nnp state are changed together, with "atomicity"
> protected by the cred_guard_mutex.
> 

Ah, good point, thanks!

This can be fixed by checking current->signal->cred_locked_for_ptrace
while the cred_guard_mutex is locked, like this for instance:

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b6ea3dc..377abf0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
        BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
        assert_spin_locked(&current->sighand->siglock);
 
+       if (current->signal->cred_locked_for_ptrace)
+               return -EAGAIN;
+
        /* Validate all threads being eligible for synchronization. */
        caller = current;
        for_each_thread(caller, thread) {


> And this is just the bad state I _can_ see. I'm worried there are more...
> 
> All this said, I do see a small similarity here to the work I did to
> stabilize stack rlimits (there was an ongoing problem with making multiple
> decisions for the bprm based on current's state -- but current's state
> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> current's copy until exec ended and then stored bprm's copy into current.
> If the only problem anyone can see here is the handling of no_new_privs,
> we might be able to solve that similarly, at least disentangling tsync/nnp
> from cred_guard_mutex.
> 

I still think that is solvable with using cred_locked_for_ptrace and
simply make the tsync fail if it would otherwise be blocked.


Thanks
Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03  4:54                           ` Bernd Edlinger
@ 2020-03-03  5:29                             ` Kees Cook
  2020-03-03  8:08                               ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Kees Cook @ 2020-03-03  5:29 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Jann Horn, Christian Brauner, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> On 3/3/20 3:26 AM, Kees Cook wrote:
> > On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > > [...]
> >
> > If I'm reading this patch correctly, this changes the lifetime of the
> > cred_guard_mutex lock to be:
> > 	- during prepare_bprm_creds()
> > 	- from flush_old_exec() through install_exec_creds()
> > Before, cred_guard_mutex was held from prepare_bprm_creds() through
> > install_exec_creds().

BTW, I think the effect of this change (i.e. my paragraph above) should
be distinctly called out in the commit log if this solution moves
forward.

> > That means, for example, that check_unsafe_exec()'s documented invariant
> > is violated:
> >     /*
> >      * determine how safe it is to execute the proposed program
> >      * - the caller must hold ->cred_guard_mutex to protect against
> >      *   PTRACE_ATTACH or seccomp thread-sync
> >      */
> 
> Oh, right, I haven't understood that hint...

I know no_new_privs is checked there, but I haven't studied the
PTRACE_ATTACH part of that comment. If that is handled with the new
check, this comment should be updated.

> > I think it also means that the potentially multiple invocations
> > of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> > binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> > a lock (another place where current's no_new_privs is evaluated).
> 
> So no_new_privs can change from 0->1, but should not
> when execve is running.
> 
> As long as the calling thread is in execve it won't do this,
> and the only other place, where it may set for other threads
> is in seccomp_sync_threads, but that can easily be avoided see below.

Yeah, everything was fine until I had to go complicate things with
TSYNC. ;) The real goal is making sure an exec cannot gain privs while
later gaining a seccomp filter from an unpriv process. The no_new_privs
flag was used to control this, but it required that the filter not get
applied during exec.

> > Related, it also means that cred_guard_mutex is unheld for every
> > invocation of search_binary_handler() (which can loop via the previously
> > mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> > dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> > currently.)
> > 
> > For seccomp, the expectations about existing thread states risks races
> > too. There are two locks held for TSYNC:
> > - current->sighand->siglock is held to keep new threads from
> >   appearing/disappearing, which would destroy filter refcounting and
> >   lead to memory corruption.
> 
> I don't understand what you mean here.
> How can this lead to memory corruption?

Mainly this is a matter of how seccomp manages its filter hierarchy
(since the filters are shared through process ancestry), so if a thread
appears in the middle of TSYNC it may be racing another TSYNC and break
ancestry, leading to bad reference counting on process death, etc.
(Though, yes, with refcount_t now, things should never corrupt, just
waste memory.)

> > - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> >   avoid no_new_privs and filter confusion during exec, which could
> >   lead to exploitable setuid conditions (see below).
> > 
> > Just racing a malicious thread during TSYNC is not a very strong
> > example (a malicious thread could do lots of fun things to "current"
> > before it ever got near calling TSYNC), but I think there is the risk
> > of mismatched/confused states that we don't want to allow. One is a
> > particularly bad state that could lead to privilege escalations (in the
> > form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> > has a filter attached that silently fails a priv-dropping setuid call
> > and continues execution with elevated privs, it can be tricked into
> > doing bad things on behalf of the unprivileged parent, which was the
> > primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> > 
> > thread A clones thread B
> > thread B starts setuid exec
> > thread A sets no_new_privs
> > thread A calls seccomp with TSYNC
> > thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> > thread B passes check_unsafe_exec() with no_new_privs unset
> > thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> > thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> > thread B finishes exec, now running with elevated privs, a filter chosen
> >          by thread A, _and_ nnp set (which doesn't matter)
> > 
> > With the original locking, thread B will fail check_unsafe_exec()
> > because filter and nnp state are changed together, with "atomicity"
> > protected by the cred_guard_mutex.
> > 
> 
> Ah, good point, thanks!
> 
> This can be fixed by checking current->signal->cred_locked_for_ptrace
> while the cred_guard_mutex is locked, like this for instance:
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b6ea3dc..377abf0 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
>         assert_spin_locked(&current->sighand->siglock);
>  
> +       if (current->signal->cred_locked_for_ptrace)
> +               return -EAGAIN;
> +

Hmm. I guess something like that could work. TSYNC expects to be able to
report _which_ thread wrecked the call, though... I wonder if in_execve
could be used to figure out the offending thread. Hm, nope, that would
be outside of lock too (and all users are "current" right now, so the
lock wasn't needed before).

>         /* Validate all threads being eligible for synchronization. */
>         caller = current;
>         for_each_thread(caller, thread) {
> 
> 
> > And this is just the bad state I _can_ see. I'm worried there are more...
> > 
> > All this said, I do see a small similarity here to the work I did to
> > stabilize stack rlimits (there was an ongoing problem with making multiple
> > decisions for the bprm based on current's state -- but current's state
> > was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> > current's copy until exec ended and then stored bprm's copy into current.
> > If the only problem anyone can see here is the handling of no_new_privs,
> > we might be able to solve that similarly, at least disentangling tsync/nnp
> > from cred_guard_mutex.
> > 
> 
> I still think that is solvable with using cred_locked_for_ptrace and
> simply make the tsync fail if it would otherwise be blocked.

I wonder if we can find a better name than "cred_locked_for_ptrace"?
Maybe "cred_unfinished" or "cred_locked_in_exec" or something?

And the comment on bool cred_locked_for_ptrace should mention that
access is only allowed under cred_guard_mutex lock.

> > > +	sig->cred_locked_for_ptrace = false;

This is redundant to the zalloc -- I think you can drop it (unless
someone wants to keep it for clarify?)

Also, I think cred_locked_for_ptrace needs checking deeper, in
__ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
be sufficient to see a stable version of the thread...

(I remain very nervous about weakening cred_guard_mutex without
addressing the many many users...)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03  5:29                             ` Kees Cook
@ 2020-03-03  8:08                               ` Bernd Edlinger
  2020-03-03  8:34                                 ` Christian Brauner
  2020-03-04 15:30                                 ` Christian Brauner
  0 siblings, 2 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-03  8:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Jann Horn, Christian Brauner, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On 3/3/20 6:29 AM, Kees Cook wrote:
> On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
>> On 3/3/20 3:26 AM, Kees Cook wrote:
>>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
>>>> [...]
>>>
>>> If I'm reading this patch correctly, this changes the lifetime of the
>>> cred_guard_mutex lock to be:
>>> 	- during prepare_bprm_creds()
>>> 	- from flush_old_exec() through install_exec_creds()
>>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
>>> install_exec_creds().
> 
> BTW, I think the effect of this change (i.e. my paragraph above) should
> be distinctly called out in the commit log if this solution moves
> forward.
> 

Okay, will do.

>>> That means, for example, that check_unsafe_exec()'s documented invariant
>>> is violated:
>>>     /*
>>>      * determine how safe it is to execute the proposed program
>>>      * - the caller must hold ->cred_guard_mutex to protect against
>>>      *   PTRACE_ATTACH or seccomp thread-sync
>>>      */
>>
>> Oh, right, I haven't understood that hint...
> 
> I know no_new_privs is checked there, but I haven't studied the
> PTRACE_ATTACH part of that comment. If that is handled with the new
> check, this comment should be updated.
> 

Okay, I change that comment to:

/*
 * determine how safe it is to execute the proposed program
 * - the caller must have set ->cred_locked_in_execve to protect against
 *   PTRACE_ATTACH or seccomp thread-sync
 */

>>> I think it also means that the potentially multiple invocations
>>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
>>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
>>> a lock (another place where current's no_new_privs is evaluated).
>>
>> So no_new_privs can change from 0->1, but should not
>> when execve is running.
>>
>> As long as the calling thread is in execve it won't do this,
>> and the only other place, where it may set for other threads
>> is in seccomp_sync_threads, but that can easily be avoided see below.
> 
> Yeah, everything was fine until I had to go complicate things with
> TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> later gaining a seccomp filter from an unpriv process. The no_new_privs
> flag was used to control this, but it required that the filter not get
> applied during exec.
> 
>>> Related, it also means that cred_guard_mutex is unheld for every
>>> invocation of search_binary_handler() (which can loop via the previously
>>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
>>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
>>> currently.)
>>>
>>> For seccomp, the expectations about existing thread states risks races
>>> too. There are two locks held for TSYNC:
>>> - current->sighand->siglock is held to keep new threads from
>>>   appearing/disappearing, which would destroy filter refcounting and
>>>   lead to memory corruption.
>>
>> I don't understand what you mean here.
>> How can this lead to memory corruption?
> 
> Mainly this is a matter of how seccomp manages its filter hierarchy
> (since the filters are shared through process ancestry), so if a thread
> appears in the middle of TSYNC it may be racing another TSYNC and break
> ancestry, leading to bad reference counting on process death, etc.
> (Though, yes, with refcount_t now, things should never corrupt, just
> waste memory.)
> 

I assume for now, that the current->sighand->siglock held while iterating all
threads is sufficient here.

>>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
>>>   avoid no_new_privs and filter confusion during exec, which could
>>>   lead to exploitable setuid conditions (see below).
>>>
>>> Just racing a malicious thread during TSYNC is not a very strong
>>> example (a malicious thread could do lots of fun things to "current"
>>> before it ever got near calling TSYNC), but I think there is the risk
>>> of mismatched/confused states that we don't want to allow. One is a
>>> particularly bad state that could lead to privilege escalations (in the
>>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
>>> has a filter attached that silently fails a priv-dropping setuid call
>>> and continues execution with elevated privs, it can be tricked into
>>> doing bad things on behalf of the unprivileged parent, which was the
>>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
>>>
>>> thread A clones thread B
>>> thread B starts setuid exec
>>> thread A sets no_new_privs
>>> thread A calls seccomp with TSYNC
>>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
>>> thread B passes check_unsafe_exec() with no_new_privs unset
>>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
>>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
>>> thread B finishes exec, now running with elevated privs, a filter chosen
>>>          by thread A, _and_ nnp set (which doesn't matter)
>>>
>>> With the original locking, thread B will fail check_unsafe_exec()
>>> because filter and nnp state are changed together, with "atomicity"
>>> protected by the cred_guard_mutex.
>>>
>>
>> Ah, good point, thanks!
>>
>> This can be fixed by checking current->signal->cred_locked_for_ptrace
>> while the cred_guard_mutex is locked, like this for instance:
>>
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index b6ea3dc..377abf0 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
>>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
>>         assert_spin_locked(&current->sighand->siglock);
>>  
>> +       if (current->signal->cred_locked_for_ptrace)
>> +               return -EAGAIN;
>> +
> 
> Hmm. I guess something like that could work. TSYNC expects to be able to
> report _which_ thread wrecked the call, though... I wonder if in_execve
> could be used to figure out the offending thread. Hm, nope, that would
> be outside of lock too (and all users are "current" right now, so the
> lock wasn't needed before).
> 

I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
but the caller will die quickly and cannot do anything with that information
when another thread executes execve, right?

>>         /* Validate all threads being eligible for synchronization. */
>>         caller = current;
>>         for_each_thread(caller, thread) {
>>
>>
>>> And this is just the bad state I _can_ see. I'm worried there are more...
>>>
>>> All this said, I do see a small similarity here to the work I did to
>>> stabilize stack rlimits (there was an ongoing problem with making multiple
>>> decisions for the bprm based on current's state -- but current's state
>>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
>>> current's copy until exec ended and then stored bprm's copy into current.
>>> If the only problem anyone can see here is the handling of no_new_privs,
>>> we might be able to solve that similarly, at least disentangling tsync/nnp
>>> from cred_guard_mutex.
>>>
>>
>> I still think that is solvable with using cred_locked_for_ptrace and
>> simply make the tsync fail if it would otherwise be blocked.
> 
> I wonder if we can find a better name than "cred_locked_for_ptrace"?
> Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> 

Yeah, I'd go with "cred_locked_in_execve".

> And the comment on bool cred_locked_for_ptrace should mention that
> access is only allowed under cred_guard_mutex lock.
> 

okay.

>>>> +	sig->cred_locked_for_ptrace = false;
> 
> This is redundant to the zalloc -- I think you can drop it (unless
> someone wants to keep it for clarify?)
> 

I'll remove that here and in init/init_task.c

> Also, I think cred_locked_for_ptrace needs checking deeper, in
> __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> be sufficient to see a stable version of the thread...
> 

No, these need to be addressed individually, but most users just want
to know if the current credentials are sufficient at this moment, but will
not change the credentials, as ptrace and TSYNC do. 

BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
So adding an access to cred_locked_for_execve in ptrace_may_access is
probably not an option.

However, one nice added value by this change is this:

void *thread(void *arg)
{
	ptrace(PTRACE_TRACEME, 0,0,0);
	return NULL;
}

int main(void)
{
	int pid = fork();

	if (!pid) {
		pthread_t pt;
		pthread_create(&pt, NULL, thread, NULL);
		pthread_join(pt, NULL);
		execlp("echo", "echo", "passed", NULL);
	}

	sleep(1000);
	ptrace(PTRACE_ATTACH, pid, 0,0);
	kill(pid, SIGCONT);
	return 0;
}

cat /proc/3812/stack 
[<0>] flush_old_exec+0xbf/0x760
[<0>] load_elf_binary+0x35a/0x16c0
[<0>] search_binary_handler+0x97/0x1d0
[<0>] __do_execve_file.isra.40+0x624/0x920
[<0>] __x64_sys_execve+0x49/0x60
[<0>] do_syscall_64+0x64/0x220
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9


> (I remain very nervous about weakening cred_guard_mutex without
> addressing the many many users...)
> 

They need to be looked at closely, that's pretty clear.
Most fall in the class, that just the current credentials need
to stay stable for a certain time.


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03  8:08                               ` Bernd Edlinger
@ 2020-03-03  8:34                                 ` Christian Brauner
  2020-03-03  8:43                                   ` Christian Brauner
  2020-03-04 15:30                                 ` Christian Brauner
  1 sibling, 1 reply; 203+ messages in thread
From: Christian Brauner @ 2020-03-03  8:34 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Kees Cook, Eric W. Biederman, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> On 3/3/20 6:29 AM, Kees Cook wrote:
> > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> >> On 3/3/20 3:26 AM, Kees Cook wrote:
> >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> >>>> [...]
> >>>
> >>> If I'm reading this patch correctly, this changes the lifetime of the
> >>> cred_guard_mutex lock to be:
> >>> 	- during prepare_bprm_creds()
> >>> 	- from flush_old_exec() through install_exec_creds()
> >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> >>> install_exec_creds().
> > 
> > BTW, I think the effect of this change (i.e. my paragraph above) should
> > be distinctly called out in the commit log if this solution moves
> > forward.
> > 
> 
> Okay, will do.
> 
> >>> That means, for example, that check_unsafe_exec()'s documented invariant
> >>> is violated:
> >>>     /*
> >>>      * determine how safe it is to execute the proposed program
> >>>      * - the caller must hold ->cred_guard_mutex to protect against
> >>>      *   PTRACE_ATTACH or seccomp thread-sync
> >>>      */
> >>
> >> Oh, right, I haven't understood that hint...
> > 
> > I know no_new_privs is checked there, but I haven't studied the
> > PTRACE_ATTACH part of that comment. If that is handled with the new
> > check, this comment should be updated.
> > 
> 
> Okay, I change that comment to:
> 
> /*
>  * determine how safe it is to execute the proposed program
>  * - the caller must have set ->cred_locked_in_execve to protect against
>  *   PTRACE_ATTACH or seccomp thread-sync
>  */
> 
> >>> I think it also means that the potentially multiple invocations
> >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> >>> a lock (another place where current's no_new_privs is evaluated).
> >>
> >> So no_new_privs can change from 0->1, but should not
> >> when execve is running.
> >>
> >> As long as the calling thread is in execve it won't do this,
> >> and the only other place, where it may set for other threads
> >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > 
> > Yeah, everything was fine until I had to go complicate things with
> > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > flag was used to control this, but it required that the filter not get
> > applied during exec.
> > 
> >>> Related, it also means that cred_guard_mutex is unheld for every
> >>> invocation of search_binary_handler() (which can loop via the previously
> >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> >>> currently.)
> >>>
> >>> For seccomp, the expectations about existing thread states risks races
> >>> too. There are two locks held for TSYNC:
> >>> - current->sighand->siglock is held to keep new threads from
> >>>   appearing/disappearing, which would destroy filter refcounting and
> >>>   lead to memory corruption.
> >>
> >> I don't understand what you mean here.
> >> How can this lead to memory corruption?
> > 
> > Mainly this is a matter of how seccomp manages its filter hierarchy
> > (since the filters are shared through process ancestry), so if a thread
> > appears in the middle of TSYNC it may be racing another TSYNC and break
> > ancestry, leading to bad reference counting on process death, etc.
> > (Though, yes, with refcount_t now, things should never corrupt, just
> > waste memory.)
> > 
> 
> I assume for now, that the current->sighand->siglock held while iterating all
> threads is sufficient here.
> 
> >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> >>>   avoid no_new_privs and filter confusion during exec, which could
> >>>   lead to exploitable setuid conditions (see below).
> >>>
> >>> Just racing a malicious thread during TSYNC is not a very strong
> >>> example (a malicious thread could do lots of fun things to "current"
> >>> before it ever got near calling TSYNC), but I think there is the risk
> >>> of mismatched/confused states that we don't want to allow. One is a
> >>> particularly bad state that could lead to privilege escalations (in the
> >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> >>> has a filter attached that silently fails a priv-dropping setuid call
> >>> and continues execution with elevated privs, it can be tricked into
> >>> doing bad things on behalf of the unprivileged parent, which was the
> >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> >>>
> >>> thread A clones thread B
> >>> thread B starts setuid exec
> >>> thread A sets no_new_privs
> >>> thread A calls seccomp with TSYNC
> >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> >>> thread B passes check_unsafe_exec() with no_new_privs unset
> >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> >>> thread B finishes exec, now running with elevated privs, a filter chosen
> >>>          by thread A, _and_ nnp set (which doesn't matter)
> >>>
> >>> With the original locking, thread B will fail check_unsafe_exec()
> >>> because filter and nnp state are changed together, with "atomicity"
> >>> protected by the cred_guard_mutex.
> >>>
> >>
> >> Ah, good point, thanks!
> >>
> >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> >> while the cred_guard_mutex is locked, like this for instance:
> >>
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index b6ea3dc..377abf0 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> >>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> >>         assert_spin_locked(&current->sighand->siglock);
> >>  
> >> +       if (current->signal->cred_locked_for_ptrace)
> >> +               return -EAGAIN;
> >> +
> > 
> > Hmm. I guess something like that could work. TSYNC expects to be able to
> > report _which_ thread wrecked the call, though... I wonder if in_execve
> > could be used to figure out the offending thread. Hm, nope, that would
> > be outside of lock too (and all users are "current" right now, so the
> > lock wasn't needed before).
> > 
> 
> I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> but the caller will die quickly and cannot do anything with that information
> when another thread executes execve, right?
> 
> >>         /* Validate all threads being eligible for synchronization. */
> >>         caller = current;
> >>         for_each_thread(caller, thread) {
> >>
> >>
> >>> And this is just the bad state I _can_ see. I'm worried there are more...
> >>>
> >>> All this said, I do see a small similarity here to the work I did to
> >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> >>> decisions for the bprm based on current's state -- but current's state
> >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> >>> current's copy until exec ended and then stored bprm's copy into current.
> >>> If the only problem anyone can see here is the handling of no_new_privs,
> >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> >>> from cred_guard_mutex.
> >>>
> >>
> >> I still think that is solvable with using cred_locked_for_ptrace and
> >> simply make the tsync fail if it would otherwise be blocked.
> > 
> > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > 
> 
> Yeah, I'd go with "cred_locked_in_execve".
> 
> > And the comment on bool cred_locked_for_ptrace should mention that
> > access is only allowed under cred_guard_mutex lock.
> > 
> 
> okay.
> 
> >>>> +	sig->cred_locked_for_ptrace = false;
> > 
> > This is redundant to the zalloc -- I think you can drop it (unless
> > someone wants to keep it for clarify?)
> > 
> 
> I'll remove that here and in init/init_task.c
> 
> > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > be sufficient to see a stable version of the thread...
> > 
> 
> No, these need to be addressed individually, but most users just want
> to know if the current credentials are sufficient at this moment, but will
> not change the credentials, as ptrace and TSYNC do. 
> 
> BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> So adding an access to cred_locked_for_execve in ptrace_may_access is
> probably not an option.
> 
> However, one nice added value by this change is this:
> 
> void *thread(void *arg)
> {
> 	ptrace(PTRACE_TRACEME, 0,0,0);
> 	return NULL;
> }
> 
> int main(void)
> {
> 	int pid = fork();
> 
> 	if (!pid) {
> 		pthread_t pt;
> 		pthread_create(&pt, NULL, thread, NULL);
> 		pthread_join(pt, NULL);
> 		execlp("echo", "echo", "passed", NULL);
> 	}
> 
> 	sleep(1000);
> 	ptrace(PTRACE_ATTACH, pid, 0,0);
> 	kill(pid, SIGCONT);
> 	return 0;
> }
> 
> cat /proc/3812/stack 
> [<0>] flush_old_exec+0xbf/0x760
> [<0>] load_elf_binary+0x35a/0x16c0
> [<0>] search_binary_handler+0x97/0x1d0
> [<0>] __do_execve_file.isra.40+0x624/0x920
> [<0>] __x64_sys_execve+0x49/0x60
> [<0>] do_syscall_64+0x64/0x220
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
> > (I remain very nervous about weakening cred_guard_mutex without
> > addressing the many many users...)
> > 
> 
> They need to be looked at closely, that's pretty clear.
> Most fall in the class, that just the current credentials need
> to stay stable for a certain time.

I remain rather set on wanting some very basic tests with this change.
Imho, looking through tools/testing/selftests again we don't have nearly
enough for these codepaths; not to say none. Basically, if someone wants
to make a change affecting the current problem we should really have at
least a single simple test/reproducer that can be run without digging
through lore. And hopefully over time we'll have more tests.

Christian

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03  8:34                                 ` Christian Brauner
@ 2020-03-03  8:43                                   ` Christian Brauner
  0 siblings, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-03  8:43 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Kees Cook, Eric W. Biederman, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On Tue, Mar 03, 2020 at 09:34:26AM +0100, Christian Brauner wrote:
> On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> > On 3/3/20 6:29 AM, Kees Cook wrote:
> > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> > >> On 3/3/20 3:26 AM, Kees Cook wrote:
> > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > >>>> [...]
> > >>>
> > >>> If I'm reading this patch correctly, this changes the lifetime of the
> > >>> cred_guard_mutex lock to be:
> > >>> 	- during prepare_bprm_creds()
> > >>> 	- from flush_old_exec() through install_exec_creds()
> > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> > >>> install_exec_creds().
> > > 
> > > BTW, I think the effect of this change (i.e. my paragraph above) should
> > > be distinctly called out in the commit log if this solution moves
> > > forward.
> > > 
> > 
> > Okay, will do.
> > 
> > >>> That means, for example, that check_unsafe_exec()'s documented invariant
> > >>> is violated:
> > >>>     /*
> > >>>      * determine how safe it is to execute the proposed program
> > >>>      * - the caller must hold ->cred_guard_mutex to protect against
> > >>>      *   PTRACE_ATTACH or seccomp thread-sync
> > >>>      */
> > >>
> > >> Oh, right, I haven't understood that hint...
> > > 
> > > I know no_new_privs is checked there, but I haven't studied the
> > > PTRACE_ATTACH part of that comment. If that is handled with the new
> > > check, this comment should be updated.
> > > 
> > 
> > Okay, I change that comment to:
> > 
> > /*
> >  * determine how safe it is to execute the proposed program
> >  * - the caller must have set ->cred_locked_in_execve to protect against
> >  *   PTRACE_ATTACH or seccomp thread-sync
> >  */
> > 
> > >>> I think it also means that the potentially multiple invocations
> > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> > >>> a lock (another place where current's no_new_privs is evaluated).
> > >>
> > >> So no_new_privs can change from 0->1, but should not
> > >> when execve is running.
> > >>
> > >> As long as the calling thread is in execve it won't do this,
> > >> and the only other place, where it may set for other threads
> > >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > > 
> > > Yeah, everything was fine until I had to go complicate things with
> > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > > flag was used to control this, but it required that the filter not get
> > > applied during exec.
> > > 
> > >>> Related, it also means that cred_guard_mutex is unheld for every
> > >>> invocation of search_binary_handler() (which can loop via the previously
> > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> > >>> currently.)
> > >>>
> > >>> For seccomp, the expectations about existing thread states risks races
> > >>> too. There are two locks held for TSYNC:
> > >>> - current->sighand->siglock is held to keep new threads from
> > >>>   appearing/disappearing, which would destroy filter refcounting and
> > >>>   lead to memory corruption.
> > >>
> > >> I don't understand what you mean here.
> > >> How can this lead to memory corruption?
> > > 
> > > Mainly this is a matter of how seccomp manages its filter hierarchy
> > > (since the filters are shared through process ancestry), so if a thread
> > > appears in the middle of TSYNC it may be racing another TSYNC and break
> > > ancestry, leading to bad reference counting on process death, etc.
> > > (Though, yes, with refcount_t now, things should never corrupt, just
> > > waste memory.)
> > > 
> > 
> > I assume for now, that the current->sighand->siglock held while iterating all
> > threads is sufficient here.
> > 
> > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> > >>>   avoid no_new_privs and filter confusion during exec, which could
> > >>>   lead to exploitable setuid conditions (see below).
> > >>>
> > >>> Just racing a malicious thread during TSYNC is not a very strong
> > >>> example (a malicious thread could do lots of fun things to "current"
> > >>> before it ever got near calling TSYNC), but I think there is the risk
> > >>> of mismatched/confused states that we don't want to allow. One is a
> > >>> particularly bad state that could lead to privilege escalations (in the
> > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> > >>> has a filter attached that silently fails a priv-dropping setuid call
> > >>> and continues execution with elevated privs, it can be tricked into
> > >>> doing bad things on behalf of the unprivileged parent, which was the
> > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> > >>>
> > >>> thread A clones thread B
> > >>> thread B starts setuid exec
> > >>> thread A sets no_new_privs
> > >>> thread A calls seccomp with TSYNC
> > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> > >>> thread B passes check_unsafe_exec() with no_new_privs unset
> > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> > >>> thread B finishes exec, now running with elevated privs, a filter chosen
> > >>>          by thread A, _and_ nnp set (which doesn't matter)
> > >>>
> > >>> With the original locking, thread B will fail check_unsafe_exec()
> > >>> because filter and nnp state are changed together, with "atomicity"
> > >>> protected by the cred_guard_mutex.
> > >>>
> > >>
> > >> Ah, good point, thanks!
> > >>
> > >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> > >> while the cred_guard_mutex is locked, like this for instance:
> > >>
> > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > >> index b6ea3dc..377abf0 100644
> > >> --- a/kernel/seccomp.c
> > >> +++ b/kernel/seccomp.c
> > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> > >>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> > >>         assert_spin_locked(&current->sighand->siglock);
> > >>  
> > >> +       if (current->signal->cred_locked_for_ptrace)
> > >> +               return -EAGAIN;
> > >> +
> > > 
> > > Hmm. I guess something like that could work. TSYNC expects to be able to
> > > report _which_ thread wrecked the call, though... I wonder if in_execve
> > > could be used to figure out the offending thread. Hm, nope, that would
> > > be outside of lock too (and all users are "current" right now, so the
> > > lock wasn't needed before).
> > > 
> > 
> > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> > but the caller will die quickly and cannot do anything with that information
> > when another thread executes execve, right?
> > 
> > >>         /* Validate all threads being eligible for synchronization. */
> > >>         caller = current;
> > >>         for_each_thread(caller, thread) {
> > >>
> > >>
> > >>> And this is just the bad state I _can_ see. I'm worried there are more...
> > >>>
> > >>> All this said, I do see a small similarity here to the work I did to
> > >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> > >>> decisions for the bprm based on current's state -- but current's state
> > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> > >>> current's copy until exec ended and then stored bprm's copy into current.
> > >>> If the only problem anyone can see here is the handling of no_new_privs,
> > >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> > >>> from cred_guard_mutex.
> > >>>
> > >>
> > >> I still think that is solvable with using cred_locked_for_ptrace and
> > >> simply make the tsync fail if it would otherwise be blocked.
> > > 
> > > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > > 
> > 
> > Yeah, I'd go with "cred_locked_in_execve".
> > 
> > > And the comment on bool cred_locked_for_ptrace should mention that
> > > access is only allowed under cred_guard_mutex lock.
> > > 
> > 
> > okay.
> > 
> > >>>> +	sig->cred_locked_for_ptrace = false;
> > > 
> > > This is redundant to the zalloc -- I think you can drop it (unless
> > > someone wants to keep it for clarify?)
> > > 
> > 
> > I'll remove that here and in init/init_task.c
> > 
> > > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > > be sufficient to see a stable version of the thread...
> > > 
> > 
> > No, these need to be addressed individually, but most users just want
> > to know if the current credentials are sufficient at this moment, but will
> > not change the credentials, as ptrace and TSYNC do. 
> > 
> > BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> > So adding an access to cred_locked_for_execve in ptrace_may_access is
> > probably not an option.
> > 
> > However, one nice added value by this change is this:
> > 
> > void *thread(void *arg)
> > {
> > 	ptrace(PTRACE_TRACEME, 0,0,0);
> > 	return NULL;
> > }
> > 
> > int main(void)
> > {
> > 	int pid = fork();
> > 
> > 	if (!pid) {
> > 		pthread_t pt;
> > 		pthread_create(&pt, NULL, thread, NULL);
> > 		pthread_join(pt, NULL);
> > 		execlp("echo", "echo", "passed", NULL);
> > 	}
> > 
> > 	sleep(1000);
> > 	ptrace(PTRACE_ATTACH, pid, 0,0);
> > 	kill(pid, SIGCONT);
> > 	return 0;
> > }
> > 
> > cat /proc/3812/stack 
> > [<0>] flush_old_exec+0xbf/0x760
> > [<0>] load_elf_binary+0x35a/0x16c0
> > [<0>] search_binary_handler+0x97/0x1d0
> > [<0>] __do_execve_file.isra.40+0x624/0x920
> > [<0>] __x64_sys_execve+0x49/0x60
> > [<0>] do_syscall_64+0x64/0x220
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > 
> > > (I remain very nervous about weakening cred_guard_mutex without
> > > addressing the many many users...)
> > > 
> > 
> > They need to be looked at closely, that's pretty clear.
> > Most fall in the class, that just the current credentials need
> > to stay stable for a certain time.
> 
> I remain rather set on wanting some very basic tests with this change.
> Imho, looking through tools/testing/selftests again we don't have nearly
> enough for these codepaths; not to say none. Basically, if someone wants
> to make a change affecting the current problem we should really have at
> least a single simple test/reproducer that can be run without digging
> through lore. And hopefully over time we'll have more tests.

Which you added in v4. Which is great! (I should've mentioned this in my
first mail.)
Christian

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03  2:26                         ` Kees Cook
  2020-03-03  4:54                           ` Bernd Edlinger
@ 2020-03-03  8:58                           ` Christian Brauner
  2020-03-03 10:34                             ` Bernd Edlinger
  2020-03-03 13:02                             ` [PATCHv5] " Bernd Edlinger
  1 sibling, 2 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-03  8:58 UTC (permalink / raw)
  To: Kees Cook
  Cc: Bernd Edlinger, Eric W. Biederman, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On Mon, Mar 02, 2020 at 06:26:47PM -0800, Kees Cook wrote:
> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > This fixes a deadlock in the tracer when tracing a multi-threaded
> > application that calls execve while more than one thread are running.
> > 
> > I observed that when running strace on the gcc test suite, it always
> > blocks after a while, when expect calls execve, because other threads
> > have to be terminated.  They send ptrace events, but the strace is no
> > longer able to respond, since it is blocked in vm_access.
> > 
> > The deadlock is always happening when strace needs to access the
> > tracees process mmap, while another thread in the tracee starts to
> > execve a child process, but that cannot continue until the
> > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> > 
> > strace          D    0 30614  30584 0x00000000
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > schedule_preempt_disabled+0x15/0x20
> > __mutex_lock.isra.13+0x1ec/0x520
> > __mutex_lock_killable_slowpath+0x13/0x20
> > mutex_lock_killable+0x28/0x30
> > mm_access+0x27/0xa0
> > process_vm_rw_core.isra.3+0xff/0x550
> > process_vm_rw+0xdd/0xf0
> > __x64_sys_process_vm_readv+0x31/0x40
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > expect          D    0 31933  30876 0x80004003
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > flush_old_exec+0xc4/0x770
> > load_elf_binary+0x35a/0x16c0
> > search_binary_handler+0x97/0x1d0
> > __do_execve_file.isra.40+0x5d4/0x8a0
> > __x64_sys_execve+0x49/0x60
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > The proposed solution is to take the cred_guard_mutex only
> > in a critical section at the beginning, and at the end of the
> > execve function, and let PTRACE_ATTACH fail with EAGAIN while
> > execve is not complete, but other functions like vm_access are
> > allowed to complete normally.
> 
> Sorry to be bummer, but I don't think this will work. A few more things
> during the exec process depend on cred_guard_mutex being held.
> 
> If I'm reading this patch correctly, this changes the lifetime of the
> cred_guard_mutex lock to be:
> 	- during prepare_bprm_creds()
> 	- from flush_old_exec() through install_exec_creds()
> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> install_exec_creds().
> 
> That means, for example, that check_unsafe_exec()'s documented invariant
> is violated:
>     /*
>      * determine how safe it is to execute the proposed program
>      * - the caller must hold ->cred_guard_mutex to protect against
>      *   PTRACE_ATTACH or seccomp thread-sync
>      */
>     static void check_unsafe_exec(struct linux_binprm *bprm) ...
> which is looking at no_new_privs as well as other details, and making
> decisions about the bprm state from the current state.
> 
> I think it also means that the potentially multiple invocations
> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> a lock (another place where current's no_new_privs is evaluated).
> 
> Related, it also means that cred_guard_mutex is unheld for every
> invocation of search_binary_handler() (which can loop via the previously
> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> currently.)

So one issue I see with having to reacquire the cred_guard_mutex might
be that this would allow tasks holding the cred_guard_mutex to block a
killed exec'ing task from exiting, right?

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03  8:58                           ` Christian Brauner
@ 2020-03-03 10:34                             ` Bernd Edlinger
  2020-03-03 11:23                               ` Bernd Edlinger
  2020-03-03 13:02                             ` [PATCHv5] " Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-03 10:34 UTC (permalink / raw)
  To: Christian Brauner, Kees Cook
  Cc: Eric W. Biederman, Jann Horn, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On 3/3/20 9:58 AM, Christian Brauner wrote:
> On Mon, Mar 02, 2020 at 06:26:47PM -0800, Kees Cook wrote:
>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
>>> This fixes a deadlock in the tracer when tracing a multi-threaded
>>> application that calls execve while more than one thread are running.
>>>
>>> I observed that when running strace on the gcc test suite, it always
>>> blocks after a while, when expect calls execve, because other threads
>>> have to be terminated.  They send ptrace events, but the strace is no
>>> longer able to respond, since it is blocked in vm_access.
>>>
>>> The deadlock is always happening when strace needs to access the
>>> tracees process mmap, while another thread in the tracee starts to
>>> execve a child process, but that cannot continue until the
>>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>>
>>> strace          D    0 30614  30584 0x00000000
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> schedule_preempt_disabled+0x15/0x20
>>> __mutex_lock.isra.13+0x1ec/0x520
>>> __mutex_lock_killable_slowpath+0x13/0x20
>>> mutex_lock_killable+0x28/0x30
>>> mm_access+0x27/0xa0
>>> process_vm_rw_core.isra.3+0xff/0x550
>>> process_vm_rw+0xdd/0xf0
>>> __x64_sys_process_vm_readv+0x31/0x40
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> expect          D    0 31933  30876 0x80004003
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> flush_old_exec+0xc4/0x770
>>> load_elf_binary+0x35a/0x16c0
>>> search_binary_handler+0x97/0x1d0
>>> __do_execve_file.isra.40+0x5d4/0x8a0
>>> __x64_sys_execve+0x49/0x60
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> The proposed solution is to take the cred_guard_mutex only
>>> in a critical section at the beginning, and at the end of the
>>> execve function, and let PTRACE_ATTACH fail with EAGAIN while
>>> execve is not complete, but other functions like vm_access are
>>> allowed to complete normally.
>>
>> Sorry to be bummer, but I don't think this will work. A few more things
>> during the exec process depend on cred_guard_mutex being held.
>>
>> If I'm reading this patch correctly, this changes the lifetime of the
>> cred_guard_mutex lock to be:
>> 	- during prepare_bprm_creds()
>> 	- from flush_old_exec() through install_exec_creds()
>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
>> install_exec_creds().
>>
>> That means, for example, that check_unsafe_exec()'s documented invariant
>> is violated:
>>     /*
>>      * determine how safe it is to execute the proposed program
>>      * - the caller must hold ->cred_guard_mutex to protect against
>>      *   PTRACE_ATTACH or seccomp thread-sync
>>      */
>>     static void check_unsafe_exec(struct linux_binprm *bprm) ...
>> which is looking at no_new_privs as well as other details, and making
>> decisions about the bprm state from the current state.
>>
>> I think it also means that the potentially multiple invocations
>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
>> a lock (another place where current's no_new_privs is evaluated).
>>
>> Related, it also means that cred_guard_mutex is unheld for every
>> invocation of search_binary_handler() (which can loop via the previously
>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
>> currently.)
> 
> So one issue I see with having to reacquire the cred_guard_mutex might
> be that this would allow tasks holding the cred_guard_mutex to block a
> killed exec'ing task from exiting, right?
> 

Yes maybe, but I think it will not be worse than it is now.
Since the second time the mutex is acquired it is done with
mutex_lock_killable, so at least kill -9 should get it terminated.


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03 10:34                             ` Bernd Edlinger
@ 2020-03-03 11:23                               ` Bernd Edlinger
  2020-03-03 14:20                                 ` Christian Brauner
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-03 11:23 UTC (permalink / raw)
  To: Christian Brauner, Kees Cook
  Cc: Eric W. Biederman, Jann Horn, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On 3/3/20 11:34 AM, Bernd Edlinger wrote:
> On 3/3/20 9:58 AM, Christian Brauner wrote:
>> So one issue I see with having to reacquire the cred_guard_mutex might
>> be that this would allow tasks holding the cred_guard_mutex to block a
>> killed exec'ing task from exiting, right?
>>
> 
> Yes maybe, but I think it will not be worse than it is now.
> Since the second time the mutex is acquired it is done with
> mutex_lock_killable, so at least kill -9 should get it terminated.
> 



>  static void free_bprm(struct linux_binprm *bprm)
>  {
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> +		if (!bprm->called_flush_old_exec)
> +			mutex_lock(&current->signal->cred_guard_mutex);
> +		current->signal->cred_locked_for_ptrace = false;
>  		mutex_unlock(&current->signal->cred_guard_mutex);


Hmm, cough...
actually when the mutex_lock_killable fails, due to kill -9, in flush_old_exec
free_bprm locks the same mutex, this time unkillable, but I should better do
mutex_lock_killable here, and if that fails, I can leave cred_locked_for_ptrace,
it shouldn't matter, since this is a fatal signal anyway, right?

Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-03  8:58                           ` Christian Brauner
  2020-03-03 10:34                             ` Bernd Edlinger
@ 2020-03-03 13:02                             ` Bernd Edlinger
  2020-03-03 15:18                               ` Eric W. Biederman
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-03 13:02 UTC (permalink / raw)
  To: Christian Brauner, Kees Cook
  Cc: Eric W. Biederman, Jann Horn, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The proposed solution is to take the cred_guard_mutex only
in a critical section at the beginning, and at the end of the
execve function, and let PTRACE_ATTACH fail with EAGAIN while
execve is not complete, but other functions like vm_access are
allowed to complete normally.

This changes the lifetime of the cred_guard_mutex lock to be:
	- during prepare_bprm_creds()
	- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through
install_exec_creds().

I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 Documentation/security/credentials.rst    | 19 +++++----
 fs/exec.c                                 | 41 ++++++++++++++++---
 include/linux/binfmts.h                   |  6 ++-
 include/linux/sched/signal.h              |  2 +
 kernel/cred.c                             |  2 +-
 kernel/ptrace.c                           |  4 ++
 kernel/seccomp.c                          |  3 ++
 mm/process_vm_access.c                    |  2 +-
 tools/testing/selftests/ptrace/Makefile   |  4 +-
 tools/testing/selftests/ptrace/vmaccess.c | 66 +++++++++++++++++++++++++++++++
 10 files changed, 130 insertions(+), 19 deletions(-)
 create mode 100644 tools/testing/selftests/ptrace/vmaccess.c

v2: adds a test case which passes when this patch is applied.
v3: fixes the issue without introducing a new mutex.
v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case. 
v5: addresses review comments.

diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..0988798 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,14 @@ new set of credentials by calling::
 
 	struct cred *prepare_creds(void);
 
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful.  It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and released after setting
+current->signal->cred_locked_in_execve.  The same mutex is acquired later,
+while the credentials and the process mmap are actually changed, and
+current->signal->cred_locked_in_execve is reset again.
 
 The mutex prevents ``ptrace()`` from altering the ptrace state of a process
 while security checks on credentials construction and changing is taking place
@@ -466,9 +471,8 @@ by calling::
 
 This will alter various aspects of the credentials and the process, giving the
 LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
 
 This function is guaranteed to return 0, so that it can be tail-called at the
 end of such functions as ``sys_setresuid()``.
@@ -486,8 +490,7 @@ invoked::
 
 	void abort_creds(struct cred *new);
 
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
 
 
 A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..5fc744e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_guard_mutex);
+	if (retval)
+		goto out;
+
+	bprm->called_flush_old_exec = 1;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1398,29 +1404,51 @@ void finalize_exec(struct linux_binprm *bprm)
 EXPORT_SYMBOL(finalize_exec);
 
 /*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and set ->cred_locked_in_execve.
  * install_exec_creds() commits the new creds and drops the lock.
  * Or, if exec fails before, free_bprm() should release ->cred and
  * and unlock.
  */
 static int prepare_bprm_creds(struct linux_binprm *bprm)
 {
+	int ret;
+
 	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	ret = -EAGAIN;
+	if (unlikely(current->signal->cred_locked_in_execve))
+		goto out;
+
+	ret = -ENOMEM;
 	bprm->cred = prepare_exec_creds();
-	if (likely(bprm->cred))
-		return 0;
+	if (likely(bprm->cred)) {
+		current->signal->cred_locked_in_execve = true;
+		ret = 0;
+	}
 
+out:
 	mutex_unlock(&current->signal->cred_guard_mutex);
-	return -ENOMEM;
+	return ret;
 }
 
 static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		mutex_unlock(&current->signal->cred_guard_mutex);
+		/*
+		 * If flush_old_exec did not acquire the cred_guard_mutex,
+		 * try again here, but if that fails, just leave
+		 * cred_locked_in_execve alone, since this means there
+		 * must be a fatal signal pending.
+		 * We don't want to prevent this task to be killed, just
+		 * because it is stuck in the middle of execve.
+		 */
+		if (bprm->called_flush_old_exec ||
+		    !mutex_lock_killable(&current->signal->cred_guard_mutex)) {
+			current->signal->cred_locked_in_execve = false;
+			mutex_unlock(&current->signal->cred_guard_mutex);
+		}
 		abort_creds(bprm->cred);
 	}
 	if (bprm->file) {
@@ -1469,13 +1497,14 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	current->signal->cred_locked_in_execve = false;
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
 
 /*
  * determine how safe it is to execute the proposed program
- * - the caller must hold ->cred_guard_mutex to protect against
+ * - the caller must have set ->cred_locked_in_execve to protect against
  *   PTRACE_ATTACH or seccomp thread-sync
  */
 static void check_unsafe_exec(struct linux_binprm *bprm)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2930253 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
 		 * exec has happened. Used to sanitize execution environment
 		 * and to set AT_SECURE auxv for glibc.
 		 */
-		secureexec:1;
+		secureexec:1,
+		/*
+		 * Set by flush_old_exec, when the cred_guard_mutex is taken.
+		 */
+		called_flush_old_exec:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..8f8e358 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,8 @@ struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	bool cred_locked_in_execve;	/* set while in execve, only valid when
+					 * cred_guard_mutex is held */
 } __randomize_layout;
 
 /*
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
  *
  * Returns the new credentials or NULL if out of memory.
  *
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
  */
 struct cred *prepare_kernel_cred(struct task_struct *daemon)
 {
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..0f82bab 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -395,6 +395,10 @@ static int ptrace_attach(struct task_struct *task, long request,
 	if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
 		goto out;
 
+	retval = -EAGAIN;
+	if (task->signal->cred_locked_in_execve)
+		goto unlock_creds;
+
 	task_lock(task);
 	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
 	task_unlock(task);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b6ea3dc..3efa3e5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
 	BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
 	assert_spin_locked(&current->sighand->siglock);
 
+	if (current->signal->cred_locked_in_execve)
+		return -EAGAIN;
+
 	/* Validate all threads being eligible for synchronization. */
 	caller = current;
 	for_each_thread(caller, thread) {
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	if (!mm || IS_ERR(mm)) {
 		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		/*
-		 * Explicitly map EACCES to EPERM as EPERM is a more a
+		 * Explicitly map EACCES to EPERM as EPERM is a more
 		 * appropriate error code for process_vw_readv/writev
 		 */
 		if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
 
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
 
 include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..6d8a048
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+	return NULL;
+}
+
+TEST(vmaccess)
+{
+	int f, pid = fork();
+	char mm[64];
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	sprintf(mm, "/proc/%d/mem", pid);
+	f = open(mm, O_RDONLY);
+	ASSERT_LE(0, f);
+	close(f);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(0, f);
+}
+
+TEST(attach)
+{
+	int f, pid = fork();
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(EAGAIN, errno);
+	ASSERT_EQ(f, -1);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(0, f);
+}
+
+TEST_HARNESS_MAIN
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03 11:23                               ` Bernd Edlinger
@ 2020-03-03 14:20                                 ` Christian Brauner
  0 siblings, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-03 14:20 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Kees Cook, Eric W. Biederman, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On Tue, Mar 03, 2020 at 11:23:31AM +0000, Bernd Edlinger wrote:
> On 3/3/20 11:34 AM, Bernd Edlinger wrote:
> > On 3/3/20 9:58 AM, Christian Brauner wrote:
> >> So one issue I see with having to reacquire the cred_guard_mutex might
> >> be that this would allow tasks holding the cred_guard_mutex to block a
> >> killed exec'ing task from exiting, right?
> >>
> > 
> > Yes maybe, but I think it will not be worse than it is now.
> > Since the second time the mutex is acquired it is done with
> > mutex_lock_killable, so at least kill -9 should get it terminated.
> > 
> 
> 
> 
> >  static void free_bprm(struct linux_binprm *bprm)
> >  {
> >  	free_arg_pages(bprm);
> >  	if (bprm->cred) {
> > +		if (!bprm->called_flush_old_exec)
> > +			mutex_lock(&current->signal->cred_guard_mutex);
> > +		current->signal->cred_locked_for_ptrace = false;
> >  		mutex_unlock(&current->signal->cred_guard_mutex);
> 
> 
> Hmm, cough...
> actually when the mutex_lock_killable fails, due to kill -9, in flush_old_exec
> free_bprm locks the same mutex, this time unkillable, but I should better do
> mutex_lock_killable here, and if that fails, I can leave cred_locked_for_ptrace,
> it shouldn't matter, since this is a fatal signal anyway, right?

I think so, yes.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-03 13:02                             ` [PATCHv5] " Bernd Edlinger
@ 2020-03-03 15:18                               ` Eric W. Biederman
  2020-03-03 16:48                                 ` Bernd Edlinger
  2020-03-03 16:50                                 ` [PATCHv5] exec: Fix a deadlock in ptrace Christian Brauner
  0 siblings, 2 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-03 15:18 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated.  They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

A couple of things.

Why do we think it is safe to change the behavior exposed to userspace?
Not the deadlock but all of the times the current code would not
deadlock?

Especially given that this is a small window it might be hard for people
to track down and report so we need a strong argument that this won't
break existing userspace before we just change things.

Usually surveying all of the users of a system call that we can find
and checking to see if they might be affected by the change in behavior
is difficult enough that we usually opt for not being lazy and
preserving the behavior.

This patch is up to two changes in behavior now, that could potentially
affect a whole array of programs.  Adding linux-api so that this change
in behavior can be documented if/when this change goes through.

If you can split the documentation and test fixes out into separate
patches that would help reviewing this code, or please make it explicit
that the your are changing documentation about behavior that is changing
with this patch.

Eric

> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> new file mode 100644
> index 0000000..6d8a048
> --- /dev/null
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -0,0 +1,66 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
> + * All rights reserved.
> + *
> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
> + * when de_thread is blocked with ->cred_guard_mutex held.
> + */
> +
> +#include "../kselftest_harness.h"
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <unistd.h>
> +#include <sys/ptrace.h>
> +
> +static void *thread(void *arg)
> +{
> +	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
> +	return NULL;
> +}
> +
> +TEST(vmaccess)
> +{
> +	int f, pid = fork();
> +	char mm[64];
> +
> +	if (!pid) {
> +		pthread_t pt;
> +
> +		pthread_create(&pt, NULL, thread, NULL);
> +		pthread_join(pt, NULL);
> +		execlp("true", "true", NULL);
> +	}
> +
> +	sleep(1);
> +	sprintf(mm, "/proc/%d/mem", pid);
> +	f = open(mm, O_RDONLY);
> +	ASSERT_LE(0, f);
> +	close(f);
> +	f = kill(pid, SIGCONT);
> +	ASSERT_EQ(0, f);
> +}
> +
> +TEST(attach)
> +{
> +	int f, pid = fork();
> +
> +	if (!pid) {
> +		pthread_t pt;
> +
> +		pthread_create(&pt, NULL, thread, NULL);
> +		pthread_join(pt, NULL);
> +		execlp("true", "true", NULL);
> +	}
> +
> +	sleep(1);
> +	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);

To be meaningful this code needs to learn to loop when
ptrace returns -EAGAIN.

Because that is pretty much what any self respecting user space
process will do.

At which point I am not certain we can say that the behavior has
sufficiently improved not to be a deadlock.

> +	ASSERT_EQ(EAGAIN, errno);
> +	ASSERT_EQ(f, -1);
> +	f = kill(pid, SIGCONT);
> +	ASSERT_EQ(0, f);
> +}
> +
> +TEST_HARNESS_MAIN

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-03 15:18                               ` Eric W. Biederman
@ 2020-03-03 16:48                                 ` Bernd Edlinger
  2020-03-03 17:01                                   ` Christian Brauner
  2020-03-03 20:08                                   ` Eric W. Biederman
  2020-03-03 16:50                                 ` [PATCHv5] exec: Fix a deadlock in ptrace Christian Brauner
  1 sibling, 2 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-03 16:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/3/20 4:18 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated.  They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> A couple of things.
> 
> Why do we think it is safe to change the behavior exposed to userspace?
> Not the deadlock but all of the times the current code would not
> deadlock?
> 
> Especially given that this is a small window it might be hard for people
> to track down and report so we need a strong argument that this won't
> break existing userspace before we just change things.
> 

Hmm, I tend to agree.

> Usually surveying all of the users of a system call that we can find
> and checking to see if they might be affected by the change in behavior
> is difficult enough that we usually opt for not being lazy and
> preserving the behavior.
> 
> This patch is up to two changes in behavior now, that could potentially
> affect a whole array of programs.  Adding linux-api so that this change
> in behavior can be documented if/when this change goes through.
> 

One is PTRACE_ACCESS possibly returning EAGAIN, yes.

We could try to restrict that behavior change to when any
thread is ptraced when execve starts, can't be too complicated.


But the other is only SYS_seccomp returning EAGAIN, when a different
thread of the current process is calling execve at the same time.

I would consider it completely impossible to have any user-visual effect,
since de_thread is just terminating all threads, including the thread
where the -EAGAIN was returned, so we will never know what happened.


> If you can split the documentation and test fixes out into separate
> patches that would help reviewing this code, or please make it explicit
> that the your are changing documentation about behavior that is changing
> with this patch.
> 

I am not sure if I have touched the right user documentation.

I only saw a document referring to a non-existent "current->cred_replace_mutex"
I haven't digged the git history, but that must be pre-historic IMHO.
It appears to me that is some developer documentation, but it's nevertheless
worth to keep up to date when the code changes.

So where would I add the possibility for PTRACE_ATTACH to return -EAGAIN ?


Bernd.

> Eric
> 
>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>> new file mode 100644
>> index 0000000..6d8a048
>> --- /dev/null
>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>> @@ -0,0 +1,66 @@
>> +// SPDX-License-Identifier: GPL-2.0+
>> +/*
>> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
>> + * All rights reserved.
>> + *
>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>> + * when de_thread is blocked with ->cred_guard_mutex held.
>> + */
>> +
>> +#include "../kselftest_harness.h"
>> +#include <stdio.h>
>> +#include <fcntl.h>
>> +#include <pthread.h>
>> +#include <signal.h>
>> +#include <unistd.h>
>> +#include <sys/ptrace.h>
>> +
>> +static void *thread(void *arg)
>> +{
>> +	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>> +	return NULL;
>> +}
>> +
>> +TEST(vmaccess)
>> +{
>> +	int f, pid = fork();
>> +	char mm[64];
>> +
>> +	if (!pid) {
>> +		pthread_t pt;
>> +
>> +		pthread_create(&pt, NULL, thread, NULL);
>> +		pthread_join(pt, NULL);
>> +		execlp("true", "true", NULL);
>> +	}
>> +
>> +	sleep(1);
>> +	sprintf(mm, "/proc/%d/mem", pid);
>> +	f = open(mm, O_RDONLY);
>> +	ASSERT_LE(0, f);
>> +	close(f);
>> +	f = kill(pid, SIGCONT);
>> +	ASSERT_EQ(0, f);
>> +}
>> +
>> +TEST(attach)
>> +{
>> +	int f, pid = fork();
>> +
>> +	if (!pid) {
>> +		pthread_t pt;
>> +
>> +		pthread_create(&pt, NULL, thread, NULL);
>> +		pthread_join(pt, NULL);
>> +		execlp("true", "true", NULL);
>> +	}
>> +
>> +	sleep(1);
>> +	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> 
> To be meaningful this code needs to learn to loop when
> ptrace returns -EAGAIN.
> 
> Because that is pretty much what any self respecting user space
> process will do.
> 
> At which point I am not certain we can say that the behavior has
> sufficiently improved not to be a deadlock.
> 

In this special dead-duck test it won't work, but it would
still be lots more transparent what is going on, since previously
you had two zombie process, and no way to even output debug
messages, which also all self respecting user space processes
should do.

So yes, I can at least give a good example and re-try it several
times together with wait4 which a tracer is expected to do.

Bernd.

>> +	ASSERT_EQ(EAGAIN, errno);
>> +	ASSERT_EQ(f, -1);
>> +	f = kill(pid, SIGCONT);
>> +	ASSERT_EQ(0, f);
>> +}
>> +
>> +TEST_HARNESS_MAIN
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-03 15:18                               ` Eric W. Biederman
  2020-03-03 16:48                                 ` Bernd Edlinger
@ 2020-03-03 16:50                                 ` Christian Brauner
  1 sibling, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-03 16:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 03, 2020 at 09:18:44AM -0600, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
> > This fixes a deadlock in the tracer when tracing a multi-threaded
> > application that calls execve while more than one thread are running.
> >
> > I observed that when running strace on the gcc test suite, it always
> > blocks after a while, when expect calls execve, because other threads
> > have to be terminated.  They send ptrace events, but the strace is no
> > longer able to respond, since it is blocked in vm_access.
> >
> > The deadlock is always happening when strace needs to access the
> > tracees process mmap, while another thread in the tracee starts to
> > execve a child process, but that cannot continue until the
> > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> A couple of things.
> 
> Why do we think it is safe to change the behavior exposed to userspace?
> Not the deadlock but all of the times the current code would not
> deadlock?
> 
> Especially given that this is a small window it might be hard for people
> to track down and report so we need a strong argument that this won't
> break existing userspace before we just change things.
> 
> Usually surveying all of the users of a system call that we can find
> and checking to see if they might be affected by the change in behavior
> is difficult enough that we usually opt for not being lazy and
> preserving the behavior.
> 
> This patch is up to two changes in behavior now, that could potentially
> affect a whole array of programs.  Adding linux-api so that this change
> in behavior can be documented if/when this change goes through.
> 
> If you can split the documentation and test fixes out into separate
> patches that would help reviewing this code, or please make it explicit
> that the your are changing documentation about behavior that is changing
> with this patch.

Agreed. I think it'd be good to do it in three patches:
1. unrelated documentation update
2. fix + documentation changes specific to the fix
3. test(s)

Christian

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-03 16:48                                 ` Bernd Edlinger
@ 2020-03-03 17:01                                   ` Christian Brauner
  2020-03-03 17:20                                     ` Christian Brauner
  2020-03-03 20:08                                   ` Eric W. Biederman
  1 sibling, 1 reply; 203+ messages in thread
From: Christian Brauner @ 2020-03-03 17:01 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 03, 2020 at 04:48:01PM +0000, Bernd Edlinger wrote:
> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
> > Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> > 
> >> This fixes a deadlock in the tracer when tracing a multi-threaded
> >> application that calls execve while more than one thread are running.
> >>
> >> I observed that when running strace on the gcc test suite, it always
> >> blocks after a while, when expect calls execve, because other threads
> >> have to be terminated.  They send ptrace events, but the strace is no
> >> longer able to respond, since it is blocked in vm_access.
> >>
> >> The deadlock is always happening when strace needs to access the
> >> tracees process mmap, while another thread in the tracee starts to
> >> execve a child process, but that cannot continue until the
> >> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> > 
> > A couple of things.
> > 
> > Why do we think it is safe to change the behavior exposed to userspace?
> > Not the deadlock but all of the times the current code would not
> > deadlock?
> > 
> > Especially given that this is a small window it might be hard for people
> > to track down and report so we need a strong argument that this won't
> > break existing userspace before we just change things.
> > 
> 
> Hmm, I tend to agree.
> 
> > Usually surveying all of the users of a system call that we can find
> > and checking to see if they might be affected by the change in behavior
> > is difficult enough that we usually opt for not being lazy and
> > preserving the behavior.
> > 
> > This patch is up to two changes in behavior now, that could potentially
> > affect a whole array of programs.  Adding linux-api so that this change
> > in behavior can be documented if/when this change goes through.
> > 
> 
> One is PTRACE_ACCESS possibly returning EAGAIN, yes.
> 
> We could try to restrict that behavior change to when any
> thread is ptraced when execve starts, can't be too complicated.
> 
> 
> But the other is only SYS_seccomp returning EAGAIN, when a different
> thread of the current process is calling execve at the same time.
> 
> I would consider it completely impossible to have any user-visual effect,
> since de_thread is just terminating all threads, including the thread
> where the -EAGAIN was returned, so we will never know what happened.

I think if we risk a user-space facing change we should try the simple
thing first before making the fix more convoluted? But it's a tough
call...

> 
> 
> > If you can split the documentation and test fixes out into separate
> > patches that would help reviewing this code, or please make it explicit
> > that the your are changing documentation about behavior that is changing
> > with this patch.
> > 
> 
> I am not sure if I have touched the right user documentation.
> 
> I only saw a document referring to a non-existent "current->cred_replace_mutex"
> I haven't digged the git history, but that must be pre-historic IMHO.
> It appears to me that is some developer documentation, but it's nevertheless
> worth to keep up to date when the code changes.
> 
> So where would I add the possibility for PTRACE_ATTACH to return -EAGAIN ?

Since that would be a potentially user-visible change it would make the
most sense to add it to man ptrace(2) if/when we land this change.

For developers, placing a comment in kernel/ptrace.c:ptrace_attach()
would make the most sense? We already have something about exec
protection in there.

Christian

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-03 17:01                                   ` Christian Brauner
@ 2020-03-03 17:20                                     ` Christian Brauner
  0 siblings, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-03 17:20 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 03, 2020 at 06:01:11PM +0100, Christian Brauner wrote:
> On Tue, Mar 03, 2020 at 04:48:01PM +0000, Bernd Edlinger wrote:
> > On 3/3/20 4:18 PM, Eric W. Biederman wrote:
> > > Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> > > 
> > >> This fixes a deadlock in the tracer when tracing a multi-threaded
> > >> application that calls execve while more than one thread are running.
> > >>
> > >> I observed that when running strace on the gcc test suite, it always
> > >> blocks after a while, when expect calls execve, because other threads
> > >> have to be terminated.  They send ptrace events, but the strace is no
> > >> longer able to respond, since it is blocked in vm_access.
> > >>
> > >> The deadlock is always happening when strace needs to access the
> > >> tracees process mmap, while another thread in the tracee starts to
> > >> execve a child process, but that cannot continue until the
> > >> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> > > 
> > > A couple of things.
> > > 
> > > Why do we think it is safe to change the behavior exposed to userspace?
> > > Not the deadlock but all of the times the current code would not
> > > deadlock?
> > > 
> > > Especially given that this is a small window it might be hard for people
> > > to track down and report so we need a strong argument that this won't
> > > break existing userspace before we just change things.
> > > 
> > 
> > Hmm, I tend to agree.
> > 
> > > Usually surveying all of the users of a system call that we can find
> > > and checking to see if they might be affected by the change in behavior
> > > is difficult enough that we usually opt for not being lazy and
> > > preserving the behavior.
> > > 
> > > This patch is up to two changes in behavior now, that could potentially
> > > affect a whole array of programs.  Adding linux-api so that this change
> > > in behavior can be documented if/when this change goes through.
> > > 
> > 
> > One is PTRACE_ACCESS possibly returning EAGAIN, yes.
> > 
> > We could try to restrict that behavior change to when any
> > thread is ptraced when execve starts, can't be too complicated.
> > 
> > 
> > But the other is only SYS_seccomp returning EAGAIN, when a different
> > thread of the current process is calling execve at the same time.
> > 
> > I would consider it completely impossible to have any user-visual effect,
> > since de_thread is just terminating all threads, including the thread
> > where the -EAGAIN was returned, so we will never know what happened.
> 
> I think if we risk a user-space facing change we should try the simple
> thing first before making the fix more convoluted? But it's a tough
> call...

Actually, to get a _rough_ estimate of the possible impact I would
recommend you run the criu test suite (and possible the strace
test-suite) on a kernel with and without your fix. That's what I tend to
do when I touch code I fear will have impact on APIs that very deeply
touch core kernel. Criu's test-suite makes heavy use of ptrace and
usually runs into a bunch of interesting (exec) races too, and does have
tests for handling zombies processes etc. pp.

Should be relatively simple: create a vm and then criu build-dependencies,
git clone criu; cd criu; make; cd test; ./zdtm.py run -a --keep-going
If your system doesn't support Selinux properly, you need to disable it
when running the tests and you also need to make sure that you're using
python3 or change the shebang in zdtm.py to python3.

Just a recommendation.

Christian

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-03 16:48                                 ` Bernd Edlinger
  2020-03-03 17:01                                   ` Christian Brauner
@ 2020-03-03 20:08                                   ` Eric W. Biederman
  2020-03-04 14:37                                     ` Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-03 20:08 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>>> new file mode 100644
>>> index 0000000..6d8a048
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>>> @@ -0,0 +1,66 @@
>>> +// SPDX-License-Identifier: GPL-2.0+
>>> +/*
>>> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
>>> + * All rights reserved.
>>> + *
>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>>> + * when de_thread is blocked with ->cred_guard_mutex held.
>>> + */
>>> +
>>> +#include "../kselftest_harness.h"
>>> +#include <stdio.h>
>>> +#include <fcntl.h>
>>> +#include <pthread.h>
>>> +#include <signal.h>
>>> +#include <unistd.h>
>>> +#include <sys/ptrace.h>
>>> +
>>> +static void *thread(void *arg)
>>> +{
>>> +	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>>> +	return NULL;
>>> +}
>>> +
>>> +TEST(vmaccess)
>>> +{
>>> +	int f, pid = fork();
>>> +	char mm[64];
>>> +
>>> +	if (!pid) {
>>> +		pthread_t pt;
>>> +
>>> +		pthread_create(&pt, NULL, thread, NULL);
>>> +		pthread_join(pt, NULL);
>>> +		execlp("true", "true", NULL);
>>> +	}
>>> +
>>> +	sleep(1);
>>> +	sprintf(mm, "/proc/%d/mem", pid);
>>> +	f = open(mm, O_RDONLY);
>>> +	ASSERT_LE(0, f);
>>> +	close(f);
>>> +	f = kill(pid, SIGCONT);
>>> +	ASSERT_EQ(0, f);
>>> +}
>>> +
>>> +TEST(attach)
>>> +{
>>> +	int f, pid = fork();
>>> +
>>> +	if (!pid) {
>>> +		pthread_t pt;
>>> +
>>> +		pthread_create(&pt, NULL, thread, NULL);
>>> +		pthread_join(pt, NULL);
>>> +		execlp("true", "true", NULL);
>>> +	}
>>> +
>>> +	sleep(1);
>>> +	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>> 
>> To be meaningful this code needs to learn to loop when
>> ptrace returns -EAGAIN.
>> 
>> Because that is pretty much what any self respecting user space
>> process will do.
>> 
>> At which point I am not certain we can say that the behavior has
>> sufficiently improved not to be a deadlock.
>> 
>
> In this special dead-duck test it won't work, but it would
> still be lots more transparent what is going on, since previously
> you had two zombie process, and no way to even output debug
> messages, which also all self respecting user space processes
> should do.

Agreed it is more transparent.  So if you are going to deadlock
it is better.

My previous proposal (which I admit is more work to implement) would
actually allow succeeding in this case and so it would not be subject to
a dead lock (even via -EGAIN) at this point.

> So yes, I can at least give a good example and re-try it several
> times together with wait4 which a tracer is expected to do.

Thank you,

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-03 20:08                                   ` Eric W. Biederman
@ 2020-03-04 14:37                                     ` Bernd Edlinger
  2020-03-04 16:33                                       ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-04 14:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/3/20 9:08 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>>>> new file mode 100644
>>>> index 0000000..6d8a048
>>>> --- /dev/null
>>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>>>> @@ -0,0 +1,66 @@
>>>> +// SPDX-License-Identifier: GPL-2.0+
>>>> +/*
>>>> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
>>>> + * All rights reserved.
>>>> + *
>>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>>>> + * when de_thread is blocked with ->cred_guard_mutex held.
>>>> + */
>>>> +
>>>> +#include "../kselftest_harness.h"
>>>> +#include <stdio.h>
>>>> +#include <fcntl.h>
>>>> +#include <pthread.h>
>>>> +#include <signal.h>
>>>> +#include <unistd.h>
>>>> +#include <sys/ptrace.h>
>>>> +
>>>> +static void *thread(void *arg)
>>>> +{
>>>> +	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +TEST(vmaccess)
>>>> +{
>>>> +	int f, pid = fork();
>>>> +	char mm[64];
>>>> +
>>>> +	if (!pid) {
>>>> +		pthread_t pt;
>>>> +
>>>> +		pthread_create(&pt, NULL, thread, NULL);
>>>> +		pthread_join(pt, NULL);
>>>> +		execlp("true", "true", NULL);
>>>> +	}
>>>> +
>>>> +	sleep(1);
>>>> +	sprintf(mm, "/proc/%d/mem", pid);
>>>> +	f = open(mm, O_RDONLY);
>>>> +	ASSERT_LE(0, f);
>>>> +	close(f);
>>>> +	f = kill(pid, SIGCONT);
>>>> +	ASSERT_EQ(0, f);
>>>> +}
>>>> +
>>>> +TEST(attach)
>>>> +{
>>>> +	int f, pid = fork();
>>>> +
>>>> +	if (!pid) {
>>>> +		pthread_t pt;
>>>> +
>>>> +		pthread_create(&pt, NULL, thread, NULL);
>>>> +		pthread_join(pt, NULL);
>>>> +		execlp("true", "true", NULL);
>>>> +	}
>>>> +
>>>> +	sleep(1);
>>>> +	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>>>
>>> To be meaningful this code needs to learn to loop when
>>> ptrace returns -EAGAIN.
>>>
>>> Because that is pretty much what any self respecting user space
>>> process will do.
>>>
>>> At which point I am not certain we can say that the behavior has
>>> sufficiently improved not to be a deadlock.
>>>
>>
>> In this special dead-duck test it won't work, but it would
>> still be lots more transparent what is going on, since previously
>> you had two zombie process, and no way to even output debug
>> messages, which also all self respecting user space processes
>> should do.
> 
> Agreed it is more transparent.  So if you are going to deadlock
> it is better.
> 
> My previous proposal (which I admit is more work to implement) would
> actually allow succeeding in this case and so it would not be subject to
> a dead lock (even via -EGAIN) at this point.
> 
>> So yes, I can at least give a good example and re-try it several
>> times together with wait4 which a tracer is expected to do.
> 
> Thank you,
> 
> Eric
> 

Okay, I think it can be done with minimal API changes,
but it needs two mutexes, one that guards the execve,
and one that guards only the credentials.

If no traced sibling thread exists, the mutexes are used this way:
lock(exec_guard_mutex)
cred_locked_in_execve = true;
de_thread()
lock(cred_guard_mutex)
unlock(cred_guard_mutex)
cred_locked_in_execve = false;
unlock(exec_guard_mutex)

so effectively no API change at all.

If a traced sibling thread exists, the mutexes are used differently:
lock(exec_guard_mutex)
cred_locked_in_execve = true;
unlock(exec_guard_mutex)
de_thread()
lock(cred_guard_mutex)
unlock(cred_guard_mutex)
lock(exec_guard_mutex)
cred_locked_in_execve = false;
unlock(exec_guard_mutex)

Only the case changes that would deadlock anyway.


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv4] exec: Fix a deadlock in ptrace
  2020-03-03  8:08                               ` Bernd Edlinger
  2020-03-03  8:34                                 ` Christian Brauner
@ 2020-03-04 15:30                                 ` Christian Brauner
  1 sibling, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-04 15:30 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Kees Cook, Eric W. Biederman, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable

On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> On 3/3/20 6:29 AM, Kees Cook wrote:
> > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> >> On 3/3/20 3:26 AM, Kees Cook wrote:
> >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> >>>> [...]
> >>>
> >>> If I'm reading this patch correctly, this changes the lifetime of the
> >>> cred_guard_mutex lock to be:
> >>> 	- during prepare_bprm_creds()
> >>> 	- from flush_old_exec() through install_exec_creds()
> >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> >>> install_exec_creds().
> > 
> > BTW, I think the effect of this change (i.e. my paragraph above) should
> > be distinctly called out in the commit log if this solution moves
> > forward.
> > 
> 
> Okay, will do.
> 
> >>> That means, for example, that check_unsafe_exec()'s documented invariant
> >>> is violated:
> >>>     /*
> >>>      * determine how safe it is to execute the proposed program
> >>>      * - the caller must hold ->cred_guard_mutex to protect against
> >>>      *   PTRACE_ATTACH or seccomp thread-sync
> >>>      */
> >>
> >> Oh, right, I haven't understood that hint...
> > 
> > I know no_new_privs is checked there, but I haven't studied the
> > PTRACE_ATTACH part of that comment. If that is handled with the new
> > check, this comment should be updated.
> > 
> 
> Okay, I change that comment to:
> 
> /*
>  * determine how safe it is to execute the proposed program
>  * - the caller must have set ->cred_locked_in_execve to protect against
>  *   PTRACE_ATTACH or seccomp thread-sync
>  */
> 
> >>> I think it also means that the potentially multiple invocations
> >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> >>> a lock (another place where current's no_new_privs is evaluated).
> >>
> >> So no_new_privs can change from 0->1, but should not
> >> when execve is running.
> >>
> >> As long as the calling thread is in execve it won't do this,
> >> and the only other place, where it may set for other threads
> >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > 
> > Yeah, everything was fine until I had to go complicate things with
> > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > flag was used to control this, but it required that the filter not get
> > applied during exec.
> > 
> >>> Related, it also means that cred_guard_mutex is unheld for every
> >>> invocation of search_binary_handler() (which can loop via the previously
> >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> >>> currently.)
> >>>
> >>> For seccomp, the expectations about existing thread states risks races
> >>> too. There are two locks held for TSYNC:
> >>> - current->sighand->siglock is held to keep new threads from
> >>>   appearing/disappearing, which would destroy filter refcounting and
> >>>   lead to memory corruption.
> >>
> >> I don't understand what you mean here.
> >> How can this lead to memory corruption?
> > 
> > Mainly this is a matter of how seccomp manages its filter hierarchy
> > (since the filters are shared through process ancestry), so if a thread
> > appears in the middle of TSYNC it may be racing another TSYNC and break
> > ancestry, leading to bad reference counting on process death, etc.
> > (Though, yes, with refcount_t now, things should never corrupt, just
> > waste memory.)
> > 
> 
> I assume for now, that the current->sighand->siglock held while iterating all
> threads is sufficient here.
> 
> >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> >>>   avoid no_new_privs and filter confusion during exec, which could
> >>>   lead to exploitable setuid conditions (see below).
> >>>
> >>> Just racing a malicious thread during TSYNC is not a very strong
> >>> example (a malicious thread could do lots of fun things to "current"
> >>> before it ever got near calling TSYNC), but I think there is the risk
> >>> of mismatched/confused states that we don't want to allow. One is a
> >>> particularly bad state that could lead to privilege escalations (in the
> >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> >>> has a filter attached that silently fails a priv-dropping setuid call
> >>> and continues execution with elevated privs, it can be tricked into
> >>> doing bad things on behalf of the unprivileged parent, which was the
> >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> >>>
> >>> thread A clones thread B
> >>> thread B starts setuid exec
> >>> thread A sets no_new_privs
> >>> thread A calls seccomp with TSYNC
> >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> >>> thread B passes check_unsafe_exec() with no_new_privs unset
> >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> >>> thread B finishes exec, now running with elevated privs, a filter chosen
> >>>          by thread A, _and_ nnp set (which doesn't matter)
> >>>
> >>> With the original locking, thread B will fail check_unsafe_exec()
> >>> because filter and nnp state are changed together, with "atomicity"
> >>> protected by the cred_guard_mutex.
> >>>
> >>
> >> Ah, good point, thanks!
> >>
> >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> >> while the cred_guard_mutex is locked, like this for instance:
> >>
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index b6ea3dc..377abf0 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> >>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> >>         assert_spin_locked(&current->sighand->siglock);
> >>  
> >> +       if (current->signal->cred_locked_for_ptrace)
> >> +               return -EAGAIN;
> >> +
> > 
> > Hmm. I guess something like that could work. TSYNC expects to be able to
> > report _which_ thread wrecked the call, though... I wonder if in_execve
> > could be used to figure out the offending thread. Hm, nope, that would
> > be outside of lock too (and all users are "current" right now, so the
> > lock wasn't needed before).
> > 
> 
> I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> but the caller will die quickly and cannot do anything with that information
> when another thread executes execve, right?
> 
> >>         /* Validate all threads being eligible for synchronization. */
> >>         caller = current;
> >>         for_each_thread(caller, thread) {
> >>
> >>
> >>> And this is just the bad state I _can_ see. I'm worried there are more...
> >>>
> >>> All this said, I do see a small similarity here to the work I did to
> >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> >>> decisions for the bprm based on current's state -- but current's state
> >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> >>> current's copy until exec ended and then stored bprm's copy into current.
> >>> If the only problem anyone can see here is the handling of no_new_privs,
> >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> >>> from cred_guard_mutex.
> >>>
> >>
> >> I still think that is solvable with using cred_locked_for_ptrace and
> >> simply make the tsync fail if it would otherwise be blocked.
> > 
> > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > 
> 
> Yeah, I'd go with "cred_locked_in_execve".
> 
> > And the comment on bool cred_locked_for_ptrace should mention that
> > access is only allowed under cred_guard_mutex lock.
> > 
> 
> okay.
> 
> >>>> +	sig->cred_locked_for_ptrace = false;
> > 
> > This is redundant to the zalloc -- I think you can drop it (unless
> > someone wants to keep it for clarify?)
> > 
> 
> I'll remove that here and in init/init_task.c
> 
> > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > be sufficient to see a stable version of the thread...
> > 
> 
> No, these need to be addressed individually, but most users just want
> to know if the current credentials are sufficient at this moment, but will
> not change the credentials, as ptrace and TSYNC do. 
> 
> BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> So adding an access to cred_locked_for_execve in ptrace_may_access is
> probably not an option.

That could be solved by e.g. adding ptrace_may_access_{no}exec() taking
cred_guard_mutex.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-04 14:37                                     ` Bernd Edlinger
@ 2020-03-04 16:33                                       ` Eric W. Biederman
  2020-03-04 21:49                                         ` Bernd Edlinger
  2020-03-04 21:56                                         ` [PATCHv6] " Bernd Edlinger
  0 siblings, 2 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-04 16:33 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/3/20 9:08 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
>>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>>>>> new file mode 100644
>>>>> index 0000000..6d8a048
>>>>> --- /dev/null
>>>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>>>>> @@ -0,0 +1,66 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0+
>>>>> +/*
>>>>> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
>>>>> + * All rights reserved.
>>>>> + *
>>>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>>>>> + * when de_thread is blocked with ->cred_guard_mutex held.
>>>>> + */
>>>>> +
>>>>> +#include "../kselftest_harness.h"
>>>>> +#include <stdio.h>
>>>>> +#include <fcntl.h>
>>>>> +#include <pthread.h>
>>>>> +#include <signal.h>
>>>>> +#include <unistd.h>
>>>>> +#include <sys/ptrace.h>
>>>>> +
>>>>> +static void *thread(void *arg)
>>>>> +{
>>>>> +	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>>>>> +	return NULL;
>>>>> +}
>>>>> +
>>>>> +TEST(vmaccess)
>>>>> +{
>>>>> +	int f, pid = fork();
>>>>> +	char mm[64];
>>>>> +
>>>>> +	if (!pid) {
>>>>> +		pthread_t pt;
>>>>> +
>>>>> +		pthread_create(&pt, NULL, thread, NULL);
>>>>> +		pthread_join(pt, NULL);
>>>>> +		execlp("true", "true", NULL);
>>>>> +	}
>>>>> +
>>>>> +	sleep(1);
>>>>> +	sprintf(mm, "/proc/%d/mem", pid);
>>>>> +	f = open(mm, O_RDONLY);
>>>>> +	ASSERT_LE(0, f);
>>>>> +	close(f);
>>>>> +	f = kill(pid, SIGCONT);
>>>>> +	ASSERT_EQ(0, f);
>>>>> +}
>>>>> +
>>>>> +TEST(attach)
>>>>> +{
>>>>> +	int f, pid = fork();
>>>>> +
>>>>> +	if (!pid) {
>>>>> +		pthread_t pt;
>>>>> +
>>>>> +		pthread_create(&pt, NULL, thread, NULL);
>>>>> +		pthread_join(pt, NULL);
>>>>> +		execlp("true", "true", NULL);
>>>>> +	}
>>>>> +
>>>>> +	sleep(1);
>>>>> +	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>>>>
>>>> To be meaningful this code needs to learn to loop when
>>>> ptrace returns -EAGAIN.
>>>>
>>>> Because that is pretty much what any self respecting user space
>>>> process will do.
>>>>
>>>> At which point I am not certain we can say that the behavior has
>>>> sufficiently improved not to be a deadlock.
>>>>
>>>
>>> In this special dead-duck test it won't work, but it would
>>> still be lots more transparent what is going on, since previously
>>> you had two zombie process, and no way to even output debug
>>> messages, which also all self respecting user space processes
>>> should do.
>> 
>> Agreed it is more transparent.  So if you are going to deadlock
>> it is better.
>> 
>> My previous proposal (which I admit is more work to implement) would
>> actually allow succeeding in this case and so it would not be subject to
>> a dead lock (even via -EGAIN) at this point.
>> 
>>> So yes, I can at least give a good example and re-try it several
>>> times together with wait4 which a tracer is expected to do.
>> 
>> Thank you,
>> 
>> Eric
>> 
>
> Okay, I think it can be done with minimal API changes,
> but it needs two mutexes, one that guards the execve,
> and one that guards only the credentials.
>
> If no traced sibling thread exists, the mutexes are used this way:
> lock(exec_guard_mutex)
> cred_locked_in_execve = true;
> de_thread()
> lock(cred_guard_mutex)
> unlock(cred_guard_mutex)
> cred_locked_in_execve = false;
> unlock(exec_guard_mutex)
>
> so effectively no API change at all.
>
> If a traced sibling thread exists, the mutexes are used differently:
> lock(exec_guard_mutex)
> cred_locked_in_execve = true;
> unlock(exec_guard_mutex)
> de_thread()
> lock(cred_guard_mutex)
> unlock(cred_guard_mutex)
> lock(exec_guard_mutex)
> cred_locked_in_execve = false;
> unlock(exec_guard_mutex)
>
> Only the case changes that would deadlock anyway.


Let me propose a slight alternative that I think sets us up for long
term success.

Leave cred_guard_mutex as is, but declare it undesirable.  The
cred_guard_mutex as designed really is something we should get rid of.
As it it can sleep over several different userspace accesses.  The
copying of the exec arguments is technically as prone to deadlock as the
ptrace case.

Add a new mutex with a better name perhaps "exec_change_mutex" that is
used to guard the changes that exec makes to a process.

Then we gradually shift all the cred_guard_mutex users over to the new
mutex.  AKA one patch per user of cred_guard_mutex.  At each patch that
shifts things over we will have the opportunity to review the code to
see that there no funny dependencies that were missed.

I will sign up for working on the no_new_privs and ptrace_attach cases
as I think I can make those happen.  Especially no_new_privs.

Getting the easier cases will resolve your issues and put things on a
better footing.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv5] exec: Fix a deadlock in ptrace
  2020-03-04 16:33                                       ` Eric W. Biederman
@ 2020-03-04 21:49                                         ` Bernd Edlinger
  2020-03-04 21:56                                         ` [PATCHv6] " Bernd Edlinger
  1 sibling, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-04 21:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/4/20 5:33 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/3/20 9:08 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>
>>>> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
>>>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>>>>>> new file mode 100644
>>>>>> index 0000000..6d8a048
>>>>>> --- /dev/null
>>>>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>>>>>> @@ -0,0 +1,66 @@
>>>>>> +// SPDX-License-Identifier: GPL-2.0+
>>>>>> +/*
>>>>>> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
>>>>>> + * All rights reserved.
>>>>>> + *
>>>>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>>>>>> + * when de_thread is blocked with ->cred_guard_mutex held.
>>>>>> + */
>>>>>> +
>>>>>> +#include "../kselftest_harness.h"
>>>>>> +#include <stdio.h>
>>>>>> +#include <fcntl.h>
>>>>>> +#include <pthread.h>
>>>>>> +#include <signal.h>
>>>>>> +#include <unistd.h>
>>>>>> +#include <sys/ptrace.h>
>>>>>> +
>>>>>> +static void *thread(void *arg)
>>>>>> +{
>>>>>> +	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>>>>>> +	return NULL;
>>>>>> +}
>>>>>> +
>>>>>> +TEST(vmaccess)
>>>>>> +{
>>>>>> +	int f, pid = fork();
>>>>>> +	char mm[64];
>>>>>> +
>>>>>> +	if (!pid) {
>>>>>> +		pthread_t pt;
>>>>>> +
>>>>>> +		pthread_create(&pt, NULL, thread, NULL);
>>>>>> +		pthread_join(pt, NULL);
>>>>>> +		execlp("true", "true", NULL);
>>>>>> +	}
>>>>>> +
>>>>>> +	sleep(1);
>>>>>> +	sprintf(mm, "/proc/%d/mem", pid);
>>>>>> +	f = open(mm, O_RDONLY);
>>>>>> +	ASSERT_LE(0, f);
>>>>>> +	close(f);
>>>>>> +	f = kill(pid, SIGCONT);
>>>>>> +	ASSERT_EQ(0, f);
>>>>>> +}
>>>>>> +
>>>>>> +TEST(attach)
>>>>>> +{
>>>>>> +	int f, pid = fork();
>>>>>> +
>>>>>> +	if (!pid) {
>>>>>> +		pthread_t pt;
>>>>>> +
>>>>>> +		pthread_create(&pt, NULL, thread, NULL);
>>>>>> +		pthread_join(pt, NULL);
>>>>>> +		execlp("true", "true", NULL);
>>>>>> +	}
>>>>>> +
>>>>>> +	sleep(1);
>>>>>> +	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>>>>>
>>>>> To be meaningful this code needs to learn to loop when
>>>>> ptrace returns -EAGAIN.
>>>>>
>>>>> Because that is pretty much what any self respecting user space
>>>>> process will do.
>>>>>
>>>>> At which point I am not certain we can say that the behavior has
>>>>> sufficiently improved not to be a deadlock.
>>>>>
>>>>
>>>> In this special dead-duck test it won't work, but it would
>>>> still be lots more transparent what is going on, since previously
>>>> you had two zombie process, and no way to even output debug
>>>> messages, which also all self respecting user space processes
>>>> should do.
>>>
>>> Agreed it is more transparent.  So if you are going to deadlock
>>> it is better.
>>>
>>> My previous proposal (which I admit is more work to implement) would
>>> actually allow succeeding in this case and so it would not be subject to
>>> a dead lock (even via -EGAIN) at this point.
>>>
>>>> So yes, I can at least give a good example and re-try it several
>>>> times together with wait4 which a tracer is expected to do.
>>>
>>> Thank you,
>>>
>>> Eric
>>>
>>
>> Okay, I think it can be done with minimal API changes,
>> but it needs two mutexes, one that guards the execve,
>> and one that guards only the credentials.
>>
>> If no traced sibling thread exists, the mutexes are used this way:
>> lock(exec_guard_mutex)
>> cred_locked_in_execve = true;
>> de_thread()
>> lock(cred_guard_mutex)
>> unlock(cred_guard_mutex)
>> cred_locked_in_execve = false;
>> unlock(exec_guard_mutex)
>>
>> so effectively no API change at all.
>>
>> If a traced sibling thread exists, the mutexes are used differently:
>> lock(exec_guard_mutex)
>> cred_locked_in_execve = true;
>> unlock(exec_guard_mutex)
>> de_thread()
>> lock(cred_guard_mutex)
>> unlock(cred_guard_mutex)
>> lock(exec_guard_mutex)
>> cred_locked_in_execve = false;
>> unlock(exec_guard_mutex)
>>
>> Only the case changes that would deadlock anyway.
> 
> 
> Let me propose a slight alternative that I think sets us up for long
> term success.
> 
> Leave cred_guard_mutex as is, but declare it undesirable.  The
> cred_guard_mutex as designed really is something we should get rid of.
> As it it can sleep over several different userspace accesses.  The
> copying of the exec arguments is technically as prone to deadlock as the
> ptrace case.
> 
> Add a new mutex with a better name perhaps "exec_change_mutex" that is
> used to guard the changes that exec makes to a process.
> 
> Then we gradually shift all the cred_guard_mutex users over to the new
> mutex.  AKA one patch per user of cred_guard_mutex.  At each patch that
> shifts things over we will have the opportunity to review the code to
> see that there no funny dependencies that were missed.
> 
> I will sign up for working on the no_new_privs and ptrace_attach cases
> as I think I can make those happen.  Especially no_new_privs.
> 
> Getting the easier cases will resolve your issues and put things on a
> better footing.
> 
> Eric
> 

Okay, however I think we will need two mutexes in the long term.

So currently I have reduced the cred_guard_mutex to protect just
the loading of the executable code in the process vm, since that
is what works for vm_access, (one of the test cases).
And another mutex that protects the whole execve function, that
is need for ptrace, (and seccomp).
But I have only a test case for ptrace.


If I understand that right, I should not recycle cred_guard_mutex
but leave it as is, and create two additional mutexes which will
take over step by step.

Sounds reasonable, indeed.

I will send an update (v6) what I have right now,
but just for information, so you can see how my minimal API-Change
approach works.


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCHv6] exec: Fix a deadlock in ptrace
  2020-03-04 16:33                                       ` Eric W. Biederman
  2020-03-04 21:49                                         ` Bernd Edlinger
@ 2020-03-04 21:56                                         ` Bernd Edlinger
  2020-03-05 18:36                                           ` Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-04 21:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The proposed solution is to detect if a sibling thread
exists that is traced and in this case to make PTRACE_ACCESS
fail with -EAGAIN instead of dead-lock.
But other functions like vm_access are allowed to complete normally.

This changes the lifetime of the cred_guard_mutex lock to be
from flush_old_exec() through install_exec_creds().
Before, cred_guard_mutex was held from prepare_bprm_creds() through
install_exec_creds().

Additionally a new mutex exec_guard_mutex is introduced that is used
for PTRACE_ACCESS and SECCOMP_FILTER_FLAG_TSYNC.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 Documentation/security/credentials.rst    | 29 ++++++++---
 fs/exec.c                                 | 58 ++++++++++++++++++---
 include/linux/binfmts.h                   | 15 +++++-
 include/linux/sched/signal.h              | 10 ++--
 init/init_task.c                          |  1 +
 kernel/cred.c                             |  4 +-
 kernel/fork.c                             |  1 +
 kernel/ptrace.c                           | 20 ++++++--
 kernel/seccomp.c                          | 15 +++---
 mm/process_vm_access.c                    |  2 +-
 tools/testing/selftests/ptrace/Makefile   |  4 +-
 tools/testing/selftests/ptrace/vmaccess.c | 85 +++++++++++++++++++++++++++++++
 12 files changed, 210 insertions(+), 34 deletions(-)
 create mode 100644 tools/testing/selftests/ptrace/vmaccess.c

v2: adds a test case which passes when this patch is applied.
v3: fixes the issue without introducing a new mutex.
v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case. 
v5: addresses review comments.
v6: minimal API changes, using a second mutex, improved test case.

diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..b08899f 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,15 +437,30 @@ new set of credentials by calling::
 
 	struct cred *prepare_creds(void);
 
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful.  It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->exec_guard_mutex
+is acquired before this function gets called, and usually released after
+the new process mmap and credentials are installed.  However if one of the
+sibling threads are being traced when the execve is invoked, there is no
+guarantee how long it takes to terminate all sibling threads, and therefore
+the variable current->signal->cred_locked_in_execve is set, and the
+exec_guard_mutex is released immediately.  Functions that may have effect
+on the credentials of a different thread need to lock the exec_guard_mutex
+and additionally check the cred_locked_in_execve status, and fail with
+-EAGAIN if that variable is set.
 
 The mutex prevents ``ptrace()`` from altering the ptrace state of a process
 while security checks on credentials construction and changing is taking place
 as the ptrace state may alter the outcome, particularly in the case of
 ``execve()``.
 
+The mutex current->signal->cred_guard_mutex is acquired when only a single thread
+is remaining, and the credentials and the process mmap are actually changed.
+Functions that only need to access to a consistent state of the credentials
+and the process mmap do only need to aquire this mutex.
+
 The new credentials set should be altered appropriately, and any security
 checks and hooks done.  Both the current and the proposed sets of credentials
 are available for this purpose as current_cred() will return the current set
@@ -466,9 +481,8 @@ by calling::
 
 This will alter various aspects of the credentials and the process, giving the
 LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
 
 This function is guaranteed to return 0, so that it can be tail-called at the
 end of such functions as ``sys_setresuid()``.
@@ -486,8 +500,7 @@ invoked::
 
 	void abort_creds(struct cred *new);
 
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
 
 
 A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..8a23804 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1258,6 +1258,11 @@ int flush_old_exec(struct linux_binprm * bprm)
 {
 	int retval;
 
+	if (bprm->detected_unsafe_exec) {
+		mutex_unlock(&current->signal->exec_guard_mutex);
+		bprm->holding_exec_guard_mutex = 0;
+	}
+
 	/*
 	 * Make sure we have a private signal table and that
 	 * we are unassociated from the previous thread group.
@@ -1266,6 +1271,12 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_guard_mutex);
+	if (retval)
+		goto out;
+
+	bprm->holding_cred_guard_mutex = 1;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1398,29 +1409,56 @@ void finalize_exec(struct linux_binprm *bprm)
 EXPORT_SYMBOL(finalize_exec);
 
 /*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and set ->cred_locked_in_execve.
  * install_exec_creds() commits the new creds and drops the lock.
  * Or, if exec fails before, free_bprm() should release ->cred and
  * and unlock.
  */
 static int prepare_bprm_creds(struct linux_binprm *bprm)
 {
-	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
+	int ret;
+	struct task_struct *t;
+
+	if (mutex_lock_interruptible(&current->signal->exec_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	bprm->holding_exec_guard_mutex = 1;
+
+	ret = -EAGAIN;
+	if (unlikely(current->signal->cred_locked_in_execve))
+		goto out;
+
 	bprm->cred = prepare_exec_creds();
-	if (likely(bprm->cred))
-		return 0;
+	ret = -ENOMEM;
+	if (unlikely(bprm->cred == NULL))
+		goto out;
 
-	mutex_unlock(&current->signal->cred_guard_mutex);
-	return -ENOMEM;
+	current->signal->cred_locked_in_execve = true;
+
+	spin_lock_irq(&current->sighand->siglock);
+	t = current;
+	while_each_thread(current, t) {
+		if (t->ptrace)
+			bprm->detected_unsafe_exec = 1;
+	}
+	spin_unlock_irq(&current->sighand->siglock);
+	return 0;
+
+out:
+	mutex_unlock(&current->signal->exec_guard_mutex);
+	return ret;
 }
 
 static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		mutex_unlock(&current->signal->cred_guard_mutex);
+		if (bprm->holding_cred_guard_mutex)
+			mutex_unlock(&current->signal->cred_guard_mutex);
+		if (!bprm->holding_exec_guard_mutex)
+			mutex_lock(&current->signal->exec_guard_mutex);
+		current->signal->cred_locked_in_execve = false;
+		mutex_unlock(&current->signal->exec_guard_mutex);
 		abort_creds(bprm->cred);
 	}
 	if (bprm->file) {
@@ -1470,12 +1508,16 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 */
 	security_bprm_committed_creds(bprm);
 	mutex_unlock(&current->signal->cred_guard_mutex);
+	if (bprm->detected_unsafe_exec)
+		mutex_lock(&current->signal->exec_guard_mutex);
+	current->signal->cred_locked_in_execve = false;
+	mutex_unlock(&current->signal->exec_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
 
 /*
  * determine how safe it is to execute the proposed program
- * - the caller must hold ->cred_guard_mutex to protect against
+ * - the caller must have set ->cred_locked_in_execve to protect against
  *   PTRACE_ATTACH or seccomp thread-sync
  */
 static void check_unsafe_exec(struct linux_binprm *bprm)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..238e280 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,20 @@ struct linux_binprm {
 		 * exec has happened. Used to sanitize execution environment
 		 * and to set AT_SECURE auxv for glibc.
 		 */
-		secureexec:1;
+		secureexec:1,
+		/*
+		 * Set by prepare_bprm_creds, if a sibling thread is being
+		 * traced and the exec_guard_mutex is therefore not taken.
+		 */
+		detected_unsafe_exec:1,
+		/*
+		 * Set when the cred_guard_mutex is taken.
+		 */
+		holding_cred_guard_mutex:1,
+		/*
+		 * Set when the exec_guard_mutex is taken.
+		 */
+		holding_exec_guard_mutex:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..4484aa3 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -222,9 +222,13 @@ struct signal_struct {
 	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
 					 * killed by the oom killer */
 
-	struct mutex cred_guard_mutex;	/* guard against foreign influences on
-					 * credential calculations
-					 * (notably. ptrace) */
+	struct mutex cred_guard_mutex;	/* guard against changing credentials */
+	struct mutex exec_guard_mutex;	/* guard against foreign influences on
+					 * execve (notably. ptrace)
+					 */
+	bool cred_locked_in_execve;	/* set while in execve, only valid when
+					 * exec_guard_mutex is held
+					 */
 } __randomize_layout;
 
 /*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..6cf602a 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+	.exec_guard_mutex = __MUTEX_INITIALIZER(init_signals.exec_guard_mutex),
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
 	.cputimer	= {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..620cd50 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -295,7 +295,7 @@ struct cred *prepare_creds(void)
 
 /*
  * Prepare credentials for current to perform an execve()
- * - The caller must hold ->cred_guard_mutex
+ * - The caller must hold ->exec_guard_mutex
  */
 struct cred *prepare_exec_creds(void)
 {
@@ -676,7 +676,7 @@ void __init cred_init(void)
  *
  * Returns the new credentials or NULL if out of memory.
  *
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
  */
 struct cred *prepare_kernel_cred(struct task_struct *daemon)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..0c21baa 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->exec_guard_mutex);
 
 	return 0;
 }
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..1af8ff4 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -392,9 +392,13 @@ static int ptrace_attach(struct task_struct *task, long request,
 	 * under ptrace.
 	 */
 	retval = -ERESTARTNOINTR;
-	if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
+	if (mutex_lock_interruptible(&task->signal->exec_guard_mutex))
 		goto out;
 
+	retval = -EAGAIN;
+	if (task->signal->cred_locked_in_execve)
+		goto unlock_creds;
+
 	task_lock(task);
 	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
 	task_unlock(task);
@@ -447,7 +451,7 @@ static int ptrace_attach(struct task_struct *task, long request,
 unlock_tasklist:
 	write_unlock_irq(&tasklist_lock);
 unlock_creds:
-	mutex_unlock(&task->signal->cred_guard_mutex);
+	mutex_unlock(&task->signal->exec_guard_mutex);
 out:
 	if (!retval) {
 		/*
@@ -472,10 +476,18 @@ static int ptrace_attach(struct task_struct *task, long request,
  */
 static int ptrace_traceme(void)
 {
-	int ret = -EPERM;
+	int ret;
+
+	if (mutex_lock_interruptible(&current->signal->exec_guard_mutex))
+		return -ERESTARTNOINTR;
+
+	ret = -EAGAIN;
+	if (current->signal->cred_locked_in_execve)
+		goto unlock_creds;
 
 	write_lock_irq(&tasklist_lock);
 	/* Are we already being traced? */
+	ret = -EPERM;
 	if (!current->ptrace) {
 		ret = security_ptrace_traceme(current->parent);
 		/*
@@ -490,6 +502,8 @@ static int ptrace_traceme(void)
 	}
 	write_unlock_irq(&tasklist_lock);
 
+unlock_creds:
+	mutex_unlock(&current->signal->exec_guard_mutex);
 	return ret;
 }
 
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b6ea3dc..7ec66b1 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -329,7 +329,7 @@ static int is_ancestor(struct seccomp_filter *parent,
 /**
  * seccomp_can_sync_threads: checks if all threads can be synchronized
  *
- * Expects sighand and cred_guard_mutex locks to be held.
+ * Expects sighand and exec_guard_mutex locks to be held.
  *
  * Returns 0 on success, -ve on error, or the pid of a thread which was
  * either not in the correct seccomp mode or did not have an ancestral
@@ -339,9 +339,12 @@ static inline pid_t seccomp_can_sync_threads(void)
 {
 	struct task_struct *thread, *caller;
 
-	BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
+	BUG_ON(!mutex_is_locked(&current->signal->exec_guard_mutex));
 	assert_spin_locked(&current->sighand->siglock);
 
+	if (current->signal->cred_locked_in_execve)
+		return -EAGAIN;
+
 	/* Validate all threads being eligible for synchronization. */
 	caller = current;
 	for_each_thread(caller, thread) {
@@ -371,7 +374,7 @@ static inline pid_t seccomp_can_sync_threads(void)
 /**
  * seccomp_sync_threads: sets all threads to use current's filter
  *
- * Expects sighand and cred_guard_mutex locks to be held, and for
+ * Expects sighand and exec_guard_mutex locks to be held, and for
  * seccomp_can_sync_threads() to have returned success already
  * without dropping the locks.
  *
@@ -380,7 +383,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
 {
 	struct task_struct *thread, *caller;
 
-	BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
+	BUG_ON(!mutex_is_locked(&current->signal->exec_guard_mutex));
 	assert_spin_locked(&current->sighand->siglock);
 
 	/* Synchronize all threads. */
@@ -1319,7 +1322,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
-	    mutex_lock_killable(&current->signal->cred_guard_mutex))
+	    mutex_lock_killable(&current->signal->exec_guard_mutex))
 		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
@@ -1337,7 +1340,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
 out:
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
-		mutex_unlock(&current->signal->cred_guard_mutex);
+		mutex_unlock(&current->signal->exec_guard_mutex);
 out_put_fd:
 	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
 		if (ret) {
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	if (!mm || IS_ERR(mm)) {
 		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		/*
-		 * Explicitly map EACCES to EPERM as EPERM is a more a
+		 * Explicitly map EACCES to EPERM as EPERM is a more
 		 * appropriate error code for process_vw_readv/writev
 		 */
 		if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
 
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
 
 include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..fdca30b
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+	return NULL;
+}
+
+TEST(vmaccess)
+{
+	int f, pid = fork();
+	char mm[64];
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	sprintf(mm, "/proc/%d/mem", pid);
+	f = open(mm, O_RDONLY);
+	ASSERT_GE(f, 0);
+	close(f);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(f, 0);
+}
+
+TEST(attach)
+{
+	int s, k, pid = fork();
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("sleep", "sleep", "2", NULL);
+	}
+
+	sleep(1);
+	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(errno, EAGAIN);
+	ASSERT_EQ(k, -1);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_NE(k, 0);
+	ASSERT_NE(k, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 0);
+	sleep(1);
+	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 0);
+	k = waitpid(-1, NULL, 0);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, ECHILD);
+}
+
+TEST_HARNESS_MAIN
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCHv6] exec: Fix a deadlock in ptrace
  2020-03-04 21:56                                         ` [PATCHv6] " Bernd Edlinger
@ 2020-03-05 18:36                                           ` Bernd Edlinger
  2020-03-05 21:14                                             ` [PATCH 0/2] Infrastructure to allow fixing exec deadlocks Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-05 18:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/4/20 10:56 PM, Bernd Edlinger wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
> 
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated.  They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
> 
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> strace          D    0 30614  30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> expect          D    0 31933  30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The proposed solution is to detect if a sibling thread
> exists that is traced and in this case to make PTRACE_ACCESS
> fail with -EAGAIN instead of dead-lock.
> But other functions like vm_access are allowed to complete normally.
> 
> This changes the lifetime of the cred_guard_mutex lock to be
> from flush_old_exec() through install_exec_creds().
> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> install_exec_creds().
> 
> Additionally a new mutex exec_guard_mutex is introduced that is used
> for PTRACE_ACCESS and SECCOMP_FILTER_FLAG_TSYNC.
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  Documentation/security/credentials.rst    | 29 ++++++++---
>  fs/exec.c                                 | 58 ++++++++++++++++++---
>  include/linux/binfmts.h                   | 15 +++++-
>  include/linux/sched/signal.h              | 10 ++--
>  init/init_task.c                          |  1 +
>  kernel/cred.c                             |  4 +-
>  kernel/fork.c                             |  1 +
>  kernel/ptrace.c                           | 20 ++++++--
>  kernel/seccomp.c                          | 15 +++---
>  mm/process_vm_access.c                    |  2 +-
>  tools/testing/selftests/ptrace/Makefile   |  4 +-
>  tools/testing/selftests/ptrace/vmaccess.c | 85 +++++++++++++++++++++++++++++++
>  12 files changed, 210 insertions(+), 34 deletions(-)
>  create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
> 

Okay, I think there is consensus about the next steps to be as follows:

- post the Documentation/security/credentials.rst changes as an independent patch.
- post a infrastructure patch which only introduces two new mutexes,
  one exec_guard_mutex, and one the "cred_change_mutex" (I am unhappy with that name,
  because credentials can change without the cred_guard_mutex, this appears more
  to guarantee that the credentials of the process and the process memory map are
  consistent, so I think I need to think of a better name first...)
  This keeps cred_guard_mutex as is, just deprecates it, and adds a note that it will
  go away.
- post one patch that fixes the mm_access code path
- post one patch that fixes the PTRACE_ATTACH code path
- post one patch that introduces the new test cases


Thanks
Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 0/2] Infrastructure to allow fixing exec deadlocks
  2020-03-05 18:36                                           ` Bernd Edlinger
@ 2020-03-05 21:14                                             ` Eric W. Biederman
  2020-03-05 21:15                                               ` [PATCH 1/2] exec: Properly mark the point of no return Eric W. Biederman
                                                                 ` (3 more replies)
  0 siblings, 4 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-05 21:14 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


Bernd, everyone

This is how I think the infrastructure change should look that makes way
for fixing this issue.

- Correct the point of no return.
- Add a new mutex to replace cred_guard_mutex

Then I think it is just going through the existing
users of cred_guard_mutex and fixing them to use the new one.

There really aren't that many users of cred_guard_mutex so we should be
able to get through the easy ones fairly quickly.  And anything that
isn't easy we can wait until we have a good fix.

The users of cred_guard_mutex that I saw were:
    fs/proc/base.c:
       proc_pid_attr_write
       do_io_accounting
       proc_pid_stack
       proc_pid_syscall
       proc_pid_personality
    
    perf_event_open
    mm_access
    kcmp
    pidfd_fget
    seccomp_set_mode_filter

Bernd does this make sense to you?  

I think we can fix the seccomp/no_new_privs issue with some careful
refactoring.  We can probably do the same for ptrace but that appears
to need a little lsm bug fixing.

My goal here is to allow us to fix the uncontroversial easy bits.  While
still allowing the difficult tricky bits to be fixed.

Eric W. Biederman (2):
      exec: Properly mark the point of no return
      exec: Add a exec_update_mutex to replace cred_guard_mutex

 fs/exec.c                    | 11 ++++++++---
 include/linux/binfmts.h      |  7 ++++++-
 include/linux/sched/signal.h |  9 ++++++++-
 kernel/fork.c                |  1 +
 4 files changed, 23 insertions(+), 5 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 1/2] exec: Properly mark the point of no return
  2020-03-05 21:14                                             ` [PATCH 0/2] Infrastructure to allow fixing exec deadlocks Eric W. Biederman
@ 2020-03-05 21:15                                               ` Eric W. Biederman
  2020-03-05 22:34                                                 ` Bernd Edlinger
  2020-03-05 22:56                                                 ` Bernd Edlinger
  2020-03-05 21:16                                               ` [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex Eric W. Biederman
                                                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-05 21:15 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


Add a flag binfmt->unrecoverable to mark when execution has gotten to
the point where it is impossible to return to userspace with the
calling process unchanged.

While techinically this state starts as soon as de_thread starts
killing threads, the only return path at that point is if there is a
fatal signal pending.  I have choosen instead to set unrecoverable
when the killing stops, and there are possibilities of failures other
than fatal signals.  In particular it is possible for the allocation
of a new sighand structure to fail.

Setting unrecoverable at this point has the benefit that other actions
can be taken after the other threads are all dead, and the
unrecoverable flag can double as a flag that those actions have been
taken.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c               | 7 ++++---
 include/linux/binfmts.h | 7 ++++++-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..c243f9660d46 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1061,7 +1061,7 @@ static int exec_mmap(struct mm_struct *mm)
  * disturbing other processes.  (Other processes might share the signal
  * table via the CLONE_SIGHAND option to clone().)
  */
-static int de_thread(struct task_struct *tsk)
+static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
@@ -1182,6 +1182,7 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+	bprm->unrecoverable = true;
 	sig->group_exit_task = NULL;
 	sig->notify_count = 0;
 
@@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 	 * Make sure we have a private signal table and that
 	 * we are unassociated from the previous thread group.
 	 */
-	retval = de_thread(current);
+	retval = de_thread(bprm, current);
 	if (retval)
 		goto out;
 
@@ -1664,7 +1665,7 @@ int search_binary_handler(struct linux_binprm *bprm)
 
 		read_lock(&binfmt_lock);
 		put_binfmt(fmt);
-		if (retval < 0 && !bprm->mm) {
+		if (retval < 0 && bprm->unrecoverable) {
 			/* we got to flush_old_exec() and failed after it */
 			read_unlock(&binfmt_lock);
 			force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc633f3be..12263115ce7a 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,12 @@ struct linux_binprm {
 		 * exec has happened. Used to sanitize execution environment
 		 * and to set AT_SECURE auxv for glibc.
 		 */
-		secureexec:1;
+		secureexec:1,
+		/*
+		 * Set when changes have been made that prevent returning
+		 * to userspace.
+		 */
+		unrecoverable:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
-- 
2.25.0


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-05 21:14                                             ` [PATCH 0/2] Infrastructure to allow fixing exec deadlocks Eric W. Biederman
  2020-03-05 21:15                                               ` [PATCH 1/2] exec: Properly mark the point of no return Eric W. Biederman
@ 2020-03-05 21:16                                               ` Eric W. Biederman
  2020-03-05 21:51                                                 ` Bernd Edlinger
  2020-03-05 22:31                                               ` [PATCH 0/2] Infrastructure to allow fixing exec deadlocks Bernd Edlinger
  2020-03-08 21:34                                               ` [PATCH 0/5] " Eric W. Biederman
  3 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-05 21:16 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


The cred_guard_mutex is problematic.  The cred_guard_mutex is held
over the userspace accesses as the arguments from userspace are read.
The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
threads are killed.  The cred_guard_mutex is held over
"put_user(0, tsk->clear_child_tid)" in exit_mm().

Any of those can result in deadlock, as the cred_guard_mutex is held
over a possible indefinite userspace waits for userspace.

Add exec_update_mutex that is only held over exec updating process
with the new contents of exec, so that code that needs not to be
confused by exec changing the mm and the cred in ways that can not
happen during ordinary execution of a process can take.

The plan is to switch the users of cred_guard_mutex to
exed_udpate_mutex one by one.  This lets us move forward while still
being careful and not introducing any regressions.

Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 4 ++++
 include/linux/sched/signal.h | 9 ++++++++-
 kernel/fork.c                | 1 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index c243f9660d46..ad7b518f906d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk)
 		release_task(leader);
 	}
 
+	mutex_lock(&current->signal->exec_update_mutex);
 	bprm->unrecoverable = true;
 	sig->group_exit_task = NULL;
 	sig->notify_count = 0;
@@ -1425,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		if (bprm->unrecoverable)
+			mutex_unlock(&current->signal->exec_update_mutex);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1474,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	mutex_unlock(&current->signal->exec_update_mutex);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 88050259c466..a29df79540ce 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -224,7 +224,14 @@ struct signal_struct {
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
-					 * (notably. ptrace) */
+					 * (notably. ptrace)
+					 * Deprecated do not use in new code.
+					 * Use exec_update_mutex instead.
+					 */
+	struct mutex exec_update_mutex;	/* Held while task_struct is being
+					 * updated during exec, and may have
+					 * inconsistent permissions.
+					 */
 } __randomize_layout;
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 60a1295f4384..12896a6ecee6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->exec_update_mutex);
 
 	return 0;
 }
-- 
2.25.0


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-05 21:16                                               ` [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex Eric W. Biederman
@ 2020-03-05 21:51                                                 ` Bernd Edlinger
  2020-03-06  5:17                                                   ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-05 21:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/5/20 10:16 PM, Eric W. Biederman wrote:
> 
> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed.  The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
> 
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
> 
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process can take.
> 
> The plan is to switch the users of cred_guard_mutex to
> exed_udpate_mutex one by one.  This lets us move forward while still
> being careful and not introducing any regressions.
> 
> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c                    | 4 ++++
>  include/linux/sched/signal.h | 9 ++++++++-
>  kernel/fork.c                | 1 +
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index c243f9660d46..ad7b518f906d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk)
>  		release_task(leader);
>  	}
>  
> +	mutex_lock(&current->signal->exec_update_mutex);
>  	bprm->unrecoverable = true;
>  	sig->group_exit_task = NULL;
>  	sig->notify_count = 0;
> @@ -1425,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm)
>  {
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> +		if (bprm->unrecoverable)
> +			mutex_unlock(&current->signal->exec_update_mutex);
>  		mutex_unlock(&current->signal->cred_guard_mutex);
>  		abort_creds(bprm->cred);
>  	}
> @@ -1474,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>  	 * credentials; any time after this it may be unlocked.
>  	 */
>  	security_bprm_committed_creds(bprm);
> +	mutex_unlock(&current->signal->exec_update_mutex);
>  	mutex_unlock(&current->signal->cred_guard_mutex);
>  }
>  EXPORT_SYMBOL(install_exec_creds);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 88050259c466..a29df79540ce 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -224,7 +224,14 @@ struct signal_struct {
>  
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations
> -					 * (notably. ptrace) */
> +					 * (notably. ptrace)
> +					 * Deprecated do not use in new code.
> +					 * Use exec_update_mutex instead.
> +					 */
> +	struct mutex exec_update_mutex;	/* Held while task_struct is being
> +					 * updated during exec, and may have
> +					 * inconsistent permissions.
> +					 */
>  } __randomize_layout;
>  
>  /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 60a1295f4384..12896a6ecee6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>  	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>  
>  	mutex_init(&sig->cred_guard_mutex);
> +	mutex_init(&sig->exec_update_mutex);
>  
>  	return 0;
>  }
> 
Don't you need to add something like this to init/init_task.c ?

.exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 0/2] Infrastructure to allow fixing exec deadlocks
  2020-03-05 21:14                                             ` [PATCH 0/2] Infrastructure to allow fixing exec deadlocks Eric W. Biederman
  2020-03-05 21:15                                               ` [PATCH 1/2] exec: Properly mark the point of no return Eric W. Biederman
  2020-03-05 21:16                                               ` [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex Eric W. Biederman
@ 2020-03-05 22:31                                               ` Bernd Edlinger
  2020-03-06  5:06                                                 ` Eric W. Biederman
  2020-03-08 21:34                                               ` [PATCH 0/5] " Eric W. Biederman
  3 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-05 22:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api



On 3/5/20 10:14 PM, Eric W. Biederman wrote:
> 
> Bernd, everyone
> 
> This is how I think the infrastructure change should look that makes way
> for fixing this issue.
> 
> - Correct the point of no return.
> - Add a new mutex to replace cred_guard_mutex
> 
> Then I think it is just going through the existing
> users of cred_guard_mutex and fixing them to use the new one.
> 
> There really aren't that many users of cred_guard_mutex so we should be
> able to get through the easy ones fairly quickly.  And anything that
> isn't easy we can wait until we have a good fix.
> 
> The users of cred_guard_mutex that I saw were:
>     fs/proc/base.c:
>        proc_pid_attr_write
>        do_io_accounting
>        proc_pid_stack
>        proc_pid_syscall
>        proc_pid_personality
>     
>     perf_event_open
>     mm_access
>     kcmp
>     pidfd_fget
>     seccomp_set_mode_filter
> 
> Bernd does this make sense to you?  
> 
> I think we can fix the seccomp/no_new_privs issue with some careful
> refactoring.  We can probably do the same for ptrace but that appears
> to need a little lsm bug fixing.
> 

Yes, for most functions the proposed "exec_update_mutex" is fine,
but we will need a longer-time block for ptrace_attach, seccomp_set_mode_filter
and proc_pid_attr_write need to be blocked for the whole exec duration so
they need a second "mutex", with deadlock-detection as in my previous patch,
if I see that right.

Unfortunately only one of the two test cases can be fixed without the
second mutex, of course the mm_access is what cause the practical problem.

Currently for the unlimited user space delay, I have only the case of
a ptraced sibling thread on my radar, de_thread waits for the parent
to call wait in this case, that can literally take forever.
But I know that also PTRACE_CONT may be needed after a PTRACE_EVENT_EXIT.

Can you explain what else in the user space can go wrong to make an
unlimited delay in the execve?


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/2] exec: Properly mark the point of no return
  2020-03-05 21:15                                               ` [PATCH 1/2] exec: Properly mark the point of no return Eric W. Biederman
@ 2020-03-05 22:34                                                 ` Bernd Edlinger
  2020-03-06  5:19                                                   ` Eric W. Biederman
  2020-03-05 22:56                                                 ` Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-05 22:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/5/20 10:15 PM, Eric W. Biederman wrote:
> 
> Add a flag binfmt->unrecoverable to mark when execution has gotten to
> the point where it is impossible to return to userspace with the
> calling process unchanged.
> 
> While techinically this state starts as soon as de_thread starts
> killing threads, the only return path at that point is if there is a
> fatal signal pending.  I have choosen instead to set unrecoverable
> when the killing stops, and there are possibilities of failures other
> than fatal signals.  In particular it is possible for the allocation
> of a new sighand structure to fail.
> 
> Setting unrecoverable at this point has the benefit that other actions
> can be taken after the other threads are all dead, and the
> unrecoverable flag can double as a flag that those actions have been
> taken.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c               | 7 ++++---
>  include/linux/binfmts.h | 7 ++++++-
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index db17be51b112..c243f9660d46 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1061,7 +1061,7 @@ static int exec_mmap(struct mm_struct *mm)
>   * disturbing other processes.  (Other processes might share the signal
>   * table via the CLONE_SIGHAND option to clone().)
>   */
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk)
>  {
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
> @@ -1182,6 +1182,7 @@ static int de_thread(struct task_struct *tsk)
>  		release_task(leader);
>  	}
>  
> +	bprm->unrecoverable = true;
>  	sig->group_exit_task = NULL;
>  	sig->notify_count = 0;
>  
> @@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	 * Make sure we have a private signal table and that
>  	 * we are unassociated from the previous thread group.
>  	 */
> -	retval = de_thread(current);
> +	retval = de_thread(bprm, current);

can we get rid of passing current as parameter here?

Thanks
Bernd.

>  	if (retval)
>  		goto out;
>  
> @@ -1664,7 +1665,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>  
>  		read_lock(&binfmt_lock);
>  		put_binfmt(fmt);
> -		if (retval < 0 && !bprm->mm) {
> +		if (retval < 0 && bprm->unrecoverable) {
>  			/* we got to flush_old_exec() and failed after it */
>  			read_unlock(&binfmt_lock);
>  			force_sigsegv(SIGSEGV);
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index b40fc633f3be..12263115ce7a 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -44,7 +44,12 @@ struct linux_binprm {
>  		 * exec has happened. Used to sanitize execution environment
>  		 * and to set AT_SECURE auxv for glibc.
>  		 */
> -		secureexec:1;
> +		secureexec:1,
> +		/*
> +		 * Set when changes have been made that prevent returning
> +		 * to userspace.
> +		 */
> +		unrecoverable:1;
>  #ifdef __alpha__
>  	unsigned int taso:1;
>  #endif
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/2] exec: Properly mark the point of no return
  2020-03-05 21:15                                               ` [PATCH 1/2] exec: Properly mark the point of no return Eric W. Biederman
  2020-03-05 22:34                                                 ` Bernd Edlinger
@ 2020-03-05 22:56                                                 ` Bernd Edlinger
  2020-03-06  5:09                                                   ` Eric W. Biederman
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-05 22:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/5/20 10:15 PM, Eric W. Biederman wrote:
> 
> Add a flag binfmt->unrecoverable to mark when execution has gotten to
> the point where it is impossible to return to userspace with the
> calling process unchanged.
> 
> While techinically this state starts as soon as de_thread starts
> killing threads, the only return path at that point is if there is a
> fatal signal pending.  I have choosen instead to set unrecoverable
> when the killing stops, and there are possibilities of failures other
> than fatal signals.  In particular it is possible for the allocation
> of a new sighand structure to fail.
> 
> Setting unrecoverable at this point has the benefit that other actions
> can be taken after the other threads are all dead, and the
> unrecoverable flag can double as a flag that those actions have been
> taken.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c               | 7 ++++---
>  include/linux/binfmts.h | 7 ++++++-
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index db17be51b112..c243f9660d46 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1061,7 +1061,7 @@ static int exec_mmap(struct mm_struct *mm)
>   * disturbing other processes.  (Other processes might share the signal
>   * table via the CLONE_SIGHAND option to clone().)
>   */
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk)
>  {
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
> @@ -1182,6 +1182,7 @@ static int de_thread(struct task_struct *tsk)
>  		release_task(leader);
>  	}
>  
> +	bprm->unrecoverable = true;
>  	sig->group_exit_task = NULL;
>  	sig->notify_count = 0;
>  

ah, sorry, 
        if (thread_group_empty(tsk))
                goto no_thread_group;
will skip this:

        sig->group_exit_task = NULL;
        sig->notify_count = 0;

no_thread_group:
        /* we have changed execution domain */
        tsk->exit_signal = SIGCHLD;

so I think the bprm->unrecoverable = true; should be here?


Bernd.
> @@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	 * Make sure we have a private signal table and that
>  	 * we are unassociated from the previous thread group.
>  	 */
> -	retval = de_thread(current);
> +	retval = de_thread(bprm, current);
>  	if (retval)
>  		goto out;
>  
> @@ -1664,7 +1665,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>  
>  		read_lock(&binfmt_lock);
>  		put_binfmt(fmt);
> -		if (retval < 0 && !bprm->mm) {
> +		if (retval < 0 && bprm->unrecoverable) {
>  			/* we got to flush_old_exec() and failed after it */
>  			read_unlock(&binfmt_lock);
>  			force_sigsegv(SIGSEGV);
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index b40fc633f3be..12263115ce7a 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -44,7 +44,12 @@ struct linux_binprm {
>  		 * exec has happened. Used to sanitize execution environment
>  		 * and to set AT_SECURE auxv for glibc.
>  		 */
> -		secureexec:1;
> +		secureexec:1,
> +		/*
> +		 * Set when changes have been made that prevent returning
> +		 * to userspace.
> +		 */
> +		unrecoverable:1;
>  #ifdef __alpha__
>  	unsigned int taso:1;
>  #endif
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 0/2] Infrastructure to allow fixing exec deadlocks
  2020-03-05 22:31                                               ` [PATCH 0/2] Infrastructure to allow fixing exec deadlocks Bernd Edlinger
@ 2020-03-06  5:06                                                 ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-06  5:06 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/5/20 10:14 PM, Eric W. Biederman wrote:
>> 
>> Bernd, everyone
>> 
>> This is how I think the infrastructure change should look that makes way
>> for fixing this issue.
>> 
>> - Correct the point of no return.
>> - Add a new mutex to replace cred_guard_mutex
>> 
>> Then I think it is just going through the existing
>> users of cred_guard_mutex and fixing them to use the new one.
>> 
>> There really aren't that many users of cred_guard_mutex so we should be
>> able to get through the easy ones fairly quickly.  And anything that
>> isn't easy we can wait until we have a good fix.
>> 
>> The users of cred_guard_mutex that I saw were:
>>     fs/proc/base.c:
>>        proc_pid_attr_write
>>        do_io_accounting
>>        proc_pid_stack
>>        proc_pid_syscall
>>        proc_pid_personality
>>     
>>     perf_event_open
>>     mm_access
>>     kcmp
>>     pidfd_fget
>>     seccomp_set_mode_filter
>> 
>> Bernd does this make sense to you?  
>> 
>> I think we can fix the seccomp/no_new_privs issue with some careful
>> refactoring.  We can probably do the same for ptrace but that appears
>> to need a little lsm bug fixing.
>> 
>
> Yes, for most functions the proposed "exec_update_mutex" is fine,
> but we will need a longer-time block for ptrace_attach, seccomp_set_mode_filter
> and proc_pid_attr_write need to be blocked for the whole exec duration so
> they need a second "mutex", with deadlock-detection as in my previous patch,
> if I see that right.

So far I am leaving "cred_guard_mutex" as that second "mutex".  My sense
is that when all we have left are the hard cases we can take those
cases out in detail, examine them and see what really can be done.

> Unfortunately only one of the two test cases can be fixed without the
> second mutex, of course the mm_access is what cause the practical problem.

Fixing the practical problems are foremost on my agenda.
That and clearing away enough of the noise that we can really focus on
the hard problems when we begin to address them.

That way I am hoping we can really solve some of these issues and make
them go away.

> Currently for the unlimited user space delay, I have only the case of
> a ptraced sibling thread on my radar, de_thread waits for the parent
> to call wait in this case, that can literally take forever.
> But I know that also PTRACE_CONT may be needed after a PTRACE_EVENT_EXIT.
>
> Can you explain what else in the user space can go wrong to make an
> unlimited delay in the execve?

Triggering a page fault.  Depending on the backing store or possibly
with the use of userfaultfd that page fault can be delayed indefinitely
and pretty much be as bad as the ptrace case.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/2] exec: Properly mark the point of no return
  2020-03-05 22:56                                                 ` Bernd Edlinger
@ 2020-03-06  5:09                                                   ` Eric W. Biederman
  2020-03-06 16:26                                                     ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-06  5:09 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/5/20 10:15 PM, Eric W. Biederman wrote:
>> 
>> Add a flag binfmt->unrecoverable to mark when execution has gotten to
>> the point where it is impossible to return to userspace with the
>> calling process unchanged.
>> 
>> While techinically this state starts as soon as de_thread starts
>> killing threads, the only return path at that point is if there is a
>> fatal signal pending.  I have choosen instead to set unrecoverable
>> when the killing stops, and there are possibilities of failures other
>> than fatal signals.  In particular it is possible for the allocation
>> of a new sighand structure to fail.
>> 
>> Setting unrecoverable at this point has the benefit that other actions
>> can be taken after the other threads are all dead, and the
>> unrecoverable flag can double as a flag that those actions have been
>> taken.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/exec.c               | 7 ++++---
>>  include/linux/binfmts.h | 7 ++++++-
>>  2 files changed, 10 insertions(+), 4 deletions(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index db17be51b112..c243f9660d46 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1061,7 +1061,7 @@ static int exec_mmap(struct mm_struct *mm)
>>   * disturbing other processes.  (Other processes might share the signal
>>   * table via the CLONE_SIGHAND option to clone().)
>>   */
>> -static int de_thread(struct task_struct *tsk)
>> +static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk)
>>  {
>>  	struct signal_struct *sig = tsk->signal;
>>  	struct sighand_struct *oldsighand = tsk->sighand;
>> @@ -1182,6 +1182,7 @@ static int de_thread(struct task_struct *tsk)
>>  		release_task(leader);
>>  	}
>>  
>> +	bprm->unrecoverable = true;
>>  	sig->group_exit_task = NULL;
>>  	sig->notify_count = 0;
>>  
>
> ah, sorry, 
>         if (thread_group_empty(tsk))
>                 goto no_thread_group;
> will skip this:
>
>         sig->group_exit_task = NULL;
>         sig->notify_count = 0;
>
> no_thread_group:
>         /* we have changed execution domain */
>         tsk->exit_signal = SIGCHLD;
>
> so I think the bprm->unrecoverable = true; should be here?

Absolutely.  Thank you very much.

This is why I try and keep things to one clear simple thing per patch so
silly thinkos like that can be caught.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-05 21:51                                                 ` Bernd Edlinger
@ 2020-03-06  5:17                                                   ` Eric W. Biederman
  2020-03-06 11:46                                                     ` Bernd Edlinger
  2020-03-06 19:16                                                     ` Bernd Edlinger
  0 siblings, 2 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-06  5:17 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/5/20 10:16 PM, Eric W. Biederman wrote:
>> 
>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> threads are killed.  The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>> 
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>> 
>> Add exec_update_mutex that is only held over exec updating process
>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process can take.
>> 
>> The plan is to switch the users of cred_guard_mutex to
>> exed_udpate_mutex one by one.  This lets us move forward while still
>> being careful and not introducing any regressions.
>> 
>> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/exec.c                    | 4 ++++
>>  include/linux/sched/signal.h | 9 ++++++++-
>>  kernel/fork.c                | 1 +
>>  3 files changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index c243f9660d46..ad7b518f906d 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk)
>>  		release_task(leader);
>>  	}
>>  
>> +	mutex_lock(&current->signal->exec_update_mutex);
>>  	bprm->unrecoverable = true;
>>  	sig->group_exit_task = NULL;
>>  	sig->notify_count = 0;
>> @@ -1425,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm)
>>  {
>>  	free_arg_pages(bprm);
>>  	if (bprm->cred) {
>> +		if (bprm->unrecoverable)
>> +			mutex_unlock(&current->signal->exec_update_mutex);
>>  		mutex_unlock(&current->signal->cred_guard_mutex);
>>  		abort_creds(bprm->cred);
>>  	}
>> @@ -1474,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>>  	 * credentials; any time after this it may be unlocked.
>>  	 */
>>  	security_bprm_committed_creds(bprm);
>> +	mutex_unlock(&current->signal->exec_update_mutex);
>>  	mutex_unlock(&current->signal->cred_guard_mutex);
>>  }
>>  EXPORT_SYMBOL(install_exec_creds);
>> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>> index 88050259c466..a29df79540ce 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -224,7 +224,14 @@ struct signal_struct {
>>  
>>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>>  					 * credential calculations
>> -					 * (notably. ptrace) */
>> +					 * (notably. ptrace)
>> +					 * Deprecated do not use in new code.
>> +					 * Use exec_update_mutex instead.
>> +					 */
>> +	struct mutex exec_update_mutex;	/* Held while task_struct is being
>> +					 * updated during exec, and may have
>> +					 * inconsistent permissions.
>> +					 */
>>  } __randomize_layout;
>>  
>>  /*
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 60a1295f4384..12896a6ecee6 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>>  	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>>  
>>  	mutex_init(&sig->cred_guard_mutex);
>> +	mutex_init(&sig->exec_update_mutex);
>>  
>>  	return 0;
>>  }
>> 
> Don't you need to add something like this to init/init_task.c ?
>
> .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),

Yes.  I overlooked that.  Thank you.

Eric


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/2] exec: Properly mark the point of no return
  2020-03-05 22:34                                                 ` Bernd Edlinger
@ 2020-03-06  5:19                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-06  5:19 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/5/20 10:15 PM, Eric W. Biederman wrote:
>> @@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>>  	 * Make sure we have a private signal table and that
>>  	 * we are unassociated from the previous thread group.
>>  	 */
>> -	retval = de_thread(current);
>> +	retval = de_thread(bprm, current);
>
> can we get rid of passing current as parameter here?

With a separate patch.  It makes the patch less clear if I make that
change in this one.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-06  5:17                                                   ` Eric W. Biederman
@ 2020-03-06 11:46                                                     ` Bernd Edlinger
  2020-03-06 21:18                                                       ` Eric W. Biederman
  2020-03-06 19:16                                                     ` Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-06 11:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api



Am 06.03.20 um 06:17 schrieb Eric W. Biederman:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/5/20 10:16 PM, Eric W. Biederman wrote:
>>>
>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>> over the userspace accesses as the arguments from userspace are read.
>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>> threads are killed.  The cred_guard_mutex is held over
>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>
>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>> over a possible indefinite userspace waits for userspace.
>>>
>>> Add exec_update_mutex that is only held over exec updating process
>>> with the new contents of exec, so that code that needs not to be
>>> confused by exec changing the mm and the cred in ways that can not
>>> happen during ordinary execution of a process can take.
>>>
>>> The plan is to switch the users of cred_guard_mutex to
>>> exed_udpate_mutex one by one.  This lets us move forward while still
>>> being careful and not introducing any regressions.
>>>
>>> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>>> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>>> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>>  fs/exec.c                    | 4 ++++
>>>  include/linux/sched/signal.h | 9 ++++++++-
>>>  kernel/fork.c                | 1 +
>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> index c243f9660d46..ad7b518f906d 100644
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk)
>>>  		release_task(leader);
>>>  	}
>>>  
>>> +	mutex_lock(&current->signal->exec_update_mutex);

And by the way, could you make this mutex_lock_killable?


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/2] exec: Properly mark the point of no return
  2020-03-06  5:09                                                   ` Eric W. Biederman
@ 2020-03-06 16:26                                                     ` Bernd Edlinger
  2020-03-06 17:16                                                       ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-06 16:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/6/20 6:09 AM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/5/20 10:15 PM, Eric W. Biederman wrote:
>>>
>>> Add a flag binfmt->unrecoverable to mark when execution has gotten to
>>> the point where it is impossible to return to userspace with the
>>> calling process unchanged.
>>>
>>> While techinically this state starts as soon as de_thread starts

typo: s/techinically/technically/

>>> killing threads, the only return path at that point is if there is a
>>> fatal signal pending.  I have choosen instead to set unrecoverable

I'm not good at english, is this chosen ?


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/2] exec: Properly mark the point of no return
  2020-03-06 16:26                                                     ` Bernd Edlinger
@ 2020-03-06 17:16                                                       ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-06 17:16 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/6/20 6:09 AM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> On 3/5/20 10:15 PM, Eric W. Biederman wrote:
>>>>
>>>> Add a flag binfmt->unrecoverable to mark when execution has gotten to
>>>> the point where it is impossible to return to userspace with the
>>>> calling process unchanged.
>>>>
>>>> While techinically this state starts as soon as de_thread starts
>
> typo: s/techinically/technically/

>>>> killing threads, the only return path at that point is if there is a
>>>> fatal signal pending.  I have choosen instead to set unrecoverable
>
> I'm not good at english, is this chosen ?
>

Yes.  Defintley worth fixing.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-06  5:17                                                   ` Eric W. Biederman
  2020-03-06 11:46                                                     ` Bernd Edlinger
@ 2020-03-06 19:16                                                     ` Bernd Edlinger
  2020-03-06 21:58                                                       ` Eric W. Biederman
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-06 19:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/6/20 6:17 AM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/5/20 10:16 PM, Eric W. Biederman wrote:
>>>
>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>> over the userspace accesses as the arguments from userspace are read.
>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>> threads are killed.  The cred_guard_mutex is held over
>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>

I am all for this patch, and the direction it is heading, Eric.

I just wanted to add a note that I think it is
possible that exec_mm_release can also invoke put_user(0, tsk->clear_child_tid),
under the new exec_update_mutex, since vm_access increments the
mm->mm_users, under the cred_update_mutex, but releases the mutex,
and the caller can hold the reference for a while and then exec_mmap is not
releasing the last reference.


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-06 11:46                                                     ` Bernd Edlinger
@ 2020-03-06 21:18                                                       ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-06 21:18 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> Am 06.03.20 um 06:17 schrieb Eric W. Biederman:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> On 3/5/20 10:16 PM, Eric W. Biederman wrote:
>>>>
>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>> over the userspace accesses as the arguments from userspace are read.
>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>> threads are killed.  The cred_guard_mutex is held over
>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>
>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>> over a possible indefinite userspace waits for userspace.
>>>>
>>>> Add exec_update_mutex that is only held over exec updating process
>>>> with the new contents of exec, so that code that needs not to be
>>>> confused by exec changing the mm and the cred in ways that can not
>>>> happen during ordinary execution of a process can take.
>>>>
>>>> The plan is to switch the users of cred_guard_mutex to
>>>> exed_udpate_mutex one by one.  This lets us move forward while still
>>>> being careful and not introducing any regressions.
>>>>
>>>> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>>> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>>>> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>>>> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>> ---
>>>>  fs/exec.c                    | 4 ++++
>>>>  include/linux/sched/signal.h | 9 ++++++++-
>>>>  kernel/fork.c                | 1 +
>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>> index c243f9660d46..ad7b518f906d 100644
>>>> --- a/fs/exec.c
>>>> +++ b/fs/exec.c
>>>> @@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk)
>>>>  		release_task(leader);
>>>>  	}
>>>>  
>>>> +	mutex_lock(&current->signal->exec_update_mutex);
>
> And by the way, could you make this mutex_lock_killable?

For some reason when I first read this suggestion I thought making this
mutex_lock_killable would cause me to rework the logic of when I
set unrecoverable and when I unlock the mutex.  I blame a tired brain.
If a process has received a fatal signal none of that matters.

So yes I will do that just to make things robust in case we miss
something that would still make it possible to deadlock in with the new
mutex.

I am a little worried that the new mutex might still cover a little too
much.  But past a certain point I we are not being able to make this
code perfect in the first change.  The best we can do is to be careful
and avoid regressions.  Whatever slips through we can fix when we spot
the problem.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-06 19:16                                                     ` Bernd Edlinger
@ 2020-03-06 21:58                                                       ` Eric W. Biederman
  2020-03-06 22:29                                                         ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-06 21:58 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/6/20 6:17 AM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> On 3/5/20 10:16 PM, Eric W. Biederman wrote:
>>>>
>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>> over the userspace accesses as the arguments from userspace are read.
>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>> threads are killed.  The cred_guard_mutex is held over
>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>
>
> I am all for this patch, and the direction it is heading, Eric.
>
> I just wanted to add a note that I think it is
> possible that exec_mm_release can also invoke put_user(0, tsk->clear_child_tid),
> under the new exec_update_mutex, since vm_access increments the
> mm->mm_users, under the cred_update_mutex, but releases the mutex,
> and the caller can hold the reference for a while and then exec_mmap is not
> releasing the last reference.

Good catch.  I really appreciate your close look at the details.

I am wondering if process_vm_readv and process_vm_writev could be
safely changed to use mmgrab and mmdrop, instead of mmget and mmput.

That would resolve the potential issue you have pointed out.  I just
haven't figured out if it is safe.  The mm code has been seriously
refactored since I knew how it all worked.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-06 21:58                                                       ` Eric W. Biederman
@ 2020-03-06 22:29                                                         ` Eric W. Biederman
  2020-03-07  1:03                                                           ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-06 22:29 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

ebiederm@xmission.com (Eric W. Biederman) writes:

> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>
>> On 3/6/20 6:17 AM, Eric W. Biederman wrote:
>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>> 
>>>> On 3/5/20 10:16 PM, Eric W. Biederman wrote:
>>>>>
>>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>> threads are killed.  The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>
>> I am all for this patch, and the direction it is heading, Eric.
>>
>> I just wanted to add a note that I think it is
>> possible that exec_mm_release can also invoke put_user(0, tsk->clear_child_tid),
>> under the new exec_update_mutex, since vm_access increments the
>> mm->mm_users, under the cred_update_mutex, but releases the mutex,
>> and the caller can hold the reference for a while and then exec_mmap is not
>> releasing the last reference.
>
> Good catch.  I really appreciate your close look at the details.
>
> I am wondering if process_vm_readv and process_vm_writev could be
> safely changed to use mmgrab and mmdrop, instead of mmget and mmput.
>
> That would resolve the potential issue you have pointed out.  I just
> haven't figured out if it is safe.  The mm code has been seriously
> refactored since I knew how it all worked.

Nope, mmget can not be replaced by mmgrab.

It might be possible to do something creative like store a cred in place
of the userns on the mm and use that for mm_access permission checks.
Still we are talking a pretty narrow window, and a case that no one has
figured out how to trigger yet.  So I will leave that corner case as
something for future improvements.

Eric


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/2] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-06 22:29                                                         ` Eric W. Biederman
@ 2020-03-07  1:03                                                           ` Eric W. Biederman
  2020-03-08 12:58                                                             ` [PATCH] exec: make de_thread alloc new signal struct earlier Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-07  1:03 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

ebiederm@xmission.com (Eric W. Biederman) writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>
>>> On 3/6/20 6:17 AM, Eric W. Biederman wrote:
>>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>> 
>>>>> On 3/5/20 10:16 PM, Eric W. Biederman wrote:
>>>>>>
>>>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>>> threads are killed.  The cred_guard_mutex is held over
>>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>>
>>>
>>> I am all for this patch, and the direction it is heading, Eric.
>>>
>>> I just wanted to add a note that I think it is
>>> possible that exec_mm_release can also invoke put_user(0, tsk->clear_child_tid),
>>> under the new exec_update_mutex, since vm_access increments the
>>> mm->mm_users, under the cred_update_mutex, but releases the mutex,
>>> and the caller can hold the reference for a while and then exec_mmap is not
>>> releasing the last reference.
>>
>> Good catch.  I really appreciate your close look at the details.
>>
>> I am wondering if process_vm_readv and process_vm_writev could be
>> safely changed to use mmgrab and mmdrop, instead of mmget and mmput.
>>
>> That would resolve the potential issue you have pointed out.  I just
>> haven't figured out if it is safe.  The mm code has been seriously
>> refactored since I knew how it all worked.
>
> Nope, mmget can not be replaced by mmgrab.
>
> It might be possible to do something creative like store a cred in place
> of the userns on the mm and use that for mm_access permission checks.
> Still we are talking a pretty narrow window, and a case that no one has
> figured out how to trigger yet.  So I will leave that corner case as
> something for future improvements.

My brain is restless and keep looking at it.

The worst case is processes created with CLONE_VM|CLONE_CHILD_CLEARTID
but not CLONE_THREAD.  For those that put_user will occur ever time
in exec_mmap.

The only solution that I can see is to move taking the new mutex after
exec_mm_release.  Which may be feasible given how close exec_mmap
follows de_thread.

I am going to sleep on that and perhaps I will be able to see how to
move taking the mutex lower.

It would be very nice not to have a known issue going into this set of
changes.

Eric


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH] exec: make de_thread alloc new signal struct earlier
  2020-03-07  1:03                                                           ` Eric W. Biederman
@ 2020-03-08 12:58                                                             ` Bernd Edlinger
  2020-03-08 18:12                                                               ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-08 12:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

It was pointed out that de_thread may return -ENOMEM
when it already terminated threads, and returning
an error from execve, except when a fatal signal is
being delivered is not an option any more.

Allocate the memory for the signal table earlier,
and make sure that -ENOMEM is returned before the
unrecoverable actions are started.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
Eric, what do you think, might this be helpful
to move the "point of no return" lower, and simplify
your patch?

 fs/exec.c | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..a0328dc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1057,16 +1057,26 @@ static int exec_mmap(struct mm_struct *mm)
  * disturbing other processes.  (Other processes might share the signal
  * table via the CLONE_SIGHAND option to clone().)
  */
-static int de_thread(struct task_struct *tsk)
+static int de_thread(void)
 {
+	struct task_struct *tsk = current;
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
 	spinlock_t *lock = &oldsighand->siglock;
+	struct sighand_struct *newsighand = NULL;
 
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
 
 	/*
+	 * This is the last time for an out of memory error.
+	 * After this point only fatal signals are are okay.
+	 */
+	newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
+	if (!newsighand)
+		return -ENOMEM;
+
+	/*
 	 * Kill all other threads in the thread group.
 	 */
 	spin_lock_irq(lock);
@@ -1076,7 +1086,7 @@ static int de_thread(struct task_struct *tsk)
 		 * return so that the signal is processed.
 		 */
 		spin_unlock_irq(lock);
-		return -EAGAIN;
+		goto err_free;
 	}
 
 	sig->group_exit_task = tsk;
@@ -1191,14 +1201,16 @@ static int de_thread(struct task_struct *tsk)
 #endif
 
 	if (refcount_read(&oldsighand->count) != 1) {
-		struct sighand_struct *newsighand;
 		/*
 		 * This ->sighand is shared with the CLONE_SIGHAND
 		 * but not CLONE_THREAD task, switch to the new one.
 		 */
-		newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
-		if (!newsighand)
-			return -ENOMEM;
+		if (!newsighand) {
+			newsighand = kmem_cache_alloc(sighand_cachep,
+						      GFP_KERNEL);
+			if (!newsighand)
+				return -ENOMEM;
+		}
 
 		refcount_set(&newsighand->count, 1);
 		memcpy(newsighand->action, oldsighand->action,
@@ -1211,7 +1223,8 @@ static int de_thread(struct task_struct *tsk)
 		write_unlock_irq(&tasklist_lock);
 
 		__cleanup_sighand(oldsighand);
-	}
+	} else if (newsighand)
+		kmem_cache_free(sighand_cachep, newsighand);
 
 	BUG_ON(!thread_group_leader(tsk));
 	return 0;
@@ -1222,6 +1235,8 @@ static int de_thread(struct task_struct *tsk)
 	sig->group_exit_task = NULL;
 	sig->notify_count = 0;
 	read_unlock(&tasklist_lock);
+err_free:
+	kmem_cache_free(sighand_cachep, newsighand);
 	return -EAGAIN;
 }
 
@@ -1262,7 +1277,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 	 * Make sure we have a private signal table and that
 	 * we are unassociated from the previous thread group.
 	 */
-	retval = de_thread(current);
+	retval = de_thread();
 	if (retval)
 		goto out;
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] exec: make de_thread alloc new signal struct earlier
  2020-03-08 12:58                                                             ` [PATCH] exec: make de_thread alloc new signal struct earlier Bernd Edlinger
@ 2020-03-08 18:12                                                               ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-08 18:12 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> It was pointed out that de_thread may return -ENOMEM
> when it already terminated threads, and returning
> an error from execve, except when a fatal signal is
> being delivered is not an option any more.
>
> Allocate the memory for the signal table earlier,
> and make sure that -ENOMEM is returned before the
> unrecoverable actions are started.
>
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
> Eric, what do you think, might this be helpful
> to move the "point of no return" lower, and simplify
> your patch?

Good thinking but no.  In this case it is possible to move the entire
allocation lower.  As well as the posix timer cleanup.  That code is
actually much clearer outside of de_thread.  I will post a patch in that
direction in a moment.

It is something of a bad idea to move the new sighand allocation sooner
because in practice it does not happen.  It only exists to support the
CLONE_VM | CLONE_SIGHAND without CLONE_SIGNAL case which is not used
by the modern posix thread libraries.

There are just enough old executables floating out there that I don't
think we can remove the CLONE_SIGHAND case in general but I keep
dreaming about it.  We get a lot of complexity in the code to support
something that no one really does anymore.

Eric

>  fs/exec.c | 31 +++++++++++++++++++++++--------
>  1 file changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 74d88da..a0328dc 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1057,16 +1057,26 @@ static int exec_mmap(struct mm_struct *mm)
>   * disturbing other processes.  (Other processes might share the signal
>   * table via the CLONE_SIGHAND option to clone().)
>   */
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(void)
>  {
> +	struct task_struct *tsk = current;
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
>  	spinlock_t *lock = &oldsighand->siglock;
> +	struct sighand_struct *newsighand = NULL;
>  
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
>  
>  	/*
> +	 * This is the last time for an out of memory error.
> +	 * After this point only fatal signals are are okay.
> +	 */
> +	newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
> +	if (!newsighand)
> +		return -ENOMEM;
> +
> +	/*
>  	 * Kill all other threads in the thread group.
>  	 */
>  	spin_lock_irq(lock);
> @@ -1076,7 +1086,7 @@ static int de_thread(struct task_struct *tsk)
>  		 * return so that the signal is processed.
>  		 */
>  		spin_unlock_irq(lock);
> -		return -EAGAIN;
> +		goto err_free;
>  	}
>  
>  	sig->group_exit_task = tsk;
> @@ -1191,14 +1201,16 @@ static int de_thread(struct task_struct *tsk)
>  #endif
>  
>  	if (refcount_read(&oldsighand->count) != 1) {
> -		struct sighand_struct *newsighand;
>  		/*
>  		 * This ->sighand is shared with the CLONE_SIGHAND
>  		 * but not CLONE_THREAD task, switch to the new one.
>  		 */
> -		newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
> -		if (!newsighand)
> -			return -ENOMEM;
> +		if (!newsighand) {
> +			newsighand = kmem_cache_alloc(sighand_cachep,
> +						      GFP_KERNEL);
> +			if (!newsighand)
> +				return -ENOMEM;
> +		}
>  
>  		refcount_set(&newsighand->count, 1);
>  		memcpy(newsighand->action, oldsighand->action,
> @@ -1211,7 +1223,8 @@ static int de_thread(struct task_struct *tsk)
>  		write_unlock_irq(&tasklist_lock);
>  
>  		__cleanup_sighand(oldsighand);
> -	}
> +	} else if (newsighand)
> +		kmem_cache_free(sighand_cachep, newsighand);
>  
>  	BUG_ON(!thread_group_leader(tsk));
>  	return 0;
> @@ -1222,6 +1235,8 @@ static int de_thread(struct task_struct *tsk)
>  	sig->group_exit_task = NULL;
>  	sig->notify_count = 0;
>  	read_unlock(&tasklist_lock);
> +err_free:
> +	kmem_cache_free(sighand_cachep, newsighand);
>  	return -EAGAIN;
>  }
>  
> @@ -1262,7 +1277,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	 * Make sure we have a private signal table and that
>  	 * we are unassociated from the previous thread group.
>  	 */
> -	retval = de_thread(current);
> +	retval = de_thread();
>  	if (retval)
>  		goto out;

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 0/5] Infrastructure to allow fixing exec deadlocks
  2020-03-05 21:14                                             ` [PATCH 0/2] Infrastructure to allow fixing exec deadlocks Eric W. Biederman
                                                                 ` (2 preceding siblings ...)
  2020-03-05 22:31                                               ` [PATCH 0/2] Infrastructure to allow fixing exec deadlocks Bernd Edlinger
@ 2020-03-08 21:34                                               ` Eric W. Biederman
  2020-03-08 21:35                                                 ` [PATCH v2 1/5] exec: Only compute current once in flush_old_exec Eric W. Biederman
                                                                   ` (5 more replies)
  3 siblings, 6 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-08 21:34 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


Bernd, everyone

This is how I think the infrastructure change should look that makes way
for fixing this issue.

- Cleanup and reorder the code so code that can potentially wait
  indefinitely for userspace comes at the beginning for flush_old_exec.
- Add a new mutex and take it after we have passed any potential
  indefinite waits for userspace.

Then I think it is just going through the existing users of
cred_guard_mutex and fixing them to use the new one.

There really aren't that many users of cred_guard_mutex so we should be
able to get through the easy ones fairly quickly.  And anything that
isn't easy we can wait until we have a good fix.

The users of cred_guard_mutex that I saw were:
    fs/proc/base.c:
       proc_pid_attr_write
       do_io_accounting
       proc_pid_stack
       proc_pid_syscall
       proc_pid_personality
    
    perf_event_open
    mm_access
    kcmp
    pidfd_fget
    seccomp_set_mode_filter

Bernd I think I have addressed the issues you pointed out in v1.
Please let me know if you see anything else.

Eric W. Biederman (5):
      exec: Only compute current once in flush_old_exec
      exec: Factor unshare_sighand out of de_thread and call it separately
      exec: Move cleanup of posix timers on exec out of de_thread
      exec: Move exec_mmap right after de_thread in flush_old_exec
      exec: Add a exec_update_mutex to replace cred_guard_mutex

 fs/exec.c                    | 65 ++++++++++++++++++++++++++++++--------------
 include/linux/sched/signal.h |  9 +++++-
 init/init_task.c             |  1 +
 kernel/fork.c                |  1 +
 4 files changed, 54 insertions(+), 22 deletions(-)



^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
  2020-03-08 21:34                                               ` [PATCH 0/5] " Eric W. Biederman
@ 2020-03-08 21:35                                                 ` Eric W. Biederman
  2020-03-09 13:56                                                   ` Bernd Edlinger
                                                                     ` (2 more replies)
  2020-03-08 21:36                                                 ` [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately Eric W. Biederman
                                                                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-08 21:35 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


Make it clear that current only needs to be computed once in
flush_old_exec.  This may have some efficiency improvements and it
makes the code easier to change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..c3f34791f2f0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
  */
 int flush_old_exec(struct linux_binprm * bprm)
 {
+	struct task_struct *me = current;
 	int retval;
 
 	/*
 	 * Make sure we have a private signal table and that
 	 * we are unassociated from the previous thread group.
 	 */
-	retval = de_thread(current);
+	retval = de_thread(me);
 	if (retval)
 		goto out;
 
@@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
 	bprm->mm = NULL;
 
 	set_fs(USER_DS);
-	current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
+	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
 					PF_NOFREEZE | PF_NO_SETAFFINITY);
 	flush_thread();
-	current->personality &= ~bprm->per_clear;
+	me->personality &= ~bprm->per_clear;
 
 	/*
 	 * We have to apply CLOEXEC before we change whether the process is
@@ -1305,7 +1306,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 	 * trying to access the should-be-closed file descriptors of a process
 	 * undergoing exec(2).
 	 */
-	do_close_on_exec(current->files);
+	do_close_on_exec(me->files);
 	return 0;
 
 out:
-- 
2.25.0


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
  2020-03-08 21:34                                               ` [PATCH 0/5] " Eric W. Biederman
  2020-03-08 21:35                                                 ` [PATCH v2 1/5] exec: Only compute current once in flush_old_exec Eric W. Biederman
@ 2020-03-08 21:36                                                 ` Eric W. Biederman
  2020-03-09 19:28                                                   ` Bernd Edlinger
                                                                     ` (2 more replies)
  2020-03-08 21:36                                                 ` [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread Eric W. Biederman
                                                                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-08 21:36 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


This makes the code clearer and makes it easier to implement a mutex
that is not taken over any locations that may block indefinitely waiting
for userspace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index c3f34791f2f0..ff74b9a74d34 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
 	flush_itimer_signals();
 #endif
 
+	BUG_ON(!thread_group_leader(tsk));
+	return 0;
+
+killed:
+	/* protects against exit_notify() and __exit_signal() */
+	read_lock(&tasklist_lock);
+	sig->group_exit_task = NULL;
+	sig->notify_count = 0;
+	read_unlock(&tasklist_lock);
+	return -EAGAIN;
+}
+
+
+static int unshare_sighand(struct task_struct *me)
+{
+	struct sighand_struct *oldsighand = me->sighand;
+
 	if (refcount_read(&oldsighand->count) != 1) {
 		struct sighand_struct *newsighand;
 		/*
@@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
 
 		write_lock_irq(&tasklist_lock);
 		spin_lock(&oldsighand->siglock);
-		rcu_assign_pointer(tsk->sighand, newsighand);
+		rcu_assign_pointer(me->sighand, newsighand);
 		spin_unlock(&oldsighand->siglock);
 		write_unlock_irq(&tasklist_lock);
 
 		__cleanup_sighand(oldsighand);
 	}
-
-	BUG_ON(!thread_group_leader(tsk));
 	return 0;
-
-killed:
-	/* protects against exit_notify() and __exit_signal() */
-	read_lock(&tasklist_lock);
-	sig->group_exit_task = NULL;
-	sig->notify_count = 0;
-	read_unlock(&tasklist_lock);
-	return -EAGAIN;
 }
 
 char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
@@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm)
 	int retval;
 
 	/*
-	 * Make sure we have a private signal table and that
-	 * we are unassociated from the previous thread group.
+	 * Make this the only thread in the thread group.
 	 */
 	retval = de_thread(me);
 	if (retval)
 		goto out;
 
+	/*
+	 * Make the signal table private.
+	 */
+	retval = unshare_sighand(me);
+	if (retval)
+		goto out;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
-- 
2.25.0


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-08 21:34                                               ` [PATCH 0/5] " Eric W. Biederman
  2020-03-08 21:35                                                 ` [PATCH v2 1/5] exec: Only compute current once in flush_old_exec Eric W. Biederman
  2020-03-08 21:36                                                 ` [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately Eric W. Biederman
@ 2020-03-08 21:36                                                 ` Eric W. Biederman
  2020-03-09 19:30                                                   ` Bernd Edlinger
                                                                     ` (4 more replies)
  2020-03-08 21:38                                                 ` [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec Eric W. Biederman
                                                                   ` (2 subsequent siblings)
  5 siblings, 5 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-08 21:36 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


These functions have very little to do with de_thread move them out
of de_thread an into flush_old_exec proper so it can be more clearly
seen what flush_old_exec is doing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index ff74b9a74d34..215d86f77b63 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
 	/* we have changed execution domain */
 	tsk->exit_signal = SIGCHLD;
 
-#ifdef CONFIG_POSIX_TIMERS
-	exit_itimers(sig);
-	flush_itimer_signals();
-#endif
-
 	BUG_ON(!thread_group_leader(tsk));
 	return 0;
 
@@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+#ifdef CONFIG_POSIX_TIMERS
+	exit_itimers(me->signal);
+	flush_itimer_signals();
+#endif
+
 	/*
 	 * Make the signal table private.
 	 */
-- 
2.25.0


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-08 21:34                                               ` [PATCH 0/5] " Eric W. Biederman
                                                                   ` (2 preceding siblings ...)
  2020-03-08 21:36                                                 ` [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread Eric W. Biederman
@ 2020-03-08 21:38                                                 ` Eric W. Biederman
  2020-03-09 19:34                                                   ` Bernd Edlinger
                                                                     ` (2 more replies)
  2020-03-08 21:38                                                 ` [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex Eric W. Biederman
  2020-03-09 13:58                                                 ` [PATCH 0/5] Infrastructure to allow fixing exec deadlocks Bernd Edlinger
  5 siblings, 3 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-08 21:38 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


I have read through the code in exec_mmap and I do not see anything
that depends on sighand or the sighand lock, or on signals in anyway
so this should be safe.

This rearrangement of code has two siginficant benefits.  It makes
the determination of passing the point of no return by testing bprm->mm
accurate.  All failures prior to that point in flush_old_exec are
either truly recoverable or they are fatal.

Futher this consolidates all of the possible indefinite waits for
userspace together at the top of flush_old_exec.  The possible wait
for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
to be resolved in clear_child_tid, and the possible wait for a page
fault in exit_robust_list.

This consolidation allows the creation of a mutex to replace
cred_guard_mutex that is not held of possible indefinite userspace
waits.  Which will allow removing deadlock scenarios from the kernel.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 215d86f77b63..d820a7272a76 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
-#ifdef CONFIG_POSIX_TIMERS
-	exit_itimers(me->signal);
-	flush_itimer_signals();
-#endif
-
-	/*
-	 * Make the signal table private.
-	 */
-	retval = unshare_sighand(me);
-	if (retval)
-		goto out;
-
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm)
 	 */
 	bprm->mm = NULL;
 
+#ifdef CONFIG_POSIX_TIMERS
+	exit_itimers(me->signal);
+	flush_itimer_signals();
+#endif
+
+	/*
+	 * Make the signal table private.
+	 */
+	retval = unshare_sighand(me);
+	if (retval)
+		goto out;
+
 	set_fs(USER_DS);
 	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
 					PF_NOFREEZE | PF_NO_SETAFFINITY);
-- 
2.25.0


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-08 21:34                                               ` [PATCH 0/5] " Eric W. Biederman
                                                                   ` (3 preceding siblings ...)
  2020-03-08 21:38                                                 ` [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec Eric W. Biederman
@ 2020-03-08 21:38                                                 ` Eric W. Biederman
  2020-03-09 13:45                                                   ` Bernd Edlinger
                                                                     ` (3 more replies)
  2020-03-09 13:58                                                 ` [PATCH 0/5] Infrastructure to allow fixing exec deadlocks Bernd Edlinger
  5 siblings, 4 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-08 21:38 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


The cred_guard_mutex is problematic.  The cred_guard_mutex is held
over the userspace accesses as the arguments from userspace are read.
The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
threads are killed.  The cred_guard_mutex is held over
"put_user(0, tsk->clear_child_tid)" in exit_mm().

Any of those can result in deadlock, as the cred_guard_mutex is held
over a possible indefinite userspace waits for userspace.

Add exec_update_mutex that is only held over exec updating process
with the new contents of exec, so that code that needs not to be
confused by exec changing the mm and the cred in ways that can not
happen during ordinary execution of a process.

The plan is to switch the users of cred_guard_mutex to
exec_udpate_mutex one by one.  This lets us move forward while still
being careful and not introducing any regressions.

Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 9 +++++++++
 include/linux/sched/signal.h | 9 ++++++++-
 init/init_task.c             | 1 +
 kernel/fork.c                | 1 +
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index d820a7272a76..ffeebb1f167b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 	struct mm_struct *old_mm, *active_mm;
+	int ret;
 
 	/* Notify parent that we're no longer interested in the old VM */
 	tsk = current;
@@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
 			return -EINTR;
 		}
 	}
+
+	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
+	if (ret)
+		return ret;
+
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
 	membarrier_exec_mmap(mm);
@@ -1438,6 +1444,8 @@ static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		if (!bprm->mm)
+			mutex_unlock(&current->signal->exec_update_mutex);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1487,6 +1495,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	mutex_unlock(&current->signal->exec_update_mutex);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 88050259c466..a29df79540ce 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -224,7 +224,14 @@ struct signal_struct {
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
-					 * (notably. ptrace) */
+					 * (notably. ptrace)
+					 * Deprecated do not use in new code.
+					 * Use exec_update_mutex instead.
+					 */
+	struct mutex exec_update_mutex;	/* Held while task_struct is being
+					 * updated during exec, and may have
+					 * inconsistent permissions.
+					 */
 } __randomize_layout;
 
 /*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5eab7b..bd403ed3e418 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@ static struct signal_struct init_signals = {
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+	.exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
 	.cputimer	= {
diff --git a/kernel/fork.c b/kernel/fork.c
index 60a1295f4384..12896a6ecee6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->exec_update_mutex);
 
 	return 0;
 }
-- 
2.25.0


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-08 21:38                                                 ` [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex Eric W. Biederman
@ 2020-03-09 13:45                                                   ` Bernd Edlinger
  2020-03-09 17:40                                                     ` Eric W. Biederman
  2020-03-10 21:21                                                   ` Jann Horn
                                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 13:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/8/20 10:38 PM, Eric W. Biederman wrote:
> 
> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other

... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
or something?

I wonder if we also should mention that
it is held while waiting for the trace parent to
receive the exit code with "wait"?

> threads are killed.  The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
> 
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
> 
> Add exec_update_mutex that is only held over exec updating process

Add ?

> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
> 
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one.  This lets us move forward while still

s/udpate/update/


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
  2020-03-08 21:35                                                 ` [PATCH v2 1/5] exec: Only compute current once in flush_old_exec Eric W. Biederman
@ 2020-03-09 13:56                                                   ` Bernd Edlinger
  2020-03-09 17:34                                                     ` Eric W. Biederman
  2020-03-10 20:17                                                   ` Kees Cook
  2020-03-10 21:12                                                   ` Christian Brauner
  2 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 13:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/8/20 10:35 PM, Eric W. Biederman wrote:
> 
> Make it clear that current only needs to be computed once in
> flush_old_exec.  This may have some efficiency improvements and it
> makes the code easier to change.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index db17be51b112..c3f34791f2f0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>   */
>  int flush_old_exec(struct linux_binprm * bprm)
>  {
> +	struct task_struct *me = current;
>  	int retval;
>  
>  	/*
>  	 * Make sure we have a private signal table and that
>  	 * we are unassociated from the previous thread group.
>  	 */
> -	retval = de_thread(current);
> +	retval = de_thread(me);
>  	if (retval)
>  		goto out;
>  
> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	bprm->mm = NULL;
>  
>  	set_fs(USER_DS);
> -	current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> +	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>  					PF_NOFREEZE | PF_NO_SETAFFINITY);

I wonder if this line should be aligned with the previous?


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 0/5] Infrastructure to allow fixing exec deadlocks
  2020-03-08 21:34                                               ` [PATCH 0/5] " Eric W. Biederman
                                                                   ` (4 preceding siblings ...)
  2020-03-08 21:38                                                 ` [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex Eric W. Biederman
@ 2020-03-09 13:58                                                 ` Bernd Edlinger
  5 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 13:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/8/20 10:34 PM, Eric W. Biederman wrote:
> 
> Bernd, everyone
> 
> This is how I think the infrastructure change should look that makes way
> for fixing this issue.
> 
> - Cleanup and reorder the code so code that can potentially wait
>   indefinitely for userspace comes at the beginning for flush_old_exec.
> - Add a new mutex and take it after we have passed any potential
>   indefinite waits for userspace.
> 
> Then I think it is just going through the existing users of
> cred_guard_mutex and fixing them to use the new one.
> 
> There really aren't that many users of cred_guard_mutex so we should be
> able to get through the easy ones fairly quickly.  And anything that
> isn't easy we can wait until we have a good fix.
> 
> The users of cred_guard_mutex that I saw were:
>     fs/proc/base.c:
>        proc_pid_attr_write
>        do_io_accounting
>        proc_pid_stack
>        proc_pid_syscall
>        proc_pid_personality
>     
>     perf_event_open
>     mm_access
>     kcmp
>     pidfd_fget
>     seccomp_set_mode_filter
> 
> Bernd I think I have addressed the issues you pointed out in v1.
> Please let me know if you see anything else.
> 

Yes, looks good, except some nits.


Thanks
Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
  2020-03-09 13:56                                                   ` Bernd Edlinger
@ 2020-03-09 17:34                                                     ` Eric W. Biederman
  2020-03-09 17:56                                                       ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 17:34 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/8/20 10:35 PM, Eric W. Biederman wrote:
>> 
>> Make it clear that current only needs to be computed once in
>> flush_old_exec.  This may have some efficiency improvements and it
>> makes the code easier to change.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/exec.c | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index db17be51b112..c3f34791f2f0 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>>   */
>>  int flush_old_exec(struct linux_binprm * bprm)
>>  {
>> +	struct task_struct *me = current;
>>  	int retval;
>>  
>>  	/*
>>  	 * Make sure we have a private signal table and that
>>  	 * we are unassociated from the previous thread group.
>>  	 */
>> -	retval = de_thread(current);
>> +	retval = de_thread(me);
>>  	if (retval)
>>  		goto out;
>>  
>> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
>>  	bprm->mm = NULL;
>>  
>>  	set_fs(USER_DS);
>> -	current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>> +	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>  					PF_NOFREEZE | PF_NO_SETAFFINITY);
>
> I wonder if this line should be aligned with the previous?

In this case I don't think so.  The style used for second line is indent
with tabs as much as possible to the right.  I haven't changed that.

Further mixing a change in indentation style with just a variable rename
will make the patch confusing to read because two things have to be
verified at the same time.

So while I see why you ask I think this bit needs to stay as is.

Eric





^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 13:45                                                   ` Bernd Edlinger
@ 2020-03-09 17:40                                                     ` Eric W. Biederman
  2020-03-09 18:01                                                       ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 17:40 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>> 
>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
                                ^ over
>
> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
> or something?

Yes.  Let me see if I can phrase that better.

> I wonder if we also should mention that
> it is held while waiting for the trace parent to
> receive the exit code with "wait"?

I don't think we have to spell out the details of how it all works,
unless that makes things clearer.  Kernel developers can be expected
to figure out how the kernel works.  The critical thing is that it is
an indefinite wait for userspace to take action.

But I will look.

>> threads are killed.  The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>> 
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>> 
>> Add exec_update_mutex that is only held over exec updating process
>
> Add ?

Yes.  That is what the change does: add exec_update_mutex.

>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process.
>> 
>> The plan is to switch the users of cred_guard_mutex to
>> exec_udpate_mutex one by one.  This lets us move forward while still
>
> s/udpate/update/

Yes.  Very much so.

Eric


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
  2020-03-09 17:34                                                     ` Eric W. Biederman
@ 2020-03-09 17:56                                                       ` Bernd Edlinger
  2020-03-09 19:27                                                         ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 17:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/9/20 6:34 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/8/20 10:35 PM, Eric W. Biederman wrote:
>>>
>>> Make it clear that current only needs to be computed once in
>>> flush_old_exec.  This may have some efficiency improvements and it
>>> makes the code easier to change.
>>>
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>>  fs/exec.c | 9 +++++----
>>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> index db17be51b112..c3f34791f2f0 100644
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>>>   */
>>>  int flush_old_exec(struct linux_binprm * bprm)
>>>  {
>>> +	struct task_struct *me = current;
>>>  	int retval;
>>>  
>>>  	/*
>>>  	 * Make sure we have a private signal table and that
>>>  	 * we are unassociated from the previous thread group.
>>>  	 */
>>> -	retval = de_thread(current);
>>> +	retval = de_thread(me);
>>>  	if (retval)
>>>  		goto out;
>>>  
>>> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
>>>  	bprm->mm = NULL;
>>>  
>>>  	set_fs(USER_DS);
>>> -	current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>> +	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>>  					PF_NOFREEZE | PF_NO_SETAFFINITY);
>>
>> I wonder if this line should be aligned with the previous?
> 
> In this case I don't think so.  The style used for second line is indent
> with tabs as much as possible to the right.  I haven't changed that.
> 
> Further mixing a change in indentation style with just a variable rename
> will make the patch confusing to read because two things have to be
> verified at the same time.
> 
> So while I see why you ask I think this bit needs to stay as is.
> 

Ah, okay, I see.
Thanks for explaining this rule, I was not aware of it,
but I am still new here :)


Thanks
Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 17:40                                                     ` Eric W. Biederman
@ 2020-03-09 18:01                                                       ` Bernd Edlinger
  2020-03-09 18:10                                                         ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 18:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/9/20 6:40 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>
>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>> over the userspace accesses as the arguments from userspace are read.
>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>                                 ^ over
>>
>> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
>> or something?
> 
> Yes.  Let me see if I can phrase that better.
> 
>> I wonder if we also should mention that
>> it is held while waiting for the trace parent to
>> receive the exit code with "wait"?
> 
> I don't think we have to spell out the details of how it all works,
> unless that makes things clearer.  Kernel developers can be expected
> to figure out how the kernel works.  The critical thing is that it is
> an indefinite wait for userspace to take action.
> 
> But I will look.
> 
>>> threads are killed.  The cred_guard_mutex is held over
>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>
>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>> over a possible indefinite userspace waits for userspace.
>>>
>>> Add exec_update_mutex that is only held over exec updating process
>>
>> Add ?
> 
> Yes.  That is what the change does: add exec_update_mutex.
> 

I just kind of missed the "subject" in this sentence,
like "This patch adds an exec_update_mutex that is ..."
but english is a foreign language for me, so may be okay as is.


Bernd.

>>> with the new contents of exec, so that code that needs not to be
>>> confused by exec changing the mm and the cred in ways that can not
>>> happen during ordinary execution of a process.
>>>
>>> The plan is to switch the users of cred_guard_mutex to
>>> exec_udpate_mutex one by one.  This lets us move forward while still
>>
>> s/udpate/update/
> 
> Yes.  Very much so.
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 18:01                                                       ` Bernd Edlinger
@ 2020-03-09 18:10                                                         ` Eric W. Biederman
  2020-03-09 18:24                                                           ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 18:10 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 6:40 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>>
>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>> over the userspace accesses as the arguments from userspace are read.
>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>                                 ^ over
>>>
>>> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
>>> or something?
>> 
>> Yes.  Let me see if I can phrase that better.
>> 
>>> I wonder if we also should mention that
>>> it is held while waiting for the trace parent to
>>> receive the exit code with "wait"?
>> 
>> I don't think we have to spell out the details of how it all works,
>> unless that makes things clearer.  Kernel developers can be expected
>> to figure out how the kernel works.  The critical thing is that it is
>> an indefinite wait for userspace to take action.
>> 
>> But I will look.
>> 
>>>> threads are killed.  The cred_guard_mutex is held over
>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>
>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>> over a possible indefinite userspace waits for userspace.
>>>>
>>>> Add exec_update_mutex that is only held over exec updating process
>>>
>>> Add ?
>> 
>> Yes.  That is what the change does: add exec_update_mutex.
>> 
>
> I just kind of missed the "subject" in this sentence,
> like "This patch adds an exec_update_mutex that is ..."
> but english is a foreign language for me, so may be okay as is.

English has a lot of options.  I think this is a stylistic difference.

Instead of being an observer and describing what the change does:
"This patch adds exec_update_mutex ..."  

I was being there in the moment and saying/commading what is happening:
"Add exec_update_mutex ..."

Using the more immdediate form ends up with more concise and clearer
sentences.

Every one of my writing teachers in school emphasized that point
and I see the who it works when I write things.  But writing is hard and
I still tend toward long rambling sentences with many qualifiers that
confuse and detract from the point rather than make it clear what is
happening.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 18:10                                                         ` Eric W. Biederman
@ 2020-03-09 18:24                                                           ` Eric W. Biederman
  2020-03-09 18:36                                                             ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 18:24 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

ebiederm@xmission.com (Eric W. Biederman) writes:

> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>
>> On 3/9/20 6:40 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>> 
>>>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>>>
>>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>                                 ^ over
>>>>
>>>> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
>>>> or something?
>>> 
>>> Yes.  Let me see if I can phrase that better.
>>> 
>>>> I wonder if we also should mention that
>>>> it is held while waiting for the trace parent to
>>>> receive the exit code with "wait"?
>>> 
>>> I don't think we have to spell out the details of how it all works,
>>> unless that makes things clearer.  Kernel developers can be expected
>>> to figure out how the kernel works.  The critical thing is that it is
>>> an indefinite wait for userspace to take action.
>>> 
>>> But I will look.
>>> 
>>>>> threads are killed.  The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>> over a possible indefinite userspace waits for userspace.
>>>>>
>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>
>>>> Add ?
>>> 
>>> Yes.  That is what the change does: add exec_update_mutex.
>>> 
>>
>> I just kind of missed the "subject" in this sentence,
>> like "This patch adds an exec_update_mutex that is ..."
>> but english is a foreign language for me, so may be okay as is.
>
> English has a lot of options.  I think this is a stylistic difference.
>
> Instead of being an observer and describing what the change does:
> "This patch adds exec_update_mutex ..."  
>
> I was being there in the moment and saying/commading what is happening:
> "Add exec_update_mutex ..."
>
> Using the more immdediate form ends up with more concise and clearer
> sentences.
>
> Every one of my writing teachers in school emphasized that point
> and I see the who it works when I write things.  But writing is hard and
> I still tend toward long rambling sentences with many qualifiers that
> confuse and detract from the point rather than make it clear what is
> happening.

And reading through it all now I can see your confusion.  That
description of my changes was not well done.  Reworking it now.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 18:24                                                           ` Eric W. Biederman
@ 2020-03-09 18:36                                                             ` Eric W. Biederman
  2020-03-09 18:47                                                               ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 18:36 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


My rewritten change description reads as follows:

    exec: Add a exec_update_mutex to replace cred_guard_mutex
    
    The cred_guard_mutex is problematic as it is held over possibly
    indefinite waits for userspace.  The possilbe indefinite waits for
    userspace that I have identified are: The cred_guard_mutex is held in
    PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
    held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
    cred_guard_mutex is held over "get_user(futex_offset, ...")  in
    exit_robust_list.  The cred_guard_mutex held over copy_strings.
    
    The functions get_user and put_user can trigger a page fault which can
    potentially wait indefinitely in the case of userfaultfd or if
    userspace implements part of the page fault path.
    
    In any of those cases the userspace process that the kernel is waiting
    for might userspace might make a different system call that winds up
    taking the cred_guard_mutex and result in deadlock.
    
    Holding a mutex over any of those possibly indefinite waits for
    userspace does not appear necessary.  Add exec_update_mutex that will
    just cover updating the process during exec where the permissions and
    the objects pointed to by the task struct may be out of sync.
    
    The plan is to switch the users of cred_guard_mutex to
    exec_udpate_mutex one by one.  This lets us move forward while still
    being careful and not introducing any regressions.
    
    Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
    Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
    Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
    Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
    Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
    Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
    Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Does that sound better?

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 18:36                                                             ` Eric W. Biederman
@ 2020-03-09 18:47                                                               ` Bernd Edlinger
  2020-03-09 19:02                                                                 ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 18:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/9/20 7:36 PM, Eric W. Biederman wrote:
> 
> My rewritten change description reads as follows:
> 
>     exec: Add a exec_update_mutex to replace cred_guard_mutex

is this "an" exec_update_mutex?

>     
>     The cred_guard_mutex is problematic as it is held over possibly
>     indefinite waits for userspace.  The possilbe indefinite waits for
>     userspace that I have identified are: The cred_guard_mutex is held in
>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>     exit_robust_list.  The cred_guard_mutex held over copy_strings.
>     
>     The functions get_user and put_user can trigger a page fault which can
>     potentially wait indefinitely in the case of userfaultfd or if
>     userspace implements part of the page fault path.
>     
>     In any of those cases the userspace process that the kernel is waiting
>     for might userspace might make a different system call that winds up
                ^-------------^
                      ^- remove this
>     taking the cred_guard_mutex and result in deadlock.
>     
>     Holding a mutex over any of those possibly indefinite waits for
>     userspace does not appear necessary.  Add exec_update_mutex that will
>     just cover updating the process during exec where the permissions and
>     the objects pointed to by the task struct may be out of sync.
>     
>     The plan is to switch the users of cred_guard_mutex to
>     exec_udpate_mutex one by one.  This lets us move forward while still

            ^-- typo: update

>     being careful and not introducing any regressions.
>     
>     Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>     Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>     Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>     Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>     Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>     Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>     Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Does that sound better?
> 

almost done.

> Eric
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 18:47                                                               ` Bernd Edlinger
@ 2020-03-09 19:02                                                                 ` Eric W. Biederman
  2020-03-09 19:24                                                                   ` Bernd Edlinger
                                                                                     ` (2 more replies)
  0 siblings, 3 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 19:02 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>> 
>> 
>> Does that sound better?
>> 
>
> almost done.

I think this text is finally clean.

    exec: Add exec_update_mutex to replace cred_guard_mutex
    
    The cred_guard_mutex is problematic as it is held over possibly
    indefinite waits for userspace.  The possilbe indefinite waits for
    userspace that I have identified are: The cred_guard_mutex is held in
    PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
    held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
    cred_guard_mutex is held over "get_user(futex_offset, ...")  in
    exit_robust_list.  The cred_guard_mutex held over copy_strings.
    
    The functions get_user and put_user can trigger a page fault which can
    potentially wait indefinitely in the case of userfaultfd or if
    userspace implements part of the page fault path.
    
    In any of those cases the userspace process that the kernel is waiting
    for might make a different system call that winds up taking the
    cred_guard_mutex and result in deadlock.
    
    Holding a mutex over any of those possibly indefinite waits for
    userspace does not appear necessary.  Add exec_update_mutex that will
    just cover updating the process during exec where the permissions and
    the objects pointed to by the task struct may be out of sync.
    
    The plan is to switch the users of cred_guard_mutex to
    exec_update_mutex one by one.  This lets us move forward while still
    being careful and not introducing any regressions.
    
    Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
    Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
    Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
    Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
    Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
    Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
    Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


Bernd do you want to give me your Reviewed-by for this part of the
series?

After that do you think you can write the obvious patch for mm_access?

I will apply these changes to my tree and push them into linux-next.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 19:02                                                                 ` Eric W. Biederman
@ 2020-03-09 19:24                                                                   ` Bernd Edlinger
  2020-03-09 19:35                                                                     ` Eric W. Biederman
  2020-03-09 19:39                                                                     ` Eric W. Biederman
  2020-03-09 19:33                                                                   ` [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex Dmitry V. Levin
  2020-03-10 20:55                                                                   ` Kees Cook
  2 siblings, 2 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 19:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api



On 3/9/20 8:02 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>>>
>>>
>>> Does that sound better?
>>>
>>
>> almost done.
> 
> I think this text is finally clean.
> 
>     exec: Add exec_update_mutex to replace cred_guard_mutex
>     
>     The cred_guard_mutex is problematic as it is held over possibly
>     indefinite waits for userspace.  The possilbe indefinite waits for
>     userspace that I have identified are: The cred_guard_mutex is held in
>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>     exit_robust_list.  The cred_guard_mutex held over copy_strings.
>     
>     The functions get_user and put_user can trigger a page fault which can
>     potentially wait indefinitely in the case of userfaultfd or if
>     userspace implements part of the page fault path.
>     
>     In any of those cases the userspace process that the kernel is waiting
>     for might make a different system call that winds up taking the
>     cred_guard_mutex and result in deadlock.
>     
>     Holding a mutex over any of those possibly indefinite waits for
>     userspace does not appear necessary.  Add exec_update_mutex that will
>     just cover updating the process during exec where the permissions and
>     the objects pointed to by the task struct may be out of sync.
>     
>     The plan is to switch the users of cred_guard_mutex to
>     exec_update_mutex one by one.  This lets us move forward while still
>     being careful and not introducing any regressions.
>     
>     Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>     Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>     Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>     Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>     Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>     Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>     Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")

I checked the urls they all work.
Just one last question, are these git references?
I can't find them in my linux git tree (cloned from linus' git)?

Sorry for being pedantically.


>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> 
> Bernd do you want to give me your Reviewed-by for this part of the
> series?
> 

Sure also the other parts of course.

Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>

> After that do you think you can write the obvious patch for mm_access?
> 

Yes, I can do that.
I also have some typos in comments, will make them extra patches as well.

I wonder if the test case is okay to include the ptrace_attach altough
that is not yet passing?


Thanks
Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
  2020-03-09 17:56                                                       ` Bernd Edlinger
@ 2020-03-09 19:27                                                         ` Bernd Edlinger
  0 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 19:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/9/20 6:56 PM, Bernd Edlinger wrote:
> On 3/9/20 6:34 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>
>>> On 3/8/20 10:35 PM, Eric W. Biederman wrote:
>>>>
>>>> Make it clear that current only needs to be computed once in
>>>> flush_old_exec.  This may have some efficiency improvements and it
>>>> makes the code easier to change.
>>>>
>>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>> ---
>>>>  fs/exec.c | 9 +++++----
>>>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>> index db17be51b112..c3f34791f2f0 100644
>>>> --- a/fs/exec.c
>>>> +++ b/fs/exec.c
>>>> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>>>>   */
>>>>  int flush_old_exec(struct linux_binprm * bprm)
>>>>  {
>>>> +	struct task_struct *me = current;
>>>>  	int retval;
>>>>  
>>>>  	/*
>>>>  	 * Make sure we have a private signal table and that
>>>>  	 * we are unassociated from the previous thread group.
>>>>  	 */
>>>> -	retval = de_thread(current);
>>>> +	retval = de_thread(me);
>>>>  	if (retval)
>>>>  		goto out;
>>>>  
>>>> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
>>>>  	bprm->mm = NULL;
>>>>  
>>>>  	set_fs(USER_DS);
>>>> -	current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>>> +	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>>>  					PF_NOFREEZE | PF_NO_SETAFFINITY);
>>>
>>> I wonder if this line should be aligned with the previous?
>>
>> In this case I don't think so.  The style used for second line is indent
>> with tabs as much as possible to the right.  I haven't changed that.
>>
>> Further mixing a change in indentation style with just a variable rename
>> will make the patch confusing to read because two things have to be
>> verified at the same time.
>>
>> So while I see why you ask I think this bit needs to stay as is.
>>
> 
> Ah, okay, I see.
> Thanks for explaining this rule, I was not aware of it,
> but I am still new here :)
> 

Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
  2020-03-08 21:36                                                 ` [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately Eric W. Biederman
@ 2020-03-09 19:28                                                   ` Bernd Edlinger
  2020-03-10 20:29                                                   ` Kees Cook
  2020-03-10 21:21                                                   ` Christian Brauner
  2 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 19:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/8/20 10:36 PM, Eric W. Biederman wrote:
> 
> This makes the code clearer and makes it easier to implement a mutex
> that is not taken over any locations that may block indefinitely waiting
> for userspace.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>


Bernd.
> ---
>  fs/exec.c | 39 ++++++++++++++++++++++++++-------------
>  1 file changed, 26 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index c3f34791f2f0..ff74b9a74d34 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
>  	flush_itimer_signals();
>  #endif
>  
> +	BUG_ON(!thread_group_leader(tsk));
> +	return 0;
> +
> +killed:
> +	/* protects against exit_notify() and __exit_signal() */
> +	read_lock(&tasklist_lock);
> +	sig->group_exit_task = NULL;
> +	sig->notify_count = 0;
> +	read_unlock(&tasklist_lock);
> +	return -EAGAIN;
> +}
> +
> +
> +static int unshare_sighand(struct task_struct *me)
> +{
> +	struct sighand_struct *oldsighand = me->sighand;
> +
>  	if (refcount_read(&oldsighand->count) != 1) {
>  		struct sighand_struct *newsighand;
>  		/*
> @@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
>  
>  		write_lock_irq(&tasklist_lock);
>  		spin_lock(&oldsighand->siglock);
> -		rcu_assign_pointer(tsk->sighand, newsighand);
> +		rcu_assign_pointer(me->sighand, newsighand);
>  		spin_unlock(&oldsighand->siglock);
>  		write_unlock_irq(&tasklist_lock);
>  
>  		__cleanup_sighand(oldsighand);
>  	}
> -
> -	BUG_ON(!thread_group_leader(tsk));
>  	return 0;
> -
> -killed:
> -	/* protects against exit_notify() and __exit_signal() */
> -	read_lock(&tasklist_lock);
> -	sig->group_exit_task = NULL;
> -	sig->notify_count = 0;
> -	read_unlock(&tasklist_lock);
> -	return -EAGAIN;
>  }
>  
>  char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
> @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	int retval;
>  
>  	/*
> -	 * Make sure we have a private signal table and that
> -	 * we are unassociated from the previous thread group.
> +	 * Make this the only thread in the thread group.
>  	 */
>  	retval = de_thread(me);
>  	if (retval)
>  		goto out;
>  
> +	/*
> +	 * Make the signal table private.
> +	 */
> +	retval = unshare_sighand(me);
> +	if (retval)
> +		goto out;
> +
>  	/*
>  	 * Must be called _before_ exec_mmap() as bprm->mm is
>  	 * not visibile until then. This also enables the update
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-08 21:36                                                 ` [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread Eric W. Biederman
@ 2020-03-09 19:30                                                   ` Bernd Edlinger
  2020-03-09 19:59                                                   ` Christian Brauner
                                                                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 19:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/8/20 10:36 PM, Eric W. Biederman wrote:
> 
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>


Bernd.
> ---
>  fs/exec.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index ff74b9a74d34..215d86f77b63 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
>  	/* we have changed execution domain */
>  	tsk->exit_signal = SIGCHLD;
>  
> -#ifdef CONFIG_POSIX_TIMERS
> -	exit_itimers(sig);
> -	flush_itimer_signals();
> -#endif
> -
>  	BUG_ON(!thread_group_leader(tsk));
>  	return 0;
>  
> @@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> +#ifdef CONFIG_POSIX_TIMERS
> +	exit_itimers(me->signal);
> +	flush_itimer_signals();
> +#endif
> +
>  	/*
>  	 * Make the signal table private.
>  	 */
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 19:02                                                                 ` Eric W. Biederman
  2020-03-09 19:24                                                                   ` Bernd Edlinger
@ 2020-03-09 19:33                                                                   ` Dmitry V. Levin
  2020-03-09 19:42                                                                     ` Eric W. Biederman
  2020-03-10 20:55                                                                   ` Kees Cook
  2 siblings, 1 reply; 203+ messages in thread
From: Dmitry V. Levin @ 2020-03-09 19:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Kees Cook, Jann Horn,
	Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Thomas Gleixner, Oleg Nesterov, Frederic Weisbecker,
	Andrei Vagin, Ingo Molnar, Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, stable, linux-api

On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
> > On 3/9/20 7:36 PM, Eric W. Biederman wrote:
> >> 
> >> 
> >> Does that sound better?
> >> 
> >
> > almost done.
> 
> I think this text is finally clean.
> 
>     exec: Add exec_update_mutex to replace cred_guard_mutex
>     
>     The cred_guard_mutex is problematic as it is held over possibly
>     indefinite waits for userspace.  The possilbe indefinite waits for

-------------------------------------------^^^^^^^^ possible?


-- 
ldv

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-08 21:38                                                 ` [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec Eric W. Biederman
@ 2020-03-09 19:34                                                   ` Bernd Edlinger
  2020-03-09 19:45                                                     ` Eric W. Biederman
  2020-03-10 20:44                                                   ` Kees Cook
  2020-03-10 20:47                                                   ` Kees Cook
  2 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 19:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/8/20 10:38 PM, Eric W. Biederman wrote:
> 
> I have read through the code in exec_mmap and I do not see anything
> that depends on sighand or the sighand lock, or on signals in anyway
> so this should be safe.
> 
> This rearrangement of code has two siginficant benefits.  It makes
                                        ^ typo: significant

> the determination of passing the point of no return by testing bprm->mm
> accurate.  All failures prior to that point in flush_old_exec are
> either truly recoverable or they are fatal.
> 
> Futher this consolidates all of the possible indefinite waits for   ^ typo: Further

> userspace together at the top of flush_old_exec.  The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.
> 
> This consolidation allows the creation of a mutex to replace
> cred_guard_mutex that is not held of possible indefinite userspace

can you also reword this "held of" thing here as well?


Thanks
Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 19:24                                                                   ` Bernd Edlinger
@ 2020-03-09 19:35                                                                     ` Eric W. Biederman
  2020-03-09 19:39                                                                     ` Eric W. Biederman
  1 sibling, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 19:35 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 8:02 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>>>>
>>>>
>>>> Does that sound better?
>>>>
>>>
>>> almost done.
>> 
>> I think this text is finally clean.
>> 
>>     exec: Add exec_update_mutex to replace cred_guard_mutex
>>     
>>     The cred_guard_mutex is problematic as it is held over possibly
>>     indefinite waits for userspace.  The possilbe indefinite waits for
>>     userspace that I have identified are: The cred_guard_mutex is held in
>>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>>     exit_robust_list.  The cred_guard_mutex held over copy_strings.
>>     
>>     The functions get_user and put_user can trigger a page fault which can
>>     potentially wait indefinitely in the case of userfaultfd or if
>>     userspace implements part of the page fault path.
>>     
>>     In any of those cases the userspace process that the kernel is waiting
>>     for might make a different system call that winds up taking the
>>     cred_guard_mutex and result in deadlock.
>>     
>>     Holding a mutex over any of those possibly indefinite waits for
>>     userspace does not appear necessary.  Add exec_update_mutex that will
>>     just cover updating the process during exec where the permissions and
>>     the objects pointed to by the task struct may be out of sync.
>>     
>>     The plan is to switch the users of cred_guard_mutex to
>>     exec_update_mutex one by one.  This lets us move forward while still
>>     being careful and not introducing any regressions.
>>     
>>     Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>>     Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>     Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>>     Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>>     Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>>     Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>     Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>
> I checked the urls they all work.
> Just one last question, are these git references?
> I can't find them in my linux git tree (cloned from linus' git)?
>
> Sorry for being pedantically.

You have to track down tglx's historicaly git tree from when everything
was in bitkeeper.

But yes they are git references and yes they work.  Just that part
of the history is not in linux.git.

>>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> 
>> 
>> Bernd do you want to give me your Reviewed-by for this part of the
>> series?
>> 
>
> Sure also the other parts of course.
>
> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>
>> After that do you think you can write the obvious patch for mm_access?
>> 
>
> Yes, I can do that.
> I also have some typos in comments, will make them extra patches as well.
>
> I wonder if the test case is okay to include the ptrace_attach altough
> that is not yet passing?

It is an existing kernel but that it doesn't pass.

My sense is that if you include it as a separate patch if it is a
problem for someone we can identify it easily via bisect and we do
whatever is appropriate.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 19:24                                                                   ` Bernd Edlinger
  2020-03-09 19:35                                                                     ` Eric W. Biederman
@ 2020-03-09 19:39                                                                     ` Eric W. Biederman
  2020-03-10 13:43                                                                       ` [PATCH 0/4] Use new infrastructure to fix deadlocks in execve Bernd Edlinger
                                                                                         ` (4 more replies)
  1 sibling, 5 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 19:39 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 8:02 PM, Eric W. Biederman wrote:
>> 
>>     
>>     Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>>     Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>     Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>>     Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>>     Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>>     Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>     Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>
> I checked the urls they all work.
> Just one last question, are these git references?
> I can't find them in my linux git tree (cloned from linus' git)?

I will add this tag to help people figure out what is going on.

History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 19:33                                                                   ` [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex Dmitry V. Levin
@ 2020-03-09 19:42                                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 19:42 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Bernd Edlinger, Christian Brauner, Kees Cook, Jann Horn,
	Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Thomas Gleixner, Oleg Nesterov, Frederic Weisbecker,
	Andrei Vagin, Ingo Molnar, Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, stable, linux-api

"Dmitry V. Levin" <ldv@altlinux.org> writes:

> On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>> > On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>> >> 
>> >> 
>> >> Does that sound better?
>> >> 
>> >
>> > almost done.
>> 
>> I think this text is finally clean.
>> 
>>     exec: Add exec_update_mutex to replace cred_guard_mutex
>>     
>>     The cred_guard_mutex is problematic as it is held over possibly
>>     indefinite waits for userspace.  The possilbe indefinite waits for
>
> -------------------------------------------^^^^^^^^ possible?


Yes.  Thank you.  Fixed.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-09 19:34                                                   ` Bernd Edlinger
@ 2020-03-09 19:45                                                     ` Eric W. Biederman
  2020-03-09 19:52                                                       ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 19:45 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>> 
>> This consolidation allows the creation of a mutex to replace
>> cred_guard_mutex that is not held of possible indefinite userspace
>
> can you also reword this "held of" thing here as well?

Done:

    exec: Move exec_mmap right after de_thread in flush_old_exec
    
    I have read through the code in exec_mmap and I do not see anything
    that depends on sighand or the sighand lock, or on signals in anyway
    so this should be safe.
    
    This rearrangement of code has two siginficant benefits.  It makes
    the determination of passing the point of no return by testing bprm->mm
    accurate.  All failures prior to that point in flush_old_exec are
    either truly recoverable or they are fatal.
    
    Futher this consolidates all of the possible indefinite waits for
    userspace together at the top of flush_old_exec.  The possible wait
    for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
    to be resolved in clear_child_tid, and the possible wait for a page
    fault in exit_robust_list.
    
    This consolidation allows the creation of a mutex to replace
    cred_guard_mutex that is not held over possible indefinite userspace
    waits.  Which will allow removing deadlock scenarios from the kernel.
    
    Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-09 19:45                                                     ` Eric W. Biederman
@ 2020-03-09 19:52                                                       ` Bernd Edlinger
  2020-03-09 19:58                                                         ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 19:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api



On 3/9/20 8:45 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>
>>> This consolidation allows the creation of a mutex to replace
>>> cred_guard_mutex that is not held of possible indefinite userspace
>>
>> can you also reword this "held of" thing here as well?
> 
> Done:
> 
>     exec: Move exec_mmap right after de_thread in flush_old_exec
>     
>     I have read through the code in exec_mmap and I do not see anything
>     that depends on sighand or the sighand lock, or on signals in anyway
>     so this should be safe.
>     
>     This rearrangement of code has two siginficant benefits.  It makes

watch out: sig_i_nificant

>     the determination of passing the point of no return by testing bprm->mm
>     accurate.  All failures prior to that point in flush_old_exec are
>     either truly recoverable or they are fatal.
>     
>     Futher this consolidates all of the possible indefinite waits for

Add some r to "Futher", please?

>     userspace together at the top of flush_old_exec.  The possible wait
>     for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>     to be resolved in clear_child_tid, and the possible wait for a page
>     fault in exit_robust_list.
>     
>     This consolidation allows the creation of a mutex to replace
>     cred_guard_mutex that is not held over possible indefinite userspace
>     waits.  Which will allow removing deadlock scenarios from the kernel.
>     
>     Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-09 19:52                                                       ` Bernd Edlinger
@ 2020-03-09 19:58                                                         ` Eric W. Biederman
  2020-03-09 20:03                                                           ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 19:58 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api


Ok.  I think this has it sorted:

    exec: Move exec_mmap right after de_thread in flush_old_exec
    
    I have read through the code in exec_mmap and I do not see anything
    that depends on sighand or the sighand lock, or on signals in anyway
    so this should be safe.
    
    This rearrangement of code has two significant benefits.  It makes
    the determination of passing the point of no return by testing bprm->mm
    accurate.  All failures prior to that point in flush_old_exec are
    either truly recoverable or they are fatal.
    
    Further this consolidates all of the possible indefinite waits for
    userspace together at the top of flush_old_exec.  The possible wait
    for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
    to be resolved in clear_child_tid, and the possible wait for a page
    fault in exit_robust_list.
    
    This consolidation allows the creation of a mutex to replace
    cred_guard_mutex that is not held over possible indefinite userspace
    waits.  Which will allow removing deadlock scenarios from the kernel.
    
    Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


I don't think I usually have this many typos.  Sigh.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-08 21:36                                                 ` [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread Eric W. Biederman
  2020-03-09 19:30                                                   ` Bernd Edlinger
@ 2020-03-09 19:59                                                   ` Christian Brauner
  2020-03-09 20:06                                                     ` Eric W. Biederman
  2020-03-10 20:31                                                   ` Kees Cook
                                                                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 203+ messages in thread
From: Christian Brauner @ 2020-03-09 19:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
> 
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index ff74b9a74d34..215d86f77b63 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)

While you're cleaning up de_thread() wouldn't it be good to also take
the opportunity and remove the task argument from de_thread(). It's only
ever used with current. Could be done in one of your patches or as a
separate patch.

diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..ee108707e4b0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1061,8 +1061,9 @@ static int exec_mmap(struct mm_struct *mm)
  * disturbing other processes.  (Other processes might share the signal
  * table via the CLONE_SIGHAND option to clone().)
  */
-static int de_thread(struct task_struct *tsk)
+static int de_thread(void)
 {
+       struct task_struct *tsk = current;
        struct signal_struct *sig = tsk->signal;
        struct sighand_struct *oldsighand = tsk->sighand;
        spinlock_t *lock = &oldsighand->siglock;
@@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm)
         * Make sure we have a private signal table and that
         * we are unassociated from the previous thread group.
         */
-       retval = de_thread(current);
+       retval = de_thread();
        if (retval)
                goto out;

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-09 19:58                                                         ` Eric W. Biederman
@ 2020-03-09 20:03                                                           ` Bernd Edlinger
  2020-03-09 20:35                                                             ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-09 20:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/9/20 8:58 PM, Eric W. Biederman wrote:
> 
> Ok.  I think this has it sorted:
> 
>     exec: Move exec_mmap right after de_thread in flush_old_exec
>     
>     I have read through the code in exec_mmap and I do not see anything
>     that depends on sighand or the sighand lock, or on signals in anyway
>     so this should be safe.
>     
>     This rearrangement of code has two significant benefits.  It makes
>     the determination of passing the point of no return by testing bprm->mm
>     accurate.  All failures prior to that point in flush_old_exec are
>     either truly recoverable or they are fatal.
>     
>     Further this consolidates all of the possible indefinite waits for
>     userspace together at the top of flush_old_exec.  The possible wait
>     for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>     to be resolved in clear_child_tid, and the possible wait for a page
>     fault in exit_robust_list.
>     
>     This consolidation allows the creation of a mutex to replace
>     cred_guard_mutex that is not held over possible indefinite userspace
>     waits.  Which will allow removing deadlock scenarios from the kernel.
>     
>     Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> 
> I don't think I usually have this many typos.  Sigh.
> 

OK.

never mind,
Bernd.

 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-09 19:59                                                   ` Christian Brauner
@ 2020-03-09 20:06                                                     ` Eric W. Biederman
  2020-03-09 20:17                                                       ` Christian Brauner
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 20:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
>> 
>> These functions have very little to do with de_thread move them out
>> of de_thread an into flush_old_exec proper so it can be more clearly
>> seen what flush_old_exec is doing.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/exec.c | 10 +++++-----
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index ff74b9a74d34..215d86f77b63 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
>
> While you're cleaning up de_thread() wouldn't it be good to also take
> the opportunity and remove the task argument from de_thread(). It's only
> ever used with current. Could be done in one of your patches or as a
> separate patch.

How does that affect the code generation?

My sense is that computing current once in flush_old_exec is much
better than computing it in each function flush_old_exec calls.
I remember that computing current used to be not expensive but
noticable.

For clarity I can see renaming tsk to me.  So that it is clear we are
talking about the current process, and not some arbitrary process.

And for clarity my goal here is not to clean up de_thread.  Though
I don't mind that result.  My goal is to get the extra work out of
de_thread so we can do process tear down cleanups that are safe
according to the ordinary process rules, before taking a mutex that
protects exec mucking with all of the state in exec.

Eric


> diff --git a/fs/exec.c b/fs/exec.c
> index db17be51b112..ee108707e4b0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1061,8 +1061,9 @@ static int exec_mmap(struct mm_struct *mm)
>   * disturbing other processes.  (Other processes might share the signal
>   * table via the CLONE_SIGHAND option to clone().)
>   */
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(void)
>  {
> +       struct task_struct *tsk = current;
>         struct signal_struct *sig = tsk->signal;
>         struct sighand_struct *oldsighand = tsk->sighand;
>         spinlock_t *lock = &oldsighand->siglock;
> @@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>          * Make sure we have a private signal table and that
>          * we are unassociated from the previous thread group.
>          */
> -       retval = de_thread(current);
> +       retval = de_thread();
>         if (retval)
>                 goto out;

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-09 20:06                                                     ` Eric W. Biederman
@ 2020-03-09 20:17                                                       ` Christian Brauner
  2020-03-09 20:48                                                         ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Christian Brauner @ 2020-03-09 20:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Mon, Mar 09, 2020 at 03:06:46PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
> >> 
> >> These functions have very little to do with de_thread move them out
> >> of de_thread an into flush_old_exec proper so it can be more clearly
> >> seen what flush_old_exec is doing.
> >> 
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>  fs/exec.c | 10 +++++-----
> >>  1 file changed, 5 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/fs/exec.c b/fs/exec.c
> >> index ff74b9a74d34..215d86f77b63 100644
> >> --- a/fs/exec.c
> >> +++ b/fs/exec.c
> >> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
> >
> > While you're cleaning up de_thread() wouldn't it be good to also take
> > the opportunity and remove the task argument from de_thread(). It's only
> > ever used with current. Could be done in one of your patches or as a
> > separate patch.
> 
> How does that affect the code generation?

The same way renaming "tsk" to "me" does.

> 
> My sense is that computing current once in flush_old_exec is much
> better than computing it in each function flush_old_exec calls.
> I remember that computing current used to be not expensive but
> noticable.
> 
> For clarity I can see renaming tsk to me.  So that it is clear we are
> talking about the current process, and not some arbitrary process.

For clarity since de_thread() uses "tsk" giving the impression that any
task can be dethreaded while it's only ever used with current. It's just
a suggestion since you're doing the rename tsk->me anyway it would fit
with the series. You do whatever you want though.
(I just remember that the same request was made once to changes I did:
Don't pass current as arg when it's the only task passed.)

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-09 20:03                                                           ` Bernd Edlinger
@ 2020-03-09 20:35                                                             ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 20:35 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 8:58 PM, Eric W. Biederman wrote:
>> 
>> Ok.  I think this has it sorted:
>> 
>>     exec: Move exec_mmap right after de_thread in flush_old_exec
>>     
>>     I have read through the code in exec_mmap and I do not see anything
>>     that depends on sighand or the sighand lock, or on signals in anyway
>>     so this should be safe.
>>     
>>     This rearrangement of code has two significant benefits.  It makes
>>     the determination of passing the point of no return by testing bprm->mm
>>     accurate.  All failures prior to that point in flush_old_exec are
>>     either truly recoverable or they are fatal.
>>     
>>     Further this consolidates all of the possible indefinite waits for
>>     userspace together at the top of flush_old_exec.  The possible wait
>>     for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>>     to be resolved in clear_child_tid, and the possible wait for a page
>>     fault in exit_robust_list.
>>     
>>     This consolidation allows the creation of a mutex to replace
>>     cred_guard_mutex that is not held over possible indefinite userspace
>>     waits.  Which will allow removing deadlock scenarios from the kernel.
>>     
>>     Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> 
>> 
>> I don't think I usually have this many typos.  Sigh.
>> 
>
> OK.
>
> never mind,

No no.  I really appreciate all of the scrutiny.  Frequently the issues
that will produce typos or poor patch descriptions are also the issues
that will produce sloppy patches as well.  I was just frustrated with
myself.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-09 20:17                                                       ` Christian Brauner
@ 2020-03-09 20:48                                                         ` Eric W. Biederman
  2020-03-10  8:55                                                           ` Christian Brauner
  2020-03-10 20:16                                                           ` [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread Kees Cook
  0 siblings, 2 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-09 20:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Mon, Mar 09, 2020 at 03:06:46PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> 
>> > On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
>> >> 
>> >> These functions have very little to do with de_thread move them out
>> >> of de_thread an into flush_old_exec proper so it can be more clearly
>> >> seen what flush_old_exec is doing.
>> >> 
>> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> >> ---
>> >>  fs/exec.c | 10 +++++-----
>> >>  1 file changed, 5 insertions(+), 5 deletions(-)
>> >> 
>> >> diff --git a/fs/exec.c b/fs/exec.c
>> >> index ff74b9a74d34..215d86f77b63 100644
>> >> --- a/fs/exec.c
>> >> +++ b/fs/exec.c
>> >> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
>> >
>> > While you're cleaning up de_thread() wouldn't it be good to also take
>> > the opportunity and remove the task argument from de_thread(). It's only
>> > ever used with current. Could be done in one of your patches or as a
>> > separate patch.
>> 
>> How does that affect the code generation?
>
> The same way renaming "tsk" to "me" does.
>
>> 
>> My sense is that computing current once in flush_old_exec is much
>> better than computing it in each function flush_old_exec calls.
>> I remember that computing current used to be not expensive but
>> noticable.
>> 
>> For clarity I can see renaming tsk to me.  So that it is clear we are
>> talking about the current process, and not some arbitrary process.
>
> For clarity since de_thread() uses "tsk" giving the impression that any
> task can be dethreaded while it's only ever used with current. It's just
> a suggestion since you're doing the rename tsk->me anyway it would fit
> with the series. You do whatever you want though.
> (I just remember that the same request was made once to changes I did:
> Don't pass current as arg when it's the only task passed.)

That's fair.

And I completely agree that we should at least rename tsk to me.
Just for clarity.

My apologies if I am a little short.  My little son has been an extra
handful lately.

Eric


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-09 20:48                                                         ` Eric W. Biederman
@ 2020-03-10  8:55                                                           ` Christian Brauner
  2020-03-10 18:52                                                             ` [PATCH] pidfd: Stop taking cred_guard_mutex Eric W. Biederman
  2020-03-10 20:16                                                           ` [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread Kees Cook
  1 sibling, 1 reply; 203+ messages in thread
From: Christian Brauner @ 2020-03-10  8:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Mon, Mar 09, 2020 at 03:48:55PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > On Mon, Mar 09, 2020 at 03:06:46PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >> 
> >> > On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
> >> >> 
> >> >> These functions have very little to do with de_thread move them out
> >> >> of de_thread an into flush_old_exec proper so it can be more clearly
> >> >> seen what flush_old_exec is doing.
> >> >> 
> >> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> >> ---
> >> >>  fs/exec.c | 10 +++++-----
> >> >>  1 file changed, 5 insertions(+), 5 deletions(-)
> >> >> 
> >> >> diff --git a/fs/exec.c b/fs/exec.c
> >> >> index ff74b9a74d34..215d86f77b63 100644
> >> >> --- a/fs/exec.c
> >> >> +++ b/fs/exec.c
> >> >> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
> >> >
> >> > While you're cleaning up de_thread() wouldn't it be good to also take
> >> > the opportunity and remove the task argument from de_thread(). It's only
> >> > ever used with current. Could be done in one of your patches or as a
> >> > separate patch.
> >> 
> >> How does that affect the code generation?
> >
> > The same way renaming "tsk" to "me" does.
> >
> >> 
> >> My sense is that computing current once in flush_old_exec is much
> >> better than computing it in each function flush_old_exec calls.
> >> I remember that computing current used to be not expensive but
> >> noticable.
> >> 
> >> For clarity I can see renaming tsk to me.  So that it is clear we are
> >> talking about the current process, and not some arbitrary process.
> >
> > For clarity since de_thread() uses "tsk" giving the impression that any
> > task can be dethreaded while it's only ever used with current. It's just
> > a suggestion since you're doing the rename tsk->me anyway it would fit
> > with the series. You do whatever you want though.
> > (I just remember that the same request was made once to changes I did:
> > Don't pass current as arg when it's the only task passed.)
> 
> That's fair.
> 
> And I completely agree that we should at least rename tsk to me.
> Just for clarity.
> 
> My apologies if I am a little short.  My little son has been an extra
> handful lately.

No worries, stress is a thing most of us know too well.

Christian

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 0/4] Use new infrastructure to fix deadlocks in execve
  2020-03-09 19:39                                                                     ` Eric W. Biederman
@ 2020-03-10 13:43                                                                       ` Bernd Edlinger
  2020-03-10 15:35                                                                         ` Eric W. Biederman
  2020-03-10 13:43                                                                       ` [PATCH 1/4] exec: Fix a deadlock in ptrace Bernd Edlinger
                                                                                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 13:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This is a follow up on Eric's patch series to
fix the deadlocks observed with ptracing when execve
in multi-threaded applications.

This fixes the simple and most important case where
the cred_guard_mutex causes strace to deadlock.

This also adds a test case (which is only partially
fixed so far, the rest of the fixes will follow
soon).

Two trivial comment fixes are also included.

Bernd Edlinger (4):
  exec: Fix a deadlock in ptrace
  selftests/ptrace: add test cases for dead-locks
  mm: docs: Fix a comment in process_vm_rw_core
  kernel: doc: remove outdated comment in prepare_kernel_cred

 kernel/cred.c                             |  2 -
 kernel/fork.c                             |  4 +-
 mm/process_vm_access.c                    |  2 +-
 tools/testing/selftests/ptrace/Makefile   |  4 +-
 tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++
 5 files changed, 91 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/ptrace/vmaccess.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 1/4] exec: Fix a deadlock in ptrace
  2020-03-09 19:39                                                                     ` Eric W. Biederman
  2020-03-10 13:43                                                                       ` [PATCH 0/4] Use new infrastructure to fix deadlocks in execve Bernd Edlinger
@ 2020-03-10 13:43                                                                       ` Bernd Edlinger
  2020-03-10 15:13                                                                         ` Eric W. Biederman
  2020-03-10 21:00                                                                         ` Kees Cook
  2020-03-10 13:44                                                                       ` [PATCH 2/4] selftests/ptrace: add test cases for dead-locks Bernd Edlinger
                                                                                         ` (2 subsequent siblings)
  4 siblings, 2 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 13:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

This changes mm_access to use the new exec_update_mutex
instead of cred_guard_mutex.

This patch is based on the following patch by Eric W. Biederman:
"[PATCH 0/5] Infrastructure to allow fixing exec deadlocks"
Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 kernel/fork.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index c12595a..5720ff3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	struct mm_struct *mm;
 	int err;
 
-	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
+	err =  mutex_lock_killable(&task->signal->exec_update_mutex);
 	if (err)
 		return ERR_PTR(err);
 
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 		mmput(mm);
 		mm = ERR_PTR(-EACCES);
 	}
-	mutex_unlock(&task->signal->cred_guard_mutex);
+	mutex_unlock(&task->signal->exec_update_mutex);
 
 	return mm;
 }
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 2/4] selftests/ptrace: add test cases for dead-locks
  2020-03-09 19:39                                                                     ` Eric W. Biederman
  2020-03-10 13:43                                                                       ` [PATCH 0/4] Use new infrastructure to fix deadlocks in execve Bernd Edlinger
  2020-03-10 13:43                                                                       ` [PATCH 1/4] exec: Fix a deadlock in ptrace Bernd Edlinger
@ 2020-03-10 13:44                                                                       ` Bernd Edlinger
  2020-03-10 21:36                                                                         ` Kees Cook
  2020-03-10 22:41                                                                         ` Dmitry V. Levin
  2020-03-10 13:44                                                                       ` [PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core Bernd Edlinger
  2020-03-10 13:44                                                                       ` [PATCH 4/4] kernel: doc: remove outdated comment cred.c Bernd Edlinger
  4 siblings, 2 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 13:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This adds test cases for ptrace deadlocks.

Additionally fixes a compile problem in get_syscall_info.c,
observed with gcc-4.8.4:

get_syscall_info.c: In function 'get_syscall_info':
get_syscall_info.c:93:3: error: 'for' loop initial declarations are only
                                 allowed in C99 mode
   for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) {
   ^
get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile
                               your code

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 tools/testing/selftests/ptrace/Makefile   |  4 +-
 tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/ptrace/vmaccess.c

diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
 
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
 
 include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..4db327b
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+	return NULL;
+}
+
+TEST(vmaccess)
+{
+	int f, pid = fork();
+	char mm[64];
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	sprintf(mm, "/proc/%d/mem", pid);
+	f = open(mm, O_RDONLY);
+	ASSERT_GE(f, 0);
+	close(f);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(f, 0);
+}
+
+TEST(attach)
+{
+	int s, k, pid = fork();
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("sleep", "sleep", "2", NULL);
+	}
+
+	sleep(1);
+	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(errno, EAGAIN);
+	ASSERT_EQ(k, -1);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_NE(k, -1);
+	ASSERT_NE(k, 0);
+	ASSERT_NE(k, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 0);
+	sleep(1);
+	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 0);
+	k = waitpid(-1, NULL, 0);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, ECHILD);
+}
+
+TEST_HARNESS_MAIN
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core
  2020-03-09 19:39                                                                     ` Eric W. Biederman
                                                                                         ` (2 preceding siblings ...)
  2020-03-10 13:44                                                                       ` [PATCH 2/4] selftests/ptrace: add test cases for dead-locks Bernd Edlinger
@ 2020-03-10 13:44                                                                       ` Bernd Edlinger
  2020-03-11 18:53                                                                         ` Kees Cook
  2020-03-10 13:44                                                                       ` [PATCH 4/4] kernel: doc: remove outdated comment cred.c Bernd Edlinger
  4 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 13:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This removes a duplicate "a" in the comment in process_vm_rw_core.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 mm/process_vm_access.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	if (!mm || IS_ERR(mm)) {
 		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		/*
-		 * Explicitly map EACCES to EPERM as EPERM is a more a
+		 * Explicitly map EACCES to EPERM as EPERM is a more
 		 * appropriate error code for process_vw_readv/writev
 		 */
 		if (rc == -EACCES)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 4/4] kernel: doc: remove outdated comment cred.c
  2020-03-09 19:39                                                                     ` Eric W. Biederman
                                                                                         ` (3 preceding siblings ...)
  2020-03-10 13:44                                                                       ` [PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core Bernd Edlinger
@ 2020-03-10 13:44                                                                       ` Bernd Edlinger
  2020-03-11 18:54                                                                         ` Kees Cook
  4 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 13:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This removes an outdated comment in prepare_kernel_cred.

There is no "cred_replace_mutex" any more, so the comment must
go away.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 kernel/cred.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..71a7926 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -675,8 +675,6 @@ void __init cred_init(void)
  * The caller may change these controls afterwards if desired.
  *
  * Returns the new credentials or NULL if out of memory.
- *
- * Does not take, and does not return holding current->cred_replace_mutex.
  */
 struct cred *prepare_kernel_cred(struct task_struct *daemon)
 {
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/4] exec: Fix a deadlock in ptrace
  2020-03-10 13:43                                                                       ` [PATCH 1/4] exec: Fix a deadlock in ptrace Bernd Edlinger
@ 2020-03-10 15:13                                                                         ` Eric W. Biederman
  2020-03-10 15:17                                                                           ` Bernd Edlinger
  2020-03-10 21:00                                                                         ` Kees Cook
  1 sibling, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 15:13 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated.  They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

Overall this looks good.  Mind if I change the subject to:
"exec: Fix a deadlock in strace" ?

Eric


>
> strace          D    0 30614  30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> expect          D    0 31933  30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> This changes mm_access to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This patch is based on the following patch by Eric W. Biederman:
> "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks"
> Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/
>
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  kernel/fork.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c12595a..5720ff3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  	struct mm_struct *mm;
>  	int err;
>  
> -	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
> +	err =  mutex_lock_killable(&task->signal->exec_update_mutex);
>  	if (err)
>  		return ERR_PTR(err);
>  
> @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  		mmput(mm);
>  		mm = ERR_PTR(-EACCES);
>  	}
> -	mutex_unlock(&task->signal->cred_guard_mutex);
> +	mutex_unlock(&task->signal->exec_update_mutex);
>  
>  	return mm;
>  }

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/4] exec: Fix a deadlock in ptrace
  2020-03-10 15:13                                                                         ` Eric W. Biederman
@ 2020-03-10 15:17                                                                           ` Bernd Edlinger
  0 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 15:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/10/20 4:13 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated.  They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> Overall this looks good.  Mind if I change the subject to:
> "exec: Fix a deadlock in strace" ?
> 

Sure, go ahead.

Thanks
Bernd.

> Eric
> 
> 
>>
>> strace          D    0 30614  30584 0x00000000
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> schedule_preempt_disabled+0x15/0x20
>> __mutex_lock.isra.13+0x1ec/0x520
>> __mutex_lock_killable_slowpath+0x13/0x20
>> mutex_lock_killable+0x28/0x30
>> mm_access+0x27/0xa0
>> process_vm_rw_core.isra.3+0xff/0x550
>> process_vm_rw+0xdd/0xf0
>> __x64_sys_process_vm_readv+0x31/0x40
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> expect          D    0 31933  30876 0x80004003
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> flush_old_exec+0xc4/0x770
>> load_elf_binary+0x35a/0x16c0
>> search_binary_handler+0x97/0x1d0
>> __do_execve_file.isra.40+0x5d4/0x8a0
>> __x64_sys_execve+0x49/0x60
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> This changes mm_access to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This patch is based on the following patch by Eric W. Biederman:
>> "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks"
>> Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/
>>
>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>> ---
>>  kernel/fork.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index c12595a..5720ff3 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>>  	struct mm_struct *mm;
>>  	int err;
>>  
>> -	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
>> +	err =  mutex_lock_killable(&task->signal->exec_update_mutex);
>>  	if (err)
>>  		return ERR_PTR(err);
>>  
>> @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>>  		mmput(mm);
>>  		mm = ERR_PTR(-EACCES);
>>  	}
>> -	mutex_unlock(&task->signal->cred_guard_mutex);
>> +	mutex_unlock(&task->signal->exec_update_mutex);
>>  
>>  	return mm;
>>  }

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 0/4] Use new infrastructure to fix deadlocks in execve
  2020-03-10 13:43                                                                       ` [PATCH 0/4] Use new infrastructure to fix deadlocks in execve Bernd Edlinger
@ 2020-03-10 15:35                                                                         ` Eric W. Biederman
  2020-03-10 17:44                                                                           ` [PATCH 0/4] Use new infrastructure in more simple cases Bernd Edlinger
                                                                                             ` (4 more replies)
  0 siblings, 5 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 15:35 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> This is a follow up on Eric's patch series to
> fix the deadlocks observed with ptracing when execve
> in multi-threaded applications.
>
> This fixes the simple and most important case where
> the cred_guard_mutex causes strace to deadlock.
>
> This also adds a test case (which is only partially
> fixed so far, the rest of the fixes will follow
> soon).
>
> Two trivial comment fixes are also included.
>
> Bernd Edlinger (4):
>   exec: Fix a deadlock in ptrace
>   selftests/ptrace: add test cases for dead-locks
>   mm: docs: Fix a comment in process_vm_rw_core
>   kernel: doc: remove outdated comment in prepare_kernel_cred
>
>  kernel/cred.c                             |  2 -
>  kernel/fork.c                             |  4 +-
>  mm/process_vm_access.c                    |  2 +-
>  tools/testing/selftests/ptrace/Makefile   |  4 +-
>  tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++
>  5 files changed, 91 insertions(+), 7 deletions(-)
>  create mode 100644 tools/testing/selftests/ptrace/vmaccess.c

Applied.

Thank you,
Eric


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 0/4] Use new infrastructure in more simple cases
  2020-03-10 15:35                                                                         ` Eric W. Biederman
@ 2020-03-10 17:44                                                                           ` Bernd Edlinger
  2020-03-10 17:45                                                                           ` [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve Bernd Edlinger
                                                                                             ` (3 subsequent siblings)
  4 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 17:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This continues the execve anti-deadlock patch and addresses all
of the (mostly) simple cases, there the new exec_update_mutex
can be used instead of the cred_guard_mutex.

Note: each of these patches is independent of each other, so
in case one of them turns out to be controversial, that does
not affect the others.

Bernd Edlinger (4):
  kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
  proc: Use new infrastructure to fix deadlocks in execve
  proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  perf: Use new infrastructure to fix deadlocks in execve

 fs/proc/base.c       | 10 +++++-----
 kernel/events/core.c | 12 ++++++------
 kernel/kcmp.c        |  8 ++++----
 3 files changed, 15 insertions(+), 15 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
  2020-03-10 15:35                                                                         ` Eric W. Biederman
  2020-03-10 17:44                                                                           ` [PATCH 0/4] Use new infrastructure in more simple cases Bernd Edlinger
@ 2020-03-10 17:45                                                                           ` Bernd Edlinger
  2020-03-10 19:01                                                                             ` Eric W. Biederman
  2020-03-10 17:45                                                                           ` [PATCH 2/4] proc: " Bernd Edlinger
                                                                                             ` (2 subsequent siblings)
  4 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 17:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This changes kcmp_epoll_target to use the new exec_update_mutex
instead of cred_guard_mutex.

This should be safe, as the credentials are only used for reading,
and furthermore ->mm and ->sighand are updated on execve,
but only under the new exec_update_mutex.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 kernel/kcmp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index a0e3d7a..b3ff928 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -173,8 +173,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
 	/*
 	 * One should have enough rights to inspect task details.
 	 */
-	ret = kcmp_lock(&task1->signal->cred_guard_mutex,
-			&task2->signal->cred_guard_mutex);
+	ret = kcmp_lock(&task1->signal->exec_update_mutex,
+			&task2->signal->exec_update_mutex);
 	if (ret)
 		goto err;
 	if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) ||
@@ -229,8 +229,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
 	}
 
 err_unlock:
-	kcmp_unlock(&task1->signal->cred_guard_mutex,
-		    &task2->signal->cred_guard_mutex);
+	kcmp_unlock(&task1->signal->exec_update_mutex,
+		    &task2->signal->exec_update_mutex);
 err:
 	put_task_struct(task1);
 	put_task_struct(task2);
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
  2020-03-10 15:35                                                                         ` Eric W. Biederman
  2020-03-10 17:44                                                                           ` [PATCH 0/4] Use new infrastructure in more simple cases Bernd Edlinger
  2020-03-10 17:45                                                                           ` [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve Bernd Edlinger
@ 2020-03-10 17:45                                                                           ` Bernd Edlinger
  2020-03-11 18:59                                                                             ` Kees Cook
  2020-03-11 19:10                                                                             ` Kees Cook
  2020-03-10 17:45                                                                           ` [PATCH 3/4] proc: io_accounting: " Bernd Edlinger
  2020-03-10 17:45                                                                           ` [PATCH 4/4] perf: " Bernd Edlinger
  4 siblings, 2 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 17:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This changes lock_trace to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/stack for instance.

This should be safe, as the credentials are only used for reading,
and task->mm is updated on execve under the new exec_update_mutex.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 fs/proc/base.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea950..4fdfe4f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
 
 static int lock_trace(struct task_struct *task)
 {
-	int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+	int err = mutex_lock_killable(&task->signal->exec_update_mutex);
 	if (err)
 		return err;
 	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
-		mutex_unlock(&task->signal->cred_guard_mutex);
+		mutex_unlock(&task->signal->exec_update_mutex);
 		return -EPERM;
 	}
 	return 0;
@@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task)
 
 static void unlock_trace(struct task_struct *task)
 {
-	mutex_unlock(&task->signal->cred_guard_mutex);
+	mutex_unlock(&task->signal->exec_update_mutex);
 }
 
 #ifdef CONFIG_STACKTRACE
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  2020-03-10 15:35                                                                         ` Eric W. Biederman
                                                                                             ` (2 preceding siblings ...)
  2020-03-10 17:45                                                                           ` [PATCH 2/4] proc: " Bernd Edlinger
@ 2020-03-10 17:45                                                                           ` Bernd Edlinger
  2020-03-10 19:06                                                                             ` Eric W. Biederman
  2020-03-11 19:08                                                                             ` Kees Cook
  2020-03-10 17:45                                                                           ` [PATCH 4/4] perf: " Bernd Edlinger
  4 siblings, 2 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 17:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This changes do_io_accounting to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/io for instance.

This should be safe, as the credentials are only used for reading.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 fs/proc/base.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4fdfe4f..529d0c6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 	unsigned long flags;
 	int result;
 
-	result = mutex_lock_killable(&task->signal->cred_guard_mutex);
+	result = mutex_lock_killable(&task->signal->exec_update_mutex);
 	if (result)
 		return result;
 
@@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 	result = 0;
 
 out_unlock:
-	mutex_unlock(&task->signal->cred_guard_mutex);
+	mutex_unlock(&task->signal->exec_update_mutex);
 	return result;
 }
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve
  2020-03-10 15:35                                                                         ` Eric W. Biederman
                                                                                             ` (3 preceding siblings ...)
  2020-03-10 17:45                                                                           ` [PATCH 3/4] proc: io_accounting: " Bernd Edlinger
@ 2020-03-10 17:45                                                                           ` Bernd Edlinger
  4 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 17:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

This changes perf_event_set_clock to use the new exec_update_mutex
instead of cred_guard_mutex.

This should be safe, as the credentials are only used for reading.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 kernel/events/core.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2173c23..c37f6eb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1248,7 +1248,7 @@ static void put_ctx(struct perf_event_context *ctx)
  * function.
  *
  * Lock order:
- *    cred_guard_mutex
+ *    exec_update_mutex
  *	task_struct::perf_event_mutex
  *	  perf_event_context::mutex
  *	    perf_event::child_mutex;
@@ -11254,14 +11254,14 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
 	}
 
 	if (task) {
-		err = mutex_lock_interruptible(&task->signal->cred_guard_mutex);
+		err = mutex_lock_interruptible(&task->signal->exec_update_mutex);
 		if (err)
 			goto err_task;
 
 		/*
 		 * Reuse ptrace permission checks for now.
 		 *
-		 * We must hold cred_guard_mutex across this and any potential
+		 * We must hold exec_update_mutex across this and any potential
 		 * perf_install_in_context() call for this new event to
 		 * serialize against exec() altering our credentials (and the
 		 * perf_event_exit_task() that could imply).
@@ -11550,7 +11550,7 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
 	mutex_unlock(&ctx->mutex);
 
 	if (task) {
-		mutex_unlock(&task->signal->cred_guard_mutex);
+		mutex_unlock(&task->signal->exec_update_mutex);
 		put_task_struct(task);
 	}
 
@@ -11586,7 +11586,7 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
 		free_event(event);
 err_cred:
 	if (task)
-		mutex_unlock(&task->signal->cred_guard_mutex);
+		mutex_unlock(&task->signal->exec_update_mutex);
 err_task:
 	if (task)
 		put_task_struct(task);
@@ -11891,7 +11891,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
 /*
  * When a child task exits, feed back event values to parent events.
  *
- * Can be called with cred_guard_mutex held when called from
+ * Can be called with exec_update_mutex held when called from
  * install_exec_creds().
  */
 void perf_event_exit_task(struct task_struct *child)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10  8:55                                                           ` Christian Brauner
@ 2020-03-10 18:52                                                             ` Eric W. Biederman
  2020-03-10 19:15                                                               ` Christian Brauner
  2020-03-10 19:16                                                               ` Jann Horn
  0 siblings, 2 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 18:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon


During exec some file descriptors are closed and the files struct is
unshared.  But all of that can happen at other times and it has the
same protections during exec as at ordinary times.  So stop taking the
cred_guard_mutex as it is useless.

Furthermore he cred_guard_mutex is a bad idea because it is deadlock
prone, as it is held in serveral while waiting possibly indefinitely
for userspace to do something.

Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/pid.c | 6 ------
 1 file changed, 6 deletions(-)

Christian if you don't have any objections I will take this one through
my tree.

I tried to figure out why this code path takes the cred_guard_mutex and
the archive on lore.kernel.org was not helpful in finding that part of
the conversation.

diff --git a/kernel/pid.c b/kernel/pid.c
index 60820e72634c..53646d5616d2 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
 	struct file *file;
 	int ret;
 
-	ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
-	if (ret)
-		return ERR_PTR(ret);
-
 	if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
 		file = fget_task(task, fd);
 	else
 		file = ERR_PTR(-EPERM);
 
-	mutex_unlock(&task->signal->cred_guard_mutex);
-
 	return file ?: ERR_PTR(-EBADF);
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
  2020-03-10 17:45                                                                           ` [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve Bernd Edlinger
@ 2020-03-10 19:01                                                                             ` Eric W. Biederman
  2020-03-10 19:42                                                                               ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 19:01 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> This changes kcmp_epoll_target to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This should be safe, as the credentials are only used for reading,
> and furthermore ->mm and ->sighand are updated on execve,
> but only under the new exec_update_mutex.
>

Can you add a comment that the exec_update_mutex is not needed for
KCMP_FILE?  As both sets of credentials during exec are valid
for accessing the files so exec_update_mutex does not matter.

I don't think exec_update_mutex is needed for KCMP_SYSVSEM
or KCMP_EPOLL_TFD either.  As I don't think exec changes either
one of those.

Eric


> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  kernel/kcmp.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/kcmp.c b/kernel/kcmp.c
> index a0e3d7a..b3ff928 100644
> --- a/kernel/kcmp.c
> +++ b/kernel/kcmp.c
> @@ -173,8 +173,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
>  	/*
>  	 * One should have enough rights to inspect task details.
>  	 */
> -	ret = kcmp_lock(&task1->signal->cred_guard_mutex,
> -			&task2->signal->cred_guard_mutex);
> +	ret = kcmp_lock(&task1->signal->exec_update_mutex,
> +			&task2->signal->exec_update_mutex);
>  	if (ret)
>  		goto err;
>  	if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) ||
> @@ -229,8 +229,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
>  	}
>  
>  err_unlock:
> -	kcmp_unlock(&task1->signal->cred_guard_mutex,
> -		    &task2->signal->cred_guard_mutex);
> +	kcmp_unlock(&task1->signal->exec_update_mutex,
> +		    &task2->signal->exec_update_mutex);
>  err:
>  	put_task_struct(task1);
>  	put_task_struct(task2);

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  2020-03-10 17:45                                                                           ` [PATCH 3/4] proc: io_accounting: " Bernd Edlinger
@ 2020-03-10 19:06                                                                             ` Eric W. Biederman
  2020-03-10 20:19                                                                               ` Bernd Edlinger
  2020-03-11 19:08                                                                             ` Kees Cook
  1 sibling, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 19:06 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> This changes do_io_accounting to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This fixes possible deadlocks when the trace is accessing
> /proc/$pid/io for instance.
>
> This should be safe, as the credentials are only used for reading.

This is an improvement.

We probably want to do this just as an incremental step in making things
better but perhaps I am blind but I am not finding the reason for
guarding this with the cred_guard_mutex to be at all persuasive.

I think moving the ptrace_may_access check down to after the
unlock_task_sighand would be just as effective at addressing the
concerns raised in the original commit.  I think the task_lock provides
all of the barrier we need to make it safe to move the ptrace_may_access
checks safe.

The reason I say this is I don't see exec changing ->ioac.  Just
performing some I/O which would update the io accounting statistics.

Can anyone see if I am wrong?

Eric


commit 293eb1e7772b25a93647c798c7b89bf26c2da2e0
Author: Vasiliy Kulikov <segoon@openwall.com>
Date:   Tue Jul 26 16:08:38 2011 -0700

    proc: fix a race in do_io_accounting()
    
    If an inode's mode permits opening /proc/PID/io and the resulting file
    descriptor is kept across execve() of a setuid or similar binary, the
    ptrace_may_access() check tries to prevent using this fd against the
    task with escalated privileges.
    
    Unfortunately, there is a race in the check against execve().  If
    execve() is processed after the ptrace check, but before the actual io
    information gathering, io statistics will be gathered from the
    privileged process.  At least in theory this might lead to gathering
    sensible information (like ssh/ftp password length) that wouldn't be
    available otherwise.
    
    Holding task->signal->cred_guard_mutex while gathering the io
    information should protect against the race.
    
    The order of locking is similar to the one inside of ptrace_attach():
    first goes cred_guard_mutex, then lock_task_sighand().
    
    Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: <stable@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>



> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  fs/proc/base.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4fdfe4f..529d0c6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>  	unsigned long flags;
>  	int result;
>  
> -	result = mutex_lock_killable(&task->signal->cred_guard_mutex);
> +	result = mutex_lock_killable(&task->signal->exec_update_mutex);
>  	if (result)
>  		return result;
>  
> @@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>  	result = 0;
>  
>  out_unlock:
> -	mutex_unlock(&task->signal->cred_guard_mutex);
> +	mutex_unlock(&task->signal->exec_update_mutex);
>  	return result;
>  }

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 18:52                                                             ` [PATCH] pidfd: Stop taking cred_guard_mutex Eric W. Biederman
@ 2020-03-10 19:15                                                               ` Christian Brauner
  2020-03-10 19:16                                                               ` Jann Horn
  1 sibling, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-10 19:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon

On Tue, Mar 10, 2020 at 01:52:05PM -0500, Eric W. Biederman wrote:
> 
> During exec some file descriptors are closed and the files struct is
> unshared.  But all of that can happen at other times and it has the
> same protections during exec as at ordinary times.  So stop taking the
> cred_guard_mutex as it is useless.
> 
> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> prone, as it is held in serveral while waiting possibly indefinitely
> for userspace to do something.
> 
> Cc: Sargun Dhillon <sargun@sargun.me>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/pid.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> Christian if you don't have any objections I will take this one through
> my tree.

Sure.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

> 
> I tried to figure out why this code path takes the cred_guard_mutex and
> the archive on lore.kernel.org was not helpful in finding that part of
> the conversation.

Let me think a little harder and hopefully get back to you with a
sensible explanation.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 18:52                                                             ` [PATCH] pidfd: Stop taking cred_guard_mutex Eric W. Biederman
  2020-03-10 19:15                                                               ` Christian Brauner
@ 2020-03-10 19:16                                                               ` Jann Horn
  2020-03-10 19:27                                                                 ` Eric W. Biederman
  1 sibling, 1 reply; 203+ messages in thread
From: Jann Horn @ 2020-03-10 19:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Bernd Edlinger, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon

On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> During exec some file descriptors are closed and the files struct is
> unshared.  But all of that can happen at other times and it has the
> same protections during exec as at ordinary times.  So stop taking the
> cred_guard_mutex as it is useless.
>
> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> prone, as it is held in serveral while waiting possibly indefinitely
> for userspace to do something.

Please don't. Just use the new exec_update_mutex like everywhere else.

> Cc: Sargun Dhillon <sargun@sargun.me>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/pid.c | 6 ------
>  1 file changed, 6 deletions(-)
>
> Christian if you don't have any objections I will take this one through
> my tree.
>
> I tried to figure out why this code path takes the cred_guard_mutex and
> the archive on lore.kernel.org was not helpful in finding that part of
> the conversation.

That was my suggestion.

> diff --git a/kernel/pid.c b/kernel/pid.c
> index 60820e72634c..53646d5616d2 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
>         struct file *file;
>         int ret;
>
> -       ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
> -       if (ret)
> -               return ERR_PTR(ret);
> -
>         if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
>                 file = fget_task(task, fd);
>         else
>                 file = ERR_PTR(-EPERM);
>
> -       mutex_unlock(&task->signal->cred_guard_mutex);
> -
>         return file ?: ERR_PTR(-EBADF);
>  }

If you make this change, then if this races with execution of a setuid
program that afterwards e.g. opens a unix domain socket, an attacker
will be able to steal that socket and inject messages into
communication with things like DBus. procfs currently has the same
race, and that still needs to be fixed, but at least procfs doesn't
let you open things like sockets because they don't have a working
->open handler, and it enforces the normal permission check for opening files.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 19:16                                                               ` Jann Horn
@ 2020-03-10 19:27                                                                 ` Eric W. Biederman
  2020-03-10 20:00                                                                   ` Jann Horn
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 19:27 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Bernd Edlinger, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon

Jann Horn <jannh@google.com> writes:

> On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> During exec some file descriptors are closed and the files struct is
>> unshared.  But all of that can happen at other times and it has the
>> same protections during exec as at ordinary times.  So stop taking the
>> cred_guard_mutex as it is useless.
>>
>> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
>> prone, as it is held in serveral while waiting possibly indefinitely
>> for userspace to do something.
>
> Please don't. Just use the new exec_update_mutex like everywhere else.
>
>> Cc: Sargun Dhillon <sargun@sargun.me>
>> Cc: Christian Brauner <christian.brauner@ubuntu.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  kernel/pid.c | 6 ------
>>  1 file changed, 6 deletions(-)
>>
>> Christian if you don't have any objections I will take this one through
>> my tree.
>>
>> I tried to figure out why this code path takes the cred_guard_mutex and
>> the archive on lore.kernel.org was not helpful in finding that part of
>> the conversation.
>
> That was my suggestion.
>
>> diff --git a/kernel/pid.c b/kernel/pid.c
>> index 60820e72634c..53646d5616d2 100644
>> --- a/kernel/pid.c
>> +++ b/kernel/pid.c
>> @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
>>         struct file *file;
>>         int ret;
>>
>> -       ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> -       if (ret)
>> -               return ERR_PTR(ret);
>> -
>>         if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
>>                 file = fget_task(task, fd);
>>         else
>>                 file = ERR_PTR(-EPERM);
>>
>> -       mutex_unlock(&task->signal->cred_guard_mutex);
>> -
>>         return file ?: ERR_PTR(-EBADF);
>>  }
>
> If you make this change, then if this races with execution of a setuid
> program that afterwards e.g. opens a unix domain socket, an attacker
> will be able to steal that socket and inject messages into
> communication with things like DBus. procfs currently has the same
> race, and that still needs to be fixed, but at least procfs doesn't
> let you open things like sockets because they don't have a working
> ->open handler, and it enforces the normal permission check for
> opening files.

It isn't only exec that can change credentials.  Do we need a lock for
changing credentials?

Wouldn't it be sufficient to simply test ptrace_may_access after
we get a copy of the file?

If we need a lock around credential change let's design and build that.
Having a mismatch between what a lock is designed to do, and what
people use it for can only result in other bugs as people get confused.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
  2020-03-10 19:01                                                                             ` Eric W. Biederman
@ 2020-03-10 19:42                                                                               ` Bernd Edlinger
  0 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 19:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/10/20 8:01 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> This changes kcmp_epoll_target to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This should be safe, as the credentials are only used for reading,
>> and furthermore ->mm and ->sighand are updated on execve,
>> but only under the new exec_update_mutex.
>>
> 
> Can you add a comment that the exec_update_mutex is not needed for
> KCMP_FILE?  As both sets of credentials during exec are valid
> for accessing the files so exec_update_mutex does not matter.
> 

some files are closed by do_close_on_exec,
so in theory this allows you to examine files that
were open in the old user but closed for the new user
with either credential.

It is not a race condition, but it may be a security
concern.

> I don't think exec_update_mutex is needed for KCMP_SYSVSEM
> or KCMP_EPOLL_TFD either.  As I don't think exec changes either
> one of those.
> 

KCMP_EPOLL_TFD is also accessing file pointers,
that is possible.

It might be that KCMP_SYSVSEM is a missed optimization, but
I may have overlooked something.
I'd rather err on the safe side.

> Eric
> 
> 
>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>> ---
>>  kernel/kcmp.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/kcmp.c b/kernel/kcmp.c
>> index a0e3d7a..b3ff928 100644
>> --- a/kernel/kcmp.c
>> +++ b/kernel/kcmp.c
>> @@ -173,8 +173,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
>>  	/*
>>  	 * One should have enough rights to inspect task details.
>>  	 */
>> -	ret = kcmp_lock(&task1->signal->cred_guard_mutex,
>> -			&task2->signal->cred_guard_mutex);
>> +	ret = kcmp_lock(&task1->signal->exec_update_mutex,
>> +			&task2->signal->exec_update_mutex);
>>  	if (ret)
>>  		goto err;
>>  	if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) ||
>> @@ -229,8 +229,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
>>  	}
>>  
>>  err_unlock:
>> -	kcmp_unlock(&task1->signal->cred_guard_mutex,
>> -		    &task2->signal->cred_guard_mutex);
>> +	kcmp_unlock(&task1->signal->exec_update_mutex,
>> +		    &task2->signal->exec_update_mutex);
>>  err:
>>  	put_task_struct(task1);
>>  	put_task_struct(task2);

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 19:27                                                                 ` Eric W. Biederman
@ 2020-03-10 20:00                                                                   ` Jann Horn
  2020-03-10 20:10                                                                     ` Jann Horn
  0 siblings, 1 reply; 203+ messages in thread
From: Jann Horn @ 2020-03-10 20:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Bernd Edlinger, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon

On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Jann Horn <jannh@google.com> writes:
> > On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> During exec some file descriptors are closed and the files struct is
> >> unshared.  But all of that can happen at other times and it has the
> >> same protections during exec as at ordinary times.  So stop taking the
> >> cred_guard_mutex as it is useless.
> >>
> >> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> >> prone, as it is held in serveral while waiting possibly indefinitely
> >> for userspace to do something.
> >
> > Please don't. Just use the new exec_update_mutex like everywhere else.
> >
> >> Cc: Sargun Dhillon <sargun@sargun.me>
> >> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> >> Cc: Arnd Bergmann <arnd@arndb.de>
> >> Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>  kernel/pid.c | 6 ------
> >>  1 file changed, 6 deletions(-)
> >>
> >> Christian if you don't have any objections I will take this one through
> >> my tree.
> >>
> >> I tried to figure out why this code path takes the cred_guard_mutex and
> >> the archive on lore.kernel.org was not helpful in finding that part of
> >> the conversation.
> >
> > That was my suggestion.
> >
> >> diff --git a/kernel/pid.c b/kernel/pid.c
> >> index 60820e72634c..53646d5616d2 100644
> >> --- a/kernel/pid.c
> >> +++ b/kernel/pid.c
> >> @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
> >>         struct file *file;
> >>         int ret;
> >>
> >> -       ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
> >> -       if (ret)
> >> -               return ERR_PTR(ret);
> >> -
> >>         if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
> >>                 file = fget_task(task, fd);
> >>         else
> >>                 file = ERR_PTR(-EPERM);
> >>
> >> -       mutex_unlock(&task->signal->cred_guard_mutex);
> >> -
> >>         return file ?: ERR_PTR(-EBADF);
> >>  }
> >
> > If you make this change, then if this races with execution of a setuid
> > program that afterwards e.g. opens a unix domain socket, an attacker
> > will be able to steal that socket and inject messages into
> > communication with things like DBus. procfs currently has the same
> > race, and that still needs to be fixed, but at least procfs doesn't
> > let you open things like sockets because they don't have a working
> > ->open handler, and it enforces the normal permission check for
> > opening files.
>
> It isn't only exec that can change credentials.  Do we need a lock for
> changing credentials?

Hmm, I guess so? Normally, a task that's changing credentials becomes
nondumpable at the same time (and there are explicit memory barriers
in commit_creds() and __ptrace_may_access() to enforce the ordering
for this); so you normally don't see tasks becoming ptrace-accessible
via anything other than execve(). But I guess if someone opens a
root-only file, closes it, drops privileges, and then explicitly does
prctl(PR_SET_DUMPABLE, 1), we should probably protect that, too.

> Wouldn't it be sufficient to simply test ptrace_may_access after
> we get a copy of the file?

There are also setuid helpers that can, after having done privileged
stuff, drop privileges and call execve(); after that,
ptrace_may_access() succeeds again. In particular, polkit has a helper
that does this.

> If we need a lock around credential change let's design and build that.
> Having a mismatch between what a lock is designed to do, and what
> people use it for can only result in other bugs as people get confused.

Hmm... what benefits do we get from making it a separate lock? I guess
it would allow us to make it a per-task lock instead of a
signal_struct-wide one? That might be helpful...

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 20:00                                                                   ` Jann Horn
@ 2020-03-10 20:10                                                                     ` Jann Horn
  2020-03-10 20:22                                                                       ` Bernd Edlinger
  2020-03-10 20:57                                                                       ` Eric W. Biederman
  0 siblings, 2 replies; 203+ messages in thread
From: Jann Horn @ 2020-03-10 20:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Bernd Edlinger, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon

On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <jannh@google.com> wrote:
> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Jann Horn <jannh@google.com> writes:
> > > On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > >> During exec some file descriptors are closed and the files struct is
> > >> unshared.  But all of that can happen at other times and it has the
> > >> same protections during exec as at ordinary times.  So stop taking the
> > >> cred_guard_mutex as it is useless.
> > >>
> > >> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> > >> prone, as it is held in serveral while waiting possibly indefinitely
> > >> for userspace to do something.
[...]
> > > If you make this change, then if this races with execution of a setuid
> > > program that afterwards e.g. opens a unix domain socket, an attacker
> > > will be able to steal that socket and inject messages into
> > > communication with things like DBus. procfs currently has the same
> > > race, and that still needs to be fixed, but at least procfs doesn't
> > > let you open things like sockets because they don't have a working
> > > ->open handler, and it enforces the normal permission check for
> > > opening files.
> >
> > It isn't only exec that can change credentials.  Do we need a lock for
> > changing credentials?
[...]
> > If we need a lock around credential change let's design and build that.
> > Having a mismatch between what a lock is designed to do, and what
> > people use it for can only result in other bugs as people get confused.
>
> Hmm... what benefits do we get from making it a separate lock? I guess
> it would allow us to make it a per-task lock instead of a
> signal_struct-wide one? That might be helpful...

But actually, isn't the core purpose of the cred_guard_mutex to guard
against concurrent credential changes anyway? That's what almost
everyone uses it for, and it's in the name...

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-09 20:48                                                         ` Eric W. Biederman
  2020-03-10  8:55                                                           ` Christian Brauner
@ 2020-03-10 20:16                                                           ` Kees Cook
  1 sibling, 0 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-10 20:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Bernd Edlinger, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Mon, Mar 09, 2020 at 03:48:55PM -0500, Eric W. Biederman wrote:
> And I completely agree that we should at least rename tsk to me.
> Just for clarity.

I think it wouldn't hurt to add comments to spell it out explicitly
in each of the tsk->me functions, something like:

/*
 * The "me" task_struct argument here must only ever refer to "current",
 * but it gets passed in to avoid re-calculating "current" in each helper.
 */

I've found that the exec code in its entirety would be better off with
more comments. :) Usually that's the bulk of what I find myself adding
when I make changes in this area. ;)

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
  2020-03-08 21:35                                                 ` [PATCH v2 1/5] exec: Only compute current once in flush_old_exec Eric W. Biederman
  2020-03-09 13:56                                                   ` Bernd Edlinger
@ 2020-03-10 20:17                                                   ` Kees Cook
  2020-03-10 21:12                                                   ` Christian Brauner
  2 siblings, 0 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-10 20:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 08, 2020 at 04:35:26PM -0500, Eric W. Biederman wrote:
> 
> Make it clear that current only needs to be computed once in
> flush_old_exec.  This may have some efficiency improvements and it
> makes the code easier to change.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

modulo my suggestion of adding more comments (it could even be kerndoc!)
that explicitly states that "me" should always be "current", yup, looks
good:

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  fs/exec.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index db17be51b112..c3f34791f2f0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>   */
>  int flush_old_exec(struct linux_binprm * bprm)
>  {
> +	struct task_struct *me = current;
>  	int retval;
>  
>  	/*
>  	 * Make sure we have a private signal table and that
>  	 * we are unassociated from the previous thread group.
>  	 */
> -	retval = de_thread(current);
> +	retval = de_thread(me);
>  	if (retval)
>  		goto out;
>  
> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	bprm->mm = NULL;
>  
>  	set_fs(USER_DS);
> -	current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> +	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>  					PF_NOFREEZE | PF_NO_SETAFFINITY);
>  	flush_thread();
> -	current->personality &= ~bprm->per_clear;
> +	me->personality &= ~bprm->per_clear;
>  
>  	/*
>  	 * We have to apply CLOEXEC before we change whether the process is
> @@ -1305,7 +1306,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	 * trying to access the should-be-closed file descriptors of a process
>  	 * undergoing exec(2).
>  	 */
> -	do_close_on_exec(current->files);
> +	do_close_on_exec(me->files);
>  	return 0;
>  
>  out:
> -- 
> 2.25.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  2020-03-10 19:06                                                                             ` Eric W. Biederman
@ 2020-03-10 20:19                                                                               ` Bernd Edlinger
  2020-03-10 21:25                                                                                 ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 20:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/10/20 8:06 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> This changes do_io_accounting to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This fixes possible deadlocks when the trace is accessing
>> /proc/$pid/io for instance.
>>
>> This should be safe, as the credentials are only used for reading.
> 
> This is an improvement.
> 
> We probably want to do this just as an incremental step in making things
> better but perhaps I am blind but I am not finding the reason for
> guarding this with the cred_guard_mutex to be at all persuasive.
> 
> I think moving the ptrace_may_access check down to after the
> unlock_task_sighand would be just as effective at addressing the
> concerns raised in the original commit.  I think the task_lock provides
> all of the barrier we need to make it safe to move the ptrace_may_access
> checks safe.
> 
> The reason I say this is I don't see exec changing ->ioac.  Just
> performing some I/O which would update the io accounting statistics.
> 

Maybe the suid executable is starting up and doing io or not,
and what the program does immediately at startup is a secret,
that we want to keep secret but evil eve want to find out.
eve is using /proc/alice/io to do that.

It is a bit constructed, but seems like a security concern.
when we keep the exec_update_mutex while collecting the data, we
cannot see any io of the new process when the new credentials
don't allow that.


Bernd.

> Can anyone see if I am wrong?
> 
> Eric
> 
> 
> commit 293eb1e7772b25a93647c798c7b89bf26c2da2e0
> Author: Vasiliy Kulikov <segoon@openwall.com>
> Date:   Tue Jul 26 16:08:38 2011 -0700
> 
>     proc: fix a race in do_io_accounting()
>     
>     If an inode's mode permits opening /proc/PID/io and the resulting file
>     descriptor is kept across execve() of a setuid or similar binary, the
>     ptrace_may_access() check tries to prevent using this fd against the
>     task with escalated privileges.
>     
>     Unfortunately, there is a race in the check against execve().  If
>     execve() is processed after the ptrace check, but before the actual io
>     information gathering, io statistics will be gathered from the
>     privileged process.  At least in theory this might lead to gathering
>     sensible information (like ssh/ftp password length) that wouldn't be
>     available otherwise.
>     
>     Holding task->signal->cred_guard_mutex while gathering the io
>     information should protect against the race.
>     
>     The order of locking is similar to the one inside of ptrace_attach():
>     first goes cred_guard_mutex, then lock_task_sighand().
>     
>     Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
>     Cc: Al Viro <viro@zeniv.linux.org.uk>
>     Cc: <stable@kernel.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> 
> 
>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>> ---
>>  fs/proc/base.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 4fdfe4f..529d0c6 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>>  	unsigned long flags;
>>  	int result;
>>  
>> -	result = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> +	result = mutex_lock_killable(&task->signal->exec_update_mutex);
>>  	if (result)
>>  		return result;
>>  
>> @@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>>  	result = 0;
>>  
>>  out_unlock:
>> -	mutex_unlock(&task->signal->cred_guard_mutex);
>> +	mutex_unlock(&task->signal->exec_update_mutex);
>>  	return result;
>>  }

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 20:10                                                                     ` Jann Horn
@ 2020-03-10 20:22                                                                       ` Bernd Edlinger
  2020-03-11  6:11                                                                         ` Bernd Edlinger
  2020-03-10 20:57                                                                       ` Eric W. Biederman
  1 sibling, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 20:22 UTC (permalink / raw)
  To: Jann Horn, Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon

On 3/10/20 9:10 PM, Jann Horn wrote:
> On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <jannh@google.com> wrote:
>> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> Jann Horn <jannh@google.com> writes:
>>>> On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>> During exec some file descriptors are closed and the files struct is
>>>>> unshared.  But all of that can happen at other times and it has the
>>>>> same protections during exec as at ordinary times.  So stop taking the
>>>>> cred_guard_mutex as it is useless.
>>>>>
>>>>> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
>>>>> prone, as it is held in serveral while waiting possibly indefinitely
>>>>> for userspace to do something.
> [...]
>>>> If you make this change, then if this races with execution of a setuid
>>>> program that afterwards e.g. opens a unix domain socket, an attacker
>>>> will be able to steal that socket and inject messages into
>>>> communication with things like DBus. procfs currently has the same
>>>> race, and that still needs to be fixed, but at least procfs doesn't
>>>> let you open things like sockets because they don't have a working
>>>> ->open handler, and it enforces the normal permission check for
>>>> opening files.
>>>
>>> It isn't only exec that can change credentials.  Do we need a lock for
>>> changing credentials?
> [...]
>>> If we need a lock around credential change let's design and build that.
>>> Having a mismatch between what a lock is designed to do, and what
>>> people use it for can only result in other bugs as people get confused.
>>
>> Hmm... what benefits do we get from making it a separate lock? I guess
>> it would allow us to make it a per-task lock instead of a
>> signal_struct-wide one? That might be helpful...
> 
> But actually, isn't the core purpose of the cred_guard_mutex to guard
> against concurrent credential changes anyway? That's what almost
> everyone uses it for, and it's in the name...
> 

The main reason d'etre of exec_update_mutex is to get a consitent
view of task->mm and task credentials.

The reason why you want the cred_guard_mutex, is that some action
is changing the resulting credentials that the execve is about
to install, and that is the data flow in the opposite direction.


Bernd.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
  2020-03-08 21:36                                                 ` [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately Eric W. Biederman
  2020-03-09 19:28                                                   ` Bernd Edlinger
@ 2020-03-10 20:29                                                   ` Kees Cook
  2020-03-10 20:34                                                     ` Bernd Edlinger
  2020-03-10 21:21                                                   ` Christian Brauner
  2 siblings, 1 reply; 203+ messages in thread
From: Kees Cook @ 2020-03-10 20:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
> 
> This makes the code clearer and makes it easier to implement a mutex
> that is not taken over any locations that may block indefinitely waiting
> for userspace.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 39 ++++++++++++++++++++++++++-------------
>  1 file changed, 26 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index c3f34791f2f0..ff74b9a74d34 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
>  	flush_itimer_signals();
>  #endif

Semi-related (existing behavior): in de_thread(), what keeps the thread
group from changing? i.e.:

        if (thread_group_empty(tsk))
                goto no_thread_group;

        /*
         * Kill all other threads in the thread group.
         */
        spin_lock_irq(lock);
	... kill other threads under lock ...

Why is the thread_group_emtpy() test not under lock?

>  
> +	BUG_ON(!thread_group_leader(tsk));
> +	return 0;
> +
> +killed:
> +	/* protects against exit_notify() and __exit_signal() */

I wonder if include/linux/sched/task.h's definition of tasklist_lock
should explicitly gain note about group_exit_task and notify_count,
or, alternatively, signal.h's section on these fields should gain a
comment? tasklist_lock is unmentioned in signal.h... :(

> +	read_lock(&tasklist_lock);
> +	sig->group_exit_task = NULL;
> +	sig->notify_count = 0;
> +	read_unlock(&tasklist_lock);
> +	return -EAGAIN;
> +}
> +
> +
> +static int unshare_sighand(struct task_struct *me)
> +{
> +	struct sighand_struct *oldsighand = me->sighand;
> +
>  	if (refcount_read(&oldsighand->count) != 1) {
>  		struct sighand_struct *newsighand;
>  		/*
> @@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
>  
>  		write_lock_irq(&tasklist_lock);
>  		spin_lock(&oldsighand->siglock);
> -		rcu_assign_pointer(tsk->sighand, newsighand);
> +		rcu_assign_pointer(me->sighand, newsighand);
>  		spin_unlock(&oldsighand->siglock);
>  		write_unlock_irq(&tasklist_lock);
>  
>  		__cleanup_sighand(oldsighand);
>  	}
> -
> -	BUG_ON(!thread_group_leader(tsk));
>  	return 0;
> -
> -killed:
> -	/* protects against exit_notify() and __exit_signal() */
> -	read_lock(&tasklist_lock);
> -	sig->group_exit_task = NULL;
> -	sig->notify_count = 0;
> -	read_unlock(&tasklist_lock);
> -	return -EAGAIN;
>  }
>  
>  char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
> @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	int retval;
>  
>  	/*
> -	 * Make sure we have a private signal table and that
> -	 * we are unassociated from the previous thread group.
> +	 * Make this the only thread in the thread group.
>  	 */
>  	retval = de_thread(me);
>  	if (retval)
>  		goto out;
>  
> +	/*
> +	 * Make the signal table private.
> +	 */
> +	retval = unshare_sighand(me);
> +	if (retval)
> +		goto out;
> +
>  	/*
>  	 * Must be called _before_ exec_mmap() as bprm->mm is
>  	 * not visibile until then. This also enables the update
> -- 
> 2.25.0

Otherwise, yes, sensible separation.

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-08 21:36                                                 ` [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread Eric W. Biederman
  2020-03-09 19:30                                                   ` Bernd Edlinger
  2020-03-09 19:59                                                   ` Christian Brauner
@ 2020-03-10 20:31                                                   ` Kees Cook
  2020-03-10 20:57                                                   ` Jann Horn
  2020-03-10 21:22                                                   ` Christian Brauner
  4 siblings, 0 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-10 20:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
> 
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index ff74b9a74d34..215d86f77b63 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
>  	/* we have changed execution domain */
>  	tsk->exit_signal = SIGCHLD;
>  
> -#ifdef CONFIG_POSIX_TIMERS
> -	exit_itimers(sig);
> -	flush_itimer_signals();
> -#endif
> -
>  	BUG_ON(!thread_group_leader(tsk));
>  	return 0;
>  
> @@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> +#ifdef CONFIG_POSIX_TIMERS
> +	exit_itimers(me->signal);
> +	flush_itimer_signals();
> +#endif
> +

I twitch at seeing #ifdefs in .c instead of hidden in the .h declarations
of these two functions, but as this is a copy/paste, I'll live. ;)

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

>  	/*
>  	 * Make the signal table private.
>  	 */
> -- 
> 2.25.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
  2020-03-10 20:29                                                   ` Kees Cook
@ 2020-03-10 20:34                                                     ` Bernd Edlinger
  2020-03-10 20:57                                                       ` Kees Cook
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-10 20:34 UTC (permalink / raw)
  To: Kees Cook, Eric W. Biederman
  Cc: Christian Brauner, Jann Horn, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/10/20 9:29 PM, Kees Cook wrote:
> On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
>>
>> This makes the code clearer and makes it easier to implement a mutex
>> that is not taken over any locations that may block indefinitely waiting
>> for userspace.
>>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/exec.c | 39 ++++++++++++++++++++++++++-------------
>>  1 file changed, 26 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index c3f34791f2f0..ff74b9a74d34 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
>>  	flush_itimer_signals();
>>  #endif
> 
> Semi-related (existing behavior): in de_thread(), what keeps the thread
> group from changing? i.e.:
> 
>         if (thread_group_empty(tsk))
>                 goto no_thread_group;
> 
>         /*
>          * Kill all other threads in the thread group.
>          */
>         spin_lock_irq(lock);
> 	... kill other threads under lock ...
> 
> Why is the thread_group_emtpy() test not under lock?
> 

A new thread cannot created when only one thread is executing,
right?

>>  
>> +	BUG_ON(!thread_group_leader(tsk));
>> +	return 0;
>> +
>> +killed:
>> +	/* protects against exit_notify() and __exit_signal() */
> 
> I wonder if include/linux/sched/task.h's definition of tasklist_lock
> should explicitly gain note about group_exit_task and notify_count,
> or, alternatively, signal.h's section on these fields should gain a
> comment? tasklist_lock is unmentioned in signal.h... :(
> 
>> +	read_lock(&tasklist_lock);
>> +	sig->group_exit_task = NULL;
>> +	sig->notify_count = 0;
>> +	read_unlock(&tasklist_lock);
>> +	return -EAGAIN;
>> +}
>> +
>> +
>> +static int unshare_sighand(struct task_struct *me)
>> +{
>> +	struct sighand_struct *oldsighand = me->sighand;
>> +
>>  	if (refcount_read(&oldsighand->count) != 1) {
>>  		struct sighand_struct *newsighand;
>>  		/*
>> @@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
>>  
>>  		write_lock_irq(&tasklist_lock);
>>  		spin_lock(&oldsighand->siglock);
>> -		rcu_assign_pointer(tsk->sighand, newsighand);
>> +		rcu_assign_pointer(me->sighand, newsighand);
>>  		spin_unlock(&oldsighand->siglock);
>>  		write_unlock_irq(&tasklist_lock);
>>  
>>  		__cleanup_sighand(oldsighand);
>>  	}
>> -
>> -	BUG_ON(!thread_group_leader(tsk));
>>  	return 0;
>> -
>> -killed:
>> -	/* protects against exit_notify() and __exit_signal() */
>> -	read_lock(&tasklist_lock);
>> -	sig->group_exit_task = NULL;
>> -	sig->notify_count = 0;
>> -	read_unlock(&tasklist_lock);
>> -	return -EAGAIN;
>>  }
>>  
>>  char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
>> @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm)
>>  	int retval;
>>  
>>  	/*
>> -	 * Make sure we have a private signal table and that
>> -	 * we are unassociated from the previous thread group.
>> +	 * Make this the only thread in the thread group.
>>  	 */
>>  	retval = de_thread(me);
>>  	if (retval)
>>  		goto out;
>>  
>> +	/*
>> +	 * Make the signal table private.
>> +	 */
>> +	retval = unshare_sighand(me);
>> +	if (retval)
>> +		goto out;
>> +
>>  	/*
>>  	 * Must be called _before_ exec_mmap() as bprm->mm is
>>  	 * not visibile until then. This also enables the update
>> -- 
>> 2.25.0
> 
> Otherwise, yes, sensible separation.
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-08 21:38                                                 ` [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec Eric W. Biederman
  2020-03-09 19:34                                                   ` Bernd Edlinger
@ 2020-03-10 20:44                                                   ` Kees Cook
  2020-03-10 21:20                                                     ` Eric W. Biederman
  2020-03-10 20:47                                                   ` Kees Cook
  2 siblings, 1 reply; 203+ messages in thread
From: Kees Cook @ 2020-03-10 20:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
> 
> I have read through the code in exec_mmap and I do not see anything
> that depends on sighand or the sighand lock, or on signals in anyway
> so this should be safe.
> 
> This rearrangement of code has two siginficant benefits.  It makes
> the determination of passing the point of no return by testing bprm->mm
> accurate.  All failures prior to that point in flush_old_exec are
> either truly recoverable or they are fatal.

Agreed. Though I see a use of "current", which maybe you want to
parameterize to a "me" argument in acct_arg_size(). (Though looking at
the callers, perhaps there is no benefit?)

> 
> Futher this consolidates all of the possible indefinite waits for
> userspace together at the top of flush_old_exec.  The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.
> 
> This consolidation allows the creation of a mutex to replace
> cred_guard_mutex that is not held of possible indefinite userspace
> waits.  Which will allow removing deadlock scenarios from the kernel.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 215d86f77b63..d820a7272a76 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> -#ifdef CONFIG_POSIX_TIMERS
> -	exit_itimers(me->signal);
> -	flush_itimer_signals();
> -#endif

I think this comment:

/*
 * This is called by do_exit or de_thread, only when there are no more
 * references to the shared signal_struct.
 */
void exit_itimers(struct signal_struct *sig)

Refers to there being other threads, yes? Not that the signal table is
private yet?

> -
> -	/*
> -	 * Make the signal table private.
> -	 */
> -	retval = unshare_sighand(me);
> -	if (retval)
> -		goto out;
> -
>  	/*
>  	 * Must be called _before_ exec_mmap() as bprm->mm is
>  	 * not visibile until then. This also enables the update
> @@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	 */
>  	bprm->mm = NULL;
>  
> +#ifdef CONFIG_POSIX_TIMERS
> +	exit_itimers(me->signal);
> +	flush_itimer_signals();
> +#endif

I've mostly convinced myself that there are no "side-effects" from having
these timers expire as the mm is going away. I think some kind of comment
of that intent should be explicitly stated here above the timer work.

Beyond that:

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> +
> +	/*
> +	 * Make the signal table private.
> +	 */
> +	retval = unshare_sighand(me);
> +	if (retval)
> +		goto out;
> +
>  	set_fs(USER_DS);
>  	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>  					PF_NOFREEZE | PF_NO_SETAFFINITY);
> -- 
> 2.25.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-08 21:38                                                 ` [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec Eric W. Biederman
  2020-03-09 19:34                                                   ` Bernd Edlinger
  2020-03-10 20:44                                                   ` Kees Cook
@ 2020-03-10 20:47                                                   ` Kees Cook
  2020-03-10 21:09                                                     ` Eric W. Biederman
  2 siblings, 1 reply; 203+ messages in thread
From: Kees Cook @ 2020-03-10 20:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
> Futher this consolidates all of the possible indefinite waits for
> userspace together at the top of flush_old_exec.  The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.

I forgot to mention, just as a point of clarity, there are lots of
other page faults possible, but they're _before_ flush_old_exec()
(i.e. all the copy_strings() calls). Is it worth clarifying this to
"before or at the top of flush_old_exec()" or do you mean something
else? (And as always: perhaps expand flush_old_exec()'s comment to
describe the newly intended state.)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-09 19:02                                                                 ` Eric W. Biederman
  2020-03-09 19:24                                                                   ` Bernd Edlinger
  2020-03-09 19:33                                                                   ` [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex Dmitry V. Levin
@ 2020-03-10 20:55                                                                   ` Kees Cook
  2020-03-10 21:02                                                                     ` Eric W. Biederman
  2 siblings, 1 reply; 203+ messages in thread
From: Kees Cook @ 2020-03-10 20:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
>     exec: Add exec_update_mutex to replace cred_guard_mutex
>     
>     The cred_guard_mutex is problematic as it is held over possibly
>     indefinite waits for userspace.  The possilbe indefinite waits for
>     userspace that I have identified are: The cred_guard_mutex is held in
>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>     exit_robust_list.  The cred_guard_mutex held over copy_strings.

I suspect you're not trying to make a comprehensive list here, but do
you want to mention seccomp too (since it's yet another weird case).

> [...]
>     Holding a mutex over any of those possibly indefinite waits for
>     userspace does not appear necessary.  Add exec_update_mutex that will
>     just cover updating the process during exec where the permissions and
>     the objects pointed to by the task struct may be out of sync.

Should the specific resources be pointed out here? creds, mm, ... ?

But otherwise, yup, looks sane:

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 20:10                                                                     ` Jann Horn
  2020-03-10 20:22                                                                       ` Bernd Edlinger
@ 2020-03-10 20:57                                                                       ` Eric W. Biederman
  2020-03-10 21:29                                                                         ` Christian Brauner
  2020-03-11 18:49                                                                         ` Kees Cook
  1 sibling, 2 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 20:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Bernd Edlinger, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon

Jann Horn <jannh@google.com> writes:

> On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <jannh@google.com> wrote:
>> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> > Jann Horn <jannh@google.com> writes:
>> > > On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> > >> During exec some file descriptors are closed and the files struct is
>> > >> unshared.  But all of that can happen at other times and it has the
>> > >> same protections during exec as at ordinary times.  So stop taking the
>> > >> cred_guard_mutex as it is useless.
>> > >>
>> > >> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
>> > >> prone, as it is held in serveral while waiting possibly indefinitely
>> > >> for userspace to do something.
> [...]
>> > > If you make this change, then if this races with execution of a setuid
>> > > program that afterwards e.g. opens a unix domain socket, an attacker
>> > > will be able to steal that socket and inject messages into
>> > > communication with things like DBus. procfs currently has the same
>> > > race, and that still needs to be fixed, but at least procfs doesn't
>> > > let you open things like sockets because they don't have a working
>> > > ->open handler, and it enforces the normal permission check for
>> > > opening files.
>> >
>> > It isn't only exec that can change credentials.  Do we need a lock for
>> > changing credentials?
> [...]
>> > If we need a lock around credential change let's design and build that.
>> > Having a mismatch between what a lock is designed to do, and what
>> > people use it for can only result in other bugs as people get confused.
>>
>> Hmm... what benefits do we get from making it a separate lock? I guess
>> it would allow us to make it a per-task lock instead of a
>> signal_struct-wide one? That might be helpful...
>
> But actually, isn't the core purpose of the cred_guard_mutex to guard
> against concurrent credential changes anyway? That's what almost
> everyone uses it for, and it's in the name...

Having been through all of the users nope.

Maybe someone tried to repurpose for that.  I haven't traced through
when it went the it was renamed from cred_exec_mutex to
cred_guard_mutex.

The original purpose was to make make exec and ptrace deadlock.  But it
was seen as being there to allow safely calculating the new credentials
before the point of now return.  Because if a process is ptraced or not
affects the new credential calculations.  Unfortunately offering that
guarantee fundamentally leads to deadlock.

So ptrace_attach and seccomp use the cred_guard_mutex to guarantee
a deadlock.

The common use is to take cred_guard_mutex to guard the window when
credentials and process details are out of sync in exec.  But there
is at least do_io_accounting that seems to have the same justification
for holding __pidfd_fget.

With effort I suspect we can replace exec_change_mutex with task_lock.
When we are guaranteed to be single threaded placing exec_change_mutex
in signal_struct doesn't really help us (except maybe in some races?).

The deep problem is no one really understands cred_guard_mutex so it is
a mess.  Code with poorly defined semantics is always wrong somewhere
for someone.  Which is part of why I am attacking this and having the
conversations to make certain I understand what is going on.

I see your point about commit_creds making a process undumpable.  So in
practice it really is only exec that changes creds in a way that
ptrace_may_access will allow the process to be inspected.

So I guess for now the practical non-regressing course is to change
everything to my exec_change_mutex, removing the deadlock.  Then we
figure out how to cleanly deal with the races inspecting a process with
changing credentials has.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-08 21:36                                                 ` [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread Eric W. Biederman
                                                                     ` (2 preceding siblings ...)
  2020-03-10 20:31                                                   ` Kees Cook
@ 2020-03-10 20:57                                                   ` Jann Horn
  2020-03-10 21:05                                                     ` Eric W. Biederman
  2020-03-10 21:22                                                   ` Christian Brauner
  4 siblings, 1 reply; 203+ messages in thread
From: Jann Horn @ 2020-03-10 20:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 8, 2020 at 10:39 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index ff74b9a74d34..215d86f77b63 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
>         /* we have changed execution domain */
>         tsk->exit_signal = SIGCHLD;
>
> -#ifdef CONFIG_POSIX_TIMERS
> -       exit_itimers(sig);
> -       flush_itimer_signals();
> -#endif
> -
>         BUG_ON(!thread_group_leader(tsk));
>         return 0;
>
> @@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
>         if (retval)
>                 goto out;
>
> +#ifdef CONFIG_POSIX_TIMERS
> +       exit_itimers(me->signal);
> +       flush_itimer_signals();
> +#endif

nit: exit_itimers() has a comment referring to de_thread, that should
probably be updated

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
  2020-03-10 20:34                                                     ` Bernd Edlinger
@ 2020-03-10 20:57                                                       ` Kees Cook
  0 siblings, 0 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-10 20:57 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 09:34:03PM +0100, Bernd Edlinger wrote:
> On 3/10/20 9:29 PM, Kees Cook wrote:
> > On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
> >>
> >> This makes the code clearer and makes it easier to implement a mutex
> >> that is not taken over any locations that may block indefinitely waiting
> >> for userspace.
> >>
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>  fs/exec.c | 39 ++++++++++++++++++++++++++-------------
> >>  1 file changed, 26 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/fs/exec.c b/fs/exec.c
> >> index c3f34791f2f0..ff74b9a74d34 100644
> >> --- a/fs/exec.c
> >> +++ b/fs/exec.c
> >> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
> >>  	flush_itimer_signals();
> >>  #endif
> > 
> > Semi-related (existing behavior): in de_thread(), what keeps the thread
> > group from changing? i.e.:
> > 
> >         if (thread_group_empty(tsk))
> >                 goto no_thread_group;
> > 
> >         /*
> >          * Kill all other threads in the thread group.
> >          */
> >         spin_lock_irq(lock);
> > 	... kill other threads under lock ...
> > 
> > Why is the thread_group_emtpy() test not under lock?
> > 
> 
> A new thread cannot created when only one thread is executing,
> right?

*face palm* Yes, of course. :) I'm thinking too hard.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 1/4] exec: Fix a deadlock in ptrace
  2020-03-10 13:43                                                                       ` [PATCH 1/4] exec: Fix a deadlock in ptrace Bernd Edlinger
  2020-03-10 15:13                                                                         ` Eric W. Biederman
@ 2020-03-10 21:00                                                                         ` Kees Cook
  1 sibling, 0 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-10 21:00 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 02:43:41PM +0100, Bernd Edlinger wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
> 
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated.  They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
> 
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> strace          D    0 30614  30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> expect          D    0 31933  30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> This changes mm_access to use the new exec_update_mutex
> instead of cred_guard_mutex.
> 
> This patch is based on the following patch by Eric W. Biederman:
> "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks"
> Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>

Cool, yes, on top of the new infrastructure this looks correct to me --
the new mutex wraps mm changes and mm_access() is looking at *drum roll*
the mm! :)

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  kernel/fork.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c12595a..5720ff3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  	struct mm_struct *mm;
>  	int err;
>  
> -	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
> +	err =  mutex_lock_killable(&task->signal->exec_update_mutex);
>  	if (err)
>  		return ERR_PTR(err);
>  
> @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  		mmput(mm);
>  		mm = ERR_PTR(-EACCES);
>  	}
> -	mutex_unlock(&task->signal->cred_guard_mutex);
> +	mutex_unlock(&task->signal->exec_update_mutex);
>  
>  	return mm;
>  }
> -- 
> 1.9.1

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-10 20:55                                                                   ` Kees Cook
@ 2020-03-10 21:02                                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 21:02 UTC (permalink / raw)
  To: Kees Cook
  Cc: Bernd Edlinger, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Kees Cook <keescook@chromium.org> writes:

> On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
>>     exec: Add exec_update_mutex to replace cred_guard_mutex
>>     
>>     The cred_guard_mutex is problematic as it is held over possibly
>>     indefinite waits for userspace.  The possilbe indefinite waits for
>>     userspace that I have identified are: The cred_guard_mutex is held in
>>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>>     exit_robust_list.  The cred_guard_mutex held over copy_strings.
>
> I suspect you're not trying to make a comprehensive list here, but do
> you want to mention seccomp too (since it's yet another weird case).

I was calling out all of the places I have found so far where
cred_guard_mutex is held over waiting for userspace to maybe do
something.  Those places are what cause our deadlocks.

>> [...]
>>     Holding a mutex over any of those possibly indefinite waits for
>>     userspace does not appear necessary.  Add exec_update_mutex that will
>>     just cover updating the process during exec where the permissions and
>>     the objects pointed to by the task struct may be out of sync.
>
> Should the specific resources be pointed out here? creds, mm, ... ?
>
> But otherwise, yup, looks sane:

Probably not.  The design is if exec changes it we will hold the
cred_guard_mutex over it, so things are semi-atomic.

> Reviewed-by: Kees Cook <keescook@chromium.org>

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-10 20:57                                                   ` Jann Horn
@ 2020-03-10 21:05                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 21:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Christian Brauner, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Jann Horn <jannh@google.com> writes:

> On Sun, Mar 8, 2020 at 10:39 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> These functions have very little to do with de_thread move them out
>> of de_thread an into flush_old_exec proper so it can be more clearly
>> seen what flush_old_exec is doing.
>>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/exec.c | 10 +++++-----
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index ff74b9a74d34..215d86f77b63 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
>>         /* we have changed execution domain */
>>         tsk->exit_signal = SIGCHLD;
>>
>> -#ifdef CONFIG_POSIX_TIMERS
>> -       exit_itimers(sig);
>> -       flush_itimer_signals();
>> -#endif
>> -
>>         BUG_ON(!thread_group_leader(tsk));
>>         return 0;
>>
>> @@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
>>         if (retval)
>>                 goto out;
>>
>> +#ifdef CONFIG_POSIX_TIMERS
>> +       exit_itimers(me->signal);
>> +       flush_itimer_signals();
>> +#endif
>
> nit: exit_itimers() has a comment referring to de_thread, that should
> probably be updated

Good point.

Eric


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-10 20:47                                                   ` Kees Cook
@ 2020-03-10 21:09                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 21:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: Bernd Edlinger, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Kees Cook <keescook@chromium.org> writes:

> On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
>> Futher this consolidates all of the possible indefinite waits for
>> userspace together at the top of flush_old_exec.  The possible wait
>> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>> to be resolved in clear_child_tid, and the possible wait for a page
>> fault in exit_robust_list.
>
> I forgot to mention, just as a point of clarity, there are lots of
> other page faults possible, but they're _before_ flush_old_exec()
> (i.e. all the copy_strings() calls). Is it worth clarifying this to
> "before or at the top of flush_old_exec()" or do you mean something
> else? (And as always: perhaps expand flush_old_exec()'s comment to
> describe the newly intended state.)

Yes.  Before or at the start of flush_old_exec where the mutex
is taken.  That is the point.  I will see if I can come up with
and appropriate comment.

Eric




^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
  2020-03-08 21:35                                                 ` [PATCH v2 1/5] exec: Only compute current once in flush_old_exec Eric W. Biederman
  2020-03-09 13:56                                                   ` Bernd Edlinger
  2020-03-10 20:17                                                   ` Kees Cook
@ 2020-03-10 21:12                                                   ` Christian Brauner
  2 siblings, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-10 21:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 08, 2020 at 04:35:26PM -0500, Eric W. Biederman wrote:
> 
> Make it clear that current only needs to be computed once in
> flush_old_exec.  This may have some efficiency improvements and it
> makes the code easier to change.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
  2020-03-10 20:44                                                   ` Kees Cook
@ 2020-03-10 21:20                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 21:20 UTC (permalink / raw)
  To: Kees Cook
  Cc: Bernd Edlinger, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Kees Cook <keescook@chromium.org> writes:

> On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
>> 
>> I have read through the code in exec_mmap and I do not see anything
>> that depends on sighand or the sighand lock, or on signals in anyway
>> so this should be safe.
>> 
>> This rearrangement of code has two siginficant benefits.  It makes
>> the determination of passing the point of no return by testing bprm->mm
>> accurate.  All failures prior to that point in flush_old_exec are
>> either truly recoverable or they are fatal.
>
> Agreed. Though I see a use of "current", which maybe you want to
> parameterize to a "me" argument in acct_arg_size(). (Though looking at
> the callers, perhaps there is no benefit?)

My testing suggests there is a small benefit on x86.

The code is just "#define current get_current()"
and get_current() revoles into a read of "%gs:current_task".

But looking at the code I find gcc can sometimes when the
reads are close in the source code can optimize the read
away.  But gcc does not manage to optimize the extra
read of "%gs:current_task" away.

So I think things are much much better than they used to be,
code generation wise.  But it still helps to cache current
in a local variable.

>> Futher this consolidates all of the possible indefinite waits for
>> userspace together at the top of flush_old_exec.  The possible wait
>> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>> to be resolved in clear_child_tid, and the possible wait for a page
>> fault in exit_robust_list.
>> 
>> This consolidation allows the creation of a mutex to replace
>> cred_guard_mutex that is not held of possible indefinite userspace
>> waits.  Which will allow removing deadlock scenarios from the kernel.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/exec.c | 24 ++++++++++++------------
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 215d86f77b63..d820a7272a76 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm)
>>  	if (retval)
>>  		goto out;
>>  
>> -#ifdef CONFIG_POSIX_TIMERS
>> -	exit_itimers(me->signal);
>> -	flush_itimer_signals();
>> -#endif
>
> I think this comment:
>
> /*
>  * This is called by do_exit or de_thread, only when there are no more
>  * references to the shared signal_struct.
>  */
> void exit_itimers(struct signal_struct *sig)
>
> Refers to there being other threads, yes? Not that the signal table is
> private yet?

The signal table is in sighand_struct.

So yes that refers to the other threads being gone.


>> -
>> -	/*
>> -	 * Make the signal table private.
>> -	 */
>> -	retval = unshare_sighand(me);
>> -	if (retval)
>> -		goto out;
>> -
>>  	/*
>>  	 * Must be called _before_ exec_mmap() as bprm->mm is
>>  	 * not visibile until then. This also enables the update
>> @@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm)
>>  	 */
>>  	bprm->mm = NULL;
>>  
>> +#ifdef CONFIG_POSIX_TIMERS
>> +	exit_itimers(me->signal);
>> +	flush_itimer_signals();
>> +#endif
>
> I've mostly convinced myself that there are no "side-effects" from having
> these timers expire as the mm is going away. I think some kind of comment
> of that intent should be explicitly stated here above the timer work.

The timers can at most generate signals.  And we are not handling
signals in the middle of exec.

So the only possible interaction would be to set a timeout and then try
exec, and have the timer kill the caller.

Maybe we get a killable signal from a scenario like that and maybe this
changes the time before the timer expires into the dangerous zone.
But that is all I can think of.

We have to return to the edge of userspace before any signals are
delivered.


> Beyond that:
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
>
> -Kees
>
>> +
>> +	/*
>> +	 * Make the signal table private.
>> +	 */
>> +	retval = unshare_sighand(me);
>> +	if (retval)
>> +		goto out;
>> +
>>  	set_fs(USER_DS);
>>  	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>  					PF_NOFREEZE | PF_NO_SETAFFINITY);
>> -- 
>> 2.25.0
>> 

Eric


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
  2020-03-08 21:36                                                 ` [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately Eric W. Biederman
  2020-03-09 19:28                                                   ` Bernd Edlinger
  2020-03-10 20:29                                                   ` Kees Cook
@ 2020-03-10 21:21                                                   ` Christian Brauner
  2 siblings, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-10 21:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
> 
> This makes the code clearer and makes it easier to implement a mutex
> that is not taken over any locations that may block indefinitely waiting
> for userspace.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 39 ++++++++++++++++++++++++++-------------
>  1 file changed, 26 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index c3f34791f2f0..ff74b9a74d34 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
>  	flush_itimer_signals();
>  #endif
>  
> +	BUG_ON(!thread_group_leader(tsk));
> +	return 0;
> +
> +killed:
> +	/* protects against exit_notify() and __exit_signal() */
> +	read_lock(&tasklist_lock);
> +	sig->group_exit_task = NULL;
> +	sig->notify_count = 0;
> +	read_unlock(&tasklist_lock);
> +	return -EAGAIN;
> +}
> +
> +
> +static int unshare_sighand(struct task_struct *me)
> +{
> +	struct sighand_struct *oldsighand = me->sighand;
> +
>  	if (refcount_read(&oldsighand->count) != 1) {
>  		struct sighand_struct *newsighand;
>  		/*
> @@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
>  
>  		write_lock_irq(&tasklist_lock);
>  		spin_lock(&oldsighand->siglock);
> -		rcu_assign_pointer(tsk->sighand, newsighand);
> +		rcu_assign_pointer(me->sighand, newsighand);
>  		spin_unlock(&oldsighand->siglock);
>  		write_unlock_irq(&tasklist_lock);
>  
>  		__cleanup_sighand(oldsighand);
>  	}

This is fine for now but we share an aweful lot of code with
copy_sighand(). We should earmark this to look into consolidating the
core operations into a common helper called from both copy_sighand() and
unshare_sighand() maybe even dumbing it down to one helper. But not
needed for now.

Otherwise:
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-08 21:38                                                 ` [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex Eric W. Biederman
  2020-03-09 13:45                                                   ` Bernd Edlinger
@ 2020-03-10 21:21                                                   ` Jann Horn
  2020-03-10 21:30                                                     ` Eric W. Biederman
  2020-03-11 13:18                                                   ` Qian Cai
  2020-03-12 10:27                                                   ` Kirill Tkhai
  3 siblings, 1 reply; 203+ messages in thread
From: Jann Horn @ 2020-03-10 21:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed.  The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
>
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one.  This lets us move forward while still
> being careful and not introducing any regressions.
[...]
> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>                         return -EINTR;
>                 }
>         }
> +
> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> +       if (ret)
> +               return ret;

We're already holding the old mmap_sem, and now nest the
exec_update_mutex inside it; but then while still holding the
exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
at least lockdep will be unhappy, and I'm not sure whether it's an
actual problem or not.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
  2020-03-08 21:36                                                 ` [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread Eric W. Biederman
                                                                     ` (3 preceding siblings ...)
  2020-03-10 20:57                                                   ` Jann Horn
@ 2020-03-10 21:22                                                   ` Christian Brauner
  4 siblings, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-10 21:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
> 
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  2020-03-10 20:19                                                                               ` Bernd Edlinger
@ 2020-03-10 21:25                                                                                 ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 21:25 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/10/20 8:06 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> This changes do_io_accounting to use the new exec_update_mutex
>>> instead of cred_guard_mutex.
>>>
>>> This fixes possible deadlocks when the trace is accessing
>>> /proc/$pid/io for instance.
>>>
>>> This should be safe, as the credentials are only used for reading.
>> 
>> This is an improvement.
>> 
>> We probably want to do this just as an incremental step in making things
>> better but perhaps I am blind but I am not finding the reason for
>> guarding this with the cred_guard_mutex to be at all persuasive.
>> 
>> I think moving the ptrace_may_access check down to after the
>> unlock_task_sighand would be just as effective at addressing the
>> concerns raised in the original commit.  I think the task_lock provides
>> all of the barrier we need to make it safe to move the ptrace_may_access
>> checks safe.
>> 
>> The reason I say this is I don't see exec changing ->ioac.  Just
>> performing some I/O which would update the io accounting statistics.
>> 
>
> Maybe the suid executable is starting up and doing io or not,
> and what the program does immediately at startup is a secret,
> that we want to keep secret but evil eve want to find out.
> eve is using /proc/alice/io to do that.
>
> It is a bit constructed, but seems like a security concern.
> when we keep the exec_update_mutex while collecting the data, we
> cannot see any io of the new process when the new credentials
> don't allow that.

Jann Horn has convinced me we should just convert these to the
exec_change_mutex today.  Because while not 100% correct in theory, the
only really interesting case is exec.  So the code does something
interesting and worth while, and mostly correct.  The last thing I want
to do is to cause an unnecessary regression.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 20:57                                                                       ` Eric W. Biederman
@ 2020-03-10 21:29                                                                         ` Christian Brauner
  2020-03-11 18:49                                                                         ` Kees Cook
  1 sibling, 0 replies; 203+ messages in thread
From: Christian Brauner @ 2020-03-10 21:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Bernd Edlinger, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon

On Tue, Mar 10, 2020 at 03:57:35PM -0500, Eric W. Biederman wrote:
> Jann Horn <jannh@google.com> writes:
> 
> > On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <jannh@google.com> wrote:
> >> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> > Jann Horn <jannh@google.com> writes:
> >> > > On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> > >> During exec some file descriptors are closed and the files struct is
> >> > >> unshared.  But all of that can happen at other times and it has the
> >> > >> same protections during exec as at ordinary times.  So stop taking the
> >> > >> cred_guard_mutex as it is useless.
> >> > >>
> >> > >> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> >> > >> prone, as it is held in serveral while waiting possibly indefinitely
> >> > >> for userspace to do something.
> > [...]
> >> > > If you make this change, then if this races with execution of a setuid
> >> > > program that afterwards e.g. opens a unix domain socket, an attacker
> >> > > will be able to steal that socket and inject messages into
> >> > > communication with things like DBus. procfs currently has the same
> >> > > race, and that still needs to be fixed, but at least procfs doesn't
> >> > > let you open things like sockets because they don't have a working
> >> > > ->open handler, and it enforces the normal permission check for
> >> > > opening files.
> >> >
> >> > It isn't only exec that can change credentials.  Do we need a lock for
> >> > changing credentials?
> > [...]
> >> > If we need a lock around credential change let's design and build that.
> >> > Having a mismatch between what a lock is designed to do, and what
> >> > people use it for can only result in other bugs as people get confused.
> >>
> >> Hmm... what benefits do we get from making it a separate lock? I guess
> >> it would allow us to make it a per-task lock instead of a
> >> signal_struct-wide one? That might be helpful...
> >
> > But actually, isn't the core purpose of the cred_guard_mutex to guard
> > against concurrent credential changes anyway? That's what almost
> > everyone uses it for, and it's in the name...
> 
> Having been through all of the users nope.
> 
> Maybe someone tried to repurpose for that.  I haven't traced through
> when it went the it was renamed from cred_exec_mutex to
> cred_guard_mutex.
> 
> The original purpose was to make make exec and ptrace deadlock.  But it
> was seen as being there to allow safely calculating the new credentials
> before the point of now return.  Because if a process is ptraced or not
> affects the new credential calculations.  Unfortunately offering that
> guarantee fundamentally leads to deadlock.
> 
> So ptrace_attach and seccomp use the cred_guard_mutex to guarantee
> a deadlock.
> 
> The common use is to take cred_guard_mutex to guard the window when
> credentials and process details are out of sync in exec.  But there
> is at least do_io_accounting that seems to have the same justification
> for holding __pidfd_fget.
> 
> With effort I suspect we can replace exec_change_mutex with task_lock.
> When we are guaranteed to be single threaded placing exec_change_mutex
> in signal_struct doesn't really help us (except maybe in some races?).
> 
> The deep problem is no one really understands cred_guard_mutex so it is
> a mess.  Code with poorly defined semantics is always wrong somewhere

This is a good point. When discussing patches sensitive to credential
changes cred_guard_mutex was always introduced as having the purpose to
guard against concurrent credential changes. And I'm pretty sure that
that's how most people have been using it for quite a long time. I mean,
it's at least the case for seccomp and proc and probably quite a few
more. So the problem seems to me that it has clear _intended_ semantics
that runs into issues in all sorts of cases. So if cred_guard_mutex is
not that then we seem to need to provide something that serves it's
intended purpose.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-10 21:21                                                   ` Jann Horn
@ 2020-03-10 21:30                                                     ` Eric W. Biederman
  2020-03-10 23:21                                                       ` Jann Horn
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-10 21:30 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Christian Brauner, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Jann Horn <jannh@google.com> writes:

> On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> threads are killed.  The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>>
>> Add exec_update_mutex that is only held over exec updating process
>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process.
>>
>> The plan is to switch the users of cred_guard_mutex to
>> exec_udpate_mutex one by one.  This lets us move forward while still
>> being careful and not introducing any regressions.
> [...]
>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>                         return -EINTR;
>>                 }
>>         }
>> +
>> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> +       if (ret)
>> +               return ret;
>
> We're already holding the old mmap_sem, and now nest the
> exec_update_mutex inside it; but then while still holding the
> exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
> which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
> at least lockdep will be unhappy, and I'm not sure whether it's an
> actual problem or not.

Good point.  I should double check the lock ordering here with mmap_sem.
It doesn't look like mmput takes mmap_sem, but still there might be a
lock inversion of some kind here.  At least as far as lockdep is
concerned and we don't want anything like that.

Eric








^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/4] selftests/ptrace: add test cases for dead-locks
  2020-03-10 13:44                                                                       ` [PATCH 2/4] selftests/ptrace: add test cases for dead-locks Bernd Edlinger
@ 2020-03-10 21:36                                                                         ` Kees Cook
  2020-03-10 22:41                                                                         ` Dmitry V. Levin
  1 sibling, 0 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-10 21:36 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 02:44:01PM +0100, Bernd Edlinger wrote:
> This adds test cases for ptrace deadlocks.
> 
> Additionally fixes a compile problem in get_syscall_info.c,
> observed with gcc-4.8.4:
> 
> get_syscall_info.c: In function 'get_syscall_info':
> get_syscall_info.c:93:3: error: 'for' loop initial declarations are only
>                                  allowed in C99 mode
>    for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) {
>    ^
> get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile
>                                your code

*discomfort noises* (see below)

> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  tools/testing/selftests/ptrace/Makefile   |  4 +-
>  tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++
>  2 files changed, 88 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
> 
> diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
> index c0b7f89..2f1f532 100644
> --- a/tools/testing/selftests/ptrace/Makefile
> +++ b/tools/testing/selftests/ptrace/Makefile
> @@ -1,6 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -CFLAGS += -iquote../../../../include/uapi -Wall
> +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall

This isn't the common solution in the kernel (the variable declaration
would just be lifted out of the loop), but as it's selftest code, which
does lots of special things ... I *guess* this is okay.

>  
> -TEST_GEN_PROGS := get_syscall_info peeksiginfo
> +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess

I love having this deadlock test added to the selftests.

I think I need to make an improvement to the test harness, though, as
the failure mode right now just blows up after the 30 second timeout
and leaves this deadlocked:

$ ./vmaccess
[==========] Running 2 tests from 1 test cases.
[ RUN      ] global.vmaccess
Alarm clock
$ ps
  PID TTY          TIME CMD
 2605 pts/0    00:00:00 bash
23360 pts/0    00:00:00 vmaccess
23361 pts/0    00:00:00 vmaccess
23363 pts/0    00:00:00 ps

But that's mostly unrelated to this code.

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

>  
>  include ../lib.mk
> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> new file mode 100644
> index 0000000..4db327b
> --- /dev/null
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -0,0 +1,86 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
> + * All rights reserved.
> + *
> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
> + * when de_thread is blocked with ->cred_guard_mutex held.
> + */
> +
> +#include "../kselftest_harness.h"
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <unistd.h>
> +#include <sys/ptrace.h>
> +
> +static void *thread(void *arg)
> +{
> +	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
> +	return NULL;
> +}
> +
> +TEST(vmaccess)
> +{
> +	int f, pid = fork();
> +	char mm[64];
> +
> +	if (!pid) {
> +		pthread_t pt;
> +
> +		pthread_create(&pt, NULL, thread, NULL);
> +		pthread_join(pt, NULL);
> +		execlp("true", "true", NULL);
> +	}
> +
> +	sleep(1);
> +	sprintf(mm, "/proc/%d/mem", pid);
> +	f = open(mm, O_RDONLY);
> +	ASSERT_GE(f, 0);
> +	close(f);
> +	f = kill(pid, SIGCONT);
> +	ASSERT_EQ(f, 0);
> +}
> +
> +TEST(attach)
> +{
> +	int s, k, pid = fork();
> +
> +	if (!pid) {
> +		pthread_t pt;
> +
> +		pthread_create(&pt, NULL, thread, NULL);
> +		pthread_join(pt, NULL);
> +		execlp("sleep", "sleep", "2", NULL);
> +	}
> +
> +	sleep(1);
> +	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> +	ASSERT_EQ(errno, EAGAIN);
> +	ASSERT_EQ(k, -1);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_NE(k, -1);
> +	ASSERT_NE(k, 0);
> +	ASSERT_NE(k, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	sleep(1);
> +	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> +	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	k = waitpid(-1, NULL, 0);
> +	ASSERT_EQ(k, -1);
> +	ASSERT_EQ(errno, ECHILD);
> +}
> +
> +TEST_HARNESS_MAIN
> -- 
> 1.9.1

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/4] selftests/ptrace: add test cases for dead-locks
  2020-03-10 13:44                                                                       ` [PATCH 2/4] selftests/ptrace: add test cases for dead-locks Bernd Edlinger
  2020-03-10 21:36                                                                         ` Kees Cook
@ 2020-03-10 22:41                                                                         ` Dmitry V. Levin
  1 sibling, 0 replies; 203+ messages in thread
From: Dmitry V. Levin @ 2020-03-10 22:41 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Kees Cook, Jann Horn,
	Jonathan Corbet, Alexander Viro, Andrew Morton, Alexey Dobriyan,
	Thomas Gleixner, Oleg Nesterov, Frederic Weisbecker,
	Andrei Vagin, Ingo Molnar, Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 02:44:01PM +0100, Bernd Edlinger wrote:
> This adds test cases for ptrace deadlocks.
> 
> Additionally fixes a compile problem in get_syscall_info.c,
> observed with gcc-4.8.4:
> 
> get_syscall_info.c: In function 'get_syscall_info':
> get_syscall_info.c:93:3: error: 'for' loop initial declarations are only
>                                  allowed in C99 mode
>    for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) {
>    ^
> get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile
>                                your code
[...]
> @@ -1,6 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -CFLAGS += -iquote../../../../include/uapi -Wall
> +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall

Wouldn't it be better to choose -std=gnu99 over -std=c99?


-- 
ldv

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-10 21:30                                                     ` Eric W. Biederman
@ 2020-03-10 23:21                                                       ` Jann Horn
  2020-03-11  0:15                                                         ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Jann Horn @ 2020-03-10 23:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Christian Brauner, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Jann Horn <jannh@google.com> writes:
> > On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> >> over the userspace accesses as the arguments from userspace are read.
> >> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> >> threads are killed.  The cred_guard_mutex is held over
> >> "put_user(0, tsk->clear_child_tid)" in exit_mm().
> >>
> >> Any of those can result in deadlock, as the cred_guard_mutex is held
> >> over a possible indefinite userspace waits for userspace.
> >>
> >> Add exec_update_mutex that is only held over exec updating process
> >> with the new contents of exec, so that code that needs not to be
> >> confused by exec changing the mm and the cred in ways that can not
> >> happen during ordinary execution of a process.
> >>
> >> The plan is to switch the users of cred_guard_mutex to
> >> exec_udpate_mutex one by one.  This lets us move forward while still
> >> being careful and not introducing any regressions.
> > [...]
> >> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
> >>                         return -EINTR;
> >>                 }
> >>         }
> >> +
> >> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> >> +       if (ret)
> >> +               return ret;
> >
> > We're already holding the old mmap_sem, and now nest the
> > exec_update_mutex inside it; but then while still holding the
> > exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
> > which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
> > at least lockdep will be unhappy, and I'm not sure whether it's an
> > actual problem or not.
>
> Good point.  I should double check the lock ordering here with mmap_sem.
> It doesn't look like mmput takes mmap_sem

You sure about that? mmput() -> __mmput() -> ksm_exit() ->
__ksm_exit() -> down_write(&mm->mmap_sem)

Or also: mmput() -> __mmput() -> khugepaged_exit() ->
__khugepaged_exit() -> down_write(&mm->mmap_sem)

Or is there a reason why those paths can't happen?

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-10 23:21                                                       ` Jann Horn
@ 2020-03-11  0:15                                                         ` Eric W. Biederman
  2020-03-11  6:33                                                           ` Bernd Edlinger
  0 siblings, 1 reply; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-11  0:15 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Christian Brauner, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Jann Horn <jannh@google.com> writes:

> On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Jann Horn <jannh@google.com> writes:
>> > On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>> >> over the userspace accesses as the arguments from userspace are read.
>> >> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> >> threads are killed.  The cred_guard_mutex is held over
>> >> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>> >>
>> >> Any of those can result in deadlock, as the cred_guard_mutex is held
>> >> over a possible indefinite userspace waits for userspace.
>> >>
>> >> Add exec_update_mutex that is only held over exec updating process
>> >> with the new contents of exec, so that code that needs not to be
>> >> confused by exec changing the mm and the cred in ways that can not
>> >> happen during ordinary execution of a process.
>> >>
>> >> The plan is to switch the users of cred_guard_mutex to
>> >> exec_udpate_mutex one by one.  This lets us move forward while still
>> >> being careful and not introducing any regressions.
>> > [...]
>> >> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>> >>                         return -EINTR;
>> >>                 }
>> >>         }
>> >> +
>> >> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> >> +       if (ret)
>> >> +               return ret;
>> >
>> > We're already holding the old mmap_sem, and now nest the
>> > exec_update_mutex inside it; but then while still holding the
>> > exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
>> > which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
>> > at least lockdep will be unhappy, and I'm not sure whether it's an
>> > actual problem or not.
>>
>> Good point.  I should double check the lock ordering here with mmap_sem.
>> It doesn't look like mmput takes mmap_sem
>
> You sure about that? mmput() -> __mmput() -> ksm_exit() ->
> __ksm_exit() -> down_write(&mm->mmap_sem)
>
> Or also: mmput() -> __mmput() -> khugepaged_exit() ->
> __khugepaged_exit() -> down_write(&mm->mmap_sem)
>
> Or is there a reason why those paths can't happen?

Clearly I didn't look far enough. 

I will adjust this so that exec_update_mutex is taken before mmap_sem.
Anything else is just asking for trouble.

Eric

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 20:22                                                                       ` Bernd Edlinger
@ 2020-03-11  6:11                                                                         ` Bernd Edlinger
  2020-03-11 14:56                                                                           ` Jann Horn
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-11  6:11 UTC (permalink / raw)
  To: jannh, Eric W. Biederman
  Cc: Christian Brauner, Kees Cook, Jonathan Corbet, Alexander Viro,
	Andrew Morton, adobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, avagin, Ingo Molnar, Peter Zijlstra (Intel),
	duyuyang, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, gregkh,
	Shakeel Butt, Jason Gunthorpe, christian, Andrea Arcangeli,
	Aleksa Sarai, Dmitry V. Levin, linux-doc, linux-kernel,
	linux-fsdevel, linux-mm, stable, linux-api, Arnd Bergmann,
	sargun

On 3/10/20 9:22 PM, Bernd Edlinger wrote:
> On 3/10/20 9:10 PM, Jann Horn wrote:
>> On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <jannh@google.com> wrote:
>>> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>> Jann Horn <jannh@google.com> writes:
>>>>> On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>>> During exec some file descriptors are closed and the files struct is
>>>>>> unshared.  But all of that can happen at other times and it has the
>>>>>> same protections during exec as at ordinary times.  So stop taking the
>>>>>> cred_guard_mutex as it is useless.
>>>>>>
>>>>>> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
>>>>>> prone, as it is held in serveral while waiting possibly indefinitely
>>>>>> for userspace to do something.
>> [...]
>>>>> If you make this change, then if this races with execution of a setuid
>>>>> program that afterwards e.g. opens a unix domain socket, an attacker
>>>>> will be able to steal that socket and inject messages into
>>>>> communication with things like DBus. procfs currently has the same
>>>>> race, and that still needs to be fixed, but at least procfs doesn't
>>>>> let you open things like sockets because they don't have a working
>>>>> ->open handler, and it enforces the normal permission check for
>>>>> opening files.
>>>>
>>>> It isn't only exec that can change credentials.  Do we need a lock for
>>>> changing credentials?
>> [...]
>>>> If we need a lock around credential change let's design and build that.
>>>> Having a mismatch between what a lock is designed to do, and what
>>>> people use it for can only result in other bugs as people get confused.
>>>
>>> Hmm... what benefits do we get from making it a separate lock? I guess
>>> it would allow us to make it a per-task lock instead of a
>>> signal_struct-wide one? That might be helpful...
>>
>> But actually, isn't the core purpose of the cred_guard_mutex to guard
>> against concurrent credential changes anyway? That's what almost
>> everyone uses it for, and it's in the name...
>>
> 
> The main reason d'etre of exec_update_mutex is to get a consitent
> view of task->mm and task credentials.
> > The reason why you want the cred_guard_mutex, is that some action
> is changing the resulting credentials that the execve is about
> to install, and that is the data flow in the opposite direction.
> 

So in other words, you need the exec_update_mutex when you
access another thread's credentials and possibly the mmap at the
same time.

You need the cred_guard_mutex when you *change* the credentials
of another thread.  (Where you cannot be sure that the other thread
just started to execve something)

You need no mutex at all when you are just accessing or
even changing the credentials of the current thread.  (If another
thread is doing execve, your task will be killed, and wether
or not the credentials were changed does not matter any more)

> 
> Bernd.
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-11  0:15                                                         ` Eric W. Biederman
@ 2020-03-11  6:33                                                           ` Bernd Edlinger
  2020-03-11 16:29                                                             ` Eric W. Biederman
  0 siblings, 1 reply; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-11  6:33 UTC (permalink / raw)
  To: Eric W. Biederman, Jann Horn
  Cc: Christian Brauner, Kees Cook, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Alexey Dobriyan, Thomas Gleixner, Oleg Nesterov,
	Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/11/20 1:15 AM, Eric W. Biederman wrote:
> Jann Horn <jannh@google.com> writes:
> 
>> On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Jann Horn <jannh@google.com> writes:
>>>> On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>> threads are killed.  The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>> over a possible indefinite userspace waits for userspace.
>>>>>
>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>> with the new contents of exec, so that code that needs not to be
>>>>> confused by exec changing the mm and the cred in ways that can not
>>>>> happen during ordinary execution of a process.
>>>>>
>>>>> The plan is to switch the users of cred_guard_mutex to
>>>>> exec_udpate_mutex one by one.  This lets us move forward while still
>>>>> being careful and not introducing any regressions.
>>>> [...]
>>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>                         return -EINTR;
>>>>>                 }
>>>>>         }
>>>>> +
>>>>> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>>> +       if (ret)
>>>>> +               return ret;
>>>>
>>>> We're already holding the old mmap_sem, and now nest the
>>>> exec_update_mutex inside it; but then while still holding the
>>>> exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
>>>> which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
>>>> at least lockdep will be unhappy, and I'm not sure whether it's an
>>>> actual problem or not.
>>>
>>> Good point.  I should double check the lock ordering here with mmap_sem.
>>> It doesn't look like mmput takes mmap_sem
>>
>> You sure about that? mmput() -> __mmput() -> ksm_exit() ->
>> __ksm_exit() -> down_write(&mm->mmap_sem)
>>
>> Or also: mmput() -> __mmput() -> khugepaged_exit() ->
>> __khugepaged_exit() -> down_write(&mm->mmap_sem)
>>
>> Or is there a reason why those paths can't happen?
> 
> Clearly I didn't look far enough. 
> 
> I will adjust this so that exec_update_mutex is taken before mmap_sem.
> Anything else is just asking for trouble.
> 

Note that vm_access does also mmput under the exec_update_mutex.
So I don't see a huge problem here.
But maybe I missed something.


Bernd.


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-08 21:38                                                 ` [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex Eric W. Biederman
  2020-03-09 13:45                                                   ` Bernd Edlinger
  2020-03-10 21:21                                                   ` Jann Horn
@ 2020-03-11 13:18                                                   ` Qian Cai
  2020-03-12 10:27                                                   ` Kirill Tkhai
  3 siblings, 0 replies; 203+ messages in thread
From: Qian Cai @ 2020-03-11 13:18 UTC (permalink / raw)
  To: Eric W. Biederman, Bernd Edlinger
  Cc: Christian Brauner, Kees Cook, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Sun, 2020-03-08 at 16:38 -0500, Eric W. Biederman wrote:
> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed.  The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
> 
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
> 
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
> 
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one.  This lets us move forward while still
> being careful and not introducing any regressions.
> 
> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

This patch will trigger a warning during boot,

[   19.707214][    T1] pci 0035:01:00.0: enabling device (0545 -> 0547)
[   19.707287][    T1] EEH: Capable adapter found: recovery enabled.
[   19.732541][    T1] cpuidle-powernv: Default stop: psscr =
0x0000000000000330,mask=0x00000000003003ff
[   19.732567][    T1] cpuidle-powernv: Deepest stop: psscr =
0x0000000000300375,mask=0x00000000003003ff
[   19.732598][    T1] cpuidle-powernv: First stop level that may lose SPRs =
0x4
[   19.732617][    T1] cpuidle-powernv: First stop level that may lose timebase
= 0x10
[   19.769784][    T1] HugeTLB registered 2.00 MiB page size, pre-allocated 0
pages
[   19.769810][    T1] HugeTLB registered 1.00 GiB page size, pre-allocated 0
pages
[   19.789344][  T718] 
[   19.789367][  T718] =====================================
[   19.789379][  T718] WARNING: bad unlock balance detected!
[   19.789393][  T718] 5.6.0-rc5-next-20200311+ #4 Not tainted
[   19.789414][  T718] -------------------------------------
[   19.789426][  T718] kworker/u257:0/718 is trying to release lock (&sig-
>exec_update_mutex) at:
[   19.789459][  T718] [<c0000000004c6770>] free_bprm+0xe0/0xf0
[   19.789481][  T718] but there are no more locks to release!
[   19.789502][  T718] 
[   19.789502][  T718] other info that might help us debug this:
[   19.789537][  T718] 1 lock held by kworker/u257:0/718:
[   19.789558][  T718]  #0: c000001fa8842808 (&sig->cred_guard_mutex){+.+.}, at:
__do_execve_file.isra.33+0x1b0/0xda0
[   19.789611][  T718] 
[   19.789611][  T718] stack backtrace:
[   19.789645][  T718] CPU: 8 PID: 718 Comm: kworker/u257:0 Not tainted 5.6.0-
rc5-next-20200311+ #4
[   19.789681][  T718] Call Trace:
[   19.789703][  T718] [c000000dad8cfa70] [c000000000979b40]
dump_stack+0xf4/0x164 (unreliable)
[   19.789742][  T718] [c000000dad8cfac0] [c0000000001c1d78]
print_unlock_imbalance_bug+0x118/0x140
[   19.789780][  T718] [c000000dad8cfb40] [c0000000001ceaa0]
lock_release+0x270/0x520
[   19.789817][  T718] [c000000dad8cfbf0] [c0000000009a2898]
__mutex_unlock_slowpath+0x68/0x400
[   19.789854][  T718] [c000000dad8cfcc0] [c0000000004c6770] free_bprm+0xe0/0xf0
[   19.789900][  T718] [c000000dad8cfcf0] [c0000000004c845c]
__do_execve_file.isra.33+0x44c/0xda0
__do_execve_file at fs/exec.c:1904
[   19.789938][  T718] [c000000dad8cfde0] [c0000000001391d8]
call_usermodehelper_exec_async+0x218/0x250
[   19.789977][  T718] [c000000dad8cfe20] [c00000000000b748]
ret_from_kernel_thread+0x5c/0x74

> ---
>  fs/exec.c                    | 9 +++++++++
>  include/linux/sched/signal.h | 9 ++++++++-
>  init/init_task.c             | 1 +
>  kernel/fork.c                | 1 +
>  4 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index d820a7272a76..ffeebb1f167b 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>  {
>  	struct task_struct *tsk;
>  	struct mm_struct *old_mm, *active_mm;
> +	int ret;
>  
>  	/* Notify parent that we're no longer interested in the old VM */
>  	tsk = current;
> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>  			return -EINTR;
>  		}
>  	}
> +
> +	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> +	if (ret)
> +		return ret;
> +
>  	task_lock(tsk);
>  	active_mm = tsk->active_mm;
>  	membarrier_exec_mmap(mm);
> @@ -1438,6 +1444,8 @@ static void free_bprm(struct linux_binprm *bprm)
>  {
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> +		if (!bprm->mm)
> +			mutex_unlock(&current->signal->exec_update_mutex);
>  		mutex_unlock(&current->signal->cred_guard_mutex);
>  		abort_creds(bprm->cred);
>  	}
> @@ -1487,6 +1495,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>  	 * credentials; any time after this it may be unlocked.
>  	 */
>  	security_bprm_committed_creds(bprm);
> +	mutex_unlock(&current->signal->exec_update_mutex);
>  	mutex_unlock(&current->signal->cred_guard_mutex);
>  }
>  EXPORT_SYMBOL(install_exec_creds);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 88050259c466..a29df79540ce 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -224,7 +224,14 @@ struct signal_struct {
>  
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations
> -					 * (notably. ptrace) */
> +					 * (notably. ptrace)
> +					 * Deprecated do not use in new code.
> +					 * Use exec_update_mutex instead.
> +					 */
> +	struct mutex exec_update_mutex;	/* Held while task_struct is being
> +					 * updated during exec, and may have
> +					 * inconsistent permissions.
> +					 */
>  } __randomize_layout;
>  
>  /*
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5eab7b..bd403ed3e418 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -26,6 +26,7 @@ static struct signal_struct init_signals = {
>  	.multiprocess	= HLIST_HEAD_INIT,
>  	.rlim		= INIT_RLIMITS,
>  	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
> +	.exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
>  #ifdef CONFIG_POSIX_TIMERS
>  	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
>  	.cputimer	= {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 60a1295f4384..12896a6ecee6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>  	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>  
>  	mutex_init(&sig->cred_guard_mutex);
> +	mutex_init(&sig->exec_update_mutex);
>  
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-11  6:11                                                                         ` Bernd Edlinger
@ 2020-03-11 14:56                                                                           ` Jann Horn
  0 siblings, 0 replies; 203+ messages in thread
From: Jann Horn @ 2020-03-11 14:56 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, adobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, avagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	duyuyang, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris, gregkh,
	Shakeel Butt, Jason Gunthorpe, christian, Andrea Arcangeli,
	Aleksa Sarai, Dmitry V. Levin, linux-doc, linux-kernel,
	linux-fsdevel, linux-mm, stable, linux-api, Arnd Bergmann,
	sargun

On Wed, Mar 11, 2020 at 7:12 AM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
> On 3/10/20 9:22 PM, Bernd Edlinger wrote:
> > On 3/10/20 9:10 PM, Jann Horn wrote:
> >> On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <jannh@google.com> wrote:
> >>> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>>> Jann Horn <jannh@google.com> writes:
> >>>>> On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>>>>> During exec some file descriptors are closed and the files struct is
> >>>>>> unshared.  But all of that can happen at other times and it has the
> >>>>>> same protections during exec as at ordinary times.  So stop taking the
> >>>>>> cred_guard_mutex as it is useless.
> >>>>>>
> >>>>>> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> >>>>>> prone, as it is held in serveral while waiting possibly indefinitely
> >>>>>> for userspace to do something.
> >> [...]
> >>>>> If you make this change, then if this races with execution of a setuid
> >>>>> program that afterwards e.g. opens a unix domain socket, an attacker
> >>>>> will be able to steal that socket and inject messages into
> >>>>> communication with things like DBus. procfs currently has the same
> >>>>> race, and that still needs to be fixed, but at least procfs doesn't
> >>>>> let you open things like sockets because they don't have a working
> >>>>> ->open handler, and it enforces the normal permission check for
> >>>>> opening files.
> >>>>
> >>>> It isn't only exec that can change credentials.  Do we need a lock for
> >>>> changing credentials?
> >> [...]
> >>>> If we need a lock around credential change let's design and build that.
> >>>> Having a mismatch between what a lock is designed to do, and what
> >>>> people use it for can only result in other bugs as people get confused.
> >>>
> >>> Hmm... what benefits do we get from making it a separate lock? I guess
> >>> it would allow us to make it a per-task lock instead of a
> >>> signal_struct-wide one? That might be helpful...
> >>
> >> But actually, isn't the core purpose of the cred_guard_mutex to guard
> >> against concurrent credential changes anyway? That's what almost
> >> everyone uses it for, and it's in the name...
> >>
> >
> > The main reason d'etre of exec_update_mutex is to get a consitent
> > view of task->mm and task credentials.
> > > The reason why you want the cred_guard_mutex, is that some action
> > is changing the resulting credentials that the execve is about
> > to install, and that is the data flow in the opposite direction.
> >
>
> So in other words, you need the exec_update_mutex when you
> access another thread's credentials and possibly the mmap at the
> same time.

Or the file descriptor table, or register state, ...

> You need no mutex at all when you are just accessing or
> even changing the credentials of the current thread.  (If another
> thread is doing execve, your task will be killed, and wether
> or not the credentials were changed does not matter any more)

Only if the only access checks you care about are those related to mm access.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
  2020-03-11  6:33                                                           ` Bernd Edlinger
@ 2020-03-11 16:29                                                             ` Eric W. Biederman
  0 siblings, 0 replies; 203+ messages in thread
From: Eric W. Biederman @ 2020-03-11 16:29 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Christian Brauner, Kees Cook, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/11/20 1:15 AM, Eric W. Biederman wrote:
>> Jann Horn <jannh@google.com> writes:
>> 
>>> On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Jann Horn <jannh@google.com> writes:
>>>>> On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>>> threads are killed.  The cred_guard_mutex is held over
>>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>>
>>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>>> over a possible indefinite userspace waits for userspace.
>>>>>>
>>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>>> with the new contents of exec, so that code that needs not to be
>>>>>> confused by exec changing the mm and the cred in ways that can not
>>>>>> happen during ordinary execution of a process.
>>>>>>
>>>>>> The plan is to switch the users of cred_guard_mutex to
>>>>>> exec_udpate_mutex one by one.  This lets us move forward while still
>>>>>> being careful and not introducing any regressions.
>>>>> [...]
>>>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>>                         return -EINTR;
>>>>>>                 }
>>>>>>         }
>>>>>> +
>>>>>> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>>>> +       if (ret)
>>>>>> +               return ret;
>>>>>
>>>>> We're already holding the old mmap_sem, and now nest the
>>>>> exec_update_mutex inside it; but then while still holding the
>>>>> exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
>>>>> which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
>>>>> at least lockdep will be unhappy, and I'm not sure whether it's an
>>>>> actual problem or not.
>>>>
>>>> Good point.  I should double check the lock ordering here with mmap_sem.
>>>> It doesn't look like mmput takes mmap_sem
>>>
>>> You sure about that? mmput() -> __mmput() -> ksm_exit() ->
>>> __ksm_exit() -> down_write(&mm->mmap_sem)
>>>
>>> Or also: mmput() -> __mmput() -> khugepaged_exit() ->
>>> __khugepaged_exit() -> down_write(&mm->mmap_sem)
>>>
>>> Or is there a reason why those paths can't happen?
>> 
>> Clearly I didn't look far enough. 
>> 
>> I will adjust this so that exec_update_mutex is taken before mmap_sem.
>> Anything else is just asking for trouble.
>> 
>
> Note that vm_access does also mmput under the exec_update_mutex.
> So I don't see a huge problem here.
> But maybe I missed something.

The issue is that to prevent deadlock locks must always be taken
in the same order.

Taking mmap_sem then exec_update_mutex at the start of the function,
then taking exec_update_mutex then mmap_sem in mmput, takes the
two locks in two different orders.   Which means that in the right
set or circumstances:

thread1:                                thread2:
   obtain mmap_sem                      optain exec_update_mutex
      wait for exec_update_mutex        wait for mmap_sem

Which guarantees that neither thread will make progress.

The fix is easy I just need to take exec_update_mutex a few lines
earlier.

Eric


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH] pidfd: Stop taking cred_guard_mutex
  2020-03-10 20:57                                                                       ` Eric W. Biederman
  2020-03-10 21:29                                                                         ` Christian Brauner
@ 2020-03-11 18:49                                                                         ` Kees Cook
  2020-03-14  9:12                                                                           ` [PATCH] pidfd: Use new infrastructure to fix deadlocks in execve Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Kees Cook @ 2020-03-11 18:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Christian Brauner, Bernd Edlinger, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api, Arnd Bergmann, Sargun Dhillon

On Tue, Mar 10, 2020 at 03:57:35PM -0500, Eric W. Biederman wrote:
> So ptrace_attach and seccomp use the cred_guard_mutex to guarantee
> a deadlock.

Well, that's the result, but seccomp uses it because it wants to
be certain that credentials and no_new_privs are changed together
"atomically".

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core
  2020-03-10 13:44                                                                       ` [PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core Bernd Edlinger
@ 2020-03-11 18:53                                                                         ` Kees Cook
  0 siblings, 0 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-11 18:53 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 02:44:10PM +0100, Bernd Edlinger wrote:
> This removes a duplicate "a" in the comment in process_vm_rw_core.
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  mm/process_vm_access.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 357aa7b..b3e6eb5 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
>  	if (!mm || IS_ERR(mm)) {
>  		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
>  		/*
> -		 * Explicitly map EACCES to EPERM as EPERM is a more a
> +		 * Explicitly map EACCES to EPERM as EPERM is a more
>  		 * appropriate error code for process_vw_readv/writev
>  		 */
>  		if (rc == -EACCES)
> -- 
> 1.9.1

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 4/4] kernel: doc: remove outdated comment cred.c
  2020-03-10 13:44                                                                       ` [PATCH 4/4] kernel: doc: remove outdated comment cred.c Bernd Edlinger
@ 2020-03-11 18:54                                                                         ` Kees Cook
  0 siblings, 0 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-11 18:54 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 02:44:18PM +0100, Bernd Edlinger wrote:
> This removes an outdated comment in prepare_kernel_cred.
> 
> There is no "cred_replace_mutex" any more, so the comment must
> go away.
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  kernel/cred.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 809a985..71a7926 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -675,8 +675,6 @@ void __init cred_init(void)
>   * The caller may change these controls afterwards if desired.
>   *
>   * Returns the new credentials or NULL if out of memory.
> - *
> - * Does not take, and does not return holding current->cred_replace_mutex.
>   */
>  struct cred *prepare_kernel_cred(struct task_struct *daemon)
>  {
> -- 
> 1.9.1

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
  2020-03-10 17:45                                                                           ` [PATCH 2/4] proc: " Bernd Edlinger
@ 2020-03-11 18:59                                                                             ` Kees Cook
  2020-03-11 19:10                                                                             ` Kees Cook
  1 sibling, 0 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-11 18:59 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 06:45:32PM +0100, Bernd Edlinger wrote:
> This changes lock_trace to use the new exec_update_mutex
> instead of cred_guard_mutex.
> 
> This fixes possible deadlocks when the trace is accessing
> /proc/$pid/stack for instance.
> 
> This should be safe, as the credentials are only used for reading,
> and task->mm is updated on execve under the new exec_update_mutex.
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  fs/proc/base.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ebea950..4fdfe4f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
>  
>  static int lock_trace(struct task_struct *task)
>  {
> -	int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
> +	int err = mutex_lock_killable(&task->signal->exec_update_mutex);
>  	if (err)
>  		return err;
>  	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
> -		mutex_unlock(&task->signal->cred_guard_mutex);
> +		mutex_unlock(&task->signal->exec_update_mutex);
>  		return -EPERM;
>  	}
>  	return 0;
> @@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task)
>  
>  static void unlock_trace(struct task_struct *task)
>  {
> -	mutex_unlock(&task->signal->cred_guard_mutex);
> +	mutex_unlock(&task->signal->exec_update_mutex);
>  }
>  
>  #ifdef CONFIG_STACKTRACE
> -- 
> 1.9.1

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  2020-03-10 17:45                                                                           ` [PATCH 3/4] proc: io_accounting: " Bernd Edlinger
  2020-03-10 19:06                                                                             ` Eric W. Biederman
@ 2020-03-11 19:08                                                                             ` Kees Cook
  2020-03-11 19:48                                                                               ` Bernd Edlinger
  2020-03-11 19:48                                                                               ` Eric W. Biederman
  1 sibling, 2 replies; 203+ messages in thread
From: Kees Cook @ 2020-03-11 19:08 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 06:45:47PM +0100, Bernd Edlinger wrote:
> This changes do_io_accounting to use the new exec_update_mutex
> instead of cred_guard_mutex.
> 
> This fixes possible deadlocks when the trace is accessing
> /proc/$pid/io for instance.
> 
> This should be safe, as the credentials are only used for reading.

I'd like to see the rationale described better here for why it should be
safe. I'm still not seeing why this is safe here, as we might check
ptrace_may_access() with one cred and then iterate io accounting with a
different credential...

What am I missing?

-Kees

> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  fs/proc/base.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4fdfe4f..529d0c6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>  	unsigned long flags;
>  	int result;
>  
> -	result = mutex_lock_killable(&task->signal->cred_guard_mutex);
> +	result = mutex_lock_killable(&task->signal->exec_update_mutex);
>  	if (result)
>  		return result;
>  
> @@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>  	result = 0;
>  
>  out_unlock:
> -	mutex_unlock(&task->signal->cred_guard_mutex);
> +	mutex_unlock(&task->signal->exec_update_mutex);
>  	return result;
>  }
>  
> -- 
> 1.9.1

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
  2020-03-10 17:45                                                                           ` [PATCH 2/4] proc: " Bernd Edlinger
  2020-03-11 18:59                                                                             ` Kees Cook
@ 2020-03-11 19:10                                                                             ` Kees Cook
  2020-03-11 19:38                                                                               ` Bernd Edlinger
  1 sibling, 1 reply; 203+ messages in thread
From: Kees Cook @ 2020-03-11 19:10 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On Tue, Mar 10, 2020 at 06:45:32PM +0100, Bernd Edlinger wrote:
> This changes lock_trace to use the new exec_update_mutex
> instead of cred_guard_mutex.
> 
> This fixes possible deadlocks when the trace is accessing
> /proc/$pid/stack for instance.
> 
> This should be safe, as the credentials are only used for reading,
> and task->mm is updated on execve under the new exec_update_mutex.
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>

I have the same question here as in 3/4. I should probably rescind my
Reviewed-by until I'm convinced about the security-safety of this -- why
is this not a race against cred changes?

-Kees

> ---
>  fs/proc/base.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ebea950..4fdfe4f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
>  
>  static int lock_trace(struct task_struct *task)
>  {
> -	int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
> +	int err = mutex_lock_killable(&task->signal->exec_update_mutex);
>  	if (err)
>  		return err;
>  	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
> -		mutex_unlock(&task->signal->cred_guard_mutex);
> +		mutex_unlock(&task->signal->exec_update_mutex);
>  		return -EPERM;
>  	}
>  	return 0;
> @@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task)
>  
>  static void unlock_trace(struct task_struct *task)
>  {
> -	mutex_unlock(&task->signal->cred_guard_mutex);
> +	mutex_unlock(&task->signal->exec_update_mutex);
>  }
>  
>  #ifdef CONFIG_STACKTRACE
> -- 
> 1.9.1

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
  2020-03-11 19:10                                                                             ` Kees Cook
@ 2020-03-11 19:38                                                                               ` Bernd Edlinger
  0 siblings, 0 replies; 203+ messages in thread
From: Bernd Edlinger @ 2020-03-11 19:38 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Christian Brauner, Jann Horn, Jonathan Corbet,
	Alexander Viro, Andrew Morton, Alexey Dobriyan, Thomas Gleixner,
	Oleg Nesterov, Frederic Weisbecker, Andrei Vagin, Ingo Molnar,
	Peter Zijlstra (Intel),
	Yuyang Du, David Hildenbrand, Sebastian Andrzej Siewior,
	Anshuman Khandual, David Howells, James Morris,
	Greg Kroah-Hartman, Shakeel Butt, Jason Gunthorpe,
	Christian Kellner, Andrea Arcangeli, Aleksa Sarai,
	Dmitry V. Levin, linux-doc, linux-kernel, linux-fsdevel,
	linux-mm, stable, linux-api

On 3/11/20 8:10 PM, Kees Cook wrote:
> On Tue, Mar 10, 2020 at 06:45:32PM +0100, Bernd Edlinger wrote:
>> This changes lock_trace to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This fixes possible deadlocks when the trace is accessing
>> /proc/$pid/stack for instance.
>>
>> This should be safe, as the credentials are only used for reading,
>> and task->mm is updated on execve under the new exec_update_mutex.
>>
>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> 
> I have the same question here as in 3/4. I should probably rescind my
> Reviewed-by until I'm convinced about the security-safety of this -- why
> is this not a race against cred changes?
> 

The credentials of a thread that is currently executing execve is already
set in the bprm structure, however the credential in the task structure
is not yet changed, as well as the process memory map keeps stable
until the exec_update_mutex is acquired.

What is done with this functions is access the call stack of the
process before the new executable is actually started.

There would immediately be a severe security problem if we did
not use any mutex as the check would be then with the old credential,
but the stack trace would potentially reveal secret function
calls that are done by a setuid program when it starts up.


Bernd.


> -Kees
> 
>> ---
>>  fs/proc/base.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index ebea950..4fdfe4f 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
>>  
>>  static int lock_trace(struct task_struct *task)
>>  {
>> -	int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> +	int err = mutex_lock_killable(&task->signal->exec_update_mutex);
>>  	if (err)
>>  		return err;
>>  	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
>> -		mutex_unlock(&task->signal->cred_guard_mutex);
>> +		mutex_unlock(&task->signal->exec_update_mutex);
>>  		return -EPERM;
>>  	}
>>  	return 0;
>> @@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task)
>>  
>>  static void unlock_trace(struct task_struct *task)
>>  {
>> -	mutex_unlock(&task->signal->cred_guard_mutex);
>> +	mutex_unlock(&task->signal->exec_update_mutex);
>>  }
>>  
>>  #ifdef CONFIG_STACKTRACE
>> -- 
>> 1.9.1
> 

^ permalink raw reply	[flat|