Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec
@ 2020-07-27 21:03 Eric W. Biederman
  2020-07-28  0:20 ` Linus Torvalds
  2020-07-28  9:41 ` [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec Aleksa Sarai
  0 siblings, 2 replies; 15+ messages in thread
From: Eric W. Biederman @ 2020-07-27 21:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Kees Cook, Pavel Machek, Rafael J. Wysocki,
	linux-fsdevel, Oleg Nesterov, linux-pm


Many of the challenges of implementing a simple version of exec come
from the fact the code handles execing multi-thread processes.

To the best of my knowledge processes with more than one thread
calling exec are not common, and as all of the threads will be killed
by exec there does not appear to be any useful work a thread can
reliably do during exec.

Therefore make it simpler to get exec correct by freezing the other
threads at the beginning of exec.  This removes an entire class of
races, and makes it tractable to fix some of the long standing
issues with exec.

One issue that this change makes it easier to solve is the issue of
deailing with the file table.  Today exec unshares the file table at
the beginning to ensure there are no weird races with file
descriptors.  Unfortunately this unsharing can unshare the file table
when only threads of the current process share it.  Which results in
unnecessary unsharing and posix locks being inappropriately dropped by
a multi-threaded exec.  With all of the threads frozen the thread
count is stable and it is easy to tell if the if the file table really
needs to be unshared by exec.

Further this changes allows seccomp to stop taking cred_guard_mutex,
as the seccomp code takes cred_guard_mutex to protect against another
thread that is in the middle of calling exec and this change
guarantees that if one threads is calling exec all of the other threads
have stopped running.  So this problematic kind of concurrency between
threads can no longer happen.

As this change reuses the generic freezer code interactions with
signal group stop, ptrace stop, the cgroup freezer, fatal signals, and
SIGCONT are already well defined and inherited from the freezer code.
In short other threads will not wake up to participate in signal group
stop, ptrace stop, the cgroup freezer, to handle fatal signals.  As
SIGCONT is handled by the caller that is still processed as usual.
Fatal signals while not processed by other threads are still processed
by the thread calling exec so they take effect as usual.

The code in de_thread was modified to unfreeze the threads at the same
time as it is killing them ensuring that code continues to work as it
does today, and without introducing any races where a thread might
perform any problematic work in the middle of de_thread.

The code in fork is optimized to set TIF_SIGPENDING if the new task
needs to freeze.  This makes it more likely that a new task will
freeze immediately instead of proceeding to userspace to execute some
code, where the next freezing loop will need to tell it to freeze.

A new function exec_freezing is added and called from the refrigerator
so that the freezer code will actually ensure threads called from
exec are frozen as well.

A new function exec_freeze_threads based upon
kernel/power/process.c:try_to_freeze_tasks is added.  To play well
with other uses of the kernel freezer it uses a killable sleep wrapped
with freezer_do_not_count/freezer_count.  So that it will not lock out
the global freezer when it is simply waiting for it's threads to
freeze.  This new function also ensures that only one thread of a
thread group can enter exec at a time.

While this does now allow every process to touch system_freezing_cnt
which is an int.  This should be fine as the maximum number of tasks
is PID_MAX_LIMIT which has an upper bound of 4 * 1024 * 1024.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 99 +++++++++++++++++++++++++++++++++++-
 include/linux/sched/signal.h | 10 ++++
 kernel/fork.c                |  3 ++
 kernel/freezer.c             |  2 +-
 4 files changed, 112 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 3698252719a3..cce6b700c3bb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -62,6 +62,7 @@
 #include <linux/oom.h>
 #include <linux/compat.h>
 #include <linux/vmalloc.h>
+#include <linux/freezer.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1145,6 +1146,94 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
+static void unfreeze_threads(struct signal_struct *signal)
+{
+	if (signal->group_execing_task) {
+		signal->group_execing_task = NULL;
+		atomic_dec(&system_freezing_cnt);
+	}
+}
+
+/*
+ * Freeze all other threads in the group for exec
+ */
+static int exec_freeze_threads(void)
+{
+	struct task_struct *p = current, *t;
+	struct signal_struct *signal = p->signal;
+	spinlock_t *lock = &p->sighand->siglock;
+	unsigned long sleep = msecs_to_jiffies(1);
+	unsigned int todo;
+	int ret;
+
+	if (thread_group_empty(p))
+		return 0;
+
+	spin_lock_irq(lock);
+	if (signal_pending(p) || signal_group_exit(signal) ||
+	    signal->group_execing_task) {
+		spin_unlock(lock);
+		return -ERESTARTNOINTR;
+	}
+
+	signal->group_execing_task = p;
+	atomic_inc(&system_freezing_cnt);
+	for (;;) {
+		todo = 0;
+		__for_each_thread(signal, t) {
+			if ((t == p) || !freeze_task(p))
+				continue;
+
+			if (!freezer_should_skip(p))
+				todo++;
+		}
+
+		if (!todo || __fatal_signal_pending(p) ||
+		    (sleep > msecs_to_jiffies(8)))
+			break;
+
+		/*
+		 * We need to retry, but first give the freezing tasks some
+		 * time to enter the refrigerator.  Start with an initial
+		 * 1 ms sleep followed by exponential backoff until 8 ms.
+		 */
+		spin_unlock_irq(lock);
+
+		freezer_do_not_count();
+		schedule_timeout_killable(sleep);
+		freezer_count();
+		sleep *= 2;
+
+		spin_lock_irq(lock);
+	}
+	ret = 0;
+	if (todo)
+		ret = -EBUSY;
+	if (__fatal_signal_pending(p))
+		ret = -ERESTARTNOINTR;
+	if (ret)
+		unfreeze_threads(signal);
+	spin_unlock_irq(lock);
+	return ret;
+}
+
+static void exec_thaw_threads(void)
+{
+	struct task_struct *p = current, *t;
+	struct signal_struct *signal = p->signal;
+	spinlock_t *lock = &p->sighand->siglock;
+
+	spin_lock_irq(lock);
+	if (signal->group_execing_task) {
+		unfreeze_threads(signal);
+		__for_each_thread(signal, t) {
+			if (t != p)
+				__thaw_task(t);
+		}
+	}
+	spin_unlock_irq(lock);
+}
+
 static int de_thread(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
@@ -1167,6 +1256,7 @@ static int de_thread(struct task_struct *tsk)
 		return -EAGAIN;
 	}
 
+	unfreeze_threads(sig);
 	sig->group_exit_task = tsk;
 	sig->notify_count = zap_other_threads(tsk);
 	if (!thread_group_leader(tsk))
@@ -1885,10 +1975,15 @@ static int bprm_execve(struct linux_binprm *bprm,
 	struct files_struct *displaced;
 	int retval;
 
-	retval = unshare_files(&displaced);
+	/* If the process is multi-threaded stop the other threads */
+	retval = exec_freeze_threads();
 	if (retval)
 		return retval;
 
+	retval = unshare_files(&displaced);
+	if (retval)
+		goto out_ret;
+
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
 		goto out_files;
@@ -1949,6 +2044,8 @@ static int bprm_execve(struct linux_binprm *bprm,
 out_files:
 	if (displaced)
 		reset_files_struct(displaced);
+out_ret:
+	exec_thaw_threads();
 
 	return retval;
 }
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..bbf53fcd913b 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -106,6 +106,9 @@ struct signal_struct {
 	int			notify_count;
 	struct task_struct	*group_exit_task;
 
+	/* Task that is performing exec */
+	struct task_struct	*group_execing_task;
+
 	/* thread group stop support, overloads group_exit_code too */
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
@@ -269,6 +272,13 @@ static inline int signal_group_exit(const struct signal_struct *sig)
 		(sig->group_exit_task != NULL);
 }
 
+static inline bool exec_freezing(struct task_struct *p)
+{
+	struct task_struct *execing = READ_ONCE(p->signal->group_execing_task);
+
+	return execing && (execing != p);
+}
+
 extern void flush_signals(struct task_struct *);
 extern void ignore_signals(struct task_struct *);
 extern void flush_signal_handlers(struct task_struct *, int force_default);
diff --git a/kernel/fork.c b/kernel/fork.c
index bf215af7a904..d3a0f914231c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2299,6 +2299,9 @@ static __latent_entropy struct task_struct *copy_process(
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+	/* Ensure a process that should freeze exits on the slow path */
+	if (freezing(p))
+		set_tsk_thread_flag(p, TIF_SIGPENDING);
 	proc_fork_connector(p);
 	cgroup_post_fork(p, args);
 	perf_event_fork(p);
diff --git a/kernel/freezer.c b/kernel/freezer.c
index dc520f01f99d..97c6f69b832e 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -42,7 +42,7 @@ bool freezing_slow_path(struct task_struct *p)
 	if (test_tsk_thread_flag(p, TIF_MEMDIE))
 		return false;
 
-	if (pm_nosig_freezing || cgroup_freezing(p))
+	if (pm_nosig_freezing || cgroup_freezing(p) || exec_freezing(p))
 		return true;
 
 	if (pm_freezing && !(p->flags & PF_KTHREAD))
-- 
2.25.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec
  2020-07-27 21:03 [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec Eric W. Biederman
@ 2020-07-28  0:20 ` Linus Torvalds
  2020-07-28 12:39   ` Eric W. Biederman
  2020-07-28  9:41 ` [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec Aleksa Sarai
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2020-07-28  0:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM

On Mon, Jul 27, 2020 at 2:06 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Therefore make it simpler to get exec correct by freezing the other
> threads at the beginning of exec.  This removes an entire class of
> races, and makes it tractable to fix some of the long standing
> issues with exec.

I hate the global state part of the freezer.

It's also pointless. We don't want to trigger all the tests that
various random driver kernel threads do.

I also really don't like how now execve() by any random person will
suddenly impact everything that might be doing freezing.

It also makes for a possible _huge_ latency regression for execve(),
since freezing really has never been a very low-latency operation.

Other threads doing IO can now basically block execve() for a long
long long time.

Finally, I think your patch is fundamentally broken for another
reason: it depends on CONFIG_FREEZER, and that isn't even required to
be set!

So no, this is not at all acceptable in that form.

Now, maybe we could _make_ it acceptable, by

 (a) add a per-process freezer count to avoid the global state for this case

 (b)  make a small subset of the freezing code available for the
!CONFIG_FREEZER thing

 (c) fix this "simple freezer" to not actually force wakeups etc, but
catch things in the

but honestly, at that point nothing of the "CONFIG_FREEZER" code even
really exists any more. It would be more of a "execve_synchronize()"
thing, where we'd catch things in the scheduler and/or system call
entry/exit or whatever.

Also, that makes these kinds of nasty hacks that just make the
existign freezer code even harder to figure out:

> A new function exec_freeze_threads based upon
> kernel/power/process.c:try_to_freeze_tasks is added.  To play well
> with other uses of the kernel freezer it uses a killable sleep wrapped
> with freezer_do_not_count/freezer_count.

Ugh. Just _ugly_.

And honestly, completely and utterly broken. See above.

I understand the wish to re-use existing infrastructure. But the fact
is, the FREEZER code is just about the _last_ thing you should want to
use. That, and stop_machine(), is just too much of a big hammer.

                Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec
  2020-07-27 21:03 [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec Eric W. Biederman
  2020-07-28  0:20 ` Linus Torvalds
@ 2020-07-28  9:41 ` Aleksa Sarai
  2020-07-28 12:18   ` Eric W. Biederman
  1 sibling, 1 reply; 15+ messages in thread
From: Aleksa Sarai @ 2020-07-28  9:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, linux-pm


[-- Attachment #1: Type: text/plain, Size: 736 bytes --]

On 2020-07-27, Eric W. Biederman <ebiederm@xmission.com> wrote:
> To the best of my knowledge processes with more than one thread
> calling exec are not common, and as all of the threads will be killed
> by exec there does not appear to be any useful work a thread can
> reliably do during exec.

Every Go program which calls exec (this includes runc, Docker, LXD,
Kubernetes, et al) fills the niche of "multi-threaded program that calls
exec" -- all Go programs are multi-threaded and there's no way of
disabling this. This will most likely cause pretty bad performance
regression for basically all container workloads.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec
  2020-07-28  9:41 ` [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec Aleksa Sarai
@ 2020-07-28 12:18   ` Eric W. Biederman
  0 siblings, 0 replies; 15+ messages in thread
From: Eric W. Biederman @ 2020-07-28 12:18 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, linux-pm

Aleksa Sarai <cyphar@cyphar.com> writes:

> On 2020-07-27, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> To the best of my knowledge processes with more than one thread
>> calling exec are not common, and as all of the threads will be killed
>> by exec there does not appear to be any useful work a thread can
>> reliably do during exec.
>
> Every Go program which calls exec (this includes runc, Docker, LXD,
> Kubernetes, et al) fills the niche of "multi-threaded program that calls
> exec" -- all Go programs are multi-threaded and there's no way of
> disabling this. This will most likely cause pretty bad performance
> regression for basically all container workloads.

So it is a good point that container runtimes use Go, and that
fundamentally anything that uses Go will be multi-threaded.  Having just
looked closely at this I don't think in practice this is an issue even
at this early state of my code.

If those other threads are sleeping the code I have implemented should
be a no-op.

If those threads aren't sleeping you have undefined behavior, as at
some point the kernel will come through and kill those threads.

Further unless I am completely mistaken the container runtimes use
forkAndExecInChild from go/src/syscall/exec_linux.go which performs a
vfork before performing the exec.

Eric


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec
  2020-07-28  0:20 ` Linus Torvalds
@ 2020-07-28 12:39   ` Eric W. Biederman
  2020-07-28 13:20     ` Eric W. Biederman
  0 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2020-07-28 12:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jul 27, 2020 at 2:06 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Therefore make it simpler to get exec correct by freezing the other
>> threads at the beginning of exec.  This removes an entire class of
>> races, and makes it tractable to fix some of the long standing
>> issues with exec.
>
> I hate the global state part of the freezer.
>
> It's also pointless. We don't want to trigger all the tests that
> various random driver kernel threads do.
>
> I also really don't like how now execve() by any random person will
> suddenly impact everything that might be doing freezing.

Yes.  system_freezing_cnt as an enable/disable that affects all
tasks in the system does seem like a misfeature for use in the context
of exec where only a few tasks need to be dealt with.

> It also makes for a possible _huge_ latency regression for execve(),
> since freezing really has never been a very low-latency operation.
>
> Other threads doing IO can now basically block execve() for a long
> long long time.

Hmm.  Potentially.  The synchronization with the other threads must
happen in a multi-threaded exec in de_thread.

So I need to look at the differences between where de_thread thread
can kill a thread and the freezer can not freeze a thread.  I am hoping
that the freezer has already instrumented most of those sleeps but I
admit I have not looked yet.

> Finally, I think your patch is fundamentally broken for another
> reason: it depends on CONFIG_FREEZER, and that isn't even required to
> be set!

Very true.  I absolutely missed that detail.

> So no, this is not at all acceptable in that form.
>
> Now, maybe we could _make_ it acceptable, by
>
>  (a) add a per-process freezer count to avoid the global state for this case

Perhaps even a single bit.


>  (b)  make a small subset of the freezing code available for the
> !CONFIG_FREEZER thing

The code that is controlled by CONFIG_FREEZER is just kernel/freezer.c,
and include/linux/freezer.h.  Which is 177 + 303 lines respectively
so not much.

Or are you thinking about all of the locations that already include
freezable sleeps?

>  (c) fix this "simple freezer" to not actually force wakeups etc, but
> catch things in the

To catch things in the scheduler I presume?

The thing is the freezer code does not wake up anything if it is in a
sleep that it has wrapped.  Which I thought was just about all
significant ones but I need to verify that.

> but honestly, at that point nothing of the "CONFIG_FREEZER" code even
> really exists any more. It would be more of a "execve_synchronize()"
> thing, where we'd catch things in the scheduler and/or system call
> entry/exit or whatever.

Yes.  Where we catch things seems to be key.  I believe if all sleeps
that are killable plus system call exit should be enough, to be a noop.
As those are the places where the code can be killed now.

The tricky part is to mark processes that are sleeping in such a way
that then they wake up they go into a slow path and they get trapped
by the freezing code.

> Also, that makes these kinds of nasty hacks that just make the
> existign freezer code even harder to figure out:


>> A new function exec_freeze_threads based upon
>> kernel/power/process.c:try_to_freeze_tasks is added.  To play well
>> with other uses of the kernel freezer it uses a killable sleep wrapped
>> with freezer_do_not_count/freezer_count.
>
> Ugh. Just _ugly_.
>
> And honestly, completely and utterly broken. See above.
>
> I understand the wish to re-use existing infrastructure. But the fact
> is, the FREEZER code is just about the _last_ thing you should want to
> use. That, and stop_machine(), is just too much of a big hammer.

Part of my challenge is that it the more layers that get put around a
sleep the trickier it is to subsequently wrap.

I can see the point of building something very simple and more
fundamental that doesn't need as much support as the current freezer.
Perhaps something the freezer can the be rebuilt upon.


If the basic idea of stopping other threads early before we kill them in
exec sounds plausible then I will see what I can do.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec
  2020-07-28 12:39   ` Eric W. Biederman
@ 2020-07-28 13:20     ` Eric W. Biederman
  2020-07-28 18:17       ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2020-07-28 13:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM

ebiederm@xmission.com (Eric W. Biederman) writes:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> It also makes for a possible _huge_ latency regression for execve(),
>> since freezing really has never been a very low-latency operation.
>>
>> Other threads doing IO can now basically block execve() for a long
>> long long time.
>
> Hmm.  Potentially.  The synchronization with the other threads must
> happen in a multi-threaded exec in de_thread.
>
> So I need to look at the differences between where de_thread thread
> can kill a thread and the freezer can not freeze a thread.  I am hoping
> that the freezer has already instrumented most of those sleeps but I
> admit I have not looked yet.

Alright I have looked at the freezer a bit more and I now see that the
point of marking things freezable is for kernel threads rather that user
space threads.  I think there are 5 maybe 6 places the code sleeps
reachable by userspace threads that are marked as freezable and most
of those are callable from get_signal.

For exec all I care about are user space threads.  So it appears the
freezer infrastructure adds very little.

Now to see if I can find another way to divert a task into a slow path
as it wakes up, so I don't need to manually wrap all of the sleeping
calls.  Something that plays nice with the scheduler.

Eric


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec
  2020-07-28 13:20     ` Eric W. Biederman
@ 2020-07-28 18:17       ` Linus Torvalds
  2020-07-30 13:16         ` Eric W. Biederman
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2020-07-28 18:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM

On Tue, Jul 28, 2020 at 6:23 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> For exec all I care about are user space threads.  So it appears the
> freezer infrastructure adds very little.

Yeah. 99% of the freezer stuff is for just adding the magic notations
for kernel threads, and for any user space threads it seems the wrong
interface.

> Now to see if I can find another way to divert a task into a slow path
> as it wakes up, so I don't need to manually wrap all of the sleeping
> calls.  Something that plays nice with the scheduler.

The thing is, how many places really care?

Because I think there are like five of them. And they are all marked
by taking cred_guard_mutex, or the file table lock.

So it seems really excessive to then create some whole new "let's
serialize every thread", when you actually don't care about any of it,
except for a couple of very very special cases.

If you care about "thread count stable", you care about exit() and
clone().  You don't care about threads that are happily running - or
sleeping - doing their own thing.

So trying to catch those threads and freezing them really feels like
entirely the wrong interface.

             Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec
  2020-07-28 18:17       ` Linus Torvalds
@ 2020-07-30 13:16         ` Eric W. Biederman
  2020-07-30 22:56           ` [RFC][PATCH] exec: Conceal the other threads from wakeups during exec Eric W. Biederman
  0 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2020-07-30 13:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jul 28, 2020 at 6:23 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> For exec all I care about are user space threads.  So it appears the
>> freezer infrastructure adds very little.
>
> Yeah. 99% of the freezer stuff is for just adding the magic notations
> for kernel threads, and for any user space threads it seems the wrong
> interface.
>
>> Now to see if I can find another way to divert a task into a slow path
>> as it wakes up, so I don't need to manually wrap all of the sleeping
>> calls.  Something that plays nice with the scheduler.
>
> The thing is, how many places really care?
>
> Because I think there are like five of them. And they are all marked
> by taking cred_guard_mutex, or the file table lock.
>
> So it seems really excessive to then create some whole new "let's
> serialize every thread", when you actually don't care about any of it,
> except for a couple of very very special cases.
>
> If you care about "thread count stable", you care about exit() and
> clone().  You don't care about threads that are happily running - or
> sleeping - doing their own thing.
>
> So trying to catch those threads and freezing them really feels like
> entirely the wrong interface.

For me stopping the other threads has been a conceptually simple
direction that needs exploration even if it doesn't work out.

On that note I have something that is just a little longer than the
patch I posted that doesn't use the freezer, and leans on the fact that
TASK_INTERRUPTBLE and TASK_KILLABLE can occassionaly expect a spurious
wake up.  Which means (with the right locking) those two states
can be transformed into TASK_WAKEKILL to keep sleeping processes
sleeping.

After that I only need one small test in get_signal to catch the
unlikely case of processes running in userspace.

I have not figured out TASK_STOPPED or TASK_TRACED yet as they do
not handle spurious wake ups.

So I think I can stop and keep stopped the other threads in the group
without too much code or complexity for other parts of the kernel
(assuming spurious wakes ups are allowed).




Even with the other threads stopped the code does not simplify as
much as I had hoped.  The code still has to deal with the existence
of the other threads.  So while races don't happen and thus the code
is easier to understand and make correct the actual work of walking
the threads making a count etc still remains.



The real sticky widget right now is the unshare_files call.  When Al
moved the call in fd8328be874f ("[PATCH] sanitize handling of shared
descriptor tables in failing execve()") it introduced a regression that
causes posix file locks to be improperly removed during exec.
Unfortunately no one noticed for about a decade.

What caused the regression is that unshare_files is a noop if the usage
count is 1.  Which means that after de_thread unshare_files only does
something if our file table is shared by another process.  However when
unshare_files is called before de_thread in a multi-threaded process
unshare_files will always perform work.

For the locks that set fl_owner to current->files unsharing
current->files when unnecessary already causes problems, as we now
have an unnecessary change of owner during exec.

After the unnecessary change in owner the existing locks are
eventually removed at the end of bprm_execve:
    bprm_execve()
       if (displaced)
           put_files_struct(displaced)
              filp_close()
                 locks_remove_posix()
                    /* Which removes the posix locks */



After 12 years moving unshare_files back where it used to be is
also problematic, as with the addition of execveat we grew
a clear dependency other threads not messing with our open files:

	/*
	 * Record that a name derived from an O_CLOEXEC fd will be
	 * inaccessible after exec. Relies on having exclusive access to
	 * current->files (due to unshare_files above).
	 */
	if (bprm->fdpath &&
	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
		bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;


I have made a cursory look and I don't expect any other issues as the
only code in exec that messes with file descriptors is in fs/exec.c
Everything else simply uses "struct file *".


Testing to see if the file descriptor is close_on_exec is just there to
prevent trying to run a script that through a close_on_exec file
descriptor, that is part of the path to the script.  So it is a quality
of implementation issue so the code does not fail later if userspace
tries something silly.

So I think we can safely just update the comment and say if userspace
messes with the file descriptor they pass to exec during exec userspace
can keep both pieces, as it is a self inflicted problem.

All of the other issues I see with reverting the location where the file
table is unshared also look like userspace shooting themselves in the
foot and not a problem with correctness of kernel code.

Which is a long way of saying that I have just convinced myself to
effectively revert fd8328be874f ("[PATCH] sanitize handling of shared
descriptor tables in failing execve()") 

A memory allocation failure after the point of no return is the only
reason to avoid doing this, and since the unshare is expected to
be a noop most of the time that doesn't look like a downside either.


Assuming I don't find anything else I think I will kick down the road
a bit the problem of stopping other threads during exec.  I can handle
unshare_files and seccomp without it.  There still might be something
I can not solve another way, but until I find it I will put this on the
back burner.

Eric


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC][PATCH] exec: Conceal the other threads from wakeups during exec
  2020-07-30 13:16         ` Eric W. Biederman
@ 2020-07-30 22:56           ` Eric W. Biederman
  2020-07-30 23:17             ` Linus Torvalds
  2020-07-31  6:28             ` Oleg Nesterov
  0 siblings, 2 replies; 15+ messages in thread
From: Eric W. Biederman @ 2020-07-30 22:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM


Right now I think I see solutions to the problems of exec without
using this code.

However this code is tested and working for the common cases (and has no
lockdep warnings) and the techniques it is using could potentially be
used to simplify the freezer, the cgroup v1 freezer, or the cgroup v2
freezer.

The key is the function make_task_wakekill which could probably
benefit from a little more review and refinement but appears to
be basically correct.

---
[This change requires more work to handle TASK_STOPPED and TASK_TRACED]
[This adds a new lock ordering dependency siglock -> pi_lock -> rq_lock ]

Many of the challenges of implementing a simple version of exec come
from the fact the code handles exec'ing multi-thread processes.

To the best of my knowledge processes with more than one thread
calling exec are not common, and as all of the threads will be killed
by exec there does not appear to be any useful work a thread can
reliably do during exec.

Therefore make it simpler to get exec correct by concealing the other
threads from wakeups at the beginning of exec.  This removes an entire
class of races, and makes it tractable to fix some of the long
standing issues with exec.

One issue that this change makes it easier to solve is the issue of
deailing with the file table.  Today exec unshares the file table at
the beginning to ensure there are no weird races with file
descriptors.  Unfortunately this unsharing can unshare the file table
when only threads of the current process share it.  Which results in
unnecessary unsharing and posix locks being inappropriately dropped by
a multi-threaded exec.  With all of the threads frozen the thread
count is stable and it is easy to tell if the if the file table really
needs to be unshared by exec.

Further this changes allows seccomp to stop taking cred_guard_mutex,
as the seccomp code takes cred_guard_mutex to protect against another
thread that is in the middle of calling exec and this change
guarantees that if one threads is calling exec all of the other threads
have stopped running.  So this problematic kind of concurrency between
threads can no longer happen.

The code in de_thread was modified to unmask the threads at the same
time as it is killing them ensuring that code continues to work as it
does today, and without introducing any races where a thread might
perform any problematic work in the middle of de_thread.

The code in fork is modified to fail if another thread in the parent
is in the middle of fork.

A new generic scheduler function make_task_killable is added.
I think the locking is ok, changing task->state under
both pi_lock and the rq_lock, but I have not done a detailed
looked yet to confirm that I am not missing something subtle.

A new function exec_conceal_threads is added, to set
group_execing_task and walk through all of the threads and change
their state to TASK_WAKEKILL if it is not already.

A new companion function exec_reveal_threads sends a wake up to all of
the other threads and clear group_execing_task.  This may cause a
spuroius wake up but that is an uncommon case and the code for
TASK_UNINTERRUPTIBLE and TASK_INTERRUPTIBLE is expected to be handle
spurious so it should be fine.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 98 +++++++++++++++++++++++++++++++++++-
 include/linux/sched.h        |  2 +
 include/linux/sched/signal.h |  3 ++
 kernel/fork.c                |  6 +++
 kernel/sched/core.c          | 38 ++++++++++++++
 kernel/signal.c              | 11 ++++
 6 files changed, 157 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 3698252719a3..5e4b0187ac05 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1145,6 +1145,95 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
+static void exec_reveal_threads_locked(void)
+{
+	struct task_struct *p = current, *t;
+	struct signal_struct *signal = p->signal;
+
+	if (signal->group_execing_task) {
+		signal->group_execing_task = NULL;
+		__for_each_thread(signal, t) {
+			if (t == p)
+				continue;
+			/*
+			 * This might be a spurious wake up task t but
+			 * this should be fine as t should verify it
+			 * is the appropriate time to wake up and if
+			 * not fall back asleep.
+			 *
+			 * Any performance (rather than correctness)
+			 * implications of this code should be unimportant
+			 * as it is only called on error.
+			 */
+			wake_up_state(t, TASK_WAKEKILL);
+		}
+	}
+}
+
+static void exec_reveal_threads(void)
+{
+	spinlock_t *lock = &current->sighand->siglock;
+
+	spin_lock_irq(lock);
+	exec_reveal_threads_locked();
+	spin_unlock_irq(lock);
+}
+
+/*
+ * Conceal all other threads in the thread group from wakeups
+ */
+static int exec_conceal_threads(void)
+{
+	struct task_struct *me = current, *t;
+	struct signal_struct *signal = me->signal;
+	spinlock_t *lock = &me->sighand->siglock;
+	int ret = 0;
+
+	if (thread_group_empty(me))
+		return 0;
+
+	spin_lock_irq(lock);
+	if (signal_pending(me) || signal_group_exit(signal) ||
+	    signal->group_execing_task) {
+		spin_unlock(lock);
+		return -ERESTARTNOINTR;
+	}
+
+	signal->group_execing_task = me;
+	for (;;) {
+		unsigned int todo = 0;
+
+		__for_each_thread(signal, t) {
+			if ((t == me) || (t->flags & PF_EXITING))
+				continue;
+
+			if (make_task_wakekill(t))
+				continue;
+
+			signal_wake_up(t, 0);
+			todo++;
+		}
+
+		if ((todo == 0) || __fatal_signal_pending(me))
+			break;
+
+		set_current_state(TASK_KILLABLE);
+		spin_unlock_irq(lock);
+
+		schedule();
+
+		spin_lock_irq(lock);
+		if (__fatal_signal_pending(me))
+			break;
+	}
+	if (__fatal_signal_pending(me)) {
+		ret = -ERESTARTNOINTR;
+		exec_reveal_threads_locked();
+	}
+	spin_unlock_irq(lock);
+	return ret;
+}
+
 static int de_thread(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
@@ -1885,10 +1974,15 @@ static int bprm_execve(struct linux_binprm *bprm,
 	struct files_struct *displaced;
 	int retval;
 
-	retval = unshare_files(&displaced);
+	/* Conceal any other threads from wakeups */
+	retval = exec_conceal_threads();
 	if (retval)
 		return retval;
 
+	retval = unshare_files(&displaced);
+	if (retval)
+		goto out_ret;
+
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
 		goto out_files;
@@ -1949,6 +2043,8 @@ static int bprm_execve(struct linux_binprm *bprm,
 out_files:
 	if (displaced)
 		reset_files_struct(displaced);
+out_ret:
+	exec_reveal_threads();
 
 	return retval;
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edb2020875ad..dcd79e78b651 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1726,6 +1726,8 @@ extern void kick_process(struct task_struct *tsk);
 static inline void kick_process(struct task_struct *tsk) { }
 #endif
 
+extern int make_task_wakekill(struct task_struct *tsk);
+
 extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec);
 
 static inline void set_task_comm(struct task_struct *tsk, const char *from)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..647b7d0d2231 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -106,6 +106,9 @@ struct signal_struct {
 	int			notify_count;
 	struct task_struct	*group_exit_task;
 
+	/* Task that is performing exec */
+	struct task_struct	*group_execing_task;
+
 	/* thread group stop support, overloads group_exit_code too */
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
diff --git a/kernel/fork.c b/kernel/fork.c
index bf215af7a904..686c6901eabd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2247,6 +2247,12 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
+	/* Don't allow creation of new tasks during exec */
+	if (unlikely(current->signal->group_execing_task)) {
+		retval = -ERESTARTNOINTR;
+		goto bad_fork_cancel_cgroup;
+	}
+
 	/* past the last point of failure */
 	if (pidfile)
 		fd_install(pidfd, pidfile);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f360326861e..1ac8b81f22de 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2648,6 +2648,44 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	return success;
 }
 
+int make_task_wakekill(struct task_struct *p)
+{
+	unsigned long flags;
+	int cpu, success = 0;
+	struct rq_flags rf;
+	struct rq *rq;
+	long state;
+
+	/* Assumes p != current */
+	preempt_disable();
+	/*
+	 * If we are going to change a thread waiting for CONDITION we
+	 * need to ensure that CONDITION=1 done by the caller can not be
+	 * reordered with p->state check below. This pairs with mb() in
+	 * set_current_state() the waiting thread does.
+	 */
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	smp_mb__after_spinlock();
+	state = p->state;
+
+	/* FIXME handle TASK_STOPPED and TASK_TRACED */
+	if ((state == TASK_KILLABLE) ||
+	    (state == TASK_INTERRUPTIBLE)) {
+		success = 1;
+		cpu = task_cpu(p);
+		rq = cpu_rq(cpu);
+		rq_lock(rq, &rf);
+		p->state = TASK_WAKEKILL;
+		rq_unlock(rq, &rf);
+	}
+	else if (state == TASK_WAKEKILL)
+		success = 1;
+
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	preempt_enable();
+	return success;
+}
+
 /**
  * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
  * @p: Process for which the function is to be invoked.
diff --git a/kernel/signal.c b/kernel/signal.c
index 5ca48cc5da76..19f73eda1d54 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2590,6 +2590,15 @@ bool get_signal(struct ksignal *ksig)
 		goto fatal;
 	}
 
+	/* Is this task concealing itself from wake-ups during exec? */
+	if (unlikely(signal->group_execing_task)) {
+		set_current_state(TASK_WAKEKILL);
+		wake_up_process(signal->group_execing_task);
+		spin_unlock_irq(&sighand->siglock);
+		schedule();
+		goto relock;
+	}
+
 	for (;;) {
 		struct k_sigaction *ka;
 
@@ -2849,6 +2858,8 @@ void exit_signals(struct task_struct *tsk)
 	    task_participate_group_stop(tsk))
 		group_stop = CLD_STOPPED;
 out:
+	if (unlikely(tsk->signal->group_execing_task))
+		wake_up_process(tsk->signal->group_execing_task);
 	spin_unlock_irq(&tsk->sighand->siglock);
 
 	/*
-- 
2.20.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Conceal the other threads from wakeups during exec
  2020-07-30 22:56           ` [RFC][PATCH] exec: Conceal the other threads from wakeups during exec Eric W. Biederman
@ 2020-07-30 23:17             ` Linus Torvalds
  2020-07-31 17:16               ` Eric W. Biederman
  2020-07-31  6:28             ` Oleg Nesterov
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2020-07-30 23:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM

On Thu, Jul 30, 2020 at 4:00 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The key is the function make_task_wakekill which could probably
> benefit from a little more review and refinement but appears to
> be basically correct.

You really need to explain a lot more why you think this is all a good idea.

For example, what if one of those other threads is waiting in line for
a critical lock, and the wait-queue you basically disabled was the
exclusive wait after lock handoff?

That means that the lock will now effectively be held by that thread.
No, it wasn't woken up, but it had the lock handed to it, and it's now
entirely unresponsive until it is killed.

How is that different from the deadlocks you're actually trying to fix?

These are the kinds of problems that the freezer() code had too, with
freezing things that held locks etc.

This approach does seem better than the freezer thing, and if I read
it right it will gather things in the signal handler code, but it's
not obvious that gathering them in random places where they sleep for
random reasons is safe or a good idea.

I can imagine _so_ many dead systems if you just basically froze
something that holds the mmap lock and is sleeping on a page fault,
for example.

Maybe I'm missing something, but I really think your "let's freeze
things" is seriously misguided. You're concentrating on some small
problem and trying to solve that, and not seeign the HUGE HONKING
problems that your approach is fundamentally introducing.

              Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Conceal the other threads from wakeups during exec
  2020-07-30 22:56           ` [RFC][PATCH] exec: Conceal the other threads from wakeups during exec Eric W. Biederman
  2020-07-30 23:17             ` Linus Torvalds
@ 2020-07-31  6:28             ` Oleg Nesterov
  2020-07-31 16:50               ` Eric W. Biederman
  1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2020-07-31  6:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Kernel Mailing List, Kees Cook,
	Pavel Machek, Rafael J. Wysocki, linux-fsdevel, Linux PM

Eric, I won't comment the intent, but I too do not understand this idea.

On 07/30, Eric W. Biederman wrote:
>
> [This change requires more work to handle TASK_STOPPED and TASK_TRACED]

Yes. And it is not clear to me how can you solve this.

> [This adds a new lock ordering dependency siglock -> pi_lock -> rq_lock ]

Not really, ttwu() can be safely called with siglock held and it takes
pi_lock + rq_lock. Say, signal_wake_up().

> +int make_task_wakekill(struct task_struct *p)
> +{
> +	unsigned long flags;
> +	int cpu, success = 0;
> +	struct rq_flags rf;
> +	struct rq *rq;
> +	long state;
> +
> +	/* Assumes p != current */
> +	preempt_disable();
> +	/*
> +	 * If we are going to change a thread waiting for CONDITION we
> +	 * need to ensure that CONDITION=1 done by the caller can not be
> +	 * reordered with p->state check below. This pairs with mb() in
> +	 * set_current_state() the waiting thread does.
> +	 */
> +	raw_spin_lock_irqsave(&p->pi_lock, flags);
> +	smp_mb__after_spinlock();
> +	state = p->state;
> +
> +	/* FIXME handle TASK_STOPPED and TASK_TRACED */
> +	if ((state == TASK_KILLABLE) ||
> +	    (state == TASK_INTERRUPTIBLE)) {
> +		success = 1;
> +		cpu = task_cpu(p);
> +		rq = cpu_rq(cpu);
> +		rq_lock(rq, &rf);
> +		p->state = TASK_WAKEKILL;

You can only do this if the task was already deactivated. Just suppose it
is preempted or does something like

	set_current_sate(TASK_INTERRUPTIBLE);

	if (CONDITION) {
		// make_task_wakekill() sets state = TASK_WAKEKILL
		__set_current_state(TASK_RUNNING);
		return;
	}

	schedule();

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Conceal the other threads from wakeups during exec
  2020-07-31  6:28             ` Oleg Nesterov
@ 2020-07-31 16:50               ` Eric W. Biederman
  0 siblings, 0 replies; 15+ messages in thread
From: Eric W. Biederman @ 2020-07-31 16:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Kees Cook,
	Pavel Machek, Rafael J. Wysocki, linux-fsdevel, Linux PM

Oleg Nesterov <oleg@redhat.com> writes:

> Eric, I won't comment the intent, but I too do not understand this idea.
>
> On 07/30, Eric W. Biederman wrote:
>>
>> [This change requires more work to handle TASK_STOPPED and TASK_TRACED]
>
> Yes. And it is not clear to me how can you solve this.

I was imagining something putting TASK_STOPPED and TASK_TRACED in a loop
that verified they should be in that state before exiting so they could
handle spurious wake ups.

There are a many subtlties in that code, especially in the conversion
fo TASK_STOPPED to TASK_TRACED.  So I suspect something more would be
required but I have not looked yet to see how tricky that would be.

>> [This adds a new lock ordering dependency siglock -> pi_lock -> rq_lock ]
>
> Not really, ttwu() can be safely called with siglock held and it takes
> pi_lock + rq_lock. Say, signal_wake_up().

Good point.

>> +int make_task_wakekill(struct task_struct *p)
>> +{
>> +	unsigned long flags;
>> +	int cpu, success = 0;
>> +	struct rq_flags rf;
>> +	struct rq *rq;
>> +	long state;
>> +
>> +	/* Assumes p != current */
>> +	preempt_disable();
>> +	/*
>> +	 * If we are going to change a thread waiting for CONDITION we
>> +	 * need to ensure that CONDITION=1 done by the caller can not be
>> +	 * reordered with p->state check below. This pairs with mb() in
>> +	 * set_current_state() the waiting thread does.
>> +	 */
>> +	raw_spin_lock_irqsave(&p->pi_lock, flags);
>> +	smp_mb__after_spinlock();
>> +	state = p->state;
>> +
>> +	/* FIXME handle TASK_STOPPED and TASK_TRACED */
>> +	if ((state == TASK_KILLABLE) ||
>> +	    (state == TASK_INTERRUPTIBLE)) {
>> +		success = 1;
>> +		cpu = task_cpu(p);
>> +		rq = cpu_rq(cpu);
>> +		rq_lock(rq, &rf);
>> +		p->state = TASK_WAKEKILL;
>
> You can only do this if the task was already deactivated. Just suppose it
> is preempted or does something like
>
> 	set_current_sate(TASK_INTERRUPTIBLE);
>
> 	if (CONDITION) {
> 		// make_task_wakekill() sets state = TASK_WAKEKILL
> 		__set_current_state(TASK_RUNNING);
> 		return;
> 	}
>
> 	schedule();

You are quite right.

So that bit of code would need to be:
	if (!task->on_rq)
        	goto out;
	if ((state == TASK_KILLABLE) ||
            (state == TASK_INTERRUPTIBLE)) {
            ...

Eric


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Conceal the other threads from wakeups during exec
  2020-07-30 23:17             ` Linus Torvalds
@ 2020-07-31 17:16               ` Eric W. Biederman
  2020-07-31 17:41                 ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2020-07-31 17:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Jul 30, 2020 at 4:00 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The key is the function make_task_wakekill which could probably
>> benefit from a little more review and refinement but appears to
>> be basically correct.
>
> You really need to explain a lot more why you think this is all a good idea.
>
> For example, what if one of those other threads is waiting in line for
> a critical lock, and the wait-queue you basically disabled was the
> exclusive wait after lock handoff?
>
> That means that the lock will now effectively be held by that thread.
> No, it wasn't woken up, but it had the lock handed to it, and it's now
> entirely unresponsive until it is killed.
>
> How is that different from the deadlocks you're actually trying to fix?
>
> These are the kinds of problems that the freezer() code had too, with
> freezing things that held locks etc.
>
> This approach does seem better than the freezer thing, and if I read
> it right it will gather things in the signal handler code, but it's
> not obvious that gathering them in random places where they sleep for
> random reasons is safe or a good idea.
>
> I can imagine _so_ many dead systems if you just basically froze
> something that holds the mmap lock and is sleeping on a page fault,
> for example.
>
> Maybe I'm missing something, but I really think your "let's freeze
> things" is seriously misguided. You're concentrating on some small
> problem and trying to solve that, and not seeign the HUGE HONKING
> problems that your approach is fundamentally introducing.

Very good point.  That would be a priority inversion on mmap_lock.
Without great care that could indeed result in lockups.

That definitely requires the points where things are already sleeping
that can be converted to be opt-in.  Which potentially makes things much
more work.

Thanks, that helps kill my bright idea as I expressed it.

Part of what I was trying to solve (because I ran into the problem while
I was reading the code) was that the freezer, the cgroup v2 freezer, and
other waits do not compose nicely.

Even limited to opt-in locations I think the trick of being able to
transform the wait-state may solve that composition problem.


That said I was really just posting this so if the ideas were good they
could inspire future code, and if the ideas were bad they could be sunk.
When it comes to sorting out future especially in exec I will know which
ideas don't fly, so it will be easier to make the case for ideas that
will work.

Eric


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Conceal the other threads from wakeups during exec
  2020-07-31 17:16               ` Eric W. Biederman
@ 2020-07-31 17:41                 ` Linus Torvalds
  2020-07-31 20:07                   ` Eric W. Biederman
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2020-07-31 17:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM

On Fri, Jul 31, 2020 at 10:19 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Even limited to opt-in locations I think the trick of being able to
> transform the wait-state may solve that composition problem.

So the part I found intriguing was the "catch things in the signal
handling path".

Catching things there - and *only* there - would avoid a lot of the
problems we had with the freezer. When you're about to return to user
mode, there are no lock inversions etc.

And it kind of makes conceptual sense to do, since what you're trying
to capture is the signal group - so using the signal state to do so
seems like a natural thing to do. No touching of any runqueues or
scheduler data structures, do everything _purely_ with the signal
handling pathways.

So that "feels" ok to me.

That said, I do wonder if there are nasty nasty latency issues with
odd users. Normally, you'd expect that execve() with other threads in
the group shouldn't be a performance issue, because people simply
shouldn't do that. So it might be ok.

And if you capture them all in the signal handling pathway, that ends
up being a very convenient place to zap them all too, so maybe my
latency worry is misguided.

IOW, I think that you could try to do your "freese other threads" not
at all like the freezer, but more like a "collect all threads in their
signal handler parts as the first phase of zapping them".

So maybe this approach is salvageable. I see where something like the
above could work well. But I say that with a lot of handwaving, and
maybe if I see the patch I'd go "Christ, I was a complete idiot for
ever even suggesting that".

                    Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] exec: Conceal the other threads from wakeups during exec
  2020-07-31 17:41                 ` Linus Torvalds
@ 2020-07-31 20:07                   ` Eric W. Biederman
  0 siblings, 0 replies; 15+ messages in thread
From: Eric W. Biederman @ 2020-07-31 20:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Kees Cook, Pavel Machek,
	Rafael J. Wysocki, linux-fsdevel, Oleg Nesterov, Linux PM

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Jul 31, 2020 at 10:19 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Even limited to opt-in locations I think the trick of being able to
>> transform the wait-state may solve that composition problem.
>
> So the part I found intriguing was the "catch things in the signal
> handling path".
>
> Catching things there - and *only* there - would avoid a lot of the
> problems we had with the freezer. When you're about to return to user
> mode, there are no lock inversions etc.
>
> And it kind of makes conceptual sense to do, since what you're trying
> to capture is the signal group - so using the signal state to do so
> seems like a natural thing to do. No touching of any runqueues or
> scheduler data structures, do everything _purely_ with the signal
> handling pathways.
>
> So that "feels" ok to me.
>
> That said, I do wonder if there are nasty nasty latency issues with
> odd users. Normally, you'd expect that execve() with other threads in
> the group shouldn't be a performance issue, because people simply
> shouldn't do that. So it might be ok.
>
> And if you capture them all in the signal handling pathway, that ends
> up being a very convenient place to zap them all too, so maybe my
> latency worry is misguided.
>
> IOW, I think that you could try to do your "freese other threads" not
> at all like the freezer, but more like a "collect all threads in their
> signal handler parts as the first phase of zapping them".
>
> So maybe this approach is salvageable. I see where something like the
> above could work well. But I say that with a lot of handwaving, and
> maybe if I see the patch I'd go "Christ, I was a complete idiot for
> ever even suggesting that".

Yes.

The tricky bit is that there are a handful of stops that must
be handled, or it is impossible to stop everything without causing
disruption if the exec fails.  The big ones are TASK_STOPPED and
TASK_TRACED.  There is another in wait_for_vfork_done.

At which point I am looking at writting a wrapper around schedule that
changes task state to something like TASK_WAKEKILL when asked, and then
restores the state when released.  Something that is independent of
which freezer the code is using.

It could be the scheduler to with a special bit in state that says
opt-in.  But if we have to opt in it is probably much less error
prone to write the code as an wrapper around schedule, and only
modify the core scheduling code if necessary.

If I can make TASK_STOPPED and TASK_TRACED handle spurious wake-ups
I think I can build something that is independent of the rest of the
freezers so the code doesn't have to go 3 deep on wrappers of different
freezer at those locations.  It is already 2 layers deep.

But I really don't intend to work on that again for a while.


Right now I am in the final stages of ressurecting:
https://lore.kernel.org/linux-fsdevel/87a7ohs5ow.fsf@xmission.com/

The hard part looks like cleaning up and resurrecting Oleg's patch
to prevent the abuse of files_struct->count.
https://lore.kernel.org/linux-fsdevel/20180915160423.GA31461@redhat.com/

I am close but dotting all of the i's and crossing all of the t's is
taking ab bit.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, back to index

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-27 21:03 [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec Eric W. Biederman
2020-07-28  0:20 ` Linus Torvalds
2020-07-28 12:39   ` Eric W. Biederman
2020-07-28 13:20     ` Eric W. Biederman
2020-07-28 18:17       ` Linus Torvalds
2020-07-30 13:16         ` Eric W. Biederman
2020-07-30 22:56           ` [RFC][PATCH] exec: Conceal the other threads from wakeups during exec Eric W. Biederman
2020-07-30 23:17             ` Linus Torvalds
2020-07-31 17:16               ` Eric W. Biederman
2020-07-31 17:41                 ` Linus Torvalds
2020-07-31 20:07                   ` Eric W. Biederman
2020-07-31  6:28             ` Oleg Nesterov
2020-07-31 16:50               ` Eric W. Biederman
2020-07-28  9:41 ` [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec Aleksa Sarai
2020-07-28 12:18   ` Eric W. Biederman

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git