[v2,4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
diff mbox series

Message ID 875zfe5xzb.fsf_-_@x220.int.ebiederm.org
State In Next
Commit 1a2e3486a9f2ff147140ec862c31ff3826765340
Headers show
Series
  • Infrastructure to allow fixing exec deadlocks
Related show

Commit Message

Eric W. Biederman March 8, 2020, 9:38 p.m. UTC
I have read through the code in exec_mmap and I do not see anything
that depends on sighand or the sighand lock, or on signals in anyway
so this should be safe.

This rearrangement of code has two siginficant benefits.  It makes
the determination of passing the point of no return by testing bprm->mm
accurate.  All failures prior to that point in flush_old_exec are
either truly recoverable or they are fatal.

Futher this consolidates all of the possible indefinite waits for
userspace together at the top of flush_old_exec.  The possible wait
for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
to be resolved in clear_child_tid, and the possible wait for a page
fault in exit_robust_list.

This consolidation allows the creation of a mutex to replace
cred_guard_mutex that is not held of possible indefinite userspace
waits.  Which will allow removing deadlock scenarios from the kernel.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

Comments

Bernd Edlinger March 9, 2020, 7:34 p.m. UTC | #1
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
> 
> I have read through the code in exec_mmap and I do not see anything
> that depends on sighand or the sighand lock, or on signals in anyway
> so this should be safe.
> 
> This rearrangement of code has two siginficant benefits.  It makes
                                        ^ typo: significant

> the determination of passing the point of no return by testing bprm->mm
> accurate.  All failures prior to that point in flush_old_exec are
> either truly recoverable or they are fatal.
> 
> Futher this consolidates all of the possible indefinite waits for   ^ typo: Further

> userspace together at the top of flush_old_exec.  The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.
> 
> This consolidation allows the creation of a mutex to replace
> cred_guard_mutex that is not held of possible indefinite userspace

can you also reword this "held of" thing here as well?


Thanks
Bernd.
Eric W. Biederman March 9, 2020, 7:45 p.m. UTC | #2
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>> 
>> This consolidation allows the creation of a mutex to replace
>> cred_guard_mutex that is not held of possible indefinite userspace
>
> can you also reword this "held of" thing here as well?

Done:

    exec: Move exec_mmap right after de_thread in flush_old_exec
    
    I have read through the code in exec_mmap and I do not see anything
    that depends on sighand or the sighand lock, or on signals in anyway
    so this should be safe.
    
    This rearrangement of code has two siginficant benefits.  It makes
    the determination of passing the point of no return by testing bprm->mm
    accurate.  All failures prior to that point in flush_old_exec are
    either truly recoverable or they are fatal.
    
    Futher this consolidates all of the possible indefinite waits for
    userspace together at the top of flush_old_exec.  The possible wait
    for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
    to be resolved in clear_child_tid, and the possible wait for a page
    fault in exit_robust_list.
    
    This consolidation allows the creation of a mutex to replace
    cred_guard_mutex that is not held over possible indefinite userspace
    waits.  Which will allow removing deadlock scenarios from the kernel.
    
    Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Eric
Bernd Edlinger March 9, 2020, 7:52 p.m. UTC | #3
On 3/9/20 8:45 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>
>>> This consolidation allows the creation of a mutex to replace
>>> cred_guard_mutex that is not held of possible indefinite userspace
>>
>> can you also reword this "held of" thing here as well?
> 
> Done:
> 
>     exec: Move exec_mmap right after de_thread in flush_old_exec
>     
>     I have read through the code in exec_mmap and I do not see anything
>     that depends on sighand or the sighand lock, or on signals in anyway
>     so this should be safe.
>     
>     This rearrangement of code has two siginficant benefits.  It makes

watch out: sig_i_nificant

>     the determination of passing the point of no return by testing bprm->mm
>     accurate.  All failures prior to that point in flush_old_exec are
>     either truly recoverable or they are fatal.
>     
>     Futher this consolidates all of the possible indefinite waits for

Add some r to "Futher", please?

>     userspace together at the top of flush_old_exec.  The possible wait
>     for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>     to be resolved in clear_child_tid, and the possible wait for a page
>     fault in exit_robust_list.
>     
>     This consolidation allows the creation of a mutex to replace
>     cred_guard_mutex that is not held over possible indefinite userspace
>     waits.  Which will allow removing deadlock scenarios from the kernel.
>     
>     Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Eric
>
Eric W. Biederman March 9, 2020, 7:58 p.m. UTC | #4
Ok.  I think this has it sorted:

    exec: Move exec_mmap right after de_thread in flush_old_exec
    
    I have read through the code in exec_mmap and I do not see anything
    that depends on sighand or the sighand lock, or on signals in anyway
    so this should be safe.
    
    This rearrangement of code has two significant benefits.  It makes
    the determination of passing the point of no return by testing bprm->mm
    accurate.  All failures prior to that point in flush_old_exec are
    either truly recoverable or they are fatal.
    
    Further this consolidates all of the possible indefinite waits for
    userspace together at the top of flush_old_exec.  The possible wait
    for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
    to be resolved in clear_child_tid, and the possible wait for a page
    fault in exit_robust_list.
    
    This consolidation allows the creation of a mutex to replace
    cred_guard_mutex that is not held over possible indefinite userspace
    waits.  Which will allow removing deadlock scenarios from the kernel.
    
    Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


I don't think I usually have this many typos.  Sigh.

Eric
Bernd Edlinger March 9, 2020, 8:03 p.m. UTC | #5
On 3/9/20 8:58 PM, Eric W. Biederman wrote:
> 
> Ok.  I think this has it sorted:
> 
>     exec: Move exec_mmap right after de_thread in flush_old_exec
>     
>     I have read through the code in exec_mmap and I do not see anything
>     that depends on sighand or the sighand lock, or on signals in anyway
>     so this should be safe.
>     
>     This rearrangement of code has two significant benefits.  It makes
>     the determination of passing the point of no return by testing bprm->mm
>     accurate.  All failures prior to that point in flush_old_exec are
>     either truly recoverable or they are fatal.
>     
>     Further this consolidates all of the possible indefinite waits for
>     userspace together at the top of flush_old_exec.  The possible wait
>     for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>     to be resolved in clear_child_tid, and the possible wait for a page
>     fault in exit_robust_list.
>     
>     This consolidation allows the creation of a mutex to replace
>     cred_guard_mutex that is not held over possible indefinite userspace
>     waits.  Which will allow removing deadlock scenarios from the kernel.
>     
>     Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> 
> I don't think I usually have this many typos.  Sigh.
> 

OK.

never mind,
Bernd.
Eric W. Biederman March 9, 2020, 8:35 p.m. UTC | #6
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 8:58 PM, Eric W. Biederman wrote:
>> 
>> Ok.  I think this has it sorted:
>> 
>>     exec: Move exec_mmap right after de_thread in flush_old_exec
>>     
>>     I have read through the code in exec_mmap and I do not see anything
>>     that depends on sighand or the sighand lock, or on signals in anyway
>>     so this should be safe.
>>     
>>     This rearrangement of code has two significant benefits.  It makes
>>     the determination of passing the point of no return by testing bprm->mm
>>     accurate.  All failures prior to that point in flush_old_exec are
>>     either truly recoverable or they are fatal.
>>     
>>     Further this consolidates all of the possible indefinite waits for
>>     userspace together at the top of flush_old_exec.  The possible wait
>>     for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>>     to be resolved in clear_child_tid, and the possible wait for a page
>>     fault in exit_robust_list.
>>     
>>     This consolidation allows the creation of a mutex to replace
>>     cred_guard_mutex that is not held over possible indefinite userspace
>>     waits.  Which will allow removing deadlock scenarios from the kernel.
>>     
>>     Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> 
>> 
>> I don't think I usually have this many typos.  Sigh.
>> 
>
> OK.
>
> never mind,

No no.  I really appreciate all of the scrutiny.  Frequently the issues
that will produce typos or poor patch descriptions are also the issues
that will produce sloppy patches as well.  I was just frustrated with
myself.

Eric
Kees Cook March 10, 2020, 8:44 p.m. UTC | #7
On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
> 
> I have read through the code in exec_mmap and I do not see anything
> that depends on sighand or the sighand lock, or on signals in anyway
> so this should be safe.
> 
> This rearrangement of code has two siginficant benefits.  It makes
> the determination of passing the point of no return by testing bprm->mm
> accurate.  All failures prior to that point in flush_old_exec are
> either truly recoverable or they are fatal.

Agreed. Though I see a use of "current", which maybe you want to
parameterize to a "me" argument in acct_arg_size(). (Though looking at
the callers, perhaps there is no benefit?)

> 
> Futher this consolidates all of the possible indefinite waits for
> userspace together at the top of flush_old_exec.  The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.
> 
> This consolidation allows the creation of a mutex to replace
> cred_guard_mutex that is not held of possible indefinite userspace
> waits.  Which will allow removing deadlock scenarios from the kernel.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 215d86f77b63..d820a7272a76 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> -#ifdef CONFIG_POSIX_TIMERS
> -	exit_itimers(me->signal);
> -	flush_itimer_signals();
> -#endif

I think this comment:

/*
 * This is called by do_exit or de_thread, only when there are no more
 * references to the shared signal_struct.
 */
void exit_itimers(struct signal_struct *sig)

Refers to there being other threads, yes? Not that the signal table is
private yet?

> -
> -	/*
> -	 * Make the signal table private.
> -	 */
> -	retval = unshare_sighand(me);
> -	if (retval)
> -		goto out;
> -
>  	/*
>  	 * Must be called _before_ exec_mmap() as bprm->mm is
>  	 * not visibile until then. This also enables the update
> @@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	 */
>  	bprm->mm = NULL;
>  
> +#ifdef CONFIG_POSIX_TIMERS
> +	exit_itimers(me->signal);
> +	flush_itimer_signals();
> +#endif

I've mostly convinced myself that there are no "side-effects" from having
these timers expire as the mm is going away. I think some kind of comment
of that intent should be explicitly stated here above the timer work.

Beyond that:

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> +
> +	/*
> +	 * Make the signal table private.
> +	 */
> +	retval = unshare_sighand(me);
> +	if (retval)
> +		goto out;
> +
>  	set_fs(USER_DS);
>  	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>  					PF_NOFREEZE | PF_NO_SETAFFINITY);
> -- 
> 2.25.0
>
Kees Cook March 10, 2020, 8:47 p.m. UTC | #8
On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
> Futher this consolidates all of the possible indefinite waits for
> userspace together at the top of flush_old_exec.  The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.

I forgot to mention, just as a point of clarity, there are lots of
other page faults possible, but they're _before_ flush_old_exec()
(i.e. all the copy_strings() calls). Is it worth clarifying this to
"before or at the top of flush_old_exec()" or do you mean something
else? (And as always: perhaps expand flush_old_exec()'s comment to
describe the newly intended state.)
Eric W. Biederman March 10, 2020, 9:09 p.m. UTC | #9
Kees Cook <keescook@chromium.org> writes:

> On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
>> Futher this consolidates all of the possible indefinite waits for
>> userspace together at the top of flush_old_exec.  The possible wait
>> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>> to be resolved in clear_child_tid, and the possible wait for a page
>> fault in exit_robust_list.
>
> I forgot to mention, just as a point of clarity, there are lots of
> other page faults possible, but they're _before_ flush_old_exec()
> (i.e. all the copy_strings() calls). Is it worth clarifying this to
> "before or at the top of flush_old_exec()" or do you mean something
> else? (And as always: perhaps expand flush_old_exec()'s comment to
> describe the newly intended state.)

Yes.  Before or at the start of flush_old_exec where the mutex
is taken.  That is the point.  I will see if I can come up with
and appropriate comment.

Eric
Eric W. Biederman March 10, 2020, 9:20 p.m. UTC | #10
Kees Cook <keescook@chromium.org> writes:

> On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
>> 
>> I have read through the code in exec_mmap and I do not see anything
>> that depends on sighand or the sighand lock, or on signals in anyway
>> so this should be safe.
>> 
>> This rearrangement of code has two siginficant benefits.  It makes
>> the determination of passing the point of no return by testing bprm->mm
>> accurate.  All failures prior to that point in flush_old_exec are
>> either truly recoverable or they are fatal.
>
> Agreed. Though I see a use of "current", which maybe you want to
> parameterize to a "me" argument in acct_arg_size(). (Though looking at
> the callers, perhaps there is no benefit?)

My testing suggests there is a small benefit on x86.

The code is just "#define current get_current()"
and get_current() revoles into a read of "%gs:current_task".

But looking at the code I find gcc can sometimes when the
reads are close in the source code can optimize the read
away.  But gcc does not manage to optimize the extra
read of "%gs:current_task" away.

So I think things are much much better than they used to be,
code generation wise.  But it still helps to cache current
in a local variable.

>> Futher this consolidates all of the possible indefinite waits for
>> userspace together at the top of flush_old_exec.  The possible wait
>> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>> to be resolved in clear_child_tid, and the possible wait for a page
>> fault in exit_robust_list.
>> 
>> This consolidation allows the creation of a mutex to replace
>> cred_guard_mutex that is not held of possible indefinite userspace
>> waits.  Which will allow removing deadlock scenarios from the kernel.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/exec.c | 24 ++++++++++++------------
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 215d86f77b63..d820a7272a76 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm)
>>  	if (retval)
>>  		goto out;
>>  
>> -#ifdef CONFIG_POSIX_TIMERS
>> -	exit_itimers(me->signal);
>> -	flush_itimer_signals();
>> -#endif
>
> I think this comment:
>
> /*
>  * This is called by do_exit or de_thread, only when there are no more
>  * references to the shared signal_struct.
>  */
> void exit_itimers(struct signal_struct *sig)
>
> Refers to there being other threads, yes? Not that the signal table is
> private yet?

The signal table is in sighand_struct.

So yes that refers to the other threads being gone.


>> -
>> -	/*
>> -	 * Make the signal table private.
>> -	 */
>> -	retval = unshare_sighand(me);
>> -	if (retval)
>> -		goto out;
>> -
>>  	/*
>>  	 * Must be called _before_ exec_mmap() as bprm->mm is
>>  	 * not visibile until then. This also enables the update
>> @@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm)
>>  	 */
>>  	bprm->mm = NULL;
>>  
>> +#ifdef CONFIG_POSIX_TIMERS
>> +	exit_itimers(me->signal);
>> +	flush_itimer_signals();
>> +#endif
>
> I've mostly convinced myself that there are no "side-effects" from having
> these timers expire as the mm is going away. I think some kind of comment
> of that intent should be explicitly stated here above the timer work.

The timers can at most generate signals.  And we are not handling
signals in the middle of exec.

So the only possible interaction would be to set a timeout and then try
exec, and have the timer kill the caller.

Maybe we get a killable signal from a scenario like that and maybe this
changes the time before the timer expires into the dangerous zone.
But that is all I can think of.

We have to return to the edge of userspace before any signals are
delivered.


> Beyond that:
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
>
> -Kees
>
>> +
>> +	/*
>> +	 * Make the signal table private.
>> +	 */
>> +	retval = unshare_sighand(me);
>> +	if (retval)
>> +		goto out;
>> +
>>  	set_fs(USER_DS);
>>  	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>  					PF_NOFREEZE | PF_NO_SETAFFINITY);
>> -- 
>> 2.25.0
>> 

Eric

Patch
diff mbox series

diff --git a/fs/exec.c b/fs/exec.c
index 215d86f77b63..d820a7272a76 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1272,18 +1272,6 @@  int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
-#ifdef CONFIG_POSIX_TIMERS
-	exit_itimers(me->signal);
-	flush_itimer_signals();
-#endif
-
-	/*
-	 * Make the signal table private.
-	 */
-	retval = unshare_sighand(me);
-	if (retval)
-		goto out;
-
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1307,6 +1295,18 @@  int flush_old_exec(struct linux_binprm * bprm)
 	 */
 	bprm->mm = NULL;
 
+#ifdef CONFIG_POSIX_TIMERS
+	exit_itimers(me->signal);
+	flush_itimer_signals();
+#endif
+
+	/*
+	 * Make the signal table private.
+	 */
+	retval = unshare_sighand(me);
+	if (retval)
+		goto out;
+
 	set_fs(USER_DS);
 	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
 					PF_NOFREEZE | PF_NO_SETAFFINITY);