[v2,5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
diff mbox series

Message ID 87zhcq4jdj.fsf_-_@x220.int.ebiederm.org
State In Next
Commit 0ce3510a3c0d2cd0e9f80cea83e3d508ea363a13
Headers show
Series
  • Infrastructure to allow fixing exec deadlocks
Related show

Commit Message

Eric W. Biederman March 8, 2020, 9:38 p.m. UTC
The cred_guard_mutex is problematic.  The cred_guard_mutex is held
over the userspace accesses as the arguments from userspace are read.
The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
threads are killed.  The cred_guard_mutex is held over
"put_user(0, tsk->clear_child_tid)" in exit_mm().

Any of those can result in deadlock, as the cred_guard_mutex is held
over a possible indefinite userspace waits for userspace.

Add exec_update_mutex that is only held over exec updating process
with the new contents of exec, so that code that needs not to be
confused by exec changing the mm and the cred in ways that can not
happen during ordinary execution of a process.

The plan is to switch the users of cred_guard_mutex to
exec_udpate_mutex one by one.  This lets us move forward while still
being careful and not introducing any regressions.

Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 9 +++++++++
 include/linux/sched/signal.h | 9 ++++++++-
 init/init_task.c             | 1 +
 kernel/fork.c                | 1 +
 4 files changed, 19 insertions(+), 1 deletion(-)

Comments

Bernd Edlinger March 9, 2020, 1:45 p.m. UTC | #1
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
> 
> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other

... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
or something?

I wonder if we also should mention that
it is held while waiting for the trace parent to
receive the exit code with "wait"?

> threads are killed.  The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
> 
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
> 
> Add exec_update_mutex that is only held over exec updating process

Add ?

> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
> 
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one.  This lets us move forward while still

s/udpate/update/


Bernd.
Eric W. Biederman March 9, 2020, 5:40 p.m. UTC | #2
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>> 
>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
                                ^ over
>
> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
> or something?

Yes.  Let me see if I can phrase that better.

> I wonder if we also should mention that
> it is held while waiting for the trace parent to
> receive the exit code with "wait"?

I don't think we have to spell out the details of how it all works,
unless that makes things clearer.  Kernel developers can be expected
to figure out how the kernel works.  The critical thing is that it is
an indefinite wait for userspace to take action.

But I will look.

>> threads are killed.  The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>> 
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>> 
>> Add exec_update_mutex that is only held over exec updating process
>
> Add ?

Yes.  That is what the change does: add exec_update_mutex.

>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process.
>> 
>> The plan is to switch the users of cred_guard_mutex to
>> exec_udpate_mutex one by one.  This lets us move forward while still
>
> s/udpate/update/

Yes.  Very much so.

Eric
Bernd Edlinger March 9, 2020, 6:01 p.m. UTC | #3
On 3/9/20 6:40 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>
>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>> over the userspace accesses as the arguments from userspace are read.
>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>                                 ^ over
>>
>> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
>> or something?
> 
> Yes.  Let me see if I can phrase that better.
> 
>> I wonder if we also should mention that
>> it is held while waiting for the trace parent to
>> receive the exit code with "wait"?
> 
> I don't think we have to spell out the details of how it all works,
> unless that makes things clearer.  Kernel developers can be expected
> to figure out how the kernel works.  The critical thing is that it is
> an indefinite wait for userspace to take action.
> 
> But I will look.
> 
>>> threads are killed.  The cred_guard_mutex is held over
>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>
>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>> over a possible indefinite userspace waits for userspace.
>>>
>>> Add exec_update_mutex that is only held over exec updating process
>>
>> Add ?
> 
> Yes.  That is what the change does: add exec_update_mutex.
> 

I just kind of missed the "subject" in this sentence,
like "This patch adds an exec_update_mutex that is ..."
but english is a foreign language for me, so may be okay as is.


Bernd.

>>> with the new contents of exec, so that code that needs not to be
>>> confused by exec changing the mm and the cred in ways that can not
>>> happen during ordinary execution of a process.
>>>
>>> The plan is to switch the users of cred_guard_mutex to
>>> exec_udpate_mutex one by one.  This lets us move forward while still
>>
>> s/udpate/update/
> 
> Yes.  Very much so.
> 
> Eric
>
Eric W. Biederman March 9, 2020, 6:10 p.m. UTC | #4
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 6:40 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>>
>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>> over the userspace accesses as the arguments from userspace are read.
>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>                                 ^ over
>>>
>>> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
>>> or something?
>> 
>> Yes.  Let me see if I can phrase that better.
>> 
>>> I wonder if we also should mention that
>>> it is held while waiting for the trace parent to
>>> receive the exit code with "wait"?
>> 
>> I don't think we have to spell out the details of how it all works,
>> unless that makes things clearer.  Kernel developers can be expected
>> to figure out how the kernel works.  The critical thing is that it is
>> an indefinite wait for userspace to take action.
>> 
>> But I will look.
>> 
>>>> threads are killed.  The cred_guard_mutex is held over
>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>
>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>> over a possible indefinite userspace waits for userspace.
>>>>
>>>> Add exec_update_mutex that is only held over exec updating process
>>>
>>> Add ?
>> 
>> Yes.  That is what the change does: add exec_update_mutex.
>> 
>
> I just kind of missed the "subject" in this sentence,
> like "This patch adds an exec_update_mutex that is ..."
> but english is a foreign language for me, so may be okay as is.

English has a lot of options.  I think this is a stylistic difference.

Instead of being an observer and describing what the change does:
"This patch adds exec_update_mutex ..."  

I was being there in the moment and saying/commading what is happening:
"Add exec_update_mutex ..."

Using the more immdediate form ends up with more concise and clearer
sentences.

Every one of my writing teachers in school emphasized that point
and I see the who it works when I write things.  But writing is hard and
I still tend toward long rambling sentences with many qualifiers that
confuse and detract from the point rather than make it clear what is
happening.

Eric
Eric W. Biederman March 9, 2020, 6:24 p.m. UTC | #5
ebiederm@xmission.com (Eric W. Biederman) writes:

> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>
>> On 3/9/20 6:40 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>> 
>>>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>>>
>>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>                                 ^ over
>>>>
>>>> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
>>>> or something?
>>> 
>>> Yes.  Let me see if I can phrase that better.
>>> 
>>>> I wonder if we also should mention that
>>>> it is held while waiting for the trace parent to
>>>> receive the exit code with "wait"?
>>> 
>>> I don't think we have to spell out the details of how it all works,
>>> unless that makes things clearer.  Kernel developers can be expected
>>> to figure out how the kernel works.  The critical thing is that it is
>>> an indefinite wait for userspace to take action.
>>> 
>>> But I will look.
>>> 
>>>>> threads are killed.  The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>> over a possible indefinite userspace waits for userspace.
>>>>>
>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>
>>>> Add ?
>>> 
>>> Yes.  That is what the change does: add exec_update_mutex.
>>> 
>>
>> I just kind of missed the "subject" in this sentence,
>> like "This patch adds an exec_update_mutex that is ..."
>> but english is a foreign language for me, so may be okay as is.
>
> English has a lot of options.  I think this is a stylistic difference.
>
> Instead of being an observer and describing what the change does:
> "This patch adds exec_update_mutex ..."  
>
> I was being there in the moment and saying/commading what is happening:
> "Add exec_update_mutex ..."
>
> Using the more immdediate form ends up with more concise and clearer
> sentences.
>
> Every one of my writing teachers in school emphasized that point
> and I see the who it works when I write things.  But writing is hard and
> I still tend toward long rambling sentences with many qualifiers that
> confuse and detract from the point rather than make it clear what is
> happening.

And reading through it all now I can see your confusion.  That
description of my changes was not well done.  Reworking it now.

Eric
Eric W. Biederman March 9, 2020, 6:36 p.m. UTC | #6
My rewritten change description reads as follows:

    exec: Add a exec_update_mutex to replace cred_guard_mutex
    
    The cred_guard_mutex is problematic as it is held over possibly
    indefinite waits for userspace.  The possilbe indefinite waits for
    userspace that I have identified are: The cred_guard_mutex is held in
    PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
    held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
    cred_guard_mutex is held over "get_user(futex_offset, ...")  in
    exit_robust_list.  The cred_guard_mutex held over copy_strings.
    
    The functions get_user and put_user can trigger a page fault which can
    potentially wait indefinitely in the case of userfaultfd or if
    userspace implements part of the page fault path.
    
    In any of those cases the userspace process that the kernel is waiting
    for might userspace might make a different system call that winds up
    taking the cred_guard_mutex and result in deadlock.
    
    Holding a mutex over any of those possibly indefinite waits for
    userspace does not appear necessary.  Add exec_update_mutex that will
    just cover updating the process during exec where the permissions and
    the objects pointed to by the task struct may be out of sync.
    
    The plan is to switch the users of cred_guard_mutex to
    exec_udpate_mutex one by one.  This lets us move forward while still
    being careful and not introducing any regressions.
    
    Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
    Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
    Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
    Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
    Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
    Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
    Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Does that sound better?

Eric
Bernd Edlinger March 9, 2020, 6:47 p.m. UTC | #7
On 3/9/20 7:36 PM, Eric W. Biederman wrote:
> 
> My rewritten change description reads as follows:
> 
>     exec: Add a exec_update_mutex to replace cred_guard_mutex

is this "an" exec_update_mutex?

>     
>     The cred_guard_mutex is problematic as it is held over possibly
>     indefinite waits for userspace.  The possilbe indefinite waits for
>     userspace that I have identified are: The cred_guard_mutex is held in
>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>     exit_robust_list.  The cred_guard_mutex held over copy_strings.
>     
>     The functions get_user and put_user can trigger a page fault which can
>     potentially wait indefinitely in the case of userfaultfd or if
>     userspace implements part of the page fault path.
>     
>     In any of those cases the userspace process that the kernel is waiting
>     for might userspace might make a different system call that winds up
                ^-------------^
                      ^- remove this
>     taking the cred_guard_mutex and result in deadlock.
>     
>     Holding a mutex over any of those possibly indefinite waits for
>     userspace does not appear necessary.  Add exec_update_mutex that will
>     just cover updating the process during exec where the permissions and
>     the objects pointed to by the task struct may be out of sync.
>     
>     The plan is to switch the users of cred_guard_mutex to
>     exec_udpate_mutex one by one.  This lets us move forward while still

            ^-- typo: update

>     being careful and not introducing any regressions.
>     
>     Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>     Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>     Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>     Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>     Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>     Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>     Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Does that sound better?
> 

almost done.

> Eric
>
Eric W. Biederman March 9, 2020, 7:02 p.m. UTC | #8
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>> 
>> 
>> Does that sound better?
>> 
>
> almost done.

I think this text is finally clean.

    exec: Add exec_update_mutex to replace cred_guard_mutex
    
    The cred_guard_mutex is problematic as it is held over possibly
    indefinite waits for userspace.  The possilbe indefinite waits for
    userspace that I have identified are: The cred_guard_mutex is held in
    PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
    held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
    cred_guard_mutex is held over "get_user(futex_offset, ...")  in
    exit_robust_list.  The cred_guard_mutex held over copy_strings.
    
    The functions get_user and put_user can trigger a page fault which can
    potentially wait indefinitely in the case of userfaultfd or if
    userspace implements part of the page fault path.
    
    In any of those cases the userspace process that the kernel is waiting
    for might make a different system call that winds up taking the
    cred_guard_mutex and result in deadlock.
    
    Holding a mutex over any of those possibly indefinite waits for
    userspace does not appear necessary.  Add exec_update_mutex that will
    just cover updating the process during exec where the permissions and
    the objects pointed to by the task struct may be out of sync.
    
    The plan is to switch the users of cred_guard_mutex to
    exec_update_mutex one by one.  This lets us move forward while still
    being careful and not introducing any regressions.
    
    Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
    Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
    Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
    Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
    Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
    Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
    Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


Bernd do you want to give me your Reviewed-by for this part of the
series?

After that do you think you can write the obvious patch for mm_access?

I will apply these changes to my tree and push them into linux-next.

Eric
Bernd Edlinger March 9, 2020, 7:24 p.m. UTC | #9
On 3/9/20 8:02 PM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>>>
>>>
>>> Does that sound better?
>>>
>>
>> almost done.
> 
> I think this text is finally clean.
> 
>     exec: Add exec_update_mutex to replace cred_guard_mutex
>     
>     The cred_guard_mutex is problematic as it is held over possibly
>     indefinite waits for userspace.  The possilbe indefinite waits for
>     userspace that I have identified are: The cred_guard_mutex is held in
>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>     exit_robust_list.  The cred_guard_mutex held over copy_strings.
>     
>     The functions get_user and put_user can trigger a page fault which can
>     potentially wait indefinitely in the case of userfaultfd or if
>     userspace implements part of the page fault path.
>     
>     In any of those cases the userspace process that the kernel is waiting
>     for might make a different system call that winds up taking the
>     cred_guard_mutex and result in deadlock.
>     
>     Holding a mutex over any of those possibly indefinite waits for
>     userspace does not appear necessary.  Add exec_update_mutex that will
>     just cover updating the process during exec where the permissions and
>     the objects pointed to by the task struct may be out of sync.
>     
>     The plan is to switch the users of cred_guard_mutex to
>     exec_update_mutex one by one.  This lets us move forward while still
>     being careful and not introducing any regressions.
>     
>     Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>     Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>     Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>     Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>     Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>     Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>     Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")

I checked the urls they all work.
Just one last question, are these git references?
I can't find them in my linux git tree (cloned from linus' git)?

Sorry for being pedantically.


>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> 
> Bernd do you want to give me your Reviewed-by for this part of the
> series?
> 

Sure also the other parts of course.

Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>

> After that do you think you can write the obvious patch for mm_access?
> 

Yes, I can do that.
I also have some typos in comments, will make them extra patches as well.

I wonder if the test case is okay to include the ptrace_attach altough
that is not yet passing?


Thanks
Bernd.
Dmitry V. Levin March 9, 2020, 7:33 p.m. UTC | #10
On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
> > On 3/9/20 7:36 PM, Eric W. Biederman wrote:
> >> 
> >> 
> >> Does that sound better?
> >> 
> >
> > almost done.
> 
> I think this text is finally clean.
> 
>     exec: Add exec_update_mutex to replace cred_guard_mutex
>     
>     The cred_guard_mutex is problematic as it is held over possibly
>     indefinite waits for userspace.  The possilbe indefinite waits for

-------------------------------------------^^^^^^^^ possible?
Eric W. Biederman March 9, 2020, 7:35 p.m. UTC | #11
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 8:02 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>>> On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>>>>
>>>>
>>>> Does that sound better?
>>>>
>>>
>>> almost done.
>> 
>> I think this text is finally clean.
>> 
>>     exec: Add exec_update_mutex to replace cred_guard_mutex
>>     
>>     The cred_guard_mutex is problematic as it is held over possibly
>>     indefinite waits for userspace.  The possilbe indefinite waits for
>>     userspace that I have identified are: The cred_guard_mutex is held in
>>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>>     exit_robust_list.  The cred_guard_mutex held over copy_strings.
>>     
>>     The functions get_user and put_user can trigger a page fault which can
>>     potentially wait indefinitely in the case of userfaultfd or if
>>     userspace implements part of the page fault path.
>>     
>>     In any of those cases the userspace process that the kernel is waiting
>>     for might make a different system call that winds up taking the
>>     cred_guard_mutex and result in deadlock.
>>     
>>     Holding a mutex over any of those possibly indefinite waits for
>>     userspace does not appear necessary.  Add exec_update_mutex that will
>>     just cover updating the process during exec where the permissions and
>>     the objects pointed to by the task struct may be out of sync.
>>     
>>     The plan is to switch the users of cred_guard_mutex to
>>     exec_update_mutex one by one.  This lets us move forward while still
>>     being careful and not introducing any regressions.
>>     
>>     Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>>     Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>     Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>>     Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>>     Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>>     Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>     Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>
> I checked the urls they all work.
> Just one last question, are these git references?
> I can't find them in my linux git tree (cloned from linus' git)?
>
> Sorry for being pedantically.

You have to track down tglx's historicaly git tree from when everything
was in bitkeeper.

But yes they are git references and yes they work.  Just that part
of the history is not in linux.git.

>>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> 
>> 
>> Bernd do you want to give me your Reviewed-by for this part of the
>> series?
>> 
>
> Sure also the other parts of course.
>
> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>
>> After that do you think you can write the obvious patch for mm_access?
>> 
>
> Yes, I can do that.
> I also have some typos in comments, will make them extra patches as well.
>
> I wonder if the test case is okay to include the ptrace_attach altough
> that is not yet passing?

It is an existing kernel but that it doesn't pass.

My sense is that if you include it as a separate patch if it is a
problem for someone we can identify it easily via bisect and we do
whatever is appropriate.

Eric
Eric W. Biederman March 9, 2020, 7:39 p.m. UTC | #12
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/9/20 8:02 PM, Eric W. Biederman wrote:
>> 
>>     
>>     Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>>     Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>     Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>>     Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>>     Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>>     Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>     Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>
> I checked the urls they all work.
> Just one last question, are these git references?
> I can't find them in my linux git tree (cloned from linus' git)?

I will add this tag to help people figure out what is going on.

History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git

Eric
Eric W. Biederman March 9, 2020, 7:42 p.m. UTC | #13
"Dmitry V. Levin" <ldv@altlinux.org> writes:

> On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>> 
>> > On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>> >> 
>> >> 
>> >> Does that sound better?
>> >> 
>> >
>> > almost done.
>> 
>> I think this text is finally clean.
>> 
>>     exec: Add exec_update_mutex to replace cred_guard_mutex
>>     
>>     The cred_guard_mutex is problematic as it is held over possibly
>>     indefinite waits for userspace.  The possilbe indefinite waits for
>
> -------------------------------------------^^^^^^^^ possible?


Yes.  Thank you.  Fixed.

Eric
Kees Cook March 10, 2020, 8:55 p.m. UTC | #14
On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
>     exec: Add exec_update_mutex to replace cred_guard_mutex
>     
>     The cred_guard_mutex is problematic as it is held over possibly
>     indefinite waits for userspace.  The possilbe indefinite waits for
>     userspace that I have identified are: The cred_guard_mutex is held in
>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>     exit_robust_list.  The cred_guard_mutex held over copy_strings.

I suspect you're not trying to make a comprehensive list here, but do
you want to mention seccomp too (since it's yet another weird case).

> [...]
>     Holding a mutex over any of those possibly indefinite waits for
>     userspace does not appear necessary.  Add exec_update_mutex that will
>     just cover updating the process during exec where the permissions and
>     the objects pointed to by the task struct may be out of sync.

Should the specific resources be pointed out here? creds, mm, ... ?

But otherwise, yup, looks sane:

Reviewed-by: Kees Cook <keescook@chromium.org>
Eric W. Biederman March 10, 2020, 9:02 p.m. UTC | #15
Kees Cook <keescook@chromium.org> writes:

> On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
>>     exec: Add exec_update_mutex to replace cred_guard_mutex
>>     
>>     The cred_guard_mutex is problematic as it is held over possibly
>>     indefinite waits for userspace.  The possilbe indefinite waits for
>>     userspace that I have identified are: The cred_guard_mutex is held in
>>     PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
>>     held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
>>     cred_guard_mutex is held over "get_user(futex_offset, ...")  in
>>     exit_robust_list.  The cred_guard_mutex held over copy_strings.
>
> I suspect you're not trying to make a comprehensive list here, but do
> you want to mention seccomp too (since it's yet another weird case).

I was calling out all of the places I have found so far where
cred_guard_mutex is held over waiting for userspace to maybe do
something.  Those places are what cause our deadlocks.

>> [...]
>>     Holding a mutex over any of those possibly indefinite waits for
>>     userspace does not appear necessary.  Add exec_update_mutex that will
>>     just cover updating the process during exec where the permissions and
>>     the objects pointed to by the task struct may be out of sync.
>
> Should the specific resources be pointed out here? creds, mm, ... ?
>
> But otherwise, yup, looks sane:

Probably not.  The design is if exec changes it we will hold the
cred_guard_mutex over it, so things are semi-atomic.

> Reviewed-by: Kees Cook <keescook@chromium.org>

Eric
Jann Horn March 10, 2020, 9:21 p.m. UTC | #16
On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed.  The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
>
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one.  This lets us move forward while still
> being careful and not introducing any regressions.
[...]
> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>                         return -EINTR;
>                 }
>         }
> +
> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> +       if (ret)
> +               return ret;

We're already holding the old mmap_sem, and now nest the
exec_update_mutex inside it; but then while still holding the
exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
at least lockdep will be unhappy, and I'm not sure whether it's an
actual problem or not.
Eric W. Biederman March 10, 2020, 9:30 p.m. UTC | #17
Jann Horn <jannh@google.com> writes:

> On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> threads are killed.  The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>>
>> Add exec_update_mutex that is only held over exec updating process
>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process.
>>
>> The plan is to switch the users of cred_guard_mutex to
>> exec_udpate_mutex one by one.  This lets us move forward while still
>> being careful and not introducing any regressions.
> [...]
>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>                         return -EINTR;
>>                 }
>>         }
>> +
>> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> +       if (ret)
>> +               return ret;
>
> We're already holding the old mmap_sem, and now nest the
> exec_update_mutex inside it; but then while still holding the
> exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
> which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
> at least lockdep will be unhappy, and I'm not sure whether it's an
> actual problem or not.

Good point.  I should double check the lock ordering here with mmap_sem.
It doesn't look like mmput takes mmap_sem, but still there might be a
lock inversion of some kind here.  At least as far as lockdep is
concerned and we don't want anything like that.

Eric
Jann Horn March 10, 2020, 11:21 p.m. UTC | #18
On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Jann Horn <jannh@google.com> writes:
> > On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> >> over the userspace accesses as the arguments from userspace are read.
> >> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> >> threads are killed.  The cred_guard_mutex is held over
> >> "put_user(0, tsk->clear_child_tid)" in exit_mm().
> >>
> >> Any of those can result in deadlock, as the cred_guard_mutex is held
> >> over a possible indefinite userspace waits for userspace.
> >>
> >> Add exec_update_mutex that is only held over exec updating process
> >> with the new contents of exec, so that code that needs not to be
> >> confused by exec changing the mm and the cred in ways that can not
> >> happen during ordinary execution of a process.
> >>
> >> The plan is to switch the users of cred_guard_mutex to
> >> exec_udpate_mutex one by one.  This lets us move forward while still
> >> being careful and not introducing any regressions.
> > [...]
> >> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
> >>                         return -EINTR;
> >>                 }
> >>         }
> >> +
> >> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> >> +       if (ret)
> >> +               return ret;
> >
> > We're already holding the old mmap_sem, and now nest the
> > exec_update_mutex inside it; but then while still holding the
> > exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
> > which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
> > at least lockdep will be unhappy, and I'm not sure whether it's an
> > actual problem or not.
>
> Good point.  I should double check the lock ordering here with mmap_sem.
> It doesn't look like mmput takes mmap_sem

You sure about that? mmput() -> __mmput() -> ksm_exit() ->
__ksm_exit() -> down_write(&mm->mmap_sem)

Or also: mmput() -> __mmput() -> khugepaged_exit() ->
__khugepaged_exit() -> down_write(&mm->mmap_sem)

Or is there a reason why those paths can't happen?
Eric W. Biederman March 11, 2020, 12:15 a.m. UTC | #19
Jann Horn <jannh@google.com> writes:

> On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Jann Horn <jannh@google.com> writes:
>> > On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>> >> over the userspace accesses as the arguments from userspace are read.
>> >> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> >> threads are killed.  The cred_guard_mutex is held over
>> >> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>> >>
>> >> Any of those can result in deadlock, as the cred_guard_mutex is held
>> >> over a possible indefinite userspace waits for userspace.
>> >>
>> >> Add exec_update_mutex that is only held over exec updating process
>> >> with the new contents of exec, so that code that needs not to be
>> >> confused by exec changing the mm and the cred in ways that can not
>> >> happen during ordinary execution of a process.
>> >>
>> >> The plan is to switch the users of cred_guard_mutex to
>> >> exec_udpate_mutex one by one.  This lets us move forward while still
>> >> being careful and not introducing any regressions.
>> > [...]
>> >> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>> >>                         return -EINTR;
>> >>                 }
>> >>         }
>> >> +
>> >> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> >> +       if (ret)
>> >> +               return ret;
>> >
>> > We're already holding the old mmap_sem, and now nest the
>> > exec_update_mutex inside it; but then while still holding the
>> > exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
>> > which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
>> > at least lockdep will be unhappy, and I'm not sure whether it's an
>> > actual problem or not.
>>
>> Good point.  I should double check the lock ordering here with mmap_sem.
>> It doesn't look like mmput takes mmap_sem
>
> You sure about that? mmput() -> __mmput() -> ksm_exit() ->
> __ksm_exit() -> down_write(&mm->mmap_sem)
>
> Or also: mmput() -> __mmput() -> khugepaged_exit() ->
> __khugepaged_exit() -> down_write(&mm->mmap_sem)
>
> Or is there a reason why those paths can't happen?

Clearly I didn't look far enough. 

I will adjust this so that exec_update_mutex is taken before mmap_sem.
Anything else is just asking for trouble.

Eric
Bernd Edlinger March 11, 2020, 6:33 a.m. UTC | #20
On 3/11/20 1:15 AM, Eric W. Biederman wrote:
> Jann Horn <jannh@google.com> writes:
> 
>> On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Jann Horn <jannh@google.com> writes:
>>>> On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>> threads are killed.  The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>> over a possible indefinite userspace waits for userspace.
>>>>>
>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>> with the new contents of exec, so that code that needs not to be
>>>>> confused by exec changing the mm and the cred in ways that can not
>>>>> happen during ordinary execution of a process.
>>>>>
>>>>> The plan is to switch the users of cred_guard_mutex to
>>>>> exec_udpate_mutex one by one.  This lets us move forward while still
>>>>> being careful and not introducing any regressions.
>>>> [...]
>>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>                         return -EINTR;
>>>>>                 }
>>>>>         }
>>>>> +
>>>>> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>>> +       if (ret)
>>>>> +               return ret;
>>>>
>>>> We're already holding the old mmap_sem, and now nest the
>>>> exec_update_mutex inside it; but then while still holding the
>>>> exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
>>>> which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
>>>> at least lockdep will be unhappy, and I'm not sure whether it's an
>>>> actual problem or not.
>>>
>>> Good point.  I should double check the lock ordering here with mmap_sem.
>>> It doesn't look like mmput takes mmap_sem
>>
>> You sure about that? mmput() -> __mmput() -> ksm_exit() ->
>> __ksm_exit() -> down_write(&mm->mmap_sem)
>>
>> Or also: mmput() -> __mmput() -> khugepaged_exit() ->
>> __khugepaged_exit() -> down_write(&mm->mmap_sem)
>>
>> Or is there a reason why those paths can't happen?
> 
> Clearly I didn't look far enough. 
> 
> I will adjust this so that exec_update_mutex is taken before mmap_sem.
> Anything else is just asking for trouble.
> 

Note that vm_access does also mmput under the exec_update_mutex.
So I don't see a huge problem here.
But maybe I missed something.


Bernd.
Qian Cai March 11, 2020, 1:18 p.m. UTC | #21
On Sun, 2020-03-08 at 16:38 -0500, Eric W. Biederman wrote:
> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed.  The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
> 
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
> 
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
> 
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one.  This lets us move forward while still
> being careful and not introducing any regressions.
> 
> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

This patch will trigger a warning during boot,

[   19.707214][    T1] pci 0035:01:00.0: enabling device (0545 -> 0547)
[   19.707287][    T1] EEH: Capable adapter found: recovery enabled.
[   19.732541][    T1] cpuidle-powernv: Default stop: psscr =
0x0000000000000330,mask=0x00000000003003ff
[   19.732567][    T1] cpuidle-powernv: Deepest stop: psscr =
0x0000000000300375,mask=0x00000000003003ff
[   19.732598][    T1] cpuidle-powernv: First stop level that may lose SPRs =
0x4
[   19.732617][    T1] cpuidle-powernv: First stop level that may lose timebase
= 0x10
[   19.769784][    T1] HugeTLB registered 2.00 MiB page size, pre-allocated 0
pages
[   19.769810][    T1] HugeTLB registered 1.00 GiB page size, pre-allocated 0
pages
[   19.789344][  T718] 
[   19.789367][  T718] =====================================
[   19.789379][  T718] WARNING: bad unlock balance detected!
[   19.789393][  T718] 5.6.0-rc5-next-20200311+ #4 Not tainted
[   19.789414][  T718] -------------------------------------
[   19.789426][  T718] kworker/u257:0/718 is trying to release lock (&sig-
>exec_update_mutex) at:
[   19.789459][  T718] [<c0000000004c6770>] free_bprm+0xe0/0xf0
[   19.789481][  T718] but there are no more locks to release!
[   19.789502][  T718] 
[   19.789502][  T718] other info that might help us debug this:
[   19.789537][  T718] 1 lock held by kworker/u257:0/718:
[   19.789558][  T718]  #0: c000001fa8842808 (&sig->cred_guard_mutex){+.+.}, at:
__do_execve_file.isra.33+0x1b0/0xda0
[   19.789611][  T718] 
[   19.789611][  T718] stack backtrace:
[   19.789645][  T718] CPU: 8 PID: 718 Comm: kworker/u257:0 Not tainted 5.6.0-
rc5-next-20200311+ #4
[   19.789681][  T718] Call Trace:
[   19.789703][  T718] [c000000dad8cfa70] [c000000000979b40]
dump_stack+0xf4/0x164 (unreliable)
[   19.789742][  T718] [c000000dad8cfac0] [c0000000001c1d78]
print_unlock_imbalance_bug+0x118/0x140
[   19.789780][  T718] [c000000dad8cfb40] [c0000000001ceaa0]
lock_release+0x270/0x520
[   19.789817][  T718] [c000000dad8cfbf0] [c0000000009a2898]
__mutex_unlock_slowpath+0x68/0x400
[   19.789854][  T718] [c000000dad8cfcc0] [c0000000004c6770] free_bprm+0xe0/0xf0
[   19.789900][  T718] [c000000dad8cfcf0] [c0000000004c845c]
__do_execve_file.isra.33+0x44c/0xda0
__do_execve_file at fs/exec.c:1904
[   19.789938][  T718] [c000000dad8cfde0] [c0000000001391d8]
call_usermodehelper_exec_async+0x218/0x250
[   19.789977][  T718] [c000000dad8cfe20] [c00000000000b748]
ret_from_kernel_thread+0x5c/0x74

> ---
>  fs/exec.c                    | 9 +++++++++
>  include/linux/sched/signal.h | 9 ++++++++-
>  init/init_task.c             | 1 +
>  kernel/fork.c                | 1 +
>  4 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index d820a7272a76..ffeebb1f167b 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>  {
>  	struct task_struct *tsk;
>  	struct mm_struct *old_mm, *active_mm;
> +	int ret;
>  
>  	/* Notify parent that we're no longer interested in the old VM */
>  	tsk = current;
> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>  			return -EINTR;
>  		}
>  	}
> +
> +	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> +	if (ret)
> +		return ret;
> +
>  	task_lock(tsk);
>  	active_mm = tsk->active_mm;
>  	membarrier_exec_mmap(mm);
> @@ -1438,6 +1444,8 @@ static void free_bprm(struct linux_binprm *bprm)
>  {
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> +		if (!bprm->mm)
> +			mutex_unlock(&current->signal->exec_update_mutex);
>  		mutex_unlock(&current->signal->cred_guard_mutex);
>  		abort_creds(bprm->cred);
>  	}
> @@ -1487,6 +1495,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>  	 * credentials; any time after this it may be unlocked.
>  	 */
>  	security_bprm_committed_creds(bprm);
> +	mutex_unlock(&current->signal->exec_update_mutex);
>  	mutex_unlock(&current->signal->cred_guard_mutex);
>  }
>  EXPORT_SYMBOL(install_exec_creds);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 88050259c466..a29df79540ce 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -224,7 +224,14 @@ struct signal_struct {
>  
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations
> -					 * (notably. ptrace) */
> +					 * (notably. ptrace)
> +					 * Deprecated do not use in new code.
> +					 * Use exec_update_mutex instead.
> +					 */
> +	struct mutex exec_update_mutex;	/* Held while task_struct is being
> +					 * updated during exec, and may have
> +					 * inconsistent permissions.
> +					 */
>  } __randomize_layout;
>  
>  /*
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5eab7b..bd403ed3e418 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -26,6 +26,7 @@ static struct signal_struct init_signals = {
>  	.multiprocess	= HLIST_HEAD_INIT,
>  	.rlim		= INIT_RLIMITS,
>  	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
> +	.exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
>  #ifdef CONFIG_POSIX_TIMERS
>  	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
>  	.cputimer	= {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 60a1295f4384..12896a6ecee6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>  	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>  
>  	mutex_init(&sig->cred_guard_mutex);
> +	mutex_init(&sig->exec_update_mutex);
>  
>  	return 0;
>  }
Eric W. Biederman March 11, 2020, 4:29 p.m. UTC | #22
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> On 3/11/20 1:15 AM, Eric W. Biederman wrote:
>> Jann Horn <jannh@google.com> writes:
>> 
>>> On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Jann Horn <jannh@google.com> writes:
>>>>> On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>>> threads are killed.  The cred_guard_mutex is held over
>>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>>
>>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>>> over a possible indefinite userspace waits for userspace.
>>>>>>
>>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>>> with the new contents of exec, so that code that needs not to be
>>>>>> confused by exec changing the mm and the cred in ways that can not
>>>>>> happen during ordinary execution of a process.
>>>>>>
>>>>>> The plan is to switch the users of cred_guard_mutex to
>>>>>> exec_udpate_mutex one by one.  This lets us move forward while still
>>>>>> being careful and not introducing any regressions.
>>>>> [...]
>>>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>>                         return -EINTR;
>>>>>>                 }
>>>>>>         }
>>>>>> +
>>>>>> +       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>>>> +       if (ret)
>>>>>> +               return ret;
>>>>>
>>>>> We're already holding the old mmap_sem, and now nest the
>>>>> exec_update_mutex inside it; but then while still holding the
>>>>> exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
>>>>> which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
>>>>> at least lockdep will be unhappy, and I'm not sure whether it's an
>>>>> actual problem or not.
>>>>
>>>> Good point.  I should double check the lock ordering here with mmap_sem.
>>>> It doesn't look like mmput takes mmap_sem
>>>
>>> You sure about that? mmput() -> __mmput() -> ksm_exit() ->
>>> __ksm_exit() -> down_write(&mm->mmap_sem)
>>>
>>> Or also: mmput() -> __mmput() -> khugepaged_exit() ->
>>> __khugepaged_exit() -> down_write(&mm->mmap_sem)
>>>
>>> Or is there a reason why those paths can't happen?
>> 
>> Clearly I didn't look far enough. 
>> 
>> I will adjust this so that exec_update_mutex is taken before mmap_sem.
>> Anything else is just asking for trouble.
>> 
>
> Note that vm_access does also mmput under the exec_update_mutex.
> So I don't see a huge problem here.
> But maybe I missed something.

The issue is that to prevent deadlock locks must always be taken
in the same order.

Taking mmap_sem then exec_update_mutex at the start of the function,
then taking exec_update_mutex then mmap_sem in mmput, takes the
two locks in two different orders.   Which means that in the right
set or circumstances:

thread1:                                thread2:
   obtain mmap_sem                      optain exec_update_mutex
      wait for exec_update_mutex        wait for mmap_sem

Which guarantees that neither thread will make progress.

The fix is easy I just need to take exec_update_mutex a few lines
earlier.

Eric
Kirill Tkhai March 12, 2020, 10:27 a.m. UTC | #23
On 09.03.2020 00:38, Eric W. Biederman wrote:
> 
> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed.  The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
> 
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
> 
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
> 
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one.  This lets us move forward while still
> being careful and not introducing any regressions.
> 
> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c                    | 9 +++++++++
>  include/linux/sched/signal.h | 9 ++++++++-
>  init/init_task.c             | 1 +
>  kernel/fork.c                | 1 +
>  4 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index d820a7272a76..ffeebb1f167b 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>  {
>  	struct task_struct *tsk;
>  	struct mm_struct *old_mm, *active_mm;
> +	int ret;
>  
>  	/* Notify parent that we're no longer interested in the old VM */
>  	tsk = current;
> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>  			return -EINTR;
>  		}
>  	}
> +
> +	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> +	if (ret)
> +		return ret;

You missed old_mm->mmap_sem unlock. See here:

diff --git a/fs/exec.c b/fs/exec.c
index 47582cd97f86..d557bac3e862 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1063,8 +1063,11 @@ static int exec_mmap(struct mm_struct *mm)
 	}
 
 	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
-	if (ret)
+	if (ret) {
+		if (old_mm)
+			up_read(&old_mm->mmap_sem);
 		return ret;
+	}
 
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
Eric W. Biederman March 12, 2020, 12:24 p.m. UTC | #24
Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> On 09.03.2020 00:38, Eric W. Biederman wrote:
>> 
>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> threads are killed.  The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>> 
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>> 
>> Add exec_update_mutex that is only held over exec updating process
>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process.
>> 
>> The plan is to switch the users of cred_guard_mutex to
>> exec_udpate_mutex one by one.  This lets us move forward while still
>> being careful and not introducing any regressions.
>> 
>> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/exec.c                    | 9 +++++++++
>>  include/linux/sched/signal.h | 9 ++++++++-
>>  init/init_task.c             | 1 +
>>  kernel/fork.c                | 1 +
>>  4 files changed, 19 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index d820a7272a76..ffeebb1f167b 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>>  {
>>  	struct task_struct *tsk;
>>  	struct mm_struct *old_mm, *active_mm;
>> +	int ret;
>>  
>>  	/* Notify parent that we're no longer interested in the old VM */
>>  	tsk = current;
>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>  			return -EINTR;
>>  		}
>>  	}
>> +
>> +	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> +	if (ret)
>> +		return ret;
>
> You missed old_mm->mmap_sem unlock. See here:

Duh.  Thank you.

I actually need to switch the lock ordering here, and I haven't yet
because my son was sick yesterday.

Something like this.

diff --git a/fs/exec.c b/fs/exec.c
index 96f89401b4d1..03d50c27ec01 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1020,9 +1020,14 @@ static int exec_mmap(struct mm_struct *mm)
        tsk = current;
        old_mm = current->mm;
        exec_mm_release(tsk, old_mm);
+       if (old_mm)
+               sync_mm_rss(old_mm);
+
+       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
+       if (ret)
+               return ret;
 
        if (old_mm) {
-               sync_mm_rss(old_mm);
                /*
                 * Make sure that if there is a core dump in progress
                 * for the old mm, we get out and die instead of going
@@ -1032,14 +1037,11 @@ static int exec_mmap(struct mm_struct *mm)
                down_read(&old_mm->mmap_sem);
                if (unlikely(old_mm->core_state)) {
                        up_read(&old_mm->mmap_sem);
+                       mutex_unlock(&tsk->signal->exec_update_mutex);
                        return -EINTR;
                }
        }
 
-       ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
-       if (ret)
-               return ret;
-
        task_lock(tsk);
        active_mm = tsk->active_mm;
        membarrier_exec_mmap(mm);



> diff --git a/fs/exec.c b/fs/exec.c
> index 47582cd97f86..d557bac3e862 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1063,8 +1063,11 @@ static int exec_mmap(struct mm_struct *mm)
>  	}
>  
>  	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> -	if (ret)
> +	if (ret) {
> +		if (old_mm)
> +			up_read(&old_mm->mmap_sem);
>  		return ret;
> +	}
>  
>  	task_lock(tsk);
>  	active_mm = tsk->active_mm;

Eric
Kirill Tkhai March 12, 2020, 1:45 p.m. UTC | #25
On 12.03.2020 15:24, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 09.03.2020 00:38, Eric W. Biederman wrote:
>>>
>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>> over the userspace accesses as the arguments from userspace are read.
>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>> threads are killed.  The cred_guard_mutex is held over
>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>
>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>> over a possible indefinite userspace waits for userspace.
>>>
>>> Add exec_update_mutex that is only held over exec updating process
>>> with the new contents of exec, so that code that needs not to be
>>> confused by exec changing the mm and the cred in ways that can not
>>> happen during ordinary execution of a process.
>>>
>>> The plan is to switch the users of cred_guard_mutex to
>>> exec_udpate_mutex one by one.  This lets us move forward while still
>>> being careful and not introducing any regressions.
>>>
>>> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>>> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>>> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>>  fs/exec.c                    | 9 +++++++++
>>>  include/linux/sched/signal.h | 9 ++++++++-
>>>  init/init_task.c             | 1 +
>>>  kernel/fork.c                | 1 +
>>>  4 files changed, 19 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> index d820a7272a76..ffeebb1f167b 100644
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>  {
>>>  	struct task_struct *tsk;
>>>  	struct mm_struct *old_mm, *active_mm;
>>> +	int ret;
>>>  
>>>  	/* Notify parent that we're no longer interested in the old VM */
>>>  	tsk = current;
>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>  			return -EINTR;
>>>  		}
>>>  	}
>>> +
>>> +	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>> +	if (ret)
>>> +		return ret;
>>
>> You missed old_mm->mmap_sem unlock. See here:
> 
> Duh.  Thank you.
> 
> I actually need to switch the lock ordering here, and I haven't yet
> because my son was sick yesterday.

There is some fundamental problem with your patch, since the below fires in 100% cases
on current linux-next:

[   22.838717] kernel BUG at fs/exec.c:1474!

diff --git a/fs/exec.c b/fs/exec.c
index 47582cd97f86..0f77f8c94905 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1470,8 +1470,10 @@ static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		if (!bprm->mm)
+		if (!bprm->mm) {
+			BUG_ON(!mutex_is_locked(&current->signal->exec_update_mutex));
 			mutex_unlock(&current->signal->exec_update_mutex);
+		}
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1521,6 +1523,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	BUG_ON(!mutex_is_locked(&current->signal->exec_update_mutex));
 	mutex_unlock(&current->signal->exec_update_mutex);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }

---------------------------------------------------------------------------------------------

First time the mutex is unlocked in:

exec_binprm()->search_binary_handler()->.load_binary->install_exec_creds()

Then exec_binprm()->search_binary_handler()->.load_binary->flush_old_exec() clears mm:

        bprm->mm = NULL;        

Second time the mutex is unlocked in free_bprm():

	if (bprm->cred) {
                if (!bprm->mm)
                        mutex_unlock(&current->signal->exec_update_mutex);

My opinion is we should not relay on side indicators like bprm->mm. Better you may
introduce struct linux_binprm::exec_update_mutex_is_locked. So the next person dealing
with this after you won't waste much time on diving into this. Also, if someone decides
to change the place, where bprm->mm is set into NULL, this person will bump into hell
of dependences between unrelated components like your newly introduced mutex.

So, I'm strongly for *struct linux_binprm::exec_update_mutex_is_locked*, since this improves
modularity.
Eric W. Biederman March 12, 2020, 2:38 p.m. UTC | #26
Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> On 12.03.2020 15:24, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>> 
>>> On 09.03.2020 00:38, Eric W. Biederman wrote:
>>>>
>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>> over the userspace accesses as the arguments from userspace are read.
>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>> threads are killed.  The cred_guard_mutex is held over
>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>
>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>> over a possible indefinite userspace waits for userspace.
>>>>
>>>> Add exec_update_mutex that is only held over exec updating process
>>>> with the new contents of exec, so that code that needs not to be
>>>> confused by exec changing the mm and the cred in ways that can not
>>>> happen during ordinary execution of a process.
>>>>
>>>> The plan is to switch the users of cred_guard_mutex to
>>>> exec_udpate_mutex one by one.  This lets us move forward while still
>>>> being careful and not introducing any regressions.
>>>>
>>>> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>>> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>>>> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>>>> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>> ---
>>>>  fs/exec.c                    | 9 +++++++++
>>>>  include/linux/sched/signal.h | 9 ++++++++-
>>>>  init/init_task.c             | 1 +
>>>>  kernel/fork.c                | 1 +
>>>>  4 files changed, 19 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>> index d820a7272a76..ffeebb1f167b 100644
>>>> --- a/fs/exec.c
>>>> +++ b/fs/exec.c
>>>> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>>  {
>>>>  	struct task_struct *tsk;
>>>>  	struct mm_struct *old_mm, *active_mm;
>>>> +	int ret;
>>>>  
>>>>  	/* Notify parent that we're no longer interested in the old VM */
>>>>  	tsk = current;
>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>  			return -EINTR;
>>>>  		}
>>>>  	}
>>>> +
>>>> +	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>> +	if (ret)
>>>> +		return ret;
>>>
>>> You missed old_mm->mmap_sem unlock. See here:
>> 
>> Duh.  Thank you.
>> 
>> I actually need to switch the lock ordering here, and I haven't yet
>> because my son was sick yesterday.
>
> There is some fundamental problem with your patch, since the below fires in 100% cases
> on current linux-next:

Thank you.

I have just backed this out of linux-next for now because it is clearly
flawed.

You make some good points about the recursion.  I will go back to the
drawing board and see what I can work out.


> [   22.838717] kernel BUG at fs/exec.c:1474!
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 47582cd97f86..0f77f8c94905 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1470,8 +1470,10 @@ static void free_bprm(struct linux_binprm *bprm)
>  {
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> -		if (!bprm->mm)
> +		if (!bprm->mm) {
> +			BUG_ON(!mutex_is_locked(&current->signal->exec_update_mutex));
>  			mutex_unlock(&current->signal->exec_update_mutex);
> +		}
>  		mutex_unlock(&current->signal->cred_guard_mutex);
>  		abort_creds(bprm->cred);
>  	}
> @@ -1521,6 +1523,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>  	 * credentials; any time after this it may be unlocked.
>  	 */
>  	security_bprm_committed_creds(bprm);
> +	BUG_ON(!mutex_is_locked(&current->signal->exec_update_mutex));
>  	mutex_unlock(&current->signal->exec_update_mutex);
>  	mutex_unlock(&current->signal->cred_guard_mutex);
>  }
>
> ---------------------------------------------------------------------------------------------
>
> First time the mutex is unlocked in:
>
> exec_binprm()->search_binary_handler()->.load_binary->install_exec_creds()
>
> Then exec_binprm()->search_binary_handler()->.load_binary->flush_old_exec() clears mm:
>
>         bprm->mm = NULL;        
>
> Second time the mutex is unlocked in free_bprm():
>
> 	if (bprm->cred) {
>                 if (!bprm->mm)
>                         mutex_unlock(&current->signal->exec_update_mutex);
>
> My opinion is we should not relay on side indicators like bprm->mm. Better you may
> introduce struct linux_binprm::exec_update_mutex_is_locked. So the next person dealing
> with this after you won't waste much time on diving into this. Also, if someone decides
> to change the place, where bprm->mm is set into NULL, this person will bump into hell
> of dependences between unrelated components like your newly introduced mutex.
>
> So, I'm strongly for *struct linux_binprm::exec_update_mutex_is_locked*, since this improves
> modularity.

Am I wrong or is that also a problem with cred_guard_mutex?

Eric
Kirill Tkhai March 12, 2020, 3:23 p.m. UTC | #27
On 12.03.2020 17:38, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 12.03.2020 15:24, Eric W. Biederman wrote:
>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>
>>>> On 09.03.2020 00:38, Eric W. Biederman wrote:
>>>>>
>>>>> The cred_guard_mutex is problematic.  The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>> threads are killed.  The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>> over a possible indefinite userspace waits for userspace.
>>>>>
>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>> with the new contents of exec, so that code that needs not to be
>>>>> confused by exec changing the mm and the cred in ways that can not
>>>>> happen during ordinary execution of a process.
>>>>>
>>>>> The plan is to switch the users of cred_guard_mutex to
>>>>> exec_udpate_mutex one by one.  This lets us move forward while still
>>>>> being careful and not introducing any regressions.
>>>>>
>>>>> Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>>>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>>>> Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>>>>> Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
>>>>> Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
>>>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>> ---
>>>>>  fs/exec.c                    | 9 +++++++++
>>>>>  include/linux/sched/signal.h | 9 ++++++++-
>>>>>  init/init_task.c             | 1 +
>>>>>  kernel/fork.c                | 1 +
>>>>>  4 files changed, 19 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>>> index d820a7272a76..ffeebb1f167b 100644
>>>>> --- a/fs/exec.c
>>>>> +++ b/fs/exec.c
>>>>> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>  {
>>>>>  	struct task_struct *tsk;
>>>>>  	struct mm_struct *old_mm, *active_mm;
>>>>> +	int ret;
>>>>>  
>>>>>  	/* Notify parent that we're no longer interested in the old VM */
>>>>>  	tsk = current;
>>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>  			return -EINTR;
>>>>>  		}
>>>>>  	}
>>>>> +
>>>>> +	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>>> +	if (ret)
>>>>> +		return ret;
>>>>
>>>> You missed old_mm->mmap_sem unlock. See here:
>>>
>>> Duh.  Thank you.
>>>
>>> I actually need to switch the lock ordering here, and I haven't yet
>>> because my son was sick yesterday.
>>
>> There is some fundamental problem with your patch, since the below fires in 100% cases
>> on current linux-next:
> 
> Thank you.
> 
> I have just backed this out of linux-next for now because it is clearly
> flawed.
> 
> You make some good points about the recursion.  I will go back to the
> drawing board and see what I can work out.
> 
> 
>> [   22.838717] kernel BUG at fs/exec.c:1474!
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 47582cd97f86..0f77f8c94905 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1470,8 +1470,10 @@ static void free_bprm(struct linux_binprm *bprm)
>>  {
>>  	free_arg_pages(bprm);
>>  	if (bprm->cred) {
>> -		if (!bprm->mm)
>> +		if (!bprm->mm) {
>> +			BUG_ON(!mutex_is_locked(&current->signal->exec_update_mutex));
>>  			mutex_unlock(&current->signal->exec_update_mutex);
>> +		}
>>  		mutex_unlock(&current->signal->cred_guard_mutex);
>>  		abort_creds(bprm->cred);
>>  	}
>> @@ -1521,6 +1523,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>>  	 * credentials; any time after this it may be unlocked.
>>  	 */
>>  	security_bprm_committed_creds(bprm);
>> +	BUG_ON(!mutex_is_locked(&current->signal->exec_update_mutex));
>>  	mutex_unlock(&current->signal->exec_update_mutex);
>>  	mutex_unlock(&current->signal->cred_guard_mutex);
>>  }
>>
>> ---------------------------------------------------------------------------------------------
>>
>> First time the mutex is unlocked in:
>>
>> exec_binprm()->search_binary_handler()->.load_binary->install_exec_creds()
>>
>> Then exec_binprm()->search_binary_handler()->.load_binary->flush_old_exec() clears mm:
>>
>>         bprm->mm = NULL;        
>>
>> Second time the mutex is unlocked in free_bprm():
>>
>> 	if (bprm->cred) {
>>                 if (!bprm->mm)
>>                         mutex_unlock(&current->signal->exec_update_mutex);
>>
>> My opinion is we should not relay on side indicators like bprm->mm. Better you may
>> introduce struct linux_binprm::exec_update_mutex_is_locked. So the next person dealing
>> with this after you won't waste much time on diving into this. Also, if someone decides
>> to change the place, where bprm->mm is set into NULL, this person will bump into hell
>> of dependences between unrelated components like your newly introduced mutex.
>>
>> So, I'm strongly for *struct linux_binprm::exec_update_mutex_is_locked*, since this improves
>> modularity.
> 
> Am I wrong or is that also a problem with cred_guard_mutex?

No, there is no a problem.

cred_guard_mutex is locked in a pair with bprm->cred = prepare_exec_creds() assignment.

cred_guard_mutex is unlocked in a pair with bprm->cred = NULL clearing (see install_exec_creds()).
Further free_bprm() skip unlock in case of bprm->cred is NULL.
Bernd Edlinger March 13, 2020, 1:05 a.m. UTC | #28
On 3/12/20 3:38 PM, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 12.03.2020 15:24, Eric W. Biederman wrote:
>>>
>>> I actually need to switch the lock ordering here, and I haven't yet
>>> because my son was sick yesterday.

All the best wishes to you and your son.  I hope he will get well soon.

And sorry for not missing the issue in the review.  The reason turns
out that bprm_mm_init is called after prepare_bprm_creds, but there
are error pathes between those where free_bprm is called up with
cred != NULL and mm == NULL, but the mutex not locked.

I figured out a possible fix for the problem that was pointed out:


From ceb6f65b52b3a7f0280f4f20509a1564a439edf6 Mon Sep 17 00:00:00 2001
From: Bernd Edlinger <bernd.edlinger@hotmail.de>
Date: Wed, 11 Mar 2020 15:31:07 +0100
Subject: [PATCH] Fix issues with exec_update_mutex

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 fs/exec.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index ffeebb1..cde4937 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1021,8 +1021,14 @@ static int exec_mmap(struct mm_struct *mm)
 	old_mm = current->mm;
 	exec_mm_release(tsk, old_mm);
 
-	if (old_mm) {
+	if (old_mm)
 		sync_mm_rss(old_mm);
+
+	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
+	if (ret)
+		return ret;
+
+	if (old_mm) {
 		/*
 		 * Make sure that if there is a core dump in progress
 		 * for the old mm, we get out and die instead of going
@@ -1032,14 +1038,11 @@ static int exec_mmap(struct mm_struct *mm)
 		down_read(&old_mm->mmap_sem);
 		if (unlikely(old_mm->core_state)) {
 			up_read(&old_mm->mmap_sem);
+			mutex_unlock(&tsk->signal->exec_update_mutex);
 			return -EINTR;
 		}
 	}
 
-	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
-	if (ret)
-		return ret;
-
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
 	membarrier_exec_mmap(mm);
@@ -1444,8 +1447,6 @@ static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		if (!bprm->mm)
-			mutex_unlock(&current->signal->exec_update_mutex);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1846,6 +1847,8 @@ static int __do_execve_file(int fd, struct filename *filename,
 	would_dump(bprm, bprm->file);
 
 	retval = exec_binprm(bprm);
+	if (bprm->cred && !bprm->mm)
+		mutex_unlock(&current->signal->exec_update_mutex);
 	if (retval < 0)
 		goto out;
Kirill Tkhai March 13, 2020, 9:13 a.m. UTC | #29
On 13.03.2020 04:05, Bernd Edlinger wrote:
> On 3/12/20 3:38 PM, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>
>>> On 12.03.2020 15:24, Eric W. Biederman wrote:
>>>>
>>>> I actually need to switch the lock ordering here, and I haven't yet
>>>> because my son was sick yesterday.
> 
> All the best wishes to you and your son.  I hope he will get well soon.
> 
> And sorry for not missing the issue in the review.  The reason turns
> out that bprm_mm_init is called after prepare_bprm_creds, but there
> are error pathes between those where free_bprm is called up with
> cred != NULL and mm == NULL, but the mutex not locked.
> 
> I figured out a possible fix for the problem that was pointed out:
> 
> 
> From ceb6f65b52b3a7f0280f4f20509a1564a439edf6 Mon Sep 17 00:00:00 2001
> From: Bernd Edlinger <bernd.edlinger@hotmail.de>
> Date: Wed, 11 Mar 2020 15:31:07 +0100
> Subject: [PATCH] Fix issues with exec_update_mutex
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  fs/exec.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index ffeebb1..cde4937 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1021,8 +1021,14 @@ static int exec_mmap(struct mm_struct *mm)
>  	old_mm = current->mm;
>  	exec_mm_release(tsk, old_mm);
>  
> -	if (old_mm) {
> +	if (old_mm)
>  		sync_mm_rss(old_mm);
> +
> +	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> +	if (ret)
> +		return ret;
> +
> +	if (old_mm) {
>  		/*
>  		 * Make sure that if there is a core dump in progress
>  		 * for the old mm, we get out and die instead of going
> @@ -1032,14 +1038,11 @@ static int exec_mmap(struct mm_struct *mm)
>  		down_read(&old_mm->mmap_sem);
>  		if (unlikely(old_mm->core_state)) {
>  			up_read(&old_mm->mmap_sem);
> +			mutex_unlock(&tsk->signal->exec_update_mutex);
>  			return -EINTR;
>  		}
>  	}
>  
> -	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> -	if (ret)
> -		return ret;
> -
>  	task_lock(tsk);
>  	active_mm = tsk->active_mm;
>  	membarrier_exec_mmap(mm);
> @@ -1444,8 +1447,6 @@ static void free_bprm(struct linux_binprm *bprm)
>  {
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> -		if (!bprm->mm)
> -			mutex_unlock(&current->signal->exec_update_mutex);
>  		mutex_unlock(&current->signal->cred_guard_mutex);
>  		abort_creds(bprm->cred);
>  	}
> @@ -1846,6 +1847,8 @@ static int __do_execve_file(int fd, struct filename *filename,
>  	would_dump(bprm, bprm->file);
>  
>  	retval = exec_binprm(bprm);
> +	if (bprm->cred && !bprm->mm)
> +		mutex_unlock(&current->signal->exec_update_mutex);

Despite this should fix the problem, this looks like a broken puzzle.

We can't use bprm->cred as an identifier whether the mutex was locked or not.
We can check for bprm->cred in regard to cred_guard_mutex, because of there is
strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment
(see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing".
Take attention on modularity of all this: there is no dependencies between anything else.

In regard to newly introduced exec_update_mutex, your fix and source patch way look like
an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm,
and this introduces the problems in the future modifications and support of all involved
entities. If someone wants to move some functions in relation to each other, there will
be a pain, and this person will have to go again the same dependencies and bug way,
Eric stepped on in the original patch.
Bernd Edlinger March 14, 2020, 9:57 a.m. UTC | #30
On 3/13/20 10:13 AM, Kirill Tkhai wrote:
> 
> Despite this should fix the problem, this looks like a broken puzzle.
> 
> We can't use bprm->cred as an identifier whether the mutex was locked or not.
> We can check for bprm->cred in regard to cred_guard_mutex, because of there is
> strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment
> (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing".
> Take attention on modularity of all this: there is no dependencies between anything else.
> 
> In regard to newly introduced exec_update_mutex, your fix and source patch way look like
> an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm,
> and this introduces the problems in the future modifications and support of all involved
> entities. If someone wants to move some functions in relation to each other, there will
> be a pain, and this person will have to go again the same dependencies and bug way,
> Eric stepped on in the original patch.
> 

Okay, yes, valid points you make, thanks.
I just wanted to understand what was exactly wrong with this patch,
since the failure mode looked a lot like it was failing because of
something clobbering the data unexpectedly.


So I have posted a few updated patch for the failed one here:

[PATCH v3 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
[PATCH] pidfd: Use new infrastructure to fix deadlocks in execve

which replaces these:
[PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
https://lore.kernel.org/lkml/87zhcq4jdj.fsf_-_@x220.int.ebiederm.org/

[PATCH] pidfd: Stop taking cred_guard_mutex 
https://lore.kernel.org/lkml/87wo7svy96.fsf_-_@x220.int.ebiederm.org/


and a new patch series to fix deadlock in ptrace_attach and update doc:
[PATCH 0/2] exec: Fix dead-lock in de_thread with ptrace_attach
[PATCH 1/2] exec: Fix dead-lock in de_thread with ptrace_attach
[PATCH 2/2] doc: Update documentation of ->exec_*_mutex


Other patches needed, still valid:

[PATCH v2 1/5] exec: Only compute current once in flush_old_exec
https://lore.kernel.org/lkml/87pndm5y3l.fsf_-_@x220.int.ebiederm.org/

[PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
https://lore.kernel.org/lkml/87k13u5y26.fsf_-_@x220.int.ebiederm.org/

[PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
https://lore.kernel.org/lkml/875zfe5xzb.fsf_-_@x220.int.ebiederm.org/

[PATCH 1/4] exec: Fix a deadlock in ptrace
https://lore.kernel.org/lkml/AM6PR03MB517033EAD25BED15CC84E17DE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/

[PATCH 2/4] selftests/ptrace: add test cases for dead-locks
https://lore.kernel.org/lkml/AM6PR03MB51703199741A2C27A78980FFE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/

[PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core
https://lore.kernel.org/lkml/AM6PR03MB5170ED6D4D216EEEEF400136E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/

[PATCH 4/4] kernel: doc: remove outdated comment cred.c
https://lore.kernel.org/lkml/AM6PR03MB517039DB07AB641C194FEA57E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/

[PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB517057A2269C3A4FB287B76EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/

[PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB51705D211EC8E7EA270627B1E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/

[PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB5170BD2476E35068E182EFA4E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/

[PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB517035DEEDB9C8699CB6B34EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/


I think most of the existing patches are already approved, but if
there are still change requests, please let me know.


Thanks
Bernd.
Bernd Edlinger March 14, 2020, 10:02 a.m. UTC | #31
On 3/14/20 10:57 AM, Bernd Edlinger wrote:
> On 3/13/20 10:13 AM, Kirill Tkhai wrote:
>>
>> Despite this should fix the problem, this looks like a broken puzzle.
>>
>> We can't use bprm->cred as an identifier whether the mutex was locked or not.
>> We can check for bprm->cred in regard to cred_guard_mutex, because of there is
>> strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment
>> (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing".
>> Take attention on modularity of all this: there is no dependencies between anything else.
>>
>> In regard to newly introduced exec_update_mutex, your fix and source patch way look like
>> an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm,
>> and this introduces the problems in the future modifications and support of all involved
>> entities. If someone wants to move some functions in relation to each other, there will
>> be a pain, and this person will have to go again the same dependencies and bug way,
>> Eric stepped on in the original patch.
>>
> 
> Okay, yes, valid points you make, thanks.
> I just wanted to understand what was exactly wrong with this patch,
> since the failure mode looked a lot like it was failing because of
> something clobbering the data unexpectedly.
> 
> 
> So I have posted a few updated patch for the failed one here:
> 
> [PATCH v3 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
> [PATCH] pidfd: Use new infrastructure to fix deadlocks in execve
> 
> which replaces these:
> [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
> https://lore.kernel.org/lkml/87zhcq4jdj.fsf_-_@x220.int.ebiederm.org/
> 
> [PATCH] pidfd: Stop taking cred_guard_mutex 
> https://lore.kernel.org/lkml/87wo7svy96.fsf_-_@x220.int.ebiederm.org/
> 
> 
> and a new patch series to fix deadlock in ptrace_attach and update doc:
> [PATCH 0/2] exec: Fix dead-lock in de_thread with ptrace_attach
> [PATCH 1/2] exec: Fix dead-lock in de_thread with ptrace_attach
> [PATCH 2/2] doc: Update documentation of ->exec_*_mutex
> 
> 
> Other patches needed, still valid:
> 
> [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
> https://lore.kernel.org/lkml/87pndm5y3l.fsf_-_@x220.int.ebiederm.org/
> 
> [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
> https://lore.kernel.org/lkml/87k13u5y26.fsf_-_@x220.int.ebiederm.org/
> 

Ah, sorry, forgot this one:
[PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
https://lore.kernel.org/lkml/87eeu25y14.fsf_-_@x220.int.ebiederm.org/

> [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
> https://lore.kernel.org/lkml/875zfe5xzb.fsf_-_@x220.int.ebiederm.org/
> 
> [PATCH 1/4] exec: Fix a deadlock in ptrace
> https://lore.kernel.org/lkml/AM6PR03MB517033EAD25BED15CC84E17DE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
> 
> [PATCH 2/4] selftests/ptrace: add test cases for dead-locks
> https://lore.kernel.org/lkml/AM6PR03MB51703199741A2C27A78980FFE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
> 
> [PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core
> https://lore.kernel.org/lkml/AM6PR03MB5170ED6D4D216EEEEF400136E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
> 
> [PATCH 4/4] kernel: doc: remove outdated comment cred.c
> https://lore.kernel.org/lkml/AM6PR03MB517039DB07AB641C194FEA57E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
> 
> [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
> https://lore.kernel.org/lkml/AM6PR03MB517057A2269C3A4FB287B76EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
> 
> [PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
> https://lore.kernel.org/lkml/AM6PR03MB51705D211EC8E7EA270627B1E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
> 
> [PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
> https://lore.kernel.org/lkml/AM6PR03MB5170BD2476E35068E182EFA4E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
> 
> [PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve
> https://lore.kernel.org/lkml/AM6PR03MB517035DEEDB9C8699CB6B34EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
> 
> 
> I think most of the existing patches are already approved, but if
> there are still change requests, please let me know.
> 
> 
> Thanks
> Bernd.
> 

Hope it is correct now.
I haven't seen the new patches on the kernel archives yet,
so I cannot add URLs for them.

Bernd.
Kirill Tkhai March 17, 2020, 8:58 a.m. UTC | #32
On 14.03.2020 13:02, Bernd Edlinger wrote:
> On 3/14/20 10:57 AM, Bernd Edlinger wrote:
>> On 3/13/20 10:13 AM, Kirill Tkhai wrote:
>>>
>>> Despite this should fix the problem, this looks like a broken puzzle.
>>>
>>> We can't use bprm->cred as an identifier whether the mutex was locked or not.
>>> We can check for bprm->cred in regard to cred_guard_mutex, because of there is
>>> strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment
>>> (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing".
>>> Take attention on modularity of all this: there is no dependencies between anything else.
>>>
>>> In regard to newly introduced exec_update_mutex, your fix and source patch way look like
>>> an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm,
>>> and this introduces the problems in the future modifications and support of all involved
>>> entities. If someone wants to move some functions in relation to each other, there will
>>> be a pain, and this person will have to go again the same dependencies and bug way,
>>> Eric stepped on in the original patch.
>>>
>>
>> Okay, yes, valid points you make, thanks.
>> I just wanted to understand what was exactly wrong with this patch,
>> since the failure mode looked a lot like it was failing because of
>> something clobbering the data unexpectedly.
>>
>>
>> So I have posted a few updated patch for the failed one here:
>>
>> [PATCH v3 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
>> [PATCH] pidfd: Use new infrastructure to fix deadlocks in execve
>>
>> which replaces these:
>> [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
>> https://lore.kernel.org/lkml/87zhcq4jdj.fsf_-_@x220.int.ebiederm.org/
>>
>> [PATCH] pidfd: Stop taking cred_guard_mutex 
>> https://lore.kernel.org/lkml/87wo7svy96.fsf_-_@x220.int.ebiederm.org/
>>
>>
>> and a new patch series to fix deadlock in ptrace_attach and update doc:
>> [PATCH 0/2] exec: Fix dead-lock in de_thread with ptrace_attach
>> [PATCH 1/2] exec: Fix dead-lock in de_thread with ptrace_attach
>> [PATCH 2/2] doc: Update documentation of ->exec_*_mutex
>>
>>
>> Other patches needed, still valid:
>>
>> [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
>> https://lore.kernel.org/lkml/87pndm5y3l.fsf_-_@x220.int.ebiederm.org/
>>
>> [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
>> https://lore.kernel.org/lkml/87k13u5y26.fsf_-_@x220.int.ebiederm.org/
>>
> 
> Ah, sorry, forgot this one:
> [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
> https://lore.kernel.org/lkml/87eeu25y14.fsf_-_@x220.int.ebiederm.org/
> 
>> [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
>> https://lore.kernel.org/lkml/875zfe5xzb.fsf_-_@x220.int.ebiederm.org/

1-4/5 look OK for me. You may add my

Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>

>> [PATCH 1/4] exec: Fix a deadlock in ptrace
>> https://lore.kernel.org/lkml/AM6PR03MB517033EAD25BED15CC84E17DE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 2/4] selftests/ptrace: add test cases for dead-locks
>> https://lore.kernel.org/lkml/AM6PR03MB51703199741A2C27A78980FFE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core
>> https://lore.kernel.org/lkml/AM6PR03MB5170ED6D4D216EEEEF400136E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 4/4] kernel: doc: remove outdated comment cred.c
>> https://lore.kernel.org/lkml/AM6PR03MB517039DB07AB641C194FEA57E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
>> https://lore.kernel.org/lkml/AM6PR03MB517057A2269C3A4FB287B76EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
>> https://lore.kernel.org/lkml/AM6PR03MB51705D211EC8E7EA270627B1E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
>> https://lore.kernel.org/lkml/AM6PR03MB5170BD2476E35068E182EFA4E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve
>> https://lore.kernel.org/lkml/AM6PR03MB517035DEEDB9C8699CB6B34EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>>
>> I think most of the existing patches are already approved, but if
>> there are still change requests, please let me know.
>>
>>
>> Thanks
>> Bernd.
>>
> 
> Hope it is correct now.
> I haven't seen the new patches on the kernel archives yet,
> so I cannot add URLs for them.
> 
> Bernd.
>

Patch
diff mbox series

diff --git a/fs/exec.c b/fs/exec.c
index d820a7272a76..ffeebb1f167b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1014,6 +1014,7 @@  static int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 	struct mm_struct *old_mm, *active_mm;
+	int ret;
 
 	/* Notify parent that we're no longer interested in the old VM */
 	tsk = current;
@@ -1034,6 +1035,11 @@  static int exec_mmap(struct mm_struct *mm)
 			return -EINTR;
 		}
 	}
+
+	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
+	if (ret)
+		return ret;
+
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
 	membarrier_exec_mmap(mm);
@@ -1438,6 +1444,8 @@  static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		if (!bprm->mm)
+			mutex_unlock(&current->signal->exec_update_mutex);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1487,6 +1495,7 @@  void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	mutex_unlock(&current->signal->exec_update_mutex);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 88050259c466..a29df79540ce 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -224,7 +224,14 @@  struct signal_struct {
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
-					 * (notably. ptrace) */
+					 * (notably. ptrace)
+					 * Deprecated do not use in new code.
+					 * Use exec_update_mutex instead.
+					 */
+	struct mutex exec_update_mutex;	/* Held while task_struct is being
+					 * updated during exec, and may have
+					 * inconsistent permissions.
+					 */
 } __randomize_layout;
 
 /*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5eab7b..bd403ed3e418 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@  static struct signal_struct init_signals = {
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+	.exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
 	.cputimer	= {
diff --git a/kernel/fork.c b/kernel/fork.c
index 60a1295f4384..12896a6ecee6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@  static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->exec_update_mutex);
 
 	return 0;
 }