All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
       [not found] <87blobnq02.fsf@x220.int.ebiederm.org>
@ 2020-04-02 19:04 ` Linus Torvalds
  2020-04-02 19:31   ` Bernd Edlinger
  2020-04-03 15:09   ` Bernd Edlinger
  2020-04-10 13:03 ` [GIT PULL] proc fix " Eric W. Biederman
  1 sibling, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-02 19:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Bernd Edlinger, Alexey Gladkov

On Wed, Apr 1, 2020 at 9:16 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The work on exec starts solving a long standing issue with exec that
> it takes mutexes of blocking userspace applications, which makes exec
> extremely deadlock prone.  For the moment this adds a second mutex
> with a narrower scope that handles all of the easy cases.  Which
> makes the tricky cases easy to spot.  With a little luck the code to
> solve those deadlocks will be ready by next merge window.

So this worries me.

I've pulled it, but I'm not entirely happy about some of it.

For example, the "rationale" for some of the changes is

    This should be safe, as the credentials are only used for reading.

which is just nonsensical. "Only used for reading" is immaterial, and
there's no explanation for why that would matter at all. Most of the
credentials are ever used for reading, and the worry about execve() is
that the credentials can change, and people race with them and use the
new 'suid' credentials and allow things that shouldn't be allowed. So
the rationale makes no sense at all.

Btw, if "this only takes it for reading" is such a big deal, why is
that mutex not an rw-semaphore?

The pidfd change at least has a rationale that makes sense:

    This should be safe, as the credentials do not change
    before exec_update_mutex is locked.  Therefore whatever
    file access is possible with holding the cred_guard_mutex
    here is also possbile with the exec_update_mutex.

so now you at least have an explanation of why that particular lock
makes sense and works and is equivalent.

It's still not a *great* explanation for why it would be equivalent,
because cred_guard_mutex ends up not just guarding the write of the
credentials, but makes it atomic wrt *other* data. That's the same
problem as "I'm only reading".

Locking is not about *one* value being consistent - if that was the
case, then you could just do a "get refcount on the credentials, now I
have a stable set of creds I can read forever". No lock needed.

So locking is about *multiple* values being consistent, which is why
that "I'm only reading" is not an explanation for why you can change
the lock.

It's also why that second one is questionable: it's a _better_ attempt
at explaining things, but the point is really that cred_guard_mutex
protects *other* things too.

A real explanation would have absolutely *nothing* to do with
"reading" or "same value of credentials". Both of those are entirely
immaterial, since - as mentioned - you could just get a snapshot of
the creds instead.

A real explanation would be about how there is no other state that
cred_guard_mutex protects that matters.

See what I'm saying?

This code is subtle as h*ll, and we've had bugs in it, and it has a
series of tens of patches to fix them. But that also means that the
explanations for the patches should take the subtleties into account,
and not gloss over them with things like this.

Ok, enough about the explanations. The actual _code_ is kind of odd
too. For example, you have that "bprm->called_exec_mmap" flag to say
"I've taken the exec_update_mutex, and need to drop it".

But that flag is not set anywhere _near_ actually taking the lock.
Sure, it is taken after exec_mmap() returns successfully, and that
makes sense from a naming standpoint, but wouldn't it have been a
_lot_ more obvious if you just set the flag when you took that lock,
and instead of naming it by some magical code sequence, you named it
for what it does?

Again, this looks all technically correct, but it's written in a way
that doesn't seem to make a lot of sense. Why is the code literally
written with a magical assumption of "calling exec_mmap takes this
lock, so if the flag named called_exec_mmap is set, I have to free
that lock that is not named that at all".

I hate conditional locking in the first place, but if it has to exist,
then the conditional should be named after the lock, and the lock
getting should be very very explicitly tied to it.

Wouldn't it have been much clearer if you called that flag
"exec_update_mutex_taken", and set it WHEN YOU TAKE IT?

In fact, then you could drop the

                        mutex_unlock(&tsk->signal->exec_update_mutex);

in the error case of exec_mmap(), because now the error handling in
free_bprm() would do the cleanup automatically.

See what I'm saying? You've made the locking more complex and subtle
than it needed to be. And since the whole point of the *new* lock is
that it should replace an old lock that was really complex and subtle,
that's a problem.

                   Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 19:04 ` [GIT PULL] Please pull proc and exec work for 5.7-rc1 Linus Torvalds
@ 2020-04-02 19:31   ` Bernd Edlinger
  2020-04-02 19:52     ` Linus Torvalds
  2020-04-03 15:09   ` Bernd Edlinger
  1 sibling, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-02 19:31 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: Linux Kernel Mailing List, Alexey Gladkov

On 4/2/20 9:04 PM, Linus Torvalds wrote:
> On Wed, Apr 1, 2020 at 9:16 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The work on exec starts solving a long standing issue with exec that
>> it takes mutexes of blocking userspace applications, which makes exec
>> extremely deadlock prone.  For the moment this adds a second mutex
>> with a narrower scope that handles all of the easy cases.  Which
>> makes the tricky cases easy to spot.  With a little luck the code to
>> solve those deadlocks will be ready by next merge window.
> 
> So this worries me.
> 
> I've pulled it, but I'm not entirely happy about some of it.
> 
> For example, the "rationale" for some of the changes is
> 
>     This should be safe, as the credentials are only used for reading.
> 

What I meant, but did probably not find a good way to say it.

There are places where credentials of other threads are written,
e.g. set no new privs on a thread, that already started to execve
a setuid process.

You always have the right to change the credentials of the own thread,
you dont need a mutex for it.

This is at least what is my impression how the existing mutexes are used,
a mutex called "cred_guard_mutex" is a not very good self explaining name,
in my opinion, it is totally unclear what it does "guard", and why.


Bernd.

> which is just nonsensical. "Only used for reading" is immaterial, and
> there's no explanation for why that would matter at all. Most of the
> credentials are ever used for reading, and the worry about execve() is
> that the credentials can change, and people race with them and use the
> new 'suid' credentials and allow things that shouldn't be allowed. So
> the rationale makes no sense at all.
> 
> Btw, if "this only takes it for reading" is such a big deal, why is
> that mutex not an rw-semaphore?
> 
> The pidfd change at least has a rationale that makes sense:
> 
>     This should be safe, as the credentials do not change
>     before exec_update_mutex is locked.  Therefore whatever
>     file access is possible with holding the cred_guard_mutex
>     here is also possbile with the exec_update_mutex.
> 
> so now you at least have an explanation of why that particular lock
> makes sense and works and is equivalent.
> 
> It's still not a *great* explanation for why it would be equivalent,
> because cred_guard_mutex ends up not just guarding the write of the
> credentials, but makes it atomic wrt *other* data. That's the same
> problem as "I'm only reading".
> 
> Locking is not about *one* value being consistent - if that was the
> case, then you could just do a "get refcount on the credentials, now I
> have a stable set of creds I can read forever". No lock needed.
> 
> So locking is about *multiple* values being consistent, which is why
> that "I'm only reading" is not an explanation for why you can change
> the lock.
> 
> It's also why that second one is questionable: it's a _better_ attempt
> at explaining things, but the point is really that cred_guard_mutex
> protects *other* things too.
> 
> A real explanation would have absolutely *nothing* to do with
> "reading" or "same value of credentials". Both of those are entirely
> immaterial, since - as mentioned - you could just get a snapshot of
> the creds instead.
> 
> A real explanation would be about how there is no other state that
> cred_guard_mutex protects that matters.
> 
> See what I'm saying?
> 
> This code is subtle as h*ll, and we've had bugs in it, and it has a
> series of tens of patches to fix them. But that also means that the
> explanations for the patches should take the subtleties into account,
> and not gloss over them with things like this.
> 
> Ok, enough about the explanations. The actual _code_ is kind of odd
> too. For example, you have that "bprm->called_exec_mmap" flag to say
> "I've taken the exec_update_mutex, and need to drop it".
> 
> But that flag is not set anywhere _near_ actually taking the lock.
> Sure, it is taken after exec_mmap() returns successfully, and that
> makes sense from a naming standpoint, but wouldn't it have been a
> _lot_ more obvious if you just set the flag when you took that lock,
> and instead of naming it by some magical code sequence, you named it
> for what it does?
> 
> Again, this looks all technically correct, but it's written in a way
> that doesn't seem to make a lot of sense. Why is the code literally
> written with a magical assumption of "calling exec_mmap takes this
> lock, so if the flag named called_exec_mmap is set, I have to free
> that lock that is not named that at all".
> 
> I hate conditional locking in the first place, but if it has to exist,
> then the conditional should be named after the lock, and the lock
> getting should be very very explicitly tied to it.
> 
> Wouldn't it have been much clearer if you called that flag
> "exec_update_mutex_taken", and set it WHEN YOU TAKE IT?
> 
> In fact, then you could drop the
> 
>                         mutex_unlock(&tsk->signal->exec_update_mutex);
> 
> in the error case of exec_mmap(), because now the error handling in
> free_bprm() would do the cleanup automatically.
> 
> See what I'm saying? You've made the locking more complex and subtle
> than it needed to be. And since the whole point of the *new* lock is
> that it should replace an old lock that was really complex and subtle,
> that's a problem.
> 
>                    Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 19:31   ` Bernd Edlinger
@ 2020-04-02 19:52     ` Linus Torvalds
  2020-04-02 20:59       ` Bernd Edlinger
  2020-04-03 16:00       ` Bernd Edlinger
  0 siblings, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-02 19:52 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov

On Thu, Apr 2, 2020 at 12:31 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> This is at least what is my impression how the existing mutexes are used,
> a mutex called "cred_guard_mutex" is a not very good self explaining name,
> in my opinion, it is totally unclear what it does "guard", and why.

Oh, I absolutely agree that cred_guard_mutex is a horrible lock.

It actually _used_ to be a lot more understandable, and the name used
to make more sense in the context it was used.

See commit

  a2a8474c3fff ("exec: do not sleep in TASK_TRACED under ->cred_guard_mutex")

for when it changed from "somewhat understandable" to "really hard to follow".

Don't get me wrong - that commit has a very good reason for it, but it
does make the locking really hard to understand.

It all used to be in one function - do_execve() - and it was holding
the lock over a fairly obvious range, starting at

    bprm->cred = prepare_exec_creds();

and ending at basically "we're done with execve()".

So basically, cred_guard_mutex ends up being the thing that is held
all the way from the "before execve looks at the old creds" to "execve
is done, and has changed the creds".

The reason it's needed is exactly that there are some nasty situations
where execve() itself does things with creds to determine that the new
creds are ok. And it uses the old creds to do that, but it also uses
the task->flags and task->ptrace.

So think of cred_guard_mutex as a lock around not just the creds, but
the combination of creds and the task flags/ptrace.

Anybody who changes the task ptrace setting needs to serialize with
execve(). Or anybody who tests for "dumpable()", for example.

If *all* you care about is just the creds, then you don't need it.
It's really only users that do more checks than just credentials.
"dumpable()" is I think the common one.

And that's why cred_guard_mutex has that big range - it starts when we
read the original creds (because it will use those creds to determine
how the *new* creds will affect dumpability etc), and it ends when it
has updated not only to the new creds, but it has set all those other
flags too.

So I'm not at all against splitting the lock up, and trying to make it
more directed and specific.

My complaints were about how the new lock wasn't much better. It was
still completely incomprehensible, the conditional unlocking was hard
to follow, and it really wasn't obvious that the converted users were
fine.

See?

               Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 19:52     ` Linus Torvalds
@ 2020-04-02 20:59       ` Bernd Edlinger
  2020-04-02 21:46         ` Linus Torvalds
  2020-04-03 16:00       ` Bernd Edlinger
  1 sibling, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-02 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov

On 4/2/20 9:52 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 12:31 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> This is at least what is my impression how the existing mutexes are used,
>> a mutex called "cred_guard_mutex" is a not very good self explaining name,
>> in my opinion, it is totally unclear what it does "guard", and why.
> 
> Oh, I absolutely agree that cred_guard_mutex is a horrible lock.
> 
> It actually _used_ to be a lot more understandable, and the name used
> to make more sense in the context it was used.
> 
> See commit
> 
>   a2a8474c3fff ("exec: do not sleep in TASK_TRACED under ->cred_guard_mutex")
> 
> for when it changed from "somewhat understandable" to "really hard to follow".
> 
> Don't get me wrong - that commit has a very good reason for it, but it
> does make the locking really hard to understand.
> 
> It all used to be in one function - do_execve() - and it was holding
> the lock over a fairly obvious range, starting at
> 
>     bprm->cred = prepare_exec_creds();
> 
> and ending at basically "we're done with execve()".
> 
> So basically, cred_guard_mutex ends up being the thing that is held
> all the way from the "before execve looks at the old creds" to "execve
> is done, and has changed the creds".
> 
> The reason it's needed is exactly that there are some nasty situations
> where execve() itself does things with creds to determine that the new
> creds are ok. And it uses the old creds to do that, but it also uses
> the task->flags and task->ptrace.
> 
> So think of cred_guard_mutex as a lock around not just the creds, but
> the combination of creds and the task flags/ptrace.
> 
> Anybody who changes the task ptrace setting needs to serialize with
> execve(). Or anybody who tests for "dumpable()", for example.
> 
> If *all* you care about is just the creds, then you don't need it.
> It's really only users that do more checks than just credentials.
> "dumpable()" is I think the common one.
> 
> And that's why cred_guard_mutex has that big range - it starts when we
> read the original creds (because it will use those creds to determine
> how the *new* creds will affect dumpability etc), and it ends when it
> has updated not only to the new creds, but it has set all those other
> flags too.
> 
> So I'm not at all against splitting the lock up, and trying to make it
> more directed and specific.
> 
> My complaints were about how the new lock wasn't much better. It was
> still completely incomprehensible, the conditional unlocking was hard
> to follow, and it really wasn't obvious that the converted users were
> fine.
> 
> See?
> 

Understand completely.  The change is in a way mechanic, that is we
have the following sequence:

1 execve starts
     |
     |    access args, may fault, deadlock in user mode fault handler
     |    de_thread, may block waiting for strace to call wait and so
     |
     |    exec_mm_release, may also falut, deadlock un user mode fault handler
     v
2 process update begins
     |
     |    should not block, to our current knowledge (except when loading a nfs image probably ?)
     |    credentials may change at any time.
     |
     v
3 process update done
     |
     v
4 execve done


So we have functions that access the process memory map. they use the credentials
and need to access the correct image, not to reveal secrets of the new about to be
loaded image.  they need the inner mutex from 2 .. 3

Also when you want to read credentials of another thread, it is probably better
to have a consistent state of the credentials, and no new privs for instance.
That also needs the inner mutex from 2 .. 3

And then we have things that change the credentials or no new privs, in general
all ptrace_attach and security modules are of that kind.  They need the
mutex from 1 .. 4, but I want to change the name, in the two patches below, and
I want to break the dead-lock from ptrace in a API-incompatible way, but in a
very limited breaking change, that only breaks what is already broken.


There are two more patches, which might be of interest for you, just to
make the picture complete.
It is not clear if we go that way, or if Eric has a yet better idea.

[PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
https://www.spinics.net/lists/kernel/msg3459067.html

[PATCH v6 16/16] doc: Update documentation of ->exec_*_mutex
https://www.spinics.net/lists/kernel/msg3449539.html



Thanks
Bernd.



>                Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 20:59       ` Bernd Edlinger
@ 2020-04-02 21:46         ` Linus Torvalds
  2020-04-02 23:01           ` Eric W. Biederman
                             ` (3 more replies)
  0 siblings, 4 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-02 21:46 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov

On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>
> There are two more patches, which might be of interest for you, just to
> make the picture complete.
> It is not clear if we go that way, or if Eric has a yet better idea.
>
> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
> https://www.spinics.net/lists/kernel/msg3459067.html

There is no way I would ever take that patch.

The amount of confusion in that patch is not acceptable. Randomly
unlocking the new lock?

That code makes everything worse, it's completely incomprehensible,
the locking rules make no sense ahwt-so-ever.

I'm seriously starting to feel like I should not have pulled this
code, because the future looks _worse_ than what we used to have.

No. No no no. Eric, this is not an acceptable direction.

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 21:46         ` Linus Torvalds
@ 2020-04-02 23:01           ` Eric W. Biederman
  2020-04-02 23:42             ` Bernd Edlinger
                               ` (3 more replies)
  2020-04-02 23:02           ` Bernd Edlinger
                             ` (2 subsequent siblings)
  3 siblings, 4 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-02 23:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>
>> There are two more patches, which might be of interest for you, just to
>> make the picture complete.
>> It is not clear if we go that way, or if Eric has a yet better idea.
>>
>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>> https://www.spinics.net/lists/kernel/msg3459067.html
>
> There is no way I would ever take that patch.
>
> The amount of confusion in that patch is not acceptable. Randomly
> unlocking the new lock?
>
> That code makes everything worse, it's completely incomprehensible,
> the locking rules make no sense ahwt-so-ever.
>
> I'm seriously starting to feel like I should not have pulled this
> code, because the future looks _worse_ than what we used to have.
>
> No. No no no. Eric, this is not an acceptable direction.

That is not the direction I intend to take either.

I was hoping I could put off replying to this thread for a bit because
I only managed to get 4 hours of sleep last night and I am not as alert
to technical details as I would like to be.

Long story short:

The exec_update_mutex is to be a strict subset of today's
cred_guard_mutex.  I tried to copy cred_guard_mutex's unlock style so
that was obvious and that turns out was messier than I intended.

I thought the changes to the individual locking changes were
sufficiently unsubtle that they did not need my personal attention.
Especially as they are just a substitution of one lock for another
with a slightly smaller scope.

I started working on the the series of changes that reorganizes
the changes in exec.

It was reported that something had gone wrong with my introduction
of exec_update_mutex and I pulled it from linux-next.

By the time I was ready to start putting humpty dumpty back together
again Bernd had collected everything up and had it working.  I had seen
that he had been given the feedback about better change descriptions.

I had looked at the code of his patches earlier and the basic changes
were trivial.

Since I thought I already knew what was in the patches and the worst
problem was the missing unlock of cred_guard_mutex, and I know Bernd's
patches had been tested I applied them.  I missed that Bernd had added
the exec_mmap_called flag into my patch.  I thought he had only added
the missing unlock.

I spotted the weirdness in unlocking exec_update_mutex, and because it
does fix a real world deadlock with ptrace I did not back it out from my
tree.

I have been much laxer on the details than I like to be my apologies.

The plan is:
	exec_udpate_mutex will cover just the subset of cred_guard_mutex
        after the point of no return, and after we do any actions that
	might block waiting for userspace to do anything.

	So exec_update_mutex will just cover those things that exec
        is updating, so if you want an atomic snapshot of them
        plus the appropriate struct cred you can grab exec_update_mutex.
        
	I added a new mutex instead of just fixing cred_guard_mutex so
        that we can update or revert the individual code paths if it
        is found that something is wrong.

	The cred_guard_mutex also prevents other tasks from starting
        to ptrace the task that is exec'ing, and other tasks from
        setting no_new_privs on the task that is exec'ing.

        My plan is to carefully refactor the code so it can perform
        both the ptrace and no_new_privs checks after the point of
        no return.

I have uncovered all kinds of surprises while working in that direction
and I realize it is going to take a very delicate and careful touch to
achieve my goal.

There are silly things like normal linux exec when you are ptraced and
exec changes the credentials the ordinary code will continue with the
old credentials, but the an LSM enabled your process is likely to be
killed instead.

There is the personal mind blowing scenario where selinux will increase
your credentials upon exec but if a magic directive is supplied in it's
rules will avoid setting AT_SECURE, so that userspace won't protect
itself from hostile takeover from the pre credential change environment.
Much to my surprise "noatsecure" is a known and documented feature of
selinux.  I am not certain but I think I even spotted it in use on
production.

I will catch up on my sleep before I allow any more changes, and I will
see replacing the called_exec_mmap flag with something saner.

Eric

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 21:46         ` Linus Torvalds
  2020-04-02 23:01           ` Eric W. Biederman
@ 2020-04-02 23:02           ` Bernd Edlinger
  2020-04-02 23:22           ` Bernd Edlinger
  2020-04-03  7:38           ` Bernd Edlinger
  3 siblings, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-02 23:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov

On 4/2/20 11:46 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>
>> There are two more patches, which might be of interest for you, just to
>> make the picture complete.
>> It is not clear if we go that way, or if Eric has a yet better idea.
>>
>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>> https://www.spinics.net/lists/kernel/msg3459067.html
> 
> There is no way I would ever take that patch.
> 
> The amount of confusion in that patch is not acceptable. Randomly
> unlocking the new lock?
> 
> That code makes everything worse, it's completely incomprehensible,
> the locking rules make no sense ahwt-so-ever.
> 
> I'm seriously starting to feel like I should not have pulled this
> code, because the future looks _worse_ than what we used to have.
> 
> No. No no no. Eric, this is not an acceptable direction.
> 

Seriously, Linus,

nobody is forcing anything on you.

That would be quite a stupid idea (to try to force you per e-mail :-) )

The future is not yet written.

I think Eric has an alternative idea for the next step (he did not tell
me more but I am curious).  Maybe that will be better, maybe not. And of
course I do not try to win a battle here, and I am willing to take advice.
So I am sure, we can work together to understand the problem better when
we take the time to analyze the problem better (I have not
yet read everything in your last mail completely, and followed
every link you have given, so what I write is just preliminary.

I just would like to know one thing,
how did you like my "big fat warning" comments?

If it turns out to be the wrong direction,
is it too late to turn back now?



Thanks
Bernd.


>              Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 21:46         ` Linus Torvalds
  2020-04-02 23:01           ` Eric W. Biederman
  2020-04-02 23:02           ` Bernd Edlinger
@ 2020-04-02 23:22           ` Bernd Edlinger
  2020-04-03  7:38           ` Bernd Edlinger
  3 siblings, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-02 23:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov

On 4/2/20 11:46 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>
>> There are two more patches, which might be of interest for you, just to
>> make the picture complete.
>> It is not clear if we go that way, or if Eric has a yet better idea.
>>
>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>> https://www.spinics.net/lists/kernel/msg3459067.html
> 
> There is no way I would ever take that patch.
> 
> The amount of confusion in that patch is not acceptable. Randomly
> unlocking the new lock?
> 
> That code makes everything worse, it's completely incomprehensible,
> the locking rules make no sense ahwt-so-ever.
> 

Linus,

let me explain what the locking here does.

It is a kind of soft mutex, which is normally strong, so taken
from 1 .. 4.  and nothing changes from how it was before.

But it can also be weak.

So if we detect that another thread is being ptraced, we drop
the lock, and keep the boolean set to true, which makes the ptrace_attach
acquire the lock, and the boolean is true, that make the
ptrace_attach return -EAGAIN. release the lock immediatly,
the deadlock is broken, the thread can handle the deadly signal
from de_thread, de_thread continues.  And just
at the end of the execve, when the boolean has to be set
to false again, we have to lock the mutex, set the boolean to
false, and unlock the mutex.  It is very important for the
correctness that the boolean is only changed when the mutex
is held.

Once again, please give Eric the time to catch up with his
sleep, that can be more serious as you would think to have
too less sleep.  Then I am looking forward to see his idea,
usually that may be something worth do consider.  But
we have all the time we want for that.


Thanks
Bernd.


> I'm seriously starting to feel like I should not have pulled this
> code, because the future looks _worse_ than what we used to have.
> 
> No. No no no. Eric, this is not an acceptable direction.
> 
>              Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 23:01           ` Eric W. Biederman
@ 2020-04-02 23:42             ` Bernd Edlinger
  2020-04-02 23:45               ` Eric W. Biederman
  2020-04-02 23:45               ` Linus Torvalds
  2020-04-02 23:44             ` Linus Torvalds
                               ` (2 subsequent siblings)
  3 siblings, 2 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-02 23:42 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: Linux Kernel Mailing List, Alexey Gladkov

On 4/3/20 1:01 AM, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
>> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>
>>> There are two more patches, which might be of interest for you, just to
>>> make the picture complete.
>>> It is not clear if we go that way, or if Eric has a yet better idea.
>>>
>>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>>> https://www.spinics.net/lists/kernel/msg3459067.html
>>
>> There is no way I would ever take that patch.
>>
>> The amount of confusion in that patch is not acceptable. Randomly
>> unlocking the new lock?
>>
>> That code makes everything worse, it's completely incomprehensible,
>> the locking rules make no sense ahwt-so-ever.
>>
>> I'm seriously starting to feel like I should not have pulled this
>> code, because the future looks _worse_ than what we used to have.
>>
>> No. No no no. Eric, this is not an acceptable direction.
> 
> That is not the direction I intend to take either.
> 
> I was hoping I could put off replying to this thread for a bit because
> I only managed to get 4 hours of sleep last night and I am not as alert
> to technical details as I would like to be.
> 
> Long story short:
> 
> The exec_update_mutex is to be a strict subset of today's
> cred_guard_mutex.  I tried to copy cred_guard_mutex's unlock style so
> that was obvious and that turns out was messier than I intended.
> 
> I thought the changes to the individual locking changes were
> sufficiently unsubtle that they did not need my personal attention.
> Especially as they are just a substitution of one lock for another
> with a slightly smaller scope.
> 
> I started working on the the series of changes that reorganizes
> the changes in exec.
> 
> It was reported that something had gone wrong with my introduction
> of exec_update_mutex and I pulled it from linux-next.
> 
> By the time I was ready to start putting humpty dumpty back together
> again Bernd had collected everything up and had it working.  I had seen
> that he had been given the feedback about better change descriptions.
> 
> I had looked at the code of his patches earlier and the basic changes
> were trivial.
> 
> Since I thought I already knew what was in the patches and the worst
> problem was the missing unlock of cred_guard_mutex, and I know Bernd's
> patches had been tested I applied them.  I missed that Bernd had added
> the exec_mmap_called flag into my patch.  I thought he had only added
> the missing unlock.
> 

Hi Eric,

oh, sorry for that, that was requested in the peer review, I could not
get a patch approved that does not have such a boolean, that simplified
the error handling.

Actually I had sent you an e-mail with that patch 24H before I posted
the update, then Greg asked me to re-post the whole series, that
took at least another two days, so at that time I was seriously
concerned how you are doing, since I head nothing from you about the
updated patch with the exec_mmap_called.

Linus that is not the boolean I was talking in the other mail.
That boolean is called unsafe_execve_in_progres.

So, and now I also try to get some sleep....


Thanks
Bernd.

> I spotted the weirdness in unlocking exec_update_mutex, and because it
> does fix a real world deadlock with ptrace I did not back it out from my
> tree.
> 
> I have been much laxer on the details than I like to be my apologies.
> 
> The plan is:
> 	exec_udpate_mutex will cover just the subset of cred_guard_mutex
>         after the point of no return, and after we do any actions that
> 	might block waiting for userspace to do anything.
> 
> 	So exec_update_mutex will just cover those things that exec
>         is updating, so if you want an atomic snapshot of them
>         plus the appropriate struct cred you can grab exec_update_mutex.
>         
> 	I added a new mutex instead of just fixing cred_guard_mutex so
>         that we can update or revert the individual code paths if it
>         is found that something is wrong.
> 
> 	The cred_guard_mutex also prevents other tasks from starting
>         to ptrace the task that is exec'ing, and other tasks from
>         setting no_new_privs on the task that is exec'ing.
> 
>         My plan is to carefully refactor the code so it can perform
>         both the ptrace and no_new_privs checks after the point of
>         no return.
> 
> I have uncovered all kinds of surprises while working in that direction
> and I realize it is going to take a very delicate and careful touch to
> achieve my goal.
> 
> There are silly things like normal linux exec when you are ptraced and
> exec changes the credentials the ordinary code will continue with the
> old credentials, but the an LSM enabled your process is likely to be
> killed instead.
> 
> There is the personal mind blowing scenario where selinux will increase
> your credentials upon exec but if a magic directive is supplied in it's
> rules will avoid setting AT_SECURE, so that userspace won't protect
> itself from hostile takeover from the pre credential change environment.
> Much to my surprise "noatsecure" is a known and documented feature of
> selinux.  I am not certain but I think I even spotted it in use on
> production.
> 
> I will catch up on my sleep before I allow any more changes, and I will
> see replacing the called_exec_mmap flag with something saner.
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 23:01           ` Eric W. Biederman
  2020-04-02 23:42             ` Bernd Edlinger
@ 2020-04-02 23:44             ` Linus Torvalds
  2020-04-03  0:05               ` Eric W. Biederman
  2020-04-07  1:29               ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Eric W. Biederman
  2020-04-03  5:09             ` [GIT PULL] Please pull proc and exec work for 5.7-rc1 Bernd Edlinger
  2020-04-03 19:26             ` Linus Torvalds
  3 siblings, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-02 23:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov

On Thu, Apr 2, 2020 at 4:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> That is not the direction I intend to take either.

Ahh, good. Because those kinds of "play games with dropping locks in
the middle based on conditionals" really have been horrible.

Yes, we've done it, and it's almost always been asource of truly subtle bugs.

> The exec_update_mutex is to be a strict subset of today's
> cred_guard_mutex.  I tried to copy cred_guard_mutex's unlock style so
> that was obvious and that turns out was messier than I intended.

Yes. That is why I had no problem pulling that subset, and my worries
were mainly about the explanations and that flag use.

> Since I thought I already knew what was in the patches and the worst
> problem was the missing unlock of cred_guard_mutex, and I know Bernd's
> patches had been tested I applied them.  I missed that Bernd had added
> the exec_mmap_called flag into my patch.  I thought he had only added
> the missing unlock.

Ahh, so you meant for all of that to be entirely static refactoring,
rather than the conditional unlock depending on just how far we had
gotten.

Good, that's generally the much superior approach.

I absolutely _hate_ the "drop and retake" model, unless it's a very
local case with a very explicit retry path.

In contrast, the "we have a flag that shows how far we've gotten"
_has_ been a successful model, and while I much prefer a static "lock
pairs with unlock", that "I have done this, so you need to unlock" is
not entirely out of the question when the static rules become too
complex to think about.

The vfs code has something similar in FMODE_OPENED which is basically
a flag saying "I actually made it all the way to the ->open()"
callback. We used to have a static model, but the rules for when we
can use fput(), and when we have to use fdrop() were too hard for
people.

> I spotted the weirdness in unlocking exec_update_mutex, and because it
> does fix a real world deadlock with ptrace I did not back it out from my
> tree.
>
> I have been much laxer on the details than I like to be my apologies.

Ok, as long as we have a sane plan..

And

> The plan is:
>         exec_udpate_mutex will cover just the subset of cred_guard_mutex
>         after the point of no return, and after we do any actions that
>         might block waiting for userspace to do anything.
>
>         So exec_update_mutex will just cover those things that exec
>         is updating, so if you want an atomic snapshot of them
>         plus the appropriate struct cred you can grab exec_update_mutex.
>
>         I added a new mutex instead of just fixing cred_guard_mutex so
>         that we can update or revert the individual code paths if it
>         is found that something is wrong.
>
>         The cred_guard_mutex also prevents other tasks from starting
>         to ptrace the task that is exec'ing, and other tasks from
>         setting no_new_privs on the task that is exec'ing.
>
>         My plan is to carefully refactor the code so it can perform
>         both the ptrace and no_new_privs checks after the point of
>         no return.

Ok. Sounds good.

> I have uncovered all kinds of surprises while working in that direction
> and I realize it is going to take a very delicate and careful touch to
> achieve my goal.
>
> There are silly things like normal linux exec when you are ptraced and
> exec changes the credentials the ordinary code will continue with the
> old credentials, but the an LSM enabled your process is likely to be
> killed instead.

Yeah. The "continue with old credentials" is actually very traditional
and the original behavior, and is useful for handling the case of
debugging something that is suid, but doesn't necessarily _require_
it.

But the LSM's just say yes/no.

I have this dim memory that it also triggers when you do the debugging
as root, but that may be some medication-induced memory.

> There is the personal mind blowing scenario where selinux will increase
> your credentials upon exec but if a magic directive is supplied in it's
> rules will avoid setting AT_SECURE, so that userspace won't protect
> itself from hostile takeover from the pre credential change environment.
> Much to my surprise "noatsecure" is a known and documented feature of
> selinux.  I am not certain but I think I even spotted it in use on
> production.

We have had a _ton_ of random small rules so that people could enable
SElinux in legacy environments.

They are _probably_ effectively dead code in this day and age, but
it's hard to tell...

            Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 23:42             ` Bernd Edlinger
@ 2020-04-02 23:45               ` Eric W. Biederman
  2020-04-02 23:49                 ` Bernd Edlinger
  2020-04-02 23:45               ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-02 23:45 UTC (permalink / raw)
  To: Bernd Edlinger; +Cc: Linus Torvalds, Linux Kernel Mailing List, Alexey Gladkov

Bernd Edlinger <bernd.edlinger@hotmail.de> writes:

> oh, sorry for that, that was requested in the peer review, I could not
> get a patch approved that does not have such a boolean, that simplified
> the error handling.

If you had included a note in your changlog when you respun my patch I
probably would have realized what you had done I would have spotted it
faster.

When I glanced at the patch quickly I thought you had just added the
missing unlock.

Eric

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 23:42             ` Bernd Edlinger
  2020-04-02 23:45               ` Eric W. Biederman
@ 2020-04-02 23:45               ` Linus Torvalds
  1 sibling, 0 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-02 23:45 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov

On Thu, Apr 2, 2020 at 4:42 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>
> Linus that is not the boolean I was talking in the other mail.
> That boolean is called unsafe_execve_in_progres.

Yeah, I tracked what you were trying to say - I did understand how
that patch worked.

I just absolutely and utterly hated it ;/

              Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 23:45               ` Eric W. Biederman
@ 2020-04-02 23:49                 ` Bernd Edlinger
  0 siblings, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-02 23:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Kernel Mailing List, Alexey Gladkov



On 4/3/20 1:45 AM, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> 
>> oh, sorry for that, that was requested in the peer review, I could not
>> get a patch approved that does not have such a boolean, that simplified
>> the error handling.
> 
> If you had included a note in your changlog when you respun my patch I
> probably would have realized what you had done I would have spotted it
> faster.
> 

Yeah, mistakes happen.

> When I glanced at the patch quickly I thought you had just added the
> missing unlock.
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 23:44             ` Linus Torvalds
@ 2020-04-03  0:05               ` Eric W. Biederman
  2020-04-07  1:29               ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Eric W. Biederman
  1 sibling, 0 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-03  0:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 2, 2020 at 4:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> That is not the direction I intend to take either.
>
> Ahh, good. Because those kinds of "play games with dropping locks in
> the middle based on conditionals" really have been horrible.
>
> Yes, we've done it, and it's almost always been asource of truly subtle bugs.
>
>> The exec_update_mutex is to be a strict subset of today's
>> cred_guard_mutex.  I tried to copy cred_guard_mutex's unlock style so
>> that was obvious and that turns out was messier than I intended.
>
> Yes. That is why I had no problem pulling that subset, and my worries
> were mainly about the explanations and that flag use.
>
>> Since I thought I already knew what was in the patches and the worst
>> problem was the missing unlock of cred_guard_mutex, and I know Bernd's
>> patches had been tested I applied them.  I missed that Bernd had added
>> the exec_mmap_called flag into my patch.  I thought he had only added
>> the missing unlock.
>
> Ahh, so you meant for all of that to be entirely static refactoring,
> rather than the conditional unlock depending on just how far we had
> gotten.
>
> Good, that's generally the much superior approach.

Looking at it right now if I add the unlock to one code path I can
get the flag and free_binprm out of it, and have it completely static.

> I absolutely _hate_ the "drop and retake" model, unless it's a very
> local case with a very explicit retry path.
>
> In contrast, the "we have a flag that shows how far we've gotten"
> _has_ been a successful model, and while I much prefer a static "lock
> pairs with unlock", that "I have done this, so you need to unlock" is
> not entirely out of the question when the static rules become too
> complex to think about.

We do definitely need one of those for the point of no return.  I
need to check if we can set it sooner.  I think we have a weird case
where we can't set the flag because calling force_sigsegv during when
coredumping is rude.

> The vfs code has something similar in FMODE_OPENED which is basically
> a flag saying "I actually made it all the way to the ->open()"
> callback. We used to have a static model, but the rules for when we
> can use fput(), and when we have to use fdrop() were too hard for
> people.
>
>> I spotted the weirdness in unlocking exec_update_mutex, and because it
>> does fix a real world deadlock with ptrace I did not back it out from my
>> tree.
>>
>> I have been much laxer on the details than I like to be my apologies.
>
> Ok, as long as we have a sane plan..
>
> And
>
>> The plan is:
>>         exec_udpate_mutex will cover just the subset of cred_guard_mutex
>>         after the point of no return, and after we do any actions that
>>         might block waiting for userspace to do anything.
>>
>>         So exec_update_mutex will just cover those things that exec
>>         is updating, so if you want an atomic snapshot of them
>>         plus the appropriate struct cred you can grab exec_update_mutex.
>>
>>         I added a new mutex instead of just fixing cred_guard_mutex so
>>         that we can update or revert the individual code paths if it
>>         is found that something is wrong.
>>
>>         The cred_guard_mutex also prevents other tasks from starting
>>         to ptrace the task that is exec'ing, and other tasks from
>>         setting no_new_privs on the task that is exec'ing.
>>
>>         My plan is to carefully refactor the code so it can perform
>>         both the ptrace and no_new_privs checks after the point of
>>         no return.
>
> Ok. Sounds good.
>
>> I have uncovered all kinds of surprises while working in that direction
>> and I realize it is going to take a very delicate and careful touch to
>> achieve my goal.
>>
>> There are silly things like normal linux exec when you are ptraced and
>> exec changes the credentials the ordinary code will continue with the
>> old credentials, but the an LSM enabled your process is likely to be
>> killed instead.
>
> Yeah. The "continue with old credentials" is actually very traditional
> and the original behavior, and is useful for handling the case of
> debugging something that is suid, but doesn't necessarily _require_
> it.

If we continue with old credentials I think we are still setting
AT_SECURE which seems odd.  Especially since it doesn't appear to be
intentional.

> But the LSM's just say yes/no.

Oh I wish what the LSM's were doing was anything approaching as
simple as merely saying yes/no during exec.

> I have this dim memory that it also triggers when you do the debugging
> as root, but that may be some medication-induced memory.

I have a memory of someone fixing something like that years ago.
If you have sufficient privileges while ptracing the current code will
allow you to trace a suid root exec.

>> There is the personal mind blowing scenario where selinux will increase
>> your credentials upon exec but if a magic directive is supplied in it's
>> rules will avoid setting AT_SECURE, so that userspace won't protect
>> itself from hostile takeover from the pre credential change environment.
>> Much to my surprise "noatsecure" is a known and documented feature of
>> selinux.  I am not certain but I think I even spotted it in use on
>> production.
>
> We have had a _ton_ of random small rules so that people could enable
> SElinux in legacy environments.
>
> They are _probably_ effectively dead code in this day and age, but
> it's hard to tell...

For that specific case I attempted to look at the SELinux rules
file on a production RHEL7 configuration and strings told me
"noatsecure" is present.  But the whole thing is a binary blob
and I have not spent enough time to figure out how to properly
return it to test so I can see what that "noatsecure" means.
It might just have been a section header in the binary file.

There are a lot of things like that are either going to need comments
from the relevant maintiners or to just be avoided for the time being.

The plan is small careful patches so I get through it with setting
off the minimal number of landmines.

Eric

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 23:01           ` Eric W. Biederman
  2020-04-02 23:42             ` Bernd Edlinger
  2020-04-02 23:44             ` Linus Torvalds
@ 2020-04-03  5:09             ` Bernd Edlinger
  2020-04-03 19:26             ` Linus Torvalds
  3 siblings, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-03  5:09 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: Linux Kernel Mailing List, Alexey Gladkov

On 4/3/20 1:01 AM, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
>> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>
>>> There are two more patches, which might be of interest for you, just to
>>> make the picture complete.
>>> It is not clear if we go that way, or if Eric has a yet better idea.
>>>
>>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>>> https://www.spinics.net/lists/kernel/msg3459067.html
>>
>> There is no way I would ever take that patch.
>>
>> The amount of confusion in that patch is not acceptable. Randomly
>> unlocking the new lock?
>>
>> That code makes everything worse, it's completely incomprehensible,
>> the locking rules make no sense ahwt-so-ever.
>>
>> I'm seriously starting to feel like I should not have pulled this
>> code, because the future looks _worse_ than what we used to have.
>>
>> No. No no no. Eric, this is not an acceptable direction.
> 
> That is not the direction I intend to take either.
> 
> I was hoping I could put off replying to this thread for a bit because
> I only managed to get 4 hours of sleep last night and I am not as alert
> to technical details as I would like to be.
> 
> Long story short:
> 
> The exec_update_mutex is to be a strict subset of today's
> cred_guard_mutex.  I tried to copy cred_guard_mutex's unlock style so
> that was obvious and that turns out was messier than I intended.
> 
> I thought the changes to the individual locking changes were
> sufficiently unsubtle that they did not need my personal attention.
> Especially as they are just a substitution of one lock for another
> with a slightly smaller scope.
> 
> I started working on the the series of changes that reorganizes
> the changes in exec.
> 
> It was reported that something had gone wrong with my introduction
> of exec_update_mutex and I pulled it from linux-next.
> 
> By the time I was ready to start putting humpty dumpty back together
> again Bernd had collected everything up and had it working.  I had seen
> that he had been given the feedback about better change descriptions.
> 

Sorry, I did it as slowly as I could possibly do.
I wanted to wait for you, but....


> I had looked at the code of his patches earlier and the basic changes
> were trivial.
> 
> Since I thought I already knew what was in the patches and the worst
> problem was the missing unlock of cred_guard_mutex, and I know Bernd's
> patches had been tested I applied them.  I missed that Bernd had added
> the exec_mmap_called flag into my patch.  I thought he had only added
> the missing unlock.
> 
> I spotted the weirdness in unlocking exec_update_mutex, and because it
> does fix a real world deadlock with ptrace I did not back it out from my
> tree.
> 
> I have been much laxer on the details than I like to be my apologies.
> 
> The plan is:
> 	exec_udpate_mutex will cover just the subset of cred_guard_mutex
>         after the point of no return, and after we do any actions that
> 	might block waiting for userspace to do anything.
> 
> 	So exec_update_mutex will just cover those things that exec
>         is updating, so if you want an atomic snapshot of them
>         plus the appropriate struct cred you can grab exec_update_mutex.
>         
> 	I added a new mutex instead of just fixing cred_guard_mutex so
>         that we can update or revert the individual code paths if it
>         is found that something is wrong.
> 
> 	The cred_guard_mutex also prevents other tasks from starting
>         to ptrace the task that is exec'ing, and other tasks from
>         setting no_new_privs on the task that is exec'ing.
> 
>         My plan is to carefully refactor the code so it can perform
>         both the ptrace and no_new_privs checks after the point of
>         no return.
> 
> I have uncovered all kinds of surprises while working in that direction
> and I realize it is going to take a very delicate and careful touch to
> achieve my goal.
> 

That worries me a bit.
Could you please share details of the failed attempts with us,
Leaning from failures could help us better understand the issue.



> There are silly things like normal linux exec when you are ptraced and
> exec changes the credentials the ordinary code will continue with the
> old credentials, but the an LSM enabled your process is likely to be
> killed instead.
> 

Please elaborate on the details.

> There is the personal mind blowing scenario where selinux will increase
> your credentials upon exec but if a magic directive is supplied in it's
> rules will avoid setting AT_SECURE, so that userspace won't protect
> itself from hostile takeover from the pre credential change environment.
> Much to my surprise "noatsecure" is a known and documented feature of
> selinux.  I am not certain but I think I even spotted it in use on
> production.
> 

Also here, it might help to make us aware of the problems you face.

I also considered moving all the credentials to the inner block,
but had the impression that is probably a really tough problem instead.

I wondered what happens if a ptraced execve process executes a
suid program that is.  Don't you need different credentials
when you are pthraced, I mean, doesn't that override the suid bit,
while when not ptraced, you be root user, and have all the root
powers to load the image in the new vm?

Isn't there a race when execve starts, and ptrace attach happens later?


Thanks
Bernd.

> I will catch up on my sleep before I allow any more changes, and I will
> see replacing the called_exec_mmap flag with something saner.
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 21:46         ` Linus Torvalds
                             ` (2 preceding siblings ...)
  2020-04-02 23:22           ` Bernd Edlinger
@ 2020-04-03  7:38           ` Bernd Edlinger
  3 siblings, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-03  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov

On 4/2/20 11:46 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>
>> There are two more patches, which might be of interest for you, just to
>> make the picture complete.
>> It is not clear if we go that way, or if Eric has a yet better idea.
>>
>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>> https://www.spinics.net/lists/kernel/msg3459067.html
> 
> There is no way I would ever take that patch.
> 
> The amount of confusion in that patch is not acceptable. Randomly
> unlocking the new lock?
> 
> That code makes everything worse, it's completely incomprehensible,
> the locking rules make no sense ahwt-so-ever.
> 
> I'm seriously starting to feel like I should not have pulled this
> code, because the future looks _worse_ than what we used to have.
> 

No problem, sometimes they say the cure is worse than the disease,
and I would not rule out the possibility that this is also
an example for that.

My initial proposal was much smaller and probably more on the issue,
but in peer review it turned out that we want to solve the problem
from ground up.  Otherwise I saw no possibility how to get it approved.
That forced me in that direction that this took.

I just try to help with that.  But I do not insist in a specific
direction (-:

This is what I initially proposed:

[PATCH] exec: Fix a deadlock in ptrace
https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/


Thanks
Bernd.

> No. No no no. Eric, this is not an acceptable direction.
> 
>              Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 19:04 ` [GIT PULL] Please pull proc and exec work for 5.7-rc1 Linus Torvalds
  2020-04-02 19:31   ` Bernd Edlinger
@ 2020-04-03 15:09   ` Bernd Edlinger
  2020-04-03 16:23     ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-03 15:09 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: Linux Kernel Mailing List, Alexey Gladkov

On 4/2/20 9:04 PM, Linus Torvalds wrote:
> On Wed, Apr 1, 2020 at 9:16 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The work on exec starts solving a long standing issue with exec that
>> it takes mutexes of blocking userspace applications, which makes exec
>> extremely deadlock prone.  For the moment this adds a second mutex
>> with a narrower scope that handles all of the easy cases.  Which
>> makes the tricky cases easy to spot.  With a little luck the code to
>> solve those deadlocks will be ready by next merge window.
> 
> So this worries me.
> 
> I've pulled it, but I'm not entirely happy about some of it.
> 
> For example, the "rationale" for some of the changes is
> 
>     This should be safe, as the credentials are only used for reading.
> 
> which is just nonsensical. "Only used for reading" is immaterial, and
> there's no explanation for why that would matter at all. Most of the
> credentials are ever used for reading, and the worry about execve() is
> that the credentials can change, and people race with them and use the
> new 'suid' credentials and allow things that shouldn't be allowed. So
> the rationale makes no sense at all.
> 
> Btw, if "this only takes it for reading" is such a big deal, why is
> that mutex not an rw-semaphore?
> 
> The pidfd change at least has a rationale that makes sense:
> 
>     This should be safe, as the credentials do not change
>     before exec_update_mutex is locked.  Therefore whatever
>     file access is possible with holding the cred_guard_mutex
>     here is also possbile with the exec_update_mutex.
> 
> so now you at least have an explanation of why that particular lock
> makes sense and works and is equivalent.
> 
> It's still not a *great* explanation for why it would be equivalent,
> because cred_guard_mutex ends up not just guarding the write of the
> credentials, but makes it atomic wrt *other* data. That's the same
> problem as "I'm only reading".
> 
> Locking is not about *one* value being consistent - if that was the
> case, then you could just do a "get refcount on the credentials, now I
> have a stable set of creds I can read forever". No lock needed.
> 
> So locking is about *multiple* values being consistent, which is why
> that "I'm only reading" is not an explanation for why you can change
> the lock.
> 
> It's also why that second one is questionable: it's a _better_ attempt
> at explaining things, but the point is really that cred_guard_mutex
> protects *other* things too.
> 

Can we still edit the change logs, maybe that is a clear indication
that they are not sufficiently clear, when one don't understand the
patch without following the whole email thread.


> A real explanation would have absolutely *nothing* to do with
> "reading" or "same value of credentials". Both of those are entirely
> immaterial, since - as mentioned - you could just get a snapshot of
> the creds instead.
> 

The problem we have here is that *another* thread can change no_new_privs
of the execve thread, that is a write.  I think that must be avoided
whatever it costs.  Those are the hard issues,
and reading another thread's credentials, an taking a reference of the
vm need to be consistent, so should just not happen while the vm
is updated, but the credentials not yet.

Or am I missing something here?

> A real explanation would be about how there is no other state that
> cred_guard_mutex protects that matters.
> 
> See what I'm saying?
> 
> This code is subtle as h*ll, and we've had bugs in it, and it has a
> series of tens of patches to fix them. But that also means that the
> explanations for the patches should take the subtleties into account,
> and not gloss over them with things like this.
> 

:-)

> Ok, enough about the explanations. The actual _code_ is kind of odd
> too. For example, you have that "bprm->called_exec_mmap" flag to say
> "I've taken the exec_update_mutex, and need to drop it".
> 

previously that was bprm->mm == NULL, that is even more hacky.

> But that flag is not set anywhere _near_ actually taking the lock.
> Sure, it is taken after exec_mmap() returns successfully, and that
> makes sense from a naming standpoint, but wouldn't it have been a
> _lot_ more obvious if you just set the flag when you took that lock,
> and instead of naming it by some magical code sequence, you named it
> for what it does?
> 

Linus, I take full responsibility for this part of the patch.
In this case, I just did not want to change the name again.
That name was in a previous version of my patch, that I merged
with Eric's and at the same time had to fix the mutex-lock-order
issue in Eric's original patch.  But if anybody would have
suggested a better name, and advised me to pass a parameter to
exec_mmap that would have happened.
So a kind of laziness on my side, and unfortunately I forgot to
point to all the changes in a revision log, I usually do that,
but this time I forgot it somehow.  This was a 16-part patch
series at that time, so I just was really busy with following
each mail of the previous patch version, and also get the latest
revision of the change log (I use the mail maybe I should have
pulled Eric's tree, but I am still a newbie here ... :-) ).
Anyhow I was surprised that Eric did not see my changes by
looking at them, but that is the human nature, nothing to be
blamed for.


> Again, this looks all technically correct, but it's written in a way
> that doesn't seem to make a lot of sense. Why is the code literally
> written with a magical assumption of "calling exec_mmap takes this
> lock, so if the flag named called_exec_mmap is set, I have to free
> that lock that is not named that at all".
> 

Names can be changed.  In the peer review everybody was happy with it.
But that is not set in stone.

Initially I only wanted to address the ptrace attach, but Eric
came up with the user mode page fault handler, that made the patch
a lot more complicated, if that goal is dropped, also the place
where the mutex need to be taken could be a different one.


> I hate conditional locking in the first place, but if it has to exist,
> then the conditional should be named after the lock, and the lock
> getting should be very very explicitly tied to it.
> 
> Wouldn't it have been much clearer if you called that flag
> "exec_update_mutex_taken", and set it WHEN YOU TAKE IT?
> 

Can be done.  I don't care.  It is one additional register taken
with a parameter to exec_mmap and it is probably inlined, nothing
more nothing less.


> In fact, then you could drop the
> 
>                         mutex_unlock(&tsk->signal->exec_update_mutex);
> 
> in the error case of exec_mmap(), because now the error handling in
> free_bprm() would do the cleanup automatically.
> 

The error handling is sometimes called when the exec_update_mutex is
not taken, in fact even de_thread not called.

Can you say how you would suggest that to be done?


> See what I'm saying? You've made the locking more complex and subtle
> than it needed to be. And since the whole point of the *new* lock is
> that it should replace an old lock that was really complex and subtle,
> that's a problem.
> 
>                    Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 19:52     ` Linus Torvalds
  2020-04-02 20:59       ` Bernd Edlinger
@ 2020-04-03 16:00       ` Bernd Edlinger
  1 sibling, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-03 16:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov

On 4/2/20 9:52 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 12:31 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> This is at least what is my impression how the existing mutexes are used,
>> a mutex called "cred_guard_mutex" is a not very good self explaining name,
>> in my opinion, it is totally unclear what it does "guard", and why.
> 
> Oh, I absolutely agree that cred_guard_mutex is a horrible lock.
> 
> It actually _used_ to be a lot more understandable, and the name used
> to make more sense in the context it was used.
> 
> See commit
> 
>   a2a8474c3fff ("exec: do not sleep in TASK_TRACED under ->cred_guard_mutex")
> > for when it changed from "somewhat understandable" to "really hard to follow".
> 

Ah, yes, there it was introduced.

That fixed only the case of a single-threaded process doing execve,
but missed to fix the case of a multi-threaded process doing execve,
and the other threads racing with the execve.  That is what happened
on my laptop, again and again, when I tried to fix a bug in the
gcc testsuite, that is while I wanted to track down another bug,
that is why the gcc testsuite left loads of temp-files in /tmp,
until I decided to go on a little bug-hunt in the linux kernel
instead :-/

And I had no idea what was happening at all.  But that way this bug
bit me again and again, until I realized the nature of the strace
problem, when I was really baffled.

Before I considered a linux patch for that I tried to fix it in the
strace code instead, and in fact I had tried two approaches,
one is wait in a signal handler, that did not work.
The second one is use another thread that does the wait, and that
did only work when I disable the PTRACE_O_TRACEEXIT flags.

I posted the two patches on lkml, just for reference.
Maybe you are amused by those patches. I consider that a craziness myself,
but it was indeed able to avoid the deadlock, with a user space change alone:

https://lore.kernel.org/lkml/AM6PR03MB5170D68B5010FCA627A603F8E4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/


so that is more or less for your amusement, sincerely I would not propose
that as the way to fix the strace deadlock.


Bernd.

> Don't get me wrong - that commit has a very good reason for it, but it
> does make the locking really hard to understand.
> 
> It all used to be in one function - do_execve() - and it was holding
> the lock over a fairly obvious range, starting at
> 
>     bprm->cred = prepare_exec_creds();
> 
> and ending at basically "we're done with execve()".
> 
> So basically, cred_guard_mutex ends up being the thing that is held
> all the way from the "before execve looks at the old creds" to "execve
> is done, and has changed the creds".
> 
> The reason it's needed is exactly that there are some nasty situations
> where execve() itself does things with creds to determine that the new
> creds are ok. And it uses the old creds to do that, but it also uses
> the task->flags and task->ptrace.
> 
> So think of cred_guard_mutex as a lock around not just the creds, but
> the combination of creds and the task flags/ptrace.
> 
> Anybody who changes the task ptrace setting needs to serialize with
> execve(). Or anybody who tests for "dumpable()", for example.
> 
> If *all* you care about is just the creds, then you don't need it.
> It's really only users that do more checks than just credentials.
> "dumpable()" is I think the common one.
> 
> And that's why cred_guard_mutex has that big range - it starts when we
> read the original creds (because it will use those creds to determine
> how the *new* creds will affect dumpability etc), and it ends when it
> has updated not only to the new creds, but it has set all those other
> flags too.
> 
> So I'm not at all against splitting the lock up, and trying to make it
> more directed and specific.
> 
> My complaints were about how the new lock wasn't much better. It was
> still completely incomprehensible, the conditional unlocking was hard
> to follow, and it really wasn't obvious that the converted users were
> fine.
> 
> See?
> 
>                Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 15:09   ` Bernd Edlinger
@ 2020-04-03 16:23     ` Linus Torvalds
  2020-04-03 16:36       ` Bernd Edlinger
  2020-04-04  5:43       ` Bernd Edlinger
  0 siblings, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-03 16:23 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov

[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]

On Fri, Apr 3, 2020 at 8:09 AM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>
> On 4/2/20 9:04 PM, Linus Torvalds wrote:
> > In fact, then you could drop the
> >
> >                         mutex_unlock(&tsk->signal->exec_update_mutex);
> >
> > in the error case of exec_mmap(), because now the error handling in
> > free_bprm() would do the cleanup automatically.
> >
>
> The error handling is sometimes called when the exec_update_mutex is
> not taken, in fact even de_thread not called.

But that's the whole point of the flag. Make the flag be about "do I
hold the mutex", and then the error handling does the right thing
regardless.

> Can you say how you would suggest that to be done?

I think the easiest thing to do to explain is to just write the patch.

This is entirely untested, but see what the difference is? I make the
flag be about exactly where I take the lock, not about some "I have
called exec_mmap".

Which means that now exec_mmap() doesn't even need to unlock it in the
error case, because the unlocking will happen properly in the
bprm_exit regardless.

This makes that unconditional unlocking logic much more obvious.

That said, Eric says he can make it all properly static so that it
doesn't need that kind of dynamic "if (x) unlock()" logic at all,
which is much better.

So this patch is not for consumption, it's purely for "look, something
like this"

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2743 bytes --]

 fs/exec.c               | 15 +++++++--------
 include/linux/binfmts.h |  2 +-
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 06b4c550af5d..cdc7f1145662 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1041,8 +1041,9 @@ EXPORT_SYMBOL(read_code);
  * On success, this function returns with the mutex
  * exec_update_mutex locked.
  */
-static int exec_mmap(struct mm_struct *mm)
+static int exec_mmap(struct linux_binprm *bprm)
 {
+	struct mm_struct *mm = bprm->mm;
 	struct task_struct *tsk;
 	struct mm_struct *old_mm, *active_mm;
 	int ret;
@@ -1055,6 +1056,7 @@ static int exec_mmap(struct mm_struct *mm)
 	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
 	if (ret)
 		return ret;
+	bprm->update_mutex_held = 1;
 
 	if (old_mm) {
 		sync_mm_rss(old_mm);
@@ -1067,7 +1069,6 @@ static int exec_mmap(struct mm_struct *mm)
 		down_read(&old_mm->mmap_sem);
 		if (unlikely(old_mm->core_state)) {
 			up_read(&old_mm->mmap_sem);
-			mutex_unlock(&tsk->signal->exec_update_mutex);
 			return -EINTR;
 		}
 	}
@@ -1321,17 +1322,15 @@ int flush_old_exec(struct linux_binprm * bprm)
 	 * Release all of the old mmap stuff
 	 */
 	acct_arg_size(bprm, 0);
-	retval = exec_mmap(bprm->mm);
+	retval = exec_mmap(bprm);
 	if (retval)
 		goto out;
 
 	/*
-	 * After setting bprm->called_exec_mmap (to mark that current is
-	 * using the prepared mm now), we have nothing left of the original
+	 * After setting bprm->mm to NULL, we have nothing left of the original
 	 * process. If anything from here on returns an error, the check
 	 * in search_binary_handler() will SEGV current.
 	 */
-	bprm->called_exec_mmap = 1;
 	bprm->mm = NULL;
 
 #ifdef CONFIG_POSIX_TIMERS
@@ -1477,7 +1476,7 @@ static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		if (bprm->called_exec_mmap)
+		if (bprm->update_mutex_held)
 			mutex_unlock(&current->signal->exec_update_mutex);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
@@ -1720,7 +1719,7 @@ int search_binary_handler(struct linux_binprm *bprm)
 
 		read_lock(&binfmt_lock);
 		put_binfmt(fmt);
-		if (retval < 0 && bprm->called_exec_mmap) {
+		if (retval < 0 && !bprm->mm) {
 			/* we got to flush_old_exec() and failed after it */
 			read_unlock(&binfmt_lock);
 			force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index a345d9fed3d8..b815783c8b2c 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -50,7 +50,7 @@ struct linux_binprm {
 		 * This is past the point of no return, when the
 		 * exec_update_mutex has been taken.
 		 */
-		called_exec_mmap:1;
+		update_mutex_held:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 16:23     ` Linus Torvalds
@ 2020-04-03 16:36       ` Bernd Edlinger
  2020-04-04  5:43       ` Bernd Edlinger
  1 sibling, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-03 16:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov



On 4/3/20 6:23 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 8:09 AM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>
>> On 4/2/20 9:04 PM, Linus Torvalds wrote:
>>> In fact, then you could drop the
>>>
>>>                         mutex_unlock(&tsk->signal->exec_update_mutex);
>>>
>>> in the error case of exec_mmap(), because now the error handling in
>>> free_bprm() would do the cleanup automatically.
>>>
>>
>> The error handling is sometimes called when the exec_update_mutex is
>> not taken, in fact even de_thread not called.
> 
> But that's the whole point of the flag. Make the flag be about "do I
> hold the mutex", and then the error handling does the right thing
> regardless.
> 
>> Can you say how you would suggest that to be done?
> 
> I think the easiest thing to do to explain is to just write the patch.
> 
> This is entirely untested, but see what the difference is? I make the
> flag be about exactly where I take the lock, not about some "I have
> called exec_mmap".
> 
> Which means that now exec_mmap() doesn't even need to unlock it in the
> error case, because the unlocking will happen properly in the
> bprm_exit regardless.
> 
> This makes that unconditional unlocking logic much more obvious.
> 
> That said, Eric says he can make it all properly static so that it
> doesn't need that kind of dynamic "if (x) unlock()" logic at all,
> which is much better.
> 
> So this patch is not for consumption, it's purely for "look, something
> like this"
> 

Works for me.  But I also want to wait for Eric, I am curious.
I have a lot of time.


Bernd.

>               Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-02 23:01           ` Eric W. Biederman
                               ` (2 preceding siblings ...)
  2020-04-03  5:09             ` [GIT PULL] Please pull proc and exec work for 5.7-rc1 Bernd Edlinger
@ 2020-04-03 19:26             ` Linus Torvalds
  2020-04-03 20:41               ` Waiman Long
  2020-04-06 22:17               ` Eric W. Biederman
  3 siblings, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-03 19:26 UTC (permalink / raw)
  To: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov

[ For Waiman & co - the problem is that the current cred_guard_mutex
is horrendous and has problems with execve() deadlocking against
various users. We've had this bug before, there's a new one, it's just
nasty ]

On Thu, Apr 2, 2020 at 4:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> That is not the direction I intend to take either.
>
> I was hoping I could put off replying to this thread for a bit because
> I only managed to get 4 hours of sleep last night and I am not as alert
> to technical details as I would like to be.

Hmm.. So I've been looking at this cred_guard_mutex, and I wonder...

This is a bit hand-wavy, because I haven't walker through all the
paths, but could we perhaps work around a lot of the problems a
different way., namely:

 - make the "cred_guard_mutex" an rwsem-like thing instead of being a mutex.

 - make the ptrace_attach() case get it for writing - not because
ptrace changes the creds, but because ptrace changes 'task->ptrace'
and depends on dumpability etc.

 - change the *name* of that damn thing. Not because it's now
rwsem'ish rather than a mutex, but because it was never really about
just "creds". It was about creds+ptrace+dumpable flags etc.

 - make all the ones that read the creds to just take it for reading
(IOW, the cases that were basically switched over to
exec_update_mutex).

 - and finally: make "execve()" take it just for reading too, but
introduce a "upgrade to write" at the very end (when it actually is
all done and then finally changes the creds and dumpability)

Wouldn't that solve all problems? We wouldn't get deadlocks wrt
execve(), simply because execve() doesn't need it to be writable, and
the things execve() does and can deadlock all only want readability.

But hear me out, because the above is fundamentally broken in a couple
of ways, so let me address that brokenness before you tell me I'm a
complete nincompoop and an idiot.

I'm including some locking people here because of these issues, so
that they can maybe verify my thinking.

 (a) our rwsem's are fair

     So the whole "execve takes it for reading, so now others can take
it for reading too without deadlocks" is simply not true - if you use
the existing rwsem.

     Because a concurrent (blocked) writer will then block other
readers for fairness reasons, and holding it for reading doesn't
guarantee that others can get it for reading.

     So clearly, the above doesn't even *fix* the deadlocks - unless
we have an unfair mode (or just a special lock for just this that is
not our standard rwsem, but a special unfair one).

     So I'm suggesting we use a special unfair rwsem here (we can make
a simple spinlock-based one - it doesn't need to be as clever or
optimized as the real rwsems are)

 (b) similarly, our rwsem's don't actually have a "upgrade from read
to write", because that's also a fundamentally deadlocky operation.

     Again, that's true. Except execve() is special, and we know
there's only _one_ execve() at a time that will complete, since we're
serializing them. So for this particular use, "upgrade to write" would
be possible without the general-case deadlock issues.

 (c) I didn't think things through, and even with these special
semantics, my idea is complete garbage

     Ok, this may well be true.

Anyway, the advantage of this (if it works) is that it would allow us
to go back to the _really_ simple original model of just taking this
lock for reading at the beginning of execve(), and not worrying so
much about complex nesting or very complex rules for exactly when we
got the lock and error handling.

The final part when we actually update the credentials and dumpability
and stuff in execve() is actually fairly simple. So the "upgrade to a
write lock" phase doesn't worry me too much.  It's the interaction
with all the previous parts (which happen with it held just for
reading) that tend to be the nastier ones.

And ptrace_attach() really is special, and I think it would be the
only one that really needs that write lock.

The disadvantage, of course, is that it would require that
special-case lock semantic, and I might also be missing some thing
that makes it not work anyway.

Comments? Am I just dreaming of a simpler model without my medications again?

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 19:26             ` Linus Torvalds
@ 2020-04-03 20:41               ` Waiman Long
  2020-04-03 20:59                 ` Linus Torvalds
  2020-04-06 22:17               ` Eric W. Biederman
  1 sibling, 1 reply; 127+ messages in thread
From: Waiman Long @ 2020-04-03 20:41 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman, Ingo Molnar, Will Deacon
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov

On 4/3/20 3:26 PM, Linus Torvalds wrote:
> I'm including some locking people here because of these issues, so
> that they can maybe verify my thinking.
>
>  (a) our rwsem's are fair
>
>      So the whole "execve takes it for reading, so now others can take
> it for reading too without deadlocks" is simply not true - if you use
> the existing rwsem.
>
>      Because a concurrent (blocked) writer will then block other
> readers for fairness reasons, and holding it for reading doesn't
> guarantee that others can get it for reading.
>
>      So clearly, the above doesn't even *fix* the deadlocks - unless
> we have an unfair mode (or just a special lock for just this that is
> not our standard rwsem, but a special unfair one).
>
>      So I'm suggesting we use a special unfair rwsem here (we can make
> a simple spinlock-based one - it doesn't need to be as clever or
> optimized as the real rwsems are)
>
>  (b) similarly, our rwsem's don't actually have a "upgrade from read
> to write", because that's also a fundamentally deadlocky operation.
>
>      Again, that's true. Except execve() is special, and we know
> there's only _one_ execve() at a time that will complete, since we're
> serializing them. So for this particular use, "upgrade to write" would
> be possible without the general-case deadlock issues.
>
>  (c) I didn't think things through, and even with these special
> semantics, my idea is complete garbage
>
>      Ok, this may well be true.
>
> Anyway, the advantage of this (if it works) is that it would allow us
> to go back to the _really_ simple original model of just taking this
> lock for reading at the beginning of execve(), and not worrying so
> much about complex nesting or very complex rules for exactly when we
> got the lock and error handling.
>
> The final part when we actually update the credentials and dumpability
> and stuff in execve() is actually fairly simple. So the "upgrade to a
> write lock" phase doesn't worry me too much.  It's the interaction
> with all the previous parts (which happen with it held just for
> reading) that tend to be the nastier ones.
>
> And ptrace_attach() really is special, and I think it would be the
> only one that really needs that write lock.

Making an unfair rwsem that prefer readers (like the original rwlock
semantics) is certainly doable. I don't think that is hard to do. I can
think of 2 possible ways to do that. We  could make the unfairness
globally applies to all the readers of a rwsem by defining the fairness
state at init time. That will require keeping the state in the rwsem
structure increasing its size.

Another alternative is to add new functions like down_read_unfair() that
perform unfair read locking for its callers. That will require less code
change, but the calling functions have to make the right choice.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 20:41               ` Waiman Long
@ 2020-04-03 20:59                 ` Linus Torvalds
  2020-04-03 23:16                   ` Waiman Long
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-03 20:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On Fri, Apr 3, 2020 at 1:41 PM Waiman Long <longman@redhat.com> wrote:
>
> Another alternative is to add new functions like down_read_unfair() that
> perform unfair read locking for its callers. That will require less code
> change, but the calling functions have to make the right choice.

I'd prefer the static choice model - and I'd hide this in some
"task_cred_read_lock()" function anyway rather than have the users do
"mutex_lock_killable(&task->signal->cred_guard_mutex)" like they do
now.

How nasty would it be to add the "upgrade" op? I took a quick look,
but that just made me go "Waiman would know" ;)

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 20:59                 ` Linus Torvalds
@ 2020-04-03 23:16                   ` Waiman Long
  2020-04-03 23:23                     ` Waiman Long
  2020-04-04  4:23                     ` Bernd Edlinger
  0 siblings, 2 replies; 127+ messages in thread
From: Waiman Long @ 2020-04-03 23:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On 4/3/20 4:59 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 1:41 PM Waiman Long <longman@redhat.com> wrote:
>> Another alternative is to add new functions like down_read_unfair() that
>> perform unfair read locking for its callers. That will require less code
>> change, but the calling functions have to make the right choice.
> I'd prefer the static choice model - and I'd hide this in some
> "task_cred_read_lock()" function anyway rather than have the users do
> "mutex_lock_killable(&task->signal->cred_guard_mutex)" like they do
> now.
>
> How nasty would it be to add the "upgrade" op? I took a quick look,
> but that just made me go "Waiman would know" ;)
>
>              Linus
>
With static choice, you mean defined at init time. Right? In that case,
you don't really need a special encapsulation function.

With upgrade, if there is only one reader, it is pretty straight
forward. With more than one readers, it gets more complicated as we have
to wait for other readers to unlock. We can spin for a certain period of
time. After that, that reader can use the handoff mechanism by queuing
itself in front the wait queue before releasing the read lock and go to
sleep. That will make sure that it will get the lock once all the other
readers exits. For an unfair rwsem, the writer cannot assert the handoff
bit and so it shouldn't interfere with this upgrade process.

If there are multiple upgrade readers, only one can win the race. The
others have to release the read lock and queue themselves as writers.
Will that be acceptable?

Cheers,
Longman



Cheers,
Longman


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 23:16                   ` Waiman Long
@ 2020-04-03 23:23                     ` Waiman Long
  2020-04-04  1:30                       ` Linus Torvalds
  2020-04-04  4:23                     ` Bernd Edlinger
  1 sibling, 1 reply; 127+ messages in thread
From: Waiman Long @ 2020-04-03 23:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On 4/3/20 7:16 PM, Waiman Long wrote:
> On 4/3/20 4:59 PM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 1:41 PM Waiman Long <longman@redhat.com> wrote:
>>> Another alternative is to add new functions like down_read_unfair() that
>>> perform unfair read locking for its callers. That will require less code
>>> change, but the calling functions have to make the right choice.
>> I'd prefer the static choice model - and I'd hide this in some
>> "task_cred_read_lock()" function anyway rather than have the users do
>> "mutex_lock_killable(&task->signal->cred_guard_mutex)" like they do
>> now.
>>
>> How nasty would it be to add the "upgrade" op? I took a quick look,
>> but that just made me go "Waiman would know" ;)
>>
>>              Linus
>>
> With static choice, you mean defined at init time. Right? In that case,
> you don't really need a special encapsulation function.
>
> With upgrade, if there is only one reader, it is pretty straight
> forward. With more than one readers, it gets more complicated as we have
> to wait for other readers to unlock. We can spin for a certain period of
> time. After that, that reader can use the handoff mechanism by queuing
> itself in front the wait queue before releasing the read lock and go to
> sleep. That will make sure that it will get the lock once all the other
> readers exits. For an unfair rwsem, the writer cannot assert the handoff
> bit and so it shouldn't interfere with this upgrade process.
>
> If there are multiple upgrade readers, only one can win the race. The
> others have to release the read lock and queue themselves as writers.
> Will that be acceptable?

Alternatively, we could assert that only one reader can do the upgrade
and do a WARN_ON_ONCE() if multiple concurrent upgrade attempts is detected.

Regards,
Longman


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 23:23                     ` Waiman Long
@ 2020-04-04  1:30                       ` Linus Torvalds
  2020-04-04  2:02                         ` Waiman Long
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-04  1:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On Fri, Apr 3, 2020 at 4:23 PM Waiman Long <longman@redhat.com> wrote:
>
> Alternatively, we could assert that only one reader can do the upgrade
> and do a WARN_ON_ONCE() if multiple concurrent upgrade attempts is detected.

Ack, that would be best.

[ And since I'm not on mobile any more, and my html email got thrown
out by the list, I'll just repeat that by "static choice" I mean "no
runtime decisions or flags": code that needs the unfair behavior would
use a special unfair function ]

              Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-04  1:30                       ` Linus Torvalds
@ 2020-04-04  2:02                         ` Waiman Long
  2020-04-04  2:28                           ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Waiman Long @ 2020-04-04  2:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On 4/3/20 9:30 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 4:23 PM Waiman Long <longman@redhat.com> wrote:
>> Alternatively, we could assert that only one reader can do the upgrade
>> and do a WARN_ON_ONCE() if multiple concurrent upgrade attempts is detected.
> Ack, that would be best.
>
> [ And since I'm not on mobile any more, and my html email got thrown
> out by the list, I'll just repeat that by "static choice" I mean "no
> runtime decisions or flags": code that needs the unfair behavior would
> use a special unfair function ]
>
>               Linus
>
Got it.

So in term of priority, my current thinking is

    upgrading unfair reader > unfair reader > reader/writer

A higher priority locker will block other lockers from acquiring the lock.

Thought?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-04  2:02                         ` Waiman Long
@ 2020-04-04  2:28                           ` Linus Torvalds
  2020-04-04  6:34                             ` Bernd Edlinger
                                               ` (2 more replies)
  0 siblings, 3 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-04  2:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <longman@redhat.com> wrote:
>
> So in term of priority, my current thinking is
>
>     upgrading unfair reader > unfair reader > reader/writer
>
> A higher priority locker will block other lockers from acquiring the lock.

An alternative option might be to have readers normally be 100% normal
(ie with fairness wrt writers), and not really introduce any special
"unfair reader" lock.

Instead, all the unfairness would come into play only when the special
case - execve() - does it's special "lock for reading with intent to
upgrade".

But when it enters that kind of "intent to upgrade" lock state, it
would not only block all subsequent writers, it would also guarantee
that all other readers can continue to go).

So then the new rwsem operations would be

 - read_with_write_intent_lock_interruptible()

   This is the beginning of "execve()", and waits for all writers to
exit, and puts the lock into "all readers can go" mode.

   You could think of it as a "I'm queuing myself for a write lock,
but I'm allowing readers to go ahead" state.

 - read_lock_to_write_upgrade()

   This is the "now this turns into a regular write lock". It needs to
wait for all other readers to exit, of course.

 - read_with_write_intent_unlock()

   This is the "I'm unqueuing myself, I aborted and will not become a
write lock after all" operation.

NOTE! In this model, there may be multiple threads that do that
initial queuing thing. We only guarantee that only one of them will
get to the actual write lock stage, and the others will abort before
that happens.

If that is a more natural state machine, then that should work fine
too. And it has some advantages, in that it keeps the readers normally
fair, and only turns them unfair when we get to that special
read-for-write stage.

But whatever it most natural for the rwsem code. Entirely up to you.

               Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 23:16                   ` Waiman Long
  2020-04-03 23:23                     ` Waiman Long
@ 2020-04-04  4:23                     ` Bernd Edlinger
  1 sibling, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-04  4:23 UTC (permalink / raw)
  To: Waiman Long, Linus Torvalds
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov



On 4/4/20 1:16 AM, Waiman Long wrote:
> On 4/3/20 4:59 PM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 1:41 PM Waiman Long <longman@redhat.com> wrote:
>>> Another alternative is to add new functions like down_read_unfair() that
>>> perform unfair read locking for its callers. That will require less code
>>> change, but the calling functions have to make the right choice.
>> I'd prefer the static choice model - and I'd hide this in some
>> "task_cred_read_lock()" function anyway rather than have the users do
>> "mutex_lock_killable(&task->signal->cred_guard_mutex)" like they do
>> now.
>>
>> How nasty would it be to add the "upgrade" op? I took a quick look,
>> but that just made me go "Waiman would know" ;)
>>
>>              Linus
>>
> With static choice, you mean defined at init time. Right? In that case,
> you don't really need a special encapsulation function.
> 
> With upgrade, if there is only one reader, it is pretty straight
> forward. With more than one readers, it gets more complicated as we have
> to wait for other readers to unlock. We can spin for a certain period of
> time. After that, that reader can use the handoff mechanism by queuing
> itself in front the wait queue before releasing the read lock and go to
> sleep. That will make sure that it will get the lock once all the other
> readers exits. For an unfair rwsem, the writer cannot assert the handoff
> bit and so it shouldn't interfere with this upgrade process.
> 
> If there are multiple upgrade readers, only one can win the race. The
> others have to release the read lock and queue themselves as writers.
> Will that be acceptable?
> 

Someone pointer out prevoiosly I think
that with the real time linux
the rwmutex are just mutex and we
better not base our desing on that.

To me linux_rt is a must.

Thanks
Bernd.

> Cheers,
> Longman
> 
> 
> 
> Cheers,
> Longman
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 16:23     ` Linus Torvalds
  2020-04-03 16:36       ` Bernd Edlinger
@ 2020-04-04  5:43       ` Bernd Edlinger
  2020-04-04  5:48         ` Bernd Edlinger
  1 sibling, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-04  5:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov



On 4/3/20 6:23 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 8:09 AM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>
>> On 4/2/20 9:04 PM, Linus Torvalds wrote:
>>> In fact, then you could drop the
>>>
>>>                         mutex_unlock(&tsk->signal->exec_update_mutex);
>>>
>>> in the error case of exec_mmap(), because now the error handling in
>>> free_bprm() would do the cleanup automatically.
>>>
>>
>> The error handling is sometimes called when the exec_update_mutex is
>> not taken, in fact even de_thread not called.
> 
> But that's the whole point of the flag. Make the flag be about "do I
> hold the mutex", and then the error handling does the right thing
> regardless.
> 
>> Can you say how you would suggest that to be done?
> 
> I think the easiest thing to do to explain is to just write the patch.
> 
> This is entirely untested, but see what the difference is? I make the
> flag be about exactly where I take the lock, not about some "I have
> called exec_mmap".
> 
> Which means that now exec_mmap() doesn't even need to unlock it in the
> error case, because the unlocking will happen properly in the
> bprm_exit regardless.
> 
> This makes that unconditional unlocking logic much more obvious.
> 
> That said, Eric says he can make it all properly static so that it
> doesn't need that kind of dynamic "if (x) unlock()" logic at all,
> which is much better.
> 
> So this patch is not for consumption, it's purely for "look, something
> like this"
> 


Just one suggestion, in general It would feel pretty much okay if you
like to improve the naming, and the consistency in any of my patches.

> @@ -1067,7 +1069,6 @@ static int exec_mmap(struct mm_struct *mm)
>  		down_read(&old_mm->mmap_sem);
>  		if (unlikely(old_mm->core_state)) {
>  			up_read(&old_mm->mmap_sem);
> -			mutex_unlock(&tsk->signal->exec_update_mutex);

I was trying to replicate the behavior of prepare_bprm_creds
which also unlocks the mutex in the error case, therefore it felt
okay to unlock the mutex here, but it will work either way.

I should further note, that the mutex would be locked if this
error exit is taken, and unlocked if this error happens:

        ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
        if (ret)
                return ret;

so at least the function comment I introduced above should be updated:
 * Maps the mm_struct mm into the current task struct.
 * On success, this function returns with the mutex
 * exec_update_mutex locked.


>  		put_binfmt(fmt);
> -		if (retval < 0 && bprm->called_exec_mmap) {
> +		if (retval < 0 && !bprm->mm) {

Using bprm->mm like this feels like a hack to me.  It works here,
but nowhere else.  Therefore I changed this line.

Using !bprm->mm in the error handling code made Eric's patch fail.


Thanks
Bernd.


>               Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-04  5:43       ` Bernd Edlinger
@ 2020-04-04  5:48         ` Bernd Edlinger
  2020-04-06  6:41           ` Bernd Edlinger
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-04  5:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov



On 4/4/20 7:43 AM, Bernd Edlinger wrote:
> 
> 
> On 4/3/20 6:23 PM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 8:09 AM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>
>>> On 4/2/20 9:04 PM, Linus Torvalds wrote:
>>>> In fact, then you could drop the
>>>>
>>>>                         mutex_unlock(&tsk->signal->exec_update_mutex);
>>>>
>>>> in the error case of exec_mmap(), because now the error handling in
>>>> free_bprm() would do the cleanup automatically.
>>>>
>>>
>>> The error handling is sometimes called when the exec_update_mutex is
>>> not taken, in fact even de_thread not called.
>>
>> But that's the whole point of the flag. Make the flag be about "do I
>> hold the mutex", and then the error handling does the right thing
>> regardless.
>>
>>> Can you say how you would suggest that to be done?
>>
>> I think the easiest thing to do to explain is to just write the patch.
>>
>> This is entirely untested, but see what the difference is? I make the
>> flag be about exactly where I take the lock, not about some "I have
>> called exec_mmap".
>>
>> Which means that now exec_mmap() doesn't even need to unlock it in the
>> error case, because the unlocking will happen properly in the
>> bprm_exit regardless.
>>
>> This makes that unconditional unlocking logic much more obvious.
>>
>> That said, Eric says he can make it all properly static so that it
>> doesn't need that kind of dynamic "if (x) unlock()" logic at all,
>> which is much better.
>>
>> So this patch is not for consumption, it's purely for "look, something
>> like this"
>>
> 
> 
> Just one suggestion, in general It would feel pretty much okay if you
> like to improve the naming, and the consistency in any of my patches.
> 
>> @@ -1067,7 +1069,6 @@ static int exec_mmap(struct mm_struct *mm)
>>  		down_read(&old_mm->mmap_sem);
>>  		if (unlikely(old_mm->core_state)) {
>>  			up_read(&old_mm->mmap_sem);
>> -			mutex_unlock(&tsk->signal->exec_update_mutex);
> 
> I was trying to replicate the behavior of prepare_bprm_creds
> which also unlocks the mutex in the error case, therefore it felt
> okay to unlock the mutex here, but it will work either way.
> 
> I should further note, that the mutex would be locked if this
> error exit is taken, and unlocked if this error happens:
> 
>         ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>         if (ret)
>                 return ret;
> 
> so at least the function comment I introduced above should be updated:
>  * Maps the mm_struct mm into the current task struct.
>  * On success, this function returns with the mutex
>  * exec_update_mutex locked.
> 
> 
>>  		put_binfmt(fmt);
>> -		if (retval < 0 && bprm->called_exec_mmap) {
>> +		if (retval < 0 && !bprm->mm) {
> 
> Using bprm->mm like this feels like a hack to me.  It works here,
> but nowhere else.  Therefore I changed this line.
> 
> Using !bprm->mm in the error handling code made Eric's patch fail.
> 

That does probably work better it the boolean is named
after_the_point_of_no_return or something....


> 
> Thanks
> Bernd.
> 
> 
>>               Linus
>>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-04  2:28                           ` Linus Torvalds
@ 2020-04-04  6:34                             ` Bernd Edlinger
  2020-04-05  6:34                               ` Bernd Edlinger
  2020-04-05  2:42                             ` Waiman Long
  2020-04-06 13:13                             ` Will Deacon
  2 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-04  6:34 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov



On 4/4/20 4:28 AM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <longman@redhat.com> wrote:
>>
>> So in term of priority, my current thinking is
>>
>>     upgrading unfair reader > unfair reader > reader/writer
>>
>> A higher priority locker will block other lockers from acquiring the lock.
> 
> An alternative option might be to have readers normally be 100% normal
> (ie with fairness wrt writers), and not really introduce any special
> "unfair reader" lock.
> 
> Instead, all the unfairness would come into play only when the special
> case - execve() - does it's special "lock for reading with intent to
> upgrade".
> 
> But when it enters that kind of "intent to upgrade" lock state, it
> would not only block all subsequent writers, it would also guarantee
> that all other readers can continue to go).
> 
> So then the new rwsem operations would be
> 
>  - read_with_write_intent_lock_interruptible()
> 
>    This is the beginning of "execve()", and waits for all writers to
> exit, and puts the lock into "all readers can go" mode.
> 
>    You could think of it as a "I'm queuing myself for a write lock,
> but I'm allowing readers to go ahead" state.
> 
>  - read_lock_to_write_upgrade()
> 
>    This is the "now this turns into a regular write lock". It needs to
> wait for all other readers to exit, of course.
> 
>  - read_with_write_intent_unlock()
> 
>    This is the "I'm unqueuing myself, I aborted and will not become a
> write lock after all" operation.
> 
> NOTE! In this model, there may be multiple threads that do that
> initial queuing thing. We only guarantee that only one of them will
> get to the actual write lock stage, and the others will abort before
> that happens.

One of the problems that add to the current situation, is that sometimes
the cred_guard_mutex is locked killable, so can be killed by de_thread.
But in other places cred_guard_mutex is not killable. So cannot be
locked and cannot be killed either -> dead-lock.


But Fear Not!

Overall we are pretty much in a good position to defeat the
enemy now, once an forever.

- We have my ugly-crazy patch that just works.

- We will have Eric's patch that is even better.

- We can try to put something togeter with creative new rw-type semaphores.

- We can merge ideas from one of the patches to another.


So it is impossible we not succeed to fix it this time :-)


Bernd.

> 
> If that is a more natural state machine, then that should work fine
> too. And it has some advantages, in that it keeps the readers normally
> fair, and only turns them unfair when we get to that special
> read-for-write stage.
> 
> But whatever it most natural for the rwsem code. Entirely up to you.
> 
>                Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-04  2:28                           ` Linus Torvalds
  2020-04-04  6:34                             ` Bernd Edlinger
@ 2020-04-05  2:42                             ` Waiman Long
  2020-04-05  3:35                               ` Bernd Edlinger
  2020-04-06 13:13                             ` Will Deacon
  2 siblings, 1 reply; 127+ messages in thread
From: Waiman Long @ 2020-04-05  2:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On 4/3/20 10:28 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <longman@redhat.com> wrote:
>> So in term of priority, my current thinking is
>>
>>     upgrading unfair reader > unfair reader > reader/writer
>>
>> A higher priority locker will block other lockers from acquiring the lock.
> An alternative option might be to have readers normally be 100% normal
> (ie with fairness wrt writers), and not really introduce any special
> "unfair reader" lock.
A regular down_read() caller will be handled normally.
> Instead, all the unfairness would come into play only when the special
> case - execve() - does it's special "lock for reading with intent to
> upgrade".
>
> But when it enters that kind of "intent to upgrade" lock state, it
> would not only block all subsequent writers, it would also guarantee
> that all other readers can continue to go).

Yes, that shouldn't be hard to do. If that is what is required, we may
only need a special upgrade function to drain the OSQ and then wake up
all the readers in the wait queue. I will add a flags argument to that
special upgrade function so that we may be able to select different
behavior in the future.

The regular down_read_interruptible() can be used unless we want to
designate only some readers are allowed to do upgrade by calling a
special down_read() function.
>
> So then the new rwsem operations would be
>
>  - read_with_write_intent_lock_interruptible()
>
>    This is the beginning of "execve()", and waits for all writers to
> exit, and puts the lock into "all readers can go" mode.
>
>    You could think of it as a "I'm queuing myself for a write lock,
> but I'm allowing readers to go ahead" state.
>
>  - read_lock_to_write_upgrade()
>
>    This is the "now this turns into a regular write lock". It needs to
> wait for all other readers to exit, of course.
>
>  - read_with_write_intent_unlock()
>
>    This is the "I'm unqueuing myself, I aborted and will not become a
> write lock after all" operation.
>
> NOTE! In this model, there may be multiple threads that do that
> initial queuing thing. We only guarantee that only one of them will
> get to the actual write lock stage, and the others will abort before
> that happens.
>
> If that is a more natural state machine, then that should work fine
> too. And it has some advantages, in that it keeps the readers normally
> fair, and only turns them unfair when we get to that special
> read-for-write stage.
>
> But whatever it most natural for the rwsem code. Entirely up to you.

To be symmetric with the existing downgrade_write() function, I will
choose the name upgrade_read() for the upgrade function.

Will that work for you?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-05  2:42                             ` Waiman Long
@ 2020-04-05  3:35                               ` Bernd Edlinger
  2020-04-05  3:45                                 ` Waiman Long
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-05  3:35 UTC (permalink / raw)
  To: Waiman Long, Linus Torvalds
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov

On 4/5/20 4:42 AM, Waiman Long wrote:
> On 4/3/20 10:28 PM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <longman@redhat.com> wrote:
>>> So in term of priority, my current thinking is
>>>
>>>     upgrading unfair reader > unfair reader > reader/writer
>>>
>>> A higher priority locker will block other lockers from acquiring the lock.
>> An alternative option might be to have readers normally be 100% normal
>> (ie with fairness wrt writers), and not really introduce any special
>> "unfair reader" lock.
> A regular down_read() caller will be handled normally.
>> Instead, all the unfairness would come into play only when the special
>> case - execve() - does it's special "lock for reading with intent to
>> upgrade".
>>
>> But when it enters that kind of "intent to upgrade" lock state, it
>> would not only block all subsequent writers, it would also guarantee
>> that all other readers can continue to go).
> 
> Yes, that shouldn't be hard to do. If that is what is required, we may
> only need a special upgrade function to drain the OSQ and then wake up
> all the readers in the wait queue. I will add a flags argument to that
> special upgrade function so that we may be able to select different
> behavior in the future.
> 
> The regular down_read_interruptible() can be used unless we want to
> designate only some readers are allowed to do upgrade by calling a
> special down_read() function.
>>
>> So then the new rwsem operations would be
>>
>>  - read_with_write_intent_lock_interruptible()
>>
>>    This is the beginning of "execve()", and waits for all writers to
>> exit, and puts the lock into "all readers can go" mode.
>>
>>    You could think of it as a "I'm queuing myself for a write lock,
>> but I'm allowing readers to go ahead" state.
>>
>>  - read_lock_to_write_upgrade()
>>
>>    This is the "now this turns into a regular write lock". It needs to
>> wait for all other readers to exit, of course.
>>
>>  - read_with_write_intent_unlock()
>>
>>    This is the "I'm unqueuing myself, I aborted and will not become a
>> write lock after all" operation.
>>
>> NOTE! In this model, there may be multiple threads that do that
>> initial queuing thing. We only guarantee that only one of them will
>> get to the actual write lock stage, and the others will abort before
>> that happens.
>>
>> If that is a more natural state machine, then that should work fine
>> too. And it has some advantages, in that it keeps the readers normally
>> fair, and only turns them unfair when we get to that special
>> read-for-write stage.
>>
>> But whatever it most natural for the rwsem code. Entirely up to you.
> 
> To be symmetric with the existing downgrade_write() function, I will
> choose the name upgrade_read() for the upgrade function.
> 
> Will that work for you?
> 

May I ask, if the proposed rwsem will also work for RT-linux,
or will it be a normal mutex there?


Thanks
Bernd.


> Cheers,
> Longman
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-05  3:35                               ` Bernd Edlinger
@ 2020-04-05  3:45                                 ` Waiman Long
  0 siblings, 0 replies; 127+ messages in thread
From: Waiman Long @ 2020-04-05  3:45 UTC (permalink / raw)
  To: Bernd Edlinger, Linus Torvalds
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov

On 4/4/20 11:35 PM, Bernd Edlinger wrote:
>> To be symmetric with the existing downgrade_write() function, I will
>> choose the name upgrade_read() for the upgrade function.
>>
>> Will that work for you?
>>
> May I ask, if the proposed rwsem will also work for RT-linux,
> or will it be a normal mutex there?

Good question. RT have their own special code for rwsem. I need to take
a look at that to see if something like that is possible.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-04  6:34                             ` Bernd Edlinger
@ 2020-04-05  6:34                               ` Bernd Edlinger
  2020-04-05 19:35                                 ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-05  6:34 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long
  Cc: Eric W. Biederman, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov



On 4/4/20 8:34 AM, Bernd Edlinger wrote:
> 
> 
> On 4/4/20 4:28 AM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <longman@redhat.com> wrote:
>>>
>>> So in term of priority, my current thinking is
>>>
>>>     upgrading unfair reader > unfair reader > reader/writer
>>>
>>> A higher priority locker will block other lockers from acquiring the lock.
>>
>> An alternative option might be to have readers normally be 100% normal
>> (ie with fairness wrt writers), and not really introduce any special
>> "unfair reader" lock.
>>
>> Instead, all the unfairness would come into play only when the special
>> case - execve() - does it's special "lock for reading with intent to
>> upgrade".
>>
>> But when it enters that kind of "intent to upgrade" lock state, it
>> would not only block all subsequent writers, it would also guarantee
>> that all other readers can continue to go).
>>
>> So then the new rwsem operations would be
>>
>>  - read_with_write_intent_lock_interruptible()
>>
>>    This is the beginning of "execve()", and waits for all writers to
>> exit, and puts the lock into "all readers can go" mode.
>>
>>    You could think of it as a "I'm queuing myself for a write lock,
>> but I'm allowing readers to go ahead" state.
>>
>>  - read_lock_to_write_upgrade()
>>
>>    This is the "now this turns into a regular write lock". It needs to
>> wait for all other readers to exit, of course.
>>
>>  - read_with_write_intent_unlock()
>>
>>    This is the "I'm unqueuing myself, I aborted and will not become a
>> write lock after all" operation.
>>
>> NOTE! In this model, there may be multiple threads that do that
>> initial queuing thing. We only guarantee that only one of them will
>> get to the actual write lock stage, and the others will abort before
>> that happens.
> 
> One of the problems that add to the current situation, is that sometimes
> the cred_guard_mutex is locked killable, so can be killed by de_thread.
> But in other places cred_guard_mutex is not killable. So cannot be
> locked and cannot be killed either -> dead-lock.
> 
> 
> But Fear Not!
> 
> Overall we are pretty much in a good position to defeat the
> enemy now, once an forever.
> 
> - We have my ugly-crazy patch that just works.
> 
> - We will have Eric's patch that is even better.
> 
> - We can try to put something togeter with creative new rw-type semaphores.
> 
> - We can merge ideas from one of the patches to another.
> 
> 
> So it is impossible we not succeed to fix it this time :-)
> 

BTW there is one other independent thing that came to my attention when I tried
to fix the ptrace deadlock from user space, which I tried first.

In order to break the deadlock from user space the strace program would have
to be rewitten to be multi-threaded.  I tried that, but the problem was,
that an event from the tracee can be received either in the main thread,
or a signal handler, or another thread.  I tried to implement both possibilities,
see my strace-patches which I pointed out previously here.

The signal handler approach completely failed, and the second thread
approach did not completely fail but was just definitely insane.

What really makes it impossible to write a multi-threaded strace program,
is that *only* the tread that made PTRACE_ATTACH can do all the other
PTRACE-APIs, but for a multi-treaded strace, any thread should be able
to call PTRACE-APIs as long as we are in the same process.

I don't know if that is really hard to achieve, but it seems like something
that would allow user space much more flexibility.

What do you think?


Thanks
Bernd.


> 
> Bernd.
> 
>>
>> If that is a more natural state machine, then that should work fine
>> too. And it has some advantages, in that it keeps the readers normally
>> fair, and only turns them unfair when we get to that special
>> read-for-write stage.
>>
>> But whatever it most natural for the rwsem code. Entirely up to you.
>>
>>                Linus
>>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-05  6:34                               ` Bernd Edlinger
@ 2020-04-05 19:35                                 ` Linus Torvalds
  0 siblings, 0 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-05 19:35 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Waiman Long, Eric W. Biederman, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov

On Sat, Apr 4, 2020 at 11:34 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> What really makes it impossible to write a multi-threaded strace program,
> is that *only* the tread that made PTRACE_ATTACH can do all the other
> PTRACE-APIs, but for a multi-treaded strace, any thread should be able
> to call PTRACE-APIs as long as we are in the same process.

I agree that the ptrace model is broken, and no, you can't do a
threaded ptrace the way things are now.

Some of that is really fundamental to how we do things (ie the ptracer
is the parent), and our data structures really make that be
per-thread.

I'm not sure how easy it would be to fix. Some of it is probably
really painful. For example, right now we know we can't race between
different pthread operations, because only the thread that did
PTRACE_ATTACH is allowed to do most of them.

So it could be very painful indeed to try to fix it so that you can do
threaded tracing. It woudl probably be a good thing to have, but it
might not be worth the pain.

Some daring person could try...

               Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-04  5:48         ` Bernd Edlinger
@ 2020-04-06  6:41           ` Bernd Edlinger
  0 siblings, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-06  6:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Alexey Gladkov


On 4/4/20 7:48 AM, Bernd Edlinger wrote:
> 
> 
> On 4/4/20 7:43 AM, Bernd Edlinger wrote:
>>
>>
>> On 4/3/20 6:23 PM, Linus Torvalds wrote:
>>> On Fri, Apr 3, 2020 at 8:09 AM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>>
>>>> On 4/2/20 9:04 PM, Linus Torvalds wrote:
>>>>> In fact, then you could drop the
>>>>>
>>>>>                         mutex_unlock(&tsk->signal->exec_update_mutex);
>>>>>
>>>>> in the error case of exec_mmap(), because now the error handling in
>>>>> free_bprm() would do the cleanup automatically.
>>>>>
>>>>
>>>> The error handling is sometimes called when the exec_update_mutex is
>>>> not taken, in fact even de_thread not called.
>>>
>>> But that's the whole point of the flag. Make the flag be about "do I
>>> hold the mutex", and then the error handling does the right thing
>>> regardless.
>>>
>>>> Can you say how you would suggest that to be done?
>>>
>>> I think the easiest thing to do to explain is to just write the patch.
>>>
>>> This is entirely untested, but see what the difference is? I make the
>>> flag be about exactly where I take the lock, not about some "I have
>>> called exec_mmap".
>>>
>>> Which means that now exec_mmap() doesn't even need to unlock it in the
>>> error case, because the unlocking will happen properly in the
>>> bprm_exit regardless.
>>>
>>> This makes that unconditional unlocking logic much more obvious.
>>>
>>> That said, Eric says he can make it all properly static so that it
>>> doesn't need that kind of dynamic "if (x) unlock()" logic at all,
>>> which is much better.
>>>
>>> So this patch is not for consumption, it's purely for "look, something
>>> like this"
>>>
>>
>>
>> Just one suggestion, in general It would feel pretty much okay if you
>> like to improve the naming, and the consistency in any of my patches.
>>

I mean it, I could not imagine a greater honor, than You improving
one of my patches.

Just please consider what I said below before you do it.


Thanks
Bernd.

>>> @@ -1067,7 +1069,6 @@ static int exec_mmap(struct mm_struct *mm)
>>>  		down_read(&old_mm->mmap_sem);
>>>  		if (unlikely(old_mm->core_state)) {
>>>  			up_read(&old_mm->mmap_sem);
>>> -			mutex_unlock(&tsk->signal->exec_update_mutex);
>>
>> I was trying to replicate the behavior of prepare_bprm_creds
>> which also unlocks the mutex in the error case, therefore it felt
>> okay to unlock the mutex here, but it will work either way.
>>
>> I should further note, that the mutex would be locked if this
>> error exit is taken, and unlocked if this error happens:
>>
>>         ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>         if (ret)
>>                 return ret;
>>
>> so at least the function comment I introduced above should be updated:
>>  * Maps the mm_struct mm into the current task struct.
>>  * On success, this function returns with the mutex
>>  * exec_update_mutex locked.
>>
>>
>>>  		put_binfmt(fmt);
>>> -		if (retval < 0 && bprm->called_exec_mmap) {
>>> +		if (retval < 0 && !bprm->mm) {
>>
>> Using bprm->mm like this feels like a hack to me.  It works here,
>> but nowhere else.  Therefore I changed this line.
>>
>> Using !bprm->mm in the error handling code made Eric's patch fail.
>>
> 
> That does probably work better it the boolean is named
> after_the_point_of_no_return or something....
> 
> 
>>
>> Thanks
>> Bernd.
>>
>>
>>>               Linus
>>>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-04  2:28                           ` Linus Torvalds
  2020-04-04  6:34                             ` Bernd Edlinger
  2020-04-05  2:42                             ` Waiman Long
@ 2020-04-06 13:13                             ` Will Deacon
  2 siblings, 0 replies; 127+ messages in thread
From: Will Deacon @ 2020-04-06 13:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Eric W. Biederman, Ingo Molnar, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov, peterz

[+Peter]

On Fri, Apr 03, 2020 at 07:28:36PM -0700, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <longman@redhat.com> wrote:
> >
> > So in term of priority, my current thinking is
> >
> >     upgrading unfair reader > unfair reader > reader/writer
> >
> > A higher priority locker will block other lockers from acquiring the lock.
> 
> An alternative option might be to have readers normally be 100% normal
> (ie with fairness wrt writers), and not really introduce any special
> "unfair reader" lock.
> 
> Instead, all the unfairness would come into play only when the special
> case - execve() - does it's special "lock for reading with intent to
> upgrade".
> 
> But when it enters that kind of "intent to upgrade" lock state, it
> would not only block all subsequent writers, it would also guarantee
> that all other readers can continue to go).
> 
> So then the new rwsem operations would be
> 
>  - read_with_write_intent_lock_interruptible()
> 
>    This is the beginning of "execve()", and waits for all writers to
> exit, and puts the lock into "all readers can go" mode.
> 
>    You could think of it as a "I'm queuing myself for a write lock,
> but I'm allowing readers to go ahead" state.
> 
>  - read_lock_to_write_upgrade()
> 
>    This is the "now this turns into a regular write lock". It needs to
> wait for all other readers to exit, of course.

... and at this point, subsequent readers queue behind the upgrader so we
can't run into the usual "stream of readers prevents forward progress"
issue, which was my initial worry when I started reading the thread. Makes
sense.

>  - read_with_write_intent_unlock()
> 
>    This is the "I'm unqueuing myself, I aborted and will not become a
> write lock after all" operation.
> 
> NOTE! In this model, there may be multiple threads that do that
> initial queuing thing. We only guarantee that only one of them will
> get to the actual write lock stage, and the others will abort before
> that happens.

I do worry a bit about how much of this we can enforce, but I suppose I'll
wait for the patches. For example, it would nice for
read_lock_to_write_upgrade() to return -EBUSY if there was a concurrent
(successful) upgrade rather than some pathological failure mode like
deadlock, but that feels like it might be a pain to do. It would probably
also be nice to scream if read_lock_to_write_upgrade() is called on a lock
where the upgrade *did* go ahead. Maybe some of this is food for lockdep.

That said, if this all ends up being spelled task_cred_*() then perhaps
it doesn't matter.

Will

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-03 19:26             ` Linus Torvalds
  2020-04-03 20:41               ` Waiman Long
@ 2020-04-06 22:17               ` Eric W. Biederman
  2020-04-07 19:50                 ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-06 22:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> [ For Waiman & co - the problem is that the current cred_guard_mutex
> is horrendous and has problems with execve() deadlocking against
> various users. We've had this bug before, there's a new one, it's just
> nasty ]
>
> On Thu, Apr 2, 2020 at 4:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> That is not the direction I intend to take either.
>>
>> I was hoping I could put off replying to this thread for a bit because
>> I only managed to get 4 hours of sleep last night and I am not as alert
>> to technical details as I would like to be.
>
> Hmm.. So I've been looking at this cred_guard_mutex, and I wonder...
>
> This is a bit hand-wavy, because I haven't walker through all the
> paths, but could we perhaps work around a lot of the problems a
> different way., namely:
>
>  - make the "cred_guard_mutex" an rwsem-like thing instead of being a mutex.
>
>  - make the ptrace_attach() case get it for writing - not because
> ptrace changes the creds, but because ptrace changes 'task->ptrace'
> and depends on dumpability etc.
>
>  - change the *name* of that damn thing. Not because it's now
> rwsem'ish rather than a mutex, but because it was never really about
> just "creds". It was about creds+ptrace+dumpable flags etc.
>
>  - make all the ones that read the creds to just take it for reading
> (IOW, the cases that were basically switched over to
> exec_update_mutex).
>
>  - and finally: make "execve()" take it just for reading too, but
> introduce a "upgrade to write" at the very end (when it actually is
> all done and then finally changes the creds and dumpability)
>
> Wouldn't that solve all problems? We wouldn't get deadlocks wrt
> execve(), simply because execve() doesn't need it to be writable, and
> the things execve() does and can deadlock all only want readability.
>
> But hear me out, because the above is fundamentally broken in a couple
> of ways, so let me address that brokenness before you tell me I'm a
> complete nincompoop and an idiot.
>
> I'm including some locking people here because of these issues, so
> that they can maybe verify my thinking.
>
>  (a) our rwsem's are fair
>
>      So the whole "execve takes it for reading, so now others can take
> it for reading too without deadlocks" is simply not true - if you use
> the existing rwsem.
>
>      Because a concurrent (blocked) writer will then block other
> readers for fairness reasons, and holding it for reading doesn't
> guarantee that others can get it for reading.
>
>      So clearly, the above doesn't even *fix* the deadlocks - unless
> we have an unfair mode (or just a special lock for just this that is
> not our standard rwsem, but a special unfair one).
>
>      So I'm suggesting we use a special unfair rwsem here (we can make
> a simple spinlock-based one - it doesn't need to be as clever or
> optimized as the real rwsems are)
>
>  (b) similarly, our rwsem's don't actually have a "upgrade from read
> to write", because that's also a fundamentally deadlocky operation.
>
>      Again, that's true. Except execve() is special, and we know
> there's only _one_ execve() at a time that will complete, since we're
> serializing them. So for this particular use, "upgrade to write" would
> be possible without the general-case deadlock issues.
>
>  (c) I didn't think things through, and even with these special
> semantics, my idea is complete garbage
>
>      Ok, this may well be true.
>
> Anyway, the advantage of this (if it works) is that it would allow us
> to go back to the _really_ simple original model of just taking this
> lock for reading at the beginning of execve(), and not worrying so
> much about complex nesting or very complex rules for exactly when we
> got the lock and error handling.
>
> The final part when we actually update the credentials and dumpability
> and stuff in execve() is actually fairly simple. So the "upgrade to a
> write lock" phase doesn't worry me too much.  It's the interaction
> with all the previous parts (which happen with it held just for
> reading) that tend to be the nastier ones.
>
> And ptrace_attach() really is special, and I think it would be the
> only one that really needs that write lock.
>
> The disadvantage, of course, is that it would require that
> special-case lock semantic, and I might also be missing some thing
> that makes it not work anyway.
>
> Comments? Am I just dreaming of a simpler model without my medications
> again?

Withough reading everything through at least.

* There is also security_setprocattr which needs ptrace and nnp state not to
  change it needs to set something that at least selinux's cred
  calculations needs to remain constant (like nnp and ptrace).

  Which means one thread calling security_setprocattr and another thread
  calling exec can deadlock in de_thread.

* Even with your lock and just the ptrace case I can deadlock.
  Ptracer:                  Thread A               Tread B
     ptrace_attach A
                                                   exec
     ptrace_attach B
                                                   uprade R to RW
     ---------------------- DEADLOCKED -------------------------

Those are the first two cases I have thought of.  There are probably
more.



But fundamentally the only reason we need this information stable
before the point of no return is so that we can return a nice error
code to the process calling exec.  Instead of terminating the
process with SIGSEGV.

These are for the most part unlikely scenarios or people would have been
complaining much more loudly about deadlock.

So my plan is to perform the relevant calculations effectively twice.
Once just before the point of no return, and give a graceful return code
if necessary and possible.  Once just afer the point of no return, and
SIGSEGV if necessary.



Of course this all only applies to LSMs that refuse to continue under
NNP or ptrace without changing the cred.  Linux without those LSMs
enabled will just continue with the original credentials.


So I don't think we will noticably be sacraficing the quality of
the user experience with my plan.  In the worst case a deadlock
will become a SIGSEGV killing the execing program.

Eric

p.s. Yes we can do better than a mutex that makes everything mutually
     exclusive. I am just starting there for simplicity, and to
     see if we need anything better.  Unfortuantely too many things are
     changing simultaneously for rcu to cover all of the read side
     cases.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC][PATCH 0/3] exec_update_mutex related cleanups
  2020-04-02 23:44             ` Linus Torvalds
  2020-04-03  0:05               ` Eric W. Biederman
@ 2020-04-07  1:29               ` Eric W. Biederman
  2020-04-07  1:31                 ` [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf Eric W. Biederman
                                   ` (4 more replies)
  1 sibling, 5 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-07  1:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov,
	Oleg Nesterov, Kees Cook, Jann Horn, Christian Brauner


Linus,

Since you rightly pointed out the code in fs/exec.c is less readable
than it should be right now.  Here is where I currently sit on making
that code static where possible and as obvious as possible.

I will resend this after the merge window for a proper review when
people are less likely to be distrcacted but I figured I might as well
send this out now so I can see if anyone runs screaming from this code.

Eric W. Biederman (3):
      binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf
      exec: Make unlocking exec_update_mutex explict
      exec: Rename the flag called_exec_mmap point_of_no_return

 arch/x86/ia32/ia32_aout.c |  3 +--
 fs/binfmt_aout.c          |  2 +-
 fs/binfmt_elf_fdpic.c     |  2 +-
 fs/binfmt_flat.c          |  3 +--
 fs/exec.c                 | 18 +++++++++---------
 include/linux/binfmts.h   |  7 +++----
 6 files changed, 16 insertions(+), 19 deletions(-)


Eric

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf
  2020-04-07  1:29               ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Eric W. Biederman
@ 2020-04-07  1:31                 ` Eric W. Biederman
  2020-04-07 15:58                   ` Kees Cook
                                     ` (2 more replies)
  2020-04-07  1:31                 ` [PATCH 2/3] exec: Make unlocking exec_update_mutex explict Eric W. Biederman
                                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-07  1:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov,
	Oleg Nesterov, Kees Cook, Jann Horn, Christian Brauner


In 2016 Linus moved install_exec_creds immediately after
setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
potential information leak.

Perform the same cleanup for the other binary formats.

Different binary formats doing the same things the same way makes exec
easier to reason about and easier to maintain.

Putting install_exec_creds immediate after setup_new_exec makes many
simplifications possible in the code.

Ref: 9f834ec18def ("binfmt_elf: switch to new creds when switching to new mm")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/x86/ia32/ia32_aout.c | 3 +--
 fs/binfmt_aout.c          | 2 +-
 fs/binfmt_elf_fdpic.c     | 2 +-
 fs/binfmt_flat.c          | 3 +--
 4 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 9bb71abd66bd..37b36a8ce5fa 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -140,6 +140,7 @@ static int load_aout_binary(struct linux_binprm *bprm)
 	set_personality_ia32(false);
 
 	setup_new_exec(bprm);
+	install_exec_creds(bprm);
 
 	regs->cs = __USER32_CS;
 	regs->r8 = regs->r9 = regs->r10 = regs->r11 = regs->r12 =
@@ -156,8 +157,6 @@ static int load_aout_binary(struct linux_binprm *bprm)
 	if (retval < 0)
 		return retval;
 
-	install_exec_creds(bprm);
-
 	if (N_MAGIC(ex) == OMAGIC) {
 		unsigned long text_addr, map_size;
 
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index 8e8346a81723..ace587b66904 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -162,6 +162,7 @@ static int load_aout_binary(struct linux_binprm * bprm)
 	set_personality(PER_LINUX);
 #endif
 	setup_new_exec(bprm);
+	install_exec_creds(bprm);
 
 	current->mm->end_code = ex.a_text +
 		(current->mm->start_code = N_TXTADDR(ex));
@@ -174,7 +175,6 @@ static int load_aout_binary(struct linux_binprm * bprm)
 	if (retval < 0)
 		return retval;
 
-	install_exec_creds(bprm);
 
 	if (N_MAGIC(ex) == OMAGIC) {
 		unsigned long text_addr, map_size;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 240f66663543..6c94c6d53d97 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -353,6 +353,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 		current->personality |= READ_IMPLIES_EXEC;
 
 	setup_new_exec(bprm);
+	install_exec_creds(bprm);
 
 	set_binfmt(&elf_fdpic_format);
 
@@ -434,7 +435,6 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 	current->mm->start_stack = current->mm->start_brk + stack_size;
 #endif
 
-	install_exec_creds(bprm);
 	if (create_elf_fdpic_tables(bprm, current->mm,
 				    &exec_params, &interp_params) < 0)
 		goto error;
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index 831a2b25ba79..1a1d1fcb893f 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -541,6 +541,7 @@ static int load_flat_file(struct linux_binprm *bprm,
 		/* OK, This is the point of no return */
 		set_personality(PER_LINUX_32BIT);
 		setup_new_exec(bprm);
+		install_exec_creds(bprm);
 	}
 
 	/*
@@ -963,8 +964,6 @@ static int load_flat_binary(struct linux_binprm *bprm)
 		}
 	}
 
-	install_exec_creds(bprm);
-
 	set_binfmt(&flat_format);
 
 #ifdef CONFIG_MMU
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH 2/3] exec: Make unlocking exec_update_mutex explict
  2020-04-07  1:29               ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Eric W. Biederman
  2020-04-07  1:31                 ` [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf Eric W. Biederman
@ 2020-04-07  1:31                 ` Eric W. Biederman
  2020-04-07 16:02                   ` Kees Cook
  2020-04-07 16:17                   ` Christian Brauner
  2020-04-07  1:32                 ` [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return Eric W. Biederman
                                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-07  1:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov,
	Oleg Nesterov, Kees Cook, Jann Horn, Christian Brauner


With install_exec_creds updated to follow immediately after
setup_new_exec, the failure of unshare_sighand is the only
code path where exec_update_mutex is held but not explicitly
unlocked.

Update that code path to explicitly unlock exec_update_mutex.

Remove the unlocking of exec_update_mutex from free_bprm.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c               | 6 +++---
 include/linux/binfmts.h | 3 +--
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index d55710a36056..28c87020da9b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1318,7 +1318,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 	 */
 	retval = unshare_sighand(me);
 	if (retval)
-		goto out;
+		goto out_unlock;
 
 	set_fs(USER_DS);
 	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
@@ -1335,6 +1335,8 @@ int flush_old_exec(struct linux_binprm * bprm)
 	do_close_on_exec(me->files);
 	return 0;
 
+out_unlock:
+	mutex_unlock(&me->signal->exec_update_mutex);
 out:
 	return retval;
 }
@@ -1451,8 +1453,6 @@ static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		if (bprm->called_exec_mmap)
-			mutex_unlock(&current->signal->exec_update_mutex);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index a345d9fed3d8..6f564b9ad882 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -47,8 +47,7 @@ struct linux_binprm {
 		secureexec:1,
 		/*
 		 * Set by flush_old_exec, when exec_mmap has been called.
-		 * This is past the point of no return, when the
-		 * exec_update_mutex has been taken.
+		 * This is past the point of no return.
 		 */
 		called_exec_mmap:1;
 #ifdef __alpha__
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return
  2020-04-07  1:29               ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Eric W. Biederman
  2020-04-07  1:31                 ` [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf Eric W. Biederman
  2020-04-07  1:31                 ` [PATCH 2/3] exec: Make unlocking exec_update_mutex explict Eric W. Biederman
@ 2020-04-07  1:32                 ` Eric W. Biederman
  2020-04-07 16:03                   ` Kees Cook
  2020-04-07 16:21                   ` Christian Brauner
  2020-04-07 16:22                 ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Christian Brauner
  2020-04-08 17:26                 ` Linus Torvalds
  4 siblings, 2 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-07  1:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov,
	Oleg Nesterov, Kees Cook, Jann Horn, Christian Brauner


Update the comments and make the code easier to understand by
renaming this flag.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c               | 12 ++++++------
 include/linux/binfmts.h |  6 +++---
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 28c87020da9b..a61987d6dc33 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1300,12 +1300,12 @@ int flush_old_exec(struct linux_binprm * bprm)
 		goto out;
 
 	/*
-	 * After setting bprm->called_exec_mmap (to mark that current is
-	 * using the prepared mm now), we have nothing left of the original
-	 * process. If anything from here on returns an error, the check
-	 * in search_binary_handler() will SEGV current.
+	 * With the new mm installed it is completely impossible to
+	 * fail and return to the original process.  If anything from
+	 * here on returns an error, the check in
+	 * search_binary_handler() will SEGV current.
 	 */
-	bprm->called_exec_mmap = 1;
+	bprm->point_of_no_return = true;
 	bprm->mm = NULL;
 
 #ifdef CONFIG_POSIX_TIMERS
@@ -1694,7 +1694,7 @@ int search_binary_handler(struct linux_binprm *bprm)
 
 		read_lock(&binfmt_lock);
 		put_binfmt(fmt);
-		if (retval < 0 && bprm->called_exec_mmap) {
+		if (retval < 0 && bprm->point_of_no_return) {
 			/* we got to flush_old_exec() and failed after it */
 			read_unlock(&binfmt_lock);
 			force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 6f564b9ad882..8f479dad7931 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -46,10 +46,10 @@ struct linux_binprm {
 		 */
 		secureexec:1,
 		/*
-		 * Set by flush_old_exec, when exec_mmap has been called.
-		 * This is past the point of no return.
+		 * Set when errors can no longer be returned to the
+		 * original userspace.
 		 */
-		called_exec_mmap:1;
+		point_of_no_return:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf
  2020-04-07  1:31                 ` [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf Eric W. Biederman
@ 2020-04-07 15:58                   ` Kees Cook
  2020-04-07 16:11                   ` Christian Brauner
  2020-04-08 17:25                   ` Linus Torvalds
  2 siblings, 0 replies; 127+ messages in thread
From: Kees Cook @ 2020-04-07 15:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Bernd Edlinger, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov, Jann Horn, Christian Brauner

On Mon, Apr 06, 2020 at 08:31:25PM -0500, Eric W. Biederman wrote:
> 
> In 2016 Linus moved install_exec_creds immediately after
> setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
> potential information leak.
> 
> Perform the same cleanup for the other binary formats.
> 
> Different binary formats doing the same things the same way makes exec
> easier to reason about and easier to maintain.
> 
> Putting install_exec_creds immediate after setup_new_exec makes many
> simplifications possible in the code.
> 
> Ref: 9f834ec18def ("binfmt_elf: switch to new creds when switching to new mm")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Acked-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/ia32/ia32_aout.c | 3 +--
>  fs/binfmt_aout.c          | 2 +-
>  fs/binfmt_elf_fdpic.c     | 2 +-
>  fs/binfmt_flat.c          | 3 +--
>  4 files changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
> index 9bb71abd66bd..37b36a8ce5fa 100644
> --- a/arch/x86/ia32/ia32_aout.c
> +++ b/arch/x86/ia32/ia32_aout.c
> @@ -140,6 +140,7 @@ static int load_aout_binary(struct linux_binprm *bprm)
>  	set_personality_ia32(false);
>  
>  	setup_new_exec(bprm);
> +	install_exec_creds(bprm);
>  
>  	regs->cs = __USER32_CS;
>  	regs->r8 = regs->r9 = regs->r10 = regs->r11 = regs->r12 =
> @@ -156,8 +157,6 @@ static int load_aout_binary(struct linux_binprm *bprm)
>  	if (retval < 0)
>  		return retval;
>  
> -	install_exec_creds(bprm);
> -
>  	if (N_MAGIC(ex) == OMAGIC) {
>  		unsigned long text_addr, map_size;
>  
> diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
> index 8e8346a81723..ace587b66904 100644
> --- a/fs/binfmt_aout.c
> +++ b/fs/binfmt_aout.c
> @@ -162,6 +162,7 @@ static int load_aout_binary(struct linux_binprm * bprm)
>  	set_personality(PER_LINUX);
>  #endif
>  	setup_new_exec(bprm);
> +	install_exec_creds(bprm);
>  
>  	current->mm->end_code = ex.a_text +
>  		(current->mm->start_code = N_TXTADDR(ex));
> @@ -174,7 +175,6 @@ static int load_aout_binary(struct linux_binprm * bprm)
>  	if (retval < 0)
>  		return retval;
>  
> -	install_exec_creds(bprm);
>  
>  	if (N_MAGIC(ex) == OMAGIC) {
>  		unsigned long text_addr, map_size;
> diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
> index 240f66663543..6c94c6d53d97 100644
> --- a/fs/binfmt_elf_fdpic.c
> +++ b/fs/binfmt_elf_fdpic.c
> @@ -353,6 +353,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
>  		current->personality |= READ_IMPLIES_EXEC;
>  
>  	setup_new_exec(bprm);
> +	install_exec_creds(bprm);
>  
>  	set_binfmt(&elf_fdpic_format);
>  
> @@ -434,7 +435,6 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
>  	current->mm->start_stack = current->mm->start_brk + stack_size;
>  #endif
>  
> -	install_exec_creds(bprm);
>  	if (create_elf_fdpic_tables(bprm, current->mm,
>  				    &exec_params, &interp_params) < 0)
>  		goto error;
> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
> index 831a2b25ba79..1a1d1fcb893f 100644
> --- a/fs/binfmt_flat.c
> +++ b/fs/binfmt_flat.c
> @@ -541,6 +541,7 @@ static int load_flat_file(struct linux_binprm *bprm,
>  		/* OK, This is the point of no return */
>  		set_personality(PER_LINUX_32BIT);
>  		setup_new_exec(bprm);
> +		install_exec_creds(bprm);
>  	}
>  
>  	/*
> @@ -963,8 +964,6 @@ static int load_flat_binary(struct linux_binprm *bprm)
>  		}
>  	}
>  
> -	install_exec_creds(bprm);
> -
>  	set_binfmt(&flat_format);
>  
>  #ifdef CONFIG_MMU
> -- 
> 2.25.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH 2/3] exec: Make unlocking exec_update_mutex explict
  2020-04-07  1:31                 ` [PATCH 2/3] exec: Make unlocking exec_update_mutex explict Eric W. Biederman
@ 2020-04-07 16:02                   ` Kees Cook
  2020-04-07 16:17                   ` Christian Brauner
  1 sibling, 0 replies; 127+ messages in thread
From: Kees Cook @ 2020-04-07 16:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Bernd Edlinger, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov, Jann Horn, Christian Brauner

On Mon, Apr 06, 2020 at 08:31:52PM -0500, Eric W. Biederman wrote:
> 
> With install_exec_creds updated to follow immediately after
> setup_new_exec, the failure of unshare_sighand is the only
> code path where exec_update_mutex is held but not explicitly
> unlocked.
> 
> Update that code path to explicitly unlock exec_update_mutex.
> 
> Remove the unlocking of exec_update_mutex from free_bprm.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  fs/exec.c               | 6 +++---
>  include/linux/binfmts.h | 3 +--
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index d55710a36056..28c87020da9b 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1318,7 +1318,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	 */
>  	retval = unshare_sighand(me);
>  	if (retval)
> -		goto out;
> +		goto out_unlock;
>  
>  	set_fs(USER_DS);
>  	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> @@ -1335,6 +1335,8 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	do_close_on_exec(me->files);
>  	return 0;
>  
> +out_unlock:
> +	mutex_unlock(&me->signal->exec_update_mutex);
>  out:
>  	return retval;
>  }
> @@ -1451,8 +1453,6 @@ static void free_bprm(struct linux_binprm *bprm)
>  {
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> -		if (bprm->called_exec_mmap)
> -			mutex_unlock(&current->signal->exec_update_mutex);
>  		mutex_unlock(&current->signal->cred_guard_mutex);
>  		abort_creds(bprm->cred);
>  	}
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index a345d9fed3d8..6f564b9ad882 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -47,8 +47,7 @@ struct linux_binprm {
>  		secureexec:1,
>  		/*
>  		 * Set by flush_old_exec, when exec_mmap has been called.
> -		 * This is past the point of no return, when the
> -		 * exec_update_mutex has been taken.
> +		 * This is past the point of no return.
>  		 */
>  		called_exec_mmap:1;
>  #ifdef __alpha__
> -- 
> 2.25.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return
  2020-04-07  1:32                 ` [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return Eric W. Biederman
@ 2020-04-07 16:03                   ` Kees Cook
  2020-04-07 16:21                   ` Christian Brauner
  1 sibling, 0 replies; 127+ messages in thread
From: Kees Cook @ 2020-04-07 16:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Bernd Edlinger, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov, Jann Horn, Christian Brauner

On Mon, Apr 06, 2020 at 08:32:23PM -0500, Eric W. Biederman wrote:
> 
> Update the comments and make the code easier to understand by
> renaming this flag.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

I like it, yes!

Acked-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  fs/exec.c               | 12 ++++++------
>  include/linux/binfmts.h |  6 +++---
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 28c87020da9b..a61987d6dc33 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1300,12 +1300,12 @@ int flush_old_exec(struct linux_binprm * bprm)
>  		goto out;
>  
>  	/*
> -	 * After setting bprm->called_exec_mmap (to mark that current is
> -	 * using the prepared mm now), we have nothing left of the original
> -	 * process. If anything from here on returns an error, the check
> -	 * in search_binary_handler() will SEGV current.
> +	 * With the new mm installed it is completely impossible to
> +	 * fail and return to the original process.  If anything from
> +	 * here on returns an error, the check in
> +	 * search_binary_handler() will SEGV current.
>  	 */
> -	bprm->called_exec_mmap = 1;
> +	bprm->point_of_no_return = true;
>  	bprm->mm = NULL;
>  
>  #ifdef CONFIG_POSIX_TIMERS
> @@ -1694,7 +1694,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>  
>  		read_lock(&binfmt_lock);
>  		put_binfmt(fmt);
> -		if (retval < 0 && bprm->called_exec_mmap) {
> +		if (retval < 0 && bprm->point_of_no_return) {
>  			/* we got to flush_old_exec() and failed after it */
>  			read_unlock(&binfmt_lock);
>  			force_sigsegv(SIGSEGV);
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 6f564b9ad882..8f479dad7931 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -46,10 +46,10 @@ struct linux_binprm {
>  		 */
>  		secureexec:1,
>  		/*
> -		 * Set by flush_old_exec, when exec_mmap has been called.
> -		 * This is past the point of no return.
> +		 * Set when errors can no longer be returned to the
> +		 * original userspace.
>  		 */
> -		called_exec_mmap:1;
> +		point_of_no_return:1;
>  #ifdef __alpha__
>  	unsigned int taso:1;
>  #endif
> -- 
> 2.25.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf
  2020-04-07  1:31                 ` [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf Eric W. Biederman
  2020-04-07 15:58                   ` Kees Cook
@ 2020-04-07 16:11                   ` Christian Brauner
  2020-04-08 17:25                   ` Linus Torvalds
  2 siblings, 0 replies; 127+ messages in thread
From: Christian Brauner @ 2020-04-07 16:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Bernd Edlinger, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov, Kees Cook, Jann Horn

On Mon, Apr 06, 2020 at 08:31:25PM -0500, Eric W. Biederman wrote:
> 
> In 2016 Linus moved install_exec_creds immediately after
> setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
> potential information leak.
> 
> Perform the same cleanup for the other binary formats.
> 
> Different binary formats doing the same things the same way makes exec
> easier to reason about and easier to maintain.
> 
> Putting install_exec_creds immediate after setup_new_exec makes many
> simplifications possible in the code.
> 
> Ref: 9f834ec18def ("binfmt_elf: switch to new creds when switching to new mm")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Sure, why not.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH 2/3] exec: Make unlocking exec_update_mutex explict
  2020-04-07  1:31                 ` [PATCH 2/3] exec: Make unlocking exec_update_mutex explict Eric W. Biederman
  2020-04-07 16:02                   ` Kees Cook
@ 2020-04-07 16:17                   ` Christian Brauner
  2020-04-07 16:21                     ` Eric W. Biederman
  1 sibling, 1 reply; 127+ messages in thread
From: Christian Brauner @ 2020-04-07 16:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Bernd Edlinger, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov, Kees Cook, Jann Horn

On Mon, Apr 06, 2020 at 08:31:52PM -0500, Eric W. Biederman wrote:
> 
> With install_exec_creds updated to follow immediately after
> setup_new_exec, the failure of unshare_sighand is the only
> code path where exec_update_mutex is held but not explicitly
> unlocked.
> 
> Update that code path to explicitly unlock exec_update_mutex.
> 
> Remove the unlocking of exec_update_mutex from free_bprm.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Yeah, assuming that I didn't miss any subtleties just now.
By "explicit" I assume you mean not conditionally unlocked, i.e. we
don't need to check any condition in free_binprm().

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return
  2020-04-07  1:32                 ` [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return Eric W. Biederman
  2020-04-07 16:03                   ` Kees Cook
@ 2020-04-07 16:21                   ` Christian Brauner
  1 sibling, 0 replies; 127+ messages in thread
From: Christian Brauner @ 2020-04-07 16:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Bernd Edlinger, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov, Kees Cook, Jann Horn

On Mon, Apr 06, 2020 at 08:32:23PM -0500, Eric W. Biederman wrote:
> 
> Update the comments and make the code easier to understand by
> renaming this flag.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

There were only 4 calls where called_exec_mmap was referenced, I could
find. The last one being removed in 2/3 in this series so this seems
fine.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH 2/3] exec: Make unlocking exec_update_mutex explict
  2020-04-07 16:17                   ` Christian Brauner
@ 2020-04-07 16:21                     ` Eric W. Biederman
  0 siblings, 0 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-07 16:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Bernd Edlinger, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov, Kees Cook, Jann Horn

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Mon, Apr 06, 2020 at 08:31:52PM -0500, Eric W. Biederman wrote:
>> 
>> With install_exec_creds updated to follow immediately after
>> setup_new_exec, the failure of unshare_sighand is the only
>> code path where exec_update_mutex is held but not explicitly
>> unlocked.
>> 
>> Update that code path to explicitly unlock exec_update_mutex.
>> 
>> Remove the unlocking of exec_update_mutex from free_bprm.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Yeah, assuming that I didn't miss any subtleties just now.
> By "explicit" I assume you mean not conditionally unlocked, i.e. we
> don't need to check any condition in free_binprm().

Yes.  Not conditionally unlocked is what I meant.

> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

Eric

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC][PATCH 0/3] exec_update_mutex related cleanups
  2020-04-07  1:29               ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Eric W. Biederman
                                   ` (2 preceding siblings ...)
  2020-04-07  1:32                 ` [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return Eric W. Biederman
@ 2020-04-07 16:22                 ` Christian Brauner
  2020-04-08 17:26                 ` Linus Torvalds
  4 siblings, 0 replies; 127+ messages in thread
From: Christian Brauner @ 2020-04-07 16:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Bernd Edlinger, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov, Kees Cook, Jann Horn

On Mon, Apr 06, 2020 at 08:29:50PM -0500, Eric W. Biederman wrote:
> 
> Linus,
> 
> Since you rightly pointed out the code in fs/exec.c is less readable
> than it should be right now.  Here is where I currently sit on making
> that code static where possible and as obvious as possible.
> 
> I will resend this after the merge window for a proper review when
> people are less likely to be distrcacted but I figured I might as well
> send this out now so I can see if anyone runs screaming from this code.
> 
> Eric W. Biederman (3):
>       binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf
>       exec: Make unlocking exec_update_mutex explict
>       exec: Rename the flag called_exec_mmap point_of_no_return

Under the assumption that we go forward with this approach this seems
like a good cleanup.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-06 22:17               ` Eric W. Biederman
@ 2020-04-07 19:50                 ` Linus Torvalds
  2020-04-07 20:29                   ` Bernd Edlinger
  2020-04-08 15:14                   ` Eric W. Biederman
  0 siblings, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-07 19:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On Mon, Apr 6, 2020 at 3:20 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> But fundamentally the only reason we need this information stable
> before the point of no return is so that we can return a nice error
> code to the process calling exec.  Instead of terminating the
> process with SIGSEGV.

I'd suggest doing it the other way around instead: let the thread that
does the security_setprocattr() die, since execve() is terminating
other threads anyway.

And the easy way to do that is to just make the rule be that anybody
who waits for this thing for write needs to use a killable wait.

So if the execve() got started earlier, and already took the cred lock
(whatever we'll call it) for reading, then zap_other_threads() will
take care of another thread doing setprocattr().

That sounds like a really simple model, no?

                Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-07 19:50                 ` Linus Torvalds
@ 2020-04-07 20:29                   ` Bernd Edlinger
  2020-04-07 20:47                     ` Linus Torvalds
  2020-04-08 15:14                   ` Eric W. Biederman
  1 sibling, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-07 20:29 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov



On 4/7/20 9:50 PM, Linus Torvalds wrote:
> On Mon, Apr 6, 2020 at 3:20 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> But fundamentally the only reason we need this information stable
>> before the point of no return is so that we can return a nice error
>> code to the process calling exec.  Instead of terminating the
>> process with SIGSEGV.
> 
> I'd suggest doing it the other way around instead: let the thread that
> does the security_setprocattr() die, since execve() is terminating
> other threads anyway.
> 
> And the easy way to do that is to just make the rule be that anybody
> who waits for this thing for write needs to use a killable wait.
> 
> So if the execve() got started earlier, and already took the cred lock
> (whatever we'll call it) for reading, then zap_other_threads() will
> take care of another thread doing setprocattr().
> 
> That sounds like a really simple model, no?
> 

Maybe, actually I considered this, but I was anxious that making something
that is so far not killable suddenly killable might break other things.

But I am a wimp :-)


Bernd.




>                 Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-07 20:29                   ` Bernd Edlinger
@ 2020-04-07 20:47                     ` Linus Torvalds
  0 siblings, 0 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-07 20:47 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov

On Tue, Apr 7, 2020 at 1:29 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>
> Maybe, actually I considered this, but I was anxious that making something
> that is so far not killable suddenly killable might break other things.

I don't think it can.

Basically, if you have a execve() and a setprocattr() racing, one or
the other starts first.

And if the execve() started first, then the setprocattr() thread would
get killed by the execve(), and there's no serialization. So you might
as well just say "it got killed before it even started to wait".

So semantically, having a killable wait is basically exactly the same
as losing the race - which wasn't ordered to begin with.

It's not like anybody will see the return value - the thread that
would have gotten the value got killed.

So doing

    if (down_writel_killable(&credlock))
        return -EINTR;

may *look* like it's new semantics, but it isn't really. That EINTR
error isn't visible to anybody, and everything looks absolutely
identical to "execve() in the other thread started earlier and killed
the thread even before it got to the system call".

              Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-07 19:50                 ` Linus Torvalds
  2020-04-07 20:29                   ` Bernd Edlinger
@ 2020-04-08 15:14                   ` Eric W. Biederman
  2020-04-08 15:21                     ` Bernd Edlinger
  2020-04-08 16:34                     ` Linus Torvalds
  1 sibling, 2 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-08 15:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Apr 6, 2020 at 3:20 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> But fundamentally the only reason we need this information stable
>> before the point of no return is so that we can return a nice error
>> code to the process calling exec.  Instead of terminating the
>> process with SIGSEGV.
>
> I'd suggest doing it the other way around instead: let the thread that
> does the security_setprocattr() die, since execve() is terminating
> other threads anyway.
>
> And the easy way to do that is to just make the rule be that anybody
> who waits for this thing for write needs to use a killable wait.
>
> So if the execve() got started earlier, and already took the cred lock
> (whatever we'll call it) for reading, then zap_other_threads() will
> take care of another thread doing setprocattr().
>
> That sounds like a really simple model, no?

Yes.  I missed the fact that we could take the lock killable.
We still unfortunately have the deadlock with ptrace.

It might be simpler to make whichever lock we are dealing with per
task_struct instead of per signal_struct.  Then we don't even have to
think about what de_thread does or if the lock is taken killable.


Looking at the code in binfmt_elf.c there are about 11 other places
after install_exec_creds where we can fail and would be forced to
terminate the application with SIGSEGV instead of causing fork to fail.




I keep wondering if we could do something similar to vfork.  That is
allocate an new task_struct and fully set it up for the post exec
process, and then make it visible under tasklist_lock.  Finally we could
free the old process.

That would appear as if everything happened atomically from
the point of view of the rest of the kernel.

As well as fixing all of the deadlocks and making it easy
to ensure we don't have any more weird failures in the future.

Eric

p.s. For tasklist_lock I suspect we can put a lock in struct pid
and use that to guard the task lists in struct pid.  Which would
allow for tasklist_lock to be take much less.  Then we would
just need a solution for task->parent and task->real_parent and
I think all of the major users of tasklist_lock would be gone.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-08 15:14                   ` Eric W. Biederman
@ 2020-04-08 15:21                     ` Bernd Edlinger
  2020-04-08 16:34                     ` Linus Torvalds
  1 sibling, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On 4/8/20 5:14 PM, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
>> On Mon, Apr 6, 2020 at 3:20 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>
>>> But fundamentally the only reason we need this information stable
>>> before the point of no return is so that we can return a nice error
>>> code to the process calling exec.  Instead of terminating the
>>> process with SIGSEGV.
>>
>> I'd suggest doing it the other way around instead: let the thread that
>> does the security_setprocattr() die, since execve() is terminating
>> other threads anyway.
>>
>> And the easy way to do that is to just make the rule be that anybody
>> who waits for this thing for write needs to use a killable wait.
>>
>> So if the execve() got started earlier, and already took the cred lock
>> (whatever we'll call it) for reading, then zap_other_threads() will
>> take care of another thread doing setprocattr().
>>
>> That sounds like a really simple model, no?
> 
> Yes.  I missed the fact that we could take the lock killable.
> We still unfortunately have the deadlock with ptrace.
> 
> It might be simpler to make whichever lock we are dealing with per
> task_struct instead of per signal_struct.  Then we don't even have to
> think about what de_thread does or if the lock is taken killable.
> 

I think you said that already, but I did not understand the difference,
could you please give some more details about your idea?


Thanks
Bernd.

> 
> Looking at the code in binfmt_elf.c there are about 11 other places
> after install_exec_creds where we can fail and would be forced to
> terminate the application with SIGSEGV instead of causing fork to fail.
> 
> 
> 
> 
> I keep wondering if we could do something similar to vfork.  That is
> allocate an new task_struct and fully set it up for the post exec
> process, and then make it visible under tasklist_lock.  Finally we could
> free the old process.
> 
> That would appear as if everything happened atomically from
> the point of view of the rest of the kernel.
> 
> As well as fixing all of the deadlocks and making it easy
> to ensure we don't have any more weird failures in the future.
> 
> Eric
> 
> p.s. For tasklist_lock I suspect we can put a lock in struct pid
> and use that to guard the task lists in struct pid.  Which would
> allow for tasklist_lock to be take much less.  Then we would
> just need a solution for task->parent and task->real_parent and
> I think all of the major users of tasklist_lock would be gone.
> 
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-08 15:14                   ` Eric W. Biederman
  2020-04-08 15:21                     ` Bernd Edlinger
@ 2020-04-08 16:34                     ` Linus Torvalds
  2020-04-09 14:58                       ` Eric W. Biederman
  1 sibling, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-08 16:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On Wed, Apr 8, 2020 at 8:17 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Yes.  I missed the fact that we could take the lock killable.
> We still unfortunately have the deadlock with ptrace.

That, I feel, is similarly trivial.

Again, anybody who takes the lock for writing should just do so
killably. So you have three cases:

 - ptrace wins the race and gets the lock.

   Fine, the execve will wait until afterwards.

 - ptrace loses the race and is not a thread with execve.

   Fine, the execve() won, and the ptrace will wait until after execve.

 - ptrace loses the race and is a thread with execve.

   Fine, the execve() will kill the thing in dethread() and the ptrace
thread will release the lock and die.

So all three cases are fine, and none of them have any behavioral
differences (as mentioned, the killing is "invisible" to users since
it's fundamentally a race, and you can consider the kill to have
happened before the ptrace started).

> It might be simpler to make whichever lock we are dealing with per
> task_struct instead of per signal_struct.  Then we don't even have to
> think about what de_thread does or if the lock is taken killable.

Well, yes, but I think the dethread behavior of killing threads is
required anyway, so..

> I keep wondering if we could do something similar to vfork.  That is
> allocate an new task_struct and fully set it up for the post exec
> process, and then make it visible under tasklist_lock.  Finally we could
> free the old process.
>
> That would appear as if everything happened atomically from
> the point of view of the rest of the kernel.

I do think that would have been a lovely design originally, and would
avoid a lot of things. So "execve()" would basically look like an exit
and a thread creation with the same pid (without the SIGCHILD to the
parent, obviously)

That would also naturally handle the "flush pending signals" etc issues.

The fact that we created a whole new mm-struct ended up fixing a lot
of problems (even if it was painful to do). This might be similar.

But it's not what we've ever done, and I do suspect you'd run into a
lot of odd small special cases if we were to try to do it now.

So I think it's simpler to just start making the "cred lock waiters
have to be killable" rule. It's not like that's a very complex rule.

              Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf
  2020-04-07  1:31                 ` [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf Eric W. Biederman
  2020-04-07 15:58                   ` Kees Cook
  2020-04-07 16:11                   ` Christian Brauner
@ 2020-04-08 17:25                   ` Linus Torvalds
  2020-04-08 19:51                     ` Eric W. Biederman
  2 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-08 17:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov,
	Oleg Nesterov, Kees Cook, Jann Horn, Christian Brauner

On Mon, Apr 6, 2020 at 6:34 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> In 2016 Linus moved install_exec_creds immediately after
> setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
> potential information leak.
>
> Perform the same cleanup for the other binary formats

Can we not move it _into_ setup_new_exec() now if you've changed all
the binfmt handlers?

The fewer cases of "this gets called by the low-level handler at
different points" that we have, the better off we'd be, I think. One
of the complexities of our execve() code is that some of it gets
called directly, and some of it gets called by the binfmt handler, and
it's often very hard to see the logic when it jumps out to the binfmt
code and then back to the generic fs/exec.c code..

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC][PATCH 0/3] exec_update_mutex related cleanups
  2020-04-07  1:29               ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Eric W. Biederman
                                   ` (3 preceding siblings ...)
  2020-04-07 16:22                 ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Christian Brauner
@ 2020-04-08 17:26                 ` Linus Torvalds
  4 siblings, 0 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-08 17:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov,
	Oleg Nesterov, Kees Cook, Jann Horn, Christian Brauner

On Mon, Apr 6, 2020 at 6:32 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I will resend this after the merge window for a proper review when
> people are less likely to be distrcacted but I figured I might as well
> send this out now so I can see if anyone runs screaming from this code.

Ack. It looks sane to me. I had that one question about a further
simplification, but even without that it looks like an improvement.

           Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf
  2020-04-08 17:25                   ` Linus Torvalds
@ 2020-04-08 19:51                     ` Eric W. Biederman
  0 siblings, 0 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-08 19:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernd Edlinger, Linux Kernel Mailing List, Alexey Gladkov,
	Oleg Nesterov, Kees Cook, Jann Horn, Christian Brauner

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Apr 6, 2020 at 6:34 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> In 2016 Linus moved install_exec_creds immediately after
>> setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
>> potential information leak.
>>
>> Perform the same cleanup for the other binary formats
>
> Can we not move it _into_ setup_new_exec() now if you've changed all
> the binfmt handlers?
>
> The fewer cases of "this gets called by the low-level handler at
> different points" that we have, the better off we'd be, I think. One
> of the complexities of our execve() code is that some of it gets
> called directly, and some of it gets called by the binfmt handler, and
> it's often very hard to see the logic when it jumps out to the binfmt
> code and then back to the generic fs/exec.c code..

Yes.  I already have merging of setup_new_exec and install_exec_creds in
my working tree.  I just posted the simplest set of patches to get the
idea across.

We can almost merge those two with flush_old_exec as well except for the
code that sets the personality between flush_old_exec and and
setup_new_exec.  I am wondering if maybe setting the personality should
be a callback.

Eric


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-08 16:34                     ` Linus Torvalds
@ 2020-04-09 14:58                       ` Eric W. Biederman
  2020-04-09 15:15                         ` Bernd Edlinger
  2020-04-09 16:15                         ` Linus Torvalds
  0 siblings, 2 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-09 14:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Apr 8, 2020 at 8:17 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Yes.  I missed the fact that we could take the lock killable.
>> We still unfortunately have the deadlock with ptrace.
>
> That, I feel, is similarly trivial.
>
> Again, anybody who takes the lock for writing should just do so
> killably. So you have three cases:
>
>  - ptrace wins the race and gets the lock.
>
>    Fine, the execve will wait until afterwards.
>
>  - ptrace loses the race and is not a thread with execve.
>
>    Fine, the execve() won, and the ptrace will wait until after execve.
>
>  - ptrace loses the race and is a thread with execve.
>
>    Fine, the execve() will kill the thing in dethread() and the ptrace
> thread will release the lock and die.

That would be nice.

That is unfortunately not how ptrace_event(PTRACE_EVENT_EXIT, ...) works.

When a thread going about it's ordinary business receives the SIGKILL
from de_thread the thread changes course and finds it's way to do_exit.
In do_exit the thread calls ptrace_event(PTRACE_EVENT_EXIT, ...) and
blocks waiting for the tracer to let it continue.

Further from previous attempts to fix this we know that there
are pieces of userspace expect that stop to happen.  So if the
PTRACE_EVENT_EXIT stop does not happen userspace which is already
attached breaks.

Further this case with ptrace is something we know userspace
does and is is just a matter of bad timing of attaching to the
threads when one thread is exec'ing.  So we don't even need to wonder if
userspace would do such a silling thing.



There are a lot similar cases that can happen if userspace inserts
itself into the path of page faults, directly or indirectly,
as long as some wait somewhere ultimately waits for a ptrace attach.


> So all three cases are fine, and none of them have any behavioral
> differences (as mentioned, the killing is "invisible" to users since
> it's fundamentally a race, and you can consider the kill to have
> happened before the ptrace started).

See above.


>> It might be simpler to make whichever lock we are dealing with per
>> task_struct instead of per signal_struct.  Then we don't even have to
>> think about what de_thread does or if the lock is taken killable.
>
> Well, yes, but I think the dethread behavior of killing threads is
> required anyway, so..

It is, but it is actually part of the problem.

I think making some of this thread local might solve another easy case
and let us focus more on the really hard problem.

>> I keep wondering if we could do something similar to vfork.  That is
>> allocate an new task_struct and fully set it up for the post exec
>> process, and then make it visible under tasklist_lock.  Finally we could
>> free the old process.
>>
>> That would appear as if everything happened atomically from
>> the point of view of the rest of the kernel.
>
> I do think that would have been a lovely design originally, and would
> avoid a lot of things. So "execve()" would basically look like an exit
> and a thread creation with the same pid (without the SIGCHILD to the
> parent, obviously)
>
> That would also naturally handle the "flush pending signals" etc issues.
>
> The fact that we created a whole new mm-struct ended up fixing a lot
> of problems (even if it was painful to do). This might be similar.
>
> But it's not what we've ever done, and I do suspect you'd run into a
> lot of odd small special cases if we were to try to do it now.

I completely agree, which is why I haven't been rushing to do that.
But this remains the only idea that I have thought of that would solve all
of the issues.

> So I think it's simpler to just start making the "cred lock waiters
> have to be killable" rule. It's not like that's a very complex rule.

I just looked at the remaining users of cred_guard_mutex and they are
all killable or interruptible.  Further all of the places that have been
converted to use the exec_update_mutex are also all killable or
interruptible.

So where we came in is that we had the killable rule and that has what
has allowed this to remain on the backburner for so long.  At least you
could kill the affected process from userspace.   Unfortunately the
deadlocks still happen.

Eric

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 14:58                       ` Eric W. Biederman
@ 2020-04-09 15:15                         ` Bernd Edlinger
  2020-04-09 16:15                         ` Linus Torvalds
  1 sibling, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-09 15:15 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On 4/9/20 4:58 PM, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
>> On Wed, Apr 8, 2020 at 8:17 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>
>>> Yes.  I missed the fact that we could take the lock killable.
>>> We still unfortunately have the deadlock with ptrace.
>>
>> That, I feel, is similarly trivial.
>>
>> Again, anybody who takes the lock for writing should just do so
>> killably. So you have three cases:
>>
>>  - ptrace wins the race and gets the lock.
>>
>>    Fine, the execve will wait until afterwards.
>>
>>  - ptrace loses the race and is not a thread with execve.
>>
>>    Fine, the execve() won, and the ptrace will wait until after execve.
>>
>>  - ptrace loses the race and is a thread with execve.
>>
>>    Fine, the execve() will kill the thing in dethread() and the ptrace
>> thread will release the lock and die.
> 
> That would be nice.
> 
> That is unfortunately not how ptrace_event(PTRACE_EVENT_EXIT, ...) works.
> 
> When a thread going about it's ordinary business receives the SIGKILL
> from de_thread the thread changes course and finds it's way to do_exit.
> In do_exit the thread calls ptrace_event(PTRACE_EVENT_EXIT, ...) and
> blocks waiting for the tracer to let it continue.
> 
> Further from previous attempts to fix this we know that there
> are pieces of userspace expect that stop to happen.  So if the
> PTRACE_EVENT_EXIT stop does not happen userspace which is already
> attached breaks.
> 
> Further this case with ptrace is something we know userspace
> does and is is just a matter of bad timing of attaching to the
> threads when one thread is exec'ing.  So we don't even need to wonder if
> userspace would do such a silling thing.
> 
> 
> 
> There are a lot similar cases that can happen if userspace inserts
> itself into the path of page faults, directly or indirectly,
> as long as some wait somewhere ultimately waits for a ptrace attach.
> 
> 

Remember, as a last resort there is my "insane" 15/16 patch, which
Linus admittedly hates, but it works.  If we find a cleaner solution
it can always be reverted, that is just fine for me.

Thanks
Bernd.

>> So all three cases are fine, and none of them have any behavioral
>> differences (as mentioned, the killing is "invisible" to users since
>> it's fundamentally a race, and you can consider the kill to have
>> happened before the ptrace started).
> 
> See above.
> 
> 
>>> It might be simpler to make whichever lock we are dealing with per
>>> task_struct instead of per signal_struct.  Then we don't even have to
>>> think about what de_thread does or if the lock is taken killable.
>>
>> Well, yes, but I think the dethread behavior of killing threads is
>> required anyway, so..
> 
> It is, but it is actually part of the problem.
> 
> I think making some of this thread local might solve another easy case
> and let us focus more on the really hard problem.
> 
>>> I keep wondering if we could do something similar to vfork.  That is
>>> allocate an new task_struct and fully set it up for the post exec
>>> process, and then make it visible under tasklist_lock.  Finally we could
>>> free the old process.
>>>
>>> That would appear as if everything happened atomically from
>>> the point of view of the rest of the kernel.
>>
>> I do think that would have been a lovely design originally, and would
>> avoid a lot of things. So "execve()" would basically look like an exit
>> and a thread creation with the same pid (without the SIGCHILD to the
>> parent, obviously)
>>
>> That would also naturally handle the "flush pending signals" etc issues.
>>
>> The fact that we created a whole new mm-struct ended up fixing a lot
>> of problems (even if it was painful to do). This might be similar.
>>
>> But it's not what we've ever done, and I do suspect you'd run into a
>> lot of odd small special cases if we were to try to do it now.
> 
> I completely agree, which is why I haven't been rushing to do that.
> But this remains the only idea that I have thought of that would solve all
> of the issues.
> 
>> So I think it's simpler to just start making the "cred lock waiters
>> have to be killable" rule. It's not like that's a very complex rule.
> 
> I just looked at the remaining users of cred_guard_mutex and they are
> all killable or interruptible.  Further all of the places that have been
> converted to use the exec_update_mutex are also all killable or
> interruptible.
> 
> So where we came in is that we had the killable rule and that has what
> has allowed this to remain on the backburner for so long.  At least you
> could kill the affected process from userspace.   Unfortunately the
> deadlocks still happen.
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 14:58                       ` Eric W. Biederman
  2020-04-09 15:15                         ` Bernd Edlinger
@ 2020-04-09 16:15                         ` Linus Torvalds
  2020-04-09 16:24                           ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-09 16:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On Thu, Apr 9, 2020 at 8:01 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> When a thread going about it's ordinary business receives the SIGKILL
> from de_thread the thread changes course and finds it's way to do_exit.
> In do_exit the thread calls ptrace_event(PTRACE_EVENT_EXIT, ...) and
> blocks waiting for the tracer to let it continue.

Hah.

That code isn't _supposed_ to block.

may_ptrace_stop() is supposed to stop the blocking exactly so that it
doesn't deadlock.

I wonder why that doesn't work..

[ Goes and look ]

Oh. I see.

That ptrace_may_stop() only ever considered core-dumping, not execve().

But if _that_ is the reason for the deadlock, then it's trivially fixed.

Famous last words..

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 16:15                         ` Linus Torvalds
@ 2020-04-09 16:24                           ` Linus Torvalds
  2020-04-09 17:03                             ` Eric W. Biederman
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-09 16:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov

On Thu, Apr 9, 2020 at 9:15 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> may_ptrace_stop() is supposed to stop the blocking exactly so that it
> doesn't deadlock.
>
> I wonder why that doesn't work..
>
> [ Goes and look ]
>
> Oh. I see.
>
> That ptrace_may_stop() only ever considered core-dumping, not execve().
>
> But if _that_ is the reason for the deadlock, then it's trivially fixed.

So maybe may_ptrace_stop() should just do something like this
(ENTIRELY UNTESTED):

        struct task_struct *me = current, *parent = me->parent;

        if (!likely(me->ptrace))
                return false;

        /* If the parent is exiting or core-dumping, it's not
listening to our signals */
        if (parent->signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP))
                return false;

        /* if the parent is going through a execve(), it's not listening */
        if (parent->signal->group_exit_task)
                return false;

        return true;

instead of the fairly ad-hoc tests for core-dumping.

The above is hand-wavy - I didn't think a lot about locking.
may_ptrace_stop() is already called under the tasklist_lock, so the
parent won't change, but maybe it should take the signal lock?

So the above very much is *not* meant to be a "do it like this", more
of a "this direction, maybe"?

The existing code is definitely broken. It special-cases core-dumping
probably simply because that's the only case people had realized, and
not thought of the execve() thing.

            Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 16:24                           ` Linus Torvalds
@ 2020-04-09 17:03                             ` Eric W. Biederman
  2020-04-09 17:17                               ` Bernd Edlinger
  2020-04-09 17:36                               ` Linus Torvalds
  0 siblings, 2 replies; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-09 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov


Adding Oleg to the conversation if for no other reason that he can see
it is happening.

Oleg has had a test case where this can happen for years and nothing
has come out as an obvious proper fix for this deadlock issue.

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 9, 2020 at 9:15 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> may_ptrace_stop() is supposed to stop the blocking exactly so that it
>> doesn't deadlock.
>>
>> I wonder why that doesn't work..
>>
>> [ Goes and look ]
>>
>> Oh. I see.
>>
>> That ptrace_may_stop() only ever considered core-dumping, not execve().
>>
>> But if _that_ is the reason for the deadlock, then it's trivially fixed.
>
> So maybe may_ptrace_stop() should just do something like this
> (ENTIRELY UNTESTED):
>
>         struct task_struct *me = current, *parent = me->parent;
>
>         if (!likely(me->ptrace))
>                 return false;
>
>         /* If the parent is exiting or core-dumping, it's not
> listening to our signals */
>         if (parent->signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP))
>                 return false;
>
>         /* if the parent is going through a execve(), it's not listening */
>         if (parent->signal->group_exit_task)
>                 return false;
>
>         return true;
>
> instead of the fairly ad-hoc tests for core-dumping.
>
> The above is hand-wavy - I didn't think a lot about locking.
> may_ptrace_stop() is already called under the tasklist_lock, so the
> parent won't change, but maybe it should take the signal lock?
>
> So the above very much is *not* meant to be a "do it like this", more
> of a "this direction, maybe"?
>
> The existing code is definitely broken. It special-cases core-dumping
> probably simply because that's the only case people had realized, and
> not thought of the execve() thing.


I don't see how there can be a complete solution in may_ptrace_stop.

a) We must stop in PTRACE_EVENT_EXIT during exec or userspace *breaks*.

   Those are the defined semantics and I believe it is something
   as common as strace that depends on them.

b) Even if we added a test for our ptrace parent blocking in a ptrace
   attach of an ongoing exec, it still wouldn't help.

   That ptrace attach could legitimately come after the thread in
   question has stopped and notified it's parent it is stopped.



None of this is to say I like the semantics of PTRACE_EVENT_EXIT.  It is
just we will violate the no regressions rule if we don't stop there
during exec.

The normal case is that the strace or whomever is already attached to
all of the threads during exec and no deadlock occurs.  So the current
behavior is quite usable.



Maybe my memory is wrong that userspace cares but I really don't think
so.


Eric

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 17:03                             ` Eric W. Biederman
@ 2020-04-09 17:17                               ` Bernd Edlinger
  2020-04-09 17:37                                 ` Linus Torvalds
  2020-04-09 17:36                               ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-09 17:17 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov


On 4/9/20 7:03 PM, Eric W. Biederman wrote:
> 
> Adding Oleg to the conversation if for no other reason that he can see
> it is happening.
> 
> Oleg has had a test case where this can happen for years and nothing
> has come out as an obvious proper fix for this deadlock issue.
> 

Just for reference, I used Oleg's test case,
and improved it a bit.  The test case  anticipates the
EAGAIN return code from PTRACE_ATTACH.  This is likely
to change somehow.
If Linus's idea works, you will probably have to
look at adjusting the test expectations again.

I would still be surprised if any other solution works.


Bernd. 

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
>> On Thu, Apr 9, 2020 at 9:15 AM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> may_ptrace_stop() is supposed to stop the blocking exactly so that it
>>> doesn't deadlock.
>>>
>>> I wonder why that doesn't work..
>>>
>>> [ Goes and look ]
>>>
>>> Oh. I see.
>>>
>>> That ptrace_may_stop() only ever considered core-dumping, not execve().
>>>
>>> But if _that_ is the reason for the deadlock, then it's trivially fixed.
>>
>> So maybe may_ptrace_stop() should just do something like this
>> (ENTIRELY UNTESTED):
>>
>>         struct task_struct *me = current, *parent = me->parent;
>>
>>         if (!likely(me->ptrace))
>>                 return false;
>>
>>         /* If the parent is exiting or core-dumping, it's not
>> listening to our signals */
>>         if (parent->signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP))
>>                 return false;
>>
>>         /* if the parent is going through a execve(), it's not listening */
>>         if (parent->signal->group_exit_task)
>>                 return false;
>>
>>         return true;
>>
>> instead of the fairly ad-hoc tests for core-dumping.
>>
>> The above is hand-wavy - I didn't think a lot about locking.
>> may_ptrace_stop() is already called under the tasklist_lock, so the
>> parent won't change, but maybe it should take the signal lock?
>>
>> So the above very much is *not* meant to be a "do it like this", more
>> of a "this direction, maybe"?
>>
>> The existing code is definitely broken. It special-cases core-dumping
>> probably simply because that's the only case people had realized, and
>> not thought of the execve() thing.
> 
> 
> I don't see how there can be a complete solution in may_ptrace_stop.
> 
> a) We must stop in PTRACE_EVENT_EXIT during exec or userspace *breaks*.
> 
>    Those are the defined semantics and I believe it is something
>    as common as strace that depends on them.
> 
> b) Even if we added a test for our ptrace parent blocking in a ptrace
>    attach of an ongoing exec, it still wouldn't help.
> 
>    That ptrace attach could legitimately come after the thread in
>    question has stopped and notified it's parent it is stopped.
> 
> 
> 
> None of this is to say I like the semantics of PTRACE_EVENT_EXIT.  It is
> just we will violate the no regressions rule if we don't stop there
> during exec.
> 
> The normal case is that the strace or whomever is already attached to
> all of the threads during exec and no deadlock occurs.  So the current
> behavior is quite usable.
> 
> 
> 
> Maybe my memory is wrong that userspace cares but I really don't think
> so.
> 
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 17:03                             ` Eric W. Biederman
  2020-04-09 17:17                               ` Bernd Edlinger
@ 2020-04-09 17:36                               ` Linus Torvalds
  2020-04-09 20:34                                 ` Eric W. Biederman
  1 sibling, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-09 17:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On Thu, Apr 9, 2020 at 10:06 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> a) We must stop in PTRACE_EVENT_EXIT during exec or userspace *breaks*.
>
>    Those are the defined semantics and I believe it is something
>    as common as strace that depends on them.

Don't be silly.

Of course we must stop IF THE TRACER IS ACTUALLY TRACING US.

But that's simply not the case. The deadlock case is where the tracer
is going through an execve, and the tracing thread is being killed.

Claiming that "user space breaks" is garbage. User space cannot care.
In fact, it's broken right now because it deadlocks, but it deadlocks
becvause that code waits for absolutely no good reason.

> b) Even if we added a test for our ptrace parent blocking in a ptrace
>    attach of an ongoing exec, it still wouldn't help.
>
>    That ptrace attach could legitimately come after the thread in
>    question has stopped and notified it's parent it is stopped.

What?

The whole point is that the tracer _is_ the thing going through
execve(), which is why you get the deadlock in the first place.

You make no sense.

If the tracer is somebody else, we wouldn't be deadlocking. We'd just
be tracing.

I really don't understand your arguments against my patch. They seem
entirely nonsensical. Are we speaking past each other some way?

                   Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 17:17                               ` Bernd Edlinger
@ 2020-04-09 17:37                                 ` Linus Torvalds
  2020-04-09 17:46                                   ` Bernd Edlinger
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-09 17:37 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On Thu, Apr 9, 2020 at 10:17 AM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> Just for reference, I used Oleg's test case,
> and improved it a bit.

I'm sure the test-case got posted somewhere, but mind sending it to me
(or just pointing me at it?)

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 17:37                                 ` Linus Torvalds
@ 2020-04-09 17:46                                   ` Bernd Edlinger
  2020-04-09 18:36                                     ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-09 17:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov



On 4/9/20 7:37 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 10:17 AM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> Just for reference, I used Oleg's test case,
>> and improved it a bit.
> 
> I'm sure the test-case got posted somewhere, but mind sending it to me
> (or just pointing me at it?)
> 


No problem looks like you already swallowed the hook :-)
look at the following commit in your tree:

commit 2de4e82318c7f9d34f4b08599a612cd4cd10bf0b
Author: Bernd Edlinger <bernd.edlinger@hotmail.de>
Date:   Fri Mar 20 21:26:19 2020 +0100

    selftests/ptrace: add test cases for dead-locks
    
    This adds test cases for ptrace deadlocks.
    
    Additionally fixes a compile problem in get_syscall_info.c,
    observed with gcc-4.8.4:
    
    get_syscall_info.c: In function 'get_syscall_info':
    get_syscall_info.c:93:3: error: 'for' loop initial declarations are only
                                     allowed in C99 mode
       for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) {
       ^
    get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile
                                   your code
    
    Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


Test case 1/2 is working 2/2 is failing, deadlocking,
I think even the time-out handler does not kill the dead-lock
if I remember correctly.

And sorry, I anticipated part 15/16 and 16/16 would be pulled at the
same time, so the glitch would not be visible by now.


Bernd.

>              Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 17:46                                   ` Bernd Edlinger
@ 2020-04-09 18:36                                     ` Linus Torvalds
  2020-04-09 19:42                                       ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-09 18:36 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On Thu, Apr 9, 2020 at 10:46 AM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> Test case 1/2 is working 2/2 is failing, deadlocking,
> I think even the time-out handler does not kill the dead-lock
> if I remember correctly.

Ok, I get

  [==========] Running 2 tests from 1 test cases.
  [ RUN      ] global.vmaccess
  [       OK ] global.vmaccess
  [ RUN      ] global.attach
  global.attach: Test terminated by timeout
  [     FAIL ] global.attach
  [==========] 1 / 2 tests passed.
  [  FAILED  ]

but reading that test it's not doing what I thought it was doing.

I thought it was doing the ptrace from within a thread. But it's doing
a proper fork() and doing the attach from the parent, just doing the
TRACEME from a thread that exits.

I guess I need to look at what that test is actually testing, because
it wasn't what I thought.

               Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 18:36                                     ` Linus Torvalds
@ 2020-04-09 19:42                                       ` Linus Torvalds
  2020-04-09 19:57                                         ` Bernd Edlinger
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-09 19:42 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On Thu, Apr 9, 2020 at 11:36 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I guess I need to look at what that test is actually testing, because
> it wasn't what I thought.

Ahh.

The problem is that zap_other_threads() counts all threads.

But it doesn't bother notifying already dead threads, even if it counts them.

And then it waits for the threads to go away, but didn't do anything
to make that dead thread go away.

And the test case has an already dead thread that is just waiting to
be reaped by the same person who is now waiting for it to go away.

So it just stays around.

Honestly, I'm not entirely sure this is worth worrying about, since
it's all killable anyway and only happens if you do something stupid.

I mean, you can get two threads to wait for each other more easily other ways.

Or maybe we just shouldn't count already dead threads? Yeah, they'd
share that current signal struct, but they're dead and can't do
anything about it, they can only be reaped.

But that would mean that we should also move the signal->notify_count
update to when we mark the EXIT_ZOMBIE or EXIT_DEAD in exit_state.

          Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 19:42                                       ` Linus Torvalds
@ 2020-04-09 19:57                                         ` Bernd Edlinger
  2020-04-09 20:04                                           ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-09 19:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On 4/9/20 9:42 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 11:36 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I guess I need to look at what that test is actually testing, because
>> it wasn't what I thought.
> 
> Ahh.
> 
> The problem is that zap_other_threads() counts all threads.
> 
> But it doesn't bother notifying already dead threads, even if it counts them.
> 
> And then it waits for the threads to go away, but didn't do anything
> to make that dead thread go away.
> 
> And the test case has an already dead thread that is just waiting to
> be reaped by the same person who is now waiting for it to go away.
> 
> So it just stays around.
> 
> Honestly, I'm not entirely sure this is worth worrying about, since
> it's all killable anyway and only happens if you do something stupid.
> 

The use case where this may happen with strace
when you call strace with lots of -p <pid> arguments,
and one of them is a bomb. strace stuck.

So when that happens in the beginning, it is not much
work lost, but if you traced a megabyte of data to analyze
and then that happens, you are not really amused.

Also slightly different things happen with PTRACE_O_TRACEEXIT
then the tracer is supposed to continue the exit, and then
to wait for the thread to die.  Which is twice as ugly...


Bernd.


> I mean, you can get two threads to wait for each other more easily other ways.
> 
> Or maybe we just shouldn't count already dead threads? Yeah, they'd
> share that current signal struct, but they're dead and can't do
> anything about it, they can only be reaped.
> 
> But that would mean that we should also move the signal->notify_count
> update to when we mark the EXIT_ZOMBIE or EXIT_DEAD in exit_state.
> 
>           Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 19:57                                         ` Bernd Edlinger
@ 2020-04-09 20:04                                           ` Linus Torvalds
  2020-04-09 20:36                                             ` Bernd Edlinger
  2020-04-09 21:00                                             ` Eric W. Biederman
  0 siblings, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-09 20:04 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On Thu, Apr 9, 2020 at 12:57 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> The use case where this may happen with strace
> when you call strace with lots of -p <pid> arguments,
> and one of them is a bomb. strace stuck.

Yeah, so from a convenience angle I do agree that it would be nicer to
just not count dead threads.

You can test that by just moving the

                /* Don't bother with already dead threads */
                if (t->exit_state)
                        continue;

test in zap_other_threads() to above the

                count++;

line instead.

NOTE! That is *NOT* the correct true fix. I'm just suggesting that you
try if it fixes that particular test-case (I did not try it myself -
because .. lazy)

If Oleg agrees that we could take the approach that we can share a
signal struct with dead threads, we'd also need to change the
accounting to do that notify_count not when the signal struct is
unlinked, but when exit_state is first set.

I'm not convinced that's the right solution, but I do agree that it's
annoying how easily strace can get stuck, since one of the main uses
for strace is for debugging nasty situations.

                Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 17:36                               ` Linus Torvalds
@ 2020-04-09 20:34                                 ` Eric W. Biederman
  2020-04-09 20:56                                   ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-09 20:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 9, 2020 at 10:06 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> a) We must stop in PTRACE_EVENT_EXIT during exec or userspace *breaks*.
>>
>>    Those are the defined semantics and I believe it is something
>>    as common as strace that depends on them.
>
> Don't be silly.
>
> Of course we must stop IF THE TRACER IS ACTUALLY TRACING US.
>
> But that's simply not the case. The deadlock case is where the tracer
> is going through an execve, and the tracing thread is being killed.

Linus please don't be daft.

I will agree that if one thread in a process ptracess another thread
in the same process, and the tracing thread calls execve we have
a problem.  A different problem but one worth addressing.




The deadlock case I am talking about.  The deadlock case that trivially
exists in real code is:

A single threaded process (the tracer) ptrace attaches to every thread of a
multi-threaded process (the tracee).

If one of these attaches succeeds, and another thread of the tracee
processes calls execve before the tracer attachs to it, then the tracer
blocks in ptrace_attach waitiing for the traccee's exeve to succeed
while the tracee blocks in de_thread waiting for it's other threads to
exit.  The threads of the tracee attempt to exit but one or more of them
are in PTRACE_EVENT_EXIT waiting for the tracer to let them continue.

The tracer of course is stalled waiting for the exec to succeed.


Let me see if I can draw a picture.




Tracer                       TraceeThreadA     TraceeThreadB
ptrace_attach(TraceeThreadA)
                                               execve
                                               acquires cred_guard_mutex
ptrace_attach(TraceeThreadB)
 Blocks on cred_guard_mutex
                                               de_thread
                                               waits for other threads to exit
                             Receives SIGKILL
                             do_exit()
                             PTRACE_EVENT_EXIT
                               Waits for tracer


So we have a loop.

    TraceeThreadB is waiting for TraceeThreadA to reach exit_noitfy.
    TraceeThreadA is waiting for the tracer to allow it to continue.
    The Tracer is waiting for TraceeThreadB to finish it's call to exec.

Since they are all waiting for each other that loop is a deadlock.

All it takes is a tracer that uses PTRACE_EVENT_EXIT.

Does that make the deadlock that I see clear?


In your proposed lock revision you were talking about ptrace_attach
taking your new the lock for write so I don't see your proposed lock
being any different in this scenario from cred_guard_mutex.  Perhaps I
missed something?

I know Oleg's test case was a little more involved but that was to
guarantee the timing perhaps that introduced confusion.

Eric


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 20:04                                           ` Linus Torvalds
@ 2020-04-09 20:36                                             ` Bernd Edlinger
  2020-04-09 21:00                                             ` Eric W. Biederman
  1 sibling, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-09 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On 4/9/20 10:04 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 12:57 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> The use case where this may happen with strace
>> when you call strace with lots of -p <pid> arguments,
>> and one of them is a bomb. strace stuck.
> 
> Yeah, so from a convenience angle I do agree that it would be nicer to
> just not count dead threads.
> 
> You can test that by just moving the
> 
>                 /* Don't bother with already dead threads */
>                 if (t->exit_state)
>                         continue;
> 
> test in zap_other_threads() to above the
> 
>                 count++;
> 
> line instead.
> 
> NOTE! That is *NOT* the correct true fix. I'm just suggesting that you

Eric, I think he means you, I am too busy with other work ;-) right now.

> try if it fixes that particular test-case (I did not try it myself -
> because .. lazy)
> 
> If Oleg agrees that we could take the approach that we can share a
> signal struct with dead threads, we'd also need to change the
> accounting to do that notify_count not when the signal struct is
> unlinked, but when exit_state is first set.
> 
> I'm not convinced that's the right solution, but I do agree that it's
> annoying how easily strace can get stuck, since one of the main uses
> for strace is for debugging nasty situations.
> 
>                 Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 20:34                                 ` Eric W. Biederman
@ 2020-04-09 20:56                                   ` Linus Torvalds
  0 siblings, 0 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-09 20:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Bernd Edlinger,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On Thu, Apr 9, 2020 at 1:37 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Since they are all waiting for each other that loop is a deadlock.

No.

That's just a user bug. It's not a deadlock for the kernel.

The fact that you guys kept calling it a deadlock was what confused me
and made me think you were talking about something much more
fundamental (like the same thread trying to take the lock recursively
- *THAT* is a deadlock).

There are lots of easier ways to make people wait for each other. This
is a trivial one:

  #include <unistd.h>

  int main(void)
  {
        int fd[2];
        char buffer[1];

        pipe(fd);
        fork();
        read(fd[0], buffer, sizeof(buffer));
        write(fd[1], buffer, sizeof(buffer));
  }

where you have two readers that both wait for each other to write.

As far as the kernel is concerned, it's not a deadlock. It's just a
user space bug.

The exact same thing is true here. The user space was buggy, and set
it up so that both sides of two processes were just waiting for the
other side to do something that they never did.

And exactly like the reads, it's not a kernel bug.

Now, I do agree that from a QoI standpoint, it's annoying when
ptrace() just stops like that, particularly when you want to use
ptrace for debugging. So I'm not dismissing trying to improve on
interfaces, but I think you've confused things by calling this a
deadlock and thinking that it's a kernel bug.

The kernel never tries to figure out "Oh, stupid users are waiting for
each other". Sure, file locking has the special circular locking
detection, but that's literally a special case. The normal semantics
are that you give users rope. If users make a noose of the rope and
then trip on it, that's _their_ problem, not the kernels.

            Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 20:04                                           ` Linus Torvalds
  2020-04-09 20:36                                             ` Bernd Edlinger
@ 2020-04-09 21:00                                             ` Eric W. Biederman
  2020-04-09 21:17                                               ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-09 21:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernd Edlinger, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 9, 2020 at 12:57 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> The use case where this may happen with strace
>> when you call strace with lots of -p <pid> arguments,
>> and one of them is a bomb. strace stuck.
>
> Yeah, so from a convenience angle I do agree that it would be nicer to
> just not count dead threads.
>
> You can test that by just moving the
>
>                 /* Don't bother with already dead threads */
>                 if (t->exit_state)
>                         continue;
>
> test in zap_other_threads() to above the
>
>                 count++;
>
> line instead.

That looks like a legitimate race, and something worth addressing.  It
doesn't look like t->exit_state has siglock protection so I don't think
testing it under siglock would fix that race.  But something like that
certainly should.

But no.  While you are goind a good job at spotting odd corner
cases that need to be fixed.  This also is not the cause of the
deadlock.  It is nothing that subtle.

Eric

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 21:00                                             ` Eric W. Biederman
@ 2020-04-09 21:17                                               ` Linus Torvalds
  2020-04-09 23:52                                                 ` Bernd Edlinger
  2020-04-10  0:30                                                 ` Linus Torvalds
  0 siblings, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-09 21:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On Thu, Apr 9, 2020 at 2:03 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> But no.  While you are goind a good job at spotting odd corner
> cases that need to be fixed.  This also is not the cause of the
> deadlock.  It is nothing that subtle.

So Eric, I'm now going to stop wasting my time on arguing with you.

Since both you and Bernd claimed to be too busy to even bother testing
that thing, I just built it and booted it.

And guess what? That thing makes your non-deadlock thing go away.

So it's _literally_ that simple.

Now, does it make the tests "pass"? No.

Because the "vmaccess" test fails because the open() now fails -
because we simply don't wait for that dead thread any more, so the
/proc/<pid>/mem thing doesn't exist.

And for the same reason that "attach" test now no longer returns
EAGAIN, it just attaches to the remaining execlp thing instead.

So I'm not just good at "spotting odd corner cases". I told you why
that bogus deadlock of yours failed - the execve was pointlessly
waiting for a dead thread that had marked itself ptraced, and nobody
was reaping it.

And it appears you were too lazy to even try it out.

Yes, that whole "notify_dead" count vs "tsk->exit_state" test is
fundamentally racy. But that race happens to be irrelevant for the
test case in question.

So until you can actually add something to the discussion, I'm done
with this thread.

           Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 21:17                                               ` Linus Torvalds
@ 2020-04-09 23:52                                                 ` Bernd Edlinger
  2020-04-10  0:30                                                 ` Linus Torvalds
  1 sibling, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-09 23:52 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov



On 4/9/20 11:17 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 2:03 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> But no.  While you are goind a good job at spotting odd corner
>> cases that need to be fixed.  This also is not the cause of the
>> deadlock.  It is nothing that subtle.
> 
> So Eric, I'm now going to stop wasting my time on arguing with you.
> 
> Since both you and Bernd claimed to be too busy to even bother testing
> that thing, I just built it and booted it.
> 
> And guess what? That thing makes your non-deadlock thing go away.
> 
> So it's _literally_ that simple.
> 

You known I was right from the beginning :-) :-) (-: (-:,
I said you would have to adjust the test.  I only thought of the
second part, so that is were I was wrong.

Yeah Thanks.  My real problem is called OpenSSL 3.0 + FIPS and it feels
like a very big pain in the ass......

But please tell nobody.  That is a secret :-)


Thanks
Bernd.

> Now, does it make the tests "pass"? No.
> 
> Because the "vmaccess" test fails because the open() now fails -
> because we simply don't wait for that dead thread any more, so the
> /proc/<pid>/mem thing doesn't exist.
> 
> And for the same reason that "attach" test now no longer returns
> EAGAIN, it just attaches to the remaining execlp thing instead.
> 
> So I'm not just good at "spotting odd corner cases". I told you why
> that bogus deadlock of yours failed - the execve was pointlessly
> waiting for a dead thread that had marked itself ptraced, and nobody
> was reaping it.
> 
> And it appears you were too lazy to even try it out.
> 
> Yes, that whole "notify_dead" count vs "tsk->exit_state" test is
> fundamentally racy. But that race happens to be irrelevant for the
> test case in question.
> 
> So until you can actually add something to the discussion, I'm done
> with this thread.
> 
>            Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-09 21:17                                               ` Linus Torvalds
  2020-04-09 23:52                                                 ` Bernd Edlinger
@ 2020-04-10  0:30                                                 ` Linus Torvalds
  2020-04-10  0:32                                                   ` Linus Torvalds
  2020-04-11 18:20                                                   ` Oleg Nesterov
  1 sibling, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-10  0:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

On Thu, Apr 9, 2020 at 2:17 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I'm not just good at "spotting odd corner cases". I told you why
> that bogus deadlock of yours failed - the execve was pointlessly
> waiting for a dead thread that had marked itself ptraced, and nobody
> was reaping it.

Side note: this does seem to be the only special case at least in this
particular area for execve().

The thing is, for actual _live_ threads that end up waiting for
ptracers, the SIGKILL that zap_other_threads() does should punch
through everything - it should not just break out anybody waiting for
the mutex, it should also break any "wait for tracer" ptrace waiting
conditions.

But already dead threads are special. Unlike SIGKILL'ed other threads,
they can stay around almost forever.

Because they're in Zombie state, and no amount of SIGKILL will do
anything to them (and zap_other_threads() didn't even try), and
they'll stay around until they are reaped by some entirely independent
party (which just happened to be the same one that was doing the
ptrace, but doesn't really have to be)

(Of course, who knows what other special cases there might be - I'm
not saying this is the _only_ special case, but in the particular area
of 'zap other threads and wait for them', already dead threads do seem
to be special).

So the fact that "zap_threads()" counts dead threads - but cannot do
anything about them - is fundamentally different, and that's why that
particular test-case has that odd behavior.

So I think we have basically three choices:

 (1) have execve() not wait for dead threads while holding the cred
mutex (that's effectively what that zap_threads() hacky patch does,
but it's not really correct because it can cause notify_count
underflows)

 (2) have zap_other_threads() force-reap Zombie threads

 (3) say that it's a user space bug, and if you're doing PTRACE_ATTACH
you need to make sure there are no dead threads of the target that
might be blocking an execve().

For an example of that (3) approach: making the test-case just do a
waitpid() before the PTRACE_ATTACH will unhang it, because it reaps
that thread that did the PTRACE_TRACEME.

So option (3) is basically saying "that test-case is buggy, exactly
like the two readers on a pipe".

But I continued to look at (1), but trying to deal with the fact that
"notify_count" will get decremented not just by live threads, but by
the ones that already died too.

Instead of trying to change how notify_count gets decremented, we
could do something like the attached patch: wait for it to go down to
zero, yes, but go back and re-check until you don't have to wait any
more. That should fix the underflow situation. The comment says it
all.

This patch is 100% untested. It probably compiles, that's all I'll
say. I'm not proud of it. And I may be missing some important thing
(and I'm not happy about the magical "am I not the
thread_group_leader?" special case etc).

It's worth noting that this code is the only one that cares about the
return value of zap_other_threads(), so changing the semantics to only
count non-dead threads is safe in that sense.

Whether it's safe to then share the signal structure ever the rest of
exevbe() - even if it's only with dead threads - I didn't check.

I think Oleg might be the only person alive who understands all of our
process exit code.

Oleg? Comments?

              Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-10  0:30                                                 ` Linus Torvalds
@ 2020-04-10  0:32                                                   ` Linus Torvalds
  2020-04-11  4:07                                                     ` Bernd Edlinger
  2020-04-11 18:20                                                   ` Oleg Nesterov
  1 sibling, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-10  0:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bernd Edlinger, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]

On Thu, Apr 9, 2020 at 5:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Instead of trying to change how notify_count gets decremented, we
> could do something like the attached patch: wait for it to go down to
> zero, yes, but go back and re-check until you don't have to wait any
> more. That should fix the underflow situation. The comment says it
> all.

The "attached" patch wasn't.

Blush.

Here it is. Still entirely and utterly untested.

           Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1583 bytes --]

 fs/exec.c       | 22 ++++++++++++++++++----
 kernel/signal.c |  2 +-
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 06b4c550af5d..e847c0417e34 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1122,11 +1122,25 @@ static int de_thread(struct task_struct *tsk)
 	}
 
 	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
-	if (!thread_group_leader(tsk))
-		sig->notify_count--;
 
-	while (sig->notify_count) {
+	/*
+	 * Zap and wait for other threads to go away.
+	 *
+	 * Note that 'notify_count' is not stable, because
+	 * it also gets modified by zombie threads that
+	 * zap_other_threads() does not count, but we're
+	 * guaranteed to under-count, and at worst that will
+	 * cause us to wake up early and go through the
+	 * loop a few times.
+	 */
+	for (;;) {
+		sig->notify_count = zap_other_threads(tsk);
+		if (!thread_group_leader(tsk))
+			sig->notify_count--;
+		if (!sig->notify_count)
+			break;
+
+		/* sig->notify_count going down to zero will wake us up */
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
 		schedule();
diff --git a/kernel/signal.c b/kernel/signal.c
index e58a6c619824..98e5523f792c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1353,11 +1353,11 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
+		count++;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
 	}

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [GIT PULL] proc fix for 5.7-rc1
       [not found] <87blobnq02.fsf@x220.int.ebiederm.org>
  2020-04-02 19:04 ` [GIT PULL] Please pull proc and exec work for 5.7-rc1 Linus Torvalds
@ 2020-04-10 13:03 ` Eric W. Biederman
  2020-04-10 20:40   ` pr-tracker-bot
  1 sibling, 1 reply; 127+ messages in thread
From: Eric W. Biederman @ 2020-04-10 13:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Alexey Gladkov, Oleg Nesterov, Christian Brauner


Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: 63f818f46af9f8b3f17b9695501e8d08959feb60 proc: Use a dedicated lock in struct pid

A brown paper bag slipped through my proc changes, and syzcaller caught
it when the code ended up in your tree.  I have opted to fix it the
simplest cleanest way I know how.  So there is no reasonable chance
for the bug to repeat.

Eric

From 63f818f46af9f8b3f17b9695501e8d08959feb60 Mon Sep 17 00:00:00 2001
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, 7 Apr 2020 09:43:04 -0500
Subject: [PATCH] proc: Use a dedicated lock in struct pid

syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
>  (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
>  Possible interrupt unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&pid->wait_pidfd);
>                                local_irq_disable();
>                                lock(tasklist_lock);
>                                lock(&pid->wait_pidfd);
>   <Interrupt>
>     lock(tasklist_lock);
>
>  *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:

The problem is that because wait_pidfd.lock is taken under the tasklist
lock.  It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.

Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock.  That should be safe and I think the code will eventually
do that.  It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.

It is also true that my pre-merge window testing was insufficient.  So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things.  I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.

It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals.  That will require taking pid->lock with irqs disabled.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/base.c      | 10 +++++-----
 include/linux/pid.h |  1 +
 kernel/pid.c        |  1 +
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 74f948a6b621..6042b646ab27 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1839,9 +1839,9 @@ void proc_pid_evict_inode(struct proc_inode *ei)
 	struct pid *pid = ei->pid;
 
 	if (S_ISDIR(ei->vfs_inode.i_mode)) {
-		spin_lock(&pid->wait_pidfd.lock);
+		spin_lock(&pid->lock);
 		hlist_del_init_rcu(&ei->sibling_inodes);
-		spin_unlock(&pid->wait_pidfd.lock);
+		spin_unlock(&pid->lock);
 	}
 
 	put_pid(pid);
@@ -1877,9 +1877,9 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
 	/* Let the pid remember us for quick removal */
 	ei->pid = pid;
 	if (S_ISDIR(mode)) {
-		spin_lock(&pid->wait_pidfd.lock);
+		spin_lock(&pid->lock);
 		hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes);
-		spin_unlock(&pid->wait_pidfd.lock);
+		spin_unlock(&pid->lock);
 	}
 
 	task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
@@ -3273,7 +3273,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
 
 void proc_flush_pid(struct pid *pid)
 {
-	proc_invalidate_siblings_dcache(&pid->inodes, &pid->wait_pidfd.lock);
+	proc_invalidate_siblings_dcache(&pid->inodes, &pid->lock);
 	put_pid(pid);
 }
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 01a0d4e28506..cc896f0fc4e3 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -60,6 +60,7 @@ struct pid
 {
 	refcount_t count;
 	unsigned int level;
+	spinlock_t lock;
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
 	struct hlist_head inodes;
diff --git a/kernel/pid.c b/kernel/pid.c
index efd34874b3d1..517d0855d4cf 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -246,6 +246,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	get_pid_ns(ns);
 	refcount_set(&pid->count, 1);
+	spin_lock_init(&pid->lock);
 	for (type = 0; type < PIDTYPE_MAX; ++type)
 		INIT_HLIST_HEAD(&pid->tasks[type]);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] proc fix for 5.7-rc1
  2020-04-10 13:03 ` [GIT PULL] proc fix " Eric W. Biederman
@ 2020-04-10 20:40   ` pr-tracker-bot
  0 siblings, 0 replies; 127+ messages in thread
From: pr-tracker-bot @ 2020-04-10 20:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, linux-kernel, Alexey Gladkov, Oleg Nesterov,
	Christian Brauner

The pull request you sent on Fri, 10 Apr 2020 08:03:04 -0500:

> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/87ad46e601340394cd75c1c79b19ca906f82c543

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-10  0:32                                                   ` Linus Torvalds
@ 2020-04-11  4:07                                                     ` Bernd Edlinger
  0 siblings, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-11  4:07 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov, Oleg Nesterov

On 4/10/20 2:32 AM, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 5:30 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Instead of trying to change how notify_count gets decremented, we
>> could do something like the attached patch: wait for it to go down to
>> zero, yes, but go back and re-check until you don't have to wait any
>> more. That should fix the underflow situation. The comment says it
>> all.
> 
> The "attached" patch wasn't.
> 
> Blush.
> 
> Here it is. Still entirely and utterly untested.
> 

Okay, if this works, please do not only make sure that our own test case
works, but also that the strace-5.5 test suite does not regress.

So currently at least one of the test cases was failing
before my totally crazy patch.
After my patch exactly the same test was failing.

So please make sure you don't break their tests.


Thanks
Bernd.

>            Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-10  0:30                                                 ` Linus Torvalds
  2020-04-10  0:32                                                   ` Linus Torvalds
@ 2020-04-11 18:20                                                   ` Oleg Nesterov
  2020-04-11 18:29                                                     ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Oleg Nesterov @ 2020-04-11 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Bernd Edlinger, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

Eric, Linus, et al,

by various reasons I have not been reading emails for last weeks,
I'll try to read this thread tomorrow, currently I am a bit lost.

On 04/09, Linus Torvalds wrote:
>
>  (1) have execve() not wait for dead threads while holding the cred
> mutex

This is what I tried to do 3 years ago, see

	[PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
	https://lore.kernel.org/lkml/20170213141516.GA30233@redhat.com/

yes, yes, yes, the patch is not pretty.

From your another email:

>	/* if the parent is going through a execve(), it's not listening */
>	if (parent->signal->group_exit_task)
		return false;

Heh ;) see

	[PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec
	https://lore.kernel.org/lkml/20170213141519.GA30239@redhat.com/

from the same thread.

But this change is more problematic. 

Oleg.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-11 18:20                                                   ` Oleg Nesterov
@ 2020-04-11 18:29                                                     ` Linus Torvalds
  2020-04-11 18:31                                                       ` Linus Torvalds
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-11 18:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Bernd Edlinger, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On Sat, Apr 11, 2020 at 11:21 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 04/09, Linus Torvalds wrote:
> >
> >  (1) have execve() not wait for dead threads while holding the cred
> > mutex
>
> This is what I tried to do 3 years ago, see

Well, you did it differently - by moving the "wait for dead threads"
logic to after releasing the lock.

My simpler patch was lazier - just don't wait for dead threads at all,
since they are dead and not interesting.

Because even if it's Easter weekend, those threads are not coming back
to life ;)

You do say in that old patch that we can't just share the signal
state, but I wonder how true that is. Sharing it with a TASK_ZOMBIE
doesn't seem all that problematic to me. The only thing that can do is
getting reaped by a later wait.

That said, I actually am starting to think that maybe execve() should
just try to reap those threads instead, and avoid the whole issue that
way. Basically my "option (2)" thing.

Sure, that's basically stealing them from the parent, but 'execve()'
really is special wrt threads, and the parent still has the execve()
thread itself. And it's not so different from SIGKILL, which also
forcibly breaks off any ptracer etc without anybody being able to say
anything about it.

                Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-11 18:29                                                     ` Linus Torvalds
@ 2020-04-11 18:31                                                       ` Linus Torvalds
  2020-04-11 19:15                                                       ` Bernd Edlinger
  2020-04-12 19:50                                                       ` Oleg Nesterov
  2 siblings, 0 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-11 18:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Bernd Edlinger, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On Sat, Apr 11, 2020 at 11:29 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Well, you did it differently - by moving the "wait for dead threads"
> logic to after releasing the lock.

Not that I mind that approach either - the less work we do inside that
lock, the better off we are..

           Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-11 18:29                                                     ` Linus Torvalds
  2020-04-11 18:31                                                       ` Linus Torvalds
@ 2020-04-11 19:15                                                       ` Bernd Edlinger
  2020-04-11 20:07                                                         ` Linus Torvalds
  2020-04-12 19:50                                                       ` Oleg Nesterov
  2 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-11 19:15 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov



On 4/11/20 8:29 PM, Linus Torvalds wrote:
> On Sat, Apr 11, 2020 at 11:21 AM Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> On 04/09, Linus Torvalds wrote:
>>>
>>>  (1) have execve() not wait for dead threads while holding the cred
>>> mutex
>>
>> This is what I tried to do 3 years ago, see
> 
> Well, you did it differently - by moving the "wait for dead threads"
> logic to after releasing the lock.
> 
> My simpler patch was lazier - just don't wait for dead threads at all,
> since they are dead and not interesting.

But won't the dead thread's lifetime overlap the new thread's lifetime
from the tracer's POV?


Bernd.

> 
> Because even if it's Easter weekend, those threads are not coming back
> to life ;)
> 
> You do say in that old patch that we can't just share the signal
> state, but I wonder how true that is. Sharing it with a TASK_ZOMBIE
> doesn't seem all that problematic to me. The only thing that can do is
> getting reaped by a later wait.
> 
> That said, I actually am starting to think that maybe execve() should
> just try to reap those threads instead, and avoid the whole issue that
> way. Basically my "option (2)" thing.
> 
> Sure, that's basically stealing them from the parent, but 'execve()'
> really is special wrt threads, and the parent still has the execve()
> thread itself. And it's not so different from SIGKILL, which also
> forcibly breaks off any ptracer etc without anybody being able to say
> anything about it.
> 
>                 Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-11 19:15                                                       ` Bernd Edlinger
@ 2020-04-11 20:07                                                         ` Linus Torvalds
  2020-04-11 21:16                                                           ` Bernd Edlinger
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-11 20:07 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Oleg Nesterov, Eric W. Biederman, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On Sat, Apr 11, 2020 at 12:15 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> But won't the dead thread's lifetime overlap the new thread's lifetime
> from the tracer's POV?

What new thread?

execve() doesn't create any new thread.

But yes, an external tracer could see the (old) thread that did
execve() do new system calls before it sees the (other old) thread
that was a zombie.

But that is already somethign that can happen, simply because the
events aren't ordered. The whole issue is that the zombie thread
already died, but the tracer just didn't bother to read that state
change.

So it's not that the dead thread somehow _dies_ after the execve(). It
already died.

It's just that whoever is to reap it (or traces it) just hasn't cared
to read the status of that thing yet.

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-11 20:07                                                         ` Linus Torvalds
@ 2020-04-11 21:16                                                           ` Bernd Edlinger
       [not found]                                                             ` <CAHk-=wgWHkBzFazWJj57emHPd3Dg9SZHaZqoO7-AD+UbBTJgig@mail.gmail.com>
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-11 21:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Eric W. Biederman, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On 4/11/20 10:07 PM, Linus Torvalds wrote:
> On Sat, Apr 11, 2020 at 12:15 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> But won't the dead thread's lifetime overlap the new thread's lifetime
>> from the tracer's POV?
> 
> What new thread?
> 
> execve() doesn't create any new thread.
> 
> But yes, an external tracer could see the (old) thread that did
> execve() do new system calls before it sees the (other old) thread
> that was a zombie.
> 

That is an api change.  Previously the strace could rely that there
is a callback at the end of the execve and that all previous threads
are de-zombiefied and waited for.

Then there is a execve done event.

And then the old thread continues to run but executing the new program.

I'd bet the strace test suite has tests for that order of events,
or at least it should.


> But that is already somethign that can happen, simply because the
> events aren't ordered. The whole issue is that the zombie thread
> already died, but the tracer just didn't bother to read that state
> change.

What causes the deadlock is that de_thread waits for the tracer to
wait on the threads.  If that changes something will break in the
user space.  Of course you could say, I did not say "Simon says".


Bernd.

> 
> So it's not that the dead thread somehow _dies_ after the execve(). It
> already died.
> 
> It's just that whoever is to reap it (or traces it) just hasn't cared
> to read the status of that thing yet.
> 
>              Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
       [not found]                                                             ` <CAHk-=wgWHkBzFazWJj57emHPd3Dg9SZHaZqoO7-AD+UbBTJgig@mail.gmail.com>
@ 2020-04-11 21:57                                                               ` Linus Torvalds
  2020-04-12  6:01                                                                 ` Bernd Edlinger
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-11 21:57 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Oleg Nesterov, Eric W. Biederman, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On Sat, Apr 11, 2020 at 2:21 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> You're confused.
>
> There *was* that callback already. It happened when the thread exited and became a zombie. No ordering has changed.

Also, stop calling this a "deadlock" like you do, when you're talking
about the kernel behavior.

That only - again - shows that you're confused.

The kernel at no point was deadlocked.

The kernel had two events - the exit of one thread, and the execve()
of another - that it wanted to return to the tracer.

The tracer isn't listening to the first one, and the kernel is waiting
for the tracer to do so. The tracer wants to get the second one, but
it's not happening until the tracer has dealt with the first one.

That's a deadlock FOR THE TRACER.

But it's not a deadlock for the kernel. See the difference?

It literally is absolutely not at all different from a user space
application that deadlocks on a full pipe buffer, because it's not
emptying it - and it's not emptying it because it's trying to just
write more data to it.

See? It's a user space bug. It really is that simple.

Here, I'll give you this program just so that you can run it and see
for yourself how it hangs:

    #include <unistd.h>

    char buffer[1024*1024];

    int main(void)
    {
        int fd[2];
        pipe(fd);
        write(fd[1], buffer, sizeof(buffer));
        read(fd[0], buffer, sizeof(buffer));
    }

and it's *exactly* the same thing. The kernel buffer for the pipe
fills up, so the write() ends up hanging. There's a read() there to
empty the buffer, but that program will never get to it, because the
write hangs.

Is the above a kernel deadlock? No.

And anybody who calls it a "deadlock" when talking to kernel people is
confused and should be ignored.

So stop calling the ptrace() issue a deadlock when talking to kernel
people - it just means annoys and confuses them. I literally thought
something completely different was going on initially because you and
Eric called it a deadlock: I thought that one thread was tracing
another, and that the SIGKILL didn't resolve it.

But no. It's not a kernel deadlock at all, it's something much
"simpler". It is the same exact thing as the stupid buggy example
program above, except instead of using "write()" and waiting for
something that will never happen because the write needs another thing
to happen first, the buggy ptrace test program is using 'ptrace()' to
wait for something that will never happen.

Now, admittedly there is _one_ big difference: ptrace() is a garbage
interface. Nobody disputes that.

ptrace() is absolutely horrible for so many reasons. One reason it's
horrible is that it doesn't have a nonblocking mode with poll(), like
the write() above would have, and that people who use read and write
can do. You can fix the stupid deadlock with the above read/write loop
by switching over to nonblocking IO and a poll loop - doing the same
with ptrace is more difficult, because ptrace() just isn't that kind
of interface.

And with read/write, you could also (and this is even more common)
just used multiple processes (or threads), so that one process does
reading, and another does writing. Again, ptrace() is a shitty
interface and doesn't really allow for that very well, although maybe
it could be made to work that way too.

But even with ptrace, you could have made the tracing program set a
signal handler for SIGCHLD *before* it started doing forks and stuff,
and then exit of the first thread should have caused a signal, and you
could have reaped it there, and it should work with ptrace.

But that garbage ptrace() test program didn't do even that. So it ends
up hanging, waiting for an event that never happens, because it didn't
do that other thing it needed to do - *EXACTLY* like the "write()"
ends up hanging, waiting for that read() that will never happen,
because the garbage test-program did things wrong.

So as long as you guys keep talking about "deadlocks", the _only_
thing you are showing is that you don't understand the problem.

It's not a deadlock - it's a buggy test.

Now, what I've suggested are a couple of ways to make ptrace()
friendlier to use, and thus allow that stupid test to work the way you
seem to want it to work.

Basically, execve() doesn't have a lot of reasons to really wait for
the threads it waits for, and the only real thing it needs to do is to
make sure they are killed. But maybe it can just ignore them once they
are dead.

Or alternatively, it can just say "if you didn't reap the zombie
threads, I'll reap them for you".

Or, as mentioned, we can do nothing, and just let buggy user space
tracing programs hang. It's not the _kernels_ problem if you write
buggy code in user space, this is not anything we've ever helped with
before, so it's not like it's a regression.

(Although it can be a regression for your buggy program if you used to
- for example - do threading entirely using some user-space library,
and didn't have threads that the kernel knew about, and so when your
program did 'execve()' it just got rid of those threads
automatically).

                 Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-11 21:57                                                               ` Linus Torvalds
@ 2020-04-12  6:01                                                                 ` Bernd Edlinger
  0 siblings, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-12  6:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Eric W. Biederman, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

Linus,

On 4/11/20 11:57 PM, Linus Torvalds wrote:
> On Sat, Apr 11, 2020 at 2:21 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> You're confused.
>>
>> There *was* that callback already. It happened when the thread exited and became a zombie. No ordering has changed.
> 
> Also, stop calling this a "deadlock" like you do, when you're talking
> about the kernel behavior.
> 

Agreed, we must first define the terms.  Let's call this a pseudo-deadlock.

I always talk about the tracer, so it is a deadlock in the tracer yes.
And it is also a deadlock in the process trying to execve the process
that execve's is doing something silly.  That is not shutting down
the threads that are maybe doing dangerous things, while one of them
does execve.  In my programs I do never do that because it is always
a race condition and dangerous not shut down threads before exiting
or execve.  But the tracer wants to debug this silly program, just to
understand why it is occasionally malfunctioning.  Then a deadlock in
both processes happens at the same time, repeatably.  Now a user called
Bernd is scratching his head why that happens.....

> That only - again - shows that you're confused.
> 

Sorry, I was writing another more important mail :), and interrupted
that quickly, to help you with one aspect of the problem that may
have been overlooked, so it is quite possible that I missed something.

So the intention was just to make sure you make your decisions based
on the some facts that I realized because I have debugged the tracer,
and found out that the kernel is expecting the tracer to have knowledge
of future events, and that is impossible for the tracer, so the tracer
must fail when he predicts the future wrong.  Again that is just my
impression I had when debugging the tracer.

The trace looks like this
do {
   wait()
   if(needed) ptrace_access();
   if(needed) ptreace_otherstuff();
} while(1);

when the thread death event is in the tracer's input queue,
the tracer does not know of it unless it looks,
but from time to time the tracer wants to do other things
like attach a process that may be doing random things.
But if I think about it, the parent changes in the moment.
The original parent did not listen and make the zombie
thread go away, but now the trace is the parent, but
it is unable to call wait since it is just doing
ptrace_access.  It is possible that I have missed something,
then it woud be good that others with more experience help
with me putting the puzzle together.

> The kernel at no point was deadlocked.
> 
> The kernel had two events - the exit of one thread, and the execve()
> of another - that it wanted to return to the tracer.
> 
> The tracer isn't listening to the first one, and the kernel is waiting
> for the tracer to do so. The tracer wants to get the second one, but
> it's not happening until the tracer has dealt with the first one.
> 
> That's a deadlock FOR THE TRACER.
> 

I just say, changing the order of the events is not what the tracer
wants.  The tracer wants to get out of the ptrace_access and do
listen to the events, in the correct order.  If the events happen
in a different order than before the tracer may be confused.
That is the point I am trying to make.  And yes I may be wrong.
So please correct me, if that is the case.

> But it's not a deadlock for the kernel. See the difference?
> 
> It literally is absolutely not at all different from a user space
> application that deadlocks on a full pipe buffer, because it's not
> emptying it - and it's not emptying it because it's trying to just
> write more data to it.
> 
> See? It's a user space bug. It really is that simple.
> 
> Here, I'll give you this program just so that you can run it and see
> for yourself how it hangs:
> 
>     #include <unistd.h>
> 
>     char buffer[1024*1024];
> 
>     int main(void)
>     {
>         int fd[2];
>         pipe(fd);
>         write(fd[1], buffer, sizeof(buffer));
>         read(fd[0], buffer, sizeof(buffer));
>     }
> 
> and it's *exactly* the same thing. The kernel buffer for the pipe
> fills up, so the write() ends up hanging. There's a read() there to
> empty the buffer, but that program will never get to it, because the
> write hangs.
> 

Yes, and no.  To solve this problem we have non-blocking sockets
and pipes.  When I write a server I never use any blocking APIs
because I know a write can block at any time, when the server and
the client use a blocking strategy.  So also when I write a client
it is always using non-blocking sockets.  But not all clients are
written this way, therefore the server must be forgiving.

> Is the above a kernel deadlock? No.
> 
> And anybody who calls it a "deadlock" when talking to kernel people is
> confused and should be ignored.
> 

It is not a good idea to ignore people who are barely aware of the right
terms.  It is better to try to understand what they really mean and
translate what they said to what they meant, then ask if you understood
correctly what was meant.  At least that is how I would like the discussion
like that to be.  It is a suggestion and not meant as an offense.

> So stop calling the ptrace() issue a deadlock when talking to kernel
> people - it just means annoys and confuses them. I literally thought
> something completely different was going on initially because you and
> Eric called it a deadlock: I thought that one thread was tracing
> another, and that the SIGKILL didn't resolve it.
> 

Once again, I am new to this lkml list, and it is not meant as an offense.

Yes SIGKILL does resolve this.
Either the tracer, or one of the tracess, but which one?

But SIGKILL is impolite, I prefer SIGTERM. And that does not resolve
the pseudo-deadlock.

> But no. It's not a kernel deadlock at all, it's something much
> "simpler". It is the same exact thing as the stupid buggy example
> program above, except instead of using "write()" and waiting for
> something that will never happen because the write needs another thing
> to happen first, the buggy ptrace test program is using 'ptrace()' to
> wait for something that will never happen.
> 
> Now, admittedly there is _one_ big difference: ptrace() is a garbage
> interface. Nobody disputes that.
> 
> ptrace() is absolutely horrible for so many reasons. One reason it's
> horrible is that it doesn't have a nonblocking mode with poll(), like
> the write() above would have, and that people who use read and write
> can do. You can fix the stupid deadlock with the above read/write loop
> by switching over to nonblocking IO and a poll loop - doing the same
> with ptrace is more difficult, because ptrace() just isn't that kind
> of interface.
> 

What my other patch (you know the one you hated ;-) ) does, is making
ptrace non-blocking when it must, by returning -EAGAIN.
But keep the order of events as they are.
I think your solution will change the order of events in order
to make the kernel more simple.  Right?

> And with read/write, you could also (and this is even more common)
> just used multiple processes (or threads), so that one process does
> reading, and another does writing. Again, ptrace() is a shitty
> interface and doesn't really allow for that very well, although maybe
> it could be made to work that way too.
> 
> But even with ptrace, you could have made the tracing program set a
> signal handler for SIGCHLD *before* it started doing forks and stuff,
> and then exit of the first thread should have caused a signal, and you
> could have reaped it there, and it should work with ptrace.
> 
> But that garbage ptrace() test program didn't do even that. So it ends
> up hanging, waiting for an event that never happens, because it didn't
> do that other thing it needed to do - *EXACTLY* like the "write()"
> ends up hanging, waiting for that read() that will never happen,
> because the garbage test-program did things wrong.
> 

The test case does that on purpose to demonstrate something that
may also happen in strace, but in a very unlikely case, but the
likelihood is not zero and when it happens users are surprised.

> So as long as you guys keep talking about "deadlocks", the _only_
> thing you are showing is that you don't understand the problem.
> 
> It's not a deadlock - it's a buggy test.
> 

The test case makes only sense together with the last part of
the patch.  If you prefer another solution then you must change
the test case as well.  That is allowed.
Changing strace is not allowed.  And breaking something in the
strace-5.5 test suite is also not allowed.

> Now, what I've suggested are a couple of ways to make ptrace()
> friendlier to use, and thus allow that stupid test to work the way you
> seem to want it to work.

No do not make my test case happy.  That is not what it was written
for.

> 
> Basically, execve() doesn't have a lot of reasons to really wait for
> the threads it waits for, and the only real thing it needs to do is to
> make sure they are killed. But maybe it can just ignore them once they
> are dead.
> 
> Or alternatively, it can just say "if you didn't reap the zombie
> threads, I'll reap them for you".
> 

Isn't that is an API change?

I think, there is no way how we can avoid an API change, but we have the
choice which API change we want.

> Or, as mentioned, we can do nothing, and just let buggy user space
> tracing programs hang. It's not the _kernels_ problem if you write
> buggy code in user space, this is not anything we've ever helped with
> before, so it's not like it's a regression.
> 

Also that is a possible allowed solution.
Just leave the test case as is, or change to KFAIL so known FAIL.

The remaining problem is certainly not important enough to make you
unhappy.


Bernd.

> (Although it can be a regression for your buggy program if you used to
> - for example - do threading entirely using some user-space library,
> and didn't have threads that the kernel knew about, and so when your
> program did 'execve()' it just got rid of those threads
> automatically).
> 
>                  Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-11 18:29                                                     ` Linus Torvalds
  2020-04-11 18:31                                                       ` Linus Torvalds
  2020-04-11 19:15                                                       ` Bernd Edlinger
@ 2020-04-12 19:50                                                       ` Oleg Nesterov
  2020-04-12 20:14                                                         ` Linus Torvalds
  2 siblings, 1 reply; 127+ messages in thread
From: Oleg Nesterov @ 2020-04-12 19:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Bernd Edlinger, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On 04/11, Linus Torvalds wrote:
>
> On Sat, Apr 11, 2020 at 11:21 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 04/09, Linus Torvalds wrote:
> > >
> > >  (1) have execve() not wait for dead threads while holding the cred
> > > mutex
> >
> > This is what I tried to do 3 years ago, see
>
> Well, you did it differently - by moving the "wait for dead threads"
> logic to after releasing the lock.

Yes, please see below.

> My simpler patch was lazier

To be honest, I don't understand it... OK, suppose that the main thread
M execs and zap_other_threads() finds a single (and alive) sub-thread T,
sig->notify_count = 1.

If T is traced, then ->notify_count won't be decremented until the tracer
reaps this task, so we have the same problem.

This is fixeable, say, we can uglify exit_notify() like my patch does,
but:

> - just don't wait for dead threads at all,
> since they are dead and not interesting.

Well, I am not sure. Just for example, seccomp(SECCOMP_FILTER_FLAG_TSYNC)
can fail after mt-exec because seccomp_can_sync_threads() finds a zombe
thread. Sure, this too can can be fixed, but I think there should be no
other threads after exec.

And:

> You do say in that old patch that we can't just share the signal
> state, but I wonder how true that is.

We can share sighand_struct with TASK_ZOMBIE's. The problem is that
we can not unshare ->sighand until they go away, execing thread and
zombies must use the same sighand->siglock to serialize the access to
->thread_head/etc.

OK, we probably can if we complicate unshare_sighand(), we will need
to take tasklist_lock/oldsighand->siglock unconditionally to check
oldsighand->count > sig->nr_thread, then do

	for_each_thread(current, t) {
		t->sighand = newsighand;
		__cleanup_sighand(oldsighand);
	}

but see above, I don't think this makes any sense.

Oleg


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-12 19:50                                                       ` Oleg Nesterov
@ 2020-04-12 20:14                                                         ` Linus Torvalds
  2020-04-28  2:56                                                           ` Bernd Edlinger
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-12 20:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Bernd Edlinger, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On Sun, Apr 12, 2020 at 12:51 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> To be honest, I don't understand it... OK, suppose that the main thread
> M execs and zap_other_threads() finds a single (and alive) sub-thread T,
> sig->notify_count = 1.
>
> If T is traced, then ->notify_count won't be decremented until the tracer
> reaps this task, so we have the same problem.

Right you are.

I was hoping to avoid the "move notify_count update", but you're
right, the threads that do get properly killed will never get to that
point, so now the live ones that we're waiting for will just hit the
same issue that the dead ones did.

Goot catch. So the optimistic simplification doesn't work.

> > You do say in that old patch that we can't just share the signal
> > state, but I wonder how true that is.
>
> We can share sighand_struct with TASK_ZOMBIE's. The problem is that
> we can not unshare ->sighand until they go away, execing thread and
> zombies must use the same sighand->siglock to serialize the access to
> ->thread_head/etc.

Yeah, they'll still touch the lock, and maybe look at it, but it's not
like they'll be changing any state.

> but see above, I don't think this makes any sense.

Yeah, I think your patch is better since my simplification doesn't work.

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-12 20:14                                                         ` Linus Torvalds
@ 2020-04-28  2:56                                                           ` Bernd Edlinger
  2020-04-28 17:07                                                             ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-28  2:56 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Eric W. Biederman, Waiman Long, Ingo Molnar, Will Deacon,
	Linux Kernel Mailing List, Alexey Gladkov

On 4/12/20 10:14 PM, Linus Torvalds wrote:
> On Sun, Apr 12, 2020 at 12:51 PM Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> To be honest, I don't understand it... OK, suppose that the main thread
>> M execs and zap_other_threads() finds a single (and alive) sub-thread T,
>> sig->notify_count = 1.
>>
>> If T is traced, then ->notify_count won't be decremented until the tracer
>> reaps this task, so we have the same problem.
> 
> Right you are.
> 
> I was hoping to avoid the "move notify_count update", but you're
> right, the threads that do get properly killed will never get to that
> point, so now the live ones that we're waiting for will just hit the
> same issue that the dead ones did.
> 
> Goot catch. So the optimistic simplification doesn't work.
> 
>>> You do say in that old patch that we can't just share the signal
>>> state, but I wonder how true that is.
>>
>> We can share sighand_struct with TASK_ZOMBIE's. The problem is that
>> we can not unshare ->sighand until they go away, execing thread and
>> zombies must use the same sighand->siglock to serialize the access to
>> ->thread_head/etc.
> 
> Yeah, they'll still touch the lock, and maybe look at it, but it's not
> like they'll be changing any state.
> 
>> but see above, I don't think this makes any sense.
> 
> Yeah, I think your patch is better since my simplification doesn't work.
> 

Ping...
was this resolved meanwhile?


Thanks
Bernd.

>              Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-28  2:56                                                           ` Bernd Edlinger
@ 2020-04-28 17:07                                                             ` Linus Torvalds
  2020-04-28 19:08                                                               ` Oleg Nesterov
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-28 17:07 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Oleg Nesterov, Eric W. Biederman, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On Mon, Apr 27, 2020 at 7:56 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> was this resolved meanwhile?

No. I think the tentative plan is to just apply Oleg's "don't wait for
zombie threads with cred_guard_mutex held" patch, hopefully with that
de_thread() moved into install_exec_creds() (right after the dropping
of the locks).

But since it's arguably a user-level bug, and not a regression by any
means, it's not been exactly urgent. I suspect I'd like Oleg to
perhaps decide to take the patch up again.

              Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-28 17:07                                                             ` Linus Torvalds
@ 2020-04-28 19:08                                                               ` Oleg Nesterov
  2020-04-28 20:35                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Oleg Nesterov @ 2020-04-28 19:08 UTC (permalink / raw)
  To: Linus Torvalds, Jann Horn
  Cc: Bernd Edlinger, Eric W. Biederman, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On 04/28, Linus Torvalds wrote:
>
> On Mon, Apr 27, 2020 at 7:56 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
> >
> > was this resolved meanwhile?
>
> No. I think the tentative plan is to just apply Oleg's "don't wait for
> zombie threads with cred_guard_mutex held" patch, hopefully with that
> de_thread() moved into install_exec_creds() (right after the dropping
> of the locks).

Oops. I can update that old patch but somehow I thought there is a better
plan which I don't yet understand...

And, IIRC, Jan had some ideas how to rework the new creds calculation in
execve paths to avoid the cred_guard_mutex deadlock?

Oleg.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-28 19:08                                                               ` Oleg Nesterov
@ 2020-04-28 20:35                                                                 ` Linus Torvalds
  2020-04-28 21:06                                                                   ` Jann Horn
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-28 20:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Bernd Edlinger, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

[-- Attachment #1: Type: text/plain, Size: 974 bytes --]

On Tue, Apr 28, 2020 at 12:08 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Oops. I can update that old patch but somehow I thought there is a better
> plan which I don't yet understand...

I don't think any plan survived reality.

Unless we want to do something *really* hacky.. The attached patch is
not meant to be serious.

> And, IIRC, Jan had some ideas how to rework the new creds calculation in
> execve paths to avoid the cred_guard_mutex deadlock?

I'm not sure how you'd do that.

Execve() fundamentally needs to serialize with PTRACE_ATTACH somehow,
since the whole point is that "tsk->ptrace" changes how the
credentials are interpreted.

So PTRACE_ATTACH doesn't really _change_ the credentials, but it very
much changes what execve() will do with them.

But I guess we could do a "if somebody attached to us while we did the
execve(), just repeat the whole thing"

Jann, what was your clever idea? Maybe it got lost in the long thread..

               Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 1319 bytes --]

 kernel/ptrace.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179508d6..ebbc9876914b 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -31,6 +31,7 @@
 #include <linux/cn_proc.h>
 #include <linux/compat.h>
 #include <linux/sched/signal.h>
+#include <linux/delay.h>
 
 #include <asm/syscall.h>	/* for syscall_get_* */
 
@@ -390,10 +391,24 @@ static int ptrace_attach(struct task_struct *task, long request,
 	 * Protect exec's credential calculations against our interference;
 	 * SUID, SGID and LSM creds get determined differently
 	 * under ptrace.
+	 *
+	 * Don't wait forever on the credential lock if the target is
+	 * going through an execve.
+	 *
+	 * Whatever. We don't have "mutex_lock_interruptible_timeout()".
+	 * But this would be a disgusting hack even with it.
 	 */
-	retval = -ERESTARTNOINTR;
-	if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
-		goto out;
+	for (;;) {
+		if (mutex_trylock(&task->signal->cred_guard_mutex))
+			break;
+		retval = -ERESTARTNOINTR;
+		if (signal_pending(current))
+			goto out;
+		retval = -EAGAIN;
+		if (task->in_execve)
+			goto out;
+		msleep_interruptible(100);
+	}
 
 	task_lock(task);
 	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-28 20:35                                                                 ` Linus Torvalds
@ 2020-04-28 21:06                                                                   ` Jann Horn
  2020-04-28 21:36                                                                     ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Jann Horn @ 2020-04-28 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Bernd Edlinger, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Tue, Apr 28, 2020 at 10:36 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2020 at 12:08 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Oops. I can update that old patch but somehow I thought there is a better
> > plan which I don't yet understand...
>
> I don't think any plan survived reality.
>
> Unless we want to do something *really* hacky.. The attached patch is
> not meant to be serious.
>
> > And, IIRC, Jan had some ideas how to rework the new creds calculation in
> > execve paths to avoid the cred_guard_mutex deadlock?
>
> I'm not sure how you'd do that.
>
> Execve() fundamentally needs to serialize with PTRACE_ATTACH somehow,
> since the whole point is that "tsk->ptrace" changes how the
> credentials are interpreted.
>
> So PTRACE_ATTACH doesn't really _change_ the credentials, but it very
> much changes what execve() will do with them.
>
> But I guess we could do a "if somebody attached to us while we did the
> execve(), just repeat the whole thing"
>
> Jann, what was your clever idea? Maybe it got lost in the long thread..

My clever/horrible/overly-complex idea was basically:

In execve:

 - After the point of no return, but before we start waiting for the
   other threads to go away, finish calculating our post-execve creds
   and stash them somewhere in the task_struct or so.
 - Drop the cred_guard_mutex.
 - Wait for the other threads to die.
 - Take the cred_guard_mutex again.
 - Clear out the pointer in the task_struct.
 - Finish execve and install the new creds.
 - Drop the cred_guard_mutex again.

Then in ptrace_may_access, after taking the cred_guard_mutex, we'd
know that the target task is either outside execve or in the middle of
execve, with old and new credentials known; and then we could say "you
only get to access that task if you're capable relative to *both* its
old and new credentials, since the task currently has both state from
the old executable and from the new one". (Other users that expect to
use cred_guard_mutex to synchronize with execve would also have to be
changed appropriately; e.g. seccomp tsync would have to bail out if
the task turns out to be in execve after the mutex has been acquired.)

So I think we can conceptually fix the deadlock, but it requires a bit
of refactoring. (I have an old branch somewhere in which I tried to
implement this, and where I did a bunch of refactoring around
ptrace_may_access() so that e.g. the LSM hooks for ptrace can be
invoked twice when the target task is in execve, and so that they take
the target's cred* as an argument.)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-28 21:06                                                                   ` Jann Horn
@ 2020-04-28 21:36                                                                     ` Linus Torvalds
  2020-04-28 21:53                                                                       ` Jann Horn
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-28 21:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: Oleg Nesterov, Bernd Edlinger, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Tue, Apr 28, 2020 at 2:06 PM Jann Horn <jannh@google.com> wrote:
>
> In execve:
>
>  - After the point of no return, but before we start waiting for the
>    other threads to go away, finish calculating our post-execve creds
>    and stash them somewhere in the task_struct or so.
>  - Drop the cred_guard_mutex.
>  - Wait for the other threads to die.
>  - Take the cred_guard_mutex again.
>  - Clear out the pointer in the task_struct.
>  - Finish execve and install the new creds.
>  - Drop the cred_guard_mutex again.
>
> Then in ptrace_may_access, after taking the cred_guard_mutex, we'd
> know that the target task is either outside execve or in the middle of
> execve, with old and new credentials known; and then we could say "you
> only get to access that task if you're capable relative to *both* its
> old and new credentials, since the task currently has both state from
> the old executable and from the new one".

That doesn't solve the problem with "check_unsafe_exec()" - you might
miss setting LSM_UNSAFE_PTRACE.

Although maybe that whole function could be moved down (to after you
get the cred_guard_mutex the second time).

               Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-28 21:36                                                                     ` Linus Torvalds
@ 2020-04-28 21:53                                                                       ` Jann Horn
  2020-04-28 22:14                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Jann Horn @ 2020-04-28 21:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Bernd Edlinger, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Tue, Apr 28, 2020 at 11:37 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2020 at 2:06 PM Jann Horn <jannh@google.com> wrote:
> > In execve:
> >
> >  - After the point of no return, but before we start waiting for the
> >    other threads to go away, finish calculating our post-execve creds
> >    and stash them somewhere in the task_struct or so.
> >  - Drop the cred_guard_mutex.
> >  - Wait for the other threads to die.
> >  - Take the cred_guard_mutex again.
> >  - Clear out the pointer in the task_struct.
> >  - Finish execve and install the new creds.
> >  - Drop the cred_guard_mutex again.
> >
> > Then in ptrace_may_access, after taking the cred_guard_mutex, we'd
> > know that the target task is either outside execve or in the middle of
> > execve, with old and new credentials known; and then we could say "you
> > only get to access that task if you're capable relative to *both* its
> > old and new credentials, since the task currently has both state from
> > the old executable and from the new one".
>
> That doesn't solve the problem with "check_unsafe_exec()" - you might
> miss setting LSM_UNSAFE_PTRACE.

You don't need LSM_UNSAFE_PTRACE if the tracer has already passed a
ptrace_may_access() check against the post-execve creds of the target
- that's no different from having done PTRACE_ATTACH after execve is
over.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-28 21:53                                                                       ` Jann Horn
@ 2020-04-28 22:14                                                                         ` Linus Torvalds
  2020-04-28 23:36                                                                           ` Jann Horn
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-28 22:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Oleg Nesterov, Bernd Edlinger, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Tue, Apr 28, 2020 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>
> You don't need LSM_UNSAFE_PTRACE if the tracer has already passed a
> ptrace_may_access() check against the post-execve creds of the target
> - that's no different from having done PTRACE_ATTACH after execve is
> over.

Hmm. That sounds believable, I guess.

But along these ways, I'm starting to think that we might perhaps skip
the lock entirely.

What if we made the rule instead be:

 - we move check_unsafe_exec() down. As far as I can tell, there's no
reason it's that early - the flags it sets aren't actually used until
when we actually do that final set_creds..

 - we add a "next cred" pointer to the signal struct (or task struct)

 - make the rule be that check_unsafe_exec() checks p->ptrace under
the tasklist_lock (or sighand lock - whatever ends up being most
convenient)

 - set "next cred" to be the known next cred there too under the lock.
We call this small locked region the "cred sync point".

 - ptrace will check if we have the "in_exec" flag set and have one of
those "next cred" pointers, in which case it checks both the old and
the next credentials.

No cred_guard_mutex at all, instead the rule is that as execve() goes
through that "cred sync point", we have two cases

 (a) either ptrace has attached (because task->ptrace is set), and it
does the LSM_UNSAFE_PTRACE dance.

or

 (b) it knows that ptrace will check the next creds if it races with execve.

And then after execve has installed the final new creds, it just
clears the "next cred" pointer again, because at that point it knows
that now any subsequent PTRACE_ATTACH will be checking the new creds.

So instead of taking and dropping the cred_guard_mutex, we'd basically
get rid of it entirely.

Yeah, I didn't look at the seccomp case, but I guess the issues will be similar.

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-28 22:14                                                                         ` Linus Torvalds
@ 2020-04-28 23:36                                                                           ` Jann Horn
  2020-04-29 17:58                                                                             ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Jann Horn @ 2020-04-28 23:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Bernd Edlinger, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 12:14 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2020 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> >
> > You don't need LSM_UNSAFE_PTRACE if the tracer has already passed a
> > ptrace_may_access() check against the post-execve creds of the target
> > - that's no different from having done PTRACE_ATTACH after execve is
> > over.
>
> Hmm. That sounds believable, I guess.
>
> But along these ways, I'm starting to think that we might perhaps skip
> the lock entirely.

Just as a note: One effect of that will be that if a process goes into
execve on a setuid binary, but bails out before the point of no
return, a tracer may fail to attach to it during that time window. But
that should be completely fine.

> What if we made the rule instead be:
>
>  - we move check_unsafe_exec() down. As far as I can tell, there's no
> reason it's that early - the flags it sets aren't actually used until
> when we actually do that final set_creds..

Right, we should be able to do that stuff quite a bit later than it happens now.

At the moment the final security_bprm_set_creds() seems to happen
before we're calling would_dump(), which computes some of the data
we'll need for access checks... I guess we'll have to split up
security_bprm_set_creds into one hook for "here's another executable
file that's part of the execve chain" and a second hook
("security_bprm_cred_sync()"?) for "now the ->unsafe flags are ready
and we're about to drop the lock again, time to modify the creds if
you want to do that". And then LSMs can make decisions that are
influenced by ->unsafe (fiddling with the creds or rejecting the
execution - I think e.g. selinux_bprm_set_creds() can do both) inside
the "cred sync point".

>  - we add a "next cred" pointer to the signal struct (or task struct)
>
>  - make the rule be that check_unsafe_exec() checks p->ptrace under
> the tasklist_lock (or sighand lock - whatever ends up being most
> convenient)
>
>  - set "next cred" to be the known next cred there too under the lock.
> We call this small locked region the "cred sync point".
>
>  - ptrace will check if we have the "in_exec" flag set and have one of
> those "next cred" pointers, in which case it checks both the old and
> the next credentials.
>
> No cred_guard_mutex at all, instead the rule is that as execve() goes
> through that "cred sync point", we have two cases
>
>  (a) either ptrace has attached (because task->ptrace is set), and it
> does the LSM_UNSAFE_PTRACE dance.
>
> or
>
>  (b) it knows that ptrace will check the next creds if it races with execve.
>
> And then after execve has installed the final new creds, it just
> clears the "next cred" pointer again, because at that point it knows
> that now any subsequent PTRACE_ATTACH will be checking the new creds.
>
> So instead of taking and dropping the cred_guard_mutex, we'd basically
> get rid of it entirely.
>
> Yeah, I didn't look at the seccomp case, but I guess the issues will be similar.

Yeah, seccomp should be able to reject any thread sync if the "next
cred" pointer is set on any of the threads. It should work well as
long as the lock around the "next cred" pointer is at least on the
level of the signal_struct (or broader), so that TSYNC can decide
whether everything's good before starting to iterate over the threads.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-28 23:36                                                                           ` Jann Horn
@ 2020-04-29 17:58                                                                             ` Linus Torvalds
  2020-04-29 18:33                                                                               ` Jann Horn
  2020-04-29 19:23                                                                               ` Bernd Edlinger
  0 siblings, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-29 17:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: Oleg Nesterov, Bernd Edlinger, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Tue, Apr 28, 2020 at 4:36 PM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Apr 29, 2020 at 12:14 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> >  - we move check_unsafe_exec() down. As far as I can tell, there's no
> > reason it's that early - the flags it sets aren't actually used until
> > when we actually do that final set_creds..
>
> Right, we should be able to do that stuff quite a bit later than it happens now.

Actually, looking at it, this looks painful for multiple reasons.

The LSM_UNSAFE_xyz flags are used by security_bprm_set_creds(), which
when I traced it through, happened much earlier than I thought. Making
things worse, it's done by prepare_binprm(), which also potentially
gets called from random points by the low-level binfmt handlers too.

And we also have that odd "fs->in_exec" flag, which is used by thread
cloning and io_uring, and I'm not sure what the exact semantics are.

I'm _almost_ inclined to say that we should just abort the execve()
entirely if somebody tries to attach in the middle.

IOW, get rid of the locking, and replace it all just with a sequence
count. Make execve() abort if the sequence count has changed between
loading the original creds, and having installed the new creds.

You can ptrace _over_ an execve, and you can ptrace _after_ an
execve(), but trying to attach just as we execve() would just cause
the execve() to fail.

We could maybe make it conditional on the credentials actually having
changed at all (set another flag in bprm_fill_uid()). So it would only
fail for the suid exec case.

Because honestly, trying to ptrace in the middle of a suid execve()
sounds like an attack, not a useful thing.

That sequence count approach would be a much simpler change.

              Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 17:58                                                                             ` Linus Torvalds
@ 2020-04-29 18:33                                                                               ` Jann Horn
  2020-04-29 18:57                                                                                 ` Linus Torvalds
  2020-04-29 19:23                                                                               ` Bernd Edlinger
  1 sibling, 1 reply; 127+ messages in thread
From: Jann Horn @ 2020-04-29 18:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Bernd Edlinger, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 7:58 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2020 at 4:36 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Wed, Apr 29, 2020 at 12:14 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > >  - we move check_unsafe_exec() down. As far as I can tell, there's no
> > > reason it's that early - the flags it sets aren't actually used until
> > > when we actually do that final set_creds..
> >
> > Right, we should be able to do that stuff quite a bit later than it happens now.
>
> Actually, looking at it, this looks painful for multiple reasons.
>
> The LSM_UNSAFE_xyz flags are used by security_bprm_set_creds(), which
> when I traced it through, happened much earlier than I thought. Making
> things worse, it's done by prepare_binprm(), which also potentially
> gets called from random points by the low-level binfmt handlers too.

Yeah, but all of that happens before we actually need to do anything
with the accumulated credential information from the prepare_binprm()
calls. We can probably move the unsafe calculation and a new LSM hook
into flush_old_exec(), right before de_thread().

> And we also have that odd "fs->in_exec" flag, which is used by thread
> cloning and io_uring, and I'm not sure what the exact semantics are.

The idea is to ensure that once we're through check_unsafe_exec() and
have computed our LSM_UNSAFE_* flags, another thread that's still
running must not be able to fork() off a child with CLONE_FS, because
having an fs_struct that's shared with anything other than sibling
threads (which will be killed off) is supposed to only be possible if
LSM_UNSAFE_SHARE is set. So:

If check_unsafe_exec() can match each reference in the refcount
->fs->users with a reference from a sibling thread (iow the fs_struct
is not currently shared with another task), it sets p->fs->in_exec.

If another thread tries to clone(CLONE_FS) while we're in execve(),
copy_fs() will throw -EAGAIN. And if io_uring tries to grab a
reference to the fs_struct with the intent to use it on a kernel
worker thread (which conceptually is kinda similar to the
clone(CLONE_FS) case), that also aborts.

And then at the end of execve(), we clear the ->fs->in_exec flag again.

So this should work fine as long as we ensure that we can't have two
threads from the same process going through execve concurrently. (Or
if we actually want to support that, we could make ->in_exec a counter
instead of a flag, but really, preventing concurrent execve()s from a
multithreaded process seems saner...)

> I'm _almost_ inclined to say that we should just abort the execve()
> entirely if somebody tries to attach in the middle.
>
> IOW, get rid of the locking, and replace it all just with a sequence
> count. Make execve() abort if the sequence count has changed between
> loading the original creds, and having installed the new creds.
>
> You can ptrace _over_ an execve, and you can ptrace _after_ an
> execve(), but trying to attach just as we execve() would just cause
> the execve() to fail.
>
> We could maybe make it conditional on the credentials actually having
> changed at all (set another flag in bprm_fill_uid()). So it would only
> fail for the suid exec case.
>
> Because honestly, trying to ptrace in the middle of a suid execve()
> sounds like an attack, not a useful thing.
>
> That sequence count approach would be a much simpler change.

In that model, what should happen if someone tries to attach to a
process that's in execve(), but after the point of no return in
de_thread()? "Abort" after the point of no return normally means
force_sigsegv(), right?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 18:33                                                                               ` Jann Horn
@ 2020-04-29 18:57                                                                                 ` Linus Torvalds
  0 siblings, 0 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-29 18:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: Oleg Nesterov, Bernd Edlinger, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 11:33 AM Jann Horn <jannh@google.com> wrote:
>
> > That sequence count approach would be a much simpler change.
>
> In that model, what should happen if someone tries to attach to a
> process that's in execve(), but after the point of no return in
> de_thread()? "Abort" after the point of no return normally means
> force_sigsegv(), right?

It would by definition have to check the sequence number at the end of
install_exec_creds() (where we currently release the
cred_guard_mutex).

And yes, that's after the point of no return, so it would cause the
usual "kill the process".

We could check earlier too (while still able to return errors) and
return -EAGAIN or something, but that wouldn't obviate the need for
that final check, iut would just shrink the window for the "fatal
exec" case.

                Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 17:58                                                                             ` Linus Torvalds
  2020-04-29 18:33                                                                               ` Jann Horn
@ 2020-04-29 19:23                                                                               ` Bernd Edlinger
  2020-04-29 19:26                                                                                 ` Jann Horn
  2020-04-29 22:38                                                                                 ` Linus Torvalds
  1 sibling, 2 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-29 19:23 UTC (permalink / raw)
  To: Linus Torvalds, Jann Horn
  Cc: Oleg Nesterov, Eric W. Biederman, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On 4/29/20 7:58 PM, Linus Torvalds wrote:
> On Tue, Apr 28, 2020 at 4:36 PM Jann Horn <jannh@google.com> wrote:
>>
>> On Wed, Apr 29, 2020 at 12:14 AM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>>  - we move check_unsafe_exec() down. As far as I can tell, there's no
>>> reason it's that early - the flags it sets aren't actually used until
>>> when we actually do that final set_creds..
>>
>> Right, we should be able to do that stuff quite a bit later than it happens now.
> 
> Actually, looking at it, this looks painful for multiple reasons.
> 
> The LSM_UNSAFE_xyz flags are used by security_bprm_set_creds(), which
> when I traced it through, happened much earlier than I thought. Making
> things worse, it's done by prepare_binprm(), which also potentially
> gets called from random points by the low-level binfmt handlers too.
> 
> And we also have that odd "fs->in_exec" flag, which is used by thread
> cloning and io_uring, and I'm not sure what the exact semantics are.
> 
> I'm _almost_ inclined to say that we should just abort the execve()
> entirely if somebody tries to attach in the middle.
> 
> IOW, get rid of the locking, and replace it all just with a sequence
> count. Make execve() abort if the sequence count has changed between
> loading the original creds, and having installed the new creds.
> 
> You can ptrace _over_ an execve, and you can ptrace _after_ an
> execve(), but trying to attach just as we execve() would just cause
> the execve() to fail.
> 
> We could maybe make it conditional on the credentials actually having
> changed at all (set another flag in bprm_fill_uid()). So it would only
> fail for the suid exec case.
> 
> Because honestly, trying to ptrace in the middle of a suid execve()
> sounds like an attack, not a useful thing.
> 

I think the use case where a program attaches and detaches many
processes at a high rate, is either an attack or a very aggressive
virus checker, fixing a bug that prevents an attack is not a good
idea, but fixing a bug that would otherwise break a virus checker
would be a good thing.

By the way, all other attempts to fix it look much more dangerous
than my initially proposed patch, you know the one you hated, but
it does work and does not look overly complicated either.

What was the reason why that cannot be done this way?


Thanks,
Bernd.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 19:23                                                                               ` Bernd Edlinger
@ 2020-04-29 19:26                                                                                 ` Jann Horn
  2020-04-29 20:19                                                                                   ` Bernd Edlinger
  2020-04-29 22:38                                                                                 ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Jann Horn @ 2020-04-29 19:26 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Linus Torvalds, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 9:23 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
> On 4/29/20 7:58 PM, Linus Torvalds wrote:
> > On Tue, Apr 28, 2020 at 4:36 PM Jann Horn <jannh@google.com> wrote:
> >>
> >> On Wed, Apr 29, 2020 at 12:14 AM Linus Torvalds
> >> <torvalds@linux-foundation.org> wrote:
> >>>
> >>>  - we move check_unsafe_exec() down. As far as I can tell, there's no
> >>> reason it's that early - the flags it sets aren't actually used until
> >>> when we actually do that final set_creds..
> >>
> >> Right, we should be able to do that stuff quite a bit later than it happens now.
> >
> > Actually, looking at it, this looks painful for multiple reasons.
> >
> > The LSM_UNSAFE_xyz flags are used by security_bprm_set_creds(), which
> > when I traced it through, happened much earlier than I thought. Making
> > things worse, it's done by prepare_binprm(), which also potentially
> > gets called from random points by the low-level binfmt handlers too.
> >
> > And we also have that odd "fs->in_exec" flag, which is used by thread
> > cloning and io_uring, and I'm not sure what the exact semantics are.
> >
> > I'm _almost_ inclined to say that we should just abort the execve()
> > entirely if somebody tries to attach in the middle.
> >
> > IOW, get rid of the locking, and replace it all just with a sequence
> > count. Make execve() abort if the sequence count has changed between
> > loading the original creds, and having installed the new creds.
> >
> > You can ptrace _over_ an execve, and you can ptrace _after_ an
> > execve(), but trying to attach just as we execve() would just cause
> > the execve() to fail.
> >
> > We could maybe make it conditional on the credentials actually having
> > changed at all (set another flag in bprm_fill_uid()). So it would only
> > fail for the suid exec case.
> >
> > Because honestly, trying to ptrace in the middle of a suid execve()
> > sounds like an attack, not a useful thing.
> >
>
> I think the use case where a program attaches and detaches many
> processes at a high rate, is either an attack or a very aggressive
> virus checker, fixing a bug that prevents an attack is not a good
> idea, but fixing a bug that would otherwise break a virus checker
> would be a good thing.
>
> By the way, all other attempts to fix it look much more dangerous
> than my initially proposed patch, you know the one you hated, but
> it does work and does not look overly complicated either.
>
> What was the reason why that cannot be done this way?

I'm not sure which patch you're talking about - I assume you don't
mean <https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/>?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 19:26                                                                                 ` Jann Horn
@ 2020-04-29 20:19                                                                                   ` Bernd Edlinger
  2020-04-29 21:06                                                                                     ` Jann Horn
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-29 20:19 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linus Torvalds, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On 4/29/20 9:26 PM, Jann Horn wrote:
> On Wed, Apr 29, 2020 at 9:23 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>> On 4/29/20 7:58 PM, Linus Torvalds wrote:
>>> On Tue, Apr 28, 2020 at 4:36 PM Jann Horn <jannh@google.com> wrote:
>>>>
>>>> On Wed, Apr 29, 2020 at 12:14 AM Linus Torvalds
>>>> <torvalds@linux-foundation.org> wrote:
>>>>>
>>>>>  - we move check_unsafe_exec() down. As far as I can tell, there's no
>>>>> reason it's that early - the flags it sets aren't actually used until
>>>>> when we actually do that final set_creds..
>>>>
>>>> Right, we should be able to do that stuff quite a bit later than it happens now.
>>>
>>> Actually, looking at it, this looks painful for multiple reasons.
>>>
>>> The LSM_UNSAFE_xyz flags are used by security_bprm_set_creds(), which
>>> when I traced it through, happened much earlier than I thought. Making
>>> things worse, it's done by prepare_binprm(), which also potentially
>>> gets called from random points by the low-level binfmt handlers too.
>>>
>>> And we also have that odd "fs->in_exec" flag, which is used by thread
>>> cloning and io_uring, and I'm not sure what the exact semantics are.
>>>
>>> I'm _almost_ inclined to say that we should just abort the execve()
>>> entirely if somebody tries to attach in the middle.
>>>
>>> IOW, get rid of the locking, and replace it all just with a sequence
>>> count. Make execve() abort if the sequence count has changed between
>>> loading the original creds, and having installed the new creds.
>>>
>>> You can ptrace _over_ an execve, and you can ptrace _after_ an
>>> execve(), but trying to attach just as we execve() would just cause
>>> the execve() to fail.
>>>
>>> We could maybe make it conditional on the credentials actually having
>>> changed at all (set another flag in bprm_fill_uid()). So it would only
>>> fail for the suid exec case.
>>>
>>> Because honestly, trying to ptrace in the middle of a suid execve()
>>> sounds like an attack, not a useful thing.
>>>
>>
>> I think the use case where a program attaches and detaches many
>> processes at a high rate, is either an attack or a very aggressive
>> virus checker, fixing a bug that prevents an attack is not a good
>> idea, but fixing a bug that would otherwise break a virus checker
>> would be a good thing.
>>
>> By the way, all other attempts to fix it look much more dangerous
>> than my initially proposed patch, you know the one you hated, but
>> it does work and does not look overly complicated either.
>>
>> What was the reason why that cannot be done this way?
> 
> I'm not sure which patch you're talking about - I assume you don't
> mean <https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/>?
> 

No, I meant:

[PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
https://marc.info/?l=linux-kernel&m=158559277631548&w=2

and

[PATCH v6 16/16] doc: Update documentation of ->exec_*_mutex
https://marc.info/?l=linux-kernel&m=158559277631548&w=2


I think that was the latest version, but this had several iterations already.

Thanks
Bernd.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 20:19                                                                                   ` Bernd Edlinger
@ 2020-04-29 21:06                                                                                     ` Jann Horn
  0 siblings, 0 replies; 127+ messages in thread
From: Jann Horn @ 2020-04-29 21:06 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Linus Torvalds, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 10:20 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
> On 4/29/20 9:26 PM, Jann Horn wrote:
> > On Wed, Apr 29, 2020 at 9:23 PM Bernd Edlinger
> > <bernd.edlinger@hotmail.de> wrote:
> >> On 4/29/20 7:58 PM, Linus Torvalds wrote:
> >>> On Tue, Apr 28, 2020 at 4:36 PM Jann Horn <jannh@google.com> wrote:
> >>>>
> >>>> On Wed, Apr 29, 2020 at 12:14 AM Linus Torvalds
> >>>> <torvalds@linux-foundation.org> wrote:
> >>>>>
> >>>>>  - we move check_unsafe_exec() down. As far as I can tell, there's no
> >>>>> reason it's that early - the flags it sets aren't actually used until
> >>>>> when we actually do that final set_creds..
> >>>>
> >>>> Right, we should be able to do that stuff quite a bit later than it happens now.
> >>>
> >>> Actually, looking at it, this looks painful for multiple reasons.
> >>>
> >>> The LSM_UNSAFE_xyz flags are used by security_bprm_set_creds(), which
> >>> when I traced it through, happened much earlier than I thought. Making
> >>> things worse, it's done by prepare_binprm(), which also potentially
> >>> gets called from random points by the low-level binfmt handlers too.
> >>>
> >>> And we also have that odd "fs->in_exec" flag, which is used by thread
> >>> cloning and io_uring, and I'm not sure what the exact semantics are.
> >>>
> >>> I'm _almost_ inclined to say that we should just abort the execve()
> >>> entirely if somebody tries to attach in the middle.
> >>>
> >>> IOW, get rid of the locking, and replace it all just with a sequence
> >>> count. Make execve() abort if the sequence count has changed between
> >>> loading the original creds, and having installed the new creds.
> >>>
> >>> You can ptrace _over_ an execve, and you can ptrace _after_ an
> >>> execve(), but trying to attach just as we execve() would just cause
> >>> the execve() to fail.
> >>>
> >>> We could maybe make it conditional on the credentials actually having
> >>> changed at all (set another flag in bprm_fill_uid()). So it would only
> >>> fail for the suid exec case.
> >>>
> >>> Because honestly, trying to ptrace in the middle of a suid execve()
> >>> sounds like an attack, not a useful thing.
> >>>
> >>
> >> I think the use case where a program attaches and detaches many
> >> processes at a high rate, is either an attack or a very aggressive
> >> virus checker, fixing a bug that prevents an attack is not a good
> >> idea, but fixing a bug that would otherwise break a virus checker
> >> would be a good thing.
> >>
> >> By the way, all other attempts to fix it look much more dangerous
> >> than my initially proposed patch, you know the one you hated, but
> >> it does work and does not look overly complicated either.
> >>
> >> What was the reason why that cannot be done this way?
> >
> > I'm not sure which patch you're talking about - I assume you don't
> > mean <https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/>?
> >
>
> No, I meant:
>
> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
> https://marc.info/?l=linux-kernel&m=158559277631548&w=2

(on lore: https://lore.kernel.org/lkml/AM6PR03MB51700577CF9EF4972FDE568AE4CB0@AM6PR03MB5170.eurprd03.prod.outlook.com/)

I mean - I guess that kinda works? It'll mean that attaching to a task
with "strace" as root won't work reliably if the task is in the middle
of execve though, there'll be a weird race where if strace first
attaches to a sibling and then tries to attach to the thread going
through execve(), it'll fail to attach to that thread, and then lose
the process entirely once de_thread() has killed off the other
threads. My perfectionist side is somewhat irked by that.

Still, it's probably far better than Linus' "simpler change" where the
target just gets killed off if someone tries to trace it at the wrong
time.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 19:23                                                                               ` Bernd Edlinger
  2020-04-29 19:26                                                                                 ` Jann Horn
@ 2020-04-29 22:38                                                                                 ` Linus Torvalds
  2020-04-29 23:22                                                                                   ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-29 22:38 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 12:23 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> By the way, all other attempts to fix it look much more dangerous
> than my initially proposed patch, you know the one you hated, but
> it does work and does not look overly complicated either.

I don't think it works.

The whole "take lock, release it in the middle, then re-take it" is
fundamentally a broken model. We've never had it work well, and it
tends to have subtle issues. That's particularly true when you then
open-core the (only) acceptable sequence something like five times.

> What was the reason why that cannot be done this way?

If it had introduced a new locking model, and new locking helpers for
that model, with a comment in _one_ place, and nobody doing the ad-hoc
locking on their own, that might be more acceptable.

But that's not what that patch did. No way will I take something that
is so fragile and hacky, and repeats the hack N times.

If you do it properly, with a helper function instead of repeating
that fragile nasty thing, maybe it will look better to me.

That said, locks that get released in the middle aren't really locks.
But at least if the only way to take that lock had the "oh, this lock
is in that inconsistent state, I will return -EAGAIN", that would be
one thing. But when you have N different users and rely on all of them
getting that special semantic right, you're doing something wrong.

                Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 22:38                                                                                 ` Linus Torvalds
@ 2020-04-29 23:22                                                                                   ` Linus Torvalds
  2020-04-29 23:59                                                                                     ` Jann Horn
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-29 23:22 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 3:38 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If you do it properly, with a helper function instead of repeating
> that fragile nasty thing, maybe it will look better to me.

Side note: if it has a special helper function for the "get lock,
repeat if it was invalid", you can do a better job than return
-EAGAIN.

In particular, you can do this

        set_thread_flag(TIF_SIGPENDING);
        return -RESTARTNOINTR;

which will actually restart the system call. So a ptrace() user (or
somebody doing a "write()" to /proc/<pid>/attr/xyz, wouldn't even see
the impossible EAGAIN error.

But that all requires that you have some locking helper routines like

    int lock_exec_creds(struct task_struct *);
    void unlock_exec_guard(struct task_struct *);

because there's no way we put that kind of super-fragile code in
several places. It would be very much one single routine with a *HUGE*
comment on it.

             Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 23:22                                                                                   ` Linus Torvalds
@ 2020-04-29 23:59                                                                                     ` Jann Horn
  2020-04-30  1:08                                                                                       ` Bernd Edlinger
  2020-04-30  2:16                                                                                       ` Linus Torvalds
  0 siblings, 2 replies; 127+ messages in thread
From: Jann Horn @ 2020-04-29 23:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernd Edlinger, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Thu, Apr 30, 2020 at 1:22 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 29, 2020 at 3:38 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > If you do it properly, with a helper function instead of repeating
> > that fragile nasty thing, maybe it will look better to me.
>
> Side note: if it has a special helper function for the "get lock,
> repeat if it was invalid", you can do a better job than return
> -EAGAIN.
>
> In particular, you can do this
>
>         set_thread_flag(TIF_SIGPENDING);
>         return -RESTARTNOINTR;
>
> which will actually restart the system call. So a ptrace() user (or
> somebody doing a "write()" to /proc/<pid>/attr/xyz, wouldn't even see
> the impossible EAGAIN error.

Wouldn't you end up livelocked in the scenario that currently deadlocks? Like:

 - tracer attaches to thread A
 - thread B goes into execve, blocks on waiting for A's death
 - tracer tries to attach to B and hits the -EAGAIN

If we make the PTRACE_ATTACH call restart, the tracer will just end up
looping without ever resolving the deadlock. If we want to get through
this cleanly with this approach, userspace needs to either
deprioritize the "I want to attach to pid X" and go back into its
eventloop, or to just treat -EAGAIN as a fatal error and give up
trying to attach to that task.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 23:59                                                                                     ` Jann Horn
@ 2020-04-30  1:08                                                                                       ` Bernd Edlinger
  2020-04-30  2:20                                                                                         ` Linus Torvalds
  2020-04-30  2:16                                                                                       ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-30  1:08 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds
  Cc: Oleg Nesterov, Eric W. Biederman, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On 4/30/20 1:59 AM, Jann Horn wrote:
> On Thu, Apr 30, 2020 at 1:22 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, Apr 29, 2020 at 3:38 PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> If you do it properly, with a helper function instead of repeating
>>> that fragile nasty thing, maybe it will look better to me.

I added the BIG FAT WARNNIG comments as a mitigation for that.
Did you like those comments?

>>
>> Side note: if it has a special helper function for the "get lock,
>> repeat if it was invalid", you can do a better job than return
>> -EAGAIN.
>>
>> In particular, you can do this
>>
>>         set_thread_flag(TIF_SIGPENDING);
>>         return -RESTARTNOINTR;
>>
>> which will actually restart the system call. So a ptrace() user (or
>> somebody doing a "write()" to /proc/<pid>/attr/xyz, wouldn't even see
>> the impossible EAGAIN error.
> 
> Wouldn't you end up livelocked in the scenario that currently deadlocks? Like:
> 
>  - tracer attaches to thread A
>  - thread B goes into execve, blocks on waiting for A's death
>  - tracer tries to attach to B and hits the -EAGAIN
> 
> If we make the PTRACE_ATTACH call restart, the tracer will just end up
> looping without ever resolving the deadlock. If we want to get through
> this cleanly with this approach, userspace needs to either
> deprioritize the "I want to attach to pid X" and go back into its
> eventloop, or to just treat -EAGAIN as a fatal error and give up
> trying to attach to that task.
> 

Yes, exactly, the point is the caller is expected to call wait in that
scenario, otherwise the -EAGAIN just repeats forever, that is an API
change, yes, but something unavoidable, and the patch tries hard to
limit it to cases where the live-lock or pseudo-dead-lock is unavoidable
anyway.


Bernd.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-29 23:59                                                                                     ` Jann Horn
  2020-04-30  1:08                                                                                       ` Bernd Edlinger
@ 2020-04-30  2:16                                                                                       ` Linus Torvalds
  2020-04-30 13:39                                                                                         ` Bernd Edlinger
  1 sibling, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-30  2:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 5:00 PM Jann Horn <jannh@google.com> wrote:
>
> Wouldn't you end up livelocked in the scenario that currently deadlocks?

The test case that we already know is broken, and any fix will have to
change anyway?

Let's just say that I don't care in the least.

But Bernd's patch as-is breaks a test-case that currently *works*,
namely something as simple as

  echo xyz > /proc/<pid>/attr/something

and honestly, breaking something that _works_ and may be used in
reality, in orderf to make a known buggy user testcase work?

Because no, "write()" returning -EAGAIN isn't ok.

            Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30  1:08                                                                                       ` Bernd Edlinger
@ 2020-04-30  2:20                                                                                         ` Linus Torvalds
  2020-04-30  3:00                                                                                           ` Jann Horn
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-30  2:20 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 6:08 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> I added the BIG FAT WARNNIG comments as a mitigation for that.
> Did you like those comments?

No.

What's the point olf saying "THIS CODE IS GARBAGE" and then expecting
that to make it ok?

No,m that doesn't make it ok. It just means that it should have been
done differently.

> Yes, exactly, the point is the caller is expected to call wait in that
> scenario, otherwise the -EAGAIN just repeats forever, that is an API
> change, yes, but something unavoidable, and the patch tries hard to
> limit it to cases where the live-lock or pseudo-dead-lock is unavoidable
> anyway.

I'm getting really fed up with your insistence on that KNOWN BROKEN
garbage test-case.

It's shit. The test-case is wrong. I've told you before.

Your patch as-is breaks other cases that are *not* wrong in the kernel
currently, and that don't have test-cases because they JustWork(tm).

The livelock isn't interesting. The test-case that shows it is pure
garbage, and is written wrong.

IF that test-case hadn't been buggy in the first place, it would have
had ignored its child (or had a handler for SIGCHLD), and not
livelocked.

                Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30  2:20                                                                                         ` Linus Torvalds
@ 2020-04-30  3:00                                                                                           ` Jann Horn
  2020-04-30  3:25                                                                                             ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Jann Horn @ 2020-04-30  3:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernd Edlinger, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Thu, Apr 30, 2020 at 4:20 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 29, 2020 at 6:08 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
> >
> > I added the BIG FAT WARNNIG comments as a mitigation for that.
> > Did you like those comments?
>
> No.
>
> What's the point olf saying "THIS CODE IS GARBAGE" and then expecting
> that to make it ok?
>
> No,m that doesn't make it ok. It just means that it should have been
> done differently.
>
> > Yes, exactly, the point is the caller is expected to call wait in that
> > scenario, otherwise the -EAGAIN just repeats forever, that is an API
> > change, yes, but something unavoidable, and the patch tries hard to
> > limit it to cases where the live-lock or pseudo-dead-lock is unavoidable
> > anyway.
>
> I'm getting really fed up with your insistence on that KNOWN BROKEN
> garbage test-case.
>
> It's shit. The test-case is wrong. I've told you before.
>
> Your patch as-is breaks other cases that are *not* wrong in the kernel
> currently, and that don't have test-cases because they JustWork(tm).
>
> The livelock isn't interesting. The test-case that shows it is pure
> garbage, and is written wrong.
>
> IF that test-case hadn't been buggy in the first place, it would have
> had ignored its child (or had a handler for SIGCHLD), and not
> livelocked.

But if we go with Bernd's approach together with your restart
suggestion, then simply doing PTRACE_ATTACH on two threads A and B
would be enough to livelock, right?

tracer: PTRACE_ATTACHes to A
B: enters de_thread()
tracer: attempts to PTRACE_ATTACH to B

Now the tracer will loop on PTRACE_ATTACH, right?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30  3:00                                                                                           ` Jann Horn
@ 2020-04-30  3:25                                                                                             ` Linus Torvalds
  2020-04-30  3:41                                                                                               ` Jann Horn
  2020-04-30 13:37                                                                                               ` Linus Torvalds
  0 siblings, 2 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-30  3:25 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 8:00 PM Jann Horn <jannh@google.com> wrote:
>
> But if we go with Bernd's approach together with your restart
> suggestion,

So repeat after me: Bernd's approach _without_ the restart is unacceptable.

It's unacceptable because it breaks things that currently work, and
returns EAGAIN in situations where that is simple not a valid error
code.

His original patch also happens to be unacceptable because it's an
unmaintainable mess, but that's independent of the bug it introduced.

That bug has nothing to do with ptrace(). It's literally a "write()"
to a file in /proc.

What is so hard to get about this basic thing?

> then simply doing PTRACE_ATTACH on two threads A and B
> would be enough to livelock, right?

The same case that just causes a recursive wait. Yes. No worse off than we were.

And the fact is, *THAT* case looks truly trivial to work around.

Just make the ptrace() code - but not the fs/proc/base.c code - do
something like

        if (lock_exec_creds(tsk))
                return -EINTR;

and now ptrace() doesn't repeat (simply because it doesn't return that
ERESTARTNOINTR. It would go through that "return through signal
handling code" in the kernel, but it wouldn't actually retry the
system call).

But I'm getting less and less interested in trying to "fix" this
problem, when people don't seem to realize that the important case is
to not break _good_ programs, and the pointless buggy garbage
test-case is entirely uninteresting. It's buggy user code. If it
causes a wait or a livelock, nobody sane should care in the least. Fix
the bug in user space.

Introducing new bugs in the kernel where they didn't exist before - in
order to try to work around buggy user-space that has never ever
worked - is not acceptable.

                       Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30  3:25                                                                                             ` Linus Torvalds
@ 2020-04-30  3:41                                                                                               ` Jann Horn
  2020-04-30  3:50                                                                                                 ` Linus Torvalds
  2020-04-30 13:37                                                                                               ` Linus Torvalds
  1 sibling, 1 reply; 127+ messages in thread
From: Jann Horn @ 2020-04-30  3:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernd Edlinger, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Thu, Apr 30, 2020 at 5:26 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 29, 2020 at 8:00 PM Jann Horn <jannh@google.com> wrote:
> >
> > But if we go with Bernd's approach together with your restart
> > suggestion,
>
> So repeat after me: Bernd's approach _without_ the restart is unacceptable.
>
> It's unacceptable because it breaks things that currently work, and
> returns EAGAIN in situations where that is simple not a valid error
> code.

Sure, makes sense to me. I'm not eager to start randomly throwing
EAGAIN where it couldn't happen before either (and I initially missed
that Bernd's patch did that for procfs files, too).

> That bug has nothing to do with ptrace(). It's literally a "write()"
> to a file in /proc.
>
> What is so hard to get about this basic thing?

You said:

| So a ptrace() user (or [...] wouldn't even see the impossible EAGAIN error.

So I assumed you explicitly wanted ptrace() to restart, too. I was
just pointing out that that didn't make sense to me.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30  3:41                                                                                               ` Jann Horn
@ 2020-04-30  3:50                                                                                                 ` Linus Torvalds
  0 siblings, 0 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-30  3:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 8:41 PM Jann Horn <jannh@google.com> wrote:
>
> | So a ptrace() user (or [...] wouldn't even see the impossible EAGAIN error.
>
> So I assumed you explicitly wanted ptrace() to restart, too. I was
> just pointing out that that didn't make sense to me.

I'm actually ok with the restart option, simply because I continue to
maintain that the program is buggy. "Anything goes".

To not be buggy, the program needs to install a SIGCHLD handler so
that it can reap its (pseudo-)children.

At which point it doesn't actually make any difference whether we fix
the kernel or not, because then the non-buggy program will just work -
even with a non-modified kernel.

Honestly, the main argument for the kernel doing anything different at
all is that from a user-mode perspective, silently hanging in the
kernel waiting for something to happen is likely the least easy to
debug.

But if you do a return to user space - even if it's to just rinse and
repeat - it's at least not "silent" any more, even if the main noise
it makes is just to waste 100% CPU time. At least that's a big hint to
somebody to take a look.

But yes, we can make ptrace() - and _only_ ptrace() - then not repeat,
and return a new error code that it has never returned before. Like
EAGAIN. Mainly because in that case we're only breaking semantics of
something that was already broken - unlike "write()", which has
perfectly well-defined semantics and wasn't broken.

                 Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30  3:25                                                                                             ` Linus Torvalds
  2020-04-30  3:41                                                                                               ` Jann Horn
@ 2020-04-30 13:37                                                                                               ` Linus Torvalds
  1 sibling, 0 replies; 127+ messages in thread
From: Linus Torvalds @ 2020-04-30 13:37 UTC (permalink / raw)
  To: Jann Horn
  Cc: Bernd Edlinger, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Wed, Apr 29, 2020 at 8:25 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Bernd's approach _without_ the restart is unacceptable.
>
> It's unacceptable because it breaks things that currently work, and
> returns EAGAIN in situations where that is simple not a valid error
> code.

Looking at my restart thing, I think it's a hack, and I don't think
that's acceptable either.

I was pleased with how clever it was, but it's one of those "clever
hacks" that is in the end more "hack" than "clever".

The basic issue is that releasing a lock in the middle just
fundamentally defeats the purpose of the lock unless you have a way to
redo the operation after fixing whatever caused the drop.

And the system call restart thing is dodgy, because there's none of
that "fixing".

It can cause that "write()" call to do the CPU busy loop too if it
hits that "execve() in process" situation.

The only difference with the "write()" case vs "ptrace()" is that
nobody has ever written an insane test-case that doesn't wait for
children, and then does a "write()" to the /proc file that can then
require zombie children to be reaped.

So I don't think the approach is valid even with the restart. Not
restarting isn't acceptable for write(), but restarting doesn't really
work either.

I guess we could have a very special lock that does something like

    int lock_exec_cred_mutex(struct task_struct *task)
    {
        if (mutex_trylock(&task->signal->cred_guard_mutex))
                return 0;

        if (lock_can_deadlock(task))
                return -EDEADLK;

        return mutex_lock_interruptible(&task->signal->cred_guard_mutex);
    }

might work. But that "lock_can_deadlock()" needs some kind of oracle
or heuristic.

And I can't come up with a perfect one, although I can come up with
things like "if the target has threads, and those threads have a
reaoer that is you, then you have to have SIGCHLD enabled". But it
gets ugly and hacky.

But I think actually releasing the lock in the middle of execve()
before it's done with is worse than ugly and hacky - it's
fundamentally broken.

Moving things around? Sure - like waiting for the threads _after_ the
lock and having done all the cred calculations. So I think Oleg's
patch works.

                 Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30  2:16                                                                                       ` Linus Torvalds
@ 2020-04-30 13:39                                                                                         ` Bernd Edlinger
  2020-04-30 13:47                                                                                           ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-30 13:39 UTC (permalink / raw)
  To: Linus Torvalds, Jann Horn
  Cc: Oleg Nesterov, Eric W. Biederman, Waiman Long, Ingo Molnar,
	Will Deacon, Linux Kernel Mailing List, Alexey Gladkov

On 4/30/20 4:16 AM, Linus Torvalds wrote:
> On Wed, Apr 29, 2020 at 5:00 PM Jann Horn <jannh@google.com> wrote:
>>
>> Wouldn't you end up livelocked in the scenario that currently deadlocks?
> 
> The test case that we already know is broken, and any fix will have to
> change anyway?
> 

The purpose of the test case was only to test the behaviour of my
later patch.  The test case _must_ be adjusted to the follow-up
patch, I have no problem with that.  Anybody may change the test case
when we know how to fix the API.  I did just not anticipate that Eric
would only apply 14 of 16 patches = 87.5% of the patch series. Now that
causes some tension, but it is not really a problem IMHO.

> Let's just say that I don't care in the least.
> 
> But Bernd's patch as-is breaks a test-case that currently *works*,
> namely something as simple as
> 
>   echo xyz > /proc/<pid>/attr/something
> 

Excuse me, but what in my /proc folder there is no attr/something
is there a procfs equivalent of pthread_attach ?

What exactly is "attr/something" ?

> and honestly, breaking something that _works_ and may be used in
> reality, in orderf to make a known buggy user testcase work?
> 
> Because no, "write()" returning -EAGAIN isn't ok.
> 

write can return -EAGAIN if the file is non-blocking, it is
never the case for a disk file, but for a NFS that is not at all
clear, depends on a mount option, and once I had a deadlock in
one of my test systems after OOM-stress, but I never was able
to reproduce, the umount deadlocked, then I was not able to
reboot, could be an alpha-particle of course, who knows...


Hmmm.. maybe a stupid idea:

We could keep the old deadlock-capable API,
and add a new _flag_ somewhere to the PTHREAD_ATTACH call,
that _enables_ the non-blocking behavior, how about that.



Thanks
Bernd.

>             Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30 13:39                                                                                         ` Bernd Edlinger
@ 2020-04-30 13:47                                                                                           ` Linus Torvalds
  2020-04-30 14:29                                                                                             ` Bernd Edlinger
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-30 13:47 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Thu, Apr 30, 2020 at 6:39 AM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> Excuse me, but what in my /proc folder there is no attr/something
> is there a procfs equivalent of pthread_attach ?
>
> What exactly is "attr/something" ?

Anything that uses that proc_pid_attr_write().

Which you should have realized, since you wrote the patch that changed
that function to return -EAGAIN.

That's

    /proc/<pid>/attr/{current,exec,fscreate,keycreate,prev,sockcreate}

and some smack files.

Your patch definitely made them return -EINVAL if they happen in that
execve() black hole, instead of waiting for the execve() to just
complete and then just work.

Dropping a lock really is broken. It';s broken even if you then set a
flag saying "I dropped the lock, now you can't use it".

                  Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30 13:47                                                                                           ` Linus Torvalds
@ 2020-04-30 14:29                                                                                             ` Bernd Edlinger
  2020-04-30 16:40                                                                                               ` Linus Torvalds
  0 siblings, 1 reply; 127+ messages in thread
From: Bernd Edlinger @ 2020-04-30 14:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

Hi Linus,

On 4/30/20 3:47 PM, Linus Torvalds wrote:
> On Thu, Apr 30, 2020 at 6:39 AM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> Excuse me, but what in my /proc folder there is no attr/something
>> is there a procfs equivalent of pthread_attach ?
>>
>> What exactly is "attr/something" ?
> 
> Anything that uses that proc_pid_attr_write().
> 
> Which you should have realized, since you wrote the patch that changed
> that function to return -EAGAIN.
> 

Ah, now I see, that was of course not the intended effect,
but that is not where the pseudo-deadlock happens at all,
would returning -RESTARTNOINTR in this function make this
patch acceptable, it will not have an effect on the test case?


Bernd.

> That's
> 
>     /proc/<pid>/attr/{current,exec,fscreate,keycreate,prev,sockcreate}
> 
> and some smack files.
> 
> Your patch definitely made them return -EINVAL if they happen in that
> execve() black hole, instead of waiting for the execve() to just
> complete and then just work.
> 
> Dropping a lock really is broken. It';s broken even if you then set a
> flag saying "I dropped the lock, now you can't use it".
> 
>                   Linus
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30 14:29                                                                                             ` Bernd Edlinger
@ 2020-04-30 16:40                                                                                               ` Linus Torvalds
  2020-05-02  4:11                                                                                                 ` Bernd Edlinger
  0 siblings, 1 reply; 127+ messages in thread
From: Linus Torvalds @ 2020-04-30 16:40 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Jann Horn, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On Thu, Apr 30, 2020 at 7:29 AM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> Ah, now I see, that was of course not the intended effect,
> but that is not where the pseudo-deadlock happens at all,
> would returning -RESTARTNOINTR in this function make this
> patch acceptable, it will not have an effect on the test case?

So that was why I suggested doing it all with a helper function, and
also doing that

        set_thread_flag(TIF_SIGPENDING);

because without going through the "check-for-signals" code at return
to user space, -ERESTARTNOINTR doesn't actually _do_ any restart.

However, the more I looked at it, the less I actually liked that hack.

Part of it is simply because it can cause the exact same problem that
ptrace() does (at least in theory). And even if you don't get the
livelock thing, you can get the "use 100% CPU time" thing, because if
that case ever triggers, and we re-try, it will generally just _keep_
on triggering (think "execve is waiting for a zombie, nobody is
reaping it").

IOW, restarting doesn't really fix the problem, or guarantee any
forward progress.

So I'd have been ok with your "unsafe_exec_flag" if

 (a) it had been done in one place with a helper function.

 (b) it would _only_ trigger for ptrace (and perhaps seccomp).

but I don't think it works for that write() case.

That said, I'm not 100% convinced that that write() case really even
needs that cred_guard_mutex (renamed or not).

Maybe we can introduce a new mutex just against concurrent ptrace
(which is what at least the _comment_ says_ that
security_setprocattr() wants - I didn't check the actual low-level
security code).

So maybe that proc_pid_attr_write() case could be done some other way entirely.

Th emore we go through all this, the more I really think that Oleg's
patch to just delay the waiting for things until after dropping the
mutex in execve() is the way to go.

Is it a "simple" and small patch? No. But it really addresses the core
issue, without introducing new odd rules or special cases, or making a
lock that doesn't reliably work as a lock.

                      Linus

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
  2020-04-30 16:40                                                                                               ` Linus Torvalds
@ 2020-05-02  4:11                                                                                                 ` Bernd Edlinger
  0 siblings, 0 replies; 127+ messages in thread
From: Bernd Edlinger @ 2020-05-02  4:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, Oleg Nesterov, Eric W. Biederman, Waiman Long,
	Ingo Molnar, Will Deacon, Linux Kernel Mailing List,
	Alexey Gladkov

On 4/30/20 6:40 PM, Linus Torvalds wrote:
> On Thu, Apr 30, 2020 at 7:29 AM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> Ah, now I see, that was of course not the intended effect,
>> but that is not where the pseudo-deadlock happens at all,
>> would returning -RESTARTNOINTR in this function make this
>> patch acceptable, it will not have an effect on the test case?
> 
> So that was why I suggested doing it all with a helper function, and
> also doing that
> 
>         set_thread_flag(TIF_SIGPENDING);
> 
> because without going through the "check-for-signals" code at return
> to user space, -ERESTARTNOINTR doesn't actually _do_ any restart.
> 
> However, the more I looked at it, the less I actually liked that hack.
> 
> Part of it is simply because it can cause the exact same problem that
> ptrace() does (at least in theory). And even if you don't get the
> livelock thing, you can get the "use 100% CPU time" thing, because if
> that case ever triggers, and we re-try, it will generally just _keep_
> on triggering (think "execve is waiting for a zombie, nobody is
> reaping it").
> 
> IOW, restarting doesn't really fix the problem, or guarantee any
> forward progress.
> 

Right, if it is a real time process it will result in priority-inversion.
Correct.

If it is a virus checker it will be real time priority and it will be
very aggressive ;-) I can feel its aggressiveness already :-) shiver...

And this little zombie-process will paralyze it immediately, nice try.

You see what I mean?

> So I'd have been ok with your "unsafe_exec_flag" if
> 
>  (a) it had been done in one place with a helper function.
> 
>  (b) it would _only_ trigger for ptrace (and perhaps seccomp).
> 
> but I don't think it works for that write() case.
> 
> That said, I'm not 100% convinced that that write() case really even
> needs that cred_guard_mutex (renamed or not).
> 
> Maybe we can introduce a new mutex just against concurrent ptrace
> (which is what at least the _comment_ says_ that
> security_setprocattr() wants - I didn't check the actual low-level
> security code).
> 
> So maybe that proc_pid_attr_write() case could be done some other way entirely.
> 
> Th emore we go through all this, the more I really think that Oleg's
> patch to just delay the waiting for things until after dropping the
> mutex in execve() is the way to go.
> 
> Is it a "simple" and small patch? No. But it really addresses the core
> issue, without introducing new odd rules or special cases, or making a
> lock that doesn't reliably work as a lock.
> 

Hmm.  I think I can agree, that this problem deserves to be solved
really slowly.

Oleg where was your last patch, does it still work or does it
need to be re-based?

And I almost forgot about Eric, are you still with us?


Thanks
Bernd.

^ permalink raw reply	[flat|nested] 127+ messages in thread

end of thread, other threads:[~2020-05-02  4:11 UTC | newest]

Thread overview: 127+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87blobnq02.fsf@x220.int.ebiederm.org>
2020-04-02 19:04 ` [GIT PULL] Please pull proc and exec work for 5.7-rc1 Linus Torvalds
2020-04-02 19:31   ` Bernd Edlinger
2020-04-02 19:52     ` Linus Torvalds
2020-04-02 20:59       ` Bernd Edlinger
2020-04-02 21:46         ` Linus Torvalds
2020-04-02 23:01           ` Eric W. Biederman
2020-04-02 23:42             ` Bernd Edlinger
2020-04-02 23:45               ` Eric W. Biederman
2020-04-02 23:49                 ` Bernd Edlinger
2020-04-02 23:45               ` Linus Torvalds
2020-04-02 23:44             ` Linus Torvalds
2020-04-03  0:05               ` Eric W. Biederman
2020-04-07  1:29               ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Eric W. Biederman
2020-04-07  1:31                 ` [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf Eric W. Biederman
2020-04-07 15:58                   ` Kees Cook
2020-04-07 16:11                   ` Christian Brauner
2020-04-08 17:25                   ` Linus Torvalds
2020-04-08 19:51                     ` Eric W. Biederman
2020-04-07  1:31                 ` [PATCH 2/3] exec: Make unlocking exec_update_mutex explict Eric W. Biederman
2020-04-07 16:02                   ` Kees Cook
2020-04-07 16:17                   ` Christian Brauner
2020-04-07 16:21                     ` Eric W. Biederman
2020-04-07  1:32                 ` [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return Eric W. Biederman
2020-04-07 16:03                   ` Kees Cook
2020-04-07 16:21                   ` Christian Brauner
2020-04-07 16:22                 ` [RFC][PATCH 0/3] exec_update_mutex related cleanups Christian Brauner
2020-04-08 17:26                 ` Linus Torvalds
2020-04-03  5:09             ` [GIT PULL] Please pull proc and exec work for 5.7-rc1 Bernd Edlinger
2020-04-03 19:26             ` Linus Torvalds
2020-04-03 20:41               ` Waiman Long
2020-04-03 20:59                 ` Linus Torvalds
2020-04-03 23:16                   ` Waiman Long
2020-04-03 23:23                     ` Waiman Long
2020-04-04  1:30                       ` Linus Torvalds
2020-04-04  2:02                         ` Waiman Long
2020-04-04  2:28                           ` Linus Torvalds
2020-04-04  6:34                             ` Bernd Edlinger
2020-04-05  6:34                               ` Bernd Edlinger
2020-04-05 19:35                                 ` Linus Torvalds
2020-04-05  2:42                             ` Waiman Long
2020-04-05  3:35                               ` Bernd Edlinger
2020-04-05  3:45                                 ` Waiman Long
2020-04-06 13:13                             ` Will Deacon
2020-04-04  4:23                     ` Bernd Edlinger
2020-04-06 22:17               ` Eric W. Biederman
2020-04-07 19:50                 ` Linus Torvalds
2020-04-07 20:29                   ` Bernd Edlinger
2020-04-07 20:47                     ` Linus Torvalds
2020-04-08 15:14                   ` Eric W. Biederman
2020-04-08 15:21                     ` Bernd Edlinger
2020-04-08 16:34                     ` Linus Torvalds
2020-04-09 14:58                       ` Eric W. Biederman
2020-04-09 15:15                         ` Bernd Edlinger
2020-04-09 16:15                         ` Linus Torvalds
2020-04-09 16:24                           ` Linus Torvalds
2020-04-09 17:03                             ` Eric W. Biederman
2020-04-09 17:17                               ` Bernd Edlinger
2020-04-09 17:37                                 ` Linus Torvalds
2020-04-09 17:46                                   ` Bernd Edlinger
2020-04-09 18:36                                     ` Linus Torvalds
2020-04-09 19:42                                       ` Linus Torvalds
2020-04-09 19:57                                         ` Bernd Edlinger
2020-04-09 20:04                                           ` Linus Torvalds
2020-04-09 20:36                                             ` Bernd Edlinger
2020-04-09 21:00                                             ` Eric W. Biederman
2020-04-09 21:17                                               ` Linus Torvalds
2020-04-09 23:52                                                 ` Bernd Edlinger
2020-04-10  0:30                                                 ` Linus Torvalds
2020-04-10  0:32                                                   ` Linus Torvalds
2020-04-11  4:07                                                     ` Bernd Edlinger
2020-04-11 18:20                                                   ` Oleg Nesterov
2020-04-11 18:29                                                     ` Linus Torvalds
2020-04-11 18:31                                                       ` Linus Torvalds
2020-04-11 19:15                                                       ` Bernd Edlinger
2020-04-11 20:07                                                         ` Linus Torvalds
2020-04-11 21:16                                                           ` Bernd Edlinger
     [not found]                                                             ` <CAHk-=wgWHkBzFazWJj57emHPd3Dg9SZHaZqoO7-AD+UbBTJgig@mail.gmail.com>
2020-04-11 21:57                                                               ` Linus Torvalds
2020-04-12  6:01                                                                 ` Bernd Edlinger
2020-04-12 19:50                                                       ` Oleg Nesterov
2020-04-12 20:14                                                         ` Linus Torvalds
2020-04-28  2:56                                                           ` Bernd Edlinger
2020-04-28 17:07                                                             ` Linus Torvalds
2020-04-28 19:08                                                               ` Oleg Nesterov
2020-04-28 20:35                                                                 ` Linus Torvalds
2020-04-28 21:06                                                                   ` Jann Horn
2020-04-28 21:36                                                                     ` Linus Torvalds
2020-04-28 21:53                                                                       ` Jann Horn
2020-04-28 22:14                                                                         ` Linus Torvalds
2020-04-28 23:36                                                                           ` Jann Horn
2020-04-29 17:58                                                                             ` Linus Torvalds
2020-04-29 18:33                                                                               ` Jann Horn
2020-04-29 18:57                                                                                 ` Linus Torvalds
2020-04-29 19:23                                                                               ` Bernd Edlinger
2020-04-29 19:26                                                                                 ` Jann Horn
2020-04-29 20:19                                                                                   ` Bernd Edlinger
2020-04-29 21:06                                                                                     ` Jann Horn
2020-04-29 22:38                                                                                 ` Linus Torvalds
2020-04-29 23:22                                                                                   ` Linus Torvalds
2020-04-29 23:59                                                                                     ` Jann Horn
2020-04-30  1:08                                                                                       ` Bernd Edlinger
2020-04-30  2:20                                                                                         ` Linus Torvalds
2020-04-30  3:00                                                                                           ` Jann Horn
2020-04-30  3:25                                                                                             ` Linus Torvalds
2020-04-30  3:41                                                                                               ` Jann Horn
2020-04-30  3:50                                                                                                 ` Linus Torvalds
2020-04-30 13:37                                                                                               ` Linus Torvalds
2020-04-30  2:16                                                                                       ` Linus Torvalds
2020-04-30 13:39                                                                                         ` Bernd Edlinger
2020-04-30 13:47                                                                                           ` Linus Torvalds
2020-04-30 14:29                                                                                             ` Bernd Edlinger
2020-04-30 16:40                                                                                               ` Linus Torvalds
2020-05-02  4:11                                                                                                 ` Bernd Edlinger
2020-04-09 17:36                               ` Linus Torvalds
2020-04-09 20:34                                 ` Eric W. Biederman
2020-04-09 20:56                                   ` Linus Torvalds
2020-04-02 23:02           ` Bernd Edlinger
2020-04-02 23:22           ` Bernd Edlinger
2020-04-03  7:38           ` Bernd Edlinger
2020-04-03 16:00       ` Bernd Edlinger
2020-04-03 15:09   ` Bernd Edlinger
2020-04-03 16:23     ` Linus Torvalds
2020-04-03 16:36       ` Bernd Edlinger
2020-04-04  5:43       ` Bernd Edlinger
2020-04-04  5:48         ` Bernd Edlinger
2020-04-06  6:41           ` Bernd Edlinger
2020-04-10 13:03 ` [GIT PULL] proc fix " Eric W. Biederman
2020-04-10 20:40   ` pr-tracker-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.