linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
@ 2019-07-17 17:21 Joel Fernandes (Google)
  2019-07-17 17:55 ` Christian Brauner
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Joel Fernandes (Google) @ 2019-07-17 17:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Suren Baghdasaryan, kernel-team, Joel Fernandes,
	Andrea Arcangeli, Andrew Morton, Christian Brauner,
	Eric W. Biederman, Oleg Nesterov, Tejun Heo

From: Suren Baghdasaryan <surenb@google.com>

There is a race between reading task->exit_state in pidfd_poll and writing
it after do_notify_parent calls do_notify_pidfd. Expected sequence of
events is:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
  tsk->exit_state = EXIT_DEAD
                                  pidfd_poll
                                     if (tsk->exit_state)

However nothing prevents the following sequence:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
                                   pidfd_poll
                                      if (tsk->exit_state)
  tsk->exit_state = EXIT_DEAD

This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug,
and the below patch fixes it.

To fix this, we set tsk->exit_state before calling do_notify_pidfd.

Cc: kernel-team@android.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
 kernel/exit.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index a75b6a7f458a..740ceacb4b76 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (group_dead)
 		kill_orphaned_pgrp(tsk->group_leader, NULL);
 
+	tsk->exit_state = EXIT_ZOMBIE;
 	if (unlikely(tsk->ptrace)) {
 		int sig = thread_group_leader(tsk) &&
 				thread_group_empty(tsk) &&
@@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		ptrace_unlink(p);
 
 		/* If parent wants a zombie, don't release it now */
-		state = EXIT_ZOMBIE;
+		p->exit_state = EXIT_ZOMBIE;
 		if (do_notify_parent(p, p->exit_signal))
-			state = EXIT_DEAD;
-		p->exit_state = state;
+			p->exit_state = EXIT_DEAD;
+
+		state = p->exit_state;
 		write_unlock_irq(&tasklist_lock);
 	}
 	if (state == EXIT_DEAD)
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-17 17:21 [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling Joel Fernandes (Google)
@ 2019-07-17 17:55 ` Christian Brauner
  2019-07-17 18:09   ` Suren Baghdasaryan
  2019-07-17 20:51   ` Joel Fernandes
  2019-07-18 10:17 ` Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Christian Brauner @ 2019-07-17 17:55 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Suren Baghdasaryan, kernel-team, Andrea Arcangeli,
	Andrew Morton, Eric W. Biederman, Oleg Nesterov, Tejun Heo,
	jannh

On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> From: Suren Baghdasaryan <surenb@google.com>
> 
> There is a race between reading task->exit_state in pidfd_poll and writing
> it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> events is:
> 
> CPU 0                            CPU 1
> ------------------------------------------------
> exit_notify
>   do_notify_parent
>     do_notify_pidfd
>   tsk->exit_state = EXIT_DEAD
>                                   pidfd_poll
>                                      if (tsk->exit_state)
> 
> However nothing prevents the following sequence:
> 
> CPU 0                            CPU 1
> ------------------------------------------------
> exit_notify
>   do_notify_parent
>     do_notify_pidfd
>                                    pidfd_poll
>                                       if (tsk->exit_state)
>   tsk->exit_state = EXIT_DEAD
> 
> This causes a polling task to wait forever, since poll blocks because
> exit_state is 0 and the waiting task is not notified again. A stress
> test continuously doing pidfd poll and process exits uncovered this bug,
> and the below patch fixes it.
> 
> To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> 
> Cc: kernel-team@android.com
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

That means in such a situation other users will see EXIT_ZOMBIE where
they didn't see that before until after the parent failed to get
notified.

That's a rather subtle internal change. I was worried about
__ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
at the point when we do set p->exit_signal.

Acked-by: Christian Brauner <christian@brauner.io>

Once Oleg confirms that I'm right not to worty I'll pick this up.

Thanks!
Christian

> 
> ---
>  kernel/exit.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index a75b6a7f458a..740ceacb4b76 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	if (group_dead)
>  		kill_orphaned_pgrp(tsk->group_leader, NULL);
>  
> +	tsk->exit_state = EXIT_ZOMBIE;
>  	if (unlikely(tsk->ptrace)) {
>  		int sig = thread_group_leader(tsk) &&
>  				thread_group_empty(tsk) &&
> @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
>  		ptrace_unlink(p);
>  
>  		/* If parent wants a zombie, don't release it now */
> -		state = EXIT_ZOMBIE;
> +		p->exit_state = EXIT_ZOMBIE;
>  		if (do_notify_parent(p, p->exit_signal))
> -			state = EXIT_DEAD;
> -		p->exit_state = state;
> +			p->exit_state = EXIT_DEAD;
> +
> +		state = p->exit_state;
>  		write_unlock_irq(&tasklist_lock);
>  	}
>  	if (state == EXIT_DEAD)
> -- 
> 2.22.0.657.g960e92d24f-goog
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-17 17:55 ` Christian Brauner
@ 2019-07-17 18:09   ` Suren Baghdasaryan
  2019-07-17 20:47     ` Joel Fernandes
  2019-07-17 20:51   ` Joel Fernandes
  1 sibling, 1 reply; 14+ messages in thread
From: Suren Baghdasaryan @ 2019-07-17 18:09 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Joel Fernandes (Google),
	LKML, kernel-team, Andrea Arcangeli, Andrew Morton,
	Eric W. Biederman, Oleg Nesterov, Tejun Heo, Jann Horn

On Wed, Jul 17, 2019 at 10:56 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > From: Suren Baghdasaryan <surenb@google.com>
> >
> > There is a race between reading task->exit_state in pidfd_poll and writing
> > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > events is:
> >
> > CPU 0                            CPU 1
> > ------------------------------------------------
> > exit_notify
> >   do_notify_parent
> >     do_notify_pidfd
> >   tsk->exit_state = EXIT_DEAD
> >                                   pidfd_poll
> >                                      if (tsk->exit_state)
> >
> > However nothing prevents the following sequence:
> >
> > CPU 0                            CPU 1
> > ------------------------------------------------
> > exit_notify
> >   do_notify_parent
> >     do_notify_pidfd
> >                                    pidfd_poll
> >                                       if (tsk->exit_state)
> >   tsk->exit_state = EXIT_DEAD
> >
> > This causes a polling task to wait forever, since poll blocks because
> > exit_state is 0 and the waiting task is not notified again. A stress
> > test continuously doing pidfd poll and process exits uncovered this bug,
> > and the below patch fixes it.
> >
> > To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> >
> > Cc: kernel-team@android.com
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> That means in such a situation other users will see EXIT_ZOMBIE where
> they didn't see that before until after the parent failed to get
> notified.

I'm a little nervous about that myself even though in my stress
testing this worked fine. I think the safest change would be to move
do_notify_pidfd() out of do_notify_parent() and call it after
tsk->exit_state is finalized. The downside of that is that there are 4
places we call do_notify_parent(), so instead of calling
do_notify_pidfd() one time from do_notify_parent() we will be calling
it 4 times now.

Also my original patch had memory barriers to ensure correct ordering
of tsk->exit_state writes before reads. In this final version Joel
removed them, so I suppose he found out they are not needed. Joel,
please clarify.
Thanks!

> That's a rather subtle internal change. I was worried about
> __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
> seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
> at the point when we do set p->exit_signal.
>
> Acked-by: Christian Brauner <christian@brauner.io>
>
> Once Oleg confirms that I'm right not to worty I'll pick this up.
>
> Thanks!
> Christian
>
> >
> > ---
> >  kernel/exit.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index a75b6a7f458a..740ceacb4b76 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >       if (group_dead)
> >               kill_orphaned_pgrp(tsk->group_leader, NULL);
> >
> > +     tsk->exit_state = EXIT_ZOMBIE;
> >       if (unlikely(tsk->ptrace)) {
> >               int sig = thread_group_leader(tsk) &&
> >                               thread_group_empty(tsk) &&
> > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> >               ptrace_unlink(p);
> >
> >               /* If parent wants a zombie, don't release it now */
> > -             state = EXIT_ZOMBIE;
> > +             p->exit_state = EXIT_ZOMBIE;
> >               if (do_notify_parent(p, p->exit_signal))
> > -                     state = EXIT_DEAD;
> > -             p->exit_state = state;
> > +                     p->exit_state = EXIT_DEAD;
> > +
> > +             state = p->exit_state;
> >               write_unlock_irq(&tasklist_lock);
> >       }
> >       if (state == EXIT_DEAD)
> > --
> > 2.22.0.657.g960e92d24f-goog
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-17 18:09   ` Suren Baghdasaryan
@ 2019-07-17 20:47     ` Joel Fernandes
  2019-07-18 10:09       ` Christian Brauner
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2019-07-17 20:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Christian Brauner, LKML, kernel-team, Andrea Arcangeli,
	Andrew Morton, Eric W. Biederman, Oleg Nesterov, Tejun Heo,
	Jann Horn

On Wed, Jul 17, 2019 at 11:09:59AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 17, 2019 at 10:56 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > > From: Suren Baghdasaryan <surenb@google.com>
> > >
> > > There is a race between reading task->exit_state in pidfd_poll and writing
> > > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > > events is:
> > >
> > > CPU 0                            CPU 1
> > > ------------------------------------------------
> > > exit_notify
> > >   do_notify_parent
> > >     do_notify_pidfd
> > >   tsk->exit_state = EXIT_DEAD
> > >                                   pidfd_poll
> > >                                      if (tsk->exit_state)
> > >
> > > However nothing prevents the following sequence:
> > >
> > > CPU 0                            CPU 1
> > > ------------------------------------------------
> > > exit_notify
> > >   do_notify_parent
> > >     do_notify_pidfd
> > >                                    pidfd_poll
> > >                                       if (tsk->exit_state)
> > >   tsk->exit_state = EXIT_DEAD
> > >
> > > This causes a polling task to wait forever, since poll blocks because
> > > exit_state is 0 and the waiting task is not notified again. A stress
> > > test continuously doing pidfd poll and process exits uncovered this bug,
> > > and the below patch fixes it.
> > >
> > > To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> > >
> > > Cc: kernel-team@android.com
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >
> > That means in such a situation other users will see EXIT_ZOMBIE where
> > they didn't see that before until after the parent failed to get
> > notified.
> 
> I'm a little nervous about that myself even though in my stress
> testing this worked fine. I think the safest change would be to move
> do_notify_pidfd() out of do_notify_parent() and call it after
> tsk->exit_state is finalized. The downside of that is that there are 4

My initial approach to pidfd polling did it this way, and I remember there
was a break in semantics where this does not work well. We want the
notification to happen in do_notify_parent() so that it is in sync with the
wait APIs..

I don't see a risk with this patch though. But let us see what Oleg's eyes
find.

> places we call do_notify_parent(), so instead of calling
> do_notify_pidfd() one time from do_notify_parent() we will be calling
> it 4 times now.
> 
> Also my original patch had memory barriers to ensure correct ordering
> of tsk->exit_state writes before reads. In this final version Joel
> removed them, so I suppose he found out they are not needed. Joel,
> please clarify.

The barriers were initially add by me to your patch, but then I felt it may
not be needed so I removed them before sending the patch. My initial concern
was something like the following:

CPU 0                      CPU 1
------------------------------------------------
store tsk->exit_state = 1
/* smp_wmb() ? */
do_notify_parent
wake up
                           poll_wait()
                           /* smp_rmb(); ? */
                           read tsk->exit_state = 0
                           block...


I was initially concerned if tsk->exit_state write would be missed by the
polling thread and we would block forever (similar to this bug).

I don't think this is possible anymore since wakeup implies release-barrier
and waiting implies acquire barrier AFAIU. I am still not fully sure though,
so yeah if you guys think it is an issue, let us add the memory barriers. As
such I know memory barrier additions to the kernel requires justification,
otherwise Linus calls it "Voodoo programming". So let us convince ourself
first if memory barriers are needed before adding them anyway.

thanks,

 - Joel




> Thanks!
> 
> > That's a rather subtle internal change. I was worried about
> > __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
> > seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
> > at the point when we do set p->exit_signal.
> >
> > Acked-by: Christian Brauner <christian@brauner.io>
> >
> > Once Oleg confirms that I'm right not to worty I'll pick this up.
> >
> > Thanks!
> > Christian
> >
> > >
> > > ---
> > >  kernel/exit.c | 8 +++++---
> > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/exit.c b/kernel/exit.c
> > > index a75b6a7f458a..740ceacb4b76 100644
> > > --- a/kernel/exit.c
> > > +++ b/kernel/exit.c
> > > @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > >       if (group_dead)
> > >               kill_orphaned_pgrp(tsk->group_leader, NULL);
> > >
> > > +     tsk->exit_state = EXIT_ZOMBIE;
> > >       if (unlikely(tsk->ptrace)) {
> > >               int sig = thread_group_leader(tsk) &&
> > >                               thread_group_empty(tsk) &&
> > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > >               ptrace_unlink(p);
> > >
> > >               /* If parent wants a zombie, don't release it now */
> > > -             state = EXIT_ZOMBIE;
> > > +             p->exit_state = EXIT_ZOMBIE;
> > >               if (do_notify_parent(p, p->exit_signal))
> > > -                     state = EXIT_DEAD;
> > > -             p->exit_state = state;
> > > +                     p->exit_state = EXIT_DEAD;
> > > +
> > > +             state = p->exit_state;
> > >               write_unlock_irq(&tasklist_lock);
> > >       }
> > >       if (state == EXIT_DEAD)
> > > --
> > > 2.22.0.657.g960e92d24f-goog
> > >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-17 17:55 ` Christian Brauner
  2019-07-17 18:09   ` Suren Baghdasaryan
@ 2019-07-17 20:51   ` Joel Fernandes
  1 sibling, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2019-07-17 20:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Suren Baghdasaryan, kernel-team, Andrea Arcangeli,
	Andrew Morton, Eric W. Biederman, Oleg Nesterov, Tejun Heo,
	jannh

On Wed, Jul 17, 2019 at 07:55:57PM +0200, Christian Brauner wrote:
> On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > From: Suren Baghdasaryan <surenb@google.com>
> > 
> > There is a race between reading task->exit_state in pidfd_poll and writing
> > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > events is:
> > 
> > CPU 0                            CPU 1
> > ------------------------------------------------
> > exit_notify
> >   do_notify_parent
> >     do_notify_pidfd
> >   tsk->exit_state = EXIT_DEAD
> >                                   pidfd_poll
> >                                      if (tsk->exit_state)
> > 
> > However nothing prevents the following sequence:
> > 
> > CPU 0                            CPU 1
> > ------------------------------------------------
> > exit_notify
> >   do_notify_parent
> >     do_notify_pidfd
> >                                    pidfd_poll
> >                                       if (tsk->exit_state)
> >   tsk->exit_state = EXIT_DEAD
> > 
> > This causes a polling task to wait forever, since poll blocks because
> > exit_state is 0 and the waiting task is not notified again. A stress
> > test continuously doing pidfd poll and process exits uncovered this bug,
> > and the below patch fixes it.
> > 
> > To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> > 
> > Cc: kernel-team@android.com
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> That means in such a situation other users will see EXIT_ZOMBIE where
> they didn't see that before until after the parent failed to get
> notified.
> 
> That's a rather subtle internal change. I was worried about
> __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
> seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
> at the point when we do set p->exit_signal.

Right.

> Acked-by: Christian Brauner <christian@brauner.io>

Thanks.

> Once Oleg confirms that I'm right not to worty I'll pick this up.

Ok.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-17 20:47     ` Joel Fernandes
@ 2019-07-18 10:09       ` Christian Brauner
  0 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2019-07-18 10:09 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Suren Baghdasaryan, LKML, kernel-team, Andrea Arcangeli,
	Andrew Morton, Eric W. Biederman, Oleg Nesterov, Tejun Heo,
	Jann Horn

On Wed, Jul 17, 2019 at 04:47:58PM -0400, Joel Fernandes wrote:
> On Wed, Jul 17, 2019 at 11:09:59AM -0700, Suren Baghdasaryan wrote:
> > On Wed, Jul 17, 2019 at 10:56 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > > > From: Suren Baghdasaryan <surenb@google.com>
> > > >
> > > > There is a race between reading task->exit_state in pidfd_poll and writing
> > > > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > > > events is:
> > > >
> > > > CPU 0                            CPU 1
> > > > ------------------------------------------------
> > > > exit_notify
> > > >   do_notify_parent
> > > >     do_notify_pidfd
> > > >   tsk->exit_state = EXIT_DEAD
> > > >                                   pidfd_poll
> > > >                                      if (tsk->exit_state)
> > > >
> > > > However nothing prevents the following sequence:
> > > >
> > > > CPU 0                            CPU 1
> > > > ------------------------------------------------
> > > > exit_notify
> > > >   do_notify_parent
> > > >     do_notify_pidfd
> > > >                                    pidfd_poll
> > > >                                       if (tsk->exit_state)
> > > >   tsk->exit_state = EXIT_DEAD
> > > >
> > > > This causes a polling task to wait forever, since poll blocks because
> > > > exit_state is 0 and the waiting task is not notified again. A stress
> > > > test continuously doing pidfd poll and process exits uncovered this bug,
> > > > and the below patch fixes it.
> > > >
> > > > To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> > > >
> > > > Cc: kernel-team@android.com
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >
> > > That means in such a situation other users will see EXIT_ZOMBIE where
> > > they didn't see that before until after the parent failed to get
> > > notified.
> > 
> > I'm a little nervous about that myself even though in my stress
> > testing this worked fine. I think the safest change would be to move
> > do_notify_pidfd() out of do_notify_parent() and call it after
> > tsk->exit_state is finalized. The downside of that is that there are 4
> 
> My initial approach to pidfd polling did it this way, and I remember there
> was a break in semantics where this does not work well. We want the
> notification to happen in do_notify_parent() so that it is in sync with the
> wait APIs..
> 
> I don't see a risk with this patch though. But let us see what Oleg's eyes
> find.

I've been going through the various codepaths and that change should be
fine. The places I looked at that worried me were release_task(),
reparent_leader(), wait_consider_task() and their callers.
But all of these either take read_lock(&tasklist_lock) or
write_lock_irq(&tasklist_lock) themselves or are called with them held,
same for ptrace_attach() and ptrace_detach(). And the whole sequence
that switches to autoreaping when the parent ingores SIGCHLD in
do_notify_parent() and wait_task_zombie() is under
write_lock_irq(&tasklist_lock) as well so setting it to EXIT_ZOMBIE
before do_notify_parent() and switching it to EXIT_DEAD when the parent
ignores SIGCHLD should be safe.
If we missed a sublety then we'll know pretty soon I'm sure.

I'll pick this up now. We'll have some time anyway.

> 
> > places we call do_notify_parent(), so instead of calling
> > do_notify_pidfd() one time from do_notify_parent() we will be calling
> > it 4 times now.
> > 
> > Also my original patch had memory barriers to ensure correct ordering
> > of tsk->exit_state writes before reads. In this final version Joel
> > removed them, so I suppose he found out they are not needed. Joel,
> > please clarify.
> 
> The barriers were initially add by me to your patch, but then I felt it may
> not be needed so I removed them before sending the patch. My initial concern
> was something like the following:
> 
> CPU 0                      CPU 1
> ------------------------------------------------
> store tsk->exit_state = 1
> /* smp_wmb() ? */
> do_notify_parent
> wake up
>                            poll_wait()
>                            /* smp_rmb(); ? */
>                            read tsk->exit_state = 0
>                            block...
> 
> 
> I was initially concerned if tsk->exit_state write would be missed by the
> polling thread and we would block forever (similar to this bug).
> 
> I don't think this is possible anymore since wakeup implies release-barrier

wake_up_all() which is used in do_notify_pidfd() implies a general
memory barrier if something is actually woken up.

> and waiting implies acquire barrier AFAIU. I am still not fully sure though,

poll_wait() when used with eventpoll hits add_wait_queue which takes
spin_lock_irqsave() which implies an acquire barrier if I remember
memory_barriers right.

> so yeah if you guys think it is an issue, let us add the memory barriers. As
> such I know memory barrier additions to the kernel requires justification,
> otherwise Linus calls it "Voodoo programming". So let us convince ourself
> first if memory barriers are needed before adding them anyway.

I didn't see it as an issue either.

> 
> thanks,
> 
>  - Joel
> 
> 
> 
> 
> > Thanks!
> > 
> > > That's a rather subtle internal change. I was worried about
> > > __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
> > > seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
> > > at the point when we do set p->exit_signal.
> > >
> > > Acked-by: Christian Brauner <christian@brauner.io>
> > >
> > > Once Oleg confirms that I'm right not to worty I'll pick this up.
> > >
> > > Thanks!
> > > Christian
> > >
> > > >
> > > > ---
> > > >  kernel/exit.c | 8 +++++---
> > > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/kernel/exit.c b/kernel/exit.c
> > > > index a75b6a7f458a..740ceacb4b76 100644
> > > > --- a/kernel/exit.c
> > > > +++ b/kernel/exit.c
> > > > @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > >       if (group_dead)
> > > >               kill_orphaned_pgrp(tsk->group_leader, NULL);
> > > >
> > > > +     tsk->exit_state = EXIT_ZOMBIE;
> > > >       if (unlikely(tsk->ptrace)) {
> > > >               int sig = thread_group_leader(tsk) &&
> > > >                               thread_group_empty(tsk) &&
> > > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > > >               ptrace_unlink(p);
> > > >
> > > >               /* If parent wants a zombie, don't release it now */
> > > > -             state = EXIT_ZOMBIE;
> > > > +             p->exit_state = EXIT_ZOMBIE;
> > > >               if (do_notify_parent(p, p->exit_signal))
> > > > -                     state = EXIT_DEAD;
> > > > -             p->exit_state = state;
> > > > +                     p->exit_state = EXIT_DEAD;
> > > > +
> > > > +             state = p->exit_state;
> > > >               write_unlock_irq(&tasklist_lock);
> > > >       }
> > > >       if (state == EXIT_DEAD)
> > > > --
> > > > 2.22.0.657.g960e92d24f-goog
> > > >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-17 17:21 [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling Joel Fernandes (Google)
  2019-07-17 17:55 ` Christian Brauner
@ 2019-07-18 10:17 ` Christian Brauner
  2019-07-18 16:05   ` Suren Baghdasaryan
  2019-07-18 15:00 ` Oleg Nesterov
  2019-07-19 16:14 ` Oleg Nesterov
  3 siblings, 1 reply; 14+ messages in thread
From: Christian Brauner @ 2019-07-18 10:17 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Suren Baghdasaryan, kernel-team, Andrea Arcangeli,
	Andrew Morton, Eric W. Biederman, Oleg Nesterov, Tejun Heo

On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> From: Suren Baghdasaryan <surenb@google.com>
> 
> There is a race between reading task->exit_state in pidfd_poll and writing
> it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> events is:
> 
> CPU 0                            CPU 1
> ------------------------------------------------
> exit_notify
>   do_notify_parent
>     do_notify_pidfd
>   tsk->exit_state = EXIT_DEAD
>                                   pidfd_poll
>                                      if (tsk->exit_state)
> 
> However nothing prevents the following sequence:
> 
> CPU 0                            CPU 1
> ------------------------------------------------
> exit_notify
>   do_notify_parent
>     do_notify_pidfd
>                                    pidfd_poll
>                                       if (tsk->exit_state)
>   tsk->exit_state = EXIT_DEAD
> 
> This causes a polling task to wait forever, since poll blocks because
> exit_state is 0 and the waiting task is not notified again. A stress
> test continuously doing pidfd poll and process exits uncovered this bug,

Btw, if that stress test is in any way upstreamable I'd like to put this
into for-next as well. :)

Christian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-17 17:21 [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling Joel Fernandes (Google)
  2019-07-17 17:55 ` Christian Brauner
  2019-07-18 10:17 ` Christian Brauner
@ 2019-07-18 15:00 ` Oleg Nesterov
  2019-07-19 16:14 ` Oleg Nesterov
  3 siblings, 0 replies; 14+ messages in thread
From: Oleg Nesterov @ 2019-07-18 15:00 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Suren Baghdasaryan, kernel-team, Andrea Arcangeli,
	Andrew Morton, Christian Brauner, Eric W. Biederman, Tejun Heo

On 07/17, Joel Fernandes (Google) wrote:
>
> exit_notify
>   do_notify_parent
>     do_notify_pidfd
>   tsk->exit_state = EXIT_DEAD

OOPS. Somehow I thought it sets exit_state before do_notify_parent(),
I didn't even bother to verify this when I reviewed pidfd_poll()...

Sorry, can't read the patch today, will do tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-18 10:17 ` Christian Brauner
@ 2019-07-18 16:05   ` Suren Baghdasaryan
  0 siblings, 0 replies; 14+ messages in thread
From: Suren Baghdasaryan @ 2019-07-18 16:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Joel Fernandes (Google),
	LKML, kernel-team, Andrea Arcangeli, Andrew Morton,
	Eric W. Biederman, Oleg Nesterov, Tejun Heo

On Thu, Jul 18, 2019 at 3:17 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > From: Suren Baghdasaryan <surenb@google.com>
> >
> > There is a race between reading task->exit_state in pidfd_poll and writing
> > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > events is:
> >
> > CPU 0                            CPU 1
> > ------------------------------------------------
> > exit_notify
> >   do_notify_parent
> >     do_notify_pidfd
> >   tsk->exit_state = EXIT_DEAD
> >                                   pidfd_poll
> >                                      if (tsk->exit_state)
> >
> > However nothing prevents the following sequence:
> >
> > CPU 0                            CPU 1
> > ------------------------------------------------
> > exit_notify
> >   do_notify_parent
> >     do_notify_pidfd
> >                                    pidfd_poll
> >                                       if (tsk->exit_state)
> >   tsk->exit_state = EXIT_DEAD
> >
> > This causes a polling task to wait forever, since poll blocks because
> > exit_state is 0 and the waiting task is not notified again. A stress
> > test continuously doing pidfd poll and process exits uncovered this bug,
>
> Btw, if that stress test is in any way upstreamable I'd like to put this
> into for-next as well. :)

Definitely. I'll work with Joel on making it upstreamable and posting
as a separate patch.

>
> Christian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-17 17:21 [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2019-07-18 15:00 ` Oleg Nesterov
@ 2019-07-19 16:14 ` Oleg Nesterov
  2019-07-19 16:27   ` Christian Brauner
  3 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2019-07-19 16:14 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Suren Baghdasaryan, kernel-team, Andrea Arcangeli,
	Andrew Morton, Christian Brauner, Eric W. Biederman, Tejun Heo

it seems that I missed something else...

On 07/17, Joel Fernandes (Google) wrote:
>
> @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
>  		ptrace_unlink(p);
>  
>  		/* If parent wants a zombie, don't release it now */
> -		state = EXIT_ZOMBIE;
> +		p->exit_state = EXIT_ZOMBIE;
>  		if (do_notify_parent(p, p->exit_signal))
> -			state = EXIT_DEAD;
> -		p->exit_state = state;
> +			p->exit_state = EXIT_DEAD;
> +
> +		state = p->exit_state;
>  		write_unlock_irq(&tasklist_lock);

why do you think we also need to change wait_task_zombie() ?

pidfd_poll() only needs the exit_state != 0 check, we know that it
is not zero at this point. Why do we need to change exit_state before
do_notify_parent() ?

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-19 16:14 ` Oleg Nesterov
@ 2019-07-19 16:27   ` Christian Brauner
  2019-07-19 16:33     ` Suren Baghdasaryan
  2019-07-19 16:51     ` Joel Fernandes
  0 siblings, 2 replies; 14+ messages in thread
From: Christian Brauner @ 2019-07-19 16:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Joel Fernandes (Google),
	linux-kernel, Suren Baghdasaryan, kernel-team, Andrea Arcangeli,
	Andrew Morton, Eric W. Biederman, Tejun Heo

On Fri, Jul 19, 2019 at 06:14:05PM +0200, Oleg Nesterov wrote:
> it seems that I missed something else...
> 
> On 07/17, Joel Fernandes (Google) wrote:
> >
> > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> >  		ptrace_unlink(p);
> >  
> >  		/* If parent wants a zombie, don't release it now */
> > -		state = EXIT_ZOMBIE;
> > +		p->exit_state = EXIT_ZOMBIE;
> >  		if (do_notify_parent(p, p->exit_signal))
> > -			state = EXIT_DEAD;
> > -		p->exit_state = state;
> > +			p->exit_state = EXIT_DEAD;
> > +
> > +		state = p->exit_state;
> >  		write_unlock_irq(&tasklist_lock);
> 
> why do you think we also need to change wait_task_zombie() ?
> 
> pidfd_poll() only needs the exit_state != 0 check, we know that it
> is not zero at this point. Why do we need to change exit_state before
> do_notify_parent() ?

Oh, because of?:

	/*
	 * Move the task's state to DEAD/TRACE, only one thread can do this.
	 */
	state = (ptrace_reparented(p) && thread_group_leader(p)) ?
		EXIT_TRACE : EXIT_DEAD;
	if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
		return 0;

So exit_state will definitely be set in this scenario. Good point.

Christian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-19 16:27   ` Christian Brauner
@ 2019-07-19 16:33     ` Suren Baghdasaryan
  2019-07-19 16:51     ` Joel Fernandes
  1 sibling, 0 replies; 14+ messages in thread
From: Suren Baghdasaryan @ 2019-07-19 16:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Joel Fernandes (Google),
	LKML, kernel-team, Andrea Arcangeli, Andrew Morton,
	Eric W. Biederman, Tejun Heo

On Fri, Jul 19, 2019 at 9:27 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Fri, Jul 19, 2019 at 06:14:05PM +0200, Oleg Nesterov wrote:
> > it seems that I missed something else...
> >
> > On 07/17, Joel Fernandes (Google) wrote:
> > >
> > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > >             ptrace_unlink(p);
> > >
> > >             /* If parent wants a zombie, don't release it now */
> > > -           state = EXIT_ZOMBIE;
> > > +           p->exit_state = EXIT_ZOMBIE;
> > >             if (do_notify_parent(p, p->exit_signal))
> > > -                   state = EXIT_DEAD;
> > > -           p->exit_state = state;
> > > +                   p->exit_state = EXIT_DEAD;
> > > +
> > > +           state = p->exit_state;
> > >             write_unlock_irq(&tasklist_lock);
> >
> > why do you think we also need to change wait_task_zombie() ?
> >
> > pidfd_poll() only needs the exit_state != 0 check, we know that it
> > is not zero at this point. Why do we need to change exit_state before
> > do_notify_parent() ?
>
> Oh, because of?:
>
>         /*
>          * Move the task's state to DEAD/TRACE, only one thread can do this.
>          */
>         state = (ptrace_reparented(p) && thread_group_leader(p)) ?
>                 EXIT_TRACE : EXIT_DEAD;
>         if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
>                 return 0;
>
> So exit_state will definitely be set in this scenario. Good point.
>

Yes, I think you are right. AFAIU in this code path p->exit_state
should always be equal to EXIT_TRACE because of the earlier cmpxchg()
call and the if condition before do_notify_parent(). That's of course
unless there is a chance that p->exit_state gets changed by some other
thread after cmpxchg() call and before do_notify_parent()... I'm not
that familiar with this code to say for sure that it's impossible. If
that can't happen I think we can remove this one but the change in
exit_notify() should definitely stay.
Thanks,
Suren.

> Christian
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-19 16:27   ` Christian Brauner
  2019-07-19 16:33     ` Suren Baghdasaryan
@ 2019-07-19 16:51     ` Joel Fernandes
  2019-07-19 16:53       ` Christian Brauner
  1 sibling, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2019-07-19 16:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Joel Fernandes (Google),
	LKML, Suren Baghdasaryan, Cc: Android Kernel, Andrea Arcangeli,
	Andrew Morton, Eric W. Biederman, Tejun Heo

On Fri, Jul 19, 2019 at 12:27 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Fri, Jul 19, 2019 at 06:14:05PM +0200, Oleg Nesterov wrote:
> > it seems that I missed something else...
> >
> > On 07/17, Joel Fernandes (Google) wrote:
> > >
> > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > >             ptrace_unlink(p);
> > >
> > >             /* If parent wants a zombie, don't release it now */
> > > -           state = EXIT_ZOMBIE;
> > > +           p->exit_state = EXIT_ZOMBIE;
> > >             if (do_notify_parent(p, p->exit_signal))
> > > -                   state = EXIT_DEAD;
> > > -           p->exit_state = state;
> > > +                   p->exit_state = EXIT_DEAD;
> > > +
> > > +           state = p->exit_state;
> > >             write_unlock_irq(&tasklist_lock);
> >
> > why do you think we also need to change wait_task_zombie() ?
> >
> > pidfd_poll() only needs the exit_state != 0 check, we know that it
> > is not zero at this point. Why do we need to change exit_state before
> > do_notify_parent() ?
>
> Oh, because of?:
>
>         /*
>          * Move the task's state to DEAD/TRACE, only one thread can do this.
>          */
>         state = (ptrace_reparented(p) && thread_group_leader(p)) ?
>                 EXIT_TRACE : EXIT_DEAD;
>         if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
>                 return 0;
>
> So exit_state will definitely be set in this scenario. Good point.
>

Agreed. Christian, do you mind dropping this hunk from the patch or do
you want me to resend the patch with the hunk dropped?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling
  2019-07-19 16:51     ` Joel Fernandes
@ 2019-07-19 16:53       ` Christian Brauner
  0 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2019-07-19 16:53 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Oleg Nesterov, Joel Fernandes (Google),
	LKML, Suren Baghdasaryan, Cc: Android Kernel, Andrea Arcangeli,
	Andrew Morton, Eric W. Biederman, Tejun Heo

On July 19, 2019 6:51:20 PM GMT+02:00, Joel Fernandes <joelaf@google.com> wrote:
>On Fri, Jul 19, 2019 at 12:27 PM Christian Brauner
><christian@brauner.io> wrote:
>>
>> On Fri, Jul 19, 2019 at 06:14:05PM +0200, Oleg Nesterov wrote:
>> > it seems that I missed something else...
>> >
>> > On 07/17, Joel Fernandes (Google) wrote:
>> > >
>> > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct
>wait_opts *wo, struct task_struct *p)
>> > >             ptrace_unlink(p);
>> > >
>> > >             /* If parent wants a zombie, don't release it now */
>> > > -           state = EXIT_ZOMBIE;
>> > > +           p->exit_state = EXIT_ZOMBIE;
>> > >             if (do_notify_parent(p, p->exit_signal))
>> > > -                   state = EXIT_DEAD;
>> > > -           p->exit_state = state;
>> > > +                   p->exit_state = EXIT_DEAD;
>> > > +
>> > > +           state = p->exit_state;
>> > >             write_unlock_irq(&tasklist_lock);
>> >
>> > why do you think we also need to change wait_task_zombie() ?
>> >
>> > pidfd_poll() only needs the exit_state != 0 check, we know that it
>> > is not zero at this point. Why do we need to change exit_state
>before
>> > do_notify_parent() ?
>>
>> Oh, because of?:
>>
>>         /*
>>          * Move the task's state to DEAD/TRACE, only one thread can
>do this.
>>          */
>>         state = (ptrace_reparented(p) && thread_group_leader(p)) ?
>>                 EXIT_TRACE : EXIT_DEAD;
>>         if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) !=
>EXIT_ZOMBIE)
>>                 return 0;
>>
>> So exit_state will definitely be set in this scenario. Good point.
>>
>
>Agreed. Christian, do you mind dropping this hunk from the patch or do
>you want me to resend the patch with the hunk dropped?

Yeah, no problem. :)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-07-19 16:53 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-17 17:21 [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling Joel Fernandes (Google)
2019-07-17 17:55 ` Christian Brauner
2019-07-17 18:09   ` Suren Baghdasaryan
2019-07-17 20:47     ` Joel Fernandes
2019-07-18 10:09       ` Christian Brauner
2019-07-17 20:51   ` Joel Fernandes
2019-07-18 10:17 ` Christian Brauner
2019-07-18 16:05   ` Suren Baghdasaryan
2019-07-18 15:00 ` Oleg Nesterov
2019-07-19 16:14 ` Oleg Nesterov
2019-07-19 16:27   ` Christian Brauner
2019-07-19 16:33     ` Suren Baghdasaryan
2019-07-19 16:51     ` Joel Fernandes
2019-07-19 16:53       ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).