linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] namespaces: fix leak on fork() failure
@ 2012-04-28  9:19 Mike Galbraith
  2012-04-28 14:26 ` Oleg Nesterov
  2012-04-30 13:57 ` [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith
  0 siblings, 2 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-04-28  9:19 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 2448 bytes --]

Greetings,

The attached testcase induces quite a bit of pid/mnt namespace leakage.
The below fixes up one of these leaks.  There's still at least one pid
namespace leak left, that being the final put_pid() in softirq context
goes missing.

A trace of the leak that's left shows... 
vsftpd-5055  [003] ....  3921.490806: proc_set_super: get_pid_ns: 0xffff8801c996e988 count:1->2
vsftpd-5055  [003] ....  3921.490823: alloc_pid: get_pid_ns: 0xffff8801c996e988 count:2->3
vsftpd-5102  [003] ....  3921.502565: switch_task_namespaces: exiting: 0xffff8801c996e988 count:3
vsftpd-5102  [003] ....  3921.522296: free_nsproxy: put_pid_ns: 0xffff8801c996e988 count:3->2
vsftpd-5055  [003] ....  3921.574201: proc_kill_sb: put_pid_ns: 0xffff8801c996e988 count:2->1

..but that should be..

vsftpd-5055  [003] ....  3921.497313: proc_set_super: get_pid_ns: 0xffff8801c6e65ff0 count:1->2
vsftpd-5055  [003] ....  3921.497330: alloc_pid: get_pid_ns: 0xffff8801c6e65ff0 count:2->3
vsftpd-5124  [003] ....  3921.502977: switch_task_namespaces: exiting: 0xffff8801c6e65ff0 count:3
vsftpd-5124  [003] ....  3921.522308: free_nsproxy: put_pid_ns: 0xffff8801c6e65ff0 count:3->2
vsftpd-5055  [003] ....  3921.698349: proc_kill_sb: put_pid_ns: 0xffff8801c6e65ff0 count:2->1
ksoftirqd/3-16    [003] ..s.  3921.702182: put_pid: put_pid_ns: 0xffff8801c6e65ff0 count:1->0

Anyway, here's what I did for one of the little buggers.

SIGCHLD delivery during fork() may cause failure, resulting in the aborted
child being cloned with CLONE_NEWPID leaking namespaces due to proc being
mounted during pid namespace creation, but not unmounted on fork() failure.

Call pid_ns_release_proc() to prevent the leaks.

Signed-off-by: Mike Galbraith <efault@gmx.de>
 
 kernel/nsproxy.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b576f7f..fd751d3 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -216,6 +216,14 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
 	rcu_assign_pointer(p->nsproxy, new);
 
 	if (ns && atomic_dec_and_test(&ns->count)) {
+		/* Handle fork() failure, unmount proc before proceeding */
+		if (unlikely(!new && !((p->flags & PF_EXITING)))) {
+			struct pid_namespace *pid_ns = ns->pid_ns;
+
+			if (pid_ns && pid_ns != &init_pid_ns)
+				pid_ns_release_proc(pid_ns);
+		}
+
 		/*
 		 * wait for others to get what they want from this nsproxy.
 		 *


[-- Attachment #2: vsftpd.c --]
[-- Type: text/x-csrc, Size: 2181 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sched.h>
#include <linux/sched.h>
#include <unistd.h>
#include <sys/syscall.h>

#include <sys/types.h>
#include <sys/wait.h>
#include <signal.h>
#include <string.h>

#if !defined(WITH_SIGCHLD)
#define WITH_SIGCHLD 1
#endif

#if WITH_SIGCHLD == 1
/*
 * vsftpd 
 * sysutil.c vsf_sysutil_wait_reap_one()
 * standalone.c handle_sigchld()
 * 
 */
int vsf_sysutil_wait_reap_one(void)
{
    int retval = waitpid(-1, NULL, WNOHANG);
    if (retval == 0 || (retval < 0 && errno == ECHILD)) {
        /* No more children */
        return 0;
    }
    if (retval < 0) {
        perror("waitpid");
        exit(EXIT_FAILURE);
    }
    /* Got one */
    return retval;
}

int received;
int reaped;

void handle_sigchld(int sig)
{
    unsigned int reap_one = 1;

    received++;
    while (reap_one) {
        reap_one = (unsigned int) vsf_sysutil_wait_reap_one();
	if (reap_one)
            reaped++;
    }
}
#endif

int zombies;

int main(int argc, char *argv[])
{
    int i, ret;

#if WITH_SIGCHLD == 1
    /*
     * vsftpd sysutil.c vsf_sysutil_set_sighandler()
     */
    struct sigaction sa;
    memset(&sa, 0, sizeof(sa));
    sa.sa_handler = handle_sigchld;
    if (-1 == sigfillset(&sa.sa_mask)) {
        perror("sigfillset");
        exit(EXIT_FAILURE);
    }
    if (-1 == sigaction(SIGCHLD, &sa, NULL)) {
        perror("sigaction");
        exit(EXIT_FAILURE);
    }
    fprintf(stderr, "SIGCHLD handler enabled\n");
#else
    fprintf(stderr, "SIGCHLD handler not enabled\n");
#endif

    for (i = 0; i < 100; i++) {

//        if (0 == (ret = syscall(__NR_clone, CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER | SIGCHLD, NULL)))
        if (0 == (ret = syscall(__NR_clone, CLONE_NEWPID | SIGCHLD, NULL)))
            return 0;

        if (-1 == ret) {
            perror("clone");
            exit(EXIT_FAILURE);
        }

    }
#if 1
    while (1) {
	int res = waitpid(-1, NULL, WNOHANG);
	if (res < 0)
		break;
	if (!res)
		continue;
        zombies++;
    }
//    printf("received %d signals, reaped %d - %d zombies left\n", received, reaped, zombies);
//    sleep(1);
#endif
    return 0;
}

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-28  9:19 [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith
@ 2012-04-28 14:26 ` Oleg Nesterov
  2012-04-29  4:13   ` Mike Galbraith
  2012-04-29  7:57   ` Eric W. Biederman
  2012-04-30 13:57 ` [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith
  1 sibling, 2 replies; 69+ messages in thread
From: Oleg Nesterov @ 2012-04-28 14:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Pavel Emelyanov, Cyrill Gorcunov, Eric W. Biederman, Louis Rilling

On 04/28, Mike Galbraith wrote:
>
> Greetings,

Hi,

Add CC's. I never understood the proc/namespace interaction in details,
and it seems to me I forgot everything.

> SIGCHLD delivery during fork() may cause failure,

Or any other reason to fail after copy_namespaces()

> resulting in the aborted
> child being cloned with CLONE_NEWPID leaking namespaces due to proc being
> mounted during pid namespace creation, but not unmounted on fork() failure.

Heh. Please look at http://marc.info/?l=linux-kernel&m=127687751003902
and the whole thread, there are a lot more problems here.

But this particular one looks simple iirc.

> @@ -216,6 +216,14 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
>  	rcu_assign_pointer(p->nsproxy, new);
>
>  	if (ns && atomic_dec_and_test(&ns->count)) {
> +		/* Handle fork() failure, unmount proc before proceeding */
> +		if (unlikely(!new && !((p->flags & PF_EXITING)))) {
> +			struct pid_namespace *pid_ns = ns->pid_ns;
> +
> +			if (pid_ns && pid_ns != &init_pid_ns)
> +				pid_ns_release_proc(pid_ns);
> +		}
> +
>  		/*
>  		 * wait for others to get what they want from this nsproxy.
>  		 *

At first glance this looks correct. But the PF_EXITING check doesn't
look very nice imho. It is needed to detect the case when the caller
is copy_process()->bad_fork_cleanup_namespaces and p is not current.

Perhaps it would be more clean to add the explicit

	 bad_fork_cleanup_namespaces:
	+	if (unlikely(clone_flags & CLONE_NEWPID))
	+		pid_ns_release_proc(...);
		exit_task_namespaces(p);
		
		
code into this error path in copy_process?

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-28 14:26 ` Oleg Nesterov
@ 2012-04-29  4:13   ` Mike Galbraith
  2012-04-29  7:57   ` Eric W. Biederman
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-04-29  4:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Pavel Emelyanov, Cyrill Gorcunov, Eric W. Biederman, Louis Rilling

On Sat, 2012-04-28 at 16:26 +0200, Oleg Nesterov wrote: 
> On 04/28, Mike Galbraith wrote:
> >
> > Greetings,
> 
> Hi,
> 
> Add CC's. I never understood the proc/namespace interaction in details,
> and it seems to me I forgot everything.
> 
> > SIGCHLD delivery during fork() may cause failure,
> 
> Or any other reason to fail after copy_namespaces()

Yeah.

> > resulting in the aborted
> > child being cloned with CLONE_NEWPID leaking namespaces due to proc being
> > mounted during pid namespace creation, but not unmounted on fork() failure.
> 
> Heh. Please look at http://marc.info/?l=linux-kernel&m=127687751003902
> and the whole thread, there are a lot more problems here.

Ew, I would have been better off not reading that ;-)

> But this particular one looks simple iirc.
> 
> > @@ -216,6 +216,14 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
> >  	rcu_assign_pointer(p->nsproxy, new);
> >
> >  	if (ns && atomic_dec_and_test(&ns->count)) {
> > +		/* Handle fork() failure, unmount proc before proceeding */
> > +		if (unlikely(!new && !((p->flags & PF_EXITING)))) {
> > +			struct pid_namespace *pid_ns = ns->pid_ns;
> > +
> > +			if (pid_ns && pid_ns != &init_pid_ns)
> > +				pid_ns_release_proc(pid_ns);
> > +		}
> > +
> >  		/*
> >  		 * wait for others to get what they want from this nsproxy.
> >  		 *
> 
> At first glance this looks correct. But the PF_EXITING check doesn't
> look very nice imho. It is needed to detect the case when the caller
> is copy_process()->bad_fork_cleanup_namespaces and p is not current.

Yeah, that does look a lot like a wart.

This being the first use of pid_ns_release_proc(), I was more concerned
that perhaps I should be doing something else entirely.

> Perhaps it would be more clean to add the explicit
> 
> 	 bad_fork_cleanup_namespaces:
> 	+	if (unlikely(clone_flags & CLONE_NEWPID))
> 	+		pid_ns_release_proc(...);
> 		exit_task_namespaces(p);
> 		
> 		
> code into this error path in copy_process?

Yeah, that's prettier.  Thanks.

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-28 14:26 ` Oleg Nesterov
  2012-04-29  4:13   ` Mike Galbraith
@ 2012-04-29  7:57   ` Eric W. Biederman
  2012-04-29  9:49     ` Mike Galbraith
  2012-04-29 16:58     ` Oleg Nesterov
  1 sibling, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-04-29  7:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, LKML, Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/28, Mike Galbraith wrote:
>>
>> Greetings,
>
> Hi,
>
> Add CC's. I never understood the proc/namespace interaction in details,
> and it seems to me I forgot everything.
>
>> SIGCHLD delivery during fork() may cause failure,
>
> Or any other reason to fail after copy_namespaces()
>
>> resulting in the aborted
>> child being cloned with CLONE_NEWPID leaking namespaces due to proc being
>> mounted during pid namespace creation, but not unmounted on fork() failure.
>
> Heh. Please look at http://marc.info/?l=linux-kernel&m=127687751003902
> and the whole thread, there are a lot more problems here.

I don't remember seeing a leak in that conversation.

> But this particular one looks simple iirc.
>
>> @@ -216,6 +216,14 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
>>  	rcu_assign_pointer(p->nsproxy, new);
>>
>>  	if (ns && atomic_dec_and_test(&ns->count)) {
>> +		/* Handle fork() failure, unmount proc before proceeding */
>> +		if (unlikely(!new && !((p->flags & PF_EXITING)))) {
>> +			struct pid_namespace *pid_ns = ns->pid_ns;
>> +
>> +			if (pid_ns && pid_ns != &init_pid_ns)
>> +				pid_ns_release_proc(pid_ns);
>> +		}
>> +
>>  		/*
>>  		 * wait for others to get what they want from this nsproxy.
>>  		 *
>
> At first glance this looks correct. But the PF_EXITING check doesn't
> look very nice imho. It is needed to detect the case when the caller
> is copy_process()->bad_fork_cleanup_namespaces and p is not current.

Mike's proposed change to switch_task_namespace is most definitely not
correct.  This will potentially get called on unshare and so we don't
limit ourselves to just an exit pid_namespace.  The result is that we
could free the proc mount long before it is safe.

At the same time the leak that Mike detected is most definitely real.

> Perhaps it would be more clean to add the explicit
>
> 	 bad_fork_cleanup_namespaces:
> 	+	if (unlikely(clone_flags & CLONE_NEWPID))
> 	+		pid_ns_release_proc(...);
> 		exit_task_namespaces(p);
> 		
> 		
> code into this error path in copy_process?

For now Oleg your minimal patch looks good. 

Part of me would like to call proc_flush_task instead of
pid_ns_release_proc but we have no assurance task_pid and task_tgid are
valid when we get here so proc_flush_task is out.

There are crazy code paths like daemonize() that also call
swith_task_namespaces and change the pid namespace that are still
potentially broken.

Breaking the loop between the pid namespace and the proc mount would
be good, and I will see about making the time to push those patches.
So we can have something much less magical going on.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-29  7:57   ` Eric W. Biederman
@ 2012-04-29  9:49     ` Mike Galbraith
  2012-04-29 16:58     ` Oleg Nesterov
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-04-29  9:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling

On Sun, 2012-04-29 at 00:57 -0700, Eric W. Biederman wrote:

> Mike's proposed change to switch_task_namespace is most definitely not
> correct.  This will potentially get called on unshare and so we don't
> limit ourselves to just an exit pid_namespace.  The result is that we
> could free the proc mount long before it is safe.

!new && !(p->flags & PF_EXITING) should prevent that..

> At the same time the leak that Mike detected is most definitely real.
> 
> > Perhaps it would be more clean to add the explicit
> >
> > 	 bad_fork_cleanup_namespaces:
> > 	+	if (unlikely(clone_flags & CLONE_NEWPID))
> > 	+		pid_ns_release_proc(...);
> > 		exit_task_namespaces(p);
> > 		
> > 		
> > code into this error path in copy_process?
> 
> For now Oleg your minimal patch looks good.

..but yeah, that looks much nicer.

> Part of me would like to call proc_flush_task instead of
> pid_ns_release_proc but we have no assurance task_pid and task_tgid are
> valid when we get here so proc_flush_task is out.

I only discovered pid_ns_release_proc() exists after didn't work :)

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-29  7:57   ` Eric W. Biederman
  2012-04-29  9:49     ` Mike Galbraith
@ 2012-04-29 16:58     ` Oleg Nesterov
  2012-04-30  2:59       ` Eric W. Biederman
  2012-04-30  3:01       ` [PATCH] " Mike Galbraith
  1 sibling, 2 replies; 69+ messages in thread
From: Oleg Nesterov @ 2012-04-29 16:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Galbraith, LKML, Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling

On 04/29, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Heh. Please look at http://marc.info/?l=linux-kernel&m=127687751003902
> > and the whole thread, there are a lot more problems here.
>
> I don't remember seeing a leak in that conversation.

It was discussed many times ;) in particular, from the link above:

	Note: afaics we have another problem. What if copy_process(CLONE_NEWPID)
	fails after pid_ns_prepare_proc() ? Who will do mntput() ?

But we all forgot about this (relatively minor) problem.

> > But this particular one looks simple iirc.
> >
> >> @@ -216,6 +216,14 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
> >>  	rcu_assign_pointer(p->nsproxy, new);
> >>
> >>  	if (ns && atomic_dec_and_test(&ns->count)) {
> >> +		/* Handle fork() failure, unmount proc before proceeding */
> >> +		if (unlikely(!new && !((p->flags & PF_EXITING)))) {
> >> +			struct pid_namespace *pid_ns = ns->pid_ns;
> >> +
> >> +			if (pid_ns && pid_ns != &init_pid_ns)
> >> +				pid_ns_release_proc(pid_ns);
> >> +		}
> >> +
> >>  		/*
> >>  		 * wait for others to get what they want from this nsproxy.
> >>  		 *
> >
> > At first glance this looks correct. But the PF_EXITING check doesn't
> > look very nice imho. It is needed to detect the case when the caller
> > is copy_process()->bad_fork_cleanup_namespaces and p is not current.
>
> Mike's proposed change to switch_task_namespace is most definitely not
> correct.  This will potentially get called on unshare

Yes, but please note that this change also checks "new == NULL", so I
still think the patch is correct.

But,

> > 	 bad_fork_cleanup_namespaces:
> > 	+	if (unlikely(clone_flags & CLONE_NEWPID))
> > 	+		pid_ns_release_proc(...);
> > 		exit_task_namespaces(p);
> > 		
> > 		
> > code into this error path in copy_process?
>
> For now Oleg your minimal patch looks good.

Good.

Mike, could you please re-send the patch to akpm? Feel free to add my ack.
I guess Eric will ack this fix too.

> Part of me would like to call proc_flush_task instead,

Yes, I thought about this too, it checks upid->nr == 1. But

> pid_ns_release_proc but we have no assurance task_pid and task_tgid are
> valid when we get here so proc_flush_task is out.

Yes.

> There are crazy code paths like daemonize()

Forget. It has no callers anymore, should be killed. A user-space process
should never use kernel_thread() and thus daemonize() is not needed.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-29 16:58     ` Oleg Nesterov
@ 2012-04-30  2:59       ` Eric W. Biederman
  2012-04-30  3:25         ` Mike Galbraith
  2012-05-02 12:40         ` Oleg Nesterov
  2012-04-30  3:01       ` [PATCH] " Mike Galbraith
  1 sibling, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-04-30  2:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, LKML, Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/29, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>
>> > But this particular one looks simple iirc.
>> >
>> >> @@ -216,6 +216,14 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
>> >>  	rcu_assign_pointer(p->nsproxy, new);
>> >>
>> >>  	if (ns && atomic_dec_and_test(&ns->count)) {
>> >> +		/* Handle fork() failure, unmount proc before proceeding */
>> >> +		if (unlikely(!new && !((p->flags & PF_EXITING)))) {
>> >> +			struct pid_namespace *pid_ns = ns->pid_ns;
>> >> +
>> >> +			if (pid_ns && pid_ns != &init_pid_ns)
>> >> +				pid_ns_release_proc(pid_ns);
>> >> +		}
>> >> +
>> >>  		/*
>> >>  		 * wait for others to get what they want from this nsproxy.
>> >>  		 *
>> >
>> > At first glance this looks correct. But the PF_EXITING check doesn't
>> > look very nice imho. It is needed to detect the case when the caller
>> > is copy_process()->bad_fork_cleanup_namespaces and p is not current.
>>
>> Mike's proposed change to switch_task_namespace is most definitely not
>> correct.  This will potentially get called on unshare
>
> Yes, but please note that this change also checks "new == NULL", so I
> still think the patch is correct.

Sort of.  It is correct in the sense that it performs magic checks on
it's arguments to see that it's caller is exit_task_namespaces called
from the fork failure path.

It is incorrect in the case that it doesn't handle weird cases like
daemonize() which also call switch_namespaces.  So it is no better and
much more confusing and much less maintainable than your two line patch
below.

> But,
>
>> > 	 bad_fork_cleanup_namespaces:
>> > 	+	if (unlikely(clone_flags & CLONE_NEWPID))
>> > 	+		pid_ns_release_proc(...);
>> > 		exit_task_namespaces(p);
>> > 		
>> > 		
>> > code into this error path in copy_process?
>>
>> For now Oleg your minimal patch looks good.
>
> Good.
>
> Mike, could you please re-send the patch to akpm? Feel free to add my ack.
> I guess Eric will ack this fix too.

I will.

>> There are crazy code paths like daemonize()
>
> Forget. It has no callers anymore, should be killed. A user-space process
> should never use kernel_thread() and thus daemonize() is not needed.

Good point.  Oleg do you think you can send in the patches to kill
daemonize.  I make it a lot easier to sleep at night and review patches
if I did not have to think about that scary code path.

Eric




^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-29 16:58     ` Oleg Nesterov
  2012-04-30  2:59       ` Eric W. Biederman
@ 2012-04-30  3:01       ` Mike Galbraith
       [not found]         ` <m1zk9rmyh4.fsf@fess.ebiederm.org>
  1 sibling, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2012-04-30  3:01 UTC (permalink / raw)
  To: Oleg Nesterov, Andrew Morton
  Cc: Eric W. Biederman, LKML, Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling

On Sun, 2012-04-29 at 18:58 +0200, Oleg Nesterov wrote: 
> On 04/29, Eric W. Biederman wrote:
> >
> > Oleg Nesterov <oleg@redhat.com> writes:
> >
> > > Heh. Please look at http://marc.info/?l=linux-kernel&m=127687751003902
> > > and the whole thread, there are a lot more problems here.
> >
> > I don't remember seeing a leak in that conversation.
> 
> It was discussed many times ;) in particular, from the link above:
> 
> 	Note: afaics we have another problem. What if copy_process(CLONE_NEWPID)
> 	fails after pid_ns_prepare_proc() ? Who will do mntput() ?
> 
> But we all forgot about this (relatively minor) problem.
> 
> > > But this particular one looks simple iirc.
> > >
> > >> @@ -216,6 +216,14 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
> > >>  	rcu_assign_pointer(p->nsproxy, new);
> > >>
> > >>  	if (ns && atomic_dec_and_test(&ns->count)) {
> > >> +		/* Handle fork() failure, unmount proc before proceeding */
> > >> +		if (unlikely(!new && !((p->flags & PF_EXITING)))) {
> > >> +			struct pid_namespace *pid_ns = ns->pid_ns;
> > >> +
> > >> +			if (pid_ns && pid_ns != &init_pid_ns)
> > >> +				pid_ns_release_proc(pid_ns);
> > >> +		}
> > >> +
> > >>  		/*
> > >>  		 * wait for others to get what they want from this nsproxy.
> > >>  		 *
> > >
> > > At first glance this looks correct. But the PF_EXITING check doesn't
> > > look very nice imho. It is needed to detect the case when the caller
> > > is copy_process()->bad_fork_cleanup_namespaces and p is not current.
> >
> > Mike's proposed change to switch_task_namespace is most definitely not
> > correct.  This will potentially get called on unshare
> 
> Yes, but please note that this change also checks "new == NULL", so I
> still think the patch is correct.
> 
> But,
> 
> > > 	 bad_fork_cleanup_namespaces:
> > > 	+	if (unlikely(clone_flags & CLONE_NEWPID))
> > > 	+		pid_ns_release_proc(...);
> > > 		exit_task_namespaces(p);
> > > 		
> > > 		
> > > code into this error path in copy_process?
> >
> > For now Oleg your minimal patch looks good.
> 
> Good.
> 
> Mike, could you please re-send the patch to akpm? Feel free to add my ack.
> I guess Eric will ack this fix too.

namespaces, pid_ns: fix leakage on fork() failure

Fork() failure post namespace creation for a child cloned with CLONE_NEWPID
leaks pid_namespace/mnt_cache due to proc being mounted during creation, but
not unmounted during cleanup.  Call pid_ns_release_proc() during cleanup.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>

 kernel/fork.c |    3 +++
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index b9372a0..91482b6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -67,6 +67,7 @@
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
+#include <linux/proc_fs.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1464,6 +1465,8 @@ bad_fork_cleanup_io:
 	if (p->io_context)
 		exit_io_context(p);
 bad_fork_cleanup_namespaces:
+	if (unlikely(clone_flags & CLONE_NEWPID))
+		pid_ns_release_proc(p->nsproxy->pid_ns);
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
 	if (p->mm)



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-30  2:59       ` Eric W. Biederman
@ 2012-04-30  3:25         ` Mike Galbraith
  2012-05-02 12:40         ` Oleg Nesterov
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-04-30  3:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling

On Sun, 2012-04-29 at 19:59 -0700, Eric W. Biederman wrote: 
> Oleg Nesterov <oleg@redhat.com> writes:

> > Yes, but please note that this change also checks "new == NULL", so I
> > still think the patch is correct.
> 
> Sort of.  It is correct in the sense that it performs magic checks on
> it's arguments to see that it's caller is exit_task_namespaces called
> from the fork failure path.

Yeah, did that to keep namespace fix in namespace source.  Bad idea even
with circles and arrows "hey, I'm an unborn red-headed stepchild".

-Mike




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-28  9:19 [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith
  2012-04-28 14:26 ` Oleg Nesterov
@ 2012-04-30 13:57 ` Mike Galbraith
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-04-30 13:57 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov

On Sat, 2012-04-28 at 11:19 +0200, Mike Galbraith wrote:

> There's still at least one pid
> namespace leak left, that being the final put_pid() in softirq context
> goes missing.

Ah..

vsftpd-14507 [003] ....  1467.046189: proc_set_super: get_pid_ns: 0xffff8801dc560998 count:1->2
vsftpd-14507 [003] ....  1467.046201: create_pid_namespace: create_pid_namespace: 0xffff8801dc560998
vsftpd-14507 [003] ....  1467.046206: alloc_pid: get_pid_ns: 0xffff8801dc560998 count:2->3
vsftpd-14521 [003] ....  1467.052481: switch_task_namespaces: exiting: 0xffff8801dc560998 count:3
vsftpd-14521 [003] ....  1467.073823: free_nsproxy: put_pid_ns: 0xffff8801dc560998 count:3->2
vsftpd-14507 [003] ....  1467.173657: put_pid: namespace: 0xffff8801dc560998 pid count:2->1 pid_ns count:2
vsftpd-14507 [003] ....  1467.173677: proc_kill_sb: put_pid_ns: 0xffff8801dc560998 count:2->1
<idle>-0     [003] ..s.  1467.213562: put_pid: namespace: 0xffff8801dc560998 pid count:6->5 pid_ns count:1

..somebody grabs pid references while we wait for rcu destruction.

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
       [not found]         ` <m1zk9rmyh4.fsf@fess.ebiederm.org>
@ 2012-05-01 20:42           ` Andrew Morton
  2012-05-03  3:12             ` Mike Galbraith
  2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
  0 siblings, 2 replies; 69+ messages in thread
From: Andrew Morton @ 2012-05-01 20:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On Tue, 01 May 2012 13:35:03 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> 
> Andrew can you please pick up this patch?

Sure.  I assume it's fixing a post-3.4 regression?  No -stable backport
needed?

> This doesn't explain all of the vsftp weirdness people have been seeing
> but it does fix a real leak on fork failure that vsftp could most
> definitely have triggered.
> 
> Mike Galbraith <efault@gmx.de> writes:

hm, Mike had some test code.  If that was put in
tools/testing/selftests/nsproxy, this leak wouldn't happen again!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-04-30  2:59       ` Eric W. Biederman
  2012-04-30  3:25         ` Mike Galbraith
@ 2012-05-02 12:40         ` Oleg Nesterov
  2012-05-02 17:37           ` Eric W. Biederman
  1 sibling, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-02 12:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Galbraith, LKML, Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling

On 04/29, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 04/29, Eric W. Biederman wrote:
> >>
> >> There are crazy code paths like daemonize()
> >
> > Forget. It has no callers anymore, should be killed. A user-space process
> > should never use kernel_thread() and thus daemonize() is not needed.
>
> Good point.  Oleg do you think you can send in the patches to kill
> daemonize.

Yes, will do. I am waiting until the last user in arch/powerpc goes
away, the patch is already in mm.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-02 12:40         ` Oleg Nesterov
@ 2012-05-02 17:37           ` Eric W. Biederman
  0 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-02 17:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, LKML, Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/29, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > On 04/29, Eric W. Biederman wrote:
>> >>
>> >> There are crazy code paths like daemonize()
>> >
>> > Forget. It has no callers anymore, should be killed. A user-space process
>> > should never use kernel_thread() and thus daemonize() is not needed.
>>
>> Good point.  Oleg do you think you can send in the patches to kill
>> daemonize.
>
> Yes, will do. I am waiting until the last user in arch/powerpc goes
> away, the patch is already in mm.

Well I don't quite know what the path was but it looks like that change
has already hit Linus's tree:


commit 37ef9bd48af6ab9a3d1fd28df4f929abc19f2cc3
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Wed Mar 28 12:20:57 2012 +0000

    powerpc/eeh: Remove eeh_event_handler()->daemonize()
    
    daemonize() is only needed when a user-space task does kernel_thread().
    
    eeh_event_handler() thread is created by the worker kthread, and thus it
    doesn't need the soon-to-be-deprecated daemonize().
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Linas Vepstas <linasvepstas@gmail.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Acked-by: Matt Fleming <matt.fleming@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

diff --git a/arch/powerpc/platforms/pseries/eeh_event.c b/arch/powerpc/platforms/pseries/eeh_event.c
index 4a47525..92dd84c 100644
--- a/arch/powerpc/platforms/pseries/eeh_event.c
+++ b/arch/powerpc/platforms/pseries/eeh_event.c
@@ -59,7 +59,7 @@ static int eeh_event_handler(void * dummy)
        struct eeh_event *event;
        struct eeh_dev *edev;
 
-       daemonize("eehd");
+       set_task_comm(current, "eehd");
        set_current_state(TASK_INTERRUPTIBLE);
 
        spin_lock_irqsave(&eeh_eventlist_lock, flags);


Eric

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-01 20:42           ` Andrew Morton
@ 2012-05-03  3:12             ` Mike Galbraith
  2012-05-03 14:56               ` Mike Galbraith
  2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
  1 sibling, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2012-05-03  3:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Tue, 2012-05-01 at 13:42 -0700, Andrew Morton wrote:
> On Tue, 01 May 2012 13:35:03 -0700
> ebiederm@xmission.com (Eric W. Biederman) wrote:
> 
> > 
> > Andrew can you please pick up this patch?
> 
> Sure.  I assume it's fixing a post-3.4 regression?  No -stable backport
> needed?

Dunno what all should go to stable, but anyone using vsftpd will
appreciate something going.  Large leakage was initially reported
against 3.1.  That was bisected to..
423e0ab0 VFS : mount lock scalability for internal mounts 

Subsequent fixes which did not go to stable were applied..
	905ad269 procfs: fix a vfsmount longterm reference leak
	6f686574 ... and the same kind of leak for mqueue
..but leakage persists even with fork failure hole plugged.

The one (at least) that remains, grabbing pid references while we wait
for RCU destruction of pid/namespace and our subsequently never
destroying namespace is being annoying.  Yesterday I ran up >4600 leaks
with 10 background instances of the testcase (time that, egad), much
MUCH later ~4000 were released.  Between huge delay and ftrace simply
refusing to trace out of lined get_pid(), I'm having a jolly time trying
to get an 8x10 color glossy of the leaky event to stare at. 

Whatever goes to stable, what fixes this little bugger should go too.
> hm, Mike had some test code.  If that was put in
> tools/testing/selftests/nsproxy, this leak wouldn't happen again!

I didn't write it, that and a perl script to monitor leakage came along
with bug report.

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-03  3:12             ` Mike Galbraith
@ 2012-05-03 14:56               ` Mike Galbraith
  2012-05-04  4:27                 ` Mike Galbraith
  2012-05-04  8:03                 ` [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure Eric W. Biederman
  0 siblings, 2 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-05-03 14:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Thu, 2012-05-03 at 05:12 +0200, Mike Galbraith wrote: 
> On Tue, 2012-05-01 at 13:42 -0700, Andrew Morton wrote:
> > On Tue, 01 May 2012 13:35:03 -0700
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> > 
> > > 
> > > Andrew can you please pick up this patch?
> > 
> > Sure.  I assume it's fixing a post-3.4 regression?  No -stable backport
> > needed?
> 
> Dunno what all should go to stable, but anyone using vsftpd will
> appreciate something going.  Large leakage was initially reported
> against 3.1.  That was bisected to..
> 423e0ab0 VFS : mount lock scalability for internal mounts 
> 
> Subsequent fixes which did not go to stable were applied..
> 	905ad269 procfs: fix a vfsmount longterm reference leak
> 	6f686574 ... and the same kind of leak for mqueue
> ..but leakage persists even with fork failure hole plugged.

> Whatever goes to stable, what fixes this little bugger should go too.

Finally have a decent trace, patch to fix the problem below.

marge:~ # grep 0xffff8801fad5dff0 /trace3
          vsftpd-18277 [003] ....  1779.012239: proc_set_super: get_pid_ns: 0xffff8801fad5dff0 count:1->2
          vsftpd-18277 [003] ....  1779.012253: create_pid_namespace: create_pid_namespace: 0xffff8801fad5dff0
          vsftpd-18277 [003] ....  1779.012258: alloc_pid: get_pid_ns: 0xffff8801fad5dff0 count:2->3
          vsftpd-18277 [003] ....  1779.012278: proc_kill_sb: put_pid_ns: 0xffff8801fad5dff0 count:3->2
     ksoftirqd/3-16    [003] ..s.  1779.012731: delayed_put_pid: put_pid_ns: 0xffff8801fad5dff0 count:2->1
          vsftpd-18277 [003] ....  1779.015614: destroy_pid_namespace: destroy_pid_namespace: 0xffff8801fad5dff0
          vsftpd-18277 [003] ....  1779.015614: free_nsproxy: put_pid_ns: 0xffff8801fad5dff0 count:1->0
          vsftpd-18277 [003] ....  1779.249871: proc_set_super: get_pid_ns: 0xffff8801fad5dff0 count:1->2
          vsftpd-18277 [003] ....  1779.249884: create_pid_namespace: create_pid_namespace: 0xffff8801fad5dff0
          vsftpd-18277 [003] ....  1779.249888: alloc_pid: get_pid_ns: 0xffff8801fad5dff0 count:2->3
          vsftpd-18351 [003] ....  1779.256337: switch_task_namespaces: exiting: 0xffff8801fad5dff0 count:3
          vsftpd-18351 [003] ....  1779.266243: free_nsproxy: put_pid_ns: 0xffff8801fad5dff0 count:3->2
<insert>
              ps-18381 [000] ....  1779.298798: proc_fill_cache <-proc_pid_readdir
              ps-18381 [000] ....  1779.298802: proc_pid_instantiate <-proc_fill_cache
              ps-18381 [000] ....  1779.298802: proc_pid_make_inode <-proc_pid_instantiate
              ps-18381 [000] ....  1779.298802: proc_alloc_inode <-alloc_inode
              ps-18381 [000] ....  1779.298807: get_task_pid <-proc_pid_make_inode
              ps-18381 [000] ....  1779.298807: get_pid <-get_task_pid
</insert> ditto for other pid references added post task exit
              ps-18381 [000] ....  1779.298807: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:1->2 pid_ns count:2
              ps-18381 [001] ....  1779.327593: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:2->3 pid_ns count:2
              ps-18381 [001] ....  1779.327653: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:3->4 pid_ns count:2
              ps-18381 [001] ....  1779.327716: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:4->5 pid_ns count:2
              ps-18381 [001] ....  1779.327804: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:5->6 pid_ns count:2
              ps-18381 [001] ....  1779.327817: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:6->7 pid_ns count:2
              ps-18381 [001] ....  1779.327818: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:7->6 pid_ns count:2
          vsftpd-18277 [003] ....  1779.358887: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:6->5 pid_ns count:2
          vsftpd-18277 [003] ....  1779.358889: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:5->4 pid_ns count:2
          vsftpd-18277 [003] ....  1779.358891: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:4->3 pid_ns count:2
          vsftpd-18277 [003] ....  1779.358894: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:3->2 pid_ns count:2
          vsftpd-18277 [003] ....  1779.358897: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:2->1 pid_ns count:2
          vsftpd-18277 [003] ....  1779.358918: proc_kill_sb: put_pid_ns: 0xffff8801fad5dff0 count:2->1
              ps-18386 [001] ....  1779.370210: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:1->2 pid_ns count:1
              ps-18386 [001] ....  1779.370240: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:2->3 pid_ns count:1
              ps-18386 [001] ....  1779.370300: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:3->4 pid_ns count:1
              ps-18386 [001] ....  1779.370361: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:4->5 pid_ns count:1
              ps-18386 [001] ....  1779.370454: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:5->6 pid_ns count:1
              ps-18386 [001] ....  1779.370467: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:6->7 pid_ns count:1
              ps-18386 [001] ....  1779.370468: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:7->6 pid_ns count:1
     ksoftirqd/3-16    [003] ..s.  1779.390717: delayed_put_pid: pid: 0xffff8802031a2fc0 LEAKED namespace: 0xffff8801fad5dff0

Ok, that seems reasonable.

Create > 27k "leaked" namespaces, watch many thousands go away over
time.. but many hundred persist and persist and persist.

Hm.  echo 3 > /proc/sys/vm/drop_caches.. *poof gone*

Grr.  I wonder who is doing the pinning when I don't monitor, but..

<patch>
kick kick kick... it's dead Jim.
</patch>

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-03 14:56               ` Mike Galbraith
@ 2012-05-04  4:27                 ` Mike Galbraith
  2012-05-04  7:55                   ` Eric W. Biederman
  2012-05-04  8:03                 ` [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure Eric W. Biederman
  1 sibling, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2012-05-04  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

Namespaces have something in common with cgroups.  synchronize_rcu()
makes them somewhat less than wonderful for dynamic use.

default flags = SIGCHLD

-namespace:  flag |= CLONE_NEWPID 
-all:  flags |= CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER

marge:/usr/local/tmp/starvation # ./hackbench
Running with 10*40 (== 400) tasks.
Time: 2.636
marge:/usr/local/tmp/starvation # ./hackbench -namespace
Running with 10*40 (== 400) tasks.
Time: 11.624
marge:/usr/local/tmp/starvation # ./hackbench -namespace -all
Running with 10*40 (== 400) tasks.
Time: 51.474

You can create trash quickly, but you have to haul it away.

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04  4:27                 ` Mike Galbraith
@ 2012-05-04  7:55                   ` Eric W. Biederman
  2012-05-04  8:34                     ` Mike Galbraith
  2012-05-04  9:45                     ` Mike Galbraith
  0 siblings, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-04  7:55 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

Mike Galbraith <efault@gmx.de> writes:

> Namespaces have something in common with cgroups.  synchronize_rcu()
> makes them somewhat less than wonderful for dynamic use.

Well unlike cgroups namespaces were not designed for heavy dynamic use.
Although it appears that vsftp puts them to that kind of use so some
of the design decisions are with revisiting.

> default flags = SIGCHLD
>
> -namespace:  flag |= CLONE_NEWPID 
> -all:  flags |= CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER
>
> marge:/usr/local/tmp/starvation # ./hackbench
> Running with 10*40 (== 400) tasks.
> Time: 2.636
> marge:/usr/local/tmp/starvation # ./hackbench -namespace
> Running with 10*40 (== 400) tasks.
> Time: 11.624
> marge:/usr/local/tmp/starvation # ./hackbench -namespace -all
> Running with 10*40 (== 400) tasks.
> Time: 51.474

CLONE_NEWUSER?  I presume you have applied my latest user namespace
patches?  Otherwise you are running completely half baked code.

hackbench?  Which kernel are you running.  Hackbench in some kernels is
really good at triggering cache ping-pong effects with pids, and creds.
So I'm not certain what to say there.  In the latest kernels things
should be better with unix domain sockets as long as you don't actually
ask to pass your creds but hackbench is still a pretty ridiculous
benchmark.  Oversharing is always going to be bad for performance.

> You can create trash quickly, but you have to haul it away.

Well synchronize_rcu is much better in that respect than call_rcu, which
let's the trash build up but is never carried away.

The core design assumption with namespaces is that they will be used
much more than they will be created/destroyed, and as long as there are
progress guarantees in place I don't have a problem with that.   At the
same time if there are easy things we can do to make things go faster
I am in favor of that notion.

Still especially in the case of hackbench I think it is worth asking the
question how much of the slow down is due to cache ping-pong due to
oversharing.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-03 14:56               ` Mike Galbraith
  2012-05-04  4:27                 ` Mike Galbraith
@ 2012-05-04  8:03                 ` Eric W. Biederman
  2012-05-04  8:19                   ` Mike Galbraith
  1 sibling, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-04  8:03 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

Mike Galbraith <efault@gmx.de> writes:

> On Thu, 2012-05-03 at 05:12 +0200, Mike Galbraith wrote: 
>> On Tue, 2012-05-01 at 13:42 -0700, Andrew Morton wrote:
>> > On Tue, 01 May 2012 13:35:03 -0700
>> > ebiederm@xmission.com (Eric W. Biederman) wrote:
>> > 
>> > > 
>> > > Andrew can you please pick up this patch?
>> > 
>> > Sure.  I assume it's fixing a post-3.4 regression?  No -stable backport
>> > needed?
>> 
>> Dunno what all should go to stable, but anyone using vsftpd will
>> appreciate something going.  Large leakage was initially reported
>> against 3.1.  That was bisected to..
>> 423e0ab0 VFS : mount lock scalability for internal mounts 
>> 
>> Subsequent fixes which did not go to stable were applied..
>> 	905ad269 procfs: fix a vfsmount longterm reference leak
>> 	6f686574 ... and the same kind of leak for mqueue
>> ..but leakage persists even with fork failure hole plugged.
>
>> Whatever goes to stable, what fixes this little bugger should go too.
>
> Finally have a decent trace, patch to fix the problem below.
>
> marge:~ # grep 0xffff8801fad5dff0 /trace3
>           vsftpd-18277 [003] ....  1779.012239: proc_set_super: get_pid_ns: 0xffff8801fad5dff0 count:1->2
>           vsftpd-18277 [003] ....  1779.012253: create_pid_namespace: create_pid_namespace: 0xffff8801fad5dff0
>           vsftpd-18277 [003] ....  1779.012258: alloc_pid: get_pid_ns: 0xffff8801fad5dff0 count:2->3
>           vsftpd-18277 [003] ....  1779.012278: proc_kill_sb: put_pid_ns: 0xffff8801fad5dff0 count:3->2
>      ksoftirqd/3-16    [003] ..s.  1779.012731: delayed_put_pid: put_pid_ns: 0xffff8801fad5dff0 count:2->1
>           vsftpd-18277 [003] ....  1779.015614: destroy_pid_namespace: destroy_pid_namespace: 0xffff8801fad5dff0
>           vsftpd-18277 [003] ....  1779.015614: free_nsproxy: put_pid_ns: 0xffff8801fad5dff0 count:1->0
>           vsftpd-18277 [003] ....  1779.249871: proc_set_super: get_pid_ns: 0xffff8801fad5dff0 count:1->2
>           vsftpd-18277 [003] ....  1779.249884: create_pid_namespace: create_pid_namespace: 0xffff8801fad5dff0
>           vsftpd-18277 [003] ....  1779.249888: alloc_pid: get_pid_ns: 0xffff8801fad5dff0 count:2->3
>           vsftpd-18351 [003] ....  1779.256337: switch_task_namespaces: exiting: 0xffff8801fad5dff0 count:3
>           vsftpd-18351 [003] ....  1779.266243: free_nsproxy: put_pid_ns: 0xffff8801fad5dff0 count:3->2
> <insert>
>               ps-18381 [000] ....  1779.298798: proc_fill_cache <-proc_pid_readdir
>               ps-18381 [000] ....  1779.298802: proc_pid_instantiate <-proc_fill_cache
>               ps-18381 [000] ....  1779.298802: proc_pid_make_inode <-proc_pid_instantiate
>               ps-18381 [000] ....  1779.298802: proc_alloc_inode <-alloc_inode
>               ps-18381 [000] ....  1779.298807: get_task_pid <-proc_pid_make_inode
>               ps-18381 [000] ....  1779.298807: get_pid <-get_task_pid
> </insert> ditto for other pid references added post task exit
>               ps-18381 [000] ....  1779.298807: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:1->2 pid_ns count:2
>               ps-18381 [001] ....  1779.327593: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:2->3 pid_ns count:2
>               ps-18381 [001] ....  1779.327653: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:3->4 pid_ns count:2
>               ps-18381 [001] ....  1779.327716: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:4->5 pid_ns count:2
>               ps-18381 [001] ....  1779.327804: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:5->6 pid_ns count:2
>               ps-18381 [001] ....  1779.327817: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:6->7 pid_ns count:2
>               ps-18381 [001] ....  1779.327818: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:7->6 pid_ns count:2
>           vsftpd-18277 [003] ....  1779.358887: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:6->5 pid_ns count:2
>           vsftpd-18277 [003] ....  1779.358889: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:5->4 pid_ns count:2
>           vsftpd-18277 [003] ....  1779.358891: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:4->3 pid_ns count:2
>           vsftpd-18277 [003] ....  1779.358894: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:3->2 pid_ns count:2
>           vsftpd-18277 [003] ....  1779.358897: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:2->1 pid_ns count:2
>           vsftpd-18277 [003] ....  1779.358918: proc_kill_sb: put_pid_ns: 0xffff8801fad5dff0 count:2->1
>               ps-18386 [001] ....  1779.370210: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:1->2 pid_ns count:1
>               ps-18386 [001] ....  1779.370240: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:2->3 pid_ns count:1
>               ps-18386 [001] ....  1779.370300: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:3->4 pid_ns count:1
>               ps-18386 [001] ....  1779.370361: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:4->5 pid_ns count:1
>               ps-18386 [001] ....  1779.370454: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:5->6 pid_ns count:1
>               ps-18386 [001] ....  1779.370467: get_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:6->7 pid_ns count:1
>               ps-18386 [001] ....  1779.370468: put_pid: pid: 0xffff8802031a2fc0 namespace: 0xffff8801fad5dff0 pid count:7->6 pid_ns count:1
>      ksoftirqd/3-16    [003] ..s.  1779.390717: delayed_put_pid: pid: 0xffff8802031a2fc0 LEAKED namespace: 0xffff8801fad5dff0
>
> Ok, that seems reasonable.
>
> Create > 27k "leaked" namespaces, watch many thousands go away over
> time.. but many hundred persist and persist and persist.
>
> Hm.  echo 3 > /proc/sys/vm/drop_caches.. *poof gone*

Good to hear.  I am sad to hear that proc_flush_task isn't doing a
better job but at least memory pressure will now free everything up as
it is supposed to.

> Grr.  I wonder who is doing the pinning when I don't monitor, but..
>
> <patch>
> kick kick kick... it's dead Jim.
> </patch>

Are you saying all known bugs are fixed?  Thanks for digging into this
by the way.  A fresh set of eyeballs is always nice.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04  8:03                 ` [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure Eric W. Biederman
@ 2012-05-04  8:19                   ` Mike Galbraith
  2012-05-04  8:54                     ` Mike Galbraith
  0 siblings, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2012-05-04  8:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Fri, 2012-05-04 at 01:03 -0700, Eric W. Biederman wrote: 
> Mike Galbraith <efault@gmx.de> writes:

> > Hm.  echo 3 > /proc/sys/vm/drop_caches.. *poof gone*
> 
> Good to hear.  I am sad to hear that proc_flush_task isn't doing a
> better job but at least memory pressure will now free everything up as
> it is supposed to.
> 
> > Grr.  I wonder who is doing the pinning when I don't monitor, but..
> >
> > <patch>
> > kick kick kick... it's dead Jim.
> > </patch>
> 
> Are you saying all known bugs are fixed?  Thanks for digging into this
> by the way.  A fresh set of eyeballs is always nice.

Think so.  I'm beating it up in our 3.0 kernel with all fixes, since the
patch that started this odyssey got backported to that kernel as well.

Too bad I had so much ftrace trouble, I chased a dead bug for a week :)

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04  7:55                   ` Eric W. Biederman
@ 2012-05-04  8:34                     ` Mike Galbraith
  2012-05-04  9:45                     ` Mike Galbraith
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-05-04  8:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Fri, 2012-05-04 at 00:55 -0700, Eric W. Biederman wrote: 
> Mike Galbraith <efault@gmx.de> writes:
> 
> > Namespaces have something in common with cgroups.  synchronize_rcu()
> > makes them somewhat less than wonderful for dynamic use.
> 
> Well unlike cgroups namespaces were not designed for heavy dynamic use.
> Although it appears that vsftp puts them to that kind of use so some
> of the design decisions are with revisiting.

Yeah, the testcase was distilled from vsftp, so it must be beating on
namespaces pretty hard to induce a bug report.

> > default flags = SIGCHLD
> >
> > -namespace:  flag |= CLONE_NEWPID 
> > -all:  flags |= CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER
> >
> > marge:/usr/local/tmp/starvation # ./hackbench
> > Running with 10*40 (== 400) tasks.
> > Time: 2.636
> > marge:/usr/local/tmp/starvation # ./hackbench -namespace
> > Running with 10*40 (== 400) tasks.
> > Time: 11.624
> > marge:/usr/local/tmp/starvation # ./hackbench -namespace -all
> > Running with 10*40 (== 400) tasks.
> > Time: 51.474
> 
> CLONE_NEWUSER?  I presume you have applied my latest user namespace
> patches?  Otherwise you are running completely half baked code.

I was testing in mainline.  While fiddling with the testcase and leakage
monitor script, I decided to see what happens with all namespace flags.
The others didn't cause any leakage, but did make things slow down.

> hackbench?  Which kernel are you running.  Hackbench in some kernels is
> really good at triggering cache ping-pong effects with pids, and creds.
> So I'm not certain what to say there.  In the latest kernels things
> should be better with unix domain sockets as long as you don't actually
> ask to pass your creds but hackbench is still a pretty ridiculous
> benchmark.  Oversharing is always going to be bad for performance.

Hackbench was just to show the price of hefty namespace usage.

> > You can create trash quickly, but you have to haul it away.
> 
> Well synchronize_rcu is much better in that respect than call_rcu, which
> let's the trash build up but is never carried away.
> 
> The core design assumption with namespaces is that they will be used
> much more than they will be created/destroyed, and as long as there are
> progress guarantees in place I don't have a problem with that.   At the
> same time if there are easy things we can do to make things go faster
> I am in favor of that notion.
> 
> Still especially in the case of hackbench I think it is worth asking the
> question how much of the slow down is due to cache ping-pong due to
> oversharing.

Dunno, and doubt I'll have time to tinker with it more.  Darn bugzilla
thing keeps knocking on my mailbox with interesting bugs in places I
know _diddly spit_ about.. like namespaces.

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04  8:19                   ` Mike Galbraith
@ 2012-05-04  8:54                     ` Mike Galbraith
  0 siblings, 0 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-05-04  8:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Fri, 2012-05-04 at 10:19 +0200, Mike Galbraith wrote: 
> On Fri, 2012-05-04 at 01:03 -0700, Eric W. Biederman wrote: 

> > Are you saying all known bugs are fixed?

> Think so.  I'm beating it up in our 3.0 kernel with all fixes, since the
> patch that started this odyssey got backported to that kernel as well.

Yeah, it's dead.

	-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04  7:55                   ` Eric W. Biederman
  2012-05-04  8:34                     ` Mike Galbraith
@ 2012-05-04  9:45                     ` Mike Galbraith
  2012-05-04 14:13                       ` Eric W. Biederman
  1 sibling, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2012-05-04  9:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Fri, 2012-05-04 at 00:55 -0700, Eric W. Biederman wrote:

> CLONE_NEWUSER?  I presume you have applied my latest user namespace
> patches?  Otherwise you are running completely half baked code.

I Removed CLONE_NEWUSER flag.

> hackbench?  Which kernel are you running.  Hackbench in some kernels is
> really good at triggering cache ping-pong effects with pids, and creds.

Not when pinned.  3.0 kernel without the debug stuff enabled in 3.4.git.
 
marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench
Running with 10*40 (== 400) tasks.
Time: 0.868
marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace
Running with 10*40 (== 400) tasks.
Time: 7.582
marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace -all
Running with 10*40 (== 400) tasks.
Time: 29.677

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04  9:45                     ` Mike Galbraith
@ 2012-05-04 14:13                       ` Eric W. Biederman
  2012-05-04 14:49                         ` Mike Galbraith
  0 siblings, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-04 14:13 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

Mike Galbraith <efault@gmx.de> writes:

> On Fri, 2012-05-04 at 00:55 -0700, Eric W. Biederman wrote:
>
>> CLONE_NEWUSER?  I presume you have applied my latest user namespace
>> patches?  Otherwise you are running completely half baked code.
>
> I Removed CLONE_NEWUSER flag.
>
>> hackbench?  Which kernel are you running.  Hackbench in some kernels is
>> really good at triggering cache ping-pong effects with pids, and creds.
>
> Not when pinned.  3.0 kernel without the debug stuff enabled in 3.4.git.
> 
> marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench
> Running with 10*40 (== 400) tasks.
> Time: 0.868
> marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace
> Running with 10*40 (== 400) tasks.
> Time: 7.582
> marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace -all
> Running with 10*40 (== 400) tasks.
> Time: 29.677

Interesting.  I guess what truly puzzles me is what serializes all of
the processes.  Even synchronize_rcu should sleep and thus let other
synchronize_rcu calls run in parallel.

Did you have HZ=100 in that kernel?  400 tasks at 100Hz all serialized
somehow and then doing synchronize_rcu at a jiffy each would account
for 4 seconds.  And the nsproxy certainly has a synchronize_rcu call.

The network namespace is comparatively heavy weight, at least in the
amount of code and other things it has to go through, so that would be
my prime suspect for those 29 seconds.  There are 2-4 synchronize_rcu
calls needed to put the loopback device.  Still we use
synchronize_rcu_expedited and that work should be out of line and all of
those calls should batch.

Mike is this something you are looking at a pursuing farther?

I want to guess the serialization comes from waiting on children to be
reaped but the namespaces are all cleaned up in exit_notify() called
from do_exit() so that theory doesn't hold water.  The worst case
I can see is detach_pid from exit_signal running under the task list lock.
but nothing sleeps under that lock.  :(

So I am very puzzled why the code serializes itself in a way that leads
to those long delays.  Shrug.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04 14:13                       ` Eric W. Biederman
@ 2012-05-04 14:49                         ` Mike Galbraith
  2012-05-04 15:36                           ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2012-05-04 14:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Fri, 2012-05-04 at 07:13 -0700, Eric W. Biederman wrote: 
> Mike Galbraith <efault@gmx.de> writes:
> 
> > On Fri, 2012-05-04 at 00:55 -0700, Eric W. Biederman wrote:
> >
> >> CLONE_NEWUSER?  I presume you have applied my latest user namespace
> >> patches?  Otherwise you are running completely half baked code.
> >
> > I Removed CLONE_NEWUSER flag.
> >
> >> hackbench?  Which kernel are you running.  Hackbench in some kernels is
> >> really good at triggering cache ping-pong effects with pids, and creds.
> >
> > Not when pinned.  3.0 kernel without the debug stuff enabled in 3.4.git.
> > 
> > marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench
> > Running with 10*40 (== 400) tasks.
> > Time: 0.868
> > marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace
> > Running with 10*40 (== 400) tasks.
> > Time: 7.582
> > marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace -all
> > Running with 10*40 (== 400) tasks.
> > Time: 29.677
> 
> Interesting.  I guess what truly puzzles me is what serializes all of
> the processes.  Even synchronize_rcu should sleep and thus let other
> synchronize_rcu calls run in parallel.
> 
> Did you have HZ=100 in that kernel?  400 tasks at 100Hz all serialized
> somehow and then doing synchronize_rcu at a jiffy each would account
> for 4 seconds.  And the nsproxy certainly has a synchronize_rcu call.

HZ=250

> The network namespace is comparatively heavy weight, at least in the
> amount of code and other things it has to go through, so that would be
> my prime suspect for those 29 seconds.  There are 2-4 synchronize_rcu
> calls needed to put the loopback device.  Still we use
> synchronize_rcu_expedited and that work should be out of line and all of
> those calls should batch.
> 
> Mike is this something you are looking at a pursuing farther?

Not really, but I can put it on my good intentions list.

> I want to guess the serialization comes from waiting on children to be
> reaped but the namespaces are all cleaned up in exit_notify() called
> from do_exit() so that theory doesn't hold water.  The worst case
> I can see is detach_pid from exit_signal running under the task list lock.
> but nothing sleeps under that lock.  :(

I'm up to my ears in zombies with several instances of the testcase
running in parallel, so I imagine it's the same with hackbench.

marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace& for i in 1 2 3 4 5 6 7 ; do ps ax|grep defunct|wc -l;sleep 1; done
[1] 29985
Running with 10*40 (== 400) tasks.
1
397
327
261
199
135
72
marge:/usr/local/tmp/starvation # Time: 7.675

-Mike


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04 14:49                         ` Mike Galbraith
@ 2012-05-04 15:36                           ` Eric W. Biederman
  2012-05-04 16:57                             ` Mike Galbraith
  0 siblings, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-04 15:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

Mike Galbraith <efault@gmx.de> writes:

> On Fri, 2012-05-04 at 07:13 -0700, Eric W. Biederman wrote: 
>> Mike Galbraith <efault@gmx.de> writes:

>> Did you have HZ=100 in that kernel?  400 tasks at 100Hz all serialized
>> somehow and then doing synchronize_rcu at a jiffy each would account
>> for 4 seconds.  And the nsproxy certainly has a synchronize_rcu call.
>
> HZ=250

Rats.  Then non of my theories even approaches holding water.

>> The network namespace is comparatively heavy weight, at least in the
>> amount of code and other things it has to go through, so that would be
>> my prime suspect for those 29 seconds.  There are 2-4 synchronize_rcu
>> calls needed to put the loopback device.  Still we use
>> synchronize_rcu_expedited and that work should be out of line and all of
>> those calls should batch.
>> 
>> Mike is this something you are looking at a pursuing farther?
>
> Not really, but I can put it on my good intentions list.

About what I expected.  I just wanted to make certain I understood the
situation.

I will remember this as something weird and when I have time perhaps
I will investigate and track it.

>> I want to guess the serialization comes from waiting on children to be
>> reaped but the namespaces are all cleaned up in exit_notify() called
>> from do_exit() so that theory doesn't hold water.  The worst case
>> I can see is detach_pid from exit_signal running under the task list lock.
>> but nothing sleeps under that lock.  :(
>
> I'm up to my ears in zombies with several instances of the testcase
> running in parallel, so I imagine it's the same with hackbench.

Oh interesting.

> marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace& for i in 1 2 3 4 5 6 7 ; do ps ax|grep defunct|wc -l;sleep 1; done
> [1] 29985
> Running with 10*40 (== 400) tasks.
> 1
> 397
> 327
> 261
> 199
> 135
> 72
> marge:/usr/local/tmp/starvation # Time: 7.675

So if I read your output right the first second is spent running the
code and the rest of the time is spent reaping zombies.

So if this is all in reaping zombies it should be possible to add go
faster stripes by setting exit_signal to -1 on these guys.  I know
you can do that for threads, and I seem to remember hackbench using
threads so that might be interesting.

I wonder if it might be userspace scheduling madness.

What changes the speed of a waitpid loop?  Weird.  Very Weird.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04 15:36                           ` Eric W. Biederman
@ 2012-05-04 16:57                             ` Mike Galbraith
  2012-05-04 20:29                               ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2012-05-04 16:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Fri, 2012-05-04 at 08:36 -0700, Eric W. Biederman wrote: 
> Mike Galbraith <efault@gmx.de> writes:
> 
> > On Fri, 2012-05-04 at 07:13 -0700, Eric W. Biederman wrote: 
> >> Mike Galbraith <efault@gmx.de> writes:
> 
> >> Did you have HZ=100 in that kernel?  400 tasks at 100Hz all serialized
> >> somehow and then doing synchronize_rcu at a jiffy each would account
> >> for 4 seconds.  And the nsproxy certainly has a synchronize_rcu call.
> >
> > HZ=250
> 
> Rats.  Then non of my theories even approaches holding water.
> 
> >> The network namespace is comparatively heavy weight, at least in the
> >> amount of code and other things it has to go through, so that would be
> >> my prime suspect for those 29 seconds.  There are 2-4 synchronize_rcu
> >> calls needed to put the loopback device.  Still we use
> >> synchronize_rcu_expedited and that work should be out of line and all of
> >> those calls should batch.
> >> 
> >> Mike is this something you are looking at a pursuing farther?
> >
> > Not really, but I can put it on my good intentions list.
> 
> About what I expected.  I just wanted to make certain I understood the
> situation.
> 
> I will remember this as something weird and when I have time perhaps
> I will investigate and track it.
> 
> >> I want to guess the serialization comes from waiting on children to be
> >> reaped but the namespaces are all cleaned up in exit_notify() called
> >> from do_exit() so that theory doesn't hold water.  The worst case
> >> I can see is detach_pid from exit_signal running under the task list lock.
> >> but nothing sleeps under that lock.  :(
> >
> > I'm up to my ears in zombies with several instances of the testcase
> > running in parallel, so I imagine it's the same with hackbench.
> 
> Oh interesting.
> 
> > marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace& for i in 1 2 3 4 5 6 7 ; do ps ax|grep defunct|wc -l;sleep 1; done
> > [1] 29985
> > Running with 10*40 (== 400) tasks.
> > 1
> > 397
> > 327
> > 261
> > 199
> > 135
> > 72
> > marge:/usr/local/tmp/starvation # Time: 7.675
> 
> So if I read your output right the first second is spent running the
> code and the rest of the time is spent reaping zombies.

The distance between these is mighty fishy.

marge:~ # grep 'signalfd_cleanup ' /trace2
          vsftpd-9628  [003] ....   712.571961: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.575717: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.579698: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.587734: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.591671: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.595695: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.599685: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.603680: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.607682: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.611692: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.615740: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.619705: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.623730: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.627748: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.631712: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.635741: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.643683: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.647685: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.651691: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.655742: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.659738: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.663738: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.667756: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.671693: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.679682: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.683694: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.687750: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.691738: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.695751: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.699740: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.703736: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.707757: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.711685: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.715689: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.719694: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.723742: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.727752: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.731695: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.739687: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.743688: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.747697: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.751689: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.755688: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.759699: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.763705: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.767754: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.771702: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.775749: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.775884: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.783754: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.787754: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.791763: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.795764: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.799755: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.807768: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.835723: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.843695: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.847752: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.851694: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.855711: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.859704: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.863751: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.867754: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.871753: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.875765: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.879706: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.883696: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.887697: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.891711: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.898493: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.911740: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.927755: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.955754: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.975771: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   712.995826: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.003739: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.003920: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.011710: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.015831: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.023827: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.031694: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.035715: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.039714: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.043816: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.047726: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.051818: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.055724: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.059814: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.063725: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.067824: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.071825: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.075726: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.079709: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.083814: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.087850: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.095859: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.099826: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.103830: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.107726: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.111723: signalfd_cleanup <-__cleanup_sighand
          vsftpd-9628  [003] d...   713.115874: signalfd_cleanup <-__cleanup_sighand




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04 16:57                             ` Mike Galbraith
@ 2012-05-04 20:29                               ` Eric W. Biederman
  2012-05-05  5:56                                 ` Mike Galbraith
  0 siblings, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-04 20:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

Mike Galbraith <efault@gmx.de> writes:

> On Fri, 2012-05-04 at 08:36 -0700, Eric W. Biederman wrote: 
>> Mike Galbraith <efault@gmx.de> writes:
>> 
>> > On Fri, 2012-05-04 at 07:13 -0700, Eric W. Biederman wrote: 
>> >> Mike Galbraith <efault@gmx.de> writes:
>> 
>> >> Did you have HZ=100 in that kernel?  400 tasks at 100Hz all serialized
>> >> somehow and then doing synchronize_rcu at a jiffy each would account
>> >> for 4 seconds.  And the nsproxy certainly has a synchronize_rcu call.
>> >
>> > HZ=250
>> 
>> Rats.  Then non of my theories even approaches holding water.
>> 
>> >> The network namespace is comparatively heavy weight, at least in the
>> >> amount of code and other things it has to go through, so that would be
>> >> my prime suspect for those 29 seconds.  There are 2-4 synchronize_rcu
>> >> calls needed to put the loopback device.  Still we use
>> >> synchronize_rcu_expedited and that work should be out of line and all of
>> >> those calls should batch.
>> >> 
>> >> Mike is this something you are looking at a pursuing farther?
>> >
>> > Not really, but I can put it on my good intentions list.
>> 
>> About what I expected.  I just wanted to make certain I understood the
>> situation.
>> 
>> I will remember this as something weird and when I have time perhaps
>> I will investigate and track it.
>> 
>> >> I want to guess the serialization comes from waiting on children to be
>> >> reaped but the namespaces are all cleaned up in exit_notify() called
>> >> from do_exit() so that theory doesn't hold water.  The worst case
>> >> I can see is detach_pid from exit_signal running under the task list lock.
>> >> but nothing sleeps under that lock.  :(
>> >
>> > I'm up to my ears in zombies with several instances of the testcase
>> > running in parallel, so I imagine it's the same with hackbench.
>> 
>> Oh interesting.
>> 
>> > marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace& for i in 1 2 3 4 5 6 7 ; do ps ax|grep defunct|wc -l;sleep 1; done
>> > [1] 29985
>> > Running with 10*40 (== 400) tasks.
>> > 1
>> > 397
>> > 327
>> > 261
>> > 199
>> > 135
>> > 72
>> > marge:/usr/local/tmp/starvation # Time: 7.675
>> 
>> So if I read your output right the first second is spent running the
>> code and the rest of the time is spent reaping zombies.
>
> The distance between these is mighty fishy.

Yes.  1 to 2 jiffiers per iteration.

That probably puts us in:
do_wait()
   do_wait_thread()
      wait_consider_task()
         wait_task_zombie()
            release_task()

The only parts that I see that are clearly outside of the tasklist_lock are:
put_user in wait_task_zombie 
proc_flush_task in release_task
release_thread in release_task

Of those if I had to take a blind guess I would guess something in
proc_flush_task possibly kern_unmount.  That is the only bit that
should be namespace unique.

But shrug.  I have looked and I don't see anything obvious in those
code paths.

The only other possibility are schedule and signal deliver in the
syscall return path.  Perhaps there is kernel thread or a work queue
or something running on the same cpu and using all of the time and
our reaper thread only gets scheduled occasionally.  Or perhaps
it is something peculiar with the signal delivery logic.

Shrug.  I have skimmed through all of that code and I don't see anything
obvious.  I guess it would take a few more data points to figure out
where we are sleeping for a jiffy or two while we are reaping children.

Eric

> marge:~ # grep 'signalfd_cleanup ' /trace2
>           vsftpd-9628  [003] ....   712.571961: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.575717: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.579698: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.587734: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.591671: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.595695: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.599685: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.603680: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.607682: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.611692: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.615740: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.619705: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.623730: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.627748: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.631712: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.635741: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.643683: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.647685: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.651691: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.655742: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.659738: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.663738: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.667756: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.671693: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.679682: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.683694: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.687750: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.691738: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.695751: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.699740: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.703736: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.707757: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.711685: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.715689: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.719694: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.723742: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.727752: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.731695: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.739687: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.743688: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.747697: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.751689: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.755688: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.759699: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.763705: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.767754: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.771702: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.775749: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.775884: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.783754: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.787754: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.791763: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.795764: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.799755: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.807768: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.835723: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.843695: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.847752: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.851694: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.855711: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.859704: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.863751: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.867754: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.871753: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.875765: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.879706: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.883696: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.887697: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.891711: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.898493: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.911740: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.927755: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.955754: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.975771: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   712.995826: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.003739: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.003920: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.011710: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.015831: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.023827: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.031694: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.035715: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.039714: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.043816: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.047726: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.051818: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.055724: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.059814: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.063725: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.067824: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.071825: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.075726: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.079709: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.083814: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.087850: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.095859: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.099826: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.103830: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.107726: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.111723: signalfd_cleanup <-__cleanup_sighand
>           vsftpd-9628  [003] d...   713.115874: signalfd_cleanup <-__cleanup_sighand

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-04 20:29                               ` Eric W. Biederman
@ 2012-05-05  5:56                                 ` Mike Galbraith
  2012-05-05  6:08                                   ` Mike Galbraith
  0 siblings, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2012-05-05  5:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Fri, 2012-05-04 at 13:29 -0700, Eric W. Biederman wrote:

> Shrug.  I have skimmed through all of that code and I don't see anything
> obvious.  I guess it would take a few more data points to figure out
> where we are sleeping for a jiffy or two while we are reaping children.

GrimReaper meets SynchroZilla

          vsftpd-7977  [003] ....   577.171463: sys_wait4 <-system_call_fastpath
          vsftpd-7977  [003] ....   577.171463: do_wait <-sys_wait4
          vsftpd-7977  [003] ....   577.171463: add_wait_queue <-do_wait
          vsftpd-7977  [003] ....   577.171463: _raw_spin_lock_irqsave <-add_wait_queue
          vsftpd-7977  [003] d...   577.171463: _raw_spin_unlock_irqrestore <-add_wait_queue
          vsftpd-7977  [003] ....   577.171464: _raw_read_lock <-do_wait
          vsftpd-7977  [003] ....   577.171464: wait_consider_task <-do_wait
          vsftpd-7977  [003] ....   577.171464: wait_consider_task <-do_wait
          vsftpd-7977  [003] ....   577.171464: wait_consider_task <-do_wait
          vsftpd-7977  [003] ....   577.171465: __task_pid_nr_ns <-wait_consider_task
          vsftpd-7977  [003] ....   577.171465: pid_nr_ns <-__task_pid_nr_ns
          vsftpd-7977  [003] ....   577.171465: thread_group_times <-wait_consider_task
          vsftpd-7977  [003] ....   577.171466: thread_group_cputime <-thread_group_times
          vsftpd-7977  [003] ....   577.171466: task_sched_runtime <-thread_group_cputime
          vsftpd-7977  [003] ....   577.171466: task_rq_lock <-task_sched_runtime
          vsftpd-7977  [003] ....   577.171466: _raw_spin_lock_irqsave <-task_rq_lock
          vsftpd-7977  [003] d...   577.171467: _raw_spin_lock <-task_rq_lock
          vsftpd-7977  [003] d...   577.171467: _raw_spin_unlock_irqrestore <-task_sched_runtime
          vsftpd-7977  [003] ....   577.171467: nsecs_to_jiffies <-thread_group_times
          vsftpd-7977  [003] ....   577.171467: _raw_spin_lock_irq <-wait_consider_task
          vsftpd-7977  [003] ....   577.171468: release_task <-wait_consider_task
          vsftpd-7977  [003] ....   577.171468: proc_flush_task <-release_task
          vsftpd-7977  [003] ....   577.171470: d_hash_and_lookup <-proc_flush_task
          vsftpd-7977  [003] ....   577.171470: full_name_hash <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.171470: d_lookup <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.171470: __d_lookup <-d_lookup
          vsftpd-7977  [003] ....   577.171471: d_hash_and_lookup <-proc_flush_task
          vsftpd-7977  [003] ....   577.171471: full_name_hash <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.171471: d_lookup <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.171472: __d_lookup <-d_lookup
          vsftpd-7977  [003] ....   577.171472: d_hash_and_lookup <-proc_flush_task
          vsftpd-7977  [003] ....   577.171472: full_name_hash <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.171472: d_lookup <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.171473: __d_lookup <-d_lookup
          vsftpd-7977  [003] ....   577.171473: d_hash_and_lookup <-proc_flush_task
          vsftpd-7977  [003] ....   577.171473: full_name_hash <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.171473: d_lookup <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.171473: __d_lookup <-d_lookup
          vsftpd-7977  [003] ....   577.171474: pid_ns_release_proc <-proc_flush_task
          vsftpd-7977  [003] ....   577.171474: kern_unmount <-pid_ns_release_proc
          vsftpd-7977  [003] ....   577.171474: mnt_make_shortterm <-kern_unmount
          vsftpd-7977  [003] ....   577.171474: vfsmount_lock_global_lock_online <-mnt_make_shortterm
          vsftpd-7977  [003] ....   577.171474: _raw_spin_lock <-vfsmount_lock_global_lock_online
          vsftpd-7977  [003] ....   577.171475: vfsmount_lock_global_unlock_online <-mnt_make_shortterm
          vsftpd-7977  [003] ....   577.171475: mntput <-kern_unmount
          vsftpd-7977  [003] ....   577.171475: mntput_no_expire <-mntput
          vsftpd-7977  [003] ....   577.171476: vfsmount_lock_local_lock <-mntput_no_expire
          vsftpd-7977  [003] ....   577.171476: vfsmount_lock_global_lock_online <-mntput_no_expire
          vsftpd-7977  [003] ....   577.171476: _raw_spin_lock <-vfsmount_lock_global_lock_online
          vsftpd-7977  [003] ....   577.171476: mnt_get_count <-mntput_no_expire
          vsftpd-7977  [003] ....   577.171477: vfsmount_lock_global_unlock_online <-mntput_no_expire
          vsftpd-7977  [003] ....   577.171477: mnt_get_writers.isra.12 <-mntput_no_expire
          vsftpd-7977  [003] ....   577.171477: __fsnotify_vfsmount_delete <-mntput_no_expire
          vsftpd-7977  [003] ....   577.171477: fsnotify_clear_marks_by_mount <-__fsnotify_vfsmount_delete
          vsftpd-7977  [003] ....   577.171478: _raw_spin_lock <-fsnotify_clear_marks_by_mount
          vsftpd-7977  [003] ....   577.171478: dput <-mntput_no_expire
          vsftpd-7977  [003] ....   577.171478: _raw_spin_lock <-dput
          vsftpd-7977  [003] ....   577.171478: free_vfsmnt <-mntput_no_expire
          vsftpd-7977  [003] ....   577.171478: kfree <-free_vfsmnt
          vsftpd-7977  [003] ....   577.171479: __phys_addr <-kfree
          vsftpd-7977  [003] ....   577.171479: __slab_free <-kfree
          vsftpd-7977  [003] ....   577.171479: free_debug_processing <-__slab_free
          vsftpd-7977  [003] d...   577.171479: check_slab <-free_debug_processing
          vsftpd-7977  [003] d...   577.171479: slab_pad_check.part.42 <-check_slab
          vsftpd-7977  [003] d...   577.171480: on_freelist <-free_debug_processing
          vsftpd-7977  [003] d...   577.171480: check_object <-free_debug_processing
          vsftpd-7977  [003] d...   577.171480: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.171480: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.171481: set_track <-free_debug_processing
          vsftpd-7977  [003] d...   577.171481: dump_trace <-save_stack_trace
          vsftpd-7977  [003] d...   577.171481: print_context_stack <-dump_trace
          vsftpd-7977  [003] d...   577.171484: init_object <-free_debug_processing
          vsftpd-7977  [003] ....   577.171485: _raw_spin_lock_irqsave <-__slab_free
          vsftpd-7977  [003] d...   577.171485: _raw_spin_unlock_irqrestore <-__slab_free
          vsftpd-7977  [003] ....   577.171485: mnt_free_id.isra.20 <-free_vfsmnt
          vsftpd-7977  [003] ....   577.171485: _raw_spin_lock <-mnt_free_id.isra.20
          vsftpd-7977  [003] ....   577.171486: free_percpu <-free_vfsmnt
          vsftpd-7977  [003] ....   577.171486: _raw_spin_lock_irqsave <-free_percpu
          vsftpd-7977  [003] d...   577.171486: pcpu_free_area <-free_percpu
          vsftpd-7977  [003] d...   577.171487: pcpu_chunk_slot <-pcpu_free_area
          vsftpd-7977  [003] d...   577.171487: pcpu_chunk_relocate <-pcpu_free_area
          vsftpd-7977  [003] d...   577.171488: pcpu_chunk_slot <-pcpu_chunk_relocate
          vsftpd-7977  [003] d...   577.171488: _raw_spin_unlock_irqrestore <-free_percpu
          vsftpd-7977  [003] ....   577.171488: kmem_cache_free <-free_vfsmnt
          vsftpd-7977  [003] ....   577.171488: __phys_addr <-kmem_cache_free
          vsftpd-7977  [003] ....   577.171489: __slab_free <-kmem_cache_free
          vsftpd-7977  [003] ....   577.171489: free_debug_processing <-__slab_free
          vsftpd-7977  [003] d...   577.171489: check_slab <-free_debug_processing
          vsftpd-7977  [003] d...   577.171489: slab_pad_check.part.42 <-check_slab
          vsftpd-7977  [003] d...   577.171489: on_freelist <-free_debug_processing
          vsftpd-7977  [003] d...   577.171490: check_object <-free_debug_processing
          vsftpd-7977  [003] d...   577.171490: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.171490: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.171491: set_track <-free_debug_processing
          vsftpd-7977  [003] d...   577.171491: dump_trace <-save_stack_trace
          vsftpd-7977  [003] d...   577.171491: print_context_stack <-dump_trace
          vsftpd-7977  [003] d...   577.171494: init_object <-free_debug_processing
          vsftpd-7977  [003] ....   577.171495: deactivate_super <-mntput_no_expire
          vsftpd-7977  [003] ....   577.171495: down_write <-deactivate_super
          vsftpd-7977  [003] ....   577.171495: _cond_resched <-down_write
          vsftpd-7977  [003] ....   577.171495: deactivate_locked_super <-deactivate_super
          vsftpd-7977  [003] ....   577.171496: proc_kill_sb <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.171496: kill_anon_super <-proc_kill_sb
          vsftpd-7977  [003] ....   577.171496: generic_shutdown_super <-kill_anon_super
          vsftpd-7977  [003] ....   577.171496: shrink_dcache_for_umount <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.171496: down_read_trylock <-shrink_dcache_for_umount
          vsftpd-7977  [003] ....   577.171497: shrink_dcache_for_umount_subtree <-shrink_dcache_for_umount
          vsftpd-7977  [003] ....   577.171497: dentry_lru_prune <-shrink_dcache_for_umount_subtree
          vsftpd-7977  [003] ....   577.171498: __d_shrink <-shrink_dcache_for_umount_subtree
          vsftpd-7977  [003] ....   577.171498: iput <-shrink_dcache_for_umount_subtree
          vsftpd-7977  [003] ....   577.171498: _raw_spin_lock <-_atomic_dec_and_lock
          vsftpd-7977  [003] ....   577.171499: generic_delete_inode <-iput
          vsftpd-7977  [003] ....   577.171499: evict <-iput
          vsftpd-7977  [003] ....   577.171499: _raw_spin_lock <-evict
          vsftpd-7977  [003] ....   577.171500: proc_evict_inode <-evict
          vsftpd-7977  [003] ....   577.171500: truncate_inode_pages <-proc_evict_inode
          vsftpd-7977  [003] ....   577.171500: truncate_inode_pages_range <-truncate_inode_pages
          vsftpd-7977  [003] ....   577.171501: end_writeback <-proc_evict_inode
          vsftpd-7977  [003] ....   577.171501: _cond_resched <-end_writeback
          vsftpd-7977  [003] ....   577.171501: _raw_spin_lock_irq <-end_writeback
          vsftpd-7977  [003] ....   577.171501: _cond_resched <-end_writeback
          vsftpd-7977  [003] ....   577.171501: put_pid <-proc_evict_inode
          vsftpd-7977  [003] ....   577.171502: put_pid: put_pid: NULL
          vsftpd-7977  [003] ....   577.171502: pde_put <-proc_evict_inode
          vsftpd-7977  [003] ....   577.171502: __remove_inode_hash <-evict
          vsftpd-7977  [003] ....   577.171503: _raw_spin_lock <-__remove_inode_hash
          vsftpd-7977  [003] ....   577.171503: _raw_spin_lock <-__remove_inode_hash
          vsftpd-7977  [003] ....   577.171503: _raw_spin_lock <-evict
          vsftpd-7977  [003] ....   577.171504: wake_up_bit <-evict
          vsftpd-7977  [003] ....   577.171504: bit_waitqueue <-wake_up_bit
          vsftpd-7977  [003] ....   577.171504: __phys_addr <-bit_waitqueue
          vsftpd-7977  [003] ....   577.171504: __wake_up_bit <-wake_up_bit
          vsftpd-7977  [003] ....   577.171504: destroy_inode <-evict
          vsftpd-7977  [003] ....   577.171505: __destroy_inode <-destroy_inode
          vsftpd-7977  [003] ....   577.171505: inode_has_buffers <-__destroy_inode
          vsftpd-7977  [003] ....   577.171505: __fsnotify_inode_delete <-__destroy_inode
          vsftpd-7977  [003] ....   577.171505: fsnotify_clear_marks_by_inode <-__fsnotify_inode_delete
          vsftpd-7977  [003] ....   577.171505: _raw_spin_lock <-fsnotify_clear_marks_by_inode
          vsftpd-7977  [003] ....   577.171506: proc_destroy_inode <-destroy_inode
          vsftpd-7977  [003] ....   577.171506: call_rcu_sched <-proc_destroy_inode
          vsftpd-7977  [003] ....   577.171506: __call_rcu <-call_rcu_sched
          vsftpd-7977  [003] ....   577.171506: d_free <-shrink_dcache_for_umount_subtree
          vsftpd-7977  [003] ....   577.171507: __d_free <-d_free
          vsftpd-7977  [003] ....   577.171507: kmem_cache_free <-__d_free
          vsftpd-7977  [003] ....   577.171507: __phys_addr <-kmem_cache_free
          vsftpd-7977  [003] ....   577.171507: __slab_free <-kmem_cache_free
          vsftpd-7977  [003] ....   577.171507: free_debug_processing <-__slab_free
          vsftpd-7977  [003] d...   577.171508: check_slab <-free_debug_processing
          vsftpd-7977  [003] d...   577.171508: slab_pad_check.part.42 <-check_slab
          vsftpd-7977  [003] d...   577.171508: on_freelist <-free_debug_processing
          vsftpd-7977  [003] d...   577.171509: check_object <-free_debug_processing
          vsftpd-7977  [003] d...   577.171509: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.171510: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.171510: set_track <-free_debug_processing
          vsftpd-7977  [003] d...   577.171510: dump_trace <-save_stack_trace
          vsftpd-7977  [003] d...   577.171510: print_context_stack <-dump_trace
          vsftpd-7977  [003] d...   577.171514: init_object <-free_debug_processing
          vsftpd-7977  [003] ....   577.171515: sync_filesystem <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.171515: __sync_filesystem <-sync_filesystem
          vsftpd-7977  [003] ....   577.171515: __sync_filesystem <-sync_filesystem
          vsftpd-7977  [003] ....   577.171516: fsnotify_unmount_inodes <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.171516: _raw_spin_lock <-fsnotify_unmount_inodes
          vsftpd-7977  [003] ....   577.171516: evict_inodes <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.171516: _raw_spin_lock <-evict_inodes
          vsftpd-7977  [003] ....   577.171516: dispose_list <-evict_inodes
          vsftpd-7977  [003] ....   577.171517: _raw_spin_lock <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.171517: up_write <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.171517: free_anon_bdev <-kill_anon_super
          vsftpd-7977  [003] ....   577.171517: _raw_spin_lock <-free_anon_bdev
          vsftpd-7977  [003] ....   577.171518: proc_kill_sb: put_pid_ns: 0xffff8801dc56f320 count:2->1
          vsftpd-7977  [003] ....   577.171518: unregister_shrinker <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.171518: down_write <-unregister_shrinker
          vsftpd-7977  [003] ....   577.171518: _cond_resched <-down_write
          vsftpd-7977  [003] ....   577.171519: up_write <-unregister_shrinker
          vsftpd-7977  [003] ....   577.171519: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.171519: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.171519: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.171519: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.171520: _cond_resched <-mutex_lock
          vsftpd-7977  [003] ....   577.171520: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.171520: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.171520: smp_call_function <-on_each_cpu
          vsftpd-7977  [003] ....   577.171521: smp_call_function_many <-smp_call_function
          vsftpd-7977  [003] ....   577.171521: _raw_spin_lock_irqsave <-smp_call_function_many
          vsftpd-7977  [003] d...   577.171521: _raw_spin_unlock_irqrestore <-smp_call_function_many
          vsftpd-7977  [003] ....   577.171522: native_send_call_func_ipi <-smp_call_function_many
          vsftpd-7977  [003] ....   577.171522: flat_send_IPI_allbutself <-native_send_call_func_ipi
          vsftpd-7977  [003] d...   577.171532: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.171532: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] d...   577.171533: __call_rcu <-call_rcu_sched
          vsftpd-7977  [003] ....   577.171533: wait_for_completion <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.171533: wait_for_common <-wait_for_completion
          vsftpd-7977  [003] ....   577.171533: _cond_resched <-wait_for_common
          vsftpd-7977  [003] ....   577.171533: _raw_spin_lock_irq <-wait_for_common
          vsftpd-7977  [003] ....   577.171534: schedule_timeout <-wait_for_common
          vsftpd-7977  [003] ....   577.171534: schedule <-schedule_timeout
          vsftpd-7977  [003] ....   577.171534: __schedule <-schedule




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-05  5:56                                 ` Mike Galbraith
@ 2012-05-05  6:08                                   ` Mike Galbraith
  2012-05-05  7:12                                     ` Mike Galbraith
  0 siblings, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2012-05-05  6:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling

On Sat, 2012-05-05 at 07:56 +0200, Mike Galbraith wrote: 
> On Fri, 2012-05-04 at 13:29 -0700, Eric W. Biederman wrote:
> 
> > Shrug.  I have skimmed through all of that code and I don't see anything
> > obvious.  I guess it would take a few more data points to figure out
> > where we are sleeping for a jiffy or two while we are reaping children.

egrep 'synchronize|rcu_barrier' /trace 

          vsftpd-7981  [003] ....   577.164997: synchronize_sched <-switch_task_namespaces
          vsftpd-7981  [003] ....   577.164998: _cond_resched <-synchronize_sched
          vsftpd-7981  [003] ....   577.164998: wait_rcu_gp <-synchronize_sched
          vsftpd-7982  [003] ....   577.166583: synchronize_sched <-switch_task_namespaces
          vsftpd-7982  [003] ....   577.166583: _cond_resched <-synchronize_sched
          vsftpd-7982  [003] ....   577.166584: wait_rcu_gp <-synchronize_sched
          vsftpd-7983  [003] ....   577.167128: synchronize_sched <-switch_task_namespaces
          vsftpd-7983  [003] ....   577.167129: _cond_resched <-synchronize_sched
          vsftpd-7983  [003] ....   577.167129: wait_rcu_gp <-synchronize_sched
          vsftpd-7980  [003] ....   577.167678: synchronize_sched <-switch_task_namespaces
          vsftpd-7980  [003] ....   577.167679: _cond_resched <-synchronize_sched
          vsftpd-7980  [003] ....   577.167679: wait_rcu_gp <-synchronize_sched
          vsftpd-7984  [003] ....   577.168232: synchronize_sched <-switch_task_namespaces
          vsftpd-7984  [003] ....   577.168232: _cond_resched <-synchronize_sched
          vsftpd-7984  [003] ....   577.168233: wait_rcu_gp <-synchronize_sched
          vsftpd-7979  [003] ....   577.168800: synchronize_sched <-switch_task_namespaces
          vsftpd-7979  [003] ....   577.168800: _cond_resched <-synchronize_sched
          vsftpd-7979  [003] ....   577.168801: wait_rcu_gp <-synchronize_sched
          vsftpd-7985  [003] ....   577.169373: synchronize_sched <-switch_task_namespaces
          vsftpd-7985  [003] ....   577.169373: _cond_resched <-synchronize_sched
          vsftpd-7985  [003] ....   577.169373: wait_rcu_gp <-synchronize_sched
          vsftpd-7986  [003] ....   577.169946: synchronize_sched <-switch_task_namespaces
          vsftpd-7986  [003] ....   577.169947: _cond_resched <-synchronize_sched
          vsftpd-7986  [003] ....   577.169947: wait_rcu_gp <-synchronize_sched
          vsftpd-7987  [003] ....   577.170519: synchronize_sched <-switch_task_namespaces
          vsftpd-7987  [003] ....   577.170519: _cond_resched <-synchronize_sched
          vsftpd-7987  [003] ....   577.170519: wait_rcu_gp <-synchronize_sched
          vsftpd-7978  [003] ....   577.171091: synchronize_sched <-switch_task_namespaces
          vsftpd-7978  [003] ....   577.171091: _cond_resched <-synchronize_sched
          vsftpd-7978  [003] ....   577.171091: wait_rcu_gp <-synchronize_sched
          vsftpd-7977  [003] ....   577.171519: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.171519: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.171519: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.171519: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.171520: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.171520: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.171532: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.171532: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.171533: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.171691: rcu_barrier_callback <-__rcu_process_callbacks
          vsftpd-7977  [003] ....   577.176443: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.176552: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.176552: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.176553: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.176553: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.176553: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.176554: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.176561: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.176561: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.176562: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.176730: rcu_barrier_callback <-__rcu_process_callbacks
          vsftpd-7977  [003] ....   577.180448: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.180553: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.180553: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.180554: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.180554: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.180555: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.180555: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.180561: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.180562: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.180563: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.180767: rcu_barrier_callback <-__rcu_process_callbacks
          vsftpd-7977  [003] ....   577.184450: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.184565: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.184566: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.184566: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.184566: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.184566: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.184567: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.184573: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.184573: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.184573: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.184733: rcu_barrier_callback <-__rcu_process_callbacks
          vsftpd-7977  [003] ....   577.188430: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.188536: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.188536: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.188536: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.188536: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.188537: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.188537: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.188555: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.188555: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.188555: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.188715: rcu_barrier_callback <-__rcu_process_callbacks
          vsftpd-7977  [003] ....   577.192421: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192527: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.192527: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.192527: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.192527: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192528: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192528: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.192546: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.192546: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.192547: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.192731: rcu_barrier_callback <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.192731: complete <-rcu_barrier_callback
          vsftpd-7977  [003] ....   577.192750: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192821: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.192821: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.192821: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.192821: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192822: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192822: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.192832: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.192833: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.192833: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.196559: rcu_barrier_callback <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196560: complete <-rcu_barrier_callback
          vsftpd-7977  [003] ....   577.196571: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.196642: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.196642: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.196642: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.196642: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.196643: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.196643: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.196653: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.196654: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.196654: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.200577: rcu_barrier_callback <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.200577: complete <-rcu_barrier_callback
          vsftpd-7977  [003] ....   577.200589: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.200659: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.200659: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.200659: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.200659: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.200660: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.200660: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.200671: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.200671: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.200672: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.204548: rcu_barrier_callback <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.204548: complete <-rcu_barrier_callback
          vsftpd-7977  [003] ....   577.204559: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.204630: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.204631: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.204631: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.204631: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.204631: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.204631: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] d...   577.204642: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.204642: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] ....   577.204643: wait_for_completion <-_rcu_barrier.isra.31
     ksoftirqd/3-16    [003] ..s.   577.208511: rcu_barrier_callback <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.208511: complete <-rcu_barrier_callback
          vsftpd-7977  [003] ....   577.208523: mutex_unlock <-_rcu_barrier.isra.31



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-05  6:08                                   ` Mike Galbraith
@ 2012-05-05  7:12                                     ` Mike Galbraith
  2012-05-05 11:37                                       ` Eric W. Biederman
  2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
  0 siblings, 2 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-05-05  7:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling, Paul E. McKenney

On Sat, 2012-05-05 at 08:08 +0200, Mike Galbraith wrote:

> egrep 'synchronize|rcu_barrier' /trace 
> 
>           vsftpd-7981  [003] ....   577.164997: synchronize_sched <-switch_task_namespaces
>           vsftpd-7981  [003] ....   577.164998: _cond_resched <-synchronize_sched
>           vsftpd-7981  [003] ....   577.164998: wait_rcu_gp <-synchronize_sched
>           vsftpd-7982  [003] ....   577.166583: synchronize_sched <-switch_task_namespaces
>           vsftpd-7982  [003] ....   577.166583: _cond_resched <-synchronize_sched

> vsftpd-7977  [003] ....   577.171519: rcu_barrier_sched <-rcu_barrier
>           vsftpd-7977  [003] ....   577.171519: _rcu_barrier.isra.31 <-rcu_barrier_sched
>           vsftpd-7977  [003] ....   577.171519: mutex_lock <-_rcu_barrier.isra.31
>           vsftpd-7977  [003] ....   577.171520: __init_waitqueue_head <-_rcu_barrier.isra.31
>           vsftpd-7977  [003] ....   577.171520: on_each_cpu <-_rcu_barrier.isra.31
>           vsftpd-7977  [003] d...   577.171532: rcu_barrier_func <-on_each_cpu
>           vsftpd-7977  [003] d...   577.171532: call_rcu_sched <-rcu_barrier_func
>           vsftpd-7977  [003] ....   577.171533: wait_for_completion <-_rcu_barrier.isra.31
>      ksoftirqd/3-16    [003] ..s.   577.171691: rcu_barrier_callback <-__rcu_process_callbacks
>           vsftpd-7977  [003] ....   577.176443: mutex_unlock <-_rcu_barrier.isra.31
...

Ok, so CLONE_NEWPID | SIGCHLD + waitpid is a bad idea given extreme
unmount synchronization, but why does it take four softirqs?  Seems this
could have gone a lot faster. 

          vsftpd-7977  [003] ....   577.192773: do_wait <-sys_wait4
          vsftpd-7977  [003] ....   577.192773: add_wait_queue <-do_wait
          vsftpd-7977  [003] ....   577.192773: _raw_spin_lock_irqsave <-add_wait_queue
          vsftpd-7977  [003] d...   577.192773: _raw_spin_unlock_irqrestore <-add_wait_queue
          vsftpd-7977  [003] ....   577.192774: _raw_read_lock <-do_wait
          vsftpd-7977  [003] ....   577.192774: wait_consider_task <-do_wait
          vsftpd-7977  [003] ....   577.192774: __task_pid_nr_ns <-wait_consider_task
          vsftpd-7977  [003] ....   577.192774: pid_nr_ns <-__task_pid_nr_ns
          vsftpd-7977  [003] ....   577.192774: thread_group_times <-wait_consider_task
          vsftpd-7977  [003] ....   577.192775: thread_group_cputime <-thread_group_times
          vsftpd-7977  [003] ....   577.192775: task_sched_runtime <-thread_group_cputime
          vsftpd-7977  [003] ....   577.192775: task_rq_lock <-task_sched_runtime
          vsftpd-7977  [003] ....   577.192775: _raw_spin_lock_irqsave <-task_rq_lock
          vsftpd-7977  [003] d...   577.192775: _raw_spin_lock <-task_rq_lock
          vsftpd-7977  [003] d...   577.192776: _raw_spin_unlock_irqrestore <-task_sched_runtime
          vsftpd-7977  [003] ....   577.192776: nsecs_to_jiffies <-thread_group_times
          vsftpd-7977  [003] ....   577.192776: _raw_spin_lock_irq <-wait_consider_task
          vsftpd-7977  [003] ....   577.192776: release_task <-wait_consider_task
          vsftpd-7977  [003] ....   577.192776: proc_flush_task <-release_task
          vsftpd-7977  [003] ....   577.192777: d_hash_and_lookup <-proc_flush_task
          vsftpd-7977  [003] ....   577.192777: full_name_hash <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.192777: d_lookup <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.192777: __d_lookup <-d_lookup
          vsftpd-7977  [003] ....   577.192778: d_hash_and_lookup <-proc_flush_task
          vsftpd-7977  [003] ....   577.192778: full_name_hash <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.192778: d_lookup <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.192778: __d_lookup <-d_lookup
          vsftpd-7977  [003] ....   577.192779: d_hash_and_lookup <-proc_flush_task
          vsftpd-7977  [003] ....   577.192779: full_name_hash <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.192779: d_lookup <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.192780: __d_lookup <-d_lookup
          vsftpd-7977  [003] ....   577.192780: d_hash_and_lookup <-proc_flush_task
          vsftpd-7977  [003] ....   577.192780: full_name_hash <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.192780: d_lookup <-d_hash_and_lookup
          vsftpd-7977  [003] ....   577.192780: __d_lookup <-d_lookup
          vsftpd-7977  [003] ....   577.192781: pid_ns_release_proc <-proc_flush_task
          vsftpd-7977  [003] ....   577.192781: kern_unmount <-pid_ns_release_proc
          vsftpd-7977  [003] ....   577.192781: mnt_make_shortterm <-kern_unmount
          vsftpd-7977  [003] ....   577.192781: vfsmount_lock_global_lock_online <-mnt_make_shortterm
          vsftpd-7977  [003] ....   577.192781: _raw_spin_lock <-vfsmount_lock_global_lock_online
          vsftpd-7977  [003] ....   577.192782: vfsmount_lock_global_unlock_online <-mnt_make_shortterm
          vsftpd-7977  [003] ....   577.192782: mntput <-kern_unmount
          vsftpd-7977  [003] ....   577.192782: mntput_no_expire <-mntput
          vsftpd-7977  [003] ....   577.192782: vfsmount_lock_local_lock <-mntput_no_expire
          vsftpd-7977  [003] ....   577.192782: vfsmount_lock_global_lock_online <-mntput_no_expire
          vsftpd-7977  [003] ....   577.192783: _raw_spin_lock <-vfsmount_lock_global_lock_online
          vsftpd-7977  [003] ....   577.192783: mnt_get_count <-mntput_no_expire
          vsftpd-7977  [003] ....   577.192783: vfsmount_lock_global_unlock_online <-mntput_no_expire
          vsftpd-7977  [003] ....   577.192784: mnt_get_writers.isra.12 <-mntput_no_expire
          vsftpd-7977  [003] ....   577.192784: __fsnotify_vfsmount_delete <-mntput_no_expire
          vsftpd-7977  [003] ....   577.192784: fsnotify_clear_marks_by_mount <-__fsnotify_vfsmount_delete
          vsftpd-7977  [003] ....   577.192784: _raw_spin_lock <-fsnotify_clear_marks_by_mount
          vsftpd-7977  [003] ....   577.192784: dput <-mntput_no_expire
          vsftpd-7977  [003] ....   577.192784: _raw_spin_lock <-dput
          vsftpd-7977  [003] ....   577.192785: free_vfsmnt <-mntput_no_expire
          vsftpd-7977  [003] ....   577.192785: kfree <-free_vfsmnt
          vsftpd-7977  [003] ....   577.192785: __phys_addr <-kfree
          vsftpd-7977  [003] ....   577.192785: __slab_free <-kfree
          vsftpd-7977  [003] ....   577.192785: free_debug_processing <-__slab_free
          vsftpd-7977  [003] d...   577.192786: check_slab <-free_debug_processing
          vsftpd-7977  [003] d...   577.192786: slab_pad_check.part.42 <-check_slab
          vsftpd-7977  [003] d...   577.192786: on_freelist <-free_debug_processing
          vsftpd-7977  [003] d...   577.192786: check_object <-free_debug_processing
          vsftpd-7977  [003] d...   577.192787: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.192787: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.192787: set_track <-free_debug_processing
          vsftpd-7977  [003] d...   577.192787: dump_trace <-save_stack_trace
          vsftpd-7977  [003] d...   577.192788: print_context_stack <-dump_trace
          vsftpd-7977  [003] d...   577.192791: init_object <-free_debug_processing
          vsftpd-7977  [003] ....   577.192791: _raw_spin_lock_irqsave <-__slab_free
          vsftpd-7977  [003] d...   577.192791: _raw_spin_unlock_irqrestore <-__slab_free
          vsftpd-7977  [003] ....   577.192792: mnt_free_id.isra.20 <-free_vfsmnt
          vsftpd-7977  [003] ....   577.192792: _raw_spin_lock <-mnt_free_id.isra.20
          vsftpd-7977  [003] ....   577.192792: free_percpu <-free_vfsmnt
          vsftpd-7977  [003] ....   577.192792: _raw_spin_lock_irqsave <-free_percpu
          vsftpd-7977  [003] d...   577.192792: pcpu_free_area <-free_percpu
          vsftpd-7977  [003] d...   577.192793: pcpu_chunk_slot <-pcpu_free_area
          vsftpd-7977  [003] d...   577.192793: pcpu_chunk_relocate <-pcpu_free_area
          vsftpd-7977  [003] d...   577.192794: pcpu_chunk_slot <-pcpu_chunk_relocate
          vsftpd-7977  [003] d...   577.192794: _raw_spin_unlock_irqrestore <-free_percpu
          vsftpd-7977  [003] ....   577.192794: kmem_cache_free <-free_vfsmnt
          vsftpd-7977  [003] ....   577.192794: __phys_addr <-kmem_cache_free
          vsftpd-7977  [003] ....   577.192794: __slab_free <-kmem_cache_free
          vsftpd-7977  [003] ....   577.192794: free_debug_processing <-__slab_free
          vsftpd-7977  [003] d...   577.192795: check_slab <-free_debug_processing
          vsftpd-7977  [003] d...   577.192795: slab_pad_check.part.42 <-check_slab
          vsftpd-7977  [003] d...   577.192795: on_freelist <-free_debug_processing
          vsftpd-7977  [003] d...   577.192796: check_object <-free_debug_processing
          vsftpd-7977  [003] d...   577.192796: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.192796: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.192796: set_track <-free_debug_processing
          vsftpd-7977  [003] d...   577.192797: dump_trace <-save_stack_trace
          vsftpd-7977  [003] d...   577.192797: print_context_stack <-dump_trace
          vsftpd-7977  [003] d...   577.192800: init_object <-free_debug_processing
          vsftpd-7977  [003] ....   577.192800: deactivate_super <-mntput_no_expire
          vsftpd-7977  [003] ....   577.192800: down_write <-deactivate_super
          vsftpd-7977  [003] ....   577.192801: _cond_resched <-down_write
          vsftpd-7977  [003] ....   577.192801: deactivate_locked_super <-deactivate_super
          vsftpd-7977  [003] ....   577.192801: proc_kill_sb <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.192801: kill_anon_super <-proc_kill_sb
          vsftpd-7977  [003] ....   577.192801: generic_shutdown_super <-kill_anon_super
          vsftpd-7977  [003] ....   577.192802: shrink_dcache_for_umount <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.192802: down_read_trylock <-shrink_dcache_for_umount
          vsftpd-7977  [003] ....   577.192802: shrink_dcache_for_umount_subtree <-shrink_dcache_for_umount
          vsftpd-7977  [003] ....   577.192802: dentry_lru_prune <-shrink_dcache_for_umount_subtree
          vsftpd-7977  [003] ....   577.192802: __d_shrink <-shrink_dcache_for_umount_subtree
          vsftpd-7977  [003] ....   577.192803: iput <-shrink_dcache_for_umount_subtree
          vsftpd-7977  [003] ....   577.192803: _raw_spin_lock <-_atomic_dec_and_lock
          vsftpd-7977  [003] ....   577.192803: generic_delete_inode <-iput
          vsftpd-7977  [003] ....   577.192803: evict <-iput
          vsftpd-7977  [003] ....   577.192804: _raw_spin_lock <-evict
          vsftpd-7977  [003] ....   577.192804: proc_evict_inode <-evict
          vsftpd-7977  [003] ....   577.192804: truncate_inode_pages <-proc_evict_inode
          vsftpd-7977  [003] ....   577.192804: truncate_inode_pages_range <-truncate_inode_pages
          vsftpd-7977  [003] ....   577.192804: end_writeback <-proc_evict_inode
          vsftpd-7977  [003] ....   577.192804: _cond_resched <-end_writeback
          vsftpd-7977  [003] ....   577.192805: _raw_spin_lock_irq <-end_writeback
          vsftpd-7977  [003] ....   577.192805: _cond_resched <-end_writeback
          vsftpd-7977  [003] ....   577.192805: put_pid <-proc_evict_inode
          vsftpd-7977  [003] ....   577.192805: put_pid: put_pid: NULL
          vsftpd-7977  [003] ....   577.192806: pde_put <-proc_evict_inode
          vsftpd-7977  [003] ....   577.192806: __remove_inode_hash <-evict
          vsftpd-7977  [003] ....   577.192806: _raw_spin_lock <-__remove_inode_hash
          vsftpd-7977  [003] ....   577.192806: _raw_spin_lock <-__remove_inode_hash
          vsftpd-7977  [003] ....   577.192806: _raw_spin_lock <-evict
          vsftpd-7977  [003] ....   577.192807: wake_up_bit <-evict
          vsftpd-7977  [003] ....   577.192807: bit_waitqueue <-wake_up_bit
          vsftpd-7977  [003] ....   577.192807: __phys_addr <-bit_waitqueue
          vsftpd-7977  [003] ....   577.192807: __wake_up_bit <-wake_up_bit
          vsftpd-7977  [003] ....   577.192807: destroy_inode <-evict
          vsftpd-7977  [003] ....   577.192807: __destroy_inode <-destroy_inode
          vsftpd-7977  [003] ....   577.192808: inode_has_buffers <-__destroy_inode
          vsftpd-7977  [003] ....   577.192808: __fsnotify_inode_delete <-__destroy_inode
          vsftpd-7977  [003] ....   577.192808: fsnotify_clear_marks_by_inode <-__fsnotify_inode_delete
          vsftpd-7977  [003] ....   577.192808: _raw_spin_lock <-fsnotify_clear_marks_by_inode
          vsftpd-7977  [003] ....   577.192808: proc_destroy_inode <-destroy_inode
          vsftpd-7977  [003] ....   577.192808: call_rcu_sched <-proc_destroy_inode
          vsftpd-7977  [003] ....   577.192809: __call_rcu <-call_rcu_sched
          vsftpd-7977  [003] ....   577.192809: d_free <-shrink_dcache_for_umount_subtree
          vsftpd-7977  [003] ....   577.192809: __d_free <-d_free
          vsftpd-7977  [003] ....   577.192809: kmem_cache_free <-__d_free
          vsftpd-7977  [003] ....   577.192809: __phys_addr <-kmem_cache_free
          vsftpd-7977  [003] ....   577.192810: __slab_free <-kmem_cache_free
          vsftpd-7977  [003] ....   577.192810: free_debug_processing <-__slab_free
          vsftpd-7977  [003] d...   577.192810: check_slab <-free_debug_processing
          vsftpd-7977  [003] d...   577.192811: slab_pad_check.part.42 <-check_slab
          vsftpd-7977  [003] d...   577.192811: on_freelist <-free_debug_processing
          vsftpd-7977  [003] d...   577.192812: check_object <-free_debug_processing
          vsftpd-7977  [003] d...   577.192812: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.192812: check_bytes_and_report <-check_object
          vsftpd-7977  [003] d...   577.192812: set_track <-free_debug_processing
          vsftpd-7977  [003] d...   577.192813: dump_trace <-save_stack_trace
          vsftpd-7977  [003] d...   577.192813: print_context_stack <-dump_trace
          vsftpd-7977  [003] d...   577.192817: init_object <-free_debug_processing
          vsftpd-7977  [003] ....   577.192817: sync_filesystem <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.192817: __sync_filesystem <-sync_filesystem
          vsftpd-7977  [003] ....   577.192818: __sync_filesystem <-sync_filesystem
          vsftpd-7977  [003] ....   577.192818: fsnotify_unmount_inodes <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.192818: _raw_spin_lock <-fsnotify_unmount_inodes
          vsftpd-7977  [003] ....   577.192818: evict_inodes <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.192818: _raw_spin_lock <-evict_inodes
          vsftpd-7977  [003] ....   577.192818: dispose_list <-evict_inodes
          vsftpd-7977  [003] ....   577.192819: _raw_spin_lock <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.192819: up_write <-generic_shutdown_super
          vsftpd-7977  [003] ....   577.192819: free_anon_bdev <-kill_anon_super
          vsftpd-7977  [003] ....   577.192819: _raw_spin_lock <-free_anon_bdev
          vsftpd-7977  [003] ....   577.192820: proc_kill_sb: put_pid_ns: 0xffff8801dc56b990 count:2->1
          vsftpd-7977  [003] ....   577.192820: unregister_shrinker <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.192820: down_write <-unregister_shrinker
          vsftpd-7977  [003] ....   577.192820: _cond_resched <-down_write
          vsftpd-7977  [003] ....   577.192821: up_write <-unregister_shrinker
          vsftpd-7977  [003] ....   577.192821: rcu_barrier <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.192821: rcu_barrier_sched <-rcu_barrier
          vsftpd-7977  [003] ....   577.192821: _rcu_barrier.isra.31 <-rcu_barrier_sched
          vsftpd-7977  [003] ....   577.192821: mutex_lock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192821: _cond_resched <-mutex_lock
          vsftpd-7977  [003] ....   577.192822: __init_waitqueue_head <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192822: on_each_cpu <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192822: smp_call_function <-on_each_cpu
          vsftpd-7977  [003] ....   577.192822: smp_call_function_many <-smp_call_function
          vsftpd-7977  [003] ....   577.192822: _raw_spin_lock_irqsave <-smp_call_function_many
          vsftpd-7977  [003] d...   577.192823: _raw_spin_unlock_irqrestore <-smp_call_function_many
          vsftpd-7977  [003] ....   577.192823: native_send_call_func_ipi <-smp_call_function_many
          vsftpd-7977  [003] ....   577.192823: flat_send_IPI_allbutself <-native_send_call_func_ipi
          vsftpd-7977  [003] d...   577.192832: rcu_barrier_func <-on_each_cpu
          vsftpd-7977  [003] d...   577.192833: call_rcu_sched <-rcu_barrier_func
          vsftpd-7977  [003] d...   577.192833: __call_rcu <-call_rcu_sched
          vsftpd-7977  [003] ....   577.192833: wait_for_completion <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.192833: wait_for_common <-wait_for_completion
          vsftpd-7977  [003] ....   577.192834: _cond_resched <-wait_for_common
          vsftpd-7977  [003] ....   577.192834: _raw_spin_lock_irq <-wait_for_common
          vsftpd-7977  [003] ....   577.192834: schedule_timeout <-wait_for_common
          vsftpd-7977  [003] ....   577.192834: schedule <-schedule_timeout
          vsftpd-7977  [003] ....   577.192834: __schedule <-schedule
          vsftpd-7977  [003] ....   577.192835: rcu_note_context_switch <-__schedule
          vsftpd-7977  [003] ....   577.192835: rcu_sched_qs <-rcu_note_context_switch
          vsftpd-7977  [003] ....   577.192835: _raw_spin_lock_irq <-__schedule
          vsftpd-7977  [003] d...   577.192835: deactivate_task <-__schedule
          vsftpd-7977  [003] d...   577.192835: dequeue_task <-deactivate_task
          vsftpd-7977  [003] d...   577.192835: update_rq_clock <-dequeue_task
          vsftpd-7977  [003] d...   577.192836: dequeue_task_fair <-dequeue_task
          vsftpd-7977  [003] d...   577.192836: update_curr <-dequeue_task_fair
          vsftpd-7977  [003] d...   577.192836: clear_buddies <-dequeue_task_fair
          vsftpd-7977  [003] d...   577.192836: hrtick_update <-dequeue_task_fair
          vsftpd-7977  [003] d...   577.192836: idle_balance <-__schedule
          vsftpd-7977  [003] d...   577.192837: _raw_spin_lock <-idle_balance
          vsftpd-7977  [003] d...   577.192837: put_prev_task_fair <-__schedule
          vsftpd-7977  [003] d...   577.192837: pick_next_task_fair <-__schedule
          vsftpd-7977  [003] d...   577.192837: pick_next_task_stop <-__schedule
          vsftpd-7977  [003] d...   577.192838: pick_next_task_rt <-__schedule
          vsftpd-7977  [003] d...   577.192838: pick_next_task_fair <-__schedule
          vsftpd-7977  [003] d...   577.192838: pick_next_task_idle <-__schedule
          vsftpd-7977  [003] d...   577.192838: calc_load_account_idle <-pick_next_task_idle
          <idle>-0     [003] d...   577.192839: finish_task_switch <-__schedule
          <idle>-0     [003] ....   577.192839: tick_nohz_idle_enter <-cpu_idle
          <idle>-0     [003] ....   577.192839: set_cpu_sd_state_idle <-tick_nohz_idle_enter
          <idle>-0     [003] d...   577.192839: tick_nohz_stop_sched_tick.isra.9 <-tick_nohz_idle_enter
          <idle>-0     [003] d...   577.192839: ktime_get <-tick_nohz_stop_sched_tick.isra.9
          <idle>-0     [003] ....   577.192840: local_touch_nmi <-cpu_idle
          <idle>-0     [003] d...   577.192840: enter_idle <-cpu_idle
...
          <idle>-0     [003] d...   577.192840: rcu_idle_enter <-cpu_idle
          <idle>-0     [003] d...   577.192841: rcu_idle_enter_common <-rcu_idle_enter
          <idle>-0     [003] d...   577.192841: rcu_cpu_has_callbacks <-rcu_idle_enter_common
          <idle>-0     [003] d...   577.192841: rcu_pending <-rcu_idle_enter_common
          <idle>-0     [003] d...   577.192841: __rcu_pending <-rcu_pending
          <idle>-0     [003] d...   577.192842: raise_softirq <-rcu_idle_enter_common
          <idle>-0     [003] d...   577.192842: __raise_softirq_irqoff <-raise_softirq
          <idle>-0     [003] d...   577.192842: wakeup_softirqd <-raise_softirq
          <idle>-0     [003] d...   577.192842: wake_up_process <-wakeup_softirqd
...
     ksoftirqd/3-16    [003] d...   577.192853: finish_task_switch <-__schedule
     ksoftirqd/3-16    [003] d...   577.192853: __do_softirq <-run_ksoftirqd
     ksoftirqd/3-16    [003] ..s.   577.192853: rcu_process_callbacks <-__do_softirq
     ksoftirqd/3-16    [003] ..s.   577.192853: __rcu_process_callbacks <-rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.192853: rcu_process_gp_end <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] d.s.   577.192854: _raw_spin_trylock <-rcu_process_gp_end
     ksoftirqd/3-16    [003] d.s.   577.192854: __rcu_process_gp_end.isra.5 <-rcu_process_gp_end
     ksoftirqd/3-16    [003] d.s.   577.192854: _raw_spin_unlock_irqrestore <-rcu_process_gp_end
     ksoftirqd/3-16    [003] ..s.   577.192855: check_for_new_grace_period <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] d.s.   577.192855: _raw_spin_trylock <-check_for_new_grace_period
     ksoftirqd/3-16    [003] d.s.   577.192855: __note_new_gpnum.isra.27 <-check_for_new_grace_period
     ksoftirqd/3-16    [003] d.s.   577.192855: _raw_spin_unlock_irqrestore <-check_for_new_grace_period
     ksoftirqd/3-16    [003] ..s.   577.192855: __rcu_process_callbacks <-rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.192855: force_quiescent_state <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.192856: rcu_process_gp_end <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.192856: check_for_new_grace_period <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.192856: rcu_bh_qs <-__do_softirq
     ksoftirqd/3-16    [003] d.s.   577.192856: __local_bh_enable <-__do_softirq
     ksoftirqd/3-16    [003] ....   577.192856: _cond_resched <-run_ksoftirqd
     ksoftirqd/3-16    [003] ....   577.192857: rcu_note_context_switch <-run_ksoftirqd
     ksoftirqd/3-16    [003] ....   577.192857: rcu_sched_qs <-rcu_note_context_switch
     ksoftirqd/3-16    [003] ....   577.192857: kthread_should_stop <-run_ksoftirqd
     ksoftirqd/3-16    [003] ....   577.192857: schedule_preempt_disabled <-run_ksoftirqd
     ksoftirqd/3-16    [003] ....   577.192857: schedule <-schedule_preempt_disabled
     ksoftirqd/3-16    [003] ....   577.192858: __schedule <-schedule
...
     ksoftirqd/3-16    [003] d...   577.192861: calc_load_account_idle <-pick_next_task_idle
          <idle>-0     [003] d...   577.192861: finish_task_switch <-__schedule
          <idle>-0     [003] ....   577.192862: tick_nohz_idle_enter <-cpu_idle
          <idle>-0     [003] ....   577.192862: set_cpu_sd_state_idle <-tick_nohz_idle_enter
          <idle>-0     [003] d...   577.192862: tick_nohz_stop_sched_tick.isra.9 <-tick_nohz_idle_enter
          <idle>-0     [003] d...   577.192862: ktime_get <-tick_nohz_stop_sched_tick.isra.9
          <idle>-0     [003] ....   577.192863: local_touch_nmi <-cpu_idle
          <idle>-0     [003] d...   577.192863: enter_idle <-cpu_idle
          <idle>-0     [003] d...   577.192863: atomic_notifier_call_chain <-enter_idle
          <idle>-0     [003] d...   577.192863: notifier_call_chain <-atomic_notifier_call_chain
          <idle>-0     [003] d...   577.192863: rcu_idle_enter <-cpu_idle
          <idle>-0     [003] d...   577.192863: rcu_idle_enter_common <-rcu_idle_enter
          <idle>-0     [003] d...   577.192864: rcu_cpu_has_callbacks <-rcu_idle_enter_common
          <idle>-0     [003] d...   577.192864: cpuidle_idle_call <-cpu_idle
          <idle>-0     [003] d...   577.192864: cpuidle_get_driver <-cpuidle_idle_call
          <idle>-0     [003] d...   577.192864: mwait_idle <-cpu_idle
          <idle>-0     [003] d...   577.196356: smp_apic_timer_interrupt <-apic_timer_interrupt
          <idle>-0     [003] d...   577.196356: native_apic_mem_write <-smp_apic_timer_interrupt
          <idle>-0     [003] d...   577.196357: irq_enter <-smp_apic_timer_interrupt
          <idle>-0     [003] d...   577.196357: rcu_irq_enter <-irq_enter
          <idle>-0     [003] d...   577.196357: rcu_idle_exit_common <-rcu_irq_enter
          <idle>-0     [003] d...   577.196357: hrtimer_cancel <-rcu_idle_exit_common
...
          <idle>-0     [003] d.h.   577.196366: update_process_times <-tick_sched_timer
          <idle>-0     [003] d.h.   577.196366: account_process_tick <-update_process_times
          <idle>-0     [003] d.h.   577.196367: run_local_timers <-update_process_times
          <idle>-0     [003] d.h.   577.196367: hrtimer_run_queues <-run_local_timers
          <idle>-0     [003] d.h.   577.196367: raise_softirq <-run_local_timers
          <idle>-0     [003] d.h.   577.196367: __raise_softirq_irqoff <-raise_softirq
          <idle>-0     [003] d.h.   577.196368: rcu_check_callbacks <-update_process_times
          <idle>-0     [003] d.h.   577.196368: rcu_sched_qs <-rcu_check_callbacks
          <idle>-0     [003] d.h.   577.196368: rcu_bh_qs <-rcu_check_callbacks
          <idle>-0     [003] d.h.   577.196369: rcu_pending <-rcu_check_callbacks
          <idle>-0     [003] d.h.   577.196369: __rcu_pending <-rcu_pending
          <idle>-0     [003] d.h.   577.196369: raise_softirq <-rcu_check_callbacks
          <idle>-0     [003] d.h.   577.196369: __raise_softirq_irqoff <-raise_softirq
...
          <idle>-0     [003] d...   577.196376: do_softirq <-irq_exit
          <idle>-0     [003] d...   577.196376: __do_softirq <-call_softirq
          <idle>-0     [003] ..s.   577.196376: run_timer_softirq <-__do_softirq
          <idle>-0     [003] ..s.   577.196377: hrtimer_run_pending <-run_timer_softirq
          <idle>-0     [003] ..s.   577.196377: _raw_spin_lock_irq <-run_timer_softirq
          <idle>-0     [003] ..s.   577.196377: rcu_bh_qs <-__do_softirq
          <idle>-0     [003] ..s.   577.196378: rcu_process_callbacks <-__do_softirq
          <idle>-0     [003] ..s.   577.196378: __rcu_process_callbacks <-rcu_process_callbacks
          <idle>-0     [003] ..s.   577.196378: rcu_process_gp_end <-__rcu_process_callbacks
          <idle>-0     [003] d.s.   577.196378: _raw_spin_trylock <-rcu_process_gp_end
          <idle>-0     [003] d.s.   577.196379: __rcu_process_gp_end.isra.5 <-rcu_process_gp_end
          <idle>-0     [003] d.s.   577.196379: _raw_spin_unlock_irqrestore <-rcu_process_gp_end
          <idle>-0     [003] ..s.   577.196379: check_for_new_grace_period <-__rcu_process_callbacks
          <idle>-0     [003] ..s.   577.196380: _raw_spin_lock_irqsave <-__rcu_process_callbacks
          <idle>-0     [003] d.s.   577.196380: rcu_start_gp <-__rcu_process_callbacks
...
          <idle>-0     [003] d...   577.196387: force_qs_rnp <-force_quiescent_state
          <idle>-0     [003] d...   577.196388: _raw_spin_lock_irqsave <-force_qs_rnp
          <idle>-0     [003] d...   577.196388: dyntick_save_progress_counter <-force_qs_rnp
          <idle>-0     [003] d...   577.196388: dyntick_save_progress_counter <-force_qs_rnp
          <idle>-0     [003] d...   577.196389: dyntick_save_progress_counter <-force_qs_rnp
          <idle>-0     [003] d...   577.196389: dyntick_save_progress_counter <-force_qs_rnp
          <idle>-0     [003] d...   577.196390: rcu_report_qs_rnp <-force_qs_rnp
          <idle>-0     [003] d...   577.196390: _raw_spin_unlock_irqrestore <-rcu_report_qs_rnp
          <idle>-0     [003] d...   577.196390: _raw_spin_lock <-force_quiescent_state
          <idle>-0     [003] d...   577.196391: _raw_spin_unlock_irqrestore <-force_quiescent_state
          <idle>-0     [003] d...   577.196391: rcu_cpu_has_callbacks <-rcu_idle_enter_common
          <idle>-0     [003] d...   577.196391: raise_softirq <-rcu_idle_enter_common
          <idle>-0     [003] d...   577.196391: __raise_softirq_irqoff <-raise_softirq
          <idle>-0     [003] d...   577.196392: wakeup_softirqd <-raise_softirq
          <idle>-0     [003] d...   577.196392: wake_up_process <-wakeup_softirqd
...
          <idle>-0     [003] .N..   577.196403: __schedule <-schedule
          <idle>-0     [003] .N..   577.196403: rcu_note_context_switch <-__schedule
          <idle>-0     [003] .N..   577.196403: rcu_sched_qs <-rcu_note_context_switch
          <idle>-0     [003] .N..   577.196403: _raw_spin_lock_irq <-__schedule
          <idle>-0     [003] dN..   577.196404: put_prev_task_idle <-__schedule
          <idle>-0     [003] dN..   577.196404: pick_next_task_fair <-__schedule
          <idle>-0     [003] dN..   577.196404: clear_buddies <-pick_next_task_fair
          <idle>-0     [003] dN..   577.196405: set_next_entity <-pick_next_task_fair
          <idle>-0     [003] dN..   577.196405: update_stats_wait_end <-set_next_entity
     ksoftirqd/3-16    [003] d...   577.196406: finish_task_switch <-__schedule
     ksoftirqd/3-16    [003] d...   577.196406: __do_softirq <-run_ksoftirqd
     ksoftirqd/3-16    [003] ..s.   577.196406: rcu_process_callbacks <-__do_softirq
     ksoftirqd/3-16    [003] ..s.   577.196407: __rcu_process_callbacks <-rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196407: rcu_process_gp_end <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196407: check_for_new_grace_period <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196408: rcu_report_qs_rdp.isra.29 <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196408: _raw_spin_lock_irqsave <-rcu_report_qs_rdp.isra.29
     ksoftirqd/3-16    [003] d.s.   577.196408: rcu_report_qs_rnp <-rcu_report_qs_rdp.isra.29
     ksoftirqd/3-16    [003] d.s.   577.196408: _raw_spin_unlock_irqrestore <-rcu_report_qs_rnp
     ksoftirqd/3-16    [003] ..s.   577.196409: __rcu_process_callbacks <-rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196409: force_quiescent_state <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196409: rcu_process_gp_end <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196410: check_for_new_grace_period <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196410: rcu_bh_qs <-__do_softirq
     ksoftirqd/3-16    [003] d.s.   577.196410: __local_bh_enable <-__do_softirq
     ksoftirqd/3-16    [003] ....   577.196410: _cond_resched <-run_ksoftirqd
     ksoftirqd/3-16    [003] ....   577.196411: rcu_note_context_switch <-run_ksoftirqd
     ksoftirqd/3-16    [003] ....   577.196411: rcu_sched_qs <-rcu_note_context_switch
     ksoftirqd/3-16    [003] ....   577.196411: kthread_should_stop <-run_ksoftirqd
     ksoftirqd/3-16    [003] ....   577.196412: schedule_preempt_disabled <-run_ksoftirqd
     ksoftirqd/3-16    [003] ....   577.196412: schedule <-schedule_preempt_disabled
     ksoftirqd/3-16    [003] ....   577.196412: __schedule <-schedule
     ksoftirqd/3-16    [003] ....   577.196412: rcu_note_context_switch <-__schedule
...
     ksoftirqd/3-16    [003] d...   577.196417: pick_next_task_idle <-__schedule
     ksoftirqd/3-16    [003] d...   577.196417: calc_load_account_idle <-pick_next_task_idle
          <idle>-0     [003] d...   577.196418: finish_task_switch <-__schedule
          <idle>-0     [003] ....   577.196418: tick_nohz_idle_enter <-cpu_idle
          <idle>-0     [003] ....   577.196419: set_cpu_sd_state_idle <-tick_nohz_idle_enter
          <idle>-0     [003] d...   577.196419: tick_nohz_stop_sched_tick.isra.9 <-tick_nohz_idle_enter
          <idle>-0     [003] d...   577.196419: ktime_get <-tick_nohz_stop_sched_tick.isra.9
          <idle>-0     [003] ....   577.196420: local_touch_nmi <-cpu_idle
          <idle>-0     [003] d...   577.196420: enter_idle <-cpu_idle
          <idle>-0     [003] d...   577.196420: atomic_notifier_call_chain <-enter_idle
          <idle>-0     [003] d...   577.196420: notifier_call_chain <-atomic_notifier_call_chain
          <idle>-0     [003] d...   577.196421: rcu_idle_enter <-cpu_idle
          <idle>-0     [003] d...   577.196421: rcu_idle_enter_common <-rcu_idle_enter
          <idle>-0     [003] d...   577.196421: rcu_cpu_has_callbacks <-rcu_idle_enter_common
          <idle>-0     [003] d...   577.196421: rcu_sched_qs <-rcu_idle_enter_common
          <idle>-0     [003] d...   577.196422: force_quiescent_state <-rcu_idle_enter_common
          <idle>-0     [003] d...   577.196422: _raw_spin_trylock <-force_quiescent_state
          <idle>-0     [003] d...   577.196422: _raw_spin_lock <-force_quiescent_state
          <idle>-0     [003] d...   577.196423: force_qs_rnp <-force_quiescent_state
          <idle>-0     [003] d...   577.196423: _raw_spin_lock_irqsave <-force_qs_rnp
          <idle>-0     [003] d...   577.196423: rcu_implicit_dynticks_qs <-force_qs_rnp
          <idle>-0     [003] d...   577.196424: rcu_report_qs_rnp <-force_qs_rnp
          <idle>-0     [003] d...   577.196424: _raw_spin_lock <-rcu_report_qs_rnp
          <idle>-0     [003] d...   577.196424: _raw_spin_lock <-rcu_report_qs_rnp
          <idle>-0     [003] d...   577.196425: rcu_start_gp <-rcu_report_qs_rnp
          <idle>-0     [003] d...   577.196425: _raw_spin_unlock_irqrestore <-rcu_start_gp
          <idle>-0     [003] d...   577.196425: _raw_spin_lock_irqsave <-force_qs_rnp
          <idle>-0     [003] d...   577.196426: _raw_spin_unlock_irqrestore <-force_qs_rnp
          <idle>-0     [003] d...   577.196426: _raw_spin_lock <-force_quiescent_state
          <idle>-0     [003] d...   577.196426: rcu_start_gp <-force_quiescent_state
...
          <idle>-0     [003] dN..   577.196443: clear_buddies <-pick_next_task_fair
          <idle>-0     [003] dN..   577.196443: set_next_entity <-pick_next_task_fair
          <idle>-0     [003] dN..   577.196443: update_stats_wait_end <-set_next_entity
     ksoftirqd/3-16    [003] d...   577.196444: finish_task_switch <-__schedule
     ksoftirqd/3-16    [003] d...   577.196444: __do_softirq <-run_ksoftirqd
     ksoftirqd/3-16    [003] ..s.   577.196444: rcu_process_callbacks <-__do_softirq
     ksoftirqd/3-16    [003] ..s.   577.196445: __rcu_process_callbacks <-rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196445: rcu_process_gp_end <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196445: check_for_new_grace_period <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196446: rcu_report_qs_rdp.isra.29 <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196446: _raw_spin_lock_irqsave <-rcu_report_qs_rdp.isra.29
     ksoftirqd/3-16    [003] d.s.   577.196446: rcu_report_qs_rnp <-rcu_report_qs_rdp.isra.29
     ksoftirqd/3-16    [003] d.s.   577.196446: _raw_spin_unlock_irqrestore <-rcu_report_qs_rnp
     ksoftirqd/3-16    [003] ..s.   577.196447: put_cred_rcu <-__rcu_process_callbacks
     ksoftirqd/3-16    [003] ..s.   577.196447: key_put <-put_cred_rcu
     ksoftirqd/3-16    [003] ..s.   577.196447: key_put <-put_cred_rcu
     ksoftirqd/3-16    [003] ..s.   577.196448: release_tgcred.isra.11 <-put_cred_rcu
     ksoftirqd/3-16    [003] ..s.   577.196448: call_rcu_sched <-release_tgcred.isra.11
     ksoftirqd/3-16    [003] ..s.   577.196448: __call_rcu <-call_rcu_sched
     ksoftirqd/3-16    [003] ..s.   577.196449: free_uid <-put_cred_rcu
     ksoftirqd/3-16    [003] ..s.   577.196449: kmem_cache_free <-put_cred_rcu
     ksoftirqd/3-16    [003] ..s.   577.196449: __phys_addr <-kmem_cache_free
     ksoftirqd/3-16    [003] ..s.   577.196450: __slab_free <-kmem_cache_free
     ksoftirqd/3-16    [003] ..s.   577.196450: free_debug_processing <-__slab_free
     ksoftirqd/3-16    [003] d.s.   577.196450: check_slab <-free_debug_processing
     ksoftirqd/3-16    [003] d.s.   577.196451: slab_pad_check.part.42 <-check_slab
     ksoftirqd/3-16    [003] d.s.   577.196451: on_freelist <-free_debug_processing
     ksoftirqd/3-16    [003] d.s.   577.196451: check_object <-free_debug_processing
     ksoftirqd/3-16    [003] d.s.   577.196452: check_bytes_and_report <-check_object
     ksoftirqd/3-16    [003] d.s.   577.196452: check_bytes_and_report <-check_object
     ksoftirqd/3-16    [003] d.s.   577.196452: set_track <-free_debug_processing
     ksoftirqd/3-16    [003] d.s.   577.196453: dump_trace <-save_stack_trace
     ksoftirqd/3-16    [003] d.s.   577.196453: print_context_stack <-dump_trace
     ksoftirqd/3-16    [003] d.s.   577.196458: init_object <-free_debug_processing
     ksoftirqd/3-16    [003] ..s.   577.196458: delayed_put_pid <-__rcu_process_callbacks
...
     ksoftirqd/3-16    [003] d...   577.196570: pick_next_task_fair <-__schedule
     ksoftirqd/3-16    [003] d...   577.196570: clear_buddies <-pick_next_task_fair
     ksoftirqd/3-16    [003] d...   577.196570: set_next_entity <-pick_next_task_fair
     ksoftirqd/3-16    [003] d...   577.196570: update_stats_wait_end <-set_next_entity
          vsftpd-7977  [003] d...   577.196571: finish_task_switch <-__schedule
          vsftpd-7977  [003] ....   577.196571: _raw_spin_lock_irq <-wait_for_common
          vsftpd-7977  [003] ....   577.196571: mutex_unlock <-_rcu_barrier.isra.31
          vsftpd-7977  [003] ....   577.196572: put_filesystem <-deactivate_locked_super
          vsftpd-7977  [003] ....   577.196572: put_super <-deactivate_locked_super



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
  2012-05-05  7:12                                     ` Mike Galbraith
@ 2012-05-05 11:37                                       ` Eric W. Biederman
  2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
  1 sibling, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-05 11:37 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling, Paul E. McKenney

Mike Galbraith <efault@gmx.de> writes:

> On Sat, 2012-05-05 at 08:08 +0200, Mike Galbraith wrote:
>
>> egrep 'synchronize|rcu_barrier' /trace 
>> 
>>           vsftpd-7981  [003] ....   577.164997: synchronize_sched <-switch_task_namespaces
>>           vsftpd-7981  [003] ....   577.164998: _cond_resched <-synchronize_sched
>>           vsftpd-7981  [003] ....   577.164998: wait_rcu_gp <-synchronize_sched
>>           vsftpd-7982  [003] ....   577.166583: synchronize_sched <-switch_task_namespaces
>>           vsftpd-7982  [003] ....   577.166583: _cond_resched <-synchronize_sched
>
>> vsftpd-7977  [003] ....   577.171519: rcu_barrier_sched <-rcu_barrier
>>           vsftpd-7977  [003] ....   577.171519: _rcu_barrier.isra.31 <-rcu_barrier_sched
>>           vsftpd-7977  [003] ....   577.171519: mutex_lock <-_rcu_barrier.isra.31
>>           vsftpd-7977  [003] ....   577.171520: __init_waitqueue_head <-_rcu_barrier.isra.31
>>           vsftpd-7977  [003] ....   577.171520: on_each_cpu <-_rcu_barrier.isra.31
>>           vsftpd-7977  [003] d...   577.171532: rcu_barrier_func <-on_each_cpu
>>           vsftpd-7977  [003] d...   577.171532: call_rcu_sched <-rcu_barrier_func
>>           vsftpd-7977  [003] ....   577.171533: wait_for_completion <-_rcu_barrier.isra.31
>>      ksoftirqd/3-16    [003] ..s.   577.171691: rcu_barrier_callback <-__rcu_process_callbacks
>>           vsftpd-7977  [003] ....   577.176443: mutex_unlock <-_rcu_barrier.isra.31
> ...
>
> Ok, so CLONE_NEWPID | SIGCHLD + waitpid is a bad idea given extreme
> unmount synchronization, but why does it take four softirqs?  Seems this
> could have gone a lot faster. 

It is just taking one 4millisecond or 1 jiffy at 250hz which seems
correct operation for rcu_barrier.

To recap for anyone watching.  We have:

sys_wait4
   do_wait
     ...
        release_task
           proc_flush_task
              pid_ns_release_proc
                 kern_unmount
                    mntput
                      mntput_no_expire
                         mntfree
                            deactivate_super
                               deactivate_locked_super
                                  rcu_barrier

So each instance of sys_wait4 winds up taking 4ms sad.  But that
does explain what it takes so long to reap the zombies we are
synchronous.

The ipc namespace is also going to suffer from this deactivate_super
delay but more likely in exit_namespaces() so the delay should not
be synchronized across a bunch of processes.  Aka the wait should
be done before the parent is notified.

I had a nefarious plan to combine the proc mount reference count with
the pid namespace reference count (to break the loop).  I will see if I
can reawaken that.  If that plan comes to fruition the final put_pid on
the pid namespace should happen in a call_rcu after release_task so
wait4 should not be bottlenecked.

I am still mystified why adding the rest of the namespaces adds so much
of a slowdown.  Those task existing should have been parallized before
do_wait..

The rcu_barrier is new as of 2.6.38-rc5 with commit d863b50ab on Feb 10 2011:

commit d863b50ab01333659314c2034890cb76d9fdc3c7
Author: Boaz Harrosh <bharrosh@panasas.com>
Date:   Thu Feb 10 15:01:20 2011 -0800

    vfs: call rcu_barrier after ->kill_sb()
    
    In commit fa0d7e3de6d6 ("fs: icache RCU free inodes"), we use rcu free
    inode instead of freeing the inode directly.  It causes a crash when we
    rmmod immediately after we umount the volume[1].
    
    So we need to call rcu_barrier after we kill_sb so that the inode is
    freed before we do rmmod.  The idea is inspired by Aneesh Kumar.
    rcu_barrier will wait for all callbacks to end before preceding.  The
    original patch was done by Tao Ma, but synchronize_rcu() is not enough
    here.
    
    1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2
    
    Tested-by: Tao Ma <boyu.mt@taobao.com>
    Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Chris Mason <chris.mason@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/fs/super.c b/fs/super.c
index 74e149e..7e9dd4c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -177,6 +177,11 @@ void deactivate_locked_super(struct super_block *s)
        struct file_system_type *fs = s->s_type;
        if (atomic_dec_and_test(&s->s_active)) {
                fs->kill_sb(s);
+               /*
+                * We need to call rcu_barrier so all the delayed rcu free
+                * inodes are flushed before we release the fs module.
+                */
+               rcu_barrier();
                put_filesystem(fs);
                put_super(s);
        } else {



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 0/3] pidns: Closing the pid namespace exit race.
  2012-05-01 20:42           ` Andrew Morton
  2012-05-03  3:12             ` Mike Galbraith
@ 2012-05-07  0:32             ` Eric W. Biederman
  2012-05-07  0:33               ` [PATCH 1/3] pidns: Use task_active_pid_ns in do_notify_parent Eric W. Biederman
                                 ` (2 more replies)
  1 sibling, 3 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-07  0:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith


I had some patches that broke the circular reference between the
pid_namespace and the proc_mnt by combining the reference counts but the
world has changes and now the proc_mnt reference count is not suitable
to be used that way so that plan is scrapped.

I did play with it and I have found a relatively elegant way of at least
handling the problem of self reaping children escaping from
zap_pid_ns_processess.

The following patches guarantee that the task with pid == 1 will be the
last task reaped in a pid namespace. Making proc_flush_task safe.

The previous patch to call pid_ns_release_proc on fork failure is still
needed.  These patches simply address the other failure mode.

How did it escape my memory that setting SIGCHLD to SIG_IGN caused
children to autoreap?


Eric W. Biederman (3):
      pidns:  Use task_active_pid_ns in do_notify_parent.
      pidns: Guarantee that the pidns init will be the last pidns process reaped.
      pidns: Make killed children autoreap

 kernel/exit.c          |   46 +++++++++++++++++++++++++++++++++++-----------
 kernel/pid_namespace.c |    7 ++++++-
 kernel/signal.c        |   11 +++++------
 3 files changed, 46 insertions(+), 18 deletions(-)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/3] pidns: Use task_active_pid_ns in do_notify_parent.
  2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
@ 2012-05-07  0:33               ` Eric W. Biederman
  2012-05-07  0:35               ` [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped Eric W. Biederman
  2012-05-07  0:35               ` [PATCH 3/3] pidns: Make killed children autoreap Eric W. Biederman
  2 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-07  0:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith


Using task_active_pid_ns is more robust because it works even after we
have called exit_namespaces.  This change allows us to have parent
processes that are zombies.  Normally a zombie parent processes is crazy
and the last thing you would want to have but in the case of not letting
the init process of a pid namespace be reaped until all of it's children
are dead and reaped a zombie parent process is exactly what we want.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/signal.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 17afcaf..0e4ef99 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1665,19 +1665,18 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	info.si_signo = sig;
 	info.si_errno = 0;
 	/*
-	 * we are under tasklist_lock here so our parent is tied to
-	 * us and cannot exit and release its namespace.
+	 * We are under tasklist_lock here so our parent is tied to
+	 * us and cannot change.
 	 *
-	 * the only it can is to switch its nsproxy with sys_unshare,
-	 * bu uncharing pid namespaces is not allowed, so we'll always
-	 * see relevant namespace
+	 * task_active_pid_ns will always return the same pid namespace
+	 * until a task passes through release_task.
 	 *
 	 * write_lock() currently calls preempt_disable() which is the
 	 * same as rcu_read_lock(), but according to Oleg, this is not
 	 * correct to rely on this
 	 */
 	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
+	info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(tsk->parent));
 	info.si_uid = map_cred_ns(__task_cred(tsk),
 			task_cred_xxx(tsk->parent, user_ns));
 	rcu_read_unlock();
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
  2012-05-07  0:33               ` [PATCH 1/3] pidns: Use task_active_pid_ns in do_notify_parent Eric W. Biederman
@ 2012-05-07  0:35               ` Eric W. Biederman
  2012-05-08 22:50                 ` Andrew Morton
  2012-05-16 18:39                 ` Oleg Nesterov
  2012-05-07  0:35               ` [PATCH 3/3] pidns: Make killed children autoreap Eric W. Biederman
  2 siblings, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-07  0:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith


This change extends the thread group zombie leader logic to work for pid
namespaces.  The task with pid 1 is declared the pid namespace leader.
A pid namespace with no more processes is detected by observing that the
init task is a zombie in an empty thread group, and the the init task
has no children.

Instead of moving lingering EXIT_DEAD tasks off of init's ->children
list we now block init from exiting until those children have self
reaped and have removed themselves.  Which guarantees that the init task
is the last task in a pid namespace to be reaped.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/exit.c |   46 +++++++++++++++++++++++++++++++++++-----------
 1 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index d8bd3b42..7269260 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -164,6 +164,16 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }
 
+static bool pidns_leader(struct task_struct *tsk)
+{
+	return is_child_reaper(task_pid(tsk));
+}
+
+static bool delay_pidns_leader(struct task_struct *tsk)
+{
+	return pidns_leader(tsk) &&
+	       (!thread_group_empty(tsk) || !list_empty(&tsk->children));
+}
 
 void release_task(struct task_struct * p)
 {
@@ -183,15 +193,23 @@ repeat:
 	__exit_signal(p);
 
 	/*
-	 * If we are the last non-leader member of the thread
-	 * group, and the leader is zombie, then notify the
-	 * group leader's parent process. (if it wants notification.)
+	 * If we are the last non-leader member of the thread group,
+	 * or the last non-leader member of the pid namespace, and the
+	 * leader is zombie, then notify the leader's parent
+	 * process. (if it wants notification.)
 	 */
 	zap_leader = 0;
-	leader = p->group_leader;
-	if (leader != p && thread_group_empty(leader) && leader->exit_state == EXIT_ZOMBIE) {
+	leader = NULL;
+	/* Do we need to worry about our thread_group or our pidns leader? */
+	if (p != p->group_leader)
+		leader = p->group_leader;
+	else if (pidns_leader(p->real_parent))
+		leader = p->real_parent;
+
+	if (leader && thread_group_empty(leader) &&
+	    leader->exit_state == EXIT_ZOMBIE && list_empty(&leader->children)) {
 		/*
-		 * If we were the last child thread and the leader has
+		 * If we were the last task in the group and the leader has
 		 * exited already, and the leader's parent ignores SIGCHLD,
 		 * then we are the one who should release the leader.
 		 */
@@ -720,11 +738,10 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 		zap_pid_ns_processes(pid_ns);
 		write_lock_irq(&tasklist_lock);
 		/*
-		 * We can not clear ->child_reaper or leave it alone.
-		 * There may by stealth EXIT_DEAD tasks on ->children,
-		 * forget_original_parent() must move them somewhere.
+		 * Move all lingering EXIT_DEAD tasks onto the
+		 * children list of init's thread group leader.
 		 */
-		pid_ns->child_reaper = init_pid_ns.child_reaper;
+		pid_ns->child_reaper = father->group_leader;
 	} else if (father->signal->has_child_subreaper) {
 		struct task_struct *reaper;
 
@@ -798,6 +815,12 @@ static void forget_original_parent(struct task_struct *father)
 	exit_ptrace(father);
 	reaper = find_new_reaper(father);
 
+	/* Return immediately if we aren't going to reparent anything */
+	if (unlikely(reaper == father)) {
+		write_unlock_irq(&tasklist_lock);
+		return;
+	}
+		
 	list_for_each_entry_safe(p, n, &father->children, sibling) {
 		struct task_struct *t = p;
 		do {
@@ -853,6 +876,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 		autoreap = do_notify_parent(tsk, sig);
 	} else if (thread_group_leader(tsk)) {
 		autoreap = thread_group_empty(tsk) &&
+			!delay_pidns_leader(tsk) &&
 			do_notify_parent(tsk, tsk->exit_signal);
 	} else {
 		autoreap = true;
@@ -1579,7 +1603,7 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
 		}
 
 		/* we don't reap group leaders with subthreads */
-		if (!delay_group_leader(p))
+		if (!delay_group_leader(p) && !delay_pidns_leader(p))
 			return wait_task_zombie(wo, p);
 
 		/*
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 3/3] pidns: Make killed children autoreap
  2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
  2012-05-07  0:33               ` [PATCH 1/3] pidns: Use task_active_pid_ns in do_notify_parent Eric W. Biederman
  2012-05-07  0:35               ` [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped Eric W. Biederman
@ 2012-05-07  0:35               ` Eric W. Biederman
  2012-05-08 22:51                 ` Andrew Morton
  2 siblings, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-07  0:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith


Force SIGCHLD handling to SIG_IGN so that signals are not generated
and so that the children autoreap.  This increases the parallelize
and in general the speed of network namespace shutdown.

Note self reaping childrean can exist past zap_pid_ns_processess but
they will all be reaped before we allow the pid namespace init task
with pid == 1 to be reaped.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/pid_namespace.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 57bc1fd..b98b0ed 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -149,7 +149,12 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 {
 	int nr;
 	int rc;
-	struct task_struct *task;
+	struct task_struct *task, *me = current;
+
+	/* Ignore SIGCHLD causing any terminated children to autoreap */
+	spin_lock_irq(&me->sighand->siglock);
+	me->sighand->action[SIGCHLD -1].sa.sa_handler = SIG_IGN;
+	spin_unlock_irq(&me->sighand->siglock);
 
 	/*
 	 * The last thread in the cgroup-init thread group is terminating.
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-05  7:12                                     ` Mike Galbraith
  2012-05-05 11:37                                       ` Eric W. Biederman
@ 2012-05-07 21:51                                       ` Eric W. Biederman
  2012-05-07 22:17                                         ` Al Viro
  1 sibling, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-07 21:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling, Paul E. McKenney, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel


Recently it was observed that a distilled version of vsftp was taking a
surprising amount of time reaping zombies.  A measurement was taken
and vsftp was taking about 4ms (one jiffie) to reap each zombie and
those 4ms were spent spleeping in rcu_barrier in deactivate_locked_super.

The reason vsftp was sleeping in deactivate_locked_super is because
vsftp creates a pid namespace for each connection, and with that
pid namespace comes an internal mount of /proc.  That internal mount
of proc is unmounted when the last process in the pid namespace is
reaped.

/proc and similar non-modular filesystems do not need a rcu_barrier
in deactivate_locked_super.  Being non-modular there is no danger
of the rcu callback running after the module is unloaded.

Therefore do the easy thing and remove 4ms+ from unmount times by only
calling rcu_barrier for modular filesystems in unmount.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/super.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index cf00177..c739ef8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -261,7 +261,8 @@ void deactivate_locked_super(struct super_block *s)
 		 * We need to call rcu_barrier so all the delayed rcu free
 		 * inodes are flushed before we release the fs module.
 		 */
-		rcu_barrier();
+		if (fs->owner)
+			rcu_barrier();
 		put_filesystem(fs);
 		put_super(s);
 	} else {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
@ 2012-05-07 22:17                                         ` Al Viro
  2012-05-07 23:56                                           ` Paul E. McKenney
  0 siblings, 1 reply; 69+ messages in thread
From: Al Viro @ 2012-05-07 22:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling, Paul E. McKenney, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:

> /proc and similar non-modular filesystems do not need a rcu_barrier
> in deactivate_locked_super.  Being non-modular there is no danger
> of the rcu callback running after the module is unloaded.

There's more than just a module unload there, though - actual freeing
struct super_block also happens past that rcu_barrier()...

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-07 22:17                                         ` Al Viro
@ 2012-05-07 23:56                                           ` Paul E. McKenney
  2012-05-08  1:07                                             ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Paul E. McKenney @ 2012-05-07 23:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
> 
> > /proc and similar non-modular filesystems do not need a rcu_barrier
> > in deactivate_locked_super.  Being non-modular there is no danger
> > of the rcu callback running after the module is unloaded.
> 
> There's more than just a module unload there, though - actual freeing
> struct super_block also happens past that rcu_barrier()...

Is there anything in there for which synchronous operation is required?
If not, one approach would be to drop the rcu_barrier() calls to a
workqueue or something similar.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-07 23:56                                           ` Paul E. McKenney
@ 2012-05-08  1:07                                             ` Eric W. Biederman
  2012-05-08  4:53                                               ` Mike Galbraith
  2012-05-09  7:55                                               ` Nick Piggin
  0 siblings, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-08  1:07 UTC (permalink / raw)
  To: paulmck
  Cc: Al Viro, Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>> 
>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>> > in deactivate_locked_super.  Being non-modular there is no danger
>> > of the rcu callback running after the module is unloaded.
>> 
>> There's more than just a module unload there, though - actual freeing
>>  struct super_block also happens past that rcu_barrier()...

Al.  I have not closely audited the entire code path but at a quick
sample I see no evidence that anything depends on inode->i_sb being
rcu safe.  Do you know of any such location?

It has only been a year and a half since Nick added this code which
isn't very much time to have grown strange dependencies like that.

> Is there anything in there for which synchronous operation is required?
> If not, one approach would be to drop the rcu_barrier() calls to a
> workqueue or something similar.

We need to drain all of the rcu callbacks before we free the slab
and unload the module.

This actually makes deactivate_locked_super the totally wrong place
for the rcu_barrier.  We want the rcu_barrier in the module exit
routine where we destroy the inode cache.

What I see as the real need is the filesystem modules need to do:
	rcu_barrier()
	kmem_cache_destroy(cache);

Perhaps we can add some helpers to make it easy.  But I think
I would be happy today with simply moving the rcu_barrier into
every filesystems module exit path, just before the file system
module destoryed it's inode cache.

Eric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-08  1:07                                             ` Eric W. Biederman
@ 2012-05-08  4:53                                               ` Mike Galbraith
  2012-05-09  7:55                                               ` Nick Piggin
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Galbraith @ 2012-05-08  4:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling,
	Christoph Hellwig, linux-fsdevel

On Mon, 2012-05-07 at 18:07 -0700, Eric W. Biederman wrote: 
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> What I see as the real need is the filesystem modules need to do:
> 	rcu_barrier()
> 	kmem_cache_destroy(cache);
> 
> Perhaps we can add some helpers to make it easy.  But I think
> I would be happy today with simply moving the rcu_barrier into
> every filesystems module exit path, just before the file system
> module destoryed it's inode cache.

One liner kills the reap bottleneck and 99.999% of cache bloat.  1000
backgrounded vfstpd testcases finished ~instantly and left one
persistent pid namespace vs taking ages and bloating very badly.

Hacked up hackbench still hurts with all (except user) namespaces, but
that's a different problem (modulo hackbench wonderfulness).

Previous numbers:

default flags = SIGCHLD

-namespace:  flag |= CLONE_NEWPID 
-all:  flags |= CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER

marge:/usr/local/tmp/starvation # ./hackbench
Running with 10*40 (== 400) tasks.
Time: 2.636
marge:/usr/local/tmp/starvation # ./hackbench -namespace
Running with 10*40 (== 400) tasks.
Time: 11.624
marge:/usr/local/tmp/starvation # ./hackbench -namespace -all
Running with 10*40 (== 400) tasks.
Time: 51.474


New numbers: 
marge:/usr/local/tmp/starvation # time ./hackbench
Running with 10*40 (== 400) tasks.
Time: 2.718

real    0m2.877s
user    0m0.060s
sys     0m10.057s
marge:/usr/local/tmp/starvation # time ./hackbench -namespace
Running with 10*40 (== 400) tasks.
Time: 2.689

real    0m2.878s
user    0m0.060s
sys     0m9.945s
marge:/usr/local/tmp/starvation # time ./hackbench -namespace -all
Running with 10*40 (== 400) tasks.                                                                                                                                                                                                          
Time: 2.521
                                                                                                                                                                                                                                            
real    0m27.774s
user    0m0.048s                                                                                                                                                                                                                            
sys     0m21.681s
marge:/usr/local/tmp/starvation #  


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-07  0:35               ` [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped Eric W. Biederman
@ 2012-05-08 22:50                 ` Andrew Morton
  2012-05-16 18:39                 ` Oleg Nesterov
  1 sibling, 0 replies; 69+ messages in thread
From: Andrew Morton @ 2012-05-08 22:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On Sun, 06 May 2012 17:35:02 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> 
> This change extends the thread group zombie leader logic to work for pid
> namespaces.  The task with pid 1 is declared the pid namespace leader.
> A pid namespace with no more processes is detected by observing that the
> init task is a zombie in an empty thread group, and the the init task
> has no children.
> 
> Instead of moving lingering EXIT_DEAD tasks off of init's ->children
> list we now block init from exiting until those children have self
> reaped and have removed themselves.  Which guarantees that the init task
> is the last task in a pid namespace to be reaped.
> 
> ...
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -164,6 +164,16 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>  	put_task_struct(tsk);
>  }
>  
> +static bool pidns_leader(struct task_struct *tsk)
> +{
> +	return is_child_reaper(task_pid(tsk));
> +}
> +
> +static bool delay_pidns_leader(struct task_struct *tsk)
> +{
> +	return pidns_leader(tsk) &&
> +	       (!thread_group_empty(tsk) || !list_empty(&tsk->children));
> +}

The code would be significantly easier to understand if the above two
functions were documented.

What is the significance of pidns leadership, and why might callers
want to know this?

delay_pidns_leader() seems poorly named, which doesn't help.  I guess
it's trying to say "should delay the pidns leader".  But even then, it
doesn't describe *why* the leader should be delayed.

Have a think about it, please?

>
> ...
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/3] pidns: Make killed children autoreap
  2012-05-07  0:35               ` [PATCH 3/3] pidns: Make killed children autoreap Eric W. Biederman
@ 2012-05-08 22:51                 ` Andrew Morton
  0 siblings, 0 replies; 69+ messages in thread
From: Andrew Morton @ 2012-05-08 22:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On Sun, 06 May 2012 17:35:46 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> 
> Force SIGCHLD handling to SIG_IGN so that signals are not generated
> and so that the children autoreap.  This increases the parallelize
> and in general the speed of network namespace shutdown.
> 
> Note self reaping childrean can exist past zap_pid_ns_processess but
> they will all be reaped before we allow the pid namespace init task
> with pid == 1 to be reaped.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  kernel/pid_namespace.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 57bc1fd..b98b0ed 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -149,7 +149,12 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  {
>  	int nr;
>  	int rc;
> -	struct task_struct *task;
> +	struct task_struct *task, *me = current;
> +
> +	/* Ignore SIGCHLD causing any terminated children to autoreap */
> +	spin_lock_irq(&me->sighand->siglock);
> +	me->sighand->action[SIGCHLD -1].sa.sa_handler = SIG_IGN;
> +	spin_unlock_irq(&me->sighand->siglock);

Taking a lock around a single atomic write is always fishy.  What
exactly is this locking here for?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-08  1:07                                             ` Eric W. Biederman
  2012-05-08  4:53                                               ` Mike Galbraith
@ 2012-05-09  7:55                                               ` Nick Piggin
  2012-05-09 11:02                                                 ` Eric W. Biederman
  2012-05-09 13:59                                                 ` Paul E. McKenney
  1 sibling, 2 replies; 69+ messages in thread
From: Nick Piggin @ 2012-05-09  7:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>
>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>
>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>> > of the rcu callback running after the module is unloaded.
>>>
>>> There's more than just a module unload there, though - actual freeing
>>>  struct super_block also happens past that rcu_barrier()...
>
> Al.  I have not closely audited the entire code path but at a quick
> sample I see no evidence that anything depends on inode->i_sb being
> rcu safe.  Do you know of any such location?
>
> It has only been a year and a half since Nick added this code which
> isn't very much time to have grown strange dependencies like that.

No, it has always depended on this.

Look at ncp_compare_dentry(), for example.


>> Is there anything in there for which synchronous operation is required?
>> If not, one approach would be to drop the rcu_barrier() calls to a
>> workqueue or something similar.
>
> We need to drain all of the rcu callbacks before we free the slab
> and unload the module.
>
> This actually makes deactivate_locked_super the totally wrong place
> for the rcu_barrier.  We want the rcu_barrier in the module exit
> routine where we destroy the inode cache.
>
> What I see as the real need is the filesystem modules need to do:
>        rcu_barrier()
>        kmem_cache_destroy(cache);
>
> Perhaps we can add some helpers to make it easy.  But I think
> I would be happy today with simply moving the rcu_barrier into
> every filesystems module exit path, just before the file system
> module destoryed it's inode cache.

No, because that's not the only requirement for the rcu_barrier.

Making it asynchronous is not something I wanted to do, because
then we potentially have a process exiting from kernel space after
releasing last reference on a mount, but the mount does not go
away until "some time" later. Which is crazy.

However. We are holding vfsmount_lock for read at the point
where we ever actually do anything with an "rcu-referenced"
dentry/inode. I wonder if we could use this to get i_sb pinned.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-09  7:55                                               ` Nick Piggin
@ 2012-05-09 11:02                                                 ` Eric W. Biederman
  2012-05-15  8:40                                                   ` Nick Piggin
  2012-05-09 13:59                                                 ` Paul E. McKenney
  1 sibling, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-09 11:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

Nick Piggin <npiggin@gmail.com> writes:

> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>
>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>
>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>> > of the rcu callback running after the module is unloaded.
>>>>
>>>> There's more than just a module unload there, though - actual freeing
>>>>  struct super_block also happens past that rcu_barrier()...
>>
>> Al.  I have not closely audited the entire code path but at a quick
>> sample I see no evidence that anything depends on inode->i_sb being
>> rcu safe.  Do you know of any such location?
>>
>> It has only been a year and a half since Nick added this code which
>> isn't very much time to have grown strange dependencies like that.
>
> No, it has always depended on this.
>
> Look at ncp_compare_dentry(), for example.

Interesting. ncp_compare_dentry this logic is broken.

Accessing i_sb->s_fs_info for parameters does seem reasonable.
Unfortunately ncp_put_super frees server directly.

Meaning if we are depending on only rcu protections a badly timed
ncp_compare_dentry will oops the kernel.

I am going to go out on a limb and guess that every other filesystem
with a similar dependency follows the same pattern and is likely
broken as well.

>> We need to drain all of the rcu callbacks before we free the slab
>> and unload the module.
>>
>> This actually makes deactivate_locked_super the totally wrong place
>> for the rcu_barrier.  We want the rcu_barrier in the module exit
>> routine where we destroy the inode cache.
>>
>> What I see as the real need is the filesystem modules need to do:
>>        rcu_barrier()
>>        kmem_cache_destroy(cache);
>>
>> Perhaps we can add some helpers to make it easy.  But I think
>> I would be happy today with simply moving the rcu_barrier into
>> every filesystems module exit path, just before the file system
>> module destoryed it's inode cache.
>
> No, because that's not the only requirement for the rcu_barrier.
>
> Making it asynchronous is not something I wanted to do, because
> then we potentially have a process exiting from kernel space after
> releasing last reference on a mount, but the mount does not go
> away until "some time" later. Which is crazy.

Well we certainly want a deliberate unmount of a filesystem to safely
and successfully put the filesystem in a sane state before the unmount
returns.

If we have a few linger data structures waiting for an rcu grace period
after a process exits I'm not certain that is bad.  Although I would not
mind it much.

> However. We are holding vfsmount_lock for read at the point
> where we ever actually do anything with an "rcu-referenced"
> dentry/inode. I wonder if we could use this to get i_sb pinned.

Interesting observation.

Taking that observation farther we have a mount reference count, that
pins the super block.  So at first glance the super block looks safe
without any rcu protections.

I'm not certain what pins the inodes. Let's see:

mnt->d_mnt_root has the root dentry of the dentry tree, and that
dentry count is protected by the vfsmount_lock.

Beyond that we have kill_sb.
  kill_sb() typically calls generic_shutdown_super()
  From generic_shutdown_super() we call:
     shrink_dcache_for_umount() which flushes lingering dentries.
     evict_inodes() which flushes lingering inodes.

So in some sense the reference counts on mounts and dentries protect
the cache.

So the only case I can see where rcu appears to matter is when we are
freeing dentries.

When freeing dentries the idiom is:
dentry_iput(dentry);
d_free(dentry);

d_free does if (dentry->d_flags & DCACHE_RCUACCESS) call_rcu(... __d_free);

So while most of the time dentries hold onto inodes reliably with a
reference count and most of the time dentries are kept alive by the
dentry->d_count part of the time there is this gray zone where only
rcu references to dentries are keeping them alive.

Which explains the need for rcu freeing of inodes. 

This makes me wonder why we think calling d_release is safe
before we want the rcu grace period.

Documentation/filesystems/vfs.txt seems to duplicate this reasoning
of why the superblock is safe.  Because we hold a real reference to it
from the vfsmount.

The strangest case is calling __lookup_mnt during an "rcu-path-walk".
But mounts are reference counted from the mount namespace, and
are protected during an "rcu-path-walk" by vfsmount_lock read locked,
and are only changed with vfsmount_lock write locked.

Which leads again (with stronger reasons now) to the conclusions that:
a) We don't depend on rcu_barrier to protect the superblock.
b) My trivial patch is safe.
c) We probably should move rcu_barrier to the filesystem module exit
   routines, just to make things clear and to make everything faster.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-09  7:55                                               ` Nick Piggin
  2012-05-09 11:02                                                 ` Eric W. Biederman
@ 2012-05-09 13:59                                                 ` Paul E. McKenney
  1 sibling, 0 replies; 69+ messages in thread
From: Paul E. McKenney @ 2012-05-09 13:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric W. Biederman, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On Wed, May 09, 2012 at 05:55:57PM +1000, Nick Piggin wrote:
> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

[ . . . ]

> >> Is there anything in there for which synchronous operation is required?
> >> If not, one approach would be to drop the rcu_barrier() calls to a
> >> workqueue or something similar.
> >
> > We need to drain all of the rcu callbacks before we free the slab
> > and unload the module.
> >
> > This actually makes deactivate_locked_super the totally wrong place
> > for the rcu_barrier.  We want the rcu_barrier in the module exit
> > routine where we destroy the inode cache.
> >
> > What I see as the real need is the filesystem modules need to do:
> >        rcu_barrier()
> >        kmem_cache_destroy(cache);
> >
> > Perhaps we can add some helpers to make it easy.  But I think
> > I would be happy today with simply moving the rcu_barrier into
> > every filesystems module exit path, just before the file system
> > module destoryed it's inode cache.
> 
> No, because that's not the only requirement for the rcu_barrier.
> 
> Making it asynchronous is not something I wanted to do, because
> then we potentially have a process exiting from kernel space after
> releasing last reference on a mount, but the mount does not go
> away until "some time" later. Which is crazy.

In any case, I am looking into making concurrent calls to rcu_barrier()
share each others' work, so if asynchronous turns out to be needed,
it will be efficient.

						Thanx, Paul

> However. We are holding vfsmount_lock for read at the point
> where we ever actually do anything with an "rcu-referenced"
> dentry/inode. I wonder if we could use this to get i_sb pinned.
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-09 11:02                                                 ` Eric W. Biederman
@ 2012-05-15  8:40                                                   ` Nick Piggin
  2012-05-16  0:34                                                     ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Nick Piggin @ 2012-05-15  8:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On 9 May 2012 21:02, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>>
>>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>>
>>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>>> > of the rcu callback running after the module is unloaded.
>>>>>
>>>>> There's more than just a module unload there, though - actual freeing
>>>>>  struct super_block also happens past that rcu_barrier()...
>>>
>>> Al.  I have not closely audited the entire code path but at a quick
>>> sample I see no evidence that anything depends on inode->i_sb being
>>> rcu safe.  Do you know of any such location?
>>>
>>> It has only been a year and a half since Nick added this code which
>>> isn't very much time to have grown strange dependencies like that.
>>
>> No, it has always depended on this.
>>
>> Look at ncp_compare_dentry(), for example.
>
> Interesting. ncp_compare_dentry this logic is broken.
>
> Accessing i_sb->s_fs_info for parameters does seem reasonable.
> Unfortunately ncp_put_super frees server directly.
>
> Meaning if we are depending on only rcu protections a badly timed
> ncp_compare_dentry will oops the kernel.
>
> I am going to go out on a limb and guess that every other filesystem
> with a similar dependency follows the same pattern and is likely
> broken as well.

But ncp_put_super should be called after the rcu_barrier(), no?

How is it broken?


>>> We need to drain all of the rcu callbacks before we free the slab
>>> and unload the module.
>>>
>>> This actually makes deactivate_locked_super the totally wrong place
>>> for the rcu_barrier.  We want the rcu_barrier in the module exit
>>> routine where we destroy the inode cache.
>>>
>>> What I see as the real need is the filesystem modules need to do:
>>>        rcu_barrier()
>>>        kmem_cache_destroy(cache);
>>>
>>> Perhaps we can add some helpers to make it easy.  But I think
>>> I would be happy today with simply moving the rcu_barrier into
>>> every filesystems module exit path, just before the file system
>>> module destoryed it's inode cache.
>>
>> No, because that's not the only requirement for the rcu_barrier.
>>
>> Making it asynchronous is not something I wanted to do, because
>> then we potentially have a process exiting from kernel space after
>> releasing last reference on a mount, but the mount does not go
>> away until "some time" later. Which is crazy.
>
> Well we certainly want a deliberate unmount of a filesystem to safely
> and successfully put the filesystem in a sane state before the unmount
> returns.
>
> If we have a few linger data structures waiting for an rcu grace period
> after a process exits I'm not certain that is bad.  Although I would not
> mind it much.
>
>> However. We are holding vfsmount_lock for read at the point
>> where we ever actually do anything with an "rcu-referenced"
>> dentry/inode. I wonder if we could use this to get i_sb pinned.
>
> Interesting observation.
>
> Taking that observation farther we have a mount reference count, that
> pins the super block.  So at first glance the super block looks safe
> without any rcu protections.

Well yes, that's what I'm getting at. But I don't think it's quite complete...

>
> I'm not certain what pins the inodes. Let's see:
>
> mnt->d_mnt_root has the root dentry of the dentry tree, and that
> dentry count is protected by the vfsmount_lock.

If the mount is already detached from the namespace when we start
to do a path walk, AFAIKS it can be freed up from underneath us at
that point.

This would require cycling vfsmount_lock for write in such path. It's
better than rcu_barrier probably, but not terribly nice.

>
> Beyond that we have kill_sb.
>  kill_sb() typically calls generic_shutdown_super()
>  From generic_shutdown_super() we call:
>     shrink_dcache_for_umount() which flushes lingering dentries.
>     evict_inodes() which flushes lingering inodes.
>
> So in some sense the reference counts on mounts and dentries protect
> the cache.
>
> So the only case I can see where rcu appears to matter is when we are
> freeing dentries.
>
> When freeing dentries the idiom is:
> dentry_iput(dentry);
> d_free(dentry);
>
> d_free does if (dentry->d_flags & DCACHE_RCUACCESS) call_rcu(... __d_free);
>
> So while most of the time dentries hold onto inodes reliably with a
> reference count and most of the time dentries are kept alive by the
> dentry->d_count part of the time there is this gray zone where only
> rcu references to dentries are keeping them alive.
>
> Which explains the need for rcu freeing of inodes.
>
> This makes me wonder why we think calling d_release is safe
> before we want the rcu grace period.

Why wouldn't it be? The superblock cannot go away until all dentries
are freed.

>
> Documentation/filesystems/vfs.txt seems to duplicate this reasoning
> of why the superblock is safe.  Because we hold a real reference to it
> from the vfsmount.

rcu walk does not hold a reference to the vfsmount, however. It can
go away. This is why functions which can be called from rcu-walk
must go through synchronize_rcu() before they go away, also before
the superblock goes away.

The other way we could change the rule is to require barrier only for
those filesystems which access superblock or other info from rcu-walk.
I would prefer not to have such a rule, but it could be pragmatic.

>
> The strangest case is calling __lookup_mnt during an "rcu-path-walk".
> But mounts are reference counted from the mount namespace, and
> are protected during an "rcu-path-walk" by vfsmount_lock read locked,
> and are only changed with vfsmount_lock write locked.
>
> Which leads again (with stronger reasons now) to the conclusions that:
> a) We don't depend on rcu_barrier to protect the superblock.
> b) My trivial patch is safe.
> c) We probably should move rcu_barrier to the filesystem module exit
>   routines, just to make things clear and to make everything faster.

Still not convinced.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-15  8:40                                                   ` Nick Piggin
@ 2012-05-16  0:34                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-16  0:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

Nick Piggin <npiggin@gmail.com> writes:

> On 9 May 2012 21:02, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Nick Piggin <npiggin@gmail.com> writes:
>>
>>> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>>>
>>>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>>>
>>>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>>>> > of the rcu callback running after the module is unloaded.
>>>>>>
>>>>>> There's more than just a module unload there, though - actual freeing
>>>>>>  struct super_block also happens past that rcu_barrier()...
>>>>
>>>> Al.  I have not closely audited the entire code path but at a quick
>>>> sample I see no evidence that anything depends on inode->i_sb being
>>>> rcu safe.  Do you know of any such location?
>>>>
>>>> It has only been a year and a half since Nick added this code which
>>>> isn't very much time to have grown strange dependencies like that.
>>>
>>> No, it has always depended on this.
>>>
>>> Look at ncp_compare_dentry(), for example.
>>
>> Interesting. ncp_compare_dentry this logic is broken.
>>
>> Accessing i_sb->s_fs_info for parameters does seem reasonable.
>> Unfortunately ncp_put_super frees server directly.
>>
>> Meaning if we are depending on only rcu protections a badly timed
>> ncp_compare_dentry will oops the kernel.
>>
>> I am going to go out on a limb and guess that every other filesystem
>> with a similar dependency follows the same pattern and is likely
>> broken as well.
>
> But ncp_put_super should be called after the rcu_barrier(), no?
>
> How is it broken?

The interesting hunk of code from deactivate_locked_super is:
>	cleancache_invalidate_fs(s);
>	fs->kill_sb(s);
 	^^^^^^^^^^^^^^  This is where ncp_put_super() is called.
>
>	/* caches are now gone, we can safely kill the shrinker now */
>	unregister_shrinker(&s->s_shrink);
>
>	/*
>	 * We need to call rcu_barrier so all the delayed rcu free
>	 * inodes are flushed before we release the fs module.
>	 */
>	rcu_barrier();
>	put_filesystem(fs);
>	put_super(s);

Which guarantees ncp_put_super() happens before the rcu_barrier.

>> Taking that observation farther we have a mount reference count, that
>> pins the super block.  So at first glance the super block looks safe
>> without any rcu protections.
>
> Well yes, that's what I'm getting at. But I don't think it's quite complete...
>
>>
>> I'm not certain what pins the inodes. Let's see:
>>
>> mnt->d_mnt_root has the root dentry of the dentry tree, and that
>> dentry count is protected by the vfsmount_lock.
>
> If the mount is already detached from the namespace when we start
> to do a path walk, AFAIKS it can be freed up from underneath us at
> that point.
>
> This would require cycling vfsmount_lock for write in such path. It's
> better than rcu_barrier probably, but not terribly nice.

Where do you see the possibility of a mount detached from a namespace
causing problems?   Simply having any count on a mount ensures we cycle
the vfsmount in mntput_no_expire.


Or if you want to see what I am seeing:

The rcu_path_walk starts at one of.  "." "/" or file->f_path, all of
which hold a reference on a struct vfsmount.

We perform an rcu_path_walk with the locking.
br_read_lock(vfsmount_lock);
rcu_read_lock();

We can transition to another vfs mount via follow_mount_rcu 
which consults the mount hash table which can only be modified
under the br_write_lock(vfsmount_lock);

We can also transition to another vfs mount via follow_up_rcu
which simply goes to mnt->mnt_parent.  Where our starting vfsmount
holds a reference to the target vfsmount.

When we complete the rcu_path_walk we do:
rcu_read_unlock()
br_write_lock(vfsmount_lock)

mntput_no_expire, which decrements mount counts takes and releases
br_write_lock before we put the final mount reference.  Which means
that it is impossible for the final mntput on a mount to complete
while we are in the middle of an rcu path walk.

Once we have take and released br_write_lock(vfsmount_lock)
in mntput_no_expire we call mntfree.  mntfree calls
deactivate_super.  And deactivate_super calls deactivate_locked_super.

Which is a long winded way of saying we always call
deactivate_locked_super after we put our final mount count.

I don't possibly see how a mount can be freed while we are in
the middle of a rcu path walk.  Not while we hold the
br_read_lock(vfsmount_lock), and the final mntput takes
br_write_lock(vfsmount_lock).


>> Documentation/filesystems/vfs.txt seems to duplicate this reasoning
>> of why the superblock is safe.  Because we hold a real reference to it
>> from the vfsmount.
>
> rcu walk does not hold a reference to the vfsmount, however. It can
> go away. This is why functions which can be called from rcu-walk
> must go through synchronize_rcu() before they go away, also before
> the superblock goes away.

Not at all.

The rcu walk itself does not hold a reference to the vfsmount, but
something holds a reference to the vfsmount and to drop the final
reference on a vfsmount we must hold the vfsmount_lock for write.
The rcu walk holds the vfsmount_lock for read which prevents us from
grabbing the vfsmount_lock for write.

We need to wait an rcu grace period before freeing dentries and inodes
becuase for dentries and inodes we only have rcu protection for them.
For vfsmounts and the superblock we have a lock protected reference
count.

> The other way we could change the rule is to require barrier only for
> those filesystems which access superblock or other info from rcu-walk.
> I would prefer not to have such a rule, but it could be pragmatic.

I don't see that we need to change a rule.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-07  0:35               ` [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped Eric W. Biederman
  2012-05-08 22:50                 ` Andrew Morton
@ 2012-05-16 18:39                 ` Oleg Nesterov
  2012-05-16 19:34                   ` Oleg Nesterov
  2012-05-16 20:54                   ` Eric W. Biederman
  1 sibling, 2 replies; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-16 18:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

Eric, sorry for the huge delay, I was on vacation when you sent this patch...

On 05/06, Eric W. Biederman wrote:
>
> @@ -798,6 +815,12 @@ static void forget_original_parent(struct task_struct *father)
>  	exit_ptrace(father);
>  	reaper = find_new_reaper(father);
>
> +	/* Return immediately if we aren't going to reparent anything */
> +	if (unlikely(reaper == father)) {
> +		write_unlock_irq(&tasklist_lock);
> +		return;
> +	}

I was confused by the comment. Afaics, it is not that "we aren't
going to reparent", we need this change because we can't "reparent"
to the same thread, list_for_each_entry_safe() below can never stop.
But this is off-topic...

Hmm. I don't think the patch is 100% correct. Afaics, this needs more
delay_pidns_leader() checks.

For example. Suppose we have a CLONE_NEWPID zombie I, it has an
EXIT_DEAD child D so delay_pidns_leader(I) == T.

Now suppose that I->real_parent exits, lets denote this task as P.

Suppose that P->real_parent ignores SIGCHLD.

In this case P will do release_task(I) prematurely. And worse, when
D finally does realease_task(D) it will do realease_task(I) again.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-16 18:39                 ` Oleg Nesterov
@ 2012-05-16 19:34                   ` Oleg Nesterov
  2012-05-16 20:54                   ` Eric W. Biederman
  1 sibling, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-16 19:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On 05/16, Oleg Nesterov wrote:
>
> Hmm. I don't think the patch is 100% correct. Afaics, this needs more
> delay_pidns_leader() checks.

Anyway, if we rely on ->children, can't we make a simpler fix?

Something like below. It can be simplified even more, just to explain
the idea. Perhaps we don't even need the new PF_ flag and we can
re-use ->wait_chldexit.

Oleg.

--- x/kernel/pid_namespace.c
+++ x/kernel/pid_namespace.c
@@ -184,6 +184,22 @@ void zap_pid_ns_processes(struct pid_nam
 		rc = sys_wait4(-1, NULL, __WALL, NULL);
 	} while (rc != -ECHILD);
 
+	current->flags |= PF_DEAD_INIT;
+	for (;;) {
+		bool need_wait;
+
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		read_lock(&tasklist_lock);
+		need_wait = !list_empty(current->children);
+		read_unlock(&tasklist_lock);
+
+		if (!need_wait)
+			break;
+		schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+	current->flags &= ~PF_DEAD_INIT;
+
 	if (pid_ns->reboot)
 		current->signal->group_exit_code = pid_ns->reboot;
 
--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -71,6 +71,11 @@ static void __unhash_process(struct task
 
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
+
+		if (unlikely(p->real_parent->flags & PF_DEAD_INIT)
+			if (list_empty(&p->real_parent->children))
+				wake_up_process(p->real_parent);
+
 		__this_cpu_dec(process_counts);
 	}
 	list_del_rcu(&p->thread_group);


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-16 18:39                 ` Oleg Nesterov
  2012-05-16 19:34                   ` Oleg Nesterov
@ 2012-05-16 20:54                   ` Eric W. Biederman
  2012-05-17 17:00                     ` Oleg Nesterov
  1 sibling, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-16 20:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

Oleg Nesterov <oleg@redhat.com> writes:

> Eric, sorry for the huge delay, I was on vacation when you sent this patch...
>
> On 05/06, Eric W. Biederman wrote:
>>
>> @@ -798,6 +815,12 @@ static void forget_original_parent(struct task_struct *father)
>>  	exit_ptrace(father);
>>  	reaper = find_new_reaper(father);
>>
>> +	/* Return immediately if we aren't going to reparent anything */
>> +	if (unlikely(reaper == father)) {
>> +		write_unlock_irq(&tasklist_lock);
>> +		return;
>> +	}
>
> I was confused by the comment. Afaics, it is not that "we aren't
> going to reparent", we need this change because we can't "reparent"
> to the same thread, list_for_each_entry_safe() below can never stop.
> But this is off-topic...

True.  We will get stuck if we try to reparent to the same process.

> Hmm. I don't think the patch is 100% correct. Afaics, this needs more
> delay_pidns_leader() checks.
>
> For example. Suppose we have a CLONE_NEWPID zombie I, it has an
> EXIT_DEAD child D so delay_pidns_leader(I) == T.
>
> Now suppose that I->real_parent exits, lets denote this task as P.
>
> Suppose that P->real_parent ignores SIGCHLD.
>
> In this case P will do release_task(I) prematurely. And worse, when
> D finally does realease_task(D) it will do realease_task(I) again.

Good point.  I will fix that and post a patch shortly.  It doesn't
need a full delay_pidns_leader test just a test for children.

In looking for any other weird corner case bugs I am noticing that
I don't think I handled the case of a ptraced init quite right.
I don't understand the change signaling semantics when the
ptracer is our parent.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-16 20:54                   ` Eric W. Biederman
@ 2012-05-17 17:00                     ` Oleg Nesterov
  2012-05-17 21:46                       ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-17 17:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On 05/16, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Hmm. I don't think the patch is 100% correct. Afaics, this needs more
> > delay_pidns_leader() checks.
> >
> > For example. Suppose we have a CLONE_NEWPID zombie I, it has an
> > EXIT_DEAD child D so delay_pidns_leader(I) == T.
> >
> > Now suppose that I->real_parent exits, lets denote this task as P.
> >
> > Suppose that P->real_parent ignores SIGCHLD.
> >
> > In this case P will do release_task(I) prematurely. And worse, when
> > D finally does realease_task(D) it will do realease_task(I) again.
>
> Good point.  I will fix that and post a patch shortly.  It doesn't
> need a full delay_pidns_leader test just a test for children.

This will add more complications. And even this is not enough, I guess.
For example __ptrace_detach()...

I agree, the idea to "hack" release_task() so that it switches to
init is clever, but imho this is too clever ;)

Seriously, what do you think about the patch below? Or something
like this. It is still based on your suggestion to check ->children,
but it is much, much more simple and understandable.

Just in case... Even with the PF_EXITING check __wake_up_parent()
can be wrong, but this is very unlikely and harmless.

What do you think?

> In looking for any other weird corner case bugs I am noticing that
> I don't think I handled the case of a ptraced init quite right.
> I don't understand the change signaling semantics when the
> ptracer is our parent.

Do you mean the "if (tsk->ptrace)" code in exit_notify() ? Nobody
understand it ;) Last time this code was modified by me (iirc), but
I simply tried to preserve the previous behaviour.

Oleg.

--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -63,6 +63,13 @@ static void exit_mm(struct task_struct *
 
 static void __unhash_process(struct task_struct *p, bool group_dead)
 {
+	struct task_struct *parent = p->parent;
+	bool parent_is_init = false;
+
+#ifdef CONFIG_PID_NS
+	parent_is_init = (task_active_pid_ns(p)->child_reaper == parent);
+#endif
+
 	nr_threads--;
 	detach_pid(p, PIDTYPE_PID);
 	if (group_dead) {
@@ -72,6 +79,11 @@ static void __unhash_process(struct task
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
 		__this_cpu_dec(process_counts);
+
+		if (parent_is_init && (parent->flags & PF_EXITING)) {
+			if (list_empty(&parent->children))
+				__wake_up_parent(p, parent);
+		}
 	}
 	list_del_rcu(&p->thread_group);
 }
--- x/kernel/pid_namespace.c
+++ x/kernel/pid_namespace.c
@@ -184,6 +184,9 @@ void zap_pid_ns_processes(struct pid_nam
 		rc = sys_wait4(-1, NULL, __WALL, NULL);
 	} while (rc != -ECHILD);
 
+	wait_event(&current->signal->wait_chldexit,
+			list_empty(&current->children));
+
 	if (pid_ns->reboot)
 		current->signal->group_exit_code = pid_ns->reboot;
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-17 17:00                     ` Oleg Nesterov
@ 2012-05-17 21:46                       ` Eric W. Biederman
  2012-05-18 12:39                         ` Oleg Nesterov
  0 siblings, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-17 21:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

Oleg Nesterov <oleg@redhat.com> writes:

> On 05/16, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > Hmm. I don't think the patch is 100% correct. Afaics, this needs more
>> > delay_pidns_leader() checks.
>> >
>> > For example. Suppose we have a CLONE_NEWPID zombie I, it has an
>> > EXIT_DEAD child D so delay_pidns_leader(I) == T.
>> >
>> > Now suppose that I->real_parent exits, lets denote this task as P.
>> >
>> > Suppose that P->real_parent ignores SIGCHLD.
>> >
>> > In this case P will do release_task(I) prematurely. And worse, when
>> > D finally does realease_task(D) it will do realease_task(I) again.
>>
>> Good point.  I will fix that and post a patch shortly.  It doesn't
>> need a full delay_pidns_leader test just a test for children.
>
> This will add more complications. And even this is not enough, I guess.
> For example __ptrace_detach()...

Agreed.  I am having to step back and think about this a bit more.

I don't like doing things two different ways but delay_thread_group
leader and all of that is pretty horrible from a maintenance point
of view and extending that just makes things worse.

> I agree, the idea to "hack" release_task() so that it switches to
> init is clever, but imho this is too clever ;)
>
> Seriously, what do you think about the patch below? Or something
> like this. It is still based on your suggestion to check ->children,
> but it is much, much more simple and understandable.
>
> Just in case... Even with the PF_EXITING check __wake_up_parent()
> can be wrong, but this is very unlikely and harmless.
>
> What do you think?

I think there is something very compelling about your solution,
we do need my bit about making the init process ignore SIGCHLD
so all of init's children self reap.

Before I go farther I am going to play with the code more.

In part I think the current code for waiting for processes to
die etc is pretty horrible maintenance wise and it might just
be worth cleaning up before we extending it with yet another
strange and bizarre case, if for no other reason than to make
it clear what we are doing.


>> In looking for any other weird corner case bugs I am noticing that
>> I don't think I handled the case of a ptraced init quite right.
>> I don't understand the change signaling semantics when the
>> ptracer is our parent.
>
> Do you mean the "if (tsk->ptrace)" code in exit_notify() ? Nobody
> understand it ;) Last time this code was modified by me (iirc), but
> I simply tried to preserve the previous behaviour.

Yes.  It is some pretty strange code.  Especially where we are reading
a return result which is always false.  I think there is a bug somewhere
between that code and ptrace detach but I don't know that I could tell
you what it is.

Hopefully I have a follow-on patch in another couple of hours.

Eric


> Oleg.
>
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -63,6 +63,13 @@ static void exit_mm(struct task_struct *
>  
>  static void __unhash_process(struct task_struct *p, bool group_dead)
>  {
> +	struct task_struct *parent = p->parent;
> +	bool parent_is_init = false;
> +
> +#ifdef CONFIG_PID_NS
> +	parent_is_init = (task_active_pid_ns(p)->child_reaper == parent);
> +#endif
> +
>  	nr_threads--;
>  	detach_pid(p, PIDTYPE_PID);
>  	if (group_dead) {
> @@ -72,6 +79,11 @@ static void __unhash_process(struct task
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> +
> +		if (parent_is_init && (parent->flags & PF_EXITING)) {
> +			if (list_empty(&parent->children))
> +				__wake_up_parent(p, parent);
> +		}
>  	}
>  	list_del_rcu(&p->thread_group); 
>  }
> --- x/kernel/pid_namespace.c
> +++ x/kernel/pid_namespace.c
> @@ -184,6 +184,9 @@ void zap_pid_ns_processes(struct pid_nam
>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>  	} while (rc != -ECHILD);
>  
> +	wait_event(&current->signal->wait_chldexit,
> +			list_empty(&current->children));
> +
>  	if (pid_ns->reboot)
>  		current->signal->group_exit_code = pid_ns->reboot;
>  
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-17 21:46                       ` Eric W. Biederman
@ 2012-05-18 12:39                         ` Oleg Nesterov
  2012-05-19  0:03                           ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-18 12:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On 05/17, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > What do you think?
>
> I think there is something very compelling about your solution,
> we do need my bit about making the init process ignore SIGCHLD
> so all of init's children self reap.

Not sure I understand. This can work with or without 3/3 which
changes zap_pid_ns_processes() to ignore SIGCHLD. And just in
case, I think 3/3 is fine.

And once again, this wait_event() + __wake_up_parent() is very
simple and straightforward, we can cleanup this code later if
needed.


> > Do you mean the "if (tsk->ptrace)" code in exit_notify() ? Nobody
> > understand it ;) Last time this code was modified by me (iirc), but
> > I simply tried to preserve the previous behaviour.
>
> Yes.  It is some pretty strange code.

Yes. In particular, I think it should always use SIGCHLD.

> Especially where we are reading
> a return result which is always false.  I think there is a bug somewhere
> between that code and ptrace detach

Yes. This is the known oddity. We always notify the tracer if the
leader exits, even if !thread_group_empty(). But after that the
tracer can't detach, and it can't do do_wait(WEXITED).

The problem is not that we can't "fix" this. Just any discussed
fix adds the subtle/incompatible user-visible change.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-18 12:39                         ` Oleg Nesterov
@ 2012-05-19  0:03                           ` Eric W. Biederman
  2012-05-21 12:44                             ` Oleg Nesterov
  0 siblings, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-19  0:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

Oleg Nesterov <oleg@redhat.com> writes:

> On 05/17, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > What do you think?
>>
>> I think there is something very compelling about your solution,
>> we do need my bit about making the init process ignore SIGCHLD
>> so all of init's children self reap.
>
> Not sure I understand. This can work with or without 3/3 which
> changes zap_pid_ns_processes() to ignore SIGCHLD. And just in
> case, I think 3/3 is fine.

The only issue I see is that without 3/3 we might have processes that
on one wait(2)s for and so will never have release_task called on.

We do have the wait loop but I think there is a race possible there.

> And once again, this wait_event() + __wake_up_parent() is very
> simple and straightforward, we can cleanup this code later if
> needed.

Yes, and it doesn't when you do an UNINTERRUPTIBLE sleep with
an INTERRUPTIBLE wake up unless I misread the code.

>> > Do you mean the "if (tsk->ptrace)" code in exit_notify() ? Nobody
>> > understand it ;) Last time this code was modified by me (iirc), but
>> > I simply tried to preserve the previous behaviour.
>>
>> Yes.  It is some pretty strange code.
>
> Yes. In particular, I think it should always use SIGCHLD.
>
>> Especially where we are reading
>> a return result which is always false.  I think there is a bug somewhere
>> between that code and ptrace detach
>
> Yes. This is the known oddity. We always notify the tracer if the
> leader exits, even if !thread_group_empty(). But after that the
> tracer can't detach, and it can't do do_wait(WEXITED).
>
> The problem is not that we can't "fix" this. Just any discussed
> fix adds the subtle/incompatible user-visible change.

Yes and that is nasty.

I need to sit down and write a good change log and do a bit more testing
(hopefully tonight) but this is what I have come up with so far.

It is based on your first version of the patch with a few changes
a TASK_INTERRUPTIBLE sleep so that we don't count in the load average,
and moving detach_pid so we don't have to be super careful about
where we call task_active_pid_ns.

Eric


---
 kernel/exit.c          |   13 ++++++++++++-
 kernel/pid_namespace.c |   11 +++++++++++
 2 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index d8bd3b42..abc4fc0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -64,15 +64,26 @@ static void exit_mm(struct task_struct * tsk);
 static void __unhash_process(struct task_struct *p, bool group_dead)
 {
 	nr_threads--;
-	detach_pid(p, PIDTYPE_PID);
 	if (group_dead) {
+		struct task_struct *parent;
+
 		detach_pid(p, PIDTYPE_PGID);
 		detach_pid(p, PIDTYPE_SID);
 
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
 		__this_cpu_dec(process_counts);
+
+		/* If we are the last child process in a pid namespace
+		 * to be reaped notify the child_reaper.
+		 */
+		parent = p->real_parent;
+		if ((task_active_pid_ns(p)->child_reaper == parent) &&
+		    list_empty(&parent->children) &&
+		    (parent->flags & PF_EXITING))
+			wake_up_process(parent);
 	}
+	detach_pid(p, PIDTYPE_PID);
 	list_del_rcu(&p->thread_group);
 }
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index b98b0ed..ce96627 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 		rc = sys_wait4(-1, NULL, __WALL, NULL);
 	} while (rc != -ECHILD);
 
+	read_lock(&tasklist_lock);
+	for (;;) {
+		__set_current_state(TASK_INTERRUPTIBLE);
+		if (list_empty(&current->children))
+			break;
+		read_unlock(&tasklist_lock);
+		schedule();
+		read_lock(&tasklist_lock);
+	}
+	read_unlock(&tasklist_lock);
+
 	if (pid_ns->reboot)
 		current->signal->group_exit_code = pid_ns->reboot;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-19  0:03                           ` Eric W. Biederman
@ 2012-05-21 12:44                             ` Oleg Nesterov
  2012-05-22  0:16                               ` Eric W. Biederman
  2012-05-22  0:20                               ` [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2 Eric W. Biederman
  0 siblings, 2 replies; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-21 12:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On 05/18, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> >> I think there is something very compelling about your solution,
> >> we do need my bit about making the init process ignore SIGCHLD
> >> so all of init's children self reap.
> >
> > Not sure I understand. This can work with or without 3/3 which
> > changes zap_pid_ns_processes() to ignore SIGCHLD. And just in
> > case, I think 3/3 is fine.
>
> The only issue I see is that without 3/3 we might have processes that
> on one wait(2)s for and so will never have release_task called on.
>
> We do have the wait loop

Yes, and we need this loop anyway, even if SIGCHLD is ignored.
It is possible that we already have a EXIT_ZOMBIE child(s) when
zap_pid_ns_processes().

> but I think there is a race possible there.

Hmm. I do not see any race, but perhaps I missed something.
I think we can trust -ECHILD, or do_wait() is buggy.

Hmm. But there is another (off-topic) problem, security_task_wait()
can return an error if there are some security policy problems...
OK, this shouldn't happen I hope.

> > And once again, this wait_event() + __wake_up_parent() is very
> > simple and straightforward, we can cleanup this code later if
> > needed.
>
> Yes, and it doesn't when you do an UNINTERRUPTIBLE sleep with
> an INTERRUPTIBLE wake up unless I misread the code.

Yes. so we need wait_event_interruptible() or __unhash_process()
should use __wake_up_sync_key(wait_chldexit).

> > Yes. This is the known oddity. We always notify the tracer if the
> > leader exits, even if !thread_group_empty(). But after that the
> > tracer can't detach, and it can't do do_wait(WEXITED).
> >
> > The problem is not that we can't "fix" this. Just any discussed
> > fix adds the subtle/incompatible user-visible change.
>
> Yes and that is nasty.

Agreed. ptrace API is nasty ;)

> and moving detach_pid so we don't have to be super careful about
> where we call task_active_pid_ns.

Yes, I was thinking about this change too,

> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>  	} while (rc != -ECHILD);
>
> +	read_lock(&tasklist_lock);
> +	for (;;) {
> +		__set_current_state(TASK_INTERRUPTIBLE);
> +		if (list_empty(&current->children))
> +			break;
> +		read_unlock(&tasklist_lock);
> +		schedule();

OK, but then it makes sense to add clear_thread_flag(TIF_SIGPENDING)
before schedule, to avoid the busy-wait loop (like the sys_wait4 loop
does). Or simply use TASK_UNINTERRUPTIBLE, I do not think it is that
important to "fool" /proc/loadavg. But I am fine either way.

Maybe you can also add "ifdef CONFIG_PID_NS" into __unhash_process(),
but this is minor too.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.
  2012-05-21 12:44                             ` Oleg Nesterov
@ 2012-05-22  0:16                               ` Eric W. Biederman
  2012-05-22  0:20                               ` [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2 Eric W. Biederman
  1 sibling, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-22  0:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

Oleg Nesterov <oleg@redhat.com> writes:

> On 05/18, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> >> I think there is something very compelling about your solution,
>> >> we do need my bit about making the init process ignore SIGCHLD
>> >> so all of init's children self reap.
>> >
>> > Not sure I understand. This can work with or without 3/3 which
>> > changes zap_pid_ns_processes() to ignore SIGCHLD. And just in
>> > case, I think 3/3 is fine.
>>
>> The only issue I see is that without 3/3 we might have processes that
>> on one wait(2)s for and so will never have release_task called on.
>>
>> We do have the wait loop
>
> Yes, and we need this loop anyway, even if SIGCHLD is ignored.
> It is possible that we already have a EXIT_ZOMBIE child(s) when
> zap_pid_ns_processes().
>
>> but I think there is a race possible there.
>
> Hmm. I do not see any race, but perhaps I missed something.
> I think we can trust -ECHILD, or do_wait() is buggy.

Think about it some more you are right.  For some reason
I had forgotten that without WNOHANG we don't block forever
until a child exits.

> Hmm. But there is another (off-topic) problem, security_task_wait()
> can return an error if there are some security policy problems...
> OK, this shouldn't happen I hope.

Agreed.  We might be able to address that problem but that is indeed
another issue.

>> > And once again, this wait_event() + __wake_up_parent() is very
>> > simple and straightforward, we can cleanup this code later if
>> > needed.
>>
>> Yes, and it doesn't when you do an UNINTERRUPTIBLE sleep with
>> an INTERRUPTIBLE wake up unless I misread the code.
>
> Yes. so we need wait_event_interruptible() or __unhash_process()
> should use __wake_up_sync_key(wait_chldexit).
>
>> > Yes. This is the known oddity. We always notify the tracer if the
>> > leader exits, even if !thread_group_empty(). But after that the
>> > tracer can't detach, and it can't do do_wait(WEXITED).
>> >
>> > The problem is not that we can't "fix" this. Just any discussed
>> > fix adds the subtle/incompatible user-visible change.
>>
>> Yes and that is nasty.
>
> Agreed. ptrace API is nasty ;)
>
>> and moving detach_pid so we don't have to be super careful about
>> where we call task_active_pid_ns.
>
> Yes, I was thinking about this change too,
>
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>>  	} while (rc != -ECHILD);
>>
>> +	read_lock(&tasklist_lock);
>> +	for (;;) {
>> +		__set_current_state(TASK_INTERRUPTIBLE);
>> +		if (list_empty(&current->children))
>> +			break;
>> +		read_unlock(&tasklist_lock);
>> +		schedule();
>
> OK, but then it makes sense to add clear_thread_flag(TIF_SIGPENDING)
> before schedule, to avoid the busy-wait loop (like the sys_wait4 loop
> does). Or simply use TASK_UNINTERRUPTIBLE, I do not think it is that
> important to "fool" /proc/loadavg. But I am fine either way.

It can get darn strange when you hold a thread in stopped with ptrace
and your load mysteriously jumps.  But we already have this problem
with de_thread and people aren't yelling so shrug.

So at a practical level Idon't think it is fooling /proc/loadavg but at
this point if we want more accuraccy from /proc/loadavg we need to fix
the computation and distinguish short term disk sleeps from other
uninterruptible sleeps and thus fix how /proc/loadavg is computed,
rather than hacking around with code like this.

> Maybe you can also add "ifdef CONFIG_PID_NS" into __unhash_process(),
> but this is minor too.

An #ifdef just leads to weird build failures that in weird rare
configurations.  If we can hide it all away in a header fine, but
putting a bare #ifdef in the core of the code simply as a performance
optimization is ugly and a a major testing challenge.  Keeping track of
all of the flying pieces with this patch has been tricky enough as it
is.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2
  2012-05-21 12:44                             ` Oleg Nesterov
  2012-05-22  0:16                               ` Eric W. Biederman
@ 2012-05-22  0:20                               ` Eric W. Biederman
  2012-05-22 16:54                                 ` Oleg Nesterov
  2012-05-22 19:23                                 ` Andrew Morton
  1 sibling, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-22  0:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith


Today we have a two-fold bug.  Sometimes release_task on pid == 1 in a
pid namespace can run before other processes in a pid namespace have had
release task called.  With the result that pid_ns_release_proc can be
called before the last proc_flus_task() is done using
upid->ns->proc_mnt, resulting in the use of a stale pointer.  This same
set of circumstances can lead to waitpid(...) returning for a processes
started with clone(CLONE_NEWPID) before the every process in the pid
namespace has actually exited.

To fix this modify zap_pid_ns_processess wait until all other processes
in the pid namespace have exited, even EXIT_DEAD zombies.

The delay_group_leader and related tests ensure that the thread gruop
leader will be the last thread of a process group to be reaped, or to
become EXIT_DEAD and self reap.  With the change to zap_pid_ns_processes
we get the guarantee that pid == 1 in a pid namespace will be the last
task that release_task is called on.

With pid == 1 being the last task to pass through release_task
pid_ns_release_proc can no longer be called too early nor can wait
return before all of the EXIT_DEAD tasks in a pid namespace have exited.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

Andrew can you replace your earlier version of this patch in your tree
with this one, after Oleg takes a look at it.  I think this is about
as simple and maintainable and obvious as we can make this bug fix.

 kernel/exit.c          |   13 ++++++++++++-
 kernel/pid_namespace.c |   11 +++++++++++
 2 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index d8bd3b42..abc4fc0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -64,15 +64,26 @@ static void exit_mm(struct task_struct * tsk);
 static void __unhash_process(struct task_struct *p, bool group_dead)
 {
 	nr_threads--;
-	detach_pid(p, PIDTYPE_PID);
 	if (group_dead) {
+		struct task_struct *parent;
+
 		detach_pid(p, PIDTYPE_PGID);
 		detach_pid(p, PIDTYPE_SID);
 
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
 		__this_cpu_dec(process_counts);
+
+		/* If we are the last child process in a pid namespace
+		 * to be reaped notify the child_reaper.
+		 */
+		parent = p->real_parent;
+		if ((task_active_pid_ns(p)->child_reaper == parent) &&
+		    list_empty(&parent->children) &&
+		    (parent->flags & PF_EXITING))
+			wake_up_process(parent);
 	}
+	detach_pid(p, PIDTYPE_PID);
 	list_del_rcu(&p->thread_group);
 }
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index b98b0ed..ba1cbb8 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 		rc = sys_wait4(-1, NULL, __WALL, NULL);
 	} while (rc != -ECHILD);
 
+	read_lock(&tasklist_lock);
+	for (;;) {
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		if (list_empty(&current->children))
+			break;
+		read_unlock(&tasklist_lock);
+		schedule();
+		read_lock(&tasklist_lock);
+	}
+	read_unlock(&tasklist_lock);
+
 	if (pid_ns->reboot)
 		current->signal->group_exit_code = pid_ns->reboot;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2
  2012-05-22  0:20                               ` [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2 Eric W. Biederman
@ 2012-05-22 16:54                                 ` Oleg Nesterov
  2012-05-22 19:23                                 ` Andrew Morton
  1 sibling, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-22 16:54 UTC (permalink / raw)
  To: Eric W. Biederman, Andrew Morton
  Cc: LKML, Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith

On 05/21, Eric W. Biederman wrote:
>
> Andrew can you replace your earlier version of this patch in your tree
> with this one,

Yes, please, the old one is

	pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-reaped.patch

> after Oleg takes a look at it.  I think this is about
> as simple and maintainable and obvious as we can make this bug fix.

I believe the patch is fine.

Thanks Eric.

> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -64,15 +64,26 @@ static void exit_mm(struct task_struct * tsk);
>  static void __unhash_process(struct task_struct *p, bool group_dead)
>  {
>  	nr_threads--;
> -	detach_pid(p, PIDTYPE_PID);
>  	if (group_dead) {
> +		struct task_struct *parent;
> +
>  		detach_pid(p, PIDTYPE_PGID);
>  		detach_pid(p, PIDTYPE_SID);
>  
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> +
> +		/* If we are the last child process in a pid namespace
> +		 * to be reaped notify the child_reaper.
> +		 */
> +		parent = p->real_parent;
> +		if ((task_active_pid_ns(p)->child_reaper == parent) &&
> +		    list_empty(&parent->children) &&
> +		    (parent->flags & PF_EXITING))
> +			wake_up_process(parent);
>  	}
> +	detach_pid(p, PIDTYPE_PID);
>  	list_del_rcu(&p->thread_group);
>  }
>  
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index b98b0ed..ba1cbb8 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>  	} while (rc != -ECHILD);
>  
> +	read_lock(&tasklist_lock);
> +	for (;;) {
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (list_empty(&current->children))
> +			break;
> +		read_unlock(&tasklist_lock);
> +		schedule();
> +		read_lock(&tasklist_lock);
> +	}
> +	read_unlock(&tasklist_lock);
> +
>  	if (pid_ns->reboot)
>  		current->signal->group_exit_code = pid_ns->reboot;
>  
> -- 
> 1.7.5.4
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2
  2012-05-22  0:20                               ` [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2 Eric W. Biederman
  2012-05-22 16:54                                 ` Oleg Nesterov
@ 2012-05-22 19:23                                 ` Andrew Morton
  2012-05-23 14:52                                   ` Oleg Nesterov
  1 sibling, 1 reply; 69+ messages in thread
From: Andrew Morton @ 2012-05-22 19:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On Mon, 21 May 2012 18:20:31 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> 
> Today we have a two-fold bug.  Sometimes release_task on pid == 1 in a
> pid namespace can run before other processes in a pid namespace have had
> release task called.  With the result that pid_ns_release_proc can be
> called before the last proc_flus_task() is done using
> upid->ns->proc_mnt, resulting in the use of a stale pointer.  This same
> set of circumstances can lead to waitpid(...) returning for a processes
> started with clone(CLONE_NEWPID) before the every process in the pid
> namespace has actually exited.
> 
> To fix this modify zap_pid_ns_processess wait until all other processes
> in the pid namespace have exited, even EXIT_DEAD zombies.
> 
> The delay_group_leader and related tests ensure that the thread gruop
> leader will be the last thread of a process group to be reaped, or to
> become EXIT_DEAD and self reap.  With the change to zap_pid_ns_processes
> we get the guarantee that pid == 1 in a pid namespace will be the last
> task that release_task is called on.
> 
> With pid == 1 being the last task to pass through release_task
> pid_ns_release_proc can no longer be called too early nor can wait
> return before all of the EXIT_DEAD tasks in a pid namespace have exited.
> 
> ...
>
> index d8bd3b42..abc4fc0 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -64,15 +64,26 @@ static void exit_mm(struct task_struct * tsk);
>  static void __unhash_process(struct task_struct *p, bool group_dead)
>  {
>  	nr_threads--;
> -	detach_pid(p, PIDTYPE_PID);
>  	if (group_dead) {
> +		struct task_struct *parent;
> +
>  		detach_pid(p, PIDTYPE_PGID);
>  		detach_pid(p, PIDTYPE_SID);
>  
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> +
> +		/* If we are the last child process in a pid namespace

like this:
		/*
		 * If ...

> +		 * to be reaped notify the child_reaper.

s/reaped/reaped,/

More seriously, it isn't a very good comment.  It tells us "what" the
code is doing (which is pretty obvious from reading it), but it didn't
tell us "why" it is doing this.  Why do PID namespaces need special
handling here?  What's the backstory??

> +		 */
> +		parent = p->real_parent;
> +		if ((task_active_pid_ns(p)->child_reaper == parent) &&
> +		    list_empty(&parent->children) &&
> +		    (parent->flags & PF_EXITING))
> +			wake_up_process(parent);
>  	}
> +	detach_pid(p, PIDTYPE_PID);
>  	list_del_rcu(&p->thread_group);
>  }
>  
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index b98b0ed..ba1cbb8 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>  	} while (rc != -ECHILD);
>  
> +	read_lock(&tasklist_lock);
> +	for (;;) {
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (list_empty(&current->children))
> +			break;
> +		read_unlock(&tasklist_lock);
> +		schedule();
> +		read_lock(&tasklist_lock);
> +	}
> +	read_unlock(&tasklist_lock);

Well.

a) This loop can leave the thread in state TASK_UNINTERRUPTIBLE,
   which looks wrong.

b) Given that the waking side is also testing list_empty(), I think
   you might need set_current_state() here, with the barrier.  So that
   this thread gets the correct view of the list_head wrt the waker's
   view.

c) Did we really need to bang on tasklist_lock so many times?  There
   doesn't seem a nicer way of structuring this :(

d) Anyone who reads this code a year from now will come away
   thinking "wtf".

   IOW, wtf?  We sit in a dead loop waiting for ->children to drain.
   But why?  Who is draining them?  What are the dynamics here? Why
   do we care?

   IOW, it needs a comment!!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2
  2012-05-22 19:23                                 ` Andrew Morton
@ 2012-05-23 14:52                                   ` Oleg Nesterov
  2012-05-25 15:15                                     ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Oleg Nesterov
  0 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-23 14:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On 05/22, Andrew Morton wrote:
>
> On Mon, 21 May 2012 18:20:31 -0600
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
> > +		/* If we are the last child process in a pid namespace
>
> like this:
> 		/*
> 		 * If ...
>
> > +		 * to be reaped notify the child_reaper.
>
> s/reaped/reaped,/
>
> More seriously, it isn't a very good comment.  It tells us "what" the
> code is doing (which is pretty obvious from reading it), but it didn't
> tell us "why" it is doing this.  Why do PID namespaces need special
> handling here?  What's the backstory??

Well, this is documented in the changelog but I agree, this needs some
documentation.

Perhaps zap_pid_ns_processes() should document that it waits for the
stealth EXIT_DEAD tasks, and __unhash_process() can simply say
"see zap_pid_ns_processes()".

> > @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> >  		rc = sys_wait4(-1, NULL, __WALL, NULL);
> >  	} while (rc != -ECHILD);
> >
> > +	read_lock(&tasklist_lock);
> > +	for (;;) {
> > +		__set_current_state(TASK_UNINTERRUPTIBLE);
> > +		if (list_empty(&current->children))
> > +			break;
> > +		read_unlock(&tasklist_lock);
> > +		schedule();
> > +		read_lock(&tasklist_lock);
> > +	}
> > +	read_unlock(&tasklist_lock);
>
> Well.
>
> a) This loop can leave the thread in state TASK_UNINTERRUPTIBLE,
>    which looks wrong.

OOPS. You fixed this in *-fix.patch, thanks.

> b) Given that the waking side is also testing list_empty(), I think
>    you might need set_current_state() here, with the barrier.  So that
>    this thread gets the correct view of the list_head wrt the waker's
>    view.

We rely on tasklist_lock, note that the waking side takes it for writing.

> c) Did we really need to bang on tasklist_lock so many times?  There
>    doesn't seem a nicer way of structuring this :(

See above.

We do not really need tasklist_lock to wait for list_empty(children),
we could add a couple of barriers or use wait_event(wait_chldexit).
But the explicit usage of tasklist makes the code more understandable,
this this way we obviously can't race with the child doing release_task(),
say, we can't return before the last detach_pid(PIDTYPE_PID).

As for re-structuring, I'd suggest

	for (;;) {
		bool need_wait = false;

		read_lock(&tasklist_lock);
		if (!list_empty(&current->children)) {
			__set_current_state(TASK_UNINTERRUPTIBLE);
			need_wait = true;
		}
		read_unlock(&tasklist_lock);

		if (!need_wait)
			break;
		schedule();
	}

but this is subjective.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix
  2012-05-23 14:52                                   ` Oleg Nesterov
@ 2012-05-25 15:15                                     ` Oleg Nesterov
  2012-05-25 15:59                                       ` [PATCH -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper Oleg Nesterov
  2012-05-25 21:25                                       ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Eric W. Biederman
  0 siblings, 2 replies; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-25 15:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

So. Eric, Andrew, will you agree with this cleanup on top of
pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-reaped-v2-fix.patch
?

1. Update the comments in zap_pid_ns_processes() and __unhash_process()

2. Move the wake-up-reaper code in __unhash_process() under IS_ENABLED()

3. Re-structure the wait-for-empty-children code in zap_pid_ns_processes()

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c          |   17 +++++++++--------
 kernel/pid_namespace.c |   21 ++++++++++++++-------
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 231decb..b3e6e0e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -65,8 +65,6 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 {
 	nr_threads--;
 	if (group_dead) {
-		struct task_struct *parent;
-
 		detach_pid(p, PIDTYPE_PGID);
 		detach_pid(p, PIDTYPE_SID);
 
@@ -76,13 +74,16 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 
 		/*
 		 * If we are the last child process in a pid namespace to be
-		 * reaped, notify the child_reaper.
+		 * reaped, notify the child_reaper, see zap_pid_ns_processes().
 		 */
-		parent = p->real_parent;
-		if ((task_active_pid_ns(p)->child_reaper == parent) &&
-		    list_empty(&parent->children) &&
-		    (parent->flags & PF_EXITING))
-			wake_up_process(parent);
+		if (IS_ENABLED(CONFIG_PID_NS)) {
+			struct task_struct *parent = p->real_parent;
+
+			if ((task_active_pid_ns(p)->child_reaper == parent) &&
+			    list_empty(&parent->children) &&
+			    (parent->flags & PF_EXITING))
+				wake_up_process(parent);
+		}
 	}
 	detach_pid(p, PIDTYPE_PID);
 	list_del_rcu(&p->thread_group);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 723c948..c2b0df3 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -184,17 +184,24 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 		rc = sys_wait4(-1, NULL, __WALL, NULL);
 	} while (rc != -ECHILD);
 
-	read_lock(&tasklist_lock);
+	/*
+	 * sys_wait4() above can't reap the TASK_DEAD children we may
+	 * have. Make sure they all go away, see __unhash_process().
+	 */
 	for (;;) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		if (list_empty(&current->children))
-			break;
+		bool need_wait = false;
+
+		read_lock(&tasklist_lock);
+		if (!list_empty(&current->children)) {
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+			need_wait = true;
+		}
 		read_unlock(&tasklist_lock);
+
+		if (!need_wait)
+			break;
 		schedule();
-		read_lock(&tasklist_lock);
 	}
-	read_unlock(&tasklist_lock);
-	set_current_state(TASK_RUNNING);
 
 	if (pid_ns->reboot)
 		current->signal->group_exit_code = pid_ns->reboot;


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper
  2012-05-25 15:15                                     ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Oleg Nesterov
@ 2012-05-25 15:59                                       ` Oleg Nesterov
  2012-05-25 16:00                                         ` [PATCH -mm 1/1] " Oleg Nesterov
  2012-05-25 21:25                                       ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Eric W. Biederman
  1 sibling, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-25 15:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On 05/25, Oleg Nesterov wrote:
>
> So. Eric, Andrew, will you agree with this cleanup on top of
> pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-reaped-v2-fix.patch
> ?

And I think we need another subtle fix.

Compile-only tested, please review.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH -mm 1/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper
  2012-05-25 15:59                                       ` [PATCH -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper Oleg Nesterov
@ 2012-05-25 16:00                                         ` Oleg Nesterov
  2012-05-25 21:43                                           ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-25 16:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

find_new_reaper() changes pid_ns->child_reaper, see add0d4df
"pid_ns: zap_pid_ns_processes: fix the ->child_reaper changing".

The original reason has gone away after the previous patch,
->children list must be empty after zap_pid_ns_processes().
However, "can't clear ->child_reaper or leave it alone" is
still true, and now we can not use init_pid_ns.child_reaper.

__unhash_process() relies on the "->child_reaper == parent"
check, but this check does not work if the last exiting task
is also the child reaper.

Change find_new_reaper() to use pid_ns->parent->child_reaper.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index b3e6e0e..9f9af91 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -733,11 +733,11 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 		zap_pid_ns_processes(pid_ns);
 		write_lock_irq(&tasklist_lock);
 		/*
-		 * We can not clear ->child_reaper or leave it alone.
-		 * There may by stealth EXIT_DEAD tasks on ->children,
-		 * forget_original_parent() must move them somewhere.
+		 * Our parent can be ->child_reaper as well, make sure
+		 * we don't break the "child_reaper == parent" logic in
+		 * __unhash_process().
 		 */
-		pid_ns->child_reaper = init_pid_ns.child_reaper;
+		pid_ns->child_reaper = pid_ns->parent->child_reaper;
 	} else if (father->signal->has_child_subreaper) {
 		struct task_struct *reaper;
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix
  2012-05-25 15:15                                     ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Oleg Nesterov
  2012-05-25 15:59                                       ` [PATCH -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper Oleg Nesterov
@ 2012-05-25 21:25                                       ` Eric W. Biederman
  2012-05-27 18:41                                         ` [PATCH -mm v2] " Oleg Nesterov
  1 sibling, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-25 21:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

Oleg Nesterov <oleg@redhat.com> writes:

> So. Eric, Andrew, will you agree with this cleanup on top of
> pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-reaped-v2-fix.patch
> ?
>
> 1. Update the comments in zap_pid_ns_processes() and __unhash_process()

In zap_pid_ns_processes I wonder if we should update the big block
comment with a little more of the theory.  AKA we want as many children
to self-reap and become EXIT_DEAD children as possible becasue it
enables more parallelism and is thus faster.

> 2. Move the wake-up-reaper code in __unhash_process() under IS_ENABLED()

I don't really care, it ceartainly looks better than an #ifdef block.
However come to think of it, it is about time to just plain start
removing those config options.  The original point was so that there
would be a simple hammer people could throw while we were implementing
the namespaces to easily avoid any issues.  At this point with the
namespaces being about as stable as the rest of the kernel I don't know
that there is any advantage is having in having a config option.

> 3. Re-structure the wait-for-empty-children code in zap_pid_ns_processes()
The restructuring seems basically sane.

> @@ -76,13 +74,16 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>  
>  		/*
>  		 * If we are the last child process in a pid namespace to be
> -		 * reaped, notify the child_reaper.
> +		 * reaped, notify the child_reaper, see zap_pid_ns_processes().
>  		 */


How about instead:
>  		/*
>  		 * If we are the last child process in a pid namespace to be
> -		 * reaped, notify the child_reaper.
> +		 * reaped, wake up the child_reaper sleeping in zap_pid_ns_processes().
>  		 */


> -		parent = p->real_parent;
> -		if ((task_active_pid_ns(p)->child_reaper == parent) &&
> -		    list_empty(&parent->children) &&
> -		    (parent->flags & PF_EXITING))
> -			wake_up_process(parent);
> +		if (IS_ENABLED(CONFIG_PID_NS)) {
> +			struct task_struct *parent = p->real_parent;
> +
> +			if ((task_active_pid_ns(p)->child_reaper == parent) &&
> +			    list_empty(&parent->children) &&
> +			    (parent->flags & PF_EXITING))
> +				wake_up_process(parent);
> +		}
>  	}
>  	detach_pid(p, PIDTYPE_PID);
>  	list_del_rcu(&p->thread_group);

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH -mm 1/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper
  2012-05-25 16:00                                         ` [PATCH -mm 1/1] " Oleg Nesterov
@ 2012-05-25 21:43                                           ` Eric W. Biederman
  2012-05-27 19:10                                             ` [PATCH v2 -mm 0/1] " Oleg Nesterov
  0 siblings, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-25 21:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

Oleg Nesterov <oleg@redhat.com> writes:

> find_new_reaper() changes pid_ns->child_reaper, see add0d4df
> "pid_ns: zap_pid_ns_processes: fix the ->child_reaper changing".
>
> The original reason has gone away after the previous patch,
> ->children list must be empty after zap_pid_ns_processes().
> However, "can't clear ->child_reaper or leave it alone" is
> still true, and now we can not use init_pid_ns.child_reaper.
>
> __unhash_process() relies on the "->child_reaper == parent"
> check, but this check does not work if the last exiting task
> is also the child reaper.
>
> Change find_new_reaper() to use pid_ns->parent->child_reaper.

Oleg this is a good catch for a real problem.  However I disagree about
the fix.

We should make unhash_process say:
	if ((task_active_pid_ns(parent)->child_reaper == parent) &&
	    list_empty(&parent->children) &&
	    (parent->flags & PF_EXITING))
		wake_up_process(parent);

It is always the child_reaper of our parent's pid namespace that we are
reparented to if our parent exits.  So we were looking at the wrong
processes pid_namespace.  Just using parent removes any need for magic
after zap_pid_ns_processes(), and the test always becomes valid.

And we should just set delete the code after zap_pid_ns_processes that
changes the child_reaper since nothing we will use the child_reaper
after that.  We could set pid_ns.child_reaper to NULL after that
but why bother.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH -mm v2] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix
  2012-05-25 21:25                                       ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Eric W. Biederman
@ 2012-05-27 18:41                                         ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-27 18:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On 05/25, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > 1. Update the comments in zap_pid_ns_processes() and __unhash_process()
>
> In zap_pid_ns_processes I wonder if we should update the big block
> comment with a little more of the theory.  AKA we want as many children
> to self-reap and become EXIT_DEAD children as possible becasue it
> enables more parallelism and is thus faster.

Yes, the comment can be better, I agree.

Ideally it should explain that we need the sys_wait4() loop even if
we ignore SIGCHLD (with your patch), but at the same time we need
the wait-for-empty loop even if SIGCHLD is not ignored.

OK, I tried to make it a bit better, see below. Feel free to rewrite.

> > 2. Move the wake-up-reaper code in __unhash_process() under IS_ENABLED()
>
> I don't really care, it ceartainly looks better than an #ifdef block.
> However come to think of it, it is about time to just plain start
> removing those config options.

Probably, I do not mind if we remove CONFIG_PID_NS. But until we do this,
I think IS_ENABLED(CONFIG_PID_NS) makes sense as a documentation.

> > 3. Re-structure the wait-for-empty-children code in zap_pid_ns_processes()
> The restructuring seems basically sane.

Good.

> > +		 * reaped, notify the child_reaper, see zap_pid_ns_processes().
> >  		 */
>
> How about instead:
> >  		/*
> >  		 * If we are the last child process in a pid namespace to be
> > -		 * reaped, notify the child_reaper.
> > +		 * reaped, wake up the child_reaper sleeping in zap_pid_ns_processes().

OK.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c          |   17 +++++++++--------
 kernel/pid_namespace.c |   22 +++++++++++++++-------
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 231decb..6d66cd2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -65,8 +65,6 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 {
 	nr_threads--;
 	if (group_dead) {
-		struct task_struct *parent;
-
 		detach_pid(p, PIDTYPE_PGID);
 		detach_pid(p, PIDTYPE_SID);
 
@@ -76,13 +74,16 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 
 		/*
 		 * If we are the last child process in a pid namespace to be
-		 * reaped, notify the child_reaper.
+		 * reaped, notify the reaper sleeping zap_pid_ns_processes().
 		 */
-		parent = p->real_parent;
-		if ((task_active_pid_ns(p)->child_reaper == parent) &&
-		    list_empty(&parent->children) &&
-		    (parent->flags & PF_EXITING))
-			wake_up_process(parent);
+		if (IS_ENABLED(CONFIG_PID_NS)) {
+			struct task_struct *parent = p->real_parent;
+
+			if ((task_active_pid_ns(p)->child_reaper == parent) &&
+			    list_empty(&parent->children) &&
+			    (parent->flags & PF_EXITING))
+				wake_up_process(parent);
+		}
 	}
 	detach_pid(p, PIDTYPE_PID);
 	list_del_rcu(&p->thread_group);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 723c948..41ed867 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -179,22 +179,30 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	}
 	read_unlock(&tasklist_lock);
 
+	/* Firstly reap the EXIT_ZOMBIE children we may have. */
 	do {
 		clear_thread_flag(TIF_SIGPENDING);
 		rc = sys_wait4(-1, NULL, __WALL, NULL);
 	} while (rc != -ECHILD);
 
-	read_lock(&tasklist_lock);
+	/*
+	 * sys_wait4() above can't reap the TASK_DEAD children.
+	 * Make sure they all go away, see __unhash_process().
+	 */
 	for (;;) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		if (list_empty(&current->children))
-			break;
+		bool need_wait = false;
+
+		read_lock(&tasklist_lock);
+		if (!list_empty(&current->children)) {
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+			need_wait = true;
+		}
 		read_unlock(&tasklist_lock);
+
+		if (!need_wait)
+			break;
 		schedule();
-		read_lock(&tasklist_lock);
 	}
-	read_unlock(&tasklist_lock);
-	set_current_state(TASK_RUNNING);
 
 	if (pid_ns->reboot)
 		current->signal->group_exit_code = pid_ns->reboot;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper
  2012-05-25 21:43                                           ` Eric W. Biederman
@ 2012-05-27 19:10                                             ` Oleg Nesterov
  2012-05-27 19:11                                               ` [PATCH v2 -mm 1/1] " Oleg Nesterov
  0 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-27 19:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

On 05/25, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Change find_new_reaper() to use pid_ns->parent->child_reaper.
>
> Oleg this is a good catch for a real problem.  However I disagree about
> the fix.
>
> We should make unhash_process say:
> 	if ((task_active_pid_ns(parent)->child_reaper == parent) &&

Damn, I hate you^W^W^W thanks a lot Eric.

Indeed! Not only this is more simple, this is just more natural!

> And we should just set delete the code after zap_pid_ns_processes

Yes, this is clear.

> We could set pid_ns.child_reaper to NULL after that
> but why bother.

Agreed. Plus I do not think that pid_ns.child_reaper == NULL looks
good even if this doesn't matter currently.

OK, please see v2. Note that I moved detach_pid(PIDTYPE_PID) back.
Yes, yes, there is no real reason to do this. Just I think that if
someone look at these changes later, it is not easy to understand
why it was moved down.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 -mm 1/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper
  2012-05-27 19:10                                             ` [PATCH v2 -mm 0/1] " Oleg Nesterov
@ 2012-05-27 19:11                                               ` Oleg Nesterov
  2012-05-29  6:34                                                 ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2012-05-27 19:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

find_new_reaper() changes pid_ns->child_reaper, see add0d4df
"pid_ns: zap_pid_ns_processes: fix the ->child_reaper changing".

The original reason has gone away after the previous patch,
->children list must be empty after zap_pid_ns_processes().

However now we can not switch to init_pid_ns.child_reaper.
__unhash_process() relies on the "->child_reaper == parent"
check, but this check does not work if the last exiting task
is also the child reaper.

As Eric sugested, we can change __unhash_process() to use the
parent's pid_ns and remove this code.

Also, with this change we can move detach_pid(PIDTYPE_PID) back,
where it was before the previous fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   10 ++--------
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 6d66cd2..6424e6b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -64,6 +64,7 @@ static void exit_mm(struct task_struct * tsk);
 static void __unhash_process(struct task_struct *p, bool group_dead)
 {
 	nr_threads--;
+	detach_pid(p, PIDTYPE_PID);
 	if (group_dead) {
 		detach_pid(p, PIDTYPE_PGID);
 		detach_pid(p, PIDTYPE_SID);
@@ -79,13 +80,12 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 		if (IS_ENABLED(CONFIG_PID_NS)) {
 			struct task_struct *parent = p->real_parent;
 
-			if ((task_active_pid_ns(p)->child_reaper == parent) &&
+			if ((task_active_pid_ns(parent)->child_reaper == parent) &&
 			    list_empty(&parent->children) &&
 			    (parent->flags & PF_EXITING))
 				wake_up_process(parent);
 		}
 	}
-	detach_pid(p, PIDTYPE_PID);
 	list_del_rcu(&p->thread_group);
 }
 
@@ -732,12 +732,6 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 
 		zap_pid_ns_processes(pid_ns);
 		write_lock_irq(&tasklist_lock);
-		/*
-		 * We can not clear ->child_reaper or leave it alone.
-		 * There may by stealth EXIT_DEAD tasks on ->children,
-		 * forget_original_parent() must move them somewhere.
-		 */
-		pid_ns->child_reaper = init_pid_ns.child_reaper;
 	} else if (father->signal->has_child_subreaper) {
 		struct task_struct *reaper;
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 -mm 1/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper
  2012-05-27 19:11                                               ` [PATCH v2 -mm 1/1] " Oleg Nesterov
@ 2012-05-29  6:34                                                 ` Eric W. Biederman
  0 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2012-05-29  6:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, Pavel Emelyanov, Cyrill Gorcunov,
	Louis Rilling, Mike Galbraith

Oleg Nesterov <oleg@redhat.com> writes:

> find_new_reaper() changes pid_ns->child_reaper, see add0d4df
> "pid_ns: zap_pid_ns_processes: fix the ->child_reaper changing".
>
> The original reason has gone away after the previous patch,
> ->children list must be empty after zap_pid_ns_processes().
>
> However now we can not switch to init_pid_ns.child_reaper.
> __unhash_process() relies on the "->child_reaper == parent"
> check, but this check does not work if the last exiting task
> is also the child reaper.
>
> As Eric sugested, we can change __unhash_process() to use the
> parent's pid_ns and remove this code.
>
> Also, with this change we can move detach_pid(PIDTYPE_PID) back,
> where it was before the previous fix.

This looks good to me. 

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

I will be on the road for next two days so I don't expect I will
be particularly active in this converation for a while.

Eric


> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/exit.c |   10 ++--------
>  1 files changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 6d66cd2..6424e6b 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -64,6 +64,7 @@ static void exit_mm(struct task_struct * tsk);
>  static void __unhash_process(struct task_struct *p, bool group_dead)
>  {
>  	nr_threads--;
> +	detach_pid(p, PIDTYPE_PID);
>  	if (group_dead) {
>  		detach_pid(p, PIDTYPE_PGID);
>  		detach_pid(p, PIDTYPE_SID);
> @@ -79,13 +80,12 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>  		if (IS_ENABLED(CONFIG_PID_NS)) {
>  			struct task_struct *parent = p->real_parent;
>  
> -			if ((task_active_pid_ns(p)->child_reaper == parent) &&
> +			if ((task_active_pid_ns(parent)->child_reaper == parent) &&
>  			    list_empty(&parent->children) &&
>  			    (parent->flags & PF_EXITING))
>  				wake_up_process(parent);
>  		}
>  	}
> -	detach_pid(p, PIDTYPE_PID);
>  	list_del_rcu(&p->thread_group);
>  }
>  
> @@ -732,12 +732,6 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
>  
>  		zap_pid_ns_processes(pid_ns);
>  		write_lock_irq(&tasklist_lock);
> -		/*
> -		 * We can not clear ->child_reaper or leave it alone.
> -		 * There may by stealth EXIT_DEAD tasks on ->children,
> -		 * forget_original_parent() must move them somewhere.
> -		 */
> -		pid_ns->child_reaper = init_pid_ns.child_reaper;
>  	} else if (father->signal->has_child_subreaper) {
>  		struct task_struct *reaper;

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2012-05-29  6:34 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-28  9:19 [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith
2012-04-28 14:26 ` Oleg Nesterov
2012-04-29  4:13   ` Mike Galbraith
2012-04-29  7:57   ` Eric W. Biederman
2012-04-29  9:49     ` Mike Galbraith
2012-04-29 16:58     ` Oleg Nesterov
2012-04-30  2:59       ` Eric W. Biederman
2012-04-30  3:25         ` Mike Galbraith
2012-05-02 12:40         ` Oleg Nesterov
2012-05-02 17:37           ` Eric W. Biederman
2012-04-30  3:01       ` [PATCH] " Mike Galbraith
     [not found]         ` <m1zk9rmyh4.fsf@fess.ebiederm.org>
2012-05-01 20:42           ` Andrew Morton
2012-05-03  3:12             ` Mike Galbraith
2012-05-03 14:56               ` Mike Galbraith
2012-05-04  4:27                 ` Mike Galbraith
2012-05-04  7:55                   ` Eric W. Biederman
2012-05-04  8:34                     ` Mike Galbraith
2012-05-04  9:45                     ` Mike Galbraith
2012-05-04 14:13                       ` Eric W. Biederman
2012-05-04 14:49                         ` Mike Galbraith
2012-05-04 15:36                           ` Eric W. Biederman
2012-05-04 16:57                             ` Mike Galbraith
2012-05-04 20:29                               ` Eric W. Biederman
2012-05-05  5:56                                 ` Mike Galbraith
2012-05-05  6:08                                   ` Mike Galbraith
2012-05-05  7:12                                     ` Mike Galbraith
2012-05-05 11:37                                       ` Eric W. Biederman
2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
2012-05-07 22:17                                         ` Al Viro
2012-05-07 23:56                                           ` Paul E. McKenney
2012-05-08  1:07                                             ` Eric W. Biederman
2012-05-08  4:53                                               ` Mike Galbraith
2012-05-09  7:55                                               ` Nick Piggin
2012-05-09 11:02                                                 ` Eric W. Biederman
2012-05-15  8:40                                                   ` Nick Piggin
2012-05-16  0:34                                                     ` Eric W. Biederman
2012-05-09 13:59                                                 ` Paul E. McKenney
2012-05-04  8:03                 ` [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure Eric W. Biederman
2012-05-04  8:19                   ` Mike Galbraith
2012-05-04  8:54                     ` Mike Galbraith
2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
2012-05-07  0:33               ` [PATCH 1/3] pidns: Use task_active_pid_ns in do_notify_parent Eric W. Biederman
2012-05-07  0:35               ` [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped Eric W. Biederman
2012-05-08 22:50                 ` Andrew Morton
2012-05-16 18:39                 ` Oleg Nesterov
2012-05-16 19:34                   ` Oleg Nesterov
2012-05-16 20:54                   ` Eric W. Biederman
2012-05-17 17:00                     ` Oleg Nesterov
2012-05-17 21:46                       ` Eric W. Biederman
2012-05-18 12:39                         ` Oleg Nesterov
2012-05-19  0:03                           ` Eric W. Biederman
2012-05-21 12:44                             ` Oleg Nesterov
2012-05-22  0:16                               ` Eric W. Biederman
2012-05-22  0:20                               ` [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2 Eric W. Biederman
2012-05-22 16:54                                 ` Oleg Nesterov
2012-05-22 19:23                                 ` Andrew Morton
2012-05-23 14:52                                   ` Oleg Nesterov
2012-05-25 15:15                                     ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Oleg Nesterov
2012-05-25 15:59                                       ` [PATCH -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper Oleg Nesterov
2012-05-25 16:00                                         ` [PATCH -mm 1/1] " Oleg Nesterov
2012-05-25 21:43                                           ` Eric W. Biederman
2012-05-27 19:10                                             ` [PATCH v2 -mm 0/1] " Oleg Nesterov
2012-05-27 19:11                                               ` [PATCH v2 -mm 1/1] " Oleg Nesterov
2012-05-29  6:34                                                 ` Eric W. Biederman
2012-05-25 21:25                                       ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Eric W. Biederman
2012-05-27 18:41                                         ` [PATCH -mm v2] " Oleg Nesterov
2012-05-07  0:35               ` [PATCH 3/3] pidns: Make killed children autoreap Eric W. Biederman
2012-05-08 22:51                 ` Andrew Morton
2012-04-30 13:57 ` [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).