linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Mike Galbraith <efault@gmx.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Louis Rilling <louis.rilling@kerlabs.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH]  Re: [RFC PATCH] namespaces: fix leak on fork() failure
Date: Sat, 05 May 2012 04:37:51 -0700	[thread overview]
Message-ID: <m1obq2kgds.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <1336201977.7346.22.camel@marge.simpson.net> (Mike Galbraith's message of "Sat, 05 May 2012 09:12:57 +0200")

Mike Galbraith <efault@gmx.de> writes:

> On Sat, 2012-05-05 at 08:08 +0200, Mike Galbraith wrote:
>
>> egrep 'synchronize|rcu_barrier' /trace 
>> 
>>           vsftpd-7981  [003] ....   577.164997: synchronize_sched <-switch_task_namespaces
>>           vsftpd-7981  [003] ....   577.164998: _cond_resched <-synchronize_sched
>>           vsftpd-7981  [003] ....   577.164998: wait_rcu_gp <-synchronize_sched
>>           vsftpd-7982  [003] ....   577.166583: synchronize_sched <-switch_task_namespaces
>>           vsftpd-7982  [003] ....   577.166583: _cond_resched <-synchronize_sched
>
>> vsftpd-7977  [003] ....   577.171519: rcu_barrier_sched <-rcu_barrier
>>           vsftpd-7977  [003] ....   577.171519: _rcu_barrier.isra.31 <-rcu_barrier_sched
>>           vsftpd-7977  [003] ....   577.171519: mutex_lock <-_rcu_barrier.isra.31
>>           vsftpd-7977  [003] ....   577.171520: __init_waitqueue_head <-_rcu_barrier.isra.31
>>           vsftpd-7977  [003] ....   577.171520: on_each_cpu <-_rcu_barrier.isra.31
>>           vsftpd-7977  [003] d...   577.171532: rcu_barrier_func <-on_each_cpu
>>           vsftpd-7977  [003] d...   577.171532: call_rcu_sched <-rcu_barrier_func
>>           vsftpd-7977  [003] ....   577.171533: wait_for_completion <-_rcu_barrier.isra.31
>>      ksoftirqd/3-16    [003] ..s.   577.171691: rcu_barrier_callback <-__rcu_process_callbacks
>>           vsftpd-7977  [003] ....   577.176443: mutex_unlock <-_rcu_barrier.isra.31
> ...
>
> Ok, so CLONE_NEWPID | SIGCHLD + waitpid is a bad idea given extreme
> unmount synchronization, but why does it take four softirqs?  Seems this
> could have gone a lot faster. 

It is just taking one 4millisecond or 1 jiffy at 250hz which seems
correct operation for rcu_barrier.

To recap for anyone watching.  We have:

sys_wait4
   do_wait
     ...
        release_task
           proc_flush_task
              pid_ns_release_proc
                 kern_unmount
                    mntput
                      mntput_no_expire
                         mntfree
                            deactivate_super
                               deactivate_locked_super
                                  rcu_barrier

So each instance of sys_wait4 winds up taking 4ms sad.  But that
does explain what it takes so long to reap the zombies we are
synchronous.

The ipc namespace is also going to suffer from this deactivate_super
delay but more likely in exit_namespaces() so the delay should not
be synchronized across a bunch of processes.  Aka the wait should
be done before the parent is notified.

I had a nefarious plan to combine the proc mount reference count with
the pid namespace reference count (to break the loop).  I will see if I
can reawaken that.  If that plan comes to fruition the final put_pid on
the pid namespace should happen in a call_rcu after release_task so
wait4 should not be bottlenecked.

I am still mystified why adding the rest of the namespaces adds so much
of a slowdown.  Those task existing should have been parallized before
do_wait..

The rcu_barrier is new as of 2.6.38-rc5 with commit d863b50ab on Feb 10 2011:

commit d863b50ab01333659314c2034890cb76d9fdc3c7
Author: Boaz Harrosh <bharrosh@panasas.com>
Date:   Thu Feb 10 15:01:20 2011 -0800

    vfs: call rcu_barrier after ->kill_sb()
    
    In commit fa0d7e3de6d6 ("fs: icache RCU free inodes"), we use rcu free
    inode instead of freeing the inode directly.  It causes a crash when we
    rmmod immediately after we umount the volume[1].
    
    So we need to call rcu_barrier after we kill_sb so that the inode is
    freed before we do rmmod.  The idea is inspired by Aneesh Kumar.
    rcu_barrier will wait for all callbacks to end before preceding.  The
    original patch was done by Tao Ma, but synchronize_rcu() is not enough
    here.
    
    1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2
    
    Tested-by: Tao Ma <boyu.mt@taobao.com>
    Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Chris Mason <chris.mason@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/fs/super.c b/fs/super.c
index 74e149e..7e9dd4c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -177,6 +177,11 @@ void deactivate_locked_super(struct super_block *s)
        struct file_system_type *fs = s->s_type;
        if (atomic_dec_and_test(&s->s_active)) {
                fs->kill_sb(s);
+               /*
+                * We need to call rcu_barrier so all the delayed rcu free
+                * inodes are flushed before we release the fs module.
+                */
+               rcu_barrier();
                put_filesystem(fs);
                put_super(s);
        } else {



  reply	other threads:[~2012-05-05 11:33 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-28  9:19 [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith
2012-04-28 14:26 ` Oleg Nesterov
2012-04-29  4:13   ` Mike Galbraith
2012-04-29  7:57   ` Eric W. Biederman
2012-04-29  9:49     ` Mike Galbraith
2012-04-29 16:58     ` Oleg Nesterov
2012-04-30  2:59       ` Eric W. Biederman
2012-04-30  3:25         ` Mike Galbraith
2012-05-02 12:40         ` Oleg Nesterov
2012-05-02 17:37           ` Eric W. Biederman
2012-04-30  3:01       ` [PATCH] " Mike Galbraith
     [not found]         ` <m1zk9rmyh4.fsf@fess.ebiederm.org>
2012-05-01 20:42           ` Andrew Morton
2012-05-03  3:12             ` Mike Galbraith
2012-05-03 14:56               ` Mike Galbraith
2012-05-04  4:27                 ` Mike Galbraith
2012-05-04  7:55                   ` Eric W. Biederman
2012-05-04  8:34                     ` Mike Galbraith
2012-05-04  9:45                     ` Mike Galbraith
2012-05-04 14:13                       ` Eric W. Biederman
2012-05-04 14:49                         ` Mike Galbraith
2012-05-04 15:36                           ` Eric W. Biederman
2012-05-04 16:57                             ` Mike Galbraith
2012-05-04 20:29                               ` Eric W. Biederman
2012-05-05  5:56                                 ` Mike Galbraith
2012-05-05  6:08                                   ` Mike Galbraith
2012-05-05  7:12                                     ` Mike Galbraith
2012-05-05 11:37                                       ` Eric W. Biederman [this message]
2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
2012-05-07 22:17                                         ` Al Viro
2012-05-07 23:56                                           ` Paul E. McKenney
2012-05-08  1:07                                             ` Eric W. Biederman
2012-05-08  4:53                                               ` Mike Galbraith
2012-05-09  7:55                                               ` Nick Piggin
2012-05-09 11:02                                                 ` Eric W. Biederman
2012-05-15  8:40                                                   ` Nick Piggin
2012-05-16  0:34                                                     ` Eric W. Biederman
2012-05-09 13:59                                                 ` Paul E. McKenney
2012-05-04  8:03                 ` [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure Eric W. Biederman
2012-05-04  8:19                   ` Mike Galbraith
2012-05-04  8:54                     ` Mike Galbraith
2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
2012-05-07  0:33               ` [PATCH 1/3] pidns: Use task_active_pid_ns in do_notify_parent Eric W. Biederman
2012-05-07  0:35               ` [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped Eric W. Biederman
2012-05-08 22:50                 ` Andrew Morton
2012-05-16 18:39                 ` Oleg Nesterov
2012-05-16 19:34                   ` Oleg Nesterov
2012-05-16 20:54                   ` Eric W. Biederman
2012-05-17 17:00                     ` Oleg Nesterov
2012-05-17 21:46                       ` Eric W. Biederman
2012-05-18 12:39                         ` Oleg Nesterov
2012-05-19  0:03                           ` Eric W. Biederman
2012-05-21 12:44                             ` Oleg Nesterov
2012-05-22  0:16                               ` Eric W. Biederman
2012-05-22  0:20                               ` [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2 Eric W. Biederman
2012-05-22 16:54                                 ` Oleg Nesterov
2012-05-22 19:23                                 ` Andrew Morton
2012-05-23 14:52                                   ` Oleg Nesterov
2012-05-25 15:15                                     ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Oleg Nesterov
2012-05-25 15:59                                       ` [PATCH -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper Oleg Nesterov
2012-05-25 16:00                                         ` [PATCH -mm 1/1] " Oleg Nesterov
2012-05-25 21:43                                           ` Eric W. Biederman
2012-05-27 19:10                                             ` [PATCH v2 -mm 0/1] " Oleg Nesterov
2012-05-27 19:11                                               ` [PATCH v2 -mm 1/1] " Oleg Nesterov
2012-05-29  6:34                                                 ` Eric W. Biederman
2012-05-25 21:25                                       ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Eric W. Biederman
2012-05-27 18:41                                         ` [PATCH -mm v2] " Oleg Nesterov
2012-05-07  0:35               ` [PATCH 3/3] pidns: Make killed children autoreap Eric W. Biederman
2012-05-08 22:51                 ` Andrew Morton
2012-04-30 13:57 ` [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1obq2kgds.fsf@fess.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=akpm@linux-foundation.org \
    --cc=efault@gmx.de \
    --cc=gorcunov@openvz.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=louis.rilling@kerlabs.com \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).