All of lore.kernel.org
 help / color / mirror / Atom feed
From: Louis Rilling <Louis.Rilling@kerlabs.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Pavel Emelyanov <xemul@openvz.org>,
	Oleg Nesterov <oleg@redhat.com>,
	linux-kernel@vger.kernel.org,
	Linux Containers <containers@lists.osdl.org>,
	Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Subject: Re: [PATCH] pidns: Fix wait for zombies to be reaped in zap_pid_ns_processes
Date: Tue, 13 Jul 2010 23:42:35 +0200	[thread overview]
Message-ID: <20100713214234.GA21042@hawkmoon.kerlabs.com> (raw)
In-Reply-To: <m1tyo4shas.fsf_-_@fess.ebiederm.org>

[-- Attachment #1: Type: text/plain, Size: 8852 bytes --]

Hi Eric,

On 12/07/10 11:09 -0700, Eric W. Biederman wrote:
> 
> Fix zap_pid_ns_processes so that it successfully waits for all of
> the tasks in the pid namespace to be reaped, even if called for a
> non-leader task of the init process.  This guarantees that no task
> can escpae the pid namespace, and that it is safe for proc_flush_task
> to put the proc_mnt of the dead pid_namespace when pid 1 of the
> pid namespace calls proc_flush_task.
> 
> Before zap_pid_ns_processes was fixed to wait for all zombies
> in the pid namespace to be reaped the easiest way to generate
> a zombie that would escape the pid namespace would be to attach
> a debugger to a process inside a pidnamespace from outside the
> pid namespace and then exit the pid namespace.
> 
> In the process of trying to fix this bug I have looked at a lot
> of different options and a lot of different directions we can
> go.  There are several limiting factors.
> 
> - We need to guarantee that the init process of a pid namespace
>   is not reaped before every other task in the pid namespace is
>   reaped.  Wait succeeding on the init process of a pid namespace
>   gives the guarantee that all processes in the pid namespace
>   are dead and gone.  Or more succinctly it is not possible to
>   escape from a pid namespace.
> 
>   The previous behaviour where some zombies could escape the pid
>   namespace violates the assumption made by some reapers of a pid
>   namespace init that all of the pid namespace cleanup has completed
>   by the time that init is reaped.
> 
> - proc_flush_task needs to be called after each task is reaped.
>   Tasks are volatile and applications like top and ps frequently
>   touch every thread group directory in /proc which triggers dcache
>   entries to be created.  If we don't remove those dcache entries
>   when tasks are reaped we can get a large build up of useless
>   inodes and dentries.  Shrink_slab is designed to flush out useful
>   cache entries not useless ones so while in the big picture it doesn't
>   hurt if we leak a few if we leak a lot of dentries we put unnatural
>   pressure on the kernels memory managment.
> 
>   I sat down and attempted to measure the cost of calling
>   proc_flush_task with lat_tcp (from lmbench) and I get the same
>   range of latency readings wether or not proc_flush_task is
>   called.  Which tells me the runtime cost of the existing
>   proc_flush_task is in the noise.
> 
>   By running find /proc/ > /dev/null with proc_flush_task
>   disabled and then examining the counts in the slabcache
>   I managed to see us growing about 84 proc_inodes per
>   iteration, which is pretty horrific.  With proc_flush_task
>   enabled I don't see steady growth in any of the slab caches.
> 
> - Mounts of the /proc need a referenece to the pid namespace
>   that doesn't go away until /proc is unmounted.  Without
>   holding a reference to the pid namespace that lasts until
>   a /proc is unmounted it just isn't safe to lookup and display
>   pids in a particular pid_namespace.
> 
> - The pid_namespace needs to be able to access proc_mnt until
>   the at least the last call of proc_flush_task, for a
>   pid_namespace.
> 
>   Currently there is a the circular reference between proc_mnt
>   and the pid_namespace that we break very carefully through
>   an interaction of zap_pid_ns_processes, and proc_flush_task.
>   That clever interaction going wrong is what caused oopses
>   that led us to realize we have a problem.
> 
>   Even if we fix the pid_namespace and the proc_mnt to have
>   conjoined lifetimes and the oopses are fixed we still have
>   the problem that zombie processes can escape the pid namespace.
>   Which appears to be a problem for people using pid_namespaces
>   as inescapable process containers.
> 
> - fork/exec/waitpid is a common kernel path and as such we need
>   to keep the runtime costs down.  Which means as much as possible
>   we want to keep from adding code (especially expensive code)
>   into the fork/exec/waitpid code path.
> 
>   Changing zap_pid_ns_processes to fix the problem instead of
>   changing the code elsewhere is one of the few solutions I have
>   seen that does not increase the cost of the lat_proc test from
>   lmbench.

This patch looks like it is working (only a small RCU issue shown below). I
couldn't try it yet though.

I must admit that I am using a similar back-off solution in Kerrighed (not to
solve the issue of proc_flush_task(), but for one of the reasons that you stated
above: we want to be sure that all tasks of the namespace have been reaped), but
I considered it too ugly to propose it for Linux ;)

That said, this is probably the least intrusive solution we have seen yet.

> 
> Reported-by: Louis Rilling <Louis.Rilling@kerlabs.com>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  kernel/pid_namespace.c |   50 ++++++++++++++++++++++++++++++++++++-----------
>  1 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index a5aff94..aaf2ab0 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -139,16 +139,20 @@ void free_pid_ns(struct kref *kref)
>  
>  void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  {
> +	struct task_struct *me = current;
>  	int nr;
>  	int rc;
>  	struct task_struct *task;
>  
>  	/*
> -	 * The last thread in the cgroup-init thread group is terminating.
> -	 * Find remaining pid_ts in the namespace, signal and wait for them
> -	 * to exit.
> +	 * The last task in the pid namespace-init threa group is terminating.

s/threa/thread/

> +	 * Find remaining pids in the namespace, signal and wait for them
> +	 * to to be reaped.
>  	 *
> -	 * Note:  This signals each threads in the namespace - even those that
> +	 * By waiting for all of the tasks to be reaped before init is reaped
> +	 * we provide the invariant that no task can escape the pid namespace.
> +	 *
> +	 * Note:  This signals each task in the namespace - even those that
>  	 * 	  belong to the same thread group, To avoid this, we would have
>  	 * 	  to walk the entire tasklist looking a processes in this
>  	 * 	  namespace, but that could be unnecessarily expensive if the
> @@ -157,28 +161,50 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	 *
>  	 */
>  	read_lock(&tasklist_lock);
> -	nr = next_pidmap(pid_ns, 1);
> -	while (nr > 0) {
> -		rcu_read_lock();
> +	for (nr = next_pidmap(pid_ns, 0); nr > 0; nr = next_pidmap(pid_ns, nr)) {
>  
>  		/*
>  		 * Any nested-container's init processes won't ignore the
>  		 * SEND_SIG_NOINFO signal, see send_signal()->si_fromuser().
>  		 */
> -		task = pid_task(find_vpid(nr), PIDTYPE_PID);
> -		if (task)
> +		rcu_read_lock();
> +		task = pid_task(find_pid_ns(nr, pid_ns), PIDTYPE_PID);
> +		if (task && !same_thread_group(task, me))
>  			send_sig_info(SIGKILL, SEND_SIG_NOINFO, task);
> -
>  		rcu_read_unlock();
> -
> -		nr = next_pidmap(pid_ns, nr);
>  	}
>  	read_unlock(&tasklist_lock);
>  
> +	/* Nicely reap all of the remaining children in the namespac */

s/namespac/namespace/

>  	do {
>  		clear_thread_flag(TIF_SIGPENDING);
>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>  	} while (rc != -ECHILD);
> +       
> +
> +repeat:
> +	/* Brute force wait for any remaining tasks to pass unhash_process
> +	 * in release_task.  Once a task has passed unhash_process there
> +	 * is no pid_namespace state left and they can be safely ignored.
> +	 */
> +	for (nr = next_pidmap(pid_ns, 1); nr > 0; nr = next_pidmap(pid_ns, nr)) {
> +
> +		/* Are there any tasks alive in this pid namespace */
> +		rcu_read_lock();
> +		task = pid_task(find_pid_ns(nr, pid_ns), PIDTYPE_PID);
> +		rcu_read_unlock();
> +		if (task && !same_thread_group(task, me)) {

rcu_read_unlock() should not be called before here, or task may be freed before
calling same_thread_group().

> +			clear_thread_flag(TIF_SIGPENDING);
> +			schedule_timeout_interruptible(HZ/10);
> +			goto repeat;
> +		}
> +	}
> +	/* At this point there are at most two tasks in the pid namespace.
> +	 * These tasks are our current task, and if we aren't pid 1 the zombie
> +	 * of pid 1. In either case pid 1 will be the final task reaped in this
> +	 * pid namespace, as non-leader threads are self reaping and leaders
> +	 * cannot be reaped until all of their siblings have been reaped.
> +	 */
>  
>  	acct_exit_ns(pid_ns);
>  	return;
> -- 
> 1.6.5.2.143.g8cc62

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

  reply	other threads:[~2010-07-13 21:42 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-16 16:34 [PATCH] procfs: Do not release pid_ns->proc_mnt too early Louis Rilling
     [not found] ` <1276706068-18567-1-git-send-email-louis.rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
2010-06-17  9:53   ` Pavel Emelyanov
2010-06-17  9:53     ` Pavel Emelyanov
2010-06-17 13:41     ` Eric W. Biederman
2010-06-17 14:20       ` Louis Rilling
2010-06-17 21:36       ` Oleg Nesterov
2010-06-18  8:27         ` Louis Rilling
2010-06-18 16:27           ` Oleg Nesterov
2010-06-21 11:11             ` Louis Rilling
2010-06-21 12:58               ` Eric W. Biederman
2010-06-21 14:15                 ` Louis Rilling
2010-06-21 14:26                   ` Eric W. Biederman
     [not found]           ` <20100618082738.GE16877-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2010-06-18 16:27             ` Oleg Nesterov
2010-06-17 21:20 ` Oleg Nesterov
2010-06-18  8:20   ` Louis Rilling
2010-06-18 11:15     ` Oleg Nesterov
2010-06-18 16:08       ` Oleg Nesterov
2010-06-18 16:08         ` Oleg Nesterov
2010-06-18 17:33         ` Louis Rilling
2010-06-18 17:55           ` Oleg Nesterov
2010-06-18 17:55             ` Oleg Nesterov
2010-06-18 21:23             ` Oleg Nesterov
2010-06-18 21:23               ` Oleg Nesterov
2010-06-19 19:08               ` [PATCH 0/4] pid_ns_prepare_proc/unshare cleanups Oleg Nesterov
2010-06-19 19:09                 ` [PATCH 1/4] procfs: proc_get_sb: consolidate/cleanup root_inode->pid logic Oleg Nesterov
2010-06-19 19:10                 ` [PATCH 2/4] procfs: kill the global proc_mnt variable Oleg Nesterov
2010-06-19 19:10                 ` [PATCH 3/4] procfs: move pid_ns_prepare_proc() from copy_process() to create_pid_namespace() Oleg Nesterov
2010-06-19 19:11                 ` [PATCH RESEND 4/4] sys_unshare: simplify the not-really-implemented CLONE_THREAD/SIGHAND/VM code Oleg Nesterov
2010-06-20  8:42                 ` [PATCH 0/6] Unshare support for the pid namespace Eric W. Biederman
2010-06-20  8:44                   ` [PATCH 1/6] pid: Remove the child_reaper special case in init/main.c Eric W. Biederman
     [not found]                     ` <m1ljaaqejm.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-06-20 18:29                       ` Oleg Nesterov
2010-06-20 18:29                         ` Oleg Nesterov
2010-06-20 20:27                         ` Oleg Nesterov
2010-06-20  8:45                   ` [PATCH 2/6] pidns: Call pid_ns_prepare_proc from create_pid_namespace Eric W. Biederman
     [not found]                     ` <m1hbkyqeib.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-06-20 18:19                       ` Oleg Nesterov
2010-06-20 18:19                         ` Oleg Nesterov
2010-06-20  8:45                   ` [PATCH 3/6] procfs: kill the global proc_mnt variable Eric W. Biederman
2010-06-20  8:47                   ` [PATCH 4/6] pidns: Don't allow new pids after the namespace is dead Eric W. Biederman
2010-06-20 18:44                     ` Oleg Nesterov
2010-06-20  8:48                   ` [PATCH 5/6] pidns: Use task_active_pid_ns where appropriate Eric W. Biederman
2010-06-20  8:49                   ` [PATCH 6/6] pidns: Support unsharing the pid namespace Eric W. Biederman
2010-06-20 20:14                     ` Oleg Nesterov
2010-06-20 20:42                       ` Oleg Nesterov
2010-06-21  1:53                       ` Eric W. Biederman
2010-06-20 18:03                   ` [PATCH 0/6] Unshare support for " Oleg Nesterov
2010-06-20 18:05                     ` [PATCH 0/2] pid_ns_release_proc() fixes Oleg Nesterov
2010-06-20 18:06                       ` [PATCH 1/2] pid_ns: move destroy_pid_namespace() into workqueue context Oleg Nesterov
2010-06-20 18:06                       ` [PATCH 2/2] pid_ns: refactor the buggy pid_ns_release_proc() logic Oleg Nesterov
2010-06-20 21:00                     ` [PATCH 0/6] Unshare support for the pid namespace Eric W. Biederman
2010-06-20 21:48                       ` Oleg Nesterov
     [not found]                       ` <m14ogxctd6.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-06-20 21:56                         ` Oleg Nesterov
2010-06-20 21:56                           ` Oleg Nesterov
2011-01-26 15:57                   ` Daniel Lezcano
2010-06-23 20:36                 ` [PATCH 0/1] pid_ns: move pid_ns_release_proc() from proc_flush_task() to zap_pid_ns_processes() Oleg Nesterov
     [not found]                   ` <20100623203652.GA25298-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-06-23 20:37                     ` [PATCH 1/1] " Oleg Nesterov
2010-06-23 20:37                       ` Oleg Nesterov
2010-06-24  6:36                       ` Sukadev Bhattiprolu
2010-06-24 12:59                         ` Oleg Nesterov
2010-06-24  7:06                       ` Eric W. Biederman
2010-06-24 13:01                         ` Oleg Nesterov
2010-06-24  8:37                   ` [PATCH] pid_ns: Fix proc_flush_task() accessing freed proc_mnt Louis Rilling
2010-06-24 17:08                   ` [RESEND PATCH] " Louis Rilling
2010-06-24 19:18                     ` Oleg Nesterov
2010-06-25 10:23                       ` Louis Rilling
     [not found]                         ` <20100625102303.GG3773-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2010-06-25 12:21                           ` Oleg Nesterov
2010-06-25 12:21                             ` Oleg Nesterov
2010-06-25 18:37                           ` Sukadev Bhattiprolu
2010-06-25 18:37                         ` Sukadev Bhattiprolu
2010-06-25 19:29                           ` Oleg Nesterov
2010-06-25 21:26                             ` Sukadev Bhattiprolu
2010-06-25 21:27                               ` Oleg Nesterov
2010-06-25 22:07                                 ` Sukadev Bhattiprolu
2010-07-09  4:36                                   ` [RFC][PATCH 1/2] pidns: Add a flag to indicate a pid namespace is dead Eric W. Biederman
2010-07-09  4:39                                     ` [RFC][PATCH 2/2] pidns: Remove proc flush races when a pid namespaces are exiting Eric W. Biederman
2010-07-09 12:14                                       ` Louis Rilling
2010-07-09 13:05                                         ` Eric W. Biederman
2010-07-09 14:13                                           ` Louis Rilling
2010-07-09 15:58                                             ` [PATCH 01/24] pidns: Remove races by stopping the caching of proc_mnt Eric W. Biederman
2010-07-09 22:13                                               ` Serge E. Hallyn
2010-07-11 14:14                                               ` Louis Rilling
2010-07-11 14:25                                                 ` Eric W. Biederman
2010-07-12 18:09                                                 ` [PATCH] pidns: Fix wait for zombies to be reaped in zap_pid_ns_processes Eric W. Biederman
2010-07-13 21:42                                                   ` Louis Rilling [this message]
     [not found]                                                     ` <20100713214234.GA21042-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2010-07-13 22:34                                                       ` Serge E. Hallyn
2010-07-13 22:34                                                     ` Serge E. Hallyn
2010-07-14  1:47                                                     ` Eric W. Biederman
     [not found]                                                       ` <m1oceakf5x.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-10-30  7:07                                                         ` Sukadev Bhattiprolu
2010-10-30  7:07                                                           ` Sukadev Bhattiprolu
2010-07-14 20:53                                                   ` Sukadev Bhattiprolu
2010-07-14 21:35                                                     ` Eric W. Biederman
2010-06-21 11:09             ` [PATCH] procfs: Do not release pid_ns->proc_mnt too early Louis Rilling
2010-06-21 11:15             ` Louis Rilling
2010-06-21 14:38               ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100713214234.GA21042@hawkmoon.kerlabs.com \
    --to=louis.rilling@kerlabs.com \
    --cc=akpm@linux-foundation.org \
    --cc=containers@lists.osdl.org \
    --cc=ebiederm@xmission.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=sukadev@linux.vnet.ibm.com \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.