All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oleg Nesterov <oleg@redhat.com>
To: Louis Rilling <louis.rilling@kerlabs.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
	Pavel Emelyanov <xemul@openvz.org>,
	Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Containers <containers@lists.osdl.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [RESEND PATCH] pid_ns: Fix proc_flush_task() accessing freed proc_mnt
Date: Thu, 24 Jun 2010 21:18:43 +0200	[thread overview]
Message-ID: <20100624191843.GA14205@redhat.com> (raw)
In-Reply-To: <1277399329-18087-1-git-send-email-louis.rilling@kerlabs.com>

[-- Attachment #1: Type: text/plain, Size: 2039 bytes --]

On 06/24, Louis Rilling wrote:
>
> [ Resending because of buggy quotes in Eric's address. Sorry for the noise. ]
>
> On 06/19, Oleg Nesterov wrote:
> > And the last one on top of this series, before I go away from this
> > thread ;)
> >
> > Since my further fixes were not approved, I think at least it makes
> > sense to cleanup pid_ns_release_proc() logic a bit.
>
> It's completely untested and could be split into three patches. But I think that
> it solves the issues found so far, and that it will work with Eric's
> unshare(CLONE_NEWPID) too.
>
> What do you think about this approach?

Oh. I shouldn't continue to participate in this discussion... I don't have
the time and my previous patch proves that my patches should be ignored ;)

But, looking at this patch,

> - defer pid_ns_release_proc()->mntput() to a worqueue context, so that
>   pid_ns_release_proc() can be called in atomic context;

OK, not good but this is what I did too,

> - introduce pid_ns->nr_pids, so that we can count the number of pids
>   allocated by alloc_pidmap();

and this adds the extra code to alloc/free pidmap.

> - move the call to pid_ns_prepare_proc() to alloc_pid(), where we know
>   when the first pid of a namespace is allocated;

This is what I personally dislike. I do not think pid_ns_prepare_proc()
should depend on the fact that the first pid was already allocated.

And, this subjective, yes, but it looks a bit strange that upid->nr
has a reference to proc_mnt.

And of course, imho it would nice to not create the circular reference
we currently have.


Louis, Eric.

I am attaching my 2 patches (on top of cleanups) again. Could you take
a look?

Changes:

	- pid_ns_release_proc() nullifies sb->s_fs_info

	- proc_pid_lookup() and proc_self_readlink() check ns != NULL
	  (this is sb->s_fs_info)

I even tried to test this finally, seems to work.

I am not going to argue if you prefer Louis's approach. But I will appreciate
if you explain why my fix is wrong. I am curious because I spent 3 hours doing
grep fs/proc ;)

Oleg.

[-- Attachment #2: PNS_5_DESTROY_FROM_WQ.patch --]
[-- Type: text/plain, Size: 2582 bytes --]

[PATCH 1/2] pid_ns: move destroy_pid_namespace() into workqueue context

A separate patch to simplify the review of the next change.

Move destroy_pid_namespace() into workqueue context. This allows us to do
mntput() from free_pid_ns() paths, see the next patch.

Add the new member, "struct work_struct destroy" into struct pid_namespace
and change free_pid_ns() to call destroy_pid_namespace() via schedule_work().

The patch looks a bit complicated because it also moves copy_pid_ns() up.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/pid_namespace.h |    1 +
 kernel/pid_namespace.c        |   25 +++++++++++++++++--------
 2 files changed, 18 insertions(+), 8 deletions(-)

--- 35-rc3/include/linux/pid_namespace.h~PNS_5_DESTROY_FROM_WQ	2010-06-23 22:06:07.000000000 +0200
+++ 35-rc3/include/linux/pid_namespace.h	2010-06-24 15:17:46.000000000 +0200
@@ -24,6 +24,7 @@ struct pid_namespace {
 	struct kmem_cache *pid_cachep;
 	unsigned int level;
 	struct pid_namespace *parent;
+	struct work_struct destroy;
 #ifdef CONFIG_PROC_FS
 	struct vfsmount *proc_mnt;
 #endif
--- 35-rc3/kernel/pid_namespace.c~PNS_5_DESTROY_FROM_WQ	2010-06-24 15:16:22.000000000 +0200
+++ 35-rc3/kernel/pid_namespace.c	2010-06-24 15:17:46.000000000 +0200
@@ -114,6 +114,16 @@ out:
 	return ERR_PTR(err);
 }
 
+struct pid_namespace *copy_pid_ns(unsigned long flags,
+					struct pid_namespace *old_ns)
+{
+	if (!(flags & CLONE_NEWPID))
+		return get_pid_ns(old_ns);
+	if (flags & (CLONE_THREAD|CLONE_PARENT))
+		return ERR_PTR(-EINVAL);
+	return create_pid_namespace(old_ns);
+}
+
 static void destroy_pid_namespace(struct pid_namespace *ns)
 {
 	int i;
@@ -123,13 +133,11 @@ static void destroy_pid_namespace(struct
 	kmem_cache_free(pid_ns_cachep, ns);
 }
 
-struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
+static void destroy_pid_namespace_work(struct work_struct *work)
 {
-	if (!(flags & CLONE_NEWPID))
-		return get_pid_ns(old_ns);
-	if (flags & (CLONE_THREAD|CLONE_PARENT))
-		return ERR_PTR(-EINVAL);
-	return create_pid_namespace(old_ns);
+	struct pid_namespace *ns =
+			container_of(work, struct pid_namespace, destroy);
+	destroy_pid_namespace(ns);
 }
 
 void free_pid_ns(struct kref *kref)
@@ -137,9 +145,10 @@ void free_pid_ns(struct kref *kref)
 	struct pid_namespace *ns, *parent;
 
 	ns = container_of(kref, struct pid_namespace, kref);
-
 	parent = ns->parent;
-	destroy_pid_namespace(ns);
+
+	INIT_WORK(&ns->destroy, destroy_pid_namespace_work);
+	schedule_work(&ns->destroy);
 
 	if (parent != NULL)
 		put_pid_ns(parent);

[-- Attachment #3: PNS_6_BREAK_CIRCLE.patch --]
[-- Type: text/plain, Size: 4825 bytes --]

[PATCH 2/2] pid_ns: refactor the buggy pid_ns_release_proc() logic

pid_namespace holds ns->proc_mnt, while this vfsmount has a referene to
the namespace via PROC_I(sb->s_root->d_inode)->pid. To break this circle
/sbin/init does mntput() in pid_ns_release_proc(). See 6f4e6433.

But we have the following problems:

	- Nobody does mntput() if copy_process() fails after
	  pid_ns_prepare_proc().

	- proc_flush_task() checks upid->nr == 1 to verify we are init,
	  this is wrong if a multi-threaded init does exec.

	- As Louis pointed out, this namespace can have the detached
	  EXIT_DEAD tasks which can use ns->proc_mnt after this mntput().

With this patch only pid_namespace has a reference to ns->proc_mnt, and
mntput(ns->proc_mnt) is called by destroy_pid_namespace() paths when we
know that this ns must not have any references (in particular, there are
no pids in this namespace).

Changes:

	- kill proc_flush_task()->pid_ns_release_proc()

	- change fs/proc/root.c so that we don't create the "artificial"
	  references to the namespace or its pid==1.

	- change destroy_pid_namespace() to call pid_ns_release_proc().

	- change pid_ns_release_proc() to clear s_root->d_inode->pid and
	  sb->s_fs_info. The caller is destroy_pid_namespace(), both pid
	  and ns must not have any reference.

	- change proc_self_readlink() and proc_pid_lookup() to check
	  sb->s_fs_info != NULL to detect the case when the proc fs is
	  kept mounted after pid_ns_release_proc().

Reported-by:  Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/pid_namespace.c |    2 ++
 fs/proc/base.c         |   20 ++++++++++++--------
 fs/proc/root.c         |   11 +++++++----
 3 files changed, 21 insertions(+), 12 deletions(-)

--- 35-rc3/kernel/pid_namespace.c~PNS_6_BREAK_CIRCLE	2010-06-24 15:17:46.000000000 +0200
+++ 35-rc3/kernel/pid_namespace.c	2010-06-24 20:48:18.000000000 +0200
@@ -128,6 +128,8 @@ static void destroy_pid_namespace(struct
 {
 	int i;
 
+	pid_ns_release_proc(ns);
+
 	for (i = 0; i < PIDMAP_ENTRIES; i++)
 		kfree(ns->pidmap[i].page);
 	kmem_cache_free(pid_ns_cachep, ns);
--- 35-rc3/fs/proc/base.c~PNS_6_BREAK_CIRCLE	2010-06-24 15:16:21.000000000 +0200
+++ 35-rc3/fs/proc/base.c	2010-06-24 20:48:18.000000000 +0200
@@ -2349,11 +2349,17 @@ static const struct file_operations proc
 /*
  * /proc/self:
  */
+
+static inline pid_t self_tgid(struct dentry *dentry)
+{
+	struct pid_namespace *ns = dentry->d_sb->s_fs_info;
+	return ns ? task_tgid_nr_ns(current, ns) : 0;
+}
+
 static int proc_self_readlink(struct dentry *dentry, char __user *buffer,
 			      int buflen)
 {
-	struct pid_namespace *ns = dentry->d_sb->s_fs_info;
-	pid_t tgid = task_tgid_nr_ns(current, ns);
+	pid_t tgid = self_tgid(dentry);
 	char tmp[PROC_NUMBUF];
 	if (!tgid)
 		return -ENOENT;
@@ -2363,8 +2369,7 @@ static int proc_self_readlink(struct den
 
 static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
-	struct pid_namespace *ns = dentry->d_sb->s_fs_info;
-	pid_t tgid = task_tgid_nr_ns(current, ns);
+	pid_t tgid = self_tgid(dentry);
 	char *name = ERR_PTR(-ENOENT);
 	if (tgid) {
 		name = __getname();
@@ -2745,10 +2750,6 @@ void proc_flush_task(struct task_struct 
 		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
 					tgid->numbers[i].nr);
 	}
-
-	upid = &pid->numbers[pid->level];
-	if (upid->nr == 1)
-		pid_ns_release_proc(upid->ns);
 }
 
 static struct dentry *proc_pid_instantiate(struct inode *dir,
@@ -2796,6 +2797,9 @@ struct dentry *proc_pid_lookup(struct in
 		goto out;
 
 	ns = dentry->d_sb->s_fs_info;
+	if (!ns)
+		goto out;
+
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tgid, ns);
 	if (task)
--- 35-rc3/fs/proc/root.c~PNS_6_BREAK_CIRCLE	2010-06-23 22:06:01.000000000 +0200
+++ 35-rc3/fs/proc/root.c	2010-06-24 20:48:18.000000000 +0200
@@ -31,7 +31,7 @@ static int proc_set_super(struct super_b
 	struct pid_namespace *ns;
 
 	ns = (struct pid_namespace *)data;
-	sb->s_fs_info = get_pid_ns(ns);
+	sb->s_fs_info = ns;
 	return set_anon_super(sb, NULL);
 }
 
@@ -68,7 +68,7 @@ static int proc_get_sb(struct file_syste
 		struct proc_inode *ei = PROC_I(sb->s_root->d_inode);
 		if (!ei->pid) {
 			rcu_read_lock();
-			ei->pid = get_pid(find_pid_ns(1, ns));
+			ei->pid = find_pid_ns(1, ns);
 			rcu_read_unlock();
 		}
 	}
@@ -83,7 +83,6 @@ static void proc_kill_sb(struct super_bl
 
 	ns = (struct pid_namespace *)sb->s_fs_info;
 	kill_anon_super(sb);
-	put_pid_ns(ns);
 }
 
 static struct file_system_type proc_fs_type = {
@@ -209,5 +208,9 @@ int pid_ns_prepare_proc(struct pid_names
 
 void pid_ns_release_proc(struct pid_namespace *ns)
 {
-	mntput(ns->proc_mnt);
+	if (ns->proc_mnt) {
+		ns->proc_mnt->mnt_sb->s_fs_info = NULL;
+		PROC_I(ns->proc_mnt->mnt_sb->s_root->d_inode)->pid = NULL;
+		mntput(ns->proc_mnt);
+	}
 }

  reply	other threads:[~2010-06-24 19:18 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-16 16:34 [PATCH] procfs: Do not release pid_ns->proc_mnt too early Louis Rilling
     [not found] ` <1276706068-18567-1-git-send-email-louis.rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
2010-06-17  9:53   ` Pavel Emelyanov
2010-06-17  9:53     ` Pavel Emelyanov
2010-06-17 13:41     ` Eric W. Biederman
2010-06-17 14:20       ` Louis Rilling
2010-06-17 21:36       ` Oleg Nesterov
2010-06-18  8:27         ` Louis Rilling
2010-06-18 16:27           ` Oleg Nesterov
2010-06-21 11:11             ` Louis Rilling
2010-06-21 12:58               ` Eric W. Biederman
2010-06-21 14:15                 ` Louis Rilling
2010-06-21 14:26                   ` Eric W. Biederman
     [not found]           ` <20100618082738.GE16877-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2010-06-18 16:27             ` Oleg Nesterov
2010-06-17 21:20 ` Oleg Nesterov
2010-06-18  8:20   ` Louis Rilling
2010-06-18 11:15     ` Oleg Nesterov
2010-06-18 16:08       ` Oleg Nesterov
2010-06-18 16:08         ` Oleg Nesterov
2010-06-18 17:33         ` Louis Rilling
2010-06-18 17:55           ` Oleg Nesterov
2010-06-18 17:55             ` Oleg Nesterov
2010-06-18 21:23             ` Oleg Nesterov
2010-06-18 21:23               ` Oleg Nesterov
2010-06-19 19:08               ` [PATCH 0/4] pid_ns_prepare_proc/unshare cleanups Oleg Nesterov
2010-06-19 19:09                 ` [PATCH 1/4] procfs: proc_get_sb: consolidate/cleanup root_inode->pid logic Oleg Nesterov
2010-06-19 19:10                 ` [PATCH 2/4] procfs: kill the global proc_mnt variable Oleg Nesterov
2010-06-19 19:10                 ` [PATCH 3/4] procfs: move pid_ns_prepare_proc() from copy_process() to create_pid_namespace() Oleg Nesterov
2010-06-19 19:11                 ` [PATCH RESEND 4/4] sys_unshare: simplify the not-really-implemented CLONE_THREAD/SIGHAND/VM code Oleg Nesterov
2010-06-20  8:42                 ` [PATCH 0/6] Unshare support for the pid namespace Eric W. Biederman
2010-06-20  8:44                   ` [PATCH 1/6] pid: Remove the child_reaper special case in init/main.c Eric W. Biederman
     [not found]                     ` <m1ljaaqejm.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-06-20 18:29                       ` Oleg Nesterov
2010-06-20 18:29                         ` Oleg Nesterov
2010-06-20 20:27                         ` Oleg Nesterov
2010-06-20  8:45                   ` [PATCH 2/6] pidns: Call pid_ns_prepare_proc from create_pid_namespace Eric W. Biederman
     [not found]                     ` <m1hbkyqeib.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-06-20 18:19                       ` Oleg Nesterov
2010-06-20 18:19                         ` Oleg Nesterov
2010-06-20  8:45                   ` [PATCH 3/6] procfs: kill the global proc_mnt variable Eric W. Biederman
2010-06-20  8:47                   ` [PATCH 4/6] pidns: Don't allow new pids after the namespace is dead Eric W. Biederman
2010-06-20 18:44                     ` Oleg Nesterov
2010-06-20  8:48                   ` [PATCH 5/6] pidns: Use task_active_pid_ns where appropriate Eric W. Biederman
2010-06-20  8:49                   ` [PATCH 6/6] pidns: Support unsharing the pid namespace Eric W. Biederman
2010-06-20 20:14                     ` Oleg Nesterov
2010-06-20 20:42                       ` Oleg Nesterov
2010-06-21  1:53                       ` Eric W. Biederman
2010-06-20 18:03                   ` [PATCH 0/6] Unshare support for " Oleg Nesterov
2010-06-20 18:05                     ` [PATCH 0/2] pid_ns_release_proc() fixes Oleg Nesterov
2010-06-20 18:06                       ` [PATCH 1/2] pid_ns: move destroy_pid_namespace() into workqueue context Oleg Nesterov
2010-06-20 18:06                       ` [PATCH 2/2] pid_ns: refactor the buggy pid_ns_release_proc() logic Oleg Nesterov
2010-06-20 21:00                     ` [PATCH 0/6] Unshare support for the pid namespace Eric W. Biederman
2010-06-20 21:48                       ` Oleg Nesterov
     [not found]                       ` <m14ogxctd6.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-06-20 21:56                         ` Oleg Nesterov
2010-06-20 21:56                           ` Oleg Nesterov
2011-01-26 15:57                   ` Daniel Lezcano
2010-06-23 20:36                 ` [PATCH 0/1] pid_ns: move pid_ns_release_proc() from proc_flush_task() to zap_pid_ns_processes() Oleg Nesterov
     [not found]                   ` <20100623203652.GA25298-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-06-23 20:37                     ` [PATCH 1/1] " Oleg Nesterov
2010-06-23 20:37                       ` Oleg Nesterov
2010-06-24  6:36                       ` Sukadev Bhattiprolu
2010-06-24 12:59                         ` Oleg Nesterov
2010-06-24  7:06                       ` Eric W. Biederman
2010-06-24 13:01                         ` Oleg Nesterov
2010-06-24  8:37                   ` [PATCH] pid_ns: Fix proc_flush_task() accessing freed proc_mnt Louis Rilling
2010-06-24 17:08                   ` [RESEND PATCH] " Louis Rilling
2010-06-24 19:18                     ` Oleg Nesterov [this message]
2010-06-25 10:23                       ` Louis Rilling
     [not found]                         ` <20100625102303.GG3773-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2010-06-25 12:21                           ` Oleg Nesterov
2010-06-25 12:21                             ` Oleg Nesterov
2010-06-25 18:37                           ` Sukadev Bhattiprolu
2010-06-25 18:37                         ` Sukadev Bhattiprolu
2010-06-25 19:29                           ` Oleg Nesterov
2010-06-25 21:26                             ` Sukadev Bhattiprolu
2010-06-25 21:27                               ` Oleg Nesterov
2010-06-25 22:07                                 ` Sukadev Bhattiprolu
2010-07-09  4:36                                   ` [RFC][PATCH 1/2] pidns: Add a flag to indicate a pid namespace is dead Eric W. Biederman
2010-07-09  4:39                                     ` [RFC][PATCH 2/2] pidns: Remove proc flush races when a pid namespaces are exiting Eric W. Biederman
2010-07-09 12:14                                       ` Louis Rilling
2010-07-09 13:05                                         ` Eric W. Biederman
2010-07-09 14:13                                           ` Louis Rilling
2010-07-09 15:58                                             ` [PATCH 01/24] pidns: Remove races by stopping the caching of proc_mnt Eric W. Biederman
2010-07-09 22:13                                               ` Serge E. Hallyn
2010-07-11 14:14                                               ` Louis Rilling
2010-07-11 14:25                                                 ` Eric W. Biederman
2010-07-12 18:09                                                 ` [PATCH] pidns: Fix wait for zombies to be reaped in zap_pid_ns_processes Eric W. Biederman
2010-07-13 21:42                                                   ` Louis Rilling
     [not found]                                                     ` <20100713214234.GA21042-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2010-07-13 22:34                                                       ` Serge E. Hallyn
2010-07-13 22:34                                                     ` Serge E. Hallyn
2010-07-14  1:47                                                     ` Eric W. Biederman
     [not found]                                                       ` <m1oceakf5x.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-10-30  7:07                                                         ` Sukadev Bhattiprolu
2010-10-30  7:07                                                           ` Sukadev Bhattiprolu
2010-07-14 20:53                                                   ` Sukadev Bhattiprolu
2010-07-14 21:35                                                     ` Eric W. Biederman
2010-06-21 11:09             ` [PATCH] procfs: Do not release pid_ns->proc_mnt too early Louis Rilling
2010-06-21 11:15             ` Louis Rilling
2010-06-21 14:38               ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100624191843.GA14205@redhat.com \
    --to=oleg@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=containers@lists.osdl.org \
    --cc=ebiederm@xmission.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=louis.rilling@kerlabs.com \
    --cc=sukadev@linux.vnet.ibm.com \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.