[REVIEW][PATCH 0/11] pid namespace cleanups and enhancements

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [REVIEW][PATCH 0/11] pid namespace cleanups and enhancements
@ 2012-11-16 16:32 Eric W. Biederman
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
  0 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:32 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge E. Hallyn, Gao feng, Andrew Morton


This patchset is my pile of pid namespace patches that I have been
sitting on for entirely too long.  I have been running and testing these
changes for a while but if anyone sees any problems please let me know.

Feature wise this patchset adds unshare and setns support for the pid
namespace.

Cleanup wise this patchset adds an explicit count of how many pids are
hashed in a pid namespace and uses that count to trigger the unmounting
of the internal kernel mount of proc.  The current scheme is buggy and
entirely too clever to continue living.

Some proc bits that were added to support the pid namespace initially
are removed, as they are no no longer necessary.

These patches are also available at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git pidns-v73

Since some of this work is closely allied with the user namespace bits I
have pending I intend to merge these changes through my user namespace
tree.

Eric W. Biederman (11):
      procfs: Use the proc generic infrastructure for proc/self.
      procfs: Don't cache a pid in the root inode.
      pidns: Capture the user namespace and filter ns_last_pid
      pidns: Use task_active_pid_ns where appropriate
      pidns: Make the pidns proc mount/umount logic obvious.
      pidns: Don't allow new processes in a dead pid namespace.
      pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
      pidns: Deny strange cases when creating pid namespaces.
      pidns: Add setns support
      pidns: Consolidate initialzation of special init task state
      pidns: Support unsharing the pid namespace.

 arch/powerpc/platforms/cell/spufs/sched.c |    2 +-
 arch/um/drivers/mconsole_kern.c           |    2 +-
 drivers/staging/android/binder.c          |    3 +-
 fs/hppfs/hppfs.c                          |    2 +-
 fs/proc/Makefile                          |    1 +
 fs/proc/base.c                            |  169 +----------------------------
 fs/proc/internal.h                        |    1 +
 fs/proc/namespaces.c                      |    3 +
 fs/proc/root.c                            |   16 +---
 fs/proc/self.c                            |   59 ++++++++++
 include/linux/pid_namespace.h             |   10 ++-
 include/linux/proc_fs.h                   |    1 +
 init/main.c                               |    1 -
 kernel/cgroup.c                           |    2 +-
 kernel/events/core.c                      |    2 +-
 kernel/exit.c                             |   12 --
 kernel/fork.c                             |   42 +++++---
 kernel/nsproxy.c                          |    4 +-
 kernel/pid.c                              |   46 +++++++--
 kernel/pid_namespace.c                    |   99 +++++++++++++----
 kernel/signal.c                           |    2 +-
 kernel/sysctl_binary.c                    |    2 +-
 22 files changed, 231 insertions(+), 250 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self.
  2012-11-16 16:32 [REVIEW][PATCH 0/11] pid namespace cleanups and enhancements Eric W. Biederman
@ 2012-11-16 16:35 ` Eric W. Biederman
  2012-11-16 16:35   ` [PATCH 02/11] procfs: Don't cache a pid in the root inode Eric W. Biederman
                     ` (9 more replies)
  0 siblings, 10 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

I had visions at one point of splitting proc into two filesystems.  If
that had happened proc/self being the the part of proc that actually deals
with pids would have been a nice cleanup.  As it is proc/self requires
a lot of unnecessary infrastructure for a single file.

The only user visible change is that a mounted /proc for a pid namespace
that is dead now shows a broken proc symlink, instead of being completely
invisible.  I don't think anyone will notice or care.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/Makefile   |    1 +
 fs/proc/base.c     |  154 +---------------------------------------------------
 fs/proc/internal.h |    1 +
 fs/proc/root.c     |    1 +
 fs/proc/self.c     |   59 ++++++++++++++++++++
 5 files changed, 64 insertions(+), 152 deletions(-)
 create mode 100644 fs/proc/self.c

diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 99349ef..981b056 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -21,6 +21,7 @@ proc-y	+= uptime.o
 proc-y	+= version.o
 proc-y	+= softirqs.o
 proc-y	+= namespaces.o
+proc-y	+= self.o
 proc-$(CONFIG_PROC_SYSCTL)	+= proc_sysctl.o
 proc-$(CONFIG_NET)		+= proc_net.o
 proc-$(CONFIG_PROC_KCORE)	+= kcore.o
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 144a967..cbe454e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2237,146 +2237,6 @@ static const struct file_operations proc_coredump_filter_operations = {
 };
 #endif
 
-/*
- * /proc/self:
- */
-static int proc_self_readlink(struct dentry *dentry, char __user *buffer,
-			      int buflen)
-{
-	struct pid_namespace *ns = dentry->d_sb->s_fs_info;
-	pid_t tgid = task_tgid_nr_ns(current, ns);
-	char tmp[PROC_NUMBUF];
-	if (!tgid)
-		return -ENOENT;
-	sprintf(tmp, "%d", tgid);
-	return vfs_readlink(dentry,buffer,buflen,tmp);
-}
-
-static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd)
-{
-	struct pid_namespace *ns = dentry->d_sb->s_fs_info;
-	pid_t tgid = task_tgid_nr_ns(current, ns);
-	char *name = ERR_PTR(-ENOENT);
-	if (tgid) {
-		/* 11 for max length of signed int in decimal + NULL term */
-		name = kmalloc(12, GFP_KERNEL);
-		if (!name)
-			name = ERR_PTR(-ENOMEM);
-		else
-			sprintf(name, "%d", tgid);
-	}
-	nd_set_link(nd, name);
-	return NULL;
-}
-
-static void proc_self_put_link(struct dentry *dentry, struct nameidata *nd,
-				void *cookie)
-{
-	char *s = nd_get_link(nd);
-	if (!IS_ERR(s))
-		kfree(s);
-}
-
-static const struct inode_operations proc_self_inode_operations = {
-	.readlink	= proc_self_readlink,
-	.follow_link	= proc_self_follow_link,
-	.put_link	= proc_self_put_link,
-};
-
-/*
- * proc base
- *
- * These are the directory entries in the root directory of /proc
- * that properly belong to the /proc filesystem, as they describe
- * describe something that is process related.
- */
-static const struct pid_entry proc_base_stuff[] = {
-	NOD("self", S_IFLNK|S_IRWXUGO,
-		&proc_self_inode_operations, NULL, {}),
-};
-
-static struct dentry *proc_base_instantiate(struct inode *dir,
-	struct dentry *dentry, struct task_struct *task, const void *ptr)
-{
-	const struct pid_entry *p = ptr;
-	struct inode *inode;
-	struct proc_inode *ei;
-	struct dentry *error;
-
-	/* Allocate the inode */
-	error = ERR_PTR(-ENOMEM);
-	inode = new_inode(dir->i_sb);
-	if (!inode)
-		goto out;
-
-	/* Initialize the inode */
-	ei = PROC_I(inode);
-	inode->i_ino = get_next_ino();
-	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
-
-	/*
-	 * grab the reference to the task.
-	 */
-	ei->pid = get_task_pid(task, PIDTYPE_PID);
-	if (!ei->pid)
-		goto out_iput;
-
-	inode->i_mode = p->mode;
-	if (S_ISDIR(inode->i_mode))
-		set_nlink(inode, 2);
-	if (S_ISLNK(inode->i_mode))
-		inode->i_size = 64;
-	if (p->iop)
-		inode->i_op = p->iop;
-	if (p->fop)
-		inode->i_fop = p->fop;
-	ei->op = p->op;
-	d_add(dentry, inode);
-	error = NULL;
-out:
-	return error;
-out_iput:
-	iput(inode);
-	goto out;
-}
-
-static struct dentry *proc_base_lookup(struct inode *dir, struct dentry *dentry)
-{
-	struct dentry *error;
-	struct task_struct *task = get_proc_task(dir);
-	const struct pid_entry *p, *last;
-
-	error = ERR_PTR(-ENOENT);
-
-	if (!task)
-		goto out_no_task;
-
-	/* Lookup the directory entry */
-	last = &proc_base_stuff[ARRAY_SIZE(proc_base_stuff) - 1];
-	for (p = proc_base_stuff; p <= last; p++) {
-		if (p->len != dentry->d_name.len)
-			continue;
-		if (!memcmp(dentry->d_name.name, p->name, p->len))
-			break;
-	}
-	if (p > last)
-		goto out;
-
-	error = proc_base_instantiate(dir, dentry, task, p);
-
-out:
-	put_task_struct(task);
-out_no_task:
-	return error;
-}
-
-static int proc_base_fill_cache(struct file *filp, void *dirent,
-	filldir_t filldir, struct task_struct *task, const struct pid_entry *p)
-{
-	return proc_fill_cache(filp, dirent, filldir, p->name, p->len,
-				proc_base_instantiate, task, p);
-}
-
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 static int do_io_accounting(struct task_struct *task, char *buffer, int whole)
 {
@@ -2767,15 +2627,11 @@ out:
 
 struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, unsigned int flags)
 {
-	struct dentry *result;
+	struct dentry *result = NULL;
 	struct task_struct *task;
 	unsigned tgid;
 	struct pid_namespace *ns;
 
-	result = proc_base_lookup(dir, dentry);
-	if (!IS_ERR(result) || PTR_ERR(result) != -ENOENT)
-		goto out;
-
 	tgid = name_to_int(dentry);
 	if (tgid == ~0U)
 		goto out;
@@ -2838,7 +2694,7 @@ retry:
 	return iter;
 }
 
-#define TGID_OFFSET (FIRST_PROCESS_ENTRY + ARRAY_SIZE(proc_base_stuff))
+#define TGID_OFFSET (FIRST_PROCESS_ENTRY)
 
 static int proc_pid_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
 	struct tgid_iter iter)
@@ -2872,12 +2728,6 @@ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
 	if (!reaper)
 		goto out_no_task;
 
-	for (; nr < ARRAY_SIZE(proc_base_stuff); filp->f_pos++, nr++) {
-		const struct pid_entry *p = &proc_base_stuff[nr];
-		if (proc_base_fill_cache(filp, dirent, filldir, reaper, p) < 0)
-			goto out;
-	}
-
 	ns = filp->f_dentry->d_sb->s_fs_info;
 	iter.task = NULL;
 	iter.tgid = filp->f_pos - TGID_OFFSET;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 43973b0..252544c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -15,6 +15,7 @@ struct  ctl_table_header;
 struct  mempolicy;
 
 extern struct proc_dir_entry proc_root;
+extern void proc_self_init(void);
 #ifdef CONFIG_PROC_SYSCTL
 extern int proc_sys_init(void);
 extern void sysctl_head_put(struct ctl_table_header *head);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 9889a92..5da9849 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -169,6 +169,7 @@ void __init proc_root_init(void)
 		return;
 	}
 
+	proc_self_init();
 	proc_symlink("mounts", NULL, "self/mounts");
 
 	proc_net_init();
diff --git a/fs/proc/self.c b/fs/proc/self.c
new file mode 100644
index 0000000..aa5cc3b
--- /dev/null
+++ b/fs/proc/self.c
@@ -0,0 +1,59 @@
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+#include <linux/namei.h>
+
+/*
+ * /proc/self:
+ */
+static int proc_self_readlink(struct dentry *dentry, char __user *buffer,
+			      int buflen)
+{
+	struct pid_namespace *ns = dentry->d_sb->s_fs_info;
+	pid_t tgid = task_tgid_nr_ns(current, ns);
+	char tmp[PROC_NUMBUF];
+	if (!tgid)
+		return -ENOENT;
+	sprintf(tmp, "%d", tgid);
+	return vfs_readlink(dentry,buffer,buflen,tmp);
+}
+
+static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	struct pid_namespace *ns = dentry->d_sb->s_fs_info;
+	pid_t tgid = task_tgid_nr_ns(current, ns);
+	char *name = ERR_PTR(-ENOENT);
+	if (tgid) {
+		/* 11 for max length of signed int in decimal + NULL term */
+		name = kmalloc(12, GFP_KERNEL);
+		if (!name)
+			name = ERR_PTR(-ENOMEM);
+		else
+			sprintf(name, "%d", tgid);
+	}
+	nd_set_link(nd, name);
+	return NULL;
+}
+
+static void proc_self_put_link(struct dentry *dentry, struct nameidata *nd,
+				void *cookie)
+{
+	char *s = nd_get_link(nd);
+	if (!IS_ERR(s))
+		kfree(s);
+}
+
+static const struct inode_operations proc_self_inode_operations = {
+	.readlink	= proc_self_readlink,
+	.follow_link	= proc_self_follow_link,
+	.put_link	= proc_self_put_link,
+};
+
+void __init proc_self_init(void)
+{
+	struct proc_dir_entry *proc_self_symlink;
+	mode_t mode;
+
+	mode = S_IFLNK | S_IRWXUGO;
+	proc_self_symlink = proc_create("self", mode, NULL, NULL );
+	proc_self_symlink->proc_iops = &proc_self_inode_operations;
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 02/11] procfs: Don't cache a pid in the root inode.
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-21  1:07     ` Gao feng
  2012-11-16 16:35   ` [PATCH 03/11] pidns: Capture the user namespace and filter ns_last_pid Eric W. Biederman
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

Now that we have s_fs_info pointing to our pid namespace
the original reason for the proc root inode having a struct
pid is gone.

Caching a pid in the root inode has led to some complicated
code.  Now that we don't need the struct pid, just remove it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/base.c |   11 +----------
 fs/proc/root.c |    8 --------
 2 files changed, 1 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index cbe454e..6177fc2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2714,19 +2714,12 @@ static int fake_filldir(void *buf, const char *name, int namelen,
 /* for the /proc/ directory itself, after non-process stuff has been done */
 int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
 {
-	unsigned int nr;
-	struct task_struct *reaper;
 	struct tgid_iter iter;
 	struct pid_namespace *ns;
 	filldir_t __filldir;
 
 	if (filp->f_pos >= PID_MAX_LIMIT + TGID_OFFSET)
-		goto out_no_task;
-	nr = filp->f_pos - FIRST_PROCESS_ENTRY;
-
-	reaper = get_proc_task(filp->f_path.dentry->d_inode);
-	if (!reaper)
-		goto out_no_task;
+		goto out;
 
 	ns = filp->f_dentry->d_sb->s_fs_info;
 	iter.task = NULL;
@@ -2747,8 +2740,6 @@ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
 	}
 	filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET;
 out:
-	put_task_struct(reaper);
-out_no_task:
 	return 0;
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5da9849..13ef624 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -100,7 +100,6 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 	int err;
 	struct super_block *sb;
 	struct pid_namespace *ns;
-	struct proc_inode *ei;
 	char *options;
 
 	if (flags & MS_KERNMOUNT) {
@@ -130,13 +129,6 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 		sb->s_flags |= MS_ACTIVE;
 	}
 
-	ei = PROC_I(sb->s_root->d_inode);
-	if (!ei->pid) {
-		rcu_read_lock();
-		ei->pid = get_pid(find_pid_ns(1, ns));
-		rcu_read_unlock();
-	}
-
 	return dget(sb->s_root);
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 03/11] pidns: Capture the user namespace and filter ns_last_pid
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
  2012-11-16 16:35   ` [PATCH 02/11] procfs: Don't cache a pid in the root inode Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-21  1:26     ` Gao feng
  2012-11-16 16:35   ` [PATCH 04/11] pidns: Use task_active_pid_ns where appropriate Eric W. Biederman
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

- Capture the the user namespace that creates the pid namespace
- Use that user namespace to test if it is ok to write to
  /proc/sys/kernel/ns_last_pid.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/pid_namespace.h |    8 +++++---
 kernel/nsproxy.c              |    2 +-
 kernel/pid.c                  |    1 +
 kernel/pid_namespace.c        |   16 +++++++++++-----
 4 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 65e3e87..c89c9cf 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -31,6 +31,7 @@ struct pid_namespace {
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct bsd_acct_struct *bacct;
 #endif
+	struct user_namespace *user_ns;
 	kgid_t pid_gid;
 	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
@@ -46,7 +47,8 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 	return ns;
 }
 
-extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns);
+extern struct pid_namespace *copy_pid_ns(unsigned long flags,
+	struct user_namespace *user_ns, struct pid_namespace *ns);
 extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
 extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
 extern void put_pid_ns(struct pid_namespace *ns);
@@ -59,8 +61,8 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 	return ns;
 }
 
-static inline struct pid_namespace *
-copy_pid_ns(unsigned long flags, struct pid_namespace *ns)
+static inline struct pid_namespace *copy_pid_ns(unsigned long flags,
+	struct user_namespace *user_ns, struct pid_namespace *ns)
 {
 	if (flags & CLONE_NEWPID)
 		ns = ERR_PTR(-EINVAL);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b576f7f..5fe88e1 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -84,7 +84,7 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_ipc;
 	}
 
-	new_nsp->pid_ns = copy_pid_ns(flags, task_active_pid_ns(tsk));
+	new_nsp->pid_ns = copy_pid_ns(flags, task_cred_xxx(tsk, user_ns), task_active_pid_ns(tsk));
 	if (IS_ERR(new_nsp->pid_ns)) {
 		err = PTR_ERR(new_nsp->pid_ns);
 		goto out_pid;
diff --git a/kernel/pid.c b/kernel/pid.c
index aebd4f5..2a624f1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -78,6 +78,7 @@ struct pid_namespace init_pid_ns = {
 	.last_pid = 0,
 	.level = 0,
 	.child_reaper = &init_task,
+	.user_ns = &init_user_ns,
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 7b07cc0..7580aa0 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -10,6 +10,7 @@
 
 #include <linux/pid.h>
 #include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
 #include <linux/syscalls.h>
 #include <linux/err.h>
 #include <linux/acct.h>
@@ -74,7 +75,8 @@ err_alloc:
 /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
 #define MAX_PID_NS_LEVEL 32
 
-static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_pid_ns)
+static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
+	struct pid_namespace *parent_pid_ns)
 {
 	struct pid_namespace *ns;
 	unsigned int level = parent_pid_ns->level + 1;
@@ -102,6 +104,7 @@ static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_p
 	kref_init(&ns->kref);
 	ns->level = level;
 	ns->parent = get_pid_ns(parent_pid_ns);
+	ns->user_ns = get_user_ns(user_ns);
 
 	set_bit(0, ns->pidmap[0].page);
 	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
@@ -117,6 +120,7 @@ static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_p
 
 out_put_parent_pid_ns:
 	put_pid_ns(parent_pid_ns);
+	put_user_ns(user_ns);
 out_free_map:
 	kfree(ns->pidmap[0].page);
 out_free:
@@ -134,13 +138,14 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
 	kmem_cache_free(pid_ns_cachep, ns);
 }
 
-struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
+struct pid_namespace *copy_pid_ns(unsigned long flags,
+	struct user_namespace *user_ns, struct pid_namespace *old_ns)
 {
 	if (!(flags & CLONE_NEWPID))
 		return get_pid_ns(old_ns);
 	if (flags & (CLONE_THREAD|CLONE_PARENT))
 		return ERR_PTR(-EINVAL);
-	return create_pid_namespace(old_ns);
+	return create_pid_namespace(user_ns, old_ns);
 }
 
 static void free_pid_ns(struct kref *kref)
@@ -239,9 +244,10 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos)
 {
+	struct pid_namespace *pid_ns = task_active_pid_ns(current);
 	struct ctl_table tmp = *table;
 
-	if (write && !capable(CAP_SYS_ADMIN))
+	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	/*
@@ -250,7 +256,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 	 * it should synchronize its usage with external means.
 	 */
 
-	tmp.data = &current->nsproxy->pid_ns->last_pid;
+	tmp.data = &pid_ns->last_pid;
 	return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 04/11] pidns: Use task_active_pid_ns where appropriate
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
  2012-11-16 16:35   ` [PATCH 02/11] procfs: Don't cache a pid in the root inode Eric W. Biederman
  2012-11-16 16:35   ` [PATCH 03/11] pidns: Capture the user namespace and filter ns_last_pid Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-21  2:02     ` Gao feng
  2012-11-16 16:35   ` [PATCH 05/11] pidns: Make the pidns proc mount/umount logic obvious Eric W. Biederman
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
aka ns_of_pid(task_pid(tsk)) should have the same number of
cache line misses with the practical difference that
ns_of_pid(task_pid(tsk)) is released later in a processes life.

Furthermore by using task_active_pid_ns it becomes trivial
to write an unshare implementation for the the pid namespace.

So I have used task_active_pid_ns everywhere I can.

In fork since the pid has not yet been attached to the
process I use ns_of_pid, to achieve the same effect.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 arch/powerpc/platforms/cell/spufs/sched.c |    2 +-
 arch/um/drivers/mconsole_kern.c           |    2 +-
 drivers/staging/android/binder.c          |    3 ++-
 fs/hppfs/hppfs.c                          |    2 +-
 fs/proc/root.c                            |    2 +-
 kernel/cgroup.c                           |    2 +-
 kernel/events/core.c                      |    2 +-
 kernel/fork.c                             |    2 +-
 kernel/nsproxy.c                          |    2 +-
 kernel/pid.c                              |    8 ++++----
 kernel/signal.c                           |    2 +-
 kernel/sysctl_binary.c                    |    2 +-
 12 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 965d381..25db92a 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1094,7 +1094,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
 		LOAD_INT(c), LOAD_FRAC(c),
 		count_active_contexts(),
 		atomic_read(&nr_spu_contexts),
-		current->nsproxy->pid_ns->last_pid);
+		task_active_pid_ns(current)->last_pid);
 	return 0;
 }
 
diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index 79ccfe6..7fc71c6 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -123,7 +123,7 @@ void mconsole_log(struct mc_request *req)
 
 void mconsole_proc(struct mc_request *req)
 {
-	struct vfsmount *mnt = current->nsproxy->pid_ns->proc_mnt;
+	struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
 	char *buf;
 	int len;
 	struct file *file;
diff --git a/drivers/staging/android/binder.c b/drivers/staging/android/binder.c
index 5d4610b..a97bbcd 100644
--- a/drivers/staging/android/binder.c
+++ b/drivers/staging/android/binder.c
@@ -33,6 +33,7 @@
 #include <linux/uaccess.h>
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
+#include <linux/pid_namespace.h>
 
 #include "binder.h"
 
@@ -2344,7 +2345,7 @@ retry:
 		if (t->from) {
 			struct task_struct *sender = t->from->proc->tsk;
 			tr.sender_pid = task_tgid_nr_ns(sender,
-							current->nsproxy->pid_ns);
+							task_active_pid_ns(current));
 		} else {
 			tr.sender_pid = 0;
 		}
diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c
index 78f21f8..43b315f 100644
--- a/fs/hppfs/hppfs.c
+++ b/fs/hppfs/hppfs.c
@@ -710,7 +710,7 @@ static int hppfs_fill_super(struct super_block *sb, void *d, int silent)
 	struct vfsmount *proc_mnt;
 	int err = -ENOENT;
 
-	proc_mnt = mntget(current->nsproxy->pid_ns->proc_mnt);
+	proc_mnt = mntget(task_active_pid_ns(current)->proc_mnt);
 	if (IS_ERR(proc_mnt))
 		goto out;
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 13ef624..fc16093 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -106,7 +106,7 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 		ns = (struct pid_namespace *)data;
 		options = NULL;
 	} else {
-		ns = current->nsproxy->pid_ns;
+		ns = task_active_pid_ns(current);
 		options = data;
 	}
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f24f724..0dbfba2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3390,7 +3390,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
 {
 	struct cgroup_pidlist *l;
 	/* don't need task_nsproxy() if we're looking at ourself */
-	struct pid_namespace *ns = current->nsproxy->pid_ns;
+	struct pid_namespace *ns = task_active_pid_ns(current);
 
 	/*
 	 * We can't drop the pidlist_mutex before taking the l->mutex in case
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dbccf83..738f356 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6155,7 +6155,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	event->parent		= parent_event;
 
-	event->ns		= get_pid_ns(current->nsproxy->pid_ns);
+	event->ns		= get_pid_ns(task_active_pid_ns(current));
 	event->id		= atomic64_inc_return(&perf_event_id);
 
 	event->state		= PERF_EVENT_STATE_INACTIVE;
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..7798c24 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1442,7 +1442,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 		if (thread_group_leader(p)) {
 			if (is_child_reaper(pid))
-				p->nsproxy->pid_ns->child_reaper = p;
+				ns_of_pid(pid)->child_reaper = p;
 
 			p->signal->leader_pid = pid;
 			p->signal->tty = tty_kref_get(current->signal->tty);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 5fe88e1..cf37d52 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -84,7 +84,7 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_ipc;
 	}
 
-	new_nsp->pid_ns = copy_pid_ns(flags, task_cred_xxx(tsk, user_ns), task_active_pid_ns(tsk));
+	new_nsp->pid_ns = copy_pid_ns(flags, task_cred_xxx(tsk, user_ns), tsk->nsproxy->pid_ns);
 	if (IS_ERR(new_nsp->pid_ns)) {
 		err = PTR_ERR(new_nsp->pid_ns);
 		goto out_pid;
diff --git a/kernel/pid.c b/kernel/pid.c
index 2a624f1..3a5f238 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -345,7 +345,7 @@ EXPORT_SYMBOL_GPL(find_pid_ns);
 
 struct pid *find_vpid(int nr)
 {
-	return find_pid_ns(nr, current->nsproxy->pid_ns);
+	return find_pid_ns(nr, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(find_vpid);
 
@@ -429,7 +429,7 @@ struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
 
 struct task_struct *find_task_by_vpid(pid_t vnr)
 {
-	return find_task_by_pid_ns(vnr, current->nsproxy->pid_ns);
+	return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
 }
 
 struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
@@ -484,7 +484,7 @@ EXPORT_SYMBOL_GPL(pid_nr_ns);
 
 pid_t pid_vnr(struct pid *pid)
 {
-	return pid_nr_ns(pid, current->nsproxy->pid_ns);
+	return pid_nr_ns(pid, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(pid_vnr);
 
@@ -495,7 +495,7 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
 
 	rcu_read_lock();
 	if (!ns)
-		ns = current->nsproxy->pid_ns;
+		ns = task_active_pid_ns(current);
 	if (likely(pid_alive(task))) {
 		if (type != PIDTYPE_PID)
 			task = task->group_leader;
diff --git a/kernel/signal.c b/kernel/signal.c
index 0af8868..b2445d8 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1752,7 +1752,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	 * see comment in do_notify_parent() about the following 4 lines
 	 */
 	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns);
+	info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
 	info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns), task_uid(tsk));
 	rcu_read_unlock();
 
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 65bdcf1..5a63844 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -1344,7 +1344,7 @@ static ssize_t binary_sysctl(const int *name, int nlen,
 		goto out_putname;
 	}
 
-	mnt = current->nsproxy->pid_ns->proc_mnt;
+	mnt = task_active_pid_ns(current)->proc_mnt;
 	file = file_open_root(mnt->mnt_root, mnt, pathname, flags);
 	result = PTR_ERR(file);
 	if (IS_ERR(file))
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 05/11] pidns: Make the pidns proc mount/umount logic obvious.
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
                     ` (2 preceding siblings ...)
  2012-11-16 16:35   ` [PATCH 04/11] pidns: Use task_active_pid_ns where appropriate Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-19 11:02     ` Gao feng
  2012-11-16 16:35   ` [PATCH 06/11] pidns: Don't allow new processes in a dead pid namespace Eric W. Biederman
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

Track the number of pids in the proc hash table.  When the number of
pids goes to 0 schedule work to unmount the kernel mount of proc.

Move the mount of proc into alloc_pid when we allocate the pid for
init.

Remove the surprising calls of pid_ns_release proc in fork and
proc_flush_task.  Those code paths really shouldn't know about proc
namespace implementation details and people have demonstrated several
times that finding and understanding those code paths is difficult and
non-obvious.

Because of the call path detach pid is alwasy called with the
rtnl_lock held free_pid is not allowed to sleep, so the work to
unmounting proc is moved to a work queue.  This has the side benefit
of not blocking the entire world waiting for the unnecessary
rcu_barrier in deactivate_locked_super.

In the process of making the code clear and obvious this fixes a bug
reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
succeeded and copy_net_ns failed.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/base.c                |    4 ----
 fs/proc/root.c                |    5 -----
 include/linux/pid_namespace.h |    2 ++
 kernel/fork.c                 |    2 --
 kernel/pid.c                  |   21 +++++++++++++++++----
 kernel/pid_namespace.c        |   14 +++++++-------
 6 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6177fc2..7621dc5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2590,10 +2590,6 @@ void proc_flush_task(struct task_struct *task)
 		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
 					tgid->numbers[i].nr);
 	}
-
-	upid = &pid->numbers[pid->level];
-	if (upid->nr == 1)
-		pid_ns_release_proc(upid->ns);
 }
 
 static struct dentry *proc_pid_instantiate(struct inode *dir,
diff --git a/fs/proc/root.c b/fs/proc/root.c
index fc16093..f2f2511 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -155,11 +155,6 @@ void __init proc_root_init(void)
 	err = register_filesystem(&proc_fs_type);
 	if (err)
 		return;
-	err = pid_ns_prepare_proc(&init_pid_ns);
-	if (err) {
-		unregister_filesystem(&proc_fs_type);
-		return;
-	}
 
 	proc_self_init();
 	proc_symlink("mounts", NULL, "self/mounts");
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index c89c9cf..4c96acd 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -21,6 +21,7 @@ struct pid_namespace {
 	struct kref kref;
 	struct pidmap pidmap[PIDMAP_ENTRIES];
 	int last_pid;
+	int nr_hashed;
 	struct task_struct *child_reaper;
 	struct kmem_cache *pid_cachep;
 	unsigned int level;
@@ -32,6 +33,7 @@ struct pid_namespace {
 	struct bsd_acct_struct *bacct;
 #endif
 	struct user_namespace *user_ns;
+	struct work_struct proc_work;
 	kgid_t pid_gid;
 	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7798c24..666dc8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1476,8 +1476,6 @@ bad_fork_cleanup_io:
 	if (p->io_context)
 		exit_io_context(p);
 bad_fork_cleanup_namespaces:
-	if (unlikely(clone_flags & CLONE_NEWPID))
-		pid_ns_release_proc(p->nsproxy->pid_ns);
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
 	if (p->mm)
diff --git a/kernel/pid.c b/kernel/pid.c
index 3a5f238..e957f8b 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -36,6 +36,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/init_task.h>
 #include <linux/syscalls.h>
+#include <linux/proc_fs.h>
 
 #define pid_hashfn(nr, ns)	\
 	hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift)
@@ -270,8 +271,12 @@ void free_pid(struct pid *pid)
 	unsigned long flags;
 
 	spin_lock_irqsave(&pidmap_lock, flags);
-	for (i = 0; i <= pid->level; i++)
-		hlist_del_rcu(&pid->numbers[i].pid_chain);
+	for (i = 0; i <= pid->level; i++) {
+		struct upid *upid = pid->numbers + i;
+		hlist_del_rcu(&upid->pid_chain);
+		if (--upid->ns->nr_hashed == 0)
+			schedule_work(&upid->ns->proc_work);
+	}
 	spin_unlock_irqrestore(&pidmap_lock, flags);
 
 	for (i = 0; i <= pid->level; i++)
@@ -293,6 +298,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 		goto out;
 
 	tmp = ns;
+	pid->level = ns->level;
 	for (i = ns->level; i >= 0; i--) {
 		nr = alloc_pidmap(tmp);
 		if (nr < 0)
@@ -303,17 +309,23 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 		tmp = tmp->parent;
 	}
 
+	if (unlikely(is_child_reaper(pid))) {
+		if (pid_ns_prepare_proc(ns))
+			goto out_free;
+	}
+
 	get_pid_ns(ns);
-	pid->level = ns->level;
 	atomic_set(&pid->count, 1);
 	for (type = 0; type < PIDTYPE_MAX; ++type)
 		INIT_HLIST_HEAD(&pid->tasks[type]);
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
-	for ( ; upid >= pid->numbers; --upid)
+	for ( ; upid >= pid->numbers; --upid) {
 		hlist_add_head_rcu(&upid->pid_chain,
 				&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
+		upid->ns->nr_hashed++;
+	}
 	spin_unlock_irq(&pidmap_lock);
 
 out:
@@ -570,6 +582,7 @@ void __init pidmap_init(void)
 	/* Reserve PID 0. We never call free_pidmap(0) */
 	set_bit(0, init_pid_ns.pidmap[0].page);
 	atomic_dec(&init_pid_ns.pidmap[0].nr_free);
+	init_pid_ns.nr_hashed = 1;
 
 	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
 			SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 7580aa0..cc44db7 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -72,6 +72,12 @@ err_alloc:
 	return NULL;
 }
 
+static void proc_cleanup_work(struct work_struct *work)
+{
+	struct pid_namespace *ns = container_of(work, struct pid_namespace, proc_work);
+	pid_ns_release_proc(ns);
+}
+
 /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
 #define MAX_PID_NS_LEVEL 32
 
@@ -105,6 +111,7 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	ns->level = level;
 	ns->parent = get_pid_ns(parent_pid_ns);
 	ns->user_ns = get_user_ns(user_ns);
+	INIT_WORK(&ns->proc_work, proc_cleanup_work);
 
 	set_bit(0, ns->pidmap[0].page);
 	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
@@ -112,15 +119,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	for (i = 1; i < PIDMAP_ENTRIES; i++)
 		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
 
-	err = pid_ns_prepare_proc(ns);
-	if (err)
-		goto out_put_parent_pid_ns;
-
 	return ns;
 
-out_put_parent_pid_ns:
-	put_pid_ns(parent_pid_ns);
-	put_user_ns(user_ns);
 out_free_map:
 	kfree(ns->pidmap[0].page);
 out_free:
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 06/11] pidns: Don't allow new processes in a dead pid namespace.
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
                     ` (3 preceding siblings ...)
  2012-11-16 16:35   ` [PATCH 05/11] pidns: Make the pidns proc mount/umount logic obvious Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-21  2:17     ` Gao feng
  2012-11-16 16:35   ` [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 Eric W. Biederman
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

Set nr_hashed to -1 just before we schedule the work to cleanup proc.
Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
fail.

This guaranteees that processes never enter a pid namespaces after we
have cleaned up the state to support processes in a pid namespace.

Currently sending SIGKILL to all of the process in a pid namespace as
init exists gives us this guarantee but we need something a little
stronger to support unsharing and joining a pid namespace.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/pid.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index e957f8b..9c21911 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -274,8 +274,10 @@ void free_pid(struct pid *pid)
 	for (i = 0; i <= pid->level; i++) {
 		struct upid *upid = pid->numbers + i;
 		hlist_del_rcu(&upid->pid_chain);
-		if (--upid->ns->nr_hashed == 0)
+		if (--upid->ns->nr_hashed == 0) {
+			upid->ns->nr_hashed = -1;
 			schedule_work(&upid->ns->proc_work);
+		}
 	}
 	spin_unlock_irqrestore(&pidmap_lock, flags);
 
@@ -321,6 +323,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
+	if (ns->nr_hashed < 0)
+		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
 		hlist_add_head_rcu(&upid->pid_chain,
 				&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
@@ -331,6 +335,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 out:
 	return pid;
 
+out_unlock:
+	spin_unlock(&pidmap_lock);
 out_free:
 	while (++i <= ns->level)
 		free_pidmap(pid->numbers + i);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
                     ` (4 preceding siblings ...)
  2012-11-16 16:35   ` [PATCH 06/11] pidns: Don't allow new processes in a dead pid namespace Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-21  2:24     ` Gao feng
  2012-12-19 18:47     ` Oleg Nesterov
  2012-11-16 16:35   ` [PATCH 08/11] pidns: Deny strange cases when creating pid namespaces Eric W. Biederman
                     ` (3 subsequent siblings)
  9 siblings, 2 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

Looking at pid_ns->nr_hashed is a bit simpler and it works for
disjoint process trees that an unshare or a join of a pid_namespace
may create.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/exit.c          |   12 ------------
 kernel/pid.c           |   16 +++++++++++++---
 kernel/pid_namespace.c |   15 ++++-----------
 3 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 346616c..d7fe58d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -72,18 +72,6 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
 		__this_cpu_dec(process_counts);
-		/*
-		 * If we are the last child process in a pid namespace to be
-		 * reaped, notify the reaper sleeping zap_pid_ns_processes().
-		 */
-		if (IS_ENABLED(CONFIG_PID_NS)) {
-			struct task_struct *parent = p->real_parent;
-
-			if ((task_active_pid_ns(parent)->child_reaper == parent) &&
-			    list_empty(&parent->children) &&
-			    (parent->flags & PF_EXITING))
-				wake_up_process(parent);
-		}
 	}
 	list_del_rcu(&p->thread_group);
 }
diff --git a/kernel/pid.c b/kernel/pid.c
index 9c21911..6e8da29 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -273,10 +273,20 @@ void free_pid(struct pid *pid)
 	spin_lock_irqsave(&pidmap_lock, flags);
 	for (i = 0; i <= pid->level; i++) {
 		struct upid *upid = pid->numbers + i;
+		struct pid_namespace *ns = upid->ns;
 		hlist_del_rcu(&upid->pid_chain);
-		if (--upid->ns->nr_hashed == 0) {
-			upid->ns->nr_hashed = -1;
-			schedule_work(&upid->ns->proc_work);
+		switch(--ns->nr_hashed) {
+		case 1:
+			/* When all that is left in the pid namespace
+			 * is the reaper wake up the reaper.  The reaper
+			 * may be sleeping in zap_pid_ns_processes().
+			 */
+			wake_up_process(ns->child_reaper);
+			break;
+		case 0:
+			ns->nr_hashed = -1;
+			schedule_work(&ns->proc_work);
+			break;
 		}
 	}
 	spin_unlock_irqrestore(&pidmap_lock, flags);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index cc44db7..e106b18 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -216,22 +216,15 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 
 	/*
 	 * sys_wait4() above can't reap the TASK_DEAD children.
-	 * Make sure they all go away, see __unhash_process().
+	 * Make sure they all go away, see free_pid().
 	 */
 	for (;;) {
-		bool need_wait = false;
-
-		read_lock(&tasklist_lock);
-		if (!list_empty(&current->children)) {
-			__set_current_state(TASK_UNINTERRUPTIBLE);
-			need_wait = true;
-		}
-		read_unlock(&tasklist_lock);
-
-		if (!need_wait)
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (pid_ns->nr_hashed == 1)
 			break;
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
 	if (pid_ns->reboot)
 		current->signal->group_exit_code = pid_ns->reboot;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 08/11] pidns: Deny strange cases when creating pid namespaces.
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
                     ` (5 preceding siblings ...)
  2012-11-16 16:35   ` [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-21  2:25     ` Gao feng
  2012-11-16 16:35   ` [PATCH 09/11] pidns: Add setns support Eric W. Biederman
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

task_active_pid_ns(current) != current->ns_proxy->pid_ns will
soon be allowed to support unshare and setns.

The definition of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
we create a child pid namespace of current->ns_proxy->pid_ns.  However
that leads to strange cases like trying to have a single process be
init in multiple pid namespaces, which is racy and hard to think
about.

The definition of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
we create a child pid namespace of task_active_pid_ns(current).  While
that seems less racy it does not provide any utility.

Therefore define the semantics of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the
pid namespace creation fails.  That is easy to implement and easy
to think about.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/pid_namespace.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e106b18..066c058 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -145,6 +145,8 @@ struct pid_namespace *copy_pid_ns(unsigned long flags,
 		return get_pid_ns(old_ns);
 	if (flags & (CLONE_THREAD|CLONE_PARENT))
 		return ERR_PTR(-EINVAL);
+	if (task_active_pid_ns(current) != old_ns)
+		return ERR_PTR(-EINVAL);
 	return create_pid_namespace(user_ns, old_ns);
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 09/11] pidns: Add setns support
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
                     ` (6 preceding siblings ...)
  2012-11-16 16:35   ` [PATCH 08/11] pidns: Deny strange cases when creating pid namespaces Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-19  9:11     ` Gao feng
  2012-11-21  2:36     ` Gao feng
  2012-11-16 16:35   ` [PATCH 10/11] pidns: Consolidate initialzation of special init task state Eric W. Biederman
  2012-11-16 16:35   ` [PATCH 11/11] pidns: Support unsharing the pid namespace Eric W. Biederman
  9 siblings, 2 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

- Pid namespaces are designed to be inescapable so verify that the
  passed in pid namespace is a child of the currently active
  pid namespace or the currently active pid namespace itself.

  Allowing the currently active pid namespace is important so
  the effects of an earlier setns can be cancelled.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/namespaces.c    |    3 ++
 include/linux/proc_fs.h |    1 +
 kernel/pid_namespace.c  |   54 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index b178ed7..85ca047 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -24,6 +24,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 #ifdef CONFIG_IPC_NS
 	&ipcns_operations,
 #endif
+#ifdef CONFIG_PID_NS
+	&pidns_operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 3fd2e87..acaafcd 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -251,6 +251,7 @@ struct proc_ns_operations {
 extern const struct proc_ns_operations netns_operations;
 extern const struct proc_ns_operations utsns_operations;
 extern const struct proc_ns_operations ipcns_operations;
+extern const struct proc_ns_operations pidns_operations;
 
 union proc_op {
 	int (*proc_get_link)(struct dentry *, struct path *);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 066c058..c6514f5 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -300,6 +300,60 @@ int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 	return 0;
 }
 
+static void *pidns_get(struct task_struct *task)
+{
+	struct pid_namespace *ns;
+
+	rcu_read_lock();
+	ns = get_pid_ns(task_active_pid_ns(task));
+	rcu_read_unlock();
+
+	return ns;
+}
+
+static void pidns_put(void *ns)
+{
+	put_pid_ns(ns);
+}
+
+static int pidns_install(struct nsproxy *nsproxy, void *ns)
+{
+	struct pid_namespace *active = task_active_pid_ns(current);
+	struct pid_namespace *ancestor, *new = ns;
+
+	if (!ns_capable(new->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/*
+	 * Only allow entering the current active pid namespace
+	 * or a child of the current active pid namespace.
+	 *
+	 * This is required for fork to return a usable pid value and
+	 * this maintains the property that processes and their
+	 * children can not escape their current pid namespace.
+	 */
+	if (new->level < active->level)
+		return -EINVAL;
+
+	ancestor = new;
+	while (ancestor->level > active->level)
+		ancestor = ancestor->parent;
+	if (ancestor != active)
+		return -EINVAL;
+
+	put_pid_ns(nsproxy->pid_ns);
+	nsproxy->pid_ns = get_pid_ns(new);
+	return 0;
+}
+
+const struct proc_ns_operations pidns_operations = {
+	.name		= "pid",
+	.type		= CLONE_NEWPID,
+	.get		= pidns_get,
+	.put		= pidns_put,
+	.install	= pidns_install,
+};
+
 static __init int pid_namespaces_init(void)
 {
 	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 10/11] pidns: Consolidate initialzation of special init task state
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
                     ` (7 preceding siblings ...)
  2012-11-16 16:35   ` [PATCH 09/11] pidns: Add setns support Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-21  2:56     ` Gao feng
  2012-11-16 16:35   ` [PATCH 11/11] pidns: Support unsharing the pid namespace Eric W. Biederman
  9 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
for the system init process, and another way for pid namespace
init processes test pid->nr == 1 and use the same code for both.

For the global init this results in SIGNAL_UNKILLABLE being set
much earlier in the initialization process.

This is a small cleanup and it paves the way for allowing unshare and
enter of the pid namespace as that path like our global init also will
not set CLONE_NEWPID.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 init/main.c   |    1 -
 kernel/fork.c |    6 +++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/init/main.c b/init/main.c
index 9cf77ab..317750a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -810,7 +810,6 @@ static int __ref kernel_init(void *unused)
 	system_state = SYSTEM_RUNNING;
 	numa_default_policy();
 
-	current->signal->flags |= SIGNAL_UNKILLABLE;
 	flush_delayed_fput();
 
 	if (ramdisk_execute_command) {
diff --git a/kernel/fork.c b/kernel/fork.c
index 666dc8b..0f2bbce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1039,8 +1039,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	atomic_set(&sig->live, 1);
 	atomic_set(&sig->sigcnt, 1);
 	init_waitqueue_head(&sig->wait_chldexit);
-	if (clone_flags & CLONE_NEWPID)
-		sig->flags |= SIGNAL_UNKILLABLE;
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);
@@ -1441,8 +1439,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
 
 		if (thread_group_leader(p)) {
-			if (is_child_reaper(pid))
+			if (is_child_reaper(pid)) {
 				ns_of_pid(pid)->child_reaper = p;
+				p->signal->flags |= SIGNAL_UNKILLABLE;
+			}
 
 			p->signal->leader_pid = pid;
 			p->signal->tty = tty_kref_get(current->signal->tty);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/11] pidns: Support unsharing the pid namespace.
  2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
                     ` (8 preceding siblings ...)
  2012-11-16 16:35   ` [PATCH 10/11] pidns: Consolidate initialzation of special init task state Eric W. Biederman
@ 2012-11-16 16:35   ` Eric W. Biederman
  2012-11-21  2:55     ` Gao feng
  2012-12-19 18:14     ` Oleg Nesterov
  9 siblings, 2 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-16 16:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-kernel, Oleg Nesterov, Serge Hallyn, Gao feng,
	Andrew Morton, Eric W. Biederman

From: "Eric W. Biederman" <ebiederm@xmission.com>

Unsharing of the pid namespace unlike unsharing of other namespaces
does not take affect immediately.  Instead it affects the children
created with fork and clone.  The first of these children becomes the init
process of the new pid namespace, the rest become oddball children
of pid 0.  From the point of view of the new pid namespace the process
that created it is pid 0, as it's pid does not map.

A couple of different semantics were considered but this one was
settled on because it is easy to implement and it is usable from
pam modules.  The core reasons for the existence of unshare.

I took a survey of the callers of pam modules and the following
appears to be a representative sample of their logic.
{
	setup stuff include pam
	child = fork();
	if (!child) {
		setuid()
                exec /bin/bash
        }
        waitpid(child);

        pam and other cleanup
}

As you can see there is a fork to create the unprivileged user
space process.  Which means that the unprivileged user space
process will appear as pid 1 in the new pid namespace.  Further
most login processes do not cope with extraneous children which
means shifting the duty of reaping extraneous child process to
the creator of those extraneous children makes the system more
comprehensible.

The practical reason for this set of pid namespace semantics is
that it is simple to implement and verify they work correctly.
Whereas an implementation that requres changing the struct
pid on a process comes with a lot more races and pain.  Not
the least of which is that glibc caches getpid().

These semantics are implemented by having two notions
of the pid namespace of a proces.  There is task_active_pid_ns
which is the pid namspace the process was created with
and the pid namespace that all pids are presented to
that process in.  The task_active_pid_ns is stored
in the struct pid of the task.

Then there is the pid namespace that will be used for children
that pid namespace is stored in task->nsproxy->pid_ns.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/fork.c          |   32 +++++++++++++++++++++++++-------
 kernel/nsproxy.c       |    2 +-
 kernel/pid_namespace.c |    2 --
 3 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 0f2bbce..811ffba 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1565,9 +1565,11 @@ long do_fork(unsigned long clone_flags,
 	 * Do some preliminary argument and permissions checking before we
 	 * actually start allocating stuff
 	 */
-	if (clone_flags & CLONE_NEWUSER) {
-		if (clone_flags & CLONE_THREAD)
+	if (clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) {
+		if (clone_flags & (CLONE_THREAD|CLONE_PARENT))
 			return -EINVAL;
+	}
+	if (clone_flags & CLONE_NEWUSER) {
 		/* hopefully this check will go away when userns support is
 		 * complete
 		 */
@@ -1692,7 +1694,8 @@ static int check_unshare_flags(unsigned long unshare_flags)
 {
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
-				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET))
+				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
+				CLONE_NEWPID))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing to
@@ -1763,15 +1766,30 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
 	int do_sysvsem = 0;
 	int err;
 
-	err = check_unshare_flags(unshare_flags);
-	if (err)
-		goto bad_unshare_out;
-
+	/*
+	 * If unsharing a pid namespace must also unshare the thread.
+	 */
+	if (unshare_flags & CLONE_NEWPID)
+		unshare_flags |= CLONE_THREAD;
+	/*
+	 * If unsharing a thread from a thread group, must also unshare vm.
+	 */
+	if (unshare_flags & CLONE_THREAD)
+		unshare_flags |= CLONE_VM;
+	/*
+	 * If unsharing vm, must also unshare signal handlers.
+	 */
+	if (unshare_flags & CLONE_VM)
+		unshare_flags |= CLONE_SIGHAND;
 	/*
 	 * If unsharing namespace, must also unshare filesystem information.
 	 */
 	if (unshare_flags & CLONE_NEWNS)
 		unshare_flags |= CLONE_FS;
+
+	err = check_unshare_flags(unshare_flags);
+	if (err)
+		goto bad_unshare_out;
 	/*
 	 * CLONE_NEWIPC must also detach from the undolist: after switching
 	 * to a new ipc namespace, the semaphore arrays from the old
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index cf37d52..2a11447 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -188,7 +188,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET)))
+			       CLONE_NEWNET | CLONE_NEWPID)))
 		return 0;
 
 	if (!capable(CAP_SYS_ADMIN))
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index c6514f5..a385b98 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -143,8 +143,6 @@ struct pid_namespace *copy_pid_ns(unsigned long flags,
 {
 	if (!(flags & CLONE_NEWPID))
 		return get_pid_ns(old_ns);
-	if (flags & (CLONE_THREAD|CLONE_PARENT))
-		return ERR_PTR(-EINVAL);
 	if (task_active_pid_ns(current) != old_ns)
 		return ERR_PTR(-EINVAL);
 	return create_pid_namespace(user_ns, old_ns);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/11] pidns: Add setns support
  2012-11-16 16:35   ` [PATCH 09/11] pidns: Add setns support Eric W. Biederman
@ 2012-11-19  9:11     ` Gao feng
  2012-11-19  9:27       ` Eric W. Biederman
  2012-11-21  2:36     ` Gao feng
  1 sibling, 1 reply; 37+ messages in thread
From: Gao feng @ 2012-11-19  9:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

于 2012年11月17日 00:35, Eric W. Biederman 写道:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> - Pid namespaces are designed to be inescapable so verify that the
>   passed in pid namespace is a child of the currently active
>   pid namespace or the currently active pid namespace itself.
> 
>   Allowing the currently active pid namespace is important so
>   the effects of an earlier setns can be cancelled.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---

Hi Eric

I noticed that,after we call setns to change task's pidns to container A's pidns.
we can't see this task in container A's proc filesystem.

Is this what we expected?

Thanks
Gao

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/11] pidns: Add setns support
  2012-11-19  9:11     ` Gao feng
@ 2012-11-19  9:27       ` Eric W. Biederman
  0 siblings, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-19  9:27 UTC (permalink / raw)
  To: Gao feng
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

Gao feng <gaofeng@cn.fujitsu.com> writes:

> 于 2012年11月17日 00:35, Eric W. Biederman 写道:
>> From: "Eric W. Biederman" <ebiederm@xmission.com>
>> 
>> - Pid namespaces are designed to be inescapable so verify that the
>>   passed in pid namespace is a child of the currently active
>>   pid namespace or the currently active pid namespace itself.
>> 
>>   Allowing the currently active pid namespace is important so
>>   the effects of an earlier setns can be cancelled.
>> 
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>> ---
>
> Hi Eric
>
> I noticed that,after we call setns to change task's pidns to container A's pidns.
> we can't see this task in container A's proc filesystem.
>
> Is this what we expected?

Only children move to the new pid namespace so yes.

Any other semantic requires ugly races with changing the pid of an
existing process.

Eric


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/11] pidns: Make the pidns proc mount/umount logic obvious.
  2012-11-16 16:35   ` [PATCH 05/11] pidns: Make the pidns proc mount/umount logic obvious Eric W. Biederman
@ 2012-11-19 11:02     ` Gao feng
  0 siblings, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-19 11:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

于 2012年11月17日 00:35, Eric W. Biederman 写道:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Track the number of pids in the proc hash table.  When the number of
> pids goes to 0 schedule work to unmount the kernel mount of proc.
> 
> Move the mount of proc into alloc_pid when we allocate the pid for
> init.
> 
> Remove the surprising calls of pid_ns_release proc in fork and
> proc_flush_task.  Those code paths really shouldn't know about proc
> namespace implementation details and people have demonstrated several
> times that finding and understanding those code paths is difficult and
> non-obvious.
> 
> Because of the call path detach pid is alwasy called with the
> rtnl_lock held free_pid is not allowed to sleep, so the work to
> unmounting proc is moved to a work queue.  This has the side benefit
> of not blocking the entire world waiting for the unnecessary
> rcu_barrier in deactivate_locked_super.
> 
> In the process of making the code clear and obvious this fixes a bug
> reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
> mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
> succeeded and copy_net_ns failed.
> 
> Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/11] procfs: Don't cache a pid in the root inode.
  2012-11-16 16:35   ` [PATCH 02/11] procfs: Don't cache a pid in the root inode Eric W. Biederman
@ 2012-11-21  1:07     ` Gao feng
  0 siblings, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-21  1:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

On 2012/11/17 00:35, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Now that we have s_fs_info pointing to our pid namespace
> the original reason for the proc root inode having a struct
> pid is gone.
> 
> Caching a pid in the root inode has led to some complicated
> code.  Now that we don't need the struct pid, just remove it.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/11] pidns: Capture the user namespace and filter ns_last_pid
  2012-11-16 16:35   ` [PATCH 03/11] pidns: Capture the user namespace and filter ns_last_pid Eric W. Biederman
@ 2012-11-21  1:26     ` Gao feng
  0 siblings, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-21  1:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

on 2012/11/17 00:35, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> - Capture the the user namespace that creates the pid namespace
> - Use that user namespace to test if it is ok to write to
>   /proc/sys/kernel/ns_last_pid.
> 
> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] pidns: Use task_active_pid_ns where appropriate
  2012-11-16 16:35   ` [PATCH 04/11] pidns: Use task_active_pid_ns where appropriate Eric W. Biederman
@ 2012-11-21  2:02     ` Gao feng
  0 siblings, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-21  2:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

on 2012/11/17 00:35, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
> aka ns_of_pid(task_pid(tsk)) should have the same number of
> cache line misses with the practical difference that
> ns_of_pid(task_pid(tsk)) is released later in a processes life.
> 
> Furthermore by using task_active_pid_ns it becomes trivial
> to write an unshare implementation for the the pid namespace.
> 
> So I have used task_active_pid_ns everywhere I can.
> 
> In fork since the pid has not yet been attached to the
> process I use ns_of_pid, to achieve the same effect.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/11] pidns: Don't allow new processes in a dead pid namespace.
  2012-11-16 16:35   ` [PATCH 06/11] pidns: Don't allow new processes in a dead pid namespace Eric W. Biederman
@ 2012-11-21  2:17     ` Gao feng
  0 siblings, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-21  2:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

on 2012/11/17 00:35, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Set nr_hashed to -1 just before we schedule the work to cleanup proc.
> Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
> fail.
> 
> This guaranteees that processes never enter a pid namespaces after we
> have cleaned up the state to support processes in a pid namespace.
> 
> Currently sending SIGKILL to all of the process in a pid namespace as
> init exists gives us this guarantee but we need something a little
> stronger to support unsharing and joining a pid namespace.
> 
> Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  2012-11-16 16:35   ` [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 Eric W. Biederman
@ 2012-11-21  2:24     ` Gao feng
  2012-12-19 18:47     ` Oleg Nesterov
  1 sibling, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-21  2:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

on 2012/11/17 00:35, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Looking at pid_ns->nr_hashed is a bit simpler and it works for
> disjoint process trees that an unshare or a join of a pid_namespace
> may create.
> 
> Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/11] pidns: Deny strange cases when creating pid namespaces.
  2012-11-16 16:35   ` [PATCH 08/11] pidns: Deny strange cases when creating pid namespaces Eric W. Biederman
@ 2012-11-21  2:25     ` Gao feng
  0 siblings, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-21  2:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

on 2012/11/17 00:35, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> task_active_pid_ns(current) != current->ns_proxy->pid_ns will
> soon be allowed to support unshare and setns.
> 
> The definition of creating a child pid namespace when
> task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
> we create a child pid namespace of current->ns_proxy->pid_ns.  However
> that leads to strange cases like trying to have a single process be
> init in multiple pid namespaces, which is racy and hard to think
> about.
> 
> The definition of creating a child pid namespace when
> task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
> we create a child pid namespace of task_active_pid_ns(current).  While
> that seems less racy it does not provide any utility.
> 
> Therefore define the semantics of creating a child pid namespace when
> task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the
> pid namespace creation fails.  That is easy to implement and easy
> to think about.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/11] pidns: Add setns support
  2012-11-16 16:35   ` [PATCH 09/11] pidns: Add setns support Eric W. Biederman
  2012-11-19  9:11     ` Gao feng
@ 2012-11-21  2:36     ` Gao feng
  1 sibling, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-21  2:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

on 2012/11/17 00:35, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> - Pid namespaces are designed to be inescapable so verify that the
>   passed in pid namespace is a child of the currently active
>   pid namespace or the currently active pid namespace itself.
> 
>   Allowing the currently active pid namespace is important so
>   the effects of an earlier setns can be cancelled.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] pidns: Support unsharing the pid namespace.
  2012-11-16 16:35   ` [PATCH 11/11] pidns: Support unsharing the pid namespace Eric W. Biederman
@ 2012-11-21  2:55     ` Gao feng
  2012-12-19 18:14     ` Oleg Nesterov
  1 sibling, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-21  2:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

on 2012/11/17 00:35, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Unsharing of the pid namespace unlike unsharing of other namespaces
> does not take affect immediately.  Instead it affects the children
> created with fork and clone.  The first of these children becomes the init
> process of the new pid namespace, the rest become oddball children
> of pid 0.  From the point of view of the new pid namespace the process
> that created it is pid 0, as it's pid does not map.
> 
> A couple of different semantics were considered but this one was
> settled on because it is easy to implement and it is usable from
> pam modules.  The core reasons for the existence of unshare.
> 
> I took a survey of the callers of pam modules and the following
> appears to be a representative sample of their logic.
> {
> 	setup stuff include pam
> 	child = fork();
> 	if (!child) {
> 		setuid()
>                 exec /bin/bash
>         }
>         waitpid(child);
> 
>         pam and other cleanup
> }
> 
> As you can see there is a fork to create the unprivileged user
> space process.  Which means that the unprivileged user space
> process will appear as pid 1 in the new pid namespace.  Further
> most login processes do not cope with extraneous children which
> means shifting the duty of reaping extraneous child process to
> the creator of those extraneous children makes the system more
> comprehensible.
> 
> The practical reason for this set of pid namespace semantics is
> that it is simple to implement and verify they work correctly.
> Whereas an implementation that requres changing the struct
> pid on a process comes with a lot more races and pain.  Not
> the least of which is that glibc caches getpid().
> 
> These semantics are implemented by having two notions
> of the pid namespace of a proces.  There is task_active_pid_ns
> which is the pid namspace the process was created with
> and the pid namespace that all pids are presented to
> that process in.  The task_active_pid_ns is stored
> in the struct pid of the task.
> 
> Then there is the pid namespace that will be used for children
> that pid namespace is stored in task->nsproxy->pid_ns.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/11] pidns: Consolidate initialzation of special init task state
  2012-11-16 16:35   ` [PATCH 10/11] pidns: Consolidate initialzation of special init task state Eric W. Biederman
@ 2012-11-21  2:56     ` Gao feng
  0 siblings, 0 replies; 37+ messages in thread
From: Gao feng @ 2012-11-21  2:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Oleg Nesterov, Serge Hallyn,
	Andrew Morton

on 2012/11/17 00:35, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
> for the system init process, and another way for pid namespace
> init processes test pid->nr == 1 and use the same code for both.
> 
> For the global init this results in SIGNAL_UNKILLABLE being set
> much earlier in the initialization process.
> 
> This is a small cleanup and it paves the way for allowing unshare and
> enter of the pid namespace as that path like our global init also will
> not set CLONE_NEWPID.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] pidns: Support unsharing the pid namespace.
  2012-11-16 16:35   ` [PATCH 11/11] pidns: Support unsharing the pid namespace Eric W. Biederman
  2012-11-21  2:55     ` Gao feng
@ 2012-12-19 18:14     ` Oleg Nesterov
  2012-12-21  1:43       ` Eric W. Biederman
  1 sibling, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2012-12-19 18:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

Hi Eric,

oleg@tv-sign.ru no longer works, so I just noticed these emails.

On 11/16, Eric W. Biederman wrote:
>
> Unsharing of the pid namespace unlike unsharing of other namespaces
> does not take affect immediately.  Instead it affects the children
> created with fork and clone.

I'll try to read this series later, but I am not sure I will ever
understand the code with these patches ;)

So alloc_pid() becomes the only user nsproxy->pid_ns and it is not
necessarily equal to task_active_pid_ns(). It seems to me that this
adds a lot of new corner cases.

Unless I missed something, at least we should not allow CLONE_THREAD
if active_pid_ns != nsproxy->pid_ns. If nothing else, copy_process()
initializes ->child_reaper only if thread_group_leader(child). And
->child_reaper == NULL can obviously lead to crash.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  2012-11-16 16:35   ` [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 Eric W. Biederman
  2012-11-21  2:24     ` Gao feng
@ 2012-12-19 18:47     ` Oleg Nesterov
  2012-12-21  1:19       ` Eric W. Biederman
  1 sibling, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2012-12-19 18:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

On 11/16, Eric W. Biederman wrote:
>
> @@ -216,22 +216,15 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>
>  	/*
>  	 * sys_wait4() above can't reap the TASK_DEAD children.
> -	 * Make sure they all go away, see __unhash_process().
> +	 * Make sure they all go away, see free_pid().
>  	 */
>  	for (;;) {
> -		bool need_wait = false;
> -
> -		read_lock(&tasklist_lock);
> -		if (!list_empty(&current->children)) {
> -			__set_current_state(TASK_UNINTERRUPTIBLE);
> -			need_wait = true;
> -		}
> -		read_unlock(&tasklist_lock);
> -
> -		if (!need_wait)
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (pid_ns->nr_hashed == 1)
>  			break;
>  		schedule();
>  	}

I agree, the patch itself looks fine.

But, with all other changes I do not understand this part at all.

A task from the parent namespace can do setns + fork at any time
(until nr_hashed >= 0). So ->nr_hashed can be incremented again
after zap_pid_ns_processes() returns.

Or, we can sleep in TASK_UNINTERRUPTIBLE "forever" if this happens
after kill-them-all.

Could you explain why do we need to wait at all? I can be easily
wrong, but at first glance the original reason for this wait has
gone away?

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  2012-12-19 18:47     ` Oleg Nesterov
@ 2012-12-21  1:19       ` Eric W. Biederman
  2012-12-21 14:11         ` Oleg Nesterov
  0 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-12-21  1:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

Oleg Nesterov <oleg@redhat.com> writes:

> On 11/16, Eric W. Biederman wrote:
>>
>> @@ -216,22 +216,15 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>
>>  	/*
>>  	 * sys_wait4() above can't reap the TASK_DEAD children.
>> -	 * Make sure they all go away, see __unhash_process().
>> +	 * Make sure they all go away, see free_pid().
>>  	 */
>>  	for (;;) {
>> -		bool need_wait = false;
>> -
>> -		read_lock(&tasklist_lock);
>> -		if (!list_empty(&current->children)) {
>> -			__set_current_state(TASK_UNINTERRUPTIBLE);
>> -			need_wait = true;
>> -		}
>> -		read_unlock(&tasklist_lock);
>> -
>> -		if (!need_wait)
>> +		set_current_state(TASK_UNINTERRUPTIBLE);
>> +		if (pid_ns->nr_hashed == 1)
>>  			break;
>>  		schedule();
>>  	}
>
> I agree, the patch itself looks fine.
>
> But, with all other changes I do not understand this part at all.
>
> A task from the parent namespace can do setns + fork at any time
> (until nr_hashed >= 0). So ->nr_hashed can be incremented again
> after zap_pid_ns_processes() returns.

I want to talk about how alloc_pid and free_pid prevent nr_hashed
from increasing once the last processes has exited the pid namespace
but that doesn't apply here.

> Or, we can sleep in TASK_UNINTERRUPTIBLE "forever" if this happens
> after kill-them-all.

Sleeping forever should be prevented by this chunk in free_pid:

		switch(--ns->nr_hashed) {
		case 1:
			/* When all that is left in the pid namespace
			 * is the reaper wake up the reaper.  The reaper
			 * may be sleeping in zap_pid_ns_processes().
			 */
			wake_up_process(ns->child_reaper);


I admit it continues to be true that if an injected process or a
debugged process does not exit we can block waiting for all of the
processes to be reaped indefinitely.

> Could you explain why do we need to wait at all? I can be easily
> wrong, but at first glance the original reason for this wait has
> gone away?

It is very nice to know that when you do waitpid for the init process of
a pid namespace that there are no other processes in the pid namespace.

Leaving the wait here has the nice effect that it doesn't penalize
anything but pid namespace code paths.

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] pidns: Support unsharing the pid namespace.
  2012-12-19 18:14     ` Oleg Nesterov
@ 2012-12-21  1:43       ` Eric W. Biederman
  2012-12-21 15:49         ` Oleg Nesterov
  0 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-12-21  1:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

Oleg Nesterov <oleg@redhat.com> writes:

> Hi Eric,
>
> oleg@tv-sign.ru no longer works, so I just noticed these emails.

Darn and instead of bouncing the emails just go into a black hole :(

I have updated my address book to point to oleg@redhat.com so
hopefully I don't make that mistake again.

> On 11/16, Eric W. Biederman wrote:
>>
>> Unsharing of the pid namespace unlike unsharing of other namespaces
>> does not take affect immediately.  Instead it affects the children
>> created with fork and clone.
>
> I'll try to read this series later, but I am not sure I will ever
> understand the code with these patches ;)

Hopefully the code doesn't cause you too many problems.

> So alloc_pid() becomes the only user nsproxy->pid_ns and it is not
> necessarily equal to task_active_pid_ns(). It seems to me that this
> adds a lot of new corner cases.

I have tried to simply outlaw the most of the new corner cases as they
simply are not interesting so there is no point implementing them,
or thinking about them once they are outlawed.

> Unless I missed something, at least we should not allow CLONE_THREAD
> if active_pid_ns != nsproxy->pid_ns. If nothing else, copy_process()
> initializes ->child_reaper only if thread_group_leader(child). And
> ->child_reaper == NULL can obviously lead to crash.

Hmm.  Let me think that through as you may have a point.

In copy_pid_ns I fail if task_active_pid_ns != nsproxy->pid_ns, and in
unshare CLONE_NEW_PID implies "CLONE_THREAD|CLONE_VM|CLONE_SIGHAND".  So
I avoid most of those cases already.

You are asking about clone(CLONE_THREAD) after unshare(CLONE_NEWPID).  I
totally failed to realize that case existed.  Oleg thank you for
pointing it out.

Below is my preliminary patch for ruling those things out.  I have added
CLONE_PARENT to the forbidden set because that seems about as bad
as CLONE_SIGHAND or CLONE_THREAD.

I will cook up a proper patch and get it merged shortly.

Eric

diff --git a/kernel/fork.c b/kernel/fork.c
index c36c4e3..340a25c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1166,6 +1166,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
                                current->signal->flags & SIGNAL_UNKILLABLE)
                return ERR_PTR(-EINVAL);

+       /*
+        * If the children will be in a different pid namespace don't allow
+        * the creation of threads.
+        */
+       if ((clone_flags & (CLONE_THREAD|CLONE_SIGHAND|CLONE_VM|CLONE_PARENT)) &&
+           task_active_pid_ns(current) != current->nsproxy->pid_ns)
+               return ERR_PTR(-EINVAL);
+
        retval = security_task_create(clone_flags);
        if (retval)
                goto fork_out;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  2012-12-21  1:19       ` Eric W. Biederman
@ 2012-12-21 14:11         ` Oleg Nesterov
  2012-12-21 15:02           ` Oleg Nesterov
  2012-12-21 18:33           ` Eric W. Biederman
  0 siblings, 2 replies; 37+ messages in thread
From: Oleg Nesterov @ 2012-12-21 14:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

On 12/20, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 11/16, Eric W. Biederman wrote:
> >>
> >> @@ -216,22 +216,15 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> >>
> >>  	/*
> >>  	 * sys_wait4() above can't reap the TASK_DEAD children.
> >> -	 * Make sure they all go away, see __unhash_process().
> >> +	 * Make sure they all go away, see free_pid().
> >>  	 */
> >>  	for (;;) {
> >> -		bool need_wait = false;
> >> -
> >> -		read_lock(&tasklist_lock);
> >> -		if (!list_empty(&current->children)) {
> >> -			__set_current_state(TASK_UNINTERRUPTIBLE);
> >> -			need_wait = true;
> >> -		}
> >> -		read_unlock(&tasklist_lock);
> >> -
> >> -		if (!need_wait)
> >> +		set_current_state(TASK_UNINTERRUPTIBLE);
> >> +		if (pid_ns->nr_hashed == 1)
> >>  			break;
> >>  		schedule();
> >>  	}
> >
> > I agree, the patch itself looks fine.
> >
> > But, with all other changes I do not understand this part at all.
> >
> > A task from the parent namespace can do setns + fork at any time
> > (until nr_hashed >= 0). So ->nr_hashed can be incremented again
> > after zap_pid_ns_processes() returns.

XXX: this creates the new pid P in this ns. Please see below...

> I want to talk about how alloc_pid and free_pid prevent nr_hashed
> from increasing once the last processes has exited the pid namespace
> but that doesn't apply here.

Not sure I understand, but it seems you agree this can happen.

> > Or, we can sleep in TASK_UNINTERRUPTIBLE "forever" if this happens
> > after kill-them-all.
>
> Sleeping forever should be prevented by this chunk in free_pid:

Note that I said "forever", not forever ;)

>
> 		switch(--ns->nr_hashed) {
> 		case 1:
> 			/* When all that is left in the pid namespace
> 			 * is the reaper wake up the reaper.  The reaper
> 			 * may be sleeping in zap_pid_ns_processes().
> 			 */
> 			wake_up_process(ns->child_reaper);
>
>
> I admit it continues to be true that if an injected process or a
> debugged process does not exit we can block waiting for all of the
> processes to be reaped indefinitely.

Yes, I meant until the injected process exits.

> > Could you explain why do we need to wait at all? I can be easily
> > wrong, but at first glance the original reason for this wait has
> > gone away?
>
> It is very nice to know that when you do waitpid for the init process of
> a pid namespace that there are no other processes in the pid namespace.

OK, and I agree. But my point was, at least this _looks_ strange, because
ns->nr_hashed == 1 is not stable.

And in fact I think this is not strange, but simply wrong.

Please consider the XXX case above. Suppose that free_pid(P) happens
after ns->child_reaper exits and thus this pointer points to nowhere.
Suppose also that there is another injected pid so nr_hashed == 2.
In this case wake_up_process(ns->child_reaper) means use-after-free,
no?

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  2012-12-21 14:11         ` Oleg Nesterov
@ 2012-12-21 15:02           ` Oleg Nesterov
  2012-12-21 15:31             ` Oleg Nesterov
  2012-12-21 18:33           ` Eric W. Biederman
  1 sibling, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2012-12-21 15:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

On 12/21, Oleg Nesterov wrote:
>
> And in fact I think this is not strange, but simply wrong.
>
> Please consider the XXX case above. Suppose that free_pid(P) happens
> after ns->child_reaper exits and thus this pointer points to nowhere.
> Suppose also that there is another injected pid so nr_hashed == 2.
> In this case wake_up_process(ns->child_reaper) means use-after-free,
> no?

Hmm. And another minor problem, unless I missed something.

Once again, the parent namespace injects the task T after ns->reaper
sees nr_hashed == 1 and returns. Suppose that reaper's parent does
do_wait() and free_pidmap() clears the bit == 1.

Now, what if T doesn't exit but forks? We must not re-create the
task with pid_nr == 1 in the dead namespace. Normally this can't
happen, RESERVED_PIDS logic in alloc_pidmap() saves us. But it
seems that we need

	- .extra1 = &zero,
	+ .extra1 = &one,

in pid_ns_ctl_table.

Oleg.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  2012-12-21 15:02           ` Oleg Nesterov
@ 2012-12-21 15:31             ` Oleg Nesterov
  2012-12-21 18:42               ` Eric W. Biederman
  0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2012-12-21 15:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

On 12/21, Oleg Nesterov wrote:
>
> Once again, the parent namespace injects the task T after ns->reaper
> sees nr_hashed == 1 and returns. Suppose that reaper's parent does
> do_wait() and free_pidmap() clears the bit == 1.
>
> Now, what if T doesn't exit but forks? We must not re-create the
> task with pid_nr == 1 in the dead namespace. Normally this can't
> happen, RESERVED_PIDS logic in alloc_pidmap() saves us. But it
> seems that we need
>
> 	- .extra1 = &zero,
> 	+ .extra1 = &one,
>
> in pid_ns_ctl_table.

Oh, and another problem, or I am totally confused.

T forks and creates the child C1. C1 creates C2. What if C1 exits?
It will try to reparent C2 to the dead/freed ns->child_reaper.

In short. We shouldn't allow alloc_pid() if ns->child_reaper is dying,
I think. nr_hashed == -1 doesn't really work.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] pidns: Support unsharing the pid namespace.
  2012-12-21  1:43       ` Eric W. Biederman
@ 2012-12-21 15:49         ` Oleg Nesterov
  2012-12-21 17:51           ` Eric W. Biederman
  0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2012-12-21 15:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

On 12/20, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > So alloc_pid() becomes the only user nsproxy->pid_ns and it is not
> > necessarily equal to task_active_pid_ns(). It seems to me that this
> > adds a lot of new corner cases.
>
> I have tried to simply outlaw the most of the new corner cases as they
> simply are not interesting so there is no point implementing them,
> or thinking about them once they are outlawed.

Eric. I understand that it is too late to discuss this. And yes, I simply
do not understand the problem space, I never used containers.

But, stupid question. Let's ignore the pid_ns-specific oddities.

1. Ignoring setns(), why do we need /proc/pid/ns/ ?

2. Why setns() requires /proc/pid/ns/ ? IOW, why it can't be

	sys_setns(pid_t pid, int clone_flags)
	{
		truct task_struct *tsk = find_task_by_vpid(pid);
		struct nsproxy *target = get_nsproxy(tsk->nsproxy);

		new_nsproxy = create_new_namespaces(...);

		if (clone_flags & CLONE_NEWNS)
			mntns_install(...);
		if (clone_flags & CLONE_NEWIPC)
			ipcns_install(...);
		...
	}

I feel I missed something trivial, but what?

> @@ -1166,6 +1166,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>                                 current->signal->flags & SIGNAL_UNKILLABLE)
>                 return ERR_PTR(-EINVAL);
>
> +       /*
> +        * If the children will be in a different pid namespace don't allow
> +        * the creation of threads.
> +        */
> +       if ((clone_flags & (CLONE_THREAD|CLONE_SIGHAND|CLONE_VM|CLONE_PARENT)) &&
> +           task_active_pid_ns(current) != current->nsproxy->pid_ns)
> +               return ERR_PTR(-EINVAL);

Agreed, and this also removes other oddities with pthread_create().

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] pidns: Support unsharing the pid namespace.
  2012-12-21 15:49         ` Oleg Nesterov
@ 2012-12-21 17:51           ` Eric W. Biederman
  2012-12-21 19:24             ` Rob Landley
  0 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-12-21 17:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

Oleg Nesterov <oleg@redhat.com> writes:

> Eric. I understand that it is too late to discuss this. And yes, I simply
> do not understand the problem space, I never used containers.
>
> But, stupid question. Let's ignore the pid_ns-specific oddities.
>
> 1. Ignoring setns(), why do we need /proc/pid/ns/ ?
>
> 2. Why setns() requires /proc/pid/ns/ ? IOW, why it can't be
>
> 	sys_setns(pid_t pid, int clone_flags)
> 	{
> 		truct task_struct *tsk = find_task_by_vpid(pid);
> 		struct nsproxy *target = get_nsproxy(tsk->nsproxy);
>
> 		new_nsproxy = create_new_namespaces(...);
>
> 		if (clone_flags & CLONE_NEWNS)
> 			mntns_install(...);
> 		if (clone_flags & CLONE_NEWIPC)
> 			ipcns_install(...);
> 		...
> 	}
>
> I feel I missed something trivial, but what?

It is a question of naming.

The problem I set out to solve when all of this was introduced was how
to name namespaces without introducing yet another namespace.

The solution to the naming problem that I finally found was to introduce
something I could mount.  Using a file in /proc I can bind mount it
anywhere in the mount namespace with any name.  That gives me names for
namespaces in the mount namespace.  Furthermore those names go away
when the mount namespace goes away making them very easy to manage.

Being able to open the file instead of passing a path to setns
allows a process for private per process naming (via file descriptors).

To get a practical feel of this it may be worth looking at iproute.

ip netns add
ip netns del
ip netns exec

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  2012-12-21 14:11         ` Oleg Nesterov
  2012-12-21 15:02           ` Oleg Nesterov
@ 2012-12-21 18:33           ` Eric W. Biederman
  1 sibling, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-12-21 18:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

Oleg Nesterov <oleg@redhat.com> writes:

> On 12/20, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > On 11/16, Eric W. Biederman wrote:
>> >>
>> >> @@ -216,22 +216,15 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>> >>
>> >>  	/*
>> >>  	 * sys_wait4() above can't reap the TASK_DEAD children.
>> >> -	 * Make sure they all go away, see __unhash_process().
>> >> +	 * Make sure they all go away, see free_pid().
>> >>  	 */
>> >>  	for (;;) {
>> >> -		bool need_wait = false;
>> >> -
>> >> -		read_lock(&tasklist_lock);
>> >> -		if (!list_empty(&current->children)) {
>> >> -			__set_current_state(TASK_UNINTERRUPTIBLE);
>> >> -			need_wait = true;
>> >> -		}
>> >> -		read_unlock(&tasklist_lock);
>> >> -
>> >> -		if (!need_wait)
>> >> +		set_current_state(TASK_UNINTERRUPTIBLE);
>> >> +		if (pid_ns->nr_hashed == 1)
>> >>  			break;
>> >>  		schedule();
>> >>  	}
>> >
>> > I agree, the patch itself looks fine.
>> >
>> > But, with all other changes I do not understand this part at all.
>> >
>> > A task from the parent namespace can do setns + fork at any time
>> > (until nr_hashed >= 0). So ->nr_hashed can be incremented again
>> > after zap_pid_ns_processes() returns.
>
> XXX: this creates the new pid P in this ns. Please see below...
>
>> I want to talk about how alloc_pid and free_pid prevent nr_hashed
>> from increasing once the last processes has exited the pid namespace
>> but that doesn't apply here.
>
> Not sure I understand, but it seems you agree this can happen.
>
>> > Or, we can sleep in TASK_UNINTERRUPTIBLE "forever" if this happens
>> > after kill-them-all.
>>
>> Sleeping forever should be prevented by this chunk in free_pid:
>
> Note that I said "forever", not forever ;)
>
>>
>> 		switch(--ns->nr_hashed) {
>> 		case 1:
>> 			/* When all that is left in the pid namespace
>> 			 * is the reaper wake up the reaper.  The reaper
>> 			 * may be sleeping in zap_pid_ns_processes().
>> 			 */
>> 			wake_up_process(ns->child_reaper);
>>
>>
>> I admit it continues to be true that if an injected process or a
>> debugged process does not exit we can block waiting for all of the
>> processes to be reaped indefinitely.
>
> Yes, I meant until the injected process exits.


>> > Could you explain why do we need to wait at all? I can be easily
>> > wrong, but at first glance the original reason for this wait has
>> > gone away?
>>
>> It is very nice to know that when you do waitpid for the init process of
>> a pid namespace that there are no other processes in the pid namespace.
>
> OK, and I agree. But my point was, at least this _looks_ strange, because
> ns->nr_hashed == 1 is not stable.
>
> And in fact I think this is not strange, but simply wrong.
>
> Please consider the XXX case above. Suppose that free_pid(P) happens
> after ns->child_reaper exits and thus this pointer points to nowhere.
> Suppose also that there is another injected pid so nr_hashed == 2.
> In this case wake_up_process(ns->child_reaper) means use-after-free,
> no?

That is a bug in two dimensions.
- The use after free.
- The fact that a process can persist in the pid namespace after the
  init process has been exited and waited on.

Without the setns support this code is fine, but with setns we have
an process injection bug.

I suspect zap_pid_ns_processes just needs to set a flag that prevents
the allocation of new pids.  I am going to have to step back and think
of something clean.

Eric


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  2012-12-21 15:31             ` Oleg Nesterov
@ 2012-12-21 18:42               ` Eric W. Biederman
  0 siblings, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-12-21 18:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Containers, linux-kernel, Serge Hallyn, Gao feng, Andrew Morton

Oleg Nesterov <oleg@redhat.com> writes:

> On 12/21, Oleg Nesterov wrote:
>>
>> Once again, the parent namespace injects the task T after ns->reaper
>> sees nr_hashed == 1 and returns. Suppose that reaper's parent does
>> do_wait() and free_pidmap() clears the bit == 1.
>>
>> Now, what if T doesn't exit but forks? We must not re-create the
>> task with pid_nr == 1 in the dead namespace. Normally this can't
>> happen, RESERVED_PIDS logic in alloc_pidmap() saves us. But it
>> seems that we need
>>
>> 	- .extra1 = &zero,
>> 	+ .extra1 = &one,
>>
>> in pid_ns_ctl_table.
>
> Oh, and another problem, or I am totally confused.
>
> T forks and creates the child C1. C1 creates C2. What if C1 exits?
> It will try to reparent C2 to the dead/freed ns->child_reaper.
>
> In short. We shouldn't allow alloc_pid() if ns->child_reaper is dying,
> I think. nr_hashed == -1 doesn't really work.

Certainly nr_hashed == -1 is insufficient.

Injecting a processes when nr_hashed == 1 seems to be the magic poison.

I wonder if we could just say.
	if (ns->nr_hashed == -1)
        	goto out_unlock;
	if ((ns->nr_hashed >= 1) && (ns->child_reaper->flags & PF_EXITING))
		goto out_unlock;

I don't know if the locking is sufficient at that point.

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] pidns: Support unsharing the pid namespace.
  2012-12-21 17:51           ` Eric W. Biederman
@ 2012-12-21 19:24             ` Rob Landley
  2012-12-21 22:58               ` namespace documentation Eric W. Biederman
  0 siblings, 1 reply; 37+ messages in thread
From: Rob Landley @ 2012-12-21 19:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Linux Containers, linux-kernel, Andrew Morton

On 12/21/2012 11:51:03 AM, Eric W. Biederman wrote:
> Oleg Nesterov <oleg@redhat.com> writes:
> 
> > Eric. I understand that it is too late to discuss this. And yes, I  
> simply
> > do not understand the problem space, I never used containers.
> >
> > But, stupid question. Let's ignore the pid_ns-specific oddities.
> >
> > 1. Ignoring setns(), why do we need /proc/pid/ns/ ?
> >
> > 2. Why setns() requires /proc/pid/ns/ ? IOW, why it can't be
> >
> > 	sys_setns(pid_t pid, int clone_flags)
> > 	{
> > 		truct task_struct *tsk = find_task_by_vpid(pid);
> > 		struct nsproxy *target = get_nsproxy(tsk->nsproxy);
> >
> > 		new_nsproxy = create_new_namespaces(...);
> >
> > 		if (clone_flags & CLONE_NEWNS)
> > 			mntns_install(...);
> > 		if (clone_flags & CLONE_NEWIPC)
> > 			ipcns_install(...);
> > 		...
> > 	}
> >
> > I feel I missed something trivial, but what?
> 
> It is a question of naming.
> 
> The problem I set out to solve when all of this was introduced was how
> to name namespaces without introducing yet another namespace.
> 
> The solution to the naming problem that I finally found was to  
> introduce
> something I could mount.

Where might I find documentation on this? I'm aware of  
Documentation/namespaces but it's only got one file in it (about  
conflicts between namespace types). I'm aware of  
http://lxc.sourceforge.net/index.php/about/kernel-namespaces/ and  
http://lxc.sourceforge.net/man/ but that's mixed in with the  
implementation details of a particular userspace tool, and tends to lag  
the kernel significantly. (Those man pages were last updated in 2010,  
which if I recall was the last time I poked them about it.)

Rob

^ permalink raw reply	[flat|nested] 37+ messages in thread

* namespace documentation.
  2012-12-21 19:24             ` Rob Landley
@ 2012-12-21 22:58               ` Eric W. Biederman
  0 siblings, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-12-21 22:58 UTC (permalink / raw)
  To: Rob Landley
  Cc: Oleg Nesterov, Linux Containers, linux-kernel, Andrew Morton,
	Michael Kerrisk (man-pages)

Rob Landley <rob@landley.net> writes:

> On 12/21/2012 11:51:03 AM, Eric W. Biederman wrote:
>> Oleg Nesterov <oleg@redhat.com> writes:
>> 
>> > Eric. I understand that it is too late to discuss this. And yes, I  
>> simply
>> > do not understand the problem space, I never used containers.
>> >
>> > But, stupid question. Let's ignore the pid_ns-specific oddities.
>> >
>> > 1. Ignoring setns(), why do we need /proc/pid/ns/ ?
>> >
>> > 2. Why setns() requires /proc/pid/ns/ ? IOW, why it can't be
>> >
>> > 	sys_setns(pid_t pid, int clone_flags)
>> > 	{
>> > 		truct task_struct *tsk = find_task_by_vpid(pid);
>> > 		struct nsproxy *target = get_nsproxy(tsk->nsproxy);
>> >
>> > 		new_nsproxy = create_new_namespaces(...);
>> >
>> > 		if (clone_flags & CLONE_NEWNS)
>> > 			mntns_install(...);
>> > 		if (clone_flags & CLONE_NEWIPC)
>> > 			ipcns_install(...);
>> > 		...
>> > 	}
>> >
>> > I feel I missed something trivial, but what?
>> 
>> It is a question of naming.
>> 
>> The problem I set out to solve when all of this was introduced was how
>> to name namespaces without introducing yet another namespace.
>> 
>> The solution to the naming problem that I finally found was to  
>> introduce
>> something I could mount.
>
> Where might I find documentation on this? I'm aware of  
> Documentation/namespaces but it's only got one file in it (about  
> conflicts between namespace types). I'm aware of  
> http://lxc.sourceforge.net/index.php/about/kernel-namespaces/ and  
> http://lxc.sourceforge.net/man/ but that's mixed in with the  
> implementation details of a particular userspace tool, and tends to lag  
> the kernel significantly. (Those man pages were last updated in 2010,  
> which if I recall was the last time I poked them about it.)

I'm not certain what you are asking about.

The man pages that I endeavour to keep reasonably current are.

man 5 proc
man 2 setns
man 2 unshare
man 2 clone

You won't get a design discussion but you will get a description of how
the existing pieces work.  Of course now that I look it appears my
patches have not merged yet.  But that is reasonable since my recent
changes did not merge until a few days ago.

There is also iproute2 it's man pages and source.

There is the kernel source.

There are the occassional lwn articles.

I believe there should be a reasonable amount of email in the mailing
list archives when talking about the design descision, and when I
introduced setns.

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-12-21 22:59 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-16 16:32 [REVIEW][PATCH 0/11] pid namespace cleanups and enhancements Eric W. Biederman
2012-11-16 16:35 ` [PATCH 01/11] procfs: Use the proc generic infrastructure for proc/self Eric W. Biederman
2012-11-16 16:35   ` [PATCH 02/11] procfs: Don't cache a pid in the root inode Eric W. Biederman
2012-11-21  1:07     ` Gao feng
2012-11-16 16:35   ` [PATCH 03/11] pidns: Capture the user namespace and filter ns_last_pid Eric W. Biederman
2012-11-21  1:26     ` Gao feng
2012-11-16 16:35   ` [PATCH 04/11] pidns: Use task_active_pid_ns where appropriate Eric W. Biederman
2012-11-21  2:02     ` Gao feng
2012-11-16 16:35   ` [PATCH 05/11] pidns: Make the pidns proc mount/umount logic obvious Eric W. Biederman
2012-11-19 11:02     ` Gao feng
2012-11-16 16:35   ` [PATCH 06/11] pidns: Don't allow new processes in a dead pid namespace Eric W. Biederman
2012-11-21  2:17     ` Gao feng
2012-11-16 16:35   ` [PATCH 07/11] pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 Eric W. Biederman
2012-11-21  2:24     ` Gao feng
2012-12-19 18:47     ` Oleg Nesterov
2012-12-21  1:19       ` Eric W. Biederman
2012-12-21 14:11         ` Oleg Nesterov
2012-12-21 15:02           ` Oleg Nesterov
2012-12-21 15:31             ` Oleg Nesterov
2012-12-21 18:42               ` Eric W. Biederman
2012-12-21 18:33           ` Eric W. Biederman
2012-11-16 16:35   ` [PATCH 08/11] pidns: Deny strange cases when creating pid namespaces Eric W. Biederman
2012-11-21  2:25     ` Gao feng
2012-11-16 16:35   ` [PATCH 09/11] pidns: Add setns support Eric W. Biederman
2012-11-19  9:11     ` Gao feng
2012-11-19  9:27       ` Eric W. Biederman
2012-11-21  2:36     ` Gao feng
2012-11-16 16:35   ` [PATCH 10/11] pidns: Consolidate initialzation of special init task state Eric W. Biederman
2012-11-21  2:56     ` Gao feng
2012-11-16 16:35   ` [PATCH 11/11] pidns: Support unsharing the pid namespace Eric W. Biederman
2012-11-21  2:55     ` Gao feng
2012-12-19 18:14     ` Oleg Nesterov
2012-12-21  1:43       ` Eric W. Biederman
2012-12-21 15:49         ` Oleg Nesterov
2012-12-21 17:51           ` Eric W. Biederman
2012-12-21 19:24             ` Rob Landley
2012-12-21 22:58               ` namespace documentation Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).