linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [ABI REVIEW][PATCH 0/8] Namespace file descriptors
@ 2010-09-23  8:45 Eric W. Biederman
  2010-09-23  8:46 ` [PATCH 1/8] ns: proc files for namespace naming policy Eric W. Biederman
                   ` (9 more replies)
  0 siblings, 10 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23  8:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy


Introduce file for manipulating namespaces and related syscalls.
files:
/proc/self/ns/<nstype>

syscalls:
int setns(unsigned long nstype, int fd);
socketat(int nsfd, int family, int type, int protocol);

Netlink attribute:
IFLA_NS_FD int fd.

Name space file descriptors address three specific problems that
can make namespaces hard to work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the child of the
  original creator.
- Namespaces don't have names that userspace can use to talk about them.

Opening of the /proc/self/ns/<nstype> files return a file descriptor
that can be used to talk about a specific namespace, and to keep the
specified namespace alive.

/proc/self/ns/<nstype> can be bind mounted as:
mount --bind /proc/self/ns/net /some/filesystem/path
to keep the namespace alive as long as the mount exists.

setns() as a companion to unshare allows changing the namespace
of the current process, being able to unshare the namespace is
a requirement.

There are two primary envisioned uses for this functionality.
o ``Entering'' an existing container.
o Allowing multiple network namespaces to be in use at once on
  the same machine, without requiring elaborate infrastructure.

Overall this received positive reviews on the containers list but this
needs a wider review of the ABI as this is pretty fundamental kernel
functionality.


I have left out the pid namespaces bits for the moment because the pid
namespace still needs work before it is safe to unshare, and my concern
at the moment is ensuring the system calls seem reasonable.

Eric W. Biederman (8):
      ns: proc files for namespace naming policy.
      ns: Introduce the setns syscall
      ns proc: Add support for the network namespace.
      ns proc: Add support for the uts namespace
      ns proc: Add support for the ipc namespace
      ns proc: Add support for the mount namespace
      net: Allow setting the network namespace by fd
      net: Implement socketat.

---
 fs/namespace.c              |   57 +++++++++++++
 fs/proc/Makefile            |    1 +
 fs/proc/base.c              |   22 +++---
 fs/proc/inode.c             |    7 ++
 fs/proc/internal.h          |   18 ++++
 fs/proc/namespaces.c        |  193 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/if_link.h     |    1 +
 include/linux/proc_fs.h     |   20 +++++
 include/net/net_namespace.h |    1 +
 ipc/namespace.c             |   31 +++++++
 kernel/nsproxy.c            |   39 +++++++++
 kernel/utsname.c            |   32 +++++++
 net/core/net_namespace.c    |   56 +++++++++++++
 net/core/rtnetlink.c        |    4 +-
 net/socket.c                |   26 ++++++-
 15 files changed, 494 insertions(+), 14 deletions(-)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/8] ns: proc files for namespace naming policy.
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
@ 2010-09-23  8:46 ` Eric W. Biederman
  2010-09-23  8:46 ` [PATCH 2/8] ns: Introduce the setns syscall Eric W. Biederman
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23  8:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy


Create files under /proc/<pid>/ns/ to allow controlling the
namespaces of a process.

This addresses three specific problems that can make namespaces hard to
work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the child
  of the original creator.
- Namespaces don't have names that userspace can use to talk about
  them.

The namespace files under /proc/<pid>/ns/ can be opened and the
file descriptor can be used to talk about a specific namespace, and
to keep the specified namespace alive.

A namespace can be kept alive by either holding the file descriptor
open or bind mounting the file someplace else.  aka:
mount --bind /proc/self/ns/net /some/filesystem/path
mount --bind /proc/self/fd/<N> /some/filesystem/path

This allows namespaces to be named with userspace policy.

It requires additional support to make use of these filedescriptors
and that will be comming in the following patches.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/Makefile        |    1 +
 fs/proc/base.c          |   22 +++---
 fs/proc/inode.c         |    7 ++
 fs/proc/internal.h      |   18 +++++
 fs/proc/namespaces.c    |  183 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/proc_fs.h |   16 ++++
 6 files changed, 236 insertions(+), 11 deletions(-)
 create mode 100644 fs/proc/namespaces.c

diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 2758e2a..3cf2529 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -19,6 +19,7 @@ proc-y	+= stat.o
 proc-y	+= uptime.o
 proc-y	+= version.o
 proc-y	+= softirqs.o
+proc-y	+= namespaces.o
 proc-$(CONFIG_PROC_SYSCTL)	+= proc_sysctl.o
 proc-$(CONFIG_NET)		+= proc_net.o
 proc-$(CONFIG_PROC_KCORE)	+= kcore.o
diff --git a/fs/proc/base.c b/fs/proc/base.c
index a1c43e7..30b9384 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -550,7 +550,7 @@ static int proc_fd_access_allowed(struct inode *inode)
 	return allowed;
 }
 
-static int proc_setattr(struct dentry *dentry, struct iattr *attr)
+int proc_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	int error;
 	struct inode *inode = dentry->d_inode;
@@ -1585,8 +1585,7 @@ static int task_dumpable(struct task_struct *task)
 	return 0;
 }
 
-
-static struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task)
+struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task)
 {
 	struct inode * inode;
 	struct proc_inode *ei;
@@ -1627,7 +1626,7 @@ out_unlock:
 	return NULL;
 }
 
-static int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 {
 	struct inode *inode = dentry->d_inode;
 	struct task_struct *task;
@@ -1668,7 +1667,7 @@ static int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat
  * made this apply to all per process world readable and executable
  * directories.
  */
-static int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
+int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
 	struct task_struct *task = get_proc_task(inode);
@@ -1704,7 +1703,7 @@ static int pid_delete_dentry(struct dentry * dentry)
 	return !proc_pid(dentry->d_inode)->tasks[PIDTYPE_PID].first;
 }
 
-static const struct dentry_operations pid_dentry_operations =
+const struct dentry_operations pid_dentry_operations =
 {
 	.d_revalidate	= pid_revalidate,
 	.d_delete	= pid_delete_dentry,
@@ -1712,9 +1711,6 @@ static const struct dentry_operations pid_dentry_operations =
 
 /* Lookups */
 
-typedef struct dentry *instantiate_t(struct inode *, struct dentry *,
-				struct task_struct *, const void *);
-
 /*
  * Fill a directory entry.
  *
@@ -1727,8 +1723,8 @@ typedef struct dentry *instantiate_t(struct inode *, struct dentry *,
  * reported by readdir in sync with the inode numbers reported
  * by stat.
  */
-static int proc_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
-	char *name, int len,
+int proc_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
+	const char *name, int len,
 	instantiate_t instantiate, struct task_struct *task, const void *ptr)
 {
 	struct dentry *child, *dir = filp->f_path.dentry;
@@ -2360,6 +2356,8 @@ static const struct inode_operations proc_attr_dir_inode_operations = {
 
 #endif
 
+
+
 #ifdef CONFIG_ELF_CORE
 static ssize_t proc_coredump_filter_read(struct file *file, char __user *buf,
 					 size_t count, loff_t *ppos)
@@ -2668,6 +2666,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
 #endif
@@ -3007,6 +3006,7 @@ out_no_task:
 static const struct pid_entry tid_base_stuff[] = {
 	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 	REG("environ",   S_IRUSR, proc_environ_operations),
 	INF("auxv",      S_IRUSR, proc_pid_auxv),
 	ONE("status",    S_IRUGO, proc_pid_status),
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 9c2b5f4..1e3e720 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -28,6 +28,7 @@
 static void proc_evict_inode(struct inode *inode)
 {
 	struct proc_dir_entry *de;
+	const struct proc_ns_operations *ns_ops;
 
 	truncate_inode_pages(&inode->i_data, 0);
 	end_writeback(inode);
@@ -41,6 +42,10 @@ static void proc_evict_inode(struct inode *inode)
 		pde_put(de);
 	if (PROC_I(inode)->sysctl)
 		sysctl_head_put(PROC_I(inode)->sysctl);
+	/* Release any associated namespace */
+	ns_ops = PROC_I(inode)->ns_ops;
+	if (ns_ops && ns_ops->put)
+		ns_ops->put(PROC_I(inode)->ns);
 }
 
 struct vfsmount *proc_mnt;
@@ -61,6 +66,8 @@ static struct inode *proc_alloc_inode(struct super_block *sb)
 	ei->pde = NULL;
 	ei->sysctl = NULL;
 	ei->sysctl_entry = NULL;
+	ei->ns = NULL;
+	ei->ns_ops = NULL;
 	inode = &ei->vfs_inode;
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 	return inode;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 1f24a3e..6b61c7f 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -119,3 +119,21 @@ struct inode *proc_get_inode(struct super_block *, unsigned int, struct proc_dir
  */
 int proc_readdir(struct file *, void *, filldir_t);
 struct dentry *proc_lookup(struct inode *, struct dentry *, struct nameidata *);
+
+
+
+/* Lookups */
+typedef struct dentry *instantiate_t(struct inode *, struct dentry *,
+				struct task_struct *, const void *);
+int proc_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
+	const char *name, int len,
+	instantiate_t instantiate, struct task_struct *task, const void *ptr);
+int pid_revalidate(struct dentry *dentry, struct nameidata *nd);
+struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task);
+extern const struct dentry_operations pid_dentry_operations;
+int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat);
+int proc_setattr(struct dentry *dentry, struct iattr *attr);
+
+extern const struct inode_operations proc_ns_dir_inode_operations;
+extern const struct file_operations proc_ns_dir_operations;
+
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
new file mode 100644
index 0000000..f33537f
--- /dev/null
+++ b/fs/proc/namespaces.c
@@ -0,0 +1,183 @@
+#include <linux/proc_fs.h>
+#include <linux/nsproxy.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/fs_struct.h>
+#include <linux/mount.h>
+#include <linux/path.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/utsname.h>
+#include <net/net_namespace.h>
+#include <linux/mnt_namespace.h>
+#include <linux/ipc_namespace.h>
+#include <linux/pid_namespace.h>
+#include "internal.h"
+
+
+static const struct proc_ns_operations *ns_entries[] = {
+};
+
+static const struct file_operations ns_file_operations = {
+	.llseek		= no_llseek,
+};
+
+static struct dentry *proc_ns_instantiate(struct inode *dir,
+	struct dentry *dentry, struct task_struct *task, const void *ptr)
+{
+	const struct proc_ns_operations *ns_ops = ptr;
+	struct inode *inode;
+	struct proc_inode *ei;
+	struct dentry *error = ERR_PTR(-ENOENT);
+
+	inode = proc_pid_make_inode(dir->i_sb, task);
+	if (!inode)
+		goto out;
+
+	ei = PROC_I(inode);
+	inode->i_mode = S_IFREG|S_IRUSR;
+	inode->i_fop  = &ns_file_operations;
+	ei->ns_ops    = ns_ops;
+	ei->ns	      = ns_ops->get(task);
+
+	dentry->d_op = &pid_dentry_operations;
+	d_add(dentry, inode);
+	/* Close the race of the process dying before we return the dentry */
+	if (pid_revalidate(dentry, NULL))
+		error = NULL;
+out:
+	return error;
+}
+
+static int proc_ns_fill_cache(struct file *filp, void *dirent,
+	filldir_t filldir, struct task_struct *task,
+	const struct proc_ns_operations *ops)
+{
+	return proc_fill_cache(filp, dirent, filldir,
+				ops->name.name, ops->name.len,
+				proc_ns_instantiate, task, ops);
+}
+
+static int proc_ns_dir_readdir(struct file *filp, void *dirent,
+				filldir_t filldir)
+{
+	int i;
+	struct dentry *dentry = filp->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct task_struct *task = get_proc_task(inode);
+	const struct proc_ns_operations **entry, **last;
+	ino_t ino;
+	int ret;
+
+	ret = -ENOENT;
+	if (!task)
+		goto out_no_task;
+
+	ret = -EPERM;
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	ret = 0;
+	i = filp->f_pos;
+	switch (i) {
+	case 0:
+		ino = inode->i_ino;
+		if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0)
+			goto out;
+		i++;
+		filp->f_pos++;
+		/* fall through */
+	case 1:
+		ino = parent_ino(dentry);
+		if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0)
+			goto out;
+		i++;
+		filp->f_pos++;
+		/* fall through */
+	default:
+		i -= 2;
+		if (i >= ARRAY_SIZE(ns_entries)) {
+			ret = 1;
+			goto out;
+		}
+		entry = ns_entries + i;
+		last = &ns_entries[ARRAY_SIZE(ns_entries) - 1];
+		while (entry <= last) {
+			if (proc_ns_fill_cache(filp, dirent, filldir,
+						task, *entry) < 0)
+				goto out;
+			filp->f_pos++;
+			entry++;
+		}
+	}
+
+	ret = 1;
+out:
+	put_task_struct(task);
+out_no_task:
+	return ret;
+}
+
+const struct file_operations proc_ns_dir_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_ns_dir_readdir,
+};
+
+static struct dentry *proc_ns_dir_lookup(struct inode *dir,
+				struct dentry *dentry, struct nameidata *nd)
+{
+	struct dentry *error;
+	struct task_struct *task = get_proc_task(dir);
+	const struct proc_ns_operations **entry, **last;
+	unsigned int len = dentry->d_name.len;
+
+	error = ERR_PTR(-ENOENT);
+
+	if (!task)
+		goto out_no_task;
+
+	error = ERR_PTR(-EPERM);
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	last = &ns_entries[ARRAY_SIZE(ns_entries) - 1];
+	for (entry = ns_entries; entry <= last; entry++) {
+		if ((*entry)->name.len != len)
+			continue;
+		if (!memcmp(dentry->d_name.name, (*entry)->name.name, len))
+			break;
+	}
+	if (entry > last)
+		goto out;
+
+	error = proc_ns_instantiate(dir, dentry, task, *entry);
+out:
+	put_task_struct(task);
+out_no_task:
+	return error;
+}
+
+const struct inode_operations proc_ns_dir_inode_operations = {
+	.lookup		= proc_ns_dir_lookup,
+	.getattr	= pid_getattr,
+	.setattr	= proc_setattr,
+};
+
+struct file *proc_ns_fget(int fd)
+{
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return ERR_PTR(-EBADF);
+
+	if (file->f_op != &ns_file_operations)
+		goto out_invalid;
+
+	return file;
+
+out_invalid:
+	fput(file);
+	return ERR_PTR(-EINVAL);
+}
+
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 379eaed..a6c26f0 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -250,6 +250,20 @@ kclist_add(struct kcore_list *new, void *addr, size_t size, int type)
 extern void kclist_add(struct kcore_list *, void *, size_t, int type);
 #endif
 
+struct nsproxy;
+struct proc_ns_operations {
+	struct {
+		unsigned int len;
+		const char *name;
+	} name;
+	unsigned int name_len;
+	void *(*get)(struct task_struct *task);
+	void (*put)(void *ns);
+	int (*install)(struct nsproxy *nsproxy, void *ns);
+};
+#define PROC_NSNAME(NAME) { .name = (NAME), .len = (sizeof(NAME) - 1), }
+extern struct file *proc_ns_fget(int fd);
+
 union proc_op {
 	int (*proc_get_link)(struct inode *, struct path *);
 	int (*proc_read)(struct task_struct *task, char *page);
@@ -268,6 +282,8 @@ struct proc_inode {
 	struct proc_dir_entry *pde;
 	struct ctl_table_header *sysctl;
 	struct ctl_table *sysctl_entry;
+	void *ns;
+	const struct proc_ns_operations *ns_ops;
 	struct inode vfs_inode;
 };
 
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/8] ns: Introduce the setns syscall
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
  2010-09-23  8:46 ` [PATCH 1/8] ns: proc files for namespace naming policy Eric W. Biederman
@ 2010-09-23  8:46 ` Eric W. Biederman
  2010-09-23  8:47 ` [PATCH 3/8] ns proc: Add support for the network namespace Eric W. Biederman
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23  8:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy


With the networking stack today there is demand to handle
multiple network stacks at a time.  Not in the context
of containers but in the context of people doing interesting
things with routing.

There is also demand in the context of containers to have
an efficient way to execute some code in the container itself.
If nothing else it is very useful ad a debugging technique.

Both problems can be solved by starting some form of login
daemon in the namespaces people want access to, or you
can play games by ptracing a process and getting the
traced process to do things you want it to do. However
it turns out that a login daemon or a ptrace puppet
controller are more code, they are more prone to
failure, and generally they are less efficient than
simply changing the namespace of a process to a
specified one.

Pieces of this puzzle can also be solved by instead of
coming up with a general purpose system call coming up
with targed system calls perhaps socketat that solve
a subset of the larger problem.  Overall that appears
to be more work for less reward.

int setns(unsigned int nstype, int fd);

In the setns system call the nstype is 0 or specifies
an the name of the namespace you think you are changing,
to prevent changing a namespace unintentionally.

The fd argument is a file descriptor referring to a proc
file of the namespace you want to switch the process to.

v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
v4: Moved wiring up of the system call to another patch

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/nsproxy.c |   39 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f74e6c0..0bf2dba 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -22,6 +22,9 @@
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/proc_fs.h>
+#include <linux/file.h>
+#include <linux/syscalls.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -233,6 +236,42 @@ void exit_task_namespaces(struct task_struct *p)
 	switch_task_namespaces(p, NULL);
 }
 
+SYSCALL_DEFINE2(setns, unsigned int, nstype, int, fd)
+{
+	const struct proc_ns_operations *ops;
+	struct task_struct *tsk = current;
+	struct nsproxy *new_nsproxy;
+	struct proc_inode *ei;
+	struct file *file;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	file = proc_ns_fget(fd);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	err = -EINVAL;
+	ei = PROC_I(file->f_dentry->d_inode);
+	ops = ei->ns_ops;
+	if (nstype &&
+	    ((ops->name.len >= sizeof(nstype)) ||
+	    memcmp(&nstype, ops->name.name, ops->name.len)))
+		goto out;
+
+	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
+	err = ops->install(new_nsproxy, ei->ns);
+	if (err) {
+		free_nsproxy(new_nsproxy);
+		goto out;
+	}
+	switch_task_namespaces(tsk, new_nsproxy);
+out:
+	fput(file);
+	return err;
+}
+
 static int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/8] ns proc: Add support for the network namespace.
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
  2010-09-23  8:46 ` [PATCH 1/8] ns: proc files for namespace naming policy Eric W. Biederman
  2010-09-23  8:46 ` [PATCH 2/8] ns: Introduce the setns syscall Eric W. Biederman
@ 2010-09-23  8:47 ` Eric W. Biederman
  2010-09-23 11:27   ` Louis Rilling
  2010-09-23  8:48 ` [PATCH 4/8] ns proc: Add support for the uts namespace Eric W. Biederman
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23  8:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy


Implementing file descriptors for the network namespace is simple and
straight forward.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/namespaces.c     |    3 +++
 include/linux/proc_fs.h  |    1 +
 net/core/net_namespace.c |   30 ++++++++++++++++++++++++++++++
 3 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index f33537f..31e32f3 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -16,6 +16,9 @@
 
 
 static const struct proc_ns_operations *ns_entries[] = {
+#ifdef CONFIG_NET_NS
+	&netns_operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index a6c26f0..9cd3fae 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -262,6 +262,7 @@ struct proc_ns_operations {
 	int (*install)(struct nsproxy *nsproxy, void *ns);
 };
 #define PROC_NSNAME(NAME) { .name = (NAME), .len = (sizeof(NAME) - 1), }
+extern const struct proc_ns_operations netns_operations;
 extern struct file *proc_ns_fget(int fd);
 
 union proc_op {
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index c988e68..581a088 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -571,3 +571,33 @@ void unregister_pernet_device(struct pernet_operations *ops)
 	mutex_unlock(&net_mutex);
 }
 EXPORT_SYMBOL_GPL(unregister_pernet_device);
+
+#ifdef CONFIG_NET_NS
+static void *netns_get(struct task_struct *task)
+{
+	struct net *net;
+	rcu_read_lock();
+	net = get_net(task->nsproxy->net_ns);
+	rcu_read_unlock();
+	return net;
+}
+
+static void netns_put(void *ns)
+{
+	put_net(ns);
+}
+
+static int netns_install(struct nsproxy *nsproxy, void *ns)
+{
+	put_net(nsproxy->net_ns);
+	nsproxy->net_ns = get_net(ns);
+	return 0;
+}
+
+const struct proc_ns_operations netns_operations = {
+	.name		= PROC_NSNAME("net"),
+	.get		= netns_get,
+	.put		= netns_put,
+	.install	= netns_install,
+};
+#endif
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 4/8] ns proc: Add support for the uts namespace
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
                   ` (2 preceding siblings ...)
  2010-09-23  8:47 ` [PATCH 3/8] ns proc: Add support for the network namespace Eric W. Biederman
@ 2010-09-23  8:48 ` Eric W. Biederman
  2010-09-23  8:49 ` [PATCH 5/8] ns proc: Add support for the ipc namespace Eric W. Biederman
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23  8:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy


Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/namespaces.c    |    3 +++
 include/linux/proc_fs.h |    1 +
 kernel/utsname.c        |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 31e32f3..902443e 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -19,6 +19,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 #ifdef CONFIG_NET_NS
 	&netns_operations,
 #endif
+#ifdef CONFIG_UTS_NS
+	&utsns_operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 9cd3fae..28b4ffd 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -263,6 +263,7 @@ struct proc_ns_operations {
 };
 #define PROC_NSNAME(NAME) { .name = (NAME), .len = (sizeof(NAME) - 1), }
 extern const struct proc_ns_operations netns_operations;
+extern const struct proc_ns_operations utsns_operations;
 extern struct file *proc_ns_fget(int fd);
 
 union proc_op {
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..ff06086 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,6 +14,7 @@
 #include <linux/utsname.h>
 #include <linux/err.h>
 #include <linux/slab.h>
+#include <linux/proc_fs.h>
 
 static struct uts_namespace *create_uts_ns(void)
 {
@@ -73,3 +74,34 @@ void free_uts_ns(struct kref *kref)
 	ns = container_of(kref, struct uts_namespace, kref);
 	kfree(ns);
 }
+
+static void *utsns_get(struct task_struct *task)
+{
+	struct uts_namespace *ns;
+	rcu_read_lock();
+	ns = task->nsproxy->uts_ns;
+	get_uts_ns(ns);
+	rcu_read_unlock();
+	return ns;
+}
+
+static void utsns_put(void *ns)
+{
+	put_uts_ns(ns);
+}
+
+static int utsns_install(struct nsproxy *nsproxy, void *ns)
+{
+	get_uts_ns(ns);
+	put_uts_ns(nsproxy->uts_ns);
+	nsproxy->uts_ns = ns;
+	return 0;
+}
+
+const struct proc_ns_operations utsns_operations = {
+	.name		= PROC_NSNAME("uts"),
+	.get		= utsns_get,
+	.put		= utsns_put,
+	.install	= utsns_install,
+};
+
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 5/8] ns proc: Add support for the ipc namespace
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
                   ` (3 preceding siblings ...)
  2010-09-23  8:48 ` [PATCH 4/8] ns proc: Add support for the uts namespace Eric W. Biederman
@ 2010-09-23  8:49 ` Eric W. Biederman
  2010-09-23  8:50 ` [PATCH 6/8] ns proc: Add support for the mount namespace Eric W. Biederman
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23  8:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy


Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/namespaces.c    |    3 +++
 include/linux/proc_fs.h |    1 +
 ipc/namespace.c         |   31 +++++++++++++++++++++++++++++++
 3 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 902443e..2f503b5 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -22,6 +22,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 #ifdef CONFIG_UTS_NS
 	&utsns_operations,
 #endif
+#ifdef CONFIG_IPC_NS
+	&ipcns_operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 28b4ffd..9a9ef31 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -264,6 +264,7 @@ struct proc_ns_operations {
 #define PROC_NSNAME(NAME) { .name = (NAME), .len = (sizeof(NAME) - 1), }
 extern const struct proc_ns_operations netns_operations;
 extern const struct proc_ns_operations utsns_operations;
+extern const struct proc_ns_operations ipcns_operations;
 extern struct file *proc_ns_fget(int fd);
 
 union proc_op {
diff --git a/ipc/namespace.c b/ipc/namespace.c
index a1094ff..2c5947f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -11,6 +11,7 @@
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/mount.h>
+#include <linux/proc_fs.h>
 
 #include "util.h"
 
@@ -132,3 +133,33 @@ void put_ipc_ns(struct ipc_namespace *ns)
 		free_ipc_ns(ns);
 	}
 }
+
+static void *ipcns_get(struct task_struct *task)
+{
+	struct ipc_namespace *ns;
+	rcu_read_lock();
+	ns = get_ipc_ns(task->nsproxy->ipc_ns);
+	rcu_read_unlock();
+	return ns;
+}
+
+static void ipcns_put(void *ns)
+{
+	return put_ipc_ns(ns);
+}
+
+static int ipcns_install(struct nsproxy *nsproxy, void *ns)
+{
+	/* Ditch state from the old ipc namespace */
+	exit_sem(current);
+	put_ipc_ns(nsproxy->ipc_ns);
+	nsproxy->ipc_ns = get_ipc_ns(ns);
+	return 0;
+}
+
+const struct proc_ns_operations ipcns_operations = {
+	.name		= PROC_NSNAME("ipc"),
+	.get		= ipcns_get,
+	.put		= ipcns_put,
+	.install	= ipcns_install,
+};
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 6/8] ns proc: Add support for the mount namespace
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
                   ` (4 preceding siblings ...)
  2010-09-23  8:49 ` [PATCH 5/8] ns proc: Add support for the ipc namespace Eric W. Biederman
@ 2010-09-23  8:50 ` Eric W. Biederman
  2010-09-23  8:51 ` [PATCH 7/8] net: Allow setting the network namespace by fd Eric W. Biederman
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy


The mount namespace is a little tricky as an arbitrary
decision must be made about what to set fs->root and
fs->pwd to, as there is no expectation of a relationship
between the two mount namespaces.

Therefor I arbitrary find the root mount point, and follow
every mount on top of it to find the top of the mount stack.
Then I set fs->root and fs->pwd to that location.

The topmost root of the mount stack seems like a reasonable place to be.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/namespace.c          |   57 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/proc/namespaces.c    |    1 +
 include/linux/proc_fs.h |    1 +
 3 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index a72eaab..ed11bac 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -32,6 +32,7 @@
 #include <linux/idr.h>
 #include <linux/fs_struct.h>
 #include <linux/fsnotify.h>
+#include <linux/proc_fs.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include "pnode.h"
@@ -2418,3 +2419,59 @@ void put_mnt_ns(struct mnt_namespace *ns)
 	kfree(ns);
 }
 EXPORT_SYMBOL(put_mnt_ns);
+
+
+static void *mntns_get(struct task_struct *task)
+{
+	struct mnt_namespace *ns;
+	rcu_read_lock();
+	ns = task->nsproxy->mnt_ns;
+	get_mnt_ns(ns);
+	rcu_read_unlock();
+	return ns;
+}
+
+static void mntns_put(void *ns)
+{
+	put_mnt_ns(ns);
+}
+
+static int mntns_install(struct nsproxy *nsproxy, void *ns)
+{
+	struct fs_struct *fs = current->fs;
+	struct mnt_namespace *mnt_ns = ns;
+	struct path root;
+
+	if (fs->users != 1)
+		return -EINVAL;
+
+	get_mnt_ns(mnt_ns);
+	put_mnt_ns(nsproxy->mnt_ns);
+	nsproxy->mnt_ns = mnt_ns;
+
+	/* Find the root */
+	root.mnt    = mnt_ns->root;
+	root.dentry = mnt_ns->root->mnt_root;
+	path_get(&root);
+	while(d_mountpoint(root.dentry) && follow_down(&root))
+		;
+
+	/* Update the pwd and root */
+	path_get(&root);
+	path_get(&root);
+	path_put(&fs->root);
+	path_put(&fs->pwd);
+	fs->root = root;
+	fs->pwd  = root;
+	path_put(&root);
+
+	return 0;
+}
+
+const struct proc_ns_operations mntns_operations = {
+	.name		= PROC_NSNAME("mnt"),
+	.get		= mntns_get,
+	.put		= mntns_put,
+	.install	= mntns_install,
+};
+
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 2f503b5..c5956ad 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -25,6 +25,7 @@ static const struct proc_ns_operations *ns_entries[] = {
 #ifdef CONFIG_IPC_NS
 	&ipcns_operations,
 #endif
+	&mntns_operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 9a9ef31..8a260b0 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -265,6 +265,7 @@ struct proc_ns_operations {
 extern const struct proc_ns_operations netns_operations;
 extern const struct proc_ns_operations utsns_operations;
 extern const struct proc_ns_operations ipcns_operations;
+extern const struct proc_ns_operations mntns_operations;
 extern struct file *proc_ns_fget(int fd);
 
 union proc_op {
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
                   ` (5 preceding siblings ...)
  2010-09-23  8:50 ` [PATCH 6/8] ns proc: Add support for the mount namespace Eric W. Biederman
@ 2010-09-23  8:51 ` Eric W. Biederman
  2010-09-23  9:41   ` Eric Dumazet
                     ` (3 more replies)
  2010-09-23  8:51 ` [PATCH 8/8] net: Implement socketat Eric W. Biederman
                   ` (2 subsequent siblings)
  9 siblings, 4 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23  8:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy


Take advantage of the new abstraction and allow network devices
to be placed in any network namespace that we have a fd to talk
about.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/if_link.h     |    1 +
 include/net/net_namespace.h |    1 +
 net/core/net_namespace.c    |   26 ++++++++++++++++++++++++++
 net/core/rtnetlink.c        |    4 +++-
 4 files changed, 31 insertions(+), 1 deletions(-)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 2fc66dd..ae73d5e 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -116,6 +116,7 @@ enum {
 	IFLA_STATS64,
 	IFLA_VF_PORTS,
 	IFLA_PORT_SELF,
+	IFLA_NET_NS_FD,
 	__IFLA_MAX
 };
 
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index bd10a79..68672ce 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -114,6 +114,7 @@ static inline struct net *copy_net_ns(unsigned long flags, struct net *net_ns)
 extern struct list_head net_namespace_list;
 
 extern struct net *get_net_ns_by_pid(pid_t pid);
+extern struct net *get_net_ns_by_fd(int pid);
 
 #ifdef CONFIG_NET_NS
 extern void __put_net(struct net *net);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 581a088..a9b54a7 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -8,6 +8,8 @@
 #include <linux/idr.h>
 #include <linux/rculist.h>
 #include <linux/nsproxy.h>
+#include <linux/proc_fs.h>
+#include <linux/file.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -341,6 +343,30 @@ struct net *get_net_ns_by_pid(pid_t pid)
 }
 EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
 
+struct net *get_net_ns_by_fd(int fd)
+{
+	struct proc_inode *ei;
+	struct file *file;
+	struct net *net;
+
+	file = NULL;
+	net = ERR_PTR(-EINVAL);
+	file = proc_ns_fget(fd);
+	if (!fd)
+		goto out;
+		return ERR_PTR(-EINVAL);
+
+	ei = PROC_I(file->f_dentry->d_inode);
+	if (ei->ns_ops != &netns_operations)
+		goto out;
+
+	net = get_net(ei->ns);
+out:
+	if (file)
+		fput(file);
+	return net;
+}
+
 static int __init net_ns_init(void)
 {
 	struct net_generic *ng;
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index f78d821..771d8be 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1003,6 +1003,8 @@ struct net *rtnl_link_get_net(struct net *src_net, struct nlattr *tb[])
 	 */
 	if (tb[IFLA_NET_NS_PID])
 		net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
+	else if (tb[IFLA_NET_NS_FD])
+		net = get_net_ns_by_fd(nla_get_u32(tb[IFLA_NET_NS_FD]));
 	else
 		net = get_net(src_net);
 	return net;
@@ -1077,7 +1079,7 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
 	int send_addr_notify = 0;
 	int err;
 
-	if (tb[IFLA_NET_NS_PID]) {
+	if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD]) {
 		struct net *net = rtnl_link_get_net(dev_net(dev), tb);
 		if (IS_ERR(net)) {
 			err = PTR_ERR(net);
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 8/8] net: Implement socketat.
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
                   ` (6 preceding siblings ...)
  2010-09-23  8:51 ` [PATCH 7/8] net: Allow setting the network namespace by fd Eric W. Biederman
@ 2010-09-23  8:51 ` Eric W. Biederman
  2010-09-23  8:56   ` Pavel Emelyanov
  2010-09-23 15:18 ` [ABI REVIEW][PATCH 0/8] Namespace file descriptors David Lamparter
  2010-09-24 13:02 ` Andrew Lutomirski
  9 siblings, 1 reply; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23  8:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy


Add a system call for creating sockets in a specified network namespace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 net/socket.c |   26 ++++++++++++++++++++++++--
 1 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 2270b94..1116f3c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1269,7 +1269,7 @@ int sock_create_kern(int family, int type, int protocol, struct socket **res)
 }
 EXPORT_SYMBOL(sock_create_kern);
 
-SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
+static int do_socket(struct net *net, int family, int type, int protocol)
 {
 	int retval;
 	struct socket *sock;
@@ -1289,7 +1289,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
 		flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
 
-	retval = sock_create(family, type, protocol, &sock);
+	retval = __sock_create(net, family, type, protocol, &sock, 0);
 	if (retval < 0)
 		goto out;
 
@@ -1306,6 +1306,28 @@ out_release:
 	return retval;
 }
 
+SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
+{
+	return do_socket(current->nsproxy->net_ns, family, type, protocol);
+}
+
+SYSCALL_DEFINE4(socketat, int, fd, int, family, int, type, int, protocol)
+{
+	struct net *net;
+	int retval;
+
+	if (fd == -1) {
+		net = get_net(current->nsproxy->net_ns);
+	} else {
+		net = get_net_ns_by_fd(fd);
+		if (IS_ERR(net))
+			return  PTR_ERR(net);
+	}
+	retval = do_socket(net, family, type, protocol);
+	put_net(net);
+	return retval;
+}
+
 /*
  *	Create a pair of connected sockets.
  */
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23  8:51 ` [PATCH 8/8] net: Implement socketat Eric W. Biederman
@ 2010-09-23  8:56   ` Pavel Emelyanov
  2010-09-23 11:19     ` jamal
  0 siblings, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2010-09-23  8:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, jamal, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

On 09/23/2010 12:51 PM, Eric W. Biederman wrote:
> 
> Add a system call for creating sockets in a specified network namespace.

What for?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23  8:51 ` [PATCH 7/8] net: Allow setting the network namespace by fd Eric W. Biederman
@ 2010-09-23  9:41   ` Eric Dumazet
  2010-09-23 16:03     ` Eric W. Biederman
  2010-09-23 11:22   ` jamal
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 46+ messages in thread
From: Eric Dumazet @ 2010-09-23  9:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, jamal, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

Le jeudi 23 septembre 2010 à 01:51 -0700, Eric W. Biederman a écrit :
> Take advantage of the new abstraction and allow network devices
> to be placed in any network namespace that we have a fd to talk
> about.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  include/linux/if_link.h     |    1 +
>  include/net/net_namespace.h |    1 +
>  net/core/net_namespace.c    |   26 ++++++++++++++++++++++++++
>  net/core/rtnetlink.c        |    4 +++-
>  4 files changed, 31 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/if_link.h b/include/linux/if_link.h
> index 2fc66dd..ae73d5e 100644
> --- a/include/linux/if_link.h
> +++ b/include/linux/if_link.h
> @@ -116,6 +116,7 @@ enum {
>  	IFLA_STATS64,
>  	IFLA_VF_PORTS,
>  	IFLA_PORT_SELF,
> +	IFLA_NET_NS_FD,
>  	__IFLA_MAX
>  };
>  
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index bd10a79..68672ce 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -114,6 +114,7 @@ static inline struct net *copy_net_ns(unsigned long flags, struct net *net_ns)
>  extern struct list_head net_namespace_list;
>  
>  extern struct net *get_net_ns_by_pid(pid_t pid);
> +extern struct net *get_net_ns_by_fd(int pid);
>  
>  #ifdef CONFIG_NET_NS
>  extern void __put_net(struct net *net);
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 581a088..a9b54a7 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -8,6 +8,8 @@
>  #include <linux/idr.h>
>  #include <linux/rculist.h>
>  #include <linux/nsproxy.h>
> +#include <linux/proc_fs.h>
> +#include <linux/file.h>
>  #include <net/net_namespace.h>
>  #include <net/netns/generic.h>
>  
> @@ -341,6 +343,30 @@ struct net *get_net_ns_by_pid(pid_t pid)
>  }
>  EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
>  
> +struct net *get_net_ns_by_fd(int fd)
> +{
> +	struct proc_inode *ei;
> +	struct file *file;
> +	struct net *net;
> +
> +	file = NULL;
> +	net = ERR_PTR(-EINVAL);
> +	file = proc_ns_fget(fd);
> +	if (!fd)
> +		goto out;
> +		return ERR_PTR(-EINVAL);
> +
> +	ei = PROC_I(file->f_dentry->d_inode);
> +	if (ei->ns_ops != &netns_operations)
> +		goto out;
> +
> +	net = get_net(ei->ns);
> +out:
> +	if (file)
> +		fput(file);
> +	return net;
> +}
> +
>  static int __init net_ns_init(void)
>  {
>  	struct net_generic *ng;
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index f78d821..771d8be 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1003,6 +1003,8 @@ struct net *rtnl_link_get_net(struct net *src_net, struct nlattr *tb[])
>  	 */
>  	if (tb[IFLA_NET_NS_PID])
>  		net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
> +	else if (tb[IFLA_NET_NS_FD])
> +		net = get_net_ns_by_fd(nla_get_u32(tb[IFLA_NET_NS_FD]));
>  	else
>  		net = get_net(src_net);
>  	return net;
> @@ -1077,7 +1079,7 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
>  	int send_addr_notify = 0;
>  	int err;
>  
> -	if (tb[IFLA_NET_NS_PID]) {
> +	if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD]) {
>  		struct net *net = rtnl_link_get_net(dev_net(dev), tb);
>  		if (IS_ERR(net)) {
>  			err = PTR_ERR(net);

You probably want to add following chunk :

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index b2a718d..35bb6de 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -927,6 +927,7 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_LINKMODE]		= { .type = NLA_U8 },
 	[IFLA_LINKINFO]		= { .type = NLA_NESTED },
 	[IFLA_NET_NS_PID]	= { .type = NLA_U32 },
+	[IFLA_NET_NS_FD]	= { .type = NLA_U32 },
 	[IFLA_IFALIAS]	        = { .type = NLA_STRING, .len = IFALIASZ-1 },
 	[IFLA_VFINFO_LIST]	= {. type = NLA_NESTED },
 	[IFLA_VF_PORTS]		= { .type = NLA_NESTED },



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23  8:56   ` Pavel Emelyanov
@ 2010-09-23 11:19     ` jamal
  2010-09-23 11:33       ` Pavel Emelyanov
  0 siblings, 1 reply; 46+ messages in thread
From: jamal @ 2010-09-23 11:19 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

On Thu, 2010-09-23 at 12:56 +0400, Pavel Emelyanov wrote:
> On 09/23/2010 12:51 PM, Eric W. Biederman wrote:
> > 
> > Add a system call for creating sockets in a specified network namespace.
> 
> What for?

I can see many uses if my understanding is correct..
ex, from mother namespace:
fdx = open socket at namespace blah
from mother namespace, read/write/poll fdx 
(eg add route with netlink socket)

cheers,
jamal


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23  8:51 ` [PATCH 7/8] net: Allow setting the network namespace by fd Eric W. Biederman
  2010-09-23  9:41   ` Eric Dumazet
@ 2010-09-23 11:22   ` jamal
  2010-09-23 14:58     ` David Lamparter
  2010-09-23 15:14     ` Eric W. Biederman
  2010-09-23 14:22   ` Brian Haley
  2010-09-24 13:46   ` Daniel Lezcano
  3 siblings, 2 replies; 46+ messages in thread
From: jamal @ 2010-09-23 11:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, Daniel Lezcano, Linus Torvalds, Michael Kerrisk,
	Ulrich Drepper, Al Viro, David Miller, Serge E. Hallyn,
	Pavel Emelyanov, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

On Thu, 2010-09-23 at 01:51 -0700, Eric W. Biederman wrote:
> Take advantage of the new abstraction and allow network devices
> to be placed in any network namespace that we have a fd to talk
> about.
> 

So ... why just netdevice? could you allow migration of other
net "items" eg a route table since they are all tagged by
netns?

cheers,
jamal


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] ns proc: Add support for the network namespace.
  2010-09-23  8:47 ` [PATCH 3/8] ns proc: Add support for the network namespace Eric W. Biederman
@ 2010-09-23 11:27   ` Louis Rilling
  2010-09-23 16:00     ` Eric W. Biederman
  0 siblings, 1 reply; 46+ messages in thread
From: Louis Rilling @ 2010-09-23 11:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, jamal, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

[-- Attachment #1: Type: text/plain, Size: 2217 bytes --]

On 23/09/10  1:47 -0700, Eric W. Biederman wrote:
> 
> Implementing file descriptors for the network namespace is simple and
> straight forward.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

[...]

> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index c988e68..581a088 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -571,3 +571,33 @@ void unregister_pernet_device(struct pernet_operations *ops)
>  	mutex_unlock(&net_mutex);
>  }
>  EXPORT_SYMBOL_GPL(unregister_pernet_device);
> +
> +#ifdef CONFIG_NET_NS
> +static void *netns_get(struct task_struct *task)
> +{
> +	struct net *net;
> +	rcu_read_lock();
> +	net = get_net(task->nsproxy->net_ns);

task could be exiting, so task->nsproxy could be NULL, right?
Maybe make proc_ns_instantiate() rcu_dereference task->nsproxy, check for it
being not NULL, and pass task->nsproxy to ns_ops->get()?

That could be an issue for the user namespace since it is not in nsproxy, but
maybe no reasonable usage of ns_ops with user namespaces is envisioned.
Otherwise, checking that task is alive with RCU locked in proc_ns_instantiate() should be enough to be
rely on task->cred when calling ns_ops->get().

Thanks,

Louis

> +	rcu_read_unlock();
> +	return net;
> +}
> +
> +static void netns_put(void *ns)
> +{
> +	put_net(ns);
> +}
> +
> +static int netns_install(struct nsproxy *nsproxy, void *ns)
> +{
> +	put_net(nsproxy->net_ns);
> +	nsproxy->net_ns = get_net(ns);
> +	return 0;
> +}
> +
> +const struct proc_ns_operations netns_operations = {
> +	.name		= PROC_NSNAME("net"),
> +	.get		= netns_get,
> +	.put		= netns_put,
> +	.install	= netns_install,
> +};
> +#endif
> -- 
> 1.6.5.2.143.g8cc62
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23 11:19     ` jamal
@ 2010-09-23 11:33       ` Pavel Emelyanov
  2010-09-23 11:40         ` jamal
  0 siblings, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2010-09-23 11:33 UTC (permalink / raw)
  To: hadi
  Cc: Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

On 09/23/2010 03:19 PM, jamal wrote:
> On Thu, 2010-09-23 at 12:56 +0400, Pavel Emelyanov wrote:
>> On 09/23/2010 12:51 PM, Eric W. Biederman wrote:
>>>
>>> Add a system call for creating sockets in a specified network namespace.
>>
>> What for?
> 
> I can see many uses if my understanding is correct..
> ex, from mother namespace:
> fdx = open socket at namespace blah
> from mother namespace, read/write/poll fdx 
> (eg add route with netlink socket)

This particular usecase is unneeded once you have the "enter" ability.

> cheers,
> jamal
> 
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23 11:33       ` Pavel Emelyanov
@ 2010-09-23 11:40         ` jamal
  2010-09-23 11:53           ` Pavel Emelyanov
  0 siblings, 1 reply; 46+ messages in thread
From: jamal @ 2010-09-23 11:40 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

On Thu, 2010-09-23 at 15:33 +0400, Pavel Emelyanov wrote:

> This particular usecase is unneeded once you have the "enter" ability.

Is that cheaper from a syscall count/cost?
i.e do I have to enter every time i want to write/read this fd?
How does poll/select work in that enter scenario?

cheers,
jamal


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23 11:40         ` jamal
@ 2010-09-23 11:53           ` Pavel Emelyanov
  2010-09-23 12:11             ` jamal
  2010-10-02 21:13             ` Daniel Lezcano
  0 siblings, 2 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2010-09-23 11:53 UTC (permalink / raw)
  To: hadi
  Cc: Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

On 09/23/2010 03:40 PM, jamal wrote:
> On Thu, 2010-09-23 at 15:33 +0400, Pavel Emelyanov wrote:
> 
>> This particular usecase is unneeded once you have the "enter" ability.
> 
> Is that cheaper from a syscall count/cost?

Why does it matter? You told, that the usage scenario was to
add routes to container. If I do 2 syscalls instead of 1, is
it THAT worse?

> i.e do I have to enter every time i want to write/read this fd?

No - you enter once, create a socket and do whatever you need
withing the enterned namespace.

> How does poll/select work in that enter scenario?

Just like it used to before the enter.

> cheers,
> jamal
> 
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23 11:53           ` Pavel Emelyanov
@ 2010-09-23 12:11             ` jamal
  2010-09-23 12:34               ` Pavel Emelyanov
  2010-10-02 21:13             ` Daniel Lezcano
  1 sibling, 1 reply; 46+ messages in thread
From: jamal @ 2010-09-23 12:11 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

On Thu, 2010-09-23 at 15:53 +0400, Pavel Emelyanov wrote:

> Why does it matter? You told, that the usage scenario was to
> add routes to container. If I do 2 syscalls instead of 1, is
> it THAT worse?
> 

Anything to do with socket IO that requires namespace awareness
applies for usage; it could be tcp/udp/etc socket. If it doesnt
make any difference performance wise using one scheme vs other
to write/read heavy messages then i dont see an issue and socketat
is redundant.

If i was to pick blindly - I would say whatever approach with
less syscalls is better even if just a "slow" path one time
thing. I could create a scenario which would make it bad
to have more syscalls.

But theres also the simplicity aspect in doing:
fdx = socketat namespace foo
use fdx for read/write/poll into foo without any wrapper code.
Vs
enter foo
fdx = socket ..
read/write fdx
leave foo.

> Just like it used to before the enter.
> 

So if i enter foo, get a fdx, leave foo i can use it in
ns0 as if it was in ns0?

cheers,
jamal


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23 12:11             ` jamal
@ 2010-09-23 12:34               ` Pavel Emelyanov
  2010-09-23 14:54                 ` David Lamparter
  2010-09-23 15:00                 ` Eric W. Biederman
  0 siblings, 2 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2010-09-23 12:34 UTC (permalink / raw)
  To: hadi, Eric W. Biederman
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, Daniel Lezcano, Linus Torvalds, Michael Kerrisk,
	Ulrich Drepper, Al Viro, David Miller, Serge E. Hallyn,
	Ben Greear, Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

On 09/23/2010 04:11 PM, jamal wrote:
> On Thu, 2010-09-23 at 15:53 +0400, Pavel Emelyanov wrote:
> 
>> Why does it matter? You told, that the usage scenario was to
>> add routes to container. If I do 2 syscalls instead of 1, is
>> it THAT worse?
>>
> 
> Anything to do with socket IO that requires namespace awareness
> applies for usage; it could be tcp/udp/etc socket. If it doesnt
> make any difference performance wise using one scheme vs other
> to write/read heavy messages then i dont see an issue and socketat
> is redundant.

That's what my point is about - unless we know why would we need it
we don't need it.

Eric, please clarify, what is the need in creating a socket in foreign
net namespace?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23  8:51 ` [PATCH 7/8] net: Allow setting the network namespace by fd Eric W. Biederman
  2010-09-23  9:41   ` Eric Dumazet
  2010-09-23 11:22   ` jamal
@ 2010-09-23 14:22   ` Brian Haley
  2010-09-23 16:16     ` Eric W. Biederman
  2010-09-24 13:46   ` Daniel Lezcano
  3 siblings, 1 reply; 46+ messages in thread
From: Brian Haley @ 2010-09-23 14:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, jamal, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

On 09/23/2010 04:51 AM, Eric W. Biederman wrote:
> 
> Take advantage of the new abstraction and allow network devices
> to be placed in any network namespace that we have a fd to talk
> about.
> 
...
> +struct net *get_net_ns_by_fd(int fd)
> +{
> +	struct proc_inode *ei;
> +	struct file *file;
> +	struct net *net;
> +
> +	file = NULL;

No need to initialize this.

> +	net = ERR_PTR(-EINVAL);

or this?

> +	file = proc_ns_fget(fd);
> +	if (!fd)
> +		goto out;
> +		return ERR_PTR(-EINVAL);

Shouldn't this be:

	if (!file)

And the "goto" seems wrong, especially without a {} here.  Unless you
meant to keep the "goto" and branch below?

-Brian

> +
> +	ei = PROC_I(file->f_dentry->d_inode);
> +	if (ei->ns_ops != &netns_operations)
> +		goto out;
> +
> +	net = get_net(ei->ns);
> +out:
> +	if (file)
> +		fput(file);
> +	return net;
> +}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23 12:34               ` Pavel Emelyanov
@ 2010-09-23 14:54                 ` David Lamparter
  2010-09-23 15:00                 ` Eric W. Biederman
  1 sibling, 0 replies; 46+ messages in thread
From: David Lamparter @ 2010-09-23 14:54 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy

On Thu, Sep 23, 2010 at 04:34:37PM +0400, Pavel Emelyanov wrote:
> On 09/23/2010 04:11 PM, jamal wrote:
> > On Thu, 2010-09-23 at 15:53 +0400, Pavel Emelyanov wrote:
> > 
> >> Why does it matter? You told, that the usage scenario was to
> >> add routes to container. If I do 2 syscalls instead of 1, is
> >> it THAT worse?
> >>
> > 
> > Anything to do with socket IO that requires namespace awareness
> > applies for usage; it could be tcp/udp/etc socket. If it doesnt
> > make any difference performance wise using one scheme vs other
> > to write/read heavy messages then i dont see an issue and socketat
> > is redundant.
> 
> That's what my point is about - unless we know why would we need it
> we don't need it.
> 
> Eric, please clarify, what is the need in creating a socket in foreign
> net namespace?

Hmm. If you somewhere get the fd to a socket from another namespace, it
definitely does work (I'm currently implementing my "socketat" with fd
passing through AF_UNIX sockets, so i know it works), so the

  setns(other...)
  fd = socket(...)
  setns(orig...)

sequence would certainly work. However, there might be other things
happening inbetween like a signal (imagine AIO particularly). While
signals are user-controllable (and therefore to be managed/excluded by
the user), we need to think if there are other problems with doing this
as sequence?

If there are no other problematic conditions with this, socketat should
probably be moved to a user library.


-David


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23 11:22   ` jamal
@ 2010-09-23 14:58     ` David Lamparter
  2010-09-24 11:51       ` jamal
  2010-09-23 15:14     ` Eric W. Biederman
  1 sibling, 1 reply; 46+ messages in thread
From: David Lamparter @ 2010-09-23 14:58 UTC (permalink / raw)
  To: jamal
  Cc: Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

On Thu, Sep 23, 2010 at 07:22:06AM -0400, jamal wrote:
> On Thu, 2010-09-23 at 01:51 -0700, Eric W. Biederman wrote:
> > Take advantage of the new abstraction and allow network devices
> > to be placed in any network namespace that we have a fd to talk
> > about.
> 
> So ... why just netdevice? could you allow migration of other
> net "items" eg a route table since they are all tagged by
> netns?

migrating route table entries makes no sense because
a) they refer to devices and configuration that does not exist in the
   target namespace; they only make sense within their netns context
b) they are purely virtual and you get the same result from deleting and
   recreating them.

Network devices are special because they may have something attached to
them, be it hardware or some daemon.


-David


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23 12:34               ` Pavel Emelyanov
  2010-09-23 14:54                 ` David Lamparter
@ 2010-09-23 15:00                 ` Eric W. Biederman
  1 sibling, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23 15:00 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, Daniel Lezcano, Linus Torvalds, Michael Kerrisk,
	Ulrich Drepper, Al Viro, David Miller, Serge E. Hallyn,
	Ben Greear, Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

Pavel Emelyanov <xemul@parallels.com> writes:

> On 09/23/2010 04:11 PM, jamal wrote:
>> On Thu, 2010-09-23 at 15:53 +0400, Pavel Emelyanov wrote:
>> 
>>> Why does it matter? You told, that the usage scenario was to
>>> add routes to container. If I do 2 syscalls instead of 1, is
>>> it THAT worse?
>>>
>> 
>> Anything to do with socket IO that requires namespace awareness
>> applies for usage; it could be tcp/udp/etc socket. If it doesnt
>> make any difference performance wise using one scheme vs other
>> to write/read heavy messages then i dont see an issue and socketat
>> is redundant.
>
> That's what my point is about - unless we know why would we need it
> we don't need it.
>
> Eric, please clarify, what is the need in creating a socket in foreign
> net namespace?

Strictly speaking with setns() you can implement this functionality
with setns().  aka

int socketat(int nsfd, int domain, int type, int protocol)
{
        int sk;

        setns(0, nsfd);
        sk = socket(domain, type, protocol);
        setns(0, default_nsfd);

        return sk;
}

The major difference is that socketat in userspace suffers
from races, with signals etc.

The use case are applications are the handful of networking applications
that find that it makes sense to listen to sockets from multiple network
namespaces at once.  Say a home machine that has a vpn into your office
network and the vpn into the office network runs in a different network
namespace so you don't have to worry about address conflicts between
the two networks, the chance of accidentally bridging between them,
and so you can use different dns resolvers for the different networks.

In that scenario it would be nice if I could run some services on both
networks.  Starting two+ copies of the daemons just so the can have live
in all of the networks is ok, but in the fullness of time I expect that
there will be daemons that want to optimize things and have sockets in
all of the network namespaces you are connected to.

In a multiple network namespace aware application when it goes to open
a socket it will want to specify which network namespace the socket is
in.  If it is a general listener it will probably listening to events
in /proc/mounts waiting for extra namespaces to be mounted under a
standard location say: /var/run/netns/<netnsname>/ns.

Once the application receives the event for a new network namespace
showing up it can will want to create a new socket listening for
connections in the new network namespace.

In that scenario none of those network namespaces are foreign, but one
network namespace will be the default and the rest will be non-default
network namespaces.

To support a multiple network namespace aware daemon I need to implement
sockeat() somewhere.  So I figured I would see if anyone minded a
trivial in kernel race free implementation.  To me it is a wart in the
API and I am busily removing warts in the API.

I don't know of any scenarios with other namespaces where there would be
applications that would be native in multiple namespaces.  So I haven't
haven't done any work in that direction.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23 11:22   ` jamal
  2010-09-23 14:58     ` David Lamparter
@ 2010-09-23 15:14     ` Eric W. Biederman
  1 sibling, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23 15:14 UTC (permalink / raw)
  To: hadi
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, Daniel Lezcano, Linus Torvalds, Michael Kerrisk,
	Ulrich Drepper, Al Viro, David Miller, Serge E. Hallyn,
	Pavel Emelyanov, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

jamal <hadi@cyberus.ca> writes:

> On Thu, 2010-09-23 at 01:51 -0700, Eric W. Biederman wrote:
>> Take advantage of the new abstraction and allow network devices
>> to be placed in any network namespace that we have a fd to talk
>> about.
>> 
>
> So ... why just netdevice? could you allow migration of other
> net "items" eg a route table since they are all tagged by
> netns?

For this patchset because we only support migrating physical
network devices between network namespaces today.

In the bigger picture migrating things between network namespaces is
race prone.  Fixing those races probably would reduce network stack
performance and increase code complexity for not particularly good
reason.  Network devices are special because they are physical hardware
and in combination with the rule that all packets coming a network
device go to a single network namespace we have to implement migration
for network devices.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
                   ` (7 preceding siblings ...)
  2010-09-23  8:51 ` [PATCH 8/8] net: Implement socketat Eric W. Biederman
@ 2010-09-23 15:18 ` David Lamparter
  2010-09-23 16:32   ` Eric W. Biederman
  2010-09-24 13:02 ` Andrew Lutomirski
  9 siblings, 1 reply; 46+ messages in thread
From: David Lamparter @ 2010-09-23 15:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, jamal, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

On Thu, Sep 23, 2010 at 01:45:04AM -0700, Eric W. Biederman wrote:
> Introduce file for manipulating namespaces and related syscalls.
> files:
> /proc/self/ns/<nstype>

As feedback from using network namespaces extensively in more or less
production setups, I would like to make a request/suggestion: there
needs to be a way to enumerate network namespaces independent from
by-pid access.

At several occasions, I was left with either some runaway daemon which
kept the namespace alive. To describe this a little more graphically:
I found no other way than doing a
	md5sum /proc/*/net/if_inet6 | sort | uniq -c -w 32
to find out which runaway to kill to terminate the namespace.

This makes network namespaces particularly cumbersome to use without PID
namespaces. While I agree that a large part of the users - namely lxc -
will use them together, network namespaces without pidns are very
interesting for routing applications implementing VRFs.

Is it possible to add some kind of "all namespaces" list, optimally
giving an opportunity to open() exactly this file descriptor that you
get from /proc/<pid>/ns/net?

Also, is it possible to extend that file descriptor to have an
"get all pids" ioctl,
...or, wait, maybe have /proc/...ns/proc/<pid> symlink?

(This obviously isn't fully thought to the end, please pick up...)


-David


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] ns proc: Add support for the network namespace.
  2010-09-23 11:27   ` Louis Rilling
@ 2010-09-23 16:00     ` Eric W. Biederman
  0 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23 16:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Containers, netdev, netfilter-devel, linux-fsdevel, jamal,
	Daniel Lezcano, Linus Torvalds, Michael Kerrisk, Ulrich Drepper,
	Al Viro, David Miller, Serge E. Hallyn, Pavel Emelyanov,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy

Louis Rilling <Louis.Rilling@kerlabs.com> writes:

> On 23/09/10  1:47 -0700, Eric W. Biederman wrote:
>> 
>> Implementing file descriptors for the network namespace is simple and
>> straight forward.
>> 
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>
> [...]
>
>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> index c988e68..581a088 100644
>> --- a/net/core/net_namespace.c
>> +++ b/net/core/net_namespace.c
>> @@ -571,3 +571,33 @@ void unregister_pernet_device(struct pernet_operations *ops)
>>  	mutex_unlock(&net_mutex);
>>  }
>>  EXPORT_SYMBOL_GPL(unregister_pernet_device);
>> +
>> +#ifdef CONFIG_NET_NS
>> +static void *netns_get(struct task_struct *task)
>> +{
>> +	struct net *net;
>> +	rcu_read_lock();
>> +	net = get_net(task->nsproxy->net_ns);
>
> task could be exiting, so task->nsproxy could be NULL, right?
> Maybe make proc_ns_instantiate() rcu_dereference task->nsproxy, check for it
> being not NULL, and pass task->nsproxy to ns_ops->get()?

Ugh.  Thanks. fixed.

Somehow I forgot /proc shows zombies which means nsproxy
will definitely be NULL in those cases.  It is easy enough to handle
once my brain gets out of park.

I don't hold any locks at that point so I don't think I want to do
anything in proc_ns_instantiate() except handle a NULL return when
the ns_ops->get() fails.

> That could be an issue for the user namespace since it is not in nsproxy, but
> maybe no reasonable usage of ns_ops with user namespaces is
> envisioned.

It is also a issue for the pid namespace.  

> Otherwise, checking that task is alive with RCU locked in proc_ns_instantiate() should be enough to be
> rely on task->cred when calling ns_ops->get().

That sounds about right.  I keep conveniently forgetting the user
namespace.  It doesn't support unshare right now so there isn't anything
I can reasonably do with it at the moment.  It wouldn't surprise me if I
don't wind up handling the user namespace like the pid namespace, where
unsharing it changes the properties for the children and not the parent.
Bleh.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23  9:41   ` Eric Dumazet
@ 2010-09-23 16:03     ` Eric W. Biederman
  0 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23 16:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, jamal, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

Eric Dumazet <eric.dumazet@gmail.com> writes:

> Le jeudi 23 septembre 2010 à 01:51 -0700, Eric W. Biederman a écrit :
>> Take advantage of the new abstraction and allow network devices
>> to be placed in any network namespace that we have a fd to talk
>> about.
>> 
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>> ---
>>  include/linux/if_link.h     |    1 +
>>  include/net/net_namespace.h |    1 +
>>  net/core/net_namespace.c    |   26 ++++++++++++++++++++++++++
>>  net/core/rtnetlink.c        |    4 +++-
>>  4 files changed, 31 insertions(+), 1 deletions(-)
>> 
>> diff --git a/include/linux/if_link.h b/include/linux/if_link.h
>> index 2fc66dd..ae73d5e 100644
>> --- a/include/linux/if_link.h
>> +++ b/include/linux/if_link.h
>> @@ -116,6 +116,7 @@ enum {
>>  	IFLA_STATS64,
>>  	IFLA_VF_PORTS,
>>  	IFLA_PORT_SELF,
>> +	IFLA_NET_NS_FD,
>>  	__IFLA_MAX
>>  };
>>  
>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>> index bd10a79..68672ce 100644
>> --- a/include/net/net_namespace.h
>> +++ b/include/net/net_namespace.h
>> @@ -114,6 +114,7 @@ static inline struct net *copy_net_ns(unsigned long flags, struct net *net_ns)
>>  extern struct list_head net_namespace_list;
>>  
>>  extern struct net *get_net_ns_by_pid(pid_t pid);
>> +extern struct net *get_net_ns_by_fd(int pid);
>>  
>>  #ifdef CONFIG_NET_NS
>>  extern void __put_net(struct net *net);
>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> index 581a088..a9b54a7 100644
>> --- a/net/core/net_namespace.c
>> +++ b/net/core/net_namespace.c
>> @@ -8,6 +8,8 @@
>>  #include <linux/idr.h>
>>  #include <linux/rculist.h>
>>  #include <linux/nsproxy.h>
>> +#include <linux/proc_fs.h>
>> +#include <linux/file.h>
>>  #include <net/net_namespace.h>
>>  #include <net/netns/generic.h>
>>  
>> @@ -341,6 +343,30 @@ struct net *get_net_ns_by_pid(pid_t pid)
>>  }
>>  EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
>>  
>> +struct net *get_net_ns_by_fd(int fd)
>> +{
>> +	struct proc_inode *ei;
>> +	struct file *file;
>> +	struct net *net;
>> +
>> +	file = NULL;
>> +	net = ERR_PTR(-EINVAL);
>> +	file = proc_ns_fget(fd);
>> +	if (!fd)
>> +		goto out;
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	ei = PROC_I(file->f_dentry->d_inode);
>> +	if (ei->ns_ops != &netns_operations)
>> +		goto out;
>> +
>> +	net = get_net(ei->ns);
>> +out:
>> +	if (file)
>> +		fput(file);
>> +	return net;
>> +}
>> +
>>  static int __init net_ns_init(void)
>>  {
>>  	struct net_generic *ng;
>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>> index f78d821..771d8be 100644
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -1003,6 +1003,8 @@ struct net *rtnl_link_get_net(struct net *src_net, struct nlattr *tb[])
>>  	 */
>>  	if (tb[IFLA_NET_NS_PID])
>>  		net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
>> +	else if (tb[IFLA_NET_NS_FD])
>> +		net = get_net_ns_by_fd(nla_get_u32(tb[IFLA_NET_NS_FD]));
>>  	else
>>  		net = get_net(src_net);
>>  	return net;
>> @@ -1077,7 +1079,7 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
>>  	int send_addr_notify = 0;
>>  	int err;
>>  
>> -	if (tb[IFLA_NET_NS_PID]) {
>> +	if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD]) {
>>  		struct net *net = rtnl_link_get_net(dev_net(dev), tb);
>>  		if (IS_ERR(net)) {
>>  			err = PTR_ERR(net);
>
> You probably want to add following chunk :

Thanks fixed.

Eric

> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index b2a718d..35bb6de 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -927,6 +927,7 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
>  	[IFLA_LINKMODE]		= { .type = NLA_U8 },
>  	[IFLA_LINKINFO]		= { .type = NLA_NESTED },
>  	[IFLA_NET_NS_PID]	= { .type = NLA_U32 },
> +	[IFLA_NET_NS_FD]	= { .type = NLA_U32 },
>  	[IFLA_IFALIAS]	        = { .type = NLA_STRING, .len = IFALIASZ-1 },
>  	[IFLA_VFINFO_LIST]	= {. type = NLA_NESTED },
>  	[IFLA_VF_PORTS]		= { .type = NLA_NESTED },

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23 14:22   ` Brian Haley
@ 2010-09-23 16:16     ` Eric W. Biederman
  0 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23 16:16 UTC (permalink / raw)
  To: Brian Haley
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, jamal, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

Brian Haley <brian.haley@hp.com> writes:

> On 09/23/2010 04:51 AM, Eric W. Biederman wrote:
>> 
>> Take advantage of the new abstraction and allow network devices
>> to be placed in any network namespace that we have a fd to talk
>> about.
>> 
> ...
>> +struct net *get_net_ns_by_fd(int fd)
>> +{
>> +	struct proc_inode *ei;
>> +	struct file *file;
>> +	struct net *net;
>> +
>> +	file = NULL;
>
> No need to initialize this.
>
>> +	net = ERR_PTR(-EINVAL);
>
> or this?
>
>> +	file = proc_ns_fget(fd);
>> +	if (!fd)
>> +		goto out;
>> +		return ERR_PTR(-EINVAL);
>
> Shouldn't this be:
>
> 	if (!file)
>
> And the "goto" seems wrong, especially without a {} here.  Unless you
> meant to keep the "goto" and branch below?

I think I changed my mind half way through writing the code and never
did anything about it.  Oops.

Thanks fixed.  It is now:

struct net *get_net_ns_by_fd(int fd)
{
	struct proc_inode *ei;
	struct file *file;
	struct net *net;

	net = ERR_PTR(-EINVAL);
	file = proc_ns_fget(fd);
	if (!file)
		goto out;

	ei = PROC_I(file->f_dentry->d_inode);
	if (ei->ns_ops != &netns_operations)
		goto out;

	net = get_net(ei->ns);
out:
	if (file)
		fput(file);
	return net;
}

Which at least makes sense.  Now to test it to double check it does what
it should do.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors
  2010-09-23 15:18 ` [ABI REVIEW][PATCH 0/8] Namespace file descriptors David Lamparter
@ 2010-09-23 16:32   ` Eric W. Biederman
  2010-09-23 16:49     ` David Lamparter
  0 siblings, 1 reply; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-23 16:32 UTC (permalink / raw)
  To: David Lamparter
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, jamal, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

David Lamparter <equinox@diac24.net> writes:

> On Thu, Sep 23, 2010 at 01:45:04AM -0700, Eric W. Biederman wrote:
>> Introduce file for manipulating namespaces and related syscalls.
>> files:
>> /proc/self/ns/<nstype>
>
> As feedback from using network namespaces extensively in more or less
> production setups, I would like to make a request/suggestion: there
> needs to be a way to enumerate network namespaces independent from
> by-pid access.
>
> At several occasions, I was left with either some runaway daemon which
> kept the namespace alive. To describe this a little more graphically:
> I found no other way than doing a
> 	md5sum /proc/*/net/if_inet6 | sort | uniq -c -w 32
> to find out which runaway to kill to terminate the namespace.
>
> This makes network namespaces particularly cumbersome to use without PID
> namespaces. While I agree that a large part of the users - namely lxc -
> will use them together, network namespaces without pidns are very
> interesting for routing applications implementing VRFs.
>
> Is it possible to add some kind of "all namespaces" list, optimally
> giving an opportunity to open() exactly this file descriptor that you
> get from /proc/<pid>/ns/net?
>
> Also, is it possible to extend that file descriptor to have an
> "get all pids" ioctl,
> ...or, wait, maybe have /proc/...ns/proc/<pid> symlink?
>
> (This obviously isn't fully thought to the end, please pick up...)

Maybe.  I can understand the pain.

Is the problem you are facing you are shutting down a vrf and you want
to make certain nothing is using it any longer?

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors
  2010-09-23 16:32   ` Eric W. Biederman
@ 2010-09-23 16:49     ` David Lamparter
  0 siblings, 0 replies; 46+ messages in thread
From: David Lamparter @ 2010-09-23 16:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Lamparter, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, jamal, Daniel Lezcano,
	Linus Torvalds, Michael Kerrisk, Ulrich Drepper, Al Viro,
	David Miller, Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov,
	Ben Greear, Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

On Thu, Sep 23, 2010 at 09:32:29AM -0700, Eric W. Biederman wrote:
> > At several occasions, I was left with either some runaway daemon which
> > kept the namespace alive. To describe this a little more graphically:
> > I found no other way than doing a
> > 	md5sum /proc/*/net/if_inet6 | sort | uniq -c -w 32
> > to find out which runaway to kill to terminate the namespace.
> >
> > This makes network namespaces particularly cumbersome to use without PID
> > namespaces. While I agree that a large part of the users - namely lxc -
> > will use them together, network namespaces without pidns are very
> > interesting for routing applications implementing VRFs.
> >
> > Is it possible to add some kind of "all namespaces" list, optimally
> > giving an opportunity to open() exactly this file descriptor that you
> > get from /proc/<pid>/ns/net?
> >
> > Also, is it possible to extend that file descriptor to have an
> > "get all pids" ioctl,
> > ...or, wait, maybe have /proc/...ns/proc/<pid> symlink?
> >
> > (This obviously isn't fully thought to the end, please pick up...)
> 
> Maybe.  I can understand the pain.
> 
> Is the problem you are facing you are shutting down a vrf and you want
> to make certain nothing is using it any longer?

Hrm. There are 2 and a half problems i can describe:

1) identifying namespaces. You can walk over /proc just fine and look at
   all processes namespaces, but you don't know which are actually the
   same aside from looking at some entry like if_inet6. There is no
   identifier and no easy equality match. (As far as i can tell.)

   Bonus difficulty: your patch will allow namespaces that have no
   process attached to them anymore since they only exist as files.
   Those will be invisible to someone running through /proc. Which leads
   to:
2) enumerating namespaces. Sure you can walk through /proc, but that's
   racy and won't even work with fd-only namespaces. It might even be a
   security risk if some trojan creates, say, a VLAN on your eth0, or a
   macvlan, hides it in a network namespace and communicates through it.

2 1/2) is terminating a namespace. It's not really a problem to add a
   PID namespace when you have "uncontrollable" daemons; however you
   can't be sure whether someone else took a reference on the network
   namespace from the outside.

These all are mainly administration/management issues, not that much
regular operation. Writing routing software with VRF support works just
fine, but the sysadmin can be at somewhat of an odd end here.


-David


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23 14:58     ` David Lamparter
@ 2010-09-24 11:51       ` jamal
  2010-09-24 12:57         ` David Lamparter
  0 siblings, 1 reply; 46+ messages in thread
From: jamal @ 2010-09-24 11:51 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

On Thu, 2010-09-23 at 16:58 +0200, David Lamparter wrote:

> migrating route table entries makes no sense because
> a) they refer to devices and configuration that does not exist in the
>    target namespace; they only make sense within their netns context
> b) they are purely virtual and you get the same result from deleting and
>    recreating them.
> 
> Network devices are special because they may have something attached to
> them, be it hardware or some daemon.

Routes functionally reside on top of netdevices, point to nexthop
neighbors across these netdevices etc. Underlying assumption is you take
care of that dependency when migrating.
We are talking about FIB entries here not the route cache; moving a few
pointers within the kernel is a hell lot faster than recreating a subset
of BGP entries from user space. 

Eric, I didnt follow the exposed-races arguement: Why would it involve
more than just some basic locking only while you change the struct net
pointer to the new namespace for these sub-subsystems?

cheers,
jamal


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-24 11:51       ` jamal
@ 2010-09-24 12:57         ` David Lamparter
  2010-09-24 13:32           ` jamal
  0 siblings, 1 reply; 46+ messages in thread
From: David Lamparter @ 2010-09-24 12:57 UTC (permalink / raw)
  To: jamal
  Cc: David Lamparter, Eric W. Biederman, linux-kernel,
	Linux Containers, netdev, netfilter-devel, linux-fsdevel

On Fri, Sep 24, 2010 at 07:51:24AM -0400, jamal wrote:
> > migrating route table entries makes no sense because
> > a) they refer to devices and configuration that does not exist in the
> >    target namespace; they only make sense within their netns context
> > b) they are purely virtual and you get the same result from deleting and
> >    recreating them.
> > 
> > Network devices are special because they may have something attached to
> > them, be it hardware or some daemon.
> 
> Routes functionally reside on top of netdevices, point to nexthop
> neighbors across these netdevices etc. Underlying assumption is you take
> care of that dependency when migrating.
> We are talking about FIB entries here not the route cache; moving a few
> pointers within the kernel is a hell lot faster than recreating a subset
> of BGP entries from user space. 

No. While you sure could associate routes with devices, they don't
*functionally* reside on top of network devices. They reside on top of
the entire IP configuration, and in case of BGP they even reside on top
of your set of peerings and their data.

Even if you could "move" routes together with a network device, the
result would be utter nonsense. The routes depend on your BGP view, and
if your set of interfaces (and peers) changes, your routes will change.
Your bgpd will, either way, need to set up new peerings and redo best
path evaluations.

(On an unrelated note, how often are you planning to move stuff between
namespaces? I don't expect to be moving stuff except on configuration
events...)


-David


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors
  2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
                   ` (8 preceding siblings ...)
  2010-09-23 15:18 ` [ABI REVIEW][PATCH 0/8] Namespace file descriptors David Lamparter
@ 2010-09-24 13:02 ` Andrew Lutomirski
  2010-09-24 13:49   ` Daniel Lezcano
  9 siblings, 1 reply; 46+ messages in thread
From: Andrew Lutomirski @ 2010-09-24 13:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linux Containers, netdev, netfilter-devel,
	linux-fsdevel, jamal, Daniel Lezcano, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

Eric W. Biederman wrote:
> Introduce file for manipulating namespaces and related syscalls.
> files:
> /proc/self/ns/<nstype>
> 
> syscalls:
> int setns(unsigned long nstype, int fd);
> socketat(int nsfd, int family, int type, int protocol);
> 

How does security work?  Are there different kinds of fd that give (say) pin-the-namespace permission, socketat permission, and setns permission?

--Andy

> Netlink attribute:
> IFLA_NS_FD int fd.
> 
> Name space file descriptors address three specific problems that
> can make namespaces hard to work with.
> - Namespaces require a dedicated process to pin them in memory.
> - It is not possible to use a namespace unless you are the child of the
>   original creator.
> - Namespaces don't have names that userspace can use to talk about them.
> 
> Opening of the /proc/self/ns/<nstype> files return a file descriptor
> that can be used to talk about a specific namespace, and to keep the
> specified namespace alive.
> 
> /proc/self/ns/<nstype> can be bind mounted as:
> mount --bind /proc/self/ns/net /some/filesystem/path
> to keep the namespace alive as long as the mount exists.
> 
> setns() as a companion to unshare allows changing the namespace
> of the current process, being able to unshare the namespace is
> a requirement.
> 
> There are two primary envisioned uses for this functionality.
> o ``Entering'' an existing container.
> o Allowing multiple network namespaces to be in use at once on
>   the same machine, without requiring elaborate infrastructure.
> 
> Overall this received positive reviews on the containers list but this
> needs a wider review of the ABI as this is pretty fundamental kernel
> functionality.
> 
> 
> I have left out the pid namespaces bits for the moment because the pid
> namespace still needs work before it is safe to unshare, and my concern
> at the moment is ensuring the system calls seem reasonable.
> 
> Eric W. Biederman (8):
>       ns: proc files for namespace naming policy.
>       ns: Introduce the setns syscall
>       ns proc: Add support for the network namespace.
>       ns proc: Add support for the uts namespace
>       ns proc: Add support for the ipc namespace
>       ns proc: Add support for the mount namespace
>       net: Allow setting the network namespace by fd
>       net: Implement socketat.
> 
> ---
>  fs/namespace.c              |   57 +++++++++++++
>  fs/proc/Makefile            |    1 +
>  fs/proc/base.c              |   22 +++---
>  fs/proc/inode.c             |    7 ++
>  fs/proc/internal.h          |   18 ++++
>  fs/proc/namespaces.c        |  193 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/if_link.h     |    1 +
>  include/linux/proc_fs.h     |   20 +++++
>  include/net/net_namespace.h |    1 +
>  ipc/namespace.c             |   31 +++++++
>  kernel/nsproxy.c            |   39 +++++++++
>  kernel/utsname.c            |   32 +++++++
>  net/core/net_namespace.c    |   56 +++++++++++++
>  net/core/rtnetlink.c        |    4 +-
>  net/socket.c                |   26 ++++++-
>  15 files changed, 494 insertions(+), 14 deletions(-)
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-24 12:57         ` David Lamparter
@ 2010-09-24 13:32           ` jamal
  2010-09-24 14:09             ` David Lamparter
  0 siblings, 1 reply; 46+ messages in thread
From: jamal @ 2010-09-24 13:32 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel

On Fri, 2010-09-24 at 14:57 +0200, David Lamparter wrote:

> No. While you sure could associate routes with devices, they don't
> *functionally* reside on top of network devices. They reside on top of
> the entire IP configuration, 

I think i am not clearly making my point. There are data dependencies;
If you were to move routes, youd need everything that routes depend on.
IOW, if i was to draw a functional graph, routes would appear on top
of netdevs (I dont care what other functional blocks you put in between
or sideways to them).

> and in case of BGP they even reside on top
> of your set of peerings and their data.
> Even if you could "move" routes together with a network device, the
> result would be utter nonsense. 

You could argue that moving a netdevice where some of its fundamental
properties such as an ifindex change is utter nonsense. But you can
work around it.

> The routes depend on your BGP view, and
> if your set of interfaces (and peers) changes, your routes will change.
> Your bgpd will, either way, need to set up new peerings and redo best
> path evaluations.

Worst case scenario, yes. I am beginning to get a feeling we are trying 
to achieve different goals maybe? Why are you even migrating netdevs?

> (On an unrelated note, how often are you planning to move stuff between
> namespaces? I don't expect to be moving stuff except on configuration
> events...)

Triggering on config events is useful and it is likely the only
possibility if you assumed the other namespace is remote. But if could
send a single command to migrate several things in the kernel (in my
case to recover state to a different ns), then that is much simpler and
uses the least resources (memory, cpu, bandwidth). I admit it is very
hard to do in most cases where the underlying dependencies are evolving
and synchronizing via user space is the best approach. The example
of route table i pointed to is simple.
Besides that: dynamic state created in the kernel that doesnt have to be
recreated by the next arriving 100K packets helps to improve recovery.

cheers,
jamal


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-23  8:51 ` [PATCH 7/8] net: Allow setting the network namespace by fd Eric W. Biederman
                     ` (2 preceding siblings ...)
  2010-09-23 14:22   ` Brian Haley
@ 2010-09-24 13:46   ` Daniel Lezcano
  3 siblings, 0 replies; 46+ messages in thread
From: Daniel Lezcano @ 2010-09-24 13:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Sukadev Bhattiprolu, Pavel Emelyanov,
	Pavel Emelyanov, Ulrich Drepper, netdev, Jonathan Corbet,
	Jan Engelhardt, linux-fsdevel, netfilter-devel, Michael Kerrisk,
	Linux Containers, Ben Greear, Linus Torvalds, David Miller,
	Al Viro

On 09/23/2010 10:51 AM, Eric W. Biederman wrote:
>
> Take advantage of the new abstraction and allow network devices
> to be placed in any network namespace that we have a fd to talk
> about.
>
> Signed-off-by: Eric W. Biederman<ebiederm@xmission.com>
> ---

[ ... ]

> +struct net *get_net_ns_by_fd(int fd)
> +{
> +	struct proc_inode *ei;
> +	struct file *file;
> +	struct net *net;
> +
> +	file = NULL;
> +	net = ERR_PTR(-EINVAL);
> +	file = proc_ns_fget(fd);
> +	if (!fd)
> +		goto out;
> +		return ERR_PTR(-EINVAL);
> +
> +	ei = PROC_I(file->f_dentry->d_inode);
> +	if (ei->ns_ops !=&netns_operations)
> +		goto out;

Is this check necessary here ? proc_ns_fget checks "file->f_op != 
&ns_file_operations", no ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors
  2010-09-24 13:02 ` Andrew Lutomirski
@ 2010-09-24 13:49   ` Daniel Lezcano
  2010-09-24 17:06     ` Eric W. Biederman
  0 siblings, 1 reply; 46+ messages in thread
From: Daniel Lezcano @ 2010-09-24 13:49 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Eric W. Biederman, Sukadev Bhattiprolu, Pavel Emelyanov,
	Pavel Emelyanov, Ulrich Drepper, netdev, Jonathan Corbet,
	linux-kernel, Jan Engelhardt, linux-fsdevel, netfilter-devel,
	Michael Kerrisk, Linux Containers, Ben Greear, Linus Torvalds,
	David Miller, Al Viro

On 09/24/2010 03:02 PM, Andrew Lutomirski wrote:
> Eric W. Biederman wrote:
>> Introduce file for manipulating namespaces and related syscalls.
>> files:
>> /proc/self/ns/<nstype>
>>
>> syscalls:
>> int setns(unsigned long nstype, int fd);
>> socketat(int nsfd, int family, int type, int protocol);
>>
>
> How does security work?  Are there different kinds of fd that give (say) pin-the-namespace permission, socketat permission, and setns permission?

AFAICS, socketat, setns and "set netns by fd" only accept fd from 
/proc/<pid>/ns/<ns>.

setns does :

	file = proc_ns_fget(fd);
	if (IS_ERR(file))
		return PTR_ERR(file);

proc_ns_fget checks if (file->f_op != &ns_file_operations)


socketat and get_net_ns_by_fd:

	net = get_net_ns_by_fd(fd);

this one calls proc_ns_fget.

We have the guarantee here, the fd is resulting from an open of the file 
with the right permissions.

Another way to pin the namespace, would be to mount --bind 
/proc/<pid>/ns/<ns> but we have to be root to do that ...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-24 13:32           ` jamal
@ 2010-09-24 14:09             ` David Lamparter
  2010-09-24 14:16               ` jamal
  0 siblings, 1 reply; 46+ messages in thread
From: David Lamparter @ 2010-09-24 14:09 UTC (permalink / raw)
  To: jamal
  Cc: David Lamparter, Eric W. Biederman, linux-kernel,
	Linux Containers, netdev, netfilter-devel, linux-fsdevel

On Fri, Sep 24, 2010 at 09:32:53AM -0400, jamal wrote:
> On Fri, 2010-09-24 at 14:57 +0200, David Lamparter wrote:
> > No. While you sure could associate routes with devices, they don't
> > *functionally* reside on top of network devices. They reside on top of
> > the entire IP configuration, 
> 
> I think i am not clearly making my point. There are data dependencies;
> If you were to move routes, youd need everything that routes depend on.
> IOW, if i was to draw a functional graph, routes would appear on top
> of netdevs (I dont care what other functional blocks you put in between
> or sideways to them).

I understood your point. What I'm saying is that that functional graph
you're describing is too simplistic do be a workable model. Your graph
allows for what you're trying to do, yes. But your graph is not modeling
the reality.

> > The routes depend on your BGP view, and
> > if your set of interfaces (and peers) changes, your routes will change.
> > Your bgpd will, either way, need to set up new peerings and redo best
> > path evaluations.
> 
> Worst case scenario, yes. I am beginning to get a feeling we are trying 
> to achieve different goals maybe? Why are you even migrating netdevs?

Err... I'm migrating netdevs to assign them to namespaces to allow them
to use them? Setup, basically. Either way a device move only happens as
result of some administrative action; be it creating a new namespace or
changing the physical/logical network setup.

> > (On an unrelated note, how often are you planning to move stuff between
> > namespaces? I don't expect to be moving stuff except on configuration
> > events...)
> 
> Triggering on config events is useful and it is likely the only
> possibility if you assumed the other namespace is remote.

wtf is a "remote" namespace?

>                                                           But if could
> send a single command to migrate several things in the kernel (in my
> case to recover state to a different ns), then that is much simpler and
> uses the least resources (memory, cpu, bandwidth). I admit it is very
> hard to do in most cases where the underlying dependencies are evolving
> and synchronizing via user space is the best approach. The example
> of route table i pointed to is simple.
> Besides that: dynamic state created in the kernel that doesnt have to be
> recreated by the next arriving 100K packets helps to improve recovery.

Can you please describe your application that requires moving possibly
several network devices together with "their" routes to a different
namespace?


-David


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] net: Allow setting the network namespace by fd
  2010-09-24 14:09             ` David Lamparter
@ 2010-09-24 14:16               ` jamal
  0 siblings, 0 replies; 46+ messages in thread
From: jamal @ 2010-09-24 14:16 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel

On Fri, 2010-09-24 at 16:09 +0200, David Lamparter wrote:

> I understood your point. What I'm saying is that that functional graph
> you're describing is too simplistic do be a workable model. Your graph
> allows for what you're trying to do, yes. But your graph is not modeling
> the reality.

How about we put this specific point to rest by agreeing to
disagree? ;->

> Err... I'm migrating netdevs to assign them to namespaces to allow them
> to use them? Setup, basically. Either way a device move only happens as
> result of some administrative action; be it creating a new namespace or
> changing the physical/logical network setup.
> 

Ok, different need. You have a much more basic requirement than i do.

> wtf is a "remote" namespace?
> 

A namespace that is remotely located on another machine/hardware ;->

> Can you please describe your application that requires moving possibly
> several network devices together with "their" routes to a different
> namespace?

scaling and availability are the driving requirements.

cheers,
jamal


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors
  2010-09-24 13:49   ` Daniel Lezcano
@ 2010-09-24 17:06     ` Eric W. Biederman
  0 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-09-24 17:06 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Andrew Lutomirski, Sukadev Bhattiprolu, Pavel Emelyanov,
	Pavel Emelyanov, Ulrich Drepper, netdev, Jonathan Corbet,
	linux-kernel, Jan Engelhardt, linux-fsdevel, netfilter-devel,
	Michael Kerrisk, Linux Containers, Ben Greear, Linus Torvalds,
	David Miller, Al Viro

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> On 09/24/2010 03:02 PM, Andrew Lutomirski wrote:
>> Eric W. Biederman wrote:
>>> Introduce file for manipulating namespaces and related syscalls.
>>> files:
>>> /proc/self/ns/<nstype>
>>>
>>> syscalls:
>>> int setns(unsigned long nstype, int fd);
>>> socketat(int nsfd, int family, int type, int protocol);
>>>
>>
>> How does security work?  Are there different kinds of fd that give (say) pin-the-namespace permission, socketat permission, and setns permission?
>
> AFAICS, socketat, setns and "set netns by fd" only accept fd from
> /proc/<pid>/ns/<ns>.
>
> setns does :
>
> 	file = proc_ns_fget(fd);
> 	if (IS_ERR(file))
> 		return PTR_ERR(file);
>
> proc_ns_fget checks if (file->f_op != &ns_file_operations)
>
>
> socketat and get_net_ns_by_fd:
>
> 	net = get_net_ns_by_fd(fd);
>
> this one calls proc_ns_fget.
>
> We have the guarantee here, the fd is resulting from an open of the file with
> the right permissions.

In particular the default /proc permissions say you have to be the owner
of the process (or root) to access the file.  If you are the owner of
the process with a namespace (or root) you already have permission to
access and manipulate the namespace.

Additionally setns like unshare requires CAP_SYS_ADMIN (aka root magic).

> Another way to pin the namespace, would be to mount --bind /proc/<pid>/ns/<ns>
> but we have to be root to do that ...

Simply keeping the process running, pins the namespace. That requires no
new permissions.

Similarly socketat.  It is possible to use unix domain sockets to
implement it today without any kernel changes.  It is just an
unnecessary pain to run a server process to pin a namespace or to serve
up file descriptors in other network namespaces.

The primary change of this patchset is the ability to do everything
with file descriptors, and with the mount namespace.  That moves
everything from a bizarre hard to understand and manipulate interface
to one where things can be done much more easily, and cheaply.
Resulting in a much more powerful and usable interface.

Eric


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-09-23 11:53           ` Pavel Emelyanov
  2010-09-23 12:11             ` jamal
@ 2010-10-02 21:13             ` Daniel Lezcano
  2010-10-03 13:44               ` jamal
  1 sibling, 1 reply; 46+ messages in thread
From: Daniel Lezcano @ 2010-10-02 21:13 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, Eric W. Biederman, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Linus Torvalds, Michael Kerrisk,
	Ulrich Drepper, Al Viro, David Miller, Serge E. Hallyn,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy

On 09/23/2010 01:53 PM, Pavel Emelyanov wrote:
> On 09/23/2010 03:40 PM, jamal wrote:
>    
>> On Thu, 2010-09-23 at 15:33 +0400, Pavel Emelyanov wrote:
>>
>>      
>>> This particular usecase is unneeded once you have the "enter" ability.
>>>        
>> Is that cheaper from a syscall count/cost?
>>      
> Why does it matter? You told, that the usage scenario was to
> add routes to container. If I do 2 syscalls instead of 1, is
> it THAT worse?
>
>    
>> i.e do I have to enter every time i want to write/read this fd?
>>      
> No - you enter once, create a socket and do whatever you need
> withing the enterned namespace.
>    

Just to clarify this point. You enter the namespace, create the socket 
and go back to the initial namespace (or create a new one). Further 
operations can be made against this fd because it is the network 
namespace stored in the sock struct which is used, not the current 
process network namespace which is used at the socket creation only.

We can actually already do that by unsharing and then create a socket. 
This socket will pin the namespace and can be used as a control socket 
for the namespace (assuming the socket domain will be ok for all the 
operations).

Jamal, I don't know what kind of application you want to use but if I 
assume you want to create a process controlling 1024 netns, let's try to 
identificate what happen with setns and with socketat :

With setns:

     * open /proc/self/ns/net (1)
     * unshare the netns
     * open /proc/self/ns/net (2)
     * setns (1)
     * create a virtual network device
     * move the virtual device to (2) (using the set netns by fd)
     * unshare the netns
     ...

With socketat:

     * open a socket (1)
     * unshare the netns
     * open a netlink with socketat(1) => (2)
     * create a virtual device using (2) (at this point it is init_net_ns)
     * move the virtual device to the current netns (using the set netns 
by pid)
     * open a socket (3)
     * unshare the netns
     ...

We have the same number of file descriptors kept opened. Except, with 
setns we can bind mount the directory somewhere, that will pin the 
namespace and then we can close the /proc/self/ns/net file descriptors 
and reopen them later.

If your application has to do a lot of specific network processing, 
during its life cycle, in different namespaces, the socketat syscall 
will be better because it will reduce the number of syscalls but at the 
cost of keeping the file descriptors opened (potentially a big number). 
Otherwise, setns should fit your needs.



>> How does poll/select work in that enter scenario?
>>      
> Just like it used to before the enter.
>
>    
>> cheers,
>> jamal
>>
>>
>>      
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>    


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-10-02 21:13             ` Daniel Lezcano
@ 2010-10-03 13:44               ` jamal
  2010-10-04 10:13                 ` Daniel Lezcano
                                   ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: jamal @ 2010-10-03 13:44 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Eric W. Biederman, linux-kernel,
	Linux Containers, netdev, netfilter-devel, linux-fsdevel,
	Linus Torvalds, Michael Kerrisk, Ulrich Drepper, Al Viro,
	David Miller, Serge E. Hallyn, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

Hi Daniel,

Thanks for clarifying this ..

On Sat, 2010-10-02 at 23:13 +0200, Daniel Lezcano wrote:
> Just to clarify this point. You enter the namespace, create the socket
> and go back to the initial namespace (or create a new one). Further 
> operations can be made against this fd because it is the network 
> namespace stored in the sock struct which is used, not the current 
> process network namespace which is used at the socket creation only.
> 
> We can actually already do that by unsharing and then create a
> socket. 
> This socket will pin the namespace and can be used as a control socket
> for the namespace (assuming the socket domain will be ok for all the 
> operations).
>
> Jamal, I don't know what kind of application you want to use but if I 
> assume you want to create a process controlling 1024 netns, 

At the moment i am looking at 8K on a Nehalem with lots of RAM. They
will mostly be created at startup but some could be created afterwards.
Each will have its own netdevs etc. also created at startup (and some
other config that may happen later). 
Because startup time may accumulate, it is clearly important to me
to pick whatever scheme that reduces the number of calls...

> let's try to identificate what happen with setns and with socketat :
> 
> With setns:
> 
>      * open /proc/self/ns/net (1)
>      * unshare the netns
>      * open /proc/self/ns/net (2)
>      * setns (1)
>      * create a virtual network device
>      * move the virtual device to (2) (using the set netns by fd)
>      * unshare the netns
>      ...
> 
> With socketat:
> 
>      * open a socket (1)
>      * unshare the netns
>      * open a netlink with socketat(1) => (2)
>      * create a virtual device using (2) (at this point it is
> init_net_ns)
>      * move the virtual device to the current netns (using the set
> netns 
> by pid)
>      * open a socket (3)
>      * unshare the netns
>      ...
> 
> We have the same number of file descriptors kept opened. Except, with 
> setns we can bind mount the directory somewhere, that will pin the 
> namespace and then we can close the /proc/self/ns/net file descriptors
> and reopen them later.
> 

Ok, so a wrapper such as: create_socket_on(namespaceid)
will have generally less system calls with socketat()

> If your application has to do a lot of specific network processing, 
> during its life cycle, in different namespaces, the socketat syscall 
> will be better because it will reduce the number of syscalls but at
> the cost of keeping the file descriptors opened (potentially a big
> number). Otherwise, setns should fit your needs.

Makes sense. 

One thing still confuses me...
The app control point is in namespace0. I still want to be able to
"boot" namespaces first and maybe a few seconds later do a socketat()...
and create devices, tcp sockets etc. I suspect create_ns(namespace-name)
would involve:
     * open /proc/self/ns/net (namespace-name)
     * unshare the netns
Is this correct?

cheers,
jamal


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-10-03 13:44               ` jamal
@ 2010-10-04 10:13                 ` Daniel Lezcano
  2010-10-04 19:07                 ` Eric W. Biederman
  2010-10-15 12:30                 ` netns patches WAS( " jamal
  2 siblings, 0 replies; 46+ messages in thread
From: Daniel Lezcano @ 2010-10-04 10:13 UTC (permalink / raw)
  To: hadi
  Cc: Pavel Emelyanov, Eric W. Biederman, linux-kernel,
	Linux Containers, netdev, netfilter-devel, linux-fsdevel,
	Linus Torvalds, Michael Kerrisk, Ulrich Drepper, Al Viro,
	David Miller, Serge E. Hallyn, Pavel Emelyanov, Ben Greear,
	Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
	Jan Engelhardt, Patrick McHardy

On 10/03/2010 03:44 PM, jamal wrote:
> Hi Daniel,
>
> Thanks for clarifying this ..
>
> On Sat, 2010-10-02 at 23:13 +0200, Daniel Lezcano wrote:
>    
>> Just to clarify this point. You enter the namespace, create the socket
>> and go back to the initial namespace (or create a new one). Further
>> operations can be made against this fd because it is the network
>> namespace stored in the sock struct which is used, not the current
>> process network namespace which is used at the socket creation only.
>>
>> We can actually already do that by unsharing and then create a
>> socket.
>> This socket will pin the namespace and can be used as a control socket
>> for the namespace (assuming the socket domain will be ok for all the
>> operations).
>>
>> Jamal, I don't know what kind of application you want to use but if I
>> assume you want to create a process controlling 1024 netns,
>>      
> At the moment i am looking at 8K on a Nehalem with lots of RAM. They
> will mostly be created at startup but some could be created afterwards.
> Each will have its own netdevs etc. also created at startup (and some
> other config that may happen later).
> Because startup time may accumulate, it is clearly important to me
> to pick whatever scheme that reduces the number of calls...
>    

8K ! whow ! :)


>> let's try to identificate what happen with setns and with socketat :
>>
>> With setns:
>>
>>       * open /proc/self/ns/net (1)
>>       * unshare the netns
>>       * open /proc/self/ns/net (2)
>>       * setns (1)
>>       * create a virtual network device
>>       * move the virtual device to (2) (using the set netns by fd)
>>       * unshare the netns
>>       ...
>>
>> With socketat:
>>
>>       * open a socket (1)
>>       * unshare the netns
>>       * open a netlink with socketat(1) =>  (2)
>>       * create a virtual device using (2) (at this point it is
>> init_net_ns)
>>       * move the virtual device to the current netns (using the set
>> netns
>> by pid)
>>       * open a socket (3)
>>       * unshare the netns
>>       ...
>>
>> We have the same number of file descriptors kept opened. Except, with
>> setns we can bind mount the directory somewhere, that will pin the
>> namespace and then we can close the /proc/self/ns/net file descriptors
>> and reopen them later.
>>
>>      
> Ok, so a wrapper such as: create_socket_on(namespaceid)
> will have generally less system calls with socketat()
>    

Yes, I think so.

>> If your application has to do a lot of specific network processing,
>> during its life cycle, in different namespaces, the socketat syscall
>> will be better because it will reduce the number of syscalls but at
>> the cost of keeping the file descriptors opened (potentially a big
>> number). Otherwise, setns should fit your needs.
>>      
> Makes sense.
>
> One thing still confuses me...
> The app control point is in namespace0. I still want to be able to
> "boot" namespaces first and maybe a few seconds later do a socketat()...
> and create devices, tcp sockets etc. I suspect create_ns(namespace-name)
> would involve:
>       * open /proc/self/ns/net (namespace-name)
>       * unshare the netns
> Is this correct?
>    

Maybe I misunderstanding but you are trying to save some syscalls, you 
should use socketat only and keep app control namespace0 socket for it. 
The process will be in the last netns you unshared (maybe you can use 
here one setns syscall to return back to the namespace0).

     (1) socketat  :
         * pros : 1 syscall to create a socket
         * cons : a file descriptor per namespace, namespace is only 
manageable via a socket

     (2) setns :
         * pros : namespace is fully manageable with a generic code
         * cons : 2 syscall (or 3 if we want to return to the initial 
netns) to create a socket(setns + socket [ + setns ]), a file descriptor 
per namespace

     (3) setns + bind mount :
         * pros : no file descriptor need to be kept opened
         * cons : startup longer, (unshare + mount --bind), 4 syscalls 
to create a socket in the namespace (open, setns, socket, close), (may 
be 5 syscalls if we want to return to the initial netns).

Depending of the scheme you choose the startup will be for:

     (1) socketat :
          * open /proc/self/ns/net (one time to 'save' and pin the 
initial netns)
         and then

         int create_ns(void)
         {
             unshare(CLONE_NEWNET);
             return socket(...)
         }

         and,

          for (i = 0; i < 8192; i++)
                  mynsfd[i] = create_ns();

     (2) setns :
          * open /proc/self/ns/net (one time to 'save' and pin the 
initial netns)
           and then

         int create_ns(void)
         {
             unshare(CLONE_NEWNET);
             return open("/proc/self/ns/net");
         }

         and,

         for (i = 0; i < 8192; i++)
               mynsfd[i] = create_ns();

     (3) setns + mount :

          * open /proc/self/ns/net (one time to 'save' and pin the 
initial netns)
           and then

             int create_ns(const char *nspath)
             {
                unshare(CLONE_NEWNET);
                creat(nspath);
                mount("/proc/self/ns/net", nspath, MS_BIND);
             }

             for (i  = 0; i < 8192; i++)
                     create_ns(mynspath[i]);

Hope that helps.

   -- Daniel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] net: Implement socketat.
  2010-10-03 13:44               ` jamal
  2010-10-04 10:13                 ` Daniel Lezcano
@ 2010-10-04 19:07                 ` Eric W. Biederman
  2010-10-15 12:30                 ` netns patches WAS( " jamal
  2 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-10-04 19:07 UTC (permalink / raw)
  To: hadi
  Cc: Daniel Lezcano, Pavel Emelyanov, linux-kernel, Linux Containers,
	netdev, netfilter-devel, linux-fsdevel, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

jamal <hadi@cyberus.ca> writes:

> One thing still confuses me...
> The app control point is in namespace0. I still want to be able to
> "boot" namespaces first and maybe a few seconds later do a socketat()...
> and create devices, tcp sockets etc. I suspect create_ns(namespace-name)
> would involve:
>      * open /proc/self/ns/net (namespace-name)
>      * unshare the netns
> Is this correct?

Almost.

create should be:
        * verify namespace-name is not already in use
        * mkdir -p /var/run/netns/<namespace-name>
	* unshare the netns
        * mount --bind /proc/self/ns/net /var/run/netns/<namespace-name>

Are you talking about an replacing something that used to use the linux
vrf patches that are floating around?

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* netns patches WAS( Re: [PATCH 8/8] net: Implement socketat.
  2010-10-03 13:44               ` jamal
  2010-10-04 10:13                 ` Daniel Lezcano
  2010-10-04 19:07                 ` Eric W. Biederman
@ 2010-10-15 12:30                 ` jamal
  2010-10-26 20:52                   ` jamal
  2 siblings, 1 reply; 46+ messages in thread
From: jamal @ 2010-10-15 12:30 UTC (permalink / raw)
  To: Daniel Lezcano, Eric W. Biederman
  Cc: Pavel Emelyanov, linux-kernel, Linux Containers, netdev,
	netfilter-devel, linux-fsdevel, Linus Torvalds, Michael Kerrisk,
	Ulrich Drepper, Al Viro, David Miller, Serge E. Hallyn,
	Pavel Emelyanov, Ben Greear, Matt Helsley, Jonathan Corbet,
	Sukadev Bhattiprolu, Jan Engelhardt, Patrick McHardy

Eric et al,

Did these patches make it in? I was looking at
two Davem net trees and i dont see them.

cheers,
jamal


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: netns patches WAS( Re: [PATCH 8/8] net: Implement socketat.
  2010-10-15 12:30                 ` netns patches WAS( " jamal
@ 2010-10-26 20:52                   ` jamal
  2010-10-27  0:27                     ` Eric W. Biederman
  0 siblings, 1 reply; 46+ messages in thread
From: jamal @ 2010-10-26 20:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Pavel Emelyanov, linux-kernel, Linux Containers,
	netdev, netfilter-devel, linux-fsdevel, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

Eric,

Ping?
If you are too busy to push these in maybe have
someone clueful like Daniel help out submitting? I think it
should probably be reasonable to leave out the sockeat
patch initially if it is deemed controversial..

cheers,
jamal

On Fri, 2010-10-15 at 08:30 -0400, jamal wrote:
> Eric et al,
> 
> Did these patches make it in? I was looking at
> two Davem net trees and i dont see them.
> 
> cheers,
> jamal
> 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: netns patches WAS( Re: [PATCH 8/8] net: Implement socketat.
  2010-10-26 20:52                   ` jamal
@ 2010-10-27  0:27                     ` Eric W. Biederman
  0 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2010-10-27  0:27 UTC (permalink / raw)
  To: hadi
  Cc: Daniel Lezcano, Pavel Emelyanov, linux-kernel, Linux Containers,
	netdev, netfilter-devel, linux-fsdevel, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy

jamal <hadi@cyberus.ca> writes:

> Eric,
>
> Ping?
> If you are too busy to push these in maybe have
> someone clueful like Daniel help out submitting? I think it
> should probably be reasonable to leave out the sockeat
> patch initially if it is deemed controversial..

This merge cycle I am too busy, and my patches did not make it into
linux-next before the merge window.

Everything except socketat at seems non-controversial.  socketat makes
sense to post-pone a little bit until we start converting applications,
and there is a little real world experience about what is needed.

I anticipate some time freeing up in the next couple of weeks so I
should be ready for the next merge window.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2010-10-27  0:35 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-23  8:45 [ABI REVIEW][PATCH 0/8] Namespace file descriptors Eric W. Biederman
2010-09-23  8:46 ` [PATCH 1/8] ns: proc files for namespace naming policy Eric W. Biederman
2010-09-23  8:46 ` [PATCH 2/8] ns: Introduce the setns syscall Eric W. Biederman
2010-09-23  8:47 ` [PATCH 3/8] ns proc: Add support for the network namespace Eric W. Biederman
2010-09-23 11:27   ` Louis Rilling
2010-09-23 16:00     ` Eric W. Biederman
2010-09-23  8:48 ` [PATCH 4/8] ns proc: Add support for the uts namespace Eric W. Biederman
2010-09-23  8:49 ` [PATCH 5/8] ns proc: Add support for the ipc namespace Eric W. Biederman
2010-09-23  8:50 ` [PATCH 6/8] ns proc: Add support for the mount namespace Eric W. Biederman
2010-09-23  8:51 ` [PATCH 7/8] net: Allow setting the network namespace by fd Eric W. Biederman
2010-09-23  9:41   ` Eric Dumazet
2010-09-23 16:03     ` Eric W. Biederman
2010-09-23 11:22   ` jamal
2010-09-23 14:58     ` David Lamparter
2010-09-24 11:51       ` jamal
2010-09-24 12:57         ` David Lamparter
2010-09-24 13:32           ` jamal
2010-09-24 14:09             ` David Lamparter
2010-09-24 14:16               ` jamal
2010-09-23 15:14     ` Eric W. Biederman
2010-09-23 14:22   ` Brian Haley
2010-09-23 16:16     ` Eric W. Biederman
2010-09-24 13:46   ` Daniel Lezcano
2010-09-23  8:51 ` [PATCH 8/8] net: Implement socketat Eric W. Biederman
2010-09-23  8:56   ` Pavel Emelyanov
2010-09-23 11:19     ` jamal
2010-09-23 11:33       ` Pavel Emelyanov
2010-09-23 11:40         ` jamal
2010-09-23 11:53           ` Pavel Emelyanov
2010-09-23 12:11             ` jamal
2010-09-23 12:34               ` Pavel Emelyanov
2010-09-23 14:54                 ` David Lamparter
2010-09-23 15:00                 ` Eric W. Biederman
2010-10-02 21:13             ` Daniel Lezcano
2010-10-03 13:44               ` jamal
2010-10-04 10:13                 ` Daniel Lezcano
2010-10-04 19:07                 ` Eric W. Biederman
2010-10-15 12:30                 ` netns patches WAS( " jamal
2010-10-26 20:52                   ` jamal
2010-10-27  0:27                     ` Eric W. Biederman
2010-09-23 15:18 ` [ABI REVIEW][PATCH 0/8] Namespace file descriptors David Lamparter
2010-09-23 16:32   ` Eric W. Biederman
2010-09-23 16:49     ` David Lamparter
2010-09-24 13:02 ` Andrew Lutomirski
2010-09-24 13:49   ` Daniel Lezcano
2010-09-24 17:06     ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).