linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix /proc/net in presence of net namespaces
@ 2008-02-28 15:46 Pavel Emelyanov
  2008-02-28 15:49 ` [PATCH 1/2] Add an id to struct net Pavel Emelyanov
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Pavel Emelyanov @ 2008-02-28 15:46 UTC (permalink / raw)
  To: Andrew Morton, David Miller
  Cc: Alexey Dobriyan, Linux Netdev List, Linux Kernel Mailing List,
	Eric W. Biederman

Current /proc/net is done with so called "shadows", but current
implementation is broken and has little chances to get fixed.

The problem is that dentries subtree of /proc/net directory has
fancy revalidation rules to make processes living in different
net namespaces see different entries in /proc/net subtree, but
currently, tasks see in the /proc/net subdir the contents of any
other namespace, depending on who opened the file first.

The proposed fix is to turn /proc/net into a symlink, which behaves
similar to /proc/self link - it points to .netns/<id> directory 
where the <id> is the id of net namespace, current task lives in.

# ls -l /proc/net
lrwxrwxrwx  1 root root 8 Feb 28 18:38 /proc/net -> .netns/0

The /proc/.netns dir contains subtrees for all the namespaces in 
the system:

# ls -l /proc/.netns/
total 0
dr-xr-xr-x  5 root root 0 Feb 28 18:39 0
dr-xr-xr-x  3 root root 0 Feb 28 18:39 1

To provide some security each /proc/.netns/<id> directory allows
access to tasks that live in the owning namespace only (with the
exception, that init_net tasks can see everything).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] Add an id to struct net
  2008-02-28 15:46 [PATCH 0/2] Fix /proc/net in presence of net namespaces Pavel Emelyanov
@ 2008-02-28 15:49 ` Pavel Emelyanov
  2008-02-28 15:51 ` [PATCH 2/2] Make /proc/net a symlink and drop proc shadows Pavel Emelyanov
  2008-02-28 19:31 ` [PATCH 0/2] Fix /proc/net in presence of net namespaces Eric W. Biederman
  2 siblings, 0 replies; 20+ messages in thread
From: Pavel Emelyanov @ 2008-02-28 15:49 UTC (permalink / raw)
  To: Andrew Morton, David Miller
  Cc: Alexey Dobriyan, Linux Netdev List, Linux Kernel Mailing List,
	Eric W. Biederman

This is required to implement the /proc/net symlink, but can later 
be used to enhance networking hash functions - objects like sockets,
fragments, twbuckets, etc should better be stored in different hash
chains when they live in different namespaces.

Id generation is done via ida to handle the id overflowing.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 28738b7..8aedfd6 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -26,6 +26,7 @@ struct net {
 	atomic_t		use_count;	/* To track references we
 						 * destroy on demand
 						 */
+	int			id;
 	struct list_head	list;		/* list of network namespaces */
 	struct work_struct	work;		/* work struct for freeing */
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7b66083..31b018a 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -6,6 +6,7 @@
 #include <linux/delay.h>
 #include <linux/sched.h>
 #include <net/net_namespace.h>
+#include <linux/idr.h>
 
 /*
  *	Our network namespace constructor/destructor lists
@@ -20,6 +21,8 @@ LIST_HEAD(net_namespace_list);
 struct net init_net;
 EXPORT_SYMBOL(init_net);
 
+static struct ida net_ns_ids;
+
 /*
  * setup_net runs the initializers for the network namespace object.
  */
@@ -31,8 +34,17 @@ static __net_init int setup_net(struct net *net)
 
 	atomic_set(&net->count, 1);
 	atomic_set(&net->use_count, 0);
+retry:
+	error = ida_get_new(&net_ns_ids, &net->id);
+	if (error) {
+		if (error == -EAGAIN) {
+			ida_pre_get(&net_ns_ids, GFP_KERNEL);
+			goto retry;
+		}
+
+		goto out;
+	}
 
-	error = 0;
 	list_for_each_entry(ops, &pernet_list, list) {
 		if (ops->init) {
 			error = ops->init(net);
@@ -53,6 +65,8 @@ out_undo:
 	}
 
 	rcu_barrier();
+
+	ida_remove(&net_ns_ids, net->id);
 	goto out;
 }
 
@@ -142,6 +156,7 @@ static void cleanup_net(struct work_struct *work)
 	 */
 	rcu_barrier();
 
+	ida_remove(&net_ns_ids, net->id);
 	/* Finally it is safe to free my network namespace structure */
 	net_free(net);
 }
@@ -179,6 +194,8 @@ static int __init net_ns_init(void)
 		panic("Could not create netns workq");
 #endif
 
+	ida_init(&net_ns_ids);
+
 	mutex_lock(&net_mutex);
 	err = setup_net(&init_net);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] Make /proc/net a symlink and drop proc shadows
  2008-02-28 15:46 [PATCH 0/2] Fix /proc/net in presence of net namespaces Pavel Emelyanov
  2008-02-28 15:49 ` [PATCH 1/2] Add an id to struct net Pavel Emelyanov
@ 2008-02-28 15:51 ` Pavel Emelyanov
  2008-02-28 19:31 ` [PATCH 0/2] Fix /proc/net in presence of net namespaces Eric W. Biederman
  2 siblings, 0 replies; 20+ messages in thread
From: Pavel Emelyanov @ 2008-02-28 15:51 UTC (permalink / raw)
  To: Andrew Morton, David Miller
  Cc: Alexey Dobriyan, Linux Netdev List, Linux Kernel Mailing List,
	Eric W. Biederman

Turn /proc/net into a symlink, which behaves similar to /proc/self
link - it points to .netns/<id> directory where the <id> is the id
of net namespace, current task lives in.

# ls -l /proc/net
lrwxrwxrwx  1 root root 8 Feb 28 18:38 /proc/net -> .netns/0

The /proc/.netns dir contains subtrees for all the namespaces in 
the system:

# ls -l /proc/.netns/
total 0
dr-xr-xr-x  5 root root 0 Feb 28 18:39 0
dr-xr-xr-x  3 root root 0 Feb 28 18:39 1

To provide some security each /proc/.netns/<id> directory allows
access to tasks that live in the owning namespace only (with the
exception, that init_net tasks can see everything).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 68971e6..d4103e5 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -227,7 +227,7 @@ static const struct file_operations proc_file_operations = {
 	.write		= proc_file_write,
 };
 
-static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
+int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
 {
 	struct inode *inode = dentry->d_inode;
 	struct proc_dir_entry *de = PDE(inode);
@@ -248,7 +248,7 @@ out:
 	return error;
 }
 
-static int proc_getattr(struct vfsmount *mnt, struct dentry *dentry,
+int proc_getattr(struct vfsmount *mnt, struct dentry *dentry,
 			struct kstat *stat)
 {
 	struct inode *inode = dentry->d_inode;
@@ -393,8 +393,6 @@ struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nam
 			if (!memcmp(dentry->d_name.name, de->name, de->namelen)) {
 				unsigned int ino;
 
-				if (de->shadow_proc)
-					de = de->shadow_proc(current, de);
 				ino = de->low_ino;
 				de_get(de);
 				spin_unlock(&proc_subdir_lock);
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 14e9b5a..8000ea4 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -83,14 +83,6 @@ struct net *get_proc_net(const struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(get_proc_net);
 
-static struct proc_dir_entry *shadow_pde;
-
-static struct proc_dir_entry *proc_net_shadow(struct task_struct *task,
-						struct proc_dir_entry *de)
-{
-	return task->nsproxy->net_ns->proc_net;
-}
-
 struct proc_dir_entry *proc_net_mkdir(struct net *net, const char *name,
 		struct proc_dir_entry *parent)
 {
@@ -102,47 +94,61 @@ struct proc_dir_entry *proc_net_mkdir(struct net *net, const char *name,
 }
 EXPORT_SYMBOL_GPL(proc_net_mkdir);
 
+static int proc_netns_id_permission(struct inode *inode, int mask,
+		struct nameidata *nd)
+{
+	struct net *net = current->nsproxy->net_ns;
+
+	if (net == &init_net || net == PDE(inode)->data)
+		return generic_permission(inode, mask, NULL);
+	else
+		return -EACCES;
+}
+
+static const struct inode_operations proc_netns_id_ops = {
+	.lookup		= proc_lookup,
+	.getattr	= proc_getattr,
+	.setattr	= proc_notify_change,
+	.permission	= proc_netns_id_permission,
+};
+
+static struct proc_dir_entry *netns_dir;
+
 static __net_init int proc_net_ns_init(struct net *net)
 {
-	struct proc_dir_entry *root, *netd, *net_statd;
-	int err;
+	struct proc_dir_entry *netd, *net_statd;
+	char id[PROC_NUMBUF];
 
-	err = -ENOMEM;
-	root = kzalloc(sizeof(*root), GFP_KERNEL);
-	if (!root)
-		goto out;
+	snprintf(id, sizeof(id), "%d", net->id);
 
-	err = -EEXIST;
-	netd = proc_net_mkdir(net, "net", root);
+	netd = proc_net_mkdir(net, id, netns_dir);
 	if (!netd)
-		goto free_root;
+		goto out;
+
+	netd->proc_iops = &proc_netns_id_ops;
 
-	err = -EEXIST;
 	net_statd = proc_net_mkdir(net, "stat", netd);
 	if (!net_statd)
 		goto free_net;
 
-	root->data = net;
-
-	net->proc_net_root = root;
 	net->proc_net = netd;
 	net->proc_net_stat = net_statd;
-	err = 0;
+	return 0;
 
-out:
-	return err;
 free_net:
-	remove_proc_entry("net", root);
-free_root:
-	kfree(root);
-	goto out;
+	remove_proc_entry(id, netns_dir);
+out:
+	return -ENOMEM;
 }
 
 static __net_exit void proc_net_ns_exit(struct net *net)
 {
+	char id[PROC_NUMBUF];
+
+	snprintf(id, sizeof(id), "%d", net->id);
+
 	remove_proc_entry("stat", net->proc_net);
-	remove_proc_entry("net", net->proc_net_root);
-	kfree(net->proc_net_root);
+	remove_proc_entry(id, netns_dir);
 }
 
 static struct pernet_operations __net_initdata proc_net_ns_ops = {
@@ -150,10 +156,36 @@ static struct pernet_operations __net_initdata proc_net_ns_ops = {
 	.exit = proc_net_ns_exit,
 };
 
+static int proc_net_readlink(struct dentry *dentry, char __user *buffer,
+			      int buflen)
+{
+	char tmp[PROC_NUMBUF];
+
+	sprintf(tmp, ".netns/%d", current->nsproxy->net_ns->id);
+	return vfs_readlink(dentry, buffer, buflen, tmp);
+}
+
+static void *proc_net_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	char tmp[PROC_NUMBUF];
+
+	sprintf(tmp, ".netns/%d", current->nsproxy->net_ns->id);
+	return ERR_PTR(vfs_follow_link(nd, tmp));
+}
+
+static const struct inode_operations proc_net_link_ops = {
+	.readlink	= proc_net_readlink,
+	.follow_link	= proc_net_follow_link,
+};
+
 int __init proc_net_init(void)
 {
-	shadow_pde = proc_mkdir("net", NULL);
-	shadow_pde->shadow_proc = proc_net_shadow;
+	struct proc_dir_entry *nnet;
+
+	netns_dir = proc_mkdir(".netns", NULL);
+	nnet = proc_symlink("net", NULL, ".netns/0");
+
+	nnet->proc_iops = &proc_net_link_ops;
 
 	return register_pernet_subsys(&proc_net_ns_ops);
 }
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d9a9e71..be0c9b7 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -82,7 +82,6 @@ struct proc_dir_entry {
 	int pde_users;	/* number of callers into module in progress */
 	spinlock_t pde_unload_lock; /* proc_fops checks and pde_users bumps */
 	struct completion *pde_unload_completion;
-	shadow_proc_t *shadow_proc;
 };
 
 struct kcore_list {
@@ -145,6 +144,9 @@ extern struct inode *proc_get_inode(struct super_block *, unsigned int, struct p
  */
 extern int proc_readdir(struct file *, void *, filldir_t);
 extern struct dentry *proc_lookup(struct inode *, struct dentry *, struct nameidata *);
+extern int proc_getattr(struct vfsmount *mnt, struct dentry *dentry,
+			struct kstat *stat);
+extern int proc_notify_change(struct dentry *dentry, struct iattr *iattr);
 
 extern const struct file_operations proc_kcore_operations;
 extern const struct file_operations proc_kmsg_operations;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 28738b7..8aedfd6 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -32,7 +32,6 @@ struct net {
 
 	struct proc_dir_entry 	*proc_net;
 	struct proc_dir_entry 	*proc_net_stat;
-	struct proc_dir_entry 	*proc_net_root;
 
 	struct list_head	sysctl_table_headers;
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-28 15:46 [PATCH 0/2] Fix /proc/net in presence of net namespaces Pavel Emelyanov
  2008-02-28 15:49 ` [PATCH 1/2] Add an id to struct net Pavel Emelyanov
  2008-02-28 15:51 ` [PATCH 2/2] Make /proc/net a symlink and drop proc shadows Pavel Emelyanov
@ 2008-02-28 19:31 ` Eric W. Biederman
  2008-02-28 21:17   ` serge
  2008-02-29  7:42   ` Pavel Emelyanov
  2 siblings, 2 replies; 20+ messages in thread
From: Eric W. Biederman @ 2008-02-28 19:31 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, David Miller, Alexey Dobriyan, Linux Netdev List,
	Linux Kernel Mailing List

Pavel Emelyanov <xemul@openvz.org> writes:

> Current /proc/net is done with so called "shadows", but current
> implementation is broken and has little chances to get fixed.
>
> The problem is that dentries subtree of /proc/net directory has
> fancy revalidation rules to make processes living in different
> net namespaces see different entries in /proc/net subtree, but
> currently, tasks see in the /proc/net subdir the contents of any
> other namespace, depending on who opened the file first.
>
> The proposed fix is to turn /proc/net into a symlink, which behaves
> similar to /proc/self link - it points to .netns/<id> directory 
> where the <id> is the id of net namespace, current task lives in.
>
> # ls -l /proc/net
> lrwxrwxrwx  1 root root 8 Feb 28 18:38 /proc/net -> .netns/0
>
> The /proc/.netns dir contains subtrees for all the namespaces in 
> the system:
>
> # ls -l /proc/.netns/
> total 0
> dr-xr-xr-x  5 root root 0 Feb 28 18:39 0
> dr-xr-xr-x  3 root root 0 Feb 28 18:39 1
>
> To provide some security each /proc/.netns/<id> directory allows
> access to tasks that live in the owning namespace only (with the
> exception, that init_net tasks can see everything).


Nack.  Yet another global set of ids that require us to implement another
namespace looks like the wrong way to go.

Can you try this approach by capturing a struct pid instead of an id
in a new global namespace? 

In particular the pid of the process that creates the pid namespace.
Like we do with setsid.

I think the implementation difficulty should be about the same, but
it will allow us something that works cleanly in the cases of
migration and nested namespaces.  As well as not adding an unnecessary
special case with init_net and visibility.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-28 19:31 ` [PATCH 0/2] Fix /proc/net in presence of net namespaces Eric W. Biederman
@ 2008-02-28 21:17   ` serge
  2008-02-28 22:39     ` Eric W. Biederman
  2008-02-29  7:44     ` Pavel Emelyanov
  2008-02-29  7:42   ` Pavel Emelyanov
  1 sibling, 2 replies; 20+ messages in thread
From: serge @ 2008-02-28 21:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Andrew Morton, David Miller, Alexey Dobriyan,
	Linux Netdev List, Linux Kernel Mailing List

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Pavel Emelyanov <xemul@openvz.org> writes:
> 
> > Current /proc/net is done with so called "shadows", but current
> > implementation is broken and has little chances to get fixed.
> >
> > The problem is that dentries subtree of /proc/net directory has
> > fancy revalidation rules to make processes living in different
> > net namespaces see different entries in /proc/net subtree, but
> > currently, tasks see in the /proc/net subdir the contents of any
> > other namespace, depending on who opened the file first.
> >
> > The proposed fix is to turn /proc/net into a symlink, which behaves
> > similar to /proc/self link - it points to .netns/<id> directory 
> > where the <id> is the id of net namespace, current task lives in.
> >
> > # ls -l /proc/net
> > lrwxrwxrwx  1 root root 8 Feb 28 18:38 /proc/net -> .netns/0
> >
> > The /proc/.netns dir contains subtrees for all the namespaces in 
> > the system:
> >
> > # ls -l /proc/.netns/
> > total 0
> > dr-xr-xr-x  5 root root 0 Feb 28 18:39 0
> > dr-xr-xr-x  3 root root 0 Feb 28 18:39 1
> >
> > To provide some security each /proc/.netns/<id> directory allows
> > access to tasks that live in the owning namespace only (with the
> > exception, that init_net tasks can see everything).
> 
> 
> Nack.  Yet another global set of ids that require us to implement another
> namespace looks like the wrong way to go.

Sentiment granted, but I'm not sure it can be an issue.  It *could* be
in issue if we moved to a more flexible access control here here any
netns could access the .netns/N directories for all it's child
namespaces.

But it can't, and /proc/net is set by the kernel.  So the <id> can't be
an issue for any checkpoint/restart except htat of the whole system, and
of course on whole-system resume we have no <id> collision worries.

So userspace can't do anything with <id>, so there is no reason to worry
about it becoming another namespace?

Right?

thanks,
-serge

> Can you try this approach by capturing a struct pid instead of an id
> in a new global namespace? 
> 
> In particular the pid of the process that creates the pid namespace.
> Like we do with setsid.
> 
> I think the implementation difficulty should be about the same, but
> it will allow us something that works cleanly in the cases of
> migration and nested namespaces.  As well as not adding an unnecessary
> special case with init_net and visibility.
> 
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-28 21:17   ` serge
@ 2008-02-28 22:39     ` Eric W. Biederman
  2008-02-29  3:17       ` serge
  2008-02-29  7:58       ` Pavel Emelyanov
  2008-02-29  7:44     ` Pavel Emelyanov
  1 sibling, 2 replies; 20+ messages in thread
From: Eric W. Biederman @ 2008-02-28 22:39 UTC (permalink / raw)
  To: serge
  Cc: Pavel Emelyanov, Andrew Morton, David Miller, Alexey Dobriyan,
	Linux Netdev List, Linux Kernel Mailing List

serge@hallyn.com writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Pavel Emelyanov <xemul@openvz.org> writes:
>> 
>> > Current /proc/net is done with so called "shadows", but current
>> > implementation is broken and has little chances to get fixed.
>> >
>> > The problem is that dentries subtree of /proc/net directory has
>> > fancy revalidation rules to make processes living in different
>> > net namespaces see different entries in /proc/net subtree, but
>> > currently, tasks see in the /proc/net subdir the contents of any
>> > other namespace, depending on who opened the file first.
>> >
>> > The proposed fix is to turn /proc/net into a symlink, which behaves
>> > similar to /proc/self link - it points to .netns/<id> directory 
>> > where the <id> is the id of net namespace, current task lives in.
>> >
>> > # ls -l /proc/net
>> > lrwxrwxrwx  1 root root 8 Feb 28 18:38 /proc/net -> .netns/0
>> >
>> > The /proc/.netns dir contains subtrees for all the namespaces in 
>> > the system:
>> >
>> > # ls -l /proc/.netns/
>> > total 0
>> > dr-xr-xr-x  5 root root 0 Feb 28 18:39 0
>> > dr-xr-xr-x  3 root root 0 Feb 28 18:39 1
>> >
>> > To provide some security each /proc/.netns/<id> directory allows
>> > access to tasks that live in the owning namespace only (with the
>> > exception, that init_net tasks can see everything).
>> 
>> 
>> Nack.  Yet another global set of ids that require us to implement another
>> namespace looks like the wrong way to go.
>
> Sentiment granted, but I'm not sure it can be an issue.  It *could* be
> in issue if we moved to a more flexible access control here here any
> netns could access the .netns/N directories for all it's child
> namespaces.

However at least for visibility and inspection we want that.
We want to inspect what is happening to other processes.  If we didn't
care then all of the pid namespaces could just be disjoint.

Providing interfaces where people can inspect what is going on through
the filesystem is very natural, and a lot easier to support long term
then adding a whole new set of interfaces for debuggers and the like.

> But it can't, and /proc/net is set by the kernel.  So the <id> can't be
> an issue for any checkpoint/restart except htat of the whole system, and
> of course on whole-system resume we have no <id> collision worries.
>
> So userspace can't do anything with <id>, so there is no reason to worry
> about it becoming another namespace?

I was thinking we might be able to hide the existence of
/proc/.netns/NNN/  however we can read the current working directory.
So even if we only allow explicit access through /proc/net and all
others paths don't work we have something that is visible.

So we really need something that we are not afraid to air in public.
That we are not afraid to use and have it's use expanded upon.

> Right?

Think of user space processes inspecting /proc etc.  Having directory
names change out form under you for no apparent reason is pretty nasty.

Plus we have the consequence that a user space visible id is likely to
get used for reporting in user space programs.  Reporting that will go
haywire on a migration event.

And if the id is used in reporting people are likely to want to use the
id for control (so this may be the edge of a slippery slope).

Things like inode numbers that are a secondary effect are enough of a problem
when looking at how things interact.  A directly visible user space visible
id is a problem.

All we need to do if we use a pid as an id is:
- Have one directory .netns with all of the net directories listed by pid.
- Have readdir and lookup filter the directory entries by the pid
  namespace of the proc mount.

It looks like we have to tweak things just a bit so that free_pid
would not be called until the pid namespace goes away.  Something
similar to how we do the hash chains.

If we make namespaces show up anywhere besides under
"/proc/<pid>/task/<tid>/" we have to do something like this, and pids
are largely designed for this kind of use.

It looks like the way /proc is currently structured we don't need a
reverse map from pid to net namespace.  But I would not have a problem
with that.

Our limitations are:
- We need an inviolate dentry tree of the VFS dcache goes nuts.
- We need an id that is in a namespace, or else we get pushed
  into the yet another namespace problem.
- We want to aim for minimal dentry duplication, to keep resource
  consumption under control.  Which makes /proc/<pid>/task/<tid>/net
  an unfortunate choice.

So I think /proc/.netns/ or simply /proc/netns/ is a good choice. We
just need a non-global id for our directory entries so we don't paint
ourselves into a corner.

And honestly pid visibility is a very natural choice for which network
namespaces you can see.  You can see the namespace of any process you
can see.  Which especially means your children.  It is an arbitrary
rule, it is a simple rule to explain, and it works recursively unlike
any init_net is special rule.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-28 22:39     ` Eric W. Biederman
@ 2008-02-29  3:17       ` serge
  2008-02-29  8:16         ` Pavel Emelyanov
  2008-02-29  7:58       ` Pavel Emelyanov
  1 sibling, 1 reply; 20+ messages in thread
From: serge @ 2008-02-29  3:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: serge, Pavel Emelyanov, Andrew Morton, David Miller,
	Alexey Dobriyan, Linux Netdev List, Linux Kernel Mailing List

Quoting Eric W. Biederman (ebiederm@xmission.com):
> serge@hallyn.com writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> Pavel Emelyanov <xemul@openvz.org> writes:
> >> 
> >> > Current /proc/net is done with so called "shadows", but current
> >> > implementation is broken and has little chances to get fixed.
> >> >
> >> > The problem is that dentries subtree of /proc/net directory has
> >> > fancy revalidation rules to make processes living in different
> >> > net namespaces see different entries in /proc/net subtree, but
> >> > currently, tasks see in the /proc/net subdir the contents of any
> >> > other namespace, depending on who opened the file first.
> >> >
> >> > The proposed fix is to turn /proc/net into a symlink, which behaves
> >> > similar to /proc/self link - it points to .netns/<id> directory 
> >> > where the <id> is the id of net namespace, current task lives in.
> >> >
> >> > # ls -l /proc/net
> >> > lrwxrwxrwx  1 root root 8 Feb 28 18:38 /proc/net -> .netns/0
> >> >
> >> > The /proc/.netns dir contains subtrees for all the namespaces in 
> >> > the system:
> >> >
> >> > # ls -l /proc/.netns/
> >> > total 0
> >> > dr-xr-xr-x  5 root root 0 Feb 28 18:39 0
> >> > dr-xr-xr-x  3 root root 0 Feb 28 18:39 1
> >> >
> >> > To provide some security each /proc/.netns/<id> directory allows
> >> > access to tasks that live in the owning namespace only (with the
> >> > exception, that init_net tasks can see everything).
> >> 
> >> 
> >> Nack.  Yet another global set of ids that require us to implement another
> >> namespace looks like the wrong way to go.
> >
> > Sentiment granted, but I'm not sure it can be an issue.  It *could* be
> > in issue if we moved to a more flexible access control here here any
> > netns could access the .netns/N directories for all it's child
> > namespaces.
> 
> However at least for visibility and inspection we want that.
> We want to inspect what is happening to other processes.  If we didn't
> care then all of the pid namespaces could just be disjoint.

But the way Pavel has it coded, only tasks in the init namespace can
view /proc/.net/* for any other net namespaces.

If you object to that (I'm undecided - though on a gut level I do prefer
a simpler approach using bind mounts and not forcing any cross-ns access
control this way) then the ids become relevant, but the way Pavel has
it, it's just not.

> Providing interfaces where people can inspect what is going on through
> the filesystem is very natural, and a lot easier to support long term
> then adding a whole new set of interfaces for debuggers and the like.
> 
> > But it can't, and /proc/net is set by the kernel.  So the <id> can't be
> > an issue for any checkpoint/restart except htat of the whole system, and
> > of course on whole-system resume we have no <id> collision worries.
> >
> > So userspace can't do anything with <id>, so there is no reason to worry
> > about it becoming another namespace?
> 
> I was thinking we might be able to hide the existence of
> /proc/.netns/NNN/  however we can read the current working directory.
> So even if we only allow explicit access through /proc/net and all
> others paths don't work we have something that is visible.
> 
> So we really need something that we are not afraid to air in public.
> That we are not afraid to use and have it's use expanded upon.
> 
> > Right?
> 
> Think of user space processes inspecting /proc etc.

Well a task in one pid namespace cannot view the /proc for another pid
namespace, right?

> Having directory
> names change out form under you for no apparent reason is pretty nasty.

Yes, but they won't just 'change out' from under you, you ask for it...
But here like I said I do prefer an approach where /proc/net is bind
mounted by the user.  But I have no good reason to back it up...

> Plus we have the consequence that a user space visible id is likely to
> get used for reporting in user space programs.  Reporting that will go
> haywire on a migration event.
> 
> And if the id is used in reporting people are likely to want to use the
> id for control (so this may be the edge of a slippery slope).
> 
> Things like inode numbers that are a secondary effect are enough of a problem
> when looking at how things interact.  A directly visible user space visible
> id is a problem.
> 
> All we need to do if we use a pid as an id is:
> - Have one directory .netns with all of the net directories listed by pid.

Which pid will be used?  This .netns directory still is systemwide so
these pids  must (a) be globally unique and (b) be available (by your
requirements :)  upon a container restart so that the restarted process
can re-use the same pid...

I don't see the advantage to using a pid.

> - Have readdir and lookup filter the directory entries by the pid
>   namespace of the proc mount.

Yuck, I far prefer to have userspace mount --bind /proc/.net/whatever
to /proc/net rather than have the readdir and lookup do magic filtering.

> It looks like we have to tweak things just a bit so that free_pid
> would not be called until the pid namespace goes away.  Something
> similar to how we do the hash chains.

When you say pid are you talking about a pid_nr or a struct pid?  In any
case, I still don't see what makes it available for reuse on a
restart...

And if not available for reuse on a restart, then we may as well use a
simple global unique id as Pavel uses.

> If we make namespaces show up anywhere besides under
> "/proc/<pid>/task/<tid>/" we have to do something like this, and pids
> are largely designed for this kind of use.

> It looks like the way /proc is currently structured we don't need a
> reverse map from pid to net namespace.  But I would not have a problem
> with that.
> 
> Our limitations are:
> - We need an inviolate dentry tree of the VFS dcache goes nuts.
> - We need an id that is in a namespace, or else we get pushed
>   into the yet another namespace problem.
> - We want to aim for minimal dentry duplication, to keep resource
>   consumption under control.  Which makes /proc/<pid>/task/<tid>/net
>   an unfortunate choice.
> 
> So I think /proc/.netns/ or simply /proc/netns/ is a good choice. We
> just need a non-global id for our directory entries so we don't paint
> ourselves into a corner.

But I don't see how this is going to work.  If you're using a pid_nr
inside a pid namespace, then you're not guaranteeing that the pid_nr
will be unique, so you may not be able to create th e/proc/.netns/X
directory.  If you're using the global pid, then you're not guaranteeing
that it will be available upon a container restart.

> And honestly pid visibility is a very natural choice for which network
> namespaces you can see.  You can see the namespace of any process you
> can see.  Which especially means your children.  It is an arbitrary
> rule, it is a simple rule to explain, and it works recursively unlike

But apparently not simple enough for a simpleton like me to understand,
sorry :(

> any init_net is special rule.
> 
> Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-28 19:31 ` [PATCH 0/2] Fix /proc/net in presence of net namespaces Eric W. Biederman
  2008-02-28 21:17   ` serge
@ 2008-02-29  7:42   ` Pavel Emelyanov
  2008-03-02  2:29     ` Eric W. Biederman
  1 sibling, 1 reply; 20+ messages in thread
From: Pavel Emelyanov @ 2008-02-29  7:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, David Miller, Alexey Dobriyan, Linux Netdev List,
	Linux Kernel Mailing List

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@openvz.org> writes:
> 
>> Current /proc/net is done with so called "shadows", but current
>> implementation is broken and has little chances to get fixed.
>>
>> The problem is that dentries subtree of /proc/net directory has
>> fancy revalidation rules to make processes living in different
>> net namespaces see different entries in /proc/net subtree, but
>> currently, tasks see in the /proc/net subdir the contents of any
>> other namespace, depending on who opened the file first.
>>
>> The proposed fix is to turn /proc/net into a symlink, which behaves
>> similar to /proc/self link - it points to .netns/<id> directory 
>> where the <id> is the id of net namespace, current task lives in.
>>
>> # ls -l /proc/net
>> lrwxrwxrwx  1 root root 8 Feb 28 18:38 /proc/net -> .netns/0
>>
>> The /proc/.netns dir contains subtrees for all the namespaces in 
>> the system:
>>
>> # ls -l /proc/.netns/
>> total 0
>> dr-xr-xr-x  5 root root 0 Feb 28 18:39 0
>> dr-xr-xr-x  3 root root 0 Feb 28 18:39 1
>>
>> To provide some security each /proc/.netns/<id> directory allows
>> access to tasks that live in the owning namespace only (with the
>> exception, that init_net tasks can see everything).
> 
> 
> Nack.  Yet another global set of ids that require us to implement another
> namespace looks like the wrong way to go.

I could use the struct net pointer values (obtained with sprintf(id, "%p", net))
instead, but exporting internal kernel addresses seemed even uglier.

> Can you try this approach by capturing a struct pid instead of an id
> in a new global namespace? 

This is a bad approach. When task, that created the namespace dies, his
pid is removed from the pidmap and can be reused, so we can get another
net with the same id.

> In particular the pid of the process that creates the pid namespace.
> Like we do with setsid.
> 
> I think the implementation difficulty should be about the same, but
> it will allow us something that works cleanly in the cases of
> migration and nested namespaces.  As well as not adding an unnecessary
> special case with init_net and visibility.

This net's id is not supposed to be used to address any net in the kernel.
And I see no problems with migration - you can change the net's id safely
during checkpoint/restart - tasks will always see this one via the /proc/net
symlink, which is dynamic.

> Eric
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-28 21:17   ` serge
  2008-02-28 22:39     ` Eric W. Biederman
@ 2008-02-29  7:44     ` Pavel Emelyanov
  1 sibling, 0 replies; 20+ messages in thread
From: Pavel Emelyanov @ 2008-02-29  7:44 UTC (permalink / raw)
  To: serge
  Cc: Eric W. Biederman, Andrew Morton, David Miller, Alexey Dobriyan,
	Linux Netdev List, Linux Kernel Mailing List

>> Nack.  Yet another global set of ids that require us to implement another
>> namespace looks like the wrong way to go.
> 
> Sentiment granted, but I'm not sure it can be an issue.  It *could* be
> in issue if we moved to a more flexible access control here here any
> netns could access the .netns/N directories for all it's child
> namespaces.
> 
> But it can't, and /proc/net is set by the kernel.  So the <id> can't be
> an issue for any checkpoint/restart except htat of the whole system, and
> of course on whole-system resume we have no <id> collision worries.
> 
> So userspace can't do anything with <id>, so there is no reason to worry
> about it becoming another namespace?
> 
> Right?

Right. Thanks, Serge.

> thanks,
> -serge

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-28 22:39     ` Eric W. Biederman
  2008-02-29  3:17       ` serge
@ 2008-02-29  7:58       ` Pavel Emelyanov
  2008-03-02  2:03         ` Eric W. Biederman
  2008-03-02  2:17         ` Eric W. Biederman
  1 sibling, 2 replies; 20+ messages in thread
From: Pavel Emelyanov @ 2008-02-29  7:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: serge, Andrew Morton, David Miller, Alexey Dobriyan,
	Linux Netdev List, Linux Kernel Mailing List

> I was thinking we might be able to hide the existence of
> /proc/.netns/NNN/  however we can read the current working directory.
> So even if we only allow explicit access through /proc/net and all
> others paths don't work we have something that is visible.

I have a patch that overrides the ->readdir method for /proc/.netns,
so that you can no longer read the directory contents, but you still
can guess one by guessing and opening files in it. Overriding the 
->lookup to screw one up looks like "shadowing" technics.

OTOH - consider you have the ids of existing net namespaces, but cannot 
read the contents on any but yours. So what? This information is useless
for you. So I dropped this part of a patch.

> So we really need something that we are not afraid to air in public.
> That we are not afraid to use and have it's use expanded upon.
> 
>> Right?
> 
> Think of user space processes inspecting /proc etc.  Having directory
> names change out form under you for no apparent reason is pretty nasty.

Have you ever bothered about /proc/<pid> change?

> Plus we have the consequence that a user space visible id is likely to
> get used for reporting in user space programs.  Reporting that will go
> haywire on a migration event.
>
> And if the id is used in reporting people are likely to want to use the
> id for control (so this may be the edge of a slippery slope).
> 
> Things like inode numbers that are a secondary effect are enough of a problem
> when looking at how things interact.  A directly visible user space visible
> id is a problem.
> 
> All we need to do if we use a pid as an id is:
> - Have one directory .netns with all of the net directories listed by pid.

We have one now.

> - Have readdir and lookup filter the directory entries by the pid
>   namespace of the proc mount.

So, how are you going to filter the lookup? The problem I see - you have
a process that opened the /proc/.netns/X directory (he onws that namespace)
and the other one trying to do the same. The VFS layer finds the hashed
dentry corresponding to this /proc/.netns/X. The only way you can prevent
VFS from giving one to the second task is to override .d_revalidate method 
and drop that dentry....

But we've already tried to walk this way with no luck.

> It looks like we have to tweak things just a bit so that free_pid
> would not be called until the pid namespace goes away.  Something
> similar to how we do the hash chains.

This is not about pid namespace, this is about net namespace and
tuning pids management to facilitate networking needs is not the right
thing to do.

> If we make namespaces show up anywhere besides under
> "/proc/<pid>/task/<tid>/" we have to do something like this, and pids
> are largely designed for this kind of use.

Proc consists of two parts - the <pid>-s one with generated-on-the-fly
entries and the static one that is represented by proc_dir_entry tree.
Do you propose to mix those two?

> It looks like the way /proc is currently structured we don't need a
> reverse map from pid to net namespace.  But I would not have a problem
> with that.
> 
> Our limitations are:
> - We need an inviolate dentry tree of the VFS dcache goes nuts.
> - We need an id that is in a namespace, or else we get pushed
>   into the yet another namespace problem.
> - We want to aim for minimal dentry duplication, to keep resource
>   consumption under control.  Which makes /proc/<pid>/task/<tid>/net
>   an unfortunate choice.
> 
> So I think /proc/.netns/ or simply /proc/netns/ is a good choice. We

Thanks.

> just need a non-global id for our directory entries so we don't paint
> ourselves into a corner.

What namespace do you mean by "non-global"?

> And honestly pid visibility is a very natural choice for which network
> namespaces you can see.  You can see the namespace of any process you
> can see.  Which especially means your children.  It is an arbitrary
> rule, it is a simple rule to explain, and it works recursively unlike
> any init_net is special rule.
> 
> Eric
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-29  3:17       ` serge
@ 2008-02-29  8:16         ` Pavel Emelyanov
  2008-02-29 15:38           ` serge
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Emelyanov @ 2008-02-29  8:16 UTC (permalink / raw)
  To: serge
  Cc: Eric W. Biederman, Andrew Morton, David Miller, Alexey Dobriyan,
	Linux Netdev List, Linux Kernel Mailing List

[snip]

>> However at least for visibility and inspection we want that.
>> We want to inspect what is happening to other processes.  If we didn't
>> care then all of the pid namespaces could just be disjoint.
> 
> But the way Pavel has it coded, only tasks in the init namespace can
> view /proc/.net/* for any other net namespaces.

This is not essential part of the patch :) I did so to make my
life in init namespace easier. This part can be dropped.

[snip]

>> Think of user space processes inspecting /proc etc.
> 
> Well a task in one pid namespace cannot view the /proc for another pid
> namespace, right?

No. Pid namespace provides another pid-to-task map, but have noting
to do with VFS visibility.

>> Having directory
>> names change out form under you for no apparent reason is pretty nasty.
> 
> Yes, but they won't just 'change out' from under you, you ask for it...
> But here like I said I do prefer an approach where /proc/net is bind
> mounted by the user.  But I have no good reason to back it up...

If you make /proc/net be a bund mount then you have to force all of
the users to update their init scripts. I tried to make so with sysctl 
filesystem, but this thread was not very popular :)

Another way to do so - is to mount this one from inside the kernel, 
but are you ready to fight with Al Viro for this? :)

[snip]

>> So I think /proc/.netns/ or simply /proc/netns/ is a good choice. We
>> just need a non-global id for our directory entries so we don't paint
>> ourselves into a corner.
> 
> But I don't see how this is going to work.  If you're using a pid_nr
> inside a pid namespace, then you're not guaranteeing that the pid_nr
> will be unique, so you may not be able to create th e/proc/.netns/X
> directory.  If you're using the global pid, then you're not guaranteeing
> that it will be available upon a container restart.

Right - this is the same as using other ids, which I propose.

>> And honestly pid visibility is a very natural choice for which network
>> namespaces you can see.  You can see the namespace of any process you
>> can see.  Which especially means your children.  It is an arbitrary
>> rule, it is a simple rule to explain, and it works recursively unlike
> 
> But apparently not simple enough for a simpleton like me to understand,
> sorry :(
> 
>> any init_net is special rule.
>>
>> Eric
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-29  8:16         ` Pavel Emelyanov
@ 2008-02-29 15:38           ` serge
  0 siblings, 0 replies; 20+ messages in thread
From: serge @ 2008-02-29 15:38 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: serge, Eric W. Biederman, Andrew Morton, David Miller,
	Alexey Dobriyan, Linux Netdev List, Linux Kernel Mailing List

Quoting Pavel Emelyanov (xemul@openvz.org):
> [snip]
> 
> >> However at least for visibility and inspection we want that.
> >> We want to inspect what is happening to other processes.  If we didn't
> >> care then all of the pid namespaces could just be disjoint.
> > 
> > But the way Pavel has it coded, only tasks in the init namespace can
> > view /proc/.net/* for any other net namespaces.
> 
> This is not essential part of the patch :) I did so to make my
> life in init namespace easier. This part can be dropped.
> 
> [snip]
> 
> >> Think of user space processes inspecting /proc etc.
> > 
> > Well a task in one pid namespace cannot view the /proc for another pid
> > namespace, right?
> 
> No. Pid namespace provides another pid-to-task map, but have noting
> to do with VFS visibility.
> 
> >> Having directory
> >> names change out form under you for no apparent reason is pretty nasty.
> > 
> > Yes, but they won't just 'change out' from under you, you ask for it...
> > But here like I said I do prefer an approach where /proc/net is bind
> > mounted by the user.  But I have no good reason to back it up...
> 
> If you make /proc/net be a bund mount then you have to force all of
> the users to update their init scripts. I tried to make so with sysctl 
> filesystem, but this thread was not very popular :)

Ok.

A symlink is far preferable to sneaky redirection imo.

> Another way to do so - is to mount this one from inside the kernel, 
> but are you ready to fight with Al Viro for this? :)

Nope :)  Especially since it's an ugly idea.

> [snip]
> 
> >> So I think /proc/.netns/ or simply /proc/netns/ is a good choice. We
> >> just need a non-global id for our directory entries so we don't paint
> >> ourselves into a corner.
> > 
> > But I don't see how this is going to work.  If you're using a pid_nr
> > inside a pid namespace, then you're not guaranteeing that the pid_nr
> > will be unique, so you may not be able to create th e/proc/.netns/X
> > directory.  If you're using the global pid, then you're not guaranteeing
> > that it will be available upon a container restart.
> 
> Right - this is the same as using other ids, which I propose.
> 
> >> And honestly pid visibility is a very natural choice for which network
> >> namespaces you can see.  You can see the namespace of any process you
> >> can see.  Which especially means your children.  It is an arbitrary
> >> rule, it is a simple rule to explain, and it works recursively unlike
> > 
> > But apparently not simple enough for a simpleton like me to understand,
> > sorry :(
> > 
> >> any init_net is special rule.
> >>
> >> Eric

I suspect I'm misunderstanding Eric's proposal, though.

-serge

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-29  7:58       ` Pavel Emelyanov
@ 2008-03-02  2:03         ` Eric W. Biederman
  2008-03-02  2:17         ` Eric W. Biederman
  1 sibling, 0 replies; 20+ messages in thread
From: Eric W. Biederman @ 2008-03-02  2:03 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: serge, Andrew Morton, David Miller, Alexey Dobriyan,
	Linux Netdev List, Linux Kernel Mailing List, Linux Containers



- The experience from vserver, planetlab and OpenVZ is that it is good
  to be able to monitor processes in other namespaces.  

- The linux experience says filesystems are a good way to do that.

- So we really want to filesystem monitoring interfaces to depend on
  the filesystem mount options instead of current.

- Starting with making /proc and sysctls depend on current is a cheap
  way to get things up and going.

- When I consider breaking things up into multiple filesystems I run
  across the occasional file that depends on multiple namespaces.
  uids in /proc/sysvipc/* for example.  Luckily I have yet to find
  any directory structures that depend on more then one namespace.

  Maybe that can be handled properly by capturing multiple 
  namespaces at mount time but I am a bit leery of that.

- The visibility of namespaces should be match the visibility of the
  processes that use them.   Access control of course can be more
  restricted.

- We want to see how namespaces connect to tasks.

Therefore.

/proc/net, /proc/sys, /proc/sysvipc, and probably a few others
should migrate under /proc/<pid>/task/<tid> (not under /proc/<pid>
so we can finally straighten out the task group vs task issue).

Todays problem of course is /proc/net/

What I had intended to implement was:
/proc/current -> /proc/<pid>/task/<tid>
(A new symlink to the task directory)

/proc/net -> /proc/current/net
(like /proc/mounts)

The only downside of placing files under the task directory is
that we use a lot more dentries for /proc.

....

Optimizations.

If the dentry pressure is significant and we don't have data from
other namespaces in the files causing us to want to present the
information differently for different processes I support using
an id and a per namespace upper level directory.  With a symlink
into there from the task directories.

/proc/<pid>/task/<tid>/net -> ../../.../netns/<netns id>


The id I would use is a struct pid because that makes the id useful
for userspace monitoring and control applications and because we
can migrate it.

In my view /proc/netns/<pids> would be implemented like
/proc/<pids> with readdir and lookup returning different contents
based upon the pid namespace captured when we mounted proc.

Further struct pid would be enhanced so that as long as we have
a namespace using a struct pid as an id we would not free that pid_nr
in any of the pid namespaces.  Just like we do with process groups
and sessions today.

I think for the network namespace and network /proc files that
optimization is safe.  I seem to recall checking and not finding any
ids from other namespaces in the files under /proc/net.

I will try for some more detailed replies.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-29  7:58       ` Pavel Emelyanov
  2008-03-02  2:03         ` Eric W. Biederman
@ 2008-03-02  2:17         ` Eric W. Biederman
  2008-03-03  9:07           ` Pavel Emelyanov
  1 sibling, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2008-03-02  2:17 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: serge, Andrew Morton, David Miller, Alexey Dobriyan,
	Linux Netdev List, Linux Kernel Mailing List

Pavel Emelyanov <xemul@openvz.org> writes:

>> I was thinking we might be able to hide the existence of
>> /proc/.netns/NNN/  however we can read the current working directory.
>> So even if we only allow explicit access through /proc/net and all
>> others paths don't work we have something that is visible.
>
> I have a patch that overrides the ->readdir method for /proc/.netns,
> so that you can no longer read the directory contents, but you still
> can guess one by guessing and opening files in it. Overriding the 
> ->lookup to screw one up looks like "shadowing" technics.

Or looking at the symlinks under /proc/<pid>/fd/1
Or opening something under /proc/net/ and calling get pwd.

> OTOH - consider you have the ids of existing net namespaces, but cannot 
> read the contents on any but yours. So what? This information is useless
> for you. So I dropped this part of a patch.

However it is fundamental that monitoring programs want to inspect
the namespaces of other processes.

In theory the resource group stuff was suppose to provide us with
all of the names we would need.  However the semantics seem a bit
to flexible to use for something like this.

>> - Have readdir and lookup filter the directory entries by the pid
>>   namespace of the proc mount.
>
> So, how are you going to filter the lookup? The problem I see - you have
> a process that opened the /proc/.netns/X directory (he onws that namespace)
> and the other one trying to do the same. The VFS layer finds the hashed
> dentry corresponding to this /proc/.netns/X. The only way you can prevent
> VFS from giving one to the second task is to override .d_revalidate method 
> and drop that dentry....
>
> But we've already tried to walk this way with no luck.

I meant a per mount filtering.  Exactly like we do for the pids now.

>> It looks like we have to tweak things just a bit so that free_pid
>> would not be called until the pid namespace goes away.  Something
>> similar to how we do the hash chains.
>
> This is not about pid namespace, this is about net namespace and
> tuning pids management to facilitate networking needs is not the right
> thing to do.

Not exactly.  It is about process attribute visibility.  Which
is what proc is about.  In plan9 where the concept comes from
namespaces are referred to as process groups, and that is a valid
view.  At least from a monitoring perspective.

>> If we make namespaces show up anywhere besides under
>> "/proc/<pid>/task/<tid>/" we have to do something like this, and pids
>> are largely designed for this kind of use.
>
> Proc consists of two parts - the <pid>-s one with generated-on-the-fly
> entries and the static one that is represented by proc_dir_entry tree.
> Do you propose to mix those two?

Yes.  Because the static entries are beginning to depend on process
specific attributes.  We have already started with /proc/mounts.

>> just need a non-global id for our directory entries so we don't paint
>> ourselves into a corner.
>
> What namespace do you mean by "non-global"?

The best is an id I can take with me when I migrate from machine A
to machine B.  An id in some namespace or a form that doesn't need
an id at all is the core requirement.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-02-29  7:42   ` Pavel Emelyanov
@ 2008-03-02  2:29     ` Eric W. Biederman
  2008-03-03  8:52       ` Pavel Emelyanov
  0 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2008-03-02  2:29 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, David Miller, Alexey Dobriyan, Linux Netdev List,
	Linux Kernel Mailing List

Pavel Emelyanov <xemul@openvz.org> writes:

> I could use the struct net pointer values (obtained with sprintf(id, "%p", net))
> instead, but exporting internal kernel addresses seemed even uglier.

Agreed.

>> Can you try this approach by capturing a struct pid instead of an id
>> in a new global namespace? 
>
> This is a bad approach. When task, that created the namespace dies, his
> pid is removed from the pidmap and can be reused, so we can get another
> net with the same id.

It takes a little updating of how we use pids.  The easiest method
is to add an extra counter.  So we know when someone besides the hash
chains is using the pid as an id.  However it might make sense to actually
have a net namespace pointer in the pid.

>
> This net's id is not supposed to be used to address any net in the kernel.
> And I see no problems with migration - you can change the net's id safely
> during checkpoint/restart - tasks will always see this one via the /proc/net
> symlink, which is dynamic.

So you are really talking about a hidden id.  There are just enough
ways for something like that to slip out I'm not especially
comfortable with the idea.

I really think we need something clean that we can live with, and be
proud of.  However we implement the enhancement to /proc/net this has
to be maintained for decades.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-03-02  2:29     ` Eric W. Biederman
@ 2008-03-03  8:52       ` Pavel Emelyanov
  2008-03-04 22:23         ` Eric W. Biederman
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Emelyanov @ 2008-03-03  8:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, David Miller, Alexey Dobriyan, Linux Netdev List,
	Linux Kernel Mailing List

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@openvz.org> writes:
> 
>> I could use the struct net pointer values (obtained with sprintf(id, "%p", net))
>> instead, but exporting internal kernel addresses seemed even uglier.
> 
> Agreed.
> 
>>> Can you try this approach by capturing a struct pid instead of an id
>>> in a new global namespace? 
>> This is a bad approach. When task, that created the namespace dies, his
>> pid is removed from the pidmap and can be reused, so we can get another
>> net with the same id.
> 
> It takes a little updating of how we use pids.  The easiest method
> is to add an extra counter.  So we know when someone besides the hash
> chains is using the pid as an id.  However it might make sense to actually
> have a net namespace pointer in the pid.

No, please, no. I'm strongly opposed to making pids provide identification
for anything we need in the kernel.

>> This net's id is not supposed to be used to address any net in the kernel.
>> And I see no problems with migration - you can change the net's id safely
>> during checkpoint/restart - tasks will always see this one via the /proc/net
>> symlink, which is dynamic.
> 
> So you are really talking about a hidden id.  There are just enough
> ways for something like that to slip out I'm not especially
> comfortable with the idea.
> 
> I really think we need something clean that we can live with, and be
> proud of.  However we implement the enhancement to /proc/net this has
> to be maintained for decades.
> 
> Eric
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-03-02  2:17         ` Eric W. Biederman
@ 2008-03-03  9:07           ` Pavel Emelyanov
  2008-03-04 22:49             ` Eric W. Biederman
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Emelyanov @ 2008-03-03  9:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: serge, Andrew Morton, David Miller, Alexey Dobriyan,
	Linux Netdev List, Linux Kernel Mailing List

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@openvz.org> writes:
> 
>>> I was thinking we might be able to hide the existence of
>>> /proc/.netns/NNN/  however we can read the current working directory.
>>> So even if we only allow explicit access through /proc/net and all
>>> others paths don't work we have something that is visible.
>> I have a patch that overrides the ->readdir method for /proc/.netns,
>> so that you can no longer read the directory contents, but you still
>> can guess one by guessing and opening files in it. Overriding the 
>> ->lookup to screw one up looks like "shadowing" technics.
> 
> Or looking at the symlinks under /proc/<pid>/fd/1
> Or opening something under /proc/net/ and calling get pwd.
> 
>> OTOH - consider you have the ids of existing net namespaces, but cannot 
>> read the contents on any but yours. So what? This information is useless
>> for you. So I dropped this part of a patch.
> 
> However it is fundamental that monitoring programs want to inspect
> the namespaces of other processes.
> 
> In theory the resource group stuff was suppose to provide us with
> all of the names we would need.  However the semantics seem a bit
> to flexible to use for something like this.
> 
>>> - Have readdir and lookup filter the directory entries by the pid
>>>   namespace of the proc mount.
>> So, how are you going to filter the lookup? The problem I see - you have
>> a process that opened the /proc/.netns/X directory (he onws that namespace)
>> and the other one trying to do the same. The VFS layer finds the hashed
>> dentry corresponding to this /proc/.netns/X. The only way you can prevent
>> VFS from giving one to the second task is to override .d_revalidate method 
>> and drop that dentry....
>>
>> But we've already tried to walk this way with no luck.
> 
> I meant a per mount filtering.  Exactly like we do for the pids now.

We (me) do not perform any "filtering" in /proc. I just make /proc play 
the VFS rules - one super-block one tree of dentries.

>>> It looks like we have to tweak things just a bit so that free_pid
>>> would not be called until the pid namespace goes away.  Something
>>> similar to how we do the hash chains.
>> This is not about pid namespace, this is about net namespace and
>> tuning pids management to facilitate networking needs is not the right
>> thing to do.
> 
> Not exactly.  It is about process attribute visibility.  Which
> is what proc is about.  In plan9 where the concept comes from
> namespaces are referred to as process groups, and that is a valid
> view.  At least from a monitoring perspective.
> 
>>> If we make namespaces show up anywhere besides under
>>> "/proc/<pid>/task/<tid>/" we have to do something like this, and pids
>>> are largely designed for this kind of use.
>> Proc consists of two parts - the <pid>-s one with generated-on-the-fly
>> entries and the static one that is represented by proc_dir_entry tree.
>> Do you propose to mix those two?
> 
> Yes.  Because the static entries are beginning to depend on process
> specific attributes.  We have already started with /proc/mounts.

/proc/<pid>/mounts is not represented with any proc_dir_entry, but
what you're proposing with /proc/<pid>/net seems like doing this
representation.

>>> just need a non-global id for our directory entries so we don't paint
>>> ourselves into a corner.
>> What namespace do you mean by "non-global"?
> 
> The best is an id I can take with me when I migrate from machine A
> to machine B.  An id in some namespace or a form that doesn't need
> an id at all is the core requirement.

If we're OK in having a /proc/netns/<xxx> for each namespace, then
this <xxx> is an id, regardless whatever it is - a pre-generated 
number, a pointer, etc.

That said, your only wish is to make this <xxx> be preservable across 
migration, right?

> Eric
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-03-03  8:52       ` Pavel Emelyanov
@ 2008-03-04 22:23         ` Eric W. Biederman
  0 siblings, 0 replies; 20+ messages in thread
From: Eric W. Biederman @ 2008-03-04 22:23 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, David Miller, Alexey Dobriyan, Linux Netdev List,
	Linux Kernel Mailing List

Pavel Emelyanov <xemul@openvz.org> writes:

>> 
>> It takes a little updating of how we use pids.  The easiest method
>> is to add an extra counter.  So we know when someone besides the hash
>> chains is using the pid as an id.  However it might make sense to actually
>> have a net namespace pointer in the pid.
>
> No, please, no. I'm strongly opposed to making pids provide identification
> for anything we need in the kernel.

I don't see the problem.  Sessions and process groups are roughly the
same kinds of concept as namespaces.

However I do agree that not using any kind of id is cleaner, leaves us
with less legacy that we have to deal with, and keeps struct pid
small, so it is preferable.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-03-03  9:07           ` Pavel Emelyanov
@ 2008-03-04 22:49             ` Eric W. Biederman
  2008-03-05  9:43               ` Pavel Emelyanov
  0 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2008-03-04 22:49 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: serge, Andrew Morton, David Miller, Alexey Dobriyan,
	Linux Netdev List, Linux Kernel Mailing List

Pavel Emelyanov <xemul@openvz.org> writes:

>>>> - Have readdir and lookup filter the directory entries by the pid
>>>>   namespace of the proc mount.
>>> So, how are you going to filter the lookup? The problem I see - you have
>>> a process that opened the /proc/.netns/X directory (he onws that namespace)
>>> and the other one trying to do the same. The VFS layer finds the hashed
>>> dentry corresponding to this /proc/.netns/X. The only way you can prevent
>>> VFS from giving one to the second task is to override .d_revalidate method 
>>> and drop that dentry....
>>>
>>> But we've already tried to walk this way with no luck.
>> 
>> I meant a per mount filtering.  Exactly like we do for the pids now.
>
> We (me) do not perform any "filtering" in /proc. I just make /proc play 
> the VFS rules - one super-block one tree of dentries.

Exactly.  For different super blocks we return a different set of
processes and a different set of numbers of those processes.  If you
do use ids that do not live in a namespace I agree you do not need to
do different things for different mounts, but that seems ugly and
problematic.

>>>> If we make namespaces show up anywhere besides under
>>>> "/proc/<pid>/task/<tid>/" we have to do something like this, and pids
>>>> are largely designed for this kind of use.
>>> Proc consists of two parts - the <pid>-s one with generated-on-the-fly
>>> entries and the static one that is represented by proc_dir_entry tree.
>>> Do you propose to mix those two?
>> 
>> Yes.  Because the static entries are beginning to depend on process
>> specific attributes.  We have already started with /proc/mounts.
>
> /proc/<pid>/mounts is not represented with any proc_dir_entry, but
> what you're proposing with /proc/<pid>/net seems like doing this
> representation.

Yes.  I am talking about placing things represented with a
prod_dir_entry and having them show up under a hierarchy not
represented with proc_dir_entries under /proc/<pid>.

As that is clean, worked well for /proc/mounts, does not
require ids at all, and is essentially the optimal form
for monitoring processes.

/proc/mounts used to have a proc_dir_entry.  When it was reimplemented
to be per fs namespace that was removed.

>>>> just need a non-global id for our directory entries so we don't paint
>>>> ourselves into a corner.
>>> What namespace do you mean by "non-global"?
>> 
>> The best is an id I can take with me when I migrate from machine A
>> to machine B.  An id in some namespace or a form that doesn't need
>> an id at all is the core requirement.
>
> If we're OK in having a /proc/netns/<xxx> for each namespace, then
> this <xxx> is an id, regardless whatever it is - a pre-generated 
> number, a pointer, etc.
>
> That said, your only wish is to make this <xxx> be preservable across 
> migration, right?

No, that is not my only wish.

- I wish for a clean maintainable interface.
- I wish for an interface that we can use for monitoring programs like
  top and ps.
- I wish for an interface that is migration safe.

It is my opinion that using an id is simply an optimization to reduce
the number of cached proc dentries.

I gave a full run down of what I wish and the reasons for it earlier
in this thread. I have not seen you respond to that message.

Currently I am NOT ok having a /proc/netns/<xxx>. It appears to be
a contentious premature optimization.

VFS clean, maintainable, and usable for monitoring is
/proc/<pid>/task/<tid>/net.

We can always figure out how to optimize that form later.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] Fix /proc/net in presence of net namespaces
  2008-03-04 22:49             ` Eric W. Biederman
@ 2008-03-05  9:43               ` Pavel Emelyanov
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Emelyanov @ 2008-03-05  9:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: serge, Alexey Dobriyan, Linux Netdev List, Linux Kernel Mailing List

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@openvz.org> writes:
> 
>>>>> - Have readdir and lookup filter the directory entries by the pid
>>>>>   namespace of the proc mount.
>>>> So, how are you going to filter the lookup? The problem I see - you have
>>>> a process that opened the /proc/.netns/X directory (he onws that namespace)
>>>> and the other one trying to do the same. The VFS layer finds the hashed
>>>> dentry corresponding to this /proc/.netns/X. The only way you can prevent
>>>> VFS from giving one to the second task is to override .d_revalidate method 
>>>> and drop that dentry....
>>>>
>>>> But we've already tried to walk this way with no luck.
>>> I meant a per mount filtering.  Exactly like we do for the pids now.
>> We (me) do not perform any "filtering" in /proc. I just make /proc play 
>> the VFS rules - one super-block one tree of dentries.
> 
> Exactly.  For different super blocks we return a different set of
> processes and a different set of numbers of those processes.  If you
> do use ids that do not live in a namespace I agree you do not need to
> do different things for different mounts, but that seems ugly and
> problematic.
> 
>>>>> If we make namespaces show up anywhere besides under
>>>>> "/proc/<pid>/task/<tid>/" we have to do something like this, and pids
>>>>> are largely designed for this kind of use.
>>>> Proc consists of two parts - the <pid>-s one with generated-on-the-fly
>>>> entries and the static one that is represented by proc_dir_entry tree.
>>>> Do you propose to mix those two?
>>> Yes.  Because the static entries are beginning to depend on process
>>> specific attributes.  We have already started with /proc/mounts.
>> /proc/<pid>/mounts is not represented with any proc_dir_entry, but
>> what you're proposing with /proc/<pid>/net seems like doing this
>> representation.
> 
> Yes.  I am talking about placing things represented with a
> prod_dir_entry and having them show up under a hierarchy not
> represented with proc_dir_entries under /proc/<pid>.
> 
> As that is clean, worked well for /proc/mounts, does not
> require ids at all, and is essentially the optimal form
> for monitoring processes.
> 
> /proc/mounts used to have a proc_dir_entry.  When it was reimplemented
> to be per fs namespace that was removed.
> 
>>>>> just need a non-global id for our directory entries so we don't paint
>>>>> ourselves into a corner.
>>>> What namespace do you mean by "non-global"?
>>> The best is an id I can take with me when I migrate from machine A
>>> to machine B.  An id in some namespace or a form that doesn't need
>>> an id at all is the core requirement.
>> If we're OK in having a /proc/netns/<xxx> for each namespace, then
>> this <xxx> is an id, regardless whatever it is - a pre-generated 
>> number, a pointer, etc.
>>
>> That said, your only wish is to make this <xxx> be preservable across 
>> migration, right?
> 
> No, that is not my only wish.
> 
> - I wish for a clean maintainable interface.
> - I wish for an interface that we can use for monitoring programs like
>   top and ps.
> - I wish for an interface that is migration safe.
> 
> It is my opinion that using an id is simply an optimization to reduce
> the number of cached proc dentries.
> 
> I gave a full run down of what I wish and the reasons for it earlier
> in this thread. I have not seen you respond to that message.

I took you opinion, expressed in this letter, into account.

> Currently I am NOT ok having a /proc/netns/<xxx>. It appears to be
> a contentious premature optimization.

Have you changed your mind suddenly? 
You told opposite less than a week ago.

> VFS clean, maintainable, and usable for monitoring is
> /proc/<pid>/task/<tid>/net.

Why not /proc/pid/net? Are we ever going to move threads in namespace?

> We can always figure out how to optimize that form later.
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-03-05  9:44 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-28 15:46 [PATCH 0/2] Fix /proc/net in presence of net namespaces Pavel Emelyanov
2008-02-28 15:49 ` [PATCH 1/2] Add an id to struct net Pavel Emelyanov
2008-02-28 15:51 ` [PATCH 2/2] Make /proc/net a symlink and drop proc shadows Pavel Emelyanov
2008-02-28 19:31 ` [PATCH 0/2] Fix /proc/net in presence of net namespaces Eric W. Biederman
2008-02-28 21:17   ` serge
2008-02-28 22:39     ` Eric W. Biederman
2008-02-29  3:17       ` serge
2008-02-29  8:16         ` Pavel Emelyanov
2008-02-29 15:38           ` serge
2008-02-29  7:58       ` Pavel Emelyanov
2008-03-02  2:03         ` Eric W. Biederman
2008-03-02  2:17         ` Eric W. Biederman
2008-03-03  9:07           ` Pavel Emelyanov
2008-03-04 22:49             ` Eric W. Biederman
2008-03-05  9:43               ` Pavel Emelyanov
2008-02-29  7:44     ` Pavel Emelyanov
2008-02-29  7:42   ` Pavel Emelyanov
2008-03-02  2:29     ` Eric W. Biederman
2008-03-03  8:52       ` Pavel Emelyanov
2008-03-04 22:23         ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).