linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [patch 0/6] [Network namespace] introduction
@ 2006-06-09 21:02 dlezcano
  2006-06-09 21:02 ` [RFC] [patch 1/6] [Network namespace] Network namespace structure dlezcano
                   ` (9 more replies)
  0 siblings, 10 replies; 113+ messages in thread
From: dlezcano @ 2006-06-09 21:02 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: serue, haveblue, clg, dlezcano

The following patches create a private "network namespace" for use
within containers. This is intended for use with system containers
like vserver, but might also be useful for restricting individual
applications' access to the network stack.

These patches isolate traffic inside the network namespace. The
network ressources, the incoming and the outgoing packets are
identified to be related to a namespace. 

It hides network resource not contained in the current namespace, but
still allows administration of the network with normal commands like
ifconfig.

It applies to the kernel version 2.6.17-rc6-mm1

It provides the following:
-------------------------
   - when an application unshares its network namespace, it looses its
     view of all network devices by default. The administrator can
     choose to make any devices to become visible again. The container
     then gains a view to the device but without the ip address
     configured on it. It is up to the container administrator to use
     ifconfig or ip command to setup a new ip address. This ip address
     is only visible inside the container.

   - the loopback is isolated inside the container and it is not
     possible to communicate between containers via the
     loopback. 

   - several containers can have an application bind to the same
     address:port without conflicting. 

What is for ?
-------------
   - security : an application can be bounded inside a container
     without interacting with the network used by another container

   - consolidation : several instance of the same application can be
     ran in different container because the network namespace allows
     to bind to the same addr:port

What could be done ?
--------------------
    - because the network ressources are related to a namespace, it is
      easy to identify them. That facilitate the implementation of the
      network migration

How to use ?
------------
   - do unshare with the CLONE_NEWNET flag as root
   - do echo eth0 > /sys/kernel/debug/net_ns/dev
   - use ifconfig or ip command to set a new ip address

What is missing ?
-----------------
The routes are not yet isolated, that implies:

   - binding to another container's address is allowed

   - an outgoing packet which has an unset source address can
     potentially get another container's address

   - an incoming packet can be routed to the wrong container if there
     are several containers listening to the same addr:port

--

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [RFC] [patch 1/6] [Network namespace] Network namespace structure
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
@ 2006-06-09 21:02 ` dlezcano
  2006-06-09 21:02 ` [RFC] [patch 2/6] [Network namespace] Network device sharing by view dlezcano
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 113+ messages in thread
From: dlezcano @ 2006-06-09 21:02 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: serue, haveblue, clg, dlezcano

[-- Attachment #1: net_ns.patch --]
[-- Type: text/plain, Size: 11607 bytes --]

This patch adds to the nsproxy the network namespace and a set of
functions to unshare it. The network namespace structure should be
filled later with the identified network ressources needed for more
isolation.

Replace-Subject: [Network namespace] Network namespace structure
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> 
--
 include/linux/init_task.h |    2 
 include/linux/net_ns.h    |   59 ++++++++++++++++++++++++++++
 include/linux/nsproxy.h   |    2 
 include/linux/sched.h     |    1 
 init/version.c            |    8 +++
 kernel/fork.c             |   24 +++++++++--
 kernel/nsproxy.c          |   38 +++++++++++-------
 net/Kconfig               |    9 ++++
 net/Makefile              |    1 
 net/net_ns.c              |   96 ++++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 222 insertions(+), 18 deletions(-)

Index: 2.6-mm/include/linux/net_ns.h
===================================================================
--- /dev/null
+++ 2.6-mm/include/linux/net_ns.h
@@ -0,0 +1,59 @@
+#ifndef _LINUX_NET_NS_H
+#define _LINUX_NET_NS_H
+
+#include <linux/kref.h>
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
+
+struct net_namespace {
+	struct kref kref;
+};
+
+extern struct net_namespace init_net_ns;
+
+#ifdef CONFIG_NET_NS
+
+extern int unshare_network(unsigned long unshare_flags,
+			   struct net_namespace **new_net);
+
+extern int copy_network(int flags, struct task_struct *tsk);
+
+static inline void get_net_ns(struct net_namespace *ns)
+{
+	kref_get(&ns->kref);
+}
+
+void free_net_ns(struct kref *kref);
+
+static inline void put_net_ns(struct net_namespace *ns)
+{
+	kref_put(&ns->kref, free_net_ns);
+}
+
+static inline void exit_network(struct task_struct *p)
+{
+	struct net_namespace *net_ns = p->nsproxy->net_ns;
+	if (net_ns)
+		put_net_ns(net_ns);
+}
+#else /* !CONFIG_NET_NS */
+static inline int unshare_network(unsigned long unshare_flags,
+				  struct net_namespace **new_net)
+{
+	return -EINVAL;
+}
+static inline int copy_network(int flags, struct task_struct *tsk)
+{
+	return 0;
+}
+static inline void get_net_ns(struct net_namespace *ns) {}
+static inline void put_net_ns(struct net_namespace *ns) {}
+static inline void exit_network(struct task_struct *p) {}
+#endif /* CONFIG_NET_NS */
+
+static inline struct net_namespace *net_ns(void)
+{
+	return current->nsproxy->net_ns;
+}
+
+#endif
Index: 2.6-mm/net/net_ns.c
===================================================================
--- /dev/null
+++ 2.6-mm/net/net_ns.c
@@ -0,0 +1,96 @@
+/*
+ *  net_ns.c - adds support for network namespace
+ *
+ *  Copyright (C) 2006 IBM
+ *
+ *  Author: Daniel Lezcano <dlezcano@fr.ibm.com>
+ *
+ *     This program is free software; you can redistribute it and/or
+ *     modify it under the terms of the GNU General Public License as
+ *     published by the Free Software Foundation, version 2 of the
+ *     License.
+ */
+
+#include <linux/net_ns.h>
+#include <linux/module.h>
+
+/*
+ * Clone a new ns copying an original, setting refcount to 1
+ * Cloned process will have
+ * @old_ns: namespace to clone
+ * Return NULL on error (failure to kmalloc), new ns otherwise
+ */
+struct net_namespace *clone_net_ns(struct net_namespace *old_ns)
+{
+	struct net_namespace *new_ns;
+
+	new_ns = kmalloc(sizeof(*new_ns), GFP_KERNEL);
+ 	if (!new_ns)
+ 		return NULL;
+ 	kref_init(&new_ns->kref);
+	return new_ns;
+}
+
+/*
+ * unshare the current process' network namespace.
+ * called only in sys_unshare()
+ */
+int unshare_network(unsigned long unshare_flags,
+		    struct net_namespace **new_net)
+{
+ 	if (!(unshare_flags & CLONE_NEWNET))
+ 		return 0;
+
+ 	if (!capable(CAP_SYS_ADMIN))
+ 		return -EPERM;
+
+ 	*new_net = clone_net_ns(current->nsproxy->net_ns);
+ 	if (!*new_net)
+ 		return -ENOMEM;
+
+	return 0;
+}
+
+/*
+ * Copy task tsk's network namespace, or clone it if flags specifies
+ * CLONE_NEWNET.  In latter case, changes to the network ressources of
+ * this process won't be seen by parent, and vice versa.
+ */
+int copy_network(int flags, struct task_struct *tsk)
+{
+	struct net_namespace *old_ns = tsk->nsproxy->net_ns;
+	struct net_namespace *new_ns;
+	int err = 0;
+
+	if (!old_ns)
+		return 0;
+
+	get_net_ns(old_ns);
+
+	if (!(flags & CLONE_NEWNET))
+		return 0;
+
+	if (!capable(CAP_SYS_ADMIN)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	new_ns = clone_net_ns(old_ns);
+	if (!new_ns) {
+		err = -ENOMEM;
+		goto out;
+	}
+	tsk->nsproxy->net_ns = new_ns;
+
+out:
+	put_net_ns(old_ns);
+	return err;
+}
+
+void free_net_ns(struct kref *kref)
+{
+	struct net_namespace *ns;
+
+	ns = container_of(kref, struct net_namespace, kref);
+	kfree(ns);
+}
Index: 2.6-mm/include/linux/nsproxy.h
===================================================================
--- 2.6-mm.orig/include/linux/nsproxy.h
+++ 2.6-mm/include/linux/nsproxy.h
@@ -6,6 +6,7 @@
 
 struct namespace;
 struct uts_namespace;
+struct net_namespace;
 
 /*
  * A structure to contain pointers to all per-process
@@ -23,6 +24,7 @@ struct nsproxy {
 	atomic_t count;
 	spinlock_t nslock;
 	struct uts_namespace *uts_ns;
+	struct net_namespace *net_ns;
 	struct namespace *namespace;
 };
 extern struct nsproxy init_nsproxy;
Index: 2.6-mm/kernel/nsproxy.c
===================================================================
--- 2.6-mm.orig/kernel/nsproxy.c
+++ 2.6-mm/kernel/nsproxy.c
@@ -14,6 +14,7 @@
 #include <linux/nsproxy.h>
 #include <linux/namespace.h>
 #include <linux/utsname.h>
+#include <linux/net_ns.h>
 
 static inline void get_nsproxy(struct nsproxy *ns)
 {
@@ -59,6 +60,8 @@ struct nsproxy *dup_namespaces(struct ns
 			get_namespace(ns->namespace);
 		if (ns->uts_ns)
 			get_uts_ns(ns->uts_ns);
+		if (ns->net_ns)
+			get_net_ns(ns->net_ns);
 	}
 
 	return ns;
@@ -79,7 +82,7 @@ int copy_namespaces(int flags, struct ta
 
 	get_nsproxy(old_ns);
 
-	if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS)))
+	if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWNET)))
 		return 0;
 
 	new_ns = clone_namespaces(old_ns);
@@ -91,21 +94,28 @@ int copy_namespaces(int flags, struct ta
 	tsk->nsproxy = new_ns;
 
 	err = copy_namespace(flags, tsk);
-	if (err) {
-		tsk->nsproxy = old_ns;
-		put_nsproxy(new_ns);
-		goto out;
-	}
+	if (err)
+		goto bad_copy_namespace;
 
 	err = copy_utsname(flags, tsk);
-	if (err) {
-		if (new_ns->namespace)
-			put_namespace(new_ns->namespace);
-		tsk->nsproxy = old_ns;
-		put_nsproxy(new_ns);
-		goto out;
-	}
+	if (err)
+		goto bad_copy_utsname;
 
+  	err = copy_network(flags, tsk);
+  	if (err)
+ 		goto bad_copy_network;
+
+  	goto out;
+
+bad_copy_network:
+	if (new_ns->uts_ns)
+		put_uts_ns(new_ns->uts_ns);
+bad_copy_utsname:
+	if (new_ns->namespace)
+		put_namespace(new_ns->namespace);
+bad_copy_namespace:
+	tsk->nsproxy = old_ns;
+	put_nsproxy(new_ns);
 out:
 	put_nsproxy(old_ns);
 	return err;
@@ -117,5 +127,7 @@ void free_nsproxy(struct nsproxy *ns)
 			put_namespace(ns->namespace);
 		if (ns->uts_ns)
 			put_uts_ns(ns->uts_ns);
+		if (ns->net_ns)
+			put_net_ns(ns->net_ns);
 		kfree(ns);
 }
Index: 2.6-mm/include/linux/sched.h
===================================================================
--- 2.6-mm.orig/include/linux/sched.h
+++ 2.6-mm/include/linux/sched.h
@@ -25,6 +25,7 @@
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
 #define CLONE_STOPPED		0x02000000	/* Start in stopped state */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
+#define CLONE_NEWNET		0x08000000	/* New network namespace */
 
 /*
  * Scheduling policies
Index: 2.6-mm/net/Kconfig
===================================================================
--- 2.6-mm.orig/net/Kconfig
+++ 2.6-mm/net/Kconfig
@@ -60,6 +60,15 @@ config INET
 
 	  Short answer: say Y.
 
+config NET_NS
+	bool "Network namespaces"
+	depends on NET
+	default n
+	---help---
+	  Support for network namespaces.  This allows containers, i.e.
+	  vservers, to use network namespaces to provide isolated
+	  network for different servers.  If unsure, say N.
+
 if INET
 source "net/ipv4/Kconfig"
 source "net/ipv6/Kconfig"
Index: 2.6-mm/net/Makefile
===================================================================
--- 2.6-mm.orig/net/Makefile
+++ 2.6-mm/net/Makefile
@@ -50,3 +50,4 @@ obj-$(CONFIG_TIPC)		+= tipc/
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_SYSCTL)		+= sysctl_net.o
 endif
+obj-$(CONFIG_NET_NS)            += net_ns.o
Index: 2.6-mm/include/linux/init_task.h
===================================================================
--- 2.6-mm.orig/include/linux/init_task.h
+++ 2.6-mm/include/linux/init_task.h
@@ -4,6 +4,7 @@
 #include <linux/file.h>
 #include <linux/rcupdate.h>
 #include <linux/utsname.h>
+#include <linux/net_ns.h>
 #include <linux/interrupt.h>
 
 #define INIT_FDTABLE \
@@ -73,6 +74,7 @@ extern struct nsproxy init_nsproxy;
 	.count		= ATOMIC_INIT(1),				\
 	.nslock		= SPIN_LOCK_UNLOCKED,				\
 	.uts_ns		= &init_uts_ns,					\
+	.net_ns         = &init_net_ns,                                 \
 	.namespace	= NULL,						\
 }
 
Index: 2.6-mm/kernel/fork.c
===================================================================
--- 2.6-mm.orig/kernel/fork.c
+++ 2.6-mm/kernel/fork.c
@@ -1592,13 +1592,15 @@ asmlinkage long sys_unshare(unsigned lon
 	struct sem_undo_list *new_ulist = NULL;
 	struct nsproxy *new_nsproxy = NULL, *old_nsproxy = NULL;
 	struct uts_namespace *uts, *new_uts = NULL;
+	struct net_namespace *net, *new_net = NULL;
 
 	check_unshare_flags(&unshare_flags);
 
 	/* Return -EINVAL for all unsupported flags */
 	err = -EINVAL;
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
-				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|CLONE_NEWUTS))
+				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|CLONE_NEWUTS|
+			        CLONE_NEWNET))
 		goto bad_unshare_out;
 
 	if ((err = unshare_thread(unshare_flags)))
@@ -1617,18 +1619,20 @@ asmlinkage long sys_unshare(unsigned lon
 		goto bad_unshare_cleanup_fd;
 	if ((err = unshare_utsname(unshare_flags, &new_uts)))
 		goto bad_unshare_cleanup_semundo;
+ 	if ((err = unshare_network(unshare_flags, &new_net)))
+ 		goto bad_unshare_cleanup_utsname;
 
-	if (new_ns || new_uts) {
+	if (new_ns || new_uts || new_net) {
 		old_nsproxy = current->nsproxy;
 		new_nsproxy = dup_namespaces(old_nsproxy);
 		if (!new_nsproxy) {
 			err = -ENOMEM;
-			goto bad_unshare_cleanup_uts;
+			goto bad_unshare_cleanup_net;
 		}
 	}
 
 	if (new_fs || new_ns || new_sigh || new_mm || new_fd || new_ulist ||
-				new_uts) {
+				new_uts || new_net) {
 
 		task_lock(current);
 
@@ -1676,13 +1680,23 @@ asmlinkage long sys_unshare(unsigned lon
 			new_uts = uts;
 		}
 
+ 		if (new_net) {
+ 			net = current->nsproxy->net_ns;
+ 			current->nsproxy->net_ns = new_net;
+			new_net = net;
+ 		}
+
 		task_unlock(current);
 	}
 
 	if (new_nsproxy)
 		put_nsproxy(new_nsproxy);
 
-bad_unshare_cleanup_uts:
+bad_unshare_cleanup_net:
+	if (new_net)
+		put_net_ns(new_net);
+
+bad_unshare_cleanup_utsname:
 	if (new_uts)
 		put_uts_ns(new_uts);
 
Index: 2.6-mm/init/version.c
===================================================================
--- 2.6-mm.orig/init/version.c
+++ 2.6-mm/init/version.c
@@ -10,6 +10,7 @@
 #include <linux/module.h>
 #include <linux/uts.h>
 #include <linux/utsname.h>
+#include <linux/net_ns.h>
 #include <linux/version.h>
 #include <linux/sched.h>
 
@@ -33,6 +34,13 @@ struct uts_namespace init_uts_ns = {
 };
 EXPORT_SYMBOL_GPL(init_uts_ns);
 
+struct net_namespace init_net_ns = {
+	.kref = {
+		.refcount	= ATOMIC_INIT(2),
+	},
+};
+EXPORT_SYMBOL_GPL(init_net_ns);
+
 const char linux_banner[] =
 	"Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@"
 	LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n";

--

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [RFC] [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
  2006-06-09 21:02 ` [RFC] [patch 1/6] [Network namespace] Network namespace structure dlezcano
@ 2006-06-09 21:02 ` dlezcano
  2006-06-11 10:18   ` Andrew Morton
                     ` (2 more replies)
  2006-06-09 21:02 ` [RFC] [patch 3/6] [Network namespace] Network devices isolation dlezcano
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 113+ messages in thread
From: dlezcano @ 2006-06-09 21:02 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: serue, haveblue, clg, dlezcano

[-- Attachment #1: net_ns_dev.patch --]
[-- Type: text/plain, Size: 8458 bytes --]

Adds to the network namespace a device list view. This view is emptied
when the unshare is done. The view is filled/emptied by a set of
function which can be called by an external module.

Replace-Subject: [Network namespace] Network device sharing by view
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> 
--
 include/linux/net_ns.h     |    2 
 include/linux/net_ns_dev.h |   32 +++++++
 init/version.c             |    4 
 net/core/Makefile          |    2 
 net/core/net_ns_dev.c      |  205 +++++++++++++++++++++++++++++++++++++++++++++
 net/net_ns.c               |    6 +
 6 files changed, 250 insertions(+), 1 deletion(-)

Index: 2.6-mm/include/linux/net_ns_dev.h
===================================================================
--- /dev/null
+++ 2.6-mm/include/linux/net_ns_dev.h
@@ -0,0 +1,32 @@
+#ifndef _LINUX_NET_NS_DEV_H
+#define _LINUX_NET_NS_DEV_H
+
+struct net_device;
+
+struct net_ns_dev {
+	struct list_head list;
+	struct net_device *dev;
+};
+
+struct net_ns_dev_list {
+	struct list_head list;
+	rwlock_t lock;
+};
+
+extern int net_ns_dev_unregister(struct net_device *dev,
+				 struct net_ns_dev_list *devlist);
+
+extern int net_ns_dev_register(struct net_device *dev,
+			       struct net_ns_dev_list *devlist);
+
+extern struct net_device *net_ns_dev_find_by_name(const char *devname,
+						  struct net_ns_dev_list *devlist);
+extern int net_ns_dev_remove(const char *devname,
+			     struct net_ns_dev_list *devlist);
+
+extern int net_ns_dev_add(const char *devname,
+			  struct net_ns_dev_list *devlist);
+
+extern int free_net_ns_dev(struct net_ns_dev_list *devlist);
+
+#endif
Index: 2.6-mm/include/linux/net_ns.h
===================================================================
--- 2.6-mm.orig/include/linux/net_ns.h
+++ 2.6-mm/include/linux/net_ns.h
@@ -4,9 +4,11 @@
 #include <linux/kref.h>
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
+#include <linux/net_ns_dev.h>
 
 struct net_namespace {
 	struct kref kref;
+	struct net_ns_dev_list dev_list;
 };
 
 extern struct net_namespace init_net_ns;
Index: 2.6-mm/net/core/net_ns_dev.c
===================================================================
--- /dev/null
+++ 2.6-mm/net/core/net_ns_dev.c
@@ -0,0 +1,205 @@
+/*
+ *  net_ns_dev.c - adds namespace netwok device view
+ *
+ *  Copyright (C) 2006 IBM
+ *
+ *  Author: Daniel Lezcano <dlezcano@fr.ibm.com>
+ *
+ *     This program is free software; you can redistribute it and/or
+ *     modify it under the terms of the GNU General Public License as
+ *     published by the Free Software Foundation, version 2 of the
+ *     License.
+ */
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/netdevice.h>
+#include <linux/net_ns_dev.h>
+
+int free_net_ns_dev(struct net_ns_dev_list *devlist)
+{
+	struct list_head *l, *next;
+	struct net_ns_dev *db;
+	struct net_device *dev;
+
+	write_lock(&devlist->lock);
+	list_for_each_safe(l, next, &devlist->list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
+		list_del(&db->list);
+		dev_put(dev);
+		kfree(db);
+	}
+	write_unlock(&devlist->lock);
+
+	return 0;
+}
+
+/*
+ * Remove a device to the namespace network devices list
+ * when registered from a namespace
+ * @dev : network device
+ * @dev_list: network namespace devices
+ * Return ENODEV if the device does not exist,
+ */
+int net_ns_dev_unregister(struct net_device *dev,
+			  struct net_ns_dev_list *devlist)
+{
+	struct net_ns_dev *db;
+	struct list_head *l;
+	int ret = -ENODEV;
+
+	write_lock(&devlist->lock);
+	list_for_each(l, &devlist->list) {
+		db = list_entry(l, struct net_ns_dev, list);
+ 		if (dev != db->dev)
+ 			continue;
+
+ 		list_del(&db->list);
+ 		dev_put(dev);
+ 		kfree(db);
+ 		ret = 0;
+ 		break;
+	}
+	write_unlock(&devlist->lock);
+	return ret;
+}
+
+EXPORT_SYMBOL_GPL(net_ns_dev_unregister);
+
+/*
+ * Add a device to the namespace network devices list
+ * when registered from a namespace
+ * @dev : network device
+ * @dev_list: network namespace devices
+ * Return ENOMEM if allocation fails, 0 on success
+ */
+int net_ns_dev_register(struct net_device *dev,
+			struct net_ns_dev_list *devlist)
+{
+	struct net_ns_dev *db;
+
+	db = kmalloc(sizeof(*db), GFP_KERNEL);
+	if (!db)
+		return -ENOMEM;
+
+	write_lock(&devlist->lock);
+	dev_hold(dev);
+	db->dev = dev;
+	list_add_tail(&db->list, &devlist->list);
+	write_unlock(&devlist->lock);
+
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(net_ns_dev_register);
+
+/*
+ * Add a device to the namespace network devices list
+ * @devname : network device name
+ * @dev_list: network namespace devices
+ * Return ENODEV if the device does not exist,
+ * ENOMEM if allocation fails, 0 on success
+ */
+int net_ns_dev_add(const char *devname,
+		   struct net_ns_dev_list *devlist)
+{
+	struct net_ns_dev *db;
+	struct net_device *dev;
+	int ret = 0;
+
+	read_lock(&dev_base_lock);
+
+	for (dev = dev_base; dev; dev = dev->next)
+		if (!strncmp(dev->name, devname, IFNAMSIZ))
+			break;
+
+	if (!dev) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	db = kmalloc(sizeof(*db), GFP_KERNEL);
+	if (!db) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	write_lock(&devlist->lock);
+	db->dev = dev;
+	dev_hold(dev);
+	list_add_tail(&db->list, &devlist->list);
+	write_unlock(&devlist->lock);
+
+out:
+	read_unlock(&dev_base_lock);
+
+	return ret;
+}
+
+EXPORT_SYMBOL_GPL(net_ns_dev_add);
+
+/*
+ * Remove a device from the namespace network devices list
+ * @devname : network device name
+ * @dev_list: network namespace devices
+ * Return ENODEV if the device does not exist, 0 on success
+ */
+int net_ns_dev_remove(const char *devname,
+		      struct net_ns_dev_list *devlist)
+{
+	struct net_ns_dev *db;
+	struct net_device *dev;
+	struct list_head *l;
+	int ret = 0;
+
+	write_lock(&devlist->lock);
+	list_for_each(l, &devlist->list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
+
+		if (!strncmp(dev->name, devname, IFNAMSIZ)) {
+			list_del(&db->list);
+			dev_put(dev);
+			kfree(db);
+			goto out;
+		}
+	}
+	ret = -ENODEV;
+out:
+	write_unlock(&devlist->lock);
+	return ret;
+}
+
+EXPORT_SYMBOL_GPL(net_ns_dev_remove);
+
+/*
+ * Find a namespace network device
+ * @devname : network device name
+ * @dev_list: network namespace devices
+ * Return ENODEV if the device does not exist, 0 on success
+ */
+struct net_device *net_ns_dev_find_by_name(const char *devname,
+					   struct net_ns_dev_list *devlist)
+{
+	struct net_ns_dev *db;
+	struct net_device *dev;
+	struct list_head *l;
+
+	read_lock(&devlist->lock);
+
+	list_for_each(l, &devlist->list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
+
+		if (!strncmp(dev->name, devname, IFNAMSIZ)) {
+			dev_hold(dev);
+			goto out;
+		}
+	}
+	dev = NULL;
+out:
+	read_unlock(&devlist->lock);
+	return dev;
+}
+
+EXPORT_SYMBOL_GPL(net_ns_dev_find_by_name);
Index: 2.6-mm/net/net_ns.c
===================================================================
--- 2.6-mm.orig/net/net_ns.c
+++ 2.6-mm/net/net_ns.c
@@ -23,11 +23,16 @@
 struct net_namespace *clone_net_ns(struct net_namespace *old_ns)
 {
 	struct net_namespace *new_ns;
+	struct net_ns_dev_list *new_dev_list;
 
 	new_ns = kmalloc(sizeof(*new_ns), GFP_KERNEL);
  	if (!new_ns)
  		return NULL;
+
  	kref_init(&new_ns->kref);
+ 	new_dev_list = &new_ns->dev_list;
+ 	INIT_LIST_HEAD(&new_dev_list->list);
+	new_dev_list->lock = RW_LOCK_UNLOCKED;
 	return new_ns;
 }
 
@@ -92,5 +97,6 @@ void free_net_ns(struct kref *kref)
 	struct net_namespace *ns;
 
 	ns = container_of(kref, struct net_namespace, kref);
+	free_net_ns_dev(&ns->dev_list);
 	kfree(ns);
 }
Index: 2.6-mm/net/core/Makefile
===================================================================
--- 2.6-mm.orig/net/core/Makefile
+++ 2.6-mm/net/core/Makefile
@@ -7,7 +7,7 @@ obj-y := sock.o request_sock.o skbuff.o 
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
-obj-y		     += dev.o ethtool.o dev_mcast.o dst.o \
+obj-y		     += dev.o net_ns_dev.o ethtool.o dev_mcast.o dst.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o
 
 obj-$(CONFIG_XFRM) += flow.o
Index: 2.6-mm/init/version.c
===================================================================
--- 2.6-mm.orig/init/version.c
+++ 2.6-mm/init/version.c
@@ -38,6 +38,10 @@ struct net_namespace init_net_ns = {
 	.kref = {
 		.refcount	= ATOMIC_INIT(2),
 	},
+	.dev_list = {
+		 .lock = RW_LOCK_UNLOCKED,
+		 .list = LIST_HEAD_INIT(init_net_ns.dev_list.list),
+	 },
 };
 EXPORT_SYMBOL_GPL(init_net_ns);
 

--

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [RFC] [patch 3/6] [Network namespace] Network devices isolation
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
  2006-06-09 21:02 ` [RFC] [patch 1/6] [Network namespace] Network namespace structure dlezcano
  2006-06-09 21:02 ` [RFC] [patch 2/6] [Network namespace] Network device sharing by view dlezcano
@ 2006-06-09 21:02 ` dlezcano
  2006-06-18 18:57   ` Al Viro
  2006-06-09 21:02 ` [RFC] [patch 4/6] [Network namespace] Network inet " dlezcano
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 113+ messages in thread
From: dlezcano @ 2006-06-09 21:02 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: serue, haveblue, clg, dlezcano

[-- Attachment #1: netdev_isolation.patch --]
[-- Type: text/plain, Size: 9453 bytes --]

The dev list view is filled and used from here. The dev_base_list has
been replaced to the dev list view and devices can be accessed only if
the view has the device in its list. All calls from the userspace,
ioctls, netlinks and procfs, will use the network devices view instead
of the global network device list.

Replace-Subject: [Network namespace] Network devices isolation 
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> 
--
 net/core/dev.c       |  147 ++++++++++++++++++++++++++++++++++++++-------------
 net/core/rtnetlink.c |   21 +++++--
 2 files changed, 126 insertions(+), 42 deletions(-)

Index: 2.6-mm/net/core/dev.c
===================================================================
--- 2.6-mm.orig/net/core/dev.c
+++ 2.6-mm/net/core/dev.c
@@ -115,6 +115,7 @@
 #include <net/iw_handler.h>
 #include <asm/current.h>
 #include <linux/audit.h>
+#include <linux/net_ns.h>
 #include <linux/dmaengine.h>
 
 /*
@@ -474,13 +475,16 @@
 
 struct net_device *__dev_get_by_name(const char *name)
 {
-	struct hlist_node *p;
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	struct list_head *l, *list = &dev_list->list;
+	struct net_ns_dev *db;
+	struct net_device *dev;
 
-	hlist_for_each(p, dev_name_hash(name)) {
-		struct net_device *dev
-			= hlist_entry(p, struct net_device, name_hlist);
+	list_for_each(l, list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
 		if (!strncmp(dev->name, name, IFNAMSIZ))
-			return dev;
+ 			return dev;
 	}
 	return NULL;
 }
@@ -498,13 +502,14 @@
 
 struct net_device *dev_get_by_name(const char *name)
 {
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
 	struct net_device *dev;
 
-	read_lock(&dev_base_lock);
+	read_lock(&dev_list->lock);
 	dev = __dev_get_by_name(name);
 	if (dev)
 		dev_hold(dev);
-	read_unlock(&dev_base_lock);
+	read_unlock(&dev_list->lock);
 	return dev;
 }
 
@@ -521,11 +526,14 @@
 
 struct net_device *__dev_get_by_index(int ifindex)
 {
-	struct hlist_node *p;
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	struct list_head *l, *list = &dev_list->list;
+	struct net_ns_dev *db;
+	struct net_device *dev;
 
-	hlist_for_each(p, dev_index_hash(ifindex)) {
-		struct net_device *dev
-			= hlist_entry(p, struct net_device, index_hlist);
+	list_for_each(l, list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
 		if (dev->ifindex == ifindex)
 			return dev;
 	}
@@ -545,13 +553,14 @@
 
 struct net_device *dev_get_by_index(int ifindex)
 {
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
 	struct net_device *dev;
 
-	read_lock(&dev_base_lock);
+	read_lock(&dev_list->lock);
 	dev = __dev_get_by_index(ifindex);
 	if (dev)
 		dev_hold(dev);
-	read_unlock(&dev_base_lock);
+	read_unlock(&dev_list->lock);
 	return dev;
 }
 
@@ -571,14 +580,24 @@
 
 struct net_device *dev_getbyhwaddr(unsigned short type, char *ha)
 {
-	struct net_device *dev;
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	struct list_head *l, *list = &dev_list->list;
+	struct net_ns_dev *db;
+	struct net_device *dev = NULL;
 
 	ASSERT_RTNL();
 
-	for (dev = dev_base; dev; dev = dev->next)
+	read_lock(&dev_list->lock);
+	list_for_each(l, list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
 		if (dev->type == type &&
 		    !memcmp(dev->dev_addr, ha, dev->addr_len))
-			break;
+			goto out;
+	}
+	dev = NULL;
+out:
+	read_unlock(&dev_list->lock);
 	return dev;
 }
 
@@ -586,15 +605,25 @@
 
 struct net_device *dev_getfirstbyhwtype(unsigned short type)
 {
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	struct list_head *l, *list = &dev_list->list;
+	struct net_ns_dev *db;
 	struct net_device *dev;
 
 	rtnl_lock();
-	for (dev = dev_base; dev; dev = dev->next) {
+
+	read_lock(&dev_list->lock);
+	list_for_each(l, list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
 		if (dev->type == type) {
 			dev_hold(dev);
-			break;
+			goto out;
 		}
 	}
+	dev = NULL;
+out:
+	read_unlock(&dev_list->lock);
 	rtnl_unlock();
 	return dev;
 }
@@ -614,16 +643,23 @@
 
 struct net_device * dev_get_by_flags(unsigned short if_flags, unsigned short mask)
 {
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	struct list_head *l, *list = &dev_list->list;
+	struct net_ns_dev *db;
 	struct net_device *dev;
 
-	read_lock(&dev_base_lock);
-	for (dev = dev_base; dev != NULL; dev = dev->next) {
+	read_lock(&dev_list->lock);
+	list_for_each(l, list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
 		if (((dev->flags ^ if_flags) & mask) == 0) {
 			dev_hold(dev);
-			break;
+			goto out;
 		}
 	}
-	read_unlock(&dev_base_lock);
+	dev = NULL;
+out:
+	read_unlock(&dev_list->lock);
 	return dev;
 }
 
@@ -1942,6 +1978,9 @@
 static int dev_ifconf(char __user *arg)
 {
 	struct ifconf ifc;
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	struct list_head *l, *list = &dev_list->list;
+	struct net_ns_dev *db;
 	struct net_device *dev;
 	char __user *pos;
 	int len;
@@ -1963,8 +2002,14 @@
 	 */
 
 	total = 0;
-	for (dev = dev_base; dev; dev = dev->next) {
+
+	list_for_each(l, list) {
+
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
+
 		for (i = 0; i < NPROTO; i++) {
+
 			if (gifconf_list[i]) {
 				int done;
 				if (!pos)
@@ -1995,40 +2040,63 @@
  *	This is invoked by the /proc filesystem handler to display a device
  *	in detail.
  */
-static __inline__ struct net_device *dev_get_idx(loff_t pos)
+static __inline__ struct net_ns_dev *dev_get_idx(loff_t pos)
 {
-	struct net_device *dev;
-	loff_t i;
-
-	for (i = 0, dev = dev_base; dev && i < pos; ++i, dev = dev->next);
-
-	return i == pos ? dev : NULL;
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	struct list_head *l, *list = &dev_list->list;
+	struct net_ns_dev *db;
+
+	loff_t i = 0;
+
+	list_for_each(l, list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		if (i == pos)
+			return db;
+		i++;
+	};
+	return NULL;
 }
 
 void *dev_seq_start(struct seq_file *seq, loff_t *pos)
 {
-	read_lock(&dev_base_lock);
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+
+	read_lock(&dev_list->lock);
 	return *pos ? dev_get_idx(*pos - 1) : SEQ_START_TOKEN;
 }
 
 void *dev_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	struct net_ns_dev *db = NULL;
+	struct list_head *next;
+
 	++*pos;
-	return v == SEQ_START_TOKEN ? dev_base : ((struct net_device *)v)->next;
+
+	if (v == SEQ_START_TOKEN)
+		next = dev_list->list.next;
+	else
+		next = ((struct net_ns_dev*)v)->list.next;
+	if (next && next != &dev_list->list)
+		db = list_entry(next, struct net_ns_dev, list);
+	return db;
 }
 
 void dev_seq_stop(struct seq_file *seq, void *v)
 {
-	read_unlock(&dev_base_lock);
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	read_unlock(&dev_list->lock);
 }
 
-static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
+static void dev_seq_printf_stats(struct seq_file *seq, struct net_ns_dev *db)
 {
+	struct net_device *dev = db->dev;
+
 	if (dev->get_stats) {
 		struct net_device_stats *stats = dev->get_stats(dev);
 
 		seq_printf(seq, "%6s:%8lu %7lu %4lu %4lu %4lu %5lu %10lu %9lu "
-				"%8lu %7lu %4lu %4lu %4lu %5lu %7lu %10lu\n",
+			        "%8lu %7lu %4lu %4lu %4lu %5lu %7lu %10lu\n",
 			   dev->name, stats->rx_bytes, stats->rx_packets,
 			   stats->rx_errors,
 			   stats->rx_dropped + stats->rx_missed_errors,
@@ -2402,7 +2470,7 @@
  */
 static int dev_ifsioc(struct ifreq *ifr, unsigned int cmd)
 {
-	int err;
+	int err = 0;
 	struct net_device *dev = __dev_get_by_name(ifr->ifr_name);
 
 	if (!dev)
@@ -2509,7 +2577,6 @@
 		/*
 		 *	Unknown or private ioctl
 		 */
-
 		default:
 			if ((cmd >= SIOCDEVPRIVATE &&
 			    cmd <= SIOCDEVPRIVATE + 15) ||
@@ -2847,6 +2914,10 @@
 		}
  	}
 
+	ret = net_ns_dev_register(dev, &(net_ns()->dev_list));
+	if (ret)
+		goto out_err;
+
 	/* Fix illegal SG+CSUM combinations. */
 	if ((dev->features & NETIF_F_SG) &&
 	    !(dev->features & (NETIF_F_IP_CSUM |
@@ -3218,6 +3289,8 @@
 		return -ENODEV;
 	}
 
+	net_ns_dev_unregister(dev, &(net_ns()->dev_list));
+
 	dev->reg_state = NETREG_UNREGISTERING;
 
 	synchronize_net();
Index: 2.6-mm/net/core/rtnetlink.c
===================================================================
--- 2.6-mm.orig/net/core/rtnetlink.c
+++ 2.6-mm/net/core/rtnetlink.c
@@ -55,6 +55,7 @@
 #include <linux/wireless.h>
 #include <net/iw_handler.h>
 #endif	/* CONFIG_NET_WIRELESS_RTNETLINK */
+#include <linux/net_ns.h>
 
 static DEFINE_MUTEX(rtnl_mutex);
 
@@ -315,21 +316,31 @@
 
 static int rtnetlink_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	int idx;
+	int idx = 0;
 	int s_idx = cb->args[0];
+
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+	struct list_head *l, *list = &dev_list->list;
+	struct net_ns_dev *db;
 	struct net_device *dev;
 
-	read_lock(&dev_base_lock);
-	for (dev=dev_base, idx=0; dev; dev = dev->next, idx++) {
-		if (idx < s_idx)
+	read_lock(&dev_list->lock);
+	list_for_each(l, list) {
+
+		if (idx++ < s_idx)
 			continue;
+
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
+
 		if (rtnetlink_fill_ifinfo(skb, dev, RTM_NEWLINK,
 					  NETLINK_CB(cb->skb).pid,
 					  cb->nlh->nlmsg_seq, 0,
 					  NLM_F_MULTI) <= 0)
 			break;
 	}
-	read_unlock(&dev_base_lock);
+	read_unlock(&dev_list->lock);
+
 	cb->args[0] = idx;
 
 	return skb->len;

--

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [RFC] [patch 4/6] [Network namespace] Network inet devices isolation
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
                   ` (2 preceding siblings ...)
  2006-06-09 21:02 ` [RFC] [patch 3/6] [Network namespace] Network devices isolation dlezcano
@ 2006-06-09 21:02 ` dlezcano
  2006-06-09 21:02 ` [RFC] [patch 5/6] [Network namespace] ipv4 isolation dlezcano
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 113+ messages in thread
From: dlezcano @ 2006-06-09 21:02 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: serue, haveblue, clg, dlezcano

[-- Attachment #1: inetdev_isolation.patch --]
[-- Type: text/plain, Size: 4457 bytes --]

The network isolation relies on the fact that an application can not
use IP addresses not belonging to the container in which it's
running. This patch isolates the inet device level by adding a
structure namespace pointer in the structure in_ifaddr. When an ip
address is set inside a network namespace, the structure in_ifaddr is
filled with the current namespace pointer. There is a special case
with loopback address which belongs to all the namespaces and its
particularity is to have the network namespace pointer set to NULL.
This patch isolates the ifconfig, ip addr commands, so when an IP
address is set, this one it is not visible by another network
namespaces.

Replace-Subject: [Network namespace] Network inet devices isolation 
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> 
--
 include/linux/inetdevice.h |    1 +
 net/ipv4/devinet.c         |   28 +++++++++++++++++++++++++++-
 2 files changed, 28 insertions(+), 1 deletion(-)

Index: 2.6-mm/include/linux/inetdevice.h
===================================================================
--- 2.6-mm.orig/include/linux/inetdevice.h
+++ 2.6-mm/include/linux/inetdevice.h
@@ -99,6 +99,7 @@
 	unsigned char		ifa_flags;
 	unsigned char		ifa_prefixlen;
 	char			ifa_label[IFNAMSIZ];
+	struct net_namespace    *ifa_net_ns;
 };
 
 extern int register_inetaddr_notifier(struct notifier_block *nb);
Index: 2.6-mm/net/ipv4/devinet.c
===================================================================
--- 2.6-mm.orig/net/ipv4/devinet.c
+++ 2.6-mm/net/ipv4/devinet.c
@@ -54,6 +54,7 @@
 #include <linux/notifier.h>
 #include <linux/inetdevice.h>
 #include <linux/igmp.h>
+#include <linux/net_ns.h>
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
@@ -257,6 +258,7 @@
 
 			if (!(ifa->ifa_flags & IFA_F_SECONDARY) ||
 			    ifa1->ifa_mask != ifa->ifa_mask ||
+			    ifa->ifa_net_ns != net_ns() ||
 			    !inet_ifa_match(ifa1->ifa_address, ifa)) {
 				ifap1 = &ifa->ifa_next;
 				prev_prom = ifa;
@@ -317,6 +319,8 @@
 	if (destroy) {
 		inet_free_ifa(ifa1);
 
+		put_net_ns(ifa1->ifa_net_ns);
+
 		if (!in_dev->ifa_list)
 			inetdev_destroy(in_dev);
 	}
@@ -343,6 +347,7 @@
 		    ifa->ifa_scope <= ifa1->ifa_scope)
 			last_primary = &ifa1->ifa_next;
 		if (ifa1->ifa_mask == ifa->ifa_mask &&
+		    ifa1->ifa_net_ns == ifa->ifa_net_ns &&
 		    inet_ifa_match(ifa1->ifa_address, ifa)) {
 			if (ifa1->ifa_local == ifa->ifa_local) {
 				inet_free_ifa(ifa);
@@ -437,6 +442,8 @@
 
 	for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
 	     ifap = &ifa->ifa_next) {
+		if (ifa->ifa_net_ns != net_ns())
+			continue;
 		if ((rta[IFA_LOCAL - 1] &&
 		     memcmp(RTA_DATA(rta[IFA_LOCAL - 1]),
 			    &ifa->ifa_local, 4)) ||
@@ -497,6 +504,9 @@
 	ifa->ifa_scope = ifm->ifa_scope;
 	in_dev_hold(in_dev);
 	ifa->ifa_dev   = in_dev;
+	ifa->ifa_net_ns = net_ns();
+	get_net_ns(net_ns());
+
 	if (rta[IFA_LABEL - 1])
 		rtattr_strlcpy(ifa->ifa_label, rta[IFA_LABEL - 1], IFNAMSIZ);
 	else
@@ -631,10 +641,15 @@
 			for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
 			     ifap = &ifa->ifa_next)
 				if (!strcmp(ifr.ifr_name, ifa->ifa_label))
-					break;
+					if (!ifa->ifa_net_ns ||
+					    ifa->ifa_net_ns == net_ns())
+						break;
 		}
 	}
 
+	if (ifa && ifa->ifa_net_ns && ifa->ifa_net_ns != net_ns())
+		goto done;
+
 	ret = -EADDRNOTAVAIL;
 	if (!ifa && cmd != SIOCSIFADDR && cmd != SIOCSIFFLAGS)
 		goto done;
@@ -678,6 +693,12 @@
 			ret = -ENOBUFS;
 			if ((ifa = inet_alloc_ifa()) == NULL)
 				break;
+			if (!LOOPBACK(sin->sin_addr.s_addr)) {
+				ifa->ifa_net_ns = net_ns();
+				get_net_ns(net_ns());
+			} else
+				ifa->ifa_net_ns = NULL;
+
 			if (colon)
 				memcpy(ifa->ifa_label, ifr.ifr_name, IFNAMSIZ);
 			else
@@ -782,6 +803,8 @@
 		goto out;
 
 	for (; ifa; ifa = ifa->ifa_next) {
+		if (ifa->ifa_net_ns && ifa->ifa_net_ns != net_ns())
+			continue;
 		if (!buf) {
 			done += sizeof(ifr);
 			continue;
@@ -1012,6 +1035,7 @@
 				  ifa->ifa_address = htonl(INADDR_LOOPBACK);
 				ifa->ifa_prefixlen = 8;
 				ifa->ifa_mask = inet_make_mask(8);
+				ifa->ifa_net_ns = NULL;
 				in_dev_hold(in_dev);
 				ifa->ifa_dev = in_dev;
 				ifa->ifa_scope = RT_SCOPE_HOST;
@@ -1110,6 +1134,8 @@
 
 		for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
 		     ifa = ifa->ifa_next, ip_idx++) {
+			if (ifa->ifa_net_ns && ifa->ifa_net_ns != net_ns())
+				continue;
 			if (ip_idx < s_ip_idx)
 				continue;
 			if (inet_fill_ifaddr(skb, ifa, NETLINK_CB(cb->skb).pid,

--

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [RFC] [patch 5/6] [Network namespace] ipv4 isolation
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
                   ` (3 preceding siblings ...)
  2006-06-09 21:02 ` [RFC] [patch 4/6] [Network namespace] Network inet " dlezcano
@ 2006-06-09 21:02 ` dlezcano
  2006-06-10  0:23   ` James Morris
  2006-06-09 21:02 ` [RFC] [patch 6/6] [Network namespace] Network namespace debugfs dlezcano
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 113+ messages in thread
From: dlezcano @ 2006-06-09 21:02 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: serue, haveblue, clg, dlezcano

[-- Attachment #1: inet_isolation.patch --]
[-- Type: text/plain, Size: 15639 bytes --]

This patch partially isolates ipv4 by adding the network namespace
structure in the structure sock, bind bucket and skbuf. When a socket
is created, the pointer to the network namespace is stored in the
struct sock and the socket belongs to the namespace by this way. That
allows to identify sockets related to a namespace for lookup and
procfs. 

The lookup is extended with a network namespace pointer, in
order to identify listen points binded to the same port. That allows
to have several applications binded to INADDR_ANY:port in different
network namespace without conflicting. The bind is checked against
port and network namespace.

When an outgoing packet has the loopback destination addres, the
skbuff is filled with the network namespace. So the loopback packets
never go outside the namespace. This approach facilitate the migration
of loopback because identification is done by network namespace and
not by address. The loopback has been benchmarked by tbench and the
overhead is roughly 1.5 %

Replace-Subject: [Network namespace] ipv4 isolation
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> 
--
 include/linux/skbuff.h           |    2 ++
 include/net/inet_hashtables.h    |   34 ++++++++++++++++++++++++----------
 include/net/inet_timewait_sock.h |    1 +
 include/net/sock.h               |    4 ++++
 net/dccp/ipv4.c                  |    7 ++++---
 net/ipv4/af_inet.c               |    2 ++
 net/ipv4/inet_connection_sock.c  |    3 ++-
 net/ipv4/inet_diag.c             |    3 ++-
 net/ipv4/inet_hashtables.c       |    6 +++++-
 net/ipv4/inet_timewait_sock.c    |    1 +
 net/ipv4/ip_output.c             |    4 ++++
 net/ipv4/tcp_ipv4.c              |   25 ++++++++++++++++---------
 net/ipv4/udp.c                   |    7 +++++--
 13 files changed, 72 insertions(+), 27 deletions(-)

Index: 2.6-mm/include/linux/skbuff.h
===================================================================
--- 2.6-mm.orig/include/linux/skbuff.h
+++ 2.6-mm/include/linux/skbuff.h
@@ -27,6 +27,7 @@
 #include <linux/poll.h>
 #include <linux/net.h>
 #include <linux/textsearch.h>
+#include <linux/net_ns.h>
 #include <net/checksum.h>
 #include <linux/dmaengine.h>
 
@@ -301,6 +302,7 @@
 				*data,
 				*tail,
 				*end;
+	struct net_namespace    *net_ns;
 };
 
 #ifdef __KERNEL__
Index: 2.6-mm/include/net/inet_hashtables.h
===================================================================
--- 2.6-mm.orig/include/net/inet_hashtables.h
+++ 2.6-mm/include/net/inet_hashtables.h
@@ -23,6 +23,8 @@
 #include <linux/spinlock.h>
 #include <linux/types.h>
 #include <linux/wait.h>
+#include <linux/in.h>
+#include <linux/net_ns.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_sock.h>
@@ -78,6 +80,7 @@
 	signed short		fastreuse;
 	struct hlist_node	node;
 	struct hlist_head	owners;
+	struct net_namespace    *net_ns;
 };
 
 #define inet_bind_bucket_for_each(tb, node, head) \
@@ -274,13 +277,15 @@
 extern struct sock *__inet_lookup_listener(const struct hlist_head *head,
 					   const u32 daddr,
 					   const unsigned short hnum,
-					   const int dif);
+					   const int dif,
+					   const struct net_namespace *net_ns);
 
 /* Optimize the common listener case. */
 static inline struct sock *
 		inet_lookup_listener(struct inet_hashinfo *hashinfo,
 				     const u32 daddr,
-				     const unsigned short hnum, const int dif)
+				     const unsigned short hnum, const int dif,
+				     const struct net_namespace *net_ns)
 {
 	struct sock *sk = NULL;
 	const struct hlist_head *head;
@@ -294,8 +299,9 @@
 		    (!inet->rcv_saddr || inet->rcv_saddr == daddr) &&
 		    (sk->sk_family == PF_INET || !ipv6_only_sock(sk)) &&
 		    !sk->sk_bound_dev_if)
-			goto sherry_cache;
-		sk = __inet_lookup_listener(head, daddr, hnum, dif);
+			if (sk->sk_net_ns == net_ns && LOOPBACK(daddr))
+				goto sherry_cache;
+		sk = __inet_lookup_listener(head, daddr, hnum, dif, net_ns);
 	}
 	if (sk) {
 sherry_cache:
@@ -358,7 +364,8 @@
 	__inet_lookup_established(struct inet_hashinfo *hashinfo,
 				  const u32 saddr, const u16 sport,
 				  const u32 daddr, const u16 hnum,
-				  const int dif)
+				  const int dif,
+				  const struct net_namespace *net_ns)
 {
 	INET_ADDR_COOKIE(acookie, saddr, daddr)
 	const __u32 ports = INET_COMBINED_PORTS(sport, hnum);
@@ -373,12 +380,16 @@
 	prefetch(head->chain.first);
 	read_lock(&head->lock);
 	sk_for_each(sk, node, &head->chain) {
+		if (sk->sk_net_ns != net_ns && LOOPBACK(daddr))
+			continue;
 		if (INET_MATCH(sk, hash, acookie, saddr, daddr, ports, dif))
 			goto hit; /* You sunk my battleship! */
 	}
 
 	/* Must check for a TIME_WAIT'er before going to listener hash. */
 	sk_for_each(sk, node, &(head + hashinfo->ehash_size)->chain) {
+		if (sk->sk_net_ns != net_ns && LOOPBACK(daddr))
+			continue;
 		if (INET_TW_MATCH(sk, hash, acookie, saddr, daddr, ports, dif))
 			goto hit;
 	}
@@ -394,22 +405,25 @@
 static inline struct sock *__inet_lookup(struct inet_hashinfo *hashinfo,
 					 const u32 saddr, const u16 sport,
 					 const u32 daddr, const u16 hnum,
-					 const int dif)
+					 const int dif,
+					 const struct net_namespace *net_ns)
 {
 	struct sock *sk = __inet_lookup_established(hashinfo, saddr, sport, daddr,
-						    hnum, dif);
-	return sk ? : inet_lookup_listener(hashinfo, daddr, hnum, dif);
+						    hnum, dif, net_ns);
+	return sk ? : inet_lookup_listener(hashinfo, daddr, hnum, dif, net_ns);
 }
 
 static inline struct sock *inet_lookup(struct inet_hashinfo *hashinfo,
 				       const u32 saddr, const u16 sport,
 				       const u32 daddr, const u16 dport,
-				       const int dif)
+				       const int dif,
+				       const struct net_namespace *net_ns)
 {
 	struct sock *sk;
 
 	local_bh_disable();
-	sk = __inet_lookup(hashinfo, saddr, sport, daddr, ntohs(dport), dif);
+	sk = __inet_lookup(hashinfo, saddr, sport, daddr, ntohs(dport),
+			   dif, net_ns);
 	local_bh_enable();
 
 	return sk;
Index: 2.6-mm/include/net/inet_timewait_sock.h
===================================================================
--- 2.6-mm.orig/include/net/inet_timewait_sock.h
+++ 2.6-mm/include/net/inet_timewait_sock.h
@@ -115,6 +115,7 @@
 #define tw_refcnt		__tw_common.skc_refcnt
 #define tw_hash			__tw_common.skc_hash
 #define tw_prot			__tw_common.skc_prot
+#define tw_net_ns               __tw_common.skc_net_ns
 	volatile unsigned char	tw_substate;
 	/* 3 bits hole, try to pack */
 	unsigned char		tw_rcv_wscale;
Index: 2.6-mm/include/net/sock.h
===================================================================
--- 2.6-mm.orig/include/net/sock.h
+++ 2.6-mm/include/net/sock.h
@@ -47,6 +47,7 @@
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/net_ns.h>
 
 #include <linux/filter.h>
 
@@ -94,6 +95,7 @@
  *	@skc_refcnt: reference count
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_prot: protocol handlers inside a network family
+ *      @skc_net_ns: network namespace owning the socket
  *
  *	This is the minimal network layer representation of sockets, the header
  *	for struct sock and struct inet_timewait_sock.
@@ -108,6 +110,7 @@
 	atomic_t		skc_refcnt;
 	unsigned int		skc_hash;
 	struct proto		*skc_prot;
+	struct net_namespace    *skc_net_ns;
 };
 
 /**
@@ -183,6 +186,7 @@
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_hash			__sk_common.skc_hash
 #define sk_prot			__sk_common.skc_prot
+#define sk_net_ns               __sk_common.skc_net_ns
 	unsigned char		sk_shutdown : 2,
 				sk_no_check : 2,
 				sk_userlocks : 4;
Index: 2.6-mm/net/dccp/ipv4.c
===================================================================
--- 2.6-mm.orig/net/dccp/ipv4.c
+++ 2.6-mm/net/dccp/ipv4.c
@@ -308,7 +308,8 @@
 	}
 
 	sk = inet_lookup(&dccp_hashinfo, iph->daddr, dh->dccph_dport,
-			 iph->saddr, dh->dccph_sport, inet_iif(skb));
+			 iph->saddr, dh->dccph_sport, inet_iif(skb),
+			 skb->net_ns);
 	if (sk == NULL) {
 		ICMP_INC_STATS_BH(ICMP_MIB_INERRORS);
 		return;
@@ -610,7 +611,7 @@
 	nsk = __inet_lookup_established(&dccp_hashinfo,
 					iph->saddr, dh->dccph_sport,
 					iph->daddr, ntohs(dh->dccph_dport),
-					inet_iif(skb));
+					inet_iif(skb), skb->net_ns);
 	if (nsk != NULL) {
 		if (nsk->sk_state != DCCP_TIME_WAIT) {
 			bh_lock_sock(nsk);
@@ -924,7 +925,7 @@
 	sk = __inet_lookup(&dccp_hashinfo,
 			   skb->nh.iph->saddr, dh->dccph_sport,
 			   skb->nh.iph->daddr, ntohs(dh->dccph_dport),
-			   inet_iif(skb));
+			   inet_iif(skb), skb->net_ns);
 
 	/* 
 	 * Step 2:
Index: 2.6-mm/net/ipv4/af_inet.c
===================================================================
--- 2.6-mm.orig/net/ipv4/af_inet.c
+++ 2.6-mm/net/ipv4/af_inet.c
@@ -325,6 +325,7 @@
 	sk->sk_family	   = PF_INET;
 	sk->sk_protocol	   = protocol;
 	sk->sk_backlog_rcv = sk->sk_prot->backlog_rcv;
+	sk->sk_net_ns      = net_ns();
 
 	inet->uc_ttl	= -1;
 	inet->mc_loop	= 1;
@@ -616,6 +617,7 @@
 
 	sock_graft(sk2, newsock);
 
+	sk2->sk_net_ns = net_ns();
 	newsock->state = SS_CONNECTED;
 	err = 0;
 	release_sock(sk2);
Index: 2.6-mm/net/ipv4/inet_connection_sock.c
===================================================================
--- 2.6-mm.orig/net/ipv4/inet_connection_sock.c
+++ 2.6-mm/net/ipv4/inet_connection_sock.c
@@ -116,7 +116,7 @@
 		head = &hashinfo->bhash[inet_bhashfn(snum, hashinfo->bhash_size)];
 		spin_lock(&head->lock);
 		inet_bind_bucket_for_each(tb, node, &head->chain)
-			if (tb->port == snum)
+			if (tb->port == snum && tb->net_ns == net_ns())
 				goto tb_found;
 	}
 	tb = NULL;
@@ -146,6 +146,7 @@
 	} else if (tb->fastreuse &&
 		   (!sk->sk_reuse || sk->sk_state == TCP_LISTEN))
 		tb->fastreuse = 0;
+	tb->net_ns = net_ns();
 success:
 	if (!inet_csk(sk)->icsk_bind_hash)
 		inet_bind_hash(sk, tb, snum);
Index: 2.6-mm/net/ipv4/inet_diag.c
===================================================================
--- 2.6-mm.orig/net/ipv4/inet_diag.c
+++ 2.6-mm/net/ipv4/inet_diag.c
@@ -241,7 +241,8 @@
 	if (req->idiag_family == AF_INET) {
 		sk = inet_lookup(hashinfo, req->id.idiag_dst[0],
 				 req->id.idiag_dport, req->id.idiag_src[0],
-				 req->id.idiag_sport, req->id.idiag_if);
+				 req->id.idiag_sport, req->id.idiag_if,
+				 in_skb->net_ns);
 	}
 #if defined(CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE)
 	else if (req->idiag_family == AF_INET6) {
Index: 2.6-mm/net/ipv4/inet_hashtables.c
===================================================================
--- 2.6-mm.orig/net/ipv4/inet_hashtables.c
+++ 2.6-mm/net/ipv4/inet_hashtables.c
@@ -126,7 +126,8 @@
  * wildcarded during the search since they can never be otherwise.
  */
 struct sock *__inet_lookup_listener(const struct hlist_head *head, const u32 daddr,
-				    const unsigned short hnum, const int dif)
+				    const unsigned short hnum, const int dif,
+				    const struct net_namespace *net_ns)
 {
 	struct sock *result = NULL, *sk;
 	const struct hlist_node *node;
@@ -139,6 +140,9 @@
 			const __u32 rcv_saddr = inet->rcv_saddr;
 			int score = sk->sk_family == PF_INET ? 1 : 0;
 
+			if (sk->sk_net_ns != net_ns && LOOPBACK(daddr))
+				continue;
+
 			if (rcv_saddr) {
 				if (rcv_saddr != daddr)
 					continue;
Index: 2.6-mm/net/ipv4/inet_timewait_sock.c
===================================================================
--- 2.6-mm.orig/net/ipv4/inet_timewait_sock.c
+++ 2.6-mm/net/ipv4/inet_timewait_sock.c
@@ -110,6 +110,7 @@
 		tw->tw_hash	    = sk->sk_hash;
 		tw->tw_ipv6only	    = 0;
 		tw->tw_prot	    = sk->sk_prot_creator;
+		tw->tw_net_ns       = sk->sk_net_ns;
 		atomic_set(&tw->tw_refcnt, 1);
 		inet_twsk_dead_node_init(tw);
 		__module_get(tw->tw_prot->owner);
Index: 2.6-mm/net/ipv4/ip_output.c
===================================================================
--- 2.6-mm.orig/net/ipv4/ip_output.c
+++ 2.6-mm/net/ipv4/ip_output.c
@@ -284,6 +284,10 @@
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
+	if ((skb->nh.iph->protocol == IPPROTO_TCP ||
+	     skb->nh.iph->protocol == IPPROTO_UDP) &&
+	    LOOPBACK(skb->nh.iph->daddr))
+			skb->net_ns = skb->sk->sk_net_ns;
 
 	return NF_HOOK_COND(PF_INET, NF_IP_POST_ROUTING, skb, NULL, dev,
 		            ip_finish_output,
Index: 2.6-mm/net/ipv4/tcp_ipv4.c
===================================================================
--- 2.6-mm.orig/net/ipv4/tcp_ipv4.c
+++ 2.6-mm/net/ipv4/tcp_ipv4.c
@@ -349,7 +349,7 @@
 	}
 
 	sk = inet_lookup(&tcp_hashinfo, iph->daddr, th->dest, iph->saddr,
-			 th->source, inet_iif(skb));
+			 th->source, inet_iif(skb), skb->net_ns);
 	if (!sk) {
 		ICMP_INC_STATS_BH(ICMP_MIB_INERRORS);
 		return;
@@ -933,7 +933,8 @@
 
 	nsk = __inet_lookup_established(&tcp_hashinfo, skb->nh.iph->saddr,
 					th->source, skb->nh.iph->daddr,
-					ntohs(th->dest), inet_iif(skb));
+					ntohs(th->dest), inet_iif(skb),
+					skb->net_ns);
 
 	if (nsk) {
 		if (nsk->sk_state != TCP_TIME_WAIT) {
@@ -1071,7 +1072,7 @@
 
 	sk = __inet_lookup(&tcp_hashinfo, skb->nh.iph->saddr, th->source,
 			   skb->nh.iph->daddr, ntohs(th->dest),
-			   inet_iif(skb));
+			   inet_iif(skb), skb->net_ns);
 
 	if (!sk)
 		goto no_tcp_socket;
@@ -1149,7 +1150,8 @@
 		struct sock *sk2 = inet_lookup_listener(&tcp_hashinfo,
 							skb->nh.iph->daddr,
 							ntohs(th->dest),
-							inet_iif(skb));
+							inet_iif(skb),
+							skb->net_ns);
 		if (sk2) {
 			inet_twsk_deschedule((struct inet_timewait_sock *)sk,
 					     &tcp_death_row);
@@ -1395,7 +1397,8 @@
 	}
 get_sk:
 	sk_for_each_from(sk, node) {
-		if (sk->sk_family == st->family) {
+		if (sk->sk_family == st->family &&
+		    sk->sk_net_ns == net_ns()) {
 			cur = sk;
 			goto out;
 		}
@@ -1446,7 +1449,8 @@
 
 		read_lock(&tcp_hashinfo.ehash[st->bucket].lock);
 		sk_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
-			if (sk->sk_family != st->family) {
+			if (sk->sk_family != st->family ||
+			    sk->sk_net_ns != net_ns()) {
 				continue;
 			}
 			rc = sk;
@@ -1455,7 +1459,8 @@
 		st->state = TCP_SEQ_STATE_TIME_WAIT;
 		inet_twsk_for_each(tw, node,
 				   &tcp_hashinfo.ehash[st->bucket + tcp_hashinfo.ehash_size].chain) {
-			if (tw->tw_family != st->family) {
+			if (tw->tw_family != st->family ||
+			    tw->tw_net_ns != net_ns()) {
 				continue;
 			}
 			rc = tw;
@@ -1481,7 +1486,8 @@
 		tw = cur;
 		tw = tw_next(tw);
 get_tw:
-		while (tw && tw->tw_family != st->family) {
+		while (tw && (tw->tw_family != st->family ||
+			      tw->tw_net_ns != net_ns())) {
 			tw = tw_next(tw);
 		}
 		if (tw) {
@@ -1505,7 +1511,8 @@
 		sk = sk_next(sk);
 
 	sk_for_each_from(sk, node) {
-		if (sk->sk_family == st->family)
+		if (sk->sk_family == st->family &&
+		    sk->sk_net_ns == net_ns())
 			goto found;
 	}
 
Index: 2.6-mm/net/ipv4/udp.c
===================================================================
--- 2.6-mm.orig/net/ipv4/udp.c
+++ 2.6-mm/net/ipv4/udp.c
@@ -184,6 +184,7 @@
 			    (!inet2->rcv_saddr ||
 			     !inet->rcv_saddr ||
 			     inet2->rcv_saddr == inet->rcv_saddr) &&
+			    sk2->sk_net_ns == sk->sk_net_ns &&
 			    (!sk2->sk_reuse || !sk->sk_reuse))
 				goto fail;
 		}
@@ -1404,7 +1405,8 @@
 	for (state->bucket = 0; state->bucket < UDP_HTABLE_SIZE; ++state->bucket) {
 		struct hlist_node *node;
 		sk_for_each(sk, node, &udp_hash[state->bucket]) {
-			if (sk->sk_family == state->family)
+			if (sk->sk_family == state->family &&
+			    sk->sk_net_ns == net_ns())
 				goto found;
 		}
 	}
@@ -1421,7 +1423,8 @@
 		sk = sk_next(sk);
 try_again:
 		;
-	} while (sk && sk->sk_family != state->family);
+	} while (sk && (sk->sk_family != state->family ||
+			sk->sk_net_ns != net_ns()));
 
 	if (!sk && ++state->bucket < UDP_HTABLE_SIZE) {
 		sk = sk_head(&udp_hash[state->bucket]);

--

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [RFC] [patch 6/6] [Network namespace] Network namespace debugfs
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
                   ` (4 preceding siblings ...)
  2006-06-09 21:02 ` [RFC] [patch 5/6] [Network namespace] ipv4 isolation dlezcano
@ 2006-06-09 21:02 ` dlezcano
  2006-06-10  7:16 ` [RFC] [patch 0/6] [Network namespace] introduction Kari Hurtta
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 113+ messages in thread
From: dlezcano @ 2006-06-09 21:02 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: serue, haveblue, clg, dlezcano

[-- Attachment #1: net_ns_debugfs.patch --]
[-- Type: text/plain, Size: 4800 bytes --]

This patch is for testing purpose. It allows to read which network
devices are accessible and to add a network device to the view.
This RFC hack is purely for discussing the best way to do that.

After unsharing with CLONE_NEWNET flag:
--------------------------------------
 To see which devices are accessible:
	 cat /sys/kernel/debug/net_ns/dev

 To add a device:
	 echo eth1 > /sys/kernel/debug/net_ns/dev

This functionnality is intended to be implemented in an higher level
container configuration.

Replace-Subject: [Network namespace] Network namespace debugfs
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> 
--
 fs/debugfs/Makefile |    2 
 fs/debugfs/net_ns.c |  141 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/Kconfig         |    4 +
 3 files changed, 146 insertions(+), 1 deletion(-)

Index: 2.6-mm/fs/debugfs/Makefile
===================================================================
--- 2.6-mm.orig/fs/debugfs/Makefile
+++ 2.6-mm/fs/debugfs/Makefile
@@ -1,4 +1,4 @@
 debugfs-objs	:= inode.o file.o
 
 obj-$(CONFIG_DEBUG_FS)	+= debugfs.o
-
+obj-$(CONFIG_NET_NS_DEBUG) += net_ns.o
Index: 2.6-mm/fs/debugfs/net_ns.c
===================================================================
--- /dev/null
+++ 2.6-mm/fs/debugfs/net_ns.c
@@ -0,0 +1,141 @@
+/*
+ *  net_ns.c - adds a net_ns/ directory to debug NET namespaces
+ *
+ *  Copyright (C) 2006 IBM
+ *
+ *  Author: Daniel Lezcano <dlezcano@fr.ibm.com>
+ *
+ *     This program is free software; you can redistribute it and/or
+ *     modify it under the terms of the GNU General Public License as
+ *     published by the Free Software Foundation, version 2 of the
+ *     License.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/pagemap.h>
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+#include <linux/netdevice.h>
+#include <linux/net_ns.h>
+
+static struct dentry *net_ns_dentry;
+static struct dentry *net_ns_dentry_dev;
+
+static ssize_t net_ns_dev_read_file(struct file *file, char __user *user_buf,
+				    size_t count, loff_t *ppos)
+{
+	size_t len;
+	char *buf;
+	struct net_ns_dev_list *devlist = &(net_ns()->dev_list);
+	struct net_ns_dev *db;
+	struct net_device *dev;
+	struct list_head *l;
+
+	if (*ppos < 0)
+		return -EINVAL;
+	if (*ppos >= count)
+		return 0;
+
+	/* It's for debug, everything should fit */
+	buf = kmalloc(4096, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+	buf[0] = '\0';
+
+	read_lock(&devlist->lock);
+	list_for_each(l, &devlist->list) {
+		db = list_entry(l, struct net_ns_dev, list);
+		dev = db->dev;
+		strcat(buf,dev->name);
+		strcat(buf,"\n");
+	}
+	read_unlock(&devlist->lock);
+
+	len = strlen(buf);
+
+	if (len > count)
+		len = count;
+
+	if (copy_to_user(user_buf, buf, len)) {
+		kfree(buf);
+		return -EFAULT;
+	}
+
+	*ppos += count;
+	kfree(buf);
+
+	return count;
+}
+
+static ssize_t net_ns_dev_write_file(struct file *file,
+				     const char __user *user_buf,
+				     size_t count, loff_t *ppos)
+{
+	int ret;
+	size_t len;
+	const char __user *p;
+	char c;
+	char devname[IFNAMSIZ];
+	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
+
+	len = 0;
+	p = user_buf;
+	while (len < count) {
+		if (get_user(c, p++))
+			return -EFAULT;
+		if (c == 0 || c == '\n')
+			break;
+		len++;
+	}
+
+	if (len >= IFNAMSIZ)
+		return -EINVAL;
+
+	if (copy_from_user(devname, user_buf, len))
+		return -EFAULT;
+
+	devname[len] = '\0';
+
+	ret = net_ns_dev_add(devname, dev_list);
+	if (ret)
+		return ret;
+
+	*ppos += count;
+	return count;
+}
+
+static int net_ns_dev_open_file(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+static struct file_operations net_ns_dev_fops = {
+       .read =         net_ns_dev_read_file,
+       .write =        net_ns_dev_write_file,
+       .open =         net_ns_dev_open_file,
+};
+
+static int __init net_ns_init(void)
+{
+	net_ns_dentry = debugfs_create_dir("net_ns", NULL);
+
+	net_ns_dentry_dev = debugfs_create_file("dev", 0666,
+						net_ns_dentry,
+						NULL,
+						&net_ns_dev_fops);
+	return 0;
+}
+
+static void __exit net_ns_exit(void)
+{
+	debugfs_remove(net_ns_dentry_dev);
+	debugfs_remove(net_ns_dentry);
+}
+
+module_init(net_ns_init);
+module_exit(net_ns_exit);
+
+MODULE_DESCRIPTION("NET namespace debugfs");
+MODULE_AUTHOR("Daniel Lezcano <dlezcano@fr.ibm.com>");
+MODULE_LICENSE("GPL");
Index: 2.6-mm/net/Kconfig
===================================================================
--- 2.6-mm.orig/net/Kconfig
+++ 2.6-mm/net/Kconfig
@@ -69,6 +69,10 @@ config NET_NS
 	  vservers, to use network namespaces to provide isolated
 	  network for different servers.  If unsure, say N.
 
+config NET_NS_DEBUG
+	bool "Debug fs for network namespace"
+	depends on DEBUG_FS && NET_NS
+
 if INET
 source "net/ipv4/Kconfig"
 source "net/ipv6/Kconfig"

--

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 5/6] [Network namespace] ipv4 isolation
  2006-06-09 21:02 ` [RFC] [patch 5/6] [Network namespace] ipv4 isolation dlezcano
@ 2006-06-10  0:23   ` James Morris
  2006-06-10  0:27     ` Rick Jones
  0 siblings, 1 reply; 113+ messages in thread
From: James Morris @ 2006-06-10  0:23 UTC (permalink / raw)
  To: dlezcano; +Cc: linux-kernel, netdev, serue, haveblue, clg

On Fri, 9 Jun 2006, dlezcano@fr.ibm.com wrote:

> When an outgoing packet has the loopback destination addres, the
> skbuff is filled with the network namespace. So the loopback packets
> never go outside the namespace. This approach facilitate the migration
> of loopback because identification is done by network namespace and
> not by address. The loopback has been benchmarked by tbench and the
> overhead is roughly 1.5 %

I think you'll need to make it so this code has zero impact when not 
configured.


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 5/6] [Network namespace] ipv4 isolation
  2006-06-10  0:23   ` James Morris
@ 2006-06-10  0:27     ` Rick Jones
  2006-06-10  0:47       ` James Morris
  0 siblings, 1 reply; 113+ messages in thread
From: Rick Jones @ 2006-06-10  0:27 UTC (permalink / raw)
  To: James Morris; +Cc: dlezcano, linux-kernel, netdev, serue, haveblue, clg

James Morris wrote:
> On Fri, 9 Jun 2006, dlezcano@fr.ibm.com wrote:
> 
> 
>>When an outgoing packet has the loopback destination addres, the
>>skbuff is filled with the network namespace. So the loopback packets
>>never go outside the namespace. This approach facilitate the migration
>>of loopback because identification is done by network namespace and
>>not by address. The loopback has been benchmarked by tbench and the
>>overhead is roughly 1.5 %
> 
> 
> I think you'll need to make it so this code has zero impact when not 
> configured.

Indeed, and over stuff other than loopback too.  I'll not so humbly 
suggest :)  netperf TCP_STREAM and TCP_RR figures _with_ CPU 
utilization/service demand measures.

rick jones

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 5/6] [Network namespace] ipv4 isolation
  2006-06-10  0:27     ` Rick Jones
@ 2006-06-10  0:47       ` James Morris
  0 siblings, 0 replies; 113+ messages in thread
From: James Morris @ 2006-06-10  0:47 UTC (permalink / raw)
  To: Rick Jones; +Cc: dlezcano, linux-kernel, netdev, serue, haveblue, clg

On Fri, 9 Jun 2006, Rick Jones wrote:

> > I think you'll need to make it so this code has zero impact when not
> > configured.
> 
> Indeed, and over stuff other than loopback too.  I'll not so humbly suggest :)

Yes, I meant the whole lot.



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 0/6] [Network namespace] introduction
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
                   ` (5 preceding siblings ...)
  2006-06-09 21:02 ` [RFC] [patch 6/6] [Network namespace] Network namespace debugfs dlezcano
@ 2006-06-10  7:16 ` Kari Hurtta
  2006-06-16  4:23 ` Eric W. Biederman
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 113+ messages in thread
From: Kari Hurtta @ 2006-06-10  7:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev

dlezcano@fr.ibm.com writes in gmane.linux.network,gmane.linux.kernel:

> The following patches create a private "network namespace" for use
> within containers. This is intended for use with system containers
> like vserver, but might also be useful for restricting individual
> applications' access to the network stack.
> 
> These patches isolate traffic inside the network namespace. The
> network ressources, the incoming and the outgoing packets are
> identified to be related to a namespace. 
> 
> It hides network resource not contained in the current namespace, but
> still allows administration of the network with normal commands like
> ifconfig.
> 
> It applies to the kernel version 2.6.17-rc6-mm1
> 
> It provides the following:
> -------------------------
>    - when an application unshares its network namespace, it looses its
>      view of all network devices by default. The administrator can
>      choose to make any devices to become visible again. The container
>      then gains a view to the device but without the ip address
>      configured on it. It is up to the container administrator to use
>      ifconfig or ip command to setup a new ip address. This ip address
>      is only visible inside the container.

Do other namespaces work differently ?
When namespace is unshared, it has initially the same resources
(for example compare to CLONE_NEWNS)

 
>    - the loopback is isolated inside the container and it is not
>      possible to communicate between containers via the
>      loopback. 
> 
>    - several containers can have an application bind to the same
>      address:port without conflicting. 

That of course be problem, if initially unshared namespace shares
same resources.

/ Kari Hurtta


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-09 21:02 ` [RFC] [patch 2/6] [Network namespace] Network device sharing by view dlezcano
@ 2006-06-11 10:18   ` Andrew Morton
  2006-06-18 18:53   ` Al Viro
  2006-06-26  9:47   ` Andrey Savochkin
  2 siblings, 0 replies; 113+ messages in thread
From: Andrew Morton @ 2006-06-11 10:18 UTC (permalink / raw)
  To: dlezcano; +Cc: linux-kernel, netdev, serue, haveblue, clg, dlezcano

On Fri, 09 Jun 2006 23:02:04 +0200
dlezcano@fr.ibm.com wrote:

> +int net_ns_dev_add(const char *devname,
> +		   struct net_ns_dev_list *devlist)
> +{
> +	struct net_ns_dev *db;
> +	struct net_device *dev;
> +	int ret = 0;
> +
> +	read_lock(&dev_base_lock);
> +
> +	for (dev = dev_base; dev; dev = dev->next)
> +		if (!strncmp(dev->name, devname, IFNAMSIZ))
> +			break;
> +
> +	if (!dev) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	db = kmalloc(sizeof(*db), GFP_KERNEL);

sleep-in-spinlock.  Please always test new code with all kernel debugging
options enabled.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 0/6] [Network namespace] introduction
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
                   ` (6 preceding siblings ...)
  2006-06-10  7:16 ` [RFC] [patch 0/6] [Network namespace] introduction Kari Hurtta
@ 2006-06-16  4:23 ` Eric W. Biederman
  2006-06-16  9:06   ` Daniel Lezcano
  2006-06-18 18:47 ` Al Viro
  2006-06-26 23:38 ` Patrick McHardy
  9 siblings, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-16  4:23 UTC (permalink / raw)
  To: dlezcano; +Cc: serue, haveblue, clg, dlezcano, linux-kernel, netdev


My apologies for not looking at this earlier I had an email
hickup so I'm having to recreate the context from email archives,
and you didn't copy me.

Have you seen my previous work in this direction?

I know I had a much much more complete implementation.  The only part
I had not completed was iptables support and that was about a days
more work.

> The following patches create a private "network namespace" for use
> within containers. This is intended for use with system containers
> like vserver, but might also be useful for restricting individual
> applications' access to the network stack.
> 
> These patches isolate traffic inside the network namespace. The
> network ressources, the incoming and the outgoing packets are
> identified to be related to a namespace.
> 
> It hides network resource not contained in the current namespace, but
> still allows administration of the network with normal commands like
> ifconfig.
> 
> It applies to the kernel version 2.6.17-rc6-mm1

A couple of comments.

> ------------
>    - do unshare with the CLONE_NEWNET flag as root
>    - do echo eth0 > /sys/kernel/debug/net_ns/dev
>    - use ifconfig or ip command to set a new ip address
> 
> What is missing ?
> -----------------
> The routes are not yet isolated, that implies:
> 
>    - binding to another container's address is allowed
> 
>    - an outgoing packet which has an unset source address can
>      potentially get another container's address
> 
>    - an incoming packet can be routed to the wrong container if there
>      are several containers listening to the same addr:port

I haven't looked at the patches in enough detail to see how the network
isolation is being done exactly.  But some of these comments and some
of the other pieces I have seen don't give me warm fuzzies.

In particular I did not see a provision for multiple instance of
the loopback device.

As a general rule network sockets and network devices should be attached
to the network namespaces, which basically preserves all of the existing
network stack logic.

Basically this means that the only operations that get more expensive
are reads of global variables, which take a necessary extra indirection.

As a general rule I found that it usually makes sense to add an additional
namespace field to hash tables so they can still use the boot time
memory allocator.  Although if you already have a network device as
part of your hash table key that isn't necessary for the network
stack.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 0/6] [Network namespace] introduction
  2006-06-16  4:23 ` Eric W. Biederman
@ 2006-06-16  9:06   ` Daniel Lezcano
  2006-06-16  9:22     ` Eric W. Biederman
  0 siblings, 1 reply; 113+ messages in thread
From: Daniel Lezcano @ 2006-06-16  9:06 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: serue, haveblue, clg, linux-kernel, netdev

Eric W. Biederman wrote:

  > Have you seen my previous work in this direction?
> 
> I know I had a much much more complete implementation.  The only part
> I had not completed was iptables support and that was about a days
> more work.

No, I didn't see your work, is it possible to send me a pointer on it or 
to have a patchset of your code ?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 0/6] [Network namespace] introduction
  2006-06-16  9:06   ` Daniel Lezcano
@ 2006-06-16  9:22     ` Eric W. Biederman
  0 siblings, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-16  9:22 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: serue, haveblue, clg, linux-kernel, netdev

Daniel Lezcano <dlezcano@fr.ibm.com> writes:

> Eric W. Biederman wrote:
>
>  > Have you seen my previous work in this direction?
>> I know I had a much much more complete implementation.  The only part
>> I had not completed was iptables support and that was about a days
>> more work.
>
> No, I didn't see your work, is it possible to send me a pointer on it or to have
> a patchset of your code ?

It is in my git tree up on kernel.org.  I think it is in my proof-of-concept
branch.

The individual commits tell a tangled tale that is definitely unacceptable
for an upstream merge but the actual result is interesting.

If that isn't enough to find it, tell me and I will track down the details.
It has been a couple months since I posted it during the design discussion
and I'm way overdue for being in bed, tonight.

Eric




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 0/6] [Network namespace] introduction
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
                   ` (7 preceding siblings ...)
  2006-06-16  4:23 ` Eric W. Biederman
@ 2006-06-18 18:47 ` Al Viro
  2006-06-20 21:21   ` Daniel Lezcano
  2006-06-26 23:38 ` Patrick McHardy
  9 siblings, 1 reply; 113+ messages in thread
From: Al Viro @ 2006-06-18 18:47 UTC (permalink / raw)
  To: dlezcano; +Cc: linux-kernel, netdev, serue, haveblue, clg

On Fri, Jun 09, 2006 at 11:02:02PM +0200, dlezcano@fr.ibm.com wrote:
> What is missing ?
> -----------------
> The routes are not yet isolated, that implies:
> 
>    - binding to another container's address is allowed
> 
>    - an outgoing packet which has an unset source address can
>      potentially get another container's address
> 
>    - an incoming packet can be routed to the wrong container if there
>      are several containers listening to the same addr:port

- renaming an interface in one "namespace" affects everyone.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-09 21:02 ` [RFC] [patch 2/6] [Network namespace] Network device sharing by view dlezcano
  2006-06-11 10:18   ` Andrew Morton
@ 2006-06-18 18:53   ` Al Viro
  2006-06-26  9:47   ` Andrey Savochkin
  2 siblings, 0 replies; 113+ messages in thread
From: Al Viro @ 2006-06-18 18:53 UTC (permalink / raw)
  To: dlezcano; +Cc: linux-kernel, netdev, serue, haveblue, clg

On Fri, Jun 09, 2006 at 11:02:04PM +0200, dlezcano@fr.ibm.com wrote:

> +	read_lock(&dev_base_lock);
> +
> +	for (dev = dev_base; dev; dev = dev->next)
> +		if (!strncmp(dev->name, devname, IFNAMSIZ))
> +			break;
> +
> +	if (!dev) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	db = kmalloc(sizeof(*db), GFP_KERNEL);

deadlock.


Besides, holding references to net_device from userland is Not Good(tm).

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 3/6] [Network namespace] Network devices isolation
  2006-06-09 21:02 ` [RFC] [patch 3/6] [Network namespace] Network devices isolation dlezcano
@ 2006-06-18 18:57   ` Al Viro
  0 siblings, 0 replies; 113+ messages in thread
From: Al Viro @ 2006-06-18 18:57 UTC (permalink / raw)
  To: dlezcano; +Cc: linux-kernel, netdev, serue, haveblue, clg

On Fri, Jun 09, 2006 at 11:02:05PM +0200, dlezcano@fr.ibm.com wrote:
>  struct net_device *dev_get_by_name(const char *name)
>  {
> +	struct net_ns_dev_list *dev_list = &(net_ns()->dev_list);
>  	struct net_device *dev;
>  
> -	read_lock(&dev_base_lock);
> +	read_lock(&dev_list->lock);
>  	dev = __dev_get_by_name(name);
>  	if (dev)
>  		dev_hold(dev);
> -	read_unlock(&dev_base_lock);
> +	read_unlock(&dev_list->lock);
>  	return dev;

And what would stop renames done via different lists from creating a
conflict?  Incidentally, WTF protects the device name while we are
doing that lookup?

While we are at it, what are you going to do with sysfs?
ls /sys/class/net and watch the fun...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 0/6] [Network namespace] introduction
  2006-06-18 18:47 ` Al Viro
@ 2006-06-20 21:21   ` Daniel Lezcano
  2006-06-20 21:25     ` Al Viro
  0 siblings, 1 reply; 113+ messages in thread
From: Daniel Lezcano @ 2006-06-20 21:21 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, netdev, serue, haveblue, clg

Al Viro wrote:
> On Fri, Jun 09, 2006 at 11:02:02PM +0200, dlezcano@fr.ibm.com wrote:
> - renaming an interface in one "namespace" affects everyone.

Exact. If we ensure the interface can't be renamed if used in different 
namespace, is it really a problem ?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 0/6] [Network namespace] introduction
  2006-06-20 21:21   ` Daniel Lezcano
@ 2006-06-20 21:25     ` Al Viro
  2006-06-20 22:45       ` Daniel Lezcano
  0 siblings, 1 reply; 113+ messages in thread
From: Al Viro @ 2006-06-20 21:25 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: linux-kernel, netdev, serue, haveblue, clg

On Tue, Jun 20, 2006 at 11:21:43PM +0200, Daniel Lezcano wrote:
> Al Viro wrote:
> >On Fri, Jun 09, 2006 at 11:02:02PM +0200, dlezcano@fr.ibm.com wrote:
> >- renaming an interface in one "namespace" affects everyone.
> 
> Exact. If we ensure the interface can't be renamed if used in different 
> namespace, is it really a problem ?

You _still_ have a single namespace; look in /sys/class/net and you'll see.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 0/6] [Network namespace] introduction
  2006-06-20 21:25     ` Al Viro
@ 2006-06-20 22:45       ` Daniel Lezcano
  0 siblings, 0 replies; 113+ messages in thread
From: Daniel Lezcano @ 2006-06-20 22:45 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, netdev, serue, haveblue, clg

Al Viro wrote:
> On Tue, Jun 20, 2006 at 11:21:43PM +0200, Daniel Lezcano wrote:
> 
>>Al Viro wrote:
>>
>>>On Fri, Jun 09, 2006 at 11:02:02PM +0200, dlezcano@fr.ibm.com wrote:
>>>- renaming an interface in one "namespace" affects everyone.
>>
>>Exact. If we ensure the interface can't be renamed if used in different 
>>namespace, is it really a problem ?
> 
> 
> You _still_ have a single namespace; look in /sys/class/net and you'll see.

Yes, that's right. The network devices namespaces are not yet 
implemented. There are potentially some conflicts with /proc and sysfs 
but we will address them in a future.

BTW, do you have some ideas on how handle these conflicts ?



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-09 21:02 ` [RFC] [patch 2/6] [Network namespace] Network device sharing by view dlezcano
  2006-06-11 10:18   ` Andrew Morton
  2006-06-18 18:53   ` Al Viro
@ 2006-06-26  9:47   ` Andrey Savochkin
  2006-06-26 13:02     ` Herbert Poetzl
  2006-06-26 14:56     ` Daniel Lezcano
  2 siblings, 2 replies; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-26  9:47 UTC (permalink / raw)
  To: dlezcano
  Cc: linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	herbert, devel, sam, ebiederm, viro

Hi Daniel,

It's good that you kicked off network namespace discussion.
Although I wish you'd Cc'ed someone at OpenVZ so I could notice it earlier :).

Indeed, the first point to agree in this discussion is device list.
In your patch, you essentially introduce a data structure parallel
to the main device list, creating a "view" of this list.
I see a fundamental problem with this approach.
When a device presents an skb to the protocol layer, it needs to know to which
namespace this skb belongs.
Otherwise you would never get rid of problems with bind: what to do if device
eth1 is visible in namespace1, namespace2, and root namespace, and each
namespace has a socket bound to 0.0.0.0:80?

We have to conclude that each device should be visible only in one namespace.
In this case, instead of introducing net_ns_dev and net_ns_dev_list
structures, we can simply have a separate dev_base list head in each namespace.
Moreover, separate device list in each namespace will be in line with
making namespace isolation complete.  Complete isolation will allow each
namespace to set up own tun/tap devices, have own routes, netfilter tables,
and so on.

My follow-up messages will contain the first set of patches with network
namespaces implemented in the same way as network isolation in OpenVZ.
This patchset introduces namespaces for device list and IPv4 FIB/routing.
Two technical issues are omitted to make the patch idea clearer: device moving
between namespaces, and selective routing cache flush + garbage collection.

If this patchset is agreeable, the next patchset will finalize integration
with nsproxy, add namespaces to socket lookup code and neighbour
cache, and introduce a simple device to pass traffic between namespaces.
Then we will turn to less obvious matters including netlink messages,
network statistics, representation of network information in proc and sysfs,
tuning of parameters through sysctl, IPv6 and other protocols, and
per-namespace netfilters.

Best regards
		Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26  9:47   ` Andrey Savochkin
@ 2006-06-26 13:02     ` Herbert Poetzl
  2006-06-26 14:05       ` Eric W. Biederman
  2006-06-26 14:08       ` Andrey Savochkin
  2006-06-26 14:56     ` Daniel Lezcano
  1 sibling, 2 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-26 13:02 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: dlezcano, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, devel, sam, ebiederm, viro

On Mon, Jun 26, 2006 at 01:47:11PM +0400, Andrey Savochkin wrote:
> Hi Daniel,
> 
> It's good that you kicked off network namespace discussion Although I.
> wish you'd Cc'ed someone at OpenVZ so I could notice it earlier :)   .

> Indeed, the first point to agree in this discussion is device list. 
> In your patch, you essentially introduce a data structure parallel
> to the main device list, creating a "view" of this list. 

> I see a fundamental problem with this approach. When a device presents
> an skb to the protocol layer, it needs to know to which namespace this
> skb belongs.

> Otherwise you would never get rid of problems with bind: what to do if
> device eth1 is visible in namespace1, namespace2, and root namespace,
> and each namespace has a socket bound to 0.0.0.0:80?

this is something which isn't a fundamental problem at
all, and IMHO there are at least three options here
(probably more)

 - check at 'bind' time if the binding would overlap
   and give the 'proper' error (as it happens right
   now on the host)
   (this is how Linux-VServer currently handles the
   network isolation, and yes, it works quite fine :)

 - allow arbitrary binds and 'tag' the packets according
   to some 'host' policy (e.g. iptables or tc)
   (this is how the Linux-VServer ngnet was designed)

 - deliver packets to _all_ bound sockets/destinations
   (this is probably a more unusable but quite thinkable
   solution)

> We have to conclude that each device should be visible only in one
> namespace. 

I disagree here, especially some supervisor context or
the host context should be able to 'see' and probably
manipulate _all_ of the devices

> In this case, instead of introducing net_ns_dev and net_ns_dev_list
> structures, we can simply have a separate dev_base list head in each
> namespace. Moreover, separate device list in each namespace will be in
> line with making namespace isolation complete. 

> Complete isolation will allow each namespace to set up own tun/tap
> devices, have own routes, netfilter tables, and so on.

tun/tap devices are quite possible with this approach
too, I see no problem here ...

for iptables and routes, I'm worried about the required
'policy' to make them secure, i.e. how do you ensure
that the packets 'leaving' guest X do not contain
'evil' packets and/or disrupt your host system?

> My follow-up messages will contain the first set of patches with
> network namespaces implemented in the same way as network isolation 
> in OpenVZ. 

hmm, you probably mean 'network virtualization' here

> This patchset introduces namespaces for device list and IPv4
> FIB/routing. Two technical issues are omitted to make the patch idea
> clearer: device moving between namespaces, and selective routing cache
> flush + garbage collection.
>
> If this patchset is agreeable, the next patchset will finalize
> integration with nsproxy, add namespaces to socket lookup code and
> neighbour cache, and introduce a simple device to pass traffic between
> namespaces.

passing traffic 'between' namespaces should happen via
lo, no? what kind of 'device' is required there, and
what overhead does it add to the networking?

TIA,
Herbert

> Then we will turn to less obvious matters including
> netlink messages, network statistics, representation of network
> information in proc and sysfs, tuning of parameters through sysctl,
> IPv6 and other protocols, and per-namespace netfilters.
> 
> Best regards
> 		Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 13:02     ` Herbert Poetzl
@ 2006-06-26 14:05       ` Eric W. Biederman
  2006-06-26 14:08       ` Andrey Savochkin
  1 sibling, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-26 14:05 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: dlezcano, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, devel, sam, viro

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Mon, Jun 26, 2006 at 01:47:11PM +0400, Andrey Savochkin wrote:
>> Hi Daniel,
>> 
>> It's good that you kicked off network namespace discussion Although I.
>> wish you'd Cc'ed someone at OpenVZ so I could notice it earlier :)   .
>
>> Indeed, the first point to agree in this discussion is device list. 
>> In your patch, you essentially introduce a data structure parallel
>> to the main device list, creating a "view" of this list. 
>
>> I see a fundamental problem with this approach. When a device presents
>> an skb to the protocol layer, it needs to know to which namespace this
>> skb belongs.
>
>> Otherwise you would never get rid of problems with bind: what to do if
>> device eth1 is visible in namespace1, namespace2, and root namespace,
>> and each namespace has a socket bound to 0.0.0.0:80?
>
> this is something which isn't a fundamental problem at
> all, and IMHO there are at least three options here
> (probably more)

I agree that there are other implementations that can be used for
containers.  However when you think namespaces this is what you need.

For several reasons.
1) So you can use AF_PACKET safely.
   This allows a network namespace to use DHCP and all of the other
   usual network autoconfiguration tools.  0.0.0.0:80 is just
   a special subset of that.

2) It means the existing network stack can be used without
   logic changes.  All that is needed is a lookup of the appropriate
   context.  This is very straight forward to audit.

3) Since all of the network stack is trivially available all of
   the advanced network stack features like iptables are easily
   available.

4) There is no retraining or other rules for user to learn.
   Because people understand what is going on it is more likely
   a setup will be secure.  Most of the other implementations
   don't quite act like a normal network setup and the special
   rules can be hard to learn.

>  - check at 'bind' time if the binding would overlap
>    and give the 'proper' error (as it happens right
>    now on the host)
>    (this is how Linux-VServer currently handles the
>    network isolation, and yes, it works quite fine :)

It works yes but it limits you to a subset of the network
stack.   And has serious problems with concepts like INADDR_ANY.
PF_PACKET is not an option.

>  - allow arbitrary binds and 'tag' the packets according
>    to some 'host' policy (e.g. iptables or tc)
>    (this is how the Linux-VServer ngnet was designed)

A little more general but very weird.

>  - deliver packets to _all_ bound sockets/destinations
>    (this is probably a more unusable but quite thinkable
>    solution)
>
>> We have to conclude that each device should be visible only in one
>> namespace. 
>
> I disagree here, especially some supervisor context or
> the host context should be able to 'see' and probably
> manipulate _all_ of the devices

This part really is necessary.  This does not preclude managing
a network namespace from outside of the namespace.

>> In this case, instead of introducing net_ns_dev and net_ns_dev_list
>> structures, we can simply have a separate dev_base list head in each
>> namespace. Moreover, separate device list in each namespace will be in
>> line with making namespace isolation complete. 
>
>> Complete isolation will allow each namespace to set up own tun/tap
>> devices, have own routes, netfilter tables, and so on.
>
> tun/tap devices are quite possible with this approach
> too, I see no problem here ...
>
> for iptables and routes, I'm worried about the required
> 'policy' to make them secure, i.e. how do you ensure
> that the packets 'leaving' guest X do not contain
> 'evil' packets and/or disrupt your host system?

In the traditional ways.  When you control the router and/or the switch
someone is directly connected to.

We don't need to reinvent the wheel if we do this properly.

>> This patchset introduces namespaces for device list and IPv4
>> FIB/routing. Two technical issues are omitted to make the patch idea
>> clearer: device moving between namespaces, and selective routing cache
>> flush + garbage collection.
>>
>> If this patchset is agreeable, the next patchset will finalize
>> integration with nsproxy, add namespaces to socket lookup code and
>> neighbour cache, and introduce a simple device to pass traffic between
>> namespaces.
>
> passing traffic 'between' namespaces should happen via
> lo, no? what kind of 'device' is required there, and
> what overhead does it add to the networking?

Definitely not.  lo is a local loopback interface.

What is needed is a two headed device that is the cousin of lo.
But with one network interface in each network namespace.

Note even connecting network namespaces is optional.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 13:02     ` Herbert Poetzl
  2006-06-26 14:05       ` Eric W. Biederman
@ 2006-06-26 14:08       ` Andrey Savochkin
  2006-06-26 18:28         ` Herbert Poetzl
  1 sibling, 1 reply; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-26 14:08 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: dlezcano, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, devel, sam, ebiederm, viro, Alexey Kuznetsov

Hi Herbert,

On Mon, Jun 26, 2006 at 03:02:03PM +0200, Herbert Poetzl wrote:
> On Mon, Jun 26, 2006 at 01:47:11PM +0400, Andrey Savochkin wrote:
> 
> > I see a fundamental problem with this approach. When a device presents
> > an skb to the protocol layer, it needs to know to which namespace this
> > skb belongs.
> 
> > Otherwise you would never get rid of problems with bind: what to do if
> > device eth1 is visible in namespace1, namespace2, and root namespace,
> > and each namespace has a socket bound to 0.0.0.0:80?
> 
> this is something which isn't a fundamental problem at
> all, and IMHO there are at least three options here
> (probably more)
> 
>  - check at 'bind' time if the binding would overlap
>    and give the 'proper' error (as it happens right
>    now on the host)
>    (this is how Linux-VServer currently handles the
>    network isolation, and yes, it works quite fine :)

I'm not comfortable with this as a permanent mainstream solution.
It means that network namespaces are actually not namespaces: you can't run
some program (e.g., apache) with default configs in a new namespace without
regards to who runs what in other namespaces.
In other words, name "0.0.0.0:80" creates a collision in your implementation,
so socket "names" do not form isolated spaces.

> 
>  - allow arbitrary binds and 'tag' the packets according
>    to some 'host' policy (e.g. iptables or tc)
>    (this is how the Linux-VServer ngnet was designed)
> 
>  - deliver packets to _all_ bound sockets/destinations
>    (this is probably a more unusable but quite thinkable
>    solution)

Deliver TCP packets to all sockets?
How many connections do you expect to be established in this case?

> 
> > We have to conclude that each device should be visible only in one
> > namespace. 
> 
> I disagree here, especially some supervisor context or
> the host context should be able to 'see' and probably
> manipulate _all_ of the devices

Right, manipulating all devices from some supervisor context is useful.

But this shouldn't necessarily be done by regular ip/ifconfig tools.
Besides, it could be quite confusing if in ifconfig output in the
supervisor context you see 325 "tun0" devices coming from
different namespaces :)

So I'm all for network namespace management mechanisms not bound
to existing tools/APIs.

> 
> > Complete isolation will allow each namespace to set up own tun/tap
> > devices, have own routes, netfilter tables, and so on.
> 
> tun/tap devices are quite possible with this approach
> too, I see no problem here ...
> 
> for iptables and routes, I'm worried about the required
> 'policy' to make them secure, i.e. how do you ensure
> that the packets 'leaving' guest X do not contain
> 'evil' packets and/or disrupt your host system?

Sorry, I don't get your point.
How do you ensure that packets leaving your neighbor's computer
do not disrupt your system?
>From my point of view, network namespaces are just neighbors.

> 
> > My follow-up messages will contain the first set of patches with
> > network namespaces implemented in the same way as network isolation 
> > in OpenVZ. 
> 
> hmm, you probably mean 'network virtualization' here

I meant isolation between different network contexts/namespaces.

> 
> > This patchset introduces namespaces for device list and IPv4
> > FIB/routing. Two technical issues are omitted to make the patch idea
> > clearer: device moving between namespaces, and selective routing cache
> > flush + garbage collection.
> >
> > If this patchset is agreeable, the next patchset will finalize
> > integration with nsproxy, add namespaces to socket lookup code and
> > neighbour cache, and introduce a simple device to pass traffic between
> > namespaces.
> 
> passing traffic 'between' namespaces should happen via
> lo, no? what kind of 'device' is required there, and
> what overhead does it add to the networking?

OpenVZ provides 2 options.

 1) A packet appears right inside some namespace, without any additional
    overhead.  Usually this implies that either all packets from this device
    belong to this namespace, i.e. simple device->namespace assignment.
    However, there is nothing conceptually wrong with having
    namespace-aware device drivers or netfilter modules selecting namespaces
    for each incoming packet.  It all depends on how you want packets go
    through various network layers, and how much network management abilities
    you want to have in non-root namespaces.
    My point is that for network namespaces being real namespaces, decision
    making should be done somewhere before socket lookup.

 2) Parent network namespace acts as a router forwarding packets to child
    namespaces.  This scheme is the preferred one in OpenVZ for various
    reasons, most important being the simplicity of migration of network
    namespaces.  In this case flexibility has the cost of going through
    packet handling layers two times.
    Technically, this is implemented via a simple netdevice doing
    netif_rx in hard_xmit.

Regards

Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26  9:47   ` Andrey Savochkin
  2006-06-26 13:02     ` Herbert Poetzl
@ 2006-06-26 14:56     ` Daniel Lezcano
  2006-06-26 15:21       ` Eric W. Biederman
  2006-06-26 15:27       ` Andrey Savochkin
  1 sibling, 2 replies; 113+ messages in thread
From: Daniel Lezcano @ 2006-06-26 14:56 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	herbert, devel, sam, ebiederm, viro

Andrey Savochkin wrote:
> Hi Daniel,

Hi Andrey,

> 
> It's good that you kicked off network namespace discussion.
> Although I wish you'd Cc'ed someone at OpenVZ so I could notice it earlier :).

devel@openvz.org ?

> When a device presents an skb to the protocol layer, it needs to know to which
> namespace this skb belongs.
> Otherwise you would never get rid of problems with bind: what to do if device
> eth1 is visible in namespace1, namespace2, and root namespace, and each
> namespace has a socket bound to 0.0.0.0:80?

Exact. But, the idea was to retrieve the namespace from the routes.

IMHO, I think there are roughly 2 network isolation implementation:

	- make all network ressources private to the namespace

	- keep a "flat" model where network ressources have a new identifier 
which is the network namespace pointer. The idea is to move only some 
network informations private to the namespace (eg port range, stats, ...)


   Daniel.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 14:56     ` Daniel Lezcano
@ 2006-06-26 15:21       ` Eric W. Biederman
  2006-06-26 15:27       ` Andrey Savochkin
  1 sibling, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-26 15:21 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Andrey Savochkin, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, herbert, devel, sam, viro

Daniel Lezcano <dlezcano@fr.ibm.com> writes:

> Andrey Savochkin wrote:
>> Hi Daniel,
>
> Hi Andrey,
>
>> It's good that you kicked off network namespace discussion.
>> Although I wish you'd Cc'ed someone at OpenVZ so I could notice it earlier :).
>
> devel@openvz.org ?
>
>> When a device presents an skb to the protocol layer, it needs to know to which
>> namespace this skb belongs.
>> Otherwise you would never get rid of problems with bind: what to do if device
>> eth1 is visible in namespace1, namespace2, and root namespace, and each
>> namespace has a socket bound to 0.0.0.0:80?
>
> Exact. But, the idea was to retrieve the namespace from the routes.

The problem is that if you start at the routes you have to do things
at layer 3 and you can't do anything at layer 2.  (i.e. You can't use DHCP).
You loose a whole lot of flexibility and power when you make it a
layer 3 only mechanism.

> IMHO, I think there are roughly 2 network isolation implementation:
>
> 	- make all network ressources private to the namespace
>
> 	- keep a "flat" model where network ressources have a new identifier
> which is the network namespace pointer. The idea is to move only some network
> informations private to the namespace (eg port range, stats, ...)

The problem is that you have to add a lot of new logic which is very
hard to get right and has some really weird corner cases that are very
hard to understand.  

- That makes the patches hard to review.  
- It makes it hard for the implementors to get it right.
- It means that there will be corner cases that the users don't
  understand.
- It is less flexible/powerful in what you can express?

I've been down that route it sucks.  Anything more than the simple
filter at bind time is asking for real trouble until you do the whole
thing.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 14:56     ` Daniel Lezcano
  2006-06-26 15:21       ` Eric W. Biederman
@ 2006-06-26 15:27       ` Andrey Savochkin
  2006-06-26 15:49         ` Daniel Lezcano
  1 sibling, 1 reply; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-26 15:27 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	herbert, devel, sam, ebiederm, viro, Alexey Kuznetsov

Daniel,

On Mon, Jun 26, 2006 at 04:56:32PM +0200, Daniel Lezcano wrote:
> Andrey Savochkin wrote:
> > 
> > It's good that you kicked off network namespace discussion.
> > Although I wish you'd Cc'ed someone at OpenVZ so I could notice it earlier :).
> 
> devel@openvz.org ?

devel@openvz.org is fine

> 
> > When a device presents an skb to the protocol layer, it needs to know to which
> > namespace this skb belongs.
> > Otherwise you would never get rid of problems with bind: what to do if device
> > eth1 is visible in namespace1, namespace2, and root namespace, and each
> > namespace has a socket bound to 0.0.0.0:80?
> 
> Exact. But, the idea was to retrieve the namespace from the routes.

Then you lose the ability for each namespace to have its own routing entries.
Which implies that you'll have difficulties with devices that should exist
and be visible in one namespace only (like tunnels), as they require IP
addresses and route.

> 
> IMHO, I think there are roughly 2 network isolation implementation:
> 
> 	- make all network ressources private to the namespace
> 
> 	- keep a "flat" model where network ressources have a new identifier 
> which is the network namespace pointer. The idea is to move only some 
> network informations private to the namespace (eg port range, stats, ...)

Sorry, I don't get the second idea with only some information private to
namespace.

How do you want TCP_INC_STATS macro look?
In my concept, it would be something like
#define TCP_INC_STATS(field) SNMP_INC_STATS(current_net_ns->tcp_stat, field)
where tcp_stat is a TCP statistics array inside net_namespace.

Regards

Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 15:27       ` Andrey Savochkin
@ 2006-06-26 15:49         ` Daniel Lezcano
  2006-06-26 16:40           ` Eric W. Biederman
  2006-06-27  9:11           ` Andrey Savochkin
  0 siblings, 2 replies; 113+ messages in thread
From: Daniel Lezcano @ 2006-06-26 15:49 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	herbert, devel, sam, ebiederm, viro, Alexey Kuznetsov


> Then you lose the ability for each namespace to have its own routing entries.
> Which implies that you'll have difficulties with devices that should exist
> and be visible in one namespace only (like tunnels), as they require IP
> addresses and route.

I mean instead of having the route tables private to the namespace, the 
routes have the information to which namespace they are associated.

> 
>	- keep a "flat" model where network ressources have a new identifier 
>>which is the network namespace pointer. The idea is to move only some 
>>network informations private to the namespace (eg port range, stats, ...)
> 
> 
> Sorry, I don't get the second idea with only some information private to
> namespace.
> 
> How do you want TCP_INC_STATS macro look?

I was thinking in TCP_INC_STATS(net_ns, field) 
SNMP_INC_STATS(net_ns->tcp_stat, field)

> In my concept, it would be something like
> #define TCP_INC_STATS(field) SNMP_INC_STATS(current_net_ns->tcp_stat, field)
> where tcp_stat is a TCP statistics array inside net_namespace.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 15:49         ` Daniel Lezcano
@ 2006-06-26 16:40           ` Eric W. Biederman
  2006-06-26 18:36             ` Herbert Poetzl
  2006-06-27  9:11           ` Andrey Savochkin
  1 sibling, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-26 16:40 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Andrey Savochkin, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, herbert, devel, sam, ebiederm, viro,
	Alexey Kuznetsov

Daniel Lezcano <dlezcano@fr.ibm.com> writes:

>> Then you lose the ability for each namespace to have its own routing entries.
>> Which implies that you'll have difficulties with devices that should exist
>> and be visible in one namespace only (like tunnels), as they require IP
>> addresses and route.
>
> I mean instead of having the route tables private to the namespace, the routes
> have the information to which namespace they are associated.

Is this an implementation difference or is this a user visible difference?
As an implementation difference this is sensible, as it is pretty insane
to allocate hash tables at run time.

As a user visible difference that affects semantics of the operations
this is not something we want to do.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 14:08       ` Andrey Savochkin
@ 2006-06-26 18:28         ` Herbert Poetzl
  2006-06-26 18:59           ` Eric W. Biederman
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-26 18:28 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: dlezcano, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, devel, sam, ebiederm, viro, Alexey Kuznetsov

On Mon, Jun 26, 2006 at 06:08:03PM +0400, Andrey Savochkin wrote:
> Hi Herbert,
> 
> On Mon, Jun 26, 2006 at 03:02:03PM +0200, Herbert Poetzl wrote:
> > On Mon, Jun 26, 2006 at 01:47:11PM +0400, Andrey Savochkin wrote:
> > 
> > > I see a fundamental problem with this approach. When a device
> > > presents an skb to the protocol layer, it needs to know to which
> > > namespace this skb belongs.
> >
> > > Otherwise you would never get rid of problems with bind: what to
> > > do if device eth1 is visible in namespace1, namespace2, and root
> > > namespace, and each namespace has a socket bound to 0.0.0.0:80?
> > 
> > this is something which isn't a fundamental problem at
> > all, and IMHO there are at least three options here
> > (probably more)
> > 
> >  - check at 'bind' time if the binding would overlap
> >    and give the 'proper' error (as it happens right
> >    now on the host)
> >    (this is how Linux-VServer currently handles the
> >    network isolation, and yes, it works quite fine :)
> 
> I'm not comfortable with this as a permanent mainstream solution.
> It means that network namespaces are actually not namespaces: you
> can't run some program (e.g., apache) with default configs in a new
> namespace without regards to who runs what in other namespaces.

not at all, maybe you should take a closer look at the
current Linux-VServer implementation, which is quite
simple and _does_ allow guests to bind to IP_ANY quite
fine, only the host (which has all priviledes) has to
be careful with binding to 0.0.0.0 ...

> In other words, name "0.0.0.0:80" creates a collision in your
> implementation, so socket "names" do not form isolated spaces.
> 
> > 
> >  - allow arbitrary binds and 'tag' the packets according
> >    to some 'host' policy (e.g. iptables or tc)
> >    (this is how the Linux-VServer ngnet was designed)
> > 
> >  - deliver packets to _all_ bound sockets/destinations
> >    (this is probably a more unusable but quite thinkable
> >    solution)
> 
> Deliver TCP packets to all sockets?
> How many connections do you expect to be established in this case?

well, roughly the same number of connections you'll
get when you have two boxes with the same IP on the
same subnet :)

in other words, if there are more than one guest
with the same ip and port open, then we have some
kind of misconfiguration (i.e. policy is required)

> > > We have to conclude that each device should be visible only in one
> > > namespace.
> > 
> > I disagree here, especially some supervisor context or
> > the host context should be able to 'see' and probably
> > manipulate _all_ of the devices
> 
> Right, manipulating all devices from some supervisor context is useful.
> 
> But this shouldn't necessarily be done by regular ip/ifconfig tools.
> Besides, it could be quite confusing if in ifconfig output in the
> supervisor context you see 325 "tun0" devices coming from
> different namespaces :)

isolation would not provide more than _one_ tun0
interfaces, virtualization OTOH will ...

> So I'm all for network namespace management mechanisms not bound
> to existing tools/APIs.

well, I'm not against new APIs/tools, but I prefer
to keep it simple, and elegant, which often includes
reusing existing APIs and tools ...

> > > Complete isolation will allow each namespace to set up own tun/tap
> > > devices, have own routes, netfilter tables, and so on.
> > 
> > tun/tap devices are quite possible with this approach
> > too, I see no problem here ...
> > 
> > for iptables and routes, I'm worried about the required
> > 'policy' to make them secure, i.e. how do you ensure
> > that the packets 'leaving' guest X do not contain
> > 'evil' packets and/or disrupt your host system?
> 
> Sorry, I don't get your point.
> How do you ensure that packets leaving your neighbor's computer
> do not disrupt your system?

by having a strong 'policy' on the router/switch
which will (hopefully) reject everything sent in error
or to disrupt/harm other boxes ...

> >From my point of view, network namespaces are just neighbors.

yes, but you _need_ policy there the same way you need
it for resources, i.e. you cannot simply allow everyone
to do everything with his network interface, especially
if that interface is _shared_ with all others ...


> > > My follow-up messages will contain the first set of patches with
> > > network namespaces implemented in the same way as network isolation 
> > > in OpenVZ. 
> > 
> > hmm, you probably mean 'network virtualization' here
> 
> I meant isolation between different network contexts/namespaces.

well, isolation is basically what we do in Linux-VServer
by allowing to bind to certain IPs (or ranges) instead
of binding _all_ available IPs ... this can be extended
for routing and iptables as well, and does not require
any 'virtualization' which would give each guest it's own
set of interfaces, routes, iptables etc ... and it is
usually more lightweight too ..

> > > This patchset introduces namespaces for device list and IPv4
> > > FIB/routing. Two technical issues are omitted to make the patch
> > > idea clearer: device moving between namespaces, and selective
> > > routing cache flush + garbage collection.
> > >
> > > If this patchset is agreeable, the next patchset will finalize
> > > integration with nsproxy, add namespaces to socket lookup code and
> > > neighbour cache, and introduce a simple device to pass traffic
> > > between namespaces.
> > 
> > passing traffic 'between' namespaces should happen via
> > lo, no? what kind of 'device' is required there, and
> > what overhead does it add to the networking?
> 
> OpenVZ provides 2 options.
> 
>  1) A packet appears right inside some namespace, without any additional
>     overhead. Usually this implies that either all packets from this
>     device belong to this namespace, i.e. simple device->namespace
>     assignment. However, there is nothing conceptually wrong with
>     having namespace-aware device drivers or netfilter modules
>     selecting namespaces for each incoming packet. It all depends on
>     how you want packets go through various network layers, and how
>     much network management abilities you want to have in non-root
>     namespaces. My point is that for network namespaces being real
>     namespaces, decision making should be done somewhere before socket
>     lookup.

well, I doubt that many providers will be able to put
roughly hundred or more network interface cards into
their machines, plus a proper switch to do the policy :)

>  2) Parent network namespace acts as a router forwarding packets to child
>     namespaces.  This scheme is the preferred one in OpenVZ for various
>     reasons, most important being the simplicity of migration of network
>     namespaces.  In this case flexibility has the cost of going through
>     packet handling layers two times.

>     Technically, this is implemented via a simple netdevice doing
>     netif_rx in hard_xmit.

which results in a duplicate stack traversal and kernel
side policy to decide which goes where ... i.e. at least
twice as much overhead than any isolation would have

best,
Herbert

> Regards
> 
> Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 16:40           ` Eric W. Biederman
@ 2006-06-26 18:36             ` Herbert Poetzl
  2006-06-26 19:35               ` Eric W. Biederman
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-26 18:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Andrey Savochkin, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

On Mon, Jun 26, 2006 at 10:40:59AM -0600, Eric W. Biederman wrote:
> Daniel Lezcano <dlezcano@fr.ibm.com> writes:
> 
> >> Then you lose the ability for each namespace to have its own
> >> routing entries. Which implies that you'll have difficulties with
> >> devices that should exist and be visible in one namespace only
> >> (like tunnels), as they require IP addresses and route.
> >
> > I mean instead of having the route tables private to the namespace, the routes
> > have the information to which namespace they are associated.
> 
> Is this an implementation difference or is this a user visible
> difference? As an implementation difference this is sensible, as it is
> pretty insane to allocate hash tables at run time.
>
> As a user visible difference that affects semantics of the operations
> this is not something we want to do.

well, I guess there are even more options here, for
example I'd like to propose the following idea, which
might be a viable solution for the policy/isolation
problem, with the actual overhead on the setup part
not the hot pathes for packet and connection handling

we could use the multiple routing tables to provide
a single routing table for each guest, which could
be used inside the guest to add arbitrary routes, but
would allow to keep the 'main' policy on the host, by
selecting the proper table based on IPs and guest tags

similar we could allow to have a separate iptables
chain for each guest (or several chains), which are
once again directed by the host system (applying the
required prolicy) which can be managed and configured
via normal iptable interfaces (both on the guest and
host) but actually provide at least to layers

note: this does not work for hierarchical network
contexts, but I do not see that the yet proposed
implementations would do, so I do not think that
is of concern here ...

best,
Herbert

> Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 18:28         ` Herbert Poetzl
@ 2006-06-26 18:59           ` Eric W. Biederman
  0 siblings, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-26 18:59 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: dlezcano, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, devel, sam, ebiederm, viro, Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Mon, Jun 26, 2006 at 06:08:03PM +0400, Andrey Savochkin wrote:
>
> not at all, maybe you should take a closer look at the
> current Linux-VServer implementation, which is quite
> simple and _does_ allow guests to bind to IP_ANY quite
> fine, only the host (which has all priviledes) has to
> be careful with binding to 0.0.0.0 ...

It works, and is a reasonable implementation.  However the
semantics change.

The real practical problem is that you loose power, and
the ability to migrate applications.  Not that this precludes
you from loading a security module and doing what you do now.

>> > > We have to conclude that each device should be visible only in one
>> > > namespace.
>> > 
>> > I disagree here, especially some supervisor context or
>> > the host context should be able to 'see' and probably
>> > manipulate _all_ of the devices
>> 
>> Right, manipulating all devices from some supervisor context is useful.
>> 
>> But this shouldn't necessarily be done by regular ip/ifconfig tools.
>> Besides, it could be quite confusing if in ifconfig output in the
>> supervisor context you see 325 "tun0" devices coming from
>> different namespaces :)
>
> isolation would not provide more than _one_ tun0
> interfaces, virtualization OTOH will ...

Think layer 2 isolation not layer 3 isolation.

>> So I'm all for network namespace management mechanisms not bound
>> to existing tools/APIs.
>
> well, I'm not against new APIs/tools, but I prefer
> to keep it simple, and elegant, which often includes
> reusing existing APIs and tools ...

And knowledge.  Except for the single IP per guest case filtering
at BIND time starts show some surprising semantics.

> by having a strong 'policy' on the router/switch
> which will (hopefully) reject everything sent in error
> or to disrupt/harm other boxes ...

And linux has a software router and switch capabilities so
those can easily be used unmodified.

>> >From my point of view, network namespaces are just neighbors.
>
> yes, but you _need_ policy there the same way you need
> it for resources, i.e. you cannot simply allow everyone
> to do everything with his network interface, especially
> if that interface is _shared_ with all others ...

Agreed.  And the network stack seems to have a perfectly good
set of utilities to handle that already. 

>
> well, isolation is basically what we do in Linux-VServer
> by allowing to bind to certain IPs (or ranges) instead
> of binding _all_ available IPs ... this can be extended
> for routing and iptables as well, and does not require
> any 'virtualization' which would give each guest it's own
> set of interfaces, routes, iptables etc ... and it is
> usually more lightweight too ..

I disagree with the cost.  Done properly we should have the cost
of the existing networking stack plus the cost of an extra pointer
dereference when we look at global variables.

This is layer 2 isolation.  So we can use protocols like DHCP,
unmodified.

In the normally accepted definition it isn't virtualization because
we aren't emulating anything.

>> > > This patchset introduces namespaces for device list and IPv4
>> > > FIB/routing. Two technical issues are omitted to make the patch
>> > > idea clearer: device moving between namespaces, and selective
>> > > routing cache flush + garbage collection.
>> > >
>> > > If this patchset is agreeable, the next patchset will finalize
>> > > integration with nsproxy, add namespaces to socket lookup code and
>> > > neighbour cache, and introduce a simple device to pass traffic
>> > > between namespaces.
>> > 
>> > passing traffic 'between' namespaces should happen via
>> > lo, no? what kind of 'device' is required there, and
>> > what overhead does it add to the networking?
>> 
>> OpenVZ provides 2 options.
>> 
>>  1) A packet appears right inside some namespace, without any additional
>>     overhead. Usually this implies that either all packets from this
>>     device belong to this namespace, i.e. simple device->namespace
>>     assignment. However, there is nothing conceptually wrong with
>>     having namespace-aware device drivers or netfilter modules
>>     selecting namespaces for each incoming packet. It all depends on
>>     how you want packets go through various network layers, and how
>>     much network management abilities you want to have in non-root
>>     namespaces. My point is that for network namespaces being real
>>     namespaces, decision making should be done somewhere before socket
>>     lookup.
>
> well, I doubt that many providers will be able to put
> roughly hundred or more network interface cards into
> their machines, plus a proper switch to do the policy :)

Well switches exist.   But yes because physical hardware
is limited this is a limited policy.

>>  2) Parent network namespace acts as a router forwarding packets to child
>>     namespaces.  This scheme is the preferred one in OpenVZ for various
>>     reasons, most important being the simplicity of migration of network
>>     namespaces.  In this case flexibility has the cost of going through
>>     packet handling layers two times.
>
>>     Technically, this is implemented via a simple netdevice doing
>>     netif_rx in hard_xmit.
>
> which results in a duplicate stack traversal and kernel
> side policy to decide which goes where ... i.e. at least
> twice as much overhead than any isolation would have

Not twice because you don't traverse the entire network stack.
Just up to the routing layer and then across, and there are
optimization possibilities that should keep it down to a single
traversal of the network stack.

Note: We are not encouraging saying that the linux-vserver implementation
must die.  Only that we are solving something with much larger scope.

If the first case does not at least pass packets as fast as the existing
network stack I doubt we will be allowed to merge it.  By making the nic
drivers smarter we can have a single driver that creates multiple
network interfaces simply by looking at the destination mac address,
sort of like the bonding driver in reverse.  That will trivially
remove the extra network stack traversals if we don't want
to apply before we let the packet out on the wire.

And there is not requirement that after the namespace is setup we
leave any applications on the inside with CAP_NET_ADMIN so we don't
need to worry about user space applications changing the network
configurations if we don't want to.

Eric


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 18:36             ` Herbert Poetzl
@ 2006-06-26 19:35               ` Eric W. Biederman
  2006-06-26 20:02                 ` Herbert Poetzl
  2006-06-26 22:13                 ` Ben Greear
  0 siblings, 2 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-26 19:35 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Daniel Lezcano, Andrey Savochkin, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Mon, Jun 26, 2006 at 10:40:59AM -0600, Eric W. Biederman wrote:
>> Daniel Lezcano <dlezcano@fr.ibm.com> writes:
>> 
>> >> Then you lose the ability for each namespace to have its own
>> >> routing entries. Which implies that you'll have difficulties with
>> >> devices that should exist and be visible in one namespace only
>> >> (like tunnels), as they require IP addresses and route.
>> >
>> > I mean instead of having the route tables private to the namespace, the
> routes
>> > have the information to which namespace they are associated.
>> 
>> Is this an implementation difference or is this a user visible
>> difference? As an implementation difference this is sensible, as it is
>> pretty insane to allocate hash tables at run time.
>>
>> As a user visible difference that affects semantics of the operations
>> this is not something we want to do.
>
> well, I guess there are even more options here, for
> example I'd like to propose the following idea, which
> might be a viable solution for the policy/isolation
> problem, with the actual overhead on the setup part
> not the hot pathes for packet and connection handling
>
> we could use the multiple routing tables to provide
> a single routing table for each guest, which could
> be used inside the guest to add arbitrary routes, but
> would allow to keep the 'main' policy on the host, by
> selecting the proper table based on IPs and guest tags
>
> similar we could allow to have a separate iptables
> chain for each guest (or several chains), which are
> once again directed by the host system (applying the
> required prolicy) which can be managed and configured
> via normal iptable interfaces (both on the guest and
> host) but actually provide at least to layers

I have real concerns about the complexity of the route you
have described.

> note: this does not work for hierarchical network
> contexts, but I do not see that the yet proposed
> implementations would do, so I do not think that
> is of concern here ...

Well we are hierarchical in the sense that a parent
can have a different network namespace then a child.
So recursive containers work fine.  So this is like
the uts namespace or the ipc namespace rather than
like the pid namespace.

I really do not believe we have a hotpath issue, if this
is implemented properly. Benchmarks of course need to be taken,
to prove this.

There are only two places a sane implementation should show issues.
- When the access to a pointer goes through a pointer to find
  that global variable.
- When doing a lookup in a hash table we need to look at an additional
  field to verify a hash match.  Because having a completely separate
  hash table is likely too expensive.

If that can be shown to really slow down packets on the hot path
I am willing to consider other possibilities.  Until then I think
we are on path to the simplest and most powerful version of building
a network namespace usable by containers.

The routing between network namespaces does have the potential to
be more expensive than just a packet trivially coming off the wire
into a socket.  However that is fundamentally from a lack of hardware.
If the rest works smarter filters in the drivers should enable to
remove the cost.

Basically it is just a matter of:
if (dest_mac == my_mac1) it is for device 1.
If (dest_mac == my_mac2) it is for device 2.
etc.

At a small count of macs it is trivial to understand it will go
fast for a larger count of macs it only works with a good data
structure.  We don't hit any extra cache lines of the packet,
and the above test can be collapsed with other routing lookup tests.

Eric


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 19:35               ` Eric W. Biederman
@ 2006-06-26 20:02                 ` Herbert Poetzl
  2006-06-26 20:37                   ` Eric W. Biederman
  2006-06-27  9:09                   ` Andrey Savochkin
  2006-06-26 22:13                 ` Ben Greear
  1 sibling, 2 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-26 20:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Andrey Savochkin, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

On Mon, Jun 26, 2006 at 01:35:15PM -0600, Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
> 
> > On Mon, Jun 26, 2006 at 10:40:59AM -0600, Eric W. Biederman wrote:
> >> Daniel Lezcano <dlezcano@fr.ibm.com> writes:
> >> 
> >> >> Then you lose the ability for each namespace to have its own
> >> >> routing entries. Which implies that you'll have difficulties with
> >> >> devices that should exist and be visible in one namespace only
> >> >> (like tunnels), as they require IP addresses and route.
> >> >
> >> > I mean instead of having the route tables private to the namespace, the
> > routes
> >> > have the information to which namespace they are associated.
> >> 
> >> Is this an implementation difference or is this a user visible
> >> difference? As an implementation difference this is sensible, as it is
> >> pretty insane to allocate hash tables at run time.
> >>
> >> As a user visible difference that affects semantics of the operations
> >> this is not something we want to do.
> >
> > well, I guess there are even more options here, for
> > example I'd like to propose the following idea, which
> > might be a viable solution for the policy/isolation
> > problem, with the actual overhead on the setup part
> > not the hot pathes for packet and connection handling
> >
> > we could use the multiple routing tables to provide
> > a single routing table for each guest, which could
> > be used inside the guest to add arbitrary routes, but
> > would allow to keep the 'main' policy on the host, by
> > selecting the proper table based on IPs and guest tags
> >
> > similar we could allow to have a separate iptables
> > chain for each guest (or several chains), which are
> > once again directed by the host system (applying the
> > required prolicy) which can be managed and configured
> > via normal iptable interfaces (both on the guest and
> > host) but actually provide at least to layers
> 
> I have real concerns about the complexity of the route you
> have described.
> 
> > note: this does not work for hierarchical network
> > contexts, but I do not see that the yet proposed
> > implementations would do, so I do not think that
> > is of concern here ...
> 
> Well we are hierarchical in the sense that a parent
> can have a different network namespace then a child.
> So recursive containers work fine.  So this is like
> the uts namespace or the ipc namespace rather than
> like the pid namespace.

yes, but you will not be able to apply policy on
the parent, restricting the child networking in a
proper way without jumping through hoops ...

> I really do not believe we have a hotpath issue, if this
> is implemented properly. Benchmarks of course need to be taken,
> to prove this.

I'm fine with proper testing and good numbers here
but until then, I can only consider it a prototype

> There are only two places a sane implementation should show issues.
> - When the access to a pointer goes through a pointer to find
>   that global variable.
> - When doing a lookup in a hash table we need to look at an additional
>   field to verify a hash match.  Because having a completely separate
>   hash table is likely too expensive.
> 
> If that can be shown to really slow down packets on the hot path I am
> willing to consider other possibilities. Until then I think we are on
> path to the simplest and most powerful version of building a network
> namespace usable by containers.

keep in mind that you actually have three kinds
of network traffic on a typical host/guest system:

 - traffic between unit and outside
   - host traffic should be quite minimal
   - guest traffic will be quite high

 - traffic between host and guest
   probably minimal too (only for shared services)

 - traffic between guests
   can be as high (or even higher) than the
   outbound traffic, just think web guest and
   database guest

> The routing between network namespaces does have the potential to be
> more expensive than just a packet trivially coming off the wire into a
> socket.

IMHO the routing between network namespaces should
not require more than the current local traffic
does (i.e. you should be able to achieve loopback
speed within an insignificant tolerance) and not
nearly the time required for on-wire stuff ...

> However that is fundamentally from a lack of hardware. If the
> rest works smarter filters in the drivers should enable to remove the
> cost.
> 
> Basically it is just a matter of:
> if (dest_mac == my_mac1) it is for device 1.
> If (dest_mac == my_mac2) it is for device 2.
> etc.

hmm, so you plan on providing a different MAC for
each guest? how should that be handled from the
user PoV? you cannot simply make up MACs as you
go, and, depending on the network card, operation
in promisc mode might be slower than for a given
set (maybe only one) MAC, no?

> At a small count of macs it is trivial to understand it will go
> fast for a larger count of macs it only works with a good data
> structure.  We don't hit any extra cache lines of the packet,
> and the above test can be collapsed with other routing lookup tests.

well, I'm absolutely not against flexibility or
full virtualization, but the proposed 'routing'
on the host effectively doubles the time the
packet spends in the network stack(s), so I can
not believe that this approach would not add
(significant) overhead to the hot path ...

best,
Herbert

> Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 20:02                 ` Herbert Poetzl
@ 2006-06-26 20:37                   ` Eric W. Biederman
  2006-06-26 21:26                     ` Herbert Poetzl
  2006-06-27  9:09                   ` Andrey Savochkin
  1 sibling, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-26 20:37 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Daniel Lezcano, Andrey Savochkin, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Mon, Jun 26, 2006 at 01:35:15PM -0600, Eric W. Biederman wrote:
>> Herbert Poetzl <herbert@13thfloor.at> writes:
>> 
>
> yes, but you will not be able to apply policy on
> the parent, restricting the child networking in a
> proper way without jumping through hoops ...

?  I don't understand where you are coming from.
There is no restriction on where you can apply policy.

>> I really do not believe we have a hotpath issue, if this
>> is implemented properly. Benchmarks of course need to be taken,
>> to prove this.
>
> I'm fine with proper testing and good numbers here
> but until then, I can only consider it a prototype

We are taking the first steps to get this all sorted out.
I think what we have is more than a prototype but less then
the final implementation.  Call it the very first draft version.

>> There are only two places a sane implementation should show issues.
>> - When the access to a pointer goes through a pointer to find
>>   that global variable.
>> - When doing a lookup in a hash table we need to look at an additional
>>   field to verify a hash match.  Because having a completely separate
>>   hash table is likely too expensive.
>> 
>> If that can be shown to really slow down packets on the hot path I am
>> willing to consider other possibilities. Until then I think we are on
>> path to the simplest and most powerful version of building a network
>> namespace usable by containers.
>
> keep in mind that you actually have three kinds
> of network traffic on a typical host/guest system:
>
>  - traffic between unit and outside
>    - host traffic should be quite minimal
>    - guest traffic will be quite high
>
>  - traffic between host and guest
>    probably minimal too (only for shared services)
>
>  - traffic between guests
>    can be as high (or even higher) than the
>    outbound traffic, just think web guest and
>    database guest

Interesting.

>> The routing between network namespaces does have the potential to be
>> more expensive than just a packet trivially coming off the wire into a
>> socket.
>
> IMHO the routing between network namespaces should
> not require more than the current local traffic
> does (i.e. you should be able to achieve loopback
> speed within an insignificant tolerance) and not
> nearly the time required for on-wire stuff ...

That assumes on the wire stuff is noticeably slower.
You can achieve over 1GB/s on some networks.

But I agree that the cost should resemble the current
loopback device.  I have seen nothing that suggests
it is not.

>> However that is fundamentally from a lack of hardware. If the
>> rest works smarter filters in the drivers should enable to remove the
>> cost.
>> 
>> Basically it is just a matter of:
>> if (dest_mac == my_mac1) it is for device 1.
>> If (dest_mac == my_mac2) it is for device 2.
>> etc.
>
> hmm, so you plan on providing a different MAC for
> each guest? how should that be handled from the
> user PoV? you cannot simply make up MACs as you
> go, and, depending on the network card, operation
> in promisc mode might be slower than for a given
> set (maybe only one) MAC, no?

The speed is a factor certainly.  As for making up
macs.  There is a local assignment bit that you can set.
With that set it is just a matter of using a decent random
number generator.  The kernel already does this is some places.

>> At a small count of macs it is trivial to understand it will go
>> fast for a larger count of macs it only works with a good data
>> structure.  We don't hit any extra cache lines of the packet,
>> and the above test can be collapsed with other routing lookup tests.
>
> well, I'm absolutely not against flexibility or
> full virtualization, but the proposed 'routing'
> on the host effectively doubles the time the
> packet spends in the network stack(s), so I can
> not believe that this approach would not add
> (significant) overhead to the hot path ...

It might, but I am pretty certain it won't double
the cost, as you don't do 2 full network stack traversals.
And even at a full doubling I doubt it will affect bandwith
or latency very much.  If it does we have a lot more to optimize
in the network stack than just this code.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 20:37                   ` Eric W. Biederman
@ 2006-06-26 21:26                     ` Herbert Poetzl
  2006-06-26 21:59                       ` Ben Greear
  2006-06-26 22:11                       ` Eric W. Biederman
  0 siblings, 2 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-26 21:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Andrey Savochkin, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

On Mon, Jun 26, 2006 at 02:37:15PM -0600, Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
> 
> > On Mon, Jun 26, 2006 at 01:35:15PM -0600, Eric W. Biederman wrote:
> >> Herbert Poetzl <herbert@13thfloor.at> writes:
> >
> > yes, but you will not be able to apply policy on
> > the parent, restricting the child networking in a
> > proper way without jumping through hoops ...
> 
> ?  I don't understand where you are coming from.
> There is no restriction on where you can apply policy.

in a fully hierarchical system (not that I really think
this is required here) you would be able to 'delegate'
certain addresses (ranges) to a namespace (the child)
below the current one (the parent) with the ability to
limit/control the input/output (which is required for 
security)

> >> I really do not believe we have a hotpath issue, if this
> >> is implemented properly. Benchmarks of course need to be taken,
> >> to prove this.
> >
> > I'm fine with proper testing and good numbers here
> > but until then, I can only consider it a prototype
> 
> We are taking the first steps to get this all sorted out.
> I think what we have is more than a prototype but less then
> the final implementation.  Call it the very first draft version.

what we are desperately missing here is a proper way
to testing this, maybe the network folks can come up
with some test equipment/ideas here ...

> >> There are only two places a sane implementation should show issues.
> >> - When the access to a pointer goes through a pointer to find
> >>   that global variable.
> >> - When doing a lookup in a hash table we need to look at an additional
> >>   field to verify a hash match.  Because having a completely separate
> >>   hash table is likely too expensive.
> >> 
> >> If that can be shown to really slow down packets on the hot path I am
> >> willing to consider other possibilities. Until then I think we are on
> >> path to the simplest and most powerful version of building a network
> >> namespace usable by containers.
> >
> > keep in mind that you actually have three kinds
> > of network traffic on a typical host/guest system:
> >
> >  - traffic between unit and outside
> >    - host traffic should be quite minimal
> >    - guest traffic will be quite high
> >
> >  - traffic between host and guest
> >    probably minimal too (only for shared services)
> >
> >  - traffic between guests
> >    can be as high (or even higher) than the
> >    outbound traffic, just think web guest and
> >    database guest
> 
> Interesting.
> 
> >> The routing between network namespaces does have the potential to be
> >> more expensive than just a packet trivially coming off the wire into a
> >> socket.
> >
> > IMHO the routing between network namespaces should
> > not require more than the current local traffic
> > does (i.e. you should be able to achieve loopback
> > speed within an insignificant tolerance) and not
> > nearly the time required for on-wire stuff ...
> 
> That assumes on the wire stuff is noticeably slower.
> You can achieve over 1GB/s on some networks.

well, have you ever tried how much you can achieve
over loopback :)

> But I agree that the cost should resemble the current
> loopback device.  I have seen nothing that suggests
> it is not.
> 
> >> However that is fundamentally from a lack of hardware. If the
> >> rest works smarter filters in the drivers should enable to remove the
> >> cost.
> >> 
> >> Basically it is just a matter of:
> >> if (dest_mac == my_mac1) it is for device 1.
> >> If (dest_mac == my_mac2) it is for device 2.
> >> etc.
> >
> > hmm, so you plan on providing a different MAC for
> > each guest? how should that be handled from the
> > user PoV? you cannot simply make up MACs as you
> > go, and, depending on the network card, operation
> > in promisc mode might be slower than for a given
> > set (maybe only one) MAC, no?
> 
> The speed is a factor certainly.  As for making up
> macs.  There is a local assignment bit that you can set.

well, local is fine, but you cannot utilize that 
on-wire which basically means that you would have
either to 'map' the MAC on transmission (to the
real one) which would basically make the approach
useless, or to use addresses which are fine within
a certain range of routers ...

> With that set it is just a matter of using a decent random
> number generator.  The kernel already does this is some places.

sure you can make up MACs, but you will never
be able to use them 'outside' the box 

> >> At a small count of macs it is trivial to understand it will go
> >> fast for a larger count of macs it only works with a good data
> >> structure.  We don't hit any extra cache lines of the packet,
> >> and the above test can be collapsed with other routing lookup tests.
> >
> > well, I'm absolutely not against flexibility or
> > full virtualization, but the proposed 'routing'
> > on the host effectively doubles the time the
> > packet spends in the network stack(s), so I can
> > not believe that this approach would not add
> > (significant) overhead to the hot path ...
> 
> It might, but I am pretty certain it won't double
> the cost, as you don't do 2 full network stack traversals.

> And even at a full doubling I doubt it will affect bandwith
> or latency very much.  

well, for loopback that would mean half the bandwidth
and twice the latency, no?

> If it does we have a lot more to optimize in the network stack than
> just this code.

why? duplicate stack traversal takes roughly twice
the time, or am I wrong here? if so, please enlighten
me ...

best,
Herbert

> Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 21:26                     ` Herbert Poetzl
@ 2006-06-26 21:59                       ` Ben Greear
  2006-06-26 22:11                       ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Ben Greear @ 2006-06-26 21:59 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Eric W. Biederman, Daniel Lezcano, Andrey Savochkin,
	linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	devel, sam, viro, Alexey Kuznetsov

Herbert Poetzl wrote:
> On Mon, Jun 26, 2006 at 02:37:15PM -0600, Eric W. Biederman wrote:
> 
>>Herbert Poetzl <herbert@13thfloor.at> writes:
>>
>>
>>>On Mon, Jun 26, 2006 at 01:35:15PM -0600, Eric W. Biederman wrote:
>>>
>>>>Herbert Poetzl <herbert@13thfloor.at> writes:
>>>
>>>yes, but you will not be able to apply policy on
>>>the parent, restricting the child networking in a
>>>proper way without jumping through hoops ...
>>
>>?  I don't understand where you are coming from.
>>There is no restriction on where you can apply policy.
> 
> 
> in a fully hierarchical system (not that I really think
> this is required here) you would be able to 'delegate'
> certain addresses (ranges) to a namespace (the child)
> below the current one (the parent) with the ability to
> limit/control the input/output (which is required for 
> security)
> 
> 
>>>>I really do not believe we have a hotpath issue, if this
>>>>is implemented properly. Benchmarks of course need to be taken,
>>>>to prove this.
>>>
>>>I'm fine with proper testing and good numbers here
>>>but until then, I can only consider it a prototype
>>
>>We are taking the first steps to get this all sorted out.
>>I think what we have is more than a prototype but less then
>>the final implementation.  Call it the very first draft version.
> 
> 
> what we are desperately missing here is a proper way
> to testing this, maybe the network folks can come up
> with some test equipment/ideas here ...
> 
> 
>>>>There are only two places a sane implementation should show issues.
>>>>- When the access to a pointer goes through a pointer to find
>>>>  that global variable.
>>>>- When doing a lookup in a hash table we need to look at an additional
>>>>  field to verify a hash match.  Because having a completely separate
>>>>  hash table is likely too expensive.
>>>>
>>>>If that can be shown to really slow down packets on the hot path I am
>>>>willing to consider other possibilities. Until then I think we are on
>>>>path to the simplest and most powerful version of building a network
>>>>namespace usable by containers.
>>>
>>>keep in mind that you actually have three kinds
>>>of network traffic on a typical host/guest system:
>>>
>>> - traffic between unit and outside
>>>   - host traffic should be quite minimal
>>>   - guest traffic will be quite high
>>>
>>> - traffic between host and guest
>>>   probably minimal too (only for shared services)
>>>
>>> - traffic between guests
>>>   can be as high (or even higher) than the
>>>   outbound traffic, just think web guest and
>>>   database guest
>>
>>Interesting.
>>
>>
>>>>The routing between network namespaces does have the potential to be
>>>>more expensive than just a packet trivially coming off the wire into a
>>>>socket.
>>>
>>>IMHO the routing between network namespaces should
>>>not require more than the current local traffic
>>>does (i.e. you should be able to achieve loopback
>>>speed within an insignificant tolerance) and not
>>>nearly the time required for on-wire stuff ...
>>
>>That assumes on the wire stuff is noticeably slower.
>>You can achieve over 1GB/s on some networks.
> 
> 
> well, have you ever tried how much you can achieve
> over loopback :)
> 
> 
>>But I agree that the cost should resemble the current
>>loopback device.  I have seen nothing that suggests
>>it is not.
>>
>>
>>>>However that is fundamentally from a lack of hardware. If the
>>>>rest works smarter filters in the drivers should enable to remove the
>>>>cost.
>>>>
>>>>Basically it is just a matter of:
>>>>if (dest_mac == my_mac1) it is for device 1.
>>>>If (dest_mac == my_mac2) it is for device 2.
>>>>etc.
>>>
>>>hmm, so you plan on providing a different MAC for
>>>each guest? how should that be handled from the
>>>user PoV? you cannot simply make up MACs as you
>>>go, and, depending on the network card, operation
>>>in promisc mode might be slower than for a given
>>>set (maybe only one) MAC, no?

> well, local is fine, but you cannot utilize that 
> on-wire which basically means that you would have
> either to 'map' the MAC on transmission (to the
> real one) which would basically make the approach
> useless, or to use addresses which are fine within
> a certain range of routers ...
> 
> 
>>With that set it is just a matter of using a decent random
>>number generator.  The kernel already does this is some places.
> 
> 
> sure you can make up MACs, but you will never
> be able to use them 'outside' the box 

I do it all the time with my mac-vlan patch and it works fine..so
long as you pay minimal attention.  Just let the user specify the
MAC addr and they can manage the uniqueness as they wish....

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 21:26                     ` Herbert Poetzl
  2006-06-26 21:59                       ` Ben Greear
@ 2006-06-26 22:11                       ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-26 22:11 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Daniel Lezcano, Andrey Savochkin, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Mon, Jun 26, 2006 at 02:37:15PM -0600, Eric W. Biederman wrote:
>> Herbert Poetzl <herbert@13thfloor.at> writes:
>> 
>> > On Mon, Jun 26, 2006 at 01:35:15PM -0600, Eric W. Biederman wrote:
>> >> Herbert Poetzl <herbert@13thfloor.at> writes:
>> >
>> > yes, but you will not be able to apply policy on
>> > the parent, restricting the child networking in a
>> > proper way without jumping through hoops ...
>> 
>> ?  I don't understand where you are coming from.
>> There is no restriction on where you can apply policy.
>
> in a fully hierarchical system (not that I really think
> this is required here) you would be able to 'delegate'
> certain addresses (ranges) to a namespace (the child)
> below the current one (the parent) with the ability to
> limit/control the input/output (which is required for 
> security)

All of that is possible with the current design.
It is merely a matter of using the features the kernel
currently has.

The trick is know that a child namespace only has a loopback
by default, and has to have a network interface added from
the parent to be able to talk to anything.


>> >> I really do not believe we have a hotpath issue, if this
>> >> is implemented properly. Benchmarks of course need to be taken,
>> >> to prove this.
>> >
>> > I'm fine with proper testing and good numbers here
>> > but until then, I can only consider it a prototype
>> 
>> We are taking the first steps to get this all sorted out.
>> I think what we have is more than a prototype but less then
>> the final implementation.  Call it the very first draft version.
>
> what we are desperately missing here is a proper way
> to testing this, maybe the network folks can come up
> with some test equipment/ideas here ...
>

>> That assumes on the wire stuff is noticeably slower.
>> You can achieve over 1GB/s on some networks.
>
> well, have you ever tried how much you can achieve
> over loopback :)

Not recently.

> well, local is fine, but you cannot utilize that 
> on-wire which basically means that you would have
> either to 'map' the MAC on transmission (to the
> real one) which would basically make the approach
> useless, or to use addresses which are fine within
> a certain range of routers ...

I believe on the wire is fine as well.  Certainly I had
no problems when I tested it.  I do agree that it increases
the chance for a mac address collision so should be handled
carefully.  But we are talking a number with almost as many random
bits as a UUID.

As I recall the rule from the birthday paradox is something like: if
you have N possibilities you get a 50% chance of collision when
you have sqrt(N) items present.  sqrt(2**40) = 2**20 ~= 1 Million.

So as long as you have a good random generator the odds of a collision
are quite small until you have used a million local mac addresses.

Now while I can see some small chance of that happening on a very
crowded local area network, using lots of logical servers the
kernel arp cache and switch mac cache limits would start causing
real problems long before you got that far.  So you would need to
start routing at which point the practical problem goes away.

>
> well, for loopback that would mean half the bandwidth
> and twice the latency, no?

Not at all.  The usual bottle neck is copying the data.  The data
only gets put in a skb once and then pointers to the skb are
passed around.  So you get copied in once and copied out once.
We are certainly not going to add an extra copy to it.

For practical purposes the fast path through the network stack 
is a series of hash table looks.

That added to the fact that we don't make a full trip through
the network stack on both sides (unless someone is running tcp dump)
for example.

>> If it does we have a lot more to optimize in the network stack than
>> just this code.
>
> why? duplicate stack traversal takes roughly twice
> the time, or am I wrong here? if so, please enlighten
> me ...

The network stack is how we decide what goes where.  Sending
and receiving packets should always have hardware as the bottleneck,
not software.  So software should be a tiny percentage of the time
it takes to send or receive any packet.

With packets limited to 1.5K and below things aren't always that
clear cut but the ideal remains.  But basically anything we can
do besides remove the copy in and the copy out of the network stack
we should do.

The copy in and the copy out are fundamentally hard to remove without
modifying page tables which because of tlb invalidates can be even
more expensive than a copy.

The best you can hope for on a loopback scenario is a single copy
from one user space buffer to another skipping the intermediate
kernel buffer.  But that is tricky for an entirely different set
of reasons.

Eric


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 19:35               ` Eric W. Biederman
  2006-06-26 20:02                 ` Herbert Poetzl
@ 2006-06-26 22:13                 ` Ben Greear
  2006-06-26 22:54                   ` Herbert Poetzl
  1 sibling, 1 reply; 113+ messages in thread
From: Ben Greear @ 2006-06-26 22:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Herbert Poetzl, Daniel Lezcano, Andrey Savochkin, linux-kernel,
	netdev, serue, haveblue, clg, Andrew Morton, dev, devel, sam,
	viro, Alexey Kuznetsov

Eric W. Biederman wrote:

> Basically it is just a matter of:
> if (dest_mac == my_mac1) it is for device 1.
> If (dest_mac == my_mac2) it is for device 2.
> etc.
> 
> At a small count of macs it is trivial to understand it will go
> fast for a larger count of macs it only works with a good data
> structure.  We don't hit any extra cache lines of the packet,
> and the above test can be collapsed with other routing lookup tests.

I think you should do this at the layer-2 level, well before you get
to routing.  That will make the virtual mac-vlan work with arbitrary
protocols and appear very much like a regular ethernet interface.  This
approach worked well with .1q vlans, and with my version of the mac-vlan
module.

Using the mac-vlan and source-based routing tables, I can give a unique
'interface' to each process and have each process able to bind to the
same IP port, for instance.  Using source-based routing (by binding to a local
IP explicitly and adding a route table for that source IP), I can give unique
default routes to each interface as well.  Since we cannot have more than 256
routing tables, this approach is currently limitted to around 250 virtual
interfaces, but that is still a substantial amount.

My mac-vlan patch, redirect-device patch, and other hackings are consolidated
in this patch:

http://www.candelatech.com/oss/candela_2.6.16.patch

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 22:13                 ` Ben Greear
@ 2006-06-26 22:54                   ` Herbert Poetzl
  2006-06-26 23:08                     ` Ben Greear
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-26 22:54 UTC (permalink / raw)
  To: Ben Greear
  Cc: Eric W. Biederman, Daniel Lezcano, Andrey Savochkin,
	linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	devel, sam, viro, Alexey Kuznetsov

On Mon, Jun 26, 2006 at 03:13:17PM -0700, Ben Greear wrote:
> Eric W. Biederman wrote:
> 
> >Basically it is just a matter of:
> >if (dest_mac == my_mac1) it is for device 1.
> >If (dest_mac == my_mac2) it is for device 2.
> >etc.
> >
> >At a small count of macs it is trivial to understand it will go
> >fast for a larger count of macs it only works with a good data
> >structure.  We don't hit any extra cache lines of the packet,
> >and the above test can be collapsed with other routing lookup tests.
> 
> I think you should do this at the layer-2 level, well before you get
> to routing. That will make the virtual mac-vlan work with arbitrary
> protocols and appear very much like a regular ethernet interface.
> This approach worked well with .1q vlans, and with my version of the
> mac-vlan module.

yes, that sounds good to me, any numbers how that
affects networking in general (performance wise and
memory wise, i.e. caches and hashes) ...

> Using the mac-vlan and source-based routing tables, I can give a
> unique 'interface' to each process and have each process able to bind
> to the same IP port, for instance. Using source-based routing (by
> binding to a local IP explicitly and adding a route table for that
> source IP), I can give unique default routes to each interface as
> well. Since we cannot have more than 256 routing tables, this approach
> is currently limitted to around 250 virtual interfaces, but that is
> still a substantial amount.

an typically that would be sufficient IMHO, but
of course, a more 'general' hash tag would be
better in the long run ...

> My mac-vlan patch, redirect-device patch, and other hackings are
> consolidated in this patch:
> 
> http://www.candelatech.com/oss/candela_2.6.16.patch

great! thanks!

best,
Herbert

> Thanks,
> Ben
> 
> -- 
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 22:54                   ` Herbert Poetzl
@ 2006-06-26 23:08                     ` Ben Greear
  2006-06-27 16:07                       ` Ben Greear
  0 siblings, 1 reply; 113+ messages in thread
From: Ben Greear @ 2006-06-26 23:08 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Eric W. Biederman, Daniel Lezcano, Andrey Savochkin,
	linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	devel, sam, viro, Alexey Kuznetsov

Herbert Poetzl wrote:
> On Mon, Jun 26, 2006 at 03:13:17PM -0700, Ben Greear wrote:

> yes, that sounds good to me, any numbers how that
> affects networking in general (performance wise and
> memory wise, i.e. caches and hashes) ...

I'll run some tests later today.  Based on my previous tests,
I don't remember any significant overhead.

>>Using the mac-vlan and source-based routing tables, I can give a
>>unique 'interface' to each process and have each process able to bind
>>to the same IP port, for instance. Using source-based routing (by
>>binding to a local IP explicitly and adding a route table for that
>>source IP), I can give unique default routes to each interface as
>>well. Since we cannot have more than 256 routing tables, this approach
>>is currently limitted to around 250 virtual interfaces, but that is
>>still a substantial amount.
> 
> 
> an typically that would be sufficient IMHO, but
> of course, a more 'general' hash tag would be
> better in the long run ...

I'm willing to offer a bounty (hardware, beer, money, ...)
if someone will 'fix' this so we can have 1000 or more routes....

Being able to select these routes at a more global level (without
having to specifically bind to a local IP would be nice as well.)

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [RFC] [patch 0/6] [Network namespace] introduction
  2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
                   ` (8 preceding siblings ...)
  2006-06-18 18:47 ` Al Viro
@ 2006-06-26 23:38 ` Patrick McHardy
  9 siblings, 0 replies; 113+ messages in thread
From: Patrick McHardy @ 2006-06-26 23:38 UTC (permalink / raw)
  To: dlezcano; +Cc: linux-kernel, netdev, serue, haveblue, clg

dlezcano@fr.ibm.com wrote:
> What is missing ?
> -----------------
> The routes are not yet isolated, that implies:
> 
>    - binding to another container's address is allowed
> 
>    - an outgoing packet which has an unset source address can
>      potentially get another container's address
> 
>    - an incoming packet can be routed to the wrong container if there
>      are several containers listening to the same addr:port

Does that mean that identification of containers for incoming packets
is done by IP address through routing (just had a quick look at the
patches, if I missed something obvious please just point me to it)?
How is code that uses global data without verifying its presence
(and visibility in the container) at initialization time going to be
handled? Netfilter and I think the TC action code contain some examples
for this.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 20:02                 ` Herbert Poetzl
  2006-06-26 20:37                   ` Eric W. Biederman
@ 2006-06-27  9:09                   ` Andrey Savochkin
  2006-06-27 15:48                     ` Herbert Poetzl
  1 sibling, 1 reply; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-27  9:09 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Eric W. Biederman, Daniel Lezcano, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

Herbert,

On Mon, Jun 26, 2006 at 10:02:25PM +0200, Herbert Poetzl wrote:
> 
> keep in mind that you actually have three kinds
> of network traffic on a typical host/guest system:
> 
>  - traffic between unit and outside
>    - host traffic should be quite minimal
>    - guest traffic will be quite high
> 
>  - traffic between host and guest
>    probably minimal too (only for shared services)
> 
>  - traffic between guests
>    can be as high (or even higher) than the
>    outbound traffic, just think web guest and
>    database guest

My experience with host-guest systems tells me the opposite:
outside traffic is a way higher than traffic between guests.
People put web server and database in different guests not more frequent than
they put them on separate physical server.
Unless people are building a really huge system when 1 server can't take the
whole load, web and database live together and benefit from communications
over UNIX sockets.

Guests are usually comprised of web-db pairs, and people place many such
guests on a single computer.

> 
> > The routing between network namespaces does have the potential to be
> > more expensive than just a packet trivially coming off the wire into a
> > socket.
> 
> IMHO the routing between network namespaces should
> not require more than the current local traffic
> does (i.e. you should be able to achieve loopback
> speed within an insignificant tolerance) and not
> nearly the time required for on-wire stuff ...

I'd like to caution about over-optimizing communications between different
network namespaces.
Many optimizations of local traffic (such as high MTU) don't look so
appealing when you start to think about live migration of namespaces.

Regards
	Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 15:49         ` Daniel Lezcano
  2006-06-26 16:40           ` Eric W. Biederman
@ 2006-06-27  9:11           ` Andrey Savochkin
  2006-06-27  9:34             ` Daniel Lezcano
  1 sibling, 1 reply; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-27  9:11 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	herbert, devel, sam, ebiederm, viro, Alexey Kuznetsov

Daniel,

On Mon, Jun 26, 2006 at 05:49:41PM +0200, Daniel Lezcano wrote:
> 
> > Then you lose the ability for each namespace to have its own routing entries.
> > Which implies that you'll have difficulties with devices that should exist
> > and be visible in one namespace only (like tunnels), as they require IP
> > addresses and route.
> 
> I mean instead of having the route tables private to the namespace, the 
> routes have the information to which namespace they are associated.

I think I understand what you're talking about: you want to make routing
responsible for determining destination namespace ID in addition to route
type (local, unicast etc), nexthop information, and so on.  Right?

My point is that if you make namespace tagging at routing time, and
your packets are being routed only once, you lose the ability
to have separate routing tables in each namespace.

	Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27  9:11           ` Andrey Savochkin
@ 2006-06-27  9:34             ` Daniel Lezcano
  2006-06-27  9:38               ` Andrey Savochkin
                                 ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Daniel Lezcano @ 2006-06-27  9:34 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	herbert, devel, sam, ebiederm, viro, Alexey Kuznetsov

Andrey Savochkin wrote:
> Daniel,
> 
> On Mon, Jun 26, 2006 at 05:49:41PM +0200, Daniel Lezcano wrote:
> 
>>>Then you lose the ability for each namespace to have its own routing entries.
>>>Which implies that you'll have difficulties with devices that should exist
>>>and be visible in one namespace only (like tunnels), as they require IP
>>>addresses and route.
>>
>>I mean instead of having the route tables private to the namespace, the 
>>routes have the information to which namespace they are associated.
> 
> 
> I think I understand what you're talking about: you want to make routing
> responsible for determining destination namespace ID in addition to route
> type (local, unicast etc), nexthop information, and so on.  Right?

Yes.

> 
> My point is that if you make namespace tagging at routing time, and
> your packets are being routed only once, you lose the ability
> to have separate routing tables in each namespace.

Right. What is the advantage of having separate the routing tables ?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27  9:34             ` Daniel Lezcano
@ 2006-06-27  9:38               ` Andrey Savochkin
  2006-06-27 11:21                 ` Daniel Lezcano
  2006-06-27  9:54               ` Kirill Korotaev
  2006-07-06  9:45               ` Routing tables (Re: [patch 2/6] [Network namespace] Network device sharing by view) Kari Hurtta
  2 siblings, 1 reply; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-27  9:38 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	herbert, devel, sam, ebiederm, viro, Alexey Kuznetsov

On Tue, Jun 27, 2006 at 11:34:36AM +0200, Daniel Lezcano wrote:
> Andrey Savochkin wrote:
> > Daniel,
> > 
> > On Mon, Jun 26, 2006 at 05:49:41PM +0200, Daniel Lezcano wrote:
> > 
> >>>Then you lose the ability for each namespace to have its own routing entries.
> >>>Which implies that you'll have difficulties with devices that should exist
> >>>and be visible in one namespace only (like tunnels), as they require IP
> >>>addresses and route.
> >>
> >>I mean instead of having the route tables private to the namespace, the 
> >>routes have the information to which namespace they are associated.
> > 
> > 
> > I think I understand what you're talking about: you want to make routing
> > responsible for determining destination namespace ID in addition to route
> > type (local, unicast etc), nexthop information, and so on.  Right?
> 
> Yes.
> 
> > 
> > My point is that if you make namespace tagging at routing time, and
> > your packets are being routed only once, you lose the ability
> > to have separate routing tables in each namespace.
> 
> Right. What is the advantage of having separate the routing tables ?

Routing is everything.
For example, I want namespaces to have their private tunnel devices.
It means that namespaces should be allowed have private routes of local type,
private default routes, and so on...

	Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27  9:34             ` Daniel Lezcano
  2006-06-27  9:38               ` Andrey Savochkin
@ 2006-06-27  9:54               ` Kirill Korotaev
  2006-06-27 16:09                 ` Herbert Poetzl
  2006-07-06  9:45               ` Routing tables (Re: [patch 2/6] [Network namespace] Network device sharing by view) Kari Hurtta
  2 siblings, 1 reply; 113+ messages in thread
From: Kirill Korotaev @ 2006-06-27  9:54 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Andrey Savochkin, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, herbert, devel, sam, ebiederm, viro,
	Alexey Kuznetsov

>> My point is that if you make namespace tagging at routing time, and
>> your packets are being routed only once, you lose the ability
>> to have separate routing tables in each namespace.
> 
> 
> Right. What is the advantage of having separate the routing tables ?
it is impossible to have bridged networking, tun/tap and many other 
features without it. I even doubt that it is possible to introduce 
private netfilter rules w/o virtualization of routing.

The question is do we want to have fully featured namespaces which allow 
to create isolated virtual environments with semantics and behaviour of 
standalone linux box or do we want to introduce some hacks with new 
rules/restrictions to meet ones goals only?

 From my POV, fully virtualized namespaces are the future. It is what 
makes virtualization solution usable (w/o apps modifications), provides 
all the features and doesn't require much efforts from people to be used.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27  9:38               ` Andrey Savochkin
@ 2006-06-27 11:21                 ` Daniel Lezcano
  2006-06-27 11:52                   ` Eric W. Biederman
  2006-06-27 11:55                   ` Andrey Savochkin
  0 siblings, 2 replies; 113+ messages in thread
From: Daniel Lezcano @ 2006-06-27 11:21 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	herbert, devel, sam, ebiederm, viro, Alexey Kuznetsov

>>>My point is that if you make namespace tagging at routing time, and
>>>your packets are being routed only once, you lose the ability
>>>to have separate routing tables in each namespace.
>>
>>Right. What is the advantage of having separate the routing tables ?
> 
> 
> Routing is everything.
> For example, I want namespaces to have their private tunnel devices.
> It means that namespaces should be allowed have private routes of local type,
> private default routes, and so on...
> 

Ok, we are talking about the same things. We do it only in a different way:

	* separate routing table :
		 namespace
			|
			\--- route_tables
				|
				\---routes

	* tagged routing table :
		route_tables
			|
			\---routes
				|
				\---namespace

When using routes private to the namespace, globally the logic of the ip 
stack is not changed, it manipulates only differents variables. It is 
more clean than tagging the route for the reasons mentioned by Eric.

When using route tagging, the logic is changed because when doing lookup 
on the routes table which is global, the namespace is used to match the 
route and make it visible.

I use the second method, because I think it is more effecient and reduce 
the overhead. But the isolation is minimalist and only aims to avoid the 
application using ressources outside of the container (aka namespace) 
without taking care of the system. For example, I didn't take care of 
network devices, because as far as see I can't imagine an administrator 
wanting to change the network device name while there are hundred of 
containers running. Concerning tunnel devices for example, they should 
be created inside the container.

I think, private network ressources method is more elegant and involves 
more network ressources, but there is probably a significant overhead 
and some difficulties to have __lightweight__ container (aka application 
container), make nfs working well, etc... I did some tests with tbench 
and the loopback with the private namespace and there is roughly an 
overhead of 4 % without the isolation since with the tagging method 
there is 1 % with the isolation.

The network namespace aims the isolation for now, but the container 
based on the namespaces will probably need checkpoint/restart and 
migration ability. The migration is needed not only for servers but for 
HPC jobs too.

So I don't know what level of isolation/virtualization is really needed 
by users, what should be acceptable (strong isolation and overhead / 
weak isolation and efficiency). I don't know if people wanting strong 
isolation will not prefer Xen (cleary with much more overhead than your 
patches ;) )



Regards
	-- Daniel










^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 11:21                 ` Daniel Lezcano
@ 2006-06-27 11:52                   ` Eric W. Biederman
  2006-06-27 16:02                     ` Herbert Poetzl
  2006-06-27 11:55                   ` Andrey Savochkin
  1 sibling, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-27 11:52 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Andrey Savochkin, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, herbert, devel, sam, viro, Alexey Kuznetsov

Daniel Lezcano <dlezcano@fr.ibm.com> writes:

>>>>My point is that if you make namespace tagging at routing time, and
>>>>your packets are being routed only once, you lose the ability
>>>>to have separate routing tables in each namespace.
>>>
>>>Right. What is the advantage of having separate the routing tables ?
>> Routing is everything.
>> For example, I want namespaces to have their private tunnel devices.
>> It means that namespaces should be allowed have private routes of local type,
>> private default routes, and so on...
>>
>
> Ok, we are talking about the same things. We do it only in a different way:
>
> 	* separate routing table :
> 		 namespace
> 			|
> 			\--- route_tables
> 				|
> 				\---routes
>
> 	* tagged routing table :
> 		route_tables
> 			|
> 			\---routes
> 				|
> 				\---namespace

There is a third possibility, that falls in between these two if local
communication is really the bottle neck.

We have the dst cache for caching routes and cache multiple transformations
that happen on a packet.

With a little extra knowledge it is possible to have the separate
routing tables but have special logic that recognizes the local tunnel
device that connects namespaces and have it look into the next
namespaces routes, and build up a complete stack of dst entries of
where the packet needs to go.

I keep forgetting about that possibility.  But as long as everything
is done at the routing layer that should work.

> I use the second method, because I think it is more effecient and reduce the
> overhead. But the isolation is minimalist and only aims to avoid the application
> using ressources outside of the container (aka namespace) without taking care of
> the system. For example, I didn't take care of network devices, because as far
> as see I can't imagine an administrator wanting to change the network device
> name while there are hundred of containers running. Concerning tunnel devices
> for example, they should be created inside the container.

Inside the containers I want all network devices named eth0!

> I think, private network ressources method is more elegant and involves more
> network ressources, but there is probably a significant overhead and some
> difficulties to have __lightweight__ container (aka application container), make
> nfs working well, etc... I did some tests with tbench and the loopback with the
> private namespace and there is roughly an overhead of 4 % without the isolation
> since with the tagging method there is 1 % with the isolation.

The overhead went down?

> The network namespace aims the isolation for now, but the container based on the
> namespaces will probably need checkpoint/restart and migration ability. The
> migration is needed not only for servers but for HPC jobs too.

Yes.

> So I don't know what level of isolation/virtualization is really needed by
> users, what should be acceptable (strong isolation and overhead / weak isolation
> and efficiency). I don't know if people wanting strong isolation will not prefer
> Xen (cleary with much more overhead than your patches ;) )

We need a clean abstraction that optimizes well.

However local communication between containers is not what we should
benchmark.  That can always be improved later.  So long as the
performance is reasonable.  What needs to be benchmarked is the
overhead of namespaces when connected to physical networking devices
and on their own local loopback, and comparing that to a kernel
without namespace support.

If we don't hurt that core case we have an implementation we can
merge.  There are a lot of optimization opportunities for local
communications and we can do that after we have a correct and accepted
implementation.  Anything else is optimizing too soon, and will
just be muddying the waters.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 11:21                 ` Daniel Lezcano
  2006-06-27 11:52                   ` Eric W. Biederman
@ 2006-06-27 11:55                   ` Andrey Savochkin
  1 sibling, 0 replies; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-27 11:55 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	herbert, devel, sam, ebiederm, viro, Alexey Kuznetsov

Daniel,

On Tue, Jun 27, 2006 at 01:21:02PM +0200, Daniel Lezcano wrote:
> >>>My point is that if you make namespace tagging at routing time, and
> >>>your packets are being routed only once, you lose the ability
> >>>to have separate routing tables in each namespace.
> >>
> >>Right. What is the advantage of having separate the routing tables ?
> > 
> > 
> > Routing is everything.
> > For example, I want namespaces to have their private tunnel devices.
> > It means that namespaces should be allowed have private routes of local type,
> > private default routes, and so on...
> > 
> 
> Ok, we are talking about the same things. We do it only in a different way:

We are not talking about the same things.

It isn't a technical thing whether route lookup is performed before or after
namespace change.
It is a fundamental question determining functionality of network namespaces.
We are talking about the capabilities namespaces provide.

Your proposal essentially denies namespaces to have their own tunnel or other
devices.  There is no point in having a device inside a namespace if the
namespace owner can't route all or some specific outgoing packets through
that device.  You don't allow system administrators to completely delegate
management of network configuration to namespace owners.

	Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27  9:09                   ` Andrey Savochkin
@ 2006-06-27 15:48                     ` Herbert Poetzl
  2006-06-27 16:19                       ` Andrey Savochkin
  2006-06-27 16:40                       ` Eric W. Biederman
  0 siblings, 2 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-27 15:48 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: Eric W. Biederman, Daniel Lezcano, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

On Tue, Jun 27, 2006 at 01:09:11PM +0400, Andrey Savochkin wrote:
> Herbert,
> 
> On Mon, Jun 26, 2006 at 10:02:25PM +0200, Herbert Poetzl wrote:
> > 
> > keep in mind that you actually have three kinds
> > of network traffic on a typical host/guest system:
> > 
> >  - traffic between unit and outside
> >    - host traffic should be quite minimal
> >    - guest traffic will be quite high
> > 
> >  - traffic between host and guest
> >    probably minimal too (only for shared services)
> > 
> >  - traffic between guests
> >    can be as high (or even higher) than the
> >    outbound traffic, just think web guest and
> >    database guest
> 
> My experience with host-guest systems tells me the opposite: outside
> traffic is a way higher than traffic between guests. People put web
> server and database in different guests not more frequent than they
> put them on separate physical server. Unless people are building a
> really huge system when 1 server can't take the whole load, web and
> database live together and benefit from communications over UNIX
> sockets.

well, that's probably because you (or your company)
focuses on providers which simply (re)sell the entities
to their customers, in which case it would be more
expensive to put e.g. the database into a separate
guest. but let me state here that this is not the only
application for this technology

many folks use Linux-VServer for separating services
(e.g. mail, web, database, ...) and here a _lot_ of
traffic happens between guests (as it would on a normal
linux system or within a single guest in your case)

> Guests are usually comprised of web-db pairs, and people place many
> such guests on a single computer.

in case two guests cost more than one, yes, in case
two guests allow for better isolation and easier
maintainance without additional cost, no :)

> > > The routing between network namespaces does have the potential to
> > > be more expensive than just a packet trivially coming off the wire
> > > into a socket.
> > 
> > IMHO the routing between network namespaces should
> > not require more than the current local traffic
> > does (i.e. you should be able to achieve loopback
> > speed within an insignificant tolerance) and not
> > nearly the time required for on-wire stuff ...
> 
> I'd like to caution about over-optimizing communications between
> different network namespaces. Many optimizations of local traffic
> (such as high MTU) don't look so appealing when you start to think
> about live migration of namespaces.

I think the 'optimization' (or to be precise: desire
not to sacrifice local/loopback traffic for some use
case as you describe it) does not interfere with live
migration at all, we still will have 'local' and 'remote'
traffic, and personally I doubt that the live migration
is a feature for the masses ...

best,
Herbert

> Regards
> 	Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 11:52                   ` Eric W. Biederman
@ 2006-06-27 16:02                     ` Herbert Poetzl
  2006-06-27 16:47                       ` Eric W. Biederman
  2006-06-27 16:49                       ` Alexey Kuznetsov
  0 siblings, 2 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-27 16:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Andrey Savochkin, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

On Tue, Jun 27, 2006 at 05:52:52AM -0600, Eric W. Biederman wrote:
> Daniel Lezcano <dlezcano@fr.ibm.com> writes:
> 
> >>>>My point is that if you make namespace tagging at routing time,
> >>>>and your packets are being routed only once, you lose the ability
> >>>>to have separate routing tables in each namespace.
> >>>
> >>>Right. What is the advantage of having separate the routing tables ?
> >> Routing is everything. For example, I want namespaces to have their
> >> private tunnel devices. It means that namespaces should be allowed
> >> have private routes of local type, private default routes, and so
> >> on...
> >>
> >
> > Ok, we are talking about the same things. We do it only in a different way:
> >
> > 	* separate routing table :
> > 		 namespace
> > 			|
> > 			\--- route_tables
> > 				|
> > 				\---routes
> >
> > 	* tagged routing table :
> > 		route_tables
> > 			|
> > 			\---routes
> > 				|
> > 				\---namespace
> 
> There is a third possibility, that falls in between these two if local
> communication is really the bottle neck.
>
> We have the dst cache for caching routes and cache multiple
> transformations that happen on a packet.
>
> With a little extra knowledge it is possible to have the separate
> routing tables but have special logic that recognizes the local
> tunnel device that connects namespaces and have it look into the next
> namespaces routes, and build up a complete stack of dst entries of
> where the packet needs to go.
>
> I keep forgetting about that possibility. But as long as everything is
> done at the routing layer that should work.
> 
> > I use the second method, because I think it is more effecient and
> > reduce the overhead. But the isolation is minimalist and only aims
> > to avoid the application using ressources outside of the container
> > (aka namespace) without taking care of the system. For example, I
> > didn't take care of network devices, because as far as see I can't
> > imagine an administrator wanting to change the network device name
> > while there are hundred of containers running. Concerning tunnel
> > devices for example, they should be created inside the container.
> 
> Inside the containers I want all network devices named eth0!

huh? even if there are two of them? also tun?

I think you meant, you want to be able to have eth0 in
_more_ than one guest where eth0 in a guest can also
be/use/relate to eth1 on the host, right?

> > I think, private network ressources method is more elegant
> > and involves more network ressources, but there is probably a
> > significant overhead and some difficulties to have __lightweight__
> > container (aka application container), make nfs working well,
> > etc... I did some tests with tbench and the loopback with the
> > private namespace and there is roughly an overhead of 4 % without
> > the isolation since with the tagging method there is 1 % with the
> > isolation.
> 
> The overhead went down?

yes, this might actually happen, because the guest
has only to look at a certain subset of entries
but this needs a lot more testing, especially with
a lot of guests

> > The network namespace aims the isolation for now, but the container
> > based on the namespaces will probably need checkpoint/restart and
> > migration ability. The migration is needed not only for servers but
> > for HPC jobs too.
> 
> Yes.
> 
> > So I don't know what level of isolation/virtualization is really
> > needed by users, what should be acceptable (strong isolation and
> > overhead / weak isolation and efficiency). I don't know if people
> > wanting strong isolation will not prefer Xen (cleary with much more
> > overhead than your patches ;) )

well, Xen claims something below 2% IIRC, and would
be clearly the better choice if you want strict 
separation with the complete functionality, especially
with hardware support

> We need a clean abstraction that optimizes well.
> 
> However local communication between containers is not what we
> should benchmark. That can always be improved later. So long as
> the performance is reasonable. What needs to be benchmarked is the
> overhead of namespaces when connected to physical networking devices
> and on their own local loopback, and comparing that to a kernel
> without namespace support.

well, for me (obviously advocating the lightweight case)
it seems improtant that the following conditions are met:

 - loopback traffic inside a guest is insignificantly
   slower than on a normal system

 - loopback traffic on the host is insignificantly
   slower than on a normal system

 - inter guest traffic is faster than on-wire traffic,
   and should be withing a small tolerance of the
   loopback case (as it really isn't different)

 - network (on-wire) traffic should be as fast as without
   the namespace (i.e. within 1% or so, better not really
   measurable)

 - all this should be true in a setup with a significant
   number of guests, when only one guest is active, but
   all other guests are ready/configured

 - all this should scale well with a few hundred guests

> If we don't hurt that core case we have an implementation we can
> merge.  There are a lot of optimization opportunities for local
> communications and we can do that after we have a correct and accepted
> implementation.  Anything else is optimizing too soon, and will
> just be muddying the waters.

what I fear is that once something is in, the kernel will
just become slower (as it already did in some areas) and
nobody will care/be-able to fix that later on ...

best,
Herbert

> Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-26 23:08                     ` Ben Greear
@ 2006-06-27 16:07                       ` Ben Greear
  2006-06-27 22:48                         ` Herbert Poetzl
  0 siblings, 1 reply; 113+ messages in thread
From: Ben Greear @ 2006-06-27 16:07 UTC (permalink / raw)
  To: Ben Greear
  Cc: Herbert Poetzl, Eric W. Biederman, Daniel Lezcano,
	Andrey Savochkin, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, devel, sam, viro, Alexey Kuznetsov

Ben Greear wrote:
> Herbert Poetzl wrote:
> 
>> On Mon, Jun 26, 2006 at 03:13:17PM -0700, Ben Greear wrote:
> 
> 
>> yes, that sounds good to me, any numbers how that
>> affects networking in general (performance wise and
>> memory wise, i.e. caches and hashes) ...
> 
> 
> I'll run some tests later today.  Based on my previous tests,
> I don't remember any significant overhead.

Here's a quick benchmark using my redirect devices (RDD).  Each
RDD comes in a pair...when you tx on one, the pkt is rx'd on the peer.
The idea is that it is exactly like two physical ethernet interfaces
connected by a cross-over cable.

My test system is a 64-bit dual-core Intel system, 3.013 Ghz processor with 1GB RAM.
Fairly standard stuff..it's one of the Shuttle XPC systems.
Kernel is 2.6.16.16 (64-bit).


Test setup is:  rdd1 -- rdd2   [bridge]   rdd3 -- rdd4

I am using my proprietary module for the bridge logic...and the default
bridge should be at least this fast.  I am injecting 1514 byte packets
on rdd1 and rdd4 with pktgen (bi-directional flow).  My pktgen is also
receiving the pkts and gathering stats.

This setup sustains 1.7Gbps of generated and received traffic between
rdd1 and rdd4.

Running only the [bridge] between two 10/100/1000 ports on an Intel PCI-E
NIC will sustain about 870Mbps (bi-directional) on this system, so the
virtual devices are quite efficient, as suspected.

I have not yet had time to benchmark the mac-vlans...hopefully later today.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27  9:54               ` Kirill Korotaev
@ 2006-06-27 16:09                 ` Herbert Poetzl
  2006-06-27 16:29                   ` Eric W. Biederman
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-27 16:09 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Daniel Lezcano, Andrey Savochkin, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, devel, sam, ebiederm, viro,
	Alexey Kuznetsov

On Tue, Jun 27, 2006 at 01:54:51PM +0400, Kirill Korotaev wrote:
> >>My point is that if you make namespace tagging at routing time, and
> >>your packets are being routed only once, you lose the ability
> >>to have separate routing tables in each namespace.
> >
> >
> >Right. What is the advantage of having separate the routing tables ?

> it is impossible to have bridged networking, tun/tap and many other 
> features without it. I even doubt that it is possible to introduce 
> private netfilter rules w/o virtualization of routing.

why? iptables work quite fine on a typical linux
system when you 'delegate' certain functionality
to certain chains (i.e. doesn't require access to
_all_ of them)

> The question is do we want to have fully featured namespaces which
> allow to create isolated virtual environments with semantics and
> behaviour of standalone linux box or do we want to introduce some
> hacks with new rules/restrictions to meet ones goals only?

well, soemtimes 'hacks' are not only simpler but also 
a much better solution for a given problem than the
straight forward approach ... 

for example, you won't have multiple routing tables
in a kernel where this feature is disabled, no?
so why should it affect a guest, or require modified
apps inside a guest when we would decide to provide
only a single routing table?

> From my POV, fully virtualized namespaces are the future. 

the future is already there, it's called Xen or UML, or QEMU :)

> It is what makes virtualization solution usable (w/o apps
> modifications), provides all the features and doesn't require much
> efforts from people to be used.

and what if they want to use virtualization inside
their guests? where do you draw the line?

best,
Herbert

> Thanks,
> Kirill

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 15:48                     ` Herbert Poetzl
@ 2006-06-27 16:19                       ` Andrey Savochkin
  2006-06-27 16:40                       ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-27 16:19 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Eric W. Biederman, Daniel Lezcano, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

Herbert,

On Tue, Jun 27, 2006 at 05:48:19PM +0200, Herbert Poetzl wrote:
> On Tue, Jun 27, 2006 at 01:09:11PM +0400, Andrey Savochkin wrote:
> > 
> > On Mon, Jun 26, 2006 at 10:02:25PM +0200, Herbert Poetzl wrote:
> > > 
> > >  - traffic between guests
> > >    can be as high (or even higher) than the
> > >    outbound traffic, just think web guest and
> > >    database guest
> > 
> > My experience with host-guest systems tells me the opposite: outside
> > traffic is a way higher than traffic between guests. People put web
> > server and database in different guests not more frequent than they
> > put them on separate physical server. Unless people are building a
> > really huge system when 1 server can't take the whole load, web and
> > database live together and benefit from communications over UNIX
> > sockets.
> 
> well, that's probably because you (or your company)
> focuses on providers which simply (re)sell the entities
> to their customers, in which case it would be more
> expensive to put e.g. the database into a separate
> guest. but let me state here that this is not the only
> application for this technology

I'm just sharing my experience.
You have one experience, I have another, and your classification of traffic
importance is not the universal one.
My point was that we shouldn't overestimate the use of INET sockets vs. UNIX
ones in configurations where communications but not web/db operations play a
big role in overall performance.
And indeed I've talked with many different people, from universities to
large enterprises.

> 
[snip]
> > I'd like to caution about over-optimizing communications between
> > different network namespaces. Many optimizations of local traffic
> > (such as high MTU) don't look so appealing when you start to think
> > about live migration of namespaces.
> 
> I think the 'optimization' (or to be precise: desire
> not to sacrifice local/loopback traffic for some use
> case as you describe it) does not interfere with live
> migration at all, we still will have 'local' and 'remote'
> traffic, and personally I doubt that the live migration
> is a feature for the masses ...

Why not for the masses?

	Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 16:09                 ` Herbert Poetzl
@ 2006-06-27 16:29                   ` Eric W. Biederman
  2006-06-27 23:07                     ` Herbert Poetzl
  0 siblings, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-27 16:29 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Kirill Korotaev, Daniel Lezcano, Andrey Savochkin, linux-kernel,
	netdev, serue, haveblue, clg, Andrew Morton, devel, sam,
	ebiederm, viro, Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Tue, Jun 27, 2006 at 01:54:51PM +0400, Kirill Korotaev wrote:
>> >>My point is that if you make namespace tagging at routing time, and
>> >>your packets are being routed only once, you lose the ability
>> >>to have separate routing tables in each namespace.
>> >
>> >
>> >Right. What is the advantage of having separate the routing tables ?
>
>> it is impossible to have bridged networking, tun/tap and many other 
>> features without it. I even doubt that it is possible to introduce 
>> private netfilter rules w/o virtualization of routing.
>
> why? iptables work quite fine on a typical linux
> system when you 'delegate' certain functionality
> to certain chains (i.e. doesn't require access to
> _all_ of them)
>
>> The question is do we want to have fully featured namespaces which
>> allow to create isolated virtual environments with semantics and
>> behaviour of standalone linux box or do we want to introduce some
>> hacks with new rules/restrictions to meet ones goals only?
>
> well, soemtimes 'hacks' are not only simpler but also 
> a much better solution for a given problem than the
> straight forward approach ... 

Well I would like to see a hack that qualifies.  I watched the
linux-vserver irc channel for a while and almost every network problem
was caused by the change in semantics vserver provides.

In this case when you allow a guest more than one IP your hack while
easy to maintain becomes much more complex.  Especially as you address
each case people care about one at a time.

In one shot this goes the entire way.  Given how many people miss
that you do the work at layer 2 than at layer 3 I would not call this
the straight forward approach.  The straight forward implementation yes,
but not the straight forward approach.

> for example, you won't have multiple routing tables
> in a kernel where this feature is disabled, no?
> so why should it affect a guest, or require modified
> apps inside a guest when we would decide to provide
> only a single routing table?
>
>> From my POV, fully virtualized namespaces are the future. 
>
> the future is already there, it's called Xen or UML, or QEMU :)

Yep.  And now we need it to run fast.

>> It is what makes virtualization solution usable (w/o apps
>> modifications), provides all the features and doesn't require much
>> efforts from people to be used.
>
> and what if they want to use virtualization inside
> their guests? where do you draw the line?

The implementation doesn't have any problems with guests inside
of guests.

The only reason to restrict guests inside of guests is because
the we aren't certain which permissions make sense.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 15:48                     ` Herbert Poetzl
  2006-06-27 16:19                       ` Andrey Savochkin
@ 2006-06-27 16:40                       ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-27 16:40 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Daniel Lezcano, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, devel, sam, viro, Alexey Kuznetsov,
	Andrey Savochkin

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Tue, Jun 27, 2006 at 01:09:11PM +0400, Andrey Savochkin wrote:
>> 
>> I'd like to caution about over-optimizing communications between
>> different network namespaces. Many optimizations of local traffic
>> (such as high MTU) don't look so appealing when you start to think
>> about live migration of namespaces.
>
> I think the 'optimization' (or to be precise: desire
> not to sacrifice local/loopback traffic for some use
> case as you describe it) does not interfere with live
> migration at all, we still will have 'local' and 'remote'
> traffic, and personally I doubt that the live migration
> is a feature for the masses ...

Several things.
- The linux loopback device is not strongly optimized, it is a compatibility
  layer.
- Traffic between guests is an implementation detail.
  There is nothing fundamental in our semantics that says the traffic
  has to be slow for any workload (except for the limuts imposed by using
  actual on the wire protocols).  The lo shares the same problem.

Worry about this case now when it has clearly been shown that there are several
possible ways to optimize this and get back any lost local performance is
optimizing way too early.

Criticize the per namespace performance and all you want.  That is pretty
much a merge blocker.  Unless we do worse than a 1-5% penalty the communication
across namespaces is really a non-issue.

Even with your large communications flows between guests 1-5% is nothing.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 16:02                     ` Herbert Poetzl
@ 2006-06-27 16:47                       ` Eric W. Biederman
  2006-06-27 17:19                         ` Ben Greear
  2006-06-27 16:49                       ` Alexey Kuznetsov
  1 sibling, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-27 16:47 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Daniel Lezcano, Andrey Savochkin, linux-kernel, netdev, serue,
	haveblue, clg, Andrew Morton, dev, devel, sam, viro,
	Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Tue, Jun 27, 2006 at 05:52:52AM -0600, Eric W. Biederman wrote:
>> 
>> Inside the containers I want all network devices named eth0!
>
> huh? even if there are two of them? also tun?
>
> I think you meant, you want to be able to have eth0 in
> _more_ than one guest where eth0 in a guest can also
> be/use/relate to eth1 on the host, right?

Right I want to have an eth0 in each guest where eth0 is
it's own network device and need have no relationship to
eth0 on the host.

>> We need a clean abstraction that optimizes well.
>> 
>> However local communication between containers is not what we
>> should benchmark. That can always be improved later. So long as
>> the performance is reasonable. What needs to be benchmarked is the
>> overhead of namespaces when connected to physical networking devices
>> and on their own local loopback, and comparing that to a kernel
>> without namespace support.
>
> well, for me (obviously advocating the lightweight case)
> it seems improtant that the following conditions are met:
>
>  - loopback traffic inside a guest is insignificantly
>    slower than on a normal system
>
>  - loopback traffic on the host is insignificantly
>    slower than on a normal system
>
>  - inter guest traffic is faster than on-wire traffic,
>    and should be withing a small tolerance of the
>    loopback case (as it really isn't different)
>
>  - network (on-wire) traffic should be as fast as without
>    the namespace (i.e. within 1% or so, better not really
>    measurable)
>
>  - all this should be true in a setup with a significant
>    number of guests, when only one guest is active, but
>    all other guests are ready/configured
>
>  - all this should scale well with a few hundred guests

Ultimately I agree. However.  Only host performance should be
a merge blocker.  Allowing us to go back and reclaim the few
percentage points we lost later.

>> If we don't hurt that core case we have an implementation we can
>> merge.  There are a lot of optimization opportunities for local
>> communications and we can do that after we have a correct and accepted
>> implementation.  Anything else is optimizing too soon, and will
>> just be muddying the waters.
>
> what I fear is that once something is in, the kernel will
> just become slower (as it already did in some areas) and
> nobody will care/be-able to fix that later on ...

If nobody cares it doesn't matter.

If no one can fix it that is a problem.  Which is why we need
high standards and clean code, not early optimizations.

But on that front each step of the way must be justified on
it's own merits.  Not because it will give us some holy grail.

The way to keep the inter guest performance from degrading is
to measure it an complain.  But the linux network stack is too
big to get in one pass.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 16:02                     ` Herbert Poetzl
  2006-06-27 16:47                       ` Eric W. Biederman
@ 2006-06-27 16:49                       ` Alexey Kuznetsov
  1 sibling, 0 replies; 113+ messages in thread
From: Alexey Kuznetsov @ 2006-06-27 16:49 UTC (permalink / raw)
  To: Eric W. Biederman, Daniel Lezcano, Andrey Savochkin,
	linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	devel, sam, viro, Alexey Kuznetsov

On Tue, Jun 27, 2006 at 06:02:42PM +0200, Herbert Poetzl wrote:

>  - loopback traffic inside a guest is insignificantly
>    slower than on a normal system
> 
>  - loopback traffic on the host is insignificantly
>    slower than on a normal system
> 
>  - inter guest traffic is faster than on-wire traffic,
>    and should be withing a small tolerance of the
>    loopback case (as it really isn't different)

I do not follow what are you people arguing about?

Intra-guest, guest-guest and host-guest paths have _no_ differences
from host-host loopback. Only the device is different:
* virtual loopback for intra-guest
* virtual interface for guest-guest and host-guest

But the work is exactly the same, only the place where packets
looped back is different. How could this be issue to break a lance over? :-)

Alexey


PS. The only thing, which I can imagine is "optimized" out ip_route_input()
in the case of loopback. But this optimization was an obvious design mistake
(mine, sorry) and apparently will die together with removal of current
deficiences of routing cache. Actually, it is one of deficiences.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 16:47                       ` Eric W. Biederman
@ 2006-06-27 17:19                         ` Ben Greear
  2006-06-27 22:52                           ` Herbert Poetzl
  0 siblings, 1 reply; 113+ messages in thread
From: Ben Greear @ 2006-06-27 17:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Herbert Poetzl, Daniel Lezcano, Andrey Savochkin, linux-kernel,
	netdev, serue, haveblue, clg, Andrew Morton, dev, devel, sam,
	viro, Alexey Kuznetsov

Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
> 
> 
>>On Tue, Jun 27, 2006 at 05:52:52AM -0600, Eric W. Biederman wrote:
>>
>>>Inside the containers I want all network devices named eth0!
>>
>>huh? even if there are two of them? also tun?
>>
>>I think you meant, you want to be able to have eth0 in
>>_more_ than one guest where eth0 in a guest can also
>>be/use/relate to eth1 on the host, right?
> 
> 
> Right I want to have an eth0 in each guest where eth0 is
> it's own network device and need have no relationship to
> eth0 on the host.

How does that help anything?  Do you envision programs
that make special decisions on whether the interface is
called eth0 v/s eth151?

Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 16:07                       ` Ben Greear
@ 2006-06-27 22:48                         ` Herbert Poetzl
  0 siblings, 0 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-27 22:48 UTC (permalink / raw)
  To: Ben Greear
  Cc: Eric W. Biederman, Daniel Lezcano, Andrey Savochkin,
	linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	devel, sam, viro, Alexey Kuznetsov

On Tue, Jun 27, 2006 at 09:07:38AM -0700, Ben Greear wrote:
> Ben Greear wrote:
> >Herbert Poetzl wrote:
> >
> >>On Mon, Jun 26, 2006 at 03:13:17PM -0700, Ben Greear wrote:
> >
> >>yes, that sounds good to me, any numbers how that
> >>affects networking in general (performance wise and
> >>memory wise, i.e. caches and hashes) ...
> >
> >I'll run some tests later today.  Based on my previous tests,
> >I don't remember any significant overhead.
> 
> Here's a quick benchmark using my redirect devices (RDD). Each RDD
> comes in a pair...when you tx on one, the pkt is rx'd on the peer.
> The idea is that it is exactly like two physical ethernet interfaces
> connected by a cross-over cable.
>
> My test system is a 64-bit dual-core Intel system, 3.013 Ghz processor
> with 1GB RAM. Fairly standard stuff..it's one of the Shuttle XPC
> systems. Kernel is 2.6.16.16 (64-bit).
> 
> 
> Test setup is:  rdd1 -- rdd2   [bridge]   rdd3 -- rdd4
> 
> I am using my proprietary module for the bridge logic...and the
> default bridge should be at least this fast. I am injecting 1514 byte
> packets on rdd1 and rdd4 with pktgen (bi-directional flow). My pktgen
> is also receiving the pkts and gathering stats.
>
> This setup sustains 1.7Gbps of generated and received traffic between
> rdd1 and rdd4.
>
> Running only the [bridge] between two 10/100/1000 ports on an Intel
> PCI-E NIC will sustain about 870Mbps (bi-directional) on this system,
> so the virtual devices are quite efficient, as suspected.
>
> I have not yet had time to benchmark the mac-vlans...hopefully later
> today.

hmm, maybe you could also benchmark loopback connections
(and their throughput) on your system?

my (not so fancy) PIII, 32bit, 2.6.17.1 seems to do
roughly 2Gbs on the loopback device (tested with dd
and netcat)

best,
Herbert

> Thanks,
> Ben
> 
> -- 
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 17:19                         ` Ben Greear
@ 2006-06-27 22:52                           ` Herbert Poetzl
  2006-06-27 23:12                             ` Dave Hansen
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-27 22:52 UTC (permalink / raw)
  To: Ben Greear
  Cc: Eric W. Biederman, Daniel Lezcano, Andrey Savochkin,
	linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, dev,
	devel, sam, viro, Alexey Kuznetsov

On Tue, Jun 27, 2006 at 10:19:23AM -0700, Ben Greear wrote:
> Eric W. Biederman wrote:
> >Herbert Poetzl <herbert@13thfloor.at> writes:
> >
> >
> >>On Tue, Jun 27, 2006 at 05:52:52AM -0600, Eric W. Biederman wrote:
> >>
> >>>Inside the containers I want all network devices named eth0!
> >>
> >>huh? even if there are two of them? also tun?
> >>
> >>I think you meant, you want to be able to have eth0 in
> >>_more_ than one guest where eth0 in a guest can also
> >>be/use/relate to eth1 on the host, right?
> >
> >
> >Right I want to have an eth0 in each guest where eth0 is
> >it's own network device and need have no relationship to
> >eth0 on the host.
> 
> How does that help anything?  Do you envision programs
> that make special decisions on whether the interface is
> called eth0 v/s eth151?

well, those poor folks who do not have ethernet
devices for networking :)

seriously, what I think Eric meant was that it
might be nice (especially for migration purposes)
to keep the device namespace completely virtualized
and not just isolated ...

I'm fine with that, as long as it does not add
overhead or complicate handling, and as far as I
can tell, it should not do that ...

best,
Herbert

> Ben
> 
> 
> -- 
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 16:29                   ` Eric W. Biederman
@ 2006-06-27 23:07                     ` Herbert Poetzl
  2006-06-28  4:07                       ` Eric W. Biederman
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-27 23:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Korotaev, Daniel Lezcano, Andrey Savochkin, linux-kernel,
	netdev, serue, haveblue, clg, Andrew Morton, devel, sam, viro,
	Alexey Kuznetsov

On Tue, Jun 27, 2006 at 10:29:39AM -0600, Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
> 
> > On Tue, Jun 27, 2006 at 01:54:51PM +0400, Kirill Korotaev wrote:
> >> >>My point is that if you make namespace tagging at routing time, and
> >> >>your packets are being routed only once, you lose the ability
> >> >>to have separate routing tables in each namespace.
> >> >
> >> >
> >> >Right. What is the advantage of having separate the routing tables ?
> >
> >> it is impossible to have bridged networking, tun/tap and many other 
> >> features without it. I even doubt that it is possible to introduce 
> >> private netfilter rules w/o virtualization of routing.
> >
> > why? iptables work quite fine on a typical linux
> > system when you 'delegate' certain functionality
> > to certain chains (i.e. doesn't require access to
> > _all_ of them)
> >
> >> The question is do we want to have fully featured namespaces which
> >> allow to create isolated virtual environments with semantics and
> >> behaviour of standalone linux box or do we want to introduce some
> >> hacks with new rules/restrictions to meet ones goals only?
> >
> > well, soemtimes 'hacks' are not only simpler but also 
> > a much better solution for a given problem than the
> > straight forward approach ... 
> 
> Well I would like to see a hack that qualifies.  

> I watched the linux-vserver irc channel for a while and almost
> every network problem was caused by the change in semantics 
> vserver provides.

the problem here is not the change in semantics compared
to a real linux system (as there basically is none) but
compared to _other_ technologies like UML or QEMU, which
add the need for bridging and additional interfaces, while
Linux-VServer only focuses on the IP layer ...

> In this case when you allow a guest more than one IP your hack 
> while easy to maintain becomes much more complex. 

why? a set of IPs is quite similar to a single IP (which
is actually a subset), so no real change there, only
IP_ANY means something different for a guest ...

> Especially as you address each case people care about one at a time.

hmm?

> In one shot this goes the entire way. Given how many people miss that
> you do the work at layer 2 than at layer 3 I would not call this the
> straight forward approach. The straight forward implementation yes,
> but not the straight forward approach.

seems I lost you here ...

> > for example, you won't have multiple routing tables
> > in a kernel where this feature is disabled, no?
> > so why should it affect a guest, or require modified
> > apps inside a guest when we would decide to provide
> > only a single routing table?
> >
> >> From my POV, fully virtualized namespaces are the future. 
> >
> > the future is already there, it's called Xen or UML, or QEMU :)
> 
> Yep.  And now we need it to run fast.

hmm, maybe you should try to optimize linux for Xen then,
as I'm sure it will provide the optimal virtualization
and has all the features folks are looking for (regarding
virtualization)

I thought we are trying to figure a light-weight subset
of isolation and virtualization technologies and methods
which make sense to have in mainline ...

> >> It is what makes virtualization solution usable (w/o apps
> >> modifications), provides all the features and doesn't require much
> >> efforts from people to be used.
> >
> > and what if they want to use virtualization inside
> > their guests? where do you draw the line?
> 
> The implementation doesn't have any problems with guests inside
> of guests.
> 
> The only reason to restrict guests inside of guests is because
> the we aren't certain which permissions make sense.

well, we have not even touched the permission issues yet

best,
Herbert

> Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 22:52                           ` Herbert Poetzl
@ 2006-06-27 23:12                             ` Dave Hansen
  2006-06-27 23:42                               ` Alexey Kuznetsov
  2006-06-28 14:51                               ` Eric W. Biederman
  0 siblings, 2 replies; 113+ messages in thread
From: Dave Hansen @ 2006-06-27 23:12 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Ben Greear, Eric W. Biederman, Daniel Lezcano, Andrey Savochkin,
	linux-kernel, netdev, serue, clg, Andrew Morton, dev, devel, sam,
	viro, Alexey Kuznetsov

On Wed, 2006-06-28 at 00:52 +0200, Herbert Poetzl wrote:
> seriously, what I think Eric meant was that it
> might be nice (especially for migration purposes)
> to keep the device namespace completely virtualized
> and not just isolated ...

It might be nice, but it is probably unneeded for an initial
implementation.  In practice, a cluster doing
checkpoint/restart/migration will already have a system in place for
assigning unique IPs or other identifiers to each container.  It could
just as easily make sure to assign unique network device names to
containers.

The issues really only come into play when you have an unstructured set
of machines and you want to migrate between them without having prepared
them with any kind of unique net device names beforehand.

It may look weird, but do application really *need* to see eth0 rather
than eth858354?

-- Dave


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 23:12                             ` Dave Hansen
@ 2006-06-27 23:42                               ` Alexey Kuznetsov
  2006-06-28  3:38                                 ` Eric W. Biederman
  2006-06-28 14:51                               ` Eric W. Biederman
  1 sibling, 1 reply; 113+ messages in thread
From: Alexey Kuznetsov @ 2006-06-27 23:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Herbert Poetzl, Ben Greear, Eric W. Biederman, Daniel Lezcano,
	Andrey Savochkin, linux-kernel, netdev, serue, clg,
	Andrew Morton, dev, devel, sam, viro, Alexey Kuznetsov

Hello!

> It may look weird, but do application really *need* to see eth0 rather
> than eth858354?

Applications do not care, humans do. :-)

What's about applications they just need to see exactly the same device
after migration. Not only name, but f.e. also its ifindex. If you do not
create a separate namespace for netdevices, you will inevitably end up
with some strange hack sort of VPIDs to translate (or to partition) ifindices
or to tell that "ping -I eth858354 xxx" is too coimplicated application
to survive migration.

Alexey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 23:42                               ` Alexey Kuznetsov
@ 2006-06-28  3:38                                 ` Eric W. Biederman
  2006-06-28 13:36                                   ` Herbert Poetzl
  0 siblings, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-28  3:38 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Dave Hansen, Herbert Poetzl, Ben Greear, Daniel Lezcano,
	Andrey Savochkin, linux-kernel, netdev, serue, clg,
	Andrew Morton, dev, devel, sam, viro, Alexey Kuznetsov

Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> writes:

> Hello!
>
>> It may look weird, but do application really *need* to see eth0 rather
>> than eth858354?
>
> Applications do not care, humans do. :-)
>
> What's about applications they just need to see exactly the same device
> after migration. Not only name, but f.e. also its ifindex. If you do not
> create a separate namespace for netdevices, you will inevitably end up
> with some strange hack sort of VPIDs to translate (or to partition) ifindices
> or to tell that "ping -I eth858354 xxx" is too coimplicated application
> to survive migration.


Actually there are applications with peculiar licensing practices that
do look at devices like eth0 to verify you have the appropriate mac, and
do really weird things if you don't have an eth0.

Plus there are other cases where it can be simpler to hard code things
if it is allowable. (The human factor)  Otherwise your configuration
must be done through hotplug scripts.

But yes there are misguided applications that care.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 23:07                     ` Herbert Poetzl
@ 2006-06-28  4:07                       ` Eric W. Biederman
  2006-06-28  6:31                         ` Sam Vilain
                                           ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-28  4:07 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Kirill Korotaev, Daniel Lezcano, Andrey Savochkin, linux-kernel,
	netdev, serue, haveblue, clg, Andrew Morton, devel, sam, viro,
	Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Tue, Jun 27, 2006 at 10:29:39AM -0600, Eric W. Biederman wrote:
>> Herbert Poetzl <herbert@13thfloor.at> writes:
>
>> I watched the linux-vserver irc channel for a while and almost
>> every network problem was caused by the change in semantics 
>> vserver provides.
>
> the problem here is not the change in semantics compared
> to a real linux system (as there basically is none) but
> compared to _other_ technologies like UML or QEMU, which
> add the need for bridging and additional interfaces, while
> Linux-VServer only focuses on the IP layer ...

Not being able to bind to INADDR_ANY is a huge semantic change.
Unless things have changed recently you get that change when
you have two IP addresses in Linux-Vserver.

Talking to the outsider world through the loop back interface
is a noticeable semantics change.

Having to be careful of who uses INADDR_ANY on the host
when you have guests is essentially a semantics change.

Being able to talk to the outside world with a server
bound only to the loopback IP is a weird semantic
change.

And I suspect I missed something, it is weird peculiar and
I don't care to remember all of the exceptions.

Have a few more network interfaces for a layer 2 solution
is fundamental.  Believing without proof and after arguments
to the contrary that you have not contradicted that a layer 2
solution is inherently slower is non-productive.  Arguing
that a layer 2 only solution most prove itself on guest to guest
communication is also non-productive.

So just to sink one additional nail in the coffin of the silly
guest to guest communication issue.  For any two guests where
fast communication between them is really important I can run
an additional interface pair that requires no routing or bridging.
Given that the implementation of the tunnel device is essentially
the same as the loopback interface and that I make only one
trip through the network stack there will be no performance overhead.
Similarly for any critical guest communication to the outside world
I can give the guest a real network adapter.

That said I don't think those things will be necessary and that if
they are it is an optimization opportunity to make various bits
of the network stack faster.

Bridging or routing between guests is an exercise in simplicity
and control not a requirement.

>> In this case when you allow a guest more than one IP your hack 
>> while easy to maintain becomes much more complex. 
>
> why? a set of IPs is quite similar to a single IP (which
> is actually a subset), so no real change there, only
> IP_ANY means something different for a guest ...

Which simply filtering at bind time makes impossible.

With a guest with 4 IPs 
10.0.0.1 192.168.0.1 172.16.0.1 127.0.0.1
How do you make INADDR_ANY work with just filtering at bind time?

The host has at least the additional IPs.
10.0.0.2 192.168.0.2 172.16.0.2 127.0.0.1

Herbert I suspect we are talking about completely different
implementations otherwise I can't possibly see how we have
such different perceptions of their capabilities.

I am talking precisely about filter IP addresses at connect
or bind time that a guest can use.  Which as I recall is
what vserver implements.  If you are thinking of your ngnet
implementation that would explain things.

>> Especially as you address each case people care about one at a time.
>
> hmm?

Multiple IPs, IPv6, additional protocols, firewalls. etc.

>> In one shot this goes the entire way. Given how many people miss that
>> you do the work at layer 2 than at layer 3 I would not call this the
>> straight forward approach. The straight forward implementation yes,
>> but not the straight forward approach.
>
> seems I lost you here ...


>> > for example, you won't have multiple routing tables
>> > in a kernel where this feature is disabled, no?
>> > so why should it affect a guest, or require modified
>> > apps inside a guest when we would decide to provide
>> > only a single routing table?
>> >
>> >> From my POV, fully virtualized namespaces are the future. 
>> >
>> > the future is already there, it's called Xen or UML, or QEMU :)
>> 
>> Yep.  And now we need it to run fast.
>
> hmm, maybe you should try to optimize linux for Xen then,
> as I'm sure it will provide the optimal virtualization
> and has all the features folks are looking for (regarding
> virtualization)
>
> I thought we are trying to figure a light-weight subset
> of isolation and virtualization technologies and methods
> which make sense to have in mainline ...

And you presume doing things at layer 2 is more expensive than
layer 3.

>From what I have seen of layer 3 solutions it is a 
bloody maintenance nightmare, and an inflexible mess.

>> >> It is what makes virtualization solution usable (w/o apps
>> >> modifications), provides all the features and doesn't require much
>> >> efforts from people to be used.
>> >
>> > and what if they want to use virtualization inside
>> > their guests? where do you draw the line?
>> 
>> The implementation doesn't have any problems with guests inside
>> of guests.
>> 
>> The only reason to restrict guests inside of guests is because
>> the we aren't certain which permissions make sense.
>
> well, we have not even touched the permission issues yet

Agreed, permissions have not discussed but the point is that is the only
reason to keep from nesting the networking stack the way I have described
it.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28  4:07                       ` Eric W. Biederman
@ 2006-06-28  6:31                         ` Sam Vilain
  2006-06-28 14:15                           ` Herbert Poetzl
  2006-06-28 10:14                         ` Cedric Le Goater
  2006-06-28 14:11                         ` Herbert Poetzl
  2 siblings, 1 reply; 113+ messages in thread
From: Sam Vilain @ 2006-06-28  6:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Herbert Poetzl, Kirill Korotaev, Daniel Lezcano,
	Andrey Savochkin, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, devel, viro, Alexey Kuznetsov

Eric W. Biederman wrote:
> Have a few more network interfaces for a layer 2 solution
> is fundamental.  Believing without proof and after arguments
> to the contrary that you have not contradicted that a layer 2
> solution is inherently slower is non-productive.  Arguing
> that a layer 2 only solution most prove itself on guest to guest
> communication is also non-productive.
>   

Yes, it does break what some people consider to be a sanity condition
when you don't have loopback anymore within a guest. I once experimented
with using 127.* addresses for per-guest loopback devices with vserver
to fix this, but that couldn't work without fixing glibc to not make
assumptions deep in the bowels of the resolver. I logged a fault with
gnu.org and you can guess where it went :-).

I don't think it's just the performance issue, though. Consider also
that if you only have one set of interfaces to manage, the overall
configuration of the network stack is simpler. `ip addr list' on the
host shows all the addresses on the system, you only have one routing
table to manage, one set of iptables, etc.

That being said, perhaps if each guest got its own interface, and from
some suitably privileged context you could see them all, perhaps it
would be nicer and maybe just as fast. Perhaps then *devices* could get
their own routing namespaces, and routing namespaces could get iptables
namespaces, or something like that, to give the most options.

> With a guest with 4 IPs 
> 10.0.0.1 192.168.0.1 172.16.0.1 127.0.0.1
> How do you make INADDR_ANY work with just filtering at bind time?
>   

It used to just bind to the first one. Don't know if it still does.

Sam.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28  4:07                       ` Eric W. Biederman
  2006-06-28  6:31                         ` Sam Vilain
@ 2006-06-28 10:14                         ` Cedric Le Goater
  2006-06-28 14:11                         ` Herbert Poetzl
  2 siblings, 0 replies; 113+ messages in thread
From: Cedric Le Goater @ 2006-06-28 10:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Herbert Poetzl, Kirill Korotaev, Daniel Lezcano,
	Andrey Savochkin, linux-kernel, netdev, serue, haveblue,
	Andrew Morton, devel, sam, viro, Alexey Kuznetsov

Hi !

Eric W. Biederman wrote:

[ ... ]

> So just to sink one additional nail in the coffin of the silly
> guest to guest communication issue.  For any two guests where
> fast communication between them is really important I can run
> an additional interface pair that requires no routing or bridging.
> Given that the implementation of the tunnel device is essentially
> the same as the loopback interface and that I make only one
> trip through the network stack there will be no performance overhead.
> Similarly for any critical guest communication to the outside world
> I can give the guest a real network adapter.
> 
> That said I don't think those things will be necessary and that if
> they are it is an optimization opportunity to make various bits
> of the network stack faster.

just one comment on the 'guest to guest communication' topic :

guest to guest communication is an important factor in consolidation
scenarios, where containers are packed on one server. This for maintenance
issues or priority issues on a HPC cluster for example. This case of
container migration is problably the most interesting and the performance
should be more than acceptable. May be not a top priority for the moment.


thanks,

C.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28  3:38                                 ` Eric W. Biederman
@ 2006-06-28 13:36                                   ` Herbert Poetzl
  2006-06-28 13:53                                     ` jamal
  2006-06-28 14:21                                     ` Eric W. Biederman
  0 siblings, 2 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-28 13:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexey Kuznetsov, Dave Hansen, Ben Greear, Daniel Lezcano,
	Andrey Savochkin, linux-kernel, netdev, serue, clg,
	Andrew Morton, dev, devel, sam, viro, Alexey Kuznetsov

On Tue, Jun 27, 2006 at 09:38:14PM -0600, Eric W. Biederman wrote:
> Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> writes:
> 
> > Hello!
> >
> >> It may look weird, but do application really *need* to see eth0 rather
> >> than eth858354?
> >
> > Applications do not care, humans do. :-)
> >
> > What's about applications they just need to see exactly the same
> > device after migration. Not only name, but f.e. also its ifindex.
> > If you do not create a separate namespace for netdevices, you will
> > inevitably end up with some strange hack sort of VPIDs to translate
> > (or to partition) ifindices or to tell that "ping -I eth858354 xxx"
> > is too coimplicated application to survive migration.
> 
> 
> Actually there are applications with peculiar licensing practices that
> do look at devices like eth0 to verify you have the appropriate mac, and
> do really weird things if you don't have an eth0.
> 
> Plus there are other cases where it can be simpler to hard code things
> if it is allowable. (The human factor)  Otherwise your configuration
> must be done through hotplug scripts.
> 
> But yes there are misguided applications that care.

last time I pointed to such 'misguided' apps which 
made assumptions that are not necessarily true
inside a virtual environment (e.g. pstree, initpid)
the general? position was that those apps should
be fixed instead adding a 'workaround'

note: personally I'm absolutely not against virtualizing
the device names so that each guest can have a separate
name space for devices, but there should be a way to
'see' _and_ 'identify' the interfaces from outside
(i.e. host or spectator context)

best,
Herbert

> Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 13:36                                   ` Herbert Poetzl
@ 2006-06-28 13:53                                     ` jamal
  2006-06-28 14:19                                       ` Andrey Savochkin
                                                         ` (2 more replies)
  2006-06-28 14:21                                     ` Eric W. Biederman
  1 sibling, 3 replies; 113+ messages in thread
From: jamal @ 2006-06-28 13:53 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Alexey Kuznetsov, viro, sam, devel, dev, Andrew Morton, clg,
	serue, netdev, linux-kernel, Andrey Savochkin, Daniel Lezcano,
	Ben Greear, Dave Hansen, Alexey Kuznetsov, Eric W. Biederman


On Wed, 2006-28-06 at 15:36 +0200, Herbert Poetzl wrote:

> note: personally I'm absolutely not against virtualizing
> the device names so that each guest can have a separate
> name space for devices, but there should be a way to
> 'see' _and_ 'identify' the interfaces from outside
> (i.e. host or spectator context)
> 

Makes sense for the host side to have naming convention tied
to the guest. Example as a prefix: guest0-eth0. Would it not
be interesting to have the host also manage these interfaces
via standard tools like ip or ifconfig etc? i.e if i admin up
guest0-eth0, then the user in guest0 will see its eth0 going
up.

Anyways, interesting discussion.

cheers,
jamal


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28  4:07                       ` Eric W. Biederman
  2006-06-28  6:31                         ` Sam Vilain
  2006-06-28 10:14                         ` Cedric Le Goater
@ 2006-06-28 14:11                         ` Herbert Poetzl
  2006-06-28 16:10                           ` Eric W. Biederman
  2 siblings, 1 reply; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-28 14:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Korotaev, Daniel Lezcano, Andrey Savochkin, linux-kernel,
	netdev, serue, haveblue, clg, Andrew Morton, devel, sam, viro,
	Alexey Kuznetsov

On Tue, Jun 27, 2006 at 10:07:29PM -0600, Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
> 
> > On Tue, Jun 27, 2006 at 10:29:39AM -0600, Eric W. Biederman wrote:
> >> Herbert Poetzl <herbert@13thfloor.at> writes:
> >
> >> I watched the linux-vserver irc channel for a while and almost
> >> every network problem was caused by the change in semantics 
> >> vserver provides.
> >
> > the problem here is not the change in semantics compared
> > to a real linux system (as there basically is none) but
> > compared to _other_ technologies like UML or QEMU, which
> > add the need for bridging and additional interfaces, while
> > Linux-VServer only focuses on the IP layer ...
> 
> Not being able to bind to INADDR_ANY is a huge semantic change.
> Unless things have changed recently you get that change when
> you have two IP addresses in Linux-Vserver.

not at all, you probably looked at a different
code, binding to INADDR_ANY actually _is_ the
default inside a guest, the only difference here
is that INADDR_ANY maps to a subset of _all_
available IPs ...

> Talking to the outsider world through the loop back interface
> is a noticeable semantics change.

this does not happen either, as I said several
times, networking happens as on a _normal_ 
linux system, local traffic uses loop back, while
outbound traffic uses the appropriate network
interfaces

> Having to be careful of who uses INADDR_ANY on the host
> when you have guests is essentially a semantics change.

this 'semantic' change is intentional, and it
would be quite easy to change that (by putting 
the host in a network context too) but as the
mechanism is isolation the host, similar to the
chroot() semantic for filesystems, sees _all_
the interfaces and IPs and therefore can also
bind to all of them ...

> Being able to talk to the outside world with a server
> bound only to the loopback IP is a weird semantic
> change.

that does not happen either ...

IMHO you should have a closer look (or ask more 
questions) before making false assumptions

> And I suspect I missed something, it is weird peculiar and
> I don't care to remember all of the exceptions.

there are no real exceptions, we have a legacy
mapping which basically 'remaps' localhost to
the first assigned IP (to make guest local traffic
secure without messing with the network stack)
but this can be avoided completely

> Have a few more network interfaces for a layer 2 solution
> is fundamental.  Believing without proof and after arguments
> to the contrary that you have not contradicted that a layer 2
> solution is inherently slower is non-productive.  

assuming that it will not be slower, although it
will now pass two network stacks and the bridging
code is non-productive too, let's see how it goes
but do not ignore the overhead just because it
might simplify the implementation ...

> Arguing that a layer 2 only solution most prove itself on 
> guest to guest communication is also non-productive.
> 
> So just to sink one additional nail in the coffin of the silly
> guest to guest communication issue.  For any two guests where
> fast communication between them is really important I can run
> an additional interface pair that requires no routing or bridging.
> Given that the implementation of the tunnel device is essentially
> the same as the loopback interface and that I make only one
> trip through the network stack there will be no performance overhead.

that is a good argument and I think I'm perfectly
fine with this, given that the implementation 
allows that (i.e. the network stack can handle
two interfaces with the same IP assigned and will
choose the local interface over the remote one
when the traffic will be between guests)

> Similarly for any critical guest communication to the outside world
> I can give the guest a real network adapter.

with a single MAC assigned, that is, I presume?

> That said I don't think those things will be necessary and that if
> they are it is an optimization opportunity to make various bits 
> of the network stack faster.
> 
> Bridging or routing between guests is an exercise in simplicity
> and control not a requirement.
> 
> >> In this case when you allow a guest more than one IP your hack 
> >> while easy to maintain becomes much more complex. 
> >
> > why? a set of IPs is quite similar to a single IP (which
> > is actually a subset), so no real change there, only
> > IP_ANY means something different for a guest ...
> 
> Which simply filtering at bind time makes impossible.
> 
> With a guest with 4 IPs 
> 10.0.0.1 192.168.0.1 172.16.0.1 127.0.0.1
> How do you make INADDR_ANY work with just filtering at bind time?
> 
> The host has at least the additional IPs.
> 10.0.0.2 192.168.0.2 172.16.0.2 127.0.0.1
> 
> Herbert I suspect we are talking about completely different
> implementations otherwise I can't possibly see how we have
> such different perceptions of their capabilities.

guess that's what this discussion is about,
finding out the various aspects how isolation
and/or vitrtualization can be accomplished and
what features we consider common/useful enough
for mainline ... for me that is still in the
brainstorming phase, although several 'working
prototypes' already exist. IMHO the next step
is to collect a set of representative use cases
and test them with each implementation, regarding
performance, usability and practicability

> I am talking precisely about filter IP addresses at connect
> or bind time that a guest can use.  Which as I recall is
> what vserver implements.  If you are thinking of your ngnet
> implementation that would explain things.

I'm thinking of all the various implementations
and 'prototypes' we did and tested, I agree
this might be confusing ...

> >> Especially as you address each case people care about one at a time.
> >
> > hmm?
> 
> Multiple IPs, IPv6, additional protocols, firewalls. etc.
> 
> >> In one shot this goes the entire way. Given how many people miss that
> >> you do the work at layer 2 than at layer 3 I would not call this the
> >> straight forward approach. The straight forward implementation yes,
> >> but not the straight forward approach.
> >
> > seems I lost you here ...
> 
> 
> >> > for example, you won't have multiple routing tables
> >> > in a kernel where this feature is disabled, no?
> >> > so why should it affect a guest, or require modified
> >> > apps inside a guest when we would decide to provide
> >> > only a single routing table?
> >> >
> >> >> From my POV, fully virtualized namespaces are the future. 
> >> >
> >> > the future is already there, it's called Xen or UML, or QEMU :)
> >> 
> >> Yep.  And now we need it to run fast.
> >
> > hmm, maybe you should try to optimize linux for Xen then,
> > as I'm sure it will provide the optimal virtualization
> > and has all the features folks are looking for (regarding
> > virtualization)
> >
> > I thought we are trying to figure a light-weight subset
> > of isolation and virtualization technologies and methods
> > which make sense to have in mainline ...
> 
> And you presume doing things at layer 2 is more expensive than
> layer 3.

not necessarily, but I _know_ that the overhead
added at layer 3 is unmeasureable, and it still
needs to be proven that this is true for a layer
2 solution (which I'd actually prefer, because
it solves the protocol _and_ setup issues)

> >From what I have seen of layer 3 solutions it is a 
> bloody maintenance nightmare, and an inflexible mess.

that is your opinion, I really doubt that you
will have less maintenance when you apply policy
to the guests ...

example here (just to clarify):

 - let's assume we have eth0 on the host and in
   guest A and B, with the following setup:

   eth0(H) 192.168.0.1/24
   eth0(A) 10.0.1.1/16 10.0.1.2/16
   eth0(B) 10.0.2.1/16

 - now what keeps guest B from jsut assigning
   10.0.2.2/16 to eth0? you need some kind of
   mechanism to prevent that, and/or to block
   the packets using inappropriate IPs

 * in the first case, i.e. you prevent assigning
   certain IPs inside a guest, you get a semantic
   change in the behaviour compared to a normal
   system, but there is no additional overhead
   on the communication

 * in the second case, you have to maintain the
   policy mechanism and keep it in sync with the
   guest configuration (somehow), and of course
   you have to verify every communication

 - OTOH, if you do not care about collisions
   basically assuming the point "that's like
   a hub on a network, if there are two guests
   with the same ip, it will be trouble, but
   that's okay" then this becomes a real issue
   for providers with potentially 'evil' customers

best,
Herbert

> >> >> It is what makes virtualization solution usable (w/o apps
> >> >> modifications), provides all the features and doesn't require much
> >> >> efforts from people to be used.
> >> >
> >> > and what if they want to use virtualization inside
> >> > their guests? where do you draw the line?
> >> 
> >> The implementation doesn't have any problems with guests inside
> >> of guests.
> >> 
> >> The only reason to restrict guests inside of guests is because
> >> the we aren't certain which permissions make sense.
> >
> > well, we have not even touched the permission issues yet
> 
> Agreed, permissions have not discussed but the point is that is the only
> reason to keep from nesting the networking stack the way I have described
> it.
> 
> Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28  6:31                         ` Sam Vilain
@ 2006-06-28 14:15                           ` Herbert Poetzl
  2006-06-28 15:36                             ` Eric W. Biederman
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-28 14:15 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Eric W. Biederman, Kirill Korotaev, Daniel Lezcano,
	Andrey Savochkin, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, devel, viro, Alexey Kuznetsov

On Wed, Jun 28, 2006 at 06:31:05PM +1200, Sam Vilain wrote:
> Eric W. Biederman wrote:
> > Have a few more network interfaces for a layer 2 solution
> > is fundamental.  Believing without proof and after arguments
> > to the contrary that you have not contradicted that a layer 2
> > solution is inherently slower is non-productive.  Arguing
> > that a layer 2 only solution most prove itself on guest to guest
> > communication is also non-productive.
> >   
> 
> Yes, it does break what some people consider to be a sanity condition
> when you don't have loopback anymore within a guest. I once experimented
> with using 127.* addresses for per-guest loopback devices with vserver
> to fix this, but that couldn't work without fixing glibc to not make
> assumptions deep in the bowels of the resolver. I logged a fault with
> gnu.org and you can guess where it went :-).

this is what the lo* patches address, by providing
the required loopback isolation and providing lo
inside a guest (i.e. it looks and feels like a
normal system, except that you cannot modify the
interfaces from inside)

> I don't think it's just the performance issue, though. Consider also
> that if you only have one set of interfaces to manage, the overall
> configuration of the network stack is simpler. `ip addr list' on the
> host shows all the addresses on the system, you only have one routing
> table to manage, one set of iptables, etc.
> 
> That being said, perhaps if each guest got its own interface, and from
> some suitably privileged context you could see them all, perhaps it
> would be nicer and maybe just as fast. Perhaps then *devices* could get
> their own routing namespaces, and routing namespaces could get iptables
> namespaces, or something like that, to give the most options.
> 
> > With a guest with 4 IPs 
> > 10.0.0.1 192.168.0.1 172.16.0.1 127.0.0.1
> > How do you make INADDR_ANY work with just filtering at bind time?
> >   
> 
> It used to just bind to the first one. Don't know if it still does.

no, it _alway_ binds to INADDR_ANY and checks
against other sockets (in the same context)
comparing the lists of assigned IPs (the subset)

so all checks happen at bind/connect time and
always against the set of IPs, only exception is
a performance optimization we do for single IP
guests (where INADDR_ANY gets rewritten to the
single IP)

best,
Herbert

> Sam.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 13:53                                     ` jamal
@ 2006-06-28 14:19                                       ` Andrey Savochkin
  2006-06-28 16:17                                         ` jamal
  2006-06-28 17:04                                         ` Herbert Poetzl
  2006-06-28 14:39                                       ` Eric W. Biederman
  2006-06-29 21:07                                       ` Sam Vilain
  2 siblings, 2 replies; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-28 14:19 UTC (permalink / raw)
  To: hadi
  Cc: Herbert Poetzl, Alexey Kuznetsov, viro, sam, devel, dev,
	Andrew Morton, clg, serue, netdev, linux-kernel, Daniel Lezcano,
	Ben Greear, Dave Hansen, Eric W. Biederman

Hi Jamal,

On Wed, Jun 28, 2006 at 09:53:23AM -0400, jamal wrote:
> 
> On Wed, 2006-28-06 at 15:36 +0200, Herbert Poetzl wrote:
> 
> > note: personally I'm absolutely not against virtualizing
> > the device names so that each guest can have a separate
> > name space for devices, but there should be a way to
> > 'see' _and_ 'identify' the interfaces from outside
> > (i.e. host or spectator context)
> > 
> 
> Makes sense for the host side to have naming convention tied
> to the guest. Example as a prefix: guest0-eth0. Would it not
> be interesting to have the host also manage these interfaces
> via standard tools like ip or ifconfig etc? i.e if i admin up
> guest0-eth0, then the user in guest0 will see its eth0 going
> up.

Seeing guestXX-eth0 interfaces by standard tools has certain attractive
sides.  But it creates a lot of undesired side effects.

For example, ntpd queries all network devices by the same ioctls as ifconfig,
and creates separate sockets bound to IP addresses of each device, which is
certainly not desired with namespaces.

Or more subtle question: do you want hotplug events to be generated when
guest0-eth0 interface comes up in the root namespace, and standard scripts
to try to set some IP address on this interface?..

In my opinion, the downside of this scheme overweights possible advantages,
and I'm personally quite happy with running commands with switched namespace,
like
vzctl exec guest0 ip addr list
vzctl exec guest0 ip link set eth0 up
and so on.

Best regards

Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 13:36                                   ` Herbert Poetzl
  2006-06-28 13:53                                     ` jamal
@ 2006-06-28 14:21                                     ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-28 14:21 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Alexey Kuznetsov, Dave Hansen, Ben Greear, Daniel Lezcano,
	Andrey Savochkin, linux-kernel, netdev, serue, clg,
	Andrew Morton, dev, devel, sam, viro, Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

> last time I pointed to such 'misguided' apps which 
> made assumptions that are not necessarily true
> inside a virtual environment (e.g. pstree, initpid)
> the general? position was that those apps should
> be fixed instead adding a 'workaround'

I agree that if it was solely misguided apps.  There would be
no justification.

One of the standard applications interfaces we support is renaming
a network interface.  So supporting those misguided apps is a actually
a side effect of supporting one the standard operations on a network interface.

Another way of looking at it is that the names of networking devices like the
names of devices in /dev are a user space policy (today).  In the
configuration of the networking stack historically we had those identifiers
hard coded.  It isn't until just recently that user space has been able to
cope with dynamically added/removed network devices.

As for initpid and friends.  In the context where you are simply isolating
pids and not doing a full pid namespaces it was felt that changing the few
user space applications that care was easier and probably worth doing anyway.

> note: personally I'm absolutely not against virtualizing
> the device names so that each guest can have a separate
> name space for devices, but there should be a way to
> 'see' _and_ 'identify' the interfaces from outside
> (i.e. host or spectator context)

Yep.  Basically that interface comes when we fix the sysfs support,
to support per namespace reporting.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 13:53                                     ` jamal
  2006-06-28 14:19                                       ` Andrey Savochkin
@ 2006-06-28 14:39                                       ` Eric W. Biederman
  2006-06-30  1:41                                         ` Sam Vilain
  2006-06-29 21:07                                       ` Sam Vilain
  2 siblings, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-28 14:39 UTC (permalink / raw)
  To: hadi
  Cc: Herbert Poetzl, Alexey Kuznetsov, viro, sam, devel, dev,
	Andrew Morton, clg, serue, netdev, linux-kernel,
	Andrey Savochkin, Daniel Lezcano, Ben Greear, Dave Hansen,
	Alexey Kuznetsov

jamal <hadi@cyberus.ca> writes:

> On Wed, 2006-28-06 at 15:36 +0200, Herbert Poetzl wrote:
>
>> note: personally I'm absolutely not against virtualizing
>> the device names so that each guest can have a separate
>> name space for devices, but there should be a way to
>> 'see' _and_ 'identify' the interfaces from outside
>> (i.e. host or spectator context)
>> 
>
> Makes sense for the host side to have naming convention tied
> to the guest. Example as a prefix: guest0-eth0. Would it not
> be interesting to have the host also manage these interfaces
> via standard tools like ip or ifconfig etc? i.e if i admin up
> guest0-eth0, then the user in guest0 will see its eth0 going
> up.
>
> Anyways, interesting discussion.

Please no.

We really want the fundamental rule that a network device
is tied to a single namespace, and that a socket is tied
to a single namespace.  If those two conditions are met
we don't have to tag packets with a namespace identifier.

We only have to modify hash table lookups in the networking
code to look at a namespace tag in addition to the rest because
that is less expensive than allocating new hash tables.

Currently with a network device only being usable in one
network namespace we have the situation where we can
fairly safely give a guest CAP_NET_ADMIN without problems.

In addition currently nothing in the implementation knows about
the hierarchical structure of how the network namespace will be
used.  To allow ifconfig guest0-eth0 to work would require
understanding the hierarchical structure and places serious questions
on how safe we can make CAP_NET_ADMIN.

Now I am open to radically different designs if they allow the
implementation cost to be lower and they have clean semantics,
and don't wind up being an ugly unmaintainable wart on the linux
networking stack.  The only route I could imagine such a thing coming
from is something like tagging flows, in some netfiler like way.
Which might allow ifconfig guest-eth0 from the host without problems.
But I have not seen such a design.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-27 23:12                             ` Dave Hansen
  2006-06-27 23:42                               ` Alexey Kuznetsov
@ 2006-06-28 14:51                               ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-28 14:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Herbert Poetzl, Ben Greear, Daniel Lezcano, Andrey Savochkin,
	linux-kernel, netdev, serue, clg, Andrew Morton, dev, devel, sam,
	viro, Alexey Kuznetsov

Dave Hansen <haveblue@us.ibm.com> writes:

> On Wed, 2006-06-28 at 00:52 +0200, Herbert Poetzl wrote:
>> seriously, what I think Eric meant was that it
>> might be nice (especially for migration purposes)
>> to keep the device namespace completely virtualized
>> and not just isolated ...
>
> It might be nice, but it is probably unneeded for an initial
> implementation.  In practice, a cluster doing
> checkpoint/restart/migration will already have a system in place for
> assigning unique IPs or other identifiers to each container.  It could
> just as easily make sure to assign unique network device names to
> containers.
>
> The issues really only come into play when you have an unstructured set
> of machines and you want to migrate between them without having prepared
> them with any kind of unique net device names beforehand.
>
> It may look weird, but do application really *need* to see eth0 rather
> than eth858354?

Actually there is a very practical reason we don't need to preserve device
names across a migration event between machines, is the only sane thing
to do is to generate a hotplug event that says you have removed the old
interface and added a new interface.

My expectation is that during migration you will wind up with add and
remove events for all of your hardware devices.  But most applications
because they do not access hardware devices directly will not care.

I haven't looked closely but I suspect this is one area where a container
style approach will be noticeably different from a Xen or Vmware style
approach.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 14:15                           ` Herbert Poetzl
@ 2006-06-28 15:36                             ` Eric W. Biederman
  2006-06-28 17:18                               ` Herbert Poetzl
  0 siblings, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-28 15:36 UTC (permalink / raw)
  To: Sam Vilain, Herbert Poetzl
  Cc: Kirill Korotaev, Daniel Lezcano, Andrey Savochkin, linux-kernel,
	netdev, serue, haveblue, clg, Andrew Morton, devel, viro,
	Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Wed, Jun 28, 2006 at 06:31:05PM +1200, Sam Vilain wrote:
>> Eric W. Biederman wrote:
>> > Have a few more network interfaces for a layer 2 solution
>> > is fundamental.  Believing without proof and after arguments
>> > to the contrary that you have not contradicted that a layer 2
>> > solution is inherently slower is non-productive.  Arguing
>> > that a layer 2 only solution most prove itself on guest to guest
>> > communication is also non-productive.
>> >   
>> 
>> Yes, it does break what some people consider to be a sanity condition
>> when you don't have loopback anymore within a guest. I once experimented
>> with using 127.* addresses for per-guest loopback devices with vserver
>> to fix this, but that couldn't work without fixing glibc to not make
>> assumptions deep in the bowels of the resolver. I logged a fault with
>> gnu.org and you can guess where it went :-).
>
> this is what the lo* patches address, by providing
> the required loopback isolation and providing lo
> inside a guest (i.e. it looks and feels like a
> normal system, except that you cannot modify the
> interfaces from inside)

Ok.  This is new.  How do you talk between guests now?
Before those patches it was through IP addresses on the loopback interface
as I recall.

>> > With a guest with 4 IPs 
>> > 10.0.0.1 192.168.0.1 172.16.0.1 127.0.0.1
>> > How do you make INADDR_ANY work with just filtering at bind time?
>> >   
>> 
>> It used to just bind to the first one. Don't know if it still does.
>
> no, it _alway_ binds to INADDR_ANY and checks
> against other sockets (in the same context)
> comparing the lists of assigned IPs (the subset)
>
> so all checks happen at bind/connect time and
> always against the set of IPs, only exception is
> a performance optimization we do for single IP
> guests (where INADDR_ANY gets rewritten to the
> single IP)

What is the mechanism there?

My rough extrapolation says this mechanism causes problems when
migrating between machines.  In particular it sounds like
only one process can bind to *:80, even if it is only allowed
to accept connections from a subset of those IPs.

So if on another machine I bound something to *:80 and only allowed to
use a different set of IPs and then attempted to migrate it, the
migration would fail because I could not restart the application,
with all of it's layer 3 resources.

To be clear I assume when I migrate I always take my IP address or
addresses with me.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 14:11                         ` Herbert Poetzl
@ 2006-06-28 16:10                           ` Eric W. Biederman
  0 siblings, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-28 16:10 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Kirill Korotaev, Daniel Lezcano, Andrey Savochkin, linux-kernel,
	netdev, serue, haveblue, clg, Andrew Morton, devel, sam, viro,
	Alexey Kuznetsov

Herbert Poetzl <herbert@13thfloor.at> writes:

>> Have a few more network interfaces for a layer 2 solution
>> is fundamental.  Believing without proof and after arguments
>> to the contrary that you have not contradicted that a layer 2
>> solution is inherently slower is non-productive.  
>
> assuming that it will not be slower, although it
> will now pass two network stacks and the bridging
> code is non-productive too, let's see how it goes
> but do not ignore the overhead just because it
> might simplify the implementation ...

Sure.  Mostly I have set it aside because the overhead
is not horrible and it is a very specific case that
can be heavily optimized if the core infrastructure is
solid.

>> Arguing that a layer 2 only solution most prove itself on 
>> guest to guest communication is also non-productive.
>> 
>> So just to sink one additional nail in the coffin of the silly
>> guest to guest communication issue.  For any two guests where
>> fast communication between them is really important I can run
>> an additional interface pair that requires no routing or bridging.
>> Given that the implementation of the tunnel device is essentially
>> the same as the loopback interface and that I make only one
>> trip through the network stack there will be no performance overhead.
>
> that is a good argument and I think I'm perfectly
> fine with this, given that the implementation 
> allows that (i.e. the network stack can handle
> two interfaces with the same IP assigned and will
> choose the local interface over the remote one
> when the traffic will be between guests)

Yep.  That exists today.  The network stack prefers routes
as specific as possible.

>> Similarly for any critical guest communication to the outside world
>> I can give the guest a real network adapter.
>
> with a single MAC assigned, that is, I presume?

Yes.

>
> guess that's what this discussion is about,
> finding out the various aspects how isolation
> and/or vitrtualization can be accomplished and
> what features we consider common/useful enough
> for mainline ... for me that is still in the
> brainstorming phase, although several 'working
> prototypes' already exist. IMHO the next step
> is to collect a set of representative use cases
> and test them with each implementation, regarding
> performance, usability and practicability

I am fairly strongly convinced a layer 2 solution will
do fine.  So for me it is a matter of proving that
and ensuring a good implementation.

> not necessarily, but I _know_ that the overhead
> added at layer 3 is unmeasureable, and it still
> needs to be proven that this is true for a layer
> 2 solution (which I'd actually prefer, because
> it solves the protocol _and_ setup issues)

That is a good perspective.  Layer 3 is free, is layer 2 also free?
Unless the cache miss penalty is a killer layer 2 should come very
close.  Of course VJ recently gave some evidence that packet processing
is dominated by cache misses.

>> >From what I have seen of layer 3 solutions it is a 
>> bloody maintenance nightmare, and an inflexible mess.
>
> that is your opinion, I really doubt that you
> will have less maintenance when you apply policy
> to the guests ...

Yes and mostly of the layer 3 things that I implemented.
At a moderately fundamental level I see layer 3 implementations
being a special case that is a tangent from the rest of the
networking code.  So I don't see a real synthesis with what
the rest of the networking stack is doing.  Plus all of the
limitations that come with a layer 3 implementation.

> example here (just to clarify):
>
>  - let's assume we have eth0 on the host and in
>    guest A and B, with the following setup:
>
>    eth0(H) 192.168.0.1/24
>    eth0(A) 10.0.1.1/16 10.0.1.2/16
>    eth0(B) 10.0.2.1/16
>
>  - now what keeps guest B from jsut assigning
>    10.0.2.2/16 to eth0? you need some kind of
>    mechanism to prevent that, and/or to block
>    the packets using inappropriate IPs
>
>  * in the first case, i.e. you prevent assigning
>    certain IPs inside a guest, you get a semantic
>    change in the behaviour compared to a normal
>    system, but there is no additional overhead
>    on the communication
>
>  * in the second case, you have to maintain the
>    policy mechanism and keep it in sync with the
>    guest configuration (somehow), and of course
>    you have to verify every communication
>
>  - OTOH, if you do not care about collisions
>    basically assuming the point "that's like
>    a hub on a network, if there are two guests
>    with the same ip, it will be trouble, but
>    that's okay" then this becomes a real issue
>    for providers with potentially 'evil' customers

So linux when serving as a router has strong filter capabilities.

So we can either use the strong network filtering linux already has
making work for the host administrator who has poorly behaved customers.
Or we can simply not give those poorly behaved guests CAP_NET_ADMIN,
and assign the IP address at guest startup before dropping the
capability.  At which point the guest cannot misbehave.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 14:19                                       ` Andrey Savochkin
@ 2006-06-28 16:17                                         ` jamal
  2006-06-28 16:58                                           ` Andrey Savochkin
  2006-06-28 17:17                                           ` Eric W. Biederman
  2006-06-28 17:04                                         ` Herbert Poetzl
  1 sibling, 2 replies; 113+ messages in thread
From: jamal @ 2006-06-28 16:17 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: Eric W. Biederman, Dave Hansen, Ben Greear, Daniel Lezcano,
	linux-kernel, netdev, serue, clg, Andrew Morton, dev, devel, sam,
	viro, Alexey Kuznetsov, Herbert Poetzl

Andrey,

On Wed, 2006-28-06 at 18:19 +0400, Andrey Savochkin wrote:
> Hi Jamal,
> 
> On Wed, Jun 28, 2006 at 09:53:23AM -0400, jamal wrote:
> > 

> 
> Seeing guestXX-eth0 interfaces by standard tools has certain attractive
> sides.  But it creates a lot of undesired side effects.
> 

I apologize because i butted into the discussion without perhaps reading
the full thread. 

> For example, ntpd queries all network devices by the same ioctls as ifconfig,
> and creates separate sockets bound to IP addresses of each device, which is
> certainly not desired with namespaces.
> 

Ok, so the problem is that ntp in this case runs on the host side as
opposed to the guest? This would explain why Eric is reacting vehemently
to the suggestion.

> Or more subtle question: do you want hotplug events to be generated when
> guest0-eth0 interface comes up in the root namespace, and standard scripts
> to try to set some IP address on this interface?..
> 

yes, thats what i was thinking. Even go further and actually create
guestxx-eth0 on the host (which results in creating eth0 on the guest)
and other things.

> In my opinion, the downside of this scheme overweights possible advantages,
> and I'm personally quite happy with running commands with switched namespace,
> like
> vzctl exec guest0 ip addr list
> vzctl exec guest0 ip link set eth0 up
> and so on.

Ok, above may be good enough and doesnt require any state it seems on
the host side. 
I got motivated when the word "migration" was mentioned. I understood it
to be meaning that a guest may become inoperative for some reason and
that its info will be transfered to another guest which may be local or
even remote. In such a case, clearly one would need a protocol and the
state of all guests sitting at the host. Maybe i am over-reaching. 

cheers,
jamal


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 16:17                                         ` jamal
@ 2006-06-28 16:58                                           ` Andrey Savochkin
  2006-06-28 17:17                                           ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-28 16:58 UTC (permalink / raw)
  To: hadi
  Cc: Eric W. Biederman, Dave Hansen, Ben Greear, Daniel Lezcano,
	linux-kernel, netdev, serue, clg, Andrew Morton, dev, devel, sam,
	viro, Alexey Kuznetsov, Herbert Poetzl

On Wed, Jun 28, 2006 at 12:17:35PM -0400, jamal wrote:
> 
> On Wed, 2006-28-06 at 18:19 +0400, Andrey Savochkin wrote:
> > 
> > Seeing guestXX-eth0 interfaces by standard tools has certain attractive
> > sides.  But it creates a lot of undesired side effects.
> > 
> 
> I apologize because i butted into the discussion without perhaps reading
> the full thread. 

Your comments are quite welcome

> 
> > For example, ntpd queries all network devices by the same ioctls as ifconfig,
> > and creates separate sockets bound to IP addresses of each device, which is
> > certainly not desired with namespaces.
> > 
> 
> Ok, so the problem is that ntp in this case runs on the host side as

yes

> opposed to the guest? This would explain why Eric is reacting vehemently
> to the suggestion.

:)

And I actually do not want to distinguish host and guest sides much.
They are namespaces in the first place.
Parent namespace may have some capabilities to manipulate its child
namespaces, like donate its own device to one of its children.

But it comes secondary to having namespace isolation borders.
In particular, because most cases of cross-namespace interaction lead to
failures of formal security models and inability to migrate
namespaces between computers.

> 
> > Or more subtle question: do you want hotplug events to be generated when
> > guest0-eth0 interface comes up in the root namespace, and standard scripts
> > to try to set some IP address on this interface?..
> > 
> 
> yes, thats what i was thinking. Even go further and actually create
> guestxx-eth0 on the host (which results in creating eth0 on the guest)
> and other things.

This actually goes in the opposite direction to what I keep in mind.
I want to offload as much as possible of network administration work to
guests.  Delegation of management is one of the motivating factors
behind covering not only sockets but devices, routes, and so on by the
namespace patches.

> 
> > In my opinion, the downside of this scheme overweights possible advantages,
> > and I'm personally quite happy with running commands with switched namespace,
> > like
> > vzctl exec guest0 ip addr list
> > vzctl exec guest0 ip link set eth0 up
> > and so on.
> 
> Ok, above may be good enough and doesnt require any state it seems on
> the host side. 
> I got motivated when the word "migration" was mentioned. I understood it
> to be meaning that a guest may become inoperative for some reason and
> that its info will be transfered to another guest which may be local or
> even remote. In such a case, clearly one would need a protocol and the
> state of all guests sitting at the host. Maybe i am over-reaching. 

Migration will work inside the kernel, so it has full access
to whatever state information it needs.

Best regards

Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 14:19                                       ` Andrey Savochkin
  2006-06-28 16:17                                         ` jamal
@ 2006-06-28 17:04                                         ` Herbert Poetzl
  1 sibling, 0 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-28 17:04 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: hadi, Alexey Kuznetsov, viro, sam, devel, dev, Andrew Morton,
	clg, serue, netdev, linux-kernel, Daniel Lezcano, Ben Greear,
	Dave Hansen, Eric W. Biederman

On Wed, Jun 28, 2006 at 06:19:00PM +0400, Andrey Savochkin wrote:
> Hi Jamal,
> 
> On Wed, Jun 28, 2006 at 09:53:23AM -0400, jamal wrote:
> > 
> > On Wed, 2006-28-06 at 15:36 +0200, Herbert Poetzl wrote:
> > 
> > > note: personally I'm absolutely not against virtualizing
> > > the device names so that each guest can have a separate
> > > name space for devices, but there should be a way to
> > > 'see' _and_ 'identify' the interfaces from outside
> > > (i.e. host or spectator context)
> > 
> > Makes sense for the host side to have naming convention tied
> > to the guest. Example as a prefix: guest0-eth0. Would it not
> > be interesting to have the host also manage these interfaces
> > via standard tools like ip or ifconfig etc? i.e if i admin up
> > guest0-eth0, then the user in guest0 will see its eth0 going
> > up.
> 
> Seeing guestXX-eth0 interfaces by standard tools has certain 
> attractive sides.  But it creates a lot of undesired side effects.

which all can be avoided by not using the host
context for that, but a special 'all seeing' 
context (as we have in Linux-VServer) which
can see (and probably manipulate) those interfaces
from the 'admin' PoV without entering the guest
context

> For example, ntpd queries all network devices by the same ioctls as
> ifconfig, and creates separate sockets bound to IP addresses of each
> device, which is certainly not desired with namespaces.

applications scanning the interfaces at startup
are broken by design and should probably be
fixed instead of worked around ...

> Or more subtle question: do you want hotplug events to be generated
> when guest0-eth0 interface comes up in the root namespace, and
> standard scripts to try to set some IP address on this interface?..

why not, would it do any harm when the hotplug
scripts on the host would take the appropriate
actions (i.e. do the required config for the guest)
for special guest interfaces?

but now that you mention it, what about hotplug
events inside the guest?

> In my opinion, the downside of this scheme overweights possible
> advantages, and I'm personally quite happy with running commands with
> switched namespace, like
> vzctl exec guest0 ip addr list
> vzctl exec guest0 ip link set eth0 up

I do not consider this the best solution, especially
from the security PoV. don't forget you basically
enter the guest and execute arbitrary programs
(which might have been compromised) to do a setup
task you actually want to happen on the host

best,
Herbert

> and so on.
> 
> Best regards
> 
> Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 16:17                                         ` jamal
  2006-06-28 16:58                                           ` Andrey Savochkin
@ 2006-06-28 17:17                                           ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-28 17:17 UTC (permalink / raw)
  To: hadi
  Cc: Andrey Savochkin, Dave Hansen, Ben Greear, Daniel Lezcano,
	linux-kernel, netdev, serue, clg, Andrew Morton, dev, devel, sam,
	viro, Alexey Kuznetsov, Herbert Poetzl

jamal <hadi@cyberus.ca> writes:

> Andrey,
>
> On Wed, 2006-28-06 at 18:19 +0400, Andrey Savochkin wrote:
>> Hi Jamal,
>> 
>> On Wed, Jun 28, 2006 at 09:53:23AM -0400, jamal wrote:
>> > 
>
>> 
>> Seeing guestXX-eth0 interfaces by standard tools has certain attractive
>> sides.  But it creates a lot of undesired side effects.
>> 
>
> I apologize because i butted into the discussion without perhaps reading
> the full thread. 

This thread is serving as an educational vehicle, and the more people
from outside of our little biased group that begin to understand what
we are about the better.

>> For example, ntpd queries all network devices by the same ioctls as ifconfig,
>> and creates separate sockets bound to IP addresses of each device, which is
>> certainly not desired with namespaces.
>> 
>
> Ok, so the problem is that ntp in this case runs on the host side as
> opposed to the guest? This would explain why Eric is reacting vehemently
> to the suggestion.

Yes, that would be one problem case.

>> Or more subtle question: do you want hotplug events to be generated when
>> guest0-eth0 interface comes up in the root namespace, and standard scripts
>> to try to set some IP address on this interface?..
>> 
>
> yes, thats what i was thinking. Even go further and actually create
> guestxx-eth0 on the host (which results in creating eth0 on the guest)
> and other things.
>
>> In my opinion, the downside of this scheme overweights possible advantages,
>> and I'm personally quite happy with running commands with switched namespace,
>> like
>> vzctl exec guest0 ip addr list
>> vzctl exec guest0 ip link set eth0 up
>> and so on.
>
> Ok, above may be good enough and doesnt require any state it seems on
> the host side. 
> I got motivated when the word "migration" was mentioned. I understood it
> to be meaning that a guest may become inoperative for some reason and
> that its info will be transfered to another guest which may be local or
> even remote. In such a case, clearly one would need a protocol and the
> state of all guests sitting at the host. Maybe i am over-reaching. 

Not really.  Network namespaces while useful in their own right lay
the foundation for some more interesting applications.  Application
migration between machines in particular.

The biggest fundamental problem in migration is after checkpointing your
application you can not acquire the resources you need on the new machine 
because of name conflicts.

So for those of us concerned with migration a question we ask is can
we successfully import resource names that another machine has assigned without
consulting us.

The context for all of this goes to other discussion that we have been having
since January.  Breaking all of this into small pieces that can be merged and
tested a little at a time is a challenge.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 15:36                             ` Eric W. Biederman
@ 2006-06-28 17:18                               ` Herbert Poetzl
  0 siblings, 0 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-28 17:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sam Vilain, Kirill Korotaev, Daniel Lezcano, Andrey Savochkin,
	linux-kernel, netdev, serue, haveblue, clg, Andrew Morton, devel,
	viro, Alexey Kuznetsov

On Wed, Jun 28, 2006 at 09:36:40AM -0600, Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
> 
> > On Wed, Jun 28, 2006 at 06:31:05PM +1200, Sam Vilain wrote:
> >> Eric W. Biederman wrote:
> >> > Have a few more network interfaces for a layer 2 solution
> >> > is fundamental.  Believing without proof and after arguments
> >> > to the contrary that you have not contradicted that a layer 2
> >> > solution is inherently slower is non-productive.  Arguing
> >> > that a layer 2 only solution most prove itself on guest to guest
> >> > communication is also non-productive.
> >> >   
> >> 
> >> Yes, it does break what some people consider to be a sanity
> >> condition when you don't have loopback anymore within a guest. I
> >> once experimented with using 127.* addresses for per-guest loopback
> >> devices with vserver to fix this, but that couldn't work without
> >> fixing glibc to not make assumptions deep in the bowels of the
> >> resolver. I logged a fault with gnu.org and you can guess where it
> >> went :-).
> >
> > this is what the lo* patches address, by providing
> > the required loopback isolation and providing lo
> > inside a guest (i.e. it looks and feels like a
> > normal system, except that you cannot modify the
> > interfaces from inside)
> 
> Ok.  This is new.  How do you talk between guests now?

> Before those patches it was through IP addresses on the loopback
> interface as I recall.

no, that was probably your assumption, the IPs are
assigned (in a perfectly normal way) to the interfaces
(e.g. eth0 carries some IPs for guest A and B, eth1
carries others for guest C). the way the linux network
stack works, local addresses (i.e. those of A,B and C)
will automatically communicate via loopback (as they
are local) while outbound traffic will use the proper
interface (nothing is changed here)

the difference in the lo patches is, that we allow to
use the 'localhost' ip range (127.x.x.x) by isolating
traffic (in this range) on the loopback interface
(which typically allows to have 127.0.0.1 and lo
visible inside a guest)

> >> > With a guest with 4 IPs 
> >> > 10.0.0.1 192.168.0.1 172.16.0.1 127.0.0.1
> >> > How do you make INADDR_ANY work with just filtering at bind time?
> >> >   
> >> 
> >> It used to just bind to the first one. Don't know if it still does.
> >
> > no, it _alway_ binds to INADDR_ANY and checks
> > against other sockets (in the same context)
> > comparing the lists of assigned IPs (the subset)
> >
> > so all checks happen at bind/connect time and
> > always against the set of IPs, only exception is
> > a performance optimization we do for single IP
> > guests (where INADDR_ANY gets rewritten to the
> > single IP)
> 
> What is the mechanism there?
> 
> My rough extrapolation says this mechanism causes problems when
> migrating between machines. 

that might be, as we do not consider migration such
important as other folks do :)

> In particular it sounds like only one process can bind to *:80, even
> if it is only allowed to accept connections from a subset of those
> IPs.

no, guest A,B and C can all bind to *:80 and coexist
quite fine, given that they do not have any IP in
the intersection of their subsets (which is checked
at bind time)

> So if on another machine I bound something to *:80 and only allowed to
> use a different set of IPs and then attempted to migrate it, the
> migration would fail because I could not restart the application,
> with all of it's layer 3 resources.

actually I do not see why, unless the destination
has a conflict on the ip subset, in which case you
would end up with a migrated, but not working guest :)

> To be clear I assume when I migrate I always take my IP address or
> addresses with me.

that's fine, the only requirement would be that the
host has a superset of the IP addresses used by the
guests ...

HTC,
Herbert

> Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 13:53                                     ` jamal
  2006-06-28 14:19                                       ` Andrey Savochkin
  2006-06-28 14:39                                       ` Eric W. Biederman
@ 2006-06-29 21:07                                       ` Sam Vilain
  2006-06-29 22:14                                         ` strict isolation of net interfaces Cedric Le Goater
  2006-06-30  0:15                                         ` [patch 2/6] [Network namespace] Network device sharing by view jamal
  2 siblings, 2 replies; 113+ messages in thread
From: Sam Vilain @ 2006-06-29 21:07 UTC (permalink / raw)
  To: hadi
  Cc: Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, clg, serue, netdev, linux-kernel,
	Andrey Savochkin, Daniel Lezcano, Ben Greear, Dave Hansen,
	Alexey Kuznetsov, Eric W. Biederman

jamal wrote:
>> note: personally I'm absolutely not against virtualizing
>> the device names so that each guest can have a separate
>> name space for devices, but there should be a way to
>> 'see' _and_ 'identify' the interfaces from outside
>> (i.e. host or spectator context)
>>
>>     
>
> Makes sense for the host side to have naming convention tied
> to the guest. Example as a prefix: guest0-eth0. Would it not
> be interesting to have the host also manage these interfaces
> via standard tools like ip or ifconfig etc? i.e if i admin up
> guest0-eth0, then the user in guest0 will see its eth0 going
> up.

That particular convention only works if you have network namespaces and
UTS namespaces tightly bound.  We plan to have them separate - so for
that to work, each network namespace could have an arbitrary "prefix"
that determines what the interface name will look like from the outside
when combined.  We'd have to be careful about length limits.

And guest0-eth0 doesn't necessarily make sense; it's not really an
ethernet interface, more like a tun or something.

So, an equally good convention might be to use sequential prefixes on
the host, like "tun", "dummy", or a new prefix - then a property of that
is what the name of the interface is perceived to be to those who are in
the corresponding network namespace.

Then the pragmatic question becomes how to correlate what you see from
`ip addr list' to guests.

Sam.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* strict isolation of net interfaces
  2006-06-29 21:07                                       ` Sam Vilain
@ 2006-06-29 22:14                                         ` Cedric Le Goater
  2006-06-30  2:39                                           ` Serge E. Hallyn
  2006-06-30  0:15                                         ` [patch 2/6] [Network namespace] Network device sharing by view jamal
  1 sibling, 1 reply; 113+ messages in thread
From: Cedric Le Goater @ 2006-06-29 22:14 UTC (permalink / raw)
  To: Sam Vilain
  Cc: hadi, Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, serue, netdev, linux-kernel, Andrey Savochkin,
	Daniel Lezcano, Ben Greear, Dave Hansen, Alexey Kuznetsov,
	Eric W. Biederman

Sam Vilain wrote:
> jamal wrote:
>>> note: personally I'm absolutely not against virtualizing
>>> the device names so that each guest can have a separate
>>> name space for devices, but there should be a way to
>>> 'see' _and_ 'identify' the interfaces from outside
>>> (i.e. host or spectator context)
>>>
>>>     
>> Makes sense for the host side to have naming convention tied
>> to the guest. Example as a prefix: guest0-eth0. Would it not
>> be interesting to have the host also manage these interfaces
>> via standard tools like ip or ifconfig etc? i.e if i admin up
>> guest0-eth0, then the user in guest0 will see its eth0 going
>> up.
> 
> That particular convention only works if you have network namespaces and
> UTS namespaces tightly bound.  We plan to have them separate - so for
> that to work, each network namespace could have an arbitrary "prefix"
> that determines what the interface name will look like from the outside
> when combined.  We'd have to be careful about length limits.
> 
> And guest0-eth0 doesn't necessarily make sense; it's not really an
> ethernet interface, more like a tun or something.
> 
> So, an equally good convention might be to use sequential prefixes on
> the host, like "tun", "dummy", or a new prefix - then a property of that
> is what the name of the interface is perceived to be to those who are in
> the corresponding network namespace.
> 
> Then the pragmatic question becomes how to correlate what you see from
> `ip addr list' to guests.


we could work on virtualizing the net interfaces in the host, map them to
eth0 or something in the guest and let the guest handle upper network layers ?

lo0 would just be exposed relying on skbuff tagging to discriminate traffic
between guests.



host                  |  guest 0  |  guest 1  |  guest2
----------------------+-----------+-----------+--------------
  |                   |           |           |
  |-> l0      <-------+-> lo0 ... | lo0       | lo0
  |                   |           |           |
  |-> bar0   <--------+-> eth0    |           |
  |                   |           |           |
  |-> foo0   <--------+-----------+-----------+-> eth0
  |                   |           |           |
  `-> foo0:1  <-------+-----------+-> eth0    |
                      |           |           |


is that clear ? stupid ? reinventing the wheel ?

thanks,

C.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-29 21:07                                       ` Sam Vilain
  2006-06-29 22:14                                         ` strict isolation of net interfaces Cedric Le Goater
@ 2006-06-30  0:15                                         ` jamal
  2006-06-30  3:35                                           ` Herbert Poetzl
  2006-06-30  7:45                                           ` Andrey Savochkin
  1 sibling, 2 replies; 113+ messages in thread
From: jamal @ 2006-06-30  0:15 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, clg, serue, netdev, linux-kernel,
	Andrey Savochkin, Daniel Lezcano, Ben Greear, Dave Hansen,
	Alexey Kuznetsov, Eric W. Biederman

On Fri, 2006-30-06 at 09:07 +1200, Sam Vilain wrote:
> jamal wrote:

> > Makes sense for the host side to have naming convention tied
> > to the guest. Example as a prefix: guest0-eth0. Would it not
> > be interesting to have the host also manage these interfaces
> > via standard tools like ip or ifconfig etc? i.e if i admin up
> > guest0-eth0, then the user in guest0 will see its eth0 going
> > up.
> 
> That particular convention only works if you have network namespaces and
> UTS namespaces tightly bound. 

that would be one approach. Another less sophisticated approach is to
have no binding whatsoever, rather some translation table to map two
unrelated devices. 

>  We plan to have them separate - so for
> that to work, each network namespace could have an arbitrary "prefix"
> that determines what the interface name will look like from the outside
> when combined.  We'd have to be careful about length limits.
> 
> And guest0-eth0 doesn't necessarily make sense; it's not really an
> ethernet interface, more like a tun or something.
> 

it wouldnt quiet fit as a tun device. More like a mirror side of the 
guest eth0 created on the host side 
i.e a sort of passthrough device with one side visible on the host (send
from guest0-eth0 is received on eth0 in the guest and vice-versa).

Note this is radically different from what i have heard Andrey and co
talk about and i dont wanna disturb any shit because there seems to be
some agreement. But if you address me i respond because it is very
interesting a topic;->

> So, an equally good convention might be to use sequential prefixes on
> the host, like "tun", "dummy", or a new prefix - then a property of that
> is what the name of the interface is perceived to be to those who are in
> the corresponding network namespace.
>
> Then the pragmatic question becomes how to correlate what you see from
> `ip addr list' to guests.

on the host ip addr and the one seen on the guest side are the same.
Except one is seen (on the host) on guest0-eth0 and another is seen 
on eth0 (on guest).
Anyways, ignore what i am saying if it is disrupting the discussion.

cheers,
jamal 





^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-28 14:39                                       ` Eric W. Biederman
@ 2006-06-30  1:41                                         ` Sam Vilain
  0 siblings, 0 replies; 113+ messages in thread
From: Sam Vilain @ 2006-06-30  1:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, clg, serue, netdev, linux-kernel,
	Andrey Savochkin, Daniel Lezcano, Ben Greear, Dave Hansen,
	Alexey Kuznetsov

Eric W. Biederman wrote:
>> Makes sense for the host side to have naming convention tied
>> to the guest. Example as a prefix: guest0-eth0. Would it not
>> be interesting to have the host also manage these interfaces
>> via standard tools like ip or ifconfig etc? i.e if i admin up
>> guest0-eth0, then the user in guest0 will see its eth0 going
>> up.
>>     
> Please no.
> [...]
> Now I am open to radically different designs if they allow the
> implementation cost to be lower and they have clean semantics,
> and don't wind up being an ugly unmaintainable wart on the linux
> networking stack.  The only route I could imagine such a thing coming
> from is something like tagging flows, in some netfiler like way.
> Which might allow ifconfig guest-eth0 from the host without problems.
> But I have not seen such a design.
>   

Right.  New tools to support new features would probably be tidier, anyway.

Sam.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-29 22:14                                         ` strict isolation of net interfaces Cedric Le Goater
@ 2006-06-30  2:39                                           ` Serge E. Hallyn
  2006-06-30  2:49                                             ` Sam Vilain
                                                               ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Serge E. Hallyn @ 2006-06-30  2:39 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Sam Vilain, hadi, Herbert Poetzl, Alexey Kuznetsov, viro, devel,
	dev, Andrew Morton, serue, netdev, linux-kernel,
	Andrey Savochkin, Daniel Lezcano, Ben Greear, Dave Hansen,
	Alexey Kuznetsov, Eric W. Biederman

Quoting Cedric Le Goater (clg@fr.ibm.com):
> Sam Vilain wrote:
> > jamal wrote:
> >>> note: personally I'm absolutely not against virtualizing
> >>> the device names so that each guest can have a separate
> >>> name space for devices, but there should be a way to
> >>> 'see' _and_ 'identify' the interfaces from outside
> >>> (i.e. host or spectator context)
> >>>
> >>>     
> >> Makes sense for the host side to have naming convention tied
> >> to the guest. Example as a prefix: guest0-eth0. Would it not
> >> be interesting to have the host also manage these interfaces
> >> via standard tools like ip or ifconfig etc? i.e if i admin up
> >> guest0-eth0, then the user in guest0 will see its eth0 going
> >> up.
> > 
> > That particular convention only works if you have network namespaces and
> > UTS namespaces tightly bound.  We plan to have them separate - so for
> > that to work, each network namespace could have an arbitrary "prefix"
> > that determines what the interface name will look like from the outside
> > when combined.  We'd have to be careful about length limits.
> > 
> > And guest0-eth0 doesn't necessarily make sense; it's not really an
> > ethernet interface, more like a tun or something.
> > 
> > So, an equally good convention might be to use sequential prefixes on
> > the host, like "tun", "dummy", or a new prefix - then a property of that
> > is what the name of the interface is perceived to be to those who are in
> > the corresponding network namespace.
> > 
> > Then the pragmatic question becomes how to correlate what you see from
> > `ip addr list' to guests.
> 
> 
> we could work on virtualizing the net interfaces in the host, map them to
> eth0 or something in the guest and let the guest handle upper network layers ?
> 
> lo0 would just be exposed relying on skbuff tagging to discriminate traffic
> between guests.

This seems to me the preferable way.  We create a full virtual net
device for each new container, and fully virtualize the device
namespace.

> host                  |  guest 0  |  guest 1  |  guest2
> ----------------------+-----------+-----------+--------------
>   |                   |           |           |
>   |-> l0      <-------+-> lo0 ... | lo0       | lo0
>   |                   |           |           |
>   |-> bar0   <--------+-> eth0    |           |
>   |                   |           |           |
>   |-> foo0   <--------+-----------+-----------+-> eth0
>   |                   |           |           |
>   `-> foo0:1  <-------+-----------+-> eth0    |
>                       |           |           |
> 
> 
> is that clear ? stupid ? reinventing the wheel ?

The last one in your diagram confuses me - why foo0:1?  I would
have thought it'd be

host                  |  guest 0  |  guest 1  |  guest2
----------------------+-----------+-----------+--------------
  |                   |           |           |
  |-> l0      <-------+-> lo0 ... | lo0       | lo0
  |                   |           |           |
  |-> eth0            |           |           |
  |                   |           |           |
  |-> veth0  <--------+-> eth0    |           |
  |                   |           |           |
  |-> veth1  <--------+-----------+-----------+-> eth0
  |                   |           |           |
  |-> veth2   <-------+-----------+-> eth0    |

I think we should avoid using device aliases, as trying to do
something like giving eth0:1 to guest1 and eth0:2 to guest2
while hiding eth0:1 from guest2 requires some uglier code (as
I recall) than working with full devices.  In other words,
if a namespace can see eth0, and eth0:2 exists, it should always
see eth0:2.

So conceptually using a full virtual net device per container
certainly seems cleaner to me, and it seems like it should be
simpler by way of statistics gathering etc, but are there actually
any real gains?  Or is the support for multiple IPs per device
actually enough?

Herbert, is this basically how ngnet is supposed to work?

-serge

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30  2:39                                           ` Serge E. Hallyn
@ 2006-06-30  2:49                                             ` Sam Vilain
  2006-07-03 14:53                                               ` Andrey Savochkin
  2006-06-30  8:56                                             ` Cedric Le Goater
  2006-06-30 12:23                                             ` Daniel Lezcano
  2 siblings, 1 reply; 113+ messages in thread
From: Sam Vilain @ 2006-06-30  2:49 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Cedric Le Goater, hadi, Herbert Poetzl, Alexey Kuznetsov, viro,
	devel, dev, Andrew Morton, netdev, linux-kernel,
	Andrey Savochkin, Daniel Lezcano, Ben Greear, Dave Hansen,
	Alexey Kuznetsov, Eric W. Biederman

Serge E. Hallyn wrote:
> The last one in your diagram confuses me - why foo0:1?  I would
> have thought it'd be
>
> host                  |  guest 0  |  guest 1  |  guest2
> ----------------------+-----------+-----------+--------------
>   |                   |           |           |
>   |-> l0      <-------+-> lo0 ... | lo0       | lo0
>   |                   |           |           |
>   |-> eth0            |           |           |
>   |                   |           |           |
>   |-> veth0  <--------+-> eth0    |           |
>   |                   |           |           |
>   |-> veth1  <--------+-----------+-----------+-> eth0
>   |                   |           |           |
>   |-> veth2   <-------+-----------+-> eth0    |
>
> [...]
>
> So conceptually using a full virtual net device per container
> certainly seems cleaner to me, and it seems like it should be
> simpler by way of statistics gathering etc, but are there actually
> any real gains?  Or is the support for multiple IPs per device
> actually enough?
>   

Why special case loopback?

Why not:

host                  |  guest 0  |  guest 1  |  guest2
----------------------+-----------+-----------+--------------
  |                   |           |           |
  |-> lo              |           |           |
  |                   |           |           |
  |-> vlo0  <---------+-> lo      |           |
  |                   |           |           |
  |-> vlo1  <---------+-----------+-----------+-> lo
  |                   |           |           |
  |-> vlo2   <--------+-----------+-> lo      |
  |                   |           |           |
  |-> eth0            |           |           |
  |                   |           |           |
  |-> veth0  <--------+-> eth0    |           |
  |                   |           |           |
  |-> veth1  <--------+-----------+-----------+-> eth0
  |                   |           |           |
  |-> veth2   <-------+-----------+-> eth0    |


Sam.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-30  0:15                                         ` [patch 2/6] [Network namespace] Network device sharing by view jamal
@ 2006-06-30  3:35                                           ` Herbert Poetzl
  2006-06-30  7:45                                           ` Andrey Savochkin
  1 sibling, 0 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-06-30  3:35 UTC (permalink / raw)
  To: jamal
  Cc: Sam Vilain, Alexey Kuznetsov, viro, devel, dev, Andrew Morton,
	clg, serue, netdev, linux-kernel, Andrey Savochkin,
	Daniel Lezcano, Ben Greear, Dave Hansen, Alexey Kuznetsov,
	Eric W. Biederman

On Thu, Jun 29, 2006 at 08:15:52PM -0400, jamal wrote:
> On Fri, 2006-30-06 at 09:07 +1200, Sam Vilain wrote:
> > jamal wrote:
> 
> > > Makes sense for the host side to have naming convention tied
> > > to the guest. Example as a prefix: guest0-eth0. Would it not
> > > be interesting to have the host also manage these interfaces
> > > via standard tools like ip or ifconfig etc? i.e if i admin up
> > > guest0-eth0, then the user in guest0 will see its eth0 going
> > > up.
> > 
> > That particular convention only works if you have network namespaces
> > and UTS namespaces tightly bound.
> 
> that would be one approach. Another less sophisticated approach is to
> have no binding whatsoever, rather some translation table to map two
> unrelated devices. 
> 
> >  We plan to have them separate - so for
> > that to work, each network namespace could have an arbitrary
> > "prefix" that determines what the interface name will look like from
> > the outside when combined. We'd have to be careful about length
> > limits.
> >
> > And guest0-eth0 doesn't necessarily make sense; it's not really an
> > ethernet interface, more like a tun or something.
> 
> it wouldnt quiet fit as a tun device. More like a mirror side of the 
> guest eth0 created on the host side 
> i.e a sort of passthrough device with one side visible on the host (send
> from guest0-eth0 is received on eth0 in the guest and vice-versa).
> 
> Note this is radically different from what i have heard Andrey and co
> talk about and i dont wanna disturb any shit because there seems to be
> some agreement. But if you address me i respond because it is very
> interesting a topic;->

thing is, we have several things we should care about
and some of them 'look' or 'sound' similar, although
they are not really ... I'll try to clarify

 first, we want to have 'per guest' interfaces, which
 do not clash with any interfaces on the host or in
 other guests

 then, we want to 'connect' them, implicitely or 
 explicetly with 'other' interfaces or devices inside
 other guests or on the host, here we have the following
 cases (some are a little special):

 - lo interface, guest and host private (by default)
 - tap/tun interfaces, again host/guest private
 - tun like interfaces between host and guests
 - tun like interfaces between guests
 - 'normal' interfaces mapped into guests

 on the traffic side we have the following cases:

 - local traffic on the host
 - local traffic on the guest
 - local traffic between host and guest
 - local traffic between guests
 - routed traffic from guest via host
 - bridged traffic from guest via host

 special cases here would be tun/tap traffic inside
 a guest, but that can be considered local too

> > So, an equally good convention might be to use sequential prefixes
> > on the host, like "tun", "dummy", or a new prefix - then a property
> > of that is what the name of the interface is perceived to be to
> > those who are in the corresponding network namespace.
> >
> > Then the pragmatic question becomes how to correlate what you see
> > from `ip addr list' to guests.
> 
> on the host ip addr and the one seen on the guest side are the same.
> Except one is seen (on the host) on guest0-eth0 and another is seen 
> on eth0 (on guest).

this depends on the way the interfaces are handled
and how they actually work, means:

 if the interfaces _solely_ work via routing or
 bridging, then the 'host' end has to exist and be
 visible similar to 'normal' interfaces

 if the traffic is (magically) mapped from guest
 interfaces to real (outside) host interfaces, we
 might want the same view as the guest has 
 (i.e. basically a 'copy' which is not real)

> Anyways, ignore what i am saying if it is disrupting the discussion.

IMHO input is always welcome .. helps the folks to
do better thinking :)

> cheers,
> jamal 
> 
> 
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-30  0:15                                         ` [patch 2/6] [Network namespace] Network device sharing by view jamal
  2006-06-30  3:35                                           ` Herbert Poetzl
@ 2006-06-30  7:45                                           ` Andrey Savochkin
  2006-06-30 13:50                                             ` jamal
  1 sibling, 1 reply; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-30  7:45 UTC (permalink / raw)
  To: hadi
  Cc: Sam Vilain, Herbert Poetzl, viro, devel, dev, Andrew Morton, clg,
	serue, netdev, linux-kernel, Daniel Lezcano, Ben Greear,
	Dave Hansen, Alexey Kuznetsov, Eric W. Biederman

Hi Jamal,

On Thu, Jun 29, 2006 at 08:15:52PM -0400, jamal wrote:
> On Fri, 2006-30-06 at 09:07 +1200, Sam Vilain wrote:
[snip]
> >  We plan to have them separate - so for
> > that to work, each network namespace could have an arbitrary "prefix"
> > that determines what the interface name will look like from the outside
> > when combined.  We'd have to be careful about length limits.
> > 
> > And guest0-eth0 doesn't necessarily make sense; it's not really an
> > ethernet interface, more like a tun or something.
> > 
> 
> it wouldnt quiet fit as a tun device. More like a mirror side of the 
> guest eth0 created on the host side 
> i.e a sort of passthrough device with one side visible on the host (send
> from guest0-eth0 is received on eth0 in the guest and vice-versa).
> 
> Note this is radically different from what i have heard Andrey and co
> talk about and i dont wanna disturb any shit because there seems to be
> some agreement. But if you address me i respond because it is very
> interesting a topic;->

I do not have anything against guest-eth0 - eth0 pairs _if_ they are set up
by the host administrators explicitly for some purpose.
For example, if these guest-eth0 and eth0 devices stay as pure virtual ones,
i.e. they don't have any physical NIC, host administrator may route traffic
to guestXX-eth0 interfaces to pass it to the guests.

However, I oppose the idea of automatic mirroring of _all_ devices appearing
inside some namespaces ("guests") to another namespace (the "host").
This clearly goes against the concept of namespaces as independent realms,
and creates a lot of problems with applications running in the host, hotplug
scripts and so on.

> 
> > So, an equally good convention might be to use sequential prefixes on
> > the host, like "tun", "dummy", or a new prefix - then a property of that
> > is what the name of the interface is perceived to be to those who are in
> > the corresponding network namespace.
> >
> > Then the pragmatic question becomes how to correlate what you see from
> > `ip addr list' to guests.
> 
> on the host ip addr and the one seen on the guest side are the same.
> Except one is seen (on the host) on guest0-eth0 and another is seen 
> on eth0 (on guest).

Then what to do if the host system has 10.0.0.1 as a private address on eth3,
and then interfaces guest1-tun0 and guest2-tun0 both get address 10.0.0.1
when each guest has added 10.0.0.1 to their tun0 device?

Regards,

Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30  2:39                                           ` Serge E. Hallyn
  2006-06-30  2:49                                             ` Sam Vilain
@ 2006-06-30  8:56                                             ` Cedric Le Goater
  2006-07-03 13:36                                               ` Herbert Poetzl
  2006-06-30 12:23                                             ` Daniel Lezcano
  2 siblings, 1 reply; 113+ messages in thread
From: Cedric Le Goater @ 2006-06-30  8:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Sam Vilain, hadi, Herbert Poetzl, Alexey Kuznetsov, viro, devel,
	dev, Andrew Morton, netdev, linux-kernel, Andrey Savochkin,
	Daniel Lezcano, Ben Greear, Dave Hansen, Alexey Kuznetsov,
	Eric W. Biederman

Serge E. Hallyn wrote:
> 
> The last one in your diagram confuses me - why foo0:1?  I would
> have thought it'd be

just thinking aloud. I thought that any kind/type of interface could be
mapped from host to guest.

> host                  |  guest 0  |  guest 1  |  guest2
> ----------------------+-----------+-----------+--------------
>   |                   |           |           |
>   |-> l0      <-------+-> lo0 ... | lo0       | lo0
>   |                   |           |           |
>   |-> eth0            |           |           |
>   |                   |           |           |
>   |-> veth0  <--------+-> eth0    |           |
>   |                   |           |           |
>   |-> veth1  <--------+-----------+-----------+-> eth0
>   |                   |           |           |
>   |-> veth2   <-------+-----------+-> eth0    |
> 
> I think we should avoid using device aliases, as trying to do
> something like giving eth0:1 to guest1 and eth0:2 to guest2
> while hiding eth0:1 from guest2 requires some uglier code (as
> I recall) than working with full devices.  In other words,
> if a namespace can see eth0, and eth0:2 exists, it should always
> see eth0:2.
> 
> So conceptually using a full virtual net device per container
> certainly seems cleaner to me, and it seems like it should be
> simpler by way of statistics gathering etc, but are there actually
> any real gains?  Or is the support for multiple IPs per device
> actually enough?
> 
> Herbert, is this basically how ngnet is supposed to work?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30  2:39                                           ` Serge E. Hallyn
  2006-06-30  2:49                                             ` Sam Vilain
  2006-06-30  8:56                                             ` Cedric Le Goater
@ 2006-06-30 12:23                                             ` Daniel Lezcano
  2006-06-30 14:20                                               ` Eric W. Biederman
  2006-06-30 18:09                                               ` Eric W. Biederman
  2 siblings, 2 replies; 113+ messages in thread
From: Daniel Lezcano @ 2006-06-30 12:23 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Cedric Le Goater, Sam Vilain, hadi, Herbert Poetzl,
	Alexey Kuznetsov, viro, devel, dev, Andrew Morton, netdev,
	linux-kernel, Andrey Savochkin, Ben Greear, Dave Hansen,
	Alexey Kuznetsov, Eric W. Biederman

Serge E. Hallyn wrote:
> Quoting Cedric Le Goater (clg@fr.ibm.com):
> 
>>we could work on virtualizing the net interfaces in the host, map them to
>>eth0 or something in the guest and let the guest handle upper network layers ?
>>
>>lo0 would just be exposed relying on skbuff tagging to discriminate traffic
>>between guests.
> 
> 
> This seems to me the preferable way.  We create a full virtual net
> device for each new container, and fully virtualize the device
> namespace.

I have a few questions about all the network isolation stuff:

   * What level of isolation is wanted for the network ? network devices 
? IPv4/IPv6 ? TCP/UDP ?

   * How is handled the incoming packets from the network ? I mean what 
will be mecanism to dispatch the packet to the right virtual device ?

   * How to handle the SO_BINDTODEVICE socket option ?

   * Has the virtual device a different MAC address ? How to manage it 
with the real MAC address on the system ? How to manage ARP, ICMP, 
multicasting and IP ?

It seems for me, IMHO that will require a lot of translation and 
browsing table. It will probably add a very significant overhead.

    * How to handle NFS access mounted outside of the container ?

    * How to handle ICMP_REDIRECT ?

Regards








^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-30  7:45                                           ` Andrey Savochkin
@ 2006-06-30 13:50                                             ` jamal
  2006-06-30 15:01                                               ` Andrey Savochkin
  2006-06-30 18:22                                               ` Eric W. Biederman
  0 siblings, 2 replies; 113+ messages in thread
From: jamal @ 2006-06-30 13:50 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: Eric W. Biederman, Alexey Kuznetsov, Dave Hansen, Ben Greear,
	Daniel Lezcano, linux-kernel, netdev, serue, clg, Andrew Morton,
	dev, devel, viro, Herbert Poetzl, Sam Vilain

Hi Andrey,

BTW - I was just looking at openvz, very impressive. To the other folks,
I am not putting down any of your approaches - just havent
had time to study them. Andrey, this is the same thing you guys have
been working on for a few years now, you just changed the name, correct?

Ok, since you guys are encouraging me to speak, here goes ;->
Hopefully this addresses the other email from Herbert et al.

On Fri, 2006-30-06 at 11:45 +0400, Andrey Savochkin wrote: 
> Hi Jamal,
> 
> On Thu, Jun 29, 2006 at 08:15:52PM -0400, jamal wrote:
> > On Fri, 2006-30-06 at 09:07 +1200, Sam Vilain wrote:
> [snip]

> 
> I do not have anything against guest-eth0 - eth0 pairs _if_ they are set up
> by the host administrators explicitly for some purpose.
> For example, if these guest-eth0 and eth0 devices stay as pure virtual ones,
> i.e. they don't have any physical NIC, host administrator may route traffic
> to guestXX-eth0 interfaces to pass it to the guests.
> 

Well there will be purely virtual of course.  Something along the lines
for openvz:

// create the guest
[host-node]# vzctl create 101 --ostemplate fedora-core-5-minimal 
// create guest101::eth0, seems to only create config to boot up with 
[host-node]# vzctl create 101 --netdev eth0
// bootup guest101
[host-node]# vzctl start 101

As soon as bootup of guest101 happens, creating guest101::eth0 should activate 
creation of the host side netdevice. This could be triggered for example by
the netlink event message seen on host whic- which is a result of creating guest101::eth0 
Which means control sits purely in user space.

at that point if i do ifconfig on host i see g101-eth0
on guest101 i see just name eth0.

My earlier suggestion was that instead of:
host-node]# vzctl set 101 --ipadd 10.1.2.3

you do:
host-node]# ip addr add g101-eth0 10.1.2.3/32
you should still use vzctl to save config for next bootup

> However, I oppose the idea of automatic mirroring of _all_ devices appearing
> inside some namespaces ("guests") to another namespace (the "host").
> This clearly goes against the concept of namespaces as independent realms,
> and creates a lot of problems with applications running in the host, hotplug
> scripts and so on.
> 

I was thinking that the host side is the master i.e you can peek at
namespaces in the guest from the host.
Also note that having the pass through device allows for guests to be
connected via standard linux schemes in the host side (bridge, point
routes, tc redirect etc); so you dont need a speacial device to hook
them together.

> > > Then the pragmatic question becomes how to correlate what you see from
> > > `ip addr list' to guests.
> > 
> > on the host ip addr and the one seen on the guest side are the same.
> > Except one is seen (on the host) on guest0-eth0 and another is seen 
> > on eth0 (on guest).
> 
> Then what to do if the host system has 10.0.0.1 as a private address on eth3,
> and then interfaces guest1-tun0 and guest2-tun0 both get address 10.0.0.1
> when each guest has added 10.0.0.1 to their tun0 device?
> 

Yes, that would be a conflict that needs to be resolved. If you look at
ip addresses as also belonging to namespaces, then it should work, no?
i am assuming a tag at the ifa table level.

cheers,
jamal



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30 12:23                                             ` Daniel Lezcano
@ 2006-06-30 14:20                                               ` Eric W. Biederman
  2006-06-30 15:22                                                 ` Daniel Lezcano
  2006-06-30 16:14                                                 ` Serge E. Hallyn
  2006-06-30 18:09                                               ` Eric W. Biederman
  1 sibling, 2 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-30 14:20 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, Cedric Le Goater, Sam Vilain, hadi,
	Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Andrey Savochkin,
	Ben Greear, Dave Hansen, Alexey Kuznetsov

Daniel Lezcano <dlezcano@fr.ibm.com> writes:

> Serge E. Hallyn wrote:
>> Quoting Cedric Le Goater (clg@fr.ibm.com):
>>
>>>we could work on virtualizing the net interfaces in the host, map them to
>>>eth0 or something in the guest and let the guest handle upper network layers ?
>>>
>>>lo0 would just be exposed relying on skbuff tagging to discriminate traffic
>>>between guests.
>> This seems to me the preferable way.  We create a full virtual net
>> device for each new container, and fully virtualize the device
>> namespace.
>
> I have a few questions about all the network isolation stuff:

So far I have seen two viable possibilities on the table,
neither of them involve multiple names for a network device.

layer 3 (filtering the allowed ip addresses at bind time roughly the current vserver).
  - implementable as a security hook.
  - Benefit no measurable performance impact.
  - Downside not many things we can do.

layer 2 (What appears to applications a separate instance of the network stack).
  - Implementable as a namespace.
  - Each network namespace would have dedicated network devices.
  - Benefit extremely flexible.
  - Downside since at least the slow path must examine the packet
    it has the possibility of slowing down the networking stack.


For me the important characteristics.
- Allows for application migration, when we take our ip address with us.
  In particular it allows for importation of addresses assignments
  mad on other machines.

- No measurable impact on the existing networking when the code
  is compiled in.

- Clean predictable semantics.


This whole debate on network devices show up in multiple network namespaces
is just silly.  The only reason for wanting that appears to be better management.
We have deeper issues like can we do a reasonable implementation without a
network device showing up in multiple namespaces.

I think the reason the debate exists at all is that it is a very approachable
topic, as opposed to the fundamentals here.

If we can get layer 2 level isolation working without measurable overhead
with one namespace per device it may be worth revisiting things.  Until
then it is a side issue at best.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-30 13:50                                             ` jamal
@ 2006-06-30 15:01                                               ` Andrey Savochkin
  2006-06-30 18:22                                               ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Andrey Savochkin @ 2006-06-30 15:01 UTC (permalink / raw)
  To: hadi
  Cc: Eric W. Biederman, Alexey Kuznetsov, Dave Hansen, Ben Greear,
	Daniel Lezcano, linux-kernel, netdev, serue, clg, Andrew Morton,
	dev, devel, viro, Herbert Poetzl, Sam Vilain

Jamal,

On Fri, Jun 30, 2006 at 09:50:52AM -0400, jamal wrote:
> 
> BTW - I was just looking at openvz, very impressive. To the other folks,

Thanks!

> I am not putting down any of your approaches - just havent
> had time to study them. Andrey, this is the same thing you guys have
> been working on for a few years now, you just changed the name, correct?

The relations are more complicated than just the change of name,
but yes, OpenVZ represents the result of our work for a few years.

> 
> Ok, since you guys are encouraging me to speak, here goes ;->
> Hopefully this addresses the other email from Herbert et al.
> 
[snip]
> // create the guest
> [host-node]# vzctl create 101 --ostemplate fedora-core-5-minimal 
> // create guest101::eth0, seems to only create config to boot up with 
> [host-node]# vzctl create 101 --netdev eth0
> // bootup guest101
> [host-node]# vzctl start 101
> 
> As soon as bootup of guest101 happens, creating guest101::eth0 should activate 
> creation of the host side netdevice. This could be triggered for example by
> the netlink event message seen on host whic- which is a result of creating guest101::eth0 
> Which means control sits purely in user space.

I'd like to clarify you idea: whether this host-side device is a real
device capable of receiving and transmitting packets (by moving them between
namespaces), or it's a fake device creating only a view of other namespace's
devices?

[snip]
> > However, I oppose the idea of automatic mirroring of _all_ devices appearing
> > inside some namespaces ("guests") to another namespace (the "host").
> > This clearly goes against the concept of namespaces as independent realms,
> > and creates a lot of problems with applications running in the host, hotplug
> > scripts and so on.
> > 
> 
> I was thinking that the host side is the master i.e you can peek at
> namespaces in the guest from the host.

"Host(master)-guest" relations is a valid and useful scheme.
However, I'm thinking about broader application of network namespaces,
when they can form an arbitrary tree and may not be in "host-guest" relations.

> Also note that having the pass through device allows for guests to be
> connected via standard linux schemes in the host side (bridge, point
> routes, tc redirect etc); so you dont need a speacial device to hook
> them together.

What do you mean under pass through device?
Do you mean using guest1-tun0 as a backdoor to talk to the guest?

> 
> > > > Then the pragmatic question becomes how to correlate what you see from
> > > > `ip addr list' to guests.
> > > 
> > > on the host ip addr and the one seen on the guest side are the same.
> > > Except one is seen (on the host) on guest0-eth0 and another is seen 
> > > on eth0 (on guest).
> > 
> > Then what to do if the host system has 10.0.0.1 as a private address on eth3,
> > and then interfaces guest1-tun0 and guest2-tun0 both get address 10.0.0.1
> > when each guest has added 10.0.0.1 to their tun0 device?
> > 
> 
> Yes, that would be a conflict that needs to be resolved. If you look at
> ip addresses as also belonging to namespaces, then it should work, no?
> i am assuming a tag at the ifa table level.

I'm not sure, it's complicated.
You wouldn't want automatic local routes to be added for IP addresses on
the host-side interfaces, right?
Do you expect these IP addresses to act as local addresses in other places,
like answering to arp requests about these IP on all physical devices?

But anyway, you'll have conflicts on the application level.
Many programs like ntpd, bind, and others fetch the device list using the
same ioctls as ifconfig, and make (un)intelligent decisions basing on what
they see.
Mirroring may have some advantages if I am both host and guest administrator.
But if I create a namespace for my friend Joe to play with IPv6 and sit
tunnels, why should I face inconveniences because of what he does there?

Best regards

Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30 14:20                                               ` Eric W. Biederman
@ 2006-06-30 15:22                                                 ` Daniel Lezcano
  2006-06-30 17:58                                                   ` Eric W. Biederman
  2006-06-30 16:14                                                 ` Serge E. Hallyn
  1 sibling, 1 reply; 113+ messages in thread
From: Daniel Lezcano @ 2006-06-30 15:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Cedric Le Goater, Sam Vilain, hadi,
	Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Andrey Savochkin,
	Ben Greear, Dave Hansen, Alexey Kuznetsov

Eric W. Biederman wrote:
> Daniel Lezcano <dlezcano@fr.ibm.com> writes:
> 
> 
>>Serge E. Hallyn wrote:
>>
>>>Quoting Cedric Le Goater (clg@fr.ibm.com):
>>>
>>>
>>>>we could work on virtualizing the net interfaces in the host, map them to
>>>>eth0 or something in the guest and let the guest handle upper network layers ?
>>>>
>>>>lo0 would just be exposed relying on skbuff tagging to discriminate traffic
>>>>between guests.
>>>
>>>This seems to me the preferable way.  We create a full virtual net
>>>device for each new container, and fully virtualize the device
>>>namespace.
>>
>>I have a few questions about all the network isolation stuff:
> 

It seems these questions are not important.

> 
> So far I have seen two viable possibilities on the table,
> neither of them involve multiple names for a network device.
> 
> layer 3 (filtering the allowed ip addresses at bind time roughly the current vserver).
>   - implementable as a security hook.
>   - Benefit no measurable performance impact.
>   - Downside not many things we can do.

What things ? Can you develop please ? Can you give some examples ?

> 
> layer 2 (What appears to applications a separate instance of the network stack).
>   - Implementable as a namespace.

what about accessing a NFS mounted outside the container ?

>   - Each network namespace would have dedicated network devices.
>   - Benefit extremely flexible.

For what ? For who ? Do you have examples ?

>   - Downside since at least the slow path must examine the packet
>     it has the possibility of slowing down the networking stack.

What is/are the slow path(s) you identified ?

> For me the important characteristics.
> - Allows for application migration, when we take our ip address with us.
>   In particular it allows for importation of addresses assignments
>   mad on other machines.

Ok for the two methods no ?

> - No measurable impact on the existing networking when the code
>   is compiled in.

You contradict ...

> - Clean predictable semantics.

What that means ? Can you explain, please ?

> This whole debate on network devices show up in multiple network namespaces
> is just silly.  

The debate is not on the network device show up. The debate is can we 
have a network isolation ___usable for everybody___ not only for the 
beauty of having namespaces and for a system container like.

I am not against the network device virtualization or against the 
namespaces. I am just asking if the namespace is the solution for all 
the network isolation. Should we nest layer 2 and layer 3 vitualization 
into namespaces or separate them in order to have the flexibility to 
choose isolation/performance.

> The only reason for wanting that appears to be better management.
> We have deeper issues like can we do a reasonable implementation without a
> network device showing up in multiple namespaces.

Again, I am not against having the network device virtualization. It is 
a good idea.

> I think the reason the debate exists at all is that it is a very approachable
> topic, as opposed to the fundamentals here.
> 
> If we can get layer 2 level isolation working without measurable overhead
> with one namespace per device it may be worth revisiting things.  Until
> then it is a side issue at best.

I agree, so where are the answers of the questions I asked in my 
previous email ? You said you did some implementation of network 
isolation with and without namespaces, so you should be able to answer...


   -- Daniel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30 14:20                                               ` Eric W. Biederman
  2006-06-30 15:22                                                 ` Daniel Lezcano
@ 2006-06-30 16:14                                                 ` Serge E. Hallyn
  2006-06-30 17:41                                                   ` Eric W. Biederman
  1 sibling, 1 reply; 113+ messages in thread
From: Serge E. Hallyn @ 2006-06-30 16:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Serge E. Hallyn, Cedric Le Goater, Sam Vilain,
	hadi, Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Andrey Savochkin,
	Ben Greear, Dave Hansen, Alexey Kuznetsov

Quoting Eric W. Biederman (ebiederm@xmission.com):
> This whole debate on network devices show up in multiple network namespaces
> is just silly.  The only reason for wanting that appears to be better management.

A damned good reason.  Clearly we want the parent namespace to be able
to control what the child can do.  So whatever interface a child gets,
the parent should be able to somehow address.  Simple iptables rules
controlling traffic between it's own netdevice and the one it hands it's
children seem a good option.

> We have deeper issues like can we do a reasonable implementation without a
> network device showing up in multiple namespaces.

Isn't that the same issue?

> If we can get layer 2 level isolation working without measurable overhead
> with one namespace per device it may be worth revisiting things.  Until
> then it is a side issue at best.

Ok, and in the meantime we can all use the network part of the bsdjail
lsm?  :)

-serge

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30 16:14                                                 ` Serge E. Hallyn
@ 2006-06-30 17:41                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-30 17:41 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Daniel Lezcano, Cedric Le Goater, Sam Vilain, hadi,
	Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Andrey Savochkin,
	Ben Greear, Dave Hansen, Alexey Kuznetsov

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> This whole debate on network devices show up in multiple network namespaces
>> is just silly.  The only reason for wanting that appears to be better
> management.
>
> A damned good reason.  

Better management is a good reason.  But constructing the management in 
a way that hampers the implementation and confuses existing applications is
a problem.

Things are much easier if namespaces are completely independent.

Among other things the semantics are clear and obvious.

> Clearly we want the parent namespace to be able
> to control what the child can do.  So whatever interface a child gets,
> the parent should be able to somehow address.  Simple iptables rules
> controlling traffic between it's own netdevice and the one it hands it's
> children seem a good option.

That or we setup the child and then drop CAP_NET_ADMIN.

>> We have deeper issues like can we do a reasonable implementation without a
>> network device showing up in multiple namespaces.
>
> Isn't that the same issue?

I guess I was thinking from the performance and cleanliness point of
view.

>> If we can get layer 2 level isolation working without measurable overhead
>> with one namespace per device it may be worth revisiting things.  Until
>> then it is a side issue at best.
>
> Ok, and in the meantime we can all use the network part of the bsdjail
> lsm?  :)

If necessary.  But mostly we concentrate on the fundamentals and figure
out what it takes to take the level 2 stuff working.

Eric


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30 15:22                                                 ` Daniel Lezcano
@ 2006-06-30 17:58                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-30 17:58 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, Cedric Le Goater, Sam Vilain, hadi,
	Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Andrey Savochkin,
	Ben Greear, Dave Hansen, Alexey Kuznetsov

Daniel Lezcano <dlezcano@fr.ibm.com> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <dlezcano@fr.ibm.com> writes:
>>
>>>Serge E. Hallyn wrote:
>>>
>>>>Quoting Cedric Le Goater (clg@fr.ibm.com):
>>>>
>>>>
>>>>>we could work on virtualizing the net interfaces in the host, map them to
>>>>>eth0 or something in the guest and let the guest handle upper network layers
> ?
>>>>>
>>>>>lo0 would just be exposed relying on skbuff tagging to discriminate traffic
>>>>>between guests.
>>>>
>>>>This seems to me the preferable way.  We create a full virtual net
>>>>device for each new container, and fully virtualize the device
>>>>namespace.
>>>
>>>I have a few questions about all the network isolation stuff:
>>
>
> It seems these questions are not important.

I'm just trying to get us back to a productive topic.

>> So far I have seen two viable possibilities on the table,
>> neither of them involve multiple names for a network device.
>> layer 3 (filtering the allowed ip addresses at bind time roughly the current
>> vserver).
>>   - implementable as a security hook.
>>   - Benefit no measurable performance impact.
>>   - Downside not many things we can do.
>
> What things ? Can you develop please ? Can you give some examples ?

DHCP, tcpdump,..  Probably a bad way of phrasing it.  But there
is a lot more that we can do using a pure layer 2 approach.

>> layer 2 (What appears to applications a separate instance of the network
>> stack).
>>   - Implementable as a namespace.
>
> what about accessing a NFS mounted outside the container ?

As I replied earlier it isn't a problem.  If you get to it through the
filesystem namespace it uses the network namespace it was mounted with
for it's connection.

>>   - Each network namespace would have dedicated network devices.
>>   - Benefit extremely flexible.
>
> For what ? For who ? Do you have examples ?

See above.

>>   - Downside since at least the slow path must examine the packet
>>     it has the possibility of slowing down the networking stack.
>
> What is/are the slow path(s) you identified ?

Grr.  I put that badly.  Basically at least on the slow path you need to
look at a per network namespace data structure.  The extra pointer
indirection could slow things down.  The point is that we may be
able to have a fast path that is exactly the same as the rest
of the network stack.

If the obvious approach does not work my gut the feeling the
network stack fast path will give us an implementation without overhead.

>> For me the important characteristics.
>> - Allows for application migration, when we take our ip address with us.
>>   In particular it allows for importation of addresses assignments
>>   mad on other machines.
>
> Ok for the two methods no ?

So far.

>> - No measurable impact on the existing networking when the code
>>   is compiled in.
>
> You contradict ...

How so?  As far as I can tell this is a basic requirement to get
merged.

>> - Clean predictable semantics.
>
> What that means ? Can you explain, please ?

>> This whole debate on network devices show up in multiple network namespaces
>> is just silly.
>
> The debate is not on the network device show up. The debate is can we have a
> network isolation ___usable for everybody___ not only for the beauty of having
> namespaces and for a system container like.

This subthread talking about devices showing up in multiple namespaces seemed
 very much exactly on how network devices show up.

> I am not against the network device virtualization or against the namespaces. I
> am just asking if the namespace is the solution for all the network
> isolation. Should we nest layer 2 and layer 3 vitualization into namespaces or
> separate them in order to have the flexibility to choose isolation/performance.

I believe I addressed Herbert Poetzl's concerns earlier.  To me the question
is can we implement an acceptable layer 2 solution, that distrubutions and
other people who do not need isolation would have no problem compiling in
by default.

The joy of namespaces is that if you don't want it you don't have to use it.
Layer 2 can do everything and is likely usable by everyone iff the performance
is acceptable.

>> The only reason for wanting that appears to be better management.
>> We have deeper issues like can we do a reasonable implementation without a
>> network device showing up in multiple namespaces.
>
> Again, I am not against having the network device virtualization. It is a good
> idea.
>
>> I think the reason the debate exists at all is that it is a very approachable
>> topic, as opposed to the fundamentals here.
>> If we can get layer 2 level isolation working without measurable overhead
>> with one namespace per device it may be worth revisiting things.  Until
>> then it is a side issue at best.
>
> I agree, so where are the answers of the questions I asked in my previous email
> ? You said you did some implementation of network isolation with and without
> namespaces, so you should be able to answer...

Sorry.  More than anything those questions looked retorical and aimed
at disarming some of the silliness.  I will go back and try and
answer those.  Fundamentally when we have one namespace that includes
network devices, network sockets, and all of the data structures necessary
to use them (routing tables and the like) and we have a tunnel device
that can connect namespaces the answers are trivial and I though obvious.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30 12:23                                             ` Daniel Lezcano
  2006-06-30 14:20                                               ` Eric W. Biederman
@ 2006-06-30 18:09                                               ` Eric W. Biederman
  1 sibling, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-30 18:09 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, Cedric Le Goater, Sam Vilain, hadi,
	Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Andrey Savochkin,
	Ben Greear, Dave Hansen, Alexey Kuznetsov, Eric W. Biederman

Daniel Lezcano <dlezcano@fr.ibm.com> writes:

> Serge E. Hallyn wrote:
>> Quoting Cedric Le Goater (clg@fr.ibm.com):
>>
>>>we could work on virtualizing the net interfaces in the host, map them to
>>>eth0 or something in the guest and let the guest handle upper network layers ?
>>>
>>>lo0 would just be exposed relying on skbuff tagging to discriminate traffic
>>>between guests.
>> This seems to me the preferable way.  We create a full virtual net
>> device for each new container, and fully virtualize the device
>> namespace.
>

Answers with respect to how I see layer 2 isolation,
with network devices and sockets as well as the associated routing
information given per namespace.

> I have a few questions about all the network isolation stuff:
>
>   * What level of isolation is wanted for the network ? network devices ?
> IPv4/IPv6 ? TCP/UDP ?
>
>   * How is handled the incoming packets from the network ? I mean what will be
> mecanism to dispatch the packet to the right virtual device ?

Wrong question.  A better question is to ask how do you know which namespace
a packet is in.  
Answer:  By looking at which device or socket it just came from.

How do you get a packet into a non-default namespace?
Either you move a real network interface into that namespace.
Or you use a tunnel device that shows up as two network interfaces in
two different namespaces.

Then you route, or bridge packets between the two.  Trivial.

>   * How to handle the SO_BINDTODEVICE socket option ?

Just like we do now.

>   * Has the virtual device a different MAC address ? 

All network devices are abstractions of the hardware so they are all
sort of virtual.  My implementation of a tunnel device has a mac
address so I can use it with ethernet bridging but that isn't a hard
requirement.  And yes the mac address is different because you can't
do layer 2 switching if everyone has the same mac address.

But there is no special ``virtual'' device.

> How to manage it with the real MAC address on the system ? 
Manage?

> How to manage ARP, ICMP, multicasting and IP ?

Like you always do.  It would be a terrible implementation if
we had to change that logic.  There is a little bit of that
where we need to detect which network namespace we are going to because
the answers can differ but that is pretty straight forward.

> It seems for me, IMHO that will require a lot of translation and browsing
> table. It will probably add a very significant overhead.

Then look at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-ns.git#proof-of-concept
or the OpenVZ implementation.  

It isn't serious overhead.

>    * How to handle NFS access mounted outside of the container ?

The socket should remember it's network namespace.
It works fine.

>    * How to handle ICMP_REDIRECT ?

Just like we always do?

Eric



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-30 13:50                                             ` jamal
  2006-06-30 15:01                                               ` Andrey Savochkin
@ 2006-06-30 18:22                                               ` Eric W. Biederman
  2006-06-30 21:51                                                 ` jamal
  1 sibling, 1 reply; 113+ messages in thread
From: Eric W. Biederman @ 2006-06-30 18:22 UTC (permalink / raw)
  To: hadi
  Cc: Andrey Savochkin, Alexey Kuznetsov, Dave Hansen, Ben Greear,
	Daniel Lezcano, linux-kernel, netdev, serue, clg, Andrew Morton,
	dev, devel, viro, Herbert Poetzl, Sam Vilain

jamal <hadi@cyberus.ca> writes:

>> > > Then the pragmatic question becomes how to correlate what you see from
>> > > `ip addr list' to guests.
>> > 
>> > on the host ip addr and the one seen on the guest side are the same.
>> > Except one is seen (on the host) on guest0-eth0 and another is seen 
>> > on eth0 (on guest).
>> 
>> Then what to do if the host system has 10.0.0.1 as a private address on eth3,
>> and then interfaces guest1-tun0 and guest2-tun0 both get address 10.0.0.1
>> when each guest has added 10.0.0.1 to their tun0 device?
>
> Yes, that would be a conflict that needs to be resolved. If you look at
> ip addresses as also belonging to namespaces, then it should work, no?
> i am assuming a tag at the ifa table level.

Yes.  The conception is that everything belongs to the namespace,
so it looks like you have multiple instances of the network stack.

Which means through existing interfaces it would be a real problem
if a network device showed up in more than one network stack as
that would confuse things.

Basically the reading and configuration through existing interfaces
is expected to be in the namespace as well which is where the difficulty
shows up.

When you get serious about splitting up roots powers this becomes a real
advantage.  Because you might want to have one person responsible for
what would normally be eth0 and another person responsible for eth1.

Anyway Jamal can you see the problem the aliases present to the implementation?

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-30 18:22                                               ` Eric W. Biederman
@ 2006-06-30 21:51                                                 ` jamal
  2006-07-01  0:50                                                   ` Eric W. Biederman
  0 siblings, 1 reply; 113+ messages in thread
From: jamal @ 2006-06-30 21:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sam Vilain, Herbert Poetzl, viro, devel, dev, Andrew Morton, clg,
	serue, netdev, linux-kernel, Daniel Lezcano, Ben Greear,
	Dave Hansen, Alexey Kuznetsov, Andrey Savochkin

On Fri, 2006-30-06 at 12:22 -0600, Eric W. Biederman wrote:

> 
> Anyway Jamal can you see the problem the aliases present to the implementation?
> 

I think more than anything i may have a different view of things and no
code ;-> And you are trying to restore order in the discussion - so my
wild ideas dont help. If you guys have a meeting somewhere like this
coming OLS I will come over and disrupt your meeting ;-> I actually have
attempted to implement things like virtual routers but you guys seem way
ahead of anywhere i was.


cheers,
jamal



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/6] [Network namespace] Network device sharing by view
  2006-06-30 21:51                                                 ` jamal
@ 2006-07-01  0:50                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 113+ messages in thread
From: Eric W. Biederman @ 2006-07-01  0:50 UTC (permalink / raw)
  To: hadi
  Cc: Sam Vilain, Herbert Poetzl, viro, devel, dev, Andrew Morton, clg,
	serue, netdev, linux-kernel, Daniel Lezcano, Ben Greear,
	Dave Hansen, Alexey Kuznetsov, Andrey Savochkin

jamal <hadi@cyberus.ca> writes:

> On Fri, 2006-30-06 at 12:22 -0600, Eric W. Biederman wrote:
>
>> 
>> Anyway Jamal can you see the problem the aliases present to the
> implementation?
>> 
>
> I think more than anything i may have a different view of things and no
> code ;-> And you are trying to restore order in the discussion - so my
> wild ideas dont help. If you guys have a meeting somewhere like this
> coming OLS I will come over and disrupt your meeting ;-> I actually have
> attempted to implement things like virtual routers but you guys seem way
> ahead of anywhere i was.

Currently I think we have both a talk, and a BOFH at OLS plus probably
a little time at kernel summit.

Eric

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30  8:56                                             ` Cedric Le Goater
@ 2006-07-03 13:36                                               ` Herbert Poetzl
  0 siblings, 0 replies; 113+ messages in thread
From: Herbert Poetzl @ 2006-07-03 13:36 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Serge E. Hallyn, Sam Vilain, hadi, Alexey Kuznetsov, viro, devel,
	dev, Andrew Morton, netdev, linux-kernel, Andrey Savochkin,
	Daniel Lezcano, Ben Greear, Dave Hansen, Alexey Kuznetsov,
	Eric W. Biederman

On Fri, Jun 30, 2006 at 10:56:13AM +0200, Cedric Le Goater wrote:
> Serge E. Hallyn wrote:
> > 
> > The last one in your diagram confuses me - why foo0:1?  I would
> > have thought it'd be
> 
> just thinking aloud. I thought that any kind/type of interface could be
> mapped from host to guest.
> 
> > host                  |  guest 0  |  guest 1  |  guest2
> > ----------------------+-----------+-----------+--------------
> >   |                   |           |           |
> >   |-> l0      <-------+-> lo0 ... | lo0       | lo0
> >   |                   |           |           |
> >   |-> eth0            |           |           |
> >   |                   |           |           |
> >   |-> veth0  <--------+-> eth0    |           |
> >   |                   |           |           |
> >   |-> veth1  <--------+-----------+-----------+-> eth0
> >   |                   |           |           |
> >   |-> veth2   <-------+-----------+-> eth0    |
> > 
> > I think we should avoid using device aliases, as trying to do
> > something like giving eth0:1 to guest1 and eth0:2 to guest2
> > while hiding eth0:1 from guest2 requires some uglier code (as
> > I recall) than working with full devices.  In other words,
> > if a namespace can see eth0, and eth0:2 exists, it should always
> > see eth0:2.
> > 
> > So conceptually using a full virtual net device per container
> > certainly seems cleaner to me, and it seems like it should be
> > simpler by way of statistics gathering etc, but are there actually
> > any real gains?  Or is the support for multiple IPs per device
> > actually enough?
> > 
> > Herbert, is this basically how ngnet is supposed to work?

hard to tell, we have at least three ngnet prototypes
and basically all variants are covered there, from
separate interfaces which map to real ones to perfect
isolation of addresses assigned to global interfaces

IMHO the 'virtual' interface per guest is fine, as
the overhead and consumed resources are non critical
and it will definitely simplify handling for the
guest side

I'd really appreciate if we could find a solution which
allows both, isolation and virtualization, and if the
bridge scenario is as fast as a direct mapping, I'm
perfectly fine with a big bridge + ebtables to handle
security issues

best,
Herbert


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-06-30  2:49                                             ` Sam Vilain
@ 2006-07-03 14:53                                               ` Andrey Savochkin
  2006-07-04  3:00                                                 ` Sam Vilain
  2006-07-04 12:29                                                 ` Daniel Lezcano
  0 siblings, 2 replies; 113+ messages in thread
From: Andrey Savochkin @ 2006-07-03 14:53 UTC (permalink / raw)
  To: Sam Vilain, Serge E. Hallyn, Cedric Le Goater
  Cc: hadi, Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Daniel Lezcano, Ben Greear,
	Dave Hansen, Alexey Kuznetsov, Eric W. Biederman

Sam, Serge, Cedric,

On Fri, Jun 30, 2006 at 02:49:05PM +1200, Sam Vilain wrote:
> Serge E. Hallyn wrote:
> > The last one in your diagram confuses me - why foo0:1?  I would
> > have thought it'd be
> >
> > host                  |  guest 0  |  guest 1  |  guest2
> > ----------------------+-----------+-----------+--------------
> >   |                   |           |           |
> >   |-> l0      <-------+-> lo0 ... | lo0       | lo0
> >   |                   |           |           |
> >   |-> eth0            |           |           |
> >   |                   |           |           |
> >   |-> veth0  <--------+-> eth0    |           |
> >   |                   |           |           |
> >   |-> veth1  <--------+-----------+-----------+-> eth0
> >   |                   |           |           |
> >   |-> veth2   <-------+-----------+-> eth0    |
> >
> > [...]
> >
> > So conceptually using a full virtual net device per container
> > certainly seems cleaner to me, and it seems like it should be
> > simpler by way of statistics gathering etc, but are there actually
> > any real gains?  Or is the support for multiple IPs per device
> > actually enough?
> >   
> 
> Why special case loopback?
> 
> Why not:
> 
> host                  |  guest 0  |  guest 1  |  guest2
> ----------------------+-----------+-----------+--------------
>   |                   |           |           |
>   |-> lo              |           |           |
>   |                   |           |           |
>   |-> vlo0  <---------+-> lo      |           |
>   |                   |           |           |
>   |-> vlo1  <---------+-----------+-----------+-> lo
>   |                   |           |           |
>   |-> vlo2   <--------+-----------+-> lo      |
>   |                   |           |           |
>   |-> eth0            |           |           |
>   |                   |           |           |
>   |-> veth0  <--------+-> eth0    |           |
>   |                   |           |           |
>   |-> veth1  <--------+-----------+-----------+-> eth0
>   |                   |           |           |
>   |-> veth2   <-------+-----------+-> eth0    |

I still can't completely understand your direction of thoughts.
Could you elaborate on IP address assignment in your diagram, please?  For
example, guest0 wants 127.0.0.1 and 192.168.0.1 addresses on its lo
interface, and 10.1.1.1 on its eth0 interface.
Does this diagram assume any local IP addresses on v* interfaces in the
"host"?

And the second question.
Are vlo0, veth0, etc. devices supposed to have hard_xmit routines?

Best regards

Andrey

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-07-03 14:53                                               ` Andrey Savochkin
@ 2006-07-04  3:00                                                 ` Sam Vilain
  2006-07-04 12:29                                                 ` Daniel Lezcano
  1 sibling, 0 replies; 113+ messages in thread
From: Sam Vilain @ 2006-07-04  3:00 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: Serge E. Hallyn, Cedric Le Goater, hadi, Herbert Poetzl,
	Alexey Kuznetsov, viro, devel, dev, Andrew Morton, netdev,
	linux-kernel, Daniel Lezcano, Ben Greear, Dave Hansen,
	Alexey Kuznetsov, Eric W. Biederman

Andrey Savochkin wrote:
>> Why special case loopback?
>>
>> Why not:
>>
>> host                  |  guest 0  |  guest 1  |  guest2
>> ----------------------+-----------+-----------+--------------
>>   |                   |           |           |
>>   |-> lo              |           |           |
>>   |                   |           |           |
>>   |-> vlo0  <---------+-> lo      |           |
>>   |                   |           |           |
>>   |-> vlo1  <---------+-----------+-----------+-> lo
>>   |                   |           |           |
>>   |-> vlo2   <--------+-----------+-> lo      |
>>   |                   |           |           |
>>   |-> eth0            |           |           |
>>   |                   |           |           |
>>   |-> veth0  <--------+-> eth0    |           |
>>   |                   |           |           |
>>   |-> veth1  <--------+-----------+-----------+-> eth0
>>   |                   |           |           |
>>   |-> veth2   <-------+-----------+-> eth0    |
>>     
>
> I still can't completely understand your direction of thoughts.
> Could you elaborate on IP address assignment in your diagram, please?  For
> example, guest0 wants 127.0.0.1 and 192.168.0.1 addresses on its lo
> interface, and 10.1.1.1 on its eth0 interface.
> Does this diagram assume any local IP addresses on v* interfaces in the
> "host"?
>   

Well, Eric already pointed out some pretty good reasons why this thread
should die.

The idea is that each "lo" interface would have the same set of
addresses. Which would make routing on the host confusing. Yet another
reason to kill this idea. Let's just make better tools instead.

Sam.

> And the second question.
> Are vlo0, veth0, etc. devices supposed to have hard_xmit routines?
>   


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-07-03 14:53                                               ` Andrey Savochkin
  2006-07-04  3:00                                                 ` Sam Vilain
@ 2006-07-04 12:29                                                 ` Daniel Lezcano
  2006-07-04 13:13                                                   ` Sam Vilain
  1 sibling, 1 reply; 113+ messages in thread
From: Daniel Lezcano @ 2006-07-04 12:29 UTC (permalink / raw)
  To: Andrey Savochkin
  Cc: Sam Vilain, Serge E. Hallyn, Cedric Le Goater, hadi,
	Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Ben Greear, Dave Hansen,
	Alexey Kuznetsov, Eric W. Biederman

Andrey Savochkin wrote:
> 
> I still can't completely understand your direction of thoughts.
> Could you elaborate on IP address assignment in your diagram, please?  For
> example, guest0 wants 127.0.0.1 and 192.168.0.1 addresses on its lo
> interface, and 10.1.1.1 on its eth0 interface.
> Does this diagram assume any local IP addresses on v* interfaces in the
> "host"?
> 
> And the second question.
> Are vlo0, veth0, etc. devices supposed to have hard_xmit routines?


Andrey,

some people are interested by a network full isolation/virtualization 
like you did with the layer 2 isolation and some other people are 
interested by a light network isolation done at the layer 3. This one is 
intended to implement "application container" aka "lightweight container".

In the case of a layer 3 isolation, the network interface is not totally 
isolated and the debate here is to find a way to have something 
intuitive to manage the network devices.

IHMO, all the discussion we had convinced me of the needs to have the 
possibility to choose between a layer 2 or a layer 3 isolation.

If it is ok for you, we can collaborate to merge the two solutions in 
one. I will focus on layer 3 isolation and you on the layer 2.

Regards

   - Daniel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-07-04 12:29                                                 ` Daniel Lezcano
@ 2006-07-04 13:13                                                   ` Sam Vilain
  2006-07-04 13:19                                                     ` Daniel Lezcano
  0 siblings, 1 reply; 113+ messages in thread
From: Sam Vilain @ 2006-07-04 13:13 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Andrey Savochkin, Serge E. Hallyn, Cedric Le Goater, hadi,
	Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Ben Greear, Dave Hansen,
	Alexey Kuznetsov, Eric W. Biederman, Russel Coker

Daniel Lezcano wrote:
> 
> If it is ok for you, we can collaborate to merge the two solutions in
> one. I will focus on layer 3 isolation and you on the layer 2.

So, you're writing a LSM module or adapting the BSD Jail LSM, right? :)

Sam.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: strict isolation of net interfaces
  2006-07-04 13:13                                                   ` Sam Vilain
@ 2006-07-04 13:19                                                     ` Daniel Lezcano
  0 siblings, 0 replies; 113+ messages in thread
From: Daniel Lezcano @ 2006-07-04 13:19 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Andrey Savochkin, Serge E. Hallyn, Cedric Le Goater, hadi,
	Herbert Poetzl, Alexey Kuznetsov, viro, devel, dev,
	Andrew Morton, netdev, linux-kernel, Ben Greear, Dave Hansen,
	Alexey Kuznetsov, Eric W. Biederman, Russel Coker

Sam Vilain wrote:
> Daniel Lezcano wrote:
> 
>>If it is ok for you, we can collaborate to merge the two solutions in
>>one. I will focus on layer 3 isolation and you on the layer 2.
> 
> 
> So, you're writing a LSM module or adapting the BSD Jail LSM, right? :)
> 
> Sam.

No. I am adapting a prototype of network application container we did.

   -- Daniel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Routing tables (Re: [patch 2/6] [Network namespace] Network device sharing by view)
  2006-06-27  9:34             ` Daniel Lezcano
  2006-06-27  9:38               ` Andrey Savochkin
  2006-06-27  9:54               ` Kirill Korotaev
@ 2006-07-06  9:45               ` Kari Hurtta
  2 siblings, 0 replies; 113+ messages in thread
From: Kari Hurtta @ 2006-07-06  9:45 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Andrey Savochkin, linux-kernel, netdev, serue, haveblue, clg,
	Andrew Morton, dev, herbert, devel, sam, ebiederm, viro,
	Alexey Kuznetsov

> Andrey Savochkin wrote:
> > Daniel,
> > 
> > On Mon, Jun 26, 2006 at 05:49:41PM +0200, Daniel Lezcano wrote:
> > 
> >>>Then you lose the ability for each namespace to have its own routing entries.
> >>>Which implies that you'll have difficulties with devices that should exist
> >>>and be visible in one namespace only (like tunnels), as they require IP
> >>>addresses and route.
> >>
> >>I mean instead of having the route tables private to the namespace, the 
> >>routes have the information to which namespace they are associated.
> > 
> > 
> > I think I understand what you're talking about: you want to make routing
> > responsible for determining destination namespace ID in addition to route
> > type (local, unicast etc), nexthop information, and so on.  Right?
> 
> Yes.
> 
> > 
> > My point is that if you make namespace tagging at routing time, and
> > your packets are being routed only once, you lose the ability
> > to have separate routing tables in each namespace.
> 
> Right. What is the advantage of having separate the routing tables ?

One application may be following. Consider firewall

                       (isp1)            (isp2)


                         I                 I
          +-----------  red0------------- red1 ---------+
          |     +                                 +     |
          |     |    red routing deamon  (BGP)    |     |
          |     |                                 |     |
          |     |    red routing tables           |     |
          |     |                                 |     |
          |     +----------tun(?)-----------------+     |
          |                                             |
          |     +----------tun(?)-----------------+     |
          |     |                                 |     |
          |     |     green routing tables        |     |
     I mana0    |                                 |     |
          |     |     green routing deamon (ospf) |     |
          |     +                                 +     |
          +--------- green0 ---------- green1 ----------+
                       I                 I

               

That may allow running different routing deamon on
red and green side. That is possible if they manage
different routing tables on kernel.  They not need
communigate together, when route between them is static.

/ Kari Hurtta

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2006-07-06  9:46 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-09 21:02 [RFC] [patch 0/6] [Network namespace] introduction dlezcano
2006-06-09 21:02 ` [RFC] [patch 1/6] [Network namespace] Network namespace structure dlezcano
2006-06-09 21:02 ` [RFC] [patch 2/6] [Network namespace] Network device sharing by view dlezcano
2006-06-11 10:18   ` Andrew Morton
2006-06-18 18:53   ` Al Viro
2006-06-26  9:47   ` Andrey Savochkin
2006-06-26 13:02     ` Herbert Poetzl
2006-06-26 14:05       ` Eric W. Biederman
2006-06-26 14:08       ` Andrey Savochkin
2006-06-26 18:28         ` Herbert Poetzl
2006-06-26 18:59           ` Eric W. Biederman
2006-06-26 14:56     ` Daniel Lezcano
2006-06-26 15:21       ` Eric W. Biederman
2006-06-26 15:27       ` Andrey Savochkin
2006-06-26 15:49         ` Daniel Lezcano
2006-06-26 16:40           ` Eric W. Biederman
2006-06-26 18:36             ` Herbert Poetzl
2006-06-26 19:35               ` Eric W. Biederman
2006-06-26 20:02                 ` Herbert Poetzl
2006-06-26 20:37                   ` Eric W. Biederman
2006-06-26 21:26                     ` Herbert Poetzl
2006-06-26 21:59                       ` Ben Greear
2006-06-26 22:11                       ` Eric W. Biederman
2006-06-27  9:09                   ` Andrey Savochkin
2006-06-27 15:48                     ` Herbert Poetzl
2006-06-27 16:19                       ` Andrey Savochkin
2006-06-27 16:40                       ` Eric W. Biederman
2006-06-26 22:13                 ` Ben Greear
2006-06-26 22:54                   ` Herbert Poetzl
2006-06-26 23:08                     ` Ben Greear
2006-06-27 16:07                       ` Ben Greear
2006-06-27 22:48                         ` Herbert Poetzl
2006-06-27  9:11           ` Andrey Savochkin
2006-06-27  9:34             ` Daniel Lezcano
2006-06-27  9:38               ` Andrey Savochkin
2006-06-27 11:21                 ` Daniel Lezcano
2006-06-27 11:52                   ` Eric W. Biederman
2006-06-27 16:02                     ` Herbert Poetzl
2006-06-27 16:47                       ` Eric W. Biederman
2006-06-27 17:19                         ` Ben Greear
2006-06-27 22:52                           ` Herbert Poetzl
2006-06-27 23:12                             ` Dave Hansen
2006-06-27 23:42                               ` Alexey Kuznetsov
2006-06-28  3:38                                 ` Eric W. Biederman
2006-06-28 13:36                                   ` Herbert Poetzl
2006-06-28 13:53                                     ` jamal
2006-06-28 14:19                                       ` Andrey Savochkin
2006-06-28 16:17                                         ` jamal
2006-06-28 16:58                                           ` Andrey Savochkin
2006-06-28 17:17                                           ` Eric W. Biederman
2006-06-28 17:04                                         ` Herbert Poetzl
2006-06-28 14:39                                       ` Eric W. Biederman
2006-06-30  1:41                                         ` Sam Vilain
2006-06-29 21:07                                       ` Sam Vilain
2006-06-29 22:14                                         ` strict isolation of net interfaces Cedric Le Goater
2006-06-30  2:39                                           ` Serge E. Hallyn
2006-06-30  2:49                                             ` Sam Vilain
2006-07-03 14:53                                               ` Andrey Savochkin
2006-07-04  3:00                                                 ` Sam Vilain
2006-07-04 12:29                                                 ` Daniel Lezcano
2006-07-04 13:13                                                   ` Sam Vilain
2006-07-04 13:19                                                     ` Daniel Lezcano
2006-06-30  8:56                                             ` Cedric Le Goater
2006-07-03 13:36                                               ` Herbert Poetzl
2006-06-30 12:23                                             ` Daniel Lezcano
2006-06-30 14:20                                               ` Eric W. Biederman
2006-06-30 15:22                                                 ` Daniel Lezcano
2006-06-30 17:58                                                   ` Eric W. Biederman
2006-06-30 16:14                                                 ` Serge E. Hallyn
2006-06-30 17:41                                                   ` Eric W. Biederman
2006-06-30 18:09                                               ` Eric W. Biederman
2006-06-30  0:15                                         ` [patch 2/6] [Network namespace] Network device sharing by view jamal
2006-06-30  3:35                                           ` Herbert Poetzl
2006-06-30  7:45                                           ` Andrey Savochkin
2006-06-30 13:50                                             ` jamal
2006-06-30 15:01                                               ` Andrey Savochkin
2006-06-30 18:22                                               ` Eric W. Biederman
2006-06-30 21:51                                                 ` jamal
2006-07-01  0:50                                                   ` Eric W. Biederman
2006-06-28 14:21                                     ` Eric W. Biederman
2006-06-28 14:51                               ` Eric W. Biederman
2006-06-27 16:49                       ` Alexey Kuznetsov
2006-06-27 11:55                   ` Andrey Savochkin
2006-06-27  9:54               ` Kirill Korotaev
2006-06-27 16:09                 ` Herbert Poetzl
2006-06-27 16:29                   ` Eric W. Biederman
2006-06-27 23:07                     ` Herbert Poetzl
2006-06-28  4:07                       ` Eric W. Biederman
2006-06-28  6:31                         ` Sam Vilain
2006-06-28 14:15                           ` Herbert Poetzl
2006-06-28 15:36                             ` Eric W. Biederman
2006-06-28 17:18                               ` Herbert Poetzl
2006-06-28 10:14                         ` Cedric Le Goater
2006-06-28 14:11                         ` Herbert Poetzl
2006-06-28 16:10                           ` Eric W. Biederman
2006-07-06  9:45               ` Routing tables (Re: [patch 2/6] [Network namespace] Network device sharing by view) Kari Hurtta
2006-06-09 21:02 ` [RFC] [patch 3/6] [Network namespace] Network devices isolation dlezcano
2006-06-18 18:57   ` Al Viro
2006-06-09 21:02 ` [RFC] [patch 4/6] [Network namespace] Network inet " dlezcano
2006-06-09 21:02 ` [RFC] [patch 5/6] [Network namespace] ipv4 isolation dlezcano
2006-06-10  0:23   ` James Morris
2006-06-10  0:27     ` Rick Jones
2006-06-10  0:47       ` James Morris
2006-06-09 21:02 ` [RFC] [patch 6/6] [Network namespace] Network namespace debugfs dlezcano
2006-06-10  7:16 ` [RFC] [patch 0/6] [Network namespace] introduction Kari Hurtta
2006-06-16  4:23 ` Eric W. Biederman
2006-06-16  9:06   ` Daniel Lezcano
2006-06-16  9:22     ` Eric W. Biederman
2006-06-18 18:47 ` Al Viro
2006-06-20 21:21   ` Daniel Lezcano
2006-06-20 21:25     ` Al Viro
2006-06-20 22:45       ` Daniel Lezcano
2006-06-26 23:38 ` Patrick McHardy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).