All of lore.kernel.org
 help / color / mirror / Atom feed
* Network device and namespace checkpoint/restart (v3)
@ 2010-02-16 16:03 Dan Smith
  2010-02-16 16:03 ` [PATCH 1/5] Add checkpoint and collect hooks to net_device_ops Dan Smith
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Dan Smith @ 2010-02-16 16:03 UTC (permalink / raw)
  To: containers, netdev

This patch set adds checkpoint/restart support for network namespaces,
as well as the network devices within.  Currently supports veth and loopback
device types.

Minor changes this time around, based on feedback from the previous
posting.  Changes are detailed per-patch, and a new one at the end
adds a checkpoint handler for SIT devices (which does the bare minimum
so that we can fail when a device has no support).


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] Add checkpoint and collect hooks to net_device_ops
  2010-02-16 16:03 Network device and namespace checkpoint/restart (v3) Dan Smith
@ 2010-02-16 16:03 ` Dan Smith
  2010-02-16 16:03 ` [PATCH 2/5] C/R: Basic support for network namespaces and devices (v4) Dan Smith
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Dan Smith @ 2010-02-16 16:03 UTC (permalink / raw)
  To: containers, netdev

These will be implemented per-driver by those that support such
operations.

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 include/linux/netdevice.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a3fccc8..415a791 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -673,6 +673,12 @@ struct net_device_ops {
 	int			(*ndo_fcoe_get_wwn)(struct net_device *dev,
 						    u64 *wwn, int type);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int			(*ndo_collect)(struct ckpt_ctx *ctx,
+					       struct net_device *dev);
+	int			(*ndo_checkpoint)(struct ckpt_ctx *ctx,
+						  struct net_device *dev);
+#endif
 };
 
 /*
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] C/R: Basic support for network namespaces and devices (v4)
  2010-02-16 16:03 Network device and namespace checkpoint/restart (v3) Dan Smith
  2010-02-16 16:03 ` [PATCH 1/5] Add checkpoint and collect hooks to net_device_ops Dan Smith
@ 2010-02-16 16:03 ` Dan Smith
       [not found]   ` <20100222194523.GA13135@us.ibm.com>
  2010-02-16 16:03 ` [PATCH 3/5] Add checkpoint support for veth devices (v2) Dan Smith
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2010-02-16 16:03 UTC (permalink / raw)
  To: containers, netdev

When checkpointing a task tree with network namespaces, we hook into
do_checkpoint_ns() along with the others.  Any devices in a given namespace
are checkpointed (including their peer, in the case of veth) sequentially.
Each network device stores a list of protocol addresses, as well as other
information, such as hardware address.

This patch supports veth pairs, as well as the loopback adapter.  The
loopback support is there to make sure that any additional addresses and
state (such as up/down) is copied to the loopback adapter that we are
given in the new network namespace.

On restart, we instantiate new network namespaces and veth pairs as
necessary.  Any device we encounter that isn't in a network namespace
that was checkpointed as part of a task is left in the namespace of the
restarting process.  This will be the case for a veth half that exists
in the init netns to provide network access to a container.

Still to do are:

  1. Routes
  2. Netfilter rules
  3. IPv6 addresses
  4. Other virtual device types (e.g. bridges)
  5. Multicast
  6. Device config info (ipv4_devconf)
  7. Additional ipv4 address attributes

Changes in v4:
 - Fix allocation under lock in ckpt_netdev_inet_addrs()
 - Add comment for case where there is no netns info in checkpoint image
 - Fix inner structure alignment in netdev_addr header
 - Fix instances of kfree(skb)
 - Remove init_netns_ref from container header and checkpoint context
 - Add 'extern' to checkpoint.h prototypes
 - Swizzle do_restore_netns() to handle netns more like the others
 - Return E2BIG for failure case when collecting inet addrs
 - Report case where device doesn't support checkpoint
 - Remove nested netns check from may_checkpoint_task()
 - Move veth-specific netdev attributes into unioned struct to set an
   example for specific attributes of additional device types
 - Add 'sit' device restore path that doesn't really do anything
 - Fail instead of skip when encountering a device with no checkpoint
   support

Changes in v3:
 - Use dev->checkpoint() for per-device checkpoint operation
 - Use RTNL for veth pair creation on restart
 - Export some of the functions that will be needed by dev->ndo_checkpoint()

Changes in v2:
 - Add CONFIG_CHECKPOINT_NETNS that is dependent on NET, NET_NS, and
   CHECKPOINT.  Conditionally compile the checkpoint_dev code based on it.
 - Updated comment on should_checkpoint_netdev()
 - Updated checkpoint_netdev() to explicitly check for "veth" in name
 - Changed checkpoint_netns() to use BUG() for impossible condition
 - Fixed a bug on restart with all devices in the init netns
 - Lock the dev_base_lock while traversing interface addresses
 - Collect all addresses for an interface before writing out in one
   single pass

Signed-off-by: Dan Smith <danms@us.ibm.com>
Cc: netdev@vger.kernel.org
---
 checkpoint/checkpoint.c          |   18 +-
 checkpoint/objhash.c             |   48 +++
 include/linux/checkpoint.h       |   23 ++
 include/linux/checkpoint_hdr.h   |   54 +++
 include/linux/checkpoint_types.h |    1 +
 kernel/nsproxy.c                 |   24 ++-
 net/Kconfig                      |    4 +
 net/Makefile                     |    1 +
 net/checkpoint_dev.c             |  703 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 864 insertions(+), 12 deletions(-)
 create mode 100644 net/checkpoint_dev.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index b4e0021..119f093 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -185,11 +185,10 @@ static int checkpoint_container(struct ckpt_ctx *ctx)
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CONTAINER);
 	if (!h)
 		return -ENOMEM;
-	ret = ckpt_write_obj(ctx, &h->h);
-	ckpt_hdr_put(ctx, h);
 
+	ret = ckpt_write_obj(ctx, &h->h);
 	if (ret < 0)
-		return ret;
+		goto out;
 
 	memset(ctx->lsm_name, 0, CHECKPOINT_LSM_NAME_MAX + 1);
 	strlcpy(ctx->lsm_name, security_get_lsm_name(),
@@ -197,9 +196,13 @@ static int checkpoint_container(struct ckpt_ctx *ctx)
 	ret = ckpt_write_buffer(ctx, ctx->lsm_name,
 				CHECKPOINT_LSM_NAME_MAX + 1);
 	if (ret < 0)
-		return ret;
+		goto out;
 
-	return security_checkpoint_header(ctx);
+	ret = security_checkpoint_header(ctx);
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
 }
 
 /* write the checkpoint trailer */
@@ -285,11 +288,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		_ckpt_err(ctx, -EPERM, "%(T)Nested mnt_ns unsupported\n");
 		ret = -EPERM;
 	}
-	/* no support for >1 private netns */
-	if (nsproxy->net_ns != ctx->root_nsproxy->net_ns) {
-		_ckpt_err(ctx, -EPERM, "%(T)Nested net_ns unsupported\n");
-		ret = -EPERM;
-	}
 	/* no support for >1 private pidns */
 	if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns) {
 		_ckpt_err(ctx, -EPERM, "%(T)Nested pid_ns unsupported\n");
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 4ca7799..729fbe5 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -348,6 +348,36 @@ static void lsm_string_drop(void *ptr, int lastref)
 	kref_put(&s->kref, lsm_string_free);
 }
 
+static int netns_grab(void *ptr)
+{
+	struct net *net = ptr;
+
+	get_net(net);
+	return 0;
+}
+
+static void netns_drop(void *ptr, int lastref)
+{
+	struct net *net = ptr;
+
+	put_net(net);
+}
+
+static int netdev_grab(void *ptr)
+{
+	struct net_device *dev = ptr;
+
+	dev_hold(dev);
+	return 0;
+}
+
+static void netdev_drop(void *ptr, int lastref)
+{
+	struct net_device *dev = ptr;
+
+	dev_put(dev);
+}
+
 /* security context strings */
 static int checkpoint_lsm_string(struct ckpt_ctx *ctx, void *ptr);
 static struct ckpt_lsm_string *restore_lsm_string(struct ckpt_ctx *ctx);
@@ -550,6 +580,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_lsm_string,
 		.restore = restore_lsm_string_wrap,
 	},
+	/* Network Namespace Object */
+	{
+		.obj_name = "NET_NS",
+		.obj_type = CKPT_OBJ_NET_NS,
+		.ref_grab = netns_grab,
+		.ref_drop = netns_drop,
+		.checkpoint = checkpoint_netns,
+		.restore = restore_netns,
+	},
+	/* Network Device Object */
+	{
+		.obj_name = "NET_DEV",
+		.obj_type = CKPT_OBJ_NETDEV,
+		.ref_grab = netdev_grab,
+		.ref_drop = netdev_drop,
+		.checkpoint = checkpoint_netdev,
+		.restore = restore_netdev,
+	},
 };
 
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7101d6f..a25bac1 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -35,6 +35,7 @@
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
+#include <linux/inetdevice.h>
 #include <net/sock.h>
 
 /* sycall helpers */
@@ -119,6 +120,28 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
 extern struct sk_buff *sock_restore_skb(struct ckpt_ctx *ctx, struct sock *sk);
 extern void sock_listening_list_free(struct list_head *head);
 
+#ifdef CONFIG_CHECKPOINT_NETNS
+extern int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_netns(struct ckpt_ctx *ctx);
+extern int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_netdev(struct ckpt_ctx *ctx);
+
+extern int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx,
+				     struct net_device *dev);
+extern int ckpt_netdev_inet_addrs(struct in_device *indev,
+				  struct ckpt_netdev_addr *list[]);
+extern int ckpt_netdev_hwaddr(struct net_device *dev,
+			      struct ckpt_hdr_netdev *h);
+extern struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
+						struct net_device *dev,
+						struct ckpt_netdev_addr *addrs[]);
+#else
+# define checkpoint_netns NULL
+# define restore_netns NULL
+# define checkpoint_netdev NULL
+# define restore_netdev NULL
+#endif
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index e591fd1..1fa0482 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -181,6 +181,12 @@ enum {
 #define CKPT_HDR_SOCKET_UNIX CKPT_HDR_SOCKET_UNIX
 	CKPT_HDR_SOCKET_INET,
 #define CKPT_HDR_SOCKET_INET CKPT_HDR_SOCKET_INET
+	CKPT_HDR_NET_NS,
+#define CKPT_HDR_NET_NS CKPT_HDR_NET_NS
+	CKPT_HDR_NETDEV,
+#define CKPT_HDR_NETDEV CKPT_HDR_NETDEV
+	CKPT_HDR_NETDEV_ADDR,
+#define CKPT_HDR_NETDEV_ADDR CKPT_HDR_NETDEV_ADDR
 
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
@@ -253,6 +259,10 @@ enum obj_type {
 #define CKPT_OBJ_SECURITY_PTR CKPT_OBJ_SECURITY_PTR
 	CKPT_OBJ_SECURITY,
 #define CKPT_OBJ_SECURITY CKPT_OBJ_SECURITY
+	CKPT_OBJ_NET_NS,
+#define CKPT_OBJ_NET_NS CKPT_OBJ_NET_NS
+	CKPT_OBJ_NETDEV,
+#define CKPT_OBJ_NETDEV CKPT_OBJ_NETDEV
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -313,6 +323,7 @@ struct ckpt_hdr_tail {
 /* container configuration section header */
 struct ckpt_hdr_container {
 	struct ckpt_hdr h;
+	__s32 init_netns_ref;
 	/*
 	 * the header is followed by the string:
 	 *   char lsm_name[SECURITY_NAME_MAX + 1]
@@ -434,6 +445,7 @@ struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
 	__s32 uts_objref;
 	__s32 ipc_objref;
+	__s32 net_objref;
 } __attribute__((aligned(8)));
 
 /* cannot include <linux/tty.h> from userspace, so define: */
@@ -758,6 +770,48 @@ struct ckpt_hdr_file_socket {
 	__s32 sock_objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_netns {
+	struct ckpt_hdr h;
+	__s32 this_ref;
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_types {
+	CKPT_NETDEV_LO,
+	CKPT_NETDEV_VETH,
+	CKPT_NETDEV_SIT,
+};
+
+struct ckpt_hdr_netdev {
+	struct ckpt_hdr h;
+ 	__s32 netns_ref;
+	union {
+		struct {
+			__s32 this_ref;
+			__s32 peer_ref;
+		} veth;
+	};
+	__u32 inet_addrs;
+	__u16 type;
+	__u16 flags;
+	__u8 hwaddr[6];
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_addr_types {
+	CKPT_NETDEV_ADDR_IPV4,
+};
+
+struct ckpt_netdev_addr {
+	__u16 type;
+	union {
+		struct {
+			__u32 inet4_local;
+			__u32 inet4_address;
+			__u32 inet4_mask;
+			__u32 inet4_broadcast;
+		};
+	} __attribute__((aligned(8)));
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_eventpoll_items {
 	struct ckpt_hdr h;
 	__s32  epfile_objref;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 51efd5a..e646ec6 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -86,6 +86,7 @@ struct ckpt_ctx {
 	wait_queue_head_t ghostq;	/* waitqueue for ghost tasks */
 	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
 	struct list_head listen_sockets;/* listening parent sockets */
+	int init_netns_ref;             /* Objref of root net namespace */
 
 	struct ckpt_stats stats;	/* statistics */
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b0e71f2..e6c84cd 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -248,6 +248,11 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
 	if (ret < 0)
 		goto out;
+#ifdef CONFIG_CHECKPOINT_NETNS
+	ret = ckpt_obj_collect(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	if (ret < 0)
+		goto out;
+#endif
 	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
 	if (ret < 0)
 		goto out;
@@ -288,6 +293,12 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (ret < 0)
 		goto out;
 	h->ipc_objref = ret;
+#ifdef CONFIG_CHECKPOINT_NETNS
+	ret = checkpoint_obj(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	if (ret < 0)
+		goto out;
+	h->net_objref = ret;
+#endif
 
 	/* FIXME: for now, only marked visited to pacify leaks */
 	ret = ckpt_obj_visit(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS);
@@ -312,6 +323,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	struct nsproxy *nsproxy = NULL;
 	struct uts_namespace *uts_ns;
 	struct ipc_namespace *ipc_ns;
+	struct net *net_ns;
 	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
@@ -333,6 +345,14 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 		ret = PTR_ERR(ipc_ns);
 		goto out;
 	}
+	if (h->net_objref == 0)
+		net_ns = current->nsproxy->net_ns;
+	else
+		net_ns = ckpt_obj_fetch(ctx, h->net_objref, CKPT_OBJ_NET_NS);
+	if (IS_ERR(net_ns)) {
+		ret = PTR_ERR(net_ns);
+		goto out;
+	}
 
 #if defined(COFNIG_UTS_NS) || defined(CONFIG_IPC_NS)
 	ret = -ENOMEM;
@@ -344,13 +364,13 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	nsproxy->uts_ns = uts_ns;
 	get_ipc_ns(ipc_ns);
 	nsproxy->ipc_ns = ipc_ns;
+	get_net(net_ns);
+	nsproxy->net_ns = net_ns;
 
 	get_pid_ns(current->nsproxy->pid_ns);
 	nsproxy->pid_ns = current->nsproxy->pid_ns;
 	get_mnt_ns(current->nsproxy->mnt_ns);
 	nsproxy->mnt_ns = current->nsproxy->mnt_ns;
-	get_net(current->nsproxy->net_ns);
-	nsproxy->net_ns = current->nsproxy->net_ns;
 #else
 	nsproxy = current->nsproxy;
 	get_nsproxy(nsproxy);
diff --git a/net/Kconfig b/net/Kconfig
index 041c35e..64dd3cd 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -276,4 +276,8 @@ source "net/wimax/Kconfig"
 source "net/rfkill/Kconfig"
 source "net/9p/Kconfig"
 
+config CHECKPOINT_NETNS
+       bool
+       default y if NET && NET_NS && CHECKPOINT
+
 endif   # if NET
diff --git a/net/Makefile b/net/Makefile
index 74b038f..570ee98 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -67,3 +67,4 @@ endif
 obj-$(CONFIG_WIMAX)		+= wimax/
 
 obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
+obj-$(CONFIG_CHECKPOINT_NETNS)	+= checkpoint_dev.o
diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
new file mode 100644
index 0000000..3515560
--- /dev/null
+++ b/net/checkpoint_dev.c
@@ -0,0 +1,703 @@
+/*
+ *  Copyright 2010 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms@us.ibm.com>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/sched.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/inetdevice.h>
+#include <linux/veth.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/deferqueue.h>
+
+#include <net/net_namespace.h>
+#include <net/sch_generic.h>
+
+struct dq_netdev {
+	struct net_device *dev;
+	struct ckpt_ctx *ctx;
+};
+
+static int __kern_devinet_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = devinet_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = dev_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+static struct socket *rtnl_open(void)
+{
+	struct socket *sock;
+	int ret;
+
+	ret = sock_create(AF_NETLINK, SOCK_DGRAM, NETLINK_ROUTE, &sock);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	return sock;
+}
+
+static int rtnl_close(struct socket *rtnl)
+{
+	if (rtnl)
+		return kernel_sock_shutdown(rtnl, SHUT_RDWR);
+	else
+		return 0;
+}
+
+static struct nlmsghdr *rtnl_get_response(struct socket *rtnl,
+					  struct sk_buff **skb)
+{
+	int ret;
+	long timeo = MAX_SCHEDULE_TIMEOUT;
+	struct nlmsghdr *nlh;
+
+	ret = sk_wait_data(rtnl->sk, &timeo);
+	if (!ret)
+		return ERR_PTR(-EPIPE);
+
+	*skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
+	if (!*skb)
+		return ERR_PTR(-EPIPE);
+
+	ret = -EINVAL;
+	nlh = nlmsg_hdr(*skb);
+	if (!nlh)
+		goto err;
+
+	if (nlh->nlmsg_type == NLMSG_ERROR) {
+		struct nlmsgerr *errmsg = nlmsg_data(nlh);
+		ret = errmsg->error;
+		goto err;
+	}
+
+	return nlh;
+ err:
+	kfree_skb(*skb);
+	*skb = NULL;
+
+	return ERR_PTR(ret);
+}
+
+int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	return dev->nd_net == current->nsproxy->net_ns;
+}
+
+int ckpt_netdev_hwaddr(struct net_device *dev, struct ckpt_hdr_netdev *h)
+{
+	struct net *net = dev->nd_net;
+	struct ifreq req;
+	int ret;
+
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	ret = __kern_dev_ioctl(net, SIOCGIFFLAGS, &req);
+	h->flags = req.ifr_flags;
+	if (ret < 0)
+		return ret;
+
+	ret = __kern_dev_ioctl(net, SIOCGIFHWADDR, &req);
+	if (ret < 0)
+		return ret;
+
+	memcpy(h->hwaddr, req.ifr_hwaddr.sa_data, sizeof(h->hwaddr));
+
+	return 0;
+}
+
+int ckpt_netdev_inet_addrs(struct in_device *indev,
+			   struct ckpt_netdev_addr *_abuf[])
+{
+	struct ckpt_netdev_addr *abuf = NULL;
+	struct in_ifaddr *addr = indev->ifa_list;
+	int pages = 0;
+	int addrs = 0;
+	int max;
+
+ retry:
+	if (++pages > 4) {
+		addrs = -E2BIG;
+		goto out;
+	}
+
+	*_abuf = krealloc(abuf, PAGE_SIZE * pages, GFP_KERNEL);
+	if (*_abuf == NULL) {
+		addrs = -ENOMEM;
+		goto out;
+	}
+	abuf = *_abuf;
+
+	read_lock(&dev_base_lock);
+
+	max = (pages * PAGE_SIZE) / sizeof(*abuf);
+	while (addr) {
+		abuf[addrs].type = CKPT_NETDEV_ADDR_IPV4; /* Only IPv4 now */
+		abuf[addrs].inet4_local = addr->ifa_local;
+		abuf[addrs].inet4_address = addr->ifa_address;
+		abuf[addrs].inet4_mask = addr->ifa_mask;
+		abuf[addrs].inet4_broadcast = addr->ifa_broadcast;
+
+		addr = addr->ifa_next;
+		if (++addrs >= max) {
+			read_unlock(&dev_base_lock);
+			goto retry;
+		}
+	}
+
+	read_unlock(&dev_base_lock);
+ out:
+	if (addrs < 0) {
+		kfree(abuf);
+		*_abuf = NULL;
+	}
+
+	return addrs;
+}
+
+struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
+					 struct net_device *dev,
+					 struct ckpt_netdev_addr *addrs[])
+{
+	struct ckpt_hdr_netdev *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (!h)
+		return ERR_PTR(-ENOMEM);
+
+	ret = ckpt_netdev_hwaddr(dev, h);
+	if (ret < 0)
+		goto out;
+
+	*addrs = NULL;
+	ret = h->inet_addrs = ckpt_netdev_inet_addrs(dev->ip_ptr, addrs);
+	if (ret < 0) {
+		if (ret == -E2BIG)
+			ckpt_err(ctx, ret,
+				 "Too many inet addresses on interface %s\n",
+				 dev->name);
+		goto out;
+	}
+
+	if (ckpt_netdev_in_init_netns(ctx, dev))
+		ret = h->netns_ref = 0;
+	else
+		ret = h->netns_ref = checkpoint_obj(ctx, dev->nd_net,
+						    CKPT_OBJ_NET_NS);
+ out:
+	if (ret < 0) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(ret);
+		if (*addrs)
+			kfree(*addrs);
+	}
+
+	return h;
+}
+
+int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct net_device *dev = (struct net_device *)ptr;
+
+	if (!dev->netdev_ops->ndo_checkpoint) {
+		ckpt_err(ctx, -ENOSYS,
+			 "Device %s does not support checkpoint\n", dev->name);
+		return -ENOSYS;
+	}
+
+	ckpt_debug("checkpointing netdev %s\n", dev->name);
+
+	return dev->netdev_ops->ndo_checkpoint(ctx, dev);
+}
+
+int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct net *net = ptr;
+	struct net_device *dev;
+	struct ckpt_hdr_netns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (!h)
+		return -ENOMEM;
+
+	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
+	BUG_ON(h->this_ref == 0);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	for_each_netdev(net, dev) {
+		if (!dev->netdev_ops->ndo_checkpoint) {
+			ckpt_debug("Device %s does not support checkpoint\n",
+				   dev->name);
+			ret = -ENOSYS;
+			break;
+		}
+
+		ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+static int restore_in_addrs(struct ckpt_ctx *ctx,
+			    __u32 naddrs,
+			    struct net *net,
+			    struct net_device *dev)
+{
+	__u32 i;
+	int ret = 0;
+	int len = naddrs * sizeof(struct ckpt_netdev_addr);
+	struct ckpt_netdev_addr *addrs = NULL;
+
+	addrs = kmalloc(len, GFP_KERNEL);
+	if (!addrs)
+		return -ENOMEM;
+
+	ret = _ckpt_read_buffer(ctx, addrs, len);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < naddrs; i++) {
+		struct ckpt_netdev_addr *addr = &addrs[i];
+		struct ifreq req;
+		struct sockaddr_in *inaddr;
+
+		if (addr->type != CKPT_NETDEV_ADDR_IPV4) {
+			ret = -EINVAL;
+			ckpt_err(ctx, ret, "Unsupported netdev addr type %i\n",
+				 addr->type);
+			break;
+		}
+
+		ckpt_debug("restoring %s: %x/%x/%x\n", dev->name,
+			   addr->inet4_address,
+			   addr->inet4_mask,
+			   addr->inet4_broadcast);
+
+		memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = addr->inet4_address;
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set address\n");
+			break;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = addr->inet4_mask;
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFNETMASK, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set netmask\n");
+			break;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = addr->inet4_broadcast;
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFBRDADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set broadcast\n");
+			break;
+		}
+	}
+
+ out:
+	kfree(addrs);
+
+	return ret;
+}
+
+static int veth_peer_data(struct sk_buff *skb, char *peer_name)
+{
+	struct nlattr *linkdata;
+	struct ifinfomsg ifm;
+
+	linkdata = nla_nest_start(skb, IFLA_INFO_DATA);
+	if (!linkdata)
+		return -ENOMEM;
+
+	nla_put(skb, VETH_INFO_PEER, sizeof(ifm), &ifm);
+	nla_put_string(skb, IFLA_IFNAME, peer_name);
+
+	nla_nest_end(skb, linkdata);
+
+	return 0;
+}
+
+static struct sk_buff *new_link_message(char *this_name, char *peer_name)
+{
+	int ret = -ENOMEM;
+	int flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
+	struct nlmsghdr *nlh;
+	struct sk_buff *skb;
+	struct ifinfomsg *ifm;
+	struct nlattr *linkinfo;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		goto out;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_NEWLINK, sizeof(*ifm), flags);
+	if (!nlh)
+		goto out;
+
+	ifm = nlmsg_data(nlh);
+	memset(ifm, 0, sizeof(*ifm));
+
+	ret = nla_put_string(skb, IFLA_IFNAME, this_name);
+	if (ret)
+		goto out;
+
+	ret = -ENOMEM;
+
+	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
+	if (!linkinfo)
+		goto out;
+
+	if (nla_put_string(skb, IFLA_INFO_KIND, "veth") < 0)
+		goto out;
+
+	ret = veth_peer_data(skb, peer_name);
+	if (ret < 0)
+		goto out;
+
+	nla_nest_end(skb, linkinfo);
+	nlmsg_end(skb, nlh);
+
+ out:
+	if (ret < 0) {
+		kfree_skb(skb);
+		skb = ERR_PTR(ret);
+	}
+
+	return skb;
+}
+
+static struct net_device *new_veth_pair(char *this_name, char *peer_name)
+{
+	int ret = -ENOMEM;
+	struct socket *rtnl = NULL;
+	struct sk_buff *skb = NULL;
+	struct nlmsghdr *nlh;
+	struct msghdr msg;
+	struct kvec kvec;
+
+	skb = new_link_message(this_name, peer_name);
+	if (IS_ERR(skb)) {
+		ret = PTR_ERR(skb);
+		ckpt_debug("failed to create new link message: %i\n", ret);
+		skb = NULL;
+		goto out;
+	}
+
+	memset(&msg, 0, sizeof(msg));
+	kvec.iov_len = skb->len;
+	kvec.iov_base = skb->head;
+
+	rtnl = rtnl_open();
+	if (IS_ERR(rtnl)) {
+		ret = PTR_ERR(rtnl);
+		ckpt_debug("Unable to open rtnetlink socket: %i\n", ret);
+		goto out_noclose;
+	}
+
+	ret = kernel_sendmsg(rtnl, &msg, &kvec, 1, kvec.iov_len);
+	if (ret < 0)
+		goto out;
+	else if (ret != skb->len) {
+		ret = -EIO;
+		goto out;
+	}
+
+	/* Free the send skb to make room for the receive skb */
+	kfree_skb(skb);
+
+	nlh = rtnl_get_response(rtnl, &skb);
+	if (IS_ERR(nlh)) {
+		ret = PTR_ERR(nlh);
+		ckpt_debug("RTNETLINK said: %i\n", ret);
+	}
+ out:
+	rtnl_close(rtnl);
+ out_noclose:
+	kfree_skb(skb);
+
+	if (ret < 0)
+		return ERR_PTR(ret);
+	else
+		return dev_get_by_name(current->nsproxy->net_ns, this_name);
+}
+
+static int netdev_noop(void *data)
+{
+	return 0;
+}
+
+static int netdev_cleanup(void *data)
+{
+	struct dq_netdev *dq = data;
+
+	dev_put(dq->dev);
+
+	if (dq->ctx->errno) {
+		ckpt_debug("Unregistering netdev %s\n", dq->dev->name);
+		unregister_netdev(dq->dev);
+	}
+
+	return 0;
+}
+
+static struct net_device *restore_veth(struct ckpt_ctx *ctx,
+				       struct ckpt_hdr_netdev *h,
+				       struct net *net)
+{
+	int ret;
+	char this_name[IFNAMSIZ];
+	char peer_name[IFNAMSIZ];
+	struct net_device *dev;
+	struct net_device *peer;
+	int didreg = 0;
+	struct ifreq req;
+	struct dq_netdev dq;
+
+	dq.ctx = ctx;
+
+	ret = _ckpt_read_buffer(ctx, this_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ret = _ckpt_read_buffer(ctx, peer_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ckpt_debug("restored veth netdev %s:%s\n", this_name, peer_name);
+
+	peer = ckpt_obj_try_fetch(ctx, h->veth.peer_ref, CKPT_OBJ_NETDEV);
+	if (IS_ERR(peer)) {
+		/* We're first: allocate the veth pair */
+		didreg = 1;
+		dev = new_veth_pair(this_name, peer_name);
+		if (IS_ERR(dev))
+			return dev;
+
+		peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
+		if (!peer) {
+			ret = -EINVAL;
+			goto err_dev;
+		}
+
+		dq.dev = peer;
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     netdev_noop, netdev_cleanup);
+		if (ret)
+			goto err_peer;
+
+		ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
+				      CKPT_OBJ_NETDEV);
+		if (ret < 0)
+			/* Can't recall peer dq, so let it cleanup peer */
+			goto err_dev;
+		dev_put(peer);
+
+		dq.dev = dev;
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     netdev_noop, netdev_cleanup);
+		if (ret)
+			/* Can't recall peer dq, so let it cleanup peer */
+			goto err_dev;
+
+	} else {
+		/* We're second: get our dev from the hash */
+		dev = ckpt_obj_fetch(ctx, h->veth.this_ref, CKPT_OBJ_NETDEV);
+		if (IS_ERR(dev))
+			return dev;
+	}
+
+	/* Move to our new netns */
+	rtnl_lock();
+	ret = dev_change_net_namespace(dev, net, dev->name);
+	rtnl_unlock();
+	if (ret < 0)
+		goto out;
+
+	/* Restore MAC address */
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	memcpy(req.ifr_hwaddr.sa_data, h->hwaddr, sizeof(h->hwaddr));
+	req.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+	ret = __kern_dev_ioctl(net, SIOCSIFHWADDR, &req);
+ out:
+	if (ret)
+		dev = ERR_PTR(ret);
+
+	return dev;
+
+ err_peer:
+	dev_put(peer);
+	unregister_netdev(peer);
+ err_dev:
+	dev_put(dev);
+	unregister_netdev(dev);
+
+	return ERR_PTR(ret);
+}
+
+static struct net_device *restore_lo(struct ckpt_ctx *ctx,
+				     struct ckpt_hdr_netdev *h,
+				     struct net *net)
+{
+	struct net_device *dev;
+	char name[IFNAMSIZ+1];
+	int ret;
+
+	dev = dev_get_by_name(net, "lo");
+	if (!dev)
+		return ERR_PTR(-EINVAL);
+
+	ret = _ckpt_read_buffer(ctx, name, IFNAMSIZ);
+	if (ret < 0)
+		goto err;
+
+	if (strncmp(dev->name, name, IFNAMSIZ) != 0) {
+		ret = dev_change_name(dev, name);
+		if (ret < 0)
+			goto err;
+	}
+
+	return dev;
+ err:
+	dev_put(dev);
+
+	return ERR_PTR(ret);
+}
+
+static struct net_device *restore_sit(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_netdev *h,
+				      struct net *net)
+{
+	/* Don't actually do anything for SIT devices yet */
+	return dev_get_by_name(net, "sit0");
+}
+
+void *restore_netdev(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netdev *h;
+	struct net_device *dev = NULL;
+	struct ifreq req;
+	struct net *net;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (IS_ERR(h)) {
+		ckpt_err(ctx, PTR_ERR(h), "failed to read netdev\n");
+		return h;
+	}
+
+	if (h->netns_ref != 0) {
+		net = ckpt_obj_try_fetch(ctx, h->netns_ref, CKPT_OBJ_NET_NS);
+		if (IS_ERR(net)) {
+			ckpt_debug("failed to get net for %i\n", h->netns_ref);
+			ret = PTR_ERR(net);
+			net = current->nsproxy->net_ns;
+			goto out;
+		}
+	} else
+		net = current->nsproxy->net_ns;
+
+	if (h->type == CKPT_NETDEV_VETH)
+		dev = restore_veth(ctx, h, net);
+	else if (h->type == CKPT_NETDEV_LO)
+		dev = restore_lo(ctx, h, net);
+	else if (h->type == CKPT_NETDEV_SIT)
+		dev = restore_sit(ctx, h, net);
+	else
+		dev = ERR_PTR(-EINVAL);
+
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		ckpt_err(ctx, ret, "Netdev type %i not supported\n", h->type);
+		goto out;
+	}
+
+	/* Restore flags (which will likely bring the interface up) */
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	req.ifr_flags = h->flags;
+	ret = __kern_dev_ioctl(net, SIOCSIFFLAGS, &req);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0)
+		ret = restore_in_addrs(ctx, h->inet_addrs, net, dev);
+ out:
+	if (ret) {
+		ckpt_err(ctx, ret, "Failed to restore netdevice\n");
+		if ((h->type == CKPT_NETDEV_VETH) && !IS_ERR(dev)) {
+			dev_put(dev);
+		}
+		dev = ERR_PTR(ret);
+	} else
+		ckpt_debug("restored netdev %s\n", dev->name);
+
+	ckpt_hdr_put(ctx, h);
+
+	return dev;
+}
+
+void *restore_netns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netns *h;
+	struct net *net;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (IS_ERR(h)) {
+		ckpt_err(ctx, PTR_ERR(h), "failed to read netns\n");
+		return h;
+	}
+
+	if (h->this_ref != 0) {
+		net = copy_net_ns(CLONE_NEWNET, current->nsproxy->net_ns);
+		if (IS_ERR(net))
+			goto out;
+	} else
+		net = current->nsproxy->net_ns;
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return net;
+}
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] Add checkpoint support for veth devices (v2)
  2010-02-16 16:03 Network device and namespace checkpoint/restart (v3) Dan Smith
  2010-02-16 16:03 ` [PATCH 1/5] Add checkpoint and collect hooks to net_device_ops Dan Smith
  2010-02-16 16:03 ` [PATCH 2/5] C/R: Basic support for network namespaces and devices (v4) Dan Smith
@ 2010-02-16 16:03 ` Dan Smith
       [not found]   ` <1266336187-19105-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-02-16 16:03 ` [PATCH 4/5] Add loopback checkpoint support Dan Smith
  2010-02-16 16:03 ` [PATCH 5/5] Add a checkpoint handler to the 'sit' device Dan Smith
  4 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2010-02-16 16:03 UTC (permalink / raw)
  To: containers, netdev

Adds an ndo_checkpoint() handler for veth devices to checkpoint themselves.
Writes out the pairing information, addresses, and initiates a checkpoint
on the peer if the peer won't be reached from another netns.  Throws an
error of our peer's netns isn't already in the hash (i.e., a tree leak).

Changes in v2:
 - Fix check detecting if peer is in the init netns

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 drivers/net/veth.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 3a15de5..db92de8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -16,6 +16,9 @@
 #include <net/xfrm.h>
 #include <linux/veth.h>
 
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
@@ -284,6 +287,76 @@ static void veth_dev_free(struct net_device *dev)
 	free_netdev(dev);
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int veth_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer = priv->peer;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+	int n;
+
+	if (!peer) {
+		ckpt_err(ctx, -EINVAL, "veth device has no peer!\n");
+		return -EINVAL;
+	}
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_VETH;
+
+	ret = h->veth.this_ref = ckpt_obj_lookup_add(ctx, dev,
+						     CKPT_OBJ_NETDEV, &n);
+	if (ret < 0)
+		goto out;
+
+	ret = h->veth.peer_ref = ckpt_obj_lookup_add(ctx, peer,
+						     CKPT_OBJ_NETDEV, &n);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, peer->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+		if (ret)
+			goto out;
+	}
+
+	/* Only checkpoint peer if we're not going to arrive at it
+	 * via another task's netns.  Fail if the pipe exits
+	 * our container to a netns not already in the hash
+	 */
+	if (ckpt_netdev_in_init_netns(ctx, peer))
+		ret = checkpoint_obj(ctx, peer, CKPT_OBJ_NETDEV);
+	else if (!ckpt_obj_lookup(ctx, peer->nd_net, CKPT_OBJ_NET_NS)) {
+		ret = -EINVAL;
+		ckpt_err(ctx, ret,
+			 "Peer %s of %s not in checkpointed namespaces\n",
+			 peer->name, dev->name);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -292,6 +365,9 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_change_mtu      = veth_change_mtu,
 	.ndo_get_stats       = veth_get_stats,
 	.ndo_set_mac_address = eth_mac_addr,
+#ifdef CONFIG_CHECKPOINT
+	.ndo_checkpoint      = veth_checkpoint,
+#endif
 };
 
 static void veth_setup(struct net_device *dev)
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] Add loopback checkpoint support
  2010-02-16 16:03 Network device and namespace checkpoint/restart (v3) Dan Smith
                   ` (2 preceding siblings ...)
  2010-02-16 16:03 ` [PATCH 3/5] Add checkpoint support for veth devices (v2) Dan Smith
@ 2010-02-16 16:03 ` Dan Smith
  2010-02-16 16:09   ` Eric Dumazet
  2010-02-16 16:03 ` [PATCH 5/5] Add a checkpoint handler to the 'sit' device Dan Smith
  4 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2010-02-16 16:03 UTC (permalink / raw)
  To: containers, netdev

Adds a small ndo_checkpoint() handler for loopback devices to write the
name and addresses like other interfaces.

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 drivers/net/loopback.c |   41 ++++++++++++++++++++++++++++++++++++++---
 1 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index b9fcc98..816a527 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -57,6 +57,8 @@
 #include <linux/ip.h>
 #include <linux/tcp.h>
 #include <linux/percpu.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
 #include <net/net_namespace.h>
 
 struct pcpu_lstats {
@@ -153,10 +155,43 @@ static void loopback_dev_free(struct net_device *dev)
 	free_netdev(dev);
 }
 
+static int loopback_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_LO;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+	}
+
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+
 static const struct net_device_ops loopback_ops = {
-	.ndo_init      = loopback_dev_init,
-	.ndo_start_xmit= loopback_xmit,
-	.ndo_get_stats = loopback_get_stats,
+	.ndo_init       = loopback_dev_init,
+	.ndo_start_xmit = loopback_xmit,
+	.ndo_get_stats  = loopback_get_stats,
+	.ndo_checkpoint = loopback_checkpoint,
 };
 
 /*
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] Add a checkpoint handler to the 'sit' device
  2010-02-16 16:03 Network device and namespace checkpoint/restart (v3) Dan Smith
                   ` (3 preceding siblings ...)
  2010-02-16 16:03 ` [PATCH 4/5] Add loopback checkpoint support Dan Smith
@ 2010-02-16 16:03 ` Dan Smith
  4 siblings, 0 replies; 16+ messages in thread
From: Dan Smith @ 2010-02-16 16:03 UTC (permalink / raw)
  To: containers, netdev

This handler doesn't really do much to checkpoint the device, other
than the minimum required to support the restart process.  When we
add IPv6 support to this, then we can fill this out.

This allows us to avoid skipping unsupported interfaces on a normal
system.

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 net/ipv6/sit.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 976e682..a9fc331 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -32,6 +32,8 @@
 #include <linux/init.h>
 #include <linux/netfilter_ipv4.h>
 #include <linux/if_ether.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
 
 #include <net/sock.h>
 #include <net/snmp.h>
@@ -1085,11 +1087,39 @@ static int ipip6_tunnel_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+static int ipip6_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_SIT;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+
 static const struct net_device_ops ipip6_netdev_ops = {
 	.ndo_uninit	= ipip6_tunnel_uninit,
 	.ndo_start_xmit	= ipip6_tunnel_xmit,
 	.ndo_do_ioctl	= ipip6_tunnel_ioctl,
 	.ndo_change_mtu	= ipip6_tunnel_change_mtu,
+	.ndo_checkpoint	= ipip6_checkpoint,
 };
 
 static void ipip6_tunnel_setup(struct net_device *dev)
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] Add loopback checkpoint support
  2010-02-16 16:03 ` [PATCH 4/5] Add loopback checkpoint support Dan Smith
@ 2010-02-16 16:09   ` Eric Dumazet
  2010-02-16 16:13     ` Dan Smith
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2010-02-16 16:09 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, netdev

Le mardi 16 février 2010 à 08:03 -0800, Dan Smith a écrit :
> Adds a small ndo_checkpoint() handler for loopback devices to write the
> name and addresses like other interfaces.
> 
> Signed-off-by: Dan Smith <danms@us.ibm.com>
> ---
>  drivers/net/loopback.c |   41 ++++++++++++++++++++++++++++++++++++++---
>  1 files changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index b9fcc98..816a527 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -57,6 +57,8 @@
>  #include <linux/ip.h>
>  #include <linux/tcp.h>
>  #include <linux/percpu.h>
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
>  #include <net/net_namespace.h>
>  
>  struct pcpu_lstats {
> @@ -153,10 +155,43 @@ static void loopback_dev_free(struct net_device *dev)
>  	free_netdev(dev);
>  }
>  

Dont you have a #ifdef CONFIG_CHECKPOINT or something to avoid this for
small machines ?

> +static int loopback_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
> +{
> +	struct ckpt_hdr_netdev *h;
> +	struct ckpt_netdev_addr *addrs;
> +	int ret;
> +
> +	h = ckpt_netdev_base(ctx, dev, &addrs);
> +	if (IS_ERR(h))
> +		return PTR_ERR(h);
> +
> +	h->type = CKPT_NETDEV_LO;
> +
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (h->inet_addrs > 0) {
> +		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
> +		ret = ckpt_write_buffer(ctx, addrs, len);
> +	}
> +
> + out:
> +	ckpt_hdr_put(ctx, h);
> +	kfree(addrs);
> +
> +	return ret;
> +}
> +
>  static const struct net_device_ops loopback_ops = {
> -	.ndo_init      = loopback_dev_init,
> -	.ndo_start_xmit= loopback_xmit,
> -	.ndo_get_stats = loopback_get_stats,
> +	.ndo_init       = loopback_dev_init,
> +	.ndo_start_xmit = loopback_xmit,
> +	.ndo_get_stats  = loopback_get_stats,
> +	.ndo_checkpoint = loopback_checkpoint,
>  };
>  
>  /*



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] Add loopback checkpoint support
  2010-02-16 16:09   ` Eric Dumazet
@ 2010-02-16 16:13     ` Dan Smith
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Smith @ 2010-02-16 16:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: containers, netdev

ED> Dont you have a #ifdef CONFIG_CHECKPOINT or something to avoid
ED> this for small machines ?

Yes, and the veth patch used it appropriately.  It should look like
this:

Add loopback checkpoint support (v2)

Adds a small ndo_checkpoint() handler for loopback devices to write the
name and addresses like other interfaces.

Changes in v2:
 - Add CONFIG_CHECKPOINT around the handler

Signed-off-by: Dan Smith <danms@us.ibm.com>

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index b9fcc98..77023a7 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -57,6 +57,8 @@
 #include <linux/ip.h>
 #include <linux/tcp.h>
 #include <linux/percpu.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
 #include <net/net_namespace.h>
 
 struct pcpu_lstats {
@@ -153,10 +155,46 @@ static void loopback_dev_free(struct net_device *dev)
 	free_netdev(dev);
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int loopback_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_LO;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+	}
+
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
 static const struct net_device_ops loopback_ops = {
-	.ndo_init      = loopback_dev_init,
-	.ndo_start_xmit= loopback_xmit,
-	.ndo_get_stats = loopback_get_stats,
+	.ndo_init       = loopback_dev_init,
+	.ndo_start_xmit = loopback_xmit,
+	.ndo_get_stats  = loopback_get_stats,
+#ifdef CONFIG_CHECKPOINT
+	.ndo_checkpoint = loopback_checkpoint,
+#endif
 };
 
 /*


-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] Add checkpoint support for veth devices (v2)
       [not found]   ` <1266336187-19105-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-02-22 19:56     ` Serge E. Hallyn
  2010-02-22 20:25       ` Dan Smith
  0 siblings, 1 reply; 16+ messages in thread
From: Serge E. Hallyn @ 2010-02-22 19:56 UTC (permalink / raw)
  To: Dan Smith
  Cc: containers-qjLDD68F18O7TbgM5vRIOg, netdev-u79uwXL29TY76Z2rM5mHXA

Quoting Dan Smith (danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> Adds an ndo_checkpoint() handler for veth devices to checkpoint themselves.
> Writes out the pairing information, addresses, and initiates a checkpoint
> on the peer if the peer won't be reached from another netns.  Throws an
> error of our peer's netns isn't already in the hash (i.e., a tree leak).
> 
> Changes in v2:
>  - Fix check detecting if peer is in the init netns
> 
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/net/veth.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 76 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 3a15de5..db92de8 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -16,6 +16,9 @@
>  #include <net/xfrm.h>
>  #include <linux/veth.h>
> 
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
>  #define DRV_NAME	"veth"
>  #define DRV_VERSION	"1.0"
> 
> @@ -284,6 +287,76 @@ static void veth_dev_free(struct net_device *dev)
>  	free_netdev(dev);
>  }
> 
> +#ifdef CONFIG_CHECKPOINT
> +static int veth_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
> +{
> +	struct ckpt_hdr_netdev *h;
> +	struct veth_priv *priv = netdev_priv(dev);
> +	struct net_device *peer = priv->peer;
> +	struct ckpt_netdev_addr *addrs;
> +	int ret;
> +	int n;
> +
> +	if (!peer) {
> +		ckpt_err(ctx, -EINVAL, "veth device has no peer!\n");
> +		return -EINVAL;
> +	}
> +
> +	h = ckpt_netdev_base(ctx, dev, &addrs);
> +	if (IS_ERR(h))
> +		return PTR_ERR(h);
> +
> +	h->type = CKPT_NETDEV_VETH;
> +
> +	ret = h->veth.this_ref = ckpt_obj_lookup_add(ctx, dev,
> +						     CKPT_OBJ_NETDEV, &n);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = h->veth.peer_ref = ckpt_obj_lookup_add(ctx, peer,
> +						     CKPT_OBJ_NETDEV, &n);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = ckpt_write_buffer(ctx, peer->name, IFNAMSIZ);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (h->inet_addrs > 0) {
> +		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
> +		ret = ckpt_write_buffer(ctx, addrs, len);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	/* Only checkpoint peer if we're not going to arrive at it
> +	 * via another task's netns.  Fail if the pipe exits
> +	 * our container to a netns not already in the hash
> +	 */
> +	if (ckpt_netdev_in_init_netns(ctx, peer))
> +		ret = checkpoint_obj(ctx, peer, CKPT_OBJ_NETDEV);
> +	else if (!ckpt_obj_lookup(ctx, peer->nd_net, CKPT_OBJ_NET_NS)) {
> +		ret = -EINVAL;
> +		ckpt_err(ctx, ret,
> +			 "Peer %s of %s not in checkpointed namespaces\n",
> +			 peer->name, dev->name);

I'm not sure this check does what you think it does:  note that
ckpt_netdev_base(), defined in the previous patch, and called higher
up in this function, is going to checkpoint peer->nd_net.   :)

(right?)

> +	}
> + out:
> +	ckpt_hdr_put(ctx, h);
> +	kfree(addrs);
> +
> +	return ret;
> +}
> +#endif
> +
>  static const struct net_device_ops veth_netdev_ops = {
>  	.ndo_init            = veth_dev_init,
>  	.ndo_open            = veth_open,
> @@ -292,6 +365,9 @@ static const struct net_device_ops veth_netdev_ops = {
>  	.ndo_change_mtu      = veth_change_mtu,
>  	.ndo_get_stats       = veth_get_stats,
>  	.ndo_set_mac_address = eth_mac_addr,
> +#ifdef CONFIG_CHECKPOINT
> +	.ndo_checkpoint      = veth_checkpoint,
> +#endif
>  };
> 
>  static void veth_setup(struct net_device *dev)
> -- 
> 1.6.2.5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] Add checkpoint support for veth devices (v2)
  2010-02-22 19:56     ` Serge E. Hallyn
@ 2010-02-22 20:25       ` Dan Smith
  2010-02-22 20:57         ` Serge E. Hallyn
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2010-02-22 20:25 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: containers, netdev

>> +	else if (!ckpt_obj_lookup(ctx, peer->nd_net, CKPT_OBJ_NET_NS)) {
>> +		ret = -EINVAL;
>> +		ckpt_err(ctx, ret,
>> +			 "Peer %s of %s not in checkpointed namespaces\n",
>> +			 peer->name, dev->name);

SH> I'm not sure this check does what you think it does: note that
SH> ckpt_netdev_base(), defined in the previous patch, and called
SH> higher up in this function, is going to checkpoint peer->nd_net.
SH> :)

Actually, no, ckpt_netdev_base() can't checkpoint peer->nd_net because
it's device-agnostic and has no knowledge of dev->peer.

The idea here was that we checkpoint a netns when we arrive at it via
nsproxy.  Doing that, we checkpoint the devices within.  We encounter
a veth device, which has a peer, so we decide if:

 1. We won't arrive at the peer later because it is in the init
    namespace, so we checkpoint it now.
 2. We will arrive at it later because the peer's netns is in the list
    we've already collected, so checkpoint the peer with its namespace
 3. Neither are true and we won't arrive at it later and therefore we
    can't allow checkpoint to continue

#2 depends on the collect process having put all the task's netns' in
 the hash ahead of time.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] Add checkpoint support for veth devices (v2)
  2010-02-22 20:25       ` Dan Smith
@ 2010-02-22 20:57         ` Serge E. Hallyn
  2010-02-22 21:01           ` Dan Smith
  0 siblings, 1 reply; 16+ messages in thread
From: Serge E. Hallyn @ 2010-02-22 20:57 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, netdev

Quoting Dan Smith (danms@us.ibm.com):
> >> +	else if (!ckpt_obj_lookup(ctx, peer->nd_net, CKPT_OBJ_NET_NS)) {
> >> +		ret = -EINVAL;
> >> +		ckpt_err(ctx, ret,
> >> +			 "Peer %s of %s not in checkpointed namespaces\n",
> >> +			 peer->name, dev->name);
> 
> SH> I'm not sure this check does what you think it does: note that
> SH> ckpt_netdev_base(), defined in the previous patch, and called
> SH> higher up in this function, is going to checkpoint peer->nd_net.
> SH> :)
> 
> Actually, no, ckpt_netdev_base() can't checkpoint peer->nd_net because
> it's device-agnostic and has no knowledge of dev->peer.

Oh, ok.

> The idea here was that we checkpoint a netns when we arrive at it via
> nsproxy.  Doing that, we checkpoint the devices within.  We encounter
> a veth device, which has a peer, so we decide if:
> 
>  1. We won't arrive at the peer later because it is in the init
>     namespace, so we checkpoint it now.
>  2. We will arrive at it later because the peer's netns is in the list
>     we've already collected, so checkpoint the peer with its namespace
>  3. Neither are true and we won't arrive at it later and therefore we
>     can't allow checkpoint to continue
> 
> #2 depends on the collect process having put all the task's netns' in
>  the hash ahead of time.

Right, that was what I was originally starting to hunt down when I
thought I saw ckpt_netdev_base() checkpointing peer's netns.

So do you actually know that the peer's netns will have been
checkpointed?  I'm a little fuzzy about where netns and netdevs
are checkpointed.  If you have two private netns's in a container,
with a veth connecting them, and you checkpoint a task in netns 1,
will you fail bc netns 2 hasn't been checkpointed yet bc no task in
it has been checkpointed yet?

-serge

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] Add checkpoint support for veth devices (v2)
  2010-02-22 20:57         ` Serge E. Hallyn
@ 2010-02-22 21:01           ` Dan Smith
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Smith @ 2010-02-22 21:01 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: containers, netdev

SH> So do you actually know that the peer's netns will have been
SH> checkpointed?  I'm a little fuzzy about where netns and netdevs
SH> are checkpointed.  If you have two private netns's in a container,
SH> with a veth connecting them, and you checkpoint a task in netns 1,
SH> will you fail bc netns 2 hasn't been checkpointed yet bc no task
SH> in it has been checkpointed yet?

Nope, because they're collect()'ed before we checkpoint().

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] C/R: Basic support for network namespaces and devices (v4)
       [not found]   ` <20100222194523.GA13135@us.ibm.com>
@ 2010-02-23 16:35     ` Dan Smith
  2010-02-23 16:47       ` Serge E. Hallyn
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2010-02-23 16:35 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: containers, netdev

SH> the above two hunks change the flow in checkpoint_container(), but
SH> they don't seem to actually add anything.  And I don't see (with a
SH> quick browse) any later patch in this series changing this either.
SH> Is this just noise?

Ah, yeah, I think that's left over from a previous version where I had
to insert something there.  Sorry about that :)

>> +int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx, struct net_device *dev)
>> +{
>> +	return dev->nd_net == current->nsproxy->net_ns;
>> +}

SH> You are comparing it to the net_ns of the checkpointing task.  I'm
SH> not sure that makes sense - but I'm also not sure what if anything
SH> makes more sense.

SH> What exactly do you mean by the 'init' netns here?  Do you mean
SH> the init_net_ns for the container, or that it is the net_ns of
SH> whatever task created the container?

In this case, 'current' is the task doing the checkpoint, right?  So,
we're treating the netns that it is in as the "top level" and will
restore the tree, as visible from that task, relative to the netns of
the restart process.  We had an IRC conversation about this, I believe :)

SH> How about a
SH> 			ckpt_err(ctx, -ENOSYS,
SH> 				Device %s does not support checkpoint\n",
dev-> name);

SH> here to put a meaningful msg in the user's log?

Yep, definitely.

Thanks!

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] C/R: Basic support for network namespaces and devices (v4)
  2010-02-23 16:35     ` Dan Smith
@ 2010-02-23 16:47       ` Serge E. Hallyn
  2010-02-23 17:27         ` Dan Smith
  0 siblings, 1 reply; 16+ messages in thread
From: Serge E. Hallyn @ 2010-02-23 16:47 UTC (permalink / raw)
  To: Dan Smith; +Cc: Serge E. Hallyn, containers, netdev

Quoting Dan Smith (danms@us.ibm.com):
> SH> the above two hunks change the flow in checkpoint_container(), but
> SH> they don't seem to actually add anything.  And I don't see (with a
> SH> quick browse) any later patch in this series changing this either.
> SH> Is this just noise?
> 
> Ah, yeah, I think that's left over from a previous version where I had
> to insert something there.  Sorry about that :)
> 
> >> +int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx, struct net_device *dev)
> >> +{
> >> +	return dev->nd_net == current->nsproxy->net_ns;
> >> +}
> 
> SH> You are comparing it to the net_ns of the checkpointing task.  I'm
> SH> not sure that makes sense - but I'm also not sure what if anything
> SH> makes more sense.
> 
> SH> What exactly do you mean by the 'init' netns here?  Do you mean
> SH> the init_net_ns for the container, or that it is the net_ns of
> SH> whatever task created the container?
> 
> In this case, 'current' is the task doing the checkpoint, right?  So,
> we're treating the netns that it is in as the "top level" and will
> restore the tree, as visible from that task, relative to the netns of
> the restart process.  We had an IRC conversation about this, I believe :)

But there is no guarantee that the checkpointer is in the netns which
we would call the 'top level' netns.  Which means that, at restart, whether
or not the devices which are in what we call the top level netns are in
fact inherited or not, will depend on conditions of the checkpointer.  Do
we care?  (I thought we did, but maybe we don't... it's unlikely to happen
anyway)

> SH> How about a
> SH> 			ckpt_err(ctx, -ENOSYS,
> SH> 				Device %s does not support checkpoint\n",
> dev-> name);
> 
> SH> here to put a meaningful msg in the user's log?
> 
> Yep, definitely.
> 
> Thanks!
> 
> -- 
> Dan Smith
> IBM Linux Technology Center
> email: danms@us.ibm.com
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] C/R: Basic support for network namespaces and devices (v4)
  2010-02-23 16:47       ` Serge E. Hallyn
@ 2010-02-23 17:27         ` Dan Smith
  2010-02-23 18:49           ` Serge E. Hallyn
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2010-02-23 17:27 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Serge E. Hallyn, containers, netdev

SH> But there is no guarantee that the checkpointer is in the netns
SH> which we would call the 'top level' netns.  Which means that, at
SH> restart, whether or not the devices which are in what we call the
SH> top level netns are in fact inherited or not, will depend on
SH> conditions of the checkpointer.  Do we care?  (I thought we did,
SH> but maybe we don't... it's unlikely to happen anyway)

Well, when we discussed this on IRC with Oren, I think we came to the
conclusion that since network namespaces aren't hierarchical, that we
would restore things from the "viewpoint" of the process that
checkpointed them.  It gives us a sane way to ensure that the peer
devices residing in the init netns can be put back there, even though we
don't checkpoint everything in the init netns (like eth0).

If you checkpoint a veth from within the container and you have a peer
device that is outside the container (but not in a netns that is
checkpointed as part of a task), it's going to fail and tell you that
one of your peers leaked to the outside.  I think that's sane and
preferred behavior, no?  If you're using macvlan and you checkpoint
from within the container, I think you should be okay, as long as
there is a appropriately named device to base the restored devices on
in whatever netns your restore process is in.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] C/R: Basic support for network namespaces and devices (v4)
  2010-02-23 17:27         ` Dan Smith
@ 2010-02-23 18:49           ` Serge E. Hallyn
  0 siblings, 0 replies; 16+ messages in thread
From: Serge E. Hallyn @ 2010-02-23 18:49 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, netdev

Quoting Dan Smith (danms@us.ibm.com):
> SH> But there is no guarantee that the checkpointer is in the netns
> SH> which we would call the 'top level' netns.  Which means that, at
> SH> restart, whether or not the devices which are in what we call the
> SH> top level netns are in fact inherited or not, will depend on
> SH> conditions of the checkpointer.  Do we care?  (I thought we did,
> SH> but maybe we don't... it's unlikely to happen anyway)
> 
> Well, when we discussed this on IRC with Oren, I think we came to the
> conclusion that since network namespaces aren't hierarchical, that we
> would restore things from the "viewpoint" of the process that
> checkpointed them.  It gives us a sane way to ensure that the peer
> devices residing in the init netns can be put back there, even though we
> don't checkpoint everything in the init netns (like eth0).
> 
> If you checkpoint a veth from within the container and you have a peer
> device that is outside the container (but not in a netns that is
> checkpointed as part of a task), it's going to fail and tell you that
> one of your peers leaked to the outside.  I think that's sane and
> preferred behavior, no?

Well I don't think it is, but it's a fine starting point, so let's
worry about it later.

thanks,
-serge

> If you're using macvlan and you checkpoint
> from within the container, I think you should be okay, as long as
> there is a appropriately named device to base the restored devices on
> in whatever netns your restore process is in.
> 
> -- 
> Dan Smith
> IBM Linux Technology Center
> email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-02-23 18:49 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-16 16:03 Network device and namespace checkpoint/restart (v3) Dan Smith
2010-02-16 16:03 ` [PATCH 1/5] Add checkpoint and collect hooks to net_device_ops Dan Smith
2010-02-16 16:03 ` [PATCH 2/5] C/R: Basic support for network namespaces and devices (v4) Dan Smith
     [not found]   ` <20100222194523.GA13135@us.ibm.com>
2010-02-23 16:35     ` Dan Smith
2010-02-23 16:47       ` Serge E. Hallyn
2010-02-23 17:27         ` Dan Smith
2010-02-23 18:49           ` Serge E. Hallyn
2010-02-16 16:03 ` [PATCH 3/5] Add checkpoint support for veth devices (v2) Dan Smith
     [not found]   ` <1266336187-19105-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-02-22 19:56     ` Serge E. Hallyn
2010-02-22 20:25       ` Dan Smith
2010-02-22 20:57         ` Serge E. Hallyn
2010-02-22 21:01           ` Dan Smith
2010-02-16 16:03 ` [PATCH 4/5] Add loopback checkpoint support Dan Smith
2010-02-16 16:09   ` Eric Dumazet
2010-02-16 16:13     ` Dan Smith
2010-02-16 16:03 ` [PATCH 5/5] Add a checkpoint handler to the 'sit' device Dan Smith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.