All of lore.kernel.org
 help / color / mirror / Atom feed
* C/R: Checkpoint and restore network namespaces and devices
@ 2010-02-25 20:43 Dan Smith
  2010-02-25 20:43 ` [PATCH 1/6] C/R: Add checkpoint and collect hooks to net_device_ops Dan Smith
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Dan Smith @ 2010-02-25 20:43 UTC (permalink / raw)
  To: containers; +Cc: benjamin.thery, den, ebiederm, davem, netdev

This patch set adds checkpoint/restart support for network namespaces,
as well as the network devices within.  Currently supports veth, macvlan,
and loopback device types.

Includes the latest feedback and the new macvlan bits.  Rebased on Oren's
official v19.  Changes detailed per patch.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] C/R: Add checkpoint and collect hooks to net_device_ops
  2010-02-25 20:43 C/R: Checkpoint and restore network namespaces and devices Dan Smith
@ 2010-02-25 20:43 ` Dan Smith
  2010-02-26 12:08   ` David Miller
  2010-02-25 20:43 ` [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5) Dan Smith
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Dan Smith @ 2010-02-25 20:43 UTC (permalink / raw)
  To: containers; +Cc: benjamin.thery, den, ebiederm, davem, netdev

These will be implemented per-driver by those that support such
operations.

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 include/linux/netdevice.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a3fccc8..415a791 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -673,6 +673,12 @@ struct net_device_ops {
 	int			(*ndo_fcoe_get_wwn)(struct net_device *dev,
 						    u64 *wwn, int type);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int			(*ndo_collect)(struct ckpt_ctx *ctx,
+					       struct net_device *dev);
+	int			(*ndo_checkpoint)(struct ckpt_ctx *ctx,
+						  struct net_device *dev);
+#endif
 };
 
 /*
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
  2010-02-25 20:43 C/R: Checkpoint and restore network namespaces and devices Dan Smith
  2010-02-25 20:43 ` [PATCH 1/6] C/R: Add checkpoint and collect hooks to net_device_ops Dan Smith
@ 2010-02-25 20:43 ` Dan Smith
  2010-02-26 12:08   ` David Miller
  2010-03-06  3:53   ` Oren Laadan
  2010-02-25 20:43 ` [PATCH 3/6] C/R: Add checkpoint support for veth devices (v2) Dan Smith
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 25+ messages in thread
From: Dan Smith @ 2010-02-25 20:43 UTC (permalink / raw)
  To: containers; +Cc: benjamin.thery, den, ebiederm, davem, netdev

When checkpointing a task tree with network namespaces, we hook into
do_checkpoint_ns() along with the others.  Any devices in a given namespace
are checkpointed (including their peer, in the case of veth) sequentially.
Each network device stores a list of protocol addresses, as well as other
information, such as hardware address.

This patch supports veth pairs, as well as the loopback adapter.  The
loopback support is there to make sure that any additional addresses and
state (such as up/down) is copied to the loopback adapter that we are
given in the new network namespace.

On restart, we instantiate new network namespaces and veth pairs as
necessary.  Any device we encounter that isn't in a network namespace
that was checkpointed as part of a task is left in the namespace of the
restarting process.  This will be the case for a veth half that exists
in the init netns to provide network access to a container.

Still to do are:

  1. Routes
  2. Netfilter rules
  3. IPv6 addresses
  4. Other virtual device types (e.g. bridges)
  5. Multicast
  6. Device config info (ipv4_devconf)
  7. Additional ipv4 address attributes

Changes in v5:
 - Rebase
 - Remove checkpoint_container() noise
 - Factor out some common bits of the RTNL newlink operations
 - Add macvlan support

Changes in v4:
 - Fix allocation under lock in ckpt_netdev_inet_addrs()
 - Add comment for case where there is no netns info in checkpoint image
 - Fix inner structure alignment in netdev_addr header
 - Fix instances of kfree(skb)
 - Remove init_netns_ref from container header and checkpoint context
 - Add 'extern' to checkpoint.h prototypes
 - Swizzle do_restore_netns() to handle netns more like the others
 - Return E2BIG for failure case when collecting inet addrs
 - Report case where device doesn't support checkpoint
 - Remove nested netns check from may_checkpoint_task()
 - Move veth-specific netdev attributes into unioned struct to set an
   example for specific attributes of additional device types
 - Add 'sit' device restore path that doesn't really do anything
 - Fail instead of skip when encountering a device with no checkpoint
   support

Changes in v3:
 - Use dev->checkpoint() for per-device checkpoint operation
 - Use RTNL for veth pair creation on restart
 - Export some of the functions that will be needed by dev->ndo_checkpoint()

Changes in v2:
 - Add CONFIG_CHECKPOINT_NETNS that is dependent on NET, NET_NS, and
   CHECKPOINT.  Conditionally compile the checkpoint_dev code based on it.
 - Updated comment on should_checkpoint_netdev()
 - Updated checkpoint_netdev() to explicitly check for "veth" in name
 - Changed checkpoint_netns() to use BUG() for impossible condition
 - Fixed a bug on restart with all devices in the init netns
 - Lock the dev_base_lock while traversing interface addresses
 - Collect all addresses for an interface before writing out in one
   single pass

Signed-off-by: Dan Smith <danms@us.ibm.com>
Cc: netdev@vger.kernel.org
---
 checkpoint/checkpoint.c          |    6 +-
 checkpoint/objhash.c             |   48 +++
 include/linux/checkpoint.h       |   23 ++
 include/linux/checkpoint_hdr.h   |   58 +++
 include/linux/checkpoint_types.h |    1 +
 kernel/nsproxy.c                 |   20 +-
 net/Kconfig                      |    4 +
 net/Makefile                     |    1 +
 net/checkpoint_dev.c             |  815 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 970 insertions(+), 6 deletions(-)
 create mode 100644 net/checkpoint_dev.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index b3c1c4f..466f594 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -184,6 +184,7 @@ static int checkpoint_container(struct ckpt_ctx *ctx)
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CONTAINER);
 	if (!h)
 		return -ENOMEM;
+
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
 
@@ -284,11 +285,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		_ckpt_err(ctx, -EPERM, "%(T)Nested mnt_ns unsupported\n");
 		ret = -EPERM;
 	}
-	/* no support for >1 private netns */
-	if (nsproxy->net_ns != ctx->root_nsproxy->net_ns) {
-		_ckpt_err(ctx, -EPERM, "%(T)Nested net_ns unsupported\n");
-		ret = -EPERM;
-	}
 	/* no support for >1 private pidns */
 	if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns) {
 		_ckpt_err(ctx, -EPERM, "%(T)Nested pid_ns unsupported\n");
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index fbc58ea..16f2c43 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -348,6 +348,36 @@ static void lsm_string_drop(void *ptr, int lastref)
 	kref_put(&s->kref, lsm_string_free);
 }
 
+static int netns_grab(void *ptr)
+{
+	struct net *net = ptr;
+
+	get_net(net);
+	return 0;
+}
+
+static void netns_drop(void *ptr, int lastref)
+{
+	struct net *net = ptr;
+
+	put_net(net);
+}
+
+static int netdev_grab(void *ptr)
+{
+	struct net_device *dev = ptr;
+
+	dev_hold(dev);
+	return 0;
+}
+
+static void netdev_drop(void *ptr, int lastref)
+{
+	struct net_device *dev = ptr;
+
+	dev_put(dev);
+}
+
 /* security context strings */
 static int checkpoint_lsm_string(struct ckpt_ctx *ctx, void *ptr);
 static struct ckpt_lsm_string *restore_lsm_string(struct ckpt_ctx *ctx);
@@ -550,6 +580,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_lsm_string,
 		.restore = restore_lsm_string_wrap,
 	},
+	/* Network Namespace Object */
+	{
+		.obj_name = "NET_NS",
+		.obj_type = CKPT_OBJ_NET_NS,
+		.ref_grab = netns_grab,
+		.ref_drop = netns_drop,
+		.checkpoint = checkpoint_netns,
+		.restore = restore_netns,
+	},
+	/* Network Device Object */
+	{
+		.obj_name = "NET_DEV",
+		.obj_type = CKPT_OBJ_NETDEV,
+		.ref_grab = netdev_grab,
+		.ref_drop = netdev_drop,
+		.checkpoint = checkpoint_netdev,
+		.restore = restore_netdev,
+	},
 };
 
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7101d6f..a25bac1 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -35,6 +35,7 @@
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
+#include <linux/inetdevice.h>
 #include <net/sock.h>
 
 /* sycall helpers */
@@ -119,6 +120,28 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
 extern struct sk_buff *sock_restore_skb(struct ckpt_ctx *ctx, struct sock *sk);
 extern void sock_listening_list_free(struct list_head *head);
 
+#ifdef CONFIG_CHECKPOINT_NETNS
+extern int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_netns(struct ckpt_ctx *ctx);
+extern int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_netdev(struct ckpt_ctx *ctx);
+
+extern int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx,
+				     struct net_device *dev);
+extern int ckpt_netdev_inet_addrs(struct in_device *indev,
+				  struct ckpt_netdev_addr *list[]);
+extern int ckpt_netdev_hwaddr(struct net_device *dev,
+			      struct ckpt_hdr_netdev *h);
+extern struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
+						struct net_device *dev,
+						struct ckpt_netdev_addr *addrs[]);
+#else
+# define checkpoint_netns NULL
+# define restore_netns NULL
+# define checkpoint_netdev NULL
+# define restore_netdev NULL
+#endif
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 41412d1..c065739 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -181,6 +181,12 @@ enum {
 #define CKPT_HDR_SOCKET_UNIX CKPT_HDR_SOCKET_UNIX
 	CKPT_HDR_SOCKET_INET,
 #define CKPT_HDR_SOCKET_INET CKPT_HDR_SOCKET_INET
+	CKPT_HDR_NET_NS,
+#define CKPT_HDR_NET_NS CKPT_HDR_NET_NS
+	CKPT_HDR_NETDEV,
+#define CKPT_HDR_NETDEV CKPT_HDR_NETDEV
+	CKPT_HDR_NETDEV_ADDR,
+#define CKPT_HDR_NETDEV_ADDR CKPT_HDR_NETDEV_ADDR
 
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
@@ -253,6 +259,10 @@ enum obj_type {
 #define CKPT_OBJ_SECURITY_PTR CKPT_OBJ_SECURITY_PTR
 	CKPT_OBJ_SECURITY,
 #define CKPT_OBJ_SECURITY CKPT_OBJ_SECURITY
+	CKPT_OBJ_NET_NS,
+#define CKPT_OBJ_NET_NS CKPT_OBJ_NET_NS
+	CKPT_OBJ_NETDEV,
+#define CKPT_OBJ_NETDEV CKPT_OBJ_NETDEV
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -313,6 +323,7 @@ struct ckpt_hdr_tail {
 /* container configuration section header */
 struct ckpt_hdr_container {
 	struct ckpt_hdr h;
+	__s32 init_netns_ref;
 	/*
 	 * the header is followed by the string:
 	 *   char lsm_name[SECURITY_NAME_MAX + 1]
@@ -434,6 +445,7 @@ struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
 	__s32 uts_objref;
 	__s32 ipc_objref;
+	__s32 net_objref;
 } __attribute__((aligned(8)));
 
 /* cannot include <linux/tty.h> from userspace, so define: */
@@ -758,6 +770,52 @@ struct ckpt_hdr_file_socket {
 	__s32 sock_objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_netns {
+	struct ckpt_hdr h;
+	__s32 this_ref;
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_types {
+	CKPT_NETDEV_LO,
+	CKPT_NETDEV_VETH,
+	CKPT_NETDEV_SIT,
+	CKPT_NETDEV_MACVLAN,
+};
+
+struct ckpt_hdr_netdev {
+	struct ckpt_hdr h;
+ 	__s32 netns_ref;
+	union {
+		struct {
+			__s32 this_ref;
+			__s32 peer_ref;
+		} veth;
+		struct {
+			__u32 mode;
+		} macvlan;
+	};
+	__u32 inet_addrs;
+	__u16 type;
+	__u16 flags;
+	__u8 hwaddr[6];
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_addr_types {
+	CKPT_NETDEV_ADDR_IPV4,
+};
+
+struct ckpt_netdev_addr {
+	__u16 type;
+	union {
+		struct {
+			__u32 inet4_local;
+			__u32 inet4_address;
+			__u32 inet4_mask;
+			__u32 inet4_broadcast;
+		};
+	} __attribute__((aligned(8)));
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_eventpoll_items {
 	struct ckpt_hdr h;
 	__s32  epfile_objref;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 5d5e00d..efc9357 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -82,6 +82,7 @@ struct ckpt_ctx {
 	wait_queue_head_t ghostq;	/* waitqueue for ghost tasks */
 	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
 	struct list_head listen_sockets;/* listening parent sockets */
+	int init_netns_ref;             /* Objref of root net namespace */
 
 	struct ckpt_stats stats;	/* statistics */
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 0da0d83..b0e67ff 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -248,6 +248,11 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
 	if (ret < 0)
 		goto out;
+#ifdef CONFIG_CHECKPOINT_NETNS
+	ret = ckpt_obj_collect(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	if (ret < 0)
+		goto out;
+#endif
 	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
 	if (ret < 0)
 		goto out;
@@ -288,6 +293,12 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (ret < 0)
 		goto out;
 	h->ipc_objref = ret;
+#ifdef CONFIG_CHECKPOINT_NETNS
+	ret = checkpoint_obj(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	if (ret < 0)
+		goto out;
+	h->net_objref = ret;
+#endif
 
 	/* FIXME: for now, only marked visited to pacify leaks */
 	ret = ckpt_obj_visit(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS);
@@ -328,6 +339,14 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 		ret = PTR_ERR(uts_ns);
 		goto out;
 	}
+	if (h->net_objref == 0)
+		net_ns = current->nsproxy->net_ns;
+	else
+		net_ns = ckpt_obj_fetch(ctx, h->net_objref, CKPT_OBJ_NET_NS);
+	if (IS_ERR(net_ns)) {
+		ret = PTR_ERR(net_ns);
+		goto out;
+	}
 
 	if (h->ipc_objref == 0)
 		ipc_ns = ctx->root_nsproxy->ipc_ns;
@@ -339,7 +358,6 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	}
 
 	mnt_ns = ctx->root_nsproxy->mnt_ns;
-	net_ns = ctx->root_nsproxy->net_ns;
 
 	if (uts_ns == current->nsproxy->uts_ns &&
 	    ipc_ns == current->nsproxy->ipc_ns &&
diff --git a/net/Kconfig b/net/Kconfig
index 041c35e..64dd3cd 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -276,4 +276,8 @@ source "net/wimax/Kconfig"
 source "net/rfkill/Kconfig"
 source "net/9p/Kconfig"
 
+config CHECKPOINT_NETNS
+       bool
+       default y if NET && NET_NS && CHECKPOINT
+
 endif   # if NET
diff --git a/net/Makefile b/net/Makefile
index 74b038f..570ee98 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -67,3 +67,4 @@ endif
 obj-$(CONFIG_WIMAX)		+= wimax/
 
 obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
+obj-$(CONFIG_CHECKPOINT_NETNS)	+= checkpoint_dev.o
diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
new file mode 100644
index 0000000..5479419
--- /dev/null
+++ b/net/checkpoint_dev.c
@@ -0,0 +1,815 @@
+/*
+ *  Copyright 2010 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms@us.ibm.com>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/sched.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/inetdevice.h>
+#include <linux/veth.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/deferqueue.h>
+
+#include <net/net_namespace.h>
+#include <net/sch_generic.h>
+
+struct dq_netdev {
+	struct net_device *dev;
+	struct ckpt_ctx *ctx;
+};
+
+struct veth_newlink {
+	char *peer;
+};
+
+struct mvl_newlink {
+	char this[IFNAMSIZ+1];
+	char base[IFNAMSIZ+1];
+	int mode;
+	__u8 *hwaddr;
+};
+
+typedef int (*new_link_fn)(struct sk_buff *, void *);
+
+static int __kern_devinet_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = devinet_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = dev_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+static struct socket *rtnl_open(void)
+{
+	struct socket *sock;
+	int ret;
+
+	ret = sock_create(AF_NETLINK, SOCK_DGRAM, NETLINK_ROUTE, &sock);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	return sock;
+}
+
+static int rtnl_close(struct socket *rtnl)
+{
+	if (rtnl)
+		return kernel_sock_shutdown(rtnl, SHUT_RDWR);
+	else
+		return 0;
+}
+
+static struct nlmsghdr *rtnl_get_response(struct socket *rtnl,
+					  struct sk_buff **skb)
+{
+	int ret;
+	long timeo = MAX_SCHEDULE_TIMEOUT;
+	struct nlmsghdr *nlh;
+
+	ret = sk_wait_data(rtnl->sk, &timeo);
+	if (!ret)
+		return ERR_PTR(-EPIPE);
+
+	*skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
+	if (!*skb)
+		return ERR_PTR(-EPIPE);
+
+	ret = -EINVAL;
+	nlh = nlmsg_hdr(*skb);
+	if (!nlh)
+		goto err;
+
+	if (nlh->nlmsg_type == NLMSG_ERROR) {
+		struct nlmsgerr *errmsg = nlmsg_data(nlh);
+		ret = errmsg->error;
+		goto err;
+	}
+
+	return nlh;
+ err:
+	kfree_skb(*skb);
+	*skb = NULL;
+
+	return ERR_PTR(ret);
+}
+
+int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	return dev->nd_net == current->nsproxy->net_ns;
+}
+
+int ckpt_netdev_hwaddr(struct net_device *dev, struct ckpt_hdr_netdev *h)
+{
+	struct net *net = dev->nd_net;
+	struct ifreq req;
+	int ret;
+
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	ret = __kern_dev_ioctl(net, SIOCGIFFLAGS, &req);
+	h->flags = req.ifr_flags;
+	if (ret < 0)
+		return ret;
+
+	ret = __kern_dev_ioctl(net, SIOCGIFHWADDR, &req);
+	if (ret < 0)
+		return ret;
+
+	memcpy(h->hwaddr, req.ifr_hwaddr.sa_data, sizeof(h->hwaddr));
+
+	return 0;
+}
+
+int ckpt_netdev_inet_addrs(struct in_device *indev,
+			   struct ckpt_netdev_addr *_abuf[])
+{
+	struct ckpt_netdev_addr *abuf = NULL;
+	struct in_ifaddr *addr = indev->ifa_list;
+	int pages = 0;
+	int addrs = 0;
+	int max;
+
+ retry:
+	if (++pages > 4) {
+		addrs = -E2BIG;
+		goto out;
+	}
+
+	*_abuf = krealloc(abuf, PAGE_SIZE * pages, GFP_KERNEL);
+	if (*_abuf == NULL) {
+		addrs = -ENOMEM;
+		goto out;
+	}
+	abuf = *_abuf;
+
+	read_lock(&dev_base_lock);
+
+	max = (pages * PAGE_SIZE) / sizeof(*abuf);
+	while (addr) {
+		abuf[addrs].type = CKPT_NETDEV_ADDR_IPV4; /* Only IPv4 now */
+		abuf[addrs].inet4_local = addr->ifa_local;
+		abuf[addrs].inet4_address = addr->ifa_address;
+		abuf[addrs].inet4_mask = addr->ifa_mask;
+		abuf[addrs].inet4_broadcast = addr->ifa_broadcast;
+
+		addr = addr->ifa_next;
+		if (++addrs >= max) {
+			read_unlock(&dev_base_lock);
+			goto retry;
+		}
+	}
+
+	read_unlock(&dev_base_lock);
+ out:
+	if (addrs < 0) {
+		kfree(abuf);
+		*_abuf = NULL;
+	}
+
+	return addrs;
+}
+
+struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
+					 struct net_device *dev,
+					 struct ckpt_netdev_addr *addrs[])
+{
+	struct ckpt_hdr_netdev *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (!h)
+		return ERR_PTR(-ENOMEM);
+
+	ret = ckpt_netdev_hwaddr(dev, h);
+	if (ret < 0)
+		goto out;
+
+	*addrs = NULL;
+	ret = h->inet_addrs = ckpt_netdev_inet_addrs(dev->ip_ptr, addrs);
+	if (ret < 0) {
+		if (ret == -E2BIG)
+			ckpt_err(ctx, ret,
+				 "Too many inet addresses on interface %s\n",
+				 dev->name);
+		goto out;
+	}
+
+	if (ckpt_netdev_in_init_netns(ctx, dev))
+		ret = h->netns_ref = 0;
+	else
+		ret = h->netns_ref = checkpoint_obj(ctx, dev->nd_net,
+						    CKPT_OBJ_NET_NS);
+ out:
+	if (ret < 0) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(ret);
+		if (*addrs)
+			kfree(*addrs);
+	}
+
+	return h;
+}
+
+int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct net_device *dev = (struct net_device *)ptr;
+
+	if (!dev->netdev_ops->ndo_checkpoint) {
+		ckpt_err(ctx, -ENOSYS,
+			 "Device %s does not support checkpoint\n", dev->name);
+		return -ENOSYS;
+	}
+
+	ckpt_debug("checkpointing netdev %s\n", dev->name);
+
+	return dev->netdev_ops->ndo_checkpoint(ctx, dev);
+}
+
+int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct net *net = ptr;
+	struct net_device *dev;
+	struct ckpt_hdr_netns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (!h)
+		return -ENOMEM;
+
+	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
+	BUG_ON(h->this_ref == 0);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	for_each_netdev(net, dev) {
+		if (!dev->netdev_ops->ndo_checkpoint) {
+			ret = -ENOSYS;
+			ckpt_err(ctx, ret,
+				 "Device %s does not support checkpoint\n",
+				 dev->name);
+			break;
+		}
+
+		ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+static int restore_in_addrs(struct ckpt_ctx *ctx,
+			    __u32 naddrs,
+			    struct net *net,
+			    struct net_device *dev)
+{
+	__u32 i;
+	int ret = 0;
+	int len = naddrs * sizeof(struct ckpt_netdev_addr);
+	struct ckpt_netdev_addr *addrs = NULL;
+
+	addrs = kmalloc(len, GFP_KERNEL);
+	if (!addrs)
+		return -ENOMEM;
+
+	ret = _ckpt_read_buffer(ctx, addrs, len);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < naddrs; i++) {
+		struct ckpt_netdev_addr *addr = &addrs[i];
+		struct ifreq req;
+		struct sockaddr_in *inaddr;
+
+		if (addr->type != CKPT_NETDEV_ADDR_IPV4) {
+			ret = -EINVAL;
+			ckpt_err(ctx, ret, "Unsupported netdev addr type %i\n",
+				 addr->type);
+			break;
+		}
+
+		ckpt_debug("restoring %s: %x/%x/%x\n", dev->name,
+			   addr->inet4_address,
+			   addr->inet4_mask,
+			   addr->inet4_broadcast);
+
+		memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = addr->inet4_address;
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set address\n");
+			break;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = addr->inet4_mask;
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFNETMASK, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set netmask\n");
+			break;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = addr->inet4_broadcast;
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFBRDADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set broadcast\n");
+			break;
+		}
+	}
+
+ out:
+	kfree(addrs);
+
+	return ret;
+}
+
+static int veth_new_link_msg(struct sk_buff *skb, void *data)
+{
+	struct nlattr *linkinfo;
+	struct nlattr *linkdata;
+	struct ifinfomsg ifm;
+	int ret = -ENOMEM;
+	struct veth_newlink *d = data;
+
+	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
+	if (!linkinfo)
+		goto out;
+
+	ret = nla_put_string(skb, IFLA_INFO_KIND, "veth");
+	if (ret)
+		goto out;
+
+	linkdata = nla_nest_start(skb, IFLA_INFO_DATA);
+	if (!linkdata) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = nla_put(skb, VETH_INFO_PEER, sizeof(ifm), &ifm);
+	if (!ret)
+		ret = nla_put_string(skb, IFLA_IFNAME, d->peer);
+
+	nla_nest_end(skb, linkdata);
+ out:
+	nla_nest_end(skb, linkinfo);
+
+	return ret;
+}
+
+static int mvl_new_link_msg(struct sk_buff *skb, void *data)
+{
+	struct mvl_newlink *d = data;
+	struct nlattr *linkinfo;
+	struct nlattr *linkdata;
+	struct net_device *lowerdev;
+	int ret;
+
+	lowerdev = dev_get_by_name(current->nsproxy->net_ns, d->base);
+	if (!lowerdev)
+		return -ENOENT;
+
+	ret = nla_put(skb, IFLA_ADDRESS, ETH_ALEN, d->hwaddr);
+	if (ret)
+		goto out_put;
+
+	ret = nla_put_u32(skb, IFLA_LINK, lowerdev->ifindex);
+	if (ret)
+		goto out_put;
+
+	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
+	if (!linkinfo) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = nla_put_string(skb, IFLA_INFO_KIND, "macvlan");
+	if (ret)
+		goto out;
+
+	linkdata = nla_nest_start(skb, IFLA_INFO_DATA);
+	if (!linkdata) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = nla_put_u32(skb, IFLA_MACVLAN_MODE, d->mode);
+	nla_nest_end(skb, linkdata);
+ out:
+	nla_nest_end(skb, linkinfo);
+ out_put:
+	dev_put(lowerdev);
+
+	return ret;
+}
+
+static struct sk_buff *new_link_msg(new_link_fn fn, void *data, char *name)
+{
+	int ret = -ENOMEM;
+	int flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
+	struct nlmsghdr *nlh;
+	struct sk_buff *skb;
+	struct ifinfomsg *ifm;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		goto out;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_NEWLINK, sizeof(*ifm), flags);
+	if (!nlh)
+		goto out;
+
+	ifm = nlmsg_data(nlh);
+	memset(ifm, 0, sizeof(*ifm));
+
+	ret = nla_put_string(skb, IFLA_IFNAME, name);
+	if (ret)
+		goto out;
+
+	ret = fn(skb, data);
+
+	nlmsg_end(skb, nlh);
+
+ out:
+	if (ret < 0) {
+		kfree_skb(skb);
+		skb = ERR_PTR(ret);
+	}
+
+	return skb;
+}
+
+static struct net_device *rtnl_newlink(new_link_fn fn, void *data, char *name)
+{
+	int ret = -ENOMEM;
+	struct socket *rtnl = NULL;
+	struct sk_buff *skb = NULL;
+	struct nlmsghdr *nlh;
+	struct msghdr msg;
+	struct kvec kvec;
+
+	skb = new_link_msg(fn, data, name);
+	if (IS_ERR(skb)) {
+		ret = PTR_ERR(skb);
+		ckpt_debug("failed to create new link message: %i\n", ret);
+		skb = NULL;
+		goto out;
+	}
+
+	memset(&msg, 0, sizeof(msg));
+	kvec.iov_len = skb->len;
+	kvec.iov_base = skb->head;
+
+	rtnl = rtnl_open();
+	if (IS_ERR(rtnl)) {
+		ret = PTR_ERR(rtnl);
+		ckpt_debug("Unable to open rtnetlink socket: %i\n", ret);
+		goto out_noclose;
+	}
+
+	ret = kernel_sendmsg(rtnl, &msg, &kvec, 1, kvec.iov_len);
+	if (ret < 0)
+		goto out;
+	else if (ret != skb->len) {
+		ret = -EIO;
+		goto out;
+	}
+
+	/* Free the send skb to make room for the receive skb */
+	kfree_skb(skb);
+
+	nlh = rtnl_get_response(rtnl, &skb);
+	if (IS_ERR(nlh)) {
+		ret = PTR_ERR(nlh);
+		ckpt_debug("RTNETLINK said: %i\n", ret);
+	}
+ out:
+	rtnl_close(rtnl);
+ out_noclose:
+	kfree_skb(skb);
+
+	if (ret < 0)
+		return ERR_PTR(ret);
+	else
+		return dev_get_by_name(current->nsproxy->net_ns, name);
+}
+
+static int netdev_noop(void *data)
+{
+	return 0;
+}
+
+static int netdev_cleanup(void *data)
+{
+	struct dq_netdev *dq = data;
+
+	dev_put(dq->dev);
+
+	if (dq->ctx->errno) {
+		ckpt_debug("Unregistering netdev %s\n", dq->dev->name);
+		unregister_netdev(dq->dev);
+	}
+
+	return 0;
+}
+
+static struct net_device *restore_veth(struct ckpt_ctx *ctx,
+				       struct ckpt_hdr_netdev *h,
+				       struct net *net)
+{
+	int ret;
+	char this_name[IFNAMSIZ];
+	char peer_name[IFNAMSIZ];
+	struct net_device *dev;
+	struct net_device *peer;
+	int didreg = 0;
+	struct ifreq req;
+	struct dq_netdev dq;
+
+	dq.ctx = ctx;
+
+	ret = _ckpt_read_buffer(ctx, this_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ret = _ckpt_read_buffer(ctx, peer_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ckpt_debug("restored veth netdev %s:%s\n", this_name, peer_name);
+
+	peer = ckpt_obj_try_fetch(ctx, h->veth.peer_ref, CKPT_OBJ_NETDEV);
+	if (IS_ERR(peer)) {
+		struct veth_newlink veth = {
+			.peer = peer_name,
+		};
+
+		/* We're first: allocate the veth pair */
+		didreg = 1;
+
+		dev = rtnl_newlink(veth_new_link_msg, &veth, this_name);
+		if (IS_ERR(dev))
+			return dev;
+
+		peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
+		if (!peer) {
+			ret = -EINVAL;
+			goto err_dev;
+		}
+
+		dq.dev = peer;
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     netdev_noop, netdev_cleanup);
+		if (ret)
+			goto err_peer;
+
+		ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
+				      CKPT_OBJ_NETDEV);
+		if (ret < 0)
+			/* Can't recall peer dq, so let it cleanup peer */
+			goto err_dev;
+		dev_put(peer);
+
+		dq.dev = dev;
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     netdev_noop, netdev_cleanup);
+		if (ret)
+			/* Can't recall peer dq, so let it cleanup peer */
+			goto err_dev;
+
+	} else {
+		/* We're second: get our dev from the hash */
+		dev = ckpt_obj_fetch(ctx, h->veth.this_ref, CKPT_OBJ_NETDEV);
+		if (IS_ERR(dev))
+			return dev;
+	}
+
+	/* Move to our new netns */
+	rtnl_lock();
+	ret = dev_change_net_namespace(dev, net, dev->name);
+	rtnl_unlock();
+	if (ret < 0)
+		goto out;
+
+	/* Restore MAC address */
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	memcpy(req.ifr_hwaddr.sa_data, h->hwaddr, sizeof(h->hwaddr));
+	req.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+	ret = __kern_dev_ioctl(net, SIOCSIFHWADDR, &req);
+ out:
+	if (ret)
+		dev = ERR_PTR(ret);
+
+	return dev;
+
+ err_peer:
+	dev_put(peer);
+	unregister_netdev(peer);
+ err_dev:
+	dev_put(dev);
+	unregister_netdev(dev);
+
+	return ERR_PTR(ret);
+}
+
+static struct net_device *restore_lo(struct ckpt_ctx *ctx,
+				     struct ckpt_hdr_netdev *h,
+				     struct net *net)
+{
+	struct net_device *dev;
+	char name[IFNAMSIZ+1];
+	int ret;
+
+	dev = dev_get_by_name(net, "lo");
+	if (!dev)
+		return ERR_PTR(-EINVAL);
+
+	ret = _ckpt_read_buffer(ctx, name, IFNAMSIZ);
+	if (ret < 0)
+		goto err;
+
+	if (strncmp(dev->name, name, IFNAMSIZ) != 0) {
+		ret = dev_change_name(dev, name);
+		if (ret < 0)
+			goto err;
+	}
+
+	return dev;
+ err:
+	dev_put(dev);
+
+	return ERR_PTR(ret);
+}
+
+static struct net_device *restore_sit(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_netdev *h,
+				      struct net *net)
+{
+	/* Don't actually do anything for SIT devices yet */
+	return dev_get_by_name(net, "sit0");
+}
+
+static struct net_device *restore_macvlan(struct ckpt_ctx *ctx,
+					  struct ckpt_hdr_netdev *h,
+					  struct net *net)
+{
+	struct net_device *dev;
+	struct mvl_newlink mvl = {
+		.mode = h->macvlan.mode,
+		.hwaddr = h->hwaddr,
+	};
+	int ret;
+
+	ret = _ckpt_read_buffer(ctx, mvl.this, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ret = _ckpt_read_buffer(ctx, mvl.base, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	dev = rtnl_newlink(mvl_new_link_msg, &mvl, mvl.this);
+	if (IS_ERR(dev)) {
+		ckpt_err(ctx, PTR_ERR(dev),
+			 "Failed to create macvlan device %s:%s",
+			 mvl.this, mvl.base);
+		goto out;
+	}
+
+	rtnl_lock();
+	ret = dev_change_net_namespace(dev, net, dev->name);
+	rtnl_unlock();
+
+	if (ret) {
+		ckpt_err(ctx, ret, "Failed to change netns of %s:%s\n",
+			 mvl.this, mvl.base);
+		dev_put(dev);
+		unregister_netdev(dev);
+		dev = ERR_PTR(ret);
+	}
+ out:
+	return dev;
+}
+
+void *restore_netdev(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netdev *h;
+	struct net_device *dev = NULL;
+	struct ifreq req;
+	struct net *net;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (IS_ERR(h)) {
+		ckpt_err(ctx, PTR_ERR(h), "failed to read netdev\n");
+		return h;
+	}
+
+	if (h->netns_ref != 0) {
+		net = ckpt_obj_try_fetch(ctx, h->netns_ref, CKPT_OBJ_NET_NS);
+		if (IS_ERR(net)) {
+			ckpt_debug("failed to get net for %i\n", h->netns_ref);
+			ret = PTR_ERR(net);
+			net = current->nsproxy->net_ns;
+			goto out;
+		}
+	} else
+		net = current->nsproxy->net_ns;
+
+	if (h->type == CKPT_NETDEV_VETH)
+		dev = restore_veth(ctx, h, net);
+	else if (h->type == CKPT_NETDEV_LO)
+		dev = restore_lo(ctx, h, net);
+	else if (h->type == CKPT_NETDEV_SIT)
+		dev = restore_sit(ctx, h, net);
+	else if (h->type == CKPT_NETDEV_MACVLAN)
+		dev = restore_macvlan(ctx, h, net);
+	else
+		dev = ERR_PTR(-EINVAL);
+
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		ckpt_err(ctx, ret, "Netdev type %i not supported\n", h->type);
+		goto out;
+	}
+
+	/* Restore flags (which will likely bring the interface up) */
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	req.ifr_flags = h->flags;
+	ret = __kern_dev_ioctl(net, SIOCSIFFLAGS, &req);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0)
+		ret = restore_in_addrs(ctx, h->inet_addrs, net, dev);
+ out:
+	if (ret) {
+		ckpt_err(ctx, ret, "Failed to restore netdevice\n");
+		if ((h->type == CKPT_NETDEV_VETH) && !IS_ERR(dev)) {
+			dev_put(dev);
+		}
+		dev = ERR_PTR(ret);
+	} else
+		ckpt_debug("restored netdev %s\n", dev->name);
+
+	ckpt_hdr_put(ctx, h);
+
+	return dev;
+}
+
+void *restore_netns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netns *h;
+	struct net *net;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (IS_ERR(h)) {
+		ckpt_err(ctx, PTR_ERR(h), "failed to read netns\n");
+		return h;
+	}
+
+	if (h->this_ref != 0) {
+		net = copy_net_ns(CLONE_NEWNET, current->nsproxy->net_ns);
+		if (IS_ERR(net))
+			goto out;
+	} else
+		net = current->nsproxy->net_ns;
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return net;
+}
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/6] C/R: Add checkpoint support for veth devices (v2)
  2010-02-25 20:43 C/R: Checkpoint and restore network namespaces and devices Dan Smith
  2010-02-25 20:43 ` [PATCH 1/6] C/R: Add checkpoint and collect hooks to net_device_ops Dan Smith
  2010-02-25 20:43 ` [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5) Dan Smith
@ 2010-02-25 20:43 ` Dan Smith
  2010-02-26 12:09   ` David Miller
  2010-02-25 20:43 ` [PATCH 4/6] C/R: Add loopback checkpoint support (v2) Dan Smith
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Dan Smith @ 2010-02-25 20:43 UTC (permalink / raw)
  To: containers; +Cc: benjamin.thery, den, ebiederm, davem, netdev

Adds an ndo_checkpoint() handler for veth devices to checkpoint themselves.
Writes out the pairing information, addresses, and initiates a checkpoint
on the peer if the peer won't be reached from another netns.  Throws an
error of our peer's netns isn't already in the hash (i.e., a tree leak).

Changes in v2:
 - Fix check detecting if peer is in the init netns

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 drivers/net/veth.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 3a15de5..db92de8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -16,6 +16,9 @@
 #include <net/xfrm.h>
 #include <linux/veth.h>
 
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
@@ -284,6 +287,76 @@ static void veth_dev_free(struct net_device *dev)
 	free_netdev(dev);
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int veth_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer = priv->peer;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+	int n;
+
+	if (!peer) {
+		ckpt_err(ctx, -EINVAL, "veth device has no peer!\n");
+		return -EINVAL;
+	}
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_VETH;
+
+	ret = h->veth.this_ref = ckpt_obj_lookup_add(ctx, dev,
+						     CKPT_OBJ_NETDEV, &n);
+	if (ret < 0)
+		goto out;
+
+	ret = h->veth.peer_ref = ckpt_obj_lookup_add(ctx, peer,
+						     CKPT_OBJ_NETDEV, &n);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, peer->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+		if (ret)
+			goto out;
+	}
+
+	/* Only checkpoint peer if we're not going to arrive at it
+	 * via another task's netns.  Fail if the pipe exits
+	 * our container to a netns not already in the hash
+	 */
+	if (ckpt_netdev_in_init_netns(ctx, peer))
+		ret = checkpoint_obj(ctx, peer, CKPT_OBJ_NETDEV);
+	else if (!ckpt_obj_lookup(ctx, peer->nd_net, CKPT_OBJ_NET_NS)) {
+		ret = -EINVAL;
+		ckpt_err(ctx, ret,
+			 "Peer %s of %s not in checkpointed namespaces\n",
+			 peer->name, dev->name);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -292,6 +365,9 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_change_mtu      = veth_change_mtu,
 	.ndo_get_stats       = veth_get_stats,
 	.ndo_set_mac_address = eth_mac_addr,
+#ifdef CONFIG_CHECKPOINT
+	.ndo_checkpoint      = veth_checkpoint,
+#endif
 };
 
 static void veth_setup(struct net_device *dev)
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/6] C/R: Add loopback checkpoint support (v2)
  2010-02-25 20:43 C/R: Checkpoint and restore network namespaces and devices Dan Smith
                   ` (2 preceding siblings ...)
  2010-02-25 20:43 ` [PATCH 3/6] C/R: Add checkpoint support for veth devices (v2) Dan Smith
@ 2010-02-25 20:43 ` Dan Smith
  2010-02-26 12:09   ` David Miller
  2010-02-25 20:43 ` [PATCH 5/6] C/R: Add a checkpoint handler to the 'sit' device Dan Smith
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Dan Smith @ 2010-02-25 20:43 UTC (permalink / raw)
  To: containers; +Cc: benjamin.thery, den, ebiederm, davem, netdev

Adds a small ndo_checkpoint() handler for loopback devices to write the
name and addresses like other interfaces.

Changes in v2:
 - Add CONFIG_CHECKPOINT around the handler

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 drivers/net/loopback.c |   44 +++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index b9fcc98..77023a7 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -57,6 +57,8 @@
 #include <linux/ip.h>
 #include <linux/tcp.h>
 #include <linux/percpu.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
 #include <net/net_namespace.h>
 
 struct pcpu_lstats {
@@ -153,10 +155,46 @@ static void loopback_dev_free(struct net_device *dev)
 	free_netdev(dev);
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int loopback_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_LO;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+	}
+
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
 static const struct net_device_ops loopback_ops = {
-	.ndo_init      = loopback_dev_init,
-	.ndo_start_xmit= loopback_xmit,
-	.ndo_get_stats = loopback_get_stats,
+	.ndo_init       = loopback_dev_init,
+	.ndo_start_xmit = loopback_xmit,
+	.ndo_get_stats  = loopback_get_stats,
+#ifdef CONFIG_CHECKPOINT
+	.ndo_checkpoint = loopback_checkpoint,
+#endif
 };
 
 /*
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/6] C/R: Add a checkpoint handler to the 'sit' device
  2010-02-25 20:43 C/R: Checkpoint and restore network namespaces and devices Dan Smith
                   ` (3 preceding siblings ...)
  2010-02-25 20:43 ` [PATCH 4/6] C/R: Add loopback checkpoint support (v2) Dan Smith
@ 2010-02-25 20:43 ` Dan Smith
  2010-02-26 12:09   ` David Miller
  2010-02-25 20:43 ` [PATCH 6/6] C/R: Add checkpoint support to macvlan driver Dan Smith
  2010-03-15  2:49 ` C/R: Checkpoint and restore network namespaces and devices Oren Laadan
  6 siblings, 1 reply; 25+ messages in thread
From: Dan Smith @ 2010-02-25 20:43 UTC (permalink / raw)
  To: containers; +Cc: benjamin.thery, den, ebiederm, davem, netdev

This handler doesn't really do much to checkpoint the device, other
than the minimum required to support the restart process.  When we
add IPv6 support to this, then we can fill this out.

This allows us to avoid skipping unsupported interfaces on a normal
system.

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 net/ipv6/sit.c |   34 ++++++++++++++++++++++++++++++++++
 1 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 976e682..cf9fcb8 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -32,6 +32,8 @@
 #include <linux/init.h>
 #include <linux/netfilter_ipv4.h>
 #include <linux/if_ether.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
 
 #include <net/sock.h>
 #include <net/snmp.h>
@@ -1085,11 +1087,43 @@ static int ipip6_tunnel_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int ipip6_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_SIT;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
+
 static const struct net_device_ops ipip6_netdev_ops = {
 	.ndo_uninit	= ipip6_tunnel_uninit,
 	.ndo_start_xmit	= ipip6_tunnel_xmit,
 	.ndo_do_ioctl	= ipip6_tunnel_ioctl,
 	.ndo_change_mtu	= ipip6_tunnel_change_mtu,
+#ifdef CONFIG_CHECKPOINT
+	.ndo_checkpoint	= ipip6_checkpoint,
+#endif
 };
 
 static void ipip6_tunnel_setup(struct net_device *dev)
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/6] C/R: Add checkpoint support to macvlan driver
  2010-02-25 20:43 C/R: Checkpoint and restore network namespaces and devices Dan Smith
                   ` (4 preceding siblings ...)
  2010-02-25 20:43 ` [PATCH 5/6] C/R: Add a checkpoint handler to the 'sit' device Dan Smith
@ 2010-02-25 20:43 ` Dan Smith
  2010-02-26 12:09   ` David Miller
  2010-03-15  2:49 ` C/R: Checkpoint and restore network namespaces and devices Oren Laadan
  6 siblings, 1 reply; 25+ messages in thread
From: Dan Smith @ 2010-02-25 20:43 UTC (permalink / raw)
  To: containers; +Cc: benjamin.thery, den, ebiederm, davem, netdev

This has an small hidden gotcha.  Since the macvlan device is moved
completely into a container's network namespace, the init netns cannot
freeze traffic to and from it in order to migrate live network connections
by simply utilizing its netfilter tables.  A helper process in the container
or something like what was recently discussed on the containers list[1]
would provide a way to do this.

[1]: https://lists.linux-foundation.org/pipermail/containers/2010-February/023001.html

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 drivers/net/macvlan.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 21a9c9a..07d006a 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -28,6 +28,8 @@
 #include <linux/if_arp.h>
 #include <linux/if_link.h>
 #include <linux/if_macvlan.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
 #include <net/rtnetlink.h>
 #include <net/xfrm.h>
 
@@ -493,6 +495,49 @@ static struct net_device_stats *macvlan_dev_get_stats(struct net_device *dev)
 	return stats;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int macvlan_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	struct ckpt_hdr_netdev *h;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+
+	ckpt_debug("Checkpointing macvlan %s:%s\n",
+		   dev->name, vlan->lowerdev->name);
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_MACVLAN;
+
+	h->macvlan.mode = vlan->mode;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, vlan->lowerdev->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
+
 static void macvlan_ethtool_get_drvinfo(struct net_device *dev,
 					struct ethtool_drvinfo *drvinfo)
 {
@@ -539,6 +584,9 @@ static const struct net_device_ops macvlan_netdev_ops = {
 	.ndo_set_multicast_list	= macvlan_set_multicast_list,
 	.ndo_get_stats		= macvlan_dev_get_stats,
 	.ndo_validate_addr	= eth_validate_addr,
+#ifdef CONFIG_CHECKPOINT
+	.ndo_checkpoint		= macvlan_checkpoint,
+#endif
 };
 
 static void macvlan_setup(struct net_device *dev)
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] C/R: Add checkpoint and collect hooks to net_device_ops
  2010-02-25 20:43 ` [PATCH 1/6] C/R: Add checkpoint and collect hooks to net_device_ops Dan Smith
@ 2010-02-26 12:08   ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2010-02-26 12:08 UTC (permalink / raw)
  To: danms; +Cc: containers, benjamin.thery, den, ebiederm, netdev

From: Dan Smith <danms@us.ibm.com>
Date: Thu, 25 Feb 2010 12:43:10 -0800

> These will be implemented per-driver by those that support such
> operations.
> 
> Signed-off-by: Dan Smith <danms@us.ibm.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
  2010-02-25 20:43 ` [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5) Dan Smith
@ 2010-02-26 12:08   ` David Miller
  2010-02-26 14:56     ` Dan Smith
  2010-03-06  3:53   ` Oren Laadan
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2010-02-26 12:08 UTC (permalink / raw)
  To: danms; +Cc: containers, benjamin.thery, den, ebiederm, netdev

From: Dan Smith <danms@us.ibm.com>
Date: Thu, 25 Feb 2010 12:43:11 -0800

> When checkpointing a task tree with network namespaces, we hook into
> do_checkpoint_ns() along with the others.  Any devices in a given namespace
> are checkpointed (including their peer, in the case of veth) sequentially.
> Each network device stores a list of protocol addresses, as well as other
> information, such as hardware address.
> 
> This patch supports veth pairs, as well as the loopback adapter.  The
> loopback support is there to make sure that any additional addresses and
> state (such as up/down) is copied to the loopback adapter that we are
> given in the new network namespace.
> 
> On restart, we instantiate new network namespaces and veth pairs as
> necessary.  Any device we encounter that isn't in a network namespace
> that was checkpointed as part of a task is left in the namespace of the
> restarting process.  This will be the case for a veth half that exists
> in the init netns to provide network access to a container.

To be safe you should probably use __be32 and store the IP
addresses in network byte order.

But other than that:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] C/R: Add checkpoint support for veth devices (v2)
  2010-02-25 20:43 ` [PATCH 3/6] C/R: Add checkpoint support for veth devices (v2) Dan Smith
@ 2010-02-26 12:09   ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2010-02-26 12:09 UTC (permalink / raw)
  To: danms; +Cc: containers, benjamin.thery, den, ebiederm, netdev

From: Dan Smith <danms@us.ibm.com>
Date: Thu, 25 Feb 2010 12:43:12 -0800

> Adds an ndo_checkpoint() handler for veth devices to checkpoint themselves.
> Writes out the pairing information, addresses, and initiates a checkpoint
> on the peer if the peer won't be reached from another netns.  Throws an
> error of our peer's netns isn't already in the hash (i.e., a tree leak).
> 
> Changes in v2:
>  - Fix check detecting if peer is in the init netns
> 
> Signed-off-by: Dan Smith <danms@us.ibm.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/6] C/R: Add loopback checkpoint support (v2)
  2010-02-25 20:43 ` [PATCH 4/6] C/R: Add loopback checkpoint support (v2) Dan Smith
@ 2010-02-26 12:09   ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2010-02-26 12:09 UTC (permalink / raw)
  To: danms; +Cc: containers, benjamin.thery, den, ebiederm, netdev

From: Dan Smith <danms@us.ibm.com>
Date: Thu, 25 Feb 2010 12:43:13 -0800

> Adds a small ndo_checkpoint() handler for loopback devices to write the
> name and addresses like other interfaces.
> 
> Changes in v2:
>  - Add CONFIG_CHECKPOINT around the handler
> 
> Signed-off-by: Dan Smith <danms@us.ibm.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] C/R: Add a checkpoint handler to the 'sit' device
  2010-02-25 20:43 ` [PATCH 5/6] C/R: Add a checkpoint handler to the 'sit' device Dan Smith
@ 2010-02-26 12:09   ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2010-02-26 12:09 UTC (permalink / raw)
  To: danms; +Cc: containers, benjamin.thery, den, ebiederm, netdev

From: Dan Smith <danms@us.ibm.com>
Date: Thu, 25 Feb 2010 12:43:14 -0800

> This handler doesn't really do much to checkpoint the device, other
> than the minimum required to support the restart process.  When we
> add IPv6 support to this, then we can fill this out.
> 
> This allows us to avoid skipping unsupported interfaces on a normal
> system.
> 
> Signed-off-by: Dan Smith <danms@us.ibm.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] C/R: Add checkpoint support to macvlan driver
  2010-02-25 20:43 ` [PATCH 6/6] C/R: Add checkpoint support to macvlan driver Dan Smith
@ 2010-02-26 12:09   ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2010-02-26 12:09 UTC (permalink / raw)
  To: danms; +Cc: containers, benjamin.thery, den, ebiederm, netdev

From: Dan Smith <danms@us.ibm.com>
Date: Thu, 25 Feb 2010 12:43:15 -0800

> This has an small hidden gotcha.  Since the macvlan device is moved
> completely into a container's network namespace, the init netns cannot
> freeze traffic to and from it in order to migrate live network connections
> by simply utilizing its netfilter tables.  A helper process in the container
> or something like what was recently discussed on the containers list[1]
> would provide a way to do this.
> 
> [1]: https://lists.linux-foundation.org/pipermail/containers/2010-February/023001.html
> 
> Signed-off-by: Dan Smith <danms@us.ibm.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
  2010-02-26 12:08   ` David Miller
@ 2010-02-26 14:56     ` Dan Smith
  2010-03-06  3:55       ` Oren Laadan
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Smith @ 2010-02-26 14:56 UTC (permalink / raw)
  To: David Miller; +Cc: containers, benjamin.thery, den, ebiederm, netdev

DM> To be safe you should probably use __be32 and store the IP
DM> addresses in network byte order.

Okay, yeah, good call.

DM> Acked-by: David S. Miller <davem@davemloft.net>

Thanks Dave!

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
  2010-02-25 20:43 ` [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5) Dan Smith
  2010-02-26 12:08   ` David Miller
@ 2010-03-06  3:53   ` Oren Laadan
       [not found]     ` <4B91D1A3.9030404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 1 reply; 25+ messages in thread
From: Oren Laadan @ 2010-03-06  3:53 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, den, netdev, davem, ebiederm, benjamin.thery

Hi Dan,

I finally got to look at it - see comments inline:

Dan Smith wrote:
> When checkpointing a task tree with network namespaces, we hook into
> do_checkpoint_ns() along with the others.  Any devices in a given namespace
> are checkpointed (including their peer, in the case of veth) sequentially.
> Each network device stores a list of protocol addresses, as well as other
> information, such as hardware address.
> 
> This patch supports veth pairs, as well as the loopback adapter.  The
> loopback support is there to make sure that any additional addresses and
> state (such as up/down) is copied to the loopback adapter that we are
> given in the new network namespace.
> 
> On restart, we instantiate new network namespaces and veth pairs as
> necessary.  Any device we encounter that isn't in a network namespace
> that was checkpointed as part of a task is left in the namespace of the
> restarting process.  This will be the case for a veth half that exists
> in the init netns to provide network access to a container.

I think that this documentation (and other pieces from q&a on your
patches, e.g. your reply to Serge on 2/22) deserve an honorable
mention either in the source files, or in Documentation/checkpoint/.

As it turns out, the network related c/r logic - sockets, netns and
netdev - all of these are nontrivial and we need to explain them for
reviewers, future coders and ... ourselves :)

[...]

> 
> Signed-off-by: Dan Smith <danms@us.ibm.com>
> Cc: netdev@vger.kernel.org
> ---
>  checkpoint/checkpoint.c          |    6 +-
>  checkpoint/objhash.c             |   48 +++
>  include/linux/checkpoint.h       |   23 ++
>  include/linux/checkpoint_hdr.h   |   58 +++
>  include/linux/checkpoint_types.h |    1 +
>  kernel/nsproxy.c                 |   20 +-
>  net/Kconfig                      |    4 +
>  net/Makefile                     |    1 +
>  net/checkpoint_dev.c             |  815 ++++++++++++++++++++++++++++++++++++++
>  9 files changed, 970 insertions(+), 6 deletions(-)
>  create mode 100644 net/checkpoint_dev.c
> 
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index b3c1c4f..466f594 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -184,6 +184,7 @@ static int checkpoint_container(struct ckpt_ctx *ctx)
>  	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CONTAINER);
>  	if (!h)
>  		return -ENOMEM;
> +

[nit] noise ?

>  	ret = ckpt_write_obj(ctx, &h->h);
>  	ckpt_hdr_put(ctx, h);
>  
> @@ -284,11 +285,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
>  		_ckpt_err(ctx, -EPERM, "%(T)Nested mnt_ns unsupported\n");
>  		ret = -EPERM;
>  	}
> -	/* no support for >1 private netns */
> -	if (nsproxy->net_ns != ctx->root_nsproxy->net_ns) {
> -		_ckpt_err(ctx, -EPERM, "%(T)Nested net_ns unsupported\n");
> -		ret = -EPERM;
> -	}
>  	/* no support for >1 private pidns */
>  	if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns) {
>  		_ckpt_err(ctx, -EPERM, "%(T)Nested pid_ns unsupported\n");
> diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
> index fbc58ea..16f2c43 100644
> --- a/checkpoint/objhash.c
> +++ b/checkpoint/objhash.c
> @@ -348,6 +348,36 @@ static void lsm_string_drop(void *ptr, int lastref)
>  	kref_put(&s->kref, lsm_string_free);
>  }
>  
> +static int netns_grab(void *ptr)
> +{
> +	struct net *net = ptr;
> +
> +	get_net(net);
> +	return 0;
> +}
> +
> +static void netns_drop(void *ptr, int lastref)
> +{
> +	struct net *net = ptr;
> +
> +	put_net(net);
> +}
> +
> +static int netdev_grab(void *ptr)
> +{
> +	struct net_device *dev = ptr;
> +
> +	dev_hold(dev);
> +	return 0;
> +}
> +
> +static void netdev_drop(void *ptr, int lastref)
> +{
> +	struct net_device *dev = ptr;
> +
> +	dev_put(dev);
> +}
> +
>  /* security context strings */
>  static int checkpoint_lsm_string(struct ckpt_ctx *ctx, void *ptr);
>  static struct ckpt_lsm_string *restore_lsm_string(struct ckpt_ctx *ctx);
> @@ -550,6 +580,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
>  		.checkpoint = checkpoint_lsm_string,
>  		.restore = restore_lsm_string_wrap,
>  	},
> +	/* Network Namespace Object */
> +	{
> +		.obj_name = "NET_NS",
> +		.obj_type = CKPT_OBJ_NET_NS,
> +		.ref_grab = netns_grab,
> +		.ref_drop = netns_drop,
> +		.checkpoint = checkpoint_netns,
> +		.restore = restore_netns,
> +	},
> +	/* Network Device Object */
> +	{
> +		.obj_name = "NET_DEV",
> +		.obj_type = CKPT_OBJ_NETDEV,
> +		.ref_grab = netdev_grab,
> +		.ref_drop = netdev_drop,
> +		.checkpoint = checkpoint_netdev,
> +		.restore = restore_netdev,
> +	},
>  };

What about leak detection ?
Aren't we missing {netns,netdev}_users()?

>  
>  
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> index 7101d6f..a25bac1 100644
> --- a/include/linux/checkpoint.h
> +++ b/include/linux/checkpoint.h
> @@ -35,6 +35,7 @@
>  #include <linux/checkpoint_types.h>
>  #include <linux/checkpoint_hdr.h>
>  #include <linux/err.h>
> +#include <linux/inetdevice.h>
>  #include <net/sock.h>
>  
>  /* sycall helpers */
> @@ -119,6 +120,28 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
>  extern struct sk_buff *sock_restore_skb(struct ckpt_ctx *ctx, struct sock *sk);
>  extern void sock_listening_list_free(struct list_head *head);
>  
> +#ifdef CONFIG_CHECKPOINT_NETNS
> +extern int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr);
> +extern void *restore_netns(struct ckpt_ctx *ctx);
> +extern int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr);
> +extern void *restore_netdev(struct ckpt_ctx *ctx);
> +
> +extern int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx,
> +				     struct net_device *dev);
> +extern int ckpt_netdev_inet_addrs(struct in_device *indev,
> +				  struct ckpt_netdev_addr *list[]);
> +extern int ckpt_netdev_hwaddr(struct net_device *dev,
> +			      struct ckpt_hdr_netdev *h);
> +extern struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
> +						struct net_device *dev,
> +						struct ckpt_netdev_addr *addrs[]);
> +#else
> +# define checkpoint_netns NULL
> +# define restore_netns NULL
> +# define checkpoint_netdev NULL
> +# define restore_netdev NULL
> +#endif
> +
>  /* ckpt kflags */
>  #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
>  	set_bit(__kflag##_BIT, &(__ctx)->kflags)
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index 41412d1..c065739 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -181,6 +181,12 @@ enum {
>  #define CKPT_HDR_SOCKET_UNIX CKPT_HDR_SOCKET_UNIX
>  	CKPT_HDR_SOCKET_INET,
>  #define CKPT_HDR_SOCKET_INET CKPT_HDR_SOCKET_INET
> +	CKPT_HDR_NET_NS,
> +#define CKPT_HDR_NET_NS CKPT_HDR_NET_NS
> +	CKPT_HDR_NETDEV,
> +#define CKPT_HDR_NETDEV CKPT_HDR_NETDEV
> +	CKPT_HDR_NETDEV_ADDR,
> +#define CKPT_HDR_NETDEV_ADDR CKPT_HDR_NETDEV_ADDR
>  
>  	CKPT_HDR_TAIL = 9001,
>  #define CKPT_HDR_TAIL CKPT_HDR_TAIL
> @@ -253,6 +259,10 @@ enum obj_type {
>  #define CKPT_OBJ_SECURITY_PTR CKPT_OBJ_SECURITY_PTR
>  	CKPT_OBJ_SECURITY,
>  #define CKPT_OBJ_SECURITY CKPT_OBJ_SECURITY
> +	CKPT_OBJ_NET_NS,
> +#define CKPT_OBJ_NET_NS CKPT_OBJ_NET_NS
> +	CKPT_OBJ_NETDEV,
> +#define CKPT_OBJ_NETDEV CKPT_OBJ_NETDEV
>  	CKPT_OBJ_MAX
>  #define CKPT_OBJ_MAX CKPT_OBJ_MAX
>  };
> @@ -313,6 +323,7 @@ struct ckpt_hdr_tail {
>  /* container configuration section header */
>  struct ckpt_hdr_container {
>  	struct ckpt_hdr h;
> +	__s32 init_netns_ref;
>  	/*
>  	 * the header is followed by the string:
>  	 *   char lsm_name[SECURITY_NAME_MAX + 1]
> @@ -434,6 +445,7 @@ struct ckpt_hdr_ns {
>  	struct ckpt_hdr h;
>  	__s32 uts_objref;
>  	__s32 ipc_objref;
> +	__s32 net_objref;
>  } __attribute__((aligned(8)));
>  
>  /* cannot include <linux/tty.h> from userspace, so define: */
> @@ -758,6 +770,52 @@ struct ckpt_hdr_file_socket {
>  	__s32 sock_objref;
>  } __attribute__((aligned(8)));
>  
> +struct ckpt_hdr_netns {
> +	struct ckpt_hdr h;
> +	__s32 this_ref;
> +} __attribute__((aligned(8)));
> +
> +enum ckpt_netdev_types {
> +	CKPT_NETDEV_LO,
> +	CKPT_NETDEV_VETH,
> +	CKPT_NETDEV_SIT,
> +	CKPT_NETDEV_MACVLAN,
> +};
> +
> +struct ckpt_hdr_netdev {
> +	struct ckpt_hdr h;
> + 	__s32 netns_ref;
> +	union {
> +		struct {
> +			__s32 this_ref;
> +			__s32 peer_ref;
> +		} veth;
> +		struct {
> +			__u32 mode;
> +		} macvlan;
> +	};
> +	__u32 inet_addrs;
> +	__u16 type;
> +	__u16 flags;
> +	__u8 hwaddr[6];
> +} __attribute__((aligned(8)));
> +
> +enum ckpt_netdev_addr_types {
> +	CKPT_NETDEV_ADDR_IPV4,
> +};
> +
> +struct ckpt_netdev_addr {
> +	__u16 type;

Bad alignments ...

> +	union {
> +		struct {
> +			__u32 inet4_local;
> +			__u32 inet4_address;
> +			__u32 inet4_mask;
> +			__u32 inet4_broadcast;
> +		};
> +	} __attribute__((aligned(8)));
> +} __attribute__((aligned(8)));
> +
>  struct ckpt_hdr_eventpoll_items {
>  	struct ckpt_hdr h;
>  	__s32  epfile_objref;
> diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
> index 5d5e00d..efc9357 100644
> --- a/include/linux/checkpoint_types.h
> +++ b/include/linux/checkpoint_types.h
> @@ -82,6 +82,7 @@ struct ckpt_ctx {
>  	wait_queue_head_t ghostq;	/* waitqueue for ghost tasks */
>  	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
>  	struct list_head listen_sockets;/* listening parent sockets */
> +	int init_netns_ref;             /* Objref of root net namespace */
>  
>  	struct ckpt_stats stats;	/* statistics */
>  
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index 0da0d83..b0e67ff 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -248,6 +248,11 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
>  	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
>  	if (ret < 0)
>  		goto out;

I'm unsure why you need a separate configuration option for this ?
If you just want a shortcut to
	"#if defined(CHECKPOINT) && defined(NETNS)"
then that's ok, but then in Kconfig (below) it should be tunable by
the user.

> +#ifdef CONFIG_CHECKPOINT_NETNS
> +	ret = ckpt_obj_collect(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
> +	if (ret < 0)
> +		goto out;
> +#endif
>  	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
>  	if (ret < 0)
>  		goto out;
> @@ -288,6 +293,12 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
>  	if (ret < 0)
>  		goto out;
>  	h->ipc_objref = ret;
> +#ifdef CONFIG_CHECKPOINT_NETNS
> +	ret = checkpoint_obj(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
> +	if (ret < 0)
> +		goto out;
> +	h->net_objref = ret;
> +#endif
>  
>  	/* FIXME: for now, only marked visited to pacify leaks */
>  	ret = ckpt_obj_visit(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS);
> @@ -328,6 +339,14 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
>  		ret = PTR_ERR(uts_ns);
>  		goto out;
>  	}
> +	if (h->net_objref == 0)
> +		net_ns = current->nsproxy->net_ns;
> +	else
> +		net_ns = ckpt_obj_fetch(ctx, h->net_objref, CKPT_OBJ_NET_NS);
> +	if (IS_ERR(net_ns)) {
> +		ret = PTR_ERR(net_ns);
> +		goto out;
> +	}
>  
>  	if (h->ipc_objref == 0)
>  		ipc_ns = ctx->root_nsproxy->ipc_ns;
> @@ -339,7 +358,6 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
>  	}
>  
>  	mnt_ns = ctx->root_nsproxy->mnt_ns;
> -	net_ns = ctx->root_nsproxy->net_ns;
>  
>  	if (uts_ns == current->nsproxy->uts_ns &&
>  	    ipc_ns == current->nsproxy->ipc_ns &&
> diff --git a/net/Kconfig b/net/Kconfig
> index 041c35e..64dd3cd 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -276,4 +276,8 @@ source "net/wimax/Kconfig"
>  source "net/rfkill/Kconfig"
>  source "net/9p/Kconfig"
>  
> +config CHECKPOINT_NETNS
> +       bool
> +       default y if NET && NET_NS && CHECKPOINT
> +

Did you mean this to be visible (settable) by the user ?

>  endif   # if NET
> diff --git a/net/Makefile b/net/Makefile
> index 74b038f..570ee98 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -67,3 +67,4 @@ endif
>  obj-$(CONFIG_WIMAX)		+= wimax/
>  
>  obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
> +obj-$(CONFIG_CHECKPOINT_NETNS)	+= checkpoint_dev.o
> diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
> new file mode 100644
> index 0000000..5479419
> --- /dev/null
> +++ b/net/checkpoint_dev.c
> @@ -0,0 +1,815 @@
> +/*
> + *  Copyright 2010 IBM Corporation
> + *
> + *  Author(s): Dan Smith <danms@us.ibm.com>
> + *
> + *  This program is free software; you can redistribute it and/or
> + *  modify it under the terms of the GNU General Public License as
> + *  published by the Free Software Foundation, version 2 of the
> + *  License.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/if.h>
> +#include <linux/if_arp.h>
> +#include <linux/inetdevice.h>
> +#include <linux/veth.h>
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +#include <linux/deferqueue.h>
> +
> +#include <net/net_namespace.h>
> +#include <net/sch_generic.h>
> +
> +struct dq_netdev {
> +	struct net_device *dev;
> +	struct ckpt_ctx *ctx;
> +};
> +
> +struct veth_newlink {
> +	char *peer;
> +};
> +
> +struct mvl_newlink {
> +	char this[IFNAMSIZ+1];
> +	char base[IFNAMSIZ+1];
> +	int mode;
> +	__u8 *hwaddr;
> +};
> +
> +typedef int (*new_link_fn)(struct sk_buff *, void *);
> +
> +static int __kern_devinet_ioctl(struct net *net, unsigned int cmd, void *arg)
> +{
> +	mm_segment_t fs;
> +	int ret;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	ret = devinet_ioctl(net, cmd, arg);
> +	set_fs(fs);
> +
> +	return ret;
> +}
> +
> +static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
> +{
> +	mm_segment_t fs;
> +	int ret;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	ret = dev_ioctl(net, cmd, arg);
> +	set_fs(fs);
> +
> +	return ret;
> +}
> +
> +static struct socket *rtnl_open(void)
> +{
> +	struct socket *sock;
> +	int ret;
> +
> +	ret = sock_create(AF_NETLINK, SOCK_DGRAM, NETLINK_ROUTE, &sock);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	return sock;
> +}
> +
> +static int rtnl_close(struct socket *rtnl)
> +{
> +	if (rtnl)
> +		return kernel_sock_shutdown(rtnl, SHUT_RDWR);
> +	else
> +		return 0;
> +}
> +
> +static struct nlmsghdr *rtnl_get_response(struct socket *rtnl,
> +					  struct sk_buff **skb)
> +{
> +	int ret;
> +	long timeo = MAX_SCHEDULE_TIMEOUT;
> +	struct nlmsghdr *nlh;
> +
> +	ret = sk_wait_data(rtnl->sk, &timeo);
> +	if (!ret)
> +		return ERR_PTR(-EPIPE);
> +
> +	*skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
> +	if (!*skb)
> +		return ERR_PTR(-EPIPE);
> +
> +	ret = -EINVAL;
> +	nlh = nlmsg_hdr(*skb);
> +	if (!nlh)
> +		goto err;
> +
> +	if (nlh->nlmsg_type == NLMSG_ERROR) {
> +		struct nlmsgerr *errmsg = nlmsg_data(nlh);
> +		ret = errmsg->error;
> +		goto err;
> +	}
> +
> +	return nlh;
> + err:
> +	kfree_skb(*skb);
> +	*skb = NULL;
> +
> +	return ERR_PTR(ret);
> +}
> +
> +int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx, struct net_device *dev)
> +{
> +	return dev->nd_net == current->nsproxy->net_ns;
> +}

This has raised questions before - probably worth a comment that explain
what's the rationale here.

> +
> +int ckpt_netdev_hwaddr(struct net_device *dev, struct ckpt_hdr_netdev *h)
> +{
> +	struct net *net = dev->nd_net;
> +	struct ifreq req;
> +	int ret;
> +
> +	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
> +	ret = __kern_dev_ioctl(net, SIOCGIFFLAGS, &req);
> +	h->flags = req.ifr_flags;
> +	if (ret < 0)
> +		return ret;

[nit] is it intentional that you assign h->flags before testing ret ?

> +
> +	ret = __kern_dev_ioctl(net, SIOCGIFHWADDR, &req);
> +	if (ret < 0)
> +		return ret;
> +
> +	memcpy(h->hwaddr, req.ifr_hwaddr.sa_data, sizeof(h->hwaddr));
> +
> +	return 0;
> +}
> +
> +int ckpt_netdev_inet_addrs(struct in_device *indev,
> +			   struct ckpt_netdev_addr *_abuf[])
> +{
> +	struct ckpt_netdev_addr *abuf = NULL;
> +	struct in_ifaddr *addr = indev->ifa_list;
> +	int pages = 0;
> +	int addrs = 0;
> +	int max;
> +
> + retry:
> +	if (++pages > 4) {
> +		addrs = -E2BIG;
> +		goto out;
> +	}

Why 4 ?

This makes me wonder if it's worth to allocate a temp buf over the
checkpoint ctx (e.g. ctx->tmpbuf) to be used for scratch ?

> +
> +	*_abuf = krealloc(abuf, PAGE_SIZE * pages, GFP_KERNEL);
> +	if (*_abuf == NULL) {
> +		addrs = -ENOMEM;
> +		goto out;
> +	}
> +	abuf = *_abuf;
> +
> +	read_lock(&dev_base_lock);
> +
> +	max = (pages * PAGE_SIZE) / sizeof(*abuf);
> +	while (addr) {
> +		abuf[addrs].type = CKPT_NETDEV_ADDR_IPV4; /* Only IPv4 now */
> +		abuf[addrs].inet4_local = addr->ifa_local;
> +		abuf[addrs].inet4_address = addr->ifa_address;
> +		abuf[addrs].inet4_mask = addr->ifa_mask;
> +		abuf[addrs].inet4_broadcast = addr->ifa_broadcast;
> +
> +		addr = addr->ifa_next;
> +		if (++addrs >= max) {
> +			read_unlock(&dev_base_lock);
> +			goto retry;
> +		}
> +	}
> +
> +	read_unlock(&dev_base_lock);
> + out:
> +	if (addrs < 0) {
> +		kfree(abuf);
> +		*_abuf = NULL;
> +	}
> +
> +	return addrs;
> +}
> +
> +struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
> +					 struct net_device *dev,
> +					 struct ckpt_netdev_addr *addrs[])
> +{
> +	struct ckpt_hdr_netdev *h;
> +	int ret;
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
> +	if (!h)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = ckpt_netdev_hwaddr(dev, h);
> +	if (ret < 0)
> +		goto out;
> +
> +	*addrs = NULL;
> +	ret = h->inet_addrs = ckpt_netdev_inet_addrs(dev->ip_ptr, addrs);
> +	if (ret < 0) {
> +		if (ret == -E2BIG)
> +			ckpt_err(ctx, ret,
> +				 "Too many inet addresses on interface %s\n",
> +				 dev->name);

Do we really need this special case ?  I'd be happy with a ckpt_err()
for any error - and the actual error number would be useful to tell
which case it was.

> +		goto out;
> +	}
> +
> +	if (ckpt_netdev_in_init_netns(ctx, dev))
> +		ret = h->netns_ref = 0;
> +	else
> +		ret = h->netns_ref = checkpoint_obj(ctx, dev->nd_net,
> +						    CKPT_OBJ_NET_NS);
> + out:
> +	if (ret < 0) {
> +		ckpt_hdr_put(ctx, h);
> +		h = ERR_PTR(ret);
> +		if (*addrs)
> +			kfree(*addrs);
> +	}
> +
> +	return h;
> +}
> +
> +int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
> +{
> +	struct net_device *dev = (struct net_device *)ptr;
> +
> +	if (!dev->netdev_ops->ndo_checkpoint) {
> +		ckpt_err(ctx, -ENOSYS,
> +			 "Device %s does not support checkpoint\n", dev->name);
> +		return -ENOSYS;
> +	}
> +
> +	ckpt_debug("checkpointing netdev %s\n", dev->name);
> +
> +	return dev->netdev_ops->ndo_checkpoint(ctx, dev);
> +}
> +
> +int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
> +{
> +	struct net *net = ptr;
> +	struct net_device *dev;
> +	struct ckpt_hdr_netns *h;
> +	int ret;
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
> +	if (!h)
> +		return -ENOMEM;
> +
> +	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
> +	BUG_ON(h->this_ref == 0);

How about s/==/<=/ ?

> +
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> +	if (ret < 0)
> +		goto out;
> +
> +	for_each_netdev(net, dev) {
> +		if (!dev->netdev_ops->ndo_checkpoint) {

Isn't this check redundant ?  I expect it to fail promptly in
checkpoint_netdev() above.

> +			ret = -ENOSYS;
> +			ckpt_err(ctx, ret,
> +				 "Device %s does not support checkpoint\n",
> +				 dev->name);
> +			break;
> +		}
> +
> +		ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
> +		if (ret < 0)
> +			break;
> +	}
> + out:
> +	ckpt_hdr_put(ctx, h);
> +
> +	return ret;
> +}
> +
> +static int restore_in_addrs(struct ckpt_ctx *ctx,
> +			    __u32 naddrs,
> +			    struct net *net,
> +			    struct net_device *dev)
> +{
> +	__u32 i;
> +	int ret = 0;
> +	int len = naddrs * sizeof(struct ckpt_netdev_addr);
> +	struct ckpt_netdev_addr *addrs = NULL;
> +
> +	addrs = kmalloc(len, GFP_KERNEL);
> +	if (!addrs)
> +		return -ENOMEM;
> +
> +	ret = _ckpt_read_buffer(ctx, addrs, len);
> +	if (ret < 0)
> +		goto out;

Could you use ckpt_read_payload() instead of the above ?

> +
> +	for (i = 0; i < naddrs; i++) {
> +		struct ckpt_netdev_addr *addr = &addrs[i];
> +		struct ifreq req;
> +		struct sockaddr_in *inaddr;
> +
> +		if (addr->type != CKPT_NETDEV_ADDR_IPV4) {
> +			ret = -EINVAL;
> +			ckpt_err(ctx, ret, "Unsupported netdev addr type %i\n",
> +				 addr->type);
> +			break;
> +		}
> +
> +		ckpt_debug("restoring %s: %x/%x/%x\n", dev->name,
> +			   addr->inet4_address,
> +			   addr->inet4_mask,
> +			   addr->inet4_broadcast);
> +
> +		memcpy(req.ifr_name, dev->name, IFNAMSIZ);
> +
> +		inaddr = (struct sockaddr_in *)&req.ifr_addr;
> +		inaddr->sin_addr.s_addr = addr->inet4_address;
> +		inaddr->sin_family = AF_INET;
> +		ret = __kern_devinet_ioctl(net, SIOCSIFADDR, &req);
> +		if (ret < 0) {
> +			ckpt_err(ctx, ret, "Failed to set address\n");
> +			break;
> +		}
> +
> +		inaddr = (struct sockaddr_in *)&req.ifr_addr;
> +		inaddr->sin_addr.s_addr = addr->inet4_mask;
> +		inaddr->sin_family = AF_INET;
> +		ret = __kern_devinet_ioctl(net, SIOCSIFNETMASK, &req);
> +		if (ret < 0) {
> +			ckpt_err(ctx, ret, "Failed to set netmask\n");
> +			break;
> +		}
> +
> +		inaddr = (struct sockaddr_in *)&req.ifr_addr;
> +		inaddr->sin_addr.s_addr = addr->inet4_broadcast;
> +		inaddr->sin_family = AF_INET;
> +		ret = __kern_devinet_ioctl(net, SIOCSIFBRDADDR, &req);
> +		if (ret < 0) {
> +			ckpt_err(ctx, ret, "Failed to set broadcast\n");
> +			break;
> +		}
> +	}
> +
> + out:
> +	kfree(addrs);
> +
> +	return ret;
> +}
> +
> +static int veth_new_link_msg(struct sk_buff *skb, void *data)
> +{
> +	struct nlattr *linkinfo;
> +	struct nlattr *linkdata;
> +	struct ifinfomsg ifm;
> +	int ret = -ENOMEM;
> +	struct veth_newlink *d = data;
> +
> +	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
> +	if (!linkinfo)
> +		goto out;
> +
> +	ret = nla_put_string(skb, IFLA_INFO_KIND, "veth");
> +	if (ret)
> +		goto out;
> +
> +	linkdata = nla_nest_start(skb, IFLA_INFO_DATA);
> +	if (!linkdata) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = nla_put(skb, VETH_INFO_PEER, sizeof(ifm), &ifm);
> +	if (!ret)
> +		ret = nla_put_string(skb, IFLA_IFNAME, d->peer);
> +
> +	nla_nest_end(skb, linkdata);
> + out:
> +	nla_nest_end(skb, linkinfo);
> +
> +	return ret;
> +}
> +
> +static int mvl_new_link_msg(struct sk_buff *skb, void *data)
> +{
> +	struct mvl_newlink *d = data;
> +	struct nlattr *linkinfo;
> +	struct nlattr *linkdata;
> +	struct net_device *lowerdev;
> +	int ret;
> +
> +	lowerdev = dev_get_by_name(current->nsproxy->net_ns, d->base);
> +	if (!lowerdev)
> +		return -ENOENT;
> +
> +	ret = nla_put(skb, IFLA_ADDRESS, ETH_ALEN, d->hwaddr);
> +	if (ret)
> +		goto out_put;
> +
> +	ret = nla_put_u32(skb, IFLA_LINK, lowerdev->ifindex);
> +	if (ret)
> +		goto out_put;
> +
> +	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
> +	if (!linkinfo) {
> +		ret = -ENOMEM;
> +		goto out;

s/out/out_put/   ??

> +	}
> +
> +	ret = nla_put_string(skb, IFLA_INFO_KIND, "macvlan");
> +	if (ret)
> +		goto out;
> +
> +	linkdata = nla_nest_start(skb, IFLA_INFO_DATA);
> +	if (!linkdata) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = nla_put_u32(skb, IFLA_MACVLAN_MODE, d->mode);
> +	nla_nest_end(skb, linkdata);
> + out:
> +	nla_nest_end(skb, linkinfo);
> + out_put:
> +	dev_put(lowerdev);
> +
> +	return ret;
> +}
> +
> +static struct sk_buff *new_link_msg(new_link_fn fn, void *data, char *name)
> +{
> +	int ret = -ENOMEM;
> +	int flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
> +	struct nlmsghdr *nlh;
> +	struct sk_buff *skb;
> +	struct ifinfomsg *ifm;
> +
> +	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!skb)
> +		goto out;
> +
> +	nlh = nlmsg_put(skb, 0, 0, RTM_NEWLINK, sizeof(*ifm), flags);
> +	if (!nlh)
> +		goto out;
> +
> +	ifm = nlmsg_data(nlh);
> +	memset(ifm, 0, sizeof(*ifm));
> +
> +	ret = nla_put_string(skb, IFLA_IFNAME, name);
> +	if (ret)
> +		goto out;
> +
> +	ret = fn(skb, data);
> +
> +	nlmsg_end(skb, nlh);
> +
> + out:
> +	if (ret < 0) {
> +		kfree_skb(skb);
> +		skb = ERR_PTR(ret);
> +	}
> +
> +	return skb;
> +}
> +
> +static struct net_device *rtnl_newlink(new_link_fn fn, void *data, char *name)
> +{
> +	int ret = -ENOMEM;
> +	struct socket *rtnl = NULL;
> +	struct sk_buff *skb = NULL;
> +	struct nlmsghdr *nlh;
> +	struct msghdr msg;
> +	struct kvec kvec;
> +
> +	skb = new_link_msg(fn, data, name);
> +	if (IS_ERR(skb)) {
> +		ret = PTR_ERR(skb);
> +		ckpt_debug("failed to create new link message: %i\n", ret);
> +		skb = NULL;
> +		goto out;

Is this right - you didn't open the rtnl yet ?
Perhaps only ckpt_debug() and then return PTR_ERR(skb).

> +	}
> +
> +	memset(&msg, 0, sizeof(msg));
> +	kvec.iov_len = skb->len;
> +	kvec.iov_base = skb->head;
> +
> +	rtnl = rtnl_open();
> +	if (IS_ERR(rtnl)) {
> +		ret = PTR_ERR(rtnl);
> +		ckpt_debug("Unable to open rtnetlink socket: %i\n", ret);
> +		goto out_noclose;
> +	}
> +
> +	ret = kernel_sendmsg(rtnl, &msg, &kvec, 1, kvec.iov_len);
> +	if (ret < 0)
> +		goto out;
> +	else if (ret != skb->len) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	/* Free the send skb to make room for the receive skb */
> +	kfree_skb(skb);
> +
> +	nlh = rtnl_get_response(rtnl, &skb);
> +	if (IS_ERR(nlh)) {

If this happens, would rtnl_get_response() place a NULL in skb ?
If not, then the kfree_skb() below will free the previous skb again.

> +		ret = PTR_ERR(nlh);
> +		ckpt_debug("RTNETLINK said: %i\n", ret);
> +	}
> + out:
> +	rtnl_close(rtnl);
> + out_noclose:
> +	kfree_skb(skb);
> +
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +	else
> +		return dev_get_by_name(current->nsproxy->net_ns, name);
> +}
> +
> +static int netdev_noop(void *data)
> +{
> +	return 0;
> +}
> +
> +static int netdev_cleanup(void *data)
> +{
> +	struct dq_netdev *dq = data;
> +
> +	dev_put(dq->dev);
> +
> +	if (dq->ctx->errno) {
> +		ckpt_debug("Unregistering netdev %s\n", dq->dev->name);
> +		unregister_netdev(dq->dev);
> +	}
> +
> +	return 0;
> +}
> +
> +static struct net_device *restore_veth(struct ckpt_ctx *ctx,
> +				       struct ckpt_hdr_netdev *h,
> +				       struct net *net)
> +{
> +	int ret;
> +	char this_name[IFNAMSIZ];
> +	char peer_name[IFNAMSIZ];
> +	struct net_device *dev;
> +	struct net_device *peer;
> +	int didreg = 0;
> +	struct ifreq req;
> +	struct dq_netdev dq;
> +
> +	dq.ctx = ctx;
> +
> +	ret = _ckpt_read_buffer(ctx, this_name, IFNAMSIZ);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	ret = _ckpt_read_buffer(ctx, peer_name, IFNAMSIZ);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	ckpt_debug("restored veth netdev %s:%s\n", this_name, peer_name);
> +
> +	peer = ckpt_obj_try_fetch(ctx, h->veth.peer_ref, CKPT_OBJ_NETDEV);
> +	if (IS_ERR(peer)) {
> +		struct veth_newlink veth = {
> +			.peer = peer_name,
> +		};
> +
> +		/* We're first: allocate the veth pair */
> +		didreg = 1;
> +
> +		dev = rtnl_newlink(veth_new_link_msg, &veth, this_name);
> +		if (IS_ERR(dev))
> +			return dev;
> +
> +		peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
> +		if (!peer) {
> +			ret = -EINVAL;
> +			goto err_dev;
> +		}
> +
> +		dq.dev = peer;
> +		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
> +				     netdev_noop, netdev_cleanup);
> +		if (ret)
> +			goto err_peer;
> +
> +		ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
> +				      CKPT_OBJ_NETDEV);
> +		if (ret < 0)
> +			/* Can't recall peer dq, so let it cleanup peer */
> +			goto err_dev;

This may be a bit simpler if you move the first deferqueue_add()
forward to just before the other one. Or better: change dq_netdev
to have two pointers, dev and peer (if any is null, the cleanup
function will skip).

> +		dev_put(peer);
> +
> +		dq.dev = dev;
> +		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
> +				     netdev_noop, netdev_cleanup);
> +		if (ret)
> +			/* Can't recall peer dq, so let it cleanup peer */
> +			goto err_dev;
> +
> +	} else {
> +		/* We're second: get our dev from the hash */
> +		dev = ckpt_obj_fetch(ctx, h->veth.this_ref, CKPT_OBJ_NETDEV);
> +		if (IS_ERR(dev))
> +			return dev;
> +	}
> +
> +	/* Move to our new netns */
> +	rtnl_lock();
> +	ret = dev_change_net_namespace(dev, net, dev->name);
> +	rtnl_unlock();
> +	if (ret < 0)
> +		goto out;
> +
> +	/* Restore MAC address */
> +	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
> +	memcpy(req.ifr_hwaddr.sa_data, h->hwaddr, sizeof(h->hwaddr));
> +	req.ifr_hwaddr.sa_family = ARPHRD_ETHER;
> +	ret = __kern_dev_ioctl(net, SIOCSIFHWADDR, &req);
> + out:
> +	if (ret)
> +		dev = ERR_PTR(ret);
> +
> +	return dev;
> +
> + err_peer:
> +	dev_put(peer);
> +	unregister_netdev(peer);
> + err_dev:
> +	dev_put(dev);
> +	unregister_netdev(dev);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +static struct net_device *restore_lo(struct ckpt_ctx *ctx,
> +				     struct ckpt_hdr_netdev *h,
> +				     struct net *net)
> +{
> +	struct net_device *dev;
> +	char name[IFNAMSIZ+1];
> +	int ret;
> +
> +	dev = dev_get_by_name(net, "lo");
> +	if (!dev)
> +		return ERR_PTR(-EINVAL);
> +
> +	ret = _ckpt_read_buffer(ctx, name, IFNAMSIZ);
> +	if (ret < 0)
> +		goto err;
> +
> +	if (strncmp(dev->name, name, IFNAMSIZ) != 0) {
> +		ret = dev_change_name(dev, name);
> +		if (ret < 0)
> +			goto err;
> +	}
> +
> +	return dev;
> + err:
> +	dev_put(dev);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +static struct net_device *restore_sit(struct ckpt_ctx *ctx,
> +				      struct ckpt_hdr_netdev *h,
> +				      struct net *net)
> +{
> +	/* Don't actually do anything for SIT devices yet */
> +	return dev_get_by_name(net, "sit0");
> +}
> +
> +static struct net_device *restore_macvlan(struct ckpt_ctx *ctx,
> +					  struct ckpt_hdr_netdev *h,
> +					  struct net *net)
> +{
> +	struct net_device *dev;
> +	struct mvl_newlink mvl = {
> +		.mode = h->macvlan.mode,
> +		.hwaddr = h->hwaddr,
> +	};
> +	int ret;
> +
> +	ret = _ckpt_read_buffer(ctx, mvl.this, IFNAMSIZ);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	ret = _ckpt_read_buffer(ctx, mvl.base, IFNAMSIZ);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	dev = rtnl_newlink(mvl_new_link_msg, &mvl, mvl.this);
> +	if (IS_ERR(dev)) {
> +		ckpt_err(ctx, PTR_ERR(dev),
> +			 "Failed to create macvlan device %s:%s",
> +			 mvl.this, mvl.base);
> +		goto out;
> +	}
> +
> +	rtnl_lock();
> +	ret = dev_change_net_namespace(dev, net, dev->name);
> +	rtnl_unlock();
> +
> +	if (ret) {
> +		ckpt_err(ctx, ret, "Failed to change netns of %s:%s\n",
> +			 mvl.this, mvl.base);
> +		dev_put(dev);
> +		unregister_netdev(dev);
> +		dev = ERR_PTR(ret);
> +	}
> + out:
> +	return dev;
> +}
> +
> +void *restore_netdev(struct ckpt_ctx *ctx)
> +{
> +	struct ckpt_hdr_netdev *h;
> +	struct net_device *dev = NULL;
> +	struct ifreq req;
> +	struct net *net;
> +	int ret;
> +
> +	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
> +	if (IS_ERR(h)) {
> +		ckpt_err(ctx, PTR_ERR(h), "failed to read netdev\n");

This ckpt_err() is redundant. ckpt_read_obj_type() already calls
ckpt_err() if it fails.

> +		return h;
> +	}
> +

How about some comment to explain this following snippet ?

> +	if (h->netns_ref != 0) {
> +		net = ckpt_obj_try_fetch(ctx, h->netns_ref, CKPT_OBJ_NET_NS);
> +		if (IS_ERR(net)) {
> +			ckpt_debug("failed to get net for %i\n", h->netns_ref);
> +			ret = PTR_ERR(net);
> +			net = current->nsproxy->net_ns;
			^^^^^^^^^^^^^^^^^
Why this ?

> +			goto out;
> +		}
> +	} else
> +		net = current->nsproxy->net_ns;
> +
> +	if (h->type == CKPT_NETDEV_VETH)
> +		dev = restore_veth(ctx, h, net);
> +	else if (h->type == CKPT_NETDEV_LO)
> +		dev = restore_lo(ctx, h, net);
> +	else if (h->type == CKPT_NETDEV_SIT)
> +		dev = restore_sit(ctx, h, net);
> +	else if (h->type == CKPT_NETDEV_MACVLAN)
> +		dev = restore_macvlan(ctx, h, net);
> +	else
> +		dev = ERR_PTR(-EINVAL);

This is ugly. How about a dispatch table intead ?  It will also
allow in the future for kernel modules to register their restore
functions.

> +
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		ckpt_err(ctx, ret, "Netdev type %i not supported\n", h->type);
> +		goto out;
> +	}
> +
> +	/* Restore flags (which will likely bring the interface up) */
> +	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
> +	req.ifr_flags = h->flags;
> +	ret = __kern_dev_ioctl(net, SIOCSIFFLAGS, &req);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (h->inet_addrs > 0)
> +		ret = restore_in_addrs(ctx, h->inet_addrs, net, dev);
> + out:
> +	if (ret) {
> +		ckpt_err(ctx, ret, "Failed to restore netdevice\n");
> +		if ((h->type == CKPT_NETDEV_VETH) && !IS_ERR(dev)) {
> +			dev_put(dev);
> +		}
> +		dev = ERR_PTR(ret);
> +	} else
> +		ckpt_debug("restored netdev %s\n", dev->name);
> +
> +	ckpt_hdr_put(ctx, h);
> +
> +	return dev;
> +}
> +
> +void *restore_netns(struct ckpt_ctx *ctx)
> +{
> +	struct ckpt_hdr_netns *h;
> +	struct net *net;
> +
> +	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
> +	if (IS_ERR(h)) {
> +		ckpt_err(ctx, PTR_ERR(h), "failed to read netns\n");

Here, too, no need for ckpt_err().

> +		return h;
> +	}
> +

Here, too, a comment to explain the snippet will be helpful.

> +	if (h->this_ref != 0) {
> +		net = copy_net_ns(CLONE_NEWNET, current->nsproxy->net_ns);
> +		if (IS_ERR(net))
> +			goto out;

[nit] this test isn't necessary.

> +	} else
> +		net = current->nsproxy->net_ns;
> + out:
> +	ckpt_hdr_put(ctx, h);
> +
> +	return net;
> +}

Oren.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
  2010-02-26 14:56     ` Dan Smith
@ 2010-03-06  3:55       ` Oren Laadan
       [not found]         ` <4B91D234.2020003-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Oren Laadan @ 2010-03-06  3:55 UTC (permalink / raw)
  To: Dan Smith; +Cc: David Miller, containers, den, netdev, ebiederm, benjamin.thery


Ditto for the existing ipv4 (socket) c/r code ... ?

Dan Smith wrote:
> DM> To be safe you should probably use __be32 and store the IP
> DM> addresses in network byte order.
> 
> Okay, yeah, good call.
> 
> DM> Acked-by: David S. Miller <davem@davemloft.net>
> 
> Thanks Dave!
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
       [not found]     ` <4B91D1A3.9030404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-06 17:08       ` Dan Smith
       [not found]         ` <87bpf1idic.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Smith @ 2010-03-06 17:08 UTC (permalink / raw)
  To: Oren Laadan
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	containers-qjLDD68F18O7TbgM5vRIOg, den-GEFAQzZX7r8dnm+yROfE0A,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, benjamin.thery-6ktuUTfB/bM

OL> What about leak detection ?
OL> Aren't we missing {netns,netdev}_users()?

This is something I need to give more thought to, but it's not as easy
as it sounds.  Network devices aren't released at the last put() like
a lot of other things, and my initial attempts to reconcile the
refcount after a checkpoint operation have not been successful.

However, I'm not sure that it's as important here, because AFAIK, a
network device can only exist in one network namespace at a time.  If
we're checkpointing a netdev, it's because we are checkpointing the
namespace that it lives in.  Making sure the netns isn't leaked out of
the process tree would be much easier and just as effective, no?

>> +config CHECKPOINT_NETNS
>> +       bool
>> +       default y if NET && NET_NS && CHECKPOINT
>> +

OL> Did you mean this to be visible (settable) by the user ?

No, it was specifically supposed to enable itself when those other
items are enabled, but not be a user adjustable toggle.  I had a
discussion with Serge about it and we came to this as a solution,
although I don't remember what the problem we started with was.  I'll
dig through my IRC logs to see if I can figure it out.

>> + retry:
>> +	if (++pages > 4) {
>> +		addrs = -E2BIG;
>> +		goto out;
>> +	}

OL> Why 4 ?

It's just a sanity limit.

OL> Do we really need this special case ?  I'd be happy with a ckpt_err()
OL> for any error - and the actual error number would be useful to tell
OL> which case it was.

Unless I'm missing something, you asked for this specifically:

https://lists.linux-foundation.org/pipermail/containers/2010-February/022844.html

OL> Isn't this check redundant ?  I expect it to fail promptly in
OL> checkpoint_netdev() above.

No, as I've said a couple of times previously, this isn't the only way
we can arrive at a netdev for checkpointing.  This case is the one
where we're marching through the netns and find a netdev that is not
supported.  The other is where we arrive at a device as a peer of
another device, so the other check may come into play at times where
this one doesn't and vice versa.

OL> This may be a bit simpler if you move the first deferqueue_add()
OL> forward to just before the other one. Or better: change dq_netdev
OL> to have two pointers, dev and peer (if any is null, the cleanup
OL> function will skip).

The reason it is this messy is because of the way network devices are
deallocated.  Since they don't destroy themselves on the final put(),
we have to explicitly call unregister_netdev() on them when we know
they're no longer used (else we block).  Once we've added them to the
deferqueue, we can no longer destroy them here because a reference is
held and the deferqueue will run afterwards.

The ordering of this is a result of me injecting failures at each step
and working it out until I got it to not block on unregistering either
of the devices in all of the error paths.  That's not to say it's the
best way, but there is a reason it's ordered the way it is.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
       [not found]         ` <4B91D234.2020003-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-06 17:09           ` Dan Smith
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Smith @ 2010-03-06 17:09 UTC (permalink / raw)
  To: Oren Laadan
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	containers-qjLDD68F18O7TbgM5vRIOg, den-GEFAQzZX7r8dnm+yROfE0A,
	David Miller, benjamin.thery-6ktuUTfB/bM

OL> Ditto for the existing ipv4 (socket) c/r code ... ?

Yes.  I've got a couple of header format changes lined up for IPv6
stuff, so I can add a patch to do this in that set, if it's okay.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
       [not found]         ` <87bpf1idic.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2010-03-06 22:21           ` Oren Laadan
  2010-03-08 17:36             ` Dan Smith
  0 siblings, 1 reply; 25+ messages in thread
From: Oren Laadan @ 2010-03-06 22:21 UTC (permalink / raw)
  To: Dan Smith
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	containers-qjLDD68F18O7TbgM5vRIOg, den-GEFAQzZX7r8dnm+yROfE0A,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, benjamin.thery-6ktuUTfB/bM



Dan Smith wrote:
> OL> What about leak detection ?
> OL> Aren't we missing {netns,netdev}_users()?
> 
> This is something I need to give more thought to, but it's not as easy
> as it sounds.  Network devices aren't released at the last put() like
> a lot of other things, and my initial attempts to reconcile the
> refcount after a checkpoint operation have not been successful.
> 
> However, I'm not sure that it's as important here, because AFAIK, a
> network device can only exist in one network namespace at a time.  If
> we're checkpointing a netdev, it's because we are checkpointing the
> namespace that it lives in.  Making sure the netns isn't leaked out of
> the process tree would be much easier and just as effective, no?

We should guarantee that neither netns nor netdev leaks outside
the container; currently none is. If a netdev can only belong to
a single netns, then it suffices to only care about netns.

> 
>>> +config CHECKPOINT_NETNS
>>> +       bool
>>> +       default y if NET && NET_NS && CHECKPOINT
>>> +
> 
> OL> Did you mean this to be visible (settable) by the user ?
> 
> No, it was specifically supposed to enable itself when those other
> items are enabled, but not be a user adjustable toggle.  I had a
> discussion with Serge about it and we came to this as a solution,
> although I don't remember what the problem we started with was.  I'll
> dig through my IRC logs to see if I can figure it out.

Duh.. my bad, I misinterpreted the code. That's fine.

BTW, there is a similar SYSVIPC_CHECKPOINT - we should decide
if we do X_CHECKPOINT or CHECKPOINT_X for a subsystem X, and
stick to that convention. I prefer the latter - what you did...

> 
>>> + retry:
>>> +	if (++pages > 4) {
>>> +		addrs = -E2BIG;
>>> +		goto out;
>>> +	}
> 
> OL> Why 4 ?
> 
> It's just a sanity limit.

Hmm... let me be more explicit:  why not keep trying until it
realloc fails ?  or switch to vmalloc() at some point ?

> 
> OL> Do we really need this special case ?  I'd be happy with a ckpt_err()
> OL> for any error - and the actual error number would be useful to tell
> OL> which case it was.
> 
> Unless I'm missing something, you asked for this specifically:
> 
> https://lists.linux-foundation.org/pipermail/containers/2010-February/022844.html

Lol .. that was me :o  But looking at the code it feels wrong,
because the errno already reveals the type of the problem.

I'm thinking - wouldn't it make sense to do error reporting
in checkpoint_netdev() if the call to ->ndo_checkpoint() fails ?

> 
> OL> Isn't this check redundant ?  I expect it to fail promptly in
> OL> checkpoint_netdev() above.
> 
> No, as I've said a couple of times previously, this isn't the only way
> we can arrive at a netdev for checkpointing.  This case is the one
> where we're marching through the netns and find a netdev that is not
> supported.  The other is where we arrive at a device as a peer of
> another device, so the other check may come into play at times where
> this one doesn't and vice versa.

I'm confused: in checkpoint_ns() inside the for_each_netdev()
loop you first test for dev->netdev_ops->ndo_checkpoint and
then call checkpoint_obj(... CKPT_OBJ_NETDEV) - which in turn
will call checkpoint_netdev(), which will again test for
dev->netdev_ops->ndo_checkpoint ...  am I reading it wrongly ?

> 
> OL> This may be a bit simpler if you move the first deferqueue_add()
> OL> forward to just before the other one. Or better: change dq_netdev
> OL> to have two pointers, dev and peer (if any is null, the cleanup
> OL> function will skip).
> 
> The reason it is this messy is because of the way network devices are
> deallocated.  Since they don't destroy themselves on the final put(),
> we have to explicitly call unregister_netdev() on them when we know
> they're no longer used (else we block).  Once we've added them to the
> deferqueue, we can no longer destroy them here because a reference is
> held and the deferqueue will run afterwards.
> 
> The ordering of this is a result of me injecting failures at each step
> and working it out until I got it to not block on unregistering either
> of the devices in all of the error paths.  That's not to say it's the
> best way, but there is a reason it's ordered the way it is.
> 

How about this - to me it feels simpler:

	dev = rtnl_newlink(veth_new_link_msg, &veth, this_name);
	if (IS_ERR(dev))
		return dev;

	peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
	if (!peer) {
		ret = -EINVAL;
		goto err_dev;
	}
	ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
			      CKPT_OBJ_NETDEV);
	if (ret < 0)
		goto err_peer;

	dev_put(peer);

	dq.dev = dev;
	dq.peer = peer;
	ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
			     netdev_noop, netdev_cleanup);
	if (ret)
		goto err_peer;

(yes, you need to adjust struct dq_netdev and netdev_cleanup).

BTW, the variable "didreg" should disappear from restore_veth().

Oren.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
  2010-03-06 22:21           ` Oren Laadan
@ 2010-03-08 17:36             ` Dan Smith
  2010-03-08 17:53               ` Eric W. Biederman
  2010-03-08 18:36               ` Oren Laadan
  0 siblings, 2 replies; 25+ messages in thread
From: Dan Smith @ 2010-03-08 17:36 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, den, netdev, davem, ebiederm, benjamin.thery

OL> I'm confused: in checkpoint_ns() inside the for_each_netdev() loop
OL> you first test for dev->netdev_ops->ndo_checkpoint and then call
OL> checkpoint_obj(... CKPT_OBJ_NETDEV) - which in turn will call
OL> checkpoint_netdev(), which will again test for
dev-> netdev_ops->ndo_checkpoint ...  am I reading it wrongly ?

In the case of veth, yes.  It goes something like this:

checkpoint_netns() {
  foreach netdev in netns {
    checkpoint_netdev {
      if netdev is veth {
        checkpoint_peer(); // Will call checkpoint_netdev again
      }
    }
  }
}

It shouldn't happen, but it seems like since we could potentially add
another checkpoint_obj(mydev) somewhere other than in
checkpoint_netdev(), it is reasonable to check that there is actually
something to call before we call it.

Would you prefer a BUG()?

OL> How about this - to me it feels simpler:

OL> 	dev = rtnl_newlink(veth_new_link_msg, &veth, this_name);
OL> 	if (IS_ERR(dev))
OL> 		return dev;

OL> 	peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
OL> 	if (!peer) {
OL> 		ret = -EINVAL;
OL> 		goto err_dev;
OL> 	}
OL> 	ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
OL> 			      CKPT_OBJ_NETDEV);
OL> 	if (ret < 0)
OL> 		goto err_peer;

OL> 	dev_put(peer);

OL> 	dq.dev = dev;
OL> 	dq.peer = peer;
OL> 	ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
OL> 			     netdev_noop, netdev_cleanup);
OL> 	if (ret)
OL> 		goto err_peer;

If you fail here you need to unregister_netdev() because the dev_put()
that the objhash will not cause it to happen.  Unless we add something
to allow you to remove your object from the hash, you can't prevent
that final put, so you have to have it in the deferqueue for
later.  You can't check the refcount in the objhash function because it
will differ depending on the number of addresses and protocols the
device has, and those don't get released until unregister_netdev()
which will block if you call it before you've released all of your
references.  If the objhash put function could examine ctx->errno,
then it could drop its reference and then call unregister_netdev(),
but that would involve changing all the drop functions.  What am I
missing?

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
  2010-03-08 17:36             ` Dan Smith
@ 2010-03-08 17:53               ` Eric W. Biederman
  2010-03-08 18:07                   ` Dan Smith
  2010-03-08 18:36               ` Oren Laadan
  1 sibling, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2010-03-08 17:53 UTC (permalink / raw)
  To: Dan Smith; +Cc: Oren Laadan, containers, den, netdev, davem, benjamin.thery

Dan Smith <danms@us.ibm.com> writes:

> OL> I'm confused: in checkpoint_ns() inside the for_each_netdev() loop
> OL> you first test for dev->netdev_ops->ndo_checkpoint and then call
> OL> checkpoint_obj(... CKPT_OBJ_NETDEV) - which in turn will call
> OL> checkpoint_netdev(), which will again test for
> dev-> netdev_ops->ndo_checkpoint ...  am I reading it wrongly ?
>
> In the case of veth, yes.  It goes something like this:
>
> checkpoint_netns() {
>   foreach netdev in netns {
>     checkpoint_netdev {
>       if netdev is veth {
>         checkpoint_peer(); // Will call checkpoint_netdev again
>       }
>     }
>   }
> }
>
> It shouldn't happen, but it seems like since we could potentially add
> another checkpoint_obj(mydev) somewhere other than in
> checkpoint_netdev(), it is reasonable to check that there is actually
> something to call before we call it.
>
> Would you prefer a BUG()?
>
> OL> How about this - to me it feels simpler:
>
> OL> 	dev = rtnl_newlink(veth_new_link_msg, &veth, this_name);
> OL> 	if (IS_ERR(dev))
> OL> 		return dev;
>
> OL> 	peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
> OL> 	if (!peer) {
> OL> 		ret = -EINVAL;
> OL> 		goto err_dev;
> OL> 	}
> OL> 	ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
> OL> 			      CKPT_OBJ_NETDEV);
> OL> 	if (ret < 0)
> OL> 		goto err_peer;
>
> OL> 	dev_put(peer);
>
> OL> 	dq.dev = dev;
> OL> 	dq.peer = peer;
> OL> 	ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
> OL> 			     netdev_noop, netdev_cleanup);
> OL> 	if (ret)
> OL> 		goto err_peer;
>
> If you fail here you need to unregister_netdev() because the dev_put()
> that the objhash will not cause it to happen.  Unless we add something
> to allow you to remove your object from the hash, you can't prevent
> that final put, so you have to have it in the deferqueue for
> later.  You can't check the refcount in the objhash function because it
> will differ depending on the number of addresses and protocols the
> device has, and those don't get released until unregister_netdev()
> which will block if you call it before you've released all of your
> references.  If the objhash put function could examine ctx->errno,
> then it could drop its reference and then call unregister_netdev(),
> but that would involve changing all the drop functions.  What am I
> missing?

Can we take advantage of the fact that when you destroy a network
namespace the virtual devices in that network namespace are also destroyed?

Eric


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
  2010-03-08 17:53               ` Eric W. Biederman
@ 2010-03-08 18:07                   ` Dan Smith
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Smith @ 2010-03-08 18:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oren Laadan, containers, den, netdev, davem, benjamin.thery

EB> Can we take advantage of the fact that when you destroy a network
EB> namespace the virtual devices in that network namespace are also
EB> destroyed?

Hmm, that kinda seems like cheating, but maybe so.  I'll take a look :)

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
@ 2010-03-08 18:07                   ` Dan Smith
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Smith @ 2010-03-08 18:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oren Laadan, containers, den, netdev, davem, benjamin.thery

EB> Can we take advantage of the fact that when you destroy a network
EB> namespace the virtual devices in that network namespace are also
EB> destroyed?

Hmm, that kinda seems like cheating, but maybe so.  I'll take a look :)

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5)
  2010-03-08 17:36             ` Dan Smith
  2010-03-08 17:53               ` Eric W. Biederman
@ 2010-03-08 18:36               ` Oren Laadan
  1 sibling, 0 replies; 25+ messages in thread
From: Oren Laadan @ 2010-03-08 18:36 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, den, netdev, davem, ebiederm, benjamin.thery



Dan Smith wrote:
> OL> I'm confused: in checkpoint_ns() inside the for_each_netdev() loop
> OL> you first test for dev->netdev_ops->ndo_checkpoint and then call
> OL> checkpoint_obj(... CKPT_OBJ_NETDEV) - which in turn will call
> OL> checkpoint_netdev(), which will again test for
> dev-> netdev_ops->ndo_checkpoint ...  am I reading it wrongly ?
> 
> In the case of veth, yes.  It goes something like this:
> 
> checkpoint_netns() {
>   foreach netdev in netns {
>     checkpoint_netdev {
>       if netdev is veth {
>         checkpoint_peer(); // Will call checkpoint_netdev again
>       }
>     }
>   }
> }
> 
> It shouldn't happen, but it seems like since we could potentially add
> another checkpoint_obj(mydev) somewhere other than in
> checkpoint_netdev(), it is reasonable to check that there is actually
> something to call before we call it.
> 
> Would you prefer a BUG()?

Ok.. so this is solved over IRC - the test was redundant :)

> 
> OL> How about this - to me it feels simpler:
> 
> OL> 	dev = rtnl_newlink(veth_new_link_msg, &veth, this_name);
> OL> 	if (IS_ERR(dev))
> OL> 		return dev;
> 
> OL> 	peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
> OL> 	if (!peer) {
> OL> 		ret = -EINVAL;
> OL> 		goto err_dev;
> OL> 	}
> OL> 	ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
> OL> 			      CKPT_OBJ_NETDEV);
> OL> 	if (ret < 0)
> OL> 		goto err_peer;
> 
> OL> 	dev_put(peer);
> 
> OL> 	dq.dev = dev;
> OL> 	dq.peer = peer;
> OL> 	ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
> OL> 			     netdev_noop, netdev_cleanup);
> OL> 	if (ret)
> OL> 		goto err_peer;
> 
> If you fail here you need to unregister_netdev() because the dev_put()
> that the objhash will not cause it to happen.  Unless we add something
> to allow you to remove your object from the hash, you can't prevent
> that final put, so you have to have it in the deferqueue for
> later.  You can't check the refcount in the objhash function because it
> will differ depending on the number of addresses and protocols the
> device has, and those don't get released until unregister_netdev()
> which will block if you call it before you've released all of your
> references.  If the objhash put function could examine ctx->errno,
> then it could drop its reference and then call unregister_netdev(),
> but that would involve changing all the drop functions.  What am I
> missing?
> 

Oh ..  I see - I missed that point that a ref is taken once it's
inserted to the objhash, so insert must be preceeded by the call
to deferqueue. Thanks for the explanation.

It still makes sense to have a single call to deferqueue that
relates to both the veth and the peer, instead of two separate
calls, no ?

Oren.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: C/R: Checkpoint and restore network namespaces and devices
  2010-02-25 20:43 C/R: Checkpoint and restore network namespaces and devices Dan Smith
                   ` (5 preceding siblings ...)
  2010-02-25 20:43 ` [PATCH 6/6] C/R: Add checkpoint support to macvlan driver Dan Smith
@ 2010-03-15  2:49 ` Oren Laadan
  6 siblings, 0 replies; 25+ messages in thread
From: Oren Laadan @ 2010-03-15  2:49 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, den, netdev, davem, ebiederm, benjamin.thery

Hi Dan,

With this series I get this error when NETNS isn't configured:
(ckpt-v20-rc1)

drivers/net/loopback.c: In function ‘loopback_checkpoint’:
drivers/net/loopback.c:165: error: implicit declaration of function
‘ckpt_netdev_base’
drivers/net/loopback.c:165: warning: assignment makes pointer from integer
without a cast
make[2]: *** [drivers/net/loopback.o] Error 1
make[1]: *** [drivers/net] Error 2
make: *** [drivers] Error 2

Oren.

Dan Smith wrote:
> This patch set adds checkpoint/restart support for network namespaces,
> as well as the network devices within.  Currently supports veth, macvlan,
> and loopback device types.
> 
> Includes the latest feedback and the new macvlan bits.  Rebased on Oren's
> official v19.  Changes detailed per patch.
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2010-03-15  2:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-25 20:43 C/R: Checkpoint and restore network namespaces and devices Dan Smith
2010-02-25 20:43 ` [PATCH 1/6] C/R: Add checkpoint and collect hooks to net_device_ops Dan Smith
2010-02-26 12:08   ` David Miller
2010-02-25 20:43 ` [PATCH 2/6] C/R: Basic support for network namespaces and devices (v5) Dan Smith
2010-02-26 12:08   ` David Miller
2010-02-26 14:56     ` Dan Smith
2010-03-06  3:55       ` Oren Laadan
     [not found]         ` <4B91D234.2020003-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-06 17:09           ` Dan Smith
2010-03-06  3:53   ` Oren Laadan
     [not found]     ` <4B91D1A3.9030404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-06 17:08       ` Dan Smith
     [not found]         ` <87bpf1idic.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2010-03-06 22:21           ` Oren Laadan
2010-03-08 17:36             ` Dan Smith
2010-03-08 17:53               ` Eric W. Biederman
2010-03-08 18:07                 ` Dan Smith
2010-03-08 18:07                   ` Dan Smith
2010-03-08 18:36               ` Oren Laadan
2010-02-25 20:43 ` [PATCH 3/6] C/R: Add checkpoint support for veth devices (v2) Dan Smith
2010-02-26 12:09   ` David Miller
2010-02-25 20:43 ` [PATCH 4/6] C/R: Add loopback checkpoint support (v2) Dan Smith
2010-02-26 12:09   ` David Miller
2010-02-25 20:43 ` [PATCH 5/6] C/R: Add a checkpoint handler to the 'sit' device Dan Smith
2010-02-26 12:09   ` David Miller
2010-02-25 20:43 ` [PATCH 6/6] C/R: Add checkpoint support to macvlan driver Dan Smith
2010-02-26 12:09   ` David Miller
2010-03-15  2:49 ` C/R: Checkpoint and restore network namespaces and devices Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.