linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 01/38] cred: #include init.h in cred.h
@ 2009-05-22  4:54 Alexey Dobriyan
  2009-05-22  4:54 ` [PATCH 02/38] utsns: extract create_uts_ns() Alexey Dobriyan
                   ` (37 more replies)
  0 siblings, 38 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:54 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

cred.h can't be included as first header because it uses __init and
doesn't include init.h which is enough to break compilation on at least
ia64.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/cred.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 3282ee4..4fa9996 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -13,6 +13,7 @@
 #define _LINUX_CRED_H
 
 #include <linux/capability.h>
+#include <linux/init.h>
 #include <linux/key.h>
 #include <asm/atomic.h>
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 02/38] utsns: extract create_uts_ns()
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
@ 2009-05-22  4:54 ` Alexey Dobriyan
  2009-05-24 22:37   ` Serge E. Hallyn
  2009-05-22  4:54 ` [PATCH 03/38] ipcns 1/4: remove useless get/put while CLONE_NEWIPC Alexey Dobriyan
                   ` (36 subsequent siblings)
  37 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:54 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

create_uts_ns() will be used by C/R to create fresh uts_ns.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 kernel/utsname.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/utsname.c b/kernel/utsname.c
index 815237a..8a82b4b 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -15,6 +15,16 @@
 #include <linux/err.h>
 #include <linux/slab.h>
 
+static struct uts_namespace *create_uts_ns(void)
+{
+	struct uts_namespace *uts_ns;
+
+	uts_ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL);
+	if (uts_ns)
+		kref_init(&uts_ns->kref);
+	return uts_ns;
+}
+
 /*
  * Clone a new ns copying an original utsname, setting refcount to 1
  * @old_ns: namespace to clone
@@ -24,14 +34,13 @@ static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns)
 {
 	struct uts_namespace *ns;
 
-	ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL);
+	ns = create_uts_ns();
 	if (!ns)
 		return ERR_PTR(-ENOMEM);
 
 	down_read(&uts_sem);
 	memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
 	up_read(&uts_sem);
-	kref_init(&ns->kref);
 	return ns;
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 03/38] ipcns 1/4: remove useless get/put while CLONE_NEWIPC
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
  2009-05-22  4:54 ` [PATCH 02/38] utsns: extract create_uts_ns() Alexey Dobriyan
@ 2009-05-22  4:54 ` Alexey Dobriyan
  2009-05-22  9:00   ` Amerigo Wang
  2009-05-22  4:54 ` [PATCH 04/38] ipcns 2/4: extract create_ipc_ns() Alexey Dobriyan
                   ` (35 subsequent siblings)
  37 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:54 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

copy_ipcs() doesn't actually copy anything. If new ipcns is created, it's
created from scratch, in this case get/put on old ipcns isn't needed.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 ipc/namespace.c |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/ipc/namespace.c b/ipc/namespace.c
index 4a5e752..a56fc59 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -50,15 +50,11 @@ struct ipc_namespace *copy_ipcs(unsigned long flags, struct ipc_namespace *ns)
 {
 	struct ipc_namespace *new_ns;
 
-	BUG_ON(!ns);
-	get_ipc_ns(ns);
-
 	if (!(flags & CLONE_NEWIPC))
-		return ns;
+		return get_ipc_ns(ns);
 
 	new_ns = clone_ipc_ns(ns);
 
-	put_ipc_ns(ns);
 	return new_ns;
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 04/38] ipcns 2/4: extract create_ipc_ns()
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
  2009-05-22  4:54 ` [PATCH 02/38] utsns: extract create_uts_ns() Alexey Dobriyan
  2009-05-22  4:54 ` [PATCH 03/38] ipcns 1/4: remove useless get/put while CLONE_NEWIPC Alexey Dobriyan
@ 2009-05-22  4:54 ` Alexey Dobriyan
  2009-05-22  8:59   ` Amerigo Wang
  2009-05-22  4:54 ` [PATCH 05/38] ipcns 3/4: make free_ipc_ns() static Alexey Dobriyan
                   ` (34 subsequent siblings)
  37 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:54 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

clone_ipc_ns() is misnamed, it doesn't clone anything and doesn't use passed
parameter. Rename it.

create_ipc_ns() will be used by C/R to create fresh ipcns.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 ipc/namespace.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/ipc/namespace.c b/ipc/namespace.c
index a56fc59..231ee53 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -14,7 +14,7 @@
 
 #include "util.h"
 
-static struct ipc_namespace *clone_ipc_ns(struct ipc_namespace *old_ns)
+static struct ipc_namespace *create_ipc_ns(void)
 {
 	struct ipc_namespace *ns;
 	int err;
@@ -48,14 +48,9 @@ static struct ipc_namespace *clone_ipc_ns(struct ipc_namespace *old_ns)
 
 struct ipc_namespace *copy_ipcs(unsigned long flags, struct ipc_namespace *ns)
 {
-	struct ipc_namespace *new_ns;
-
 	if (!(flags & CLONE_NEWIPC))
 		return get_ipc_ns(ns);
-
-	new_ns = clone_ipc_ns(ns);
-
-	return new_ns;
+	return create_ipc_ns();
 }
 
 /*
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 05/38] ipcns 3/4: make free_ipc_ns() static
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (2 preceding siblings ...)
  2009-05-22  4:54 ` [PATCH 04/38] ipcns 2/4: extract create_ipc_ns() Alexey Dobriyan
@ 2009-05-22  4:54 ` Alexey Dobriyan
  2009-05-24 22:40   ` Serge E. Hallyn
  2009-05-22  4:55 ` [PATCH 06/38] ipcns 4/2: move free_ipcs() proto Alexey Dobriyan
                   ` (33 subsequent siblings)
  37 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:54 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/ipc_namespace.h |    1 -
 ipc/namespace.c               |   48 ++++++++++++++++++++--------------------
 2 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 3bf40e2..804e4e4 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -94,7 +94,6 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { return 0; }
 #endif
 
 #if defined(CONFIG_IPC_NS)
-extern void free_ipc_ns(struct ipc_namespace *ns);
 extern struct ipc_namespace *copy_ipcs(unsigned long flags,
 				       struct ipc_namespace *ns);
 extern void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 231ee53..a1094ff 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -83,6 +83,30 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 	up_write(&ids->rw_mutex);
 }
 
+static void free_ipc_ns(struct ipc_namespace *ns)
+{
+	/*
+	 * Unregistering the hotplug notifier at the beginning guarantees
+	 * that the ipc namespace won't be freed while we are inside the
+	 * callback routine. Since the blocking_notifier_chain_XXX routines
+	 * hold a rw lock on the notifier list, unregister_ipcns_notifier()
+	 * won't take the rw lock before blocking_notifier_call_chain() has
+	 * released the rd lock.
+	 */
+	unregister_ipcns_notifier(ns);
+	sem_exit_ns(ns);
+	msg_exit_ns(ns);
+	shm_exit_ns(ns);
+	kfree(ns);
+	atomic_dec(&nr_ipc_ns);
+
+	/*
+	 * Do the ipcns removal notification after decrementing nr_ipc_ns in
+	 * order to have a correct value when recomputing msgmni.
+	 */
+	ipcns_notify(IPCNS_REMOVED);
+}
+
 /*
  * put_ipc_ns - drop a reference to an ipc namespace.
  * @ns: the namespace to put
@@ -108,27 +132,3 @@ void put_ipc_ns(struct ipc_namespace *ns)
 		free_ipc_ns(ns);
 	}
 }
-
-void free_ipc_ns(struct ipc_namespace *ns)
-{
-	/*
-	 * Unregistering the hotplug notifier at the beginning guarantees
-	 * that the ipc namespace won't be freed while we are inside the
-	 * callback routine. Since the blocking_notifier_chain_XXX routines
-	 * hold a rw lock on the notifier list, unregister_ipcns_notifier()
-	 * won't take the rw lock before blocking_notifier_call_chain() has
-	 * released the rd lock.
-	 */
-	unregister_ipcns_notifier(ns);
-	sem_exit_ns(ns);
-	msg_exit_ns(ns);
-	shm_exit_ns(ns);
-	kfree(ns);
-	atomic_dec(&nr_ipc_ns);
-
-	/*
-	 * Do the ipcns removal notification after decrementing nr_ipc_ns in
-	 * order to have a correct value when recomputing msgmni.
-	 */
-	ipcns_notify(IPCNS_REMOVED);
-}
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 06/38] ipcns 4/2: move free_ipcs() proto
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (3 preceding siblings ...)
  2009-05-22  4:54 ` [PATCH 05/38] ipcns 3/4: make free_ipc_ns() static Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-24 22:49   ` Serge E. Hallyn
  2009-05-22  4:55 ` [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns Alexey Dobriyan
                   ` (32 subsequent siblings)
  37 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Function is really private to ipc/ and avoid struct kern_ipc_perm
forward declaration.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/ipc_namespace.h |    4 ----
 ipc/util.h                    |    3 ++-
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 804e4e4..e408722 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -96,10 +96,6 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { return 0; }
 #if defined(CONFIG_IPC_NS)
 extern struct ipc_namespace *copy_ipcs(unsigned long flags,
 				       struct ipc_namespace *ns);
-extern void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
-		      void (*free)(struct ipc_namespace *,
-				   struct kern_ipc_perm *));
-
 static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 {
 	if (ns)
diff --git a/ipc/util.h b/ipc/util.h
index 1187332..7a2861b 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -171,5 +171,6 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 			struct ipc_ops *ops, struct ipc_params *params);
-
+void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
+		void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 #endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (4 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 06/38] ipcns 4/2: move free_ipcs() proto Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  9:20   ` Amerigo Wang
                     ` (2 more replies)
  2009-05-22  4:55 ` [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns() Alexey Dobriyan
                   ` (31 subsequent siblings)
  37 siblings, 3 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

create_pid_namespace() creates everything, but caller has to assign parent
pidns by hand, which is unnatural. At the moment of call new ->level has
to be taken from somewhere and parent pidns is already available.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 kernel/pid_namespace.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 2d1001b..495d5de 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -67,9 +67,10 @@ err_alloc:
 	return NULL;
 }
 
-static struct pid_namespace *create_pid_namespace(unsigned int level)
+static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_pid_ns)
 {
 	struct pid_namespace *ns;
+	unsigned int level = parent_pid_ns->level + 1;
 	int i;
 
 	ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
@@ -86,6 +87,7 @@ static struct pid_namespace *create_pid_namespace(unsigned int level)
 
 	kref_init(&ns->kref);
 	ns->level = level;
+	ns->parent = get_pid_ns(parent_pid_ns);
 
 	set_bit(0, ns->pidmap[0].page);
 	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
@@ -125,9 +127,7 @@ struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old
 	if (flags & CLONE_THREAD)
 		goto out_put;
 
-	new_ns = create_pid_namespace(old_ns->level + 1);
-	if (!IS_ERR(new_ns))
-		new_ns->parent = get_pid_ns(old_ns);
+	new_ns = create_pid_namespace(old_ns);
 
 out_put:
 	put_pid_ns(old_ns);
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns()
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (5 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  9:14   ` Amerigo Wang
                     ` (2 more replies)
  2009-05-22  4:55 ` [PATCH 09/38] netns 1/2: don't get/put old netns on CLONE_NEWNET Alexey Dobriyan
                   ` (30 subsequent siblings)
  37 siblings, 3 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

copy_pid_ns() is a perfect example of a case where unwinding leads to more
code and makes it less clear. Watch the diffstat.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 kernel/pid_namespace.c |   18 +++---------------
 1 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 495d5de..821722a 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -116,23 +116,11 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
 
 struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
 {
-	struct pid_namespace *new_ns;
-
-	BUG_ON(!old_ns);
-	new_ns = get_pid_ns(old_ns);
 	if (!(flags & CLONE_NEWPID))
-		goto out;
-
-	new_ns = ERR_PTR(-EINVAL);
+		return get_pid_ns(old_ns);
 	if (flags & CLONE_THREAD)
-		goto out_put;
-
-	new_ns = create_pid_namespace(old_ns);
-
-out_put:
-	put_pid_ns(old_ns);
-out:
-	return new_ns;
+		return ERR_PTR(-EINVAL);
+	return create_pid_namespace(old_ns);
 }
 
 void free_pid_ns(struct kref *kref)
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 09/38] netns 1/2: don't get/put old netns on CLONE_NEWNET
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (6 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns() Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  6:30   ` David Miller
  2009-05-22  4:55 ` [PATCH 10/38] netns 2/2: extract net_create() Alexey Dobriyan
                   ` (29 subsequent siblings)
  37 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

copy_net_ns() doesn't copy anything, it creates fresh netns, so get/put of old
netns isn't needed.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 net/core/net_namespace.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index e3bebd3..4488010 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -120,10 +120,8 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 	struct net *new_net = NULL;
 	int err;
 
-	get_net(old_net);
-
 	if (!(flags & CLONE_NEWNET))
-		return old_net;
+		return get_net(old_net);
 
 	err = -ENOMEM;
 	new_net = net_alloc();
@@ -142,7 +140,6 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 	if (err)
 		goto out_free;
 out:
-	put_net(old_net);
 	return new_net;
 
 out_free:
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 10/38] netns 2/2: extract net_create()
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (7 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 09/38] netns 1/2: don't get/put old netns on CLONE_NEWNET Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  6:30   ` David Miller
  2009-05-22  4:55 ` [PATCH 11/38] nsproxy: extract create_nsproxy() Alexey Dobriyan
                   ` (28 subsequent siblings)
  37 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

net_create() will be used by C/R to create fresh netns on restart.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 net/core/net_namespace.c |   44 ++++++++++++++++++++------------------------
 1 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 4488010..6b3edc9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -115,38 +115,34 @@ static void net_free(struct net *net)
 	kmem_cache_free(net_cachep, net);
 }
 
-struct net *copy_net_ns(unsigned long flags, struct net *old_net)
+static struct net *net_create(void)
 {
-	struct net *new_net = NULL;
-	int err;
-
-	if (!(flags & CLONE_NEWNET))
-		return get_net(old_net);
-
-	err = -ENOMEM;
-	new_net = net_alloc();
-	if (!new_net)
-		goto out_err;
+	struct net *net;
+	int rv;
 
+	net = net_alloc();
+	if (!net)
+		return ERR_PTR(-ENOMEM);
 	mutex_lock(&net_mutex);
-	err = setup_net(new_net);
-	if (!err) {
+	rv = setup_net(net);
+	if (rv == 0) {
 		rtnl_lock();
-		list_add_tail(&new_net->list, &net_namespace_list);
+		list_add_tail(&net->list, &net_namespace_list);
 		rtnl_unlock();
 	}
 	mutex_unlock(&net_mutex);
+	if (rv < 0) {
+		net_free(net);
+		return ERR_PTR(rv);
+	}
+	return net;
+}
 
-	if (err)
-		goto out_free;
-out:
-	return new_net;
-
-out_free:
-	net_free(new_net);
-out_err:
-	new_net = ERR_PTR(err);
-	goto out;
+struct net *copy_net_ns(unsigned long flags, struct net *old_net)
+{
+	if (!(flags & CLONE_NEWNET))
+		return get_net(old_net);
+	return net_create();
 }
 
 static void cleanup_net(struct work_struct *work)
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 11/38] nsproxy: extract create_nsproxy()
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (8 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 10/38] netns 2/2: extract net_create() Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 12/38] i386: ifdef out struct thread_struct::fs Alexey Dobriyan
                   ` (27 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

clone_nsproxy() does useless copying of old nsproxy -- every pointer will be
rewritten to new ns or to old ns. Remove copying, rename clone_nsproxy(),
create_nsproxy() will be used by C/R code to create fresh nsproxy on restart.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 kernel/nsproxy.c |   19 +++++++------------
 1 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 63598dc..09b4ff9 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -26,19 +26,14 @@ static struct kmem_cache *nsproxy_cachep;
 
 struct nsproxy init_nsproxy = INIT_NSPROXY(init_nsproxy);
 
-/*
- * creates a copy of "orig" with refcount 1.
- */
-static inline struct nsproxy *clone_nsproxy(struct nsproxy *orig)
+static inline struct nsproxy *create_nsproxy(void)
 {
-	struct nsproxy *ns;
+	struct nsproxy *nsproxy;
 
-	ns = kmem_cache_alloc(nsproxy_cachep, GFP_KERNEL);
-	if (ns) {
-		memcpy(ns, orig, sizeof(struct nsproxy));
-		atomic_set(&ns->count, 1);
-	}
-	return ns;
+	nsproxy = kmem_cache_alloc(nsproxy_cachep, GFP_KERNEL);
+	if (nsproxy)
+		atomic_set(&nsproxy->count, 1);
+	return nsproxy;
 }
 
 /*
@@ -52,7 +47,7 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	struct nsproxy *new_nsp;
 	int err;
 
-	new_nsp = clone_nsproxy(tsk->nsproxy);
+	new_nsp = create_nsproxy();
 	if (!new_nsp)
 		return ERR_PTR(-ENOMEM);
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 12/38] i386: ifdef out struct thread_struct::fs
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (9 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 11/38] nsproxy: extract create_nsproxy() Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 13/38] x86_64: ifdef out struct thread_struct::ip Alexey Dobriyan
                   ` (26 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

After commit 464d1a78fbf8cf6c7fd970e7b3e2db50a320ce28 aka
"[PATCH] i386: Convert i386 PDA code to use %fs"
%fs saved during context switch moved from thread_struct to pt_regs
and value on thread_struct became unused.

I don't understand why .fs (OK, .gs first) initializer was added, kernel
clearly stopped touching it after the patch.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/include/asm/processor.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c2cceae..a6732ff 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -428,7 +428,9 @@ struct thread_struct {
 	unsigned short		gsindex;
 #endif
 	unsigned long		ip;
+#ifdef CONFIG_X86_64
 	unsigned long		fs;
+#endif
 	unsigned long		gs;
 	/* Hardware debugging registers: */
 	unsigned long		debugreg0;
@@ -874,7 +876,6 @@ static inline void spin_lock_prefetch(const void *x)
 	.vm86_info		= NULL,					  \
 	.sysenter_cs		= __KERNEL_CS,				  \
 	.io_bitmap_ptr		= NULL,					  \
-	.fs			= __KERNEL_PERCPU,			  \
 }
 
 /*
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 13/38] x86_64: ifdef out struct thread_struct::ip
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (10 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 12/38] i386: ifdef out struct thread_struct::fs Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 14/38] Remove struct mm_struct::exe_file et al Alexey Dobriyan
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

struct thread_struct::ip isn't used on x86_64, struct pt_regs::ip is used
instead.

kgdb should be reading 0 always, but I can't check it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/include/asm/processor.h |    2 ++
 arch/x86/kernel/kgdb.c           |    2 +-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a6732ff..a9ba743 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -427,7 +427,9 @@ struct thread_struct {
 	unsigned short		fsindex;
 	unsigned short		gsindex;
 #endif
+#ifdef CONFIG_X86_32
 	unsigned long		ip;
+#endif
 #ifdef CONFIG_X86_64
 	unsigned long		fs;
 #endif
diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index b1f4dff..8d82a77 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -142,7 +142,7 @@ void sleeping_thread_to_gdb_regs(unsigned long *gdb_regs, struct task_struct *p)
 	gdb_regs32[GDB_PS]	= *(unsigned long *)(p->thread.sp + 8);
 	gdb_regs32[GDB_CS]	= __KERNEL_CS;
 	gdb_regs32[GDB_SS]	= __KERNEL_DS;
-	gdb_regs[GDB_PC]	= p->thread.ip;
+	gdb_regs[GDB_PC]	= 0;
 	gdb_regs[GDB_R8]	= 0;
 	gdb_regs[GDB_R9]	= 0;
 	gdb_regs[GDB_R10]	= 0;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 14/38] Remove struct mm_struct::exe_file et al
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (11 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 13/38] x86_64: ifdef out struct thread_struct::ip Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 15/38] dcache: extract and use d_unlinked() Alexey Dobriyan
                   ` (24 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Commit 925d1c401fa6cfd0df5d2e37da8981494ccdec07 aka "procfs task exe symlink".
introduced struct mm_struct::exe_file and struct mm_struct::num_exe_file_vmas.

The rationale is weak: unifying MMU and no-MMU version of /proc/*/exe code.
For this a) struct mm_struct becomes bigger, b) mmap/munmap/exit become slower,
c) patch adds more code than removes in fact.

After commit 8feae13110d60cc6287afabc2887366b0eb226c2 aka
"NOMMU: Make VMAs per MM as for MMU-mode linux" no-MMU kernels also
maintain list of VMAs in ->mmap, so we can switch back for MMU version
of /proc/*/exe.

This also helps C/R, no need to save and restore ->exe_file and to count
additional references to check if there is a leak of struct file outside
group of checkpointed resources.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 fs/exec.c                |    2 -
 fs/proc/base.c           |  105 +++++++++++++---------------------------------
 include/linux/mm.h       |   12 -----
 include/linux/mm_types.h |    6 ---
 include/linux/proc_fs.h  |   20 ---------
 kernel/fork.c            |    3 -
 mm/mmap.c                |   22 ++--------
 mm/nommu.c               |   16 +------
 8 files changed, 36 insertions(+), 150 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 895823d..04fbe3e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -938,8 +938,6 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
-	set_mm_exe_file(bprm->mm, bprm->file);
-
 	/*
 	 * Release all of the old mmap stuff
 	 */
diff --git a/fs/proc/base.c b/fs/proc/base.c
index fb45615..4284cc6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -201,6 +201,36 @@ static int proc_root_link(struct inode *inode, struct path *path)
 	return result;
 }
 
+static int proc_exe_link(struct inode *inode, struct path *path)
+{
+	struct task_struct *tsk;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+
+	tsk = get_proc_task(inode);
+	if (!tsk)
+		return -ENOENT;
+	mm = get_task_mm(tsk);
+	put_task_struct(tsk);
+	if (!mm)
+		return -ENOENT;
+
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if ((vma->vm_flags & VM_EXECUTABLE) && vma->vm_file) {
+			*path = vma->vm_file->f_path;
+			path_get(&vma->vm_file->f_path);
+
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			return 0;
+		}
+	}
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return -ENOENT;
+}
+
 /*
  * Return zero if current may access user memory in @task, -error if not.
  */
@@ -1251,81 +1281,6 @@ static const struct file_operations proc_pid_sched_operations = {
 
 #endif
 
-/*
- * We added or removed a vma mapping the executable. The vmas are only mapped
- * during exec and are not mapped with the mmap system call.
- * Callers must hold down_write() on the mm's mmap_sem for these
- */
-void added_exe_file_vma(struct mm_struct *mm)
-{
-	mm->num_exe_file_vmas++;
-}
-
-void removed_exe_file_vma(struct mm_struct *mm)
-{
-	mm->num_exe_file_vmas--;
-	if ((mm->num_exe_file_vmas == 0) && mm->exe_file){
-		fput(mm->exe_file);
-		mm->exe_file = NULL;
-	}
-
-}
-
-void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
-{
-	if (new_exe_file)
-		get_file(new_exe_file);
-	if (mm->exe_file)
-		fput(mm->exe_file);
-	mm->exe_file = new_exe_file;
-	mm->num_exe_file_vmas = 0;
-}
-
-struct file *get_mm_exe_file(struct mm_struct *mm)
-{
-	struct file *exe_file;
-
-	/* We need mmap_sem to protect against races with removal of
-	 * VM_EXECUTABLE vmas */
-	down_read(&mm->mmap_sem);
-	exe_file = mm->exe_file;
-	if (exe_file)
-		get_file(exe_file);
-	up_read(&mm->mmap_sem);
-	return exe_file;
-}
-
-void dup_mm_exe_file(struct mm_struct *oldmm, struct mm_struct *newmm)
-{
-	/* It's safe to write the exe_file pointer without exe_file_lock because
-	 * this is called during fork when the task is not yet in /proc */
-	newmm->exe_file = get_mm_exe_file(oldmm);
-}
-
-static int proc_exe_link(struct inode *inode, struct path *exe_path)
-{
-	struct task_struct *task;
-	struct mm_struct *mm;
-	struct file *exe_file;
-
-	task = get_proc_task(inode);
-	if (!task)
-		return -ENOENT;
-	mm = get_task_mm(task);
-	put_task_struct(task);
-	if (!mm)
-		return -ENOENT;
-	exe_file = get_mm_exe_file(mm);
-	mmput(mm);
-	if (exe_file) {
-		*exe_path = exe_file->f_path;
-		path_get(&exe_file->f_path);
-		fput(exe_file);
-		return 0;
-	} else
-		return -ENOENT;
-}
-
 static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..b3b61a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1121,18 +1121,6 @@ extern void exit_mmap(struct mm_struct *);
 extern int mm_take_all_locks(struct mm_struct *mm);
 extern void mm_drop_all_locks(struct mm_struct *mm);
 
-#ifdef CONFIG_PROC_FS
-/* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */
-extern void added_exe_file_vma(struct mm_struct *mm);
-extern void removed_exe_file_vma(struct mm_struct *mm);
-#else
-static inline void added_exe_file_vma(struct mm_struct *mm)
-{}
-
-static inline void removed_exe_file_vma(struct mm_struct *mm)
-{}
-#endif /* CONFIG_PROC_FS */
-
 extern int may_expand_vm(struct mm_struct *mm, unsigned long npages);
 extern int install_special_mapping(struct mm_struct *mm,
 				   unsigned long addr, unsigned long len,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0e80e26..90786ea 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -269,12 +269,6 @@ struct mm_struct {
 	 */
 	struct task_struct *owner;
 #endif
-
-#ifdef CONFIG_PROC_FS
-	/* store ref to file /proc/<pid>/exe symlink points to */
-	struct file *exe_file;
-	unsigned long num_exe_file_vmas;
-#endif
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index fbfa3d4..64ed076 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -188,12 +188,6 @@ extern void proc_net_remove(struct net *net, const char *name);
 extern struct proc_dir_entry *proc_net_mkdir(struct net *net, const char *name,
 	struct proc_dir_entry *parent);
 
-/* While the {get|set|dup}_mm_exe_file functions are for mm_structs, they are
- * only needed to implement /proc/<pid>|self/exe so we define them here. */
-extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
-extern struct file *get_mm_exe_file(struct mm_struct *mm);
-extern void dup_mm_exe_file(struct mm_struct *oldmm, struct mm_struct *newmm);
-
 #else
 
 #define proc_net_fops_create(net, name, mode, fops)  ({ (void)(mode), NULL; })
@@ -240,20 +234,6 @@ static inline int pid_ns_prepare_proc(struct pid_namespace *ns)
 static inline void pid_ns_release_proc(struct pid_namespace *ns)
 {
 }
-
-static inline void set_mm_exe_file(struct mm_struct *mm,
-				   struct file *new_exe_file)
-{}
-
-static inline struct file *get_mm_exe_file(struct mm_struct *mm)
-{
-	return NULL;
-}
-
-static inline void dup_mm_exe_file(struct mm_struct *oldmm,
-	       			   struct mm_struct *newmm)
-{}
-
 #endif /* CONFIG_PROC_FS */
 
 #if !defined(CONFIG_PROC_KCORE)
diff --git a/kernel/fork.c b/kernel/fork.c
index b9e2edd..ed377ad 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -488,7 +488,6 @@ void mmput(struct mm_struct *mm)
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
 		exit_mmap(mm);
-		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
 			list_del(&mm->mmlist);
@@ -611,8 +610,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 	if (init_new_context(tsk, mm))
 		goto fail_nocontext;
 
-	dup_mm_exe_file(oldmm, mm);
-
 	err = dup_mmap(mm, oldmm);
 	if (err)
 		goto free_pt;
diff --git a/mm/mmap.c b/mm/mmap.c
index 6b7b1a9..76faabc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -232,11 +232,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file) {
+	if (vma->vm_file)
 		fput(vma->vm_file);
-		if (vma->vm_flags & VM_EXECUTABLE)
-			removed_exe_file_vma(vma->vm_mm);
-	}
 	mpol_put(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
@@ -633,11 +630,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 		spin_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
-		if (file) {
+		if (file)
 			fput(file);
-			if (next->vm_flags & VM_EXECUTABLE)
-				removed_exe_file_vma(mm);
-		}
 		mm->map_count--;
 		mpol_put(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
@@ -1192,8 +1186,6 @@ munmap_back:
 		error = file->f_op->mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
-		if (vm_flags & VM_EXECUTABLE)
-			added_exe_file_vma(mm);
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
@@ -1849,11 +1841,8 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	}
 	vma_set_policy(new, pol);
 
-	if (new->vm_file) {
+	if (new->vm_file)
 		get_file(new->vm_file);
-		if (vma->vm_flags & VM_EXECUTABLE)
-			added_exe_file_vma(mm);
-	}
 
 	if (new->vm_ops && new->vm_ops->open)
 		new->vm_ops->open(new);
@@ -2200,11 +2189,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			new_vma->vm_start = addr;
 			new_vma->vm_end = addr + len;
 			new_vma->vm_pgoff = pgoff;
-			if (new_vma->vm_file) {
+			if (new_vma->vm_file)
 				get_file(new_vma->vm_file);
-				if (vma->vm_flags & VM_EXECUTABLE)
-					added_exe_file_vma(mm);
-			}
 			if (new_vma->vm_ops && new_vma->vm_ops->open)
 				new_vma->vm_ops->open(new_vma);
 			vma_link(mm, new_vma, prev, rb_link, rb_parent);
diff --git a/mm/nommu.c b/mm/nommu.c
index b571ef7..352aac5 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -721,11 +721,8 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
 	kenter("%p", vma);
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file) {
+	if (vma->vm_file)
 		fput(vma->vm_file);
-		if (vma->vm_flags & VM_EXECUTABLE)
-			removed_exe_file_vma(mm);
-	}
 	put_nommu_region(vma->vm_region);
 	kmem_cache_free(vm_area_cachep, vma);
 }
@@ -1218,10 +1215,6 @@ unsigned long do_mmap_pgoff(struct file *file,
 		get_file(file);
 		vma->vm_file = file;
 		get_file(file);
-		if (vm_flags & VM_EXECUTABLE) {
-			added_exe_file_vma(current->mm);
-			vma->vm_mm = current->mm;
-		}
 	}
 
 	down_write(&nommu_region_sem);
@@ -1359,11 +1352,8 @@ share:
 error_put_region:
 	__put_nommu_region(region);
 	if (vma) {
-		if (vma->vm_file) {
+		if (vma->vm_file)
 			fput(vma->vm_file);
-			if (vma->vm_flags & VM_EXECUTABLE)
-				removed_exe_file_vma(vma->vm_mm);
-		}
 		kmem_cache_free(vm_area_cachep, vma);
 	}
 	kleave(" = %d [pr]", ret);
@@ -1375,8 +1365,6 @@ error:
 	fput(region->vm_file);
 	kmem_cache_free(vm_region_jar, region);
 	fput(vma->vm_file);
-	if (vma->vm_flags & VM_EXECUTABLE)
-		removed_exe_file_vma(vma->vm_mm);
 	kmem_cache_free(vm_area_cachep, vma);
 	kleave(" = %d", ret);
 	return ret;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 15/38] dcache: extract and use d_unlinked()
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (12 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 14/38] Remove struct mm_struct::exe_file et al Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 16/38] x86: ptrace debugreg checks rewrite Alexey Dobriyan
                   ` (23 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

d_unlinked() will be used in middle-term to ban checkpointing when opened
but unlinked file is detected, and in long term, to detect such situation
and special case on it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 fs/dcache.c            |    7 +++----
 fs/namespace.c         |    8 ++++----
 include/linux/dcache.h |    5 +++++
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 75659a6..9e5cd3c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1910,7 +1910,7 @@ char *__d_path(const struct path *path, struct path *root,
 
 	spin_lock(&vfsmount_lock);
 	prepend(&end, &buflen, "\0", 1);
-	if (!IS_ROOT(dentry) && d_unhashed(dentry) &&
+	if (d_unlinked(dentry) &&
 		(prepend(&end, &buflen, " (deleted)", 10) != 0))
 			goto Elong;
 
@@ -2035,7 +2035,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 
 	spin_lock(&dcache_lock);
 	prepend(&end, &buflen, "\0", 1);
-	if (!IS_ROOT(dentry) && d_unhashed(dentry) &&
+	if (d_unlinked(dentry) &&
 		(prepend(&end, &buflen, "//deleted", 9) != 0))
 			goto Elong;
 	if (buflen < 1)
@@ -2097,9 +2097,8 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 	read_unlock(&current->fs->lock);
 
 	error = -ENOENT;
-	/* Has the current directory has been unlinked? */
 	spin_lock(&dcache_lock);
-	if (IS_ROOT(pwd.dentry) || !d_unhashed(pwd.dentry)) {
+	if (!d_unlinked(pwd.dentry)) {
 		unsigned long len;
 		struct path tmp = root;
 		char * cwd;
diff --git a/fs/namespace.c b/fs/namespace.c
index 134d494..be33bfc 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1430,7 +1430,7 @@ static int graft_tree(struct vfsmount *mnt, struct path *path)
 		goto out_unlock;
 
 	err = -ENOENT;
-	if (IS_ROOT(path->dentry) || !d_unhashed(path->dentry))
+	if (!d_unlinked(path->dentry))
 		err = attach_recursive_mnt(mnt, path, NULL);
 out_unlock:
 	mutex_unlock(&path->dentry->d_inode->i_mutex);
@@ -1612,7 +1612,7 @@ static int do_move_mount(struct path *path, char *old_name)
 	if (IS_DEADDIR(path->dentry->d_inode))
 		goto out1;
 
-	if (!IS_ROOT(path->dentry) && d_unhashed(path->dentry))
+	if (d_unlinked(path->dentry))
 		goto out1;
 
 	err = -EINVAL;
@@ -2175,9 +2175,9 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
 	error = -ENOENT;
 	if (IS_DEADDIR(new.dentry->d_inode))
 		goto out2;
-	if (d_unhashed(new.dentry) && !IS_ROOT(new.dentry))
+	if (d_unlinked(new.dentry))
 		goto out2;
-	if (d_unhashed(old.dentry) && !IS_ROOT(old.dentry))
+	if (d_unlinked(old.dentry))
 		goto out2;
 	error = -EBUSY;
 	if (new.mnt == root.mnt ||
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1515636..f8488bd 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -351,6 +351,11 @@ static inline int d_unhashed(struct dentry *dentry)
 	return (dentry->d_flags & DCACHE_UNHASHED);
 }
 
+static inline int d_unlinked(struct dentry *dentry)
+{
+	return d_unhashed(dentry) && !IS_ROOT(dentry);
+}
+
 static inline struct dentry *dget_parent(struct dentry *dentry)
 {
 	struct dentry *ret;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 16/38] x86: ptrace debugreg checks rewrite
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (13 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 15/38] dcache: extract and use d_unlinked() Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-26 23:25   ` Andrew Morton
  2009-05-22  4:55 ` [PATCH 17/38] groups: move code to kernel/groups.c Alexey Dobriyan
                   ` (22 subsequent siblings)
  37 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

This is a mess.

Pre unified-x86 code did check for breakpoint addr
to be "< TASK_SIZE - 3 (or 7)". This was fine from security POV,
but banned valid breakpoint usage when address is close to TASK_SIZE.
E. g. 1-byte breakpoint at TASK_SIZE - 1 should be allowed, but it wasn't.

Then came commit 84929801e14d968caeb84795bfbb88f04283fbd9
("[PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes")
which for some reason touched ptrace as well and made effective
TASK_SIZE of 32-bit process depending on IA32_PAGE_OFFSET
which is not a constant!:

	#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? 0xc0000000 : 0xFFFFe000)
				   ^^^^^^^
Maximum addr for breakpoint became dependent on personality of ptracer.

Commit also relaxed danger zone for 32-bit processes from 8 bytes to 4
not taking into account that 8-byte wide breakpoints are possible even
for 32-bit processes. This was fine, however, because 64-bit kernel
addresses are too far from 32-bit ones.

Then came utrace with commit 2047b08be67b70875d8765fc81d34ce28041bec3
("x86: x86 ptrace getreg/putreg merge") which copy-pasted and ifdeffed 32-bit
part of TASK_SIZE_OF() leaving 8-byte issue as-is.

So, what patch fixes?
1) Too strict logic near TASK_SIZE boundary -- as long as we don't cross
   TASK_SIZE_MAX, we're fine.
2) Too smart logic of using breakpoints over non-existent kernel
   boundary -- we should only protect against setting up after
   TASK_SIZE_MAX, the rest is none of kernel business. This fixes
   IA32_PAGE_OFFSET beartrap as well.

As a bonus, remove uberhack and big comment determining DR7 validness,
rewrite with clear algorithm when it's obvious what's going on.

Make DR validness checker suitable for C/R. On restart DR registers
must be checked the same way they are checked on PTRACE_POKEUSR.

Question 1: TIF_DEBUG can set even if none of breakpoints is turned on,
should this be optimized?

Question 2: Breakpoints are allowed to be globally enabled, is this a
security risk?

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/kernel/ptrace.c |  175 +++++++++++++++++++++++++++-------------------
 1 files changed, 103 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 23b7c8f..afd2b7d 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -135,11 +135,6 @@ static int set_segment_reg(struct task_struct *task,
 	return 0;
 }
 
-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
-	return TASK_SIZE - 3;
-}
-
 #else  /* CONFIG_X86_64 */
 
 #define FLAG_MASK		(FLAG_MASK_32 | X86_EFLAGS_NT)
@@ -263,16 +258,6 @@ static int set_segment_reg(struct task_struct *task,
 
 	return 0;
 }
-
-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
-#ifdef CONFIG_IA32_EMULATION
-	if (test_tsk_thread_flag(task, TIF_IA32))
-		return IA32_PAGE_OFFSET - 3;
-#endif
-	return TASK_SIZE_MAX - 7;
-}
-
 #endif	/* CONFIG_X86_32 */
 
 static unsigned long get_flags(struct task_struct *task)
@@ -480,77 +465,123 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
 	return 0;
 }
 
+static int ptrace_check_debugreg(int _32bit,
+				 unsigned long dr0, unsigned long dr1,
+				 unsigned long dr2, unsigned long dr3,
+				 unsigned long dr6, unsigned long dr7)
+{
+	/* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
+	unsigned int rw[4];
+	/* Breakpoint length: 00: 1 byte, 01: 2 bytes, 10: 8 bytes, 11: 4 bytes */
+	unsigned int len[4];
+	int n;
+
+	if (dr0 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr1 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr2 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr3 >= TASK_SIZE_MAX)
+		return -EINVAL;
+
+	for (n = 0; n < 4; n++) {
+		rw[n] = (dr7 >> (16 + n * 4)) & 0x3;
+		len[n] = (dr7 >> (16 + n * 4 + 2)) & 0x3;
+
+		if (rw[n] == 0x2)
+			return -EINVAL;
+		if (rw[n] == 0x0 && len[n] != 0x0)
+			return -EINVAL;
+		if (_32bit && len[n] == 0x2)
+			return -EINVAL;
+
+		if (len[n] == 0x0)
+			len[n] = 1;
+		else if (len[n] == 0x1)
+			len[n] = 2;
+		else if (len[n] == 0x2)
+			len[n] = 8;
+		else if (len[n] == 0x3)
+			len[n] = 4;
+		/* From now breakpoint length is in bytes. */
+	}
+
+	if (dr6 & ~0xFFFFFFFFUL)
+		return -EINVAL;
+	if (dr7 & ~0xFFFFFFFFUL)
+		return -EINVAL;
+
+	if (dr7 == 0)
+		return 0;
+
+	if (dr0 + len[0] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr1 + len[1] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr2 + len[2] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr3 + len[3] > TASK_SIZE_MAX)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int ptrace_set_debugreg(struct task_struct *child,
 			       int n, unsigned long data)
 {
-	int i;
+	unsigned long dr0, dr1, dr2, dr3, dr6, dr7;
+	int _32bit;
 
 	if (unlikely(n == 4 || n == 5))
 		return -EIO;
 
-	if (n < 4 && unlikely(data >= debugreg_addr_limit(child)))
-		return -EIO;
-
+	dr0 = child->thread.debugreg0;
+	dr1 = child->thread.debugreg1;
+	dr2 = child->thread.debugreg2;
+	dr3 = child->thread.debugreg3;
+	dr6 = child->thread.debugreg6;
+	dr7 = child->thread.debugreg7;
 	switch (n) {
-	case 0:		child->thread.debugreg0 = data; break;
-	case 1:		child->thread.debugreg1 = data; break;
-	case 2:		child->thread.debugreg2 = data; break;
-	case 3:		child->thread.debugreg3 = data; break;
-
+	case 0:
+		dr0 = data;
+		break;
+	case 1:
+		dr1 = data;
+		break;
+	case 2:
+		dr2 = data;
+		break;
+	case 3:
+		dr3 = data;
+		break;
 	case 6:
-		if ((data & ~0xffffffffUL) != 0)
-			return -EIO;
-		child->thread.debugreg6 = data;
+		dr6 = data;
 		break;
-
 	case 7:
-		/*
-		 * Sanity-check data. Take one half-byte at once with
-		 * check = (val >> (16 + 4*i)) & 0xf. It contains the
-		 * R/Wi and LENi bits; bits 0 and 1 are R/Wi, and bits
-		 * 2 and 3 are LENi. Given a list of invalid values,
-		 * we do mask |= 1 << invalid_value, so that
-		 * (mask >> check) & 1 is a correct test for invalid
-		 * values.
-		 *
-		 * R/Wi contains the type of the breakpoint /
-		 * watchpoint, LENi contains the length of the watched
-		 * data in the watchpoint case.
-		 *
-		 * The invalid values are:
-		 * - LENi == 0x10 (undefined), so mask |= 0x0f00.	[32-bit]
-		 * - R/Wi == 0x10 (break on I/O reads or writes), so
-		 *   mask |= 0x4444.
-		 * - R/Wi == 0x00 && LENi != 0x00, so we have mask |=
-		 *   0x1110.
-		 *
-		 * Finally, mask = 0x0f00 | 0x4444 | 0x1110 == 0x5f54.
-		 *
-		 * See the Intel Manual "System Programming Guide",
-		 * 15.2.4
-		 *
-		 * Note that LENi == 0x10 is defined on x86_64 in long
-		 * mode (i.e. even for 32-bit userspace software, but
-		 * 64-bit kernel), so the x86_64 mask value is 0x5454.
-		 * See the AMD manual no. 24593 (AMD64 System Programming)
-		 */
-#ifdef CONFIG_X86_32
-#define	DR7_MASK	0x5f54
-#else
-#define	DR7_MASK	0x5554
-#endif
-		data &= ~DR_CONTROL_RESERVED;
-		for (i = 0; i < 4; i++)
-			if ((DR7_MASK >> ((data >> (16 + 4*i)) & 0xf)) & 1)
-				return -EIO;
-		child->thread.debugreg7 = data;
-		if (data)
-			set_tsk_thread_flag(child, TIF_DEBUG);
-		else
-			clear_tsk_thread_flag(child, TIF_DEBUG);
+		dr7 = data & ~DR_CONTROL_RESERVED;
 		break;
 	}
 
+	_32bit = (sizeof(unsigned long) == 4);
+#ifdef CONFIG_COMPAT
+	if (test_tsk_thread_flag(child, TIF_IA32))
+		_32bit = 1;
+#endif
+	if (ptrace_check_debugreg(_32bit, dr0, dr1, dr2, dr3, dr6, dr7))
+		return -EIO;
+
+	child->thread.debugreg0 = dr0;
+	child->thread.debugreg1 = dr1;
+	child->thread.debugreg2 = dr2;
+	child->thread.debugreg3 = dr3;
+	child->thread.debugreg6 = dr6;
+	child->thread.debugreg7 = dr7;
+	if (dr7)
+		set_tsk_thread_flag(child, TIF_DEBUG);
+	else
+		clear_tsk_thread_flag(child, TIF_DEBUG);
+
 	return 0;
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 17/38] groups: move code to kernel/groups.c
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (14 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 16/38] x86: ptrace debugreg checks rewrite Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-25  0:53   ` Serge E. Hallyn
  2009-05-26 14:48   ` Serge E. Hallyn
  2009-05-22  4:55 ` [PATCH 18/38] C/R: core stuff Alexey Dobriyan
                   ` (21 subsequent siblings)
  37 siblings, 2 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Move supplementary groups implementation to kernel/groups.c .
kernel/sys.c already accumulated quite a few random stuff.

Do strictly copy/paste + add required headers to compile.
Compile-tested on many configs and archs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 kernel/Makefile |    1 +
 kernel/groups.c |  288 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c    |  283 ------------------------------------------------------
 3 files changed, 289 insertions(+), 283 deletions(-)
 create mode 100644 kernel/groups.c

diff --git a/kernel/Makefile b/kernel/Makefile
index 4242366..705ad3d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -11,6 +11,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
 	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
 	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
 	    async.o
+obj-y += groups.o
 
 ifdef CONFIG_FUNCTION_TRACER
 # Do not trace debug files and internal ftrace files
diff --git a/kernel/groups.c b/kernel/groups.c
new file mode 100644
index 0000000..1b95b2f
--- /dev/null
+++ b/kernel/groups.c
@@ -0,0 +1,288 @@
+/*
+ * Supplementary group IDs
+ */
+#include <linux/cred.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/security.h>
+#include <linux/syscalls.h>
+#include <asm/uaccess.h>
+
+/* init to 2 - one for init_task, one to ensure it is never freed */
+struct group_info init_groups = { .usage = ATOMIC_INIT(2) };
+
+struct group_info *groups_alloc(int gidsetsize)
+{
+	struct group_info *group_info;
+	int nblocks;
+	int i;
+
+	nblocks = (gidsetsize + NGROUPS_PER_BLOCK - 1) / NGROUPS_PER_BLOCK;
+	/* Make sure we always allocate at least one indirect block pointer */
+	nblocks = nblocks ? : 1;
+	group_info = kmalloc(sizeof(*group_info) + nblocks*sizeof(gid_t *), GFP_USER);
+	if (!group_info)
+		return NULL;
+	group_info->ngroups = gidsetsize;
+	group_info->nblocks = nblocks;
+	atomic_set(&group_info->usage, 1);
+
+	if (gidsetsize <= NGROUPS_SMALL)
+		group_info->blocks[0] = group_info->small_block;
+	else {
+		for (i = 0; i < nblocks; i++) {
+			gid_t *b;
+			b = (void *)__get_free_page(GFP_USER);
+			if (!b)
+				goto out_undo_partial_alloc;
+			group_info->blocks[i] = b;
+		}
+	}
+	return group_info;
+
+out_undo_partial_alloc:
+	while (--i >= 0) {
+		free_page((unsigned long)group_info->blocks[i]);
+	}
+	kfree(group_info);
+	return NULL;
+}
+
+EXPORT_SYMBOL(groups_alloc);
+
+void groups_free(struct group_info *group_info)
+{
+	if (group_info->blocks[0] != group_info->small_block) {
+		int i;
+		for (i = 0; i < group_info->nblocks; i++)
+			free_page((unsigned long)group_info->blocks[i]);
+	}
+	kfree(group_info);
+}
+
+EXPORT_SYMBOL(groups_free);
+
+/* export the group_info to a user-space array */
+static int groups_to_user(gid_t __user *grouplist,
+			  const struct group_info *group_info)
+{
+	int i;
+	unsigned int count = group_info->ngroups;
+
+	for (i = 0; i < group_info->nblocks; i++) {
+		unsigned int cp_count = min(NGROUPS_PER_BLOCK, count);
+		unsigned int len = cp_count * sizeof(*grouplist);
+
+		if (copy_to_user(grouplist, group_info->blocks[i], len))
+			return -EFAULT;
+
+		grouplist += NGROUPS_PER_BLOCK;
+		count -= cp_count;
+	}
+	return 0;
+}
+
+/* fill a group_info from a user-space array - it must be allocated already */
+static int groups_from_user(struct group_info *group_info,
+    gid_t __user *grouplist)
+{
+	int i;
+	unsigned int count = group_info->ngroups;
+
+	for (i = 0; i < group_info->nblocks; i++) {
+		unsigned int cp_count = min(NGROUPS_PER_BLOCK, count);
+		unsigned int len = cp_count * sizeof(*grouplist);
+
+		if (copy_from_user(group_info->blocks[i], grouplist, len))
+			return -EFAULT;
+
+		grouplist += NGROUPS_PER_BLOCK;
+		count -= cp_count;
+	}
+	return 0;
+}
+
+/* a simple Shell sort */
+static void groups_sort(struct group_info *group_info)
+{
+	int base, max, stride;
+	int gidsetsize = group_info->ngroups;
+
+	for (stride = 1; stride < gidsetsize; stride = 3 * stride + 1)
+		; /* nothing */
+	stride /= 3;
+
+	while (stride) {
+		max = gidsetsize - stride;
+		for (base = 0; base < max; base++) {
+			int left = base;
+			int right = left + stride;
+			gid_t tmp = GROUP_AT(group_info, right);
+
+			while (left >= 0 && GROUP_AT(group_info, left) > tmp) {
+				GROUP_AT(group_info, right) =
+				    GROUP_AT(group_info, left);
+				right = left;
+				left -= stride;
+			}
+			GROUP_AT(group_info, right) = tmp;
+		}
+		stride /= 3;
+	}
+}
+
+/* a simple bsearch */
+int groups_search(const struct group_info *group_info, gid_t grp)
+{
+	unsigned int left, right;
+
+	if (!group_info)
+		return 0;
+
+	left = 0;
+	right = group_info->ngroups;
+	while (left < right) {
+		unsigned int mid = (left+right)/2;
+		int cmp = grp - GROUP_AT(group_info, mid);
+		if (cmp > 0)
+			left = mid + 1;
+		else if (cmp < 0)
+			right = mid;
+		else
+			return 1;
+	}
+	return 0;
+}
+
+/**
+ * set_groups - Change a group subscription in a set of credentials
+ * @new: The newly prepared set of credentials to alter
+ * @group_info: The group list to install
+ *
+ * Validate a group subscription and, if valid, insert it into a set
+ * of credentials.
+ */
+int set_groups(struct cred *new, struct group_info *group_info)
+{
+	int retval;
+
+	retval = security_task_setgroups(group_info);
+	if (retval)
+		return retval;
+
+	put_group_info(new->group_info);
+	groups_sort(group_info);
+	get_group_info(group_info);
+	new->group_info = group_info;
+	return 0;
+}
+
+EXPORT_SYMBOL(set_groups);
+
+/**
+ * set_current_groups - Change current's group subscription
+ * @group_info: The group list to impose
+ *
+ * Validate a group subscription and, if valid, impose it upon current's task
+ * security record.
+ */
+int set_current_groups(struct group_info *group_info)
+{
+	struct cred *new;
+	int ret;
+
+	new = prepare_creds();
+	if (!new)
+		return -ENOMEM;
+
+	ret = set_groups(new, group_info);
+	if (ret < 0) {
+		abort_creds(new);
+		return ret;
+	}
+
+	return commit_creds(new);
+}
+
+EXPORT_SYMBOL(set_current_groups);
+
+SYSCALL_DEFINE2(getgroups, int, gidsetsize, gid_t __user *, grouplist)
+{
+	const struct cred *cred = current_cred();
+	int i;
+
+	if (gidsetsize < 0)
+		return -EINVAL;
+
+	/* no need to grab task_lock here; it cannot change */
+	i = cred->group_info->ngroups;
+	if (gidsetsize) {
+		if (i > gidsetsize) {
+			i = -EINVAL;
+			goto out;
+		}
+		if (groups_to_user(grouplist, cred->group_info)) {
+			i = -EFAULT;
+			goto out;
+		}
+	}
+out:
+	return i;
+}
+
+/*
+ *	SMP: Our groups are copy-on-write. We can set them safely
+ *	without another task interfering.
+ */
+ 
+SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
+{
+	struct group_info *group_info;
+	int retval;
+
+	if (!capable(CAP_SETGID))
+		return -EPERM;
+	if ((unsigned)gidsetsize > NGROUPS_MAX)
+		return -EINVAL;
+
+	group_info = groups_alloc(gidsetsize);
+	if (!group_info)
+		return -ENOMEM;
+	retval = groups_from_user(group_info, grouplist);
+	if (retval) {
+		put_group_info(group_info);
+		return retval;
+	}
+
+	retval = set_current_groups(group_info);
+	put_group_info(group_info);
+
+	return retval;
+}
+
+/*
+ * Check whether we're fsgid/egid or in the supplemental group..
+ */
+int in_group_p(gid_t grp)
+{
+	const struct cred *cred = current_cred();
+	int retval = 1;
+
+	if (grp != cred->fsgid)
+		retval = groups_search(cred->group_info, grp);
+	return retval;
+}
+
+EXPORT_SYMBOL(in_group_p);
+
+int in_egroup_p(gid_t grp)
+{
+	const struct cred *cred = current_cred();
+	int retval = 1;
+
+	if (grp != cred->egid)
+		retval = groups_search(cred->group_info, grp);
+	return retval;
+}
+
+EXPORT_SYMBOL(in_egroup_p);
diff --git a/kernel/sys.c b/kernel/sys.c
index e7998cf..4edcf51 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1112,289 +1112,6 @@ out:
 	return err;
 }
 
-/*
- * Supplementary group IDs
- */
-
-/* init to 2 - one for init_task, one to ensure it is never freed */
-struct group_info init_groups = { .usage = ATOMIC_INIT(2) };
-
-struct group_info *groups_alloc(int gidsetsize)
-{
-	struct group_info *group_info;
-	int nblocks;
-	int i;
-
-	nblocks = (gidsetsize + NGROUPS_PER_BLOCK - 1) / NGROUPS_PER_BLOCK;
-	/* Make sure we always allocate at least one indirect block pointer */
-	nblocks = nblocks ? : 1;
-	group_info = kmalloc(sizeof(*group_info) + nblocks*sizeof(gid_t *), GFP_USER);
-	if (!group_info)
-		return NULL;
-	group_info->ngroups = gidsetsize;
-	group_info->nblocks = nblocks;
-	atomic_set(&group_info->usage, 1);
-
-	if (gidsetsize <= NGROUPS_SMALL)
-		group_info->blocks[0] = group_info->small_block;
-	else {
-		for (i = 0; i < nblocks; i++) {
-			gid_t *b;
-			b = (void *)__get_free_page(GFP_USER);
-			if (!b)
-				goto out_undo_partial_alloc;
-			group_info->blocks[i] = b;
-		}
-	}
-	return group_info;
-
-out_undo_partial_alloc:
-	while (--i >= 0) {
-		free_page((unsigned long)group_info->blocks[i]);
-	}
-	kfree(group_info);
-	return NULL;
-}
-
-EXPORT_SYMBOL(groups_alloc);
-
-void groups_free(struct group_info *group_info)
-{
-	if (group_info->blocks[0] != group_info->small_block) {
-		int i;
-		for (i = 0; i < group_info->nblocks; i++)
-			free_page((unsigned long)group_info->blocks[i]);
-	}
-	kfree(group_info);
-}
-
-EXPORT_SYMBOL(groups_free);
-
-/* export the group_info to a user-space array */
-static int groups_to_user(gid_t __user *grouplist,
-			  const struct group_info *group_info)
-{
-	int i;
-	unsigned int count = group_info->ngroups;
-
-	for (i = 0; i < group_info->nblocks; i++) {
-		unsigned int cp_count = min(NGROUPS_PER_BLOCK, count);
-		unsigned int len = cp_count * sizeof(*grouplist);
-
-		if (copy_to_user(grouplist, group_info->blocks[i], len))
-			return -EFAULT;
-
-		grouplist += NGROUPS_PER_BLOCK;
-		count -= cp_count;
-	}
-	return 0;
-}
-
-/* fill a group_info from a user-space array - it must be allocated already */
-static int groups_from_user(struct group_info *group_info,
-    gid_t __user *grouplist)
-{
-	int i;
-	unsigned int count = group_info->ngroups;
-
-	for (i = 0; i < group_info->nblocks; i++) {
-		unsigned int cp_count = min(NGROUPS_PER_BLOCK, count);
-		unsigned int len = cp_count * sizeof(*grouplist);
-
-		if (copy_from_user(group_info->blocks[i], grouplist, len))
-			return -EFAULT;
-
-		grouplist += NGROUPS_PER_BLOCK;
-		count -= cp_count;
-	}
-	return 0;
-}
-
-/* a simple Shell sort */
-static void groups_sort(struct group_info *group_info)
-{
-	int base, max, stride;
-	int gidsetsize = group_info->ngroups;
-
-	for (stride = 1; stride < gidsetsize; stride = 3 * stride + 1)
-		; /* nothing */
-	stride /= 3;
-
-	while (stride) {
-		max = gidsetsize - stride;
-		for (base = 0; base < max; base++) {
-			int left = base;
-			int right = left + stride;
-			gid_t tmp = GROUP_AT(group_info, right);
-
-			while (left >= 0 && GROUP_AT(group_info, left) > tmp) {
-				GROUP_AT(group_info, right) =
-				    GROUP_AT(group_info, left);
-				right = left;
-				left -= stride;
-			}
-			GROUP_AT(group_info, right) = tmp;
-		}
-		stride /= 3;
-	}
-}
-
-/* a simple bsearch */
-int groups_search(const struct group_info *group_info, gid_t grp)
-{
-	unsigned int left, right;
-
-	if (!group_info)
-		return 0;
-
-	left = 0;
-	right = group_info->ngroups;
-	while (left < right) {
-		unsigned int mid = (left+right)/2;
-		int cmp = grp - GROUP_AT(group_info, mid);
-		if (cmp > 0)
-			left = mid + 1;
-		else if (cmp < 0)
-			right = mid;
-		else
-			return 1;
-	}
-	return 0;
-}
-
-/**
- * set_groups - Change a group subscription in a set of credentials
- * @new: The newly prepared set of credentials to alter
- * @group_info: The group list to install
- *
- * Validate a group subscription and, if valid, insert it into a set
- * of credentials.
- */
-int set_groups(struct cred *new, struct group_info *group_info)
-{
-	int retval;
-
-	retval = security_task_setgroups(group_info);
-	if (retval)
-		return retval;
-
-	put_group_info(new->group_info);
-	groups_sort(group_info);
-	get_group_info(group_info);
-	new->group_info = group_info;
-	return 0;
-}
-
-EXPORT_SYMBOL(set_groups);
-
-/**
- * set_current_groups - Change current's group subscription
- * @group_info: The group list to impose
- *
- * Validate a group subscription and, if valid, impose it upon current's task
- * security record.
- */
-int set_current_groups(struct group_info *group_info)
-{
-	struct cred *new;
-	int ret;
-
-	new = prepare_creds();
-	if (!new)
-		return -ENOMEM;
-
-	ret = set_groups(new, group_info);
-	if (ret < 0) {
-		abort_creds(new);
-		return ret;
-	}
-
-	return commit_creds(new);
-}
-
-EXPORT_SYMBOL(set_current_groups);
-
-SYSCALL_DEFINE2(getgroups, int, gidsetsize, gid_t __user *, grouplist)
-{
-	const struct cred *cred = current_cred();
-	int i;
-
-	if (gidsetsize < 0)
-		return -EINVAL;
-
-	/* no need to grab task_lock here; it cannot change */
-	i = cred->group_info->ngroups;
-	if (gidsetsize) {
-		if (i > gidsetsize) {
-			i = -EINVAL;
-			goto out;
-		}
-		if (groups_to_user(grouplist, cred->group_info)) {
-			i = -EFAULT;
-			goto out;
-		}
-	}
-out:
-	return i;
-}
-
-/*
- *	SMP: Our groups are copy-on-write. We can set them safely
- *	without another task interfering.
- */
- 
-SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
-{
-	struct group_info *group_info;
-	int retval;
-
-	if (!capable(CAP_SETGID))
-		return -EPERM;
-	if ((unsigned)gidsetsize > NGROUPS_MAX)
-		return -EINVAL;
-
-	group_info = groups_alloc(gidsetsize);
-	if (!group_info)
-		return -ENOMEM;
-	retval = groups_from_user(group_info, grouplist);
-	if (retval) {
-		put_group_info(group_info);
-		return retval;
-	}
-
-	retval = set_current_groups(group_info);
-	put_group_info(group_info);
-
-	return retval;
-}
-
-/*
- * Check whether we're fsgid/egid or in the supplemental group..
- */
-int in_group_p(gid_t grp)
-{
-	const struct cred *cred = current_cred();
-	int retval = 1;
-
-	if (grp != cred->fsgid)
-		retval = groups_search(cred->group_info, grp);
-	return retval;
-}
-
-EXPORT_SYMBOL(in_group_p);
-
-int in_egroup_p(gid_t grp)
-{
-	const struct cred *cred = current_cred();
-	int retval = 1;
-
-	if (grp != cred->egid)
-		retval = groups_search(cred->group_info, grp);
-	return retval;
-}
-
-EXPORT_SYMBOL(in_egroup_p);
-
 DECLARE_RWSEM(uts_sem);
 
 SYSCALL_DEFINE1(newuname, struct new_utsname __user *, name)
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 18/38] C/R: core stuff
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (15 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 17/38] groups: move code to kernel/groups.c Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-26 13:16   ` Serge E. Hallyn
  2009-05-27 16:28   ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 19/38] C/R: multiple tasks Alexey Dobriyan
                   ` (20 subsequent siblings)
  37 siblings, 2 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Introduction
------------
Checkpoint/restart (C/R from now) allows to dump group of processes to disk
for various reasons like saving process state in case of box failure or
restoration of group of processes on another or same machine later.

Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel
and dump more or less raw pages, proposed C/R doesn't require hypervisor.
For that C/R code needs to know about all little and big intimate kernel details.

The good thing is that not all details needs to be serialized and saved
like, say, readahead state. The bad things is still quite a few things
need to be.

How C/R works
-------------
User passes to system call pid of process which is the root of process hierarchy
to be saved. Hierarchy is formed wrt ->real_parent.

Processes which belong to this hierarchy are frozen.

C/R code walks in-kernel datastructures starting from task_structs and collects
references to them in one place for a dump. This includes at least mm_struct,
nsproxy and it's belongings, opened files, CPU registers, etc. More or less
anything which is userspace-visible must be dumped on one form or another.

Collected data structures are checked for sanity wrt several things:
- still unsupported features (hugetlb mappings) in which case
  checkpointing is aborted
- structure "leaks" to outside of selected group of processes.

  This is done by maintaining mirror refcount from structures collected and
  comparing it with real refcount. If discrepancy is noticed, structure is
  used by someone else who is not frozen and thus we're trying to checkpoint
  live data structure. Depending on what data structure is dumping live data
  structure can be pretty easy (uts_namespace requires only uts_sem)
  to pretty hard (net_namespace sockets requires... actually nobody knows
  what net_namespace will require).

  Note! There re multiple levels of correct operation implied:
	- kernel shouldn't oops by dumping live data structure
	- kernel shouldn't write inconsistent state by dumping live data structure

If all checks for all data structures are OK, C/R code walks
collected references in certain order and does actual serializing and
writes image to passed file descriptor.

This results in a file which is believed to contain enough information
to restore group of process to exactly same user-visible state as before
checkpointing module inevitable issues like time.

Image format design
-------------------
Image consists of image header, object images one after another and
terminator which is formally an object.

Image header consists of magic ("LinuxC/R") and image version (__le32).
This is immutable part of an image. The rest is defined strictly by image
version even the rest of image header.

	Nobody is making guarantees that image format is immutable!

Once again, image format will change, however it's guaranteed that
magic+version part will remain and image version will be bumped.

So far, image header consists of
a) arch of kernel which dumped image to signalize that you can't restore
   powerpc image on i386 kernel and hint code in case of restoring i386
   image on x86_64 kernel.

b) kernel version as found in utsname.

   This is done for distributions who eventually may want to support C/R.
   While the expected way to maintain migrating from older kernels is
   to write userspace converter which knows everything about two image
   formats, distro kernels may want to maintain all backward-caompat code
   in kernel which can be small or big depending on the amount of changes
   they pull into their kernel.

   Just image version isn't realistically sufficient, distributions are
   expected to leave image version alone and demultiplex backward-compat
   code depending on utsname which is in case of distro kernels is pretty
   well known.

Object image
------------
Object images are direct projection of in-kernel data structures which
can be shared inside kernel to disk :-) Example: struct mm_struct is
dumped to an object of KSTATE_OBJ_MM_STRUCT type, struct cred is dumped
to KSTATE_OBJ_CRED type.

There are so far 3 exceptions: VMA, page content and fd. These are formal
objects with type and length to simplify reading and restoration of VMAs
and file descriptors, respectively. VMA and pages attributed are even
variable-sized.

Any object image starts with object header which is object type,
object lenth including header and globally unique object id (per-image,
of course). Type and length are used in verifying that image is not
malformed and object id is used in references of objects to another objects
and also verifying.

VMA, pages and file descriptors don't get object id, because they aren't
directly collected. But this is fine, as they get invalid object id (0)
which is not checked.

Relations in image
------------------
To serialize reference of object A to object B (task_struct::mm) image of
object a gets a field of kstate_ref_t type:

	struct kstate_image_a {
		struct kstate_object_header hdr;

		kstate_ref_t	ref_b;
	};

which corresponds to a->b pointer.

Reference consists of position in a dumpfile (if it's known by the time B
is dumped and object id of B (which _is_ known).

If there is a loop in pointers A => B => A, position is dumped as 0 and attention
is required at restore time. In each loop case it will be dealt individually.
(so far, there is one loop: user->user_ns->creator)

C/R dump code tries hard to maintaint streamable property of dump process --
if dumpfile is opened with O_APPEND, it should work).


Changes in kernel internals:
- add struct file_operations::checkpoint hook

  Code which operated file knows better what information needs to be saved
  to allow successful restoration. For usual files on on-disk filesystem
  add generic_file_checkpoint().

  Add ext3 opened regular files and directories for start.

  If opened or references file doesn't have ->checkpoint hook, checkpointing
  is aborted -- this is deny-by-default policy.

- add struct vm_operations_struct::checkpoint

  Same as for files. Will be used more by vDSO code.

checkpoint(2), restart(2)
-------------------------
Exact number and semantics of system calls is WIP, it was correctly noticed,
that 'freeze' and 'dump' parts needs to be split apart to allow filesystem
sync after freeze. For now leave 2 syscalls for people to play with.

Splitting freeze/dump implies persistent C/R context state BTW which is not
the case now.

Checkpoint semantics
--------------------
Checkpointing is done on per-container level without "leaks" to outside.
In this kernel can provide promises to dump coherent start and to do it
without major games with data structures locking.

Works to allow reliable checkpoint on live data structures has been started
however it's unclear what the result will be because of even more thing one
has to keep inside head simultaneously.

As a semi-direct consequence checkpkointing is not allowed for ordinary
users. While it's a nice feature to allow it, they formally can't even
create container for themselves (CAP_SYS_ADMIN during nsproxy tweaks).

Regadless, once enabled this will present several security risks:
- anything in image is controlled by untrusted user who will try to check
  how well checking code on restart(2) is written.

  And, yes, nobody is hiding that restart(2) involves honest parsing of
  slightly more complex that string file format inside kernel.

- user who effectively can turn off ASLR randomization, because
  he controls VMA boundaries.

These two directions (checkpointing live and restart(2) for everyone)
aren't explored in this patchset because they are artificially not
independent right now and each one is hard enough for already hard C/R work.

Sorry.

However, code is kept good enough to add each of feature later.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 fs/ext3/dir.c                  |    3 +
 fs/ext3/file.c                 |    3 +
 include/linux/Kbuild           |    1 +
 include/linux/fs.h             |   12 +-
 include/linux/kstate-image.h   |  118 +++++++++
 include/linux/kstate.h         |  144 ++++++++++
 include/linux/mm.h             |    4 +
 include/linux/syscalls.h       |    3 +
 init/Kconfig                   |    2 +
 kernel/Makefile                |    1 +
 kernel/kstate/Kconfig          |    7 +
 kernel/kstate/Makefile         |    8 +
 kernel/kstate/cpt-sys.c        |  196 ++++++++++++++
 kernel/kstate/kstate-context.c |   49 ++++
 kernel/kstate/kstate-file.c    |  204 +++++++++++++++
 kernel/kstate/kstate-image.c   |  116 ++++++++
 kernel/kstate/kstate-mm.c      |  563 ++++++++++++++++++++++++++++++++++++++++
 kernel/kstate/kstate-object.c  |  100 +++++++
 kernel/kstate/kstate-task.c    |  287 ++++++++++++++++++++
 kernel/kstate/rst-sys.c        |   91 +++++++
 kernel/sys_ni.c                |    3 +
 mm/filemap.c                   |    3 +
 22 files changed, 1916 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/kstate-image.h
 create mode 100644 include/linux/kstate.h
 create mode 100644 kernel/kstate/Kconfig
 create mode 100644 kernel/kstate/Makefile
 create mode 100644 kernel/kstate/cpt-sys.c
 create mode 100644 kernel/kstate/kstate-context.c
 create mode 100644 kernel/kstate/kstate-file.c
 create mode 100644 kernel/kstate/kstate-image.c
 create mode 100644 kernel/kstate/kstate-mm.c
 create mode 100644 kernel/kstate/kstate-object.c
 create mode 100644 kernel/kstate/kstate-task.c
 create mode 100644 kernel/kstate/rst-sys.c

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 3d724a9..ee4d4df 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,9 @@ const struct file_operations ext3_dir_operations = {
 #endif
 	.fsync		= ext3_sync_file,	/* BKL held */
 	.release	= ext3_release_dir,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..6cc26f5 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -126,6 +126,9 @@ const struct file_operations ext3_file_operations = {
 	.fsync		= ext3_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 3f0eaa3..353a218 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -50,6 +50,7 @@ header-y += coff.h
 header-y += comstats.h
 header-y += const.h
 header-y += cgroupstats.h
+header-y += kstate-image.h
 header-y += cramfs_fs.h
 header-y += cycx_cfm.h
 header-y += dcbnl.h
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b534e5..e4f33e0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -387,6 +387,7 @@ struct poll_table_struct;
 struct kstatfs;
 struct vm_area_struct;
 struct vfsmount;
+struct kstate_context;
 struct cred;
 
 extern void __init inode_init(void);
@@ -1508,6 +1509,9 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct file *file, struct kstate_context *ctx);
+#endif
 };
 
 struct inode_operations {
@@ -2079,7 +2083,9 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end, int sync_mode);
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
-
+#ifdef CONFIG_CHECKPOINT
+int filemap_checkpoint(struct vm_area_struct *vma, struct kstate_context *ctx);
+#endif
 extern int vfs_fsync(struct file *file, struct dentry *dentry, int datasync);
 extern void sync_supers(void);
 extern void sync_filesystems(int wait);
@@ -2201,7 +2207,9 @@ extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, lof
 extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
 extern int generic_segment_checks(const struct iovec *iov,
 		unsigned long *nr_segs, size_t *count, int access_flags);
-
+#ifdef CONFIG_CHECKPOINT
+int generic_file_checkpoint(struct file *file, struct kstate_context *ctx);
+#endif
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
 		struct pipe_inode_info *, size_t, unsigned int);
diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
new file mode 100644
index 0000000..ac3c81d
--- /dev/null
+++ b/include/linux/kstate-image.h
@@ -0,0 +1,118 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#ifndef __INCLUDE_LINUX_KSTATE_IMAGE_H
+#define __INCLUDE_LINUX_KSTATE_IMAGE_H
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+typedef __u64 kstate_pos_t;	/* position of another object in a dumpfile */
+typedef __u32 kstate_id_t;	/* object id */
+struct kstate_ref {
+	kstate_pos_t	pos;
+	kstate_id_t	id;
+} __packed;
+typedef struct kstate_ref kstate_ref_t;
+
+#define KSTATE_REF_UNDEF	((kstate_ref_t){ .pos = 0, .id = 0 })
+static inline int kstate_ref_undefined(kstate_ref_t *ref)
+{
+	return ref->pos == 0 && ref->id == 0;
+}
+
+struct kstate_image_header {
+	/* Immutable part except version bumps. */
+#define KSTATE_IMAGE_MAGIC	"LinuxC/R"
+	__u8	image_magic[8];
+#define KSTATE_IMAGE_VERSION	1
+	__le32	image_version;
+
+	/* Mutable part. */
+	/* Arch of the kernel which dumped the image. */
+	__le32	kernel_arch;
+	/*
+	 * Distributions are expected to leave image version alone and
+	 * demultiplex by this field on restart.
+	 */
+	__u8	uts_release[64];
+} __packed;
+
+#define KSTATE_OBJ_TERMINATOR	0
+#define KSTATE_OBJ_TASK_STRUCT	1
+#define KSTATE_OBJ_MM_STRUCT	2
+#define KSTATE_OBJ_FILE		3
+#define KSTATE_OBJ_VMA		4
+#define KSTATE_OBJ_PAGE		5
+
+struct kstate_object_header {
+	__u32		obj_type;
+	__u32		obj_len;	/* in bytes including this header */
+	kstate_id_t	obj_id;
+} __packed;
+
+/*
+ * 1. struct kstate_object_header MUST start object image.
+ * 2. Every member which refers to position of another object image in
+ *    a dumpfile MUST have kstate_ref_t type and SHOULD additionally use
+ *    'ref_' prefix.
+ * 3. Size and layout of every object type image MUST be the same on all
+ *    architectures.
+ */
+
+struct kstate_image_task_struct {
+	struct kstate_object_header hdr;
+
+	kstate_ref_t	ref_mm;
+
+	__u8		comm[16];
+
+	/* Native arch of task, one of KSTATE_ARCH_*. */
+	__u32		tsk_arch;
+} __packed;
+
+struct kstate_image_mm_struct {
+	struct kstate_object_header hdr;
+
+	__u64		def_flags;
+	__u64		start_code;
+	__u64		end_code;
+	__u64		start_data;
+	__u64		end_data;
+	__u64		start_brk;
+	__u64		brk;
+	__u64		start_stack;
+	__u64		arg_start;
+	__u64		arg_end;
+	__u64		env_start;
+	__u64		env_end;
+	__u64		flags;
+	__u8		saved_auxv[416];
+} __packed;
+
+struct kstate_image_vma {
+	struct kstate_object_header hdr;
+
+	__u64		vm_start;
+	__u64		vm_end;
+	__u64		vm_page_prot;
+	__u64		vm_flags;
+	__u64		vm_pgoff;
+	kstate_ref_t	ref_vm_file;
+} __packed;
+
+struct kstate_image_page {
+	struct kstate_object_header hdr;
+
+	__u64		start_addr;
+	__u32		page_size;
+	/* __u8 data[page_size]; */
+} __packed;
+
+struct kstate_image_file {
+	struct kstate_object_header hdr;
+
+	__u32		i_mode;
+	__u32		f_flags;
+	__u64		f_pos;
+	__u32		name_len;	/* including NUL */
+	/* __u8	name[name_len] */
+} __packed;
+#endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
new file mode 100644
index 0000000..3ae9e28
--- /dev/null
+++ b/include/linux/kstate.h
@@ -0,0 +1,144 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#ifndef __INCLUDE_LINUX_KSTATE_H
+#define __INCLUDE_LINUX_KSTATE_H
+#include <linux/list.h>
+
+#include <linux/kstate-image.h>
+
+struct file;
+struct mm_struct;
+struct task_struct;
+
+struct kstate_object {
+	/* entry in struct kstate_context::obj lists */
+	struct list_head	o_list;
+	/* number of references from collected objects */
+	unsigned long		o_count;
+	kstate_ref_t		o_ref;
+	/* pointer to object being collected/dumped */
+	void			*o_obj;
+};
+
+/* Not visible to userspace! */
+enum kstate_context_obj_type {
+	KSTATE_CTX_FILE,
+	KSTATE_CTX_MM_STRUCT,
+	KSTATE_CTX_TASK_STRUCT,
+	NR_KSTATE_CTX_TYPES
+};
+
+struct kstate_context {
+	struct task_struct	*init_tsk;
+	struct file		*dump_file;
+	struct list_head	obj[NR_KSTATE_CTX_TYPES];
+};
+
+#define for_each_kstate_object(ctx, obj, type)				\
+	list_for_each_entry(obj, &ctx->obj[type], o_list)
+#define for_each_kstate_object_safe(ctx, obj, tmp, type)		\
+	list_for_each_entry_safe(obj, tmp, &ctx->obj[type], o_list)
+struct kstate_object *find_kstate_obj_by_ptr(struct kstate_context *ctx, const void *ptr, enum kstate_context_obj_type type);
+struct kstate_object *find_kstate_obj_by_ref(struct kstate_context *ctx, kstate_ref_t *ref, enum kstate_context_obj_type type);
+struct kstate_object *find_kstate_obj_by_id(struct kstate_context *ctx, kstate_ref_t *ref, enum kstate_context_obj_type type);
+
+int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_context_obj_type type);
+int kstate_restore_object(struct kstate_context *ctx, void *p, enum kstate_context_obj_type type, kstate_ref_t *ref);
+
+struct kstate_context *kstate_context_create(struct task_struct *tsk, struct file *file);
+void kstate_context_destroy(struct kstate_context *ctx);
+
+int kstate_pread(struct kstate_context *ctx, void *buf, unsigned int count, kstate_pos_t pos);
+int kstate_write(struct kstate_context *ctx, const void *buf, unsigned int count);
+
+void *kstate_prepare_image(__u32 type, unsigned int len);
+void *kstate_read_image(struct kstate_context *ctx, kstate_ref_t *ref, __u32 type, unsigned int len);
+int kstate_write_image(struct kstate_context *ctx, void *i, unsigned int len, struct kstate_object *obj);
+
+int kstate_collect_all_task_struct(struct kstate_context *ctx);
+int kstate_dump_all_task_struct(struct kstate_context *ctx);
+int kstate_restore_task_struct(struct kstate_context *ctx, kstate_ref_t *ref);
+
+int kstate_collect_all_mm_struct(struct kstate_context *ctx);
+int kstate_dump_all_mm_struct(struct kstate_context *ctx);
+int kstate_restore_mm_struct(struct kstate_context *ctx, kstate_ref_t *ref, unsigned int *len);
+int kstate_restore_vma(struct kstate_context *ctx, kstate_pos_t pos);
+
+int kstate_collect_all_file(struct kstate_context *ctx);
+int kstate_dump_all_file(struct kstate_context *ctx);
+int kstate_restore_file(struct kstate_context *ctx, kstate_ref_t *ref);
+
+#if 0
+extern const __u32 kstate_kernel_arch;
+int kstate_arch_check_image_header(struct kstate_image_header *i);
+
+__u32 kstate_task_struct_arch(struct task_struct *tsk);
+int kstate_arch_check_image_task_struct(struct kstate_image_task_struct *i);
+
+unsigned int kstate_arch_len_task_struct(struct task_struct *tsk);
+int kstate_arch_check_task_struct(struct task_struct *tsk);
+int kstate_arch_dump_task_struct(struct kstate_context *ctx, struct task_struct *tsk, void *arch_i);
+int kstate_arch_restore_task_struct(struct task_struct *tsk, struct kstate_image_task_struct *i);
+
+unsigned int kstate_arch_len_mm_struct(struct mm_struct *mm);
+int kstate_arch_check_mm_struct(struct mm_struct *mm);
+int kstate_arch_dump_mm_struct(struct kstate_context *ctx, struct mm_struct *mm, void *arch_i);
+int kstate_arch_restore_mm_struct(struct kstate_context *ctx, struct kstate_image_mm_struct *i);
+#else
+#define kstate_kernel_arch 0
+
+static inline int kstate_arch_check_image_header(struct kstate_image_header *i)
+{
+	return -ENOSYS;
+}
+
+static inline __u32 kstate_task_struct_arch(struct task_struct *tsk)
+{
+	return 0;
+}
+
+static inline int kstate_arch_check_image_task_struct(struct kstate_image_task_struct *i)
+{
+	return -ENOSYS;
+}
+
+static inline unsigned int kstate_arch_len_task_struct(struct task_struct *tsk)
+{
+	return 0;
+}
+
+static inline int kstate_arch_check_task_struct(struct task_struct *tsk)
+{
+	return -ENOSYS;
+}
+
+static inline int kstate_arch_dump_task_struct(struct kstate_context *ctx, struct task_struct *tsk, void *arch_i)
+{
+	return -ENOSYS;
+}
+
+static inline int kstate_arch_restore_task_struct(struct task_struct *tsk, struct kstate_image_task_struct *i)
+{
+	return -ENOSYS;
+}
+
+static inline unsigned int kstate_arch_len_mm_struct(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline int kstate_arch_check_mm_struct(struct mm_struct *mm)
+{
+	return -ENOSYS;
+}
+
+static inline int kstate_arch_dump_mm_struct(struct kstate_context *ctx, struct mm_struct *mm, void *arch_i)
+{
+	return -ENOSYS;
+}
+
+static inline int kstate_arch_restore_mm_struct(struct kstate_context *ctx, struct kstate_image_mm_struct *i)
+{
+	return -ENOSYS;
+}
+#endif
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b3b61a6..96c206b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -16,6 +16,7 @@
 
 struct mempolicy;
 struct anon_vma;
+struct kstate_context;
 struct file_ra_state;
 struct user_struct;
 struct writeback_control;
@@ -220,6 +221,9 @@ struct vm_operations_struct {
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct vm_area_struct *vma, struct kstate_context *ctx);
+#endif
 };
 
 struct mmu_gather;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3052084..eddd210 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -753,6 +753,9 @@ asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
 asmlinkage long sys_pipe2(int __user *, int);
 asmlinkage long sys_pipe(int __user *);
 
+asmlinkage long sys_checkpoint(pid_t pid, int fd, int flags);
+asmlinkage long sys_restart(int fd, int flags);
+
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..bc3b7cb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -608,6 +608,8 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 endif # CGROUPS
 
+source "kernel/kstate/Kconfig"
+
 config MM_OWNER
 	bool
 
diff --git a/kernel/Makefile b/kernel/Makefile
index 705ad3d..9e0d9e9 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -56,6 +56,7 @@ obj-$(CONFIG_FREEZER) += power/
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
+obj-$(CONFIG_CHECKPOINT) += kstate/
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
diff --git a/kernel/kstate/Kconfig b/kernel/kstate/Kconfig
new file mode 100644
index 0000000..6155043
--- /dev/null
+++ b/kernel/kstate/Kconfig
@@ -0,0 +1,7 @@
+config CHECKPOINT
+	bool "Container checkpoint/restart"
+	select FREEZER
+	help
+	  Container checkpoint/restart.
+
+	  Say N.
diff --git a/kernel/kstate/Makefile b/kernel/kstate/Makefile
new file mode 100644
index 0000000..eacd3cf
--- /dev/null
+++ b/kernel/kstate/Makefile
@@ -0,0 +1,8 @@
+obj-$(CONFIG_CHECKPOINT) += kstate.o
+kstate-y := cpt-sys.o rst-sys.o
+kstate-y += kstate-context.o
+kstate-y += kstate-file.o
+kstate-y += kstate-image.o
+kstate-y += kstate-mm.o
+kstate-y += kstate-object.o
+kstate-y += kstate-task.o
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
new file mode 100644
index 0000000..6bc1d0a
--- /dev/null
+++ b/kernel/kstate/cpt-sys.c
@@ -0,0 +1,196 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+/* checkpoint(2) */
+#include <linux/capability.h>
+#include <linux/file.h>
+#include <linux/freezer.h>
+#include <linux/fs.h>
+#include <linux/nsproxy.h>
+#include <linux/pid_namespace.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/utsname.h>
+
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+/* 'tsk' is child of 'parent' in some generation. */
+static int child_of(struct task_struct *parent, struct task_struct *tsk)
+{
+	struct task_struct *tmp = tsk;
+
+	while (tmp != &init_task) {
+		if (tmp == parent)
+			return 1;
+		tmp = tmp->real_parent;
+	}
+	/* In case 'parent' is 'init_task'. */
+	return tmp == parent;
+}
+
+static int freeze_tasks(struct task_struct *init_tsk)
+{
+	struct task_struct *tmp, *tsk;
+
+	read_lock(&tasklist_lock);
+	do_each_thread(tmp, tsk) {
+		if (child_of(init_tsk, tsk)) {
+			if (!freeze_task(tsk, 1)) {
+				pr_err("%s: freezing '%s' failed\n", __func__, tsk->comm);
+				read_unlock(&tasklist_lock);
+				return -EBUSY;
+			}
+		}
+	} while_each_thread(tmp, tsk);
+	read_unlock(&tasklist_lock);
+	return 0;
+}
+
+static void thaw_tasks(struct task_struct *init_tsk)
+{
+	struct task_struct *tmp, *tsk;
+
+	read_lock(&tasklist_lock);
+	do_each_thread(tmp, tsk) {
+		if (child_of(init_tsk, tsk))
+			thaw_process(tsk);
+	} while_each_thread(tmp, tsk);
+	read_unlock(&tasklist_lock);
+}
+
+static int kstate_collect(struct kstate_context *ctx)
+{
+	int rv;
+
+	rv = kstate_collect_all_task_struct(ctx);
+	if (rv < 0)
+		return rv;
+	rv = kstate_collect_all_mm_struct(ctx);
+	if (rv < 0)
+		return rv;
+	rv = kstate_collect_all_file(ctx);
+	if (rv < 0)
+		return rv;
+	return 0;
+}
+
+static void kstate_assign_object_ids(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	kstate_id_t id;
+	int type;
+
+	/* 0 means 'undefined'. */
+	id = 1;
+	for (type = 0; type < NR_KSTATE_CTX_TYPES; type++) {
+		for_each_kstate_object(ctx, obj, type) {
+			obj->o_ref.id = id;
+			id++;
+		}
+	}
+}
+
+static int kstate_dump_image_header(struct kstate_context *ctx)
+{
+	struct kstate_image_header hdr;
+
+	memset(&hdr, 0, sizeof(hdr));
+
+	memcpy(hdr.image_magic, KSTATE_IMAGE_MAGIC, 8);
+	hdr.image_version = cpu_to_le32(KSTATE_IMAGE_VERSION);
+	hdr.kernel_arch = cpu_to_le32(kstate_kernel_arch);
+	strlcpy((char *)&hdr.uts_release, (const char *)init_uts_ns.name.release, sizeof(hdr.uts_release));
+	return kstate_write(ctx, &hdr, sizeof(hdr));
+}
+
+static int kstate_dump_terminator(struct kstate_context *ctx)
+{
+	struct kstate_object_header hdr;
+
+	memset(&hdr, 0, sizeof(hdr));
+
+	hdr.obj_type = KSTATE_OBJ_TERMINATOR;
+	hdr.obj_len = sizeof(hdr);
+	return kstate_write(ctx, &hdr, sizeof(hdr));
+}
+
+static int kstate_dump(struct kstate_context *ctx)
+{
+	int rv;
+
+	rv = kstate_dump_image_header(ctx);
+	if (rv < 0)
+		return rv;
+	rv = kstate_dump_all_file(ctx);
+	if (rv < 0)
+		return rv;
+	rv = kstate_dump_all_mm_struct(ctx);
+	if (rv < 0)
+		return rv;
+	rv = kstate_dump_all_task_struct(ctx);
+	if (rv < 0)
+		return rv;
+	return kstate_dump_terminator(ctx);
+}
+
+SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, int, flags)
+{
+	struct kstate_context *ctx;
+	struct file *file;
+	struct task_struct *init_tsk = NULL, *tsk;
+	int rv = 0;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+
+	/* Determine root of hierarchy to be checkpointed. */
+	rcu_read_lock();
+	tsk = find_task_by_vpid(pid);
+	if (tsk) {
+		struct nsproxy *nsproxy;
+
+		nsproxy = task_nsproxy(tsk);
+		if (nsproxy) {
+			init_tsk = nsproxy->pid_ns->child_reaper;
+			if (init_tsk != tsk)
+				init_tsk = NULL;
+		} else
+			init_tsk = NULL;
+		if (init_tsk)
+			get_task_struct(init_tsk);
+	}
+	rcu_read_unlock();
+	if (!init_tsk) {
+		rv = -ESRCH;
+		goto out_no_init_tsk;
+	}
+
+	ctx = kstate_context_create(init_tsk, file);
+	if (!ctx) {
+		rv = -ENOMEM;
+		goto out_ctx_create;
+	}
+
+	rv = freeze_tasks(init_tsk);
+	if (rv < 0)
+		goto out_freeze;
+	rv = kstate_collect(ctx);
+	if (rv < 0)
+		goto out_collect;
+	kstate_assign_object_ids(ctx);
+	rv = kstate_dump(ctx);
+
+out_collect:
+	/* FIXME: kill_tasks() */
+	thaw_tasks(init_tsk);
+out_freeze:
+	kstate_context_destroy(ctx);
+out_ctx_create:
+	put_task_struct(init_tsk);
+out_no_init_tsk:
+	fput(file);
+	return rv;
+}
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
new file mode 100644
index 0000000..85d1514
--- /dev/null
+++ b/kernel/kstate/kstate-context.c
@@ -0,0 +1,49 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+
+#include <linux/kstate.h>
+
+/*
+ * During checkpoint ->init_tsk is root of process hierarchy.
+ * During restart ->init_tsk is task which does restart(2).
+ */
+struct kstate_context *kstate_context_create(struct task_struct *tsk, struct file *file)
+{
+	struct kstate_context *ctx;
+
+	ctx = kmalloc(sizeof(struct kstate_context), GFP_KERNEL);
+	if (ctx) {
+		int type;
+
+		ctx->init_tsk = tsk;
+		ctx->dump_file = file;
+		for (type = 0; type < NR_KSTATE_CTX_TYPES; type++)
+			INIT_LIST_HEAD(&ctx->obj[type]);
+	}
+	return ctx;
+}
+
+void kstate_context_destroy(struct kstate_context *ctx)
+{
+	struct kstate_object *obj, *tmp;
+
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_FILE) {
+		fput((struct file *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_MM_STRUCT) {
+		mmput((struct mm_struct *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_TASK_STRUCT) {
+		put_task_struct((struct task_struct *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
+	kfree(ctx);
+}
diff --git a/kernel/kstate/kstate-file.c b/kernel/kstate/kstate-file.c
new file mode 100644
index 0000000..8f678cd
--- /dev/null
+++ b/kernel/kstate/kstate-file.c
@@ -0,0 +1,204 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/stat.h>
+
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int check_file(struct file *file)
+{
+	if (!file->f_op) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	if (file->f_op && !file->f_op->checkpoint) {
+		WARN(1, "file %pS isn't checkpointable\n", file->f_op);
+		return -EINVAL;
+	}
+	if (d_unlinked(file->f_path.dentry)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#ifdef CONFIG_SECURITY
+	if (file->f_security) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+#ifdef CONFIG_EPOLL
+	spin_lock(&file->f_lock);
+	if (!list_empty(&file->f_ep_links)) {
+		spin_unlock(&file->f_lock);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	spin_unlock(&file->f_lock);
+#endif
+	return 0;
+}
+
+static int collect_file(struct kstate_context *ctx, struct file *file)
+{
+	int rv;
+
+	rv = check_file(file);
+	if (rv < 0)
+		return rv;
+	rv = kstate_collect_object(ctx, file, KSTATE_CTX_FILE);
+	pr_debug("collect file %p: rv %d\n", file, rv);
+	return rv;
+}
+
+int kstate_collect_all_file(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_MM_STRUCT) {
+		struct mm_struct *mm = obj->o_obj;
+		struct vm_area_struct *vma;
+
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (vma->vm_file) {
+				rv = collect_file(ctx, vma->vm_file);
+				if (rv < 0)
+					return rv;
+			}
+		}
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_FILE) {
+		struct file *file = obj->o_obj;
+		unsigned long cnt = atomic_long_read(&file->f_count);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("file %p/%pS has external references %lu:%lu\n", file, file->f_op, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+int generic_file_checkpoint(struct file *file, struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	struct kstate_image_file *i;
+	struct kstat stat;
+	char *buf, *name;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_FILE, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	rv = vfs_getattr(file->f_path.mnt, file->f_path.dentry, &stat);
+	if (rv < 0)
+		goto out_free_image;
+	i->i_mode = stat.mode;
+	i->f_flags = file->f_flags;
+	/* Assume seeking over file doesn't have history. */
+	i->f_pos = file->f_pos;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+	name = d_path(&file->f_path, buf, PAGE_SIZE);
+	if (IS_ERR(name)) {
+		rv = PTR_ERR(name);
+		goto out_free_buf;
+	}
+	i->name_len = buf + PAGE_SIZE - name - 1;
+	i->hdr.obj_len += i->name_len + 1;
+
+	obj = find_kstate_obj_by_ptr(ctx, file, KSTATE_CTX_FILE);
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	if (rv == 0)
+		rv = kstate_write(ctx, name, i->name_len);
+	if (rv == 0)
+		rv = kstate_write(ctx, &rv, 1);	/* write NUL */
+	pr_debug("dump file %p: name_len %u, '%.*s', ->f_op %pS\n", file, i->name_len, i->name_len, name, file->f_op);
+
+out_free_buf:
+	kfree(buf);
+out_free_image:
+	kfree(i);
+	return rv;
+}
+EXPORT_SYMBOL_GPL(generic_file_checkpoint);
+
+static int dump_file(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct file *file = obj->o_obj;
+	int rv;
+
+	rv = file->f_op->checkpoint(file, ctx);
+	pr_debug("dump file %p: ref {%llu, %u}, rv %d\n", file, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_file(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_FILE) {
+		rv = dump_file(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_file(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_file *i;
+	struct file *file;
+	char *name;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_FILE, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+	if (i->hdr.obj_len < sizeof(*i) + i->name_len) {
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+	name = (char *)(i + 1);
+	if (name[i->name_len] != '\0') {
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+	file = filp_open(name, i->f_flags, 0);
+	if (IS_ERR(file)) {
+		rv = PTR_ERR(file);
+		goto out_free_image;
+	}
+	if (file->f_dentry->d_inode->i_mode != i->i_mode) {
+		rv = -EINVAL;
+		goto out_fput;
+	}
+	/* Assume seeking over file doesn't have history. */
+	if (vfs_llseek(file, i->f_pos, SEEK_SET) != i->f_pos) {
+		rv = -EINVAL;
+		goto out_fput;
+	}
+
+	rv = kstate_restore_object(ctx, file, KSTATE_CTX_FILE, ref);
+	if (rv < 0)
+		fput(file);
+	pr_debug("restore file %p: ref {%llu, %u}, rv %d: '%s'\n", file, (unsigned long long)ref->pos, ref->id, rv, name);
+	kfree(i);
+	return rv;
+
+out_fput:
+	fput(file);
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
diff --git a/kernel/kstate/kstate-image.c b/kernel/kstate/kstate-image.c
new file mode 100644
index 0000000..b04cafc
--- /dev/null
+++ b/kernel/kstate/kstate-image.c
@@ -0,0 +1,116 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <asm/uaccess.h>
+
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+int kstate_pread(struct kstate_context *ctx, void *buf, unsigned int count, kstate_pos_t pos)
+{
+	struct file *file = ctx->dump_file;
+	mm_segment_t old_fs;
+	ssize_t rv;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	BUILD_BUG_ON(sizeof(kstate_pos_t) != sizeof(loff_t));
+	rv = vfs_read(file, (char __user *)buf, count, (loff_t *)&pos);
+	set_fs(old_fs);
+	if (rv != count)
+		return (rv < 0) ? rv : -EIO;
+	return 0;
+}
+
+int kstate_write(struct kstate_context *ctx, const void *buf, unsigned int count)
+{
+	struct file *file = ctx->dump_file;
+	mm_segment_t old_fs;
+	ssize_t rv;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+write_more:
+	rv = vfs_write(file, (const char __user *)buf, count, &file->f_pos);
+	if (rv > 0 && rv < count) {
+		buf += rv;
+		count -= rv;
+		goto write_more;
+	}
+	set_fs(old_fs);
+	return (rv < 0) ? rv : 0;
+}
+
+void *kstate_prepare_image(__u32 type, unsigned int len)
+{
+	void *p;
+
+	p = kzalloc(len, GFP_KERNEL);
+	if (p) {
+		/* Any image must start with header. */
+		struct kstate_object_header *hdr = p;
+
+		hdr->obj_type = type;
+		hdr->obj_len = len;
+		hdr->obj_id = 0;
+	}
+	return p;
+}
+
+void *kstate_read_image(struct kstate_context *ctx, kstate_ref_t *ref, __u32 type, unsigned int len)
+{
+	struct kstate_object_header hdr;
+	void *i;
+	int rv;
+
+	/* Image header is not restorable object. */
+	if (ref->pos < sizeof(struct kstate_image_header))
+		return ERR_PTR(-EINVAL);
+
+	rv = kstate_pread(ctx, &hdr, sizeof(hdr), ref->pos);
+	if (rv < 0)
+		return ERR_PTR(rv);
+
+	if (hdr.obj_type != type) {
+		pr_debug("%s: object {%u, %u, %u} at %llu of wrong type, expected {%u, >=%u, %u}\n",
+			 __func__,
+			 hdr.obj_type, hdr.obj_len, hdr.obj_id,
+			 (unsigned long long)ref->pos,
+			 type, len, ref->id);
+		return ERR_PTR(-EINVAL);
+	}
+	if (hdr.obj_len < sizeof(hdr) || hdr.obj_len < len) {
+		pr_debug("%s: object {%u, %u, %u} at %llu too small, expected {%u, >=%u, %u}\n",
+			 __func__,
+			 hdr.obj_type, hdr.obj_len, hdr.obj_id,
+			 (unsigned long long)ref->pos,
+			 type, len, ref->id);
+		return ERR_PTR(-EINVAL);
+	}
+	if (hdr.obj_id != ref->id) {
+		pr_debug("%s: object {%u, %u, %u} at %llu has incorrect id, expected {%u, >=%u, %u}\n",
+			 __func__,
+			 hdr.obj_type, hdr.obj_len, hdr.obj_id,
+			 (unsigned long long)ref->pos,
+			 type, len, ref->id);
+		return ERR_PTR(-EINVAL);
+	}
+
+	i = kzalloc(hdr.obj_len, GFP_KERNEL);
+	if (!i)
+		return ERR_PTR(-ENOMEM);
+	rv = kstate_pread(ctx, i, hdr.obj_len, ref->pos);
+	if (rv < 0) {
+		kfree(i);
+		return ERR_PTR(rv);
+	}
+	return i;
+}
+
+int kstate_write_image(struct kstate_context *ctx, void *i, unsigned int len, struct kstate_object *obj)
+{
+	/* Object image must start with header. */
+	((struct kstate_object_header *)i)->obj_id = obj->o_ref.id;
+	obj->o_ref.pos = ctx->dump_file->f_pos;
+	return kstate_write(ctx, i, len);
+}
diff --git a/kernel/kstate/kstate-mm.c b/kernel/kstate/kstate-mm.c
new file mode 100644
index 0000000..d3045f3
--- /dev/null
+++ b/kernel/kstate/kstate-mm.c
@@ -0,0 +1,563 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/highmem.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/sched.h>
+#include <asm/elf.h>
+#include <asm/mman.h>
+#include <asm/mmu_context.h>
+#include <asm/pgalloc.h>
+
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int check_vma(struct vm_area_struct *vma)
+{
+	unsigned long vm_flags;
+
+	if (vma->vm_ops && !vma->vm_ops->checkpoint) {
+		WARN(1, "vma %08lx-%08lx %pS isn't checkpointable\n",
+			vma->vm_start, vma->vm_end, vma->vm_ops);
+		return -EINVAL;
+	}
+
+	vm_flags = vma->vm_flags;
+	/* Known good and unknown bad flags. */
+	vm_flags &= ~VM_READ;
+	vm_flags &= ~VM_WRITE;
+	vm_flags &= ~VM_EXEC;
+//	vm_flags &= ~VM_SHARED;
+	vm_flags &= ~VM_MAYREAD;
+	vm_flags &= ~VM_MAYWRITE;
+	vm_flags &= ~VM_MAYEXEC;
+//	vm_flags &= ~VM_MAYSHARE;
+	vm_flags &= ~VM_GROWSDOWN;
+//	vm_flags &= ~VM_GROWSUP;
+//	vm_flags &= ~VM_PFNMAP;
+	vm_flags &= ~VM_DENYWRITE;
+	vm_flags &= ~VM_EXECUTABLE;
+//	vm_flags &= ~VM_LOCKED;
+//	vm_flags &= ~VM_IO;
+//	vm_flags &= ~VM_SEQ_READ;
+//	vm_flags &= ~VM_RAND_READ;
+//	vm_flags &= ~VM_DONTCOPY;
+	vm_flags &= ~VM_DONTEXPAND;
+//	vm_flags &= ~VM_RESERVED;
+	vm_flags &= ~VM_ACCOUNT;
+//	vm_flags &= ~VM_NORESERVE;
+//	vm_flags &= ~VM_HUGETLB;
+//	vm_flags &= ~VM_NONLINEAR;
+//	vm_flags &= ~VM_MAPPED_COPY;
+//	vm_flags &= ~VM_INSERTPAGE;
+	vm_flags &= ~VM_ALWAYSDUMP;
+	vm_flags &= ~VM_CAN_NONLINEAR;
+//	vm_flags &= ~VM_MIXEDMAP;
+//	vm_flags &= ~VM_SAO;
+//	vm_flags &= ~VM_PFN_AT_MMAP;
+
+	if (vm_flags) {
+		WARN(1, "vma %08lx-%08lx %pS has uncheckpointable flags %08lx\n",
+			vma->vm_start, vma->vm_end, vma->vm_ops, vm_flags);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int dump_vma_pages(struct kstate_context *ctx, struct vm_area_struct *vma)
+{
+	unsigned long addr;
+	int rv;
+
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+		struct page *page;
+
+again:
+		cond_resched();
+		page = follow_page(vma, addr, FOLL_ANON|FOLL_GET);
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+		if (page == ZERO_PAGE(0)) {
+			put_page(page);
+			continue;
+		}
+		if (!page) {
+			rv = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+			if (rv & VM_FAULT_ERROR) {
+				if (rv & VM_FAULT_OOM)
+					return -ENOMEM;
+				if (rv & VM_FAULT_SIGBUS)
+					return -EFAULT;
+				BUG();
+			}
+			goto again;
+		}
+
+		if (PageAnon(page) || (!PageAnon(page) && !page_mapping(page))) {
+			struct kstate_image_page i;
+			void *data;
+
+			pr_debug("dump vma %p: addr %08lx, page %p\n",
+				 vma, addr, page);
+
+			i.hdr.obj_type = KSTATE_OBJ_PAGE;
+			i.hdr.obj_len = sizeof(i) + PAGE_SIZE;
+			i.hdr.obj_id = 0;
+
+			i.start_addr = addr;
+			i.page_size = PAGE_SIZE;
+			rv = kstate_write(ctx, &i, sizeof(i));
+			if (rv < 0) {
+				put_page(page);
+				return rv;
+			}
+
+			data = kmap(page);
+			rv = kstate_write(ctx, data, PAGE_SIZE);
+			kunmap(page);
+			if (rv < 0) {
+				put_page(page);
+				return rv;
+			}
+		}
+		put_page(page);
+	}
+	return 0;
+}
+
+static int dump_anonvma(struct kstate_context *ctx, struct vm_area_struct *vma)
+{
+	struct kstate_image_vma *i;
+	int rv;
+
+	pr_debug("dump vma %p: %08lx-%08lx %c%c%c%c vm_flags %08lx, vm_pgoff %08lx\n",
+		vma, vma->vm_start, vma->vm_end,
+		vma->vm_flags & VM_READ ? 'r' : '-',
+		vma->vm_flags & VM_WRITE ? 'w' : '-',
+		vma->vm_flags & VM_EXEC ? 'x' : '-',
+		vma->vm_flags & VM_MAYSHARE ? 's' : 'p',
+		vma->vm_flags,
+		vma->vm_pgoff);
+
+	i = kstate_prepare_image(KSTATE_OBJ_VMA, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+	/*
+	 * VMA doesn't get id because it can't be shared by itself,
+	 * only mm_struct can. Assign some deterministic id.
+	 */
+	i->hdr.obj_id = 0;
+
+	i->vm_start = vma->vm_start;
+	i->vm_end = vma->vm_end;
+	i->vm_page_prot = pgprot_val(vma->vm_page_prot);
+	i->vm_flags = vma->vm_flags;
+	i->vm_pgoff = vma->vm_pgoff;
+	i->ref_vm_file = KSTATE_REF_UNDEF;
+
+	rv = kstate_write(ctx, i, sizeof(*i));
+	kfree(i);
+	if (rv < 0)
+		return rv;
+	return dump_vma_pages(ctx, vma);
+}
+
+int filemap_checkpoint(struct vm_area_struct *vma, struct kstate_context *ctx)
+{
+	struct kstate_image_vma *i;
+	struct kstate_object *tmp;
+	int rv;
+
+	pr_debug("dump vma %p: %08lx-%08lx %c%c%c%c vm_flags %08lx, vm_pgoff %08lx, vm_ops %pS\n",
+		vma, vma->vm_start, vma->vm_end,
+		vma->vm_flags & VM_READ ? 'r' : '-',
+		vma->vm_flags & VM_WRITE ? 'w' : '-',
+		vma->vm_flags & VM_EXEC ? 'x' : '-',
+		vma->vm_flags & VM_MAYSHARE ? 's' : 'p',
+		vma->vm_flags,
+		vma->vm_pgoff,
+		vma->vm_ops);
+
+	i = kstate_prepare_image(KSTATE_OBJ_VMA, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+	/*
+	 * VMA doesn't get id because it can't be shared by itself,
+	 * only mm_struct can. Assign some deterministic id.
+	 */
+	i->hdr.obj_id = 0;
+
+	i->vm_start = vma->vm_start;
+	i->vm_end = vma->vm_end;
+	i->vm_page_prot = pgprot_val(vma->vm_page_prot);
+	i->vm_flags = vma->vm_flags;
+	i->vm_pgoff = vma->vm_pgoff;
+	tmp = find_kstate_obj_by_ptr(ctx, vma->vm_file, KSTATE_CTX_FILE);
+	i->ref_vm_file = tmp->o_ref;
+
+	rv = kstate_write(ctx, i, sizeof(*i));
+	kfree(i);
+	if (rv < 0)
+		return rv;
+	return dump_vma_pages(ctx, vma);
+}
+
+static int dump_vma(struct kstate_context *ctx, struct vm_area_struct *vma)
+{
+	if (!vma->vm_ops)
+		return dump_anonvma(ctx, vma);
+	if (vma->vm_ops->checkpoint)
+		return vma->vm_ops->checkpoint(vma, ctx);
+	BUG();
+}
+
+static int dump_all_vma(struct kstate_context *ctx, struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	int rv;
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		rv = dump_vma(ctx, vma);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+static int restore_page(struct kstate_context *ctx, kstate_pos_t pos)
+{
+	struct kstate_image_page i;
+	struct page *page;
+	void *addr;
+	int rv;
+
+	rv = kstate_pread(ctx, &i, sizeof(i), pos);
+	if (rv < 0)
+		return rv;
+	if (i.hdr.obj_type != KSTATE_OBJ_PAGE)
+		return -EINVAL;
+	if (i.hdr.obj_len != sizeof(i) + PAGE_SIZE)
+		return -EINVAL;
+
+	rv = get_user_pages(current, current->mm, i.start_addr, 1, 1, 1, &page, NULL);
+	if (rv != 1)
+		return (rv < 0) ? rv : -EFAULT;
+	addr = kmap(page);
+	rv = kstate_pread(ctx, addr, PAGE_SIZE, pos + sizeof(i));
+	set_page_dirty_lock(page);
+	kunmap(page);
+	put_page(page);
+	return rv;
+}
+
+static int restore_pages(struct kstate_context *ctx, kstate_pos_t pos)
+{
+	while (1) {
+		struct kstate_object_header hdr;
+		int rv;
+
+		rv = kstate_pread(ctx, &hdr, sizeof(hdr), pos);
+		if (rv < 0)
+			return rv;
+		switch (hdr.obj_type) {
+		case KSTATE_OBJ_PAGE:
+			rv = restore_page(ctx, pos);
+			if (rv < 0)
+				return rv;
+			break;
+		default:
+			return 0;
+		}
+		pos += hdr.obj_len;
+	}
+}
+
+static int make_prot(struct kstate_image_vma *i)
+{
+	unsigned long prot = PROT_NONE;
+
+	if (i->vm_flags & VM_READ)
+		prot |= PROT_READ;
+	if (i->vm_flags & VM_WRITE)
+		prot |= PROT_WRITE;
+	if (i->vm_flags & VM_EXEC)
+		prot |= PROT_EXEC;
+	return prot;
+}
+
+static int make_flags(struct kstate_image_vma *i)
+{
+	unsigned long flags = MAP_FIXED;
+
+	flags |= MAP_PRIVATE;
+	if (kstate_ref_undefined(&i->ref_vm_file))
+		flags |= MAP_ANONYMOUS;
+
+	if (i->vm_flags & VM_GROWSDOWN)
+		flags |= MAP_GROWSDOWN;
+#ifdef MAP_GROWSUP
+	if (i->vm_flags & VM_GROWSUP)
+		flags |= MAP_GROWSUP;
+#endif
+	if (i->vm_flags & VM_EXECUTABLE)
+		flags |= MAP_EXECUTABLE;
+	if (i->vm_flags & VM_DENYWRITE)
+		flags |= MAP_DENYWRITE;
+	return flags;
+}
+
+int kstate_restore_vma(struct kstate_context *ctx, kstate_pos_t pos)
+{
+	kstate_ref_t ref = { .pos = pos, .id = 0 };
+	struct kstate_image_vma *i;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	struct file *file;
+	unsigned long addr, prot, flags;
+	int rv;
+
+	i = kstate_read_image(ctx, &ref, KSTATE_OBJ_VMA, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	if (kstate_ref_undefined(&i->ref_vm_file))
+		file = NULL;
+	else {
+		struct kstate_object *tmp;
+
+		tmp = find_kstate_obj_by_ref(ctx, &i->ref_vm_file, KSTATE_CTX_FILE);
+		if (!tmp) {
+			rv = kstate_restore_file(ctx, &i->ref_vm_file);
+			if (rv < 0)
+				goto out_free_image;
+			tmp = find_kstate_obj_by_ref(ctx, &i->ref_vm_file, KSTATE_CTX_FILE);
+		}
+		file = tmp->o_obj;
+	}
+
+	prot = make_prot(i);
+	flags = make_flags(i);
+	addr = do_mmap_pgoff(file, i->vm_start, i->vm_end - i->vm_start, prot, flags, i->vm_pgoff);
+	if (addr != i->vm_start) {
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+	vma = find_vma(mm, addr);
+	if (!vma) {
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+	if (vma->vm_start != i->vm_start || vma->vm_end != i->vm_end) {
+		pr_debug("%s: vma %08lx-%08lx should be %08lx-%08lx\n",
+			 __func__, vma->vm_start, vma->vm_end,
+			 (unsigned long)i->vm_start, (unsigned long)i->vm_end);
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+	pr_debug("restore vma: %08lx-%08lx, vm_flags %08lx, pgprot %016llx, vm_pgoff 0x%lx, vm_file {%llu, %u}\n",
+		 vma->vm_start, vma->vm_end, vma->vm_flags,
+		 (unsigned long long)pgprot_val(vma->vm_page_prot),
+		 vma->vm_pgoff,
+		 (unsigned long long)i->ref_vm_file.pos, i->ref_vm_file.id);
+	if (vma->vm_flags != i->vm_flags)
+		pr_debug("restore vma: vm_flags %08lx, i->vm_flags %08lx\n",
+			 vma->vm_flags, (unsigned long)i->vm_flags);
+	if (pgprot_val(vma->vm_page_prot) != i->vm_page_prot)
+		pr_debug("restore vma: prot %016llx, i->vm_page_prot %016llx\n",
+			 (unsigned long long)pgprot_val(vma->vm_page_prot),
+			 (unsigned long long)i->vm_page_prot);
+	kfree(i);
+	return restore_pages(ctx, pos + sizeof(*i));
+
+out_free_image:
+	kfree(i);
+	return rv;
+}
+
+static int check_mm_struct(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	int rv;
+
+	down_read(&mm->mmap_sem);
+	if (mm->core_state) {
+		up_read(&mm->mmap_sem);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	up_read(&mm->mmap_sem);
+#ifdef CONFIG_AIO
+	spin_lock(&mm->ioctx_lock);
+	if (!hlist_empty(&mm->ioctx_list)) {
+		spin_unlock(&mm->ioctx_lock);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	spin_unlock(&mm->ioctx_lock);
+#endif
+#ifdef CONFIG_MMU_NOTIFIER
+	down_read(&mm->mmap_sem);
+	if (mm_has_notifiers(mm)) {
+		up_read(&mm->mmap_sem);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	up_read(&mm->mmap_sem);
+#endif
+	rv = kstate_arch_check_mm_struct(mm);
+	if (rv < 0)
+		return rv;
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		rv = check_vma(vma);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+static int collect_mm_struct(struct kstate_context *ctx, struct mm_struct *mm)
+{
+	int rv;
+
+	rv = check_mm_struct(mm);
+	if (rv < 0)
+		return rv;
+	rv = kstate_collect_object(ctx, mm, KSTATE_CTX_MM_STRUCT);
+	pr_debug("collect mm_struct %p: rv %d\n", mm, rv);
+	return rv;
+}
+
+int kstate_collect_all_mm_struct(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = collect_mm_struct(ctx, tsk->mm);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_MM_STRUCT) {
+		struct mm_struct *mm = obj->o_obj;
+		unsigned int cnt = atomic_read(&mm->mm_users);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("mm_struct %p has external references %lu:%u\n", mm, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_mm_struct(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct mm_struct *mm = obj->o_obj;
+	struct kstate_image_mm_struct *i;
+	unsigned int image_len;
+	int rv;
+
+	image_len = sizeof(*i) + kstate_arch_len_mm_struct(mm);
+	i = kstate_prepare_image(KSTATE_OBJ_MM_STRUCT, image_len);
+	if (!i)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	i->def_flags = mm->def_flags;
+	i->start_code = mm->start_code;
+	i->end_code = mm->end_code;
+	i->start_data = mm->start_data;
+	i->end_data = mm->end_data;
+	i->start_brk = mm->start_brk;
+	i->brk = mm->brk;
+	i->start_stack = mm->start_stack;
+	i->arg_start = mm->arg_start;
+	i->arg_end = mm->arg_end;
+	i->env_start = mm->env_start;
+	i->env_end = mm->env_end;
+	i->flags = mm->flags;
+	BUILD_BUG_ON(sizeof(i->saved_auxv) < sizeof(mm->saved_auxv));
+	memcpy(i->saved_auxv, mm->saved_auxv, sizeof(mm->saved_auxv));
+
+	rv = kstate_arch_dump_mm_struct(ctx, mm, i + 1);
+	up_read(&mm->mmap_sem);
+	if (rv == 0)
+		rv = kstate_write_image(ctx, i, image_len, obj);
+	kfree(i);
+	pr_debug("dump mm_struct %p: ref {%llu, %u}, rv %d\n", mm, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_mm_struct(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_MM_STRUCT) {
+		struct mm_struct *mm = obj->o_obj;
+
+		rv = dump_mm_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+		rv = dump_all_vma(ctx, mm);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_mm_struct(struct kstate_context *ctx, kstate_ref_t *ref, unsigned int *len)
+{
+	struct kstate_image_mm_struct *i;
+	struct mm_struct *mm;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_MM_STRUCT, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	mm = mm_alloc();
+	if (!mm) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+	rv = init_new_context(current, mm);
+	if (rv < 0)
+		goto out_mm_put;
+
+	mm->get_unmapped_area = arch_get_unmapped_area_topdown;
+	mm->unmap_area = arch_unmap_area_topdown;
+
+	mm->def_flags = i->def_flags;
+	mm->start_code = i->start_code;
+	mm->end_code = i->end_code;
+	mm->start_data = i->start_data;
+	mm->end_data = i->end_data;
+	mm->start_brk = i->start_brk;
+	mm->brk = i->brk;
+	mm->start_stack = i->start_stack;
+	mm->arg_start = i->arg_start;
+	mm->arg_end = i->arg_end;
+	mm->env_start = i->env_start;
+	mm->env_end = i->env_end;
+	mm->flags = i->flags;
+	memcpy(mm->saved_auxv, i->saved_auxv, sizeof(mm->saved_auxv));
+
+	*len = i->hdr.obj_len;
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, mm, KSTATE_CTX_MM_STRUCT, ref);
+	if (rv < 0)
+		mmdrop(mm);
+	pr_debug("restore mm_struct %p: ref {%llu, %u}, rv %d\n", mm, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_mm_put:
+	mmdrop(mm);
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
new file mode 100644
index 0000000..f9f2f33
--- /dev/null
+++ b/kernel/kstate/kstate-object.c
@@ -0,0 +1,100 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/fs.h>
+#include <linux/mm_types.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_context_obj_type type)
+{
+	struct kstate_object *obj;
+
+	BUG_ON(type >= NR_KSTATE_CTX_TYPES);
+
+	obj = find_kstate_obj_by_ptr(ctx, p, type);
+	if (obj) {
+		obj->o_count++;
+		return 0;
+	}
+	obj = kzalloc(sizeof(struct kstate_object), GFP_KERNEL);
+	if (!obj)
+		return -ENOMEM;
+	obj->o_count = 1;
+	obj->o_ref.pos = 0;	/* not yet dumped */
+	obj->o_ref.id = 0;	/* not yet assigned */
+	obj->o_obj = p;
+	list_add(&obj->o_list, &ctx->obj[type]);
+
+	switch (type) {
+	case KSTATE_CTX_FILE:
+		get_file((struct file *)obj->o_obj);
+		break;
+	case KSTATE_CTX_MM_STRUCT:
+		atomic_inc(&((struct mm_struct *)obj->o_obj)->mm_users);
+		break;
+	case KSTATE_CTX_TASK_STRUCT:
+		get_task_struct((struct task_struct *)obj->o_obj);
+		break;
+	default:
+		BUG();
+	}
+	return 0;
+}
+
+int kstate_restore_object(struct kstate_context *ctx, void *p, enum kstate_context_obj_type type, kstate_ref_t *ref)
+{
+	struct kstate_object *obj;
+
+	obj = kzalloc(sizeof(struct kstate_object), GFP_KERNEL);
+	if (!obj)
+		return -ENOMEM;
+	/* ->o_count isn't used on restart. */
+	obj->o_ref = *ref;
+	obj->o_obj = p;
+	list_add(&obj->o_list, &ctx->obj[type]);
+
+	if (type == KSTATE_CTX_TASK_STRUCT)
+		get_task_struct((struct task_struct *)obj->o_obj);
+	return 0;
+}
+
+struct kstate_object *find_kstate_obj_by_ptr(struct kstate_context *ctx, const void *ptr, enum kstate_context_obj_type type)
+{
+	struct kstate_object *obj;
+
+	BUG_ON(type >= NR_KSTATE_CTX_TYPES);
+
+	for_each_kstate_object(ctx, obj, type) {
+		if (obj->o_obj == ptr)
+			return obj;
+	}
+	return NULL;
+}
+
+struct kstate_object *find_kstate_obj_by_ref(struct kstate_context *ctx, kstate_ref_t *ref, enum kstate_context_obj_type type)
+{
+	struct kstate_object *obj;
+
+	BUG_ON(type >= NR_KSTATE_CTX_TYPES);
+
+	for_each_kstate_object(ctx, obj, type) {
+		if (obj->o_ref.pos == ref->pos && obj->o_ref.id == ref->id)
+			return obj;
+	}
+	return NULL;
+}
+
+struct kstate_object *find_kstate_obj_by_id(struct kstate_context *ctx, kstate_ref_t *ref, enum kstate_context_obj_type type)
+{
+	struct kstate_object *obj;
+
+	BUG_ON(type >= NR_KSTATE_CTX_TYPES);
+
+	for_each_kstate_object(ctx, obj, type) {
+		if (obj->o_ref.id == ref->id)
+			return obj;
+	}
+	return NULL;
+}
diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c
new file mode 100644
index 0000000..aec97c2
--- /dev/null
+++ b/kernel/kstate/kstate-task.c
@@ -0,0 +1,287 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/kthread.h>
+#include <linux/nsproxy.h>
+#include <linux/pid_namespace.h>
+#include <linux/sched.h>
+#include <asm/mmu_context.h>
+
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int check_task_struct(struct task_struct *tsk)
+{
+	read_lock_irq(&tasklist_lock);
+	if (!list_empty(&tsk->children)) {
+		read_unlock_irq(&tasklist_lock);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	if (!list_empty(&tsk->thread_group)) {
+		read_unlock_irq(&tasklist_lock);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	/* ptrace */
+	if (tsk->parent != tsk->real_parent) {
+		read_unlock_irq(&tasklist_lock);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	read_unlock_irq(&tasklist_lock);
+	if (tsk->exit_state) {
+		WARN(1, "exit_state %08x\n", tsk->exit_state);
+		return -EINVAL;
+	}
+	if (!tsk->mm || !tsk->active_mm || tsk->mm != tsk->active_mm) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#ifdef CONFIG_MM_OWNER
+	if (tsk->mm && tsk->mm->owner != tsk) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+	if (!tsk->nsproxy) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	if (!tsk->sighand) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	if (!tsk->signal) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	return kstate_arch_check_task_struct(tsk);
+}
+
+static int collect_task_struct(struct kstate_context *ctx, struct task_struct *tsk)
+{
+	int rv;
+
+	/* task_struct is never shared. */
+	BUG_ON(find_kstate_obj_by_ptr(ctx, tsk, KSTATE_CTX_TASK_STRUCT));
+
+	rv = check_task_struct(tsk);
+	if (rv < 0)
+		return rv;
+	rv = kstate_collect_object(ctx, tsk, KSTATE_CTX_TASK_STRUCT);
+	pr_debug("collect task_struct %p: rv %d, '%s'\n", tsk, rv, tsk->comm);
+	return rv;
+}
+
+int kstate_collect_all_task_struct(struct kstate_context *ctx)
+{
+	/* Seed task list. */
+	return collect_task_struct(ctx, ctx->init_tsk);
+}
+
+static int dump_task_struct(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct task_struct *tsk = obj->o_obj;
+	struct kstate_image_task_struct *i;
+	unsigned int image_len;
+	struct kstate_object *tmp;
+	int rv;
+
+	image_len = sizeof(*i) + kstate_arch_len_task_struct(tsk);
+	i = kstate_prepare_image(KSTATE_OBJ_TASK_STRUCT, image_len);
+	if (!i)
+		return -ENOMEM;
+
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->mm, KSTATE_CTX_MM_STRUCT);
+	i->ref_mm = tmp->o_ref;
+
+	BUILD_BUG_ON(sizeof(i->comm) != sizeof(tsk->comm));
+	strlcpy((char *)i->comm, (const char *)tsk->comm, sizeof(i->comm));
+
+	i->tsk_arch = kstate_task_struct_arch(tsk);
+
+	rv = kstate_arch_dump_task_struct(ctx, tsk, i + 1);
+	if (rv == 0)
+		rv = kstate_write_image(ctx, i, image_len, obj);
+	kfree(i);
+	pr_debug("dump task_struct %p: ref {%llu, %u}, rv %d: '%s'\n", tsk, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv, tsk->comm);
+	return rv;
+}
+
+int kstate_dump_all_task_struct(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_TASK_STRUCT) {
+		rv = dump_task_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+static int task_restore_all_vma(struct kstate_context *ctx, kstate_pos_t pos)
+{
+	while (1) {
+		struct kstate_object_header hdr;
+		int rv;
+
+		rv = kstate_pread(ctx, &hdr, sizeof(hdr), pos);
+		if (rv < 0)
+			return rv;
+		if (hdr.obj_len < sizeof(hdr))
+			return -EINVAL;
+
+		switch (hdr.obj_type) {
+		case KSTATE_OBJ_VMA:
+			down_write(&current->mm->mmap_sem);
+			rv = kstate_restore_vma(ctx, pos);
+			up_write(&current->mm->mmap_sem);
+			if (rv < 0)
+				return rv;
+			break;
+		case KSTATE_OBJ_PAGE:
+			break;
+		default:
+			return 0;
+		}
+		pos += hdr.obj_len;
+	}
+}
+
+static int restore_mm(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct task_struct *tsk = current;
+	struct mm_struct *mm, *prev_mm;
+	unsigned int len = 0;
+	int restore_vma;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_MM_STRUCT);
+	if (!tmp) {
+		rv = kstate_restore_mm_struct(ctx, ref, &len);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_MM_STRUCT);
+		restore_vma = 1;
+	} else
+		restore_vma = 0;
+	mm = tmp->o_obj;
+
+	atomic_inc(&mm->mm_users);
+	task_lock(tsk);
+	prev_mm = tsk->active_mm;
+	tsk->mm = tsk->active_mm = mm;
+	activate_mm(prev_mm, mm);
+	tsk->flags &= ~PF_KTHREAD;
+	task_unlock(tsk);
+
+	if (restore_vma)
+		return task_restore_all_vma(ctx, ref->pos + len);
+	return 0;
+}
+
+struct task_struct_restore_context {
+	struct kstate_context *ctx;
+	struct kstate_image_task_struct *i;
+	struct completion c;
+	int rv;
+};
+
+/*
+ * Restore is done in current context. Put unneeded pieces and read/create or
+ * get already created ones. Registers are restored in context of a task which
+ * did restart(2).
+ */
+static int task_struct_restorer(void *_tsk_ctx)
+{
+	struct task_struct_restore_context *tsk_ctx = _tsk_ctx;
+	struct kstate_image_task_struct *i = tsk_ctx->i;
+	struct kstate_context *ctx = tsk_ctx->ctx;
+	/* In the name of symmetry. */
+	struct task_struct *tsk = current, *real_parent;
+	int rv;
+
+	pr_debug("%s: ENTER tsk %p/%s\n", __func__, tsk, tsk->comm);
+
+	write_lock_irq(&tasklist_lock);
+	real_parent = ctx->init_tsk->nsproxy->pid_ns->child_reaper;
+	tsk->real_parent = tsk->parent = real_parent;
+	list_move_tail(&tsk->sibling, &tsk->real_parent->sibling);
+	write_unlock_irq(&tasklist_lock);
+
+	rv = restore_mm(ctx, &i->ref_mm);
+	if (rv < 0)
+		goto out;
+
+out:
+	tsk_ctx->rv = rv;
+	complete(&tsk_ctx->c);
+	__set_current_state(TASK_UNINTERRUPTIBLE);
+	schedule();
+	pr_debug("%s: return %d\n", __func__, rv);
+	return rv;
+}
+
+int kstate_restore_task_struct(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_task_struct *i;
+	struct task_struct_restore_context tsk_ctx;
+	struct task_struct *tsk;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_TASK_STRUCT, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	rv = kstate_arch_check_image_task_struct(i);
+	if (rv < 0)
+		goto out_free_image;
+
+	tsk_ctx.ctx = ctx;
+	tsk_ctx.i = i;
+	init_completion(&tsk_ctx.c);
+	/* Restore ->comm for free. */
+	tsk = kthread_run(task_struct_restorer, &tsk_ctx, "%.*s", (int)sizeof(i->comm) - 1, i->comm);
+	if (IS_ERR(tsk)) {
+		rv = PTR_ERR(tsk);
+		goto out_free_image;
+	}
+	wait_for_completion(&tsk_ctx.c);
+	wait_task_inactive(tsk, 0);
+	if (tsk_ctx.rv < 0) {
+		rv = tsk_ctx.rv;
+		goto out_kill;
+	}
+
+	rv = kstate_arch_restore_task_struct(tsk, i);
+	if (rv < 0)
+		goto out_kill;
+
+#ifdef CONFIG_PREEMPT
+	task_thread_info(tsk)->preempt_count--;
+#endif
+
+	rv = kstate_restore_object(ctx, tsk, KSTATE_CTX_TASK_STRUCT, ref);
+	if (rv < 0)
+		goto out_kill;
+
+	kfree(i);
+
+	pr_debug("restore task_struct %p: ref {%llu, %u}, rv %d: '%s'\n", tsk, (unsigned long long)ref->pos, ref->id, rv, tsk->comm);
+	return 0;
+
+out_kill:
+	send_sig(SIGKILL, tsk, 1);
+	spin_lock_irq(&tsk->sighand->siglock);
+	sigfillset(&tsk->blocked);
+	sigdelsetmask(&tsk->blocked, sigmask(SIGKILL));
+	set_tsk_thread_flag(tsk, TIF_SIGPENDING);
+	spin_unlock_irq(&tsk->sighand->siglock);
+	wake_up_process(tsk);
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
diff --git a/kernel/kstate/rst-sys.c b/kernel/kstate/rst-sys.c
new file mode 100644
index 0000000..4c88716
--- /dev/null
+++ b/kernel/kstate/rst-sys.c
@@ -0,0 +1,91 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+/* restart(2) */
+#include <linux/capability.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int kstate_check_image_header(struct kstate_context *ctx)
+{
+	struct kstate_image_header hdr;
+	int rv;
+
+	rv = kstate_pread(ctx, &hdr, sizeof(hdr), 0);
+	if (rv < 0)
+		return rv;
+	pr_debug("%s: image version %u, arch %u\n", __func__, hdr.image_version, hdr.kernel_arch);
+	if (memcmp(hdr.image_magic, KSTATE_IMAGE_MAGIC, 8) != 0)
+		return -EINVAL;
+	if (hdr.image_version != cpu_to_le32(KSTATE_IMAGE_VERSION))
+		return -EINVAL;
+	return kstate_arch_check_image_header(&hdr);
+}
+
+static int kstate_restart(struct kstate_context *ctx)
+{
+	kstate_pos_t pos;
+	struct kstate_object *obj;
+	int rv;
+
+	rv = kstate_check_image_header(ctx);
+	if (rv < 0)
+		return rv;
+	pos = sizeof(struct kstate_image_header);
+	do {
+		struct kstate_object_header hdr;
+		kstate_ref_t ref;
+
+		rv = kstate_pread(ctx, &hdr, sizeof(hdr), pos);
+		if (rv < 0)
+			return rv;
+		if (hdr.obj_type == KSTATE_OBJ_TERMINATOR)
+			break;
+
+		ref.pos = pos;
+		ref.id = hdr.obj_id;
+		if (hdr.obj_type == KSTATE_OBJ_TASK_STRUCT) {
+			rv = kstate_restore_task_struct(ctx, &ref);
+			if (rv < 0)
+				return rv;
+		}
+		pos += hdr.obj_len;
+	} while (rv == 0);
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		pr_debug("%s: wake up task %p/%s\n", __func__, tsk, tsk->comm);
+		wake_up_process(tsk);
+	}
+
+	return 0;
+}
+
+SYSCALL_DEFINE2(restart, int, fd, int, flags)
+{
+	struct kstate_context *ctx;
+	struct file *file;
+	int rv;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+	ctx = kstate_context_create(current, file);
+	if (!ctx) {
+		rv = -ENOMEM;
+		goto out_ctx_create;
+	}
+
+	rv = kstate_restart(ctx);
+
+	kstate_context_destroy(ctx);
+out_ctx_create:
+	fput(file);
+	return rv;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 27dad29..da4fbf6 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -175,3 +175,6 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..ec6889d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1627,6 +1627,9 @@ EXPORT_SYMBOL(filemap_fault);
 
 struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 19/38] C/R: multiple tasks
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (16 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 18/38] C/R: core stuff Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 20/38] C/R: i386 support Alexey Dobriyan
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Restore task hierarchy wrt ->real_parent.

First thing soon-to-be-restored task does after birth is to find
->real_parent and glue itself to parent lists.

For this parent is dumped first and restored first, so that at the time
child starts restoration, parent is already up and running as task_struct,
and we get cheap loop-prevention check.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h  |    2 +
 kernel/kstate/kstate-object.c |    3 +-
 kernel/kstate/kstate-task.c   |   53 +++++++++++++++++++++++++++++++++++-----
 3 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index ac3c81d..348f59f 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -60,6 +60,8 @@ struct kstate_object_header {
 struct kstate_image_task_struct {
 	struct kstate_object_header hdr;
 
+	kstate_ref_t	ref_real_parent;
+
 	kstate_ref_t	ref_mm;
 
 	__u8		comm[16];
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index f9f2f33..60ba70d 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -25,7 +25,8 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	obj->o_ref.pos = 0;	/* not yet dumped */
 	obj->o_ref.id = 0;	/* not yet assigned */
 	obj->o_obj = p;
-	list_add(&obj->o_list, &ctx->obj[type]);
+	/* task_struct collecting relies on "_tail" part. */
+	list_add_tail(&obj->o_list, &ctx->obj[type]);
 
 	switch (type) {
 	case KSTATE_CTX_FILE:
diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c
index aec97c2..4f48c32 100644
--- a/kernel/kstate/kstate-task.c
+++ b/kernel/kstate/kstate-task.c
@@ -11,11 +11,6 @@
 static int check_task_struct(struct task_struct *tsk)
 {
 	read_lock_irq(&tasklist_lock);
-	if (!list_empty(&tsk->children)) {
-		read_unlock_irq(&tasklist_lock);
-		WARN_ON(1);
-		return -EINVAL;
-	}
 	if (!list_empty(&tsk->thread_group)) {
 		read_unlock_irq(&tasklist_lock);
 		WARN_ON(1);
@@ -74,8 +69,32 @@ static int collect_task_struct(struct kstate_context *ctx, struct task_struct *t
 
 int kstate_collect_all_task_struct(struct kstate_context *ctx)
 {
+	struct kstate_object *obj;
+	int rv;
+
 	/* Seed task list. */
-	return collect_task_struct(ctx, ctx->init_tsk);
+	rv = collect_task_struct(ctx, ctx->init_tsk);
+	if (rv < 0)
+		return rv;
+	/*
+	 * Children are added after parent as iteration goes on:
+	 * - parent is dumped before child, child knows position of parent's
+	 *   image and can reference it,
+	 * - task_struct restore is done in the same order: parent first.
+	 *   This is cheap loop prevention wrt "->real_parent": if real_parent
+	 *   reference can't be resolved at the time ->real_parent rewrite is
+	 *   done, image is malformed.
+	 */
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj, *child;
+
+		list_for_each_entry(child, &tsk->children, sibling) {
+			rv = collect_task_struct(ctx, child);
+			if (rv < 0)
+				return rv;
+		}
+	}
+	return 0;
 }
 
 static int dump_task_struct(struct kstate_context *ctx, struct kstate_object *obj)
@@ -91,6 +110,13 @@ static int dump_task_struct(struct kstate_context *ctx, struct kstate_object *ob
 	if (!i)
 		return -ENOMEM;
 
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->real_parent, KSTATE_CTX_TASK_STRUCT);
+	if (tmp)
+		i->ref_real_parent = tmp->o_ref;
+	else
+		/* Root of hierarchy to be checkpointed. */
+		i->ref_real_parent = KSTATE_REF_UNDEF;
+
 	tmp = find_kstate_obj_by_ptr(ctx, tsk->mm, KSTATE_CTX_MM_STRUCT);
 	i->ref_mm = tmp->o_ref;
 
@@ -206,7 +232,20 @@ static int task_struct_restorer(void *_tsk_ctx)
 	pr_debug("%s: ENTER tsk %p/%s\n", __func__, tsk, tsk->comm);
 
 	write_lock_irq(&tasklist_lock);
-	real_parent = ctx->init_tsk->nsproxy->pid_ns->child_reaper;
+	if (kstate_ref_undefined(&i->ref_real_parent))
+		real_parent = ctx->init_tsk->nsproxy->pid_ns->child_reaper;
+	else {
+		struct kstate_object *tmp;
+
+		/* Parent as task_struct should be restored already. */
+		tmp = find_kstate_obj_by_ref(ctx, &i->ref_real_parent, KSTATE_CTX_TASK_STRUCT);
+		if (!tmp) {
+			write_unlock_irq(&tasklist_lock);
+			rv = -EINVAL;
+			goto out;
+		}
+		real_parent = tmp->o_obj;
+	}
 	tsk->real_parent = tsk->parent = real_parent;
 	list_move_tail(&tsk->sibling, &tsk->real_parent->sibling);
 	write_unlock_irq(&tasklist_lock);
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 20/38] C/R: i386 support
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (17 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 19/38] C/R: multiple tasks Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 21/38] C/R: i386 debug registers Alexey Dobriyan
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Segment registers are abstracted to allow i386 => x86_64
migration (BTW, I'm not so sure if just making 32-bit selectors
the same will achieve same effect)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 include/linux/kstate-image.h       |   30 ++++
 include/linux/kstate.h             |    2 +-
 kernel/kstate/Makefile             |    1 +
 kernel/kstate/kstate-x86_32.c      |  294 ++++++++++++++++++++++++++++++++++++
 6 files changed, 330 insertions(+), 1 deletions(-)
 create mode 100644 kernel/kstate/kstate-x86_32.c

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..48557e1 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,8 @@
 #define __NR_inotify_init1	332
 #define __NR_preadv		333
 #define __NR_pwritev		334
+#define __NR_checkpoint		335
+#define __NR_restart		336
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..70d5441 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,5 @@ ENTRY(sys_call_table)
 	.long sys_inotify_init1
 	.long sys_preadv
 	.long sys_pwritev
+	.long sys_checkpoint		/* 335 */
+	.long sys_restart
diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 348f59f..8df5c4a 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -27,6 +27,7 @@ struct kstate_image_header {
 
 	/* Mutable part. */
 	/* Arch of the kernel which dumped the image. */
+#define KSTATE_ARCH_I386	1
 	__le32	kernel_arch;
 	/*
 	 * Distributions are expected to leave image version alone and
@@ -70,6 +71,35 @@ struct kstate_image_task_struct {
 	__u32		tsk_arch;
 } __packed;
 
+#define KSTATE_SEG_NULL		0
+#define KSTATE_SEG_USER32_CS	1
+#define KSTATE_SEG_USER32_DS	2
+#define KSTATE_SEG_TLS		0x4000	/* 0100 0000 0000 00xx */
+#define KSTATE_SEG_LDT		0x8000	/* 100x xxxx xxxx xxxx */
+
+struct kstate_image_task_struct_i386 {
+	__u32		ebx;
+	__u32		ecx;
+	__u32		edx;
+	__u32		esi;
+	__u32		edi;
+	__u32		ebp;
+	__u32		eax;
+	__u32		orig_eax;
+	__u32		eip;
+	__u32		eflags;
+	__u32		esp;
+
+	__u16		cs;
+	__u16		ds;
+	__u16		es;
+	__u16		fs;
+	__u16		gs;
+	__u16		ss;
+
+	__u64		tls_array[3];
+} __packed;
+
 struct kstate_image_mm_struct {
 	struct kstate_object_header hdr;
 
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index 3ae9e28..c4b55b6 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -67,7 +67,7 @@ int kstate_collect_all_file(struct kstate_context *ctx);
 int kstate_dump_all_file(struct kstate_context *ctx);
 int kstate_restore_file(struct kstate_context *ctx, kstate_ref_t *ref);
 
-#if 0
+#if defined(CONFIG_X86_32)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
 
diff --git a/kernel/kstate/Makefile b/kernel/kstate/Makefile
index eacd3cf..ca19a22 100644
--- a/kernel/kstate/Makefile
+++ b/kernel/kstate/Makefile
@@ -6,3 +6,4 @@ kstate-y += kstate-image.o
 kstate-y += kstate-mm.o
 kstate-y += kstate-object.o
 kstate-y += kstate-task.o
+kstate-$(CONFIG_X86_32) += kstate-x86_32.o
diff --git a/kernel/kstate/kstate-x86_32.c b/kernel/kstate/kstate-x86_32.c
new file mode 100644
index 0000000..809242c
--- /dev/null
+++ b/kernel/kstate/kstate-x86_32.c
@@ -0,0 +1,294 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/sched.h>
+
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+const __u32 kstate_kernel_arch = KSTATE_ARCH_I386;
+
+int kstate_arch_check_image_header(struct kstate_image_header *i)
+{
+	if (i->kernel_arch == cpu_to_le32(KSTATE_ARCH_I386))
+		return 0;
+	return -EINVAL;
+}
+
+__u32 kstate_task_struct_arch(struct task_struct *tsk)
+{
+	return KSTATE_ARCH_I386;
+}
+
+static int check_eflags(__u32 eflags)
+{
+	eflags &= ~X86_EFLAGS_CF;
+	eflags &= ~X86_EFLAGS_PF;
+	eflags &= ~X86_EFLAGS_AF;
+	eflags &= ~X86_EFLAGS_ZF;
+	eflags &= ~X86_EFLAGS_SF;
+	eflags &= ~X86_EFLAGS_TF;
+	eflags &= ~X86_EFLAGS_DF;
+	eflags &= ~X86_EFLAGS_OF;
+	eflags &= ~X86_EFLAGS_NT;
+	eflags &= ~X86_EFLAGS_AC;
+	eflags &= ~X86_EFLAGS_ID;
+	if (eflags != (X86_EFLAGS_IF|0x2)) {
+		pr_debug("%s: eflags %08x\n", __func__, eflags);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int check_segment(__u16 seg)
+{
+	switch (seg) {
+	case KSTATE_SEG_NULL:
+	case KSTATE_SEG_USER32_CS:
+	case KSTATE_SEG_USER32_DS:
+		return 0;
+	}
+	if (seg & KSTATE_SEG_TLS) {
+		if ((seg & ~KSTATE_SEG_TLS) > GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN) {
+			pr_debug("%s: seg %04x, GDT_ENTRY_TLS_MIN %u, GDT_ENTRY_TLS_MAX %u\n", __func__, seg, GDT_ENTRY_TLS_MIN, GDT_ENTRY_TLS_MAX);
+			return -EINVAL;
+		}
+		return 0;
+	}
+	if (seg & KSTATE_SEG_LDT) {
+		if ((seg & ~KSTATE_SEG_LDT) > 0x1fff) {
+			pr_debug("%s: seg %04x\n", __func__, seg);
+			return -EINVAL;
+		}
+		return 0;
+	}
+	pr_debug("%s: seg %04x\n", __func__, seg);
+	return -EINVAL;
+}
+
+static int check_tls(struct desc_struct *desc)
+{
+	if (desc->l != 0 || desc->s != 1 || desc->dpl != 3)
+		return -EINVAL;
+	return 0;
+}
+
+int kstate_arch_check_image_task_struct(struct kstate_image_task_struct *tsk_i)
+{
+	struct kstate_image_task_struct_i386 *i = (void *)(tsk_i + 1);
+	int rv;
+
+	if (tsk_i->tsk_arch != KSTATE_ARCH_I386)
+		return -EINVAL;
+	if (tsk_i->hdr.obj_len < sizeof(*tsk_i) + sizeof(*i))
+		return -EINVAL;
+
+	rv = check_eflags(i->eflags);
+	if (rv < 0)
+		return rv;
+
+	if (i->cs == KSTATE_SEG_NULL)
+		return -EINVAL;
+	rv = check_segment(i->cs);
+	if (rv < 0)
+		return rv;
+	rv = check_segment(i->ds);
+	if (rv < 0)
+		return rv;
+	rv = check_segment(i->es);
+	if (rv < 0)
+		return rv;
+	rv = check_segment(i->fs);
+	if (rv < 0)
+		return rv;
+	rv = check_segment(i->gs);
+	if (rv < 0)
+		return rv;
+	rv = check_segment(i->ss);
+	if (rv < 0)
+		return rv;
+
+	if (i->tls_array[0]) {
+		rv = check_tls((struct desc_struct *)&i->tls_array[0]);
+		if (rv < 0)
+			return rv;
+	}
+	if (i->tls_array[1]) {
+		rv = check_tls((struct desc_struct *)&i->tls_array[1]);
+		if (rv < 0)
+			return rv;
+	}
+	if (i->tls_array[2]) {
+		rv = check_tls((struct desc_struct *)&i->tls_array[2]);
+		if (rv < 0)
+			return rv;
+	}
+
+	return 0;
+}
+
+unsigned int kstate_arch_len_task_struct(struct task_struct *tsk)
+{
+	return sizeof(struct kstate_image_task_struct_i386);
+}
+
+int kstate_arch_check_task_struct(struct task_struct *tsk)
+{
+	struct restart_block *rb;
+
+	if (tsk->thread.xstate) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	if (test_tsk_thread_flag(tsk, TIF_DEBUG)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	rb = &task_thread_info(tsk)->restart_block;
+	if (rb->fn != current_thread_info()->restart_block.fn) {
+		WARN(1, "rb->fn = %pF\n", rb->fn);
+		return -EINVAL;
+	}
+	if (tsk->thread.vm86_info) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	if (tsk->thread.io_bitmap_ptr) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#ifdef CONFIG_X86_DS
+	if (tsk->thread.ds_ctx) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+	return 0;
+}
+
+static __u16 encode_segment(u16 seg)
+{
+	if (seg == 0)
+		return KSTATE_SEG_NULL;
+	BUG_ON((seg & 3) != 3);
+	if (seg & 4)
+		return KSTATE_SEG_LDT | (seg >> 3);
+
+	if (seg == __USER_CS)
+		return KSTATE_SEG_USER32_CS;
+	if (seg == __USER_DS)
+		return KSTATE_SEG_USER32_DS;
+
+	if (GDT_ENTRY_TLS_MIN <= (seg >> 3) && (seg >> 3) <= GDT_ENTRY_TLS_MAX)
+		return KSTATE_SEG_TLS | ((seg >> 3) - GDT_ENTRY_TLS_MIN);
+	BUG();
+}
+
+static u16 decode_segment(__u16 seg)
+{
+	if (seg == KSTATE_SEG_NULL)
+		return 0;
+	if (seg == KSTATE_SEG_USER32_CS)
+		return __USER_CS;
+	if (seg == KSTATE_SEG_USER32_DS)
+		return __USER_DS;
+
+	if (seg & KSTATE_SEG_TLS) {
+		seg &= ~KSTATE_SEG_TLS;
+		return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3;
+	}
+	if (seg & KSTATE_SEG_LDT) {
+		seg &= ~KSTATE_SEG_LDT;
+		return (seg << 3) | 7;
+	}
+	BUG();
+}
+
+int kstate_arch_dump_task_struct(struct kstate_context *ctx, struct task_struct *tsk, void *arch_i)
+{
+	struct kstate_image_task_struct_i386 *i = arch_i;
+	struct pt_regs *regs = task_pt_regs(tsk);
+
+	i->ebx = regs->bx;
+	i->ecx = regs->cx;
+	i->edx = regs->dx;
+	i->esi = regs->si;
+	i->edi = regs->di;
+	i->ebp = regs->bp;
+	i->eax = regs->ax;
+	i->orig_eax = regs->orig_ax;
+	i->eip = regs->ip;
+	i->eflags = regs->flags;
+	i->esp = regs->sp;
+
+	i->cs = encode_segment(regs->cs);
+	i->ds = encode_segment(regs->ds);
+	i->es = encode_segment(regs->es);
+	i->fs = encode_segment(regs->fs);
+	i->gs = encode_segment(tsk->thread.gs);
+	i->ss = encode_segment(regs->ss);
+
+	BUILD_BUG_ON(sizeof(tsk->thread.tls_array[0]) != 8);
+	BUILD_BUG_ON(sizeof(tsk->thread.tls_array) != 3 * 8);
+	memcpy(i->tls_array, tsk->thread.tls_array, sizeof(i->tls_array));
+
+	return 0;
+}
+
+asmlinkage void ret_from_fork(void);
+static int restore_task_struct_i386(struct task_struct *tsk, struct kstate_image_task_struct_i386 *i)
+{
+	struct pt_regs *regs = task_pt_regs(tsk);
+
+	tsk->thread.sp = (unsigned long)regs;
+	tsk->thread.sp0 = (unsigned long)(regs + 1);
+	tsk->thread.ip = (unsigned long)ret_from_fork;
+
+	regs->bx = i->ebx;
+	regs->cx = i->ecx;
+	regs->dx = i->edx;
+	regs->si = i->esi;
+	regs->di = i->edi;
+	regs->bp = i->ebp;
+	regs->ax = i->eax;
+	regs->orig_ax = i->orig_eax;
+	regs->ip = i->eip;
+	regs->flags = i->eflags;
+	regs->sp = i->esp;
+
+	regs->cs = decode_segment(i->cs);
+	regs->ds = decode_segment(i->ds);
+	regs->es = decode_segment(i->es);
+	regs->fs = decode_segment(i->fs);
+	tsk->thread.gs = decode_segment(i->gs);
+	regs->ss = decode_segment(i->ss);
+
+	memcpy(tsk->thread.tls_array, i->tls_array, 3 * 8);
+
+	return 0;
+}
+
+int kstate_arch_restore_task_struct(struct task_struct *tsk, struct kstate_image_task_struct *i)
+{
+	return restore_task_struct_i386(tsk, (void *)(i + 1));
+}
+
+int kstate_arch_check_mm_struct(struct mm_struct *mm)
+{
+	mutex_lock(&mm->context.lock);
+	if (mm->context.ldt || mm->context.size != 0) {
+		mutex_unlock(&mm->context.lock);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	mutex_unlock(&mm->context.lock);
+	return 0;
+}
+
+unsigned int kstate_arch_len_mm_struct(struct mm_struct *mm)
+{
+	return 0;
+}
+
+int kstate_arch_dump_mm_struct(struct kstate_context *ctx, struct mm_struct *mm, void *arch_i)
+{
+	return 0;
+}
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 21/38] C/R: i386 debug registers
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (18 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 20/38] C/R: i386 support Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 22/38] C/R: i386 xstate Alexey Dobriyan
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Make extensive checks to not allow restoration of breakpoints
inside kernel code.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/include/asm/ptrace.h |    5 +++++
 arch/x86/kernel/ptrace.c      |    8 ++++----
 include/linux/kstate-image.h  |    7 +++++++
 kernel/kstate/kstate-x86_32.c |   26 ++++++++++++++++++++++----
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 624f133..2b40c3c 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -243,6 +243,11 @@ extern void x86_ptrace_fork(struct task_struct *child,
 #define arch_ptrace_untrace(tsk) x86_ptrace_untrace(tsk)
 #define arch_ptrace_fork(child, flags) x86_ptrace_fork(child, flags)
 
+extern int ptrace_check_debugreg(int _32bit,
+				 unsigned long dr0, unsigned long dr1,
+				 unsigned long dr2, unsigned long dr3,
+				 unsigned long dr6, unsigned long dr7);
+
 #endif /* __KERNEL__ */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index afd2b7d..23941a7 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -465,10 +465,10 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
 	return 0;
 }
 
-static int ptrace_check_debugreg(int _32bit,
-				 unsigned long dr0, unsigned long dr1,
-				 unsigned long dr2, unsigned long dr3,
-				 unsigned long dr6, unsigned long dr7)
+int ptrace_check_debugreg(int _32bit,
+			  unsigned long dr0, unsigned long dr1,
+			  unsigned long dr2, unsigned long dr3,
+			  unsigned long dr6, unsigned long dr7)
 {
 	/* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
 	unsigned int rw[4];
diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 8df5c4a..6f11b4d 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -97,6 +97,13 @@ struct kstate_image_task_struct_i386 {
 	__u16		gs;
 	__u16		ss;
 
+	__u32		dr0;
+	__u32		dr1;
+	__u32		dr2;
+	__u32		dr3;
+	__u32		dr6;
+	__u32		dr7;
+
 	__u64		tls_array[3];
 } __packed;
 
diff --git a/kernel/kstate/kstate-x86_32.c b/kernel/kstate/kstate-x86_32.c
index 809242c..c738e16 100644
--- a/kernel/kstate/kstate-x86_32.c
+++ b/kernel/kstate/kstate-x86_32.c
@@ -106,6 +106,10 @@ int kstate_arch_check_image_task_struct(struct kstate_image_task_struct *tsk_i)
 	if (rv < 0)
 		return rv;
 
+	rv = ptrace_check_debugreg(1, i->dr0, i->dr1, i->dr2, i->dr3, i->dr6, i->dr7);
+	if (rv < 0)
+		return rv;
+
 	if (i->tls_array[0]) {
 		rv = check_tls((struct desc_struct *)&i->tls_array[0]);
 		if (rv < 0)
@@ -138,10 +142,6 @@ int kstate_arch_check_task_struct(struct task_struct *tsk)
 		WARN_ON(1);
 		return -EINVAL;
 	}
-	if (test_tsk_thread_flag(tsk, TIF_DEBUG)) {
-		WARN_ON(1);
-		return -EINVAL;
-	}
 	rb = &task_thread_info(tsk)->restart_block;
 	if (rb->fn != current_thread_info()->restart_block.fn) {
 		WARN(1, "rb->fn = %pF\n", rb->fn);
@@ -226,6 +226,13 @@ int kstate_arch_dump_task_struct(struct kstate_context *ctx, struct task_struct
 	i->gs = encode_segment(tsk->thread.gs);
 	i->ss = encode_segment(regs->ss);
 
+	i->dr0 = tsk->thread.debugreg0;
+	i->dr1 = tsk->thread.debugreg1;
+	i->dr2 = tsk->thread.debugreg2;
+	i->dr3 = tsk->thread.debugreg3;
+	i->dr6 = tsk->thread.debugreg6;
+	i->dr7 = tsk->thread.debugreg7;
+
 	BUILD_BUG_ON(sizeof(tsk->thread.tls_array[0]) != 8);
 	BUILD_BUG_ON(sizeof(tsk->thread.tls_array) != 3 * 8);
 	memcpy(i->tls_array, tsk->thread.tls_array, sizeof(i->tls_array));
@@ -261,6 +268,17 @@ static int restore_task_struct_i386(struct task_struct *tsk, struct kstate_image
 	tsk->thread.gs = decode_segment(i->gs);
 	regs->ss = decode_segment(i->ss);
 
+	tsk->thread.debugreg0 = i->dr0;
+	tsk->thread.debugreg1 = i->dr1;
+	tsk->thread.debugreg2 = i->dr2;
+	tsk->thread.debugreg3 = i->dr3;
+	tsk->thread.debugreg6 = i->dr6;
+	tsk->thread.debugreg7 = i->dr7;
+	if (i->dr7)
+		set_tsk_thread_flag(tsk, TIF_DEBUG);
+	else
+		clear_tsk_thread_flag(tsk, TIF_DEBUG);
+
 	memcpy(tsk->thread.tls_array, i->tls_array, 3 * 8);
 
 	return 0;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 22/38] C/R: i386 xstate
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (19 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 21/38] C/R: i386 debug registers Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 23/38] C/R: x86_64 support Alexey Dobriyan
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

The only check is if xstate length doesn't match.
This is insufficient, but posted anyway, because glibc manages
to do FP calculations and create xstate which would prevent
checkpointing.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h  |    3 ++
 kernel/kstate/kstate-x86_32.c |   44 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 6f11b4d..3c93432 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -105,6 +105,9 @@ struct kstate_image_task_struct_i386 {
 	__u32		dr7;
 
 	__u64		tls_array[3];
+
+	__u32		len_xstate;
+	/* __u8	xstate[len_xstate]; */
 } __packed;
 
 struct kstate_image_mm_struct {
diff --git a/kernel/kstate/kstate-x86_32.c b/kernel/kstate/kstate-x86_32.c
index c738e16..d5c162b 100644
--- a/kernel/kstate/kstate-x86_32.c
+++ b/kernel/kstate/kstate-x86_32.c
@@ -1,5 +1,6 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
 #include <linux/sched.h>
+#include <asm/i387.h>
 
 #include <linux/kstate.h>
 #include <linux/kstate-image.h>
@@ -74,12 +75,16 @@ static int check_tls(struct desc_struct *desc)
 int kstate_arch_check_image_task_struct(struct kstate_image_task_struct *tsk_i)
 {
 	struct kstate_image_task_struct_i386 *i = (void *)(tsk_i + 1);
+	unsigned int len_xstate;
 	int rv;
 
 	if (tsk_i->tsk_arch != KSTATE_ARCH_I386)
 		return -EINVAL;
 	if (tsk_i->hdr.obj_len < sizeof(*tsk_i) + sizeof(*i))
 		return -EINVAL;
+	len_xstate = i->len_xstate;
+	if (tsk_i->hdr.obj_len - sizeof(*tsk_i) - sizeof(*i) < len_xstate)
+		return -EINVAL;
 
 	rv = check_eflags(i->eflags);
 	if (rv < 0)
@@ -126,22 +131,28 @@ int kstate_arch_check_image_task_struct(struct kstate_image_task_struct *tsk_i)
 			return rv;
 	}
 
+	if (len_xstate > 0 && len_xstate != xstate_size) {
+		WARN(1, "xstate size mismatch %u:%u\n", len_xstate, xstate_size);
+		return -EINVAL;
+	}
+
 	return 0;
 }
 
 unsigned int kstate_arch_len_task_struct(struct task_struct *tsk)
 {
-	return sizeof(struct kstate_image_task_struct_i386);
+	unsigned int len;
+
+	len = sizeof(struct kstate_image_task_struct_i386);
+	if (tsk->thread.xstate)
+		len += xstate_size;
+	return len;
 }
 
 int kstate_arch_check_task_struct(struct task_struct *tsk)
 {
 	struct restart_block *rb;
 
-	if (tsk->thread.xstate) {
-		WARN_ON(1);
-		return -EINVAL;
-	}
 	rb = &task_thread_info(tsk)->restart_block;
 	if (rb->fn != current_thread_info()->restart_block.fn) {
 		WARN(1, "rb->fn = %pF\n", rb->fn);
@@ -237,13 +248,30 @@ int kstate_arch_dump_task_struct(struct kstate_context *ctx, struct task_struct
 	BUILD_BUG_ON(sizeof(tsk->thread.tls_array) != 3 * 8);
 	memcpy(i->tls_array, tsk->thread.tls_array, sizeof(i->tls_array));
 
+	i->len_xstate = 0;
+	if (tsk->thread.xstate) {
+		i->len_xstate = xstate_size;
+		memcpy(i + 1, tsk->thread.xstate, xstate_size);
+	}
+
 	return 0;
 }
 
+static int restore_xstate(struct task_struct *tsk, void *xstate, unsigned int len)
+{
+	int rv;
+
+	rv = init_fpu(tsk);
+	if (rv == 0)
+		memcpy(tsk->thread.xstate, xstate, len);
+	return rv;
+}
+
 asmlinkage void ret_from_fork(void);
 static int restore_task_struct_i386(struct task_struct *tsk, struct kstate_image_task_struct_i386 *i)
 {
 	struct pt_regs *regs = task_pt_regs(tsk);
+	int rv;
 
 	tsk->thread.sp = (unsigned long)regs;
 	tsk->thread.sp0 = (unsigned long)(regs + 1);
@@ -281,6 +309,12 @@ static int restore_task_struct_i386(struct task_struct *tsk, struct kstate_image
 
 	memcpy(tsk->thread.tls_array, i->tls_array, 3 * 8);
 
+	if (i->len_xstate) {
+		rv = restore_xstate(tsk, i + 1, i->len_xstate);
+		if (rv < 0)
+			return rv;
+	}
+
 	return 0;
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 23/38] C/R: x86_64 support
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (20 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 22/38] C/R: i386 xstate Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 24/38] C/R: x86_64 debug registers Alexey Dobriyan
                   ` (15 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

In theory and in practice, x86_64 COMPAT=y kernel will restore i386 images
and in other direction. There are small problems still and it doesn't work,
but mentioning anyway.

Right now x86_64 kernel restores only x86_64 images and 64-bit tasks.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/ia32/ia32entry.S        |    2 +
 arch/x86/include/asm/unistd_64.h |    4 +
 include/linux/kstate-image.h     |   36 ++++
 include/linux/kstate.h           |    2 +-
 kernel/kstate/Makefile           |    1 +
 kernel/kstate/kstate-x86_64.c    |  336 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 380 insertions(+), 1 deletions(-)
 create mode 100644 kernel/kstate/kstate-x86_64.c

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a505202..b12e911 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -830,4 +830,6 @@ ia32_sys_call_table:
 	.quad sys_inotify_init1
 	.quad compat_sys_preadv
 	.quad compat_sys_pwritev
+	.quad sys_checkpoint		/* 335 */
+	.quad sys_restart
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index f818294..a839c66 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -657,6 +657,10 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
 __SYSCALL(__NR_preadv, sys_preadv)
 #define __NR_pwritev				296
 __SYSCALL(__NR_pwritev, sys_pwritev)
+#define __NR_checkpoint				297
+__SYSCALL(__NR_checkpoint, sys_checkpoint)
+#define __NR_restart				298
+__SYSCALL(__NR_restart, sys_restart)
 
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 3c93432..d697d97 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -28,6 +28,7 @@ struct kstate_image_header {
 	/* Mutable part. */
 	/* Arch of the kernel which dumped the image. */
 #define KSTATE_ARCH_I386	1
+#define KSTATE_ARCH_X86_64	2
 	__le32	kernel_arch;
 	/*
 	 * Distributions are expected to leave image version alone and
@@ -74,6 +75,8 @@ struct kstate_image_task_struct {
 #define KSTATE_SEG_NULL		0
 #define KSTATE_SEG_USER32_CS	1
 #define KSTATE_SEG_USER32_DS	2
+#define KSTATE_SEG_USER64_CS	3
+#define KSTATE_SEG_USER64_DS	4
 #define KSTATE_SEG_TLS		0x4000	/* 0100 0000 0000 00xx */
 #define KSTATE_SEG_LDT		0x8000	/* 100x xxxx xxxx xxxx */
 
@@ -110,6 +113,39 @@ struct kstate_image_task_struct_i386 {
 	/* __u8	xstate[len_xstate]; */
 } __packed;
 
+struct kstate_image_task_struct_x86_64 {
+	__u64		r15;
+	__u64		r14;
+	__u64		r13;
+	__u64		r12;
+	__u64		rbp;
+	__u64		rbx;
+	__u64		r11;
+	__u64		r10;
+	__u64		r9;
+	__u64		r8;
+	__u64		rax;
+	__u64		rcx;
+	__u64		rdx;
+	__u64		rsi;
+	__u64		rdi;
+	__u64		orig_rax;
+	__u64		rip;
+	__u64		rflags;
+	__u64		rsp;
+
+	__u64		fs;
+	__u64		gs;
+	__u16		cs;
+	__u16		ds;
+	__u16		es;
+	__u16		fsindex;
+	__u16		gsindex;
+	__u16		ss;
+
+	__u64		tls_array[3];
+} __packed;
+
 struct kstate_image_mm_struct {
 	struct kstate_object_header hdr;
 
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index c4b55b6..95898ec 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -67,7 +67,7 @@ int kstate_collect_all_file(struct kstate_context *ctx);
 int kstate_dump_all_file(struct kstate_context *ctx);
 int kstate_restore_file(struct kstate_context *ctx, kstate_ref_t *ref);
 
-#if defined(CONFIG_X86_32)
+#if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
 
diff --git a/kernel/kstate/Makefile b/kernel/kstate/Makefile
index ca19a22..0678fc9 100644
--- a/kernel/kstate/Makefile
+++ b/kernel/kstate/Makefile
@@ -7,3 +7,4 @@ kstate-y += kstate-mm.o
 kstate-y += kstate-object.o
 kstate-y += kstate-task.o
 kstate-$(CONFIG_X86_32) += kstate-x86_32.o
+kstate-$(CONFIG_X86_64) += kstate-x86_64.o
diff --git a/kernel/kstate/kstate-x86_64.c b/kernel/kstate/kstate-x86_64.c
new file mode 100644
index 0000000..0d85704
--- /dev/null
+++ b/kernel/kstate/kstate-x86_64.c
@@ -0,0 +1,336 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/sched.h>
+
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+const __u32 kstate_kernel_arch = KSTATE_ARCH_X86_64;
+
+int kstate_arch_check_image_header(struct kstate_image_header *i)
+{
+	if (i->kernel_arch == cpu_to_le32(KSTATE_ARCH_X86_64))
+		return 0;
+	return -EINVAL;
+}
+
+__u32 kstate_task_struct_arch(struct task_struct *tsk)
+{
+	return KSTATE_ARCH_X86_64;
+}
+
+static int check_rflags(__u64 rflags)
+{
+	rflags &= ~X86_EFLAGS_CF;
+	rflags &= ~X86_EFLAGS_PF;
+	rflags &= ~X86_EFLAGS_AF;
+	rflags &= ~X86_EFLAGS_ZF;
+	rflags &= ~X86_EFLAGS_SF;
+	rflags &= ~X86_EFLAGS_TF;
+	rflags &= ~X86_EFLAGS_DF;
+	rflags &= ~X86_EFLAGS_OF;
+	rflags &= ~X86_EFLAGS_NT;
+	rflags &= ~X86_EFLAGS_AC;
+	rflags &= ~X86_EFLAGS_ID;
+	if (rflags != (X86_EFLAGS_IF|0x2)) {
+		pr_debug("%s: rflags %016llx\n", __func__, (unsigned long long)rflags);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int check_segment64(__u16 seg)
+{
+	switch (seg) {
+	case KSTATE_SEG_NULL:
+	case KSTATE_SEG_USER64_CS:
+	case KSTATE_SEG_USER64_DS:
+		return 0;
+	}
+	if (seg & KSTATE_SEG_TLS) {
+		if ((seg & ~KSTATE_SEG_TLS) > GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN) {
+			pr_debug("%s: seg %04x, GDT_ENTRY_TLS_MIN %u, GDT_ENTRY_TLS_MAX %u\n", __func__, seg, GDT_ENTRY_TLS_MIN, GDT_ENTRY_TLS_MAX);
+			return -EINVAL;
+		}
+		return 0;
+	}
+	if (seg & KSTATE_SEG_LDT) {
+		if ((seg & ~KSTATE_SEG_LDT) > 0x1fff) {
+			pr_debug("%s: seg %04x\n", __func__, seg);
+			return -EINVAL;
+		}
+		return 0;
+	}
+	pr_debug("%s: seg %04x\n", __func__, seg);
+	return -EINVAL;
+}
+
+static int check_tls(struct desc_struct *desc)
+{
+	if (desc->l != 0 || desc->s != 1 || desc->dpl != 3)
+		return -EINVAL;
+	return 0;
+}
+
+static int check_image_task_struct_x86_64(struct kstate_image_task_struct *tsk_i)
+{
+	struct kstate_image_task_struct_x86_64 *i = (void *)(tsk_i + 1);
+	int rv;
+
+	if (tsk_i->hdr.obj_len < sizeof(*tsk_i) + sizeof(*i))
+		return -EINVAL;
+
+	rv = check_rflags(i->rflags);
+	if (rv < 0)
+		return rv;
+
+	if (i->fs >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (i->gs >= TASK_SIZE_MAX)
+		return -EINVAL;
+
+	if (i->cs == KSTATE_SEG_NULL)
+		return -EINVAL;
+	rv = check_segment64(i->cs);
+	if (rv < 0)
+		return rv;
+	rv = check_segment64(i->ds);
+	if (rv < 0)
+		return rv;
+	rv = check_segment64(i->es);
+	if (rv < 0)
+		return rv;
+	rv = check_segment64(i->fsindex);
+	if (rv < 0)
+		return rv;
+	rv = check_segment64(i->gsindex);
+	if (rv < 0)
+		return rv;
+	rv = check_segment64(i->ss);
+	if (rv < 0)
+		return rv;
+
+	if (i->tls_array[0]) {
+		rv = check_tls((struct desc_struct *)&i->tls_array[0]);
+		if (rv < 0)
+			return rv;
+	}
+	if (i->tls_array[1]) {
+		rv = check_tls((struct desc_struct *)&i->tls_array[1]);
+		if (rv < 0)
+			return rv;
+	}
+	if (i->tls_array[2]) {
+		rv = check_tls((struct desc_struct *)&i->tls_array[2]);
+		if (rv < 0)
+			return rv;
+	}
+
+	return 0;
+}
+
+int kstate_arch_check_image_task_struct(struct kstate_image_task_struct *i)
+{
+	if (i->tsk_arch == KSTATE_ARCH_X86_64)
+		return check_image_task_struct_x86_64(i);
+	return -EINVAL;
+}
+
+unsigned int kstate_arch_len_task_struct(struct task_struct *tsk)
+{
+	return sizeof(struct kstate_image_task_struct_x86_64);
+}
+
+int kstate_arch_check_task_struct(struct task_struct *tsk)
+{
+	struct restart_block *rb;
+
+#ifdef CONFIG_COMPAT
+	if (test_tsk_thread_flag(tsk, TIF_IA32)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+	if (test_tsk_thread_flag(tsk, TIF_DEBUG)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	if (tsk->thread.xstate) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	rb = &task_thread_info(tsk)->restart_block;
+	if (rb->fn != current_thread_info()->restart_block.fn) {
+		WARN(1, "rb->fn = %pF\n", rb->fn);
+		return -EINVAL;
+	}
+	if (tsk->thread.io_bitmap_ptr) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#ifdef CONFIG_X86_DS
+	if (tsk->thread.ds_ctx) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+	return 0;
+}
+
+static __u16 encode_segment(u16 seg)
+{
+	if (seg == 0)
+		return KSTATE_SEG_NULL;
+	BUG_ON((seg & 3) != 3);
+	if (seg & 4)
+		return KSTATE_SEG_LDT | (seg >> 3);
+
+	if (seg == __USER_CS)
+		return KSTATE_SEG_USER64_CS;
+	if (seg == __USER_DS)
+		return KSTATE_SEG_USER64_DS;
+
+	if (GDT_ENTRY_TLS_MIN <= (seg >> 3) && (seg >> 3) <= GDT_ENTRY_TLS_MAX)
+		return KSTATE_SEG_TLS | ((seg >> 3) - GDT_ENTRY_TLS_MIN);
+	BUG();
+}
+
+static u16 decode_segment(__u16 seg)
+{
+	if (seg == KSTATE_SEG_NULL)
+		return 0;
+	if (seg == KSTATE_SEG_USER64_CS)
+		return __USER_CS;
+	if (seg == KSTATE_SEG_USER64_DS)
+		return __USER_DS;
+
+	BUILD_BUG_ON(GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN + 1 != 3);
+	if ((seg & KSTATE_SEG_TLS) == KSTATE_SEG_TLS) {
+		seg &= ~KSTATE_SEG_TLS;
+		if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN)
+			return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3;
+	}
+	if ((seg & KSTATE_SEG_LDT) == KSTATE_SEG_LDT) {
+		seg &= ~KSTATE_SEG_LDT;
+		return (seg << 3) | 7;
+	}
+	BUG();
+}
+
+static int dump_task_struct_x86_64(struct kstate_context *ctx, struct task_struct *tsk, void *arch_i)
+{
+	struct kstate_image_task_struct_x86_64 *i = arch_i;
+	struct pt_regs *regs = task_pt_regs(tsk);
+
+	i->r15 = regs->r15;
+	i->r14 = regs->r14;
+	i->r13 = regs->r13;
+	i->r12 = regs->r12;
+	i->rbp = regs->bp;
+	i->rbx = regs->bx;
+	i->r11 = regs->r11;
+	i->r10 = regs->r10;
+	i->r9 = regs->r9;
+	i->r8 = regs->r8;
+	i->rax = regs->ax;
+	i->rcx = regs->cx;
+	i->rdx = regs->dx;
+	i->rsi = regs->si;
+	i->rdi = regs->di;
+	i->orig_rax = regs->orig_ax;
+	i->rip = regs->ip;
+	i->rflags = regs->flags;
+	i->rsp = regs->sp;
+
+	i->fs = tsk->thread.fs;
+	i->gs = tsk->thread.gs;
+	i->cs = encode_segment(regs->cs);
+	i->ds = encode_segment(tsk->thread.ds);
+	i->es = encode_segment(tsk->thread.es);
+	i->fsindex = encode_segment(tsk->thread.fsindex);
+	i->gsindex = encode_segment(tsk->thread.gsindex);
+	i->ss = encode_segment(regs->ss);
+
+	BUILD_BUG_ON(sizeof(tsk->thread.tls_array[0]) != 8);
+	BUILD_BUG_ON(sizeof(tsk->thread.tls_array) != 3 * 8);
+	memcpy(i->tls_array, tsk->thread.tls_array, sizeof(i->tls_array));
+
+	return 0;
+}
+
+int kstate_arch_dump_task_struct(struct kstate_context *ctx, struct task_struct *tsk, void *arch_i)
+{
+	return dump_task_struct_x86_64(ctx, tsk, arch_i);
+}
+
+static int restore_task_struct_x86_64(struct task_struct *tsk, struct kstate_image_task_struct_x86_64 *i)
+{
+	struct pt_regs *regs = task_pt_regs(tsk);
+
+	tsk->thread.sp = (unsigned long)regs;
+	tsk->thread.sp0 = (unsigned long)(regs + 1);
+
+	regs->r15 = i->r15;
+	regs->r14 = i->r14;
+	regs->r13 = i->r13;
+	regs->r12 = i->r12;
+	regs->bp = i->rbp;
+	regs->bx = i->rbx;
+	regs->r11 = i->r11;
+	regs->r10 = i->r10;
+	regs->r9 = i->r9;
+	regs->r8 = i->r8;
+	regs->ax = i->rax;
+	regs->cx = i->rcx;
+	regs->dx = i->rdx;
+	regs->si = i->rsi;
+	regs->di = i->rdi;
+	regs->orig_ax = i->orig_rax;
+	regs->ip = i->rip;
+	regs->flags = i->rflags;
+	regs->sp = i->rsp;
+	tsk->thread.usersp = regs->sp;
+
+	tsk->thread.fs = i->fs;
+	tsk->thread.gs = i->gs;
+	regs->cs = decode_segment(i->cs);
+	tsk->thread.ds = decode_segment(i->ds);
+	tsk->thread.es = decode_segment(i->es);
+	tsk->thread.fsindex = decode_segment(i->fsindex);
+	tsk->thread.gsindex = decode_segment(i->gsindex);
+	regs->ss = decode_segment(i->ss);
+
+	memcpy(tsk->thread.tls_array, i->tls_array, sizeof(i->tls_array));
+
+	set_tsk_thread_flag(tsk, TIF_FORK);
+	return 0;
+}
+
+int kstate_arch_restore_task_struct(struct task_struct *tsk, struct kstate_image_task_struct *i)
+{
+	if (i->tsk_arch == KSTATE_ARCH_X86_64) {
+		return restore_task_struct_x86_64(tsk, (void *)(i + 1));
+	}
+	BUG();
+}
+
+int kstate_arch_check_mm_struct(struct mm_struct *mm)
+{
+	mutex_lock(&mm->context.lock);
+	if (mm->context.ldt || mm->context.size != 0) {
+		mutex_unlock(&mm->context.lock);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	mutex_unlock(&mm->context.lock);
+	return 0;
+}
+
+unsigned int kstate_arch_len_mm_struct(struct mm_struct *mm)
+{
+	return 0;
+}
+
+int kstate_arch_dump_mm_struct(struct kstate_context *ctx, struct mm_struct *mm, void *arch_i)
+{
+	return 0;
+}
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 24/38] C/R: x86_64 debug registers
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (21 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 23/38] C/R: x86_64 support Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 25/38] C/R: x86_64 xstate Alexey Dobriyan
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h  |    7 +++++++
 kernel/kstate/kstate-x86_64.c |   26 ++++++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index d697d97..7c54711 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -143,6 +143,13 @@ struct kstate_image_task_struct_x86_64 {
 	__u16		gsindex;
 	__u16		ss;
 
+	__u64		dr0;
+	__u64		dr1;
+	__u64		dr2;
+	__u64		dr3;
+	__u64		dr6;
+	__u64		dr7;
+
 	__u64		tls_array[3];
 } __packed;
 
diff --git a/kernel/kstate/kstate-x86_64.c b/kernel/kstate/kstate-x86_64.c
index 0d85704..83ed1b5 100644
--- a/kernel/kstate/kstate-x86_64.c
+++ b/kernel/kstate/kstate-x86_64.c
@@ -109,6 +109,10 @@ static int check_image_task_struct_x86_64(struct kstate_image_task_struct *tsk_i
 	if (rv < 0)
 		return rv;
 
+	rv = ptrace_check_debugreg(0, i->dr0, i->dr1, i->dr2, i->dr3, i->dr6, i->dr7);
+	if (rv < 0)
+		return rv;
+
 	if (i->tls_array[0]) {
 		rv = check_tls((struct desc_struct *)&i->tls_array[0]);
 		if (rv < 0)
@@ -150,10 +154,6 @@ int kstate_arch_check_task_struct(struct task_struct *tsk)
 		return -EINVAL;
 	}
 #endif
-	if (test_tsk_thread_flag(tsk, TIF_DEBUG)) {
-		WARN_ON(1);
-		return -EINVAL;
-	}
 	if (tsk->thread.xstate) {
 		WARN_ON(1);
 		return -EINVAL;
@@ -250,6 +250,13 @@ static int dump_task_struct_x86_64(struct kstate_context *ctx, struct task_struc
 	i->gsindex = encode_segment(tsk->thread.gsindex);
 	i->ss = encode_segment(regs->ss);
 
+	i->dr0 = tsk->thread.debugreg0;
+	i->dr1 = tsk->thread.debugreg1;
+	i->dr2 = tsk->thread.debugreg2;
+	i->dr3 = tsk->thread.debugreg3;
+	i->dr6 = tsk->thread.debugreg6;
+	i->dr7 = tsk->thread.debugreg7;
+
 	BUILD_BUG_ON(sizeof(tsk->thread.tls_array[0]) != 8);
 	BUILD_BUG_ON(sizeof(tsk->thread.tls_array) != 3 * 8);
 	memcpy(i->tls_array, tsk->thread.tls_array, sizeof(i->tls_array));
@@ -299,6 +306,17 @@ static int restore_task_struct_x86_64(struct task_struct *tsk, struct kstate_ima
 	tsk->thread.gsindex = decode_segment(i->gsindex);
 	regs->ss = decode_segment(i->ss);
 
+	tsk->thread.debugreg0 = i->dr0;
+	tsk->thread.debugreg1 = i->dr1;
+	tsk->thread.debugreg2 = i->dr2;
+	tsk->thread.debugreg3 = i->dr3;
+	tsk->thread.debugreg6 = i->dr6;
+	tsk->thread.debugreg7 = i->dr7;
+	if (i->dr7)
+		set_tsk_thread_flag(tsk, TIF_DEBUG);
+	else
+		clear_tsk_thread_flag(tsk, TIF_DEBUG);
+
 	memcpy(tsk->thread.tls_array, i->tls_array, sizeof(i->tls_array));
 
 	set_tsk_thread_flag(tsk, TIF_FORK);
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 25/38] C/R: x86_64 xstate
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (22 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 24/38] C/R: x86_64 debug registers Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 26/38] C/R: nsproxy Alexey Dobriyan
                   ` (13 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Again, checks aren't enough on all counts.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h  |    3 ++
 kernel/kstate/kstate-x86_64.c |   44 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 7c54711..d956d05 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -151,6 +151,9 @@ struct kstate_image_task_struct_x86_64 {
 	__u64		dr7;
 
 	__u64		tls_array[3];
+
+	__u8		len_xstate;
+	/* __u8 xstate[len_xstate]; */
 } __packed;
 
 struct kstate_image_mm_struct {
diff --git a/kernel/kstate/kstate-x86_64.c b/kernel/kstate/kstate-x86_64.c
index 83ed1b5..e18cca1 100644
--- a/kernel/kstate/kstate-x86_64.c
+++ b/kernel/kstate/kstate-x86_64.c
@@ -1,5 +1,6 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
 #include <linux/sched.h>
+#include <asm/i387.h>
 
 #include <linux/kstate.h>
 #include <linux/kstate-image.h>
@@ -74,10 +75,14 @@ static int check_tls(struct desc_struct *desc)
 static int check_image_task_struct_x86_64(struct kstate_image_task_struct *tsk_i)
 {
 	struct kstate_image_task_struct_x86_64 *i = (void *)(tsk_i + 1);
+	unsigned int len_xstate;
 	int rv;
 
 	if (tsk_i->hdr.obj_len < sizeof(*tsk_i) + sizeof(*i))
 		return -EINVAL;
+	len_xstate = i->len_xstate;
+	if (tsk_i->hdr.obj_len - sizeof(*tsk_i) - sizeof(*i) < len_xstate)
+		return -EINVAL;
 
 	rv = check_rflags(i->rflags);
 	if (rv < 0)
@@ -129,6 +134,11 @@ static int check_image_task_struct_x86_64(struct kstate_image_task_struct *tsk_i
 			return rv;
 	}
 
+	if (len_xstate > 0 && len_xstate != xstate_size) {
+		WARN(1, "xstate size mismatch %u:%u\n", len_xstate, xstate_size);
+		return -EINVAL;
+	}
+
 	return 0;
 }
 
@@ -141,7 +151,12 @@ int kstate_arch_check_image_task_struct(struct kstate_image_task_struct *i)
 
 unsigned int kstate_arch_len_task_struct(struct task_struct *tsk)
 {
-	return sizeof(struct kstate_image_task_struct_x86_64);
+	unsigned int len;
+
+	len = sizeof(struct kstate_image_task_struct_x86_64);
+	if (tsk->thread.xstate)
+		len += xstate_size;
+	return len;
 }
 
 int kstate_arch_check_task_struct(struct task_struct *tsk)
@@ -154,10 +169,6 @@ int kstate_arch_check_task_struct(struct task_struct *tsk)
 		return -EINVAL;
 	}
 #endif
-	if (tsk->thread.xstate) {
-		WARN_ON(1);
-		return -EINVAL;
-	}
 	rb = &task_thread_info(tsk)->restart_block;
 	if (rb->fn != current_thread_info()->restart_block.fn) {
 		WARN(1, "rb->fn = %pF\n", rb->fn);
@@ -261,6 +272,12 @@ static int dump_task_struct_x86_64(struct kstate_context *ctx, struct task_struc
 	BUILD_BUG_ON(sizeof(tsk->thread.tls_array) != 3 * 8);
 	memcpy(i->tls_array, tsk->thread.tls_array, sizeof(i->tls_array));
 
+	i->len_xstate = 0;
+	if (tsk->thread.xstate) {
+		i->len_xstate = xstate_size;
+		memcpy(i + 1, tsk->thread.xstate, xstate_size);
+	}
+
 	return 0;
 }
 
@@ -269,9 +286,20 @@ int kstate_arch_dump_task_struct(struct kstate_context *ctx, struct task_struct
 	return dump_task_struct_x86_64(ctx, tsk, arch_i);
 }
 
+static int restore_xstate(struct task_struct *tsk, void *xstate, unsigned int len)
+{
+	int rv;
+
+	rv = init_fpu(tsk);
+	if (rv == 0)
+		memcpy(tsk->thread.xstate, xstate, len);
+	return rv;
+}
+
 static int restore_task_struct_x86_64(struct task_struct *tsk, struct kstate_image_task_struct_x86_64 *i)
 {
 	struct pt_regs *regs = task_pt_regs(tsk);
+	int rv;
 
 	tsk->thread.sp = (unsigned long)regs;
 	tsk->thread.sp0 = (unsigned long)(regs + 1);
@@ -319,6 +347,12 @@ static int restore_task_struct_x86_64(struct task_struct *tsk, struct kstate_ima
 
 	memcpy(tsk->thread.tls_array, i->tls_array, sizeof(i->tls_array));
 
+	if (i->len_xstate) {
+		rv = restore_xstate(tsk, i + 1, i->len_xstate);
+		if (rv < 0)
+			return rv;
+	}
+
 	set_tsk_thread_flag(tsk, TIF_FORK);
 	return 0;
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 26/38] C/R: nsproxy
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (23 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 25/38] C/R: x86_64 xstate Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 27/38] C/R: checkpoint/restore struct uts_namespace Alexey Dobriyan
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

It can't be argued, that nsproxy shouldn't even be dumped,
because pointers inside nsproxy are interesting. This is correct,
however this creates addtitional code on restart and frankly
more that several nsproxies are rare, so nobody cares.

Current algorithm is simplest:
- dump all namespaces
- dump nsproxy as KSTATE_OBJ_NSPROXY type filled with references
- on restore, read reference, read and restore namespace if needed,
  bump refcount, glue namespace to nsproxy.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |    6 ++
 include/linux/kstate.h         |    5 ++
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    6 ++
 kernel/kstate/kstate-object.c  |    4 +
 kernel/kstate/kstate-task.c    |   26 ++++++++
 kernel/nsproxy.c               |  129 +++++++++++++++++++++++++++++++++++++++-
 7 files changed, 181 insertions(+), 1 deletions(-)
 create mode 100644 kernel/kstate/kstate-nsproxy.c

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index d956d05..b8263c8 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -43,6 +43,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_FILE		3
 #define KSTATE_OBJ_VMA		4
 #define KSTATE_OBJ_PAGE		5
+#define KSTATE_OBJ_NSPROXY	6
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -65,6 +66,7 @@ struct kstate_image_task_struct {
 	kstate_ref_t	ref_real_parent;
 
 	kstate_ref_t	ref_mm;
+	kstate_ref_t	ref_nsproxy;
 
 	__u8		comm[16];
 
@@ -203,4 +205,8 @@ struct kstate_image_file {
 	__u32		name_len;	/* including NUL */
 	/* __u8	name[name_len] */
 } __packed;
+
+struct kstate_image_nsproxy {
+	struct kstate_object_header hdr;
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index 95898ec..bd5c9f1 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -23,6 +23,7 @@ struct kstate_object {
 enum kstate_context_obj_type {
 	KSTATE_CTX_FILE,
 	KSTATE_CTX_MM_STRUCT,
+	KSTATE_CTX_NSPROXY,
 	KSTATE_CTX_TASK_STRUCT,
 	NR_KSTATE_CTX_TYPES
 };
@@ -67,6 +68,10 @@ int kstate_collect_all_file(struct kstate_context *ctx);
 int kstate_dump_all_file(struct kstate_context *ctx);
 int kstate_restore_file(struct kstate_context *ctx, kstate_ref_t *ref);
 
+int kstate_collect_all_nsproxy(struct kstate_context *ctx);
+int kstate_dump_all_nsproxy(struct kstate_context *ctx);
+int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 6bc1d0a..1d5e79d 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -65,6 +65,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_task_struct(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_nsproxy(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_collect_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
@@ -127,6 +130,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_nsproxy(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_task_struct(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 85d1514..8c728f5 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -1,6 +1,7 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
 #include <linux/file.h>
 #include <linux/list.h>
+#include <linux/nsproxy.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 
@@ -40,6 +41,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_NSPROXY) {
+		put_nsproxy((struct nsproxy *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_TASK_STRUCT) {
 		put_task_struct((struct task_struct *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-nsproxy.c b/kernel/kstate/kstate-nsproxy.c
new file mode 100644
index 0000000..e69de29
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index 60ba70d..078bd36 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -1,6 +1,7 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
 #include <linux/fs.h>
 #include <linux/mm_types.h>
+#include <linux/nsproxy.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 
@@ -35,6 +36,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_MM_STRUCT:
 		atomic_inc(&((struct mm_struct *)obj->o_obj)->mm_users);
 		break;
+	case KSTATE_CTX_NSPROXY:
+		get_nsproxy((struct nsproxy *)obj->o_obj);
+		break;
 	case KSTATE_CTX_TASK_STRUCT:
 		get_task_struct((struct task_struct *)obj->o_obj);
 		break;
diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c
index 4f48c32..de876fe 100644
--- a/kernel/kstate/kstate-task.c
+++ b/kernel/kstate/kstate-task.c
@@ -120,6 +120,9 @@ static int dump_task_struct(struct kstate_context *ctx, struct kstate_object *ob
 	tmp = find_kstate_obj_by_ptr(ctx, tsk->mm, KSTATE_CTX_MM_STRUCT);
 	i->ref_mm = tmp->o_ref;
 
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->nsproxy, KSTATE_CTX_NSPROXY);
+	i->ref_nsproxy = tmp->o_ref;
+
 	BUILD_BUG_ON(sizeof(i->comm) != sizeof(tsk->comm));
 	strlcpy((char *)i->comm, (const char *)tsk->comm, sizeof(i->comm));
 
@@ -208,6 +211,26 @@ static int restore_mm(struct kstate_context *ctx, kstate_ref_t *ref)
 	return 0;
 }
 
+static int restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct nsproxy *nsproxy;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_NSPROXY);
+	if (!tmp) {
+		rv = kstate_restore_nsproxy(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_NSPROXY);
+	}
+	nsproxy = tmp->o_obj;
+
+	get_nsproxy(nsproxy);
+	switch_task_namespaces(current, nsproxy);
+	return 0;
+}
+
 struct task_struct_restore_context {
 	struct kstate_context *ctx;
 	struct kstate_image_task_struct *i;
@@ -253,6 +276,9 @@ static int task_struct_restorer(void *_tsk_ctx)
 	rv = restore_mm(ctx, &i->ref_mm);
 	if (rv < 0)
 		goto out;
+	rv = restore_nsproxy(ctx, &i->ref_nsproxy);
+	if (rv < 0)
+		goto out;
 
 out:
 	tsk_ctx->rv = rv;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 09b4ff9..b7ccd68 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -1,5 +1,6 @@
 /*
  *  Copyright (C) 2006 IBM Corporation
+ *  Copyright (C) 2009 Parallels Holdings, Ltd.
  *
  *  Author: Serge Hallyn <serue@us.ibm.com>
  *
@@ -26,7 +27,7 @@ static struct kmem_cache *nsproxy_cachep;
 
 struct nsproxy init_nsproxy = INIT_NSPROXY(init_nsproxy);
 
-static inline struct nsproxy *create_nsproxy(void)
+static struct nsproxy *create_nsproxy(void)
 {
 	struct nsproxy *nsproxy;
 
@@ -228,3 +229,129 @@ static int __init nsproxy_cache_init(void)
 }
 
 module_init(nsproxy_cache_init);
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int collect_nsproxy(struct kstate_context *ctx, struct nsproxy *nsproxy)
+{
+	int rv;
+
+	rv = kstate_collect_object(ctx, nsproxy, KSTATE_CTX_NSPROXY);
+	pr_debug("collect nsproxy %p: rv %d\n", nsproxy, rv);
+	return rv;
+}
+
+int kstate_collect_all_nsproxy(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = collect_nsproxy(ctx, tsk->nsproxy);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+		unsigned int cnt = atomic_read(&nsproxy->count);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("nsproxy %p has external references %lu:%u\n", nsproxy, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_nsproxy(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct nsproxy *nsproxy = obj->o_obj;
+	struct kstate_image_nsproxy *i;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_NSPROXY, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump nsproxy %p: ref {%llu, %u}, rv %d\n", nsproxy, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_nsproxy(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_NSPROXY) {
+		rv = dump_nsproxy(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_nsproxy *i;
+	struct nsproxy *nsproxy;
+	struct uts_namespace *uts_ns;
+#ifdef CONFIG_IPC_NS
+	struct ipc_namespace *ipc_ns;
+#endif
+	struct mnt_namespace *mnt_ns;
+	struct pid_namespace *pid_ns;
+#ifdef CONFIG_NET_NS
+	struct net *net_ns;
+#endif
+	int rv;
+
+	i = kstate_read_image(ctx, ref,  KSTATE_OBJ_NSPROXY, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	nsproxy = create_nsproxy();
+	if (!nsproxy) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+
+	uts_ns = ctx->init_tsk->nsproxy->uts_ns;
+	get_uts_ns(uts_ns);
+	nsproxy->uts_ns = uts_ns;
+
+#ifdef CONFIG_IPC_NS
+	ipc_ns = ctx->init_tsk->nsproxy->ipc_ns;
+	nsproxy->ipc_ns = get_ipc_ns(ipc_ns);
+#endif
+
+	mnt_ns = ctx->init_tsk->nsproxy->mnt_ns;
+	get_mnt_ns(mnt_ns);
+	nsproxy->mnt_ns = mnt_ns;
+
+	pid_ns = ctx->init_tsk->nsproxy->pid_ns;
+	nsproxy->pid_ns = get_pid_ns(pid_ns);
+
+#ifdef CONFIG_NET_NS
+	net_ns = ctx->init_tsk->nsproxy->net_ns;
+	nsproxy->net_ns = get_net(net_ns);
+#endif
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, nsproxy, KSTATE_CTX_NSPROXY, ref);
+	if (rv < 0)
+		put_nsproxy(nsproxy);
+	pr_debug("restore nsproxy %p, ref {%llu, %u}, rv %d\n", nsproxy, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 27/38] C/R: checkpoint/restore struct uts_namespace
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (24 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 26/38] C/R: nsproxy Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 28/38] C/R: formally checkpoint/restore struct ipc_namespace Alexey Dobriyan
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

The simplest thing.

Note, no locking is done, because we know:
a) tasks are frozen and thus can't change it
b) number of collected references is correct, so nobody else
   is using uts_namespace and thus can't change it under dump

If live checkpointing will be done, code will be changed
to something like:

	if (ctx->dump_live)
		down_read(&uts_sem);
		...
	if (ctx->dump_live)
		up_read(&uts_sem);

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |   14 +++++
 include/linux/kstate.h         |    5 ++
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    6 ++
 kernel/kstate/kstate-object.c  |    4 ++
 kernel/nsproxy.c               |   33 +++++++++++--
 kernel/utsname.c               |  106 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 170 insertions(+), 4 deletions(-)
 delete mode 100644 kernel/kstate/kstate-nsproxy.c
 create mode 100644 kernel/kstate/kstate-uts_ns.c

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index b8263c8..2eef50b 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -44,6 +44,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_VMA		4
 #define KSTATE_OBJ_PAGE		5
 #define KSTATE_OBJ_NSPROXY	6
+#define KSTATE_OBJ_UTS_NS	7
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -208,5 +209,18 @@ struct kstate_image_file {
 
 struct kstate_image_nsproxy {
 	struct kstate_object_header hdr;
+
+	kstate_ref_t	ref_uts_ns;
+} __packed;
+
+struct kstate_image_uts_ns {
+	struct kstate_object_header hdr;
+
+	__u8		sysname[64];
+	__u8		nodename[64];
+	__u8		release[64];
+	__u8		version[64];
+	__u8		machine[64];
+	__u8		domainname[64];
 } __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index bd5c9f1..dba4803 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -25,6 +25,7 @@ enum kstate_context_obj_type {
 	KSTATE_CTX_MM_STRUCT,
 	KSTATE_CTX_NSPROXY,
 	KSTATE_CTX_TASK_STRUCT,
+	KSTATE_CTX_UTS_NS,
 	NR_KSTATE_CTX_TYPES
 };
 
@@ -72,6 +73,10 @@ int kstate_collect_all_nsproxy(struct kstate_context *ctx);
 int kstate_dump_all_nsproxy(struct kstate_context *ctx);
 int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref);
 
+int kstate_collect_all_uts_ns(struct kstate_context *ctx);
+int kstate_dump_all_uts_ns(struct kstate_context *ctx);
+int kstate_restore_uts_ns(struct kstate_context *ctx, kstate_ref_t *ref);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 1d5e79d..7d4681a 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -68,6 +68,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_nsproxy(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_uts_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_collect_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
@@ -130,6 +133,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_uts_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_nsproxy(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 8c728f5..98dc0c0 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -4,6 +4,7 @@
 #include <linux/nsproxy.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/utsname.h>
 
 #include <linux/kstate.h>
 
@@ -51,5 +52,10 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_UTS_NS) {
+		put_uts_ns((struct uts_namespace *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	kfree(ctx);
 }
diff --git a/kernel/kstate/kstate-nsproxy.c b/kernel/kstate/kstate-nsproxy.c
deleted file mode 100644
index e69de29..0000000
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index 078bd36..aeadd95 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -4,6 +4,7 @@
 #include <linux/nsproxy.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/utsname.h>
 
 #include <linux/kstate.h>
 #include <linux/kstate-image.h>
@@ -42,6 +43,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_TASK_STRUCT:
 		get_task_struct((struct task_struct *)obj->o_obj);
 		break;
+	case KSTATE_CTX_UTS_NS:
+		get_uts_ns((struct uts_namespace *)obj->o_obj);
+		break;
 	default:
 		BUG();
 	}
diff --git a/kernel/kstate/kstate-uts_ns.c b/kernel/kstate/kstate-uts_ns.c
new file mode 100644
index 0000000..e69de29
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b7ccd68..b2a7ed3 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -271,12 +271,16 @@ static int dump_nsproxy(struct kstate_context *ctx, struct kstate_object *obj)
 {
 	struct nsproxy *nsproxy = obj->o_obj;
 	struct kstate_image_nsproxy *i;
+	struct kstate_object *tmp;
 	int rv;
 
 	i = kstate_prepare_image(KSTATE_OBJ_NSPROXY, sizeof(*i));
 	if (!i)
 		return -ENOMEM;
 
+	tmp = find_kstate_obj_by_ptr(ctx, nsproxy->uts_ns, KSTATE_CTX_UTS_NS);
+	i->ref_uts_ns = tmp->o_ref;
+
 	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
 	kfree(i);
 	pr_debug("dump nsproxy %p: ref {%llu, %u}, rv %d\n", nsproxy, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
@@ -296,11 +300,30 @@ int kstate_dump_all_nsproxy(struct kstate_context *ctx)
 	return 0;
 }
 
+static int restore_uts_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct nsproxy *nsproxy)
+{
+	struct uts_namespace *uts_ns;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_UTS_NS);
+	if (!tmp) {
+		rv = kstate_restore_uts_ns(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_UTS_NS);
+	}
+	uts_ns = tmp->o_obj;
+
+	get_uts_ns(uts_ns);
+	nsproxy->uts_ns = uts_ns;
+	return 0;
+}
+
 int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 {
 	struct kstate_image_nsproxy *i;
 	struct nsproxy *nsproxy;
-	struct uts_namespace *uts_ns;
 #ifdef CONFIG_IPC_NS
 	struct ipc_namespace *ipc_ns;
 #endif
@@ -321,9 +344,9 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 		goto out_free_image;
 	}
 
-	uts_ns = ctx->init_tsk->nsproxy->uts_ns;
-	get_uts_ns(uts_ns);
-	nsproxy->uts_ns = uts_ns;
+	rv = restore_uts_ns(ctx, &i->ref_uts_ns, nsproxy);
+	if (rv < 0)
+		goto out_uts_ns;
 
 #ifdef CONFIG_IPC_NS
 	ipc_ns = ctx->init_tsk->nsproxy->ipc_ns;
@@ -349,6 +372,8 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	pr_debug("restore nsproxy %p, ref {%llu, %u}, rv %d\n", nsproxy, (unsigned long long)ref->pos, ref->id, rv);
 	return rv;
 
+out_uts_ns:
+	kmem_cache_free(nsproxy_cachep, nsproxy);
 out_free_image:
 	kfree(i);
 	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..5b25e8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -1,5 +1,6 @@
 /*
  *  Copyright (C) 2004 IBM Corporation
+ *  Copyright (C) 2009 Parallels Holdings, Ltd.
  *
  *  Author: Serge Hallyn <serue@us.ibm.com>
  *
@@ -73,3 +74,108 @@ void free_uts_ns(struct kref *kref)
 	ns = container_of(kref, struct uts_namespace, kref);
 	kfree(ns);
 }
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int collect_uts_ns(struct kstate_context *ctx, struct uts_namespace *uts_ns)
+{
+	int rv;
+
+	rv = kstate_collect_object(ctx, uts_ns, KSTATE_CTX_UTS_NS);
+	pr_debug("collect uts_ns %p: rv %d\n", uts_ns, rv);
+	return rv;
+}
+
+int kstate_collect_all_uts_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = collect_uts_ns(ctx, nsproxy->uts_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_UTS_NS) {
+		struct uts_namespace *uts_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&uts_ns->kref.refcount);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("uts_ns %p has external references %lu:%u\n", uts_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_uts_ns(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct uts_namespace *uts_ns = obj->o_obj;
+	struct kstate_image_uts_ns *i;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_UTS_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	strncpy(i->sysname, (const char *)uts_ns->name.sysname, 64);
+	strncpy(i->nodename, (const char *)uts_ns->name.nodename, 64);
+	strncpy(i->release, (const char *)uts_ns->name.release, 64);
+	strncpy(i->version, (const char *)uts_ns->name.version, 64);
+	strncpy(i->machine, (const char *)uts_ns->name.machine, 64);
+	strncpy(i->domainname, (const char *)uts_ns->name.domainname, 64);
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump uts_ns %p: ref {%llu, %u}, rv %d\n", uts_ns, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_uts_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_UTS_NS) {
+		rv = dump_uts_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_uts_ns(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_uts_ns *i;
+	struct uts_namespace *uts_ns;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_UTS_NS, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	uts_ns = create_uts_ns();
+	if (!uts_ns) {
+		kfree(i);
+		return -ENOMEM;
+	}
+
+	strlcpy(uts_ns->name.sysname, i->sysname, sizeof(uts_ns->name.sysname));
+	strlcpy(uts_ns->name.nodename, i->nodename, sizeof(uts_ns->name.nodename));
+	strlcpy(uts_ns->name.release, i->release, sizeof(uts_ns->name.release));
+	strlcpy(uts_ns->name.version, i->version, sizeof(uts_ns->name.version));
+	strlcpy(uts_ns->name.machine, i->machine, sizeof(uts_ns->name.machine));
+	strlcpy(uts_ns->name.domainname, i->domainname, sizeof(uts_ns->name.domainname));
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, uts_ns, KSTATE_CTX_UTS_NS, ref);
+	if (rv < 0)
+		put_uts_ns(uts_ns);
+	pr_debug("restore uts_ns %p: ref {%llu, %u}, rv %d\n", uts_ns, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+}
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 28/38] C/R: formally checkpoint/restore struct ipc_namespace
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (25 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 27/38] C/R: checkpoint/restore struct uts_namespace Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 29/38] C/R: formally checkpoint/restore struct mnt_namespace Alexey Dobriyan
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

No IPC objects are done yet, only struct ipc_namespace itself and
tsk->nsproxy->ipc_ns skeleton.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/ipc_namespace.h  |    3 +
 include/linux/kstate-image.h   |    6 ++
 include/linux/kstate.h         |   19 ++++++++
 ipc/namespace.c                |  101 +++++++++++++++++++++++++++++++++++++++-
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    8 +++
 kernel/kstate/kstate-object.c  |    6 ++
 kernel/nsproxy.c               |   66 +++++++++++++++++++++++---
 8 files changed, 206 insertions(+), 9 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index e408722..2f75a2d 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -94,6 +94,7 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { return 0; }
 #endif
 
 #if defined(CONFIG_IPC_NS)
+extern struct ipc_namespace *create_ipc_ns(void);
 extern struct ipc_namespace *copy_ipcs(unsigned long flags,
 				       struct ipc_namespace *ns);
 static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
@@ -105,6 +106,8 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
+#include <linux/sched.h>
+
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
 		struct ipc_namespace *ns)
 {
diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 2eef50b..43a1458 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -45,6 +45,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_PAGE		5
 #define KSTATE_OBJ_NSPROXY	6
 #define KSTATE_OBJ_UTS_NS	7
+#define KSTATE_OBJ_IPC_NS	8
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -211,6 +212,7 @@ struct kstate_image_nsproxy {
 	struct kstate_object_header hdr;
 
 	kstate_ref_t	ref_uts_ns;
+	kstate_ref_t	ref_ipc_ns;	/* KSTATE_REF_UNDEF if IPC_NS=n */
 } __packed;
 
 struct kstate_image_uts_ns {
@@ -223,4 +225,8 @@ struct kstate_image_uts_ns {
 	__u8		machine[64];
 	__u8		domainname[64];
 } __packed;
+
+struct kstate_image_ipc_ns {
+	struct kstate_object_header hdr;
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index dba4803..61bed98 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -22,6 +22,9 @@ struct kstate_object {
 /* Not visible to userspace! */
 enum kstate_context_obj_type {
 	KSTATE_CTX_FILE,
+#ifdef CONFIG_IPC_NS
+	KSTATE_CTX_IPC_NS,
+#endif
 	KSTATE_CTX_MM_STRUCT,
 	KSTATE_CTX_NSPROXY,
 	KSTATE_CTX_TASK_STRUCT,
@@ -77,6 +80,22 @@ int kstate_collect_all_uts_ns(struct kstate_context *ctx);
 int kstate_dump_all_uts_ns(struct kstate_context *ctx);
 int kstate_restore_uts_ns(struct kstate_context *ctx, kstate_ref_t *ref);
 
+#ifdef CONFIG_IPC_NS
+int kstate_collect_all_ipc_ns(struct kstate_context *ctx);
+int kstate_dump_all_ipc_ns(struct kstate_context *ctx);
+int kstate_restore_ipc_ns(struct kstate_context *ctx, kstate_ref_t *ref);
+#else
+static inline int kstate_collect_all_ipc_ns(struct kstate_context *ctx)
+{
+	return 0;
+}
+
+static inline int kstate_dump_all_ipc_ns(struct kstate_context *ctx)
+{
+	return 0;
+}
+#endif
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index a1094ff..bfdc3dc 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -1,6 +1,7 @@
 /*
  * linux/ipc/namespace.c
  * Copyright (C) 2006 Pavel Emelyanov <xemul@openvz.org> OpenVZ, SWsoft Inc.
+ * Copyright (C) 2009 Parallels Holdings, Ltd.
  */
 
 #include <linux/ipc.h>
@@ -14,7 +15,7 @@
 
 #include "util.h"
 
-static struct ipc_namespace *create_ipc_ns(void)
+struct ipc_namespace *create_ipc_ns(void)
 {
 	struct ipc_namespace *ns;
 	int err;
@@ -132,3 +133,101 @@ void put_ipc_ns(struct ipc_namespace *ns)
 		free_ipc_ns(ns);
 	}
 }
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int collect_ipc_ns(struct kstate_context *ctx, struct ipc_namespace *ipc_ns)
+{
+	int rv;
+
+	rv = kstate_collect_object(ctx, ipc_ns, KSTATE_CTX_IPC_NS);
+	pr_debug("collect ipc_ns %p: rv %d\n", ipc_ns, rv);
+	return rv;
+}
+
+int kstate_collect_all_ipc_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = collect_ipc_ns(ctx, nsproxy->ipc_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_IPC_NS) {
+		struct ipc_namespace *ipc_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&ipc_ns->count);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("ipc_ns %p has external references %lu:%u\n", ipc_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_ipc_ns(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct ipc_namespace *ipc_ns = obj->o_obj;
+	struct kstate_image_ipc_ns *i;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_IPC_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump ipc_ns %p: ref {%llu, %u}, rv %d\n", ipc_ns, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_ipc_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_IPC_NS) {
+		rv = dump_ipc_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_ipc_ns(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_ipc_ns *i;
+	struct ipc_namespace *ipc_ns;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_IPC_NS, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	ipc_ns = create_ipc_ns();
+	if (!ipc_ns) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+
+	/* FIXME */
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, ipc_ns, KSTATE_CTX_IPC_NS, ref);
+	if (rv < 0)
+		put_ipc_ns(ipc_ns);
+	pr_debug("restore ipc_ns %p: ref {%llu, %u}, rv %d\n", ipc_ns, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 7d4681a..1b03de5 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -71,6 +71,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_uts_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_ipc_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_collect_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
@@ -133,6 +136,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_ipc_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_uts_ns(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 98dc0c0..c2449d5 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -1,5 +1,6 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
 #include <linux/file.h>
+#include <linux/ipc_namespace.h>
 #include <linux/list.h>
 #include <linux/nsproxy.h>
 #include <linux/sched.h>
@@ -37,6 +38,13 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+#ifdef CONFIG_IPC_NS
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_IPC_NS) {
+		put_ipc_ns((struct ipc_namespace *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
+#endif
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_MM_STRUCT) {
 		mmput((struct mm_struct *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index aeadd95..0056572 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -1,5 +1,6 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
 #include <linux/fs.h>
+#include <linux/ipc_namespace.h>
 #include <linux/mm_types.h>
 #include <linux/nsproxy.h>
 #include <linux/sched.h>
@@ -34,6 +35,11 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_FILE:
 		get_file((struct file *)obj->o_obj);
 		break;
+#ifdef CONFIG_IPC_NS
+	case KSTATE_CTX_IPC_NS:
+		get_ipc_ns((struct ipc_namespace *)obj->o_obj);
+		break;
+#endif
 	case KSTATE_CTX_MM_STRUCT:
 		atomic_inc(&((struct mm_struct *)obj->o_obj)->mm_users);
 		break;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b2a7ed3..7cb82e3 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -280,6 +280,12 @@ static int dump_nsproxy(struct kstate_context *ctx, struct kstate_object *obj)
 
 	tmp = find_kstate_obj_by_ptr(ctx, nsproxy->uts_ns, KSTATE_CTX_UTS_NS);
 	i->ref_uts_ns = tmp->o_ref;
+#ifdef CONFIG_IPC_NS
+	tmp = find_kstate_obj_by_ptr(ctx, nsproxy->ipc_ns, KSTATE_CTX_IPC_NS);
+	i->ref_ipc_ns = tmp->o_ref;
+#else
+	i->ref_ipc_ns = KSTATE_REF_UNDEF;
+#endif
 
 	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
 	kfree(i);
@@ -320,13 +326,57 @@ static int restore_uts_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct
 	return 0;
 }
 
+#ifdef CONFIG_IPC_NS
+static int restore_ipc_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct nsproxy *nsproxy)
+{
+	struct ipc_namespace *ipc_ns;
+	struct kstate_object *tmp;
+	int rv;
+
+	if (kstate_ref_undefined(ref)) {
+		/*
+		 * IPC_NS=n => IPC_NS=y case: hope nobody is crazy enough
+		 * to depend on IPC_NS absence.
+		 */
+		ipc_ns = create_ipc_ns();
+		if (IS_ERR(ipc_ns))
+			return PTR_ERR(ipc_ns);
+		rv = kstate_restore_object(ctx, ipc_ns, KSTATE_CTX_IPC_NS, ref);
+		if (rv < 0) {
+			put_ipc_ns(ipc_ns);
+			return rv;
+		}
+		nsproxy->ipc_ns = ipc_ns;
+		return 0;
+	}
+	/* IPC_NS=y => IPC_NS=y case. */
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_IPC_NS);
+	if (!tmp) {
+		rv = kstate_restore_ipc_ns(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_IPC_NS);
+	}
+	ipc_ns = tmp->o_obj;
+
+	nsproxy->ipc_ns = get_ipc_ns(ipc_ns);
+	return 0;
+}
+#else
+static int restore_ipc_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct nsproxy *nsproxy)
+{
+	/* IPC_NS=n => IPC_NS=n case. */
+	if (kstate_ref_undefined(ref))
+		return 0;
+	/* IPC_NS=y => IPC_NS=n case. */
+	return -EINVAL;
+}
+#endif
+
 int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 {
 	struct kstate_image_nsproxy *i;
 	struct nsproxy *nsproxy;
-#ifdef CONFIG_IPC_NS
-	struct ipc_namespace *ipc_ns;
-#endif
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns;
 #ifdef CONFIG_NET_NS
@@ -347,11 +397,9 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	rv = restore_uts_ns(ctx, &i->ref_uts_ns, nsproxy);
 	if (rv < 0)
 		goto out_uts_ns;
-
-#ifdef CONFIG_IPC_NS
-	ipc_ns = ctx->init_tsk->nsproxy->ipc_ns;
-	nsproxy->ipc_ns = get_ipc_ns(ipc_ns);
-#endif
+	rv = restore_ipc_ns(ctx, &i->ref_ipc_ns, nsproxy);
+	if (rv < 0)
+		goto out_ipc_ns;
 
 	mnt_ns = ctx->init_tsk->nsproxy->mnt_ns;
 	get_mnt_ns(mnt_ns);
@@ -372,6 +420,8 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	pr_debug("restore nsproxy %p, ref {%llu, %u}, rv %d\n", nsproxy, (unsigned long long)ref->pos, ref->id, rv);
 	return rv;
 
+out_ipc_ns:
+	put_uts_ns(nsproxy->uts_ns);
 out_uts_ns:
 	kmem_cache_free(nsproxy_cachep, nsproxy);
 out_free_image:
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 29/38] C/R: formally checkpoint/restore struct mnt_namespace
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (26 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 28/38] C/R: formally checkpoint/restore struct ipc_namespace Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 30/38] C/R: checkpoint/restore struct pid_namespace Alexey Dobriyan
                   ` (9 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 fs/namespace.c                 |   70 ++++++++++++++++++++++++++++++++++++++++
 include/linux/kstate-image.h   |    6 +++
 include/linux/kstate.h         |    4 ++
 kernel/kstate/cpt-sys.c        |    6 +++
 kernel/kstate/kstate-context.c |    6 +++
 kernel/kstate/kstate-object.c  |    4 ++
 kernel/nsproxy.c               |   24 +++++++++++---
 7 files changed, 115 insertions(+), 5 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index be33bfc..ccd7b54 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3,6 +3,7 @@
  *
  * (C) Copyright Al Viro 2000, 2001
  *	Released under GPL v2.
+ * Copyright (C) 2000-2009 Parallels Holdings, Ltd.
  *
  * Based on code from fs/super.c, copyright Linus Torvalds and others.
  * Heavily rewritten.
@@ -2309,3 +2310,72 @@ void __put_mnt_ns(struct mnt_namespace *ns)
 	release_mounts(&umount_list);
 	kfree(ns);
 }
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int collect_mnt_ns(struct kstate_context *ctx, struct mnt_namespace *mnt_ns)
+{
+	int rv;
+
+	rv = kstate_collect_object(ctx, mnt_ns, KSTATE_CTX_MNT_NS);
+	pr_debug("collect mnt_ns %p: rv %d\n", mnt_ns, rv);
+	return rv;
+}
+
+int kstate_collect_all_mnt_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = collect_mnt_ns(ctx, nsproxy->mnt_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_MNT_NS) {
+		struct mnt_namespace *mnt_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&mnt_ns->count);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("mnt_ns %p has external references %lu:%u\n", mnt_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_mnt_ns(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct mnt_namespace *mnt_ns = obj->o_obj;
+	struct kstate_image_mnt_ns *i;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_MNT_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump mnt_ns %p: ref {%llu, %u}, rv %d\n", mnt_ns, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_mnt_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_MNT_NS) {
+		rv = dump_mnt_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+/* FIXME kstate_restore_mnt_ns() */
+#endif
diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 43a1458..fa4921b 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -46,6 +46,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_NSPROXY	6
 #define KSTATE_OBJ_UTS_NS	7
 #define KSTATE_OBJ_IPC_NS	8
+#define KSTATE_OBJ_MNT_NS	9
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -213,6 +214,7 @@ struct kstate_image_nsproxy {
 
 	kstate_ref_t	ref_uts_ns;
 	kstate_ref_t	ref_ipc_ns;	/* KSTATE_REF_UNDEF if IPC_NS=n */
+	kstate_ref_t	ref_mnt_ns;
 } __packed;
 
 struct kstate_image_uts_ns {
@@ -229,4 +231,8 @@ struct kstate_image_uts_ns {
 struct kstate_image_ipc_ns {
 	struct kstate_object_header hdr;
 } __packed;
+
+struct kstate_image_mnt_ns {
+	struct kstate_object_header hdr;
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index 61bed98..43e2556 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -26,6 +26,7 @@ enum kstate_context_obj_type {
 	KSTATE_CTX_IPC_NS,
 #endif
 	KSTATE_CTX_MM_STRUCT,
+	KSTATE_CTX_MNT_NS,
 	KSTATE_CTX_NSPROXY,
 	KSTATE_CTX_TASK_STRUCT,
 	KSTATE_CTX_UTS_NS,
@@ -96,6 +97,9 @@ static inline int kstate_dump_all_ipc_ns(struct kstate_context *ctx)
 }
 #endif
 
+int kstate_collect_all_mnt_ns(struct kstate_context *ctx);
+int kstate_dump_all_mnt_ns(struct kstate_context *ctx);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 1b03de5..0e3c74c 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -74,6 +74,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_ipc_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_mnt_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_collect_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
@@ -136,6 +139,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_mnt_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_ipc_ns(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index c2449d5..700c0f4 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -2,6 +2,7 @@
 #include <linux/file.h>
 #include <linux/ipc_namespace.h>
 #include <linux/list.h>
+#include <linux/mnt_namespace.h>
 #include <linux/nsproxy.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -50,6 +51,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_MNT_NS) {
+		put_mnt_ns((struct mnt_namespace *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_NSPROXY) {
 		put_nsproxy((struct nsproxy *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index 0056572..6b1ab4a 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -2,6 +2,7 @@
 #include <linux/fs.h>
 #include <linux/ipc_namespace.h>
 #include <linux/mm_types.h>
+#include <linux/mnt_namespace.h>
 #include <linux/nsproxy.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -43,6 +44,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_MM_STRUCT:
 		atomic_inc(&((struct mm_struct *)obj->o_obj)->mm_users);
 		break;
+	case KSTATE_CTX_MNT_NS:
+		get_mnt_ns((struct mnt_namespace *)obj->o_obj);
+		break;
 	case KSTATE_CTX_NSPROXY:
 		get_nsproxy((struct nsproxy *)obj->o_obj);
 		break;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 7cb82e3..4e22ec4 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -286,6 +286,8 @@ static int dump_nsproxy(struct kstate_context *ctx, struct kstate_object *obj)
 #else
 	i->ref_ipc_ns = KSTATE_REF_UNDEF;
 #endif
+	tmp = find_kstate_obj_by_ptr(ctx, nsproxy->mnt_ns, KSTATE_CTX_MNT_NS);
+	i->ref_mnt_ns = tmp->o_ref;
 
 	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
 	kfree(i);
@@ -373,11 +375,22 @@ static int restore_ipc_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct
 }
 #endif
 
+static int restore_mnt_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct nsproxy *nsproxy)
+{
+	struct mnt_namespace *mnt_ns;
+
+	/* FIXME */
+	mnt_ns = ctx->init_tsk->nsproxy->mnt_ns;
+
+	get_mnt_ns(mnt_ns);
+	nsproxy->mnt_ns = mnt_ns;
+	return 0;
+}
+
 int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 {
 	struct kstate_image_nsproxy *i;
 	struct nsproxy *nsproxy;
-	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns;
 #ifdef CONFIG_NET_NS
 	struct net *net_ns;
@@ -400,10 +413,9 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	rv = restore_ipc_ns(ctx, &i->ref_ipc_ns, nsproxy);
 	if (rv < 0)
 		goto out_ipc_ns;
-
-	mnt_ns = ctx->init_tsk->nsproxy->mnt_ns;
-	get_mnt_ns(mnt_ns);
-	nsproxy->mnt_ns = mnt_ns;
+	rv = restore_mnt_ns(ctx, &i->ref_mnt_ns, nsproxy);
+	if (rv < 0)
+		goto out_mnt_ns;
 
 	pid_ns = ctx->init_tsk->nsproxy->pid_ns;
 	nsproxy->pid_ns = get_pid_ns(pid_ns);
@@ -420,6 +432,8 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	pr_debug("restore nsproxy %p, ref {%llu, %u}, rv %d\n", nsproxy, (unsigned long long)ref->pos, ref->id, rv);
 	return rv;
 
+out_mnt_ns:
+	put_ipc_ns(nsproxy->ipc_ns);
 out_ipc_ns:
 	put_uts_ns(nsproxy->uts_ns);
 out_uts_ns:
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 30/38] C/R: checkpoint/restore struct pid_namespace
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (27 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 29/38] C/R: formally checkpoint/restore struct mnt_namespace Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 31/38] C/R: formally checkpoint/restore struct net_namespace Alexey Dobriyan
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

pidns are hierarchical, so we dump parent pidns first.
On restore, parent task is restored first and restores parent pidns
for itself, which means that when child restores its pidns,
parent pidns already exists.

This fullfills loop protection and, given, every pidns has init task
which pins it even works reliably. :-)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |    9 ++
 include/linux/kstate.h         |    5 +
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    6 ++
 kernel/kstate/kstate-object.c  |    4 +
 kernel/nsproxy.c               |   30 ++++++-
 kernel/pid_namespace.c         |  168 +++++++++++++++++++++++++++++++++++++++-
 7 files changed, 223 insertions(+), 5 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index fa4921b..64328e1 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -47,6 +47,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_UTS_NS	7
 #define KSTATE_OBJ_IPC_NS	8
 #define KSTATE_OBJ_MNT_NS	9
+#define KSTATE_OBJ_PID_NS	10
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -215,6 +216,7 @@ struct kstate_image_nsproxy {
 	kstate_ref_t	ref_uts_ns;
 	kstate_ref_t	ref_ipc_ns;	/* KSTATE_REF_UNDEF if IPC_NS=n */
 	kstate_ref_t	ref_mnt_ns;
+	kstate_ref_t	ref_pid_ns;
 } __packed;
 
 struct kstate_image_uts_ns {
@@ -235,4 +237,11 @@ struct kstate_image_ipc_ns {
 struct kstate_image_mnt_ns {
 	struct kstate_object_header hdr;
 } __packed;
+
+struct kstate_image_pid_ns {
+	struct kstate_object_header hdr;
+
+	kstate_ref_t	ref_parent;	/* KSTATE_REF_UNDEF if root pid_ns */
+	__u32		last_pid;
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index 43e2556..c925cef 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -28,6 +28,7 @@ enum kstate_context_obj_type {
 	KSTATE_CTX_MM_STRUCT,
 	KSTATE_CTX_MNT_NS,
 	KSTATE_CTX_NSPROXY,
+	KSTATE_CTX_PID_NS,
 	KSTATE_CTX_TASK_STRUCT,
 	KSTATE_CTX_UTS_NS,
 	NR_KSTATE_CTX_TYPES
@@ -100,6 +101,10 @@ static inline int kstate_dump_all_ipc_ns(struct kstate_context *ctx)
 int kstate_collect_all_mnt_ns(struct kstate_context *ctx);
 int kstate_dump_all_mnt_ns(struct kstate_context *ctx);
 
+int kstate_collect_all_pid_ns(struct kstate_context *ctx);
+int kstate_dump_all_pid_ns(struct kstate_context *ctx);
+int kstate_restore_pid_ns(struct kstate_context *ctx, kstate_ref_t *ref);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 0e3c74c..cbaf038 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -77,6 +77,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_mnt_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_pid_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_collect_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
@@ -133,6 +136,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_image_header(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_pid_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_file(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 700c0f4..7cd1f45 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -4,6 +4,7 @@
 #include <linux/list.h>
 #include <linux/mnt_namespace.h>
 #include <linux/nsproxy.h>
+#include <linux/pid_namespace.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/utsname.h>
@@ -61,6 +62,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_PID_NS) {
+		put_pid_ns((struct pid_namespace *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_TASK_STRUCT) {
 		put_task_struct((struct task_struct *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index 6b1ab4a..13bb75c 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -4,6 +4,7 @@
 #include <linux/mm_types.h>
 #include <linux/mnt_namespace.h>
 #include <linux/nsproxy.h>
+#include <linux/pid_namespace.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/utsname.h>
@@ -50,6 +51,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_NSPROXY:
 		get_nsproxy((struct nsproxy *)obj->o_obj);
 		break;
+	case KSTATE_CTX_PID_NS:
+		get_pid_ns((struct pid_namespace *)obj->o_obj);
+		break;
 	case KSTATE_CTX_TASK_STRUCT:
 		get_task_struct((struct task_struct *)obj->o_obj);
 		break;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 4e22ec4..0b1f66d 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -288,6 +288,8 @@ static int dump_nsproxy(struct kstate_context *ctx, struct kstate_object *obj)
 #endif
 	tmp = find_kstate_obj_by_ptr(ctx, nsproxy->mnt_ns, KSTATE_CTX_MNT_NS);
 	i->ref_mnt_ns = tmp->o_ref;
+	tmp = find_kstate_obj_by_ptr(ctx, nsproxy->pid_ns, KSTATE_CTX_PID_NS);
+	i->ref_pid_ns = tmp->o_ref;
 
 	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
 	kfree(i);
@@ -387,11 +389,29 @@ static int restore_mnt_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct
 	return 0;
 }
 
+static int restore_pid_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct nsproxy *nsproxy)
+{
+	struct pid_namespace *pid_ns;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_PID_NS);
+	if (!tmp) {
+		rv = kstate_restore_pid_ns(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_PID_NS);
+	}
+	pid_ns = tmp->o_obj;
+
+	nsproxy->pid_ns = get_pid_ns(pid_ns);
+	return 0;
+}
+
 int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 {
 	struct kstate_image_nsproxy *i;
 	struct nsproxy *nsproxy;
-	struct pid_namespace *pid_ns;
 #ifdef CONFIG_NET_NS
 	struct net *net_ns;
 #endif
@@ -416,9 +436,9 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	rv = restore_mnt_ns(ctx, &i->ref_mnt_ns, nsproxy);
 	if (rv < 0)
 		goto out_mnt_ns;
-
-	pid_ns = ctx->init_tsk->nsproxy->pid_ns;
-	nsproxy->pid_ns = get_pid_ns(pid_ns);
+	rv = restore_pid_ns(ctx, &i->ref_pid_ns, nsproxy);
+	if (rv < 0)
+		goto out_pid_ns;
 
 #ifdef CONFIG_NET_NS
 	net_ns = ctx->init_tsk->nsproxy->net_ns;
@@ -432,6 +452,8 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	pr_debug("restore nsproxy %p, ref {%llu, %u}, rv %d\n", nsproxy, (unsigned long long)ref->pos, ref->id, rv);
 	return rv;
 
+out_pid_ns:
+	put_mnt_ns(nsproxy->mnt_ns);
 out_mnt_ns:
 	put_ipc_ns(nsproxy->ipc_ns);
 out_ipc_ns:
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 821722a..104bccb 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -5,11 +5,12 @@
  *    (C) 2007 Pavel Emelyanov <xemul@openvz.org>, OpenVZ, SWsoft Inc.
  *    (C) 2007 Sukadev Bhattiprolu <sukadev@us.ibm.com>, IBM
  *     Many thanks to Oleg Nesterov for comments and help
- *
+ * Copyright (C) 2000-2009 Parallels Holdings, Ltd.
  */
 
 #include <linux/pid.h>
 #include <linux/pid_namespace.h>
+#include <linux/proc_fs.h>
 #include <linux/syscalls.h>
 #include <linux/err.h>
 #include <linux/acct.h>
@@ -191,3 +192,168 @@ static __init int pid_namespaces_init(void)
 }
 
 __initcall(pid_namespaces_init);
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int check_pid_ns(struct pid_namespace *pid_ns)
+{
+#ifdef CONFIG_BSD_PROCESS_ACCT
+	if (pid_ns->bacct) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+	return 0;
+}
+
+static int collect_pid_ns(struct kstate_context *ctx, struct pid_namespace *pid_ns)
+{
+	int rv;
+
+	rv = check_pid_ns(pid_ns);
+	if (rv < 0)
+		return rv;
+	rv = kstate_collect_object(ctx, pid_ns, KSTATE_CTX_PID_NS);
+	pr_debug("collect pid_ns %p: rv %d\n", pid_ns, rv);
+	return rv;
+}
+
+int kstate_collect_all_pid_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = collect_pid_ns(ctx, nsproxy->pid_ns);
+		if (rv < 0)
+			return rv;
+	}
+	/* FIXME check for external references */
+	return 0;
+}
+
+static int dump_pid_ns(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct pid_namespace *pid_ns = obj->o_obj;
+	struct kstate_image_pid_ns *i;
+	struct kstate_object *tmp;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_PID_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	if (pid_ns == ctx->init_tsk->nsproxy->pid_ns)
+		i->ref_parent = KSTATE_REF_UNDEF;
+	else {
+		tmp = find_kstate_obj_by_ptr(ctx, pid_ns->parent, KSTATE_CTX_PID_NS);
+		i->ref_parent = tmp->o_ref;
+	}
+	i->last_pid = pid_ns->last_pid;	/* see /proc/loadavg */
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump pid_ns %p: ref {%llu, %u}, rv %d\n", pid_ns, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_pid_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	unsigned int level, nr_dumped;
+	int rv;
+
+	/*
+	 * Unlike other namespaces, pid_ns are hierarchical via ->parent.
+	 *
+	 * Dump pid_ns in certain order to check for loops on restore cheaply:
+	 * lower ->level one goes first. For every pid_ns being restored
+	 * (except root one), it's parent pid_ns was restored already.
+	 * Unresolved ->ref_parent reference is treated like invalid image.
+	 *
+	 * pid_ns pins parent one, which means there are no holes in hierarchy
+	 * wrt ->level, which means loop is finite.
+	 */
+	level = ctx->init_tsk->nsproxy->pid_ns->level;
+	do {
+		nr_dumped = 0;
+		for_each_kstate_object(ctx, obj, KSTATE_CTX_PID_NS) {
+			struct pid_namespace *pid_ns = obj->o_obj;
+
+			if (pid_ns->level != level)
+				continue;
+			rv = dump_pid_ns(ctx, obj);
+			if (rv < 0)
+				return rv;
+			nr_dumped++;
+		}
+		level++;
+	} while (nr_dumped > 0);
+	return 0;
+}
+
+int kstate_restore_pid_ns(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_pid_ns *i;
+	struct pid_namespace *parent_pid_ns, *pid_ns;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_PID_NS, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	if (kstate_ref_undefined(&i->ref_parent))
+		parent_pid_ns = ctx->init_tsk->nsproxy->pid_ns;
+	else {
+		struct kstate_object *tmp;
+
+		/*
+		 * Every pid_ns had child reaper alive. Otherwise restore
+		 * includes restore of parent pid_ns, check for loops,
+		 * fighting recursion...
+		 */
+		tmp = find_kstate_obj_by_ref(ctx, &i->ref_parent, KSTATE_CTX_PID_NS);
+		if (!tmp) {
+			rv = -EINVAL;
+			goto out_free_image;
+		}
+		parent_pid_ns = tmp->o_obj;
+	}
+
+	pid_ns = create_pid_namespace(parent_pid_ns);
+	if (IS_ERR(pid_ns)) {
+		rv = PTR_ERR(pid_ns);
+		goto out_free_image;
+	}
+	rv = pid_ns_prepare_proc(pid_ns);
+	if (rv < 0)
+		goto out_put_pid_ns;
+
+	pid_ns->last_pid = i->last_pid;
+	/*
+	 * Depend on child reaper being the first task restoring pid_ns and
+	 * pid_ns restore being done in current context.
+	 */
+	pid_ns->child_reaper = current;
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, pid_ns, KSTATE_CTX_PID_NS, ref);
+	if (rv < 0)
+		goto out_release_proc;
+	pr_debug("restore pid_ns %p: ref {%llu, %u}, rv %d\n", pid_ns, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_release_proc:
+	pid_ns_release_proc(pid_ns);
+out_put_pid_ns:
+	put_pid_ns(pid_ns);
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 31/38] C/R: formally checkpoint/restore struct net_namespace
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (28 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 30/38] C/R: checkpoint/restore struct pid_namespace Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 32/38] C/R: checkpoint/restore struct cred Alexey Dobriyan
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |    6 ++
 include/linux/kstate.h         |   19 +++++++
 include/net/net_namespace.h    |    1 +
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    8 +++
 kernel/kstate/kstate-object.c  |    6 ++
 kernel/nsproxy.c               |   66 +++++++++++++++++++++++---
 net/core/net_namespace.c       |  102 +++++++++++++++++++++++++++++++++++++++-
 8 files changed, 205 insertions(+), 9 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 64328e1..8119f2b 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -48,6 +48,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_IPC_NS	8
 #define KSTATE_OBJ_MNT_NS	9
 #define KSTATE_OBJ_PID_NS	10
+#define KSTATE_OBJ_NET_NS	11
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -217,6 +218,7 @@ struct kstate_image_nsproxy {
 	kstate_ref_t	ref_ipc_ns;	/* KSTATE_REF_UNDEF if IPC_NS=n */
 	kstate_ref_t	ref_mnt_ns;
 	kstate_ref_t	ref_pid_ns;
+	kstate_ref_t	ref_net_ns;	/* KSTATE_REF_UNDEF if NET_NS=n */
 } __packed;
 
 struct kstate_image_uts_ns {
@@ -244,4 +246,8 @@ struct kstate_image_pid_ns {
 	kstate_ref_t	ref_parent;	/* KSTATE_REF_UNDEF if root pid_ns */
 	__u32		last_pid;
 } __packed;
+
+struct kstate_image_net_ns {
+	struct kstate_object_header hdr;
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index c925cef..a2bea1a 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -27,6 +27,9 @@ enum kstate_context_obj_type {
 #endif
 	KSTATE_CTX_MM_STRUCT,
 	KSTATE_CTX_MNT_NS,
+#ifdef CONFIG_NET_NS
+	KSTATE_CTX_NET_NS,
+#endif
 	KSTATE_CTX_NSPROXY,
 	KSTATE_CTX_PID_NS,
 	KSTATE_CTX_TASK_STRUCT,
@@ -105,6 +108,22 @@ int kstate_collect_all_pid_ns(struct kstate_context *ctx);
 int kstate_dump_all_pid_ns(struct kstate_context *ctx);
 int kstate_restore_pid_ns(struct kstate_context *ctx, kstate_ref_t *ref);
 
+#ifdef CONFIG_NET_NS
+int kstate_collect_all_net_ns(struct kstate_context *ctx);
+int kstate_dump_all_net_ns(struct kstate_context *ctx);
+int kstate_restore_net_ns(struct kstate_context *ctx, kstate_ref_t *ref);
+#else
+static inline int kstate_collect_all_net_ns(struct kstate_context *ctx)
+{
+	return 0;
+}
+
+static inline int kstate_dump_all_net_ns(struct kstate_context *ctx)
+{
+	return 0;
+}
+#endif
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index ded434b..a48e765 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -107,6 +107,7 @@ static inline struct net *copy_net_ns(unsigned long flags, struct net *net_ns)
 extern struct list_head net_namespace_list;
 
 #ifdef CONFIG_NET_NS
+extern struct net *net_create(void);
 extern void __put_net(struct net *net);
 
 static inline struct net *get_net(struct net *net)
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index cbaf038..620998a 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -80,6 +80,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_pid_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_net_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_collect_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
@@ -145,6 +148,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_net_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_mnt_ns(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 7cd1f45..423b45b 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/utsname.h>
+#include <net/net_namespace.h>
 
 #include <linux/kstate.h>
 
@@ -57,6 +58,13 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+#ifdef CONFIG_NET_NS
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_NET_NS) {
+		put_net((struct net *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
+#endif
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_NSPROXY) {
 		put_nsproxy((struct nsproxy *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index 13bb75c..d9b36fa 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/utsname.h>
+#include <net/net_namespace.h>
 
 #include <linux/kstate.h>
 #include <linux/kstate-image.h>
@@ -48,6 +49,11 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_MNT_NS:
 		get_mnt_ns((struct mnt_namespace *)obj->o_obj);
 		break;
+#ifdef CONFIG_NET_NS
+	case KSTATE_CTX_NET_NS:
+		get_net((struct net *)obj->o_obj);
+		break;
+#endif
 	case KSTATE_CTX_NSPROXY:
 		get_nsproxy((struct nsproxy *)obj->o_obj);
 		break;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 0b1f66d..85c6677 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -290,6 +290,12 @@ static int dump_nsproxy(struct kstate_context *ctx, struct kstate_object *obj)
 	i->ref_mnt_ns = tmp->o_ref;
 	tmp = find_kstate_obj_by_ptr(ctx, nsproxy->pid_ns, KSTATE_CTX_PID_NS);
 	i->ref_pid_ns = tmp->o_ref;
+#ifdef CONFIG_NET_NS
+	tmp = find_kstate_obj_by_ptr(ctx, nsproxy->net_ns, KSTATE_CTX_NET_NS);
+	i->ref_net_ns = tmp->o_ref;
+#else
+	i->ref_net_ns = KSTATE_REF_UNDEF;
+#endif
 
 	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
 	kfree(i);
@@ -408,13 +414,57 @@ static int restore_pid_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct
 	return 0;
 }
 
+#ifdef CONFIG_NET_NS
+static int restore_net_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct nsproxy *nsproxy)
+{
+	struct net *net_ns;
+	struct kstate_object *tmp;
+	int rv;
+
+	if (kstate_ref_undefined(ref)) {
+		/*
+		 * NET_NS=n => NET_NS=y case: hope nobody is crazy enough
+		 * to depend on NET_NS absence.
+		 */
+		net_ns = net_create();
+		if (IS_ERR(net_ns))
+			return PTR_ERR(net_ns);
+		rv = kstate_restore_object(ctx, net_ns, KSTATE_CTX_NET_NS, ref);
+		if (rv < 0) {
+			put_net(net_ns);
+			return rv;
+		}
+		nsproxy->net_ns = net_ns;
+		return 0;
+	}
+	/* NET_NS=y => NET_NS=y case. */
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_NET_NS);
+	if (!tmp) {
+		rv = kstate_restore_net_ns(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_NET_NS);
+	}
+	net_ns = tmp->o_obj;
+
+	nsproxy->net_ns = get_net(net_ns);
+	return 0;
+}
+#else
+static int restore_net_ns(struct kstate_context *ctx, kstate_ref_t *ref, struct nsproxy *nsproxy)
+{
+	/* NET_NS=n => NET_NS=n case. */
+	if (kstate_ref_undefined(ref))
+		return 0;
+	/* NET_NS=y => NET_NS=n case. */
+	return -EINVAL;
+}
+#endif
+
 int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 {
 	struct kstate_image_nsproxy *i;
 	struct nsproxy *nsproxy;
-#ifdef CONFIG_NET_NS
-	struct net *net_ns;
-#endif
 	int rv;
 
 	i = kstate_read_image(ctx, ref,  KSTATE_OBJ_NSPROXY, sizeof(*i));
@@ -439,11 +489,9 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	rv = restore_pid_ns(ctx, &i->ref_pid_ns, nsproxy);
 	if (rv < 0)
 		goto out_pid_ns;
-
-#ifdef CONFIG_NET_NS
-	net_ns = ctx->init_tsk->nsproxy->net_ns;
-	nsproxy->net_ns = get_net(net_ns);
-#endif
+	rv = restore_net_ns(ctx, &i->ref_net_ns, nsproxy);
+	if (rv < 0)
+		goto out_net_ns;
 	kfree(i);
 
 	rv = kstate_restore_object(ctx, nsproxy, KSTATE_CTX_NSPROXY, ref);
@@ -452,6 +500,8 @@ int kstate_restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	pr_debug("restore nsproxy %p, ref {%llu, %u}, rv %d\n", nsproxy, (unsigned long long)ref->pos, ref->id, rv);
 	return rv;
 
+out_net_ns:
+	put_pid_ns(nsproxy->pid_ns);
 out_pid_ns:
 	put_mnt_ns(nsproxy->mnt_ns);
 out_mnt_ns:
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6b3edc9..f0073db 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -1,8 +1,10 @@
+/* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
 #include <linux/workqueue.h>
 #include <linux/rtnetlink.h>
 #include <linux/cache.h>
 #include <linux/slab.h>
 #include <linux/list.h>
+#include <linux/nsproxy.h>
 #include <linux/delay.h>
 #include <linux/sched.h>
 #include <linux/idr.h>
@@ -115,7 +117,7 @@ static void net_free(struct net *net)
 	kmem_cache_free(net_cachep, net);
 }
 
-static struct net *net_create(void)
+struct net *net_create(void)
 {
 	struct net *net;
 	int rv;
@@ -500,3 +502,101 @@ assign:
 	return 0;
 }
 EXPORT_SYMBOL_GPL(net_assign_generic);
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int collect_net_ns(struct kstate_context *ctx, struct net *net_ns)
+{
+	int rv;
+
+	rv = kstate_collect_object(ctx, net_ns, KSTATE_CTX_NET_NS);
+	pr_debug("collect net_ns %p: rv %d\n", net_ns, rv);
+	return rv;
+}
+
+int kstate_collect_all_net_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = collect_net_ns(ctx, nsproxy->net_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_NET_NS) {
+		struct net *net_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&net_ns->count);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("net_ns %p has external references %lu:%u\n", net_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_net_ns(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct net *net_ns = obj->o_obj;
+	struct kstate_image_net_ns *i;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_NET_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump net_ns %p: ref {%llu, %u}, rv %d\n", net_ns, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_net_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_NET_NS) {
+		rv = dump_net_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_net_ns(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_net_ns *i;
+	struct net *net_ns;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_NET_NS, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	net_ns = net_create();
+	if (!net_ns) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+
+	/* FIXME */
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, net_ns, KSTATE_CTX_NET_NS, ref);
+	if (rv < 0)
+		put_net(net_ns);
+	pr_debug("restore net_ns %p: ref {%llu, %u}, rv %d\n", net_ns, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 32/38] C/R: checkpoint/restore struct cred
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (29 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 31/38] C/R: formally checkpoint/restore struct net_namespace Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 33/38] C/R: checkpoint/restore aux groups (structy group_info) Alexey Dobriyan
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

->group_info, ->user will be done in next patches

FIXME: insufficient checks for capabilities at least.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |   25 ++++++
 include/linux/kstate.h         |    5 +
 kernel/cred.c                  |  164 ++++++++++++++++++++++++++++++++++++++++
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    5 +
 kernel/kstate/kstate-file.c    |   30 +++++++
 kernel/kstate/kstate-object.c  |    3 +
 kernel/kstate/kstate-task.c    |   55 +++++++++++++
 8 files changed, 293 insertions(+), 0 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 8119f2b..3e768ca 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -49,6 +49,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_MNT_NS	9
 #define KSTATE_OBJ_PID_NS	10
 #define KSTATE_OBJ_NET_NS	11
+#define KSTATE_OBJ_CRED		12
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -73,6 +74,9 @@ struct kstate_image_task_struct {
 	kstate_ref_t	ref_mm;
 	kstate_ref_t	ref_nsproxy;
 
+	kstate_ref_t	ref_real_cred;
+	kstate_ref_t	ref_cred;
+
 	__u8		comm[16];
 
 	/* Native arch of task, one of KSTATE_ARCH_*. */
@@ -207,6 +211,7 @@ struct kstate_image_file {
 	__u32		i_mode;
 	__u32		f_flags;
 	__u64		f_pos;
+	kstate_ref_t	ref_f_cred;
 	__u32		name_len;	/* including NUL */
 	/* __u8	name[name_len] */
 } __packed;
@@ -250,4 +255,24 @@ struct kstate_image_pid_ns {
 struct kstate_image_net_ns {
 	struct kstate_object_header hdr;
 } __packed;
+
+typedef __u64 kstate_cap_t;
+struct kstate_image_cred {
+	struct kstate_object_header hdr;
+
+	__u32		uid;
+	__u32		gid;
+	__u32		suid;
+	__u32		sgid;
+	__u32		euid;
+	__u32		egid;
+	__u32		fsuid;
+	__u32		fsgid;
+	__u32		securebits;
+
+	kstate_cap_t	cap_inheritable;
+	kstate_cap_t	cap_permitted;
+	kstate_cap_t	cap_effective;
+	kstate_cap_t	cap_bset;
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index a2bea1a..bbed87b 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -21,6 +21,7 @@ struct kstate_object {
 
 /* Not visible to userspace! */
 enum kstate_context_obj_type {
+	KSTATE_CTX_CRED,
 	KSTATE_CTX_FILE,
 #ifdef CONFIG_IPC_NS
 	KSTATE_CTX_IPC_NS,
@@ -124,6 +125,10 @@ static inline int kstate_dump_all_net_ns(struct kstate_context *ctx)
 }
 #endif
 
+int kstate_collect_all_cred(struct kstate_context *ctx);
+int kstate_dump_all_cred(struct kstate_context *ctx);
+int kstate_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/cred.c b/kernel/cred.c
index 3a03918..9152c90 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -2,6 +2,7 @@
  *
  * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
  * Written by David Howells (dhowells@redhat.com)
+ * Copyright (C) 2000-2009 Parallels Holdings, Ltd.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public Licence
@@ -589,3 +590,166 @@ int set_create_files_as(struct cred *new, struct inode *inode)
 	return security_kernel_create_files_as(new, inode);
 }
 EXPORT_SYMBOL(set_create_files_as);
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int check_cred(struct cred *cred)
+{
+#ifdef CONFIG_KEYS
+	if (cred->thread_keyring || cred->request_key_auth || cred->tgcred) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+#ifdef CONFIG_SECURITY
+	if (cred->security) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+	return 0;
+}
+
+static int collect_cred(struct kstate_context *ctx, struct cred *cred)
+{
+	int rv;
+
+	rv = check_cred(cred);
+	if (rv < 0)
+		return rv;
+	rv = kstate_collect_object(ctx, cred, KSTATE_CTX_CRED);
+	pr_debug("collect cred %p: rv %d\n", cred, rv);
+	return rv;
+}
+
+int kstate_collect_all_cred(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = collect_cred(ctx, (struct cred *)tsk->real_cred);
+		if (rv < 0)
+			return rv;
+		rv = collect_cred(ctx, (struct cred *)tsk->cred);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_FILE) {
+		struct file *file = obj->o_obj;
+
+		rv = collect_cred(ctx, (struct cred *)file->f_cred);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_CRED) {
+		struct cred *cred = obj->o_obj;
+		unsigned int cnt = atomic_read(&cred->usage);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("cred %p has external references %lu:%u\n", cred, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_cred(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct cred *cred = obj->o_obj;
+	struct kstate_image_cred *i;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_CRED, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	i->uid = cred->uid;
+	i->gid = cred->gid;
+	i->suid = cred->suid;
+	i->sgid = cred->sgid;
+	i->euid = cred->euid;
+	i->egid = cred->egid;
+	i->fsuid = cred->fsuid;
+	i->fsgid = cred->fsgid;
+	i->securebits = cred->securebits;
+
+	BUILD_BUG_ON(sizeof(kstate_cap_t) != sizeof(kernel_cap_t));
+	memcpy(&i->cap_inheritable, &cred->cap_inheritable, sizeof(kstate_cap_t));
+	memcpy(&i->cap_permitted, &cred->cap_permitted, sizeof(kstate_cap_t));
+	memcpy(&i->cap_effective, &cred->cap_effective, sizeof(kstate_cap_t));
+	memcpy(&i->cap_bset, &cred->cap_bset, sizeof(kstate_cap_t));
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump cred %p: ref {%llu, %u}, rv %d\n", cred, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_cred(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_CRED) {
+		rv = dump_cred(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_cred *i;
+	struct cred *cred;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_CRED, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	cred = kmem_cache_zalloc(cred_jar, GFP_KERNEL);
+	if (!cred) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+	atomic_set(&cred->usage, 1);
+
+	cred->uid = i->uid;
+	cred->gid = i->gid;
+	cred->suid = i->suid;
+	cred->sgid = i->sgid;
+	cred->euid = i->euid;
+	cred->egid = i->egid;
+	cred->fsuid = i->fsuid;
+	cred->fsgid = i->fsgid;
+	cred->securebits = i->securebits;
+
+	memcpy(&cred->cap_inheritable, &i->cap_inheritable, sizeof(i->cap_inheritable));
+	memcpy(&cred->cap_permitted, &i->cap_permitted, sizeof(i->cap_permitted));
+	memcpy(&cred->cap_effective, &i->cap_effective, sizeof(i->cap_effective));
+	memcpy(&cred->cap_bset, &i->cap_bset, sizeof(i->cap_bset));
+
+	/* FIXME */
+	cred->group_info = get_group_info(&init_groups);
+	/* FIXME */
+	cred->user = get_uid(&root_user);
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, cred, KSTATE_CTX_CRED, ref);
+	if (rv < 0)
+		put_cred(cred);
+	pr_debug("restore cred %p: ref {%llu, %u}, rv %d\n", cred, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 620998a..b990fd3 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -89,6 +89,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_file(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_cred(ctx);
+	if (rv < 0)
+		return rv;
 	return 0;
 }
 
@@ -142,6 +145,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_pid_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_cred(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_file(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 423b45b..01b6e5f 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -36,6 +36,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 {
 	struct kstate_object *obj, *tmp;
 
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_CRED) {
+		put_cred((struct cred *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_FILE) {
 		fput((struct file *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-file.c b/kernel/kstate/kstate-file.c
index 8f678cd..f378be3 100644
--- a/kernel/kstate/kstate-file.c
+++ b/kernel/kstate/kstate-file.c
@@ -86,6 +86,7 @@ int generic_file_checkpoint(struct file *file, struct kstate_context *ctx)
 {
 	struct kstate_object *obj;
 	struct kstate_image_file *i;
+	struct kstate_object *tmp;
 	struct kstat stat;
 	char *buf, *name;
 	int rv;
@@ -102,6 +103,9 @@ int generic_file_checkpoint(struct file *file, struct kstate_context *ctx)
 	/* Assume seeking over file doesn't have history. */
 	i->f_pos = file->f_pos;
 
+	tmp = find_kstate_obj_by_ptr(ctx, file->f_cred, KSTATE_CTX_CRED);
+	i->ref_f_cred = tmp->o_ref;
+
 	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
 	if (!buf) {
 		rv = -ENOMEM;
@@ -154,6 +158,28 @@ int kstate_dump_all_file(struct kstate_context *ctx)
 	return 0;
 }
 
+static int file_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref, struct file *file)
+{
+	struct cred *cred;
+	const struct cred *old_cred;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_CRED);
+	if (!tmp) {
+		rv = kstate_restore_cred(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_CRED);
+	}
+	cred = tmp->o_obj;
+
+	old_cred = file->f_cred;
+	rcu_assign_pointer(file->f_cred, get_cred(cred));
+	put_cred(old_cred);
+	return 0;
+}
+
 int kstate_restore_file(struct kstate_context *ctx, kstate_ref_t *ref)
 {
 	struct kstate_image_file *i;
@@ -182,6 +208,10 @@ int kstate_restore_file(struct kstate_context *ctx, kstate_ref_t *ref)
 		rv = -EINVAL;
 		goto out_fput;
 	}
+
+	rv = file_restore_cred(ctx, &i->ref_f_cred, file);
+	if (rv < 0)
+		goto out_fput;
 	/* Assume seeking over file doesn't have history. */
 	if (vfs_llseek(file, i->f_pos, SEEK_SET) != i->f_pos) {
 		rv = -EINVAL;
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index d9b36fa..8000124 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -35,6 +35,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	list_add_tail(&obj->o_list, &ctx->obj[type]);
 
 	switch (type) {
+	case KSTATE_CTX_CRED:
+		get_cred((struct cred *)obj->o_obj);
+		break;
 	case KSTATE_CTX_FILE:
 		get_file((struct file *)obj->o_obj);
 		break;
diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c
index de876fe..dc2387b 100644
--- a/kernel/kstate/kstate-task.c
+++ b/kernel/kstate/kstate-task.c
@@ -123,6 +123,11 @@ static int dump_task_struct(struct kstate_context *ctx, struct kstate_object *ob
 	tmp = find_kstate_obj_by_ptr(ctx, tsk->nsproxy, KSTATE_CTX_NSPROXY);
 	i->ref_nsproxy = tmp->o_ref;
 
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->real_cred, KSTATE_CTX_CRED);
+	i->ref_real_cred = tmp->o_ref;
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->cred, KSTATE_CTX_CRED);
+	i->ref_cred = tmp->o_ref;
+
 	BUILD_BUG_ON(sizeof(i->comm) != sizeof(tsk->comm));
 	strlcpy((char *)i->comm, (const char *)tsk->comm, sizeof(i->comm));
 
@@ -149,6 +154,50 @@ int kstate_dump_all_task_struct(struct kstate_context *ctx)
 	return 0;
 }
 
+static int restore_real_cred(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct cred *cred;
+	const struct cred *old_cred;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_CRED);
+	if (!tmp) {
+		rv = kstate_restore_cred(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_CRED);
+	}
+	cred = tmp->o_obj;
+
+	old_cred = current->real_cred;
+	rcu_assign_pointer(current->real_cred, get_cred(cred));
+	put_cred(old_cred);
+	return 0;
+}
+
+static int restore_cred(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct cred *cred;
+	const struct cred *old_cred;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_CRED);
+	if (!tmp) {
+		rv = kstate_restore_cred(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_CRED);
+	}
+	cred = tmp->o_obj;
+
+	old_cred = current->cred;
+	rcu_assign_pointer(current->cred, get_cred(cred));
+	put_cred(old_cred);
+	return 0;
+}
+
 static int task_restore_all_vma(struct kstate_context *ctx, kstate_pos_t pos)
 {
 	while (1) {
@@ -279,6 +328,12 @@ static int task_struct_restorer(void *_tsk_ctx)
 	rv = restore_nsproxy(ctx, &i->ref_nsproxy);
 	if (rv < 0)
 		goto out;
+	rv = restore_real_cred(ctx, &i->ref_real_cred);
+	if (rv < 0)
+		goto out;
+	rv = restore_cred(ctx, &i->ref_cred);
+	if (rv < 0)
+		goto out;
 
 out:
 	tsk_ctx->rv = rv;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 33/38] C/R: checkpoint/restore aux groups (structy group_info)
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (30 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 32/38] C/R: checkpoint/restore struct cred Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 34/38] C/R: checkpoint/restore struct user Alexey Dobriyan
                   ` (5 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |   10 +++
 include/linux/kstate.h         |    5 ++
 kernel/cred.c                  |   30 +++++++++-
 kernel/groups.c                |  125 ++++++++++++++++++++++++++++++++++++++++
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    5 ++
 kernel/kstate/kstate-object.c  |    3 +
 7 files changed, 182 insertions(+), 2 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 3e768ca..dfd71ea 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -50,6 +50,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_PID_NS	10
 #define KSTATE_OBJ_NET_NS	11
 #define KSTATE_OBJ_CRED		12
+#define KSTATE_OBJ_GROUP_INFO	13
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -274,5 +275,14 @@ struct kstate_image_cred {
 	kstate_cap_t	cap_permitted;
 	kstate_cap_t	cap_effective;
 	kstate_cap_t	cap_bset;
+
+	kstate_ref_t	ref_group_info;
+} __packed;
+
+struct kstate_image_group_info {
+	struct kstate_object_header hdr;
+
+	__u32		ngroups;
+	/* __u32	gid[ngroups]; */
 } __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index bbed87b..4df3bfa 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -23,6 +23,7 @@ struct kstate_object {
 enum kstate_context_obj_type {
 	KSTATE_CTX_CRED,
 	KSTATE_CTX_FILE,
+	KSTATE_CTX_GROUP_INFO,
 #ifdef CONFIG_IPC_NS
 	KSTATE_CTX_IPC_NS,
 #endif
@@ -129,6 +130,10 @@ int kstate_collect_all_cred(struct kstate_context *ctx);
 int kstate_dump_all_cred(struct kstate_context *ctx);
 int kstate_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref);
 
+int kstate_collect_all_group_info(struct kstate_context *ctx);
+int kstate_dump_all_group_info(struct kstate_context *ctx);
+int kstate_restore_group_info(struct kstate_context *ctx, kstate_ref_t *ref);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/cred.c b/kernel/cred.c
index 9152c90..8543360 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -662,6 +662,7 @@ static int dump_cred(struct kstate_context *ctx, struct kstate_object *obj)
 {
 	struct cred *cred = obj->o_obj;
 	struct kstate_image_cred *i;
+	struct kstate_object *tmp;
 	int rv;
 
 	i = kstate_prepare_image(KSTATE_OBJ_CRED, sizeof(*i));
@@ -684,6 +685,9 @@ static int dump_cred(struct kstate_context *ctx, struct kstate_object *obj)
 	memcpy(&i->cap_effective, &cred->cap_effective, sizeof(kstate_cap_t));
 	memcpy(&i->cap_bset, &cred->cap_bset, sizeof(kstate_cap_t));
 
+	tmp = find_kstate_obj_by_ptr(ctx, cred->group_info, KSTATE_CTX_GROUP_INFO);
+	i->ref_group_info = tmp->o_ref;
+
 	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
 	kfree(i);
 	pr_debug("dump cred %p: ref {%llu, %u}, rv %d\n", cred, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
@@ -703,6 +707,25 @@ int kstate_dump_all_cred(struct kstate_context *ctx)
 	return 0;
 }
 
+static int restore_group_info(struct kstate_context *ctx, kstate_ref_t *ref, struct cred *cred)
+{
+	struct group_info *gi;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_GROUP_INFO);
+	if (!tmp) {
+		rv = kstate_restore_group_info(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_GROUP_INFO);
+	}
+	gi = tmp->o_obj;
+
+	cred->group_info = get_group_info(gi);
+	return 0;
+}
+
 int kstate_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref)
 {
 	struct kstate_image_cred *i;
@@ -735,8 +758,9 @@ int kstate_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref)
 	memcpy(&cred->cap_effective, &i->cap_effective, sizeof(i->cap_effective));
 	memcpy(&cred->cap_bset, &i->cap_bset, sizeof(i->cap_bset));
 
-	/* FIXME */
-	cred->group_info = get_group_info(&init_groups);
+	rv = restore_group_info(ctx, &i->ref_group_info, cred);
+	if (rv < 0)
+		goto out_group_info;
 	/* FIXME */
 	cred->user = get_uid(&root_user);
 	kfree(i);
@@ -747,6 +771,8 @@ int kstate_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref)
 	pr_debug("restore cred %p: ref {%llu, %u}, rv %d\n", cred, (unsigned long long)ref->pos, ref->id, rv);
 	return rv;
 
+out_group_info:
+	kmem_cache_free(cred_jar, cred);
 out_free_image:
 	kfree(i);
 	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
diff --git a/kernel/groups.c b/kernel/groups.c
index 1b95b2f..706681a 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -1,5 +1,6 @@
 /*
  * Supplementary group IDs
+ * Copyright (C) 2000-2009 Parallels Holdings, Ltd.
  */
 #include <linux/cred.h>
 #include <linux/module.h>
@@ -286,3 +287,127 @@ int in_egroup_p(gid_t grp)
 }
 
 EXPORT_SYMBOL(in_egroup_p);
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int collect_group_info(struct kstate_context *ctx, struct group_info *gi)
+{
+	int rv;
+
+	rv = kstate_collect_object(ctx, gi, KSTATE_CTX_GROUP_INFO);
+	pr_debug("collect group_info %p: rv %d\n", gi, rv);
+	return rv;
+}
+
+int kstate_collect_all_group_info(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_CRED) {
+		struct cred *cred = obj->o_obj;
+
+		rv = collect_group_info(ctx, cred->group_info);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_GROUP_INFO) {
+		struct group_info *gi = obj->o_obj;
+		unsigned int cnt = atomic_read(&gi->usage);
+
+		if (obj->o_count + 1 != cnt && gi != &init_groups) {
+			pr_err("group_info %p has external references %lu:%u\n", gi, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_group_info(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct group_info *gi = obj->o_obj;
+	struct kstate_image_group_info *i;
+	__u32 *gid;
+	unsigned int image_len;
+	int n;
+	int rv;
+
+	image_len = sizeof(*i) + gi->ngroups * sizeof(__u32);
+	i = kstate_prepare_image(KSTATE_OBJ_GROUP_INFO, image_len);
+	if (!i)
+		return -ENOMEM;
+
+	i->ngroups = gi->ngroups;
+	gid = (__u32 *)(i + 1);
+	for (n = 0; n < gi->ngroups; n++)
+		gid[n] = GROUP_AT(gi, n);
+
+	rv = kstate_write_image(ctx, i, image_len, obj);
+	kfree(i);
+	pr_debug("dump group_info %p: ref {%llu, %u}, rv %d\n", gi, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_group_info(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_GROUP_INFO) {
+		rv = dump_group_info(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_group_info(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_group_info *i;
+	struct group_info *gi;
+	__u32 *gid;
+	int n;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_GROUP_INFO, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+	if (i->ngroups > NGROUPS_MAX) {
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+	if (i->hdr.obj_len != sizeof(*i) + i->ngroups * sizeof(__u32)) {
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+	gid = (__u32 *)(i + 1);
+	for (n = 1; n < i->ngroups; n++) {
+		if (gid[n - 1] >= gid[n]) {
+			rv = -EINVAL;
+			goto out_free_image;
+		}
+	}
+
+	gi = groups_alloc(i->ngroups);
+	if (!gi) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+	for (n = 0; n < i->ngroups; n++)
+		GROUP_AT(gi, n) = gid[n];
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, gi, KSTATE_CTX_GROUP_INFO, ref);
+	if (rv < 0)
+		put_group_info(gi);
+	pr_debug("restore group_info %p: ref {%llu, %u}, rv %d\n", gi, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %d}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index b990fd3..1b352c8 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -92,6 +92,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_cred(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_group_info(ctx);
+	if (rv < 0)
+		return rv;
 	return 0;
 }
 
@@ -145,6 +148,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_pid_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_group_info(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_cred(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 01b6e5f..51b28c0 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -46,6 +46,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_GROUP_INFO) {
+		put_group_info((struct group_info *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 #ifdef CONFIG_IPC_NS
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_IPC_NS) {
 		put_ipc_ns((struct ipc_namespace *)obj->o_obj);
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index 8000124..f6ce7a2 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -41,6 +41,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_FILE:
 		get_file((struct file *)obj->o_obj);
 		break;
+	case KSTATE_CTX_GROUP_INFO:
+		get_group_info((struct group_info *)obj->o_obj);
+		break;
 #ifdef CONFIG_IPC_NS
 	case KSTATE_CTX_IPC_NS:
 		get_ipc_ns((struct ipc_namespace *)obj->o_obj);
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 34/38] C/R: checkpoint/restore struct user
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (31 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 33/38] C/R: checkpoint/restore aux groups (structy group_info) Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 35/38] C/R: checkpoint/restore struct user_namespace Alexey Dobriyan
                   ` (4 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Note, user_namespace is in next patch as well as ->creator loop issue.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |    8 +++
 include/linux/kstate.h         |    5 ++
 kernel/cred.c                  |   28 +++++++++-
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    5 ++
 kernel/kstate/kstate-object.c  |    3 +
 kernel/user.c                  |  121 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 174 insertions(+), 2 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index dfd71ea..605a2b5 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -51,6 +51,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_NET_NS	11
 #define KSTATE_OBJ_CRED		12
 #define KSTATE_OBJ_GROUP_INFO	13
+#define KSTATE_OBJ_USER_STRUCT	14
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -277,6 +278,7 @@ struct kstate_image_cred {
 	kstate_cap_t	cap_bset;
 
 	kstate_ref_t	ref_group_info;
+	kstate_ref_t	ref_user;
 } __packed;
 
 struct kstate_image_group_info {
@@ -285,4 +287,10 @@ struct kstate_image_group_info {
 	__u32		ngroups;
 	/* __u32	gid[ngroups]; */
 } __packed;
+
+struct kstate_image_user_struct {
+	struct kstate_object_header hdr;
+
+	__u32		uid;
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index 4df3bfa..dd6b982 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -35,6 +35,7 @@ enum kstate_context_obj_type {
 	KSTATE_CTX_NSPROXY,
 	KSTATE_CTX_PID_NS,
 	KSTATE_CTX_TASK_STRUCT,
+	KSTATE_CTX_USER_STRUCT,
 	KSTATE_CTX_UTS_NS,
 	NR_KSTATE_CTX_TYPES
 };
@@ -134,6 +135,10 @@ int kstate_collect_all_group_info(struct kstate_context *ctx);
 int kstate_dump_all_group_info(struct kstate_context *ctx);
 int kstate_restore_group_info(struct kstate_context *ctx, kstate_ref_t *ref);
 
+int kstate_collect_all_user_struct(struct kstate_context *ctx);
+int kstate_dump_all_user_struct(struct kstate_context *ctx);
+int kstate_restore_user_struct(struct kstate_context *ctx, kstate_ref_t *ref);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/cred.c b/kernel/cred.c
index 8543360..a16f357 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -687,6 +687,8 @@ static int dump_cred(struct kstate_context *ctx, struct kstate_object *obj)
 
 	tmp = find_kstate_obj_by_ptr(ctx, cred->group_info, KSTATE_CTX_GROUP_INFO);
 	i->ref_group_info = tmp->o_ref;
+	tmp = find_kstate_obj_by_ptr(ctx, cred->user, KSTATE_CTX_USER_STRUCT);
+	i->ref_user = tmp->o_ref;
 
 	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
 	kfree(i);
@@ -726,6 +728,25 @@ static int restore_group_info(struct kstate_context *ctx, kstate_ref_t *ref, str
 	return 0;
 }
 
+static int restore_user(struct kstate_context *ctx, kstate_ref_t *ref, struct cred *cred)
+{
+	struct user_struct *user;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_USER_STRUCT);
+	if (!tmp) {
+		rv = kstate_restore_user_struct(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_USER_STRUCT);
+	}
+	user = tmp->o_obj;
+
+	cred->user = get_uid(user);
+	return 0;
+}
+
 int kstate_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref)
 {
 	struct kstate_image_cred *i;
@@ -761,8 +782,9 @@ int kstate_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref)
 	rv = restore_group_info(ctx, &i->ref_group_info, cred);
 	if (rv < 0)
 		goto out_group_info;
-	/* FIXME */
-	cred->user = get_uid(&root_user);
+	rv = restore_user(ctx, &i->ref_user, cred);
+	if (rv < 0)
+		goto out_user;
 	kfree(i);
 
 	rv = kstate_restore_object(ctx, cred, KSTATE_CTX_CRED, ref);
@@ -771,6 +793,8 @@ int kstate_restore_cred(struct kstate_context *ctx, kstate_ref_t *ref)
 	pr_debug("restore cred %p: ref {%llu, %u}, rv %d\n", cred, (unsigned long long)ref->pos, ref->id, rv);
 	return rv;
 
+out_user:
+	put_group_info(cred->group_info);
 out_group_info:
 	kmem_cache_free(cred_jar, cred);
 out_free_image:
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 1b352c8..a409577 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -95,6 +95,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_group_info(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_user_struct(ctx);
+	if (rv < 0)
+		return rv;
 	return 0;
 }
 
@@ -148,6 +151,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_pid_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_user_struct(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_group_info(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 51b28c0..854f971 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -90,6 +90,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_USER_STRUCT) {
+		free_uid((struct user_struct *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_UTS_NS) {
 		put_uts_ns((struct uts_namespace *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index f6ce7a2..75facda 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -69,6 +69,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_TASK_STRUCT:
 		get_task_struct((struct task_struct *)obj->o_obj);
 		break;
+	case KSTATE_CTX_USER_STRUCT:
+		get_uid((struct user_struct *)obj->o_obj);
+		break;
 	case KSTATE_CTX_UTS_NS:
 		get_uts_ns((struct uts_namespace *)obj->o_obj);
 		break;
diff --git a/kernel/user.c b/kernel/user.c
index 850e0ba..9fda1f0 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -2,6 +2,7 @@
  * The "user cache".
  *
  * (C) Copyright 1991-2000 Linus Torvalds
+ * Copyright (C) 2000-2009 Parallels Holdings, Ltd.
  *
  * We have a per-user structure to keep track of how many
  * processes, files etc the user has claimed, in order to be
@@ -497,3 +498,123 @@ static int __init uid_cache_init(void)
 }
 
 module_init(uid_cache_init);
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int check_user_struct(struct user_struct *user)
+{
+#ifdef CONFIG_INOTIFY_USER
+	if (atomic_read(&user->inotify_watches) != 0) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	if (atomic_read(&user->inotify_devs) != 0) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+#ifdef CONFIG_EPOLL
+	if (atomic_read(&user->epoll_watches) != 0) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+#ifdef CONFIG_KEYS
+	if (user->uid_keyring || user->session_keyring) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+	return 0;
+}
+
+static int collect_user_struct(struct kstate_context *ctx, struct user_struct *user)
+{
+	int rv;
+
+	rv = check_user_struct(user);
+	if (rv < 0)
+		return rv;
+	rv = kstate_collect_object(ctx, user, KSTATE_CTX_USER_STRUCT);
+	pr_debug("collect user_struct %p: rv %d\n", user, rv);
+	return rv;
+}
+
+int kstate_collect_all_user_struct(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_CRED) {
+		struct cred *cred = obj->o_obj;
+
+		rv = collect_user_struct(ctx, cred->user);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+static int dump_user_struct(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct user_struct *user = obj->o_obj;
+	struct kstate_image_user_struct *i;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_USER_STRUCT, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	i->uid = user->uid;
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump user_struct %p: ref {%llu, %u}, rv %d\n", user, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_user_struct(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_USER_STRUCT) {
+		rv = dump_user_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_user_struct(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_user_struct *i;
+	struct user_struct *user;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_USER_STRUCT, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	/* FIXME */
+	user = alloc_uid(&init_user_ns, i->uid);
+	if (!user) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, user, KSTATE_CTX_USER_STRUCT, ref);
+	if (rv < 0)
+		free_uid(user);
+	pr_debug("restore user_struct %p: ref {%llu, %u}, rv %d\n", user, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 35/38] C/R: checkpoint/restore struct user_namespace
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (32 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 34/38] C/R: checkpoint/restore struct user Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 36/38] C/R: checkpoint/restore struct pid Alexey Dobriyan
                   ` (3 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

We have first loop -- user->user_ns->creator (which is struct user_struct)

user_ns image references ->creator image but only partially because
user_namespaces are dumped before user_structs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |   12 +++
 include/linux/kstate.h         |    5 ++
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    6 ++
 kernel/kstate/kstate-object.c  |    4 +
 kernel/user.c                  |   21 +++++-
 kernel/user_namespace.c        |  146 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 198 insertions(+), 2 deletions(-)
 delete mode 100644 kernel/kstate/kstate-uts_ns.c

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 605a2b5..a573833 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -52,6 +52,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_CRED		12
 #define KSTATE_OBJ_GROUP_INFO	13
 #define KSTATE_OBJ_USER_STRUCT	14
+#define KSTATE_OBJ_USER_NS	15
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -291,6 +292,17 @@ struct kstate_image_group_info {
 struct kstate_image_user_struct {
 	struct kstate_object_header hdr;
 
+	kstate_ref_t	ref_user_ns;
 	__u32		uid;
 } __packed;
+
+struct kstate_image_user_ns {
+	struct kstate_object_header hdr;
+
+	/*
+	 * KSTATE_REF_UNDEF if user_ns creator user was outside of container,
+	 * otherwise partial {0, id} reference.
+	 */
+	kstate_ref_t	ref_creator;
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index dd6b982..f0c8e09 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -35,6 +35,7 @@ enum kstate_context_obj_type {
 	KSTATE_CTX_NSPROXY,
 	KSTATE_CTX_PID_NS,
 	KSTATE_CTX_TASK_STRUCT,
+	KSTATE_CTX_USER_NS,
 	KSTATE_CTX_USER_STRUCT,
 	KSTATE_CTX_UTS_NS,
 	NR_KSTATE_CTX_TYPES
@@ -139,6 +140,10 @@ int kstate_collect_all_user_struct(struct kstate_context *ctx);
 int kstate_dump_all_user_struct(struct kstate_context *ctx);
 int kstate_restore_user_struct(struct kstate_context *ctx, kstate_ref_t *ref);
 
+int kstate_collect_all_user_ns(struct kstate_context *ctx);
+int kstate_dump_all_user_ns(struct kstate_context *ctx);
+int kstate_restore_user_ns(struct kstate_context *ctx, kstate_ref_t *ref);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index a409577..3df776e 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -98,6 +98,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_user_struct(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_user_ns(ctx);
+	if (rv < 0)
+		return rv;
 	return 0;
 }
 
@@ -151,6 +154,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_pid_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_user_ns(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_user_struct(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 854f971..f8168cc 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -7,6 +7,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/user_namespace.h>
 #include <linux/utsname.h>
 #include <net/net_namespace.h>
 
@@ -90,6 +91,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_USER_NS) {
+		put_user_ns((struct user_namespace *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_USER_STRUCT) {
 		free_uid((struct user_struct *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index 75facda..eb77027 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -7,6 +7,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/user_namespace.h>
 #include <linux/utsname.h>
 #include <net/net_namespace.h>
 
@@ -69,6 +70,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_TASK_STRUCT:
 		get_task_struct((struct task_struct *)obj->o_obj);
 		break;
+	case KSTATE_CTX_USER_NS:
+		get_user_ns((struct user_namespace *)obj->o_obj);
+		break;
 	case KSTATE_CTX_USER_STRUCT:
 		get_uid((struct user_struct *)obj->o_obj);
 		break;
diff --git a/kernel/kstate/kstate-uts_ns.c b/kernel/kstate/kstate-uts_ns.c
deleted file mode 100644
index e69de29..0000000
diff --git a/kernel/user.c b/kernel/user.c
index 9fda1f0..508c05d 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -554,6 +554,10 @@ int kstate_collect_all_user_struct(struct kstate_context *ctx)
 		if (rv < 0)
 			return rv;
 	}
+	/*
+	 * Don't check refcounts here, user_ns->creator references weren't
+	 * accounted yet, it will fire every time CLONE_NEWUSER is used.
+	 */
 	return 0;
 }
 
@@ -561,12 +565,15 @@ static int dump_user_struct(struct kstate_context *ctx, struct kstate_object *ob
 {
 	struct user_struct *user = obj->o_obj;
 	struct kstate_image_user_struct *i;
+	struct kstate_object *tmp;
 	int rv;
 
 	i = kstate_prepare_image(KSTATE_OBJ_USER_STRUCT, sizeof(*i));
 	if (!i)
 		return -ENOMEM;
 
+	tmp = find_kstate_obj_by_ptr(ctx, user->user_ns, KSTATE_CTX_USER_NS);
+	i->ref_user_ns = tmp->o_ref;
 	i->uid = user->uid;
 
 	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
@@ -592,14 +599,24 @@ int kstate_restore_user_struct(struct kstate_context *ctx, kstate_ref_t *ref)
 {
 	struct kstate_image_user_struct *i;
 	struct user_struct *user;
+	struct user_namespace *user_ns;
+	struct kstate_object *tmp;
 	int rv;
 
 	i = kstate_read_image(ctx, ref, KSTATE_OBJ_USER_STRUCT, sizeof(*i));
 	if (IS_ERR(i))
 		return PTR_ERR(i);
 
-	/* FIXME */
-	user = alloc_uid(&init_user_ns, i->uid);
+	tmp = find_kstate_obj_by_ref(ctx, &i->ref_user_ns, KSTATE_CTX_USER_NS);
+	if (!tmp) {
+		rv = kstate_restore_user_ns(ctx, &i->ref_user_ns);
+		if (rv < 0)
+			goto out_free_image;
+		tmp = find_kstate_obj_by_ref(ctx, &i->ref_user_ns, KSTATE_CTX_USER_NS);
+	}
+	user_ns = tmp->o_obj;
+
+	user = alloc_uid(user_ns, i->uid);
 	if (!user) {
 		rv = -ENOMEM;
 		goto out_free_image;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 076c7c8..04ef11d 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1,4 +1,6 @@
 /*
+ * Copyright (C) 2000-2009 Parallels Holdings, Ltd.
+ *
  *  This program is free software; you can redistribute it and/or
  *  modify it under the terms of the GNU General Public License as
  *  published by the Free Software Foundation, version 2 of the
@@ -82,3 +84,147 @@ void free_user_ns(struct kref *kref)
 	schedule_work(&ns->destroyer);
 }
 EXPORT_SYMBOL(free_user_ns);
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int collect_user_ns(struct kstate_context *ctx, struct user_namespace *user_ns)
+{
+	int rv;
+
+	rv = kstate_collect_object(ctx, user_ns, KSTATE_CTX_USER_NS);
+	pr_debug("collect user_ns %p: rv %d\n", user_ns, rv);
+	return rv;
+}
+
+int kstate_collect_all_user_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_USER_STRUCT) {
+		struct user_struct *user = obj->o_obj;
+
+		rv = collect_user_ns(ctx, user->user_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_USER_NS) {
+		struct user_namespace *user_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&user_ns->kref.refcount);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("user_ns %p has external references %lu:%u\n", user_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	/*
+	 * user pins user_ns which pins user_ns->creator, that's why we don't
+	 * check for user refcount leaks right after user collecting.
+	 * Do it here after counting user_ns creators one more time except
+	 * those which are legitimately outside of container.
+	 */
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_USER_NS) {
+		struct user_namespace *user_ns = obj->o_obj;
+		struct kstate_object *tmp;
+
+		tmp = find_kstate_obj_by_ptr(ctx, user_ns->creator, KSTATE_CTX_USER_STRUCT);
+		if (tmp)
+			tmp->o_count++;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_USER_STRUCT) {
+		struct user_struct *user = obj->o_obj;
+		unsigned int cnt = atomic_read(&user->__count);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("user_struct %p has external references %lu:%u\n", user, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_user_ns(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct user_namespace *user_ns = obj->o_obj;
+	struct kstate_image_user_ns *i;
+	struct kstate_object *tmp;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_USER_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	tmp = find_kstate_obj_by_ptr(ctx, user_ns->creator, KSTATE_CTX_USER_STRUCT);
+	if (!tmp)
+		i->ref_creator = KSTATE_REF_UNDEF;
+	else
+		i->ref_creator = tmp->o_ref;
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump user_ns %p: ref {%llu, %u}, rv %d\n", user_ns, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_user_ns(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_USER_NS) {
+		rv = dump_user_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_user_ns(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_user_ns *i;
+	struct user_namespace *user_ns;
+	int n;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_USER_NS, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	user_ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
+	if (!user_ns) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+	kref_init(&user_ns->kref);
+	for (n = 0; n < UIDHASH_SZ; ++n)
+		INIT_HLIST_HEAD(user_ns->uidhash_table + n);
+	user_ns->creator = NULL;
+
+	if (kstate_ref_undefined(&i->ref_creator)) {
+		user_ns->creator = ctx->init_tsk->cred->user;
+	} else {
+		struct kstate_object *tmp;
+
+		tmp = find_kstate_obj_by_id(ctx, &i->ref_creator, KSTATE_CTX_USER_STRUCT);
+		if (!tmp) {
+			rv = -EINVAL;
+			goto out_free_image;
+		}
+		user_ns->creator = tmp->o_obj;
+	}
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, user_ns, KSTATE_CTX_USER_NS, ref);
+	if (rv < 0)
+		kfree(user_ns);
+	pr_debug("restore user_ns %p: ref {%llu, %u}, rv %d\n", user_ns, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 36/38] C/R: checkpoint/restore struct pid
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (33 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 35/38] C/R: checkpoint/restore struct user_namespace Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 37/38] C/R: checkpoint/restore opened files Alexey Dobriyan
                   ` (2 subsequent siblings)
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Deal with struct pid in general and task pids in particular.

Guess what, references to outside pids are banned which means
that if child is created with simple CLONE_NEWPID, it's PIDTYPE_PGID
and PIDTYPE_SID will be outside of newborn pidns.

On restore we don't know to where glue them and they weren't saved at all.
So abort checkpointing in this case.

New-born container inits should use setpgrp(2) and setsid(2)!

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |   13 +++
 include/linux/kstate.h         |    5 +
 include/linux/pid.h            |    2 +-
 kernel/fork.c                  |    2 +-
 kernel/kstate/cpt-sys.c        |    6 +
 kernel/kstate/kstate-context.c |    5 +
 kernel/kstate/kstate-object.c  |    3 +
 kernel/kstate/kstate-task.c    |   80 ++++++++++++++++
 kernel/pid.c                   |  199 +++++++++++++++++++++++++++++++++++++++-
 9 files changed, 308 insertions(+), 7 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index a573833..108bb2d 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -53,6 +53,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_GROUP_INFO	13
 #define KSTATE_OBJ_USER_STRUCT	14
 #define KSTATE_OBJ_USER_NS	15
+#define KSTATE_OBJ_PID		16
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -80,6 +81,10 @@ struct kstate_image_task_struct {
 	kstate_ref_t	ref_real_cred;
 	kstate_ref_t	ref_cred;
 
+	kstate_ref_t	ref_pid;
+	kstate_ref_t	ref_pgid;
+	kstate_ref_t	ref_sid;
+
 	__u8		comm[16];
 
 	/* Native arch of task, one of KSTATE_ARCH_*. */
@@ -305,4 +310,12 @@ struct kstate_image_user_ns {
 	 */
 	kstate_ref_t	ref_creator;
 } __packed;
+
+struct kstate_image_pid {
+	struct kstate_object_header hdr;
+
+	kstate_ref_t	ref_pid_ns;	/* last-level pid_ns */
+	__u32		level;
+	__u32		nr[1];
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index f0c8e09..99a4345 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -33,6 +33,7 @@ enum kstate_context_obj_type {
 	KSTATE_CTX_NET_NS,
 #endif
 	KSTATE_CTX_NSPROXY,
+	KSTATE_CTX_PID,
 	KSTATE_CTX_PID_NS,
 	KSTATE_CTX_TASK_STRUCT,
 	KSTATE_CTX_USER_NS,
@@ -144,6 +145,10 @@ int kstate_collect_all_user_ns(struct kstate_context *ctx);
 int kstate_dump_all_user_ns(struct kstate_context *ctx);
 int kstate_restore_user_ns(struct kstate_context *ctx, kstate_ref_t *ref);
 
+int kstate_collect_all_pid(struct kstate_context *ctx);
+int kstate_dump_all_pid(struct kstate_context *ctx);
+int kstate_restore_pid(struct kstate_context *ctx, kstate_ref_t *ref);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..f775a85 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, int *nr, unsigned int level);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index ed377ad..97521ab 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1117,7 +1117,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	if (pid != &init_struct_pid) {
 		retval = -ENOMEM;
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, NULL, 0);
 		if (!pid)
 			goto bad_fork_cleanup_io;
 
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 3df776e..119940d 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -101,6 +101,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_user_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_pid(ctx);
+	if (rv < 0)
+		return rv;
 	return 0;
 }
 
@@ -154,6 +157,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_pid_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_pid(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_user_ns(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index f8168cc..9acb441 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -81,6 +81,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_PID) {
+		put_pid((struct pid *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_PID_NS) {
 		put_pid_ns((struct pid_namespace *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index eb77027..ab026f0 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -64,6 +64,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_NSPROXY:
 		get_nsproxy((struct nsproxy *)obj->o_obj);
 		break;
+	case KSTATE_CTX_PID:
+		get_pid((struct pid *)obj->o_obj);
+		break;
 	case KSTATE_CTX_PID_NS:
 		get_pid_ns((struct pid_namespace *)obj->o_obj);
 		break;
diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c
index dc2387b..4a3524e 100644
--- a/kernel/kstate/kstate-task.c
+++ b/kernel/kstate/kstate-task.c
@@ -128,6 +128,13 @@ static int dump_task_struct(struct kstate_context *ctx, struct kstate_object *ob
 	tmp = find_kstate_obj_by_ptr(ctx, tsk->cred, KSTATE_CTX_CRED);
 	i->ref_cred = tmp->o_ref;
 
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->pids[PIDTYPE_PID].pid, KSTATE_CTX_PID);
+	i->ref_pid = tmp->o_ref;
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->pids[PIDTYPE_PGID].pid, KSTATE_CTX_PID);
+	i->ref_pgid = tmp->o_ref;
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->pids[PIDTYPE_SID].pid, KSTATE_CTX_PID);
+	i->ref_sid = tmp->o_ref;
+
 	BUILD_BUG_ON(sizeof(i->comm) != sizeof(tsk->comm));
 	strlcpy((char *)i->comm, (const char *)tsk->comm, sizeof(i->comm));
 
@@ -280,6 +287,70 @@ static int restore_nsproxy(struct kstate_context *ctx, kstate_ref_t *ref)
 	return 0;
 }
 
+static int restore_pid(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct pid *pid;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_PID);
+	if (!tmp) {
+		rv = kstate_restore_pid(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_PID);
+	}
+	pid = tmp->o_obj;
+
+	write_lock_irq(&tasklist_lock);
+	change_pid(current, PIDTYPE_PID, get_pid(pid));
+	current->pid = current->tgid = pid_nr(pid);
+	write_unlock_irq(&tasklist_lock);
+	return 0;
+}
+
+static int restore_pgid(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct pid *pid;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_PID);
+	if (!tmp) {
+		rv = kstate_restore_pid(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_PID);
+	}
+	pid = tmp->o_obj;
+
+	write_lock_irq(&tasklist_lock);
+	change_pid(current, PIDTYPE_PGID, pid);
+	write_unlock_irq(&tasklist_lock);
+	return 0;
+}
+
+static int restore_sid(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct pid *pid;
+	struct kstate_object *tmp;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_PID);
+	if (!tmp) {
+		rv = kstate_restore_pid(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_PID);
+	}
+	pid = tmp->o_obj;
+
+	write_lock_irq(&tasklist_lock);
+	change_pid(current, PIDTYPE_SID, pid);
+	write_unlock_irq(&tasklist_lock);
+	return 0;
+}
+
 struct task_struct_restore_context {
 	struct kstate_context *ctx;
 	struct kstate_image_task_struct *i;
@@ -334,6 +405,15 @@ static int task_struct_restorer(void *_tsk_ctx)
 	rv = restore_cred(ctx, &i->ref_cred);
 	if (rv < 0)
 		goto out;
+	rv = restore_pid(ctx, &i->ref_pid);
+	if (rv < 0)
+		goto out;
+	rv = restore_pgid(ctx, &i->ref_pgid);
+	if (rv < 0)
+		goto out;
+	rv = restore_sid(ctx, &i->ref_sid);
+	if (rv < 0)
+		goto out;
 
 out:
 	tsk_ctx->rv = rv;
diff --git a/kernel/pid.c b/kernel/pid.c
index b2e5f78..bacf279 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -23,6 +23,7 @@
  *    (C) 2007 Pavel Emelyanov <xemul@openvz.org>, OpenVZ, SWsoft Inc.
  *    (C) 2007 Sukadev Bhattiprolu <sukadev@us.ibm.com>, IBM
  *     Many thanks to Oleg Nesterov for comments and help
+ * Copyright (C) 2000-2009 Parallels Holdings, Ltd.
  *
  */
 
@@ -182,6 +183,36 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -1;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int set_pidmap(struct pid_namespace *pid_ns, pid_t pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+	if (!map->page) {
+		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		/*
+		 * Free the page if someone raced with us
+		 * installing it.
+		 */
+		spin_lock_irq(&pidmap_lock);
+		if (map->page)
+			kfree(page);
+		else
+			map->page = page;
+		spin_unlock_irq(&pidmap_lock);
+		if (unlikely(!map->page))
+			return -ENOMEM;
+	}
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+	atomic_dec(&map->nr_free);
+	return pid;
+}
+#endif
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
@@ -239,11 +270,12 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+/* Last level + 1 pid numbers are predefined. */
+struct pid *alloc_pid(struct pid_namespace *ns, int *nr, unsigned int level)
 {
 	struct pid *pid;
 	enum pid_type type;
-	int i, nr;
+	int i, pid_nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
 
@@ -253,11 +285,16 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
-		if (nr < 0)
+#ifdef CONFIG_CHECKPOINT
+		if (nr && ns->level - i <= level)
+			pid_nr = set_pidmap(tmp, nr[ns->level - i]);
+		else
+#endif
+			pid_nr = alloc_pidmap(tmp);
+		if (pid_nr < 0)
 			goto out_free;
 
-		pid->numbers[i].nr = nr;
+		pid->numbers[i].nr = pid_nr;
 		pid->numbers[i].ns = tmp;
 		tmp = tmp->parent;
 	}
@@ -537,3 +574,155 @@ void __init pidmap_init(void)
 	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
 			SLAB_HWCACHE_ALIGN | SLAB_PANIC);
 }
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int collect_pid(struct kstate_context *ctx, struct pid *pid)
+{
+	int rv;
+
+	rv = kstate_collect_object(ctx, pid, KSTATE_CTX_PID);
+	pr_debug("collect pid %p: rv %d\n", pid, rv);
+	return rv;
+}
+
+static int collect_task_pid(struct kstate_context *ctx, struct pid *pid)
+{
+	unsigned int level0, level;
+
+	level0 = ctx->init_tsk->nsproxy->pid_ns->level;
+	if (pid->level < level0) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	for (level = level0; level <= pid->level; level++) {
+		struct pid_namespace *pid_ns;
+		struct kstate_object *tmp;
+
+		pid_ns = pid->numbers[level].ns;
+		tmp = find_kstate_obj_by_ptr(ctx, pid_ns, KSTATE_CTX_PID_NS);
+		if (!tmp) {
+			WARN_ON(1);
+			return -EINVAL;
+		}
+	}
+	return collect_pid(ctx, pid);
+}
+
+int kstate_collect_all_pid(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = collect_task_pid(ctx, tsk->pids[PIDTYPE_PID].pid);
+		if (rv < 0)
+			return rv;
+		rv = collect_task_pid(ctx, tsk->pids[PIDTYPE_PGID].pid);
+		if (rv < 0)
+			return rv;
+		rv = collect_task_pid(ctx, tsk->pids[PIDTYPE_SID].pid);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+static int dump_pid(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct pid *pid = obj->o_obj;
+	struct kstate_image_pid *i;
+	struct kstate_object *tmp;
+	unsigned int level0, level;
+	unsigned int image_len;
+	int rv;
+
+	level0 = ctx->init_tsk->nsproxy->pid_ns->level;
+	image_len = sizeof(*i) + (pid->level - level0 + 1) * sizeof(__u32);
+	i = kstate_prepare_image(KSTATE_OBJ_PID, image_len);
+	if (!i)
+		return -ENOMEM;
+
+	tmp = find_kstate_obj_by_ptr(ctx, pid->numbers[pid->level].ns, KSTATE_CTX_PID_NS);
+	i->ref_pid_ns = tmp->o_ref;
+
+	i->level = pid->level - level0;
+	for (level = level0; level <= pid->level; level++)
+		i->nr[level - level0] = pid->numbers[level].nr;
+
+	rv = kstate_write_image(ctx, i, image_len, obj);
+	kfree(i);
+	pr_debug("dump pid %p: ref {%llu, %u}, rv %d\n", pid, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_pid(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_PID) {
+		rv = dump_pid(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_pid(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_pid *i;
+	struct pid *pid;
+	struct pid_namespace *pid_ns;
+	struct kstate_object *tmp;
+	unsigned int level0;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_PID, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+	if (i->level > ((__u32)-1 - sizeof(*i)) / sizeof(__u32) - 1) {
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+	if (i->hdr.obj_len != sizeof(*i) + (i->level + 1) * sizeof(__u32)) {
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+
+	tmp = find_kstate_obj_by_ref(ctx, &i->ref_pid_ns, KSTATE_CTX_PID_NS);
+	if (!tmp) {
+		rv = kstate_restore_pid_ns(ctx, &i->ref_pid_ns);
+		if (rv < 0)
+			goto out_free_image;
+		tmp = find_kstate_obj_by_ref(ctx, &i->ref_pid_ns, KSTATE_CTX_PID_NS);
+	}
+	pid_ns = tmp->o_obj;
+
+	level0 = ctx->init_tsk->nsproxy->pid_ns->level;
+	if (i->level >= pid_ns->level - level0) {
+		rv = -EINVAL;
+		goto out_free_image;
+	}
+
+	pid = alloc_pid(pid_ns, i->nr, i->level);
+	kfree(i);
+	if (!pid)
+		return -ENOMEM;
+
+	rv = kstate_restore_object(ctx, pid, KSTATE_CTX_PID, ref);
+	if (rv < 0)
+		put_pid(pid);
+	pr_debug("restore pid %p: ref {%lld, %u}, rv %d\n", pid, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 37/38] C/R: checkpoint/restore opened files
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (34 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 36/38] C/R: checkpoint/restore struct pid Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  4:55 ` [PATCH 38/38] C/R: checkpoint/restart struct sighand_struct Alexey Dobriyan
  2009-05-22  5:02 ` [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

File descriptor is dumped as formally an object, despite it can't be shared
by itself, only files_struct can.

Opening is done, veryfying that we opened right file, restoration of file
position and nothing more.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/kstate-image.h   |   16 +++
 include/linux/kstate.h         |    6 +
 kernel/kstate/cpt-sys.c        |    6 +
 kernel/kstate/kstate-context.c |    6 +
 kernel/kstate/kstate-file.c    |  201 ++++++++++++++++++++++++++++++++++++++++
 kernel/kstate/kstate-object.c  |    4 +
 kernel/kstate/kstate-task.c    |   54 +++++++++++
 7 files changed, 293 insertions(+), 0 deletions(-)

diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 108bb2d..700fc62 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -54,6 +54,8 @@ struct kstate_image_header {
 #define KSTATE_OBJ_USER_STRUCT	14
 #define KSTATE_OBJ_USER_NS	15
 #define KSTATE_OBJ_PID		16
+#define KSTATE_OBJ_FILES_STRUCT	17
+#define KSTATE_OBJ_FD		18
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -77,6 +79,7 @@ struct kstate_image_task_struct {
 
 	kstate_ref_t	ref_mm;
 	kstate_ref_t	ref_nsproxy;
+	kstate_ref_t	ref_files;
 
 	kstate_ref_t	ref_real_cred;
 	kstate_ref_t	ref_cred;
@@ -318,4 +321,17 @@ struct kstate_image_pid {
 	__u32		level;
 	__u32		nr[1];
 } __packed;
+
+struct kstate_image_files_struct {
+	struct kstate_object_header hdr;
+} __packed;
+
+struct kstate_image_fd {
+	struct kstate_object_header hdr;
+
+	kstate_ref_t	ref_file;
+	__u32		fd;
+#define KSTATE_FD_FLAGS_CLOEXEC		(1 << 0)
+	__u32		fd_flags;
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index 99a4345..2473381 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -23,6 +23,7 @@ struct kstate_object {
 enum kstate_context_obj_type {
 	KSTATE_CTX_CRED,
 	KSTATE_CTX_FILE,
+	KSTATE_CTX_FILES_STRUCT,
 	KSTATE_CTX_GROUP_INFO,
 #ifdef CONFIG_IPC_NS
 	KSTATE_CTX_IPC_NS,
@@ -149,6 +150,11 @@ int kstate_collect_all_pid(struct kstate_context *ctx);
 int kstate_dump_all_pid(struct kstate_context *ctx);
 int kstate_restore_pid(struct kstate_context *ctx, kstate_ref_t *ref);
 
+int kstate_collect_all_files_struct(struct kstate_context *ctx);
+int kstate_dump_all_files_struct(struct kstate_context *ctx);
+int kstate_restore_files_struct(struct kstate_context *ctx, kstate_ref_t *ref);
+int kstate_restore_fd(struct kstate_context *ctx, kstate_pos_t pos);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 119940d..05fc9d8 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -86,6 +86,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_files_struct(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_collect_all_file(ctx);
 	if (rv < 0)
 		return rv;
@@ -175,6 +178,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_file(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_files_struct(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_mm_struct(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 9acb441..3e1589f 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -1,4 +1,5 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/ipc_namespace.h>
 #include <linux/list.h>
@@ -47,6 +48,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_FILES_STRUCT) {
+		put_files_struct((struct files_struct *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_GROUP_INFO) {
 		put_group_info((struct group_info *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-file.c b/kernel/kstate/kstate-file.c
index f378be3..b11828b 100644
--- a/kernel/kstate/kstate-file.c
+++ b/kernel/kstate/kstate-file.c
@@ -58,6 +58,20 @@ int kstate_collect_all_file(struct kstate_context *ctx)
 	struct kstate_object *obj;
 	int rv;
 
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_FILES_STRUCT) {
+		struct files_struct *files = obj->o_obj;
+		struct file *file;
+		int fd;
+
+		for (fd = 0; fd < files->fdt->max_fds; fd++) {
+			file = fcheck_files(files, fd);
+			if (!file)
+				continue;
+			rv = collect_file(ctx, file);
+			if (rv < 0)
+				return rv;
+		}
+	}
 	for_each_kstate_object(ctx, obj, KSTATE_CTX_MM_STRUCT) {
 		struct mm_struct *mm = obj->o_obj;
 		struct vm_area_struct *vma;
@@ -232,3 +246,190 @@ out_free_image:
 	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
 	return rv;
 }
+
+static int collect_files_struct(struct kstate_context *ctx, struct files_struct *files)
+{
+	int rv;
+
+	rv = kstate_collect_object(ctx, files, KSTATE_CTX_FILES_STRUCT);
+	pr_debug("collect files_struct %p: rv %d\n", files, rv);
+	return rv;
+}
+
+int kstate_collect_all_files_struct(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = collect_files_struct(ctx, tsk->files);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_FILES_STRUCT) {
+		struct files_struct *files = obj->o_obj;
+		unsigned int cnt = atomic_read(&files->count);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("files_struct %p has external references %lu:%u\n", files, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_fd(struct kstate_context *ctx, struct files_struct *files, int fd, struct file *file)
+{
+	struct kstate_image_fd *i;
+	struct kstate_object *tmp;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_FD, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+	/*
+	 * fd doesn't get id because it can't be shared by itself,
+	 * only files_struct can. Assign some deterministic id.
+	 */
+	i->hdr.obj_id = 0;
+
+	tmp = find_kstate_obj_by_ptr(ctx, file, KSTATE_CTX_FILE);
+	i->ref_file = tmp->o_ref;
+	i->fd = fd;
+	i->fd_flags = 0;
+	if (FD_ISSET(fd, files->fdt->close_on_exec))
+		i->fd_flags |= KSTATE_FD_FLAGS_CLOEXEC;
+
+	rv = kstate_write(ctx, i, sizeof(*i));
+	kfree(i);
+	pr_debug("dump fd %d: rv %d, files_struct %p, file %p\n", fd, rv, files, file);
+	return rv;
+}
+
+static int dump_all_fd(struct kstate_context *ctx, struct files_struct *files)
+{
+	struct file *file;
+	int fd;
+	int rv;
+
+	for (fd = 0; fd < files->fdt->max_fds; fd++) {
+		file = fcheck_files(files, fd);
+		if (!file)
+			continue;
+		rv = dump_fd(ctx, files, fd, file);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+static int dump_files_struct(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct files_struct *files = obj->o_obj;
+	struct kstate_image_files_struct *i;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_FILES_STRUCT, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump files_struct %p: ref {%llu, %u}, rv %d\n", files, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_files_struct(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_FILES_STRUCT) {
+		struct files_struct *files = obj->o_obj;
+
+		rv = dump_files_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+		rv = dump_all_fd(ctx, files);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_fd(struct kstate_context *ctx, kstate_pos_t pos)
+{
+	kstate_ref_t ref = { .pos = pos, .id = 0 };
+	struct kstate_image_fd *i;
+	struct file *file;
+	int fd;
+	unsigned int flags;
+	struct kstate_object *tmp;
+	int rv;
+
+	i = kstate_read_image(ctx, &ref, KSTATE_OBJ_FD, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	tmp = find_kstate_obj_by_ref(ctx, &i->ref_file, KSTATE_CTX_FILE);
+	if (!tmp) {
+		rv = kstate_restore_file(ctx, &i->ref_file);
+		if (rv < 0)
+			goto out_free_image;
+		tmp = find_kstate_obj_by_ref(ctx, &i->ref_file, KSTATE_CTX_FILE);
+	}
+	file = tmp->o_obj;
+
+	flags = 0;
+	if (i->fd_flags & KSTATE_FD_FLAGS_CLOEXEC)
+		flags |= O_CLOEXEC;
+	fd = alloc_fd(i->fd, flags);
+	if (fd != i->fd)
+		rv = (fd < 0) ? fd : -EINVAL;
+	else
+		rv = 0;
+	kfree(i);
+	if (rv < 0)
+		return rv;
+
+	get_file(file);
+	fd_install(fd, file);
+	return 0;
+
+out_free_image:
+	kfree(i);
+	return rv;
+}
+
+int kstate_restore_files_struct(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_files_struct *i;
+	struct files_struct *files;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_FILES_STRUCT, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	files = kmem_cache_zalloc(files_cachep, GFP_KERNEL);
+	if (!files) {
+		kfree(i);
+		return -ENOMEM;
+	}
+	atomic_set(&files->count, 1);
+	files->fdt = &files->fdtab;
+	files->fdt->max_fds = NR_OPEN_DEFAULT;
+	files->fdt->fd = files->fd_array;
+	files->fdt->close_on_exec = (fd_set *)&files->close_on_exec_init;
+	files->fdt->open_fds = (fd_set *)&files->open_fds_init;
+	spin_lock_init(&files->file_lock);
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, files, KSTATE_CTX_FILES_STRUCT, ref);
+	if (rv < 0)
+		kmem_cache_free(files_cachep, files);
+	pr_debug("restore files_struct %p: ref {%llu, %u}, rv %d\n", files, (unsigned long long)ref->pos, ref->id,  rv);
+	return rv;
+}
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index ab026f0..bc27985 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -1,4 +1,5 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/fdtable.h>
 #include <linux/fs.h>
 #include <linux/ipc_namespace.h>
 #include <linux/mm_types.h>
@@ -42,6 +43,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_FILE:
 		get_file((struct file *)obj->o_obj);
 		break;
+	case KSTATE_CTX_FILES_STRUCT:
+		atomic_inc(&((struct files_struct *)obj->o_obj)->count);
+		break;
 	case KSTATE_CTX_GROUP_INFO:
 		get_group_info((struct group_info *)obj->o_obj);
 		break;
diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c
index 4a3524e..101fcb8 100644
--- a/kernel/kstate/kstate-task.c
+++ b/kernel/kstate/kstate-task.c
@@ -1,4 +1,5 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
+#include <linux/fdtable.h>
 #include <linux/kthread.h>
 #include <linux/nsproxy.h>
 #include <linux/pid_namespace.h>
@@ -135,6 +136,9 @@ static int dump_task_struct(struct kstate_context *ctx, struct kstate_object *ob
 	tmp = find_kstate_obj_by_ptr(ctx, tsk->pids[PIDTYPE_SID].pid, KSTATE_CTX_PID);
 	i->ref_sid = tmp->o_ref;
 
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->files, KSTATE_CTX_FILES_STRUCT);
+	i->ref_files = tmp->o_ref;
+
 	BUILD_BUG_ON(sizeof(i->comm) != sizeof(tsk->comm));
 	strlcpy((char *)i->comm, (const char *)tsk->comm, sizeof(i->comm));
 
@@ -351,6 +355,53 @@ static int restore_sid(struct kstate_context *ctx, kstate_ref_t *ref)
 	return 0;
 }
 
+static int restore_all_fd(struct kstate_context *ctx, kstate_pos_t pos)
+{
+	struct kstate_object_header hdr;
+	int rv;
+
+	while (1) {
+		rv = kstate_pread(ctx, &hdr, sizeof(hdr), pos);
+		if (rv < 0)
+			return rv;
+		if (hdr.obj_len < sizeof(hdr))
+			return -EINVAL;
+
+		if (hdr.obj_type == KSTATE_OBJ_FD) {
+			rv = kstate_restore_fd(ctx, pos);
+			if (rv < 0)
+				return rv;
+		} else
+			return 0;
+		pos += hdr.obj_len;
+	}
+}
+
+static int restore_files(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct files_struct *files;
+	struct kstate_object *tmp;
+	int restore_fd;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_FILES_STRUCT);
+	if (!tmp) {
+		rv = kstate_restore_files_struct(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_FILES_STRUCT);
+		restore_fd = 1;
+	} else
+		restore_fd = 0;
+	files = tmp->o_obj;
+
+	atomic_inc(&files->count);
+	reset_files_struct(files);
+	if (restore_fd)
+		return restore_all_fd(ctx, ref->pos + sizeof(struct kstate_image_files_struct));
+	return 0;
+}
+
 struct task_struct_restore_context {
 	struct kstate_context *ctx;
 	struct kstate_image_task_struct *i;
@@ -414,6 +465,9 @@ static int task_struct_restorer(void *_tsk_ctx)
 	rv = restore_sid(ctx, &i->ref_sid);
 	if (rv < 0)
 		goto out;
+	rv = restore_files(ctx, &i->ref_files);
+	if (rv < 0)
+		goto out;
 
 out:
 	tsk_ctx->rv = rv;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 38/38] C/R: checkpoint/restart struct sighand_struct
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (35 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 37/38] C/R: checkpoint/restore opened files Alexey Dobriyan
@ 2009-05-22  4:55 ` Alexey Dobriyan
  2009-05-22  5:02 ` [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  4:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, Alexey Dobriyan

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/include/asm/signal.h  |    2 +
 include/linux/kstate-image.h   |   17 +++++
 include/linux/kstate.h         |   29 ++++++++
 kernel/kstate/cpt-sys.c        |    6 ++
 kernel/kstate/kstate-context.c |    5 ++
 kernel/kstate/kstate-object.c  |    3 +
 kernel/kstate/kstate-task.c    |   32 +++++++++
 kernel/signal.c                |  140 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 234 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 7761a5d..f7c9040 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -140,6 +140,7 @@ struct sigaction {
 	__sigrestore_t sa_restorer;
 	sigset_t sa_mask;		/* mask last for extensibility */
 };
+#define sa_restorer sa_restorer
 
 struct k_sigaction {
 	struct sigaction sa;
@@ -170,6 +171,7 @@ struct sigaction {
 	__sigrestore_t sa_restorer;
 	sigset_t sa_mask;		/* mask last for extensibility */
 };
+#define sa_restorer sa_restorer
 
 struct k_sigaction {
 	struct sigaction sa;
diff --git a/include/linux/kstate-image.h b/include/linux/kstate-image.h
index 700fc62..38fb1a4 100644
--- a/include/linux/kstate-image.h
+++ b/include/linux/kstate-image.h
@@ -18,6 +18,9 @@ static inline int kstate_ref_undefined(kstate_ref_t *ref)
 	return ref->pos == 0 && ref->id == 0;
 }
 
+typedef __u8 kstate_sigset_t[16];
+typedef __u64 kstate_ptr_t;
+
 struct kstate_image_header {
 	/* Immutable part except version bumps. */
 #define KSTATE_IMAGE_MAGIC	"LinuxC/R"
@@ -56,6 +59,7 @@ struct kstate_image_header {
 #define KSTATE_OBJ_PID		16
 #define KSTATE_OBJ_FILES_STRUCT	17
 #define KSTATE_OBJ_FD		18
+#define KSTATE_OBJ_SIGHAND_STRUCT	19
 
 struct kstate_object_header {
 	__u32		obj_type;
@@ -88,6 +92,8 @@ struct kstate_image_task_struct {
 	kstate_ref_t	ref_pgid;
 	kstate_ref_t	ref_sid;
 
+	kstate_ref_t	ref_sighand;
+
 	__u8		comm[16];
 
 	/* Native arch of task, one of KSTATE_ARCH_*. */
@@ -334,4 +340,15 @@ struct kstate_image_fd {
 #define KSTATE_FD_FLAGS_CLOEXEC		(1 << 0)
 	__u32		fd_flags;
 } __packed;
+
+struct kstate_image_sighand_struct {
+	struct kstate_object_header hdr;
+
+	struct {
+		kstate_ptr_t	sa_handler;
+		__u64		sa_flags;
+		kstate_ptr_t	sa_restorer;
+		kstate_sigset_t	sa_mask;
+	} sa[128];
+} __packed;
 #endif
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index 2473381..c958d1d 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -1,7 +1,9 @@
 /* Copyright (C) 2000-2009 Parallels Holdings, Ltd. */
 #ifndef __INCLUDE_LINUX_KSTATE_H
 #define __INCLUDE_LINUX_KSTATE_H
+#include <linux/compiler.h>
 #include <linux/list.h>
+#include <asm/signal.h>
 
 #include <linux/kstate-image.h>
 
@@ -36,6 +38,7 @@ enum kstate_context_obj_type {
 	KSTATE_CTX_NSPROXY,
 	KSTATE_CTX_PID,
 	KSTATE_CTX_PID_NS,
+	KSTATE_CTX_SIGHAND_STRUCT,
 	KSTATE_CTX_TASK_STRUCT,
 	KSTATE_CTX_USER_NS,
 	KSTATE_CTX_USER_STRUCT,
@@ -70,6 +73,28 @@ void *kstate_prepare_image(__u32 type, unsigned int len);
 void *kstate_read_image(struct kstate_context *ctx, kstate_ref_t *ref, __u32 type, unsigned int len);
 int kstate_write_image(struct kstate_context *ctx, void *i, unsigned int len, struct kstate_object *obj);
 
+static inline kstate_ptr_t kstate_export_ptr(const void __user *ptr)
+{
+	BUILD_BUG_ON(sizeof(kstate_ptr_t) < sizeof(ptr));
+	return (unsigned long)ptr;
+}
+
+static inline const void __user *kstate_import_ptr(kstate_ptr_t i)
+{
+	return (const void __user *)(unsigned long)i;
+}
+
+static inline void kstate_dump_sigset(kstate_sigset_t *i, sigset_t *sig)
+{
+	BUILD_BUG_ON(sizeof(sigset_t) > sizeof(kstate_sigset_t));
+	memcpy(i, sig, sizeof(sigset_t));
+}
+
+static inline void kstate_restore_sigset(sigset_t *sig, kstate_sigset_t *i)
+{
+	memcpy(sig, i, sizeof(sigset_t));
+}
+
 int kstate_collect_all_task_struct(struct kstate_context *ctx);
 int kstate_dump_all_task_struct(struct kstate_context *ctx);
 int kstate_restore_task_struct(struct kstate_context *ctx, kstate_ref_t *ref);
@@ -155,6 +180,10 @@ int kstate_dump_all_files_struct(struct kstate_context *ctx);
 int kstate_restore_files_struct(struct kstate_context *ctx, kstate_ref_t *ref);
 int kstate_restore_fd(struct kstate_context *ctx, kstate_pos_t pos);
 
+int kstate_collect_all_sighand_struct(struct kstate_context *ctx);
+int kstate_dump_all_sighand_struct(struct kstate_context *ctx);
+int kstate_restore_sighand_struct(struct kstate_context *ctx, kstate_ref_t *ref);
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_X86_64)
 extern const __u32 kstate_kernel_arch;
 int kstate_arch_check_image_header(struct kstate_image_header *i);
diff --git a/kernel/kstate/cpt-sys.c b/kernel/kstate/cpt-sys.c
index 05fc9d8..8225804 100644
--- a/kernel/kstate/cpt-sys.c
+++ b/kernel/kstate/cpt-sys.c
@@ -104,6 +104,9 @@ static int kstate_collect(struct kstate_context *ctx)
 	rv = kstate_collect_all_user_ns(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_collect_all_sighand_struct(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_collect_all_pid(ctx);
 	if (rv < 0)
 		return rv;
@@ -163,6 +166,9 @@ static int kstate_dump(struct kstate_context *ctx)
 	rv = kstate_dump_all_pid(ctx);
 	if (rv < 0)
 		return rv;
+	rv = kstate_dump_all_sighand_struct(ctx);
+	if (rv < 0)
+		return rv;
 	rv = kstate_dump_all_user_ns(ctx);
 	if (rv < 0)
 		return rv;
diff --git a/kernel/kstate/kstate-context.c b/kernel/kstate/kstate-context.c
index 3e1589f..44111d1 100644
--- a/kernel/kstate/kstate-context.c
+++ b/kernel/kstate/kstate-context.c
@@ -97,6 +97,11 @@ void kstate_context_destroy(struct kstate_context *ctx)
 		list_del(&obj->o_list);
 		kfree(obj);
 	}
+	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_SIGHAND_STRUCT) {
+		__cleanup_sighand((struct sighand_struct *)obj->o_obj);
+		list_del(&obj->o_list);
+		kfree(obj);
+	}
 	for_each_kstate_object_safe(ctx, obj, tmp, KSTATE_CTX_TASK_STRUCT) {
 		put_task_struct((struct task_struct *)obj->o_obj);
 		list_del(&obj->o_list);
diff --git a/kernel/kstate/kstate-object.c b/kernel/kstate/kstate-object.c
index bc27985..f871463 100644
--- a/kernel/kstate/kstate-object.c
+++ b/kernel/kstate/kstate-object.c
@@ -74,6 +74,9 @@ int kstate_collect_object(struct kstate_context *ctx, void *p, enum kstate_conte
 	case KSTATE_CTX_PID_NS:
 		get_pid_ns((struct pid_namespace *)obj->o_obj);
 		break;
+	case KSTATE_CTX_SIGHAND_STRUCT:
+		atomic_inc(&((struct sighand_struct *)obj->o_obj)->count);
+		break;
 	case KSTATE_CTX_TASK_STRUCT:
 		get_task_struct((struct task_struct *)obj->o_obj);
 		break;
diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c
index 101fcb8..2cfcb8f 100644
--- a/kernel/kstate/kstate-task.c
+++ b/kernel/kstate/kstate-task.c
@@ -139,6 +139,9 @@ static int dump_task_struct(struct kstate_context *ctx, struct kstate_object *ob
 	tmp = find_kstate_obj_by_ptr(ctx, tsk->files, KSTATE_CTX_FILES_STRUCT);
 	i->ref_files = tmp->o_ref;
 
+	tmp = find_kstate_obj_by_ptr(ctx, tsk->sighand, KSTATE_CTX_SIGHAND_STRUCT);
+	i->ref_sighand = tmp->o_ref;
+
 	BUILD_BUG_ON(sizeof(i->comm) != sizeof(tsk->comm));
 	strlcpy((char *)i->comm, (const char *)tsk->comm, sizeof(i->comm));
 
@@ -402,6 +405,32 @@ static int restore_files(struct kstate_context *ctx, kstate_ref_t *ref)
 	return 0;
 }
 
+static int restore_sighand(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct sighand_struct *sighand;
+	struct kstate_object *tmp;
+	struct sighand_struct *old_sighand;
+	unsigned long flags;
+	int rv;
+
+	tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_SIGHAND_STRUCT);
+	if (!tmp) {
+		rv = kstate_restore_sighand_struct(ctx, ref);
+		if (rv < 0)
+			return rv;
+		tmp = find_kstate_obj_by_ref(ctx, ref, KSTATE_CTX_SIGHAND_STRUCT);
+	}
+	sighand = tmp->o_obj;
+
+	atomic_inc(&sighand->count);
+	old_sighand = current->sighand;
+	spin_lock_irqsave(&old_sighand->siglock, flags);
+	current->sighand = sighand;
+	spin_unlock_irqrestore(&old_sighand->siglock, flags);
+	__cleanup_sighand(old_sighand);
+	return 0;
+}
+
 struct task_struct_restore_context {
 	struct kstate_context *ctx;
 	struct kstate_image_task_struct *i;
@@ -468,6 +497,9 @@ static int task_struct_restorer(void *_tsk_ctx)
 	rv = restore_files(ctx, &i->ref_files);
 	if (rv < 0)
 		goto out;
+	rv = restore_sighand(ctx, &i->ref_sighand);
+	if (rv < 0)
+		goto out;
 
 out:
 	tsk_ctx->rv = rv;
diff --git a/kernel/signal.c b/kernel/signal.c
index d803473..5d5dc13 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2,6 +2,7 @@
  *  linux/kernel/signal.c
  *
  *  Copyright (C) 1991, 1992  Linus Torvalds
+ *  Copyright (C) 2000-2009 Parallels Holdings, Ltd.
  *
  *  1997-11-02  Modified for POSIX.1b signals by Richard Henderson
  *
@@ -2656,3 +2657,142 @@ void __init signals_init(void)
 {
 	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
 }
+
+#ifdef CONFIG_CHECKPOINT
+#include <linux/kstate.h>
+#include <linux/kstate-image.h>
+
+static int check_sighand_struct(struct sighand_struct *sighand)
+{
+#ifdef CONFIG_SIGNALFD
+	if (waitqueue_active(&sighand->signalfd_wqh)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+#endif
+	return 0;
+}
+
+static int collect_sighand_struct(struct kstate_context *ctx, struct sighand_struct *sighand)
+{
+	int rv;
+
+	rv = check_sighand_struct(sighand);
+	if (rv < 0)
+		return rv;
+	rv = kstate_collect_object(ctx, sighand, KSTATE_CTX_SIGHAND_STRUCT);
+	pr_debug("collect sighand_struct %p: rv %d\n", sighand, rv);
+	return rv;
+}
+
+int kstate_collect_all_sighand_struct(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = collect_sighand_struct(ctx, tsk->sighand);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_SIGHAND_STRUCT) {
+		struct sighand_struct *sighand = obj->o_obj;
+		unsigned int cnt = atomic_read(&sighand->count);
+
+		if (obj->o_count + 1 != cnt) {
+			pr_err("sighand_struct %p has external references %lu:%u\n", sighand, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int dump_sighand_struct(struct kstate_context *ctx, struct kstate_object *obj)
+{
+	struct sighand_struct *sighand = obj->o_obj;
+	struct kstate_image_sighand_struct *i;
+	unsigned long flags;
+	int sig;
+	int rv;
+
+	i = kstate_prepare_image(KSTATE_OBJ_SIGHAND_STRUCT, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	BUILD_BUG_ON(_NSIG > ARRAY_SIZE(i->sa));
+	spin_lock_irqsave(&sighand->siglock, flags);
+	for (sig = 0; sig < _NSIG; sig++) {
+		struct sigaction *sa = &sighand->action[sig].sa;
+
+		i->sa[sig].sa_handler = kstate_export_ptr(sa->sa_handler);
+		i->sa[sig].sa_flags = sa->sa_flags;
+#ifdef sa_restorer
+		i->sa[sig].sa_restorer = kstate_export_ptr(sa->sa_restorer);
+#endif
+		kstate_dump_sigset(&i->sa[sig].sa_mask, &sa->sa_mask);
+	}
+	spin_unlock_irqrestore(&sighand->siglock, flags);
+
+	rv = kstate_write_image(ctx, i, sizeof(*i), obj);
+	kfree(i);
+	pr_debug("dump sighand_struct %p: ref {%llu, %u}, rv %d\n", sighand, (unsigned long long)obj->o_ref.pos, obj->o_ref.id, rv);
+	return rv;
+}
+
+int kstate_dump_all_sighand_struct(struct kstate_context *ctx)
+{
+	struct kstate_object *obj;
+	int rv;
+
+	for_each_kstate_object(ctx, obj, KSTATE_CTX_SIGHAND_STRUCT) {
+		rv = dump_sighand_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+int kstate_restore_sighand_struct(struct kstate_context *ctx, kstate_ref_t *ref)
+{
+	struct kstate_image_sighand_struct *i;
+	struct sighand_struct *sighand;
+	int sig;
+	int rv;
+
+	i = kstate_read_image(ctx, ref, KSTATE_OBJ_SIGHAND_STRUCT, sizeof(*i));
+	if (IS_ERR(i))
+		return PTR_ERR(i);
+
+	sighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
+	if (!sighand) {
+		rv = -ENOMEM;
+		goto out_free_image;
+	}
+	atomic_set(&sighand->count, 1);
+
+	for (sig = 0; sig < _NSIG; sig++) {
+		struct sigaction *sa = &sighand->action[sig].sa;
+
+		sa->sa_handler = kstate_import_ptr(i->sa[sig].sa_handler);
+		sa->sa_flags = i->sa[sig].sa_flags;
+#ifdef sa_restorer
+		sa->sa_restorer = kstate_import_ptr(i->sa[sig].sa_restorer);
+#endif
+		kstate_restore_sigset(&sa->sa_mask, &i->sa[sig].sa_mask);
+	}
+	kfree(i);
+
+	rv = kstate_restore_object(ctx, sighand, KSTATE_CTX_SIGHAND_STRUCT, ref);
+	if (rv < 0)
+		kmem_cache_free(sighand_cachep, sighand);
+	pr_debug("restore sighand_struct %p: ref {%llu, %u}, rv %d\n", sighand, (unsigned long long)ref->pos, ref->id, rv);
+	return rv;
+
+out_free_image:
+	kfree(i);
+	pr_debug("%s: return %d, ref {%llu, %u}\n", __func__, rv, (unsigned long long)ref->pos, ref->id);
+	return rv;
+}
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/38] cred: #include init.h in cred.h
  2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
                   ` (36 preceding siblings ...)
  2009-05-22  4:55 ` [PATCH 38/38] C/R: checkpoint/restart struct sighand_struct Alexey Dobriyan
@ 2009-05-22  5:02 ` Alexey Dobriyan
  37 siblings, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-22  5:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave, mingo

More patches before "[PATCH 18/38] C/R: core stuff" were already
sent via usual channels and some are already in -next.
There are being repeated to maintain clean series against mainline.

I forgot to mention that ->checkpoint hook was not added to vDSO
mapping for several reasons so if someone want to test, boot with
vdso=0 for now.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 09/38] netns 1/2: don't get/put old netns on CLONE_NEWNET
  2009-05-22  4:55 ` [PATCH 09/38] netns 1/2: don't get/put old netns on CLONE_NEWNET Alexey Dobriyan
@ 2009-05-22  6:30   ` David Miller
  0 siblings, 0 replies; 76+ messages in thread
From: David Miller @ 2009-05-22  6:30 UTC (permalink / raw)
  To: adobriyan
  Cc: akpm, linux-kernel, containers, torvalds, xemul, orenl, serue,
	dave, mingo

From: Alexey Dobriyan <adobriyan@gmail.com>
Date: Fri, 22 May 2009 08:55:03 +0400

> copy_net_ns() doesn't copy anything, it creates fresh netns, so get/put of old
> netns isn't needed.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> Acked-by: Serge Hallyn <serue@us.ibm.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 10/38] netns 2/2: extract net_create()
  2009-05-22  4:55 ` [PATCH 10/38] netns 2/2: extract net_create() Alexey Dobriyan
@ 2009-05-22  6:30   ` David Miller
  0 siblings, 0 replies; 76+ messages in thread
From: David Miller @ 2009-05-22  6:30 UTC (permalink / raw)
  To: adobriyan
  Cc: akpm, linux-kernel, containers, torvalds, xemul, orenl, serue,
	dave, mingo

From: Alexey Dobriyan <adobriyan@gmail.com>
Date: Fri, 22 May 2009 08:55:04 +0400

> net_create() will be used by C/R to create fresh netns on restart.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> Acked-by: Serge Hallyn <serue@us.ibm.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 04/38] ipcns 2/4: extract create_ipc_ns()
  2009-05-22  4:54 ` [PATCH 04/38] ipcns 2/4: extract create_ipc_ns() Alexey Dobriyan
@ 2009-05-22  8:59   ` Amerigo Wang
  0 siblings, 0 replies; 76+ messages in thread
From: Amerigo Wang @ 2009-05-22  8:59 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, linux-kernel, containers, torvalds, xemul, orenl, serue,
	dave, mingo

On Fri, May 22, 2009 at 08:54:58AM +0400, Alexey Dobriyan wrote:
>clone_ipc_ns() is misnamed, it doesn't clone anything and doesn't use passed
>parameter. Rename it.
>
>create_ipc_ns() will be used by C/R to create fresh ipcns.
>
>Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
>Acked-by: Serge Hallyn <serue@us.ibm.com>
>---
> ipc/namespace.c |    9 ++-------
> 1 files changed, 2 insertions(+), 7 deletions(-)
>
>diff --git a/ipc/namespace.c b/ipc/namespace.c
>index a56fc59..231ee53 100644
>--- a/ipc/namespace.c
>+++ b/ipc/namespace.c
>@@ -14,7 +14,7 @@
> 
> #include "util.h"
> 
>-static struct ipc_namespace *clone_ipc_ns(struct ipc_namespace *old_ns)
>+static struct ipc_namespace *create_ipc_ns(void)


Great!! I am wondering why nobody else found this... :)

Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>



> {
> 	struct ipc_namespace *ns;
> 	int err;
>@@ -48,14 +48,9 @@ static struct ipc_namespace *clone_ipc_ns(struct ipc_namespace *old_ns)
> 
> struct ipc_namespace *copy_ipcs(unsigned long flags, struct ipc_namespace *ns)
> {
>-	struct ipc_namespace *new_ns;
>-
> 	if (!(flags & CLONE_NEWIPC))
> 		return get_ipc_ns(ns);
>-
>-	new_ns = clone_ipc_ns(ns);
>-
>-	return new_ns;
>+	return create_ipc_ns();
> }
> 
> /*
>-- 
>1.5.6.5
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 03/38] ipcns 1/4: remove useless get/put while CLONE_NEWIPC
  2009-05-22  4:54 ` [PATCH 03/38] ipcns 1/4: remove useless get/put while CLONE_NEWIPC Alexey Dobriyan
@ 2009-05-22  9:00   ` Amerigo Wang
  0 siblings, 0 replies; 76+ messages in thread
From: Amerigo Wang @ 2009-05-22  9:00 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, linux-kernel, containers, torvalds, xemul, orenl, serue,
	dave, mingo

On Fri, May 22, 2009 at 08:54:57AM +0400, Alexey Dobriyan wrote:
>copy_ipcs() doesn't actually copy anything. If new ipcns is created, it's
>created from scratch, in this case get/put on old ipcns isn't needed.
>
>Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
>Acked-by: Serge Hallyn <serue@us.ibm.com>


Looks good to me!

Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>


>---
> ipc/namespace.c |    6 +-----
> 1 files changed, 1 insertions(+), 5 deletions(-)
>
>diff --git a/ipc/namespace.c b/ipc/namespace.c
>index 4a5e752..a56fc59 100644
>--- a/ipc/namespace.c
>+++ b/ipc/namespace.c
>@@ -50,15 +50,11 @@ struct ipc_namespace *copy_ipcs(unsigned long flags, struct ipc_namespace *ns)
> {
> 	struct ipc_namespace *new_ns;
> 
>-	BUG_ON(!ns);
>-	get_ipc_ns(ns);
>-
> 	if (!(flags & CLONE_NEWIPC))
>-		return ns;
>+		return get_ipc_ns(ns);
> 
> 	new_ns = clone_ipc_ns(ns);
> 
>-	put_ipc_ns(ns);
> 	return new_ns;
> }
> 
>-- 
>1.5.6.5
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns()
  2009-05-22  4:55 ` [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns() Alexey Dobriyan
@ 2009-05-22  9:14   ` Amerigo Wang
  2009-05-24 22:45   ` Serge E. Hallyn
  2009-06-04  0:17   ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 76+ messages in thread
From: Amerigo Wang @ 2009-05-22  9:14 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, linux-kernel, containers, torvalds, xemul, orenl, serue,
	dave, mingo

On Fri, May 22, 2009 at 08:55:02AM +0400, Alexey Dobriyan wrote:
>copy_pid_ns() is a perfect example of a case where unwinding leads to more
>code and makes it less clear. Watch the diffstat.
>

Fully agreed...

>Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>

Btw, which tree is this patch against? Looks like not linus-tree.


>---
> kernel/pid_namespace.c |   18 +++---------------
> 1 files changed, 3 insertions(+), 15 deletions(-)
>
>diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>index 495d5de..821722a 100644
>--- a/kernel/pid_namespace.c
>+++ b/kernel/pid_namespace.c
>@@ -116,23 +116,11 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
> 
> struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
> {
>-	struct pid_namespace *new_ns;
>-
>-	BUG_ON(!old_ns);
>-	new_ns = get_pid_ns(old_ns);
> 	if (!(flags & CLONE_NEWPID))
>-		goto out;
>-
>-	new_ns = ERR_PTR(-EINVAL);
>+		return get_pid_ns(old_ns);
> 	if (flags & CLONE_THREAD)
>-		goto out_put;
>-
>-	new_ns = create_pid_namespace(old_ns);
>-
>-out_put:
>-	put_pid_ns(old_ns);
>-out:
>-	return new_ns;
>+		return ERR_PTR(-EINVAL);
>+	return create_pid_namespace(old_ns);
> }
> 
> void free_pid_ns(struct kref *kref)
>-- 
>1.5.6.5
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns
  2009-05-22  4:55 ` [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns Alexey Dobriyan
@ 2009-05-22  9:20   ` Amerigo Wang
  2009-05-24 22:44   ` Serge E. Hallyn
  2009-06-04  0:20   ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 76+ messages in thread
From: Amerigo Wang @ 2009-05-22  9:20 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, linux-kernel, containers, torvalds, xemul, orenl, serue,
	dave, mingo

On Fri, May 22, 2009 at 08:55:01AM +0400, Alexey Dobriyan wrote:
>create_pid_namespace() creates everything, but caller has to assign parent
>pidns by hand, which is unnatural. At the moment of call new ->level has
>to be taken from somewhere and parent pidns is already available.
>
>Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Nice cleanup!

Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>


>---
> kernel/pid_namespace.c |    8 ++++----
> 1 files changed, 4 insertions(+), 4 deletions(-)
>
>diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>index 2d1001b..495d5de 100644
>--- a/kernel/pid_namespace.c
>+++ b/kernel/pid_namespace.c
>@@ -67,9 +67,10 @@ err_alloc:
> 	return NULL;
> }
> 
>-static struct pid_namespace *create_pid_namespace(unsigned int level)
>+static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_pid_ns)
> {
> 	struct pid_namespace *ns;
>+	unsigned int level = parent_pid_ns->level + 1;
> 	int i;
> 
> 	ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
>@@ -86,6 +87,7 @@ static struct pid_namespace *create_pid_namespace(unsigned int level)
> 
> 	kref_init(&ns->kref);
> 	ns->level = level;
>+	ns->parent = get_pid_ns(parent_pid_ns);
> 
> 	set_bit(0, ns->pidmap[0].page);
> 	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
>@@ -125,9 +127,7 @@ struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old
> 	if (flags & CLONE_THREAD)
> 		goto out_put;
> 
>-	new_ns = create_pid_namespace(old_ns->level + 1);
>-	if (!IS_ERR(new_ns))
>-		new_ns->parent = get_pid_ns(old_ns);
>+	new_ns = create_pid_namespace(old_ns);
> 
> out_put:
> 	put_pid_ns(old_ns);
>-- 
>1.5.6.5
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 02/38] utsns: extract create_uts_ns()
  2009-05-22  4:54 ` [PATCH 02/38] utsns: extract create_uts_ns() Alexey Dobriyan
@ 2009-05-24 22:37   ` Serge E. Hallyn
  0 siblings, 0 replies; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-24 22:37 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, xemul, containers, linux-kernel, dave, mingo, torvalds

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> create_uts_ns() will be used by C/R to create fresh uts_ns.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Acked-by: Serge Hallyn <serue@us.ibm.com>

> ---
>  kernel/utsname.c |   13 +++++++++++--
>  1 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/utsname.c b/kernel/utsname.c
> index 815237a..8a82b4b 100644
> --- a/kernel/utsname.c
> +++ b/kernel/utsname.c
> @@ -15,6 +15,16 @@
>  #include <linux/err.h>
>  #include <linux/slab.h>
>  
> +static struct uts_namespace *create_uts_ns(void)
> +{
> +	struct uts_namespace *uts_ns;
> +
> +	uts_ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL);
> +	if (uts_ns)
> +		kref_init(&uts_ns->kref);
> +	return uts_ns;
> +}
> +
>  /*
>   * Clone a new ns copying an original utsname, setting refcount to 1
>   * @old_ns: namespace to clone
> @@ -24,14 +34,13 @@ static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns)
>  {
>  	struct uts_namespace *ns;
>  
> -	ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL);
> +	ns = create_uts_ns();
>  	if (!ns)
>  		return ERR_PTR(-ENOMEM);
>  
>  	down_read(&uts_sem);
>  	memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
>  	up_read(&uts_sem);
> -	kref_init(&ns->kref);
>  	return ns;
>  }
>  
> -- 
> 1.5.6.5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 05/38] ipcns 3/4: make free_ipc_ns() static
  2009-05-22  4:54 ` [PATCH 05/38] ipcns 3/4: make free_ipc_ns() static Alexey Dobriyan
@ 2009-05-24 22:40   ` Serge E. Hallyn
  0 siblings, 0 replies; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-24 22:40 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, xemul, containers, linux-kernel, dave, mingo, torvalds

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Acked-by: Serge Hallyn <serue@us.ibm.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns
  2009-05-22  4:55 ` [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns Alexey Dobriyan
  2009-05-22  9:20   ` Amerigo Wang
@ 2009-05-24 22:44   ` Serge E. Hallyn
  2009-06-04  0:20   ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-24 22:44 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, xemul, containers, linux-kernel, dave, mingo, torvalds

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> create_pid_namespace() creates everything, but caller has to assign parent
> pidns by hand, which is unnatural. At the moment of call new ->level has
> to be taken from somewhere and parent pidns is already available.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Reviewed-by: Serge Hallyn <serue@us.ibm.com>

(I actually did so on both this and the next one, which are both quite
nice, on May 08)

> ---
>  kernel/pid_namespace.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 2d1001b..495d5de 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -67,9 +67,10 @@ err_alloc:
>  	return NULL;
>  }
>  
> -static struct pid_namespace *create_pid_namespace(unsigned int level)
> +static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_pid_ns)
>  {
>  	struct pid_namespace *ns;
> +	unsigned int level = parent_pid_ns->level + 1;
>  	int i;
>  
>  	ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
> @@ -86,6 +87,7 @@ static struct pid_namespace *create_pid_namespace(unsigned int level)
>  
>  	kref_init(&ns->kref);
>  	ns->level = level;
> +	ns->parent = get_pid_ns(parent_pid_ns);
>  
>  	set_bit(0, ns->pidmap[0].page);
>  	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
> @@ -125,9 +127,7 @@ struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old
>  	if (flags & CLONE_THREAD)
>  		goto out_put;
>  
> -	new_ns = create_pid_namespace(old_ns->level + 1);
> -	if (!IS_ERR(new_ns))
> -		new_ns->parent = get_pid_ns(old_ns);
> +	new_ns = create_pid_namespace(old_ns);
>  
>  out_put:
>  	put_pid_ns(old_ns);
> -- 
> 1.5.6.5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns()
  2009-05-22  4:55 ` [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns() Alexey Dobriyan
  2009-05-22  9:14   ` Amerigo Wang
@ 2009-05-24 22:45   ` Serge E. Hallyn
  2009-06-04  0:17   ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-24 22:45 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, xemul, containers, linux-kernel, dave, mingo, torvalds

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> copy_pid_ns() is a perfect example of a case where unwinding leads to more
> code and makes it less clear. Watch the diffstat.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Reviewed-by: Serge Hallyn <serue@us.ibm.com>

> ---
>  kernel/pid_namespace.c |   18 +++---------------
>  1 files changed, 3 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 495d5de..821722a 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -116,23 +116,11 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
>  
>  struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
>  {
> -	struct pid_namespace *new_ns;
> -
> -	BUG_ON(!old_ns);
> -	new_ns = get_pid_ns(old_ns);
>  	if (!(flags & CLONE_NEWPID))
> -		goto out;
> -
> -	new_ns = ERR_PTR(-EINVAL);
> +		return get_pid_ns(old_ns);
>  	if (flags & CLONE_THREAD)
> -		goto out_put;
> -
> -	new_ns = create_pid_namespace(old_ns);
> -
> -out_put:
> -	put_pid_ns(old_ns);
> -out:
> -	return new_ns;
> +		return ERR_PTR(-EINVAL);
> +	return create_pid_namespace(old_ns);
>  }
>  
>  void free_pid_ns(struct kref *kref)
> -- 
> 1.5.6.5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 06/38] ipcns 4/2: move free_ipcs() proto
  2009-05-22  4:55 ` [PATCH 06/38] ipcns 4/2: move free_ipcs() proto Alexey Dobriyan
@ 2009-05-24 22:49   ` Serge E. Hallyn
  0 siblings, 0 replies; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-24 22:49 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, xemul, containers, linux-kernel, dave, mingo, torvalds

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> Function is really private to ipc/ and avoid struct kern_ipc_perm
> forward declaration.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Reviewed-by: Serge Hallyn <serue@us.ibm.com>

> ---
>  include/linux/ipc_namespace.h |    4 ----
>  ipc/util.h                    |    3 ++-
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
> index 804e4e4..e408722 100644
> --- a/include/linux/ipc_namespace.h
> +++ b/include/linux/ipc_namespace.h
> @@ -96,10 +96,6 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { return 0; }
>  #if defined(CONFIG_IPC_NS)
>  extern struct ipc_namespace *copy_ipcs(unsigned long flags,
>  				       struct ipc_namespace *ns);
> -extern void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
> -		      void (*free)(struct ipc_namespace *,
> -				   struct kern_ipc_perm *));
> -
>  static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
>  {
>  	if (ns)
> diff --git a/ipc/util.h b/ipc/util.h
> index 1187332..7a2861b 100644
> --- a/ipc/util.h
> +++ b/ipc/util.h
> @@ -171,5 +171,6 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
>  struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
>  int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
>  			struct ipc_ops *ops, struct ipc_params *params);
> -
> +void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
> +		void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
>  #endif
> -- 
> 1.5.6.5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 17/38] groups: move code to kernel/groups.c
  2009-05-22  4:55 ` [PATCH 17/38] groups: move code to kernel/groups.c Alexey Dobriyan
@ 2009-05-25  0:53   ` Serge E. Hallyn
  2009-05-26 14:48   ` Serge E. Hallyn
  1 sibling, 0 replies; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-25  0:53 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, xemul, containers, linux-kernel, dave, mingo, torvalds

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> Move supplementary groups implementation to kernel/groups.c .
> kernel/sys.c already accumulated quite a few random stuff.
> 
> Do strictly copy/paste + add required headers to compile.
> Compile-tested on many configs and archs.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Agreed, I was thinking I'd do the same thing for groups c/r with Oren's set.

Acked-by: Serge Hallyn <serue@us.ibm.com>

> ---
>  kernel/Makefile |    1 +
>  kernel/groups.c |  288 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sys.c    |  283 ------------------------------------------------------
>  3 files changed, 289 insertions(+), 283 deletions(-)
>  create mode 100644 kernel/groups.c
> 
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4242366..705ad3d 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -11,6 +11,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
>  	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
>  	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
>  	    async.o
> +obj-y += groups.o
>  
>  ifdef CONFIG_FUNCTION_TRACER
>  # Do not trace debug files and internal ftrace files
> diff --git a/kernel/groups.c b/kernel/groups.c
> new file mode 100644
> index 0000000..1b95b2f
> --- /dev/null
> +++ b/kernel/groups.c
> @@ -0,0 +1,288 @@
> +/*
> + * Supplementary group IDs
> + */
> +#include <linux/cred.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/security.h>
> +#include <linux/syscalls.h>
> +#include <asm/uaccess.h>
> +
> +/* init to 2 - one for init_task, one to ensure it is never freed */
> +struct group_info init_groups = { .usage = ATOMIC_INIT(2) };
> +
> +struct group_info *groups_alloc(int gidsetsize)
> +{
> +	struct group_info *group_info;
> +	int nblocks;
> +	int i;
> +
> +	nblocks = (gidsetsize + NGROUPS_PER_BLOCK - 1) / NGROUPS_PER_BLOCK;
> +	/* Make sure we always allocate at least one indirect block pointer */
> +	nblocks = nblocks ? : 1;
> +	group_info = kmalloc(sizeof(*group_info) + nblocks*sizeof(gid_t *), GFP_USER);
> +	if (!group_info)
> +		return NULL;
> +	group_info->ngroups = gidsetsize;
> +	group_info->nblocks = nblocks;
> +	atomic_set(&group_info->usage, 1);
> +
> +	if (gidsetsize <= NGROUPS_SMALL)
> +		group_info->blocks[0] = group_info->small_block;
> +	else {
> +		for (i = 0; i < nblocks; i++) {
> +			gid_t *b;
> +			b = (void *)__get_free_page(GFP_USER);
> +			if (!b)
> +				goto out_undo_partial_alloc;
> +			group_info->blocks[i] = b;
> +		}
> +	}
> +	return group_info;
> +
> +out_undo_partial_alloc:
> +	while (--i >= 0) {
> +		free_page((unsigned long)group_info->blocks[i]);
> +	}
> +	kfree(group_info);
> +	return NULL;
> +}
> +
> +EXPORT_SYMBOL(groups_alloc);
> +
> +void groups_free(struct group_info *group_info)
> +{
> +	if (group_info->blocks[0] != group_info->small_block) {
> +		int i;
> +		for (i = 0; i < group_info->nblocks; i++)
> +			free_page((unsigned long)group_info->blocks[i]);
> +	}
> +	kfree(group_info);
> +}
> +
> +EXPORT_SYMBOL(groups_free);
> +
> +/* export the group_info to a user-space array */
> +static int groups_to_user(gid_t __user *grouplist,
> +			  const struct group_info *group_info)
> +{
> +	int i;
> +	unsigned int count = group_info->ngroups;
> +
> +	for (i = 0; i < group_info->nblocks; i++) {
> +		unsigned int cp_count = min(NGROUPS_PER_BLOCK, count);
> +		unsigned int len = cp_count * sizeof(*grouplist);
> +
> +		if (copy_to_user(grouplist, group_info->blocks[i], len))
> +			return -EFAULT;
> +
> +		grouplist += NGROUPS_PER_BLOCK;
> +		count -= cp_count;
> +	}
> +	return 0;
> +}
> +
> +/* fill a group_info from a user-space array - it must be allocated already */
> +static int groups_from_user(struct group_info *group_info,
> +    gid_t __user *grouplist)
> +{
> +	int i;
> +	unsigned int count = group_info->ngroups;
> +
> +	for (i = 0; i < group_info->nblocks; i++) {
> +		unsigned int cp_count = min(NGROUPS_PER_BLOCK, count);
> +		unsigned int len = cp_count * sizeof(*grouplist);
> +
> +		if (copy_from_user(group_info->blocks[i], grouplist, len))
> +			return -EFAULT;
> +
> +		grouplist += NGROUPS_PER_BLOCK;
> +		count -= cp_count;
> +	}
> +	return 0;
> +}
> +
> +/* a simple Shell sort */
> +static void groups_sort(struct group_info *group_info)
> +{
> +	int base, max, stride;
> +	int gidsetsize = group_info->ngroups;
> +
> +	for (stride = 1; stride < gidsetsize; stride = 3 * stride + 1)
> +		; /* nothing */
> +	stride /= 3;
> +
> +	while (stride) {
> +		max = gidsetsize - stride;
> +		for (base = 0; base < max; base++) {
> +			int left = base;
> +			int right = left + stride;
> +			gid_t tmp = GROUP_AT(group_info, right);
> +
> +			while (left >= 0 && GROUP_AT(group_info, left) > tmp) {
> +				GROUP_AT(group_info, right) =
> +				    GROUP_AT(group_info, left);
> +				right = left;
> +				left -= stride;
> +			}
> +			GROUP_AT(group_info, right) = tmp;
> +		}
> +		stride /= 3;
> +	}
> +}
> +
> +/* a simple bsearch */
> +int groups_search(const struct group_info *group_info, gid_t grp)
> +{
> +	unsigned int left, right;
> +
> +	if (!group_info)
> +		return 0;
> +
> +	left = 0;
> +	right = group_info->ngroups;
> +	while (left < right) {
> +		unsigned int mid = (left+right)/2;
> +		int cmp = grp - GROUP_AT(group_info, mid);
> +		if (cmp > 0)
> +			left = mid + 1;
> +		else if (cmp < 0)
> +			right = mid;
> +		else
> +			return 1;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * set_groups - Change a group subscription in a set of credentials
> + * @new: The newly prepared set of credentials to alter
> + * @group_info: The group list to install
> + *
> + * Validate a group subscription and, if valid, insert it into a set
> + * of credentials.
> + */
> +int set_groups(struct cred *new, struct group_info *group_info)
> +{
> +	int retval;
> +
> +	retval = security_task_setgroups(group_info);
> +	if (retval)
> +		return retval;
> +
> +	put_group_info(new->group_info);
> +	groups_sort(group_info);
> +	get_group_info(group_info);
> +	new->group_info = group_info;
> +	return 0;
> +}
> +
> +EXPORT_SYMBOL(set_groups);
> +
> +/**
> + * set_current_groups - Change current's group subscription
> + * @group_info: The group list to impose
> + *
> + * Validate a group subscription and, if valid, impose it upon current's task
> + * security record.
> + */
> +int set_current_groups(struct group_info *group_info)
> +{
> +	struct cred *new;
> +	int ret;
> +
> +	new = prepare_creds();
> +	if (!new)
> +		return -ENOMEM;
> +
> +	ret = set_groups(new, group_info);
> +	if (ret < 0) {
> +		abort_creds(new);
> +		return ret;
> +	}
> +
> +	return commit_creds(new);
> +}
> +
> +EXPORT_SYMBOL(set_current_groups);
> +
> +SYSCALL_DEFINE2(getgroups, int, gidsetsize, gid_t __user *, grouplist)
> +{
> +	const struct cred *cred = current_cred();
> +	int i;
> +
> +	if (gidsetsize < 0)
> +		return -EINVAL;
> +
> +	/* no need to grab task_lock here; it cannot change */
> +	i = cred->group_info->ngroups;
> +	if (gidsetsize) {
> +		if (i > gidsetsize) {
> +			i = -EINVAL;
> +			goto out;
> +		}
> +		if (groups_to_user(grouplist, cred->group_info)) {
> +			i = -EFAULT;
> +			goto out;
> +		}
> +	}
> +out:
> +	return i;
> +}
> +
> +/*
> + *	SMP: Our groups are copy-on-write. We can set them safely
> + *	without another task interfering.
> + */
> + 
> +SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
> +{
> +	struct group_info *group_info;
> +	int retval;
> +
> +	if (!capable(CAP_SETGID))
> +		return -EPERM;
> +	if ((unsigned)gidsetsize > NGROUPS_MAX)
> +		return -EINVAL;
> +
> +	group_info = groups_alloc(gidsetsize);
> +	if (!group_info)
> +		return -ENOMEM;
> +	retval = groups_from_user(group_info, grouplist);
> +	if (retval) {
> +		put_group_info(group_info);
> +		return retval;
> +	}
> +
> +	retval = set_current_groups(group_info);
> +	put_group_info(group_info);
> +
> +	return retval;
> +}
> +
> +/*
> + * Check whether we're fsgid/egid or in the supplemental group..
> + */
> +int in_group_p(gid_t grp)
> +{
> +	const struct cred *cred = current_cred();
> +	int retval = 1;
> +
> +	if (grp != cred->fsgid)
> +		retval = groups_search(cred->group_info, grp);
> +	return retval;
> +}
> +
> +EXPORT_SYMBOL(in_group_p);
> +
> +int in_egroup_p(gid_t grp)
> +{
> +	const struct cred *cred = current_cred();
> +	int retval = 1;
> +
> +	if (grp != cred->egid)
> +		retval = groups_search(cred->group_info, grp);
> +	return retval;
> +}
> +
> +EXPORT_SYMBOL(in_egroup_p);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index e7998cf..4edcf51 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1112,289 +1112,6 @@ out:
>  	return err;
>  }
>  
> -/*
> - * Supplementary group IDs
> - */
> -
> -/* init to 2 - one for init_task, one to ensure it is never freed */
> -struct group_info init_groups = { .usage = ATOMIC_INIT(2) };
> -
> -struct group_info *groups_alloc(int gidsetsize)
> -{
> -	struct group_info *group_info;
> -	int nblocks;
> -	int i;
> -
> -	nblocks = (gidsetsize + NGROUPS_PER_BLOCK - 1) / NGROUPS_PER_BLOCK;
> -	/* Make sure we always allocate at least one indirect block pointer */
> -	nblocks = nblocks ? : 1;
> -	group_info = kmalloc(sizeof(*group_info) + nblocks*sizeof(gid_t *), GFP_USER);
> -	if (!group_info)
> -		return NULL;
> -	group_info->ngroups = gidsetsize;
> -	group_info->nblocks = nblocks;
> -	atomic_set(&group_info->usage, 1);
> -
> -	if (gidsetsize <= NGROUPS_SMALL)
> -		group_info->blocks[0] = group_info->small_block;
> -	else {
> -		for (i = 0; i < nblocks; i++) {
> -			gid_t *b;
> -			b = (void *)__get_free_page(GFP_USER);
> -			if (!b)
> -				goto out_undo_partial_alloc;
> -			group_info->blocks[i] = b;
> -		}
> -	}
> -	return group_info;
> -
> -out_undo_partial_alloc:
> -	while (--i >= 0) {
> -		free_page((unsigned long)group_info->blocks[i]);
> -	}
> -	kfree(group_info);
> -	return NULL;
> -}
> -
> -EXPORT_SYMBOL(groups_alloc);
> -
> -void groups_free(struct group_info *group_info)
> -{
> -	if (group_info->blocks[0] != group_info->small_block) {
> -		int i;
> -		for (i = 0; i < group_info->nblocks; i++)
> -			free_page((unsigned long)group_info->blocks[i]);
> -	}
> -	kfree(group_info);
> -}
> -
> -EXPORT_SYMBOL(groups_free);
> -
> -/* export the group_info to a user-space array */
> -static int groups_to_user(gid_t __user *grouplist,
> -			  const struct group_info *group_info)
> -{
> -	int i;
> -	unsigned int count = group_info->ngroups;
> -
> -	for (i = 0; i < group_info->nblocks; i++) {
> -		unsigned int cp_count = min(NGROUPS_PER_BLOCK, count);
> -		unsigned int len = cp_count * sizeof(*grouplist);
> -
> -		if (copy_to_user(grouplist, group_info->blocks[i], len))
> -			return -EFAULT;
> -
> -		grouplist += NGROUPS_PER_BLOCK;
> -		count -= cp_count;
> -	}
> -	return 0;
> -}
> -
> -/* fill a group_info from a user-space array - it must be allocated already */
> -static int groups_from_user(struct group_info *group_info,
> -    gid_t __user *grouplist)
> -{
> -	int i;
> -	unsigned int count = group_info->ngroups;
> -
> -	for (i = 0; i < group_info->nblocks; i++) {
> -		unsigned int cp_count = min(NGROUPS_PER_BLOCK, count);
> -		unsigned int len = cp_count * sizeof(*grouplist);
> -
> -		if (copy_from_user(group_info->blocks[i], grouplist, len))
> -			return -EFAULT;
> -
> -		grouplist += NGROUPS_PER_BLOCK;
> -		count -= cp_count;
> -	}
> -	return 0;
> -}
> -
> -/* a simple Shell sort */
> -static void groups_sort(struct group_info *group_info)
> -{
> -	int base, max, stride;
> -	int gidsetsize = group_info->ngroups;
> -
> -	for (stride = 1; stride < gidsetsize; stride = 3 * stride + 1)
> -		; /* nothing */
> -	stride /= 3;
> -
> -	while (stride) {
> -		max = gidsetsize - stride;
> -		for (base = 0; base < max; base++) {
> -			int left = base;
> -			int right = left + stride;
> -			gid_t tmp = GROUP_AT(group_info, right);
> -
> -			while (left >= 0 && GROUP_AT(group_info, left) > tmp) {
> -				GROUP_AT(group_info, right) =
> -				    GROUP_AT(group_info, left);
> -				right = left;
> -				left -= stride;
> -			}
> -			GROUP_AT(group_info, right) = tmp;
> -		}
> -		stride /= 3;
> -	}
> -}
> -
> -/* a simple bsearch */
> -int groups_search(const struct group_info *group_info, gid_t grp)
> -{
> -	unsigned int left, right;
> -
> -	if (!group_info)
> -		return 0;
> -
> -	left = 0;
> -	right = group_info->ngroups;
> -	while (left < right) {
> -		unsigned int mid = (left+right)/2;
> -		int cmp = grp - GROUP_AT(group_info, mid);
> -		if (cmp > 0)
> -			left = mid + 1;
> -		else if (cmp < 0)
> -			right = mid;
> -		else
> -			return 1;
> -	}
> -	return 0;
> -}
> -
> -/**
> - * set_groups - Change a group subscription in a set of credentials
> - * @new: The newly prepared set of credentials to alter
> - * @group_info: The group list to install
> - *
> - * Validate a group subscription and, if valid, insert it into a set
> - * of credentials.
> - */
> -int set_groups(struct cred *new, struct group_info *group_info)
> -{
> -	int retval;
> -
> -	retval = security_task_setgroups(group_info);
> -	if (retval)
> -		return retval;
> -
> -	put_group_info(new->group_info);
> -	groups_sort(group_info);
> -	get_group_info(group_info);
> -	new->group_info = group_info;
> -	return 0;
> -}
> -
> -EXPORT_SYMBOL(set_groups);
> -
> -/**
> - * set_current_groups - Change current's group subscription
> - * @group_info: The group list to impose
> - *
> - * Validate a group subscription and, if valid, impose it upon current's task
> - * security record.
> - */
> -int set_current_groups(struct group_info *group_info)
> -{
> -	struct cred *new;
> -	int ret;
> -
> -	new = prepare_creds();
> -	if (!new)
> -		return -ENOMEM;
> -
> -	ret = set_groups(new, group_info);
> -	if (ret < 0) {
> -		abort_creds(new);
> -		return ret;
> -	}
> -
> -	return commit_creds(new);
> -}
> -
> -EXPORT_SYMBOL(set_current_groups);
> -
> -SYSCALL_DEFINE2(getgroups, int, gidsetsize, gid_t __user *, grouplist)
> -{
> -	const struct cred *cred = current_cred();
> -	int i;
> -
> -	if (gidsetsize < 0)
> -		return -EINVAL;
> -
> -	/* no need to grab task_lock here; it cannot change */
> -	i = cred->group_info->ngroups;
> -	if (gidsetsize) {
> -		if (i > gidsetsize) {
> -			i = -EINVAL;
> -			goto out;
> -		}
> -		if (groups_to_user(grouplist, cred->group_info)) {
> -			i = -EFAULT;
> -			goto out;
> -		}
> -	}
> -out:
> -	return i;
> -}
> -
> -/*
> - *	SMP: Our groups are copy-on-write. We can set them safely
> - *	without another task interfering.
> - */
> - 
> -SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
> -{
> -	struct group_info *group_info;
> -	int retval;
> -
> -	if (!capable(CAP_SETGID))
> -		return -EPERM;
> -	if ((unsigned)gidsetsize > NGROUPS_MAX)
> -		return -EINVAL;
> -
> -	group_info = groups_alloc(gidsetsize);
> -	if (!group_info)
> -		return -ENOMEM;
> -	retval = groups_from_user(group_info, grouplist);
> -	if (retval) {
> -		put_group_info(group_info);
> -		return retval;
> -	}
> -
> -	retval = set_current_groups(group_info);
> -	put_group_info(group_info);
> -
> -	return retval;
> -}
> -
> -/*
> - * Check whether we're fsgid/egid or in the supplemental group..
> - */
> -int in_group_p(gid_t grp)
> -{
> -	const struct cred *cred = current_cred();
> -	int retval = 1;
> -
> -	if (grp != cred->fsgid)
> -		retval = groups_search(cred->group_info, grp);
> -	return retval;
> -}
> -
> -EXPORT_SYMBOL(in_group_p);
> -
> -int in_egroup_p(gid_t grp)
> -{
> -	const struct cred *cred = current_cred();
> -	int retval = 1;
> -
> -	if (grp != cred->egid)
> -		retval = groups_search(cred->group_info, grp);
> -	return retval;
> -}
> -
> -EXPORT_SYMBOL(in_egroup_p);
> -
>  DECLARE_RWSEM(uts_sem);
>  
>  SYSCALL_DEFINE1(newuname, struct new_utsname __user *, name)
> -- 
> 1.5.6.5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-22  4:55 ` [PATCH 18/38] C/R: core stuff Alexey Dobriyan
@ 2009-05-26 13:16   ` Serge E. Hallyn
  2009-05-26 19:35     ` Alexey Dobriyan
  2009-05-27 16:28   ` Alexey Dobriyan
  1 sibling, 1 reply; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-26 13:16 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, xemul, containers, linux-kernel

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> Introduction
> ------------
> Checkpoint/restart (C/R from now) allows to dump group of processes to disk
> for various reasons like saving process state in case of box failure or
> restoration of group of processes on another or same machine later.
> 
> Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel
> and dump more or less raw pages, proposed C/R doesn't require hypervisor.
> For that C/R code needs to know about all little and big intimate kernel details.
> 
> The good thing is that not all details needs to be serialized and saved
> like, say, readahead state. The bad things is still quite a few things
> need to be.

Hi Alexey,

the last time you posted this, I went through and tried to discern the
meaningful differences between yours and Oren's patchsets.  Then I sent some
patches to Oren to make his set configurable to act more like yours.  And Oren
took them!  But now you resend this patchset with no real changelog, no
acknowledgment that Oren's set even exists - or is much farther along and
pretty widely reviewed and tested (which is only because he started earlier
and, when we asked for your counterpatches at an earlier stage, you would never
reply) - or, most importantly, what it is that you think your patchset does
that his does not and cannot.  *Why* are you spending your time on this instead
of helping with Oren's set?  The code really isn't all that different...  Maybe
you just think that two independently written patchsets will expose more
gotchas that we'll need to catch, so you're continuing on this effort under the
expectation that eventualy we'll merge the two sets?

Honestly, I have great respect for your coding abilities.  And if 'voices
from on high' tell us to base upon your code, I'd be fine with that, I
have no real problems with what I see on yet another cursory look.  But
given the amount of collective time that's been spent developing, reviewing,
and testing Oren's set, it wouldn't make any sense to just jump.  So
I'd still just like to know how you see this proceeding.

thanks,
-serge

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 17/38] groups: move code to kernel/groups.c
  2009-05-22  4:55 ` [PATCH 17/38] groups: move code to kernel/groups.c Alexey Dobriyan
  2009-05-25  0:53   ` Serge E. Hallyn
@ 2009-05-26 14:48   ` Serge E. Hallyn
  2009-05-26 18:34     ` Alexey Dobriyan
  1 sibling, 1 reply; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-26 14:48 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, xemul, containers, linux-kernel, dave, mingo, torvalds

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> Move supplementary groups implementation to kernel/groups.c .
> kernel/sys.c already accumulated quite a few random stuff.
> 
> Do strictly copy/paste + add required headers to compile.
> Compile-tested on many configs and archs.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Actually I needed the following change:

>From 81e683ae07cfe24d98d6df384bbd37f1f7f80c38 Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <serue@us.ibm.com>
Date: Tue, 26 May 2009 10:39:47 -0400
Subject: [PATCH 1/1] groups: allow compilation on s390x

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 kernel/groups.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/groups.c b/kernel/groups.c
index 1b95b2f..14ebc6a 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -1,6 +1,7 @@
 /*
  * Supplementary group IDs
  */
+#include <linux/init.h>
 #include <linux/cred.h>
 #include <linux/module.h>
 #include <linux/slab.h>
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 17/38] groups: move code to kernel/groups.c
  2009-05-26 14:48   ` Serge E. Hallyn
@ 2009-05-26 18:34     ` Alexey Dobriyan
  2009-05-26 23:25       ` Serge E. Hallyn
  0 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-26 18:34 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: akpm, xemul, containers, linux-kernel, dave, mingo, torvalds

On Tue, May 26, 2009 at 09:48:19AM -0500, Serge E. Hallyn wrote:
> Quoting Alexey Dobriyan (adobriyan@gmail.com):
> > Move supplementary groups implementation to kernel/groups.c .
> > kernel/sys.c already accumulated quite a few random stuff.
> > 
> > Do strictly copy/paste + add required headers to compile.
> > Compile-tested on many configs and archs.
> > 
> > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> 
> Actually I needed the following change:

This is dealt by first patch:
"[PATCH 01/38] cred: #include init.h in cred.h"
which is in -mm already.

> --- a/kernel/groups.c
> +++ b/kernel/groups.c
> @@ -1,6 +1,7 @@
>  /*
>   * Supplementary group IDs
>   */
> +#include <linux/init.h>

Wrong, groups.c doesn't use __init at all.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-26 13:16   ` Serge E. Hallyn
@ 2009-05-26 19:35     ` Alexey Dobriyan
  2009-05-26 23:14       ` Serge E. Hallyn
                         ` (3 more replies)
  0 siblings, 4 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-26 19:35 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: akpm, xemul, containers, linux-kernel

On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
> Quoting Alexey Dobriyan (adobriyan@gmail.com):
> > Introduction
> > ------------
> > Checkpoint/restart (C/R from now) allows to dump group of processes to disk
> > for various reasons like saving process state in case of box failure or
> > restoration of group of processes on another or same machine later.
> > 
> > Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel
> > and dump more or less raw pages, proposed C/R doesn't require hypervisor.
> > For that C/R code needs to know about all little and big intimate kernel details.
> > 
> > The good thing is that not all details needs to be serialized and saved
> > like, say, readahead state. The bad things is still quite a few things
> > need to be.
> 
> Hi Alexey,
> 
> the last time you posted this, I went through and tried to discern the
> meaningful differences between yours and Oren's patchsets.  Then I sent some
> patches to Oren to make his set configurable to act more like yours.  And Oren
> took them!  But now you resend this patchset with no real changelog, no
> acknowledgment that Oren's set even exists

Is this a requirement? Everybody following topic already knows about
Oren's patchset.

> - or is much farther along and pretty widely reviewed and tested (which is
> only because he started earlier and, when we asked for your counterpatches
> at an earlier stage, you would never reply) - or, most importantly, what
> it is that you think your patchset does that his does not and cannot.

There are differences. And they're not small like you're trying to describe
but pretty big compared the scale of the problem.

> *Why* are you spending your time on this instead of helping with Oren's set?

Because we disagree with some core directions Oren chose.
ANK literally said: "I don't know how to dump live netns".

So, partly patchset was created so that absolutely nobody will tell us
to shut up and show the code.

The other part, is that I looked at Oren patchset, found quite a lot of
suspicious, broken and unclean places and decided that it'd be faster
to start from scratch because sending patches will overhaul like 85% of
the code.

One example, is why CKPT_HDR_CPU and CKPT_RESTART_BLOCK exist at all?
Should objects in image be only what sharable objects are in kernel
(expect VMAs, pages and possibly file descriptors)? pt_regs don't exist
by themselves after all.

And since you guys showed that just idea of in-kernel checkpointing is not
rejected outright, it doesn't mean that you can drag every single idea too.
Because history shows, that once something (especially user-visible,
like restart syscall semantics) is in kernel it's nearly impossible
to cut it out, so it's very-very important to get it right from the very
beginning.

Now here goes second version, with prefixes fixed (kstate_") like Ingo
suggested and so Linus could look at the code and with C/R code moved
close to usual code and with more checks added (which you should have
already!) to not restore null selector in %cs for example.

> The code really isn't all that different...

> Maybe you just think that two independently written patchsets will expose
> more gotchas that we'll need to catch, so you're continuing on this effort
> under the expectation that eventualy we'll merge the two sets?

Well, it already exposes. Just print both, and watch for differences.

> Honestly, I have great respect for your coding abilities.  And if 'voices
> from on high' tell us to base upon your code, I'd be fine with that, I
> have no real problems with what I see on yet another cursory look.  But
> given the amount of collective time that's been spent developing, reviewing,
> and testing Oren's set, it wouldn't make any sense to just jump.  So
> I'd still just like to know how you see this proceeding.

Yes, please, someone decide on "checkpoint semi-live container" issue.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-26 19:35     ` Alexey Dobriyan
@ 2009-05-26 23:14       ` Serge E. Hallyn
  2009-05-26 23:44       ` Serge E. Hallyn
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-26 23:14 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, xemul, containers, linux-kernel

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
> > Honestly, I have great respect for your coding abilities.  And if 'voices
> > from on high' tell us to base upon your code, I'd be fine with that, I
> > have no real problems with what I see on yet another cursory look.  But
> > given the amount of collective time that's been spent developing, reviewing,
> > and testing Oren's set, it wouldn't make any sense to just jump.  So
> > I'd still just like to know how you see this proceeding.
> 
> Yes, please, someone decide on "checkpoint semi-live container" issue.

Well, *I* for one did decide :)  I decided that checkpointing a semi-live
container will not bother the kernel, and so the user gets to decide by
passing a flag into sys_checkpoint() whether he is ok with a checkpoint
if the container was semi-live.

BTW - it looks like your ISP is causing your emails to be flagged as
spam by several servers?  Several people don't seem to be receiving
your patchsets.

-serge

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 17/38] groups: move code to kernel/groups.c
  2009-05-26 18:34     ` Alexey Dobriyan
@ 2009-05-26 23:25       ` Serge E. Hallyn
  0 siblings, 0 replies; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-26 23:25 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, xemul, containers, linux-kernel, dave, mingo, torvalds

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> On Tue, May 26, 2009 at 09:48:19AM -0500, Serge E. Hallyn wrote:
> > Quoting Alexey Dobriyan (adobriyan@gmail.com):
> > > Move supplementary groups implementation to kernel/groups.c .
> > > kernel/sys.c already accumulated quite a few random stuff.
> > > 
> > > Do strictly copy/paste + add required headers to compile.
> > > Compile-tested on many configs and archs.
> > > 
> > > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> > 
> > Actually I needed the following change:
> 
> This is dealt by first patch:
> "[PATCH 01/38] cred: #include init.h in cred.h"
> which is in -mm already.
> 
> > --- a/kernel/groups.c
> > +++ b/kernel/groups.c
> > @@ -1,6 +1,7 @@
> >  /*
> >   * Supplementary group IDs
> >   */
> > +#include <linux/init.h>
> 
> Wrong, groups.c doesn't use __init at all.

Right you are.

thanks,
-serge

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 16/38] x86: ptrace debugreg checks rewrite
  2009-05-22  4:55 ` [PATCH 16/38] x86: ptrace debugreg checks rewrite Alexey Dobriyan
@ 2009-05-26 23:25   ` Andrew Morton
  0 siblings, 0 replies; 76+ messages in thread
From: Andrew Morton @ 2009-05-26 23:25 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave,
	mingo, adobriyan

On Fri, 22 May 2009 08:55:10 +0400
Alexey Dobriyan <adobriyan@gmail.com> wrote:

> This is a mess.

heh.  I'm going to treat it as Ingo's mess :)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-26 19:35     ` Alexey Dobriyan
  2009-05-26 23:14       ` Serge E. Hallyn
@ 2009-05-26 23:44       ` Serge E. Hallyn
  2009-05-28 15:38         ` Alexey Dobriyan
  2009-05-27 18:52       ` Dave Hansen
  2009-05-27 20:56       ` Oren Laadan
  3 siblings, 1 reply; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-26 23:44 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, xemul, containers, linux-kernel

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> And since you guys showed that just idea of in-kernel checkpointing is not
> rejected outright, it doesn't mean that you can drag every single idea too.

Can you rephrase here?  I have no idea what you mean by 'drag every single
idea'

> Because history shows, that once something (especially user-visible,
> like restart syscall semantics) is in kernel it's nearly impossible
> to cut it out, so it's very-very important to get it right from the very
> beginning.

Absolutely agree (at least in terms of the API for sys_restart() and
sys_checkpoint(), not for inane other stuff like whether there is a
CKPT_OBJ_RESTART_BLOCK).  That is why 6 months ago we were trying to
get to an agreement with Andrey Mirkin about a precise API that would
definately suit both approaches (in-kernel and userspace task recreation
at sys_restart).  That way the basics could go upstream and the rest
could get settled later as we see the fallout from either decision.

-serge

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-22  4:55 ` [PATCH 18/38] C/R: core stuff Alexey Dobriyan
  2009-05-26 13:16   ` Serge E. Hallyn
@ 2009-05-27 16:28   ` Alexey Dobriyan
  1 sibling, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-27 16:28 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, containers, torvalds, xemul, orenl, serue, dave, mingo

On Fri, May 22, 2009 at 08:55:12AM +0400, Alexey Dobriyan wrote:
> +static int task_struct_restorer(void *_tsk_ctx)
> +{
> +	struct task_struct_restore_context *tsk_ctx = _tsk_ctx;
> +	struct kstate_image_task_struct *i = tsk_ctx->i;
> +	struct kstate_context *ctx = tsk_ctx->ctx;
> +	/* In the name of symmetry. */
> +	struct task_struct *tsk = current, *real_parent;
> +	int rv;
> +
> +	pr_debug("%s: ENTER tsk %p/%s\n", __func__, tsk, tsk->comm);
> +
> +	write_lock_irq(&tasklist_lock);
> +	real_parent = ctx->init_tsk->nsproxy->pid_ns->child_reaper;
> +	tsk->real_parent = tsk->parent = real_parent;
> +	list_move_tail(&tsk->sibling, &tsk->real_parent->sibling);
							 ^^^^^^^
> +	write_unlock_irq(&tasklist_lock);

Eek, what a stupid bug here


commit 2c4b5f5d606a1892b702d95a0e4d29f207685381
Author: Alexey Dobriyan <adobriyan@gmail.com>
Date:   Wed May 27 20:21:59 2009 +0400

    C/R: fix stupid bug in reparenting
    
    Child process should be added to ->children list of course
    
    Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c
index 9ed5a19..6df7d25 100644
--- a/kernel/kstate/kstate-task.c
+++ b/kernel/kstate/kstate-task.c
@@ -490,7 +490,7 @@ static int task_struct_restorer(void *_tsk_ctx)
 		real_parent = tmp->o_obj;
 	}
 	tsk->real_parent = tsk->parent = real_parent;
-	list_move_tail(&tsk->sibling, &tsk->real_parent->sibling);
+	list_move_tail(&tsk->sibling, &tsk->real_parent->children);
 	write_unlock_irq(&tasklist_lock);
 
 	rv = restore_mm(ctx, &i->ref_mm);

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-26 19:35     ` Alexey Dobriyan
  2009-05-26 23:14       ` Serge E. Hallyn
  2009-05-26 23:44       ` Serge E. Hallyn
@ 2009-05-27 18:52       ` Dave Hansen
  2009-05-27 20:56       ` Oren Laadan
  3 siblings, 0 replies; 76+ messages in thread
From: Dave Hansen @ 2009-05-27 18:52 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Serge E. Hallyn, linux-kernel, akpm, containers, xemul

On Tue, 2009-05-26 at 23:35 +0400, Alexey Dobriyan wrote:
> The other part, is that I looked at Oren patchset, found quite a lot of
> suspicious, broken and unclean places and decided that it'd be faster
> to start from scratch because sending patches will overhaul like 85% of
> the code.

I know the feeling.  I get sudden urges to rewrite the whole patch set,
but I'm working on getting past them too. :)

As long as we have two patch sets, *nobody* is going to get their
patches in, that's virtually guaranteed.  Just look at the poor I/O
controller.

The OpenVZ users are almost certainly the most important container and
c/r users out there today.  Meeting their needs with whatever we come up
with should be a top priority and I know I'm counting on you to help us
do that.

But, I'm having a really hard time culling the OpenVZ user needs from
your patch set.  If we really need to rewrite 85% of Oren's stuff to
meet the OpenVZ needs, then by all means let's do it.  I'm even willing
to help you.  But, I honestly don't know what you need.

Can we talk about specifics?

-- Dave


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-26 19:35     ` Alexey Dobriyan
                         ` (2 preceding siblings ...)
  2009-05-27 18:52       ` Dave Hansen
@ 2009-05-27 20:56       ` Oren Laadan
  2009-05-27 22:17         ` Alexey Dobriyan
  2009-05-27 22:25         ` Alexey Dobriyan
  3 siblings, 2 replies; 76+ messages in thread
From: Oren Laadan @ 2009-05-27 20:56 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Serge E. Hallyn, linux-kernel, akpm, containers, xemul



Alexey Dobriyan wrote:
> On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
>> Quoting Alexey Dobriyan (adobriyan@gmail.com):
>>> Introduction
>>> ------------
>>> Checkpoint/restart (C/R from now) allows to dump group of processes to disk
>>> for various reasons like saving process state in case of box failure or
>>> restoration of group of processes on another or same machine later.
>>>
>>> Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel
>>> and dump more or less raw pages, proposed C/R doesn't require hypervisor.
>>> For that C/R code needs to know about all little and big intimate kernel details.
>>>
>>> The good thing is that not all details needs to be serialized and saved
>>> like, say, readahead state. The bad things is still quite a few things
>>> need to be.
>> Hi Alexey,
>>
>> the last time you posted this, I went through and tried to discern the
>> meaningful differences between yours and Oren's patchsets.  Then I sent some
>> patches to Oren to make his set configurable to act more like yours.  And Oren
>> took them!  But now you resend this patchset with no real changelog, no
>> acknowledgment that Oren's set even exists
> 
> Is this a requirement? Everybody following topic already knows about
> Oren's patchset.

Some people do ack other people's work. See for example patches #1
and #24 in my recent post. You're welcome.

> 
>> - or is much farther along and pretty widely reviewed and tested (which is
>> only because he started earlier and, when we asked for your counterpatches
>> at an earlier stage, you would never reply) - or, most importantly, what
>> it is that you think your patchset does that his does not and cannot.
> 
> There are differences. And they're not small like you're trying to describe
> but pretty big compared the scale of the problem.

I've asked before, and I repeat now: can you enumerate these "big"
scary differences that make it such a "big" problem ?

So far, we identified two main "design" issues -

1) Whether or not allow c/r of sub-container (partial hierarchy)

2) Creation of restarting process hierarchy in kernel or in userspace

As for #1, you are the _only_ one who advocates restricting c/r to
a full container only. I guess you have your reasons, but I'm unsure
what they may be.

On the other hand, there has been a handful of use-cases and opinions
in favor of allowing both capabilities to co-exist. Not the mention
that nearly no additional code is necessary, on the contrary.

As for #2, you didn't even bother to reply to the discussion that I
had started about it. This decision is important to allow future
flexibility of the mechanism, and to address the needs of several
potential users, as seen in that discussion and others. Here, too,
you are the _only_ one that advocates that direction.

And the funniest thing -- *both* decisions can be *easily* overturned
in my patchset. In fact, regarding #2 - either way can be easily done
in it.

So I wonder, what are the "big" issues that bother you so much ?
"if there is a will, there is a way".

> 
>> *Why* are you spending your time on this instead of helping with Oren's set?
> 
> Because we disagree with some core directions Oren chose.
> ANK literally said: "I don't know how to dump live netns".

Eh... and you have it all sorted out ?  (yeah, I do, but not in
this patchset).

> 
> So, partly patchset was created so that absolutely nobody will tell us
> to shut up and show the code.

Oh well ... "the" code meaning "your" code I suppose.

> 
> The other part, is that I looked at Oren patchset, found quite a lot of
> suspicious, broken and unclean places and decided that it'd be faster
> to start from scratch because sending patches will overhaul like 85% of
> the code.

So you actually took the time to read and review. And then you spent
even more time in .... calculating this number !  Feedback appreciated.

If you looked closely you would have seen that we do address your
concerns over time.

> 
> One example, is why CKPT_HDR_CPU and CKPT_RESTART_BLOCK exist at all?
> Should objects in image be only what sharable objects are in kernel
> (expect VMAs, pages and possibly file descriptors)? pt_regs don't exist
> by themselves after all.

A good reason to break it into small pieces is for ease of maintenance
and debugging, as well as in the future easier transition between
incompatible kernel versions. I think it's better than a few-pages-long
single struct. And it encourages more naming of things.

But ... I'm confused ... is this your "big" concern ?  Oh well, if
that's what stands in your way, we could even rework that (~1.3% of
the code ? I reckon...).

> 
> And since you guys showed that just idea of in-kernel checkpointing is not
> rejected outright, it doesn't mean that you can drag every single idea too.
> Because history shows, that once something (especially user-visible,
> like restart syscall semantics) is in kernel it's nearly impossible
> to cut it out, so it's very-very important to get it right from the very
> beginning.
> 

Yes. Let's indeed talk about how to get it "right". Please do
participate in the public discussions and efforts. Working together
would be better for everyone.

> Now here goes second version, with prefixes fixed (kstate_") like Ingo
> suggested and so Linus could look at the code and with C/R code moved
> close to usual code and with more checks added (which you should have
> already!) to not restore null selector in %cs for example.

It is far from perfect. In fact, it's even clearly commented as such,
and exactly there.  It would have been helpful if you pointed that
out in a review, or even - god forbid - sent a patch to improve it.

But it works, and it lets people play with a more-than-a-toy
implementation and provide us with important feedback. Oh, and by
the way, it doesn't require that people use containers to try it out.
Pretty handy, don't you think ?

> 
>> The code really isn't all that different...
> 
>> Maybe you just think that two independently written patchsets will expose
>> more gotchas that we'll need to catch, so you're continuing on this effort
>> under the expectation that eventualy we'll merge the two sets?
> 
> Well, it already exposes. Just print both, and watch for differences.
> 
>> Honestly, I have great respect for your coding abilities.  And if 'voices
>> from on high' tell us to base upon your code, I'd be fine with that, I
>> have no real problems with what I see on yet another cursory look.  But
>> given the amount of collective time that's been spent developing, reviewing,
>> and testing Oren's set, it wouldn't make any sense to just jump.  So
>> I'd still just like to know how you see this proceeding.
> 
> Yes, please, someone decide on "checkpoint semi-live container" issue.

Have you not yet read enough opinion of users that would like to see
this capability ?  What would be enough to convince you ?  And really,
why not ?

Cheers,

Oren.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-27 20:56       ` Oren Laadan
@ 2009-05-27 22:17         ` Alexey Dobriyan
  2009-05-27 22:40           ` Andrew Morton
  2009-05-27 22:45           ` Oren Laadan
  2009-05-27 22:25         ` Alexey Dobriyan
  1 sibling, 2 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-27 22:17 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Serge E. Hallyn, linux-kernel, akpm, containers, xemul

On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote:
> Alexey Dobriyan wrote:
> > On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
> >> Quoting Alexey Dobriyan (adobriyan@gmail.com):
> >>> Introduction
> >>> ------------
> >>> Checkpoint/restart (C/R from now) allows to dump group of processes to disk
> >>> for various reasons like saving process state in case of box failure or
> >>> restoration of group of processes on another or same machine later.
> >>>
> >>> Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel
> >>> and dump more or less raw pages, proposed C/R doesn't require hypervisor.
> >>> For that C/R code needs to know about all little and big intimate kernel details.
> >>>
> >>> The good thing is that not all details needs to be serialized and saved
> >>> like, say, readahead state. The bad things is still quite a few things
> >>> need to be.
> >> Hi Alexey,
> >>
> >> the last time you posted this, I went through and tried to discern the
> >> meaningful differences between yours and Oren's patchsets.  Then I sent some
> >> patches to Oren to make his set configurable to act more like yours.  And Oren
> >> took them!  But now you resend this patchset with no real changelog, no
> >> acknowledgment that Oren's set even exists
> > 
> > Is this a requirement? Everybody following topic already knows about
> > Oren's patchset.
> 
> Some people do ack other people's work. See for example patches #1
> and #24 in my recent post. You're welcome.
> 
> > 
> >> - or is much farther along and pretty widely reviewed and tested (which is
> >> only because he started earlier and, when we asked for your counterpatches
> >> at an earlier stage, you would never reply) - or, most importantly, what
> >> it is that you think your patchset does that his does not and cannot.
> > 
> > There are differences. And they're not small like you're trying to describe
> > but pretty big compared the scale of the problem.
> 
> I've asked before, and I repeat now: can you enumerate these "big"
> scary differences that make it such a "big" problem ?
> 
> So far, we identified two main "design" issues -

Why in "? Yes, they are high-level design issues.

> 1) Whether or not allow c/r of sub-container (partial hierarchy)
> 
> 2) Creation of restarting process hierarchy in kernel or in userspace
> 
> As for #1, you are the _only_ one who advocates restricting c/r to
> a full container only. I guess you have your reasons, but I'm unsure
> what they may be.

The reason is that checkpointing half-frozen, half-live container is
essentially equivalent to live container which adds much complexity
to code fundamentally preventing kernel from taking coherent snapshot.

In such situations kernel will do its job badly.

Manpage will be filled with strings like "if $FOO is shared then $BAR is
not guaranteed".

What to do if user simply doesn't know if container is bounded?
Checkpoint and to hell with consequences?

If two tasks share mm_struct you can't even detect that pages you dump
aren't filled with garbage meanwhile from second task.

If two tasks share mm_struct, other task can issue AIO indefinitely
preventing from taking even coherent filesystem snapshot.

That's why I raise this issue again to hear from people what they think
and these people shouldn't be containers and C/R people, because the
latter already made up their minds.

This is super-important issue to get right from the beginning.

> On the other hand, there has been a handful of use-cases and opinions
> in favor of allowing both capabilities to co-exist. Not the mention
> that nearly no additional code is necessary, on the contrary.
> 
> As for #2, you didn't even bother to reply to the discussion that I
> had started about it. This decision is important to allow future
> flexibility of the mechanism, and to address the needs of several
> potential users, as seen in that discussion and others. Here, too,
> you are the _only_ one that advocates that direction.

Are you going to fork to-become-zombies, make them call restart(2) and
zombify?

> And the funniest thing -- *both* decisions can be *easily* overturned
> in my patchset. In fact, regarding #2 - either way can be easily done
> in it.
> 
> So I wonder, what are the "big" issues that bother you so much ?
> "if there is a will, there is a way".

Oren, don't you really understand?

Users want millions of things, but every thing has price.

Some think hardlinking of directories should be implemented. You can ask
VFS guys how hard would it be and how hard would it be to do reliably
without races/deadlocks et al.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-27 20:56       ` Oren Laadan
  2009-05-27 22:17         ` Alexey Dobriyan
@ 2009-05-27 22:25         ` Alexey Dobriyan
  1 sibling, 0 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-27 22:25 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Serge E. Hallyn, linux-kernel, akpm, containers, xemul

On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote:
> > Now here goes second version, with prefixes fixed (kstate_") like Ingo
> > suggested and so Linus could look at the code and with C/R code moved
> > close to usual code and with more checks added (which you should have
> > already!) to not restore null selector in %cs for example.
> 
> It is far from perfect. In fact, it's even clearly commented as such,
> and exactly there.  It would have been helpful if you pointed that
> out in a review, or even - god forbid - sent a patch to improve it.

This is ridiculous.

First, you declare that restart(2) should be allowed for anyone(!).
and then send patchset for inclusion in -mm which doesn't even check
if selectors are right!

> But it works, and it lets people play with a more-than-a-toy
> implementation and provide us with important feedback. Oh, and by
> the way, it doesn't require that people use containers to try it out.

Setting up container for playing is not hard:

	CLONE_NEWUTS=y
	CLONE_NEWIPC=y
	CLONE_NEWPID=y
	CLONE_NEWUSER=y
	CLONE_NEWNET=y

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <sys/mount.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>

#define CLONE_NEWNS	0x00020000
#define CLONE_NEWUTS	0x04000000
#define CLONE_NEWIPC	0x08000000
#define CLONE_NEWUSER	0x10000000
#define CLONE_NEWPID	0x20000000
#define CLONE_NEWNET	0x40000000

static int fn(void *_argv)
{
	char **argv = (char **)_argv;

	setsid();
	setpgid(getpid(), getpid());

	execve(argv[0], argv, __environ);
	return 1;
}

int main(int argc, char *argv[])
{
	unsigned long flags = 0;
	int status;
	pid_t pid;
	void *p;

	flags |= CLONE_NEWNS;
	flags |= CLONE_NEWUTS;
	flags |= CLONE_NEWIPC;
	flags |= CLONE_NEWUSER;
	flags |= CLONE_NEWPID;
	flags |= CLONE_NEWNET;

	p = malloc(4 * 4096);
	if (!p)
		return 1;
	argv++;
	pid = clone(fn, p + 4 * 4096, flags, (void *)argv);
	fprintf(stderr, "pid = %d\n", pid);
	if (pid == -1)
		return 1;
	waitpid(pid, &status, __WALL);
	return 0;
}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-27 22:17         ` Alexey Dobriyan
@ 2009-05-27 22:40           ` Andrew Morton
  2009-05-27 22:45           ` Oren Laadan
  1 sibling, 0 replies; 76+ messages in thread
From: Andrew Morton @ 2009-05-27 22:40 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: orenl, serge, linux-kernel, containers, xemul

On Thu, 28 May 2009 02:17:53 +0400
Alexey Dobriyan <adobriyan@gmail.com> wrote:

> > 1) Whether or not allow c/r of sub-container (partial hierarchy)
> > 
> > 2) Creation of restarting process hierarchy in kernel or in userspace
> > 
> > As for #1, you are the _only_ one who advocates restricting c/r to
> > a full container only. I guess you have your reasons, but I'm unsure
> > what they may be.
> 
> The reason is that checkpointing half-frozen, half-live container is
> essentially equivalent to live container which adds much complexity
> to code fundamentally preventing kernel from taking coherent snapshot.
> 
> In such situations kernel will do its job badly.
> 
> Manpage will be filled with strings like "if $FOO is shared then $BAR is
> not guaranteed".
> 
> What to do if user simply doesn't know if container is bounded?
> Checkpoint and to hell with consequences?
> 
> If two tasks share mm_struct you can't even detect that pages you dump
> aren't filled with garbage meanwhile from second task.
> 
> If two tasks share mm_struct, other task can issue AIO indefinitely
> preventing from taking even coherent filesystem snapshot.
> 
> That's why I raise this issue again to hear from people what they think
> and these people shouldn't be containers and C/R people, because the
> latter already made up their minds.
> 
> This is super-important issue to get right from the beginning.

<pipes up>

yeah, checkpointing a partial hierarchy at this stage sounds like
overreach.  Get full-container working usably first, think about
sub-containers in version 2.

<pipes down again>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-27 22:17         ` Alexey Dobriyan
  2009-05-27 22:40           ` Andrew Morton
@ 2009-05-27 22:45           ` Oren Laadan
  2009-05-28 15:33             ` Alexey Dobriyan
  1 sibling, 1 reply; 76+ messages in thread
From: Oren Laadan @ 2009-05-27 22:45 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Serge E. Hallyn, linux-kernel, akpm, containers, xemul



Alexey Dobriyan wrote:
> On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote:
>> Alexey Dobriyan wrote:
>>> On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
>>>> Quoting Alexey Dobriyan (adobriyan@gmail.com):
>>>>> Introduction
>>>>> ------------
>>>>> Checkpoint/restart (C/R from now) allows to dump group of processes to disk
>>>>> for various reasons like saving process state in case of box failure or
>>>>> restoration of group of processes on another or same machine later.
>>>>>
>>>>> Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel
>>>>> and dump more or less raw pages, proposed C/R doesn't require hypervisor.
>>>>> For that C/R code needs to know about all little and big intimate kernel details.
>>>>>
>>>>> The good thing is that not all details needs to be serialized and saved
>>>>> like, say, readahead state. The bad things is still quite a few things
>>>>> need to be.
>>>> Hi Alexey,
>>>>
>>>> the last time you posted this, I went through and tried to discern the
>>>> meaningful differences between yours and Oren's patchsets.  Then I sent some
>>>> patches to Oren to make his set configurable to act more like yours.  And Oren
>>>> took them!  But now you resend this patchset with no real changelog, no
>>>> acknowledgment that Oren's set even exists
>>> Is this a requirement? Everybody following topic already knows about
>>> Oren's patchset.
>> Some people do ack other people's work. See for example patches #1
>> and #24 in my recent post. You're welcome.
>>
>>>> - or is much farther along and pretty widely reviewed and tested (which is
>>>> only because he started earlier and, when we asked for your counterpatches
>>>> at an earlier stage, you would never reply) - or, most importantly, what
>>>> it is that you think your patchset does that his does not and cannot.
>>> There are differences. And they're not small like you're trying to describe
>>> but pretty big compared the scale of the problem.
>> I've asked before, and I repeat now: can you enumerate these "big"
>> scary differences that make it such a "big" problem ?
>>
>> So far, we identified two main "design" issues -
> 
> Why in "? Yes, they are high-level design issues.
> 

In quotes, because I argued further on that, although my patchset
takes a stand on both issues, it can be easily reverted _within_
that patchset. Moreover, I argue that they can co-exist.

>> 1) Whether or not allow c/r of sub-container (partial hierarchy)
>>
>> 2) Creation of restarting process hierarchy in kernel or in userspace
>>
>> As for #1, you are the _only_ one who advocates restricting c/r to
>> a full container only. I guess you have your reasons, but I'm unsure
>> what they may be.
> 
> The reason is that checkpointing half-frozen, half-live container is
> essentially equivalent to live container which adds much complexity
> to code fundamentally preventing kernel from taking coherent snapshot.
> 
> In such situations kernel will do its job badly.

In such situation the kernel will do a bad job if the user is asking
for a bad job. Just like checkpointing without snapshotting the
file system and expecting it to always work.

But if the user is a bit more careful (and even then, not that much),
she can enjoy the wonderful benefits of c/r without the wonderful
benefits of containers.

If useful, it's easy to pass a flag to checkpoint() that will ask
to enforce, say, shared memory "leaks" but not nsproxy or file "leaks".

In fact, even shared memory "leaks" may be useful for some users (e.g.
what the guys from kerlabs pointed out).

> 
> Manpage will be filled with strings like "if $FOO is shared then $BAR is
> not guaranteed".
> 
> What to do if user simply doesn't know if container is bounded?
> Checkpoint and to hell with consequences?
> 
> If two tasks share mm_struct you can't even detect that pages you dump
> aren't filled with garbage meanwhile from second task.
> 
> If two tasks share mm_struct, other task can issue AIO indefinitely
> preventing from taking even coherent filesystem snapshot.
> 
> That's why I raise this issue again to hear from people what they think
> and these people shouldn't be containers and C/R people, because the
> latter already made up their minds.

Lol .. and disagreement persists among us :)

And indeed, I have heard and seen already a few opinions in favor
of permitting non-container checkpoint. From potential users (not
c/r people).

> 
> This is super-important issue to get right from the beginning.
> 
>> On the other hand, there has been a handful of use-cases and opinions
>> in favor of allowing both capabilities to co-exist. Not the mention
>> that nearly no additional code is necessary, on the contrary.
>>
>> As for #2, you didn't even bother to reply to the discussion that I
>> had started about it. This decision is important to allow future
>> flexibility of the mechanism, and to address the needs of several
>> potential users, as seen in that discussion and others. Here, too,
>> you are the _only_ one that advocates that direction.
> 
> Are you going to fork to-become-zombies, make them call restart(2) and
> zombify?

Yes.

> 
>> And the funniest thing -- *both* decisions can be *easily* overturned
>> in my patchset. In fact, regarding #2 - either way can be easily done
>> in it.
>>
>> So I wonder, what are the "big" issues that bother you so much ?
>> "if there is a will, there is a way".
> 
> Oren, don't you really understand?
> 
> Users want millions of things, but every thing has price.

I beg to differ: there is marginal price to support both -- in fact,
enforcing the container requirement (e.g. leaks detection - which,
btw, is imperfect and cannot be made race-free) *adds* code over
the non-container case. So in a sense, we get the no-container case
for free.

Oren.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-27 22:45           ` Oren Laadan
@ 2009-05-28 15:33             ` Alexey Dobriyan
  2009-05-28 22:20               ` Oren Laadan
  0 siblings, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-28 15:33 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Serge E. Hallyn, linux-kernel, akpm, containers, xemul

On Wed, May 27, 2009 at 06:45:04PM -0400, Oren Laadan wrote:
> Alexey Dobriyan wrote:
> > On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote:
> >> Alexey Dobriyan wrote:
> >>> On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
> >>>> Quoting Alexey Dobriyan (adobriyan@gmail.com):
> >>>>> Introduction
> >>>>> ------------
> >>>>> Checkpoint/restart (C/R from now) allows to dump group of processes to disk
> >>>>> for various reasons like saving process state in case of box failure or
> >>>>> restoration of group of processes on another or same machine later.
> >>>>>
> >>>>> Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel
> >>>>> and dump more or less raw pages, proposed C/R doesn't require hypervisor.
> >>>>> For that C/R code needs to know about all little and big intimate kernel details.
> >>>>>
> >>>>> The good thing is that not all details needs to be serialized and saved
> >>>>> like, say, readahead state. The bad things is still quite a few things
> >>>>> need to be.
> >>>> Hi Alexey,
> >>>>
> >>>> the last time you posted this, I went through and tried to discern the
> >>>> meaningful differences between yours and Oren's patchsets.  Then I sent some
> >>>> patches to Oren to make his set configurable to act more like yours.  And Oren
> >>>> took them!  But now you resend this patchset with no real changelog, no
> >>>> acknowledgment that Oren's set even exists
> >>> Is this a requirement? Everybody following topic already knows about
> >>> Oren's patchset.
> >> Some people do ack other people's work. See for example patches #1
> >> and #24 in my recent post. You're welcome.
> >>
> >>>> - or is much farther along and pretty widely reviewed and tested (which is
> >>>> only because he started earlier and, when we asked for your counterpatches
> >>>> at an earlier stage, you would never reply) - or, most importantly, what
> >>>> it is that you think your patchset does that his does not and cannot.
> >>> There are differences. And they're not small like you're trying to describe
> >>> but pretty big compared the scale of the problem.
> >> I've asked before, and I repeat now: can you enumerate these "big"
> >> scary differences that make it such a "big" problem ?
> >>
> >> So far, we identified two main "design" issues -
> > 
> > Why in "? Yes, they are high-level design issues.
> > 
> 
> In quotes, because I argued further on that, although my patchset
> takes a stand on both issues, it can be easily reverted _within_
> that patchset. Moreover, I argue that they can co-exist.
> 
> >> 1) Whether or not allow c/r of sub-container (partial hierarchy)
> >>
> >> 2) Creation of restarting process hierarchy in kernel or in userspace
> >>
> >> As for #1, you are the _only_ one who advocates restricting c/r to
> >> a full container only. I guess you have your reasons, but I'm unsure
> >> what they may be.
> > 
> > The reason is that checkpointing half-frozen, half-live container is
> > essentially equivalent to live container which adds much complexity
> > to code fundamentally preventing kernel from taking coherent snapshot.
> > 
> > In such situations kernel will do its job badly.
> 
> In such situation the kernel will do a bad job if the user is asking
> for a bad job.

User doesn't even understand why we're discussing this issue so hard.

> Just like checkpointing without snapshotting the file system and expecting
> it to always work.

This is different.

Kernel can't do anything about not-synced fs. Because nodoby is
advocating that kernel should sync fs. Consequently, screwup in fs sync is
clearly user failure. Any (yours, mine) in-kernel C/R has this failure mode,
so we skip it and discuss what's left.

Now, kernel CAN do something about tasks and other data structures
because it easily controls them.

Your procedure for checkpointing starts with "kill -STOP".

To make anything reliable, you have to ban "kill -CONT" for the duration of
checkpointing. Is this done BTW? I don't remember new flags added
in task_struct. Or this is going to be skipped on grounds that it's
user screwup (potentially oopsable).

That's why, OpenVZ relies on suspend-to-ram freezer solely, because userspace
can't arbitrarily send suspend and freeze notifications. We only need to
protect against untimely STR unfreeze which only adds code in C/R code
not in task_struct.

But you aren't going to stop with task_struct, and add such flags all
over the place. Or you're going to declare that it's all user fault.

> But if the user is a bit more careful (and even then, not that much),
> she can enjoy the wonderful benefits of c/r without the wonderful
> benefits of containers.
> 
> If useful, it's easy to pass a flag to checkpoint() that will ask
> to enforce, say, shared memory "leaks" but not nsproxy or file "leaks".
> 
> In fact, even shared memory "leaks" may be useful for some users (e.g.
> what the guys from kerlabs pointed out).
> 
> > 
> > Manpage will be filled with strings like "if $FOO is shared then $BAR is
> > not guaranteed".
> > 
> > What to do if user simply doesn't know if container is bounded?
> > Checkpoint and to hell with consequences?
> > 
> > If two tasks share mm_struct you can't even detect that pages you dump
> > aren't filled with garbage meanwhile from second task.
> > 
> > If two tasks share mm_struct, other task can issue AIO indefinitely
> > preventing from taking even coherent filesystem snapshot.
> > 
> > That's why I raise this issue again to hear from people what they think
> > and these people shouldn't be containers and C/R people, because the
> > latter already made up their minds.
> 
> Lol .. and disagreement persists among us :)
> 
> And indeed, I have heard and seen already a few opinions in favor
> of permitting non-container checkpoint. From potential users (not
> c/r people).
> 
> > 
> > This is super-important issue to get right from the beginning.
> > 
> >> On the other hand, there has been a handful of use-cases and opinions
> >> in favor of allowing both capabilities to co-exist. Not the mention
> >> that nearly no additional code is necessary, on the contrary.
> >>
> >> As for #2, you didn't even bother to reply to the discussion that I
> >> had started about it. This decision is important to allow future
> >> flexibility of the mechanism, and to address the needs of several
> >> potential users, as seen in that discussion and others. Here, too,
> >> you are the _only_ one that advocates that direction.
> > 
> > Are you going to fork to-become-zombies, make them call restart(2) and
> > zombify?
> 
> Yes.
> 
> > 
> >> And the funniest thing -- *both* decisions can be *easily* overturned
> >> in my patchset. In fact, regarding #2 - either way can be easily done
> >> in it.
> >>
> >> So I wonder, what are the "big" issues that bother you so much ?
> >> "if there is a will, there is a way".
> > 
> > Oren, don't you really understand?
> > 
> > Users want millions of things, but every thing has price.
> 
> I beg to differ: there is marginal price to support both --

Wrong, try to do netns with sockets without deadlocks.

Price in loose case is strictly bigger than price in strict case
because set of loose states is strict superset of set of strict cases.

> in fact, enforcing the container requirement (e.g. leaks detection -
> which, btw, is imperfect and cannot be made race-free)

It'll only generate false positives when say /proc/$PID/maps is being
read from outside, so mm->mm_users is elevated. I can return -EAGAIN.

If file is opened via /proc/*/fd, proc_fd_access_allowed() is only
needed to be tweaked and use task_struct field, not new struct file
field.

->mmap_sem is taken only for reading in /proc, so fine.

/proc/*/mem -- tweak mem_open().

struct pid, itself -- refcount can be elevated because someone chdir'ed
into /proc/*, but I don't do refcount checks on struct pid, because it's
hard (inode in memory pins struct pid) and pids numbers themselves don't
change once taken. So, if struct pid is somehow leaked to outside,
there will be NO incoherency in image and no problems with locking or
whatever.

For netns elevated refcount, get_proc_task_net() will also reject
open(2) request based on task_struct field, not netns field!

I'm sorry, but all bases re refcount checks are pretty well covered
and are easily closed with tiny bit of code.

Please show counter-examples where after all data structures are
collected and tasks being frozen and refcount checks passed,
userspace can still access one of them for writing.

> *adds* code over the non-container case.

Whole 8 lines per data-structure:

	for_each_kstate_object(ctx, obj, KSTATE_CTX_MM_STRUCT) {
                struct mm_struct *mm = obj->o_obj;
                unsigned int cnt = atomic_read(&mm->mm_users);

                if (obj->o_count + 1 != cnt) {
                        pr_err("mm_struct %p has external references %lu:%u\n", mm, obj->o_count, cnt);
                        return -EINVAL;
                }
        }

No locking, no nothing, just counting refcounts.

> So in a sense, we get the no-container case for free.

Not for free definitely.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-26 23:44       ` Serge E. Hallyn
@ 2009-05-28 15:38         ` Alexey Dobriyan
  2009-05-28 18:17           ` Serge E. Hallyn
  2009-05-28 22:42           ` Oren Laadan
  0 siblings, 2 replies; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-28 15:38 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: akpm, xemul, containers, linux-kernel

On Tue, May 26, 2009 at 06:44:36PM -0500, Serge E. Hallyn wrote:
> Quoting Alexey Dobriyan (adobriyan@gmail.com):
> > And since you guys showed that just idea of in-kernel checkpointing is not
> > rejected outright, it doesn't mean that you can drag every single idea too.
> 
> Can you rephrase here?  I have no idea what you mean by 'drag every single
> idea'

  									complexity
 +-|------------------------------------------------|----------------------|-->
hypervisor C/R					in-kernel C/R	C/R for unpriviledged
								C/R with "leaks"

I personally thought in-kernel C/R will be rejected outright, but it wasn't.
This in theory doesn't mean other two issues should be accepted.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-28 15:38         ` Alexey Dobriyan
@ 2009-05-28 18:17           ` Serge E. Hallyn
  2009-05-28 22:42           ` Oren Laadan
  1 sibling, 0 replies; 76+ messages in thread
From: Serge E. Hallyn @ 2009-05-28 18:17 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, xemul, containers, linux-kernel

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> On Tue, May 26, 2009 at 06:44:36PM -0500, Serge E. Hallyn wrote:
> > Quoting Alexey Dobriyan (adobriyan@gmail.com):
> > > And since you guys showed that just idea of in-kernel checkpointing is not
> > > rejected outright, it doesn't mean that you can drag every single idea too.
> > 
> > Can you rephrase here?  I have no idea what you mean by 'drag every single
> > idea'
> 
>   									complexity
>  +-|------------------------------------------------|----------------------|-->
> hypervisor C/R					in-kernel C/R	C/R for unpriviledged
> 								C/R with "leaks"
> 
> I personally thought in-kernel C/R will be rejected outright, but it wasn't.
> This in theory doesn't mean other two issues should be accepted.

Note again that leaving c/R unprivileged was suggested (By Arnd)
as a way to keep us on our toes.  When it actually goes upstream
I would prefer it be under a new CAP_CHECKPOINT_RESTART capability,
not CAP_SYS_ADMIN.

-serge

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-28 15:33             ` Alexey Dobriyan
@ 2009-05-28 22:20               ` Oren Laadan
  2009-05-28 22:33                 ` Matt Helsley
  2009-05-29  6:01                 ` Alexey Dobriyan
  0 siblings, 2 replies; 76+ messages in thread
From: Oren Laadan @ 2009-05-28 22:20 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Serge E. Hallyn, linux-kernel, akpm, containers, xemul



Alexey Dobriyan wrote:
> On Wed, May 27, 2009 at 06:45:04PM -0400, Oren Laadan wrote:
>> Alexey Dobriyan wrote:
>>> On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote:
>>>> Alexey Dobriyan wrote:
>>>>> On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
>>>>>> Quoting Alexey Dobriyan (adobriyan@gmail.com):
>>>>>>> Introduction
>>>>>>> ------------
>>>>>>> Checkpoint/restart (C/R from now) allows to dump group of processes to disk
>>>>>>> for various reasons like saving process state in case of box failure or
>>>>>>> restoration of group of processes on another or same machine later.
>>>>>>>
>>>>>>> Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel
>>>>>>> and dump more or less raw pages, proposed C/R doesn't require hypervisor.
>>>>>>> For that C/R code needs to know about all little and big intimate kernel details.
>>>>>>>
>>>>>>> The good thing is that not all details needs to be serialized and saved
>>>>>>> like, say, readahead state. The bad things is still quite a few things
>>>>>>> need to be.
>>>>>> Hi Alexey,
>>>>>>
>>>>>> the last time you posted this, I went through and tried to discern the
>>>>>> meaningful differences between yours and Oren's patchsets.  Then I sent some
>>>>>> patches to Oren to make his set configurable to act more like yours.  And Oren
>>>>>> took them!  But now you resend this patchset with no real changelog, no
>>>>>> acknowledgment that Oren's set even exists
>>>>> Is this a requirement? Everybody following topic already knows about
>>>>> Oren's patchset.
>>>> Some people do ack other people's work. See for example patches #1
>>>> and #24 in my recent post. You're welcome.
>>>>
>>>>>> - or is much farther along and pretty widely reviewed and tested (which is
>>>>>> only because he started earlier and, when we asked for your counterpatches
>>>>>> at an earlier stage, you would never reply) - or, most importantly, what
>>>>>> it is that you think your patchset does that his does not and cannot.
>>>>> There are differences. And they're not small like you're trying to describe
>>>>> but pretty big compared the scale of the problem.
>>>> I've asked before, and I repeat now: can you enumerate these "big"
>>>> scary differences that make it such a "big" problem ?
>>>>
>>>> So far, we identified two main "design" issues -
>>> Why in "? Yes, they are high-level design issues.
>>>
>> In quotes, because I argued further on that, although my patchset
>> takes a stand on both issues, it can be easily reverted _within_
>> that patchset. Moreover, I argue that they can co-exist.
>>
>>>> 1) Whether or not allow c/r of sub-container (partial hierarchy)
>>>>
>>>> 2) Creation of restarting process hierarchy in kernel or in userspace
>>>>
>>>> As for #1, you are the _only_ one who advocates restricting c/r to
>>>> a full container only. I guess you have your reasons, but I'm unsure
>>>> what they may be.
>>> The reason is that checkpointing half-frozen, half-live container is
>>> essentially equivalent to live container which adds much complexity
>>> to code fundamentally preventing kernel from taking coherent snapshot.
>>>
>>> In such situations kernel will do its job badly.
>> In such situation the kernel will do a bad job if the user is asking
>> for a bad job.
> 
> User doesn't even understand why we're discussing this issue so hard.
> 
>> Just like checkpointing without snapshotting the file system and expecting
>> it to always work.
> 
> This is different.
> 
> Kernel can't do anything about not-synced fs. Because nodoby is
> advocating that kernel should sync fs. Consequently, screwup in fs sync is
> clearly user failure. Any (yours, mine) in-kernel C/R has this failure mode,
> so we skip it and discuss what's left.
> 
> Now, kernel CAN do something about tasks and other data structures
> because it easily controls them.
> 
> Your procedure for checkpointing starts with "kill -STOP".

Wrong. It requires the processes to be frozen.

> To make anything reliable, you have to ban "kill -CONT" for the duration of
> checkpointing. Is this done BTW? I don't remember new flags added
> in task_struct. Or this is going to be skipped on grounds that it's
> user screwup (potentially oopsable).
> 
> That's why, OpenVZ relies on suspend-to-ram freezer solely, because userspace
> can't arbitrarily send suspend and freeze notifications. We only need to
> protect against untimely STR unfreeze which only adds code in C/R code
> not in task_struct.

Same principle for both patchsets:  tasks may *not* be permitted to
execute while being checkpointed.

For this I suggested a CHECKPOINTING freezer state: transition to/from
this state is done _only_ by sys_checkpoint(), so that checkpointed
processes cannot be unfrozen. Matt Helseley already posted a patch to
implement this.

> 
> But you aren't going to stop with task_struct, and add such flags all
> over the place. Or you're going to declare that it's all user fault.

I'm unsure what sort of flags you have in mind, or what's the problem
that alleged flags would solve ?  So far I didn't plan any.

> 
>> But if the user is a bit more careful (and even then, not that much),
>> she can enjoy the wonderful benefits of c/r without the wonderful
>> benefits of containers.
>>
>> If useful, it's easy to pass a flag to checkpoint() that will ask
>> to enforce, say, shared memory "leaks" but not nsproxy or file "leaks".
>>
>> In fact, even shared memory "leaks" may be useful for some users (e.g.
>> what the guys from kerlabs pointed out).
>>
>>> Manpage will be filled with strings like "if $FOO is shared then $BAR is
>>> not guaranteed".
>>>
>>> What to do if user simply doesn't know if container is bounded?
>>> Checkpoint and to hell with consequences?
>>>
>>> If two tasks share mm_struct you can't even detect that pages you dump
>>> aren't filled with garbage meanwhile from second task.
>>>
>>> If two tasks share mm_struct, other task can issue AIO indefinitely
>>> preventing from taking even coherent filesystem snapshot.
>>>
>>> That's why I raise this issue again to hear from people what they think
>>> and these people shouldn't be containers and C/R people, because the
>>> latter already made up their minds.
>> Lol .. and disagreement persists among us :)
>>
>> And indeed, I have heard and seen already a few opinions in favor
>> of permitting non-container checkpoint. From potential users (not
>> c/r people).
>>
>>> This is super-important issue to get right from the beginning.
>>>
>>>> On the other hand, there has been a handful of use-cases and opinions
>>>> in favor of allowing both capabilities to co-exist. Not the mention
>>>> that nearly no additional code is necessary, on the contrary.
>>>>
>>>> As for #2, you didn't even bother to reply to the discussion that I
>>>> had started about it. This decision is important to allow future
>>>> flexibility of the mechanism, and to address the needs of several
>>>> potential users, as seen in that discussion and others. Here, too,
>>>> you are the _only_ one that advocates that direction.
>>> Are you going to fork to-become-zombies, make them call restart(2) and
>>> zombify?
>> Yes.
>>
>>>> And the funniest thing -- *both* decisions can be *easily* overturned
>>>> in my patchset. In fact, regarding #2 - either way can be easily done
>>>> in it.
>>>>
>>>> So I wonder, what are the "big" issues that bother you so much ?
>>>> "if there is a will, there is a way".
>>> Oren, don't you really understand?
>>>
>>> Users want millions of things, but every thing has price.
>> I beg to differ: there is marginal price to support both --
> 
> Wrong, try to do netns with sockets without deadlocks.

Let's indeed discuss netns - that's an important issue. But it's
also huge - "do netns" may mean different things to different
people. How about you start a new thread explaining why it will
be such a problem ?

> 
> Price in loose case is strictly bigger than price in strict case
> because set of loose states is strict superset of set of strict cases.

The code to enforce full-container approach is a *superset* of the
code without this requirement.

If you can checkpoint a full container, surely you can checkpoint a
sub-hierarchy of processes.

We both agree that the sub-hierarchy case is not idiot-proof: it's
garbage in, garbage out. In the worst case, the restart will simply
fail, but no other harm done.

(And if you are concerned about security, well - that's a different
topic - and one can disable c/r for non privileged users).

> 
>> in fact, enforcing the container requirement (e.g. leaks detection -
>> which, btw, is imperfect and cannot be made race-free)
> 
> It'll only generate false positives when say /proc/$PID/maps is being
> read from outside, so mm->mm_users is elevated. I can return -EAGAIN.
> 
> If file is opened via /proc/*/fd, proc_fd_access_allowed() is only
> needed to be tweaked and use task_struct field, not new struct file
> field.
> 
> ->mmap_sem is taken only for reading in /proc, so fine.
> 
> /proc/*/mem -- tweak mem_open().
> 
> struct pid, itself -- refcount can be elevated because someone chdir'ed
> into /proc/*, but I don't do refcount checks on struct pid, because it's
> hard (inode in memory pins struct pid) and pids numbers themselves don't
> change once taken. So, if struct pid is somehow leaked to outside,
> there will be NO incoherency in image and no problems with locking or
> whatever.
> 
> For netns elevated refcount, get_proc_task_net() will also reject
> open(2) request based on task_struct field, not netns field!
> 
> I'm sorry, but all bases re refcount checks are pretty well covered
> and are easily closed with tiny bit of code.
> 
> Please show counter-examples where after all data structures are
> collected and tasks being frozen and refcount checks passed,
> userspace can still access one of them for writing.

ptrace ?  (sure, can also be "tweaked" to disallow access while
task a checkpointing. Everything can :)

> 
>> *adds* code over the non-container case.
> 
> Whole 8 lines per data-structure:
> 
> 	for_each_kstate_object(ctx, obj, KSTATE_CTX_MM_STRUCT) {
>                 struct mm_struct *mm = obj->o_obj;
>                 unsigned int cnt = atomic_read(&mm->mm_users);
> 
>                 if (obj->o_count + 1 != cnt) {
>                         pr_err("mm_struct %p has external references %lu:%u\n", mm, obj->o_count, cnt);
>                         return -EINVAL;
>                 }
>         }
> 
> No locking, no nothing, just counting refcounts.
> 
>> So in a sense, we get the no-container case for free.
> 
> Not for free definitely.
> 

s/Not//    :)

If you take out (or simply skip) the isolation/coherency tests for
a full-container, you are left with ... sub-hierarchy support. Voila.

Oren.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-28 22:20               ` Oren Laadan
@ 2009-05-28 22:33                 ` Matt Helsley
  2009-05-29  6:01                 ` Alexey Dobriyan
  1 sibling, 0 replies; 76+ messages in thread
From: Matt Helsley @ 2009-05-28 22:33 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Alexey Dobriyan, containers, akpm, xemul, linux-kernel

On Thu, May 28, 2009 at 06:20:25PM -0400, Oren Laadan wrote:
> 
> 
> Alexey Dobriyan wrote:
> > On Wed, May 27, 2009 at 06:45:04PM -0400, Oren Laadan wrote:
> >> Alexey Dobriyan wrote:
> >>> On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote:
> >>>> Alexey Dobriyan wrote:
> >>>>> On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
> >>>>>> Quoting Alexey Dobriyan (adobriyan@gmail.com):
> >>>>>>> Introduction
> >>>>>>> ------------
> >>>>>>> Checkpoint/restart (C/R from now) allows to dump group of processes to disk
> >>>>>>> for various reasons like saving process state in case of box failure or
> >>>>>>> restoration of group of processes on another or same machine later.
> >>>>>>>
> >>>>>>> Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel
> >>>>>>> and dump more or less raw pages, proposed C/R doesn't require hypervisor.
> >>>>>>> For that C/R code needs to know about all little and big intimate kernel details.
> >>>>>>>
> >>>>>>> The good thing is that not all details needs to be serialized and saved
> >>>>>>> like, say, readahead state. The bad things is still quite a few things
> >>>>>>> need to be.
> >>>>>> Hi Alexey,
> >>>>>>
> >>>>>> the last time you posted this, I went through and tried to discern the
> >>>>>> meaningful differences between yours and Oren's patchsets.  Then I sent some
> >>>>>> patches to Oren to make his set configurable to act more like yours.  And Oren
> >>>>>> took them!  But now you resend this patchset with no real changelog, no
> >>>>>> acknowledgment that Oren's set even exists
> >>>>> Is this a requirement? Everybody following topic already knows about
> >>>>> Oren's patchset.
> >>>> Some people do ack other people's work. See for example patches #1
> >>>> and #24 in my recent post. You're welcome.
> >>>>
> >>>>>> - or is much farther along and pretty widely reviewed and tested (which is
> >>>>>> only because he started earlier and, when we asked for your counterpatches
> >>>>>> at an earlier stage, you would never reply) - or, most importantly, what
> >>>>>> it is that you think your patchset does that his does not and cannot.
> >>>>> There are differences. And they're not small like you're trying to describe
> >>>>> but pretty big compared the scale of the problem.
> >>>> I've asked before, and I repeat now: can you enumerate these "big"
> >>>> scary differences that make it such a "big" problem ?
> >>>>
> >>>> So far, we identified two main "design" issues -
> >>> Why in "? Yes, they are high-level design issues.
> >>>
> >> In quotes, because I argued further on that, although my patchset
> >> takes a stand on both issues, it can be easily reverted _within_
> >> that patchset. Moreover, I argue that they can co-exist.
> >>
> >>>> 1) Whether or not allow c/r of sub-container (partial hierarchy)
> >>>>
> >>>> 2) Creation of restarting process hierarchy in kernel or in userspace
> >>>>
> >>>> As for #1, you are the _only_ one who advocates restricting c/r to
> >>>> a full container only. I guess you have your reasons, but I'm unsure
> >>>> what they may be.
> >>> The reason is that checkpointing half-frozen, half-live container is
> >>> essentially equivalent to live container which adds much complexity
> >>> to code fundamentally preventing kernel from taking coherent snapshot.
> >>>
> >>> In such situations kernel will do its job badly.
> >> In such situation the kernel will do a bad job if the user is asking
> >> for a bad job.
> > 
> > User doesn't even understand why we're discussing this issue so hard.
> > 
> >> Just like checkpointing without snapshotting the file system and expecting
> >> it to always work.
> > 
> > This is different.
> > 
> > Kernel can't do anything about not-synced fs. Because nodoby is
> > advocating that kernel should sync fs. Consequently, screwup in fs sync is
> > clearly user failure. Any (yours, mine) in-kernel C/R has this failure mode,
> > so we skip it and discuss what's left.
> > 
> > Now, kernel CAN do something about tasks and other data structures
> > because it easily controls them.
> > 
> > Your procedure for checkpointing starts with "kill -STOP".
> 
> Wrong. It requires the processes to be frozen.
> 
> > To make anything reliable, you have to ban "kill -CONT" for the duration of
> > checkpointing. Is this done BTW? I don't remember new flags added
> > in task_struct. Or this is going to be skipped on grounds that it's
> > user screwup (potentially oopsable).
> > 
> > That's why, OpenVZ relies on suspend-to-ram freezer solely, because userspace
> > can't arbitrarily send suspend and freeze notifications. We only need to
> > protect against untimely STR unfreeze which only adds code in C/R code
> > not in task_struct.
> 
> Same principle for both patchsets:  tasks may *not* be permitted to
> execute while being checkpointed.
> 
> For this I suggested a CHECKPOINTING freezer state: transition to/from
> this state is done _only_ by sys_checkpoint(), so that checkpointed
> processes cannot be unfrozen. Matt Helseley already posted a patch to
> implement this.

In case it helps, here's the patch and some feedback Oren gave me:

https://lists.linux-foundation.org/pipermail/containers/2009-May/017586.html

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-28 15:38         ` Alexey Dobriyan
  2009-05-28 18:17           ` Serge E. Hallyn
@ 2009-05-28 22:42           ` Oren Laadan
  1 sibling, 0 replies; 76+ messages in thread
From: Oren Laadan @ 2009-05-28 22:42 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Serge E. Hallyn, linux-kernel, akpm, containers, xemul



Alexey Dobriyan wrote:
> On Tue, May 26, 2009 at 06:44:36PM -0500, Serge E. Hallyn wrote:
>> Quoting Alexey Dobriyan (adobriyan@gmail.com):
>>> And since you guys showed that just idea of in-kernel checkpointing is not
>>> rejected outright, it doesn't mean that you can drag every single idea too.
>> Can you rephrase here?  I have no idea what you mean by 'drag every single
>> idea'
> 
>   									complexity
>  +-|------------------------------------------------|----------------------|-->
> hypervisor C/R					in-kernel C/R	C/R for unpriviledged
> 								C/R with "leaks"
> 

c/r for unprivileged users is indeed more complex - requires more care.
Keep in mind, however, that most of that are sanity checks that we would
like to have in the code anyway.

c/r with "leaks" is actually a subset of the code that prevents leaks:
simply comment out the leak detection :)

Oren.


> I personally thought in-kernel C/R will be rejected outright, but it wasn't.
> This in theory doesn't mean other two issues should be accepted.
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-28 22:20               ` Oren Laadan
  2009-05-28 22:33                 ` Matt Helsley
@ 2009-05-29  6:01                 ` Alexey Dobriyan
  2009-05-29 17:26                   ` Dave Hansen
  1 sibling, 1 reply; 76+ messages in thread
From: Alexey Dobriyan @ 2009-05-29  6:01 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Serge E. Hallyn, linux-kernel, akpm, containers, xemul

On Thu, May 28, 2009 at 06:20:25PM -0400, Oren Laadan wrote:
> Alexey Dobriyan wrote:
> > Price in loose case is strictly bigger than price in strict case
> > because set of loose states is strict superset of set of strict cases.
> 
> The code to enforce full-container approach is a *superset* of the
> code without this requirement.
> 
> If you can checkpoint a full container, surely you can checkpoint a
> sub-hierarchy of processes.

No!

To checkpoint uts_ns reliably, in strict case, one doesn't even need
uts_sem.

In loose case, one needs uts_sem.

	static int dump_uts_ns(struct kstate_context *ctx, struct kstate_object *obj)
	{
		struct uts_namespace *uts_ns = obj->o_obj;
		struct kstate_image_uts_ns *i;
		int rv;
	
		i = kstate_prepare_image(KSTATE_OBJ_UTS_NS, sizeof(*i));
		if (!i)
			return -ENOMEM;
	
		if (ctx->dump_live)
			down_read(&uts_sem);
		strncpy(i->sysname, (const char *)uts_ns->name.sysname, 64);
		strncpy(i->nodename, (const char *)uts_ns->name.nodename, 64);
		strncpy(i->release, (const char *)uts_ns->name.release, 64);
		strncpy(i->version, (const char *)uts_ns->name.version, 64);
		strncpy(i->machine, (const char *)uts_ns->name.machine, 64);
		strncpy(i->domainname, (const char *)uts_ns->name.domainname, 64);
		if (ctx->dump_live)
			up_read(&uts_sem);

		rv = kstate_write_image(ctx, i, sizeof(*i), obj);
		kfree(i);
		return rv;
	}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/38] C/R: core stuff
  2009-05-29  6:01                 ` Alexey Dobriyan
@ 2009-05-29 17:26                   ` Dave Hansen
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Hansen @ 2009-05-29 17:26 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Oren Laadan, containers, akpm, xemul, linux-kernel

On Fri, 2009-05-29 at 10:01 +0400, Alexey Dobriyan wrote:
>                 if (ctx->dump_live)
>                         down_read(&uts_sem);
>                 strncpy(i->sysname, (const char *)uts_ns->name.sysname, 64);
>                 strncpy(i->nodename, (const char *)uts_ns->name.nodename, 64);
>                 strncpy(i->release, (const char *)uts_ns->name.release, 64);
>                 strncpy(i->version, (const char *)uts_ns->name.version, 64);
>                 strncpy(i->machine, (const char *)uts_ns->name.machine, 64);
>                 strncpy(i->domainname, (const char *)uts_ns->name.domainname, 64);
>                 if (ctx->dump_live)
>                         up_read(&uts_sem);

Doesn't this turn a minor bug (like a task getting accidentally
unfrozen) into a oopsable bug?

I mean, doing this without a lock is nice and all, but it adds code to
make it selectable.  What's the purpose of avoiding the semaphore
anyway?  Performance?

-- Dave


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns()
  2009-05-22  4:55 ` [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns() Alexey Dobriyan
  2009-05-22  9:14   ` Amerigo Wang
  2009-05-24 22:45   ` Serge E. Hallyn
@ 2009-06-04  0:17   ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 76+ messages in thread
From: Sukadev Bhattiprolu @ 2009-06-04  0:17 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, linux-kernel, containers, torvalds, xemul, orenl, serue,
	dave, mingo

Alexey Dobriyan wrote:
> copy_pid_ns() is a perfect example of a case where unwinding leads to more
> code and makes it less clear. Watch the diffstat.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

> ---
>  kernel/pid_namespace.c |   18 +++---------------
>  1 files changed, 3 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 495d5de..821722a 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -116,23 +116,11 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
>  
>  struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
>  {
> -	struct pid_namespace *new_ns;
> -
> -	BUG_ON(!old_ns);
> -	new_ns = get_pid_ns(old_ns);
>  	if (!(flags & CLONE_NEWPID))
> -		goto out;
> -
> -	new_ns = ERR_PTR(-EINVAL);
> +		return get_pid_ns(old_ns);
>  	if (flags & CLONE_THREAD)
> -		goto out_put;
> -
> -	new_ns = create_pid_namespace(old_ns);
> -
> -out_put:
> -	put_pid_ns(old_ns);
> -out:
> -	return new_ns;
> +		return ERR_PTR(-EINVAL);
> +	return create_pid_namespace(old_ns);
>  }
>  
>  void free_pid_ns(struct kref *kref)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns
  2009-05-22  4:55 ` [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns Alexey Dobriyan
  2009-05-22  9:20   ` Amerigo Wang
  2009-05-24 22:44   ` Serge E. Hallyn
@ 2009-06-04  0:20   ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 76+ messages in thread
From: Sukadev Bhattiprolu @ 2009-06-04  0:20 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, linux-kernel, containers, torvalds, xemul, orenl, serue,
	dave, mingo

Alexey Dobriyan wrote:
> create_pid_namespace() creates everything, but caller has to assign parent
> pidns by hand, which is unnatural. At the moment of call new ->level has
> to be taken from somewhere and parent pidns is already available.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

> ---
>  kernel/pid_namespace.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 2d1001b..495d5de 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -67,9 +67,10 @@ err_alloc:
>  	return NULL;
>  }
>  
> -static struct pid_namespace *create_pid_namespace(unsigned int level)
> +static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_pid_ns)
>  {
>  	struct pid_namespace *ns;
> +	unsigned int level = parent_pid_ns->level + 1;
>  	int i;
>  
>  	ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
> @@ -86,6 +87,7 @@ static struct pid_namespace *create_pid_namespace(unsigned int level)
>  
>  	kref_init(&ns->kref);
>  	ns->level = level;
> +	ns->parent = get_pid_ns(parent_pid_ns);
>  
>  	set_bit(0, ns->pidmap[0].page);
>  	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
> @@ -125,9 +127,7 @@ struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old
>  	if (flags & CLONE_THREAD)
>  		goto out_put;
>  
> -	new_ns = create_pid_namespace(old_ns->level + 1);
> -	if (!IS_ERR(new_ns))
> -		new_ns->parent = get_pid_ns(old_ns);
> +	new_ns = create_pid_namespace(old_ns);
>  
>  out_put:
>  	put_pid_ns(old_ns);


^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2009-06-04  0:22 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-22  4:54 [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan
2009-05-22  4:54 ` [PATCH 02/38] utsns: extract create_uts_ns() Alexey Dobriyan
2009-05-24 22:37   ` Serge E. Hallyn
2009-05-22  4:54 ` [PATCH 03/38] ipcns 1/4: remove useless get/put while CLONE_NEWIPC Alexey Dobriyan
2009-05-22  9:00   ` Amerigo Wang
2009-05-22  4:54 ` [PATCH 04/38] ipcns 2/4: extract create_ipc_ns() Alexey Dobriyan
2009-05-22  8:59   ` Amerigo Wang
2009-05-22  4:54 ` [PATCH 05/38] ipcns 3/4: make free_ipc_ns() static Alexey Dobriyan
2009-05-24 22:40   ` Serge E. Hallyn
2009-05-22  4:55 ` [PATCH 06/38] ipcns 4/2: move free_ipcs() proto Alexey Dobriyan
2009-05-24 22:49   ` Serge E. Hallyn
2009-05-22  4:55 ` [PATCH 07/38] pidns 1/2: make create_pid_namespace() accept parent pidns Alexey Dobriyan
2009-05-22  9:20   ` Amerigo Wang
2009-05-24 22:44   ` Serge E. Hallyn
2009-06-04  0:20   ` Sukadev Bhattiprolu
2009-05-22  4:55 ` [PATCH 08/38] pidns 2/2: rewrite copy_pid_ns() Alexey Dobriyan
2009-05-22  9:14   ` Amerigo Wang
2009-05-24 22:45   ` Serge E. Hallyn
2009-06-04  0:17   ` Sukadev Bhattiprolu
2009-05-22  4:55 ` [PATCH 09/38] netns 1/2: don't get/put old netns on CLONE_NEWNET Alexey Dobriyan
2009-05-22  6:30   ` David Miller
2009-05-22  4:55 ` [PATCH 10/38] netns 2/2: extract net_create() Alexey Dobriyan
2009-05-22  6:30   ` David Miller
2009-05-22  4:55 ` [PATCH 11/38] nsproxy: extract create_nsproxy() Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 12/38] i386: ifdef out struct thread_struct::fs Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 13/38] x86_64: ifdef out struct thread_struct::ip Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 14/38] Remove struct mm_struct::exe_file et al Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 15/38] dcache: extract and use d_unlinked() Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 16/38] x86: ptrace debugreg checks rewrite Alexey Dobriyan
2009-05-26 23:25   ` Andrew Morton
2009-05-22  4:55 ` [PATCH 17/38] groups: move code to kernel/groups.c Alexey Dobriyan
2009-05-25  0:53   ` Serge E. Hallyn
2009-05-26 14:48   ` Serge E. Hallyn
2009-05-26 18:34     ` Alexey Dobriyan
2009-05-26 23:25       ` Serge E. Hallyn
2009-05-22  4:55 ` [PATCH 18/38] C/R: core stuff Alexey Dobriyan
2009-05-26 13:16   ` Serge E. Hallyn
2009-05-26 19:35     ` Alexey Dobriyan
2009-05-26 23:14       ` Serge E. Hallyn
2009-05-26 23:44       ` Serge E. Hallyn
2009-05-28 15:38         ` Alexey Dobriyan
2009-05-28 18:17           ` Serge E. Hallyn
2009-05-28 22:42           ` Oren Laadan
2009-05-27 18:52       ` Dave Hansen
2009-05-27 20:56       ` Oren Laadan
2009-05-27 22:17         ` Alexey Dobriyan
2009-05-27 22:40           ` Andrew Morton
2009-05-27 22:45           ` Oren Laadan
2009-05-28 15:33             ` Alexey Dobriyan
2009-05-28 22:20               ` Oren Laadan
2009-05-28 22:33                 ` Matt Helsley
2009-05-29  6:01                 ` Alexey Dobriyan
2009-05-29 17:26                   ` Dave Hansen
2009-05-27 22:25         ` Alexey Dobriyan
2009-05-27 16:28   ` Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 19/38] C/R: multiple tasks Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 20/38] C/R: i386 support Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 21/38] C/R: i386 debug registers Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 22/38] C/R: i386 xstate Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 23/38] C/R: x86_64 support Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 24/38] C/R: x86_64 debug registers Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 25/38] C/R: x86_64 xstate Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 26/38] C/R: nsproxy Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 27/38] C/R: checkpoint/restore struct uts_namespace Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 28/38] C/R: formally checkpoint/restore struct ipc_namespace Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 29/38] C/R: formally checkpoint/restore struct mnt_namespace Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 30/38] C/R: checkpoint/restore struct pid_namespace Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 31/38] C/R: formally checkpoint/restore struct net_namespace Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 32/38] C/R: checkpoint/restore struct cred Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 33/38] C/R: checkpoint/restore aux groups (structy group_info) Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 34/38] C/R: checkpoint/restore struct user Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 35/38] C/R: checkpoint/restore struct user_namespace Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 36/38] C/R: checkpoint/restore struct pid Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 37/38] C/R: checkpoint/restore opened files Alexey Dobriyan
2009-05-22  4:55 ` [PATCH 38/38] C/R: checkpoint/restart struct sighand_struct Alexey Dobriyan
2009-05-22  5:02 ` [PATCH 01/38] cred: #include init.h in cred.h Alexey Dobriyan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).