linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/10] container-based checkpoint/restart prototype
@ 2011-02-28 23:40 ntl
  2011-02-28 23:40 ` [PATCH 01/10] Make exec_mmap extern ntl
                   ` (10 more replies)
  0 siblings, 11 replies; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Checkpoint/restart is a facility by which one can save the state of a
job to a file and restart it later under the right conditions.  This
is a C/R prototype intended to illustrate how well (or poorly) it
would fit into the Linux kernel.  It is basically a fork of the
"linux-cr" patch set by Oren Laadan and others, but it is more limited
in scope and has a different system call interface.  I believe what I
have here is a decent starting point for a C/R implementation that can
go upstream, but I'm releasing early with the hope of receiving some
feedback/review on the overall approach before pursuing it too much
further.

The intended users are HPC, big homogeneous clusters, environments
with long-running jobs that are not easily interrupted without losing
work, for whatever reason (perhaps you've misplaced the source code
for your program and can't modify it to checkpoint and restore its own
state).  In these situations checkpoint/restart provides a rollback
mechanism to mitigate the effects of hardware/system failures as well
as a means of migrating jobs between nodes.


How it works:

Only a process with PID 1 ("init") can call checkpoint or restart.

Checkpoint freezes the rest of the pidns and goes about dumping the
state of all the other tasks in the PID namespace to the specificed
file descriptor.  The state of the caller is not recorded.

Before calling restart, init is expected to set up the environment
(mounts, net devices and such) in accord with the checkpointed job's
"expectations".  The restart system call recreates the task tree
(except for init itself) and the tasks resume execution; init can
then wait(2) for tasks to exit in the normal fashion.


Limitations:

This implementation is limited to containers by design (and this
prototype is limited to checkpoint/restore of a single simple task).
A Linux "container" doesn't have a universally agreed upon definition,
but in this context we are referring to a group of processes for which
the PID namespace (and possibly other namespaces) is isolated from the
rest of the system (see clone(2)).  This is the tradeoff we ask users
to make - the ability to C/R and migrate is provided in exchange for
accepting some isolation and slightly reduced ease of use.  A tool
such as lxc (http://lxc.sourceforge.net) can be used to isolate jobs.
A patch against lxc is available which adds C/R capability.

The user must ensure that a restarted job's view of the filesystem is
effectively the same as it was at the time of checkpoint.

Processes that map device memory and other such hardware-dependent
things will probably not be supported.


To do:

Multiple tasks
Signal state
System call restart blocks
More code cleanup/simplification
Other architecture support
System V IPC
Network/sockets
And much more


 Documentation/filesystems/vfs.txt  |   13 +-
 arch/x86/Kconfig                   |    4 +
 arch/x86/include/asm/checkpoint.h  |   17 +
 arch/x86/include/asm/elf.h         |    5 +
 arch/x86/include/asm/ldt.h         |    7 +
 arch/x86/include/asm/unistd_32.h   |    4 +-
 arch/x86/kernel/Makefile           |    2 +
 arch/x86/kernel/checkpoint.c       |  677 +++++++++++++++++++++++++++
 arch/x86/kernel/syscall_table_32.S |    2 +
 arch/x86/vdso/vdso32-setup.c       |   25 +-
 drivers/char/mem.c                 |    6 +
 drivers/char/random.c              |    6 +
 fs/Makefile                        |    1 +
 fs/aio.c                           |   27 ++
 fs/checkpoint.c                    |  695 +++++++++++++++++++++++++++
 fs/exec.c                          |    2 +-
 fs/ext2/dir.c                      |    3 +
 fs/ext2/file.c                     |    6 +
 fs/ext3/dir.c                      |    3 +
 fs/ext3/file.c                     |    3 +
 fs/ext4/dir.c                      |    3 +
 fs/ext4/file.c                     |    6 +
 fs/fcntl.c                         |   21 +-
 fs/locks.c                         |   35 ++
 include/linux/aio.h                |    2 +
 include/linux/checkpoint.h         |  347 ++++++++++++++
 include/linux/fs.h                 |   15 +
 include/linux/magic.h              |    3 +
 include/linux/mm.h                 |   15 +
 init/Kconfig                       |    2 +
 kernel/Makefile                    |    1 +
 kernel/checkpoint/Kconfig          |   15 +
 kernel/checkpoint/Makefile         |    9 +
 kernel/checkpoint/checkpoint.c     |  437 +++++++++++++++++
 kernel/checkpoint/objhash.c        |  368 +++++++++++++++
 kernel/checkpoint/restart.c        |  651 ++++++++++++++++++++++++++
 kernel/checkpoint/sys.c            |  208 +++++++++
 kernel/sys_ni.c                    |    4 +
 mm/Makefile                        |    1 +
 mm/checkpoint.c                    |  906 ++++++++++++++++++++++++++++++++++++
 mm/filemap.c                       |    4 +
 mm/mmap.c                          |    3 +
 42 files changed, 4549 insertions(+), 15 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint.h
 create mode 100644 arch/x86/kernel/checkpoint.c
 create mode 100644 fs/checkpoint.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 kernel/checkpoint/Kconfig
 create mode 100644 kernel/checkpoint/Makefile
 create mode 100644 kernel/checkpoint/checkpoint.c
 create mode 100644 kernel/checkpoint/objhash.c
 create mode 100644 kernel/checkpoint/restart.c
 create mode 100644 kernel/checkpoint/sys.c
 create mode 100644 mm/checkpoint.c

-- 
1.7.4


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/10] Make exec_mmap extern
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
@ 2011-02-28 23:40 ` ntl
  2011-04-03 16:56   ` Serge E. Hallyn
  2011-02-28 23:40 ` [PATCH 02/10] Introduce mm_has_pending_aio() helper ntl
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Restoration of process state from a checkpoint image is similar to
exec in that the calling task's mm is replaced.  Make exec_mmap
available for this purpose.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
[ntl: extracted from Oren's "c/r: dump memory address space (private memory)"]
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 fs/exec.c          |    2 +-
 include/linux/mm.h |    3 +++
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index c62efcb..9d8c27a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -767,7 +767,7 @@ int kernel_read(struct file *file, loff_t offset,
 
 EXPORT_SYMBOL(kernel_read);
 
-static int exec_mmap(struct mm_struct *mm)
+int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 	struct mm_struct * old_mm, *active_mm;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721f451..5397237 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1321,6 +1321,9 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
+/* fs/exec.c */
+extern int exec_mmap(struct mm_struct *mm);
+
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/10] Introduce mm_has_pending_aio() helper
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
  2011-02-28 23:40 ` [PATCH 01/10] Make exec_mmap extern ntl
@ 2011-02-28 23:40 ` ntl
  2011-03-01 15:40   ` Jeff Moyer
  2011-02-28 23:40 ` [PATCH 03/10] Introduce has_locks_with_owner() helper ntl
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Support for AIO is on the to-do list, but until that is implemented,
checkpoint will have to fail if a mm_struct has outstanding AIO
contexts.  Add a mm_has_pending_aio() helper function for this
purpose.

Based on original "check_for_outstanding_aio" patch by Serge Hallyn.

Signed-off-by: Serge E. Hallyn <serge@hallyn.com>
[ntl: changed name and return type to clearly express semantics]
[ntl: added kerneldoc]
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 fs/aio.c            |   27 +++++++++++++++++++++++++++
 include/linux/aio.h |    2 ++
 2 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 8c8f6c5..1acbc99 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1847,3 +1847,30 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
 	asmlinkage_protect(5, ret, ctx_id, min_nr, nr, events, timeout);
 	return ret;
 }
+
+/**
+ * mm_has_pending_aio() - check for outstanding AIO operations
+ * @mm:		The mm_struct to check.
+ *
+ * Returns true if there is at least one non-dead kioctx on
+ * @mm->ioctx_list.  Note that the result of this function is
+ * unreliable unless the caller has ensured that new requests cannot
+ * be submitted against @mm (e.g. through freezing the associated
+ * tasks).
+ */
+bool mm_has_pending_aio(struct mm_struct *mm)
+{
+	struct kioctx *ctx;
+	struct hlist_node *n;
+	bool has_aio = false;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(ctx, n, &mm->ioctx_list, list) {
+		if (!ctx->dead) {
+			has_aio = true;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	return has_aio;
+}
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 7a8db41..39d9936 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -214,6 +214,7 @@ struct mm_struct;
 extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
+extern bool mm_has_pending_aio(struct mm_struct *mm);
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline int aio_put_req(struct kiocb *iocb) { return 0; }
@@ -224,6 +225,7 @@ static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
 				struct iocb __user * __user *iocbpp,
 				bool compat) { return 0; }
+static inline bool mm_has_pending_aio(struct mm_struct *mm) { return false; }
 #endif /* CONFIG_AIO */
 
 static inline struct kiocb *list_kiocb(struct list_head *h)
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/10] Introduce has_locks_with_owner() helper
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
  2011-02-28 23:40 ` [PATCH 01/10] Make exec_mmap extern ntl
  2011-02-28 23:40 ` [PATCH 02/10] Introduce mm_has_pending_aio() helper ntl
@ 2011-02-28 23:40 ` ntl
  2011-04-03 18:55   ` Serge E. Hallyn
  2011-02-28 23:40 ` [PATCH 04/10] Introduce vfs_fcntl() helper ntl
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Support for file locks is in the works, but until that is done
checkpoint needs to fail when an open file has locks.

Based on original "find_locks_with_owner" patch by Dave Hansen.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
[ntl: changed name and return type to clearly express semantics]
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 fs/locks.c         |   35 +++++++++++++++++++++++++++++++++++
 include/linux/fs.h |    6 ++++++
 2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 8729347..961e17f 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2037,6 +2037,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner)
 
 EXPORT_SYMBOL(locks_remove_posix);
 
+bool has_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct file_lock **inode_fl;
+	bool ret = false;
+
+	lock_flocks();
+	for_each_lock(inode, inode_fl) {
+		struct file_lock *fl = *inode_fl;
+		/*
+		 * We could use posix_same_owner() along with a 'fake'
+		 * file_lock.  But, the fake file will never have the
+		 * same fl_lmops as the fl that we are looking for and
+		 * posix_same_owner() would just fall back to this
+		 * check anyway.
+		 */
+		if (IS_POSIX(fl)) {
+			if (fl->fl_owner == owner) {
+				ret = true;
+				break;
+			}
+		} else if (IS_FLOCK(fl) || IS_LEASE(fl)) {
+			if (fl->fl_file == filp) {
+				ret = true;
+				break;
+			}
+		} else {
+			WARN(1, "unknown file lock type, fl_flags: %x",
+				fl->fl_flags);
+		}
+	}
+	unlock_flocks();
+	return ret;
+}
+
 /*
  * This function is called on the last close of an open file.
  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 090f0ea..315ded4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1138,6 +1138,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t);
 extern void locks_remove_flock(struct file *);
 extern void locks_release_private(struct file_lock *);
 extern void posix_test_lock(struct file *, struct file_lock *);
+extern bool has_locks_with_owner(struct file *filp, fl_owner_t owner);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
 extern int posix_lock_file_wait(struct file *, struct file_lock *);
 extern int posix_unblock_lock(struct file *, struct file_lock *);
@@ -1208,6 +1209,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
 	return;
 }
 
+static inline bool has_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	return false;
+}
+
 static inline void locks_remove_flock(struct file *filp)
 {
 	return;
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/10] Introduce vfs_fcntl() helper
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
                   ` (2 preceding siblings ...)
  2011-02-28 23:40 ` [PATCH 03/10] Introduce has_locks_with_owner() helper ntl
@ 2011-02-28 23:40 ` ntl
  2011-04-03 18:57   ` Serge E. Hallyn
  2011-02-28 23:40 ` [PATCH 05/10] Core checkpoint/restart support code ntl
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

When restoring process state from a checkpoint image, it will be
necessary to restore file status flags; add vfs_fcntl() for this
purpose.

Based on original code by Oren Laadan.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
[ntl: extracted from "c/r: checkpoint and restart open file descriptors"]
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 fs/fcntl.c         |   21 +++++++++++++--------
 include/linux/fs.h |    2 ++
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ecc8b39..8e797b7 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -426,6 +426,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	return err;
 }
 
+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+	int err;
+
+	err = security_file_fcntl(filp, cmd, arg);
+	if (err)
+		goto out;
+	err = do_fcntl(fd, cmd, arg, filp);
+ out:
+	return err;
+}
+
 SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 {	
 	struct file *filp;
@@ -435,14 +447,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 	if (!filp)
 		goto out;
 
-	err = security_file_fcntl(filp, cmd, arg);
-	if (err) {
-		fput(filp);
-		return err;
-	}
-
-	err = do_fcntl(fd, cmd, arg, filp);
-
+	err = vfs_fcntl(fd, cmd, arg, filp);
  	fput(filp);
 out:
 	return err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 315ded4..175bb75 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1112,6 +1112,8 @@ struct file_lock {
 
 #include <linux/fcntl.h>
 
+extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
+
 extern void send_sigio(struct fown_struct *fown, int fd, int band);
 
 #ifdef CONFIG_FILE_LOCKING
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/10] Core checkpoint/restart support code
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
                   ` (3 preceding siblings ...)
  2011-02-28 23:40 ` [PATCH 04/10] Introduce vfs_fcntl() helper ntl
@ 2011-02-28 23:40 ` ntl
  2011-04-03 19:03   ` Serge E. Hallyn
  2011-02-28 23:40 ` [PATCH 06/10] Checkpoint/restart mm support ntl
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Nathan Lynch, Alexey Dobriyan

From: Nathan Lynch <ntl@pobox.com>

Add a pair of system calls to save and restore the state of an
isolated (via clone/unshare) set of tasks and resources:

long checkpoint(int fd, unsigned int flags);
long restart(int fd, unsigned int flags);

Only a pid namespace init task - the child process produced by a call
to clone(2) with CLONE_NEWPID - is allowed to call these.  The state
of the calling task itself is not saved or altered by these system
calls.  Checkpoint dumps the state (CPU registers, open files, memory
map) of the tasks in the pid namespace to the supplied file
descriptor.  Restart is intended to be called by a pidns init in an
otherwise unpopulated pid namespace; it repopulates the caller's pidns
from the stream supplied by the file descriptor argument.

The flags argument to both syscalls must be zero at this time.  The
file descriptor argument may refer to a pipe or socket, i.e. it need
not be seekable.

On success both checkpoint and restart return 0.

Restart operations use the kthread API to restore tasks[1].  This
necessarily involves some ugly stuff like messing with task->parent,
real_parent, signal disposition etc. but provides a known consistent
state to start with.

This patch is based on original code written by Oren Laadan.

NOTE: This version of the code supports C/R of a single task only.
Pid 1 can call checkpoint while there is a single other task in its
pidns.  Restart can restore just one task into the caller's pidns.

[1] credit to A. Dobriyan for this technique; all bugs are ntl's

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
[ntl: aggregated various C/R patches from Oren]
[ntl: removed deferqueue]
[ntl: clean up CKPT_VMA_NOT_SUPPORTED]
[ntl: remove logfd argument from syscalls]
[ntl: bugfix: correct locking when looking up task by pid]
[ntl: remove superfluous #define CKPT_FOO CKPT_FOO]
[ntl: decouple various objhash APIs from checkpoint context]
[ntl: s/ckpt_err/ckpt_debug/]
[ntl: remove ckpt_msg and associated APIs]
[ntl: remove pid argument from syscalls]
[ntl: make sys_restart freeze current's pidns]
[ntl: make C/R constrained to containers/pidns]
[ntl: implement task restore entirely in-kernel]
[ntl: remove CONFIG_CHECKPOINT_DEBUG; just use #define DEBUG]
[ntl: remove various non-essential APIs]
[ntl: consolidate related headers into checkpoint.h]
[ntl: remove various unneeded symbol exports]
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 include/linux/checkpoint.h     |  347 +++++++++++++++++++++
 include/linux/magic.h          |    3 +
 init/Kconfig                   |    2 +
 kernel/Makefile                |    1 +
 kernel/checkpoint/Kconfig      |   15 +
 kernel/checkpoint/Makefile     |    9 +
 kernel/checkpoint/checkpoint.c |  437 +++++++++++++++++++++++++++
 kernel/checkpoint/objhash.c    |  368 +++++++++++++++++++++++
 kernel/checkpoint/restart.c    |  651 ++++++++++++++++++++++++++++++++++++++++
 kernel/checkpoint/sys.c        |  208 +++++++++++++
 kernel/sys_ni.c                |    4 +
 11 files changed, 2045 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 kernel/checkpoint/Kconfig
 create mode 100644 kernel/checkpoint/Makefile
 create mode 100644 kernel/checkpoint/checkpoint.c
 create mode 100644 kernel/checkpoint/objhash.c
 create mode 100644 kernel/checkpoint/restart.c
 create mode 100644 kernel/checkpoint/sys.c

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..9129860
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,347 @@
+#ifndef _LINUX_CHECKPOINT_H_
+#define _LINUX_CHECKPOINT_H_
+/*
+ *  Generic checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/list.h>
+#include <linux/path.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+
+/*
+ * header format: 'struct ckpt_hdr' must prefix all other
+ * headers. Therefore when a header is passed around, the information
+ * about it (type, size) is readily available. Structs that include a
+ * struct ckpt_hdr are named struct ckpt_hdr_* by convention (usually
+ * the struct ckpt_hdr is the first member).
+ */
+struct ckpt_hdr {
+	__u32 type;
+	__u32 len;
+};
+
+/* header types */
+enum {
+	CKPT_HDR_HEADER = 1,
+	CKPT_HDR_HEADER_ARCH,
+	CKPT_HDR_BUFFER,
+	CKPT_HDR_STRING,
+	CKPT_HDR_OBJREF,
+
+	CKPT_HDR_TASK = 101,
+	CKPT_HDR_TASK_OBJS,
+	CKPT_HDR_THREAD,
+	CKPT_HDR_CPU,
+
+	/* 201-299: reserved for arch-dependent */
+
+	CKPT_HDR_FILE_TABLE = 301,
+	CKPT_HDR_FILE_DESC,
+	CKPT_HDR_FILE_NAME,
+	CKPT_HDR_FILE,
+
+	CKPT_HDR_MM = 401,
+	CKPT_HDR_VMA,
+	CKPT_HDR_MM_CONTEXT,
+	CKPT_HDR_PAGE,
+
+	CKPT_HDR_TAIL = 9001,
+};
+
+/* architecture */
+enum {
+	CKPT_ARCH_X86_32 = 1,
+};
+
+/* shared objrects (objref) */
+struct ckpt_hdr_objref {
+	struct ckpt_hdr h;
+	__u32 objtype;
+	__s32 objref;
+};
+
+/* shared objects types */
+enum obj_type {
+	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_FILE_TABLE,
+	CKPT_OBJ_FILE,
+	CKPT_OBJ_MM,
+	CKPT_OBJ_MAX
+};
+
+/* kernel constants */
+struct ckpt_const {
+	/* task */
+	__u16 task_comm_len;
+	/* mm */
+	__u16 at_vector_size;
+	/* uts */
+	__u16 uts_release_len;
+	__u16 uts_version_len;
+	__u16 uts_machine_len;
+};
+
+/* checkpoint image header */
+struct ckpt_hdr_header {
+	struct ckpt_hdr h;
+	__u64 magic;
+
+	__u16 arch_id;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+
+	struct ckpt_const constants;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 uflags;	/* uflags from checkpoint */
+
+	/*
+	 * the header is followed by three strings:
+	 *   char release[const.uts_release_len];
+	 *   char version[const.uts_version_len];
+	 *   char machine[const.uts_machine_len];
+	 */
+};
+
+/* checkpoint image trailer */
+struct ckpt_hdr_tail {
+	struct ckpt_hdr h;
+	__u64 magic;
+};
+
+/* task data */
+struct ckpt_hdr_task {
+	struct ckpt_hdr h;
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__u64 set_child_tid;
+	__u64 clear_child_tid;
+};
+
+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 files_objref;
+	__s32 mm_objref;
+};
+
+/* file system */
+struct ckpt_hdr_file_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+};
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+};
+
+enum file_type {
+	CKPT_FILE_IGNORE = 0,
+	CKPT_FILE_GENERIC,
+	CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+};
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+};
+
+/* memory layout */
+struct ckpt_hdr_mm {
+	struct ckpt_hdr h;
+	__u32 map_count;
+	__s32 exe_objref;
+
+	__u64 def_flags;
+	__u64 flags;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+};
+
+/* vma subtypes - index into restore_vma_dispatch[] */
+enum vma_type {
+	CKPT_VMA_IGNORE = 0,
+	CKPT_VMA_VDSO,		/* special vdso vma */
+	CKPT_VMA_ANON,		/* private anonymous */
+	CKPT_VMA_FILE,		/* private mapped file */
+	CKPT_VMA_MAX
+};
+
+/* vma descriptor */
+struct ckpt_hdr_vma {
+	struct ckpt_hdr h;
+	__u32 vma_type;
+	__s32 vma_objref;	/* objref of backing file */
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+};
+
+/* page */
+struct ckpt_hdr_page {
+	struct ckpt_hdr hdr;
+#define CKPT_VMA_LAST_PAGE (~0UL)
+	__u64 vaddr;
+};
+
+struct ckpt_ctx {
+	struct ckpt_obj_hash *obj_hash; /* repository for shared objects */
+	struct task_struct *root_task;  /* pidns init and caller */
+	struct path root_fs_path;       /* container root */
+	struct task_struct *tsk;        /* checkpoint: current target task */
+	struct file *file;              /* input/output file */
+};
+
+extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, size_t count);
+extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, size_t count);
+
+extern void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int n, int type);
+
+extern int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h);
+extern int ckpt_write_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, size_t len, int type);
+extern int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, size_t len);
+extern int ckpt_write_string(struct ckpt_ctx *ctx, char *str, size_t len);
+
+extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, size_t len, int type);
+extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, size_t len);
+extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, size_t len);
+extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, size_t len, int type);
+extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, size_t max, int type);
+extern int ckpt_read_payload(struct ckpt_ctx *ctx,
+			     void **ptr, size_t max, int type);
+extern char *ckpt_fill_fname(struct path *path, struct path *root,
+			     char *buf, int *len);
+
+/* obj_hash */
+extern void ckpt_obj_hash_free(struct ckpt_obj_hash *obj_hash);
+extern struct ckpt_obj_hash *ckpt_obj_hash_alloc(void);
+
+extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
+extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr,
+			  enum obj_type type);
+extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+			       enum obj_type type, int *first);
+extern void *ckpt_obj_try_fetch(struct ckpt_ctx *ctx, int objref,
+				enum obj_type type);
+extern void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref,
+			    enum obj_type type);
+
+extern int do_checkpoint(struct ckpt_ctx *ctx);
+extern int do_restart(struct ckpt_ctx *ctx);
+
+/* arch hooks */
+extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
+extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
+
+extern int restore_read_header_arch(struct ckpt_ctx *ctx);
+extern int restore_thread(struct ckpt_ctx *ctx);
+extern int restore_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
+
+/* file table */
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+				     struct task_struct *t);
+extern int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+			    struct path *path, struct path *root);
+extern struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			       struct ckpt_hdr_file *h);
+
+/* memory */
+struct vm_area_struct;
+extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type,
+				  int vma_objref);
+
+extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref);
+
+#define CKPT_VMA_NOT_SUPPORTED (		\
+		VM_HUGETLB |			\
+		VM_INSERTPAGE |			\
+		VM_IO |				\
+		VM_MAPPED_COPY |		\
+		VM_MAYSHARE |			\
+		VM_MIXEDMAP |			\
+		VM_NONLINEAR |			\
+		VM_NORESERVE |			\
+		VM_PFNMAP |			\
+		VM_RESERVED |			\
+		VM_SAO |			\
+		VM_SHARED |			\
+		0)
+
+#define __ckpt_debug(fmt, args...)					\
+	do {								\
+		pr_devel("[%d:%d:c/r:%s:%d] " fmt,			\
+			 current->pid,					\
+			 current->nsproxy ?				\
+			 task_pid_vnr(current) : -1,			\
+			 __func__, __LINE__, ## args);			\
+	} while (0)
+
+#define ckpt_debug(fmt, args...)  \
+	__ckpt_debug(fmt, ## args)
+
+/* object operations */
+struct ckpt_obj_ops {
+	char *obj_name;
+	int obj_type;
+	void (*ref_drop)(void *ptr, int lastref);
+	int (*ref_grab)(void *ptr);
+	int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
+	void *(*restore)(struct ckpt_ctx *ctx);
+};
+
+#ifdef CONFIG_CHECKPOINT
+extern int register_checkpoint_obj(const struct ckpt_obj_ops *ops);
+#else /* CONFIG_CHECKPOINT */
+static inline int register_checkpoint_obj(const struct ckpt_obj_ops *ops)
+{
+	return 0;
+}
+#endif /* CONFIG_CHECKPOINT */
+
+#endif /* _LINUX_CHECKPOINT_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index ff690d0..30cd986 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -59,4 +59,7 @@
 #define SOCKFS_MAGIC		0x534F434B
 #define V9FS_MAGIC		0x01021997
 
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/init/Kconfig b/init/Kconfig
index c972899..cf6ce1f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -793,6 +793,8 @@ config RELAY
 
 	  If unsure, say N.
 
+source "kernel/checkpoint/Kconfig"
+
 config BLK_DEV_INITRD
 	bool "Initial RAM filesystem and RAM disk (initramfs/initrd) support"
 	depends on BROKEN || !FRV
diff --git a/kernel/Makefile b/kernel/Makefile
index 0b5ff08..3f6238c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_PERF_EVENTS) += perf_event.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
 obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint/
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/checkpoint/Kconfig b/kernel/checkpoint/Kconfig
new file mode 100644
index 0000000..21fc86b
--- /dev/null
+++ b/kernel/checkpoint/Kconfig
@@ -0,0 +1,15 @@
+# Architectures should define CHECKPOINT_SUPPORT when they have
+# implemented the hooks for processor state etc. needed by the
+# core checkpoint/restart code.
+
+config CHECKPOINT
+	bool "Checkpoint/restart (EXPERIMENTAL)"
+	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	depends on CGROUP_FREEZER
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/kernel/checkpoint/Makefile b/kernel/checkpoint/Makefile
new file mode 100644
index 0000000..3431310
--- /dev/null
+++ b/kernel/checkpoint/Makefile
@@ -0,0 +1,9 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT) += \
+	sys.o \
+	objhash.o \
+	checkpoint.o \
+	restart.o
diff --git a/kernel/checkpoint/checkpoint.c b/kernel/checkpoint/checkpoint.c
new file mode 100644
index 0000000..bef1d30
--- /dev/null
+++ b/kernel/checkpoint/checkpoint.c
@@ -0,0 +1,437 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define DEBUG
+
+#include <linux/checkpoint.h>
+#include <linux/dcache.h>
+#include <linux/file.h>
+#include <linux/freezer.h>
+#include <linux/fs.h>
+#include <linux/fs_struct.h>
+#include <linux/magic.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/pid_namespace.h>
+#include <linux/ptrace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/time.h>
+#include <linux/version.h>
+#include <linux/utsname.h>
+
+#include <asm/checkpoint.h>
+
+/**
+ * ckpt_write_obj - write an object
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ */
+int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+	ckpt_debug("type %d len %d\n", h->type, h->len);
+	return ckpt_kwrite(ctx, h, h->len);
+}
+
+/**
+ * ckpt_write_obj_type - write an object (from a pointer)
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ * @type: desired type
+ *
+ * If @ptr is NULL, then write only the header (payload to follow)
+ */
+int ckpt_write_obj_type(struct ckpt_ctx *ctx, void *ptr, size_t len, int type)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = kzalloc(sizeof(*h), GFP_KERNEL);
+	if (!h)
+		return -ENOMEM;
+
+	h->type = type;
+	h->len = len + sizeof(*h);
+
+	ckpt_debug("type %d len %d\n", h->type, h->len);
+	ret = ckpt_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		goto out;
+	if (ptr)
+		ret = ckpt_kwrite(ctx, ptr, len);
+ out:
+	kfree(h);
+	return ret;
+}
+
+/**
+ * ckpt_write_buffer - write an object of type buffer
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ */
+int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, size_t len)
+{
+	return ckpt_write_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * ckpt_write_string - write an object of type string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int ckpt_write_string(struct ckpt_ctx *ctx, char *str, size_t len)
+{
+	return ckpt_write_obj_type(ctx, str, len, CKPT_HDR_STRING);
+}
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+static void fill_kernel_const(struct ckpt_const *h)
+{
+	struct task_struct *tsk;
+	struct new_utsname *uts;
+
+	/* task */
+	h->task_comm_len = sizeof(tsk->comm);
+	/* mm->saved_auxv size */
+	h->at_vector_size = AT_VECTOR_SIZE;
+	/* uts */
+	h->uts_release_len = sizeof(uts->release);
+	h->uts_version_len = sizeof(uts->version);
+	h->uts_machine_len = sizeof(uts->machine);
+}
+
+/* write the checkpoint header */
+static int checkpoint_write_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (!h)
+		return -ENOMEM;
+
+	do_gettimeofday(&ktv);
+	uts = utsname();
+
+	h->arch_id = cpu_to_le16(CKPT_ARCH_ID);
+
+	h->magic = CHECKPOINT_MAGIC_HEAD;
+	h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	h->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	h->time = ktv.tv_sec;
+
+	fill_kernel_const(&h->constants);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	kfree(h);
+	if (ret < 0)
+		return ret;
+
+	down_read(&uts_sem);
+	ret = ckpt_write_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
+ up:
+	up_read(&uts_sem);
+	if (ret < 0)
+		return ret;
+
+	return checkpoint_write_header_arch(ctx);
+}
+
+/* write the checkpoint trailer */
+static int checkpoint_write_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (!h)
+		return -ENOMEM;
+
+	h->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	kfree(h);
+	return ret;
+}
+
+/* dump the task_struct of a given task */
+static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (!h)
+		return -ENOMEM;
+
+	h->state = t->state;
+	h->exit_state = t->exit_state;
+	h->exit_code = t->exit_code;
+	h->exit_signal = t->exit_signal;
+
+	h->set_child_tid = (unsigned long) t->set_child_tid;
+	h->clear_child_tid = (unsigned long) t->clear_child_tid;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	kfree(h);
+	if (ret < 0)
+		return ret;
+
+	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the task_struct of a given task */
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int files_objref;
+	int mm_objref;
+	int ret;
+
+	files_objref = checkpoint_obj_file_table(ctx, t);
+	ckpt_debug("files: objref %d\n", files_objref);
+	if (files_objref < 0)
+		return files_objref;
+
+	mm_objref = checkpoint_obj_mm(ctx, t);
+	ckpt_debug("mm: objref %d\n", mm_objref);
+	if (mm_objref < 0)
+		return mm_objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+	h->files_objref = files_objref;
+	h->mm_objref = mm_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	kfree(h);
+
+	return ret;
+}
+
+static bool task_is_descendant(struct task_struct *tsk)
+{
+	while (tsk != &init_task) {
+		if (tsk == current)
+			return true;
+		tsk = tsk->real_parent;
+	}
+	return false;
+}
+
+static bool task_checkpointable(struct task_struct *tsk)
+{
+	if (is_container_init(tsk)) {
+		pr_err("checkpoint of nested namespaces not supported\n");
+		return false;
+	}
+
+	if (!task_is_descendant(tsk)) {
+		pr_err("checkpoint of unrelated tasks not supported\n");
+		return false;
+	}
+
+	if (get_nr_threads(tsk) > 1) {
+		pr_err("checkpoint of multithreaded tasks not yet supported\n");
+		return false;
+	}
+
+	return true;
+}
+
+/* dump the entire state of a given task */
+static int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	if (!task_checkpointable(t))
+		return -ENOSYS;
+
+	ctx->tsk = t;
+
+	ret = checkpoint_task_struct(ctx, t);
+	ckpt_debug("task %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_thread(ctx, t);
+	ckpt_debug("thread %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_cpu(ctx, t);
+	ckpt_debug("cpu %d\n", ret);
+ out:
+	ctx->tsk = NULL;
+	return ret;
+}
+
+/**
+ * freeze_pidns() - freeze all other tasks in current pid namespace
+ *
+ * Attempts to freeze all other tasks in the caller's pid namespace.
+ * Only the init process of the pid namespace is allowed to call this.
+ * Will busy-loop trying to freeze tasks unless interrupted by a
+ * signal.
+ *
+ * Returns 0 on success, -EINTR if interrupted.  In all cases, the
+ * caller must call thaw_pidns() to ensure that the current pid
+ * namespace is completely unfrozen.
+ */
+static int freeze_pidns(void)
+{
+	struct task_struct *t, *p;
+	bool try_again;
+	int rc = 0;
+
+	BUG_ON(!is_container_init(current));
+	ckpt_debug("\n");
+again:
+	cond_resched();
+	if (signal_pending(current))
+		return -EINTR;
+	try_again = false;
+
+	read_lock(&tasklist_lock);
+
+	do_each_thread(t, p) {
+		if (p == current)
+			continue;
+
+		if (!task_is_descendant(p))
+			continue;
+
+		freeze_task(p, true);
+		try_again |= !frozen(p);
+	} while_each_thread(t, p);
+
+	read_unlock(&tasklist_lock);
+
+	if (try_again)
+		goto again;
+
+	return rc;
+}
+
+/**
+ * thaw_pidns() - unfreeze all other tasks in the current pid namespace
+ *
+ * Unfreeze all other processes in caller's pid namespace.  Only the
+ * init process of the pid namespace is allowed to call this.
+ */
+static void thaw_pidns(void)
+{
+	struct task_struct *t, *p;
+
+	BUG_ON(!is_container_init(current));
+
+	read_lock(&tasklist_lock);
+
+	do_each_thread(t, p) {
+		if (p == current)
+			continue;
+
+		if (!task_is_descendant(p))
+			continue;
+
+		if (!frozen(p))
+			continue;
+
+		thaw_process(p);
+
+	} while_each_thread(t, p);
+
+	read_unlock(&tasklist_lock);
+}
+
+/**
+ * do_checkpoint() - checkpoint the caller's pid namespace
+ * @ctx: checkpoint context
+ *
+ * Freeze, checkpoint, and thaw the current pid namespace.  The
+ * checkpoint image is written to @ctx->file.  Only the init process
+ * of the pid namespace is allowed to call this.
+ */
+int do_checkpoint(struct ckpt_ctx *ctx)
+{
+	struct task_struct *target = NULL;
+	struct task_struct *child;
+	unsigned int nr;
+	int err;
+
+	if (!is_container_init(current))
+		return -EPERM;
+
+	err = freeze_pidns();
+	if (err)
+		goto thaw;
+
+	err = checkpoint_write_header(ctx);
+	if (err)
+		goto thaw;
+
+	nr = 0;
+	read_lock(&tasklist_lock);
+	list_for_each_entry(child, &current->children, sibling) {
+		nr++;
+		if (target) /* more than one process; abort */
+			break;
+		target = child;
+		get_task_struct(target);
+	}
+	read_unlock(&tasklist_lock);
+
+	if (nr == 0) {
+		err = -ESRCH;
+		goto thaw;
+	}
+
+	if (nr > 1) {
+		pr_err("checkpoint of >1 process not yet implemented\n");
+		err = -EBUSY;
+		goto thaw;
+	}
+
+	err = checkpoint_task(ctx, target);
+	if (err)
+		goto thaw;
+
+	err = checkpoint_write_tail(ctx);
+thaw:
+	/* Thaw regardless of status; some tasks could be frozen even
+	 * if freeze_pidns return error.
+	 */
+	thaw_pidns();
+
+	if (target)
+		put_task_struct(target);
+
+	return err;
+}
diff --git a/kernel/checkpoint/objhash.c b/kernel/checkpoint/objhash.c
new file mode 100644
index 0000000..45d4e67
--- /dev/null
+++ b/kernel/checkpoint/objhash.c
@@ -0,0 +1,368 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define DEBUG
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+
+struct ckpt_obj {
+	int objref;
+	int flags;
+	void *ptr;
+	const struct ckpt_obj_ops *ops;
+	struct hlist_node hash;
+};
+
+/* object internal flags */
+#define CKPT_OBJ_CHECKPOINTED		0x1   /* object already checkpointed */
+
+struct ckpt_obj_hash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+/* ignored object */
+static const struct ckpt_obj_ops ckpt_obj_ignored_ops = {
+	.obj_name = "IGNORED",
+	.obj_type = CKPT_OBJ_IGNORE,
+	.ref_drop = NULL,
+	.ref_grab = NULL,
+};
+
+static const struct ckpt_obj_ops *ckpt_obj_ops[CKPT_OBJ_MAX] = {
+	[CKPT_OBJ_IGNORE] = &ckpt_obj_ignored_ops,
+};
+
+int register_checkpoint_obj(const struct ckpt_obj_ops *ops)
+{
+	if (ops->obj_type < 0 || ops->obj_type >= CKPT_OBJ_MAX)
+		return -EINVAL;
+	if (ckpt_obj_ops[ops->obj_type] != NULL)
+		return -EINVAL;
+	ckpt_obj_ops[ops->obj_type] = ops;
+	return 0;
+}
+
+#define CKPT_OBJ_HASH_NBITS  10
+#define CKPT_OBJ_HASH_TOTAL  (1UL << CKPT_OBJ_HASH_NBITS)
+
+static void obj_hash_clear(struct ckpt_obj_hash *obj_hash)
+{
+	struct hlist_head *h = obj_hash->head;
+	struct hlist_node *n, *t;
+	struct ckpt_obj *obj;
+	int i;
+
+	for (i = 0; i < CKPT_OBJ_HASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			if (obj->ops->ref_drop)
+				obj->ops->ref_drop(obj->ptr, 1);
+			kfree(obj);
+		}
+	}
+}
+
+void ckpt_obj_hash_free(struct ckpt_obj_hash *obj_hash)
+{
+	obj_hash_clear(obj_hash);
+	kfree(obj_hash->head);
+	kfree(obj_hash);
+}
+
+struct ckpt_obj_hash *ckpt_obj_hash_alloc(void)
+{
+	size_t size = CKPT_OBJ_HASH_TOTAL * sizeof(struct hlist_head);
+	struct ckpt_obj_hash *obj_hash;
+
+	obj_hash = kzalloc(sizeof(*obj_hash), GFP_KERNEL);
+	if (!obj_hash)
+		return NULL;
+
+	obj_hash->head = kzalloc(size, GFP_KERNEL);
+	if (!obj_hash->head) {
+		kfree(obj_hash);
+		obj_hash = NULL;
+	} else {
+		obj_hash->next_free_objref = 1;
+	}
+
+	return obj_hash;
+}
+
+static struct ckpt_obj *obj_find_by_ptr(const struct ckpt_obj_hash *obj_hash, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &obj_hash->head[hash_ptr(ptr, CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct ckpt_obj *obj_find_by_objref(const struct ckpt_obj_hash *obj_hash, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &obj_hash->head[hash_long((unsigned long)objref,
+					   CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+static int obj_alloc_objref(struct ckpt_obj_hash *obj_hash)
+{
+	return obj_hash->next_free_objref++;
+}
+
+/**
+ * ckpt_obj_new - add an object to the obj_hash
+ * @ptr: pointer to object
+ * @objref: object unique id
+ * @ops: object operations
+ *
+ * Add the object to the obj_hash. If @objref is zero, assign a unique
+ * object id and use @ptr as a hash key [checkpoint]. Else use @objref
+ * as a key [restart].
+ */
+static struct ckpt_obj *obj_new(struct ckpt_obj_hash *obj_hash, void *ptr,
+				int objref, enum obj_type type)
+{
+	const struct ckpt_obj_ops *ops = ckpt_obj_ops[type];
+	struct ckpt_obj *obj;
+	int i, ret;
+
+	if (WARN_ON_ONCE(!ptr))
+		return ERR_PTR(-EINVAL);
+
+	/* make sure we don't change this accidentally */
+	if (WARN_ON_ONCE(ops->obj_type != type))
+		return ERR_PTR(-EINVAL);
+
+	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+
+	obj->ptr = ptr;
+	obj->ops = ops;
+
+	if (!objref) {
+		/* use @obj->ptr to index, assign objref (checkpoint) */
+		obj->objref = obj_alloc_objref(obj_hash);
+		i = hash_ptr(ptr, CKPT_OBJ_HASH_NBITS);
+	} else {
+		/* use @obj->objref to index (restart) */
+		obj->objref = objref;
+		i = hash_long((unsigned long) objref, CKPT_OBJ_HASH_NBITS);
+	}
+
+	ret = ops->ref_grab ? ops->ref_grab(obj->ptr) : 0;
+	if (ret < 0) {
+		kfree(obj);
+		obj = ERR_PTR(ret);
+	} else {
+		hlist_add_head(&obj->hash, &obj_hash->head[i]);
+	}
+
+	return obj;
+}
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * obj_lookup_add - lookup object and add if not in objhash
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encounter (added to table)
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, add the object, and allocate a unique object
+ * id. Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is freed.
+ */
+static struct ckpt_obj *obj_lookup_add(struct ckpt_obj_hash *obj_hash, void *ptr,
+				       enum obj_type type, int *first)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(obj_hash, ptr);
+	if (!obj) {
+		obj = obj_new(obj_hash, ptr, 0, type);
+		*first = 1;
+	} else {
+		BUG_ON(obj->ops->obj_type != type);
+		*first = 0;
+	}
+	return obj;
+}
+
+/**
+ * checkpoint_obj - if not already in hash, add object and checkpoint
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Use obj_lookup_add() to lookup (and possibly add) the object to the
+ * hash table. If the CKPT_OBJ_CHECKPOINTED flag isn't set, then also
+ * save the object's state using its ops->checkpoint().
+ *
+ * [This is used during checkpoint].
+ * Returns: objref
+ */
+int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_hdr_objref *h;
+	struct ckpt_obj *obj;
+	int new, ret = 0;
+
+	obj = obj_lookup_add(ctx->obj_hash, ptr, type, &new);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+
+	if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
+		if (!h)
+			return -ENOMEM;
+
+		h->objtype = type;
+		h->objref = obj->objref;
+		ret = ckpt_write_obj(ctx, &h->h);
+		kfree(h);
+
+		if (ret < 0)
+			return ret;
+
+		/* invoke callback to actually dump the state */
+		if (obj->ops->checkpoint)
+			ret = obj->ops->checkpoint(ctx, ptr);
+
+		obj->flags |= CKPT_OBJ_CHECKPOINTED;
+	}
+	return (ret < 0 ? ret : obj->objref);
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_obj - read in and restore a (first seen) shared object
+ * @ctx: checkpoint context
+ * @h: ckpt_hdr of shared object
+ *
+ * Read in the header payload (struct ckpt_hdr_objref). Lookup the
+ * object to verify it isn't there.  Then restore the object's state
+ * and add it to the objash. No need to explicitly grab a reference -
+ * we hold the initial instance of this object. (Object maintained
+ * until the entire hash is free).
+ *
+ * [This is used during restart].
+ */
+int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h)
+{
+	const struct ckpt_obj_ops *ops;
+	struct ckpt_obj *obj;
+	void *ptr = ERR_PTR(-EINVAL);
+
+	ckpt_debug("len %d ref %d type %d\n", h->h.len, h->objref, h->objtype);
+	if (h->objtype >= CKPT_OBJ_MAX)
+		return -EINVAL;
+	if (h->objref <= 0)
+		return -EINVAL;
+
+	ops = ckpt_obj_ops[h->objtype];
+	if (!ops)
+		return -ENOSYS;
+
+	BUG_ON(ops->obj_type != h->objtype);
+
+	if (ops->restore)
+		ptr = ops->restore(ctx);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	obj = obj_find_by_objref(ctx->obj_hash, h->objref);
+	if (!obj) {
+		obj = obj_new(ctx->obj_hash, ptr, h->objref, h->objtype);
+		/*
+		 * Drop an extra reference to the object returned by
+		 * ops->restore to balance the one taken by obj_new()
+		 */
+		if (!IS_ERR(obj) && ops->ref_drop)
+			ops->ref_drop(ptr, 0);
+	} else if ((obj->ptr != ptr) || (obj->ops->obj_type != h->objtype)) {
+		/* Normally, we expect an object to not already exist
+		 * in the hash.  However, for some special scenarios
+		 * where we're restoring sets of objects that must be
+		 * co-allocated (such, as veth netdev pairs) we need
+		 * to tolerate this case if the second restore returns
+		 * the correct type and pointer, as specified in the
+		 * existing object.  If either of those don't match,
+		 * we fail.
+		 */
+		obj = ERR_PTR(-EINVAL);
+	}
+
+	if (IS_ERR(obj)) {
+		/* This releases our final reference on the object
+		 * returned by ops->restore()
+		 */
+		if (ops->ref_drop)
+			ops->ref_drop(ptr, 1);
+		return PTR_ERR(obj);
+	}
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_try_fetch - fetch an object by its identifier
+ * @ctx: checkpoint context
+ * @objref: object id
+ * @type: object type
+ *
+ * Lookup the objref identifier by @objref in the hash table. Return
+ * an error not found.
+ *
+ * [This is used during restart].
+ */
+void *ckpt_obj_try_fetch(struct ckpt_ctx *ctx, int objref, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_objref(ctx->obj_hash, objref);
+	if (!obj)
+		return ERR_PTR(-EINVAL);
+	ckpt_debug("%s ref %d\n", obj->ops->obj_name, obj->objref);
+	if (obj->ops->obj_type == type)
+		return obj->ptr;
+	return ERR_PTR(-ENOMSG);
+}
+
+void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref, enum obj_type type)
+{
+	void *ret = ckpt_obj_try_fetch(ctx, objref, type);
+
+	if (unlikely(IS_ERR(ret)))
+		ckpt_debug("objref=%d type=%u ret=%ld\n",
+			   objref, type, PTR_ERR(ret));
+	return ret;
+}
diff --git a/kernel/checkpoint/restart.c b/kernel/checkpoint/restart.c
new file mode 100644
index 0000000..51f580f
--- /dev/null
+++ b/kernel/checkpoint/restart.c
@@ -0,0 +1,651 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define DEBUG
+
+#include <linux/checkpoint.h>
+#include <linux/completion.h>
+#include <linux/elf.h>
+#include <linux/err.h>
+#include <linux/file.h>
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/magic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_context.h>
+#include <linux/module.h>
+#include <linux/nsproxy.h>
+#include <linux/pid.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/utsname.h>
+#include <linux/version.h>
+
+#include <asm/checkpoint.h>
+#include <asm/mmu_context.h>
+#include <asm/syscall.h>
+
+/**
+ * _ckpt_read_objref - dispatch handling of a shared object
+ * @ctx: checkpoint context
+ * @hh: objrect descriptor
+ */
+static int _ckpt_read_objref(struct ckpt_ctx *ctx, struct ckpt_hdr *hh)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = kzalloc(hh->len, GFP_KERNEL);
+	if (!h)
+		return -ENOMEM;
+
+	*h = *hh;	/* yay ! */
+
+	ckpt_debug("shared len %d type %d\n", h->len, h->type);
+	ret = ckpt_kread(ctx, (h + 1), hh->len - sizeof(struct ckpt_hdr));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_obj(ctx, (struct ckpt_hdr_objref *) h);
+ out:
+	kfree(h);
+	return ret;
+}
+
+/**
+ * ckpt_read_obj_dispatch - dispatch ERRORs and OBJREFs; don't return them
+ * @ctx: checkpoint context
+ * @h: desired ckpt_hdr
+ */
+static int ckpt_read_obj_dispatch(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+	int ret;
+
+	while (1) {
+		ret = ckpt_kread(ctx, h, sizeof(*h));
+		if (ret < 0)
+			return ret;
+		ckpt_debug("type %d len %d\n", h->type, h->len);
+		if (h->len < sizeof(*h))
+			return -EINVAL;
+
+		if (h->type == CKPT_HDR_OBJREF) {
+			ret = _ckpt_read_objref(ctx, h);
+			if (ret < 0)
+				return ret;
+		} else
+			return 0;
+	}
+}
+
+/**
+ * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: desired ckpt_hdr
+ * @ptr: desired buffer
+ * @len: desired object length (if 0, flexible)
+ * @max: maximum object length (if 0, flexible)
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
+			  void *ptr, int len, int max)
+{
+	int ret;
+
+	ret = ckpt_read_obj_dispatch(ctx, h);
+	if (ret < 0)
+		return ret;
+	ckpt_debug("type %d len %d(%d,%d)\n", h->type, h->len, len, max);
+
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && h->len != len) || (!len && max && h->len > max))
+		return -EINVAL;
+
+	if (ptr)
+		ret = ckpt_kread(ctx, ptr, h->len - sizeof(struct ckpt_hdr));
+	return ret;
+}
+
+/**
+ * _ckpt_read_obj_type - read an object of some type
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ * @type: buffer type
+ *
+ * If @ptr is NULL, then read only the header (payload to follow).
+ * @len specifies the expected buffer length (ignored if set to 0).
+ * Returns: actual _payload_ length
+ */
+int _ckpt_read_obj_type(struct ckpt_ctx *ctx, void *ptr, size_t len, int type)
+{
+	struct ckpt_hdr h;
+	int ret;
+
+	if (len)
+		len += sizeof(struct ckpt_hdr);
+	ret = _ckpt_read_obj(ctx, &h, ptr, len, len);
+	if (ret < 0)
+		return ret;
+	if (h.type != type)
+		return -EINVAL;
+	return h.len - sizeof(h);
+}
+
+/**
+ * _ckpt_read_buffer - read an object of type buffer (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ *
+ * If @ptr is NULL, then read only the header (payload to follow).
+ * @len specifies the expected buffer length (ignored if set to 0).
+ * Returns: _payload_ length.
+ */
+int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, size_t len)
+{
+	BUG_ON(!len);
+	return _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * _ckpt_read_string - read an object of type string (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: string length (including '\0')
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, size_t len)
+{
+	int ret;
+
+	BUG_ON(!len);
+	ret = _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_STRING);
+	if (ret < 0)
+		return ret;
+	if (ptr)
+		((char *) ptr)[len - 1] = '\0';	/* always play it safe */
+	return 0;
+}
+
+/**
+ * ckpt_read_obj - allocate and read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ * @len: desired total length (if 0, flexible)
+ * @max: maximum total length
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
+{
+	struct ckpt_hdr hh;
+	struct ckpt_hdr *h;
+	int ret;
+
+	ret = ckpt_read_obj_dispatch(ctx, &hh);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	ckpt_debug("type %d len %d(%d,%d)\n", hh.type, hh.len, len, max);
+
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && hh.len != len) || (!len && max && hh.len > max))
+		return ERR_PTR(-EINVAL);
+
+	h = kzalloc(hh.len, GFP_KERNEL);
+	if (!h)
+		return ERR_PTR(-ENOMEM);
+
+	*h = hh;	/* yay ! */
+
+	ret = ckpt_kread(ctx, (h + 1), hh.len - sizeof(struct ckpt_hdr));
+	if (ret < 0) {
+		kfree(h);
+		h = ERR_PTR(ret);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_obj_type - allocate and read an object of some type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_obj_type(struct ckpt_ctx *ctx, size_t len, int type)
+{
+	struct ckpt_hdr *h;
+
+	BUG_ON(!len);
+
+	h = ckpt_read_obj(ctx, len, len);
+	if (IS_ERR(h)) {
+		ckpt_debug("len=%d type=%d ret=%ld\n", len, type, PTR_ERR(h));
+		return h;
+	}
+
+	if (h->type != type) {
+		kfree(h);
+		ckpt_debug("expected type %d but got %d\n", h->type, type);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_buf_type - allocate and read an object of some type (flxible)
+ * @ctx: checkpoint context
+ * @max: maximum payload length
+ * @type: desired object type
+ *
+ * This differs from ckpt_read_obj_type() in that the length of the
+ * incoming object is flexible (up to the maximum specified by @max;
+ * unlimited if @max is 0), as determined by the ckpt_hdr data.
+ *
+ * NOTE: for symmetry with checkpoint, @max is the maximum _payload_
+ * size, excluding the header.
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_buf_type(struct ckpt_ctx *ctx, size_t max, int type)
+{
+	struct ckpt_hdr *h;
+
+	if (max)
+		max += sizeof(struct ckpt_hdr);
+
+	h = ckpt_read_obj(ctx, 0, max);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->type != type) {
+		kfree(h);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_payload - allocate and read the payload of an object
+ * @ctx: checkpoint context
+ * @max: maximum payload length
+ * @str: pointer to buffer to be allocated (caller must free)
+ * @type: desired object type
+ *
+ * This can be used to read a variable-length _payload_ from the checkpoint
+ * stream. @max limits the size of the resulting buffer.
+ *
+ * Return: actual _payload_ length
+ */
+int ckpt_read_payload(struct ckpt_ctx *ctx, void **ptr, size_t max, int type)
+{
+	int len, ret;
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, type);
+	if (len < 0)
+		return len;
+	else if (len > max)
+		return -EINVAL;
+
+	*ptr = kmalloc(len, GFP_KERNEL);
+	if (!*ptr)
+		return -ENOMEM;
+
+	ret = ckpt_kread(ctx, *ptr, len);
+	if (ret < 0) {
+		kfree(*ptr);
+		return ret;
+	}
+
+	return len;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+static int check_kernel_const(struct ckpt_const *h)
+{
+	struct task_struct *tsk;
+	struct new_utsname *uts;
+
+	/* task */
+	if (h->task_comm_len != sizeof(tsk->comm))
+		return -EINVAL;
+	/* mm->saved_auxv size */
+	if (h->at_vector_size != AT_VECTOR_SIZE)
+		return -EINVAL;
+	/* uts */
+	if (h->uts_release_len != sizeof(uts->release))
+		return -EINVAL;
+	if (h->uts_version_len != sizeof(uts->version))
+		return -EINVAL;
+	if (h->uts_machine_len != sizeof(uts->machine))
+		return -EINVAL;
+
+	return 0;
+}
+
+/* read the checkpoint header */
+static int restore_read_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts = NULL;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (le16_to_cpu(h->arch_id) != CKPT_ARCH_ID) {
+		ckpt_debug("incompatible architecture id");
+		goto out;
+	}
+	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
+	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    h->patch != ((LINUX_VERSION_CODE) & 0xff)) {
+		ckpt_debug("incompatible kernel version");
+		goto out;
+	}
+	if (h->uflags) {
+		ckpt_debug("incompatible restart user flags");
+		goto out;
+	}
+
+	ret = check_kernel_const(&h->constants);
+	if (ret < 0) {
+		ckpt_debug("incompatible kernel constants");
+		goto out;
+	}
+
+	ret = -ENOMEM;
+	uts = kmalloc(sizeof(*uts), GFP_KERNEL);
+	if (!uts)
+		goto out;
+
+	ret = _ckpt_read_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_read_header_arch(ctx);
+ out:
+	kfree(uts);
+	kfree(h);
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int restore_read_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->magic != CHECKPOINT_MAGIC_TAIL)
+		ret = -EINVAL;
+
+	kfree(h);
+	return ret;
+}
+
+/* setup restart-specific parts of ctx */
+static int init_restart_ctx(struct ckpt_ctx *ctx)
+{
+	ctx->root_task = current;
+	return 0;
+}
+
+/* read the task_struct into the current task */
+static int restore_task_struct(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task *h;
+	struct task_struct *t = current;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	memset(t->comm, 0, TASK_COMM_LEN);
+	ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN);
+	if (ret < 0)
+		goto out;
+
+	t->set_child_tid = (int __user *) (unsigned long) h->set_child_tid;
+	t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid;
+	/* return 1 for zombie, 0 otherwise */
+	ret = (h->state == TASK_DEAD ? 1 : 0);
+ out:
+	kfree(h);
+	return ret;
+}
+
+static int restore_task_objs(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_objs *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = restore_obj_file_table(ctx, h->files_objref);
+	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+
+	ret = restore_obj_mm(ctx, h->mm_objref);
+	ckpt_debug("mm: ret %d (%p)\n", ret, current->mm);
+
+	kfree(h);
+	return ret;
+}
+
+/* read the entire state of the current task */
+static int restore_task(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	ret = restore_task_struct(ctx);
+	ckpt_debug("task %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_thread(ctx);
+	ckpt_debug("thread %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_task_objs(ctx);
+	ckpt_debug("objs %d\n", ret);
+ out:
+	return ret;
+}
+
+struct task_restart_info {
+	struct ckpt_ctx *ctx;
+	struct completion completion;
+	int status;
+};
+
+static void task_restart_info_init(struct task_restart_info *info, struct ckpt_ctx *ctx)
+{
+	info->ctx = ctx;
+	init_completion(&info->completion);
+	info->status = 0;
+}
+
+static int restore_task_fn(void *work)
+{
+	struct task_restart_info *info;
+	struct mm_struct *prev_mm;
+	struct mm_struct *new_mm;
+	struct ckpt_ctx *ctx;
+
+	info = work;
+	ctx = info->ctx;
+
+	/* FIXME: Move this stuff into a helper in kernel/fork.c so we
+	 * can correctly handle errors (free_mm, mm_free_pgd).
+	 */
+	BUG_ON(!(current->flags & PF_KTHREAD));
+	BUG_ON(current->mm);
+
+	info->status = sys_unshare(CLONE_FILES | CLONE_FS);
+	if (info->status)
+		return info->status;
+
+	current->flags &= ~(PF_KTHREAD | PF_NOFREEZE | PF_FREEZER_NOSIG);
+
+	info->status = -ENOMEM;
+	new_mm = mm_alloc();
+	if (!new_mm)
+		return info->status;
+
+	prev_mm = current->active_mm;
+	current->mm = new_mm;
+	current->active_mm = new_mm;
+
+	/* activate_mm/switch_mm need to execute atomically */
+	preempt_disable();
+	activate_mm(prev_mm, new_mm);
+	preempt_enable();
+
+	arch_pick_mmap_layout(new_mm);
+
+	if (init_new_context(current, new_mm))
+		goto err_out;
+
+	info->status = restore_task(ctx);
+	if (info->status < 0)
+		pr_err("restore task failed (%i)\n", info->status);
+
+	spin_lock_irq(&current->sighand->siglock);
+	flush_signal_handlers(current, 1);
+	spin_unlock_irq(&current->sighand->siglock);
+
+	__set_current_state(TASK_UNINTERRUPTIBLE);
+	info->status = 0;
+	complete(&info->completion);
+
+	/* vfork_done points to stack data which will no longer be valid;
+	 * see kthread.c:kthread().
+	 */
+	current->vfork_done = NULL;
+
+	schedule();
+	WARN_ON(true);
+	return info->status;
+err_out:
+	WARN_ONCE(true, "Leaking mm, sorry");
+	return info->status;
+}
+
+static int restore_task_tree(struct ckpt_ctx *ctx)
+{
+	struct task_restart_info *info;
+	struct task_struct *tsk;
+	struct pid *pid;
+	int err;
+
+	err = -ENOMEM;
+	info = kmalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		goto err_out;
+
+	task_restart_info_init(info, ctx);
+
+	tsk = kthread_run(restore_task_fn, info, "krestart");
+	if (IS_ERR(tsk))
+		goto err_out;
+
+	wait_for_completion(&info->completion);
+	wait_task_inactive(tsk, 0);
+	err = info->status;
+	if (err != 0) {
+		kthread_stop(tsk);
+		goto err_out;
+	}
+	err = restore_cpu(ctx, tsk);
+	ckpt_debug("cpu %d\n", err);
+	if (WARN_ON_ONCE(err < 0)) {
+		/* FIXME: kicking the task at this point is not a good
+		 * idea as its register state may have been changed.
+		 */
+		/* kthread_stop(); */
+		goto err_out;
+	}
+	write_lock_irq(&tasklist_lock);
+	tsk->parent = tsk->real_parent = ctx->root_task; /* this is current */
+	list_move_tail(&tsk->sibling, &tsk->parent->children);
+	write_unlock_irq(&tasklist_lock);
+#ifdef CONFIG_PREEMPT
+	task_thread_info(tsk)->preempt_count--;
+#endif
+	get_nsproxy(current->nsproxy);
+	switch_task_namespaces(tsk, current->nsproxy);
+	pid = alloc_pid(tsk->nsproxy->pid_ns);
+	if (WARN_ON_ONCE(!pid))
+		goto err_out;
+	ckpt_debug("new pid: level=%u, nr=%d, vnr=%d\n", pid->level,
+		   pid_nr(pid), pid_vnr(pid));
+	tsk->pid = pid_nr(pid);
+	tsk->tgid = tsk->pid;
+	detach_pid(tsk, PIDTYPE_PID);
+	attach_pid(tsk, PIDTYPE_PID, pid);
+	wake_up_process(tsk);
+err_out:
+	kfree(info);
+	return err;
+}
+
+/**
+ * do_restart() - restore the caller's pid namespace
+ * @ctx: checkpoint context
+ *
+ * The checkpoint image is read from @ctx->file.  Only the init
+ * process of the pid namespace is allowed to call this, and only when
+ * the caller is the sole task in the pid namespace.
+ */
+int do_restart(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	ret = init_restart_ctx(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = restore_read_header(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = restore_task_tree(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = restore_read_tail(ctx);
+
+	return ret;
+}
diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
new file mode 100644
index 0000000..11ed6fd
--- /dev/null
+++ b/kernel/checkpoint/sys.c
@@ -0,0 +1,208 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <linux/nsproxy.h>
+#include <linux/kernel.h>
+#include <linux/cgroup.h>
+#include <linux/syscalls.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   _ckpt_kwrite() - write a kernel-space buffer to a file
+ *   _ckpt_kread() - read from a file to a kernel-space buffer
+ *
+ *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer
+ *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
+ *
+ * They latter two succeed only if the entire read or write succeeds,
+ * and return 0, or negative error otherwise.
+ */
+
+static ssize_t _ckpt_kwrite(struct file *file, void *addr, size_t count)
+{
+	mm_segment_t old_fs;
+	ssize_t ret;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = vfs_write(file, (void __user *)addr, count, &file->f_pos);
+	set_fs(old_fs);
+
+	/* Catch unhandled short writes */
+	if (WARN_ON_ONCE(ret >= 0 && ret < count))
+		ret = -EIO;
+
+	return ret;
+}
+
+/* returns 0 on success */
+int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, size_t count)
+{
+	int ret;
+
+	ret = _ckpt_kwrite(ctx->file, addr, count);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
+static ssize_t _ckpt_kread(struct file *file, void *addr, size_t count)
+{
+	mm_segment_t old_fs;
+	ssize_t ret;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = vfs_read(file, (void __user *)addr, count, &file->f_pos);
+	set_fs(old_fs);
+
+	return ret;
+}
+
+/* returns 0 on success */
+int ckpt_kread(struct ckpt_ctx *ctx, void *addr, size_t count)
+{
+	int ret;
+
+	ret = _ckpt_kread(ctx->file, addr, count);
+	if (ret < 0)
+		return ret;
+	if (ret != count)
+		return -EPIPE;
+
+	return 0;
+}
+
+/**
+ * ckpt_hdr_get_type - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: number of bytes to reserve
+ *
+ * Returns pointer to reserved space on hbuf
+ */
+void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	h = kzalloc(len, GFP_KERNEL);
+	if (!h)
+		return NULL;
+
+	h->type = type;
+	h->len = len;
+	return h;
+}
+
+/*
+ * Helpers to manage c/r contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void ckpt_ctx_free(struct ckpt_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+
+	if (ctx->obj_hash)
+		ckpt_obj_hash_free(ctx->obj_hash);
+
+	path_put(&ctx->root_fs_path);
+
+	kfree(ctx);
+}
+
+static struct ckpt_ctx *ckpt_ctx_alloc(int fd)
+{
+	struct ckpt_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	err = -ENOMEM;
+	ctx->obj_hash = ckpt_obj_hash_alloc();
+	if (!ctx->obj_hash)
+		goto err;
+
+	return ctx;
+ err:
+	ckpt_ctx_free(ctx);
+	return ERR_PTR(err);
+}
+
+/**
+ * sys_checkpoint - checkpoint the caller's pidns and associated resources
+ * @fd: destination for the checkpoint image; need not be seekable
+ * @flags: checkpoint operation flags (no flags defined yet)
+ *
+ * Returns 0 on success, negated errno value otherwise.
+ */
+SYSCALL_DEFINE2(checkpoint, int, fd, unsigned int, flags)
+{
+	struct ckpt_ctx *ctx;
+	int err;
+
+	if (flags)
+		return -EINVAL;
+
+	ctx = ckpt_ctx_alloc(fd);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	err = do_checkpoint(ctx);
+
+	ckpt_ctx_free(ctx);
+
+	return err;
+}
+
+/**
+ * sys_restart - restore a pidns from a checkpoint image
+ * @fd: source for checkpoint image; need not be seekable
+ * @flags: restart operation flags (no flags defined yet)
+ *
+ * Returns 0 on success, negated errno value otherwise.
+ */
+SYSCALL_DEFINE2(restart, int, fd, unsigned int, flags)
+{
+	struct ckpt_ctx *ctx;
+	int err;
+
+	if (flags)
+		return -EINVAL;
+
+	ctx = ckpt_ctx_alloc(fd);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	err = do_restart(ctx);
+
+	ckpt_ctx_free(ctx);
+
+	return err;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c782fe9..b73a106 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -186,3 +186,7 @@ cond_syscall(sys_perf_event_open);
 /* fanotify! */
 cond_syscall(sys_fanotify_init);
 cond_syscall(sys_fanotify_mark);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/10] Checkpoint/restart mm support
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
                   ` (4 preceding siblings ...)
  2011-02-28 23:40 ` [PATCH 05/10] Core checkpoint/restart support code ntl
@ 2011-02-28 23:40 ` ntl
  2011-02-28 23:40 ` [PATCH 07/10] Checkpoint/restart vfs support ntl
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Add a checkpoint() method to vm_operations_struct; this is responsible
for dumping the attributes and contents of a VMA.  For each vma, there
is a 'struct ckpt_vma'; then comes the actual contents, a page at a
time.

Normally the per-vma function will invoke generic_vma_checkpoint()
which first writes the vma description, followed by the specific logic
to dump the contents of the pages.

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the vma state and contents.
Call do_mmap_pgoffset() for each vma and then read in the data.

Based on original code by Oren Laadan.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
[ntl: remove page array chain code; dump/restore VMAs one page at a time]
[ntl: move special_mapping_checkpoint/restore() to mm/checkpoint.c]
[ntl: move filemap_checkpoint/restore() to mm/checkpoint.c]
[ntl: do without custom __get_dirty_page API; use get_user_pages]
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 include/linux/mm.h |   12 +
 mm/Makefile        |    1 +
 mm/checkpoint.c    |  906 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/filemap.c       |    4 +
 mm/mmap.c          |    3 +
 5 files changed, 926 insertions(+), 0 deletions(-)
 create mode 100644 mm/checkpoint.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5397237..14ff613 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -20,6 +20,7 @@ struct anon_vma;
 struct file_ra_state;
 struct user_struct;
 struct writeback_control;
+struct ckpt_ctx;
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -229,6 +230,9 @@ struct vm_operations_struct {
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma);
+#endif
 };
 
 struct mmu_gather;
@@ -1333,10 +1337,18 @@ extern void truncate_inode_pages_range(struct address_space *,
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
 
+/* generic vm_area_ops exported for mapped files checkpoint */
+extern int filemap_checkpoint(struct ckpt_ctx *, struct vm_area_struct *);
+
 /* mm/page-writeback.c */
 int write_one_page(struct page *page, int wait);
 void task_dirty_inc(struct task_struct *tsk);
 
+
+/* checkpoint/restart */
+extern int special_mapping_checkpoint(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma);
+
 /* readahead.c */
 #define VM_MAX_READAHEAD	128	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
diff --git a/mm/Makefile b/mm/Makefile
index f73f75a..657a9e0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
diff --git a/mm/checkpoint.c b/mm/checkpoint.c
new file mode 100644
index 0000000..f26c23d
--- /dev/null
+++ b/mm/checkpoint.c
@@ -0,0 +1,906 @@
+/*
+ *  Checkpoint/restart memory contents
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/aio.h>
+#include <linux/highmem.h>
+#include <linux/elf.h>
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/proc_fs.h>
+#include <linux/checkpoint.h>
+
+#include "internal.h" /* __get_user_pages */
+
+/**************************************************************************
+ * Checkpoint
+ *
+ * Checkpoint is outside the context of the checkpointee, so one
+ * cannot simply read pages from user-space. Instead, we scan the
+ * address space of the target to cherry-pick pages of interest.  To
+ * save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+static int dump_page_header(struct ckpt_ctx *ctx, unsigned long addr)
+{
+	struct ckpt_hdr_page *hdr;
+	int err;
+
+	hdr = ckpt_hdr_get_type(ctx, sizeof(*hdr), CKPT_HDR_PAGE);
+	if (!hdr)
+		return -ENOMEM;
+
+	hdr->vaddr = addr;
+	err = ckpt_write_obj(ctx, &hdr->hdr);
+	kfree(hdr);
+
+	return err;
+}
+
+static int dump_page_contents(struct ckpt_ctx *ctx, struct page *page)
+{
+	void *ptr;
+	int err;
+
+	ptr = kmap(page);
+	err = ckpt_kwrite(ctx, ptr, PAGE_SIZE);
+	kunmap(page);
+
+	return err;
+}
+
+static int dump_vma_page_terminator(struct ckpt_ctx *ctx)
+{
+	return dump_page_header(ctx, CKPT_VMA_LAST_PAGE);
+}
+
+/**
+ * checkpoint_memory_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ */
+static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma)
+{
+	unsigned long addr;
+	int err = 0;
+
+	/* We don't hold mmap_sem - the mm's tasks are frozen. */
+
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+		struct page *page;
+		int nr_pages;
+		int flags;
+
+		if (fatal_signal_pending(current)) {
+			err = -EINTR;
+			break;
+		}
+
+		cond_resched();
+
+		nr_pages = 1;
+		flags = FOLL_FORCE | FOLL_DUMP | FOLL_GET;
+		nr_pages = __get_user_pages(ctx->tsk, vma->vm_mm, addr,
+					    nr_pages, flags, &page, NULL);
+		if (nr_pages == -EFAULT)
+			continue;
+
+		if (nr_pages != 1) {
+			WARN_ON_ONCE(nr_pages == 0);
+			err = nr_pages;
+			break;
+		}
+
+		err = dump_page_header(ctx, addr);
+		if (!err)
+			err = dump_page_contents(ctx, page);
+
+		page_cache_release(page);
+
+		if (err)
+			break;
+	}
+
+	if (!err)
+		err = dump_vma_page_terminator(ctx);
+
+	return err;
+}
+
+/**
+ * generic_vma_checkpoint - dump metadata of vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+			   enum vma_type type, int vma_objref)
+{
+	struct ckpt_hdr_vma *h;
+	int ret;
+
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d\n",
+		 vma->vm_start, vma->vm_end, vma->vm_flags, type);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (!h)
+		return -ENOMEM;
+
+	h->vma_type = type;
+	h->vma_objref = vma_objref;
+	h->vm_start = vma->vm_start;
+	h->vm_end = vma->vm_end;
+	h->vm_page_prot = pgprot_val(vma->vm_page_prot);
+	h->vm_flags = vma->vm_flags;
+	h->vm_pgoff = vma->vm_pgoff;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	kfree(h);
+
+	return ret;
+}
+
+/**
+ * private_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+static int private_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type, int vma_objref)
+{
+	int ret;
+
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, vma);
+ out:
+	return ret;
+}
+
+int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	int vma_objref;
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(!file);
+
+	vma_objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	if (vma_objref < 0)
+		return vma_objref;
+
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+}
+
+/*
+ * FIX:
+ *   - checkpoint vdso pages (once per distinct vdso is enough)
+ *   - check for compatilibility between saved and current vdso
+ *   - accommodate for dynamic kernel data in vdso page
+ *
+ * Current, we require COMPAT_VDSO which somewhat mitigates the issue
+ */
+int special_mapping_checkpoint(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma)
+{
+	const char *name;
+
+	/*
+	 * FIX:
+	 * Currently, we only handle VDSO/vsyscall special handling.
+	 * Even that, is very basic - we just skip the contents and
+	 * hope for the best in terms of compatilibity upon restart.
+	 */
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	name = arch_vma_name(vma);
+	if (!name || strcmp(name, "[vdso]"))
+		return -ENOSYS;
+
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+}
+
+/**
+ * anonymous_checkpoint - dump contents of private-anonymous vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ */
+static int anonymous_checkpoint(struct ckpt_ctx *ctx,
+				struct vm_area_struct *vma)
+{
+	/* should be private anonymous ... verify that this is the case */
+	BUG_ON(vma->vm_flags & VM_MAYSHARE);
+	BUG_ON(vma->vm_file);
+
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_ANON, 0);
+}
+
+static int checkpoint_vmas(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct vm_area_struct *vma, *next;
+	int map_count = 0;
+	int ret = 0;
+
+	vma = kzalloc(sizeof(*vma), GFP_KERNEL);
+	if (!vma)
+		return -ENOMEM;
+
+	/*
+	 * Must not hold mm->mmap_sem when writing to image file, so
+	 * can't simply traverse the vma list. Instead, use find_vma()
+	 * to get the @next and make a local "copy" of it.
+	 */
+	while (1) {
+		down_read(&mm->mmap_sem);
+		next = find_vma(mm, vma->vm_end);
+		if (!next) {
+			up_read(&mm->mmap_sem);
+			break;
+		}
+		if (vma->vm_file)
+			fput(vma->vm_file);
+		*vma = *next;
+		if (vma->vm_file)
+			get_file(vma->vm_file);
+		up_read(&mm->mmap_sem);
+
+		map_count++;
+
+		ckpt_debug("vma %#lx-%#lx flags %#lx\n",
+			 vma->vm_start, vma->vm_end, vma->vm_flags);
+
+		if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+			ckpt_debug("vma: bad flags (%#lx)\n", vma->vm_flags);
+			ret = -ENOSYS;
+			break;
+		}
+
+		if (!vma->vm_ops)
+			ret = anonymous_checkpoint(ctx, vma);
+		else if (vma->vm_ops->checkpoint)
+			ret = (*vma->vm_ops->checkpoint)(ctx, vma);
+		else
+			ret = -ENOSYS;
+		if (ret < 0) {
+			ckpt_debug("vma checkpoint failed\n");
+			break;
+		}
+	}
+
+	if (vma->vm_file)
+		fput(vma->vm_file);
+
+	kfree(vma);
+
+	return ret < 0 ? ret : map_count;
+}
+
+#define CKPT_AT_SZ (AT_VECTOR_SIZE * sizeof(u64))
+/*
+ * We always write saved_auxv out as an array of u64s, though it is
+ * an array of u32s on 32-bit arch.
+ */
+static int ckpt_write_auxv(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	int i, ret;
+	u64 *buf = kzalloc(CKPT_AT_SZ, GFP_KERNEL);
+
+	if (!buf)
+		return -ENOMEM;
+	for (i = 0; i < AT_VECTOR_SIZE; i++)
+		buf[i] = mm->saved_auxv[i];
+	ret = ckpt_write_buffer(ctx, buf, CKPT_AT_SZ);
+	kfree(buf);
+	return ret;
+}
+
+static int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct mm_struct *mm = ptr;
+	struct ckpt_hdr_mm *h;
+	struct file *exe_file = NULL;
+	int ret;
+
+	if (mm_has_pending_aio(mm)) {
+		ckpt_debug("Outstanding aio\n");
+		return -EBUSY;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+
+	h->flags = mm->flags;
+	h->def_flags = mm->def_flags;
+
+	h->start_code = mm->start_code;
+	h->end_code = mm->end_code;
+	h->start_data = mm->start_data;
+	h->end_data = mm->end_data;
+	h->start_brk = mm->start_brk;
+	h->brk = mm->brk;
+	h->start_stack = mm->start_stack;
+	h->arg_start = mm->arg_start;
+	h->arg_end = mm->arg_end;
+	h->env_start = mm->env_start;
+	h->env_end = mm->env_end;
+
+	h->map_count = mm->map_count;
+
+	if (mm->exe_file) {  /* checkpoint the ->exe_file */
+		exe_file = mm->exe_file;
+		get_file(exe_file);
+	}
+
+	/*
+	 * Drop mm->mmap_sem before writing data to checkpoint image
+	 * to avoid reverse locking order (inode must come before mm).
+	 */
+	up_read(&mm->mmap_sem);
+
+	if (exe_file) {
+		h->exe_objref = checkpoint_obj(ctx, exe_file, CKPT_OBJ_FILE);
+		if (h->exe_objref < 0) {
+			ret = h->exe_objref;
+			goto out;
+		}
+	}
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_auxv(ctx, mm);
+	if (ret < 0)
+		return ret;
+
+	ret = checkpoint_vmas(ctx, mm);
+	if (ret != h->map_count && ret >= 0)
+		ret = -EBUSY; /* checkpoint mm leak */
+	if (ret < 0)
+		goto out;
+
+	ret = checkpoint_mm_context(ctx, mm);
+ out:
+	if (exe_file)
+		fput(exe_file);
+	kfree(h);
+	return ret;
+}
+
+int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct mm_struct *mm;
+	int objref;
+
+	mm = get_task_mm(t);
+	objref = checkpoint_obj(ctx, mm, CKPT_OBJ_MM);
+	mmput(mm);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Restart
+ *
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+static int restore_page(struct ckpt_ctx *ctx, unsigned long addr)
+{
+	struct page *page;
+	void *ptr;
+	int err;
+
+	down_read(&current->mm->mmap_sem);
+
+	err = get_user_pages(current, current->mm, addr, 1, 1, 1, &page, NULL);
+	if (err != 1) {
+		if (WARN_ON_ONCE(err >= 0))
+			err = -EFAULT;
+		goto out_unlock;
+	}
+
+	ptr = kmap(page);
+	err = ckpt_kread(ctx, ptr, PAGE_SIZE);
+	kunmap(page);
+
+	page_cache_release(page);
+
+out_unlock:
+	up_read(&current->mm->mmap_sem);
+
+	return err;
+}
+
+/**
+ * restore_memory_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ */
+static int restore_memory_contents(struct ckpt_ctx *ctx)
+{
+	int err = 0;
+
+	while (true) {
+		struct ckpt_hdr_page *hdr;
+		unsigned long addr;
+
+		if (fatal_signal_pending(current)) {
+			err = -EINTR;
+			break;
+		}
+
+		cond_resched();
+
+		hdr = ckpt_read_obj_type(ctx, sizeof(*hdr), CKPT_HDR_PAGE);
+		if (IS_ERR(hdr)) {
+			err = PTR_ERR(hdr);
+			break;
+		}
+
+		addr = hdr->vaddr;
+		kfree(hdr);
+
+		if (addr == CKPT_VMA_LAST_PAGE)
+			break;
+
+		err = restore_page(ctx, addr);
+		if (err)
+			break;
+	}
+
+	return err;
+}
+
+/**
+ * calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+/**
+ * generic_vma_restore - restore a vma
+ * @mm - address space
+ * @file - file to map (NULL for anonymous)
+ * @h - vma header data
+ */
+static unsigned long generic_vma_restore(struct mm_struct *mm,
+					 struct file *file,
+					 struct ckpt_hdr_vma *h)
+{
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+
+	if (h->vm_end < h->vm_start)
+		return -EINVAL;
+	if (h->vma_objref < 0)
+		return -EINVAL;
+
+	vm_start = h->vm_start;
+	vm_pgoff = h->vm_pgoff;
+	vm_size = h->vm_end - h->vm_start;
+	vm_prot = calc_map_prot_bits(h->vm_flags);
+	vm_flags = calc_map_flags_bits(h->vm_flags);
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	ckpt_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	return addr;
+}
+
+/**
+ * private_vma_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @file: file to use for mapping
+ * @h - vma header data
+ */
+static int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			       struct file *file, struct ckpt_hdr_vma *h)
+{
+	unsigned long addr;
+
+	if (h->vm_flags & (VM_SHARED | VM_MAYSHARE))
+		return -EINVAL;
+
+	addr = generic_vma_restore(mm, file, h);
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	return restore_memory_contents(ctx);
+}
+
+/**
+ * anon_private_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @h - vma header data
+ */
+static int anon_private_restore(struct ckpt_ctx *ctx,
+				     struct mm_struct *mm,
+				     struct ckpt_hdr_vma *h)
+{
+	/*
+	 * vm_pgoff for anonymous mapping is the "global" page
+	 * offset (namely from addr 0x0), so we force a zero
+	 */
+	h->vm_pgoff = 0;
+
+	return private_vma_restore(ctx, mm, NULL, h);
+}
+
+static int filemap_restore(struct ckpt_ctx *ctx,
+		    struct mm_struct *mm,
+		    struct ckpt_hdr_vma *h)
+{
+	struct file *file;
+	int ret;
+
+	if (h->vma_type == CKPT_VMA_FILE &&
+	    (h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
+		return -EINVAL;
+
+	file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ret = private_vma_restore(ctx, mm, file, h);
+	return ret;
+}
+
+#ifndef arch_restore_vdso
+#define arch_restore_vdso arch_restore_vdso
+#warn "arch_restore_vdso not implemented"
+static inline int arch_restore_vdso(unsigned long addr)
+{
+	return -ENOSYS;
+}
+#endif
+
+static int special_mapping_restore(struct ckpt_ctx *ctx,
+				   struct mm_struct *mm,
+				   struct ckpt_hdr_vma *h)
+{
+	BUG_ON(h->vma_type != CKPT_VMA_VDSO);
+
+	return arch_restore_vdso(h->vm_start);
+}
+
+/* callbacks to restore vma per its type: */
+struct restore_vma_ops {
+	char *vma_name;
+	enum vma_type vma_type;
+	int (*restore) (struct ckpt_ctx *ctx,
+			struct mm_struct *mm,
+			struct ckpt_hdr_vma *ptr);
+};
+
+static struct restore_vma_ops restore_vma_ops[] = {
+	/* ignored vma */
+	{
+		.vma_name = "IGNORE",
+		.vma_type = CKPT_VMA_IGNORE,
+		.restore = NULL,
+	},
+	/* special mapping (vdso) */
+	{
+		.vma_name = "VDSO",
+		.vma_type = CKPT_VMA_VDSO,
+		.restore = special_mapping_restore,
+	},
+	/* anonymous private */
+	{
+		.vma_name = "ANON PRIVATE",
+		.vma_type = CKPT_VMA_ANON,
+		.restore = anon_private_restore,
+	},
+	/* file-mapped private */
+	{
+		.vma_name = "FILE PRIVATE",
+		.vma_type = CKPT_VMA_FILE,
+		.restore = filemap_restore,
+	},
+};
+
+/**
+ * restore_vma - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ */
+static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_vma *h;
+	struct restore_vma_ops *ops;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+		   (unsigned long) h->vm_start, (unsigned long) h->vm_end,
+		   (unsigned long) h->vm_flags, (int) h->vma_type,
+		   (int) h->vma_objref);
+
+	ret = -EINVAL;
+	if (h->vm_end < h->vm_start)
+		goto out;
+	if (h->vma_objref < 0)
+		goto out;
+	if (h->vma_type >= CKPT_VMA_MAX)
+		goto out;
+	if (h->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	ops = &restore_vma_ops[h->vma_type];
+
+	/* make sure we don't change this accidentally */
+	BUG_ON(ops->vma_type != h->vma_type);
+
+	if (ops->restore) {
+		ckpt_debug("vma type %s\n", ops->vma_name);
+		ret = ops->restore(ctx, mm, h);
+	} else {
+		ckpt_debug("vma ignored\n");
+		ret = 0;
+	}
+ out:
+	kfree(h);
+	return ret;
+}
+
+static int ckpt_read_auxv(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	int i, ret;
+	u64 *buf = kmalloc(CKPT_AT_SZ, GFP_KERNEL);
+
+	if (!buf)
+		return -ENOMEM;
+	ret = _ckpt_read_buffer(ctx, buf, CKPT_AT_SZ);
+	if (ret < 0)
+		goto out;
+
+	ret = -E2BIG;
+	for (i = 0; i < AT_VECTOR_SIZE; i++)
+		if (buf[i] > (u64) ULONG_MAX)
+			goto out;
+
+	for (i = 0; i < AT_VECTOR_SIZE - 1; i++)
+		mm->saved_auxv[i] = buf[i];
+	/* sanitize the input: force AT_NULL in last entry  */
+	mm->saved_auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
+
+	ret = 0;
+ out:
+	kfree(buf);
+	return ret;
+}
+
+static int destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start,
+				vma->vm_end - vma->vm_start);
+		if (ret < 0) {
+			pr_warning("%s: failed munmap (%d)\n", __func__, ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+static void *restore_mm(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_mm *h;
+	struct mm_struct *mm = NULL;
+	struct file *file;
+	unsigned int nr;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (IS_ERR(h))
+		return (void *) h;
+
+	ckpt_debug("map_count %d\n", h->map_count);
+
+	ret = -EINVAL;
+	if ((h->start_code > h->end_code) ||
+	    (h->start_data > h->end_data))
+		goto out;
+	if (h->exe_objref < 0)
+		goto out;
+	if (h->def_flags & ~VM_LOCKED)
+		goto out;
+	if (h->flags & ~(MMF_DUMP_FILTER_MASK |
+			 ((1 << MMF_DUMP_FILTER_BITS) - 1)))
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+
+	mm->flags = h->flags;
+	mm->def_flags = h->def_flags;
+
+	mm->start_code = h->start_code;
+	mm->end_code = h->end_code;
+	mm->start_data = h->start_data;
+	mm->end_data = h->end_data;
+	mm->start_brk = h->start_brk;
+	mm->brk = h->brk;
+	mm->start_stack = h->start_stack;
+	mm->arg_start = h->arg_start;
+	mm->arg_end = h->arg_end;
+	mm->env_start = h->env_start;
+	mm->env_end = h->env_end;
+
+	/* restore the ->exe_file */
+	if (h->exe_objref) {
+		file = ckpt_obj_fetch(ctx, h->exe_objref, CKPT_OBJ_FILE);
+		if (IS_ERR(file)) {
+			up_write(&mm->mmap_sem);
+			ret = PTR_ERR(file);
+			goto out;
+		}
+		set_mm_exe_file(mm, file);
+	}
+	up_write(&mm->mmap_sem);
+
+	ret = ckpt_read_auxv(ctx, mm);
+	if (ret < 0) {
+		ckpt_debug("Error restoring auxv (%d)\n", ret);
+		goto out;
+	}
+
+	for (nr = h->map_count; nr; nr--) {
+		ret = restore_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_mm_context(ctx, mm);
+ out:
+	kfree(h);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	/* restore_obj() expect an extra reference */
+	atomic_inc(&mm->mm_users);
+	return (void *)mm;
+}
+
+int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	mm = ckpt_obj_fetch(ctx, mm_objref, CKPT_OBJ_MM);
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+
+	if (mm == current->mm)
+		return 0;
+
+	ret = exec_mmap(mm);
+	if (ret < 0)
+		return ret;
+
+	atomic_inc(&mm->mm_users);
+	return 0;
+}
+
+/*
+ * mm-related checkpoint objects
+ */
+
+static int obj_mm_grab(void *ptr)
+{
+	atomic_inc(&((struct mm_struct *) ptr)->mm_users);
+	return 0;
+}
+
+static void obj_mm_drop(void *ptr, int lastref)
+{
+	mmput((struct mm_struct *) ptr);
+}
+
+/* mm object */
+static const struct ckpt_obj_ops ckpt_obj_mm_ops = {
+	.obj_name = "MM",
+	.obj_type = CKPT_OBJ_MM,
+	.ref_drop = obj_mm_drop,
+	.ref_grab = obj_mm_grab,
+	.checkpoint = checkpoint_mm,
+	.restore = restore_mm,
+};
+
+static int __init checkpoint_register_mm(void)
+{
+	return register_checkpoint_obj(&ckpt_obj_mm_ops);
+}
+late_initcall(checkpoint_register_mm);
diff --git a/mm/filemap.c b/mm/filemap.c
index 6b9aee2..410b4fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <linux/checkpoint.h>
 #include "internal.h"
 
 /*
@@ -1651,6 +1652,9 @@ EXPORT_SYMBOL(filemap_fault);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 50a4aa0..cb47b58 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2447,6 +2447,9 @@ static void special_mapping_close(struct vm_area_struct *vma)
 static const struct vm_operations_struct special_mapping_vmops = {
 	.close = special_mapping_close,
 	.fault = special_mapping_fault,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint = special_mapping_checkpoint,
+#endif
 };
 
 /*
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/10] Checkpoint/restart vfs support
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
                   ` (5 preceding siblings ...)
  2011-02-28 23:40 ` [PATCH 06/10] Checkpoint/restart mm support ntl
@ 2011-02-28 23:40 ` ntl
  2011-02-28 23:40 ` [PATCH 08/10] Add generic '->checkpoint' f_op to ext filesystems ntl
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.

This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.

Also adds a new 'file_operations' function for 'collecting' a file for
leak-detection during full-container checkpoint. This is useful for
those files that hold references to other "collectable" objects. Two
examples are pty files that point to corresponding tty objects, and
eventpoll files that refer to the files they are monitoring.

Checkpoint: dump the file table with 'struct ckpt_hdr_file_table,
followed by all open file descriptors. Because the 'struct file'
corresponding to an fd can be shared, they are assigned an objref and
registered in the object hash. A reference to the 'file *' is kept for
as long as it lives in the hash (the hash is only cleaned up at the
end of the checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Restart: for each fd read 'struct ckpt_hdr_file_desc' and lookup
objref in the hash table; If not found in the hash table, (first
occurence), read in 'struct ckpt_hdr_file', create a new file and
register in the hash.  Otherwise attach the file pointer from the hash
as an FD.

Based on original code by Oren Laadan.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
[ntl: remove unused obj_file_users]
[ntl: rearrange error path, prevent null pointer deref in checkpoint_file_desc]
[ntl: pass file, not dentry, to fsnotify_open]
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 Documentation/filesystems/vfs.txt |   13 +-
 fs/Makefile                       |    1 +
 fs/checkpoint.c                   |  695 +++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |    7 +
 4 files changed, 715 insertions(+), 1 deletions(-)
 create mode 100644 fs/checkpoint.c

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 20899e0..23025bb 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -722,7 +722,7 @@ struct file_operations
 ----------------------
 
 This describes how the VFS can manipulate an open file. As of kernel
-2.6.22, the following members are defined:
+2.6.38, the following members are defined:
 
 struct file_operations {
 	struct module *owner;
@@ -752,6 +752,10 @@ struct file_operations {
 	int (*flock) (struct file *, int, struct file_lock *);
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int);
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct ckpt_ctx *, struct file *);
+	int (*collect)(struct ckpt_ctx *, struct file *);
+#endif
 };
 
 Again, all methods are called without any locks being held, unless
@@ -820,6 +824,13 @@ otherwise noted.
   splice_read: called by the VFS to splice data from file to a pipe. This
 	       method is used by the splice(2) system call
 
+  checkpoint: called by checkpoint(2) system call to checkpoint the
+              state of a file descriptor.
+
+  collect: called by the checkpoint(2) system call to track references to
+           file descriptors, to detect leaks in full-container checkpoint
+	   (see Documentation/checkpoint/readme.txt).
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides. When opening a device node
 (character or block special) most filesystems will call special
diff --git a/fs/Makefile b/fs/Makefile
index a7f7cef..d7a49b7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -30,6 +30,7 @@ obj-$(CONFIG_AIO)               += aio.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_NFSD_DEPRECATED)	+= nfsctl.o
+obj-$(CONFIG_CHECKPOINT)        += checkpoint.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
 obj-$(CONFIG_BINFMT_EM86)	+= binfmt_em86.o
 obj-$(CONFIG_BINFMT_MISC)	+= binfmt_misc.o
diff --git a/fs/checkpoint.c b/fs/checkpoint.c
new file mode 100644
index 0000000..9b7c5ab
--- /dev/null
+++ b/fs/checkpoint.c
@@ -0,0 +1,695 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define DEBUG
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/checkpoint.h>
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * ckpt_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	fname = __d_path(path, &tmp, buf, *len);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		ckpt_debug("file %s was opened in an alien mnt_ns, "
+			   "proceeding anyway\n", fname);
+	return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = ckpt_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		ret = ckpt_write_obj_type(ctx, fname, flen,
+					  CKPT_HDR_FILE_NAME);
+	} else {
+		ret = PTR_ERR(fname);
+		ckpt_debug("ckpt_fill_fname failed (%s)\n",
+			 path->dentry->d_name.name);
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	ckpt_debug("file %s", file->f_dentry->d_name.name);
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	if (d_unlinked(file->f_dentry)) {
+		ckpt_debug("Unlinked files unsupported\n");
+		return -EBADF;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
+ out:
+	kfree(h);
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+static int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+	int ret;
+
+	if (!file->f_op || !file->f_op->checkpoint) {
+		ckpt_debug("f_op %ps lacks checkpoint\n", file->f_op);
+		return -EBADF;
+	}
+
+	ret = file->f_op->checkpoint(ctx, file);
+	if (ret < 0)
+		ckpt_debug("file checkpoint failed\n");
+	return ret;
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+				struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+	pid_t pid;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+
+	/* sanity check (although this shouldn't happen) */
+	if (WARN_ON(!file)) {
+		rcu_read_unlock();
+		ckpt_debug("fd %d gone?\n", fd);
+		ret = -EBADF;
+		goto out;
+	}
+
+	coe = FD_ISSET(fd, fdt->close_on_exec);
+	get_file(file);
+	rcu_read_unlock();
+
+	if (has_locks_with_owner(file, files)) {
+		ret = -EBADF;
+		ckpt_debug("fd %d has file lock or lease\n", fd);
+		goto out;
+	}
+
+	/*
+	 * TODO: Implement c/r of fowner and f_sigio.  Should be
+	 * trivial, but for now we just refuse its checkpoint
+	 */
+	pid = f_getown(file);
+	if (pid) {
+		ret = -EBUSY;
+		ckpt_debug("fd %d has an owner\n", fd);
+		goto out;
+	}
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+out:
+	kfree(h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+/* checkpoint callback for file table */
+static int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct files_struct *files = ptr;
+	struct ckpt_hdr_file_table *h;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	kfree(h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+ out:
+	kfree(fdtable);
+	return ret;
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int objref;
+
+	files = get_files_struct(t);
+	if (!files)
+		return -EBUSY;
+	objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+	put_files_struct(files);
+
+	return objref;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ */
+struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
+{
+	struct file *file;
+	char *fname;
+	int len;
+
+	/* prevent bad input from doing bad things */
+	if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC))
+		return ERR_PTR(-EINVAL);
+
+	len = ckpt_read_payload(ctx, (void **) &fname,
+				PATH_MAX, CKPT_HDR_FILE_NAME);
+	if (len < 0)
+		return ERR_PTR(len);
+	fname[len - 1] = '\0';	/* always play if safe */
+	ckpt_debug("fname '%s' flags %#x\n", fname, flags);
+
+	file = filp_open(fname, flags, 0);
+	kfree(fname);
+
+	return file;
+}
+
+static int close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		get_file(file);
+		fsnotify_open(file);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CKPT_SETFL_MASK  \
+	(O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
+
+int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			struct ckpt_hdr_file *h)
+{
+	fmode_t new_mode = file->f_mode;
+	fmode_t saved_mode = (__force fmode_t) h->f_mode;
+	int ret;
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
+	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Normally f_mode is set by open, and modified only via
+	 * fcntl(), so its value now should match that at checkpoint.
+	 * However, a file may be downgraded from (read-)write to
+	 * read-only, e.g:
+	 *  - mark_files_ro() unsets FMODE_WRITE
+	 *  - nfs4_file_downgrade() too, and also sert FMODE_READ
+	 * Validate the new f_mode against saved f_mode, allowing:
+	 *  - new with FMODE_WRITE, saved without FMODE_WRITE
+	 *  - new without FMODE_READ, saved with FMODE_READ
+	 */
+	if ((new_mode & FMODE_WRITE) && !(saved_mode & FMODE_WRITE)) {
+		new_mode &= ~FMODE_WRITE;
+		if (!(new_mode & FMODE_READ) && (saved_mode & FMODE_READ))
+			new_mode |= FMODE_READ;
+	}
+	/* finally, at this point new mode should match saved mode */
+	if (new_mode ^ saved_mode)
+		return -EINVAL;
+
+	if (file->f_mode & FMODE_LSEEK)
+		ret = vfs_llseek(file, h->f_pos, SEEK_SET);
+
+	return ret;
+}
+
+static struct file *generic_file_restore(struct ckpt_ctx *ctx,
+					 struct ckpt_hdr_file *ptr)
+{
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
+		return ERR_PTR(-EINVAL);
+
+	file = restore_open_fname(ctx, ptr->f_flags);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+	return file;
+}
+
+struct restore_file_ops {
+	char *file_name;
+	enum file_type file_type;
+	struct file * (*restore) (struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_file *ptr);
+};
+
+static struct restore_file_ops restore_file_ops[] = {
+	/* ignored file */
+	{
+		.file_name = "IGNORE",
+		.file_type = CKPT_FILE_IGNORE,
+		.restore = NULL,
+	},
+	/* regular file/directory */
+	{
+		.file_name = "GENERIC",
+		.file_type = CKPT_FILE_GENERIC,
+		.restore = generic_file_restore,
+	},
+};
+
+static void *restore_file(struct ckpt_ctx *ctx)
+{
+	struct restore_file_ops *ops;
+	struct ckpt_hdr_file *h;
+	struct file *file = ERR_PTR(-EINVAL);
+
+	/*
+	 * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file,
+	 * but the actual object depends on the file type. The length
+	 * should never be more than page.
+	 */
+	h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE);
+	if (IS_ERR(h))
+		return (void *)h;
+	ckpt_debug("flags %#x mode %#x type %d\n",
+		 h->f_flags, h->f_mode, h->f_type);
+
+	if (h->f_type >= CKPT_FILE_MAX)
+		goto out;
+
+	ops = &restore_file_ops[h->f_type];
+	BUG_ON(ops->file_type != h->f_type);
+
+	if (ops->restore)
+		file = ops->restore(ctx, h);
+ out:
+	kfree(h);
+	return (void *)file;
+}
+
+/**
+ * ckpt_read_file_desc - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls restore_file to restore the file too.
+ */
+static int restore_file_desc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file;
+	int newfd, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+	ckpt_debug("ref %d fd %d c.o.e %d\n",
+		 h->fd_objref, h->fd_descriptor, h->fd_close_on_exec);
+
+	ret = -EINVAL;
+	if (h->fd_objref <= 0 || h->fd_descriptor < 0)
+		goto out;
+
+	file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	newfd = attach_file(file);
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor);
+
+	/* reposition if newfd isn't desired fd */
+	if (newfd != h->fd_descriptor) {
+		ret = sys_dup2(newfd, h->fd_descriptor);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	set_close_on_exec(h->fd_descriptor, h->fd_close_on_exec);
+	ret = 0;
+ out:
+	kfree(h);
+	return ret;
+}
+
+/* restore callback for file table */
+static void *restore_file_table(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_table *h;
+	struct files_struct *files;
+	int i, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (IS_ERR(h))
+		return (void *)h;
+
+	ckpt_debug("nfds %d\n", h->fdt_nfds);
+
+	ret = -EMFILE;
+	if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open)
+		goto out;
+
+	/*
+	 * We assume that restarting tasks, as created in user-space,
+	 * have distinct files_struct objects each. If not, we need to
+	 * call dup_fd() to make sure we don't overwrite an already
+	 * restored one.
+	 */
+
+	/* point of no return -- close all file descriptors */
+	ret = close_all_fds(current->files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < h->fdt_nfds; i++) {
+		ret = restore_file_desc(ctx);
+		if (ret < 0)
+			break;
+	}
+ out:
+	kfree(h);
+	if (!ret) {
+		files = current->files;
+		atomic_inc(&files->count);
+	} else {
+		files = ERR_PTR(ret);
+	}
+	return (void *)files;
+}
+
+int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref)
+{
+	struct files_struct *files;
+
+	files = ckpt_obj_fetch(ctx, files_objref, CKPT_OBJ_FILE_TABLE);
+	if (IS_ERR(files))
+		return PTR_ERR(files);
+
+	if (files != current->files) {
+		struct files_struct *prev;
+
+		task_lock(current);
+		prev = current->files;
+		current->files = files;
+		atomic_inc(&files->count);
+		task_unlock(current);
+
+		put_files_struct(prev);
+	}
+
+	return 0;
+}
+
+/*
+ * fs-related checkpoint objects
+ */
+static int obj_file_table_grab(void *ptr)
+{
+	atomic_inc(&((struct files_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_file_table_drop(void *ptr, int lastref)
+{
+	put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr, int lastref)
+{
+	fput((struct file *) ptr);
+}
+
+/* files_struct object */
+static const struct ckpt_obj_ops ckpt_obj_files_struct_ops = {
+	.obj_name = "FILE_TABLE",
+	.obj_type = CKPT_OBJ_FILE_TABLE,
+	.ref_drop = obj_file_table_drop,
+	.ref_grab = obj_file_table_grab,
+	.checkpoint = checkpoint_file_table,
+	.restore = restore_file_table,
+};
+
+/* file object */
+static const struct ckpt_obj_ops ckpt_obj_file_ops = {
+	.obj_name = "FILE",
+	.obj_type = CKPT_OBJ_FILE,
+	.ref_drop = obj_file_drop,
+	.ref_grab = obj_file_grab,
+	.checkpoint = checkpoint_file,
+	.restore = restore_file,
+};
+
+static __init int checkpoint_register_fs(void)
+{
+	int ret;
+
+	ret = register_checkpoint_obj(&ckpt_obj_files_struct_ops);
+	if (ret < 0)
+		return ret;
+	ret = register_checkpoint_obj(&ckpt_obj_file_ops);
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+late_initcall(checkpoint_register_fs);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 175bb75..b7c088f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -407,6 +407,7 @@ struct kstatfs;
 struct vm_area_struct;
 struct vfsmount;
 struct cred;
+struct ckpt_ctx;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1551,6 +1552,10 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct ckpt_ctx *, struct file *);
+	int (*collect)(struct ckpt_ctx *, struct file *);
+#endif
 };
 
 struct inode_operations {
@@ -2367,6 +2372,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
 extern int vfs_stat(const char __user *, struct kstat *);
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/10] Add generic '->checkpoint' f_op to ext filesystems
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
                   ` (6 preceding siblings ...)
  2011-02-28 23:40 ` [PATCH 07/10] Checkpoint/restart vfs support ntl
@ 2011-02-28 23:40 ` ntl
  2011-02-28 23:40 ` [PATCH 09/10] Add generic '->checkpoint()' f_op to simple char devices ntl
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Dave Hansen

From: Dave Hansen <dave@linux.vnet.ibm.com>

This marks ext[234] as being checkpointable.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 fs/ext2/dir.c  |    3 +++
 fs/ext2/file.c |    6 ++++++
 fs/ext3/dir.c  |    3 +++
 fs/ext3/file.c |    3 +++
 fs/ext4/dir.c  |    3 +++
 fs/ext4/file.c |    6 ++++++
 6 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 2709b34..7aefb74 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -721,4 +721,7 @@ const struct file_operations ext2_dir_operations = {
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
 	.fsync		= ext2_fsync,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 49eec94..c8991c8 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -76,6 +76,9 @@ const struct file_operations ext2_file_operations = {
 	.fsync		= ext2_fsync,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif /* CONFIG_CHECKPOINT */
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
@@ -91,6 +94,9 @@ const struct file_operations ext2_xip_file_operations = {
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif /* CONFIG_CHECKPOINT */
 };
 #endif
 
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index e2e72c3..e2f5948 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,9 @@ const struct file_operations ext3_dir_operations = {
 #endif
 	.fsync		= ext3_sync_file,	/* BKL held */
 	.release	= ext3_release_dir,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index f55df0e..2cf4ef2 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -68,6 +68,9 @@ const struct file_operations ext3_file_operations = {
 	.fsync		= ext3_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index ece76fb..0101873 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,9 @@ const struct file_operations ext4_dir_operations = {
 #endif
 	.fsync		= ext4_sync_file,
 	.release	= ext4_release_dir,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
 
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 5a5c55d..142dde6 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -86,6 +86,9 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
@@ -188,6 +191,9 @@ const struct file_operations ext4_file_operations = {
 	.fsync		= ext4_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/10] Add generic '->checkpoint()' f_op to simple char devices
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
                   ` (7 preceding siblings ...)
  2011-02-28 23:40 ` [PATCH 08/10] Add generic '->checkpoint' f_op to ext filesystems ntl
@ 2011-02-28 23:40 ` ntl
  2011-02-28 23:40 ` [PATCH 10/10] x86_32 support for checkpoint/restart ntl
  2011-03-01  1:08 ` [RFC 00/10] container-based checkpoint/restart prototype Nathan Lynch
  10 siblings, 0 replies; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan

From: Oren Laadan <orenl@cs.columbia.edu>

* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 drivers/char/mem.c    |    6 ++++++
 drivers/char/random.c |    6 ++++++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 1256454..3452d1f 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -767,6 +767,9 @@ static const struct file_operations null_fops = {
 	.read		= read_null,
 	.write		= write_null,
 	.splice_write	= splice_write_null,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
 
 #ifdef CONFIG_DEVPORT
@@ -783,6 +786,9 @@ static const struct file_operations zero_fops = {
 	.read		= read_zero,
 	.write		= write_zero,
 	.mmap		= mmap_zero,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
 
 /*
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 5a1aa64..67d00b8 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1166,6 +1166,9 @@ const struct file_operations random_fops = {
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
 	.llseek = noop_llseek,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint = generic_file_checkpoint,
+#endif
 };
 
 const struct file_operations urandom_fops = {
@@ -1174,6 +1177,9 @@ const struct file_operations urandom_fops = {
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
 	.llseek = noop_llseek,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint = generic_file_checkpoint,
+#endif
 };
 
 /***************************************************************
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/10] x86_32 support for checkpoint/restart
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
                   ` (8 preceding siblings ...)
  2011-02-28 23:40 ` [PATCH 09/10] Add generic '->checkpoint()' f_op to simple char devices ntl
@ 2011-02-28 23:40 ` ntl
  2011-03-01  1:08 ` [RFC 00/10] container-based checkpoint/restart prototype Nathan Lynch
  10 siblings, 0 replies; 41+ messages in thread
From: ntl @ 2011-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecture
specific extension of the header (ckpt_hdr_head_arch).

Based on original code by Oren Laadan.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
[ntl: aggregated arch/x86 bits spread through various c/r patches]
Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/x86/Kconfig                   |    4 +
 arch/x86/include/asm/checkpoint.h  |   17 +
 arch/x86/include/asm/elf.h         |    5 +
 arch/x86/include/asm/ldt.h         |    7 +
 arch/x86/include/asm/unistd_32.h   |    4 +-
 arch/x86/kernel/Makefile           |    2 +
 arch/x86/kernel/checkpoint.c       |  677 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/syscall_table_32.S |    2 +
 arch/x86/vdso/vdso32-setup.c       |   25 ++-
 9 files changed, 738 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint.h
 create mode 100644 arch/x86/kernel/checkpoint.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e330da2..7a2a64d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -101,6 +101,10 @@ config STACKTRACE_SUPPORT
 config HAVE_LATENCYTOP_SUPPORT
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if X86_32
+
 config MMU
 	def_bool y
 
diff --git a/arch/x86/include/asm/checkpoint.h b/arch/x86/include/asm/checkpoint.h
new file mode 100644
index 0000000..334d3be
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint.h
@@ -0,0 +1,17 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifdef CONFIG_X86_32
+#define CKPT_ARCH_ID	CKPT_ARCH_X86_32
+#endif
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index f2ad216..8a6c45e 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -320,4 +320,9 @@ extern int syscall32_setup_pages(struct linux_binprm *, int exstack);
 extern unsigned long arch_randomize_brk(struct mm_struct *mm);
 #define arch_randomize_brk arch_randomize_brk
 
+#ifdef CONFIG_X86_32
+#define arch_restore_vdso arch_restore_vdso
+extern int arch_restore_vdso(unsigned long addr);
+#endif /* CONFIG_X86_32 */
+
 #endif /* _ASM_X86_ELF_H */
diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h
index 46727eb..f2845f9 100644
--- a/arch/x86/include/asm/ldt.h
+++ b/arch/x86/include/asm/ldt.h
@@ -37,4 +37,11 @@ struct user_desc {
 #define MODIFY_LDT_CONTENTS_CODE	2
 
 #endif /* !__ASSEMBLY__ */
+
+#ifdef __KERNEL__
+#include <linux/linkage.h>
+asmlinkage int sys_modify_ldt(int func, void __user *ptr,
+			      unsigned long bytecount);
+#endif
+
 #endif /* _ASM_X86_LDT_H */
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index b766a5e..a2d589f 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -346,10 +346,12 @@
 #define __NR_fanotify_init	338
 #define __NR_fanotify_mark	339
 #define __NR_prlimit64		340
+#define __NR_checkpoint		341
+#define __NR_restart		342
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 341
+#define NR_syscalls 343
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 1e99475..f44a19d 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -111,6 +111,8 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
 
 obj-$(CONFIG_SWIOTLB)			+= pci-swiotlb.o
 
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/checkpoint.c b/arch/x86/kernel/checkpoint.c
new file mode 100644
index 0000000..ecb458a
--- /dev/null
+++ b/arch/x86/kernel/checkpoint.c
@@ -0,0 +1,677 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/checkpoint.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/preempt.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+#include <asm/checkpoint.h>
+#include <asm/desc_defs.h>
+#include <asm/desc.h>
+#include <asm/i387.h>
+#include <asm/ldt.h>
+#include <asm/syscalls.h>
+#include <asm/thread_info.h>
+
+/* arch dependent header types */
+enum {
+	CKPT_HDR_CPU_FPU = 201,
+#define CKPT_HDR_CPU_FPU CKPT_HDR_CPU_FPU
+	CKPT_HDR_MM_CONTEXT_LDT,
+#define CKPT_HDR_MM_CONTEXT_LDT CKPT_HDR_MM_CONTEXT_LDT
+};
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+	/* FIXME: add HAVE_HWFP */
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+};
+
+struct ckpt_hdr_thread {
+	struct ckpt_hdr h;
+	__u32 thread_info_flags;
+	__u16 gdt_entry_tls_entries;
+	__u16 sizeof_tls_array;
+};
+
+/* designed to work for both x86_32 and x86_64 */
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	/* see struct pt_regs (x86_64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 sp;
+
+	__u64 flags;
+
+	/* segment registers */
+	__u64 fs;
+	__u64 gs;
+
+	__u16 fsindex;
+	__u16 gsindex;
+	__u16 cs;
+	__u16 ss;
+	__u16 ds;
+	__u16 es;
+
+	__u32 used_math;
+
+	/* thread_xstate contents follow (if used_math) */
+};
+
+#define CKPT_X86_SEG_NULL	0
+#define CKPT_X86_SEG_USER32_CS	1
+#define CKPT_X86_SEG_USER32_DS	2
+#define CKPT_X86_SEG_TLS	0x4000	/* 0100 0000 0000 00xx */
+#define CKPT_X86_SEG_LDT	0x8000	/* 100x xxxx xxxx xxxx */
+
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	__u64 vdso;
+	__u32 ldt_entry_size;
+	__u32 nldt;
+};
+
+#ifdef CONFIG_X86_32
+
+static int check_segment(__u16 seg)
+{
+	int ret = 0;
+
+	switch (seg) {
+	case CKPT_X86_SEG_NULL:
+	case CKPT_X86_SEG_USER32_CS:
+	case CKPT_X86_SEG_USER32_DS:
+		return 1;
+	}
+	if (seg & CKPT_X86_SEG_TLS) {
+		seg &= ~CKPT_X86_SEG_TLS;
+		if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN)
+			ret = 1;
+	} else if (seg & CKPT_X86_SEG_LDT) {
+		seg &= ~CKPT_X86_SEG_LDT;
+		if (seg <= 0x1fff)
+			ret = 1;
+	}
+	return ret;
+}
+
+static __u16 encode_segment(unsigned short seg)
+{
+	if (seg == 0)
+		return CKPT_X86_SEG_NULL;
+	BUG_ON((seg & 3) != 3);
+
+	if (seg == __USER_CS)
+		return CKPT_X86_SEG_USER32_CS;
+	if (seg == __USER_DS)
+		return CKPT_X86_SEG_USER32_DS;
+
+	if (seg & 4)
+		return CKPT_X86_SEG_LDT | (seg >> 3);
+
+	seg >>= 3;
+	if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX)
+		return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN);
+
+	printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg);
+	BUG();
+}
+
+static unsigned short decode_segment(__u16 seg)
+{
+	if (seg == CKPT_X86_SEG_NULL)
+		return 0;
+	if (seg == CKPT_X86_SEG_USER32_CS)
+		return __USER_CS;
+	if (seg == CKPT_X86_SEG_USER32_DS)
+		return __USER_DS;
+
+	if (seg & CKPT_X86_SEG_TLS) {
+		seg &= ~CKPT_X86_SEG_TLS;
+		return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3;
+	}
+	if (seg & CKPT_X86_SEG_LDT) {
+		seg &= ~CKPT_X86_SEG_LDT;
+		return (seg << 3) | 7;
+	}
+	BUG();
+}
+
+static void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct pt_regs *regs = task_pt_regs(t);
+	unsigned long _gs;
+
+	h->bp = regs->bp;
+	h->bx = regs->bx;
+	h->ax = regs->ax;
+	h->cx = regs->cx;
+	h->dx = regs->dx;
+	h->si = regs->si;
+	h->di = regs->di;
+	h->orig_ax = regs->orig_ax;
+	h->ip = regs->ip;
+
+	h->flags = regs->flags;
+	h->sp = regs->sp;
+
+	h->cs = encode_segment(regs->cs);
+	h->ss = encode_segment(regs->ss);
+	h->ds = encode_segment(regs->ds);
+	h->es = encode_segment(regs->es);
+
+	_gs = task_user_gs(t);
+
+	h->fsindex = encode_segment(regs->fs);
+	h->gsindex = encode_segment(_gs);
+}
+
+asmlinkage void ret_from_fork(void);
+int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	if (h->cs == CKPT_X86_SEG_NULL)
+		return -EINVAL;
+	if (!check_segment(h->cs) || !check_segment(h->ds) ||
+	    !check_segment(h->es) || !check_segment(h->ss) ||
+	    !check_segment(h->fsindex) || !check_segment(h->gsindex))
+		return -EINVAL;
+
+	regs->bp = h->bp;
+	regs->bx = h->bx;
+	regs->ax = h->ax;
+	regs->cx = h->cx;
+	regs->dx = h->dx;
+	regs->si = h->si;
+	regs->di = h->di;
+	regs->orig_ax = h->orig_ax;
+	regs->ip = h->ip;
+
+	regs->sp = h->sp;
+
+	regs->ds = decode_segment(h->ds);
+	regs->es = decode_segment(h->es);
+	regs->cs = decode_segment(h->cs);
+	regs->ss = decode_segment(h->ss);
+
+	regs->fs = decode_segment(h->fsindex);
+	regs->gs = decode_segment(h->gsindex);
+
+	thread->sp = (unsigned long)regs;
+	thread->sp0 = (unsigned long)(regs + 1);
+	thread->ip = (unsigned long)ret_from_fork;
+	thread->gs = regs->gs;
+	lazy_load_gs(regs->gs);
+
+	return 0;
+}
+
+#endif /* CONFIG_X86_32 */
+
+static int check_tls(struct desc_struct *desc)
+{
+	if (!desc->a && !desc->b)
+		return 1;
+	if (desc->l != 0 || desc->s != 1 || desc->dpl != 3)
+		return 0;
+	return 1;
+}
+
+#define CKPT_X86_TIF_UNSUPPORTED   (_TIF_SECCOMP | _TIF_IO_BITMAP)
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static int may_checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+#ifdef CONFIG_X86_32
+	if (t->thread.vm86_info) {
+		ckpt_debug("Task in VM86 mode\n");
+		return -EBUSY;
+	}
+#endif
+
+	/* debugregs not (yet) supported */
+	if (test_tsk_thread_flag(t, TIF_DEBUG)) {
+		ckpt_debug("Task with debugreg set\n");
+		return -EBUSY;
+	}
+
+	if (task_thread_info(t)->flags & CKPT_X86_TIF_UNSUPPORTED) {
+		ckpt_debug("Bad thread info flags %#lx\n",
+			 (unsigned long)task_thread_info(t)->flags);
+		return -EBUSY;
+	}
+	return 0;
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_thread *h;
+	int tls_size;
+	int ret;
+
+	BUG_ON(t == current);
+
+	ret = may_checkpoint_thread(ctx, t);
+	if (ret < 0)
+		return ret;
+
+	tls_size = sizeof(t->thread.tls_array);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+	if (!h)
+		return -ENOMEM;
+
+	h->thread_info_flags =
+		task_thread_info(t)->flags & ~CKPT_X86_TIF_UNSUPPORTED;
+	h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	h->sizeof_tls_array = tls_size;
+
+	/* For simplicity dump the entire array */
+	memcpy(h + 1, t->thread.tls_array, tls_size);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	kfree(h);
+	return ret;
+}
+
+static void save_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	h->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int checkpoint_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, xstate_size + sizeof(*h),
+			      CKPT_HDR_CPU_FPU);
+	if (!h)
+		return -ENOMEM;
+
+	/*
+	 * For simplicity dump the entire structure.
+	 * FIX: need to be deliberate about what registers we are
+	 * dumping for traceability and compatibility.
+	 */
+	memcpy(h + 1, t->thread.fpu.state, xstate_size);
+
+	ret = ckpt_write_obj(ctx, h);
+	kfree(h);
+
+	return ret;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	BUG_ON(t == current);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	save_cpu_regs(h, t);
+	save_cpu_fpu(h, t);
+
+	ckpt_debug("math %d\n", h->used_math);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = checkpoint_cpu_fpu(ctx, t);
+ out:
+	kfree(h);
+	return ret;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	/* FPU capabilities */
+	h->has_fxsr = cpu_has_fxsr;
+	h->has_xsave = cpu_has_xsave;
+	h->xstate_size = xstate_size;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	kfree(h);
+
+	return ret;
+}
+
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	BUG_ON(mm == current->mm);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	mutex_lock(&mm->context.lock);
+
+	h->vdso = (unsigned long) mm->context.vdso;
+	h->ldt_entry_size = LDT_ENTRY_SIZE;
+	h->nldt = mm->context.size;
+
+	ckpt_debug("nldt %d vdso %#llx\n", h->nldt, h->vdso);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	kfree(h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj_type(ctx, mm->context.ldt,
+				  mm->context.size * LDT_ENTRY_SIZE,
+				  CKPT_HDR_MM_CONTEXT_LDT);
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_thread *h;
+	struct thread_struct *thread = &current->thread;
+	struct desc_struct *desc;
+	int tls_size;
+	int i, cpu, ret;
+
+	tls_size = sizeof(thread->tls_array);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->thread_info_flags & CKPT_X86_TIF_UNSUPPORTED)
+		goto out;
+	if (h->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+	if (h->sizeof_tls_array != tls_size)
+		goto out;
+
+	/*
+	 * restore TLS by hand: why convert to struct user_desc if
+	 * sys_set_thread_entry() will convert it back ?
+	 */
+	desc = (struct desc_struct *) (h + 1);
+
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+		if (!check_tls(&desc[i]))
+			goto out;
+	}
+
+	cpu = get_cpu();
+	memcpy(thread->tls_array, desc, tls_size);
+	load_TLS(thread, cpu);
+	put_cpu();
+
+	/* TODO: restore TIF flags as necessary (e.g. TIF_NOTSC) */
+
+	ret = 0;
+ out:
+	kfree(h);
+	return ret;
+}
+
+static int load_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!h->used_math)
+		clear_used_math();
+
+	return 0;
+}
+
+static int restore_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	/* init_fpu() eventually also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		return ret;
+
+	h = ckpt_read_obj_type(ctx, xstate_size + sizeof(*h),
+			       CKPT_HDR_CPU_FPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	memcpy(t->thread.fpu.state, h + 1, xstate_size);
+
+	kfree(h);
+	return ret;
+}
+
+static int check_eflags(__u32 eflags)
+{
+#define X86_EFLAGS_CKPT_MASK  \
+	(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | \
+	 X86_EFLAGS_SF | X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_OF | \
+	 X86_EFLAGS_NT | X86_EFLAGS_AC | X86_EFLAGS_ID | X86_EFLAGS_RF)
+
+	if ((eflags & ~X86_EFLAGS_CKPT_MASK) != (X86_EFLAGS_IF | 0x2))
+		return 0;
+	return 1;
+}
+
+static void restore_eflags(struct pt_regs *regs, __u32 eflags)
+{
+	/*
+	 * A task may have had X86_EFLAGS_RF set at checkpoint, .e.g:
+	 * 1) It ran in a KVM guest, and the guest was being debugged,
+	 * 2) The kernel was debugged using kgbd,
+	 * 3) From Intel's manual: "When calling an event handler,
+	 *    Intel 64 and IA-32 processors establish the value of the
+	 *    RF flag in the EFLAGS image pushed on the stack:
+	 *  - For any fault-class exception except a debug exception
+	 *    generated in response to an instruction breakpoint, the
+	 *    value pushed for RF is 1.
+	 *  - For any interrupt arriving after any iteration of a
+	 *    repeated string instruction but the last iteration, the
+	 *    value pushed for RF is 1.
+	 *  - For any trap-class exception generated by any iteration
+	 *    of a repeated string instruction but the last iteration,
+	 *    the value pushed for RF is 1.
+	 *  - For other cases, the value pushed for RF is the value
+	 *    that was in EFLAG.RF at the time the event handler was
+	 *    called.
+	 *  [from: http://www.intel.com/Assets/PDF/manual/253668.pdf]
+	 *
+	 * The RF flag may be set in EFLAGS by the hardware, or by
+	 * kvm/kgdb, or even by the user with ptrace or by setting a
+	 * suitable context when returning from a signal handler.
+	 *
+	 * Therefore, on restart we (1) prserve X86_EFLAGS_RF from
+	 * checkpoint time, and (2) preserve a X86_EFLAGS_RF of the
+	 * restarting process if it already exists on saved EFLAGS.
+	 */
+	eflags |= (regs->flags & X86_EFLAGS_RF);
+	regs->flags = eflags;
+}
+
+static int load_cpu_eflags(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct pt_regs *regs = task_pt_regs(t);
+
+	if (!check_eflags(h->flags))
+		return -EINVAL;
+	restore_eflags(regs, h->flags);
+	return 0;
+}
+
+/* read the cpu state and registers for a restarting task */
+int restore_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	BUG_ON(t == current);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("math %d\n", h->used_math);
+
+	ret = load_cpu_regs(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_eflags(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_fpu(h, t);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = restore_cpu_fpu(ctx, t);
+ out:
+	kfree(h);
+	return ret;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (h->has_fxsr != cpu_has_fxsr ||
+	    h->has_xsave != cpu_has_xsave ||
+	    h->xstate_size != xstate_size) {
+		ret = -EINVAL;
+		ckpt_debug("incompatible FPU capabilities");
+	}
+
+	kfree(h);
+	return ret;
+}
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	unsigned int n;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("nldt %d vdso %#lx (%p)\n",
+		 h->nldt, (unsigned long) h->vdso, mm->context.vdso);
+
+	/* FIXME: CONFIG_COMPAT_VDSO=y makes this fail */
+	ret = -EINVAL;
+	if (h->vdso != (unsigned long) mm->context.vdso)
+		goto out;
+	if (h->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	ret = _ckpt_read_obj_type(ctx, NULL,
+				  h->nldt * LDT_ENTRY_SIZE,
+				  CKPT_HDR_MM_CONTEXT_LDT);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+	for (n = 0; n < h->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = ckpt_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			break;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			break;
+	}
+ out:
+	kfree(h);
+	return ret;
+}
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index b35786d..07f48b6 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -340,3 +340,5 @@ ENTRY(sys_call_table)
 	.long sys_fanotify_init
 	.long sys_fanotify_mark
 	.long sys_prlimit64		/* 340 */
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 36df991..267aa64 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -309,11 +309,9 @@ int __init sysenter_setup(void)
 	return 0;
 }
 
-/* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+static int __arch_setup_additional_pages(unsigned long addr)
 {
 	struct mm_struct *mm = current->mm;
-	unsigned long addr;
 	int ret = 0;
 	bool compat;
 
@@ -326,12 +324,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	   changes it via sysctl */
 	compat = (vdso_enabled == VDSO_COMPAT);
 
+	/* We don't know how to handle compat with sys_restart yet */
+	if (WARN_ON_ONCE(compat && addr != 0)) {
+		ret = -ENOSYS;
+		goto up_fail;
+	}
+
 	map_compat_vdso(compat);
 
 	if (compat)
 		addr = VDSO_HIGH_BASE;
 	else {
-		addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
+		addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
 		if (IS_ERR_VALUE(addr)) {
 			ret = addr;
 			goto up_fail;
@@ -372,6 +376,19 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	return ret;
 }
 
+/* Setup a VMA at program startup for the vsyscall page */
+int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+{
+	return __arch_setup_additional_pages(0);
+}
+
+#ifdef CONFIG_X86_32
+int arch_restore_vdso(unsigned long addr)
+{
+	return __arch_setup_additional_pages(addr);
+}
+#endif /* CONFIG_X86_32 */
+
 #ifdef CONFIG_X86_64
 
 subsys_initcall(sysenter_setup);
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC 00/10] container-based checkpoint/restart prototype
  2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
                   ` (9 preceding siblings ...)
  2011-02-28 23:40 ` [PATCH 10/10] x86_32 support for checkpoint/restart ntl
@ 2011-03-01  1:08 ` Nathan Lynch
  10 siblings, 0 replies; 41+ messages in thread
From: Nathan Lynch @ 2011-03-01  1:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Oren Laadan

On Mon, 2011-02-28 at 17:40 -0600, ntl@pobox.com wrote:
> This is the tradeoff we ask users
> to make - the ability to C/R and migrate is provided in exchange for
> accepting some isolation and slightly reduced ease of use.  A tool
> such as lxc (http://lxc.sourceforge.net) can be used to isolate jobs.
> A patch against lxc is available which adds C/R capability.

Below is that patch (against the lxc-0.7.3 tag) and a usage example.

# export LXC_CMD_SOCK_ABSTRACT=test
# lxc-execute -n foo -- /bin/cat </dev/zero &>/dev/null &
# ps
  PID TTY          TIME CMD
 8736 pts/1    00:00:00 bash
 8842 pts/1    00:00:00 lxc-execute
 8843 pts/1    00:00:00 lxc-init
 8844 pts/1    00:00:01 cat
 8845 pts/1    00:00:00 ps
# lxc-checkpoint -S /tmp/ckpt.img -n foo -k
[1]+  Exit 137                lxc-execute -n foo -- /bin/cat < /dev/zero &>/dev/null
# ps
  PID TTY          TIME CMD
 8736 pts/1    00:00:00 bash
 8849 pts/1    00:00:00 ps
# lxc-restart -n foo -S /tmp/ckpt.img

[whee, watch resurrected /bin/cat eat cpu]

 doc/rootfs/Makefile.am |    2 +-
 lxc.spec.in            |    2 +-
 src/lxc/Makefile.am    |    3 +-
 src/lxc/checkpoint.c   |   59 ++++++-
 src/lxc/cr.h           |   71 ++++++++
 src/lxc/lxc_init.c     |  459 ++++++++++++++++++++++++++++++++++++++----------
 src/lxc/lxc_restart.c  |   25 ++-
 src/lxc/restart.c      |   77 --------
 src/lxc/start.c        |    3 +-
 templates/Makefile.am  |    2 +-
 10 files changed, 515 insertions(+), 188 deletions(-)

diff --git a/doc/rootfs/Makefile.am b/doc/rootfs/Makefile.am
index 98fb0e0..832bb4a 100644
--- a/doc/rootfs/Makefile.am
+++ b/doc/rootfs/Makefile.am
@@ -1,3 +1,3 @@
-READMEdir=@LXCROOTFSMOUNT@
+READMEdir=$(pkglibdir)/rootfs
 
 README_DATA=README
\ No newline at end of file
diff --git a/lxc.spec.in b/lxc.spec.in
index 379b53d..1ca6326 100644
--- a/lxc.spec.in
+++ b/lxc.spec.in
@@ -57,7 +57,7 @@ development of the linux containers.
 %setup
 %build
 PATH=$PATH:/usr/sbin:/sbin %configure
-make %{?_smp_mflags}
+make %{?_smp_mflags} CFLAGS='-Wall -Werror -g'
 
 %install
 %makeinstall
diff --git a/src/lxc/Makefile.am b/src/lxc/Makefile.am
index d2ee4d9..9e4d4c9 100644
--- a/src/lxc/Makefile.am
+++ b/src/lxc/Makefile.am
@@ -25,8 +25,7 @@ liblxc_so_SOURCES = \
 	monitor.c monitor.h \
 	console.c \
 	freezer.c \
-	checkpoint.c \
-	restart.c \
+	checkpoint.c cr.h\
 	error.h error.c \
 	parse.c parse.h \
 	cgroup.c cgroup.h \
diff --git a/src/lxc/checkpoint.c b/src/lxc/checkpoint.c
index a2d0d8a..b0c62a8 100644
--- a/src/lxc/checkpoint.c
+++ b/src/lxc/checkpoint.c
@@ -22,11 +22,64 @@
  */
 #include <lxc/lxc.h>
 #include <lxc/log.h>
+#include <stdlib.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <unistd.h>
+
+#include "af_unix.h"
+#include "cr.h"
 
 lxc_log_define(lxc_checkpoint, lxc);
 
-int lxc_checkpoint(const char *name, int sfd, int flags)
+int lxc_checkpoint(const char *name, int statefd, int flags)
 {
-	ERROR("'checkpoint' function not implemented");
-	return -1;
+	struct lxc_cr_cmd cmd = { .code = LXC_COMMAND_CHECKPOINT, };
+	struct lxc_cr_response response;
+	const char *cmd_sock_path;
+	char sun_path[sizeof(((struct sockaddr_un *)0)->sun_path)] = { 0 };
+	ssize_t ret;
+	int sockfd;
+
+	cmd_sock_path = getenv("LXC_CMD_SOCK_ABSTRACT");
+	if (!cmd_sock_path) {
+		ERROR("LXC_CMD_SOCK_ABSTRACT not set");
+		return -1;
+	}
+
+	strncpy(&sun_path[1], cmd_sock_path, sizeof(sun_path) - 2);
+
+	sockfd = lxc_af_unix_connect(sun_path);
+	if (sockfd == -1) {
+		ERROR("sock connect");
+		return -1;
+	}
+
+	ret = lxc_af_unix_send_fd(sockfd, statefd, &cmd, sizeof(cmd));
+	if (ret != sizeof(cmd)) {
+		ERROR("send fd");
+		return -1;
+	}
+
+	ret = recv(sockfd, &response, sizeof(response), 0);
+	if (ret != sizeof(response)) {
+		ERROR("recv");
+		return -1;
+	}
+
+	close(sockfd);
+
+	if (response.code != LXC_RESPONSE_SUCCESS) {
+		ERROR("checkpoint command failed (%u)", response.code);
+		return -1;
+	}
+
+	/* This is racy - we'd rather the container have no chance to
+	 * run between checkpoint and the stop request - but hopefully
+	 * it will do for now.
+	 */
+	if (flags & LXC_FLAG_HALT)
+		return lxc_stop(name);
+
+	return 0;
 }
diff --git a/src/lxc/cr.h b/src/lxc/cr.h
new file mode 100644
index 0000000..244c8f9
--- /dev/null
+++ b/src/lxc/cr.h
@@ -0,0 +1,71 @@
+/*
+ * lxc: Linux Container library
+ *
+ * Copyright IBM Corp. 2011
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef _LXC_CR_H
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+enum {
+	LXC_COMMAND_CHECKPOINT = 1,
+};
+
+enum {
+	LXC_RESPONSE_SUCCESS,
+	LXC_RESPONSE_FAILURE,
+};
+
+struct lxc_cr_cmd { unsigned int code; };
+struct lxc_cr_response { unsigned int code; };
+
+#ifndef SYS_checkpoint
+#if defined(__i386__)
+#define SYS_checkpoint 341
+#elif defined(__x86_64__)
+#define SYS_checkpoint 303
+#else
+#warning SYS_checkpoint not defined for this architecture
+#define SYS_checkpoint -1
+#endif
+#endif
+
+static inline int checkpoint(int fd, unsigned int flags)
+{
+	return syscall(SYS_checkpoint, fd, flags);
+}
+
+#ifndef SYS_restart
+#if defined(__i386__)
+#define SYS_restart 342
+#elif defined(__x86_64__)
+#define SYS_restart 304
+#else
+#warning SYS_restart not defined for this architecture
+#define SYS_restart -1
+#endif
+#endif
+
+static inline int restart(int fd, unsigned int flags)
+{
+	return syscall(SYS_restart, fd, flags);
+}
+
+#endif /* _LXC_CR_H */
diff --git a/src/lxc/lxc_init.c b/src/lxc/lxc_init.c
index a534b51..2e8f08a 100644
--- a/src/lxc/lxc_init.c
+++ b/src/lxc/lxc_init.c
@@ -1,10 +1,11 @@
 /*
  * lxc: linux Container library
  *
- * (C) Copyright IBM Corp. 2007, 2008
+ * (C) Copyright IBM Corp. 2007, 2008, 2011
  *
  * Authors:
  * Daniel Lezcano <dlezcano at fr.ibm.com>
+ * Nathan Lynch
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -21,21 +22,29 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
  */
 
-#include <stdio.h>
-#include <unistd.h>
-#include <stdlib.h>
+#define _GNU_SOURCE
+#include <assert.h>
 #include <errno.h>
-#include <signal.h>
+#include <getopt.h>
 #include <libgen.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/epoll.h>
+#include <sys/signalfd.h>
+#include <sys/socket.h>
 #include <sys/stat.h>
 #include <sys/types.h>
+#include <sys/un.h>
 #include <sys/wait.h>
-#define _GNU_SOURCE
-#include <getopt.h>
 
-#include "log.h"
+#include "af_unix.h"
 #include "caps.h"
+#include "cr.h"
 #include "error.h"
+#include "log.h"
 #include "utils.h"
 
 lxc_log_define(lxc_init, lxc);
@@ -47,23 +56,305 @@ static struct option options[] = {
 	{ 0, 0, 0, 0 },
 };
 
-static	int was_interrupted = 0;
+static bool pidns_is_empty(void)
+{
+	assert(getpid() == (pid_t)1);
+
+	if (kill(-1, 0) == 0)
+		return false;
+	assert(errno == ESRCH);
+	return true;
+}
 
-int main(int argc, char *argv[])
+static struct lxc_init_state {
+	pid_t child_pid;             /* the child we initially created */
+	bool shutting_down;          /* we've received a request to exit */
+	bool child_status_collected; /* we've retrieved child_pid's status */
+	int exit_code;               /* status code to return from main() */
+	size_t nr_waited;            /* # children waited post-restart */
+} state;
+
+static void handle_sigterm(struct lxc_init_state *state)
 {
+	if (state->shutting_down)
+		return;
+
+	state->shutting_down = true;
+	kill(-1, SIGTERM);
+	alarm(1);
+}
+
+static void handle_sigalrm(struct lxc_init_state *state)
+{
+	kill(-1, SIGKILL);
+}
+
+static void handle_sigchld(struct lxc_init_state *state)
+{
+	pid_t pid;
+	int status;
+
+	while ((pid = waitpid(-1, &status, WNOHANG)) != 0) {
+
+		if (pid == (pid_t)-1)
+			return;
+
+		/* reset timer each time a process exits */
+		if (state->shutting_down)
+			alarm(1);
+
+		ERROR("collected pid %lu\n", (unsigned long)pid);
+
+		state->nr_waited++; /* for restart */
+
+		if (state->child_status_collected)
+			continue; /* don't care */
+
+		if (pid != state->child_pid)
+			continue; /* don't care */
+
+		state->child_status_collected = true;
+		state->exit_code = lxc_error_set_and_log(pid, status);
+	}
+}
+
+typedef void (*sigfd_handler_t)(struct lxc_init_state *);
+
+static const sigfd_handler_t sig_dispatch_table[NSIG] =
+{
+	[SIGTERM] = handle_sigterm,
+	[SIGALRM] = handle_sigalrm,
+	[SIGCHLD] = handle_sigchld,
+};
+
+static int epoll_sigfd_handler(struct lxc_init_state *state, int fd, uint32_t events)
+{
+	struct signalfd_siginfo siginfo;
+	int ret;
+
+	ret = read(fd, &siginfo, sizeof(struct signalfd_siginfo));
+	if (ret != sizeof(struct signalfd_siginfo)) {
+		ERROR("read signalfd");
+		return -1;
+	}
+
+	if (sig_dispatch_table[siginfo.ssi_signo] != NULL)
+		sig_dispatch_table[siginfo.ssi_signo](state);
+	else
+		kill(state->child_pid, siginfo.ssi_signo);
+
+	return 0;
+}
+
+static int epoll_cmd_handler(struct lxc_init_state *state, int listenfd, uint32_t events)
+{
+	struct lxc_cr_response response = { .code = LXC_RESPONSE_FAILURE, };
+	struct lxc_cr_cmd cmd;
+	int saved_errno;
+	ssize_t bytes;
+	int acceptfd;
+	int statefd;
+	int flags;
+	int rc;
+
+	acceptfd = accept(listenfd, NULL, NULL);
+	if (acceptfd == -1) {
+		ERROR("accept");
+		goto out;
+	}
+
+	statefd = -1;
+	bytes = lxc_af_unix_recv_fd(acceptfd, &statefd, &cmd, sizeof(cmd));
+
+	if (bytes == -1) {
+		ERROR("recv fd");
+		goto out;
+	}
+
+	if (cmd.code != LXC_COMMAND_CHECKPOINT) {
+		ERROR("unknown command %i", cmd.code);
+		goto out;
+	}
+
+	flags = 0;
+	rc = checkpoint(statefd, flags);
+	saved_errno = errno;
+	close(statefd);
+
+	if (rc == 0)
+		response.code = LXC_RESPONSE_SUCCESS;
+	else
+		ERROR("checkpoint error: %s", strerror(saved_errno));
+
+	bytes = send(acceptfd, &response, sizeof(response), 0);
+	if (bytes != sizeof(response))
+		ERROR("send (bytes = %zd)", bytes);
+out:
+	close(acceptfd);
+	return 0;
+}
+
+typedef int (*epoll_handler_t)(struct lxc_init_state *, int fd, uint32_t events);
+
+struct epoll_info;
+typedef int (*epoll_info_ctor_t)(struct epoll_info *);
+
+struct epoll_info {
+	const char *desc;        /* human-friendly string for debug/logging */
+	epoll_info_ctor_t ctor;  /* initializes fd */
+	epoll_handler_t handler; /* handles events from epoll_wait */
+	int fd;                  /* fd passed to epoll_ctl */
+	uint32_t events;         /* events relevant for this resource */
+};
+
+static int sigfd_ctor(struct epoll_info *info)
+{
+	sigset_t mask;
+
+	sigfillset(&mask);
+
+	return signalfd(-1, &mask, SFD_CLOEXEC);
+}
+
+static int cmd_sock_ctor(struct epoll_info *info)
+{
+	char sun_path[sizeof(((struct sockaddr_un *)0)->sun_path)] = { 0 };
+	const char *cmd_sock_path;
+	int fd;
+
+	cmd_sock_path = getenv("LXC_CMD_SOCK_ABSTRACT");
+	if (!cmd_sock_path) {
+		ERROR("LXC_CMD_SOCK_ABSTRACT not set");
+		return -1;
+	}
+
+	strncpy(&sun_path[1], cmd_sock_path, sizeof(sun_path) - 2);
+
+	fd = lxc_af_unix_open(sun_path, SOCK_STREAM, 0);
+	if (fd == -1)
+		return -1;
 
-	void interrupt_handler(int sig)
+	return fd;
+}
+
+static struct epoll_info epoll_info_table[] = {
+	{
+		.desc = "signalfd",
+		.ctor = sigfd_ctor,
+		.handler = epoll_sigfd_handler,
+		.events = EPOLLIN,
+	},
 	{
-		if (!was_interrupted)
-			was_interrupted = sig;
+		.desc = "command socket",
+		.ctor = cmd_sock_ctor,
+		.handler = epoll_cmd_handler,
+		.events = EPOLLIN | EPOLLPRI,
+	},
+};
+
+static const size_t epoll_info_table_size =
+	sizeof(epoll_info_table) / sizeof(epoll_info_table[0]);
+
+static int epoll_info_init_one(struct epoll_info *info, int epollfd)
+{
+	struct epoll_event event;
+
+	info->fd = info->ctor(info);
+	if (info->fd == -1)
+		return -1;
+
+	event.events = info->events;
+	event.data.ptr = info;
+
+	if (epoll_ctl(epollfd, EPOLL_CTL_ADD, info->fd, &event) == -1) {
+		ERROR("epoll_ctl");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int epoll_setup(void)
+{
+	int epollfd;
+	int i;
+
+	epollfd = epoll_create1(O_CLOEXEC);
+	if  (epollfd == -1) {
+		ERROR("epoll_create1");
+		return epollfd;
+	}
+
+	for (i = 0; i < epoll_info_table_size; i++) {
+		struct epoll_info *info = &epoll_info_table[i];
+
+		if (epoll_info_init_one(info, epollfd) == -1) {
+			ERROR("%s failed", info->desc);
+			return -EINVAL;
+		}
 	}
 
+	return epollfd;
+}
+
+static int epoll_loop(int epollfd)
+{
+	do {
+		struct epoll_info *info;
+		struct epoll_event event;
+		int epollrc;
+
+		epollrc = epoll_wait(epollfd, &event, 1, -1);
+		if (epollrc == -1) {
+			/* e.g. SIGCONT from ptrace attach */
+			assert(errno == EINTR);
+			continue;
+		}
+
+		assert(event.data.ptr != NULL);
+
+		info = event.data.ptr;
+
+		assert(event.events & info->events);
+
+		if (info->handler(&state, info->fd, event.events) == -1)
+			return -1;
+	} while (!pidns_is_empty());
+
+	return 0;
+}
+
+static int get_restart_fd(void)
+{
+	const char *str;
+	int fd = -1;
+
+	str = getenv("LXC_RESTART_FD");
+	if (str) {
+		errno = 0;
+		fd = strtol(str, NULL, 0);
+		if (errno) {
+			ERROR("LXC_RESTART_FD has bad value '%s'", str);
+			fd = -1;
+		}
+	}
+
+	return fd;
+}
+
+int main(int argc, char *argv[])
+{
+	int restartfd;
+	int epollfd;
 	pid_t pid;
 	int nbargs = 0;
 	int err = -1;
 	char **aargv;
 	sigset_t mask, omask;
-	int i, shutdown = 0;
+	int i;
+
+	state.exit_code = EXIT_FAILURE;
+	state.nr_waited = 0;
 
 	while (1) {
 		int ret = getopt_long_only(argc, argv, "", options, NULL);
@@ -82,7 +373,9 @@ int main(int argc, char *argv[])
 	if (lxc_log_init(NULL, 0, basename(argv[0]), quiet))
 		exit(err);
 
-	if (!argv[optind]) {
+	restartfd = get_restart_fd();
+
+	if (!argv[optind] && restartfd == -1) {
 		ERROR("missing command to launch");
 		exit(err);
 	}
@@ -91,113 +384,89 @@ int main(int argc, char *argv[])
 	argc -= nbargs;
 
         /*
-	 * mask all the signals so we are safe to install a
+	 * mask most signals so we are safe to install a
 	 * signal handler and to fork
 	 */
 	sigfillset(&mask);
+	sigdelset(&mask, SIGILL);
+	sigdelset(&mask, SIGSEGV);
+	sigdelset(&mask, SIGBUS);
+	sigdelset(&mask, SIGFPE);
 	sigprocmask(SIG_SETMASK, &mask, &omask);
 
-	for (i = 1; i < NSIG; i++) {
-		struct sigaction act;
-
-		sigfillset(&act.sa_mask);
-		sigdelset(&mask, SIGILL);
-		sigdelset(&mask, SIGSEGV);
-		sigdelset(&mask, SIGBUS);
-		act.sa_flags = 0;
-		act.sa_handler = interrupt_handler;
-		sigaction(i, &act, NULL);
-	}
-
 	if (lxc_setup_fs())
 		exit(err);
 
 	if (lxc_caps_reset())
 		exit(err);
 
-	pid = fork();
-
-	if (pid < 0)
-		exit(err);
-
-	if (!pid) {
+	assert(pidns_is_empty());
 
-		/* restore default signal handlers */
-		for (i = 1; i < NSIG; i++)
-			signal(i, SIG_DFL);
+	/* restart */
+	if (restartfd != -1) {
+		unsigned int flags;
+		int ret;
 
-		sigprocmask(SIG_SETMASK, &omask, NULL);
+		epollfd = epoll_setup();
+		if (epollfd < 0)
+			exit(err);
 
-		NOTICE("about to exec '%s'", aargv[0]);
+		flags = 0;
+		ret = restart(restartfd, flags);
+		if (ret != 0) {
+			ERROR("restart: %s", strerror(errno));
+			goto out;
+		}
 
-		execvp(aargv[0], aargv);
-		ERROR("failed to exec: '%s' : %m", aargv[0]);
-		exit(err);
-	}
+		state.exit_code = epoll_loop(epollfd);
 
-	/* let's process the signals now */
-	sigdelset(&omask, SIGALRM);
-	sigprocmask(SIG_SETMASK, &omask, NULL);
+		/* FIXME: we don't know which pid's status should be
+		 * lxc-init's exit code
+		 */
 
-	/* no need of other inherited fds but stderr */
-	close(fileno(stdin));
-	close(fileno(stdout));
+		if (state.nr_waited > 1) {
+			ERROR("multiple task restart not supported yet");
+			state.exit_code = EXIT_FAILURE;
+		} else if (state.nr_waited == 0) {
+			ERROR("no tasks restarted?");
+			state.exit_code = EXIT_FAILURE;
+		}
 
-	err = 0;
-	for (;;) {
-		int status;
-		int orphan = 0;
-		pid_t waited_pid;
+		goto out;
+	} else { /* initial startup e.g. lxc-execute */
+		pid = fork();
 
-		switch (was_interrupted) {
+		if (pid < 0)
+			exit(err);
 
-		case 0:
-			break;
+		if (!pid) {
+			/* restore default signal handlers */
+			for (i = 1; i < NSIG; i++)
+				signal(i, SIG_DFL);
 
-		case SIGTERM:
-			if (!shutdown) {
-				shutdown = 1;
-				kill(-1, SIGTERM);
-				alarm(1);
-			}
-			break;
+			sigprocmask(SIG_SETMASK, &omask, NULL);
 
-		case SIGALRM:
-			kill(-1, SIGKILL);
-			break;
+			NOTICE("about to exec '%s'", aargv[0]);
 
-		default:
-			kill(pid, was_interrupted);
-			break;
+			execvp(aargv[0], aargv);
+			ERROR("failed to exec: '%s' : %m", aargv[0]);
+			exit(err);
 		}
 
-		was_interrupted = 0;
-		waited_pid = wait(&status);
-		if (waited_pid < 0) {
-			if (errno == ECHILD)
-				goto out;
-			if (errno == EINTR)
-				continue;
+		epollfd = epoll_setup();
+		if (epollfd < 0)
+			exit(err);
 
-			ERROR("failed to wait child : %s",
-			      strerror(errno));
-			goto out;
-		}
+		state.child_pid = pid;
+	}
 
-		/* reset timer each time a process exited */
-		if (shutdown)
-			alarm(1);
+/* wait: */
+	/* no need of other inherited fds but stderr */
+	close(fileno(stdin));
+	close(fileno(stdout));
+
+	epoll_loop(epollfd);
 
-		/*
-		 * keep the exit code of started application
-		 * (not wrapped pid) and continue to wait for
-		 * the end of the orphan group.
-		 */
-		if ((waited_pid != pid) || (orphan ==1))
-			continue;
-		orphan = 1;
-		err = lxc_error_set_and_log(waited_pid, status);
-	}
 out:
-	return err;
+	return state.exit_code;
 }
diff --git a/src/lxc/lxc_restart.c b/src/lxc/lxc_restart.c
index 7548682..3687429 100644
--- a/src/lxc/lxc_restart.c
+++ b/src/lxc/lxc_restart.c
@@ -37,7 +37,7 @@
 #include "confile.h"
 #include "arguments.h"
 
-lxc_log_define(lxc_restart_ui, lxc_restart);
+lxc_log_define(lxc_restart, lxc);
 
 static struct lxc_list defines;
 
@@ -109,8 +109,9 @@ Options :\n\
 
 int main(int argc, char *argv[])
 {
+	char *envstr;
+	static char **args;
 	int sfd = -1;
-	int ret;
 	char *rcfile = NULL;
 	struct lxc_conf *conf;
 
@@ -126,6 +127,10 @@ int main(int argc, char *argv[])
 			 my_args.progname, my_args.quiet))
 		return -1;
 
+	args = lxc_arguments_dup(LXCINITDIR "/lxc-init", &my_args);
+	if (!args)
+		return -1;
+
 	/* rcfile is specified in the cli option */
 	if (my_args.rcfile)
 		rcfile = (char *)my_args.rcfile;
@@ -162,7 +167,7 @@ int main(int argc, char *argv[])
 	if (my_args.statefd != -1)
 		sfd = my_args.statefd;
 
-#define OPEN_READ_MODE O_RDONLY | O_CLOEXEC | O_LARGEFILE
+#define OPEN_READ_MODE (O_RDONLY | O_LARGEFILE)
 	if (my_args.statefile) {
 		sfd = open(my_args.statefile, OPEN_READ_MODE, 0);
 		if (sfd < 0) {
@@ -171,9 +176,15 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	ret = lxc_restart(my_args.name, sfd, conf, my_args.flags);
+	if (asprintf(&envstr, "LXC_RESTART_FD=%i", sfd) == -1) {
+		SYSERROR("asprintf: %s", strerror(errno));
+		return -1;
+	}
+
+	if (putenv(envstr) != 0) {
+		SYSERROR("putenv: %s", strerror(errno));
+		return -1;
+	}
 
-	if (my_args.statefile)
-		close(sfd);
-	return ret;
+	return lxc_start(my_args.name, args, conf);
 }
diff --git a/src/lxc/restart.c b/src/lxc/restart.c
deleted file mode 100644
index c947b81..0000000
--- a/src/lxc/restart.c
+++ /dev/null
@@ -1,77 +0,0 @@
-/*
- * lxc: linux Container library
- *
- * (C) Copyright IBM Corp. 2007, 2010
- *
- * Authors:
- * Daniel Lezcano <dlezcano at fr.ibm.com>
- *
- * This library is free software; you can redistribute it and/or
- * modify it under the terms of the GNU Lesser General Public
- * License as published by the Free Software Foundation; either
- * version 2.1 of the License, or (at your option) any later version.
- *
- * This library is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * Lesser General Public License for more details.
- *
- * You should have received a copy of the GNU Lesser General Public
- * License along with this library; if not, write to the Free Software
- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
- */
-
-#include "../config.h"
-#include <stdio.h>
-#undef _GNU_SOURCE
-#include <string.h>
-#include <stdlib.h>
-#include <errno.h>
-#include <unistd.h>
-
-#include <lxc/log.h>
-#include <lxc/start.h>	/* for struct lxc_handler */
-#include <lxc/utils.h>
-#include <lxc/error.h>
-
-lxc_log_define(lxc_restart, lxc);
-
-struct restart_args {
-	int sfd;
-	int flags;
-};
-
-static int restart(struct lxc_handler *handler, void* data)
-{
-	struct restart_args *arg __attribute__ ((unused)) = data;
-
-	ERROR("'restart' function not implemented");
-	return -1;
-}
-
-static int post_restart(struct lxc_handler *handler, void* data)
-{
-	struct restart_args *arg __attribute__ ((unused)) = data;
-
-	NOTICE("'%s' container restarting with pid '%d'", handler->name,
-	       handler->pid);
-	return 0;
-}
-
-static struct lxc_operations restart_ops = {
-	.start = restart,
-	.post_start = post_restart
-};
-
-int lxc_restart(const char *name, int sfd, struct lxc_conf *conf, int flags)
-{
-	struct restart_args restart_arg = {
-		.sfd = sfd,
-		.flags = flags
-	};
-
-	if (lxc_check_inherited(sfd))
-		return -1;
-
-	return __lxc_start(name, conf, &restart_ops, &restart_arg);
-}
diff --git a/src/lxc/start.c b/src/lxc/start.c
index b963b85..8ff738f 100644
--- a/src/lxc/start.c
+++ b/src/lxc/start.c
@@ -629,8 +629,9 @@ int lxc_start(const char *name, char *const argv[], struct lxc_conf *conf)
 		.argv = argv,
 	};
 
+	/* At restart we allow lxc-init to inherit the fd for the image */
 	if (lxc_check_inherited(-1))
-		return -1;
+		ERROR("lxc_check_inherited failed; proceeding anyway");
 
 	return __lxc_start(name, conf, &start_ops, &start_arg);
 }
diff --git a/templates/Makefile.am b/templates/Makefile.am
index d55f53a..31de984 100644
--- a/templates/Makefile.am
+++ b/templates/Makefile.am
@@ -1,4 +1,4 @@
-templatesdir=@LXCTEMPLATEDIR@
+templatesdir=$(pkglibdir)
 
 templates_SCRIPTS = \
 	lxc-debian \



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/10] Introduce mm_has_pending_aio() helper
  2011-02-28 23:40 ` [PATCH 02/10] Introduce mm_has_pending_aio() helper ntl
@ 2011-03-01 15:40   ` Jeff Moyer
  2011-03-01 16:04     ` Nathan Lynch
  0 siblings, 1 reply; 41+ messages in thread
From: Jeff Moyer @ 2011-03-01 15:40 UTC (permalink / raw)
  To: ntl; +Cc: linux-kernel, containers, Oren Laadan, linux-aio


[added linux-aio@kvack.org to the cc list]

ntl@pobox.com writes:

> From: Nathan Lynch <ntl@pobox.com>
>
> Support for AIO is on the to-do list, but until that is implemented,
> checkpoint will have to fail if a mm_struct has outstanding AIO
> contexts.  Add a mm_has_pending_aio() helper function for this
> purpose.

Just because a process has an io context, doesn't mean that the process
has active outstanding requests.  So, is this really what you wanted to
test?

Cheers,
Jeff

> Based on original "check_for_outstanding_aio" patch by Serge Hallyn.
>
> Signed-off-by: Serge E. Hallyn <serge@hallyn.com>
> [ntl: changed name and return type to clearly express semantics]
> [ntl: added kerneldoc]
> Signed-off-by: Nathan Lynch <ntl@pobox.com>
> ---
>  fs/aio.c            |   27 +++++++++++++++++++++++++++
>  include/linux/aio.h |    2 ++
>  2 files changed, 29 insertions(+), 0 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 8c8f6c5..1acbc99 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1847,3 +1847,30 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>  	asmlinkage_protect(5, ret, ctx_id, min_nr, nr, events, timeout);
>  	return ret;
>  }
> +
> +/**
> + * mm_has_pending_aio() - check for outstanding AIO operations
> + * @mm:		The mm_struct to check.
> + *
> + * Returns true if there is at least one non-dead kioctx on
> + * @mm->ioctx_list.  Note that the result of this function is
> + * unreliable unless the caller has ensured that new requests cannot
> + * be submitted against @mm (e.g. through freezing the associated
> + * tasks).
> + */
> +bool mm_has_pending_aio(struct mm_struct *mm)
> +{
> +	struct kioctx *ctx;
> +	struct hlist_node *n;
> +	bool has_aio = false;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(ctx, n, &mm->ioctx_list, list) {
> +		if (!ctx->dead) {
> +			has_aio = true;
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +	return has_aio;
> +}
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index 7a8db41..39d9936 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -214,6 +214,7 @@ struct mm_struct;
>  extern void exit_aio(struct mm_struct *mm);
>  extern long do_io_submit(aio_context_t ctx_id, long nr,
>  			 struct iocb __user *__user *iocbpp, bool compat);
> +extern bool mm_has_pending_aio(struct mm_struct *mm);
>  #else
>  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
>  static inline int aio_put_req(struct kiocb *iocb) { return 0; }
> @@ -224,6 +225,7 @@ static inline void exit_aio(struct mm_struct *mm) { }
>  static inline long do_io_submit(aio_context_t ctx_id, long nr,
>  				struct iocb __user * __user *iocbpp,
>  				bool compat) { return 0; }
> +static inline bool mm_has_pending_aio(struct mm_struct *mm) { return false; }
>  #endif /* CONFIG_AIO */
>  
>  static inline struct kiocb *list_kiocb(struct list_head *h)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/10] Introduce mm_has_pending_aio() helper
  2011-03-01 15:40   ` Jeff Moyer
@ 2011-03-01 16:04     ` Nathan Lynch
  0 siblings, 0 replies; 41+ messages in thread
From: Nathan Lynch @ 2011-03-01 16:04 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, containers, Oren Laadan, linux-aio

On Tue, 2011-03-01 at 10:40 -0500, Jeff Moyer wrote:
> 
> ntl@pobox.com writes:
> 
> > From: Nathan Lynch <ntl@pobox.com>
> >
> > Support for AIO is on the to-do list, but until that is implemented,
> > checkpoint will have to fail if a mm_struct has outstanding AIO
> > contexts.  Add a mm_has_pending_aio() helper function for this
> > purpose.
> 
> Just because a process has an io context, doesn't mean that the process
> has active outstanding requests.  So, is this really what you wanted to
> test?

As a temporary measure, yeah.  We haven't settled on code to
record/restore the io context objects themselves, so we do want to bail
if we encounter any.  I realize now the name of the function doesn't
actually express this well.  Will try to come up with something better
for the next round.

Thanks!



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/10] Make exec_mmap extern
  2011-02-28 23:40 ` [PATCH 01/10] Make exec_mmap extern ntl
@ 2011-04-03 16:56   ` Serge E. Hallyn
  0 siblings, 0 replies; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-03 16:56 UTC (permalink / raw)
  To: ntl; +Cc: linux-kernel, containers

Quoting ntl@pobox.com (ntl@pobox.com):
> From: Nathan Lynch <ntl@pobox.com>
> 
> Restoration of process state from a checkpoint image is similar to
> exec in that the calling task's mm is replaced.  Make exec_mmap
> available for this purpose.
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> [ntl: extracted from Oren's "c/r: dump memory address space (private memory)"]
> Signed-off-by: Nathan Lynch <ntl@pobox.com>

Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>

> ---
>  fs/exec.c          |    2 +-
>  include/linux/mm.h |    3 +++
>  2 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index c62efcb..9d8c27a 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -767,7 +767,7 @@ int kernel_read(struct file *file, loff_t offset,
>  
>  EXPORT_SYMBOL(kernel_read);
>  
> -static int exec_mmap(struct mm_struct *mm)
> +int exec_mmap(struct mm_struct *mm)
>  {
>  	struct task_struct *tsk;
>  	struct mm_struct * old_mm, *active_mm;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 721f451..5397237 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1321,6 +1321,9 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
>  
>  extern unsigned long do_brk(unsigned long, unsigned long);
>  
> +/* fs/exec.c */
> +extern int exec_mmap(struct mm_struct *mm);
> +
>  /* filemap.c */
>  extern unsigned long page_unuse(struct page *);
>  extern void truncate_inode_pages(struct address_space *, loff_t);
> -- 
> 1.7.4
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 03/10] Introduce has_locks_with_owner() helper
  2011-02-28 23:40 ` [PATCH 03/10] Introduce has_locks_with_owner() helper ntl
@ 2011-04-03 18:55   ` Serge E. Hallyn
  0 siblings, 0 replies; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-03 18:55 UTC (permalink / raw)
  To: ntl; +Cc: linux-kernel, containers

Quoting ntl@pobox.com (ntl@pobox.com):
> From: Nathan Lynch <ntl@pobox.com>
> 
> Support for file locks is in the works, but until that is done
> checkpoint needs to fail when an open file has locks.
> 
> Based on original "find_locks_with_owner" patch by Dave Hansen.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> [ntl: changed name and return type to clearly express semantics]
> Signed-off-by: Nathan Lynch <ntl@pobox.com>

Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>

> ---
>  fs/locks.c         |   35 +++++++++++++++++++++++++++++++++++
>  include/linux/fs.h |    6 ++++++
>  2 files changed, 41 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 8729347..961e17f 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2037,6 +2037,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner)
>  
>  EXPORT_SYMBOL(locks_remove_posix);
>  
> +bool has_locks_with_owner(struct file *filp, fl_owner_t owner)
> +{
> +	struct inode *inode = filp->f_path.dentry->d_inode;
> +	struct file_lock **inode_fl;
> +	bool ret = false;
> +
> +	lock_flocks();
> +	for_each_lock(inode, inode_fl) {
> +		struct file_lock *fl = *inode_fl;
> +		/*
> +		 * We could use posix_same_owner() along with a 'fake'
> +		 * file_lock.  But, the fake file will never have the
> +		 * same fl_lmops as the fl that we are looking for and
> +		 * posix_same_owner() would just fall back to this
> +		 * check anyway.
> +		 */
> +		if (IS_POSIX(fl)) {
> +			if (fl->fl_owner == owner) {
> +				ret = true;
> +				break;
> +			}
> +		} else if (IS_FLOCK(fl) || IS_LEASE(fl)) {
> +			if (fl->fl_file == filp) {
> +				ret = true;
> +				break;
> +			}
> +		} else {
> +			WARN(1, "unknown file lock type, fl_flags: %x",
> +				fl->fl_flags);
> +		}
> +	}
> +	unlock_flocks();
> +	return ret;
> +}
> +
>  /*
>   * This function is called on the last close of an open file.
>   */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 090f0ea..315ded4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1138,6 +1138,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t);
>  extern void locks_remove_flock(struct file *);
>  extern void locks_release_private(struct file_lock *);
>  extern void posix_test_lock(struct file *, struct file_lock *);
> +extern bool has_locks_with_owner(struct file *filp, fl_owner_t owner);
>  extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
>  extern int posix_lock_file_wait(struct file *, struct file_lock *);
>  extern int posix_unblock_lock(struct file *, struct file_lock *);
> @@ -1208,6 +1209,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
>  	return;
>  }
>  
> +static inline bool has_locks_with_owner(struct file *filp, fl_owner_t owner)
> +{
> +	return false;
> +}
> +
>  static inline void locks_remove_flock(struct file *filp)
>  {
>  	return;
> -- 
> 1.7.4
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/10] Introduce vfs_fcntl() helper
  2011-02-28 23:40 ` [PATCH 04/10] Introduce vfs_fcntl() helper ntl
@ 2011-04-03 18:57   ` Serge E. Hallyn
  0 siblings, 0 replies; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-03 18:57 UTC (permalink / raw)
  To: ntl; +Cc: linux-kernel, containers

Quoting ntl@pobox.com (ntl@pobox.com):
> From: Nathan Lynch <ntl@pobox.com>
> 
> When restoring process state from a checkpoint image, it will be
> necessary to restore file status flags; add vfs_fcntl() for this
> purpose.
> 
> Based on original code by Oren Laadan.
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> [ntl: extracted from "c/r: checkpoint and restart open file descriptors"]
> Signed-off-by: Nathan Lynch <ntl@pobox.com>

Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>

> ---
>  fs/fcntl.c         |   21 +++++++++++++--------
>  include/linux/fs.h |    2 ++
>  2 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index ecc8b39..8e797b7 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -426,6 +426,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  	return err;
>  }
>  
> +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
> +{
> +	int err;
> +
> +	err = security_file_fcntl(filp, cmd, arg);
> +	if (err)
> +		goto out;
> +	err = do_fcntl(fd, cmd, arg, filp);
> + out:
> +	return err;
> +}
> +
>  SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
>  {	
>  	struct file *filp;
> @@ -435,14 +447,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
>  	if (!filp)
>  		goto out;
>  
> -	err = security_file_fcntl(filp, cmd, arg);
> -	if (err) {
> -		fput(filp);
> -		return err;
> -	}
> -
> -	err = do_fcntl(fd, cmd, arg, filp);
> -
> +	err = vfs_fcntl(fd, cmd, arg, filp);
>   	fput(filp);
>  out:
>  	return err;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 315ded4..175bb75 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1112,6 +1112,8 @@ struct file_lock {
>  
>  #include <linux/fcntl.h>
>  
> +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
> +
>  extern void send_sigio(struct fown_struct *fown, int fd, int band);
>  
>  #ifdef CONFIG_FILE_LOCKING
> -- 
> 1.7.4
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-02-28 23:40 ` [PATCH 05/10] Core checkpoint/restart support code ntl
@ 2011-04-03 19:03   ` Serge E. Hallyn
  2011-04-04 15:00     ` Nathan Lynch
  0 siblings, 1 reply; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-03 19:03 UTC (permalink / raw)
  To: ntl; +Cc: linux-kernel, containers, Oren Laadan, Alexey Dobriyan

Quoting ntl@pobox.com (ntl@pobox.com):
> Only a pid namespace init task - the child process produced by a call
> to clone(2) with CLONE_NEWPID - is allowed to call these.  The state

So you make this useful for your cases by only using this with
application containers - created using lxc-execute, or, more precisely,
using lxc-init as the container's init.  So a container running a stock
distro can't be checkpointed.

Is this just to keep the patch simple for now, or is there some reason
to keep this limitation in place?

-serge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-03 19:03   ` Serge E. Hallyn
@ 2011-04-04 15:00     ` Nathan Lynch
  2011-04-04 15:10       ` Serge E. Hallyn
  0 siblings, 1 reply; 41+ messages in thread
From: Nathan Lynch @ 2011-04-04 15:00 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: linux-kernel, containers, Oren Laadan, Alexey Dobriyan

On Sun, 2011-04-03 at 14:03 -0500, Serge E. Hallyn wrote:
> Quoting ntl@pobox.com (ntl@pobox.com):
> > Only a pid namespace init task - the child process produced by a call
> > to clone(2) with CLONE_NEWPID - is allowed to call these.  The state
> 
> So you make this useful for your cases by only using this with
> application containers - created using lxc-execute, or, more precisely,
> using lxc-init as the container's init.  So a container running a stock
> distro can't be checkpointed.

Correct, a conventional distro init won't work, and application
containers are my focus for now, at least.


> Is this just to keep the patch simple for now, or is there some reason
> to keep this limitation in place?

I guess you're asking whether non-pid-init processes could be allowed to
use the syscalls?  I don't think so... almost certainly not restart(2).

I think that restriction keeps the implementation simple and the
semantics clear.  And init is uniquely positioned to carry out any setup
required (mounts, networking) before calling restart.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 15:00     ` Nathan Lynch
@ 2011-04-04 15:10       ` Serge E. Hallyn
  2011-04-04 15:40         ` Nathan Lynch
  0 siblings, 1 reply; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-04 15:10 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linux-kernel, containers, Oren Laadan, Alexey Dobriyan

Quoting Nathan Lynch (ntl@pobox.com):
> On Sun, 2011-04-03 at 14:03 -0500, Serge E. Hallyn wrote:
> > Quoting ntl@pobox.com (ntl@pobox.com):
> > > Only a pid namespace init task - the child process produced by a call
> > > to clone(2) with CLONE_NEWPID - is allowed to call these.  The state
> > 
> > So you make this useful for your cases by only using this with
> > application containers - created using lxc-execute, or, more precisely,
> > using lxc-init as the container's init.  So a container running a stock
> > distro can't be checkpointed.
> 
> Correct, a conventional distro init won't work, and application
> containers are my focus for now, at least.
> 
> 
> > Is this just to keep the patch simple for now, or is there some reason
> > to keep this limitation in place?
> 
> I guess you're asking whether non-pid-init processes could be allowed to
> use the syscalls?

No.  I'm asking whether you are intending to later on change the checkpoint
API to allow an external task to checkpoint a pid-init process, rather than
the pid-init process having to initiate it itself.


-serge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 15:10       ` Serge E. Hallyn
@ 2011-04-04 15:40         ` Nathan Lynch
  2011-04-04 16:27           ` Serge E. Hallyn
  0 siblings, 1 reply; 41+ messages in thread
From: Nathan Lynch @ 2011-04-04 15:40 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: linux-kernel, containers, Oren Laadan, Alexey Dobriyan

On Mon, 2011-04-04 at 10:10 -0500, Serge E. Hallyn wrote:
> Quoting Nathan Lynch (ntl@pobox.com):
> > On Sun, 2011-04-03 at 14:03 -0500, Serge E. Hallyn wrote:
> > > Quoting ntl@pobox.com (ntl@pobox.com):
> > > > Only a pid namespace init task - the child process produced by a call
> > > > to clone(2) with CLONE_NEWPID - is allowed to call these.  The state
> > > 
> > > So you make this useful for your cases by only using this with
> > > application containers - created using lxc-execute, or, more precisely,
> > > using lxc-init as the container's init.  So a container running a stock
> > > distro can't be checkpointed.
> > 
> > Correct, a conventional distro init won't work, and application
> > containers are my focus for now, at least.
> > 
> > 
> > > Is this just to keep the patch simple for now, or is there some reason
> > > to keep this limitation in place?
> > 
> > I guess you're asking whether non-pid-init processes could be allowed to
> > use the syscalls?
> 
> No.  I'm asking whether you are intending to later on change the checkpoint
> API to allow an external task to checkpoint a pid-init process, rather than
> the pid-init process having to initiate it itself.

No, that is not the intention.  I can see how that would be problematic
for those wanting to run minimally-modified distro containers, but I
think running a patched pid-init is a reasonable tradeoff to ask users
to make in order to get c/r.  And there's nothing to keep the standard
distro inits from growing c/r capability.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 15:40         ` Nathan Lynch
@ 2011-04-04 16:27           ` Serge E. Hallyn
  2011-04-04 17:32             ` Oren Laadan
                               ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-04 16:27 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: linux-kernel, containers, Oren Laadan, Andrew Morton, Alexey Dobriyan

Quoting Nathan Lynch (ntl@pobox.com):
> On Mon, 2011-04-04 at 10:10 -0500, Serge E. Hallyn wrote:
> > Quoting Nathan Lynch (ntl@pobox.com):
> > > On Sun, 2011-04-03 at 14:03 -0500, Serge E. Hallyn wrote:
> > > > Quoting ntl@pobox.com (ntl@pobox.com):
> > > > > Only a pid namespace init task - the child process produced by a call
> > > > > to clone(2) with CLONE_NEWPID - is allowed to call these.  The state
> > > > 
> > > > So you make this useful for your cases by only using this with
> > > > application containers - created using lxc-execute, or, more precisely,
> > > > using lxc-init as the container's init.  So a container running a stock
> > > > distro can't be checkpointed.
> > > 
> > > Correct, a conventional distro init won't work, and application
> > > containers are my focus for now, at least.
> > > 
> > > 
> > > > Is this just to keep the patch simple for now, or is there some reason
> > > > to keep this limitation in place?
> > > 
> > > I guess you're asking whether non-pid-init processes could be allowed to
> > > use the syscalls?
> > 
> > No.  I'm asking whether you are intending to later on change the checkpoint
> > API to allow an external task to checkpoint a pid-init process, rather than
> > the pid-init process having to initiate it itself.
> 
> No, that is not the intention.  I can see how that would be problematic
> for those wanting to run minimally-modified distro containers, but I
> think running a patched pid-init is a reasonable tradeoff to ask users
> to make in order to get c/r.  And there's nothing to keep the standard
> distro inits from growing c/r capability.

It's not necessarily a dealbreaker, since presumably I can hack the
needed support into upstart, triggered by a boot option so it isn't
activated on a host.  But especially given the lack of interest in
this thread so far, I don't see a point in pushing this, an API-incompatible
less-capable version of the linux-cr tree.  If it can gain traction
better than linux-cr, that'd be one thing.  But given the amount of
review and testing the other tree has gotten - and I realize you're
able to piggy-back on much of that - and, again, the lack of responses
so far, I just don't see this as worth pushing for.

I'd really prefer that everyone was using the same tree, and sending
any and all patches which they need, no matter how ugly they fear
they are, upstream.  To that end, I think it would be appropriate
for you or Dan to get write access to Oren's tree or to move to a
newly cloned copy of his tree to which one of you has acces.

Andrew (Cc:d), did you see this thread go by, and it did it look
in any way more palatable to you?  Have you had any thoughts on
checkpoint/restart in the last few months?  Or did that horse quietly
die over winter?

thanks,
-serge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 16:27           ` Serge E. Hallyn
@ 2011-04-04 17:32             ` Oren Laadan
  2011-04-04 21:43               ` Nathan Lynch
  2011-04-04 17:41             ` Andrew Morton
  2011-04-04 21:20             ` Nathan Lynch
  2 siblings, 1 reply; 41+ messages in thread
From: Oren Laadan @ 2011-04-04 17:32 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Nathan Lynch, linux-kernel, containers, Andrew Morton, Alexey Dobriyan



On 04/04/2011 12:27 PM, Serge E. Hallyn wrote:
> Quoting Nathan Lynch (ntl@pobox.com):
>> On Mon, 2011-04-04 at 10:10 -0500, Serge E. Hallyn wrote:
>>> Quoting Nathan Lynch (ntl@pobox.com):
>>>> On Sun, 2011-04-03 at 14:03 -0500, Serge E. Hallyn wrote:
>>>>> Quoting ntl@pobox.com (ntl@pobox.com):
>>>>>> Only a pid namespace init task - the child process produced by a call
>>>>>> to clone(2) with CLONE_NEWPID - is allowed to call these.  The state
>>>>>
>>>>> So you make this useful for your cases by only using this with
>>>>> application containers - created using lxc-execute, or, more precisely,
>>>>> using lxc-init as the container's init.  So a container running a stock
>>>>> distro can't be checkpointed.
>>>>
>>>> Correct, a conventional distro init won't work, and application
>>>> containers are my focus for now, at least.
>>>>
>>>>
>>>>> Is this just to keep the patch simple for now, or is there some reason
>>>>> to keep this limitation in place?
>>>>
>>>> I guess you're asking whether non-pid-init processes could be allowed to
>>>> use the syscalls?
>>>
>>> No.  I'm asking whether you are intending to later on change the checkpoint
>>> API to allow an external task to checkpoint a pid-init process, rather than
>>> the pid-init process having to initiate it itself.
>>
>> No, that is not the intention.  I can see how that would be problematic
>> for those wanting to run minimally-modified distro containers, but I
>> think running a patched pid-init is a reasonable tradeoff to ask users
>> to make in order to get c/r.  And there's nothing to keep the standard
>> distro inits from growing c/r capability.
> 
> It's not necessarily a dealbreaker, since presumably I can hack the
> needed support into upstart, triggered by a boot option so it isn't
> activated on a host.  But especially given the lack of interest in
> this thread so far, I don't see a point in pushing this, an API-incompatible
> less-capable version of the linux-cr tree.  If it can gain traction
> better than linux-cr, that'd be one thing.  But given the amount of
> review and testing the other tree has gotten - and I realize you're
> able to piggy-back on much of that - and, again, the lack of responses
> so far, I just don't see this as worth pushing for.

First, thanks to Nathan for cleaning up and re-producing a "minimal"
patchest for review.

>From the technical point of view it *is* a big problem:  there are
very good reasons why we chose a certain design. 

If Natahan is suggesting in-kernel tree creation as a temporary thing
to simplify the code for review - then, given that this patch handles
a single process, doing so add lots of unnecessary code, all of which
in the kernel.

If this is the beginning of a permanent approach, then it is totally
incompatible with what we have done so far, and severely restricts 
the kind of use--cases of the project, potentially making it too
unattractive for many natural adaptors, like HPC users. Sorry, nack.

Thanks,

Oren.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 16:27           ` Serge E. Hallyn
  2011-04-04 17:32             ` Oren Laadan
@ 2011-04-04 17:41             ` Andrew Morton
  2011-04-04 18:51               ` Serge E. Hallyn
  2011-04-04 21:20             ` Nathan Lynch
  2 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2011-04-04 17:41 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Nathan Lynch, linux-kernel, containers, Oren Laadan, Alexey Dobriyan

On Mon, 4 Apr 2011 11:27:53 -0500 "Serge E. Hallyn" <serge@hallyn.com> wrote:

> Andrew (Cc:d), did you see this thread go by, and it did it look
> in any way more palatable to you?  Have you had any thoughts on
> checkpoint/restart in the last few months?  Or did that horse quietly
> die over winter?

argh, it was the victim of LIFO.

All I can say at this stage is that I'll be interested next time it
comes past, sorry.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 17:41             ` Andrew Morton
@ 2011-04-04 18:51               ` Serge E. Hallyn
  2011-04-04 19:42                 ` Andrew Morton
  0 siblings, 1 reply; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-04 18:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Serge E. Hallyn, containers, Nathan Lynch, linux-kernel, Alexey Dobriyan

[-- Attachment #1: Type: text/plain, Size: 935 bytes --]

Quoting Andrew Morton (akpm@linux-foundation.org):
> On Mon, 4 Apr 2011 11:27:53 -0500 "Serge E. Hallyn" <serge@hallyn.com> wrote:
> 
> > Andrew (Cc:d), did you see this thread go by, and it did it look
> > in any way more palatable to you?  Have you had any thoughts on
> > checkpoint/restart in the last few months?  Or did that horse quietly
> > die over winter?
> 
> argh, it was the victim of LIFO.
> 
> All I can say at this stage is that I'll be interested next time it
> comes past, sorry.

Thanks, that's good to know.

As you know, we started with a minimal patchset, then grew it over time
to answer the "but how will you (xyz) without uglifying the kernel".
Would you recommend we go back to keeping a separate minimal patchset,
or that we develop on the current, pretty feature-full version?  I'm not
convinced believe there will be bandwidth to keep two trees and do both
justice.

thanks,
-serge

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 18:51               ` Serge E. Hallyn
@ 2011-04-04 19:42                 ` Andrew Morton
  2011-04-04 20:29                   ` Serge E. Hallyn
                                     ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Andrew Morton @ 2011-04-04 19:42 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge E. Hallyn, containers, Nathan Lynch, linux-kernel, Alexey Dobriyan

On Mon, 4 Apr 2011 13:51:20 -0500 "Serge E. Hallyn" <serge.hallyn@ubuntu.com> wrote:

> Quoting Andrew Morton (akpm@linux-foundation.org):
> > On Mon, 4 Apr 2011 11:27:53 -0500 "Serge E. Hallyn" <serge@hallyn.com> wrote:
> > 
> > > Andrew (Cc:d), did you see this thread go by, and it did it look
> > > in any way more palatable to you?  Have you had any thoughts on
> > > checkpoint/restart in the last few months?  Or did that horse quietly
> > > die over winter?
> > 
> > argh, it was the victim of LIFO.
> > 
> > All I can say at this stage is that I'll be interested next time it
> > comes past, sorry.
> 
> Thanks, that's good to know.
> 
> As you know, we started with a minimal patchset, then grew it over time
> to answer the "but how will you (xyz) without uglifying the kernel".
> Would you recommend we go back to keeping a separate minimal patchset,
> or that we develop on the current, pretty feature-full version?  I'm not
> convinced believe there will be bandwidth to keep two trees and do both
> justice.

The minimal patchset is too minimal for Oren's use and the maximal
patchset seems to have run aground on general kernel sentiment.  So I
guess you either take the minimal patchset and make it less minimal or
take the maximal patchset and make it less maximal, ending up with the
same thing.  How's that for hand-waving useless obviousnesses :)

One obvious approach is to merge the minimal patchset then, over time,
sneak more stuff into it so we end up with the maximal patchset which
people didn't like.  Don't do that :)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 19:42                 ` Andrew Morton
@ 2011-04-04 20:29                   ` Serge E. Hallyn
  2011-04-04 21:55                   ` Matt Helsley
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-04 20:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Serge E. Hallyn, Serge E. Hallyn, containers, Nathan Lynch,
	linux-kernel, Alexey Dobriyan

Quoting Andrew Morton (akpm@linux-foundation.org):
> On Mon, 4 Apr 2011 13:51:20 -0500 "Serge E. Hallyn" <serge.hallyn@ubuntu.com> wrote:
> 
> > Quoting Andrew Morton (akpm@linux-foundation.org):
> > > On Mon, 4 Apr 2011 11:27:53 -0500 "Serge E. Hallyn" <serge@hallyn.com> wrote:
> > > 
> > > > Andrew (Cc:d), did you see this thread go by, and it did it look
> > > > in any way more palatable to you?  Have you had any thoughts on
> > > > checkpoint/restart in the last few months?  Or did that horse quietly
> > > > die over winter?
> > > 
> > > argh, it was the victim of LIFO.
> > > 
> > > All I can say at this stage is that I'll be interested next time it
> > > comes past, sorry.
> > 
> > Thanks, that's good to know.
> > 
> > As you know, we started with a minimal patchset, then grew it over time
> > to answer the "but how will you (xyz) without uglifying the kernel".
> > Would you recommend we go back to keeping a separate minimal patchset,
> > or that we develop on the current, pretty feature-full version?  I'm not
> > convinced believe there will be bandwidth to keep two trees and do both
> > justice.
> 
> The minimal patchset is too minimal for Oren's use and the maximal
> patchset seems to have run aground on general kernel sentiment.  So I
> guess you either take the minimal patchset and make it less minimal or
> take the maximal patchset and make it less maximal, ending up with the
> same thing.  How's that for hand-waving useless obviousnesses :)

Perfect, thanks :)

> One obvious approach is to merge the minimal patchset then, over time,
> sneak more stuff into it so we end up with the maximal patchset which
> people didn't like.  Don't do that :)

Hoping that "which people didn't like" is purely conjecture.

Ok, I'll advocate for proceeding with the full patch-set as long as we
can.  Thanks, Andrew.

-serge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 16:27           ` Serge E. Hallyn
  2011-04-04 17:32             ` Oren Laadan
  2011-04-04 17:41             ` Andrew Morton
@ 2011-04-04 21:20             ` Nathan Lynch
  2011-04-04 21:53               ` Serge E. Hallyn
  2 siblings, 1 reply; 41+ messages in thread
From: Nathan Lynch @ 2011-04-04 21:20 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-kernel, containers, Oren Laadan, Andrew Morton, Alexey Dobriyan

On Mon, 2011-04-04 at 11:27 -0500, Serge E. Hallyn wrote:
> Quoting Nathan Lynch (ntl@pobox.com):
> > On Mon, 2011-04-04 at 10:10 -0500, Serge E. Hallyn wrote:
> > > I'm asking whether you are intending to later on change the checkpoint
> > > API to allow an external task to checkpoint a pid-init process, rather than
> > > the pid-init process having to initiate it itself.
> > 
> > No, that is not the intention.  I can see how that would be problematic
> > for those wanting to run minimally-modified distro containers, but I
> > think running a patched pid-init is a reasonable tradeoff to ask users
> > to make in order to get c/r.  And there's nothing to keep the standard
> > distro inits from growing c/r capability.
> 
> It's not necessarily a dealbreaker, since presumably I can hack the
> needed support into upstart, triggered by a boot option so it isn't
> activated on a host.  But especially given the lack of interest in
> this thread so far, I don't see a point in pushing this, an API-incompatible
> less-capable version of the linux-cr tree.

The apparent lack of interest was discouraging, but I appreciate that
you've been looking it over.


>   If it can gain traction
> better than linux-cr, that'd be one thing.  But given the amount of
> review and testing the other tree has gotten

How much traction do you think linux-cr has?  It doesn't seem any closer
to mainline than it was a year ago, and it barely has any users.  I
don't think posting this little proof-of-concept patch set is disrupting
linux-cr's progress toward mainline.


>  - and I realize you're
> able to piggy-back on much of that - and, again, the lack of responses
> so far, I just don't see this as worth pushing for.

Sure, the lack of response sucks, but it's not unexpected, and the code
here is pretty rough (especially the stuff I wrote).  What I hoped to
highlight and discuss were the differences in system call interfaces and
goals, and to gauge interest from the larger community.  Certainly what
I posted here isn't anywhere close to merge quality and I didn't intend
it to be taken that way.  I don't think it's hurting anything to explore
an alternative approach with more modest goals (and, one hopes, less of
a maintenance footprint on the rest of the kernel).


> I'd really prefer that everyone was using the same tree, and sending
> any and all patches which they need, no matter how ugly they fear
> they are, upstream.  To that end, I think it would be appropriate
> for you or Dan to get write access to Oren's tree or to move to a
> newly cloned copy of his tree to which one of you has acces.

Oren and I disagree on some fundamental aspects of how kernel c/r should
be implemented (hence this patch set), so I'm not sure how this would
work.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 17:32             ` Oren Laadan
@ 2011-04-04 21:43               ` Nathan Lynch
  2011-04-04 22:03                 ` Serge E. Hallyn
  2011-04-04 22:29                 ` Matt Helsley
  0 siblings, 2 replies; 41+ messages in thread
From: Nathan Lynch @ 2011-04-04 21:43 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Serge E. Hallyn, linux-kernel, containers, Andrew Morton,
	Alexey Dobriyan

On Mon, 2011-04-04 at 13:32 -0400, Oren Laadan wrote:
> From the technical point of view it *is* a big problem:  there are
> very good reasons why we chose a certain design. 
> 
> If Natahan is suggesting in-kernel tree creation as a temporary thing
> to simplify the code for review - then, given that this patch handles
> a single process, doing so add lots of unnecessary code, all of which
> in the kernel.
> 
> If this is the beginning of a permanent approach, then it is totally
> incompatible with what we have done so far, and severely restricts 
> the kind of use--cases of the project, potentially making it too
> unattractive for many natural adaptors, like HPC users. Sorry, nack.

It's not a stopgap measure to "ease review" or whatever; recreating the
task tree in-kernel is a fundamental - and simplifying - part of the
design.  I have earned through painful experience the opinion that
recreating the task tree in userspace is pretty much insane, as is
exposing the pid allocator to userspace via eclone(2), as is attempting
to support c/r of any resource that isn't isolated/virtualized, as is
having every recreated task "rendezvous" in the kernel by having them
all call restart(2), even though little significant work can be done in
parallel.

Time to try something different.

I don't see anything about in-kernel task tree creation that would
interfere with real-world use cases.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 21:20             ` Nathan Lynch
@ 2011-04-04 21:53               ` Serge E. Hallyn
  0 siblings, 0 replies; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-04 21:53 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: Serge E. Hallyn, containers, linux-kernel, Alexey Dobriyan

[-- Attachment #1: Type: text/plain, Size: 1843 bytes --]

Quoting Nathan Lynch (ntl@pobox.com):
> >   If it can gain traction
> > better than linux-cr, that'd be one thing.  But given the amount of
> > review and testing the other tree has gotten
> 
> How much traction do you think linux-cr has?  It doesn't seem any closer
> to mainline than it was a year ago, and it barely has any users.  I
> don't think posting this little proof-of-concept patch set is disrupting
> linux-cr's progress toward mainline.

No, I agree with you there.  I appreciate your attempt, and it would have
been great if it had worked.  My comments are only about going forward
from today onward.  And, going forward, I don't believe that this API
simplification (and regression in functionality) is going to pay off
the way you'd hoped.

> > I'd really prefer that everyone was using the same tree, and sending
> > any and all patches which they need, no matter how ugly they fear
> > they are, upstream.  To that end, I think it would be appropriate
> > for you or Dan to get write access to Oren's tree or to move to a
> > newly cloned copy of his tree to which one of you has acces.
> 
> Oren and I disagree on some fundamental aspects of how kernel c/r should
> be implemented (hence this patch set), so I'm not sure how this would
> work.

Ok, not you then :)

I'm willing to do it, but since I won't be able to spend full time
reviewing it, I'd have to set some ground-rules, like:  I"ll pull in
any patch as soon as it has an ack from (Oren, Dan Smith, Matt
Helsley) which is not also from the submitter.  Any regression in
automated tests cause the patch which caused it to get kicked out.

If you want to discuss the technical advantages of not allowing a task
to call checkpoint on another task, let's start a new thread to do that.
So far, I'm against it.

thanks,
-serge

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 19:42                 ` Andrew Morton
  2011-04-04 20:29                   ` Serge E. Hallyn
@ 2011-04-04 21:55                   ` Matt Helsley
  2011-04-04 23:15                     ` Andrew Morton
  2011-04-04 23:16                     ` Valdis.Kletnieks
  2011-04-04 22:11                   ` Serge E. Hallyn
  2011-04-04 22:53                   ` Serge E. Hallyn
  3 siblings, 2 replies; 41+ messages in thread
From: Matt Helsley @ 2011-04-04 21:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Serge E. Hallyn, containers, Nathan Lynch, linux-kernel, Alexey Dobriyan

On Mon, Apr 04, 2011 at 12:42:22PM -0700, Andrew Morton wrote:
> On Mon, 4 Apr 2011 13:51:20 -0500 "Serge E. Hallyn" <serge.hallyn@ubuntu.com> wrote:
> 
> > Quoting Andrew Morton (akpm@linux-foundation.org):
> > > On Mon, 4 Apr 2011 11:27:53 -0500 "Serge E. Hallyn" <serge@hallyn.com> wrote:
> > > 
> > > > Andrew (Cc:d), did you see this thread go by, and it did it look
> > > > in any way more palatable to you?  Have you had any thoughts on
> > > > checkpoint/restart in the last few months?  Or did that horse quietly
> > > > die over winter?
> > > 
> > > argh, it was the victim of LIFO.
> > > 
> > > All I can say at this stage is that I'll be interested next time it
> > > comes past, sorry.
> > 
> > Thanks, that's good to know.
> > 
> > As you know, we started with a minimal patchset, then grew it over time
> > to answer the "but how will you (xyz) without uglifying the kernel".
> > Would you recommend we go back to keeping a separate minimal patchset,
> > or that we develop on the current, pretty feature-full version?  I'm not
> > convinced believe there will be bandwidth to keep two trees and do both
> > justice.
> 
> The minimal patchset is too minimal for Oren's use and the maximal
> patchset seems to have run aground on general kernel sentiment.  So I
> guess you either take the minimal patchset and make it less minimal or
> take the maximal patchset and make it less maximal, ending up with the
> same thing.  How's that for hand-waving useless obviousnesses :)
> 
> One obvious approach is to merge the minimal patchset then, over time,
> sneak more stuff into it so we end up with the maximal patchset which
> people didn't like.  Don't do that :)

Yes, merging this minimal patch set early is obviously premature.

It seems clear from your statement above that "the maximal patchset seems to
have run aground on  general kernel sentiment" -- pushing that set isn't
going to make any progress. So I think we're left with modifying the new
minimal patch set.

However I think we need some review before we continue modifying it. We
had a minimal patch set which evolved into the current maximal set. It
never really got the reviews outside our little group that it needed.
Now we're back with a new minimal patch set. You're asking us to do the same
thing and expect different results -- stack more patches on top and expect to
get it reviewed. OK, but what reason do we have to believe this time will be
any different?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 21:43               ` Nathan Lynch
@ 2011-04-04 22:03                 ` Serge E. Hallyn
  2011-04-04 23:42                   ` Dan Smith
  2011-04-04 22:29                 ` Matt Helsley
  1 sibling, 1 reply; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-04 22:03 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: Oren Laadan, containers, Alexey Dobriyan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1058 bytes --]

Quoting Nathan Lynch (ntl@pobox.com):
> On Mon, 2011-04-04 at 13:32 -0400, Oren Laadan wrote:
> > From the technical point of view it *is* a big problem:  there are
> > very good reasons why we chose a certain design. 
> > 
> > If Natahan is suggesting in-kernel tree creation as a temporary thing
> > to simplify the code for review - then, given that this patch handles
> > a single process, doing so add lots of unnecessary code, all of which
> > in the kernel.
> > 
> > If this is the beginning of a permanent approach, then it is totally
> > incompatible with what we have done so far, and severely restricts 
> > the kind of use--cases of the project, potentially making it too
> > unattractive for many natural adaptors, like HPC users. Sorry, nack.
> 
> It's not a stopgap measure to "ease review" or whatever; recreating the
> task tree in-kernel is a fundamental - and simplifying - part of the

I hadn't gotten to that part yet, so I'm on the fence.

The API for starting a checkpoint, that I'm not on the fence on.

-serge

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 19:42                 ` Andrew Morton
  2011-04-04 20:29                   ` Serge E. Hallyn
  2011-04-04 21:55                   ` Matt Helsley
@ 2011-04-04 22:11                   ` Serge E. Hallyn
  2011-04-04 22:53                   ` Serge E. Hallyn
  3 siblings, 0 replies; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-04 22:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Serge E. Hallyn, Serge E. Hallyn, containers, Nathan Lynch,
	linux-kernel, Alexey Dobriyan

Quoting Andrew Morton (akpm@linux-foundation.org):
> > As you know, we started with a minimal patchset, then grew it over time
> > to answer the "but how will you (xyz) without uglifying the kernel".
> > Would you recommend we go back to keeping a separate minimal patchset,
> > or that we develop on the current, pretty feature-full version?  I'm not
> > convinced believe there will be bandwidth to keep two trees and do both
> > justice.
> 
> The minimal patchset is too minimal for Oren's use and the maximal
> patchset seems to have run aground on general kernel sentiment.  So I

Sorry, when you say 'minimal patchset', are you referring to Nathan's tree?
Or a truly minimal patchset like what we originally started with?

thanks,
-serge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 21:43               ` Nathan Lynch
  2011-04-04 22:03                 ` Serge E. Hallyn
@ 2011-04-04 22:29                 ` Matt Helsley
  1 sibling, 0 replies; 41+ messages in thread
From: Matt Helsley @ 2011-04-04 22:29 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: Oren Laadan, containers, Alexey Dobriyan, linux-kernel

On Mon, Apr 04, 2011 at 04:43:29PM -0500, Nathan Lynch wrote:
> On Mon, 2011-04-04 at 13:32 -0400, Oren Laadan wrote:
> > From the technical point of view it *is* a big problem:  there are
> > very good reasons why we chose a certain design. 
> > 
> > If Natahan is suggesting in-kernel tree creation as a temporary thing
> > to simplify the code for review - then, given that this patch handles
> > a single process, doing so add lots of unnecessary code, all of which
> > in the kernel.
> > 
> > If this is the beginning of a permanent approach, then it is totally
> > incompatible with what we have done so far, and severely restricts 
> > the kind of use--cases of the project, potentially making it too
> > unattractive for many natural adaptors, like HPC users. Sorry, nack.
> 
> It's not a stopgap measure to "ease review" or whatever; recreating the
> task tree in-kernel is a fundamental - and simplifying - part of the
> design.  I have earned through painful experience the opinion that
> recreating the task tree in userspace is pretty much insane, as is
> exposing the pid allocator to userspace via eclone(2), as is attempting
> to support c/r of any resource that isn't isolated/virtualized, as is
> having every recreated task "rendezvous" in the kernel by having them
> all call restart(2), even though little significant work can be done in
> parallel.

So far we've been proceeding under the assumption that some userspace
code ugliness was acceptable if it simplified the kernel code. With
ghost issues and the stuff you've mentioned above I think it's become
questionable whether that choice has simplified the kernel code enough
and trying something different is valuable.

At this point the only advantage I still see in userspace task creation for
restart is the reviewability of it. eclone is a small piece of code that
can be reviewed independently of restart and thus will prove alot easier to
review for correctness and security than in-kernel task creation for restart.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 19:42                 ` Andrew Morton
                                     ` (2 preceding siblings ...)
  2011-04-04 22:11                   ` Serge E. Hallyn
@ 2011-04-04 22:53                   ` Serge E. Hallyn
  3 siblings, 0 replies; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-04 22:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Serge E. Hallyn, Serge E. Hallyn, containers, Nathan Lynch,
	linux-kernel, Alexey Dobriyan

Quoting Andrew Morton (akpm@linux-foundation.org):
> One obvious approach is to merge the minimal patchset then, over time,
> sneak more stuff into it so we end up with the maximal patchset which
> people didn't like.  Don't do that :)

Sorry, a second clarification question - you say 'which people didn't
like'.  I didn't get the impression that there were ever any complaints.
Do you remember what they were?

thanks,
-serge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 21:55                   ` Matt Helsley
@ 2011-04-04 23:15                     ` Andrew Morton
  2011-04-04 23:16                     ` Valdis.Kletnieks
  1 sibling, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2011-04-04 23:15 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Serge E. Hallyn, containers, Nathan Lynch, linux-kernel, Alexey Dobriyan

On Mon, 4 Apr 2011 14:55:11 -0700 Matt Helsley <matthltc@us.ibm.com> wrote:

> However I think we need some review before we continue modifying it. We
> had a minimal patch set which evolved into the current maximal set. It
> never really got the reviews outside our little group that it needed.
> Now we're back with a new minimal patch set. You're asking us to do the same
> thing and expect different results -- stack more patches on top and expect to
> get it reviewed. OK, but what reason do we have to believe this time will be
> any different?

None whatsoever.  It could be that the two sets "a sufficiently useful
c/r implementation" and "a c/r implementation which will be acceptable"
have no intersection.  IOW, there is no solution.

But I haven't looked at c/r patches in quite some time, hence the
hand-waving and useless platitudes.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 21:55                   ` Matt Helsley
  2011-04-04 23:15                     ` Andrew Morton
@ 2011-04-04 23:16                     ` Valdis.Kletnieks
  2011-04-04 23:43                       ` Matt Helsley
  1 sibling, 1 reply; 41+ messages in thread
From: Valdis.Kletnieks @ 2011-04-04 23:16 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Andrew Morton, Serge E. Hallyn, containers, Nathan Lynch,
	linux-kernel, Alexey Dobriyan

[-- Attachment #1: Type: text/plain, Size: 484 bytes --]

On Mon, 04 Apr 2011 14:55:11 PDT, Matt Helsley said:

> Now we're back with a new minimal patch set. You're asking us to do the same
> thing and expect different results -- stack more patches on top and expect to
> get it reviewed. OK, but what reason do we have to believe this time will be
> any different?

Has the terrain changed any since last time? In particular, ISTR a bunch of
activity in namespace support since last time - does that change what your
patch set needs to do?

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 22:03                 ` Serge E. Hallyn
@ 2011-04-04 23:42                   ` Dan Smith
  2011-04-05  2:17                     ` Serge E. Hallyn
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Smith @ 2011-04-04 23:42 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Nathan Lynch, containers, Alexey Dobriyan, linux-kernel

SH> The API for starting a checkpoint, that I'm not on the fence on.

Is that just because it requires a C/R aware container init or for some
other reason?  I think the stricter API is a lot easier to understand,
but maybe there's something we can do to avoid that as a hard
requirement?  At least until "C/R support" becomes a desirable feature
of $INIT_DE_JOUR :)

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 23:16                     ` Valdis.Kletnieks
@ 2011-04-04 23:43                       ` Matt Helsley
  0 siblings, 0 replies; 41+ messages in thread
From: Matt Helsley @ 2011-04-04 23:43 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Matt Helsley, Andrew Morton, Serge E. Hallyn, containers,
	Nathan Lynch, linux-kernel, Alexey Dobriyan

On Mon, Apr 04, 2011 at 07:16:50PM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 04 Apr 2011 14:55:11 PDT, Matt Helsley said:
> 
> > Now we're back with a new minimal patch set. You're asking us to do the same
> > thing and expect different results -- stack more patches on top and expect to
> > get it reviewed. OK, but what reason do we have to believe this time will be
> > any different?
> 
> Has the terrain changed any since last time? In particular, ISTR a bunch of
> activity in namespace support since last time - does that change what your
> patch set needs to do?

Good question.

Unfortunately it doesn't reduce what our patch set needs to do at this
point. The primary namespace changes that happened about 10+ kernel
versions ago (circa 2.6.28) are essential for containers and reliable
checkpoint/restart. Since then namespace improvements have nothing or very
little to do with enabling checkpoint/restart -- it's all for containers.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-04 23:42                   ` Dan Smith
@ 2011-04-05  2:17                     ` Serge E. Hallyn
  2011-04-05 19:18                       ` Nathan Lynch
  0 siblings, 1 reply; 41+ messages in thread
From: Serge E. Hallyn @ 2011-04-05  2:17 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, Nathan Lynch, containers, Alexey Dobriyan, linux-kernel

Quoting Dan Smith (danms@us.ibm.com):
> SH> The API for starting a checkpoint, that I'm not on the fence on.
> 
> Is that just because it requires a C/R aware container init or for some

Yup, that's it.

Which is why I'd be fine with it as a short-term workaround, if it might
actually help with upstreaming.

-serge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] Core checkpoint/restart support code
  2011-04-05  2:17                     ` Serge E. Hallyn
@ 2011-04-05 19:18                       ` Nathan Lynch
  0 siblings, 0 replies; 41+ messages in thread
From: Nathan Lynch @ 2011-04-05 19:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dan Smith, Serge E. Hallyn, containers, Alexey Dobriyan, linux-kernel

On Mon, 2011-04-04 at 21:17 -0500, Serge E. Hallyn wrote:
> Quoting Dan Smith (danms@us.ibm.com):
> > SH> The API for starting a checkpoint, that I'm not on the fence on.
> > 
> > Is that just because it requires a C/R aware container init or for some
> 
> Yup, that's it.
> 
> Which is why I'd be fine with it as a short-term workaround, if it might
> actually help with upstreaming.

Okay, I'll look at making unmodified init work.  One possibility is to
run the distro init as a child of the c/r-aware init, I suppose.



^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2011-04-05 19:19 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-28 23:40 [RFC 00/10] container-based checkpoint/restart prototype ntl
2011-02-28 23:40 ` [PATCH 01/10] Make exec_mmap extern ntl
2011-04-03 16:56   ` Serge E. Hallyn
2011-02-28 23:40 ` [PATCH 02/10] Introduce mm_has_pending_aio() helper ntl
2011-03-01 15:40   ` Jeff Moyer
2011-03-01 16:04     ` Nathan Lynch
2011-02-28 23:40 ` [PATCH 03/10] Introduce has_locks_with_owner() helper ntl
2011-04-03 18:55   ` Serge E. Hallyn
2011-02-28 23:40 ` [PATCH 04/10] Introduce vfs_fcntl() helper ntl
2011-04-03 18:57   ` Serge E. Hallyn
2011-02-28 23:40 ` [PATCH 05/10] Core checkpoint/restart support code ntl
2011-04-03 19:03   ` Serge E. Hallyn
2011-04-04 15:00     ` Nathan Lynch
2011-04-04 15:10       ` Serge E. Hallyn
2011-04-04 15:40         ` Nathan Lynch
2011-04-04 16:27           ` Serge E. Hallyn
2011-04-04 17:32             ` Oren Laadan
2011-04-04 21:43               ` Nathan Lynch
2011-04-04 22:03                 ` Serge E. Hallyn
2011-04-04 23:42                   ` Dan Smith
2011-04-05  2:17                     ` Serge E. Hallyn
2011-04-05 19:18                       ` Nathan Lynch
2011-04-04 22:29                 ` Matt Helsley
2011-04-04 17:41             ` Andrew Morton
2011-04-04 18:51               ` Serge E. Hallyn
2011-04-04 19:42                 ` Andrew Morton
2011-04-04 20:29                   ` Serge E. Hallyn
2011-04-04 21:55                   ` Matt Helsley
2011-04-04 23:15                     ` Andrew Morton
2011-04-04 23:16                     ` Valdis.Kletnieks
2011-04-04 23:43                       ` Matt Helsley
2011-04-04 22:11                   ` Serge E. Hallyn
2011-04-04 22:53                   ` Serge E. Hallyn
2011-04-04 21:20             ` Nathan Lynch
2011-04-04 21:53               ` Serge E. Hallyn
2011-02-28 23:40 ` [PATCH 06/10] Checkpoint/restart mm support ntl
2011-02-28 23:40 ` [PATCH 07/10] Checkpoint/restart vfs support ntl
2011-02-28 23:40 ` [PATCH 08/10] Add generic '->checkpoint' f_op to ext filesystems ntl
2011-02-28 23:40 ` [PATCH 09/10] Add generic '->checkpoint()' f_op to simple char devices ntl
2011-02-28 23:40 ` [PATCH 10/10] x86_32 support for checkpoint/restart ntl
2011-03-01  1:08 ` [RFC 00/10] container-based checkpoint/restart prototype Nathan Lynch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).