All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v14-rc2][PATCH 00/29] Kernel based checkpoint/restart
@ 2009-03-31  5:28 Oren Laadan
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Checkpoint-restart (c/r): two major changes are support for PPC arch
(by Nathan Lunch) and first part of refactoring file-checkpoint to use
f_ops (file operations). Tested against kernel v2.6.29-rc8 on x86_32.
Requires update of userspace tools too.

The git tree tracking v14-rc2, branch 'ckpt-v14-rc2' (and past versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

Restarting multiple processes requires 'mktree' userspace tool with
the matching branch (v14-rc2):
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Oren.


Changelog:

[2009-Mar-31] v14-rc2
  - Change along Dave's suggestion to use f_ops->checkpoint() for files
  - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
  - Merge support for PPC arch (Nathan Lynch)
  - Misc cleanups and fixes in response to comments

[2009-Mar-20] v14-rc1:
  - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
  - Check whether calls to cr_hbuf_get() succeed or fail.
  - Fixed of pipe c/r code
  - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
  - Refuse non-self checkpoint if a task isn't frozen
  - Use unsigned fields in checkpoint headers unless otherwise required
  - Rename functions in files c/r to better reflect their role
  - Add support for anonymous shared memory
  - Merge support for s390 arch (Dan Smith, Serge Hallyn)
    
[2008-Dec-03] v13:
  - Cleanups of 'struct cr_ctx' - remove unused fields
  - Misc fixes for comments
  
[2008-Dec-17] v12:
  - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
    (empty pgarr are saves in a separate pool chain)
  - Add a couple of missed calls to cr_hbuf_put()
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse: explicit conversion to 'void __user *'
  - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 01/29] Create syscalls: sys_checkpoint, sys_restart
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-03-31  5:28   ` Oren Laadan
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 02/29] Checkpoint/restart: initial documentation Oren Laadan
                     ` (27 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart.  Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic.  They also would significantly complicate
checkpoint that includes self.

Changelog[v14]:
  - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
  - Remove line 'def_bool n' (default is already 'n')
  - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)

Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/Kconfig                   |    4 +++
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   14 ++++++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 9 files changed, 76 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bc2fbad..246e26b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -71,6 +71,10 @@ config STACKTRACE_SUPPORT
 config HAVE_LATENCYTOP_SUPPORT
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if X86_32
+
 config FAST_CMPXCHG_LOCAL
 	bool
 	default y
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..a5f9e09 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,8 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restart		334
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index e2e86a0..9f8c398 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..1761b0a
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,14 @@
+# Architectures should define CHECKPOINT_SUPPORT when they have
+# implemented the hooks for processor state etc. needed by the
+# core checkpoint/restart code.
+
+config CHECKPOINT
+	bool "Enable checkpoint/restart (EXPERIMENTAL)"
+	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..8a32c6f
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..375129c
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	pr_debug("sys_checkpoint not implemented yet\n");
+	return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+	pr_debug("sys_restart not implemented yet\n");
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f9f900c..b96b61b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -691,6 +691,8 @@ asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
 			  size_t);
 asmlinkage long sys_pipe2(int __user *, int);
 asmlinkage long sys_pipe(int __user *);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index 6a5c5fe..42355df 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -952,6 +952,8 @@ config MARKERS
 
 source "arch/Kconfig"
 
+source "checkpoint/Kconfig"
+
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 27dad29..e9e749d 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -175,3 +175,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 02/29] Checkpoint/restart: initial documentation
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 01/29] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 03/29] Make file_pos_read/write() public Oren Laadan
                     ` (26 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v14]:
  - Discard the 'h.parent' field

Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Documentation/checkpoint/ckpt.c        |   32 ++++++
 Documentation/checkpoint/internals.txt |  127 +++++++++++++++++++++++
 Documentation/checkpoint/readme.txt    |  105 +++++++++++++++++++
 Documentation/checkpoint/rstr.c        |   20 ++++
 Documentation/checkpoint/security.txt  |   38 +++++++
 Documentation/checkpoint/self.c        |   57 +++++++++++
 Documentation/checkpoint/test.c        |   48 +++++++++
 Documentation/checkpoint/usage.txt     |  171 ++++++++++++++++++++++++++++++++
 8 files changed, 598 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/ckpt.c
 create mode 100644 Documentation/checkpoint/internals.txt
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/rstr.c
 create mode 100644 Documentation/checkpoint/security.txt
 create mode 100644 Documentation/checkpoint/self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/ckpt.c b/Documentation/checkpoint/ckpt.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/ckpt.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/internals.txt b/Documentation/checkpoint/internals.txt
new file mode 100644
index 0000000..c741b6c
--- /dev/null
+++ b/Documentation/checkpoint/internals.txt
@@ -0,0 +1,127 @@
+
+	===== Internals of Checkpoint-Restart =====
+
+
+(1) Order of state dump
+
+The order of operations, both save and restore, is as follows:
+
+* Header section: header, container information, etc.
+
+* Global section: [TBD] global resources such as IPC, UTS, etc.
+
+* Process forest: [TBD] tasks and their relationships
+
+* Per task data (for each task):
+  -> task state: elements of task_struct
+  -> thread state: elements of thread_struct and thread_info
+  -> CPU state: registers etc, including FPU
+  -> memory state: memory address space layout and contents
+  -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
+  -> files state: open file descriptors and their state
+  -> signals state: [TBD] pending signals and signal handling state
+  -> credentials state: [TBD] user and group state, statistics
+
+
+(2) Checkpoint image format
+
+The checkpoint image format is composed of records consisting of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+};
+
+'type' identifies the type of the payload, 'len' tells its length in
+bytes, and 'parent' identifies the owner object instance.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. Following comes the actual contents, in one or more chunks: each
+chunk begins with a header that specifies how many pages it holds,
+then the virtual addresses of all the dumped pages in that chunk,
+followed by the actual contents of all the dumped pages. A header with
+zero number of pages marks the end of the contents for a particular
+VMA. Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+	cr_hdr + cr_hdr_mm
+		cr_hdr + cr_hdr_vma + cr_hdr + string
+			cr_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_hdr_vma
+			cr_hdr_pgarr (nr_pages = 3)
+			addr3, addr4, addr5
+			page3, page4, page5
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_mm_context
+	cr_hdr + cr_hdr_thread
+	cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+(3) Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects and whether they were already saved.  Shared
+objects are stored in a hash table as they appear, indexed by their
+kernel address. (The hash table itself is not saved as part of the
+checkpoint image: it is constructed dynamically during both checkpoint
+and restart, and discarded at the end of the operation).
+
+Each shared object that is found is first looked up in the hash table.
+On the first encounter, the object will not be found, so its state is
+dumped, and the object is assigned a unique identifier and also stored
+in the hash table. Subsequent lookups of that object in the hash table
+will yield that entry, and then only the unique identifier is saved,
+as opposed the entire state of the object.
+
+During restart, shared objects are seen by their unique identifiers as
+assigned during the checkpoint. Each shared object that it read in is
+first looked up in the hash table. On the first encounter it will not
+be found, meaning that the object needs to be created and its state
+read in and restored. Then the object is added to the hash table, this
+time indexed by its unique identifier. Subsequent lookups of the same
+unique identifier in the hash table will yield that entry, and then
+the existing object instance is reused instead of creating another one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+The interface for the hash table is the following:
+
+cr_obj_get_by_ptr() - find the unique object reference (objref)
+  of the object that is pointer to by ptr [checkpoint]
+
+cr_obj_add_ptr() - add the object pointed to by ptr to the hash table
+  if not already there, and fill its unique object reference (objref)
+
+cr_obj_get_by_ref() - return the pointer to the object whose unique
+  object reference is equal to objref [restart]
+
+cr_obj_add_ref() - add the object with given unique object reference
+  (objref), pointed to by ptr to the hash table. [restart]
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..344a551
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,105 @@
+
+	===== Checkpoint-Restart support in the Linux kernel =====
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Reviewers:	Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
+		Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relatively opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial C/R products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). The checkpoint code basically serializes internal
+kernel state and writes it out to a file descriptor, and the resulting
+image is stream-able. More specifically, it consists of 5 steps:
+
+1. Pre-dump
+2. Freeze the container
+3. Dump
+4. Thaw (or kill) the container
+5. Post-dump
+
+Steps 1 and 5 are an optimization to reduce application downtime. In
+particular, "pre-dump" works before freezing the container, e.g. the
+pre-copy for live migration, and "post-dump" works after the container
+resumes execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state from a file
+descriptor, and re-creates the tasks and the resources they need to
+resume execution. The restart code is executed by each task that is
+restored in a new container to reconstruct its own state.
+
+
+=== Current Implementation
+
+* How useful is this code as it stands in real-world usage?
+
+Right now, the application must be a single process that does not
+share any resources with other processes. The only file descriptors
+that may be open are simple files and directories, they may not
+include devices, sockets or pipes.
+
+For an "external" checkpoint, the caller must first freeze (or stop)
+the target process. For "self" checkpoint, the application must be
+specifically written to use the new system calls. The restart does not
+yet preserve the pid of the original process, but will use whatever
+pid it was given by the kernel.
+
+What this means in practice is that it is useful for a simple
+application doing computational work and input/output from/to files.
+
+Currently, namespaces are not saved or restored. They will be treated
+as a class of a shared object. In particular, it is assumed that the
+task's file system namespace is the "root" for the entire container.
+It is also assumed that the same file system view is available for the
+restart task(s). Otherwise, a file system snapshot is required.
+
+* What additional work needs to be done to it?
+
+We know this design can work.  We have two commercial products and a
+horde of academic projects doing it today using this basic design.
+We're early in this particular implementation because we're trying to
+release early and often.
+
diff --git a/Documentation/checkpoint/rstr.c b/Documentation/checkpoint/rstr.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/rstr.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/security.txt b/Documentation/checkpoint/security.txt
new file mode 100644
index 0000000..e5b4107
--- /dev/null
+++ b/Documentation/checkpoint/security.txt
@@ -0,0 +1,38 @@
+
+	===== Security consideration for Checkpoint-Restart =====
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+read mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  When restoration of credentials
+becomes supported, then definitely the ability of the task that calls
+sys_restore() to setresuid/setresgid to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+
diff --git a/Documentation/checkpoint/self.c b/Documentation/checkpoint/self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..1b42d6b
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,171 @@
+
+	===== How to use Checkpoint-Restart =====
+
+The API consists of two new system calls:
+
+* int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+
+    Checkpoint a container whose init task is identified by pid, to
+    the file designated by fd. 'flags' will have future meaning (must
+    be 0 for now).
+
+    Returns: a positive checkpoint identifier (crid) upon success, 0
+    if it returns from a restart, and -1 if an error occurs.
+
+    'crid' uniquely identifies a checkpoint image. For each checkpoint
+    the kernel allocates a unique 'crid', that remains valid for as
+    long as the checkpoint is kept in the kernel (for instance, when a
+    checkpoint, or a partial checkpoint, may reside in kernel memory).
+
+* int sys_restart(int crid, int fd, unsigned long flags);
+
+    Restart a container from a checkpoint image that is read from the
+    blob stored in the file designated by fd. 'crid' will have future
+    meaning (must be 0 for now). 'flags' will have future meaning
+    (must be 0 for now).
+
+    The role of 'crid' is to identify the checkpoint image in the case
+    that it remains in kernel memory. This will be useful to restart
+    from a checkpoint image that remains in kernel memory.
+
+    Returns: -1 if an error occurs, 0 on success when restarting from
+    a "self" checkpoint, and return value of system call at the time
+    of the checkpoint when restarting from an "external" checkpoint.
+
+    If restarting from an "external" checkpoint, tasks that were
+    executing a system call will observe the return value of that
+    system call (as it was when interrupted for the act of taking the
+    checkpoint), and tasks that were executing in user space will be
+    ready to return there.
+
+    Upon successful "external" restart, the container will end up in a
+    frozen state.
+
+The granularity of a checkpoint usually is a whole container. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+If the caller passes a pid which does not refer to a container's init
+task, then sys_checkpoint() would return -EINVAL. (This is because
+with nested containers a task may belong to more than one container).
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases,
+if there are other tasks possible sharing state with the container,
+they must not modify it during the operation. It is the reponsibility
+of the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+Here is a code snippet that illustrates how a checkpoint is initiated
+by a process in a container - the logic is similar to fork():
+	...
+	crid = checkpoint(1, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships
+of the task with other tasks, or any shared resources. It is useful
+for application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+To illustrate how the API works, refer to these sample programs:
+
+* ckpt.c: accepts a 'pid' argument and checkpoint that task to stdout
+* rstr.c: restarts a checkpoint image from stdin
+* self.c: a simple test program doing self-checkpoint
+* test.c: a simple test program to checkpoint
+
+"External" checkpoint:
+---------------------
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup, or by sending SIGSTOP.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ kill -STOP 3493
+	$ ./ckpt 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ kill -CONT 3493
+
+	$ ./rstr < ckpt.image
+Now compare the output of the two output files.
+
+"Self checkpoint:
+----------------
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./rstr < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
+
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 03/29] Make file_pos_read/write() public
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 01/29] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 02/29] Checkpoint/restart: initial documentation Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 04/29] General infrastructure for checkpoint restart Oren Laadan
                     ` (25 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

These two are used in the next patch when calling vfs_read/write()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 fs/read_write.c    |   10 ----------
 include/linux/fs.h |   10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 400fe81..4f9264a 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 EXPORT_SYMBOL(vfs_write);
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	file->f_pos = pos;
-}
-
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
 	struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 92734c0..3bf5057 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1372,6 +1372,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 				struct iovec *fast_pointer,
 				struct iovec **ret_pointer);
 
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 04/29] General infrastructure for checkpoint restart
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 03/29] Make file_pos_read/write() public Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 05/29] x86 support for checkpoint/restart Oren Laadan
                     ` (24 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  CR context (a per-checkpoint data structure for housekeeping)
checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
checkpoint/restart.c - input wrappers and basic restart handling

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Changelog[v14]:
  - Define sys_checkpoint(0,...) as asking for a self-checkpoint (Serge)
  - Revert use of 'pr_fmt' to avoid tainting whom includes us (Nathan Lynch)
  - Explicitly indicate length of UTS fields in header
  - Discard field 'h->parent'
  - Check whether calls to cr_hbuf_get() fail

Changelog[v12]:
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse : explicit conversion to 'void __user *'
  - Redfine 'pr_fmt' instead of using special cr_debug()

Changelog[v10]:
  - add cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
  - force end-of-string in cr_read_string() (fix possible DoS)

Changelog[v9]:
  - cr_kwrite/cr_kread() use file->f_op->write() directly
  - Drop cr_uwrite/cr_uread() since they aren't used anywhere

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (although it's not really needed)

Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/

Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Makefile                       |    2 +-
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |  206 +++++++++++++++++++++++++++++++
 checkpoint/restart.c           |  260 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c               |  220 +++++++++++++++++++++++++++++++++-
 include/linux/checkpoint.h     |   58 +++++++++
 include/linux/checkpoint_hdr.h |   92 ++++++++++++++
 include/linux/magic.h          |    3 +
 8 files changed, 836 insertions(+), 7 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h

diff --git a/Makefile b/Makefile
index 2e2f4a4..126ff52 100644
--- a/Makefile
+++ b/Makefile
@@ -630,7 +630,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 8a32c6f..364c326 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT) += sys.o
+obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..4e4c3fc
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,206 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t cr_ctx_count = ATOMIC_INIT(0);
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+	int ret;
+
+	ret = cr_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_buffer - write a buffer
+ * @ctx: checkpoint context
+ * @str: buffer pointer
+ * @len: buffer size
+ */
+int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_BUFFER;
+	h.len = len;
+
+	return cr_write_obj(ctx, &h, buf);
+}
+
+/**
+ * cr_write_string - write a string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int cr_write_string(struct cr_ctx *ctx, char *str, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_STRING;
+	h.len = len;
+
+	return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head *hh;
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h.type = CR_HDR_HEAD;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	do_gettimeofday(&ktv);
+	uts = utsname();
+
+	hh->magic = CHECKPOINT_MAGIC_HEAD;
+	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	hh->rev = CR_VERSION;
+
+	hh->flags = ctx->flags;
+	hh->time = ktv.tv_sec;
+
+	hh->uts_release_len = sizeof(uts->release);
+	hh->uts_version_len = sizeof(uts->version);
+	hh->uts_machine_len = sizeof(uts->machine);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_write_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		return ret;
+	ret = cr_write_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		return ret;
+	ret = cr_write_buffer(ctx, uts->machine, sizeof(uts->machine));
+
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tail *hh;
+	int ret;
+
+	h.type = CR_HDR_TAIL;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	hh->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_task *hh;
+	int ret;
+
+	h.type = CR_HDR_TASK;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	hh->state = t->state;
+	hh->exit_state = t->exit_state;
+	hh->exit_code = t->exit_code;
+	hh->exit_signal = t->exit_signal;
+
+	hh->task_comm_len = TASK_COMM_LEN;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ret = cr_write_task_struct(ctx, t);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_write_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+	/* on success, return (unique) checkpoint identifier */
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..d6f98d8
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,260 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @len: available buffer size
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int len)
+{
+	int ret;
+
+	ret = cr_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("type %d len %d\n", h->type, h->len);
+
+	if (h->len > len)
+		return -EINVAL;
+
+	return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type and size
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: expected record size
+ * @type: expected record type
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, len);
+	if (ret < 0)
+		return ret;
+
+	if (h.len != len || h.type != type)
+		return -EINVAL;
+
+	return 0;
+}
+
+/**
+ * cr_read_buf_type - read a whole record of expected type (unknown size)
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: availabe buffer size (output: actual record size)
+ * @type: expected record type
+ */
+int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, *len);
+	if (ret < 0)
+		return ret;
+
+	if (h.type != type)
+		return -EINVAL;
+
+	*len = h.len;
+	return 0;
+}
+
+/**
+ * cr_read_buffer - read a buffer
+ * @ctx: checkpoint context
+ * @buf: buffer
+ * @len: buffer size (output actual record size)
+ */
+int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len)
+{
+	return cr_read_buf_type(ctx, buf, len, CR_HDR_BUFFER);
+}
+
+/**
+ * cr_read_string - read a string
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @len: string length
+ */
+int cr_read_string(struct cr_ctx *ctx, char *str, int len)
+{
+	int ret;
+
+	ret = cr_read_buf_type(ctx, str, &len, CR_HDR_STRING);
+	if (ret < 0)
+		return ret;
+
+	if (len > 0)
+		str[len - 1] = '\0';	/* always play it safe */
+
+	return ret;
+}
+
+/* read the checkpoint header */
+static int cr_read_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head *hh;
+	struct new_utsname *uts = NULL;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+	if (ret < 0)
+		goto out;
+
+	ret = -EINVAL;
+	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
+	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+	if (hh->flags & ~CR_CTX_CKPT)
+		goto out;
+	if (hh->uts_release_len != sizeof(uts->release) ||
+	    hh->uts_version_len != sizeof(uts->version) ||
+	    hh->uts_machine_len != sizeof(uts->machine))
+		goto out;
+
+	ret = -ENOMEM;
+	uts = kmalloc(sizeof(*uts), GFP_KERNEL);
+	if (!uts)
+		goto out;
+
+	ctx->oflags = hh->flags;
+
+	/* FIX: verify compatibility of release, version and machine */
+	ret = cr_read_obj_type(ctx, uts->release,
+			       sizeof(uts->release), CR_HDR_BUFFER);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_obj_type(ctx, uts->version,
+			       sizeof(uts->version), CR_HDR_BUFFER);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_obj_type(ctx, uts->machine,
+			       sizeof(uts->machine), CR_HDR_BUFFER);
+
+ out:
+	kfree(uts);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tail *hh;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+	if (ret < 0)
+		goto out;
+
+	ret = -EINVAL;
+	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
+		goto out;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+	struct cr_hdr_task *hh;
+	struct task_struct *t = current;
+	char *buf;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+	if (ret < 0)
+		goto out;
+
+	ret = -EINVAL;
+	if (hh->task_comm_len > TASK_COMM_LEN)
+		goto out;
+
+	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = cr_read_string(ctx, buf, hh->task_comm_len);
+	if (!ret) {
+		memset(t->comm, 0, TASK_COMM_LEN);
+		memcpy(t->comm, buf, hh->task_comm_len);
+	}
+	kfree(buf);
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_task_struct(ctx);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, adjust the return value if needed [TODO] */
+ out:
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 375129c..337c160 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -1,7 +1,7 @@
 /*
  *  Generic container checkpoint-restart
  *
- *  Copyright (C) 2008 Oren Laadan
+ *  Copyright (C) 2008-2009 Oren Laadan
  *
  *  This file is subject to the terms and conditions of the GNU General Public
  *  License.  See the file COPYING in the main directory of the Linux
@@ -10,6 +10,180 @@
 
 #include <linux/sched.h>
 #include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   cr_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+static inline int _cr_kwrite(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		uaddr += nwrite;
+	}
+	return 0;
+}
+
+int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _cr_kwrite(ctx->file, addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+static inline int _cr_kread(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		uaddr += nread;
+	}
+	return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _cr_kread(ctx->file , addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, use cr_hbuf_get() to reserve space
+ * in the buffer, then cr_hbuf_put() when you no longer need that space.
+ */
+
+/*
+ * ctx->hbuf is used to hold headers and data of known (or bound),
+ * static sizes. In some cases, multiple headers may be allocated in
+ * a nested manner. The size should accommodate all headers, nested
+ * or not, on all archs.
+ */
+#define CR_HBUF_TOTAL  (8 * 4096)
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ *
+ * Returns pointer to reserved space
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+	void *ptr;
+
+	/*
+	 * Since requests depend on logic and static header sizes (not on
+	 * user data), space should always suffice, unless someone either
+	 * made a structure bigger or call path deeper than expected.
+	 */
+	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+	ptr = ctx->hbuf + ctx->hpos;
+	ctx->hpos += n;
+	return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+	BUG_ON(ctx->hpos < n);
+	ctx->hpos -= n;
+}
+
+/*
+ * helpers to manage C/R contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void cr_ctx_free(struct cr_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx->hbuf);
+	kfree(ctx);
+}
+
+static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
+{
+	struct cr_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->flags = flags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	err = -ENOMEM;
+	ctx->hbuf = kmalloc(CR_HBUF_TOTAL, GFP_KERNEL);
+	if (!ctx->hbuf)
+		goto err;
+
+	return ctx;
+
+ err:
+	cr_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -22,9 +196,28 @@
  */
 asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	pr_debug("sys_checkpoint not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	if (pid == 0)
+		pid = current->pid;
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_CKPT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	cr_ctx_free(ctx);
+	return ret;
 }
+
 /**
  * sys_restart - restart a container
  * @crid: checkpoint image identifier
@@ -36,6 +229,23 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	pr_debug("sys_restart not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	pid_t pid;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	/* FIXME: for now, we use 'crid' as a pid */
+	pid = (pid_t) crid;
+
+	ret = do_restart(ctx, pid);
+
+	cr_ctx_free(ctx);
+	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..97f4af5
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,58 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CR_VERSION  1
+
+struct cr_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long flags;
+	unsigned long oflags;	/* restart: old flags */
+
+	struct file *file;
+	int total;		/* total read/written */
+
+	void *hbuf;		/* temporary buffer for headers */
+	int hpos;		/* position in headers buffer */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT	0x1
+#define CR_CTX_RSTR	0x2
+
+extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
+extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
+extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
+extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
+extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+
+#define cr_debug(fmt, args...)  \
+	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..224457c
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,92 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(cr_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+	__u16 type;
+	__u16 len;
+};
+
+/* header types */
+enum {
+	CR_HDR_HEAD = 1,
+	CR_HDR_BUFFER,
+	CR_HDR_STRING,
+
+	CR_HDR_TASK = 101,
+	CR_HDR_THREAD,
+	CR_HDR_CPU,
+
+	CR_HDR_MM = 201,
+	CR_HDR_VMA,
+	CR_HDR_MM_CONTEXT,
+
+	CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+	__u64 magic;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 flags;	/* checkpoint options */
+
+	__u16 uts_release_len;
+	__u16 uts_version_len;
+	__u16 uts_machine_len;
+	__u16 _padding;
+
+	/*
+	 * the header is followed by three strings:
+	 *   char release[__NEW_UTS_LEN];
+	 *   char version[__NEW_UTS_LEN];
+	 *   char machine[__NEW_UTS_LEN];
+	 */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_tail {
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_task {
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__u32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 0b4df7e..f2c777a 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -49,4 +49,7 @@
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 05/29] x86 support for checkpoint/restart
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 04/29] General infrastructure for checkpoint restart Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-6-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 06/29] Dump memory address space Oren Laadan
                     ` (23 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (cr_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Changelog[v14]:
  - Remove preempt_disable/enable() around init_fpu() and fix leak
  - Revert change to pr_debug(), back to cr_debug()
  - Use only unsigned fields in checkpoint headers
  - Discard field 'h->parent'
  - Check whether calls to cr_hbuf_get() fail

Changelog[v12]:
  - A couple of missed calls to cr_hbuf_put()
  - Replace obsolete cr_debug() with pr_debug()

Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in cr_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space

Changelog[v7]:
  - Fix save/restore state of FPU

Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers

Changelog[v4]:
  - Fix header structure alignment

Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |   98 +++++++++++++
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  242 +++++++++++++++++++++++++++++++++
 arch/x86/mm/restart.c                 |  223 ++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |   19 ++-
 checkpoint/checkpoint_arch.h          |    9 ++
 checkpoint/restart.c                  |   17 ++-
 include/linux/checkpoint_hdr.h        |    2 +
 8 files changed, 608 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 arch/x86/mm/restart.c
 create mode 100644 checkpoint/checkpoint_arch.h

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..ffdb5f5
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,98 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(cr_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+struct cr_hdr_head_arch {
+	/* FIXME: add HAVE_HWFP */
+
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_thread {
+	/* FIXME: restart blocks */
+
+	__u16 gdt_entry_tls_entries;
+	__u16 sizeof_tls_array;
+	__u16 ntls;	/* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_cpu {
+	/* see struct pt_regs (x86-64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 cs;
+	__u64 flags;
+	__u64 sp;
+	__u64 ss;
+
+	/* segment registers */
+	__u64 ds;
+	__u64 es;
+	__u64 fs;
+	__u64 gs;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	__u32 uses_debug;
+	__u32 used_math;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index d8cc96a..e1cb5f8 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -17,3 +17,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..946fac1
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,242 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_thread *hh;
+	struct thread_struct *thread;
+	struct desc_struct *desc;
+	int ntls = 0;
+	int n, ret;
+
+	h.type = CR_HDR_THREAD;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	thread = &t->thread;
+
+	/* calculate no. of TLS entries that follow */
+	desc = thread->tls_array;
+	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+		if (desc->a || desc->b)
+			ntls++;
+	}
+
+	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	hh->sizeof_tls_array = sizeof(thread->tls_array);
+	hh->ntls = ntls;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("ntls %d\n", ntls);
+	if (ntls == 0)
+		return 0;
+
+	/*
+	 * For simplicity dump the entire array, cherry-pick upon restart
+	 * FIXME: the TLS descriptors in the GDT should be called out and
+	 * not tied to the in-kernel representation.
+	 */
+	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+	/* IGNORE RESTART BLOCKS FOR NOW ... */
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	hh->bp = regs->bp;
+	hh->bx = regs->bx;
+	hh->ax = regs->ax;
+	hh->cx = regs->cx;
+	hh->dx = regs->dx;
+	hh->si = regs->si;
+	hh->di = regs->di;
+	hh->orig_ax = regs->orig_ax;
+	hh->ip = regs->ip;
+	hh->cs = regs->cs;
+	hh->flags = regs->flags;
+	hh->sp = regs->sp;
+	hh->ss = regs->ss;
+
+	hh->ds = regs->ds;
+	hh->es = regs->es;
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS and FS registers should be saved from the hardware;
+	 * otherwise they are already sabed on the thread structure
+	 */
+	if (t == current) {
+		savesegment(gs, hh->gs);
+		savesegment(fs, hh->fs);
+	} else {
+		hh->gs = thread->gs;
+		hh->fs = thread->fs;
+	}
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(hh->orig_ax < 0);
+		hh->ax = 0;
+	}
+}
+
+static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(hh->debugreg0, 0);
+		get_debugreg(hh->debugreg1, 1);
+		get_debugreg(hh->debugreg2, 2);
+		get_debugreg(hh->debugreg3, 3);
+		get_debugreg(hh->debugreg6, 6);
+		get_debugreg(hh->debugreg7, 7);
+	} else {
+		hh->debugreg0 = thread->debugreg0;
+		hh->debugreg1 = thread->debugreg1;
+		hh->debugreg2 = thread->debugreg2;
+		hh->debugreg3 = thread->debugreg3;
+		hh->debugreg6 = thread->debugreg6;
+		hh->debugreg7 = thread->debugreg7;
+	}
+
+	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+}
+
+static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	hh->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+	int ret;
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * have been cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	/*
+	 * For simplicity dump the entire structure.
+	 * FIXME: need to be deliberate about what registers we are
+	 * dumping for traceability and compatibility.
+	 */
+	memcpy(xstate_buf, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed it (t == current) */
+
+	ret = cr_kwrite(ctx, xstate_buf, xstate_size);
+	cr_hbuf_put(ctx, xstate_size);
+
+	return ret;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_cpu *hh;
+	int ret;
+
+	h.type = CR_HDR_CPU;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	cr_save_cpu_regs(hh, t);
+	cr_save_cpu_debug(hh, t);
+	cr_save_cpu_fpu(hh, t);
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_write_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_write_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head_arch *hh;
+	int ret;
+
+	h.type = CR_HDR_HEAD_ARCH;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	/* FPU capabilities */
+	hh->has_fxsr = cpu_has_fxsr;
+	hh->has_xsave = cpu_has_xsave;
+	hh->xstate_size = xstate_size;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
new file mode 100644
index 0000000..9353ae2
--- /dev/null
+++ b/arch/x86/mm/restart.c
@@ -0,0 +1,223 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	struct cr_hdr_thread *hh;
+	struct task_struct *t = current;
+	struct thread_struct *thread = &t->thread;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+	if (ret < 0)
+		goto out;
+
+	cr_debug("ntls %d\n", hh->ntls);
+
+	ret = -EINVAL;
+	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+	    hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+
+	if (hh->ntls > 0) {
+		struct desc_struct *desc;
+		int size, cpu;
+
+		/*
+		 * restore TLS by hand: why convert to struct user_desc if
+		 * sys_set_thread_entry() will convert it back ?
+		 */
+
+		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+		desc = kmalloc(size, GFP_KERNEL);
+		if (!desc) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = cr_kread(ctx, desc, size);
+		if (ret == 0) {
+			/*
+			 * FIX: add sanity checks (eg. that values makes
+			 * sense, that we don't overwrite old values, etc
+			 */
+			cpu = get_cpu();
+			memcpy(thread->tls_array, desc, size);
+			load_TLS(thread, cpu);
+			put_cpu();
+		}
+		kfree(desc);
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static int cr_load_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	regs->bx = hh->bx;
+	regs->cx = hh->cx;
+	regs->dx = hh->dx;
+	regs->si = hh->si;
+	regs->di = hh->di;
+	regs->bp = hh->bp;
+	regs->ax = hh->ax;
+	regs->ds = hh->ds;
+	regs->es = hh->es;
+	regs->orig_ax = hh->orig_ax;
+	regs->ip = hh->ip;
+	regs->cs = hh->cs;
+	regs->flags = hh->flags;
+	regs->sp = hh->sp;
+	regs->ss = hh->ss;
+
+	thread->gs = hh->gs;
+	thread->fs = hh->fs;
+	loadsegment(gs, hh->gs);
+	loadsegment(fs, hh->fs);
+
+	return 0;
+}
+
+static int cr_load_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	/* debug regs */
+
+	if (hh->uses_debug) {
+		set_debugreg(hh->debugreg0, 0);
+		set_debugreg(hh->debugreg1, 1);
+		/* ignore 4, 5 */
+		set_debugreg(hh->debugreg2, 2);
+		set_debugreg(hh->debugreg3, 3);
+		set_debugreg(hh->debugreg6, 6);
+		set_debugreg(hh->debugreg7, 7);
+	}
+
+	return 0;
+}
+
+static int cr_load_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!hh->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+	int ret;
+
+	ret = cr_kread(ctx, xstate_buf, xstate_size);
+	if (ret < 0)
+		goto out;
+
+	/* init_fpu() eventually also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		goto out;
+
+	memcpy(t->thread.xstate, xstate_buf, xstate_size);
+ out:
+	cr_hbuf_put(ctx, xstate_size);
+	return ret;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *hh;
+	struct task_struct *t = current;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+	if (ret < 0)
+		goto out;
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	/* FIX: sanity check for sensitive registers (eg. eflags) */
+
+	ret = cr_load_cpu_regs(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_debug(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_fpu(hh, t);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_read_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head_arch *hh;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
+	if (ret < 0)
+		goto out;
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (hh->has_fxsr != cpu_has_fxsr ||
+	    hh->has_xsave != cpu_has_xsave ||
+	    hh->xstate_size != xstate_size)
+		ret = -EINVAL;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 4e4c3fc..422ceff 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /* unique checkpoint identifier (FIXME: should be per-container ?) */
 static atomic_t cr_ctx_count = ATOMIC_INIT(0);
 
@@ -106,6 +108,7 @@ static int cr_write_head(struct cr_ctx *ctx)
 
 	ret = cr_write_obj(ctx, &h, hh);
 	cr_hbuf_put(ctx, sizeof(*hh));
+
 	if (ret < 0)
 		return ret;
 
@@ -116,8 +119,10 @@ static int cr_write_head(struct cr_ctx *ctx)
 	if (ret < 0)
 		return ret;
 	ret = cr_write_buffer(ctx, uts->machine, sizeof(uts->machine));
+	if (ret < 0)
+		return ret;
 
-	return ret;
+	return cr_write_head_arch(ctx);
 }
 
 /* write the checkpoint trailer */
@@ -178,8 +183,16 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	int ret;
 
 	ret = cr_write_task_struct(ctx, t);
-	cr_debug("ret %d\n", ret);
-
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_thread(ctx, t);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_cpu(ctx, t);
+	cr_debug("cpu: ret %d\n", ret);
+ out:
 	return ret;
 }
 
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
new file mode 100644
index 0000000..ada1369
--- /dev/null
+++ b/checkpoint/checkpoint_arch.h
@@ -0,0 +1,9 @@
+#include <linux/checkpoint.h>
+
+extern int cr_write_head_arch(struct cr_ctx *ctx);
+extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+extern int cr_read_head_arch(struct cr_ctx *ctx);
+extern int cr_read_thread(struct cr_ctx *ctx);
+extern int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d6f98d8..6cf0b41 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -15,6 +15,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /**
  * cr_read_obj - read a whole record (cr_hdr followed by payload)
  * @ctx: checkpoint context
@@ -160,6 +162,10 @@ static int cr_read_head(struct cr_ctx *ctx)
 		goto out;
 	ret = cr_read_obj_type(ctx, uts->machine,
 			       sizeof(uts->machine), CR_HDR_BUFFER);
+	if (ret < 0)
+		goto out;
+
+	ret = cr_read_head_arch(ctx);
 
  out:
 	kfree(uts);
@@ -235,8 +241,17 @@ static int cr_read_task(struct cr_ctx *ctx)
 	int ret;
 
 	ret = cr_read_task_struct(ctx);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_thread(ctx);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu(ctx);
+	cr_debug("cpu: ret %d\n", ret);
 
+ out:
 	return ret;
 }
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 224457c..0629c66 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,7 @@
 
 #include <linux/types.h>
 #include <linux/utsname.h>
+#include <asm/checkpoint_hdr.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
@@ -38,6 +39,7 @@ struct cr_hdr {
 /* header types */
 enum {
 	CR_HDR_HEAD = 1,
+	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 06/29] Dump memory address space
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 05/29] x86 support for checkpoint/restart Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-7-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 07/29] Restore " Oren Laadan
                     ` (22 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. Then comes the actual contents,
in one or more chunk: each chunk begins with a header that specifies
how many pages it holds, then the virtual addresses of all the dumped
pages in that chunk, followed by the actual contents of all dumped
pages. A header with zero number of pages marks the end of the contents.
Then comes the next VMA and so on.

Changelog[v14]:
  - Revert change to pr_debug(), back to cr_debug()
  - Save new field 'vdso' in mm_context
  - Discard field 'h->parent'
  - Check whether calls to cr_hbuf_get() fail

Changelog[v13]:
  - pgprot_t is an abstract type; use the proper accessor (fix for
    64-bit powerpc (Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>)

Changelog[v12]:
  - Hide pgarr management inside cr_private_vma_fill_pgarr()
  - Fix management of pgarr chain reset and alloc/expand: keep empty
    pgarr in a pool chain
  - Replace obsolete cr_debug() with pr_debug()

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory

Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now cr_fill_fname() fails the checkpoint.

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages

Changelog[v4]:
  - Use standard list_... for cr_pgarr

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |    6 +
 arch/x86/mm/checkpoint.c              |   31 ++
 checkpoint/Makefile                   |    3 +-
 checkpoint/checkpoint.c               |   87 +++++
 checkpoint/checkpoint_arch.h          |    1 +
 checkpoint/checkpoint_mem.h           |   41 +++
 checkpoint/ckpt_mem.c                 |  558 +++++++++++++++++++++++++++++++++
 checkpoint/sys.c                      |   11 +
 include/linux/checkpoint.h            |   13 +
 include/linux/checkpoint_hdr.h        |   32 ++
 10 files changed, 782 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/checkpoint_mem.h
 create mode 100644 checkpoint/ckpt_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index ffdb5f5..54d3a41 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -95,4 +95,10 @@ struct cr_hdr_cpu {
 	/* thread_xstate contents follow (if used_math) */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm_context {
+	__u64 vdso;
+	__u32 ldt_entry_size;
+	__u32 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 946fac1..92926e1 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -240,3 +240,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+
+	mutex_lock(&mm->context.lock);
+
+	hh->vdso = (unsigned long) mm->context.vdso;
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	cr_debug("nldt %d vdso %#llx\n", hh->nldt, hh->vdso);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	ret = cr_kwrite(ctx, mm->context.ldt,
+			mm->context.size * LDT_ENTRY_SIZE);
+
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 364c326..6924ef4 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o
+obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o \
+	 ckpt_mem.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 422ceff..422e1a3 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -13,6 +13,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -73,6 +74,65 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
 	return cr_write_obj(ctx, &h, str);
 }
 
+/**
+ * cr_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+static char *
+cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *n);
+	spin_unlock(&dcache_lock);
+	if (!IS_ERR(fname))
+		*n = (buf + (*n) - fname);
+	/*
+	 * FIXME: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * cr_write_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
+{
+	struct cr_hdr h;
+	char *buf, *fname;
+	int ret, flen;
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = cr_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		h.type = CR_HDR_FNAME;
+		h.len = flen;
+		ret = cr_write_obj(ctx, &h, fname);
+	} else
+		ret = PTR_ERR(fname);
+
+	kfree(buf);
+	return ret;
+}
+
 /* write the checkpoint header */
 static int cr_write_head(struct cr_ctx *ctx)
 {
@@ -186,6 +246,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_mm(ctx, t);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -196,10 +260,33 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	struct fs_struct *fs;
+
+	ctx->root_pid = pid;
+
+	/*
+	 * assume checkpointer is in container's root vfs
+	 * FIXME: this works for now, but will change with real containers
+	 */
+
+	fs = current->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+
+	return 0;
+}
+
 int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_checkpoint(ctx, pid);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index ada1369..5168765 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -3,6 +3,7 @@
 extern int cr_write_head_arch(struct cr_ctx *ctx);
 extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
 extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm);
 
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
new file mode 100644
index 0000000..3e48bc4
--- /dev/null
+++ b/checkpoint/checkpoint_mem.h
@@ -0,0 +1,41 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/*
+ * page-array chains: each cr_pgarr describes a set of <struct page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct cr_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CR_PGARR_CHUNK  (4 * CR_PGARR_TOTAL)
+
+extern void cr_pgarr_free(struct cr_ctx *ctx);
+extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
+extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
+
+static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CR_PGARR_TOTAL);
+}
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..3d3c5f5
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,558 @@
+/*
+ *  Checkpoint memory contents
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of populated page-array chain
+ *   ctx->pgarr_pool: list head of empty page-array pool chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * Before the next chunk of pages, the chain is reset (by dereferencing
+ * all pages) but not freed; instead, empty descsriptors are kept in pool.
+ *
+ * The head of the chain page-array ("current") advances as necessary. When
+ * it gets full, a new page-array descriptor is pushed in front of it. The
+ * new descriptor is taken from first empty descriptor (if one exists, for
+ * instance, after a chain reset), or allocated on-demand.
+ *
+ * When dumping the data, the chain is traversed in reverse order.
+ */
+
+/* return first page-array in the chain */
+static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
+}
+
+/* return (and detach) first empty page-array in the pool, if exists */
+static inline struct cr_pgarr *cr_pgarr_from_pool(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	if (list_empty(&ctx->pgarr_pool))
+		return NULL;
+	pgarr = list_first_entry(&ctx->pgarr_pool, struct cr_pgarr, list);
+	list_del(&pgarr->list);
+	return pgarr;
+}
+
+/* release pages referenced by a page-array */
+static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
+{
+	cr_debug("nr_used %d\n", pgarr->nr_used);
+	/*
+	 * both checkpoint and restart use 'nr_used', however we only
+	 * collect pages during checkpoint; in restart we simply return
+	 * because pgarr->pages remains NULL.
+	 */
+	if (pgarr->pages) {
+		struct page **pages = pgarr->pages;
+		int nr = pgarr->nr_used;
+
+		while (nr--)
+			page_cache_release(pages[nr]);
+	}
+
+	pgarr->nr_used = 0;
+}
+
+/* free a single page-array object */
+static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
+{
+	cr_pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free the chains of page-arrays (populated and empty pool) */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		cr_pgarr_free_one(pgarr);
+	}
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_pool, list) {
+		list_del(&pgarr->list);
+		cr_pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+
+	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CR_CTX_CKPT) {
+		pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
+				       GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+
+ nomem:
+	cr_pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* cr_pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Otherwise,
+ * try the next page-array after the last non-empty one, and move it to
+ * the front of the chain. Extends the list if none has space.
+ */
+struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = cr_pgarr_first(ctx);
+	if (pgarr && !cr_pgarr_is_full(pgarr))
+		return pgarr;
+
+	pgarr = cr_pgarr_from_pool(ctx);
+	if (!pgarr)
+		pgarr = cr_pgarr_alloc_one(ctx->flags);
+	if (!pgarr)
+		return NULL;
+
+	list_add(&pgarr->list, &ctx->pgarr_list);
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+void cr_pgarr_reset_all(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list)
+		cr_pgarr_release_pages(pgarr);
+	list_splice_init(&ctx->pgarr_list, &ctx->pgarr_pool);
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * cr_private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ *
+ * This function should _only_ called for private vma's.
+ */
+static struct page *
+cr_consider_private_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct page *page;
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * Only care about dirty pages: either anonymous non-zero pages,
+	 * or file-backed COW (copy-on-write) pages that were modified.
+	 * A clean COW page is not interesting because its contents are
+	 * identical to the backing file; ignore such pages.
+	 * A file-backed broken COW is identified by its page_mapping()
+	 * being unset (NULL) because the page will no longer be mapped
+	 * to the original file after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int
+cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct vm_area_struct *vma,
+			  unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	struct cr_pgarr *pgarr;
+	int nr_used;
+	int cnt = 0;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	do {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+
+		nr_used = pgarr->nr_used;
+
+		while (addr < end) {
+			struct page *page;
+
+			page = cr_consider_private_page(vma, addr);
+			if (IS_ERR(page))
+				return PTR_ERR(page);
+
+			if (page) {
+				pgarr->pages[pgarr->nr_used] = page;
+				pgarr->vaddrs[pgarr->nr_used] = addr;
+				pgarr->nr_used++;
+			}
+
+			addr += PAGE_SIZE;
+
+			if (cr_pgarr_is_full(pgarr))
+				break;
+		}
+
+		cnt += pgarr->nr_used - nr_used;
+
+	} while ((cnt < CR_PGARR_CHUNK) && (addr < end));
+
+	*start = addr;
+	return cnt;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return cr_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+	struct cr_pgarr *pgarr;
+	void *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = cr_kwrite(ctx, pgarr->vaddrs,
+				pgarr->nr_used * sizeof(*pgarr->vaddrs));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = (void *) __get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = cr_page_write(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	free_page((unsigned long) buf);
+	return ret;
+}
+
+/**
+ * cr_write_private_vma_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int
+cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_pgarr *hh;
+	unsigned long addr = vma->vm_start;
+	int cnt, ret;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumpting the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	h.type = CR_HDR_PGARR;
+	h.len = sizeof(*hh);
+
+	while (addr < vma->vm_end) {
+		cnt = cr_private_vma_fill_pgarr(ctx, vma, &addr);
+		if (cnt == 0)
+			break;
+		else if (cnt < 0)
+			return cnt;
+
+		hh = cr_hbuf_get(ctx, sizeof(*hh));
+		if (!hh)
+			return -ENOMEM;
+
+		hh->nr_pages = cnt;
+		ret = cr_write_obj(ctx, &h, hh);
+		cr_hbuf_put(ctx, sizeof(*hh));
+		if (ret < 0)
+			return ret;
+
+		ret = cr_vma_dump_pages(ctx, cnt);
+		if (ret < 0)
+			return ret;
+
+		cr_pgarr_reset_all(ctx);
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+	hh->nr_pages = 0;
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+/**
+ * cr_write_vma - classify the vma and dump its contents
+ * @ctx: checkpoint context
+ * @vma: vma object
+ *
+ * (see vma subtypes in checkpoint_hdr.h)
+ */
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_vma *hh;
+	int vma_type, ret;
+
+	h.type = CR_HDR_VMA;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -EBUSY;
+
+	hh->vm_start = vma->vm_start;
+	hh->vm_end = vma->vm_end;
+	hh->vm_page_prot = pgprot_val(vma->vm_page_prot);
+	hh->vm_flags = vma->vm_flags;
+	hh->vm_pgoff = vma->vm_pgoff;
+
+#define CR_BAD_VM_FLAGS  \
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
+
+	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -ENOSYS;
+	}
+
+	/* by default assume anon memory */
+	vma_type = CR_VMA_ANON;
+
+	/*
+	 * if there is a backing file, assume private-mapped
+	 * (FIXME: check if the file is unlinked)
+	 */
+	if (vma->vm_file)
+		vma_type = CR_VMA_FILE;
+
+	hh->vma_type = vma_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	/* save the file name */
+	/* FIXME: files should be deposited and sought in the objhash */
+	if (vma->vm_file) {
+		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
+		if (ret < 0)
+			return ret;
+	}
+
+	return cr_write_private_vma_contents(ctx, vma);
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm *hh;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int objref, ret;
+
+	h.type = CR_HDR_MM;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	mm = get_task_mm(t);
+
+	objref = 0;	/* will be meaningful with multiple processes */
+	hh->objref = objref;
+
+	down_read(&mm->mmap_sem);
+
+	hh->start_code = mm->start_code;
+	hh->end_code = mm->end_code;
+	hh->start_data = mm->start_data;
+	hh->end_data = mm->end_data;
+	hh->start_brk = mm->start_brk;
+	hh->brk = mm->brk;
+	hh->start_stack = mm->start_stack;
+	hh->arg_start = mm->arg_start;
+	hh->arg_end = mm->arg_end;
+	hh->env_start = mm->env_start;
+	hh->env_end = mm->env_end;
+
+	hh->map_count = mm->map_count;
+
+	/* FIX: need also mm->flags */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ret = cr_write_vma(ctx, vma);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_write_mm_context(ctx, mm);
+
+ out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 337c160..d403731 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -16,6 +16,8 @@
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
 
+#include "checkpoint_mem.h"
+
 /*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
@@ -153,7 +155,13 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
 	kfree(ctx->hbuf);
+
+	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
+
+	cr_pgarr_free(ctx);
+
 	kfree(ctx);
 }
 
@@ -168,6 +176,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+	INIT_LIST_HEAD(&ctx->pgarr_pool);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 97f4af5..56442ab 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,6 +10,9 @@
  *  distribution for more details.
  */
 
+#include <linux/path.h>
+#include <linux/fs.h>
+
 #define CR_VERSION  1
 
 struct cr_ctx {
@@ -25,6 +28,11 @@ struct cr_ctx {
 
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
+
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
+
+	struct path fs_mnt;	/* container root (FIXME) */
 };
 
 /* cr_ctx: flags */
@@ -42,6 +50,8 @@ struct cr_hdr;
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
 extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
 extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+extern int cr_write_fname(struct cr_ctx *ctx,
+			  struct path *path, struct path *root);
 
 extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
 extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
@@ -50,7 +60,10 @@ extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+extern int cr_read_mm(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0629c66..2a06a2f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -42,6 +42,7 @@ enum {
 	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
+	CR_HDR_FNAME,
 
 	CR_HDR_TASK = 101,
 	CR_HDR_THREAD,
@@ -49,6 +50,7 @@ enum {
 
 	CR_HDR_MM = 201,
 	CR_HDR_VMA,
+	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
 	CR_HDR_TAIL = 5001
@@ -91,4 +93,34 @@ struct cr_hdr_task {
 	__u32 task_comm_len;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm {
+	__s32 objref;		/* identifier for shared objects */
+	__u32 map_count;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum cr_vma_type {
+	CR_VMA_ANON = 1,	/* private anonymous */
+	CR_VMA_FILE,		/* private mapped file */
+};
+
+struct cr_hdr_vma {
+	__u32 vma_type;
+	__u32 _padding;
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pgarr {
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 07/29] Restore memory address space
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 06/29] Dump memory address space Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 08/29] Infrastructure for shared objects Oren Laadan
                     ` (21 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Changelog[v14]:
  - Revert change to pr_debug(), back to cr_debug()
  - Compare saved 'vdso' field of mm_context with current value
  - Discard field 'h->parent'
  - Check whether calls to cr_hbuf_get() fail

Changelog[v13]:
  - Avoid access to hh->vma_type after the header is freed
  - Test for no vma's in exit_mmap() before calling unmap_vma() (or it
    may crash if restart fails after having removed all vma's)

Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()

Changelog[v4]:
  - Use standard list_... for cr_pgarr


Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/restart.c                 |   66 ++++++
 checkpoint/Makefile                   |    2 +-
 checkpoint/checkpoint_arch.h          |    1 +
 checkpoint/checkpoint_mem.h           |    5 +
 checkpoint/restart.c                  |   51 +++++
 checkpoint/rstr_mem.c                 |  391 +++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h            |    4 +
 mm/mmap.c                             |    2 +-
 9 files changed, 525 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 54d3a41..e9eb40c 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -101,4 +101,9 @@ struct cr_hdr_mm_context {
 	__u32 nldt;
 } __attribute__((aligned(8)));
 
+#ifdef __KERNEL__
+/* misc prototypes from kernel (not defined elsewhere) */
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+#endif
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index 9353ae2..fca5cd8 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -10,10 +10,12 @@
 
 #include <asm/desc.h>
 #include <asm/i387.h>
+#include <asm/elf.h>
 
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+
 /* read the thread_struct into the current task */
 int cr_read_thread(struct cr_ctx *ctx)
 {
@@ -221,3 +223,67 @@ int cr_read_head_arch(struct cr_ctx *ctx)
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
 }
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_mm_context *hh;
+	unsigned int n;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	if (ret < 0)
+		goto out;
+
+	cr_debug("nldt %d vdso %#lx (%p)\n",
+		 hh->nldt, (unsigned long) hh->vdso, mm->context.vdso);
+
+	ret = -EINVAL;
+	if (hh->vdso != (unsigned long) mm->context.vdso)
+		goto out;
+	if (hh->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+
+	for (n = 0; n < hh->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			goto out;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 6924ef4..c2c16e0 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o \
-	 ckpt_mem.o
+		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index 5168765..e43b7fe 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -8,3 +8,4 @@ extern int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm);
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
 extern int cr_read_cpu(struct cr_ctx *ctx);
+extern int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
index 3e48bc4..de1d4c8 100644
--- a/checkpoint/checkpoint_mem.h
+++ b/checkpoint/checkpoint_mem.h
@@ -38,4 +38,9 @@ static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
 	return (pgarr->nr_used == CR_PGARR_TOTAL);
 }
 
+static inline int cr_pgarr_nr_free(struct cr_pgarr *pgarr)
+{
+	return CR_PGARR_TOTAL - pgarr->nr_used;
+}
+
 #endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 6cf0b41..665894f 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -116,6 +116,44 @@ int cr_read_string(struct cr_ctx *ctx, char *str, int len)
 	return ret;
 }
 
+/**
+ * cr_read_fname - read a file name
+ * @ctx: checkpoint context
+ * @fname: buffer
+ * @n: buffer length
+ */
+int cr_read_fname(struct cr_ctx *ctx, char *fname, int flen)
+{
+	return cr_read_buf_type(ctx, fname, &flen, CR_HDR_FNAME);
+}
+
+/**
+ * cr_read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
+{
+	struct file *file;
+	char *fname;
+	int ret;
+
+	fname = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!fname)
+		return ERR_PTR(-ENOMEM);
+
+	ret = cr_read_fname(ctx, fname, PATH_MAX);
+	cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+	if (ret >= 0)
+		file = filp_open(fname, flags, mode);
+	else
+		file = ERR_PTR(ret);
+
+	kfree(fname);
+	return file;
+}
+
 /* read the checkpoint header */
 static int cr_read_head(struct cr_ctx *ctx)
 {
@@ -244,6 +282,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_mm(ctx);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -255,10 +297,19 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* setup restart-specific parts of ctx */
+static int cr_ctx_restart(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
 int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_restart(ctx);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..003b391
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,391 @@
+/*
+ *  Restart memory contents
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+/**
+ * cr_read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int cr_read_pages_vaddrs(struct cr_ctx *ctx, unsigned long nr_pages)
+{
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = cr_pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = cr_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int cr_page_read(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+	int ret;
+
+	ret = cr_kread(ctx, buf, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, buf, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * cr_read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int cr_read_pages_contents(struct cr_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = cr_page_read(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * cr_read_private_vma_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int cr_read_private_vma_contents(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pgarr *hh;
+	unsigned long nr_pages;
+	int ret;
+
+	while (1) {
+		hh = cr_hbuf_get(ctx, sizeof(*hh));
+		if (!hh)
+			return -ENOMEM;
+		ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_PGARR);
+		if (ret < 0) {
+			cr_hbuf_put(ctx, sizeof(*hh));
+			break;
+		}
+
+		cr_debug("nr_pages %ld\n", (unsigned long) hh->nr_pages);
+
+		nr_pages = hh->nr_pages;
+		cr_hbuf_put(ctx, sizeof(*hh));
+
+		if (!nr_pages)
+			break;
+
+		ret = cr_read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = cr_read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		cr_pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+/**
+ * cr_read_vma - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ *
+ * (see vma subtypes in checkpoint_hdr.h)
+ */
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_vma *hh;
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+	enum cr_vma_type vma_type;
+	struct file *file = NULL;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+	if (ret < 0)
+		goto err;
+
+	cr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
+		 (unsigned long) hh->vm_end, (int) hh->vma_type);
+
+	ret = -EINVAL;
+	if (hh->vm_end < hh->vm_start)
+		goto err;
+
+	vm_start = hh->vm_start;
+	vm_pgoff = hh->vm_pgoff;
+	vm_size = hh->vm_end - hh->vm_start;
+	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+	vma_type = hh->vma_type;
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	switch (vma_type) {
+
+	case CR_VMA_ANON:		/* anonymous private mapping */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * vm_pgoff for anonymous mapping is the "global" page
+		 * offset (namely from addr 0x0), so we force a zero
+		 */
+		vm_pgoff = 0;
+		break;
+
+	case CR_VMA_FILE:		/* private mapping from a file */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * for private mapping using 'read-only' is sufficient
+		 */
+		file = cr_read_open_fname(ctx, O_RDONLY, 0);
+		if (IS_ERR(file)) {
+			ret = PTR_ERR(file);
+			goto err;
+		}
+		break;
+
+	default:
+		goto err;
+
+	}
+
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	/* the file (if opened) is now referenced by the vma */
+	if (file)
+		filp_close(file, NULL);
+
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	/*
+	 * CR_VMA_ANON: read in memory as is
+	 * CR_VMA_FILE: read in memory as is
+	 * (more to follow ...)
+	 */
+
+	switch (vma_type) {
+	case CR_VMA_ANON:
+	case CR_VMA_FILE:
+		/* standard case: read the data into the memory */
+		ret = cr_read_private_vma_contents(ctx);
+		break;
+	}
+
+	if (ret < 0)
+		return ret;
+
+	cr_debug("vma retval %d\n", ret);
+	return 0;
+
+ err:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			cr_debug("c/r: restart failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+	struct cr_hdr_mm *hh;
+	struct mm_struct *mm;
+	unsigned int nr;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+	if (ret < 0)
+		goto out;
+
+	cr_debug("map_count %d\n", hh->map_count);
+
+	/* XXX need more sanity checks */
+
+	ret = -EINVAL;
+	if ((hh->start_code > hh->end_code) ||
+	    (hh->start_data > hh->end_data))
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = cr_destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+	mm->start_code = hh->start_code;
+	mm->end_code = hh->end_code;
+	mm->start_data = hh->start_data;
+	mm->end_data = hh->end_data;
+	mm->start_brk = hh->start_brk;
+	mm->brk = hh->brk;
+	mm->start_stack = hh->start_stack;
+	mm->arg_start = hh->arg_start;
+	mm->arg_end = hh->arg_end;
+	mm->env_start = hh->env_start;
+	mm->env_end = hh->env_end;
+	up_write(&mm->mmap_sem);
+
+	/* FIX: need also mm->flags */
+
+	for (nr = hh->map_count; nr; nr--) {
+		ret = cr_read_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_read_mm_context(ctx, mm);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 56442ab..31124ca 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -59,6 +59,10 @@ extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
 extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
+extern int cr_read_fname(struct cr_ctx *ctx, char *fname, int n);
+extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
+				       int flags, int mode);
+
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 00ced3e..fb4df8f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2106,7 +2106,7 @@ void exit_mmap(struct mm_struct *mm)
 	tlb = tlb_gather_mmu(mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = vma ? unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL) : 0;
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 08/29] Infrastructure for shared objects
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 07/29] Restore " Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 09/29] Dump open file descriptors Oren Laadan
                     ` (20 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v13]:
  - Use hash_long() with 'unsigned long' cast to support 64bit archs
    (Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>)

Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime

Changelog[v4]:
  - Fix calculation of hash table size

Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/objhash.c       |  280 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c           |    4 +
 include/linux/checkpoint.h |   20 +++
 4 files changed, 305 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index c2c16e0..8368a03 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o \
+obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o objhash.o \
 		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..17f43fc
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,280 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+
+struct cr_objref {
+	int objref;
+	void *ptr;
+	unsigned short type;
+	unsigned short flags;
+	struct hlist_node hash;
+};
+
+struct cr_objhash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+#define CR_OBJHASH_NBITS  10
+#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS)
+
+static void cr_obj_ref_drop(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		fput((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_obj_ref_grab(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		get_file((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+	struct hlist_head *h = objhash->head;
+	struct hlist_node *n, *t;
+	struct cr_objref *obj;
+	int i;
+
+	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			cr_obj_ref_drop(obj);
+			kfree(obj);
+		}
+	}
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash = ctx->objhash;
+
+	if (objhash) {
+		cr_objhash_clear(objhash);
+		kfree(objhash->head);
+		kfree(ctx->objhash);
+		ctx->objhash = NULL;
+	}
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash;
+	struct hlist_head *head;
+
+	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+	if (!objhash)
+		return -ENOMEM;
+	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(objhash);
+		return -ENOMEM;
+	}
+
+	objhash->head = head;
+	objhash->next_free_objref = 1;
+
+	ctx->objhash = objhash;
+	return 0;
+}
+
+static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_long((unsigned long) ptr,
+					  CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_find_by_objref(struct cr_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_long((unsigned long) objref,
+					  CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+/**
+ * cr_obj_new - allocate an object and add to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Allocate an object referring to @ptr and add to the hash table.
+ * If @objref is zero, assign a unique object reference and use @ptr
+ * as a hash key [checkpoint]. Else use @objref as a key [restart].
+ * In both cases, grab a reference (depending on @type) to said obejct.
+ */
+static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
+				    unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int i;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return NULL;
+
+	obj->ptr = ptr;
+	obj->type = type;
+	obj->flags = flags;
+
+	if (objref) {
+		/* use @objref to index (restart) */
+		obj->objref = objref;
+		i = hash_long((unsigned long) objref, CR_OBJHASH_NBITS);
+	} else {
+		/* use @ptr to index, assign objref (checkpoint) */
+		obj->objref = ctx->objhash->next_free_objref++;;
+		i = hash_long((unsigned long) ptr, CR_OBJHASH_NBITS);
+	}
+
+	hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
+	cr_obj_ref_grab(obj);
+	return obj;
+}
+
+/**
+ * cr_obj_add_ptr - add an object to the hash table if not already there
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference [output]
+ * @type: object type
+ * @flags: object flags
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, then add the object to the table, and allocate a
+ * fresh unique object reference (objref). Grab a reference to every
+ * object that is added, and maintain the reference until the entire
+ * hash is free.
+ *
+ * Fills the unique objref of the object into @objref.
+ *
+ * [This is used during checkpoint].
+ *
+ * Returns 0 if found, 1 if added, < 0 on error
+ */
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int ret = 0;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = cr_obj_new(ctx, ptr, 0, type, flags);
+		if (!obj)
+			return -ENOMEM;
+		else
+			ret = 1;
+	} else if (obj->type != type)	/* sanity check */
+		return -EINVAL;
+	*objref = obj->objref;
+	return ret;
+}
+
+/**
+ * cr_obj_add_ref - add an object with unique objref to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique identifier - object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Add the object pointer to by @ptr and identified by unique object
+ * reference given by @objref to the hash table (indexed by @objref).
+ * Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is free.
+ *
+ * [This is used during restart].
+ */
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_new(ctx, ptr, objref, type, flags);
+	return obj ? 0 : -ENOMEM;
+}
+
+/**
+ * cr_obj_get_by_ptr - find the unique object reference of an object
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Look up the unique object reference (objref) of the object pointed
+ * to by @ptr, and return that number, or 0 if not found.
+ *
+ * [This is used during checkpoint].
+ */
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj)
+		return -ESRCH;
+	if (obj->type != type)
+		return -EINVAL;
+	return obj->objref;
+}
+
+/**
+ * cr_obj_get_by_ref - find an object given its unique object reference
+ * @ctx: checkpoint context
+ * @objref: unique identifier - object reference
+ * @type: object type
+ *
+ * Look up the object who is identified by unique object reference that
+ * is specified by @objref, and return a pointer to that matching object,
+ * or NULL if not found.
+ *
+ * [This is used during restart].
+ */
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return NULL;
+	if (obj->type != type)
+		return ERR_PTR(-EINVAL);
+	return obj->ptr;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index d403731..eef774e 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -161,6 +161,7 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
 
 	cr_pgarr_free(ctx);
+	cr_objhash_free(ctx);
 
 	kfree(ctx);
 }
@@ -189,6 +190,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (!ctx->hbuf)
 		goto err;
 
+	if (cr_objhash_alloc(ctx) < 0)
+		goto err;
+
 	return ctx;
 
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 31124ca..88854a9 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,8 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct cr_objhash *objhash;	/* hash for shared objects */
+
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
 
@@ -45,6 +47,24 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+/* shared objects handling */
+
+enum {
+	CR_OBJ_FILE = 1,
+	CR_OBJ_MAX
+};
+
+extern void cr_objhash_free(struct cr_ctx *ctx);
+extern int cr_objhash_alloc(struct cr_ctx *ctx);
+extern void *cr_obj_get_by_ref(struct cr_ctx *ctx,
+			       int objref, unsigned short type);
+extern int cr_obj_get_by_ptr(struct cr_ctx *ctx,
+			     void *ptr, unsigned short type);
+extern int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+			  unsigned short type, unsigned short flags);
+extern int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+			  unsigned short type, unsigned short flags);
+
 struct cr_hdr;
 
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 09/29] Dump open file descriptors
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 08/29] Infrastructure for shared objects Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-10-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 10/29] actually use f_op in checkpoint code Oren Laadan
                     ` (19 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Because the 'struct file' corresponding to an
FD can be shared, each they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it lives
in the hash (the hash is only cleaned up at the end of the checkpoint).

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its
close-on-exec property, and the objref of the corresponding 'file *'.
If the FD is to be saved (first time) then this is followed by a
'struct cr_hdr_fd_data' with the FD state. Then will come the next FD
and so on.

Recall that it is assumed that all tasks possibly sharing the file table
are frozen. If this assumption breaks, then the behavior is *undefined*:
checkpoint may fail, or restart from the resulting image file will fail.

This patch only handles basic FDs - regular files, directories.

Changelog[v14]:
  - Revert change to pr_debug(), back to cr_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  cr_write_files() => cr_write_fd_table()
  - Rename:  cr_write_fd_data() => cr_write_file()
  - Discard field 'h->parent'
  - Check whether calls to cr_hbuf_get() fail
  - Use one CR_FD_GENERIC for both regular files and dirs
  - Put code for generic file descriptors in a separate function

Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()

Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - cr_scan_fds() retries from scratch if hits size limits

Changelog[v9]:
  - Fix a couple of leaks in cr_write_files()
  - Drop useless kfree from cr_scan_fds()

Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |    2 +-
 checkpoint/Makefile                   |    2 +-
 checkpoint/checkpoint.c               |    4 +
 checkpoint/checkpoint_file.h          |   17 +++
 checkpoint/ckpt_file.c                |  247 +++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h            |    3 +-
 include/linux/checkpoint_hdr.h        |   30 ++++-
 7 files changed, 301 insertions(+), 4 deletions(-)
 create mode 100644 checkpoint/checkpoint_file.h
 create mode 100644 checkpoint/ckpt_file.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index e9eb40c..1efdf24 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -15,7 +15,7 @@
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned (8))) for the entire structure.
  *
  * Quoting Arnd Bergmann:
  *   "This structure has an odd multiple of 32-bit members, which means
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 8368a03..1d92ed2 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 422e1a3..d4e0007 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -250,6 +250,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_fd_table(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/checkpoint_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..9c344c7
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,247 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+#define CR_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i, n;
+	int tot = CR_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we our a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	spin_lock(&files->file_lock);
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (n = 0, i = 0; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+	spin_unlock(&files->file_lock);
+
+	*fdtable = fds;
+	return n;
+}
+
+static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
+				 struct cr_hdr_file *hh)
+{
+	struct cr_hdr h;
+	int ret;
+
+	/*
+	 * FIX: check if the file/dir/link is unlinked
+	 *
+	 * Or, pass up somthing like in hh->flags to tell
+	 * the higher-level code that it needs to bring
+	 * along the file contents too.
+	 */
+
+	h.type = CR_HDR_FILE;
+	h.len = sizeof(*hh);
+
+	hh->fd_type = CR_FD_GENERIC;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		return ret;
+
+	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
+}
+
+/* cr_write_file - dump the state of a given file pointer */
+static int cr_write_file(struct cr_ctx *ctx, struct file *file)
+{
+	struct cr_hdr_file *hh;
+	struct dentry *dent = file->f_dentry;
+	struct inode *inode = dent->d_inode;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	hh->f_flags = file->f_flags;
+	hh->f_mode = file->f_mode;
+	hh->f_pos = file->f_pos;
+	hh->f_version = file->f_version;
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+		ret = cr_write_file_generic(ctx, file, hh);
+		break;
+	default:
+		ret = -EBADF;
+		break;
+	}
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls cr_write_file to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_ent *hh;
+	struct file *file;
+	struct fdtable *fdt;
+	int objref, new, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	if (!file)
+		return -EBADF;
+
+	/* adding 'file' to the hash will keep a reference to it */
+	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
+	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
+
+	if (new < 0)
+		return new;
+
+	h.type = CR_HDR_FD_ENT;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh) {
+		fput(file);
+		return -ENOMEM;
+	}
+
+	hh->objref = objref;
+	hh->fd = fd;
+	hh->close_on_exec = coe;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	/* new==1 if-and-only-if file was newly added to hash */
+	if (new)
+		ret = cr_write_file(ctx, file);
+
+out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (file)
+		fput(file);
+	return ret;
+}
+
+int cr_write_fd_table(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_table *hh;
+	struct files_struct *files;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h.type = CR_HDR_FD_TABLE;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	files = get_files_struct(t);
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	hh->objref = 0;	/* will be meaningful with multiple processes */
+	hh->nfds = nfds;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	cr_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = cr_write_fd_ent(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+ out:
+	kfree(fdtable);
+	put_files_struct(files);
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 88854a9..9489ea5 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,7 +13,7 @@
 #include <linux/path.h>
 #include <linux/fs.h>
 
-#define CR_VERSION  1
+#define CR_VERSION  2
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -85,6 +85,7 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_fd_table(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 2a06a2f..a6b6dce 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -17,7 +17,7 @@
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned (8))) for the entire structure.
  *
  * Quoting Arnd Bergmann:
  *   "This structure has an odd multiple of 32-bit members, which means
@@ -53,6 +53,10 @@ enum {
 	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
+	CR_HDR_FD_TABLE = 301,
+	CR_HDR_FD_ENT,
+	CR_HDR_FILE,
+
 	CR_HDR_TAIL = 5001
 };
 
@@ -123,4 +127,28 @@ struct cr_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_fd_table {
+	__s32 objref;		/* identifier for shared objects */
+	__s32 nfds;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_fd_ent {
+	__s32 objref;		/* identifier for shared objects */
+	__s32 fd;
+	__u32 close_on_exec;
+} __attribute__((aligned(8)));
+
+/* fd types */
+enum  fd_type {
+	CR_FD_GENERIC = 1
+};
+
+struct cr_hdr_file {
+	__u16 fd_type;
+	__u16 f_mode;
+	__u32 f_flags;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 10/29] actually use f_op in checkpoint code
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 09/29] Dump open file descriptors Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-11-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 11/29] add generic checkpoint f_op to ext fses Oren Laadan
                     ` (18 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Right now, we assume all normal files and directories
can be checkpointed.  However, as usual in the VFS, there
are specialized places that will always need an ability
to override these defaults.  We could do this completely
in the checkpoint code, but that would bitrot quickly.

This adds a new 'file_operations' function for
checkpointing a file.  I did this under the assumption
that we should have a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] and /proc patches, all
that we have to do to make something simple be
supported is add a single "generic" f_op entry.

Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/ckpt_file.c |   31 +++++++++++++++----------------
 include/linux/fs.h     |   11 +++++++++++
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
index 9c344c7..0fe68bf 100644
--- a/checkpoint/ckpt_file.c
+++ b/checkpoint/ckpt_file.c
@@ -91,6 +91,11 @@ static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
 
 	hh->fd_type = CR_FD_GENERIC;
 
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+
 	ret = cr_write_obj(ctx, &h, hh);
 	if (ret < 0)
 		return ret;
@@ -98,12 +103,16 @@ static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
 	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
 }
 
+int generic_file_checkpoint(struct cr_ctx *ctx, struct file *file,
+			    struct cr_hdr_file *hh)
+{
+	return cr_write_file_generic(ctx, file, hh);
+}
+
 /* cr_write_file - dump the state of a given file pointer */
 static int cr_write_file(struct cr_ctx *ctx, struct file *file)
 {
 	struct cr_hdr_file *hh;
-	struct dentry *dent = file->f_dentry;
-	struct inode *inode = dent->d_inode;
 	int ret;
 
 	hh = cr_hbuf_get(ctx, sizeof(*hh));
@@ -116,21 +125,11 @@ static int cr_write_file(struct cr_ctx *ctx, struct file *file)
 	hh->f_version = file->f_version;
 	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
 
-	/*
-	 * FIXME: when we'll add support for unlinked files/dirs, we'll
-	 * need to distinguish between unlinked filed and unlinked dirs.
-	 */
-	switch (inode->i_mode & S_IFMT) {
-	case S_IFREG:
-	case S_IFDIR:
-		ret = cr_write_file_generic(ctx, file, hh);
-		break;
-	default:
-		ret = -EBADF;
-		break;
-	}
-	cr_hbuf_put(ctx, sizeof(*hh));
+	ret = -EBADF;
+	if (file->f_op->checkpoint)
+		ret = file->f_op->checkpoint(ctx, file, hh);
 
+	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3bf5057..835ee9e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1296,6 +1296,14 @@ int generic_osync_inode(struct inode *, struct address_space *, int);
 typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
 struct block_device_operations;
 
+#ifdef CONFIG_CHECKPOINT
+struct cr_ctx;
+struct cr_hdr_file;
+int generic_file_checkpoint(struct cr_ctx *, struct file *, struct cr_hdr_file *);
+#else
+#define generic_file_checkpoint NULL
+#endif
+
 /* These macros are for out of kernel modules to test that
  * the kernel supports the unlocked_ioctl and compat_ioctl
  * fields in struct file_operations. */
@@ -1334,6 +1342,9 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct cr_ctx *, struct file *file, struct cr_hdr_file *);
+#endif
 };
 
 struct inode_operations {
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 11/29] add generic checkpoint f_op to ext fses
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 10/29] actually use f_op in checkpoint code Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 12/29] Restore open file descriptors Oren Laadan
                     ` (17 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

This marks ext[234] as being checkpointable.  There will be many
more to do this to, but this is a start.

Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 fs/ext2/dir.c  |    1 +
 fs/ext2/file.c |    2 ++
 fs/ext3/dir.c  |    1 +
 fs/ext3/file.c |    1 +
 fs/ext4/dir.c  |    1 +
 fs/ext4/file.c |    1 +
 6 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 2999d72..4f1dd79 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -721,4 +721,5 @@ const struct file_operations ext2_dir_operations = {
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
 	.fsync		= ext2_sync_file,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 45ed071..e1731c5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -58,6 +58,7 @@ const struct file_operations ext2_file_operations = {
 	.fsync		= ext2_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
@@ -73,6 +74,7 @@ const struct file_operations ext2_xip_file_operations = {
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_sync_file,
+	.checkpoint	= generic_file_checkpoint,
 };
 #endif
 
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 5853f44..aa579e1 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
 #endif
 	.fsync		= ext3_sync_file,	/* BKL held */
 	.release	= ext3_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 3be1e06..45f73fa 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -122,6 +122,7 @@ const struct file_operations ext3_file_operations = {
 	.fsync		= ext3_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 2df2e40..9baf728 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = {
 #endif
 	.fsync		= ext4_sync_file,
 	.release	= ext4_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f731cb5..bd84dce 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -156,6 +156,7 @@ const struct file_operations ext4_file_operations = {
 	.fsync		= ext4_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 12/29] Restore open file descriptors
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 11/29] add generic checkpoint f_op to ext fses Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-13-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 13/29] External checkpoint of a task other than ourself Oren Laadan
                     ` (16 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Changelog[v14]:
  - Revert change to pr_debug(), back to cr_debug()
  - Rename:  cr_read_files() => cr_read_fd_table()
  - Rename:  cr_read_fd_data() => cr_read_file()
  - Discard field 'hh->parent'
  - Check whether calls to cr_hbuf_get() fail

Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/restart.c       |    4 +
 checkpoint/rstr_file.c     |  236 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h |    1 +
 4 files changed, 242 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 1d92ed2..607d864 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 665894f..da239fd 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -286,6 +286,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_fd_table(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..1031915
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,236 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+/**
+ * cr_attach_get_file - attach (and get) lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_get_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		get_file(file);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_file - restore the state of a given file pointer */
+static int cr_read_file(struct cr_ctx *ctx, int objref)
+{
+	struct cr_hdr_file *hh;
+	struct file *file;
+	int fd = 0;	/* pacify gcc warning */
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILE);
+	cr_debug("flags %#x mode %#x how %d\n",
+		 hh->f_flags, hh->f_mode, hh->fd_type);
+	if (ret < 0)
+		goto out;
+
+	ret = -EINVAL;
+
+	/* FIX: more sanity checks on f_flags, f_mode etc */
+
+	switch (hh->fd_type) {
+	case CR_FD_GENERIC:
+		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+		break;
+	default:
+		goto out;
+	}
+
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* adding <objref,file> to the hash will keep a reference to it */
+	ret = cr_obj_add_ref(ctx, file, objref, CR_OBJ_FILE, 0);
+	if (ret < 0) {
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
+	if (fd < 0) {
+		ret = fd;
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+	if (ret < 0)
+		goto out;
+	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
+		ret = 0;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret < 0 ? ret : fd;
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls cr_read_file to restore the file too.
+ */
+static int cr_read_fd_ent(struct cr_ctx *ctx)
+{
+	struct cr_hdr_fd_ent *hh;
+	struct file *file;
+	int newfd, ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+	if (ret < 0)
+		goto out;
+
+	cr_debug("ref %d fd %d c.o.e %d\n",
+		 hh->objref, hh->fd, hh->close_on_exec);
+
+	ret = -EINVAL;
+	if (hh->objref <= 0 || hh->fd < 0)
+		goto out;
+
+	file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	if (file) {
+		/* reuse file descriptor found in the hash table */
+		newfd = cr_attach_get_file(file);
+	} else {
+		/* create new file pointer (and register in hash table) */
+		newfd = cr_read_file(ctx, hh->objref);
+	}
+
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+	/* if newfd isn't desired fd then reposition it */
+	if (newfd != hh->fd) {
+		ret = sys_dup2(newfd, hh->fd);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	if (hh->close_on_exec)
+		set_close_on_exec(hh->fd, 1);
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_fd_table(struct cr_ctx *ctx)
+{
+	struct cr_hdr_fd_table *hh;
+	int i, ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_TABLE);
+	if (ret < 0)
+		goto out;
+
+	cr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
+
+	if (hh->nfds < 0 || hh->nfds > sysctl_nr_open) {
+		ret = -EMFILE;
+		goto out;
+	}
+
+	/* point of no return -- close all file descriptors */
+	ret = cr_close_all_fds(current->files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < hh->nfds; i++) {
+		ret = cr_read_fd_ent(ctx);
+		if (ret < 0)
+			break;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 9489ea5..ad4322d 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -89,6 +89,7 @@ extern int cr_write_fd_table(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
+extern int cr_read_fd_table(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 13/29] External checkpoint of a task other than ourself
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 12/29] Restore open file descriptors Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-14-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 14/29] Checkpoint multiple processes Oren Laadan
                     ` (15 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container.

sys_restart() remains the same, as the restart is always done in the
context of the restarting task.

Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen

Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them

Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/checkpoint.c    |   73 ++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c       |    4 +-
 checkpoint/sys.c           |    6 +++
 include/linux/checkpoint.h |    2 +
 4 files changed, 81 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index d4e0007..25229d3 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -10,6 +10,8 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/freezer.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -242,6 +244,11 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task may not be in state TASK_DEAD\n");
+		return -EAGAIN;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -264,22 +271,84 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task = NULL;
+	struct nsproxy *nsproxy = NULL;
+	int err = -ESRCH;
+
+	ctx->root_pid = pid;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+
+	if (!task)
+		goto out;
+
+#if 0	/* enable to use containers */
+	if (!is_container_init(task)) {
+		err = -EINVAL;
+		goto out;
+	}
+#endif
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	/* verify that the task is frozen (unless self) */
+	if (task != current && !frozen(task))
+		return -EBUSY;
+
+	rcu_read_lock();
+	if (task_nsproxy(task)) {
+		nsproxy = task_nsproxy(task);
+		get_nsproxy(nsproxy);
+	}
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		goto out;
+
+	ctx->root_task = task;
+	ctx->root_nsproxy = nsproxy;
+
+	return 0;
+
+ out:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* setup checkpoint-specific parts of ctx */
 static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	struct fs_struct *fs;
+	int ret;
 
 	ctx->root_pid = pid;
 
+	ret = cr_get_container(ctx, pid);
+	if (ret < 0)
+		return ret;
+
 	/*
 	 * assume checkpointer is in container's root vfs
 	 * FIXME: this works for now, but will change with real containers
 	 */
 
-	fs = current->fs;
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
 	read_lock(&fs->lock);
 	ctx->fs_mnt = fs->root;
 	path_get(&ctx->fs_mnt);
 	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
 
 	return 0;
 }
@@ -294,7 +363,7 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, current);
+	ret = cr_write_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = cr_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index da239fd..96d4d45 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -302,7 +302,7 @@ static int cr_read_task(struct cr_ctx *ctx)
 }
 
 /* setup restart-specific parts of ctx */
-static int cr_ctx_restart(struct cr_ctx *ctx)
+static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	return 0;
 }
@@ -311,7 +311,7 @@ int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = cr_ctx_restart(ctx);
+	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
 	ret = cr_read_head(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index eef774e..b1c60b1 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -9,6 +9,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -163,6 +164,11 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ad4322d..3a6cef9 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -19,6 +19,8 @@ struct cr_ctx {
 	int crid;		/* unique checkpoint id */
 
 	pid_t root_pid;		/* container identifier */
+	struct task_struct *root_task;	/* container root task */
+	struct nsproxy *root_nsproxy;	/* container root nsproxy */
 
 	unsigned long flags;
 	unsigned long oflags;	/* restart: old flags */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 14/29] Checkpoint multiple processes
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 13/29] External checkpoint of a task other than ourself Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-15-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 15/29] Restart " Oren Laadan
                     ` (14 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Checkpointing of multiple processes works by recording the tasks tree
structure below a given task (usually this task is the container init).

For a given task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies, as well as session ids.

Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen
  - Revert change to pr_debug(), back to cr_debug()
  - Use only unsigned fields in checkpoint headers
  - Check retval of cr_tree_count_tasks() in cr_build_tree()
  - Discard 'h.parent' field
  - Check whether calls to cr_hbuf_get() fail

Changelog[v13]:
  - Release tasklist_lock in error path in cr_tree_count_tasks()
  - Use separate index for 'tasks_arr' and 'hh' in cr_write_pids()

Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/checkpoint.c        |  228 ++++++++++++++++++++++++++++++++++++++--
 checkpoint/sys.c               |   16 +++
 include/linux/checkpoint.h     |    3 +
 include/linux/checkpoint_hdr.h |   13 ++-
 4 files changed, 248 insertions(+), 12 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 25229d3..7f5eee6 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -244,11 +244,6 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
-	if (t->state == TASK_DEAD) {
-		pr_warning("c/r: task may not be in state TASK_DEAD\n");
-		return -EAGAIN;
-	}
-
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -271,6 +266,211 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int cr_write_all_tasks(struct cr_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		cr_debug("dumping task #%d\n", n);
+		ret = cr_write_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
+static int cr_may_checkpoint_task(struct task_struct *t, struct cr_ctx *ctx)
+{
+	cr_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
+		return -EAGAIN;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_READ))
+		return -EPERM;
+
+	/* verify that the task is frozen (unless self) */
+	if (t != current && !frozen(t))
+		return -EBUSY;
+
+	/* FIXME: change this for nested containers */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CR_HDR_PIDS_CHUNK	256
+
+static int cr_write_pids(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pids *hh;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int tasks_nr, n, pos = 0, ret = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	tasks_nr = ctx->tasks_nr;
+	BUG_ON(tasks_nr <= 0);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh) * CR_HDR_PIDS_CHUNK);
+	if (!hh)
+		return -ENOMEM;
+
+	do {
+		rcu_read_lock();
+		for (n = 0; n < min(tasks_nr, CR_HDR_PIDS_CHUNK); n++) {
+			task = tasks_arr[pos];
+
+			/* is this task cool ? */
+			ret = cr_may_checkpoint_task(task, ctx);
+			if (ret < 0) {
+				rcu_read_unlock();
+				goto out;
+			}
+			hh[n].vpid = task_pid_nr_ns(task, ns);
+			hh[n].vtgid = task_tgid_nr_ns(task, ns);
+			hh[n].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			cr_debug("task[%d]: vpid %d vtgid %d parent %d\n", pos,
+				 hh[n].vpid, hh[n].vtgid, hh[n].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(tasks_nr, CR_HDR_PIDS_CHUNK);
+		ret = cr_kwrite(ctx, hh, n * sizeof(*hh));
+		if (ret < 0)
+			break;
+
+		tasks_nr -= n;
+	} while (tasks_nr > 0);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int cr_tree_count_tasks(struct cr_ctx *ctx)
+{
+	struct task_struct *root = ctx->root_task;
+	struct task_struct *task = root;
+	struct task_struct *parent = NULL;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int tasks_nr = ctx->tasks_nr;
+	int nr = 0;
+
+	read_lock(&tasklist_lock);
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		if (tasks_arr) {
+			/* unlikely... but if so then try again later */
+			if (nr == tasks_nr) {
+				nr = -EAGAIN;	/* cleanup in cr_ctx_free() */
+				break;
+			}
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+
+		nr++;
+
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+
+		if (task == root)
+			break;
+	}
+
+	read_unlock(&tasklist_lock);
+	return nr;
+}
+
+/*
+ * cr_build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->tasks_nr will hold the total count.
+ * The array is cleaned up by cr_ctx_free().
+ */
+static int cr_build_tree(struct cr_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = cr_tree_count_tasks(ctx);
+	if (n < 0)
+		return n;
+
+	ctx->tasks_nr = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = cr_tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in cr_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
+	return 0;
+}
+
+/* dump the array that describes the tasks tree */
+static int cr_write_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tree *hh;
+	int ret;
+
+	h.type = CR_HDR_TREE;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	hh->tasks_nr = ctx->tasks_nr;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_write_pids(ctx);
+	return ret;
+}
+
 static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task = NULL;
@@ -288,7 +488,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!task)
 		goto out;
 
-#if 0	/* enable to use containers */
+#if 0	/* enable with containers */
 	if (!is_container_init(task)) {
 		err = -EINVAL;
 		goto out;
@@ -300,10 +500,6 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 		goto out;
 	}
 
-	/* verify that the task is frozen (unless self) */
-	if (task != current && !frozen(task))
-		return -EBUSY;
-
 	rcu_read_lock();
 	if (task_nsproxy(task)) {
 		nsproxy = task_nsproxy(task);
@@ -360,12 +556,22 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_ctx_checkpoint(ctx, pid);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_build_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, ctx->root_task);
+	ret = cr_write_tree(ctx);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_write_all_tasks(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index b1c60b1..8630144 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -152,6 +152,19 @@ void cr_hbuf_put(struct cr_ctx *ctx, int n)
  * restart operation, and persists until the operation is completed.
  */
 
+static void cr_task_arr_free(struct cr_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
@@ -164,6 +177,9 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->tasks_arr)
+		cr_task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3a6cef9..c946320 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -31,6 +31,9 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int tasks_nr;			/* size of tasks array */
+
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index a6b6dce..18c9f5d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -44,7 +44,8 @@ enum {
 	CR_HDR_STRING,
 	CR_HDR_FNAME,
 
-	CR_HDR_TASK = 101,
+	CR_HDR_TREE = 101,
+	CR_HDR_TASK,
 	CR_HDR_THREAD,
 	CR_HDR_CPU,
 
@@ -88,6 +89,16 @@ struct cr_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_tree {
+	__s32 tasks_nr;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pids {
+	__s32 vpid;
+	__s32 vtgid;
+	__s32 vppid;
+} __attribute__((aligned(8)));
+
 struct cr_hdr_task {
 	__u32 state;
 	__u32 exit_state;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 15/29] Restart multiple processes
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 14/29] Checkpoint multiple processes Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-16-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup Oren Laadan
                     ` (13 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

The init task (*) has a special role: it allocates the restart context
(ctx), and coordinates the operation. In particular, it first waits
until all participating tasks enter the kernel, and provides them the
common restart context. Once everyone in ready, it begins to restart
itself.

In contrast, the other tasks enter the kernel, locate the init task (*)
and grab its restart context, and then wait for their turn to restore.

When a task (init or not) completes its restart, it hands the control
over to the next in line, by waking that task.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (ctx) maintain a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Changelog[v14]:
  - Revert change to pr_debug(), back to cr_debug()
  - Discard field 'h.parent'
  - Check whether calls to cr_hbuf_get() fail

Changelog[v13]:
  - Clear root_task->checkpoint_ctx regardless of error condition
  - Remove unused argument 'ctx' from do_restart_task() prototype
  - Remove unused member 'pids_err' from 'struct cr_ctx'

Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/restart.c       |  224 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/sys.c           |   34 ++++++--
 include/linux/checkpoint.h |   24 ++++-
 include/linux/sched.h      |    4 +
 4 files changed, 272 insertions(+), 14 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 96d4d45..adebc1c 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/checkpoint.h>
@@ -301,30 +302,245 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* cr_read_tree - read the tasks tree into the checkpoint context */
+static int cr_read_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tree *hh;
+	int size, ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
+	if (ret < 0)
+		goto out;
+
+	ret = -EINVAL;
+	if (hh->tasks_nr < 0)
+		goto out;
+
+	ctx->pids_nr = hh->tasks_nr;
+	size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
+	if (size < 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = cr_kread(ctx, ctx->pids_arr, size);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_wait_task(struct cr_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+
+	cr_debug("pid %d waiting\n", pid);
+	return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
+}
+
+static int cr_next_task(struct cr_ctx *ctx)
+{
+	struct task_struct *tsk;
+
+	ctx->pids_pos++;
+
+	cr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
+	if (ctx->pids_pos == ctx->pids_nr) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
+
+	cr_debug("pids_next %d\n", ctx->pids_active);
+
+	rcu_read_lock();
+	tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
+	if (tsk)
+		wake_up_process(tsk);
+	rcu_read_unlock();
+
+	if (!tsk) {
+		complete(&ctx->complete);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+/* FIXME: this should be per container */
+DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
+
+static int do_restart_task(pid_t pid)
+{
+	struct task_struct *root_task;
+	struct cr_ctx *ctx = NULL;
+	int ret;
+
+	rcu_read_lock();
+	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
+	if (root_task)
+		get_task_struct(root_task);
+	rcu_read_unlock();
+
+	if (!root_task)
+		return -EINVAL;
+
+	/*
+	 * wait for container init to initialize the restart context, then
+	 * grab a reference to that context, and if we're the last task to
+	 * do it, notify the container init.
+	 */
+	ret = wait_event_interruptible(cr_restart_waitq,
+				       root_task->checkpoint_ctx);
+	if (ret < 0)
+		goto out;
+
+	task_lock(root_task);
+	ctx = root_task->checkpoint_ctx;
+	if (ctx)
+		cr_ctx_get(ctx);
+	task_unlock(root_task);
+
+	if (!ctx) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (atomic_dec_and_test(&ctx->tasks_count))
+		complete(&ctx->complete);
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = cr_wait_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_next_task(ctx);
+
+ out:
+	cr_ctx_put(ctx);
+	put_task_struct(root_task);
+	return ret;
+}
+
+/**
+ * cr_wait_all_tasks_start - wait for all tasks to enter sys_restart()
+ * @ctx: checkpoint context
+ *
+ * Called by the container root to wait until all restarting tasks
+ * are ready to restore their state. Temporarily advertises the 'ctx'
+ * on 'current->checkpoint_ctx' so that others can grab a reference
+ * to it, and clears it once synchronization completes. See also the
+ * related code in do_restart_task().
+ */
+static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+	current->checkpoint_ctx = ctx;
+
+	wake_up_all(&cr_restart_waitq);
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+
+	task_lock(current);
+	current->checkpoint_ctx = NULL;
+	task_unlock(current);
+
+	return ret;
+}
+
+static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+
+	ret = cr_next_task(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
 /* setup restart-specific parts of ctx */
 static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
+	ctx->root_pid = pid;
+	ctx->root_task = current;
+	ctx->root_nsproxy = current->nsproxy;
+
+	get_task_struct(ctx->root_task);
+	get_nsproxy(ctx->root_nsproxy);
+
+	atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
+
 	return 0;
 }
 
-int do_restart(struct cr_ctx *ctx, pid_t pid)
+static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_head(ctx);
+
+	/* wait for all other tasks to enter do_restart_task() */
+	ret = cr_wait_all_tasks_start(ctx);
 	if (ret < 0)
 		goto out;
+
 	ret = cr_read_task(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_tail(ctx);
+
+	/* wait for all other tasks to complete do_restart_task() */
+	ret = cr_wait_all_tasks_finish(ctx);
 	if (ret < 0)
 		goto out;
 
-	/* on success, adjust the return value if needed [TODO] */
+	ret = cr_read_tail(ctx);
+
  out:
 	return ret;
 }
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	if (ctx)
+		ret = do_restart_root(ctx, pid);
+	else
+		ret = do_restart_task(pid);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 8630144..3a925ae 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -167,6 +167,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
 
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -185,6 +187,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	if (ctx->root_task)
 		put_task_struct(ctx->root_task);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -199,8 +203,10 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	atomic_set(&ctx->refcount, 0);
 	INIT_LIST_HEAD(&ctx->pgarr_list);
 	INIT_LIST_HEAD(&ctx->pgarr_pool);
+	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -215,6 +221,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (cr_objhash_alloc(ctx) < 0)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
 
  err:
@@ -222,6 +229,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	return ERR_PTR(err);
 }
 
+void cr_ctx_get(struct cr_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void cr_ctx_put(struct cr_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		cr_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -251,7 +269,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
 
@@ -266,7 +284,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	struct cr_ctx *ctx;
+	struct cr_ctx *ctx = NULL;
 	pid_t pid;
 	int ret;
 
@@ -274,15 +292,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 	if (flags)
 		return -EINVAL;
 
-	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
-	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
-
 	/* FIXME: for now, we use 'crid' as a pid */
 	pid = (pid_t) crid;
 
+	if (pid == task_pid_vnr(current))
+		ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	ret = do_restart(ctx, pid);
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c946320..cede30e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -12,8 +12,11 @@
 
 #include <linux/path.h>
 #include <linux/fs.h>
+#include <linux/path.h>
+#include <linux/sched.h>
+#include <asm/atomic.h>
 
-#define CR_VERSION  2
+#define CR_VERSION  3
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -31,8 +34,7 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int tasks_nr;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
@@ -40,6 +42,19 @@ struct cr_ctx {
 	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
 
 	struct path fs_mnt;	/* container root (FIXME) */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int tasks_nr;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
+	int pids_nr;			/* size of pids array */
+	int pids_pos;			/* position pids array */
+	pid_t pids_active;		/* pid of (next) active task */
+	atomic_t tasks_count;		/* sync of tasks: used to coordinate */
+	struct completion complete;	/* container root and other tasks on */
+	wait_queue_head_t waitq;	/* start, end, and restart ordering */
 };
 
 /* cr_ctx: flags */
@@ -52,6 +67,9 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+extern void cr_ctx_get(struct cr_ctx *ctx);
+extern void cr_ctx_put(struct cr_ctx *ctx);
+
 /* shared objects handling */
 
 enum {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 011db2f..3c14c93 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1417,6 +1417,10 @@ struct task_struct {
 	/* state flags for use by tracers */
 	unsigned long trace;
 #endif
+
+#ifdef CONFIG_CHECKPOINT
+	struct cr_ctx *checkpoint_ctx;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 15/29] Restart " Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-17-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 17/29] Checkpoint open pipes Oren Laadan
                     ` (12 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

While file pointers are shared objects, they may share an underlying
object themselves. For instance, file pointers of both ends of a pipe
that share the same pipe inode. In this case, the shared entity to
handle is the inode that is shared among two file pointers (e.g read-
and write- ends). In this sort of "nested sharing" we need only save
the underlying object once (upon first encounter) on checkpoint, and
restore it only once during restart.

To checkpoint a file descriptor of this sort, we first lookup the
inode in the hash table:

If not found, it is the first encounter of this inode. Here, Besides
the file descriptor data, we also (a) register the inode in the hash
and save the corresponding 'objref' of this inode in '->fd_objref' of
the file descriptor. We then also (b) save the inode data, as per the
inode type (this is not implemented in this patch, as it depends on
the object). The file descriptor type will indicate the type of that
object (e.g. for a pipe, when supported, CR_FD_PIPE).

If found, it is the second encounter of this inode, e.g. in the case
of a pipe, as we hit the other end of the same pipe. At this point we
need only record the reference ('objref') to the inode that we had
saved before, and the file descriptor type is changed to CR_FD_OBJREF.

The logic during restart is similar: the '->fd_objref' is looked up in
the hash table. Unlike checkpoint, during restart the object that is
placed (and sought) in the hash table is the _file_ pointer, rather
than the _inode_.

If not found, it is the first encounter of this inode. Therefore we
(a) restore the inode data. Specifically, we construct a matching
object and end up with multiple file pointers (e.g. if the object is a
pipe, we will have both read- and write- ends). One of those is used
for the file descriptor in question; the other(s) will be deposited in
the hash table, to be retrieved and used later on. We also (b) register
the newly created inode in the hash table using the given 'objref'.

If found, then we can skip the setup of the underlying object that
is represented by the inode.

The type CR_FD_OBJREF indicates, on restart, that the corresponding
file descriptor is already setup and registered in the hash under the
'->fd_objref' that it had been assigned.

The next two patches use CR_FD_OBJREF to implement support for pipes.

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/objhash.c           |   30 +++++++++++---
 checkpoint/rstr_file.c         |   84 ++++++++++++++++++++++++++++++---------
 include/linux/checkpoint.h     |    1 +
 include/linux/checkpoint_hdr.h |    9 +++-
 4 files changed, 94 insertions(+), 30 deletions(-)

diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 17f43fc..25916c1 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -35,20 +35,31 @@ static void cr_obj_ref_drop(struct cr_objref *obj)
 	case CR_OBJ_FILE:
 		fput((struct file *) obj->ptr);
 		break;
+	case CR_OBJ_INODE:
+		iput((struct inode *) obj->ptr);
+		break;
 	default:
 		BUG();
 	}
 }
 
-static void cr_obj_ref_grab(struct cr_objref *obj)
+static int cr_obj_ref_grab(struct cr_objref *obj)
 {
+	int ret = 0;
+
 	switch (obj->type) {
 	case CR_OBJ_FILE:
 		get_file((struct file *) obj->ptr);
 		break;
+	case CR_OBJ_INODE:
+		if (!igrab((struct inode *) obj->ptr))
+			ret = -EBADF;
+		break;
 	default:
 		BUG();
 	}
+
+	return ret;
 }
 
 static void cr_objhash_clear(struct cr_objhash *objhash)
@@ -144,16 +155,22 @@ static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
 				    unsigned short type, unsigned short flags)
 {
 	struct cr_objref *obj;
-	int i;
+	int i, ret;
 
 	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
 	if (!obj)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	obj->ptr = ptr;
 	obj->type = type;
 	obj->flags = flags;
 
+	ret = cr_obj_ref_grab(obj);
+	if (ret < 0) {
+		kfree(obj);
+		return ERR_PTR(ret);
+	}
+
 	if (objref) {
 		/* use @objref to index (restart) */
 		obj->objref = objref;
@@ -165,7 +182,6 @@ static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
 	}
 
 	hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
-	cr_obj_ref_grab(obj);
 	return obj;
 }
 
@@ -198,8 +214,8 @@ int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
 	obj = cr_obj_find_by_ptr(ctx, ptr);
 	if (!obj) {
 		obj = cr_obj_new(ctx, ptr, 0, type, flags);
-		if (!obj)
-			return -ENOMEM;
+		if (IS_ERR(obj))
+			return PTR_ERR(obj);
 		else
 			ret = 1;
 	} else if (obj->type != type)	/* sanity check */
@@ -229,7 +245,7 @@ int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
 	struct cr_objref *obj;
 
 	obj = cr_obj_new(ctx, ptr, objref, type, flags);
-	return obj ? 0 : -ENOMEM;
+	return IS_ERR(obj) ? PTR_ERR(obj) : 0;
 }
 
 /**
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
index 1031915..af9756b 100644
--- a/checkpoint/rstr_file.c
+++ b/checkpoint/rstr_file.c
@@ -65,6 +65,53 @@ static int cr_attach_get_file(struct file *file)
 	return fd;
 }
 
+/**
+ * cr_obj_add_file - register a file pointer of a given fd in hash table
+ * @ctx: checkpoint context
+ * @fd: file descriptor
+ * @objref: objrect reference
+ *
+ * Return the file pointer (will be safely referenced in the hash table)
+ */
+static struct file *cr_obj_add_file(struct cr_ctx *ctx, int fd, int objref)
+{
+	struct file *file;
+	int ret;
+
+	file = fget(fd);
+	if (!file)
+		return ERR_PTR(-EBADF);
+	ret = cr_obj_add_ref(ctx, file, objref, CR_OBJ_FILE, 0);
+	fput(file);
+	return (ret < 0 ? ERR_PTR(ret) : file);
+}
+
+/* return a new fd associated with a the file referenced by @hh->objref */
+static int cr_read_fd_objref(struct cr_ctx *ctx, struct cr_hdr_file *hh)
+{
+	struct file *file;
+
+	file = cr_obj_get_by_ref(ctx, hh->fd_objref, CR_OBJ_FILE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+	return cr_attach_get_file(file);
+}
+
+/* return a new fd associated with a new open file/directory */
+static int cr_read_fd_generic(struct cr_ctx *ctx, struct cr_hdr_file *hh)
+{
+	struct file *file;
+	int fd;
+
+	file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+	fd = cr_attach_file(file);
+	if (fd < 0)
+		filp_close(file, NULL);
+	return fd;
+}
+
 #define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
 
 /* cr_read_file - restore the state of a given file pointer */
@@ -72,8 +119,7 @@ static int cr_read_file(struct cr_ctx *ctx, int objref)
 {
 	struct cr_hdr_file *hh;
 	struct file *file;
-	int fd = 0;	/* pacify gcc warning */
-	int ret;
+	int fd, ret;
 
 	hh = cr_hbuf_get(ctx, sizeof(*hh));
 	if (!hh)
@@ -86,46 +132,44 @@ static int cr_read_file(struct cr_ctx *ctx, int objref)
 		goto out;
 
 	ret = -EINVAL;
+	if (hh->fd_objref < 0)
+		goto out;
 
 	/* FIX: more sanity checks on f_flags, f_mode etc */
 
 	switch (hh->fd_type) {
 	case CR_FD_GENERIC:
-		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+		fd = cr_read_fd_generic(ctx, hh);
+		break;
+	case CR_FD_OBJREF:
+		fd = cr_read_fd_objref(ctx, hh);
 		break;
 	default:
 		goto out;
 	}
 
-	if (IS_ERR(file)) {
-		ret = PTR_ERR(file);
+	if (fd < 0) {
+		ret = fd;
 		goto out;
 	}
 
 	/* FIX: need to restore uid, gid, owner etc */
 
-	/* adding <objref,file> to the hash will keep a reference to it */
-	ret = cr_obj_add_ref(ctx, file, objref, CR_OBJ_FILE, 0);
-	if (ret < 0) {
-		filp_close(file, NULL);
-		goto out;
-	}
-
-	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
-	if (fd < 0) {
-		ret = fd;
-		filp_close(file, NULL);
+	/* register new <objref, file> tuple in hash table */
+	file = cr_obj_add_file(ctx, fd, objref);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
 		goto out;
 	}
 
-	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
-	if (ret < 0)
-		goto out;
 	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
 	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
 		ret = 0;
 
-	ret = 0;
+	if (ret < 0)
+		goto out;
+
+	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
  out:
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret < 0 ? ret : fd;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index cede30e..3be3902 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -74,6 +74,7 @@ extern void cr_ctx_put(struct cr_ctx *ctx);
 
 enum {
 	CR_OBJ_FILE = 1,
+	CR_OBJ_INODE,
 	CR_OBJ_MAX
 };
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 18c9f5d..9ad845d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -151,12 +151,15 @@ struct cr_hdr_fd_ent {
 
 /* fd types */
 enum  fd_type {
-	CR_FD_GENERIC = 1
+	CR_FD_OBJREF = 1,
+	CR_FD_GENERIC
 };
 
 struct cr_hdr_file {
-	__u16 fd_type;
-	__u16 f_mode;
+	__u32 fd_type;
+	__s32 fd_objref;
+
+	__u32 f_mode;
 	__u32 f_flags;
 	__u64 f_pos;
 	__u64 f_version;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 17/29] Checkpoint open pipes
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (15 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-18-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 18/29] Restore " Oren Laadan
                     ` (11 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

A pipe is essentially a double-headed inode with a buffer attached to
it. We checkpoint the pipe buffer only once, as soon as we hit one
side of the pipe, regardless whether it is read- or write- end.

To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table:

If not found, it is the first encounter of this pipe. Besides the file
descriptor, we also (a) save the pipe data, and (b) register the pipe
inode in the hash. We save the 'objref' of the inode 'in ->fd_objref'
of the file descriptor. The file descriptor type becomes CR_FD_PIPE.

If found, it is the second encounter of this pipe, namely, as we hit
the other end of the same pipe. In this case we need only record the
reference ('objref') to the inode that we had saved before, and the
file descriptor type is changed to CR_FD_OBJREF.

The type CR_FD_PIPE will indicate to the kernel to create a new pipe;
since both ends are created at the same time, one end will be used,
and the other end will be deposited in the hash table for later use.
The type CR_FD_OBJREF will indicate that the corresponding file
descriptor is already setup and registered in the hash using the
'->fd_objref' that it had been assigned.

The format of the pipe data is as follows:

struct cr_hdr_fd_pipe {
       __u32 nr_bufs;
}

cr_hdr + cr_hdr_fd_ent
	cr_hdr + cr_hdr_fd_data
		cr_hdr + cr_hdr_fd_pipe		-> # buffers
			cr_hdr + cr_hdr_buffer	-> 1st buffer
			cr_hdr + cr_hdr_buffer	-> 2nd buffer
			cr_hdr + cr_hdr_buffer	-> 3rd buffer
			...

Changelog[v14]:
  - Use 'fd_type' instead of 'hh->fd_objref' in cr_write_fd_data()
  - Revert change to pr_debug(), back to cr_debug()
  - Discard the 'h.parent' field
  - Check whether calls to cr_hbuf_get() fail
  - Test that a pipe's inode != ctx->file's inode to prevent deadlock

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/ckpt_file.c         |    2 +
 fs/pipe.c                      |  113 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    8 +++-
 3 files changed, 122 insertions(+), 1 deletions(-)

diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
index 0fe68bf..dd26b3d 100644
--- a/checkpoint/ckpt_file.c
+++ b/checkpoint/ckpt_file.c
@@ -12,6 +12,7 @@
 #include <linux/sched.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/pipe_fs_i.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -72,6 +73,7 @@ int cr_scan_fds(struct files_struct *files, int **fdtable)
 	return n;
 }
 
+
 static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
 				 struct cr_hdr_file *hh)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 14f502b..0c3f391 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -22,6 +22,9 @@
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
 
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
 /*
  * We use a start+len construction, which provides full use of the 
  * allocated memory.
@@ -771,6 +774,113 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+/* cr_write_pipebuf - dump contents of a pipe/fifo (assume i_mutex taken) */
+static int cr_write_pipebuf(struct cr_ctx *ctx, struct pipe_inode_info *pipe)
+{
+	struct cr_hdr h;
+	void *kbuf, *addr;
+	int i, ret = 0;
+
+	kbuf = (void *) __get_free_page(GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	/* this is a simplified fs/pipe.c:read_pipe() */
+
+	for (i = 0; i < pipe->nrbufs; i++) {
+		int nn = (pipe->curbuf + i) & (PIPE_BUFFERS-1);
+		struct pipe_buffer *pbuf = pipe->bufs + nn;
+		const struct pipe_buf_operations *ops = pbuf->ops;
+
+		ret = ops->confirm(pipe, pbuf);
+		if (ret < 0)
+			break;
+
+		addr = ops->map(pipe, pbuf, 1);
+		memcpy(kbuf, addr + pbuf->offset, pbuf->len);
+		ops->unmap(pipe, pbuf, addr);
+
+		h.type = CR_HDR_BUFFER;
+		h.len = pbuf->len;
+
+		ret = cr_write_obj(ctx, &h, kbuf);
+		if (ret < 0)
+			break;
+	}
+
+	free_page((unsigned long) kbuf);
+	return ret;
+}
+
+/* cr_write_pipe - dump pipe (assume i_mutex taken) */
+static int cr_write_pipe(struct cr_ctx *ctx, struct inode *inode)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_pipe *hh;
+	struct pipe_inode_info *pipe = inode->i_pipe;
+	int ret;
+
+	h.type = CR_HDR_FD_PIPE;
+	h.len = sizeof(*hh);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	hh->nr_bufs = pipe->nrbufs;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_pipebuf(ctx, pipe);
+}
+
+static int pipe_file_checkpoint(struct cr_ctx *ctx,
+				struct file *file, struct cr_hdr_file *hh)
+{
+	struct cr_hdr h;
+	struct inode *inode = file->f_dentry->d_inode;
+	int new, objref;
+	int ret;
+
+	/*
+	 * We take the inode's mutex and later will call vfs_write(),
+	 * which also takes an inode's mutex. To avoid deadlock, make
+	 * sure that the two inodes are distinct.
+	 */
+	if (ctx->file->f_dentry->d_inode == inode) {
+		pr_warning("c/r: writing to pipe that is checkpointed "
+			   "may result in a deadlock ... aborting\n");
+		return -EDEADLK;
+	}
+
+	h.type = CR_HDR_FILE;
+	h.len = sizeof(*hh);
+
+	new = cr_obj_add_ptr(ctx, inode, &objref, CR_OBJ_INODE, 0);
+	cr_debug("objref %d inode %p new %d\n", objref, inode, new);
+	if (new < 0)
+		return new;
+
+	hh->fd_type = (new ? CR_FD_PIPE : CR_FD_OBJREF);
+	hh->fd_objref = objref;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		return ret;
+
+	if (new) {
+		mutex_lock(&inode->i_mutex);
+		ret = cr_write_pipe(ctx, inode);
+		mutex_unlock(&inode->i_mutex);
+	}
+
+	return ret;
+}
+
+
 /*
  * The file_operations structs are not static because they
  * are also used in linux/fs/fifo.c to do operations on FIFOs.
@@ -787,6 +897,7 @@ const struct file_operations read_pipefifo_fops = {
 	.open		= pipe_read_open,
 	.release	= pipe_read_release,
 	.fasync		= pipe_read_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations write_pipefifo_fops = {
@@ -799,6 +910,7 @@ const struct file_operations write_pipefifo_fops = {
 	.open		= pipe_write_open,
 	.release	= pipe_write_release,
 	.fasync		= pipe_write_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations rdwr_pipefifo_fops = {
@@ -812,6 +924,7 @@ const struct file_operations rdwr_pipefifo_fops = {
 	.open		= pipe_rdwr_open,
 	.release	= pipe_rdwr_release,
 	.fasync		= pipe_rdwr_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 9ad845d..ce5d880 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -57,6 +57,7 @@ enum {
 	CR_HDR_FD_TABLE = 301,
 	CR_HDR_FD_ENT,
 	CR_HDR_FILE,
+	CR_HDR_FD_PIPE,
 
 	CR_HDR_TAIL = 5001
 };
@@ -152,7 +153,8 @@ struct cr_hdr_fd_ent {
 /* fd types */
 enum  fd_type {
 	CR_FD_OBJREF = 1,
-	CR_FD_GENERIC
+	CR_FD_GENERIC,
+	CR_FD_PIPE,
 };
 
 struct cr_hdr_file {
@@ -165,4 +167,8 @@ struct cr_hdr_file {
 	__u64 f_version;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_fd_pipe {
+	__s32 nr_bufs;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 18/29] Restore open pipes
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (16 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 17/29] Checkpoint open pipes Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-19-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 19/29] Record 'struct file' object instead of the file name for VMAs Oren Laadan
                     ` (10 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

When seeing a CR_FD_PIPE file type, we create a new pipe and thus
have two file pointers (read- and write- ends). We only use one of
them, depending on which side was checkpointed first. We register the
file pointer of the other end in the hash table, with the 'objref'
given for this pipe from the checkpoint, deposited for later use. At
this point we also restore the contents of the pipe buffers.

When the other end arrives, it will have file type CR_FD_OBJREF. We
will then use the corresponding 'objref' to retrieve the file pointer
from the hash table, and attach it to the process.

Note the difference from the checkpoint logic: during checkpoint we
placed the _inode_ of the pipe in the hash table, while during restart
we place the resulting _file_ in the hash table.

We restore the pipe contents we manually allocation and attaching
buffers to the pipe; (alternatively we could read the data from the
image file and then write it into the pipe, or use splice() syscall).

Changelog[v14]:
  - Discard the 'h.parent' field
  - Check whether calls to cr_hbuf_get() fail

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint_file.h |    2 +
 checkpoint/rstr_file.c       |  124 +++++++++++++++++++++++++++++++++++++++++-
 fs/pipe.c                    |    2 +-
 3 files changed, 125 insertions(+), 3 deletions(-)

diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
index 9dc3eba..cac1a5d 100644
--- a/checkpoint/checkpoint_file.h
+++ b/checkpoint/checkpoint_file.h
@@ -14,4 +14,6 @@
 
 int cr_scan_fds(struct files_struct *files, int **fdtable);
 
+extern const struct pipe_buf_operations anon_pipe_buf_ops;
+
 #endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
index af9756b..db2fcf2 100644
--- a/checkpoint/rstr_file.c
+++ b/checkpoint/rstr_file.c
@@ -14,6 +14,8 @@
 #include <linux/file.h>
 #include <linux/fdtable.h>
 #include <linux/fsnotify.h>
+#include <linux/pagemap.h>
+#include <linux/pipe_fs_i.h>
 #include <linux/syscalls.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -86,6 +88,121 @@ static struct file *cr_obj_add_file(struct cr_ctx *ctx, int fd, int objref)
 	return (ret < 0 ? ERR_PTR(ret) : file);
 }
 
+/* cr_read_pipebuf - restore contents of a pipe/fifo (assume i_mutex taken) */
+static int
+cr_read_pipebuf(struct cr_ctx *ctx, struct pipe_inode_info *pipe, int nbufs)
+{
+	void *kbuf, *addr;
+	int i, ret = 0;
+
+	kbuf = (void *) __get_free_page(GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	for (i = 0; i < nbufs; i++) {
+		struct pipe_buffer *pbuf = pipe->bufs + i;
+		struct page *page;
+		int len = PAGE_SIZE;
+
+		ret = cr_read_buffer(ctx, kbuf, &len);
+		if (ret < 0)
+			break;
+		page = alloc_page(GFP_HIGHUSER);
+		if (!page) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		addr = kmap_atomic(page, KM_USER0);
+		memcpy(addr, kbuf, len);
+		kunmap_atomic(addr, KM_USER0);
+
+		pbuf->page = page;
+		pbuf->ops = &anon_pipe_buf_ops;
+		pbuf->offset = 0;
+		pbuf->len = len;
+		pipe->nrbufs++;
+		pipe->tmp_page = NULL;
+	}
+
+	free_page((unsigned long) kbuf);
+	return ret;
+}
+
+/* cr_read_pipe - restore pipe (assume i_mutex taken) */
+static int cr_read_pipe(struct cr_ctx *ctx, int pipefd)
+{
+	struct cr_hdr_fd_pipe *hh;
+	struct file *file;
+	struct inode *inode;
+	int nbufs, ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_PIPE);
+	nbufs = hh->nr_bufs;
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	if (ret < 0)
+		return ret;
+	if (nbufs < 0 || nbufs > PIPE_BUFFERS)
+		return -EINVAL;
+
+	file = fget(pipefd);
+	if (!file)
+		return -EIO;
+
+	inode = file->f_dentry->d_inode;
+	mutex_lock(&inode->i_mutex);
+	ret = cr_read_pipebuf(ctx, inode->i_pipe, nbufs);
+	mutex_unlock(&inode->i_mutex);
+
+	fput(file);
+	return ret;
+}
+
+/* restore a pipe */
+static int cr_read_fd_pipe(struct cr_ctx *ctx, struct cr_hdr_file *hh)
+{
+	struct file *file;
+	int fds[2], which, ret;
+
+	file = cr_obj_get_by_ref(ctx, hh->fd_objref, CR_OBJ_FILE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+	else if (file)
+		return cr_attach_get_file(file);
+
+	/* first encounter of this pipe: create it */
+	ret = do_pipe(fds);
+	if (ret < 0)
+		return ret;
+
+	which = (hh->f_flags & O_WRONLY ? 1 : 0);
+
+	/*
+	 * Below we return the fd corersponding to one side of the pipe
+	 * for our caller to use. Now register the other side of the pipe
+	 * in the hash, to be picked up when that side is to be restored.
+	 */
+	file = cr_obj_add_file(ctx, fds[1-which], hh->fd_objref);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	ret = cr_read_pipe(ctx, fds[which]);
+ out:
+	sys_close(fds[1-which]);	/* this side isn't used anymore */
+	if (ret < 0)
+		sys_close(fds[which]);
+	else
+		ret = fds[which];
+	return ret;
+}
+
 /* return a new fd associated with a the file referenced by @hh->objref */
 static int cr_read_fd_objref(struct cr_ctx *ctx, struct cr_hdr_file *hh)
 {
@@ -138,11 +255,14 @@ static int cr_read_file(struct cr_ctx *ctx, int objref)
 	/* FIX: more sanity checks on f_flags, f_mode etc */
 
 	switch (hh->fd_type) {
+	case CR_FD_OBJREF:
+		fd = cr_read_fd_objref(ctx, hh);
+		break;
 	case CR_FD_GENERIC:
 		fd = cr_read_fd_generic(ctx, hh);
 		break;
-	case CR_FD_OBJREF:
-		fd = cr_read_fd_objref(ctx, hh);
+	case CR_FD_PIPE:
+		fd = cr_read_fd_pipe(ctx, hh);
 		break;
 	default:
 		goto out;
diff --git a/fs/pipe.c b/fs/pipe.c
index 0c3f391..0636f5f 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -271,7 +271,7 @@ int generic_pipe_buf_confirm(struct pipe_inode_info *info,
 	return 0;
 }
 
-static const struct pipe_buf_operations anon_pipe_buf_ops = {
+const struct pipe_buf_operations anon_pipe_buf_ops = {
 	.can_merge = 1,
 	.map = generic_pipe_buf_map,
 	.unmap = generic_pipe_buf_unmap,
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 19/29] Record 'struct file' object instead of the file name for VMAs
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (17 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 18/29] Restore " Oren Laadan
@ 2009-03-31  5:28   ` Oren Laadan
       [not found]     ` <1238477349-11029-20-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 20/29] Prepare to support shared memory Oren Laadan
                     ` (9 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

The vma->vm_file can be an arbitrary file pointer, including one that
is in use by a process as well and provided originally via the mmap()
syscall.

Thus, when dumping the state of a VMA, save a file object instead
of only the file name. As with other file objects, if it's seen for
the first time it is dumped entirely, otherwise only the 'objref' is
saved. The restart logic updated accordingly.

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c        |    5 +++
 checkpoint/ckpt_file.c         |    2 +-
 checkpoint/ckpt_mem.c          |   49 ++++++++++++++++---------
 checkpoint/rstr_file.c         |    2 +-
 checkpoint/rstr_mem.c          |   79 +++++++++++++++++++++++++++-------------
 include/linux/checkpoint.h     |    2 +
 include/linux/checkpoint_hdr.h |    2 +-
 7 files changed, 94 insertions(+), 47 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 7f5eee6..ef35754 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -118,6 +118,11 @@ int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
 	char *buf, *fname;
 	int ret, flen;
 
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
 	flen = PATH_MAX;
 	buf = kmalloc(flen, GFP_KERNEL);
 	if (!buf)
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
index dd26b3d..5de83ab 100644
--- a/checkpoint/ckpt_file.c
+++ b/checkpoint/ckpt_file.c
@@ -112,7 +112,7 @@ int generic_file_checkpoint(struct cr_ctx *ctx, struct file *file,
 }
 
 /* cr_write_file - dump the state of a given file pointer */
-static int cr_write_file(struct cr_ctx *ctx, struct file *file)
+int cr_write_file(struct cr_ctx *ctx, struct file *file)
 {
 	struct cr_hdr_file *hh;
 	int ret;
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
index 3d3c5f5..7a10e03 100644
--- a/checkpoint/ckpt_mem.c
+++ b/checkpoint/ckpt_mem.c
@@ -447,7 +447,10 @@ static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
 {
 	struct cr_hdr h;
 	struct cr_hdr_vma *hh;
-	int vma_type, ret;
+	int vma_type;
+	int objref = 0;
+	int new = 0;
+	int ret;
 
 	h.type = CR_HDR_VMA;
 	h.len = sizeof(*hh);
@@ -467,36 +470,46 @@ static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
 
 	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
 		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
-		cr_hbuf_put(ctx, sizeof(*hh));
-		return -ENOSYS;
+		ret = -ENOSYS;
+		goto out;
 	}
 
-	/* by default assume anon memory */
-	vma_type = CR_VMA_ANON;
+	vma_type = CR_VMA_ANON;  /* by default assume anon memory */
 
-	/*
-	 * if there is a backing file, assume private-mapped
-	 * (FIXME: check if the file is unlinked)
-	 */
 	if (vma->vm_file)
-		vma_type = CR_VMA_FILE;
+		vma_type = CR_VMA_FILE;		/* assume private-mapped */
+
+	/* if file-backed, add 'file' to the hash (will keep a reference) */
+	if (vma->vm_file) {
+		new = cr_obj_add_ptr(ctx, vma->vm_file,
+				     &objref, CR_OBJ_FILE, 0);
+		cr_debug("vma %p objref %d file %p)\n",
+			 vma, objref, vma->vm_file);
+		if (new < 0) {
+			ret  = new;
+			goto out;
+		}
+	}
 
 	hh->vma_type = vma_type;
+	hh->vma_objref = objref;
 
 	ret = cr_write_obj(ctx, &h, hh);
-	cr_hbuf_put(ctx, sizeof(*hh));
 	if (ret < 0)
-		return ret;
+		goto out;
 
-	/* save the file name */
-	/* FIXME: files should be deposited and sought in the objhash */
-	if (vma->vm_file) {
-		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
+	/* new==1 if-and-only-if file was newly added to hash */
+	if (new) {
+		ret = cr_write_file(ctx, vma->vm_file);
 		if (ret < 0)
-			return ret;
+			goto out;
 	}
 
-	return cr_write_private_vma_contents(ctx, vma);
+	ret = cr_write_private_vma_contents(ctx, vma);
+
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
 }
 
 int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
index db2fcf2..de260dc 100644
--- a/checkpoint/rstr_file.c
+++ b/checkpoint/rstr_file.c
@@ -232,7 +232,7 @@ static int cr_read_fd_generic(struct cr_ctx *ctx, struct cr_hdr_file *hh)
 #define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
 
 /* cr_read_file - restore the state of a given file pointer */
-static int cr_read_file(struct cr_ctx *ctx, int objref)
+int cr_read_file(struct cr_ctx *ctx, int objref)
 {
 	struct cr_hdr_file *hh;
 	struct file *file;
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
index 003b391..a72189b 100644
--- a/checkpoint/rstr_mem.c
+++ b/checkpoint/rstr_mem.c
@@ -18,6 +18,7 @@
 #include <linux/mman.h>
 #include <linux/mm.h>
 #include <linux/err.h>
+#include <linux/syscalls.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -204,6 +205,40 @@ static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
 	return vm_flags;
 }
 
+/*
+ * cr_vma_read_file - prepare a file object required for a vma
+ * @ctx - restart context
+ * @objref - objref of file object
+ *
+ * If the file object is found, will grab a reference to the pointer
+ * that the caller will need release.
+ */
+static struct file *cr_vma_read_file(struct cr_ctx *ctx, int objref)
+{
+	struct file *file;
+	int fd;
+
+	file = cr_obj_get_by_ref(ctx, objref, CR_OBJ_FILE);
+	if (IS_ERR(file))
+		return file;
+
+	/* if object found in objhash - use it */
+	if (file) {
+		get_file(file);
+		return file;
+	}
+
+	/* get (or construct) the respective file object */
+	fd = cr_read_file(ctx, objref);
+	if (fd < 0)
+		return ERR_PTR(fd);
+
+	file = fget(fd);
+	sys_close(fd);
+
+	return file;
+}
+
 /**
  * cr_read_vma - read vma data, recreate it and read contents
  * @ctx: checkpoint context
@@ -225,14 +260,16 @@ static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
 		return -ENOMEM;
 	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
 	if (ret < 0)
-		goto err;
+		goto out;
 
 	cr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
 		 (unsigned long) hh->vm_end, (int) hh->vma_type);
 
 	ret = -EINVAL;
 	if (hh->vm_end < hh->vm_start)
-		goto err;
+		goto out;
+	if (hh->vma_objref <= 0)
+		goto out;
 
 	vm_start = hh->vm_start;
 	vm_pgoff = hh->vm_pgoff;
@@ -241,13 +278,11 @@ static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
 	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
 	vma_type = hh->vma_type;
 
-	cr_hbuf_put(ctx, sizeof(*hh));
-
 	switch (vma_type) {
 
 	case CR_VMA_ANON:		/* anonymous private mapping */
 		if (vm_flags & VM_SHARED)
-			goto err;
+			goto out;
 		/*
 		 * vm_pgoff for anonymous mapping is the "global" page
 		 * offset (namely from addr 0x0), so we force a zero
@@ -257,23 +292,20 @@ static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
 
 	case CR_VMA_FILE:		/* private mapping from a file */
 		if (vm_flags & VM_SHARED)
-			goto err;
-		/*
-		 * for private mapping using 'read-only' is sufficient
-		 */
-		file = cr_read_open_fname(ctx, O_RDONLY, 0);
+			goto out;
+		file = cr_vma_read_file(ctx, hh->vma_objref);
 		if (IS_ERR(file)) {
 			ret = PTR_ERR(file);
-			goto err;
+			file = NULL;
+			goto out;
 		}
 		break;
 
 	default:
-		goto err;
+		goto out;
 
 	}
 
-
 	down_write(&mm->mmap_sem);
 	addr = do_mmap_pgoff(file, vm_start, vm_size,
 			     vm_prot, vm_flags, vm_pgoff);
@@ -281,12 +313,10 @@ static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
 	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
 		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
 
-	/* the file (if opened) is now referenced by the vma */
-	if (file)
-		filp_close(file, NULL);
-
-	if (IS_ERR((void *) addr))
-		return PTR_ERR((void *) addr);
+	if (IS_ERR((void *) addr)) {
+		ret = PTR_ERR((void *) addr);
+		goto out;
+	}
 
 	/*
 	 * CR_VMA_ANON: read in memory as is
@@ -302,14 +332,11 @@ static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
 		break;
 	}
 
-	if (ret < 0)
-		return ret;
-
-	cr_debug("vma retval %d\n", ret);
-	return 0;
-
- err:
+ out:
+	if (file)
+		fput(file);
 	cr_hbuf_put(ctx, sizeof(*hh));
+	cr_debug("vma retval %d\n", ret);
 	return ret;
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3be3902..69d14c4 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -110,10 +110,12 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
 extern int cr_write_fd_table(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_file(struct cr_ctx *ctx, struct file *file);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
 extern int cr_read_fd_table(struct cr_ctx *ctx);
+extern int cr_read_file(struct cr_ctx *ctx, int objref);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ce5d880..8623d3b 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -126,7 +126,7 @@ enum cr_vma_type {
 
 struct cr_hdr_vma {
 	__u32 vma_type;
-	__u32 _padding;
+	__u32 vma_objref;	/* for vma->vm_file */
 
 	__u64 vm_start;
 	__u64 vm_end;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 20/29] Prepare to support shared memory
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (18 preceding siblings ...)
  2009-03-31  5:28   ` [RFC v14-rc2][PATCH 19/29] Record 'struct file' object instead of the file name for VMAs Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 21/29] Dump anonymous- and file-mapped- " Oren Laadan
                     ` (8 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

Export functionality to retrieve specific pages from shared memory
given an inode in shmem-fs; this will be used in the next two patches
to provide support for c/r of shared memory.

Handling of shared memory depends on the type of a vma; to classify a
vma we extend the 'struct vma_operations_struct' with a new function
- 'cr_vma_type()' - through which a vma will report an integer that
reflects its type.

mm/shmem.c:
- shmem_getpage() and 'enum sgp_type' moved to linux/mm.h
- 'struct vm_operations_struct' extended with '->cr_vma_type' function

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/mm.h |   14 ++++++++++++++
 mm/shmem.c         |   15 ++-------------
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065cdf8..e9bdc00 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -218,6 +218,9 @@ struct vm_operations_struct {
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int (*cr_vma_type)(struct vm_area_struct *vma);
+#endif
 };
 
 struct mmu_gather;
@@ -323,6 +326,17 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 
+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+	SGP_READ,	/* don't exceed i_size, don't allocate page */
+	SGP_CACHE,	/* don't exceed i_size, may allocate page */
+	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
+	SGP_WRITE,	/* may exceed i_size, may allocate page */
+};
+
+extern int shmem_getpage(struct inode *inode, unsigned long idx,
+			 struct page **pagep, enum sgp_type sgp, int *type);
+
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
diff --git a/mm/shmem.c b/mm/shmem.c
index 4103a23..53118f0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -83,14 +83,6 @@ static struct vfsmount *shm_mnt;
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
-	SGP_READ,	/* don't exceed i_size, don't allocate page */
-	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
-	SGP_WRITE,	/* may exceed i_size, may allocate page */
-};
-
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -103,9 +95,6 @@ static unsigned long shmem_default_max_inodes(void)
 }
 #endif
 
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			 struct page **pagep, enum sgp_type sgp, int *type);
-
 static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
 {
 	/*
@@ -1187,8 +1176,8 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			struct page **pagep, enum sgp_type sgp, int *type)
+int shmem_getpage(struct inode *inode, unsigned long idx,
+		  struct page **pagep, enum sgp_type sgp, int *type)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 21/29] Dump anonymous- and file-mapped- shared memory
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (19 preceding siblings ...)
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 20/29] Prepare to support shared memory Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
       [not found]     ` <1238477349-11029-22-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 22/29] Restore " Oren Laadan
                     ` (7 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

We now handle anonymous and file-mapped shared memory. Support for IPC
shared memory requires support for IPC first. We extend cr_write_vma()
to detect shared memory VMAs and handle it separately than private
memory.

There is not much to do for file-mapped shared memory, except to force
msync() on the region to ensure that the file system is consistent
with the checkpoint image. Use our internal type CR_VMA_SHM_FILE.

Anonymous shared memory is always backed by inode in shmem filesystem.
We use that inode to look up the VMA in the objhash and register it if
not found (on first encounter). In this case, the type of the VMA is
CR_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is
found there, we must have already saved it before, so we change the
type to CR_VMA_SHM_ANON_SKIP and skip it.

To dump the contents of a shmem VMA, we loop through the pages of the
inode in the shmem filesystem, and dump the contents of each dirty
(allocated) page - unallocated pages must be clean.

Note that we save the original size of a shmem VMA because it may have
been re-mapped partially. The format itself remains like with private
VMAs, except that instead of addresses we record _indices_ (page nr)
into the backing inode.

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/ckpt_mem.c          |  273 ++++++++++++++++++++++++++++++++++------
 checkpoint/rstr_mem.c          |    4 +
 include/linux/checkpoint.h     |    2 +
 include/linux/checkpoint_hdr.h |   15 ++-
 mm/shmem.c                     |   11 ++
 5 files changed, 266 insertions(+), 39 deletions(-)

diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
index 7a10e03..9315d1b 100644
--- a/checkpoint/ckpt_mem.c
+++ b/checkpoint/ckpt_mem.c
@@ -13,6 +13,7 @@
 #include <linux/slab.h>
 #include <linux/file.h>
 #include <linux/pagemap.h>
+#include <linux/swap.h>
 #include <linux/mm_types.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -182,11 +183,11 @@ void cr_pgarr_reset_all(struct cr_ctx *ctx)
 
 
 /**
- * cr_private_follow_page - return page pointer for dirty pages
+ * cr_consider_private_page - return page pointer for dirty pages
  * @vma - target vma
  * @addr - page address
  *
- * Looks up the page that correspond to the address in the vma, and
+ * Looks up the page that corresponds to the address in the vma, and
  * returns the page if it was modified (and grabs a reference to it),
  * or otherwise returns NULL (or error).
  *
@@ -252,25 +253,79 @@ cr_consider_private_page(struct vm_area_struct *vma, unsigned long addr)
 }
 
 /**
- * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
+ * cr_consider_shared_page - return page pointer for dirty pages
+ * @ino - inode of shmem object
+ * @idx - page index in shmem object
+ *
+ * Looks up the page that corresponds to the index in the shmem object,
+ * and returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ *
+ * This function should _only_ called for shared vma's.
+ */
+static struct page *
+cr_consider_shared_page(struct inode *inode, unsigned long idx)
+{
+	struct page *page = NULL;
+	int ret;
+
+	/*
+	 * Inspired by do_shmem_file_read(): very simplified version.
+	 *
+	 * FIXME: consolidate with do_shmem_file_read()
+	 */
+
+	ret = shmem_getpage(inode, idx, &page, SGP_READ, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	/*
+	 * Only care about dirty pages; shmem_getpage() only returns
+	 * pages that have been allocated, so they must be dirty. The
+	 * pages returned are locked and referenced.
+	 */
+
+	if (page) {
+		unlock_page(page);
+		/*
+		 * If users can be writing to this page using arbitrary
+		 * virtual addresses, take care about potential aliasing
+		 * before reading the page on the kernel side.
+		 */
+		if (mapping_writably_mapped(inode->i_mapping))
+			flush_dcache_page(page);
+		/*
+		 * Mark the page accessed if we read the beginning.
+		 */
+		mark_page_accessed(page);
+	}
+
+	return page;
+}
+
+/**
+ * cr_vma_fill_pgarr - fill a page-array with addr/page tuples
  * @ctx - checkpoint context
  * @vma - vma to scan
  * @start - start address (updated)
+ * @start - end address (updated)
  *
+ * For private vma, records addr/page tuples. For shared vma, records
+ * index/page (index is the index of the page in the shmem object).
  * Returns the number of pages collected
  */
-static int
-cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct vm_area_struct *vma,
-			  unsigned long *start)
+static int cr_vma_fill_pgarr(struct cr_ctx *ctx, int shm,
+			     struct vm_area_struct *vma, struct inode *ino,
+			     unsigned long *start, unsigned long end)
 {
-	unsigned long end = vma->vm_end;
 	unsigned long addr = *start;
 	struct cr_pgarr *pgarr;
 	int nr_used;
 	int cnt = 0;
 
 	/* this function is only for private memory (anon or file-mapped) */
-	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+	BUG_ON((vma && ino) || (ino && !shm) || (vma && shm));
+	BUG_ON(vma && (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)));
 
 	do {
 		pgarr = cr_pgarr_current(ctx);
@@ -282,7 +337,11 @@ cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct vm_area_struct *vma,
 		while (addr < end) {
 			struct page *page;
 
-			page = cr_consider_private_page(vma, addr);
+			if (shm)
+				page = cr_consider_shared_page(ino, addr);
+			else
+				page = cr_consider_private_page(vma, addr);
+
 			if (IS_ERR(page))
 				return PTR_ERR(page);
 
@@ -292,7 +351,10 @@ cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct vm_area_struct *vma,
 				pgarr->nr_used++;
 			}
 
-			addr += PAGE_SIZE;
+			if (shm)
+				addr++;
+			else
+				addr += PAGE_SIZE;
 
 			if (cr_pgarr_is_full(pgarr))
 				break;
@@ -359,7 +421,7 @@ static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
 }
 
 /**
- * cr_write_private_vma_contents - dump contents of a VMA with private memory
+ * cr_write_vma_contents - dump contents of a VMA
  * @ctx - checkpoint context
  * @vma - vma to scan
  *
@@ -367,17 +429,18 @@ static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
  * virtual addresses into ctx->pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
-static int
-cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
+static int cr_write_vma_contents(struct cr_ctx *ctx, int shm,
+				 struct vm_area_struct *vma, struct inode *ino,
+				 unsigned long start, unsigned long end)
 {
 	struct cr_hdr h;
 	struct cr_hdr_pgarr *hh;
-	unsigned long addr = vma->vm_start;
+	unsigned long addr = start;
 	int cnt, ret;
 
 	/*
 	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
-	 * in each round. Each iterations is divided into two steps:
+	 * in each round. Each iteration is divided into two steps:
 	 *
 	 * (1) scan: scan through the PTEs of the vma to collect the pages
 	 * to dump (later we'll also make them COW), while keeping a list
@@ -394,15 +457,16 @@ cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
 	 * the actual write-out of the data to after the application is
 	 * allowed to resume execution).
 	 *
-	 * After dumpting the entire contents, conclude with a header that
+	 * After dumping the entire contents, conclude with a header that
 	 * specifies 0 pages to mark the end of the contents.
 	 */
 
 	h.type = CR_HDR_PGARR;
 	h.len = sizeof(*hh);
 
-	while (addr < vma->vm_end) {
-		cnt = cr_private_vma_fill_pgarr(ctx, vma, &addr);
+	while (addr < end) {
+
+		cnt = cr_vma_fill_pgarr(ctx, shm, vma, ino, &addr, end);
 		if (cnt == 0)
 			break;
 		else if (cnt < 0)
@@ -437,6 +501,101 @@ cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
 }
 
 /**
+ * cr_write_private_vma_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ */
+static int cr_write_private_vma_contents(struct cr_ctx *ctx,
+					 struct vm_area_struct *vma)
+{
+	return cr_write_vma_contents(ctx, 0, vma, NULL,
+				     vma->vm_start, vma->vm_end);
+}
+
+int cr_write_shmem_contents(struct cr_ctx *ctx, struct inode *inode)
+{
+	unsigned long end;
+
+	end = PAGE_ALIGN(i_size_read(inode)) >> PAGE_CACHE_SHIFT;
+	return cr_write_vma_contents(ctx, 1, NULL, inode, 0, end);
+}
+
+/**
+ * cr_write_shared_vma_contents - dump contents of a VMA with shared memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ */
+static int cr_write_shared_vma_contents(struct cr_ctx *ctx,
+					struct vm_area_struct *vma,
+					enum cr_vma_type vma_type)
+{
+	struct inode *inode;
+	int ret = 0;
+
+	/*
+	 * Citing mmap(2): "Updates to the mapping are visible to other
+	 * processes that map this file, and are carried through to the
+	 * underlying file. The file may not actually be updated until
+	 * msync(2) or munmap(2) is called"
+	 *
+	 * Citing msync(2): "Without use of this call there is no guarantee
+	 * that changes are written back before munmap(2) is called."
+	 *
+	 * Force msync for region of shared mapped files, to ensure that
+	 * that the file system is consistent with the checkpoint image.
+	 * (inspired by sys_msync).
+	 *
+	 * [FIXME: call vfs_sync only once per shared segment]
+	 */
+
+	switch (vma_type) {
+	case CR_VMA_SHM_FILE:
+		/* no need for contents that are stored in the file system */
+		ret = vfs_fsync(vma->vm_file, vma->vm_file->f_path.dentry, 0);
+		break;
+	case CR_VMA_SHM_ANON:
+		/* save the contents of this resgion */
+		inode = vma->vm_file->f_dentry->d_inode;
+		ret = cr_write_shmem_contents(ctx, inode);
+		break;
+	case CR_VMA_SHM_ANON_SKIP:
+	case CR_VMA_SHM_FILE_SKIP:
+		/* already saved before .. skip now */
+		break;
+	default:
+		BUG();
+	}
+
+	return ret;
+}
+
+/* return the subtype of a private vma segment */
+static enum cr_vma_type cr_private_vma_type(struct vm_area_struct *vma)
+{
+	if (vma->vm_file)
+		return CR_VMA_FILE;
+	else
+		return CR_VMA_ANON;
+}
+
+/*
+ * cr_shared_vma_type - return the subtype of a shared vma
+ * @vma: target vma
+ * @old: 0 if shared segment seen first time, else 1
+ */
+static enum cr_vma_type cr_shared_vma_type(struct vm_area_struct *vma, int old)
+{
+	enum cr_vma_type vma_type = -ENOSYS;
+
+	if (vma->vm_ops && vma->vm_ops->cr_vma_type) {
+		vma_type = (*vma->vm_ops->cr_vma_type)(vma);
+		if (old)
+			vma_type = cr_vma_type_skip(vma_type);
+	}
+	return vma_type;
+}
+
+/**
  * cr_write_vma - classify the vma and dump its contents
  * @ctx: checkpoint context
  * @vma: vma object
@@ -447,9 +606,8 @@ static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
 {
 	struct cr_hdr h;
 	struct cr_hdr_vma *hh;
-	int vma_type;
-	int objref = 0;
-	int new = 0;
+	enum cr_vma_type vma_type;
+	int objref, new;
 	int ret;
 
 	h.type = CR_HDR_VMA;
@@ -457,7 +615,7 @@ static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
 
 	hh = cr_hbuf_get(ctx, sizeof(*hh));
 	if (!hh)
-		return -EBUSY;
+		return -ENOMEM;
 
 	hh->vm_start = vma->vm_start;
 	hh->vm_end = vma->vm_end;
@@ -465,8 +623,7 @@ static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
 	hh->vm_flags = vma->vm_flags;
 	hh->vm_pgoff = vma->vm_pgoff;
 
-#define CR_BAD_VM_FLAGS  \
-	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
+#define CR_BAD_VM_FLAGS  (VM_IO | VM_HUGETLB | VM_NONLINEAR)
 
 	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
 		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
@@ -474,38 +631,78 @@ static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
 		goto out;
 	}
 
-	vma_type = CR_VMA_ANON;  /* by default assume anon memory */
+	/*
+	 * Categorize the vma whether shared or private. If shared, deposit
+	 * the backing inode in the objhash, so that the contents are only
+	 * dumped once.
+	 */
+	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+		struct inode *inode = vma->vm_file->f_dentry->d_inode;;
+		new = cr_obj_add_ptr(ctx, inode, &objref, CR_OBJ_INODE, 0);
+		if (new < 0) {
+			ret = new;
+			goto out;
+		}
+		hh->shm_objref = objref;
+		hh->shm_size = i_size_read(inode);
+		vma_type = cr_shared_vma_type(vma, !new);
+	} else {
+		hh->shm_objref = 0;
+		hh->shm_size = 0;
+		vma_type = cr_private_vma_type(vma);
+	}
 
-	if (vma->vm_file)
-		vma_type = CR_VMA_FILE;		/* assume private-mapped */
-
-	/* if file-backed, add 'file' to the hash (will keep a reference) */
-	if (vma->vm_file) {
-		new = cr_obj_add_ptr(ctx, vma->vm_file,
-				     &objref, CR_OBJ_FILE, 0);
-		cr_debug("vma %p objref %d file %p)\n",
-			 vma, objref, vma->vm_file);
+	if (vma_type < 0) {
+		ret = vma_type;
+		goto out;
+	}
+
+	hh->vma_type = vma_type;
+
+	/*
+	 * If the vma is file-backed (private or shared) we need to save
+	 * the corresponding file object. As the file object can be shared,
+	 * we follow the same logic as when handling file descriptors.
+	 */
+	if (vma_type == CR_VMA_FILE || vma_type == CR_VMA_SHM_FILE) {
+		struct file *file = vma->vm_file;
+		new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
+		cr_debug("vma %p objref %d file %p)\n", vma, objref, file);
 		if (new < 0) {
 			ret  = new;
 			goto out;
 		}
+		hh->vma_objref = objref;
+	} else {
+		hh->vma_objref = 0;
+		new = 0;
 	}
 
-	hh->vma_type = vma_type;
-	hh->vma_objref = objref;
+	cr_debug("vma %#lx-%#lx flags %#lx f_objref %d s_objref %d type %d\n",
+		 (unsigned long) hh->vm_start, (unsigned long) hh->vm_end,
+		 (unsigned long) hh->vm_flags, (int) hh->vma_objref,
+		 (int) hh->shm_objref, (int) hh->vma_type);
 
+	/* at last, the vma header is ready: write it out */
 	ret = cr_write_obj(ctx, &h, hh);
 	if (ret < 0)
 		goto out;
 
-	/* new==1 if-and-only-if file was newly added to hash */
+	/*
+	 * new==1 if-and-only-if file was newly added to hash; in that
+	 * case we need to dump its state as well
+	 */
 	if (new) {
 		ret = cr_write_file(ctx, vma->vm_file);
 		if (ret < 0)
 			goto out;
 	}
 
-	ret = cr_write_private_vma_contents(ctx, vma);
+	/* finally, dump the actual contents of this vma */
+	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE))
+		ret = cr_write_shared_vma_contents(ctx, vma, vma_type);
+	else
+		ret = cr_write_private_vma_contents(ctx, vma);
 
  out:
 	cr_hbuf_put(ctx, sizeof(*hh));
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
index a72189b..cdf08cd 100644
--- a/checkpoint/rstr_mem.c
+++ b/checkpoint/rstr_mem.c
@@ -330,6 +330,10 @@ static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
 		/* standard case: read the data into the memory */
 		ret = cr_read_private_vma_contents(ctx);
 		break;
+	default:
+		/* pacifcy gcc (the default will be caught above) */
+		ret = -EINVAL;
+		break;
 	}
 
  out:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 69d14c4..8cd94b3 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -107,6 +107,8 @@ extern int cr_read_fname(struct cr_ctx *ctx, char *fname, int n);
 extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
 				       int flags, int mode);
 
+extern int cr_write_shmem_contents(struct cr_ctx *ctx, struct inode *inode);
+
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
 extern int cr_write_fd_table(struct cr_ctx *ctx, struct task_struct *t);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8623d3b..22b40a2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -122,11 +122,24 @@ struct cr_hdr_mm {
 enum cr_vma_type {
 	CR_VMA_ANON = 1,	/* private anonymous */
 	CR_VMA_FILE,		/* private mapped file */
+	CR_VMA_SHM_ANON,	/* shared anonymous */
+	CR_VMA_SHM_ANON_SKIP,	/* shared anonymous, skip contents */
+	CR_VMA_SHM_FILE,	/* shared mapped file, only msync */
+	CR_VMA_SHM_FILE_SKIP,	/* shared mapped file, skip msync */
 };
 
+/* ATTN! for a shared vma type X above, the matching X_SKIP must follow */
+static inline enum cr_vma_type cr_vma_type_skip(enum cr_vma_type vma_type)
+{
+	return vma_type + 1;
+}
+
 struct cr_hdr_vma {
 	__u32 vma_type;
-	__u32 vma_objref;	/* for vma->vm_file */
+	__s32 vma_objref;	/* objref of backing file */
+	__s32 shm_objref;	/* objref of shared segment */
+	__u32 _padding;
+	__u64 shm_size;		/* size of shared segment */
 
 	__u64 vm_start;
 	__u64 vm_end;
diff --git a/mm/shmem.c b/mm/shmem.c
index 53118f0..06aeda5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -28,6 +28,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/swap.h>
+#include <linux/checkpoint_hdr.h>
 
 static struct vfsmount *shm_mnt;
 
@@ -1470,6 +1471,13 @@ static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 }
 #endif
 
+#ifdef CONFIG_CHECKPOINT
+static int shmem_cr_vma_type(struct vm_area_struct *vma)
+{
+	return CR_VMA_SHM_ANON;
+}
+#endif
+
 int shmem_lock(struct file *file, int lock, struct user_struct *user)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
@@ -2477,6 +2485,9 @@ static struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_CHECKPOINT
+	.cr_vma_type	= shmem_cr_vma_type,
+#endif
 };
 
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 22/29] Restore anonymous- and file-mapped- shared memory
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (20 preceding siblings ...)
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 21/29] Dump anonymous- and file-mapped- " Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
       [not found]     ` <1238477349-11029-23-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 23/29] s390: Expose a constant for the number of words representing the CRs Oren Laadan
                     ` (6 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen

The bulk of the work is in cr_read_vma(), which has been refactored:
the part that create the suitable 'struct file *' for the mapping is
now larger and moved to a separate function. What's left is to read
the VMA description, get the file pointer, create the mapping, and
proceed to read the contents in.

Both anonymous shared VMAs that have been read earlier (as indicated
by a look up to objhash) and file-mapped shared VMAs are skipped.
Anonymous shared VMAs seen for the first time have their contents
read in directly to the backing inode, as indexed by the page numbers
(as opposed to virtual addresses).

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/rstr_mem.c      |  219 +++++++++++++++++++++++++++++++++-----------
 include/linux/checkpoint.h |    1 +
 2 files changed, 167 insertions(+), 53 deletions(-)

diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
index cdf08cd..414d6a9 100644
--- a/checkpoint/rstr_mem.c
+++ b/checkpoint/rstr_mem.c
@@ -75,13 +75,37 @@ static int cr_page_read(struct cr_ctx *ctx, struct page *page, char *buf)
 	return 0;
 }
 
+static struct page *cr_bring_private_page(unsigned long addr)
+{
+	struct page *page;
+	int ret;
+
+	ret = get_user_pages(current, current->mm, addr, 1, 1, 1, &page, NULL);
+	if (ret < 0)
+		page = ERR_PTR(ret);
+	return page;
+}
+
+static struct page *cr_bring_shared_page(unsigned long idx, struct inode *ino)
+{
+	struct page *page = NULL;
+	int ret;
+
+	ret = shmem_getpage(ino, idx, &page, SGP_WRITE, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (page)
+		unlock_page(page);
+	return page;
+}
+
 /**
  * cr_read_pages_contents - read in data of pages in page-array chain
  * @ctx - restart context
+ * @inode - inode of shmem object
  */
-static int cr_read_pages_contents(struct cr_ctx *ctx)
+static int cr_read_pages_contents(struct cr_ctx *ctx, struct inode *inode)
 {
-	struct mm_struct *mm = current->mm;
 	struct cr_pgarr *pgarr;
 	unsigned long *vaddrs;
 	char *buf;
@@ -91,16 +115,21 @@ static int cr_read_pages_contents(struct cr_ctx *ctx)
 	if (!buf)
 		return -ENOMEM;
 
-	down_read(&mm->mmap_sem);
+	down_read(&current->mm->mmap_sem);
 	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
 		vaddrs = pgarr->vaddrs;
 		for (i = 0; i < pgarr->nr_used; i++) {
 			struct page *page;
 
-			ret = get_user_pages(current, mm, vaddrs[i],
-					     1, 1, 1, &page, NULL);
-			if (ret < 0)
+			if (inode)
+				page = cr_bring_shared_page(vaddrs[i], inode);
+			else
+				page = cr_bring_private_page(vaddrs[i]);
+
+			if (IS_ERR(page)) {
+				ret = PTR_ERR(page);
 				goto out;
+			}
 
 			ret = cr_page_read(ctx, page, buf);
 			page_cache_release(page);
@@ -111,14 +140,15 @@ static int cr_read_pages_contents(struct cr_ctx *ctx)
 	}
 
  out:
-	up_read(&mm->mmap_sem);
+	up_read(&current->mm->mmap_sem);
 	kfree(buf);
 	return 0;
 }
 
 /**
- * cr_read_private_vma_contents - restore contents of a VMA with private memory
+ * cr_read_vma_contents - restore contents of a VMA with private memory
  * @ctx - restart context
+ * @file - mapped file (shared memory)
  *
  * Reads a header that specifies how many pages will follow, then reads
  * a list of virtual addresses into ctx->pgarr_list page-array chain,
@@ -126,7 +156,7 @@ static int cr_read_pages_contents(struct cr_ctx *ctx)
  * these steps until reaching a header specifying "0" pages, which marks
  * the end of the contents.
  */
-static int cr_read_private_vma_contents(struct cr_ctx *ctx)
+static int cr_read_vma_contents(struct cr_ctx *ctx, struct inode *inode)
 {
 	struct cr_hdr_pgarr *hh;
 	unsigned long nr_pages;
@@ -153,7 +183,7 @@ static int cr_read_private_vma_contents(struct cr_ctx *ctx)
 		ret = cr_read_pages_vaddrs(ctx, nr_pages);
 		if (ret < 0)
 			break;
-		ret = cr_read_pages_contents(ctx);
+		ret = cr_read_pages_contents(ctx, inode);
 		if (ret < 0)
 			break;
 		cr_pgarr_reset_all(ctx);
@@ -162,6 +192,39 @@ static int cr_read_private_vma_contents(struct cr_ctx *ctx)
 	return ret;
 }
 
+int cr_read_shmem_contents(struct cr_ctx *ctx, struct inode *inode)
+{
+	return cr_read_vma_contents(ctx, inode);
+}
+
+/* restore contents of a VMA with private memory */
+static int cr_read_private_vma_contents(struct cr_ctx *ctx)
+{
+	/*
+	 * CR_VMA_ANON: read contents into memory
+	 * CR_VMA_FILE: read contents into memory
+	 */
+
+	return cr_read_vma_contents(ctx, NULL);
+}
+
+/* restore contents of a VMA with shared memory */
+static int cr_read_shared_vma_contents(struct cr_ctx *ctx,
+				      struct file *file,
+				      enum cr_vma_type vma_type)
+{
+	/*
+	 * CR_VMA_SHM_ANON: read contents into shmem object
+	 * CR_VMA_SHM_ANON_SKIP: skip (has been read before)
+	 * CR_VMA_SHM_FILE: skip (contents already in file system)
+	 */
+
+	if (vma_type == CR_VMA_SHM_ANON)
+		return cr_read_shmem_contents(ctx, file->f_dentry->d_inode);
+	else
+		return 0;
+}
+
 /**
  * cr_calc_map_prot_bits - convert vm_flags to mmap protection
  * orig_vm_flags: source vm_flags
@@ -239,6 +302,72 @@ static struct file *cr_vma_read_file(struct cr_ctx *ctx, int objref)
 	return file;
 }
 
+static struct file *cr_vma_prep_file(struct cr_ctx *ctx, struct cr_hdr_vma *hh)
+{
+	struct file *file = ERR_PTR(-EINVAL);
+	unsigned long vm_flags = hh->vm_flags;
+	int add = 0;
+	int ret;
+
+	switch (hh->vma_type) {
+	case CR_VMA_ANON:		/* private anonymous mapping */
+		if (hh->shm_objref || hh->vma_objref)
+			break;
+		file = NULL;
+		break;
+	case CR_VMA_FILE:		/* private mapping from a file */
+		if (hh->shm_objref || !hh->vma_objref)
+			break;
+		file = cr_vma_read_file(ctx, hh->vma_objref);
+		break;
+	case CR_VMA_SHM_ANON:		/* shared anonymous mapping */
+		if (!hh->shm_objref || hh->vma_objref)
+			break;
+		/*
+		 * We could leave file==NULL and let mmap (below) do the
+		 * work. However, if 'shm_size != vm_end - vm_start', or if
+		 * 'vm_pgoff != 0', then this vma reflects only a portion
+		 * of the shm object. In this case we need to "manually"
+		 * create the full shm object. So we do it anyway ...
+		 */
+		file = shmem_file_setup("/dev/zero", hh->shm_size, vm_flags);
+		add = 1;
+		break;
+	case CR_VMA_SHM_ANON_SKIP:	/* shared anonymous mapping skipped */
+		if (!hh->shm_objref || hh->vma_objref)
+			break;
+		file = cr_obj_get_by_ref(ctx, hh->shm_objref, CR_OBJ_FILE);
+		if (!file)
+			file = ERR_PTR(-EINVAL);
+		if (!IS_ERR(file))
+			get_file(file);
+		break;
+	case CR_VMA_SHM_FILE:		/* shared mapping of a file */
+		if (!hh->shm_objref || !hh->vma_objref)
+			break;
+		file = cr_vma_read_file(ctx, hh->vma_objref);
+		break;
+	default:
+		file = ERR_PTR(-EINVAL);
+		break;
+	}
+
+	if (IS_ERR(file))
+		return file;
+
+	if (add) {
+		ret = cr_obj_add_ref(ctx, file,
+				     hh->shm_objref, CR_OBJ_FILE, 0);
+		if (ret < 0) {
+			if (file)
+				fput(file);
+			file = ERR_PTR(ret);
+		}
+	}
+
+	return file;
+}
+
 /**
  * cr_read_vma - read vma data, recreate it and read contents
  * @ctx: checkpoint context
@@ -253,22 +382,29 @@ static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
 	unsigned long addr;
 	enum cr_vma_type vma_type;
 	struct file *file = NULL;
-	int ret;
+	int shm, ret;
 
 	hh = cr_hbuf_get(ctx, sizeof(*hh));
 	if (!hh)
 		return -ENOMEM;
+
 	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
 	if (ret < 0)
 		goto out;
 
-	cr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
-		 (unsigned long) hh->vm_end, (int) hh->vma_type);
+	cr_debug("vma %#lx-%#lx flags %#lx objref %d type %d\n",
+		 (unsigned long) hh->vm_start, (unsigned long) hh->vm_end,
+		 (unsigned long) hh->vm_flags, (int) hh->shm_objref,
+		 (int) hh->vma_type);
 
 	ret = -EINVAL;
 	if (hh->vm_end < hh->vm_start)
 		goto out;
-	if (hh->vma_objref <= 0)
+	if (hh->vma_objref < 0 || hh->shm_objref < 0)
+		goto out;
+
+	shm = !!hh->shm_objref;
+	if (!(hh->vm_flags & VM_SHARED) ^ !shm)
 		goto out;
 
 	vm_start = hh->vm_start;
@@ -278,34 +414,22 @@ static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
 	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
 	vma_type = hh->vma_type;
 
-	switch (vma_type) {
-
-	case CR_VMA_ANON:		/* anonymous private mapping */
-		if (vm_flags & VM_SHARED)
-			goto out;
-		/*
-		 * vm_pgoff for anonymous mapping is the "global" page
-		 * offset (namely from addr 0x0), so we force a zero
-		 */
+	/*
+	 * vm_pgoff for anonymous mapping is the "global" page
+	 * offset (namely from addr 0x0), so we force a zero
+	 */
+	if (vma_type == CR_VMA_ANON)
 		vm_pgoff = 0;
-		break;
-
-	case CR_VMA_FILE:		/* private mapping from a file */
-		if (vm_flags & VM_SHARED)
-			goto out;
-		file = cr_vma_read_file(ctx, hh->vma_objref);
-		if (IS_ERR(file)) {
-			ret = PTR_ERR(file);
-			file = NULL;
-			goto out;
-		}
-		break;
 
-	default:
+	/* prepare the file for this vma */
+	file = cr_vma_prep_file(ctx, hh);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		file = NULL;
 		goto out;
-
 	}
 
+	/* create a new vma */
 	down_write(&mm->mmap_sem);
 	addr = do_mmap_pgoff(file, vm_start, vm_size,
 			     vm_prot, vm_flags, vm_pgoff);
@@ -318,23 +442,11 @@ static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
 		goto out;
 	}
 
-	/*
-	 * CR_VMA_ANON: read in memory as is
-	 * CR_VMA_FILE: read in memory as is
-	 * (more to follow ...)
-	 */
-
-	switch (vma_type) {
-	case CR_VMA_ANON:
-	case CR_VMA_FILE:
-		/* standard case: read the data into the memory */
+	/* read in the contents of this vma */
+	if (shm)
+		ret = cr_read_shared_vma_contents(ctx, file, vma_type);
+	else
 		ret = cr_read_private_vma_contents(ctx);
-		break;
-	default:
-		/* pacifcy gcc (the default will be caught above) */
-		ret = -EINVAL;
-		break;
-	}
 
  out:
 	if (file)
@@ -372,6 +484,7 @@ int cr_read_mm(struct cr_ctx *ctx)
 	hh = cr_hbuf_get(ctx, sizeof(*hh));
 	if (!hh)
 		return -ENOMEM;
+
 	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
 	if (ret < 0)
 		goto out;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 8cd94b3..031e414 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -108,6 +108,7 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
 				       int flags, int mode);
 
 extern int cr_write_shmem_contents(struct cr_ctx *ctx, struct inode *inode);
+extern int cr_read_shmem_contents(struct cr_ctx *ctx, struct inode *inode);
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 23/29] s390: Expose a constant for the number of words representing the CRs
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (21 preceding siblings ...)
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 22/29] Restore " Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 24/29] c/r: Add CR_COPY() macro (v4) Oren Laadan
                     ` (5 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Dan Smith, Dave Hansen

We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.

Changelog:
    Mar 30:
            . Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
    Mar 03:
            . Picked up additional use of magic '3' in ptrace.h

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/Kconfig                |    4 ++++
 arch/s390/include/asm/ptrace.h   |    4 +++-
 arch/s390/kernel/compat_ptrace.h |    3 ++-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 6b0a353..98d339e 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
 config GENERIC_CLOCKEVENTS
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if 64BIT
+
 config GENERIC_BUG
 	bool
 	depends on BUG
diff --git a/arch/s390/include/asm/ptrace.h b/arch/s390/include/asm/ptrace.h
index 8920025..f1b0516 100644
--- a/arch/s390/include/asm/ptrace.h
+++ b/arch/s390/include/asm/ptrace.h
@@ -172,6 +172,8 @@
 #define NUM_CRS		16
 #define NUM_ACRS	16
 
+#define NUM_CR_WORDS	3
+
 #define FPR_SIZE	8
 #define FPC_SIZE	4
 #define FPC_PAD_SIZE	4 /* gcc insists on aligning the fpregs */
@@ -334,7 +336,7 @@ struct pt_regs
  */
 typedef struct
 {
-	unsigned long cr[3];
+	unsigned long cr[NUM_CR_WORDS];
 } per_cr_words;
 
 #define PER_EM_MASK 0xE8000000UL
diff --git a/arch/s390/kernel/compat_ptrace.h b/arch/s390/kernel/compat_ptrace.h
index a2be3a9..123dd66 100644
--- a/arch/s390/kernel/compat_ptrace.h
+++ b/arch/s390/kernel/compat_ptrace.h
@@ -1,10 +1,11 @@
 #ifndef _PTRACE32_H
 #define _PTRACE32_H
 
+#include <asm/ptrace.h>    /* needed for NUM_CR_WORDS */
 #include "compat_linux.h"  /* needed for psw_compat_t */
 
 typedef struct {
-	__u32 cr[3];
+	__u32 cr[NUM_CR_WORDS];
 } per_cr_words32;
 
 typedef struct {
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 24/29] c/r: Add CR_COPY() macro (v4)
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (22 preceding siblings ...)
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 23/29] s390: Expose a constant for the number of words representing the CRs Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
       [not found]     ` <1238477349-11029-25-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 25/29] s390: define s390-specific checkpoint-restart code (v7) Oren Laadan
                     ` (4 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Dan Smith, Dave Hansen

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric.  CR_COPY_ARRAY() provides us a way to do
the same thing but for arrays.  It's not critical, but it helps us unify
the checkpoint and restart paths for some things.

Changelog:
    Mar 04:
            . Removed semicolons
            . Added build-time check for __must_be_array in CR_COPY_ARRAY
    Feb 27:
            . Changed CR_COPY() to use assignment, eliminating the need
              for the CR_COPY_BIT() macro
            . Add CR_COPY_ARRAY() macro to help copying register arrays,
              etc
            . Move the macro definitions inside the CR #ifdef
    Feb 25:
            . Changed WARN_ON() to BUILD_BUG_ON()

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
---
 include/linux/checkpoint.h |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 031e414..59ec563 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -120,6 +120,32 @@ extern int cr_read_mm(struct cr_ctx *ctx);
 extern int cr_read_fd_table(struct cr_ctx *ctx);
 extern int cr_read_file(struct cr_ctx *ctx, int objref);
 
+/* useful macros to copy fields and buffers to/from cr_hdr_xxx structures */
+#define CR_CPT 1
+#define CR_RST 2
+
+#define CR_COPY(op, SAVE, LIVE)				        \
+	do {							\
+		if (op == CR_CPT)				\
+			SAVE = LIVE;				\
+		else						\
+			LIVE = SAVE;				\
+	} while (0)
+
+/*
+ * Copy @count items from @LIVE to @SAVE if op is CR_CPT (otherwise,
+ * copy in the reverse direction)
+ */
+#define CR_COPY_ARRAY(op, SAVE, LIVE, count)				\
+	do {								\
+		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
+		if (op == CR_CPT)					\
+			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
+		else							\
+			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
+	} while (__must_be_array(SAVE) && __must_be_array(LIVE) && 0)
+
+
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 25/29] s390: define s390-specific checkpoint-restart code (v7)
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (23 preceding siblings ...)
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 24/29] c/r: Add CR_COPY() macro (v4) Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 26/29] powerpc: provide APIs for validating and updating DABR Oren Laadan
                     ` (3 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Dan Smith, Dave Hansen

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Implement the s390 arch-specific checkpoint/restart helpers.  This
is on top of Oren Laadan's c/r code.

With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro.  While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to.  That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.

Changelog:
    Feb 27:
            . Add checkpoint_s390.h
            . Fixed up save and restore of PSW, with the non-address bits
              properly masked out
    Feb 25:
            . Make checkpoint_hdr.h safe for inclusion in userspace
            . Replace comment about vsdo code
            . Add comment about restoring access registers
            . Write and read an empty cr_hdr_head_arch record to appease
              code (mktree) that expects it to be there
            . Utilize NUM_CR_WORDS in checkpoint_hdr.h
    Feb 24:
            . Use CR_COPY() to unify the un/loading of cpu and mm state
            . Fix fprs definition in cr_hdr_cpu
            . Remove debug WARN_ON() from checkpoint.c
    Feb 23:
            . Macro-ize the un/packing of trace flags
            . Fix the crash when externally-linked
            . Break out the restart functions into restart.c
            . Remove unneeded s390_enable_sie() call
    Jan 30:
            . Switched types in cr_hdr_cpu to __u64 etc.
              (Per Oren suggestion)
            . Replaced direct inclusion of structs in
              cr_hdr_cpu with the struct members.
              (Per Oren suggestion)
            . Also ended up adding a bunch of new things
              into restart (mm_segment, ksp, etc) in vain
              attempt to get code using fpu to not segfault
              after restart.

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/include/asm/checkpoint_hdr.h |   88 +++++++++++++++++++++++
 arch/s390/include/asm/unistd.h         |    4 +-
 arch/s390/kernel/compat_wrapper.S      |   12 +++
 arch/s390/kernel/syscalls.S            |    2 +
 arch/s390/mm/Makefile                  |    1 +
 arch/s390/mm/checkpoint.c              |  121 ++++++++++++++++++++++++++++++++
 arch/s390/mm/checkpoint_s390.h         |   22 ++++++
 arch/s390/mm/restart.c                 |   83 ++++++++++++++++++++++
 8 files changed, 332 insertions(+), 1 deletions(-)
 create mode 100644 arch/s390/include/asm/checkpoint_hdr.h
 create mode 100644 arch/s390/mm/checkpoint.c
 create mode 100644 arch/s390/mm/checkpoint_s390.h
 create mode 100644 arch/s390/mm/restart.c

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..0a405c2
--- /dev/null
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -0,0 +1,88 @@
+#ifndef __ASM_S390_CKPT_HDR_H
+#define __ASM_S390_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers s/390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef __s390x__
+
+/*
+ * Notes
+ * NUM_GPRS defined in <asm/ptrace.h> to be 16
+ * NUM_FPRS defined in <asm/ptrace.h> to be 16
+ * NUM_APRS defined in <asm/ptrace.h> to be 16
+ * NUM_CR_WORDS defined in <asm/ptrace.h> to be 3
+ */
+struct cr_hdr_cpu {
+	__u64 args[1];
+	__u64 gprs[NUM_GPRS];
+	__u64 orig_gpr2;
+	__u16 svcnr;
+	__u16 ilc;
+	__u32 acrs[NUM_ACRS];
+	__u64 ieee_instruction_pointer;
+
+	/* psw_t */
+	__u64 psw_t_mask;
+	__u64 psw_t_addr;
+
+	/* s390_fp_regs_t */
+	__u32 fpc;
+	union {
+		float f;
+		double d;
+		__u64 ui;
+		struct {
+			__u32 fp_hi;
+			__u32 fp_lo;
+		} fp;
+	} fprs[NUM_FPRS];
+
+	/* per_struct */
+	__u64 per_control_regs[NUM_CR_WORDS];
+	__u64 starting_addr;
+	__u64 ending_addr;
+	__u64 address;
+	__u16 perc_atmid;
+	__u8 access_id;
+	__u8 single_step;
+	__u8 instruction_fetch;
+};
+
+struct cr_hdr_mm_context {
+	unsigned long vdso_base;
+	int noexec;
+	int has_pgste;
+	int alloc_pgste;
+	unsigned long asce_bits;
+	unsigned long asce_limit;
+};
+
+struct cr_hdr_head_arch {
+};
+
+#ifdef __KERNEL__
+/* Functions for copying to/from the header structs */
+extern void cr_s390_regs(int op, struct cr_hdr_cpu *hh, struct task_struct *t);
+extern void cr_s390_mm(int op, struct cr_hdr_mm_context *hh,
+		       struct mm_struct *mm);
+#endif
+
+#endif /* __s390x__ */
+
+#endif /* __ASM_S390_CKPT_HDR__H */
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index c8ad350..ffe64a0 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -265,7 +265,9 @@
 #define __NR_pipe2		325
 #define __NR_dup3		326
 #define __NR_epoll_create1	327
-#define NR_syscalls 328
+#define __NR_checkpoint		328
+#define __NR_restart		329
+#define NR_syscalls 330
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S
index 62c706e..2b85f3b 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1805,3 +1805,15 @@ compat_sys_keyctl_wrapper:
 	llgfr	%r5,%r5			# u32
 	llgfr	%r6,%r6			# u32
 	jg	compat_sys_keyctl	# branch to system call
+
+	.globl sys_checkpoint_wrapper
+sys_checkpoint_wrapper:
+	lgfr	%r2,%r2			# pid_t
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+
+	.globl sys_restart_wrapper
+sys_restart_wrapper:
+	lgfr	%r2,%r2			# int
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index fe5b25a..f1cf5fb 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -336,3 +336,5 @@ SYSCALL(sys_inotify_init1,sys_inotify_init1,sys_inotify_init1_wrapper)
 SYSCALL(sys_pipe2,sys_pipe2,sys_pipe2_wrapper) /* 325 */
 SYSCALL(sys_dup3,sys_dup3,sys_dup3_wrapper)
 SYSCALL(sys_epoll_create1,sys_epoll_create1,sys_epoll_create1_wrapper)
+SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
+SYSCALL(sys_restart,sys_restart,sys_restart_wrapper)
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index 2a74581..b16161e 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -6,3 +6,4 @@ obj-y	 := init.o fault.o extmem.o mmap.o vmem.o pgtable.o
 obj-$(CONFIG_CMM) += cmm.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_PAGE_STATES) += page-states.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o restart.o
diff --git a/arch/s390/mm/checkpoint.c b/arch/s390/mm/checkpoint.c
new file mode 100644
index 0000000..263d8bd
--- /dev/null
+++ b/arch/s390/mm/checkpoint.c
@@ -0,0 +1,121 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/kernel.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+
+#include "checkpoint_s390.h"
+
+void cr_s390_regs(int op, struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct pt_regs *regs = task_pt_regs(t);
+	struct thread_struct *thr = &t->thread;
+
+	/* Save the whole PSW to facilitate forensic debugging, but only
+	 * restore the address portion to avoid letting userspace do
+	 * bad things by manipulating its value.
+	 */
+	if (op == CR_CPT) {
+		CR_COPY(op, hh->psw_t_addr, regs->psw.addr);
+	} else {
+		regs->psw.addr &= ~PSW_ADDR_INSN;
+		regs->psw.addr |= hh->psw_t_addr;
+	}
+
+	CR_COPY(op, hh->args[0], regs->args[0]);
+	CR_COPY(op, hh->orig_gpr2, regs->orig_gpr2);
+	CR_COPY(op, hh->svcnr, regs->svcnr);
+	CR_COPY(op, hh->ilc, regs->ilc);
+	CR_COPY(op, hh->ieee_instruction_pointer,
+		thr->ieee_instruction_pointer);
+	CR_COPY(op, hh->psw_t_mask, regs->psw.mask);
+	CR_COPY(op, hh->fpc, thr->fp_regs.fpc);
+	CR_COPY(op, hh->starting_addr, thr->per_info.starting_addr);
+	CR_COPY(op, hh->ending_addr, thr->per_info.ending_addr);
+	CR_COPY(op, hh->address, thr->per_info.lowcore.words.address);
+	CR_COPY(op, hh->perc_atmid, thr->per_info.lowcore.words.perc_atmid);
+	CR_COPY(op, hh->access_id, thr->per_info.lowcore.words.access_id);
+	CR_COPY(op, hh->single_step, thr->per_info.single_step);
+	CR_COPY(op, hh->instruction_fetch, thr->per_info.instruction_fetch);
+
+	CR_COPY_ARRAY(op, hh->gprs, regs->gprs, NUM_GPRS);
+	CR_COPY_ARRAY(op, hh->fprs, thr->fp_regs.fprs, NUM_FPRS);
+	CR_COPY_ARRAY(op, hh->acrs, thr->acrs, NUM_ACRS);
+	CR_COPY_ARRAY(op, hh->per_control_regs,
+		      thr->per_info.control_regs.words.cr, NUM_CR_WORDS);
+}
+
+void cr_s390_mm(int op, struct cr_hdr_mm_context *hh, struct mm_struct *mm)
+{
+	CR_COPY(op, hh->noexec, mm->context.noexec);
+	CR_COPY(op, hh->has_pgste, mm->context.has_pgste);
+	CR_COPY(op, hh->alloc_pgste, mm->context.alloc_pgste);
+	CR_COPY(op, hh->asce_bits, mm->context.asce_bits);
+	CR_COPY(op, hh->asce_limit, mm->context.asce_limit);
+}
+
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_CPU;
+	h.len = sizeof(*hh);
+
+	cr_s390_regs(CR_CPT, hh, t);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+/* Write an empty header since it is assumed to be there */
+int cr_write_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_HEAD_ARCH;
+	h.len = sizeof(*hh);
+
+	ret = cr_write_obj(ctx, &h, &hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+
+	cr_s390_mm(CR_CPT, hh, mm);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/arch/s390/mm/checkpoint_s390.h b/arch/s390/mm/checkpoint_s390.h
new file mode 100644
index 0000000..52a5e6f
--- /dev/null
+++ b/arch/s390/mm/checkpoint_s390.h
@@ -0,0 +1,22 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _S390_CHECKPOINT_H
+#define _S390_CHECKPOINT_H
+
+#include <linux/checkpoint_hdr.h>
+#include <linux/sched.h>
+#include <linux/mm_types.h>
+
+extern void cr_s390_regs(int op, struct cr_hdr_cpu *hh, struct task_struct *t);
+extern void cr_s390_mm(int op, struct cr_hdr_mm_context *hh,
+		       struct mm_struct *mm);
+
+#endif /* _S390_CHECKPOINT_H */
diff --git a/arch/s390/mm/restart.c b/arch/s390/mm/restart.c
new file mode 100644
index 0000000..7131c22
--- /dev/null
+++ b/arch/s390/mm/restart.c
@@ -0,0 +1,83 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/kernel.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/elf.h>
+
+#include "checkpoint_s390.h"
+
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *hh;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+	if  (ret < 0)
+		goto out;
+
+	cr_s390_regs(CR_RST, hh, current);
+
+	/* s390 does not restore the access registers after a syscall,
+	 * but does on a task switch.  Since we're switching tasks (in
+	 * a way), we need to replicate that behavior here.
+	 */
+	restore_access_regs(hh->acrs);
+out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head_arch *hh;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_mm_context *hh;
+	int ret;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	if (!hh)
+		return -ENOMEM;
+
+	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	if (ret < 0)
+		goto out;
+
+	cr_s390_mm(CR_RST, hh, mm);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 26/29] powerpc: provide APIs for validating and updating DABR
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (24 preceding siblings ...)
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 25/29] s390: define s390-specific checkpoint-restart code (v7) Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 27/29] powerpc: checkpoint/restart implementation Oren Laadan
                     ` (2 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Nathan Lynch, Dave Hansen

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

A checkpointed task image may specify a value for the DABR (Data
Access Breakpoint Register).  The restart code needs to validate this
value before making any changes to the current task.

ptrace_set_debugreg encapsulates the bounds checking and platform
dependencies of programming the DABR.  Split this into "validate"
(debugreg_valid) and "update" (debugreg_update) functions, and make
them available for use outside of the ptrace code.

Also ptrace_set_debugreg has extern linkage, but no users outside of
ptrace.c.  Make it static.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/include/asm/ptrace.h |    7 +++
 arch/powerpc/kernel/ptrace.c      |   88 +++++++++++++++++++++++++------------
 2 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h
index c9c678f..79bc816 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -81,6 +81,8 @@ struct pt_regs {
 
 #ifndef __ASSEMBLY__
 
+#include <linux/types.h>
+
 #define instruction_pointer(regs) ((regs)->nip)
 #define user_stack_pointer(regs) ((regs)->gpr[1])
 #define regs_return_value(regs) ((regs)->gpr[3])
@@ -138,6 +140,11 @@ do {									      \
 extern void user_enable_single_step(struct task_struct *);
 extern void user_disable_single_step(struct task_struct *);
 
+/* for reprogramming DABR/DAC during restart of a checkpointed task */
+extern bool debugreg_valid(unsigned long val, unsigned int index);
+extern void debugreg_update(struct task_struct *task, unsigned long val,
+			    unsigned int index);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 3635be6..0b6cf84 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -735,22 +735,25 @@ void user_disable_single_step(struct task_struct *task)
 	clear_tsk_thread_flag(task, TIF_SINGLESTEP);
 }
 
-int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
-			       unsigned long data)
+/**
+ * debugreg_valid() - validate the value to be written to a debug register
+ * @val:	The prospective contents of the register.
+ * @index:	Must be zero.
+ *
+ * Returns true if @val is an acceptable value for the register indicated by
+ * @index, false otherwise.
+ */
+bool debugreg_valid(unsigned long val, unsigned int index)
 {
-	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
-	 *  For embedded processors we support one DAC and no IAC's at the
-	 *  moment.
-	 */
-	if (addr > 0)
-		return -EINVAL;
+	/* We support only one debug register for now */
+	if (index != 0)
+		return false;
 
 	/* The bottom 3 bits in dabr are flags */
-	if ((data & ~0x7UL) >= TASK_SIZE)
-		return -EIO;
+	if ((val & ~0x7UL) >= TASK_SIZE)
+		return false;
 
 #ifndef CONFIG_BOOKE
-
 	/* For processors using DABR (i.e. 970), the bottom 3 bits are flags.
 	 *  It was assumed, on previous implementations, that 3 bits were
 	 *  passed together with the data address, fitting the design of the
@@ -764,47 +767,74 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
 	 */
 
 	/* Ensure breakpoint translation bit is set */
-	if (data && !(data & DABR_TRANSLATION))
-		return -EIO;
-
-	/* Move contents to the DABR register */
-	task->thread.dabr = data;
-
-#endif
-#if defined(CONFIG_BOOKE)
-
+	if (val && !(val & DABR_TRANSLATION))
+		return false;
+#else
 	/* As described above, it was assumed 3 bits were passed with the data
 	 *  address, but we will assume only the mode bits will be passed
 	 *  as to not cause alignment restrictions for DAC-based processors.
 	 */
 
+	/* Read or Write bits must be set */
+	if (!(val & 0x3UL))
+		return -EINVAL;
+#endif
+	return true;
+}
+
+/**
+ * debugreg_update() - update a debug register associated with a task
+ * @task:	The task whose register state is to be modified.
+ * @val:	The value to be written to the debug register.
+ * @index:	Specifies the debug register.  Currently unused.
+ *
+ * Set a task's DABR/DAC to @val, which should be validated with
+ * debugreg_valid() beforehand.
+ */
+void debugreg_update(struct task_struct *task, unsigned long val,
+		     unsigned int index)
+{
+#ifndef CONFIG_BOOKE
+	task->thread.dabr = val;
+#else
 	/* DAC's hold the whole address without any mode flags */
-	task->thread.dabr = data & ~0x3UL;
+	task->thread.dabr = val & ~0x3UL;
 
 	if (task->thread.dabr == 0) {
 		task->thread.dbcr0 &= ~(DBSR_DAC1R | DBSR_DAC1W | DBCR0_IDM);
 		task->thread.regs->msr &= ~MSR_DE;
-		return 0;
 	}
 
-	/* Read or Write bits must be set */
-
-	if (!(data & 0x3UL))
-		return -EINVAL;
-
 	/* Set the Internal Debugging flag (IDM bit 1) for the DBCR0
 	   register */
 	task->thread.dbcr0 = DBCR0_IDM;
 
 	/* Check for write and read flags and set DBCR0
 	   accordingly */
-	if (data & 0x1UL)
+	if (val & 0x1UL)
 		task->thread.dbcr0 |= DBSR_DAC1R;
-	if (data & 0x2UL)
+	if (val & 0x2UL)
 		task->thread.dbcr0 |= DBSR_DAC1W;
 
 	task->thread.regs->msr |= MSR_DE;
 #endif
+}
+
+static int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
+			       unsigned long data)
+{
+	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
+	 * For embedded processors we support one DAC and no IAC's at the
+	 * moment.
+	 */
+	if (addr > 0)
+		return -EINVAL;
+
+	if (!debugreg_valid(data, 0))
+		return -EIO;
+
+	debugreg_update(task, data, 0);
+
 	return 0;
 }
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 27/29] powerpc: checkpoint/restart implementation
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (25 preceding siblings ...)
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 26/29] powerpc: provide APIs for validating and updating DABR Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 28/29] powerpc: wire up checkpoint and restart syscalls Oren Laadan
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 29/29] powerpc: enable checkpoint support in Kconfig Oren Laadan
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Nathan Lynch, Dave Hansen

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Support for checkpointing and restarting GPRs, FPU state, DABR, and
Altivec state.

The portion of the checkpoint image manipulated by this code begins
with a bitmask of features indicating the various contexts saved.
Fields in image that can vary depending on kernel configuration
(e.g. FP regs due to VSX) have their sizes explicitly recorded, except
for GPRS, so migrating between ppc32 and ppc64 won't work yet.

The restart code ensures that the task is not modified until the
checkpoint image is validated against the current kernel configuration
and hardware features (e.g. can't restart a task using Altivec on
non-Altivec systems).

What works:
* self and external checkpoint of simple (single thread, one open
  file) 32- and 64-bit processes on a ppc64 kernel

What doesn't work:
* restarting a 32-bit task from a 64-bit task and vice versa

Untested:
* ppc32 (but it builds)

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/include/asm/checkpoint_hdr.h |   15 +
 arch/powerpc/mm/Makefile                  |    1 +
 arch/powerpc/mm/checkpoint.c              |  500 +++++++++++++++++++++++++++++
 3 files changed, 516 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h
 create mode 100644 arch/powerpc/mm/checkpoint.c

diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h b/arch/powerpc/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..9f0d099
--- /dev/null
+++ b/arch/powerpc/include/asm/checkpoint_hdr.h
@@ -0,0 +1,15 @@
+#ifndef __ASM_PPC_CKPT_HDR_H
+#define __ASM_PPC_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers ppc
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* nothing to see here */
+
+#endif /* __ASM_PPC_CKPT_HDR__H */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 953cc4a..02ecbcb 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -26,3 +26,4 @@ obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o
 obj-$(CONFIG_PPC_MM_SLICES)	+= slice.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/powerpc/mm/checkpoint.c b/arch/powerpc/mm/checkpoint.c
new file mode 100644
index 0000000..731874b
--- /dev/null
+++ b/arch/powerpc/mm/checkpoint.c
@@ -0,0 +1,500 @@
+/*
+ *  Checkpoint/restart - architecture specific support for powerpc.
+ *  Based on x86 implementation.
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *  Copyright 2009 IBM Corp.
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define DEBUG 1 /* for pr_debug */
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/kernel.h>
+#include <asm/processor.h>
+#include <asm/ptrace.h>
+#include <asm/system.h>
+
+enum cr_cpu_feature {
+	CKPT_USED_FP,
+	CKPT_USED_DEBUG,
+	CKPT_USED_ALTIVEC,
+	CKPT_USED_SPE,
+	CKPT_USED_VSX,
+	CKPT_FTR_END = 31,
+};
+
+#define x(ftr) (1UL << ftr)
+
+/* features this kernel can handle for restart */
+enum {
+	CKPT_FTRS_POSSIBLE =
+#ifdef CONFIG_PPC_FPU
+	x(CKPT_USED_FP) |
+#endif
+	x(CKPT_USED_DEBUG) |
+#ifdef CONFIG_ALTIVEC
+	x(CKPT_USED_ALTIVEC) |
+#endif
+#ifdef CONFIG_SPE
+	x(CKPT_USED_SPE) |
+#endif
+#ifdef CONFIG_VSX
+	x(CKPT_USED_VSX) |
+#endif
+	0,
+};
+
+#undef x
+
+struct cr_hdr_cpu {
+	u32 features_used;
+	u32 pt_regs_size;
+	u32 fpr_size;
+	struct pt_regs pt_regs;
+	/* relevant fields from thread_struct */
+	double fpr[32][TS_FPRWIDTH];
+	u32 fpscr;
+	s32 fpexc_mode;
+	u64 dabr;
+	/* Altivec/VMX state */
+	vector128 vr[32];
+	vector128 vscr;
+	u64 vrsave;
+	/* SPE state */
+	u32 evr[32];
+	u64 acc;
+	u32 spefscr;
+};
+
+static void cr_cpu_feature_set(struct cr_hdr_cpu *hdr, enum cr_cpu_feature ftr)
+{
+	hdr->features_used |= 1ULL << ftr;
+}
+
+static bool cr_cpu_feature_isset(const struct cr_hdr_cpu *hdr,
+				 enum cr_cpu_feature ftr)
+{
+	return hdr->features_used & (1ULL << ftr);
+}
+
+/* determine whether an image has feature bits set that this kernel
+ * does not support */
+static bool cr_cpu_features_unknown(const struct cr_hdr_cpu *hdr)
+{
+	return hdr->features_used & ~CKPT_FTRS_POSSIBLE;
+}
+
+static void checkpoint_gprs(struct cr_hdr_cpu *cpu_hdr,
+			    struct task_struct *task)
+{
+	struct pt_regs *pt_regs;
+
+	pr_debug("%s: saving GPRs\n", __func__);
+
+	cpu_hdr->pt_regs_size = sizeof(*pt_regs);
+	pt_regs = task_pt_regs(task);
+	cpu_hdr->pt_regs = *pt_regs;
+}
+
+#ifdef CONFIG_PPC_FPU
+static void checkpoint_fpu(struct cr_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	/* easiest to save FP state unconditionally */
+
+	pr_debug("%s: saving FPU state\n", __func__);
+
+	if (task == current)
+		flush_fp_to_thread(task);
+
+	cpu_hdr->fpr_size = sizeof(cpu_hdr->fpr);
+	cpu_hdr->fpscr = task->thread.fpscr.val;
+	cpu_hdr->fpexc_mode = task->thread.fpexc_mode;
+
+	memcpy(cpu_hdr->fpr, task->thread.fpr, sizeof(cpu_hdr->fpr));
+
+	cr_cpu_feature_set(cpu_hdr, CKPT_USED_FP);
+}
+#else
+static void checkpoint_fpu(struct cr_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_ALTIVEC
+static void checkpoint_altivec(struct cr_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		return;
+
+	if (!task->thread.used_vr)
+		return;
+
+	pr_debug("%s: saving Altivec state\n", __func__);
+
+	if (task == current)
+		flush_altivec_to_thread(task);
+
+	cpu_hdr->vrsave = task->thread.vrsave;
+	memcpy(cpu_hdr->vr, task->thread.vr, sizeof(cpu_hdr->vr));
+	cr_cpu_feature_set(cpu_hdr, CKPT_USED_ALTIVEC);
+}
+#else
+static void checkpoint_altivec(struct cr_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static void checkpoint_spe(struct cr_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		return;
+
+	if (!task->thread.used_spe)
+		return;
+
+	pr_debug("%s: saving SPE state\n", __func__);
+
+	if (task == current)
+		flush_spe_to_thread(task);
+
+	cpu_hdr->acc = task->thread.acc;
+	cpu_hdr->spefscr = task->thread.spefscr;
+	memcpy(cpu_hdr->evr, task->thread.evr, sizeof(cpu_hdr->evr));
+	cr_cpu_feature_set(cpu_hdr, CKPT_USED_SPE);
+}
+#else
+static void checkpoint_spe(struct cr_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+static void checkpoint_dabr(struct cr_hdr_cpu *cpu_hdr,
+			    const struct task_struct *task)
+{
+	if (!task->thread.dabr)
+		return;
+
+	cpu_hdr->dabr = task->thread.dabr;
+	cr_cpu_feature_set(cpu_hdr, CKPT_USED_DEBUG);
+}
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr_cpu *cpu_hdr;
+	struct cr_hdr cr_hdr;
+	int rc;
+
+	cr_hdr.type = CR_HDR_CPU;
+	cr_hdr.len = sizeof(*cpu_hdr);
+
+	rc = -ENOMEM;
+	cpu_hdr = kzalloc(sizeof(*cpu_hdr), GFP_KERNEL);
+	if (!cpu_hdr)
+		goto err;
+
+	checkpoint_gprs(cpu_hdr, t);
+	checkpoint_fpu(cpu_hdr, t);
+	checkpoint_dabr(cpu_hdr, t);
+	checkpoint_altivec(cpu_hdr, t);
+	checkpoint_spe(cpu_hdr, t);
+
+	rc = cr_write_obj(ctx, &cr_hdr, cpu_hdr);
+err:
+	kfree(cpu_hdr);
+	return rc;
+}
+
+int cr_write_head_arch(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
+
+/* restart APIs */
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
+/* Based on the MSR value from a checkpoint image, produce an MSR
+ * value that is appropriate for the restored task.  Right now we only
+ * check for MSR_SF (64-bit) for PPC64.
+ */
+static unsigned long sanitize_msr(unsigned long msr_ckpt)
+{
+#ifdef CONFIG_PPC32
+	return MSR_USER;
+#else
+	if (msr_ckpt & MSR_SF)
+		return MSR_USER64;
+	return MSR_USER32;
+#endif
+}
+
+static int restore_gprs(const struct cr_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	struct pt_regs *regs;
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->pt_regs_size != sizeof(*regs))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	regs = task_pt_regs(task);
+	*regs = cpu_hdr->pt_regs;
+
+	regs->msr = sanitize_msr(regs->msr);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_PPC_FPU
+static int restore_fpu(const struct cr_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->fpr_size != sizeof(task->thread.fpr))
+		goto out;
+
+	rc = 0;
+	if (!update || !cr_cpu_feature_isset(cpu_hdr, CKPT_USED_FP))
+		goto out;
+
+	task->thread.fpscr.val = cpu_hdr->fpscr;
+	task->thread.fpexc_mode = cpu_hdr->fpexc_mode;
+
+	memcpy(task->thread.fpr, cpu_hdr->fpr, sizeof(task->thread.fpr));
+out:
+	return rc;
+}
+#else
+static int restore_fpu(const struct cr_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(cr_cpu_feature_isset(cpu_hdr, CKPT_USED_FP));
+	return 0;
+}
+#endif
+
+static int restore_dabr(const struct cr_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!cr_cpu_feature_isset(cpu_hdr, CKPT_USED_DEBUG))
+		goto out;
+
+	rc = -EINVAL;
+	if (!debugreg_valid(cpu_hdr->dabr, 0))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	debugreg_update(task, cpu_hdr->dabr, 0);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_ALTIVEC
+static int restore_altivec(const struct cr_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!cr_cpu_feature_isset(cpu_hdr, CKPT_USED_ALTIVEC))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.vrsave = cpu_hdr->vrsave;
+	task->thread.used_vr = 1;
+
+	memcpy(task->thread.vr, cpu_hdr->vr, sizeof(cpu_hdr->vr));
+out:
+	return rc;
+}
+#else
+static int restore_altivec(const struct cr_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(cr_cpu_feature_isset(CKPT_USED_ALTIVEC));
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static int restore_spe(const struct cr_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!cr_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.acc = cpu_hdr->acc;
+	task->thread.spefscr = cpu_hdr->spefscr;
+	task->thread.used_spe = 1;
+
+	memcpy(task->thread.evr, cpu_hdr->evr, sizeof(cpu_hdr->evr));
+out:
+	return rc;
+}
+#else
+static int restore_spe(const struct cr_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(cr_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE));
+	return 0;
+}
+#endif
+
+struct restore_func_desc {
+	int (*func)(const struct cr_hdr_cpu *, struct task_struct *, bool);
+	const char *info;
+};
+
+typedef int (*restore_func_t)(const struct cr_hdr_cpu *,
+			      struct task_struct *, bool);
+
+static const restore_func_t restore_funcs[] = {
+	restore_gprs,
+	restore_fpu,
+	restore_dabr,
+	restore_altivec,
+	restore_spe,
+};
+
+static bool bitness_match(const struct cr_hdr_cpu *cpu_hdr,
+			  const struct task_struct *task)
+{
+	/* 64-bit image */
+	if (cpu_hdr->pt_regs.msr & MSR_SF) {
+		if (task->thread.regs->msr & MSR_SF)
+			return true;
+		else
+			return false;
+	}
+
+	/* 32-bit image */
+	if (task->thread.regs->msr & MSR_SF)
+		return false;
+
+	return true;
+}
+
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *cpu_hdr;
+	bool update;
+	int rc;
+	int i;
+
+	rc = -ENOMEM;
+	cpu_hdr = kzalloc(sizeof(*cpu_hdr), GFP_KERNEL);
+	if (!cpu_hdr)
+		goto err;
+
+	rc = cr_read_obj_type(ctx, cpu_hdr, sizeof(*cpu_hdr), CR_HDR_CPU);
+	if (rc < 0)
+		goto err;
+
+	rc = -EINVAL;
+	if (cr_cpu_features_unknown(cpu_hdr))
+		goto err;
+
+	/* temporary: restoring a 32-bit image from a 64-bit task and
+	 * vice-versa is known not to work (probably not restoring
+	 * thread_info correctly); detect this and fail gracefully.
+	 */
+	if (!bitness_match(cpu_hdr, current))
+		goto err;
+
+	/* We want to determine whether there's anything wrong with
+	 * the checkpoint image before changing the task at all.  Run
+	 * a "check" phase (update = false) first.
+	 */
+	update = false;
+commit:
+	for (i = 0; i < ARRAY_SIZE(restore_funcs); i++) {
+		rc = restore_funcs[i](cpu_hdr, current, update);
+		if (rc == 0)
+			continue;
+		pr_debug("%s: restore_func[%i] failed\n", __func__, i);
+		WARN_ON_ONCE(update);
+		goto err;
+	}
+
+	if (!update) {
+		update = true;
+		goto commit;
+	}
+
+err:
+	kfree(cpu_hdr);
+	return rc;
+}
+
+int cr_read_head_arch(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 28/29] powerpc: wire up checkpoint and restart syscalls
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (26 preceding siblings ...)
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 27/29] powerpc: checkpoint/restart implementation Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 29/29] powerpc: enable checkpoint support in Kconfig Oren Laadan
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Nathan Lynch, Dave Hansen

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/include/asm/systbl.h |    2 ++
 arch/powerpc/include/asm/unistd.h |    4 +++-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 72353f6..8d8dd68 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -322,3 +322,5 @@ SYSCALL_SPU(epoll_create1)
 SYSCALL_SPU(dup3)
 SYSCALL_SPU(pipe2)
 SYSCALL(inotify_init1)
+SYSCALL(checkpoint)
+SYSCALL(restart)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index e07d0c7..2e333a1 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -341,10 +341,12 @@
 #define __NR_dup3		316
 #define __NR_pipe2		317
 #define __NR_inotify_init1	318
+#define __NR_checkpoint		319
+#define __NR_restart		320
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		319
+#define __NR_syscalls		321
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v14-rc2][PATCH 29/29] powerpc: enable checkpoint support in Kconfig
       [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (27 preceding siblings ...)
  2009-03-31  5:29   ` [RFC v14-rc2][PATCH 28/29] powerpc: wire up checkpoint and restart syscalls Oren Laadan
@ 2009-03-31  5:29   ` Oren Laadan
  28 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31  5:29 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Nathan Lynch, Dave Hansen

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/Kconfig |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 74cc312..ff7d598 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -26,6 +26,9 @@ config MMU
 	bool
 	default y
 
+config CHECKPOINT_SUPPORT
+	def_bool y
+
 config GENERIC_CMOS_UPDATE
 	def_bool y
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 10/29] actually use f_op in checkpoint code
       [not found]     ` <1238477349-11029-11-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-03-31 18:31       ` Oren Laadan
  2009-04-01 18:54       ` Serge E. Hallyn
  2009-04-07  3:29       ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-03-31 18:31 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Dave Hansen


If ext2/3/4 is compiled as a kernel module, apply this patch to
successfully compile this c/r patchset.

Oren.

diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
index 0fe68bf..df6bb3d 100644
--- a/checkpoint/ckpt_file.c
+++ b/checkpoint/ckpt_file.c
@@ -109,6 +109,8 @@ int generic_file_checkpoint(struct cr_ctx *ctx, struct file *file,
 	return cr_write_file_generic(ctx, file, hh);
 }
 
+EXPORT_SYMBOL(generic_file_checkpoint);
+
 /* cr_write_file - dump the state of a given file pointer */
 static int cr_write_file(struct cr_ctx *ctx, struct file *file)
 {


On Tue, 31 Mar 2009, Oren Laadan wrote:

> From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> 
> Right now, we assume all normal files and directories
> can be checkpointed.  However, as usual in the VFS, there
> are specialized places that will always need an ability
> to override these defaults.  We could do this completely
> in the checkpoint code, but that would bitrot quickly.
> 
> This adds a new 'file_operations' function for
> checkpointing a file.  I did this under the assumption
> that we should have a dirt-simple way to make something
> (un)checkpointable that fits in with current code.
> 
> As you can see in the ext[234] and /proc patches, all
> that we have to do to make something simple be
> supported is add a single "generic" f_op entry.
> 
> Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup
       [not found]     ` <1238477349-11029-17-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-01 13:59       ` Serge E. Hallyn
       [not found]         ` <20090401135952.GA16973-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-04-01 18:36       ` Serge E. Hallyn
  2009-04-03 15:46       ` Dan Smith
  2 siblings, 1 reply; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-01 13:59 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> While file pointers are shared objects, they may share an underlying
> object themselves. For instance, file pointers of both ends of a pipe
> that share the same pipe inode. In this case, the shared entity to
> handle is the inode that is shared among two file pointers (e.g read-
> and write- ends). In this sort of "nested sharing" we need only save
> the underlying object once (upon first encounter) on checkpoint, and
> restore it only once during restart.
> 
> To checkpoint a file descriptor of this sort, we first lookup the
> inode in the hash table:

Sorry I've not followed well on irc.  What is the plan and timeline
with respect to this and Dave's fops approach?  Is someone rewriting
the pipes patches on top of that?  Will that replace this patch as
well?  Who is doing it, and when will we see that patch?

I'm just wondering how closely to review the next 3 patches.

thanks,
-serge

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup
       [not found]         ` <20090401135952.GA16973-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-01 14:13           ` Oren Laadan
  0 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-04-01 14:13 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
>> While file pointers are shared objects, they may share an underlying
>> object themselves. For instance, file pointers of both ends of a pipe
>> that share the same pipe inode. In this case, the shared entity to
>> handle is the inode that is shared among two file pointers (e.g read-
>> and write- ends). In this sort of "nested sharing" we need only save
>> the underlying object once (upon first encounter) on checkpoint, and
>> restore it only once during restart.
>>
>> To checkpoint a file descriptor of this sort, we first lookup the
>> inode in the hash table:
> 
> Sorry I've not followed well on irc.  What is the plan and timeline
> with respect to this and Dave's fops approach?  Is someone rewriting
> the pipes patches on top of that?  Will that replace this patch as
> well?  Who is doing it, and when will we see that patch?
> 

The fops approach is already implemented for checkpoint. I already
modified the pipe implementation accordingly.

So there are 'generic_file_checkpoint()' and 'pipe_file_checkpoint()'
both in place.

The restart is remains the same for all file types.

I think Dave is working on another change that will modify the format
of the checkpoint.

> I'm just wondering how closely to review the next 3 patches.

Please do :)

Oren.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup
       [not found]     ` <1238477349-11029-17-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-01 13:59       ` Serge E. Hallyn
@ 2009-04-01 18:36       ` Serge E. Hallyn
  2009-04-03 15:46       ` Dan Smith
  2 siblings, 0 replies; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-01 18:36 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> While file pointers are shared objects, they may share an underlying
> object themselves. For instance, file pointers of both ends of a pipe
> that share the same pipe inode. In this case, the shared entity to
> handle is the inode that is shared among two file pointers (e.g read-
> and write- ends). In this sort of "nested sharing" we need only save
> the underlying object once (upon first encounter) on checkpoint, and
> restore it only once during restart.
> 
> To checkpoint a file descriptor of this sort, we first lookup the
> inode in the hash table:
> 
> If not found, it is the first encounter of this inode. Here, Besides
> the file descriptor data, we also (a) register the inode in the hash
> and save the corresponding 'objref' of this inode in '->fd_objref' of
> the file descriptor. We then also (b) save the inode data, as per the
> inode type (this is not implemented in this patch, as it depends on
> the object). The file descriptor type will indicate the type of that
> object (e.g. for a pipe, when supported, CR_FD_PIPE).
> 
> If found, it is the second encounter of this inode, e.g. in the case
> of a pipe, as we hit the other end of the same pipe. At this point we
> need only record the reference ('objref') to the inode that we had
> saved before, and the file descriptor type is changed to CR_FD_OBJREF.
> 
> The logic during restart is similar: the '->fd_objref' is looked up in
> the hash table. Unlike checkpoint, during restart the object that is
> placed (and sought) in the hash table is the _file_ pointer, rather
> than the _inode_.
> 
> If not found, it is the first encounter of this inode. Therefore we
> (a) restore the inode data. Specifically, we construct a matching
> object and end up with multiple file pointers (e.g. if the object is a
> pipe, we will have both read- and write- ends). One of those is used
> for the file descriptor in question; the other(s) will be deposited in
> the hash table, to be retrieved and used later on. We also (b) register
> the newly created inode in the hash table using the given 'objref'.
> 
> If found, then we can skip the setup of the underlying object that
> is represented by the inode.
> 
> The type CR_FD_OBJREF indicates, on restart, that the corresponding
> file descriptor is already setup and registered in the hash under the
> '->fd_objref' that it had been assigned.
> 
> The next two patches use CR_FD_OBJREF to implement support for pipes.
> 
> Changelog[v14]:
>   - Introduce patch
> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

-serge

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 10/29] actually use f_op in checkpoint code
       [not found]     ` <1238477349-11029-11-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31 18:31       ` Oren Laadan
@ 2009-04-01 18:54       ` Serge E. Hallyn
  2009-04-07  3:29       ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-01 18:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> 
> Right now, we assume all normal files and directories
> can be checkpointed.  However, as usual in the VFS, there
> are specialized places that will always need an ability
> to override these defaults.  We could do this completely
> in the checkpoint code, but that would bitrot quickly.
> 
> This adds a new 'file_operations' function for
> checkpointing a file.  I did this under the assumption
> that we should have a dirt-simple way to make something
> (un)checkpointable that fits in with current code.
> 
> As you can see in the ext[234] and /proc patches, all
> that we have to do to make something simple be
> supported is add a single "generic" f_op entry.
> 
> Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Oooh, I see - for some reason I was convinced you'd put this patch
further back in the stack.

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

(of course that is assymetric with restart)

> ---
>  checkpoint/ckpt_file.c |   31 +++++++++++++++----------------
>  include/linux/fs.h     |   11 +++++++++++
>  2 files changed, 26 insertions(+), 16 deletions(-)
> 
> diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
> index 9c344c7..0fe68bf 100644
> --- a/checkpoint/ckpt_file.c
> +++ b/checkpoint/ckpt_file.c
> @@ -91,6 +91,11 @@ static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
> 
>  	hh->fd_type = CR_FD_GENERIC;
> 
> +	/*
> +	 * FIXME: when we'll add support for unlinked files/dirs, we'll
> +	 * need to distinguish between unlinked filed and unlinked dirs.
> +	 */
> +
>  	ret = cr_write_obj(ctx, &h, hh);
>  	if (ret < 0)
>  		return ret;
> @@ -98,12 +103,16 @@ static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
>  	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
>  }
> 
> +int generic_file_checkpoint(struct cr_ctx *ctx, struct file *file,
> +			    struct cr_hdr_file *hh)
> +{
> +	return cr_write_file_generic(ctx, file, hh);
> +}
> +
>  /* cr_write_file - dump the state of a given file pointer */
>  static int cr_write_file(struct cr_ctx *ctx, struct file *file)
>  {
>  	struct cr_hdr_file *hh;
> -	struct dentry *dent = file->f_dentry;
> -	struct inode *inode = dent->d_inode;
>  	int ret;
> 
>  	hh = cr_hbuf_get(ctx, sizeof(*hh));
> @@ -116,21 +125,11 @@ static int cr_write_file(struct cr_ctx *ctx, struct file *file)
>  	hh->f_version = file->f_version;
>  	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
> 
> -	/*
> -	 * FIXME: when we'll add support for unlinked files/dirs, we'll
> -	 * need to distinguish between unlinked filed and unlinked dirs.
> -	 */
> -	switch (inode->i_mode & S_IFMT) {
> -	case S_IFREG:
> -	case S_IFDIR:
> -		ret = cr_write_file_generic(ctx, file, hh);
> -		break;
> -	default:
> -		ret = -EBADF;
> -		break;
> -	}
> -	cr_hbuf_put(ctx, sizeof(*hh));
> +	ret = -EBADF;
> +	if (file->f_op->checkpoint)
> +		ret = file->f_op->checkpoint(ctx, file, hh);
> 
> +	cr_hbuf_put(ctx, sizeof(*hh));
>  	return ret;
>  }
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3bf5057..835ee9e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1296,6 +1296,14 @@ int generic_osync_inode(struct inode *, struct address_space *, int);
>  typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
>  struct block_device_operations;
> 
> +#ifdef CONFIG_CHECKPOINT
> +struct cr_ctx;
> +struct cr_hdr_file;
> +int generic_file_checkpoint(struct cr_ctx *, struct file *, struct cr_hdr_file *);
> +#else
> +#define generic_file_checkpoint NULL
> +#endif
> +
>  /* These macros are for out of kernel modules to test that
>   * the kernel supports the unlocked_ioctl and compat_ioctl
>   * fields in struct file_operations. */
> @@ -1334,6 +1342,9 @@ struct file_operations {
>  	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
>  	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
>  	int (*setlease)(struct file *, long, struct file_lock **);
> +#ifdef CONFIG_CHECKPOINT
> +	int (*checkpoint)(struct cr_ctx *, struct file *file, struct cr_hdr_file *);
> +#endif
>  };
> 
>  struct inode_operations {
> -- 
> 1.5.4.3

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 17/29] Checkpoint open pipes
       [not found]     ` <1238477349-11029-18-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-01 19:47       ` Serge E. Hallyn
  0 siblings, 0 replies; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-01 19:47 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> A pipe is essentially a double-headed inode with a buffer attached to
> it. We checkpoint the pipe buffer only once, as soon as we hit one
> side of the pipe, regardless whether it is read- or write- end.
> 
> To checkpoint a file descriptor that refers to a pipe (either end), we
> first lookup the inode in the hash table:
> 
> If not found, it is the first encounter of this pipe. Besides the file
> descriptor, we also (a) save the pipe data, and (b) register the pipe
> inode in the hash. We save the 'objref' of the inode 'in ->fd_objref'
> of the file descriptor. The file descriptor type becomes CR_FD_PIPE.
> 
> If found, it is the second encounter of this pipe, namely, as we hit
> the other end of the same pipe. In this case we need only record the
> reference ('objref') to the inode that we had saved before, and the
> file descriptor type is changed to CR_FD_OBJREF.
> 
> The type CR_FD_PIPE will indicate to the kernel to create a new pipe;
> since both ends are created at the same time, one end will be used,
> and the other end will be deposited in the hash table for later use.
> The type CR_FD_OBJREF will indicate that the corresponding file
> descriptor is already setup and registered in the hash using the
> '->fd_objref' that it had been assigned.
> 
> The format of the pipe data is as follows:
> 
> struct cr_hdr_fd_pipe {
>        __u32 nr_bufs;
> }
> 
> cr_hdr + cr_hdr_fd_ent
> 	cr_hdr + cr_hdr_fd_data
> 		cr_hdr + cr_hdr_fd_pipe		-> # buffers
> 			cr_hdr + cr_hdr_buffer	-> 1st buffer
> 			cr_hdr + cr_hdr_buffer	-> 2nd buffer
> 			cr_hdr + cr_hdr_buffer	-> 3rd buffer
> 			...
> 
> Changelog[v14]:
>   - Use 'fd_type' instead of 'hh->fd_objref' in cr_write_fd_data()
>   - Revert change to pr_debug(), back to cr_debug()
>   - Discard the 'h.parent' field
>   - Check whether calls to cr_hbuf_get() fail
>   - Test that a pipe's inode != ctx->file's inode to prevent deadlock
> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

But:

> ---
>  checkpoint/ckpt_file.c         |    2 +
>  fs/pipe.c                      |  113 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/checkpoint_hdr.h |    8 +++-
>  3 files changed, 122 insertions(+), 1 deletions(-)
> 
> diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
> index 0fe68bf..dd26b3d 100644
> --- a/checkpoint/ckpt_file.c
> +++ b/checkpoint/ckpt_file.c
> @@ -12,6 +12,7 @@
>  #include <linux/sched.h>
>  #include <linux/file.h>
>  #include <linux/fdtable.h>
> +#include <linux/pipe_fs_i.h>
>  #include <linux/checkpoint.h>
>  #include <linux/checkpoint_hdr.h>
> 
> @@ -72,6 +73,7 @@ int cr_scan_fds(struct files_struct *files, int **fdtable)
>  	return n;
>  }
> 
> +
>  static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
>  				 struct cr_hdr_file *hh)
>  {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 14f502b..0c3f391 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -22,6 +22,9 @@
>  #include <asm/uaccess.h>
>  #include <asm/ioctls.h>
> 
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
>  /*
>   * We use a start+len construction, which provides full use of the 
>   * allocated memory.
> @@ -771,6 +774,113 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
>  	return 0;
>  }
> 
> +/* cr_write_pipebuf - dump contents of a pipe/fifo (assume i_mutex taken) */
> +static int cr_write_pipebuf(struct cr_ctx *ctx, struct pipe_inode_info *pipe)
> +{
> +	struct cr_hdr h;
> +	void *kbuf, *addr;
> +	int i, ret = 0;
> +
> +	kbuf = (void *) __get_free_page(GFP_KERNEL);
> +	if (!kbuf)
> +		return -ENOMEM;
> +
> +	/* this is a simplified fs/pipe.c:read_pipe() */

pipe_read() actually :)

> +
> +	for (i = 0; i < pipe->nrbufs; i++) {
> +		int nn = (pipe->curbuf + i) & (PIPE_BUFFERS-1);
> +		struct pipe_buffer *pbuf = pipe->bufs + nn;
> +		const struct pipe_buf_operations *ops = pbuf->ops;
> +
> +		ret = ops->confirm(pipe, pbuf);
> +		if (ret < 0)
> +			break;

not that it seems to matter, but pipe_read() returns error
also if ret > 0.

> +
> +		addr = ops->map(pipe, pbuf, 1);
> +		memcpy(kbuf, addr + pbuf->offset, pbuf->len);
> +		ops->unmap(pipe, pbuf, addr);
> +
> +		h.type = CR_HDR_BUFFER;
> +		h.len = pbuf->len;
> +
> +		ret = cr_write_obj(ctx, &h, kbuf);
> +		if (ret < 0)
> +			break;
> +	}
> +
> +	free_page((unsigned long) kbuf);
> +	return ret;
> +}
> +
> +/* cr_write_pipe - dump pipe (assume i_mutex taken) */
> +static int cr_write_pipe(struct cr_ctx *ctx, struct inode *inode)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_fd_pipe *hh;
> +	struct pipe_inode_info *pipe = inode->i_pipe;
> +	int ret;
> +
> +	h.type = CR_HDR_FD_PIPE;
> +	h.len = sizeof(*hh);
> +
> +	hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	if (!hh)
> +		return -ENOMEM;
> +
> +	hh->nr_bufs = pipe->nrbufs;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		return ret;
> +
> +	return cr_write_pipebuf(ctx, pipe);
> +}
> +
> +static int pipe_file_checkpoint(struct cr_ctx *ctx,
> +				struct file *file, struct cr_hdr_file *hh)
> +{
> +	struct cr_hdr h;
> +	struct inode *inode = file->f_dentry->d_inode;
> +	int new, objref;
> +	int ret;
> +
> +	/*
> +	 * We take the inode's mutex and later will call vfs_write(),
> +	 * which also takes an inode's mutex. To avoid deadlock, make
> +	 * sure that the two inodes are distinct.
> +	 */
> +	if (ctx->file->f_dentry->d_inode == inode) {
> +		pr_warning("c/r: writing to pipe that is checkpointed "
> +			   "may result in a deadlock ... aborting\n");
> +		return -EDEADLK;
> +	}
> +
> +	h.type = CR_HDR_FILE;
> +	h.len = sizeof(*hh);
> +
> +	new = cr_obj_add_ptr(ctx, inode, &objref, CR_OBJ_INODE, 0);
> +	cr_debug("objref %d inode %p new %d\n", objref, inode, new);
> +	if (new < 0)
> +		return new;
> +
> +	hh->fd_type = (new ? CR_FD_PIPE : CR_FD_OBJREF);

The git commit msg has a good explanation, but it's worth a comment
in the code as well, that on first instance we call it
CR_FD_PIPE and second time CR_FD_OBJREF.

> +	hh->fd_objref = objref;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (new) {
> +		mutex_lock(&inode->i_mutex);
> +		ret = cr_write_pipe(ctx, inode);
> +		mutex_unlock(&inode->i_mutex);
> +	}
> +
> +	return ret;
> +}
> +
> +
>  /*
>   * The file_operations structs are not static because they
>   * are also used in linux/fs/fifo.c to do operations on FIFOs.
> @@ -787,6 +897,7 @@ const struct file_operations read_pipefifo_fops = {
>  	.open		= pipe_read_open,
>  	.release	= pipe_read_release,
>  	.fasync		= pipe_read_fasync,
> +	.checkpoint	= pipe_file_checkpoint,
>  };
> 
>  const struct file_operations write_pipefifo_fops = {
> @@ -799,6 +910,7 @@ const struct file_operations write_pipefifo_fops = {
>  	.open		= pipe_write_open,
>  	.release	= pipe_write_release,
>  	.fasync		= pipe_write_fasync,
> +	.checkpoint	= pipe_file_checkpoint,
>  };
> 
>  const struct file_operations rdwr_pipefifo_fops = {
> @@ -812,6 +924,7 @@ const struct file_operations rdwr_pipefifo_fops = {
>  	.open		= pipe_rdwr_open,
>  	.release	= pipe_rdwr_release,
>  	.fasync		= pipe_rdwr_fasync,
> +	.checkpoint	= pipe_file_checkpoint,
>  };
> 
>  struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index 9ad845d..ce5d880 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -57,6 +57,7 @@ enum {
>  	CR_HDR_FD_TABLE = 301,
>  	CR_HDR_FD_ENT,
>  	CR_HDR_FILE,
> +	CR_HDR_FD_PIPE,
> 
>  	CR_HDR_TAIL = 5001
>  };
> @@ -152,7 +153,8 @@ struct cr_hdr_fd_ent {
>  /* fd types */
>  enum  fd_type {
>  	CR_FD_OBJREF = 1,
> -	CR_FD_GENERIC
> +	CR_FD_GENERIC,
> +	CR_FD_PIPE,
>  };
> 
>  struct cr_hdr_file {
> @@ -165,4 +167,8 @@ struct cr_hdr_file {
>  	__u64 f_version;
>  } __attribute__((aligned(8)));
> 
> +struct cr_hdr_fd_pipe {
> +	__s32 nr_bufs;
> +} __attribute__((aligned(8)));
> +
>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */
> -- 
> 1.5.4.3

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 18/29] Restore open pipes
       [not found]     ` <1238477349-11029-19-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-01 20:34       ` Serge E. Hallyn
  0 siblings, 0 replies; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-01 20:34 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> When seeing a CR_FD_PIPE file type, we create a new pipe and thus
> have two file pointers (read- and write- ends). We only use one of
> them, depending on which side was checkpointed first. We register the
> file pointer of the other end in the hash table, with the 'objref'
> given for this pipe from the checkpoint, deposited for later use. At
> this point we also restore the contents of the pipe buffers.
> 
> When the other end arrives, it will have file type CR_FD_OBJREF. We
> will then use the corresponding 'objref' to retrieve the file pointer
> from the hash table, and attach it to the process.
> 
> Note the difference from the checkpoint logic: during checkpoint we
> placed the _inode_ of the pipe in the hash table, while during restart
> we place the resulting _file_ in the hash table.
> 
> We restore the pipe contents we manually allocation and attaching
> buffers to the pipe; (alternatively we could read the data from the
> image file and then write it into the pipe, or use splice() syscall).
> 
> Changelog[v14]:
>   - Discard the 'h.parent' field
>   - Check whether calls to cr_hbuf_get() fail
> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

...

> +/* restore a pipe */
> +static int cr_read_fd_pipe(struct cr_ctx *ctx, struct cr_hdr_file *hh)
> +{
> +	struct file *file;
> +	int fds[2], which, ret;
> +
> +	file = cr_obj_get_by_ref(ctx, hh->fd_objref, CR_OBJ_FILE);
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +	else if (file)
> +		return cr_attach_get_file(file);

I think the casual reader would be helped by a comment like:

	/*
	 * if cr_obj_get_by_ref returned a file, then one end
	 * of the pipe has been restored, so we have
	 * cr_attach_get_file() attach the other end to a new
	 * fd, and we return that fd.
	 */

> +
> +	/* first encounter of this pipe: create it */
> +	ret = do_pipe(fds);
> +	if (ret < 0)
> +		return ret;

thanks,
-serge

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 19/29] Record 'struct file' object instead of the file name for VMAs
       [not found]     ` <1238477349-11029-20-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-01 21:45       ` Serge E. Hallyn
  0 siblings, 0 replies; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-01 21:45 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> The vma->vm_file can be an arbitrary file pointer, including one that
> is in use by a process as well and provided originally via the mmap()
> syscall.
> 
> Thus, when dumping the state of a VMA, save a file object instead
> of only the file name. As with other file objects, if it's seen for
> the first time it is dumped entirely, otherwise only the 'objref' is
> saved. The restart logic updated accordingly.
> 
> Changelog[v14]:
>   - Introduce patch
> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

(I assume this was on your todo list before Alexey pointed it
out?  If not, it might be gracious to mention him)

-serge

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 21/29] Dump anonymous- and file-mapped- shared memory
       [not found]     ` <1238477349-11029-22-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-01 23:06       ` Serge E. Hallyn
       [not found]         ` <20090401230657.GB27725-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-01 23:06 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> We now handle anonymous and file-mapped shared memory. Support for IPC
> shared memory requires support for IPC first. We extend cr_write_vma()
> to detect shared memory VMAs and handle it separately than private
> memory.
> 
> There is not much to do for file-mapped shared memory, except to force
> msync() on the region to ensure that the file system is consistent
> with the checkpoint image. Use our internal type CR_VMA_SHM_FILE.
> 
> Anonymous shared memory is always backed by inode in shmem filesystem.
> We use that inode to look up the VMA in the objhash and register it if
> not found (on first encounter). In this case, the type of the VMA is
> CR_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is
> found there, we must have already saved it before, so we change the
> type to CR_VMA_SHM_ANON_SKIP and skip it.
> 
> To dump the contents of a shmem VMA, we loop through the pages of the
> inode in the shmem filesystem, and dump the contents of each dirty
> (allocated) page - unallocated pages must be clean.
> 
> Note that we save the original size of a shmem VMA because it may have
> been re-mapped partially. The format itself remains like with private
> VMAs, except that instead of addresses we record _indices_ (page nr)
> into the backing inode.
> 
> Changelog[v14]:
>   - Introduce patch
> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Some nits though:

...

> +/**
> + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples
>   * @ctx - checkpoint context

@shm - 

>   * @vma - vma to scan
>   * @start - start address (updated)
> + * @start - end address (updated)
>   *
> + * For private vma, records addr/page tuples. For shared vma, records
> + * index/page (index is the index of the page in the shmem object).
>   * Returns the number of pages collected
>   */
> -static int
> -cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct vm_area_struct *vma,
> -			  unsigned long *start)
> +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, int shm,
> +			     struct vm_area_struct *vma, struct inode *ino,
> +			     unsigned long *start, unsigned long end)

...

>  /**
> - * cr_write_private_vma_contents - dump contents of a VMA with private memory
> + * cr_write_vma_contents - dump contents of a VMA
>   * @ctx - checkpoint context
>   * @vma - vma to scan

again lots of new args to comment

>   *
> @@ -367,17 +429,18 @@ static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
>   * virtual addresses into ctx->pgarr_list page-array chain. Then dump
>   * the addresses, followed by the page contents.
>   */
> -static int
> -cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +static int cr_write_vma_contents(struct cr_ctx *ctx, int shm,
> +				 struct vm_area_struct *vma, struct inode *ino,
> +				 unsigned long start, unsigned long end)

...

> +/**
> + * cr_write_shared_vma_contents - dump contents of a VMA with shared memory
> + * @ctx - checkpoint context
> + * @vma - vma to scan
> + */
> +static int cr_write_shared_vma_contents(struct cr_ctx *ctx,
> +					struct vm_area_struct *vma,
> +					enum cr_vma_type vma_type)
> +{
> +	struct inode *inode;
> +	int ret = 0;
> +
> +	/*
> +	 * Citing mmap(2): "Updates to the mapping are visible to other
> +	 * processes that map this file, and are carried through to the
> +	 * underlying file. The file may not actually be updated until
> +	 * msync(2) or munmap(2) is called"
> +	 *
> +	 * Citing msync(2): "Without use of this call there is no guarantee
> +	 * that changes are written back before munmap(2) is called."
> +	 *
> +	 * Force msync for region of shared mapped files, to ensure that
> +	 * that the file system is consistent with the checkpoint image.
> +	 * (inspired by sys_msync).
> +	 *
> +	 * [FIXME: call vfs_sync only once per shared segment]
> +	 */
> +
> +	switch (vma_type) {
> +	case CR_VMA_SHM_FILE:
> +		/* no need for contents that are stored in the file system */
> +		ret = vfs_fsync(vma->vm_file, vma->vm_file->f_path.dentry, 0);
> +		break;
> +	case CR_VMA_SHM_ANON:
> +		/* save the contents of this resgion */
> +		inode = vma->vm_file->f_dentry->d_inode;
> +		ret = cr_write_shmem_contents(ctx, inode);
> +		break;
> +	case CR_VMA_SHM_ANON_SKIP:
> +	case CR_VMA_SHM_FILE_SKIP:
> +		/* already saved before .. skip now */
> +		break;
> +	default:
> +		BUG();

Well, no - since the user can feed in whatever crap they want,
this isn't a *bug*, right?

> +	}
> +
> +	return ret;
> +}
> +
> +/* return the subtype of a private vma segment */
> +static enum cr_vma_type cr_private_vma_type(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_file)
> +		return CR_VMA_FILE;
> +	else
> +		return CR_VMA_ANON;
> +}
> +
> +/*
> + * cr_shared_vma_type - return the subtype of a shared vma
> + * @vma: target vma
> + * @old: 0 if shared segment seen first time, else 1
> + */
> +static enum cr_vma_type cr_shared_vma_type(struct vm_area_struct *vma, int old)
> +{
> +	enum cr_vma_type vma_type = -ENOSYS;
> +
> +	if (vma->vm_ops && vma->vm_ops->cr_vma_type) {
> +		vma_type = (*vma->vm_ops->cr_vma_type)(vma);
> +		if (old)
> +			vma_type = cr_vma_type_skip(vma_type);

Heh, well that seems a little more obtuse than it needs to be...  Seems
like just doing vma_type++ would keep the reader more grounded about
what is going on.  But I'm not asking you to change it (bc I'm sure
someone likes it and would ask to change it back)

...

>  struct cr_hdr_vma {
>  	__u32 vma_type;
> -	__u32 vma_objref;	/* for vma->vm_file */
> +	__s32 vma_objref;	/* objref of backing file */
> +	__s32 shm_objref;	/* objref of shared segment */

You're going to upset Alexey again with the signeds, aren't you?

> +	__u32 _padding;
> +	__u64 shm_size;		/* size of shared segment */
> 
>  	__u64 vm_start;
>  	__u64 vm_end;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 53118f0..06aeda5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -28,6 +28,7 @@
>  #include <linux/mm.h>
>  #include <linux/module.h>
>  #include <linux/swap.h>
> +#include <linux/checkpoint_hdr.h>
> 
>  static struct vfsmount *shm_mnt;
> 
> @@ -1470,6 +1471,13 @@ static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
>  }
>  #endif
> 
> +#ifdef CONFIG_CHECKPOINT
> +static int shmem_cr_vma_type(struct vm_area_struct *vma)
> +{
> +	return CR_VMA_SHM_ANON;
> +}
> +#endif
> +
>  int shmem_lock(struct file *file, int lock, struct user_struct *user)
>  {
>  	struct inode *inode = file->f_path.dentry->d_inode;
> @@ -2477,6 +2485,9 @@ static struct vm_operations_struct shmem_vm_ops = {
>  	.set_policy     = shmem_set_policy,
>  	.get_policy     = shmem_get_policy,
>  #endif
> +#ifdef CONFIG_CHECKPOINT
> +	.cr_vma_type	= shmem_cr_vma_type,
> +#endif
>  };
> 
> 
> -- 
> 1.5.4.3

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 21/29] Dump anonymous- and file-mapped- shared memory
       [not found]         ` <20090401230657.GB27725-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-01 23:18           ` Oren Laadan
       [not found]             ` <49D3F636.1070303-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-04-01 23:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen


Thanks for the review ...

Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
>> We now handle anonymous and file-mapped shared memory. Support for IPC
>> shared memory requires support for IPC first. We extend cr_write_vma()
>> to detect shared memory VMAs and handle it separately than private
>> memory.

[...]

>> +	switch (vma_type) {
>> +	case CR_VMA_SHM_FILE:
>> +		/* no need for contents that are stored in the file system */
>> +		ret = vfs_fsync(vma->vm_file, vma->vm_file->f_path.dentry, 0);
>> +		break;
>> +	case CR_VMA_SHM_ANON:
>> +		/* save the contents of this resgion */
>> +		inode = vma->vm_file->f_dentry->d_inode;
>> +		ret = cr_write_shmem_contents(ctx, inode);
>> +		break;
>> +	case CR_VMA_SHM_ANON_SKIP:
>> +	case CR_VMA_SHM_FILE_SKIP:
>> +		/* already saved before .. skip now */
>> +		break;
>> +	default:
>> +		BUG();
> 
> Well, no - since the user can feed in whatever crap they want,
> this isn't a *bug*, right?

... this is during checkpoint, no user input; it makes sure we don't
add a new type of VMA that we don't handle. On restart we complain
with -EINVAL.

[...]

> 
>>  struct cr_hdr_vma {
>>  	__u32 vma_type;
>> -	__u32 vma_objref;	/* for vma->vm_file */
>> +	__s32 vma_objref;	/* objref of backing file */
>> +	__s32 shm_objref;	/* objref of shared segment */
> 
> You're going to upset Alexey again with the signeds, aren't you?

Yes, I put a comment about signed values somewhere. I cleaned up most of
the unsigned instances following Alexey's comment, but I think in some
cases it makes sense.

In particular, assume I take a pid, or an objref, which is an 'int' in
the code, and save it with __u32. During restart I need to test for a
valid value before blindly converting back to (signed) int, like:
	
	ret = -EINVAL;
	if (hh->pid > INT_MAX)
		goto out;

in that case, I can just as well leave it signed and test

	ret = -EINVAL;
	if (hh->pid < 0)
		goto out;

which is much more readable, and less error-prone if sometime later
we change objref type from (signed) int to (signed) long and forget
to change INT_MAX to LONG_MAX everywhere ...

Oren.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 24/29] c/r: Add CR_COPY() macro (v4)
       [not found]     ` <1238477349-11029-25-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-01 23:20       ` Serge E. Hallyn
       [not found]         ` <20090401232013.GA31361-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-01 23:20 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dan Smith, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> 
> As suggested by Dave[1], this provides us a way to make the copy-in and
> copy-out processes symmetric.  CR_COPY_ARRAY() provides us a way to do
> the same thing but for arrays.  It's not critical, but it helps us unify
> the checkpoint and restart paths for some things.
> 
> Changelog:
>     Mar 04:
>             . Removed semicolons
>             . Added build-time check for __must_be_array in CR_COPY_ARRAY
>     Feb 27:
>             . Changed CR_COPY() to use assignment, eliminating the need
>               for the CR_COPY_BIT() macro
>             . Add CR_COPY_ARRAY() macro to help copying register arrays,
>               etc
>             . Move the macro definitions inside the CR #ifdef
>     Feb 25:
>             . Changed WARN_ON() to BUILD_BUG_ON()
> 
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> 
> 1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
> ---
>  include/linux/checkpoint.h |   26 ++++++++++++++++++++++++++
>  1 files changed, 26 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> index 031e414..59ec563 100644
> --- a/include/linux/checkpoint.h
> +++ b/include/linux/checkpoint.h
> @@ -120,6 +120,32 @@ extern int cr_read_mm(struct cr_ctx *ctx);
>  extern int cr_read_fd_table(struct cr_ctx *ctx);
>  extern int cr_read_file(struct cr_ctx *ctx, int objref);
> 
> +/* useful macros to copy fields and buffers to/from cr_hdr_xxx structures */
> +#define CR_CPT 1
> +#define CR_RST 2
> +
> +#define CR_COPY(op, SAVE, LIVE)				        \
> +	do {							\
> +		if (op == CR_CPT)				\
> +			SAVE = LIVE;				\
> +		else						\
> +			LIVE = SAVE;				\
> +	} while (0)

Yes the above (SAVE+LIVE) is so much more useful to future coders,
thanks.

> +
> +/*
> + * Copy @count items from @LIVE to @SAVE if op is CR_CPT (otherwise,
> + * copy in the reverse direction)
> + */
> +#define CR_COPY_ARRAY(op, SAVE, LIVE, count)				\
> +	do {								\
> +		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
> +		if (op == CR_CPT)					\
> +			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
> +		else							\
> +			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
> +	} while (__must_be_array(SAVE) && __must_be_array(LIVE) && 0)

It doesn't really matter I guess, but I'd prefer to see:

#define CR_COPY_ARRAY(op, SAVE, LIVE, count)				\
	do {								\
		__must_be_array(SAVE);					\
		__must_be_array(LIVE);					\
		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
		if (op == CR_CPT)					\
			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
		else							\
			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
	} while (0)

Putting the __must_be_array()s inside the condition seems really weird.


> +
> +
>  #define cr_debug(fmt, args...)  \
>  	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
> 
> -- 
> 1.5.4.3

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 21/29] Dump anonymous- and file-mapped- shared memory
       [not found]             ` <49D3F636.1070303-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-01 23:32               ` Serge E. Hallyn
  0 siblings, 0 replies; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-01 23:32 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> 
> Thanks for the review ...
> 
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> >> We now handle anonymous and file-mapped shared memory. Support for IPC
> >> shared memory requires support for IPC first. We extend cr_write_vma()
> >> to detect shared memory VMAs and handle it separately than private
> >> memory.
> 
> [...]
> 
> >> +	switch (vma_type) {
> >> +	case CR_VMA_SHM_FILE:
> >> +		/* no need for contents that are stored in the file system */
> >> +		ret = vfs_fsync(vma->vm_file, vma->vm_file->f_path.dentry, 0);
> >> +		break;
> >> +	case CR_VMA_SHM_ANON:
> >> +		/* save the contents of this resgion */
> >> +		inode = vma->vm_file->f_dentry->d_inode;
> >> +		ret = cr_write_shmem_contents(ctx, inode);
> >> +		break;
> >> +	case CR_VMA_SHM_ANON_SKIP:
> >> +	case CR_VMA_SHM_FILE_SKIP:
> >> +		/* already saved before .. skip now */
> >> +		break;
> >> +	default:
> >> +		BUG();
> > 
> > Well, no - since the user can feed in whatever crap they want,
> > this isn't a *bug*, right?
> 
> ... this is during checkpoint

Oh, heh. never mind then.

> , no user input; it makes sure we don't
> add a new type of VMA that we don't handle. On restart we complain
> with -EINVAL.
> 
> [...]
> 
> > 
> >>  struct cr_hdr_vma {
> >>  	__u32 vma_type;
> >> -	__u32 vma_objref;	/* for vma->vm_file */
> >> +	__s32 vma_objref;	/* objref of backing file */
> >> +	__s32 shm_objref;	/* objref of shared segment */
> > 
> > You're going to upset Alexey again with the signeds, aren't you?
> 
> Yes, I put a comment about signed values somewhere. I cleaned up most of
> the unsigned instances following Alexey's comment, but I think in some
> cases it makes sense.
> 
> In particular, assume I take a pid, or an objref, which is an 'int' in
> the code, and save it with __u32. During restart I need to test for a
> valid value before blindly converting back to (signed) int, like:
> 	
> 	ret = -EINVAL;
> 	if (hh->pid > INT_MAX)
> 		goto out;
> 
> in that case, I can just as well leave it signed and test
> 
> 	ret = -EINVAL;
> 	if (hh->pid < 0)
> 		goto out;
> 
> which is much more readable, and less error-prone if sometime later
> we change objref type from (signed) int to (signed) long and forget
> to change INT_MAX to LONG_MAX everywhere ...
> 
> Oren.

Makes sense.  Just wanted to make sure it didn't accidentally slip
in.

thanks,
-serge

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 22/29] Restore anonymous- and file-mapped- shared memory
       [not found]     ` <1238477349-11029-23-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-02 16:59       ` Serge E. Hallyn
  0 siblings, 0 replies; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-02 16:59 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> The bulk of the work is in cr_read_vma(), which has been refactored:
> the part that create the suitable 'struct file *' for the mapping is
> now larger and moved to a separate function. What's left is to read
> the VMA description, get the file pointer, create the mapping, and
> proceed to read the contents in.
> 
> Both anonymous shared VMAs that have been read earlier (as indicated
> by a look up to objhash) and file-mapped shared VMAs are skipped.
> Anonymous shared VMAs seen for the first time have their contents
> read in directly to the backing inode, as indexed by the page numbers
> (as opposed to virtual addresses).
> 
> Changelog[v14]:
>   - Introduce patch
> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

thanks,
-serge

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 24/29] c/r: Add CR_COPY() macro (v4)
       [not found]         ` <20090401232013.GA31361-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-02 19:00           ` Dan Smith
       [not found]             ` <87vdpmnan2.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Smith @ 2009-04-02 19:00 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

>> +
>> +/*
>> + * Copy @count items from @LIVE to @SAVE if op is CR_CPT (otherwise,
>> + * copy in the reverse direction)
>> + */
>> +#define CR_COPY_ARRAY(op, SAVE, LIVE, count)				\
>> +	do {								\
>> +		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
>> +		if (op == CR_CPT)					\
>> +			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
>> +		else							\
>> +			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
>> +	} while (__must_be_array(SAVE) && __must_be_array(LIVE) && 0)

SH> It doesn't really matter I guess, but I'd prefer to see:

SH> #define CR_COPY_ARRAY(op, SAVE, LIVE, count)				\
SH> 	do {								\
SH> 		__must_be_array(SAVE);					\
SH> 		__must_be_array(LIVE);					\
SH> 		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
SH> 		if (op == CR_CPT)					\
SH> 			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
SH> 		else							\
SH> 			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
SH> 	} while (0)

SH> Putting the __must_be_array()s inside the condition seems really weird.

I thought I explained this somewhere.  You'll get a compile warning if
you make __must_be_array() a statement without an lvalue.  If you try
to stuff the result into a variable that you don't use, you'll get a
warning about an unused variable.  I did what I did because it seemed
like a sane way to sidestep both of those issues.

Maybe a comment is warranted? :)

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 24/29] c/r: Add CR_COPY() macro (v4)
       [not found]             ` <87vdpmnan2.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-04-02 19:06               ` Serge E. Hallyn
       [not found]                 ` <20090402190612.GA24390-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Serge E. Hallyn @ 2009-04-02 19:06 UTC (permalink / raw)
  To: Dan Smith
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Quoting Dan Smith (danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> SH> #define CR_COPY_ARRAY(op, SAVE, LIVE, count)				\
> SH> 	do {								\
> SH> 		__must_be_array(SAVE);					\
> SH> 		__must_be_array(LIVE);					\
> SH> 		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
> SH> 		if (op == CR_CPT)					\
> SH> 			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
> SH> 		else							\
> SH> 			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
> SH> 	} while (0)
> 
> SH> Putting the __must_be_array()s inside the condition seems really weird.
> 
> I thought I explained this somewhere.  You'll get a compile warning if

Oh, probably on irc...

> you make __must_be_array() a statement without an lvalue.  If you try
> to stuff the result into a variable that you don't use, you'll get a
> warning about an unused variable.  I did what I did because it seemed
> like a sane way to sidestep both of those issues.
> 
> Maybe a comment is warranted? :)

That's sucky...  yeah i would say a comment, though of course it could
be one of those cases where everyone but me already knows...

-serge

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 24/29] c/r: Add CR_COPY() macro (v4)
       [not found]                 ` <20090402190612.GA24390-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-02 20:22                   ` Dan Smith
       [not found]                     ` <87r60an6us.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Smith @ 2009-04-02 20:22 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

SH> That's sucky...  yeah i would say a comment, though of course it
SH> could be one of those cases where everyone but me already knows...

Here's a nice fix brought to us by Mr. Lynch...

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 59ec563..a8d758f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -138,12 +138,14 @@ extern int cr_read_file(struct cr_ctx *ctx, int objref);
  */
 #define CR_COPY_ARRAY(op, SAVE, LIVE, count)				\
 	do {								\
+		(void)__must_be_array(SAVE);				\
+		(void)__must_be_array(LIVE);				\
 		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
 		if (op == CR_CPT)					\
 			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
 		else							\
 			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
-	} while (__must_be_array(SAVE) && __must_be_array(LIVE) && 0)
+	} while (0)
 
 
 #define cr_debug(fmt, args...)  \

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup
       [not found]     ` <1238477349-11029-17-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-01 13:59       ` Serge E. Hallyn
  2009-04-01 18:36       ` Serge E. Hallyn
@ 2009-04-03 15:46       ` Dan Smith
       [not found]         ` <87y6uhyc3j.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  2 siblings, 1 reply; 66+ messages in thread
From: Dan Smith @ 2009-04-03 15:46 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

OL> @@ -86,46 +132,44 @@ static int cr_read_file(struct cr_ctx *ctx, int objref)
OL>  		goto out;

OL>  	ret = -EINVAL;
OL> +	if (hh->fd_objref < 0)
OL> +		goto out;

As far as I can tell, hh->fd_objref never gets set anywhere.  On my
system, this causes restart to always fail because there is garbage in
that field, thus triggering the above check.  If I remove this,
restart completes successfully.  The following grep tells me that
maybe this check isn't valid:

  % grep fd_objref checkpoint/*.c include/linux/checkpoint*.h
  checkpoint/rstr_file.c: file = cr_obj_get_by_ref(ctx, hh->fd_objref, CR_OBJ_FILE);
  checkpoint/rstr_file.c: file = cr_obj_add_file(ctx, fds[1-which], hh->fd_objref);
  checkpoint/rstr_file.c:static int cr_read_fd_objref(struct cr_ctx *ctx, struct cr_hdr_file *hh)
  checkpoint/rstr_file.c: file = cr_obj_get_by_ref(ctx, hh->fd_objref, CR_OBJ_FILE);
  checkpoint/rstr_file.c: if (hh->fd_objref < 0)
  checkpoint/rstr_file.c:	fd = cr_read_fd_objref(ctx, hh);
  include/linux/checkpoint_hdr.h:	__s32 fd_objref;

I haven't looked into the surrounding bits yet, so maybe I'm missing
something, but this seems to be causing a spurious failure on s390 at
least.

I'm doing this on a clone of your repository's ckpt-v14-rc2 branch.
Perhaps that repo is missing a patch?

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup
       [not found]         ` <87y6uhyc3j.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-04-03 16:25           ` Oren Laadan
       [not found]             ` <49D63865.1030807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-04-03 16:25 UTC (permalink / raw)
  To: Dan Smith
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen




Dan Smith wrote:
> OL> @@ -86,46 +132,44 @@ static int cr_read_file(struct cr_ctx *ctx, int objref)
> OL>  		goto out;
> 
> OL>  	ret = -EINVAL;
> OL> +	if (hh->fd_objref < 0)
> OL> +		goto out;
> 
> As far as I can tell, hh->fd_objref never gets set anywhere.  On my
> system, this causes restart to always fail because there is garbage in
> that field, thus triggering the above check.  If I remove this,
> restart completes successfully.  The following grep tells me that
> maybe this check isn't valid:
> 
>   % grep fd_objref checkpoint/*.c include/linux/checkpoint*.h
>   checkpoint/rstr_file.c: file = cr_obj_get_by_ref(ctx, hh->fd_objref, CR_OBJ_FILE);
>   checkpoint/rstr_file.c: file = cr_obj_add_file(ctx, fds[1-which], hh->fd_objref);
>   checkpoint/rstr_file.c:static int cr_read_fd_objref(struct cr_ctx *ctx, struct cr_hdr_file *hh)
>   checkpoint/rstr_file.c: file = cr_obj_get_by_ref(ctx, hh->fd_objref, CR_OBJ_FILE);
>   checkpoint/rstr_file.c: if (hh->fd_objref < 0)
>   checkpoint/rstr_file.c:	fd = cr_read_fd_objref(ctx, hh);
>   include/linux/checkpoint_hdr.h:	__s32 fd_objref;

hh->fd_objref is set, for pipes, in fs/pipe.c (outcome of the move to f_ops).
So the problem is that the field isn't explicitly zeroed otherwise. I'll fix
that for the next round. Meanwhile, you can add:

	hh->fd_objref = 0;

in cr_write_file() before the call to file->f_ops->checkpoint().

Thanks,

Oren.

> 
> I haven't looked into the surrounding bits yet, so maybe I'm missing
> something, but this seems to be causing a spurious failure on s390 at
> least.
> 
> I'm doing this on a clone of your repository's ckpt-v14-rc2 branch.
> Perhaps that repo is missing a patch?
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup
       [not found]             ` <49D63865.1030807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-03 16:30               ` Dan Smith
  2009-04-03 16:54               ` Dave Hansen
  1 sibling, 0 replies; 66+ messages in thread
From: Dan Smith @ 2009-04-03 16:30 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

OL> hh-> fd_objref is set, for pipes, in fs/pipe.c (outcome of the
OL> hh-> move to f_ops). So the problem is that the field isn't
OL> hh-> explicitly zeroed otherwise. 

Ah, gotcha.

Thanks!

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup
       [not found]             ` <49D63865.1030807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-03 16:30               ` Dan Smith
@ 2009-04-03 16:54               ` Dave Hansen
  1 sibling, 0 replies; 66+ messages in thread
From: Dave Hansen @ 2009-04-03 16:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dan Smith, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Fri, 2009-04-03 at 12:25 -0400, Oren Laadan wrote:
> > As far as I can tell, hh->fd_objref never gets set anywhere.  On my
> > system, this causes restart to always fail because there is garbage in
> > that field, thus triggering the above check.  If I remove this,
> > restart completes successfully.  The following grep tells me that
> > maybe this check isn't valid:
> > 
> >   % grep fd_objref checkpoint/*.c include/linux/checkpoint*.h
> >   checkpoint/rstr_file.c: file = cr_obj_get_by_ref(ctx, hh->fd_objref, CR_OBJ_FILE);
> >   checkpoint/rstr_file.c: file = cr_obj_add_file(ctx, fds[1-which], hh->fd_objref);
> >   checkpoint/rstr_file.c:static int cr_read_fd_objref(struct cr_ctx *ctx, struct cr_hdr_file *hh)
> >   checkpoint/rstr_file.c: file = cr_obj_get_by_ref(ctx, hh->fd_objref, CR_OBJ_FILE);
> >   checkpoint/rstr_file.c: if (hh->fd_objref < 0)
> >   checkpoint/rstr_file.c:     fd = cr_read_fd_objref(ctx, hh);
> >   include/linux/checkpoint_hdr.h:     __s32 fd_objref;
> 
> hh->fd_objref is set, for pipes, in fs/pipe.c (outcome of the move to f_ops).
> So the problem is that the field isn't explicitly zeroed otherwise. I'll fix
> that for the next round. Meanwhile, you can add:
> 
>         hh->fd_objref = 0;
> 
> in cr_write_file() before the call to file->f_ops->checkpoint().

If fd_objref isn't necessary for all fds, then why do we have it in the
'cr_hdr_fd_ent' object?  It would make a heck of a lot more sense to me
if 'cr_hdr_file' truly contained only the common pieces for all fds.

This all just seems misdesigned to me.  The hh->fd_type should always
say 'PIPE'.  Then we call into the pipe write function.  It can either
write out a real, whole pipe record or it can write out a reference to a
pipe it has already seen.

That way, we can have pipe->inode_objref that actually explains what
kind of objref we're looking for.  We can have something like a
CR_HDR_FD_PIPE_REF or a CR_HDR_FD_INODE_REF object in place of the full
pipe the second time we see it.

-- Dave

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 24/29] c/r: Add CR_COPY() macro (v4)
       [not found]                     ` <87r60an6us.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-04-05 20:25                       ` Oren Laadan
  0 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-04-05 20:25 UTC (permalink / raw)
  To: Dan Smith
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen


ok, thanks.

Dan Smith wrote:
> SH> That's sucky...  yeah i would say a comment, though of course it
> SH> could be one of those cases where everyone but me already knows...
> 
> Here's a nice fix brought to us by Mr. Lynch...
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 02/29] Checkpoint/restart: initial documentation
       [not found]     ` <1238477349-11029-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07  3:22       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:22 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen


Just a nit:

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:

| From 6f5483b085b1fb675a8445c65ddbeb7b38187865 Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Date: Mon, 30 Mar 2009 10:45:53 -0400
| Subject: [PATCH 02/29] Checkpoint/restart: initial documentation
| 
| Covers application checkpoint/restart, overall design, interfaces,
| usage, shared objects, and and checkpoint image format.
| 
| Changelog[v14]:
|   - Discard the 'h.parent' field
| 
| Changelog[v8]:
|   - Split into multiple files in Documentation/checkpoint/...
|   - Extend documentation, fix typos and comments from feedback
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
| Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| ---
|  Documentation/checkpoint/ckpt.c        |   32 ++++++
|  Documentation/checkpoint/internals.txt |  127 +++++++++++++++++++++++
|  Documentation/checkpoint/readme.txt    |  105 +++++++++++++++++++
|  Documentation/checkpoint/rstr.c        |   20 ++++
|  Documentation/checkpoint/security.txt  |   38 +++++++
|  Documentation/checkpoint/self.c        |   57 +++++++++++
|  Documentation/checkpoint/test.c        |   48 +++++++++
|  Documentation/checkpoint/usage.txt     |  171 ++++++++++++++++++++++++++++++++
|  8 files changed, 598 insertions(+), 0 deletions(-)
|  create mode 100644 Documentation/checkpoint/ckpt.c
|  create mode 100644 Documentation/checkpoint/internals.txt
|  create mode 100644 Documentation/checkpoint/readme.txt
|  create mode 100644 Documentation/checkpoint/rstr.c
|  create mode 100644 Documentation/checkpoint/security.txt
|  create mode 100644 Documentation/checkpoint/self.c
|  create mode 100644 Documentation/checkpoint/test.c
|  create mode 100644 Documentation/checkpoint/usage.txt
| 
| diff --git a/Documentation/checkpoint/ckpt.c b/Documentation/checkpoint/ckpt.c
| new file mode 100644
| index 0000000..094408c
| --- /dev/null
| +++ b/Documentation/checkpoint/ckpt.c
| @@ -0,0 +1,32 @@
| +#include <stdio.h>
| +#include <stdlib.h>
| +#include <errno.h>
| +#include <unistd.h>
| +#include <sys/syscall.h>
| +
| +int main(int argc, char *argv[])
| +{
| +	pid_t pid;
| +	int ret;
| +
| +	if (argc != 2) {
| +		printf("usage: ckpt PID\n");
| +		exit(1);
| +	}
| +
| +	pid = atoi(argv[1]);
| +	if (pid <= 0) {
| +		printf("invalid pid\n");
| +		exit(1);
| +	}
| +
| +	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
| +
| +	if (ret < 0)
| +		perror("checkpoint");
| +	else
| +		printf("checkpoint id %d\n", ret);
| +
| +	return (ret > 0 ? 0 : 1);
| +}
| +
| diff --git a/Documentation/checkpoint/internals.txt b/Documentation/checkpoint/internals.txt
| new file mode 100644
| index 0000000..c741b6c
| --- /dev/null
| +++ b/Documentation/checkpoint/internals.txt
| @@ -0,0 +1,127 @@
| +
| +	===== Internals of Checkpoint-Restart =====
| +
| +
| +(1) Order of state dump
| +
| +The order of operations, both save and restore, is as follows:
| +
| +* Header section: header, container information, etc.
| +
| +* Global section: [TBD] global resources such as IPC, UTS, etc.
| +
| +* Process forest: [TBD] tasks and their relationships
| +
| +* Per task data (for each task):
| +  -> task state: elements of task_struct
| +  -> thread state: elements of thread_struct and thread_info
| +  -> CPU state: registers etc, including FPU
| +  -> memory state: memory address space layout and contents
| +  -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
| +  -> files state: open file descriptors and their state
| +  -> signals state: [TBD] pending signals and signal handling state
| +  -> credentials state: [TBD] user and group state, statistics
| +
| +
| +(2) Checkpoint image format
| +
| +The checkpoint image format is composed of records consisting of a
| +pre-header that identifies its contents, followed by a payload. (The
| +idea here is to enable parallel checkpointing in the future in which
| +multiple threads interleave data from multiple processes into a single
| +stream).
| +
| +The pre-header is defined by "struct cr_hdr" as follows:
| +
| +struct cr_hdr {
| +	__s16 type;
| +	__s16 len;
| +};
| +
| +'type' identifies the type of the payload, 'len' tells its length in
| +bytes, and 'parent' identifies the owner object instance.

Nit: Remove reference to 'parent'.

Suka

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 04/29] General infrastructure for checkpoint restart
       [not found]     ` <1238477349-11029-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07  3:24       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:24 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Minor comment:

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:

| From 26e7a012d3ff04d64a59e629f2427dfa2b49792b Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Date: Mon, 30 Mar 2009 11:14:06 -0400
| Subject: [PATCH 04/29] General infrastructure for checkpoint restart
| 
| Add those interfaces, as well as helpers needed to easily manage the
| file format. The code is roughly broken out as follows:
| 
| checkpoint/sys.c - user/kernel data transfer, as well as setup of the
|   CR context (a per-checkpoint data structure for housekeeping)
| checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
| checkpoint/restart.c - input wrappers and basic restart handling
| 
| For now, we can only checkpoint the 'current' task ("self" checkpoint),
| and the 'pid' argument to to the syscall is ignored.
| 
| Patches to add the per-architecture support as well as the actual
| work to do the memory checkpoint follow in subsequent patches.
| 
| Changelog[v14]:
|   - Define sys_checkpoint(0,...) as asking for a self-checkpoint (Serge)
|   - Revert use of 'pr_fmt' to avoid tainting whom includes us (Nathan Lynch)
|   - Explicitly indicate length of UTS fields in header
|   - Discard field 'h->parent'
|   - Check whether calls to cr_hbuf_get() fail
| 
| Changelog[v12]:
|   - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
|   - Split cr_write/cr_read() to two parts: _cr_write/read() helper
|   - Befriend with sparse : explicit conversion to 'void __user *'
|   - Redfine 'pr_fmt' instead of using special cr_debug()
| 
| Changelog[v10]:
|   - add cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
|   - force end-of-string in cr_read_string() (fix possible DoS)
| 
| Changelog[v9]:
|   - cr_kwrite/cr_kread() use file->f_op->write() directly
|   - Drop cr_uwrite/cr_uread() since they aren't used anywhere
| 
| Changelog[v6]:
|   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
|     (although it's not really needed)
| 
| Changelog[v5]:
|   - Rename headers files s/ckpt/checkpoint/
| 
| Changelog[v2]:
|   - Added utsname->{release,version,machine} to checkpoint header
|   - Pad header structures to 64 bits to ensure compatibility
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
| Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| ---
|  Makefile                       |    2 +-
|  checkpoint/Makefile            |    2 +-
|  checkpoint/checkpoint.c        |  206 +++++++++++++++++++++++++++++++
|  checkpoint/restart.c           |  260 ++++++++++++++++++++++++++++++++++++++++
|  checkpoint/sys.c               |  220 +++++++++++++++++++++++++++++++++-
|  include/linux/checkpoint.h     |   58 +++++++++
|  include/linux/checkpoint_hdr.h |   92 ++++++++++++++
|  include/linux/magic.h          |    3 +
|  8 files changed, 836 insertions(+), 7 deletions(-)
|  create mode 100644 checkpoint/checkpoint.c
|  create mode 100644 checkpoint/restart.c
|  create mode 100644 include/linux/checkpoint.h
|  create mode 100644 include/linux/checkpoint_hdr.h
| 
| diff --git a/Makefile b/Makefile
| index 2e2f4a4..126ff52 100644
| --- a/Makefile
| +++ b/Makefile
| @@ -630,7 +630,7 @@ export mod_strip_cmd
|  
|  
|  ifeq ($(KBUILD_EXTMOD),)
| -core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
| +core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
|  
|  vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
|  		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
| diff --git a/checkpoint/Makefile b/checkpoint/Makefile
| index 8a32c6f..364c326 100644
| --- a/checkpoint/Makefile
| +++ b/checkpoint/Makefile
| @@ -2,4 +2,4 @@
|  # Makefile for linux checkpoint/restart.
|  #
|  
| -obj-$(CONFIG_CHECKPOINT) += sys.o
| +obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o
| diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| new file mode 100644
| index 0000000..4e4c3fc
| --- /dev/null
| +++ b/checkpoint/checkpoint.c
| @@ -0,0 +1,206 @@
| +/*
| + *  Checkpoint logic and helpers
| + *
| + *  Copyright (C) 2008-2009 Oren Laadan
| + *
| + *  This file is subject to the terms and conditions of the GNU General Public
| + *  License.  See the file COPYING in the main directory of the Linux
| + *  distribution for more details.
| + */
| +
| +#include <linux/version.h>
| +#include <linux/sched.h>
| +#include <linux/time.h>
| +#include <linux/fs.h>
| +#include <linux/file.h>
| +#include <linux/dcache.h>
| +#include <linux/mount.h>
| +#include <linux/utsname.h>
| +#include <linux/magic.h>
| +#include <linux/checkpoint.h>
| +#include <linux/checkpoint_hdr.h>
| +
| +/* unique checkpoint identifier (FIXME: should be per-container ?) */
| +static atomic_t cr_ctx_count = ATOMIC_INIT(0);
| +
| +/**
| + * cr_write_obj - write a record described by a cr_hdr
| + * @ctx: checkpoint context
| + * @h: record descriptor
| + * @buf: record buffer
| + */
| +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
| +{
| +	int ret;
| +
| +	ret = cr_kwrite(ctx, h, sizeof(*h));
| +	if (ret < 0)
| +		return ret;
| +	return cr_kwrite(ctx, buf, h->len);
| +}
| +
| +/**
| + * cr_write_buffer - write a buffer
| + * @ctx: checkpoint context
| + * @str: buffer pointer
| + * @len: buffer size
| + */
| +int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len)
| +{
| +	struct cr_hdr h;
| +
| +	h.type = CR_HDR_BUFFER;
| +	h.len = len;
| +
| +	return cr_write_obj(ctx, &h, buf);
| +}
| +
| +/**
| + * cr_write_string - write a string
| + * @ctx: checkpoint context
| + * @str: string pointer
| + * @len: string length
| + */
| +int cr_write_string(struct cr_ctx *ctx, char *str, int len)
| +{
| +	struct cr_hdr h;
| +
| +	h.type = CR_HDR_STRING;
| +	h.len = len;
| +
| +	return cr_write_obj(ctx, &h, str);
| +}
| +
| +/* write the checkpoint header */
| +static int cr_write_head(struct cr_ctx *ctx)
| +{
| +	struct cr_hdr h;
| +	struct cr_hdr_head *hh;
| +	struct new_utsname *uts;
| +	struct timeval ktv;
| +	int ret;
| +
| +	h.type = CR_HDR_HEAD;
| +	h.len = sizeof(*hh);
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	do_gettimeofday(&ktv);
| +	uts = utsname();
| +
| +	hh->magic = CHECKPOINT_MAGIC_HEAD;
| +	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
| +	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
| +	hh->patch = (LINUX_VERSION_CODE) & 0xff;
| +
| +	hh->rev = CR_VERSION;
| +
| +	hh->flags = ctx->flags;
| +	hh->time = ktv.tv_sec;
| +
| +	hh->uts_release_len = sizeof(uts->release);
| +	hh->uts_version_len = sizeof(uts->version);
| +	hh->uts_machine_len = sizeof(uts->machine);
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	if (ret < 0)
| +		return ret;
| +
| +	ret = cr_write_buffer(ctx, uts->release, sizeof(uts->release));
| +	if (ret < 0)
| +		return ret;
| +	ret = cr_write_buffer(ctx, uts->version, sizeof(uts->version));
| +	if (ret < 0)
| +		return ret;
| +	ret = cr_write_buffer(ctx, uts->machine, sizeof(uts->machine));
| +
| +	return ret;
| +}
| +
| +/* write the checkpoint trailer */
| +static int cr_write_tail(struct cr_ctx *ctx)
| +{
| +	struct cr_hdr h;
| +	struct cr_hdr_tail *hh;
| +	int ret;
| +
| +	h.type = CR_HDR_TAIL;
| +	h.len = sizeof(*hh);
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	hh->magic = CHECKPOINT_MAGIC_TAIL;
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	return ret;
| +}
| +
| +/* dump the task_struct of a given task */
| +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
| +{
| +	struct cr_hdr h;
| +	struct cr_hdr_task *hh;
| +	int ret;
| +
| +	h.type = CR_HDR_TASK;
| +	h.len = sizeof(*hh);
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	hh->state = t->state;
| +	hh->exit_state = t->exit_state;
| +	hh->exit_code = t->exit_code;
| +	hh->exit_signal = t->exit_signal;
| +
| +	hh->task_comm_len = TASK_COMM_LEN;
| +
| +	/* FIXME: save remaining relevant task_struct fields */
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	if (ret < 0)
| +		return ret;
| +
| +	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
| +}
| +
| +/* dump the entire state of a given task */
| +static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
| +{
| +	int ret;
| +
| +	ret = cr_write_task_struct(ctx, t);
| +	cr_debug("ret %d\n", ret);
| +
| +	return ret;
| +}
| +
| +int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
| +{
| +	int ret;
| +
| +	ret = cr_write_head(ctx);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_write_task(ctx, current);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_write_tail(ctx);
| +	if (ret < 0)
| +		goto out;
| +
| +	ctx->crid = atomic_inc_return(&cr_ctx_count);
| +
| +	/* on success, return (unique) checkpoint identifier */
| +	ret = ctx->crid;
| + out:
| +	return ret;
| +}
| diff --git a/checkpoint/restart.c b/checkpoint/restart.c
| new file mode 100644
| index 0000000..d6f98d8
| --- /dev/null
| +++ b/checkpoint/restart.c
| @@ -0,0 +1,260 @@
| +/*
| + *  Restart logic and helpers
| + *
| + *  Copyright (C) 2008-2009 Oren Laadan
| + *
| + *  This file is subject to the terms and conditions of the GNU General Public
| + *  License.  See the file COPYING in the main directory of the Linux
| + *  distribution for more details.
| + */
| +
| +#include <linux/version.h>
| +#include <linux/sched.h>
| +#include <linux/file.h>
| +#include <linux/magic.h>
| +#include <linux/checkpoint.h>
| +#include <linux/checkpoint_hdr.h>
| +
| +/**
| + * cr_read_obj - read a whole record (cr_hdr followed by payload)
| + * @ctx: checkpoint context
| + * @h: record descriptor
| + * @buf: record buffer
| + * @len: available buffer size
| + */
| +int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int len)
| +{
| +	int ret;
| +
| +	ret = cr_kread(ctx, h, sizeof(*h));
| +	if (ret < 0)
| +		return ret;
| +
| +	cr_debug("type %d len %d\n", h->type, h->len);
| +
| +	if (h->len > len)
| +		return -EINVAL;
| +
| +	return cr_kread(ctx, buf, h->len);
| +}
| +
| +/**
| + * cr_read_obj_type - read a whole record of expected type and size
| + * @ctx: checkpoint context
| + * @buf: record buffer
| + * @n: expected record size
| + * @type: expected record type
| + */
| +int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type)
| +{
| +	struct cr_hdr h;
| +	int ret;
| +
| +	ret = cr_read_obj(ctx, &h, buf, len);
| +	if (ret < 0)
| +		return ret;
| +
| +	if (h.len != len || h.type != type)
| +		return -EINVAL;
| +
| +	return 0;
| +}
| +
| +/**
| + * cr_read_buf_type - read a whole record of expected type (unknown size)
| + * @ctx: checkpoint context
| + * @buf: record buffer
| + * @n: availabe buffer size (output: actual record size)
| + * @type: expected record type
| + */
| +int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type)
| +{
| +	struct cr_hdr h;
| +	int ret;
| +
| +	ret = cr_read_obj(ctx, &h, buf, *len);
| +	if (ret < 0)
| +		return ret;
| +
| +	if (h.type != type)
| +		return -EINVAL;
| +
| +	*len = h.len;
| +	return 0;
| +}
| +
| +/**
| + * cr_read_buffer - read a buffer
| + * @ctx: checkpoint context
| + * @buf: buffer
| + * @len: buffer size (output actual record size)
| + */
| +int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len)
| +{
| +	return cr_read_buf_type(ctx, buf, len, CR_HDR_BUFFER);
| +}
| +
| +/**
| + * cr_read_string - read a string
| + * @ctx: checkpoint context
| + * @str: string buffer
| + * @len: string length
| + */
| +int cr_read_string(struct cr_ctx *ctx, char *str, int len)
| +{
| +	int ret;
| +
| +	ret = cr_read_buf_type(ctx, str, &len, CR_HDR_STRING);
| +	if (ret < 0)
| +		return ret;
| +
| +	if (len > 0)
| +		str[len - 1] = '\0';	/* always play it safe */
| +
| +	return ret;
| +}
| +
| +/* read the checkpoint header */
| +static int cr_read_head(struct cr_ctx *ctx)
| +{
| +	struct cr_hdr_head *hh;
| +	struct new_utsname *uts = NULL;
| +	int ret;
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
| +	if (ret < 0)
| +		goto out;
| +
| +	ret = -EINVAL;
| +	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
| +	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
| +	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
| +	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
| +		goto out;
| +	if (hh->flags & ~CR_CTX_CKPT)
| +		goto out;
| +	if (hh->uts_release_len != sizeof(uts->release) ||
| +	    hh->uts_version_len != sizeof(uts->version) ||
| +	    hh->uts_machine_len != sizeof(uts->machine))
| +		goto out;
| +
| +	ret = -ENOMEM;
| +	uts = kmalloc(sizeof(*uts), GFP_KERNEL);
| +	if (!uts)
| +		goto out;
| +
| +	ctx->oflags = hh->flags;
| +
| +	/* FIX: verify compatibility of release, version and machine */
| +	ret = cr_read_obj_type(ctx, uts->release,
| +			       sizeof(uts->release), CR_HDR_BUFFER);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_read_obj_type(ctx, uts->version,
| +			       sizeof(uts->version), CR_HDR_BUFFER);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_read_obj_type(ctx, uts->machine,
| +			       sizeof(uts->machine), CR_HDR_BUFFER);
| +
| + out:
| +	kfree(uts);
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	return ret;
| +}
| +
| +/* read the checkpoint trailer */
| +static int cr_read_tail(struct cr_ctx *ctx)
| +{
| +	struct cr_hdr_tail *hh;
| +	int ret;
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
| +	if (ret < 0)
| +		goto out;
| +
| +	ret = -EINVAL;
| +	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
| +		goto out;
| +
| +	ret = 0;
| + out:
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	return ret;
| +}
| +
| +/* read the task_struct into the current task */
| +static int cr_read_task_struct(struct cr_ctx *ctx)
| +{
| +	struct cr_hdr_task *hh;
| +	struct task_struct *t = current;
| +	char *buf;
| +	int ret;
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
| +	if (ret < 0)
| +		goto out;
| +
| +	ret = -EINVAL;
| +	if (hh->task_comm_len > TASK_COMM_LEN)
| +		goto out;
| +
| +	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
| +	if (!buf) {
| +		ret = -ENOMEM;
| +		goto out;
| +	}
| +	ret = cr_read_string(ctx, buf, hh->task_comm_len);
| +	if (!ret) {
| +		memset(t->comm, 0, TASK_COMM_LEN);
| +		memcpy(t->comm, buf, hh->task_comm_len);
| +	}
| +	kfree(buf);
| +
| +	/* FIXME: restore remaining relevant task_struct fields */
| + out:
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	return ret;
| +}
| +
| +/* read the entire state of the current task */
| +static int cr_read_task(struct cr_ctx *ctx)
| +{
| +	int ret;
| +
| +	ret = cr_read_task_struct(ctx);
| +	cr_debug("ret %d\n", ret);
| +
| +	return ret;
| +}
| +
| +int do_restart(struct cr_ctx *ctx, pid_t pid)
| +{
| +	int ret;
| +
| +	ret = cr_read_head(ctx);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_read_task(ctx);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_read_tail(ctx);
| +	if (ret < 0)
| +		goto out;
| +
| +	/* on success, adjust the return value if needed [TODO] */
| + out:
| +	return ret;
| +}
| diff --git a/checkpoint/sys.c b/checkpoint/sys.c
| index 375129c..337c160 100644
| --- a/checkpoint/sys.c
| +++ b/checkpoint/sys.c
| @@ -1,7 +1,7 @@
|  /*
|   *  Generic container checkpoint-restart
|   *
| - *  Copyright (C) 2008 Oren Laadan
| + *  Copyright (C) 2008-2009 Oren Laadan
|   *
|   *  This file is subject to the terms and conditions of the GNU General Public
|   *  License.  See the file COPYING in the main directory of the Linux
| @@ -10,6 +10,180 @@
|  
|  #include <linux/sched.h>
|  #include <linux/kernel.h>
| +#include <linux/fs.h>
| +#include <linux/file.h>
| +#include <linux/uaccess.h>
| +#include <linux/capability.h>
| +#include <linux/checkpoint.h>
| +
| +/*
| + * Helpers to write(read) from(to) kernel space to(from) the checkpoint
| + * image file descriptor (similar to how a core-dump is performed).
| + *
| + *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
| + *   cr_kread() - read from the checkpoint image to a kernel-space buffer
| + */
| +
| +static inline int _cr_kwrite(struct file *file, void *addr, int count)
| +{
| +	void __user *uaddr = (__force void __user *) addr;
| +	ssize_t nwrite;
| +	int nleft;
| +
| +	for (nleft = count; nleft; nleft -= nwrite) {
| +		loff_t pos = file_pos_read(file);
| +		nwrite = vfs_write(file, uaddr, nleft, &pos);
| +		file_pos_write(file, pos);
| +		if (nwrite < 0) {
| +			if (nwrite == -EAGAIN)
| +				nwrite = 0;
| +			else
| +				return nwrite;
| +		}
| +		uaddr += nwrite;
| +	}
| +	return 0;
| +}
| +
| +int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
| +{
| +	mm_segment_t fs;
| +	int ret;
| +
| +	fs = get_fs();
| +	set_fs(KERNEL_DS);
| +	ret = _cr_kwrite(ctx->file, addr, count);
| +	set_fs(fs);
| +
| +	ctx->total += count;
| +	return ret;
| +}
| +
| +static inline int _cr_kread(struct file *file, void *addr, int count)
| +{
| +	void __user *uaddr = (__force void __user *) addr;
| +	ssize_t nread;
| +	int nleft;
| +
| +	for (nleft = count; nleft; nleft -= nread) {
| +		loff_t pos = file_pos_read(file);
| +		nread = vfs_read(file, uaddr, nleft, &pos);
| +		file_pos_write(file, pos);
| +		if (nread <= 0) {
| +			if (nread == -EAGAIN) {
| +				nread = 0;
| +				continue;
| +			} else if (nread == 0)
| +				nread = -EPIPE;		/* unexecpted EOF */
| +			return nread;
| +		}
| +		uaddr += nread;
| +	}
| +	return 0;
| +}
| +
| +int cr_kread(struct cr_ctx *ctx, void *addr, int count)
| +{
| +	mm_segment_t fs;
| +	int ret;
| +
| +	fs = get_fs();
| +	set_fs(KERNEL_DS);
| +	ret = _cr_kread(ctx->file , addr, count);
| +	set_fs(fs);
| +
| +	ctx->total += count;
| +	return ret;
| +}
| +
| +/*
| + * During checkpoint and restart the code writes outs/reads in data
| + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
| + * Because operations can be nested, use cr_hbuf_get() to reserve space
| + * in the buffer, then cr_hbuf_put() when you no longer need that space.
| + */

Maybe mention that we expect that only one thread to be using the ctx->hbuf
at a time so no locking is needed ?

Sukadev

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 05/29] x86 support for checkpoint/restart
       [not found]     ` <1238477349-11029-6-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07  3:25       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:25 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Couple of nits:

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| From f5dec38baa6a2cc2a88783db3a9afd676821d293 Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Date: Mon, 30 Mar 2009 11:14:32 -0400
| Subject: [PATCH 05/29] x86 support for checkpoint/restart
| 
| Add logic to save and restore architecture specific state, including
| thread-specific state, CPU registers and FPU state.
| 
| In addition, architecture capabilities are saved in an architecure
| specific extension of the header (cr_hdr_head_arch); Currently this
| includes only FPU capabilities.
| 
| Currently only x86-32 is supported. Compiling on x86-64 will trigger
| an explicit error.
| 
| Changelog[v14]:
|   - Remove preempt_disable/enable() around init_fpu() and fix leak
|   - Revert change to pr_debug(), back to cr_debug()
|   - Use only unsigned fields in checkpoint headers
|   - Discard field 'h->parent'
|   - Check whether calls to cr_hbuf_get() fail
| 
| Changelog[v12]:
|   - A couple of missed calls to cr_hbuf_put()
|   - Replace obsolete cr_debug() with pr_debug()
| 
| Changelog[v9]:
|   - Add arch-specific header that details architecture capabilities;
|     split FPU restore to send capabilities only once.
|   - Test for zero TLS entries in cr_write_thread()
|   - Fix asm/checkpoint_hdr.h so it can be included from user-space
| 
| Changelog[v7]:
|   - Fix save/restore state of FPU
| 
| Changelog[v5]:
|   - Remove preempt_disable() when restoring debug registers
| 
| Changelog[v4]:
|   - Fix header structure alignment
| 
| Changelog[v2]:
|   - Pad header structures to 64 bits to ensure compatibility
|   - Follow Dave Hansen's refactoring of the original post
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
| Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| ---
|  arch/x86/include/asm/checkpoint_hdr.h |   98 +++++++++++++
|  arch/x86/mm/Makefile                  |    2 +
|  arch/x86/mm/checkpoint.c              |  242 +++++++++++++++++++++++++++++++++
|  arch/x86/mm/restart.c                 |  223 ++++++++++++++++++++++++++++++
|  checkpoint/checkpoint.c               |   19 ++-
|  checkpoint/checkpoint_arch.h          |    9 ++
|  checkpoint/restart.c                  |   17 ++-
|  include/linux/checkpoint_hdr.h        |    2 +
|  8 files changed, 608 insertions(+), 4 deletions(-)
|  create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
|  create mode 100644 arch/x86/mm/checkpoint.c
|  create mode 100644 arch/x86/mm/restart.c
|  create mode 100644 checkpoint/checkpoint_arch.h
| 
| diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
| new file mode 100644
| index 0000000..ffdb5f5
| --- /dev/null
| +++ b/arch/x86/include/asm/checkpoint_hdr.h
| @@ -0,0 +1,98 @@
| +#ifndef __ASM_X86_CKPT_HDR_H
| +#define __ASM_X86_CKPT_HDR_H
| +/*
| + *  Checkpoint/restart - architecture specific headers x86
| + *
| + *  Copyright (C) 2008-2009 Oren Laadan
| + *
| + *  This file is subject to the terms and conditions of the GNU General Public
| + *  License.  See the file COPYING in the main directory of the Linux
| + *  distribution for more details.
| + */
| +
| +#include <linux/types.h>
| +
| +/*
| + * To maintain compatibility between 32-bit and 64-bit architecture flavors,
| + * keep data 64-bit aligned: use padding for structure members, and use
| + * __attribute__ ((aligned (8))) for the entire structure.
| + *
| + * Quoting Arnd Bergmann:
| + *   "This structure has an odd multiple of 32-bit members, which means
| + *   that if you put it into a larger structure that also contains 64-bit
| + *   members, the larger structure may get different alignment on x86-32
| + *   and x86-64, which you might want to avoid. I can't tell if this is
| + *   an actual problem here. ... In this case, I'm pretty sure that
| + *   sizeof(cr_hdr_task) on x86-32 is different from x86-64, since it
| + *   will be 32-bit aligned on x86-32."
| + */
| +
| +/* i387 structure seen from kernel/userspace */
| +#ifdef __KERNEL__
| +#include <asm/processor.h>
| +#else
| +#include <sys/user.h>
| +#endif
| +
| +struct cr_hdr_head_arch {
| +	/* FIXME: add HAVE_HWFP */
| +
| +	__u16 has_fxsr;
| +	__u16 has_xsave;
| +	__u16 xstate_size;
| +	__u16 _pading;
| +} __attribute__((aligned(8)));
| +
| +struct cr_hdr_thread {
| +	/* FIXME: restart blocks */
| +
| +	__u16 gdt_entry_tls_entries;
| +	__u16 sizeof_tls_array;
| +	__u16 ntls;	/* number of TLS entries to follow */
| +} __attribute__((aligned(8)));
| +
| +struct cr_hdr_cpu {
| +	/* see struct pt_regs (x86-64) */
| +	__u64 r15;
| +	__u64 r14;
| +	__u64 r13;
| +	__u64 r12;
| +	__u64 bp;
| +	__u64 bx;
| +	__u64 r11;
| +	__u64 r10;
| +	__u64 r9;
| +	__u64 r8;
| +	__u64 ax;
| +	__u64 cx;
| +	__u64 dx;
| +	__u64 si;
| +	__u64 di;
| +	__u64 orig_ax;
| +	__u64 ip;
| +	__u64 cs;
| +	__u64 flags;
| +	__u64 sp;
| +	__u64 ss;
| +
| +	/* segment registers */
| +	__u64 ds;
| +	__u64 es;
| +	__u64 fs;
| +	__u64 gs;
| +
| +	/* debug registers */
| +	__u64 debugreg0;
| +	__u64 debugreg1;
| +	__u64 debugreg2;
| +	__u64 debugreg3;
| +	__u64 debugreg6;
| +	__u64 debugreg7;
| +
| +	__u32 uses_debug;
| +	__u32 used_math;
| +
| +	/* thread_xstate contents follow (if used_math) */
| +} __attribute__((aligned(8)));
| +
| +#endif /* __ASM_X86_CKPT_HDR__H */
| diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
| index d8cc96a..e1cb5f8 100644
| --- a/arch/x86/mm/Makefile
| +++ b/arch/x86/mm/Makefile
| @@ -17,3 +17,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
|  obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
|  
|  obj-$(CONFIG_MEMTEST)		+= memtest.o
| +
| +obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o restart.o
| diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
| new file mode 100644
| index 0000000..946fac1
| --- /dev/null
| +++ b/arch/x86/mm/checkpoint.c
| @@ -0,0 +1,242 @@
| +/*
| + *  Checkpoint/restart - architecture specific support for x86
| + *
| + *  Copyright (C) 2008-2009 Oren Laadan
| + *
| + *  This file is subject to the terms and conditions of the GNU General Public
| + *  License.  See the file COPYING in the main directory of the Linux
| + *  distribution for more details.
| + */
| +
| +#include <asm/desc.h>
| +#include <asm/i387.h>
| +
| +#include <linux/checkpoint.h>
| +#include <linux/checkpoint_hdr.h>
| +
| +/* dump the thread_struct of a given task */
| +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
| +{
| +	struct cr_hdr h;
| +	struct cr_hdr_thread *hh;
| +	struct thread_struct *thread;
| +	struct desc_struct *desc;
| +	int ntls = 0;
| +	int n, ret;
| +
| +	h.type = CR_HDR_THREAD;
| +	h.len = sizeof(*hh);
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	thread = &t->thread;
| +
| +	/* calculate no. of TLS entries that follow */
| +	desc = thread->tls_array;
| +	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
| +		if (desc->a || desc->b)
| +			ntls++;
| +	}
| +
| +	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
| +	hh->sizeof_tls_array = sizeof(thread->tls_array);
| +	hh->ntls = ntls;
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	if (ret < 0)
| +		return ret;
| +
| +	cr_debug("ntls %d\n", ntls);
| +	if (ntls == 0)
| +		return 0;
| +
| +	/*
| +	 * For simplicity dump the entire array, cherry-pick upon restart
| +	 * FIXME: the TLS descriptors in the GDT should be called out and
| +	 * not tied to the in-kernel representation.
| +	 */
| +	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
| +
| +	/* IGNORE RESTART BLOCKS FOR NOW ... */
| +
| +	return ret;
| +}
| +
| +#ifdef CONFIG_X86_64
| +
| +#error "CONFIG_X86_64 unsupported yet."
| +
| +#else	/* !CONFIG_X86_64 */
| +
| +static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
| +{
| +	struct thread_struct *thread = &t->thread;
| +	struct pt_regs *regs = task_pt_regs(t);
| +
| +	hh->bp = regs->bp;
| +	hh->bx = regs->bx;
| +	hh->ax = regs->ax;
| +	hh->cx = regs->cx;
| +	hh->dx = regs->dx;
| +	hh->si = regs->si;
| +	hh->di = regs->di;
| +	hh->orig_ax = regs->orig_ax;
| +	hh->ip = regs->ip;
| +	hh->cs = regs->cs;
| +	hh->flags = regs->flags;
| +	hh->sp = regs->sp;
| +	hh->ss = regs->ss;
| +
| +	hh->ds = regs->ds;
| +	hh->es = regs->es;
| +
| +	/*
| +	 * for checkpoint in process context (from within a container)
| +	 * the GS and FS registers should be saved from the hardware;
| +	 * otherwise they are already sabed on the thread structure
| +	 */
| +	if (t == current) {
| +		savesegment(gs, hh->gs);
| +		savesegment(fs, hh->fs);
| +	} else {
| +		hh->gs = thread->gs;
| +		hh->fs = thread->fs;
| +	}
| +
| +	/*
| +	 * for checkpoint in process context (from within a container),
| +	 * the actual syscall is taking place at this very moment; so
| +	 * we (optimistically) subtitute the future return value (0) of
| +	 * this syscall into the orig_eax, so that upon restart it will
| +	 * succeed (or it will endlessly retry checkpoint...)
| +	 */
| +	if (t == current) {
| +		BUG_ON(hh->orig_ax < 0);
| +		hh->ax = 0;
| +	}
| +}
| +
| +static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
| +{
| +	struct thread_struct *thread = &t->thread;
| +
| +	/* debug regs */
| +
| +	/*
| +	 * for checkpoint in process context (from within a container),
| +	 * get the actual registers; otherwise get the saved values.
| +	 */
| +
| +	if (t == current) {
| +		get_debugreg(hh->debugreg0, 0);
| +		get_debugreg(hh->debugreg1, 1);
| +		get_debugreg(hh->debugreg2, 2);
| +		get_debugreg(hh->debugreg3, 3);
| +		get_debugreg(hh->debugreg6, 6);
| +		get_debugreg(hh->debugreg7, 7);
| +	} else {
| +		hh->debugreg0 = thread->debugreg0;
| +		hh->debugreg1 = thread->debugreg1;
| +		hh->debugreg2 = thread->debugreg2;
| +		hh->debugreg3 = thread->debugreg3;
| +		hh->debugreg6 = thread->debugreg6;
| +		hh->debugreg7 = thread->debugreg7;
| +	}
| +
| +	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
| +}
| +
| +static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
| +{
| +	hh->used_math = tsk_used_math(t) ? 1 : 0;
| +}
| +
| +static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
| +{
| +	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
| +	int ret;
| +
| +	/* i387 + MMU + SSE logic */
| +	preempt_disable();	/* needed it (t == current) */
| +
| +	/*
| +	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
| +	 * have been cleared when task was context-switched out...

Nit: s/have been/was/

| +	 * except if we are in process context, in which case we do
| +	 */
| +	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
| +		unlazy_fpu(current);
| +
| +	/*
| +	 * For simplicity dump the entire structure.
| +	 * FIXME: need to be deliberate about what registers we are
| +	 * dumping for traceability and compatibility.
| +	 */
| +	memcpy(xstate_buf, t->thread.xstate, xstate_size);
| +	preempt_enable();	/* needed it (t == current) */

Nit: s/it/if/

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 06/29] Dump memory address space
       [not found]     ` <1238477349-11029-7-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07  3:26       ` Sukadev Bhattiprolu
       [not found]         ` <20090407032636.GD12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:26 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen


One comment below.

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| From eed3f074ed035c93eb49d05cc1491ee680956906 Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Date: Mon, 30 Mar 2009 13:57:11 -0400
| Subject: [PATCH 06/29] Dump memory address space
| 
| For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
| it will be followed by the file name. Then comes the actual contents,
| in one or more chunk: each chunk begins with a header that specifies
| how many pages it holds, then the virtual addresses of all the dumped
| pages in that chunk, followed by the actual contents of all dumped
| pages. A header with zero number of pages marks the end of the contents.
| Then comes the next VMA and so on.
| 
| Changelog[v14]:
|   - Revert change to pr_debug(), back to cr_debug()
|   - Save new field 'vdso' in mm_context
|   - Discard field 'h->parent'
|   - Check whether calls to cr_hbuf_get() fail
| 
| Changelog[v13]:
|   - pgprot_t is an abstract type; use the proper accessor (fix for
|     64-bit powerpc (Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>)
| 
| Changelog[v12]:
|   - Hide pgarr management inside cr_private_vma_fill_pgarr()
|   - Fix management of pgarr chain reset and alloc/expand: keep empty
|     pgarr in a pool chain
|   - Replace obsolete cr_debug() with pr_debug()
| 
| Changelog[v11]:
|   - Copy contents of 'init->fs->root' instead of pointing to them.
|   - Add missing test for VM_MAYSHARE when dumping memory
| 
| Changelog[v10]:
|   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
| 
| Changelog[v9]:
|   - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
|   - Test if __d_path() changes mnt/dentry (when crossing filesystem
|     namespace boundary). for now cr_fill_fname() fails the checkpoint.
| 
| Changelog[v7]:
|   - Fix argument given to kunmap_atomic() in memory dump/restore
| 
| Changelog[v6]:
|   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
|     (even though it's not really needed)
| 
| Changelog[v5]:
|   - Improve memory dump code (following Dave Hansen's comments)
|   - Change dump format (and code) to allow chunks of <vaddrs, pages>
|     instead of one long list of each
|   - Fix use of follow_page() to avoid faulting in non-present pages
| 
| Changelog[v4]:
|   - Use standard list_... for cr_pgarr
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
| Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| ---
|  arch/x86/include/asm/checkpoint_hdr.h |    6 +
|  arch/x86/mm/checkpoint.c              |   31 ++
|  checkpoint/Makefile                   |    3 +-
|  checkpoint/checkpoint.c               |   87 +++++
|  checkpoint/checkpoint_arch.h          |    1 +
|  checkpoint/checkpoint_mem.h           |   41 +++
|  checkpoint/ckpt_mem.c                 |  558 +++++++++++++++++++++++++++++++++
|  checkpoint/sys.c                      |   11 +
|  include/linux/checkpoint.h            |   13 +
|  include/linux/checkpoint_hdr.h        |   32 ++
|  10 files changed, 782 insertions(+), 1 deletions(-)
|  create mode 100644 checkpoint/checkpoint_mem.h
|  create mode 100644 checkpoint/ckpt_mem.c
| 
| diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
| index ffdb5f5..54d3a41 100644
| --- a/arch/x86/include/asm/checkpoint_hdr.h
| +++ b/arch/x86/include/asm/checkpoint_hdr.h
| @@ -95,4 +95,10 @@ struct cr_hdr_cpu {
|  	/* thread_xstate contents follow (if used_math) */
|  } __attribute__((aligned(8)));
|  
| +struct cr_hdr_mm_context {
| +	__u64 vdso;
| +	__u32 ldt_entry_size;
| +	__u32 nldt;
| +} __attribute__((aligned(8)));
| +
|  #endif /* __ASM_X86_CKPT_HDR__H */
| diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
| index 946fac1..92926e1 100644
| --- a/arch/x86/mm/checkpoint.c
| +++ b/arch/x86/mm/checkpoint.c
| @@ -240,3 +240,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
|  
|  	return ret;
|  }
| +
| +/* dump the mm->context state */
| +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm)
| +{
| +	struct cr_hdr h;
| +	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	int ret;
| +
| +	h.type = CR_HDR_MM_CONTEXT;
| +	h.len = sizeof(*hh);
| +
| +	mutex_lock(&mm->context.lock);
| +
| +	hh->vdso = (unsigned long) mm->context.vdso;
| +	hh->ldt_entry_size = LDT_ENTRY_SIZE;
| +	hh->nldt = mm->context.size;
| +
| +	cr_debug("nldt %d vdso %#llx\n", hh->nldt, hh->vdso);
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	if (ret < 0)
| +		goto out;
| +
| +	ret = cr_kwrite(ctx, mm->context.ldt,
| +			mm->context.size * LDT_ENTRY_SIZE);
| +
| + out:
| +	mutex_unlock(&mm->context.lock);
| +	return ret;
| +}
| diff --git a/checkpoint/Makefile b/checkpoint/Makefile
| index 364c326..6924ef4 100644
| --- a/checkpoint/Makefile
| +++ b/checkpoint/Makefile
| @@ -2,4 +2,5 @@
|  # Makefile for linux checkpoint/restart.
|  #
|  
| -obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o
| +obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o \
| +	 ckpt_mem.o
| diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| index 422ceff..422e1a3 100644
| --- a/checkpoint/checkpoint.c
| +++ b/checkpoint/checkpoint.c
| @@ -13,6 +13,7 @@
|  #include <linux/time.h>
|  #include <linux/fs.h>
|  #include <linux/file.h>
| +#include <linux/fdtable.h>
|  #include <linux/dcache.h>
|  #include <linux/mount.h>
|  #include <linux/utsname.h>
| @@ -73,6 +74,65 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
|  	return cr_write_obj(ctx, &h, str);
|  }
|  
| +/**
| + * cr_fill_fname - return pathname of a given file
| + * @path: path name
| + * @root: relative root
| + * @buf: buffer for pathname
| + * @n: buffer length (in) and pathname length (out)
| + */
| +static char *
| +cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
| +{
| +	struct path tmp = *root;
| +	char *fname;
| +
| +	BUG_ON(!buf);
| +	spin_lock(&dcache_lock);
| +	fname = __d_path(path, &tmp, buf, *n);
| +	spin_unlock(&dcache_lock);
| +	if (!IS_ERR(fname))
| +		*n = (buf + (*n) - fname);
| +	/*
| +	 * FIXME: if __d_path() changed these, it must have stepped out of
| +	 * init's namespace. Since currently we require a unified namespace
| +	 * within the container: simply fail.
| +	 */
| +	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
| +		fname = ERR_PTR(-EBADF);
| 

Shouldn't this be under if (!IS_ERR(fname)) ? 'tmp' may be uninitialized
if __d_path() fails with ENAMETOOLONG. Even otherwise, it may be better
to report the error from __dpath() first ?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 09/29] Dump open file descriptors
       [not found]     ` <1238477349-11029-10-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07  3:28       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:28 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen


A few minor comments.

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| From 6f0a1dc1db8fdac766b00f90e04e06a5827af459 Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Date: Mon, 30 Mar 2009 13:59:34 -0400
| Subject: [PATCH 09/29] Dump open file descriptors
| 
| Dump the files_struct of a task with 'struct cr_hdr_files', followed by
| all open file descriptors. Because the 'struct file' corresponding to an
| FD can be shared, each they are assigned an objref and registered in the
| object hash. A reference to the 'file *' is kept for as long as it lives
| in the hash (the hash is only cleaned up at the end of the checkpoint).
| 
| For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its
| close-on-exec property, and the objref of the corresponding 'file *'.
| If the FD is to be saved (first time) then this is followed by a
| 'struct cr_hdr_fd_data' with the FD state. Then will come the next FD
| and so on.
| 
| Recall that it is assumed that all tasks possibly sharing the file table
| are frozen. If this assumption breaks, then the behavior is *undefined*:
| checkpoint may fail, or restart from the resulting image file will fail.
| 
| This patch only handles basic FDs - regular files, directories.
| 
| Changelog[v14]:
|   - Revert change to pr_debug(), back to cr_debug()
|   - Use only unsigned fields in checkpoint headers
|   - Rename:  cr_write_files() => cr_write_fd_table()
|   - Rename:  cr_write_fd_data() => cr_write_file()
|   - Discard field 'h->parent'
|   - Check whether calls to cr_hbuf_get() fail
|   - Use one CR_FD_GENERIC for both regular files and dirs
|   - Put code for generic file descriptors in a separate function
| 
| Changelog[v12]:
|   - Replace obsolete cr_debug() with pr_debug()
| 
| Changelog[v11]:
|   - Discard handling of opened symlinks (there is no such thing)
|   - cr_scan_fds() retries from scratch if hits size limits
| 
| Changelog[v9]:
|   - Fix a couple of leaks in cr_write_files()
|   - Drop useless kfree from cr_scan_fds()
| 
| Changelog[v8]:
|   - initialize 'coe' to workaround gcc false warning
| 
| Changelog[v6]:
|   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
|     (even though it's not really needed)
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
| Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| ---
|  arch/x86/include/asm/checkpoint_hdr.h |    2 +-
|  checkpoint/Makefile                   |    2 +-
|  checkpoint/checkpoint.c               |    4 +
|  checkpoint/checkpoint_file.h          |   17 +++
|  checkpoint/ckpt_file.c                |  247 +++++++++++++++++++++++++++++++++
|  include/linux/checkpoint.h            |    3 +-
|  include/linux/checkpoint_hdr.h        |   30 ++++-
|  7 files changed, 301 insertions(+), 4 deletions(-)
|  create mode 100644 checkpoint/checkpoint_file.h
|  create mode 100644 checkpoint/ckpt_file.c
| 
| diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
| index e9eb40c..1efdf24 100644
| --- a/arch/x86/include/asm/checkpoint_hdr.h
| +++ b/arch/x86/include/asm/checkpoint_hdr.h
| @@ -15,7 +15,7 @@
|  /*
|   * To maintain compatibility between 32-bit and 64-bit architecture flavors,
|   * keep data 64-bit aligned: use padding for structure members, and use
| - * __attribute__ ((aligned (8))) for the entire structure.
| + * __attribute__((aligned (8))) for the entire structure.
|   *
|   * Quoting Arnd Bergmann:
|   *   "This structure has an odd multiple of 32-bit members, which means
| diff --git a/checkpoint/Makefile b/checkpoint/Makefile
| index 8368a03..1d92ed2 100644
| --- a/checkpoint/Makefile
| +++ b/checkpoint/Makefile
| @@ -3,4 +3,4 @@
|  #
|  
|  obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o objhash.o \
| -		ckpt_mem.o rstr_mem.o
| +		ckpt_mem.o rstr_mem.o ckpt_file.o
| diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| index 422e1a3..d4e0007 100644
| --- a/checkpoint/checkpoint.c
| +++ b/checkpoint/checkpoint.c
| @@ -250,6 +250,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
|  	cr_debug("memory: ret %d\n", ret);
|  	if (ret < 0)
|  		goto out;
| +	ret = cr_write_fd_table(ctx, t);
| +	cr_debug("files: ret %d\n", ret);
| +	if (ret < 0)
| +		goto out;
|  	ret = cr_write_thread(ctx, t);
|  	cr_debug("thread: ret %d\n", ret);
|  	if (ret < 0)
| diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
| new file mode 100644
| index 0000000..9dc3eba
| --- /dev/null
| +++ b/checkpoint/checkpoint_file.h
| @@ -0,0 +1,17 @@
| +#ifndef _CHECKPOINT_CKPT_FILE_H_
| +#define _CHECKPOINT_CKPT_FILE_H_
| +/*
| + *  Checkpoint file descriptors
| + *
| + *  Copyright (C) 2008 Oren Laadan
| + *
| + *  This file is subject to the terms and conditions of the GNU General Public
| + *  License.  See the file COPYING in the main directory of the Linux
| + *  distribution for more details.
| + */
| +
| +#include <linux/fdtable.h>
| +
| +int cr_scan_fds(struct files_struct *files, int **fdtable);
| +
| +#endif /* _CHECKPOINT_CKPT_FILE_H_ */
| diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
| new file mode 100644
| index 0000000..9c344c7
| --- /dev/null
| +++ b/checkpoint/ckpt_file.c
| @@ -0,0 +1,247 @@
| +/*
| + *  Checkpoint file descriptors
| + *
| + *  Copyright (C) 2008-2009 Oren Laadan
| + *
| + *  This file is subject to the terms and conditions of the GNU General Public
| + *  License.  See the file COPYING in the main directory of the Linux
| + *  distribution for more details.
| + */
| +
| +#include <linux/kernel.h>
| +#include <linux/sched.h>
| +#include <linux/file.h>
| +#include <linux/fdtable.h>
| +#include <linux/checkpoint.h>
| +#include <linux/checkpoint_hdr.h>
| +
| +#include "checkpoint_file.h"
| +
| +#define CR_DEFAULT_FDTABLE  256		/* an initial guess */
| +
| +/**
| + * cr_scan_fds - scan file table and construct array of open fds
| + * @files: files_struct pointer
| + * @fdtable: (output) array of open fds
| + *
| + * Returns the number of open fds found, and also the file table
| + * array via *fdtable. The caller should free the array.
| + *
| + * The caller must validate the file descriptors collected in the
| + * array before using them, e.g. by using fcheck_files(), in case
| + * the task's fdtable changes in the meantime.
| + */
| +int cr_scan_fds(struct files_struct *files, int **fdtable)
| +{
| +	struct fdtable *fdt;
| +	int *fds = NULL;
| +	int i, n;
| +	int tot = CR_DEFAULT_FDTABLE;
| +
| +	/*
| +	 * We assume that all tasks possibly sharing the file table are
| +	 * frozen (or we our a single process and we checkpoint ourselves).

Nit: s/our/are/

| +	 * Therefore, we can safely proceed after krealloc() from where we
| +	 * left off. Otherwise the file table may be modified by another
| +	 * task after we scan it. The behavior is this case is undefined,
| +	 * and either and either checkpoint or restart will likely fail.

Nit: s/and either//

| +	 */
| + retry:
| +	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
| +	if (!fds)
| +		return -ENOMEM;
| +
| +	spin_lock(&files->file_lock);
| +	rcu_read_lock();
| +	fdt = files_fdtable(files);
| +	for (n = 0, i = 0; i < fdt->max_fds; i++) {

Hmm, if we want to start where we left-off before the krealloc, shouldn't
we initialize 'i' and 'n' to 0 before the 'retry:' ? Or maybe I misunderstand
what you mean by "where we left-off" comment above.



| +		if (!fcheck_files(files, i))
| +			continue;
| +		if (n == tot) {
| +			spin_unlock(&files->file_lock);
| +			rcu_read_unlock();
| +			tot *= 2;	/* won't overflow: kmalloc will fail */
| +			goto retry;
| +		}
| +		fds[n++] = i;
| +	}
| +	rcu_read_unlock();
| +	spin_unlock(&files->file_lock);
| +
| +	*fdtable = fds;
| +	return n;
| +}
| +
| +static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
| +				 struct cr_hdr_file *hh)
| +{
| +	struct cr_hdr h;
| +	int ret;
| +
| +	/*
| +	 * FIX: check if the file/dir/link is unlinked
| +	 *
| +	 * Or, pass up somthing like in hh->flags to tell
| +	 * the higher-level code that it needs to bring
| +	 * along the file contents too.
| +	 */
| +
| +	h.type = CR_HDR_FILE;
| +	h.len = sizeof(*hh);
| +
| +	hh->fd_type = CR_FD_GENERIC;
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	if (ret < 0)
| +		return ret;
| +
| +	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
| +}
| +
| +/* cr_write_file - dump the state of a given file pointer */
| +static int cr_write_file(struct cr_ctx *ctx, struct file *file)
| +{
| +	struct cr_hdr_file *hh;
| +	struct dentry *dent = file->f_dentry;
| +	struct inode *inode = dent->d_inode;
| +	int ret;
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	hh->f_flags = file->f_flags;
| +	hh->f_mode = file->f_mode;
| +	hh->f_pos = file->f_pos;
| +	hh->f_version = file->f_version;
| +	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
| +
| +	/*
| +	 * FIXME: when we'll add support for unlinked files/dirs, we'll
| +	 * need to distinguish between unlinked filed and unlinked dirs.
| +	 */
| +	switch (inode->i_mode & S_IFMT) {
| +	case S_IFREG:
| +	case S_IFDIR:
| +		ret = cr_write_file_generic(ctx, file, hh);
| +		break;
| +	default:
| +		ret = -EBADF;
| +		break;
| +	}
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +
| +	return ret;
| +}
| +
| +/**
| + * cr_write_fd_ent - dump the state of a given file descriptor
| + * @ctx: checkpoint context
| + * @files: files_struct pointer
| + * @fd: file descriptor
| + *
| + * Saves the state of the file descriptor; looks up the actual file
| + * pointer in the hash table, and if found saves the matching objref,
| + * otherwise calls cr_write_file to dump the file pointer too.
| + */
| +static int
| +cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
| +{
| +	struct cr_hdr h;
| +	struct cr_hdr_fd_ent *hh;
| +	struct file *file;
| +	struct fdtable *fdt;
| +	int objref, new, ret;
| +	int coe = 0;	/* avoid gcc warning */
| +
| +	rcu_read_lock();
| +	fdt = files_fdtable(files);
| +	file = fcheck_files(files, fd);
| +	if (file) {
| +		coe = FD_ISSET(fd, fdt->close_on_exec);
| +		get_file(file);
| +	}
| +	rcu_read_unlock();
| +
| +	/* sanity check (although this shouldn't happen) */
| +	if (!file)
| +		return -EBADF;
| +
| +	/* adding 'file' to the hash will keep a reference to it */
| +	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
| +	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
| +
| +	if (new < 0)
| +		return new;
| +
| +	h.type = CR_HDR_FD_ENT;
| +	h.len = sizeof(*hh);
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh) {
| +		fput(file);
| +		return -ENOMEM;
| +	}
| +
| +	hh->objref = objref;
| +	hh->fd = fd;
| +	hh->close_on_exec = coe;
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	if (ret < 0)
| +		goto out;
| +
| +	/* new==1 if-and-only-if file was newly added to hash */
| +	if (new)
| +		ret = cr_write_file(ctx, file);
| +
| +out:
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	if (file)
| +		fput(file);
| +	return ret;
| +}
| +
| +int cr_write_fd_table(struct cr_ctx *ctx, struct task_struct *t)
| +{
| +	struct cr_hdr h;
| +	struct cr_hdr_fd_table *hh;
| +	struct files_struct *files;
| +	int *fdtable = NULL;
| +	int nfds, n, ret;
| +
| +	h.type = CR_HDR_FD_TABLE;
| +	h.len = sizeof(*hh);
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	files = get_files_struct(t);
| +
| +	nfds = cr_scan_fds(files, &fdtable);
| +	if (nfds < 0) {
| +		ret = nfds;
| +		goto out;
| +	}
| +
| +	hh->objref = 0;	/* will be meaningful with multiple processes */
| +	hh->nfds = nfds;
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	if (ret < 0)
| +		goto out;
| +
| +	cr_debug("nfds %d\n", nfds);
| +	for (n = 0; n < nfds; n++) {
| +		ret = cr_write_fd_ent(ctx, files, fdtable[n]);
| +		if (ret < 0)
| +			break;
| +	}
| +
| + out:
| +	kfree(fdtable);
| +	put_files_struct(files);
| +	return ret;
| +}
| diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
| index 88854a9..9489ea5 100644
| --- a/include/linux/checkpoint.h
| +++ b/include/linux/checkpoint.h
| @@ -13,7 +13,7 @@
|  #include <linux/path.h>
|  #include <linux/fs.h>
|  
| -#define CR_VERSION  1
| +#define CR_VERSION  2
|  
|  struct cr_ctx {
|  	int crid;		/* unique checkpoint id */
| @@ -85,6 +85,7 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
|  
|  extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
|  extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
| +extern int cr_write_fd_table(struct cr_ctx *ctx, struct task_struct *t);
|  
|  extern int do_restart(struct cr_ctx *ctx, pid_t pid);
|  extern int cr_read_mm(struct cr_ctx *ctx);
| diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
| index 2a06a2f..a6b6dce 100644
| --- a/include/linux/checkpoint_hdr.h
| +++ b/include/linux/checkpoint_hdr.h
| @@ -17,7 +17,7 @@
|  /*
|   * To maintain compatibility between 32-bit and 64-bit architecture flavors,
|   * keep data 64-bit aligned: use padding for structure members, and use
| - * __attribute__ ((aligned (8))) for the entire structure.
| + * __attribute__((aligned (8))) for the entire structure.

Nit: Fix in earlier patch ?

Sukadev

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 10/29] actually use f_op in checkpoint code
       [not found]     ` <1238477349-11029-11-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-03-31 18:31       ` Oren Laadan
  2009-04-01 18:54       ` Serge E. Hallyn
@ 2009-04-07  3:29       ` Sukadev Bhattiprolu
       [not found]         ` <20090407032912.GF12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:29 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

A minor comment and a nit.

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| From d832bfba9a50789fbfadf8486fbdfbd8b498a9ea Mon Sep 17 00:00:00 2001
| From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| Date: Fri, 27 Mar 2009 12:50:47 -0700
| Subject: [PATCH 10/29] actually use f_op in checkpoint code
| 
| 
| Right now, we assume all normal files and directories
| can be checkpointed.  However, as usual in the VFS, there
| are specialized places that will always need an ability
| to override these defaults.  We could do this completely
| in the checkpoint code, but that would bitrot quickly.
| 
| This adds a new 'file_operations' function for
| checkpointing a file.  I did this under the assumption
| that we should have a dirt-simple way to make something
| (un)checkpointable that fits in with current code.
| 
| As you can see in the ext[234] and /proc patches, all
| that we have to do to make something simple be
| supported is add a single "generic" f_op entry.
| 
| Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| ---
|  checkpoint/ckpt_file.c |   31 +++++++++++++++----------------
|  include/linux/fs.h     |   11 +++++++++++
|  2 files changed, 26 insertions(+), 16 deletions(-)
| 
| diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
| index 9c344c7..0fe68bf 100644
| --- a/checkpoint/ckpt_file.c
| +++ b/checkpoint/ckpt_file.c
| @@ -91,6 +91,11 @@ static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
|  
|  	hh->fd_type = CR_FD_GENERIC;
|  
| +	/*
| +	 * FIXME: when we'll add support for unlinked files/dirs, we'll
| +	 * need to distinguish between unlinked filed and unlinked dirs.
| +	 */
| +
|  	ret = cr_write_obj(ctx, &h, hh);
|  	if (ret < 0)
|  		return ret;
| @@ -98,12 +103,16 @@ static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
|  	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
|  }
|  
| +int generic_file_checkpoint(struct cr_ctx *ctx, struct file *file,
| +			    struct cr_hdr_file *hh)
| +{
| +	return cr_write_file_generic(ctx, file, hh);
| +}
| +
|  /* cr_write_file - dump the state of a given file pointer */
|  static int cr_write_file(struct cr_ctx *ctx, struct file *file)
|  {
|  	struct cr_hdr_file *hh;
| -	struct dentry *dent = file->f_dentry;
| -	struct inode *inode = dent->d_inode;
|  	int ret;
|  
|  	hh = cr_hbuf_get(ctx, sizeof(*hh));
| @@ -116,21 +125,11 @@ static int cr_write_file(struct cr_ctx *ctx, struct file *file)
|  	hh->f_version = file->f_version;
|  	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
|  
| -	/*
| -	 * FIXME: when we'll add support for unlinked files/dirs, we'll
| -	 * need to distinguish between unlinked filed and unlinked dirs.
| -	 */
| -	switch (inode->i_mode & S_IFMT) {
| -	case S_IFREG:
| -	case S_IFDIR:
| -		ret = cr_write_file_generic(ctx, file, hh);
| -		break;
| -	default:
| -		ret = -EBADF;
| -		break;
| -	}
| -	cr_hbuf_put(ctx, sizeof(*hh));
| +	ret = -EBADF;
| +	if (file->f_op->checkpoint)
| +		ret = file->f_op->checkpoint(ctx, file, hh);

Minor: not bisect safe for checkpoint - fwiw, with previous patch we could
checkpoint a process with open file, but with this change, we can't ? How
about merge patches 10 and 11 ?

|  
| +	cr_hbuf_put(ctx, sizeof(*hh));
|  	return ret;
|  }
|  
| diff --git a/include/linux/fs.h b/include/linux/fs.h
| index 3bf5057..835ee9e 100644
| --- a/include/linux/fs.h
| +++ b/include/linux/fs.h
| @@ -1296,6 +1296,14 @@ int generic_osync_inode(struct inode *, struct address_space *, int);
|  typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
|  struct block_device_operations;
|  
| +#ifdef CONFIG_CHECKPOINT
| +struct cr_ctx;
| +struct cr_hdr_file;
| +int generic_file_checkpoint(struct cr_ctx *, struct file *, struct cr_hdr_file *);
| +#else
| +#define generic_file_checkpoint NULL
| +#endif
| +
|  /* These macros are for out of kernel modules to test that
|   * the kernel supports the unlocked_ioctl and compat_ioctl
|   * fields in struct file_operations. */
| @@ -1334,6 +1342,9 @@ struct file_operations {
|  	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
|  	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
|  	int (*setlease)(struct file *, long, struct file_lock **);
| +#ifdef CONFIG_CHECKPOINT
| +	int (*checkpoint)(struct cr_ctx *, struct file *file, struct cr_hdr_file *);

Nit :-) s/file *file/file */

| +#endif
|  };
|  
|  struct inode_operations {
| -- 
| 1.5.2.5
| 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 12/29] Restore open file descriptors
       [not found]     ` <1238477349-11029-13-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07  3:29       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:29 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Minor comment.

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| From 7bb32901eb8cefba38bd06bea8a1630ac0dd5051 Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Date: Mon, 30 Mar 2009 15:05:55 -0400
| Subject: [PATCH 12/29] Restore open file descriptors
| 
| Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
| and lookup objref in the hash table; if not found (first occurence), read
| in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
| Otherwise attach the file pointer from the hash as an FD.
| 
| This patch only handles basic FDs - regular files, directories and also
| symbolic links.
| 
| Changelog[v14]:
|   - Revert change to pr_debug(), back to cr_debug()
|   - Rename:  cr_read_files() => cr_read_fd_table()
|   - Rename:  cr_read_fd_data() => cr_read_file()
|   - Discard field 'hh->parent'
|   - Check whether calls to cr_hbuf_get() fail
| 
| Changelog[v12]:
|   - Replace obsolete cr_debug() with pr_debug()
| 
| Changelog[v6]:
|   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
|     (even though it's not really needed)
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
| Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| ---
|  checkpoint/Makefile        |    2 +-
|  checkpoint/restart.c       |    4 +
|  checkpoint/rstr_file.c     |  236 ++++++++++++++++++++++++++++++++++++++++++++
|  include/linux/checkpoint.h |    1 +
|  4 files changed, 242 insertions(+), 1 deletions(-)
|  create mode 100644 checkpoint/rstr_file.c
| 
| diff --git a/checkpoint/Makefile b/checkpoint/Makefile
| index 1d92ed2..607d864 100644
| --- a/checkpoint/Makefile
| +++ b/checkpoint/Makefile
| @@ -3,4 +3,4 @@
|  #
|  
|  obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o objhash.o \
| -		ckpt_mem.o rstr_mem.o ckpt_file.o
| +		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
| diff --git a/checkpoint/restart.c b/checkpoint/restart.c
| index 665894f..da239fd 100644
| --- a/checkpoint/restart.c
| +++ b/checkpoint/restart.c
| @@ -286,6 +286,10 @@ static int cr_read_task(struct cr_ctx *ctx)
|  	cr_debug("memory: ret %d\n", ret);
|  	if (ret < 0)
|  		goto out;
| +	ret = cr_read_fd_table(ctx);
| +	cr_debug("files: ret %d\n", ret);
| +	if (ret < 0)
| +		goto out;
|  	ret = cr_read_thread(ctx);
|  	cr_debug("thread: ret %d\n", ret);
|  	if (ret < 0)
| diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
| new file mode 100644
| index 0000000..1031915
| --- /dev/null
| +++ b/checkpoint/rstr_file.c
| @@ -0,0 +1,236 @@
| +/*
| + *  Checkpoint file descriptors
| + *
| + *  Copyright (C) 2008-2009 Oren Laadan
| + *
| + *  This file is subject to the terms and conditions of the GNU General Public
| + *  License.  See the file COPYING in the main directory of the Linux
| + *  distribution for more details.
| + */
| +
| +#include <linux/kernel.h>
| +#include <linux/sched.h>
| +#include <linux/fs.h>
| +#include <linux/file.h>
| +#include <linux/fdtable.h>
| +#include <linux/fsnotify.h>
| +#include <linux/syscalls.h>
| +#include <linux/checkpoint.h>
| +#include <linux/checkpoint_hdr.h>
| +
| +#include "checkpoint_file.h"
| +
| +static int cr_close_all_fds(struct files_struct *files)
| +{
| +	int *fdtable;
| +	int nfds;
| +
| +	nfds = cr_scan_fds(files, &fdtable);
| +	if (nfds < 0)
| +		return nfds;
| +	while (nfds--)
| +		sys_close(fdtable[nfds]);
| +	kfree(fdtable);
| +	return 0;
| +}
| +
| +/**
| + * cr_attach_file - attach a lonely file ptr to a file descriptor
| + * @file: lonely file pointer
| + */
| +static int cr_attach_file(struct file *file)
| +{
| +	int fd = get_unused_fd_flags(0);
| +
| +	if (fd >= 0) {
| +		fsnotify_open(file->f_path.dentry);
| +		fd_install(fd, file);
| +	}
| +	return fd;
| +}
| +
| +/**
| + * cr_attach_get_file - attach (and get) lonely file ptr to a file descriptor
| + * @file: lonely file pointer
| + */
| +static int cr_attach_get_file(struct file *file)
| +{
| +	int fd = get_unused_fd_flags(0);
| +
| +	if (fd >= 0) {
| +		fsnotify_open(file->f_path.dentry);
| +		get_file(file);
| +		fd_install(fd, file);
| +	}
| +	return fd;
| +}
| +
| +#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
| +
| +/* cr_read_file - restore the state of a given file pointer */
| +static int cr_read_file(struct cr_ctx *ctx, int objref)
| +{
| +	struct cr_hdr_file *hh;
| +	struct file *file;
| +	int fd = 0;	/* pacify gcc warning */
| +	int ret;
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILE);
| +	cr_debug("flags %#x mode %#x how %d\n",
| +		 hh->f_flags, hh->f_mode, hh->fd_type);
| +	if (ret < 0)
| +		goto out;
| +
| +	ret = -EINVAL;
| +
| +	/* FIX: more sanity checks on f_flags, f_mode etc */
| +
| +	switch (hh->fd_type) {
| +	case CR_FD_GENERIC:
| +		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
| +		break;
| +	default:
| +		goto out;
| +	}
| +
| +	if (IS_ERR(file)) {
| +		ret = PTR_ERR(file);
| +		goto out;
| +	}
| +
| +	/* FIX: need to restore uid, gid, owner etc */
| +
| +	/* adding <objref,file> to the hash will keep a reference to it */
| +	ret = cr_obj_add_ref(ctx, file, objref, CR_OBJ_FILE, 0);
| +	if (ret < 0) {
| +		filp_close(file, NULL);
| +		goto out;
| +	}
| +
| +	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
| +	if (fd < 0) {
| +		ret = fd;
| +		filp_close(file, NULL);
| +		goto out;
| +	}
| +
| +	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
| +	if (ret < 0)
| +		goto out;
| +	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
| +	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
| +		ret = 0;
| +
| +	ret = 0;
| + out:
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	return ret < 0 ? ret : fd;
| +}
| +
| +/**
| + * cr_read_fd_ent - restore the state of a given file descriptor
| + * @ctx: checkpoint context
| + *
| + * Restores the state of a file descriptor; looks up the objref (in the
| + * header) in the hash table, and if found picks the matching file and
| + * use it; otherwise calls cr_read_file to restore the file too.
| + */
| +static int cr_read_fd_ent(struct cr_ctx *ctx)
| +{
| +	struct cr_hdr_fd_ent *hh;
| +	struct file *file;
| +	int newfd, ret;
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
| +	if (ret < 0)
| +		goto out;
| +
| +	cr_debug("ref %d fd %d c.o.e %d\n",
| +		 hh->objref, hh->fd, hh->close_on_exec);
| +
| +	ret = -EINVAL;
| +	if (hh->objref <= 0 || hh->fd < 0)
| +		goto out;
| +
| +	file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
| +	if (IS_ERR(file)) {
| +		ret = PTR_ERR(file);
| +		goto out;
| +	}
| +
| +	if (file) {
| +		/* reuse file descriptor found in the hash table */

Nit: s/file descriptor/file pointer/

Sukadev

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 13/29] External checkpoint of a task other than ourself
       [not found]     ` <1238477349-11029-14-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07  3:30       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:30 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Minor comment.

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| From 0fd2795c29fec51eec75f76ea21394367b6801db Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Date: Tue, 21 Oct 2008 16:26:10 -0400
| Subject: [PATCH 13/29] External checkpoint of a task other than ourself
| 
| Now we can do "external" checkpoint, i.e. act on another task.
| 
| sys_checkpoint() now looks up the target pid (in our namespace) and
| checkpoints that corresponding task. That task should be the root of
| a container.
| 
| sys_restart() remains the same, as the restart is always done in the
| context of the restarting task.
| 
| Changelog[v14]:
|   - Refuse non-self checkpoint if target task isn't frozen
| 
| Changelog[v12]:
|   - Replace obsolete cr_debug() with pr_debug()
| 
| Changelog[v11]:
|   - Copy contents of 'init->fs->root' instead of pointing to them
| 
| Changelog[v10]:
|   - Grab vfs root of container init, rather than current process
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
| ---
|  checkpoint/checkpoint.c    |   73 ++++++++++++++++++++++++++++++++++++++++++-
|  checkpoint/restart.c       |    4 +-
|  checkpoint/sys.c           |    6 +++
|  include/linux/checkpoint.h |    2 +
|  4 files changed, 81 insertions(+), 4 deletions(-)
| 
| diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| index d4e0007..25229d3 100644
| --- a/checkpoint/checkpoint.c
| +++ b/checkpoint/checkpoint.c
| @@ -10,6 +10,8 @@
|  
|  #include <linux/version.h>
|  #include <linux/sched.h>
| +#include <linux/freezer.h>
| +#include <linux/ptrace.h>
|  #include <linux/time.h>
|  #include <linux/fs.h>
|  #include <linux/file.h>
| @@ -242,6 +244,11 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
|  {
|  	int ret;
|  
| +	if (t->state == TASK_DEAD) {
| +		pr_warning("c/r: task may not be in state TASK_DEAD\n");
| +		return -EAGAIN;
| +	}
| +
|  	ret = cr_write_task_struct(ctx, t);
|  	cr_debug("task_struct: ret %d\n", ret);
|  	if (ret < 0)
| @@ -264,22 +271,84 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
|  	return ret;
|  }
|  
| +static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
| +{
| +	struct task_struct *task = NULL;
| +	struct nsproxy *nsproxy = NULL;
| +	int err = -ESRCH;
| +
| +	ctx->root_pid = pid;
| +
| +	read_lock(&tasklist_lock);
| +	task = find_task_by_vpid(pid);
| +	if (task)
| +		get_task_struct(task);
| +	read_unlock(&tasklist_lock);
| +
| +	if (!task)
| +		goto out;
| +
| +#if 0	/* enable to use containers */
| +	if (!is_container_init(task)) {
| +		err = -EINVAL;
| +		goto out;
| +	}
| +#endif
| +
| +	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
| +		err = -EPERM;
| +		goto out;
| +	}
| +
| +	/* verify that the task is frozen (unless self) */
| +	if (task != current && !frozen(task))
| +		return -EBUSY;
| +
| +	rcu_read_lock();
| +	if (task_nsproxy(task)) {
| +		nsproxy = task_nsproxy(task);

Nit: why call task_nproxy() twice ?

| +		get_nsproxy(nsproxy);
| +	}
| +	rcu_read_unlock();
| +
| +	if (!nsproxy)
| +		goto out;

Sukadev

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 14/29] Checkpoint multiple processes
       [not found]     ` <1238477349-11029-15-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07  3:31       ` Sukadev Bhattiprolu
       [not found]         ` <20090407033111.GI12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| From ee2f3b5c8548136229cc2f41c5271b0a81ab8a4d Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Date: Mon, 30 Mar 2009 15:06:13 -0400
| Subject: [PATCH 14/29] Checkpoint multiple processes
| 
| Checkpointing of multiple processes works by recording the tasks tree
| structure below a given task (usually this task is the container init).
| 
| For a given task, do a DFS scan of the tasks tree and collect them
| into an array (keeping a reference to each task). Using DFS simplifies
| the recreation of tasks either in user space or kernel space. For each
| task collected, test if it can be checkpointed, and save its pid, tgid,
| and ppid.
| 
| The actual work is divided into two passes: a first scan counts the
| tasks, then memory is allocated and a second scan fills the array.
| 
| The logic is suitable for creation of processes during restart either
| in userspace or by the kernel.
| 
| Currently we ignore threads and zombies, as well as session ids.
| 
| Changelog[v14]:
|   - Refuse non-self checkpoint if target task isn't frozen
|   - Revert change to pr_debug(), back to cr_debug()
|   - Use only unsigned fields in checkpoint headers
|   - Check retval of cr_tree_count_tasks() in cr_build_tree()
|   - Discard 'h.parent' field
|   - Check whether calls to cr_hbuf_get() fail
| 
| Changelog[v13]:
|   - Release tasklist_lock in error path in cr_tree_count_tasks()
|   - Use separate index for 'tasks_arr' and 'hh' in cr_write_pids()
| 
| Changelog[v12]:
|   - Replace obsolete cr_debug() with pr_debug()
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
| ---
|  checkpoint/checkpoint.c        |  228 ++++++++++++++++++++++++++++++++++++++--
|  checkpoint/sys.c               |   16 +++
|  include/linux/checkpoint.h     |    3 +
|  include/linux/checkpoint_hdr.h |   13 ++-
|  4 files changed, 248 insertions(+), 12 deletions(-)
| 
| diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| index 25229d3..7f5eee6 100644
| --- a/checkpoint/checkpoint.c
| +++ b/checkpoint/checkpoint.c
| @@ -244,11 +244,6 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
|  {
|  	int ret;
|  
| -	if (t->state == TASK_DEAD) {
| -		pr_warning("c/r: task may not be in state TASK_DEAD\n");
| -		return -EAGAIN;
| -	}
| -
|  	ret = cr_write_task_struct(ctx, t);
|  	cr_debug("task_struct: ret %d\n", ret);
|  	if (ret < 0)
| @@ -271,6 +266,211 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
|  	return ret;
|  }
|  
| +/* dump all tasks in ctx->tasks_arr[] */
| +static int cr_write_all_tasks(struct cr_ctx *ctx)
| +{
| +	int n, ret = 0;
| +
| +	for (n = 0; n < ctx->tasks_nr; n++) {
| +		cr_debug("dumping task #%d\n", n);
| +		ret = cr_write_task(ctx, ctx->tasks_arr[n]);
| +		if (ret < 0)
| +			break;
| +	}
| +
| +	return ret;
| +}
| +
| +static int cr_may_checkpoint_task(struct task_struct *t, struct cr_ctx *ctx)
| +{
| +	cr_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
| +
| +	if (t->state == TASK_DEAD) {
| +		pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
| +		return -EAGAIN;
| +	}
| +
| +	if (!ptrace_may_access(t, PTRACE_MODE_READ))
| +		return -EPERM;
| +
| +	/* verify that the task is frozen (unless self) */
| +	if (t != current && !frozen(t))
| +		return -EBUSY;
| +
| +	/* FIXME: change this for nested containers */
| +	if (task_nsproxy(t) != ctx->root_nsproxy)
| +		return -EPERM;
| +
| +	return 0;
| +}
| +
| +#define CR_HDR_PIDS_CHUNK	256
| +
| +static int cr_write_pids(struct cr_ctx *ctx)
| +{
| +	struct cr_hdr_pids *hh;
| +	struct pid_namespace *ns;
| +	struct task_struct *task;
| +	struct task_struct **tasks_arr;
| +	int tasks_nr, n, pos = 0, ret = 0;
| +
| +	ns = ctx->root_nsproxy->pid_ns;
| +	tasks_arr = ctx->tasks_arr;
| +	tasks_nr = ctx->tasks_nr;
| +	BUG_ON(tasks_nr <= 0);
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh) * CR_HDR_PIDS_CHUNK);
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	do {
| +		rcu_read_lock();
| +		for (n = 0; n < min(tasks_nr, CR_HDR_PIDS_CHUNK); n++) {
| +			task = tasks_arr[pos];
| +
| +			/* is this task cool ? */
| +			ret = cr_may_checkpoint_task(task, ctx);
| +			if (ret < 0) {
| +				rcu_read_unlock();
| +				goto out;
| +			}
| +			hh[n].vpid = task_pid_nr_ns(task, ns);
| +			hh[n].vtgid = task_tgid_nr_ns(task, ns);
| +			hh[n].vppid = task_tgid_nr_ns(task->real_parent, ns);
| +			cr_debug("task[%d]: vpid %d vtgid %d parent %d\n", pos,
| +				 hh[n].vpid, hh[n].vtgid, hh[n].vppid);
| +			pos++;
| +		}
| +		rcu_read_unlock();
| +
| +		n = min(tasks_nr, CR_HDR_PIDS_CHUNK);
| +		ret = cr_kwrite(ctx, hh, n * sizeof(*hh));
| +		if (ret < 0)
| +			break;
| +
| +		tasks_nr -= n;
| +	} while (tasks_nr > 0);
| + out:
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	return ret;
| +}
| +
| +/* count number of tasks in tree (and optionally fill pid's in array) */
| +static int cr_tree_count_tasks(struct cr_ctx *ctx)
| +{
| +	struct task_struct *root = ctx->root_task;
| +	struct task_struct *task = root;
| +	struct task_struct *parent = NULL;
| +	struct task_struct **tasks_arr = ctx->tasks_arr;
| +	int tasks_nr = ctx->tasks_nr;
| +	int nr = 0;
| +
| +	read_lock(&tasklist_lock);
| +
| +	/* count tasks via DFS scan of the tree */
| +	while (1) {
| +		if (tasks_arr) {
| +			/* unlikely... but if so then try again later */
| +			if (nr == tasks_nr) {
| +				nr = -EAGAIN;	/* cleanup in cr_ctx_free() */
| +				break;
| +			}
| +			tasks_arr[nr] = task;
| +			get_task_struct(task);

Can we do an early cr_may_checkpoint_task() here ?

Sukadev

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 15/29] Restart multiple processes
       [not found]     ` <1238477349-11029-16-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07  3:33       ` Sukadev Bhattiprolu
       [not found]         ` <20090407033315.GJ12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07  3:33 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Couple of nits and couple of not-so minor comments 

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| From 7162fef93ee3d9fd30a457dd7b0c7ad0200d5bcb Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Date: Mon, 30 Mar 2009 15:06:13 -0400
| Subject: [PATCH 15/29] Restart multiple processes
| 
| Restarting of multiple processes expects all restarting tasks to call
| sys_restart(). Once inside the system call, each task will restart
| itself at the same order that they were saved. The internals of the
| syscall will take care of in-kernel synchronization bewteen tasks.
| 
| This patch does _not_ create the task tree in the kernel. Instead it
| assumes that all tasks are created in some way and then invoke the
| restart syscall. You can use the userspace mktree.c program to do
| that.
| 
| The init task (*) has a special role: it allocates the restart context
| (ctx), and coordinates the operation. In particular, it first waits
| until all participating tasks enter the kernel, and provides them the
| common restart context. Once everyone in ready, it begins to restart
| itself.
| 
| In contrast, the other tasks enter the kernel, locate the init task (*)
| and grab its restart context, and then wait for their turn to restore.
| 
| When a task (init or not) completes its restart, it hands the control
| over to the next in line, by waking that task.
| 
| An array of pids (the one saved during the checkpoint) is used to
| synchronize the operation. The first task in the array is the init
| task (*). The restart context (ctx) maintain a "current position" in
| the array, which indicates which task is currently active. Once the
| currently active task completes its own restart, it increments that
| position and wakes up the next task.
| 
| Restart assumes that userspace provides meaningful data, otherwise
| it's garbage-in-garbage-out. In this case, the syscall may block
| indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
| otherwise kill the stray restarting tasks.
| 
| In terms of security, restart runs as the user the invokes it, so it
| will not allow a user to do more than is otherwise permitted by the
| usual system semantics and policy.
| 
| Currently we ignore threads and zombies, as well as session ids.
| Add support for multiple processes
| 
| (*) For containers, restart should be called inside a fresh container
| by the init task of that container. However, it is also possible to
| restart applications not necessarily inside a container, and without
| restoring the original pids of the processes (that is, provided that
| the application can tolerate such behavior). This is useful to allow
| multi-process restart of tasks not isolated inside a container, and
| also for debugging.
| 
| Changelog[v14]:
|   - Revert change to pr_debug(), back to cr_debug()
|   - Discard field 'h.parent'
|   - Check whether calls to cr_hbuf_get() fail
| 
| Changelog[v13]:
|   - Clear root_task->checkpoint_ctx regardless of error condition
|   - Remove unused argument 'ctx' from do_restart_task() prototype
|   - Remove unused member 'pids_err' from 'struct cr_ctx'
| 
| Changelog[v12]:
|   - Replace obsolete cr_debug() with pr_debug()
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
| ---
|  checkpoint/restart.c       |  224 +++++++++++++++++++++++++++++++++++++++++++-
|  checkpoint/sys.c           |   34 ++++++--
|  include/linux/checkpoint.h |   24 ++++-
|  include/linux/sched.h      |    4 +
|  4 files changed, 272 insertions(+), 14 deletions(-)
| 
| diff --git a/checkpoint/restart.c b/checkpoint/restart.c
| index 96d4d45..adebc1c 100644
| --- a/checkpoint/restart.c
| +++ b/checkpoint/restart.c
| @@ -10,6 +10,7 @@
|  
|  #include <linux/version.h>
|  #include <linux/sched.h>
| +#include <linux/wait.h>
|  #include <linux/file.h>
|  #include <linux/magic.h>
|  #include <linux/checkpoint.h>
| @@ -301,30 +302,245 @@ static int cr_read_task(struct cr_ctx *ctx)
|  	return ret;
|  }
|  
| +/* cr_read_tree - read the tasks tree into the checkpoint context */
| +static int cr_read_tree(struct cr_ctx *ctx)
| +{
| +	struct cr_hdr_tree *hh;
| +	int size, ret;
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
| +	if (ret < 0)
| +		goto out;
| +
| +	ret = -EINVAL;
| +	if (hh->tasks_nr < 0)
| +		goto out;
| +
| +	ctx->pids_nr = hh->tasks_nr;
| +	size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
| +	if (size < 0)		/* overflow ? */
| +		goto out;
| +
| +	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
| +	if (!ctx->pids_arr) {
| +		ret = -ENOMEM;
| +		goto out;
| +	}
| +	ret = cr_kread(ctx, ctx->pids_arr, size);
| + out:
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	return ret;
| +}
| +
| +static int cr_wait_task(struct cr_ctx *ctx)
| +{
| +	pid_t pid = task_pid_vnr(current);
| +
| +	cr_debug("pid %d waiting\n", pid);
| +	return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
| +}
| +
| +static int cr_next_task(struct cr_ctx *ctx)
| +{
| +	struct task_struct *tsk;
| +
| +	ctx->pids_pos++;
| +
| +	cr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
| +	if (ctx->pids_pos == ctx->pids_nr) {
| +		complete(&ctx->complete);
| +		return 0;
| +	}
| +
| +	ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
| +
| +	cr_debug("pids_next %d\n", ctx->pids_active);
| +
| +	rcu_read_lock();
| +	tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
| +	if (tsk)
| +		wake_up_process(tsk);
| +	rcu_read_unlock();
| +
| +	if (!tsk) {
| +		complete(&ctx->complete);
| +		return -ESRCH;
| +	}
| +
| +	return 0;
| +}
| +
| +/* FIXME: this should be per container */
| +DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
| +
| +static int do_restart_task(pid_t pid)
| +{
| +	struct task_struct *root_task;
| +	struct cr_ctx *ctx = NULL;
| +	int ret;
| +
| +	rcu_read_lock();
| +	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
| +	if (root_task)
| +		get_task_struct(root_task);
| +	rcu_read_unlock();
| +
| +	if (!root_task)
| +		return -EINVAL;
| +
| +	/*
| +	 * wait for container init to initialize the restart context, then
| +	 * grab a reference to that context, and if we're the last task to
| +	 * do it, notify the container init.
| +	 */
| +	ret = wait_event_interruptible(cr_restart_waitq,
| +				       root_task->checkpoint_ctx);
| +	if (ret < 0)
| +		goto out;
| +
| +	task_lock(root_task);
| +	ctx = root_task->checkpoint_ctx;
| +	if (ctx)
| +		cr_ctx_get(ctx);
| +	task_unlock(root_task);
| +
| +	if (!ctx) {
| +		ret = -EAGAIN;
| +		goto out;
| +	}
| +
| +	if (atomic_dec_and_test(&ctx->tasks_count))
| +		complete(&ctx->complete);
| +
| +	/* wait for our turn, do the restore, and tell next task in line */
| +	ret = cr_wait_task(ctx);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_read_task(ctx);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_next_task(ctx);
| +
| + out:
| +	cr_ctx_put(ctx);
| +	put_task_struct(root_task);
| +	return ret;
| +}
| +
| +/**
| + * cr_wait_all_tasks_start - wait for all tasks to enter sys_restart()
| + * @ctx: checkpoint context
| + *
| + * Called by the container root to wait until all restarting tasks
| + * are ready to restore their state. Temporarily advertises the 'ctx'
| + * on 'current->checkpoint_ctx' so that others can grab a reference
| + * to it, and clears it once synchronization completes. See also the
| + * related code in do_restart_task().
| + */
| +static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
| +{
| +	int ret;
| +
| +	if (ctx->pids_nr == 1)
| +		return 0;
| +
| +	init_completion(&ctx->complete);
| +	current->checkpoint_ctx = ctx;
| +
| +	wake_up_all(&cr_restart_waitq);
| +
| +	ret = wait_for_completion_interruptible(&ctx->complete);
| +
| +	task_lock(current);
| +	current->checkpoint_ctx = NULL;
| +	task_unlock(current);
| +
| +	return ret;
| +}
| +
| +static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
| +{
| +	int ret;
| +
| +	if (ctx->pids_nr == 1)
| +		return 0;
| +
| +	init_completion(&ctx->complete);
| +
| +	ret = cr_next_task(ctx);
| +	if (ret < 0)
| +		return ret;
| +
| +	ret = wait_for_completion_interruptible(&ctx->complete);
| +	if (ret < 0)
| +		return ret;
| +
| +	return 0;
| +}
| +
|  /* setup restart-specific parts of ctx */
|  static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
|  {
| +	ctx->root_pid = pid;
| +	ctx->root_task = current;
| +	ctx->root_nsproxy = current->nsproxy;
| +
| +	get_task_struct(ctx->root_task);
| +	get_nsproxy(ctx->root_nsproxy);
| +
| +	atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
| +
|  	return 0;
|  }
|  
| -int do_restart(struct cr_ctx *ctx, pid_t pid)
| +static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
|  {
|  	int ret;
|  
| +	ret = cr_read_head(ctx);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_read_tree(ctx);
| +	if (ret < 0)
| +		goto out;
| +
|  	ret = cr_ctx_restart(ctx, pid);
|  	if (ret < 0)
|  		goto out;
| -	ret = cr_read_head(ctx);
| +
| +	/* wait for all other tasks to enter do_restart_task() */
| +	ret = cr_wait_all_tasks_start(ctx);
|  	if (ret < 0)
|  		goto out;
| +
|  	ret = cr_read_task(ctx);
|  	if (ret < 0)
|  		goto out;
| -	ret = cr_read_tail(ctx);
| +
| +	/* wait for all other tasks to complete do_restart_task() */
| +	ret = cr_wait_all_tasks_finish(ctx);
|  	if (ret < 0)
|  		goto out;
|  
| -	/* on success, adjust the return value if needed [TODO] */
| +	ret = cr_read_tail(ctx);
| +
|   out:
|  	return ret;
|  }
| +
| +int do_restart(struct cr_ctx *ctx, pid_t pid)
| +{
| +	int ret;
| +
| +	if (ctx)
| +		ret = do_restart_root(ctx, pid);
| +	else
| +		ret = do_restart_task(pid);
| +
| +	/* on success, adjust the return value if needed [TODO] */
| +	return ret;
| +}
| diff --git a/checkpoint/sys.c b/checkpoint/sys.c
| index 8630144..3a925ae 100644
| --- a/checkpoint/sys.c
| +++ b/checkpoint/sys.c
| @@ -167,6 +167,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
|  
|  static void cr_ctx_free(struct cr_ctx *ctx)
|  {
| +	BUG_ON(atomic_read(&ctx->refcount));
| +
|  	if (ctx->file)
|  		fput(ctx->file);
|  
| @@ -185,6 +187,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
|  	if (ctx->root_task)
|  		put_task_struct(ctx->root_task);
|  
| +	kfree(ctx->pids_arr);
| +
|  	kfree(ctx);
|  }
|  
| @@ -199,8 +203,10 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
|  
|  	ctx->flags = flags;
|  
| +	atomic_set(&ctx->refcount, 0);
|  	INIT_LIST_HEAD(&ctx->pgarr_list);
|  	INIT_LIST_HEAD(&ctx->pgarr_pool);
| +	init_waitqueue_head(&ctx->waitq);
|  
|  	err = -EBADF;
|  	ctx->file = fget(fd);
| @@ -215,6 +221,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
|  	if (cr_objhash_alloc(ctx) < 0)
|  		goto err;
|  
| +	atomic_inc(&ctx->refcount);
|  	return ctx;
|  
|   err:
| @@ -222,6 +229,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
|  	return ERR_PTR(err);
|  }
|  
| +void cr_ctx_get(struct cr_ctx *ctx)
| +{
| +	atomic_inc(&ctx->refcount);
| +}
| +
| +void cr_ctx_put(struct cr_ctx *ctx)
| +{
| +	if (ctx && atomic_dec_and_test(&ctx->refcount))
| +		cr_ctx_free(ctx);
| +}
| +
|  /**
|   * sys_checkpoint - checkpoint a container
|   * @pid: pid of the container init(1) process
| @@ -251,7 +269,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
|  	if (!ret)
|  		ret = ctx->crid;
|  
| -	cr_ctx_free(ctx);
| +	cr_ctx_put(ctx);
|  	return ret;
|  }
|  
| @@ -266,7 +284,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
|   */
|  asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
|  {
| -	struct cr_ctx *ctx;
| +	struct cr_ctx *ctx = NULL;
|  	pid_t pid;
|  	int ret;
|  
| @@ -274,15 +292,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
|  	if (flags)
|  		return -EINVAL;
|  
| -	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
| -	if (IS_ERR(ctx))
| -		return PTR_ERR(ctx);
| -
|  	/* FIXME: for now, we use 'crid' as a pid */
|  	pid = (pid_t) crid;
|  
| +	if (pid == task_pid_vnr(current))
| +		ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
| +
| +	if (IS_ERR(ctx))
| +		return PTR_ERR(ctx);
| +
|  	ret = do_restart(ctx, pid);
|  
| -	cr_ctx_free(ctx);
| +	cr_ctx_put(ctx);
|  	return ret;
|  }
| diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
| index c946320..cede30e 100644
| --- a/include/linux/checkpoint.h
| +++ b/include/linux/checkpoint.h
| @@ -12,8 +12,11 @@
|  
|  #include <linux/path.h>
|  #include <linux/fs.h>
| +#include <linux/path.h>
| +#include <linux/sched.h>
| +#include <asm/atomic.h>
|  
| -#define CR_VERSION  2
| +#define CR_VERSION  3
|  
|  struct cr_ctx {
|  	int crid;		/* unique checkpoint id */
| @@ -31,8 +34,7 @@ struct cr_ctx {
|  	void *hbuf;		/* temporary buffer for headers */
|  	int hpos;		/* position in headers buffer */
|  
| -	struct task_struct **tasks_arr;	/* array of all tasks in container */
| -	int tasks_nr;			/* size of tasks array */
| +	atomic_t refcount;
|  
|  	struct cr_objhash *objhash;	/* hash for shared objects */
|  
| @@ -40,6 +42,19 @@ struct cr_ctx {
|  	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
|  
|  	struct path fs_mnt;	/* container root (FIXME) */
| +
| +	/* [multi-process checkpoint] */
| +	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
| +	int tasks_nr;                   /* size of tasks array */
| +
| +	/* [multi-process restart] */
| +	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
| +	int pids_nr;			/* size of pids array */

Nit: Since we already have a pid_nr() that refers to something different,
can we call this 'nr_pids' (and nr_tasks above)  like mm_context->nr_threads ?
Of course, there is no convention, so its easy to argue the other way.

Secondly, isn't pids_nr same as tasks_nr ? If so do we need both ?

Or is this intended to address the issue of multiple pid_nr values that a
task in a nested container can have ? If so, pids_nr is > tasks_nr and that
brings up two comments :-)

First, mktree.c and cr_next_task() are using 'ctx->pids_nr' to determine how
many tasks to start. If we are talking about nested containers, pids_nr
will be greater than tasks_nr so, mktree and cr_next_task() should be
use 'ctx->tasks_nr' to determine how many tasks to create. Also if
checkpointing a nested container we should view the multiple nested pid
values a process as an attribute of the task and maybe save them in
cr_write_task() rather than in cr_write_tree().

My second comment is more an orthogonal question. Suppose init_pid_ns = level
 0 and we have a container that is nested at level 3.  If we checkpoint just
this container, we would want to be able to restore this container at any level
 > 0 right ?

| +	int pids_pos;			/* position pids array */
| +	pid_t pids_active;		/* pid of (next) active task */

Do we need both pids_pos and pids_active in the ctx ? Can pids_active
just be a local variable in cr_next_task() and cr_wait_task() ?
IOW, isn't this always true

	pids_arr[pids_pos] == pids_active

| +	atomic_t tasks_count;		/* sync of tasks: used to coordinate */

Name is a bit confusing with 'tasks_nr', but the comment helps and I can't
think of a better name.

| +	struct completion complete;	/* container root and other tasks on */
| +	wait_queue_head_t waitq;	/* start, end, and restart ordering */
|  };

Sukadev

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 06/29] Dump memory address space
       [not found]         ` <20090407032636.GD12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-07  4:57           ` Oren Laadan
  0 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-04-07  4:57 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen



Sukadev Bhattiprolu wrote:
> One comment below.

Thanks .. (and the other reviews as well - I fixed all of them)

[...]

> | +static char *
> | +cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
> | +{
> | +	struct path tmp = *root;
> | +	char *fname;
> | +
> | +	BUG_ON(!buf);
> | +	spin_lock(&dcache_lock);
> | +	fname = __d_path(path, &tmp, buf, *n);
> | +	spin_unlock(&dcache_lock);
> | +	if (!IS_ERR(fname))
> | +		*n = (buf + (*n) - fname);
> | +	/*
> | +	 * FIXME: if __d_path() changed these, it must have stepped out of
> | +	 * init's namespace. Since currently we require a unified namespace
> | +	 * within the container: simply fail.
> | +	 */
> | +	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
> | +		fname = ERR_PTR(-EBADF);
> | 
> 
> Shouldn't this be under if (!IS_ERR(fname)) ? 'tmp' may be uninitialized
> if __d_path() fails with ENAMETOOLONG. Even otherwise, it may be better
> to report the error from __dpath() first ?
> 

True, fixed.

Oren.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 14/29] Checkpoint multiple processes
       [not found]         ` <20090407033111.GI12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-07  5:12           ` Oren Laadan
  0 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-04-07  5:12 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen



Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> | From ee2f3b5c8548136229cc2f41c5271b0a81ab8a4d Mon Sep 17 00:00:00 2001
> | From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> | Date: Mon, 30 Mar 2009 15:06:13 -0400
> | Subject: [PATCH 14/29] Checkpoint multiple processes

[...]

> | +/* count number of tasks in tree (and optionally fill pid's in array) */
> | +static int cr_tree_count_tasks(struct cr_ctx *ctx)
> | +{
> | +	struct task_struct *root = ctx->root_task;
> | +	struct task_struct *task = root;
> | +	struct task_struct *parent = NULL;
> | +	struct task_struct **tasks_arr = ctx->tasks_arr;
> | +	int tasks_nr = ctx->tasks_nr;
> | +	int nr = 0;
> | +
> | +	read_lock(&tasklist_lock);
> | +
> | +	/* count tasks via DFS scan of the tree */
> | +	while (1) {
> | +		if (tasks_arr) {
> | +			/* unlikely... but if so then try again later */
> | +			if (nr == tasks_nr) {
> | +				nr = -EAGAIN;	/* cleanup in cr_ctx_free() */
> | +				break;
> | +			}
> | +			tasks_arr[nr] = task;
> | +			get_task_struct(task);
> 
> Can we do an early cr_may_checkpoint_task() here ?

Sure, moved the test to here.

Oren.

> 
> Sukadev
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 15/29] Restart multiple processes
       [not found]         ` <20090407033315.GJ12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-07  5:31           ` Oren Laadan
       [not found]             ` <49DAE526.6010900-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Oren Laadan @ 2009-04-07  5:31 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen



Sukadev Bhattiprolu wrote:
> Couple of nits and couple of not-so minor comments 
> 
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> | From 7162fef93ee3d9fd30a457dd7b0c7ad0200d5bcb Mon Sep 17 00:00:00 2001
> | From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> | Date: Mon, 30 Mar 2009 15:06:13 -0400
> | Subject: [PATCH 15/29] Restart multiple processes
> | 
> | Restarting of multiple processes expects all restarting tasks to call
> | sys_restart(). Once inside the system call, each task will restart
> | itself at the same order that they were saved. The internals of the
> | syscall will take care of in-kernel synchronization bewteen tasks.
> | 

[...]

> |  
> |  struct cr_ctx {
> |  	int crid;		/* unique checkpoint id */
> | @@ -31,8 +34,7 @@ struct cr_ctx {
> |  	void *hbuf;		/* temporary buffer for headers */
> |  	int hpos;		/* position in headers buffer */
> |  
> | -	struct task_struct **tasks_arr;	/* array of all tasks in container */
> | -	int tasks_nr;			/* size of tasks array */
> | +	atomic_t refcount;
> |  
> |  	struct cr_objhash *objhash;	/* hash for shared objects */
> |  
> | @@ -40,6 +42,19 @@ struct cr_ctx {
> |  	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
> |  
> |  	struct path fs_mnt;	/* container root (FIXME) */
> | +
> | +	/* [multi-process checkpoint] */
> | +	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
> | +	int tasks_nr;                   /* size of tasks array */
> | +
> | +	/* [multi-process restart] */
> | +	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
> | +	int pids_nr;			/* size of pids array */
> 
> Nit: Since we already have a pid_nr() that refers to something different,
> can we call this 'nr_pids' (and nr_tasks above)  like mm_context->nr_threads ?
> Of course, there is no convention, so its easy to argue the other way.

Ok.

> 
> Secondly, isn't pids_nr same as tasks_nr ? If so do we need both ?

As the comment says: one is used exclusively for checkpoint and the
other exclusively for restart.
So we don't strictly need both. I thought that for readability of it's
useful to have @pids_nr (ok, @nr_pids ...) when dealing with a @pids_arr,
and a @tasks_nr (ok .. @nr_tasks ...) when dealing with @tasks_arr.

> 
> Or is this intended to address the issue of multiple pid_nr values that a
> task in a nested container can have ? If so, pids_nr is > tasks_nr and that
> brings up two comments :-)

Ugh. This topic is TBD.

> 
> First, mktree.c and cr_next_task() are using 'ctx->pids_nr' to determine how
> many tasks to start. If we are talking about nested containers, pids_nr
> will be greater than tasks_nr so, mktree and cr_next_task() should be
> use 'ctx->tasks_nr' to determine how many tasks to create. Also if
> checkpointing a nested container we should view the multiple nested pid
> values a process as an attribute of the task and maybe save them in
> cr_write_task() rather than in cr_write_tree().

Lol .. who's talking about nested containers ?   ;)

(seriously: I'm not considering that now; my gut feeling is that it may
be useful to do pid_ns in userspace, like task creation - and in that
case it makes sense to keep it in cr_write_tree(). then again, I have
not looked at it in depth).

> 
> My second comment is more an orthogonal question. Suppose init_pid_ns = level
>  0 and we have a container that is nested at level 3.  If we checkpoint just
> this container, we would want to be able to restore this container at any level
> 0 right ?

True. Do you see any limitation in the current code that prevents this ?

> 
> | +	int pids_pos;			/* position pids array */
> | +	pid_t pids_active;		/* pid of (next) active task */
> 
> Do we need both pids_pos and pids_active in the ctx ? Can pids_active
> just be a local variable in cr_next_task() and cr_wait_task() ?
> IOW, isn't this always true
> 
> 	pids_arr[pids_pos] == pids_active

Ok.

Oren.

> 
> | +	atomic_t tasks_count;		/* sync of tasks: used to coordinate */
> 
> Name is a bit confusing with 'tasks_nr', but the comment helps and I can't
> think of a better name.
> 
> | +	struct completion complete;	/* container root and other tasks on */
> | +	wait_queue_head_t waitq;	/* start, end, and restart ordering */
> |  };
> 
> Sukadev
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 10/29] actually use f_op in checkpoint code
       [not found]         ` <20090407032912.GF12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-07  5:36           ` Oren Laadan
  0 siblings, 0 replies; 66+ messages in thread
From: Oren Laadan @ 2009-04-07  5:36 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen



Sukadev Bhattiprolu wrote:
> A minor comment and a nit.
> 
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> | From d832bfba9a50789fbfadf8486fbdfbd8b498a9ea Mon Sep 17 00:00:00 2001
> | From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> | Date: Fri, 27 Mar 2009 12:50:47 -0700
> | Subject: [PATCH 10/29] actually use f_op in checkpoint code
> | 
> | 
> | Right now, we assume all normal files and directories
> | can be checkpointed.  However, as usual in the VFS, there
> | are specialized places that will always need an ability
> | to override these defaults.  We could do this completely
> | in the checkpoint code, but that would bitrot quickly.
> | 
> | This adds a new 'file_operations' function for
> | checkpointing a file.  I did this under the assumption
> | that we should have a dirt-simple way to make something
> | (un)checkpointable that fits in with current code.
> | 
> | As you can see in the ext[234] and /proc patches, all
> | that we have to do to make something simple be
> | supported is add a single "generic" f_op entry.

[...]

> |  /* cr_write_file - dump the state of a given file pointer */
> |  static int cr_write_file(struct cr_ctx *ctx, struct file *file)
> |  {
> |  	struct cr_hdr_file *hh;
> | -	struct dentry *dent = file->f_dentry;
> | -	struct inode *inode = dent->d_inode;
> |  	int ret;
> |  
> |  	hh = cr_hbuf_get(ctx, sizeof(*hh));
> | @@ -116,21 +125,11 @@ static int cr_write_file(struct cr_ctx *ctx, struct file *file)
> |  	hh->f_version = file->f_version;
> |  	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
> |  
> | -	/*
> | -	 * FIXME: when we'll add support for unlinked files/dirs, we'll
> | -	 * need to distinguish between unlinked filed and unlinked dirs.
> | -	 */
> | -	switch (inode->i_mode & S_IFMT) {
> | -	case S_IFREG:
> | -	case S_IFDIR:
> | -		ret = cr_write_file_generic(ctx, file, hh);
> | -		break;
> | -	default:
> | -		ret = -EBADF;
> | -		break;
> | -	}
> | -	cr_hbuf_put(ctx, sizeof(*hh));
> | +	ret = -EBADF;
> | +	if (file->f_op->checkpoint)
> | +		ret = file->f_op->checkpoint(ctx, file, hh);
> 
> Minor: not bisect safe for checkpoint - fwiw, with previous patch we could
> checkpoint a process with open file, but with this change, we can't ? How
> about merge patches 10 and 11 ?

It will will remain not-bisect-safe for anyone who is using other than
ext2/3/4 file systems.

I think it's a good way of separating the concept and it's implementation.

[...]

Oren.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v14-rc2][PATCH 15/29] Restart multiple processes
       [not found]             ` <49DAE526.6010900-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-07 16:29               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 66+ messages in thread
From: Sukadev Bhattiprolu @ 2009-04-07 16:29 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| Sukadev Bhattiprolu wrote:
| > 
| > Secondly, isn't pids_nr same as tasks_nr ? If so do we need both ?
| 
| As the comment says: one is used exclusively for checkpoint and the
| other exclusively for restart.
| So we don't strictly need both. I thought that for readability of it's
| useful to have @pids_nr (ok, @nr_pids ...) when dealing with a @pids_arr,
| and a @tasks_nr (ok .. @nr_tasks ...) when dealing with @tasks_arr.

| 
| > 
| > Or is this intended to address the issue of multiple pid_nr values that a
| > task in a nested container can have ? If so, pids_nr is > tasks_nr and that
| > brings up two comments :-)
| 
| Ugh. This topic is TBD.
| 
| > 
| > First, mktree.c and cr_next_task() are using 'ctx->pids_nr' to determine how
| > many tasks to start. If we are talking about nested containers, pids_nr
| > will be greater than tasks_nr so, mktree and cr_next_task() should be
| > use 'ctx->tasks_nr' to determine how many tasks to create. Also if
| > checkpointing a nested container we should view the multiple nested pid
| > values a process as an attribute of the task and maybe save them in
| > cr_write_task() rather than in cr_write_tree().
| 
| Lol .. who's talking about nested containers ?   ;)

:-) I guess the presence of both pids_nr and tasks_nr in the same structure
threw me off. Yes, ignoring nested containers for now is really good :-)

Maybe we can add a check in cr_may_checkpoint() to fail if any task in
the process tree has :

	pid->level != task_pid(current)->level + 1 

so nested containers fail cleanly.

| 
| (seriously: I'm not considering that now; my gut feeling is that it may
| be useful to do pid_ns in userspace, like task creation - and in that
| case it makes sense to keep it in cr_write_tree(). then again, I have
| not looked at it in depth).
| 
| > 
| > My second comment is more an orthogonal question. Suppose init_pid_ns = level
| >  0 and we have a container that is nested at level 3.  If we checkpoint just
| > this container, we would want to be able to restore this container at any level
| > 0 right ?
| 
| True. Do you see any limitation in the current code that prevents this ?

No I did not see anything. I mentioned it because the the number of pids
associated with the task will change and we have to discard some pids
during restart. But lets not worry about it now :-)

Sukadev

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2009-04-07 16:29 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-31  5:28 [RFC v14-rc2][PATCH 00/29] Kernel based checkpoint/restart Oren Laadan
     [not found] ` <1238477349-11029-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 01/29] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 02/29] Checkpoint/restart: initial documentation Oren Laadan
     [not found]     ` <1238477349-11029-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07  3:22       ` Sukadev Bhattiprolu
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 03/29] Make file_pos_read/write() public Oren Laadan
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 04/29] General infrastructure for checkpoint restart Oren Laadan
     [not found]     ` <1238477349-11029-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07  3:24       ` Sukadev Bhattiprolu
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 05/29] x86 support for checkpoint/restart Oren Laadan
     [not found]     ` <1238477349-11029-6-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07  3:25       ` Sukadev Bhattiprolu
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 06/29] Dump memory address space Oren Laadan
     [not found]     ` <1238477349-11029-7-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07  3:26       ` Sukadev Bhattiprolu
     [not found]         ` <20090407032636.GD12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-07  4:57           ` Oren Laadan
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 07/29] Restore " Oren Laadan
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 08/29] Infrastructure for shared objects Oren Laadan
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 09/29] Dump open file descriptors Oren Laadan
     [not found]     ` <1238477349-11029-10-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07  3:28       ` Sukadev Bhattiprolu
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 10/29] actually use f_op in checkpoint code Oren Laadan
     [not found]     ` <1238477349-11029-11-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-03-31 18:31       ` Oren Laadan
2009-04-01 18:54       ` Serge E. Hallyn
2009-04-07  3:29       ` Sukadev Bhattiprolu
     [not found]         ` <20090407032912.GF12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-07  5:36           ` Oren Laadan
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 11/29] add generic checkpoint f_op to ext fses Oren Laadan
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 12/29] Restore open file descriptors Oren Laadan
     [not found]     ` <1238477349-11029-13-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07  3:29       ` Sukadev Bhattiprolu
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 13/29] External checkpoint of a task other than ourself Oren Laadan
     [not found]     ` <1238477349-11029-14-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07  3:30       ` Sukadev Bhattiprolu
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 14/29] Checkpoint multiple processes Oren Laadan
     [not found]     ` <1238477349-11029-15-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07  3:31       ` Sukadev Bhattiprolu
     [not found]         ` <20090407033111.GI12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-07  5:12           ` Oren Laadan
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 15/29] Restart " Oren Laadan
     [not found]     ` <1238477349-11029-16-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07  3:33       ` Sukadev Bhattiprolu
     [not found]         ` <20090407033315.GJ12316-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-07  5:31           ` Oren Laadan
     [not found]             ` <49DAE526.6010900-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-07 16:29               ` Sukadev Bhattiprolu
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 16/29] A new file type (CR_FD_OBJREF) for a file descriptor already setup Oren Laadan
     [not found]     ` <1238477349-11029-17-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-01 13:59       ` Serge E. Hallyn
     [not found]         ` <20090401135952.GA16973-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-01 14:13           ` Oren Laadan
2009-04-01 18:36       ` Serge E. Hallyn
2009-04-03 15:46       ` Dan Smith
     [not found]         ` <87y6uhyc3j.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-04-03 16:25           ` Oren Laadan
     [not found]             ` <49D63865.1030807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-03 16:30               ` Dan Smith
2009-04-03 16:54               ` Dave Hansen
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 17/29] Checkpoint open pipes Oren Laadan
     [not found]     ` <1238477349-11029-18-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-01 19:47       ` Serge E. Hallyn
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 18/29] Restore " Oren Laadan
     [not found]     ` <1238477349-11029-19-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-01 20:34       ` Serge E. Hallyn
2009-03-31  5:28   ` [RFC v14-rc2][PATCH 19/29] Record 'struct file' object instead of the file name for VMAs Oren Laadan
     [not found]     ` <1238477349-11029-20-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-01 21:45       ` Serge E. Hallyn
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 20/29] Prepare to support shared memory Oren Laadan
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 21/29] Dump anonymous- and file-mapped- " Oren Laadan
     [not found]     ` <1238477349-11029-22-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-01 23:06       ` Serge E. Hallyn
     [not found]         ` <20090401230657.GB27725-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-01 23:18           ` Oren Laadan
     [not found]             ` <49D3F636.1070303-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-01 23:32               ` Serge E. Hallyn
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 22/29] Restore " Oren Laadan
     [not found]     ` <1238477349-11029-23-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-02 16:59       ` Serge E. Hallyn
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 23/29] s390: Expose a constant for the number of words representing the CRs Oren Laadan
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 24/29] c/r: Add CR_COPY() macro (v4) Oren Laadan
     [not found]     ` <1238477349-11029-25-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-01 23:20       ` Serge E. Hallyn
     [not found]         ` <20090401232013.GA31361-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-02 19:00           ` Dan Smith
     [not found]             ` <87vdpmnan2.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-04-02 19:06               ` Serge E. Hallyn
     [not found]                 ` <20090402190612.GA24390-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-02 20:22                   ` Dan Smith
     [not found]                     ` <87r60an6us.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-04-05 20:25                       ` Oren Laadan
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 25/29] s390: define s390-specific checkpoint-restart code (v7) Oren Laadan
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 26/29] powerpc: provide APIs for validating and updating DABR Oren Laadan
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 27/29] powerpc: checkpoint/restart implementation Oren Laadan
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 28/29] powerpc: wire up checkpoint and restart syscalls Oren Laadan
2009-03-31  5:29   ` [RFC v14-rc2][PATCH 29/29] powerpc: enable checkpoint support in Kconfig Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.