All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
@ 2009-04-28 23:23 Oren Laadan
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Here is the latest and greatest of checkpoint/restart (c/r) patchset.
The logic and image format reworked and simplified, code refactored,
support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
(uts and ipc).
The userspace tool 'mktree' was extended to handle more complicated
process tree and correctly account for process relationships and 
session ID (sid). Should correctly handle threads.
Hey, it even went through some massive renaming of files and functions...

Signals and timers are not supported yet, so programs that rely on
their behavior may fail to oeprate correctly after a restart (e.g.
may lose signals pending at time of checkpoint, and so on).

However, this one can actually be used for simple batch jobs (pipes,
too), a whole container or just a subtree of tasks. Try it:

create the freezer cgroup:
  $ mount -t cgroup -ofreezer freezer /freezer
  $ mkdir /freezer/0

run the test, freeze it:  
  $ test/multitask &
  [1] 2754
  $ for i in `pidof multitask`; do echo $i > /freezer/0/tasks; done
  $ echo FROZEN > /freezer/0/freezer.state

checkpoint:
  $ ./ckpt 2754 > ckpt.out

restart:
  $ ./mktree < ckpt.out

voila :)

To do all this, you'll need:

The git tree tracking v14, branch 'ckpt-v14' (and past versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

Restarting multiple processes requires 'mktree' userspace tool with
the matching branch (v14):
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git


Oren.


Changelog:

[2009-Apr-28] v14
  - Tested against kernel v2.6.30-rc3 on x86_32.
  - Refactor files chekpoint to use f_ops (file operations)
  - Refactor mm/vma to use vma_ops
  - Explicitly handle VDSO vma (and require compat mode)
  - Added code to c/r restat-blocks (restart timeout related syscalls)
  - Added code to c/r namespaces: uts, ipc (with Dan Smith)
  - Added code to c/r sysvipc (shm, msg, sem)
  - Support for VM_CLONE shared memory
  - Added resource leak detection for whole-container checkpoint
  - Added sysctl gauge to allow unprivileged restart/checkpoint
  - Improve and simplify the code and logic of shared objects
  - Rework image format: shared objects appear prior to their use
  - Merge checkpoint and restart functionality into same files
  - Massive renaming of functions: prefix "ckpt_" for generics,
    "checkpoint_" for checkpoint, and "restore_" for restart.
  - Report checkpoint errors as a valid (string record) in the output
  - Merged PPC architecture (by Nathan Lunch),
  - Requires updates to userspace tools too.
  - Misc nits and bug fixes

[2009-Mar-31] v14-rc2
  - Change along Dave's suggestion to use f_ops->checkpoint() for files
  - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
  - Merge support for PPC arch (Nathan Lynch)
  - Misc cleanups and fixes in response to comments

[2009-Mar-20] v14-rc1:
  - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
  - Check whether calls to cr_hbuf_get() succeed or fail.
  - Fixed of pipe c/r code
  - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
  - Refuse non-self checkpoint if a task isn't frozen
  - Use unsigned fields in checkpoint headers unless otherwise required
  - Rename functions in files c/r to better reflect their role
  - Add support for anonymous shared memory
  - Merge support for s390 arch (Dan Smith, Serge Hallyn)
    
[2008-Dec-03] v13:
  - Cleanups of 'struct cr_ctx' - remove unused fields
  - Misc fixes for comments
  
[2008-Dec-17] v12:
  - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
    (empty pgarr are saves in a separate pool chain)
  - Add a couple of missed calls to cr_hbuf_put()
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse: explicit conversion to 'void __user *'
  - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 01/54] Create syscalls: sys_checkpoint, sys_restart
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 02/54] Checkpoint/restart: initial documentation Oren Laadan
                     ` (54 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart.  Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic.  They also would significantly complicate
checkpoint that includes self.

Changelog[v14]:
  - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
  - Remove line 'def_bool n' (default is already 'n')
  - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)

Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/Kconfig                   |    4 +++
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   14 ++++++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 9 files changed, 76 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c9086e6..8dfe0c0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -77,6 +77,10 @@ config STACKTRACE_SUPPORT
 config HAVE_LATENCYTOP_SUPPORT
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if X86_32
+
 config FAST_CMPXCHG_LOCAL
 	bool
 	default y
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..48557e1 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,8 @@
 #define __NR_inotify_init1	332
 #define __NR_preadv		333
 #define __NR_pwritev		334
+#define __NR_checkpoint		335
+#define __NR_restart		336
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..e70b7ee 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,5 @@ ENTRY(sys_call_table)
 	.long sys_inotify_init1
 	.long sys_preadv
 	.long sys_pwritev
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..1761b0a
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,14 @@
+# Architectures should define CHECKPOINT_SUPPORT when they have
+# implemented the hooks for processor state etc. needed by the
+# core checkpoint/restart code.
+
+config CHECKPOINT
+	bool "Enable checkpoint/restart (EXPERIMENTAL)"
+	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..8a32c6f
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..375129c
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	pr_debug("sys_checkpoint not implemented yet\n");
+	return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+	pr_debug("sys_restart not implemented yet\n");
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..2aa0943 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -751,6 +751,8 @@ asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
 			  size_t);
 asmlinkage long sys_pipe2(int __user *, int);
 asmlinkage long sys_pipe(int __user *);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..adb4260 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1042,6 +1042,8 @@ config SLOW_WORK
 
 	  See Documentation/slow-work.txt.
 
+source "checkpoint/Kconfig"
+
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 27dad29..e9e749d 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -175,3 +175,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 02/54] Checkpoint/restart: initial documentation
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:23   ` [RFC v14][PATCH 01/54] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 03/54] Make file_pos_read/write() public Oren Laadan
                     ` (53 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v14]:
  - Discard the 'h.parent' field
  - New image format (shared objects appear before they are referenced
    unless they are compound)

Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Documentation/checkpoint/ckpt.c        |   32 ++++++
 Documentation/checkpoint/internals.txt |  128 ++++++++++++++++++++++++
 Documentation/checkpoint/readme.txt    |  105 +++++++++++++++++++
 Documentation/checkpoint/rstr.c        |   20 ++++
 Documentation/checkpoint/security.txt  |   38 +++++++
 Documentation/checkpoint/self.c        |   57 +++++++++++
 Documentation/checkpoint/test.c        |   48 +++++++++
 Documentation/checkpoint/usage.txt     |  171 ++++++++++++++++++++++++++++++++
 8 files changed, 599 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/ckpt.c
 create mode 100644 Documentation/checkpoint/internals.txt
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/rstr.c
 create mode 100644 Documentation/checkpoint/security.txt
 create mode 100644 Documentation/checkpoint/self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/ckpt.c b/Documentation/checkpoint/ckpt.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/ckpt.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/internals.txt b/Documentation/checkpoint/internals.txt
new file mode 100644
index 0000000..de2eead
--- /dev/null
+++ b/Documentation/checkpoint/internals.txt
@@ -0,0 +1,128 @@
+
+	===== Internals of Checkpoint-Restart =====
+
+
+(1) Order of state dump
+
+The order of operations, both save and restore, is as follows:
+
+* Header section: header, container information, etc.
+
+* Global section: [TBD] global resources
+
+* Process forest: tasks and their relationships
+
+* Per task data (for each task):
+  -> task state: elements of task_struct
+  -> thread state: elements of thread_struct and thread_info
+  -> CPU state: registers etc, including FPU
+  -> memory state: memory address space layout and contents
+  -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
+  -> files state: open file descriptors and their state
+  -> signals state: [TBD] pending signals and signal handling state
+  -> credentials state: [TBD] user and group state, statistics
+
+
+(2) Checkpoint image format
+
+The checkpoint image format is composed of records consisting of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by 'struct ckpt_hdr' as follows. The @type
+identifies the type of the payload, @len tells its length in bytes.
+
+struct ckpt_hdr {
+	__s16 type;
+	__s16 len;
+};
+
+It must be the first field in all other headers. For instance, the
+task data is saved in 'struct ckpt_hdr_task', which looks something
+like this:
+
+struct ckpt_hdr_task {
+	struct ckpt_hdr h;
+	__u32 pid;
+	...
+};
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct ckpt_vma'; if the VMA is file-mapped, it is followed by the file
+name. Following comes the actual contents, in one or more chunks: each
+chunk begins with a header that specifies how many pages it holds,
+then the virtual addresses of all the dumped pages in that chunk,
+followed by the actual contents of all the dumped pages. A header with
+zero number of pages marks the end of the contents for a particular
+VMA. Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+ckpt_hdr + ckpt_hdr_head
+ckpt_hdr + ckpt_hdr_task
+	ckpt_hdr + ckpt_hdr_mm
+		ckpt_hdr + ckpt_hdr_vma + ckpt_hdr + string
+			ckpt_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+			ckpt_hdr_pgarr (nr_pages = 0)
+		ckpt_hdr + ckpt_hdr_vma
+			ckpt_hdr_pgarr (nr_pages = 3)
+			addr3, addr4, addr5
+			page3, page4, page5
+			ckpt_hdr_pgarr (nr_pages = 0)
+		ckpt_hdr + ckpt_mm_context
+	ckpt_hdr + ckpt_hdr_thread
+	ckpt_hdr + ckpt_hdr_cpu
+ckpt_hdr + ckpt_hdr_tail
+
+
+(3) Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects and whether they were already saved.  Shared
+objects are stored in a hash table as they appear, indexed by their
+kernel address. (The hash table itself is not saved as part of the
+checkpoint image: it is constructed dynamically during both checkpoint
+and restart, and discarded at the end of the operation).
+
+Each shared object that is found is first looked up in the hash table.
+On the first encounter, the object will not be found, so its state is
+dumped, and the object is assigned a unique identifier and also stored
+in the hash table. Subsequent lookups of that object in the hash table
+will yield that entry, and then only the unique identifier is saved,
+as opposed the entire state of the object.
+
+During restart, shared objects are seen by their unique identifiers as
+assigned during the checkpoint. Each shared object that it read in is
+first looked up in the hash table. On the first encounter it will not
+be found, meaning that the object needs to be created and its state
+read in and restored. Then the object is added to the hash table, this
+time indexed by its unique identifier. Subsequent lookups of the same
+unique identifier in the hash table will yield that entry, and then
+the existing object instance is reused instead of creating another one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+Shared objects are thus saved when they are first seen, and _before_
+the parent object that uses them. Therefore by the time the parent
+objects needs them, they should already be in the objhash. The one
+exception is when more than a single shared resource will be restarted
+at once (e.g. like the two ends of a pipe, or all the namespaces in an
+nsproxy). In this case the parent object is dumped first followed by
+the individual sub-resources).
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..344a551
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,105 @@
+
+	===== Checkpoint-Restart support in the Linux kernel =====
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Reviewers:	Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
+		Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relatively opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial C/R products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). The checkpoint code basically serializes internal
+kernel state and writes it out to a file descriptor, and the resulting
+image is stream-able. More specifically, it consists of 5 steps:
+
+1. Pre-dump
+2. Freeze the container
+3. Dump
+4. Thaw (or kill) the container
+5. Post-dump
+
+Steps 1 and 5 are an optimization to reduce application downtime. In
+particular, "pre-dump" works before freezing the container, e.g. the
+pre-copy for live migration, and "post-dump" works after the container
+resumes execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state from a file
+descriptor, and re-creates the tasks and the resources they need to
+resume execution. The restart code is executed by each task that is
+restored in a new container to reconstruct its own state.
+
+
+=== Current Implementation
+
+* How useful is this code as it stands in real-world usage?
+
+Right now, the application must be a single process that does not
+share any resources with other processes. The only file descriptors
+that may be open are simple files and directories, they may not
+include devices, sockets or pipes.
+
+For an "external" checkpoint, the caller must first freeze (or stop)
+the target process. For "self" checkpoint, the application must be
+specifically written to use the new system calls. The restart does not
+yet preserve the pid of the original process, but will use whatever
+pid it was given by the kernel.
+
+What this means in practice is that it is useful for a simple
+application doing computational work and input/output from/to files.
+
+Currently, namespaces are not saved or restored. They will be treated
+as a class of a shared object. In particular, it is assumed that the
+task's file system namespace is the "root" for the entire container.
+It is also assumed that the same file system view is available for the
+restart task(s). Otherwise, a file system snapshot is required.
+
+* What additional work needs to be done to it?
+
+We know this design can work.  We have two commercial products and a
+horde of academic projects doing it today using this basic design.
+We're early in this particular implementation because we're trying to
+release early and often.
+
diff --git a/Documentation/checkpoint/rstr.c b/Documentation/checkpoint/rstr.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/rstr.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/security.txt b/Documentation/checkpoint/security.txt
new file mode 100644
index 0000000..e5b4107
--- /dev/null
+++ b/Documentation/checkpoint/security.txt
@@ -0,0 +1,38 @@
+
+	===== Security consideration for Checkpoint-Restart =====
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+read mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  When restoration of credentials
+becomes supported, then definitely the ability of the task that calls
+sys_restore() to setresuid/setresgid to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+
diff --git a/Documentation/checkpoint/self.c b/Documentation/checkpoint/self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..1b42d6b
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,171 @@
+
+	===== How to use Checkpoint-Restart =====
+
+The API consists of two new system calls:
+
+* int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+
+    Checkpoint a container whose init task is identified by pid, to
+    the file designated by fd. 'flags' will have future meaning (must
+    be 0 for now).
+
+    Returns: a positive checkpoint identifier (crid) upon success, 0
+    if it returns from a restart, and -1 if an error occurs.
+
+    'crid' uniquely identifies a checkpoint image. For each checkpoint
+    the kernel allocates a unique 'crid', that remains valid for as
+    long as the checkpoint is kept in the kernel (for instance, when a
+    checkpoint, or a partial checkpoint, may reside in kernel memory).
+
+* int sys_restart(int crid, int fd, unsigned long flags);
+
+    Restart a container from a checkpoint image that is read from the
+    blob stored in the file designated by fd. 'crid' will have future
+    meaning (must be 0 for now). 'flags' will have future meaning
+    (must be 0 for now).
+
+    The role of 'crid' is to identify the checkpoint image in the case
+    that it remains in kernel memory. This will be useful to restart
+    from a checkpoint image that remains in kernel memory.
+
+    Returns: -1 if an error occurs, 0 on success when restarting from
+    a "self" checkpoint, and return value of system call at the time
+    of the checkpoint when restarting from an "external" checkpoint.
+
+    If restarting from an "external" checkpoint, tasks that were
+    executing a system call will observe the return value of that
+    system call (as it was when interrupted for the act of taking the
+    checkpoint), and tasks that were executing in user space will be
+    ready to return there.
+
+    Upon successful "external" restart, the container will end up in a
+    frozen state.
+
+The granularity of a checkpoint usually is a whole container. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+If the caller passes a pid which does not refer to a container's init
+task, then sys_checkpoint() would return -EINVAL. (This is because
+with nested containers a task may belong to more than one container).
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases,
+if there are other tasks possible sharing state with the container,
+they must not modify it during the operation. It is the reponsibility
+of the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+Here is a code snippet that illustrates how a checkpoint is initiated
+by a process in a container - the logic is similar to fork():
+	...
+	crid = checkpoint(1, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships
+of the task with other tasks, or any shared resources. It is useful
+for application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+To illustrate how the API works, refer to these sample programs:
+
+* ckpt.c: accepts a 'pid' argument and checkpoint that task to stdout
+* rstr.c: restarts a checkpoint image from stdin
+* self.c: a simple test program doing self-checkpoint
+* test.c: a simple test program to checkpoint
+
+"External" checkpoint:
+---------------------
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup, or by sending SIGSTOP.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ kill -STOP 3493
+	$ ./ckpt 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ kill -CONT 3493
+
+	$ ./rstr < ckpt.image
+Now compare the output of the two output files.
+
+"Self checkpoint:
+----------------
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./rstr < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
+
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 03/54] Make file_pos_read/write() public
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:23   ` [RFC v14][PATCH 01/54] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 02/54] Checkpoint/restart: initial documentation Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart Oren Laadan
                     ` (52 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

These two are used in the next patch when calling vfs_read/write()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 fs/read_write.c    |   10 ----------
 include/linux/fs.h |   10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9d1e76b..ed63ea3 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 EXPORT_SYMBOL(vfs_write);
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	file->f_pos = pos;
-}
-
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
 	struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..6e00db0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1546,6 +1546,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 				struct iovec *fast_pointer,
 				struct iovec **ret_pointer);
 
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 03/54] Make file_pos_read/write() public Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
       [not found]     ` <1240961064-13991-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:23   ` [RFC v14][PATCH 05/54] x86 support for checkpoint/restart Oren Laadan
                     ` (51 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  CR context (a per-checkpoint data structure for housekeeping)
checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
checkpoint/restart.c - input wrappers and basic restart handling

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Changelog[v14]:
  - Cleanup interface to get/put hdr buffers
  - Merge checkpoint and restart code into a single file (per subsystem)
  - Take uts_sem around access to uts->{release,version,machine}
  - Embed ckpt_hdr in all ckpt_hdr_...., cleanup read/write helpers
  - Define sys_checkpoint(0,...) as asking for a self-checkpoint (Serge)
  - Revert use of 'pr_fmt' to avoid tainting whom includes us (Nathan Lynch)
  - Explicitly indicate length of UTS fields in header
  - Discard field 'h->parent' from ckpt_hdr

Changelog[v12]:
  - ckpt_kwrite/ckpt_kread() again use vfs_read(), vfs_write() (safer)
  - Split ckpt_write/ckpt_read() to two parts: _ckpt_write/read() helper
  - Befriend with sparse : explicit conversion to 'void __user *'
  - Redfine 'pr_fmt' instead of using special ckpt_debug()

Changelog[v10]:
  - add ckpt_write_buffer(), ckpt_read_buffer() and ckpt_read_buf_type()
  - force end-of-string in ckpt_read_string() (fix possible DoS)

Changelog[v9]:
  - ckpt_kwrite/ckpt_kread() use file->f_op->write() directly
  - Drop ckpt_uwrite/ckpt_uread() since they aren't used anywhere

Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (although it's not really needed)

Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/

Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 Makefile                         |    2 +-
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |  177 ++++++++++++++++++++++
 checkpoint/process.c             |   97 ++++++++++++
 checkpoint/restart.c             |  298 ++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c                 |  281 +++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint.h       |   82 +++++++++++
 include/linux/checkpoint_hdr.h   |  103 +++++++++++++
 include/linux/checkpoint_types.h |   35 +++++
 include/linux/magic.h            |    4 +
 lib/Kconfig.debug                |   13 ++
 11 files changed, 1088 insertions(+), 7 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/process.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h
 create mode 100644 include/linux/checkpoint_types.h

diff --git a/Makefile b/Makefile
index 9e5dc8f..f278909 100644
--- a/Makefile
+++ b/Makefile
@@ -646,7 +646,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 8a32c6f..5d2c083 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT) += sys.o
+obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o \
+	process.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..6ac3571
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,177 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t ctx_count = ATOMIC_INIT(0);
+
+/**
+ * ckpt_write_obj - write an object
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ */
+int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+	_ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+	return ckpt_kwrite(ctx, h, h->len);
+}
+
+/**
+ * ckpt_write_obj_type - write an object (from a pointer)
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ * @type: desired type
+ */
+int ckpt_write_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get(ctx, sizeof(*h));
+	if (!h)
+		return -ENOMEM;
+
+	h->type = type;
+	h->len = len + sizeof(*h);
+
+	_ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+	ret = ckpt_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		goto out;
+	ret = ckpt_kwrite(ctx, ptr, len);
+ out:
+	_ckpt_hdr_put(ctx, h, sizeof(*h));
+	return ret;
+}
+
+/**
+ * ckpt_write_buffer - write an object of type buffer
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ */
+int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	return ckpt_write_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * ckpt_write_string - write an object of type string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len)
+{
+	return ckpt_write_obj_type(ctx, str, len, CKPT_HDR_STRING);
+}
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+/* write the checkpoint header */
+static int checkpoint_write_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (!h)
+		return -ENOMEM;
+
+	do_gettimeofday(&ktv);
+	uts = utsname();
+
+	h->magic = CHECKPOINT_MAGIC_HEAD;
+	h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	h->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	h->rev = CKPT_VERSION;
+
+	h->flags = ctx->flags;
+	h->time = ktv.tv_sec;
+
+	h->uts_release_len = sizeof(uts->release);
+	h->uts_version_len = sizeof(uts->version);
+	h->uts_machine_len = sizeof(uts->machine);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	down_read(&uts_sem);
+	ret = ckpt_write_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
+ up:
+	up_read(&uts_sem);
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int checkpoint_write_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (!h)
+		return -ENOMEM;
+
+	h->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = checkpoint_write_header(ctx);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, return (unique) checkpoint identifier */
+	ctx->crid = atomic_inc_return(&ctx_count);
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/process.c b/checkpoint/process.c
new file mode 100644
index 0000000..bf7545e
--- /dev/null
+++ b/checkpoint/process.c
@@ -0,0 +1,97 @@
+/*
+ *  Checkpoint task structure
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/sched.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+/* dump the task_struct of a given task */
+static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (!h)
+		return -ENOMEM;
+
+	h->state = t->state;
+	h->exit_state = t->exit_state;
+	h->exit_code = t->exit_code;
+	h->exit_signal = t->exit_signal;
+
+	h->task_comm_len = TASK_COMM_LEN;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ret = checkpoint_task_struct(ctx, t);
+	ckpt_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+/* read the task_struct into the current task */
+static int restore_task_struct(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task *h;
+	struct task_struct *t = current;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->task_comm_len > TASK_COMM_LEN)
+		goto out;
+
+	memset(t->comm, 0, TASK_COMM_LEN);
+	ret = _ckpt_read_string(ctx, t->comm, h->task_comm_len);
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* read the entire state of the current task */
+int restore_task(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	ret = restore_task_struct(ctx);
+	ckpt_debug("ret %d\n", ret);
+
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..d71c0f0
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,298 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/utsname.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: desired ckpt_hdr
+ * @ptr: desired buffer
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
+			void *ptr, int len, int max)
+{
+	int ret;
+
+	ret = ckpt_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	_ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+		    h->type, h->len, len, max);
+	if (h->len < sizeof(*h))
+		return -EINVAL;
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && h->len != len) || (!len && max && h->len > max))
+		return -EINVAL;
+
+	return ckpt_kread(ctx, ptr, h->len - sizeof(struct ckpt_hdr));
+}
+
+/**
+ * _ckpt_read_nbuffer - read an object of type buffer (variable length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ *
+ * Returns: actual buffer length (bounded by @len)
+ */
+int _ckpt_read_nbuffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	struct ckpt_hdr h;
+	int ret;
+
+	BUG_ON(!len);
+
+	len += sizeof(struct ckpt_hdr);
+	ret = _ckpt_read_obj(ctx, &h, ptr, 0, len);
+	if (ret < 0)
+		return ret;
+	_ckpt_debug(CKPT_DRW, "type %d len %d\n", h.type, h.len);
+	if (h.type != CKPT_HDR_BUFFER)
+		return -EINVAL;
+	return h.len;
+}
+
+/**
+ * _ckpt_read_buffer - read an object of type buffer (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ */
+int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	struct ckpt_hdr h;
+	int ret;
+
+	BUG_ON(!len);
+
+	len += sizeof(struct ckpt_hdr);
+	ret = _ckpt_read_obj(ctx, &h, ptr, len, len);
+	if (ret < 0)
+		return ret;
+	if (h.type != CKPT_HDR_BUFFER)
+		return -EINVAL;
+	return 0;
+}
+
+/**
+ * _ckpt_read_string - read an object of type string (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: string length
+ */
+int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	struct ckpt_hdr h;
+	int ret;
+
+	BUG_ON(!len);
+
+	ret = _ckpt_read_obj(ctx, &h, ptr, len + sizeof(h), len + sizeof(h));
+	if (ret < 0)
+		return ret;
+	if (h.type != CKPT_HDR_STRING)
+		return -EINVAL;
+
+	((char *) ptr)[len - 1] = '\0';	/* always play it safe */
+	return 0;
+}
+
+/**
+ * ckpt_read_obj - allocate and read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
+{
+	struct ckpt_hdr hh;
+	struct ckpt_hdr *h;
+	int ret;
+
+	ret = ckpt_kread(ctx, &hh, sizeof(hh));
+	if (ret < 0)
+		return ERR_PTR(ret);
+	_ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+		    hh.type, hh.len, len, max);
+	if (hh.len < sizeof(*h))
+		return ERR_PTR(-EINVAL);
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && hh.len != len) || (!len && max && hh.len > max))
+		return ERR_PTR(-EINVAL);
+
+	h = ckpt_hdr_get(ctx, hh.len);
+	if (!h)
+		return ERR_PTR(-ENOMEM);
+
+	*h = hh;	/* yay ! */
+
+	ret = ckpt_kread(ctx, (h + 1), hh.len - sizeof(struct ckpt_hdr));
+	if (ret < 0) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(ret);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_obj_type - allocate and read an object of some type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	BUG_ON(!len);
+
+	h = ckpt_read_obj(ctx, len, len);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->type != type) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_buf_type - allocate and read an object of some type (flxible)
+ * @ctx: checkpoint context
+ * @len: maximum object length
+ * @type: desired object type
+ *
+ * This differs from ckpt_read_obj_type() in that the length of the
+ * incoming object is flexible (up to the maximum specified by @len),
+ * as determined by the ckpt_hdr data.
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	BUG_ON(!len);
+
+	h = ckpt_read_obj(ctx, len, len);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->type != type) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+/* read the checkpoint header */
+static int restore_read_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts = NULL;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->magic != CHECKPOINT_MAGIC_HEAD || h->rev != CKPT_VERSION ||
+	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+	if (h->flags & ~CKPT_CTX_CHECKPOINT)
+		goto out;
+	if (h->uts_release_len != sizeof(uts->release) ||
+	    h->uts_version_len != sizeof(uts->version) ||
+	    h->uts_machine_len != sizeof(uts->machine))
+		goto out;
+
+	ret = -ENOMEM;
+	uts = kmalloc(sizeof(*uts), GFP_KERNEL);
+	if (!uts)
+		goto out;
+
+	ctx->oflags = h->flags;
+
+	/* FIX: verify compatibility of release, version and machine */
+	ret = _ckpt_read_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+ out:
+	kfree(uts);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int restore_read_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->magic != CHECKPOINT_MAGIC_TAIL)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int do_restart(struct ckpt_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = restore_read_header(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_task(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_read_tail(ctx);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 375129c..a99cd51 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -1,15 +1,233 @@
 /*
  *  Generic container checkpoint-restart
  *
- *  Copyright (C) 2008 Oren Laadan
+ *  Copyright (C) 2008-2009 Oren Laadan
  *
  *  This file is subject to the terms and conditions of the GNU General Public
  *  License.  See the file COPYING in the main directory of the Linux
  *  distribution for more details.
  */
 
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
 #include <linux/sched.h>
 #include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		uaddr += nwrite;
+	}
+	return 0;
+}
+
+int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kwrite(ctx->file, addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+static inline int _ckpt_kread(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		uaddr += nread;
+	}
+	return 0;
+}
+
+int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kread(ctx->file , addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, use ckpt_hdr_get() to reserve space
+ * in the buffer, then ckpt_hdr_put() when you no longer need that space.
+ *
+ * Checkpoint is performed by a single task, and restart is performed
+ * one task at a time; Thus, we expect that only one thread be using
+ * the ctx->hbuf at a time, so no locking is needed.
+ */
+
+/*
+ * ctx->hbuf is used to hold headers and data of known (or bound),
+ * static sizes. In some cases, multiple headers may be allocated in
+ * a nested manner. The size should accommodate all headers, nested
+ * or not, on all archs.
+ */
+#define CKPT_HBUF_TOTAL  (8 * 4096)
+
+/**
+ * ckpt_hdr_get - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: desired length
+ *
+ * Returns pointer to header (on hbuf)
+ */
+void *ckpt_hdr_get(struct ckpt_ctx *ctx, int len)
+{
+	void *ptr;
+
+	/*
+	 * Since requests depend on logic and static header sizes (not on
+	 * user data), space should always suffice, unless someone either
+	 * made a structure bigger or call path deeper than expected.
+	 */
+	BUG_ON(ctx->hpos + len > CKPT_HBUF_TOTAL);
+	ptr = ctx->hbuf + ctx->hpos;
+	ctx->hpos += len;
+	return ptr;
+}
+
+/**
+ * _ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ * @ptr: header length
+ *
+ * (requiring 'ptr' makes it easily interchangable with kmalloc/kfree
+ */
+void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	BUG_ON(ctx->hpos < len);
+	ctx->hpos -= len;
+}
+
+/**
+ * ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ *
+ * It is assumed that @ptr begins with a 'struct ckpt_hdr'.
+ */
+void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct ckpt_hdr *h = ptr;
+	return _ckpt_hdr_put(ctx, ptr, h->len);
+}
+
+/**
+ * ckpt_hdr_get_type - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: number of bytes to reserve
+ *
+ * Returns pointer to reserved space on hbuf
+ */
+void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	h = ckpt_hdr_get(ctx, len);
+	if (!h)
+		return NULL;
+
+	h->type = type;
+	h->len = len;
+	return h;
+}
+
+
+/*
+ * helpers to manage C/R contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void ckpt_ctx_free(struct ckpt_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx->hbuf);
+	kfree(ctx);
+}
+
+static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long flags)
+{
+	struct ckpt_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->flags = flags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	err = -ENOMEM;
+	ctx->hbuf = kmalloc(CKPT_HBUF_TOTAL, GFP_KERNEL);
+	if (!ctx->hbuf)
+		goto err;
+
+	return ctx;
+
+ err:
+	ckpt_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -22,9 +240,28 @@
  */
 asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	pr_debug("sys_checkpoint not implemented yet\n");
-	return -ENOSYS;
+	struct ckpt_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	if (pid == 0)
+		pid = current->pid;
+	ctx = ckpt_ctx_alloc(fd, flags | CKPT_CTX_CHECKPOINT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	ckpt_ctx_free(ctx);
+	return ret;
 }
+
 /**
  * sys_restart - restart a container
  * @crid: checkpoint image identifier
@@ -36,6 +273,40 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	pr_debug("sys_restart not implemented yet\n");
-	return -ENOSYS;
+	struct ckpt_ctx *ctx;
+	pid_t pid;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = ckpt_ctx_alloc(fd, flags | CKPT_CTX_RESTART);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	/* FIXME: for now, we use 'crid' as a pid */
+	pid = (pid_t) crid;
+
+	ret = do_restart(ctx, pid);
+
+	ckpt_ctx_free(ctx);
+	return ret;
 }
+
+
+/* 'ckpt_debug_level' controls the verbosity level of c/r code */
+#ifdef CONFIG_CHECKPOINT_DEBUG
+
+/* FIX: allow to change during runtime */
+unsigned int __read_mostly ckpt_debug_level = CKPT_DDEFAULT;
+
+static __init int ckpt_debug_setup(char *s)
+{
+	ckpt_debug_level = strict_strtoul(s, NULL, 0);
+	return 0;
+}
+
+__setup("ckpt_debug=", ckpt_debug_setup);
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..1433290
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,82 @@
+#ifndef _LINUX_CHECKPOINT_H_
+#define _LINUX_CHECKPOINT_H_
+/*
+ *  Generic checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/checkpoint_types.h>
+#include <linux/checkpoint_hdr.h>
+
+extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
+extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
+
+extern void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int n);
+extern void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr);
+extern void *ckpt_hdr_get(struct ckpt_ctx *ctx, int n);
+extern void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int n, int type);
+
+extern int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h);
+extern int ckpt_write_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, int len, int type);
+extern int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len);
+
+extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, int len, int type);
+extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len);
+extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type);
+extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type);
+
+extern int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
+extern int do_restart(struct ckpt_ctx *ctx, pid_t pid);
+
+/* task */
+extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_task(struct ckpt_ctx *ctx);
+
+
+/* debugging flags */
+#define CKPT_DBASE	0x1		/* anything */
+#define CKPT_DSYS	0x2		/* generic (system) */
+#define CKPT_DRW	0x4		/* image read/write */
+
+#define CKPT_DDEFAULT	0x7		/* default debug level */
+
+#ifndef CKPT_DFLAG
+#define CKPT_DFLAG	0x0		/* nothing */
+#endif
+
+#ifdef CONFIG_CHECKPOINT_DEBUG
+extern unsigned int ckpt_debug_level;
+
+/* use this to select a specific debug level */
+#define _ckpt_debug(level, fmt, args...)			\
+	do {						\
+		if (ckpt_debug_level & (level))		\
+			pr_debug("[%d:c/r:%s] " fmt,	\
+				task_pid_vnr(current),	\
+				 __func__, ## args);	\
+	} while (0)
+
+/*
+ * CKPT_DBASE is the base flags, doesn't change
+ * CKPT_DFLAG is to be redfined in each source file
+ */
+#define ckpt_debug(fmt, args...)  \
+	_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
+
+#else
+
+#define _ckpt_debug(level, fmt, args...)	do { } while (0)
+#define ckpt_debug(fmt, args...)		do { } while (0)
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
+
+#endif /* _LINUX_CHECKPOINT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..45378aa
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,103 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/*
+ * header format: 'struct ckpt_hdr' must prefix all other headers. Therfore
+ * when a header is passed around, the information about it (type, size)
+ * is readily available.
+ */
+struct ckpt_hdr {
+	__u32 type;
+	__u32 len;
+} __attribute__((aligned(8)));
+
+/* header types */
+enum {
+	CKPT_HDR_HEADER = 1,
+	CKPT_HDR_BUFFER,
+	CKPT_HDR_STRING,
+
+	CKPT_HDR_TASK = 101,
+	CKPT_HDR_THREAD,
+	CKPT_HDR_CPU,
+
+	CKPT_HDR_MM = 201,
+	CKPT_HDR_VMA,
+	CKPT_HDR_MM_CONTEXT,
+
+	CKPT_HDR_TAIL = 5001
+};
+
+/* checkpoint image header */
+struct ckpt_hdr_header {
+	struct ckpt_hdr h;
+	__u64 magic;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 flags;	/* checkpoint options */
+
+	__u16 uts_release_len;
+	__u16 uts_version_len;
+	__u16 uts_machine_len;
+	__u16 _padding;
+
+	/*
+	 * the header is followed by three strings:
+	 *   char release[__NEW_UTS_LEN];
+	 *   char version[__NEW_UTS_LEN];
+	 *   char machine[__NEW_UTS_LEN];
+	 */
+} __attribute__((aligned(8)));
+
+
+/* checkpoint image trailer */
+struct ckpt_hdr_tail {
+	struct ckpt_hdr h;
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+
+/* task data */
+struct ckpt_hdr_task {
+	struct ckpt_hdr h;
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__u32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
new file mode 100644
index 0000000..b04090f
--- /dev/null
+++ b/include/linux/checkpoint_types.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_CHECKPOINT_TYPES_H_
+#define _LINUX_CHECKPOINT_TYPES_H_
+/*
+ *  Generic checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CKPT_VERSION  1
+
+struct ckpt_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long flags;
+	unsigned long oflags;	/* restart: old flags */
+
+	struct file *file;
+	int total;		/* total read/written */
+
+	void *hbuf;		/* temporary buffer for headers */
+	int hpos;		/* position in headers buffer */
+};
+
+/* ckpt_ctx: flags */
+#define CKPT_CTX_CHECKPOINT	0x1
+#define CKPT_CTX_RESTART	0x2
+
+
+#endif /* _LINUX_CHECKPOINT_TYPES_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 5b4e28b..73a9d02 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -50,4 +50,8 @@
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
 #define STACK_END_MAGIC		0x57AC6E9D
+
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c6e854f..68b4eab 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -962,6 +962,19 @@ config DMA_API_DEBUG
 	  This option causes a performance degredation.  Use only if you want
 	  to debug device drivers. If unsure, say N.
 
+config CHECKPOINT_DEBUG
+	bool "Checkpoint/restart debugging (EXPERIMENTAL)"
+	depends on CHECKPOINT
+	default y
+	help
+	  This options turns on the debugging output of checkpoint/restart.
+	  The level of verbosity is controlled by 'ckpt_debug_level' and can
+	  be set at boot time with "ckpt_debug=" option.
+
+	  Turning this option off will reduce the size of the c/r code. If
+	  turned on, it is unlikely to incur visible overhead if the debug
+	  level is set to zero.
+
 source "samples/Kconfig"
 
 source "lib/Kconfig.kgdb"
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 05/54] x86 support for checkpoint/restart
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
       [not found]     ` <1240961064-13991-6-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:23   ` [RFC v14][PATCH 06/54] Introduce method 'checkpoint' in struct vm_operations_struct Oren Laadan
                     ` (50 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (ckpt_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Changelog[v14]:
  - Use new interface ckpt_hdr_get/put()
  - Embed struct ckpt_hdr in struct ckpt_hdr...
  - Remove preempt_disable/enable() around init_fpu() and fix leak
  - Revert change to pr_debug(), back to ckpt_debug()
  - Move code related to task_struct to checkpoint/process.c

Changelog[v12]:
  - A couple of missed calls to ckpt_hbuf_put()
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in ckpt_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space

Changelog[v7]:
  - Fix save/restore state of FPU

Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers

Changelog[v4]:
  - Fix header structure alignment

Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |  100 ++++++++
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  445 +++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |    7 +-
 checkpoint/checkpoint_arch.h          |    9 +
 checkpoint/process.c                  |   22 ++-
 checkpoint/restart.c                  |    6 +
 include/linux/checkpoint_hdr.h        |    1 +
 8 files changed, 589 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 checkpoint/checkpoint_arch.h

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..035abbb
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,100 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+	/* FIXME: add HAVE_HWFP */
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_thread {
+	struct ckpt_hdr h;
+	/* FIXME: restart blocks */
+	__u16 gdt_entry_tls_entries;
+	__u16 sizeof_tls_array;
+	__u16 ntls;	/* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	/* see struct pt_regs (x86-64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 cs;
+	__u64 flags;
+	__u64 sp;
+	__u64 ss;
+
+	/* segment registers */
+	__u64 ds;
+	__u64 es;
+	__u64 fs;
+	__u64 gs;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	__u32 uses_debug;
+	__u32 used_math;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index fdd30d0..7d894a5 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -19,3 +19,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..86ca916
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,445 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <asm/checkpoint_hdr.h>
+#include <linux/checkpoint.h>
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_thread *h;
+	struct thread_struct *thread;
+	struct desc_struct *desc;
+	int ntls = 0;
+	int n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_THREAD);
+	if (!h)
+		return -ENOMEM;
+
+	thread = &t->thread;
+
+	/* calculate no. of TLS entries that follow */
+	desc = thread->tls_array;
+	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+		if (desc->a || desc->b)
+			ntls++;
+	}
+
+	h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	h->sizeof_tls_array = sizeof(thread->tls_array);
+	h->ntls = ntls;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("ntls %d\n", ntls);
+	if (ntls == 0)
+		return 0;
+
+	/*
+	 * For simplicity dump the entire array, cherry-pick upon restart
+	 * FIXME: the TLS descriptors in the GDT should be called out and
+	 * not tied to the in-kernel representation.
+	 */
+	ret = ckpt_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+	/* IGNORE RESTART BLOCKS FOR NOW ... */
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	h->bp = regs->bp;
+	h->bx = regs->bx;
+	h->ax = regs->ax;
+	h->cx = regs->cx;
+	h->dx = regs->dx;
+	h->si = regs->si;
+	h->di = regs->di;
+	h->orig_ax = regs->orig_ax;
+	h->ip = regs->ip;
+	h->cs = regs->cs;
+	h->flags = regs->flags;
+	h->sp = regs->sp;
+	h->ss = regs->ss;
+
+	h->ds = regs->ds;
+	h->es = regs->es;
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS and FS registers should be saved from the hardware;
+	 * otherwise they are already sabed on the thread structure
+	 */
+	if (t == current) {
+		savesegment(gs, h->gs);
+		savesegment(fs, h->fs);
+	} else {
+		h->gs = thread->gs;
+		h->fs = thread->fs;
+	}
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(h->orig_ax < 0);
+		h->ax = 0;
+	}
+}
+
+static void save_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(h->debugreg0, 0);
+		get_debugreg(h->debugreg1, 1);
+		get_debugreg(h->debugreg2, 2);
+		get_debugreg(h->debugreg3, 3);
+		get_debugreg(h->debugreg6, 6);
+		get_debugreg(h->debugreg7, 7);
+	} else {
+		h->debugreg0 = thread->debugreg0;
+		h->debugreg1 = thread->debugreg1;
+		h->debugreg2 = thread->debugreg2;
+		h->debugreg3 = thread->debugreg3;
+		h->debugreg6 = thread->debugreg6;
+		h->debugreg7 = thread->debugreg7;
+	}
+
+	h->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+}
+
+static void save_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	h->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int checkpoint_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf;
+	int ret;
+
+	xstate_buf = ckpt_hdr_get(ctx, xstate_size);
+	if (!xstate_buf)
+		return -ENOMEM;
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * was cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	/*
+	 * For simplicity dump the entire structure.
+	 * FIX: need to be deliberate about what registers we are
+	 * dumping for traceability and compatibility.
+	 */
+	memcpy(xstate_buf, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed if (t == current) */
+
+	ret = ckpt_kwrite(ctx, xstate_buf, xstate_size);
+	_ckpt_hdr_put(ctx, xstate_buf, xstate_size);
+
+	return ret;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	save_cpu_regs(h, t);
+	save_cpu_debug(h, t);
+	save_cpu_fpu(h, t);
+
+	ckpt_debug("math %d debug %d\n", h->used_math, h->uses_debug);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = checkpoint_cpu_fpu(ctx, t);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	/* FPU capabilities */
+	h->has_fxsr = cpu_has_fxsr;
+	h->has_xsave = cpu_has_xsave;
+	h->xstate_size = xstate_size;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_thread *h;
+	struct task_struct *t = current;
+	struct thread_struct *thread = &t->thread;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_THREAD);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("ntls %d\n", h->ntls);
+
+	ret = -EINVAL;
+	if (h->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+	    h->sizeof_tls_array != sizeof(thread->tls_array) ||
+	    h->ntls > GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+
+	if (h->ntls > 0) {
+		struct desc_struct *desc;
+		int size, cpu;
+
+		/*
+		 * restore TLS by hand: why convert to struct user_desc if
+		 * sys_set_thread_entry() will convert it back ?
+		 */
+
+		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+		desc = kmalloc(size, GFP_KERNEL);
+		if (!desc) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = ckpt_kread(ctx, desc, size);
+		if (ret == 0) {
+			/*
+			 * FIX: add sanity checks (eg. that values makes
+			 * sense, that we don't overwrite old values, etc
+			 */
+			cpu = get_cpu();
+			memcpy(thread->tls_array, desc, size);
+			load_TLS(thread, cpu);
+			put_cpu();
+		}
+		kfree(desc);
+	}
+
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	regs->bx = h->bx;
+	regs->cx = h->cx;
+	regs->dx = h->dx;
+	regs->si = h->si;
+	regs->di = h->di;
+	regs->bp = h->bp;
+	regs->ax = h->ax;
+	regs->ds = h->ds;
+	regs->es = h->es;
+	regs->orig_ax = h->orig_ax;
+	regs->ip = h->ip;
+	regs->cs = h->cs;
+	regs->flags = h->flags;
+	regs->sp = h->sp;
+	regs->ss = h->ss;
+
+	thread->gs = h->gs;
+	thread->fs = h->fs;
+	loadsegment(gs, h->gs);
+	loadsegment(fs, h->fs);
+
+	return 0;
+}
+
+static int load_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	/* debug regs */
+
+	if (h->uses_debug) {
+		set_debugreg(h->debugreg0, 0);
+		set_debugreg(h->debugreg1, 1);
+		/* ignore 4, 5 */
+		set_debugreg(h->debugreg2, 2);
+		set_debugreg(h->debugreg3, 3);
+		set_debugreg(h->debugreg6, 6);
+		set_debugreg(h->debugreg7, 7);
+	}
+
+	return 0;
+}
+
+static int load_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!h->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int restore_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf;
+	int ret;
+
+	xstate_buf = ckpt_hdr_get(ctx, xstate_size);
+	if (!xstate_buf)
+		return -ENOMEM;
+
+	ret = ckpt_kread(ctx, xstate_buf, xstate_size);
+	if (ret < 0)
+		goto out;
+
+	/* init_fpu() eventually also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		goto out;
+
+	memcpy(t->thread.xstate, xstate_buf, xstate_size);
+ out:
+	_ckpt_hdr_put(ctx, xstate_buf, xstate_size);
+	return ret;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *h;
+	struct task_struct *t = current;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("math %d debug %d\n", h->used_math, h->uses_debug);
+
+	/* FIX: sanity check for sensitive registers (eg. eflags) */
+
+	ret = load_cpu_regs(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_debug(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_fpu(h, t);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = restore_cpu_fpu(ctx, t);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (h->has_fxsr != cpu_has_fxsr ||
+	    h->has_xsave != cpu_has_xsave ||
+	    h->xstate_size != xstate_size)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 6ac3571..62ba0a6 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -22,6 +22,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /* unique checkpoint identifier (FIXME: should be per-container ?) */
 static atomic_t ctx_count = ATOMIC_INIT(0);
 
@@ -135,7 +137,10 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
 	ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
  up:
 	up_read(&uts_sem);
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return checkpoint_write_header_arch(ctx);
 }
 
 /* write the checkpoint trailer */
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
new file mode 100644
index 0000000..2ee4d7f
--- /dev/null
+++ b/checkpoint/checkpoint_arch.h
@@ -0,0 +1,9 @@
+#include <linux/checkpoint.h>
+
+extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
+extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int restore_read_header_arch(struct ckpt_ctx *ctx);
+extern int restore_thread(struct ckpt_ctx *ctx);
+extern int restore_cpu(struct ckpt_ctx *ctx);
diff --git a/checkpoint/process.c b/checkpoint/process.c
index bf7545e..0578182 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -15,6 +15,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /***********************************************************************
  * Checkpoint
  */
@@ -53,7 +55,15 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	ret = checkpoint_task_struct(ctx, t);
 	ckpt_debug("ret %d\n", ret);
-
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_thread(ctx, t);
+	ckpt_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_cpu(ctx, t);
+	ckpt_debug("cpu: ret %d\n", ret);
+ out:
 	return ret;
 }
 
@@ -92,6 +102,14 @@ int restore_task(struct ckpt_ctx *ctx)
 
 	ret = restore_task_struct(ctx);
 	ckpt_debug("ret %d\n", ret);
-
+	if (ret < 0)
+		goto out;
+	ret = restore_thread(ctx);
+	ckpt_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_cpu(ctx);
+	ckpt_debug("cpu: ret %d\n", ret);
+ out:
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d71c0f0..9adcc90 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -19,6 +19,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /**
  * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
  * @ctx: checkpoint context
@@ -258,6 +260,10 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 	ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_read_header_arch(ctx);
  out:
 	kfree(uts);
 	ckpt_hdr_put(ctx, h);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 45378aa..9716f4b 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -41,6 +41,7 @@ struct ckpt_hdr {
 /* header types */
 enum {
 	CKPT_HDR_HEADER = 1,
+	CKPT_HDR_HEADER_ARCH,
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 06/54] Introduce method 'checkpoint' in struct vm_operations_struct
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 05/54] x86 support for checkpoint/restart Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 07/54] cr: extend arch_setup_additional_pages() Oren Laadan
                     ` (49 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/mm.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..05f0ed9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -14,6 +14,8 @@
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
 
+#include <linux/checkpoint_types.h>
+
 struct mempolicy;
 struct anon_vma;
 struct file_ra_state;
@@ -220,6 +222,9 @@ struct vm_operations_struct {
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma);
+#endif
 };
 
 struct mmu_gather;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 07/54] cr: extend arch_setup_additional_pages()
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 06/54] Introduce method 'checkpoint' in struct vm_operations_struct Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
       [not found]     ` <1240961064-13991-8-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:23   ` [RFC v14][PATCH 08/54] Dump memory address space Oren Laadan
                     ` (48 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

From: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Add "start" argument, to request to map vDSO to a specific place,
and fail the operation if not.

This is useful for restart(2) to ensure that memory layout is restore
exactly as needed.

Signed-off-by: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/powerpc/include/asm/elf.h     |    1 +
 arch/powerpc/kernel/vdso.c         |   11 ++++++++++-
 arch/s390/include/asm/elf.h        |    2 +-
 arch/s390/kernel/vdso.c            |   11 ++++++++++-
 arch/sh/include/asm/elf.h          |    1 +
 arch/sh/kernel/vsyscall/vsyscall.c |    2 +-
 arch/x86/include/asm/elf.h         |    3 ++-
 arch/x86/vdso/vdso32-setup.c       |    9 +++++++--
 arch/x86/vdso/vma.c                |    9 +++++++--
 fs/binfmt_elf.c                    |    2 +-
 10 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 1a856b1..aa020f5 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -269,6 +269,7 @@ extern int ucache_bsize;
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b);
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index ad06d5c..48beff6 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -184,7 +184,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -211,6 +212,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_base = VDSO32_MBASE;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	current->mm->context.vdso_base = 0;
 
 	/* vDSO has a problem and was disabled, just don't "enable" it for the
@@ -234,6 +239,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		goto fail_mmapsem;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && vdso_base != start)
+		goto fail_mmapsem;
+
 	/*
 	 * our vma flags don't have VM_WRITE so by default, the process isn't
 	 * allowed to write those pages.
diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h
index 74d0bbb..54235bc 100644
--- a/arch/s390/include/asm/elf.h
+++ b/arch/s390/include/asm/elf.h
@@ -205,6 +205,6 @@ do {									    \
 struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
-int arch_setup_additional_pages(struct linux_binprm *, int);
+int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int);
 
 #endif
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 89b2e7f..34b6e0c 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -182,7 +182,8 @@ static void vdso_init_cr5(void)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -213,6 +214,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_pages = vdso32_pages;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	/*
 	 * vDSO has a problem and was disabled, just don't "enable" it for
 	 * the process
@@ -235,6 +240,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		goto out_up;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto out_up;
+
 	/*
 	 * our vma flags don't have VM_WRITE so by default, the process
 	 * isn't allowed to write those pages.
diff --git a/arch/sh/include/asm/elf.h b/arch/sh/include/asm/elf.h
index ccb1d93..6c27b1f 100644
--- a/arch/sh/include/asm/elf.h
+++ b/arch/sh/include/asm/elf.h
@@ -202,6 +202,7 @@ do {									\
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
 extern unsigned int vdso_enabled;
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 3f7e415..64c70e5 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -59,7 +59,7 @@ int __init vsyscall_init(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm, unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 83c1bc8..a4398c8 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -336,9 +336,10 @@ struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
-extern int syscall32_setup_pages(struct linux_binprm *, int exstack);
+extern int syscall32_setup_pages(struct linux_binprm *, unsigned long start, int exstack);
 #define compat_arch_setup_additional_pages	syscall32_setup_pages
 
 extern unsigned long arch_randomize_brk(struct mm_struct *mm);
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 1241f11..9c72a23 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -310,7 +310,8 @@ int __init sysenter_setup(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
@@ -331,13 +332,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	if (compat)
 		addr = VDSO_HIGH_BASE;
 	else {
-		addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
+		addr = get_unmapped_area(NULL, start, PAGE_SIZE, 0, 0);
 		if (IS_ERR_VALUE(addr)) {
 			ret = addr;
 			goto up_fail;
 		}
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	if (compat_uses_vma || !compat) {
 		/*
 		 * MAYWRITE to allow gdb to COW and set breakpoints
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 7133cdf..81fce6d 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -98,7 +98,8 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)
 
 /* Setup a VMA at program startup for the vsyscall page.
    Not called for compat tasks */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
@@ -108,13 +109,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		return 0;
 
 	down_write(&mm->mmap_sem);
-	addr = vdso_addr(mm->start_stack, vdso_size);
+	addr = start ? : vdso_addr(mm->start_stack, vdso_size);
 	addr = get_unmapped_area(NULL, addr, vdso_size, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
 		ret = addr;
 		goto up_fail;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	ret = install_special_mapping(mm, addr, vdso_size,
 				      VM_READ|VM_EXEC|
 				      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 40381df..aa9a802 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -945,7 +945,7 @@ static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
 	set_binfmt(&elf_format);
 
 #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
-	retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
+	retval = arch_setup_additional_pages(bprm, 0, !!elf_interpreter);
 	if (retval < 0) {
 		send_sig(SIGKILL, current, 0);
 		goto out;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 08/54] Dump memory address space
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 07/54] cr: extend arch_setup_additional_pages() Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
       [not found]     ` <1240961064-13991-9-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:23   ` [RFC v14][PATCH 09/54] Restore " Oren Laadan
                     ` (47 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

For each VMA, there is a 'struct ckpt_vma'; if the VMA is file-mapped,
it will be followed by the file name. Then comes the actual contents,
in one or more chunk: each chunk begins with a header that specifies
how many pages it holds, then the virtual addresses of all the dumped
pages in that chunk, followed by the actual contents of all dumped
pages. A header with zero number of pages marks the end of the contents.
Then comes the next VMA and so on.

To checkpoint a vma, call the ops->checkpoint() method of that vma.
Normally the per-vma function will invoke generic_vma_checkpoint()
which first writes the vma description, followed by the specific
logic to dump the contents of the pages.

Currently for private mapped memory we save the pathname of the file
that is mapped (restart will use it to re-open it and then map it).
Later we change that to reference a file object.

Changelog[v14]:
  - Modify the ops->checkpoint method to be much more powerful
  - Improve support for VDSO (with special_mapping checkpoint callback)
  - Save new field 'vdso' in mm_context
  - Revert change to pr_debug(), back to ckpt_debug()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h->parent'

Changelog[v13]:
  - pgprot_t is an abstract type; use the proper accessor (fix for
    64-bit powerpc (Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>)

Changelog[v12]:
  - Hide pgarr management inside ckpt_private_vma_fill_pgarr()
  - Fix management of pgarr chain reset and alloc/expand: keep empty
    pgarr in a pool chain
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory

Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in ckpt_fill_name()

Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now ckpt_fill_fname() fails the checkpoint.

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages

Changelog[v4]:
  - Use standard list_... for ckpt_pgarr

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/checkpoint_hdr.h |    7 +
 arch/x86/mm/checkpoint.c              |   32 ++
 checkpoint/Makefile                   |    2 +-
 checkpoint/checkpoint.c               |   24 ++
 checkpoint/checkpoint_arch.h          |    1 +
 checkpoint/files.c                    |   88 +++++
 checkpoint/memory.c                   |  600 +++++++++++++++++++++++++++++++++
 checkpoint/process.c                  |    4 +
 checkpoint/sys.c                      |    9 +
 include/linux/checkpoint.h            |   25 ++-
 include/linux/checkpoint_hdr.h        |   39 +++
 include/linux/checkpoint_types.h      |   10 +
 mm/filemap.c                          |   30 ++
 mm/mmap.c                             |   30 ++
 15 files changed, 900 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/files.c
 create mode 100644 checkpoint/memory.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8dfe0c0..3245e9d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -79,6 +79,7 @@ config HAVE_LATENCYTOP_SUPPORT
 
 config CHECKPOINT_SUPPORT
 	bool
+	depends on COMPAT_VDSO
 	default y if X86_32
 
 config FAST_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 035abbb..bad7b29 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -97,4 +97,11 @@ struct ckpt_hdr_cpu {
 	/* thread_xstate contents follow (if used_math) */
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	__u64 vdso;
+	__u32 ldt_entry_size;
+	__u32 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 86ca916..ede7045 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -14,6 +14,7 @@
 #include <asm/desc.h>
 #include <asm/i387.h>
 
+#include <linux/checkpoint_types.h>
 #include <asm/checkpoint_hdr.h>
 #include <linux/checkpoint.h>
 
@@ -240,6 +241,37 @@ int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	mutex_lock(&mm->context.lock);
+
+	h->vdso = (unsigned long) mm->context.vdso;
+	h->ldt_entry_size = LDT_ENTRY_SIZE;
+	h->nldt = mm->context.size;
+
+	ckpt_debug("nldt %d vdso %#llx\n", h->nldt, h->vdso);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_kwrite(ctx, mm->context.ldt,
+			mm->context.size * LDT_ENTRY_SIZE);
+
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
+
 /**************************************************************************
  * Restart
  */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5d2c083..a33ab77 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o \
-	process.o
+	process.o memory.o files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 62ba0a6..9abdf73 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -15,6 +15,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -160,10 +161,33 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int ckpt_ctx_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct fs_struct *fs;
+
+	ctx->root_pid = pid;
+
+	/*
+	 * assume checkpointer is in container's root vfs
+	 * FIXME: this works for now, but will change with real containers
+	 */
+
+	fs = current->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+
+	return 0;
+}
+
 int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = ckpt_ctx_checkpoint(ctx, pid);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index 2ee4d7f..d168b9c 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -3,6 +3,7 @@
 extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
 extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 
 extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..1718526
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,88 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+static char *fill_fname(struct path *path, struct path *root,
+			char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *len);
+	spin_unlock(&dcache_lock);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * dump_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+static int dump_fname(struct ckpt_ctx *ctx,
+		      struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname))
+		ret = ckpt_write_obj_type(ctx, fname, flen, CKPT_HDR_FNAME);
+	else
+		ret = PTR_ERR(fname);
+
+	kfree(buf);
+	return ret;
+}
+
+int checkpoint_file(struct ckpt_ctx *ctx, struct file *file)
+{
+	return dump_fname(ctx, &file->f_path, &ctx->fs_mnt);
+}
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
new file mode 100644
index 0000000..668d883
--- /dev/null
+++ b/checkpoint/memory.c
@@ -0,0 +1,600 @@
+/*
+ *  Checkpoint/restart memory contents
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DMEM
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+
+/*
+ * page-array chains: each ckpt_pgarr describes a set of <struct page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct ckpt_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CKPT_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CKPT_PGARR_CHUNK  (4 * CKPT_PGARR_TOTAL)
+
+static inline int pgarr_is_full(struct ckpt_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CKPT_PGARR_TOTAL);
+}
+
+static inline int pgarr_nr_free(struct ckpt_pgarr *pgarr)
+{
+	return CKPT_PGARR_TOTAL - pgarr->nr_used;
+}
+
+/*
+ * utilities to alloc, free, and handle 'struct ckpt_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of populated page-array chain
+ *   ctx->pgarr_pool: list head of empty page-array pool chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * Before the next chunk of pages, the chain is reset (by dereferencing
+ * all pages) but not freed; instead, empty descsriptors are kept in pool.
+ *
+ * The head of the chain page-array ("current") advances as necessary. When
+ * it gets full, a new page-array descriptor is pushed in front of it. The
+ * new descriptor is taken from first empty descriptor (if one exists, for
+ * instance, after a chain reset), or allocated on-demand.
+ *
+ * When dumping the data, the chain is traversed in reverse order.
+ */
+
+/* return first page-array in the chain */
+static inline struct ckpt_pgarr *pgarr_first(struct ckpt_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct ckpt_pgarr, list);
+}
+
+/* return (and detach) first empty page-array in the pool, if exists */
+static inline struct ckpt_pgarr *pgarr_from_pool(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	if (list_empty(&ctx->pgarr_pool))
+		return NULL;
+	pgarr = list_first_entry(&ctx->pgarr_pool, struct ckpt_pgarr, list);
+	list_del(&pgarr->list);
+	return pgarr;
+}
+
+/* release pages referenced by a page-array */
+static void pgarr_release_pages(struct ckpt_pgarr *pgarr)
+{
+	ckpt_debug("total pages %d\n", pgarr->nr_used);
+	/*
+	 * both checkpoint and restart use 'nr_used', however we only
+	 * collect pages during checkpoint; in restart we simply return
+	 * because pgarr->pages remains NULL.
+	 */
+	if (pgarr->pages) {
+		struct page **pages = pgarr->pages;
+		int nr = pgarr->nr_used;
+
+		while (nr--)
+			page_cache_release(pages[nr]);
+	}
+
+	pgarr->nr_used = 0;
+}
+
+/* free a single page-array object */
+static void pgarr_free_one(struct ckpt_pgarr *pgarr)
+{
+	pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free the chains of page-arrays (populated and empty pool) */
+void ckpt_pgarr_free(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		pgarr_free_one(pgarr);
+	}
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_pool, list) {
+		list_del(&pgarr->list);
+		pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct ckpt_pgarr *pgarr_alloc_one(unsigned long flags)
+{
+	struct ckpt_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+	pgarr->vaddrs = kmalloc(CKPT_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CKPT_CTX_CHECKPOINT) {
+		pgarr->pages = kmalloc(CKPT_PGARR_TOTAL *
+				       sizeof(struct page *), GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+ nomem:
+	pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Otherwise,
+ * try the next page-array after the last non-empty one, and move it to
+ * the front of the chain. Extends the list if none has space.
+ */
+static struct ckpt_pgarr *pgarr_current(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	pgarr = pgarr_first(ctx);
+	if (pgarr && !pgarr_is_full(pgarr))
+		return pgarr;
+
+	pgarr = pgarr_from_pool(ctx);
+	if (!pgarr)
+		pgarr = pgarr_alloc_one(ctx->flags);
+	if (!pgarr)
+		return NULL;
+
+	list_add(&pgarr->list, &ctx->pgarr_list);
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+static void pgarr_reset_all(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list)
+		pgarr_release_pages(pgarr);
+	list_splice_init(&ctx->pgarr_list, &ctx->pgarr_pool);
+}
+
+/*
+ * Checkpoint
+ *
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_private_page(struct vm_area_struct *vma,
+					  unsigned long addr)
+{
+	struct page *page;
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * Only care about dirty pages: either anonymous non-zero pages,
+	 * or file-backed COW (copy-on-write) pages that were modified.
+	 * A clean COW page is not interesting because its contents are
+	 * identical to the backing file; ignore such pages.
+	 * A file-backed broken COW is identified by its page_mapping()
+	 * being unset (NULL) because the page will no longer be mapped
+	 * to the original file after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * private_vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int private_vma_fill_pgarr(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	struct ckpt_pgarr *pgarr;
+	int nr_used;
+	int cnt = 0;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	do {
+		pgarr = pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+
+		nr_used = pgarr->nr_used;
+
+		while (addr < end) {
+			struct page *page;
+
+			page = consider_private_page(vma, addr);
+			if (IS_ERR(page))
+				return PTR_ERR(page);
+
+			if (page) {
+				_ckpt_debug(CKPT_DPAGE,
+					    "got page %#lx\n", addr);
+				pgarr->pages[pgarr->nr_used] = page;
+				pgarr->vaddrs[pgarr->nr_used] = addr;
+				pgarr->nr_used++;
+			}
+
+			addr += PAGE_SIZE;
+
+			if (pgarr_is_full(pgarr))
+				break;
+		}
+
+		cnt += pgarr->nr_used - nr_used;
+
+	} while ((cnt < CKPT_PGARR_CHUNK) && (addr < end));
+
+	*start = addr;
+	return cnt;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int checkpoint_dump_page(struct ckpt_ctx *ctx,
+				struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return ckpt_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
+{
+	struct ckpt_pgarr *pgarr;
+	void *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = ckpt_kwrite(ctx, pgarr->vaddrs,
+				pgarr->nr_used * sizeof(*pgarr->vaddrs));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = (void *) __get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = checkpoint_dump_page(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	free_page((unsigned long) buf);
+	return ret;
+}
+
+/**
+ * checkpoint_private_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int checkpoint_private_contents(struct ckpt_ctx *ctx,
+				       struct vm_area_struct *vma)
+{
+	struct ckpt_hdr_pgarr *h;
+	unsigned long addr = vma->vm_start;
+	int cnt, ret;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CKPT_PGARR_CHUNK
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumpting the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	while (addr < vma->vm_end) {
+		cnt = private_vma_fill_pgarr(ctx, vma, &addr);
+		if (cnt == 0)
+			break;
+		else if (cnt < 0)
+			return cnt;
+
+		ckpt_debug("collected %d pages\n", cnt);
+
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+		if (!h)
+			return -ENOMEM;
+
+		h->nr_pages = cnt;
+		ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
+			return ret;
+
+		ret = vma_dump_pages(ctx, cnt);
+		if (ret < 0)
+			return ret;
+
+		pgarr_reset_all(ctx);
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+	if (!h)
+		return -ENOMEM;
+	h->nr_pages = 0;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**
+ * generic_vma_checkpoint - dump metadata of vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ */
+int generic_vma_checkpoint(struct ckpt_ctx *ctx,
+			   struct vm_area_struct *vma, enum vma_type type)
+{
+	struct ckpt_hdr_vma *h;
+	int ret;
+
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d\n",
+		 vma->vm_start, vma->vm_end, vma->vm_flags, type);
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (!h)
+		return -ENOMEM;
+
+	h->vma_type = type;
+	h->vm_start = vma->vm_start;
+	h->vm_end = vma->vm_end;
+	h->vm_page_prot = pgprot_val(vma->vm_page_prot);
+	h->vm_flags = vma->vm_flags;
+	h->vm_pgoff = vma->vm_pgoff;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**
+ * private_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ */
+int private_vma_checkpoint(struct ckpt_ctx *ctx,
+			   struct vm_area_struct *vma,
+			   enum vma_type type)
+{
+	int ret;
+
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	ret = generic_vma_checkpoint(ctx, vma, type);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_private_contents(ctx, vma);
+ out:
+	return ret;
+}
+
+/**
+ * anonymous_checkpoint - dump contents of anonymous vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ */
+static int anonymous_checkpoint(struct ckpt_ctx *ctx,
+				struct vm_area_struct *vma)
+{
+	/* should be private anonymous ... verify that this is the case */
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(vma->vm_file);
+
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_ANON);
+}
+
+int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_mm *h;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (!h)
+		return -ENOMEM;
+
+	mm = get_task_mm(t);
+
+	down_read(&mm->mmap_sem);
+
+	h->start_code = mm->start_code;
+	h->end_code = mm->end_code;
+	h->start_data = mm->start_data;
+	h->end_data = mm->end_data;
+	h->start_brk = mm->start_brk;
+	h->brk = mm->brk;
+	h->start_stack = mm->start_stack;
+	h->arg_start = mm->arg_start;
+	h->arg_end = mm->arg_end;
+	h->env_start = mm->env_start;
+	h->env_end = mm->env_end;
+
+	h->map_count = mm->map_count;
+
+	/* FIX: need also mm->flags */
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ckpt_debug("vma %#lx-%#lx flags %#lx\n",
+			 vma->vm_start, vma->vm_end, vma->vm_flags);
+		if (!vma->vm_ops)
+			ret = anonymous_checkpoint(ctx, vma);
+		else if (vma->vm_ops->checkpoint)
+			ret = (*vma->vm_ops->checkpoint)(ctx, vma);
+		else
+			ret = -ENOSYS;
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = checkpoint_mm_context(ctx, mm);
+ out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 0578182..64deb76 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -57,6 +57,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ckpt_debug("ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = checkpoint_mm(ctx, t);
+	ckpt_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_thread(ctx, t);
 	ckpt_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index a99cd51..5ebbac9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -197,7 +197,13 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
 	kfree(ctx->hbuf);
+
+	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
+
+	ckpt_pgarr_free(ctx);
+
 	kfree(ctx);
 }
 
@@ -212,6 +218,9 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+	INIT_LIST_HEAD(&ctx->pgarr_pool);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 1433290..108e6a1 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -41,13 +41,36 @@ extern int do_restart(struct ckpt_ctx *ctx, pid_t pid);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
+/* memory */
+extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
+
+extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type);
+extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type);
+
+extern int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+
+#define CKPT_VMA_NOT_SUPPORTED					\
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
+	 VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE	\
+	 | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY |		\
+	 VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+
+/* files */
+extern int checkpoint_file(struct ckpt_ctx *ctx, struct file *file);
+
 
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
+#define CKPT_DMEM	0x8		/* memory state */
+#define CKPT_DPAGE	0x10		/* memory pages */
 
-#define CKPT_DDEFAULT	0x7		/* default debug level */
+#define CKPT_DDEFAULT	0xf		/* default debug level */
 
 #ifndef CKPT_DFLAG
 #define CKPT_DFLAG	0x0		/* nothing */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 9716f4b..dab6b7f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -44,6 +44,7 @@ enum {
 	CKPT_HDR_HEADER_ARCH,
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
+	CKPT_HDR_FNAME,
 
 	CKPT_HDR_TASK = 101,
 	CKPT_HDR_THREAD,
@@ -51,6 +52,7 @@ enum {
 
 	CKPT_HDR_MM = 201,
 	CKPT_HDR_VMA,
+	CKPT_HDR_PGARR,
 	CKPT_HDR_MM_CONTEXT,
 
 	CKPT_HDR_TAIL = 5001
@@ -101,4 +103,41 @@ struct ckpt_hdr_task {
 	__u32 task_comm_len;
 } __attribute__((aligned(8)));
 
+/* memory layout */
+struct ckpt_hdr_mm {
+	struct ckpt_hdr h;
+	__u32 map_count;
+	__u32 _padding;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vma_type {
+	CKPT_VMA_VDSO = 1,	/* special vdso vma */
+	CKPT_VMA_ANON,		/* private anonymous */
+	CKPT_VMA_FILE,		/* private mapped file */
+};
+
+/* vma decsriptor */
+struct ckpt_hdr_vma {
+	struct ckpt_hdr h;
+	__u32 vma_type;
+	__u32 _padding;
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+/* page array */
+struct ckpt_hdr_pgarr {
+	struct ckpt_hdr h;
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index b04090f..84b4ef4 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -10,6 +10,10 @@
  *  distribution for more details.
  */
 
+#include <linux/list.h>
+#include <linux/path.h>
+#include <linux/fs.h>
+
 #define CKPT_VERSION  1
 
 struct ckpt_ctx {
@@ -25,8 +29,14 @@ struct ckpt_ctx {
 
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
+
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
+
+	struct path fs_mnt;	/* container root (FIXME) */
 };
 
+
 /* ckpt_ctx: flags */
 #define CKPT_CTX_CHECKPOINT	0x1
 #define CKPT_CTX_RESTART	0x2
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..2b58027 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,10 @@
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
+#include <linux/checkpoint_types.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/checkpoint.h>
+
 /*
  * FIXME: remove all knowledge of the buffer layer from the core VM
  */
@@ -1625,8 +1629,34 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_CHECKPOINT
+static int filemap_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma)
+{
+	int ret;
+
+	/* should be private anonymous ... verify that this is the case */
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(!vma->vm_file);
+
+	ret = checkpoint_file(ctx, vma->vm_file);
+	if (ret < 0)
+		goto out;
+	ret = private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE);
+ out:
+	return ret;
+}
+#else
+#define filemap_checkpoint NULL
+#endif /* CONFIG_CHECKPOINT */
+
 struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.checkpoint	= filemap_checkpoint,
 };
 
 /* This is used for a general mmap of a disk file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 3303d1b..6b75359 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -34,6 +34,10 @@
 #include <asm/tlb.h>
 #include <asm/mmu_context.h>
 
+#include <linux/checkpoint_types.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/checkpoint.h>
+
 #include "internal.h"
 
 #ifndef arch_mmap_check
@@ -2268,9 +2272,35 @@ static void special_mapping_close(struct vm_area_struct *vma)
 {
 }
 
+#if CONFIG_CHEKCPOINT
+static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma)
+{
+	char *name;
+
+	/*
+	 * Currently, we only handle VDSO/vsyscall special handling.
+	 * Even that, is very basic - we just skip the contents and
+	 * hope for the best in terms of compatilibity upon restart.
+	 */
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	name = arch_vma_name(vma);
+	if (!name || strcmp(vma_name, "[vdso]"))
+		return -ENOSYS;
+
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO);
+}
+#else
+#define special_mapping_checkpoint NULL
+#endif /* CONFIG_CHECKPOINT */
+
 static struct vm_operations_struct special_mapping_vmops = {
 	.close = special_mapping_close,
 	.fault = special_mapping_fault,
+	.checkpoint = special_mapping_checkpoint,
 };
 
 /*
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 09/54] Restore memory address space
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 08/54] Dump memory address space Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
       [not found]     ` <1240961064-13991-10-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:23   ` [RFC v14][PATCH 10/54] Infrastructure for shared objects Oren Laadan
                     ` (46 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Currently to restore private mapped memory we use the pathname saved
to open a new file and pass it to do_mmap_pgoff(). Later we change
that to reference a file object.

Changelog[v14]:
  - Introduce per vma-type restore() function
  - Merge restart code into same file as checkpoint (memory.c)
  - Compare saved 'vdso' field of mm_context with current value
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h->parent'
  - Revert change to pr_debug(), back to ckpt_debug()

Changelog[v13]:
  - Avoid access to hh->vma_type after the header is freed
  - Test for no vma's in exit_mmap() before calling unmap_vma() (or it
    may crash if restart fails after having removed all vma's)

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()

Changelog[v4]:
  - Use standard list_... for ckpt_pgarr


Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/checkpoint.c              |   59 +++++
 checkpoint/checkpoint_arch.h          |    1 +
 checkpoint/files.c                    |   33 +++
 checkpoint/memory.c                   |  407 +++++++++++++++++++++++++++++++++
 checkpoint/process.c                  |    4 +
 checkpoint/restart.c                  |    9 +
 include/linux/checkpoint.h            |    5 +
 include/linux/checkpoint_hdr.h        |    6 +-
 include/linux/mm.h                    |    9 +
 mm/filemap.c                          |   18 ++
 mm/mmap.c                             |   30 ++-
 12 files changed, 580 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index bad7b29..d61653c 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -104,4 +104,9 @@ struct ckpt_hdr_mm_context {
 	__u32 nldt;
 } __attribute__((aligned(8)));
 
+#ifdef __KERNEL__
+/* misc prototypes from kernel (not defined elsewhere) */
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+#endif
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index ede7045..a475a30 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -13,6 +13,7 @@
 
 #include <asm/desc.h>
 #include <asm/i387.h>
+#include <asm/elf.h>
 
 #include <linux/checkpoint_types.h>
 #include <asm/checkpoint_hdr.h>
@@ -475,3 +476,61 @@ int restore_read_header_arch(struct ckpt_ctx *ctx)
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	unsigned int n;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("nldt %d vdso %#lx (%p)\n",
+		 h->nldt, (unsigned long) h->vdso, mm->context.vdso);
+
+	ret = -EINVAL;
+	if (h->vdso != (unsigned long) mm->context.vdso)
+		goto out;
+	if (h->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+	ret = 0;
+	for (n = 0; n < h->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = ckpt_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			break;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index d168b9c..4b9b6bf 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -8,3 +8,4 @@ extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
 extern int restore_cpu(struct ckpt_ctx *ctx);
+extern int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 1718526..a7cf6c3 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -86,3 +86,36 @@ int checkpoint_file(struct ckpt_ctx *ctx, struct file *file)
 {
 	return dump_fname(ctx, &file->f_path, &ctx->fs_mnt);
 }
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+static struct file *read_open_fname(struct ckpt_ctx *ctx, int flags, int mode)
+{
+	struct ckpt_hdr *h;
+	struct file *file;
+	char *fname;
+
+	h = ckpt_read_buf_type(ctx, PATH_MAX, CKPT_HDR_FNAME);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	fname = (char *) (h + 1);
+	ckpt_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+
+	file = filp_open(fname, flags, mode);
+	ckpt_hdr_put(ctx, h);
+	return file;
+}
+
+struct file *restore_file(struct ckpt_ctx *ctx)
+{
+	/* currently only called for mapped files; O_RDONLY works */
+	return read_open_fname(ctx, O_RDONLY, 0);
+}
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 668d883..c725519 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -15,6 +15,9 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/file.h>
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
 #include <linux/checkpoint.h>
@@ -598,3 +601,407 @@ int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
 	mmput(mm);
 	return ret;
 }
+
+/*
+ * Restart
+ *
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+/**
+ * read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int read_pages_vaddrs(struct ckpt_ctx *ctx, unsigned long nr_pages)
+{
+	struct ckpt_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = ckpt_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
+{
+	void *ptr;
+	int ret;
+
+	ret = ckpt_kread(ctx, p, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, p, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int read_pages_contents(struct ckpt_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct ckpt_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			_ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = restore_read_page(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * restore_private_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int restore_private_contents(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_pgarr *h;
+	unsigned long nr_pages;
+	int ret = 0;
+
+	while (1) {
+		h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+		if (IS_ERR(h))
+			break;
+
+		ckpt_debug("total pages %ld\n", (unsigned long) h->nr_pages);
+
+		nr_pages = h->nr_pages;
+		ckpt_hdr_put(ctx, h);
+
+		if (!nr_pages)
+			break;
+
+		ret = read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+/**
+ * generic_vma_restore - restore a vma
+ * @mm - address space
+ * @file - file to map (NULL for anonymous)
+ * @h - vma header data
+ */
+static unsigned long generic_vma_restore(struct mm_struct *mm,
+					 struct file *file,
+					 struct ckpt_hdr_vma *h)
+{
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+
+	if (h->vm_end < h->vm_start)
+		return -EINVAL;
+	if (h->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	vm_start = h->vm_start;
+	vm_pgoff = h->vm_pgoff;
+	vm_size = h->vm_end - h->vm_start;
+	vm_prot = calc_map_prot_bits(h->vm_flags);
+	vm_flags = calc_map_flags_bits(h->vm_flags);
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	ckpt_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	return addr;
+}
+
+/**
+ * private_vma_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @file: file to use for mapping
+ * @h - vma header data
+ */
+int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			struct file *file, struct ckpt_hdr_vma *h)
+{
+	unsigned long addr;
+
+	if (h->vm_flags & VM_SHARED)
+		return -EINVAL;
+
+	addr = generic_vma_restore(mm, file, h);
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	return restore_private_contents(ctx);
+}
+
+/**
+ * anon_private_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @h - vma header data
+ */
+static int anon_private_restore(struct ckpt_ctx *ctx,
+				     struct mm_struct *mm,
+				     struct ckpt_hdr_vma *h)
+{
+	/*
+	 * vm_pgoff for anonymous mapping is the "global" page
+	 * offset (namely from addr 0x0), so we force a zero
+	 */
+	h->vm_pgoff = 0;
+
+	return private_vma_restore(ctx, mm, NULL, h);
+}
+
+/* callbacks to restore vma per its type: */
+struct restore_vma_ops {
+	char *vma_name;
+	enum vma_type vma_type;
+	int (*restore) (struct ckpt_ctx *ctx,
+			struct mm_struct *mm,
+			struct ckpt_hdr_vma *ptr);
+};
+
+static struct restore_vma_ops restore_vma_ops[] = {
+	/* ignored vma */
+	{
+		.vma_name = "IGNORE",
+		.vma_type = CKPT_VMA_IGNORE,
+		.restore = NULL,
+	},
+	/* special mapping (vdso) */
+	{
+		.vma_name = "VDSO",
+		.vma_type = CKPT_VMA_VDSO,
+		.restore = special_mapping_restore,
+	},
+	/* anonymous private */
+	{
+		.vma_name = "ANON PRIVATE",
+		.vma_type = CKPT_VMA_ANON,
+		.restore = anon_private_restore,
+	},
+	/* file-mapped private */
+	{
+		.vma_name = "FILE PRIVATE",
+		.vma_type = CKPT_VMA_FILE,
+		.restore = filemap_restore,
+	},
+};
+
+/**
+ * restore_vma - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ */
+static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_vma *h;
+	struct restore_vma_ops *ops;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("vma %#lx-%#lx type %d\n", (unsigned long) h->vm_start,
+		 (unsigned long) h->vm_end, (int) h->vma_type);
+
+	ret = -EINVAL;
+	if (h->vm_end < h->vm_start)
+		goto out;
+	if (h->vma_type >= CKPT_VMA_MAX)
+		goto out;
+
+	ops = &restore_vma_ops[h->vma_type];
+
+	/* make sure we don't change this accidentally */
+	BUG_ON(ops->vma_type != h->vma_type);
+
+	if (ops->restore) {
+		ckpt_debug("vma type %s\n", ops->vma_name);
+		ret = ops->restore(ctx, mm, h);
+	} else {
+		ckpt_debug("vma ignored\n");
+		ret = 0;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_warning("c/r: failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+int restore_mm(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_mm *h;
+	struct mm_struct *mm;
+	unsigned int nr;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("map_count %d\n", h->map_count);
+
+	/* XXX need more sanity checks */
+
+	ret = -EINVAL;
+	if ((h->start_code > h->end_code) ||
+	    (h->start_data > h->end_data))
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+	mm->start_code = h->start_code;
+	mm->end_code = h->end_code;
+	mm->start_data = h->start_data;
+	mm->end_data = h->end_data;
+	mm->start_brk = h->start_brk;
+	mm->brk = h->brk;
+	mm->start_stack = h->start_stack;
+	mm->arg_start = h->arg_start;
+	mm->arg_end = h->arg_end;
+	mm->env_start = h->env_start;
+	mm->env_end = h->env_end;
+	up_write(&mm->mmap_sem);
+
+	/* FIX: need also mm->flags */
+
+	for (nr = h->map_count; nr; nr--) {
+		ret = restore_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_mm_context(ctx, mm);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 64deb76..7adb842 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -108,6 +108,10 @@ int restore_task(struct ckpt_ctx *ctx)
 	ckpt_debug("ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = restore_mm(ctx);
+	ckpt_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = restore_thread(ctx);
 	ckpt_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 9adcc90..a1ab0a1 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -287,10 +287,19 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* setup restart-specific parts of ctx */
+static int ckpt_ctx_restart(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
 int do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = ckpt_ctx_restart(ctx);
+	if (ret < 0)
+		return ret;
 	ret = restore_read_header(ctx);
 	if (ret < 0)
 		return ret;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 108e6a1..73b34af 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -51,7 +51,11 @@ extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
 				  enum vma_type type);
 
+extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			       struct file *file, struct ckpt_hdr_vma *h);
+
 extern int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_mm(struct ckpt_ctx *ctx);
 
 #define CKPT_VMA_NOT_SUPPORTED					\
 	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
@@ -61,6 +65,7 @@ extern int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 
 /* files */
 extern int checkpoint_file(struct ckpt_ctx *ctx, struct file *file);
+extern struct file *restore_file(struct ckpt_ctx *ctx);
 
 
 /* debugging flags */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index dab6b7f..5266e4b 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -114,11 +114,13 @@ struct ckpt_hdr_mm {
 	__u64 arg_start, arg_end, env_start, env_end;
 } __attribute__((aligned(8)));
 
-/* vma subtypes */
+/* vma subtypes - index into restore_vma_dispatch[] */
 enum vma_type {
-	CKPT_VMA_VDSO = 1,	/* special vdso vma */
+	CKPT_VMA_IGNORE = 0,
+	CKPT_VMA_VDSO,		/* special vdso vma */
 	CKPT_VMA_ANON,		/* private anonymous */
 	CKPT_VMA_FILE,		/* private mapped file */
+	CKPT_VMA_MAX,
 };
 
 /* vma decsriptor */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05f0ed9..585d398 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1182,6 +1182,15 @@ extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
 int write_one_page(struct page *page, int wait);
 void task_dirty_inc(struct task_struct *tsk);
 
+
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			   struct ckpt_hdr_vma *hh);
+extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+				   struct ckpt_hdr_vma *hh);
+#endif
+
 /* readahead.c */
 #define VM_MAX_READAHEAD	128	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
diff --git a/mm/filemap.c b/mm/filemap.c
index 2b58027..ef5680b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1650,6 +1650,24 @@ static int filemap_checkpoint(struct ckpt_ctx *ctx,
  out:
 	return ret;
 }
+
+int filemap_restore(struct ckpt_ctx *ctx,
+		    struct mm_struct *mm,
+		    struct ckpt_hdr_vma *h)
+{
+	struct file *file;
+	int ret;
+
+	/* for private mapping using 'read-only' is sufficient */
+	file = restore_file(ctx);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ret = private_vma_restore(ctx, mm, file, h);
+
+	fput(file);
+	return ret;
+}
 #else
 #define filemap_checkpoint NULL
 #endif /* CONFIG_CHECKPOINT */
diff --git a/mm/mmap.c b/mm/mmap.c
index 6b75359..3b6356c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2114,7 +2114,7 @@ void exit_mmap(struct mm_struct *mm)
 	tlb = tlb_gather_mmu(mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = vma ? unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL) : 0;
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
@@ -2272,13 +2272,22 @@ static void special_mapping_close(struct vm_area_struct *vma)
 {
 }
 
-#if CONFIG_CHEKCPOINT
+#ifdef CONFIG_CHECKPOINT
+/*
+ * FIX:
+ *   - checkpoint vdso pages (once per distinct vdso is enough)
+ *   - check for compatilibility between saved and current vdso
+ *   - accommodate for dynamic kernel data in vdso page
+ *
+ * Current, we require COMPAT_VDSO which somewhat mitigates the issue
+ */
 static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 				      struct vm_area_struct *vma)
 {
-	char *name;
+	const char *name;
 
 	/*
+	 * FIX:
 	 * Currently, we only handle VDSO/vsyscall special handling.
 	 * Even that, is very basic - we just skip the contents and
 	 * hope for the best in terms of compatilibity upon restart.
@@ -2288,11 +2297,24 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 		return -ENOSYS;
 
 	name = arch_vma_name(vma);
-	if (!name || strcmp(vma_name, "[vdso]"))
+	if (!name || strcmp(name, "[vdso]"))
 		return -ENOSYS;
 
 	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO);
 }
+
+int special_mapping_restore(struct ckpt_ctx *ctx,
+			    struct mm_struct *mm,
+			    struct ckpt_hdr_vma *h)
+{
+	/*
+	 * FIX:
+	 * Currently, we only handle VDSO/vsyscall special handling.
+	 * Even that, is very basic - call arch_setup_additional_pages
+	 * requiring the same mapping (start address) as before.
+	 */
+	return arch_setup_additional_pages(NULL, h->vm_start, 0);
+}
 #else
 #define special_mapping_checkpoint NULL
 #endif /* CONFIG_CHECKPOINT */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 10/54] Infrastructure for shared objects
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 09/54] Restore " Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
       [not found]     ` <1240961064-13991-11-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:23   ` [RFC v14][PATCH 11/54] Introduce 'checkpoint' method in 'struct file_operations' Oren Laadan
                     ` (45 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v14]:
  - Introduce 'struct ckpt_obj_ops' to better modularize shared objs.
  - Replace long 'switch' statements with table lookups and callbacks.
  - Introduce checkpoint_obj() and restart_obj() helpers
  - Shared objects now dumped/saved right before they are referenced
  - Cleanup interface of shared objects

Changelog[v13]:
  - Use hash_long() with 'unsigned long' cast to support 64bit archs
    (Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>)

Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime

Changelog[v4]:
  - Fix calculation of hash table size

Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/Makefile              |   10 +-
 checkpoint/objhash.c             |  372 ++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c             |   46 +++++
 checkpoint/sys.c                 |    5 +-
 include/linux/checkpoint.h       |   16 ++
 include/linux/checkpoint_hdr.h   |   14 ++
 include/linux/checkpoint_types.h |    2 +
 7 files changed, 462 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index a33ab77..2026607 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,11 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o \
-	process.o memory.o files.o
+obj-$(CONFIG_CHECKPOINT) += \
+	sys.o \
+	objhash.o \
+	checkpoint.o \
+	restart.o \
+	process.o \
+	memory.o \
+	files.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..076a3a3
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,372 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DOBJ
+
+#include <linux/kernel.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+struct ckpt_obj;
+struct ckpt_obj_ops;
+
+/* object operations */
+struct ckpt_obj_ops {
+	char *obj_name;
+	enum obj_type obj_type;
+	void (*ref_drop)(void *ptr);
+	int (*ref_grab)(void *ptr);
+	int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
+	void *(*restore)(struct ckpt_ctx *ctx);
+};
+
+struct ckpt_obj {
+	int objref;
+	void *ptr;
+	struct ckpt_obj_ops *ops;
+	struct hlist_node hash;
+};
+
+struct ckpt_obj_hash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+/*
+ * helper grab/drop functions:
+ *   obj_no_{drop,grab}: for objects ignored/skipped
+ */
+
+static void obj_no_drop(void *ptr)
+{
+	return;
+}
+
+static int obj_no_grab(void *ptr)
+{
+	return 0;
+}
+
+static struct ckpt_obj_ops ckpt_obj_ops[] = {
+	/* ignored object */
+	{
+		.obj_name = "IGNORED",
+		.obj_type = CKPT_OBJ_IGNORE,
+		.ref_drop = obj_no_drop,
+		.ref_grab = obj_no_grab,
+	},
+};
+
+
+#define CKPT_OBJ_HASH_NBITS  10
+#define CKPT_OBJ_HASH_TOTAL  (1UL << CKPT_OBJ_HASH_NBITS)
+
+static void obj_hash_clear(struct ckpt_obj_hash *obj_hash)
+{
+	struct hlist_head *h = obj_hash->head;
+	struct hlist_node *n, *t;
+	struct ckpt_obj *obj;
+	int i;
+
+	for (i = 0; i < CKPT_OBJ_HASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			obj->ops->ref_drop(obj->ptr);
+			kfree(obj);
+		}
+	}
+}
+
+void ckpt_obj_hash_free(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj_hash *obj_hash = ctx->obj_hash;
+
+	if (obj_hash) {
+		obj_hash_clear(obj_hash);
+		kfree(obj_hash->head);
+		kfree(ctx->obj_hash);
+		ctx->obj_hash = NULL;
+	}
+}
+
+int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj_hash *obj_hash;
+	struct hlist_head *head;
+
+	obj_hash = kzalloc(sizeof(*obj_hash), GFP_KERNEL);
+	if (!obj_hash)
+		return -ENOMEM;
+	head = kzalloc(CKPT_OBJ_HASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(obj_hash);
+		return -ENOMEM;
+	}
+
+	obj_hash->head = head;
+	obj_hash->next_free_objref = 1;
+
+	ctx->obj_hash = obj_hash;
+	return 0;
+}
+
+static struct ckpt_obj *obj_find_by_ptr(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &ctx->obj_hash->head[hash_long((unsigned long) ptr,
+					   CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct ckpt_obj *obj_find_by_objref(struct ckpt_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &ctx->obj_hash->head[hash_long((unsigned long) objref,
+					   CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+/**
+ * ckpt_obj_new - add an object to the obj_hash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: object unique id
+ * @ops: object operations
+ *
+ * Returns: objref
+ *
+ * Add the object to the obj_hash. If @objref is zero, assign a unique
+ * object id and use @ptr as a hash key [checkpoint]. Else use @objref
+ * as a key [restart].
+ */
+static int obj_new(struct ckpt_ctx *ctx, void *ptr, int objref,
+		   struct ckpt_obj_ops *ops)
+{
+	struct ckpt_obj *obj;
+	int i, ret;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return -ENOMEM;
+
+	obj->ptr = ptr;
+	obj->ops = ops;
+
+	if (objref) {
+		/* use @obj->objref to index (restart) */
+		obj->objref = objref;
+		i = hash_long((unsigned long) objref, CKPT_OBJ_HASH_NBITS);
+	} else {
+		/* use @obj->ptr to index, assign objref (checkpoint) */
+		obj->objref = ctx->obj_hash->next_free_objref++;;
+		i = hash_long((unsigned long) ptr, CKPT_OBJ_HASH_NBITS);
+	}
+
+	ret = ops->ref_grab(obj->ptr);
+	if (ret < 0)
+		kfree(obj);
+	else
+		hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+
+	return (ret < 0 ? : obj->objref);
+}
+
+/**
+* ckpt_obj_lookup_add - lookup object and add if not in obj_hash
+* @ctx: checkpoint context
+* @ptr: pointer to object
+* @type: object type
+* @first: [output] first encoutner (added to table)
+*
+* Look up the object pointed to by @ptr in the hash table. If it isn't
+* already found there, add the object, and allocate a unique object
+* id. Grab a reference to every object that is added, and maintain the
+* reference until the entire hash is freed.
+*
+* [This is used during checkpoint].
+*
+* Return: objref
+*/
+int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+			enum obj_type type, int *first)
+{
+	struct ckpt_obj_ops *ops = &ckpt_obj_ops[type];
+	struct ckpt_obj *obj;
+	int objref;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		objref = obj_new(ctx, ptr, 0, ops);
+		if (objref < 0)
+			return objref;
+		*first = 1;
+	} else if (obj->ops->obj_type != type) {   /* sanity check */
+		return -EINVAL;
+	} else {
+		objref = obj->objref;
+		*first = 0;
+	}
+
+	ckpt_debug("%s objref %d first %d\n", ops->obj_name, objref, *first);
+	return objref;
+}
+
+/**
+ * checkpoint_obj - if not already in hash, add object and checkpoint
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it
+ * isn't already there, then add the object to the table, allocate a
+ * fresh unique id (objref) and save the object's state, and grab a
+ * reference to every object that is added. (Maintain the reference
+ * until the entire hash is free).
+ *
+ * [This is used during checkpoint].
+ *
+ * Returns: objref
+ */
+int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj_ops *ops = &ckpt_obj_ops[type];
+	struct ckpt_hdr_objref *h;
+	struct ckpt_obj *obj;
+	int objref, ret;
+
+	/* make sure we don't change this accidentally */
+	BUG_ON(ops->obj_type != type);
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	if (obj) {
+		BUG_ON(obj->ops->obj_type != type);
+		return obj->objref;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
+	if (!h)
+		return -ENOMEM;
+
+	objref = obj_new(ctx, ptr, 0, ops);
+	if (objref < 0)
+		return objref;
+
+	h->objtype = type;
+	h->objref = objref;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	/* invoke callback to actually dump the state */
+	if (ops->checkpoint)
+		ret = ops->checkpoint(ctx, ptr);
+	if (ret < 0)
+		return ret;
+
+	return objref;
+}
+
+/**
+ * restore_obj - read in and restore a (first seen) shared object
+ * @ctx: checkpoint context
+ * @h: ckpt_hdr of shared object
+ *
+ * Read in the header payload (struct ckpt_hdr_objref). Lookup the
+ * object to verify it isn't there.  Then restore the object's state
+ * and add it to the objash. No need to explicitly grab a reference -
+ * we hold the initial instance of this object. (Object maintained
+ * until the entire hash is free).
+ *
+ * [This is used during restart].
+ */
+int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h)
+{
+	struct ckpt_obj_ops *ops;
+	void *ptr = NULL;
+	int ret;
+
+	ckpt_debug("len %d ref %d type %d\n", h->h.len, h->objref, h->objtype);
+	if (obj_find_by_objref(ctx, h->objref))
+		return -EINVAL;
+
+	if (h->objtype >= CKPT_OBJ_MAX)
+		return -EINVAL;
+
+	ops = &ckpt_obj_ops[h->objtype];
+	BUG_ON(ops->obj_type != h->objtype);
+
+	if (ops->restore)
+		ptr = ops->restore(ctx);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	ret = obj_new(ctx, ptr, h->objref, ops);
+	if (ret < 0)
+		ops->ref_drop(ptr);
+
+	return ret;
+}
+
+/**
+* ckpt_obj_insert - add an object with a given objref to obj_hash
+* @ctx: checkpoint context
+* @ptr: pointer to object
+* @objref: unique object id
+* @type: object type
+*
+* Add the object pointer to by @ptr and identified by unique object id
+* @objref to the hash table (indexed by @objref).  Grab a reference to
+* every object added, and maintain it until the entire hash is freed.
+*/
+
+int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
+		    enum obj_type type)
+{
+	struct ckpt_obj_ops *ops = &ckpt_obj_ops[type];
+
+	ckpt_debug("%s objref %d\n", ops->obj_name, objref);
+	return (obj_new(ctx, ptr, objref, ops) ? : 1);
+}
+
+/**
+ * ckpt_obj_fetch - fetch an object by its identifier
+ * @ctx: checkpoint context
+ * @objref: object id
+ * @type: object type
+ *
+ * Lookup the objref identifier by @objref in the hash table. Return
+ * an error not found.
+ *
+ * [This is used during restart].
+ */
+void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return NULL;
+	ckpt_debug("%s ref %d\n", obj->ops->obj_name, obj->objref);
+	return (obj->ops->obj_type == type ? obj->ptr : ERR_PTR(-EINVAL));
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index a1ab0a1..06224fd 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -22,6 +22,34 @@
 #include "checkpoint_arch.h"
 
 /**
+ * _ckpt_read_objref - dispatch handling of a shared object
+ * @ctx: checkpoint context
+ * @hh: objrect descriptor
+ */
+static int _ckpt_read_objref(struct ckpt_ctx *ctx, struct ckpt_hdr *hh)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get(ctx, hh->len);
+	if (!h)
+		return -ENOMEM;
+
+	*h = *hh;	/* yay ! */
+
+	_ckpt_debug(CKPT_DOBJ, "shared len %d type %d\n", h->len, h->type);
+	ret = ckpt_kread(ctx, (h + 1), hh->len - sizeof(struct ckpt_hdr));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_obj(ctx, (struct ckpt_hdr_objref *) h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+/**
  * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
  * @ctx: checkpoint context
  * @h: desired ckpt_hdr
@@ -36,6 +64,7 @@ static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
 {
 	int ret;
 
+ again:
 	ret = ckpt_kread(ctx, h, sizeof(*h));
 	if (ret < 0)
 		return ret;
@@ -43,7 +72,15 @@ static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
 		    h->type, h->len, len, max);
 	if (h->len < sizeof(*h))
 		return -EINVAL;
+
 	/* if len specified, enforce, else if maximum specified, enforce */
+	if (h->type == CKPT_HDR_OBJREF) {
+		ret = _ckpt_read_objref(ctx, h);
+		if (ret < 0)
+			return ret;
+		goto again;
+	}
+
 	if ((len && h->len != len) || (!len && max && h->len > max))
 		return -EINVAL;
 
@@ -135,6 +172,7 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
 	struct ckpt_hdr *h;
 	int ret;
 
+ again:
 	ret = ckpt_kread(ctx, &hh, sizeof(hh));
 	if (ret < 0)
 		return ERR_PTR(ret);
@@ -142,6 +180,14 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
 		    hh.type, hh.len, len, max);
 	if (hh.len < sizeof(*h))
 		return ERR_PTR(-EINVAL);
+
+	if (hh.type == CKPT_HDR_OBJREF) {
+		ret = _ckpt_read_objref(ctx, &hh);
+		if (ret < 0)
+			return ERR_PTR(ret);
+		goto again;
+	}
+
 	/* if len specified, enforce, else if maximum specified, enforce */
 	if ((len && hh.len != len) || (!len && max && hh.len > max))
 		return ERR_PTR(-EINVAL);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 5ebbac9..76d5d66 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -203,6 +203,7 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
 
 	ckpt_pgarr_free(ctx);
+	ckpt_obj_hash_free(ctx);
 
 	kfree(ctx);
 }
@@ -231,8 +232,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long flags)
 	if (!ctx->hbuf)
 		goto err;
 
-	return ctx;
+	if (ckpt_obj_hash_alloc(ctx) < 0)
+		goto err;
 
+	return ctx;
  err:
 	ckpt_ctx_free(ctx);
 	return ERR_PTR(err);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 73b34af..7845172 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,11 +29,26 @@ extern int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len);
 
 extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx,
 			       void *ptr, int len, int type);
+extern int _ckpt_read_nbuffer(struct ckpt_ctx *ctx, void *ptr, int len);
 extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
 extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len);
+
 extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type);
 extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type);
 
+/* obj_hash */
+extern void ckpt_obj_hash_free(struct ckpt_ctx *ctx);
+extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
+
+extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
+extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type);
+extern void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref,
+			    enum obj_type type);
+extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+			       enum obj_type type, int *first);
+extern int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
+			   enum obj_type type);
+
 extern int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
 extern int do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
@@ -74,6 +89,7 @@ extern struct file *restore_file(struct ckpt_ctx *ctx);
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DMEM	0x8		/* memory state */
 #define CKPT_DPAGE	0x10		/* memory pages */
+#define CKPT_DOBJ	0x20		/* shared objects */
 
 #define CKPT_DDEFAULT	0xf		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 5266e4b..0eb4acb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -45,6 +45,7 @@ enum {
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
 	CKPT_HDR_FNAME,
+	CKPT_HDR_OBJREF,
 
 	CKPT_HDR_TASK = 101,
 	CKPT_HDR_THREAD,
@@ -58,6 +59,19 @@ enum {
 	CKPT_HDR_TAIL = 5001
 };
 
+/* shared objrects (objref) */
+struct ckpt_hdr_objref {
+	struct ckpt_hdr h;
+	__u32 objtype;
+	__s32 objref;
+} __attribute__((aligned(8)));
+
+/* shared objects types */
+enum obj_type {
+	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_MAX
+};
+
 /* checkpoint image header */
 struct ckpt_hdr_header {
 	struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 84b4ef4..5a365a3 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -30,6 +30,8 @@ struct ckpt_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 11/54] Introduce 'checkpoint' method in 'struct file_operations'
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 10/54] Infrastructure for shared objects Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 12/54] Dump open file descriptors Oren Laadan
                     ` (44 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.

This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.

Also introduce vfs_fcntl() so that it can be called from restart (see
patch adding restart of files).

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 fs/fcntl.c                       |   23 +++++++++++++++--------
 include/linux/checkpoint_types.h |    2 ++
 include/linux/fs.h               |    6 ++++++
 3 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index cc8e4de..2d02259 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -335,6 +335,19 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	return err;
 }
 
+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+	int err;
+
+	err = security_file_fcntl(filp, cmd, arg);
+	if (err)
+		goto out;
+	err = do_fcntl(fd, cmd, arg, filp);
+ out:
+	return err;
+}
+
+
 SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 {	
 	struct file *filp;
@@ -344,19 +357,13 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 	if (!filp)
 		goto out;
 
-	err = security_file_fcntl(filp, cmd, arg);
-	if (err) {
-		fput(filp);
-		return err;
-	}
-
-	err = do_fcntl(fd, cmd, arg, filp);
-
+	err = vfs_fcntl(fd, cmd, arg, filp);
  	fput(filp);
 out:
 	return err;
 }
 
+
 #if BITS_PER_LONG == 32
 SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
 		unsigned long, arg)
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 5a365a3..12f0ec5 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -10,6 +10,8 @@
  *  distribution for more details.
  */
 
+struct ckpt_ctx;
+
 #include <linux/list.h>
 #include <linux/path.h>
 #include <linux/fs.h>
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e00db0..8ff37b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -8,6 +8,7 @@
 
 #include <linux/limits.h>
 #include <linux/ioctl.h>
+#include <linux/checkpoint_types.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -1082,6 +1083,8 @@ struct file_lock {
 
 #include <linux/fcntl.h>
 
+extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
+
 extern void send_sigio(struct fown_struct *fown, int fd, int band);
 
 /* fs/sync.c */
@@ -1508,6 +1511,7 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+	int (*checkpoint)(struct ckpt_ctx *, struct file *);
 };
 
 struct inode_operations {
@@ -2305,6 +2309,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#define generic_file_checkpoint NULL
+
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
 extern int vfs_stat(char __user *, struct kstat *);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 12/54] Dump open file descriptors
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 11/54] Introduce 'checkpoint' method in 'struct file_operations' Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 13/54] add generic checkpoint f_op to ext fses Oren Laadan
                     ` (43 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Dump the files_struct of a task with 'struct ckpt_hdr_files', followed by
all open file descriptors. Because the 'struct file' corresponding to an
FD can be shared, each they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it lives
in the hash (the hash is only cleaned up at the end of the checkpoint).

For each open FD there is a 'struct ckpt_hdr_fd_ent' with the FD, its
close-on-exec property, and the objref of the corresponding 'file *'.
If the FD is to be saved (first time) then this is followed by a
'struct ckpt_hdr_fd_data' with the FD state. Then will come the next FD
and so on.

Recall that it is assumed that all tasks possibly sharing the file table
are frozen. If this assumption breaks, then the behavior is *undefined*:
checkpoint may fail, or restart from the resulting image file will fail.

This patch provides generic_checkpoint_file(), which is good for
normal files and directories. It does not support yet unlinked files
or directories.

Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations->checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() => checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() => checkpoint_file()
  - Discard field 'h->parent'

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits

Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()

Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning

Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/files.c             |  210 +++++++++++++++++++++++++++++++++++++++-
 checkpoint/objhash.c           |   23 ++++-
 checkpoint/process.c           |    4 +
 include/linux/checkpoint.h     |   10 ++-
 include/linux/checkpoint_hdr.h |   39 ++++++++
 include/linux/fs.h             |    4 +
 6 files changed, 285 insertions(+), 5 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index a7cf6c3..47e5f61 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -8,9 +8,13 @@
  *  distribution for more details.
  */
 
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -82,9 +86,211 @@ static int dump_fname(struct ckpt_ctx *ctx,
 	return ret;
 }
 
-int checkpoint_file(struct ckpt_ctx *ctx, struct file *file)
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	spin_lock(&files->file_lock);
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+	spin_unlock(&files->file_lock);
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+	ret = dump_fname(ctx, &file->f_path, &ctx->fs_mnt);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* checkpoint_file - dump the state of a given file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+
+	if (!file->f_op->checkpoint)
+		return -EBADF;
+	return file->f_op->checkpoint(ctx, file);
+}
+
+/**
+ * ckpt_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int
+checkpoint_fd_ent(struct ckpt_ctx *ctx, struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_fd_ent *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FD_ENT);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	ret = -EBADF;
+	if (!file)
+		goto out;
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+out:
+	ckpt_hdr_put(ctx, h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+int checkpoint_fd_table(struct ckpt_ctx *ctx, struct task_struct *t)
 {
-	return dump_fname(ctx, &file->f_path, &ctx->fs_mnt);
+	struct ckpt_hdr_fd_table *h;
+	struct files_struct *files;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FD_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	files = get_files_struct(t);
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_fd_ent(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+ out:
+	kfree(fdtable);
+	put_files_struct(files);
+	return ret;
 }
 
 /**************************************************************************
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 076a3a3..9565bcb 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,7 @@
 
 #include <linux/kernel.h>
 #include <linux/hash.h>
+#include <linux/file.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -44,6 +45,7 @@ struct ckpt_obj_hash {
 /*
  * helper grab/drop functions:
  *   obj_no_{drop,grab}: for objects ignored/skipped
+ *   obj_file_{drop,grab}: for file objects
  */
 
 static void obj_no_drop(void *ptr)
@@ -56,6 +58,18 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+/* helper drop/grab functions */
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr)
+{
+	fput((struct file *) ptr);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -64,9 +78,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* file object */
+	{
+		.obj_name = "FILE",
+		.obj_type = CKPT_OBJ_FILE,
+		.ref_drop = obj_file_drop,
+		.ref_grab = obj_file_grab,
+		.checkpoint = checkpoint_file,
+	},
 };
 
-
 #define CKPT_OBJ_HASH_NBITS  10
 #define CKPT_OBJ_HASH_TOTAL  (1UL << CKPT_OBJ_HASH_NBITS)
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 7adb842..640a27c 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -61,6 +61,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ckpt_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = checkpoint_fd_table(ctx, t);
+	ckpt_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_thread(ctx, t);
 	ckpt_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7845172..d6644f0 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -79,9 +79,14 @@ extern int restore_mm(struct ckpt_ctx *ctx);
 	 VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
 /* files */
-extern int checkpoint_file(struct ckpt_ctx *ctx, struct file *file);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
 extern struct file *restore_file(struct ckpt_ctx *ctx);
 
+extern int checkpoint_fd_table(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+
 
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
@@ -90,8 +95,9 @@ extern struct file *restore_file(struct ckpt_ctx *ctx);
 #define CKPT_DMEM	0x8		/* memory state */
 #define CKPT_DPAGE	0x10		/* memory pages */
 #define CKPT_DOBJ	0x20		/* shared objects */
+#define CKPT_DFILE	0x40		/* files and filesystem */
 
-#define CKPT_DDEFAULT	0xf		/* default debug level */
+#define CKPT_DDEFAULT	0x4f		/* default debug level */
 
 #ifndef CKPT_DFLAG
 #define CKPT_DFLAG	0x0		/* nothing */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0eb4acb..a957e6c 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -56,6 +56,10 @@ enum {
 	CKPT_HDR_PGARR,
 	CKPT_HDR_MM_CONTEXT,
 
+	CKPT_HDR_FD_TABLE = 301,
+	CKPT_HDR_FD_ENT,
+	CKPT_HDR_FILE,
+
 	CKPT_HDR_TAIL = 5001
 };
 
@@ -69,6 +73,7 @@ struct ckpt_hdr_objref {
 /* shared objects types */
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_FILE,
 	CKPT_OBJ_MAX
 };
 
@@ -156,4 +161,38 @@ struct ckpt_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+/* file system */
+struct ckpt_hdr_fd_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_fd_ent {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+	CKPT_FILE_GENERIC = 1,
+	CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8ff37b3..2c9ff62 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2309,7 +2309,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
 #define generic_file_checkpoint NULL
+#endif
 
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 13/54] add generic checkpoint f_op to ext fses
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 12/54] Dump open file descriptors Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 14/54] Restore open file descriptors Oren Laadan
                     ` (42 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

This marks ext[234] as being checkpointable.  There will be many
more to do this to, but this is a start.

Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 fs/ext2/dir.c  |    1 +
 fs/ext2/file.c |    2 ++
 fs/ext3/dir.c  |    1 +
 fs/ext3/file.c |    1 +
 fs/ext4/dir.c  |    1 +
 fs/ext4/file.c |    1 +
 6 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 2999d72..4f1dd79 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -721,4 +721,5 @@ const struct file_operations ext2_dir_operations = {
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
 	.fsync		= ext2_sync_file,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 45ed071..e1731c5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -58,6 +58,7 @@ const struct file_operations ext2_file_operations = {
 	.fsync		= ext2_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
@@ -73,6 +74,7 @@ const struct file_operations ext2_xip_file_operations = {
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_sync_file,
+	.checkpoint	= generic_file_checkpoint,
 };
 #endif
 
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 3d724a9..54b05d2 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
 #endif
 	.fsync		= ext3_sync_file,	/* BKL held */
 	.release	= ext3_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..a421e07 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -126,6 +126,7 @@ const struct file_operations ext3_file_operations = {
 	.fsync		= ext3_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index b647899..2787fdb 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = {
 #endif
 	.fsync		= ext4_sync_file,
 	.release	= ext4_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 588af8c..c2dab33 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -161,6 +161,7 @@ const struct file_operations ext4_file_operations = {
 	.fsync		= ext4_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 14/54] Restore open file descriptors
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 13/54] add generic checkpoint f_op to ext fses Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 15/54] Record 'struct file' object instead of the file name for VMAs Oren Laadan
                     ` (41 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Restore open file descriptors: for each FD read 'struct ckpt_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct ckpt_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Changelog[v14]:
  - Introduce a per file-type restore() callback
  - Revert change to pr_debug(), back to ckpt_debug()
  - Rename:  restore_files() => restore_fd_table()
  - Rename:  ckpt_read_fd_data() => restore_file()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'hh->parent'

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/files.c             |  221 +++++++++++++++++++++++++++++++++++++++-
 checkpoint/objhash.c           |    2 +
 checkpoint/process.c           |    4 +
 checkpoint/restart.c           |    2 +-
 include/linux/checkpoint.h     |    7 +-
 include/linux/checkpoint_hdr.h |    3 +-
 mm/filemap.c                   |    1 -
 7 files changed, 232 insertions(+), 8 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 47e5f61..80e1c02 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -15,10 +15,11 @@
 #include <linux/sched.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
-
 /**************************************************************************
  * Checkpoint
  */
@@ -320,8 +321,220 @@ static struct file *read_open_fname(struct ckpt_ctx *ctx, int flags, int mode)
 	return file;
 }
 
-struct file *restore_file(struct ckpt_ctx *ctx)
+static int close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		get_file(file);
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CKPT_SETFL_MASK  \
+	(O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
+
+int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			struct ckpt_hdr_file *h)
+{
+	int ret;
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
+	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
+	if (ret < 0)
+		goto out;
+
+	ret = vfs_llseek(file, h->f_pos, SEEK_SET);
+	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
+		ret = 0;
+ out:
+	return ret;
+}
+
+static struct file *generic_file_restore(struct ckpt_ctx *ctx,
+					 struct ckpt_hdr_file *ptr)
+{
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
+		return ERR_PTR(-EINVAL);
+
+	file = read_open_fname(ctx, ptr->f_flags, ptr->f_mode);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+	return file;
+}
+
+struct restore_file_ops {
+	char *file_name;
+	enum file_type file_type;
+	struct file * (*restore) (struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_file *ptr);
+};
+
+static struct restore_file_ops restore_file_ops[] = {
+	/* ignored file */
+	{
+		.file_name = "IGNORE",
+		.file_type = CKPT_FILE_IGNORE,
+		.restore = NULL,
+	},
+	/* regular file/directory */
+	{
+		.file_name = "GENERIC",
+		.file_type = CKPT_FILE_GENERIC,
+		.restore = generic_file_restore,
+	},
+};
+
+static struct file *do_restore_file(struct ckpt_ctx *ctx)
+{
+	struct restore_file_ops *ops;
+	struct ckpt_hdr_file *h;
+	struct file *file = ERR_PTR(-EINVAL);
+
+	/*
+	 * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file,
+	 * but the actual object depends on the file type. The length
+	 * should never be more than page.
+	 */
+	h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	ckpt_debug("flags %#x mode %#x type %d\n",
+		 h->f_flags, h->f_mode, h->f_type);
+
+	if (h->f_type >= CKPT_FILE_MAX)
+		goto out;
+
+	ops = &restore_file_ops[h->f_type];
+	BUG_ON(ops->file_type != h->f_type);
+
+	if (file)
+		file = ops->restore(ctx, h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return file;
+}
+
+void *restore_file(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file(ctx);
+}
+
+/**
+ * ckpt_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls restore_file to restore the file too.
+ */
+static int restore_fd_ent(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_fd_ent *h;
+	struct file *file;
+	int newfd, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FD_ENT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+	ckpt_debug("ref %d fd %d c.o.e %d\n",
+		 h->fd_objref, h->fd_descriptor, h->fd_close_on_exec);
+
+	ret = -EINVAL;
+	if (h->fd_objref <= 0 || h->fd_descriptor < 0)
+		goto out;
+
+	file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE);
+	if (!file)
+		goto out;
+	else if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	newfd = attach_file(file);
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor);
+
+	/* reposition if newfd isn't desired fd */
+	if (newfd != h->fd_descriptor) {
+		ret = sys_dup2(newfd, h->fd_descriptor);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	if (h->fd_close_on_exec)
+		set_close_on_exec(h->fd_descriptor, 1);
+
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int restore_fd_table(struct ckpt_ctx *ctx)
 {
-	/* currently only called for mapped files; O_RDONLY works */
-	return read_open_fname(ctx, O_RDONLY, 0);
+	struct ckpt_hdr_fd_table *h;
+	int i, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FD_TABLE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("nfds %d\n", h->fdt_nfds);
+
+	ret = -EMFILE;
+	if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open)
+		goto out;
+
+	/* point of no return -- close all file descriptors */
+	ret = close_all_fds(current->files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < h->fdt_nfds; i++) {
+		ret = restore_fd_ent(ctx);
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
 }
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 9565bcb..5476b0a 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -85,9 +85,11 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_file_drop,
 		.ref_grab = obj_file_grab,
 		.checkpoint = checkpoint_file,
+		.restore = restore_file,
 	},
 };
 
+
 #define CKPT_OBJ_HASH_NBITS  10
 #define CKPT_OBJ_HASH_TOTAL  (1UL << CKPT_OBJ_HASH_NBITS)
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 640a27c..a0e8163 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -116,6 +116,10 @@ int restore_task(struct ckpt_ctx *ctx)
 	ckpt_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = restore_fd_table(ctx);
+	ckpt_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = restore_thread(ctx);
 	ckpt_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 06224fd..ecf2cf0 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -251,7 +251,7 @@ void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type)
 
 	BUG_ON(!len);
 
-	h = ckpt_read_obj(ctx, len, len);
+	h = ckpt_read_obj(ctx, 0, len);
 	if (IS_ERR(h))
 		return h;
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index d6644f0..527a84f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,6 +10,8 @@
  *  distribution for more details.
  */
 
+struct ckpt_ctx;
+
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -80,12 +82,15 @@ extern int restore_mm(struct ckpt_ctx *ctx);
 
 /* files */
 extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
-extern struct file *restore_file(struct ckpt_ctx *ctx);
+extern void *restore_file(struct ckpt_ctx *ctx);
 
 extern int checkpoint_fd_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_fd_table(struct ckpt_ctx *ctx);
 
 extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 				  struct ckpt_hdr_file *h);
+extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			       struct ckpt_hdr_file *h);
 
 
 /* debugging flags */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index a957e6c..7c87bf8 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -176,7 +176,8 @@ struct ckpt_hdr_fd_ent {
 } __attribute__((aligned(8)));
 
 enum file_type {
-	CKPT_FILE_GENERIC = 1,
+	CKPT_FILE_IGNORE = 0,
+	CKPT_FILE_GENERIC,
 	CKPT_FILE_MAX
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index ef5680b..f51b537 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1658,7 +1658,6 @@ int filemap_restore(struct ckpt_ctx *ctx,
 	struct file *file;
 	int ret;
 
-	/* for private mapping using 'read-only' is sufficient */
 	file = restore_file(ctx);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 15/54] Record 'struct file' object instead of the file name for VMAs
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 14/54] Restore open file descriptors Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 16/54] External checkpoint of a task other than ourself Oren Laadan
                     ` (40 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

The vma->vm_file can be an arbitrary file pointer, including one that
is in use by a process as well and provided originally via the mmap()
syscall.

Thus, when dumping the state of a VMA, save a file object instead
of only the file name. As with other file objects, if it's seen for
the first time it is dumped entirely, otherwise only the 'objref' is
saved. The restart logic updated accordingly.

(Also suggested by Alexey Dobriyan)

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/memory.c            |   63 +++++++++++++++++++++++++++++++++-------
 include/linux/checkpoint.h     |    6 ++-
 include/linux/checkpoint_hdr.h |    4 +-
 mm/filemap.c                   |   22 ++++++--------
 mm/mmap.c                      |    2 +-
 5 files changed, 68 insertions(+), 29 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index c725519..4fa634a 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -20,6 +20,7 @@
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
+#include <linux/proc_fs.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -472,9 +473,10 @@ static int checkpoint_private_contents(struct ckpt_ctx *ctx,
  * @ctx: checkpoint context
  * @vma: vma object
  * @type: vma type
+ * @objref: vma object id
  */
-int generic_vma_checkpoint(struct ckpt_ctx *ctx,
-			   struct vm_area_struct *vma, enum vma_type type)
+int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+			   enum vma_type type, int vma_objref)
 {
 	struct ckpt_hdr_vma *h;
 	int ret;
@@ -492,6 +494,8 @@ int generic_vma_checkpoint(struct ckpt_ctx *ctx,
 		return -ENOMEM;
 
 	h->vma_type = type;
+	h->vma_objref = vma_objref;
+
 	h->vm_start = vma->vm_start;
 	h->vm_end = vma->vm_end;
 	h->vm_page_prot = pgprot_val(vma->vm_page_prot);
@@ -509,16 +513,17 @@ int generic_vma_checkpoint(struct ckpt_ctx *ctx,
  * @ctx: checkpoint context
  * @vma: vma object
  * @type: vma type
+ * @objref: vma object id
  */
 int private_vma_checkpoint(struct ckpt_ctx *ctx,
 			   struct vm_area_struct *vma,
-			   enum vma_type type)
+			   enum vma_type type, int vma_objref)
 {
 	int ret;
 
 	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
 
-	ret = generic_vma_checkpoint(ctx, vma, type);
+	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_private_contents(ctx, vma);
@@ -542,7 +547,7 @@ static int anonymous_checkpoint(struct ckpt_ctx *ctx,
 
 	BUG_ON(vma->vm_file);
 
-	return private_vma_checkpoint(ctx, vma, CKPT_VMA_ANON);
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_ANON, 0);
 }
 
 int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
@@ -550,6 +555,7 @@ int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct ckpt_hdr_mm *h;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
+	int exe_objref = 0;
 	int ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM);
@@ -560,6 +566,8 @@ int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	down_read(&mm->mmap_sem);
 
+	/* FIX: need also mm->flags */
+
 	h->start_code = mm->start_code;
 	h->end_code = mm->end_code;
 	h->start_data = mm->start_data;
@@ -574,10 +582,18 @@ int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	h->map_count = mm->map_count;
 
-	/* FIX: need also mm->flags */
+	/* checkpoint the ->exe_file */
+	if (mm->exe_file) {
+		exe_objref = checkpoint_obj(ctx, mm->exe_file, CKPT_OBJ_FILE);
+		if (exe_objref < 0) {
+			ret = exe_objref;
+			goto out;
+		}
+	}
+
+	h->exefile_objref = exe_objref;
 
 	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
-	ckpt_hdr_put(ctx, h);
 	if (ret < 0)
 		goto out;
 
@@ -597,6 +613,7 @@ int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	ret = checkpoint_mm_context(ctx, mm);
  out:
+	ckpt_hdr_put(ctx, h);
 	up_read(&mm->mmap_sem);
 	mmput(mm);
 	return ret;
@@ -795,6 +812,8 @@ static unsigned long generic_vma_restore(struct mm_struct *mm,
 
 	if (h->vm_end < h->vm_start)
 		return -EINVAL;
+	if (h->vma_objref < 0)
+		return -EINVAL;
 	if (h->vm_flags & CKPT_VMA_NOT_SUPPORTED)
 		return -ENOSYS;
 
@@ -906,12 +925,16 @@ static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
-	ckpt_debug("vma %#lx-%#lx type %d\n", (unsigned long) h->vm_start,
-		 (unsigned long) h->vm_end, (int) h->vma_type);
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+		 (unsigned long) h->vm_start, (unsigned long) h->vm_end,
+		 (unsigned long) h->vm_flags, (int) h->vma_type,
+		 (int) h->vma_objref);
 
 	ret = -EINVAL;
 	if (h->vm_end < h->vm_start)
 		goto out;
+	if (h->vma_objref < 0)
+		goto out;
 	if (h->vma_type >= CKPT_VMA_MAX)
 		goto out;
 
@@ -954,6 +977,7 @@ int restore_mm(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_mm *h;
 	struct mm_struct *mm;
+	struct file *file;
 	unsigned int nr;
 	int ret;
 
@@ -969,6 +993,8 @@ int restore_mm(struct ckpt_ctx *ctx)
 	if ((h->start_code > h->end_code) ||
 	    (h->start_data > h->end_data))
 		goto out;
+	if (h->exefile_objref < 0)
+		goto out;
 
 	mm = current->mm;
 
@@ -979,6 +1005,9 @@ int restore_mm(struct ckpt_ctx *ctx)
 		up_write(&mm->mmap_sem);
 		goto out;
 	}
+
+	/* FIX: need also mm->flags */
+
 	mm->start_code = h->start_code;
 	mm->end_code = h->end_code;
 	mm->start_data = h->start_data;
@@ -990,9 +1019,21 @@ int restore_mm(struct ckpt_ctx *ctx)
 	mm->arg_end = h->arg_end;
 	mm->env_start = h->env_start;
 	mm->env_end = h->env_end;
-	up_write(&mm->mmap_sem);
 
-	/* FIX: need also mm->flags */
+	/* restore the ->exe_file */
+	if (h->exefile_objref) {
+		file = ckpt_obj_fetch(ctx, h->exefile_objref, CKPT_OBJ_FILE);
+		if (!file)
+			file = ERR_PTR(-EINVAL);
+		if (IS_ERR(file)) {
+			up_write(&mm->mmap_sem);
+			ret = PTR_ERR(file);
+			goto out;
+		}
+		set_mm_exe_file(mm, file);
+	}
+
+	up_write(&mm->mmap_sem);
 
 	for (nr = h->map_count; nr; nr--) {
 		ret = restore_vma(ctx, mm);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 527a84f..dcc3840 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -63,10 +63,12 @@ extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 
 extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
-				  enum vma_type type);
+				  enum vma_type type,
+				  int vma_objref);
 extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
-				  enum vma_type type);
+				  enum vma_type type,
+				  int vma_objref);
 
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 7c87bf8..0045c22 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -126,7 +126,7 @@ struct ckpt_hdr_task {
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
 	__u32 map_count;
-	__u32 _padding;
+	__u32 exefile_objref;	/* objref for the exefile */
 
 	__u64 start_code, end_code, start_data, end_data;
 	__u64 start_brk, brk, start_stack;
@@ -146,7 +146,7 @@ enum vma_type {
 struct ckpt_hdr_vma {
 	struct ckpt_hdr h;
 	__u32 vma_type;
-	__u32 _padding;
+	__u32 vma_objref;	/* for vma->vm_file */
 
 	__u64 vm_start;
 	__u64 vm_end;
diff --git a/mm/filemap.c b/mm/filemap.c
index f51b537..e515845 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1633,7 +1633,7 @@ EXPORT_SYMBOL(filemap_fault);
 static int filemap_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma)
 {
-	int ret;
+	int vma_objref;
 
 	/* should be private anonymous ... verify that this is the case */
 	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
@@ -1643,12 +1643,12 @@ static int filemap_checkpoint(struct ckpt_ctx *ctx,
 
 	BUG_ON(!vma->vm_file);
 
-	ret = checkpoint_file(ctx, vma->vm_file);
-	if (ret < 0)
-		goto out;
-	ret = private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE);
- out:
-	return ret;
+	/* checkpoint the file object first (will add to objhash) */
+	vma_objref = checkpoint_obj(ctx, vma->vm_file, CKPT_OBJ_FILE);
+	if (vma_objref < 0)
+		return vma_objref;
+
+	return  private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
 }
 
 int filemap_restore(struct ckpt_ctx *ctx,
@@ -1656,16 +1656,12 @@ int filemap_restore(struct ckpt_ctx *ctx,
 		    struct ckpt_hdr_vma *h)
 {
 	struct file *file;
-	int ret;
 
-	file = restore_file(ctx);
+	file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
-	ret = private_vma_restore(ctx, mm, file, h);
-
-	fput(file);
-	return ret;
+	return private_vma_restore(ctx, mm, file, h);
 }
 #else
 #define filemap_checkpoint NULL
diff --git a/mm/mmap.c b/mm/mmap.c
index 3b6356c..0c65512 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2300,7 +2300,7 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 	if (!name || strcmp(name, "[vdso]"))
 		return -ENOSYS;
 
-	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO);
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
 }
 
 int special_mapping_restore(struct ckpt_ctx *ctx,
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 16/54] External checkpoint of a task other than ourself
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 15/54] Record 'struct file' object instead of the file name for VMAs Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 17/54] c/r of restart-blocks: export functionality used in next patch Oren Laadan
                     ` (39 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container.

sys_restart() remains the same, as the restart is always done in the
context of the restarting task.

Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them

Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c          |   77 ++++++++++++++++++++++++++++++++++++--
 checkpoint/restart.c             |    4 +-
 checkpoint/sys.c                 |    6 +++
 include/linux/checkpoint_types.h |    2 +
 4 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 9abdf73..b741557 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -12,7 +12,11 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/freezer.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
+#include <linux/fs_struct.h>
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
@@ -161,22 +165,87 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
-static int ckpt_ctx_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
+static int get_container(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task = NULL;
+	struct nsproxy *nsproxy = NULL;
+	int err = -ESRCH;
+
+	ctx->root_pid = pid;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+
+	if (!task)
+		goto out;
+
+#if 0	/* enable to use containers */
+	if (!is_container_init(task)) {
+		err = -EINVAL;
+		goto out;
+	}
+#endif
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	/* verify that the task is frozen (unless self) */
+	if (task != current && !frozen(task))
+		return -EBUSY;
+
+	/* FIX: add support for ptraced tasks */
+	if (task_ptrace(task))
+		return -EBUSY;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(task);
+	if (nsproxy)
+		get_nsproxy(nsproxy);
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		goto out;
+
+	ctx->root_task = task;
+	ctx->root_nsproxy = nsproxy;
+
+	return 0;
+
+ out:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* setup checkpoint-specific parts of ctx */
+static int ctx_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct fs_struct *fs;
+	int ret;
 
 	ctx->root_pid = pid;
 
+	ret = get_container(ctx, pid);
+	if (ret < 0)
+		return ret;
+
 	/*
 	 * assume checkpointer is in container's root vfs
 	 * FIXME: this works for now, but will change with real containers
 	 */
 
-	fs = current->fs;
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
 	read_lock(&fs->lock);
 	ctx->fs_mnt = fs->root;
 	path_get(&ctx->fs_mnt);
 	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
 
 	return 0;
 }
@@ -185,13 +254,13 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = ckpt_ctx_checkpoint(ctx, pid);
+	ret = ctx_checkpoint(ctx, pid);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_task(ctx, current);
+	ret = checkpoint_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index ecf2cf0..637de90 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -334,7 +334,7 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
 }
 
 /* setup restart-specific parts of ctx */
-static int ckpt_ctx_restart(struct ckpt_ctx *ctx)
+static int ctx_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	return 0;
 }
@@ -343,7 +343,7 @@ int do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = ckpt_ctx_restart(ctx);
+	ret = ctx_restart(ctx, pid);
 	if (ret < 0)
 		return ret;
 	ret = restore_read_header(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 76d5d66..0b7245a 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -12,6 +12,7 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -205,6 +206,11 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	ckpt_pgarr_free(ctx);
 	ckpt_obj_hash_free(ctx);
 
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 12f0ec5..d98ba71 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -22,6 +22,8 @@ struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
 	pid_t root_pid;		/* container identifier */
+	struct task_struct *root_task;	/* container root task */
+	struct nsproxy *root_nsproxy;	/* container root nsproxy */
 
 	unsigned long flags;
 	unsigned long oflags;	/* restart: old flags */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 17/54] c/r of restart-blocks: export functionality used in next patch
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (15 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 16/54] External checkpoint of a task other than ourself Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 18/54] c/r of restart-blocks Oren Laadan
                     ` (38 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

To support c/r of restart-blocks (system call that need to be
restarted because they were interrupted but there was no userspace
visible side-effect), export restart-block callbacks for poll()
and futex() syscalls.

More details on c/r of restart-blocks and how it works in the
following patch.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/select.c                  |    2 +-
 include/linux/futex.h        |   10 ++++++++++
 include/linux/poll.h         |    3 +++
 include/linux/posix-timers.h |    2 ++
 kernel/futex.c               |   11 +----------
 kernel/posix-timers.c        |    2 +-
 6 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 0fe0e14..e64ddc6 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -833,7 +833,7 @@ out_fds:
 	return err;
 }
 
-static long do_restart_poll(struct restart_block *restart_block)
+long do_restart_poll(struct restart_block *restart_block)
 {
 	struct pollfd __user *ufds = restart_block->poll.ufds;
 	int nfds = restart_block->poll.nfds;
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 3bf5bb5..dd0e06b 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -130,6 +130,16 @@ extern int
 handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi);
 
 /*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'flags' shared capability
+ */
+#define FLAGS_SHARED		0x01
+#define FLAGS_CLOCKRT		0x02
+
+/* referenced by checkpoint/restart */
+extern long futex_wait_restart(struct restart_block *restart);
+
+/*
  * Futexes are matched on equal values of this key.
  * The key type depends on whether it's a shared or private mapping.
  * Don't rearrange members without looking at hash_futex().
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 8c24ef8..97f95a7 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -131,6 +131,9 @@ extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec);
 
+/* used by checkpoint/restart */
+extern long do_restart_poll(struct restart_block *restart_block);
+
 #endif /* KERNEL */
 
 #endif /* _LINUX_POLL_H */
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 4f71bf4..3d0e946 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -119,4 +119,6 @@ long clock_nanosleep_restart(struct restart_block *restart_block);
 
 void update_rlimit_cpu(unsigned long rlim_new);
 
+int invalid_clockid(const clockid_t which_clock);
+
 #endif
diff --git a/kernel/futex.c b/kernel/futex.c
index eef8cd2..4618b36 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1111,15 +1111,6 @@ handle_fault:
 	goto retry;
 }
 
-/*
- * In case we must use restart_block to restart a futex_wait,
- * we encode in the 'flags' shared capability
- */
-#define FLAGS_SHARED		0x01
-#define FLAGS_CLOCKRT		0x02
-
-static long futex_wait_restart(struct restart_block *restart);
-
 static int futex_wait(u32 __user *uaddr, int fshared,
 		      u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
 {
@@ -1284,7 +1275,7 @@ out:
 }
 
 
-static long futex_wait_restart(struct restart_block *restart)
+long futex_wait_restart(struct restart_block *restart)
 {
 	u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
 	int fshared = 0;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 052ec4d..a734f4e 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -205,7 +205,7 @@ static int no_timer_create(struct k_itimer *new_timer)
 /*
  * Return nonzero if we know a priori this clockid_t value is bogus.
  */
-static inline int invalid_clockid(const clockid_t which_clock)
+inline int invalid_clockid(const clockid_t which_clock)
 {
 	if (which_clock < 0)	/* CPU clock, posix_cpu_* will check it */
 		return 0;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 18/54] c/r of restart-blocks
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (16 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 17/54] c/r of restart-blocks: export functionality used in next patch Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 19/54] Checkpoint multiple processes Oren Laadan
                     ` (37 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

(Paraphrasing what's said this message:
http://lists.openwall.net/linux-kernel/2007/12/05/64)

Restart blocks are callbacks used cause a system call to be restarted
with the arguments specified in the system call restart block. It is
useful for system call that are not idempotent, i.e. the argument(s)
might be a relative timeout, where some adjustments are required when
restarting the system call. It relies on the system call itself to set
up its restart point and the argument save area.  They are rare: an
actual signal would turn that it an EINTR. The only case that should
ever trigger this is some kernel action that interrupts the system
call, but does not actually result in any user-visible state changes -
like freeze and thaw.

So restart blocks are about time remaining for the system call to
sleep/wait. Generally in c/r, there are two possible time models that
we can follow: absolute, relative. Here, I chose to save the relative
timeout, measured from the beginning of the checkpoint. The time when
the checkpoint (and restart) begin is also saved. This information is
sufficient to restart in either model (absolute or negative).

Which model to use should eventually be a per application choice (and
possible configurable via cradvise() or some sort). For now, we adopt
the relative model, namely, at restart the timeout is set relative to
the beginning of the restart.

To checkpoint, we check if a task has a valid restart block, and if so
we save the *remaining* time that is has to wait/sleep, and the type
of the restart block.

To restart, we fill in the data required at the proper place in the
thread information. If the system call return an error (which is
possibly an -ERESTARTSYS eg), we not only use that error as our own
return value, but also arrange for the task to execute the signal
handler (by faking a signal). The handler, in turn, already has the
code to handle these restart request gracefully.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/x86/mm/checkpoint.c         |    8 +-
 checkpoint/checkpoint.c          |    1 +
 checkpoint/process.c             |  219 ++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c             |   35 ++++++-
 checkpoint/sys.c                 |    2 +
 include/linux/checkpoint.h       |    4 +
 include/linux/checkpoint_hdr.h   |   22 ++++
 include/linux/checkpoint_types.h |    3 +
 8 files changed, 288 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index a475a30..bd9449b 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -64,10 +64,10 @@ int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
 	 * not tied to the in-kernel representation.
 	 */
 	ret = ckpt_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+	if (ret < 0)
+		return ret;
 
-	/* IGNORE RESTART BLOCKS FOR NOW ... */
-
-	return ret;
+	return checkpoint_restart_block(ctx, t);
 }
 
 #ifdef CONFIG_X86_64
@@ -327,7 +327,7 @@ int restore_thread(struct ckpt_ctx *ctx)
 		kfree(desc);
 	}
 
-	ret = 0;
+	ret = restore_restart_block(ctx);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index b741557..07901c1 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -24,6 +24,7 @@
 #include <linux/mount.h>
 #include <linux/utsname.h>
 #include <linux/magic.h>
+#include <linux/hrtimer.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index a0e8163..d5ee6fd 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/posix-timers.h>
+#include <linux/futex.h>
+#include <linux/poll.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -48,6 +51,117 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+/* dump the task_struct of a given task */
+int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_restart_block *h;
+	struct restart_block *restart_block;
+	long (*fn)(struct restart_block *);
+	s64 base, expire = 0;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+	if (!h)
+		return -ENOMEM;
+
+	base = ktime_to_ns(ctx->ktime_begin);
+	restart_block = &task_thread_info(t)->restart_block;
+	fn = restart_block->fn;
+
+	/* FIX: enumerate clockid_t so we're immune to changes */
+
+	if (fn == do_no_restart_syscall) {
+
+		h->fn = CKPT_RESTART_BLOCK_NONE;
+		ckpt_debug("restart_block: non\n");
+
+	} else if (fn == hrtimer_nanosleep_restart) {
+
+		h->fn = CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long) restart_block->nanosleep.rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: hrtimer expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == posix_cpu_nsleep_restart) {
+		struct timespec ts;
+
+		h->fn = CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP;
+		h->arg_0 = restart_block->arg0;
+		h->arg_1 = restart_block->arg1;
+		ts.tv_sec = restart_block->arg2;
+		ts.tv_nsec = restart_block->arg3;
+		expire = timespec_to_ns(&ts);
+		ckpt_debug("restart_block: posix_cpu expire %lld now %lld\n",
+			 expire, base);
+
+#ifdef CONFIG_COMPAT
+	} else if (fn == compat_nanosleep_restart) {
+
+		h->fn = CKPT_RESTART_BLOCK_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+		h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: compat expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == compat_clock_nanosleep_restart) {
+
+		h->fn = CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+		h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: compat_clock expire %lld now %lld\n",
+			 expire, base);
+
+#endif
+	} else if (fn == futex_wait_restart) {
+
+		h->fn = CKPT_RESTART_BLOCK_FUTEX;
+		h->arg_0 = (unsigned long) restart_block->futex.uaddr;
+		h->arg_1 = restart_block->futex.val;
+		h->arg_2 = restart_block->futex.flags;
+		h->arg_3 = restart_block->futex.bitset;
+		expire = restart_block->futex.time;
+		ckpt_debug("restart_block: futex expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == do_restart_poll) {
+		struct timespec ts;
+
+		h->fn = CKPT_RESTART_BLOCK_POLL;
+		h->arg_0 = (unsigned long) restart_block->poll.ufds;
+		h->arg_1 = restart_block->poll.nfds;
+		h->arg_2 = restart_block->poll.has_timeout;
+		ts.tv_sec = restart_block->poll.tv_sec;
+		ts.tv_nsec = restart_block->poll.tv_nsec;
+		expire = timespec_to_ns(&ts);
+		ckpt_debug("restart_block: poll expire %lld now %lld\n",
+			 expire, base);
+
+	} else {
+
+		BUG();
+
+	}
+
+	/* common to all restart blocks: */
+	if (base < expire)
+		h->arg_4 = (expire - base);
+
+	ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+		 h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	ckpt_debug("restart_block ret %d\n", ret);
+	return ret;
+}
+
 /* dump the entire state of a given task */
 int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -103,6 +217,111 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+int restore_restart_block(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_restart_block *h;
+	struct restart_block restart_block;
+	struct timespec ts;
+	clockid_t clockid;
+	s64 expire;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	expire = ktime_to_ns(ctx->ktime_begin) + h->arg_4;
+	restart_block.fn = NULL;
+
+	ckpt_debug("restart_block: expire %lld begin %lld\n",
+		 expire, ktime_to_ns(ctx->ktime_begin));
+	ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+		 h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+	switch (h->fn) {
+	case CKPT_RESTART_BLOCK_NONE:
+		restart_block.fn = do_no_restart_syscall;
+		break;
+	case CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = hrtimer_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.expires = expire;
+		break;
+	case CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = posix_cpu_nsleep_restart;
+		restart_block.arg0 = clockid;
+		restart_block.arg1 = h->arg_1;
+		ts = ns_to_timespec(expire);
+		restart_block.arg2 = ts.tv_sec;
+		restart_block.arg3 = ts.tv_nsec;
+		break;
+#ifdef CONFIG_COMPAT
+	case CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = compat_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.compat_rmtp =
+			(struct compat_timespec __user *)
+				(unsigned long) h->arg_2;
+		resatrt_block.nanosleep.expires = expire;
+		break;
+	case CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = compat_clock_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.compat_rmtp =
+			(struct compat_timespec __user *)
+				(unsigned long) h->arg_2;
+		resatrt_block.nanosleep.expires = expire;
+		break;
+#endif
+	case CKPT_RESTART_BLOCK_FUTEX:
+		restart_block.fn = futex_wait_restart;
+		restart_block.futex.uaddr = (u32 *) (unsigned long) h->arg_0;
+		restart_block.futex.val = h->arg_1;
+		restart_block.futex.flags = h->arg_2;
+		restart_block.futex.bitset = h->arg_3;
+		restart_block.futex.time = expire;
+		break;
+	case CKPT_RESTART_BLOCK_POLL:
+		restart_block.fn = do_restart_poll;
+		restart_block.poll.ufds =
+			(struct pollfd __user *) (unsigned long) h->arg_0;
+		restart_block.poll.nfds = h->arg_1;
+		restart_block.poll.has_timeout = h->arg_2;
+		ts = ns_to_timespec(expire);
+		restart_block.poll.tv_sec = ts.tv_sec;
+		restart_block.poll.tv_nsec = ts.tv_nsec;
+		break;
+	default:
+		break;
+	}
+
+	if (restart_block.fn)
+		task_thread_info(current)->restart_block = restart_block;
+	else
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 /* read the entire state of the current task */
 int restore_task(struct ckpt_ctx *ctx)
 {
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 637de90..cf11b5a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -16,6 +16,8 @@
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
+#include <asm/syscall.h>
+#include <linux/elf.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -339,6 +341,34 @@ static int ctx_restart(struct ckpt_ctx *ctx, pid_t pid)
 	return 0;
 }
 
+static int restore_retval(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	int ret = 0;
+
+	/*
+	 * The retval should be either zero if the checkpointed task
+	 * had been in user-space when frozen, or the retval from the
+	 * syscall that had been interrupted then.
+	 *
+	 * In the latter, if the syscall succeeded (perhaps partially)
+	 * then the retval is non-negative. If it failed, the error
+	 * may be one of -ERESTART... gang, interpreted in the signal
+	 * handling code. In restart it must happen, too.
+	 *
+	 * To force execution of the signal handler now, too, we fake
+	 * a signal to ourselves (a la freeze/thaw) when ret < 0.
+	 */
+
+	/* were we from a system call?  if so, get old error/retval */
+	if (syscall_get_nr(current, regs) >= 0)
+		ret = syscall_get_error(current, regs);
+	/* old error ?  if so, make sure signal handling kicks in */
+	if (ret < 0)
+		set_tsk_thread_flag(current, TIF_SIGPENDING);
+	return ret;
+}
+
 int do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	int ret;
@@ -353,7 +383,8 @@ int do_restart(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		return ret;
 	ret = restore_read_tail(ctx);
+	if (ret < 0)
+		return ret;
 
-	/* on success, adjust the return value if needed [TODO] */
-	return ret;
+	return restore_retval();
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 0b7245a..6ba0446 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -224,6 +224,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long flags)
 		return ERR_PTR(-ENOMEM);
 
 	ctx->flags = flags;
+	ctx->ktime_begin = ktime_get();
 
 	INIT_LIST_HEAD(&ctx->pgarr_list);
 	INIT_LIST_HEAD(&ctx->pgarr_pool);
@@ -241,6 +242,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long flags)
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
+
 	return ctx;
  err:
 	ckpt_ctx_free(ctx);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index dcc3840..8964a12 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -58,6 +58,10 @@ extern int do_restart(struct ckpt_ctx *ctx, pid_t pid);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
+extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
+				    struct task_struct *t);
+extern int restore_restart_block(struct ckpt_ctx *ctx);
+
 /* memory */
 extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0045c22..84403f2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -48,6 +48,7 @@ enum {
 	CKPT_HDR_OBJREF,
 
 	CKPT_HDR_TASK = 101,
+	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 
@@ -122,6 +123,27 @@ struct ckpt_hdr_task {
 	__u32 task_comm_len;
 } __attribute__((aligned(8)));
 
+/* (thread) restart blocks */
+struct ckpt_hdr_restart_block {
+	struct ckpt_hdr h;
+	__u64 fn;
+	__u64 arg_0;
+	__u64 arg_1;
+	__u64 arg_2;
+	__u64 arg_3;
+	__u64 arg_4;
+} __attribute__((aligned(8)));
+
+enum restart_block_type {
+	CKPT_RESTART_BLOCK_NONE = 1,
+	CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP,
+	CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP,
+	CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP,
+	CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP,
+	CKPT_RESTART_BLOCK_POLL,
+	CKPT_RESTART_BLOCK_FUTEX
+};
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index d98ba71..f59f749 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -15,12 +15,15 @@ struct ckpt_ctx;
 #include <linux/list.h>
 #include <linux/path.h>
 #include <linux/fs.h>
+#include <linux/ktime.h>
 
 #define CKPT_VERSION  1
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
+	ktime_t ktime_begin;	/* checkpoint start time */
+
 	pid_t root_pid;		/* container identifier */
 	struct task_struct *root_task;	/* container root task */
 	struct nsproxy *root_nsproxy;	/* container root nsproxy */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 19/54] Checkpoint multiple processes
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (17 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 18/54] c/r of restart-blocks Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 20/54] Restart " Oren Laadan
                     ` (36 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Checkpointing of multiple processes works by recording the tasks tree
structure below a given task (usually this task is the container init).

For a given task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies, as well as session ids.

Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen
  - Refuse checkpoint (for now) if task is ptraced
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Check retval of ckpt_tree_count_tasks() in ckpt_build_tree()
  - Discard 'h.parent' field
  - Check whether calls to ckpt_hbuf_get() fail
  - Disallow threads or siblings to container init

Changelog[v13]:
  - Release tasklist_lock in error path in ckpt_tree_count_tasks()
  - Use separate index for 'tasks_arr' and 'hh' in ckpt_write_pids()

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c          |  256 ++++++++++++++++++++++++++++++++++---
 checkpoint/sys.c                 |   16 +++
 include/linux/checkpoint_hdr.h   |   16 +++-
 include/linux/checkpoint_types.h |   10 +-
 4 files changed, 273 insertions(+), 25 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 07901c1..0299046 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -166,6 +166,233 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		ckpt_debug("dumping task #%d\n", n);
+		ret = checkpoint_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
+static int may_checkpoint_task(struct task_struct *t, struct ckpt_ctx *ctx)
+{
+	struct pid_namespace *ns = ctx->root_nsproxy->pid_ns;
+
+	ckpt_debug("check %d\n", task_pid_nr_ns(t, ns));
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
+		return -EAGAIN;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_READ))
+		return -EPERM;
+
+	/* verify that the task is frozen (unless self) */
+	if (t != current && !frozen(t))
+		return -EBUSY;
+
+	/* FIX: add support for ptraced tasks */
+	if (task_ptrace(t))
+		return -EBUSY;
+
+	/*
+	 * FIX: for now, disallow siblings of container init created
+	 * via CLONE_PARENT (unclear if they will remain possible)
+	 */
+	if (ctx->root_init && t != ctx->root_task &&
+	    t->real_parent == ctx->root_task->real_parent)
+		return -EINVAL;
+
+	/* FIX: change this for nested containers */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CKPT_HDR_PIDS_CHUNK	256
+
+static int checkpoint_pids(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_pids *h;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int nr_tasks, n, pos = 0, ret = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	nr_tasks = ctx->nr_tasks;
+	BUG_ON(nr_tasks <= 0);
+
+	h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+	if (!h)
+		return -ENOMEM;
+
+	do {
+		rcu_read_lock();
+		for (n = 0; n < min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) {
+			task = tasks_arr[pos];
+
+			h[n].vpid = task_pid_nr_ns(task, ns);
+			h[n].vtgid = task_tgid_nr_ns(task, ns);
+			h[n].vpgid = task_pgrp_nr_ns(task, ns);
+			h[n].vsid = task_session_nr_ns(task, ns);
+			h[n].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			ckpt_debug("task[%d]: vpid %d vtgid %d parent %d\n",
+				   pos, h[n].vpid, h[n].vtgid, h[n].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(nr_tasks, CKPT_HDR_PIDS_CHUNK);
+		ret = ckpt_kwrite(ctx, h, n * sizeof(*h));
+		if (ret < 0)
+			break;
+
+		nr_tasks -= n;
+	} while (nr_tasks > 0);
+
+	_ckpt_hdr_put(ctx, h, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int tree_count_tasks(struct ckpt_ctx *ctx)
+{
+	struct task_struct *root;
+	struct task_struct *task;
+	struct task_struct *parent;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int nr_tasks = ctx->nr_tasks;
+	int nr = 0;
+	int ret;
+
+	read_lock(&tasklist_lock);
+
+	/* we hold the lock, so root_task->real_parent can't change */
+	task = ctx->root_task;
+	if (ctx->root_init) {
+		/* container-init: start from container parent */
+		parent = task->real_parent;
+		root = parent;
+	} else {
+		/* non-container-init: start from root task and down */
+		parent = NULL;
+		root = task;
+	}
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		/* is this task cool ? */
+		ret = may_checkpoint_task(task, ctx);
+		if (ret < 0) {
+			nr = ret;
+			break;
+		}
+		if (tasks_arr) {
+			/* unlikely... but if so then try again later */
+			if (nr == nr_tasks) {
+				nr = -EAGAIN; /* cleanup in ckpt_ctx_free() */
+				break;
+			}
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+		nr++;
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+		if (task == root)
+			break;
+	}
+
+	read_unlock(&tasklist_lock);
+	return nr;
+}
+
+/*
+ * build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->nr_tasks will hold the total count.
+ * The array is cleaned up by ckpt_ctx_free().
+ */
+static int build_tree(struct ckpt_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = tree_count_tasks(ctx);
+	if (n < 0)
+		return n;
+
+	ctx->nr_tasks = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in ckpt_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
+	return 0;
+}
+
+/* dump the array that describes the tasks tree */
+static int checkpoint_tree(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tree *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+	if (!h)
+		return -ENOMEM;
+
+	h->nr_tasks = ctx->nr_tasks;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ret = checkpoint_pids(ctx);
+	return ret;
+}
+
+
 static int get_container(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task = NULL;
@@ -183,26 +410,6 @@ static int get_container(struct ckpt_ctx *ctx, pid_t pid)
 	if (!task)
 		goto out;
 
-#if 0	/* enable to use containers */
-	if (!is_container_init(task)) {
-		err = -EINVAL;
-		goto out;
-	}
-#endif
-
-	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
-		err = -EPERM;
-		goto out;
-	}
-
-	/* verify that the task is frozen (unless self) */
-	if (task != current && !frozen(task))
-		return -EBUSY;
-
-	/* FIX: add support for ptraced tasks */
-	if (task_ptrace(task))
-		return -EBUSY;
-
 	rcu_read_lock();
 	nsproxy = task_nsproxy(task);
 	if (nsproxy)
@@ -214,6 +421,7 @@ static int get_container(struct ckpt_ctx *ctx, pid_t pid)
 
 	ctx->root_task = task;
 	ctx->root_nsproxy = nsproxy;
+	ctx->root_init = is_container_init(task);
 
 	return 0;
 
@@ -258,10 +466,16 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	ret = ctx_checkpoint(ctx, pid);
 	if (ret < 0)
 		goto out;
+	ret = build_tree(ctx);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_task(ctx, ctx->root_task);
+	ret = checkpoint_tree(ctx);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_all_tasks(ctx);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_write_tail(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 6ba0446..6484c03 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -194,6 +194,19 @@ void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
  * restart operation, and persists until the operation is completed.
  */
 
+static void task_arr_free(struct ckpt_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	if (ctx->file)
@@ -206,6 +219,9 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	ckpt_pgarr_free(ctx);
 	ckpt_obj_hash_free(ctx);
 
+	if (ctx->tasks_arr)
+		task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 84403f2..03846ca 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -47,7 +47,8 @@ enum {
 	CKPT_HDR_FNAME,
 	CKPT_HDR_OBJREF,
 
-	CKPT_HDR_TASK = 101,
+	CKPT_HDR_TREE = 101,
+	CKPT_HDR_TASK,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
@@ -111,6 +112,19 @@ struct ckpt_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+/* task tree */
+struct ckpt_hdr_tree {
+	struct ckpt_hdr h;
+	__s32 nr_tasks;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_pids {
+	__s32 vpid;
+	__s32 vppid;
+	__s32 vtgid;
+	__s32 vpgid;
+	__s32 vsid;
+} __attribute__((aligned(8)));
 
 /* task data */
 struct ckpt_hdr_task {
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f59f749..c3040a7 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -24,9 +24,10 @@ struct ckpt_ctx {
 
 	ktime_t ktime_begin;	/* checkpoint start time */
 
-	pid_t root_pid;		/* container identifier */
-	struct task_struct *root_task;	/* container root task */
-	struct nsproxy *root_nsproxy;	/* container root nsproxy */
+	int root_init;		/* is root a container init ? */
+	pid_t root_pid;		/* (container) root identifier */
+	struct task_struct *root_task;	/* (container) root task */
+	struct nsproxy *root_nsproxy;	/* (container) root nsproxy */
 
 	unsigned long flags;
 	unsigned long oflags;	/* restart: old flags */
@@ -37,6 +38,9 @@ struct ckpt_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int nr_tasks;			/* size of tasks array */
+
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 20/54] Restart multiple processes
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (18 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 19/54] Checkpoint multiple processes Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 21/54] Define subtree flag and unpriv_allowed sysctl Oren Laadan
                     ` (35 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

The init task (*) has a special role: it allocates the restart context
(ctx), and coordinates the operation. In particular, it first waits
until all participating tasks enter the kernel, and provides them the
common restart context. Once everyone in ready, it begins to restart
itself.

In contrast, the other tasks enter the kernel, locate the init task (*)
and grab its restart context, and then wait for their turn to restore.

When a task (init or not) completes its restart, it hands the control
over to the next in line, by waking that task.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (ctx) maintain a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Changelog[v14]:
  - Revert change to pr_debug(), back to ckpt_debug()
  - Discard field 'h.parent'
  - Check whether calls to ckpt_hbuf_get() fail

Changelog[v13]:
  - Clear root_task->checkpoint_ctx regardless of error condition
  - Remove unused argument 'ctx' from do_restore_task() prototype
  - Remove unused member 'pids_err' from 'struct ckpt_ctx'

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/restart.c             |  242 ++++++++++++++++++++++++++++++++++++--
 checkpoint/sys.c                 |   35 ++++--
 include/linux/checkpoint.h       |    3 +
 include/linux/checkpoint_types.h |   17 +++-
 include/linux/sched.h            |    4 +
 5 files changed, 281 insertions(+), 20 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index cf11b5a..edc89ba 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -13,6 +13,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
@@ -335,12 +336,233 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* restore_read_tree - read the tasks tree into the checkpoint context */
+static int restore_read_tree(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tree *h;
+	int size, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->nr_tasks < 0)
+		goto out;
+
+	ctx->nr_pids = h->nr_tasks;
+	size = sizeof(*ctx->pids_arr) * ctx->nr_pids;
+	if (size < 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = ckpt_kread(ctx, ctx->pids_arr, size);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static inline pid_t active_pid(struct ckpt_ctx *ctx)
+{
+	return ctx->pids_arr[ctx->active_pid].vpid;
+}
+
+static int restore_wait_task(struct ckpt_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+
+	ckpt_debug("pid %d waiting\n", pid);
+	return wait_event_interruptible(ctx->waitq, active_pid(ctx) == pid);
+}
+
+static int restore_next_task(struct ckpt_ctx *ctx)
+{
+	struct task_struct *task;
+
+	ctx->active_pid++;
+
+	ckpt_debug("active_pid %d of %d\n", ctx->active_pid, ctx->nr_pids);
+	if (ctx->active_pid == ctx->nr_pids) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	ckpt_debug("pids_next %d\n", active_pid(ctx));
+
+	rcu_read_lock();
+	task = find_task_by_pid_ns(active_pid(ctx), ctx->root_nsproxy->pid_ns);
+	if (task)
+		wake_up_process(task);
+	rcu_read_unlock();
+
+	if (!task) {
+		complete(&ctx->complete);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+/* FIXME: this should be per container */
+DECLARE_WAIT_QUEUE_HEAD(restore_waitq);
+
+static int do_restore_task(pid_t pid)
+{
+	struct task_struct *root_task;
+	struct ckpt_ctx *ctx = NULL;
+	int ret;
+
+	rcu_read_lock();
+	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
+	if (root_task)
+		get_task_struct(root_task);
+	rcu_read_unlock();
+
+	if (!root_task)
+		return -EINVAL;
+
+	/*
+	 * wait for container init to initialize the restart context, then
+	 * grab a reference to that context, and if we're the last task to
+	 * do it, notify the container init.
+	 */
+	ret = wait_event_interruptible(restore_waitq,
+				       root_task->checkpoint_ctx);
+	if (ret < 0)
+		goto out;
+
+	task_lock(root_task);
+	ctx = root_task->checkpoint_ctx;
+	if (ctx)
+		ckpt_ctx_get(ctx);
+	task_unlock(root_task);
+
+	if (!ctx) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (atomic_dec_and_test(&ctx->tasks_count))
+		complete(&ctx->complete);
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = restore_wait_task(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_task(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_next_task(ctx);
+ out:
+	ckpt_ctx_put(ctx);
+	put_task_struct(root_task);
+	return ret;
+}
+
+/**
+ * wait_all_tasks_start - wait for all tasks to enter sys_restart()
+ * @ctx: checkpoint context
+ *
+ * Called by the container root to wait until all restarting tasks
+ * are ready to restore their state. Temporarily advertises the 'ctx'
+ * on 'current->checkpoint_ctx' so that others can grab a reference
+ * to it, and clears it once synchronization completes. See also the
+ * related code in do_restore_task().
+ */
+static int wait_all_tasks_start(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->nr_pids == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+	current->checkpoint_ctx = ctx;
+
+	wake_up_all(&restore_waitq);
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+
+	task_lock(current);
+	current->checkpoint_ctx = NULL;
+	task_unlock(current);
+
+	return ret;
+}
+
+static int wait_all_tasks_finish(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->nr_pids == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+
+	ret = restore_next_task(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
 /* setup restart-specific parts of ctx */
 static int ctx_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
+	ctx->root_pid = pid;
+	ctx->root_task = current;
+	ctx->root_nsproxy = current->nsproxy;
+
+	get_task_struct(ctx->root_task);
+	get_nsproxy(ctx->root_nsproxy);
+
+	atomic_set(&ctx->tasks_count, ctx->nr_pids - 1);
+
 	return 0;
 }
 
+static int do_restore_root(struct ckpt_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = restore_read_header(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_read_tree(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = ctx_restart(ctx, pid);
+	if (ret < 0)
+		return ret;
+
+	/* wait for all other tasks to enter do_restore_task() */
+	ret = wait_all_tasks_start(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = restore_task(ctx);
+	if (ret < 0)
+		return ret;
+
+	/* wait for all other tasks to complete do_restore_task() */
+	ret = wait_all_tasks_finish(ctx);
+	if (ret < 0)
+		return ret;
+
+	return restore_read_tail(ctx);
+}
+
 static int restore_retval(void)
 {
 	struct pt_regs *regs = task_pt_regs(current);
@@ -373,18 +595,18 @@ int do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = ctx_restart(ctx, pid);
-	if (ret < 0)
-		return ret;
-	ret = restore_read_header(ctx);
-	if (ret < 0)
-		return ret;
-	ret = restore_task(ctx);
-	if (ret < 0)
-		return ret;
-	ret = restore_read_tail(ctx);
+	if (ctx)
+		ret = do_restore_root(ctx, pid);
+	else
+		ret = do_restore_task(pid);
+
 	if (ret < 0)
 		return ret;
 
+	/*
+	 * The retval from either is what we return to the caller when all
+	 * goes well: this is the retval from the original syscall that was
+	 * interrupted during checkpoint (zero if the task was in userspace).
+	 */
 	return restore_retval();
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 6484c03..a613748 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -209,6 +209,8 @@ static void task_arr_free(struct ckpt_ctx *ctx)
 
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -227,6 +229,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->root_task)
 		put_task_struct(ctx->root_task);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -242,8 +246,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long flags)
 	ctx->flags = flags;
 	ctx->ktime_begin = ktime_get();
 
+	atomic_set(&ctx->refcount, 0);
 	INIT_LIST_HEAD(&ctx->pgarr_list);
 	INIT_LIST_HEAD(&ctx->pgarr_pool);
+	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -258,13 +264,24 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long flags)
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
-
+	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
 	ckpt_ctx_free(ctx);
 	return ERR_PTR(err);
 }
 
+void ckpt_ctx_get(struct ckpt_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void ckpt_ctx_put(struct ckpt_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		ckpt_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -294,7 +311,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	ckpt_ctx_free(ctx);
+	ckpt_ctx_put(ctx);
 	return ret;
 }
 
@@ -309,7 +326,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	struct ckpt_ctx *ctx;
+	struct ckpt_ctx *ctx = NULL;
 	pid_t pid;
 	int ret;
 
@@ -317,16 +334,18 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 	if (flags)
 		return -EINVAL;
 
-	ctx = ckpt_ctx_alloc(fd, flags | CKPT_CTX_RESTART);
-	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
-
 	/* FIXME: for now, we use 'crid' as a pid */
 	pid = (pid_t) crid;
 
+	if (pid == task_pid_vnr(current))
+		ctx = ckpt_ctx_alloc(fd, flags | CKPT_CTX_RESTART);
+
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	ret = do_restart(ctx, pid);
 
-	ckpt_ctx_free(ctx);
+	ckpt_ctx_put(ctx);
 	return ret;
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 8964a12..859897f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -51,6 +51,9 @@ extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 extern int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
 			   enum obj_type type);
 
+extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
+extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
+
 extern int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
 extern int do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index c3040a7..85eb184 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -16,6 +16,8 @@ struct ckpt_ctx;
 #include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
+#include <linux/sched.h>
+#include <asm/atomic.h>
 
 #define CKPT_VERSION  1
 
@@ -38,8 +40,7 @@ struct ckpt_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int nr_tasks;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
 
@@ -47,6 +48,18 @@ struct ckpt_ctx {
 	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
 
 	struct path fs_mnt;	/* container root (FIXME) */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int nr_tasks;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct ckpt_hdr_pids *pids_arr;	/* array of all pids [restart] */
+	int nr_pids;			/* size of pids array */
+	int active_pid;			/* (next) position in pids array */
+	atomic_t tasks_count;		/* sync of tasks: used to coordinate */
+	struct completion complete;	/* container root and other tasks on */
+	wait_queue_head_t waitq;	/* start, end, and restart ordering */
 };
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..d057e7a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1429,6 +1429,10 @@ struct task_struct {
 	/* state flags for use by tracers */
 	unsigned long trace;
 #endif
+
+#ifdef CONFIG_CHECKPOINT
+	struct ckpt_ctx *checkpoint_ctx;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 21/54] Define subtree flag and unpriv_allowed sysctl
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (19 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 20/54] Restart " Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 22/54] Checkpoint open pipes Oren Laadan
                     ` (34 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Define a sysctl 'ckpt_unpriv_allowed' which determines whether all
checkpoints and restarts require CAP_SYS_ADMIN.  If it is 1, then
regular permission checks are intended to prevent privilege
escalation, but leaving it at 0 prevents unprivileged users from
exploiting any privilege escalation bugs.

Define a CHECKPOINT_SUBTREE flag for sys_checkpoint() which allows to
checkpoint a subtree of processes. Otherwise, the syscall expects to
checkpoint an entire container (in the sense of a pid namespace),
starting with the container init task.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c          |    4 ++++
 checkpoint/restart.c             |    2 +-
 checkpoint/sys.c                 |   17 +++++++++++++++--
 include/linux/checkpoint_types.h |   12 +++++++++++-
 kernel/sysctl.c                  |   19 +++++++++++++++++++
 5 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 0299046..6305e5d 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -423,6 +423,10 @@ static int get_container(struct ckpt_ctx *ctx, pid_t pid)
 	ctx->root_nsproxy = nsproxy;
 	ctx->root_init = is_container_init(task);
 
+	/* FIX: does this error code makes sense here ? */
+	if (!(ctx->flags & CHECKPOINT_SUBTREE) && !ctx->root_init)
+		return -EBUSY;
+
 	return 0;
 
  out:
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index edc89ba..e5a29fb 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -287,7 +287,7 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
 	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
 		goto out;
-	if (h->flags & ~CKPT_CTX_CHECKPOINT)
+	if (h->flags & ~(CKPT_CTX_CHECKPOINT | CKPT_USER_FLAGS))
 		goto out;
 	if (h->uts_release_len != sizeof(uts->release) ||
 	    h->uts_version_len != sizeof(uts->version) ||
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index a613748..e3f7012 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -21,6 +21,13 @@
 #include <linux/checkpoint.h>
 
 /*
+ * ckpt_unpriv_allowed - sysctl_controlled, do not allow checkpoint of
+ * a set of tasks which do not form a fully isolated container, if 0.
+ */
+int ckpt_unpriv_allowed = 1;	/* default: yes */
+
+
+/*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
  *
@@ -296,10 +303,13 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 	struct ckpt_ctx *ctx;
 	int ret;
 
-	/* no flags for now */
-	if (flags)
+	/* check user flags */
+	if (flags & ~CKPT_USER_FLAGS)
 		return -EINVAL;
 
+	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	if (pid == 0)
 		pid = current->pid;
 	ctx = ckpt_ctx_alloc(fd, flags | CKPT_CTX_CHECKPOINT);
@@ -334,6 +344,9 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 	if (flags)
 		return -EINVAL;
 
+	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	/* FIXME: for now, we use 'crid' as a pid */
 	pid = (pid_t) crid;
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 85eb184..09d3238 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -10,6 +10,13 @@
  *  distribution for more details.
  */
 
+#define CKPT_VERSION  1
+
+#define CHECKPOINT_SUBTREE	0x4
+
+
+#ifdef __KERNEL__
+
 struct ckpt_ctx;
 
 #include <linux/list.h>
@@ -19,7 +26,6 @@ struct ckpt_ctx;
 #include <linux/sched.h>
 #include <asm/atomic.h>
 
-#define CKPT_VERSION  1
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
@@ -67,5 +73,9 @@ struct ckpt_ctx {
 #define CKPT_CTX_CHECKPOINT	0x1
 #define CKPT_CTX_RESTART	0x2
 
+#define CKPT_USER_FLAGS		(CHECKPOINT_SUBTREE)
+
+
+#endif /* __KERNEL__ */
 
 #endif /* _LINUX_CHECKPOINT_TYPES_H_ */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e3d2c7d..21f9c48 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -192,6 +192,10 @@ int sysctl_legacy_va_layout;
 extern int prove_locking;
 extern int lock_stat;
 
+#ifdef CONFIG_CHECKPOINT
+extern int ckpt_unpriv_allowed;
+#endif
+
 /* The default sysctl tables: */
 
 static struct ctl_table root_table[] = {
@@ -910,6 +914,20 @@ static struct ctl_table kern_table[] = {
 		.child		= slow_work_sysctls,
 	},
 #endif
+#ifdef CONFIG_CHECKPOINT
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "ckpt_unpriv_allowed",
+		.data		= &ckpt_unpriv_allowed,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
@@ -1302,6 +1320,7 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= &scan_unevictable_handler,
 	},
 #endif
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 22/54] Checkpoint open pipes
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (20 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 21/54] Define subtree flag and unpriv_allowed sysctl Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 23/54] Restore " Oren Laadan
                     ` (33 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

A pipe is essentially a double-headed inode with a buffer attached to
it. We checkpoint the pipe buffer only once, as soon as we hit one
side of the pipe, regardless whether it is read- or write- end.

To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table:

If not found, it is the first encounter of this pipe. Besides the file
descriptor, we also (a) save the pipe data, and (b) register the pipe
inode in the hash. We save the 'objref' of the inode 'in ->fd_objref'
of the file descriptor. The file descriptor type becomes CKPT_FD_PIPE.

If found, it is the second encounter of this pipe, namely, as we hit
the other end of the same pipe. In this case we need only record the
reference ('objref') to the inode that we had saved before, and the
file descriptor type is changed to CKPT_FD_OBJREF.

The type CKPT_FD_PIPE will indicate to the kernel to create a new pipe;
since both ends are created at the same time, one end will be used,
and the other end will be deposited in the hash table for later use.
The type CKPT_FD_OBJREF will indicate that the corresponding file
descriptor is already setup and registered in the hash using the
'->fd_objref' that it had been assigned.

The format of the pipe data is as follows:

struct ckpt_hdr_fd_pipe {
       __u32 nr_bufs;
}

ckpt_hdr + ckpt_hdr_fd_ent
	ckpt_hdr + ckpt_hdr_fd_data
		ckpt_hdr + ckpt_hdr_fd_pipe		-> # buffers
			ckpt_hdr + ckpt_hdr_buffer	-> 1st buffer
			ckpt_hdr + ckpt_hdr_buffer	-> 2nd buffer
			ckpt_hdr + ckpt_hdr_buffer	-> 3rd buffer
			...

Changelog[v14]:
  - Revert change to pr_debug(), back to ckpt_debug()
  - Test that a pipe's inode != ctx->file's inode to prevent deadlock
  - Discard the 'h.parent' field

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/files.c               |    4 +-
 checkpoint/objhash.c             |   30 ++++++++++
 fs/pipe.c                        |  111 ++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h   |   13 +++++
 include/linux/checkpoint_types.h |    3 +
 include/linux/fs.h               |    2 +
 6 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 80e1c02..835e39c 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -373,8 +373,8 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 	return ret;
 }
 
-static struct file *generic_file_restore(struct ckpt_ctx *ctx,
-					 struct ckpt_hdr_file *ptr)
+struct file *generic_file_restore(struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_file *ptr)
 {
 	struct file *file;
 	int ret;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 5476b0a..8e43432 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -42,10 +42,21 @@ struct ckpt_obj_hash {
 	int next_free_objref;
 };
 
+int checkpoint_bad(struct ckpt_ctx *ctx, void *ptr)
+{
+	BUG();
+}
+
+void *restore_bad(struct ckpt_ctx *ctx)
+{
+	return ERR_PTR(-EINVAL);
+}
+
 /*
  * helper grab/drop functions:
  *   obj_no_{drop,grab}: for objects ignored/skipped
  *   obj_file_{drop,grab}: for file objects
+ *   obj_inode_{drop,grab}: for inode objects
  */
 
 static void obj_no_drop(void *ptr)
@@ -70,6 +81,16 @@ static void obj_file_drop(void *ptr)
 	fput((struct file *) ptr);
 }
 
+static int obj_inode_grab(void *ptr)
+{
+	return (igrab((struct inode *) ptr) ? 0 : -EBADF);
+}
+
+static void obj_inode_drop(void *ptr)
+{
+	iput((struct inode *) ptr);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -87,6 +108,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_file,
 		.restore = restore_file,
 	},
+	/* inode object */
+	{
+		.obj_name = "INODE",
+		.obj_type = CKPT_OBJ_INODE,
+		.ref_drop = obj_inode_drop,
+		.ref_grab = obj_inode_grab,
+		.checkpoint = checkpoint_bad,	/* no c/r at inode level */
+		.restore = restore_bad,		/* no c/r at inode level */
+	},
 };
 
 
diff --git a/fs/pipe.c b/fs/pipe.c
index 13414ec..651a7fc 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -22,6 +22,9 @@
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
 
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
 /*
  * We use a start+len construction, which provides full use of the 
  * allocated memory.
@@ -795,6 +798,111 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+#ifdef CONFIG_CHECKPOINT
+/* checkpoint_pipebuf - dump contents of a pipe/fifo (assume i_mutex taken) */
+static int checkpoint_pipebuf(struct ckpt_ctx *ctx,
+			      struct pipe_inode_info *pipe)
+{
+	void *kbuf, *addr;
+	int i, ret = 0;
+
+	kbuf = (void *) __get_free_page(GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	/* this is a simplified pipe_read() */
+
+	for (i = 0; i < pipe->nrbufs; i++) {
+		int nn = (pipe->curbuf + i) & (PIPE_BUFFERS-1);
+		struct pipe_buffer *pbuf = pipe->bufs + nn;
+		const struct pipe_buf_operations *ops = pbuf->ops;
+
+		ret = ops->confirm(pipe, pbuf);
+		if (ret < 0)
+			break;
+
+		addr = ops->map(pipe, pbuf, 1);
+		memcpy(kbuf, addr + pbuf->offset, pbuf->len);
+		ops->unmap(pipe, pbuf, addr);
+
+		ret = ckpt_write_buffer(ctx, kbuf, pbuf->len);
+		if (ret < 0)
+			break;
+	}
+
+	free_page((unsigned long) kbuf);
+	return ret;
+}
+
+/* checkpoint_pipe - dump pipe (assume i_mutex taken) */
+static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
+{
+	struct ckpt_hdr_file_pipe_state *h;
+	struct pipe_inode_info *pipe = inode->i_pipe;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_PIPE);
+	if (!h)
+		return -ENOMEM;
+
+	h->pipe_nrbufs = pipe->nrbufs;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	return checkpoint_pipebuf(ctx, pipe);
+}
+
+static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_pipe *h;
+	struct inode *inode = file->f_dentry->d_inode;
+	int objref, first, ret;
+
+	/*
+	 * We take the inode's mutex and later will call vfs_write(),
+	 * which also takes an inode's mutex. To avoid deadlock, make
+	 * sure that the two inodes are distinct.
+	 */
+	if (ctx->file->f_dentry->d_inode == inode) {
+		pr_warning("c/r: writing to pipe that is checkpointed "
+			   "may result in a deadlock ... aborting\n");
+		return -EDEADLK;
+	}
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_PIPE;
+	h->pipe_objref = objref;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	if (first) {
+		mutex_lock(&inode->i_mutex);
+		ret = checkpoint_pipe(ctx, inode);
+		mutex_unlock(&inode->i_mutex);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+#else
+#define pipe_file_checkpoint  NULL
+#endif /* CONFIG_CHECKPOINT */
+
 /*
  * The file_operations structs are not static because they
  * are also used in linux/fs/fifo.c to do operations on FIFOs.
@@ -811,6 +919,7 @@ const struct file_operations read_pipefifo_fops = {
 	.open		= pipe_read_open,
 	.release	= pipe_read_release,
 	.fasync		= pipe_read_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations write_pipefifo_fops = {
@@ -823,6 +932,7 @@ const struct file_operations write_pipefifo_fops = {
 	.open		= pipe_write_open,
 	.release	= pipe_write_release,
 	.fasync		= pipe_write_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations rdwr_pipefifo_fops = {
@@ -836,6 +946,7 @@ const struct file_operations rdwr_pipefifo_fops = {
 	.open		= pipe_rdwr_open,
 	.release	= pipe_rdwr_release,
 	.fasync		= pipe_rdwr_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 03846ca..555bbf3 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -61,6 +61,7 @@ enum {
 	CKPT_HDR_FD_TABLE = 301,
 	CKPT_HDR_FD_ENT,
 	CKPT_HDR_FILE,
+	CKPT_HDR_FILE_PIPE,
 
 	CKPT_HDR_TAIL = 5001
 };
@@ -76,6 +77,7 @@ struct ckpt_hdr_objref {
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 	CKPT_OBJ_FILE,
+	CKPT_OBJ_INODE,
 	CKPT_OBJ_MAX
 };
 
@@ -214,6 +216,7 @@ struct ckpt_hdr_fd_ent {
 enum file_type {
 	CKPT_FILE_IGNORE = 0,
 	CKPT_FILE_GENERIC,
+	CKPT_FILE_PIPE,
 	CKPT_FILE_MAX
 };
 
@@ -232,4 +235,14 @@ struct ckpt_hdr_file_generic {
 	struct ckpt_hdr_file common;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_file_pipe {
+	struct ckpt_hdr_file common;
+	__s32 pipe_objref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_pipe_state {
+	struct ckpt_hdr h;
+	__s32 pipe_nrbufs;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 09d3238..a8dc5b3 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -18,6 +18,9 @@
 #ifdef __KERNEL__
 
 struct ckpt_ctx;
+struct ckpt_hdr;
+struct ckpt_hdr_vma;
+struct ckpt_hdr_file;
 
 #include <linux/list.h>
 #include <linux/path.h>
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c9ff62..8db8b8e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2311,6 +2311,8 @@ void inode_set_bytes(struct inode *inode, loff_t bytes);
 
 #ifdef CONFIG_CHECKPOINT
 extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+extern struct file *generic_file_restart(struct ckpt_ctx *ctx,
+					 struct ckpt_hdr_file *ptr);
 #else
 #define generic_file_checkpoint NULL
 #endif
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 23/54] Restore open pipes
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (21 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 22/54] Checkpoint open pipes Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 24/54] Prepare to support shared memory Oren Laadan
                     ` (32 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

When seeing a CKPT_FD_PIPE file type, we create a new pipe and thus
have two file pointers (read- and write- ends). We only use one of
them, depending on which side was checkpointed first. We register the
file pointer of the other end in the hash table, with the 'objref'
given for this pipe from the checkpoint, deposited for later use. At
this point we also restore the contents of the pipe buffers.

When the other end arrives, it will have file type CKPT_FD_OBJREF. We
will then use the corresponding 'objref' to retrieve the file pointer
from the hash table, and attach it to the process.

Note the difference from the checkpoint logic: during checkpoint we
placed the _inode_ of the pipe in the hash table, while during restart
we place the resulting _file_ in the hash table.

We restore the pipe contents we manually allocation and attaching
buffers to the pipe; (alternatively we could read the data from the
image file and then write it into the pipe, or use splice() syscall).

Changelog[v14]:
  - Discard the 'h.parent' field
  - Check whether calls to ckpt_hbuf_get() fail

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/files.c        |    7 ++
 fs/pipe.c                 |  136 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pipe_fs_i.h |    9 +++
 3 files changed, 152 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 835e39c..c6a946b 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -17,6 +17,7 @@
 #include <linux/fdtable.h>
 #include <linux/fsnotify.h>
 #include <linux/syscalls.h>
+#include <linux/pipe_fs_i.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -415,6 +416,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_GENERIC,
 		.restore = generic_file_restore,
 	},
+	/* pipe */
+	{
+		.file_name = "PIPE",
+		.file_type = CKPT_FILE_PIPE,
+		.restore = pipe_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 651a7fc..ab2de3c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -899,6 +899,142 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
+
+/* restore_pipebuf - restore contents of a pipe/fifo (assume i_mutex taken) */
+static int restore_pipebuf(struct ckpt_ctx *ctx,
+			   struct pipe_inode_info *pipe, int nbufs)
+{
+	void *kbuf, *addr;
+	int i, len, ret = 0;
+
+	kbuf = (void *) __get_free_page(GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	for (i = 0; i < nbufs; i++) {
+		struct pipe_buffer *pbuf = pipe->bufs + i;
+		struct page *page;
+
+		len = _ckpt_read_nbuffer(ctx, kbuf, PAGE_SIZE);
+		if (len < 0) {
+			ret = len;
+			break;
+		}
+		page = alloc_page(GFP_HIGHUSER);
+		if (!page) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		addr = kmap_atomic(page, KM_USER0);
+		memcpy(addr, kbuf, len);
+		kunmap_atomic(addr, KM_USER0);
+
+		pbuf->page = page;
+		pbuf->ops = &anon_pipe_buf_ops;
+		pbuf->offset = 0;
+		pbuf->len = len;
+		pipe->nrbufs++;
+		pipe->tmp_page = NULL;
+	}
+
+	free_page((unsigned long) kbuf);
+	return ret;
+}
+
+/* restore_pipe - restore pipe (assume i_mutex taken) */
+static int restore_pipe(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_pipe_state *h;
+	struct inode *inode;
+	int nbufs, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_PIPE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	nbufs = h->pipe_nrbufs;
+	ckpt_hdr_put(ctx, h);
+
+	if (nbufs < 0 || nbufs > PIPE_BUFFERS)
+		return -EINVAL;
+
+	inode = file->f_dentry->d_inode;
+	mutex_lock(&inode->i_mutex);
+	ret = restore_pipebuf(ctx, inode->i_pipe, nbufs);
+	mutex_unlock(&inode->i_mutex);
+
+	return ret;
+}
+
+/* restore a pipe */
+struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int fds[2], which, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_PIPE)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	file = ckpt_obj_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file))
+		return file;
+	/*
+	 * If ckpt_obj_fetch() returned NULL, then this is the first
+	 * time we see this pipe so need to restore the contents.
+	 * Otherwise, use the file pointer skip forward.
+	 */
+	if (!file) {
+		/* first encounter of this pipe: create it */
+		ret = do_pipe_flags(fds, 0);
+		if (ret < 0)
+			return file;
+
+		which = (ptr->f_flags & O_WRONLY ? 1 : 0);
+
+		/*
+		 * Below we return the file corersponding to one side
+		 * of the pipe for our caller to use. Now insert the
+		 * other side of the pipe to the hash, to be picked up
+		 * when that side is restored.
+		 */
+		file = fget(fds[1-which]);	/* the 'other' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+		ret = ckpt_obj_insert(ctx, file,
+				      h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0) {
+			fput(file);
+			return ERR_PTR(ret);
+		}
+
+		ret = restore_pipe(ctx, file);
+		fput(file);
+		if (ret < 0)
+			return ERR_PTR(ret);
+
+		file = fget(fds[which]);	/* 'this' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+
+		/* get rid of the file descriptors (caller sets that) */
+		sys_close(fds[which]);
+		sys_close(fds[1-which]);
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
 #else
 #define pipe_file_checkpoint  NULL
 #endif /* CONFIG_CHECKPOINT */
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index c8f0385..453d048 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_PIPE_FS_I_H
 #define _LINUX_PIPE_FS_I_H
 
+#include <linux/checkpoint_hdr.h>
+
 #define PIPEFS_MAGIC 0x50495045
 
 #define PIPE_BUFFERS (16)
@@ -153,4 +155,11 @@ void generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
 
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
+#endif
+
 #endif
+
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 24/54] Prepare to support shared memory
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (22 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 23/54] Restore " Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 25/54] Dump anonymous- and file-mapped- " Oren Laadan
                     ` (31 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Export functionality to retrieve specific pages from shared memory
given an inode in shmem-fs; this will be used in the next two patches
to provide support for c/r of shared memory.

Handling of shared memory depends on the type of a vma; to classify a
vma we extend the 'struct vma_operations_struct' with a new function
- 'ckpt_vma_type()' - through which a vma will report an integer that
reflects its type.

mm/shmem.c:
- shmem_getpage() and 'enum sgp_type' moved to linux/mm.h
- 'struct vm_operations_struct' extended with '->ckpt_vma_type' function

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/mm.h |   11 +++++++++++
 mm/shmem.c         |   15 ++-------------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 585d398..7d2f93a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -330,6 +330,17 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 
+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+	SGP_READ,	/* don't exceed i_size, don't allocate page */
+	SGP_CACHE,	/* don't exceed i_size, may allocate page */
+	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
+	SGP_WRITE,	/* may exceed i_size, may allocate page */
+};
+
+extern int shmem_getpage(struct inode *inode, unsigned long idx,
+			 struct page **pagep, enum sgp_type sgp, int *type);
+
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
diff --git a/mm/shmem.c b/mm/shmem.c
index f9cb20e..e24da02 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -99,14 +99,6 @@ static struct vfsmount *shm_mnt;
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
-	SGP_READ,	/* don't exceed i_size, don't allocate page */
-	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
-	SGP_WRITE,	/* may exceed i_size, may allocate page */
-};
-
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -119,9 +111,6 @@ static unsigned long shmem_default_max_inodes(void)
 }
 #endif
 
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			 struct page **pagep, enum sgp_type sgp, int *type);
-
 static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
 {
 	/*
@@ -1202,8 +1191,8 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			struct page **pagep, enum sgp_type sgp, int *type)
+int shmem_getpage(struct inode *inode, unsigned long idx,
+		  struct page **pagep, enum sgp_type sgp, int *type)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 25/54] Dump anonymous- and file-mapped- shared memory
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (23 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 24/54] Prepare to support shared memory Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 26/54] Restore " Oren Laadan
                     ` (30 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

We now handle anonymous and file-mapped shared memory. Support for IPC
shared memory requires support for IPC first. We extend ckpt_write_vma()
to detect shared memory VMAs and handle it separately than private
memory.

There is not much to do for file-mapped shared memory, except to force
msync() on the region to ensure that the file system is consistent
with the checkpoint image. Use our internal type CKPT_VMA_SHM_FILE.

Anonymous shared memory is always backed by inode in shmem filesystem.
We use that inode to look up the VMA in the objhash and register it if
not found (on first encounter). In this case, the type of the VMA is
CKPT_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is
found there, we must have already saved it before, so we change the
type to CKPT_VMA_SHM_ANON_SKIP and skip it.

To dump the contents of a shmem VMA, we loop through the pages of the
inode in the shmem filesystem, and dump the contents of each dirty
(allocated) page - unallocated pages must be clean.

Note that we save the original size of a shmem VMA because it may have
been re-mapped partially. The format itself remains like with private
VMAs, except that instead of addresses we record _indices_ (page nr)
into the backing inode.

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/memory.c            |  155 ++++++++++++++++++++++++++++++++++-----
 include/linux/checkpoint.h     |   15 +++--
 include/linux/checkpoint_hdr.h |    8 ++-
 mm/filemap.c                   |   45 +++++++++++-
 mm/mmap.c                      |    2 +-
 mm/shmem.c                     |   35 +++++++++
 6 files changed, 228 insertions(+), 32 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 4fa634a..f96a50f 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -21,6 +21,7 @@
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
 #include <linux/proc_fs.h>
+#include <linux/swap.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -283,6 +284,54 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
 }
 
 /**
+ * consider_shared_page - return page pointer for dirty pages
+ * @ino - inode of shmem object
+ * @idx - page index in shmem object
+ *
+ * Looks up the page that corresponds to the index in the shmem object,
+ * and returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_shared_page(struct inode *ino, unsigned long idx)
+{
+	struct page *page = NULL;
+	int ret;
+
+	/*
+	 * Inspired by do_shmem_file_read(): very simplified version.
+	 *
+	 * FIXME: consolidate with do_shmem_file_read()
+	 */
+
+	ret = shmem_getpage(ino, idx, &page, SGP_READ, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	/*
+	 * Only care about dirty pages; shmem_getpage() only returns
+	 * pages that have been allocated, so they must be dirty. The
+	 * pages returned are locked and referenced.
+	 */
+
+	if (page) {
+		unlock_page(page);
+		/*
+		 * If users can be writing to this page using arbitrary
+		 * virtual addresses, take care about potential aliasing
+		 * before reading the page on the kernel side.
+		 */
+		if (mapping_writably_mapped(ino->i_mapping))
+			flush_dcache_page(page);
+		/*
+		 * Mark the page accessed if we read the beginning.
+		 */
+		mark_page_accessed(page);
+	}
+
+	return page;
+}
+
+/**
  * private_vma_fill_pgarr - fill a page-array with addr/page tuples
  * @ctx - checkpoint context
  * @vma - vma to scan
@@ -290,18 +339,17 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
  *
  * Returns the number of pages collected
  */
-static int private_vma_fill_pgarr(struct ckpt_ctx *ctx,
-				  struct vm_area_struct *vma,
-				  unsigned long *start)
+static int vma_fill_pgarr(struct ckpt_ctx *ctx,
+			  struct vm_area_struct *vma, struct inode *inode,
+			  unsigned long *start, unsigned long end)
 {
-	unsigned long end = vma->vm_end;
 	unsigned long addr = *start;
 	struct ckpt_pgarr *pgarr;
 	int nr_used;
 	int cnt = 0;
 
 	/* this function is only for private memory (anon or file-mapped) */
-	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+	BUG_ON(inode && vma);
 
 	do {
 		pgarr = pgarr_current(ctx);
@@ -313,7 +361,11 @@ static int private_vma_fill_pgarr(struct ckpt_ctx *ctx,
 		while (addr < end) {
 			struct page *page;
 
-			page = consider_private_page(vma, addr);
+			if (vma)
+				page = consider_private_page(vma, addr);
+			else
+				page = consider_shared_page(inode, addr);
+
 			if (IS_ERR(page))
 				return PTR_ERR(page);
 
@@ -325,7 +377,10 @@ static int private_vma_fill_pgarr(struct ckpt_ctx *ctx,
 				pgarr->nr_used++;
 			}
 
-			addr += PAGE_SIZE;
+			if (vma)
+				addr += PAGE_SIZE;
+			else
+				addr++;
 
 			if (pgarr_is_full(pgarr))
 				break;
@@ -393,24 +448,36 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
 }
 
 /**
- * checkpoint_private_contents - dump contents of a VMA with private memory
+ * checkpoint_memory_contents - dump contents of a memory region
  * @ctx - checkpoint context
- * @vma - vma to scan
+ * @vma - vma to scan (--or--)
+ * @inode - inode to scan
  *
  * Collect lists of pages that needs to be dumped, and corresponding
  * virtual addresses into ctx->pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
-static int checkpoint_private_contents(struct ckpt_ctx *ctx,
-				       struct vm_area_struct *vma)
+static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma,
+				      struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
-	unsigned long addr = vma->vm_start;
+	unsigned long addr, end;
 	int cnt, ret;
 
+	BUG_ON(vma && inode);
+
+	if (vma) {
+		addr = vma->vm_start;
+		end = vma->vm_end;
+	} else {
+		addr = 0;
+		end = PAGE_ALIGN(i_size_read(inode)) >> PAGE_CACHE_SHIFT;
+	}
+
 	/*
 	 * Work iteratively, collecting and dumping at most CKPT_PGARR_CHUNK
-	 * in each round. Each iterations is divided into two steps:
+	 * in each round. Each iteration is divided into two steps:
 	 *
 	 * (1) scan: scan through the PTEs of the vma to collect the pages
 	 * to dump (later we'll also make them COW), while keeping a list
@@ -427,12 +494,12 @@ static int checkpoint_private_contents(struct ckpt_ctx *ctx,
 	 * the actual write-out of the data to after the application is
 	 * allowed to resume execution).
 	 *
-	 * After dumpting the entire contents, conclude with a header that
+	 * After dumping the entire contents, conclude with a header that
 	 * specifies 0 pages to mark the end of the contents.
 	 */
 
-	while (addr < vma->vm_end) {
-		cnt = private_vma_fill_pgarr(ctx, vma, &addr);
+	while (addr < end) {
+		cnt = vma_fill_pgarr(ctx, vma, inode, &addr, end);
 		if (cnt == 0)
 			break;
 		else if (cnt < 0)
@@ -476,7 +543,7 @@ static int checkpoint_private_contents(struct ckpt_ctx *ctx,
  * @objref: vma object id
  */
 int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
-			   enum vma_type type, int vma_objref)
+			   enum vma_type type, int vma_objref, int ino_objref)
 {
 	struct ckpt_hdr_vma *h;
 	int ret;
@@ -495,6 +562,12 @@ int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
 
 	h->vma_type = type;
 	h->vma_objref = vma_objref;
+	h->ino_objref = ino_objref;
+
+	if (vma->vm_file)
+		h->ino_size = i_size_read(vma->vm_file->f_dentry->d_inode);
+	else
+		h->ino_size = 0;
 
 	h->vm_start = vma->vm_start;
 	h->vm_end = vma->vm_end;
@@ -523,16 +596,43 @@ int private_vma_checkpoint(struct ckpt_ctx *ctx,
 
 	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
 
-	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref, 0);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, vma, NULL);
+ out:
+	return ret;
+}
+
+/**
+ * shmem_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @objref: vma object id
+ */
+int shmem_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+			 enum vma_type type, int ino_objref)
+{
+	struct file *file = vma->vm_file;
+	int ret;
+
+	ckpt_debug("type %d, ino_ref %d\n", type, ino_objref);
+	BUG_ON(!(vma->vm_flags & (VM_SHARED | VM_MAYSHARE)));
+	BUG_ON(!file);
+
+	ret = generic_vma_checkpoint(ctx, vma, type, 0, ino_objref);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_private_contents(ctx, vma);
+	if (type == CKPT_VMA_SHM_ANON_SKIP)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, NULL, file->f_dentry->d_inode);
  out:
 	return ret;
 }
 
 /**
- * anonymous_checkpoint - dump contents of anonymous vma
+ * anonymous_checkpoint - dump contents of private-anonymous vma
  * @ctx: checkpoint context
  * @vma: vma object
  */
@@ -908,6 +1008,21 @@ static struct restore_vma_ops restore_vma_ops[] = {
 		.vma_type = CKPT_VMA_FILE,
 		.restore = filemap_restore,
 	},
+	/* anonymous shared */
+	{
+		.vma_name = "ANON SHARED",
+		.vma_type = CKPT_VMA_SHM_ANON,
+	},
+	/* anonymous shared (skipped) */
+	{
+		.vma_name = "ANON SHARED (skip)",
+		.vma_type = CKPT_VMA_SHM_ANON_SKIP,
+	},
+	/* file-mapped shared */
+	{
+		.vma_name = "FILE SHARED",
+		.vma_type = CKPT_VMA_SHM_FILE,
+	},
 };
 
 /**
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 859897f..53399f8 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -71,11 +71,15 @@ extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
 				  enum vma_type type,
-				  int vma_objref);
+				  int vma_objref, int ino_objref);
 extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
 				  enum vma_type type,
 				  int vma_objref);
+extern int shmem_vma_checkpoint(struct ckpt_ctx *ctx,
+				struct vm_area_struct *vma,
+				enum vma_type type,
+				int ino_objref);
 
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
@@ -83,11 +87,10 @@ extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 extern int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_mm(struct ckpt_ctx *ctx);
 
-#define CKPT_VMA_NOT_SUPPORTED					\
-	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
-	 VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE	\
-	 | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY |		\
-	 VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+#define CKPT_VMA_NOT_SUPPORTED						\
+	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
+	 VM_RESERVED | VM_NORESERVE | VM_HUGETLB | VM_NONLINEAR |	\
+	 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
 /* files */
 extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 555bbf3..59fab62 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -177,6 +177,9 @@ enum vma_type {
 	CKPT_VMA_VDSO,		/* special vdso vma */
 	CKPT_VMA_ANON,		/* private anonymous */
 	CKPT_VMA_FILE,		/* private mapped file */
+	CKPT_VMA_SHM_ANON,	/* shared anonymous */
+	CKPT_VMA_SHM_ANON_SKIP,	/* shared anonymous (skip contents) */
+	CKPT_VMA_SHM_FILE,	/* shared mapped file, only msync */
 	CKPT_VMA_MAX,
 };
 
@@ -184,7 +187,10 @@ enum vma_type {
 struct ckpt_hdr_vma {
 	struct ckpt_hdr h;
 	__u32 vma_type;
-	__u32 vma_objref;	/* for vma->vm_file */
+	__s32 vma_objref;	/* objref of backing file */
+	__s32 ino_objref;	/* objref of shared segment */
+	__u32 _padding;
+	__u64 ino_size;		/* size of shared segment */
 
 	__u64 vm_start;
 	__u64 vm_end;
diff --git a/mm/filemap.c b/mm/filemap.c
index e515845..e9499d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1630,10 +1630,12 @@ page_not_uptodate:
 EXPORT_SYMBOL(filemap_fault);
 
 #ifdef CONFIG_CHECKPOINT
-static int filemap_checkpoint(struct ckpt_ctx *ctx,
-				  struct vm_area_struct *vma)
+static int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 {
+	struct file *file = vma->vm_file;
 	int vma_objref;
+	int ino_objref;
+	int first, ret;
 
 	/* should be private anonymous ... verify that this is the case */
 	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
@@ -1641,14 +1643,49 @@ static int filemap_checkpoint(struct ckpt_ctx *ctx,
 		return -ENOSYS;
 	}
 
-	BUG_ON(!vma->vm_file);
+	BUG_ON(!file);
 
 	/* checkpoint the file object first (will add to objhash) */
 	vma_objref = checkpoint_obj(ctx, vma->vm_file, CKPT_OBJ_FILE);
 	if (vma_objref < 0)
 		return vma_objref;
 
-	return  private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+		/*
+		 * Citing mmap(2): "Updates to the mapping are visible
+		 * to other processes that map this file, and are
+		 * carried through to the underlying file. The file
+		 * may not actually be updated until msync(2) or
+		 * munmap(2) is called"
+		 *
+		 * Citing msync(2): "Without use of this call there is
+		 * no guarantee that changes are written back before
+		 * munmap(2) is called."
+		 *
+		 * Force msync for region of shared mapped files, to
+		 * ensure that that the file system is consistent with
+		 * the checkpoint image.  (inspired by sys_msync).
+		 */
+
+		ino_objref = ckpt_obj_lookup_add(ctx, file->f_dentry->d_inode,
+					       CKPT_OBJ_INODE, &first);
+		if (ino_objref < 0)
+			return ino_objref;
+
+		if (first) {
+			ret = vfs_fsync(file, file->f_path.dentry, 0);
+			if (ret < 0)
+				return ret;
+		}
+
+		ret = generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_FILE,
+					     vma_objref, ino_objref);
+	} else {
+		ret = private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE,
+					     vma_objref);
+	}
+
+	return ret;
 }
 
 int filemap_restore(struct ckpt_ctx *ctx,
diff --git a/mm/mmap.c b/mm/mmap.c
index 0c65512..555a6a3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2300,7 +2300,7 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 	if (!name || strcmp(name, "[vdso]"))
 		return -ENOSYS;
 
-	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0, 0);
 }
 
 int special_mapping_restore(struct ckpt_ctx *ctx,
diff --git a/mm/shmem.c b/mm/shmem.c
index e24da02..17847b0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -31,6 +31,10 @@
 #include <linux/swap.h>
 #include <linux/ima.h>
 
+#include <linux/checkpoint_types.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/checkpoint.h>
+
 static struct vfsmount *shm_mnt;
 
 #ifdef CONFIG_SHMEM
@@ -2377,6 +2381,34 @@ static void shmem_destroy_inode(struct inode *inode)
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	enum vma_type vma_type;
+	int ino_objref;
+	int first;
+
+	/* should be private anonymous ... verify that this is the case */
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(!vma->vm_file);
+
+	ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+					 CKPT_OBJ_INODE, &first);
+	if (ino_objref < 0)
+		return ino_objref;
+
+	vma_type = (first ? CKPT_VMA_SHM_ANON : CKPT_VMA_SHM_ANON_SKIP);
+
+	return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
+}
+#else
+#define shmem_checkpoint NULL
+#endif /* CONFIG_CHECKPOINT */
+
 static void init_once(void *foo)
 {
 	struct shmem_inode_info *p = (struct shmem_inode_info *) foo;
@@ -2492,6 +2524,9 @@ static struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= shmem_checkpoint,
+#endif
 };
 
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 26/54] Restore anonymous- and file-mapped- shared memory
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (24 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 25/54] Dump anonymous- and file-mapped- " Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 27/54] s390: Expose a constant for the number of words representing the CRs Oren Laadan
                     ` (29 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

The bulk of the work is in ckpt_read_vma(), which has been refactored:
the part that create the suitable 'struct file *' for the mapping is
now larger and moved to a separate function. What's left is to read
the VMA description, get the file pointer, create the mapping, and
proceed to read the contents in.

Both anonymous shared VMAs that have been read earlier (as indicated
by a look up to objhash) and file-mapped shared VMAs are skipped.
Anonymous shared VMAs seen for the first time have their contents
read in directly to the backing inode, as indexed by the page numbers
(as opposed to virtual addresses).

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/memory.c        |   66 ++++++++++++++++++++++++++++++++-----------
 include/linux/checkpoint.h |    6 ++++
 include/linux/mm.h         |    2 +
 mm/filemap.c               |   14 ++++++++-
 mm/shmem.c                 |   47 +++++++++++++++++++++++++++++++
 5 files changed, 116 insertions(+), 19 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index f96a50f..f5f8fcf 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -771,13 +771,36 @@ static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
 	return 0;
 }
 
+static struct page *bring_private_page(unsigned long addr)
+{
+	struct page *page;
+	int ret;
+
+	ret = get_user_pages(current, current->mm, addr, 1, 1, 1, &page, NULL);
+	if (ret < 0)
+		page = ERR_PTR(ret);
+	return page;
+}
+
+static struct page *bring_shared_page(unsigned long idx, struct inode *ino)
+{
+	struct page *page = NULL;
+	int ret;
+
+	ret = shmem_getpage(ino, idx, &page, SGP_WRITE, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (page)
+		unlock_page(page);
+	return page;
+}
+
 /**
  * read_pages_contents - read in data of pages in page-array chain
  * @ctx - restart context
  */
-static int read_pages_contents(struct ckpt_ctx *ctx)
+static int read_pages_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
-	struct mm_struct *mm = current->mm;
 	struct ckpt_pgarr *pgarr;
 	unsigned long *vaddrs;
 	char *buf;
@@ -787,17 +810,22 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
 	if (!buf)
 		return -ENOMEM;
 
-	down_read(&mm->mmap_sem);
+	down_read(&current->mm->mmap_sem);
 	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
 		vaddrs = pgarr->vaddrs;
 		for (i = 0; i < pgarr->nr_used; i++) {
 			struct page *page;
 
 			_ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
-			ret = get_user_pages(current, mm, vaddrs[i],
-					     1, 1, 1, &page, NULL);
-			if (ret < 0)
+			if (inode)
+				page = bring_shared_page(vaddrs[i], inode);
+			else
+				page = bring_private_page(vaddrs[i]);
+
+			if (IS_ERR(page)) {
+				ret = PTR_ERR(page);
 				goto out;
+			}
 
 			ret = restore_read_page(ctx, page, buf);
 			page_cache_release(page);
@@ -808,14 +836,15 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
 	}
 
  out:
-	up_read(&mm->mmap_sem);
+	up_read(&current->mm->mmap_sem);
 	kfree(buf);
 	return 0;
 }
 
 /**
- * restore_private_contents - restore contents of a VMA with private memory
+ * restore_memory_contents - restore contents of a memory region
  * @ctx - restart context
+ * @inode - backing inode
  *
  * Reads a header that specifies how many pages will follow, then reads
  * a list of virtual addresses into ctx->pgarr_list page-array chain,
@@ -823,7 +852,7 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
  * these steps until reaching a header specifying "0" pages, which marks
  * the end of the contents.
  */
-static int restore_private_contents(struct ckpt_ctx *ctx)
+int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long nr_pages;
@@ -845,7 +874,7 @@ static int restore_private_contents(struct ckpt_ctx *ctx)
 		ret = read_pages_vaddrs(ctx, nr_pages);
 		if (ret < 0)
 			break;
-		ret = read_pages_contents(ctx);
+		ret = read_pages_contents(ctx, inode);
 		if (ret < 0)
 			break;
 		pgarr_reset_all(ctx);
@@ -903,9 +932,9 @@ static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
  * @file - file to map (NULL for anonymous)
  * @h - vma header data
  */
-static unsigned long generic_vma_restore(struct mm_struct *mm,
-					 struct file *file,
-					 struct ckpt_hdr_vma *h)
+unsigned long generic_vma_restore(struct mm_struct *mm,
+				  struct file *file,
+				  struct ckpt_hdr_vma *h)
 {
 	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
 	unsigned long addr;
@@ -952,7 +981,7 @@ int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 	if (IS_ERR((void *) addr))
 		return PTR_ERR((void *) addr);
 
-	return restore_private_contents(ctx);
+	return restore_memory_contents(ctx, NULL);
 }
 
 /**
@@ -1012,16 +1041,19 @@ static struct restore_vma_ops restore_vma_ops[] = {
 	{
 		.vma_name = "ANON SHARED",
 		.vma_type = CKPT_VMA_SHM_ANON,
+		.restore = shmem_restore,
 	},
 	/* anonymous shared (skipped) */
 	{
 		.vma_name = "ANON SHARED (skip)",
 		.vma_type = CKPT_VMA_SHM_ANON_SKIP,
+		.restore = shmem_restore,
 	},
 	/* file-mapped shared */
 	{
 		.vma_name = "FILE SHARED",
 		.vma_type = CKPT_VMA_SHM_FILE,
+		.restore = filemap_restore,
 	},
 };
 
@@ -1040,15 +1072,15 @@ static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
-	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d inoref %d\n",
 		 (unsigned long) h->vm_start, (unsigned long) h->vm_end,
 		 (unsigned long) h->vm_flags, (int) h->vma_type,
-		 (int) h->vma_objref);
+		 (int) h->vma_objref, (int) h->ino_objref);
 
 	ret = -EINVAL;
 	if (h->vm_end < h->vm_start)
 		goto out;
-	if (h->vma_objref < 0)
+	if (h->vma_objref < 0 || h->ino_objref < 0)
 		goto out;
 	if (h->vma_type >= CKPT_VMA_MAX)
 		goto out;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 53399f8..7f359ac 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -81,9 +81,15 @@ extern int shmem_vma_checkpoint(struct ckpt_ctx *ctx,
 				enum vma_type type,
 				int ino_objref);
 
+extern unsigned long generic_vma_restore(struct mm_struct *mm,
+					 struct file *file,
+					 struct ckpt_hdr_vma *hh);
+
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
+extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
+
 extern int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_mm(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7d2f93a..3a40968 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1200,6 +1200,8 @@ extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			   struct ckpt_hdr_vma *hh);
 extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 				   struct ckpt_hdr_vma *hh);
+extern int shmem_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			 struct ckpt_hdr_vma *hh);
 #endif
 
 /* readahead.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index e9499d9..af83da7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1694,11 +1694,21 @@ int filemap_restore(struct ckpt_ctx *ctx,
 {
 	struct file *file;
 
+	if (h->vma_type == CKPT_VMA_FILE && (h->vm_flags & VM_SHARED))
+		return -EINVAL;
+	if (h->vma_type == CKPT_VMA_SHM_FILE && !(h->vm_flags & VM_SHARED))
+		return -EINVAL;
+
 	file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
-	if (IS_ERR(file))
+	if (!file)
+		return -EINVAL;
+	else if (IS_ERR(file))
 		return PTR_ERR(file);
 
-	return private_vma_restore(ctx, mm, file, h);
+	if (h->vma_type == CKPT_VMA_FILE)
+		return private_vma_restore(ctx, mm, file, h);
+	else
+		return generic_vma_restore(mm, file, h);
 }
 #else
 #define filemap_checkpoint NULL
diff --git a/mm/shmem.c b/mm/shmem.c
index 17847b0..fbb2528 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2405,6 +2405,53 @@ static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 
 	return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
 }
+
+int shmem_restore(struct ckpt_ctx *ctx,
+		  struct mm_struct *mm, struct ckpt_hdr_vma *h)
+{
+	unsigned long addr;
+	struct file *file;
+	int ret = 0;
+
+	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	/* if file is NULL, this is the premiere - create and insert */
+	if (!file) {
+		if (h->vma_type != CKPT_VMA_SHM_ANON)
+			return -EINVAL;
+		/*
+		 * in theory could pass NULL to mmap and let it create
+		 * the file. But, if 'shm_size != vm_end - vm_start',
+		 * or if 'vm_pgoff != 0', then the vma reflects only a
+		 * portion of the shm object and we need to "manually"
+		 * create the full shm object.
+		 */
+		file = shmem_file_setup("/dev/zero", h->ino_size, h->vm_flags);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+		ret = ckpt_obj_insert(ctx, file, h->ino_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+	} else {
+		if (h->vma_type != CKPT_VMA_SHM_ANON_SKIP)
+			return -EINVAL;
+		/* Already need fput() for the file above; keep path simple */
+		get_file(file);
+	}
+
+	addr = generic_vma_restore(mm, file, h);
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	if (h->vma_type == CKPT_VMA_SHM_ANON)
+		ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ out:
+	fput(file);
+	return ret;
+}
+
 #else
 #define shmem_checkpoint NULL
 #endif /* CONFIG_CHECKPOINT */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 27/54] s390: Expose a constant for the number of words representing the CRs
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (25 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 26/54] Restore " Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 28/54] c/r: Add CKPT_COPY() macro (v4) Oren Laadan
                     ` (28 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Dan Smith, Alexey Dobriyan, Dave Hansen

We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.

Changelog:
    Mar 30:
            . Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
    Mar 03:
            . Picked up additional use of magic '3' in ptrace.h

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/Kconfig |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2eca5fe..bf62cad 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
 config GENERIC_CLOCKEVENTS
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if 64BIT
+
 config GENERIC_BUG
 	bool
 	depends on BUG
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 28/54] c/r: Add CKPT_COPY() macro (v4)
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (26 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 27/54] s390: Expose a constant for the number of words representing the CRs Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:23   ` [RFC v14][PATCH 29/54] s390: define s390-specific checkpoint-restart code (v7) Oren Laadan
                     ` (27 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Dan Smith, Alexey Dobriyan, Dave Hansen

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric.  CKPT_COPY_ARRAY() provides us a way to do
the same thing but for arrays.  It's not critical, but it helps us unify
the checkpoint and restart paths for some things.

Changelog:
    Mar 04:
            . Removed semicolons
            . Added build-time check for __must_be_array in CKPT_COPY_ARRAY
    Feb 27:
            . Changed CKPT_COPY() to use assignment, eliminating the need
              for the CKPT_COPY_BIT() macro
            . Add CKPT_COPY_ARRAY() macro to help copying register arrays,
              etc
            . Move the macro definitions inside the CR #ifdef
    Feb 25:
            . Changed WARN_ON() to BUILD_BUG_ON()

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
---
 include/linux/checkpoint.h |   29 +++++++++++++++++++++++++++++
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7f359ac..a662ea7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -111,6 +111,34 @@ extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 			       struct ckpt_hdr_file *h);
 
 
+/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */
+#define CKPT_CPT 1
+#define CKPT_RST 2
+
+#define CKPT_COPY(op, SAVE, LIVE)				        \
+	do {							\
+		if (op == CKPT_CPT)				\
+			SAVE = LIVE;				\
+		else						\
+			LIVE = SAVE;				\
+	} while (0)
+
+/*
+ * Copy @count items from @LIVE to @SAVE if op is CKPT_CPT (otherwise,
+ * copy in the reverse direction)
+ */
+#define CKPT_COPY_ARRAY(op, SAVE, LIVE, count)				\
+	do {								\
+		(void)__must_be_array(SAVE);				\
+		(void)__must_be_array(LIVE);				\
+		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
+		if (op == CKPT_CPT)					\
+			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
+		else							\
+			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
+	} while (0)
+
+
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
@@ -142,6 +170,7 @@ extern unsigned int ckpt_debug_level;
  * CKPT_DBASE is the base flags, doesn't change
  * CKPT_DFLAG is to be redfined in each source file
  */
+
 #define ckpt_debug(fmt, args...)  \
 	_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 29/54] s390: define s390-specific checkpoint-restart code (v7)
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (27 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 28/54] c/r: Add CKPT_COPY() macro (v4) Oren Laadan
@ 2009-04-28 23:23   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 30/54] powerpc: provide APIs for validating and updating DABR Oren Laadan
                     ` (26 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:23 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Dan Smith, Alexey Dobriyan, Dave Hansen

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Implement the s390 arch-specific checkpoint/restart helpers.  This
is on top of Oren Laadan's c/r code.

With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro.  While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to.  That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.

Changelog:
    Apr 11:
            . Introduce ckpt_arch_vdso()
    Feb 27:
            . Add checkpoint_s390.h
            . Fixed up save and restore of PSW, with the non-address bits
              properly masked out
    Feb 25:
            . Make checkpoint_hdr.h safe for inclusion in userspace
            . Replace comment about vsdo code
            . Add comment about restoring access registers
            . Write and read an empty ckpt_hdr_head_arch record to appease
              code (mktree) that expects it to be there
            . Utilize NUM_CKPT_WORDS in checkpoint_hdr.h
    Feb 24:
            . Use CKPT_COPY() to unify the un/loading of cpu and mm state
            . Fix fprs definition in ckpt_hdr_cpu
            . Remove debug WARN_ON() from checkpoint.c
    Feb 23:
            . Macro-ize the un/packing of trace flags
            . Fix the crash when externally-linked
            . Break out the restart functions into restart.c
            . Remove unneeded s390_enable_sie() call
    Jan 30:
            . Switched types in ckpt_hdr_cpu to __u64 etc.
              (Per Oren suggestion)
            . Replaced direct inclusion of structs in
              ckpt_hdr_cpu with the struct members.
              (Per Oren suggestion)
            . Also ended up adding a bunch of new things
              into restart (mm_segment, ksp, etc) in vain
              attempt to get code using fpu to not segfault
              after restart.

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/include/asm/checkpoint_hdr.h |   93 ++++++++++++++++
 arch/s390/include/asm/unistd.h         |    4 +-
 arch/s390/kernel/compat_wrapper.S      |   12 ++
 arch/s390/kernel/syscalls.S            |    2 +
 arch/s390/mm/Makefile                  |    1 +
 arch/s390/mm/checkpoint.c              |  184 ++++++++++++++++++++++++++++++++
 arch/s390/mm/checkpoint_s390.h         |   23 ++++
 7 files changed, 318 insertions(+), 1 deletions(-)
 create mode 100644 arch/s390/include/asm/checkpoint_hdr.h
 create mode 100644 arch/s390/mm/checkpoint.c
 create mode 100644 arch/s390/mm/checkpoint_s390.h

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..9324655
--- /dev/null
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -0,0 +1,93 @@
+#ifndef __ASM_S390_CKPT_HDR_H
+#define __ASM_S390_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers s/390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/checkpoint_hdr.h>
+#include <asm/ptrace.h>
+
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef __s390x__
+
+/*
+ * Notes
+ * NUM_GPRS defined in <asm/ptrace.h> to be 16
+ * NUM_FPRS defined in <asm/ptrace.h> to be 16
+ * NUM_APRS defined in <asm/ptrace.h> to be 16
+ * NUM_CKPT_WORDS defined in <asm/ptrace.h> to be 3
+ */
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	__u64 args[1];
+	__u64 gprs[NUM_GPRS];
+	__u64 orig_gpr2;
+	__u16 svcnr;
+	__u16 ilc;
+	__u32 acrs[NUM_ACRS];
+	__u64 ieee_instruction_pointer;
+
+	/* psw_t */
+	__u64 psw_t_mask;
+	__u64 psw_t_addr;
+
+	/* s390_fp_regs_t */
+	__u32 fpc;
+	union {
+		float f;
+		double d;
+		__u64 ui;
+		struct {
+			__u32 fp_hi;
+			__u32 fp_lo;
+		} fp;
+	} fprs[NUM_FPRS];
+
+	/* per_struct */
+	__u64 per_control_regs[NUM_CKPT_WORDS];
+	__u64 starting_addr;
+	__u64 ending_addr;
+	__u64 address;
+	__u16 perc_atmid;
+	__u8 access_id;
+	__u8 single_step;
+	__u8 instruction_fetch;
+};
+
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	unsigned long vdso_base;
+	int noexec;
+	int has_pgste;
+	int alloc_pgste;
+	unsigned long asce_bits;
+	unsigned long asce_limit;
+};
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+};
+
+#ifdef __KERNEL__
+/* Functions for copying to/from the header structs */
+extern void ckpt_s390_regs(int op, struct ckpt_hdr_cpu *h,
+			   struct task_struct *t);
+extern void ckpt_s390_mm(int op, struct ckpt_hdr_mm_context *h,
+			 struct mm_struct *mm);
+#endif
+
+#endif /* __s390x__ */
+
+#endif /* __ASM_S390_CKPT_HDR__H */
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index f0f19e6..3d22f17 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -267,7 +267,9 @@
 #define __NR_epoll_create1	327
 #define	__NR_preadv		328
 #define	__NR_pwritev		329
-#define NR_syscalls 330
+#define __NR_checkpoint		330
+#define __NR_restart		331
+#define NR_syscalls 332
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S
index fb38af6..ece87c8 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1823,3 +1823,15 @@ compat_sys_pwritev_wrapper:
 	llgfr	%r5,%r5			# u32
 	llgfr	%r6,%r6			# u32
 	jg	compat_sys_pwritev	# branch to system call
+
+	.globl sys_checkpoint_wrapper
+sys_checkpoint_wrapper:
+	lgfr	%r2,%r2			# pid_t
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+
+	.globl sys_restore_wrapper
+sys_restore_wrapper:
+	lgfr	%r2,%r2			# int
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 2c7739f..e755e93 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -338,3 +338,5 @@ SYSCALL(sys_dup3,sys_dup3,sys_dup3_wrapper)
 SYSCALL(sys_epoll_create1,sys_epoll_create1,sys_epoll_create1_wrapper)
 SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
 SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
+SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper) /* 330 */
+SYSCALL(sys_restart,sys_restart,sys_restore_wrapper)
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index 2a74581..b16161e 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -6,3 +6,4 @@ obj-y	 := init.o fault.o extmem.o mmap.o vmem.o pgtable.o
 obj-$(CONFIG_CMM) += cmm.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_PAGE_STATES) += page-states.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o restart.o
diff --git a/arch/s390/mm/checkpoint.c b/arch/s390/mm/checkpoint.c
new file mode 100644
index 0000000..127acdf
--- /dev/null
+++ b/arch/s390/mm/checkpoint.c
@@ -0,0 +1,184 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/kernel.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/elf.h>
+
+#include "checkpoint_s390.h"
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+void s390_regs(int op, struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct pt_regs *regs = task_pt_regs(t);
+	struct thread_struct *thr = &t->thread;
+
+	/* Save the whole PSW to facilitate forensic debugging, but only
+	 * restore the address portion to avoid letting userspace do
+	 * bad things by manipulating its value.
+	 */
+	if (op == CKPT_CPT) {
+		CKPT_COPY(op, h->psw_t_addr, regs->psw.addr);
+	} else {
+		regs->psw.addr &= ~PSW_ADDR_INSN;
+		regs->psw.addr |= h->psw_t_addr;
+	}
+
+	CKPT_COPY(op, h->args[0], regs->args[0]);
+	CKPT_COPY(op, h->orig_gpr2, regs->orig_gpr2);
+	CKPT_COPY(op, h->svcnr, regs->svcnr);
+	CKPT_COPY(op, h->ilc, regs->ilc);
+	CKPT_COPY(op, h->ieee_instruction_pointer,
+		thr->ieee_instruction_pointer);
+	CKPT_COPY(op, h->psw_t_mask, regs->psw.mask);
+	CKPT_COPY(op, h->fpc, thr->fp_regs.fpc);
+	CKPT_COPY(op, h->starting_addr, thr->per_info.starting_addr);
+	CKPT_COPY(op, h->ending_addr, thr->per_info.ending_addr);
+	CKPT_COPY(op, h->address, thr->per_info.lowcore.words.address);
+	CKPT_COPY(op, h->perc_atmid, thr->per_info.lowcore.words.perc_atmid);
+	CKPT_COPY(op, h->access_id, thr->per_info.lowcore.words.access_id);
+	CKPT_COPY(op, h->single_step, thr->per_info.single_step);
+	CKPT_COPY(op, h->instruction_fetch, thr->per_info.instruction_fetch);
+
+	CKPT_COPY_ARRAY(op, h->gprs, regs->gprs, NUM_GPRS);
+	CKPT_COPY_ARRAY(op, h->fprs, thr->fp_regs.fprs, NUM_FPRS);
+	CKPT_COPY_ARRAY(op, h->acrs, thr->acrs, NUM_ACRS);
+	CKPT_COPY_ARRAY(op, h->per_control_regs,
+		      thr->per_info.control_regs.words.cr, NUM_CKPT_WORDS);
+}
+
+void s390_mm(int op, struct ckpt_hdr_mm_context *h, struct mm_struct *mm)
+{
+	CKPT_COPY(op, h->noexec, mm->context.noexec);
+	CKPT_COPY(op, h->has_pgste, mm->context.has_pgste);
+	CKPT_COPY(op, h->alloc_pgste, mm->context.alloc_pgste);
+	CKPT_COPY(op, h->asce_bits, mm->context.asce_bits);
+	CKPT_COPY(op, h->asce_limit, mm->context.asce_limit);
+}
+
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkoint_write_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	hh = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	s390_regs(CKPT_CPT, h, t);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/* Write an empty header since it is assumed to be there */
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_head_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	s390_mm(CKPT_CPT, h, mm);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+int restore_read_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	checkpoint_s390_regs(CKPT_RST, h, current);
+
+	/* s390 does not restore the access registers after a syscall,
+	 * but does on a task switch.  Since we're switching tasks (in
+	 * a way), we need to replicate that behavior here.
+	 */
+	restore_access_regs(h->acrs);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	checkpoint_s390_mm(CKPT_RST, h, mm);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/arch/s390/mm/checkpoint_s390.h b/arch/s390/mm/checkpoint_s390.h
new file mode 100644
index 0000000..c3bf24d
--- /dev/null
+++ b/arch/s390/mm/checkpoint_s390.h
@@ -0,0 +1,23 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _S390_CHECKPOINT_H
+#define _S390_CHECKPOINT_H
+
+#include <linux/checkpoint_hdr.h>
+#include <linux/sched.h>
+#include <linux/mm_types.h>
+
+extern void checkpoint_s390_regs(int op, struct ckpt_hdr_cpu *h,
+				 struct task_struct *t);
+extern void checkpoint_s390_mm(int op, struct ckpt_hdr_mm_context *h,
+			       struct mm_struct *mm);
+
+#endif /* _S390_CHECKPOINT_H */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 30/54] powerpc: provide APIs for validating and updating DABR
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (28 preceding siblings ...)
  2009-04-28 23:23   ` [RFC v14][PATCH 29/54] s390: define s390-specific checkpoint-restart code (v7) Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation Oren Laadan
                     ` (25 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Nathan Lynch, Alexey Dobriyan, Dave Hansen

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

A checkpointed task image may specify a value for the DABR (Data
Access Breakpoint Register).  The restart code needs to validate this
value before making any changes to the current task.

ptrace_set_debugreg encapsulates the bounds checking and platform
dependencies of programming the DABR.  Split this into "validate"
(debugreg_valid) and "update" (debugreg_update) functions, and make
them available for use outside of the ptrace code.

Also ptrace_set_debugreg has extern linkage, but no users outside of
ptrace.c.  Make it static.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/include/asm/ptrace.h |    7 +++
 arch/powerpc/kernel/ptrace.c      |   88 +++++++++++++++++++++++++------------
 2 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h
index c9c678f..79bc816 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -81,6 +81,8 @@ struct pt_regs {
 
 #ifndef __ASSEMBLY__
 
+#include <linux/types.h>
+
 #define instruction_pointer(regs) ((regs)->nip)
 #define user_stack_pointer(regs) ((regs)->gpr[1])
 #define regs_return_value(regs) ((regs)->gpr[3])
@@ -138,6 +140,11 @@ do {									      \
 extern void user_enable_single_step(struct task_struct *);
 extern void user_disable_single_step(struct task_struct *);
 
+/* for reprogramming DABR/DAC during restart of a checkpointed task */
+extern bool debugreg_valid(unsigned long val, unsigned int index);
+extern void debugreg_update(struct task_struct *task, unsigned long val,
+			    unsigned int index);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 3635be6..0b6cf84 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -735,22 +735,25 @@ void user_disable_single_step(struct task_struct *task)
 	clear_tsk_thread_flag(task, TIF_SINGLESTEP);
 }
 
-int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
-			       unsigned long data)
+/**
+ * debugreg_valid() - validate the value to be written to a debug register
+ * @val:	The prospective contents of the register.
+ * @index:	Must be zero.
+ *
+ * Returns true if @val is an acceptable value for the register indicated by
+ * @index, false otherwise.
+ */
+bool debugreg_valid(unsigned long val, unsigned int index)
 {
-	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
-	 *  For embedded processors we support one DAC and no IAC's at the
-	 *  moment.
-	 */
-	if (addr > 0)
-		return -EINVAL;
+	/* We support only one debug register for now */
+	if (index != 0)
+		return false;
 
 	/* The bottom 3 bits in dabr are flags */
-	if ((data & ~0x7UL) >= TASK_SIZE)
-		return -EIO;
+	if ((val & ~0x7UL) >= TASK_SIZE)
+		return false;
 
 #ifndef CONFIG_BOOKE
-
 	/* For processors using DABR (i.e. 970), the bottom 3 bits are flags.
 	 *  It was assumed, on previous implementations, that 3 bits were
 	 *  passed together with the data address, fitting the design of the
@@ -764,47 +767,74 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
 	 */
 
 	/* Ensure breakpoint translation bit is set */
-	if (data && !(data & DABR_TRANSLATION))
-		return -EIO;
-
-	/* Move contents to the DABR register */
-	task->thread.dabr = data;
-
-#endif
-#if defined(CONFIG_BOOKE)
-
+	if (val && !(val & DABR_TRANSLATION))
+		return false;
+#else
 	/* As described above, it was assumed 3 bits were passed with the data
 	 *  address, but we will assume only the mode bits will be passed
 	 *  as to not cause alignment restrictions for DAC-based processors.
 	 */
 
+	/* Read or Write bits must be set */
+	if (!(val & 0x3UL))
+		return -EINVAL;
+#endif
+	return true;
+}
+
+/**
+ * debugreg_update() - update a debug register associated with a task
+ * @task:	The task whose register state is to be modified.
+ * @val:	The value to be written to the debug register.
+ * @index:	Specifies the debug register.  Currently unused.
+ *
+ * Set a task's DABR/DAC to @val, which should be validated with
+ * debugreg_valid() beforehand.
+ */
+void debugreg_update(struct task_struct *task, unsigned long val,
+		     unsigned int index)
+{
+#ifndef CONFIG_BOOKE
+	task->thread.dabr = val;
+#else
 	/* DAC's hold the whole address without any mode flags */
-	task->thread.dabr = data & ~0x3UL;
+	task->thread.dabr = val & ~0x3UL;
 
 	if (task->thread.dabr == 0) {
 		task->thread.dbcr0 &= ~(DBSR_DAC1R | DBSR_DAC1W | DBCR0_IDM);
 		task->thread.regs->msr &= ~MSR_DE;
-		return 0;
 	}
 
-	/* Read or Write bits must be set */
-
-	if (!(data & 0x3UL))
-		return -EINVAL;
-
 	/* Set the Internal Debugging flag (IDM bit 1) for the DBCR0
 	   register */
 	task->thread.dbcr0 = DBCR0_IDM;
 
 	/* Check for write and read flags and set DBCR0
 	   accordingly */
-	if (data & 0x1UL)
+	if (val & 0x1UL)
 		task->thread.dbcr0 |= DBSR_DAC1R;
-	if (data & 0x2UL)
+	if (val & 0x2UL)
 		task->thread.dbcr0 |= DBSR_DAC1W;
 
 	task->thread.regs->msr |= MSR_DE;
 #endif
+}
+
+static int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
+			       unsigned long data)
+{
+	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
+	 * For embedded processors we support one DAC and no IAC's at the
+	 * moment.
+	 */
+	if (addr > 0)
+		return -EINVAL;
+
+	if (!debugreg_valid(data, 0))
+		return -EIO;
+
+	debugreg_update(task, data, 0);
+
 	return 0;
 }
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (29 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 30/54] powerpc: provide APIs for validating and updating DABR Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
       [not found]     ` <1240961064-13991-32-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:24   ` [RFC v14][PATCH 32/54] powerpc: wire up checkpoint and restart syscalls Oren Laadan
                     ` (24 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Nathan Lynch, Alexey Dobriyan, Dave Hansen

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Support for checkpointing and restarting GPRs, FPU state, DABR, and
Altivec state.

The portion of the checkpoint image manipulated by this code begins
with a bitmask of features indicating the various contexts saved.
Fields in image that can vary depending on kernel configuration
(e.g. FP regs due to VSX) have their sizes explicitly recorded, except
for GPRS, so migrating between ppc32 and ppc64 won't work yet.

The restart code ensures that the task is not modified until the
checkpoint image is validated against the current kernel configuration
and hardware features (e.g. can't restart a task using Altivec on
non-Altivec systems).

What works:
* self and external checkpoint of simple (single thread, one open
  file) 32- and 64-bit processes on a ppc64 kernel

What doesn't work:
* restarting a 32-bit task from a 64-bit task and vice versa

Untested:
* ppc32 (but it builds)

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/include/asm/checkpoint_hdr.h |   17 +
 arch/powerpc/mm/Makefile                  |    1 +
 arch/powerpc/mm/checkpoint.c              |  499 +++++++++++++++++++++++++++++
 3 files changed, 517 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h
 create mode 100644 arch/powerpc/mm/checkpoint.c

diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h b/arch/powerpc/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..a147a06
--- /dev/null
+++ b/arch/powerpc/include/asm/checkpoint_hdr.h
@@ -0,0 +1,17 @@
+#ifndef __ASM_PPC_CKPT_HDR_H
+#define __ASM_PPC_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers ppc
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/checkpoint_hdr.h>
+
+/* nothing to see here */
+
+#endif /* __ASM_PPC_CKPT_HDR__H */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 17290bc..fe64d50 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -26,3 +26,4 @@ obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o
 obj-$(CONFIG_PPC_MM_SLICES)	+= slice.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/powerpc/mm/checkpoint.c b/arch/powerpc/mm/checkpoint.c
new file mode 100644
index 0000000..7bb6594
--- /dev/null
+++ b/arch/powerpc/mm/checkpoint.c
@@ -0,0 +1,499 @@
+/*
+ *  Checkpoint/restart - architecture specific support for powerpc.
+ *  Based on x86 implementation.
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *  Copyright 2009 IBM Corp.
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define DEBUG 1 /* for pr_debug */
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/kernel.h>
+#include <asm/processor.h>
+#include <asm/ptrace.h>
+#include <asm/system.h>
+
+enum ckpt_cpu_feature {
+	CKPT_USED_FP,
+	CKPT_USED_DEBUG,
+	CKPT_USED_ALTIVEC,
+	CKPT_USED_SPE,
+	CKPT_USED_VSX,
+	CKPT_FTR_END = 31,
+};
+
+#define x(ftr) (1UL << ftr)
+
+/* features this kernel can handle for restart */
+enum {
+	CKPT_FTRS_POSSIBLE =
+#ifdef CONFIG_PPC_FPU
+	x(CKPT_USED_FP) |
+#endif
+	x(CKPT_USED_DEBUG) |
+#ifdef CONFIG_ALTIVEC
+	x(CKPT_USED_ALTIVEC) |
+#endif
+#ifdef CONFIG_SPE
+	x(CKPT_USED_SPE) |
+#endif
+#ifdef CONFIG_VSX
+	x(CKPT_USED_VSX) |
+#endif
+	0,
+};
+
+#undef x
+
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	u32 features_used;
+	u32 pt_regs_size;
+	u32 fpr_size;
+	struct pt_regs pt_regs;
+	/* relevant fields from thread_struct */
+	double fpr[32][TS_FPRWIDTH];
+	u32 fpscr;
+	s32 fpexc_mode;
+	u64 dabr;
+	/* Altivec/VMX state */
+	vector128 vr[32];
+	vector128 vscr;
+	u64 vrsave;
+	/* SPE state */
+	u32 evr[32];
+	u64 acc;
+	u32 spefscr;
+};
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void ckpt_cpu_feature_set(struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	hdr->features_used |= 1ULL << ftr;
+}
+
+static bool ckpt_cpu_feature_isset(const struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	return hdr->features_used & (1ULL << ftr);
+}
+
+/* determine whether an image has feature bits set that this kernel
+ * does not support */
+static bool ckpt_cpu_features_unknown(const struct ckpt_hdr_cpu *hdr)
+{
+	return hdr->features_used & ~CKPT_FTRS_POSSIBLE;
+}
+
+static void checkpoint_gprs(struct ckpt_hdr_cpu *cpu_hdr,
+			    struct task_struct *task)
+{
+	struct pt_regs *pt_regs;
+
+	pr_debug("%s: saving GPRs\n", __func__);
+
+	cpu_hdr->pt_regs_size = sizeof(*pt_regs);
+	pt_regs = task_pt_regs(task);
+	cpu_hdr->pt_regs = *pt_regs;
+}
+
+#ifdef CONFIG_PPC_FPU
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	/* easiest to save FP state unconditionally */
+
+	pr_debug("%s: saving FPU state\n", __func__);
+
+	if (task == current)
+		flush_fp_to_thread(task);
+
+	cpu_hdr->fpr_size = sizeof(cpu_hdr->fpr);
+	cpu_hdr->fpscr = task->thread.fpscr.val;
+	cpu_hdr->fpexc_mode = task->thread.fpexc_mode;
+
+	memcpy(cpu_hdr->fpr, task->thread.fpr, sizeof(cpu_hdr->fpr));
+
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_FP);
+}
+#else
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_ALTIVEC
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		return;
+
+	if (!task->thread.used_vr)
+		return;
+
+	pr_debug("%s: saving Altivec state\n", __func__);
+
+	if (task == current)
+		flush_altivec_to_thread(task);
+
+	cpu_hdr->vrsave = task->thread.vrsave;
+	memcpy(cpu_hdr->vr, task->thread.vr, sizeof(cpu_hdr->vr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_ALTIVEC);
+}
+#else
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		return;
+
+	if (!task->thread.used_spe)
+		return;
+
+	pr_debug("%s: saving SPE state\n", __func__);
+
+	if (task == current)
+		flush_spe_to_thread(task);
+
+	cpu_hdr->acc = task->thread.acc;
+	cpu_hdr->spefscr = task->thread.spefscr;
+	memcpy(cpu_hdr->evr, task->thread.evr, sizeof(cpu_hdr->evr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_SPE);
+}
+#else
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+static void checkpoint_dabr(struct ckpt_hdr_cpu *cpu_hdr,
+			    const struct task_struct *task)
+{
+	if (!task->thread.dabr)
+		return;
+
+	cpu_hdr->dabr = task->thread.dabr;
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_DEBUG);
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_write_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	int rc;
+
+	rc = -ENOMEM;
+	cpu_hdr = ckpt_hdr_get(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (!cpu_hdr)
+		goto err;
+
+	checkpoint_gprs(cpu_hdr, t);
+	checkpoint_fpu(cpu_hdr, t);
+	checkpoint_dabr(cpu_hdr, t);
+	checkpoint_altivec(cpu_hdr, t);
+	checkpoint_spe(cpu_hdr, t);
+
+	rc = ckpt_write_obj(ctx, (struct ckpt_hdr *) ckpt_hdr);
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+/* Based on the MSR value from a checkpoint image, produce an MSR
+ * value that is appropriate for the restored task.  Right now we only
+ * check for MSR_SF (64-bit) for PPC64.
+ */
+static unsigned long sanitize_msr(unsigned long msr_ckpt)
+{
+#ifdef CONFIG_PPC32
+	return MSR_USER;
+#else
+	if (msr_ckpt & MSR_SF)
+		return MSR_USER64;
+	return MSR_USER32;
+#endif
+}
+
+static int restore_gprs(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	struct pt_regs *regs;
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->pt_regs_size != sizeof(*regs))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	regs = task_pt_regs(task);
+	*regs = cpu_hdr->pt_regs;
+
+	regs->msr = sanitize_msr(regs->msr);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_PPC_FPU
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->fpr_size != sizeof(task->thread.fpr))
+		goto out;
+
+	rc = 0;
+	if (!update || !ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP))
+		goto out;
+
+	task->thread.fpscr.val = cpu_hdr->fpscr;
+	task->thread.fpexc_mode = cpu_hdr->fpexc_mode;
+
+	memcpy(task->thread.fpr, cpu_hdr->fpr, sizeof(task->thread.fpr));
+out:
+	return rc;
+}
+#else
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP));
+	return 0;
+}
+#endif
+
+static int restore_dabr(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_DEBUG))
+		goto out;
+
+	rc = -EINVAL;
+	if (!debugreg_valid(cpu_hdr->dabr, 0))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	debugreg_update(task, cpu_hdr->dabr, 0);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_ALTIVEC
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_ALTIVEC))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.vrsave = cpu_hdr->vrsave;
+	task->thread.used_vr = 1;
+
+	memcpy(task->thread.vr, cpu_hdr->vr, sizeof(cpu_hdr->vr));
+out:
+	return rc;
+}
+#else
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(CKPT_USED_ALTIVEC));
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.acc = cpu_hdr->acc;
+	task->thread.spefscr = cpu_hdr->spefscr;
+	task->thread.used_spe = 1;
+
+	memcpy(task->thread.evr, cpu_hdr->evr, sizeof(cpu_hdr->evr));
+out:
+	return rc;
+}
+#else
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE));
+	return 0;
+}
+#endif
+
+struct restore_func_desc {
+	int (*func)(const struct ckpt_hdr_cpu *, struct task_struct *, bool);
+	const char *info;
+};
+
+typedef int (*restore_func_t)(const struct ckpt_hdr_cpu *,
+			      struct task_struct *, bool);
+
+static const restore_func_t restore_funcs[] = {
+	restore_gprs,
+	restore_fpu,
+	restore_dabr,
+	restore_altivec,
+	restore_spe,
+};
+
+static bool bitness_match(const struct ckpt_hdr_cpu *cpu_hdr,
+			  const struct task_struct *task)
+{
+	/* 64-bit image */
+	if (cpu_hdr->pt_regs.msr & MSR_SF) {
+		if (task->thread.regs->msr & MSR_SF)
+			return true;
+		else
+			return false;
+	}
+
+	/* 32-bit image */
+	if (task->thread.regs->msr & MSR_SF)
+		return false;
+
+	return true;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	bool update;
+	int rc;
+	int i;
+
+	cpu_hdr = ckpt_read_obj_type(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (IS_ERR(cpu_hdr))
+		return PTR_ERR(cpu_hdr);
+
+	rc = -EINVAL;
+	if (ckpt_cpu_features_unknown(cpu_hdr))
+		goto err;
+
+	/* temporary: restoring a 32-bit image from a 64-bit task and
+	 * vice-versa is known not to work (probably not restoring
+	 * thread_info correctly); detect this and fail gracefully.
+	 */
+	if (!bitness_match(cpu_hdr, current))
+		goto err;
+
+	/* We want to determine whether there's anything wrong with
+	 * the checkpoint image before changing the task at all.  Run
+	 * a "check" phase (update = false) first.
+	 */
+	update = false;
+commit:
+	for (i = 0; i < ARRAY_SIZE(restore_funcs); i++) {
+		rc = restore_funcs[i](cpu_hdr, current, update);
+		if (rc == 0)
+			continue;
+		pr_debug("%s: restore_func[%i] failed\n", __func__, i);
+		WARN_ON_ONCE(update);
+		goto err;
+	}
+
+	if (!update) {
+		update = true;
+		goto commit;
+	}
+
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 32/54] powerpc: wire up checkpoint and restart syscalls
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (30 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 33/54] powerpc: enable checkpoint support in Kconfig Oren Laadan
                     ` (23 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Nathan Lynch, Alexey Dobriyan, Dave Hansen

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/include/asm/systbl.h |    2 ++
 arch/powerpc/include/asm/unistd.h |    4 +++-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index d98a30d..d2828c2 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -325,3 +325,5 @@ SYSCALL(inotify_init1)
 SYSCALL(ni_syscall)
 COMPAT_SYS_SPU(preadv)
 COMPAT_SYS_SPU(pwritev)
+SYSCALL(checkpoint)
+SYSCALL(restart)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 3f06f8e..c3f515c 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -343,10 +343,12 @@
 #define __NR_inotify_init1	318
 #define __NR_preadv		320
 #define __NR_pwritev		321
+#define __NR_checkpoint		322
+#define __NR_restart		323
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		322
+#define __NR_syscalls		324
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 33/54] powerpc: enable checkpoint support in Kconfig
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (31 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 32/54] powerpc: wire up checkpoint and restart syscalls Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 34/54] Export fs/exec.c:exec_mmap() Oren Laadan
                     ` (22 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Nathan Lynch, Alexey Dobriyan, Dave Hansen

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/Kconfig |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 4c78045..a7e50d5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -26,6 +26,9 @@ config MMU
 	bool
 	default y
 
+config CHECKPOINT_SUPPORT
+	def_bool y
+
 config GENERIC_CMOS_UPDATE
 	def_bool y
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 34/54] Export fs/exec.c:exec_mmap()
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (32 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 33/54] powerpc: enable checkpoint support in Kconfig Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 35/54] Support for share memory address spaces Oren Laadan
                     ` (21 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Used in the next patch to attach an existing mm descriptor to a
restarting process.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/exec.c          |    2 +-
 include/linux/mm.h |    3 +++
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 052a961..17f5222 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -722,7 +722,7 @@ int kernel_read(struct file *file, unsigned long offset,
 
 EXPORT_SYMBOL(kernel_read);
 
-static int exec_mmap(struct mm_struct *mm)
+int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 	struct mm_struct * old_mm, *active_mm;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3a40968..c8e8972 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1180,6 +1180,9 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
+/* fs/exec.c */
+extern int exec_mmap(struct mm_struct *mm);
+
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 35/54] Support for share memory address spaces
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (33 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 34/54] Export fs/exec.c:exec_mmap() Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
       [not found]     ` <1240961064-13991-36-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:24   ` [RFC v14][PATCH 36/54] Make ckpt_may_checkpoint_task() check each namespace individually Oren Laadan
                     ` (20 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

The task address space (task->mm) may be shared between processes if
CLONE_VM is used, and particularly among threads. Accordingly, treat
'task->mm' as a shared object: during checkpoint check against the
objhash and only dump the contents if seen for the first time. During
restart, likewise, only restore if it's a new instance, otherwise use
the one already registered in the objhash.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/memory.c            |   59 ++++++++++++++++++++++++++++++++++-----
 checkpoint/objhash.c           |   21 ++++++++++++++
 checkpoint/process.c           |   46 ++++++++++++++++++++++++++++---
 include/linux/checkpoint.h     |    7 +++-
 include/linux/checkpoint_hdr.h |    7 +++++
 5 files changed, 126 insertions(+), 14 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index f5f8fcf..7a6e3f4 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -650,10 +650,9 @@ static int anonymous_checkpoint(struct ckpt_ctx *ctx,
 	return private_vma_checkpoint(ctx, vma, CKPT_VMA_ANON, 0);
 }
 
-int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
 {
 	struct ckpt_hdr_mm *h;
-	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	int exe_objref = 0;
 	int ret;
@@ -662,8 +661,6 @@ int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (!h)
 		return -ENOMEM;
 
-	mm = get_task_mm(t);
-
 	down_read(&mm->mmap_sem);
 
 	/* FIX: need also mm->flags */
@@ -715,10 +712,26 @@ int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t)
  out:
 	ckpt_hdr_put(ctx, h);
 	up_read(&mm->mmap_sem);
-	mmput(mm);
 	return ret;
 }
 
+int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_mm(ctx, (struct mm_struct *) ptr);
+}
+
+int checkpoint_mm_obj(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct mm_struct *mm;
+	int objref;
+
+	mm = get_task_mm(t);
+	objref = checkpoint_obj(ctx, mm, CKPT_OBJ_MM);
+	mmput(mm);
+
+	return objref;
+}
+
 /*
  * Restart
  *
@@ -1120,7 +1133,7 @@ static int destroy_mm(struct mm_struct *mm)
 	return 0;
 }
 
-int restore_mm(struct ckpt_ctx *ctx)
+static struct mm_struct *do_restore_mm(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_mm *h;
 	struct mm_struct *mm;
@@ -1130,7 +1143,7 @@ int restore_mm(struct ckpt_ctx *ctx)
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM);
 	if (IS_ERR(h))
-		return PTR_ERR(h);
+		return (struct mm_struct *) h;
 
 	ckpt_debug("map_count %d\n", h->map_count);
 
@@ -1142,6 +1155,8 @@ int restore_mm(struct ckpt_ctx *ctx)
 		goto out;
 	if (h->exefile_objref < 0)
 		goto out;
+	if (h->map_count <= 0)
+		goto out;
 
 	mm = current->mm;
 
@@ -1191,5 +1206,33 @@ int restore_mm(struct ckpt_ctx *ctx)
 	ret = restore_mm_context(ctx, mm);
  out:
 	ckpt_hdr_put(ctx, h);
-	return ret;
+	return (ret < 0 ? ERR_PTR(ret) : mm);
 }
+
+void *restore_mm(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_mm(ctx);
+}
+
+int restore_mm_obj(struct ckpt_ctx *ctx, int mm_objref)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	mm = ckpt_obj_fetch(ctx, mm_objref, CKPT_OBJ_MM);
+	if (!mm)
+		return -EINVAL;
+	else if (IS_ERR(mm))
+		return -EINVAL;
+
+	if (mm == current->mm)
+		return 0;
+
+	ret = exec_mmap(mm);
+	if (ret < 0)
+		return ret;
+
+	atomic_inc(&mm->mm_users);
+	return 0;
+}
+
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 8e43432..4fb5afa 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -57,6 +57,7 @@ void *restore_bad(struct ckpt_ctx *ctx)
  *   obj_no_{drop,grab}: for objects ignored/skipped
  *   obj_file_{drop,grab}: for file objects
  *   obj_inode_{drop,grab}: for inode objects
+ *   obj_mm_{drop,grab}: for mm_struct objects
  */
 
 static void obj_no_drop(void *ptr)
@@ -91,6 +92,17 @@ static void obj_inode_drop(void *ptr)
 	iput((struct inode *) ptr);
 }
 
+static int obj_mm_grab(void *ptr)
+{
+	atomic_inc(&((struct mm_struct *) ptr)->mm_users);
+	return 0;
+}
+
+static void obj_mm_drop(void *ptr)
+{
+	mmput((struct mm_struct *) ptr);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -117,6 +129,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_bad,	/* no c/r at inode level */
 		.restore = restore_bad,		/* no c/r at inode level */
 	},
+	/* mm object */
+	{
+		.obj_name = "MM",
+		.obj_type = CKPT_OBJ_MM,
+		.ref_drop = obj_mm_drop,
+		.ref_grab = obj_mm_grab,
+		.checkpoint = checkpoint_mm,
+		.restore = restore_mm,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index d5ee6fd..0bd4845 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -162,6 +162,28 @@ int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int mm_objref;
+	int ret;
+
+	mm_objref = checkpoint_mm_obj(ctx, t);
+	ckpt_debug("memory: objref %d\n", mm_objref);
+	if (mm_objref < 0)
+		return mm_objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+
+	h->mm_objref = mm_objref;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 /* dump the entire state of a given task */
 int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -171,8 +193,8 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ckpt_debug("ret %d\n", ret);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_mm(ctx, t);
-	ckpt_debug("memory: ret %d\n", ret);
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_fd_table(ctx, t);
@@ -322,6 +344,22 @@ int restore_restart_block(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_objs(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_objs *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, CKPT_HDR_TASK_OBJS, sizeof(*h));
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = restore_mm_obj(ctx, h->mm_objref);
+	ckpt_debug("memory: ret %d\n", ret);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 /* read the entire state of the current task */
 int restore_task(struct ckpt_ctx *ctx)
 {
@@ -331,8 +369,8 @@ int restore_task(struct ckpt_ctx *ctx)
 	ckpt_debug("ret %d\n", ret);
 	if (ret < 0)
 		goto out;
-	ret = restore_mm(ctx);
-	ckpt_debug("memory: ret %d\n", ret);
+	ret = restore_task_objs(ctx);
+	ckpt_debug("objs: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
 	ret = restore_fd_table(ctx);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index a662ea7..d554776 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -90,8 +90,11 @@ extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 
 extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 
-extern int checkpoint_mm(struct ckpt_ctx *ctx, struct task_struct *t);
-extern int restore_mm(struct ckpt_ctx *ctx);
+extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_mm(struct ckpt_ctx *ctx);
+
+extern int checkpoint_mm_obj(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_mm_obj(struct ckpt_ctx *ctx, int objref);
 
 #define CKPT_VMA_NOT_SUPPORTED						\
 	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 59fab62..8b00fb8 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -49,6 +49,7 @@ enum {
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
+	CKPT_HDR_TASK_OBJS,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
@@ -78,6 +79,7 @@ enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_INODE,
+	CKPT_OBJ_MM,
 	CKPT_OBJ_MAX
 };
 
@@ -139,6 +141,11 @@ struct ckpt_hdr_task {
 	__u32 task_comm_len;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 mm_objref;
+} __attribute__((aligned(8)));
+
 /* (thread) restart blocks */
 struct ckpt_hdr_restart_block {
 	struct ckpt_hdr h;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 36/54] Make ckpt_may_checkpoint_task() check each namespace individually
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (34 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 35/54] Support for share memory address spaces Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 37/54] c/r: Add UTS support (v6) Oren Laadan
                     ` (19 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Dan Smith, Alexey Dobriyan, Dave Hansen

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c        |   24 ++++++++++++--
 checkpoint/objhash.c           |   21 ++++++++++++
 checkpoint/process.c           |   67 ++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint.h     |    4 ++
 include/linux/checkpoint_hdr.h |    9 +++++
 5 files changed, 119 insertions(+), 6 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 6305e5d..64b5b45 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -184,6 +184,8 @@ static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
 static int may_checkpoint_task(struct task_struct *t, struct ckpt_ctx *ctx)
 {
 	struct pid_namespace *ns = ctx->root_nsproxy->pid_ns;
+	struct nsproxy *nsproxy;
+	int ret = 0;
 
 	ckpt_debug("check %d\n", task_pid_nr_ns(t, ns));
 
@@ -211,11 +213,25 @@ static int may_checkpoint_task(struct task_struct *t, struct ckpt_ctx *ctx)
 	    t->real_parent == ctx->root_task->real_parent)
 		return -EINVAL;
 
-	/* FIX: change this for nested containers */
-	if (task_nsproxy(t) != ctx->root_nsproxy)
-		return -EPERM;
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	if (!nsproxy) {
+		ret = -ENOSYS;
+	} else {
+		if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
+			ret = -EPERM;
+		if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
+			ret = -EPERM;
+		if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns)
+			ret = -EPERM;
+		if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns)
+			ret = -EPERM;
+		if (nsproxy->net_ns != ctx->root_nsproxy->net_ns)
+			ret = -EPERM;
+	}
+	rcu_read_unlock();
 
-	return 0;
+	return ret;
 }
 
 #define CKPT_HDR_PIDS_CHUNK	256
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 4fb5afa..819a1be 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -58,6 +58,7 @@ void *restore_bad(struct ckpt_ctx *ctx)
  *   obj_file_{drop,grab}: for file objects
  *   obj_inode_{drop,grab}: for inode objects
  *   obj_mm_{drop,grab}: for mm_struct objects
+ *   obj_ns_{drop,grab}: for nsproxy objects
  */
 
 static void obj_no_drop(void *ptr)
@@ -103,6 +104,17 @@ static void obj_mm_drop(void *ptr)
 	mmput((struct mm_struct *) ptr);
 }
 
+static int obj_ns_grab(void *ptr)
+{
+	get_nsproxy((struct nsproxy *) ptr);
+	return 0;
+}
+
+static void obj_ns_drop(void *ptr)
+{
+	put_nsproxy((struct nsproxy *) ptr);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -138,6 +150,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
+	/* ns object */
+	{
+		.obj_name = "NSPROXY",
+		.obj_type = CKPT_OBJ_NS,
+		.ref_drop = obj_ns_drop,
+		.ref_grab = obj_ns_grab,
+		.checkpoint = checkpoint_ns,
+		.restore = restore_ns,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 0bd4845..2c489fd 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -162,12 +162,43 @@ int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+
+static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
+{
+	return 0;
+}
+
+int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_ns(ctx, (struct nsproxy *) ptr);
+}
+
+int checkpoint_ns_obj(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct nsproxy *nsproxy;
+	int ret;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	get_nsproxy(nsproxy);
+	rcu_read_unlock();
+
+	ret = checkpoint_obj(ctx, nsproxy, CKPT_OBJ_NS);
+	put_nsproxy(nsproxy);
+	return ret;
+}
+
 static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
 	int mm_objref;
+	int ns_objref;
 	int ret;
 
+	ns_objref = checkpoint_ns_obj(ctx, t);
+	ckpt_debug("nsproxy: objref %d\n", ns_objref);
+	if (ns_objref < 0)
+		return ns_objref;
 	mm_objref = checkpoint_mm_obj(ctx, t);
 	ckpt_debug("memory: objref %d\n", mm_objref);
 	if (mm_objref < 0)
@@ -178,6 +209,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -ENOMEM;
 
 	h->mm_objref = mm_objref;
+	h->ns_objref = ns_objref;
 
 	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
 	ckpt_hdr_put(ctx, h);
@@ -344,18 +376,49 @@ int restore_restart_block(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
+{
+	return task_nsproxy(current);
+}
+
+void *restore_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_ns(ctx);
+}
+
+static int restore_ns_obj(struct ckpt_ctx *ctx, int ns_objref)
+{
+	struct nsproxy *nsproxy;
+
+	nsproxy = ckpt_obj_fetch(ctx, ns_objref, CKPT_OBJ_NS);
+	if (!nsproxy)
+		return -EINVAL;
+	else if (IS_ERR(nsproxy))
+		return PTR_ERR(nsproxy);
+
+	if (nsproxy != task_nsproxy(current))
+		switch_task_namespaces(current, nsproxy);
+
+	return 0;
+}
+
 static int restore_task_objs(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_task_objs *h;
 	int ret;
 
-	h = ckpt_read_obj_type(ctx, CKPT_HDR_TASK_OBJS, sizeof(*h));
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
+	ret = restore_ns_obj(ctx, h->ns_objref);
+	ckpt_debug("nsproxy: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
 	ret = restore_mm_obj(ctx, h->mm_objref);
 	ckpt_debug("memory: ret %d\n", ret);
-
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index d554776..2cdd94f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -65,6 +65,10 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* namespaces */
+extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_ns(struct ckpt_ctx *ctx);
+
 /* memory */
 extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8b00fb8..405d3bc 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -80,6 +80,7 @@ enum obj_type {
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_INODE,
 	CKPT_OBJ_MM,
+	CKPT_OBJ_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -144,6 +145,7 @@ struct ckpt_hdr_task {
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
 	__s32 mm_objref;
+	__s32 ns_objref;
 } __attribute__((aligned(8)));
 
 /* (thread) restart blocks */
@@ -167,6 +169,13 @@ enum restart_block_type {
 	CKPT_RESTART_BLOCK_FUTEX
 };
 
+/* namespaces */
+struct ckpt_hdr_ns {
+	struct ckpt_hdr h;
+	__u32 flags;
+	__u32 uts_ref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 37/54] c/r: Add UTS support (v6)
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (35 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 36/54] Make ckpt_may_checkpoint_task() check each namespace individually Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 38/54] Stub implementation of IPC namespace c/r Oren Laadan
                     ` (18 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Dan Smith, Alexey Dobriyan, Dave Hansen

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

This patch adds a "phase" of checkpoint that saves out information about any
namespaces the task(s) may have.  Do this by tracking the namespace objects
of the tasks and making sure that tasks with the same namespace that follow
get properly referenced in the checkpoint stream.

Changes:
  - Take uts_sem around access to uts data
  - Remove the kernel restore path
  - Punt on nested namespaces
  - Use __NEW_UTS_LEN in nodename and domainname buffers
  - Add a note to Documentation/checkpoint/internals.txt to indicate where
    in the save/restore process the UTS information is kept
  - Store (and track) the objref of the namespace itself instead of the
    nsproxy (based on comments from Dave on IRC)
  - Remove explicit check for non-root nsproxy
  - Store the nodename and domainname lengths and use ckpt_write_string()
    to store the actual name strings
  - Catch failure of ckpt_obj_add_ptr() in ckpt_write_namespaces()
  - Remove "types" bitfield and use the "is this new" flag to determine
    whether or not we should write out a new ns descriptor
  - Replace kernel restore path
  - Move the namespace information to be directly after the task
    information record
  - Update Documentation to reflect new location of namespace info
  - Support checkpoint and restart of nested UTS namespaces

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 Documentation/checkpoint/internals.txt |    1 +
 checkpoint/checkpoint.c                |    2 -
 checkpoint/objhash.c                   |   21 ++++
 checkpoint/process.c                   |  160 +++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h         |    9 ++
 5 files changed, 189 insertions(+), 4 deletions(-)

diff --git a/Documentation/checkpoint/internals.txt b/Documentation/checkpoint/internals.txt
index de2eead..41f0861 100644
--- a/Documentation/checkpoint/internals.txt
+++ b/Documentation/checkpoint/internals.txt
@@ -17,6 +17,7 @@ The order of operations, both save and restore, is as follows:
   -> thread state: elements of thread_struct and thread_info
   -> CPU state: registers etc, including FPU
   -> memory state: memory address space layout and contents
+  -> namespace information
   -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
   -> files state: open file descriptors and their state
   -> signals state: [TBD] pending signals and signal handling state
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 64b5b45..88dee51 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -218,8 +218,6 @@ static int may_checkpoint_task(struct task_struct *t, struct ckpt_ctx *ctx)
 	if (!nsproxy) {
 		ret = -ENOSYS;
 	} else {
-		if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
-			ret = -EPERM;
 		if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
 			ret = -EPERM;
 		if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 819a1be..abf2e47 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -59,6 +59,7 @@ void *restore_bad(struct ckpt_ctx *ctx)
  *   obj_inode_{drop,grab}: for inode objects
  *   obj_mm_{drop,grab}: for mm_struct objects
  *   obj_ns_{drop,grab}: for nsproxy objects
+ *   obj_uts_ns_{drop,grab}: for uts_namespace objects
  */
 
 static void obj_no_drop(void *ptr)
@@ -115,6 +116,17 @@ static void obj_ns_drop(void *ptr)
 	put_nsproxy((struct nsproxy *) ptr);
 }
 
+static int obj_uts_ns_grab(void *ptr)
+{
+	get_uts_ns((struct uts_namespace *) ptr);
+	return 0;
+}
+
+static void obj_uts_ns_drop(void *ptr)
+{
+	put_uts_ns((struct uts_namespace *) ptr);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -159,6 +171,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_ns,
 		.restore = restore_ns,
 	},
+	/* uts_ns object */
+	{
+		.obj_name = "UTS_NS",
+		.obj_type = CKPT_OBJ_UTS_NS,
+		.ref_drop = obj_uts_ns_drop,
+		.ref_grab = obj_uts_ns_grab,
+		.checkpoint = checkpoint_bad,
+		.restore = restore_bad,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 2c489fd..13dd48b 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -15,8 +15,11 @@
 #include <linux/posix-timers.h>
 #include <linux/futex.h>
 #include <linux/poll.h>
+#include <linux/nsproxy.h>
+#include <linux/utsname.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <linux/syscalls.h>
 
 #include "checkpoint_arch.h"
 
@@ -162,10 +165,69 @@ int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int checkpoint_uts_ns(struct ckpt_ctx *ctx, struct uts_namespace *uts_ns)
+{
+	struct ckpt_hdr_utsns *h;
+	int domainname_len;
+	int nodename_len;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+	if (!h)
+		return -ENOMEM;
+
+	nodename_len = sizeof(uts_ns->name.nodename);
+	domainname_len = sizeof(uts_ns->name.domainname);
+
+	h->nodename_len = nodename_len;
+	h->domainname_len = domainname_len;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	down_read(&uts_sem);
+	ret = ckpt_write_string(ctx, uts_ns->name.nodename, nodename_len);
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_string(ctx, uts_ns->name.domainname, domainname_len);
+ up:
+	up_read(&uts_sem);
+	return ret;
+}
 
 static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 {
-	return 0;
+	struct ckpt_hdr_ns *h;
+	int ns_flags = 0;
+	int uts_objref;
+	int first, ret;
+
+	uts_objref = ckpt_obj_lookup_add(ctx, nsproxy->uts_ns,
+					 CKPT_OBJ_UTS_NS, &first);
+	if (uts_objref < 0)
+		return uts_objref;
+	if (first)
+		ns_flags |= CLONE_NEWUTS;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NS);
+	if (!h)
+		return -ENOMEM;
+
+	h->flags = ns_flags;
+	h->uts_ref = uts_objref;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	if (ns_flags & CLONE_NEWUTS)
+		ret = checkpoint_uts_ns(ctx, nsproxy->uts_ns);
+
+	/* FIX: Write other namespaces here */
+	return ret;
 }
 
 int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr)
@@ -376,9 +438,103 @@ int restore_restart_block(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int do_restore_uts_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_utsns *h;
+	struct uts_namespace *ns;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->nodename_len > sizeof(ns->name.nodename) ||
+	    h->domainname_len > sizeof(ns->name.domainname))
+		goto out;
+
+	ns = current->nsproxy->uts_ns;
+
+	/* no need to take uts_sem because we are the sole users */
+
+	memset(ns->name.nodename, 0, sizeof(ns->name.nodename));
+	ret = _ckpt_read_string(ctx, ns->name.nodename, h->nodename_len);
+	if (ret < 0)
+		goto out;
+	memset(ns->name.domainname, 0, sizeof(ns->name.domainname));
+	ret = _ckpt_read_string(ctx, ns->name.domainname, h->domainname_len);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int restore_uts_ns(struct ckpt_ctx *ctx, int ns_objref, int flags)
+{
+	struct uts_namespace *uts_ns;
+	int ret = 0;
+
+	uts_ns = ckpt_obj_fetch(ctx, ns_objref, CKPT_OBJ_UTS_NS);
+	if (IS_ERR(uts_ns))
+		return PTR_ERR(uts_ns);
+
+	/* sanity: CLONE_NEWUTS if-and-only-if uts_ns is NULL (first timer) */
+	if (!!uts_ns ^ !(flags & CLONE_NEWUTS))
+		return -EINVAL;
+
+	if (!uts_ns) {
+		ret = do_restore_uts_ns(ctx);
+		if (ret < 0)
+			return ret;
+		ret = ckpt_obj_insert(ctx, current->nsproxy->uts_ns,
+				    ns_objref, CKPT_OBJ_UTS_NS);
+	} else {
+		struct uts_namespace *old_uts_ns;
+
+		/* safe because nsproxy->count must be 1 ... */
+		BUG_ON(atomic_read(&current->nsproxy->count) != 1);
+
+		old_uts_ns = current->nsproxy->uts_ns;
+		current->nsproxy->uts_ns = uts_ns;
+		get_uts_ns(uts_ns);
+		put_uts_ns(old_uts_ns);
+	}
+
+	return ret;
+}
+
 static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 {
-	return task_nsproxy(current);
+	struct ckpt_hdr_ns *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
+	if (IS_ERR(h))
+		return (struct nsproxy *) h;
+
+	ret = -EINVAL;
+	if (h->uts_ref < 0)
+		goto out;
+	if (h->flags & ~CLONE_NEWUTS)
+		goto out;
+
+	/* each unseen-before namespace will be un-shared now */
+	ret = sys_unshare(h->flags);
+	if (ret)
+		goto out;
+
+	/*
+	 * For each unseen-before namespace 'xxx', it is now safe to
+	 * modify the nsproxy->xxx_ns without locking because unshare()
+	 * gave a brand new nsproxy and nsproxy->xxx_ns, and we're the
+	 * sole users at this point.
+	 */
+	ret = restore_uts_ns(ctx, h->uts_ref, h->flags);
+	ckpt_debug("uts ns: %d\n", ret);
+
+	/* FIX: add more namespaces here */
+ out:
+	ckpt_hdr_put(ctx, h);
+	return (ret < 0 ? ERR_PTR(ret) : task_nsproxy(current));
 }
 
 void *restore_ns(struct ckpt_ctx *ctx)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 405d3bc..4945de6 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -53,6 +53,8 @@ enum {
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
+	CKPT_HDR_NS,
+	CKPT_HDR_UTS_NS,
 
 	CKPT_HDR_MM = 201,
 	CKPT_HDR_VMA,
@@ -81,6 +83,7 @@ enum obj_type {
 	CKPT_OBJ_INODE,
 	CKPT_OBJ_MM,
 	CKPT_OBJ_NS,
+	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -176,6 +179,12 @@ struct ckpt_hdr_ns {
 	__u32 uts_ref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_utsns {
+	struct ckpt_hdr h;
+	__u32 nodename_len;
+	__u32 domainname_len;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 38/54] Stub implementation of IPC namespace c/r
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (36 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 37/54] c/r: Add UTS support (v6) Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 39/54] deferqueue: generic queue to defer work Oren Laadan
                     ` (17 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Dan Smith, Alexey Dobriyan, Dave Hansen

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Changes:
 - Update to match UTS changes

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c        |    2 --
 checkpoint/objhash.c           |   23 +++++++++++++++++++++++
 checkpoint/process.c           |   21 ++++++++++++++++++++-
 include/linux/checkpoint.h     |   15 +++++++++++++++
 include/linux/checkpoint_hdr.h |    3 +++
 5 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 88dee51..4319976 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -218,8 +218,6 @@ static int may_checkpoint_task(struct task_struct *t, struct ckpt_ctx *ctx)
 	if (!nsproxy) {
 		ret = -ENOSYS;
 	} else {
-		if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
-			ret = -EPERM;
 		if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns)
 			ret = -EPERM;
 		if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index abf2e47..bdc719e 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -14,6 +14,8 @@
 #include <linux/kernel.h>
 #include <linux/hash.h>
 #include <linux/file.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -60,6 +62,7 @@ void *restore_bad(struct ckpt_ctx *ctx)
  *   obj_mm_{drop,grab}: for mm_struct objects
  *   obj_ns_{drop,grab}: for nsproxy objects
  *   obj_uts_ns_{drop,grab}: for uts_namespace objects
+ *   obj_ipc_ns_{drop,grab}: for ipc_namespace objects
  */
 
 static void obj_no_drop(void *ptr)
@@ -127,6 +130,17 @@ static void obj_uts_ns_drop(void *ptr)
 	put_uts_ns((struct uts_namespace *) ptr);
 }
 
+static int obj_ipc_ns_grab(void *ptr)
+{
+	get_ipc_ns((struct ipc_namespace *) ptr);
+	return 0;
+}
+
+static void obj_ipc_ns_drop(void *ptr)
+{
+	put_ipc_ns((struct ipc_namespace *) ptr);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -180,6 +194,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_bad,
 		.restore = restore_bad,
 	},
+	/* ipc_ns object */
+	{
+		.obj_name = "IPC_NS",
+		.obj_type = CKPT_OBJ_IPC_NS,
+		.ref_drop = obj_ipc_ns_drop,
+		.ref_grab = obj_ipc_ns_grab,
+		.checkpoint = checkpoint_bad,
+		.restore = restore_bad,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 13dd48b..966aa93 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -202,6 +202,7 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	struct ckpt_hdr_ns *h;
 	int ns_flags = 0;
 	int uts_objref;
+	int ipc_objref;
 	int first, ret;
 
 	uts_objref = ckpt_obj_lookup_add(ctx, nsproxy->uts_ns,
@@ -211,12 +212,20 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (first)
 		ns_flags |= CLONE_NEWUTS;
 
+	ipc_objref = ckpt_obj_lookup_add(ctx, nsproxy->ipc_ns,
+					 CKPT_OBJ_IPC_NS, &first);
+	if (ipc_objref < 0)
+		return ipc_objref;
+	if (first)
+		ns_flags |= CLONE_NEWIPC;
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NS);
 	if (!h)
 		return -ENOMEM;
 
 	h->flags = ns_flags;
 	h->uts_ref = uts_objref;
+	h->ipc_ref = ipc_objref;
 
 	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
 	ckpt_hdr_put(ctx, h);
@@ -225,6 +234,10 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 
 	if (ns_flags & CLONE_NEWUTS)
 		ret = checkpoint_uts_ns(ctx, nsproxy->uts_ns);
+#if 0
+	if (!ret && (ns_flags & CLONE_NEWIPC))
+		ret = checkpoint_ipc_ns(ctx, nsproxy->ipc_ns);
+#endif
 
 	/* FIX: Write other namespaces here */
 	return ret;
@@ -514,7 +527,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	ret = -EINVAL;
 	if (h->uts_ref < 0)
 		goto out;
-	if (h->flags & ~CLONE_NEWUTS)
+	if (h->flags & ~(CLONE_NEWUTS | CLONE_NEWIPC))
 		goto out;
 
 	/* each unseen-before namespace will be un-shared now */
@@ -530,6 +543,12 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	 */
 	ret = restore_uts_ns(ctx, h->uts_ref, h->flags);
 	ckpt_debug("uts ns: %d\n", ret);
+	if (ret < 0)
+		goto out;
+#if 0
+	ret = restore_ipc_ns(ctx, h->ipc_ref, h->flags);
+	ckpt_debug("ipc ns: %d\n", ret);
+#endif
 
 	/* FIX: add more namespaces here */
  out:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2cdd94f..867033c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -69,6 +69,21 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
 extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_ns(struct ckpt_ctx *ctx);
 
+#if 0
+/* ipc-ns */
+#ifdef CONFIG_SYSVIPC
+extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx,
+			     struct ipc_namespace *ipc_ns);
+extern int restore_ipc_ns(struct ckpt_ctx *ctx, int ns_objref, int flags);
+#else
+static inline int checkpoint_ipc_ns(struct ckpt_ctx *ctx,
+				    struct ipc_namespace *ipc_ns)
+{ return 0; }
+static inline int restore_ipc_ns(struct ckpt_ctx *ctx)
+{ return 0; }
+#endif /* CONFIG_SYSVIPC */
+#endif
+
 /* memory */
 extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 4945de6..3051031 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -55,6 +55,7 @@ enum {
 	CKPT_HDR_CPU,
 	CKPT_HDR_NS,
 	CKPT_HDR_UTS_NS,
+	CKPT_HDR_IPC_NS,
 
 	CKPT_HDR_MM = 201,
 	CKPT_HDR_VMA,
@@ -84,6 +85,7 @@ enum obj_type {
 	CKPT_OBJ_MM,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
+	CKPT_OBJ_IPC_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -177,6 +179,7 @@ struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
 	__u32 flags;
 	__u32 uts_ref;
+	__u32 ipc_ref;
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_utsns {
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 39/54] deferqueue: generic queue to defer work
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (37 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 38/54] Stub implementation of IPC namespace c/r Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 40/54] ipc: allow allocation of an ipc object with desired identifier Oren Laadan
                     ` (16 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Add a interface to postpone an action until the end of the entire
checkpoint or restart operation. This is useful when during the
scan of tasks an operation cannot be performed in place, to avoid
the need for a second scan.

One use case is when restoring an ipc shared memory region that has
been deleted (but is still attached), during restart it needs to be
create, attached and then deleted. However, creation and attachment
are performed in distinct locations, so deletion can not be performed
on the spot. Instead, this work (delete) is deferred until later.
(This example is in one of the following patches).

This interface allows chronic procrastination in the kernel:

deferqueue_create(void):
    Allocates and returns a new deferqueue.

deferqueue_run(deferqueue):
    Executes all the pending works in the queue. Returns the number
    of works executed, or an error upon the first error reported by
    a deferred work.

deferqueue_add(deferqueue, data, size, func, dtor):
    Enqueue a deferred work. @function is the callback function to
    do the work, which will be called with @data as an argument.
    @size tells the size of data. @dtor is a destructor callback
    that is invoked for deferred works remaining in the queue when
    the queue is destroyed. NOTE: for a given deferred work, @dtor
    is _not_ called if @func was already called (regardless of the
    return value of the latter).

deferqueue_destroy(deferqueue):
    Free the deferqueue and any queued items while invoking the
    @dtor callback for each queued item.

Why aren't we using the existing kernel workqueue mechanism?  We need
to defer to work until the end of the operation: not earlier, since we
need other things to be in place; not later, to not block waiting for
it. However, the workqueue schedules the work for 'some time later'.
Also, the kernel workqueue may run in any task context, but we require
many times that an operation be run in the context of some specific
restarting task (e.g., restoring IPC state of a certain ipc_ns).

Instead, this mechanism is a simple way for the c/r operation as a
whole, and later a task in particular, to defer some action until
later (but not arbitrarily later) _in the restore_ operation.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/Kconfig         |    5 ++
 include/linux/deferqueue.h |   58 +++++++++++++++++++++++
 kernel/Makefile            |    1 +
 kernel/deferqueue.c        |  109 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 173 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/deferqueue.h
 create mode 100644 kernel/deferqueue.c

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index 1761b0a..53ed6fa 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -2,9 +2,14 @@
 # implemented the hooks for processor state etc. needed by the
 # core checkpoint/restart code.
 
+config DEFERQUEUE
+	bool
+	default n
+
 config CHECKPOINT
 	bool "Enable checkpoint/restart (EXPERIMENTAL)"
 	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	select DEFERQUEUE
 	help
 	  Application checkpoint/restart is the ability to save the
 	  state of a running application so that it can later resume
diff --git a/include/linux/deferqueue.h b/include/linux/deferqueue.h
new file mode 100644
index 0000000..2eb58cf
--- /dev/null
+++ b/include/linux/deferqueue.h
@@ -0,0 +1,58 @@
+/*
+ * deferqueue.h --- deferred work queue handling for Linux.
+ */
+
+#ifndef _LINUX_DEFERQUEUE_H
+#define _LINUX_DEFERQUEUE_H
+
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+/*
+ * This interface allows chronic procrastination in the kernel:
+ *
+ * deferqueue_create(void):
+ *     Allocates and returns a new deferqueue.
+ *
+ * deferqueue_run(deferqueue):
+ *     Executes all the pending works in the queue. Returns the number
+ *     of works executed, or an error upon the first error reported by
+ *     a deferred work.
+ *
+ * deferqueue_add(deferqueue, data, size, func, dtor):
+ * 	Enqueue a deferred work. @function is the callback function to
+ *      do the work, which will be called with @data as an argument.
+ *      @size tells the size of data. @dtor is a destructor callback
+ *      that is invoked for deferred works remaining in the queue when
+ *      the queue is destroyed. NOTE: for a given deferred work, @dtor
+ *      is _not_ called if @func was already called (regardless of the
+ *      return value of the latter).
+ *
+ * deferqueue_destroy(deferqueue):
+ *      Free the deferqueue and any queued items while invoking the
+ *      @dtor callback for each queued item.
+ */
+
+
+typedef int (*deferqueue_func_t)(void *);
+
+struct deferqueue_entry {
+	deferqueue_func_t function;
+	deferqueue_func_t destructor;
+	struct list_head list;
+	char data[0];
+};
+
+struct deferqueue_head {
+	spinlock_t lock;
+	struct list_head list;
+};
+
+struct deferqueue_head *deferqueue_create(void);
+void deferqueue_destroy(struct deferqueue_head *head);
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+		   deferqueue_func_t func, deferqueue_func_t dtor);
+int deferqueue_run(struct deferqueue_head *head);
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 4242366..6bc638d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -22,6 +22,7 @@ CFLAGS_REMOVE_cgroup-debug.o = -pg
 CFLAGS_REMOVE_sched_clock.o = -pg
 endif
 
+obj-$(CONFIG_DEFERQUEUE) += deferqueue.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
diff --git a/kernel/deferqueue.c b/kernel/deferqueue.c
new file mode 100644
index 0000000..efd99d5
--- /dev/null
+++ b/kernel/deferqueue.c
@@ -0,0 +1,109 @@
+/*
+ *  Infrastructure to manage deferred work
+ *
+ *  This differs from a workqueue in that the work must be deferred
+ *  until specifically run by the caller.
+ *
+ *  As the only user currently is checkpoint/restart, which has
+ *  very simple usage, the locking is kept simple.  Adding rules
+ *  is protected by the head->lock.  But deferqueue_run() is only
+ *  called once, after all entries have been added.  So it is not
+ *  protected.  Similarly, _destroy is only called once when the
+ *  ckpt_ctx is releeased, so it is not locked or refcounted.  These
+ *  can of course be added if needed by other users.
+ *
+ *  Why not use workqueue ?  We need to defer work until the end of an
+ *  operation: not earlier, since we need other things to be in place;
+ *  not later, to not block waiting for it. However, the workqueue
+ *  schedules the work for 'some time later'. Also, workqueue may run
+ *  in any task context, but we require many times that an operation
+ *  be run in the context of some specific restarting task (e.g.,
+ *  restoring IPC state of a certain ipc_ns).
+ *
+ *  Instead, this mechanism is a simple way for the c/r operation as a
+ *  whole, and later a task in particular, to defer some action until
+ *  later (but not arbitrarily later) _in the restore_ operation.
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/deferqueue.h>
+
+struct deferqueue_head *deferqueue_create(void)
+{
+	struct deferqueue_head *h = kmalloc(sizeof(*h), GFP_KERNEL);
+	if (h) {
+		spin_lock_init(&h->lock);
+		INIT_LIST_HEAD(&h->list);
+	}
+	return h;
+}
+
+void deferqueue_destroy(struct deferqueue_head *h)
+{
+	if (!list_empty(&h->list)) {
+		struct deferqueue_entry *dq, *n;
+
+		pr_debug("%s: freeing non-empty queue\n", __func__);
+		list_for_each_entry_safe(dq, n, &h->list, list) {
+			dq->destructor(dq->data);
+			list_del(&dq->list);
+			kfree(dq);
+		}
+	}
+	kfree(h);
+}
+
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+		   deferqueue_func_t func, deferqueue_func_t dtor)
+{
+	struct deferqueue_entry *dq;
+
+	dq = kmalloc(sizeof(dq) + size, GFP_KERNEL);
+	if (!dq)
+		return -ENOMEM;
+
+	dq->function = func;
+	dq->destructor = dtor;
+	memcpy(dq->data, data, size);
+
+	pr_debug("%s: adding work %p func %p dtor %p\n",
+		 __func__, dq, func, dtor);
+	spin_lock(&head->lock);
+	list_add_tail(&head->list, &dq->list);
+	spin_unlock(&head->lock);
+	return 0;
+}
+
+/*
+ * deferqueue_run - perform all work in the work queue
+ * @head: deferqueue_head from which to run
+ *
+ * returns: number of works performed, or < 0 on error
+ */
+int deferqueue_run(struct deferqueue_head *head)
+{
+	struct deferqueue_entry *dq, *n;
+	int nr = 0;
+	int ret;
+
+	list_for_each_entry_safe(dq, n, &head->list, list) {
+		pr_debug("doing work %p function %p\n", dq, dq->function);
+		/* don't call destructor - function callback should do it */
+		ret = dq->function(dq->data);
+		if (ret < 0)
+			pr_debug("wq function failed %d\n", ret);
+		list_del(&dq->list);
+		kfree(dq);
+		nr++;
+	}
+
+	return nr;
+}
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 40/54] ipc: allow allocation of an ipc object with desired identifier
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (38 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 39/54] deferqueue: generic queue to defer work Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 41/54] ipc: helpers to save and restore kern_ipc_perm structures Oren Laadan
                     ` (15 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

During restart, we need to allocate ipc objects that with the same
identifiers as recorded during checkpoint. Modify the allocation
code allow an in-kernel caller to request a specific ipc identifier.
The system call interface remains unchanged.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 ipc/msg.c  |   17 ++++++++++++-----
 ipc/sem.c  |   17 ++++++++++++-----
 ipc/shm.c  |   19 +++++++++++++------
 ipc/util.c |   42 +++++++++++++++++++++++++++++-------------
 ipc/util.h |   12 +++++++++---
 5 files changed, 75 insertions(+), 32 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 2ceab7f..1db7c45 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -73,7 +73,7 @@ struct msg_sender {
 #define msg_unlock(msq)		ipc_unlock(&(msq)->q_perm)
 
 static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
-static int newque(struct ipc_namespace *, struct ipc_params *);
+static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
 #endif
@@ -174,10 +174,12 @@ static inline void msg_rmid(struct ipc_namespace *ns, struct msg_queue *s)
  * newque - Create a new msg queue
  * @ns: namespace
  * @params: ptr to the structure that contains the key and msgflg
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with msg_ids.rw_mutex held (writer)
  */
-static int newque(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newque(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	struct msg_queue *msq;
 	int id, retval;
@@ -201,7 +203,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	/*
 	 * ipc_addid() locks msq
 	 */
-	id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
+	id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni, req_id);
 	if (id < 0) {
 		security_msg_queue_free(msq);
 		ipc_rcu_putref(msq);
@@ -309,7 +311,7 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
 	return security_msg_queue_associate(msq, msgflg);
 }
 
-SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+int do_msgget(key_t key, int msgflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops msg_ops;
@@ -324,7 +326,12 @@ SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
 	msg_params.key = key;
 	msg_params.flg = msgflg;
 
-	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
+	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params, req_id);
+}
+
+SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+{
+	return do_msgget(key, msgflg, -1);
 }
 
 static inline unsigned long
diff --git a/ipc/sem.c b/ipc/sem.c
index 16a2189..207dbbb 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -92,7 +92,7 @@
 #define sem_unlock(sma)		ipc_unlock(&(sma)->sem_perm)
 #define sem_checkid(sma, semid)	ipc_checkid(&sma->sem_perm, semid)
 
-static int newary(struct ipc_namespace *, struct ipc_params *);
+static int newary(struct ipc_namespace *, struct ipc_params *, int);
 static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
@@ -227,11 +227,13 @@ static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
  * newary - Create a new semaphore set
  * @ns: namespace
  * @params: ptr to the structure that contains key, semflg and nsems
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with sem_ids.rw_mutex held (as a writer)
  */
 
-static int newary(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newary(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	int id;
 	int retval;
@@ -263,7 +265,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 		return retval;
 	}
 
-	id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
+	id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni, req_id);
 	if (id < 0) {
 		security_sem_free(sma);
 		ipc_rcu_putref(sma);
@@ -308,7 +310,7 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+int do_semget(key_t key, int nsems, int semflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops sem_ops;
@@ -327,7 +329,12 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 	sem_params.flg = semflg;
 	sem_params.u.nsems = nsems;
 
-	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
+	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params, req_id);
+}
+
+SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+{
+	return do_semget(key, nsems, semflg, -1);
 }
 
 /*
diff --git a/ipc/shm.c b/ipc/shm.c
index faa46da..7dd5f0c 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,7 +62,7 @@ static struct vm_operations_struct shm_vm_ops;
 #define shm_unlock(shp)			\
 	ipc_unlock(&(shp)->shm_perm)
 
-static int newseg(struct ipc_namespace *, struct ipc_params *);
+static int newseg(struct ipc_namespace *, struct ipc_params *, int);
 static void shm_open(struct vm_area_struct *vma);
 static void shm_close(struct vm_area_struct *vma);
 static void shm_destroy (struct ipc_namespace *ns, struct shmid_kernel *shp);
@@ -83,7 +83,7 @@ void shm_init_ns(struct ipc_namespace *ns)
  * Called with shm_ids.rw_mutex (writer) and the shp structure locked.
  * Only shm_ids.rw_mutex remains locked on exit.
  */
-static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct shmid_kernel *shp;
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
@@ -326,11 +326,13 @@ static struct vm_operations_struct shm_vm_ops = {
  * newseg - Create a new shared memory segment
  * @ns: namespace
  * @params: ptr to the structure that contains key, size and shmflg
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with shm_ids.rw_mutex held as a writer.
  */
 
-static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newseg(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	key_t key = params->key;
 	int shmflg = params->flg;
@@ -386,7 +388,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 		goto no_file;
 	ima_shm_check(file);
 
-	id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
+	id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni, req_id);
 	if (id < 0) {
 		error = id;
 		goto no_id;
@@ -444,7 +446,7 @@ static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+int do_shmget(key_t key, size_t size, int shmflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops shm_ops;
@@ -460,7 +462,12 @@ SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
 	shm_params.flg = shmflg;
 	shm_params.u.size = size;
 
-	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
+	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params, req_id);
+}
+
+SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+{
+	return do_shmget(key, size, shmflg, -1);
 }
 
 static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)
diff --git a/ipc/util.c b/ipc/util.c
index b8e4ba9..ca248ec 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -247,10 +247,12 @@ int ipc_get_maxid(struct ipc_ids *ids)
  *	Called with ipc_ids.rw_mutex held as a writer.
  */
  
-int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
+int
+ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int size, int req_id)
 {
 	uid_t euid;
 	gid_t egid;
+	int lid = 0;
 	int id, err;
 
 	if (size > IPCMNI)
@@ -259,28 +261,41 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
 	if (ids->in_use >= size)
 		return -ENOSPC;
 
+	if (req_id >= 0)
+		lid = ipcid_to_idx(req_id);
+
 	spin_lock_init(&new->lock);
 	new->deleted = 0;
 	rcu_read_lock();
 	spin_lock(&new->lock);
 
-	err = idr_get_new(&ids->ipcs_idr, new, &id);
+	err = idr_get_new_above(&ids->ipcs_idr, new, lid, &id);
 	if (err) {
 		spin_unlock(&new->lock);
 		rcu_read_unlock();
 		return err;
 	}
 
+	if (req_id >= 0) {
+		if (id != lid) {
+			idr_remove(&ids->ipcs_idr, id);
+			spin_unlock(&new->lock);
+			rcu_read_unlock();
+			return -EBUSY;
+		}
+		new->seq = req_id / SEQ_MULTIPLIER;
+	} else {
+		new->seq = ids->seq++;
+		if (ids->seq > ids->seq_max)
+			ids->seq = 0;
+	}
+
 	ids->in_use++;
 
 	current_euid_egid(&euid, &egid);
 	new->cuid = new->uid = euid;
 	new->gid = new->cgid = egid;
 
-	new->seq = ids->seq++;
-	if(ids->seq > ids->seq_max)
-		ids->seq = 0;
-
 	new->id = ipc_buildid(id, new->seq);
 	return id;
 }
@@ -296,7 +311,7 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
  *	when the key is IPC_PRIVATE.
  */
 static int ipcget_new(struct ipc_namespace *ns, struct ipc_ids *ids,
-		struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	int err;
 retry:
@@ -306,7 +321,7 @@ retry:
 		return -ENOMEM;
 
 	down_write(&ids->rw_mutex);
-	err = ops->getnew(ns, params);
+	err = ops->getnew(ns, params, req_id);
 	up_write(&ids->rw_mutex);
 
 	if (err == -EAGAIN)
@@ -351,6 +366,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
  *	@ids: IPC identifer set
  *	@ops: the actual creation routine to call
  *	@params: its parameters
+ *	@req_id: request desired id if available (-1 if don't care)
  *
  *	This routine is called by sys_msgget, sys_semget() and sys_shmget()
  *	when the key is not IPC_PRIVATE.
@@ -360,7 +376,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
  *	On success, the ipc id is returned.
  */
 static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
-		struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	struct kern_ipc_perm *ipcp;
 	int flg = params->flg;
@@ -381,7 +397,7 @@ retry:
 		else if (!err)
 			err = -ENOMEM;
 		else
-			err = ops->getnew(ns, params);
+			err = ops->getnew(ns, params, req_id);
 	} else {
 		/* ipc object has been locked by ipc_findkey() */
 
@@ -742,12 +758,12 @@ struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id)
  * Common routine called by sys_msgget(), sys_semget() and sys_shmget().
  */
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
-			struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	if (params->key == IPC_PRIVATE)
-		return ipcget_new(ns, ids, ops, params);
+		return ipcget_new(ns, ids, ops, params, req_id);
 	else
-		return ipcget_public(ns, ids, ops, params);
+		return ipcget_public(ns, ids, ops, params, req_id);
 }
 
 /**
diff --git a/ipc/util.h b/ipc/util.h
index 1187332..c75e3b2 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -70,7 +70,7 @@ struct ipc_params {
  *      . routine to call for an extra check if needed
  */
 struct ipc_ops {
-	int (*getnew) (struct ipc_namespace *, struct ipc_params *);
+	int (*getnew) (struct ipc_namespace *, struct ipc_params *, int);
 	int (*associate) (struct kern_ipc_perm *, int);
 	int (*more_checks) (struct kern_ipc_perm *, struct ipc_params *);
 };
@@ -93,7 +93,7 @@ void __init ipc_init_proc_interface(const char *path, const char *header,
 #define ipcid_to_idx(id) ((id) % SEQ_MULTIPLIER)
 
 /* must be called with ids->rw_mutex acquired for writing */
-int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
+int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int, int);
 
 /* must be called with ids->rw_mutex acquired for reading */
 int ipc_get_maxid(struct ipc_ids *);
@@ -170,6 +170,12 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
 
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
-			struct ipc_ops *ops, struct ipc_params *params);
+		struct ipc_ops *ops, struct ipc_params *params, int req_id);
+
+/* for checkpoint/restart */
+extern int do_shmget(key_t key, size_t size, int shmflg, int req_id);
+
+extern void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+
 
 #endif
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 41/54] ipc: helpers to save and restore kern_ipc_perm structures
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (39 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 40/54] ipc: allow allocation of an ipc object with desired identifier Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 42/54] ipc namespace: save and restore ipc namespace basics Oren Laadan
                     ` (14 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Add the helpers to save and restore the contents of 'struct
kern_ipc_perm'. Add header structures for ipc state. Put
place-holders to save and restore ipc state.

TODO:
This patch does _not_ address the issues of users/groups and the
related security issues. For now, it saves the old user/group of
ipc objects, but does not restore them during restart.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/checkpoint.h     |    7 +++-
 include/linux/checkpoint_hdr.h |   29 ++++++++++++++++
 ipc/Makefile                   |    1 +
 ipc/checkpoint.c               |   73 ++++++++++++++++++++++++++++++++++++++++
 ipc/util.h                     |    8 ++++
 5 files changed, 117 insertions(+), 1 deletions(-)
 create mode 100644 ipc/checkpoint.c

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 867033c..8a7fe9a 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -12,6 +12,10 @@
 
 struct ckpt_ctx;
 
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -169,8 +173,9 @@ extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 #define CKPT_DPAGE	0x10		/* memory pages */
 #define CKPT_DOBJ	0x20		/* shared objects */
 #define CKPT_DFILE	0x40		/* files and filesystem */
+#define CKPT_DIPC	0x80		/* sysvipc */
 
-#define CKPT_DDEFAULT	0x4f		/* default debug level */
+#define CKPT_DDEFAULT	0xcf		/* default debug level */
 
 #ifndef CKPT_DFLAG
 #define CKPT_DFLAG	0x0		/* nothing */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3051031..0af5532 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -67,6 +67,11 @@ enum {
 	CKPT_HDR_FILE,
 	CKPT_HDR_FILE_PIPE,
 
+	CKPT_HDR_IPC = 401,
+	CKPT_HDR_IPC_SHM,
+	CKPT_HDR_IPC_MSG,
+	CKPT_HDR_IPC_SEM,
+
 	CKPT_HDR_TAIL = 5001
 };
 
@@ -279,4 +284,28 @@ struct ckpt_hdr_file_pipe_state {
 	__s32 pipe_nrbufs;
 } __attribute__((aligned(8)));
 
+/* ipc commons */
+struct ckpt_hdr_ipc_perms {
+	__s32 id;
+	__u32 key;
+	__u32 uid;
+	__u32 gid;
+	__u32 cuid;
+	__u32 cgid;
+	__u32 mode;
+	__u32 _padding;
+	__u64 seq;
+} __attribute__((aligned(8)));
+
+
+#define CKPT_TST_OVERFLOW_16(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
+
+#define CKPT_TST_OVERFLOW_32(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > INT_MAX))
+
+#define CKPT_TST_OVERFLOW_64(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > LONG_MAX))
+
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/ipc/Makefile b/ipc/Makefile
index 4e1955e..aa6c8dd 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o
 
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
new file mode 100644
index 0000000..9a6cd9d
--- /dev/null
+++ b/ipc/checkpoint.c
@@ -0,0 +1,73 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/ipc.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+int checkpoint_ipcns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
+{
+	return 0;
+}
+
+int restore_ipcns(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+void checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+			       struct kern_ipc_perm *perm)
+{
+	h->id = perm->id;
+	h->key = perm->key;
+	h->uid = perm->uid;
+	h->gid = perm->gid;
+	h->cuid = perm->cuid;
+	h->cgid = perm->cgid;
+	h->mode = perm->mode & S_IRWXUGO;
+	h->seq = perm->seq;
+}
+
+int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+			   struct kern_ipc_perm *perm)
+{
+	if (h->id < 0)
+		return -EINVAL;
+	if (CKPT_TST_OVERFLOW_16(h->uid, perm->uid) ||
+	    CKPT_TST_OVERFLOW_16(h->gid, perm->gid) ||
+	    CKPT_TST_OVERFLOW_16(h->cuid, perm->cuid) ||
+	    CKPT_TST_OVERFLOW_16(h->cgid, perm->cgid) ||
+	    CKPT_TST_OVERFLOW_16(h->mode, perm->mode))
+		return -EINVAL;
+	if (h->seq >= USHORT_MAX)
+		return -EINVAL;
+	if (h->mode & ~S_IRWXUGO)
+		return -EINVAL;
+
+	/* FIX: verify the ->mode field makes sense */
+
+	perm->id = h->id;
+	perm->key = h->key;
+#if 0 /* FIX: requires security checks */
+	perm->uid = h->uid;
+	perm->gid = h->gid;
+	perm->cuid = h->cuid;
+	perm->cgid = h->cgid;
+#endif
+	perm->mode = h->mode;
+	perm->seq = h->seq;
+
+	return 0;
+}
diff --git a/ipc/util.h b/ipc/util.h
index c75e3b2..d3f7367 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -11,6 +11,7 @@
 #define _IPC_UTIL_H
 
 #include <linux/err.h>
+#include <linux/checkpoint_hdr.h>
 
 #define SEQ_MULTIPLIER	(IPCMNI)
 
@@ -177,5 +178,12 @@ extern int do_shmget(key_t key, size_t size, int shmflg, int req_id);
 
 extern void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
+#ifdef CONFIG_CHECKPOINT
+extern void checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *hh,
+				      struct kern_ipc_perm *perm);
+extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *hh,
+				  struct kern_ipc_perm *perm);
+#endif
+
 
 #endif
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 42/54] ipc namespace: save and restore ipc namespace basics
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (40 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 41/54] ipc: helpers to save and restore kern_ipc_perm structures Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 43/54] sysvipc-shm: checkpoint Oren Laadan
                     ` (13 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Save and restores the common state (parameters) of ipc namespace.

Also add logic to iterate through the objects of sysvipc shared memory,
message queues and semaphores. The logic to save and restore the state
of these objects will be added in the next few patches.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/process.c           |    4 -
 include/linux/checkpoint.h     |    5 +-
 include/linux/checkpoint_hdr.h |   22 +++++
 ipc/checkpoint.c               |  204 ++++++++++++++++++++++++++++++++++++++--
 4 files changed, 221 insertions(+), 14 deletions(-)

diff --git a/checkpoint/process.c b/checkpoint/process.c
index 966aa93..b731891 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -234,10 +234,8 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 
 	if (ns_flags & CLONE_NEWUTS)
 		ret = checkpoint_uts_ns(ctx, nsproxy->uts_ns);
-#if 0
 	if (!ret && (ns_flags & CLONE_NEWIPC))
 		ret = checkpoint_ipc_ns(ctx, nsproxy->ipc_ns);
-#endif
 
 	/* FIX: Write other namespaces here */
 	return ret;
@@ -545,10 +543,8 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	ckpt_debug("uts ns: %d\n", ret);
 	if (ret < 0)
 		goto out;
-#if 0
 	ret = restore_ipc_ns(ctx, h->ipc_ref, h->flags);
 	ckpt_debug("ipc ns: %d\n", ret);
-#endif
 
 	/* FIX: add more namespaces here */
  out:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 8a7fe9a..f915564 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -73,7 +73,6 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
 extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_ns(struct ckpt_ctx *ctx);
 
-#if 0
 /* ipc-ns */
 #ifdef CONFIG_SYSVIPC
 extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx,
@@ -86,7 +85,9 @@ static inline int checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 static inline int restore_ipc_ns(struct ckpt_ctx *ctx)
 { return 0; }
 #endif /* CONFIG_SYSVIPC */
-#endif
+
+extern int checkpoint_ipcns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns);
+extern int restore_ipcns(struct ckpt_ctx *ctx);
 
 /* memory */
 extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0af5532..2ad3cb7 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -285,6 +285,28 @@ struct ckpt_hdr_file_pipe_state {
 } __attribute__((aligned(8)));
 
 /* ipc commons */
+struct ckpt_hdr_ipcns {
+	struct ckpt_hdr h;
+	__u64 shm_ctlmax;
+	__u64 shm_ctlall;
+	__s32 shm_ctlmni;
+
+	__s32 msg_ctlmax;
+	__s32 msg_ctlmnb;
+	__s32 msg_ctlmni;
+
+	__s32 sem_ctl_msl;
+	__s32 sem_ctl_mns;
+	__s32 sem_ctl_opm;
+	__s32 sem_ctl_mni;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc {
+	struct ckpt_hdr h;
+	__u32 ipc_type;
+	__u32 ipc_count;
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_ipc_perms {
 	__s32 id;
 	__u32 key;
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 9a6cd9d..e00e524 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -17,15 +17,15 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
-int checkpoint_ipcns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
-{
-	return 0;
-}
+#include <linux/msg.h>	/* needed for util.h that uses 'struct msg_msg' */
+#include "util.h"
 
-int restore_ipcns(struct ckpt_ctx *ctx)
-{
-	return 0;
-}
+/* for ckpt_debug */
+static char *ipc_ind_to_str[] = { "sem", "msg", "shm" };
+
+/**************************************************************************
+ * Checkpoint
+ */
 
 void checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 			       struct kern_ipc_perm *perm)
@@ -40,6 +40,82 @@ void checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 	h->seq = perm->seq;
 }
 
+static int checkpoint_ipc_any(struct ckpt_ctx *ctx,
+			      struct ipc_namespace *ipc_ns,
+			      int ipc_ind, int ipc_type,
+			      int (*func)(int id, void *p, void *data))
+{
+	struct ckpt_hdr_ipc *h;
+	struct ipc_ids *ipc_ids = &ipc_ns->ids[ipc_ind];
+	int ret = -ENOMEM;
+
+	down_read(&ipc_ids->rw_mutex);
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+	if (!h)
+		goto out;
+
+	h->ipc_type = ipc_type;
+	h->ipc_count = ipc_ids->in_use;
+	ckpt_debug("ipc-%s count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ret = idr_for_each(&ipc_ids->ipcs_idr, func, ctx);
+	ckpt_debug("ipc-%s ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+ out:
+	up_read(&ipc_ids->rw_mutex);
+	return ret;
+}
+
+int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
+{
+	struct ckpt_hdr_ipcns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+	if (!h)
+		return -ENOMEM;
+
+	h->shm_ctlmax = ipc_ns->shm_ctlmax;
+	h->shm_ctlall = ipc_ns->shm_ctlall;
+	h->shm_ctlmni = ipc_ns->shm_ctlmni;
+
+	h->msg_ctlmax = ipc_ns->msg_ctlmax;
+	h->msg_ctlmnb = ipc_ns->msg_ctlmnb;
+	h->msg_ctlmni = ipc_ns->msg_ctlmni;
+
+	h->sem_ctl_msl = ipc_ns->sem_ctls[0];
+	h->sem_ctl_mns = ipc_ns->sem_ctls[1];
+	h->sem_ctl_opm = ipc_ns->sem_ctls[2];
+	h->sem_ctl_mni = ipc_ns->sem_ctls[3];
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+#if 0 /* NEXT FEW PATCHES */
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
+				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+	if (ret < 0)
+		return ret;
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
+				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+	if (ret < 0)
+		return ret;
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
+				 CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
+#endif
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
 int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 			   struct kern_ipc_perm *perm)
 {
@@ -71,3 +147,115 @@ int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 
 	return 0;
 }
+
+static int restore_ipc_any(struct ckpt_ctx *ctx, int ipc_ind, int ipc_type,
+			   int (*func)(struct ckpt_ctx *ctx))
+{
+	struct ckpt_hdr_ipc *h;
+	int n, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("ipc-%s: count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+	ret = -EINVAL;
+	if (h->ipc_type != ipc_type)
+		goto out;
+
+	ret = 0;
+	for (n = 0; n < h->ipc_count; n++) {
+		ret = (*func)(ctx);
+		if (ret < 0)
+			goto out;
+	}
+ out:
+	ckpt_debug("ipc-%s: ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int do_restore_ipc_ns(struct ckpt_ctx *ctx)
+{
+	struct ipc_namespace *ipc_ns = current->nsproxy->ipc_ns;
+	struct ckpt_hdr_ipcns *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->shm_ctlmax < 0 || h->shm_ctlall < 0 || h->shm_ctlmni < 0)
+		goto out;
+	if (h->msg_ctlmax < 0 || h->msg_ctlmnb < 0 || h->msg_ctlmni < 0)
+		goto out;
+	if (h->sem_ctl_msl < 0 || h->sem_ctl_mns < 0 ||
+	    h->sem_ctl_opm < 0 || h->sem_ctl_mni < 0)
+		goto out;
+
+	/* this is a brand new ipc_ns: safe to rewrite its properties */
+	ipc_ns->shm_ctlmax = h->shm_ctlmax;
+	ipc_ns->shm_ctlall = h->shm_ctlall;
+	ipc_ns->shm_ctlmni = h->shm_ctlmni;
+
+	ipc_ns->msg_ctlmax = h->msg_ctlmax;
+	ipc_ns->msg_ctlmnb = h->msg_ctlmnb;
+	ipc_ns->msg_ctlmni = h->msg_ctlmni;
+
+	ipc_ns->sem_ctls[0] = h->sem_ctl_msl;
+	ipc_ns->sem_ctls[1] = h->sem_ctl_mns;
+	ipc_ns->sem_ctls[2] = h->sem_ctl_opm;
+	ipc_ns->sem_ctls[3] = h->sem_ctl_mni;
+
+#if 0 /* NEXT FEW PATCHES */
+	ret = restore_ipc_any(ctx, IPC_SHM_IDS,
+			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_read_ipc_any(ctx, IPC_MSG_IDS,
+			      CKPT_HDR_IPC_MSG, restore_ipc_msg);
+	if (ret < 0)
+		goto out;
+	ret = restore_ipc_any(ctx, IPC_SEM_IDS,
+			      CKPT_HDR_IPC_SEM, restore_ipc_sem);
+#endif
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int restore_ipc_ns(struct ckpt_ctx *ctx, int ns_objref, int flags)
+{
+	struct ipc_namespace *ipc_ns;
+	int ret = 0;
+
+	ipc_ns = ckpt_obj_fetch(ctx, ns_objref, CKPT_OBJ_IPC_NS);
+	if (IS_ERR(ipc_ns))
+		return PTR_ERR(ipc_ns);
+
+	/* sanity: CLONE_NEWIPC if-and-only-if ipc_ns is NULL (first timer) */
+	if (!!ipc_ns ^ !(flags & CLONE_NEWIPC))
+		return -EINVAL;
+
+	if (!ipc_ns) {
+		ret = do_restore_ipc_ns(ctx);
+		if (ret < 0)
+			return ret;
+		ret = ckpt_obj_insert(ctx, current->nsproxy->ipc_ns,
+				      ns_objref, CKPT_OBJ_IPC_NS);
+	} else {
+		struct ipc_namespace *old_ipc_ns;
+
+		/* safe because nsproxy->count must be 1 ... */
+		BUG_ON(atomic_read(&current->nsproxy->count) != 1);
+
+		old_ipc_ns = current->nsproxy->ipc_ns;
+		current->nsproxy->ipc_ns = ipc_ns;
+		get_ipc_ns(ipc_ns);
+		put_ipc_ns(old_ipc_ns);
+	}
+
+	return ret;
+}
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 43/54] sysvipc-shm: checkpoint
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (41 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 42/54] ipc namespace: save and restore ipc namespace basics Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
       [not found]     ` <1240961064-13991-44-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:24   ` [RFC v14][PATCH 44/54] sysvipc-shm: restart Oren Laadan
                     ` (12 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Checkpoint of sysvipc shared memory is performed in two steps: first,
the entire ipc namespace is dumped as a whole by iterating through all
shm objects and dumping the contents of each one. The shmem inode is
registered in the objhash. Second, for each vma that refers to ipc
shared memory we find the inode in the objhash, and save the objref.

(If we find a new inode, that indicates that the ipc namespace is not
entirely frozen and someone must have manipulated it since step 1).

Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/memory.c            |    6 +-
 include/linux/checkpoint.h     |    3 +
 include/linux/checkpoint_hdr.h |   17 ++++++
 ipc/Makefile                   |    2 +-
 ipc/checkpoint.c               |    2 +-
 ipc/checkpoint_shm.c           |  112 ++++++++++++++++++++++++++++++++++++++++
 ipc/shm.c                      |   29 ++++++++++
 ipc/util.h                     |    3 +-
 8 files changed, 168 insertions(+), 6 deletions(-)
 create mode 100644 ipc/checkpoint_shm.c

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 7a6e3f4..ee26254 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -457,9 +457,9 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
  * virtual addresses into ctx->pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
-static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
-				      struct vm_area_struct *vma,
-				      struct inode *inode)
+int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+			       struct vm_area_struct *vma,
+			       struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long addr, end;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index f915564..65838b4 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -112,6 +112,9 @@ extern unsigned long generic_vma_restore(struct mm_struct *mm,
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
+extern int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma,
+				      struct inode *inode);
 extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 
 extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 2ad3cb7..12a808c 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -213,6 +213,8 @@ enum vma_type {
 	CKPT_VMA_SHM_ANON,	/* shared anonymous */
 	CKPT_VMA_SHM_ANON_SKIP,	/* shared anonymous (skip contents) */
 	CKPT_VMA_SHM_FILE,	/* shared mapped file, only msync */
+	CKPT_VMA_SHM_IPC,	/* shared sysvipc */
+	CKPT_VMA_SHM_IPC_SKIP,	/* shared sysvipc (skip contents) */
 	CKPT_VMA_MAX,
 };
 
@@ -308,6 +310,7 @@ struct ckpt_hdr_ipc {
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_ipc_perms {
+	struct ckpt_hdr h;
 	__s32 id;
 	__u32 key;
 	__u32 uid;
@@ -319,6 +322,20 @@ struct ckpt_hdr_ipc_perms {
 	__u64 seq;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_shm {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 shm_segsz;
+	__u64 shm_atim;
+	__u64 shm_dtim;
+	__u64 shm_ctim;
+	__s32 shm_cprid;
+	__s32 shm_lprid;
+	__u32 mlock_uid;
+	__u32 flags;
+	__u32 objref;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index aa6c8dd..7e23683 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,5 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_CHECKPOINT) += checkpoint.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o
 
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index e00e524..127eb63 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -97,9 +97,9 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
 	if (ret < 0)
 		return ret;
 
-#if 0 /* NEXT FEW PATCHES */
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
new file mode 100644
index 0000000..16d8c9d
--- /dev/null
+++ b/ipc/checkpoint_shm.c
@@ -0,0 +1,112 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc shm
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/shm.h>
+#include <linux/shmem_fs.h>
+#include <linux/hugetlb.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+
+#include <linux/msg.h>	/* needed for util.h that uses 'struct msg_msg' */
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_shm *h,
+			    struct shmid_kernel *shp)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&shp->shm_perm);
+
+	checkpoint_fill_ipc_perms(&h->perms, &shp->shm_perm);
+
+	h->shm_segsz = shp->shm_segsz;
+	h->shm_atim = shp->shm_atim;
+	h->shm_dtim = shp->shm_dtim;
+	h->shm_ctim = shp->shm_ctim;
+	h->shm_cprid = shp->shm_cprid;
+	h->shm_lprid = shp->shm_lprid;
+
+	if (shp->mlock_user)
+		h->mlock_uid = shp->mlock_user->uid;
+	else
+		h->mlock_uid = (unsigned int) -1;
+
+	h->flags = 0;
+	/* check if shm was setup with SHM_NORESERVE */
+	if (SHMEM_I(shp->shm_file->f_dentry->d_inode)->flags & VM_NORESERVE)
+		h->flags |= SHM_NORESERVE;
+	/* check if shm was setup with SHM_HUGETLB (unsupported yet) */
+	if (is_file_hugepages(shp->shm_file)) {
+		pr_warning("c/r: unsupported SHM_HUGETLB\n");
+		ret = -ENOSYS;
+	}
+
+	ipc_unlock(&shp->shm_perm);
+	ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+		 h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+	return ret;
+}
+
+int checkpoint_ipc_shm(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_shm *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct shmid_kernel *shp;
+	struct inode *inode;
+	int first, objref;
+	int ret;
+
+	shp = container_of(perm, struct shmid_kernel, shm_perm);
+	inode = shp->shm_file->f_dentry->d_inode;
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+	/* this must be the first time we see this region */
+	BUG_ON(!first);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_shm_hdr(ctx, h, shp);
+	if (ret < 0)
+		goto out;
+
+	h->objref = objref;
+	ckpt_debug("shm: objref %d\n", h->objref);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	ret = checkpoint_memory_contents(ctx, NULL, inode);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/shm.c b/ipc/shm.c
index 7dd5f0c..c521e95 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -40,6 +40,8 @@
 #include <linux/mount.h>
 #include <linux/ipc_namespace.h>
 #include <linux/ima.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 
@@ -305,6 +307,32 @@ int is_file_shm_hugepages(struct file *file)
 	return ret;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int ipcshm_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	int ino_objref;
+	int first;
+
+	ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+				       CKPT_OBJ_INODE, &first);
+	if (ino_objref < 0)
+		return ino_objref;
+
+	/*
+	 * This shouldn't happen, because all IPC regions should have
+	 * been already dumped by now via ipc namespaces; It means
+	 * the ipc_ns has been modified recently during checkpoint.
+	 */
+	if (first)
+		return -EBUSY;
+
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_IPC_SKIP,
+				      0, ino_objref);
+}
+#else
+define ipcshm_checkpoint NULL
+#endif
+
 static const struct file_operations shm_file_operations = {
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
@@ -320,6 +348,7 @@ static struct vm_operations_struct shm_vm_ops = {
 	.set_policy = shm_set_policy,
 	.get_policy = shm_get_policy,
 #endif
+	.checkpoint = ipcshm_checkpoint,
 };
 
 /**
diff --git a/ipc/util.h b/ipc/util.h
index d3f7367..e4799cb 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -183,7 +183,8 @@ extern void checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *hh,
 				      struct kern_ipc_perm *perm);
 extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *hh,
 				  struct kern_ipc_perm *perm);
-#endif
 
+extern int checkpoint_ipc_shm(int id, void *p, void *data);
+#endif
 
 #endif
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 44/54] sysvipc-shm: restart
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (42 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 43/54] sysvipc-shm: checkpoint Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 45/54] sysvipc-shm: export interface from ipc/shm.c to delete ipc shm Oren Laadan
                     ` (11 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Like chekcpoint, restart of sysvipc shared memory is also performed in
two steps: first, the entire ipc namespace is restored as a whole, by
restoring each shm object read from the checkpoint image. The shmem's
file pointer is registered in the objhash. Second, for each vma that
refers to ipc shared memory, we use the objref to find the file in the
objhash, and use that file in calling do_mmap_pgoff().

Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.

Handling of ipc shm mappings that are locked (via SHM_MLOCK) is also
not restored at the moment.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/memory.c  |   22 +++++++++++
 include/linux/shm.h  |    9 ++++
 ipc/checkpoint.c     |    2 +-
 ipc/checkpoint_shm.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++
 ipc/shm.c            |   46 +++++++++++++++++++++++
 ipc/util.h           |    1 +
 6 files changed, 179 insertions(+), 1 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index ee26254..7637c1e 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -20,6 +20,7 @@
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
+#include <linux/shm.h>
 #include <linux/proc_fs.h>
 #include <linux/swap.h>
 #include <linux/checkpoint.h>
@@ -1016,6 +1017,13 @@ static int anon_private_restore(struct ckpt_ctx *ctx,
 	return private_vma_restore(ctx, mm, NULL, h);
 }
 
+static int bad_vma_restore(struct ckpt_ctx *ctx,
+			   struct mm_struct *mm,
+			   struct ckpt_hdr_vma *h)
+{
+	return -EINVAL;
+}
+
 /* callbacks to restore vma per its type: */
 struct restore_vma_ops {
 	char *vma_name;
@@ -1068,6 +1076,20 @@ static struct restore_vma_ops restore_vma_ops[] = {
 		.vma_type = CKPT_VMA_SHM_FILE,
 		.restore = filemap_restore,
 	},
+	/* sysvipc shared */
+	{
+		.vma_name = "IPC SHARED",
+		.vma_type = CKPT_VMA_SHM_IPC,
+		/* ipc inode itself is restore by restore_ipc_ns()... */
+		.restore = bad_vma_restore,
+
+	},
+	/* sysvipc shared (skip) */
+	{
+		.vma_name = "IPC SHARED (skip)",
+		.vma_type = CKPT_VMA_SHM_IPC_SKIP,
+		.restore = ipcshm_restore,
+	},
 };
 
 /**
diff --git a/include/linux/shm.h b/include/linux/shm.h
index eca6235..1122197 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -118,6 +118,15 @@ static inline int is_file_shm_hugepages(struct file *file)
 }
 #endif
 
+#ifdef CONFIG_CHECKPOINT
+#ifdef CONFIG_SYSVIPC
+extern int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			  struct ckpt_hdr_vma *h);
+#else
+define ipcshm_restart NULL
+#endif
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_SHM_H_ */
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 127eb63..f951c1e 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -209,9 +209,9 @@ static int do_restore_ipc_ns(struct ckpt_ctx *ctx)
 	ipc_ns->sem_ctls[2] = h->sem_ctl_opm;
 	ipc_ns->sem_ctls[3] = h->sem_ctl_mni;
 
-#if 0 /* NEXT FEW PATCHES */
 	ret = restore_ipc_any(ctx, IPC_SHM_IDS,
 			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
 	ret = ckpt_read_ipc_any(ctx, IPC_MSG_IDS,
diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
index 16d8c9d..9e0d028 100644
--- a/ipc/checkpoint_shm.c
+++ b/ipc/checkpoint_shm.c
@@ -110,3 +110,103 @@ int checkpoint_ipc_shm(int id, void *p, void *data)
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_shm_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_shm *h,
+			    struct shmid_kernel *shp)
+{
+	int ret;
+
+	ret = restore_load_ipc_perms(&h->perms, &shp->shm_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+		 h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+	if (h->shm_cprid < 0 || h->shm_lprid < 0)
+		return -EINVAL;
+
+	shp->shm_segsz = h->shm_segsz;
+	shp->shm_atim = h->shm_atim;
+	shp->shm_dtim = h->shm_dtim;
+	shp->shm_ctim = h->shm_ctim;
+	shp->shm_cprid = h->shm_cprid;
+	shp->shm_lprid = h->shm_lprid;
+
+	return 0;
+}
+
+int restore_ipc_shm(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_ipc_shm *h;
+	struct kern_ipc_perm *perms;
+	struct shmid_kernel *shp;
+	struct ipc_ids *shm_ids = &current->nsproxy->ipc_ns->ids[IPC_SHM_IDS];
+	struct file *file;
+	int shmflag;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+
+#define CKPT_SHMFL_MASK  (SHM_NORESERVE | SHM_HUGETLB)
+	if (h->flags & ~CKPT_SHMFL_MASK)
+		goto out;
+
+	ret = -ENOSYS;
+	if (h->mlock_uid != (unsigned int) -1)	/* FIXME: support SHM_LOCK */
+		goto out;
+	if (h->flags & SHM_HUGETLB)	/* FIXME: support SHM_HUGETLB */
+		goto out;
+
+	/* FIXME: this will fail for deleted ipc shm segments */
+
+	shmflag = h->flags | h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("shm: do_shmget size %lld flag %#x id %d\n",
+		 h->shm_segsz, shmflag, h->perms.id);
+	ret = do_shmget(h->perms.key, h->shm_segsz, shmflag, h->perms.id);
+	ckpt_debug("shm: do_shmget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&shm_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(shm_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));  /* ipc_ns is private to us */
+
+	shp = container_of(perms, struct shmid_kernel, shm_perm);
+	file = shp->shm_file;
+	get_file(file);
+
+	ret = load_ipc_shm_hdr(ctx, h, shp);
+	ipc_unlock(perms);
+	if (ret < 0)
+		goto mutex;
+
+	/* deposit in objhash and read contents in */
+	ret = ckpt_obj_insert(ctx, file, h->objref, CKPT_OBJ_FILE);
+	if (ret < 0)
+		goto mutex;
+	ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ mutex:
+	fput(file);
+	if (ret < 0) {
+		ckpt_debug("shm: need to remove (%d)\n", ret);
+		do_shm_rmid(current->nsproxy->ipc_ns, perms);
+	}
+	up_write(&shm_ids->rw_mutex);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/shm.c b/ipc/shm.c
index c521e95..5625dbf 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -329,6 +329,52 @@ static int ipcshm_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_IPC_SKIP,
 				      0, ino_objref);
 }
+
+int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+		   struct ckpt_hdr_vma *h)
+{
+	struct file *file;
+	int shmid, shmflg = 0;
+	mm_segment_t old_fs;
+	unsigned long start;
+	unsigned long addr;
+	int ret;
+
+	if (!h->ino_objref)
+		return -EINVAL;
+	/* FIX: verify the vm_flags too */
+
+	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+	if (!file)
+		return -EINVAL;
+	else if (IS_ERR(file))
+		PTR_ERR(file);
+
+	shmid = file->f_dentry->d_inode->i_ino;
+
+	if (!(h->vm_flags & VM_WRITE))
+		shmflg |= SHM_RDONLY;
+
+	/*
+	 * FIX: do_shmat() has limited interface: all-or-nothing
+	 * mapping. If the vma, however, reflects a partial mapping
+	 * then we need to modify that function to accomplish the
+	 * desired outcome.  Partial mapping can exist due to the user
+	 * call shmat() and then unmapping part of the region.
+	 * Currently, we at least detect this and call it a foul play.
+	 */
+	if (((h->vm_end - h->vm_start) != h->ino_size) || h->vm_pgoff)
+		return -ENOSYS;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	start = h->vm_start;
+	ret = do_shmat(shmid, (char __user *) start, shmflg, &addr);
+	set_fs(old_fs);
+
+	BUG_ON(ret >= 0 && addr != h->vm_start);
+	return ret;
+}
 #else
 define ipcshm_checkpoint NULL
 #endif
diff --git a/ipc/util.h b/ipc/util.h
index e4799cb..be2fb94 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -185,6 +185,7 @@ extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *hh,
 				  struct kern_ipc_perm *perm);
 
 extern int checkpoint_ipc_shm(int id, void *p, void *data);
+extern int restore_ipc_shm(struct ckpt_ctx *ctx);
 #endif
 
 #endif
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 45/54] sysvipc-shm: export interface from ipc/shm.c to delete ipc shm
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (43 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 44/54] sysvipc-shm: restart Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 46/54] sysvipc-shm: correctly handle deleted (active) ipc shared memory Oren Laadan
                     ` (10 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Export shmctl_down() which will be used in the next patch during
restart to delete an ipc shm (the shm is mapped already, so it
won't be lost).

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/shm.h |    4 ++++
 ipc/shm.c           |    4 ++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/shm.h b/include/linux/shm.h
index 1122197..524fb3b 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -127,6 +127,10 @@ define ipcshm_restart NULL
 #endif
 #endif
 
+struct ipc_namespace;
+extern int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+		       struct shmid_ds __user *buf, int version);
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_SHM_H_ */
diff --git a/ipc/shm.c b/ipc/shm.c
index 5625dbf..70e0651 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -673,8 +673,8 @@ static void shm_get_stat(struct ipc_namespace *ns, unsigned long *rss,
  * to be held in write mode.
  * NOTE: no locks must be held, the rw_mutex is taken inside this function.
  */
-static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
-		       struct shmid_ds __user *buf, int version)
+int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+		struct shmid_ds __user *buf, int version)
 {
 	struct kern_ipc_perm *ipcp;
 	struct shmid64_ds shmid64;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 46/54] sysvipc-shm: correctly handle deleted (active) ipc shared memory
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (44 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 45/54] sysvipc-shm: export interface from ipc/shm.c to delete ipc shm Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 47/54] sysvipc-msg: make 'struct msg_msgseg' visible in ipc/util.h Oren Laadan
                     ` (9 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

During restart, an ipc shared region may have SHM_DEST, indicating
that it has been originally deleted (while still active). In this
case the task of deleting the region after restoring it is postponed
until the end of the restart; otherwise, it would be quite silly to
delete it at that time, because it will be ... gone :o

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/sys.c                 |   10 ++++++++
 include/linux/checkpoint_types.h |    1 +
 ipc/checkpoint_shm.c             |   48 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 58 insertions(+), 1 deletions(-)

diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index e3f7012..536f649 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -19,6 +19,7 @@
 #include <linux/uaccess.h>
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
+#include <linux/deferqueue.h>
 
 /*
  * ckpt_unpriv_allowed - sysctl_controlled, do not allow checkpoint of
@@ -216,8 +217,17 @@ static void task_arr_free(struct ckpt_ctx *ctx)
 
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
+	int ret;
+
 	BUG_ON(atomic_read(&ctx->refcount));
 
+	if (ctx->deferqueue) {
+		ret = deferqueue_run(ctx->deferqueue);
+		if (ret != 0)
+			pr_warning("c/r: deferqueue had %d entries\n", ret);
+		deferqueue_destroy(ctx->deferqueue);
+	}
+
 	if (ctx->file)
 		fput(ctx->file);
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index a8dc5b3..8d30dbb 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -52,6 +52,7 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *deferqueue;	/* queue of deferred work */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
index 9e0d028..d265351 100644
--- a/ipc/checkpoint_shm.c
+++ b/ipc/checkpoint_shm.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/deferqueue.h>
 
 #include <linux/msg.h>	/* needed for util.h that uses 'struct msg_msg' */
 #include "util.h"
@@ -115,6 +116,30 @@ int checkpoint_ipc_shm(int id, void *p, void *data)
  * ipc restart
  */
 
+struct dq_ipcshm_del {
+	/*
+	 * XXX: always keep ->ipcns first so that put_ipc_ns() can
+	 * be safely provided as the dtor for this deferqueue object
+	 */
+	struct ipc_namespace *ipcns;
+	int id;
+};
+
+static int ipc_shm_delete(void *data)
+{
+	struct dq_ipcshm_del *dq = (struct dq_ipcshm_del *) data;
+	mm_segment_t old_fs;
+	int ret;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	ret = shmctl_down(dq->ipcns, dq->id, IPC_RMID, NULL, 0);
+	set_fs(old_fs);
+
+	put_ipc_ns(dq->ipcns);
+	return ret;
+}
+
 static int load_ipc_shm_hdr(struct ckpt_ctx *ctx,
 			    struct ckpt_hdr_ipc_shm *h,
 			    struct shmid_kernel *shp)
@@ -169,7 +194,28 @@ int restore_ipc_shm(struct ckpt_ctx *ctx)
 	if (h->flags & SHM_HUGETLB)	/* FIXME: support SHM_HUGETLB */
 		goto out;
 
-	/* FIXME: this will fail for deleted ipc shm segments */
+	/*
+	 * SHM_DEST means that the shm is to be deleted after creation.
+	 * However, deleting before it's actually attached is quite silly.
+	 * Instead, we defer this task to until restart has succeeded.
+	 */
+	if (h->perms.mode & SHM_DEST) {
+		struct dq_ipcshm_del dq;
+
+		/* to not confuse the rest of the code */
+		h->perms.mode &= ~SHM_DEST;
+
+		dq.id = h->perms.id;
+		dq.ipcns = current->nsproxy->ipc_ns;
+		get_ipc_ns(dq.ipcns);
+
+		/* XXX can safely use put_ipc_ns() as dtor, see above */
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     (deferqueue_func_t) ipc_shm_delete,
+				     (deferqueue_func_t) put_ipc_ns);
+		if (ret < 0)
+			goto out;
+	}
 
 	shmflag = h->flags | h->perms.mode | IPC_CREAT | IPC_EXCL;
 	ckpt_debug("shm: do_shmget size %lld flag %#x id %d\n",
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 47/54] sysvipc-msg: make 'struct msg_msgseg' visible in ipc/util.h
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (45 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 46/54] sysvipc-shm: correctly handle deleted (active) ipc shared memory Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 48/54] sysvipc-msq: checkpoint Oren Laadan
                     ` (8 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Move the definition of 'struct msg_msgseg' and constants DATALEN_*
to ipc/util.h, where they are visible to ipc/ckpt_msg.c

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 ipc/msg.c     |    3 +--
 ipc/msgutil.c |    8 --------
 ipc/util.h    |   10 ++++++++++
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 1db7c45..1d5d087 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -72,7 +72,6 @@ struct msg_sender {
 
 #define msg_unlock(msq)		ipc_unlock(&(msq)->q_perm)
 
-static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
 static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
@@ -278,7 +277,7 @@ static void expunge_all(struct msg_queue *msq, int res)
  * msg_ids.rw_mutex (writer) and the spinlock for this message queue are held
  * before freeque() is called. msg_ids.rw_mutex remains locked on exit.
  */
-static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct list_head *tmp;
 	struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index f095ee2..e119243 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -36,14 +36,6 @@ struct ipc_namespace init_ipc_ns = {
 
 atomic_t nr_ipc_ns = ATOMIC_INIT(1);
 
-struct msg_msgseg {
-	struct msg_msgseg* next;
-	/* the next part of the message follows immediately */
-};
-
-#define DATALEN_MSG	(PAGE_SIZE-sizeof(struct msg_msg))
-#define DATALEN_SEG	(PAGE_SIZE-sizeof(struct msg_msgseg))
-
 struct msg_msg *load_msg(const void __user *src, int len)
 {
 	struct msg_msg *msg;
diff --git a/ipc/util.h b/ipc/util.h
index be2fb94..09b9bdf 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -140,6 +140,14 @@ extern void free_msg(struct msg_msg *msg);
 extern struct msg_msg *load_msg(const void __user *src, int len);
 extern int store_msg(void __user *dest, struct msg_msg *msg, int len);
 
+struct msg_msgseg {
+	struct msg_msgseg *next;
+	/* the next part of the message follows immediately */
+};
+
+#define DATALEN_MSG	(PAGE_SIZE-sizeof(struct msg_msg))
+#define DATALEN_SEG	(PAGE_SIZE-sizeof(struct msg_msgseg))
+
 extern void recompute_msgmni(struct ipc_namespace *);
 
 static inline int ipc_buildid(int id, int seq)
@@ -175,6 +183,8 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 
 /* for checkpoint/restart */
 extern int do_shmget(key_t key, size_t size, int shmflg, int req_id);
+extern int do_msgget(key_t key, int msgflg, int req_id);
+extern void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
 extern void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 48/54] sysvipc-msq: checkpoint
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (46 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 47/54] sysvipc-msg: make 'struct msg_msgseg' visible in ipc/util.h Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 49/54] sysvipc-msq: restart Oren Laadan
                     ` (7 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Checkpoint of sysvipc message-queues is performed by iterating through
all 'msq' objects and dumping the contents of each one. The message
queued on each 'msq' are dumped with that object.

Message of a specific queue get written one by one. The queue lock
cannot be held while dumping them, but the loop must be protected from
someone (who ?) writing or reading. To do that we grab the lock, then
hijack the entire chain of messages from the queue, drop the lock,
and then safely dump them in a loop. Finally, with the lock held, we
re-attach the chain while verifying that there isn't other (new) data
on that queue.

Writing the message contents themselves is straight forward. The code
is similar to that in ipc/msgutil.c, the main difference being that
we deal with kernel memory and not user memory.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/checkpoint_hdr.h |   21 +++++-
 ipc/Makefile                   |    2 +-
 ipc/checkpoint.c               |    2 +-
 ipc/checkpoint_msg.c           |  164 ++++++++++++++++++++++++++++++++++++++++
 ipc/util.h                     |    2 +
 5 files changed, 188 insertions(+), 3 deletions(-)
 create mode 100644 ipc/checkpoint_msg.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 12a808c..bf34b08 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -70,6 +70,7 @@ enum {
 	CKPT_HDR_IPC = 401,
 	CKPT_HDR_IPC_SHM,
 	CKPT_HDR_IPC_MSG,
+	CKPT_HDR_IPC_MSG_MSG,
 	CKPT_HDR_IPC_SEM,
 
 	CKPT_HDR_TAIL = 5001
@@ -336,6 +337,25 @@ struct ckpt_hdr_ipc_shm {
 	__u32 objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_msg {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 q_stime;
+	__u64 q_rtime;
+	__u64 q_ctime;
+	__u64 q_cbytes;
+	__u64 q_qnum;
+	__u64 q_qbytes;
+	__s32 q_lspid;
+	__s32 q_lrpid;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_msg_msg {
+	struct ckpt_hdr h;
+	__s32 m_type;
+	__u32 m_ts;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
@@ -346,5 +366,4 @@ struct ckpt_hdr_ipc_shm {
 #define CKPT_TST_OVERFLOW_64(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > LONG_MAX))
 
-
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/ipc/Makefile b/ipc/Makefile
index 7e23683..ca408ff 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,5 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o checkpoint_msg.o
 
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index f951c1e..5f63300 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -99,11 +99,11 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
 
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
diff --git a/ipc/checkpoint_msg.c b/ipc/checkpoint_msg.c
new file mode 100644
index 0000000..f0b0921
--- /dev/null
+++ b/ipc/checkpoint_msg.c
@@ -0,0 +1,164 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc msg
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/msg.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_msg_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_msg *h,
+			    struct msg_queue *msq)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&msq->q_perm);
+
+	checkpoint_fill_ipc_perms(&h->perms, &msq->q_perm);
+
+	h->q_stime = msq->q_stime;
+	h->q_rtime = msq->q_rtime;
+	h->q_ctime = msq->q_ctime;
+	h->q_cbytes = msq->q_cbytes;
+	h->q_qnum = msq->q_qnum;
+	h->q_qbytes = msq->q_qbytes;
+	h->q_lspid = msq->q_lspid;
+	h->q_lrpid = msq->q_lrpid;
+
+	ipc_unlock(&msq->q_perm);
+
+	ckpt_debug("msg: lspid %d rspid %d qnum %lld qbytes %lld\n",
+		 h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+	return ret;
+}
+
+static int checkpoint_msg_contents(struct ckpt_ctx *ctx, struct msg_msg *msg)
+{
+	struct ckpt_hdr_ipc_msg_msg *h;
+	struct msg_msgseg *seg;
+	int total, len;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+	if (!h)
+		return -ENOMEM;
+
+	h->m_type = msg->m_type;
+	h->m_ts = msg->m_ts;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	total = msg->m_ts;
+	len = min(total, (int) DATALEN_MSG);
+	ret = ckpt_write_buffer(ctx, (msg + 1), len);
+	if (ret < 0)
+		return ret;
+
+	seg = msg->next;
+	total -= len;
+
+	while (total) {
+		len = min(total, (int) DATALEN_SEG);
+		ret = ckpt_write_buffer(ctx, (seg + 1), len);
+		if (ret < 0)
+			break;
+		seg = seg->next;
+		total -= len;
+	}
+
+	return ret;
+}
+
+static int checkpoint_msg_queue(struct ckpt_ctx *ctx, struct msg_queue *msq)
+{
+	struct list_head messages;
+	struct msg_msg *msg;
+	int ret = -EBUSY;
+
+	/*
+	 * Scanning the msq requires the lock, but then we can't write
+	 * data out from inside. Instead, we grab the lock, remove all
+	 * messages to our own list, drop the lock, write the messages,
+	 * and finally re-attach the them to the msq with the lock taken.
+	 */
+	ipc_lock_by_ptr(&msq->q_perm);
+	if (!list_empty(&msq->q_receivers))
+		goto unlock;
+	if (!list_empty(&msq->q_senders))
+		goto unlock;
+	if (list_empty(&msq->q_messages))
+		goto unlock;
+	/* temporarily take out all messages */
+	INIT_LIST_HEAD(&messages);
+	list_splice_init(&msq->q_messages, &messages);
+ unlock:
+	ipc_unlock(&msq->q_perm);
+
+	list_for_each_entry(msg, &messages, m_list) {
+		ret = checkpoint_msg_contents(ctx, msg);
+		if (ret < 0)
+			break;
+	}
+
+	/* put all the messages back in */
+	ipc_lock_by_ptr(&msq->q_perm);
+	list_splice(&messages, &msq->q_messages);
+	ipc_unlock(&msq->q_perm);
+
+	return ret;
+}
+
+int checkpoint_ipc_msg(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_msg *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct msg_queue *msq;
+	int ret;
+
+	msq = container_of(perm, struct msg_queue, q_perm);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_msg_hdr(ctx, h, msq);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	if (h->q_qnum)
+		ret = checkpoint_msg_queue(ctx, msq);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/util.h b/ipc/util.h
index 09b9bdf..ab83ca3 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -196,6 +196,8 @@ extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *hh,
 
 extern int checkpoint_ipc_shm(int id, void *p, void *data);
 extern int restore_ipc_shm(struct ckpt_ctx *ctx);
+
+extern int checkpoint_ipc_msg(int id, void *p, void *data);
 #endif
 
 #endif
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 49/54] sysvipc-msq: restart
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (47 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 48/54] sysvipc-msq: checkpoint Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 50/54] sysvipc-sem: export interface from ipc/sem.c to cleanup ipc sem Oren Laadan
                     ` (6 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

The namespace is restored by creating each 'msq' object read from
the checkpoint image.

Message of a specific queue are first read and chained together on
a temporary list, and once done are attached atomically as a whole
to the newly created message queue ('msq').

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 ipc/checkpoint.c     |    4 +-
 ipc/checkpoint_msg.c |  196 ++++++++++++++++++++++++++++++++++++++++++++++++++
 ipc/util.h           |    1 +
 3 files changed, 199 insertions(+), 2 deletions(-)

diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 5f63300..dfd3286 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -211,11 +211,11 @@ static int do_restore_ipc_ns(struct ckpt_ctx *ctx)
 
 	ret = restore_ipc_any(ctx, IPC_SHM_IDS,
 			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
-	ret = ckpt_read_ipc_any(ctx, IPC_MSG_IDS,
+	ret = restore_ipc_any(ctx, IPC_MSG_IDS,
 			      CKPT_HDR_IPC_MSG, restore_ipc_msg);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
 	ret = restore_ipc_any(ctx, IPC_SEM_IDS,
diff --git a/ipc/checkpoint_msg.c b/ipc/checkpoint_msg.c
index f0b0921..a4099e5 100644
--- a/ipc/checkpoint_msg.c
+++ b/ipc/checkpoint_msg.c
@@ -162,3 +162,199 @@ int checkpoint_ipc_msg(int id, void *p, void *data)
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
+
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_msg_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_msg *h,
+			    struct msg_queue *msq)
+{
+	int ret = 0;
+
+	ret = restore_load_ipc_perms(&h->perms, &msq->q_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("msq: lspid %d lrpid %d qnum %lld qbytes %lld\n",
+		 h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+	if (h->q_lspid < 0 || h->q_lrpid < 0)
+		return -EINVAL;
+
+	msq->q_stime = h->q_stime;
+	msq->q_rtime = h->q_rtime;
+	msq->q_ctime = h->q_ctime;
+	msq->q_lspid = h->q_lspid;
+	msq->q_lrpid = h->q_lrpid;
+
+	return 0;
+}
+
+static struct msg_msg *restore_msg_contents_one(struct ckpt_ctx *ctx, int *clen)
+{
+	struct ckpt_hdr_ipc_msg_msg *h;
+	struct msg_msg *msg = NULL;
+	struct msg_msgseg *seg, **pseg;
+	int total, len;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+	if (IS_ERR(h))
+		return (struct msg_msg *) h;
+
+	ret = -EINVAL;
+	if (h->m_type < 1)
+		goto out;
+	if (h->m_ts > current->nsproxy->ipc_ns->msg_ctlmax)
+		goto out;
+
+	total = h->m_ts;
+	len = min(total, (int) DATALEN_MSG);
+	msg = kmalloc(sizeof(*msg) + len, GFP_KERNEL);
+	if (!msg) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	msg->next = NULL;
+	pseg = &msg->next;
+
+	ret = _ckpt_read_buffer(ctx, (msg + 1), len);
+	if (ret < 0)
+		goto out;
+
+	total -= len;
+	while (total) {
+		len = min(total, (int) DATALEN_SEG);
+		seg = kmalloc(sizeof(*seg) + len, GFP_KERNEL);
+		if (!seg) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		seg->next = NULL;
+		*pseg = seg;
+		pseg = &seg->next;
+
+		ret = _ckpt_read_buffer(ctx, (seg + 1), len);
+		if (ret < 0)
+			goto out;
+		total -= len;
+	}
+
+	msg->m_type = h->m_type;
+	msg->m_ts = h->m_ts;
+	*clen = h->m_ts;
+ out:
+	if (ret < 0 && msg) {
+		free_msg(msg);
+		msg = ERR_PTR(ret);
+	}
+	ckpt_hdr_put(ctx, h);
+	return msg;
+}
+
+static inline void free_msg_list(struct list_head *queue)
+{
+	struct msg_msg *msg, *tmp;
+
+	list_for_each_entry_safe(msg, tmp, queue, m_list)
+		free_msg(msg);
+}
+
+static int restore_msg_contents(struct ckpt_ctx *ctx, struct list_head *queue,
+				unsigned long qnum, unsigned long *cbytes)
+{
+	struct msg_msg *msg;
+	int clen = 0;
+	int ret = 0;
+
+	INIT_LIST_HEAD(queue);
+
+	*cbytes = 0;
+	while (qnum--) {
+		msg = restore_msg_contents_one(ctx, &clen);
+		if (IS_ERR(msg))
+			goto fail;
+		list_add_tail(&msg->m_list, queue);
+		*cbytes += clen;
+	}
+	return 0;
+ fail:
+	ret = PTR_ERR(msg);
+	free_msg_list(queue);
+	return ret;
+}
+
+int restore_ipc_msg(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_ipc_msg *h;
+	struct kern_ipc_perm *perms;
+	struct msg_queue *msq;
+	struct ipc_ids *msg_ids = &current->nsproxy->ipc_ns->ids[IPC_MSG_IDS];
+	struct list_head messages;
+	unsigned long cbytes;
+	int msgflag;
+	int ret;
+
+	INIT_LIST_HEAD(&messages);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+
+	/* read queued messages into temporary queue */
+	ret = restore_msg_contents(ctx, &messages, h->q_qnum, &cbytes);
+	if (ret < 0)
+		goto out;
+
+	ret = -EINVAL;
+	if (h->q_cbytes != cbytes)
+		goto out;
+
+	/* restore the message queue */
+	msgflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("msg: do_msgget key %d flag %#x id %d\n",
+		 h->perms.key, msgflag, h->perms.id);
+	ret = do_msgget(h->perms.key, msgflag, h->perms.id);
+	ckpt_debug("msg: do_msgget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&msg_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(msg_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));	/* ipc_ns is private to us */
+
+	msq = container_of(perms, struct msg_queue, q_perm);
+	BUG_ON(!list_empty(&msq->q_messages));	/* ipc_ns is private to us */
+
+	/* attach queued messages we read before */
+	list_splice_init(&messages, &msq->q_messages);
+
+	/* adjust msq and namespace statistics */
+	atomic_add(h->q_cbytes, &current->nsproxy->ipc_ns->msg_bytes);
+	atomic_add(h->q_qnum, &current->nsproxy->ipc_ns->msg_hdrs);
+	msq->q_cbytes = h->q_cbytes;
+	msq->q_qbytes = h->q_qbytes;
+	msq->q_qnum = h->q_qnum;
+
+	ret = load_ipc_msg_hdr(ctx, h, msq);
+	ipc_unlock(perms);
+
+	if (ret < 0) {
+		ckpt_debug("msq: need to remove (%d)\n", ret);
+		freeque(current->nsproxy->ipc_ns, perms);
+	}
+	up_write(&msg_ids->rw_mutex);
+ out:
+	free_msg_list(&messages);  /* no-op if all ok, else cleanup msgs */
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/util.h b/ipc/util.h
index ab83ca3..1e464fd 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -198,6 +198,7 @@ extern int checkpoint_ipc_shm(int id, void *p, void *data);
 extern int restore_ipc_shm(struct ckpt_ctx *ctx);
 
 extern int checkpoint_ipc_msg(int id, void *p, void *data);
+extern int restore_ipc_msg(struct ckpt_ctx *ctx);
 #endif
 
 #endif
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 50/54] sysvipc-sem: export interface from ipc/sem.c to cleanup ipc sem
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (48 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 49/54] sysvipc-msq: restart Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 51/54] sysvipc-sem: checkpoint Oren Laadan
                     ` (5 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Export freeary() which will be used in the next patch during restart
to cleanup an ipc sem.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 ipc/sem.c  |    3 +--
 ipc/util.h |    1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 207dbbb..c60076e 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -93,7 +93,6 @@
 #define sem_checkid(sma, semid)	ipc_checkid(&sma->sem_perm, semid)
 
 static int newary(struct ipc_namespace *, struct ipc_params *, int);
-static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #endif
@@ -521,7 +520,7 @@ static void free_un(struct rcu_head *head)
  * as a writer and the spinlock for this semaphore set hold. sem_ids.rw_mutex
  * remains locked on exit.
  */
-static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct sem_undo *un, *tu;
 	struct sem_queue *q, *tq;
diff --git a/ipc/util.h b/ipc/util.h
index 1e464fd..a4016e7 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -185,6 +185,7 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 extern int do_shmget(key_t key, size_t size, int shmflg, int req_id);
 extern int do_msgget(key_t key, int msgflg, int req_id);
 extern void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+extern void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 
 extern void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 51/54] sysvipc-sem: checkpoint
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (49 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 50/54] sysvipc-sem: export interface from ipc/sem.c to cleanup ipc sem Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 52/54] sysvipc-sem: restart Oren Laadan
                     ` (4 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Checkpoint of sysvipc semaphores is performed by iterating through all
sem objects and dumping the contents of each one. The semaphore array
of each sem is dumped with that object.

The semaphore array (sem->sem_base) holds an array of 'struct sem',
which is a {int, int}. Because this translates into the same format
on 32- and 64-bit architectures, the checkpoint format is simply the
dump of this array as is.

TODO: this patch does not handle semaphore-undo -- this data should be
saved per-task while iterating through the tasks.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/checkpoint_hdr.h |    8 +++
 ipc/Makefile                   |    3 +-
 ipc/checkpoint.c               |    2 -
 ipc/checkpoint_sem.c           |  101 ++++++++++++++++++++++++++++++++++++++++
 ipc/util.h                     |    2 +
 5 files changed, 113 insertions(+), 3 deletions(-)
 create mode 100644 ipc/checkpoint_sem.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index bf34b08..0e15f3f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -356,6 +356,14 @@ struct ckpt_hdr_ipc_msg_msg {
 	__u32 m_ts;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_sem {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 sem_otime;
+	__u64 sem_ctime;
+	__u32 sem_nsems;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index ca408ff..81af168 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,5 +9,6 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o checkpoint_msg.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o \
+			checkpoint_shm.o checkpoint_msg.o checkpoint_sem.o
 
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index dfd3286..7a2f4a5 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -103,12 +103,10 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
 				 CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
-#endif
 	return ret;
 }
 
diff --git a/ipc/checkpoint_sem.c b/ipc/checkpoint_sem.c
new file mode 100644
index 0000000..fc6ea44
--- /dev/null
+++ b/ipc/checkpoint_sem.c
@@ -0,0 +1,101 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc sem
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/sem.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+
+#include <linux/msg.h>	/* needed for util.h that uses 'struct msg_msg' */
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_sem_hdr(struct ckpt_ctx *ctx,
+			       struct ckpt_hdr_ipc_sem *h,
+			       struct sem_array *sem)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&sem->sem_perm);
+
+	checkpoint_fill_ipc_perms(&h->perms, &sem->sem_perm);
+
+	h->sem_otime = sem->sem_otime;
+	h->sem_ctime = sem->sem_ctime;
+	h->sem_nsems = sem->sem_nsems;
+
+	ipc_unlock(&sem->sem_perm);
+
+	ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+	return ret;
+}
+
+/**
+ * ckpt_write_sem_array - dump the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * The state of a sempahore is an array of 'struct sem'. This structure
+ * is {int, int}, which translates to the same format {32 bits, 32 bits}
+ * on both 32- and 64-bit architectures. So we simply dump the array.
+ *
+ * The sem-undo information is not saved per ipc_ns, but rather per task.
+ */
+static int checkpoint_sem_array(struct ckpt_ctx *ctx, struct sem_array *sem)
+{
+	/* this is a "best-effort" test, so lock not needed */
+	if (!list_empty(&sem->sem_pending))
+		return -EBUSY;
+
+	/* our caller holds the mutex, so this is safe */
+	return ckpt_write_buffer(ctx, sem->sem_base,
+			       sem->sem_nsems * sizeof(*sem->sem_base));
+}
+
+int checkpoint_ipc_sem(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_sem *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct sem_array *sem;
+	int ret;
+
+	sem = container_of(perm, struct sem_array, sem_perm);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_sem_hdr(ctx, h, sem);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	if (h->sem_nsems)
+		ret = checkpoint_sem_array(ctx, sem);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/util.h b/ipc/util.h
index a4016e7..5b7cead 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -200,6 +200,8 @@ extern int restore_ipc_shm(struct ckpt_ctx *ctx);
 
 extern int checkpoint_ipc_msg(int id, void *p, void *data);
 extern int restore_ipc_msg(struct ckpt_ctx *ctx);
+
+extern int checkpoint_ipc_sem(int id, void *p, void *data);
 #endif
 
 #endif
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 52/54] sysvipc-sem: restart
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (50 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 51/54] sysvipc-sem: checkpoint Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-28 23:24   ` [RFC v14][PATCH 53/54] Detect resource leaks for whole-container checkpoint Oren Laadan
                     ` (3 subsequent siblings)
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

The semaphores are restored by creating each 'sem object read from the
checkpoint image. Each semaphore array (sem->sem_base) is checked for
validity of contents before copies to the corresponding semaphore.

TODO: this patch does not handle semaphore-undo -- this data should be
restored per-task while iterating through the tasks.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 ipc/checkpoint.c     |    2 -
 ipc/checkpoint_sem.c |  116 ++++++++++++++++++++++++++++++++++++++++++++++++++
 ipc/util.h           |    3 +
 3 files changed, 119 insertions(+), 2 deletions(-)

diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 7a2f4a5..e14dea6 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -213,12 +213,10 @@ static int do_restore_ipc_ns(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_ipc_any(ctx, IPC_MSG_IDS,
 			      CKPT_HDR_IPC_MSG, restore_ipc_msg);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
 	ret = restore_ipc_any(ctx, IPC_SEM_IDS,
 			      CKPT_HDR_IPC_SEM, restore_ipc_sem);
-#endif
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
diff --git a/ipc/checkpoint_sem.c b/ipc/checkpoint_sem.c
index fc6ea44..0065202 100644
--- a/ipc/checkpoint_sem.c
+++ b/ipc/checkpoint_sem.c
@@ -99,3 +99,119 @@ int checkpoint_ipc_sem(int id, void *p, void *data)
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_sem_hdr(struct ckpt_ctx *ctx,
+			       struct ckpt_hdr_ipc_sem *h,
+			       struct sem_array *sem)
+{
+	int ret = 0;
+
+	ret = restore_load_ipc_perms(&h->perms, &sem->sem_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+
+	sem->sem_otime = h->sem_otime;
+	sem->sem_ctime = h->sem_ctime;
+	sem->sem_nsems = h->sem_nsems;
+
+	return 0;
+}
+
+/**
+ * ckpt_read_sem_array - read the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * Expect the data in an array of 'struct sem': {32 bit, 32 bit}.
+ * See comment in ckpt_write_sem_array().
+ *
+ * The sem-undo information is not restored per ipc_ns, but rather per task.
+ */
+static struct sem *restore_sem_array(struct ckpt_ctx *ctx, int nsems)
+{
+	struct sem *sma;
+	int i, ret;
+
+	sma = kmalloc(nsems * sizeof(*sma), GFP_KERNEL);
+	ret = _ckpt_read_buffer(ctx, sma, nsems * sizeof(*sma));
+	if (ret < 0)
+		goto out;
+
+	/* validate sem array contents */
+	for (i = 0; i < nsems; i++) {
+		if (sma[i].semval < 0 || sma[i].sempid < 0) {
+			ret = -EINVAL;
+			break;
+		}
+	}
+ out:
+	if (ret < 0) {
+		kfree(sma);
+		sma = ERR_PTR(ret);
+	}
+	return sma;
+}
+
+int restore_ipc_sem(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_ipc_sem *h;
+	struct kern_ipc_perm *perms;
+	struct sem_array *sem;
+	struct sem *sma = NULL;
+	struct ipc_ids *sem_ids = &current->nsproxy->ipc_ns->ids[IPC_SEM_IDS];
+	int semflag, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+	if (h->sem_nsems < 0)
+		goto out;
+
+	/* read sempahore array state */
+	sma = restore_sem_array(ctx, h->sem_nsems);
+	if (IS_ERR(sma)) {
+		ret = PTR_ERR(sma);
+		goto out;
+	}
+
+	/* restore the message queue now */
+	semflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("sem: do_semget key %d flag %#x id %d\n",
+		 h->perms.key, semflag, h->perms.id);
+	ret = do_semget(h->perms.key, h->sem_nsems, semflag, h->perms.id);
+	ckpt_debug("sem: do_msgget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&sem_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(sem_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));  /* ipc_ns is private to us */
+
+	sem = container_of(perms, struct sem_array, sem_perm);
+	memcpy(sem->sem_base, sma, sem->sem_nsems * sizeof(*sma));
+
+	ret = load_ipc_sem_hdr(ctx, h, sem);
+	ipc_unlock(perms);
+
+	if (ret < 0) {
+		ckpt_debug("sem: need to remove (%d)\n", ret);
+		freeary(current->nsproxy->ipc_ns, perms);
+	}
+	up_write(&sem_ids->rw_mutex);
+ out:
+	kfree(sma);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/util.h b/ipc/util.h
index 5b7cead..54f9acb 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -184,6 +184,8 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 /* for checkpoint/restart */
 extern int do_shmget(key_t key, size_t size, int shmflg, int req_id);
 extern int do_msgget(key_t key, int msgflg, int req_id);
+extern int do_semget(key_t key, int nsems, int semflg, int req_id);
+
 extern void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 extern void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 
@@ -202,6 +204,7 @@ extern int checkpoint_ipc_msg(int id, void *p, void *data);
 extern int restore_ipc_msg(struct ckpt_ctx *ctx);
 
 extern int checkpoint_ipc_sem(int id, void *p, void *data);
+extern int restore_ipc_sem(struct ckpt_ctx *ctx);
 #endif
 
 #endif
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 53/54] Detect resource leaks for whole-container checkpoint
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (51 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 52/54] sysvipc-sem: restart Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
       [not found]     ` <1240961064-13991-54-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-28 23:24   ` [RFC v14][PATCH 54/54] Report failures during checkpoint as an object in the output stream Oren Laadan
                     ` (2 subsequent siblings)
  55 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
checkpoint, return an error code if the actual objects' counts are
higher, indicating leaks (references to the objects from a task not
being checkpointed).  Of course, by this time most of the checkpoint
image has been written out to disk, so this is purely advisory.  But
then, it's probably naive to argue that anything more than an advisory
'this went wrong' error code is useful.

The comparison of the objhash user counts to object refcounts as a
basis for checking for leaks comes from Alexey's OpenVZ-based c/r
patchset.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c    |    8 +++
 checkpoint/memory.c        |    2 +
 checkpoint/objhash.c       |  108 +++++++++++++++++++++++++++++++++++++++----
 include/linux/checkpoint.h |    2 +
 4 files changed, 110 insertions(+), 10 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 4319976..32a0a8e 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -498,6 +498,14 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		goto out;
 
+	if (!(ctx->flags & CHECKPOINT_SUBTREE)) {
+		/* verify that all objects are contained (no leaks) */
+		if (!ckpt_obj_contained(ctx)) {
+			ret = -EBUSY;
+			goto out;
+		}
+	}
+
 	/* on success, return (unique) checkpoint identifier */
 	ctx->crid = atomic_inc_return(&ctx_count);
 	ret = ctx->crid;
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 7637c1e..5ae2b41 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -687,6 +687,8 @@ static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
 			ret = exe_objref;
 			goto out;
 		}
+		/* account for all references through vma/exe_file */
+		ckpt_obj_users_inc(ctx, mm->exe_file, mm->num_exe_file_vmas);
 	}
 
 	h->exefile_objref = exe_objref;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index bdc719e..11522b2 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -28,19 +28,23 @@ struct ckpt_obj_ops {
 	enum obj_type obj_type;
 	void (*ref_drop)(void *ptr);
 	int (*ref_grab)(void *ptr);
+	int (*ref_users)(void *ptr);
 	int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
 	void *(*restore)(struct ckpt_ctx *ctx);
 };
 
 struct ckpt_obj {
+	int users;
 	int objref;
 	void *ptr;
 	struct ckpt_obj_ops *ops;
 	struct hlist_node hash;
+	struct hlist_node next;
 };
 
 struct ckpt_obj_hash {
 	struct hlist_head *head;
+	struct hlist_head list;
 	int next_free_objref;
 };
 
@@ -56,13 +60,13 @@ void *restore_bad(struct ckpt_ctx *ctx)
 
 /*
  * helper grab/drop functions:
- *   obj_no_{drop,grab}: for objects ignored/skipped
- *   obj_file_{drop,grab}: for file objects
- *   obj_inode_{drop,grab}: for inode objects
- *   obj_mm_{drop,grab}: for mm_struct objects
- *   obj_ns_{drop,grab}: for nsproxy objects
- *   obj_uts_ns_{drop,grab}: for uts_namespace objects
- *   obj_ipc_ns_{drop,grab}: for ipc_namespace objects
+ *   obj_no_{drop,grab,users}: for objects ignored/skipped
+ *   obj_file_{drop,grab,users}: for file objects
+ *   obj_inode_{drop,grab,users}: for inode objects
+ *   obj_mm_{drop,grab,users}: for mm_struct objects
+ *   obj_ns_{drop,grab,users}: for nsproxy objects
+ *   obj_uts_ns_{drop,grab,users}: for uts_namespace objects
+ *   obj_ipc_ns_{drop,grab,users}: for ipc_namespace objects
  */
 
 static void obj_no_drop(void *ptr)
@@ -75,7 +79,12 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
-/* helper drop/grab functions */
+
+/*
+ * helper drop/grab functions
+ */
+
+/* file object */
 static int obj_file_grab(void *ptr)
 {
 	get_file((struct file *) ptr);
@@ -87,6 +96,12 @@ static void obj_file_drop(void *ptr)
 	fput((struct file *) ptr);
 }
 
+static int obj_file_users(void *ptr)
+{
+	return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
+/* inode object */
 static int obj_inode_grab(void *ptr)
 {
 	return (igrab((struct inode *) ptr) ? 0 : -EBADF);
@@ -97,6 +112,7 @@ static void obj_inode_drop(void *ptr)
 	iput((struct inode *) ptr);
 }
 
+/* inode object */
 static int obj_mm_grab(void *ptr)
 {
 	atomic_inc(&((struct mm_struct *) ptr)->mm_users);
@@ -108,6 +124,12 @@ static void obj_mm_drop(void *ptr)
 	mmput((struct mm_struct *) ptr);
 }
 
+static int obj_mm_users(void *ptr)
+{
+	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
+}
+
+/* nsproxy object */
 static int obj_ns_grab(void *ptr)
 {
 	get_nsproxy((struct nsproxy *) ptr);
@@ -119,6 +141,7 @@ static void obj_ns_drop(void *ptr)
 	put_nsproxy((struct nsproxy *) ptr);
 }
 
+/* uts_namespace object */
 static int obj_uts_ns_grab(void *ptr)
 {
 	get_uts_ns((struct uts_namespace *) ptr);
@@ -130,6 +153,12 @@ static void obj_uts_ns_drop(void *ptr)
 	put_uts_ns((struct uts_namespace *) ptr);
 }
 
+static int obj_uts_ns_users(void *ptr)
+{
+	return atomic_read(&((struct uts_namespace *) ptr)->kref.refcount);
+}
+
+/* ipc_namespace object */
 static int obj_ipc_ns_grab(void *ptr)
 {
 	get_ipc_ns((struct ipc_namespace *) ptr);
@@ -141,6 +170,11 @@ static void obj_ipc_ns_drop(void *ptr)
 	put_ipc_ns((struct ipc_namespace *) ptr);
 }
 
+static int obj_ipc_ns_users(void *ptr)
+{
+	return atomic_read(&((struct ipc_namespace *) ptr)->count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -155,6 +189,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.obj_type = CKPT_OBJ_FILE,
 		.ref_drop = obj_file_drop,
 		.ref_grab = obj_file_grab,
+		.ref_users = obj_file_users,
 		.checkpoint = checkpoint_file,
 		.restore = restore_file,
 	},
@@ -173,6 +208,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.obj_type = CKPT_OBJ_MM,
 		.ref_drop = obj_mm_drop,
 		.ref_grab = obj_mm_grab,
+		.ref_users = obj_mm_users,
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
@@ -191,6 +227,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.obj_type = CKPT_OBJ_UTS_NS,
 		.ref_drop = obj_uts_ns_drop,
 		.ref_grab = obj_uts_ns_grab,
+		.ref_users = obj_uts_ns_users,
 		.checkpoint = checkpoint_bad,
 		.restore = restore_bad,
 	},
@@ -200,6 +237,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.obj_type = CKPT_OBJ_IPC_NS,
 		.ref_drop = obj_ipc_ns_drop,
 		.ref_grab = obj_ipc_ns_grab,
+		.ref_users = obj_ipc_ns_users,
 		.checkpoint = checkpoint_bad,
 		.restore = restore_bad,
 	},
@@ -252,6 +290,7 @@ int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
 
 	obj_hash->head = head;
 	obj_hash->next_free_objref = 1;
+	INIT_HLIST_HEAD(&obj_hash->list);
 
 	ctx->obj_hash = obj_hash;
 	return 0;
@@ -310,6 +349,7 @@ static int obj_new(struct ckpt_ctx *ctx, void *ptr, int objref,
 
 	obj->ptr = ptr;
 	obj->ops = ops;
+	obj->users = 2;  /* extra reference that objhash itself takes */
 
 	if (objref) {
 		/* use @obj->objref to index (restart) */
@@ -322,10 +362,12 @@ static int obj_new(struct ckpt_ctx *ctx, void *ptr, int objref,
 	}
 
 	ret = ops->ref_grab(obj->ptr);
-	if (ret < 0)
+	if (ret < 0) {
 		kfree(obj);
-	else
+	} else {
 		hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+		hlist_add_head(&obj->next, &ctx->obj_hash->list);
+	}
 
 	return (ret < 0 ? : obj->objref);
 }
@@ -363,6 +405,7 @@ int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 		return -EINVAL;
 	} else {
 		objref = obj->objref;
+		obj->users++;
 		*first = 0;
 	}
 
@@ -370,6 +413,50 @@ int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 	return objref;
 }
 
+/* increment the 'users' count of an object */
+void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	if (obj)
+		obj->users += increment;
+}
+
+/**
+ * ckpt_obj_contained - test if shared objects are "contained" in checkpoint
+ * @ctx: checkpoint
+ *
+ * Loops through all objects in the table and compares the number of
+ * references accumulated during checkpoint, with the reference count
+ * reported by the kernel.
+ *
+ * Return 1 if respective counts match for all objects, 0 otherwise.
+ */
+int ckpt_obj_contained(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj *obj;
+	struct hlist_node *node;
+
+	/* account for ctx->file reference (if in the table already) */
+	ckpt_obj_users_inc(ctx, ctx->file, 1);
+
+	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
+		if (!obj->ops->ref_users)
+			continue;
+		if (obj->ops->ref_users(obj->ptr) != obj->users) {
+			ckpt_debug("usage leak: %s\n", obj->ops->obj_name);
+			printk(KERN_NOTICE "c/r: %s users %d != count %d\n",
+			       obj->ops->obj_name,
+			       obj->ops->ref_users(obj->ptr),
+			       obj->users);
+			return 0;
+		}
+	}
+
+	return 1;
+}
+
 /**
  * checkpoint_obj - if not already in hash, add object and checkpoint
  * @ctx: checkpoint context
@@ -399,6 +486,7 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 	obj = obj_find_by_ptr(ctx, ptr);
 	if (obj) {
 		BUG_ON(obj->ops->obj_type != type);
+		obj->users++;
 		return obj->objref;
 	}
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 65838b4..2a09244 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -48,10 +48,12 @@ extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
 
 extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
 extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type);
+extern int ckpt_obj_contained(struct ckpt_ctx *ctx);
 extern void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref,
 			    enum obj_type type);
 extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 			       enum obj_type type, int *first);
+extern void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment);
 extern int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
 			   enum obj_type type);
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC v14][PATCH 54/54] Report failures during checkpoint as an object in the output stream
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (52 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 53/54] Detect resource leaks for whole-container checkpoint Oren Laadan
@ 2009-04-28 23:24   ` Oren Laadan
  2009-04-29  8:18   ` [RFC v14][PATCH 00/54] Kernel based checkpoint/restart Louis Rilling
  2009-05-04 19:13   ` Oren Laadan
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-28 23:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Alexey Dobriyan, Dave Hansen

One way to provide some meaningful information about the reason for
which a checkpoint failed, is to write the information as a regular
record to the output stream.

Specifically, if an error is detected, then we write a special 'struct
ckpt_hdr_error' record to the output file, followed by a string that
describes the details of why the checkpoint failed.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c        |   55 ++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h |   10 ++++++-
 2 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 32a0a8e..7f5c18c 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -95,6 +95,50 @@ int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len)
 	return ckpt_write_obj_type(ctx, str, len, CKPT_HDR_STRING);
 }
 
+/**
+ * ckpt_write_err - write an object describing an error
+ * @ctx: checkpoint context
+ * @fmt: error string format
+ * @...: error string arguments
+ */
+int ckpt_write_err(struct ckpt_ctx *ctx, char *fmt, ...)
+{
+	va_list args;
+	char str[128];
+	char *ptr = NULL;
+	int len, ret;
+
+	ret = ckpt_write_obj_type(ctx, NULL, sizeof(struct ckpt_hdr),
+				  CKPT_HDR_ERROR);
+	if (ret < 0)
+		return ret;
+
+	va_start(args, fmt);
+	len = vsnprintf(str, 128, fmt, args) + 1;
+	va_end(args);
+
+	if (len > 128) {
+		/* doesn't fit on stack, allocate memory */
+		ptr = kmalloc(len + 1, GFP_KERNEL);
+		/* if malloc failed, fallback to truncated string */
+		if (ptr) {
+			va_start(args, fmt);
+			len = vsnprintf(ptr, len, fmt, args) + 1;
+			va_end(args);
+		} else {
+			len = 128;
+			printk(KERN_NOTICE "c/r: error message truncated\n");
+		}
+	}
+
+	ckpt_debug("c/r: checkpoint error: %s\n", ptr ? : str);
+	ret = ckpt_write_string(ctx, ptr ? : str, len);
+
+	kfree(ptr);
+	return ret;
+}
+
+
 /***********************************************************************
  * Checkpoint
  */
@@ -198,8 +242,11 @@ static int may_checkpoint_task(struct task_struct *t, struct ckpt_ctx *ctx)
 		return -EPERM;
 
 	/* verify that the task is frozen (unless self) */
-	if (t != current && !frozen(t))
+	if (t != current && !frozen(t)) {
+		ckpt_write_err(ctx, "task %d(%s) not frozen\n",
+			       task_pid_vnr(t), t->comm);
 		return -EBUSY;
+	}
 
 	/* FIX: add support for ptraced tasks */
 	if (task_ptrace(t))
@@ -436,8 +483,11 @@ static int get_container(struct ckpt_ctx *ctx, pid_t pid)
 	ctx->root_init = is_container_init(task);
 
 	/* FIX: does this error code makes sense here ? */
-	if (!(ctx->flags & CHECKPOINT_SUBTREE) && !ctx->root_init)
+	if (!(ctx->flags & CHECKPOINT_SUBTREE) && !ctx->root_init) {
+		ckpt_write_err(ctx, "task %d(%s) not container init\n",
+			       task_pid_vnr(task), task->comm);
 		return -EBUSY;
+	}
 
 	return 0;
 
@@ -501,6 +551,7 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	if (!(ctx->flags & CHECKPOINT_SUBTREE)) {
 		/* verify that all objects are contained (no leaks) */
 		if (!ckpt_obj_contained(ctx)) {
+			ckpt_write_err(ctx, "container is not isolated\n");
 			ret = -EBUSY;
 			goto out;
 		}
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0e15f3f..058412c 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -73,9 +73,17 @@ enum {
 	CKPT_HDR_IPC_MSG_MSG,
 	CKPT_HDR_IPC_SEM,
 
-	CKPT_HDR_TAIL = 5001
+	CKPT_HDR_TAIL = 5001,
+
+	CKPT_HDR_ERROR = 9999
 };
 
+/* error report */
+struct ckpt_hdr_error {
+	struct ckpt_hdr h;
+	/* followed by the error string */
+} __attribute__((aligned(8)));
+
 /* shared objrects (objref) */
 struct ckpt_hdr_objref {
 	struct ckpt_hdr h;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart
       [not found]     ` <1240961064-13991-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-29  0:58       ` Serge E. Hallyn
       [not found]         ` <20090429005826.GA23583-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-04-29 17:12       ` Serge E. Hallyn
  2009-05-06 20:39       ` Sukadev Bhattiprolu
  2 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2009-04-29  0:58 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
...
> +static int checkpoint_write_header(struct ckpt_ctx *ctx)
> +{
> +	struct ckpt_hdr_header *h;
> +	struct new_utsname *uts;
> +	struct timeval ktv;
> +	int ret;
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER);

...
> +	struct ckpt_hdr_tail *h;
> +	int ret;
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL);

...
> +	struct ckpt_hdr_task *h;
> +	int ret;
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK);

...
> +/**
> + * ckpt_hdr_get_type - get a hdr of certain size
> + * @ctx: checkpoint context
> + * @len: number of bytes to reserve
> + *
> + * Returns pointer to reserved space on hbuf
> + */
> +void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
> +{

Observation (based on all callers in later patches as well): the second
argument appears to be superfluous?  You should be able to determine
based on type.

(The callers would look much friendlier without the 2nd arg imo)

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 10/54] Infrastructure for shared objects
       [not found]     ` <1240961064-13991-11-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-29  1:03       ` Serge E. Hallyn
  2009-04-29 16:21       ` Serge E. Hallyn
  1 sibling, 0 replies; 107+ messages in thread
From: Serge E. Hallyn @ 2009-04-29  1:03 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> +/**
> + * checkpoint_obj - if not already in hash, add object and checkpoint
> + * @ctx: checkpoint context
> + * @ptr: pointer to object
> + * @type: object type
> + *
> + * Look up the object pointed to by @ptr in the hash table. If it
> + * isn't already there, then add the object to the table, allocate a
> + * fresh unique id (objref) and save the object's state, and grab a
> + * reference to every object that is added. (Maintain the reference
> + * until the entire hash is free).
> + *
> + * [This is used during checkpoint].
> + *
> + * Returns: objref
> + */
> +int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
> +{
> +	struct ckpt_obj_ops *ops = &ckpt_obj_ops[type];
> +	struct ckpt_hdr_objref *h;
> +	struct ckpt_obj *obj;
> +	int objref, ret;
> +
> +	/* make sure we don't change this accidentally */
> +	BUG_ON(ops->obj_type != type);
> +
> +	obj = obj_find_by_ptr(ctx, ptr);
> +	if (obj) {
> +		BUG_ON(obj->ops->obj_type != type);
> +		return obj->objref;
> +	}
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
> +	if (!h)
> +		return -ENOMEM;
> +
> +	objref = obj_new(ctx, ptr, 0, ops);
> +	if (objref < 0)

	ckpt_hdr_put(ctx, h);    ?

> +		return objref;
> +
> +	h->objtype = type;
> +	h->objref = objref;
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> +	ckpt_hdr_put(ctx, h);

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 08/54] Dump memory address space
       [not found]     ` <1240961064-13991-9-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-29  4:11       ` Serge E. Hallyn
       [not found]         ` <20090429041128.GA28018-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-04-30  4:54       ` Matt Helsley
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2009-04-29  4:11 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> +#if CONFIG_CHEKCPOINT
> +static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
> +				      struct vm_area_struct *vma)
> +{
> +	char *name;
> +
> +	/*
> +	 * Currently, we only handle VDSO/vsyscall special handling.
> +	 * Even that, is very basic - we just skip the contents and
> +	 * hope for the best in terms of compatilibity upon restart.
> +	 */
> +
> +	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
> +		return -ENOSYS;
> +
> +	name = arch_vma_name(vma);
> +	if (!name || strcmp(vma_name, "[vdso]"))

Not important except for bisect-safety, as it's fixed in the next
patch, but this should be name, not vma_name.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 08/54] Dump memory address space
       [not found]         ` <20090429041128.GA28018-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-29  6:42           ` Guenter Roeck
       [not found]             ` <20090429064241.GA17482-gvzKVTG1yJJBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Guenter Roeck @ 2009-04-29  6:42 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

On Tue, Apr 28, 2009 at 09:11:28PM -0700, Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> > +#if CONFIG_CHEKCPOINT
                ^^^^^^^^^^
CONFIG_CHECKPOINT ? 

Guenter

> > +static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
> > +				      struct vm_area_struct *vma)
> > +{
> > +	char *name;
> > +
> > +	/*
> > +	 * Currently, we only handle VDSO/vsyscall special handling.
> > +	 * Even that, is very basic - we just skip the contents and
> > +	 * hope for the best in terms of compatilibity upon restart.
> > +	 */
> > +
> > +	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
> > +		return -ENOSYS;
> > +
> > +	name = arch_vma_name(vma);
> > +	if (!name || strcmp(vma_name, "[vdso]"))
> 
> Not important except for bisect-safety, as it's fixed in the next
> patch, but this should be name, not vma_name.
> 
> -serge
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation
       [not found]     ` <1240961064-13991-32-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-29  6:54       ` Nathan Lynch
       [not found]         ` <m34ow8ueyk.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Nathan Lynch @ 2009-04-29  6:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Hello Oren,

Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:

> From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
>
> Support for checkpointing and restarting GPRs, FPU state, DABR, and
> Altivec state.

...

> Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

...

> +/* dump the cpu state and registers of a given task */
> +int checkpoint_write_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
> +{
> +	struct ckpt_hdr_cpu *cpu_hdr;
> +	int rc;
> +
> +	rc = -ENOMEM;
> +	cpu_hdr = ckpt_hdr_get(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);

This won't build (should be ckpt_hdr_get_type?).

I didn't write this code (I used kzalloc).

In the code I did write, I deliberately preferred the slab allocator to
the checkpoint-specific APIs.  I do not see the advantage of using an
arbitrarily fixed size special allocation stack that is prone to
overflow or, worse, data corruption if someone improperly interleaves
their gets and puts.

I don't believe you were acting in bad faith here, and I'm not sure
there's an established etiquette.  But my signed-off-by line is on this
patch, and I don't think it belongs there unless I've actually written
the code or agreed to the modifications.

If you insist on replacing kzalloc with ckpt_hdr_get, then please do so
in a separate commit with an explanation in the changelog.  I'd have no
objection to that -- it's your tree, after all.  Or if you want to munge
my patch in place, just replace my signoff with yours and note "based on
work by Nathan Lynch" or something.

Which brings me to the subject of tree management... it's rather
difficult for interested parties to follow development of a tree that is
frequently rewritten.  It would be much easier to base work on a linear
"append-only" branch.  The guesswork involved in tracking down
regressions in C/R function would be reduced because bisection would
work.  And we would have an accurate history of the changes made over
time.  The cost would be that the checkpoint/restart work would not have
an easily-reviewable native form, but I think it would be possible to
generate comprehensible diffs for review since the majority of the code
is in self-contained files.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (53 preceding siblings ...)
  2009-04-28 23:24   ` [RFC v14][PATCH 54/54] Report failures during checkpoint as an object in the output stream Oren Laadan
@ 2009-04-29  8:18   ` Louis Rilling
       [not found]     ` <20090429081815.GA1813-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
  2009-05-04 19:13   ` Oren Laadan
  55 siblings, 1 reply; 107+ messages in thread
From: Louis Rilling @ 2009-04-29  8:18 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen


[-- Attachment #1.1: Type: text/plain, Size: 10431 bytes --]

Hi,

On 28/04/09 19:23 -0400, Oren Laadan wrote:
> Here is the latest and greatest of checkpoint/restart (c/r) patchset.
> The logic and image format reworked and simplified, code refactored,
> support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
> (uts and ipc).

I should have asked before, but what are the reasons to checkpoint SYSV IPCs
in the same file/stream as tasks? Would it be better to checkpoint them
independently, like the file system state?

In Kerrighed we chose to checkpoint SYSV IPCs independently, a bit like the file
system state, because SYSV IPCs objects' lifetime do not depend on tasks
lifetime, and we can gain more flexibility this way. In particular we envision
cases in which two applications share a state in a SYSV SHM (something like a
producer-consumer scheme), but do not need to be checkpointed together. In such
a case the SYSV SHM itself could even need more high-availability (using
active replication) than a checkpoint/restart facility.

Louis


> The userspace tool 'mktree' was extended to handle more complicated
> process tree and correctly account for process relationships and 
> session ID (sid). Should correctly handle threads.
> Hey, it even went through some massive renaming of files and functions...
> 
> Signals and timers are not supported yet, so programs that rely on
> their behavior may fail to oeprate correctly after a restart (e.g.
> may lose signals pending at time of checkpoint, and so on).
> 
> However, this one can actually be used for simple batch jobs (pipes,
> too), a whole container or just a subtree of tasks. Try it:
> 
> create the freezer cgroup:
>   $ mount -t cgroup -ofreezer freezer /freezer
>   $ mkdir /freezer/0
> 
> run the test, freeze it:  
>   $ test/multitask &
>   [1] 2754
>   $ for i in `pidof multitask`; do echo $i > /freezer/0/tasks; done
>   $ echo FROZEN > /freezer/0/freezer.state
> 
> checkpoint:
>   $ ./ckpt 2754 > ckpt.out
> 
> restart:
>   $ ./mktree < ckpt.out
> 
> voila :)
> 
> To do all this, you'll need:
> 
> The git tree tracking v14, branch 'ckpt-v14' (and past versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
> 
> Restarting multiple processes requires 'mktree' userspace tool with
> the matching branch (v14):
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
> 
> 
> Oren.
> 
> 
> Changelog:
> 
> [2009-Apr-28] v14
>   - Tested against kernel v2.6.30-rc3 on x86_32.
>   - Refactor files chekpoint to use f_ops (file operations)
>   - Refactor mm/vma to use vma_ops
>   - Explicitly handle VDSO vma (and require compat mode)
>   - Added code to c/r restat-blocks (restart timeout related syscalls)
>   - Added code to c/r namespaces: uts, ipc (with Dan Smith)
>   - Added code to c/r sysvipc (shm, msg, sem)
>   - Support for VM_CLONE shared memory
>   - Added resource leak detection for whole-container checkpoint
>   - Added sysctl gauge to allow unprivileged restart/checkpoint
>   - Improve and simplify the code and logic of shared objects
>   - Rework image format: shared objects appear prior to their use
>   - Merge checkpoint and restart functionality into same files
>   - Massive renaming of functions: prefix "ckpt_" for generics,
>     "checkpoint_" for checkpoint, and "restore_" for restart.
>   - Report checkpoint errors as a valid (string record) in the output
>   - Merged PPC architecture (by Nathan Lunch),
>   - Requires updates to userspace tools too.
>   - Misc nits and bug fixes
> 
> [2009-Mar-31] v14-rc2
>   - Change along Dave's suggestion to use f_ops->checkpoint() for files
>   - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
>   - Merge support for PPC arch (Nathan Lynch)
>   - Misc cleanups and fixes in response to comments
> 
> [2009-Mar-20] v14-rc1:
>   - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
>   - Check whether calls to cr_hbuf_get() succeed or fail.
>   - Fixed of pipe c/r code
>   - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
>   - Refuse non-self checkpoint if a task isn't frozen
>   - Use unsigned fields in checkpoint headers unless otherwise required
>   - Rename functions in files c/r to better reflect their role
>   - Add support for anonymous shared memory
>   - Merge support for s390 arch (Dan Smith, Serge Hallyn)
>     
> [2008-Dec-03] v13:
>   - Cleanups of 'struct cr_ctx' - remove unused fields
>   - Misc fixes for comments
>   
> [2008-Dec-17] v12:
>   - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
>     (empty pgarr are saves in a separate pool chain)
>   - Add a couple of missed calls to cr_hbuf_put()
>   - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
>   - Split cr_write/cr_read() to two parts: _cr_write/read() helper
>   - Befriend with sparse: explicit conversion to 'void __user *'
>   - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
> 
> [2008-Dec-05] v11:
>   - Use contents of 'init->fs->root' instead of pointing to it
>   - Ignore symlinks (there is no such thing as an open symlink)
>   - cr_scan_fds() retries from scratch if it hits size limits
>   - Add missing test for VM_MAYSHARE when dumping memory
>   - Improve documentation about: behavior when tasks aren't fronen,
>     life span of the object hash, references to objects in the hash
>  
> [2008-Nov-26] v10:
>   - Grab vfs root of container init, rather than current process
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
>   - Force end-of-string in cr_read_string() (fix possible DoS)
>   - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
> 
> [2008-Nov-10] v9:
>   - Support multiple processes c/r
>   - Extend checkpoint header with archtiecture dependent header 
>   - Misc bug fixes (see individual changelogs)
>   - Rebase to v2.6.28-rc3.
> 
> [2008-Oct-29] v8:
>   - Support "external" checkpoint
>   - Include Dave Hansen's 'deny-checkpoint' patch
>   - Split docs in Documentation/checkpoint/..., and improve contents
> 
> [2008-Oct-17] v7:
>   - Fix save/restore state of FPU
>   - Fix argument given to kunmap_atomic() in memory dump/restore
> 
> [2008-Oct-07] v6:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
>   - Add assumptions and what's-missing to documentation
>   - Misc fixes and cleanups
> 
> [2008-Sep-11] v5:
>   - Config is now 'def_bool n' by default
>   - Improve memory dump/restore code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
>   - Memory restore now maps user pages explicitly to copy data into them,
>     instead of reading directly to user space; got rid of mprotect_fixup()
>   - Remove preempt_disable() when restoring debug registers
>   - Rename headers files s/ckpt/checkpoint/
>   - Fix misc bugs in files dump/restore
>   - Fixes and cleanups on some error paths
>   - Fix misc coding style
> 
> [2008-Sep-09] v4:
>   - Various fixes and clean-ups
>   - Fix calculation of hash table size
>   - Fix header structure alignment
>   - Use stand list_... for cr_pgarr
> 
> [2008-Aug-29] v3:
>   - Various fixes and clean-ups
>   - Use standard hlist_... for hash table
>   - Better use of standard kmalloc/kfree
> 
> [2008-Aug-20] v2:
>   - Added Dump and restore of open files (regular and directories)
>   - Added basic handling of shared objects, and improve handling of
>     'parent tag' concept
>   - Added documentation
>   - Improved ABI, 64bit padding for image data
>   - Improved locking when saving/restoring memory
>   - Added UTS information to header (release, version, machine)
>   - Cleanup extraction of filename from a file pointer
>   - Refactor to allow easier reviewing
>   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
>     security policy (this means that file restore may fail)
>   - Other cleanup and response to comments for v1
> 
> [2008-Jul-29] v1:
>   - Initial version: support a single task with address space of only
>     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
>     argument and act on current process.
> 
> --
> At the containers mini-conference before OLS, the consensus among
> all the stakeholders was that doing checkpoint/restart in the kernel
> as much as possible was the best approach.  With this approach, the
> kernel will export a relatively opaque 'blob' of data to userspace
> which can then be handed to the new kernel at restore time.
> 
> This is different than what had been proposed before, which was
> that a userspace application would be responsible for collecting
> all of this data.  We were also planning on adding lots of new,
> little kernel interfaces for all of the things that needed
> checkpointing.  This unites those into a single, grand interface.
> 
> The 'blob' will contain copies of select portions of kernel
> structures such as vmas and mm_structs.  It will also contain
> copies of the actual memory that the process uses.  Any changes
> in this blob's format between kernel revisions can be handled by
> an in-userspace conversion program.
> 
> This is a similar approach to virtually all of the commercial
> checkpoint/restart products out there, as well as the research
> project Zap.
> 
> These patches basically serialize internel kernel state and write
> it out to a file descriptor.  The checkpoint and restore are done
> with two new system calls: sys_checkpoint and sys_restart.
> 
> In this incarnation, they can only work checkpoint and restore a
> single task. The task's address space may consist of only private,
> simple vma's - anonymous or file-mapped. The open files may consist
> of only simple files and directories.
> --
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation
       [not found]         ` <m34ow8ueyk.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
@ 2009-04-29 15:49           ` Serge E. Hallyn
  2009-04-29 18:05           ` Oren Laadan
  2009-04-29 18:18           ` Oren Laadan
  2 siblings, 0 replies; 107+ messages in thread
From: Serge E. Hallyn @ 2009-04-29 15:49 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org):
> Which brings me to the subject of tree management... it's rather
> difficult for interested parties to follow development of a tree that is
> frequently rewritten.  It would be much easier to base work on a linear
> "append-only" branch.  The guesswork involved in tracking down
> regressions in C/R function would be reduced because bisection would
> work.  And we would have an accurate history of the changes made over
> time.  The cost would be that the checkpoint/restart work would not have
> an easily-reviewable native form, but I think it would be possible to
> generate comprehensible diffs for review since the majority of the code
> is in self-contained files.

It would make this set much easier for us to review (and bisect).  Late
last night I decided re-reviewing the patches just isn't going to work
for me.  I guess I'll just review the checkpoint/*.c files individually,
and generate one big c/r diff for the rest of the kernel tree.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 10/54] Infrastructure for shared objects
       [not found]     ` <1240961064-13991-11-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-29  1:03       ` Serge E. Hallyn
@ 2009-04-29 16:21       ` Serge E. Hallyn
  1 sibling, 0 replies; 107+ messages in thread
From: Serge E. Hallyn @ 2009-04-29 16:21 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> +/**
> + * ckpt_obj_new - add an object to the obj_hash
> + * @ctx: checkpoint context
> + * @ptr: pointer to object
> + * @objref: object unique id
> + * @ops: object operations
> + *
> + * Returns: objref
> + *
> + * Add the object to the obj_hash. If @objref is zero, assign a unique
> + * object id and use @ptr as a hash key [checkpoint]. Else use @objref
> + * as a key [restart].
> + */
> +static int obj_new(struct ckpt_ctx *ctx, void *ptr, int objref,
> +		   struct ckpt_obj_ops *ops)
> +{
> +	struct ckpt_obj *obj;
> +	int i, ret;
> +
> +	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
> +	if (!obj)
> +		return -ENOMEM;
> +
> +	obj->ptr = ptr;
> +	obj->ops = ops;
> +
> +	if (objref) {
> +		/* use @obj->objref to index (restart) */
> +		obj->objref = objref;
> +		i = hash_long((unsigned long) objref, CKPT_OBJ_HASH_NBITS);
> +	} else {
> +		/* use @obj->ptr to index, assign objref (checkpoint) */
> +		obj->objref = ctx->obj_hash->next_free_objref++;;
> +		i = hash_long((unsigned long) ptr, CKPT_OBJ_HASH_NBITS);
> +	}
> +
> +	ret = ops->ref_grab(obj->ptr);
> +	if (ret < 0)
> +		kfree(obj);
> +	else
> +		hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
> +
> +	return (ret < 0 ? : obj->objref);
> +}

...

> +/**
> +* ckpt_obj_insert - add an object with a given objref to obj_hash
> +* @ctx: checkpoint context
> +* @ptr: pointer to object
> +* @objref: unique object id
> +* @type: object type
> +*
> +* Add the object pointer to by @ptr and identified by unique object id
> +* @objref to the hash table (indexed by @objref).  Grab a reference to
> +* every object added, and maintain it until the entire hash is freed.
> +*/
> +
> +int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
> +		    enum obj_type type)
> +{
> +	struct ckpt_obj_ops *ops = &ckpt_obj_ops[type];
> +
> +	ckpt_debug("%s objref %d\n", ops->obj_name, objref);
> +	return (obj_new(ctx, ptr, objref, ops) ? : 1);

This line doesn't make sense - obj_new can't return 0 ?

Also, the line isn't in this patch, but when you add the
obj_mm_* to objhash.c, the comment right above it claims
	/* inode object */

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart
       [not found]     ` <1240961064-13991-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-29  0:58       ` Serge E. Hallyn
@ 2009-04-29 17:12       ` Serge E. Hallyn
  2009-05-06 20:39       ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 107+ messages in thread
From: Serge E. Hallyn @ 2009-04-29 17:12 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
>  asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>  {
> -	pr_debug("sys_checkpoint not implemented yet\n");
> -	return -ENOSYS;
> +	struct ckpt_ctx *ctx;
> +	int ret;
> +
> +	/* no flags for now */
> +	if (flags)
> +		return -EINVAL;
> +
> +	if (pid == 0)
> +		pid = current->pid;

This uses the global pid, but get_container() later will do
	task = find_task_by_vpid(pid);

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart
       [not found]         ` <20090429005826.GA23583-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-04-29 17:49           ` Oren Laadan
       [not found]             ` <49F8932D.4040506-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-29 17:49 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> ...
>> +static int checkpoint_write_header(struct ckpt_ctx *ctx)
>> +{
>> +	struct ckpt_hdr_header *h;
>> +	struct new_utsname *uts;
>> +	struct timeval ktv;
>> +	int ret;
>> +
>> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
> 
> ...
>> +	struct ckpt_hdr_tail *h;
>> +	int ret;
>> +
>> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
> 
> ...
>> +	struct ckpt_hdr_task *h;
>> +	int ret;
>> +
>> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK);
> 
> ...
>> +/**
>> + * ckpt_hdr_get_type - get a hdr of certain size
>> + * @ctx: checkpoint context
>> + * @len: number of bytes to reserve
>> + *
>> + * Returns pointer to reserved space on hbuf
>> + */
>> +void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
>> +{
> 
> Observation (based on all callers in later patches as well): the second
> argument appears to be superfluous?  You should be able to determine
> based on type.

Not always: for instance, ckpt_hdr_file_xxxx all use CKPT_HDR_FILE,
but have different sizes (all share a common 'struct ckpt_hdr_file',
but actual payload differs).

(Besides, that would require adding a big table to decide the length
based on a type... which I don't really like).

Oren.

> 
> (The callers would look much friendlier without the 2nd arg imo)

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation
       [not found]         ` <m34ow8ueyk.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2009-04-29 15:49           ` Serge E. Hallyn
@ 2009-04-29 18:05           ` Oren Laadan
       [not found]             ` <49F896E8.7020802-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-29 18:18           ` Oren Laadan
  2 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-29 18:05 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen



Nathan Lynch wrote:
> Hello Oren,
> 
> Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:
> 
>> From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
>>
>> Support for checkpointing and restarting GPRs, FPU state, DABR, and
>> Altivec state.
> 
> ...
> 
>> Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
> 
> ...
> 
>> +/* dump the cpu state and registers of a given task */
>> +int checkpoint_write_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
>> +{
>> +	struct ckpt_hdr_cpu *cpu_hdr;
>> +	int rc;
>> +
>> +	rc = -ENOMEM;
>> +	cpu_hdr = ckpt_hdr_get(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
> 
> This won't build (should be ckpt_hdr_get_type?).
> 
> I didn't write this code (I used kzalloc).
> 
> In the code I did write, I deliberately preferred the slab allocator to
> the checkpoint-specific APIs.  I do not see the advantage of using an
> arbitrarily fixed size special allocation stack that is prone to
> overflow or, worse, data corruption if someone improperly interleaves
> their gets and puts.
> 
> I don't believe you were acting in bad faith here, and I'm not sure
> there's an established etiquette.  But my signed-off-by line is on this
> patch, and I don't think it belongs there unless I've actually written
> the code or agreed to the modifications.
> 
> If you insist on replacing kzalloc with ckpt_hdr_get, then please do so
> in a separate commit with an explanation in the changelog.  I'd have no
> objection to that -- it's your tree, after all.  Or if you want to munge
> my patch in place, just replace my signoff with yours and note "based on
> work by Nathan Lynch" or something.

You are correct. Originally I was unsure how to modify that, and
later in the flood of  other changes I forgot to get back to that
commit message. I apologize for that.

Oren.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart
       [not found]             ` <49F8932D.4040506-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-29 18:15               ` Serge E. Hallyn
  0 siblings, 0 replies; 107+ messages in thread
From: Serge E. Hallyn @ 2009-04-29 18:15 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> 
> 
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> > ...
> >> +static int checkpoint_write_header(struct ckpt_ctx *ctx)
> >> +{
> >> +	struct ckpt_hdr_header *h;
> >> +	struct new_utsname *uts;
> >> +	struct timeval ktv;
> >> +	int ret;
> >> +
> >> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
> > 
> > ...
> >> +	struct ckpt_hdr_tail *h;
> >> +	int ret;
> >> +
> >> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
> > 
> > ...
> >> +	struct ckpt_hdr_task *h;
> >> +	int ret;
> >> +
> >> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK);
> > 
> > ...
> >> +/**
> >> + * ckpt_hdr_get_type - get a hdr of certain size
> >> + * @ctx: checkpoint context
> >> + * @len: number of bytes to reserve
> >> + *
> >> + * Returns pointer to reserved space on hbuf
> >> + */
> >> +void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
> >> +{
> > 
> > Observation (based on all callers in later patches as well): the second
> > argument appears to be superfluous?  You should be able to determine
> > based on type.
> 
> Not always: for instance, ckpt_hdr_file_xxxx all use CKPT_HDR_FILE,
> but have different sizes (all share a common 'struct ckpt_hdr_file',
> but actual payload differs).
> 
> (Besides, that would require adding a big table to decide the length
> based on a type... which I don't really like).

But only in one place, instead of having 'sizeof(*h)' at every
caller.

But ok, if they differ in the file case then nm.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation
       [not found]         ` <m34ow8ueyk.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2009-04-29 15:49           ` Serge E. Hallyn
  2009-04-29 18:05           ` Oren Laadan
@ 2009-04-29 18:18           ` Oren Laadan
       [not found]             ` <49F899E1.2030207-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-29 18:18 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen



Nathan Lynch wrote:
> Hello Oren,
> 
> Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:
> 
>> From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
>>
>> Support for checkpointing and restarting GPRs, FPU state, DABR, and
>> Altivec state.
> 
> ...
> 
>> Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
> 
> ...
> 
>> +/* dump the cpu state and registers of a given task */
>> +int checkpoint_write_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
>> +{
>> +	struct ckpt_hdr_cpu *cpu_hdr;
>> +	int rc;
>> +
>> +	rc = -ENOMEM;
>> +	cpu_hdr = ckpt_hdr_get(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
> 
> This won't build (should be ckpt_hdr_get_type?).
> 
> I didn't write this code (I used kzalloc).
> 
> In the code I did write, I deliberately preferred the slab allocator to
> the checkpoint-specific APIs.  I do not see the advantage of using an
> arbitrarily fixed size special allocation stack that is prone to
> overflow or, worse, data corruption if someone improperly interleaves
> their gets and puts.
> 

There is a reason I insist on it: I plan to optimize c/r app downtime
by buffering data in the kernel while apps are frozen, and write-back
the output after the resume execution. To do this efficiently without
extra data copy you need something smarted than just kmalloc/kfree().

Since some API and implementation will be used later, it makes sense
to me to enforce at API requirement already. IOW, I want the code to
use ckpt_hdr_get(), ckpt_hdr_put() and ckpt_hdr_get_type().

How they are implemented, now, doesn't really matter. Your point about
using kmalloc() is correct, in particular at this stage of development.

So here's what I'll do:  I'll keep the interface requirement, and
change the implementation behind to use kmalloc/kfree().

Oren.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 08/54] Dump memory address space
       [not found]             ` <20090429064241.GA17482-gvzKVTG1yJJBDgjK7y7TUQ@public.gmane.org>
@ 2009-04-29 20:00               ` Oren Laadan
  0 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-04-29 20:00 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Hansen



Guenter Roeck wrote:
> On Tue, Apr 28, 2009 at 09:11:28PM -0700, Serge E. Hallyn wrote:
>> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
>>> +#if CONFIG_CHEKCPOINT
>                 ^^^^^^^^^^
> CONFIG_CHECKPOINT ? 
> 

Hmmm ... :(

Will fix, thanks.

Oren.

> Guenter
> 
>>> +static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
>>> +				      struct vm_area_struct *vma)
>>> +{
>>> +	char *name;
>>> +
>>> +	/*
>>> +	 * Currently, we only handle VDSO/vsyscall special handling.
>>> +	 * Even that, is very basic - we just skip the contents and
>>> +	 * hope for the best in terms of compatilibity upon restart.
>>> +	 */
>>> +
>>> +	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
>>> +		return -ENOSYS;
>>> +
>>> +	name = arch_vma_name(vma);
>>> +	if (!name || strcmp(vma_name, "[vdso]"))
>> Not important except for bisect-safety, as it's fixed in the next
>> patch, but this should be name, not vma_name.
>>
>> -serge
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation
       [not found]             ` <49F899E1.2030207-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-29 20:33               ` Nathan Lynch
  0 siblings, 0 replies; 107+ messages in thread
From: Nathan Lynch @ 2009-04-29 20:33 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:
> Nathan Lynch wrote:
>> 
>> Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:
>>> +/* dump the cpu state and registers of a given task */
>>> +int checkpoint_write_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
>>> +{
>>> +	struct ckpt_hdr_cpu *cpu_hdr;
>>> +	int rc;
>>> +
>>> +	rc = -ENOMEM;
>>> +	cpu_hdr = ckpt_hdr_get(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
>> 
>> This won't build (should be ckpt_hdr_get_type?).
>> 
>> I didn't write this code (I used kzalloc).
>> 
>> In the code I did write, I deliberately preferred the slab allocator to
>> the checkpoint-specific APIs.  I do not see the advantage of using an
>> arbitrarily fixed size special allocation stack that is prone to
>> overflow or, worse, data corruption if someone improperly interleaves
>> their gets and puts.
>> 
>
> There is a reason I insist on it: I plan to optimize c/r app downtime
> by buffering data in the kernel while apps are frozen, and write-back
> the output after the resume execution. To do this efficiently without
> extra data copy you need something smarted than just kmalloc/kfree().
>
> Since some API and implementation will be used later, it makes sense
> to me to enforce at API requirement already. IOW, I want the code to
> use ckpt_hdr_get(), ckpt_hdr_put() and ckpt_hdr_get_type().
>
> How they are implemented, now, doesn't really matter. Your point about
> using kmalloc() is correct, in particular at this stage of development.
>
> So here's what I'll do:  I'll keep the interface requirement, and
> change the implementation behind to use kmalloc/kfree().

That sounds reasonable to me, thanks.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation
       [not found]             ` <49F896E8.7020802-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-29 20:55               ` Nathan Lynch
  0 siblings, 0 replies; 107+ messages in thread
From: Nathan Lynch @ 2009-04-29 20:55 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:
> You are correct. Originally I was unsure how to modify that, and
> later in the flood of  other changes I forgot to get back to that
> commit message. I apologize for that.

No apology necessary; I understand certain parties may have been
impatiently harrassing you to get the patch series posted :) Thanks for
understanding.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]     ` <20090429081815.GA1813-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
@ 2009-04-29 22:47       ` Oren Laadan
       [not found]         ` <49F8D8FC.8010400-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-04-29 22:47 UTC (permalink / raw)
  To: Oren Laadan,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Hi Louis,

Louis Rilling wrote:
> Hi,
> 
> On 28/04/09 19:23 -0400, Oren Laadan wrote:
>> Here is the latest and greatest of checkpoint/restart (c/r) patchset.
>> The logic and image format reworked and simplified, code refactored,
>> support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
>> (uts and ipc).
> 
> I should have asked before, but what are the reasons to checkpoint SYSV IPCs
> in the same file/stream as tasks? Would it be better to checkpoint them
> independently, like the file system state?
> 
> In Kerrighed we chose to checkpoint SYSV IPCs independently, a bit like the file
> system state, because SYSV IPCs objects' lifetime do not depend on tasks
> lifetime, and we can gain more flexibility this way. In particular we envision
> cases in which two applications share a state in a SYSV SHM (something like a
> producer-consumer scheme), but do not need to be checkpointed together. In such
> a case the SYSV SHM itself could even need more high-availability (using
> active replication) than a checkpoint/restart facility.
> 

Thanks for the feedback, this is actually an interesting idea.

Indeed in the past I also considered SYSV IPC to be a "global" resource
that was checkpointed before iterating through the tasks.

However, in the presence of namespaces, the lifetime of an IPC namespace
does depend on on tasks lifetime - when the last task referring to a
given namespace exits - that namespace is destroyed. Of course, the
root namespace is truly global, because init(1) never exits.

What would 'checkpoint them independently' mean in this case ?

In your use-case, can you restart either application without first
restoring the relevant SYSVIPC ?

Can you think of other use-cases for such a division ?  Am I right to
guess that your use case is specific to the distributed (and SSI-)
nature of your system ?  (Active-replication of SYSV_SHM sounds
awfully related to DSM :)


While not focusing on such use cases, I want to keep the design flexible
enough to not exclude them a-priori, and be able to address them later
on. Indeed, the code is split such that the the function to save a given
IPC namespace does not depend on the task that uses it. Future code
could easily use the same functionality.

One way to be flexible to support your use case, is by having some
mechanism in place to select whether a resource (virtually any) is
to be chekcpointed/restored.

For example, you could imagine checkpoint(..., CHECKPOINT_SYSVIPC)
to checkpoint (also) IPC, and not checkpoint IPC in its absence.

So normally you'd have checkpoint(..., CHECKPOINT_ALL). When you don't
want IPC, you'd use CHECKPOINT_ALL & ~CHECKPOINT_SYSVIPC. When you
want only IPC, you'd use CHECKPOINT_SYSVIPC only.

Same thing for restart, only that it will get trickier in the "only IPC"
case, since you will need to tell which IPC namespace is affected.

Also, I envision a task saying cradvise(CHECKPOINT_SYSVIPC, false),
telling the kernel to not c/r its IPC namespace. (Or any other
resource). Again there would need to be a way to add a restored
namespace.

Does this address your concerns ?

Oren.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 08/54] Dump memory address space
       [not found]     ` <1240961064-13991-9-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-29  4:11       ` Serge E. Hallyn
@ 2009-04-30  4:54       ` Matt Helsley
  2009-05-01 15:25       ` Dave Hansen
  2009-05-01 15:27       ` Dave Hansen
  3 siblings, 0 replies; 107+ messages in thread
From: Matt Helsley @ 2009-04-30  4:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

On Tue, Apr 28, 2009 at 07:23:38PM -0400, Oren Laadan wrote:
> For each VMA, there is a 'struct ckpt_vma'; if the VMA is file-mapped,
> it will be followed by the file name. Then comes the actual contents,
> in one or more chunk: each chunk begins with a header that specifies
> how many pages it holds, then the virtual addresses of all the dumped
> pages in that chunk, followed by the actual contents of all dumped
> pages. A header with zero number of pages marks the end of the contents.
> Then comes the next VMA and so on.
> 
> To checkpoint a vma, call the ops->checkpoint() method of that vma.
> Normally the per-vma function will invoke generic_vma_checkpoint()
> which first writes the vma description, followed by the specific
> logic to dump the contents of the pages.
> 
> Currently for private mapped memory we save the pathname of the file
> that is mapped (restart will use it to re-open it and then map it).
> Later we change that to reference a file object.
> 
> Changelog[v14]:
>   - Modify the ops->checkpoint method to be much more powerful
>   - Improve support for VDSO (with special_mapping checkpoint callback)
>   - Save new field 'vdso' in mm_context
>   - Revert change to pr_debug(), back to ckpt_debug()
>   - Check whether calls to ckpt_hbuf_get() fail
>   - Discard field 'h->parent'

<snipped previous bits of changelog...>

> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> ---
>  arch/x86/Kconfig                      |    1 +
>  arch/x86/include/asm/checkpoint_hdr.h |    7 +
>  arch/x86/mm/checkpoint.c              |   32 ++
>  checkpoint/Makefile                   |    2 +-
>  checkpoint/checkpoint.c               |   24 ++
>  checkpoint/checkpoint_arch.h          |    1 +
>  checkpoint/files.c                    |   88 +++++
>  checkpoint/memory.c                   |  600 +++++++++++++++++++++++++++++++++
>  checkpoint/process.c                  |    4 +
>  checkpoint/sys.c                      |    9 +
>  include/linux/checkpoint.h            |   25 ++-
>  include/linux/checkpoint_hdr.h        |   39 +++
>  include/linux/checkpoint_types.h      |   10 +
>  mm/filemap.c                          |   30 ++
>  mm/mmap.c                             |   30 ++
>  15 files changed, 900 insertions(+), 2 deletions(-)
>  create mode 100644 checkpoint/files.c
>  create mode 100644 checkpoint/memory.c

<snipped lots of patch>
 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3303d1b..6b75359 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -34,6 +34,10 @@
>  #include <asm/tlb.h>
>  #include <asm/mmu_context.h>
> 
> +#include <linux/checkpoint_types.h>
> +#include <linux/checkpoint_hdr.h>
> +#include <linux/checkpoint.h>
> +
>  #include "internal.h"
> 
>  #ifndef arch_mmap_check
> @@ -2268,9 +2272,35 @@ static void special_mapping_close(struct vm_area_struct *vma)
>  {
>  }
> 
> +#if CONFIG_CHEKCPOINT
> +static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
> +				      struct vm_area_struct *vma)
> +{
> +	char *name;
> +
> +	/*
> +	 * Currently, we only handle VDSO/vsyscall special handling.
> +	 * Even that, is very basic - we just skip the contents and
> +	 * hope for the best in terms of compatilibity upon restart.
> +	 */
> +
> +	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
> +		return -ENOSYS;
> +
> +	name = arch_vma_name(vma);
> +	if (!name || strcmp(vma_name, "[vdso]"))
> +		return -ENOSYS;
> +
> +	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO);
> +}
> +#else
> +#define special_mapping_checkpoint NULL
> +#endif /* CONFIG_CHECKPOINT */
> +
>  static struct vm_operations_struct special_mapping_vmops = {
>  	.close = special_mapping_close,
>  	.fault = special_mapping_fault,

This doesn't compile when CONFIG_CHECKPOINT is not defined. The .checkpoint op
initialization needs to be surrounded with:

#ifdef CONFIG_CHECKPOINT

> +	.checkpoint = special_mapping_checkpoint,

#endif /* CONFIG_CHECKPOINT */

Alternatively, anything that accesses it needs to use suitably-defined
empty, static inline functions.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]         ` <49F8D8FC.8010400-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-04-30  9:41           ` Louis Rilling
       [not found]             ` <20090430094106.GC13896-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Louis Rilling @ 2009-04-30  9:41 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen


[-- Attachment #1.1: Type: text/plain, Size: 3944 bytes --]

On 29/04/09 18:47 -0400, Oren Laadan wrote:
> Hi Louis,
> 
> Louis Rilling wrote:
> > Hi,
> > 
> > On 28/04/09 19:23 -0400, Oren Laadan wrote:
> >> Here is the latest and greatest of checkpoint/restart (c/r) patchset.
> >> The logic and image format reworked and simplified, code refactored,
> >> support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
> >> (uts and ipc).
> > 
> > I should have asked before, but what are the reasons to checkpoint SYSV IPCs
> > in the same file/stream as tasks? Would it be better to checkpoint them
> > independently, like the file system state?
> > 
> > In Kerrighed we chose to checkpoint SYSV IPCs independently, a bit like the file
> > system state, because SYSV IPCs objects' lifetime do not depend on tasks
> > lifetime, and we can gain more flexibility this way. In particular we envision
> > cases in which two applications share a state in a SYSV SHM (something like a
> > producer-consumer scheme), but do not need to be checkpointed together. In such
> > a case the SYSV SHM itself could even need more high-availability (using
> > active replication) than a checkpoint/restart facility.
> > 
> 
> Thanks for the feedback, this is actually an interesting idea.
> 
> Indeed in the past I also considered SYSV IPC to be a "global" resource
> that was checkpointed before iterating through the tasks.
> 
> However, in the presence of namespaces, the lifetime of an IPC namespace
> does depend on on tasks lifetime - when the last task referring to a
> given namespace exits - that namespace is destroyed. Of course, the
> root namespace is truly global, because init(1) never exits.
> 
> What would 'checkpoint them independently' mean in this case ?

I mean that the producer and the consumer could have separate checkpointing
policies (if any), and the IPC SHM as well.

> 
> In your use-case, can you restart either application without first
> restoring the relevant SYSVIPC ?

Probably not.

> 
> Can you think of other use-cases for such a division ?  Am I right to
> guess that your use case is specific to the distributed (and SSI-)
> nature of your system ?  (Active-replication of SYSV_SHM sounds
> awfully related to DSM :)

The case of active-replication may be specific to DSM-based systems, but the
case of independent policies is already interesting in standalone boxes.

> 
> 
> While not focusing on such use cases, I want to keep the design flexible
> enough to not exclude them a-priori, and be able to address them later
> on. Indeed, the code is split such that the the function to save a given
> IPC namespace does not depend on the task that uses it. Future code
> could easily use the same functionality.
> 
> One way to be flexible to support your use case, is by having some
> mechanism in place to select whether a resource (virtually any) is
> to be chekcpointed/restored.
> 
> For example, you could imagine checkpoint(..., CHECKPOINT_SYSVIPC)
> to checkpoint (also) IPC, and not checkpoint IPC in its absence.
> 
> So normally you'd have checkpoint(..., CHECKPOINT_ALL). When you don't
> want IPC, you'd use CHECKPOINT_ALL & ~CHECKPOINT_SYSVIPC. When you
> want only IPC, you'd use CHECKPOINT_SYSVIPC only.
> 
> Same thing for restart, only that it will get trickier in the "only IPC"
> case, since you will need to tell which IPC namespace is affected.
> 
> Also, I envision a task saying cradvise(CHECKPOINT_SYSVIPC, false),
> telling the kernel to not c/r its IPC namespace. (Or any other
> resource). Again there would need to be a way to add a restored
> namespace.
> 
> Does this address your concerns ?

Yes this sounds flexible enough. Thanks for taking this into account.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 05/54] x86 support for checkpoint/restart
       [not found]     ` <1240961064-13991-6-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-01 15:12       ` Dave Hansen
  0 siblings, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2009-05-01 15:12 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
> +#ifdef CONFIG_X86_64
> +
> +#error "CONFIG_X86_64 unsupported yet."
> +
> +#else  /* !CONFIG_X86_64 */

I thought we've talked about this before, but this is a job for Kconfig,
not for poor saps on x86_64 compiling with allyesconfig.

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 07/54] cr: extend arch_setup_additional_pages()
       [not found]     ` <1240961064-13991-8-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-01 15:13       ` Dave Hansen
  2009-05-01 15:42         ` Serge E. Hallyn
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2009-05-01 15:13 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
> From: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> 
> Add "start" argument, to request to map vDSO to a specific place,
> and fail the operation if not.
> 
> This is useful for restart(2) to ensure that memory layout is restore
> exactly as needed.

This needs to go up at the start of the series.  It can get merged
before everything else.

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 08/54] Dump memory address space
       [not found]     ` <1240961064-13991-9-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-29  4:11       ` Serge E. Hallyn
  2009-04-30  4:54       ` Matt Helsley
@ 2009-05-01 15:25       ` Dave Hansen
  2009-05-01 15:27       ` Dave Hansen
  3 siblings, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2009-05-01 15:25 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
> +struct ckpt_pgarr {
> +       unsigned long *vaddrs;
> +       struct page **pages;
> +       unsigned int nr_used;
> +       struct list_head list;
> +};
> +
> +#define CKPT_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
> +#define CKPT_PGARR_CHUNK  (4 * CKPT_PGARR_TOTAL)

I seem to get irrationally angry in the presence of 'clumps', 'chunks',
and other non-descriptive variable names in mm code. :)

Anyway, this is only used in one place, but it might be better to call
it:

	CKPT_PAGES_AT_ONCE
or
	CKPT_PGARR_BATCH

Batching up 4 at once doesn't seem that great of a win to me.  Why
bother adding another loop and another variable unless the batch size is
bigger?

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 08/54] Dump memory address space
       [not found]     ` <1240961064-13991-9-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                         ` (2 preceding siblings ...)
  2009-05-01 15:25       ` Dave Hansen
@ 2009-05-01 15:27       ` Dave Hansen
  2009-05-04  7:58         ` Oren Laadan
  3 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2009-05-01 15:27 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
> +/* return (and detach) first empty page-array in the pool, if exists */
> +static inline struct ckpt_pgarr *pgarr_from_pool(struct ckpt_ctx *ctx)
> +{
> +       struct ckpt_pgarr *pgarr;
> +
> +       if (list_empty(&ctx->pgarr_pool))
> +               return NULL;
> +       pgarr = list_first_entry(&ctx->pgarr_pool, struct ckpt_pgarr, list);
> +       list_del(&pgarr->list);
> +       return pgarr;
> +}

What's the pool for, again?  If we're alloc/freeing a bunch of these,
I'd vote for a slab cache rather than managing our own pool.

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 09/54] Restore memory address space
       [not found]     ` <1240961064-13991-10-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-01 15:28       ` Dave Hansen
  0 siblings, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2009-05-01 15:28 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
> diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
> index bad7b29..d61653c 100644
> --- a/arch/x86/include/asm/checkpoint_hdr.h
> +++ b/arch/x86/include/asm/checkpoint_hdr.h
> @@ -104,4 +104,9 @@ struct ckpt_hdr_mm_context {
>         __u32 nldt;
>  } __attribute__((aligned(8)));
> 
> +#ifdef __KERNEL__
> +/* misc prototypes from kernel (not defined elsewhere) */
> +asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
> +#endif

This really needs to go somewhere more appropriate.  What about
asm/ldt.h?

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 07/54] cr: extend arch_setup_additional_pages()
  2009-05-01 15:13       ` Dave Hansen
@ 2009-05-01 15:42         ` Serge E. Hallyn
       [not found]           ` <20090501154220.GA26771-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2009-05-01 15:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
> > From: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > 
> > Add "start" argument, to request to map vDSO to a specific place,
> > and fail the operation if not.
> > 
> > This is useful for restart(2) to ensure that memory layout is restore
> > exactly as needed.
> 
> This needs to go up at the start of the series.  It can get merged
> before everything else.
> 
> -- Dave

It does, but now we are sending Oren mixed signals.

I thought we were talking yesterday on irc about Oren doing append-only
git tree from now on, and someone else (you? :) was going to do patch
integration and sorting and prettifying?

/me steps aside to let Nathan and Dave duke it out.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 07/54] cr: extend arch_setup_additional_pages()
       [not found]           ` <20090501154220.GA26771-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-01 15:57             ` Dave Hansen
  2009-05-01 16:18               ` Serge E. Hallyn
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2009-05-01 15:57 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

On Fri, 2009-05-01 at 10:42 -0500, Serge E. Hallyn wrote:
> Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> > On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
> > > From: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > 
> > > Add "start" argument, to request to map vDSO to a specific place,
> > > and fail the operation if not.
> > > 
> > > This is useful for restart(2) to ensure that memory layout is restore
> > > exactly as needed.
> > 
> > This needs to go up at the start of the series.  It can get merged
> > before everything else.
> 
> It does, but now we are sending Oren mixed signals.
> 
> I thought we were talking yesterday on irc about Oren doing append-only
> git tree from now on, and someone else (you? :) was going to do patch
> integration and sorting and prettifying?

I was going to just go and start doing these, but I figured I'd have a
run through the patches first.  I probably also shouldn't be doing
anything to which Oren (or anyone else) has really strong objections.
So, I'll give everyone a chance to discuss and object before I go
screwing the patches up. ;)

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 07/54] cr: extend arch_setup_additional_pages()
  2009-05-01 15:57             ` Dave Hansen
@ 2009-05-01 16:18               ` Serge E. Hallyn
       [not found]                 ` <20090501161813.GA27516-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2009-05-01 16:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> On Fri, 2009-05-01 at 10:42 -0500, Serge E. Hallyn wrote:
> > Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> > > On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
> > > > From: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > > 
> > > > Add "start" argument, to request to map vDSO to a specific place,
> > > > and fail the operation if not.
> > > > 
> > > > This is useful for restart(2) to ensure that memory layout is restore
> > > > exactly as needed.
> > > 
> > > This needs to go up at the start of the series.  It can get merged
> > > before everything else.
> > 
> > It does, but now we are sending Oren mixed signals.
> > 
> > I thought we were talking yesterday on irc about Oren doing append-only
> > git tree from now on, and someone else (you? :) was going to do patch
> > integration and sorting and prettifying?
> 
> I was going to just go and start doing these, but I figured I'd have a
> run through the patches first.  I probably also shouldn't be doing
> anything to which Oren (or anyone else) has really strong objections.
> So, I'll give everyone a chance to discuss and object before I go
> screwing the patches up. ;)

True, part of this may be able to go up on its own.  In that case

	1. please put the part of my s390 fix patch which applies to
	   this patch in

	2. as I asked when Alexey posted this, can we remove the
	   first argument to arch_setup_additional_pages()?  Noone uses
	   it.

thanks,
-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 53/54] Detect resource leaks for whole-container checkpoint
       [not found]     ` <1240961064-13991-54-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-01 17:26       ` Dave Hansen
  2009-05-07  3:50       ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2009-05-01 17:26 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

On Tue, 2009-04-28 at 19:24 -0400, Oren Laadan wrote:
>  /*
>   * helper grab/drop functions:
> - *   obj_no_{drop,grab}: for objects ignored/skipped
> - *   obj_file_{drop,grab}: for file objects
> - *   obj_inode_{drop,grab}: for inode objects
> - *   obj_mm_{drop,grab}: for mm_struct objects
> - *   obj_ns_{drop,grab}: for nsproxy objects
> - *   obj_uts_ns_{drop,grab}: for uts_namespace objects
> - *   obj_ipc_ns_{drop,grab}: for ipc_namespace objects
> + *   obj_no_{drop,grab,users}: for objects ignored/skipped
> + *   obj_file_{drop,grab,users}: for file objects
> + *   obj_inode_{drop,grab,users}: for inode objects
> + *   obj_mm_{drop,grab,users}: for mm_struct objects
> + *   obj_ns_{drop,grab,users}: for nsproxy objects
> + *   obj_uts_ns_{drop,grab,users}: for uts_namespace objects
> + *   obj_ipc_ns_{drop,grab,users}: for ipc_namespace objects
>   */

I think some of this stuff is over-commented.  This is a perfect
example.  It doesn't buy us *anything* except for comments that get
easily stale.  These are away from the function declarations and they
won't even show up in greps or cscope searches for the functions.  If
anyone reads this:

+static void obj_file_drop(void *ptr)
+{
+       fput((struct file *) ptr);
+}

and can't tell that this is 'for file objects' well...  maybe they
should consider a new career in politics or something.

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 07/54] cr: extend arch_setup_additional_pages()
       [not found]                 ` <20090501161813.GA27516-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-04  7:25                   ` Oren Laadan
  0 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-05-04  7:25 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen



Serge E. Hallyn wrote:
> Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
>> On Fri, 2009-05-01 at 10:42 -0500, Serge E. Hallyn wrote:
>>> Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
>>>> On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
>>>>> From: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>>>>
>>>>> Add "start" argument, to request to map vDSO to a specific place,
>>>>> and fail the operation if not.
>>>>>
>>>>> This is useful for restart(2) to ensure that memory layout is restore
>>>>> exactly as needed.
>>>> This needs to go up at the start of the series.  It can get merged
>>>> before everything else.
>>> It does, but now we are sending Oren mixed signals.
>>>
>>> I thought we were talking yesterday on irc about Oren doing append-only
>>> git tree from now on, and someone else (you? :) was going to do patch
>>> integration and sorting and prettifying?
>> I was going to just go and start doing these, but I figured I'd have a
>> run through the patches first.  I probably also shouldn't be doing
>> anything to which Oren (or anyone else) has really strong objections.
>> So, I'll give everyone a chance to discuss and object before I go
>> screwing the patches up. ;)
> 
> True, part of this may be able to go up on its own.  In that case
> 
> 	1. please put the part of my s390 fix patch which applies to
> 	   this patch in
> 

Fixed.

> 	2. as I asked when Alexey posted this, can we remove the
> 	   first argument to arch_setup_additional_pages()?  Noone uses
> 	   it.

I have no objections.

Oren.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 08/54] Dump memory address space
  2009-05-01 15:27       ` Dave Hansen
@ 2009-05-04  7:58         ` Oren Laadan
  0 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-05-04  7:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan



Dave Hansen wrote:
> On Tue, 2009-04-28 at 19:23 -0400, Oren Laadan wrote:
>> +/* return (and detach) first empty page-array in the pool, if exists */
>> +static inline struct ckpt_pgarr *pgarr_from_pool(struct ckpt_ctx *ctx)
>> +{
>> +       struct ckpt_pgarr *pgarr;
>> +
>> +       if (list_empty(&ctx->pgarr_pool))
>> +               return NULL;
>> +       pgarr = list_first_entry(&ctx->pgarr_pool, struct ckpt_pgarr, list);
>> +       list_del(&pgarr->list);
>> +       return pgarr;
>> +}
> 
> What's the pool for, again?  If we're alloc/freeing a bunch of these,
> I'd vote for a slab cache rather than managing our own pool.

It's not about the 'struct ckpt_pgarr' per se. Each ckpt_pgarr in
itself points to two page-size buffers allocate as well. The pool
avoids redundant alloc/dealloc of those buffers while iterating
through all VMAs of all tasks.

Oren.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]             ` <20090430094106.GC13896-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
@ 2009-05-04  8:03               ` Matthieu Fertré
       [not found]                 ` <49FEA136.2040406-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Matthieu Fertré @ 2009-05-04  8:03 UTC (permalink / raw)
  To: Oren Laadan, Louis Rilling
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen


[-- Attachment #1.1: Type: text/plain, Size: 4524 bytes --]

Hi,

Louis Rilling a écrit :
> On 29/04/09 18:47 -0400, Oren Laadan wrote:
>> Hi Louis,
>>
>> Louis Rilling wrote:
>>> Hi,
>>>
>>> On 28/04/09 19:23 -0400, Oren Laadan wrote:
>>>> Here is the latest and greatest of checkpoint/restart (c/r) patchset.
>>>> The logic and image format reworked and simplified, code refactored,
>>>> support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
>>>> (uts and ipc).
>>> I should have asked before, but what are the reasons to checkpoint SYSV IPCs
>>> in the same file/stream as tasks? Would it be better to checkpoint them
>>> independently, like the file system state?
>>>
>>> In Kerrighed we chose to checkpoint SYSV IPCs independently, a bit like the file
>>> system state, because SYSV IPCs objects' lifetime do not depend on tasks
>>> lifetime, and we can gain more flexibility this way. In particular we envision
>>> cases in which two applications share a state in a SYSV SHM (something like a
>>> producer-consumer scheme), but do not need to be checkpointed together. In such
>>> a case the SYSV SHM itself could even need more high-availability (using
>>> active replication) than a checkpoint/restart facility.
>>>
>> Thanks for the feedback, this is actually an interesting idea.
>>
>> Indeed in the past I also considered SYSV IPC to be a "global" resource
>> that was checkpointed before iterating through the tasks.
>>
>> However, in the presence of namespaces, the lifetime of an IPC namespace
>> does depend on on tasks lifetime - when the last task referring to a
>> given namespace exits - that namespace is destroyed. Of course, the
>> root namespace is truly global, because init(1) never exits.
>>
>> What would 'checkpoint them independently' mean in this case ?
> 
> I mean that the producer and the consumer could have separate checkpointing
> policies (if any), and the IPC SHM as well.
> 
>> In your use-case, can you restart either application without first
>> restoring the relevant SYSVIPC ?
> 
> Probably not.
> 

Well, it depends. It has no sense to restart the application without
restoring the relevant SHM but it may have for a message queue (this is
application specific of course). Message queue is not linked to the
process, it can disappear during the life of the application.

>> Can you think of other use-cases for such a division ?  Am I right to
>> guess that your use case is specific to the distributed (and SSI-)
>> nature of your system ?  (Active-replication of SYSV_SHM sounds
>> awfully related to DSM :)
> 
> The case of active-replication may be specific to DSM-based systems, but the
> case of independent policies is already interesting in standalone boxes.
> 
>>
>> While not focusing on such use cases, I want to keep the design flexible
>> enough to not exclude them a-priori, and be able to address them later
>> on. Indeed, the code is split such that the the function to save a given
>> IPC namespace does not depend on the task that uses it. Future code
>> could easily use the same functionality.
>>
>> One way to be flexible to support your use case, is by having some
>> mechanism in place to select whether a resource (virtually any) is
>> to be chekcpointed/restored.
>>
>> For example, you could imagine checkpoint(..., CHECKPOINT_SYSVIPC)
>> to checkpoint (also) IPC, and not checkpoint IPC in its absence.
>>
>> So normally you'd have checkpoint(..., CHECKPOINT_ALL). When you don't
>> want IPC, you'd use CHECKPOINT_ALL & ~CHECKPOINT_SYSVIPC. When you
>> want only IPC, you'd use CHECKPOINT_SYSVIPC only.
>>
>> Same thing for restart, only that it will get trickier in the "only IPC"
>> case, since you will need to tell which IPC namespace is affected.
>>
>> Also, I envision a task saying cradvise(CHECKPOINT_SYSVIPC, false),
>> telling the kernel to not c/r its IPC namespace. (Or any other
>> resource). Again there would need to be a way to add a restored
>> namespace.
>>
>> Does this address your concerns ?
> 
> Yes this sounds flexible enough. Thanks for taking this into account.

I see one drawback with this approach if you allow checkpoint of
application that is not isolated in a container. In that case, you may
want to select which IPC objects to dump to not dump all the IPC objects
living in the system. Indeed, this is why we have chosen in Kerrighed to
checkpoint IPC objects independently of tasks, since we have no
container/namespaces support currently.

Regards,

Matthieu


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]                 ` <49FEA136.2040406-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
@ 2009-05-04  9:06                   ` Oren Laadan
       [not found]                     ` <49FEB01B.208-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-05-04  9:06 UTC (permalink / raw)
  To: Matthieu Fertré
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen



Matthieu Fertré wrote:
> Hi,
> 
> Louis Rilling a écrit :
>> On 29/04/09 18:47 -0400, Oren Laadan wrote:
>>> Hi Louis,
>>>
>>> Louis Rilling wrote:
>>>> Hi,
>>>>
>>>> On 28/04/09 19:23 -0400, Oren Laadan wrote:
>>>>> Here is the latest and greatest of checkpoint/restart (c/r) patchset.
>>>>> The logic and image format reworked and simplified, code refactored,
>>>>> support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
>>>>> (uts and ipc).
>>>> I should have asked before, but what are the reasons to checkpoint SYSV IPCs
>>>> in the same file/stream as tasks? Would it be better to checkpoint them
>>>> independently, like the file system state?
>>>>
>>>> In Kerrighed we chose to checkpoint SYSV IPCs independently, a bit like the file
>>>> system state, because SYSV IPCs objects' lifetime do not depend on tasks
>>>> lifetime, and we can gain more flexibility this way. In particular we envision
>>>> cases in which two applications share a state in a SYSV SHM (something like a
>>>> producer-consumer scheme), but do not need to be checkpointed together. In such
>>>> a case the SYSV SHM itself could even need more high-availability (using
>>>> active replication) than a checkpoint/restart facility.
>>>>
>>> Thanks for the feedback, this is actually an interesting idea.
>>>
>>> Indeed in the past I also considered SYSV IPC to be a "global" resource
>>> that was checkpointed before iterating through the tasks.
>>>
>>> However, in the presence of namespaces, the lifetime of an IPC namespace
>>> does depend on on tasks lifetime - when the last task referring to a
>>> given namespace exits - that namespace is destroyed. Of course, the
>>> root namespace is truly global, because init(1) never exits.
>>>
>>> What would 'checkpoint them independently' mean in this case ?
>> I mean that the producer and the consumer could have separate checkpointing
>> policies (if any), and the IPC SHM as well.
>>
>>> In your use-case, can you restart either application without first
>>> restoring the relevant SYSVIPC ?
>> Probably not.
>>
> 
> Well, it depends. It has no sense to restart the application without
> restoring the relevant SHM but it may have for a message queue (this is
> application specific of course). Message queue is not linked to the
> process, it can disappear during the life of the application.

Agreed - the concern regards mainly the SHM case.

> 
>>> Can you think of other use-cases for such a division ?  Am I right to
>>> guess that your use case is specific to the distributed (and SSI-)
>>> nature of your system ?  (Active-replication of SYSV_SHM sounds
>>> awfully related to DSM :)
>> The case of active-replication may be specific to DSM-based systems, but the
>> case of independent policies is already interesting in standalone boxes.
>>
>>> While not focusing on such use cases, I want to keep the design flexible
>>> enough to not exclude them a-priori, and be able to address them later
>>> on. Indeed, the code is split such that the the function to save a given
>>> IPC namespace does not depend on the task that uses it. Future code
>>> could easily use the same functionality.
>>>
>>> One way to be flexible to support your use case, is by having some
>>> mechanism in place to select whether a resource (virtually any) is
>>> to be chekcpointed/restored.
>>>
>>> For example, you could imagine checkpoint(..., CHECKPOINT_SYSVIPC)
>>> to checkpoint (also) IPC, and not checkpoint IPC in its absence.
>>>
>>> So normally you'd have checkpoint(..., CHECKPOINT_ALL). When you don't
>>> want IPC, you'd use CHECKPOINT_ALL & ~CHECKPOINT_SYSVIPC. When you
>>> want only IPC, you'd use CHECKPOINT_SYSVIPC only.
>>>
>>> Same thing for restart, only that it will get trickier in the "only IPC"
>>> case, since you will need to tell which IPC namespace is affected.
>>>
>>> Also, I envision a task saying cradvise(CHECKPOINT_SYSVIPC, false),
>>> telling the kernel to not c/r its IPC namespace. (Or any other
>>> resource). Again there would need to be a way to add a restored
>>> namespace.
>>>
>>> Does this address your concerns ?
>> Yes this sounds flexible enough. Thanks for taking this into account.
> 
> I see one drawback with this approach if you allow checkpoint of
> application that is not isolated in a container. In that case, you may
> want to select which IPC objects to dump to not dump all the IPC objects
> living in the system. Indeed, this is why we have chosen in Kerrighed to
> checkpoint IPC objects independently of tasks, since we have no
> container/namespaces support currently.

I assume that in this case it will be the application itself that
will somehow tell the system which specific sysvipc objects (ids) it
cares about.

(I'm not sure how would the system otherwise know what to dump and
what to leave out).

I originally proposed the construct of cradvise() syscall to handle
exactly those cases where the application would like to advise the
kernel about certain resources. So, extending the previous example,
a task may call something like:

   cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
   cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */

or:
   cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
   cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */

Anyway, these are just examples of the concept and what sort of generic
interface can be used to implement it; don't pick on the details...

Oren.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]                     ` <49FEB01B.208-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-04  9:17                       ` Matthieu Fertré
  2009-05-04 13:01                       ` Serge E. Hallyn
  1 sibling, 0 replies; 107+ messages in thread
From: Matthieu Fertré @ 2009-05-04  9:17 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen


[-- Attachment #1.1: Type: text/plain, Size: 5832 bytes --]

Oren Laadan a écrit :
> 
> Matthieu Fertré wrote:
>> Hi,
>>
>> Louis Rilling a écrit :
>>> On 29/04/09 18:47 -0400, Oren Laadan wrote:
>>>> Hi Louis,
>>>>
>>>> Louis Rilling wrote:
>>>>> Hi,
>>>>>
>>>>> On 28/04/09 19:23 -0400, Oren Laadan wrote:
>>>>>> Here is the latest and greatest of checkpoint/restart (c/r) patchset.
>>>>>> The logic and image format reworked and simplified, code refactored,
>>>>>> support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
>>>>>> (uts and ipc).
>>>>> I should have asked before, but what are the reasons to checkpoint SYSV IPCs
>>>>> in the same file/stream as tasks? Would it be better to checkpoint them
>>>>> independently, like the file system state?
>>>>>
>>>>> In Kerrighed we chose to checkpoint SYSV IPCs independently, a bit like the file
>>>>> system state, because SYSV IPCs objects' lifetime do not depend on tasks
>>>>> lifetime, and we can gain more flexibility this way. In particular we envision
>>>>> cases in which two applications share a state in a SYSV SHM (something like a
>>>>> producer-consumer scheme), but do not need to be checkpointed together. In such
>>>>> a case the SYSV SHM itself could even need more high-availability (using
>>>>> active replication) than a checkpoint/restart facility.
>>>>>
>>>> Thanks for the feedback, this is actually an interesting idea.
>>>>
>>>> Indeed in the past I also considered SYSV IPC to be a "global" resource
>>>> that was checkpointed before iterating through the tasks.
>>>>
>>>> However, in the presence of namespaces, the lifetime of an IPC namespace
>>>> does depend on on tasks lifetime - when the last task referring to a
>>>> given namespace exits - that namespace is destroyed. Of course, the
>>>> root namespace is truly global, because init(1) never exits.
>>>>
>>>> What would 'checkpoint them independently' mean in this case ?
>>> I mean that the producer and the consumer could have separate checkpointing
>>> policies (if any), and the IPC SHM as well.
>>>
>>>> In your use-case, can you restart either application without first
>>>> restoring the relevant SYSVIPC ?
>>> Probably not.
>>>
>> Well, it depends. It has no sense to restart the application without
>> restoring the relevant SHM but it may have for a message queue (this is
>> application specific of course). Message queue is not linked to the
>> process, it can disappear during the life of the application.
> 
> Agreed - the concern regards mainly the SHM case.
> 
>>>> Can you think of other use-cases for such a division ?  Am I right to
>>>> guess that your use case is specific to the distributed (and SSI-)
>>>> nature of your system ?  (Active-replication of SYSV_SHM sounds
>>>> awfully related to DSM :)
>>> The case of active-replication may be specific to DSM-based systems, but the
>>> case of independent policies is already interesting in standalone boxes.
>>>
>>>> While not focusing on such use cases, I want to keep the design flexible
>>>> enough to not exclude them a-priori, and be able to address them later
>>>> on. Indeed, the code is split such that the the function to save a given
>>>> IPC namespace does not depend on the task that uses it. Future code
>>>> could easily use the same functionality.
>>>>
>>>> One way to be flexible to support your use case, is by having some
>>>> mechanism in place to select whether a resource (virtually any) is
>>>> to be chekcpointed/restored.
>>>>
>>>> For example, you could imagine checkpoint(..., CHECKPOINT_SYSVIPC)
>>>> to checkpoint (also) IPC, and not checkpoint IPC in its absence.
>>>>
>>>> So normally you'd have checkpoint(..., CHECKPOINT_ALL). When you don't
>>>> want IPC, you'd use CHECKPOINT_ALL & ~CHECKPOINT_SYSVIPC. When you
>>>> want only IPC, you'd use CHECKPOINT_SYSVIPC only.
>>>>
>>>> Same thing for restart, only that it will get trickier in the "only IPC"
>>>> case, since you will need to tell which IPC namespace is affected.
>>>>
>>>> Also, I envision a task saying cradvise(CHECKPOINT_SYSVIPC, false),
>>>> telling the kernel to not c/r its IPC namespace. (Or any other
>>>> resource). Again there would need to be a way to add a restored
>>>> namespace.
>>>>
>>>> Does this address your concerns ?
>>> Yes this sounds flexible enough. Thanks for taking this into account.
>> I see one drawback with this approach if you allow checkpoint of
>> application that is not isolated in a container. In that case, you may
>> want to select which IPC objects to dump to not dump all the IPC objects
>> living in the system. Indeed, this is why we have chosen in Kerrighed to
>> checkpoint IPC objects independently of tasks, since we have no
>> container/namespaces support currently.
> 
> I assume that in this case it will be the application itself that
> will somehow tell the system which specific sysvipc objects (ids) it
> cares about.

Sure, the system can not know it.

> 
> (I'm not sure how would the system otherwise know what to dump and
> what to leave out).
> 
> I originally proposed the construct of cradvise() syscall to handle
> exactly those cases where the application would like to advise the
> kernel about certain resources. So, extending the previous example,
> a task may call something like:
> 
>    cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */
> 
> or:
>    cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */
> 
> Anyway, these are just examples of the concept and what sort of generic
> interface can be used to implement it; don't pick on the details...


Ok, seems good :)

Thanks,

Matthieu



[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]                     ` <49FEB01B.208-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-05-04  9:17                       ` Matthieu Fertré
@ 2009-05-04 13:01                       ` Serge E. Hallyn
       [not found]                         ` <20090504130108.GA21521-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2009-05-04 13:01 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Alexey Dobriyan, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> > I see one drawback with this approach if you allow checkpoint of
> > application that is not isolated in a container. In that case, you may
> > want to select which IPC objects to dump to not dump all the IPC objects
> > living in the system. Indeed, this is why we have chosen in Kerrighed to
> > checkpoint IPC objects independently of tasks, since we have no
> > container/namespaces support currently.
> 
> I assume that in this case it will be the application itself that
> will somehow tell the system which specific sysvipc objects (ids) it
> cares about.
> 
> (I'm not sure how would the system otherwise know what to dump and
> what to leave out).
> 
> I originally proposed the construct of cradvise() syscall to handle
> exactly those cases where the application would like to advise the
> kernel about certain resources. So, extending the previous example,
> a task may call something like:
> 
>    cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */
> 
> or:
>    cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */
> 
> Anyway, these are just examples of the concept and what sort of generic
> interface can be used to implement it; don't pick on the details...
> 
> Oren.

Oren, I have to be honest:  I could of course be wrong, but imo there
is 0 chance of such a bigger-and-uglier-than-ioctl syscall as cradvise
being accepted upstream.  There may be good uses for it, but I think
it's worthwhile thinking of ways around it whenever possible.

In this particular case, wouldn't it be better to do something like:

	1. freeze + checkpoint full application + container (== C1)
	2. continue application, which does a clone(CLONE_COPYIPC) (*1)
	3. application removes all shms except the one to be
	checkpointed
	4. freeze + checkpoint application again ( == C2)
	5. restart applicaiton from C1

This requires an ability to clone an ipc namespace while copying its
contents, but that seems more viable upstream, and more generally
useful, than yet another use for cradvise().

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (54 preceding siblings ...)
  2009-04-29  8:18   ` [RFC v14][PATCH 00/54] Kernel based checkpoint/restart Louis Rilling
@ 2009-05-04 19:13   ` Oren Laadan
  55 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-05-04 19:13 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Dave,

I've addressed the feedback on ckpt-v14 and pushed it to ckpt-v15.
(There is no point in reposting the entire chain).

Main changes are use kmalloc/kfree in ckpt_hdr_get/put() implementation
(discard hbuf/hpos), move vdso patch early, fixes to s390, and added
support for /dev/null, /dev/zero, /dev/random, /dev/urandom (trivial
patch). Also, the powerpc patches are still there, however Nathan is
working or fixing them so eventually they will be replaced.

Oren.

Oren Laadan wrote:
> Here is the latest and greatest of checkpoint/restart (c/r) patchset.
> The logic and image format reworked and simplified, code refactored,
> support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
> (uts and ipc).
> The userspace tool 'mktree' was extended to handle more complicated
> process tree and correctly account for process relationships and 
> session ID (sid). Should correctly handle threads.
> Hey, it even went through some massive renaming of files and functions...
> 
> Signals and timers are not supported yet, so programs that rely on
> their behavior may fail to oeprate correctly after a restart (e.g.
> may lose signals pending at time of checkpoint, and so on).
> 
> However, this one can actually be used for simple batch jobs (pipes,
> too), a whole container or just a subtree of tasks. Try it:
> 
> create the freezer cgroup:
>   $ mount -t cgroup -ofreezer freezer /freezer
>   $ mkdir /freezer/0
> 
> run the test, freeze it:  
>   $ test/multitask &
>   [1] 2754
>   $ for i in `pidof multitask`; do echo $i > /freezer/0/tasks; done
>   $ echo FROZEN > /freezer/0/freezer.state
> 
> checkpoint:
>   $ ./ckpt 2754 > ckpt.out
> 
> restart:
>   $ ./mktree < ckpt.out
> 
> voila :)
> 
> To do all this, you'll need:
> 
> The git tree tracking v14, branch 'ckpt-v14' (and past versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
> 
> Restarting multiple processes requires 'mktree' userspace tool with
> the matching branch (v14):
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
> 
> 
> Oren.
> 
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]                         ` <20090504130108.GA21521-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-04 20:13                           ` Oren Laadan
  2009-05-05  8:20                           ` Louis Rilling
  1 sibling, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-05-04 20:13 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Alexey Dobriyan, Dave Hansen



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
>>> I see one drawback with this approach if you allow checkpoint of
>>> application that is not isolated in a container. In that case, you may
>>> want to select which IPC objects to dump to not dump all the IPC objects
>>> living in the system. Indeed, this is why we have chosen in Kerrighed to
>>> checkpoint IPC objects independently of tasks, since we have no
>>> container/namespaces support currently.
>> I assume that in this case it will be the application itself that
>> will somehow tell the system which specific sysvipc objects (ids) it
>> cares about.
>>
>> (I'm not sure how would the system otherwise know what to dump and
>> what to leave out).
>>
>> I originally proposed the construct of cradvise() syscall to handle
>> exactly those cases where the application would like to advise the
>> kernel about certain resources. So, extending the previous example,
>> a task may call something like:
>>
>>    cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
>>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */
>>
>> or:
>>    cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
>>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */
>>
>> Anyway, these are just examples of the concept and what sort of generic
>> interface can be used to implement it; don't pick on the details...
>>
>> Oren.
> 
> Oren, I have to be honest:  I could of course be wrong, but imo there
> is 0 chance of such a bigger-and-uglier-than-ioctl syscall as cradvise
> being accepted upstream.  There may be good uses for it, but I think
> it's worthwhile thinking of ways around it whenever possible.

Clearly there is a tradeoff is between the flexibility and granularity
of control that one can have over how checkpoint/restart is done, vs.
complexity of the interface.

Unlike ioctl() which is a dump-place for any _type_ of device, what I'd
expect from cradvise()-like mechanism is to allow control on any _class_
of resource in the kernel. One can easily enumerate the existing ones
now in the kernel: mostly open file descriptors, namespaces, sysvipc,
memory descriptors, memory contents, etc. I don't expect cradvise() to
be specific to a specific device - that'll be userspace responsibility.

IOW, while we need to think carefully about what the interface would be,
I don't expect it to be bigger and uglier than ioctl(), because it's
focused scope, besides the fact the ioctl() is hard to compete with to
begin with...

> 
> In this particular case, wouldn't it be better to do something like:
> 
> 	1. freeze + checkpoint full application + container (== C1)
> 	2. continue application, which does a clone(CLONE_COPYIPC) (*1)
> 	3. application removes all shms except the one to be
> 	checkpointed
> 	4. freeze + checkpoint application again ( == C2)
> 	5. restart applicaiton from C1
> 
> This requires an ability to clone an ipc namespace while copying its
> contents, but that seems more viable upstream, and more generally
> useful, than yet another use for cradvise().

Sure, and indeed possibly useful outside c/r domain.

Note that for performance (speed, memory) reasons it will require
that the clone be done in COW style - not trivial for SHM.

Oren.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]                         ` <20090504130108.GA21521-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-05-04 20:13                           ` Oren Laadan
@ 2009-05-05  8:20                           ` Louis Rilling
       [not found]                             ` <20090505082057.GA11377-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 107+ messages in thread
From: Louis Rilling @ 2009-05-05  8:20 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Alexey Dobriyan, Dave Hansen


[-- Attachment #1.1: Type: text/plain, Size: 3006 bytes --]

On 04/05/09  8:01 -0500, Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> > > I see one drawback with this approach if you allow checkpoint of
> > > application that is not isolated in a container. In that case, you may
> > > want to select which IPC objects to dump to not dump all the IPC objects
> > > living in the system. Indeed, this is why we have chosen in Kerrighed to
> > > checkpoint IPC objects independently of tasks, since we have no
> > > container/namespaces support currently.
> > 
> > I assume that in this case it will be the application itself that
> > will somehow tell the system which specific sysvipc objects (ids) it
> > cares about.
> > 
> > (I'm not sure how would the system otherwise know what to dump and
> > what to leave out).
> > 
> > I originally proposed the construct of cradvise() syscall to handle
> > exactly those cases where the application would like to advise the
> > kernel about certain resources. So, extending the previous example,
> > a task may call something like:
> > 
> >    cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
> >    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */
> > 
> > or:
> >    cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
> >    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */
> > 
> > Anyway, these are just examples of the concept and what sort of generic
> > interface can be used to implement it; don't pick on the details...
> > 
> > Oren.
> 
> Oren, I have to be honest:  I could of course be wrong, but imo there
> is 0 chance of such a bigger-and-uglier-than-ioctl syscall as cradvise
> being accepted upstream.  There may be good uses for it, but I think
> it's worthwhile thinking of ways around it whenever possible.
> 
> In this particular case, wouldn't it be better to do something like:
> 
> 	1. freeze + checkpoint full application + container (== C1)
> 	2. continue application, which does a clone(CLONE_COPYIPC) (*1)
> 	3. application removes all shms except the one to be
> 	checkpointed
> 	4. freeze + checkpoint application again ( == C2)
> 	5. restart applicaiton from C1
> 

Besides COW issues mentioned by Oren in his reply, this approach does not
seem to provide the required flexibility. The point is to avoid checkpointing
some IPC objects together with the application, but we still need those IPC
objects, and the application still uses them. Moreover, on restart the
administrator should be able to first install the required IPC objects, e.g.
re-create them from scratch, or restore them from another checkpoint, and second
restart the application, linking it to the previously
re-created/restored/whatever SHMs.

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]                             ` <20090505082057.GA11377-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
@ 2009-05-05 13:49                               ` Serge E. Hallyn
       [not found]                                 ` <20090505134920.GB10136-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2009-05-05 13:49 UTC (permalink / raw)
  To: Oren Laadan,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Alexey Dobriyan

Quoting Louis Rilling (Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org):
> On 04/05/09  8:01 -0500, Serge E. Hallyn wrote:
> > Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> > > > I see one drawback with this approach if you allow checkpoint of
> > > > application that is not isolated in a container. In that case, you may
> > > > want to select which IPC objects to dump to not dump all the IPC objects
> > > > living in the system. Indeed, this is why we have chosen in Kerrighed to
> > > > checkpoint IPC objects independently of tasks, since we have no
> > > > container/namespaces support currently.
> > > 
> > > I assume that in this case it will be the application itself that
> > > will somehow tell the system which specific sysvipc objects (ids) it
> > > cares about.
> > > 
> > > (I'm not sure how would the system otherwise know what to dump and
> > > what to leave out).
> > > 
> > > I originally proposed the construct of cradvise() syscall to handle
> > > exactly those cases where the application would like to advise the
> > > kernel about certain resources. So, extending the previous example,
> > > a task may call something like:
> > > 
> > >    cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
> > >    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */
> > > 
> > > or:
> > >    cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
> > >    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */
> > > 
> > > Anyway, these are just examples of the concept and what sort of generic
> > > interface can be used to implement it; don't pick on the details...
> > > 
> > > Oren.
> > 
> > Oren, I have to be honest:  I could of course be wrong, but imo there
> > is 0 chance of such a bigger-and-uglier-than-ioctl syscall as cradvise
> > being accepted upstream.  There may be good uses for it, but I think
> > it's worthwhile thinking of ways around it whenever possible.
> > 
> > In this particular case, wouldn't it be better to do something like:
> > 
> > 	1. freeze + checkpoint full application + container (== C1)
> > 	2. continue application, which does a clone(CLONE_COPYIPC) (*1)
> > 	3. application removes all shms except the one to be
> > 	checkpointed
> > 	4. freeze + checkpoint application again ( == C2)
> > 	5. restart applicaiton from C1
> > 
> 
> Besides COW issues mentioned by Oren in his reply, this approach does not
> seem to provide the required flexibility. The point is to avoid checkpointing
> some IPC objects together with the application,

... avoided at step 3 ...

> but we still need those IPC
> objects, and the application still uses them.

... step 5 ...

> Moreover, on restart the
> administrator should be able to first install the required IPC objects, e.g.
> re-create them from scratch, or restore them from another checkpoint, and second
> restart the application, linking it to the previously
> re-created/restored/whatever SHMs.

Of course he can do that.

Anyway I'm not setting off to implement the clone(COPY_IPC)
functionality, and Oren might be right that cradvise would
be deemed different from ioctl.  I just thought I'd give a
warning, and (being a productive type :) give an alternative...

By the way, another alternative to all of the cr_advise()
stuff is to have userspace programs carve up your checkpoint
images.  It's been talked about before, but I believe Nathan
in particular is worried about what this says about kernel-user
API.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart
       [not found]                                 ` <20090505134920.GB10136-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-05 14:26                                   ` Louis Rilling
  0 siblings, 0 replies; 107+ messages in thread
From: Louis Rilling @ 2009-05-05 14:26 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Alexey Dobriyan, Dave Hansen


[-- Attachment #1.1: Type: text/plain, Size: 5039 bytes --]

On 05/05/09  8:49 -0500, Serge E. Hallyn wrote:
> Quoting Louis Rilling (Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org):
> > On 04/05/09  8:01 -0500, Serge E. Hallyn wrote:
> > > Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> > > > > I see one drawback with this approach if you allow checkpoint of
> > > > > application that is not isolated in a container. In that case, you may
> > > > > want to select which IPC objects to dump to not dump all the IPC objects
> > > > > living in the system. Indeed, this is why we have chosen in Kerrighed to
> > > > > checkpoint IPC objects independently of tasks, since we have no
> > > > > container/namespaces support currently.
> > > > 
> > > > I assume that in this case it will be the application itself that
> > > > will somehow tell the system which specific sysvipc objects (ids) it
> > > > cares about.
> > > > 
> > > > (I'm not sure how would the system otherwise know what to dump and
> > > > what to leave out).
> > > > 
> > > > I originally proposed the construct of cradvise() syscall to handle
> > > > exactly those cases where the application would like to advise the
> > > > kernel about certain resources. So, extending the previous example,
> > > > a task may call something like:
> > > > 
> > > >    cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
> > > >    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */
> > > > 
> > > > or:
> > > >    cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
> > > >    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */
> > > > 
> > > > Anyway, these are just examples of the concept and what sort of generic
> > > > interface can be used to implement it; don't pick on the details...
> > > > 
> > > > Oren.
> > > 
> > > Oren, I have to be honest:  I could of course be wrong, but imo there
> > > is 0 chance of such a bigger-and-uglier-than-ioctl syscall as cradvise
> > > being accepted upstream.  There may be good uses for it, but I think
> > > it's worthwhile thinking of ways around it whenever possible.
> > > 
> > > In this particular case, wouldn't it be better to do something like:
> > > 
> > > 	1. freeze + checkpoint full application + container (== C1)
> > > 	2. continue application, which does a clone(CLONE_COPYIPC) (*1)
> > > 	3. application removes all shms except the one to be
> > > 	checkpointed
> > > 	4. freeze + checkpoint application again ( == C2)
> > > 	5. restart applicaiton from C1
> > > 
> > 
> > Besides COW issues mentioned by Oren in his reply, this approach does not
> > seem to provide the required flexibility. The point is to avoid checkpointing
> > some IPC objects together with the application,
> 
> ... avoided at step 3 ...
> 
> > but we still need those IPC
> > objects, and the application still uses them.
> 
> ... step 5 ...

But this involves changing the application. What I describe requires that the
application is not changed (that is checkpoint/restart is transparent). The
actual policy is handled by some helper configured by sysadmin for instance.

> 
> > Moreover, on restart the
> > administrator should be able to first install the required IPC objects, e.g.
> > re-create them from scratch, or restore them from another checkpoint, and second
> > restart the application, linking it to the previously
> > re-created/restored/whatever SHMs.
> 
> Of course he can do that.
> 
> Anyway I'm not setting off to implement the clone(COPY_IPC)
> functionality, and Oren might be right that cradvise would
> be deemed different from ioctl.  I just thought I'd give a
> warning, and (being a productive type :) give an alternative...

Sure, I never doubted of the positiveness of your reply :) I just pointed
out that this alternative may not be acceptable.

> 
> By the way, another alternative to all of the cr_advise()
> stuff is to have userspace programs carve up your checkpoint
> images.  It's been talked about before, but I believe Nathan
> in particular is worried about what this says about kernel-user
> API.

With large IPC SHMs, as a user I wouldn't like paying the price of
checkpointing them if I do not need it.

If an approach à la cr_advise() still looks too close to ioctl(), I would argue
in favor of implementing as many syscalls as needed to "cleanly" obtain the same
flexibility.

To me cr_advise() looks closer to fcntl() or prctl() than to ioctl(). There are
not so many types of objects for which optional checkpoint should be considered
(at least not as many as device types that could be invented), and the set of
advices for a given object will probably be limited to {CHECKPOINT,
DO_NOT_CHECKPOINT} for checkpoint, and {RESTART, REPLACE_WITH} for restart.

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart
       [not found]     ` <1240961064-13991-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-04-29  0:58       ` Serge E. Hallyn
  2009-04-29 17:12       ` Serge E. Hallyn
@ 2009-05-06 20:39       ` Sukadev Bhattiprolu
       [not found]         ` <20090506203955.GA6003-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2 siblings, 1 reply; 107+ messages in thread
From: Sukadev Bhattiprolu @ 2009-05-06 20:39 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Oren Laadan [orenl@cs.columbia.edu] wrote:
| +
| +/* 'ckpt_debug_level' controls the verbosity level of c/r code */
| +#ifdef CONFIG_CHECKPOINT_DEBUG
| +
| +/* FIX: allow to change during runtime */
| +unsigned int __read_mostly ckpt_debug_level = CKPT_DDEFAULT;
| +
| +static __init int ckpt_debug_setup(char *s)
| +{
| +	ckpt_debug_level = strict_strtoul(s, NULL, 0);
| +	return 0;
| +}

Nit: Interchange NULL and 0 to suppress this ?

checkpoint/sys.c:384: warning: passing argument 2 of ‘strict_strtoul’
makes integer from pointer without a cast

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart
       [not found]         ` <20090506203955.GA6003-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2009-05-06 20:57           ` Oren Laadan
  0 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-05-06 20:57 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen




Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl@cs.columbia.edu] wrote:
> | +
> | +/* 'ckpt_debug_level' controls the verbosity level of c/r code */
> | +#ifdef CONFIG_CHECKPOINT_DEBUG
> | +
> | +/* FIX: allow to change during runtime */
> | +unsigned int __read_mostly ckpt_debug_level = CKPT_DDEFAULT;
> | +
> | +static __init int ckpt_debug_setup(char *s)
> | +{
> | +	ckpt_debug_level = strict_strtoul(s, NULL, 0);
> | +	return 0;
> | +}
> 
> Nit: Interchange NULL and 0 to suppress this ?
> 
> checkpoint/sys.c:384: warning: passing argument 2 of ‘strict_strtoul’
> makes integer from pointer without a cast

heh .. this was wrong to begin with. fixed in ckpt-v15.

Oren.

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 53/54] Detect resource leaks for whole-container checkpoint
       [not found]     ` <1240961064-13991-54-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-05-01 17:26       ` Dave Hansen
@ 2009-05-07  3:50       ` Sukadev Bhattiprolu
       [not found]         ` <20090507035026.GB6003-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  1 sibling, 1 reply; 107+ messages in thread
From: Sukadev Bhattiprolu @ 2009-05-07  3:50 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
| checkpoint, return an error code if the actual objects' counts are
| higher, indicating leaks (references to the objects from a task not
| being checkpointed).  Of course, by this time most of the checkpoint
| image has been written out to disk, so this is purely advisory.  But
| then, it's probably naive to argue that anything more than an advisory
| 'this went wrong' error code is useful.
| 
| The comparison of the objhash user counts to object refcounts as a
| basis for checking for leaks comes from Alexey's OpenVZ-based c/r
| patchset.
| 
| Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| ---
|  checkpoint/checkpoint.c    |    8 +++
|  checkpoint/memory.c        |    2 +
|  checkpoint/objhash.c       |  108 +++++++++++++++++++++++++++++++++++++++----
|  include/linux/checkpoint.h |    2 +
|  4 files changed, 110 insertions(+), 10 deletions(-)
| 
| diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| index 4319976..32a0a8e 100644
| --- a/checkpoint/checkpoint.c
| +++ b/checkpoint/checkpoint.c
| @@ -498,6 +498,14 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
|  	if (ret < 0)
|  		goto out;
| 
| +	if (!(ctx->flags & CHECKPOINT_SUBTREE)) {
| +		/* verify that all objects are contained (no leaks) */
| +		if (!ckpt_obj_contained(ctx)) {
| +			ret = -EBUSY;
| +			goto out;
| +		}
| +	}
| +
|  	/* on success, return (unique) checkpoint identifier */
|  	ctx->crid = atomic_inc_return(&ctx_count);
|  	ret = ctx->crid;
| diff --git a/checkpoint/memory.c b/checkpoint/memory.c
| index 7637c1e..5ae2b41 100644
| --- a/checkpoint/memory.c
| +++ b/checkpoint/memory.c
| @@ -687,6 +687,8 @@ static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
|  			ret = exe_objref;
|  			goto out;
|  		}
| +		/* account for all references through vma/exe_file */
| +		ckpt_obj_users_inc(ctx, mm->exe_file, mm->num_exe_file_vmas);

Do we really need to add num_exe_file_vmas here ?

A quick look at all callers for added_exe_file_vma() seems to show that
those callers also do a get_file().

Anyway, when I try to C/R a simple process tree, I get -EBUSY because
ckpt_obj_contained() finds a ref-count mismatch for the executable file.

I suspect the above increment of 'num_exe_file_vmas' is causing 
'obj->users' to exceed the value returned by the file's ->ref_users().

Or, when incrementing by 'num_exe_file_vmas', should we also call
obj_file_grab() ('num_exe_file_vmas' times) to keep the ref counts
in sync ?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 53/54] Detect resource leaks for whole-container checkpoint
       [not found]         ` <20090507035026.GB6003-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2009-05-07  4:11           ` Oren Laadan
       [not found]             ` <4A025F7D.3050403-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 107+ messages in thread
From: Oren Laadan @ 2009-05-07  4:11 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen



Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> | Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
> | checkpoint, return an error code if the actual objects' counts are
> | higher, indicating leaks (references to the objects from a task not
> | being checkpointed).  Of course, by this time most of the checkpoint
> | image has been written out to disk, so this is purely advisory.  But
> | then, it's probably naive to argue that anything more than an advisory
> | 'this went wrong' error code is useful.
> | 
> | The comparison of the objhash user counts to object refcounts as a
> | basis for checking for leaks comes from Alexey's OpenVZ-based c/r
> | patchset.
> | 
> | Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> | ---
> |  checkpoint/checkpoint.c    |    8 +++
> |  checkpoint/memory.c        |    2 +
> |  checkpoint/objhash.c       |  108 +++++++++++++++++++++++++++++++++++++++----
> |  include/linux/checkpoint.h |    2 +
> |  4 files changed, 110 insertions(+), 10 deletions(-)
> | 
> | diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> | index 4319976..32a0a8e 100644
> | --- a/checkpoint/checkpoint.c
> | +++ b/checkpoint/checkpoint.c
> | @@ -498,6 +498,14 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
> |  	if (ret < 0)
> |  		goto out;
> | 
> | +	if (!(ctx->flags & CHECKPOINT_SUBTREE)) {
> | +		/* verify that all objects are contained (no leaks) */
> | +		if (!ckpt_obj_contained(ctx)) {
> | +			ret = -EBUSY;
> | +			goto out;
> | +		}
> | +	}
> | +
> |  	/* on success, return (unique) checkpoint identifier */
> |  	ctx->crid = atomic_inc_return(&ctx_count);
> |  	ret = ctx->crid;
> | diff --git a/checkpoint/memory.c b/checkpoint/memory.c
> | index 7637c1e..5ae2b41 100644
> | --- a/checkpoint/memory.c
> | +++ b/checkpoint/memory.c
> | @@ -687,6 +687,8 @@ static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
> |  			ret = exe_objref;
> |  			goto out;
> |  		}
> | +		/* account for all references through vma/exe_file */
> | +		ckpt_obj_users_inc(ctx, mm->exe_file, mm->num_exe_file_vmas);
> 
> Do we really need to add num_exe_file_vmas here ?
> 
> A quick look at all callers for added_exe_file_vma() seems to show that
> those callers also do a get_file().

Each vma whose file is the same as mm->exe_file causes the refcount
of that file to increase by 2: once by vma->vm_file, and once via
added_exe_file_vma(). The c/r code calls ckpt_obj_checkpoint() only
once, thus once one obj_file_grab() for that file. The code above
accounts for the missing count.

> 
> Anyway, when I try to C/R a simple process tree, I get -EBUSY because
> ckpt_obj_contained() finds a ref-count mismatch for the executable file.

Are you certain the culprit is this count and not, possibly, some
other "leak" ?  If so, do you know what's the count difference ?
Posting the test program and a description of what you did would
be useful...

> 
> I suspect the above increment of 'num_exe_file_vmas' is causing 
> 'obj->users' to exceed the value returned by the file's ->ref_users().
> 
> Or, when incrementing by 'num_exe_file_vmas', should we also call
> obj_file_grab() ('num_exe_file_vmas' times) to keep the ref counts
> in sync ?
> 

The objhash only keeps a single refocunt of the objects that it holds
(accounted for in the comparison already). It also keeps users count
that accounts for the number of times it was observed while doing the
checkpoint. If there are no leaks, then "users + 1 = real-ref-count".

Oren.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 53/54] Detect resource leaks for whole-containercheckpoint
       [not found]             ` <4A025F7D.3050403-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-07  6:13               ` Sukadev Bhattiprolu
       [not found]                 ` <20090507061321.GA13725-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2009-05-08  4:56               ` Sukadev Bhattiprolu
  1 sibling, 1 reply; 107+ messages in thread
From: Sukadev Bhattiprolu @ 2009-05-07  6:13 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| 
| 
| Sukadev Bhattiprolu wrote:
| > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| > | Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
| > | checkpoint, return an error code if the actual objects' counts are
| > | higher, indicating leaks (references to the objects from a task not
| > | being checkpointed).  Of course, by this time most of the checkpoint
| > | image has been written out to disk, so this is purely advisory.  But
| > | then, it's probably naive to argue that anything more than an advisory
| > | 'this went wrong' error code is useful.
| > | 
| > | The comparison of the objhash user counts to object refcounts as a
| > | basis for checking for leaks comes from Alexey's OpenVZ-based c/r
| > | patchset.
| > | 
| > | Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| > | ---
| > |  checkpoint/checkpoint.c    |    8 +++
| > |  checkpoint/memory.c        |    2 +
| > |  checkpoint/objhash.c       |  108 +++++++++++++++++++++++++++++++++++++++----
| > |  include/linux/checkpoint.h |    2 +
| > |  4 files changed, 110 insertions(+), 10 deletions(-)
| > | 
| > | diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| > | index 4319976..32a0a8e 100644
| > | --- a/checkpoint/checkpoint.c
| > | +++ b/checkpoint/checkpoint.c
| > | @@ -498,6 +498,14 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
| > |  	if (ret < 0)
| > |  		goto out;
| > | 
| > | +	if (!(ctx->flags & CHECKPOINT_SUBTREE)) {
| > | +		/* verify that all objects are contained (no leaks) */
| > | +		if (!ckpt_obj_contained(ctx)) {
| > | +			ret = -EBUSY;
| > | +			goto out;
| > | +		}
| > | +	}
| > | +
| > |  	/* on success, return (unique) checkpoint identifier */
| > |  	ctx->crid = atomic_inc_return(&ctx_count);
| > |  	ret = ctx->crid;
| > | diff --git a/checkpoint/memory.c b/checkpoint/memory.c
| > | index 7637c1e..5ae2b41 100644
| > | --- a/checkpoint/memory.c
| > | +++ b/checkpoint/memory.c
| > | @@ -687,6 +687,8 @@ static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
| > |  			ret = exe_objref;
| > |  			goto out;
| > |  		}
| > | +		/* account for all references through vma/exe_file */
| > | +		ckpt_obj_users_inc(ctx, mm->exe_file, mm->num_exe_file_vmas);
| > 
| > Do we really need to add num_exe_file_vmas here ?
| > 
| > A quick look at all callers for added_exe_file_vma() seems to show that
| > those callers also do a get_file().
| 
| Each vma whose file is the same as mm->exe_file causes the refcount
| of that file to increase by 2: once by vma->vm_file, and once via
| added_exe_file_vma(). The c/r code calls ckpt_obj_checkpoint() only
| once, thus once one obj_file_grab() for that file. The code above
| accounts for the missing count.
| 
| > 
| > Anyway, when I try to C/R a simple process tree, I get -EBUSY because
| > ckpt_obj_contained() finds a ref-count mismatch for the executable file.
| 
| Are you certain the culprit is this count and not, possibly, some
| other "leak" ?

Well, I will look some more tomorrow.

| If so, do you know what's the count difference ?

Yes, dmesg had:

	c/r: FILE users 10 != count 16

The process tree had 3 processes (parent, child, grant-child) all executing
the same file (i.e none of them called exec()). The difference of 6 (or 2
per process made me suspect the num_file_exe_vmas)

| Posting the test program and a description of what you did would
| be useful...

Attached (ptree2.c).  I ran it as:

	$ ./ptree2 -n 1 -d 2

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 53/54] Detect resource leaks for whole-containercheckpoint
       [not found]                 ` <20090507061321.GA13725-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2009-05-07  6:24                   ` Sukadev Bhattiprolu
  2009-05-07 21:45                   ` Matt Helsley
  1 sibling, 0 replies; 107+ messages in thread
From: Sukadev Bhattiprolu @ 2009-05-07  6:24 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 94 bytes --]

| 
| Attached (ptree2.c).  I ran it as:
| 
| 	$ ./ptree2 -n 1 -d 2

Now, really attached :-)


[-- Attachment #2: ptree2.c --]
[-- Type: text/x-csrc, Size: 4371 bytes --]

#include <stdio.h>
#include <unistd.h>
#include <wait.h>
#include <errno.h>
#include <string.h>

int max_depth = 3;
int num_children = 3;

#define CKPT_READY		"checkpoint-ready"
#define CKPT_DONE		"checkpoint-done"
#define TEST_DONE		"test-done"
#define LOG_FILE		"log-ptree2"

#define SYS_GETGPID		1

#ifdef SYS_GETGPID
static inline int sys_getgpid()
{
#define	__NR_getgpid	335
        return syscall(__NR_getgpid);
}
#endif

FILE *logfp;

void do_exit(int status)
{
	if (logfp) {
		fflush(logfp);
		fclose(logfp);
	}
	_Exit(status);
}

int get_my_global_pid()
{
}

int test_done()
{
	int rc;

	rc = access(TEST_DONE, F_OK);
	if (rc == 0)
		return 1;
	else if (errno == ENOENT)
		return 0;

	fprintf(logfp, "access(%s) failed, %s\n", TEST_DONE, strerror(errno));
	do_exit(1);
}

int checkpoint_done()
{
	int rc;

	rc = access(CKPT_DONE, F_OK);
	if (rc == 0)
		return 1;
	else if (errno == ENOENT)
		return 0;

	fprintf(logfp, "access(%s) failed, %s\n", CKPT_DONE, strerror(errno));
	do_exit(1);
}

void checkpoint_ready()
{
	int fd;

	fd = creat(CKPT_READY, 0666, 0);
	if (fd < 0) {
		fprintf(logfp, "creat(%s) failed, %s\n", CKPT_READY,
				strerror(errno));
		do_exit(1);
	}
	close(fd);
}

print_exit_status(int pid, int status)
{
	fprintf(logfp, "Pid %d unexpected exit - ", pid);
	if (WIFEXITED(status)) {
		fprintf(logfp, "exit status %d\n", WEXITSTATUS(status));
	} else if (WIFSIGNALED(status)) {
		fprintf(logfp, "got signal %d\n", WTERMSIG(status));
	} else {
		fprintf(logfp, "stopped/continued ?\n");
	}
}

void do_wait()
{
	int rc;
	int n;
	int status;

	n = 0;
	while(1) {
		rc = waitpid(-1, &status, 0);
		if (rc < 0)
			break;

		n++;
		if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
			print_exit_status(rc, status);	
	}

	if (errno != ECHILD) {
		fprintf(logfp, "waitpid(%d) failed, error %s\n",
					rc, strerror(errno));
		do_exit(1);
	}

	if (getpid() == 1 && n != num_children * max_depth) {
		fprintf(logfp, "Only %d of %d children exited ?\n",
			num_children, num_children * max_depth);
		do_exit(1);
	}

	do_exit(0);
}

static do_child(int depth, char *suffix);

create_children(int depth, char *parent_suffix)
{
	int i;
	int child_pid;
	char suffix[1024];

	for (i = 0; i < num_children; i++) {
		sprintf(suffix, "%s-%d", parent_suffix, i);

		child_pid = fork();
		if (child_pid == 0)
			do_child(depth, suffix);
		else if (child_pid < 0) {
			fprintf(logfp, "fork() failed, depth %d, "
				"child %d, error %s\n", depth, i,
				strerror(errno));
			do_exit(1);
		}
	}
}

do_child(int depth, char *suffix)
{
	int i;
	FILE *cfp;
	char cfile[256];
	char *mode = "w";

	/*
	 * Recursively calls do_child() and both parent and child
	 * execute the code below
	 */
	if (depth < max_depth)
		create_children(depth+1, suffix);

	sprintf(cfile, "%s%s", LOG_FILE, suffix);

	i = 0;
	while (!test_done()) {
		/* truncate the first time, append after that */
		cfp = fopen(cfile, mode);
		mode = "a";
		if (!cfp) {
			fprintf(logfp, "fopen(%s) failed, error %s\n", cfile,
					strerror(errno));
			do_exit(1);
		}
		fprintf(cfp, "gpid %d, pid %d: i %d\n", sys_getgpid(),
				getpid(), i++);
		fflush(cfp);
		sleep(1);
		fprintf(cfp, "gpid %d: woke up from sleep(1)\n", sys_getgpid());
		fflush(cfp);
		fclose(cfp);
	}

	/* Wait for any children that pre-deceased us */
	do_wait();

	do_exit(0);
}

static void usage(char *argv[])
{
	printf("%s [h] [-d max-depth] [-n max-children]\n", argv[0]);
	printf("\t <max-depth> max depth of process tree, default 3\n");
	printf("\t <num-children> # of children per process, default 3\n");
	do_exit(1);
}

main(int argc, char *argv[])
{
	int c;
	int i;
	int status;

	if (test_done()) {
		printf("Remove %s before running test\n", TEST_DONE);
		do_exit(1);
	}

	while ((c = getopt(argc, argv, "hd:n:")) != EOF) {
		switch (c) {
		case 'd': max_depth = atoi(optarg); break;
		case 'n': num_children = atoi(optarg); break;
		case 'h':
		default:
			usage(argv);
		}
	};


	logfp = fopen(LOG_FILE, "w");
	if (!logfp) {
		fprintf(stderr, "fopen(%s) failed, %s\n", LOG_FILE,
					strerror(errno));
		fflush(stderr);
		do_exit(1);
	}
	close(0);close(1);close(2);

	create_children(1, "");

	/*
 	 * Now that we closed the special files and created process tree
	 * tell any wrapper scripts, we are ready for checkpoint
	 */
	checkpoint_ready();

#if 0
	while(!checkpoint_done())
		sleep(1);
#endif

	do_wait();
}

[-- Attachment #3: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 53/54] Detect resource leaks for whole-containercheckpoint
       [not found]                 ` <20090507061321.GA13725-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2009-05-07  6:24                   ` Sukadev Bhattiprolu
@ 2009-05-07 21:45                   ` Matt Helsley
       [not found]                     ` <20090507214501.GA29671-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 107+ messages in thread
From: Matt Helsley @ 2009-05-07 21:45 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

On Wed, May 06, 2009 at 11:13:21PM -0700, Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:

<snip>

> | > | diff --git a/checkpoint/memory.c b/checkpoint/memory.c
> | > | index 7637c1e..5ae2b41 100644
> | > | --- a/checkpoint/memory.c
> | > | +++ b/checkpoint/memory.c
> | > | @@ -687,6 +687,8 @@ static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
> | > |  			ret = exe_objref;
> | > |  			goto out;
> | > |  		}
> | > | +		/* account for all references through vma/exe_file */
> | > | +		ckpt_obj_users_inc(ctx, mm->exe_file, mm->num_exe_file_vmas);
> | > 
> | > Do we really need to add num_exe_file_vmas here ?
> | > 
> | > A quick look at all callers for added_exe_file_vma() seems to show that
> | > those callers also do a get_file().
> | 
> | Each vma whose file is the same as mm->exe_file causes the refcount
> | of that file to increase by 2: once by vma->vm_file, and once via
> | added_exe_file_vma(). The c/r code calls ckpt_obj_checkpoint() only
> | once, thus once one obj_file_grab() for that file. The code above
> | accounts for the missing count.

Perhaps I'm misreading Oren's explanation, but the refcount on the file
should not be:

	2*#vmas(vm_file==mm->exe_file) + #fds(filp==mm->exe_file)

It should be:

	#vmas(vm_file==mm->exe_file) + #fds(filp==mm->exe_file) + 1(for mm->exe_file).

because added_exe_file_vma() increments num_exe_file_vmas but does not
change the file reference count. So incrementing the obj count while
walking the vmas and fds should bring the count 1 short of matching.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 53/54] Detect resource leaks for whole-containercheckpoint
       [not found]             ` <4A025F7D.3050403-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-05-07  6:13               ` [RFC v14][PATCH 53/54] Detect resource leaks for whole-containercheckpoint Sukadev Bhattiprolu
@ 2009-05-08  4:56               ` Sukadev Bhattiprolu
       [not found]                 ` <20090508045622.GA31731-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  1 sibling, 1 reply; 107+ messages in thread
From: Sukadev Bhattiprolu @ 2009-05-08  4:56 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| 
| 
| Sukadev Bhattiprolu wrote:
| > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| > | Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
| > | checkpoint, return an error code if the actual objects' counts are
| > | higher, indicating leaks (references to the objects from a task not
| > | being checkpointed).  Of course, by this time most of the checkpoint
| > | image has been written out to disk, so this is purely advisory.  But
| > | then, it's probably naive to argue that anything more than an advisory
| > | 'this went wrong' error code is useful.
| > | 
| > | The comparison of the objhash user counts to object refcounts as a
| > | basis for checking for leaks comes from Alexey's OpenVZ-based c/r
| > | patchset.
| > | 
| > | Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
| > | ---
| > |  checkpoint/checkpoint.c    |    8 +++
| > |  checkpoint/memory.c        |    2 +
| > |  checkpoint/objhash.c       |  108 +++++++++++++++++++++++++++++++++++++++----
| > |  include/linux/checkpoint.h |    2 +
| > |  4 files changed, 110 insertions(+), 10 deletions(-)
| > | 
| > | diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| > | index 4319976..32a0a8e 100644
| > | --- a/checkpoint/checkpoint.c
| > | +++ b/checkpoint/checkpoint.c
| > | @@ -498,6 +498,14 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
| > |  	if (ret < 0)
| > |  		goto out;
| > | 
| > | +	if (!(ctx->flags & CHECKPOINT_SUBTREE)) {
| > | +		/* verify that all objects are contained (no leaks) */
| > | +		if (!ckpt_obj_contained(ctx)) {
| > | +			ret = -EBUSY;
| > | +			goto out;
| > | +		}
| > | +	}
| > | +
| > |  	/* on success, return (unique) checkpoint identifier */
| > |  	ctx->crid = atomic_inc_return(&ctx_count);
| > |  	ret = ctx->crid;
| > | diff --git a/checkpoint/memory.c b/checkpoint/memory.c
| > | index 7637c1e..5ae2b41 100644
| > | --- a/checkpoint/memory.c
| > | +++ b/checkpoint/memory.c
| > | @@ -687,6 +687,8 @@ static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
| > |  			ret = exe_objref;
| > |  			goto out;
| > |  		}
| > | +		/* account for all references through vma/exe_file */
| > | +		ckpt_obj_users_inc(ctx, mm->exe_file, mm->num_exe_file_vmas);
| > 
| > Do we really need to add num_exe_file_vmas here ?
| > 
| > A quick look at all callers for added_exe_file_vma() seems to show that
| > those callers also do a get_file().
| 
| Each vma whose file is the same as mm->exe_file causes the refcount
| of that file to increase by 2: once by vma->vm_file, and once via
| added_exe_file_vma(). The c/r code calls ckpt_obj_checkpoint() only
| once, thus once one obj_file_grab() for that file. The code above
| accounts for the missing count.

If the executable is shared between a parent and child (as in fork()/dup_mm)
do we still need to account for the 'added_exe_file_vma()' in the child
process ?

i.e I can trace a call to added_exe_file_vma() when loading/mmaping a biniary.
But I can't trace a call to added_exe_file_vma() during fork()/dup_mm()).

Here is how I can account for the 16 in the obj->users :-)

	Parent:
		do_checkpoint_mm: +2	= 2	(first time/obj_new())
		num_exe_vmas: +2	= 4

		filemap_checkpoint: +1	= 5	(text section)
		filemap_checkpoint: +1	= 6	(data section)

	Child:
		do_checkpoint_mm: +1	= 7
		num_exe_file_vmas: +2	= 9

		filemap_checkpoint: +1	= 10	(text section)
		filemap_checkpoint: +1	= 11	(data section)

	Grand child:

		do_checkpoint_mm: +1	= 12
		num_exe_file_vmas: +2	= 14

		filemap_checkpoint: +1	= 15	(text section)
		filemap_checkpoint: +1	= 16	(data section)

Even if we were to drop the num_exe_file_vmas for the child and
grand-child, we would be off by 2 :-(

As of now, I can account for 9 of the 10 found in file->f_count.


	Parent:
		load_a.out/do_mmap: +2	= 2	(text)
		load_aout/do_mmap(): +2	= 4	(data)

	Child:
		dup_mm()/dup_mmap(): +1	= 5	(text)
		dup_mm()/dup_mmap(): +1	= 6	(data)

	Grand Child:
		dup_mm()/dup_mmap(): +1	= 7	(text)
		dup_mm()/dup_mmap(): +1	= 8	(data)

	Checkpoint/Objhash:

		obj_new/obj_file_grab: +1 = 9

Another question is regarding the obj->users = 2 in obj_new():

	- one of this reference is for the get_file() in obj_file_grab()
	  called near the end of obj_new() right ?

	- where can I find the other get_file() ?

(again with reference to the file the three process are executing, ptree2)

Thanks,

Sukadev

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 53/54] Detect resource leaks forwhole-containercheckpoint
       [not found]                 ` <20090508045622.GA31731-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2009-05-08  8:12                   ` Matt Helsley
  0 siblings, 0 replies; 107+ messages in thread
From: Matt Helsley @ 2009-05-08  8:12 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

On Thu, May 07, 2009 at 09:56:22PM -0700, Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> | 
> | 
> | Sukadev Bhattiprolu wrote:
> | > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> | > | Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
> | > | checkpoint, return an error code if the actual objects' counts are
> | > | higher, indicating leaks (references to the objects from a task not
> | > | being checkpointed).  Of course, by this time most of the checkpoint
> | > | image has been written out to disk, so this is purely advisory.  But
> | > | then, it's probably naive to argue that anything more than an advisory
> | > | 'this went wrong' error code is useful.
> | > | 
> | > | The comparison of the objhash user counts to object refcounts as a
> | > | basis for checking for leaks comes from Alexey's OpenVZ-based c/r
> | > | patchset.
> | > | 
> | > | Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> | > | ---
> | > |  checkpoint/checkpoint.c    |    8 +++
> | > |  checkpoint/memory.c        |    2 +
> | > |  checkpoint/objhash.c       |  108 +++++++++++++++++++++++++++++++++++++++----
> | > |  include/linux/checkpoint.h |    2 +
> | > |  4 files changed, 110 insertions(+), 10 deletions(-)
> | > | 
> | > | diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> | > | index 4319976..32a0a8e 100644
> | > | --- a/checkpoint/checkpoint.c
> | > | +++ b/checkpoint/checkpoint.c
> | > | @@ -498,6 +498,14 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
> | > |  	if (ret < 0)
> | > |  		goto out;
> | > | 
> | > | +	if (!(ctx->flags & CHECKPOINT_SUBTREE)) {
> | > | +		/* verify that all objects are contained (no leaks) */
> | > | +		if (!ckpt_obj_contained(ctx)) {
> | > | +			ret = -EBUSY;
> | > | +			goto out;
> | > | +		}
> | > | +	}
> | > | +
> | > |  	/* on success, return (unique) checkpoint identifier */
> | > |  	ctx->crid = atomic_inc_return(&ctx_count);
> | > |  	ret = ctx->crid;
> | > | diff --git a/checkpoint/memory.c b/checkpoint/memory.c
> | > | index 7637c1e..5ae2b41 100644
> | > | --- a/checkpoint/memory.c
> | > | +++ b/checkpoint/memory.c
> | > | @@ -687,6 +687,8 @@ static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
> | > |  			ret = exe_objref;
> | > |  			goto out;
> | > |  		}
> | > | +		/* account for all references through vma/exe_file */
> | > | +		ckpt_obj_users_inc(ctx, mm->exe_file, mm->num_exe_file_vmas);
> | > 
> | > Do we really need to add num_exe_file_vmas here ?
> | > 
> | > A quick look at all callers for added_exe_file_vma() seems to show that
> | > those callers also do a get_file().
> | 
> | Each vma whose file is the same as mm->exe_file causes the refcount
> | of that file to increase by 2: once by vma->vm_file, and once via
> | added_exe_file_vma(). The c/r code calls ckpt_obj_checkpoint() only
> | once, thus once one obj_file_grab() for that file. The code above
> | accounts for the missing count.
> 
> If the executable is shared between a parent and child (as in fork()/dup_mm)
> do we still need to account for the 'added_exe_file_vma()' in the child
> process ?
> 
> i.e I can trace a call to added_exe_file_vma() when loading/mmaping a biniary.
> But I can't trace a call to added_exe_file_vma() during fork()/dup_mm()).

Yes, this is consistent with the common case. The count should usually
only change when exec'ing. That said, it increases as the original vmas
are split or merged as protection/permission bits change or sections are
unmapped altogether.
 
> Here is how I can account for the 16 in the obj->users :-)
> 
> 	Parent:
> 		do_checkpoint_mm: +2	= 2	(first time/obj_new())
> 		num_exe_vmas: +2	= 4

You mean mm->exe_file, not "num_exe_file_vmas"? num_exe_file_vmas
is irrelevant when it comes to checkpoint.

This seems odd: +4 to the obj->users count seems like 2 too many. It
should just be +2 -- once for mm->exe_file and once for the objhash
itself.

> 
> 		filemap_checkpoint: +1	= 5	(text section)
> 		filemap_checkpoint: +1	= 6	(data section)

These filemap_checkpoint counts make perfect sense.

> 
> 	Child:
> 		do_checkpoint_mm: +1	= 7

OK (for the child's mm->exe_file ref)

> 		num_exe_file_vmas: +2	= 9

Bad

> 
> 		filemap_checkpoint: +1	= 10	(text section)
> 		filemap_checkpoint: +1	= 11	(data section)
> 
> 	Grand child:
> 
> 		do_checkpoint_mm: +1	= 12

OK (for the child's mm->exe_file ref)

> 		num_exe_file_vmas: +2	= 14

Bad

> 
> 		filemap_checkpoint: +1	= 15	(text section)
> 		filemap_checkpoint: +1	= 16	(data section)
> 
> Even if we were to drop the num_exe_file_vmas for the child and
> grand-child, we would be off by 2 :-(

Drop all the "num_exe_vmas" increments.

> As of now, I can account for 9 of the 10 found in file->f_count.
> 
> 
> 	Parent:
> 		load_a.out/do_mmap: +2	= 2	(text)
> 		load_aout/do_mmap(): +2	= 4	(data)

(aside: I don't know why these add two each... are you sure there
weren't two more refs by dup_mm_exe_file() in child and grandchild
below?)

These happen during exec. Are you missing this one:

	void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
	{
		if (new_exe_file)
			get_file(new_exe_file);
	...

	int flush_old_exec(struct linux_binprm * bprm)
	{
		char * name;
		int i, ch, retval;
		char tcomm[sizeof(current->comm)];

		/*
		 * Make sure we have a private signal table and that
		 * we are unassociated from the previous thread group.
		 */
		retval = de_thread(current);
		if (retval)
			goto out;

		set_mm_exe_file(bprm->mm, bprm->file); <----- *here*

		/*
		 * Release all of the old mmap stuff
		 */
		retval = exec_mmap(bprm->mm);
		if (retval)
			goto out;
	...

> 
> 	Child:
> 		dup_mm()/dup_mmap(): +1	= 5	(text)
> 		dup_mm()/dup_mmap(): +1	= 6	(data)

Odd. Should be += 3 for each fork-induced dup_mm():
	1 for text VMA
	1 for data VMA
	1 for mm->exe_file (see dup_mm_exe_file() in fs/proc/base.c)

> 
> 	Grand Child:
> 		dup_mm()/dup_mmap(): +1	= 7	(text)
> 		dup_mm()/dup_mmap(): +1	= 8	(data)
>
> 	Checkpoint/Objhash:
> 
> 		obj_new/obj_file_grab: +1 = 9

By dropping the "num_exe_vmas" additions to obj->users and assuming
I'm right about file->f_count being referenced three times for each
fork, I think that accounts for everything -- file->f_count == obj->users.

> Another question is regarding the obj->users = 2 in obj_new():
> 
> 	- one of this reference is for the get_file() in obj_file_grab()
> 	  called near the end of obj_new() right ?
> 
> 	- where can I find the other get_file() ?

I think in flush_old_exec() above.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 53/54] Detect resource leaks for whole-containercheckpoint
       [not found]                     ` <20090507214501.GA29671-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-08 13:44                       ` Oren Laadan
  0 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-05-08 13:44 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Sukadev Bhattiprolu, Alexey Dobriyan, Dave Hansen

On Thu, 7 May 2009, Matt Helsley wrote:

> On Wed, May 06, 2009 at 11:13:21PM -0700, Sukadev Bhattiprolu wrote:
> > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> 
> <snip>
> 
> > | > | diff --git a/checkpoint/memory.c b/checkpoint/memory.c
> > | > | index 7637c1e..5ae2b41 100644
> > | > | --- a/checkpoint/memory.c
> > | > | +++ b/checkpoint/memory.c
> > | > | @@ -687,6 +687,8 @@ static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
> > | > |  			ret = exe_objref;
> > | > |  			goto out;
> > | > |  		}
> > | > | +		/* account for all references through vma/exe_file */
> > | > | +		ckpt_obj_users_inc(ctx, mm->exe_file, mm->num_exe_file_vmas);
> > | > 
> > | > Do we really need to add num_exe_file_vmas here ?
> > | > 
> > | > A quick look at all callers for added_exe_file_vma() seems to show that
> > | > those callers also do a get_file().
> > | 
> > | Each vma whose file is the same as mm->exe_file causes the refcount
> > | of that file to increase by 2: once by vma->vm_file, and once via
> > | added_exe_file_vma(). The c/r code calls ckpt_obj_checkpoint() only
> > | once, thus once one obj_file_grab() for that file. The code above
> > | accounts for the missing count.
> 
> Perhaps I'm misreading Oren's explanation, but the refcount on the file
> should not be:
> 
> 	2*#vmas(vm_file==mm->exe_file) + #fds(filp==mm->exe_file)

I left the #fds from the explanation, but of course they are counted.

> 
> It should be:
> 
> 	#vmas(vm_file==mm->exe_file) + #fds(filp==mm->exe_file) + 1(for mm->exe_file).

The current code counts #vmas (twice), #fds (once) and 1 (for 
mm->exe_file), for each process. Then there is +1 for the copy
that is kept in the objhash...

> 
> because added_exe_file_vma() increments num_exe_file_vmas but does not
> change the file reference count. So incrementing the obj count while
> walking the vmas and fds should bring the count 1 short of matching.
> 

Duh !  I recall having looked at it :/

The patch below should make it:

From 9574933bbdafcbc6bee9d42fd215e80d0fb25348 Mon Sep 17 00:00:00 2001
From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Date: Fri, 8 May 2009 02:41:19 -0400
Subject: [PATCH] c/r: fix users count of files that are pointed to by mm->exe_file

Drop the code that adds mm->num_exe_file_vmas to the users count of
the mm->exe_file. This is unnecessary because added_exe_file_vma()
increments num_exe_file_vmas but does not change the file reference
count. So incrementing the obj count while walking the vmas and fds
should bring the count 1 short of matching.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/memory.c        |    2 --
 checkpoint/objhash.c       |    2 +-
 include/linux/checkpoint.h |    1 -
 3 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index fcf6481..e0ff4c1 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -687,8 +687,6 @@ static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
 			ret = exe_objref;
 			goto out;
 		}
-		/* account for all references through vma/exe_file */
-		ckpt_obj_users_inc(ctx, mm->exe_file, mm->num_exe_file_vmas);
 	}
 
 	h->exefile_objref = exe_objref;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 87bc5e8..dc55047 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -394,7 +394,7 @@ int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 }
 
 /* increment the 'users' count of an object */
-void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
+static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
 {
 	struct ckpt_obj *obj;
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 8ee6304..d966f19 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -53,7 +53,6 @@ extern void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref,
 			    enum obj_type type);
 extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 			       enum obj_type type, int *first);
-extern void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment);
 extern int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
 			   enum obj_type type);
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 43/54] sysvipc-shm: checkpoint
       [not found]     ` <1240961064-13991-44-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-15 19:20       ` Serge E. Hallyn
  0 siblings, 0 replies; 107+ messages in thread
From: Serge E. Hallyn @ 2009-05-15 19:20 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> +int checkpoint_ipc_shm(int id, void *p, void *data)
> +{
> +	struct ckpt_hdr_ipc_shm *h;
> +	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
> +	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
> +	struct shmid_kernel *shp;
> +	struct inode *inode;
> +	int first, objref;
> +	int ret;
> +
> +	shp = container_of(perm, struct shmid_kernel, shm_perm);
> +	inode = shp->shm_file->f_dentry->d_inode;
> +
> +	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
> +	if (objref < 0)
> +		return objref;
> +	/* this must be the first time we see this region */
> +	BUG_ON(!first);
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
> +	if (!h)
> +		return -ENOMEM;
> +
> +	ret = fill_ipc_shm_hdr(ctx, h, shp);
> +	if (ret < 0)
> +		goto out;
> +
> +	h->objref = objref;
> +	ckpt_debug("shm: objref %d\n", h->objref);
> +
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = checkpoint_memory_contents(ctx, NULL, inode);

Of course all of the ipc checkpointing will have to actually
use ipc_perms() to check access rights.  Until that's done
we might need to just disable unprivileged checkpoints...

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 35/54] Support for share memory address spaces
       [not found]     ` <1240961064-13991-36-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-20 17:55       ` Dave Hansen
  2009-05-20 18:23         ` Oren Laadan
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2009-05-20 17:55 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan

On Tue, 2009-04-28 at 19:24 -0400, Oren Laadan wrote:
> +static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
> +{
> +       struct ckpt_hdr_task_objs *h;
> +       int mm_objref;
> +       int ret;
> +
> +       mm_objref = checkpoint_mm_obj(ctx, t);
> +       ckpt_debug("memory: objref %d\n", mm_objref);
> +       if (mm_objref < 0)
> +               return mm_objref;
> +
> +       h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
> +       if (!h)
> +               return -ENOMEM;
> +
> +       h->mm_objref = mm_objref;
> +
> +       ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> +       ckpt_hdr_put(ctx, h);
> +       return ret;
> +}

I wonder if this gets easier or harder to parse if you do this instead:

	ret = ckpt_write_obj(ctx, &h.h);

It is kinda what we already do for things that use container_of().  

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC v14][PATCH 35/54] Support for share memory address spaces
  2009-05-20 17:55       ` Dave Hansen
@ 2009-05-20 18:23         ` Oren Laadan
  0 siblings, 0 replies; 107+ messages in thread
From: Oren Laadan @ 2009-05-20 18:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alexey Dobriyan



Dave Hansen wrote:
> On Tue, 2009-04-28 at 19:24 -0400, Oren Laadan wrote:
>> +static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
>> +{
>> +       struct ckpt_hdr_task_objs *h;
>> +       int mm_objref;
>> +       int ret;
>> +
>> +       mm_objref = checkpoint_mm_obj(ctx, t);
>> +       ckpt_debug("memory: objref %d\n", mm_objref);
>> +       if (mm_objref < 0)
>> +               return mm_objref;
>> +
>> +       h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
>> +       if (!h)
>> +               return -ENOMEM;
>> +
>> +       h->mm_objref = mm_objref;
>> +
>> +       ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
>> +       ckpt_hdr_put(ctx, h);
>> +       return ret;
>> +}
> 
> I wonder if this gets easier or harder to parse if you do this instead:
> 
> 	ret = ckpt_write_obj(ctx, &h.h);
> 
> It is kinda what we already do for things that use container_of().  
> 

Fine with me.

Oren

^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2009-05-20 18:23 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-28 23:23 [RFC v14][PATCH 00/54] Kernel based checkpoint/restart Oren Laadan
     [not found] ` <1240961064-13991-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-28 23:23   ` [RFC v14][PATCH 01/54] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 02/54] Checkpoint/restart: initial documentation Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 03/54] Make file_pos_read/write() public Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 04/54] General infrastructure for checkpoint restart Oren Laadan
     [not found]     ` <1240961064-13991-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-29  0:58       ` Serge E. Hallyn
     [not found]         ` <20090429005826.GA23583-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-29 17:49           ` Oren Laadan
     [not found]             ` <49F8932D.4040506-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-29 18:15               ` Serge E. Hallyn
2009-04-29 17:12       ` Serge E. Hallyn
2009-05-06 20:39       ` Sukadev Bhattiprolu
     [not found]         ` <20090506203955.GA6003-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-05-06 20:57           ` Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 05/54] x86 support for checkpoint/restart Oren Laadan
     [not found]     ` <1240961064-13991-6-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-01 15:12       ` Dave Hansen
2009-04-28 23:23   ` [RFC v14][PATCH 06/54] Introduce method 'checkpoint' in struct vm_operations_struct Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 07/54] cr: extend arch_setup_additional_pages() Oren Laadan
     [not found]     ` <1240961064-13991-8-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-01 15:13       ` Dave Hansen
2009-05-01 15:42         ` Serge E. Hallyn
     [not found]           ` <20090501154220.GA26771-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-01 15:57             ` Dave Hansen
2009-05-01 16:18               ` Serge E. Hallyn
     [not found]                 ` <20090501161813.GA27516-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-04  7:25                   ` Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 08/54] Dump memory address space Oren Laadan
     [not found]     ` <1240961064-13991-9-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-29  4:11       ` Serge E. Hallyn
     [not found]         ` <20090429041128.GA28018-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-04-29  6:42           ` Guenter Roeck
     [not found]             ` <20090429064241.GA17482-gvzKVTG1yJJBDgjK7y7TUQ@public.gmane.org>
2009-04-29 20:00               ` Oren Laadan
2009-04-30  4:54       ` Matt Helsley
2009-05-01 15:25       ` Dave Hansen
2009-05-01 15:27       ` Dave Hansen
2009-05-04  7:58         ` Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 09/54] Restore " Oren Laadan
     [not found]     ` <1240961064-13991-10-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-01 15:28       ` Dave Hansen
2009-04-28 23:23   ` [RFC v14][PATCH 10/54] Infrastructure for shared objects Oren Laadan
     [not found]     ` <1240961064-13991-11-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-29  1:03       ` Serge E. Hallyn
2009-04-29 16:21       ` Serge E. Hallyn
2009-04-28 23:23   ` [RFC v14][PATCH 11/54] Introduce 'checkpoint' method in 'struct file_operations' Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 12/54] Dump open file descriptors Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 13/54] add generic checkpoint f_op to ext fses Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 14/54] Restore open file descriptors Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 15/54] Record 'struct file' object instead of the file name for VMAs Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 16/54] External checkpoint of a task other than ourself Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 17/54] c/r of restart-blocks: export functionality used in next patch Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 18/54] c/r of restart-blocks Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 19/54] Checkpoint multiple processes Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 20/54] Restart " Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 21/54] Define subtree flag and unpriv_allowed sysctl Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 22/54] Checkpoint open pipes Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 23/54] Restore " Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 24/54] Prepare to support shared memory Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 25/54] Dump anonymous- and file-mapped- " Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 26/54] Restore " Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 27/54] s390: Expose a constant for the number of words representing the CRs Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 28/54] c/r: Add CKPT_COPY() macro (v4) Oren Laadan
2009-04-28 23:23   ` [RFC v14][PATCH 29/54] s390: define s390-specific checkpoint-restart code (v7) Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 30/54] powerpc: provide APIs for validating and updating DABR Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 31/54] powerpc: checkpoint/restart implementation Oren Laadan
     [not found]     ` <1240961064-13991-32-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-29  6:54       ` Nathan Lynch
     [not found]         ` <m34ow8ueyk.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2009-04-29 15:49           ` Serge E. Hallyn
2009-04-29 18:05           ` Oren Laadan
     [not found]             ` <49F896E8.7020802-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-29 20:55               ` Nathan Lynch
2009-04-29 18:18           ` Oren Laadan
     [not found]             ` <49F899E1.2030207-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-29 20:33               ` Nathan Lynch
2009-04-28 23:24   ` [RFC v14][PATCH 32/54] powerpc: wire up checkpoint and restart syscalls Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 33/54] powerpc: enable checkpoint support in Kconfig Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 34/54] Export fs/exec.c:exec_mmap() Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 35/54] Support for share memory address spaces Oren Laadan
     [not found]     ` <1240961064-13991-36-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-20 17:55       ` Dave Hansen
2009-05-20 18:23         ` Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 36/54] Make ckpt_may_checkpoint_task() check each namespace individually Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 37/54] c/r: Add UTS support (v6) Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 38/54] Stub implementation of IPC namespace c/r Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 39/54] deferqueue: generic queue to defer work Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 40/54] ipc: allow allocation of an ipc object with desired identifier Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 41/54] ipc: helpers to save and restore kern_ipc_perm structures Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 42/54] ipc namespace: save and restore ipc namespace basics Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 43/54] sysvipc-shm: checkpoint Oren Laadan
     [not found]     ` <1240961064-13991-44-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-15 19:20       ` Serge E. Hallyn
2009-04-28 23:24   ` [RFC v14][PATCH 44/54] sysvipc-shm: restart Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 45/54] sysvipc-shm: export interface from ipc/shm.c to delete ipc shm Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 46/54] sysvipc-shm: correctly handle deleted (active) ipc shared memory Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 47/54] sysvipc-msg: make 'struct msg_msgseg' visible in ipc/util.h Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 48/54] sysvipc-msq: checkpoint Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 49/54] sysvipc-msq: restart Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 50/54] sysvipc-sem: export interface from ipc/sem.c to cleanup ipc sem Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 51/54] sysvipc-sem: checkpoint Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 52/54] sysvipc-sem: restart Oren Laadan
2009-04-28 23:24   ` [RFC v14][PATCH 53/54] Detect resource leaks for whole-container checkpoint Oren Laadan
     [not found]     ` <1240961064-13991-54-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-01 17:26       ` Dave Hansen
2009-05-07  3:50       ` Sukadev Bhattiprolu
     [not found]         ` <20090507035026.GB6003-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-05-07  4:11           ` Oren Laadan
     [not found]             ` <4A025F7D.3050403-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-07  6:13               ` [RFC v14][PATCH 53/54] Detect resource leaks for whole-containercheckpoint Sukadev Bhattiprolu
     [not found]                 ` <20090507061321.GA13725-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-05-07  6:24                   ` Sukadev Bhattiprolu
2009-05-07 21:45                   ` Matt Helsley
     [not found]                     ` <20090507214501.GA29671-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-08 13:44                       ` Oren Laadan
2009-05-08  4:56               ` Sukadev Bhattiprolu
     [not found]                 ` <20090508045622.GA31731-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-05-08  8:12                   ` [RFC v14][PATCH 53/54] Detect resource leaks forwhole-containercheckpoint Matt Helsley
2009-04-28 23:24   ` [RFC v14][PATCH 54/54] Report failures during checkpoint as an object in the output stream Oren Laadan
2009-04-29  8:18   ` [RFC v14][PATCH 00/54] Kernel based checkpoint/restart Louis Rilling
     [not found]     ` <20090429081815.GA1813-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2009-04-29 22:47       ` Oren Laadan
     [not found]         ` <49F8D8FC.8010400-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-04-30  9:41           ` Louis Rilling
     [not found]             ` <20090430094106.GC13896-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2009-05-04  8:03               ` Matthieu Fertré
     [not found]                 ` <49FEA136.2040406-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
2009-05-04  9:06                   ` Oren Laadan
     [not found]                     ` <49FEB01B.208-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-04  9:17                       ` Matthieu Fertré
2009-05-04 13:01                       ` Serge E. Hallyn
     [not found]                         ` <20090504130108.GA21521-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-04 20:13                           ` Oren Laadan
2009-05-05  8:20                           ` Louis Rilling
     [not found]                             ` <20090505082057.GA11377-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2009-05-05 13:49                               ` Serge E. Hallyn
     [not found]                                 ` <20090505134920.GB10136-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-05 14:26                                   ` Louis Rilling
2009-05-04 19:13   ` Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.