linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v4][PATCH 0/9] Kernel based checkpoint/restart`
@ 2008-09-09  7:42 Oren Laadan
  2008-09-09  7:42 ` [RFC v4][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
                   ` (9 more replies)
  0 siblings, 10 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

These patches implement checkpoint-restart [CR v3]. This version is
aimed at addressing feedback and eliminating bugs, after having added
save and restore of open files state (regular files and directories)
which makes it more usable.

Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Extend to handle (multiple) tasks in a container
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
(Dave Hansen's announcement)

At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.

--
(Original announcement)

In the recent mini-summit at OLS 2008 and the following days it was
agreed to tackle the checkpoint/restart (CR) by beginning with a very
simple case: save and restore a single task, with simple memory
layout, disregarding other task state such as files, signals etc.

Following these discussions I coded a prototype that can do exactly
that, as a starter. This code adds two system calls - sys_checkpoint
and sys_restart - that a task can call to save and restore its state
respectively. It also demonstrates how the checkpoint image file can
be formatted, as well as show its nested nature (e.g. cr_write_mm()
-> cr_write_vma() nesting).

The state that is saved/restored is the following:
* some of the task_struct
* some of the thread_struct and thread_info
* the cpu state (including FPU)
* the memory address space

In the current code, sys_checkpoint will checkpoint the current task,
although the logic exists to checkpoint other tasks (not in the
checkpointee's execution context). A simple loop will extend this to
handle multiple processes. sys_restart restarts the current tasks, and
with multiple tasks each task will call the syscall independently.
(Actually, to checkpoint outside the context of a task, it is also
necessary to also handle restart-block logic when saving/restoring the
thread data).

It takes longer to describe what isn't implemented or supported by
this prototype ... basically everything that isn't as simple as the
above.

As for containers - since we still don't have a representation for a
container, this patch has no notion of a container. The tests for
consistent namespaces (and isolation) are also omitted.

Below are two example programs: one uses checkpoint (called ckpt) and
one uses restart (called rstr). Note the use of "dup2" to create a 
copy of an open file and show how shared objects are treated. Execute
like this (as a superuser):

orenl:~/test$ ./ckpt > out.1
				<-- ctrl-c
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
world, hello!
(ret = 1)

orenl:~/test$ ./ckpt > out.1
				<-- ctrl-c
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
world, hello!
(ret = 2)

				<-- now change the contents of the file
orenl:~/test$ sed -i 's/world, hello!/xxxx/' /tmp/cr-rest.out
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
xxxx
(ret = 2)

				<-- and do the restart
orenl:~/test$ ./rstr < out.1
				<-- ctrl-c
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
world, hello!
(ret = 0)

(if you check the output of ps, you'll see that "rstr" changed its
name to "ckpt", as expected). 

Oren.


============================== ckpt.c ================================

#define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd.h>
#include <sys/syscall.h>

#define OUTFILE "/tmp/cr-test.out"

int main(int argc, char *argv[])
{
	pid_t pid = getpid();
	FILE *file;
	int ret;

	close(0);
	close(2);

	unlink(OUTFILE);
	file = fopen(OUTFILE, "w+");
	if (!file) {
		perror("open");
		exit(1);
	}

	if (dup2(0,2) < 0) {
		perror("dups");
		exit(1);
	}

	fprintf(file, "hello, world!\n");
	fflush(file);

	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
	if (ret < 0) {
		perror("checkpoint");
		exit(2);
	}

	fprintf(file, "world, hello!\n");
	fprintf(file, "(ret = %d)\n", ret);
	fflush(file);

	while (1)
		;

	return 0;
}

============================== rstr.c ================================

#define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd.h>
#include <sys/syscall.h>

int main(int argc, char *argv[])
{
	pid_t pid = getpid();
	int ret;

	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
	if (ret < 0)
		perror("restart");

	printf("should not reach here !\n");

	return 0;
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC v4][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
@ 2008-09-09  7:42 ` Oren Laadan
  2008-09-09  7:42 ` [RFC v4][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/kernel/syscall_table_32.S |    2 ++
 checkpoint/Kconfig                 |   11 +++++++++++
 checkpoint/Makefile                |    5 +++++
 checkpoint/sys.c                   |   35 +++++++++++++++++++++++++++++++++++
 include/asm-x86/unistd_32.h        |    2 ++
 include/linux/syscalls.h           |    2 ++
 init/Kconfig                       |    2 ++
 kernel/sys_ni.c                    |    4 ++++
 8 files changed, 63 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..a9f22ef
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+	def_bool y
+	depends on X86_32 && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..07d018b
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..b9018a4
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,35 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	pr_debug("sys_checkpoint not implemented yet\n");
+	return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+	pr_debug("sys_restart not implemented yet\n");
+	return -ENOSYS;
+}
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index d739467..88bdec4 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -338,6 +338,8 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restart		334
 
 #ifdef __KERNEL__
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d6ff145..edc218b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index c11da38..fd5f7bf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -779,6 +779,8 @@ config MARKERS
 
 source "arch/Kconfig"
 
+source "checkpoint/Kconfig"
+
 config PROC_PAGE_MONITOR
  	default y
 	depends on PROC_FS && MMU
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 08d6e1b..ca95c25 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -168,3 +168,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v4][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
  2008-09-09  7:42 ` [RFC v4][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-09-09  7:42 ` Oren Laadan
  2008-09-10  6:10   ` MinChan Kim
  2008-09-09  7:42 ` [RFC v4][PATCH 3/9] x86 support for checkpoint/restart Oren Laadan
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
checkpoint/restart context (a per-checkpoint data structure for
housekeeping)

checkpoint/checkpoint.c - output wrappers and basic checkpoint handling

checkpoint/restart.c - input wrappers and basic restart handling

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Makefile                 |    2 +-
 checkpoint/Makefile      |    2 +-
 checkpoint/checkpoint.c  |  188 +++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c     |  189 +++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c         |  218 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/ckpt.h     |   60 +++++++++++++
 include/linux/ckpt_hdr.h |   84 ++++++++++++++++++
 include/linux/magic.h    |    3 +
 8 files changed, 740 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/ckpt.h
 create mode 100644 include/linux/ckpt_hdr.h

diff --git a/Makefile b/Makefile
index f448e00..a558ad2 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..ad1099f
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,188 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+	int ret;
+
+	ret = cr_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_string - write a string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int cr_write_string(struct cr_ctx *ctx, char *str, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_STRING;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h.type = CR_HDR_HEAD;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	do_gettimeofday(&ktv);
+
+	hh->magic = CHECKPOINT_MAGIC_HEAD;
+	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	hh->rev = CR_VERSION;
+
+	hh->flags = ctx->flags;
+	hh->time = ktv.tv_sec;
+
+	uts = utsname();
+	memcpy(hh->release, uts->release, __NEW_UTS_LEN);
+	memcpy(hh->version, uts->version, __NEW_UTS_LEN);
+	memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TAIL;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TASK;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->state = t->state;
+	hh->exit_state = t->exit_state;
+	hh->exit_code = t->exit_code;
+	hh->exit_signal = t->exit_signal;
+
+	hh->utime = t->utime;
+	hh->stime = t->stime;
+	hh->utimescaled = t->utimescaled;
+	hh->stimescaled = t->stimescaled;
+	hh->gtime = t->gtime;
+	hh->prev_utime = t->prev_utime;
+	hh->prev_stime = t->prev_stime;
+	hh->nvcsw = t->nvcsw;
+	hh->nivcsw = t->nivcsw;
+	hh->start_time_sec = t->start_time.tv_sec;
+	hh->start_time_nsec = t->start_time.tv_nsec;
+	hh->real_start_time_sec = t->real_start_time.tv_sec;
+	hh->real_start_time_nsec = t->real_start_time.tv_nsec;
+	hh->min_flt = t->min_flt;
+	hh->maj_flt = t->maj_flt;
+
+	hh->task_comm_len = TASK_COMM_LEN;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+	int ret ;
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("CR: task may not be in state TASK_DEAD\n");
+		return -EAGAIN;
+	}
+
+	ret = cr_write_task_struct(ctx, t);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx)
+{
+	int ret;
+
+	/* FIX: need to test whether container is checkpointable */
+
+	ret = cr_write_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, return (unique) checkpoint identifier */
+	ret = ctx->crid;
+
+ out:
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..171cd2d
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,189 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @n: available buffer size
+ *
+ * @return: size of payload
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n)
+{
+	int ret;
+
+	ret = cr_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
+
+	if (h->len < 0 || h->len > n)
+		return -EINVAL;
+
+	return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: available buffer size
+ * @type: expected record type
+ *
+ * @return: object reference of the parent object
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, n);
+	if (!ret) {
+		if (h.type == type)
+			ret = h.parent;
+		else
+			ret = -EINVAL;
+	}
+	return ret;
+}
+
+/**
+ * cr_read_string - read a string
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @len: buffer buffer length
+ */
+int cr_read_string(struct cr_ctx *ctx, void *str, int len)
+{
+	return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
+}
+
+/* read the checkpoint header */
+static int cr_read_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+	if (parent < 0)
+		return parent;
+	else if (parent != 0)
+		return -EINVAL;
+
+	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
+	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+		return -EINVAL;
+
+	if (hh->flags & ~CR_CTX_CKPT)
+		return -EINVAL;
+
+	ctx->oflags = hh->flags;
+
+	/* FIX: verify compatibility of release, version and machine */
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return 0;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+	if (parent < 0)
+		return parent;
+	else if (parent != 0)
+		return -EINVAL;
+
+	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
+		return -EINVAL;
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return 0;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	char *buf;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+	if (parent < 0)
+		return parent;
+	else if (parent != 0)
+		return -EINVAL;
+
+	/* FIXME: for now, only restore t->comm */
+
+	/* upper limit for task_comm_len to prevent DoS */
+	if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
+		return -EINVAL;
+
+	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+	ret = cr_read_string(ctx, buf, hh->task_comm_len);
+	if (!ret) {
+		/* if t->comm is too long, silently truncate */
+		memset(t->comm, 0, TASK_COMM_LEN);
+		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
+	}
+	kfree(buf);
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_task_struct(ctx);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_restart(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, adjust the return value if needed [TODO] */
+ out:
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index b9018a4..113e0df 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -10,6 +10,189 @@
 
 #include <linux/sched.h>
 #include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/ckpt.h>
+
+/*
+ * helpers to write/read to/from the image file descriptor
+ *
+ *   cr_uwrite() - write a user-space buffer to the checkpoint image
+ *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   cr_uread() - read from the checkpoint image to a user-space buffer
+ *   cr_kread() - read from the checkpoint image to a kernel-space buffer
+ *
+ */
+
+/* (temporarily added file_pos_read() and file_pos_write() because they
+ * are static in fs/read_write.c... should cleanup and remove later) */
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
+int cr_uwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+	struct file *file = ctx->file;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, (char __user *) buf, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite <= 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		buf += nwrite;
+	}
+
+	ctx->total += count;
+	return 0;
+}
+
+int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+	mm_segment_t oldfs;
+	int ret;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = cr_uwrite(ctx, buf, count);
+	set_fs(oldfs);
+
+	return ret;
+}
+
+int cr_uread(struct cr_ctx *ctx, void *buf, int count)
+{
+	struct file *file = ctx->file;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, (char __user *) buf, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN)
+				nread = 0;
+			else
+				return nread;
+		}
+		buf += nread;
+	}
+
+	ctx->total += count;
+	return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *buf, int count)
+{
+	mm_segment_t oldfs;
+	int ret;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = cr_uread(ctx, buf, count);
+	set_fs(oldfs);
+
+	return ret;
+}
+
+
+/*
+ * helpers to manage CR contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+/* unique checkpoint identifier (FIXME: should be per-container) */
+static atomic_t cr_ctx_count;
+
+void cr_ctx_free(struct cr_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+
+	free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
+
+	kfree(ctx);
+}
+
+struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
+{
+	struct cr_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->file = fget(fd);
+	if (!ctx->file) {
+		cr_ctx_free(ctx);
+		return ERR_PTR(-EBADF);
+	}
+	get_file(ctx->file);
+
+	ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
+	if (!ctx->hbuf) {
+		cr_ctx_free(ctx);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pid = pid;
+	ctx->flags = flags;
+
+	ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+	return ctx;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the chekcpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, one should call cr_hbuf_get() to
+ * reserve space in the buffer, and then cr_hbuf_put() when no longer
+ * needs that space.
+ */
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ *
+ * @return: pointer to reserved space
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+	void *ptr;
+
+	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+	ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
+	ctx->hpos += n;
+	return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+	BUG_ON(ctx->hpos < n);
+	ctx->hpos -= n;
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -19,9 +202,23 @@
  */
 asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	pr_debug("sys_checkpoint not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx);
+
+	cr_ctx_free(ctx);
+	return ret;
 }
+
 /**
  * sys_restart - restart a container
  * @crid: checkpoint image identifier
@@ -30,6 +227,19 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	pr_debug("sys_restart not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(crid, fd, flags | CR_CTX_RSTR);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_restart(ctx);
+
+	cr_ctx_free(ctx);
+	return ret;
 }
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
new file mode 100644
index 0000000..91f4998
--- /dev/null
+++ b/include/linux/ckpt.h
@@ -0,0 +1,60 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CR_VERSION  1
+
+struct cr_ctx {
+	pid_t pid;		/* container identifier */
+	int crid;		/* unique checkpoint id */
+
+	unsigned long flags;
+	unsigned long oflags;	/* restart: old flags */
+
+	struct file *file;
+	int total;		/* total read/written */
+
+	void *hbuf;		/* temporary buffer for headers */
+	int hpos;		/* position in headers buffer */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT	0x1
+#define CR_CTX_RSTR	0x2
+
+/* allocation defaults */
+#define CR_HBUF_ORDER  1
+#define CR_HBUF_TOTAL  (PAGE_SIZE << CR_HBUF_ORDER)
+
+int cr_uwrite(struct cr_ctx *ctx, void *buf, int count);
+int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+int cr_uread(struct cr_ctx *ctx, void *buf, int count);
+int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
+int cr_read_string(struct cr_ctx *ctx, void *str, int len);
+
+int do_checkpoint(struct cr_ctx *ctx);
+int do_restart(struct cr_ctx *ctx);
+
+#define cr_debug(fmt, args...)  \
+	pr_debug("[CR:%s] " fmt, __func__, ## args)
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
new file mode 100644
index 0000000..dd05ecc
--- /dev/null
+++ b/include/linux/ckpt_hdr.h
@@ -0,0 +1,84 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+/* header types */
+enum {
+	CR_HDR_HEAD = 1,
+	CR_HDR_STRING,
+
+	CR_HDR_TASK = 101,
+	CR_HDR_THREAD,
+	CR_HDR_CPU,
+
+	CR_HDR_MM = 201,
+	CR_HDR_VMA,
+	CR_HDR_MM_CONTEXT,
+
+	CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+	__u64 magic;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 flags;	/* checkpoint options */
+
+	char release[__NEW_UTS_LEN];
+	char version[__NEW_UTS_LEN];
+	char machine[__NEW_UTS_LEN];
+} __attribute__((aligned(8)));
+
+struct cr_hdr_tail {
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_task {
+	__u64 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+	__u32 _padding0;
+
+	__u64 utime, stime, utimescaled, stimescaled;
+	__u64 gtime;
+	__u64 prev_utime, prev_stime;
+	__u64 nvcsw, nivcsw;
+	__u64 start_time_sec, start_time_nsec;
+	__u64 real_start_time_sec, real_start_time_nsec;
+	__u64 min_flt, maj_flt;
+
+	__s32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 1fa0c2c..c2b811c 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -42,4 +42,7 @@
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v4][PATCH 3/9] x86 support for checkpoint/restart
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
  2008-09-09  7:42 ` [RFC v4][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
  2008-09-09  7:42 ` [RFC v4][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
@ 2008-09-09  7:42 ` Oren Laadan
  2008-09-09  8:17   ` Ingo Molnar
  2008-09-09  7:42 ` [RFC v4][PATCH 4/9] Memory management (dump) Oren Laadan
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

(Following Dave Hansen's refactoring of the original post)

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/mm/Makefile       |    2 +
 arch/x86/mm/checkpoint.c   |  194 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/restart.c      |  178 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c    |   13 +++-
 checkpoint/ckpt_arch.h     |    7 ++
 checkpoint/restart.c       |   13 +++-
 include/asm-x86/ckpt_hdr.h |   72 ++++++++++++++++
 include/linux/ckpt_hdr.h   |    1 +
 8 files changed, 478 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 arch/x86/mm/restart.c
 create mode 100644 checkpoint/ckpt_arch.h
 create mode 100644 include/asm-x86/ckpt_hdr.h

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index dfb932d..58fe072 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -22,3 +22,5 @@ endif
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..71d21e6
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,194 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct thread_struct *thread;
+	struct desc_struct *desc;
+	int ntls = 0;
+	int n, ret;
+
+	h.type = CR_HDR_THREAD;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	thread = &t->thread;
+
+	/* calculate no. of TLS entries that follow */
+	desc = thread->tls_array;
+	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+		if (desc->a || desc->b)
+			ntls++;
+	}
+
+	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	hh->sizeof_tls_array = sizeof(thread->tls_array);
+	hh->ntls = ntls;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	/* for simplicity dump the entire array, cherry-pick upon restart */
+	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+	cr_debug("ntls %d\n", ntls);
+
+	/* IGNORE RESTART BLOCKS FOR NOW ... */
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+void cr_write_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	hh->bp = regs->bp;
+	hh->bx = regs->bx;
+	hh->ax = regs->ax;
+	hh->cx = regs->cx;
+	hh->dx = regs->dx;
+	hh->si = regs->si;
+	hh->di = regs->di;
+	hh->orig_ax = regs->orig_ax;
+	hh->ip = regs->ip;
+	hh->cs = regs->cs;
+	hh->flags = regs->flags;
+	hh->sp = regs->sp;
+	hh->ss = regs->ss;
+
+	hh->ds = regs->ds;
+	hh->es = regs->es;
+
+	/* for checkpoint in process context (from within a container)
+	   the GS and FS registers should be saved from the hardware;
+	   otherwise they are already sabed on the thread structure */
+	if (t == current) {
+		savesegment(gs, hh->gs);
+		savesegment(fs, hh->fs);
+	} else {
+		hh->gs = thread->gs;
+		hh->fs = thread->fs;
+	}
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(hh->orig_ax < 0);
+		hh->ax = 0;
+	}
+}
+
+void cr_write_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	preempt_disable();
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(hh->debugreg0, 0);
+		get_debugreg(hh->debugreg1, 1);
+		get_debugreg(hh->debugreg2, 2);
+		get_debugreg(hh->debugreg3, 3);
+		get_debugreg(hh->debugreg6, 6);
+		get_debugreg(hh->debugreg7, 7);
+	} else {
+		hh->debugreg0 = thread->debugreg0;
+		hh->debugreg1 = thread->debugreg1;
+		hh->debugreg2 = thread->debugreg2;
+		hh->debugreg3 = thread->debugreg3;
+		hh->debugreg6 = thread->debugreg6;
+		hh->debugreg7 = thread->debugreg7;
+	}
+
+	hh->debugreg4 = 0;
+	hh->debugreg5 = 0;
+
+	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+
+	preempt_enable();
+}
+
+void cr_write_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct thread_info *thread_info = task_thread_info(t);
+
+	/* i387 + MMU + SSE logic */
+
+	preempt_disable();
+
+	hh->used_math = tsk_used_math(t) ? 1 : 0;
+	if (hh->used_math) {
+		/* normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+		 * have been cleared when task was conexted-switched out...
+		 * except if we are in process context, in which case we do */
+		if (thread_info->status & TS_USEDFPU)
+			unlazy_fpu(current);
+
+		hh->has_fxsr = cpu_has_fxsr;
+		memcpy(&hh->xstate, &thread->xstate, sizeof(thread->xstate));
+	}
+
+	preempt_enable();
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_CPU;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	cr_write_cpu_regs(hh, t);
+	cr_write_cpu_debug(hh, t);
+	cr_write_cpu_fpu(hh, t);
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
new file mode 100644
index 0000000..883a163
--- /dev/null
+++ b/arch/x86/mm/restart.c
@@ -0,0 +1,178 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	struct thread_struct *thread = &t->thread;
+	int parent;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+	if (parent < 0)
+		return parent;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		return -EINVAL;
+#endif
+	cr_debug("ntls %d\n", hh->ntls);
+
+	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+		return -EINVAL;
+
+	if (hh->ntls > 0) {
+
+		/* restore TLS by hand: why convert to struct user_desc if
+		 * sys_set_thread_entry() will convert it back ? */
+
+		struct desc_struct *desc;
+		int size, cpu, ret;
+
+		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+		desc = kmalloc(size, GFP_KERNEL);
+		if (!desc)
+			return -ENOMEM;
+
+		ret = cr_kread(ctx, desc, size);
+		if (ret >= 0) {
+			/* FIX: add sanity checks (eg. that values makes
+			 * sense, that we don't overwrite old values, etc */
+			cpu = get_cpu();
+			memcpy(thread->tls_array, desc, size);
+			load_TLS(thread, cpu);
+			put_cpu();
+		}
+		kfree(desc);
+	}
+
+	return 0;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+int cr_read_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	regs->bx = hh->bx;
+	regs->cx = hh->cx;
+	regs->dx = hh->dx;
+	regs->si = hh->si;
+	regs->di = hh->di;
+	regs->bp = hh->bp;
+	regs->ax = hh->ax;
+	regs->ds = hh->ds;
+	regs->es = hh->es;
+	regs->orig_ax = hh->orig_ax;
+	regs->ip = hh->ip;
+	regs->cs = hh->cs;
+	regs->flags = hh->flags;
+	regs->sp = hh->sp;
+	regs->ss = hh->ss;
+
+	thread->gs = hh->gs;
+	thread->fs = hh->fs;
+	loadsegment(gs, hh->gs);
+	loadsegment(fs, hh->fs);
+
+	return 0;
+}
+
+int cr_read_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	/* debug regs */
+
+	preempt_disable();
+
+	if (hh->uses_debug) {
+		set_debugreg(hh->debugreg0, 0);
+		set_debugreg(hh->debugreg1, 1);
+		/* ignore 4, 5 */
+		set_debugreg(hh->debugreg2, 2);
+		set_debugreg(hh->debugreg3, 3);
+		set_debugreg(hh->debugreg6, 6);
+		set_debugreg(hh->debugreg7, 7);
+	}
+
+	preempt_enable();
+
+	return 0;
+}
+
+int cr_read_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* i387 + MMU + SSE */
+
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!hh->used_math)
+		clear_used_math();
+	else {
+		if (hh->has_fxsr != cpu_has_fxsr) {
+			force_sig(SIGFPE, t);
+			return -EINVAL;
+		}
+		memcpy(&thread->xstate, &hh->xstate, sizeof(thread->xstate));
+		set_used_math();
+	}
+
+	preempt_enable();
+
+	return 0;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+	if (parent < 0)
+		return parent;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		return -EINVAL;
+#endif
+	/* FIX: sanity check for sensitive registers (eg. eflags) */
+
+	ret = cr_read_cpu_regs(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu_debug(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu_fpu(hh, t);
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+ out:
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index ad1099f..d34a691 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,8 @@
 #include <linux/ckpt.h>
 #include <linux/ckpt_hdr.h>
 
+#include "ckpt_arch.h"
+
 /**
  * cr_write_obj - write a record described by a cr_hdr
  * @ctx: checkpoint context
@@ -159,8 +161,17 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	}
 
 	ret = cr_write_task_struct(ctx, t);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_thread(ctx, t);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_cpu(ctx, t);
+	cr_debug("cpu: ret %d\n", ret);
 
+ out:
 	return ret;
 }
 
diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
new file mode 100644
index 0000000..5bd4703
--- /dev/null
+++ b/checkpoint/ckpt_arch.h
@@ -0,0 +1,7 @@
+#include <linux/ckpt.h>
+
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+int cr_read_thread(struct cr_ctx *ctx);
+int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 171cd2d..5226994 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -15,6 +15,8 @@
 #include <linux/ckpt.h>
 #include <linux/ckpt_hdr.h>
 
+#include "ckpt_arch.h"
+
 /**
  * cr_read_obj - read a whole record (cr_hdr followed by payload)
  * @ctx: checkpoint context
@@ -164,8 +166,17 @@ static int cr_read_task(struct cr_ctx *ctx)
 	int ret;
 
 	ret = cr_read_task_struct(ctx);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_thread(ctx);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu(ctx);
+	cr_debug("cpu: ret %d\n", ret);
 
+ out:
 	return ret;
 }
 
diff --git a/include/asm-x86/ckpt_hdr.h b/include/asm-x86/ckpt_hdr.h
new file mode 100644
index 0000000..44a903c
--- /dev/null
+++ b/include/asm-x86/ckpt_hdr.h
@@ -0,0 +1,72 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/processor.h>
+
+struct cr_hdr_thread {
+	/* NEED: restart blocks */
+
+	__s16 gdt_entry_tls_entries;
+	__s16 sizeof_tls_array;
+	__s16 ntls;	/* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_cpu {
+	/* see struct pt_regs (x86-64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 cs;
+	__u64 flags;
+	__u64 sp;
+	__u64 ss;
+
+	/* segment registers */
+	__u64 ds;
+	__u64 es;
+	__u64 fs;
+	__u64 gs;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg4;
+	__u64 debugreg5;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	__u16 uses_debug;
+	__u16 used_math;
+	__u16 has_fxsr;
+	__u16 _padding;
+
+	union thread_xstate xstate;	/* i387 */
+
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
index dd05ecc..e66f322 100644
--- a/include/linux/ckpt_hdr.h
+++ b/include/linux/ckpt_hdr.h
@@ -12,6 +12,7 @@
 
 #include <linux/types.h>
 #include <linux/utsname.h>
+#include <asm/ckpt_hdr.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v4][PATCH 4/9] Memory management (dump)
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
                   ` (2 preceding siblings ...)
  2008-09-09  7:42 ` [RFC v4][PATCH 3/9] x86 support for checkpoint/restart Oren Laadan
@ 2008-09-09  7:42 ` Oren Laadan
  2008-09-09  9:22   ` Vegard Nossum
                     ` (4 more replies)
  2008-09-09  7:42 ` [RFC v4][PATCH 5/9] Memory managemnet (restore) Oren Laadan
                   ` (5 subsequent siblings)
  9 siblings, 5 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name.  The cr_vma->npages will tell
how many pages were dumped for this VMA.  Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next VMA and so on.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/mm/checkpoint.c   |   30 +++
 arch/x86/mm/restart.c      |    1 +
 checkpoint/Makefile        |    3 +-
 checkpoint/checkpoint.c    |   53 ++++++
 checkpoint/ckpt_arch.h     |    1 +
 checkpoint/ckpt_mem.c      |  448 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/ckpt_mem.h      |   35 ++++
 checkpoint/sys.c           |   23 ++-
 include/asm-x86/ckpt_hdr.h |    5 +
 include/linux/ckpt.h       |   12 ++
 include/linux/ckpt_hdr.h   |   30 +++
 11 files changed, 635 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/ckpt_mem.c
 create mode 100644 checkpoint/ckpt_mem.h

diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 71d21e6..50cfd29 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -192,3 +192,33 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
 }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	mutex_lock(&mm->context.lock);
+
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	cr_debug("nldt %d\n", hh->nldt);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
+
+	mutex_unlock(&mm->context.lock);
+
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index 883a163..d7fb89a 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -8,6 +8,7 @@
  *  distribution for more details.
  */
 
+#include <linux/unistd.h>
 #include <asm/desc.h>
 #include <asm/i387.h>
 
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d2df68c..3a0df6d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+		ckpt_mem.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index d34a691..4dae775 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -55,6 +55,55 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
 	return cr_write_obj(ctx, &h, str);
 }
 
+/**
+ * cr_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+static char *
+cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+	char *fname;
+
+	BUG_ON(!buf);
+	fname = __d_path(path, root, buf, *n);
+	if (!IS_ERR(fname))
+		*n = (buf + (*n) - fname);
+	return fname;
+}
+
+/**
+ * cr_write_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
+{
+	struct cr_hdr h;
+	char *buf, *fname;
+	int ret, flen;
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = cr_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		h.type = CR_HDR_FNAME;
+		h.len = flen;
+		h.parent = 0;
+		ret = cr_write_obj(ctx, &h, fname);
+	} else
+		ret = PTR_ERR(fname);
+
+	kfree(buf);
+	return ret;
+}
+
 /* write the checkpoint header */
 static int cr_write_head(struct cr_ctx *ctx)
 {
@@ -164,6 +213,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_mm(ctx, t);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
index 5bd4703..9bd0ba4 100644
--- a/checkpoint/ckpt_arch.h
+++ b/checkpoint/ckpt_arch.h
@@ -2,6 +2,7 @@
 
 int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
 int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);
 
 int cr_read_thread(struct cr_ctx *ctx);
 int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..2c93447
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,448 @@
+/*
+ *  Checkpoint memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+#include "ckpt_arch.h"
+#include "ckpt_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr: list head of the page-array chain
+ *   ctx->pgcur: tracks the "current" position in the chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the "current" page-array (ctx->pgcur). The "current"
+ * page-array advances as necessary, and new page-array descriptors are
+ * allocated on-demand. Before the next MM, the chain is reset but not
+ * freed (that is, dereference page pointers and reset ctx->pgcur).
+ */
+
+#define CR_PGARR_ORDER  0
+#define CR_PGARR_TOTAL  ((PAGE_SIZE << CR_PGARR_ORDER) / sizeof(void *))
+
+/* release pages referenced by a page-array */
+void cr_pgarr_unref_pages(struct cr_pgarr *pgarr)
+{
+	int n;
+
+	/* only checkpoint keeps references to pages */
+	if (pgarr->pages) {
+		cr_debug("nr_used %d\n", pgarr->nr_used);
+		for (n = pgarr->nr_used; n--; )
+			page_cache_release(pgarr->pages[n]);
+	}
+}
+
+/* free a single page-array object */
+static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
+{
+	cr_pgarr_unref_pages(pgarr);
+	if (pgarr->pages)
+		free_pages((unsigned long) pgarr->pages, CR_PGARR_ORDER);
+	if (pgarr->vaddrs)
+		free_pages((unsigned long) pgarr->vaddrs, CR_PGARR_ORDER);
+	kfree(pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr, list) {
+		list_del(&pgarr->list);
+		cr_pgarr_free_one(pgarr);
+	}
+	ctx->pgcur = NULL;
+}
+
+/* allocate a single page-array object */
+static struct cr_pgarr *cr_pgarr_alloc_one(void)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+
+	pgarr->nr_free = CR_PGARR_TOTAL;
+	pgarr->nr_used = 0;
+
+	pgarr->pages = (struct page **)
+		__get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
+	pgarr->vaddrs = (unsigned long *)
+		__get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
+	if (!pgarr->pages || !pgarr->vaddrs) {
+		cr_pgarr_free_one(pgarr);
+		return NULL;
+	}
+
+	return pgarr;
+}
+
+/* cr_pgarr_alloc - return the next available pgarr in the page-array chain
+ * @ctx: checkpoint context
+ *
+ * Return the page-array following ctx->pgcur, extending the chain if needed
+ */
+struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	/* can reuse next element after ctx->pgcur ? */
+	pgarr = ctx->pgcur;
+	if (pgarr && !list_is_last(&pgarr->list, &ctx->pgarr)) {
+		pgarr = list_entry(pgarr->list.next, struct cr_pgarr, list);
+		goto out;
+	}
+
+	/* nope, need to extend the page-array chain */
+	pgarr = cr_pgarr_alloc_one();
+	if (!pgarr)
+		return NULL;
+
+	list_add_tail(&pgarr->list, &ctx->pgarr);
+ out:
+	ctx->pgcur = pgarr;
+	return pgarr;
+
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+void cr_pgarr_reset(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr, list) {
+		cr_pgarr_unref_pages(pgarr);
+		pgarr->nr_free = CR_PGARR_TOTAL;
+		pgarr->nr_used = 0;
+	}
+	ctx->pgcur = NULL;
+}
+
+
+/* return current page-array (and allocate if needed) */
+struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx
+)
+{
+	struct cr_pgarr *pgarr = ctx->pgcur;
+
+	if (!pgarr->nr_free)
+		pgarr = cr_pgarr_alloc(ctx);
+	return pgarr;
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+/**
+ * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
+ * @ctx - checkpoint context
+ * @pgarr - page-array to fill
+ * @vma - vma to scan
+ * @start - start address (updated)
+ */
+static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
+			     struct vm_area_struct *vma, unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	struct page **pagep;
+	unsigned long *addrp;
+	int cow, nr, ret = 0;
+
+	nr = pgarr->nr_free;
+	pagep = &pgarr->pages[pgarr->nr_used];
+	addrp = &pgarr->vaddrs[pgarr->nr_used];
+	cow = !!vma->vm_file;
+
+	while (addr < end) {
+		struct page *page;
+
+		/*
+		 * simplified version of get_user_pages(): already have vma,
+		 * only need FOLL_TOUCH, and (for now) ignore fault stats.
+		 *
+		 * FIXME: consolidate with get_user_pages()
+		 */
+
+		cond_resched();
+		while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
+			ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+			if (ret & VM_FAULT_ERROR) {
+				if (ret & VM_FAULT_OOM)
+					ret = -ENOMEM;
+				else if (ret & VM_FAULT_SIGBUS)
+					ret = -EFAULT;
+				else
+					BUG();
+				break;
+			}
+			cond_resched();
+			ret = 0;
+		}
+
+		if (IS_ERR(page))
+			ret = PTR_ERR(page);
+
+		if (ret < 0)
+			break;
+
+		if (page == ZERO_PAGE(0)) {
+			page = NULL;	/* zero page: ignore */
+		} else if (cow && page_mapping(page) != NULL) {
+			page = NULL;	/* clean cow: ignore */
+		} else {
+			get_page(page);
+			*(addrp++) = addr;
+			*(pagep++) = page;
+			if (--nr == 0) {
+				addr += PAGE_SIZE;
+				break;
+			}
+		}
+
+		addr += PAGE_SIZE;
+	}
+
+	if (unlikely(ret < 0)) {
+		nr = pgarr->nr_free - nr;
+		while (nr--)
+			page_cache_release(*(--pagep));
+		return ret;
+	}
+
+	*start = addr;
+	return pgarr->nr_free - nr;
+}
+
+/**
+ * cr_vma_scan_pages - scan vma for pages that will need to be dumped
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * lists of page pointes and corresponding virtual addresses are tracked
+ * inside ctx->pgarr page-array chain
+ */
+static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+	struct cr_pgarr *pgarr;
+	int nr, total = 0;
+
+	while (addr < end) {
+		pgarr = cr_pgarr_prep(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr);
+		if (nr < 0)
+			return nr;
+		pgarr->nr_free -= nr;
+		pgarr->nr_used += nr;
+		total += nr;
+	}
+
+	cr_debug("total %d\n", total);
+	return total;
+}
+
+static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(page, KM_USER1);
+
+	return cr_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+	struct cr_pgarr *pgarr;
+	char *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	list_for_each_entry(pgarr, &ctx->pgarr, list) {
+		ret = cr_kwrite(ctx, pgarr->vaddrs,
+				pgarr->nr_used * sizeof(*pgarr->vaddrs));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry(pgarr, &ctx->pgarr, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = cr_page_write(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	kfree(buf);
+	return ret;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int vma_type, nr, ret;
+
+	h.type = CR_HDR_VMA;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->vm_start = vma->vm_start;
+	hh->vm_end = vma->vm_end;
+	hh->vm_page_prot = vma->vm_page_prot.pgprot;
+	hh->vm_flags = vma->vm_flags;
+	hh->vm_pgoff = vma->vm_pgoff;
+
+	if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
+		pr_warning("CR: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ETXTBSY;
+	}
+
+	/* by default assume anon memory */
+	vma_type = CR_VMA_ANON;
+
+	/* if there is a backing file, assume private-mapped */
+	/* (FIX: check if the file is unlinked) */
+	if (vma->vm_file)
+		vma_type = CR_VMA_FILE;
+
+	hh->vma_type = vma_type;
+
+	/*
+	 * it seems redundant now, but we do it in 3 steps for because:
+	 * first, the logic is simpler when we how many pages before
+	 * dumping them; second, a future optimization will defer the
+	 * writeout (dump, and free) to a later step; in which case all
+	 * the pages to be dumped will be aggregated on the checkpoint ctx
+	 */
+
+	/* (1) scan: scan through the PTEs of the vma to count the pages
+	 * to dump (and later make those pages COW), and keep the list of
+	 * pages (and a reference to each page) on the checkpoint ctx */
+	nr = cr_vma_scan_pages(ctx, vma);
+	if (nr < 0)
+		return nr;
+
+	hh->nr_pages = nr;
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+	/* save the file name, if relevant */
+	if (vma->vm_file)
+		ret = cr_write_fname(ctx, &vma->vm_file->f_path, ctx->vfsroot);
+
+	if (ret < 0)
+		return ret;
+
+	/* (2) dump: write out the addresses of all pages in the list (on
+	 * the checkpoint ctx) followed by the contents of all pages */
+	ret = cr_vma_dump_pages(ctx, nr);
+
+	/* (3) free: release the extra references to the pages in the list */
+	cr_pgarr_reset(ctx);
+
+	return ret;
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int objref, ret;
+
+	h.type = CR_HDR_MM;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	mm = get_task_mm(t);
+
+	objref = 0;	/* will be meaningful with multiple processes */
+	hh->objref = objref;
+
+	down_read(&mm->mmap_sem);
+
+	hh->start_code = mm->start_code;
+	hh->end_code = mm->end_code;
+	hh->start_data = mm->start_data;
+	hh->end_data = mm->end_data;
+	hh->start_brk = mm->start_brk;
+	hh->brk = mm->brk;
+	hh->start_stack = mm->start_stack;
+	hh->arg_start = mm->arg_start;
+	hh->arg_end = mm->arg_end;
+	hh->env_start = mm->env_start;
+	hh->env_end = mm->env_end;
+
+	hh->map_count = mm->map_count;
+
+	/* FIX: need also mm->flags */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ret = cr_write_vma(ctx, vma);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_write_mm_context(ctx, mm, objref);
+
+ out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
diff --git a/checkpoint/ckpt_mem.h b/checkpoint/ckpt_mem.h
new file mode 100644
index 0000000..8ee211d
--- /dev/null
+++ b/checkpoint/ckpt_mem.h
@@ -0,0 +1,35 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/*
+ * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct cr_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;	/* how many entries already used */
+	unsigned int nr_free;	/* how many entries still free */
+	struct list_head list;
+};
+
+void cr_pgarr_reset(struct cr_ctx *ctx);
+void cr_pgarr_free(struct cr_ctx *ctx);
+struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx);
+struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx);
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 113e0df..8141161 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -16,6 +16,8 @@
 #include <linux/capability.h>
 #include <linux/ckpt.h>
 
+#include "ckpt_mem.h"
+
 /*
  * helpers to write/read to/from the image file descriptor
  *
@@ -110,7 +112,6 @@ int cr_kread(struct cr_ctx *ctx, void *buf, int count)
 	return ret;
 }
 
-
 /*
  * helpers to manage CR contexts: allocated for each checkpoint and/or
  * restart operation, and persists until the operation is completed.
@@ -126,6 +127,11 @@ void cr_ctx_free(struct cr_ctx *ctx)
 
 	free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
 
+	if (ctx->vfsroot)
+		path_put(ctx->vfsroot);
+
+	cr_pgarr_free(ctx);
+
 	kfree(ctx);
 }
 
@@ -145,10 +151,13 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
 	get_file(ctx->file);
 
 	ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
-	if (!ctx->hbuf) {
-		cr_ctx_free(ctx);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!ctx->hbuf)
+		goto nomem;
+
+	/* assume checkpointer is in container's root vfs */
+	/* FIXME: this works for now, but will change with real containers */
+	ctx->vfsroot = &current->fs->root;
+	path_get(ctx->vfsroot);
 
 	ctx->pid = pid;
 	ctx->flags = flags;
@@ -156,6 +165,10 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
 	ctx->crid = atomic_inc_return(&cr_ctx_count);
 
 	return ctx;
+
+ nomem:
+	cr_ctx_free(ctx);
+	return ERR_PTR(-ENOMEM);
 }
 
 /*
diff --git a/include/asm-x86/ckpt_hdr.h b/include/asm-x86/ckpt_hdr.h
index 44a903c..6bc61ac 100644
--- a/include/asm-x86/ckpt_hdr.h
+++ b/include/asm-x86/ckpt_hdr.h
@@ -69,4 +69,9 @@ struct cr_hdr_cpu {
 
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm_context {
+	__s16 ldt_entry_size;
+	__s16 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index 91f4998..5c62a90 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -10,6 +10,9 @@
  *  distribution for more details.
  */
 
+#include <linux/path.h>
+#include <linux/fs.h>
+
 #define CR_VERSION  1
 
 struct cr_ctx {
@@ -24,6 +27,11 @@ struct cr_ctx {
 
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
+
+	struct list_head pgarr;	/* page array for dumping VMA contents */
+	struct cr_pgarr *pgcur;	/* current position in page array */
+
+	struct path *vfsroot;	/* container root (FIXME) */
 };
 
 /* cr_ctx: flags */
@@ -46,11 +54,15 @@ struct cr_hdr;
 
 int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
 int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root);
 
 int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
 int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
 int cr_read_string(struct cr_ctx *ctx, void *str, int len);
 
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+int cr_read_mm(struct cr_ctx *ctx);
+
 int do_checkpoint(struct cr_ctx *ctx);
 int do_restart(struct cr_ctx *ctx);
 
diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
index e66f322..ac77d7d 100644
--- a/include/linux/ckpt_hdr.h
+++ b/include/linux/ckpt_hdr.h
@@ -32,6 +32,7 @@ struct cr_hdr {
 enum {
 	CR_HDR_HEAD = 1,
 	CR_HDR_STRING,
+	CR_HDR_FNAME,
 
 	CR_HDR_TASK = 101,
 	CR_HDR_THREAD,
@@ -82,4 +83,33 @@ struct cr_hdr_task {
 	__s32 task_comm_len;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 map_count;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vm_type {
+	CR_VMA_ANON = 1,
+	CR_VMA_FILE
+};
+
+struct cr_hdr_vma {
+	__u32 vma_type;
+	__u32 _padding;
+	__s64 nr_pages;
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v4][PATCH 5/9] Memory managemnet (restore)
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
                   ` (3 preceding siblings ...)
  2008-09-09  7:42 ` [RFC v4][PATCH 4/9] Memory management (dump) Oren Laadan
@ 2008-09-09  7:42 ` Oren Laadan
  2008-09-09 16:07   ` Serge E. Hallyn
  2008-09-10 19:31   ` Dave Hansen
  2008-09-09  7:42 ` [RFC v4][PATCH 6/9] Checkpoint/restart: initial documentation Oren Laadan
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/mm/checkpoint.c   |    5 +-
 arch/x86/mm/restart.c      |   54 +++++++
 checkpoint/Makefile        |    2 +-
 checkpoint/ckpt_arch.h     |    2 +
 checkpoint/restart.c       |   43 ++++++
 checkpoint/rstr_mem.c      |  351 ++++++++++++++++++++++++++++++++++++++++++++
 include/asm-x86/ckpt_hdr.h |    4 +
 include/linux/ckpt.h       |    2 +
 include/linux/ckpt_hdr.h   |    2 +-
 9 files changed, 460 insertions(+), 5 deletions(-)
 create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 50cfd29..534684f 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -208,17 +208,16 @@ int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
 
 	hh->ldt_entry_size = LDT_ENTRY_SIZE;
 	hh->nldt = mm->context.size;
-
 	cr_debug("nldt %d\n", hh->nldt);
 
 	ret = cr_write_obj(ctx, &h, hh);
 	cr_hbuf_put(ctx, sizeof(*hh));
 	if (ret < 0)
-		return ret;
+		goto out;
 
 	ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
 
 	mutex_unlock(&mm->context.lock);
-
+ out:
 	return ret;
 }
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index d7fb89a..be5e0cd 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -177,3 +177,57 @@ int cr_read_cpu(struct cr_ctx *ctx)
  out:
 	return ret;
 }
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int n, rparent;
+
+	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+	if (rparent < 0)
+		return rparent;
+	if (rparent != parent)
+		return -EINVAL;
+
+	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+		return -EINVAL;
+
+	/* to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt() */
+
+	for (n = 0; n < hh->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+		int ret;
+
+		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			return ret;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, &info, sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			return ret;
+	}
+
+	load_LDT(&mm->context);
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return 0;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 3a0df6d..ac35033 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
-		ckpt_mem.o
+		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
index 9bd0ba4..29dd326 100644
--- a/checkpoint/ckpt_arch.h
+++ b/checkpoint/ckpt_arch.h
@@ -6,3 +6,5 @@ int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);
 
 int cr_read_thread(struct cr_ctx *ctx);
 int cr_read_cpu(struct cr_ctx *ctx);
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);
+
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 5226994..f8c919d 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -77,6 +77,45 @@ int cr_read_string(struct cr_ctx *ctx, void *str, int len)
 	return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
 }
 
+/**
+ * cr_read_fname - read a file name
+ * @ctx: checkpoint context
+ * @fname: buffer
+ * @n: buffer length
+ */
+int cr_read_fname(struct cr_ctx *ctx, void *fname, int flen)
+{
+	return cr_read_obj_type(ctx, fname, flen, CR_HDR_FNAME);
+}
+
+/**
+ * cr_read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
+{
+	struct file *file;
+	char *fname;
+	int flen, ret;
+
+	flen = PATH_MAX;
+	fname = kmalloc(flen, GFP_KERNEL);
+	if (!fname)
+		return ERR_PTR(-ENOMEM);
+
+	ret = cr_read_fname(ctx, fname, flen);
+	cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+	if (ret >= 0)
+		file = filp_open(fname, flags, mode);
+	else
+		file = ERR_PTR(ret);
+
+	kfree(fname);
+	return file;
+}
+
 /* read the checkpoint header */
 static int cr_read_head(struct cr_ctx *ctx)
 {
@@ -169,6 +208,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_mm(ctx);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..106b635
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,351 @@
+/*
+ *  Restart memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <asm/cacheflush.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+#include "ckpt_arch.h"
+#include "ckpt_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read in directly to the address space of the current process
+ */
+
+/**
+ * cr_vma_read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @npages - number of pages
+ */
+static int cr_vma_read_pages_vaddrs(struct cr_ctx *ctx, int npages)
+{
+	struct cr_pgarr *pgarr;
+	int nr, ret;
+
+	while (npages) {
+		pgarr = cr_pgarr_prep(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = min(npages, (int) pgarr->nr_free);
+		ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_free -= nr;
+		pgarr->nr_used += nr;
+		npages -= nr;
+	}
+	return 0;
+}
+
+/**
+ * cr_vma_read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ * @npages - number of pages
+ */
+static int cr_vma_read_pages_contents(struct cr_ctx *ctx, int npages)
+{
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrs;
+	int i, ret;
+
+	list_for_each_entry(pgarr, &ctx->pgarr, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			void *ptr = (void *) vaddrs[i];
+			ret = cr_uread(ctx, ptr, PAGE_SIZE);
+			if (ret < 0)
+				return ret;
+		}
+		npages -= pgarr->nr_used;
+	}
+	return 0;
+}
+
+/* change the protection of an address range to be writable/non-writable.
+ * this is useful when restoring the memory of a read-only vma */
+static int cr_vma_set_writable(struct mm_struct *mm, unsigned long start,
+			       unsigned long end, int writable)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long flags = 0;
+	int ret = -EINVAL;
+
+	cr_debug("vma %#lx-%#lx writable %d\n", start, end, writable);
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma_prev(mm, start, &prev);
+	if (!vma || vma->vm_start > end || vma->vm_end < start)
+		goto out;
+	if (writable && !(vma->vm_flags & VM_WRITE))
+		flags = vma->vm_flags | VM_WRITE;
+	else if (!writable && (vma->vm_flags & VM_WRITE))
+		flags = vma->vm_flags & ~VM_WRITE;
+	cr_debug("flags %#lx\n", flags);
+	if (flags)
+		ret = mprotect_fixup(vma, &prev, vma->vm_start,
+				     vma->vm_end, flags);
+ out:
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
+/**
+ * cr_vma_read_pages - read in pages for to restore a vma
+ * @ctx - restart context
+ * @cr_vma - vma descriptor from restart
+ */
+static int cr_vma_read_pages(struct cr_ctx *ctx, struct cr_hdr_vma *hh)
+{
+	struct mm_struct *mm = current->mm;
+	int ret = 0;
+
+	if (!hh->nr_pages)
+		return 0;
+
+	/* in the unlikely case that this vma is read-only */
+	if (!(hh->vm_flags & VM_WRITE))
+		ret = cr_vma_set_writable(mm, hh->vm_start, hh->vm_end, 1);
+	if (ret < 0)
+		goto out;
+	ret = cr_vma_read_pages_vaddrs(ctx, hh->nr_pages);
+	if (ret < 0)
+		goto out;
+	ret = cr_vma_read_pages_contents(ctx, hh->nr_pages);
+	if (ret < 0)
+		goto out;
+
+	cr_pgarr_reset(ctx);	/* reset page-array chain */
+
+	/* restore original protection for this vma */
+	if (!(hh->vm_flags & VM_WRITE))
+		ret = cr_vma_set_writable(mm, hh->vm_start, hh->vm_end, 0);
+
+ out:
+	return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+	unsigned long flags;
+	struct file *file = NULL;
+	int parent, ret = 0;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+	if (parent < 0)
+		return parent;
+	else if (parent != 0)
+		return -EINVAL;
+
+	cr_debug("vma %#lx-%#lx type %d nr_pages %d\n",
+		 (unsigned long) hh->vm_start, (unsigned long) hh->vm_end,
+		 (int) hh->vma_type, (int) hh->nr_pages);
+
+	if (hh->vm_end < hh->vm_start || hh->nr_pages < 0)
+		return -EINVAL;
+
+	vm_start = hh->vm_start;
+	vm_pgoff = hh->vm_pgoff;
+	vm_size = hh->vm_end - hh->vm_start;
+	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+
+	switch (hh->vma_type) {
+
+	case CR_VMA_ANON:		/* anonymous private mapping */
+		/* vm_pgoff for anonymous mapping is the "global" page
+		   offset (namely from addr 0x0), so we force a zero */
+		vm_pgoff = 0;
+		break;
+
+	case CR_VMA_FILE:		/* private mapping from a file */
+		/* O_RDWR only needed if both (VM_WRITE|VM_SHARED) are set */
+		flags = hh->vm_flags;
+		if ((flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))
+			flags = O_RDWR;
+		else
+			flags = O_RDONLY;
+		file = cr_read_open_fname(ctx, flags, 0);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+		break;
+
+	default:
+		return -EINVAL;
+
+	}
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	/* the file (if opened) is now referenced by the vma */
+	if (file)
+		filp_close(file, NULL);
+
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	/*
+	 * CR_VMA_ANON: read in memory as is
+	 * CR_VMA_FILE: read in memory as is
+	 * (more to follow ...)
+	 */
+
+	switch (hh->vma_type) {
+	case CR_VMA_ANON:
+	case CR_VMA_FILE:
+		/* standard case: read the data into the memory */
+		ret = cr_vma_read_pages(ctx, hh);
+		break;
+	}
+
+	if (ret < 0)
+		return ret;
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	cr_debug("vma retval %d\n", ret);
+	return 0;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_debug("CR: restart failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	int nr, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+	if (parent < 0)
+		return parent;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		return -EINVAL;
+#endif
+	cr_debug("map_count %d\n", hh->map_count);
+
+	/* XXX need more sanity checks */
+	if (hh->start_code > hh->end_code ||
+	    hh->start_data > hh->end_data || hh->map_count < 0)
+		return -EINVAL;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = cr_destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		return ret;
+	}
+	mm->start_code = hh->start_code;
+	mm->end_code = hh->end_code;
+	mm->start_data = hh->start_data;
+	mm->end_data = hh->end_data;
+	mm->start_brk = hh->start_brk;
+	mm->brk = hh->brk;
+	mm->start_stack = hh->start_stack;
+	mm->arg_start = hh->arg_start;
+	mm->arg_end = hh->arg_end;
+	mm->env_start = hh->env_start;
+	mm->env_end = hh->env_end;
+	up_write(&mm->mmap_sem);
+
+
+	/* FIX: need also mm->flags */
+
+	for (nr = hh->map_count; nr; nr--) {
+		ret = cr_read_vma(ctx, mm);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = cr_read_mm_context(ctx, mm, hh->objref);
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/asm-x86/ckpt_hdr.h b/include/asm-x86/ckpt_hdr.h
index 6bc61ac..f8eee6a 100644
--- a/include/asm-x86/ckpt_hdr.h
+++ b/include/asm-x86/ckpt_hdr.h
@@ -74,4 +74,8 @@ struct cr_hdr_mm_context {
 	__s16 nldt;
 } __attribute__((aligned(8)));
 
+
+/* misc prototypes from kernel (not defined elsewhere) */
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index 5c62a90..9305e7b 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -59,6 +59,8 @@ int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root);
 int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
 int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
 int cr_read_string(struct cr_ctx *ctx, void *str, int len);
+int cr_read_fname(struct cr_ctx *ctx, void *fname, int n);
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode);
 
 int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
 int cr_read_mm(struct cr_ctx *ctx);
diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
index ac77d7d..f064cbb 100644
--- a/include/linux/ckpt_hdr.h
+++ b/include/linux/ckpt_hdr.h
@@ -102,7 +102,7 @@ enum vm_type {
 struct cr_hdr_vma {
 	__u32 vma_type;
 	__u32 _padding;
-	__s64 nr_pages;
+	__s64 nr_pages;		/* number of pages saved */
 
 	__u64 vm_start;
 	__u64 vm_end;
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v4][PATCH 6/9] Checkpoint/restart: initial documentation
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
                   ` (4 preceding siblings ...)
  2008-09-09  7:42 ` [RFC v4][PATCH 5/9] Memory managemnet (restore) Oren Laadan
@ 2008-09-09  7:42 ` Oren Laadan
  2008-09-10  7:13   ` MinChan Kim
  2008-09-09  7:42 ` [RFC v4][PATCH 7/9] Infrastructure for shared objects Oren Laadan
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

Covers application checkpoint/restart, overall design, interfaces
and checkpoint image format.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Documentation/checkpoint.txt |  187 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 187 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint.txt

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
new file mode 100644
index 0000000..f67aef1
--- /dev/null
+++ b/Documentation/checkpoint.txt
@@ -0,0 +1,187 @@
+
+	=== Checkpoint-Restart support in the Linux kernel ===
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+Reviewers:
+
+Application checkpoint/restart [CR] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. CR can provide many potential benefits:
+
+* Failure recovery: by rolling back an to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off of faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relative opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial CR products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide CR: sys_checkpoint and
+sys_restart.  The checkpoint code basically serializes internel kernel
+state and writes it out to a file descriptor, and the resulting image
+is stream-able. More specifically, it consists of 5 steps:
+  1. Pre-dump
+  2. Freeze the container
+  3. Dump
+  4. Thaw (or kill) the container
+  5. Post-dump
+Steps 1 and 5 are an optimization to reduce application downtime:
+"pre-dump" works before freezing the container, e.g. the pre-copy for
+live migration, and "post-dump" works after the container resumes
+execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state and from a
+file descriptor, and re-creates the tasks and the resources they need
+to resume execution. The restart code is executed by each task that
+is restored in a new container to reconstruct its own state.
+
+
+=== Interfaces
+
+int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+  Checkpoint a container whose init task is identified by pid, to the
+  file designated by fd. Flags will have future meaning (should be 0
+  for now).
+  Returns: a positive integer that identifies the checkpoint image
+  (for future reference in case it is kept in memory) upon success,
+  0 if it returns from a restart, and -1 if an error occurs.
+
+int sys_restart(int crid, int fd, unsigned long flags);
+  Restart a container from a checkpoint image identified by crid, or
+  from the blob stored in the file designated by fd. Flags will have
+  future meaning (should be 0 for now).
+  Returns: 0 on success and -1 if an error occurs.
+
+Thus, if checkpoint is initiated by a process in the container, one
+can use logic similar to fork():
+	...
+	crid = checkpoint(...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+
+=== Checkpoint image format
+
+The checkpoint image format is composed of records consistings of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 id;
+};
+
+Here, 'type' field identifies the type of the payload, 'len' tells its
+length in byes. The 'id' identifies the owner object instance. The
+meaning of the 'id' field varies depending on the type. For example,
+for type CR_HDR_MM, the 'id' identifies the task to which this MM
+belongs. The payload also varies depending on the type, for instance,
+the data describing a task_struct is given by a 'struct cr_hdr_task'
+(type CR_HDR_TASK) and so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. The cr_vma->npages indicated how many pages were dumped for this
+VMA. Following comes the actual data: first the addresses of all the
+dumped pages, followed by the contents of all the dumped pages (npages
+entries each). Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+	cr_hdr + cr_hdr_mm
+		cr_hdr + cr_hdr_vma + cr_hdr + string
+			addr1, addr2
+			page1, page2
+		cr_hdr + cr_hdr_vma
+			addr3, addr4, addr5
+			page3, page4, page5
+		cr_hdr + cr_mm_context
+	cr_hdr + cr_hdr_thread
+	cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+=== Changelog
+
+[2008-Sep-04] v4:
+* Fix calculation of hash table size
+* Fix header structure alignment
+* Use stand list_... for cr_pgarr
+
+[2008-Aug-20] v3:
+* Various fixes and clean-ups
+* Use standard hlist_... for hash table
+* Better use of standard kmalloc/kfree
+
+[2008-Aug-09] v2:
+* Added utsname->{release,version,machine} to checkpoint header
+* Pad header structures to 64 bits to ensure compatibility
+* Address comments from LKML and linux-containers mailing list
+
+[2008-Jul-29] v1:
+In this incarnation, CR only works on single task. The address space
+may consist of only private, simple VMAs - anonymous or file-mapped.
+Both checkpoint and restart will ignore the first argument (pid/crid)
+and instead act on themselves.
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v4][PATCH 7/9] Infrastructure for shared objects
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
                   ` (5 preceding siblings ...)
  2008-09-09  7:42 ` [RFC v4][PATCH 6/9] Checkpoint/restart: initial documentation Oren Laadan
@ 2008-09-09  7:42 ` Oren Laadan
  2008-09-09  7:42 ` [RFC v4][PATCH 8/9] File descriprtors (dump) Oren Laadan
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
>From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Documentation/checkpoint.txt |   38 +++++++
 checkpoint/Makefile          |    2 +-
 checkpoint/objhash.c         |  237 ++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c             |    4 +
 include/linux/ckpt.h         |   18 +++
 5 files changed, 298 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
index f67aef1..97eee24 100644
--- a/Documentation/checkpoint.txt
+++ b/Documentation/checkpoint.txt
@@ -163,6 +163,44 @@ cr_hdr + cr_hdr_task
 cr_hdr + cr_hdr_tail
 
 
+=== Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects in the following manner.
+
+On the first encounter, the state is dumped and the object is assigned
+a unique identifier and also stored in the hash table (indexed by its
+physical kenrel address). From then on the object will be found in the
+hash and only its identifier is saved.
+
+On restart the identifier is looked up in the hash table; if not found
+then the state is read, the object is created, and added to the hash
+table (this time indexed by its identifier). Otherwise, the object in
+the hash table is used.
+
+The interface for the hash table is the following:
+
+cr_obj_get_by_ptr - find the unique identifier - object reference (objref)
+  of the object that is pointer to by ptr (or 0 if not found) [checkpoint]
+
+cr_obj_add_ptr - add the object pointed to by ptr to the hash table if
+  it isn't already there, and fill its unique identifier (objref); will
+  return 0 if already found in the has, or 1 otherwise [checkpoint]
+
+cr_obj_get_by_ref - return the pointer to the object whose unique identifier
+  is equal to objref [restart]
+
+cr_obj_add_ref - add the object with unique identifier objref, pointed to by
+  ptr to the hash table if it isn't already there; will return 0 if already
+  found in the has, or 1 otherwise [restart]
+
+
 === Changelog
 
 [2008-Sep-04] v4:
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index ac35033..9843fb9 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
 		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..79d5b70
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,237 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+#include <linux/ckpt.h>
+
+struct cr_objref {
+	int objref;
+	void *ptr;
+	unsigned short type;
+	unsigned short flags;
+	struct hlist_node hash;
+};
+
+struct cr_objhash {
+	struct hlist_head *head;
+	int objref_index;
+};
+
+#define CR_OBJHASH_NBITS  10
+#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS)
+
+static void cr_obj_ref_drop(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		fput((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_obj_ref_grab(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		get_file((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+	struct hlist_head *h = objhash->head;
+	struct hlist_node *n, *t;
+	struct cr_objref *obj;
+	int i;
+
+	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			cr_obj_ref_drop(obj);
+			kfree(obj);
+		}
+	}
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash = ctx->objhash;
+
+	if (objhash) {
+		cr_objhash_clear(objhash);
+		kfree(objhash->head);
+		kfree(ctx->objhash);
+		ctx->objhash = NULL;
+	}
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash;
+	struct hlist_head *head;
+
+	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+	if (!objhash)
+		return -ENOMEM;
+	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(objhash);
+		return -ENOMEM;
+	}
+
+	objhash->head = head;
+	objhash->objref_index = 1;
+
+	ctx->objhash = objhash;
+	return 0;
+}
+
+static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr(ptr, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_find_by_objref(struct cr_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr((void *) objref, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
+				    unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (obj) {
+		int i;
+
+		obj->ptr = ptr;
+		obj->type = type;
+		obj->flags = flags;
+
+		if (objref) {
+			/* use 'objref' to index (restart) */
+			obj->objref = objref;
+			i = hash_ptr((void *) objref, CR_OBJHASH_NBITS);
+		} else {
+			/* use 'ptr' to index, assign objref (checkpoint) */
+			obj->objref = ctx->objhash->objref_index++;;
+			i = hash_ptr(ptr, CR_OBJHASH_NBITS);
+		}
+
+		hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
+		cr_obj_ref_grab(obj);
+	}
+	return obj;
+}
+
+/**
+ * cr_obj_add_ptr - add an object to the hash table if not already there
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique identifier - object reference [output]
+ * @type: object type
+ * @flags: object flags
+ *
+ * Fills the unique identifier of the object into @objref
+ *
+ * returns 0 if found, 1 if added, < 0 on error
+ */
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int ret = 0;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = cr_obj_new(ctx, ptr, 0, type, flags);
+		if (!obj)
+			return -ENOMEM;
+		else
+			ret = 1;
+	} else if (obj->type != type)	/* sanity check */
+		return -EINVAL;
+	*objref = obj->objref;
+	return ret;
+}
+
+/**
+ * cr_obj_add_ref - add an object with unique identifer to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique identifier - object reference
+ * @type: object type
+ * @flags: object flags
+ */
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_new(ctx, ptr, objref, type, flags);
+	return obj ? 0 : -ENOMEM;
+}
+
+/**
+ * cr_obj_get_by_ptr - find the unique identifier (objref) of an object
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ */
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (obj)
+		return obj->type == type ? obj->objref : -EINVAL;
+	else
+		return -ESRCH;
+}
+
+/**
+ * cr_obj_get_by_ref - find an object given its unique identifier (objref)
+ * @ctx: checkpoint context
+ * @objref: unique identifier - object reference
+ * @type: object type
+ */
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_objref(ctx, objref);
+	if (obj)
+		return obj->type == type ? obj->ptr : ERR_PTR(-EINVAL);
+	else
+		return NULL;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 8141161..4f33ac4 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -131,6 +131,7 @@ void cr_ctx_free(struct cr_ctx *ctx)
 		path_put(ctx->vfsroot);
 
 	cr_pgarr_free(ctx);
+	cr_objhash_free(ctx);
 
 	kfree(ctx);
 }
@@ -154,6 +155,9 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
 	if (!ctx->hbuf)
 		goto nomem;
 
+	if (cr_objhash_alloc(ctx) < 0)
+		goto nomem;
+
 	/* assume checkpointer is in container's root vfs */
 	/* FIXME: this works for now, but will change with real containers */
 	ctx->vfsroot = &current->fs->root;
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index 9305e7b..d73f79e 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -28,6 +28,8 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct cr_objhash *objhash;	/* hash for shared objects */
+
 	struct list_head pgarr;	/* page array for dumping VMA contents */
 	struct cr_pgarr *pgcur;	/* current position in page array */
 
@@ -50,6 +52,22 @@ int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+/* shared objects handling */
+
+enum {
+	CR_OBJ_FILE = 1,
+	CR_OBJ_MAX
+};
+
+void cr_objhash_free(struct cr_ctx *ctx);
+int cr_objhash_alloc(struct cr_ctx *ctx);
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type);
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type);
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags);
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags);
+
 struct cr_hdr;
 
 int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v4][PATCH 8/9] File descriprtors (dump)
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
                   ` (6 preceding siblings ...)
  2008-09-09  7:42 ` [RFC v4][PATCH 7/9] Infrastructure for shared objects Oren Laadan
@ 2008-09-09  7:42 ` Oren Laadan
  2008-09-09  8:06   ` Vegard Nossum
                     ` (2 more replies)
  2008-09-09  7:42 ` [RFC v4][PATCH 9/9] File descriprtors (restore) Oren Laadan
  2008-09-09 18:06 ` [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Dave Hansen
  9 siblings, 3 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Since FDs can be shared, they are assigned an
objref and registered in the object hash.

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its objref
and its close-on-exec property. If the FD is to be saved (first time)
then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
Then will come the next FD and so on.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile      |    2 +-
 checkpoint/checkpoint.c  |    4 +
 checkpoint/ckpt_file.c   |  221 ++++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/ckpt_file.h   |   17 ++++
 include/linux/ckpt.h     |    7 +-
 include/linux/ckpt_hdr.h |   34 +++++++-
 6 files changed, 280 insertions(+), 5 deletions(-)
 create mode 100644 checkpoint/ckpt_file.c
 create mode 100644 checkpoint/ckpt_file.h

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 4dae775..aebbf22 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -217,6 +217,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_files(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..ca58b28
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,221 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+#include "ckpt_file.h"
+
+#define CR_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ * @return: the number of open fds found
+ *
+ * Allocates the file descriptors array (*fdtable), caller should free
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fdlist;
+	int i, n, max;
+
+	n = 0;
+	max = CR_DEFAULT_FDTABLE;
+
+	fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
+	if (!fdlist)
+		return -ENOMEM;
+
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	for (i = 0; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == max) {
+			/* fcheck_files() is safe with drop/re-acquire
+			 * of the lock, as it tests:  fd < max_fds */
+			spin_unlock(&files->file_lock);
+			max *= 2;
+			if (max < 0) {	/* overflow ? */
+				n = -EMFILE;
+				goto out;
+			}
+			fdlist = krealloc(fdlist, max, GFP_KERNEL);
+			if (!fdlist) {
+				n = -ENOMEM;
+				goto out;
+			}
+			spin_lock(&files->file_lock);
+		}
+		fdlist[n++] = i;
+	}
+	spin_unlock(&files->file_lock);
+
+	*fdtable = fdlist;
+ out:
+	return n;
+}
+
+/* cr_write_fd_data - dump the state of a given file pointer */
+static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct dentry *dent = file->f_dentry;
+	struct inode *inode = dent->d_inode;
+	enum fd_type fd_type;
+	int ret;
+
+	h.type = CR_HDR_FD_DATA;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	hh->f_flags = file->f_flags;
+	hh->f_mode = file->f_mode;
+	hh->f_pos = file->f_pos;
+	hh->f_uid = file->f_uid;
+	hh->f_gid = file->f_gid;
+	hh->f_version = file->f_version;
+	/* FIX: need also file->f_owner */
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		fd_type = CR_FD_FILE;
+		break;
+	case S_IFDIR:
+		fd_type = CR_FD_DIR;
+		break;
+	case S_IFLNK:
+		fd_type = CR_FD_LINK;
+		break;
+	default:
+		return -EBADF;
+	}
+
+	/* FIX: check if the file/dir/link is unlinked */
+	hh->fd_type = fd_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_fname(ctx, &file->f_path, ctx->vfsroot);
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Save the state of the file descriptor; look up the actual file pointer
+ * in the hash table, and if found save the matching objref, otherwise call
+ * cr_write_fd_data to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int coe, objref, new, ret;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	if (!file)
+		return -EBADF;
+
+	new = cr_obj_add_ptr(ctx, (void *) file, &objref, CR_OBJ_FILE, 0);
+	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
+
+	if (new < 0)
+		return new;
+
+	h.type = CR_HDR_FD_ENT;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->objref = objref;
+	hh->fd = fd;
+	hh->close_on_exec = coe;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	/* new==1 if-and-only-if file was newly added to hash */
+	if (new)
+		ret = cr_write_fd_data(ctx, file, objref);
+
+	fput(file);
+	return ret;
+}
+
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files;
+	int *fdtable;
+	int nfds, n, ret;
+
+	h.type = CR_HDR_FILES;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	files = get_files_struct(t);
+
+	hh->objref = 0;	/* will be meaningful with multiple processes */
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	hh->nfds = nfds;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto clean;
+
+	cr_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = cr_write_fd_ent(ctx, files, n);
+		if (ret < 0)
+			break;
+	}
+
+ clean:
+	kfree(fdtable);
+ out:
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/ckpt_file.h b/checkpoint/ckpt_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/ckpt_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index d73f79e..ad46baf 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -13,7 +13,7 @@
 #include <linux/path.h>
 #include <linux/fs.h>
 
-#define CR_VERSION  1
+#define CR_VERSION  2
 
 struct cr_ctx {
 	pid_t pid;		/* container identifier */
@@ -80,11 +80,12 @@ int cr_read_string(struct cr_ctx *ctx, void *str, int len);
 int cr_read_fname(struct cr_ctx *ctx, void *fname, int n);
 struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode);
 
+int do_checkpoint(struct cr_ctx *ctx);
 int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
-int cr_read_mm(struct cr_ctx *ctx);
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
-int do_checkpoint(struct cr_ctx *ctx);
 int do_restart(struct cr_ctx *ctx);
+int cr_read_mm(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[CR:%s] " fmt, __func__, ## args)
diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
index f064cbb..f868dce 100644
--- a/include/linux/ckpt_hdr.h
+++ b/include/linux/ckpt_hdr.h
@@ -17,7 +17,7 @@
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned(8))) for the entire structure.
  */
 
 /* records: generic header */
@@ -42,6 +42,10 @@ enum {
 	CR_HDR_VMA,
 	CR_HDR_MM_CONTEXT,
 
+	CR_HDR_FILES = 301,
+	CR_HDR_FD_ENT,
+	CR_HDR_FD_DATA,
+
 	CR_HDR_TAIL = 5001
 };
 
@@ -112,4 +116,32 @@ struct cr_hdr_vma {
 
 } __attribute__((aligned(8)));
 
+struct cr_hdr_files {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 nfds;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_fd_ent {
+	__u32 objref;		/* identifier for shared objects */
+	__s32 fd;
+	__u32 close_on_exec;
+} __attribute__((aligned(8)));
+
+/* fd types */
+enum  fd_type {
+	CR_FD_FILE = 1,
+	CR_FD_DIR,
+	CR_FD_LINK
+};
+
+struct cr_hdr_fd_data {
+	__u16 fd_type;
+	__u16 f_mode;
+	__u32 f_flags;
+	__u32 f_uid;
+	__u32 f_gid;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v4][PATCH 9/9] File descriprtors (restore)
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
                   ` (7 preceding siblings ...)
  2008-09-09  7:42 ` [RFC v4][PATCH 8/9] File descriprtors (dump) Oren Laadan
@ 2008-09-09  7:42 ` Oren Laadan
  2008-09-09 16:26   ` Dave Hansen
  2008-09-09 18:06 ` [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Dave Hansen
  9 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  7:42 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile    |    2 +-
 checkpoint/restart.c   |    4 +
 checkpoint/rstr_file.c |  205 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/ckpt.h   |    1 +
 4 files changed, 211 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f8c919d..bc49523 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -212,6 +212,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_files(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..28c4109
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,205 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+#include "ckpt_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int n;
+
+	do {
+		n = cr_scan_fds(files, &fdtable);
+		if (n < 0)
+			return n;
+		while (n--)
+			sys_close(fdtable[n]);
+		kfree(fdtable);
+	} while (n != -1);
+
+	return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_fd_data - restore the state of a given file pointer */
+static int
+cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int parent)
+{
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int fd, rparent, ret;
+
+	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
+	cr_debug("rparent %d parent %d flags %#x mode %#x how %d\n",
+		 rparent, parent, hh->f_flags, hh->f_mode, hh->fd_type);
+	if (rparent < 0)
+		return rparent;
+	if (rparent != parent)
+		return -EINVAL;
+	/* FIX: more sanity checks on f_flags, f_mode etc */
+
+	switch (hh->fd_type) {
+	case CR_FD_FILE:
+	case CR_FD_DIR:
+	case CR_FD_LINK:
+		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+		break;
+	default:
+		file = ERR_PTR(-EINVAL);
+		break;
+	}
+
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
+	if (fd < 0) {
+		filp_close(file, NULL);
+		return fd;
+	}
+
+	/* register new <objref, file> tuple in hash table */
+	ret = cr_obj_add_ref(ctx, (void *) file, parent, CR_OBJ_FILE, 0);
+	if (ret < 0)
+		goto out;
+	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+	if (ret < 0)
+		goto out;
+	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
+		ret = 0;
+
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret < 0 ? ret : fd;
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @parent: parent objref
+ *
+ * Restore the state of a file descriptor; look up the objref (in the header)
+ * in the hash table, and if found pick the matching file pointer and use
+ * it; otherwise call cr_read_fd_data to restore the file pointer too.
+ */
+static int
+cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int parent)
+{
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int newfd, rparent;
+
+	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+	cr_debug("rparent %d parent %d ref %d\n", rparent, parent, hh->objref);
+	if (rparent < 0)
+		return rparent;
+	if (rparent != parent)
+		return -EINVAL;
+	cr_debug("fd %d coe %d\n", hh->fd, hh->close_on_exec);
+	if (hh->objref <= 0)
+		return -EINVAL;
+
+	file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	if (file) {
+		newfd = cr_attach_file(file);
+		if (newfd < 0)
+			return newfd;
+		get_file(file);
+	} else {
+		/* create new file pointer (and register in hash table) */
+		newfd = cr_read_fd_data(ctx, files, hh->objref);
+		if (newfd < 0)
+			return newfd;
+	}
+
+	cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+	/* if newfd isn't desired fd, use dup2() to relocated it */
+	if (newfd != hh->fd) {
+		int ret = sys_dup2(newfd, hh->fd);
+		if (ret < 0)
+			return ret;
+		sys_close(newfd);
+	}
+
+	if (hh->close_on_exec)
+		set_close_on_exec(hh->fd, 1);
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return 0;
+}
+
+int cr_read_files(struct cr_ctx *ctx)
+{
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files = current->files;
+	int n, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
+	if (parent < 0)
+		return parent;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		return -EINVAL;
+#endif
+	cr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
+	if (hh->objref < 0 || hh->nfds < 0)
+		return -EINVAL;
+
+	if (hh->nfds > sysctl_nr_open)
+		return -EMFILE;
+
+	/* point of no return -- close all file descriptors */
+	ret = cr_close_all_fds(files);
+	if (ret < 0)
+		return ret;
+
+	for (n = 0; n < hh->nfds; n++) {
+		ret = cr_read_fd_ent(ctx, files, hh->objref);
+		if (ret < 0)
+			break;
+	}
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index ad46baf..1086670 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -86,6 +86,7 @@ int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 int do_restart(struct cr_ctx *ctx);
 int cr_read_mm(struct cr_ctx *ctx);
+int cr_read_files(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[CR:%s] " fmt, __func__, ## args)
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 8/9] File descriprtors (dump)
  2008-09-09  7:42 ` [RFC v4][PATCH 8/9] File descriprtors (dump) Oren Laadan
@ 2008-09-09  8:06   ` Vegard Nossum
  2008-09-09  8:23   ` Vegard Nossum
  2008-09-11  5:02   ` MinChan Kim
  2 siblings, 0 replies; 43+ messages in thread
From: Vegard Nossum @ 2008-09-09  8:06 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

On Tue, Sep 9, 2008 at 9:42 AM, Oren Laadan <orenl@cs.columbia.edu> wrote:
> +       fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
> +       if (!fdlist)
> +               return -ENOMEM;
> +
> +       spin_lock(&files->file_lock);
> +       fdt = files_fdtable(files);
> +       for (i = 0; i < fdt->max_fds; i++) {
> +               if (!fcheck_files(files, i))
> +                       continue;
> +               if (n == max) {
> +                       /* fcheck_files() is safe with drop/re-acquire
> +                        * of the lock, as it tests:  fd < max_fds */
> +                       spin_unlock(&files->file_lock);
> +                       max *= 2;
> +                       if (max < 0) {  /* overflow ? */
> +                               n = -EMFILE;
> +                               goto out;
> +                       }
> +                       fdlist = krealloc(fdlist, max, GFP_KERNEL);

Hm, shouldn't this be max * sizeof(*fdlist) like the kmalloc above?


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 3/9] x86 support for checkpoint/restart
  2008-09-09  7:42 ` [RFC v4][PATCH 3/9] x86 support for checkpoint/restart Oren Laadan
@ 2008-09-09  8:17   ` Ingo Molnar
  2008-09-09 23:23     ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-09-09  8:17 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers


* Oren Laadan <orenl@cs.columbia.edu> wrote:

> +	/* for checkpoint in process context (from within a container)
> +	   the GS and FS registers should be saved from the hardware;
> +	   otherwise they are already sabed on the thread structure */

please use the correct comment style consistently throughout your 
patches. The correct one is like this one:

> +	/*
> +	 * for checkpoint in process context (from within a container),
> +	 * the actual syscall is taking place at this very moment; so
> +	 * we (optimistically) subtitute the future return value (0) of
> +	 * this syscall into the orig_eax, so that upon restart it will
> +	 * succeed (or it will endlessly retry checkpoint...)
> +	 */

incorrect/inconsistent ones are like these:

> +		/* normally, no need to unlazy_fpu(), since TS_USEDFPU flag
> +		 * have been cleared when task was conexted-switched out...
> +		 * except if we are in process context, in which case we do */

> +		/* restore TLS by hand: why convert to struct user_desc if
> +		 * sys_set_thread_entry() will convert it back ? */

> +			/* FIX: add sanity checks (eg. that values makes
> +			 * sense, that we don't overwrite old values, etc */

(and there's many more examples throughout the series)

> +int cr_read_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	/* debug regs */
> +
> +	preempt_disable();
> +
> +	if (hh->uses_debug) {
> +		set_debugreg(hh->debugreg0, 0);
> +		set_debugreg(hh->debugreg1, 1);
> +		/* ignore 4, 5 */
> +		set_debugreg(hh->debugreg2, 2);
> +		set_debugreg(hh->debugreg3, 3);
> +		set_debugreg(hh->debugreg6, 6);
> +		set_debugreg(hh->debugreg7, 7);
> +	}
> +
> +	preempt_enable();
> +
> +	return 0;
> +}

hm, the preemption disabling seems pointless here. What does it protect 
against?

> +++ b/checkpoint/ckpt_arch.h
> @@ -0,0 +1,7 @@
> +#include <linux/ckpt.h>
> +
> +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
> +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
> +
> +int cr_read_thread(struct cr_ctx *ctx);
> +int cr_read_cpu(struct cr_ctx *ctx);

please add 'extern' to prototypes in include files.

> @@ -15,6 +15,8 @@
>  #include <linux/ckpt.h>
>  #include <linux/ckpt_hdr.h>
>  
> +#include "ckpt_arch.h"
> +

plsdntuseannyngabbrvtsngnrcd. [1]

"checkpoint_" should be just fine in most cases.

	Ingo

[1] (please dont use annoying abbreviations in generic code)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 8/9] File descriprtors (dump)
  2008-09-09  7:42 ` [RFC v4][PATCH 8/9] File descriprtors (dump) Oren Laadan
  2008-09-09  8:06   ` Vegard Nossum
@ 2008-09-09  8:23   ` Vegard Nossum
  2008-09-10  2:01     ` Oren Laadan
  2008-09-11  5:02   ` MinChan Kim
  2 siblings, 1 reply; 43+ messages in thread
From: Vegard Nossum @ 2008-09-09  8:23 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

Hi,

Below are some concerns, I would be grateful for explanations (or
pointers if I missed them before).

On Tue, Sep 9, 2008 at 9:42 AM, Oren Laadan <orenl@cs.columbia.edu> wrote:
> +/* cr_write_fd_data - dump the state of a given file pointer */
> +static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       struct dentry *dent = file->f_dentry;
> +       struct inode *inode = dent->d_inode;
> +       enum fd_type fd_type;
> +       int ret;
> +
> +       h.type = CR_HDR_FD_DATA;
> +       h.len = sizeof(*hh);
> +       h.parent = parent;
> +
> +       hh->f_flags = file->f_flags;
> +       hh->f_mode = file->f_mode;
> +       hh->f_pos = file->f_pos;
> +       hh->f_uid = file->f_uid;
> +       hh->f_gid = file->f_gid;
> +       hh->f_version = file->f_version;
> +       /* FIX: need also file->f_owner */
> +
> +       switch (inode->i_mode & S_IFMT) {
> +       case S_IFREG:
> +               fd_type = CR_FD_FILE;
> +               break;
> +       case S_IFDIR:
> +               fd_type = CR_FD_DIR;
> +               break;
> +       case S_IFLNK:
> +               fd_type = CR_FD_LINK;
> +               break;
> +       default:
> +               return -EBADF;
> +       }

Should cr_hbuf_put() come before the return here?

As far as I've understood, "leaking" the buffer size/data isn't
critical (1. because it's just some extra space, and/or 2. the buffer
is discarded on error anyway). The code looks really unbalanced
without it, though. I guess it should at least be documented?

> +
> +       /* FIX: check if the file/dir/link is unlinked */
> +       hh->fd_type = fd_type;
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               return ret;
> +
> +       return cr_write_fname(ctx, &file->f_path, ctx->vfsroot);
> +}
> +
> +/**
> + * cr_write_fd_ent - dump the state of a given file descriptor
> + * @ctx: checkpoint context
> + * @files: files_struct pointer
> + * @fd: file descriptor
> + *
> + * Save the state of the file descriptor; look up the actual file pointer
> + * in the hash table, and if found save the matching objref, otherwise call
> + * cr_write_fd_data to dump the file pointer too.
> + */
> +static int
> +cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       struct file *file = NULL;
> +       struct fdtable *fdt;
> +       int coe, objref, new, ret;
> +
> +       rcu_read_lock();
> +       fdt = files_fdtable(files);
> +       file = fcheck_files(files, fd);
> +       if (file) {
> +               coe = FD_ISSET(fd, fdt->close_on_exec);
> +               get_file(file);
> +       }
> +       rcu_read_unlock();
> +
> +       /* sanity check (although this shouldn't happen) */
> +       if (!file)
> +               return -EBADF;
> +
> +       new = cr_obj_add_ptr(ctx, (void *) file, &objref, CR_OBJ_FILE, 0);
> +       cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
> +
> +       if (new < 0)
> +               return new;

fput() and/or cr_hbuf_put()?

> +
> +       h.type = CR_HDR_FD_ENT;
> +       h.len = sizeof(*hh);
> +       h.parent = 0;
> +
> +       hh->objref = objref;
> +       hh->fd = fd;
> +       hh->close_on_exec = coe;
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               return ret;
> +
> +       /* new==1 if-and-only-if file was newly added to hash */
> +       if (new)
> +               ret = cr_write_fd_data(ctx, file, objref);
> +
> +       fput(file);
> +       return ret;
> +}
> +
> +int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       struct files_struct *files;
> +       int *fdtable;
> +       int nfds, n, ret;
> +
> +       h.type = CR_HDR_FILES;
> +       h.len = sizeof(*hh);
> +       h.parent = task_pid_vnr(t);
> +
> +       files = get_files_struct(t);
> +
> +       hh->objref = 0; /* will be meaningful with multiple processes */
> +
> +       nfds = cr_scan_fds(files, &fdtable);
> +       if (nfds < 0) {
> +               ret = nfds;

cr_hbuf_put()?

> +               goto out;
> +       }
> +
> +       hh->nfds = nfds;
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               goto clean;
> +
> +       cr_debug("nfds %d\n", nfds);
> +       for (n = 0; n < nfds; n++) {
> +               ret = cr_write_fd_ent(ctx, files, n);
> +               if (ret < 0)
> +                       break;
> +       }
> +
> + clean:
> +       kfree(fdtable);
> + out:
> +       put_files_struct(files);
> +
> +       return ret;
> +}


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 4/9] Memory management (dump)
  2008-09-09  7:42 ` [RFC v4][PATCH 4/9] Memory management (dump) Oren Laadan
@ 2008-09-09  9:22   ` Vegard Nossum
  2008-09-10  7:51   ` MinChan Kim
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 43+ messages in thread
From: Vegard Nossum @ 2008-09-09  9:22 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

On Tue, Sep 9, 2008 at 9:42 AM, Oren Laadan <orenl@cs.columbia.edu> wrote:
> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
> it will be followed by the file name.  The cr_vma->npages will tell
> how many pages were dumped for this VMA.  Then it will be followed
> by the actual data: first a dump of the addresses of all dumped
> pages (npages entries) followed by a dump of the contents of all
> dumped pages (npages pages). Then will come the next VMA and so on.
>
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> ---
>  arch/x86/mm/checkpoint.c   |   30 +++
>  arch/x86/mm/restart.c      |    1 +
>  checkpoint/Makefile        |    3 +-
>  checkpoint/checkpoint.c    |   53 ++++++
>  checkpoint/ckpt_arch.h     |    1 +
>  checkpoint/ckpt_mem.c      |  448 ++++++++++++++++++++++++++++++++++++++++++++
>  checkpoint/ckpt_mem.h      |   35 ++++
>  checkpoint/sys.c           |   23 ++-
>  include/asm-x86/ckpt_hdr.h |    5 +
>  include/linux/ckpt.h       |   12 ++
>  include/linux/ckpt_hdr.h   |   30 +++
>  11 files changed, 635 insertions(+), 6 deletions(-)
>  create mode 100644 checkpoint/ckpt_mem.c
>  create mode 100644 checkpoint/ckpt_mem.h
>
> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> index 71d21e6..50cfd29 100644
> --- a/arch/x86/mm/checkpoint.c
> +++ b/arch/x86/mm/checkpoint.c
> @@ -192,3 +192,33 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
>        cr_hbuf_put(ctx, sizeof(*hh));
>        return ret;
>  }
> +
> +/* dump the mm->context state */
> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       int ret;
> +
> +       h.type = CR_HDR_MM_CONTEXT;
> +       h.len = sizeof(*hh);
> +       h.parent = parent;
> +
> +       mutex_lock(&mm->context.lock);
> +
> +       hh->ldt_entry_size = LDT_ENTRY_SIZE;
> +       hh->nldt = mm->context.size;
> +
> +       cr_debug("nldt %d\n", hh->nldt);
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               return ret;

mutex_unlock(&mm->context.lock) before return, I think?

> +
> +       ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
> +
> +       mutex_unlock(&mm->context.lock);
> +
> +       return ret;
> +}


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 5/9] Memory managemnet (restore)
  2008-09-09  7:42 ` [RFC v4][PATCH 5/9] Memory managemnet (restore) Oren Laadan
@ 2008-09-09 16:07   ` Serge E. Hallyn
  2008-09-09 23:35     ` Oren Laadan
  2008-09-10 19:31   ` Dave Hansen
  1 sibling, 1 reply; 43+ messages in thread
From: Serge E. Hallyn @ 2008-09-09 16:07 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, containers, jeremy, linux-kernel, arnd

Quoting Oren Laadan (orenl@cs.columbia.edu):
> Restoring the memory address space begins with nuking the existing one
> of the current process, and then reading the VMA state and contents.
> Call do_mmap_pgoffset() for each VMA and then read in the data.
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> ---
>  arch/x86/mm/checkpoint.c   |    5 +-
>  arch/x86/mm/restart.c      |   54 +++++++
>  checkpoint/Makefile        |    2 +-
>  checkpoint/ckpt_arch.h     |    2 +
>  checkpoint/restart.c       |   43 ++++++
>  checkpoint/rstr_mem.c      |  351 ++++++++++++++++++++++++++++++++++++++++++++
>  include/asm-x86/ckpt_hdr.h |    4 +
>  include/linux/ckpt.h       |    2 +
>  include/linux/ckpt_hdr.h   |    2 +-
>  9 files changed, 460 insertions(+), 5 deletions(-)
>  create mode 100644 checkpoint/rstr_mem.c
> 
> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> index 50cfd29..534684f 100644
> --- a/arch/x86/mm/checkpoint.c
> +++ b/arch/x86/mm/checkpoint.c
> @@ -208,17 +208,16 @@ int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
> 
>  	hh->ldt_entry_size = LDT_ENTRY_SIZE;
>  	hh->nldt = mm->context.size;
> -
>  	cr_debug("nldt %d\n", hh->nldt);
> 
>  	ret = cr_write_obj(ctx, &h, hh);
>  	cr_hbuf_put(ctx, sizeof(*hh));
>  	if (ret < 0)
> -		return ret;
> +		goto out;
> 
>  	ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
> 
>  	mutex_unlock(&mm->context.lock);
> -
> + out:
>  	return ret;
>  }
> diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
> index d7fb89a..be5e0cd 100644
> --- a/arch/x86/mm/restart.c
> +++ b/arch/x86/mm/restart.c
> @@ -177,3 +177,57 @@ int cr_read_cpu(struct cr_ctx *ctx)
>   out:
>  	return ret;
>  }
> +
> +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
> +{
> +	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int n, rparent;
> +
> +	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
> +	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
> +	if (rparent < 0)
> +		return rparent;
> +	if (rparent != parent)
> +		return -EINVAL;
> +
> +	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
> +		return -EINVAL;
> +
> +	/* to utilize the syscall modify_ldt() we first convert the data
> +	 * in the checkpoint image from 'struct desc_struct' to 'struct
> +	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt() */
> +
> +	for (n = 0; n < hh->nldt; n++) {
> +		struct user_desc info;
> +		struct desc_struct desc;
> +		mm_segment_t old_fs;
> +		int ret;
> +
> +		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
> +		if (ret < 0)
> +			return ret;
> +
> +		info.entry_number = n;
> +		info.base_addr = desc.base0 | (desc.base1 << 16);
> +		info.limit = desc.limit0;
> +		info.seg_32bit = desc.d;
> +		info.contents = desc.type >> 2;
> +		info.read_exec_only = (desc.type >> 1) ^ 1;
> +		info.limit_in_pages = desc.g;
> +		info.seg_not_present = desc.p ^ 1;
> +		info.useable = desc.avl;
> +
> +		old_fs = get_fs();
> +		set_fs(get_ds());
> +		ret = sys_modify_ldt(1, &info, sizeof(info));
> +		set_fs(old_fs);
> +
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	load_LDT(&mm->context);
> +
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return 0;
> +}
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> index 3a0df6d..ac35033 100644
> --- a/checkpoint/Makefile
> +++ b/checkpoint/Makefile
> @@ -3,4 +3,4 @@
>  #
> 
>  obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
> -		ckpt_mem.o
> +		ckpt_mem.o rstr_mem.o
> diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
> index 9bd0ba4..29dd326 100644
> --- a/checkpoint/ckpt_arch.h
> +++ b/checkpoint/ckpt_arch.h
> @@ -6,3 +6,5 @@ int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);
> 
>  int cr_read_thread(struct cr_ctx *ctx);
>  int cr_read_cpu(struct cr_ctx *ctx);
> +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);
> +
> diff --git a/checkpoint/restart.c b/checkpoint/restart.c
> index 5226994..f8c919d 100644
> --- a/checkpoint/restart.c
> +++ b/checkpoint/restart.c
> @@ -77,6 +77,45 @@ int cr_read_string(struct cr_ctx *ctx, void *str, int len)
>  	return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
>  }
> 
> +/**
> + * cr_read_fname - read a file name
> + * @ctx: checkpoint context
> + * @fname: buffer
> + * @n: buffer length
> + */
> +int cr_read_fname(struct cr_ctx *ctx, void *fname, int flen)
> +{
> +	return cr_read_obj_type(ctx, fname, flen, CR_HDR_FNAME);
> +}
> +
> +/**
> + * cr_read_open_fname - read a file name and open a file
> + * @ctx: checkpoint context
> + * @flags: file flags
> + * @mode: file mode
> + */
> +struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
> +{
> +	struct file *file;
> +	char *fname;
> +	int flen, ret;
> +
> +	flen = PATH_MAX;
> +	fname = kmalloc(flen, GFP_KERNEL);
> +	if (!fname)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = cr_read_fname(ctx, fname, flen);
> +	cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
> +	if (ret >= 0)
> +		file = filp_open(fname, flags, mode);
> +	else
> +		file = ERR_PTR(ret);
> +
> +	kfree(fname);
> +	return file;
> +}
> +
>  /* read the checkpoint header */
>  static int cr_read_head(struct cr_ctx *ctx)
>  {
> @@ -169,6 +208,10 @@ static int cr_read_task(struct cr_ctx *ctx)
>  	cr_debug("task_struct: ret %d\n", ret);
>  	if (ret < 0)
>  		goto out;
> +	ret = cr_read_mm(ctx);
> +	cr_debug("memory: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
>  	ret = cr_read_thread(ctx);
>  	cr_debug("thread: ret %d\n", ret);
>  	if (ret < 0)
> diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
> new file mode 100644
> index 0000000..106b635
> --- /dev/null
> +++ b/checkpoint/rstr_mem.c
> @@ -0,0 +1,351 @@
> +/*
> + *  Restart memory contents
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/fcntl.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/uaccess.h>
> +#include <linux/mm_types.h>
> +#include <linux/mman.h>
> +#include <linux/mm.h>
> +#include <linux/err.h>
> +#include <asm/cacheflush.h>
> +#include <linux/ckpt.h>
> +#include <linux/ckpt_hdr.h>
> +
> +#include "ckpt_arch.h"
> +#include "ckpt_mem.h"
> +
> +/*
> + * Unlike checkpoint, restart is executed in the context of each restarting
> + * process: vma regions are restored via a call to mmap(), and the data is
> + * read in directly to the address space of the current process
> + */
> +
> +/**
> + * cr_vma_read_pages_vaddrs - read addresses of pages to page-array chain
> + * @ctx - restart context
> + * @npages - number of pages
> + */
> +static int cr_vma_read_pages_vaddrs(struct cr_ctx *ctx, int npages)
> +{
> +	struct cr_pgarr *pgarr;
> +	int nr, ret;
> +
> +	while (npages) {
> +		pgarr = cr_pgarr_prep(ctx);
> +		if (!pgarr)
> +			return -ENOMEM;
> +		nr = min(npages, (int) pgarr->nr_free);
> +		ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
> +		if (ret < 0)
> +			return ret;
> +		pgarr->nr_free -= nr;
> +		pgarr->nr_used += nr;
> +		npages -= nr;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * cr_vma_read_pages_contents - read in data of pages in page-array chain
> + * @ctx - restart context
> + * @npages - number of pages
> + */
> +static int cr_vma_read_pages_contents(struct cr_ctx *ctx, int npages)
> +{
> +	struct cr_pgarr *pgarr;
> +	unsigned long *vaddrs;
> +	int i, ret;
> +
> +	list_for_each_entry(pgarr, &ctx->pgarr, list) {
> +		vaddrs = pgarr->vaddrs;
> +		for (i = 0; i < pgarr->nr_used; i++) {
> +			void *ptr = (void *) vaddrs[i];
> +			ret = cr_uread(ctx, ptr, PAGE_SIZE);
> +			if (ret < 0)
> +				return ret;
> +		}
> +		npages -= pgarr->nr_used;
> +	}
> +	return 0;
> +}
> +
> +/* change the protection of an address range to be writable/non-writable.
> + * this is useful when restoring the memory of a read-only vma */
> +static int cr_vma_set_writable(struct mm_struct *mm, unsigned long start,
> +			       unsigned long end, int writable)
> +{
> +	struct vm_area_struct *vma, *prev;
> +	unsigned long flags = 0;
> +	int ret = -EINVAL;
> +
> +	cr_debug("vma %#lx-%#lx writable %d\n", start, end, writable);
> +
> +	down_write(&mm->mmap_sem);
> +	vma = find_vma_prev(mm, start, &prev);
> +	if (!vma || vma->vm_start > end || vma->vm_end < start)
> +		goto out;
> +	if (writable && !(vma->vm_flags & VM_WRITE))
> +		flags = vma->vm_flags | VM_WRITE;
> +	else if (!writable && (vma->vm_flags & VM_WRITE))
> +		flags = vma->vm_flags & ~VM_WRITE;
> +	cr_debug("flags %#lx\n", flags);
> +	if (flags)
> +		ret = mprotect_fixup(vma, &prev, vma->vm_start,
> +				     vma->vm_end, flags);

As Dave has pointed out, this appears to be a security problem.  I think
what you need to do is create a new helper mprotect_fixup_withchecks(),
which does all the DAC+MAC checks which are done in the sys_mprotect()
loop starting with "for (nstart = start ; ; ) {...  Otherwise an
unprivileged user can create a checkpoint image of a program which has
done a ro shared file mmap, edit the checkpoint, then restart it and (i
assume) cause the modified contents to be written to the file.  This
could violate both DAC checks and selinux checks.

So create that helper which does the security checks, and use it
both here and in the sys_mprotect() loop, please.

> + out:
> +	up_write(&mm->mmap_sem);
> +	return ret;
> +}
> +
> +/**
> + * cr_vma_read_pages - read in pages for to restore a vma
> + * @ctx - restart context
> + * @cr_vma - vma descriptor from restart
> + */
> +static int cr_vma_read_pages(struct cr_ctx *ctx, struct cr_hdr_vma *hh)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int ret = 0;
> +
> +	if (!hh->nr_pages)
> +		return 0;
> +
> +	/* in the unlikely case that this vma is read-only */
> +	if (!(hh->vm_flags & VM_WRITE))
> +		ret = cr_vma_set_writable(mm, hh->vm_start, hh->vm_end, 1);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_vma_read_pages_vaddrs(ctx, hh->nr_pages);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_vma_read_pages_contents(ctx, hh->nr_pages);
> +	if (ret < 0)
> +		goto out;
> +
> +	cr_pgarr_reset(ctx);	/* reset page-array chain */
> +
> +	/* restore original protection for this vma */
> +	if (!(hh->vm_flags & VM_WRITE))
> +		ret = cr_vma_set_writable(mm, hh->vm_start, hh->vm_end, 0);
> +
> + out:
> +	return ret;
> +}
> +
> +/**
> + * cr_calc_map_prot_bits - convert vm_flags to mmap protection
> + * orig_vm_flags: source vm_flags
> + */
> +static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
> +{
> +	unsigned long vm_prot = 0;
> +
> +	if (orig_vm_flags & VM_READ)
> +		vm_prot |= PROT_READ;
> +	if (orig_vm_flags & VM_WRITE)
> +		vm_prot |= PROT_WRITE;
> +	if (orig_vm_flags & VM_EXEC)
> +		vm_prot |= PROT_EXEC;
> +	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
> +		vm_prot |= PROT_SEM;
> +
> +	return vm_prot;
> +}
> +
> +/**
> + * cr_calc_map_flags_bits - convert vm_flags to mmap flags
> + * orig_vm_flags: source vm_flags
> + */
> +static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
> +{
> +	unsigned long vm_flags = 0;
> +
> +	vm_flags = MAP_FIXED;
> +	if (orig_vm_flags & VM_GROWSDOWN)
> +		vm_flags |= MAP_GROWSDOWN;
> +	if (orig_vm_flags & VM_DENYWRITE)
> +		vm_flags |= MAP_DENYWRITE;
> +	if (orig_vm_flags & VM_EXECUTABLE)
> +		vm_flags |= MAP_EXECUTABLE;
> +	if (orig_vm_flags & VM_MAYSHARE)
> +		vm_flags |= MAP_SHARED;
> +	else
> +		vm_flags |= MAP_PRIVATE;
> +
> +	return vm_flags;
> +}
> +
> +static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
> +{
> +	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
> +	unsigned long addr;
> +	unsigned long flags;
> +	struct file *file = NULL;
> +	int parent, ret = 0;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
> +	if (parent < 0)
> +		return parent;
> +	else if (parent != 0)
> +		return -EINVAL;
> +
> +	cr_debug("vma %#lx-%#lx type %d nr_pages %d\n",
> +		 (unsigned long) hh->vm_start, (unsigned long) hh->vm_end,
> +		 (int) hh->vma_type, (int) hh->nr_pages);
> +
> +	if (hh->vm_end < hh->vm_start || hh->nr_pages < 0)
> +		return -EINVAL;
> +
> +	vm_start = hh->vm_start;
> +	vm_pgoff = hh->vm_pgoff;
> +	vm_size = hh->vm_end - hh->vm_start;
> +	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
> +	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
> +
> +	switch (hh->vma_type) {
> +
> +	case CR_VMA_ANON:		/* anonymous private mapping */
> +		/* vm_pgoff for anonymous mapping is the "global" page
> +		   offset (namely from addr 0x0), so we force a zero */
> +		vm_pgoff = 0;
> +		break;
> +
> +	case CR_VMA_FILE:		/* private mapping from a file */
> +		/* O_RDWR only needed if both (VM_WRITE|VM_SHARED) are set */
> +		flags = hh->vm_flags;
> +		if ((flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))
> +			flags = O_RDWR;
> +		else
> +			flags = O_RDONLY;
> +		file = cr_read_open_fname(ctx, flags, 0);
> +		if (IS_ERR(file))
> +			return PTR_ERR(file);
> +		break;
> +
> +	default:
> +		return -EINVAL;
> +
> +	}
> +
> +	down_write(&mm->mmap_sem);
> +	addr = do_mmap_pgoff(file, vm_start, vm_size,
> +			     vm_prot, vm_flags, vm_pgoff);
> +	up_write(&mm->mmap_sem);
> +	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
> +		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
> +
> +	/* the file (if opened) is now referenced by the vma */
> +	if (file)
> +		filp_close(file, NULL);
> +
> +	if (IS_ERR((void *) addr))
> +		return PTR_ERR((void *) addr);
> +
> +	/*
> +	 * CR_VMA_ANON: read in memory as is
> +	 * CR_VMA_FILE: read in memory as is
> +	 * (more to follow ...)
> +	 */
> +
> +	switch (hh->vma_type) {
> +	case CR_VMA_ANON:
> +	case CR_VMA_FILE:
> +		/* standard case: read the data into the memory */
> +		ret = cr_vma_read_pages(ctx, hh);
> +		break;
> +	}
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	cr_debug("vma retval %d\n", ret);
> +	return 0;
> +}
> +
> +static int cr_destroy_mm(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vmnext = mm->mmap;
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	while (vmnext) {
> +		vma = vmnext;
> +		vmnext = vmnext->vm_next;
> +		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
> +		if (ret < 0) {
> +			pr_debug("CR: restart failed do_munmap (%d)\n", ret);
> +			return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
> +int cr_read_mm(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct mm_struct *mm;
> +	int nr, parent, ret;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
> +	if (parent < 0)
> +		return parent;
> +#if 0	/* activate when containers are used */
> +	if (parent != task_pid_vnr(current))
> +		return -EINVAL;
> +#endif
> +	cr_debug("map_count %d\n", hh->map_count);
> +
> +	/* XXX need more sanity checks */
> +	if (hh->start_code > hh->end_code ||
> +	    hh->start_data > hh->end_data || hh->map_count < 0)
> +		return -EINVAL;
> +
> +	mm = current->mm;
> +
> +	/* point of no return -- destruct current mm */
> +	down_write(&mm->mmap_sem);
> +	ret = cr_destroy_mm(mm);
> +	if (ret < 0) {
> +		up_write(&mm->mmap_sem);
> +		return ret;
> +	}
> +	mm->start_code = hh->start_code;
> +	mm->end_code = hh->end_code;
> +	mm->start_data = hh->start_data;
> +	mm->end_data = hh->end_data;
> +	mm->start_brk = hh->start_brk;
> +	mm->brk = hh->brk;
> +	mm->start_stack = hh->start_stack;
> +	mm->arg_start = hh->arg_start;
> +	mm->arg_end = hh->arg_end;
> +	mm->env_start = hh->env_start;
> +	mm->env_end = hh->env_end;
> +	up_write(&mm->mmap_sem);
> +
> +
> +	/* FIX: need also mm->flags */
> +
> +	for (nr = hh->map_count; nr; nr--) {
> +		ret = cr_read_vma(ctx, mm);
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	ret = cr_read_mm_context(ctx, mm, hh->objref);
> +
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> diff --git a/include/asm-x86/ckpt_hdr.h b/include/asm-x86/ckpt_hdr.h
> index 6bc61ac..f8eee6a 100644
> --- a/include/asm-x86/ckpt_hdr.h
> +++ b/include/asm-x86/ckpt_hdr.h
> @@ -74,4 +74,8 @@ struct cr_hdr_mm_context {
>  	__s16 nldt;
>  } __attribute__((aligned(8)));
> 
> +
> +/* misc prototypes from kernel (not defined elsewhere) */
> +asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
> +
>  #endif /* __ASM_X86_CKPT_HDR__H */
> diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
> index 5c62a90..9305e7b 100644
> --- a/include/linux/ckpt.h
> +++ b/include/linux/ckpt.h
> @@ -59,6 +59,8 @@ int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root);
>  int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
>  int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
>  int cr_read_string(struct cr_ctx *ctx, void *str, int len);
> +int cr_read_fname(struct cr_ctx *ctx, void *fname, int n);
> +struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode);
> 
>  int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
>  int cr_read_mm(struct cr_ctx *ctx);
> diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
> index ac77d7d..f064cbb 100644
> --- a/include/linux/ckpt_hdr.h
> +++ b/include/linux/ckpt_hdr.h
> @@ -102,7 +102,7 @@ enum vm_type {
>  struct cr_hdr_vma {
>  	__u32 vma_type;
>  	__u32 _padding;
> -	__s64 nr_pages;
> +	__s64 nr_pages;		/* number of pages saved */
> 
>  	__u64 vm_start;
>  	__u64 vm_end;
> -- 
> 1.5.4.3
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 9/9] File descriprtors (restore)
  2008-09-09  7:42 ` [RFC v4][PATCH 9/9] File descriprtors (restore) Oren Laadan
@ 2008-09-09 16:26   ` Dave Hansen
  2008-09-10  1:49     ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2008-09-09 16:26 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
> 
> +static int cr_close_all_fds(struct files_struct *files)
> +{
> +       int *fdtable;
> +       int n;
> +
> +       do {
> +               n = cr_scan_fds(files, &fdtable);
> +               if (n < 0)
> +                       return n;
> +               while (n--)
> +                       sys_close(fdtable[n]);
> +               kfree(fdtable);
> +       } while (n != -1);
> +
> +       return 0;
> +}

This needs to use an ERR_PTR().  It will save using the double-pointer.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 0/9] Kernel based checkpoint/restart`
  2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
                   ` (8 preceding siblings ...)
  2008-09-09  7:42 ` [RFC v4][PATCH 9/9] File descriprtors (restore) Oren Laadan
@ 2008-09-09 18:06 ` Dave Hansen
  9 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-09 18:06 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
> These patches implement checkpoint-restart [CR v3]. This version is
> aimed at addressing feedback and eliminating bugs, after having added
> save and restore of open files state (regular files and directories)
> which makes it more usable.

Cool, I can actually apply these! :)

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 3/9] x86 support for checkpoint/restart
  2008-09-09  8:17   ` Ingo Molnar
@ 2008-09-09 23:23     ` Oren Laadan
  0 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-09 23:23 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: dave, arnd, jeremy, linux-kernel, containers



Ingo Molnar wrote:
> * Oren Laadan <orenl@cs.columbia.edu> wrote:
> 
>> +	/* for checkpoint in process context (from within a container)
>> +	   the GS and FS registers should be saved from the hardware;
>> +	   otherwise they are already sabed on the thread structure */
> 
> please use the correct comment style consistently throughout your 
> patches. The correct one is like this one:
> 
>> +	/*
>> +	 * for checkpoint in process context (from within a container),
>> +	 * the actual syscall is taking place at this very moment; so
>> +	 * we (optimistically) subtitute the future return value (0) of
>> +	 * this syscall into the orig_eax, so that upon restart it will
>> +	 * succeed (or it will endlessly retry checkpoint...)
>> +	 */
> 
> incorrect/inconsistent ones are like these:
> 
>> +		/* normally, no need to unlazy_fpu(), since TS_USEDFPU flag
>> +		 * have been cleared when task was conexted-switched out...
>> +		 * except if we are in process context, in which case we do */
> 
>> +		/* restore TLS by hand: why convert to struct user_desc if
>> +		 * sys_set_thread_entry() will convert it back ? */
> 
>> +			/* FIX: add sanity checks (eg. that values makes
>> +			 * sense, that we don't overwrite old values, etc */
> 
> (and there's many more examples throughout the series)
> 
>> +int cr_read_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
>> +{
>> +	/* debug regs */
>> +
>> +	preempt_disable();
>> +
>> +	if (hh->uses_debug) {
>> +		set_debugreg(hh->debugreg0, 0);
>> +		set_debugreg(hh->debugreg1, 1);
>> +		/* ignore 4, 5 */
>> +		set_debugreg(hh->debugreg2, 2);
>> +		set_debugreg(hh->debugreg3, 3);
>> +		set_debugreg(hh->debugreg6, 6);
>> +		set_debugreg(hh->debugreg7, 7);
>> +	}
>> +
>> +	preempt_enable();
>> +
>> +	return 0;
>> +}
> 
> hm, the preemption disabling seems pointless here. What does it protect 
> against?

This is leftover from recovering; will clean up.

> 
>> +++ b/checkpoint/ckpt_arch.h
>> @@ -0,0 +1,7 @@
>> +#include <linux/ckpt.h>
>> +
>> +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
>> +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
>> +
>> +int cr_read_thread(struct cr_ctx *ctx);
>> +int cr_read_cpu(struct cr_ctx *ctx);
> 
> please add 'extern' to prototypes in include files.
> 
>> @@ -15,6 +15,8 @@
>>  #include <linux/ckpt.h>
>>  #include <linux/ckpt_hdr.h>
>>  
>> +#include "ckpt_arch.h"
>> +
> 
> plsdntuseannyngabbrvtsngnrcd. [1]
> 
> "checkpoint_" should be just fine in most cases.
> 
> 	Ingo
> 
> [1] (please dont use annoying abbreviations in generic code)

:)

Oren.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 5/9] Memory managemnet (restore)
  2008-09-09 16:07   ` Serge E. Hallyn
@ 2008-09-09 23:35     ` Oren Laadan
  2008-09-10 15:00       ` Serge E. Hallyn
  0 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-09 23:35 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: dave, containers, jeremy, linux-kernel, arnd



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl@cs.columbia.edu):

[...]

>> +/* change the protection of an address range to be writable/non-writable.
>> + * this is useful when restoring the memory of a read-only vma */
>> +static int cr_vma_set_writable(struct mm_struct *mm, unsigned long start,
>> +			       unsigned long end, int writable)
>> +{
>> +	struct vm_area_struct *vma, *prev;
>> +	unsigned long flags = 0;
>> +	int ret = -EINVAL;
>> +
>> +	cr_debug("vma %#lx-%#lx writable %d\n", start, end, writable);
>> +
>> +	down_write(&mm->mmap_sem);
>> +	vma = find_vma_prev(mm, start, &prev);
>> +	if (!vma || vma->vm_start > end || vma->vm_end < start)
>> +		goto out;
>> +	if (writable && !(vma->vm_flags & VM_WRITE))
>> +		flags = vma->vm_flags | VM_WRITE;
>> +	else if (!writable && (vma->vm_flags & VM_WRITE))
>> +		flags = vma->vm_flags & ~VM_WRITE;
>> +	cr_debug("flags %#lx\n", flags);
>> +	if (flags)
>> +		ret = mprotect_fixup(vma, &prev, vma->vm_start,
>> +				     vma->vm_end, flags);
> 
> As Dave has pointed out, this appears to be a security problem.  I think

As I replied to Dave, I don't see why this would be a security problem.

This handles private memory only. In particular, the uncommon case of a
read-only VMA tha has modified contents. This _cannot_ affect the file
from which this VMA may have been mapped.

Shared memory (not file-mapped) will be handled differently: since it is
always backed up by an inode in shmfs, the restart will populate the
relevant pages directly. Besides, non-file-mapped shared memory is again
not a security concern.

Finally, shared memory that maps to a file is simply _not saved_ at all;
it is part of the file system, and belongs to the (future) file system
snapshot capability. Since the contents are always available in the file
system, we don't need to save it (like we don't save shared libraries).

That said, it is necessary that the code ensures that the vm_flags that
belong to a VMA of a private type, e.g. CR_VMA_ANON/CR_VMA_FILE, indeed
match it (ie, don't have VM_MAY_SHARE/VM_SHARED). I'll add that.

> what you need to do is create a new helper mprotect_fixup_withchecks(),
> which does all the DAC+MAC checks which are done in the sys_mprotect()
> loop starting with "for (nstart = start ; ; ) {...  Otherwise an
> unprivileged user can create a checkpoint image of a program which has
> done a ro shared file mmap, edit the checkpoint, then restart it and (i
> assume) cause the modified contents to be written to the file.  This
> could violate both DAC checks and selinux checks.
> 
> So create that helper which does the security checks, and use it
> both here and in the sys_mprotect() loop, please.
> 

[...]

Oren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 9/9] File descriprtors (restore)
  2008-09-09 16:26   ` Dave Hansen
@ 2008-09-10  1:49     ` Oren Laadan
  2008-09-10 16:09       ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-10  1:49 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, jeremy, linux-kernel, arnd



Dave Hansen wrote:
> On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
>> +static int cr_close_all_fds(struct files_struct *files)
>> +{
>> +       int *fdtable;
>> +       int n;
>> +
>> +       do {
>> +               n = cr_scan_fds(files, &fdtable);
>> +               if (n < 0)
>> +                       return n;
>> +               while (n--)
>> +                       sys_close(fdtable[n]);
>> +               kfree(fdtable);
>> +       } while (n != -1);
>> +
>> +       return 0;
>> +}
> 
> This needs to use an ERR_PTR().  It will save using the double-pointer.

I suppose you refer to the call to cr_scan_fds(): either 'fdtable'
or 'n' will have to pass-by-reference. Is it that you prefer it to be
	fdtable = cr_scan_fds(files, &n);
?

Oren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 8/9] File descriprtors (dump)
  2008-09-09  8:23   ` Vegard Nossum
@ 2008-09-10  2:01     ` Oren Laadan
  0 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-10  2:01 UTC (permalink / raw)
  To: Vegard Nossum; +Cc: dave, arnd, jeremy, linux-kernel, containers



Vegard Nossum wrote:
> Hi,
> 
> Below are some concerns, I would be grateful for explanations (or
> pointers if I missed them before).

Thanks for the review !

> 
> On Tue, Sep 9, 2008 at 9:42 AM, Oren Laadan <orenl@cs.columbia.edu> wrote:
>> +/* cr_write_fd_data - dump the state of a given file pointer */
>> +static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
>> +{
>> +       struct cr_hdr h;
>> +       struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> +       struct dentry *dent = file->f_dentry;
>> +       struct inode *inode = dent->d_inode;
>> +       enum fd_type fd_type;
>> +       int ret;
>> +
>> +       h.type = CR_HDR_FD_DATA;
>> +       h.len = sizeof(*hh);
>> +       h.parent = parent;
>> +
>> +       hh->f_flags = file->f_flags;
>> +       hh->f_mode = file->f_mode;
>> +       hh->f_pos = file->f_pos;
>> +       hh->f_uid = file->f_uid;
>> +       hh->f_gid = file->f_gid;
>> +       hh->f_version = file->f_version;
>> +       /* FIX: need also file->f_owner */
>> +
>> +       switch (inode->i_mode & S_IFMT) {
>> +       case S_IFREG:
>> +               fd_type = CR_FD_FILE;
>> +               break;
>> +       case S_IFDIR:
>> +               fd_type = CR_FD_DIR;
>> +               break;
>> +       case S_IFLNK:
>> +               fd_type = CR_FD_LINK;
>> +               break;
>> +       default:
>> +               return -EBADF;
>> +       }
> 
> Should cr_hbuf_put() come before the return here?
> 
> As far as I've understood, "leaking" the buffer size/data isn't
> critical (1. because it's just some extra space, and/or 2. the buffer
> is discarded on error anyway). The code looks really unbalanced
> without it, though. I guess it should at least be documented?

You are right on the money: the space is allocated on a temporary
buffer that is part of the checkpoint context, and is discarded on
error (and success) anyway.

Although the code may seem somewhat unbalanced, I personally find it
useful in that it simplifies the error paths in the code. "Balancing"
the code by adding cr_hbuf_put() calls is not functionally necessary,
will clobber the code and add to its (source and compiled) size.

Certainly it could use better documentation, probably in sys.c where
they are defined. Will add.

> 
>> +
>> +       /* FIX: check if the file/dir/link is unlinked */
>> +       hh->fd_type = fd_type;
>> +
>> +       ret = cr_write_obj(ctx, &h, hh);
>> +       cr_hbuf_put(ctx, sizeof(*hh));
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       return cr_write_fname(ctx, &file->f_path, ctx->vfsroot);
>> +}
>> +
>> +/**
>> + * cr_write_fd_ent - dump the state of a given file descriptor
>> + * @ctx: checkpoint context
>> + * @files: files_struct pointer
>> + * @fd: file descriptor
>> + *
>> + * Save the state of the file descriptor; look up the actual file pointer
>> + * in the hash table, and if found save the matching objref, otherwise call
>> + * cr_write_fd_data to dump the file pointer too.
>> + */
>> +static int
>> +cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
>> +{
>> +       struct cr_hdr h;
>> +       struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> +       struct file *file = NULL;
>> +       struct fdtable *fdt;
>> +       int coe, objref, new, ret;
>> +
>> +       rcu_read_lock();
>> +       fdt = files_fdtable(files);
>> +       file = fcheck_files(files, fd);
>> +       if (file) {
>> +               coe = FD_ISSET(fd, fdt->close_on_exec);
>> +               get_file(file);
>> +       }
>> +       rcu_read_unlock();
>> +
>> +       /* sanity check (although this shouldn't happen) */
>> +       if (!file)
>> +               return -EBADF;
>> +
>> +       new = cr_obj_add_ptr(ctx, (void *) file, &objref, CR_OBJ_FILE, 0);
>> +       cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
>> +
>> +       if (new < 0)
>> +               return new;
> 
> fput() and/or cr_hbuf_put()?

Certainly; and also the "return ret" below, too.

> 
>> +

[...]

Thanks,

Oren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-09  7:42 ` [RFC v4][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
@ 2008-09-10  6:10   ` MinChan Kim
  2008-09-10 18:36     ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: MinChan Kim @ 2008-09-10  6:10 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

On Tue, Sep 9, 2008 at 4:42 PM, Oren Laadan <orenl@cs.columbia.edu> wrote:
> Add those interfaces, as well as helpers needed to easily manage the
> file format. The code is roughly broken out as follows:
>
> checkpoint/sys.c - user/kernel data transfer, as well as setup of the
> checkpoint/restart context (a per-checkpoint data structure for
> housekeeping)
>
> checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
>
> checkpoint/restart.c - input wrappers and basic restart handling
>
> Patches to add the per-architecture support as well as the actual
> work to do the memory checkpoint follow in subsequent patches.
>
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> ---
>  Makefile                 |    2 +-
>  checkpoint/Makefile      |    2 +-
>  checkpoint/checkpoint.c  |  188 +++++++++++++++++++++++++++++++++++++++
>  checkpoint/restart.c     |  189 +++++++++++++++++++++++++++++++++++++++
>  checkpoint/sys.c         |  218 +++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/ckpt.h     |   60 +++++++++++++
>  include/linux/ckpt_hdr.h |   84 ++++++++++++++++++
>  include/linux/magic.h    |    3 +
>  8 files changed, 740 insertions(+), 6 deletions(-)
>  create mode 100644 checkpoint/checkpoint.c
>  create mode 100644 checkpoint/restart.c
>  create mode 100644 include/linux/ckpt.h
>  create mode 100644 include/linux/ckpt_hdr.h
>
> diff --git a/Makefile b/Makefile
> index f448e00..a558ad2 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -619,7 +619,7 @@ export mod_strip_cmd
>
>
>  ifeq ($(KBUILD_EXTMOD),)
> -core-y         += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
> +core-y         += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
>
>  vmlinux-dirs   := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
>                     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> index 07d018b..d2df68c 100644
> --- a/checkpoint/Makefile
> +++ b/checkpoint/Makefile
> @@ -2,4 +2,4 @@
>  # Makefile for linux checkpoint/restart.
>  #
>
> -obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
> +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> new file mode 100644
> index 0000000..ad1099f
> --- /dev/null
> +++ b/checkpoint/checkpoint.c
> @@ -0,0 +1,188 @@
> +/*
> + *  Checkpoint logic and helpers
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/version.h>
> +#include <linux/sched.h>
> +#include <linux/time.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/dcache.h>
> +#include <linux/mount.h>
> +#include <linux/utsname.h>
> +#include <linux/magic.h>
> +#include <linux/ckpt.h>
> +#include <linux/ckpt_hdr.h>
> +
> +/**
> + * cr_write_obj - write a record described by a cr_hdr
> + * @ctx: checkpoint context
> + * @h: record descriptor
> + * @buf: record buffer
> + */
> +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
> +{
> +       int ret;
> +
> +       ret = cr_kwrite(ctx, h, sizeof(*h));
> +       if (ret < 0)
> +               return ret;
> +       return cr_kwrite(ctx, buf, h->len);
> +}
> +
> +/**
> + * cr_write_string - write a string
> + * @ctx: checkpoint context
> + * @str: string pointer
> + * @len: string length
> + */
> +int cr_write_string(struct cr_ctx *ctx, char *str, int len)
> +{
> +       struct cr_hdr h;
> +
> +       h.type = CR_HDR_STRING;
> +       h.len = len;
> +       h.parent = 0;
> +
> +       return cr_write_obj(ctx, &h, str);
> +}
> +
> +/* write the checkpoint header */
> +static int cr_write_head(struct cr_ctx *ctx)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       struct new_utsname *uts;
> +       struct timeval ktv;
> +       int ret;
> +
> +       h.type = CR_HDR_HEAD;
> +       h.len = sizeof(*hh);
> +       h.parent = 0;
> +
> +       do_gettimeofday(&ktv);
> +
> +       hh->magic = CHECKPOINT_MAGIC_HEAD;
> +       hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
> +       hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
> +       hh->patch = (LINUX_VERSION_CODE) & 0xff;
> +
> +       hh->rev = CR_VERSION;
> +
> +       hh->flags = ctx->flags;
> +       hh->time = ktv.tv_sec;
> +
> +       uts = utsname();
> +       memcpy(hh->release, uts->release, __NEW_UTS_LEN);
> +       memcpy(hh->version, uts->version, __NEW_UTS_LEN);
> +       memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       return ret;
> +}
> +
> +/* write the checkpoint trailer */
> +static int cr_write_tail(struct cr_ctx *ctx)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       int ret;
> +
> +       h.type = CR_HDR_TAIL;
> +       h.len = sizeof(*hh);
> +       h.parent = 0;
> +
> +       hh->magic = CHECKPOINT_MAGIC_TAIL;
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       return ret;
> +}
> +
> +/* dump the task_struct of a given task */
> +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       int ret;
> +
> +       h.type = CR_HDR_TASK;
> +       h.len = sizeof(*hh);
> +       h.parent = 0;
> +
> +       hh->state = t->state;
> +       hh->exit_state = t->exit_state;
> +       hh->exit_code = t->exit_code;
> +       hh->exit_signal = t->exit_signal;
> +
> +       hh->utime = t->utime;
> +       hh->stime = t->stime;
> +       hh->utimescaled = t->utimescaled;
> +       hh->stimescaled = t->stimescaled;
> +       hh->gtime = t->gtime;
> +       hh->prev_utime = t->prev_utime;
> +       hh->prev_stime = t->prev_stime;
> +       hh->nvcsw = t->nvcsw;
> +       hh->nivcsw = t->nivcsw;
> +       hh->start_time_sec = t->start_time.tv_sec;
> +       hh->start_time_nsec = t->start_time.tv_nsec;
> +       hh->real_start_time_sec = t->real_start_time.tv_sec;
> +       hh->real_start_time_nsec = t->real_start_time.tv_nsec;
> +       hh->min_flt = t->min_flt;
> +       hh->maj_flt = t->maj_flt;
> +
> +       hh->task_comm_len = TASK_COMM_LEN;
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               return ret;
> +
> +       return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
> +}
> +
> +/* dump the entire state of a given task */
> +static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +       int ret ;
> +
> +       if (t->state == TASK_DEAD) {
> +               pr_warning("CR: task may not be in state TASK_DEAD\n");
> +               return -EAGAIN;
> +       }
> +
> +       ret = cr_write_task_struct(ctx, t);
> +       cr_debug("ret %d\n", ret);
> +
> +       return ret;
> +}
> +
> +int do_checkpoint(struct cr_ctx *ctx)
> +{
> +       int ret;
> +
> +       /* FIX: need to test whether container is checkpointable */
> +
> +       ret = cr_write_head(ctx);
> +       if (ret < 0)
> +               goto out;
> +       ret = cr_write_task(ctx, current);
> +       if (ret < 0)
> +               goto out;
> +       ret = cr_write_tail(ctx);
> +       if (ret < 0)
> +               goto out;
> +
> +       /* on success, return (unique) checkpoint identifier */
> +       ret = ctx->crid;
> +
> + out:
> +       return ret;
> +}
> diff --git a/checkpoint/restart.c b/checkpoint/restart.c
> new file mode 100644
> index 0000000..171cd2d
> --- /dev/null
> +++ b/checkpoint/restart.c
> @@ -0,0 +1,189 @@
> +/*
> + *  Restart logic and helpers
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/version.h>
> +#include <linux/sched.h>
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/ckpt.h>
> +#include <linux/ckpt_hdr.h>
> +
> +/**
> + * cr_read_obj - read a whole record (cr_hdr followed by payload)
> + * @ctx: checkpoint context
> + * @h: record descriptor
> + * @buf: record buffer
> + * @n: available buffer size
> + *
> + * @return: size of payload
> + */
> +int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n)
> +{
> +       int ret;
> +
> +       ret = cr_kread(ctx, h, sizeof(*h));
> +       if (ret < 0)
> +               return ret;
> +
> +       cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
> +
> +       if (h->len < 0 || h->len > n)
> +               return -EINVAL;
> +
> +       return cr_kread(ctx, buf, h->len);
> +}
> +
> +/**
> + * cr_read_obj_type - read a whole record of expected type
> + * @ctx: checkpoint context
> + * @buf: record buffer
> + * @n: available buffer size
> + * @type: expected record type
> + *
> + * @return: object reference of the parent object
> + */
> +int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type)
> +{
> +       struct cr_hdr h;
> +       int ret;
> +
> +       ret = cr_read_obj(ctx, &h, buf, n);
> +       if (!ret) {
> +               if (h.type == type)
> +                       ret = h.parent;
> +               else
> +                       ret = -EINVAL;
> +       }
> +       return ret;
> +}
> +
> +/**
> + * cr_read_string - read a string
> + * @ctx: checkpoint context
> + * @str: string buffer
> + * @len: buffer buffer length
> + */
> +int cr_read_string(struct cr_ctx *ctx, void *str, int len)
> +{
> +       return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
> +}
> +
> +/* read the checkpoint header */
> +static int cr_read_head(struct cr_ctx *ctx)
> +{
> +       struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       int parent;
> +
> +       parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
> +       if (parent < 0)
> +               return parent;
> +       else if (parent != 0)
> +               return -EINVAL;
> +
> +       if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
> +           hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
> +           hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
> +           hh->patch != ((LINUX_VERSION_CODE) & 0xff))
> +               return -EINVAL;
> +
> +       if (hh->flags & ~CR_CTX_CKPT)
> +               return -EINVAL;
> +
> +       ctx->oflags = hh->flags;
> +
> +       /* FIX: verify compatibility of release, version and machine */
> +
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       return 0;
> +}
> +
> +/* read the checkpoint trailer */
> +static int cr_read_tail(struct cr_ctx *ctx)
> +{
> +       struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       int parent;
> +
> +       parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
> +       if (parent < 0)
> +               return parent;
> +       else if (parent != 0)
> +               return -EINVAL;
> +
> +       if (hh->magic != CHECKPOINT_MAGIC_TAIL)
> +               return -EINVAL;
> +
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       return 0;
> +}
> +
> +/* read the task_struct into the current task */
> +static int cr_read_task_struct(struct cr_ctx *ctx)
> +{
> +       struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       struct task_struct *t = current;
> +       char *buf;
> +       int parent, ret;
> +
> +       parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
> +       if (parent < 0)
> +               return parent;
> +       else if (parent != 0)
> +               return -EINVAL;
> +
> +       /* FIXME: for now, only restore t->comm */
> +
> +       /* upper limit for task_comm_len to prevent DoS */
> +       if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
> +               return -EINVAL;
> +
> +       buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
> +       if (!buf)
> +               return -ENOMEM;
> +       ret = cr_read_string(ctx, buf, hh->task_comm_len);
> +       if (!ret) {
> +               /* if t->comm is too long, silently truncate */
> +               memset(t->comm, 0, TASK_COMM_LEN);
> +               memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
> +       }
> +       kfree(buf);
> +
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       return ret;
> +}
> +
> +/* read the entire state of the current task */
> +static int cr_read_task(struct cr_ctx *ctx)
> +{
> +       int ret;
> +
> +       ret = cr_read_task_struct(ctx);
> +       cr_debug("ret %d\n", ret);
> +
> +       return ret;
> +}
> +
> +int do_restart(struct cr_ctx *ctx)
> +{
> +       int ret;
> +
> +       ret = cr_read_head(ctx);
> +       if (ret < 0)
> +               goto out;
> +       ret = cr_read_task(ctx);
> +       if (ret < 0)
> +               goto out;
> +       ret = cr_read_tail(ctx);
> +       if (ret < 0)
> +               goto out;
> +
> +       /* on success, adjust the return value if needed [TODO] */
> + out:
> +       return ret;
> +}
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index b9018a4..113e0df 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c
> @@ -10,6 +10,189 @@
>
>  #include <linux/sched.h>
>  #include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/uaccess.h>
> +#include <linux/capability.h>
> +#include <linux/ckpt.h>
> +
> +/*
> + * helpers to write/read to/from the image file descriptor
> + *
> + *   cr_uwrite() - write a user-space buffer to the checkpoint image
> + *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
> + *   cr_uread() - read from the checkpoint image to a user-space buffer
> + *   cr_kread() - read from the checkpoint image to a kernel-space buffer
> + *
> + */
> +
> +/* (temporarily added file_pos_read() and file_pos_write() because they
> + * are static in fs/read_write.c... should cleanup and remove later) */
> +static inline loff_t file_pos_read(struct file *file)
> +{
> +       return file->f_pos;
> +}
> +
> +static inline void file_pos_write(struct file *file, loff_t pos)
> +{
> +       file->f_pos = pos;
> +}
> +
> +int cr_uwrite(struct cr_ctx *ctx, void *buf, int count)
> +{
> +       struct file *file = ctx->file;
> +       ssize_t nwrite;
> +       int nleft;
> +
> +       for (nleft = count; nleft; nleft -= nwrite) {
> +               loff_t pos = file_pos_read(file);
> +               nwrite = vfs_write(file, (char __user *) buf, nleft, &pos);
> +               file_pos_write(file, pos);
> +               if (nwrite <= 0) {
> +                       if (nwrite == -EAGAIN)
> +                               nwrite = 0;
> +                       else
> +                               return nwrite;
> +               }
> +               buf += nwrite;
> +       }
> +
> +       ctx->total += count;
> +       return 0;
> +}
> +
> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
> +{
> +       mm_segment_t oldfs;
> +       int ret;
> +
> +       oldfs = get_fs();
> +       set_fs(KERNEL_DS);
> +       ret = cr_uwrite(ctx, buf, count);
> +       set_fs(oldfs);
> +
> +       return ret;
> +}
> +
> +int cr_uread(struct cr_ctx *ctx, void *buf, int count)
> +{
> +       struct file *file = ctx->file;
> +       ssize_t nread;
> +       int nleft;
> +
> +       for (nleft = count; nleft; nleft -= nread) {
> +               loff_t pos = file_pos_read(file);
> +               nread = vfs_read(file, (char __user *) buf, nleft, &pos);
> +               file_pos_write(file, pos);
> +               if (nread <= 0) {
> +                       if (nread == -EAGAIN)
> +                               nread = 0;
> +                       else
> +                               return nread;
> +               }
> +               buf += nread;
> +       }
> +
> +       ctx->total += count;
> +       return 0;
> +}
> +
> +int cr_kread(struct cr_ctx *ctx, void *buf, int count)
> +{
> +       mm_segment_t oldfs;
> +       int ret;
> +
> +       oldfs = get_fs();
> +       set_fs(KERNEL_DS);
> +       ret = cr_uread(ctx, buf, count);
> +       set_fs(oldfs);
> +
> +       return ret;
> +}
> +
> +
> +/*
> + * helpers to manage CR contexts: allocated for each checkpoint and/or
> + * restart operation, and persists until the operation is completed.
> + */
> +
> +/* unique checkpoint identifier (FIXME: should be per-container) */
> +static atomic_t cr_ctx_count;
> +
> +void cr_ctx_free(struct cr_ctx *ctx)
> +{
> +       if (ctx->file)
> +               fput(ctx->file);
> +
> +       free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
> +
> +       kfree(ctx);
> +}
> +
> +struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
> +{
> +       struct cr_ctx *ctx;
> +
> +       ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +       if (!ctx)
> +               return ERR_PTR(-ENOMEM);
> +
> +       ctx->file = fget(fd);
> +       if (!ctx->file) {
> +               cr_ctx_free(ctx);
> +               return ERR_PTR(-EBADF);
> +       }
> +       get_file(ctx->file);

Why do you need get_file?
You already called fget.
Am I missing something ?

> +       ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
> +       if (!ctx->hbuf) {
> +               cr_ctx_free(ctx);
> +               return ERR_PTR(-ENOMEM);
> +       }
> +
> +       ctx->pid = pid;
> +       ctx->flags = flags;
> +
> +       ctx->crid = atomic_inc_return(&cr_ctx_count);
> +
> +       return ctx;
> +}
> +
> +/*
> + * During checkpoint and restart the code writes outs/reads in data
> + * to/from the chekcpoint image from/to a temporary buffer (ctx->hbuf).
> + * Because operations can be nested, one should call cr_hbuf_get() to
> + * reserve space in the buffer, and then cr_hbuf_put() when no longer
> + * needs that space.
> + */
> +
> +/**
> + * cr_hbuf_get - reserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + *
> + * @return: pointer to reserved space
> + */
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
> +{
> +       void *ptr;
> +
> +       BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
> +       ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
> +       ctx->hpos += n;
> +       return ptr;
> +}
> +
> +/**
> + * cr_hbuf_put - unreserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + */
> +void cr_hbuf_put(struct cr_ctx *ctx, int n)
> +{
> +       BUG_ON(ctx->hpos < n);
> +       ctx->hpos -= n;
> +}
>
>  /**
>  * sys_checkpoint - checkpoint a container
> @@ -19,9 +202,23 @@
>  */
>  asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>  {
> -       pr_debug("sys_checkpoint not implemented yet\n");
> -       return -ENOSYS;
> +       struct cr_ctx *ctx;
> +       int ret;
> +
> +       /* no flags for now */
> +       if (flags)
> +               return -EINVAL;
> +
> +       ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
> +       if (IS_ERR(ctx))
> +               return PTR_ERR(ctx);
> +
> +       ret = do_checkpoint(ctx);
> +
> +       cr_ctx_free(ctx);
> +       return ret;
>  }
> +
>  /**
>  * sys_restart - restart a container
>  * @crid: checkpoint image identifier
> @@ -30,6 +227,19 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>  */
>  asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
>  {
> -       pr_debug("sys_restart not implemented yet\n");
> -       return -ENOSYS;
> +       struct cr_ctx *ctx;
> +       int ret;
> +
> +       /* no flags for now */
> +       if (flags)
> +               return -EINVAL;
> +
> +       ctx = cr_ctx_alloc(crid, fd, flags | CR_CTX_RSTR);
> +       if (IS_ERR(ctx))
> +               return PTR_ERR(ctx);
> +
> +       ret = do_restart(ctx);
> +
> +       cr_ctx_free(ctx);
> +       return ret;
>  }
> diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
> new file mode 100644
> index 0000000..91f4998
> --- /dev/null
> +++ b/include/linux/ckpt.h
> @@ -0,0 +1,60 @@
> +#ifndef _CHECKPOINT_CKPT_H_
> +#define _CHECKPOINT_CKPT_H_
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#define CR_VERSION  1
> +
> +struct cr_ctx {
> +       pid_t pid;              /* container identifier */
> +       int crid;               /* unique checkpoint id */
> +
> +       unsigned long flags;
> +       unsigned long oflags;   /* restart: old flags */
> +
> +       struct file *file;
> +       int total;              /* total read/written */
> +
> +       void *hbuf;             /* temporary buffer for headers */
> +       int hpos;               /* position in headers buffer */
> +};
> +
> +/* cr_ctx: flags */
> +#define CR_CTX_CKPT    0x1
> +#define CR_CTX_RSTR    0x2
> +
> +/* allocation defaults */
> +#define CR_HBUF_ORDER  1
> +#define CR_HBUF_TOTAL  (PAGE_SIZE << CR_HBUF_ORDER)
> +
> +int cr_uwrite(struct cr_ctx *ctx, void *buf, int count);
> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
> +int cr_uread(struct cr_ctx *ctx, void *buf, int count);
> +int cr_kread(struct cr_ctx *ctx, void *buf, int count);
> +
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n);
> +void cr_hbuf_put(struct cr_ctx *ctx, int n);
> +
> +struct cr_hdr;
> +
> +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
> +int cr_write_string(struct cr_ctx *ctx, char *str, int len);
> +
> +int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
> +int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
> +int cr_read_string(struct cr_ctx *ctx, void *str, int len);
> +
> +int do_checkpoint(struct cr_ctx *ctx);
> +int do_restart(struct cr_ctx *ctx);
> +
> +#define cr_debug(fmt, args...)  \
> +       pr_debug("[CR:%s] " fmt, __func__, ## args)
> +
> +#endif /* _CHECKPOINT_CKPT_H_ */
> diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
> new file mode 100644
> index 0000000..dd05ecc
> --- /dev/null
> +++ b/include/linux/ckpt_hdr.h
> @@ -0,0 +1,84 @@
> +#ifndef _CHECKPOINT_CKPT_HDR_H_
> +#define _CHECKPOINT_CKPT_HDR_H_
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/utsname.h>
> +
> +/*
> + * To maintain compatibility between 32-bit and 64-bit architecture flavors,
> + * keep data 64-bit aligned: use padding for structure members, and use
> + * __attribute__ ((aligned (8))) for the entire structure.
> + */
> +
> +/* records: generic header */
> +
> +struct cr_hdr {
> +       __s16 type;
> +       __s16 len;
> +       __u32 parent;
> +};
> +
> +/* header types */
> +enum {
> +       CR_HDR_HEAD = 1,
> +       CR_HDR_STRING,
> +
> +       CR_HDR_TASK = 101,
> +       CR_HDR_THREAD,
> +       CR_HDR_CPU,
> +
> +       CR_HDR_MM = 201,
> +       CR_HDR_VMA,
> +       CR_HDR_MM_CONTEXT,
> +
> +       CR_HDR_TAIL = 5001
> +};
> +
> +struct cr_hdr_head {
> +       __u64 magic;
> +
> +       __u16 major;
> +       __u16 minor;
> +       __u16 patch;
> +       __u16 rev;
> +
> +       __u64 time;     /* when checkpoint taken */
> +       __u64 flags;    /* checkpoint options */
> +
> +       char release[__NEW_UTS_LEN];
> +       char version[__NEW_UTS_LEN];
> +       char machine[__NEW_UTS_LEN];
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_tail {
> +       __u64 magic;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_task {
> +       __u64 state;
> +       __u32 exit_state;
> +       __u32 exit_code;
> +       __u32 exit_signal;
> +       __u32 _padding0;
> +
> +       __u64 utime, stime, utimescaled, stimescaled;
> +       __u64 gtime;
> +       __u64 prev_utime, prev_stime;
> +       __u64 nvcsw, nivcsw;
> +       __u64 start_time_sec, start_time_nsec;
> +       __u64 real_start_time_sec, real_start_time_nsec;
> +       __u64 min_flt, maj_flt;
> +
> +       __s32 task_comm_len;
> +} __attribute__((aligned(8)));
> +
> +#endif /* _CHECKPOINT_CKPT_HDR_H_ */
> diff --git a/include/linux/magic.h b/include/linux/magic.h
> index 1fa0c2c..c2b811c 100644
> --- a/include/linux/magic.h
> +++ b/include/linux/magic.h
> @@ -42,4 +42,7 @@
>  #define FUTEXFS_SUPER_MAGIC    0xBAD1DEA
>  #define INOTIFYFS_SUPER_MAGIC  0x2BAD1DEA
>
> +#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
> +#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
> +
>  #endif /* __LINUX_MAGIC_H__ */
> --
> 1.5.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 6/9] Checkpoint/restart: initial documentation
  2008-09-09  7:42 ` [RFC v4][PATCH 6/9] Checkpoint/restart: initial documentation Oren Laadan
@ 2008-09-10  7:13   ` MinChan Kim
  0 siblings, 0 replies; 43+ messages in thread
From: MinChan Kim @ 2008-09-10  7:13 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

On Tue, Sep 9, 2008 at 4:42 PM, Oren Laadan <orenl@cs.columbia.edu> wrote:
> Covers application checkpoint/restart, overall design, interfaces
> and checkpoint image format.
>
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> ---
>  Documentation/checkpoint.txt |  187 ++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 187 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/checkpoint.txt
>
> diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
> new file mode 100644
> index 0000000..f67aef1
> --- /dev/null
> +++ b/Documentation/checkpoint.txt
> @@ -0,0 +1,187 @@
> +
> +       === Checkpoint-Restart support in the Linux kernel ===
> +
> +Copyright (C) 2008 Oren Laadan
> +
> +Author:                Oren Laadan <orenl@cs.columbia.edu>
> +
> +License:       The GNU Free Documentation License, Version 1.2
> +               (dual licensed under the GPL v2)
> +Reviewers:
> +
> +Application checkpoint/restart [CR] is the ability to save the state
> +of a running application so that it can later resume its execution
> +from the time at which it was checkpointed. An application can be
> +migrated by checkpointing it on one machine and restarting it on
> +another. CR can provide many potential benefits:
> +
> +* Failure recovery: by rolling back an to a previous checkpoint
> +
> +* Improved response time: by restarting applications from checkpoints
> +  instead of from scratch.
> +
> +* Improved system utilization: by suspending long running CPU
> +  intensive jobs and resuming them when load decreases.
> +
> +* Fault resilience: by migrating applications off of faulty hosts.
> +
> +* Dynamic load balancing: by migrating applications to less loaded
> +  hosts.
> +
> +* Improved service availability and administration: by migrating
> +  applications before host maintenance so that they continue to run
> +  with minimal downtime
> +
> +* Time-travel: by taking periodic checkpoints and restarting from
> +  any previous checkpoint.
> +
> +
> +=== Overall design
> +
> +Checkpoint and restart is done in the kernel as much as possible. The
> +kernel exports a relative opaque 'blob' of data to userspace which can
> +then be handed to the new kernel at restore time.  The 'blob' contains
> +data and state of select portions of kernel structures such as VMAs
> +and mm_structs, as well as copies of the actual memory that the tasks
> +use. Any changes in this blob's format between kernel revisions can be
> +handled by an in-userspace conversion program. The approach is similar
> +to virtually all of the commercial CR products out there, as well as
> +the research project Zap.
> +
> +Two new system calls are introduced to provide CR: sys_checkpoint and
> +sys_restart.  The checkpoint code basically serializes internel kernel
> +state and writes it out to a file descriptor, and the resulting image
> +is stream-able. More specifically, it consists of 5 steps:
> +  1. Pre-dump
> +  2. Freeze the container
> +  3. Dump
> +  4. Thaw (or kill) the container
> +  5. Post-dump
> +Steps 1 and 5 are an optimization to reduce application downtime:
> +"pre-dump" works before freezing the container, e.g. the pre-copy for
> +live migration, and "post-dump" works after the container resumes
> +execution, e.g. write-back the data to secondary storage.
> +
> +The restart code basically reads the saved kernel state and from a
> +file descriptor, and re-creates the tasks and the resources they need
> +to resume execution. The restart code is executed by each task that
> +is restored in a new container to reconstruct its own state.
> +
> +
> +=== Interfaces
> +
> +int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
> +  Checkpoint a container whose init task is identified by pid, to the
> +  file designated by fd. Flags will have future meaning (should be 0
> +  for now).
> +  Returns: a positive integer that identifies the checkpoint image
> +  (for future reference in case it is kept in memory) upon success,
> +  0 if it returns from a restart, and -1 if an error occurs.
> +
> +int sys_restart(int crid, int fd, unsigned long flags);
> +  Restart a container from a checkpoint image identified by crid, or
> +  from the blob stored in the file designated by fd. Flags will have
> +  future meaning (should be 0 for now).
> +  Returns: 0 on success and -1 if an error occurs.
> +
> +Thus, if checkpoint is initiated by a process in the container, one
> +can use logic similar to fork():
> +       ...
> +       crid = checkpoint(...);
> +       switch (crid) {
> +       case -1:
> +               perror("checkpoint failed");
> +               break;
> +       default:
> +               fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
> +               /* proceed with execution after checkpoint */
> +               ...
> +               break;
> +       case 0:
> +               fprintf(stderr, "returned after restart\n");
> +               /* proceed with action required following a restart */
> +               ...
> +               break;
> +       }
> +       ...
> +And to initiate a restart, the process in an empty container can use
> +logic similar to execve():
> +       ...
> +       if (restart(crid, ...) < 0)
> +               perror("restart failed");
> +       /* only get here if restart failed */
> +       ...
> +
> +
> +=== Checkpoint image format
> +
> +The checkpoint image format is composed of records consistings of a
> +pre-header that identifies its contents, followed by a payload. (The
> +idea here is to enable parallel checkpointing in the future in which
> +multiple threads interleave data from multiple processes into a single
> +stream).
> +
> +The pre-header is defined by "struct cr_hdr" as follows:
> +
> +struct cr_hdr {
> +       __s16 type;
> +       __s16 len;
> +       __u32 id;
> +};
> +
> +Here, 'type' field identifies the type of the payload, 'len' tells its
> +length in byes. The 'id' identifies the owner object instance. The

byes => bytes ?? :)

> +meaning of the 'id' field varies depending on the type. For example,
> +for type CR_HDR_MM, the 'id' identifies the task to which this MM
> +belongs. The payload also varies depending on the type, for instance,
> +the data describing a task_struct is given by a 'struct cr_hdr_task'
> +(type CR_HDR_TASK) and so on.
> +
> +The format of the memory dump is as follows: for each VMA, there is a
> +'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
> +name. The cr_vma->npages indicated how many pages were dumped for this
> +VMA. Following comes the actual data: first the addresses of all the
> +dumped pages, followed by the contents of all the dumped pages (npages
> +entries each). Then comes the next VMA and so on.
> +
> +To illustrate this, consider a single simple task with two VMAs: one
> +is file mapped with two dumped pages, and the other is anonymous with
> +three dumped pages. The checkpoint image will look like this:
> +
> +cr_hdr + cr_hdr_head
> +cr_hdr + cr_hdr_task
> +       cr_hdr + cr_hdr_mm
> +               cr_hdr + cr_hdr_vma + cr_hdr + string
> +                       addr1, addr2
> +                       page1, page2
> +               cr_hdr + cr_hdr_vma
> +                       addr3, addr4, addr5
> +                       page3, page4, page5
> +               cr_hdr + cr_mm_context
> +       cr_hdr + cr_hdr_thread
> +       cr_hdr + cr_hdr_cpu
> +cr_hdr + cr_hdr_tail
> +
> +
> +=== Changelog
> +
> +[2008-Sep-04] v4:
> +* Fix calculation of hash table size
> +* Fix header structure alignment
> +* Use stand list_... for cr_pgarr
> +
> +[2008-Aug-20] v3:
> +* Various fixes and clean-ups
> +* Use standard hlist_... for hash table
> +* Better use of standard kmalloc/kfree
> +
> +[2008-Aug-09] v2:
> +* Added utsname->{release,version,machine} to checkpoint header
> +* Pad header structures to 64 bits to ensure compatibility
> +* Address comments from LKML and linux-containers mailing list
> +
> +[2008-Jul-29] v1:
> +In this incarnation, CR only works on single task. The address space
> +may consist of only private, simple VMAs - anonymous or file-mapped.
> +Both checkpoint and restart will ignore the first argument (pid/crid)
> +and instead act on themselves.
> --
> 1.5.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 4/9] Memory management (dump)
  2008-09-09  7:42 ` [RFC v4][PATCH 4/9] Memory management (dump) Oren Laadan
  2008-09-09  9:22   ` Vegard Nossum
@ 2008-09-10  7:51   ` MinChan Kim
  2008-09-10 23:49     ` MinChan Kim
  2008-09-10 16:55   ` Dave Hansen
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 43+ messages in thread
From: MinChan Kim @ 2008-09-10  7:51 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

On Tue, Sep 9, 2008 at 4:42 PM, Oren Laadan <orenl@cs.columbia.edu> wrote:
> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
> it will be followed by the file name.  The cr_vma->npages will tell
> how many pages were dumped for this VMA.  Then it will be followed
> by the actual data: first a dump of the addresses of all dumped
> pages (npages entries) followed by a dump of the contents of all
> dumped pages (npages pages). Then will come the next VMA and so on.
>
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> ---
>  arch/x86/mm/checkpoint.c   |   30 +++
>  arch/x86/mm/restart.c      |    1 +
>  checkpoint/Makefile        |    3 +-
>  checkpoint/checkpoint.c    |   53 ++++++
>  checkpoint/ckpt_arch.h     |    1 +
>  checkpoint/ckpt_mem.c      |  448 ++++++++++++++++++++++++++++++++++++++++++++
>  checkpoint/ckpt_mem.h      |   35 ++++
>  checkpoint/sys.c           |   23 ++-
>  include/asm-x86/ckpt_hdr.h |    5 +
>  include/linux/ckpt.h       |   12 ++
>  include/linux/ckpt_hdr.h   |   30 +++
>  11 files changed, 635 insertions(+), 6 deletions(-)
>  create mode 100644 checkpoint/ckpt_mem.c
>  create mode 100644 checkpoint/ckpt_mem.h
>
> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> index 71d21e6..50cfd29 100644
> --- a/arch/x86/mm/checkpoint.c
> +++ b/arch/x86/mm/checkpoint.c
> @@ -192,3 +192,33 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
>        cr_hbuf_put(ctx, sizeof(*hh));
>        return ret;
>  }
> +
> +/* dump the mm->context state */
> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       int ret;
> +
> +       h.type = CR_HDR_MM_CONTEXT;
> +       h.len = sizeof(*hh);
> +       h.parent = parent;
> +
> +       mutex_lock(&mm->context.lock);
> +
> +       hh->ldt_entry_size = LDT_ENTRY_SIZE;
> +       hh->nldt = mm->context.size;
> +
> +       cr_debug("nldt %d\n", hh->nldt);
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               return ret;
> +
> +       ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
> +
> +       mutex_unlock(&mm->context.lock);
> +
> +       return ret;
> +}
> diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
> index 883a163..d7fb89a 100644
> --- a/arch/x86/mm/restart.c
> +++ b/arch/x86/mm/restart.c
> @@ -8,6 +8,7 @@
>  *  distribution for more details.
>  */
>
> +#include <linux/unistd.h>
>  #include <asm/desc.h>
>  #include <asm/i387.h>
>
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> index d2df68c..3a0df6d 100644
> --- a/checkpoint/Makefile
> +++ b/checkpoint/Makefile
> @@ -2,4 +2,5 @@
>  # Makefile for linux checkpoint/restart.
>  #
>
> -obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
> +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
> +               ckpt_mem.o
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index d34a691..4dae775 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -55,6 +55,55 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
>        return cr_write_obj(ctx, &h, str);
>  }
>
> +/**
> + * cr_fill_fname - return pathname of a given file
> + * @path: path name
> + * @root: relative root
> + * @buf: buffer for pathname
> + * @n: buffer length (in) and pathname length (out)
> + */
> +static char *
> +cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
> +{
> +       char *fname;
> +
> +       BUG_ON(!buf);
> +       fname = __d_path(path, root, buf, *n);
> +       if (!IS_ERR(fname))
> +               *n = (buf + (*n) - fname);
> +       return fname;
> +}
> +
> +/**
> + * cr_write_fname - write a file name
> + * @ctx: checkpoint context
> + * @path: path name
> + * @root: relative root
> + */
> +int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
> +{
> +       struct cr_hdr h;
> +       char *buf, *fname;
> +       int ret, flen;
> +
> +       flen = PATH_MAX;
> +       buf = kmalloc(flen, GFP_KERNEL);
> +       if (!buf)
> +               return -ENOMEM;
> +
> +       fname = cr_fill_fname(path, root, buf, &flen);
> +       if (!IS_ERR(fname)) {
> +               h.type = CR_HDR_FNAME;
> +               h.len = flen;
> +               h.parent = 0;
> +               ret = cr_write_obj(ctx, &h, fname);
> +       } else
> +               ret = PTR_ERR(fname);
> +
> +       kfree(buf);
> +       return ret;
> +}
> +
>  /* write the checkpoint header */
>  static int cr_write_head(struct cr_ctx *ctx)
>  {
> @@ -164,6 +213,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>        cr_debug("task_struct: ret %d\n", ret);
>        if (ret < 0)
>                goto out;
> +       ret = cr_write_mm(ctx, t);
> +       cr_debug("memory: ret %d\n", ret);
> +       if (ret < 0)
> +               goto out;
>        ret = cr_write_thread(ctx, t);
>        cr_debug("thread: ret %d\n", ret);
>        if (ret < 0)
> diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
> index 5bd4703..9bd0ba4 100644
> --- a/checkpoint/ckpt_arch.h
> +++ b/checkpoint/ckpt_arch.h
> @@ -2,6 +2,7 @@
>
>  int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
>  int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);
>
>  int cr_read_thread(struct cr_ctx *ctx);
>  int cr_read_cpu(struct cr_ctx *ctx);
> diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
> new file mode 100644
> index 0000000..2c93447
> --- /dev/null
> +++ b/checkpoint/ckpt_mem.c
> @@ -0,0 +1,448 @@
> +/*
> + *  Checkpoint memory contents
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/pagemap.h>
> +#include <linux/mm_types.h>
> +#include <linux/ckpt.h>
> +#include <linux/ckpt_hdr.h>
> +
> +#include "ckpt_arch.h"
> +#include "ckpt_mem.h"
> +
> +/*
> + * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
> + * (common to ckpt_mem.c and rstr_mem.c).
> + *
> + * The checkpoint context structure has two members for page-arrays:
> + *   ctx->pgarr: list head of the page-array chain
> + *   ctx->pgcur: tracks the "current" position in the chain
> + *
> + * During checkpoint (and restart) the chain tracks the dirty pages (page
> + * pointer and virtual address) of each MM. For a particular MM, these are
> + * always added to the "current" page-array (ctx->pgcur). The "current"
> + * page-array advances as necessary, and new page-array descriptors are
> + * allocated on-demand. Before the next MM, the chain is reset but not
> + * freed (that is, dereference page pointers and reset ctx->pgcur).
> + */
> +
> +#define CR_PGARR_ORDER  0
> +#define CR_PGARR_TOTAL  ((PAGE_SIZE << CR_PGARR_ORDER) / sizeof(void *))
> +
> +/* release pages referenced by a page-array */
> +void cr_pgarr_unref_pages(struct cr_pgarr *pgarr)
> +{
> +       int n;
> +
> +       /* only checkpoint keeps references to pages */
> +       if (pgarr->pages) {
> +               cr_debug("nr_used %d\n", pgarr->nr_used);
> +               for (n = pgarr->nr_used; n--; )
> +                       page_cache_release(pgarr->pages[n]);
> +       }
> +}
> +
> +/* free a single page-array object */
> +static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
> +{
> +       cr_pgarr_unref_pages(pgarr);
> +       if (pgarr->pages)
> +               free_pages((unsigned long) pgarr->pages, CR_PGARR_ORDER);
> +       if (pgarr->vaddrs)
> +               free_pages((unsigned long) pgarr->vaddrs, CR_PGARR_ORDER);
> +       kfree(pgarr);
> +}
> +
> +/* free a chain of page-arrays */
> +void cr_pgarr_free(struct cr_ctx *ctx)
> +{
> +       struct cr_pgarr *pgarr, *tmp;
> +
> +       list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr, list) {
> +               list_del(&pgarr->list);
> +               cr_pgarr_free_one(pgarr);
> +       }
> +       ctx->pgcur = NULL;
> +}
> +
> +/* allocate a single page-array object */
> +static struct cr_pgarr *cr_pgarr_alloc_one(void)
> +{
> +       struct cr_pgarr *pgarr;
> +
> +       pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
> +       if (!pgarr)
> +               return NULL;
> +
> +       pgarr->nr_free = CR_PGARR_TOTAL;
> +       pgarr->nr_used = 0;
> +
> +       pgarr->pages = (struct page **)
> +               __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
> +       pgarr->vaddrs = (unsigned long *)
> +               __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
> +       if (!pgarr->pages || !pgarr->vaddrs) {
> +               cr_pgarr_free_one(pgarr);
> +               return NULL;
> +       }
> +
> +       return pgarr;
> +}
> +
> +/* cr_pgarr_alloc - return the next available pgarr in the page-array chain
> + * @ctx: checkpoint context
> + *
> + * Return the page-array following ctx->pgcur, extending the chain if needed
> + */
> +struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx)
> +{
> +       struct cr_pgarr *pgarr;
> +
> +       /* can reuse next element after ctx->pgcur ? */
> +       pgarr = ctx->pgcur;
> +       if (pgarr && !list_is_last(&pgarr->list, &ctx->pgarr)) {
> +               pgarr = list_entry(pgarr->list.next, struct cr_pgarr, list);
> +               goto out;
> +       }
> +
> +       /* nope, need to extend the page-array chain */
> +       pgarr = cr_pgarr_alloc_one();
> +       if (!pgarr)
> +               return NULL;
> +
> +       list_add_tail(&pgarr->list, &ctx->pgarr);
> + out:
> +       ctx->pgcur = pgarr;
> +       return pgarr;
> +
> +}
> +
> +/* reset the page-array chain (dropping page references if necessary) */
> +void cr_pgarr_reset(struct cr_ctx *ctx)
> +{
> +       struct cr_pgarr *pgarr;
> +
> +       list_for_each_entry(pgarr, &ctx->pgarr, list) {
> +               cr_pgarr_unref_pages(pgarr);
> +               pgarr->nr_free = CR_PGARR_TOTAL;
> +               pgarr->nr_used = 0;
> +       }
> +       ctx->pgcur = NULL;
> +}
> +
> +
> +/* return current page-array (and allocate if needed) */
> +struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx
> +)

Brace shoudl be located in above line. :)
> +{
> +       struct cr_pgarr *pgarr = ctx->pgcur;
> +
> +       if (!pgarr->nr_free)
> +               pgarr = cr_pgarr_alloc(ctx);
> +       return pgarr;
> +}
> +
> +/*
> + * Checkpoint is outside the context of the checkpointee, so one cannot
> + * simply read pages from user-space. Instead, we scan the address space
> + * of the target to cherry-pick pages of interest. Selected pages are
> + * enlisted in a page-array chain (attached to the checkpoint context).
> + * To save their contents, each page is mapped to kernel memory and then
> + * dumped to the file descriptor.
> + */
> +
> +/**
> + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
> + * @ctx - checkpoint context
> + * @pgarr - page-array to fill
> + * @vma - vma to scan
> + * @start - start address (updated)
> + */
> +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
> +                            struct vm_area_struct *vma, unsigned long *start)
> +{
> +       unsigned long end = vma->vm_end;
> +       unsigned long addr = *start;
> +       struct page **pagep;
> +       unsigned long *addrp;
> +       int cow, nr, ret = 0;
> +
> +       nr = pgarr->nr_free;
> +       pagep = &pgarr->pages[pgarr->nr_used];
> +       addrp = &pgarr->vaddrs[pgarr->nr_used];
> +       cow = !!vma->vm_file;
> +
> +       while (addr < end) {
> +               struct page *page;
> +
> +               /*
> +                * simplified version of get_user_pages(): already have vma,
> +                * only need FOLL_TOUCH, and (for now) ignore fault stats.
> +                *
> +                * FIXME: consolidate with get_user_pages()
> +                */
> +
> +               cond_resched();
> +               while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
> +                       ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> +                       if (ret & VM_FAULT_ERROR) {
> +                               if (ret & VM_FAULT_OOM)
> +                                       ret = -ENOMEM;
> +                               else if (ret & VM_FAULT_SIGBUS)
> +                                       ret = -EFAULT;
> +                               else
> +                                       BUG();
> +                               break;
> +                       }
> +                       cond_resched();
> +                       ret = 0;
> +               }
> +
> +               if (IS_ERR(page))
> +                       ret = PTR_ERR(page);
> +
> +               if (ret < 0)
> +                       break;
> +
> +               if (page == ZERO_PAGE(0)) {
> +                       page = NULL;    /* zero page: ignore */
> +               } else if (cow && page_mapping(page) != NULL) {
> +                       page = NULL;    /* clean cow: ignore */
> +               } else {
> +                       get_page(page);
> +                       *(addrp++) = addr;
> +                       *(pagep++) = page;
> +                       if (--nr == 0) {
> +                               addr += PAGE_SIZE;
> +                               break;
> +                       }
> +               }
> +
> +               addr += PAGE_SIZE;
> +       }
> +
> +       if (unlikely(ret < 0)) {
> +               nr = pgarr->nr_free - nr;
> +               while (nr--)
> +                       page_cache_release(*(--pagep));
> +               return ret;
> +       }
> +
> +       *start = addr;
> +       return pgarr->nr_free - nr;
> +}
> +
> +/**
> + * cr_vma_scan_pages - scan vma for pages that will need to be dumped
> + * @ctx - checkpoint context
> + * @vma - vma to scan
> + *
> + * lists of page pointes and corresponding virtual addresses are tracked
> + * inside ctx->pgarr page-array chain
> + */
> +static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> +       unsigned long addr = vma->vm_start;
> +       unsigned long end = vma->vm_end;
> +       struct cr_pgarr *pgarr;
> +       int nr, total = 0;
> +
> +       while (addr < end) {
> +               pgarr = cr_pgarr_prep(ctx);
> +               if (!pgarr)
> +                       return -ENOMEM;
> +               nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr);
> +               if (nr < 0)
> +                       return nr;
> +               pgarr->nr_free -= nr;
> +               pgarr->nr_used += nr;
> +               total += nr;
> +       }
> +
> +       cr_debug("total %d\n", total);
> +       return total;
> +}
> +
> +static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
> +{
> +       void *ptr;
> +
> +       ptr = kmap_atomic(page, KM_USER1);
> +       memcpy(buf, ptr, PAGE_SIZE);
> +       kunmap_atomic(page, KM_USER1);
> +
> +       return cr_kwrite(ctx, buf, PAGE_SIZE);
> +}
> +
> +/**
> + * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
> + * @ctx - checkpoint context
> + * @total - total number of pages
> + *
> + * First dump all virtual addresses, followed by the contents of all pages
> + */
> +static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
> +{
> +       struct cr_pgarr *pgarr;
> +       char *buf;
> +       int i, ret = 0;
> +
> +       if (!total)
> +               return 0;
> +
> +       list_for_each_entry(pgarr, &ctx->pgarr, list) {
> +               ret = cr_kwrite(ctx, pgarr->vaddrs,
> +                               pgarr->nr_used * sizeof(*pgarr->vaddrs));
> +               if (ret < 0)
> +                       return ret;
> +       }
> +
> +       buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> +       if (!buf)
> +               return -ENOMEM;
> +
> +       list_for_each_entry(pgarr, &ctx->pgarr, list) {
> +               for (i = 0; i < pgarr->nr_used; i++) {
> +                       ret = cr_page_write(ctx, pgarr->pages[i], buf);
> +                       if (ret < 0)
> +                               goto out;
> +               }
> +       }
> +
> + out:
> +       kfree(buf);
> +       return ret;
> +}
> +
> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       int vma_type, nr, ret;
> +
> +       h.type = CR_HDR_VMA;
> +       h.len = sizeof(*hh);
> +       h.parent = 0;
> +
> +       hh->vm_start = vma->vm_start;
> +       hh->vm_end = vma->vm_end;
> +       hh->vm_page_prot = vma->vm_page_prot.pgprot;
> +       hh->vm_flags = vma->vm_flags;
> +       hh->vm_pgoff = vma->vm_pgoff;
> +
> +       if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
> +               pr_warning("CR: unsupported VMA %#lx\n", vma->vm_flags);
> +               return -ETXTBSY;
> +       }
> +
> +       /* by default assume anon memory */
> +       vma_type = CR_VMA_ANON;
> +
> +       /* if there is a backing file, assume private-mapped */
> +       /* (FIX: check if the file is unlinked) */
> +       if (vma->vm_file)
> +               vma_type = CR_VMA_FILE;
> +
> +       hh->vma_type = vma_type;
> +
> +       /*
> +        * it seems redundant now, but we do it in 3 steps for because:
> +        * first, the logic is simpler when we how many pages before
> +        * dumping them; second, a future optimization will defer the
> +        * writeout (dump, and free) to a later step; in which case all
> +        * the pages to be dumped will be aggregated on the checkpoint ctx
> +        */
> +
> +       /* (1) scan: scan through the PTEs of the vma to count the pages
> +        * to dump (and later make those pages COW), and keep the list of
> +        * pages (and a reference to each page) on the checkpoint ctx */
> +       nr = cr_vma_scan_pages(ctx, vma);
> +       if (nr < 0)
> +               return nr;
> +
> +       hh->nr_pages = nr;
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               return ret;
> +       /* save the file name, if relevant */
> +       if (vma->vm_file)
> +               ret = cr_write_fname(ctx, &vma->vm_file->f_path, ctx->vfsroot);
> +
> +       if (ret < 0)
> +               return ret;
> +
> +       /* (2) dump: write out the addresses of all pages in the list (on
> +        * the checkpoint ctx) followed by the contents of all pages */
> +       ret = cr_vma_dump_pages(ctx, nr);
> +
> +       /* (3) free: release the extra references to the pages in the list */
> +       cr_pgarr_reset(ctx);
> +
> +       return ret;
> +}
> +
> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       struct mm_struct *mm;
> +       struct vm_area_struct *vma;
> +       int objref, ret;
> +
> +       h.type = CR_HDR_MM;
> +       h.len = sizeof(*hh);
> +       h.parent = task_pid_vnr(t);
> +
> +       mm = get_task_mm(t);
> +
> +       objref = 0;     /* will be meaningful with multiple processes */
> +       hh->objref = objref;
> +
> +       down_read(&mm->mmap_sem);
> +
> +       hh->start_code = mm->start_code;
> +       hh->end_code = mm->end_code;
> +       hh->start_data = mm->start_data;
> +       hh->end_data = mm->end_data;
> +       hh->start_brk = mm->start_brk;
> +       hh->brk = mm->brk;
> +       hh->start_stack = mm->start_stack;
> +       hh->arg_start = mm->arg_start;
> +       hh->arg_end = mm->arg_end;
> +       hh->env_start = mm->env_start;
> +       hh->env_end = mm->env_end;
> +
> +       hh->map_count = mm->map_count;
> +
> +       /* FIX: need also mm->flags */
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               goto out;
> +
> +       /* write the vma's */
> +       for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +               ret = cr_write_vma(ctx, vma);
> +               if (ret < 0)
> +                       goto out;
> +       }
> +
> +       ret = cr_write_mm_context(ctx, mm, objref);
> +
> + out:
> +       up_read(&mm->mmap_sem);
> +       mmput(mm);
> +       return ret;
> +}
> diff --git a/checkpoint/ckpt_mem.h b/checkpoint/ckpt_mem.h
> new file mode 100644
> index 0000000..8ee211d
> --- /dev/null
> +++ b/checkpoint/ckpt_mem.h
> @@ -0,0 +1,35 @@
> +#ifndef _CHECKPOINT_CKPT_MEM_H_
> +#define _CHECKPOINT_CKPT_MEM_H_
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/mm_types.h>
> +
> +/*
> + * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>
> + * tuples (where vaddr is the virtual address of a page in a particular mm).
> + * Specifically, we use separate arrays so that all vaddrs can be written
> + * and read at once.
> + */
> +
> +struct cr_pgarr {
> +       unsigned long *vaddrs;
> +       struct page **pages;
> +       unsigned int nr_used;   /* how many entries already used */
> +       unsigned int nr_free;   /* how many entries still free */
> +       struct list_head list;
> +};
> +
> +void cr_pgarr_reset(struct cr_ctx *ctx);
> +void cr_pgarr_free(struct cr_ctx *ctx);
> +struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx);
> +struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx);
> +
> +#endif /* _CHECKPOINT_CKPT_MEM_H_ */
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index 113e0df..8141161 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c
> @@ -16,6 +16,8 @@
>  #include <linux/capability.h>
>  #include <linux/ckpt.h>
>
> +#include "ckpt_mem.h"
> +
>  /*
>  * helpers to write/read to/from the image file descriptor
>  *
> @@ -110,7 +112,6 @@ int cr_kread(struct cr_ctx *ctx, void *buf, int count)
>        return ret;
>  }
>
> -
>  /*
>  * helpers to manage CR contexts: allocated for each checkpoint and/or
>  * restart operation, and persists until the operation is completed.
> @@ -126,6 +127,11 @@ void cr_ctx_free(struct cr_ctx *ctx)
>
>        free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
>
> +       if (ctx->vfsroot)
> +               path_put(ctx->vfsroot);
> +
> +       cr_pgarr_free(ctx);
> +
>        kfree(ctx);
>  }
>
> @@ -145,10 +151,13 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
>        get_file(ctx->file);
>
>        ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
> -       if (!ctx->hbuf) {
> -               cr_ctx_free(ctx);
> -               return ERR_PTR(-ENOMEM);
> -       }
> +       if (!ctx->hbuf)
> +               goto nomem;
> +
> +       /* assume checkpointer is in container's root vfs */
> +       /* FIXME: this works for now, but will change with real containers */
> +       ctx->vfsroot = &current->fs->root;
> +       path_get(ctx->vfsroot);
>
>        ctx->pid = pid;
>        ctx->flags = flags;
> @@ -156,6 +165,10 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
>        ctx->crid = atomic_inc_return(&cr_ctx_count);
>
>        return ctx;
> +
> + nomem:
> +       cr_ctx_free(ctx);
> +       return ERR_PTR(-ENOMEM);
>  }
>
>  /*
> diff --git a/include/asm-x86/ckpt_hdr.h b/include/asm-x86/ckpt_hdr.h
> index 44a903c..6bc61ac 100644
> --- a/include/asm-x86/ckpt_hdr.h
> +++ b/include/asm-x86/ckpt_hdr.h
> @@ -69,4 +69,9 @@ struct cr_hdr_cpu {
>
>  } __attribute__((aligned(8)));
>
> +struct cr_hdr_mm_context {
> +       __s16 ldt_entry_size;
> +       __s16 nldt;
> +} __attribute__((aligned(8)));
> +
>  #endif /* __ASM_X86_CKPT_HDR__H */
> diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
> index 91f4998..5c62a90 100644
> --- a/include/linux/ckpt.h
> +++ b/include/linux/ckpt.h
> @@ -10,6 +10,9 @@
>  *  distribution for more details.
>  */
>
> +#include <linux/path.h>
> +#include <linux/fs.h>
> +
>  #define CR_VERSION  1
>
>  struct cr_ctx {
> @@ -24,6 +27,11 @@ struct cr_ctx {
>
>        void *hbuf;             /* temporary buffer for headers */
>        int hpos;               /* position in headers buffer */
> +
> +       struct list_head pgarr; /* page array for dumping VMA contents */
> +       struct cr_pgarr *pgcur; /* current position in page array */
> +
> +       struct path *vfsroot;   /* container root (FIXME) */
>  };
>
>  /* cr_ctx: flags */
> @@ -46,11 +54,15 @@ struct cr_hdr;
>
>  int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
>  int cr_write_string(struct cr_ctx *ctx, char *str, int len);
> +int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root);
>
>  int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
>  int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
>  int cr_read_string(struct cr_ctx *ctx, void *str, int len);
>
> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
> +int cr_read_mm(struct cr_ctx *ctx);
> +
>  int do_checkpoint(struct cr_ctx *ctx);
>  int do_restart(struct cr_ctx *ctx);
>
> diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
> index e66f322..ac77d7d 100644
> --- a/include/linux/ckpt_hdr.h
> +++ b/include/linux/ckpt_hdr.h
> @@ -32,6 +32,7 @@ struct cr_hdr {
>  enum {
>        CR_HDR_HEAD = 1,
>        CR_HDR_STRING,
> +       CR_HDR_FNAME,
>
>        CR_HDR_TASK = 101,
>        CR_HDR_THREAD,
> @@ -82,4 +83,33 @@ struct cr_hdr_task {
>        __s32 task_comm_len;
>  } __attribute__((aligned(8)));
>
> +struct cr_hdr_mm {
> +       __u32 objref;           /* identifier for shared objects */
> +       __u32 map_count;
> +
> +       __u64 start_code, end_code, start_data, end_data;
> +       __u64 start_brk, brk, start_stack;
> +       __u64 arg_start, arg_end, env_start, env_end;
> +
> +} __attribute__((aligned(8)));
> +
> +/* vma subtypes */
> +enum vm_type {
> +       CR_VMA_ANON = 1,
> +       CR_VMA_FILE
> +};
> +
> +struct cr_hdr_vma {
> +       __u32 vma_type;
> +       __u32 _padding;
> +       __s64 nr_pages;
> +
> +       __u64 vm_start;
> +       __u64 vm_end;
> +       __u64 vm_page_prot;
> +       __u64 vm_flags;
> +       __u64 vm_pgoff;
> +
> +} __attribute__((aligned(8)));
> +
>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */
> --
> 1.5.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 5/9] Memory managemnet (restore)
  2008-09-09 23:35     ` Oren Laadan
@ 2008-09-10 15:00       ` Serge E. Hallyn
  0 siblings, 0 replies; 43+ messages in thread
From: Serge E. Hallyn @ 2008-09-10 15:00 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, containers, jeremy, linux-kernel, arnd

Quoting Oren Laadan (orenl@cs.columbia.edu):
> 
> 
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan (orenl@cs.columbia.edu):
> 
> [...]
> 
> >> +/* change the protection of an address range to be writable/non-writable.
> >> + * this is useful when restoring the memory of a read-only vma */
> >> +static int cr_vma_set_writable(struct mm_struct *mm, unsigned long start,
> >> +			       unsigned long end, int writable)
> >> +{
> >> +	struct vm_area_struct *vma, *prev;
> >> +	unsigned long flags = 0;
> >> +	int ret = -EINVAL;
> >> +
> >> +	cr_debug("vma %#lx-%#lx writable %d\n", start, end, writable);
> >> +
> >> +	down_write(&mm->mmap_sem);
> >> +	vma = find_vma_prev(mm, start, &prev);
> >> +	if (!vma || vma->vm_start > end || vma->vm_end < start)
> >> +		goto out;
> >> +	if (writable && !(vma->vm_flags & VM_WRITE))
> >> +		flags = vma->vm_flags | VM_WRITE;
> >> +	else if (!writable && (vma->vm_flags & VM_WRITE))
> >> +		flags = vma->vm_flags & ~VM_WRITE;
> >> +	cr_debug("flags %#lx\n", flags);
> >> +	if (flags)
> >> +		ret = mprotect_fixup(vma, &prev, vma->vm_start,
> >> +				     vma->vm_end, flags);
> > 
> > As Dave has pointed out, this appears to be a security problem.  I think
> 
> As I replied to Dave, I don't see why this would be a security problem.
> 
> This handles private memory only. In particular, the uncommon case of a
> read-only VMA tha has modified contents. This _cannot_ affect the file
> from which this VMA may have been mapped.
> 
> Shared memory (not file-mapped) will be handled differently: since it is
> always backed up by an inode in shmfs, the restart will populate the
> relevant pages directly. Besides, non-file-mapped shared memory is again
> not a security concern.
> 
> Finally, shared memory that maps to a file is simply _not saved_ at all;
> it is part of the file system, and belongs to the (future) file system
> snapshot capability. Since the contents are always available in the file
> system, we don't need to save it (like we don't save shared libraries).
> 
> That said, it is necessary that the code ensures that the vm_flags that
> belong to a VMA of a private type, e.g. CR_VMA_ANON/CR_VMA_FILE, indeed
> match it (ie, don't have VM_MAY_SHARE/VM_SHARED). I'll add that.

Cool.  That sounds good and I'll look for that in the next version.

There still may be objections about bypassing selinux execmem/execheap
permission checks, but I think that's ok for now.  Long-term I expect
we'll want the security_file_mprotect checks there, and selinux users
will have to use a policy where restart is started in a privileged
restart_t domain or somesuch (and eventually transitions back to the
checkpointed selinux type if possible).

thanks,
-serge

> > what you need to do is create a new helper mprotect_fixup_withchecks(),
> > which does all the DAC+MAC checks which are done in the sys_mprotect()
> > loop starting with "for (nstart = start ; ; ) {...  Otherwise an
> > unprivileged user can create a checkpoint image of a program which has
> > done a ro shared file mmap, edit the checkpoint, then restart it and (i
> > assume) cause the modified contents to be written to the file.  This
> > could violate both DAC checks and selinux checks.
> > 
> > So create that helper which does the security checks, and use it
> > both here and in the sys_mprotect() loop, please.
> > 
> 
> [...]
> 
> Oren.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 9/9] File descriprtors (restore)
  2008-09-10  1:49     ` Oren Laadan
@ 2008-09-10 16:09       ` Dave Hansen
  2008-09-10 18:55         ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2008-09-10 16:09 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, arnd, linux-kernel

On Tue, 2008-09-09 at 21:49 -0400, Oren Laadan wrote:
> 
> Dave Hansen wrote:
> > On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
> >> +static int cr_close_all_fds(struct files_struct *files)
> >> +{
> >> +       int *fdtable;
> >> +       int n;
> >> +
> >> +       do {
> >> +               n = cr_scan_fds(files, &fdtable);
> >> +               if (n < 0)
> >> +                       return n;
> >> +               while (n--)
> >> +                       sys_close(fdtable[n]);
> >> +               kfree(fdtable);
> >> +       } while (n != -1);
> >> +
> >> +       return 0;
> >> +}
> > 
> > This needs to use an ERR_PTR().  It will save using the double-pointer.
> 
> I suppose you refer to the call to cr_scan_fds(): either 'fdtable'
> or 'n' will have to pass-by-reference. Is it that you prefer it to be
> 	fdtable = cr_scan_fds(files, &n);
> ?

I was misreading the use of 'n'.  Can you really not use close_files()
for this operation?  You'd need to add some locking around it, but I
think it does what you need here.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 4/9] Memory management (dump)
  2008-09-09  7:42 ` [RFC v4][PATCH 4/9] Memory management (dump) Oren Laadan
  2008-09-09  9:22   ` Vegard Nossum
  2008-09-10  7:51   ` MinChan Kim
@ 2008-09-10 16:55   ` Dave Hansen
  2008-09-10 17:45     ` Dave Hansen
  2008-09-10 18:28     ` Oren Laadan
  2008-09-10 21:38   ` [RFC v4][PATCH " Dave Hansen
  2008-09-12 16:57   ` Dave Hansen
  4 siblings, 2 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-10 16:55 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
> +       while (addr < end) {
> +               struct page *page;
> +
> +               /*
> +                * simplified version of get_user_pages(): already have vma,
> +                * only need FOLL_TOUCH, and (for now) ignore fault stats.
> +                *
> +                * FIXME: consolidate with get_user_pages()
> +                */
> +
> +               cond_resched();
> +               while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
> +                       ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> +                       if (ret & VM_FAULT_ERROR) {
> +                               if (ret & VM_FAULT_OOM)
> +                                       ret = -ENOMEM;
> +                               else if (ret & VM_FAULT_SIGBUS)
> +                                       ret = -EFAULT;
> +                               else
> +                                       BUG();
> +                               break;
> +                       }
> +                       cond_resched();
> +                       ret = 0;
> +               }

get_user_pages() is really the wrong thing to use here.  It makes pages
*present* so that we can do things like hand them off to a driver.  For
checkpointing, we really don't care about that.  It's a waste of time,
for instance to perform faults to fill the mappings up with zero pages
and page tables.  Just think of what will happen the first time we touch
a very large, very sparse anonymous area.  We'll probably kill the
system just allocating page tables.  Take a look at the comment in
follow_page().  This is a similar operation to core dumping, and we need
to be careful.

This might be fine for a proof of concept, but it needs to be thought
out much more thoroughly before getting merged.  I guess I'm
volunteering to go do that.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 4/9] Memory management (dump)
  2008-09-10 16:55   ` Dave Hansen
@ 2008-09-10 17:45     ` Dave Hansen
  2008-09-10 18:28     ` Oren Laadan
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-10 17:45 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, arnd, linux-kernel

On Wed, 2008-09-10 at 09:55 -0700, Dave Hansen wrote:
> 
> > +               cond_resched();
> > +               while (!(page = follow_page(vma, addr, FOLL_TOUCH))) 

Why is the FOLL_TOUCH required here?

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 4/9] Memory management (dump)
  2008-09-10 16:55   ` Dave Hansen
  2008-09-10 17:45     ` Dave Hansen
@ 2008-09-10 18:28     ` Oren Laadan
  2008-09-10 21:03       ` Cleanups for [PATCH " Dave Hansen
  1 sibling, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-10 18:28 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, jeremy, linux-kernel, arnd



Dave Hansen wrote:
> On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
>> +       while (addr < end) {
>> +               struct page *page;
>> +
>> +               /*
>> +                * simplified version of get_user_pages(): already have vma,
>> +                * only need FOLL_TOUCH, and (for now) ignore fault stats.
>> +                *
>> +                * FIXME: consolidate with get_user_pages()
>> +                */
>> +
>> +               cond_resched();
>> +               while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
>> +                       ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
>> +                       if (ret & VM_FAULT_ERROR) {
>> +                               if (ret & VM_FAULT_OOM)
>> +                                       ret = -ENOMEM;
>> +                               else if (ret & VM_FAULT_SIGBUS)
>> +                                       ret = -EFAULT;
>> +                               else
>> +                                       BUG();
>> +                               break;
>> +                       }
>> +                       cond_resched();
>> +                       ret = 0;
>> +               }
> 
> get_user_pages() is really the wrong thing to use here.  It makes pages
> *present* so that we can do things like hand them off to a driver.  For
> checkpointing, we really don't care about that.  It's a waste of time,
> for instance to perform faults to fill the mappings up with zero pages
> and page tables.  Just think of what will happen the first time we touch
> a very large, very sparse anonymous area.  We'll probably kill the
> system just allocating page tables.  Take a look at the comment in
> follow_page().  This is a similar operation to core dumping, and we need
> to be careful.
> 
> This might be fine for a proof of concept, but it needs to be thought
> out much more thoroughly before getting merged.  I guess I'm
> volunteering to go do that.

The intention is not to allocate unallocated pages, but to get the page
pointer and bring in swapped out pages if necessary. (Avoiding swap-in
is possible, but left for future optimization).

Indeed, follow_page() does the work just fine; Of course, it should be
called with FOLL_ANON instead of FOLL_TOUCH. Thanks for pointing out.

Oren.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-10  6:10   ` MinChan Kim
@ 2008-09-10 18:36     ` Oren Laadan
  2008-09-10 22:54       ` MinChan Kim
  0 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-10 18:36 UTC (permalink / raw)
  To: MinChan Kim; +Cc: dave, arnd, jeremy, linux-kernel, containers



MinChan Kim wrote:
> On Tue, Sep 9, 2008 at 4:42 PM, Oren Laadan <orenl@cs.columbia.edu> wrote:

[...]

>> +struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
>> +{
>> +       struct cr_ctx *ctx;
>> +
>> +       ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>> +       if (!ctx)
>> +               return ERR_PTR(-ENOMEM);
>> +
>> +       ctx->file = fget(fd);
>> +       if (!ctx->file) {
>> +               cr_ctx_free(ctx);
>> +               return ERR_PTR(-EBADF);
>> +       }
>> +       get_file(ctx->file);
> 
> Why do you need get_file?
> You already called fget.
> Am I missing something ?

This was meant for when we will restart multiple processes, each would
have access to the checkpoint-context, such that the checkpoint-context
may outlives the task that created it and initiated the restart. Thus
the file-pointer will need to stay around longer than that task.

Of course, restart of multiple processes _can_ be coded such that this
first task will always terminate last - either after restart completes
successfully, or after all the other tasks aborted and won't use the
checkpoint-context anymore.

Because that code is not part of the this patch-set, I considered it
safer to grab a reference of the file pointer, making it less likely
that we forget about it later.

> 
>> +       ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
>> +       if (!ctx->hbuf) {
>> +               cr_ctx_free(ctx);
>> +               return ERR_PTR(-ENOMEM);
>> +       }
>> +
>> +       ctx->pid = pid;
>> +       ctx->flags = flags;
>> +
>> +       ctx->crid = atomic_inc_return(&cr_ctx_count);
>> +
>> +       return ctx;
>> +}

Oren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 9/9] File descriprtors (restore)
  2008-09-10 16:09       ` Dave Hansen
@ 2008-09-10 18:55         ` Oren Laadan
  0 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-10 18:55 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, jeremy, arnd, linux-kernel



Dave Hansen wrote:
> On Tue, 2008-09-09 at 21:49 -0400, Oren Laadan wrote:
>> Dave Hansen wrote:
>>> On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
>>>> +static int cr_close_all_fds(struct files_struct *files)
>>>> +{
>>>> +       int *fdtable;
>>>> +       int n;
>>>> +
>>>> +       do {
>>>> +               n = cr_scan_fds(files, &fdtable);
>>>> +               if (n < 0)
>>>> +                       return n;
>>>> +               while (n--)
>>>> +                       sys_close(fdtable[n]);
>>>> +               kfree(fdtable);
>>>> +       } while (n != -1);
>>>> +
>>>> +       return 0;
>>>> +}
>>> This needs to use an ERR_PTR().  It will save using the double-pointer.
>> I suppose you refer to the call to cr_scan_fds(): either 'fdtable'
>> or 'n' will have to pass-by-reference. Is it that you prefer it to be
>> 	fdtable = cr_scan_fds(files, &n);
>> ?
> 
> I was misreading the use of 'n'.  Can you really not use close_files()
> for this operation?  You'd need to add some locking around it, but I
> think it does what you need here.

I thought about that. However, close_files() assumes that the files_struct
will be discarded thereafter, so it does not reset ->fd_open->fd_bits[] bits,
does not adjust ->next_fd field, and does not use rcu_assign_pointer(). And
then, even if we adjust, we'll have to watch future differences between
sys_close() and close_files(), so using sys_close() is more future-proof.

Besides, cr_scan_fds() is used by the checkpoint logic already, so it's easy
to reuse for restart as well.

Oren.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 5/9] Memory managemnet (restore)
  2008-09-09  7:42 ` [RFC v4][PATCH 5/9] Memory managemnet (restore) Oren Laadan
  2008-09-09 16:07   ` Serge E. Hallyn
@ 2008-09-10 19:31   ` Dave Hansen
  2008-09-10 19:48     ` Oren Laadan
  1 sibling, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2008-09-10 19:31 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
> +/**
> + * cr_vma_read_pages_vaddrs - read addresses of pages to page-array chain
> + * @ctx - restart context
> + * @npages - number of pages
> + */
> +static int cr_vma_read_pages_vaddrs(struct cr_ctx *ctx, int npages)
> +{
> +	struct cr_pgarr *pgarr;
> +	int nr, ret;
> +
> +	while (npages) {
> +		pgarr = cr_pgarr_prep(ctx);
> +		if (!pgarr)
> +			return -ENOMEM;
> +		nr = min(npages, (int) pgarr->nr_free);
> +		ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
> +		if (ret < 0)
> +			return ret;
> +		pgarr->nr_free -= nr;
> +		pgarr->nr_used += nr;
> +		npages -= nr;
> +	}
> +	return 0;
> +}

cr_pgarr_prep() can return a partially full pgarr, right?  Won't the
cr_kread() always start at the beginning of the pgarr->vaddrs[] array?
Seems to me like it will clobber things from the last call.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 5/9] Memory managemnet (restore)
  2008-09-10 19:31   ` Dave Hansen
@ 2008-09-10 19:48     ` Oren Laadan
  2008-09-10 20:49       ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-10 19:48 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, jeremy, linux-kernel, arnd



Dave Hansen wrote:
> On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
>> +/**
>> + * cr_vma_read_pages_vaddrs - read addresses of pages to page-array chain
>> + * @ctx - restart context
>> + * @npages - number of pages
>> + */
>> +static int cr_vma_read_pages_vaddrs(struct cr_ctx *ctx, int npages)
>> +{
>> +	struct cr_pgarr *pgarr;
>> +	int nr, ret;
>> +
>> +	while (npages) {
>> +		pgarr = cr_pgarr_prep(ctx);
>> +		if (!pgarr)
>> +			return -ENOMEM;
>> +		nr = min(npages, (int) pgarr->nr_free);
>> +		ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
>> +		if (ret < 0)
>> +			return ret;
>> +		pgarr->nr_free -= nr;
>> +		pgarr->nr_used += nr;
>> +		npages -= nr;
>> +	}
>> +	return 0;
>> +}
> 
> cr_pgarr_prep() can return a partially full pgarr, right?  Won't the
> cr_kread() always start at the beginning of the pgarr->vaddrs[] array?
> Seems to me like it will clobber things from the last call.

Note that 'nr' is either equal to ->nr_free - in which case we consume
the entire 'pgarr' vaddr array such that the next call to cr_pgarr_prep()
will get a fresh one, or is smaller than ->nr_free - in which case that
is the last iteration of the loop anyhow, so it won't be clobbered.

Also, after we return - our caller, cr_vma_read_pages(), resets the state
of the page-array chain by calling cr_pgarr_reset().

Oren.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 5/9] Memory managemnet (restore)
  2008-09-10 19:48     ` Oren Laadan
@ 2008-09-10 20:49       ` Dave Hansen
  2008-09-11  6:59         ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2008-09-10 20:49 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Wed, 2008-09-10 at 15:48 -0400, Oren Laadan wrote:
> Dave Hansen wrote:
> > On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
> >> +/**
> >> + * cr_vma_read_pages_vaddrs - read addresses of pages to page-array chain
> >> + * @ctx - restart context
> >> + * @npages - number of pages
> >> + */
> >> +static int cr_vma_read_pages_vaddrs(struct cr_ctx *ctx, int npages)
> >> +{
> >> +	struct cr_pgarr *pgarr;
> >> +	int nr, ret;
> >> +
> >> +	while (npages) {
> >> +		pgarr = cr_pgarr_prep(ctx);
> >> +		if (!pgarr)
> >> +			return -ENOMEM;
> >> +		nr = min(npages, (int) pgarr->nr_free);
> >> +		ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
> >> +		if (ret < 0)
> >> +			return ret;
> >> +		pgarr->nr_free -= nr;
> >> +		pgarr->nr_used += nr;
> >> +		npages -= nr;
> >> +	}
> >> +	return 0;
> >> +}
> > 
> > cr_pgarr_prep() can return a partially full pgarr, right?  Won't the
> > cr_kread() always start at the beginning of the pgarr->vaddrs[] array?
> > Seems to me like it will clobber things from the last call.
> 
> Note that 'nr' is either equal to ->nr_free - in which case we consume
> the entire 'pgarr' vaddr array such that the next call to cr_pgarr_prep()
> will get a fresh one, or is smaller than ->nr_free - in which case that
> is the last iteration of the loop anyhow, so it won't be clobbered.
> 
> Also, after we return - our caller, cr_vma_read_pages(), resets the state
> of the page-array chain by calling cr_pgarr_reset().

Man, that's awfully subtle for something which is so simple.

I think it is a waste of memory to have to hold *all* of the vaddrs in
memory at once.  Is there a real requirement for that somehow?  The code
would look a lot simpler use less memory if it was done (for instance)
using a single 'struct pgaddr' at a time.  There are an awful lot of HPC
apps that have nearly all physical memory in the machine allocated and
mapped into a single VMA.  This approach could be quite painful there.

I know it's being done this way because that's what the dump format
looks like.  Would you consider changing the dump format to have blocks
of pages and vaddrs together?  That should also parallelize a bit more
naturally.

Anyway, this either needs a big fat comment or something that is
self-describing like this:

+       while (npages) {
+               pgarr = alloc_fresh_pgarr(...)
+               if (!pgarr)
+                       return -ENOMEM;
+               nr = min(npages, (int) pgarr->nr_free);
+               ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
+               if (ret < 0)
+                       return ret;
+               pgarr->nr_free -= nr;
+               pgarr->nr_used += nr;
+               npages -= nr;
		add_pgarr_to_ctx(ctx, pgarr);
+       }
+       return 0;

When someone is looking at that, it is painfully obvious that they're
not writing over anyone else's vaddrs since the pgarr is fresh.  

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Cleanups for [PATCH 4/9] Memory management (dump)
  2008-09-10 18:28     ` Oren Laadan
@ 2008-09-10 21:03       ` Dave Hansen
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-10 21:03 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, arnd, linux-kernel

This is a lot of changes.  But, they're all kinda intertwined
so it's hard to make individual patches out of them.  I've
tried to explain each of the changes as you look through
the patch sequentially.

Note that this patch removes more code than it adds, and I think it
makes everything more readable.  There are a few things I need to fix up
in the restore patch (like the use of nr_free), but nothing really
fundamental.

- Remove use of get_free_pages() for buffers, use kmalloc()
  for its debugging advantages, and not having to deal with
  "orders".
- Now that we use kfree(), don't check for NULL pages/vaddr
  since kfree() does it for us.
- Zero out the pgarr as we remove things from it.  Bug
  hunters will thank us.
- Change ctx->pgarr name to pgarr_list.
- Change the ordering of the pgarr_list so that the first
  entry is always the old "pgcurr".  Make a function to
  find the first entry, and kill "pgcurr".
- Get rid of pgarr->nr_free.  It's redundant with nr_used
  and the fixed-size allocation of all pgarrs.  Create
  a helper function (pgarr_is_full()) to replace it.
- Remove cr_pgarr_prep(), just use cr_pgarr_alloc().
- Create cr_add_to_pgarr() helper which also does some
  checking of the page states that come back from
  follow_page().
- Rename cr_vma_fill_pgarr() to cr_private_vma() to make
  it painfully obvious that it does not deal with
  shared memory of any kind.
- Don't fault in pages with handle_mm_fault(), we do not
  need them to be present, nor should we be wasting
  space on pagetables that need to be created for sparse
  memory areas.
- Add parenthesis around 'page_mapping(page) != NULL' check
- Don't bother releasing pgarr pages for a VMA since it
  will never free all of the VMA's pages anyway
- Don't track total pages.  If we really need this, we can
  track the number of full 'pgarr's on the list, and add
  the used pages from the first one.
- Reverse list iteration since we changed the list order
- Give cr_pgarr_reset() a better name: cr_reset_all_pgarrs()



---

 linux-2.6.git-dave/checkpoint/ckpt_mem.c |  262 +++++++++++++------------------
 linux-2.6.git-dave/checkpoint/ckpt_mem.h |    3 
 linux-2.6.git-dave/include/linux/ckpt.h  |    4 
 3 files changed, 115 insertions(+), 154 deletions(-)

diff -puN checkpoint/ckpt_mem.c~p4-dave checkpoint/ckpt_mem.c
--- linux-2.6.git/checkpoint/ckpt_mem.c~p4-dave	2008-09-10 13:58:55.000000000 -0700
+++ linux-2.6.git-dave/checkpoint/ckpt_mem.c	2008-09-10 14:00:04.000000000 -0700
@@ -25,41 +25,41 @@
  * (common to ckpt_mem.c and rstr_mem.c).
  *
  * The checkpoint context structure has two members for page-arrays:
- *   ctx->pgarr: list head of the page-array chain
- *   ctx->pgcur: tracks the "current" position in the chain
+ *   ctx->pgarr_list: list head of the page-array chain
  *
  * During checkpoint (and restart) the chain tracks the dirty pages (page
  * pointer and virtual address) of each MM. For a particular MM, these are
- * always added to the "current" page-array (ctx->pgcur). The "current"
+ * always added to the first entry in the ctx->pgarr_list.  This "current"
  * page-array advances as necessary, and new page-array descriptors are
  * allocated on-demand. Before the next MM, the chain is reset but not
- * freed (that is, dereference page pointers and reset ctx->pgcur).
+ * freed (that is, dereference page pointers).
  */
 
-#define CR_PGARR_ORDER  0
-#define CR_PGARR_TOTAL  ((PAGE_SIZE << CR_PGARR_ORDER) / sizeof(void *))
+#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
 
 /* release pages referenced by a page-array */
-void cr_pgarr_unref_pages(struct cr_pgarr *pgarr)
+void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
 {
 	int n;
 
-	/* only checkpoint keeps references to pages */
-	if (pgarr->pages) {
-		cr_debug("nr_used %d\n", pgarr->nr_used);
-		for (n = pgarr->nr_used; n--; )
-			page_cache_release(pgarr->pages[n]);
+	/*
+	 * No need to check pgarr->pages, since
+	 * nr_used will be 0 if it is NULL.
+	 */
+	cr_debug("nr_used %d\n", pgarr->nr_used);
+	for (n = pgarr->nr_used; n >= 0; n--) {
+		page_cache_release(pgarr->pages[n]);
+		pgarr->pages[n] = NULL;
+		pgarr->vaddrs[n] = 0;
 	}
 }
 
 /* free a single page-array object */
 static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
 {
-	cr_pgarr_unref_pages(pgarr);
-	if (pgarr->pages)
-		free_pages((unsigned long) pgarr->pages, CR_PGARR_ORDER);
-	if (pgarr->vaddrs)
-		free_pages((unsigned long) pgarr->vaddrs, CR_PGARR_ORDER);
+	cr_pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
 	kfree(pgarr);
 }
 
@@ -68,11 +68,10 @@ void cr_pgarr_free(struct cr_ctx *ctx)
 {
 	struct cr_pgarr *pgarr, *tmp;
 
-	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr, list) {
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
 		list_del(&pgarr->list);
 		cr_pgarr_free_one(pgarr);
 	}
-	ctx->pgcur = NULL;
 }
 
 /* allocate a single page-array object */
@@ -84,13 +83,11 @@ static struct cr_pgarr *cr_pgarr_alloc_o
 	if (!pgarr)
 		return NULL;
 
-	pgarr->nr_free = CR_PGARR_TOTAL;
 	pgarr->nr_used = 0;
-
-	pgarr->pages = (struct page **)
-		__get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
-	pgarr->vaddrs = (unsigned long *)
-		__get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
+	pgarr->pages  = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
+				GFP_KERNEL);
+	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long *),
+				GFP_KERNEL);
 	if (!pgarr->pages || !pgarr->vaddrs) {
 		cr_pgarr_free_one(pgarr);
 		return NULL;
@@ -99,57 +96,56 @@ static struct cr_pgarr *cr_pgarr_alloc_o
 	return pgarr;
 }
 
-/* cr_pgarr_alloc - return the next available pgarr in the page-array chain
+static int pgarr_is_full(struct cr_pgarr *pgarr)
+{
+	if (pgarr->nr_used > CR_PGARR_TOTAL)
+		return 1;
+	return 0;
+}
+
+static struct cr_pgarr *cr_first_pgarr(struct cr_ctx *ctx)
+{
+	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
+}
+
+/* cr_get_empty_pgarr - return the next available pgarr in the page-array chain
  * @ctx: checkpoint context
  *
- * Return the page-array following ctx->pgcur, extending the chain if needed
+ * Return the first page-array in the list with space.  Extend the
+ * list if none has space.
  */
-struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx)
+struct cr_pgarr *cr_get_empty_pgarr(struct cr_ctx *ctx)
 {
 	struct cr_pgarr *pgarr;
 
-	/* can reuse next element after ctx->pgcur ? */
-	pgarr = ctx->pgcur;
-	if (pgarr && !list_is_last(&pgarr->list, &ctx->pgarr)) {
-		pgarr = list_entry(pgarr->list.next, struct cr_pgarr, list);
+	/*
+	 * This could just as easily be a list_for_each() if we
+	 * need to do a more comprehensive search.
+	 */
+	pgarr = cr_first_pgarr(ctx);
+	if (!pgarr_is_full(pgarr))
 		goto out;
-	}
 
-	/* nope, need to extend the page-array chain */
+	ctx->nr_full_pgarrs++;
 	pgarr = cr_pgarr_alloc_one();
 	if (!pgarr)
 		return NULL;
 
-	list_add_tail(&pgarr->list, &ctx->pgarr);
+	list_add(&pgarr->list, &ctx->pgarr_list);
  out:
-	ctx->pgcur = pgarr;
 	return pgarr;
 
 }
 
 /* reset the page-array chain (dropping page references if necessary) */
-void cr_pgarr_reset(struct cr_ctx *ctx)
+void cr_reset_all_pgarrs(struct cr_ctx *ctx)
 {
 	struct cr_pgarr *pgarr;
 
-	list_for_each_entry(pgarr, &ctx->pgarr, list) {
-		cr_pgarr_unref_pages(pgarr);
-		pgarr->nr_free = CR_PGARR_TOTAL;
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
+		cr_pgarr_release_pages(pgarr);
 		pgarr->nr_used = 0;
 	}
-	ctx->pgcur = NULL;
-}
-
-
-/* return current page-array (and allocate if needed) */
-struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx
-)
-{
-	struct cr_pgarr *pgarr = ctx->pgcur;
-
-	if (!pgarr->nr_free)
-		pgarr = cr_pgarr_alloc(ctx);
-	return pgarr;
 }
 
 /*
@@ -161,116 +157,84 @@ struct cr_pgarr *cr_pgarr_prep(struct cr
  * dumped to the file descriptor.
  */
 
+/*
+ * You must ensure that the pgarr has space before
+ * calling this function.
+ */
+static inline void cr_add_to_pgarr(struct cr_pgarr *pgarr, struct page *page,
+				  unsigned long vaddr)
+{
+	/*
+	 * We're really just handing the result of the
+	 * follow_page() here.
+	 */
+	if (page == NULL)
+		return;
+	if (page == ZERO_PAGE(0))
+		return;
+
+	get_page(page);
+	pgarr->pages[pgarr->nr_used] = page;
+	pgarr->vaddrs[pgarr->nr_used] = vaddr;
+	pgarr->nr_used++;
+}
+
+static inline void cr_pgarr_release_index(struct cr_pgarr *pgarr, int index)
+{
+	page_cache_release(pgarr->pages[index]);
+	pgarr->pages[index] = NULL;
+	pgarr->vaddrs[index] = 0;
+	pgarr->nr_used--;
+}
+
 /**
- * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
+ * cr_private_vma - fill the ctx structure with pgarrs containing
+ * 		    the contents of this VMA
  * @ctx - checkpoint context
- * @pgarr - page-array to fill
  * @vma - vma to scan
- * @start - start address (updated)
  */
-static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
-			     struct vm_area_struct *vma, unsigned long *start)
+static int cr_private_vma(struct cr_ctx *ctx,
+			struct vm_area_struct *vma)
 {
-	unsigned long end = vma->vm_end;
-	unsigned long addr = *start;
-	struct page **pagep;
-	unsigned long *addrp;
-	int cow, nr, ret = 0;
-
-	nr = pgarr->nr_free;
-	pagep = &pgarr->pages[pgarr->nr_used];
-	addrp = &pgarr->vaddrs[pgarr->nr_used];
-	cow = !!vma->vm_file;
+	struct cr_pgarr *pgarr;
+	unsigned long addr = vma->vm_start;
+	int ret = 0;
+	int cow = 0;
+	int orig_nr_used;
 
-	while (addr < end) {
-		struct page *page;
+reload:
+	pgarr = cr_get_empty_pgarr(ctx);
+	if (!pgarr)
+		return -ENOMEM;
 
-		/*
-		 * simplified version of get_user_pages(): already have vma,
-		 * only need FOLL_TOUCH, and (for now) ignore fault stats.
-		 *
-		 * FIXME: consolidate with get_user_pages()
-		 */
+       	orig_nr_used = pgarr->nr_used;
+	/*
+	 * This function is only for private mappings.  If
+	 * the vma is file backed, it must be a cow.
+	 */
+	if (vma->vm_file)
+		cow = 1;
 
-		cond_resched();
-		while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
-			ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
-			if (ret & VM_FAULT_ERROR) {
-				if (ret & VM_FAULT_OOM)
-					ret = -ENOMEM;
-				else if (ret & VM_FAULT_SIGBUS)
-					ret = -EFAULT;
-				else
-					BUG();
-				break;
-			}
-			cond_resched();
-			ret = 0;
-		}
+	while (addr < vma->vm_end) {
+		struct page *page;
 
+		cond_resched();
+		page = follow_page(vma, addr, FOLL_TOUCH);
 		if (IS_ERR(page))
 			ret = PTR_ERR(page);
 
 		if (ret < 0)
 			break;
 
-		if (page == ZERO_PAGE(0)) {
-			page = NULL;	/* zero page: ignore */
-		} else if (cow && page_mapping(page) != NULL) {
-			page = NULL;	/* clean cow: ignore */
-		} else {
-			get_page(page);
-			*(addrp++) = addr;
-			*(pagep++) = page;
-			if (--nr == 0) {
-				addr += PAGE_SIZE;
-				break;
-			}
-		}
-
+		if (cow && (page_mapping(page) != NULL))
+			page = NULL;
+		cr_add_to_pgarr(pgarr, page, addr);
 		addr += PAGE_SIZE;
+		if (pgarr_is_full(pgarr))
+			goto reload;
 	}
 
-	if (unlikely(ret < 0)) {
-		nr = pgarr->nr_free - nr;
-		while (nr--)
-			page_cache_release(*(--pagep));
-		return ret;
-	}
-
-	*start = addr;
-	return pgarr->nr_free - nr;
-}
-
-/**
- * cr_vma_scan_pages - scan vma for pages that will need to be dumped
- * @ctx - checkpoint context
- * @vma - vma to scan
- *
- * lists of page pointes and corresponding virtual addresses are tracked
- * inside ctx->pgarr page-array chain
- */
-static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
-{
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
-	struct cr_pgarr *pgarr;
-	int nr, total = 0;
-
-	while (addr < end) {
-		pgarr = cr_pgarr_prep(ctx);
-		if (!pgarr)
-			return -ENOMEM;
-		nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr);
-		if (nr < 0)
-			return nr;
-		pgarr->nr_free -= nr;
-		pgarr->nr_used += nr;
-		total += nr;
-	}
-
-	cr_debug("total %d\n", total);
-	return total;
+	return ret;
 }
 
 static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
@@ -300,7 +264,7 @@ static int cr_vma_dump_pages(struct cr_c
 	if (!total)
 		return 0;
 
-	list_for_each_entry(pgarr, &ctx->pgarr, list) {
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
 		ret = cr_kwrite(ctx, pgarr->vaddrs,
 				pgarr->nr_used * sizeof(*pgarr->vaddrs));
 		if (ret < 0)
@@ -311,7 +275,7 @@ static int cr_vma_dump_pages(struct cr_c
 	if (!buf)
 		return -ENOMEM;
 
-	list_for_each_entry(pgarr, &ctx->pgarr, list) {
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
 		for (i = 0; i < pgarr->nr_used; i++) {
 			ret = cr_page_write(ctx, pgarr->pages[i], buf);
 			if (ret < 0)
@@ -366,7 +330,7 @@ static int cr_write_vma(struct cr_ctx *c
 	/* (1) scan: scan through the PTEs of the vma to count the pages
 	 * to dump (and later make those pages COW), and keep the list of
 	 * pages (and a reference to each page) on the checkpoint ctx */
-	nr = cr_vma_scan_pages(ctx, vma);
+	nr = cr_private_vma(ctx, vma);
 	if (nr < 0)
 		return nr;
 
@@ -387,7 +351,7 @@ static int cr_write_vma(struct cr_ctx *c
 	ret = cr_vma_dump_pages(ctx, nr);
 
 	/* (3) free: release the extra references to the pages in the list */
-	cr_pgarr_reset(ctx);
+	cr_reset_all_pgarrs(ctx);
 
 	return ret;
 }
diff -puN checkpoint/ckpt_mem.h~p4-dave checkpoint/ckpt_mem.h
--- linux-2.6.git/checkpoint/ckpt_mem.h~p4-dave	2008-09-10 13:58:55.000000000 -0700
+++ linux-2.6.git-dave/checkpoint/ckpt_mem.h	2008-09-10 13:58:55.000000000 -0700
@@ -23,13 +23,10 @@ struct cr_pgarr {
 	unsigned long *vaddrs;
 	struct page **pages;
 	unsigned int nr_used;	/* how many entries already used */
-	unsigned int nr_free;	/* how many entries still free */
 	struct list_head list;
 };
 
-void cr_pgarr_reset(struct cr_ctx *ctx);
 void cr_pgarr_free(struct cr_ctx *ctx);
-struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx);
 struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx);
 
 #endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff -puN include/linux/ckpt.h~p4-dave include/linux/ckpt.h
--- linux-2.6.git/include/linux/ckpt.h~p4-dave	2008-09-10 13:58:55.000000000 -0700
+++ linux-2.6.git-dave/include/linux/ckpt.h	2008-09-10 13:58:55.000000000 -0700
@@ -28,8 +28,8 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
-	struct list_head pgarr;	/* page array for dumping VMA contents */
-	struct cr_pgarr *pgcur;	/* current position in page array */
+	struct list_head pgarr_list;	/* page array for dumping VMA contents */
+	unsigned long nr_full_pgarrs;
 
 	struct path *vfsroot;	/* container root (FIXME) */
 };
_



-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 4/9] Memory management (dump)
  2008-09-09  7:42 ` [RFC v4][PATCH 4/9] Memory management (dump) Oren Laadan
                     ` (2 preceding siblings ...)
  2008-09-10 16:55   ` Dave Hansen
@ 2008-09-10 21:38   ` Dave Hansen
  2008-09-12 16:57   ` Dave Hansen
  4 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-10 21:38 UTC (permalink / raw)
  To: Oren Laadan; +Cc: arnd, jeremy, linux-kernel, containers

On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
> array chain
> + */
> +static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> +       unsigned long addr = vma->vm_start;
> +       unsigned long end = vma->vm_end;
> +       struct cr_pgarr *pgarr;
> +       int nr, total = 0;
> +
> +       while (addr < end) {
> +               pgarr = cr_pgarr_prep(ctx);
> +               if (!pgarr)
> +                       return -ENOMEM;
> +               nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr);
> +               if (nr < 0)
> +                       return nr;
> +               pgarr->nr_free -= nr;
> +               pgarr->nr_used += nr;
> +               total += nr;
> +       }
> +
> +       cr_debug("total %d\n", total);
> +       return total;
> +}

This confuses me.  cr_vma_fill_pgarr() if it runs into an error attempts
to free up the pgarr references from the current pgarr that was just
filled.  But, that could only be a portion of a large VMA.  If it can't
free up the entire VMA worth of references (at least), why does it even
try to free a portion?  Why not just return since the upper levels need
to clean up the other portions anyway?

Also, is it really necessary to track the total amount filled in here?
I kinda gums up the code.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-10 18:36     ` Oren Laadan
@ 2008-09-10 22:54       ` MinChan Kim
  2008-09-11  6:44         ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: MinChan Kim @ 2008-09-10 22:54 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

Hi, Oren.

On Thu, Sep 11, 2008 at 3:36 AM, Oren Laadan <orenl@cs.columbia.edu> wrote:
>
>
> MinChan Kim wrote:
>> On Tue, Sep 9, 2008 at 4:42 PM, Oren Laadan <orenl@cs.columbia.edu> wrote:
>
> [...]
>
>>> +struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
>>> +{
>>> +       struct cr_ctx *ctx;
>>> +
>>> +       ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>>> +       if (!ctx)
>>> +               return ERR_PTR(-ENOMEM);
>>> +
>>> +       ctx->file = fget(fd);
>>> +       if (!ctx->file) {
>>> +               cr_ctx_free(ctx);
>>> +               return ERR_PTR(-EBADF);
>>> +       }
>>> +       get_file(ctx->file);
>>
>> Why do you need get_file?
>> You already called fget.
>> Am I missing something ?
>
> This was meant for when we will restart multiple processes, each would
> have access to the checkpoint-context, such that the checkpoint-context
> may outlives the task that created it and initiated the restart. Thus
> the file-pointer will need to stay around longer than that task.

OK. Thanks for your explanation.
You should have inserted above annotation.

> Of course, restart of multiple processes _can_ be coded such that this
> first task will always terminate last - either after restart completes
> successfully, or after all the other tasks aborted and won't use the
> checkpoint-context anymore.
>
> Because that code is not part of the this patch-set, I considered it
> safer to grab a reference of the file pointer, making it less likely
> that we forget about it later.

What do you mean by that ? Isn't it a your part of your code?
When the last checkpoint-context is ended, who free file ?
I mean how it match pair(fget/fput and get_file) ?

>>
>>> +       ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
>>> +       if (!ctx->hbuf) {
>>> +               cr_ctx_free(ctx);
>>> +               return ERR_PTR(-ENOMEM);
>>> +       }
>>> +
>>> +       ctx->pid = pid;
>>> +       ctx->flags = flags;
>>> +
>>> +       ctx->crid = atomic_inc_return(&cr_ctx_count);
>>> +
>>> +       return ctx;
>>> +}
>
> Oren.
>
>



-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 4/9] Memory management (dump)
  2008-09-10  7:51   ` MinChan Kim
@ 2008-09-10 23:49     ` MinChan Kim
  0 siblings, 0 replies; 43+ messages in thread
From: MinChan Kim @ 2008-09-10 23:49 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

one more thing.

On Wed, Sep 10, 2008 at 4:51 PM, MinChan Kim <minchan.kim@gmail.com> wrote:
> On Tue, Sep 9, 2008 at 4:42 PM, Oren Laadan <orenl@cs.columbia.edu> wrote:
>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>> it will be followed by the file name.  The cr_vma->npages will tell
>> how many pages were dumped for this VMA.  Then it will be followed
>> by the actual data: first a dump of the addresses of all dumped
>> pages (npages entries) followed by a dump of the contents of all
>> dumped pages (npages pages). Then will come the next VMA and so on.
>>
>> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
>> ---
>>  arch/x86/mm/checkpoint.c   |   30 +++
>>  arch/x86/mm/restart.c      |    1 +
>>  checkpoint/Makefile        |    3 +-
>>  checkpoint/checkpoint.c    |   53 ++++++
>>  checkpoint/ckpt_arch.h     |    1 +
>>  checkpoint/ckpt_mem.c      |  448 ++++++++++++++++++++++++++++++++++++++++++++
>>  checkpoint/ckpt_mem.h      |   35 ++++
>>  checkpoint/sys.c           |   23 ++-
>>  include/asm-x86/ckpt_hdr.h |    5 +
>>  include/linux/ckpt.h       |   12 ++
>>  include/linux/ckpt_hdr.h   |   30 +++
>>  11 files changed, 635 insertions(+), 6 deletions(-)
>>  create mode 100644 checkpoint/ckpt_mem.c
>>  create mode 100644 checkpoint/ckpt_mem.h
>>
>> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
>> index 71d21e6..50cfd29 100644
>> --- a/arch/x86/mm/checkpoint.c
>> +++ b/arch/x86/mm/checkpoint.c
>> @@ -192,3 +192,33 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
>>        cr_hbuf_put(ctx, sizeof(*hh));
>>        return ret;
>>  }
>> +
>> +/* dump the mm->context state */
>> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
>> +{
>> +       struct cr_hdr h;
>> +       struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> +       int ret;
>> +
>> +       h.type = CR_HDR_MM_CONTEXT;
>> +       h.len = sizeof(*hh);
>> +       h.parent = parent;
>> +
>> +       mutex_lock(&mm->context.lock);
>> +
>> +       hh->ldt_entry_size = LDT_ENTRY_SIZE;
>> +       hh->nldt = mm->context.size;
>> +
>> +       cr_debug("nldt %d\n", hh->nldt);
>> +
>> +       ret = cr_write_obj(ctx, &h, hh);
>> +       cr_hbuf_put(ctx, sizeof(*hh));
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
>> +
>> +       mutex_unlock(&mm->context.lock);
>> +
>> +       return ret;
>> +}
>> diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
>> index 883a163..d7fb89a 100644
>> --- a/arch/x86/mm/restart.c
>> +++ b/arch/x86/mm/restart.c
>> @@ -8,6 +8,7 @@
>>  *  distribution for more details.
>>  */
>>
>> +#include <linux/unistd.h>
>>  #include <asm/desc.h>
>>  #include <asm/i387.h>
>>
>> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
>> index d2df68c..3a0df6d 100644
>> --- a/checkpoint/Makefile
>> +++ b/checkpoint/Makefile
>> @@ -2,4 +2,5 @@
>>  # Makefile for linux checkpoint/restart.
>>  #
>>
>> -obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
>> +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
>> +               ckpt_mem.o
>> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
>> index d34a691..4dae775 100644
>> --- a/checkpoint/checkpoint.c
>> +++ b/checkpoint/checkpoint.c
>> @@ -55,6 +55,55 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
>>        return cr_write_obj(ctx, &h, str);
>>  }
>>
>> +/**
>> + * cr_fill_fname - return pathname of a given file
>> + * @path: path name
>> + * @root: relative root
>> + * @buf: buffer for pathname
>> + * @n: buffer length (in) and pathname length (out)
>> + */
>> +static char *
>> +cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
>> +{
>> +       char *fname;
>> +
>> +       BUG_ON(!buf);
>> +       fname = __d_path(path, root, buf, *n);
>> +       if (!IS_ERR(fname))
>> +               *n = (buf + (*n) - fname);
>> +       return fname;
>> +}
>> +
>> +/**
>> + * cr_write_fname - write a file name
>> + * @ctx: checkpoint context
>> + * @path: path name
>> + * @root: relative root
>> + */
>> +int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
>> +{
>> +       struct cr_hdr h;
>> +       char *buf, *fname;
>> +       int ret, flen;
>> +
>> +       flen = PATH_MAX;
>> +       buf = kmalloc(flen, GFP_KERNEL);
>> +       if (!buf)
>> +               return -ENOMEM;
>> +
>> +       fname = cr_fill_fname(path, root, buf, &flen);
>> +       if (!IS_ERR(fname)) {
>> +               h.type = CR_HDR_FNAME;
>> +               h.len = flen;
>> +               h.parent = 0;
>> +               ret = cr_write_obj(ctx, &h, fname);
>> +       } else
>> +               ret = PTR_ERR(fname);
>> +
>> +       kfree(buf);
>> +       return ret;
>> +}
>> +
>>  /* write the checkpoint header */
>>  static int cr_write_head(struct cr_ctx *ctx)
>>  {
>> @@ -164,6 +213,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>>        cr_debug("task_struct: ret %d\n", ret);
>>        if (ret < 0)
>>                goto out;
>> +       ret = cr_write_mm(ctx, t);
>> +       cr_debug("memory: ret %d\n", ret);
>> +       if (ret < 0)
>> +               goto out;
>>        ret = cr_write_thread(ctx, t);
>>        cr_debug("thread: ret %d\n", ret);
>>        if (ret < 0)
>> diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
>> index 5bd4703..9bd0ba4 100644
>> --- a/checkpoint/ckpt_arch.h
>> +++ b/checkpoint/ckpt_arch.h
>> @@ -2,6 +2,7 @@
>>
>>  int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
>>  int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
>> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);
>>
>>  int cr_read_thread(struct cr_ctx *ctx);
>>  int cr_read_cpu(struct cr_ctx *ctx);
>> diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
>> new file mode 100644
>> index 0000000..2c93447
>> --- /dev/null
>> +++ b/checkpoint/ckpt_mem.c
>> @@ -0,0 +1,448 @@
>> +/*
>> + *  Checkpoint memory contents
>> + *
>> + *  Copyright (C) 2008 Oren Laadan
>> + *
>> + *  This file is subject to the terms and conditions of the GNU General Public
>> + *  License.  See the file COPYING in the main directory of the Linux
>> + *  distribution for more details.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/file.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/mm_types.h>
>> +#include <linux/ckpt.h>
>> +#include <linux/ckpt_hdr.h>
>> +
>> +#include "ckpt_arch.h"
>> +#include "ckpt_mem.h"
>> +
>> +/*
>> + * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
>> + * (common to ckpt_mem.c and rstr_mem.c).
>> + *
>> + * The checkpoint context structure has two members for page-arrays:
>> + *   ctx->pgarr: list head of the page-array chain
>> + *   ctx->pgcur: tracks the "current" position in the chain
>> + *
>> + * During checkpoint (and restart) the chain tracks the dirty pages (page
>> + * pointer and virtual address) of each MM. For a particular MM, these are
>> + * always added to the "current" page-array (ctx->pgcur). The "current"
>> + * page-array advances as necessary, and new page-array descriptors are
>> + * allocated on-demand. Before the next MM, the chain is reset but not
>> + * freed (that is, dereference page pointers and reset ctx->pgcur).
>> + */
>> +
>> +#define CR_PGARR_ORDER  0
>> +#define CR_PGARR_TOTAL  ((PAGE_SIZE << CR_PGARR_ORDER) / sizeof(void *))
>> +
>> +/* release pages referenced by a page-array */
>> +void cr_pgarr_unref_pages(struct cr_pgarr *pgarr)
>> +{
>> +       int n;
>> +
>> +       /* only checkpoint keeps references to pages */
>> +       if (pgarr->pages) {
>> +               cr_debug("nr_used %d\n", pgarr->nr_used);
>> +               for (n = pgarr->nr_used; n--; )
>> +                       page_cache_release(pgarr->pages[n]);
>> +       }
>> +}
>> +
>> +/* free a single page-array object */
>> +static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
>> +{
>> +       cr_pgarr_unref_pages(pgarr);
>> +       if (pgarr->pages)
>> +               free_pages((unsigned long) pgarr->pages, CR_PGARR_ORDER);
>> +       if (pgarr->vaddrs)
>> +               free_pages((unsigned long) pgarr->vaddrs, CR_PGARR_ORDER);
>> +       kfree(pgarr);
>> +}
>> +
>> +/* free a chain of page-arrays */
>> +void cr_pgarr_free(struct cr_ctx *ctx)
>> +{
>> +       struct cr_pgarr *pgarr, *tmp;
>> +
>> +       list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr, list) {
>> +               list_del(&pgarr->list);
>> +               cr_pgarr_free_one(pgarr);
>> +       }
>> +       ctx->pgcur = NULL;
>> +}
>> +
>> +/* allocate a single page-array object */
>> +static struct cr_pgarr *cr_pgarr_alloc_one(void)
>> +{
>> +       struct cr_pgarr *pgarr;
>> +
>> +       pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
>> +       if (!pgarr)
>> +               return NULL;
>> +
>> +       pgarr->nr_free = CR_PGARR_TOTAL;
>> +       pgarr->nr_used = 0;
>> +
>> +       pgarr->pages = (struct page **)
>> +               __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
>> +       pgarr->vaddrs = (unsigned long *)
>> +               __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
>> +       if (!pgarr->pages || !pgarr->vaddrs) {
>> +               cr_pgarr_free_one(pgarr);
>> +               return NULL;
>> +       }
>> +
>> +       return pgarr;
>> +}
>> +
>> +/* cr_pgarr_alloc - return the next available pgarr in the page-array chain
>> + * @ctx: checkpoint context
>> + *
>> + * Return the page-array following ctx->pgcur, extending the chain if needed
>> + */
>> +struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx)
>> +{
>> +       struct cr_pgarr *pgarr;
>> +
>> +       /* can reuse next element after ctx->pgcur ? */
>> +       pgarr = ctx->pgcur;
>> +       if (pgarr && !list_is_last(&pgarr->list, &ctx->pgarr)) {
>> +               pgarr = list_entry(pgarr->list.next, struct cr_pgarr, list);
>> +               goto out;
>> +       }
>> +
>> +       /* nope, need to extend the page-array chain */
>> +       pgarr = cr_pgarr_alloc_one();
>> +       if (!pgarr)
>> +               return NULL;
>> +
>> +       list_add_tail(&pgarr->list, &ctx->pgarr);
>> + out:
>> +       ctx->pgcur = pgarr;
>> +       return pgarr;
>> +
>> +}
>> +
>> +/* reset the page-array chain (dropping page references if necessary) */
>> +void cr_pgarr_reset(struct cr_ctx *ctx)
>> +{
>> +       struct cr_pgarr *pgarr;
>> +
>> +       list_for_each_entry(pgarr, &ctx->pgarr, list) {
>> +               cr_pgarr_unref_pages(pgarr);
>> +               pgarr->nr_free = CR_PGARR_TOTAL;
>> +               pgarr->nr_used = 0;
>> +       }
>> +       ctx->pgcur = NULL;
>> +}
>> +
>> +
>> +/* return current page-array (and allocate if needed) */
>> +struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx
>> +)
>
> Brace shoudl be located in above line. :)
>> +{
>> +       struct cr_pgarr *pgarr = ctx->pgcur;
>> +
>> +       if (!pgarr->nr_free)

At first trial, ctx->pgcur is null.
so, It may happen oops.

>> +               pgarr = cr_pgarr_alloc(ctx);
>> +       return pgarr;
>> +}
>> +
>> +/*
>> + * Checkpoint is outside the context of the checkpointee, so one cannot
>> + * simply read pages from user-space. Instead, we scan the address space
>> + * of the target to cherry-pick pages of interest. Selected pages are
>> + * enlisted in a page-array chain (attached to the checkpoint context).
>> + * To save their contents, each page is mapped to kernel memory and then
>> + * dumped to the file descriptor.
>> + */
>> +
>> +/**
>> + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
>> + * @ctx - checkpoint context
>> + * @pgarr - page-array to fill
>> + * @vma - vma to scan
>> + * @start - start address (updated)
>> + */
>> +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
>> +                            struct vm_area_struct *vma, unsigned long *start)
>> +{
>> +       unsigned long end = vma->vm_end;
>> +       unsigned long addr = *start;
>> +       struct page **pagep;
>> +       unsigned long *addrp;
>> +       int cow, nr, ret = 0;
>> +
>> +       nr = pgarr->nr_free;
>> +       pagep = &pgarr->pages[pgarr->nr_used];
>> +       addrp = &pgarr->vaddrs[pgarr->nr_used];
>> +       cow = !!vma->vm_file;
>> +
>> +       while (addr < end) {
>> +               struct page *page;
>> +
>> +               /*
>> +                * simplified version of get_user_pages(): already have vma,
>> +                * only need FOLL_TOUCH, and (for now) ignore fault stats.
>> +                *
>> +                * FIXME: consolidate with get_user_pages()
>> +                */
>> +
>> +               cond_resched();
>> +               while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
>> +                       ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
>> +                       if (ret & VM_FAULT_ERROR) {
>> +                               if (ret & VM_FAULT_OOM)
>> +                                       ret = -ENOMEM;
>> +                               else if (ret & VM_FAULT_SIGBUS)
>> +                                       ret = -EFAULT;
>> +                               else
>> +                                       BUG();
>> +                               break;
>> +                       }
>> +                       cond_resched();
>> +                       ret = 0;
>> +               }
>> +
>> +               if (IS_ERR(page))
>> +                       ret = PTR_ERR(page);
>> +
>> +               if (ret < 0)
>> +                       break;
>> +
>> +               if (page == ZERO_PAGE(0)) {
>> +                       page = NULL;    /* zero page: ignore */
>> +               } else if (cow && page_mapping(page) != NULL) {
>> +                       page = NULL;    /* clean cow: ignore */
>> +               } else {
>> +                       get_page(page);
>> +                       *(addrp++) = addr;
>> +                       *(pagep++) = page;
>> +                       if (--nr == 0) {
>> +                               addr += PAGE_SIZE;
>> +                               break;
>> +                       }
>> +               }
>> +
>> +               addr += PAGE_SIZE;
>> +       }
>> +
>> +       if (unlikely(ret < 0)) {
>> +               nr = pgarr->nr_free - nr;
>> +               while (nr--)
>> +                       page_cache_release(*(--pagep));
>> +               return ret;
>> +       }
>> +
>> +       *start = addr;
>> +       return pgarr->nr_free - nr;
>> +}
>> +
>> +/**
>> + * cr_vma_scan_pages - scan vma for pages that will need to be dumped
>> + * @ctx - checkpoint context
>> + * @vma - vma to scan
>> + *
>> + * lists of page pointes and corresponding virtual addresses are tracked
>> + * inside ctx->pgarr page-array chain
>> + */
>> +static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
>> +{
>> +       unsigned long addr = vma->vm_start;
>> +       unsigned long end = vma->vm_end;
>> +       struct cr_pgarr *pgarr;
>> +       int nr, total = 0;
>> +
>> +       while (addr < end) {
>> +               pgarr = cr_pgarr_prep(ctx);
>> +               if (!pgarr)
>> +                       return -ENOMEM;
>> +               nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr);
>> +               if (nr < 0)
>> +                       return nr;
>> +               pgarr->nr_free -= nr;
>> +               pgarr->nr_used += nr;
>> +               total += nr;
>> +       }
>> +
>> +       cr_debug("total %d\n", total);
>> +       return total;
>> +}
>> +
>> +static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
>> +{
>> +       void *ptr;
>> +
>> +       ptr = kmap_atomic(page, KM_USER1);
>> +       memcpy(buf, ptr, PAGE_SIZE);
>> +       kunmap_atomic(page, KM_USER1);
>> +
>> +       return cr_kwrite(ctx, buf, PAGE_SIZE);
>> +}
>> +
>> +/**
>> + * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
>> + * @ctx - checkpoint context
>> + * @total - total number of pages
>> + *
>> + * First dump all virtual addresses, followed by the contents of all pages
>> + */
>> +static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
>> +{
>> +       struct cr_pgarr *pgarr;
>> +       char *buf;
>> +       int i, ret = 0;
>> +
>> +       if (!total)
>> +               return 0;
>> +
>> +       list_for_each_entry(pgarr, &ctx->pgarr, list) {
>> +               ret = cr_kwrite(ctx, pgarr->vaddrs,
>> +                               pgarr->nr_used * sizeof(*pgarr->vaddrs));
>> +               if (ret < 0)
>> +                       return ret;
>> +       }
>> +
>> +       buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> +       if (!buf)
>> +               return -ENOMEM;
>> +
>> +       list_for_each_entry(pgarr, &ctx->pgarr, list) {
>> +               for (i = 0; i < pgarr->nr_used; i++) {
>> +                       ret = cr_page_write(ctx, pgarr->pages[i], buf);
>> +                       if (ret < 0)
>> +                               goto out;
>> +               }
>> +       }
>> +
>> + out:
>> +       kfree(buf);
>> +       return ret;
>> +}
>> +
>> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
>> +{
>> +       struct cr_hdr h;
>> +       struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> +       int vma_type, nr, ret;
>> +
>> +       h.type = CR_HDR_VMA;
>> +       h.len = sizeof(*hh);
>> +       h.parent = 0;
>> +
>> +       hh->vm_start = vma->vm_start;
>> +       hh->vm_end = vma->vm_end;
>> +       hh->vm_page_prot = vma->vm_page_prot.pgprot;
>> +       hh->vm_flags = vma->vm_flags;
>> +       hh->vm_pgoff = vma->vm_pgoff;
>> +
>> +       if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
>> +               pr_warning("CR: unsupported VMA %#lx\n", vma->vm_flags);
>> +               return -ETXTBSY;
>> +       }
>> +
>> +       /* by default assume anon memory */
>> +       vma_type = CR_VMA_ANON;
>> +
>> +       /* if there is a backing file, assume private-mapped */
>> +       /* (FIX: check if the file is unlinked) */
>> +       if (vma->vm_file)
>> +               vma_type = CR_VMA_FILE;
>> +
>> +       hh->vma_type = vma_type;
>> +
>> +       /*
>> +        * it seems redundant now, but we do it in 3 steps for because:
>> +        * first, the logic is simpler when we how many pages before
>> +        * dumping them; second, a future optimization will defer the
>> +        * writeout (dump, and free) to a later step; in which case all
>> +        * the pages to be dumped will be aggregated on the checkpoint ctx
>> +        */
>> +
>> +       /* (1) scan: scan through the PTEs of the vma to count the pages
>> +        * to dump (and later make those pages COW), and keep the list of
>> +        * pages (and a reference to each page) on the checkpoint ctx */
>> +       nr = cr_vma_scan_pages(ctx, vma);
>> +       if (nr < 0)
>> +               return nr;
>> +
>> +       hh->nr_pages = nr;
>> +       ret = cr_write_obj(ctx, &h, hh);
>> +       cr_hbuf_put(ctx, sizeof(*hh));
>> +       if (ret < 0)
>> +               return ret;
>> +       /* save the file name, if relevant */
>> +       if (vma->vm_file)
>> +               ret = cr_write_fname(ctx, &vma->vm_file->f_path, ctx->vfsroot);
>> +
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       /* (2) dump: write out the addresses of all pages in the list (on
>> +        * the checkpoint ctx) followed by the contents of all pages */
>> +       ret = cr_vma_dump_pages(ctx, nr);
>> +
>> +       /* (3) free: release the extra references to the pages in the list */
>> +       cr_pgarr_reset(ctx);
>> +
>> +       return ret;
>> +}
>> +
>> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
>> +{
>> +       struct cr_hdr h;
>> +       struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> +       struct mm_struct *mm;
>> +       struct vm_area_struct *vma;
>> +       int objref, ret;
>> +
>> +       h.type = CR_HDR_MM;
>> +       h.len = sizeof(*hh);
>> +       h.parent = task_pid_vnr(t);
>> +
>> +       mm = get_task_mm(t);
>> +
>> +       objref = 0;     /* will be meaningful with multiple processes */
>> +       hh->objref = objref;
>> +
>> +       down_read(&mm->mmap_sem);
>> +
>> +       hh->start_code = mm->start_code;
>> +       hh->end_code = mm->end_code;
>> +       hh->start_data = mm->start_data;
>> +       hh->end_data = mm->end_data;
>> +       hh->start_brk = mm->start_brk;
>> +       hh->brk = mm->brk;
>> +       hh->start_stack = mm->start_stack;
>> +       hh->arg_start = mm->arg_start;
>> +       hh->arg_end = mm->arg_end;
>> +       hh->env_start = mm->env_start;
>> +       hh->env_end = mm->env_end;
>> +
>> +       hh->map_count = mm->map_count;
>> +
>> +       /* FIX: need also mm->flags */
>> +
>> +       ret = cr_write_obj(ctx, &h, hh);
>> +       cr_hbuf_put(ctx, sizeof(*hh));
>> +       if (ret < 0)
>> +               goto out;
>> +
>> +       /* write the vma's */
>> +       for (vma = mm->mmap; vma; vma = vma->vm_next) {
>> +               ret = cr_write_vma(ctx, vma);
>> +               if (ret < 0)
>> +                       goto out;
>> +       }
>> +
>> +       ret = cr_write_mm_context(ctx, mm, objref);
>> +
>> + out:
>> +       up_read(&mm->mmap_sem);
>> +       mmput(mm);
>> +       return ret;
>> +}
>> diff --git a/checkpoint/ckpt_mem.h b/checkpoint/ckpt_mem.h
>> new file mode 100644
>> index 0000000..8ee211d
>> --- /dev/null
>> +++ b/checkpoint/ckpt_mem.h
>> @@ -0,0 +1,35 @@
>> +#ifndef _CHECKPOINT_CKPT_MEM_H_
>> +#define _CHECKPOINT_CKPT_MEM_H_
>> +/*
>> + *  Generic container checkpoint-restart
>> + *
>> + *  Copyright (C) 2008 Oren Laadan
>> + *
>> + *  This file is subject to the terms and conditions of the GNU General Public
>> + *  License.  See the file COPYING in the main directory of the Linux
>> + *  distribution for more details.
>> + */
>> +
>> +#include <linux/mm_types.h>
>> +
>> +/*
>> + * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>
>> + * tuples (where vaddr is the virtual address of a page in a particular mm).
>> + * Specifically, we use separate arrays so that all vaddrs can be written
>> + * and read at once.
>> + */
>> +
>> +struct cr_pgarr {
>> +       unsigned long *vaddrs;
>> +       struct page **pages;
>> +       unsigned int nr_used;   /* how many entries already used */
>> +       unsigned int nr_free;   /* how many entries still free */
>> +       struct list_head list;
>> +};
>> +
>> +void cr_pgarr_reset(struct cr_ctx *ctx);
>> +void cr_pgarr_free(struct cr_ctx *ctx);
>> +struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx);
>> +struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx);
>> +
>> +#endif /* _CHECKPOINT_CKPT_MEM_H_ */
>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>> index 113e0df..8141161 100644
>> --- a/checkpoint/sys.c
>> +++ b/checkpoint/sys.c
>> @@ -16,6 +16,8 @@
>>  #include <linux/capability.h>
>>  #include <linux/ckpt.h>
>>
>> +#include "ckpt_mem.h"
>> +
>>  /*
>>  * helpers to write/read to/from the image file descriptor
>>  *
>> @@ -110,7 +112,6 @@ int cr_kread(struct cr_ctx *ctx, void *buf, int count)
>>        return ret;
>>  }
>>
>> -
>>  /*
>>  * helpers to manage CR contexts: allocated for each checkpoint and/or
>>  * restart operation, and persists until the operation is completed.
>> @@ -126,6 +127,11 @@ void cr_ctx_free(struct cr_ctx *ctx)
>>
>>        free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
>>
>> +       if (ctx->vfsroot)
>> +               path_put(ctx->vfsroot);
>> +
>> +       cr_pgarr_free(ctx);
>> +
>>        kfree(ctx);
>>  }
>>
>> @@ -145,10 +151,13 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
>>        get_file(ctx->file);
>>
>>        ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
>> -       if (!ctx->hbuf) {
>> -               cr_ctx_free(ctx);
>> -               return ERR_PTR(-ENOMEM);
>> -       }
>> +       if (!ctx->hbuf)
>> +               goto nomem;
>> +
>> +       /* assume checkpointer is in container's root vfs */
>> +       /* FIXME: this works for now, but will change with real containers */
>> +       ctx->vfsroot = &current->fs->root;
>> +       path_get(ctx->vfsroot);
>>
>>        ctx->pid = pid;
>>        ctx->flags = flags;
>> @@ -156,6 +165,10 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
>>        ctx->crid = atomic_inc_return(&cr_ctx_count);
>>
>>        return ctx;
>> +
>> + nomem:
>> +       cr_ctx_free(ctx);
>> +       return ERR_PTR(-ENOMEM);
>>  }
>>
>>  /*
>> diff --git a/include/asm-x86/ckpt_hdr.h b/include/asm-x86/ckpt_hdr.h
>> index 44a903c..6bc61ac 100644
>> --- a/include/asm-x86/ckpt_hdr.h
>> +++ b/include/asm-x86/ckpt_hdr.h
>> @@ -69,4 +69,9 @@ struct cr_hdr_cpu {
>>
>>  } __attribute__((aligned(8)));
>>
>> +struct cr_hdr_mm_context {
>> +       __s16 ldt_entry_size;
>> +       __s16 nldt;
>> +} __attribute__((aligned(8)));
>> +
>>  #endif /* __ASM_X86_CKPT_HDR__H */
>> diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
>> index 91f4998..5c62a90 100644
>> --- a/include/linux/ckpt.h
>> +++ b/include/linux/ckpt.h
>> @@ -10,6 +10,9 @@
>>  *  distribution for more details.
>>  */
>>
>> +#include <linux/path.h>
>> +#include <linux/fs.h>
>> +
>>  #define CR_VERSION  1
>>
>>  struct cr_ctx {
>> @@ -24,6 +27,11 @@ struct cr_ctx {
>>
>>        void *hbuf;             /* temporary buffer for headers */
>>        int hpos;               /* position in headers buffer */
>> +
>> +       struct list_head pgarr; /* page array for dumping VMA contents */
>> +       struct cr_pgarr *pgcur; /* current position in page array */
>> +
>> +       struct path *vfsroot;   /* container root (FIXME) */
>>  };
>>
>>  /* cr_ctx: flags */
>> @@ -46,11 +54,15 @@ struct cr_hdr;
>>
>>  int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
>>  int cr_write_string(struct cr_ctx *ctx, char *str, int len);
>> +int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root);
>>
>>  int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
>>  int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
>>  int cr_read_string(struct cr_ctx *ctx, void *str, int len);
>>
>> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
>> +int cr_read_mm(struct cr_ctx *ctx);
>> +
>>  int do_checkpoint(struct cr_ctx *ctx);
>>  int do_restart(struct cr_ctx *ctx);
>>
>> diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
>> index e66f322..ac77d7d 100644
>> --- a/include/linux/ckpt_hdr.h
>> +++ b/include/linux/ckpt_hdr.h
>> @@ -32,6 +32,7 @@ struct cr_hdr {
>>  enum {
>>        CR_HDR_HEAD = 1,
>>        CR_HDR_STRING,
>> +       CR_HDR_FNAME,
>>
>>        CR_HDR_TASK = 101,
>>        CR_HDR_THREAD,
>> @@ -82,4 +83,33 @@ struct cr_hdr_task {
>>        __s32 task_comm_len;
>>  } __attribute__((aligned(8)));
>>
>> +struct cr_hdr_mm {
>> +       __u32 objref;           /* identifier for shared objects */
>> +       __u32 map_count;
>> +
>> +       __u64 start_code, end_code, start_data, end_data;
>> +       __u64 start_brk, brk, start_stack;
>> +       __u64 arg_start, arg_end, env_start, env_end;
>> +
>> +} __attribute__((aligned(8)));
>> +
>> +/* vma subtypes */
>> +enum vm_type {
>> +       CR_VMA_ANON = 1,
>> +       CR_VMA_FILE
>> +};
>> +
>> +struct cr_hdr_vma {
>> +       __u32 vma_type;
>> +       __u32 _padding;
>> +       __s64 nr_pages;
>> +
>> +       __u64 vm_start;
>> +       __u64 vm_end;
>> +       __u64 vm_page_prot;
>> +       __u64 vm_flags;
>> +       __u64 vm_pgoff;
>> +
>> +} __attribute__((aligned(8)));
>> +
>>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */
>> --
>> 1.5.4.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>
>
>
> --
> Kinds regards,
> MinChan Kim
>



-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 8/9] File descriprtors (dump)
  2008-09-09  7:42 ` [RFC v4][PATCH 8/9] File descriprtors (dump) Oren Laadan
  2008-09-09  8:06   ` Vegard Nossum
  2008-09-09  8:23   ` Vegard Nossum
@ 2008-09-11  5:02   ` MinChan Kim
  2008-09-11  6:37     ` Oren Laadan
  2 siblings, 1 reply; 43+ messages in thread
From: MinChan Kim @ 2008-09-11  5:02 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

On Tue, Sep 9, 2008 at 4:42 PM, Oren Laadan <orenl@cs.columbia.edu> wrote:
> Dump the files_struct of a task with 'struct cr_hdr_files', followed by
> all open file descriptors. Since FDs can be shared, they are assigned an
> objref and registered in the object hash.
>
> For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its objref
> and its close-on-exec property. If the FD is to be saved (first time)
> then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
> Then will come the next FD and so on.
>
> This patch only handles basic FDs - regular files, directories and also
> symbolic links.
>
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> ---
>  checkpoint/Makefile      |    2 +-
>  checkpoint/checkpoint.c  |    4 +
>  checkpoint/ckpt_file.c   |  221 ++++++++++++++++++++++++++++++++++++++++++++++
>  checkpoint/ckpt_file.h   |   17 ++++
>  include/linux/ckpt.h     |    7 +-
>  include/linux/ckpt_hdr.h |   34 +++++++-
>  6 files changed, 280 insertions(+), 5 deletions(-)
>  create mode 100644 checkpoint/ckpt_file.c
>  create mode 100644 checkpoint/ckpt_file.h
>
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> index 9843fb9..7496695 100644
> --- a/checkpoint/Makefile
> +++ b/checkpoint/Makefile
> @@ -3,4 +3,4 @@
>  #
>
>  obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
> -               ckpt_mem.o rstr_mem.o
> +               ckpt_mem.o rstr_mem.o ckpt_file.o
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index 4dae775..aebbf22 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -217,6 +217,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>        cr_debug("memory: ret %d\n", ret);
>        if (ret < 0)
>                goto out;
> +       ret = cr_write_files(ctx, t);
> +       cr_debug("files: ret %d\n", ret);
> +       if (ret < 0)
> +               goto out;
>        ret = cr_write_thread(ctx, t);
>        cr_debug("thread: ret %d\n", ret);
>        if (ret < 0)
> diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
> new file mode 100644
> index 0000000..ca58b28
> --- /dev/null
> +++ b/checkpoint/ckpt_file.c
> @@ -0,0 +1,221 @@
> +/*
> + *  Checkpoint file descriptors
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/file.h>
> +#include <linux/fdtable.h>
> +#include <linux/ckpt.h>
> +#include <linux/ckpt_hdr.h>
> +
> +#include "ckpt_file.h"
> +
> +#define CR_DEFAULT_FDTABLE  256                /* an initial guess */
> +
> +/**
> + * cr_scan_fds - scan file table and construct array of open fds
> + * @files: files_struct pointer
> + * @fdtable: (output) array of open fds
> + * @return: the number of open fds found
> + *
> + * Allocates the file descriptors array (*fdtable), caller should free
> + */
> +int cr_scan_fds(struct files_struct *files, int **fdtable)
> +{
> +       struct fdtable *fdt;
> +       int *fdlist;
> +       int i, n, max;
> +
> +       n = 0;
> +       max = CR_DEFAULT_FDTABLE;

max is read-only variable so that you don't need to declare local variable.
You can use macro.

> +       fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
> +       if (!fdlist)
> +               return -ENOMEM;
> +
> +       spin_lock(&files->file_lock);
> +       fdt = files_fdtable(files);
> +       for (i = 0; i < fdt->max_fds; i++) {
> +               if (!fcheck_files(files, i))
> +                       continue;
> +               if (n == max) {
> +                       /* fcheck_files() is safe with drop/re-acquire
> +                        * of the lock, as it tests:  fd < max_fds */
> +                       spin_unlock(&files->file_lock);
> +                       max *= 2;
> +                       if (max < 0) {  /* overflow ? */
> +                               n = -EMFILE;
> +                               goto out;
> +                       }
> +                       fdlist = krealloc(fdlist, max, GFP_KERNEL);
> +                       if (!fdlist) {
> +                               n = -ENOMEM;
> +                               goto out;
> +                       }
> +                       spin_lock(&files->file_lock);
> +               }
> +               fdlist[n++] = i;
> +       }
> +       spin_unlock(&files->file_lock);
> +
> +       *fdtable = fdlist;
> + out:
> +       return n;
> +}
> +
> +/* cr_write_fd_data - dump the state of a given file pointer */
> +static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       struct dentry *dent = file->f_dentry;
> +       struct inode *inode = dent->d_inode;
> +       enum fd_type fd_type;
> +       int ret;
> +
> +       h.type = CR_HDR_FD_DATA;
> +       h.len = sizeof(*hh);
> +       h.parent = parent;
> +
> +       hh->f_flags = file->f_flags;
> +       hh->f_mode = file->f_mode;
> +       hh->f_pos = file->f_pos;
> +       hh->f_uid = file->f_uid;
> +       hh->f_gid = file->f_gid;
> +       hh->f_version = file->f_version;
> +       /* FIX: need also file->f_owner */
> +
> +       switch (inode->i_mode & S_IFMT) {
> +       case S_IFREG:
> +               fd_type = CR_FD_FILE;
> +               break;
> +       case S_IFDIR:
> +               fd_type = CR_FD_DIR;
> +               break;
> +       case S_IFLNK:
> +               fd_type = CR_FD_LINK;
> +               break;
> +       default:
> +               return -EBADF;
> +       }
> +
> +       /* FIX: check if the file/dir/link is unlinked */
> +       hh->fd_type = fd_type;
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               return ret;
> +
> +       return cr_write_fname(ctx, &file->f_path, ctx->vfsroot);
> +}
> +
> +/**
> + * cr_write_fd_ent - dump the state of a given file descriptor
> + * @ctx: checkpoint context
> + * @files: files_struct pointer
> + * @fd: file descriptor
> + *
> + * Save the state of the file descriptor; look up the actual file pointer
> + * in the hash table, and if found save the matching objref, otherwise call
> + * cr_write_fd_data to dump the file pointer too.
> + */
> +static int
> +cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       struct file *file = NULL;
> +       struct fdtable *fdt;
> +       int coe, objref, new, ret;
> +
> +       rcu_read_lock();
> +       fdt = files_fdtable(files);
> +       file = fcheck_files(files, fd);
> +       if (file) {
> +               coe = FD_ISSET(fd, fdt->close_on_exec);
> +               get_file(file);
> +       }
> +       rcu_read_unlock();
> +
> +       /* sanity check (although this shouldn't happen) */
> +       if (!file)
> +               return -EBADF;
> +
> +       new = cr_obj_add_ptr(ctx, (void *) file, &objref, CR_OBJ_FILE, 0);
> +       cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
> +
> +       if (new < 0)
> +               return new;
> +
> +       h.type = CR_HDR_FD_ENT;
> +       h.len = sizeof(*hh);
> +       h.parent = 0;
> +
> +       hh->objref = objref;
> +       hh->fd = fd;
> +       hh->close_on_exec = coe;
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               return ret;
> +
> +       /* new==1 if-and-only-if file was newly added to hash */
> +       if (new)
> +               ret = cr_write_fd_data(ctx, file, objref);
> +
> +       fput(file);
> +       return ret;
> +}
> +
> +int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +       struct cr_hdr h;
> +       struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +       struct files_struct *files;
> +       int *fdtable;
> +       int nfds, n, ret;
> +
> +       h.type = CR_HDR_FILES;
> +       h.len = sizeof(*hh);
> +       h.parent = task_pid_vnr(t);
> +
> +       files = get_files_struct(t);
> +
> +       hh->objref = 0; /* will be meaningful with multiple processes */
> +
> +       nfds = cr_scan_fds(files, &fdtable);
> +       if (nfds < 0) {
> +               ret = nfds;
> +               goto out;
> +       }
> +
> +       hh->nfds = nfds;
> +
> +       ret = cr_write_obj(ctx, &h, hh);
> +       cr_hbuf_put(ctx, sizeof(*hh));
> +       if (ret < 0)
> +               goto clean;
> +
> +       cr_debug("nfds %d\n", nfds);
> +       for (n = 0; n < nfds; n++) {
> +               ret = cr_write_fd_ent(ctx, files, n);

I think your intention is not 'n' but 'fdtable[n]' in argument.

> +               if (ret < 0)
> +                       break;
> +       }
> +
> + clean:
> +       kfree(fdtable);
> + out:
> +       put_files_struct(files);
> +
> +       return ret;
> +}
> diff --git a/checkpoint/ckpt_file.h b/checkpoint/ckpt_file.h
> new file mode 100644
> index 0000000..9dc3eba
> --- /dev/null
> +++ b/checkpoint/ckpt_file.h
> @@ -0,0 +1,17 @@
> +#ifndef _CHECKPOINT_CKPT_FILE_H_
> +#define _CHECKPOINT_CKPT_FILE_H_
> +/*
> + *  Checkpoint file descriptors
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/fdtable.h>
> +
> +int cr_scan_fds(struct files_struct *files, int **fdtable);
> +
> +#endif /* _CHECKPOINT_CKPT_FILE_H_ */
> diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
> index d73f79e..ad46baf 100644
> --- a/include/linux/ckpt.h
> +++ b/include/linux/ckpt.h
> @@ -13,7 +13,7 @@
>  #include <linux/path.h>
>  #include <linux/fs.h>
>
> -#define CR_VERSION  1
> +#define CR_VERSION  2
>
>  struct cr_ctx {
>        pid_t pid;              /* container identifier */
> @@ -80,11 +80,12 @@ int cr_read_string(struct cr_ctx *ctx, void *str, int len);
>  int cr_read_fname(struct cr_ctx *ctx, void *fname, int n);
>  struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode);
>
> +int do_checkpoint(struct cr_ctx *ctx);
>  int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
> -int cr_read_mm(struct cr_ctx *ctx);
> +int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
>
> -int do_checkpoint(struct cr_ctx *ctx);
>  int do_restart(struct cr_ctx *ctx);
> +int cr_read_mm(struct cr_ctx *ctx);
>
>  #define cr_debug(fmt, args...)  \
>        pr_debug("[CR:%s] " fmt, __func__, ## args)
> diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
> index f064cbb..f868dce 100644
> --- a/include/linux/ckpt_hdr.h
> +++ b/include/linux/ckpt_hdr.h
> @@ -17,7 +17,7 @@
>  /*
>  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
>  * keep data 64-bit aligned: use padding for structure members, and use
> - * __attribute__ ((aligned (8))) for the entire structure.
> + * __attribute__((aligned(8))) for the entire structure.
>  */
>
>  /* records: generic header */
> @@ -42,6 +42,10 @@ enum {
>        CR_HDR_VMA,
>        CR_HDR_MM_CONTEXT,
>
> +       CR_HDR_FILES = 301,
> +       CR_HDR_FD_ENT,
> +       CR_HDR_FD_DATA,
> +
>        CR_HDR_TAIL = 5001
>  };
>
> @@ -112,4 +116,32 @@ struct cr_hdr_vma {
>
>  } __attribute__((aligned(8)));
>
> +struct cr_hdr_files {
> +       __u32 objref;           /* identifier for shared objects */
> +       __u32 nfds;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_fd_ent {
> +       __u32 objref;           /* identifier for shared objects */
> +       __s32 fd;
> +       __u32 close_on_exec;
> +} __attribute__((aligned(8)));
> +
> +/* fd types */
> +enum  fd_type {
> +       CR_FD_FILE = 1,
> +       CR_FD_DIR,
> +       CR_FD_LINK
> +};
> +
> +struct cr_hdr_fd_data {
> +       __u16 fd_type;
> +       __u16 f_mode;
> +       __u32 f_flags;
> +       __u32 f_uid;
> +       __u32 f_gid;
> +       __u64 f_pos;
> +       __u64 f_version;
> +} __attribute__((aligned(8)));
> +
>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */
> --
> 1.5.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 8/9] File descriprtors (dump)
  2008-09-11  5:02   ` MinChan Kim
@ 2008-09-11  6:37     ` Oren Laadan
  0 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-11  6:37 UTC (permalink / raw)
  To: MinChan Kim; +Cc: dave, arnd, jeremy, linux-kernel, containers



MinChan Kim wrote:
> On Tue, Sep 9, 2008 at 4:42 PM, Oren Laadan <orenl@cs.columbia.edu> wrote:

[...]

>> +#define CR_DEFAULT_FDTABLE  256                /* an initial guess */
>> +
>> +/**
>> + * cr_scan_fds - scan file table and construct array of open fds
>> + * @files: files_struct pointer
>> + * @fdtable: (output) array of open fds
>> + * @return: the number of open fds found
>> + *
>> + * Allocates the file descriptors array (*fdtable), caller should free
>> + */
>> +int cr_scan_fds(struct files_struct *files, int **fdtable)
>> +{
>> +       struct fdtable *fdt;
>> +       int *fdlist;
>> +       int i, n, max;
>> +
>> +       n = 0;
>> +       max = CR_DEFAULT_FDTABLE;
> 
> max is read-only variable so that you don't need to declare local variable.
> You can use macro.

It's actually used below - track size in case of krealloc()

> 
>> +       fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
>> +       if (!fdlist)
>> +               return -ENOMEM;
>> +
>> +       spin_lock(&files->file_lock);
>> +       fdt = files_fdtable(files);
>> +       for (i = 0; i < fdt->max_fds; i++) {
>> +               if (!fcheck_files(files, i))
>> +                       continue;
>> +               if (n == max) {
>> +                       /* fcheck_files() is safe with drop/re-acquire
>> +                        * of the lock, as it tests:  fd < max_fds */
>> +                       spin_unlock(&files->file_lock);
>> +                       max *= 2;
>> +                       if (max < 0) {  /* overflow ? */
>> +                               n = -EMFILE;
>> +                               goto out;
>> +                       }
>> +                       fdlist = krealloc(fdlist, max, GFP_KERNEL);
>> +                       if (!fdlist) {
>> +                               n = -ENOMEM;
>> +                               goto out;
>> +                       }
>> +                       spin_lock(&files->file_lock);
>> +               }
>> +               fdlist[n++] = i;
>> +       }
>> +       spin_unlock(&files->file_lock);
>> +
>> +       *fdtable = fdlist;
>> + out:
>> +       return n;
>> +}

[...]

>> +int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
>> +{
>> +       struct cr_hdr h;
>> +       struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> +       struct files_struct *files;
>> +       int *fdtable;
>> +       int nfds, n, ret;
>> +
>> +       h.type = CR_HDR_FILES;
>> +       h.len = sizeof(*hh);
>> +       h.parent = task_pid_vnr(t);
>> +
>> +       files = get_files_struct(t);
>> +
>> +       hh->objref = 0; /* will be meaningful with multiple processes */
>> +
>> +       nfds = cr_scan_fds(files, &fdtable);
>> +       if (nfds < 0) {
>> +               ret = nfds;
>> +               goto out;
>> +       }
>> +
>> +       hh->nfds = nfds;
>> +
>> +       ret = cr_write_obj(ctx, &h, hh);
>> +       cr_hbuf_put(ctx, sizeof(*hh));
>> +       if (ret < 0)
>> +               goto clean;
>> +
>> +       cr_debug("nfds %d\n", nfds);
>> +       for (n = 0; n < nfds; n++) {
>> +               ret = cr_write_fd_ent(ctx, files, n);
> 
> I think your intention is not 'n' but 'fdtable[n]' in argument.

Oops ... yes. Thanks.

[...]

Oren.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-10 22:54       ` MinChan Kim
@ 2008-09-11  6:44         ` Oren Laadan
  0 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-11  6:44 UTC (permalink / raw)
  To: MinChan Kim; +Cc: dave, arnd, jeremy, linux-kernel, containers



MinChan Kim wrote:
> Hi, Oren.
> 
> On Thu, Sep 11, 2008 at 3:36 AM, Oren Laadan <orenl@cs.columbia.edu> wrote:
>>
>> MinChan Kim wrote:
>>> On Tue, Sep 9, 2008 at 4:42 PM, Oren Laadan <orenl@cs.columbia.edu> wrote:
>> [...]
>>
>>>> +struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
>>>> +{
>>>> +       struct cr_ctx *ctx;
>>>> +
>>>> +       ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>>>> +       if (!ctx)
>>>> +               return ERR_PTR(-ENOMEM);
>>>> +
>>>> +       ctx->file = fget(fd);
>>>> +       if (!ctx->file) {
>>>> +               cr_ctx_free(ctx);
>>>> +               return ERR_PTR(-EBADF);
>>>> +       }
>>>> +       get_file(ctx->file);
>>> Why do you need get_file?
>>> You already called fget.
>>> Am I missing something ?
>> This was meant for when we will restart multiple processes, each would
>> have access to the checkpoint-context, such that the checkpoint-context
>> may outlives the task that created it and initiated the restart. Thus
>> the file-pointer will need to stay around longer than that task.
> 
> OK. Thanks for your explanation.
> You should have inserted above annotation.
> 
>> Of course, restart of multiple processes _can_ be coded such that this
>> first task will always terminate last - either after restart completes
>> successfully, or after all the other tasks aborted and won't use the
>> checkpoint-context anymore.
>>
>> Because that code is not part of the this patch-set, I considered it
>> safer to grab a reference of the file pointer, making it less likely
>> that we forget about it later.
> 
> What do you mean by that ? Isn't it a your part of your code?
> When the last checkpoint-context is ended, who free file ?
> I mean how it match pair(fget/fput and get_file) ?

I meant that future code would make that clear, but that creates more
confusion than not.

Instead, I'll leave that to the future and for now just clean the code
as you suggested.

Oren.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 5/9] Memory managemnet (restore)
  2008-09-10 20:49       ` Dave Hansen
@ 2008-09-11  6:59         ` Oren Laadan
  0 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-11  6:59 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, jeremy, linux-kernel, arnd



Dave Hansen wrote:
> On Wed, 2008-09-10 at 15:48 -0400, Oren Laadan wrote:
>> Dave Hansen wrote:
>>> On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote:
>>>> +/**
>>>> + * cr_vma_read_pages_vaddrs - read addresses of pages to page-array chain
>>>> + * @ctx - restart context
>>>> + * @npages - number of pages
>>>> + */
>>>> +static int cr_vma_read_pages_vaddrs(struct cr_ctx *ctx, int npages)
>>>> +{
>>>> +	struct cr_pgarr *pgarr;
>>>> +	int nr, ret;
>>>> +
>>>> +	while (npages) {
>>>> +		pgarr = cr_pgarr_prep(ctx);
>>>> +		if (!pgarr)
>>>> +			return -ENOMEM;
>>>> +		nr = min(npages, (int) pgarr->nr_free);
>>>> +		ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
>>>> +		if (ret < 0)
>>>> +			return ret;
>>>> +		pgarr->nr_free -= nr;
>>>> +		pgarr->nr_used += nr;
>>>> +		npages -= nr;
>>>> +	}
>>>> +	return 0;
>>>> +}
>>> cr_pgarr_prep() can return a partially full pgarr, right?  Won't the
>>> cr_kread() always start at the beginning of the pgarr->vaddrs[] array?
>>> Seems to me like it will clobber things from the last call.
>> Note that 'nr' is either equal to ->nr_free - in which case we consume
>> the entire 'pgarr' vaddr array such that the next call to cr_pgarr_prep()
>> will get a fresh one, or is smaller than ->nr_free - in which case that
>> is the last iteration of the loop anyhow, so it won't be clobbered.
>>
>> Also, after we return - our caller, cr_vma_read_pages(), resets the state
>> of the page-array chain by calling cr_pgarr_reset().
> 
> Man, that's awfully subtle for something which is so simple.
> 
> I think it is a waste of memory to have to hold *all* of the vaddrs in
> memory at once.  Is there a real requirement for that somehow?  The code
> would look a lot simpler use less memory if it was done (for instance)
> using a single 'struct pgaddr' at a time.  There are an awful lot of HPC
> apps that have nearly all physical memory in the machine allocated and
> mapped into a single VMA.  This approach could be quite painful there.
> 
> I know it's being done this way because that's what the dump format
> looks like.  Would you consider changing the dump format to have blocks
> of pages and vaddrs together?  That should also parallelize a bit more
> naturally.

It's being done this way to allow for a future optimization that will aim
at reducing downtime of the application by buffering all the data that is
to be saved while the container is frozen, so that the write-back of the
buffer happens after the container resumes execution.

(It is this reasoning that dictates the dump format and the code, not the
other way around).

That said, the point about reducing memory footprint of checkpoint/restart
is valid as well. Moreover, it conflicts with the above in requiring small
buffering, if any.

To enable both modes of operation, I'll modify the dump format to allow
multiple blocks of (addresses list followed by pages contents).

Oren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v4][PATCH 4/9] Memory management (dump)
  2008-09-09  7:42 ` [RFC v4][PATCH 4/9] Memory management (dump) Oren Laadan
                     ` (3 preceding siblings ...)
  2008-09-10 21:38   ` [RFC v4][PATCH " Dave Hansen
@ 2008-09-12 16:57   ` Dave Hansen
  4 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-12 16:57 UTC (permalink / raw)
  To: Oren Laadan; +Cc: arnd, jeremy, linux-kernel, containers

On Tue, 2008-09-09 at 03:42 -0400, Oren Laadan wrote: 
> +	if (unlikely(ret < 0)) {
> +		nr = pgarr->nr_free - nr;
> +		while (nr--)
> +			page_cache_release(*(--pagep));
> +		return ret;
> +	}

Oren, please take a good, hard look through all these patches (i.e.
grep) and find all of the (un)likely calls and toss them out.  They're
unneeded.
> 
-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2008-09-12 16:57 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-09  7:42 [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Oren Laadan
2008-09-09  7:42 ` [RFC v4][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2008-09-09  7:42 ` [RFC v4][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
2008-09-10  6:10   ` MinChan Kim
2008-09-10 18:36     ` Oren Laadan
2008-09-10 22:54       ` MinChan Kim
2008-09-11  6:44         ` Oren Laadan
2008-09-09  7:42 ` [RFC v4][PATCH 3/9] x86 support for checkpoint/restart Oren Laadan
2008-09-09  8:17   ` Ingo Molnar
2008-09-09 23:23     ` Oren Laadan
2008-09-09  7:42 ` [RFC v4][PATCH 4/9] Memory management (dump) Oren Laadan
2008-09-09  9:22   ` Vegard Nossum
2008-09-10  7:51   ` MinChan Kim
2008-09-10 23:49     ` MinChan Kim
2008-09-10 16:55   ` Dave Hansen
2008-09-10 17:45     ` Dave Hansen
2008-09-10 18:28     ` Oren Laadan
2008-09-10 21:03       ` Cleanups for [PATCH " Dave Hansen
2008-09-10 21:38   ` [RFC v4][PATCH " Dave Hansen
2008-09-12 16:57   ` Dave Hansen
2008-09-09  7:42 ` [RFC v4][PATCH 5/9] Memory managemnet (restore) Oren Laadan
2008-09-09 16:07   ` Serge E. Hallyn
2008-09-09 23:35     ` Oren Laadan
2008-09-10 15:00       ` Serge E. Hallyn
2008-09-10 19:31   ` Dave Hansen
2008-09-10 19:48     ` Oren Laadan
2008-09-10 20:49       ` Dave Hansen
2008-09-11  6:59         ` Oren Laadan
2008-09-09  7:42 ` [RFC v4][PATCH 6/9] Checkpoint/restart: initial documentation Oren Laadan
2008-09-10  7:13   ` MinChan Kim
2008-09-09  7:42 ` [RFC v4][PATCH 7/9] Infrastructure for shared objects Oren Laadan
2008-09-09  7:42 ` [RFC v4][PATCH 8/9] File descriprtors (dump) Oren Laadan
2008-09-09  8:06   ` Vegard Nossum
2008-09-09  8:23   ` Vegard Nossum
2008-09-10  2:01     ` Oren Laadan
2008-09-11  5:02   ` MinChan Kim
2008-09-11  6:37     ` Oren Laadan
2008-09-09  7:42 ` [RFC v4][PATCH 9/9] File descriprtors (restore) Oren Laadan
2008-09-09 16:26   ` Dave Hansen
2008-09-10  1:49     ` Oren Laadan
2008-09-10 16:09       ` Dave Hansen
2008-09-10 18:55         ` Oren Laadan
2008-09-09 18:06 ` [RFC v4][PATCH 0/9] Kernel based checkpoint/restart` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).