linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3][PATCH 0/9] Kernel based checkpoint/restart
@ 2008-09-04  7:57 Oren Laadan
  2008-09-04  8:02 ` [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
                   ` (8 more replies)
  0 siblings, 9 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  7:57 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers, Oren Laadan


These patches implement checkpoint-restart [CR v3]. This version is
aimed at addressing feedback and eliminating bugs, after having added
save and restore of open files state (regular files and directories)
which makes it more usable.

Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Extend to handle (multiple) tasks in a container
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Aug-29] v3:
   - Various fixes and clean-ups
   - Use standard hlist_... for hash table
   - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
   - Added Dump and restore of open files (regular and directories)
   - Added basic handling of shared objects, and improve handling of
     'parent tag' concept
   - Added documentation
   - Improved ABI, 64bit padding for image data
   - Improved locking when saving/restoring memory
   - Added UTS information to header (release, version, machine)
   - Cleanup extraction of filename from a file pointer
   - Refactor to allow easier reviewing
   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
     security policy (this means that file restore may fail)
   - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
   - Initial version: support a single task with address space of only
     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
     argument and act on current process.

--
(Dave Hansen's announcement)

At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.

--
(Original announcement)

In the recent mini-summit at OLS 2008 and the following days it was
agreed to tackle the checkpoint/restart (CR) by beginning with a very
simple case: save and restore a single task, with simple memory
layout, disregarding other task state such as files, signals etc.

Following these discussions I coded a prototype that can do exactly
that, as a starter. This code adds two system calls - sys_checkpoint
and sys_restart - that a task can call to save and restore its state
respectively. It also demonstrates how the checkpoint image file can
be formatted, as well as show its nested nature (e.g. cr_write_mm()
-> cr_write_vma() nesting).

The state that is saved/restored is the following:
* some of the task_struct
* some of the thread_struct and thread_info
* the cpu state (including FPU)
* the memory address space

In the current code, sys_checkpoint will checkpoint the current task,
although the logic exists to checkpoint other tasks (not in the
checkpointee's execution context). A simple loop will extend this to
handle multiple processes. sys_restart restarts the current tasks, and
with multiple tasks each task will call the syscall independently.
(Actually, to checkpoint outside the context of a task, it is also
necessary to also handle restart-block logic when saving/restoring the
thread data).

It takes longer to describe what isn't implemented or supported by
this prototype ... basically everything that isn't as simple as the
above.

As for containers - since we still don't have a representation for a
container, this patch has no notion of a container. The tests for
consistent namespaces (and isolation) are also omitted.

Below are two example programs: one uses checkpoint (called ckpt) and
one uses restart (called rstr). Note the use of "dup2" to create a 
copy of an open file and show how shared objects are treated. Execute
like this (as a superuser):

orenl:~/test$ ./ckpt > out.1
 				<-- ctrl-c
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
world, hello!
(ret = 1)

orenl:~/test$ ./ckpt > out.1
 				<-- ctrl-c
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
world, hello!
(ret = 2)

 				<-- now change the contents of the file
orenl:~/test$ sed -i 's/world, hello!/xxxx/' /tmp/cr-rest.out
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
xxxx
(ret = 2)

 				<-- and do the restart
orenl:~/test$ ./rstr < out.1
 				<-- ctrl-c
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
world, hello!
(ret = 0)

(if you check the output of ps, you'll see that "rstr" changed its
name to "ckpt", as expected).

Oren.


============================== ckpt.c ================================

#define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd.h>
#include <sys/syscall.h>

#define OUTFILE "/tmp/cr-test.out"

int main(int argc, char *argv[])
{
 	pid_t pid = getpid();
 	FILE *file;
 	int ret;

 	close(0);
 	close(2);

 	unlink(OUTFILE);
 	file = fopen(OUTFILE, "w+");
 	if (!file) {
 		perror("open");
 		exit(1);
 	}

 	if (dup2(0,2) < 0) {
 		perror("dups");
 		exit(1);
 	}

 	fprintf(file, "hello, world!\n");
 	fflush(file);

 	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
 	if (ret < 0) {
 		perror("checkpoint");
 		exit(2);
 	}

 	fprintf(file, "world, hello!\n");
 	fprintf(file, "(ret = %d)\n", ret);
 	fflush(file);

 	while (1)
 		;

 	return 0;
}

============================== rstr.c ================================

#define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd.h>
#include <sys/syscall.h>

int main(int argc, char *argv[])
{
 	pid_t pid = getpid();
 	int ret;

 	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
 	if (ret < 0)
 		perror("restart");

 	printf("should not reach here !\n");

 	return 0;
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
@ 2008-09-04  8:02 ` Oren Laadan
  2008-09-04  8:37   ` Cedric Le Goater
  2008-09-04 14:42   ` Serge E. Hallyn
  2008-09-04  8:02 ` [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  8:02 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers


Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  arch/x86/kernel/syscall_table_32.S |    2 ++
  checkpoint/Kconfig                 |   11 +++++++++++
  checkpoint/Makefile                |    5 +++++
  checkpoint/sys.c                   |   35 +++++++++++++++++++++++++++++++++++
  include/asm-x86/unistd_32.h        |    2 ++
  include/linux/syscalls.h           |    2 ++
  init/Kconfig                       |    2 ++
  kernel/sys_ni.c                    |    4 ++++
  8 files changed, 63 insertions(+), 0 deletions(-)
  create mode 100644 checkpoint/Kconfig
  create mode 100644 checkpoint/Makefile
  create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
  	.long sys_dup3			/* 330 */
  	.long sys_pipe2
  	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..a9f22ef
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+	def_bool y
+	depends on X86_32 && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..07d018b
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..b9018a4
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,35 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	pr_debug("sys_checkpoint not implemented yet\n");
+	return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+	pr_debug("sys_restart not implemented yet\n");
+	return -ENOSYS;
+}
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index d739467..88bdec4 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -338,6 +338,8 @@
  #define __NR_dup3		330
  #define __NR_pipe2		331
  #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restart		334

  #ifdef __KERNEL__

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d6ff145..edc218b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
  asmlinkage long sys_eventfd(unsigned int count);
  asmlinkage long sys_eventfd2(unsigned int count, int flags);
  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);

  int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

diff --git a/init/Kconfig b/init/Kconfig
index c11da38..fd5f7bf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -779,6 +779,8 @@ config MARKERS

  source "arch/Kconfig"

+source "checkpoint/Kconfig"
+
  config PROC_PAGE_MONITOR
   	default y
  	depends on PROC_FS && MMU
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 08d6e1b..ca95c25 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -168,3 +168,7 @@ cond_syscall(compat_sys_timerfd_settime);
  cond_syscall(compat_sys_timerfd_gettime);
  cond_syscall(sys_eventfd);
  cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
  2008-09-04  8:02 ` [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-09-04  8:02 ` Oren Laadan
  2008-09-04  9:12   ` Louis Rilling
  2008-09-04 16:03   ` Serge E. Hallyn
  2008-09-04  8:03 ` [RFC v3][PATCH 3/9] x86 support for checkpoint/restart Oren Laadan
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  8:02 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers


Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
checkpoint/restart context (a per-checkpoint data structure for
housekeeping)

checkpoint/checkpoint.c - output wrappers and basic checkpoint handling

checkpoint/restart.c - input wrappers and basic restart handling

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  Makefile                 |    2 +-
  checkpoint/Makefile      |    2 +-
  checkpoint/checkpoint.c  |  188 ++++++++++++++++++++++++++++++++++++++
  checkpoint/restart.c     |  189 ++++++++++++++++++++++++++++++++++++++
  checkpoint/sys.c         |  226 +++++++++++++++++++++++++++++++++++++++++++++-
  include/linux/ckpt.h     |   65 +++++++++++++
  include/linux/ckpt_hdr.h |   82 +++++++++++++++++
  include/linux/magic.h    |    3 +
  8 files changed, 751 insertions(+), 6 deletions(-)
  create mode 100644 checkpoint/checkpoint.c
  create mode 100644 checkpoint/restart.c
  create mode 100644 include/linux/ckpt.h
  create mode 100644 include/linux/ckpt_hdr.h

diff --git a/Makefile b/Makefile
index f448e00..a558ad2 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd


  ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/

  vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
  		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
  # Makefile for linux checkpoint/restart.
  #

-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..ad1099f
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,188 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+	int ret;
+
+	ret = cr_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_string - write a string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int cr_write_string(struct cr_ctx *ctx, char *str, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_STRING;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h.type = CR_HDR_HEAD;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	do_gettimeofday(&ktv);
+
+	hh->magic = CHECKPOINT_MAGIC_HEAD;
+	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	hh->rev = CR_VERSION;
+
+	hh->flags = ctx->flags;
+	hh->time = ktv.tv_sec;
+
+	uts = utsname();
+	memcpy(hh->release, uts->release, __NEW_UTS_LEN);
+	memcpy(hh->version, uts->version, __NEW_UTS_LEN);
+	memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TAIL;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TASK;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->state = t->state;
+	hh->exit_state = t->exit_state;
+	hh->exit_code = t->exit_code;
+	hh->exit_signal = t->exit_signal;
+
+	hh->utime = t->utime;
+	hh->stime = t->stime;
+	hh->utimescaled = t->utimescaled;
+	hh->stimescaled = t->stimescaled;
+	hh->gtime = t->gtime;
+	hh->prev_utime = t->prev_utime;
+	hh->prev_stime = t->prev_stime;
+	hh->nvcsw = t->nvcsw;
+	hh->nivcsw = t->nivcsw;
+	hh->start_time_sec = t->start_time.tv_sec;
+	hh->start_time_nsec = t->start_time.tv_nsec;
+	hh->real_start_time_sec = t->real_start_time.tv_sec;
+	hh->real_start_time_nsec = t->real_start_time.tv_nsec;
+	hh->min_flt = t->min_flt;
+	hh->maj_flt = t->maj_flt;
+
+	hh->task_comm_len = TASK_COMM_LEN;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+	int ret ;
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("CR: task may not be in state TASK_DEAD\n");
+		return -EAGAIN;
+	}
+
+	ret = cr_write_task_struct(ctx, t);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx)
+{
+	int ret;
+
+	/* FIX: need to test whether container is checkpointable */
+
+	ret = cr_write_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, return (unique) checkpoint identifier */
+	ret = ctx->crid;
+
+ out:
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..171cd2d
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,189 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @n: available buffer size
+ *
+ * @return: size of payload
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n)
+{
+	int ret;
+
+	ret = cr_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
+
+	if (h->len < 0 || h->len > n)
+		return -EINVAL;
+
+	return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: available buffer size
+ * @type: expected record type
+ *
+ * @return: object reference of the parent object
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, n);
+	if (!ret) {
+		if (h.type == type)
+			ret = h.parent;
+		else
+			ret = -EINVAL;
+	}
+	return ret;
+}
+
+/**
+ * cr_read_string - read a string
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @len: buffer buffer length
+ */
+int cr_read_string(struct cr_ctx *ctx, void *str, int len)
+{
+	return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
+}
+
+/* read the checkpoint header */
+static int cr_read_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+	if (parent < 0)
+		return parent;
+	else if (parent != 0)
+		return -EINVAL;
+
+	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
+	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+		return -EINVAL;
+
+	if (hh->flags & ~CR_CTX_CKPT)
+		return -EINVAL;
+
+	ctx->oflags = hh->flags;
+
+	/* FIX: verify compatibility of release, version and machine */
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return 0;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+	if (parent < 0)
+		return parent;
+	else if (parent != 0)
+		return -EINVAL;
+
+	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
+		return -EINVAL;
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return 0;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	char *buf;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+	if (parent < 0)
+		return parent;
+	else if (parent != 0)
+		return -EINVAL;
+
+	/* FIXME: for now, only restore t->comm */
+
+	/* upper limit for task_comm_len to prevent DoS */
+	if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
+		return -EINVAL;
+
+	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+	ret = cr_read_string(ctx, buf, hh->task_comm_len);
+	if (!ret) {
+		/* if t->comm is too long, silently truncate */
+		memset(t->comm, 0, TASK_COMM_LEN);
+		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
+	}
+	kfree(buf);
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_task_struct(ctx);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_restart(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, adjust the return value if needed [TODO] */
+ out:
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index b9018a4..4268bae 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -10,6 +10,197 @@

  #include <linux/sched.h>
  #include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/ckpt.h>
+
+/*
+ * helpers to write/read to/from the image file descriptor
+ *
+ *   cr_uwrite() - write a user-space buffer to the checkpoint image
+ *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   cr_uread() - read from the checkpoint image to a user-space buffer
+ *   cr_kread() - read from the checkpoint image to a kernel-space buffer
+ *
+ */
+
+/* (temporarily added file_pos_read() and file_pos_write() because they
+ * are static in fs/read_write.c... should cleanup and remove later) */
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
+int cr_uwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+	struct file *file = ctx->file;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, (char __user *) buf, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite <= 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		buf += nwrite;
+	}
+
+	ctx->total += count;
+	return 0;
+}
+
+int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+	mm_segment_t oldfs;
+	int ret;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = cr_uwrite(ctx, buf, count);
+	set_fs(oldfs);
+
+	return ret;
+}
+
+int cr_uread(struct cr_ctx *ctx, void *buf, int count)
+{
+	struct file *file = ctx->file;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, (char __user *) buf, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN)
+				nread = 0;
+			else
+				return nread;
+		}
+		buf += nread;
+	}
+
+	ctx->total += count;
+	return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *buf, int count)
+{
+	mm_segment_t oldfs;
+	int ret;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = cr_uread(ctx, buf, count);
+	set_fs(oldfs);
+
+	return ret;
+}
+
+
+/*
+ * helpers to manage CR contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+/* unique checkpoint identifier (FIXME: should be per-container) */
+static atomic_t cr_ctx_count;
+
+void cr_ctx_free(struct cr_ctx *ctx)
+{
+
+	if (ctx->file)
+		fput(ctx->file);
+	if (ctx->vfsroot)
+		path_put(ctx->vfsroot);
+
+	free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
+
+	kfree(ctx);
+}
+
+struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
+{
+	struct cr_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->file = fget(fd);
+	if (!ctx->file) {
+		cr_ctx_free(ctx);
+		return ERR_PTR(-EBADF);
+	}
+	get_file(ctx->file);
+
+	ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
+	if (!ctx->hbuf) {
+		cr_ctx_free(ctx);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pid = pid;
+	ctx->flags = flags;
+
+	/* assume checkpointer is in container's root vfs */
+	/* FIXME: this works for now, but will change with real containers */
+	ctx->vfsroot = &current->fs->root;
+	path_get(ctx->vfsroot);
+
+	ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+	return ctx;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the chekcpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, one should call cr_hbuf_get() to
+ * reserve space in the buffer, and then cr_hbuf_put() when no longer
+ * needs that space.
+ */
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ *
+ * @return: pointer to reserved space
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+	void *ptr;
+
+	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+	ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
+	ctx->hpos += n;
+	return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+	BUG_ON(ctx->hpos < n);
+	ctx->hpos -= n;
+}

  /**
   * sys_checkpoint - checkpoint a container
@@ -19,9 +210,23 @@
   */
  asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  {
-	pr_debug("sys_checkpoint not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx);
+
+	cr_ctx_free(ctx);
+	return ret;
  }
+
  /**
   * sys_restart - restart a container
   * @crid: checkpoint image identifier
@@ -30,6 +235,19 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
   */
  asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
  {
-	pr_debug("sys_restart not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(crid, fd, flags | CR_CTX_RSTR);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_restart(ctx);
+
+	cr_ctx_free(ctx);
+	return ret;
  }
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
new file mode 100644
index 0000000..1bb2b09
--- /dev/null
+++ b/include/linux/ckpt.h
@@ -0,0 +1,65 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/path.h>
+#include <linux/fs.h>
+
+#define CR_VERSION  1
+
+struct cr_ctx {
+	pid_t pid;		/* container identifier */
+	int crid;		/* unique checkpoint id */
+
+	unsigned long flags;
+	unsigned long oflags;	/* restart: old flags */
+
+	struct file *file;
+	int total;		/* total read/written */
+
+	void *hbuf;		/* temporary buffer for headers */
+	int hpos;		/* position in headers buffer */
+
+	struct path *vfsroot;	/* container root */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT	0x1
+#define CR_CTX_RSTR	0x2
+
+/* allocation defaults */
+#define CR_HBUF_ORDER  1
+#define CR_HBUF_TOTAL  (PAGE_SIZE << CR_HBUF_ORDER)
+
+int cr_uwrite(struct cr_ctx *ctx, void *buf, int count);
+int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+int cr_uread(struct cr_ctx *ctx, void *buf, int count);
+int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
+int cr_read_string(struct cr_ctx *ctx, void *str, int len);
+
+int do_checkpoint(struct cr_ctx *ctx);
+int do_restart(struct cr_ctx *ctx);
+
+#define cr_debug(fmt, args...)  \
+	pr_debug("[CR:%s] " fmt, __func__, ## args)
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
new file mode 100644
index 0000000..629ad5a
--- /dev/null
+++ b/include/linux/ckpt_hdr.h
@@ -0,0 +1,82 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+/* header types */
+enum {
+	CR_HDR_HEAD = 1,
+	CR_HDR_STRING,
+
+	CR_HDR_TASK = 101,
+	CR_HDR_THREAD,
+	CR_HDR_CPU,
+
+	CR_HDR_MM = 201,
+	CR_HDR_VMA,
+	CR_HDR_MM_CONTEXT,
+
+	CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+	__u64 magic;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 flags;	/* checkpoint options */
+
+	char release[__NEW_UTS_LEN];
+	char version[__NEW_UTS_LEN];
+	char machine[__NEW_UTS_LEN];
+} __attribute__((aligned(8)));
+
+struct cr_hdr_tail {
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_task {
+	__u64 state;
+	__u32 exit_state;
+	__u32 exit_code, exit_signal;
+
+	__u64 utime, stime, utimescaled, stimescaled;
+	__u64 gtime;
+	__u64 prev_utime, prev_stime;
+	__u64 nvcsw, nivcsw;
+	__u64 start_time_sec, start_time_nsec;
+	__u64 real_start_time_sec, real_start_time_nsec;
+	__u64 min_flt, maj_flt;
+
+	__s32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 1fa0c2c..c2b811c 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -42,4 +42,7 @@
  #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
  #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA

+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
  #endif /* __LINUX_MAGIC_H__ */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v3][PATCH 3/9] x86 support for checkpoint/restart
  2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
  2008-09-04  8:02 ` [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
  2008-09-04  8:02 ` [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
@ 2008-09-04  8:03 ` Oren Laadan
  2008-09-04  8:03 ` [RFC v3][PATCH 4/9] Memory management (dump) Oren Laadan
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  8:03 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers


(Following Dave Hansen's refactoring of the original post)

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  arch/x86/mm/Makefile       |    2 +
  arch/x86/mm/checkpoint.c   |  194 ++++++++++++++++++++++++++++++++++++++++++++
  arch/x86/mm/restart.c      |  178 ++++++++++++++++++++++++++++++++++++++++
  checkpoint/checkpoint.c    |   13 +++-
  checkpoint/ckpt_arch.h     |    7 ++
  checkpoint/restart.c       |   13 +++-
  include/asm-x86/ckpt_hdr.h |   72 ++++++++++++++++
  include/linux/ckpt_hdr.h   |    1 +
  8 files changed, 478 insertions(+), 2 deletions(-)
  create mode 100644 arch/x86/mm/checkpoint.c
  create mode 100644 arch/x86/mm/restart.c
  create mode 100644 checkpoint/ckpt_arch.h
  create mode 100644 include/asm-x86/ckpt_hdr.h

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index dfb932d..58fe072 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -22,3 +22,5 @@ endif
  obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o

  obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..71d21e6
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,194 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct thread_struct *thread;
+	struct desc_struct *desc;
+	int ntls = 0;
+	int n, ret;
+
+	h.type = CR_HDR_THREAD;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	thread = &t->thread;
+
+	/* calculate no. of TLS entries that follow */
+	desc = thread->tls_array;
+	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+		if (desc->a || desc->b)
+			ntls++;
+	}
+
+	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	hh->sizeof_tls_array = sizeof(thread->tls_array);
+	hh->ntls = ntls;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	/* for simplicity dump the entire array, cherry-pick upon restart */
+	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+	cr_debug("ntls %d\n", ntls);
+
+	/* IGNORE RESTART BLOCKS FOR NOW ... */
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+void cr_write_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	hh->bp = regs->bp;
+	hh->bx = regs->bx;
+	hh->ax = regs->ax;
+	hh->cx = regs->cx;
+	hh->dx = regs->dx;
+	hh->si = regs->si;
+	hh->di = regs->di;
+	hh->orig_ax = regs->orig_ax;
+	hh->ip = regs->ip;
+	hh->cs = regs->cs;
+	hh->flags = regs->flags;
+	hh->sp = regs->sp;
+	hh->ss = regs->ss;
+
+	hh->ds = regs->ds;
+	hh->es = regs->es;
+
+	/* for checkpoint in process context (from within a container)
+	   the GS and FS registers should be saved from the hardware;
+	   otherwise they are already sabed on the thread structure */
+	if (t == current) {
+		savesegment(gs, hh->gs);
+		savesegment(fs, hh->fs);
+	} else {
+		hh->gs = thread->gs;
+		hh->fs = thread->fs;
+	}
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(hh->orig_ax < 0);
+		hh->ax = 0;
+	}
+}
+
+void cr_write_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	preempt_disable();
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(hh->debugreg0, 0);
+		get_debugreg(hh->debugreg1, 1);
+		get_debugreg(hh->debugreg2, 2);
+		get_debugreg(hh->debugreg3, 3);
+		get_debugreg(hh->debugreg6, 6);
+		get_debugreg(hh->debugreg7, 7);
+	} else {
+		hh->debugreg0 = thread->debugreg0;
+		hh->debugreg1 = thread->debugreg1;
+		hh->debugreg2 = thread->debugreg2;
+		hh->debugreg3 = thread->debugreg3;
+		hh->debugreg6 = thread->debugreg6;
+		hh->debugreg7 = thread->debugreg7;
+	}
+
+	hh->debugreg4 = 0;
+	hh->debugreg5 = 0;
+
+	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+
+	preempt_enable();
+}
+
+void cr_write_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct thread_info *thread_info = task_thread_info(t);
+
+	/* i387 + MMU + SSE logic */
+
+	preempt_disable();
+
+	hh->used_math = tsk_used_math(t) ? 1 : 0;
+	if (hh->used_math) {
+		/* normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+		 * have been cleared when task was conexted-switched out...
+		 * except if we are in process context, in which case we do */
+		if (thread_info->status & TS_USEDFPU)
+			unlazy_fpu(current);
+
+		hh->has_fxsr = cpu_has_fxsr;
+		memcpy(&hh->xstate, &thread->xstate, sizeof(thread->xstate));
+	}
+
+	preempt_enable();
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_CPU;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	cr_write_cpu_regs(hh, t);
+	cr_write_cpu_debug(hh, t);
+	cr_write_cpu_fpu(hh, t);
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
new file mode 100644
index 0000000..883a163
--- /dev/null
+++ b/arch/x86/mm/restart.c
@@ -0,0 +1,178 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	struct thread_struct *thread = &t->thread;
+	int parent;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+	if (parent < 0)
+		return parent;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		return -EINVAL;
+#endif
+	cr_debug("ntls %d\n", hh->ntls);
+
+	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+		return -EINVAL;
+
+	if (hh->ntls > 0) {
+
+		/* restore TLS by hand: why convert to struct user_desc if
+		 * sys_set_thread_entry() will convert it back ? */
+
+		struct desc_struct *desc;
+		int size, cpu, ret;
+
+		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+		desc = kmalloc(size, GFP_KERNEL);
+		if (!desc)
+			return -ENOMEM;
+
+		ret = cr_kread(ctx, desc, size);
+		if (ret >= 0) {
+			/* FIX: add sanity checks (eg. that values makes
+			 * sense, that we don't overwrite old values, etc */
+			cpu = get_cpu();
+			memcpy(thread->tls_array, desc, size);
+			load_TLS(thread, cpu);
+			put_cpu();
+		}
+		kfree(desc);
+	}
+
+	return 0;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+int cr_read_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	regs->bx = hh->bx;
+	regs->cx = hh->cx;
+	regs->dx = hh->dx;
+	regs->si = hh->si;
+	regs->di = hh->di;
+	regs->bp = hh->bp;
+	regs->ax = hh->ax;
+	regs->ds = hh->ds;
+	regs->es = hh->es;
+	regs->orig_ax = hh->orig_ax;
+	regs->ip = hh->ip;
+	regs->cs = hh->cs;
+	regs->flags = hh->flags;
+	regs->sp = hh->sp;
+	regs->ss = hh->ss;
+
+	thread->gs = hh->gs;
+	thread->fs = hh->fs;
+	loadsegment(gs, hh->gs);
+	loadsegment(fs, hh->fs);
+
+	return 0;
+}
+
+int cr_read_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	/* debug regs */
+
+	preempt_disable();
+
+	if (hh->uses_debug) {
+		set_debugreg(hh->debugreg0, 0);
+		set_debugreg(hh->debugreg1, 1);
+		/* ignore 4, 5 */
+		set_debugreg(hh->debugreg2, 2);
+		set_debugreg(hh->debugreg3, 3);
+		set_debugreg(hh->debugreg6, 6);
+		set_debugreg(hh->debugreg7, 7);
+	}
+
+	preempt_enable();
+
+	return 0;
+}
+
+int cr_read_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* i387 + MMU + SSE */
+
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!hh->used_math)
+		clear_used_math();
+	else {
+		if (hh->has_fxsr != cpu_has_fxsr) {
+			force_sig(SIGFPE, t);
+			return -EINVAL;
+		}
+		memcpy(&thread->xstate, &hh->xstate, sizeof(thread->xstate));
+		set_used_math();
+	}
+
+	preempt_enable();
+
+	return 0;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+	if (parent < 0)
+		return parent;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		return -EINVAL;
+#endif
+	/* FIX: sanity check for sensitive registers (eg. eflags) */
+
+	ret = cr_read_cpu_regs(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu_debug(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu_fpu(hh, t);
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+ out:
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index ad1099f..d34a691 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,8 @@
  #include <linux/ckpt.h>
  #include <linux/ckpt_hdr.h>

+#include "ckpt_arch.h"
+
  /**
   * cr_write_obj - write a record described by a cr_hdr
   * @ctx: checkpoint context
@@ -159,8 +161,17 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
  	}

  	ret = cr_write_task_struct(ctx, t);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_thread(ctx, t);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_cpu(ctx, t);
+	cr_debug("cpu: ret %d\n", ret);

+ out:
  	return ret;
  }

diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
new file mode 100644
index 0000000..5bd4703
--- /dev/null
+++ b/checkpoint/ckpt_arch.h
@@ -0,0 +1,7 @@
+#include <linux/ckpt.h>
+
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+int cr_read_thread(struct cr_ctx *ctx);
+int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 171cd2d..5226994 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -15,6 +15,8 @@
  #include <linux/ckpt.h>
  #include <linux/ckpt_hdr.h>

+#include "ckpt_arch.h"
+
  /**
   * cr_read_obj - read a whole record (cr_hdr followed by payload)
   * @ctx: checkpoint context
@@ -164,8 +166,17 @@ static int cr_read_task(struct cr_ctx *ctx)
  	int ret;

  	ret = cr_read_task_struct(ctx);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_thread(ctx);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu(ctx);
+	cr_debug("cpu: ret %d\n", ret);

+ out:
  	return ret;
  }

diff --git a/include/asm-x86/ckpt_hdr.h b/include/asm-x86/ckpt_hdr.h
new file mode 100644
index 0000000..44a903c
--- /dev/null
+++ b/include/asm-x86/ckpt_hdr.h
@@ -0,0 +1,72 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/processor.h>
+
+struct cr_hdr_thread {
+	/* NEED: restart blocks */
+
+	__s16 gdt_entry_tls_entries;
+	__s16 sizeof_tls_array;
+	__s16 ntls;	/* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_cpu {
+	/* see struct pt_regs (x86-64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 cs;
+	__u64 flags;
+	__u64 sp;
+	__u64 ss;
+
+	/* segment registers */
+	__u64 ds;
+	__u64 es;
+	__u64 fs;
+	__u64 gs;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg4;
+	__u64 debugreg5;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	__u16 uses_debug;
+	__u16 used_math;
+	__u16 has_fxsr;
+	__u16 _padding;
+
+	union thread_xstate xstate;	/* i387 */
+
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
index 629ad5a..3257720 100644
--- a/include/linux/ckpt_hdr.h
+++ b/include/linux/ckpt_hdr.h
@@ -12,6 +12,7 @@

  #include <linux/types.h>
  #include <linux/utsname.h>
+#include <asm/ckpt_hdr.h>

  /*
   * To maintain compatibility between 32-bit and 64-bit architecture flavors,
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v3][PATCH 4/9] Memory management (dump)
  2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
                   ` (2 preceding siblings ...)
  2008-09-04  8:03 ` [RFC v3][PATCH 3/9] x86 support for checkpoint/restart Oren Laadan
@ 2008-09-04  8:03 ` Oren Laadan
  2008-09-04 18:25   ` Dave Hansen
  2008-09-04  8:04 ` [RFC v3][PATCH 5/9] Memory managemnet (restore) Oren Laadan
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  8:03 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers


For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name.  The cr_vma->npages will tell
how many pages were dumped for this VMA.  Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next VMA and so on.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  arch/x86/mm/checkpoint.c   |   30 ++++
  arch/x86/mm/restart.c      |    1 +
  checkpoint/Makefile        |    3 +-
  checkpoint/checkpoint.c    |   53 ++++++
  checkpoint/ckpt_arch.h     |    1 +
  checkpoint/ckpt_mem.c      |  409 ++++++++++++++++++++++++++++++++++++++++++++
  checkpoint/ckpt_mem.h      |   30 ++++
  checkpoint/sys.c           |   19 ++-
  include/asm-x86/ckpt_hdr.h |    5 +
  include/linux/ckpt.h       |    9 +-
  include/linux/ckpt_hdr.h   |   30 ++++
  11 files changed, 582 insertions(+), 8 deletions(-)
  create mode 100644 checkpoint/ckpt_mem.c
  create mode 100644 checkpoint/ckpt_mem.h

diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 71d21e6..50cfd29 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -192,3 +192,33 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
  	cr_hbuf_put(ctx, sizeof(*hh));
  	return ret;
  }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	mutex_lock(&mm->context.lock);
+
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	cr_debug("nldt %d\n", hh->nldt);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
+
+	mutex_unlock(&mm->context.lock);
+
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index 883a163..d7fb89a 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -8,6 +8,7 @@
   *  distribution for more details.
   */

+#include <linux/unistd.h>
  #include <asm/desc.h>
  #include <asm/i387.h>

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d2df68c..3a0df6d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
  # Makefile for linux checkpoint/restart.
  #

-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+		ckpt_mem.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index d34a691..4dae775 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -55,6 +55,55 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
  	return cr_write_obj(ctx, &h, str);
  }

+/**
+ * cr_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+static char *
+cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+	char *fname;
+
+	BUG_ON(!buf);
+	fname = __d_path(path, root, buf, *n);
+	if (!IS_ERR(fname))
+		*n = (buf + (*n) - fname);
+	return fname;
+}
+
+/**
+ * cr_write_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
+{
+	struct cr_hdr h;
+	char *buf, *fname;
+	int ret, flen;
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = cr_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		h.type = CR_HDR_FNAME;
+		h.len = flen;
+		h.parent = 0;
+		ret = cr_write_obj(ctx, &h, fname);
+	} else
+		ret = PTR_ERR(fname);
+
+	kfree(buf);
+	return ret;
+}
+
  /* write the checkpoint header */
  static int cr_write_head(struct cr_ctx *ctx)
  {
@@ -164,6 +213,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
  	cr_debug("task_struct: ret %d\n", ret);
  	if (ret < 0)
  		goto out;
+	ret = cr_write_mm(ctx, t);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
  	ret = cr_write_thread(ctx, t);
  	cr_debug("thread: ret %d\n", ret);
  	if (ret < 0)
diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
index 5bd4703..9bd0ba4 100644
--- a/checkpoint/ckpt_arch.h
+++ b/checkpoint/ckpt_arch.h
@@ -2,6 +2,7 @@

  int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
  int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);

  int cr_read_thread(struct cr_ctx *ctx);
  int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..47ba701
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,409 @@
+/*
+ *  Checkpoint memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+#include "ckpt_arch.h"
+#include "ckpt_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr'
+ * (common to ckpt_mem.c and rstr_mem.c)
+ */
+
+#define CR_PGARR_ORDER  0
+#define CR_PGARR_TOTAL  ((PAGE_SIZE << CR_PGARR_ORDER) / sizeof(void *))
+
+/* release pages referenced by a page-array */
+void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr)
+{
+	int n;
+
+	/* only checkpoint keeps references to pages */
+	if (ctx->flags & CR_CTX_CKPT) {
+		cr_debug("nused %d\n", pgarr->nused);
+		for (n = pgarr->nused; n--; )
+			page_cache_release(pgarr->pages[n]);
+	}
+	pgarr->nused = 0;
+	pgarr->nleft = CR_PGARR_TOTAL;
+}
+
+/* release pages referenced by chain of page-arrays */
+void cr_pgarr_release(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next)
+		_cr_pgarr_release(ctx, pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr, *pgnxt;
+
+	for (pgarr = ctx->pgarr; pgarr; pgarr = pgnxt) {
+		_cr_pgarr_release(ctx, pgarr);
+		free_pages((unsigned long) ctx->pgarr->addrs, CR_PGARR_ORDER);
+		free_pages((unsigned long) ctx->pgarr->pages, CR_PGARR_ORDER);
+		pgnxt = pgarr->next;
+		kfree(pgarr);
+	}
+}
+
+/* allocate and add a new page-array to chain */
+struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew)
+{
+	struct cr_pgarr *pgarr = ctx->pgcur;
+
+	if (pgarr && pgarr->next) {
+		ctx->pgcur = pgarr->next;
+		return pgarr->next;
+	}
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+
+	pgarr->nused = 0;
+	pgarr->nleft = CR_PGARR_TOTAL;
+	pgarr->addrs = (unsigned long *)
+		__get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
+	pgarr->pages = (struct page **)
+		__get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
+	if (pgarr->addrs && pgarr->pages) {
+		*pgnew = pgarr;
+		ctx->pgcur = pgarr;
+		return pgarr;
+	}
+	/* else ... */
+	if (pgarr->addrs)
+		free_pages((unsigned long) pgarr->addrs, CR_PGARR_ORDER);
+	if (pgarr->pages)
+		free_pages((unsigned long) pgarr->pages, CR_PGARR_ORDER);
+	kfree(pgarr);
+	return NULL;
+}
+
+/* return current page-array (and allocate if needed) */
+struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx
+)
+{
+	struct cr_pgarr *pgarr = ctx->pgcur;
+
+	if (unlikely(!pgarr->nleft))
+		pgarr = cr_pgarr_alloc(ctx, &pgarr->next);
+	return pgarr;
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+/**
+ * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
+ * @ctx - checkpoint context
+ * @pgarr - page-array to fill
+ * @vma - vma to scan
+ * @start - start address (updated)
+ */
+static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
+			     struct vm_area_struct *vma, unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	struct page **pagep;
+	unsigned long *addrp;
+	int cow, nr, ret = 0;
+
+	nr = pgarr->nleft;
+	pagep = &pgarr->pages[pgarr->nused];
+	addrp = &pgarr->addrs[pgarr->nused];
+	cow = !!vma->vm_file;
+
+	while (addr < end) {
+		struct page *page;
+
+		/*
+		 * simplified version of get_user_pages(): already have vma,
+		 * only need FOLL_TOUCH, and (for now) ignore fault stats.
+		 *
+		 * FIXME: consolidate with get_user_pages()
+		 */
+
+		cond_resched();
+		while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
+			ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+			if (ret & VM_FAULT_ERROR) {
+				if (ret & VM_FAULT_OOM)
+					ret = -ENOMEM;
+				else if (ret & VM_FAULT_SIGBUS)
+					ret = -EFAULT;
+				else
+					BUG();
+				break;
+			}
+			cond_resched();
+			ret = 0;
+		}
+
+		if (IS_ERR(page))
+			ret = PTR_ERR(page);
+
+		if (ret < 0)
+			break;
+
+		if (page == ZERO_PAGE(0)) {
+			page = NULL;	/* zero page: ignore */
+		} else if (cow && page_mapping(page) != NULL) {
+			page = NULL;	/* clean cow: ignore */
+		} else {
+			get_page(page);
+			*(addrp++) = addr;
+			*(pagep++) = page;
+			if (--nr == 0) {
+				addr += PAGE_SIZE;
+				break;
+			}
+		}
+
+		addr += PAGE_SIZE;
+	}
+
+	if (unlikely(ret < 0)) {
+		nr = pgarr->nleft - nr;
+		while (nr--)
+			page_cache_release(*(--pagep));
+		return ret;
+	}
+
+	*start = addr;
+	return pgarr->nleft - nr;
+}
+
+/**
+ * cr_vma_scan_pages - scan vma for pages that will need to be dumped
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * a list of addr/page tuples is kept in ctx->pgarr page-array chain
+ */
+static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+	struct cr_pgarr *pgarr;
+	int nr, total = 0;
+
+	while (addr < end) {
+		pgarr = cr_pgarr_prep(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr);
+		if (nr < 0)
+			return nr;
+		pgarr->nleft -= nr;
+		pgarr->nused += nr;
+		total += nr;
+	}
+
+	cr_debug("total %d\n", total);
+	return total;
+}
+
+static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(page, KM_USER1);
+
+	return cr_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+	struct cr_pgarr *pgarr;
+	char *buf;
+	int ret;
+
+	if (!total)
+		return 0;
+
+	for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
+		ret = cr_kwrite(ctx, pgarr->addrs,
+			       pgarr->nused * sizeof(*pgarr->addrs));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
+		struct page **pages = pgarr->pages;
+		int nr = pgarr->nused;
+
+		while (nr--) {
+			ret = cr_page_write(ctx, *pages, buf);
+			if (ret < 0)
+				goto out;
+			pages++;
+		}
+	}
+
+	ret = total;
+ out:
+	kfree(buf);
+	return ret;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int vma_type, nr, ret;
+
+	h.type = CR_HDR_VMA;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->vm_start = vma->vm_start;
+	hh->vm_end = vma->vm_end;
+	hh->vm_page_prot = vma->vm_page_prot.pgprot;
+	hh->vm_flags = vma->vm_flags;
+	hh->vm_pgoff = vma->vm_pgoff;
+
+	if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
+		pr_warning("CR: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ETXTBSY;
+	}
+
+	/* by default assume anon memory */
+	vma_type = CR_VMA_ANON;
+
+	/* if there is a backing file, assume private-mapped */
+	/* (FIX: check if the file is unlinked) */
+	if (vma->vm_file)
+		vma_type = CR_VMA_FILE;
+
+	hh->vma_type = vma_type;
+
+	/*
+	 * it seems redundant now, but we do it in 3 steps for because:
+	 * first, the logic is simpler when we how many pages before
+	 * dumping them; second, a future optimization will defer the
+	 * writeout (dump, and free) to a later step; in which case all
+	 * the pages to be dumped will be aggregated on the checkpoint ctx
+	 */
+
+	/* (1) scan: scan through the PTEs of the vma to count the pages
+	 * to dump (and later make those pages COW), and keep the list of
+	 * pages (and a reference to each page) on the checkpoint ctx */
+	nr = cr_vma_scan_pages(ctx, vma);
+	if (nr < 0)
+		return nr;
+
+	hh->nr_pages = nr;
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+	/* save the file name, if relevant */
+	if (vma->vm_file)
+		ret = cr_write_fname(ctx, &vma->vm_file->f_path, ctx->vfsroot);
+
+	if (ret < 0)
+		return ret;
+
+	/* (2) dump: write out the addresses of all pages in the list (on
+	 * the checkpoint ctx) followed by the contents of all pages */
+	ret = cr_vma_dump_pages(ctx, nr);
+
+	/* (3) free: free the extra references to the pages in the list */
+	cr_pgarr_release(ctx);
+
+	return ret;
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int objref, ret;
+
+	h.type = CR_HDR_MM;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	mm = get_task_mm(t);
+
+	objref = 0;	/* will be meaningful with multiple processes */
+	hh->objref = objref;
+
+	down_read(&mm->mmap_sem);
+
+	hh->start_code = mm->start_code;
+	hh->end_code = mm->end_code;
+	hh->start_data = mm->start_data;
+	hh->end_data = mm->end_data;
+	hh->start_brk = mm->start_brk;
+	hh->brk = mm->brk;
+	hh->start_stack = mm->start_stack;
+	hh->arg_start = mm->arg_start;
+	hh->arg_end = mm->arg_end;
+	hh->env_start = mm->env_start;
+	hh->env_end = mm->env_end;
+
+	hh->map_count = mm->map_count;
+
+	/* FIX: need also mm->flags */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ret = cr_write_vma(ctx, vma);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_write_mm_context(ctx, mm, objref);
+
+ out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
diff --git a/checkpoint/ckpt_mem.h b/checkpoint/ckpt_mem.h
new file mode 100644
index 0000000..83d1cfc
--- /dev/null
+++ b/checkpoint/ckpt_mem.h
@@ -0,0 +1,30 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/* page-array chains: each pgarr holds a list of <addr,page> tuples */
+struct cr_pgarr {
+	unsigned long *addrs;
+	struct page **pages;
+	struct cr_pgarr *next;
+	unsigned short nleft;
+	unsigned short nused;
+};
+
+void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr);
+void cr_pgarr_release(struct cr_ctx *ctx);
+void cr_pgarr_free(struct cr_ctx *ctx);
+struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew);
+struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx);
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 4268bae..263fb8a 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -16,6 +16,8 @@
  #include <linux/capability.h>
  #include <linux/ckpt.h>

+#include "ckpt_mem.h"
+
  /*
   * helpers to write/read to/from the image file descriptor
   *
@@ -110,7 +112,6 @@ int cr_kread(struct cr_ctx *ctx, void *buf, int count)
  	return ret;
  }

-
  /*
   * helpers to manage CR contexts: allocated for each checkpoint and/or
   * restart operation, and persists until the operation is completed.
@@ -121,7 +122,6 @@ static atomic_t cr_ctx_count;

  void cr_ctx_free(struct cr_ctx *ctx)
  {
-
  	if (ctx->file)
  		fput(ctx->file);
  	if (ctx->vfsroot)
@@ -129,6 +129,8 @@ void cr_ctx_free(struct cr_ctx *ctx)

  	free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);

+	cr_pgarr_free(ctx);
+
  	kfree(ctx);
  }

@@ -148,10 +150,11 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
  	get_file(ctx->file);

  	ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
-	if (!ctx->hbuf) {
-		cr_ctx_free(ctx);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!ctx->hbuf)
+		goto nomem;
+
+	if (!cr_pgarr_alloc(ctx, &ctx->pgarr))
+		goto nomem;

  	ctx->pid = pid;
  	ctx->flags = flags;
@@ -164,6 +167,10 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
  	ctx->crid = atomic_inc_return(&cr_ctx_count);

  	return ctx;
+
+ nomem:
+	cr_ctx_free(ctx);
+	return ERR_PTR(-ENOMEM);
  }

  /*
diff --git a/include/asm-x86/ckpt_hdr.h b/include/asm-x86/ckpt_hdr.h
index 44a903c..6bc61ac 100644
--- a/include/asm-x86/ckpt_hdr.h
+++ b/include/asm-x86/ckpt_hdr.h
@@ -69,4 +69,9 @@ struct cr_hdr_cpu {

  } __attribute__((aligned(8)));

+struct cr_hdr_mm_context {
+	__s16 ldt_entry_size;
+	__s16 nldt;
+} __attribute__((aligned(8)));
+
  #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index 1bb2b09..c834f3c 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -28,7 +28,10 @@ struct cr_ctx {
  	void *hbuf;		/* temporary buffer for headers */
  	int hpos;		/* position in headers buffer */

-	struct path *vfsroot;	/* container root */
+	struct cr_pgarr *pgarr;	/* page array for dumping VMA contents */
+	struct cr_pgarr *pgcur;	/* current position in page array */
+
+	struct path *vfsroot;	/* container root (FIXME) */
  };

  /* cr_ctx: flags */
@@ -51,11 +54,15 @@ struct cr_hdr;

  int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
  int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root);

  int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
  int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
  int cr_read_string(struct cr_ctx *ctx, void *str, int len);

+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+int cr_read_mm(struct cr_ctx *ctx);
+
  int do_checkpoint(struct cr_ctx *ctx);
  int do_restart(struct cr_ctx *ctx);

diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
index 3257720..322ade5 100644
--- a/include/linux/ckpt_hdr.h
+++ b/include/linux/ckpt_hdr.h
@@ -32,6 +32,7 @@ struct cr_hdr {
  enum {
  	CR_HDR_HEAD = 1,
  	CR_HDR_STRING,
+	CR_HDR_FNAME,

  	CR_HDR_TASK = 101,
  	CR_HDR_THREAD,
@@ -80,4 +81,33 @@ struct cr_hdr_task {
  	__s32 task_comm_len;
  } __attribute__((aligned(8)));

+struct cr_hdr_mm {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 map_count;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vm_type {
+	CR_VMA_ANON = 1,
+	CR_VMA_FILE
+};
+
+struct cr_hdr_vma {
+	__u32 vma_type;
+	__u32 _padding;
+	__s64 nr_pages;
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+
+} __attribute__((aligned(8)));
+
  #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v3][PATCH 5/9] Memory managemnet (restore)
  2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
                   ` (3 preceding siblings ...)
  2008-09-04  8:03 ` [RFC v3][PATCH 4/9] Memory management (dump) Oren Laadan
@ 2008-09-04  8:04 ` Oren Laadan
  2008-09-04 18:08   ` Dave Hansen
  2008-09-04  8:04 ` [RFC v3][PATCH 6/9] Checkpoint/restart: initial documentation Oren Laadan
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  8:04 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers


Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  arch/x86/mm/restart.c  |   56 ++++++++
  checkpoint/Makefile    |    2 +-
  checkpoint/ckpt_arch.h |    1 +
  checkpoint/restart.c   |   43 ++++++
  checkpoint/rstr_mem.c  |  351 ++++++++++++++++++++++++++++++++++++++++++++++++
  include/linux/ckpt.h   |    2 +
  6 files changed, 454 insertions(+), 1 deletions(-)
  create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index d7fb89a..7c5a7d7 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -177,3 +177,59 @@ int cr_read_cpu(struct cr_ctx *ctx)
   out:
  	return ret;
  }
+
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int n, rparent;
+
+	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+	if (rparent < 0)
+		return rparent;
+	if (rparent != parent)
+		return -EINVAL;
+
+	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+		return -EINVAL;
+
+	/* to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of inclue/asm/desc.h:fill_ldt() */
+
+	for (n = 0; n < hh->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+		int ret;
+
+		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			return ret;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, &info, sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			return ret;
+	}
+
+	load_LDT(&mm->context);
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return 0;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 3a0df6d..ac35033 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
  #

  obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
-		ckpt_mem.o
+		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
index 9bd0ba4..acfa101 100644
--- a/checkpoint/ckpt_arch.h
+++ b/checkpoint/ckpt_arch.h
@@ -6,3 +6,4 @@ int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);

  int cr_read_thread(struct cr_ctx *ctx);
  int cr_read_cpu(struct cr_ctx *ctx);
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 5226994..f8c919d 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -77,6 +77,45 @@ int cr_read_string(struct cr_ctx *ctx, void *str, int len)
  	return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
  }

+/**
+ * cr_read_fname - read a file name
+ * @ctx: checkpoint context
+ * @fname: buffer
+ * @n: buffer length
+ */
+int cr_read_fname(struct cr_ctx *ctx, void *fname, int flen)
+{
+	return cr_read_obj_type(ctx, fname, flen, CR_HDR_FNAME);
+}
+
+/**
+ * cr_read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
+{
+	struct file *file;
+	char *fname;
+	int flen, ret;
+
+	flen = PATH_MAX;
+	fname = kmalloc(flen, GFP_KERNEL);
+	if (!fname)
+		return ERR_PTR(-ENOMEM);
+
+	ret = cr_read_fname(ctx, fname, flen);
+	cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+	if (ret >= 0)
+		file = filp_open(fname, flags, mode);
+	else
+		file = ERR_PTR(ret);
+
+	kfree(fname);
+	return file;
+}
+
  /* read the checkpoint header */
  static int cr_read_head(struct cr_ctx *ctx)
  {
@@ -169,6 +208,10 @@ static int cr_read_task(struct cr_ctx *ctx)
  	cr_debug("task_struct: ret %d\n", ret);
  	if (ret < 0)
  		goto out;
+	ret = cr_read_mm(ctx);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
  	ret = cr_read_thread(ctx);
  	cr_debug("thread: ret %d\n", ret);
  	if (ret < 0)
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..2437e2e
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,351 @@
+/*
+ *  Restart memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <asm/cacheflush.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+#include "ckpt_arch.h"
+#include "ckpt_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read in directly to the address space of the current process
+ */
+
+/**
+ * cr_vma_read_pages_addr - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @npages - number of pages
+ */
+static int cr_vma_read_pages_addr(struct cr_ctx *ctx, int npages)
+{
+	struct cr_pgarr *pgarr;
+	int nr, ret;
+
+	while (npages) {
+		pgarr = cr_pgarr_prep(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = min(npages, (int) pgarr->nleft);
+		ret = cr_kread(ctx, pgarr->addrs, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nleft -= nr;
+		pgarr->nused += nr;
+		npages -= nr;
+	}
+	return 0;
+}
+
+/**
+ * cr_vma_read_pages_data - read in data of pages in page-array chain
+ * @ctx - restart context
+ * @npages - number of pages
+ */
+static int cr_vma_read_pages_data(struct cr_ctx *ctx, int npages)
+{
+	struct cr_pgarr *pgarr;
+	unsigned long *addrs;
+	int nr, ret;
+
+	for (pgarr = ctx->pgarr; npages; pgarr = pgarr->next) {
+		addrs = pgarr->addrs;
+		nr = pgarr->nused;
+		npages -= nr;
+		while (nr--) {
+			ret = cr_uread(ctx, (void *) *(addrs++), PAGE_SIZE);
+			if (ret < 0)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+
+/* change the protection of an address range to be writable/non-writable.
+ * this is useful when restoring the memory of a read-only vma */
+static int cr_vma_writable(struct mm_struct *mm, unsigned long start,
+			   unsigned long end, int writable)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long flags = 0;
+	int ret = -EINVAL;
+
+	cr_debug("vma %#lx-%#lx writable %d\n", start, end, writable);
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma_prev(mm, start, &prev);
+	if (unlikely(!vma || vma->vm_start > end || vma->vm_end < start))
+		goto out;
+	if (writable && !(vma->vm_flags & VM_WRITE))
+		flags = vma->vm_flags | VM_WRITE;
+	else if (!writable && (vma->vm_flags & VM_WRITE))
+		flags = vma->vm_flags & ~VM_WRITE;
+	cr_debug("flags %#lx\n", flags);
+	if (flags)
+		ret = mprotect_fixup(vma, &prev, vma->vm_start,
+				     vma->vm_end, flags);
+ out:
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
+/**
+ * cr_vma_read_pages - read in pages for to restore a vma
+ * @ctx - restart context
+ * @cr_vma - vma descriptor from restart
+ */
+static int cr_vma_read_pages(struct cr_ctx *ctx, struct cr_hdr_vma *cr_vma)
+{
+	struct mm_struct *mm = current->mm;
+	int ret = 0;
+
+	if (!cr_vma->nr_pages)
+		return 0;
+
+	/* in the unlikely case that this vma is read-only */
+	if (!(cr_vma->vm_flags & VM_WRITE))
+		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 1);
+	if (ret < 0)
+		goto out;
+	ret = cr_vma_read_pages_addr(ctx, cr_vma->nr_pages);
+	if (ret < 0)
+		goto out;
+	ret = cr_vma_read_pages_data(ctx, cr_vma->nr_pages);
+	if (ret < 0)
+		goto out;
+
+	cr_pgarr_release(ctx);	/* reset page-array chain */
+
+	/* restore original protection for this vma */
+	if (!(cr_vma->vm_flags & VM_WRITE))
+		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
+
+ out:
+	return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	unsigned long vm_size, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+	unsigned long flags;
+	struct file *file = NULL;
+	int parent, ret = 0;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+	if (parent < 0)
+		return parent;
+	else if (parent != 0)
+		return -EINVAL;
+
+	cr_debug("vma %#lx-%#lx type %d nr_pages %d\n",
+		 (unsigned long) hh->vm_start, (unsigned long) hh->vm_end,
+		 (int) hh->vma_type, (int) hh->nr_pages);
+
+	if (hh->vm_end < hh->vm_start || hh->nr_pages < 0)
+		return -EINVAL;
+
+	vm_size = hh->vm_end - hh->vm_start;
+	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+	vm_pgoff = hh->vm_pgoff;
+
+	switch (hh->vma_type) {
+
+	case CR_VMA_ANON:		/* anonymous private mapping */
+		/* vm_pgoff for anonymous mapping is the "global" page
+		   offset (namely from addr 0x0), so we force a zero */
+		vm_pgoff = 0;
+		break;
+
+	case CR_VMA_FILE:		/* private mapping from a file */
+		/* O_RDWR only needed if both (VM_WRITE|VM_SHARED) are set */
+		flags = hh->vm_flags & (VM_WRITE | VM_SHARED);
+		flags = (flags == (VM_WRITE | VM_SHARED) ? O_RDWR : O_RDONLY);
+		file = cr_read_open_fname(ctx, flags, 0);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+		break;
+
+	default:
+		return -EINVAL;
+
+	}
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, (unsigned long) hh->vm_start,
+			     vm_size, vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	/* the file (if opened) is now referenced by the vma */
+	if (file)
+		filp_close(file, NULL);
+
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	/*
+	 * CR_VMA_ANON: read in memory as is
+	 * CR_VMA_FILE: read in memory as is
+	 * (more to follow ...)
+	 */
+
+	switch (hh->vma_type) {
+	case CR_VMA_ANON:
+	case CR_VMA_FILE:
+		/* standard case: read the data into the memory */
+		ret = cr_vma_read_pages(ctx, hh);
+		break;
+	}
+
+	if (ret < 0)
+		return ret;
+
+	if (vm_prot & PROT_EXEC)
+		flush_icache_range(hh->vm_start, hh->vm_end);
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	cr_debug("vma retval %d\n", ret);
+	return 0;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_debug("CR: restart failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	int nr, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+	if (parent < 0)
+		return parent;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		return -EINVAL;
+#endif
+	cr_debug("map_count %d\n", hh->map_count);
+
+	/* XXX need more sanity checks */
+	if (hh->start_code > hh->end_code ||
+	    hh->start_data > hh->end_data || hh->map_count < 0)
+		return -EINVAL;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = cr_destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		return ret;
+	}
+	mm->start_code = hh->start_code;
+	mm->end_code = hh->end_code;
+	mm->start_data = hh->start_data;
+	mm->end_data = hh->end_data;
+	mm->start_brk = hh->start_brk;
+	mm->brk = hh->brk;
+	mm->start_stack = hh->start_stack;
+	mm->arg_start = hh->arg_start;
+	mm->arg_end = hh->arg_end;
+	mm->env_start = hh->env_start;
+	mm->env_end = hh->env_end;
+	up_write(&mm->mmap_sem);
+
+
+	/* FIX: need also mm->flags */
+
+	for (nr = hh->map_count; nr; nr--) {
+		ret = cr_read_vma(ctx, mm);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = cr_read_mm_context(ctx, mm, hh->objref);
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index c834f3c..83c61a4 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -59,6 +59,8 @@ int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root);
  int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
  int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
  int cr_read_string(struct cr_ctx *ctx, void *str, int len);
+int cr_read_fname(struct cr_ctx *ctx, void *fname, int n);
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode);

  int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
  int cr_read_mm(struct cr_ctx *ctx);
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v3][PATCH 6/9] Checkpoint/restart: initial documentation
  2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
                   ` (4 preceding siblings ...)
  2008-09-04  8:04 ` [RFC v3][PATCH 5/9] Memory managemnet (restore) Oren Laadan
@ 2008-09-04  8:04 ` Oren Laadan
  2008-09-04  8:05 ` [RFC v3][PATCH 7/9] Infrastructure for shared objects Oren Laadan
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  8:04 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers


Covers application checkpoint/restart, overall design, interfaces
and checkpoint image format.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  Documentation/checkpoint.txt |  182 ++++++++++++++++++++++++++++++++++++++++++
  1 files changed, 182 insertions(+), 0 deletions(-)
  create mode 100644 Documentation/checkpoint.txt

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
new file mode 100644
index 0000000..71930af
--- /dev/null
+++ b/Documentation/checkpoint.txt
@@ -0,0 +1,182 @@
+
+	=== Checkpoint-Restart support in the Linux kernel ===
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+Reviewers:
+
+Application checkpoint/restart [CR] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. CR can provide many potential benefits:
+
+* Failure recovery: by rolling back an to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off of faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relative opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial CR products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide CR: sys_checkpoint and
+sys_restart.  The checkpoint code basically serializes internel kernel
+state and writes it out to a file descriptor, and the resulting image
+is stream-able. More specifically, it consists of 5 steps:
+  1. Pre-dump
+  2. Freeze the container
+  3. Dump
+  4. Thaw (or kill) the container
+  5. Post-dump
+Steps 1 and 5 are an optimization to reduce application downtime:
+"pre-dump" works before freezing the container, e.g. the pre-copy for
+live migration, and "post-dump" works after the container resumes
+execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state and from a
+file descriptor, and re-creates the tasks and the resources they need
+to resume execution. The restart code is executed by each task that
+is restored in a new container to reconstruct its own state.
+
+
+=== Interfaces
+
+int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+  Checkpoint a container whose init task is identified by pid, to the
+  file designated by fd. Flags will have future meaning (should be 0
+  for now).
+  Returns: a positive integer that identifies the checkpoint image
+  (for future reference in case it is kept in memory) upon success,
+  0 if it returns from a restart, and -1 if an error occurs.
+
+int sys_restart(int crid, int fd, unsigned long flags);
+  Restart a container from a checkpoint image identified by crid, or
+  from the blob stored in the file designated by fd. Flags will have
+  future meaning (should be 0 for now).
+  Returns: 0 on success and -1 if an error occurs.
+
+Thus, if checkpoint is initiated by a process in the container, one
+can use logic similar to fork():
+	...
+	crid = checkpoint(...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+
+=== Checkpoint image format
+
+The checkpoint image format is composed of records consistings of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 id;
+};
+
+Here, 'type' field identifies the type of the payload, 'len' tells its
+length in byes. The 'id' identifies the owner object instance. The
+meaning of the 'id' field varies depending on the type. For example,
+for type CR_HDR_MM, the 'id' identifies the task to which this MM
+belongs. The payload also varies depending on the type, for instance,
+the data describing a task_struct is given by a 'struct cr_hdr_task'
+(type CR_HDR_TASK) and so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. The cr_vma->npages indicated how many pages were dumped for this
+VMA. Following comes the actual data: first the addresses of all the
+dumped pages, followed by the contents of all the dumped pages (npages
+entries each). Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+	cr_hdr + cr_hdr_mm
+		cr_hdr + cr_hdr_vma + cr_hdr + string
+			addr1, addr2
+			page1, page2
+		cr_hdr + cr_hdr_vma
+			addr3, addr4, addr5
+			page3, page4, page5
+		cr_hdr + cr_mm_context
+	cr_hdr + cr_hdr_thread
+	cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+=== Changelog
+
+[2008-Jul-29] v1:
+In this incarnation, CR only works on single task. The address space
+may consist of only private, simple VMAs - anonymous or file-mapped.
+Both checkpoint and restart will ignore the first argument (pid/crid)
+and instead act on themselves.
+
+[2008-Aug-09] v2:
+* Added utsname->{release,version,machine} to checkpoint header
+* Pad header structures to 64 bits to ensure compatibility
+* Address comments from LKML and linux-containers mailing list
+
+[2008-Aug-29] v3:
+* Various fixes and clean-ups
+* Use standard hlist_... for hash table
+* Better use of standard kmalloc/kfree
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v3][PATCH 7/9] Infrastructure for shared objects
  2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
                   ` (5 preceding siblings ...)
  2008-09-04  8:04 ` [RFC v3][PATCH 6/9] Checkpoint/restart: initial documentation Oren Laadan
@ 2008-09-04  8:05 ` Oren Laadan
  2008-09-04  9:38   ` Louis Rilling
  2008-09-04 18:14   ` Dave Hansen
  2008-09-04  8:05 ` [RFC v3][PATCH 8/9] File descriprtors (dump) Oren Laadan
  2008-09-04  8:06 ` [RFC v3][PATCH 9/9] File descriprtors (restore) Oren Laadan
  8 siblings, 2 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  8:05 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers


Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier and also
stored in a hash table (indexed by its physical kenrel address). From
then on the object will be found in the hash and only its identifier is
saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  Documentation/checkpoint.txt |   44 +++++++++
  checkpoint/Makefile          |    2 +-
  checkpoint/objhash.c         |  205 ++++++++++++++++++++++++++++++++++++++++++
  checkpoint/sys.c             |    4 +
  include/linux/ckpt.h         |   18 ++++
  5 files changed, 272 insertions(+), 1 deletions(-)
  create mode 100644 checkpoint/objhash.c

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
index 71930af..18725e6 100644
--- a/Documentation/checkpoint.txt
+++ b/Documentation/checkpoint.txt
@@ -163,6 +163,50 @@ cr_hdr + cr_hdr_task
  cr_hdr + cr_hdr_tail


+=== Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects in the following manner.
+
+On the first encounter, the state is dumped and the object is assigned
+a unique identifier and also stored in the hash table (indexed by its
+physical kenrel address). From then on the object will be found in the
+hash and only its identifier is saved.
+
+On restart the identifier is looked up in the hash table; if not found
+then the state is read, the object is created, and added to the hash
+table (this time indexed by its identifier). Otherwise, the object in
+the hash table is used.
+
+The interface for the hash table is the following:
+
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type);
+  [checkpoint] find the unique identifier - object reference (objref)
+  - of the object that is pointer to by ptr (or 0 if not found).
+
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags);
+  [checkpoint] add the object pointed to by ptr to the hash table if
+  it isn't already there, and fill its unique identifier (objref); will
+  return 0 if already found in the has, or 1 otherwise.
+
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type);
+  [restart] return the pointer to the object whose unique identifier
+  is equal to objref.
+
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags);
+  [restart] add the object with unique identifier objref, pointed to by
+  ptr to the hash table if it isn't already there; will return 0 if
+  already found in the has, or 1 otherwise.
+
+
  === Changelog

  [2008-Jul-29] v1:
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index ac35033..9843fb9 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,5 @@
  # Makefile for linux checkpoint/restart.
  #

-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
  		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..442b08c
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,205 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+#include <linux/ckpt.h>
+
+struct cr_objref {
+	int objref;
+	void *ptr;
+	unsigned short type;
+	unsigned short flags;
+	struct hlist_node hash;
+};
+
+struct cr_objhash {
+	struct hlist_head *head;
+	int objref_index;
+};
+
+#define CR_OBJHASH_NBITS  10
+#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS - 1)
+
+static void cr_obj_ref_drop(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		fput((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_obj_ref_grab(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		get_file((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+	struct hlist_head *h = objhash->head;
+	struct hlist_node *n, *t;
+	struct cr_objref *obj;
+	int i;
+
+	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			cr_obj_ref_drop(obj);
+			kfree(obj);
+		}
+	}
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash = ctx->objhash;
+
+	if (objhash) {
+		cr_objhash_clear(objhash);
+		kfree(objhash->head);
+		kfree(ctx->objhash);
+		ctx->objhash = NULL;
+	}
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash;
+	struct hlist_head *head;
+
+	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+	if (!objhash)
+		return -ENOMEM;
+	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(objhash);
+		return -ENOMEM;
+	}
+
+	objhash->head = head;
+	objhash->objref_index = 1;
+
+	ctx->objhash = objhash;
+	return 0;
+}
+
+static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr(ptr, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_find_by_objref(struct cr_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr((void *) objref, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
+				    unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (obj) {
+		int i;
+
+		obj->ptr = ptr;
+		obj->type = type;
+		obj->flags = flags;
+
+		if (objref) {
+			/* use 'objref' to index (restart) */
+			obj->objref = objref;
+			i = hash_ptr((void *) objref, CR_OBJHASH_NBITS);
+		} else {
+			/* use 'ptr' to index, assign objref (checkpoint) */
+			obj->objref = ctx->objhash->objref_index++;;
+			i = hash_ptr(ptr, CR_OBJHASH_NBITS);
+		}
+
+		hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
+		cr_obj_ref_grab(obj);
+	}
+	return obj;
+}
+
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int ret = 0;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = cr_obj_new(ctx, ptr, 0, type, flags);
+		if (!obj)
+			return -ENOMEM;
+		else
+			ret = 1;
+	} else if (obj->type != type)	/* sanity check */
+		return -EINVAL;
+	*objref = obj->objref;
+	return ret;
+}
+
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_new(ctx, ptr, objref, type, flags);
+	return (obj ? 0 : -ENOMEM);
+}
+
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (obj)
+		return (obj->type == type ? obj->objref : -EINVAL);
+	else
+		return -ESRCH;
+}
+
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_objref(ctx, objref);
+	if (obj)
+		return (obj->type == type ? obj->ptr : ERR_PTR(-EINVAL));
+	else
+		return NULL;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 263fb8a..1857010 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -130,6 +130,7 @@ void cr_ctx_free(struct cr_ctx *ctx)
  	free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);

  	cr_pgarr_free(ctx);
+	cr_objhash_free(ctx);

  	kfree(ctx);
  }
@@ -156,6 +157,9 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
  	if (!cr_pgarr_alloc(ctx, &ctx->pgarr))
  		goto nomem;

+	if (cr_objhash_alloc(ctx) < 0)
+		goto nomem;
+
  	ctx->pid = pid;
  	ctx->flags = flags;

diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index 83c61a4..e8be58c 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -28,6 +28,8 @@ struct cr_ctx {
  	void *hbuf;		/* temporary buffer for headers */
  	int hpos;		/* position in headers buffer */

+	struct cr_objhash *objhash;	/* hash for shared objects */
+
  	struct cr_pgarr *pgarr;	/* page array for dumping VMA contents */
  	struct cr_pgarr *pgcur;	/* current position in page array */

@@ -50,6 +52,22 @@ int cr_kread(struct cr_ctx *ctx, void *buf, int count);
  void *cr_hbuf_get(struct cr_ctx *ctx, int n);
  void cr_hbuf_put(struct cr_ctx *ctx, int n);

+/* shared objects handling */
+
+enum {
+	CR_OBJ_FILE = 1,
+	CR_OBJ_MAX
+};
+
+void cr_objhash_free(struct cr_ctx *ctx);
+int cr_objhash_alloc(struct cr_ctx *ctx);
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type);
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type);
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags);
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags);
+
  struct cr_hdr;

  int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v3][PATCH 8/9] File descriprtors (dump)
  2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
                   ` (6 preceding siblings ...)
  2008-09-04  8:05 ` [RFC v3][PATCH 7/9] Infrastructure for shared objects Oren Laadan
@ 2008-09-04  8:05 ` Oren Laadan
  2008-09-04  9:47   ` Louis Rilling
                     ` (2 more replies)
  2008-09-04  8:06 ` [RFC v3][PATCH 9/9] File descriprtors (restore) Oren Laadan
  8 siblings, 3 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  8:05 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers


Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Since FDs can be shared, they are assigned a
tag and registered in the object hash.

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag
and its close-on-exec property. If the FD is to be saved (first time)
then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
Then will come the next FD and so on.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  checkpoint/Makefile      |    2 +-
  checkpoint/checkpoint.c  |    4 +
  checkpoint/ckpt_file.c   |  224 ++++++++++++++++++++++++++++++++++++++++++++++
  checkpoint/ckpt_file.h   |   17 ++++
  include/linux/ckpt.h     |    7 +-
  include/linux/ckpt_hdr.h |   34 +++++++-
  6 files changed, 283 insertions(+), 5 deletions(-)
  create mode 100644 checkpoint/ckpt_file.c
  create mode 100644 checkpoint/ckpt_file.h

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
  #

  obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 4dae775..aebbf22 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -217,6 +217,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
  	cr_debug("memory: ret %d\n", ret);
  	if (ret < 0)
  		goto out;
+	ret = cr_write_files(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
  	ret = cr_write_thread(ctx, t);
  	cr_debug("thread: ret %d\n", ret);
  	if (ret < 0)
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..34df371
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,224 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+#include "ckpt_file.h"
+
+#define CR_DEFAULT_FDTABLE  256
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ * @return: the number of open fds found
+ *
+ * Allocates the file descriptors array (*fdtable), caller should free
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fdlist;
+	int i, n, max;
+
+	max = CR_DEFAULT_FDTABLE;
+
+ repeat:
+	n = 0;
+	fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
+	if (!fdlist)
+		return -ENOMEM;
+
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	for (i = 0; i < fdt->max_fds; i++) {
+		if (fcheck_files(files, i)) {
+			if (n == max) {
+				spin_unlock(&files->file_lock);
+				kfree(fdlist);
+				max *= 2;
+				if (max < 0) {	/* overflow ? */
+					n = -EMFILE;
+					break;
+				}
+				goto repeat;
+			}
+			fdlist[n++] = i;
+		}
+	}
+	spin_unlock(&files->file_lock);
+
+	*fdtable = fdlist;
+	return n;
+}
+
+/* cr_write_fd_data - dump the state of a given file pointer */
+static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct dentry *dent = file->f_dentry;
+	struct inode *inode = dent->d_inode;
+	enum fd_type fd_type;
+	int ret;
+
+	h.type = CR_HDR_FD_DATA;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	BUG_ON(!inode);
+
+	hh->f_flags = file->f_flags;
+	hh->f_mode = file->f_mode;
+	hh->f_pos = file->f_pos;
+	hh->f_uid = file->f_uid;
+	hh->f_gid = file->f_gid;
+	hh->f_version = file->f_version;
+	/* FIX: need also file->f_owner */
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		fd_type = CR_FD_FILE;
+		break;
+	case S_IFDIR:
+		fd_type = CR_FD_DIR;
+		break;
+	case S_IFLNK:
+		fd_type = CR_FD_LINK;
+		break;
+	default:
+		return -EBADF;
+	}
+
+	/* FIX: check if the file/dir/link is unlinked */
+	hh->fd_type = fd_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_fname(ctx, &file->f_path, ctx->vfsroot);
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Save the state of the file descriptor; look up the actual file pointer
+ * in the hash table, and if found save the matching objref, otherwise call
+ * cr_write_fd_data to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int coe, objref, ret;
+
+	/* make sure hh->fd (that is of type __u16) doesn't overflow */
+	if (fd > USHORT_MAX) {
+		pr_warning("CR: open files table too big (%d)\n", USHORT_MAX);
+		return -EMFILE;
+	}
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	if (!file)
+		return -EBADF;
+
+	ret = cr_obj_add_ptr(ctx, (void *) file, &objref, CR_OBJ_FILE, 0);
+	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
+
+	if (ret >= 0) {
+		int new = ret;
+
+		h.type = CR_HDR_FD_ENT;
+		h.len = sizeof(*hh);
+		h.parent = 0;
+
+		hh->objref = objref;
+		hh->fd = fd;
+		hh->close_on_exec = coe;
+
+		ret = cr_write_obj(ctx, &h, hh);
+		cr_hbuf_put(ctx, sizeof(*hh));
+		if (ret < 0)
+			return ret;
+
+		/* new==1 if-and-only-if file was new and added to hash */
+		if (new)
+			ret = cr_write_fd_data(ctx, file, objref);
+	}
+
+	fput(file);
+	return ret;
+}
+
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files;
+	int *fdtable;
+	int nfds, n, ret;
+
+	h.type = CR_HDR_FILES;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	files = get_files_struct(t);
+
+	hh->objref = 0;	/* will be meaningful with multiple processes */
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	hh->nfds = nfds;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto clean;
+
+	cr_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = cr_write_fd_ent(ctx, files, n);
+		if (ret < 0)
+			break;
+	}
+
+ clean:
+	kfree(fdtable);
+ out:
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/ckpt_file.h b/checkpoint/ckpt_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/ckpt_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index e8be58c..ea57fe6 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -13,7 +13,7 @@
  #include <linux/path.h>
  #include <linux/fs.h>

-#define CR_VERSION  1
+#define CR_VERSION  2

  struct cr_ctx {
  	pid_t pid;		/* container identifier */
@@ -80,11 +80,12 @@ int cr_read_string(struct cr_ctx *ctx, void *str, int len);
  int cr_read_fname(struct cr_ctx *ctx, void *fname, int n);
  struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode);

+int do_checkpoint(struct cr_ctx *ctx);
  int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
-int cr_read_mm(struct cr_ctx *ctx);
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);

-int do_checkpoint(struct cr_ctx *ctx);
  int do_restart(struct cr_ctx *ctx);
+int cr_read_mm(struct cr_ctx *ctx);

  #define cr_debug(fmt, args...)  \
  	pr_debug("[CR:%s] " fmt, __func__, ## args)
diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
index 322ade5..1ce1dbc 100644
--- a/include/linux/ckpt_hdr.h
+++ b/include/linux/ckpt_hdr.h
@@ -17,7 +17,7 @@
  /*
   * To maintain compatibility between 32-bit and 64-bit architecture flavors,
   * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned(8))) for the entire structure.
   */

  /* records: generic header */
@@ -42,6 +42,10 @@ enum {
  	CR_HDR_VMA,
  	CR_HDR_MM_CONTEXT,

+	CR_HDR_FILES = 301,
+	CR_HDR_FD_ENT,
+	CR_HDR_FD_DATA,
+
  	CR_HDR_TAIL = 5001
  };

@@ -110,4 +114,32 @@ struct cr_hdr_vma {

  } __attribute__((aligned(8)));

+struct cr_hdr_files {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 nfds;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_fd_ent {
+	__u32 objref;		/* identifier for shared objects */
+	__u16 fd;
+	__u16 close_on_exec;
+} __attribute__((aligned(8)));
+
+/* fd types */
+enum  fd_type {
+	CR_FD_FILE = 1,
+	CR_FD_DIR,
+	CR_FD_LINK
+};
+
+struct cr_hdr_fd_data {
+	__u16 fd_type;
+	__u16 f_mode;
+	__u32 f_flags;
+	__u32 f_uid;
+	__u32 f_gid;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
  #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC v3][PATCH 9/9] File descriprtors (restore)
  2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
                   ` (7 preceding siblings ...)
  2008-09-04  8:05 ` [RFC v3][PATCH 8/9] File descriprtors (dump) Oren Laadan
@ 2008-09-04  8:06 ` Oren Laadan
  8 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04  8:06 UTC (permalink / raw)
  To: dave; +Cc: arnd, jeremy, linux-kernel, containers


Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup tag in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  checkpoint/Makefile    |    2 +-
  checkpoint/restart.c   |    4 +
  checkpoint/rstr_file.c |  205 ++++++++++++++++++++++++++++++++++++++++++++++++
  include/linux/ckpt.h   |    1 +
  4 files changed, 211 insertions(+), 1 deletions(-)
  create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
  #

  obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f8c919d..bc49523 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -212,6 +212,10 @@ static int cr_read_task(struct cr_ctx *ctx)
  	cr_debug("memory: ret %d\n", ret);
  	if (ret < 0)
  		goto out;
+	ret = cr_read_files(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
  	ret = cr_read_thread(ctx);
  	cr_debug("thread: ret %d\n", ret);
  	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..56f4f38
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,205 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+#include "ckpt_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int n;
+
+	do {
+		n = cr_scan_fds(files, &fdtable);
+		if (n < 0)
+			return n;
+		while (n--)
+			sys_close(fdtable[n]);
+		kfree(fdtable);
+	} while (n != -1);
+
+	return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_fd_data - restore the state of a given file pointer */
+static int
+cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int parent)
+{
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int fd, rparent, ret;
+
+	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
+	cr_debug("rparent %d parent %d flags %#x mode %#x how %d\n",
+		 rparent, parent, hh->f_flags, hh->f_mode, hh->fd_type);
+	if (rparent < 0)
+		return rparent;
+	if (rparent != parent)
+		return -EINVAL;
+	/* FIX: more sanity checks on f_flags, f_mode etc */
+
+	switch (hh->fd_type) {
+	case CR_FD_FILE:
+	case CR_FD_DIR:
+	case CR_FD_LINK:
+		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+		break;
+	default:
+		file = ERR_PTR(-EINVAL);
+		break;
+	}
+
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
+	if (fd < 0) {
+		filp_close(file, NULL);
+		return fd;
+	}
+
+	/* register new <objref, file> tuple in hash table */
+	ret = cr_obj_add_ref(ctx, (void *) file, parent, CR_OBJ_FILE, 0);
+	if (ret < 0)
+		goto out;
+	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+	if (ret < 0)
+		goto out;
+	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
+		ret = 0;
+
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return (ret < 0 ? ret : fd);
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @parent: parent objref
+ *
+ * Restore the state of a file descriptor; look up the objref (in the header)
+ * in the hash table, and if found pick the matching file pointer and use
+ * it; otherwise call cr_read_fd_data to restore the file pointer too.
+ */
+static int
+cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int parent)
+{
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int newfd, rparent;
+
+	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+	cr_debug("rparent %d parent %d ref %d\n", rparent, parent, hh->objref);
+	if (rparent < 0)
+		return rparent;
+	if (rparent != parent)
+		return -EINVAL;
+	cr_debug("fd %d coe %d\n", hh->fd, hh->close_on_exec);
+	if (hh->objref <= 0)
+		return -EINVAL;
+
+	file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	if (file) {
+		newfd = cr_attach_file(file);
+		if (newfd < 0)
+			return newfd;
+		get_file(file);
+	} else {
+		/* create new file pointer (and register in hash table) */
+		newfd = cr_read_fd_data(ctx, files, hh->objref);
+		if (newfd < 0)
+			return newfd;
+	}
+
+	cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+	/* if newfd isn't desired fd, use dup2() to relocated it */
+	if (newfd != hh->fd) {
+		int ret = sys_dup2(newfd, hh->fd);
+		if (ret < 0)
+			return ret;
+		sys_close(newfd);
+	}
+
+	if (hh->close_on_exec)
+		set_close_on_exec(hh->fd, 1);
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return 0;
+}
+
+int cr_read_files(struct cr_ctx *ctx)
+{
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files = current->files;
+	int n, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
+	if (parent < 0)
+		return parent;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		return -EINVAL;
+#endif
+	cr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
+	if (hh->objref < 0 || hh->nfds < 0)
+		return -EINVAL;
+
+	if (hh->nfds > sysctl_nr_open)
+		return -EMFILE;
+
+	/* point of no return -- close all file descriptors */
+	ret = cr_close_all_fds(files);
+	if (ret < 0)
+		return ret;
+
+	for (n = 0; n < hh->nfds; n++) {
+		ret = cr_read_fd_ent(ctx, files, hh->objref);
+		if (ret < 0)
+			break;
+	}
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
index ea57fe6..3eb64a0 100644
--- a/include/linux/ckpt.h
+++ b/include/linux/ckpt.h
@@ -86,6 +86,7 @@ int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);

  int do_restart(struct cr_ctx *ctx);
  int cr_read_mm(struct cr_ctx *ctx);
+int cr_read_files(struct cr_ctx *ctx);

  #define cr_debug(fmt, args...)  \
  	pr_debug("[CR:%s] " fmt, __func__, ## args)
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-04  8:02 ` [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-09-04  8:37   ` Cedric Le Goater
  2008-09-04 14:42   ` Serge E. Hallyn
  1 sibling, 0 replies; 43+ messages in thread
From: Cedric Le Goater @ 2008-09-04  8:37 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, containers, jeremy, linux-kernel, arnd

Oren Laadan wrote:
> Create trivial sys_checkpoint and sys_restore system calls. They will
> enable to checkpoint and restart an entire container, to and from a
> checkpoint image file descriptor.
> 
> The syscalls take a file descriptor (for the image file) and flags as
> arguments. For sys_checkpoint the first argument identifies the target
> container; for sys_restart it will identify the checkpoint image.
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> ---
>   arch/x86/kernel/syscall_table_32.S |    2 ++
>   checkpoint/Kconfig                 |   11 +++++++++++
>   checkpoint/Makefile                |    5 +++++
>   checkpoint/sys.c                   |   35 +++++++++++++++++++++++++++++++++++
>   include/asm-x86/unistd_32.h        |    2 ++
>   include/linux/syscalls.h           |    2 ++
>   init/Kconfig                       |    2 ++
>   kernel/sys_ni.c                    |    4 ++++
>   8 files changed, 63 insertions(+), 0 deletions(-)
>   create mode 100644 checkpoint/Kconfig
>   create mode 100644 checkpoint/Makefile
>   create mode 100644 checkpoint/sys.c
> 
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index d44395f..5543136 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -332,3 +332,5 @@ ENTRY(sys_call_table)
>   	.long sys_dup3			/* 330 */
>   	.long sys_pipe2
>   	.long sys_inotify_init1

  ^
  |

there are some spaces at the beginning of this line which makes
the patch not applicable for me. This needs a fix for v4 I think.

Thanks for the pachset. anyhow, I'll play with it. 

C.

> +	.long sys_checkpoint
> +	.long sys_restart
> diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
> new file mode 100644
> index 0000000..a9f22ef
> --- /dev/null
> +++ b/checkpoint/Kconfig
> @@ -0,0 +1,11 @@
> +config CHECKPOINT_RESTART
> +	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
> +	def_bool y
> +	depends on X86_32 && EXPERIMENTAL
> +	help
> +	  Application checkpoint/restart is the ability to save the
> +	  state of a running application so that it can later resume
> +	  its execution from the time at which it was checkpointed.
> +
> +	  Turning this option on will enable checkpoint and restart
> +	  functionality in the kernel.
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> new file mode 100644
> index 0000000..07d018b
> --- /dev/null
> +++ b/checkpoint/Makefile
> @@ -0,0 +1,5 @@
> +#
> +# Makefile for linux checkpoint/restart.
> +#
> +
> +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> new file mode 100644
> index 0000000..b9018a4
> --- /dev/null
> +++ b/checkpoint/sys.c
> @@ -0,0 +1,35 @@
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/kernel.h>
> +
> +/**
> + * sys_checkpoint - checkpoint a container
> + * @pid: pid of the container init(1) process
> + * @fd: file to which dump the checkpoint image
> + * @flags: checkpoint operation flags
> + */
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> +{
> +	pr_debug("sys_checkpoint not implemented yet\n");
> +	return -ENOSYS;
> +}
> +/**
> + * sys_restart - restart a container
> + * @crid: checkpoint image identifier
> + * @fd: file from which read the checkpoint image
> + * @flags: restart operation flags
> + */
> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
> +{
> +	pr_debug("sys_restart not implemented yet\n");
> +	return -ENOSYS;
> +}
> diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
> index d739467..88bdec4 100644
> --- a/include/asm-x86/unistd_32.h
> +++ b/include/asm-x86/unistd_32.h
> @@ -338,6 +338,8 @@
>   #define __NR_dup3		330
>   #define __NR_pipe2		331
>   #define __NR_inotify_init1	332
> +#define __NR_checkpoint		333
> +#define __NR_restart		334
> 
>   #ifdef __KERNEL__
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index d6ff145..edc218b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>   asmlinkage long sys_eventfd(unsigned int count);
>   asmlinkage long sys_eventfd2(unsigned int count, int flags);
>   asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
> 
>   int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
> 
> diff --git a/init/Kconfig b/init/Kconfig
> index c11da38..fd5f7bf 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -779,6 +779,8 @@ config MARKERS
> 
>   source "arch/Kconfig"
> 
> +source "checkpoint/Kconfig"
> +
>   config PROC_PAGE_MONITOR
>    	default y
>   	depends on PROC_FS && MMU
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 08d6e1b..ca95c25 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -168,3 +168,7 @@ cond_syscall(compat_sys_timerfd_settime);
>   cond_syscall(compat_sys_timerfd_gettime);
>   cond_syscall(sys_eventfd);
>   cond_syscall(sys_eventfd2);
> +
> +/* checkpoint/restart */
> +cond_syscall(sys_checkpoint);
> +cond_syscall(sys_restart);


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-04  8:02 ` [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
@ 2008-09-04  9:12   ` Louis Rilling
  2008-09-04 16:00     ` Serge E. Hallyn
  2008-09-04 16:03   ` Serge E. Hallyn
  1 sibling, 1 reply; 43+ messages in thread
From: Louis Rilling @ 2008-09-04  9:12 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

[-- Attachment #1: Type: text/plain, Size: 3092 bytes --]

On Thu, Sep 04, 2008 at 04:02:38AM -0400, Oren Laadan wrote:
>
> Add those interfaces, as well as helpers needed to easily manage the
> file format. The code is roughly broken out as follows:
>
> checkpoint/sys.c - user/kernel data transfer, as well as setup of the
> checkpoint/restart context (a per-checkpoint data structure for
> housekeeping)
>
> checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
>
> checkpoint/restart.c - input wrappers and basic restart handling
>
> Patches to add the per-architecture support as well as the actual
> work to do the memory checkpoint follow in subsequent patches.
>

[...]

> diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
> new file mode 100644
> index 0000000..629ad5a
> --- /dev/null
> +++ b/include/linux/ckpt_hdr.h
> @@ -0,0 +1,82 @@
> +#ifndef _CHECKPOINT_CKPT_HDR_H_
> +#define _CHECKPOINT_CKPT_HDR_H_
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/utsname.h>
> +
> +/*
> + * To maintain compatibility between 32-bit and 64-bit architecture flavors,
> + * keep data 64-bit aligned: use padding for structure members, and use
> + * __attribute__ ((aligned (8))) for the entire structure.
> + */
> +
> +/* records: generic header */
> +
> +struct cr_hdr {
> +	__s16 type;
> +	__s16 len;
> +	__u32 parent;
> +};
> +
> +/* header types */
> +enum {
> +	CR_HDR_HEAD = 1,
> +	CR_HDR_STRING,
> +
> +	CR_HDR_TASK = 101,
> +	CR_HDR_THREAD,
> +	CR_HDR_CPU,
> +
> +	CR_HDR_MM = 201,
> +	CR_HDR_VMA,
> +	CR_HDR_MM_CONTEXT,
> +
> +	CR_HDR_TAIL = 5001
> +};
> +
> +struct cr_hdr_head {
> +	__u64 magic;
> +
> +	__u16 major;
> +	__u16 minor;
> +	__u16 patch;
> +	__u16 rev;
> +
> +	__u64 time;	/* when checkpoint taken */
> +	__u64 flags;	/* checkpoint options */
> +
> +	char release[__NEW_UTS_LEN];
> +	char version[__NEW_UTS_LEN];
> +	char machine[__NEW_UTS_LEN];
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_tail {
> +	__u64 magic;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_task {
> +	__u64 state;
> +	__u32 exit_state;
> +	__u32 exit_code, exit_signal;

64bits alignment issue?
I probably missed it in previous versions...

Louis

> +
> +	__u64 utime, stime, utimescaled, stimescaled;
> +	__u64 gtime;
> +	__u64 prev_utime, prev_stime;
> +	__u64 nvcsw, nivcsw;
> +	__u64 start_time_sec, start_time_nsec;
> +	__u64 real_start_time_sec, real_start_time_nsec;
> +	__u64 min_flt, maj_flt;
> +
> +	__s32 task_comm_len;
> +} __attribute__((aligned(8)));
> +
> +#endif /* _CHECKPOINT_CKPT_HDR_H_ */

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 7/9] Infrastructure for shared objects
  2008-09-04  8:05 ` [RFC v3][PATCH 7/9] Infrastructure for shared objects Oren Laadan
@ 2008-09-04  9:38   ` Louis Rilling
  2008-09-04 14:23     ` Oren Laadan
  2008-09-04 18:14   ` Dave Hansen
  1 sibling, 1 reply; 43+ messages in thread
From: Louis Rilling @ 2008-09-04  9:38 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

[-- Attachment #1: Type: text/plain, Size: 3878 bytes --]

On Thu, Sep 04, 2008 at 04:05:22AM -0400, Oren Laadan wrote:
>
> Infrastructure to handle objects that may be shared and referenced by
> multiple tasks or other objects, e..g open files, memory address space
> etc.
>
> The state of shared objects is saved once. On the first encounter, the
> state is dumped and the object is assigned a unique identifier and also
> stored in a hash table (indexed by its physical kenrel address). From
> then on the object will be found in the hash and only its identifier is
> saved.
>
> On restart the identifier is looked up in the hash table; if not found
> then the state is read, the object is created, and added to the hash
> table (this time indexed by its identifier). Otherwise, the object in
> the hash table is used.
>

[...]

> diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
> new file mode 100644
> index 0000000..442b08c
> --- /dev/null
> +++ b/checkpoint/objhash.c
> @@ -0,0 +1,205 @@
> +/*
> + *  Checkpoint-restart - object hash infrastructure to manage shared objects
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/file.h>
> +#include <linux/hash.h>
> +#include <linux/ckpt.h>
> +
> +struct cr_objref {
> +	int objref;
> +	void *ptr;
> +	unsigned short type;
> +	unsigned short flags;
> +	struct hlist_node hash;
> +};
> +
> +struct cr_objhash {
> +	struct hlist_head *head;
> +	int objref_index;
> +};
> +
> +#define CR_OBJHASH_NBITS  10
> +#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS - 1)

Why -1? This makes a total number of 512 entries, which will break below with
hashes in range 0..1023.

> +
> +static void cr_obj_ref_drop(struct cr_objref *obj)
> +{
> +	switch (obj->type) {
> +	case CR_OBJ_FILE:
> +		fput((struct file *) obj->ptr);
> +		break;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +static void cr_obj_ref_grab(struct cr_objref *obj)
> +{
> +	switch (obj->type) {
> +	case CR_OBJ_FILE:
> +		get_file((struct file *) obj->ptr);
> +		break;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +static void cr_objhash_clear(struct cr_objhash *objhash)
> +{
> +	struct hlist_head *h = objhash->head;
> +	struct hlist_node *n, *t;
> +	struct cr_objref *obj;
> +	int i;
> +
> +	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
> +		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
> +			cr_obj_ref_drop(obj);
> +			kfree(obj);
> +		}
> +	}
> +}
> +
> +void cr_objhash_free(struct cr_ctx *ctx)
> +{
> +	struct cr_objhash *objhash = ctx->objhash;
> +
> +	if (objhash) {
> +		cr_objhash_clear(objhash);
> +		kfree(objhash->head);
> +		kfree(ctx->objhash);
> +		ctx->objhash = NULL;
> +	}
> +}
> +
> +int cr_objhash_alloc(struct cr_ctx *ctx)
> +{
> +	struct cr_objhash *objhash;
> +	struct hlist_head *head;
> +
> +	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
> +	if (!objhash)
> +		return -ENOMEM;
> +	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);

512 entries allocated

> +	if (!head) {
> +		kfree(objhash);
> +		return -ENOMEM;
> +	}
> +
> +	objhash->head = head;
> +	objhash->objref_index = 1;
> +
> +	ctx->objhash = objhash;
> +	return 0;
> +}
> +
> +static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
> +{
> +	struct hlist_head *h;
> +	struct hlist_node *n;
> +	struct cr_objref *obj;
> +
> +	h = &ctx->objhash->head[hash_ptr(ptr, CR_OBJHASH_NBITS)];

access to entries 0..1023

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 8/9] File descriprtors (dump)
  2008-09-04  8:05 ` [RFC v3][PATCH 8/9] File descriprtors (dump) Oren Laadan
@ 2008-09-04  9:47   ` Louis Rilling
  2008-09-04 14:43     ` Oren Laadan
  2008-09-04 15:01   ` Dave Hansen
  2008-09-04 18:41   ` Dave Hansen
  2 siblings, 1 reply; 43+ messages in thread
From: Louis Rilling @ 2008-09-04  9:47 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, arnd, jeremy, linux-kernel, containers

[-- Attachment #1: Type: text/plain, Size: 2773 bytes --]

On Thu, Sep 04, 2008 at 04:05:50AM -0400, Oren Laadan wrote:
>
> Dump the files_struct of a task with 'struct cr_hdr_files', followed by
> all open file descriptors. Since FDs can be shared, they are assigned a
> tag and registered in the object hash.
>
> For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag
> and its close-on-exec property. If the FD is to be saved (first time)
> then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
> Then will come the next FD and so on.
>
> This patch only handles basic FDs - regular files, directories and also
> symbolic links.
>

[...]

> diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
> new file mode 100644
> index 0000000..34df371
> --- /dev/null
> +++ b/checkpoint/ckpt_file.c
> @@ -0,0 +1,224 @@
> +/*
> + *  Checkpoint file descriptors
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/file.h>
> +#include <linux/fdtable.h>
> +#include <linux/ckpt.h>
> +#include <linux/ckpt_hdr.h>
> +
> +#include "ckpt_file.h"
> +
> +#define CR_DEFAULT_FDTABLE  256
> +
> +/**
> + * cr_scan_fds - scan file table and construct array of open fds
> + * @files: files_struct pointer
> + * @fdtable: (output) array of open fds
> + * @return: the number of open fds found
> + *
> + * Allocates the file descriptors array (*fdtable), caller should free
> + */
> +int cr_scan_fds(struct files_struct *files, int **fdtable)
> +{
> +	struct fdtable *fdt;
> +	int *fdlist;
> +	int i, n, max;
> +
> +	max = CR_DEFAULT_FDTABLE;
> +
> + repeat:
> +	n = 0;
> +	fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
> +	if (!fdlist)
> +		return -ENOMEM;
> +
> +	spin_lock(&files->file_lock);
> +	fdt = files_fdtable(files);
> +	for (i = 0; i < fdt->max_fds; i++) {
> +		if (fcheck_files(files, i)) {
> +			if (n == max) {
> +				spin_unlock(&files->file_lock);
> +				kfree(fdlist);
> +				max *= 2;
> +				if (max < 0) {	/* overflow ? */
> +					n = -EMFILE;
> +					break;
> +				}
> +				goto repeat;

				fdlist = krealloc(fdlist, max, GFP_KERNEL)?

Sorry, I should have suggested this in my first review.

Louis

> +			}
> +			fdlist[n++] = i;
> +		}
> +	}
> +	spin_unlock(&files->file_lock);
> +
> +	*fdtable = fdlist;
> +	return n;
> +}
> +

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 7/9] Infrastructure for shared objects
  2008-09-04  9:38   ` Louis Rilling
@ 2008-09-04 14:23     ` Oren Laadan
  0 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04 14:23 UTC (permalink / raw)
  To: Louis.Rilling; +Cc: dave, arnd, jeremy, linux-kernel, containers



Louis Rilling wrote:
> On Thu, Sep 04, 2008 at 04:05:22AM -0400, Oren Laadan wrote:
>> Infrastructure to handle objects that may be shared and referenced by
>> multiple tasks or other objects, e..g open files, memory address space
>> etc.
>>
>> The state of shared objects is saved once. On the first encounter, the
>> state is dumped and the object is assigned a unique identifier and also
>> stored in a hash table (indexed by its physical kenrel address). From
>> then on the object will be found in the hash and only its identifier is
>> saved.
>>
>> On restart the identifier is looked up in the hash table; if not found
>> then the state is read, the object is created, and added to the hash
>> table (this time indexed by its identifier). Otherwise, the object in
>> the hash table is used.
>>
> 
> [...]
> 
>> diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
>> new file mode 100644
>> index 0000000..442b08c
>> --- /dev/null
>> +++ b/checkpoint/objhash.c
>> @@ -0,0 +1,205 @@
>> +/*
>> + *  Checkpoint-restart - object hash infrastructure to manage shared objects
>> + *
>> + *  Copyright (C) 2008 Oren Laadan
>> + *
>> + *  This file is subject to the terms and conditions of the GNU General Public
>> + *  License.  See the file COPYING in the main directory of the Linux
>> + *  distribution for more details.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/file.h>
>> +#include <linux/hash.h>
>> +#include <linux/ckpt.h>
>> +
>> +struct cr_objref {
>> +	int objref;
>> +	void *ptr;
>> +	unsigned short type;
>> +	unsigned short flags;
>> +	struct hlist_node hash;
>> +};
>> +
>> +struct cr_objhash {
>> +	struct hlist_head *head;
>> +	int objref_index;
>> +};
>> +
>> +#define CR_OBJHASH_NBITS  10
>> +#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS - 1)
> 
> Why -1? This makes a total number of 512 entries, which will break below with
> hashes in range 0..1023.

Ugh !!!  Was fixed and tested, but sneaked back in :(
Thanks for spotting, will resend the patchset.

Oren.

> 
>> +
>> +static void cr_obj_ref_drop(struct cr_objref *obj)
>> +{
>> +	switch (obj->type) {
>> +	case CR_OBJ_FILE:
>> +		fput((struct file *) obj->ptr);
>> +		break;
>> +	default:
>> +		BUG();
>> +	}
>> +}
>> +
>> +static void cr_obj_ref_grab(struct cr_objref *obj)
>> +{
>> +	switch (obj->type) {
>> +	case CR_OBJ_FILE:
>> +		get_file((struct file *) obj->ptr);
>> +		break;
>> +	default:
>> +		BUG();
>> +	}
>> +}
>> +
>> +static void cr_objhash_clear(struct cr_objhash *objhash)
>> +{
>> +	struct hlist_head *h = objhash->head;
>> +	struct hlist_node *n, *t;
>> +	struct cr_objref *obj;
>> +	int i;
>> +
>> +	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
>> +		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
>> +			cr_obj_ref_drop(obj);
>> +			kfree(obj);
>> +		}
>> +	}
>> +}
>> +
>> +void cr_objhash_free(struct cr_ctx *ctx)
>> +{
>> +	struct cr_objhash *objhash = ctx->objhash;
>> +
>> +	if (objhash) {
>> +		cr_objhash_clear(objhash);
>> +		kfree(objhash->head);
>> +		kfree(ctx->objhash);
>> +		ctx->objhash = NULL;
>> +	}
>> +}
>> +
>> +int cr_objhash_alloc(struct cr_ctx *ctx)
>> +{
>> +	struct cr_objhash *objhash;
>> +	struct hlist_head *head;
>> +
>> +	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
>> +	if (!objhash)
>> +		return -ENOMEM;
>> +	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
> 
> 512 entries allocated
> 
>> +	if (!head) {
>> +		kfree(objhash);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	objhash->head = head;
>> +	objhash->objref_index = 1;
>> +
>> +	ctx->objhash = objhash;
>> +	return 0;
>> +}
>> +
>> +static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
>> +{
>> +	struct hlist_head *h;
>> +	struct hlist_node *n;
>> +	struct cr_objref *obj;
>> +
>> +	h = &ctx->objhash->head[hash_ptr(ptr, CR_OBJHASH_NBITS)];
> 
> access to entries 0..1023
> 
> Louis
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-04  8:02 ` [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
  2008-09-04  8:37   ` Cedric Le Goater
@ 2008-09-04 14:42   ` Serge E. Hallyn
  2008-09-04 17:32     ` Oren Laadan
  2008-09-08 15:02     ` [Devel] " Andrey Mirkin
  1 sibling, 2 replies; 43+ messages in thread
From: Serge E. Hallyn @ 2008-09-04 14:42 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, containers, jeremy, linux-kernel, arnd, Andrey Mirkin

Quoting Oren Laadan (orenl@cs.columbia.edu):
> 
> Create trivial sys_checkpoint and sys_restore system calls. They will
> enable to checkpoint and restart an entire container, to and from a
> checkpoint image file descriptor.
> 
> The syscalls take a file descriptor (for the image file) and flags as
> arguments. For sys_checkpoint the first argument identifies the target
> container; for sys_restart it will identify the checkpoint image.
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> ---
>   arch/x86/kernel/syscall_table_32.S |    2 ++
>   checkpoint/Kconfig                 |   11 +++++++++++
>   checkpoint/Makefile                |    5 +++++
>   checkpoint/sys.c                   |   35 +++++++++++++++++++++++++++++++++++
>   include/asm-x86/unistd_32.h        |    2 ++
>   include/linux/syscalls.h           |    2 ++
>   init/Kconfig                       |    2 ++
>   kernel/sys_ni.c                    |    4 ++++
>   8 files changed, 63 insertions(+), 0 deletions(-)
>   create mode 100644 checkpoint/Kconfig
>   create mode 100644 checkpoint/Makefile
>   create mode 100644 checkpoint/sys.c
> 
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index d44395f..5543136 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -332,3 +332,5 @@ ENTRY(sys_call_table)
>   	.long sys_dup3			/* 330 */
>   	.long sys_pipe2
>   	.long sys_inotify_init1
> +	.long sys_checkpoint
> +	.long sys_restart
> diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
> new file mode 100644
> index 0000000..a9f22ef
> --- /dev/null
> +++ b/checkpoint/Kconfig
> @@ -0,0 +1,11 @@
> +config CHECKPOINT_RESTART
> +	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
> +	def_bool y
> +	depends on X86_32 && EXPERIMENTAL
> +	help
> +	  Application checkpoint/restart is the ability to save the
> +	  state of a running application so that it can later resume
> +	  its execution from the time at which it was checkpointed.
> +
> +	  Turning this option on will enable checkpoint and restart
> +	  functionality in the kernel.
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> new file mode 100644
> index 0000000..07d018b
> --- /dev/null
> +++ b/checkpoint/Makefile
> @@ -0,0 +1,5 @@
> +#
> +# Makefile for linux checkpoint/restart.
> +#
> +
> +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> new file mode 100644
> index 0000000..b9018a4
> --- /dev/null
> +++ b/checkpoint/sys.c
> @@ -0,0 +1,35 @@
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/kernel.h>
> +
> +/**
> + * sys_checkpoint - checkpoint a container
> + * @pid: pid of the container init(1) process
> + * @fd: file to which dump the checkpoint image
> + * @flags: checkpoint operation flags
> + */
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> +{
> +	pr_debug("sys_checkpoint not implemented yet\n");
> +	return -ENOSYS;
> +}
> +/**
> + * sys_restart - restart a container
> + * @crid: checkpoint image identifier

So can we compare your api to Andrey's?

You've explained before that crid is used to tie together multiple
calls to checkpoint, but why do you have to specify it for restart?
Can't it just come from the fd?  Or, the fd will be passed in
seek()d to the right position for the data for this task, so the crid
won't be available there?

Andrey, how will the 'ctid' in your patchset be used?  It sounds
like it's actually going to set some integer id on the created
container?  We actually don't have container ids (or even
containers) right now, so we probably don't want that in our api,
right?

> + * @fd: file from which read the checkpoint image
> + * @flags: restart operation flags
> + */
> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
> +{
> +	pr_debug("sys_restart not implemented yet\n");
> +	return -ENOSYS;
> +}
> diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
> index d739467..88bdec4 100644
> --- a/include/asm-x86/unistd_32.h
> +++ b/include/asm-x86/unistd_32.h
> @@ -338,6 +338,8 @@
>   #define __NR_dup3		330
>   #define __NR_pipe2		331
>   #define __NR_inotify_init1	332
> +#define __NR_checkpoint		333
> +#define __NR_restart		334
> 
>   #ifdef __KERNEL__
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index d6ff145..edc218b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>   asmlinkage long sys_eventfd(unsigned int count);
>   asmlinkage long sys_eventfd2(unsigned int count, int flags);
>   asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
> 
>   int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
> 
> diff --git a/init/Kconfig b/init/Kconfig
> index c11da38..fd5f7bf 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -779,6 +779,8 @@ config MARKERS
> 
>   source "arch/Kconfig"
> 
> +source "checkpoint/Kconfig"
> +
>   config PROC_PAGE_MONITOR
>    	default y
>   	depends on PROC_FS && MMU
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 08d6e1b..ca95c25 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -168,3 +168,7 @@ cond_syscall(compat_sys_timerfd_settime);
>   cond_syscall(compat_sys_timerfd_gettime);
>   cond_syscall(sys_eventfd);
>   cond_syscall(sys_eventfd2);
> +
> +/* checkpoint/restart */
> +cond_syscall(sys_checkpoint);
> +cond_syscall(sys_restart);
> -- 
> 1.5.4.3
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 8/9] File descriprtors (dump)
  2008-09-04  9:47   ` Louis Rilling
@ 2008-09-04 14:43     ` Oren Laadan
  0 siblings, 0 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-04 14:43 UTC (permalink / raw)
  To: Louis.Rilling; +Cc: dave, arnd, jeremy, linux-kernel, containers



Louis Rilling wrote:
> On Thu, Sep 04, 2008 at 04:05:50AM -0400, Oren Laadan wrote:
>> Dump the files_struct of a task with 'struct cr_hdr_files', followed by
>> all open file descriptors. Since FDs can be shared, they are assigned a
>> tag and registered in the object hash.
>>
>> For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag
>> and its close-on-exec property. If the FD is to be saved (first time)
>> then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
>> Then will come the next FD and so on.
>>
>> This patch only handles basic FDs - regular files, directories and also
>> symbolic links.
>>
> 
> [...]
> 
>> diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
>> new file mode 100644
>> index 0000000..34df371
>> --- /dev/null
>> +++ b/checkpoint/ckpt_file.c
>> @@ -0,0 +1,224 @@
>> +/*
>> + *  Checkpoint file descriptors
>> + *
>> + *  Copyright (C) 2008 Oren Laadan
>> + *
>> + *  This file is subject to the terms and conditions of the GNU General Public
>> + *  License.  See the file COPYING in the main directory of the Linux
>> + *  distribution for more details.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/sched.h>
>> +#include <linux/file.h>
>> +#include <linux/fdtable.h>
>> +#include <linux/ckpt.h>
>> +#include <linux/ckpt_hdr.h>
>> +
>> +#include "ckpt_file.h"
>> +
>> +#define CR_DEFAULT_FDTABLE  256
>> +
>> +/**
>> + * cr_scan_fds - scan file table and construct array of open fds
>> + * @files: files_struct pointer
>> + * @fdtable: (output) array of open fds
>> + * @return: the number of open fds found
>> + *
>> + * Allocates the file descriptors array (*fdtable), caller should free
>> + */
>> +int cr_scan_fds(struct files_struct *files, int **fdtable)
>> +{
>> +	struct fdtable *fdt;
>> +	int *fdlist;
>> +	int i, n, max;
>> +
>> +	max = CR_DEFAULT_FDTABLE;
>> +
>> + repeat:
>> +	n = 0;
>> +	fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
>> +	if (!fdlist)
>> +		return -ENOMEM;
>> +
>> +	spin_lock(&files->file_lock);
>> +	fdt = files_fdtable(files);
>> +	for (i = 0; i < fdt->max_fds; i++) {
>> +		if (fcheck_files(files, i)) {
>> +			if (n == max) {
>> +				spin_unlock(&files->file_lock);
>> +				kfree(fdlist);
>> +				max *= 2;
>> +				if (max < 0) {	/* overflow ? */
>> +					n = -EMFILE;
>> +					break;
>> +				}
>> +				goto repeat;
> 
> 				fdlist = krealloc(fdlist, max, GFP_KERNEL)?
> 
> Sorry, I should have suggested this in my first review.

That's a good point; I did it this way to be paranoid, even though the
the checkpointee is supposed to be frozen (e.g., if the checkpointee is
forcefully killed by, say, OOM, and it's fdt->max_fds goes to zero. But
now I notice that check_files() already tests for this. I'm not sure it
makes the code simpler, but I'll fix that.

Oren.

> 
> Louis
> 
>> +			}
>> +			fdlist[n++] = i;
>> +		}
>> +	}
>> +	spin_unlock(&files->file_lock);
>> +
>> +	*fdtable = fdlist;
>> +	return n;
>> +}
>> +
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 8/9] File descriprtors (dump)
  2008-09-04  8:05 ` [RFC v3][PATCH 8/9] File descriprtors (dump) Oren Laadan
  2008-09-04  9:47   ` Louis Rilling
@ 2008-09-04 15:01   ` Dave Hansen
  2008-09-04 18:41   ` Dave Hansen
  2 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-04 15:01 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Thu, 2008-09-04 at 04:05 -0400, Oren Laadan wrote:
> 
> diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
> index 322ade5..1ce1dbc 100644
> --- a/include/linux/ckpt_hdr.h
> +++ b/include/linux/ckpt_hdr.h
> @@ -17,7 +17,7 @@
>   /*
>    * To maintain compatibility between 32-bit and 64-bit architecture flavors,
>    * keep data 64-bit aligned: use padding for structure members, and use
> - * __attribute__ ((aligned (8))) for the entire structure.
> + * __attribute__((aligned(8))) for the entire structure.
>    */

Have you tried emailing these to yourself and trying to apply them?
That's usually a good way to iron out these whitespace issues.

I think this set is still munged.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-04  9:12   ` Louis Rilling
@ 2008-09-04 16:00     ` Serge E. Hallyn
  0 siblings, 0 replies; 43+ messages in thread
From: Serge E. Hallyn @ 2008-09-04 16:00 UTC (permalink / raw)
  To: Louis Rilling; +Cc: Oren Laadan, containers, jeremy, linux-kernel, arnd, dave

Quoting Louis Rilling (Louis.Rilling@kerlabs.com):
> On Thu, Sep 04, 2008 at 04:02:38AM -0400, Oren Laadan wrote:
> >
> > Add those interfaces, as well as helpers needed to easily manage the
> > file format. The code is roughly broken out as follows:
> >
> > checkpoint/sys.c - user/kernel data transfer, as well as setup of the
> > checkpoint/restart context (a per-checkpoint data structure for
> > housekeeping)
> >
> > checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
> >
> > checkpoint/restart.c - input wrappers and basic restart handling
> >
> > Patches to add the per-architecture support as well as the actual
> > work to do the memory checkpoint follow in subsequent patches.
> >
> 
> [...]
> 
> > diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
> > new file mode 100644
> > index 0000000..629ad5a
> > --- /dev/null
> > +++ b/include/linux/ckpt_hdr.h
> > @@ -0,0 +1,82 @@
> > +#ifndef _CHECKPOINT_CKPT_HDR_H_
> > +#define _CHECKPOINT_CKPT_HDR_H_
> > +/*
> > + *  Generic container checkpoint-restart
> > + *
> > + *  Copyright (C) 2008 Oren Laadan
> > + *
> > + *  This file is subject to the terms and conditions of the GNU General Public
> > + *  License.  See the file COPYING in the main directory of the Linux
> > + *  distribution for more details.
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/utsname.h>
> > +
> > +/*
> > + * To maintain compatibility between 32-bit and 64-bit architecture flavors,
> > + * keep data 64-bit aligned: use padding for structure members, and use
> > + * __attribute__ ((aligned (8))) for the entire structure.
> > + */
> > +
> > +/* records: generic header */
> > +
> > +struct cr_hdr {
> > +	__s16 type;
> > +	__s16 len;
> > +	__u32 parent;
> > +};
> > +
> > +/* header types */
> > +enum {
> > +	CR_HDR_HEAD = 1,
> > +	CR_HDR_STRING,
> > +
> > +	CR_HDR_TASK = 101,
> > +	CR_HDR_THREAD,
> > +	CR_HDR_CPU,
> > +
> > +	CR_HDR_MM = 201,
> > +	CR_HDR_VMA,
> > +	CR_HDR_MM_CONTEXT,
> > +
> > +	CR_HDR_TAIL = 5001
> > +};
> > +
> > +struct cr_hdr_head {
> > +	__u64 magic;
> > +
> > +	__u16 major;
> > +	__u16 minor;
> > +	__u16 patch;
> > +	__u16 rev;
> > +
> > +	__u64 time;	/* when checkpoint taken */
> > +	__u64 flags;	/* checkpoint options */
> > +
> > +	char release[__NEW_UTS_LEN];
> > +	char version[__NEW_UTS_LEN];
> > +	char machine[__NEW_UTS_LEN];
> > +} __attribute__((aligned(8)));
> > +
> > +struct cr_hdr_tail {
> > +	__u64 magic;
> > +} __attribute__((aligned(8)));
> > +
> > +struct cr_hdr_task {
> > +	__u64 state;
> > +	__u32 exit_state;
> > +	__u32 exit_code, exit_signal;
> 
> 64bits alignment issue?
> I probably missed it in previous versions...

In the first version it was followed by two __u16's (pid and tgid)...

-serge

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-04  8:02 ` [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
  2008-09-04  9:12   ` Louis Rilling
@ 2008-09-04 16:03   ` Serge E. Hallyn
  2008-09-04 16:09     ` Dave Hansen
  1 sibling, 1 reply; 43+ messages in thread
From: Serge E. Hallyn @ 2008-09-04 16:03 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, containers, jeremy, linux-kernel, arnd

Quoting Oren Laadan (orenl@cs.columbia.edu):
> 
> Add those interfaces, as well as helpers needed to easily manage the
> file format. The code is roughly broken out as follows:
> 
> checkpoint/sys.c - user/kernel data transfer, as well as setup of the
> checkpoint/restart context (a per-checkpoint data structure for
> housekeeping)
> 
> checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
> 
> checkpoint/restart.c - input wrappers and basic restart handling
> 
> Patches to add the per-architecture support as well as the actual
> work to do the memory checkpoint follow in subsequent patches.
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>

This really looks good to me - nothing particularly exotic, nice and
simple.

Dave, are you happy with the allocations here, or were you objecting
to cr_hbuf_get() and cr_hbuf_put()?

thanks,
-serge

> ---
>   Makefile                 |    2 +-
>   checkpoint/Makefile      |    2 +-
>   checkpoint/checkpoint.c  |  188 ++++++++++++++++++++++++++++++++++++++
>   checkpoint/restart.c     |  189 ++++++++++++++++++++++++++++++++++++++
>   checkpoint/sys.c         |  226 +++++++++++++++++++++++++++++++++++++++++++++-
>   include/linux/ckpt.h     |   65 +++++++++++++
>   include/linux/ckpt_hdr.h |   82 +++++++++++++++++
>   include/linux/magic.h    |    3 +
>   8 files changed, 751 insertions(+), 6 deletions(-)
>   create mode 100644 checkpoint/checkpoint.c
>   create mode 100644 checkpoint/restart.c
>   create mode 100644 include/linux/ckpt.h
>   create mode 100644 include/linux/ckpt_hdr.h
> 
> diff --git a/Makefile b/Makefile
> index f448e00..a558ad2 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -619,7 +619,7 @@ export mod_strip_cmd
> 
> 
>   ifeq ($(KBUILD_EXTMOD),)
> -core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
> +core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
> 
>   vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
>   		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> index 07d018b..d2df68c 100644
> --- a/checkpoint/Makefile
> +++ b/checkpoint/Makefile
> @@ -2,4 +2,4 @@
>   # Makefile for linux checkpoint/restart.
>   #
> 
> -obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
> +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> new file mode 100644
> index 0000000..ad1099f
> --- /dev/null
> +++ b/checkpoint/checkpoint.c
> @@ -0,0 +1,188 @@
> +/*
> + *  Checkpoint logic and helpers
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/version.h>
> +#include <linux/sched.h>
> +#include <linux/time.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/dcache.h>
> +#include <linux/mount.h>
> +#include <linux/utsname.h>
> +#include <linux/magic.h>
> +#include <linux/ckpt.h>
> +#include <linux/ckpt_hdr.h>
> +
> +/**
> + * cr_write_obj - write a record described by a cr_hdr
> + * @ctx: checkpoint context
> + * @h: record descriptor
> + * @buf: record buffer
> + */
> +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
> +{
> +	int ret;
> +
> +	ret = cr_kwrite(ctx, h, sizeof(*h));
> +	if (ret < 0)
> +		return ret;
> +	return cr_kwrite(ctx, buf, h->len);
> +}
> +
> +/**
> + * cr_write_string - write a string
> + * @ctx: checkpoint context
> + * @str: string pointer
> + * @len: string length
> + */
> +int cr_write_string(struct cr_ctx *ctx, char *str, int len)
> +{
> +	struct cr_hdr h;
> +
> +	h.type = CR_HDR_STRING;
> +	h.len = len;
> +	h.parent = 0;
> +
> +	return cr_write_obj(ctx, &h, str);
> +}
> +
> +/* write the checkpoint header */
> +static int cr_write_head(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct new_utsname *uts;
> +	struct timeval ktv;
> +	int ret;
> +
> +	h.type = CR_HDR_HEAD;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	do_gettimeofday(&ktv);
> +
> +	hh->magic = CHECKPOINT_MAGIC_HEAD;
> +	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
> +	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
> +	hh->patch = (LINUX_VERSION_CODE) & 0xff;
> +
> +	hh->rev = CR_VERSION;
> +
> +	hh->flags = ctx->flags;
> +	hh->time = ktv.tv_sec;
> +
> +	uts = utsname();
> +	memcpy(hh->release, uts->release, __NEW_UTS_LEN);
> +	memcpy(hh->version, uts->version, __NEW_UTS_LEN);
> +	memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +/* write the checkpoint trailer */
> +static int cr_write_tail(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_TAIL;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	hh->magic = CHECKPOINT_MAGIC_TAIL;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +/* dump the task_struct of a given task */
> +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_TASK;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	hh->state = t->state;
> +	hh->exit_state = t->exit_state;
> +	hh->exit_code = t->exit_code;
> +	hh->exit_signal = t->exit_signal;
> +
> +	hh->utime = t->utime;
> +	hh->stime = t->stime;
> +	hh->utimescaled = t->utimescaled;
> +	hh->stimescaled = t->stimescaled;
> +	hh->gtime = t->gtime;
> +	hh->prev_utime = t->prev_utime;
> +	hh->prev_stime = t->prev_stime;
> +	hh->nvcsw = t->nvcsw;
> +	hh->nivcsw = t->nivcsw;
> +	hh->start_time_sec = t->start_time.tv_sec;
> +	hh->start_time_nsec = t->start_time.tv_nsec;
> +	hh->real_start_time_sec = t->real_start_time.tv_sec;
> +	hh->real_start_time_nsec = t->real_start_time.tv_nsec;
> +	hh->min_flt = t->min_flt;
> +	hh->maj_flt = t->maj_flt;
> +
> +	hh->task_comm_len = TASK_COMM_LEN;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		return ret;
> +
> +	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
> +}
> +
> +/* dump the entire state of a given task */
> +static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	int ret ;
> +
> +	if (t->state == TASK_DEAD) {
> +		pr_warning("CR: task may not be in state TASK_DEAD\n");
> +		return -EAGAIN;
> +	}
> +
> +	ret = cr_write_task_struct(ctx, t);
> +	cr_debug("ret %d\n", ret);
> +
> +	return ret;
> +}
> +
> +int do_checkpoint(struct cr_ctx *ctx)
> +{
> +	int ret;
> +
> +	/* FIX: need to test whether container is checkpointable */
> +
> +	ret = cr_write_head(ctx);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_write_task(ctx, current);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_write_tail(ctx);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* on success, return (unique) checkpoint identifier */
> +	ret = ctx->crid;
> +
> + out:
> +	return ret;
> +}
> diff --git a/checkpoint/restart.c b/checkpoint/restart.c
> new file mode 100644
> index 0000000..171cd2d
> --- /dev/null
> +++ b/checkpoint/restart.c
> @@ -0,0 +1,189 @@
> +/*
> + *  Restart logic and helpers
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/version.h>
> +#include <linux/sched.h>
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/ckpt.h>
> +#include <linux/ckpt_hdr.h>
> +
> +/**
> + * cr_read_obj - read a whole record (cr_hdr followed by payload)
> + * @ctx: checkpoint context
> + * @h: record descriptor
> + * @buf: record buffer
> + * @n: available buffer size
> + *
> + * @return: size of payload
> + */
> +int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n)
> +{
> +	int ret;
> +
> +	ret = cr_kread(ctx, h, sizeof(*h));
> +	if (ret < 0)
> +		return ret;
> +
> +	cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
> +
> +	if (h->len < 0 || h->len > n)
> +		return -EINVAL;
> +
> +	return cr_kread(ctx, buf, h->len);
> +}
> +
> +/**
> + * cr_read_obj_type - read a whole record of expected type
> + * @ctx: checkpoint context
> + * @buf: record buffer
> + * @n: available buffer size
> + * @type: expected record type
> + *
> + * @return: object reference of the parent object
> + */
> +int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type)
> +{
> +	struct cr_hdr h;
> +	int ret;
> +
> +	ret = cr_read_obj(ctx, &h, buf, n);
> +	if (!ret) {
> +		if (h.type == type)
> +			ret = h.parent;
> +		else
> +			ret = -EINVAL;
> +	}
> +	return ret;
> +}
> +
> +/**
> + * cr_read_string - read a string
> + * @ctx: checkpoint context
> + * @str: string buffer
> + * @len: buffer buffer length
> + */
> +int cr_read_string(struct cr_ctx *ctx, void *str, int len)
> +{
> +	return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
> +}
> +
> +/* read the checkpoint header */
> +static int cr_read_head(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int parent;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
> +	if (parent < 0)
> +		return parent;
> +	else if (parent != 0)
> +		return -EINVAL;
> +
> +	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
> +	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
> +	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
> +	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
> +		return -EINVAL;
> +
> +	if (hh->flags & ~CR_CTX_CKPT)
> +		return -EINVAL;
> +
> +	ctx->oflags = hh->flags;
> +
> +	/* FIX: verify compatibility of release, version and machine */
> +
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return 0;
> +}
> +
> +/* read the checkpoint trailer */
> +static int cr_read_tail(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int parent;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
> +	if (parent < 0)
> +		return parent;
> +	else if (parent != 0)
> +		return -EINVAL;
> +
> +	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
> +		return -EINVAL;
> +
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return 0;
> +}
> +
> +/* read the task_struct into the current task */
> +static int cr_read_task_struct(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct task_struct *t = current;
> +	char *buf;
> +	int parent, ret;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
> +	if (parent < 0)
> +		return parent;
> +	else if (parent != 0)
> +		return -EINVAL;
> +
> +	/* FIXME: for now, only restore t->comm */
> +
> +	/* upper limit for task_comm_len to prevent DoS */
> +	if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
> +		return -EINVAL;
> +
> +	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +	ret = cr_read_string(ctx, buf, hh->task_comm_len);
> +	if (!ret) {
> +		/* if t->comm is too long, silently truncate */
> +		memset(t->comm, 0, TASK_COMM_LEN);
> +		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
> +	}
> +	kfree(buf);
> +
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +/* read the entire state of the current task */
> +static int cr_read_task(struct cr_ctx *ctx)
> +{
> +	int ret;
> +
> +	ret = cr_read_task_struct(ctx);
> +	cr_debug("ret %d\n", ret);
> +
> +	return ret;
> +}
> +
> +int do_restart(struct cr_ctx *ctx)
> +{
> +	int ret;
> +
> +	ret = cr_read_head(ctx);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_read_task(ctx);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_read_tail(ctx);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* on success, adjust the return value if needed [TODO] */
> + out:
> +	return ret;
> +}
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index b9018a4..4268bae 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c
> @@ -10,6 +10,197 @@
> 
>   #include <linux/sched.h>
>   #include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/uaccess.h>
> +#include <linux/capability.h>
> +#include <linux/ckpt.h>
> +
> +/*
> + * helpers to write/read to/from the image file descriptor
> + *
> + *   cr_uwrite() - write a user-space buffer to the checkpoint image
> + *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
> + *   cr_uread() - read from the checkpoint image to a user-space buffer
> + *   cr_kread() - read from the checkpoint image to a kernel-space buffer
> + *
> + */
> +
> +/* (temporarily added file_pos_read() and file_pos_write() because they
> + * are static in fs/read_write.c... should cleanup and remove later) */
> +static inline loff_t file_pos_read(struct file *file)
> +{
> +	return file->f_pos;
> +}
> +
> +static inline void file_pos_write(struct file *file, loff_t pos)
> +{
> +	file->f_pos = pos;
> +}
> +
> +int cr_uwrite(struct cr_ctx *ctx, void *buf, int count)
> +{
> +	struct file *file = ctx->file;
> +	ssize_t nwrite;
> +	int nleft;
> +
> +	for (nleft = count; nleft; nleft -= nwrite) {
> +		loff_t pos = file_pos_read(file);
> +		nwrite = vfs_write(file, (char __user *) buf, nleft, &pos);
> +		file_pos_write(file, pos);
> +		if (nwrite <= 0) {
> +			if (nwrite == -EAGAIN)
> +				nwrite = 0;
> +			else
> +				return nwrite;
> +		}
> +		buf += nwrite;
> +	}
> +
> +	ctx->total += count;
> +	return 0;
> +}
> +
> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
> +{
> +	mm_segment_t oldfs;
> +	int ret;
> +
> +	oldfs = get_fs();
> +	set_fs(KERNEL_DS);
> +	ret = cr_uwrite(ctx, buf, count);
> +	set_fs(oldfs);
> +
> +	return ret;
> +}
> +
> +int cr_uread(struct cr_ctx *ctx, void *buf, int count)
> +{
> +	struct file *file = ctx->file;
> +	ssize_t nread;
> +	int nleft;
> +
> +	for (nleft = count; nleft; nleft -= nread) {
> +		loff_t pos = file_pos_read(file);
> +		nread = vfs_read(file, (char __user *) buf, nleft, &pos);
> +		file_pos_write(file, pos);
> +		if (nread <= 0) {
> +			if (nread == -EAGAIN)
> +				nread = 0;
> +			else
> +				return nread;
> +		}
> +		buf += nread;
> +	}
> +
> +	ctx->total += count;
> +	return 0;
> +}
> +
> +int cr_kread(struct cr_ctx *ctx, void *buf, int count)
> +{
> +	mm_segment_t oldfs;
> +	int ret;
> +
> +	oldfs = get_fs();
> +	set_fs(KERNEL_DS);
> +	ret = cr_uread(ctx, buf, count);
> +	set_fs(oldfs);
> +
> +	return ret;
> +}
> +
> +
> +/*
> + * helpers to manage CR contexts: allocated for each checkpoint and/or
> + * restart operation, and persists until the operation is completed.
> + */
> +
> +/* unique checkpoint identifier (FIXME: should be per-container) */
> +static atomic_t cr_ctx_count;
> +
> +void cr_ctx_free(struct cr_ctx *ctx)
> +{
> +
> +	if (ctx->file)
> +		fput(ctx->file);
> +	if (ctx->vfsroot)
> +		path_put(ctx->vfsroot);
> +
> +	free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
> +
> +	kfree(ctx);
> +}
> +
> +struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
> +{
> +	struct cr_ctx *ctx;
> +
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ctx->file = fget(fd);
> +	if (!ctx->file) {
> +		cr_ctx_free(ctx);
> +		return ERR_PTR(-EBADF);
> +	}
> +	get_file(ctx->file);
> +
> +	ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
> +	if (!ctx->hbuf) {
> +		cr_ctx_free(ctx);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	ctx->pid = pid;
> +	ctx->flags = flags;
> +
> +	/* assume checkpointer is in container's root vfs */
> +	/* FIXME: this works for now, but will change with real containers */
> +	ctx->vfsroot = &current->fs->root;
> +	path_get(ctx->vfsroot);
> +
> +	ctx->crid = atomic_inc_return(&cr_ctx_count);
> +
> +	return ctx;
> +}
> +
> +/*
> + * During checkpoint and restart the code writes outs/reads in data
> + * to/from the chekcpoint image from/to a temporary buffer (ctx->hbuf).
> + * Because operations can be nested, one should call cr_hbuf_get() to
> + * reserve space in the buffer, and then cr_hbuf_put() when no longer
> + * needs that space.
> + */
> +
> +/**
> + * cr_hbuf_get - reserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + *
> + * @return: pointer to reserved space
> + */
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
> +{
> +	void *ptr;
> +
> +	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
> +	ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
> +	ctx->hpos += n;
> +	return ptr;
> +}
> +
> +/**
> + * cr_hbuf_put - unreserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + */
> +void cr_hbuf_put(struct cr_ctx *ctx, int n)
> +{
> +	BUG_ON(ctx->hpos < n);
> +	ctx->hpos -= n;
> +}
> 
>   /**
>    * sys_checkpoint - checkpoint a container
> @@ -19,9 +210,23 @@
>    */
>   asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>   {
> -	pr_debug("sys_checkpoint not implemented yet\n");
> -	return -ENOSYS;
> +	struct cr_ctx *ctx;
> +	int ret;
> +
> +	/* no flags for now */
> +	if (flags)
> +		return -EINVAL;
> +
> +	ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
> +	if (IS_ERR(ctx))
> +		return PTR_ERR(ctx);
> +
> +	ret = do_checkpoint(ctx);
> +
> +	cr_ctx_free(ctx);
> +	return ret;
>   }
> +
>   /**
>    * sys_restart - restart a container
>    * @crid: checkpoint image identifier
> @@ -30,6 +235,19 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>    */
>   asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
>   {
> -	pr_debug("sys_restart not implemented yet\n");
> -	return -ENOSYS;
> +	struct cr_ctx *ctx;
> +	int ret;
> +
> +	/* no flags for now */
> +	if (flags)
> +		return -EINVAL;
> +
> +	ctx = cr_ctx_alloc(crid, fd, flags | CR_CTX_RSTR);
> +	if (IS_ERR(ctx))
> +		return PTR_ERR(ctx);
> +
> +	ret = do_restart(ctx);
> +
> +	cr_ctx_free(ctx);
> +	return ret;
>   }
> diff --git a/include/linux/ckpt.h b/include/linux/ckpt.h
> new file mode 100644
> index 0000000..1bb2b09
> --- /dev/null
> +++ b/include/linux/ckpt.h
> @@ -0,0 +1,65 @@
> +#ifndef _CHECKPOINT_CKPT_H_
> +#define _CHECKPOINT_CKPT_H_
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/path.h>
> +#include <linux/fs.h>
> +
> +#define CR_VERSION  1
> +
> +struct cr_ctx {
> +	pid_t pid;		/* container identifier */
> +	int crid;		/* unique checkpoint id */
> +
> +	unsigned long flags;
> +	unsigned long oflags;	/* restart: old flags */
> +
> +	struct file *file;
> +	int total;		/* total read/written */
> +
> +	void *hbuf;		/* temporary buffer for headers */
> +	int hpos;		/* position in headers buffer */
> +
> +	struct path *vfsroot;	/* container root */
> +};
> +
> +/* cr_ctx: flags */
> +#define CR_CTX_CKPT	0x1
> +#define CR_CTX_RSTR	0x2
> +
> +/* allocation defaults */
> +#define CR_HBUF_ORDER  1
> +#define CR_HBUF_TOTAL  (PAGE_SIZE << CR_HBUF_ORDER)
> +
> +int cr_uwrite(struct cr_ctx *ctx, void *buf, int count);
> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
> +int cr_uread(struct cr_ctx *ctx, void *buf, int count);
> +int cr_kread(struct cr_ctx *ctx, void *buf, int count);
> +
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n);
> +void cr_hbuf_put(struct cr_ctx *ctx, int n);
> +
> +struct cr_hdr;
> +
> +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
> +int cr_write_string(struct cr_ctx *ctx, char *str, int len);
> +
> +int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
> +int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
> +int cr_read_string(struct cr_ctx *ctx, void *str, int len);
> +
> +int do_checkpoint(struct cr_ctx *ctx);
> +int do_restart(struct cr_ctx *ctx);
> +
> +#define cr_debug(fmt, args...)  \
> +	pr_debug("[CR:%s] " fmt, __func__, ## args)
> +
> +#endif /* _CHECKPOINT_CKPT_H_ */
> diff --git a/include/linux/ckpt_hdr.h b/include/linux/ckpt_hdr.h
> new file mode 100644
> index 0000000..629ad5a
> --- /dev/null
> +++ b/include/linux/ckpt_hdr.h
> @@ -0,0 +1,82 @@
> +#ifndef _CHECKPOINT_CKPT_HDR_H_
> +#define _CHECKPOINT_CKPT_HDR_H_
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/utsname.h>
> +
> +/*
> + * To maintain compatibility between 32-bit and 64-bit architecture flavors,
> + * keep data 64-bit aligned: use padding for structure members, and use
> + * __attribute__ ((aligned (8))) for the entire structure.
> + */
> +
> +/* records: generic header */
> +
> +struct cr_hdr {
> +	__s16 type;
> +	__s16 len;
> +	__u32 parent;
> +};
> +
> +/* header types */
> +enum {
> +	CR_HDR_HEAD = 1,
> +	CR_HDR_STRING,
> +
> +	CR_HDR_TASK = 101,
> +	CR_HDR_THREAD,
> +	CR_HDR_CPU,
> +
> +	CR_HDR_MM = 201,
> +	CR_HDR_VMA,
> +	CR_HDR_MM_CONTEXT,
> +
> +	CR_HDR_TAIL = 5001
> +};
> +
> +struct cr_hdr_head {
> +	__u64 magic;
> +
> +	__u16 major;
> +	__u16 minor;
> +	__u16 patch;
> +	__u16 rev;
> +
> +	__u64 time;	/* when checkpoint taken */
> +	__u64 flags;	/* checkpoint options */
> +
> +	char release[__NEW_UTS_LEN];
> +	char version[__NEW_UTS_LEN];
> +	char machine[__NEW_UTS_LEN];
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_tail {
> +	__u64 magic;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_task {
> +	__u64 state;
> +	__u32 exit_state;
> +	__u32 exit_code, exit_signal;
> +
> +	__u64 utime, stime, utimescaled, stimescaled;
> +	__u64 gtime;
> +	__u64 prev_utime, prev_stime;
> +	__u64 nvcsw, nivcsw;
> +	__u64 start_time_sec, start_time_nsec;
> +	__u64 real_start_time_sec, real_start_time_nsec;
> +	__u64 min_flt, maj_flt;
> +
> +	__s32 task_comm_len;
> +} __attribute__((aligned(8)));
> +
> +#endif /* _CHECKPOINT_CKPT_HDR_H_ */
> diff --git a/include/linux/magic.h b/include/linux/magic.h
> index 1fa0c2c..c2b811c 100644
> --- a/include/linux/magic.h
> +++ b/include/linux/magic.h
> @@ -42,4 +42,7 @@
>   #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
>   #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
> 
> +#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
> +#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
> +
>   #endif /* __LINUX_MAGIC_H__ */
> -- 
> 1.5.4.3
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart
  2008-09-04 16:03   ` Serge E. Hallyn
@ 2008-09-04 16:09     ` Dave Hansen
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-04 16:09 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Oren Laadan, linux-kernel, containers, jeremy, arnd

On Thu, 2008-09-04 at 11:03 -0500, Serge E. Hallyn wrote:
> Dave, are you happy with the allocations here, or were you objecting
> to cr_hbuf_get() and cr_hbuf_put()?

I still don't think there's really enough justification as it stands,
but don't let me get in the way.  If it ends up being an issue, it's
pretty straightforward to rip them out or put back.  The code is very
nice that way.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-04 14:42   ` Serge E. Hallyn
@ 2008-09-04 17:32     ` Oren Laadan
  2008-09-04 20:37       ` Serge E. Hallyn
  2008-09-08 15:02     ` [Devel] " Andrey Mirkin
  1 sibling, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-04 17:32 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: dave, containers, jeremy, linux-kernel, arnd, Andrey Mirkin



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl@cs.columbia.edu):
>> Create trivial sys_checkpoint and sys_restore system calls. They will
>> enable to checkpoint and restart an entire container, to and from a
>> checkpoint image file descriptor.
>>
>> The syscalls take a file descriptor (for the image file) and flags as
>> arguments. For sys_checkpoint the first argument identifies the target
>> container; for sys_restart it will identify the checkpoint image.
>>
>> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
>> ---

[...]

>> +/**
>> + * sys_checkpoint - checkpoint a container
>> + * @pid: pid of the container init(1) process
>> + * @fd: file to which dump the checkpoint image
>> + * @flags: checkpoint operation flags
>> + */
>> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>> +{
>> +	pr_debug("sys_checkpoint not implemented yet\n");
>> +	return -ENOSYS;
>> +}
>> +/**
>> + * sys_restart - restart a container
>> + * @crid: checkpoint image identifier
> 
> So can we compare your api to Andrey's?
> 
> You've explained before that crid is used to tie together multiple
> calls to checkpoint, but why do you have to specify it for restart?
> Can't it just come from the fd?  Or, the fd will be passed in
> seek()d to the right position for the data for this task, so the crid
> won't be available there?

I added the 'crid' inside to support a mode of operation in which we
would like the checkpoint data to remain in memory across multiple
system calls. Here are example scenarios:

1) We will want to reduce down time by first buffering the checkpoint
image in memory, then resuming the container, and only then writing
the data back to a (the) file descriptor.
So instead of:
  freeze -> checkpoint and write back -> unfreeze
We want:
  freeze -> checkpoint to buffer -> unfreeze -> write back
I envision each of these steps to be a separate invocation of a syscall.
to the 'crid' returned by the sys_checkpoint() at the 2nd step, will be
used to identify that data in the 4th step. (Note, that between the
unfreeze and the write-back, another checkpoint may be already taken).

2) A task may want to take a checkpoint (e.g. of itself, or a whole
container) and keep that checkpoint in memory; at a later time it may
want to revert to that checkpoint. Moreover, it may keep multiple such
checkpoints (to where it may want to return). 'crid' tells sys_restart
which one to use.

Note that this 'crid' will in fact be tied to resources that are kept
by the kernel - e.g. references to COW pages (when we add that).
Louis suggested to use a specialized FD instead of a numeric 'crid'
(that is: create a anonymous inode and a struct file that represent
that checkpoint in the kernel, and return an FD to it). This approach
has pros and cons of 'crid' (see the archives of the containers
mailing list). For now I kept 'crid', but I'm definitely open to change
it to a FD.

Oren.

> 
> Andrey, how will the 'ctid' in your patchset be used?  It sounds
> like it's actually going to set some integer id on the created
> container?  We actually don't have container ids (or even
> containers) right now, so we probably don't want that in our api,
> right?
> 
>> + * @fd: file from which read the checkpoint image
>> + * @flags: restart operation flags
>> + */
>> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
>> +{
>> +	pr_debug("sys_restart not implemented yet\n");
>> +	return -ENOSYS;
>> +}
>> diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
>> index d739467..88bdec4 100644
>> --- a/include/asm-x86/unistd_32.h
>> +++ b/include/asm-x86/unistd_32.h
>> @@ -338,6 +338,8 @@
>>   #define __NR_dup3		330
>>   #define __NR_pipe2		331
>>   #define __NR_inotify_init1	332
>> +#define __NR_checkpoint		333
>> +#define __NR_restart		334
>>
>>   #ifdef __KERNEL__
>>
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index d6ff145..edc218b 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>>   asmlinkage long sys_eventfd(unsigned int count);
>>   asmlinkage long sys_eventfd2(unsigned int count, int flags);
>>   asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
>> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
>>
>>   int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
>>
>> diff --git a/init/Kconfig b/init/Kconfig
>> index c11da38..fd5f7bf 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -779,6 +779,8 @@ config MARKERS
>>
>>   source "arch/Kconfig"
>>
>> +source "checkpoint/Kconfig"
>> +
>>   config PROC_PAGE_MONITOR
>>    	default y
>>   	depends on PROC_FS && MMU
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index 08d6e1b..ca95c25 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -168,3 +168,7 @@ cond_syscall(compat_sys_timerfd_settime);
>>   cond_syscall(compat_sys_timerfd_gettime);
>>   cond_syscall(sys_eventfd);
>>   cond_syscall(sys_eventfd2);
>> +
>> +/* checkpoint/restart */
>> +cond_syscall(sys_checkpoint);
>> +cond_syscall(sys_restart);
>> -- 
>> 1.5.4.3
>>
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 5/9] Memory managemnet (restore)
  2008-09-04  8:04 ` [RFC v3][PATCH 5/9] Memory managemnet (restore) Oren Laadan
@ 2008-09-04 18:08   ` Dave Hansen
  2008-09-07  3:09     ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2008-09-04 18:08 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Thu, 2008-09-04 at 04:04 -0400, Oren Laadan wrote:
> +asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);

This needs to go into a header.

> +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
> +{
> +	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int n, rparent;
> +
> +	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
> +	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
> +	if (rparent < 0)
> +		return rparent;
> +	if (rparent != parent)
> +		return -EINVAL;
> +
> +	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
> +		return -EINVAL;
> +
> +	/* to utilize the syscall modify_ldt() we first convert the data
> +	 * in the checkpoint image from 'struct desc_struct' to 'struct
> +	 * user_desc' with reverse logic of inclue/asm/desc.h:fill_ldt() */

Typo in the filename there ^^.

> +	for (n = 0; n < hh->nldt; n++) {
> +		struct user_desc info;
> +		struct desc_struct desc;
> +		mm_segment_t old_fs;
> +		int ret;
> +
> +		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
> +		if (ret < 0)
> +			return ret;
> +
> +		info.entry_number = n;
> +		info.base_addr = desc.base0 | (desc.base1 << 16);
> +		info.limit = desc.limit0;
> +		info.seg_32bit = desc.d;
> +		info.contents = desc.type >> 2;
> +		info.read_exec_only = (desc.type >> 1) ^ 1;
> +		info.limit_in_pages = desc.g;
> +		info.seg_not_present = desc.p ^ 1;
> +		info.useable = desc.avl;

Wouldn't it just be better to save the checkpoint image in the format
that the syscall takes in the first place?

> +		old_fs = get_fs();
> +		set_fs(get_ds());
> +		ret = sys_modify_ldt(1, &info, sizeof(info));
> +		set_fs(old_fs);
> +
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	load_LDT(&mm->context);
> +
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return 0;
> +}
...
> +int cr_read_fname(struct cr_ctx *ctx, void *fname, int flen)
> +{
> +	return cr_read_obj_type(ctx, fname, flen, CR_HDR_FNAME);
> +}
> +
> +/**
> + * cr_read_open_fname - read a file name and open a file
> + * @ctx: checkpoint context
> + * @flags: file flags
> + * @mode: file mode
> + */
> +struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
> +{
> +	struct file *file;
> +	char *fname;
> +	int flen, ret;
> +
> +	flen = PATH_MAX;
> +	fname = kmalloc(flen, GFP_KERNEL);
> +	if (!fname)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = cr_read_fname(ctx, fname, flen);
> +	cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
> +	if (ret >= 0)
> +		file = filp_open(fname, flags, mode);
> +	else
> +		file = ERR_PTR(ret);
> +
> +	kfree(fname);
> +	return file;
> +}

This looks much better than what was being used before.  Nice!

> +/*
> + * Unlike checkpoint, restart is executed in the context of each restarting
> + * process: vma regions are restored via a call to mmap(), and the data is
> + * read in directly to the address space of the current process
> + */
> +
> +/**
> + * cr_vma_read_pages_addr - read addresses of pages to page-array chain
> + * @ctx - restart context
> + * @npages - number of pages
> + */
> +static int cr_vma_read_pages_addr(struct cr_ctx *ctx, int npages)
> +{
> +	struct cr_pgarr *pgarr;
> +	int nr, ret;
> +
> +	while (npages) {
> +		pgarr = cr_pgarr_prep(ctx);
> +		if (!pgarr)
> +			return -ENOMEM;
> +		nr = min(npages, (int) pgarr->nleft);

Do you find it any easier to read this as:

	nr = npages;
	if (nr > pgarr->nleft)
		nr = pgarr->nleft;

?

> +		ret = cr_kread(ctx, pgarr->addrs, nr * sizeof(unsigned long));
> +		if (ret < 0)
> +			return ret;
> +		pgarr->nleft -= nr;
> +		pgarr->nused += nr;
> +		npages -= nr;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * cr_vma_read_pages_data - read in data of pages in page-array chain
> + * @ctx - restart context
> + * @npages - number of pages
> + */
> +static int cr_vma_read_pages_data(struct cr_ctx *ctx, int npages)
> +{
> +	struct cr_pgarr *pgarr;
> +	unsigned long *addrs;
> +	int nr, ret;
> +
> +	for (pgarr = ctx->pgarr; npages; pgarr = pgarr->next) {
> +		addrs = pgarr->addrs;
> +		nr = pgarr->nused;
> +		npages -= nr;
> +		while (nr--) {
> +			ret = cr_uread(ctx, (void *) *(addrs++), PAGE_SIZE);

The void cast is unnecessary, right?

> +			if (ret < 0)
> +				return ret;
> +		}
> +	}
> +
> +	return 0;
> +}

I'm having some difficulty parsing this function.  Could we
s/data/contents/ in the function name?  It also looks like addrs is
being used like an array here.  Can we use it explicitly that way?  I'd
also like to see it called vaddr or something explicit about what kinds
of addresses they are.

> +/* change the protection of an address range to be writable/non-writable.
> + * this is useful when restoring the memory of a read-only vma */
> +static int cr_vma_writable(struct mm_struct *mm, unsigned long start,
> +			   unsigned long end, int writable)

"cr_vma_writable" is a question to me.  This needs to be
"cr_vma_make_writable" or something to indicate that it is modifying the
vma.

> +{
> +	struct vm_area_struct *vma, *prev;
> +	unsigned long flags = 0;
> +	int ret = -EINVAL;
> +
> +	cr_debug("vma %#lx-%#lx writable %d\n", start, end, writable);
> +
> +	down_write(&mm->mmap_sem);
> +	vma = find_vma_prev(mm, start, &prev);
> +	if (unlikely(!vma || vma->vm_start > end || vma->vm_end < start))
> +		goto out;

Kill the unlikely(), please.  It's unnecessary and tends to make things
slower when not used correctly.  Can you please check all the future
patches and make sure that you don't accidentally introduce these later?

> +	if (writable && !(vma->vm_flags & VM_WRITE))
> +		flags = vma->vm_flags | VM_WRITE;
> +	else if (!writable && (vma->vm_flags & VM_WRITE))
> +		flags = vma->vm_flags & ~VM_WRITE;
> +	cr_debug("flags %#lx\n", flags);
> +	if (flags)
> +		ret = mprotect_fixup(vma, &prev, vma->vm_start,
> +				     vma->vm_end, flags);

Is this to fixup the same things that setup_arg_pages() uses it for?  We
should probably consolidate those calls somehow.  

> + out:
> +	up_write(&mm->mmap_sem);
> +	return ret;
> +}
> +
> +/**
> + * cr_vma_read_pages - read in pages for to restore a vma
> + * @ctx - restart context
> + * @cr_vma - vma descriptor from restart
> + */
> +static int cr_vma_read_pages(struct cr_ctx *ctx, struct cr_hdr_vma *cr_vma)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int ret = 0;
> +
> +	if (!cr_vma->nr_pages)
> +		return 0;

Looking at this code, I can now tell that we need to be more explicit
about what nr_pages is.  Is it the nr_pages that the vma spans,
contains, maps....??  Why do we need to check it here?

> +	/* in the unlikely case that this vma is read-only */
> +	if (!(cr_vma->vm_flags & VM_WRITE))
> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 1);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_vma_read_pages_addr(ctx, cr_vma->nr_pages);

The english here is a bit funky.  I think this needs to be
cr_vma_read_page_addrs().  The other way makes it sound like you're
reading the "page's addr", meaning a singular page.  Same for data.

> +	if (ret < 0)
> +		goto out;
> +	ret = cr_vma_read_pages_data(ctx, cr_vma->nr_pages);
> +	if (ret < 0)
> +		goto out;
> +
> +	cr_pgarr_release(ctx);	/* reset page-array chain */

Where did this sucker get allocated?  This is going to be a bit
difficult to audit since it isn't allocated and freed (or released) at
the same level.  Seems like it would be much nicer if it was allocated
at the beginning of this function.

> +	/* restore original protection for this vma */
> +	if (!(cr_vma->vm_flags & VM_WRITE))
> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
> +
> + out:
> +	return ret;
> +}

Ugh.  Is this a security hole?  What if the user was not allowed to
write to the file being mmap()'d by this VMA?  Is this a window where
someone could come in and (using ptrace or something similar) write to
the file?

We copy into the process address space all the time when not in its
context explicitly.  

> +/**
> + * cr_calc_map_prot_bits - convert vm_flags to mmap protection
> + * orig_vm_flags: source vm_flags
> + */
> +static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
> +{
> +	unsigned long vm_prot = 0;
> +
> +	if (orig_vm_flags & VM_READ)
> +		vm_prot |= PROT_READ;
> +	if (orig_vm_flags & VM_WRITE)
> +		vm_prot |= PROT_WRITE;
> +	if (orig_vm_flags & VM_EXEC)
> +		vm_prot |= PROT_EXEC;
> +	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
> +		vm_prot |= PROT_SEM;
> +
> +	return vm_prot;
> +}
> +
> +/**
> + * cr_calc_map_flags_bits - convert vm_flags to mmap flags
> + * orig_vm_flags: source vm_flags
> + */
> +static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
> +{
> +	unsigned long vm_flags = 0;
> +
> +	vm_flags = MAP_FIXED;
> +	if (orig_vm_flags & VM_GROWSDOWN)
> +		vm_flags |= MAP_GROWSDOWN;
> +	if (orig_vm_flags & VM_DENYWRITE)
> +		vm_flags |= MAP_DENYWRITE;
> +	if (orig_vm_flags & VM_EXECUTABLE)
> +		vm_flags |= MAP_EXECUTABLE;
> +	if (orig_vm_flags & VM_MAYSHARE)
> +		vm_flags |= MAP_SHARED;
> +	else
> +		vm_flags |= MAP_PRIVATE;
> +
> +	return vm_flags;
> +}
> +
> +static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
> +{
> +	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	unsigned long vm_size, vm_flags, vm_prot, vm_pgoff;
> +	unsigned long addr;
> +	unsigned long flags;
> +	struct file *file = NULL;
> +	int parent, ret = 0;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
> +	if (parent < 0)
> +		return parent;
> +	else if (parent != 0)
> +		return -EINVAL;
> +
> +	cr_debug("vma %#lx-%#lx type %d nr_pages %d\n",
> +		 (unsigned long) hh->vm_start, (unsigned long) hh->vm_end,
> +		 (int) hh->vma_type, (int) hh->nr_pages);
> +
> +	if (hh->vm_end < hh->vm_start || hh->nr_pages < 0)
> +		return -EINVAL;
> +
> +	vm_size = hh->vm_end - hh->vm_start;
> +	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
> +	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
> +	vm_pgoff = hh->vm_pgoff;
> +
> +	switch (hh->vma_type) {
> +
> +	case CR_VMA_ANON:		/* anonymous private mapping */
> +		/* vm_pgoff for anonymous mapping is the "global" page
> +		   offset (namely from addr 0x0), so we force a zero */
> +		vm_pgoff = 0;
> +		break;
> +
> +	case CR_VMA_FILE:		/* private mapping from a file */
> +		/* O_RDWR only needed if both (VM_WRITE|VM_SHARED) are set */
> +		flags = hh->vm_flags & (VM_WRITE | VM_SHARED);
> +		flags = (flags == (VM_WRITE | VM_SHARED) ? O_RDWR : O_RDONLY);

Man, that's hard to parse.  Could you break that up a little bit to make
it easier to read?

> +		file = cr_read_open_fname(ctx, flags, 0);
> +		if (IS_ERR(file))
> +			return PTR_ERR(file);
> +		break;
> +
> +	default:
> +		return -EINVAL;
> +
> +	}
> +
> +	down_write(&mm->mmap_sem);
> +	addr = do_mmap_pgoff(file, (unsigned long) hh->vm_start,
> +			     vm_size, vm_prot, vm_flags, vm_pgoff);

I'd probably just make a local vm_start to make these all consistent and
do all the ugly casting in one place.

> +	up_write(&mm->mmap_sem);
> +	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
> +		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
> +
> +	/* the file (if opened) is now referenced by the vma */
> +	if (file)
> +		filp_close(file, NULL);
> +
> +	if (IS_ERR((void *) addr))
> +		return PTR_ERR((void *) addr);
> +
> +	/*
> +	 * CR_VMA_ANON: read in memory as is
> +	 * CR_VMA_FILE: read in memory as is
> +	 * (more to follow ...)
> +	 */
> +
> +	switch (hh->vma_type) {
> +	case CR_VMA_ANON:
> +	case CR_VMA_FILE:
> +		/* standard case: read the data into the memory */
> +		ret = cr_vma_read_pages(ctx, hh);
> +		break;
> +	}
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	if (vm_prot & PROT_EXEC)
> +		flush_icache_range(hh->vm_start, hh->vm_end);

Why the heck is this here?  Isn't this a fresh mm?  We shouldn't have to
do this unless we had a VMA here previously.  Maybe it would be more
efficient to do this when tearing down the old vmas.

> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	cr_debug("vma retval %d\n", ret);
> +	return 0;
> +}
> +
> +static int cr_destroy_mm(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vmnext = mm->mmap;
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	while (vmnext) {
> +		vma = vmnext;
> +		vmnext = vmnext->vm_next;
> +		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
> +		if (ret < 0) {
> +			pr_debug("CR: restart failed do_munmap (%d)\n", ret);
> +			return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
> +int cr_read_mm(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct mm_struct *mm;
> +	int nr, parent, ret;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
> +	if (parent < 0)
> +		return parent;
> +#if 0	/* activate when containers are used */
> +	if (parent != task_pid_vnr(current))
> +		return -EINVAL;
> +#endif
> +	cr_debug("map_count %d\n", hh->map_count);
> +
> +	/* XXX need more sanity checks */
> +	if (hh->start_code > hh->end_code ||
> +	    hh->start_data > hh->end_data || hh->map_count < 0)
> +		return -EINVAL;
> +
> +	mm = current->mm;
> +
> +	/* point of no return -- destruct current mm */
> +	down_write(&mm->mmap_sem);
> +	ret = cr_destroy_mm(mm);

The other approach would be to do something more analogous to exec():
create the entire new mm and switch to it in the end. 

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 7/9] Infrastructure for shared objects
  2008-09-04  8:05 ` [RFC v3][PATCH 7/9] Infrastructure for shared objects Oren Laadan
  2008-09-04  9:38   ` Louis Rilling
@ 2008-09-04 18:14   ` Dave Hansen
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-04 18:14 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Thu, 2008-09-04 at 04:05 -0400, Oren Laadan wrote:
> +=== Shared resources (objects)
> +
> +Many resources used by tasks may be shared by more than one task (e.g.
> +file descriptors, memory address space, etc), or even have multiple
> +references from other resources (e.g. a single inode that represents
> +two ends of a pipe).
> +
> +Clearly, the state of shared objects need only be saved once, even if
> +they occur multiple times. We use a hash table (ctx->objhash) to keep
> +track of shared objects in the following manner.
> +
> +On the first encounter, the state is dumped and the object is assigned
> +a unique identifier and also stored in the hash table (indexed by its
> +physical kenrel address). From then on the object will be found in the
> +hash and only its identifier is saved.
> +
> +On restart the identifier is looked up in the hash table; if not found
> +then the state is read, the object is created, and added to the hash
> +table (this time indexed by its identifier). Otherwise, the object in
> +the hash table is used.
> +
> +The interface for the hash table is the following:
> +
> +int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type);
> +  [checkpoint] find the unique identifier - object reference (objref)
> +  - of the object that is pointer to by ptr (or 0 if not found).
> +
> +int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
> +                  unsigned short type, unsigned short flags);
> +  [checkpoint] add the object pointed to by ptr to the hash table if
> +  it isn't already there, and fill its unique identifier (objref); will
> +  return 0 if already found in the has, or 1 otherwise.
> +
> +void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type);
> +  [restart] return the pointer to the object whose unique identifier
> +  is equal to objref.
> +
> +int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
> +                  unsigned short type, unsigned short flags);
> +  [restart] add the object with unique identifier objref, pointed to by
> +  ptr to the hash table if it isn't already there; will return 0 if
> +  already found in the has, or 1 otherwise.

Once you get to the point of putting function prototypes in
Documentation/, it's probably a good time to start using kerneldocs. :)

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 4/9] Memory management (dump)
  2008-09-04  8:03 ` [RFC v3][PATCH 4/9] Memory management (dump) Oren Laadan
@ 2008-09-04 18:25   ` Dave Hansen
  2008-09-07  1:54     ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2008-09-04 18:25 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Thu, 2008-09-04 at 04:03 -0400, Oren Laadan wrote:
> +/* free a chain of page-arrays */
> +void cr_pgarr_free(struct cr_ctx *ctx)
> +{
> +       struct cr_pgarr *pgarr, *pgnxt;
> +
> +       for (pgarr = ctx->pgarr; pgarr; pgarr = pgnxt) {
> +               _cr_pgarr_release(ctx, pgarr);
> +               free_pages((unsigned long) ctx->pgarr->addrs, CR_PGARR_ORDER);
> +               free_pages((unsigned long) ctx->pgarr->pages, CR_PGARR_ORDER);
> +               pgnxt = pgarr->next;
> +               kfree(pgarr);
> +       }
> +}

What we effectively have here is:

void *addrs[CR_PGARR_TOTAL];
void *pages[CR_PGARR_TOTAL];

right?

Would any of this get simpler if we just had:

struct cr_page {
	struct page *page;
	unsigned long vaddr;
};

struct cr_pgarr {
       struct cr_page *cr_pages;
       struct cr_pgarr *next;
       unsigned short nleft;
       unsigned short nused;
};

Also, we do have lots of linked list implementations in the kernel.
They do lots of fun stuff like poisoning and checking for
initialization.  We should use them instead of rolling our own.  It lets
us do other fun stuff like list_for_each().

Also, just looking at this structure 'nleft' and 'nused' sound a bit
redundant.  I know from looking at the code that this is how many have
been filled and read back at restore time, but that is not very obvious
looking at the structure.  I think we can do a bit better in the
structure itself.

The length of the arrays is fixed at compile-time, right?  Should we
just make that explicit as well?  

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 8/9] File descriprtors (dump)
  2008-09-04  8:05 ` [RFC v3][PATCH 8/9] File descriprtors (dump) Oren Laadan
  2008-09-04  9:47   ` Louis Rilling
  2008-09-04 15:01   ` Dave Hansen
@ 2008-09-04 18:41   ` Dave Hansen
  2008-09-07  4:52     ` Oren Laadan
  2 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2008-09-04 18:41 UTC (permalink / raw)
  To: Oren Laadan; +Cc: arnd, jeremy, linux-kernel, containers

On Thu, 2008-09-04 at 04:05 -0400, Oren Laadan wrote:
> +/**
> + * cr_scan_fds - scan file table and construct array of open fds
> + * @files: files_struct pointer
> + * @fdtable: (output) array of open fds
> + * @return: the number of open fds found
> + *
> + * Allocates the file descriptors array (*fdtable), caller should free
> + */
> +int cr_scan_fds(struct files_struct *files, int **fdtable)
> +{
> +	struct fdtable *fdt;
> +	int *fdlist;
> +	int i, n, max;
> +
> +	max = CR_DEFAULT_FDTABLE;
> +
> + repeat:
> +	n = 0;
> +	fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
> +	if (!fdlist)
> +		return -ENOMEM;
> +
> +	spin_lock(&files->file_lock);
> +	fdt = files_fdtable(files);
> +	for (i = 0; i < fdt->max_fds; i++) {
> +		if (fcheck_files(files, i)) {
> +			if (n == max) {
> +				spin_unlock(&files->file_lock);
> +				kfree(fdlist);
> +				max *= 2;
> +				if (max < 0) {	/* overflow ? */
> +					n = -EMFILE;
> +					break;
> +				}
> +				goto repeat;
> +			}
> +			fdlist[n++] = i;
> +		}
> +	}
> +	spin_unlock(&files->file_lock);
> +
> +	*fdtable = fdlist;
> +	return n;
> +}

That loop needs some love.  At least save us from one level of
indenting:

> +	for (i = 0; i < fdt->max_fds; i++) {
> +		if (!fcheck_files(files, i)
> 			continue;
> 		if (n == max) {
> +			spin_unlock(&files->file_lock);
> +			kfree(fdlist);
> +			max *= 2;
> +			if (max < 0) {	/* overflow ? */
> +				n = -EMFILE;
> +				break;
> +			}
> +			goto repeat;
> +		}
> +		fdlist[n++] = i;
> +	}

My gut also says that there has to be a better way to find a good size
for fdlist() than growing it this way.  

Why do we even have a fixed size for this?

+#define CR_DEFAULT_FDTABLE  256

> +/* cr_write_fd_data - dump the state of a given file pointer */
> +static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct dentry *dent = file->f_dentry;
> +	struct inode *inode = dent->d_inode;
> +	enum fd_type fd_type;
> +	int ret;
> +
> +	h.type = CR_HDR_FD_DATA;
> +	h.len = sizeof(*hh);
> +	h.parent = parent;
> +
> +	BUG_ON(!inode);

Why a BUG_ON()?  We'll deref it in just a sec anyway.  We prefer to just
get the NULL dereference rather than an explicit BUG_ON().

> +	hh->f_flags = file->f_flags;
> +	hh->f_mode = file->f_mode;
> +	hh->f_pos = file->f_pos;
> +	hh->f_uid = file->f_uid;
> +	hh->f_gid = file->f_gid;

Is there a plan to save off the 'struct user' here instead?  Nested user
namespaces in one checkpoint image might get confused otherwise.

> +	hh->f_version = file->f_version;
> +	/* FIX: need also file->f_owner */
> +
> +	switch (inode->i_mode & S_IFMT) {
> +	case S_IFREG:
> +		fd_type = CR_FD_FILE;
> +		break;
> +	case S_IFDIR:
> +		fd_type = CR_FD_DIR;
> +		break;
> +	case S_IFLNK:
> +		fd_type = CR_FD_LINK;
> +		break;
> +	default:
> +		return -EBADF;
> +	}

Why don't we just store (and use) (inode->i_mode & S_IFMT) in fd_type
instead of making our own types?

> +	/* FIX: check if the file/dir/link is unlinked */
> +	hh->fd_type = fd_type;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		return ret;
> +
> +	return cr_write_fname(ctx, &file->f_path, ctx->vfsroot);
> +}
> +
> +/**
> + * cr_write_fd_ent - dump the state of a given file descriptor
> + * @ctx: checkpoint context
> + * @files: files_struct pointer
> + * @fd: file descriptor
> + *
> + * Save the state of the file descriptor; look up the actual file pointer
> + * in the hash table, and if found save the matching objref, otherwise call
> + * cr_write_fd_data to dump the file pointer too.
> + */
> +static int
> +cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct file *file = NULL;
> +	struct fdtable *fdt;
> +	int coe, objref, ret;
> +
> +	/* make sure hh->fd (that is of type __u16) doesn't overflow */
> +	if (fd > USHORT_MAX) {
> +		pr_warning("CR: open files table too big (%d)\n", USHORT_MAX);
> +		return -EMFILE;
> +	}

Since the kernel always seems to make fds integers, it would make sense
to me to store them as integers in the checkpoint image.  Why bother to
shrink them down to a 16-bit type?

> +	rcu_read_lock();
> +	fdt = files_fdtable(files);
> +	file = fcheck_files(files, fd);
> +	if (file) {
> +		coe = FD_ISSET(fd, fdt->close_on_exec);
> +		get_file(file);
> +	}
> +	rcu_read_unlock();
> +
> +	/* sanity check (although this shouldn't happen) */
> +	if (!file)
> +		return -EBADF;
> +
> +	ret = cr_obj_add_ptr(ctx, (void *) file, &objref, CR_OBJ_FILE, 0);
> +	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
> +
> +	if (ret >= 0) {
> +		int new = ret;
> +
> +		h.type = CR_HDR_FD_ENT;
> +		h.len = sizeof(*hh);
> +		h.parent = 0;
> +
> +		hh->objref = objref;
> +		hh->fd = fd;
> +		hh->close_on_exec = coe;
> +
> +		ret = cr_write_obj(ctx, &h, hh);
> +		cr_hbuf_put(ctx, sizeof(*hh));
> +		if (ret < 0)
> +			return ret;
> +
> +		/* new==1 if-and-only-if file was new and added to hash */
> +		if (new)
> +			ret = cr_write_fd_data(ctx, file, objref);
> +	}

This if() block is in the normal flow path of the function and should go
at the top indentation level.  You can just do this:

	  if (ret < 0)
		goto out;
  	  // if block contents here...

   out:
> +	fput(file);
> +	return ret;
> +}
-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-04 17:32     ` Oren Laadan
@ 2008-09-04 20:37       ` Serge E. Hallyn
  2008-09-04 21:05         ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: Serge E. Hallyn @ 2008-09-04 20:37 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, containers, jeremy, linux-kernel, arnd, Andrey Mirkin

Quoting Oren Laadan (orenl@cs.columbia.edu):
> 
> 
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan (orenl@cs.columbia.edu):
> >> Create trivial sys_checkpoint and sys_restore system calls. They will
> >> enable to checkpoint and restart an entire container, to and from a
> >> checkpoint image file descriptor.
> >>
> >> The syscalls take a file descriptor (for the image file) and flags as
> >> arguments. For sys_checkpoint the first argument identifies the target
> >> container; for sys_restart it will identify the checkpoint image.
> >>
> >> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> >> ---
> 
> [...]
> 
> >> +/**
> >> + * sys_checkpoint - checkpoint a container
> >> + * @pid: pid of the container init(1) process
> >> + * @fd: file to which dump the checkpoint image
> >> + * @flags: checkpoint operation flags
> >> + */
> >> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> >> +{
> >> +	pr_debug("sys_checkpoint not implemented yet\n");
> >> +	return -ENOSYS;
> >> +}
> >> +/**
> >> + * sys_restart - restart a container
> >> + * @crid: checkpoint image identifier
> > 
> > So can we compare your api to Andrey's?
> > 
> > You've explained before that crid is used to tie together multiple
> > calls to checkpoint, but why do you have to specify it for restart?
> > Can't it just come from the fd?  Or, the fd will be passed in
> > seek()d to the right position for the data for this task, so the crid
> > won't be available there?
> 
> I added the 'crid' inside to support a mode of operation in which we
> would like the checkpoint data to remain in memory across multiple
> system calls. Here are example scenarios:
> 
> 1) We will want to reduce down time by first buffering the checkpoint
> image in memory, then resuming the container, and only then writing
> the data back to a (the) file descriptor.
> So instead of:
>   freeze -> checkpoint and write back -> unfreeze
> We want:
>   freeze -> checkpoint to buffer -> unfreeze -> write back
> I envision each of these steps to be a separate invocation of a syscall.
> to the 'crid' returned by the sys_checkpoint() at the 2nd step, will be
> used to identify that data in the 4th step. (Note, that between the
> unfreeze and the write-back, another checkpoint may be already taken).
> 
> 2) A task may want to take a checkpoint (e.g. of itself, or a whole
> container) and keep that checkpoint in memory; at a later time it may
> want to revert to that checkpoint. Moreover, it may keep multiple such
> checkpoints (to where it may want to return). 'crid' tells sys_restart
> which one to use.
> 
> Note that this 'crid' will in fact be tied to resources that are kept
> by the kernel - e.g. references to COW pages (when we add that).
> Louis suggested to use a specialized FD instead of a numeric 'crid'
> (that is: create a anonymous inode and a struct file that represent
> that checkpoint in the kernel, and return an FD to it). This approach
> has pros and cons of 'crid' (see the archives of the containers
> mailing list). For now I kept 'crid', but I'm definitely open to change
> it to a FD.
> 
> Oren.

Oh, so the crid identifies one checkpoint inside the file - the single
file can store multiple checkpoints?

> > Andrey, how will the 'ctid' in your patchset be used?  It sounds
> > like it's actually going to set some integer id on the created
> > container?  We actually don't have container ids (or even
> > containers) right now, so we probably don't want that in our api,
> > right?

-serge

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-04 20:37       ` Serge E. Hallyn
@ 2008-09-04 21:05         ` Oren Laadan
  2008-09-04 22:03           ` Serge E. Hallyn
  0 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-04 21:05 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: dave, containers, jeremy, linux-kernel, arnd, Andrey Mirkin



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl@cs.columbia.edu):
>>
>> Serge E. Hallyn wrote:
>>> Quoting Oren Laadan (orenl@cs.columbia.edu):
>>>> Create trivial sys_checkpoint and sys_restore system calls. They will
>>>> enable to checkpoint and restart an entire container, to and from a
>>>> checkpoint image file descriptor.
>>>>
>>>> The syscalls take a file descriptor (for the image file) and flags as
>>>> arguments. For sys_checkpoint the first argument identifies the target
>>>> container; for sys_restart it will identify the checkpoint image.
>>>>
>>>> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
>>>> ---
>> [...]
>>
>>>> +/**
>>>> + * sys_checkpoint - checkpoint a container
>>>> + * @pid: pid of the container init(1) process
>>>> + * @fd: file to which dump the checkpoint image
>>>> + * @flags: checkpoint operation flags
>>>> + */
>>>> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>>>> +{
>>>> +	pr_debug("sys_checkpoint not implemented yet\n");
>>>> +	return -ENOSYS;
>>>> +}
>>>> +/**
>>>> + * sys_restart - restart a container
>>>> + * @crid: checkpoint image identifier
>>> So can we compare your api to Andrey's?
>>>
>>> You've explained before that crid is used to tie together multiple
>>> calls to checkpoint, but why do you have to specify it for restart?
>>> Can't it just come from the fd?  Or, the fd will be passed in
>>> seek()d to the right position for the data for this task, so the crid
>>> won't be available there?
>> I added the 'crid' inside to support a mode of operation in which we
>> would like the checkpoint data to remain in memory across multiple
>> system calls. Here are example scenarios:
>>
>> 1) We will want to reduce down time by first buffering the checkpoint
>> image in memory, then resuming the container, and only then writing
>> the data back to a (the) file descriptor.
>> So instead of:
>>   freeze -> checkpoint and write back -> unfreeze
>> We want:
>>   freeze -> checkpoint to buffer -> unfreeze -> write back
>> I envision each of these steps to be a separate invocation of a syscall.
>> to the 'crid' returned by the sys_checkpoint() at the 2nd step, will be
>> used to identify that data in the 4th step. (Note, that between the
>> unfreeze and the write-back, another checkpoint may be already taken).
>>
>> 2) A task may want to take a checkpoint (e.g. of itself, or a whole
>> container) and keep that checkpoint in memory; at a later time it may
>> want to revert to that checkpoint. Moreover, it may keep multiple such
>> checkpoints (to where it may want to return). 'crid' tells sys_restart
>> which one to use.
>>
>> Note that this 'crid' will in fact be tied to resources that are kept
>> by the kernel - e.g. references to COW pages (when we add that).
>> Louis suggested to use a specialized FD instead of a numeric 'crid'
>> (that is: create a anonymous inode and a struct file that represent
>> that checkpoint in the kernel, and return an FD to it). This approach
>> has pros and cons of 'crid' (see the archives of the containers
>> mailing list). For now I kept 'crid', but I'm definitely open to change
>> it to a FD.
>>
>> Oren.
> 
> Oh, so the crid identifies one checkpoint inside the file - the single
> file can store multiple checkpoints?

Not quite. Let me rephrase the motivation first:

There are occasions when we would like to keep the checkpoint data in the
kernel for some (relatively long) time, between syscalls. By "checkpoint
data" I mean references to memory contents (pages) and all the other data.

The two scenarios above are two examples: between the syscall to checkpoint
and the syscall to unfreeze and then write-back the data to a file (first
example), and for some time until a task may want to "go back in time"
(second example, useful for ultra fast "undo" for a task).

Note that in both cases when I say "keep in kernel" I mean before it is
written to a file, or to the network. Simply in memory, in some efficient
manner.

Subsequent syscalls will need to refer to a specific checkpoint data that
is kept in memory - e.g. to write-back to a file-descriptor, or to clean
up, or to restart from it. (At any single time a specific container may
have multiple checkpoints associated with it - eg. because they have not
yet been written back to storage but already taken).

Once the data is written back to a file descriptor, the in-kernel data can
be discarded and cleaned-up.

The main reason why I want to keep the data in the kernel and not instead
copy to user space, is efficiency: most of the checkpoint data is the memory
footprint; by keeping the data in the kernel, one can merely keep a COW
reference instead of a whole copy of everything (save space and copy time).

So, if we have keep data in kernel between syscalls, then we must have a
way to refer to it. The current implementation uses a very simple 'crid'
value to do that - although, clearly, at the moment it isn't used.

I hope this explains better.

Oren.

> 
>>> Andrey, how will the 'ctid' in your patchset be used?  It sounds
>>> like it's actually going to set some integer id on the created
>>> container?  We actually don't have container ids (or even
>>> containers) right now, so we probably don't want that in our api,
>>> right?
> 
> -serge

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-04 21:05         ` Oren Laadan
@ 2008-09-04 22:03           ` Serge E. Hallyn
  0 siblings, 0 replies; 43+ messages in thread
From: Serge E. Hallyn @ 2008-09-04 22:03 UTC (permalink / raw)
  To: Oren Laadan; +Cc: dave, containers, jeremy, linux-kernel, arnd, Andrey Mirkin

Quoting Oren Laadan (orenl@cs.columbia.edu):
> 
> 
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan (orenl@cs.columbia.edu):
> >>
> >> Serge E. Hallyn wrote:
> >>> Quoting Oren Laadan (orenl@cs.columbia.edu):
> >>>> Create trivial sys_checkpoint and sys_restore system calls. They will
> >>>> enable to checkpoint and restart an entire container, to and from a
> >>>> checkpoint image file descriptor.
> >>>>
> >>>> The syscalls take a file descriptor (for the image file) and flags as
> >>>> arguments. For sys_checkpoint the first argument identifies the target
> >>>> container; for sys_restart it will identify the checkpoint image.
> >>>>
> >>>> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> >>>> ---
> >> [...]
> >>
> >>>> +/**
> >>>> + * sys_checkpoint - checkpoint a container
> >>>> + * @pid: pid of the container init(1) process
> >>>> + * @fd: file to which dump the checkpoint image
> >>>> + * @flags: checkpoint operation flags
> >>>> + */
> >>>> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> >>>> +{
> >>>> +	pr_debug("sys_checkpoint not implemented yet\n");
> >>>> +	return -ENOSYS;
> >>>> +}
> >>>> +/**
> >>>> + * sys_restart - restart a container
> >>>> + * @crid: checkpoint image identifier
> >>> So can we compare your api to Andrey's?
> >>>
> >>> You've explained before that crid is used to tie together multiple
> >>> calls to checkpoint, but why do you have to specify it for restart?
> >>> Can't it just come from the fd?  Or, the fd will be passed in
> >>> seek()d to the right position for the data for this task, so the crid
> >>> won't be available there?
> >> I added the 'crid' inside to support a mode of operation in which we
> >> would like the checkpoint data to remain in memory across multiple
> >> system calls. Here are example scenarios:
> >>
> >> 1) We will want to reduce down time by first buffering the checkpoint
> >> image in memory, then resuming the container, and only then writing
> >> the data back to a (the) file descriptor.
> >> So instead of:
> >>   freeze -> checkpoint and write back -> unfreeze
> >> We want:
> >>   freeze -> checkpoint to buffer -> unfreeze -> write back
> >> I envision each of these steps to be a separate invocation of a syscall.
> >> to the 'crid' returned by the sys_checkpoint() at the 2nd step, will be
> >> used to identify that data in the 4th step. (Note, that between the
> >> unfreeze and the write-back, another checkpoint may be already taken).
> >>
> >> 2) A task may want to take a checkpoint (e.g. of itself, or a whole
> >> container) and keep that checkpoint in memory; at a later time it may
> >> want to revert to that checkpoint. Moreover, it may keep multiple such
> >> checkpoints (to where it may want to return). 'crid' tells sys_restart
> >> which one to use.
> >>
> >> Note that this 'crid' will in fact be tied to resources that are kept
> >> by the kernel - e.g. references to COW pages (when we add that).
> >> Louis suggested to use a specialized FD instead of a numeric 'crid'
> >> (that is: create a anonymous inode and a struct file that represent
> >> that checkpoint in the kernel, and return an FD to it). This approach
> >> has pros and cons of 'crid' (see the archives of the containers
> >> mailing list). For now I kept 'crid', but I'm definitely open to change
> >> it to a FD.
> >>
> >> Oren.
> > 
> > Oh, so the crid identifies one checkpoint inside the file - the single
> > file can store multiple checkpoints?
> 
> Not quite. Let me rephrase the motivation first:
> 
> There are occasions when we would like to keep the checkpoint data in the
> kernel for some (relatively long) time, between syscalls. By "checkpoint
> data" I mean references to memory contents (pages) and all the other data.
> 
> The two scenarios above are two examples: between the syscall to checkpoint
> and the syscall to unfreeze and then write-back the data to a file (first
> example), and for some time until a task may want to "go back in time"
> (second example, useful for ultra fast "undo" for a task).
> 
> Note that in both cases when I say "keep in kernel" I mean before it is
> written to a file, or to the network. Simply in memory, in some efficient
> manner.
> 
> Subsequent syscalls will need to refer to a specific checkpoint data that
> is kept in memory - e.g. to write-back to a file-descriptor, or to clean
> up, or to restart from it. (At any single time a specific container may
> have multiple checkpoints associated with it - eg. because they have not
> yet been written back to storage but already taken).
> 
> Once the data is written back to a file descriptor, the in-kernel data can
> be discarded and cleaned-up.
> 
> The main reason why I want to keep the data in the kernel and not instead
> copy to user space, is efficiency: most of the checkpoint data is the memory
> footprint; by keeping the data in the kernel, one can merely keep a COW
> reference instead of a whole copy of everything (save space and copy time).
> 
> So, if we have keep data in kernel between syscalls, then we must have a
> way to refer to it. The current implementation uses a very simple 'crid'
> value to do that - although, clearly, at the moment it isn't used.
> 
> I hope this explains better.

Ah, ok.  So we're either using an fd or a crid.

Personally I'd then prefer two syscalls, which are wrappers around
a more flexible in-kernel api.  That way we can start with just
	sys_restart(int fd, long flags)
and add
	sys_restart_crid(int crid, long flags)
later.

-serge

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 4/9] Memory management (dump)
  2008-09-04 18:25   ` Dave Hansen
@ 2008-09-07  1:54     ` Oren Laadan
  2008-09-08 15:55       ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-07  1:54 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, jeremy, linux-kernel, arnd



Dave Hansen wrote:
> On Thu, 2008-09-04 at 04:03 -0400, Oren Laadan wrote:
>> +/* free a chain of page-arrays */
>> +void cr_pgarr_free(struct cr_ctx *ctx)
>> +{
>> +       struct cr_pgarr *pgarr, *pgnxt;
>> +
>> +       for (pgarr = ctx->pgarr; pgarr; pgarr = pgnxt) {
>> +               _cr_pgarr_release(ctx, pgarr);
>> +               free_pages((unsigned long) ctx->pgarr->addrs, CR_PGARR_ORDER);
>> +               free_pages((unsigned long) ctx->pgarr->pages, CR_PGARR_ORDER);
>> +               pgnxt = pgarr->next;
>> +               kfree(pgarr);
>> +       }
>> +}
> 
> What we effectively have here is:
> 
> void *addrs[CR_PGARR_TOTAL];
> void *pages[CR_PGARR_TOTAL];
> 
> right?
> 
> Would any of this get simpler if we just had:
> 
> struct cr_page {
> 	struct page *page;
> 	unsigned long vaddr;
> };
> 
> struct cr_pgarr {
>        struct cr_page *cr_pages;
>        struct cr_pgarr *next;
>        unsigned short nleft;
>        unsigned short nused;
> };

The reason I use separate arrays instead of an array of tuples is that
the logic is to write all vaddr at once - simply by dumping the array
of vaddrs.

> 
> Also, we do have lots of linked list implementations in the kernel.
> They do lots of fun stuff like poisoning and checking for
> initialization.  We should use them instead of rolling our own.  It lets
> us do other fun stuff like list_for_each().
> 
> Also, just looking at this structure 'nleft' and 'nused' sound a bit
> redundant.  I know from looking at the code that this is how many have
> been filled and read back at restore time, but that is not very obvious
> looking at the structure.  I think we can do a bit better in the
> structure itself.
> 
> The length of the arrays is fixed at compile-time, right?  Should we
> just make that explicit as well?  

The length of the array may be tunable, or even adaptive (e.g. based
on statistics from recent checkpoints), in the future.

Oren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 5/9] Memory managemnet (restore)
  2008-09-04 18:08   ` Dave Hansen
@ 2008-09-07  3:09     ` Oren Laadan
  2008-09-08 16:49       ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-07  3:09 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, jeremy, linux-kernel, arnd



Dave Hansen wrote:
> On Thu, 2008-09-04 at 04:04 -0400, Oren Laadan wrote:
>> +asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
> 
> This needs to go into a header.
> 
>> +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
>> +{
>> +	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> +	int n, rparent;
>> +
>> +	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
>> +	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
>> +	if (rparent < 0)
>> +		return rparent;
>> +	if (rparent != parent)
>> +		return -EINVAL;
>> +
>> +	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
>> +		return -EINVAL;
>> +
>> +	/* to utilize the syscall modify_ldt() we first convert the data
>> +	 * in the checkpoint image from 'struct desc_struct' to 'struct
>> +	 * user_desc' with reverse logic of inclue/asm/desc.h:fill_ldt() */
> 
> Typo in the filename there ^^.
> 
>> +	for (n = 0; n < hh->nldt; n++) {
>> +		struct user_desc info;
>> +		struct desc_struct desc;
>> +		mm_segment_t old_fs;
>> +		int ret;
>> +
>> +		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
>> +		if (ret < 0)
>> +			return ret;
>> +
>> +		info.entry_number = n;
>> +		info.base_addr = desc.base0 | (desc.base1 << 16);
>> +		info.limit = desc.limit0;
>> +		info.seg_32bit = desc.d;
>> +		info.contents = desc.type >> 2;
>> +		info.read_exec_only = (desc.type >> 1) ^ 1;
>> +		info.limit_in_pages = desc.g;
>> +		info.seg_not_present = desc.p ^ 1;
>> +		info.useable = desc.avl;
> 
> Wouldn't it just be better to save the checkpoint image in the format
> that the syscall takes in the first place?

Because the syscall accepts a different format (user_desc) than it gives
back (desc_struct, which is the kernel format), conversion must occur,
either during checkpoint or during restart.
I prefer in restart: because checkpoint is more performance critical,
because it allows - in the future - to directly insert the data bypassing
the syscall (like openvz), and because we may need conversions anyway in
restart, e.g. to restart a 32 bit app on a 64 bit kernel.

> 

[...]

>> +static int cr_vma_read_pages_addr(struct cr_ctx *ctx, int npages)
>> +{
>> +	struct cr_pgarr *pgarr;
>> +	int nr, ret;
>> +
>> +	while (npages) {
>> +		pgarr = cr_pgarr_prep(ctx);
>> +		if (!pgarr)
>> +			return -ENOMEM;
>> +		nr = min(npages, (int) pgarr->nleft);
> 
> Do you find it any easier to read this as:
> 
> 	nr = npages;
> 	if (nr > pgarr->nleft)
> 		nr = pgarr->nleft;
> 
> ?

Just shorter.

>> +		ret = cr_kread(ctx, pgarr->addrs, nr * sizeof(unsigned long));
>> +		if (ret < 0)
>> +			return ret;
>> +		pgarr->nleft -= nr;
>> +		pgarr->nused += nr;
>> +		npages -= nr;
>> +	}
>> +	return 0;
>> +}
>> +
>> +/**
>> + * cr_vma_read_pages_data - read in data of pages in page-array chain
>> + * @ctx - restart context
>> + * @npages - number of pages
>> + */
>> +static int cr_vma_read_pages_data(struct cr_ctx *ctx, int npages)
>> +{
>> +	struct cr_pgarr *pgarr;
>> +	unsigned long *addrs;
>> +	int nr, ret;
>> +
>> +	for (pgarr = ctx->pgarr; npages; pgarr = pgarr->next) {
>> +		addrs = pgarr->addrs;
>> +		nr = pgarr->nused;
>> +		npages -= nr;
>> +		while (nr--) {
>> +			ret = cr_uread(ctx, (void *) *(addrs++), PAGE_SIZE);
> 
> The void cast is unnecessary, right?

It is: cr_uread() expects "void *", while "*addrs" is "unsigned long".

> 
>> +			if (ret < 0)
>> +				return ret;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
> 
> I'm having some difficulty parsing this function.  Could we
> s/data/contents/ in the function name?  It also looks like addrs is
> being used like an array here.  Can we use it explicitly that way?  I'd
> also like to see it called vaddr or something explicit about what kinds
> of addresses they are.
> 
>> +/* change the protection of an address range to be writable/non-writable.
>> + * this is useful when restoring the memory of a read-only vma */
>> +static int cr_vma_writable(struct mm_struct *mm, unsigned long start,
>> +			   unsigned long end, int writable)
> 
> "cr_vma_writable" is a question to me.  This needs to be
> "cr_vma_make_writable" or something to indicate that it is modifying the
> vma.
> 
>> +{
>> +	struct vm_area_struct *vma, *prev;
>> +	unsigned long flags = 0;
>> +	int ret = -EINVAL;
>> +
>> +	cr_debug("vma %#lx-%#lx writable %d\n", start, end, writable);
>> +
>> +	down_write(&mm->mmap_sem);
>> +	vma = find_vma_prev(mm, start, &prev);
>> +	if (unlikely(!vma || vma->vm_start > end || vma->vm_end < start))
>> +		goto out;
> 
> Kill the unlikely(), please.  It's unnecessary and tends to make things
> slower when not used correctly.  Can you please check all the future
> patches and make sure that you don't accidentally introduce these later?
> 
>> +	if (writable && !(vma->vm_flags & VM_WRITE))
>> +		flags = vma->vm_flags | VM_WRITE;
>> +	else if (!writable && (vma->vm_flags & VM_WRITE))
>> +		flags = vma->vm_flags & ~VM_WRITE;
>> +	cr_debug("flags %#lx\n", flags);
>> +	if (flags)
>> +		ret = mprotect_fixup(vma, &prev, vma->vm_start,
>> +				     vma->vm_end, flags);
> 
> Is this to fixup the same things that setup_arg_pages() uses it for?  We
> should probably consolidate those calls somehow.  

This is needed for a VMA that has modified pages but is read-only: e.g. a
task modifies memory then calls mprotect() to make it read-only.
Since during restart we will recreate this VMA as read-only, we need to
temporarily make it read-write to be able to read the saved contents into
it, and then restore the read-only protection.

setup_arg_pages() is unrelated, and the code doesn't seem to have much in
common.

> 
>> + out:
>> +	up_write(&mm->mmap_sem);
>> +	return ret;
>> +}
>> +
>> +/**
>> + * cr_vma_read_pages - read in pages for to restore a vma
>> + * @ctx - restart context
>> + * @cr_vma - vma descriptor from restart
>> + */
>> +static int cr_vma_read_pages(struct cr_ctx *ctx, struct cr_hdr_vma *cr_vma)
>> +{
>> +	struct mm_struct *mm = current->mm;
>> +	int ret = 0;
>> +
>> +	if (!cr_vma->nr_pages)
>> +		return 0;
> 
> Looking at this code, I can now tell that we need to be more explicit
> about what nr_pages is.  Is it the nr_pages that the vma spans,
> contains, maps....??  Why do we need to check it here?

nr_pages is the number of pages _saved_ for this VMA, that is, the number
of pages to follow to read in. If it's zero, we don't read anything (the
VMA is had no dirty pages). Otherwise, we need to read that many addresses
followed by that many pages.

> 
>> +	/* in the unlikely case that this vma is read-only */
>> +	if (!(cr_vma->vm_flags & VM_WRITE))
>> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 1);
>> +	if (ret < 0)
>> +		goto out;
>> +	ret = cr_vma_read_pages_addr(ctx, cr_vma->nr_pages);
> 
> The english here is a bit funky.  I think this needs to be
> cr_vma_read_page_addrs().  The other way makes it sound like you're
> reading the "page's addr", meaning a singular page.  Same for data.
> 
>> +	if (ret < 0)
>> +		goto out;
>> +	ret = cr_vma_read_pages_data(ctx, cr_vma->nr_pages);
>> +	if (ret < 0)
>> +		goto out;
>> +
>> +	cr_pgarr_release(ctx);	/* reset page-array chain */
> 
> Where did this sucker get allocated?  This is going to be a bit
> difficult to audit since it isn't allocated and freed (or released) at
> the same level.  Seems like it would be much nicer if it was allocated
> at the beginning of this function.

The name is misleading, should be:  cr_pgarr_reset().  What happens is
that we allocate the chain of cr_pgarr on demand. Once done with one MM,
we reset the state and reuse the same chain until we need to expand it.
I renamed and added comments about it in the next version.

> 
>> +	/* restore original protection for this vma */
>> +	if (!(cr_vma->vm_flags & VM_WRITE))
>> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
>> +
>> + out:
>> +	return ret;
>> +}
> 
> Ugh.  Is this a security hole?  What if the user was not allowed to
> write to the file being mmap()'d by this VMA?  Is this a window where
> someone could come in and (using ptrace or something similar) write to
> the file?

Not a security hole: this is only for private memory, so it never
modifies the underlying file. This is related to what I explained before
about read-only VMAs that have modified pages.

The process is restarting, inside a container that is restarting. All
tasks inside should be calling sys_restart() (by design) and no other
process from outside should be allowed to ptrace them at this point.

(In any case, if some other tasks ptraces this task, it can make it do
anything anyhow).

> 
> We copy into the process address space all the time when not in its
> context explicitly.  

Huh ?

> 
>> +/**
>> + * cr_calc_map_prot_bits - convert vm_flags to mmap protection
>> + * orig_vm_flags: source vm_flags

[...]

>> +	switch (hh->vma_type) {
>> +	case CR_VMA_ANON:
>> +	case CR_VMA_FILE:
>> +		/* standard case: read the data into the memory */
>> +		ret = cr_vma_read_pages(ctx, hh);
>> +		break;
>> +	}
>> +
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (vm_prot & PROT_EXEC)
>> +		flush_icache_range(hh->vm_start, hh->vm_end);
> 
> Why the heck is this here?  Isn't this a fresh mm?  We shouldn't have to
> do this unless we had a VMA here previously.  Maybe it would be more
> efficient to do this when tearing down the old vmas.

Good point, thanks.

[...]

Oren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 8/9] File descriprtors (dump)
  2008-09-04 18:41   ` Dave Hansen
@ 2008-09-07  4:52     ` Oren Laadan
  2008-09-08 16:57       ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-07  4:52 UTC (permalink / raw)
  To: Dave Hansen; +Cc: arnd, jeremy, linux-kernel, containers



Dave Hansen wrote:
> On Thu, 2008-09-04 at 04:05 -0400, Oren Laadan wrote:
>> +/**
>> + * cr_scan_fds - scan file table and construct array of open fds
>> + * @files: files_struct pointer
>> + * @fdtable: (output) array of open fds
>> + * @return: the number of open fds found
>> + *
>> + * Allocates the file descriptors array (*fdtable), caller should free
>> + */
>> +int cr_scan_fds(struct files_struct *files, int **fdtable)
>> +{
>> +	struct fdtable *fdt;
>> +	int *fdlist;
>> +	int i, n, max;
>> +
>> +	max = CR_DEFAULT_FDTABLE;
>> +
>> + repeat:
>> +	n = 0;
>> +	fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
>> +	if (!fdlist)
>> +		return -ENOMEM;
>> +
>> +	spin_lock(&files->file_lock);
>> +	fdt = files_fdtable(files);
>> +	for (i = 0; i < fdt->max_fds; i++) {
>> +		if (fcheck_files(files, i)) {
>> +			if (n == max) {
>> +				spin_unlock(&files->file_lock);
>> +				kfree(fdlist);
>> +				max *= 2;
>> +				if (max < 0) {	/* overflow ? */
>> +					n = -EMFILE;
>> +					break;
>> +				}
>> +				goto repeat;
>> +			}
>> +			fdlist[n++] = i;
>> +		}
>> +	}
>> +	spin_unlock(&files->file_lock);
>> +
>> +	*fdtable = fdlist;
>> +	return n;
>> +}
> 
> That loop needs some love.  At least save us from one level of
> indenting:
> 
>> +	for (i = 0; i < fdt->max_fds; i++) {
>> +		if (!fcheck_files(files, i)
>> 			continue;
>> 		if (n == max) {
>> +			spin_unlock(&files->file_lock);
>> +			kfree(fdlist);
>> +			max *= 2;
>> +			if (max < 0) {	/* overflow ? */
>> +				n = -EMFILE;
>> +				break;
>> +			}
>> +			goto repeat;
>> +		}
>> +		fdlist[n++] = i;
>> +	}
> 
> My gut also says that there has to be a better way to find a good size
> for fdlist() than growing it this way.  

Can you suggest a better way to find the open files of a task ?

Either I loop twice (loop to count, then allocate, then loop to fill),
or optimistically try an initial guess and expand on demand.

> 
> Why do we even have a fixed size for this?
> 
> +#define CR_DEFAULT_FDTABLE  256
> 
>> +/* cr_write_fd_data - dump the state of a given file pointer */
>> +static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
>> +{
>> +	struct cr_hdr h;
>> +	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> +	struct dentry *dent = file->f_dentry;
>> +	struct inode *inode = dent->d_inode;
>> +	enum fd_type fd_type;
>> +	int ret;
>> +
>> +	h.type = CR_HDR_FD_DATA;
>> +	h.len = sizeof(*hh);
>> +	h.parent = parent;
>> +
>> +	BUG_ON(!inode);
> 
> Why a BUG_ON()?  We'll deref it in just a sec anyway.  We prefer to just
> get the NULL dereference rather than an explicit BUG_ON().
> 
>> +	hh->f_flags = file->f_flags;
>> +	hh->f_mode = file->f_mode;
>> +	hh->f_pos = file->f_pos;
>> +	hh->f_uid = file->f_uid;
>> +	hh->f_gid = file->f_gid;
> 
> Is there a plan to save off the 'struct user' here instead?  Nested user
> namespaces in one checkpoint image might get confused otherwise.

Of course. Eventually, 'struct user' will be another shared object that
is encountered and saved with the checkpoint image.

> 
>> +	hh->f_version = file->f_version;
>> +	/* FIX: need also file->f_owner */
>> +
>> +	switch (inode->i_mode & S_IFMT) {
>> +	case S_IFREG:
>> +		fd_type = CR_FD_FILE;
>> +		break;
>> +	case S_IFDIR:
>> +		fd_type = CR_FD_DIR;
>> +		break;
>> +	case S_IFLNK:
>> +		fd_type = CR_FD_LINK;
>> +		break;
>> +	default:
>> +		return -EBADF;
>> +	}
> 
> Why don't we just store (and use) (inode->i_mode & S_IFMT) in fd_type
> instead of making our own types?

There will be others that cannot be inferred from inode->i_mode,
e.g. CR_FD_FILE_UNLINKED, CR_FD_DIR_UNLINKED, CR_FD_SOCK_UNIX,
CR_FD_SOCK_INET_V4, CR_FD_EVENTPOLL etc.

> 
>> +	/* FIX: check if the file/dir/link is unlinked */
>> +	hh->fd_type = fd_type;

[...]

Oren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Devel] Re: [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-04 14:42   ` Serge E. Hallyn
  2008-09-04 17:32     ` Oren Laadan
@ 2008-09-08 15:02     ` Andrey Mirkin
  2008-09-08 16:07       ` Cedric Le Goater
  1 sibling, 1 reply; 43+ messages in thread
From: Andrey Mirkin @ 2008-09-08 15:02 UTC (permalink / raw)
  To: devel
  Cc: Serge E. Hallyn, Oren Laadan, jeremy, arnd, containers,
	linux-kernel, dave, Andrey Mirkin

On Thursday 04 September 2008 18:42 Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl@cs.columbia.edu):
> > Create trivial sys_checkpoint and sys_restore system calls. They will
> > enable to checkpoint and restart an entire container, to and from a
> > checkpoint image file descriptor.
> >
> > The syscalls take a file descriptor (for the image file) and flags as
> > arguments. For sys_checkpoint the first argument identifies the target
> > container; for sys_restart it will identify the checkpoint image.
> >
> > Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> > ---
> >   arch/x86/kernel/syscall_table_32.S |    2 ++
> >   checkpoint/Kconfig                 |   11 +++++++++++
> >   checkpoint/Makefile                |    5 +++++
> >   checkpoint/sys.c                   |   35
> > +++++++++++++++++++++++++++++++++++ include/asm-x86/unistd_32.h        | 
> >   2 ++
> >   include/linux/syscalls.h           |    2 ++
> >   init/Kconfig                       |    2 ++
> >   kernel/sys_ni.c                    |    4 ++++
> >   8 files changed, 63 insertions(+), 0 deletions(-)
> >   create mode 100644 checkpoint/Kconfig
> >   create mode 100644 checkpoint/Makefile
> >   create mode 100644 checkpoint/sys.c
> >
> > diff --git a/arch/x86/kernel/syscall_table_32.S
> > b/arch/x86/kernel/syscall_table_32.S index d44395f..5543136 100644
> > --- a/arch/x86/kernel/syscall_table_32.S
> > +++ b/arch/x86/kernel/syscall_table_32.S
> > @@ -332,3 +332,5 @@ ENTRY(sys_call_table)
> >   	.long sys_dup3			/* 330 */
> >   	.long sys_pipe2
> >   	.long sys_inotify_init1
> > +	.long sys_checkpoint
> > +	.long sys_restart
> > diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
> > new file mode 100644
> > index 0000000..a9f22ef
> > --- /dev/null
> > +++ b/checkpoint/Kconfig
> > @@ -0,0 +1,11 @@
> > +config CHECKPOINT_RESTART
> > +	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
> > +	def_bool y
> > +	depends on X86_32 && EXPERIMENTAL
> > +	help
> > +	  Application checkpoint/restart is the ability to save the
> > +	  state of a running application so that it can later resume
> > +	  its execution from the time at which it was checkpointed.
> > +
> > +	  Turning this option on will enable checkpoint and restart
> > +	  functionality in the kernel.
> > diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> > new file mode 100644
> > index 0000000..07d018b
> > --- /dev/null
> > +++ b/checkpoint/Makefile
> > @@ -0,0 +1,5 @@
> > +#
> > +# Makefile for linux checkpoint/restart.
> > +#
> > +
> > +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
> > diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> > new file mode 100644
> > index 0000000..b9018a4
> > --- /dev/null
> > +++ b/checkpoint/sys.c
> > @@ -0,0 +1,35 @@
> > +/*
> > + *  Generic container checkpoint-restart
> > + *
> > + *  Copyright (C) 2008 Oren Laadan
> > + *
> > + *  This file is subject to the terms and conditions of the GNU General
> > Public + *  License.  See the file COPYING in the main directory of the
> > Linux + *  distribution for more details.
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/kernel.h>
> > +
> > +/**
> > + * sys_checkpoint - checkpoint a container
> > + * @pid: pid of the container init(1) process
> > + * @fd: file to which dump the checkpoint image
> > + * @flags: checkpoint operation flags
> > + */
> > +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> > +{
> > +	pr_debug("sys_checkpoint not implemented yet\n");
> > +	return -ENOSYS;
> > +}
> > +/**
> > + * sys_restart - restart a container
> > + * @crid: checkpoint image identifier
>
> So can we compare your api to Andrey's?
>
> You've explained before that crid is used to tie together multiple
> calls to checkpoint, but why do you have to specify it for restart?
> Can't it just come from the fd?  Or, the fd will be passed in
> seek()d to the right position for the data for this task, so the crid
> won't be available there?
>
> Andrey, how will the 'ctid' in your patchset be used?  It sounds
> like it's actually going to set some integer id on the created
> container?  We actually don't have container ids (or even
> containers) right now, so we probably don't want that in our api,
> right?

'ctid' in my patchset will be used later (when we will have container 
infrastructure) to specify container ID. Right now we can drop this 'ctid' 
and add it later when container infrastructure will be added.

>
> > + * @fd: file from which read the checkpoint image
> > + * @flags: restart operation flags
> > + */
> > +asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
> > +{
> > +	pr_debug("sys_restart not implemented yet\n");
> > +	return -ENOSYS;
> > +}
> > diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
> > index d739467..88bdec4 100644
> > --- a/include/asm-x86/unistd_32.h
> > +++ b/include/asm-x86/unistd_32.h
> > @@ -338,6 +338,8 @@
> >   #define __NR_dup3		330
> >   #define __NR_pipe2		331
> >   #define __NR_inotify_init1	332
> > +#define __NR_checkpoint		333
> > +#define __NR_restart		334
> >
> >   #ifdef __KERNEL__
> >
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index d6ff145..edc218b 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct
> > itimerspec __user *otmr); asmlinkage long sys_eventfd(unsigned int
> > count);
> >   asmlinkage long sys_eventfd2(unsigned int count, int flags);
> >   asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t
> > len); +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long
> > flags); +asmlinkage long sys_restart(int crid, int fd, unsigned long
> > flags);
> >
> >   int kernel_execve(const char *filename, char *const argv[], char *const
> > envp[]);
> >
> > diff --git a/init/Kconfig b/init/Kconfig
> > index c11da38..fd5f7bf 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -779,6 +779,8 @@ config MARKERS
> >
> >   source "arch/Kconfig"
> >
> > +source "checkpoint/Kconfig"
> > +
> >   config PROC_PAGE_MONITOR
> >    	default y
> >   	depends on PROC_FS && MMU
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 08d6e1b..ca95c25 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -168,3 +168,7 @@ cond_syscall(compat_sys_timerfd_settime);
> >   cond_syscall(compat_sys_timerfd_gettime);
> >   cond_syscall(sys_eventfd);
> >   cond_syscall(sys_eventfd2);
> > +
> > +/* checkpoint/restart */
> > +cond_syscall(sys_checkpoint);
> > +cond_syscall(sys_restart);
> > --
> > 1.5.4.3
> >
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linux-foundation.org/mailman/listinfo/containers
>
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>
> _______________________________________________
> Devel mailing list
> Devel@openvz.org
> https://openvz.org/mailman/listinfo/devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 4/9] Memory management (dump)
  2008-09-07  1:54     ` Oren Laadan
@ 2008-09-08 15:55       ` Dave Hansen
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-08 15:55 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Sat, 2008-09-06 at 21:54 -0400, Oren Laadan wrote:
> The length of the array may be tunable, or even adaptive (e.g. based
> on statistics from recent checkpoints), in the future.

I'm not sure "tuning" it to make those arrays longer than PAGE_SIZE will
ever be a good idea.  Seems like we should just keep the structures at
PAGE_SIZE to me.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Devel] Re: [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
  2008-09-08 15:02     ` [Devel] " Andrey Mirkin
@ 2008-09-08 16:07       ` Cedric Le Goater
  0 siblings, 0 replies; 43+ messages in thread
From: Cedric Le Goater @ 2008-09-08 16:07 UTC (permalink / raw)
  To: Andrey Mirkin
  Cc: devel, jeremy, arnd, containers, linux-kernel, dave, Andrey Mirkin

>>> --- /dev/null
>>> +++ b/checkpoint/sys.c
>>> @@ -0,0 +1,35 @@
>>> +/*
>>> + *  Generic container checkpoint-restart
>>> + *
>>> + *  Copyright (C) 2008 Oren Laadan
>>> + *
>>> + *  This file is subject to the terms and conditions of the GNU General
>>> Public + *  License.  See the file COPYING in the main directory of the
>>> Linux + *  distribution for more details.
>>> + */
>>> +
>>> +#include <linux/sched.h>
>>> +#include <linux/kernel.h>
>>> +
>>> +/**
>>> + * sys_checkpoint - checkpoint a container
>>> + * @pid: pid of the container init(1) process
>>> + * @fd: file to which dump the checkpoint image
>>> + * @flags: checkpoint operation flags
>>> + */
>>> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>>> +{
>>> +	pr_debug("sys_checkpoint not implemented yet\n");
>>> +	return -ENOSYS;
>>> +}
>>> +/**
>>> + * sys_restart - restart a container
>>> + * @crid: checkpoint image identifier
>> So can we compare your api to Andrey's?

Jumping in the API thread : how will this API interact with the namespaces ? 

I think the exact question is how are we seeing the restart sequence ?
shall we (1) restart from inside a set of pre established namespaces or 
(2) restore the state of the namespaces upon restart ? 

I think (1) is the best option in semantic, because it's closer to what
the kernel does:  create a directory (a container) and then fill it with
files (tasks). That's how the cgroup framework works and I have the
feeling we will be using this framework to build the 'super' container
object. nop ?

This direction has an impact on the API because the restart sequence 
will depend on a set of preliminary settings to create an 'empty' 
container which can then be used to exec() tasks or restart() tasks. This 
is a very different API than a magical restart() syscall creating 
hundreds of namespaces and zillions of tasks from scratch using an 
opaque binary blob. less attractive for sure but it feels more kernel 
friendly :)


But, may be you have addressed this topic at the summit and the question
is closed ?

C. 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 5/9] Memory managemnet (restore)
  2008-09-07  3:09     ` Oren Laadan
@ 2008-09-08 16:49       ` Dave Hansen
  2008-09-09  6:01         ` Oren Laadan
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2008-09-08 16:49 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Sat, 2008-09-06 at 23:09 -0400, Oren Laadan wrote:
> >> +		ret = cr_kread(ctx, pgarr->addrs, nr * sizeof(unsigned long));
> >> +		if (ret < 0)
> >> +			return ret;
> >> +		pgarr->nleft -= nr;
> >> +		pgarr->nused += nr;
> >> +		npages -= nr;
> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +/**
> >> + * cr_vma_read_pages_data - read in data of pages in page-array chain
> >> + * @ctx - restart context
> >> + * @npages - number of pages
> >> + */
> >> +static int cr_vma_read_pages_data(struct cr_ctx *ctx, int npages)
> >> +{
> >> +	struct cr_pgarr *pgarr;
> >> +	unsigned long *addrs;
> >> +	int nr, ret;
> >> +
> >> +	for (pgarr = ctx->pgarr; npages; pgarr = pgarr->next) {
> >> +		addrs = pgarr->addrs;
> >> +		nr = pgarr->nused;
> >> +		npages -= nr;
> >> +		while (nr--) {
> >> +			ret = cr_uread(ctx, (void *) *(addrs++), PAGE_SIZE);
> > 
> > The void cast is unnecessary, right?
> 
> It is: cr_uread() expects "void *", while "*addrs" is "unsigned long".

I'd suggest not storing virtual addresses in 'unsigned long'.  It's
passable when you're doing lots of arithmetic on the addresses, but that
isn't happening here.  That probably means cascading back and changing
the type of pgarr->addrs[]. 

> > 
> >> +			if (ret < 0)
> >> +				return ret;
> >> +		}
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> > 
> > I'm having some difficulty parsing this function.  Could we
> > s/data/contents/ in the function name?  It also looks like addrs is
> > being used like an array here.  Can we use it explicitly that way?  I'd
> > also like to see it called vaddr or something explicit about what kinds
> > of addresses they are.
> > 
> >> +/* change the protection of an address range to be writable/non-writable.
> >> + * this is useful when restoring the memory of a read-only vma */
> >> +static int cr_vma_writable(struct mm_struct *mm, unsigned long start,
> >> +			   unsigned long end, int writable)
> > 
> > "cr_vma_writable" is a question to me.  This needs to be
> > "cr_vma_make_writable" or something to indicate that it is modifying the
> > vma.
> > 
> >> +{
> >> +	struct vm_area_struct *vma, *prev;
> >> +	unsigned long flags = 0;
> >> +	int ret = -EINVAL;
> >> +
> >> +	cr_debug("vma %#lx-%#lx writable %d\n", start, end, writable);
> >> +
> >> +	down_write(&mm->mmap_sem);
> >> +	vma = find_vma_prev(mm, start, &prev);
> >> +	if (unlikely(!vma || vma->vm_start > end || vma->vm_end < start))
> >> +		goto out;
> > 
> > Kill the unlikely(), please.  It's unnecessary and tends to make things
> > slower when not used correctly.  Can you please check all the future
> > patches and make sure that you don't accidentally introduce these later?
> > 
> >> +	if (writable && !(vma->vm_flags & VM_WRITE))
> >> +		flags = vma->vm_flags | VM_WRITE;
> >> +	else if (!writable && (vma->vm_flags & VM_WRITE))
> >> +		flags = vma->vm_flags & ~VM_WRITE;
> >> +	cr_debug("flags %#lx\n", flags);
> >> +	if (flags)
> >> +		ret = mprotect_fixup(vma, &prev, vma->vm_start,
> >> +				     vma->vm_end, flags);
> > 
> > Is this to fixup the same things that setup_arg_pages() uses it for?  We
> > should probably consolidate those calls somehow.  
> 
> This is needed for a VMA that has modified pages but is read-only: e.g. a
> task modifies memory then calls mprotect() to make it read-only.
> Since during restart we will recreate this VMA as read-only, we need to
> temporarily make it read-write to be able to read the saved contents into
> it, and then restore the read-only protection.
> 
> setup_arg_pages() is unrelated, and the code doesn't seem to have much in
> common.

Have you looked at mprotect_fixup()?  It deals with two things:
1. altering the commit charge against RSS if the mapping is actually
   writable.
2. Merging the VMA with an adjacent one if possible

We don't want to do either of these two things.  Even if we do merge the
VMA, it will be a waste of time and energy since we'll just re-split it
when we mprotect() again.

> >> + out:
> >> +	up_write(&mm->mmap_sem);
> >> +	return ret;
> >> +}
> >> +
> >> +/**
> >> + * cr_vma_read_pages - read in pages for to restore a vma
> >> + * @ctx - restart context
> >> + * @cr_vma - vma descriptor from restart
> >> + */
> >> +static int cr_vma_read_pages(struct cr_ctx *ctx, struct cr_hdr_vma *cr_vma)
> >> +{
> >> +	struct mm_struct *mm = current->mm;
> >> +	int ret = 0;
> >> +
> >> +	if (!cr_vma->nr_pages)
> >> +		return 0;
> > 
> > Looking at this code, I can now tell that we need to be more explicit
> > about what nr_pages is.  Is it the nr_pages that the vma spans,
> > contains, maps....??  Why do we need to check it here?
> 
> nr_pages is the number of pages _saved_ for this VMA, that is, the number
> of pages to follow to read in. If it's zero, we don't read anything (the
> VMA is had no dirty pages). Otherwise, we need to read that many addresses
> followed by that many pages.

Right, but that doesn't come out of the code easily.  I'm asking you to
please change those variable names to make it easier to figure out what
those things are used for.  You, as the author, have a good grip but
others reading the code do not.

> >> +	/* restore original protection for this vma */
> >> +	if (!(cr_vma->vm_flags & VM_WRITE))
> >> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
> >> +
> >> + out:
> >> +	return ret;
> >> +}
> > 
> > Ugh.  Is this a security hole?  What if the user was not allowed to
> > write to the file being mmap()'d by this VMA?  Is this a window where
> > someone could come in and (using ptrace or something similar) write to
> > the file?
> 
> Not a security hole: this is only for private memory, so it never
> modifies the underlying file. This is related to what I explained before
> about read-only VMAs that have modified pages.

OK, so a shared, read-only mmap() should never get into this code path.
What if an attacker modified the checkpoint file to pretend to have
pages for a read-only, but shared mmap().  Would this code be tricked?

> The process is restarting, inside a container that is restarting. All
> tasks inside should be calling sys_restart() (by design) and no other
> process from outside should be allowed to ptrace them at this point.

Are there plans to implement this, or is it already in here somehow?

> (In any case, if some other tasks ptraces this task, it can make it do
> anything anyhow).

No.  I'm suggesting that since this lets you effectively write to
something that is not writable, it may be a hole with which to bypass
permissions which were set up at an earlier time.  

> > We copy into the process address space all the time when not in its
> > context explicitly.  
> 
> Huh ?

I'm just saying that you don't need to be in a process's context in
order to copy contents into its virtual address space.  Check out
access_process_vm().

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 8/9] File descriprtors (dump)
  2008-09-07  4:52     ` Oren Laadan
@ 2008-09-08 16:57       ` Dave Hansen
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-08 16:57 UTC (permalink / raw)
  To: Oren Laadan; +Cc: arnd, jeremy, linux-kernel, containers

On Sun, 2008-09-07 at 00:52 -0400, Oren Laadan wrote:
> >> +	for (i = 0; i < fdt->max_fds; i++) {
> >> +		if (!fcheck_files(files, i)
> >> 			continue;
> >> 		if (n == max) {
> >> +			spin_unlock(&files->file_lock);
> >> +			kfree(fdlist);
> >> +			max *= 2;
> >> +			if (max < 0) {	/* overflow ? */
> >> +				n = -EMFILE;
> >> +				break;
> >> +			}
> >> +			goto repeat;
> >> +		}
> >> +		fdlist[n++] = i;
> >> +	}
> > 
> > My gut also says that there has to be a better way to find a good size
> > for fdlist() than growing it this way.  
> 
> Can you suggest a better way to find the open files of a task ?
> 
> Either I loop twice (loop to count, then allocate, then loop to fill),
> or optimistically try an initial guess and expand on demand.

I'd suggest the double loop.  I think it is much more straightforward
code.

> >> +	hh->f_version = file->f_version;
> >> +	/* FIX: need also file->f_owner */
> >> +
> >> +	switch (inode->i_mode & S_IFMT) {
> >> +	case S_IFREG:
> >> +		fd_type = CR_FD_FILE;
> >> +		break;
> >> +	case S_IFDIR:
> >> +		fd_type = CR_FD_DIR;
> >> +		break;
> >> +	case S_IFLNK:
> >> +		fd_type = CR_FD_LINK;
> >> +		break;
> >> +	default:
> >> +		return -EBADF;
> >> +	}
> > 
> > Why don't we just store (and use) (inode->i_mode & S_IFMT) in fd_type
> > instead of making our own types?
> 
> There will be others that cannot be inferred from inode->i_mode,
> e.g. CR_FD_FILE_UNLINKED, CR_FD_DIR_UNLINKED, CR_FD_SOCK_UNIX,
> CR_FD_SOCK_INET_V4, CR_FD_EVENTPOLL etc.

I think we have a basically different philosophy on these.  I'd say
don't define them unless absolutely necessary.  The less you spell out
explicitly, the more flexibility you have in the future, and the less
code you spend doing simple conversions.

Anyway, I see why you're doing it this way.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 5/9] Memory managemnet (restore)
  2008-09-08 16:49       ` Dave Hansen
@ 2008-09-09  6:01         ` Oren Laadan
  2008-09-10 21:42           ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Oren Laadan @ 2008-09-09  6:01 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, jeremy, linux-kernel, arnd



Dave Hansen wrote:
> On Sat, 2008-09-06 at 23:09 -0400, Oren Laadan wrote:

[...]

>>>> +	if (writable && !(vma->vm_flags & VM_WRITE))
>>>> +		flags = vma->vm_flags | VM_WRITE;
>>>> +	else if (!writable && (vma->vm_flags & VM_WRITE))
>>>> +		flags = vma->vm_flags & ~VM_WRITE;
>>>> +	cr_debug("flags %#lx\n", flags);
>>>> +	if (flags)
>>>> +		ret = mprotect_fixup(vma, &prev, vma->vm_start,
>>>> +				     vma->vm_end, flags);
>>> Is this to fixup the same things that setup_arg_pages() uses it for?  We
>>> should probably consolidate those calls somehow.  
>> This is needed for a VMA that has modified pages but is read-only: e.g. a
>> task modifies memory then calls mprotect() to make it read-only.
>> Since during restart we will recreate this VMA as read-only, we need to
>> temporarily make it read-write to be able to read the saved contents into
>> it, and then restore the read-only protection.
>>
>> setup_arg_pages() is unrelated, and the code doesn't seem to have much in
>> common.
> 
> Have you looked at mprotect_fixup()?  It deals with two things:
> 1. altering the commit charge against RSS if the mapping is actually
>    writable.
> 2. Merging the VMA with an adjacent one if possible
> 
> We don't want to do either of these two things.  Even if we do merge the
> VMA, it will be a waste of time and energy since we'll just re-split it
> when we mprotect() again.

Your observation is correct; I chose this interface because it's really
simple and handy. I'm not worried about the performance because such VMAs
(read only but modified) are really rare, and the code can be optimized
later on.

[...]

>>>> +	/* restore original protection for this vma */
>>>> +	if (!(cr_vma->vm_flags & VM_WRITE))
>>>> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
>>>> +
>>>> + out:
>>>> +	return ret;
>>>> +}
>>> Ugh.  Is this a security hole?  What if the user was not allowed to
>>> write to the file being mmap()'d by this VMA?  Is this a window where
>>> someone could come in and (using ptrace or something similar) write to
>>> the file?
>> Not a security hole: this is only for private memory, so it never
>> modifies the underlying file. This is related to what I explained before
>> about read-only VMAs that have modified pages.
> 
> OK, so a shared, read-only mmap() should never get into this code path.
> What if an attacker modified the checkpoint file to pretend to have
> pages for a read-only, but shared mmap().  Would this code be tricked?

VMAs of shared maps (IPC, anonymous shared) will be treated differently.

VMAs of shared files (mapped shared) are saved without their contents,
as the contents remains available on the file system !  (yes, for that
we will eventually need file system snapshots).

As for an attack that provides an altered checkpoint image: since we
(currently) don't escalate privileges, the attacker will not be able
to modify something that it doesn't have access to in the first place.

> 
>> The process is restarting, inside a container that is restarting. All
>> tasks inside should be calling sys_restart() (by design) and no other
>> process from outside should be allowed to ptrace them at this point.
> 
> Are there plans to implement this, or is it already in here somehow?

Once we get positive responses about the current patchset, the next
step is to handle multiple processes: I plan to extend the freezer
with two more state for this purpose (dumping, restarting).

> 
>> (In any case, if some other tasks ptraces this task, it can make it do
>> anything anyhow).
> 
> No.  I'm suggesting that since this lets you effectively write to
> something that is not writable, it may be a hole with which to bypass
> permissions which were set up at an earlier time.

That's a good comment, but here all we are doing here is to modify a
privately mapped/anonymous memory.

> 
>>> We copy into the process address space all the time when not in its
>>> context explicitly.  
>> Huh ?
> 
> I'm just saying that you don't need to be in a process's context in
> order to copy contents into its virtual address space.  Check out
> access_process_vm().
> 

That would be the other way to implement the restart. But, since restart
executes in task's context, it's simpler and more efficient to leverage
copy-to-user().
In terms of security, both methods brings about the same end results: the
memory is modified (perhaps bypassing the read-only property of the VMA)

Oren.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 5/9] Memory managemnet (restore)
  2008-09-09  6:01         ` Oren Laadan
@ 2008-09-10 21:42           ` Dave Hansen
  2008-09-10 22:00             ` Cleanups for: [PATCH " Dave Hansen
  2008-09-11  7:37             ` [RFC v3][PATCH " Oren Laadan
  0 siblings, 2 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-10 21:42 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, linux-kernel, arnd

On Tue, 2008-09-09 at 02:01 -0400, Oren Laadan wrote: 
> > Have you looked at mprotect_fixup()?  It deals with two things:
> > 1. altering the commit charge against RSS if the mapping is actually
> >    writable.
> > 2. Merging the VMA with an adjacent one if possible
> > 
> > We don't want to do either of these two things.  Even if we do merge the
> > VMA, it will be a waste of time and energy since we'll just re-split it
> > when we mprotect() again.
> 
> Your observation is correct; I chose this interface because it's really
> simple and handy. I'm not worried about the performance because such VMAs
> (read only but modified) are really rare, and the code can be optimized
> later on.

The worry is that it will never get cleaned up, and it is basically
cruft as it stands.  People may think that it is here protecting or
fixing something that it is not.

> >>>> +	/* restore original protection for this vma */
> >>>> +	if (!(cr_vma->vm_flags & VM_WRITE))
> >>>> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
> >>>> +
> >>>> + out:
> >>>> +	return ret;
> >>>> +}
> >>> Ugh.  Is this a security hole?  What if the user was not allowed to
> >>> write to the file being mmap()'d by this VMA?  Is this a window where
> >>> someone could come in and (using ptrace or something similar) write to
> >>> the file?
> >> Not a security hole: this is only for private memory, so it never
> >> modifies the underlying file. This is related to what I explained before
> >> about read-only VMAs that have modified pages.
> > 
> > OK, so a shared, read-only mmap() should never get into this code path.
> > What if an attacker modified the checkpoint file to pretend to have
> > pages for a read-only, but shared mmap().  Would this code be tricked?
> 
> VMAs of shared maps (IPC, anonymous shared) will be treated differently.
> 
> VMAs of shared files (mapped shared) are saved without their contents,
> as the contents remains available on the file system !  (yes, for that
> we will eventually need file system snapshots).
> 
> As for an attack that provides an altered checkpoint image: since we
> (currently) don't escalate privileges, the attacker will not be able
> to modify something that it doesn't have access to in the first place.

I bugged Serge about this.  He said that this, at least, bypasses the SE
Linux checks that are normally done with an mprotect() system call.
That's a larger design problem that we need to keep in mind: we need to
be careful to keep existing checks in place.

> >> The process is restarting, inside a container that is restarting. All
> >> tasks inside should be calling sys_restart() (by design) and no other
> >> process from outside should be allowed to ptrace them at this point.
> > 
> > Are there plans to implement this, or is it already in here somehow?
> 
> Once we get positive responses about the current patchset, the next
> step is to handle multiple processes: I plan to extend the freezer
> with two more state for this purpose (dumping, restarting).

OK, but I just asked you why a ptrace() of a process during this
elevated privilege operation couldn't potentially do something bad.  You
responded that, by design, we can't ptrace things.  The design is all
well and good, but the patch isn't, because it doesn't implement that
design. :(  Before we get these merged, that needs to get resolved.

> >> (In any case, if some other tasks ptraces this task, it can make it do
> >> anything anyhow).
> > 
> > No.  I'm suggesting that since this lets you effectively write to
> > something that is not writable, it may be a hole with which to bypass
> > permissions which were set up at an earlier time.
> 
> That's a good comment, but here all we are doing here is to modify a
> privately mapped/anonymous memory.
> 
> > 
> >>> We copy into the process address space all the time when not in its
> >>> context explicitly.  
> >> Huh ?
> > 
> > I'm just saying that you don't need to be in a process's context in
> > order to copy contents into its virtual address space.  Check out
> > access_process_vm().
> > 
> 
> That would be the other way to implement the restart. But, since restart
> executes in task's context, it's simpler and more efficient to leverage
> copy-to-user().
> In terms of security, both methods brings about the same end results: the
> memory is modified (perhaps bypassing the read-only property of the VMA)

But copy_to_user() is fundamentally different.  It writes *over*
contents and in to files.  Simulating a fault fills in those pages, but
it never writes over things or in to files.   Faulting is fundamentally
safer.

Faulting today can also handle populating a memory area with pages that
appear read-only via userspace.  That's exactly what we're doing here as
well.

Anyway, I don't expect that you'll agree with this.  I'll prototype
doing it the other way at some point and we can compare how both look.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Cleanups for: [PATCH 5/9] Memory managemnet (restore)
  2008-09-10 21:42           ` Dave Hansen
@ 2008-09-10 22:00             ` Dave Hansen
  2008-09-11  7:37             ` [RFC v3][PATCH " Oren Laadan
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-10 22:00 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, arnd, linux-kernel

Here's the restore side of my cleanups on top of the v4 patches and the
one against 4/9 I just sent.

This purely makes it compile again.

---

 linux-2.6.git-dave/checkpoint/ckpt_mem.c |    2 --
 linux-2.6.git-dave/checkpoint/ckpt_mem.h |    4 ++++
 linux-2.6.git-dave/checkpoint/rstr_mem.c |   30 +++++++++++++++++++-----------
 3 files changed, 23 insertions(+), 13 deletions(-)

diff -puN checkpoint/rstr_mem.c~p5-dave checkpoint/rstr_mem.c
--- linux-2.6.git/checkpoint/rstr_mem.c~p5-dave	2008-09-10 14:51:26.000000000 -0700
+++ linux-2.6.git-dave/checkpoint/rstr_mem.c	2008-09-10 14:58:53.000000000 -0700
@@ -31,27 +31,35 @@
  * read in directly to the address space of the current process
  */
 
+static int pgarr_nr_free(struct cr_pgarr *pgarr)
+{
+	return CR_PGARR_TOTAL - pgarr->nr_used;
+}
+
 /**
- * cr_vma_read_pages_vaddrs - read addresses of pages to page-array chain
+ * cr_vma_restore_contents - read addresses of pages to page-array chain
  * @ctx - restart context
  * @npages - number of pages
  */
-static int cr_vma_read_pages_vaddrs(struct cr_ctx *ctx, int npages)
+static int cr_vma_restore_contents(struct cr_ctx *ctx, int pages_to_read)
 {
 	struct cr_pgarr *pgarr;
 	int nr, ret;
 
-	while (npages) {
-		pgarr = cr_pgarr_prep(ctx);
+	while (pages_to_read) {
+		unsigned long *vaddr_position;
+		pgarr = cr_get_empty_pgarr(ctx);
 		if (!pgarr)
 			return -ENOMEM;
-		nr = min(npages, (int) pgarr->nr_free);
-		ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
+		nr = pages_to_read;
+	       	if (nr > pgarr_nr_free(pgarr))
+			nr = pgarr_nr_free(pgarr);
+		vaddr_position = &pgarr->vaddrs[pgarr->nr_used];
+		ret = cr_kread(ctx, vaddr_position, nr * sizeof(unsigned long));
 		if (ret < 0)
 			return ret;
-		pgarr->nr_free -= nr;
 		pgarr->nr_used += nr;
-		npages -= nr;
+		pages_to_read -= nr;
 	}
 	return 0;
 }
@@ -67,7 +75,7 @@ static int cr_vma_read_pages_contents(st
 	unsigned long *vaddrs;
 	int i, ret;
 
-	list_for_each_entry(pgarr, &ctx->pgarr, list) {
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
 		vaddrs = pgarr->vaddrs;
 		for (i = 0; i < pgarr->nr_used; i++) {
 			void *ptr = (void *) vaddrs[i];
@@ -126,14 +134,14 @@ static int cr_vma_read_pages(struct cr_c
 		ret = cr_vma_set_writable(mm, hh->vm_start, hh->vm_end, 1);
 	if (ret < 0)
 		goto out;
-	ret = cr_vma_read_pages_vaddrs(ctx, hh->nr_pages);
+	ret = cr_vma_restore_contents(ctx, hh->nr_pages);
 	if (ret < 0)
 		goto out;
 	ret = cr_vma_read_pages_contents(ctx, hh->nr_pages);
 	if (ret < 0)
 		goto out;
 
-	cr_pgarr_reset(ctx);	/* reset page-array chain */
+	cr_reset_all_pgarrs(ctx);	/* reset page-array chain */
 
 	/* restore original protection for this vma */
 	if (!(hh->vm_flags & VM_WRITE))
diff -puN checkpoint/ckpt_mem.c~p5-dave checkpoint/ckpt_mem.c
--- linux-2.6.git/checkpoint/ckpt_mem.c~p5-dave	2008-09-10 14:51:33.000000000 -0700
+++ linux-2.6.git-dave/checkpoint/ckpt_mem.c	2008-09-10 14:51:54.000000000 -0700
@@ -35,8 +35,6 @@
  * freed (that is, dereference page pointers).
  */
 
-#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
-
 /* release pages referenced by a page-array */
 void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
 {
diff -puN checkpoint/ckpt_mem.h~p5-dave checkpoint/ckpt_mem.h
--- linux-2.6.git/checkpoint/ckpt_mem.h~p5-dave	2008-09-10 14:52:17.000000000 -0700
+++ linux-2.6.git-dave/checkpoint/ckpt_mem.h	2008-09-10 14:57:09.000000000 -0700
@@ -26,7 +26,11 @@ struct cr_pgarr {
 	struct list_head list;
 };
 
+#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+
 void cr_pgarr_free(struct cr_ctx *ctx);
+void cr_reset_all_pgarrs(struct cr_ctx *ctx);
 struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx);
+struct cr_pgarr *cr_get_empty_pgarr(struct cr_ctx *ctx);
 
 #endif /* _CHECKPOINT_CKPT_MEM_H_ */
_


-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 5/9] Memory managemnet (restore)
  2008-09-10 21:42           ` Dave Hansen
  2008-09-10 22:00             ` Cleanups for: [PATCH " Dave Hansen
@ 2008-09-11  7:37             ` Oren Laadan
  2008-09-11 15:38               ` Serge E. Hallyn
  2008-09-12 16:34               ` Dave Hansen
  1 sibling, 2 replies; 43+ messages in thread
From: Oren Laadan @ 2008-09-11  7:37 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, jeremy, linux-kernel, arnd



Dave Hansen wrote:
> On Tue, 2008-09-09 at 02:01 -0400, Oren Laadan wrote: 
>>> Have you looked at mprotect_fixup()?  It deals with two things:
>>> 1. altering the commit charge against RSS if the mapping is actually
>>>    writable.
>>> 2. Merging the VMA with an adjacent one if possible
>>>
>>> We don't want to do either of these two things.  Even if we do merge the
>>> VMA, it will be a waste of time and energy since we'll just re-split it
>>> when we mprotect() again.
>> Your observation is correct; I chose this interface because it's really
>> simple and handy. I'm not worried about the performance because such VMAs
>> (read only but modified) are really rare, and the code can be optimized
>> later on.
> 
> The worry is that it will never get cleaned up, and it is basically
> cruft as it stands.  People may think that it is here protecting or
> fixing something that it is not.

Let me start with the bottom line - since this creates too much confusion,
I'll just switch to the alternative: will use get_user_pages() to bring
pages in and copy the data directly. Hopefully this will end the discussion.

(Note, there there is a performance penalty in the form of extra data copy:
instead of reading data directly to the page, we instead read into a buffer,
kmap_atomic the page and copy into the page).

> 
>>>>>> +	/* restore original protection for this vma */
>>>>>> +	if (!(cr_vma->vm_flags & VM_WRITE))
>>>>>> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
>>>>>> +
>>>>>> + out:
>>>>>> +	return ret;
>>>>>> +}
>>>>> Ugh.  Is this a security hole?  What if the user was not allowed to
>>>>> write to the file being mmap()'d by this VMA?  Is this a window where
>>>>> someone could come in and (using ptrace or something similar) write to
>>>>> the file?
>>>> Not a security hole: this is only for private memory, so it never
>>>> modifies the underlying file. This is related to what I explained before
>>>> about read-only VMAs that have modified pages.
>>> OK, so a shared, read-only mmap() should never get into this code path.
>>> What if an attacker modified the checkpoint file to pretend to have
>>> pages for a read-only, but shared mmap().  Would this code be tricked?
>> VMAs of shared maps (IPC, anonymous shared) will be treated differently.
>>
>> VMAs of shared files (mapped shared) are saved without their contents,
>> as the contents remains available on the file system !  (yes, for that
>> we will eventually need file system snapshots).
>>
>> As for an attack that provides an altered checkpoint image: since we
>> (currently) don't escalate privileges, the attacker will not be able
>> to modify something that it doesn't have access to in the first place.
> 
> I bugged Serge about this.  He said that this, at least, bypasses the SE
> Linux checks that are normally done with an mprotect() system call.
> That's a larger design problem that we need to keep in mind: we need to
> be careful to keep existing checks in place.

I also discussed this with Serge, and I got the impression that he
agreed that there was no security issue because it was all and only
about private memory.

> 
>>>> The process is restarting, inside a container that is restarting. All
>>>> tasks inside should be calling sys_restart() (by design) and no other
>>>> process from outside should be allowed to ptrace them at this point.
>>> Are there plans to implement this, or is it already in here somehow?
>> Once we get positive responses about the current patchset, the next
>> step is to handle multiple processes: I plan to extend the freezer
>> with two more state for this purpose (dumping, restarting).
> 
> OK, but I just asked you why a ptrace() of a process during this
> elevated privilege operation couldn't potentially do something bad.  You
> responded that, by design, we can't ptrace things.  The design is all
> well and good, but the patch isn't, because it doesn't implement that
> design. :(  Before we get these merged, that needs to get resolved.

If a task is ptraced, then the tracer can easily arrange for the tracee
to call mprotect(), or to call sys_restart() with a tampered checkpoint
file, or do other tricks. The call to mprotect_fix(), on a private vma,
does not make this any worse. That is why I didn't bother implementing
that bit.

> 
>>>> (In any case, if some other tasks ptraces this task, it can make it do
>>>> anything anyhow).
>>> No.  I'm suggesting that since this lets you effectively write to
>>> something that is not writable, it may be a hole with which to bypass
>>> permissions which were set up at an earlier time.
>> That's a good comment, but here all we are doing here is to modify a
>> privately mapped/anonymous memory.
>>
>>>>> We copy into the process address space all the time when not in its
>>>>> context explicitly.  
>>>> Huh ?
>>> I'm just saying that you don't need to be in a process's context in
>>> order to copy contents into its virtual address space.  Check out
>>> access_process_vm().
>>>
>> That would be the other way to implement the restart. But, since restart
>> executes in task's context, it's simpler and more efficient to leverage
>> copy-to-user().
>> In terms of security, both methods brings about the same end results: the
>> memory is modified (perhaps bypassing the read-only property of the VMA)
> 
> But copy_to_user() is fundamentally different.  It writes *over*
> contents and in to files.  Simulating a fault fills in those pages, but
> it never writes over things or in to files.   Faulting is fundamentally
> safer.

copy_to_user() does not write into a file with private VMAs.
copy_to_user() in our case will always trigger a page fault.
copy_to_user() is faster as it does not require an extra copy.

> 
> Faulting today can also handle populating a memory area with pages that
> appear read-only via userspace.  That's exactly what we're doing here as
> well.
> 
> Anyway, I don't expect that you'll agree with this.  I'll prototype
> doing it the other way at some point and we can compare how both look.

Back to bottom line - whether or not I agree - I already changed the code
to use get_user_pages() and got rid of this controversy.

Oren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 5/9] Memory managemnet (restore)
  2008-09-11  7:37             ` [RFC v3][PATCH " Oren Laadan
@ 2008-09-11 15:38               ` Serge E. Hallyn
  2008-09-12 16:34               ` Dave Hansen
  1 sibling, 0 replies; 43+ messages in thread
From: Serge E. Hallyn @ 2008-09-11 15:38 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Dave Hansen, containers, jeremy, arnd, linux-kernel

Quoting Oren Laadan (orenl@cs.columbia.edu):
> 
> 
> Dave Hansen wrote:
> > On Tue, 2008-09-09 at 02:01 -0400, Oren Laadan wrote: 
> >>> Have you looked at mprotect_fixup()?  It deals with two things:
> >>> 1. altering the commit charge against RSS if the mapping is actually
> >>>    writable.
> >>> 2. Merging the VMA with an adjacent one if possible
> >>>
> >>> We don't want to do either of these two things.  Even if we do merge the
> >>> VMA, it will be a waste of time and energy since we'll just re-split it
> >>> when we mprotect() again.
> >> Your observation is correct; I chose this interface because it's really
> >> simple and handy. I'm not worried about the performance because such VMAs
> >> (read only but modified) are really rare, and the code can be optimized
> >> later on.
> > 
> > The worry is that it will never get cleaned up, and it is basically
> > cruft as it stands.  People may think that it is here protecting or
> > fixing something that it is not.
> 
> Let me start with the bottom line - since this creates too much confusion,
> I'll just switch to the alternative: will use get_user_pages() to bring
> pages in and copy the data directly. Hopefully this will end the discussion.
> 
> (Note, there there is a performance penalty in the form of extra data copy:
> instead of reading data directly to the page, we instead read into a buffer,
> kmap_atomic the page and copy into the page).
> 
> > 
> >>>>>> +	/* restore original protection for this vma */
> >>>>>> +	if (!(cr_vma->vm_flags & VM_WRITE))
> >>>>>> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
> >>>>>> +
> >>>>>> + out:
> >>>>>> +	return ret;
> >>>>>> +}
> >>>>> Ugh.  Is this a security hole?  What if the user was not allowed to
> >>>>> write to the file being mmap()'d by this VMA?  Is this a window where
> >>>>> someone could come in and (using ptrace or something similar) write to
> >>>>> the file?
> >>>> Not a security hole: this is only for private memory, so it never
> >>>> modifies the underlying file. This is related to what I explained before
> >>>> about read-only VMAs that have modified pages.
> >>> OK, so a shared, read-only mmap() should never get into this code path.
> >>> What if an attacker modified the checkpoint file to pretend to have
> >>> pages for a read-only, but shared mmap().  Would this code be tricked?
> >> VMAs of shared maps (IPC, anonymous shared) will be treated differently.
> >>
> >> VMAs of shared files (mapped shared) are saved without their contents,
> >> as the contents remains available on the file system !  (yes, for that
> >> we will eventually need file system snapshots).
> >>
> >> As for an attack that provides an altered checkpoint image: since we
> >> (currently) don't escalate privileges, the attacker will not be able
> >> to modify something that it doesn't have access to in the first place.
> > 
> > I bugged Serge about this.  He said that this, at least, bypasses the SE
> > Linux checks that are normally done with an mprotect() system call.
> > That's a larger design problem that we need to keep in mind: we need to
> > be careful to keep existing checks in place.
> 
> I also discussed this with Serge, and I got the impression that he
> agreed that there was no security issue because it was all and only
> about private memory.

We will want the checks there eventually.  Now it's going to require new
selinux policy to deal with new denials, so in other words we're
basically going to be adding checks which people will be required to
work their way around :)  But we'll still need checks, because it is
bypassing selinux checks, and just because you need to bypass them to
be able to do restart (or run lisp) doesn't mean we can just drop the
checks.

-serge

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC v3][PATCH 5/9] Memory managemnet (restore)
  2008-09-11  7:37             ` [RFC v3][PATCH " Oren Laadan
  2008-09-11 15:38               ` Serge E. Hallyn
@ 2008-09-12 16:34               ` Dave Hansen
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2008-09-12 16:34 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, jeremy, arnd, linux-kernel

On Thu, 2008-09-11 at 03:37 -0400, Oren Laadan wrote:
> Let me start with the bottom line - since this creates too much confusion,
> I'll just switch to the alternative: will use get_user_pages() to bring
> pages in and copy the data directly. Hopefully this will end the discussion.
> 
> (Note, there there is a performance penalty in the form of extra data copy:
> instead of reading data directly to the page, we instead read into a buffer,
> kmap_atomic the page and copy into the page).

Yep but, as we discussed on IRC, this code needs some optimization for
pages in swap, anyway.  It isn't optimal for those, either.  So, for
this we'll leave it at a minimal amount of code rather than maximal
functionality. :)

> > I bugged Serge about this.  He said that this, at least, bypasses the SE
> > Linux checks that are normally done with an mprotect() system call.
> > That's a larger design problem that we need to keep in mind: we need to
> > be careful to keep existing checks in place.
> 
> I also discussed this with Serge, and I got the impression that he
> agreed that there was no security issue because it was all and only
> about private memory.

Yep, as long as there are some sanity checks to make sure that we're not
overriding permissions, I'm happy with this.

> If a task is ptraced, then the tracer can easily arrange for the tracee
> to call mprotect(), or to call sys_restart() with a tampered checkpoint
> file, or do other tricks. The call to mprotect_fix(), on a private vma,
> does not make this any worse. That is why I didn't bother implementing
> that bit.

I completely agree that it isn't an issue on a private VMA.  My only
concern is if this is done to any shared memory or could potentially be
abused in such a way that it gets applied to shared memory.

-- Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2008-09-12 16:35 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-04  7:57 [RFC v3][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
2008-09-04  8:02 ` [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2008-09-04  8:37   ` Cedric Le Goater
2008-09-04 14:42   ` Serge E. Hallyn
2008-09-04 17:32     ` Oren Laadan
2008-09-04 20:37       ` Serge E. Hallyn
2008-09-04 21:05         ` Oren Laadan
2008-09-04 22:03           ` Serge E. Hallyn
2008-09-08 15:02     ` [Devel] " Andrey Mirkin
2008-09-08 16:07       ` Cedric Le Goater
2008-09-04  8:02 ` [RFC v3][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
2008-09-04  9:12   ` Louis Rilling
2008-09-04 16:00     ` Serge E. Hallyn
2008-09-04 16:03   ` Serge E. Hallyn
2008-09-04 16:09     ` Dave Hansen
2008-09-04  8:03 ` [RFC v3][PATCH 3/9] x86 support for checkpoint/restart Oren Laadan
2008-09-04  8:03 ` [RFC v3][PATCH 4/9] Memory management (dump) Oren Laadan
2008-09-04 18:25   ` Dave Hansen
2008-09-07  1:54     ` Oren Laadan
2008-09-08 15:55       ` Dave Hansen
2008-09-04  8:04 ` [RFC v3][PATCH 5/9] Memory managemnet (restore) Oren Laadan
2008-09-04 18:08   ` Dave Hansen
2008-09-07  3:09     ` Oren Laadan
2008-09-08 16:49       ` Dave Hansen
2008-09-09  6:01         ` Oren Laadan
2008-09-10 21:42           ` Dave Hansen
2008-09-10 22:00             ` Cleanups for: [PATCH " Dave Hansen
2008-09-11  7:37             ` [RFC v3][PATCH " Oren Laadan
2008-09-11 15:38               ` Serge E. Hallyn
2008-09-12 16:34               ` Dave Hansen
2008-09-04  8:04 ` [RFC v3][PATCH 6/9] Checkpoint/restart: initial documentation Oren Laadan
2008-09-04  8:05 ` [RFC v3][PATCH 7/9] Infrastructure for shared objects Oren Laadan
2008-09-04  9:38   ` Louis Rilling
2008-09-04 14:23     ` Oren Laadan
2008-09-04 18:14   ` Dave Hansen
2008-09-04  8:05 ` [RFC v3][PATCH 8/9] File descriprtors (dump) Oren Laadan
2008-09-04  9:47   ` Louis Rilling
2008-09-04 14:43     ` Oren Laadan
2008-09-04 15:01   ` Dave Hansen
2008-09-04 18:41   ` Dave Hansen
2008-09-07  4:52     ` Oren Laadan
2008-09-08 16:57       ` Dave Hansen
2008-09-04  8:06 ` [RFC v3][PATCH 9/9] File descriprtors (restore) Oren Laadan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).