linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10] OpenVZ kernel based checkpointing/restart (v2)
@ 2008-10-17 23:11 Andrey Mirkin
  2008-10-17 23:11 ` [PATCH 01/10] Introduce trivial sys_checkpoint and sys_restore system calls Andrey Mirkin
  0 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

These patchset introduces kernel based checkpointing/restart as it is 
implemented in OpenVZ project. This version (v2) supports multiple
processes with simple private memory and open files (regular files).

Todo:
 - Create processes with the same PID during restart
 - Add support for x86-64
 - Add support for shared objects

Changelog:

18 Oct 2008 (v2):
 - Add support for multiple processes
 - Cleanup and bug fixes

--

This patchset introduces kernel based checkpointing/restart as it is
implemented in OpenVZ project. This patchset has limited functionality and
are able to checkpoint/restart only single process. Recently Oren Laaden
sent another kernel based implementation of checkpoint/restart. The main
differences between this patchset and Oren's patchset are:

* In this patchset checkpointing initiated not from the process
(right now we do not have a container, only namespaces), Oren's patchset
performs checkpointing from the process context.

* Restart in this patchset is initiated from process, which restarts a new
process (in new namespaces) with saved state. Oren's patchset uses the same
process from which restart was initiated and restore saved state over it.

* Checkpoint/restart functionality in this patchset is implemented as a kernel
module


As checkpointing is initiated not from the process which state should be saved
we should freeze a process before saving its state. Right now Container Freezer
from Matt Helsley can be used for this.

This patchset introduce only a concept how kernel based checkpointing/restart
can be implemented and are able to checkpoint/restart only a single process
with simple VMAs. 

I've tried to split my patchset in small patches to make review more easier.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/10] Introduce trivial sys_checkpoint and sys_restore system calls
  2008-10-17 23:11 [PATCH 0/10] OpenVZ kernel based checkpointing/restart (v2) Andrey Mirkin
@ 2008-10-17 23:11 ` Andrey Mirkin
  2008-10-17 23:11   ` [PATCH 02/10] Make checkpoint/restart functionality modular Andrey Mirkin
  0 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

Right now they just return -ENOSYS. Later they will provide functionality
to checkpoint and restart a container.

Both syscalls take as arguments a file descriptor and flags.
Also sys_checkpoint take as the first argument a PID of container's init
(later it will be container ID); sys_restart takes as the first argument
a container ID (right now it will not be used).

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 Makefile                           |    2 +-
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Makefile                |    1 +
 checkpoint/sys_core.c              |   38 ++++++++++++++++++++++++++++++++++++
 include/asm-x86/unistd_32.h        |    2 +
 5 files changed, 44 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys_core.c

diff --git a/Makefile b/Makefile
index ea413fa..ce49afd 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index fd9d4f4..4a0d7fb 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -333,3 +333,5 @@ ENTRY(sys_call_table)
 	.long sys_pipe2
 	.long sys_inotify_init1
 	.long sys_hijack
+	.long sys_checkpoint
+	.long sys_restart		/* 335 */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..2276fb1
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1 @@
+obj-y += sys_core.o
diff --git a/checkpoint/sys_core.c b/checkpoint/sys_core.c
new file mode 100644
index 0000000..1a97fb6
--- /dev/null
+++ b/checkpoint/sys_core.c
@@ -0,0 +1,38 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+
+/**
+ * sys_checkpoint - checkpoint a container from outside
+ * @pid: pid of the container init(1) process
+ * TODO: should switch to container id later
+ * @fd: file to which save the checkpoint image
+ * @flags: checkpoint operation flags
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_restart - restart a container
+ * @ctid: container id which should be used to restart a container
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ */
+asmlinkage long sys_restart(int ctid, int fd, unsigned long flags)
+{
+	return -ENOSYS;
+}
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index 70280da..1a09604 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -339,6 +339,8 @@
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
 #define __NR_hijack		333
+#define __NR_checkpoint		334
+#define __NR_restart		335
 
 #ifdef __KERNEL__
 
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/10] Make checkpoint/restart functionality modular
  2008-10-17 23:11 ` [PATCH 01/10] Introduce trivial sys_checkpoint and sys_restore system calls Andrey Mirkin
@ 2008-10-17 23:11   ` Andrey Mirkin
  2008-10-17 23:11     ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Andrey Mirkin
                       ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

A config option CONFIG_CHECKPOINT is introduced.
New structure cpt_operations is introduced to store pointers to
checkpoint/restart functions from module.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 checkpoint/Kconfig      |    7 ++++++
 checkpoint/Makefile     |    4 +++
 checkpoint/checkpoint.h |   19 ++++++++++++++++++
 checkpoint/sys.c        |   48 +++++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys_core.c   |   29 ++++++++++++++++++++++++++-
 init/Kconfig            |    2 +
 6 files changed, 107 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/checkpoint.h
 create mode 100644 checkpoint/sys.c

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..b9bc72d
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,7 @@
+config CHECKPOINT
+	tristate "Checkpoint & restart for containers"
+	depends on EXPERIMENTAL
+	default n
+	help
+	  This option adds module "cptrst", which allow to save a running
+	  container to a file and restart it later using this image file.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 2276fb1..bfe75d5 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1 +1,5 @@
 obj-y += sys_core.o
+
+obj-$(CONFIG_CHECKPOINT) += cptrst.o
+
+cptrst-objs := sys.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
new file mode 100644
index 0000000..381a9bf
--- /dev/null
+++ b/checkpoint/checkpoint.h
@@ -0,0 +1,19 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+struct cpt_operations
+{
+	struct module * owner;
+	int (*checkpoint)(pid_t pid, int fd, unsigned long flags);
+	int (*restart)(int ctid, int fd, unsigned long flags);
+};
+extern struct cpt_operations cpt_ops;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..010e4eb
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,48 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+
+#include "checkpoint.h"
+
+MODULE_LICENSE("GPL");
+
+static int checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	return -ENOSYS; 
+}
+
+static int restart(int ctid, int fd, unsigned long flags)
+{
+	return -ENOSYS; 
+}
+
+static int __init init_cptrst(void)
+{
+	cpt_ops.owner = THIS_MODULE;
+	cpt_ops.checkpoint = checkpoint;
+	cpt_ops.restart = restart;
+	return 0;
+}
+module_init(init_cptrst);
+
+static void __exit exit_cptrst(void)
+{
+	cpt_ops.checkpoint = NULL;
+	cpt_ops.restart = NULL;
+	cpt_ops.owner = NULL;
+}
+module_exit(exit_cptrst);
diff --git a/checkpoint/sys_core.c b/checkpoint/sys_core.c
index 1a97fb6..528aaec 100644
--- a/checkpoint/sys_core.c
+++ b/checkpoint/sys_core.c
@@ -13,6 +13,13 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+
+#include "checkpoint.h"
+
+struct cpt_operations cpt_ops = { NULL, NULL, NULL };
+EXPORT_SYMBOL(cpt_ops);
 
 /**
  * sys_checkpoint - checkpoint a container from outside
@@ -23,7 +30,16 @@
  */
 asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	return -ENOSYS;
+	int ret;
+
+	ret = -ENOSYS;
+
+	if (try_module_get(cpt_ops.owner)) {
+		if (cpt_ops.checkpoint)
+			ret = cpt_ops.checkpoint(pid, fd, flags);
+		module_put(cpt_ops.owner);
+	}
+	return ret;
 }
 
 /**
@@ -34,5 +50,14 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int ctid, int fd, unsigned long flags)
 {
-	return -ENOSYS;
+	int ret;
+
+	ret = -ENOSYS;
+
+	if (try_module_get(cpt_ops.owner)) {
+		if (cpt_ops.restart)
+			ret = cpt_ops.restart(ctid, fd, flags);
+		module_put(cpt_ops.owner);
+	}
+	return ret;
 }
diff --git a/init/Kconfig b/init/Kconfig
index 4bd4b0c..b10f3cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -344,6 +344,8 @@ config CGROUP_FREEZER
           Provides a way to freeze and unfreeze all tasks in a
 	  cgroup
 
+source "checkpoint/Kconfig"
+	  
 config FAIR_GROUP_SCHED
 	bool "Group scheduling for SCHED_OTHER"
 	depends on GROUP_SCHED
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/10] Introduce context structure needed during checkpointing/restart
  2008-10-17 23:11   ` [PATCH 02/10] Make checkpoint/restart functionality modular Andrey Mirkin
@ 2008-10-17 23:11     ` Andrey Mirkin
  2008-10-17 23:11       ` [PATCH 04/10] Introduce container dump function Andrey Mirkin
  2008-10-20 17:02       ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Dave Hansen
  2008-10-20 16:51     ` [PATCH 02/10] Make checkpoint/restart functionality modular Dave Hansen
  2008-10-20 16:59     ` Serge E. Hallyn
  2 siblings, 2 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

Add functions for context allocation/destroy.
Introduce functions to read/write image.
Introduce image header and object header.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 checkpoint/checkpoint.h |   40 +++++++++++++++
 checkpoint/cpt_image.h  |   63 ++++++++++++++++++++++++
 checkpoint/sys.c        |  125 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 225 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/cpt_image.h

diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 381a9bf..8ea73f5 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -10,6 +10,8 @@
  *
  */
 
+#include "cpt_image.h"
+
 struct cpt_operations
 {
 	struct module * owner;
@@ -17,3 +19,41 @@ struct cpt_operations
 	int (*restart)(int ctid, int fd, unsigned long flags);
 };
 extern struct cpt_operations cpt_ops;
+
+enum cpt_ctx_state
+{
+	CPT_CTX_ERROR = -1,
+	CPT_CTX_IDLE = 0,
+	CPT_CTX_DUMPING,
+	CPT_CTX_UNDUMPING
+};
+
+typedef struct cpt_context
+{
+	pid_t		pid;		/* should be changed to ctid later */
+	int		ctx_id;		/* context id */
+	struct list_head ctx_list;
+	int		refcount;
+	int		ctx_state;
+	struct semaphore main_sem;
+
+	int		errno;
+
+	struct file	*file;
+	loff_t		current_object;
+
+	struct list_head object_array[CPT_OBJ_MAX];
+
+	int		(*write)(const void *addr, size_t count, struct cpt_context *ctx);
+	int		(*read)(void *addr, size_t count, struct cpt_context *ctx);
+} cpt_context_t;
+
+extern int debug_level;
+
+#define cpt_printk(lvl, fmt, args...)	do {	\
+		if (lvl <= debug_level)		\
+			printk(fmt, ##args);	\
+	} while (0)
+
+#define eprintk(a...) cpt_printk(1, "CPT ERR: " a)
+#define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
new file mode 100644
index 0000000..0338dd0
--- /dev/null
+++ b/checkpoint/cpt_image.h
@@ -0,0 +1,63 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#ifndef __CPT_IMAGE_H_
+#define __CPT_IMAGE_H_ 1
+
+enum _cpt_object_type
+{
+	CPT_OBJ_TASK = 0,
+	CPT_OBJ_MAX,
+	/* The objects above are stored in memory while checkpointing */
+
+	CPT_OBJ_HEAD = 1024,
+};
+
+enum _cpt_content_type {
+	CPT_CONTENT_VOID,
+	CPT_CONTENT_ARRAY,
+	CPT_CONTENT_DATA,
+	CPT_CONTENT_NAME,
+	CPT_CONTENT_REF,
+	CPT_CONTENT_MAX
+};
+
+#define CPT_SIGNATURE0 0x79
+#define CPT_SIGNATURE1 0x1c
+#define CPT_SIGNATURE2 0x01
+#define CPT_SIGNATURE3 0x63
+
+struct cpt_head
+{
+	__u8	cpt_signature[4];	/* Magic number */
+	__u32	cpt_hdrlen;		/* Header length */
+	__u16	cpt_image_major;	/* Format of this file */
+	__u16	cpt_image_minor;	/* Format of this file */
+	__u16	cpt_image_sublevel;	/* Format of this file */
+	__u16	cpt_image_extra;	/* Format of this file */
+	__u16	cpt_arch;		/* Architecture */
+#define CPT_ARCH_I386		0
+	__u16	cpt_pad1;
+	__u32	cpt_pad2;
+	__u64	cpt_time;		/* Time */
+} __attribute__ ((aligned (8)));
+
+/* Common object header. */
+struct cpt_object_hdr
+{
+	__u64	cpt_len;		/* Size of current chunk of data */
+	__u32	cpt_hdrlen;		/* Size of header */
+	__u16	cpt_type;		/* Type of object */
+	__u16	cpt_content;		/* Content type: array, reference... */
+} __attribute__ ((aligned (8)));
+
+#endif /* __CPT_IMAGE_H_ */
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 010e4eb..a561a06 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -13,21 +13,140 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/file.h>
-#include <linux/notifier.h>
+#include <linux/uaccess.h>
 #include <linux/module.h>
 
 #include "checkpoint.h"
+#include "cpt_image.h"
 
 MODULE_LICENSE("GPL");
 
+/* Debug level, constant for now */
+int debug_level = 1;
+
+static int file_write(const void *addr, size_t count, struct cpt_context *ctx)
+{
+	mm_segment_t oldfs;
+	ssize_t err = -EBADF;
+	struct file *file = ctx->file;
+
+	oldfs = get_fs(); set_fs(KERNEL_DS);
+	if (file)
+		err = file->f_op->write(file, addr, count, &file->f_pos);
+	set_fs(oldfs);
+	if (err != count)
+		return err >= 0 ? -EIO : err;
+	return 0;
+}
+
+static int file_read(void *addr, size_t count, struct cpt_context *ctx)
+{
+	mm_segment_t oldfs;
+	ssize_t err = -EBADF;
+	struct file *file = ctx->file;
+
+	oldfs = get_fs(); set_fs(KERNEL_DS);
+	if (file)
+		err = file->f_op->read(file, addr, count, &file->f_pos);
+	set_fs(oldfs);
+	if (err != count)
+		return err >= 0 ? -EIO : err;
+	return 0;
+}
+
+struct cpt_context * context_alloc(void)
+{
+	struct cpt_context *ctx;
+	int i;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	init_MUTEX(&ctx->main_sem);
+	ctx->refcount = 1;
+
+	ctx->current_object = -1;
+	ctx->write = file_write;
+	ctx->read = file_read;
+	for (i = 0; i < CPT_OBJ_MAX; i++) {
+		INIT_LIST_HEAD(&ctx->object_array[i]);
+	}
+
+	return ctx;
+}
+
+void context_release(struct cpt_context *ctx)
+{
+	ctx->ctx_state = CPT_CTX_ERROR;
+
+	kfree(ctx);
+}
+
+static void context_put(struct cpt_context *ctx)
+{
+	if (!--ctx->refcount)
+		context_release(ctx);
+}
+
 static int checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	return -ENOSYS; 
+	struct file *file;
+	struct cpt_context *ctx;
+	int err;
+
+	err = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+
+	err = -ENOMEM;
+	ctx = context_alloc();
+	if (!ctx)
+		goto out_file;
+
+	ctx->file = file;
+	ctx->ctx_state = CPT_CTX_DUMPING;
+
+	/* checkpoint */
+	err = -ENOSYS;
+
+	context_put(ctx);
+
+out_file:
+	fput(file);
+out:
+	return err; 
 }
 
 static int restart(int ctid, int fd, unsigned long flags)
 {
-	return -ENOSYS; 
+	struct file *file;
+	struct cpt_context *ctx;
+	int err;
+
+	err = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+
+	err = -ENOMEM;
+	ctx = context_alloc();
+	if (!ctx)
+		goto out_file;
+
+	ctx->file = file;
+	ctx->ctx_state = CPT_CTX_UNDUMPING;
+
+	/* restart */
+	err = -ENOSYS;
+
+	context_put(ctx);
+
+out_file:
+	fput(file);
+out:
+	return err; 
 }
 
 static int __init init_cptrst(void)
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/10] Introduce container dump function
  2008-10-17 23:11     ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Andrey Mirkin
@ 2008-10-17 23:11       ` Andrey Mirkin
  2008-10-17 23:11         ` [PATCH 05/10] Introduce function to dump process Andrey Mirkin
  2008-10-20 17:02       ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Dave Hansen
  1 sibling, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

Actually right now we are going to dump only one process.
Function for dumping head of image file are added.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 checkpoint/Makefile     |    2 +-
 checkpoint/checkpoint.c |   79 +++++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.h |    3 ++
 checkpoint/sys.c        |    3 +-
 kernel/fork.c           |    2 +
 5 files changed, 87 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/checkpoint.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index bfe75d5..173346b 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
 
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
-cptrst-objs := sys.o
+cptrst-objs := sys.o checkpoint.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..c4bddce
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,79 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+#include <linux/nsproxy.h>
+
+#include "checkpoint.h"
+
+static int cpt_write_head(struct cpt_context *ctx)
+{
+	struct cpt_head hdr;
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.cpt_signature[0] = CPT_SIGNATURE0;
+	hdr.cpt_signature[1] = CPT_SIGNATURE1;
+	hdr.cpt_signature[2] = CPT_SIGNATURE2;
+	hdr.cpt_signature[3] = CPT_SIGNATURE3;
+	hdr.cpt_hdrlen = sizeof(hdr);
+	hdr.cpt_image_major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hdr.cpt_image_minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hdr.cpt_image_sublevel = (LINUX_VERSION_CODE) & 0xff;
+	hdr.cpt_image_extra = 0;
+#if defined(CONFIG_X86_32)
+	hdr.cpt_arch = CPT_ARCH_I386;
+#else
+#error  Arch is not supported
+#endif
+	return ctx->write(&hdr, sizeof(hdr), ctx);
+}
+
+int dump_container(struct cpt_context *ctx)
+{
+	int err;
+	struct task_struct *root;
+
+	read_lock(&tasklist_lock);
+	root = find_task_by_vpid(ctx->pid);
+	if (root)
+		get_task_struct(root);
+	read_unlock(&tasklist_lock);
+
+	err = -ESRCH;
+	if (!root) {
+		eprintk("can not find root task\n");
+		return err;
+	}
+	rcu_read_lock();
+	ctx->nsproxy = task_nsproxy(root);
+	if (!ctx->nsproxy) {
+		eprintk("nsproxy is null\n");
+		rcu_read_unlock();
+		goto out;
+	}
+	get_nsproxy(ctx->nsproxy);
+	rcu_read_unlock();
+
+	err = cpt_write_head(ctx);
+	
+	/* Dump task here */
+	if (!err)
+		err = -ENOSYS;	
+
+out:
+	ctx->nsproxy = NULL;
+	put_task_struct(root);
+	return err;
+}
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 8ea73f5..6926aa2 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -36,6 +36,7 @@ typedef struct cpt_context
 	int		refcount;
 	int		ctx_state;
 	struct semaphore main_sem;
+	struct nsproxy	*nsproxy;
 
 	int		errno;
 
@@ -57,3 +58,5 @@ extern int debug_level;
 
 #define eprintk(a...) cpt_printk(1, "CPT ERR: " a)
 #define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
+
+int dump_container(struct cpt_context *ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index a561a06..1902fef 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -107,9 +107,10 @@ static int checkpoint(pid_t pid, int fd, unsigned long flags)
 
 	ctx->file = file;
 	ctx->ctx_state = CPT_CTX_DUMPING;
+	ctx->pid = pid;
 
 	/* checkpoint */
-	err = -ENOSYS;
+	err = dump_container(ctx);
 
 	context_put(ctx);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 52b5037..f38b43d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -77,6 +77,7 @@ int max_threads;		/* tunable limit on nr_threads */
 DEFINE_PER_CPU(unsigned long, process_counts) = 0;
 
 __cacheline_aligned DEFINE_RWLOCK(tasklist_lock);  /* outer */
+EXPORT_SYMBOL(tasklist_lock);
 
 int nr_processes(void)
 {
@@ -153,6 +154,7 @@ void __put_task_struct(struct task_struct *tsk)
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
+EXPORT_SYMBOL(__put_task_struct);
 
 /*
  * macro override instead of weak attribute alias, to workaround
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/10] Introduce function to dump process
  2008-10-17 23:11       ` [PATCH 04/10] Introduce container dump function Andrey Mirkin
@ 2008-10-17 23:11         ` Andrey Mirkin
  2008-10-17 23:11           ` [PATCH 06/10] Introduce functions to dump mm Andrey Mirkin
                             ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

Functions to dump task struct, fpu state and registers are added.
All IDs are saved from the POV of process (container) namespace.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 checkpoint/Makefile      |    2 +-
 checkpoint/checkpoint.c  |    2 +-
 checkpoint/checkpoint.h  |    1 +
 checkpoint/cpt_image.h   |  123 ++++++++++++++++++++++++
 checkpoint/cpt_process.c |  236 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 362 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/cpt_process.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 173346b..457cc96 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
 
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
-cptrst-objs := sys.o checkpoint.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c4bddce..aae198d 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -70,7 +70,7 @@ int dump_container(struct cpt_context *ctx)
 	
 	/* Dump task here */
 	if (!err)
-		err = -ENOSYS;	
+		err = cpt_dump_task(root, ctx);
 
 out:
 	ctx->nsproxy = NULL;
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 6926aa2..9e46b10 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -60,3 +60,4 @@ extern int debug_level;
 #define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
 
 int dump_container(struct cpt_context *ctx);
+int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
index 0338dd0..cddfe37 100644
--- a/checkpoint/cpt_image.h
+++ b/checkpoint/cpt_image.h
@@ -13,6 +13,9 @@
 #ifndef __CPT_IMAGE_H_
 #define __CPT_IMAGE_H_ 1
 
+#include <linux/sched.h>
+#include <asm/segment.h>
+
 enum _cpt_object_type
 {
 	CPT_OBJ_TASK = 0,
@@ -20,6 +23,8 @@ enum _cpt_object_type
 	/* The objects above are stored in memory while checkpointing */
 
 	CPT_OBJ_HEAD = 1024,
+	CPT_OBJ_X86_REGS,
+	CPT_OBJ_BITS,
 };
 
 enum _cpt_content_type {
@@ -28,6 +33,8 @@ enum _cpt_content_type {
 	CPT_CONTENT_DATA,
 	CPT_CONTENT_NAME,
 	CPT_CONTENT_REF,
+	CPT_CONTENT_X86_FPUSTATE,
+	CPT_CONTENT_X86_FPUSTATE_OLD,
 	CPT_CONTENT_MAX
 };
 
@@ -60,4 +67,120 @@ struct cpt_object_hdr
 	__u16	cpt_content;		/* Content type: array, reference... */
 } __attribute__ ((aligned (8)));
 
+struct cpt_task_image {
+	__u64	cpt_len;
+	__u32	cpt_hdrlen;
+	__u16	cpt_type;
+	__u16	cpt_content;
+
+	__u64	cpt_state;
+	__u64	cpt_flags;
+#define CPT_PF_EXITING		0
+#define CPT_PF_FORKNOEXEC	1
+#define CPT_PF_SUPERPRIV	2
+#define CPT_PF_DUMPCORE		3
+#define CPT_PF_SIGNALED		4
+#define CPT_PF_USED_MATH	5
+
+	__u64	cpt_thrflags;
+	__u64	cpt_thrstatus;
+	__u32	cpt_pid;
+	__u32	cpt_tgid;
+	__u32	cpt_ppid;
+	__u32	cpt_rppid;
+	__u32	cpt_pgrp;
+	__u32	cpt_session;
+	__u32	cpt_old_pgrp;
+	__u32	cpt_leader;
+	__u64	cpt_set_tid;
+	__u64	cpt_clear_tid;
+	__u32	cpt_exit_code;
+	__u32	cpt_exit_signal;
+	__u32	cpt_pdeath_signal;
+	__u32	cpt_user;
+	__u32	cpt_uid;
+	__u32	cpt_euid;
+	__u32	cpt_suid;
+	__u32	cpt_fsuid;
+	__u32	cpt_gid;
+	__u32	cpt_egid;
+	__u32	cpt_sgid;
+	__u32	cpt_fsgid;
+	__u8	cpt_comm[TASK_COMM_LEN];
+	__u64	cpt_tls[GDT_ENTRY_TLS_ENTRIES];
+	__u64	cpt_utime;
+	__u64	cpt_stime;
+	__u64	cpt_utimescaled;
+	__u64	cpt_stimescaled;
+	__u64	cpt_gtime;
+	__u64	cpt_prev_utime;
+	__u64	cpt_prev_stime;
+	__u64	cpt_start_time;
+	__u64	cpt_real_start_time;
+	__u64	cpt_nvcsw;
+	__u64	cpt_nivcsw;
+	__u64	cpt_min_flt;
+	__u64	cpt_maj_flt;
+} __attribute__ ((aligned (8)));
+
+struct cpt_obj_bits
+{
+	__u64	cpt_len;
+	__u32	cpt_hdrlen;
+	__u16	cpt_type;
+	__u16	cpt_content;
+
+	__u32	cpt_size;
+	__u32	__cpt_pad1;
+} __attribute__ ((aligned (8)));
+
+#define CPT_SEG_ZERO		0
+#define CPT_SEG_TLS1		1
+#define CPT_SEG_TLS2		2
+#define CPT_SEG_TLS3		3
+#define CPT_SEG_USER32_DS       4
+#define CPT_SEG_USER32_CS       5
+#define CPT_SEG_USER64_DS       6
+#define CPT_SEG_USER64_CS       7
+#define CPT_SEG_LDT             256
+
+struct cpt_x86_regs
+{
+	__u64	cpt_len;
+	__u32	cpt_hdrlen;
+	__u16	cpt_type;
+	__u16	cpt_content;
+
+	__u32	cpt_debugreg[8];
+	__u32	cpt_gs;
+
+	__u32	cpt_bx;
+	__u32	cpt_cx;
+	__u32	cpt_dx;
+	__u32	cpt_si;
+	__u32	cpt_di;
+	__u32	cpt_bp;
+	__u32	cpt_ax;
+	__u32	cpt_ds;
+	__u32	cpt_es;
+	__u32	cpt_fs;
+	__u32	cpt_orig_ax;
+	__u32	cpt_ip;
+	__u32	cpt_cs;
+	__u32	cpt_flags;
+	__u32	cpt_sp;
+	__u32	cpt_ss;
+} __attribute__ ((aligned (8)));
+
+static inline __u64 cpt_timespec_export(struct timespec *tv)
+{
+	return (((u64)tv->tv_sec) << 32) + tv->tv_nsec;
+}
+
+static inline void cpt_timespec_import(struct timespec *tv, __u64 val)
+{
+	tv->tv_sec = val >> 32;
+	tv->tv_nsec = (val & 0xFFFFFFFF);
+}
+
 #endif /* __CPT_IMAGE_H_ */
diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
new file mode 100644
index 0000000..58f608d
--- /dev/null
+++ b/checkpoint/cpt_process.c
@@ -0,0 +1,236 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+#include <linux/nsproxy.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+static unsigned int encode_task_flags(unsigned int task_flags)
+{
+	unsigned int flags = 0;
+
+	if (task_flags & PF_EXITING)
+		flags |= (1 << CPT_PF_EXITING);
+	if (task_flags & PF_FORKNOEXEC)
+		flags |= (1 << CPT_PF_FORKNOEXEC);
+	if (task_flags & PF_SUPERPRIV)
+		flags |= (1 << CPT_PF_SUPERPRIV);
+	if (task_flags & PF_DUMPCORE)
+		flags |= (1 << CPT_PF_DUMPCORE);
+	if (task_flags & PF_SIGNALED)
+		flags |= (1 << CPT_PF_SIGNALED);
+	if (task_flags & PF_USED_MATH)
+		flags |= (1 << CPT_PF_USED_MATH);
+	
+	return flags;
+		
+}
+
+int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
+{
+	struct cpt_task_image *t;
+	int i;
+	int err;
+
+	t = kzalloc(sizeof(*t), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	t->cpt_len = sizeof(*t);
+	t->cpt_type = CPT_OBJ_TASK;
+	t->cpt_hdrlen = sizeof(*t);
+	t->cpt_content = CPT_CONTENT_ARRAY;
+
+	t->cpt_state = tsk->state;
+	t->cpt_flags = encode_task_flags(tsk->flags);
+	t->cpt_exit_code = tsk->exit_code;
+	t->cpt_exit_signal = tsk->exit_signal;
+	t->cpt_pdeath_signal = tsk->pdeath_signal;
+	t->cpt_pid = task_pid_nr_ns(tsk, ctx->nsproxy->pid_ns);
+	t->cpt_tgid = task_tgid_nr_ns(tsk, ctx->nsproxy->pid_ns);
+	t->cpt_ppid = tsk->parent ?
+		task_pid_nr_ns(tsk->parent, ctx->nsproxy->pid_ns) : 0;
+	t->cpt_rppid = tsk->real_parent ?
+		task_pid_nr_ns(tsk->real_parent, ctx->nsproxy->pid_ns) : 0;
+	t->cpt_pgrp = task_pgrp_nr_ns(tsk, ctx->nsproxy->pid_ns);
+	t->cpt_session = task_session_nr_ns(tsk, ctx->nsproxy->pid_ns);
+	t->cpt_old_pgrp = 0;
+	if (tsk->signal->tty_old_pgrp)
+		t->cpt_old_pgrp = pid_vnr(tsk->signal->tty_old_pgrp);
+	t->cpt_leader = tsk->group_leader ? task_pid_vnr(tsk->group_leader) : 0;
+	t->cpt_utime = tsk->utime;
+	t->cpt_stime = tsk->stime;
+	t->cpt_utimescaled = tsk->utimescaled;
+	t->cpt_stimescaled = tsk->stimescaled;
+	t->cpt_gtime = tsk->gtime;
+	t->cpt_prev_utime = tsk->prev_utime;
+	t->cpt_prev_stime = tsk->prev_stime;
+	t->cpt_nvcsw = tsk->nvcsw;
+	t->cpt_nivcsw = tsk->nivcsw;
+	t->cpt_start_time = cpt_timespec_export(&tsk->start_time);
+	t->cpt_real_start_time = cpt_timespec_export(&tsk->real_start_time);
+	t->cpt_min_flt = tsk->min_flt;
+	t->cpt_maj_flt = tsk->maj_flt;
+	memcpy(t->cpt_comm, tsk->comm, TASK_COMM_LEN);
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+		t->cpt_tls[i] = (((u64)tsk->thread.tls_array[i].b) << 32) +
+			tsk->thread.tls_array[i].a;
+	}
+	/* TODO: encode thread flags and status like task flags */
+	t->cpt_thrflags = task_thread_info(tsk)->flags & ~(1<<TIF_FREEZE);
+	t->cpt_thrstatus = task_thread_info(tsk)->status;
+	t->cpt_user = tsk->user->uid;
+	t->cpt_uid = tsk->uid;
+	t->cpt_euid = tsk->euid;
+	t->cpt_suid = tsk->suid;
+	t->cpt_fsuid = tsk->fsuid;
+	t->cpt_gid = tsk->gid;
+	t->cpt_egid = tsk->egid;
+	t->cpt_sgid = tsk->sgid;
+	t->cpt_fsgid = tsk->fsgid;
+
+	err = ctx->write(t, sizeof(*t), ctx);
+
+	kfree(t);
+	return err;
+}
+
+static int cpt_dump_fpustate(struct task_struct *tsk, struct cpt_context *ctx)
+{
+	struct cpt_obj_bits hdr;
+	int err;
+	int content;
+	unsigned long size;
+
+	content = CPT_CONTENT_X86_FPUSTATE;
+	size = sizeof(struct i387_fxsave_struct);
+#ifndef CONFIG_X86_64
+	if (!cpu_has_fxsr) {
+		size = sizeof(struct i387_fsave_struct);
+		content = CPT_CONTENT_X86_FPUSTATE_OLD;
+	}
+#endif
+
+	hdr.cpt_len = sizeof(hdr) + size;
+	hdr.cpt_type = CPT_OBJ_BITS;
+	hdr.cpt_hdrlen = sizeof(hdr);
+	hdr.cpt_content = content;
+	hdr.cpt_size = size;
+	err = ctx->write(&hdr, sizeof(hdr), ctx);
+	if (!err)
+		ctx->write(tsk->thread.xstate, size, ctx);
+	return err;
+}
+
+static u32 encode_segment(u32 segreg)
+{
+	segreg &= 0xFFFF;
+
+	if (segreg == 0)
+		return CPT_SEG_ZERO;
+	if ((segreg & 3) != 3) {
+		eprintk("Invalid RPL of a segment reg %x\n", segreg);
+		return CPT_SEG_ZERO;
+	}
+
+	/* LDT descriptor, it is just an index to LDT array */
+	if (segreg & 4)
+		return CPT_SEG_LDT + (segreg >> 3);
+
+	/* TLS descriptor. */
+	if ((segreg >> 3) >= GDT_ENTRY_TLS_MIN &&
+			(segreg >> 3) <= GDT_ENTRY_TLS_MAX)
+		return CPT_SEG_TLS1 + ((segreg>>3) - GDT_ENTRY_TLS_MIN);
+
+	/* One of standard desriptors */
+#ifdef CONFIG_X86_64
+	if (segreg == __USER32_DS)
+		return CPT_SEG_USER32_DS;
+	if (segreg == __USER32_CS)
+		return CPT_SEG_USER32_CS;
+	if (segreg == __USER_DS)
+		return CPT_SEG_USER64_DS;
+	if (segreg == __USER_CS)
+		return CPT_SEG_USER64_CS;
+#else
+	if (segreg == __USER_DS)
+		return CPT_SEG_USER32_DS;
+	if (segreg == __USER_CS)
+		return CPT_SEG_USER32_CS;
+#endif
+	eprintk("Invalid segment reg %x\n", segreg);
+	return CPT_SEG_ZERO;
+}
+
+static int cpt_dump_registers(struct task_struct *tsk, struct cpt_context *ctx)
+{
+	struct cpt_x86_regs ri;
+	struct pt_regs *pt_regs;
+
+	ri.cpt_len = sizeof(ri);
+	ri.cpt_type = CPT_OBJ_X86_REGS;
+	ri.cpt_hdrlen = sizeof(ri);
+	ri.cpt_content = CPT_CONTENT_VOID;
+
+	ri.cpt_debugreg[0] = tsk->thread.debugreg0;
+	ri.cpt_debugreg[1] = tsk->thread.debugreg1;
+	ri.cpt_debugreg[2] = tsk->thread.debugreg2;
+	ri.cpt_debugreg[3] = tsk->thread.debugreg3;
+	ri.cpt_debugreg[4] = 0;
+	ri.cpt_debugreg[5] = 0;
+	ri.cpt_debugreg[6] = tsk->thread.debugreg6;
+	ri.cpt_debugreg[7] = tsk->thread.debugreg7;
+
+	pt_regs = task_pt_regs(tsk);
+
+	ri.cpt_fs = encode_segment(pt_regs->fs);
+	ri.cpt_gs = encode_segment(tsk->thread.gs);
+
+	ri.cpt_bx = pt_regs->bx;
+	ri.cpt_cx = pt_regs->cx;
+	ri.cpt_dx = pt_regs->dx;
+	ri.cpt_si = pt_regs->si;
+	ri.cpt_di = pt_regs->di;
+	ri.cpt_bp = pt_regs->bp;
+	ri.cpt_ax = pt_regs->ax;
+	ri.cpt_ds = encode_segment(pt_regs->ds);
+	ri.cpt_es = encode_segment(pt_regs->es);
+	ri.cpt_orig_ax = pt_regs->orig_ax;
+	ri.cpt_ip = pt_regs->ip;
+	ri.cpt_cs = encode_segment(pt_regs->cs);
+	ri.cpt_flags = pt_regs->flags;
+	ri.cpt_sp = pt_regs->sp;
+	ri.cpt_ss = encode_segment(pt_regs->ss);
+	
+	return ctx->write(&ri, sizeof(ri), ctx);
+}
+
+int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
+{
+	int err;
+
+	err = cpt_dump_task_struct(tsk, ctx);
+
+	/* Dump task mm */
+
+	if (!err)
+		cpt_dump_fpustate(tsk, ctx);
+	if (!err)
+		cpt_dump_registers(tsk, ctx);
+
+	return err;
+}
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/10] Introduce functions to dump mm
  2008-10-17 23:11         ` [PATCH 05/10] Introduce function to dump process Andrey Mirkin
@ 2008-10-17 23:11           ` Andrey Mirkin
  2008-10-17 23:11             ` [PATCH 07/10] Introduce function for restarting a container Andrey Mirkin
                               ` (2 more replies)
  2008-10-20 11:02           ` [PATCH 05/10] Introduce function to dump process Louis Rilling
  2008-10-20 17:48           ` Serge E. Hallyn
  2 siblings, 3 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

Functions to dump mm struct, VMAs and mm context are added.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 arch/x86/mm/hugetlbpage.c |    2 +
 checkpoint/Makefile       |    2 +-
 checkpoint/checkpoint.h   |    1 +
 checkpoint/cpt_image.h    |   61 +++++++
 checkpoint/cpt_mm.c       |  434 +++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/cpt_process.c  |    8 +-
 mm/memory.c               |    1 +
 7 files changed, 504 insertions(+), 5 deletions(-)
 create mode 100644 checkpoint/cpt_mm.c

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8f307d9..63028e7 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -12,6 +12,7 @@
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
+#include <linux/module.h>
 #include <asm/mman.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
@@ -221,6 +222,7 @@ int pmd_huge(pmd_t pmd)
 {
 	return !!(pmd_val(pmd) & _PAGE_PSE);
 }
+EXPORT_SYMBOL(pmd_huge);
 
 int pud_huge(pud_t pud)
 {
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 457cc96..bbb0e37 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
 
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
-cptrst-objs := sys.o checkpoint.o cpt_process.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 9e46b10..e3e6b66 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -61,3 +61,4 @@ extern int debug_level;
 
 int dump_container(struct cpt_context *ctx);
 int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
+int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
index cddfe37..160cf85 100644
--- a/checkpoint/cpt_image.h
+++ b/checkpoint/cpt_image.h
@@ -16,13 +16,19 @@
 #include <linux/sched.h>
 #include <asm/segment.h>
 
+#define CPT_NULL (~0ULL)
+
 enum _cpt_object_type
 {
 	CPT_OBJ_TASK = 0,
+	CPT_OBJ_MM,
 	CPT_OBJ_MAX,
 	/* The objects above are stored in memory while checkpointing */
 
 	CPT_OBJ_HEAD = 1024,
+	CPT_OBJ_VMA,
+	CPT_OBJ_PAGES,
+	CPT_OBJ_NAME,
 	CPT_OBJ_X86_REGS,
 	CPT_OBJ_BITS,
 };
@@ -35,6 +41,7 @@ enum _cpt_content_type {
 	CPT_CONTENT_REF,
 	CPT_CONTENT_X86_FPUSTATE,
 	CPT_CONTENT_X86_FPUSTATE_OLD,
+	CPT_CONTENT_MM_CONTEXT,
 	CPT_CONTENT_MAX
 };
 
@@ -123,6 +130,60 @@ struct cpt_task_image {
 	__u64	cpt_maj_flt;
 } __attribute__ ((aligned (8)));
 
+struct cpt_mm_image {
+	__u64	cpt_len;
+	__u32	cpt_hdrlen;
+	__u16	cpt_type;
+	__u16	cpt_content;
+
+	__u64	cpt_start_code;
+	__u64	cpt_end_code;
+	__u64	cpt_start_data;
+	__u64	cpt_end_data;
+	__u64	cpt_start_brk;
+	__u64	cpt_brk;
+	__u64	cpt_start_stack;
+	__u64	cpt_start_arg;
+	__u64	cpt_end_arg;
+	__u64	cpt_start_env;
+	__u64	cpt_end_env;
+	__u64	cpt_def_flags;
+	__u64	cpt_flags;
+	__u64	cpt_map_count;
+} __attribute__ ((aligned (8)));
+
+struct cpt_vma_image
+{
+	__u64	cpt_len;
+	__u32	cpt_hdrlen;
+	__u16	cpt_type;
+	__u16	cpt_content;
+
+	__u64	cpt_file;
+	__u32	cpt_vma_type;
+#define CPT_VMA_TYPE_0		0
+#define CPT_VMA_FILE		1
+	__u32	cpt_pad;
+
+	__u64	cpt_start;
+	__u64	cpt_end;
+	__u64	cpt_flags;
+	__u64	cpt_pgprot;
+	__u64	cpt_pgoff;
+	__u64	cpt_page_num;
+} __attribute__ ((aligned (8)));
+
+struct cpt_page_block
+{
+	__u64	cpt_len;
+	__u32	cpt_hdrlen;
+	__u16	cpt_type;
+	__u16	cpt_content;
+
+	__u64	cpt_start;
+	__u64	cpt_end;
+} __attribute__ ((aligned (8)));
+
 struct cpt_obj_bits
 {
 	__u64	cpt_len;
diff --git a/checkpoint/cpt_mm.c b/checkpoint/cpt_mm.c
new file mode 100644
index 0000000..8a22c48
--- /dev/null
+++ b/checkpoint/cpt_mm.c
@@ -0,0 +1,434 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Authors:	Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/errno.h>
+#include <linux/major.h>
+#include <linux/mman.h>
+#include <linux/mnt_namespace.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/pagemap.h>
+#include <linux/hugetlb.h>
+#include <asm/ldt.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+struct page_area
+{
+	int type;
+	unsigned long start;
+	unsigned long end;
+	pgoff_t pgoff;
+	loff_t mm;
+	__u64 list[16];
+};
+
+struct page_desc
+{
+	int	type;
+	pgoff_t	index;
+	loff_t	mm;
+	int	shared;
+};
+
+enum {
+	PD_ABSENT,
+	PD_COPY,
+	PD_FUNKEY,
+};
+
+/* 0: page can be obtained from backstore, or still not mapped anonymous  page,
+      or something else, which does not requre copy.
+   1: page requires copy
+   2: page requres copy but its content is zero. Quite useless.
+   3: wp page is shared after fork(). It is to be COWed when modified.
+   4: page is something unsupported... We copy it right now.
+ */
+
+static void page_get_desc(struct vm_area_struct *vma, unsigned long addr,
+			  struct page_desc *pdesc, cpt_context_t * ctx)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep, pte;
+	spinlock_t *ptl;
+	struct page *pg = NULL;
+	pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff;
+
+	pdesc->index = linear_index;
+	pdesc->shared = 0;
+	pdesc->mm = CPT_NULL;
+
+	if (vma->vm_flags & VM_IO) {
+		pdesc->type = PD_ABSENT;
+		return;
+	}
+
+	pgd = pgd_offset(mm, addr);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out_absent;
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out_absent;
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out_absent;
+#ifdef CONFIG_X86
+	if (pmd_huge(*pmd)) {
+		eprintk("page_huge\n");
+		goto out_unsupported;
+	}
+#endif
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	pte = *ptep;
+	pte_unmap(ptep);
+
+	if (pte_none(pte))
+		goto out_absent_unlock;
+
+	if ((pg = vm_normal_page(vma, addr, pte)) == NULL) {
+		pdesc->type = PD_COPY;
+		goto out_unlock;
+	}
+
+	get_page(pg);
+	spin_unlock(ptl);
+
+	if (pg->mapping && !PageAnon(pg)) {
+		if (vma->vm_file == NULL) {
+			eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr);
+			goto out_unsupported;
+		}
+		if (vma->vm_file->f_mapping != pg->mapping) {
+			eprintk("pg->mapping!=f_mapping: %08lx %p %p\n",
+				    addr, vma->vm_file->f_mapping, pg->mapping);
+			goto out_unsupported;
+		}
+		pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
+		/* Page is in backstore. For us it is like
+		 * it is not present.
+		 */
+		goto out_absent;
+	}
+
+	if (PageReserved(pg)) {
+		/* Special case: ZERO_PAGE is used, when an
+		 * anonymous page is accessed but not written. */
+		if (pg == ZERO_PAGE(addr)) {
+			if (pte_write(pte)) {
+				eprintk("not funny already, writable ZERO_PAGE\n");
+				goto out_unsupported;
+			}
+			/* Just copy it for now */
+			pdesc->type = PD_COPY;
+			goto out_put;
+		}
+		eprintk("reserved page %lu at %08lx\n", pg->index, addr);
+		goto out_unsupported;
+	}
+
+	if (!pg->mapping) {
+		eprintk("page without mapping at %08lx\n", addr);
+		goto out_unsupported;
+	}
+
+	pdesc->type = PD_COPY;
+
+out_put:
+	if (pg)
+		put_page(pg);
+	return;
+
+out_unlock:
+	spin_unlock(ptl);
+	goto out_put;
+
+out_absent_unlock:
+	spin_unlock(ptl);
+
+out_absent:
+	pdesc->type = PD_ABSENT;
+	goto out_put;
+
+out_unsupported:
+	pdesc->type = PD_FUNKEY;
+	goto out_put;
+}
+
+static int count_vma_pages(struct vm_area_struct *vma, struct cpt_context *ctx)
+{
+	unsigned long addr;
+	int page_num = 0;
+
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+		struct page_desc pd;
+
+		page_get_desc(vma, addr, &pd, ctx);
+
+		if (pd.type != PD_COPY) {
+			return -EINVAL;
+		} else {
+			page_num += 1;
+		}
+		
+	}
+	return page_num;
+}
+
+/* ATTN: We give "current" to get_user_pages(). This is wrong, but get_user_pages()
+ * does not really need this thing. It just stores some page fault stats there.
+ *
+ * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache pages
+ * before accessing vma.
+ */
+static int dump_pages(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, struct cpt_context *ctx)
+{
+#define MAX_PAGE_BATCH 16
+	struct page *pg[MAX_PAGE_BATCH];
+	int npages = (end - start)/PAGE_SIZE;
+	int count = 0;
+
+	while (count < npages) {
+		int copy = npages - count;
+		int n;
+
+		if (copy > MAX_PAGE_BATCH)
+			copy = MAX_PAGE_BATCH;
+		n = get_user_pages(current, vma->vm_mm, start, copy,
+				   0, 1, pg, NULL);
+		if (n == copy) {
+			int i;
+			for (i=0; i<n; i++) {
+				char *maddr = kmap(pg[i]);
+				ctx->write(maddr, PAGE_SIZE, ctx);
+				kunmap(pg[i]);
+			}
+		} else {
+			eprintk("get_user_pages fault");
+			for ( ; n > 0; n--)
+				page_cache_release(pg[n-1]);
+			return -EFAULT;
+		}
+		start += n*PAGE_SIZE;
+		count += n;
+		for ( ; n > 0; n--)
+			page_cache_release(pg[n-1]);
+	}
+	return 0;
+}
+
+static int dump_page_block(struct vm_area_struct *vma,
+			   struct cpt_page_block *pgb,
+			   struct cpt_context *ctx)
+{
+	int err;
+	pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start;
+	pgb->cpt_type = CPT_OBJ_PAGES;
+	pgb->cpt_hdrlen = sizeof(*pgb);
+	pgb->cpt_content = CPT_CONTENT_DATA;
+
+	err = ctx->write(pgb, sizeof(*pgb), ctx);
+	if (!err)
+		err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx);
+
+	return err;
+}
+
+static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx)
+{
+	int len;
+	char *path;
+	char *buf;
+	struct cpt_object_hdr o;
+
+	buf = (char *)__get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	path = d_path(p, buf, PAGE_SIZE);
+
+	if (IS_ERR(path)) {
+		free_page((unsigned long)buf);
+		return PTR_ERR(path);
+	}
+
+	len = buf + PAGE_SIZE - 1 - path;
+	o.cpt_len = sizeof(o) + len + 1;
+	o.cpt_type = CPT_OBJ_NAME;
+	o.cpt_hdrlen = sizeof(o);
+	o.cpt_content = CPT_CONTENT_NAME;
+	path[len] = 0;
+
+	ctx->write(&o, sizeof(o), ctx);
+	ctx->write(path, len + 1, ctx);
+	free_page((unsigned long)buf);
+
+	return 0;
+}
+
+static int dump_one_vma(struct mm_struct *mm,
+			struct vm_area_struct *vma, struct cpt_context *ctx)
+{
+	struct cpt_vma_image *v;
+	unsigned long addr;
+	int page_num;
+	int err;
+
+	v = kzalloc(sizeof(*v), GFP_KERNEL);
+	if (!v)
+		return -ENOMEM;
+
+	v->cpt_len = sizeof(*v);
+	v->cpt_type = CPT_OBJ_VMA;
+	v->cpt_hdrlen = sizeof(*v);
+	v->cpt_content = CPT_CONTENT_ARRAY;
+
+	v->cpt_start = vma->vm_start;
+	v->cpt_end = vma->vm_end;
+	v->cpt_flags = vma->vm_flags;
+	if (vma->vm_flags & VM_HUGETLB) {
+		eprintk("huge TLB VMAs are still not supported\n");
+		kfree(v);
+		return -EINVAL;
+	}
+	v->cpt_pgprot = vma->vm_page_prot.pgprot;
+	v->cpt_pgoff = vma->vm_pgoff;
+	v->cpt_file = CPT_NULL;
+	v->cpt_vma_type = CPT_VMA_TYPE_0;
+
+	page_num = count_vma_pages(vma, ctx);
+	if (page_num < 0) {
+		kfree(v);
+		return -EINVAL;
+	}
+	v->cpt_page_num = page_num;
+
+	if (vma->vm_file) {
+		v->cpt_file = 0;
+		v->cpt_vma_type = CPT_VMA_FILE;
+	}
+
+	ctx->write(v, sizeof(*v), ctx);
+	kfree(v);
+
+	if (vma->vm_file) {
+		err = cpt_dump_dentry(&vma->vm_file->f_path, ctx);
+		if (err < 0)
+			return err;
+	}
+
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+		struct page_desc pd;
+		struct cpt_page_block pgb;
+
+		page_get_desc(vma, addr, &pd, ctx);
+
+		if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) {
+			eprintk("dump_one_vma: funkey page\n");
+			return -EINVAL;
+		}
+
+		pgb.cpt_start = addr;
+		pgb.cpt_end = addr + PAGE_SIZE;
+		dump_page_block(vma, &pgb, ctx);
+	}
+
+	return 0;
+}
+
+static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context *ctx)
+{
+#ifdef CONFIG_X86
+	if (mm->context.size) {
+		struct cpt_obj_bits b;
+		int size;
+
+		mutex_lock(&mm->context.lock);
+
+		b.cpt_type = CPT_OBJ_BITS;
+		b.cpt_len = sizeof(b);
+		b.cpt_content = CPT_CONTENT_MM_CONTEXT;
+		b.cpt_size = mm->context.size * LDT_ENTRY_SIZE;
+
+		ctx->write(&b, sizeof(b), ctx);
+
+		size = mm->context.size * LDT_ENTRY_SIZE;
+
+		ctx->write(mm->context.ldt, size, ctx);
+
+		mutex_unlock(&mm->context.lock);
+	}
+#endif
+	return 0;
+}
+
+int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx)
+{
+	struct mm_struct *mm = tsk->mm;
+	struct cpt_mm_image *v;
+	struct vm_area_struct *vma;
+	int err;
+
+	v = kzalloc(sizeof(*v), GFP_KERNEL);
+	if (!v)
+		return -ENOMEM;
+
+	v->cpt_len = sizeof(*v);
+	v->cpt_type = CPT_OBJ_MM;
+	v->cpt_hdrlen = sizeof(*v);
+	v->cpt_content = CPT_CONTENT_ARRAY;
+
+	down_read(&mm->mmap_sem);
+	v->cpt_start_code = mm->start_code;
+	v->cpt_end_code = mm->end_code;
+	v->cpt_start_data = mm->start_data;
+	v->cpt_end_data = mm->end_data;
+	v->cpt_start_brk = mm->start_brk;
+	v->cpt_brk = mm->brk;
+	v->cpt_start_stack = mm->start_stack;
+	v->cpt_start_arg = mm->arg_start;
+	v->cpt_end_arg = mm->arg_end;
+	v->cpt_start_env = mm->env_start;
+	v->cpt_end_env = mm->env_end;
+	v->cpt_def_flags = mm->def_flags;
+	v->cpt_flags = mm->flags;
+	v->cpt_map_count = mm->map_count;
+
+	err = ctx->write(v, sizeof(*v), ctx);
+	kfree(v);
+	
+	if (err) {
+		eprintk("error during writing mm\n");
+		goto err_up;
+	}
+	
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if ((err = dump_one_vma(mm, vma, ctx)) != 0)
+			goto err_up;
+	}
+
+	err = cpt_dump_mm_context(mm, ctx);
+
+err_up:
+	up_read(&mm->mmap_sem);
+
+	return err;
+}
+
diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
index 58f608d..1f7a54b 100644
--- a/checkpoint/cpt_process.c
+++ b/checkpoint/cpt_process.c
@@ -225,12 +225,12 @@ int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
 
 	err = cpt_dump_task_struct(tsk, ctx);
 
-	/* Dump task mm */
-
 	if (!err)
-		cpt_dump_fpustate(tsk, ctx);
+		err = cpt_dump_mm(tsk, ctx);
+	if (!err)
+		err = cpt_dump_fpustate(tsk, ctx);
 	if (!err)
-		cpt_dump_registers(tsk, ctx);
+		err = cpt_dump_registers(tsk, ctx);
 
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 1002f47..479a294 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -481,6 +481,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 out:
 	return pfn_to_page(pfn);
 }
+EXPORT_SYMBOL(vm_normal_page);
 
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/10] Introduce function for restarting a container
  2008-10-17 23:11           ` [PATCH 06/10] Introduce functions to dump mm Andrey Mirkin
@ 2008-10-17 23:11             ` Andrey Mirkin
  2008-10-17 23:11               ` [PATCH 08/10] Introduce functions to restart a process Andrey Mirkin
  2008-10-20 12:25             ` [PATCH 06/10] Introduce functions to dump mm Louis Rilling
  2008-10-20 17:21             ` Dave Hansen
  2 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

Actually, right now this function will restart only one process.
Function to read head of dump file is introduced.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 checkpoint/Makefile     |    2 +-
 checkpoint/checkpoint.h |    1 +
 checkpoint/restart.c    |   87 +++++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c        |    2 +-
 4 files changed, 90 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/restart.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index bbb0e37..47c7852 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
 
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
-cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index e3e6b66..0608bb9 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -62,3 +62,4 @@ extern int debug_level;
 int dump_container(struct cpt_context *ctx);
 int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
 int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
+int restart_container(struct cpt_context *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..acfcadb
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,87 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx)
+{
+	int err;
+	struct cpt_object_hdr *hdr = tmp;
+	err = ctx->read(hdr, sizeof(struct cpt_object_hdr), ctx);
+	if (err)
+		return err;
+	if (type > 0 && type != hdr->cpt_type)
+		return -EINVAL;
+	if (hdr->cpt_hdrlen < sizeof(struct cpt_object_hdr))
+		return -EINVAL;
+	if (size < sizeof(struct cpt_object_hdr))
+		return -EINVAL;
+	if (hdr->cpt_len < hdr->cpt_hdrlen)
+		return -EINVAL;
+	if (size > hdr->cpt_hdrlen)
+		size = hdr->cpt_hdrlen;
+	if (size > sizeof(*hdr))
+		err = ctx->read(hdr + 1, size - sizeof(*hdr), ctx);
+	return err;
+}
+
+static int rst_read_head(struct cpt_context *ctx)
+{
+	struct cpt_head hdr;
+	int err;
+
+	err = -EBADF;
+	if (!ctx->file)
+		return err;
+
+	err = ctx->read(&hdr, sizeof(hdr), ctx);
+	if (err < 0)
+		return err;
+
+	if (hdr.cpt_signature[0] != CPT_SIGNATURE0 ||
+	    hdr.cpt_signature[1] != CPT_SIGNATURE1 ||
+	    hdr.cpt_signature[2] != CPT_SIGNATURE2 ||
+	    hdr.cpt_signature[3] != CPT_SIGNATURE3) {
+		return -EINVAL;
+	}
+	if (KERNEL_VERSION(hdr.cpt_image_major, hdr.cpt_image_minor,
+				hdr.cpt_image_sublevel) != LINUX_VERSION_CODE)
+		return -EINVAL;
+
+#if defined(CONFIG_X86_32)
+	if (hdr.cpt_arch != CPT_ARCH_I386)
+		return -ENOSYS;
+#else
+#error  Arch is not supported
+#endif
+
+	return 0;
+}
+
+int restart_container(struct cpt_context *ctx)
+{
+	int err;
+
+	err = rst_read_head(ctx);
+
+	/* Restart process */
+	if (!err)
+		err = -ENOSYS;
+
+	return err;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 1902fef..b92312a 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -140,7 +140,7 @@ static int restart(int ctid, int fd, unsigned long flags)
 	ctx->ctx_state = CPT_CTX_UNDUMPING;
 
 	/* restart */
-	err = -ENOSYS;
+	err = restart_container(ctx);
 
 	context_put(ctx);
 
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/10] Introduce functions to restart a process
  2008-10-17 23:11             ` [PATCH 07/10] Introduce function for restarting a container Andrey Mirkin
@ 2008-10-17 23:11               ` Andrey Mirkin
  2008-10-17 23:11                 ` [PATCH 09/10] Introduce functions to restore mm Andrey Mirkin
                                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

Functions to restart process, restore its state, fpu and registers are added.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 arch/x86/kernel/entry_32.S   |   21 +++
 arch/x86/kernel/process_32.c |    3 +
 checkpoint/Makefile          |    3 +-
 checkpoint/checkpoint.h      |    2 +
 checkpoint/restart.c         |    2 +-
 checkpoint/rst_process.c     |  277 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c               |    1 +
 7 files changed, 307 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/rst_process.c

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 109792b..a4848a3 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
 	GET_THREAD_INFO(%ebp)
 	popl %eax
 	CFI_ADJUST_CFA_OFFSET -4
+ret_from_fork_tail:
 	pushl $0x0202			# Reset kernel eflags
 	CFI_ADJUST_CFA_OFFSET 4
 	popfl
@@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
 	CFI_ENDPROC
 END(ret_from_fork)
 
+ENTRY(i386_ret_from_resume)
+	CFI_STARTPROC
+	pushl %eax
+	CFI_ADJUST_CFA_OFFSET 4
+	call schedule_tail
+	GET_THREAD_INFO(%ebp)
+	popl %eax
+	CFI_ADJUST_CFA_OFFSET -4
+	movl (%esp), %eax
+	testl %eax, %eax
+	jz    1f
+	pushl %esp
+	call  *%eax
+	addl  $4, %esp
+1:
+	addl  $256, %esp
+	jmp   ret_from_fork_tail
+	CFI_ENDPROC
+END(i386_ret_from_resume)
+
 /*
  * Return to user mode is not as complex as all this looks,
  * but we want the default path for a system call return to
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4711eed..1bdec02 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -58,6 +58,9 @@
 #include <asm/kdebug.h>
 
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
+EXPORT_SYMBOL(ret_from_fork);
+asmlinkage void i386_ret_from_resume(void) __asm__("i386_ret_from_resume");
+EXPORT_SYMBOL(i386_ret_from_resume);
 
 DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
 EXPORT_PER_CPU_SYMBOL(current_task);
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 47c7852..689a0eb 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@ obj-y += sys_core.o
 
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
-cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o \
+	       rst_process.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 0608bb9..1d0ca49 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -63,3 +63,5 @@ int dump_container(struct cpt_context *ctx);
 int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
 int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
 int restart_container(struct cpt_context *ctx);
+int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx);
+int rst_restart_process(struct cpt_context *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index acfcadb..62cef28 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -81,7 +81,7 @@ int restart_container(struct cpt_context *ctx)
 
 	/* Restart process */
 	if (!err)
-		err = -ENOSYS;
+		err = rst_restart_process(ctx);
 
 	return err;
 }
diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
new file mode 100644
index 0000000..b9f745e
--- /dev/null
+++ b/checkpoint/rst_process.c
@@ -0,0 +1,277 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+#include <linux/module.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+#define HOOK_RESERVE	256
+
+struct thr_context {
+	struct completion complete;
+	int error;
+	struct cpt_context *ctx;
+	struct task_struct *tsk;
+};
+
+int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long flags, pid_t pid)
+{
+	pid_t ret;
+
+	if (current->fs == NULL) {
+		/* do_fork_pid() hates processes without fs, oopses. */
+		eprintk("local_kernel_thread: current->fs==NULL\n");
+		return -EINVAL;
+	}
+	if (!try_module_get(THIS_MODULE))
+		return -EBUSY;
+	ret = kernel_thread(fn, arg, flags);
+	if (ret < 0)
+		module_put(THIS_MODULE);
+	return ret;
+}
+
+static unsigned int decode_task_flags(unsigned int task_flags)
+{
+	unsigned int flags = 0;
+
+	if (task_flags & (1 << CPT_PF_EXITING))
+		flags |= PF_EXITING;
+	if (task_flags & (1 << CPT_PF_FORKNOEXEC))
+		flags |= PF_FORKNOEXEC;
+	if (task_flags & (1 << CPT_PF_SUPERPRIV))
+		flags |= PF_SUPERPRIV;
+	if (task_flags & (1 << CPT_PF_DUMPCORE))
+		flags |= PF_DUMPCORE;
+	if (task_flags & (1 << CPT_PF_SIGNALED))
+		flags |= PF_SIGNALED;
+	
+	return flags;
+		
+}
+
+int rst_restore_task_struct(struct task_struct *tsk, struct cpt_task_image *ti,
+			    struct cpt_context *ctx)
+{
+	int i;
+
+	/* Restore only saved flags, comm and tls for now */
+	tsk->flags = decode_task_flags(ti->cpt_flags);
+	clear_tsk_thread_flag(tsk, TIF_FREEZE);
+	memcpy(tsk->comm, ti->cpt_comm, TASK_COMM_LEN);
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+		tsk->thread.tls_array[i].a = ti->cpt_tls[i] & 0xFFFFFFFF;
+		tsk->thread.tls_array[i].b = ti->cpt_tls[i] >> 32;
+	}
+
+	return 0;
+}
+
+static int rst_restore_fpustate(struct task_struct *tsk, struct cpt_task_image *ti,
+				struct cpt_context *ctx)
+{
+	struct cpt_obj_bits hdr;
+	int err;
+	char *buf;
+
+	clear_stopped_child_used_math(tsk);
+
+	err = rst_get_object(CPT_OBJ_BITS, &hdr, sizeof(hdr), ctx);
+	if (err < 0)
+		return err;
+
+	buf = kmalloc(hdr.cpt_size, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	err = ctx->read(buf, hdr.cpt_size, ctx);
+	if (err)
+		goto out;
+
+	if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE && cpu_has_fxsr) {
+		memcpy(&tsk->thread.xstate, buf,
+				sizeof(struct i387_fxsave_struct));
+		if (ti->cpt_flags & CPT_PF_USED_MATH)
+			set_stopped_child_used_math(tsk);
+	}
+#ifndef CONFIG_X86_64
+	else if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE_OLD &&
+			!cpu_has_fxsr) {		
+		memcpy(&tsk->thread.xstate, buf,
+				sizeof(struct i387_fsave_struct));
+		if (ti->cpt_flags & CPT_PF_USED_MATH)
+			set_stopped_child_used_math(tsk);
+	}
+#endif
+
+out:
+	kfree(buf);
+	return err;
+}
+
+static u32 decode_segment(u32 segid)
+{
+	if (segid == CPT_SEG_ZERO)
+		return 0;
+
+	/* TLS descriptors */
+	if (segid <= CPT_SEG_TLS3)
+		return ((GDT_ENTRY_TLS_MIN + segid - CPT_SEG_TLS1) << 3) + 3;
+
+	/* LDT descriptor, it is just an index to LDT array */
+	if (segid >= CPT_SEG_LDT)
+		return ((segid - CPT_SEG_LDT) << 3) | 7;
+
+	/* Check for one of standard descriptors */
+	if (segid == CPT_SEG_USER32_DS)
+		return __USER_DS;
+	if (segid == CPT_SEG_USER32_CS)
+		return __USER_CS;
+
+	eprintk("Invalid segment reg %d\n", segid);
+	return 0;
+}
+
+static int rst_restore_registers(struct task_struct *tsk, struct cpt_context *ctx)
+{
+	struct cpt_x86_regs ri;
+	struct pt_regs *regs = task_pt_regs(tsk);
+	extern char i386_ret_from_resume;
+	int err;
+
+	err = rst_get_object(CPT_OBJ_X86_REGS, &ri, sizeof(ri), ctx);
+	if (err < 0)
+		return err;
+
+	tsk->thread.sp = (unsigned long) regs;
+	tsk->thread.sp0 = (unsigned long) (regs+1);
+	tsk->thread.ip = (unsigned long) &i386_ret_from_resume;
+
+	tsk->thread.gs = decode_segment(ri.cpt_gs);
+	tsk->thread.debugreg0 = ri.cpt_debugreg[0];
+	tsk->thread.debugreg1 = ri.cpt_debugreg[1];
+	tsk->thread.debugreg2 = ri.cpt_debugreg[2];
+	tsk->thread.debugreg3 = ri.cpt_debugreg[3];
+	tsk->thread.debugreg6 = ri.cpt_debugreg[6];
+	tsk->thread.debugreg7 = ri.cpt_debugreg[7];
+
+	regs->bx = ri.cpt_bx;
+	regs->cx = ri.cpt_cx;
+	regs->dx = ri.cpt_dx;
+	regs->si = ri.cpt_si;
+	regs->di = ri.cpt_di;
+	regs->bp = ri.cpt_bp;
+	regs->ax = ri.cpt_ax;
+	regs->orig_ax = ri.cpt_orig_ax;
+	regs->ip = ri.cpt_ip;
+	regs->flags = ri.cpt_flags;
+	regs->sp = ri.cpt_sp;
+
+	regs->cs = decode_segment(ri.cpt_cs);
+	regs->ss = decode_segment(ri.cpt_ss);
+	regs->ds = decode_segment(ri.cpt_ds);
+	regs->es = decode_segment(ri.cpt_es);
+	regs->fs = decode_segment(ri.cpt_fs);
+
+	tsk->thread.sp -= HOOK_RESERVE;
+	memset((void*)tsk->thread.sp, 0, HOOK_RESERVE);
+
+	return 0;
+}
+
+static int restart_thread(void *arg)
+{
+	struct thr_context *thr_ctx = arg;
+	struct cpt_context *ctx;
+	struct cpt_task_image *ti;
+	int err;
+
+	current->state = TASK_UNINTERRUPTIBLE;
+
+	ctx = thr_ctx->ctx;
+	ti = kmalloc(sizeof(*ti), GFP_KERNEL);
+	if (!ti)
+		return -ENOMEM;
+
+	err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
+	if (!err)
+		err = rst_restore_task_struct(current, ti, ctx);
+	/* Restore mm here */
+	if (!err)
+		err = rst_restore_fpustate(current, ti, ctx);
+	if (!err)
+		err = rst_restore_registers(current, ctx);
+
+	thr_ctx->error = err;
+	complete(&thr_ctx->complete);
+
+	if (!err && (ti->cpt_state & (EXIT_ZOMBIE|EXIT_DEAD))) {
+		do_exit(ti->cpt_exit_code);
+	} else {
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+	}
+
+	kfree(ti);
+	schedule();
+
+	eprintk("leaked %d/%d %p\n", task_pid_nr(current), task_pid_vnr(current), current->mm);
+
+	module_put(THIS_MODULE);
+	complete_and_exit(NULL, 0);
+	return 0;
+}
+static int create_root_task(struct cpt_context *ctx,
+			    struct thr_context *thr_ctx)
+{
+	struct task_struct *tsk;
+	int pid;
+
+	thr_ctx->ctx = ctx;
+	thr_ctx->error = 0;
+	init_completion(&thr_ctx->complete);
+
+	/* We should also create container here */ 
+	pid = local_kernel_thread(restart_thread, thr_ctx,
+			CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | 
+			CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET, 0);
+	if (pid < 0)
+		return pid;
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_vpid(pid);
+	if (tsk)
+		get_task_struct(tsk);
+	read_unlock(&tasklist_lock);
+	if (tsk == NULL)
+		return -ESRCH;
+	thr_ctx->tsk = tsk;
+	return 0;
+}
+
+int rst_restart_process(struct cpt_context *ctx)
+{
+	struct thr_context thr_ctx_root;
+	int err;
+
+	err = create_root_task(ctx, &thr_ctx_root);
+	if (err)
+		return err;
+
+	wait_for_completion(&thr_ctx_root.complete);
+	wait_task_inactive(thr_ctx_root.tsk, 0);
+
+	return err;
+}
diff --git a/kernel/sched.c b/kernel/sched.c
index 04160d2..94a23e5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1970,6 +1970,7 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 
 	return ncsw;
 }
+EXPORT_SYMBOL(wait_task_inactive);
 
 /***
  * kick_process - kick a running thread to enter/exit the kernel
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/10] Introduce functions to restore mm
  2008-10-17 23:11               ` [PATCH 08/10] Introduce functions to restart a process Andrey Mirkin
@ 2008-10-17 23:11                 ` Andrey Mirkin
  2008-10-17 23:11                   ` [PATCH 10/10] Add support for multiple processes Andrey Mirkin
  2008-10-20  9:23                 ` [PATCH 08/10] Introduce functions to restart a process Cedric Le Goater
  2008-10-20 13:25                 ` Louis Rilling
  2 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

Functions to restore mm, VMAs and mm context are added.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 checkpoint/Makefile      |    2 +-
 checkpoint/checkpoint.h  |    1 +
 checkpoint/cpt_image.h   |    5 +
 checkpoint/rst_mm.c      |  320 ++++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/rst_process.c |    3 +-
 mm/mmap.c                |    1 +
 mm/mprotect.c            |    2 +
 7 files changed, 332 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/rst_mm.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 689a0eb..19ca732 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@ obj-y += sys_core.o
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
 cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o \
-	       rst_process.o
+	       rst_process.o rst_mm.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 1d0ca49..195fdc6 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -65,3 +65,4 @@ int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
 int restart_container(struct cpt_context *ctx);
 int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx);
 int rst_restart_process(struct cpt_context *ctx);
+int rst_restore_mm(struct cpt_context *ctx);
diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
index 160cf85..e1fb483 100644
--- a/checkpoint/cpt_image.h
+++ b/checkpoint/cpt_image.h
@@ -233,6 +233,11 @@ struct cpt_x86_regs
 	__u32	cpt_ss;
 } __attribute__ ((aligned (8)));
 
+static inline void __user * cpt_ptr_import(__u64 ptr)
+{
+	return (void*)(unsigned long)ptr;
+}
+
 static inline __u64 cpt_timespec_export(struct timespec *tv)
 {
 	return (((u64)tv->tv_sec) << 32) + tv->tv_nsec;
diff --git a/checkpoint/rst_mm.c b/checkpoint/rst_mm.c
new file mode 100644
index 0000000..fe53c45
--- /dev/null
+++ b/checkpoint/rst_mm.c
@@ -0,0 +1,320 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/vmalloc.h>
+#include <linux/syscalls.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+static unsigned long make_prot(struct cpt_vma_image *vmai)
+{
+	unsigned long prot = 0;
+
+	if (vmai->cpt_flags & VM_READ)
+		prot |= PROT_READ;
+	if (vmai->cpt_flags & VM_WRITE)
+		prot |= PROT_WRITE;
+	if (vmai->cpt_flags & VM_EXEC)
+		prot |= PROT_EXEC;
+	if (vmai->cpt_flags & VM_GROWSDOWN)
+		prot |= PROT_GROWSDOWN;
+	if (vmai->cpt_flags & VM_GROWSUP)
+		prot |= PROT_GROWSUP;
+	return prot;
+}
+
+static unsigned long make_flags(struct cpt_vma_image *vmai)
+{
+	unsigned long flags = MAP_FIXED;
+
+	if (vmai->cpt_flags&(VM_SHARED|VM_MAYSHARE))
+		flags |= MAP_SHARED;
+	else
+		flags |= MAP_PRIVATE;
+
+	if (vmai->cpt_file == CPT_NULL)
+		flags |= MAP_ANONYMOUS;
+	if (vmai->cpt_flags & VM_GROWSDOWN)
+		flags |= MAP_GROWSDOWN;
+#ifdef MAP_GROWSUP
+	if (vmai->cpt_flags & VM_GROWSUP)
+		flags |= MAP_GROWSUP;
+#endif
+	if (vmai->cpt_flags & VM_DENYWRITE)
+		flags |= MAP_DENYWRITE;
+	if (vmai->cpt_flags & VM_EXECUTABLE)
+		flags |= MAP_EXECUTABLE;
+	if (!(vmai->cpt_flags & VM_ACCOUNT))
+		flags |= MAP_NORESERVE;
+	return flags;
+}
+
+static int rst_restore_one_vma(struct cpt_context *ctx)
+{
+	int err;
+	int i;
+	unsigned long addr;
+	struct mm_struct *mm = current->mm;
+	struct cpt_vma_image vmai;
+	struct vm_area_struct *vma;
+	struct file *file = NULL;
+	unsigned long prot;
+
+	err = rst_get_object(CPT_OBJ_VMA, &vmai, sizeof(vmai), ctx);
+	if (err)
+		return err;
+
+	prot = make_prot(&vmai);
+
+	if (vmai.cpt_vma_type == CPT_VMA_FILE) {
+		struct cpt_object_hdr h;
+		int len;
+		char *path;
+
+		err = rst_get_object(CPT_OBJ_NAME, &h, sizeof(h), ctx);
+		if (err)
+			goto out;
+		len = h.cpt_len - sizeof(h);
+		if (len < 0) {
+			err = -EINVAL;
+			goto out;
+		}
+		path = kmalloc(len, GFP_KERNEL);
+		if (!path) {
+			err = -ENOMEM;
+			goto out;
+		}
+		err = ctx->read(path, len, ctx);
+		if (err) {
+			kfree(path);
+			goto out;
+		}
+
+		/* Just open file
+		   TODO: open with correct flags */
+		file = filp_open(path, O_RDONLY, 0);
+		kfree(path);
+		if (IS_ERR(file)) {
+			err = PTR_ERR(file);
+			goto out;
+		}
+	}
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vmai.cpt_start,
+			     vmai.cpt_end - vmai.cpt_start,
+			     prot, make_flags(&vmai),
+			     vmai.cpt_pgoff);
+
+	if (addr != vmai.cpt_start) {
+		up_write(&mm->mmap_sem);
+
+		err = -EINVAL;
+		if (IS_ERR((void*)addr))
+			err = addr;
+		goto out;
+	}
+
+	vma = find_vma(mm, vmai.cpt_start);
+	if (vma == NULL) {
+		up_write(&mm->mmap_sem);
+		eprintk("cannot find mmapped vma\n");
+		err = -ESRCH;
+		goto out;
+	}
+
+	/* do_mmap_pgoff() can merge new area to previous one (not to the next,
+	 * we mmap in order, the rest of mm is still unmapped). This can happen
+	 * f.e. if flags are to be adjusted later, or if we had different
+	 * anon_vma on two adjacent regions. Split it by brute force. */
+	if (vma->vm_start != vmai.cpt_start) {
+		err = split_vma(mm, vma, (unsigned long)vmai.cpt_start, 0);
+		if (err) {
+			up_write(&mm->mmap_sem);
+			eprintk("cannot split vma\n");
+			goto out;
+		}
+	}
+	up_write(&mm->mmap_sem);
+
+	for (i = 0; i < vmai.cpt_page_num; i++) {
+		struct cpt_page_block pb;
+
+		err = rst_get_object(CPT_OBJ_PAGES, &pb, sizeof(pb), ctx);
+		if (err)
+			goto out;
+		if (!(vmai.cpt_flags & VM_ACCOUNT) && !(prot & PROT_WRITE)) {
+			/* I guess this is get_user_pages() messed things,
+			 * this happens f.e. when gdb inserts breakpoints.
+			 */
+			int j;
+			for (j = 0; j < (pb.cpt_end-pb.cpt_start)/PAGE_SIZE; j++) {
+				struct page *page;
+				void *maddr;
+				err = get_user_pages(current, current->mm,
+						(unsigned long)pb.cpt_start +
+						j * PAGE_SIZE,
+						1, 1, 1, &page, NULL);
+				if (err == 0)
+					err = -EFAULT;
+				if (err < 0) {
+					eprintk("get_user_pages: %d\n", err);
+					goto out;
+				}
+				err = 0;
+				maddr = kmap(page);
+				if (pb.cpt_content == CPT_CONTENT_VOID) {
+					memset(maddr, 0, PAGE_SIZE);
+				} else if (pb.cpt_content == CPT_CONTENT_DATA) {
+					err = ctx->read(maddr, PAGE_SIZE, ctx);
+					if (err) {
+						kunmap(page);
+						goto out;
+					}
+				} else {
+					err = -EINVAL;
+					kunmap(page);
+					goto out;
+				}
+				set_page_dirty_lock(page);
+				kunmap(page);
+				page_cache_release(page);
+			}
+		} else {
+			if (!(prot & PROT_WRITE))
+				sys_mprotect(vmai.cpt_start,
+						vmai.cpt_end - vmai.cpt_start,
+						prot | PROT_WRITE);
+			if (pb.cpt_content == CPT_CONTENT_VOID) {
+				int j;
+				for (j=0; j<(pb.cpt_end-pb.cpt_start)/sizeof(unsigned long); j++) {
+					err = __put_user(0UL, ((unsigned long __user*)(unsigned long)pb.cpt_start) + j);
+					if (err) {
+						eprintk("__put_user 2 %d\n", err);
+						goto out;
+					}
+				}
+			} else if (pb.cpt_content == CPT_CONTENT_DATA) {
+				err = ctx->read(cpt_ptr_import(pb.cpt_start),
+						pb.cpt_end - pb.cpt_start,
+						ctx);
+				if (err)
+					goto out;
+			} else {
+				err = -EINVAL;
+				goto out;
+			}
+			if (!(prot & PROT_WRITE))
+				sys_mprotect(vmai.cpt_start,
+						vmai.cpt_end - vmai.cpt_start,
+						prot);
+		}
+	}
+
+out:
+	if (file)
+		fput(file);
+	return err;
+}
+
+static int rst_restore_mm_context(struct cpt_context *ctx)
+{
+	struct cpt_obj_bits b;
+	struct mm_struct *mm = current->mm;
+	int oldsize = mm->context.size;
+	int err;
+	void *oldldt;
+	void *newldt;
+
+	err = rst_get_object(CPT_OBJ_BITS, &b, sizeof(b), ctx);
+	if (err)
+		return err;
+
+	if (b.cpt_size > PAGE_SIZE)
+		newldt = vmalloc(b.cpt_size);
+	else
+		newldt = kmalloc(b.cpt_size, GFP_KERNEL);
+
+	if (!newldt)
+		return -ENOMEM;
+
+	err = ctx->read(newldt, b.cpt_size, ctx);
+	if (err)
+		return err;
+
+	oldldt = mm->context.ldt;
+	mm->context.ldt = newldt;
+	mm->context.size = b.cpt_size / LDT_ENTRY_SIZE;
+
+	load_LDT(&mm->context);
+
+	if (oldsize) {
+		if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE)
+			vfree(oldldt);
+		else
+			kfree(oldldt);
+	}
+
+	return 0;
+}
+
+int rst_restore_mm(struct cpt_context *ctx)
+{
+	int err;
+	int i;
+	struct mm_struct *mm = current->mm;
+	struct cpt_mm_image m;
+
+	err = rst_get_object(CPT_OBJ_MM, &m, sizeof(m), ctx);
+	if (err)
+		return err;
+
+	down_write(&mm->mmap_sem);
+	do_munmap(mm, 0, TASK_SIZE);
+
+	mm->start_code = m.cpt_start_code;
+	mm->end_code = m.cpt_end_code;
+	mm->start_data = m.cpt_start_data;
+	mm->end_data = m.cpt_end_data;
+	mm->start_brk = m.cpt_start_brk;
+	mm->brk = m.cpt_brk;
+	mm->start_stack = m.cpt_start_stack;
+	mm->arg_start = m.cpt_start_arg;
+	mm->arg_end = m.cpt_end_arg;
+	mm->env_start = m.cpt_start_env;
+	mm->env_end = m.cpt_end_env;
+	mm->def_flags = m.cpt_def_flags;
+	mm->flags = m.cpt_flags;
+
+	up_write(&mm->mmap_sem);
+
+	for (i = 0; i < m.cpt_map_count; i++) {
+		err = rst_restore_one_vma(ctx);
+		if (err < 0)
+			goto out;
+	}
+
+	err = rst_restore_mm_context(ctx);
+out:
+	return err;
+}
+
diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
index b9f745e..9e448b2 100644
--- a/checkpoint/rst_process.c
+++ b/checkpoint/rst_process.c
@@ -210,7 +210,8 @@ static int restart_thread(void *arg)
 	err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
 	if (!err)
 		err = rst_restore_task_struct(current, ti, ctx);
-	/* Restore mm here */
+	if (!err)
+		err = rst_restore_mm(ctx);
 	if (!err)
 		err = rst_restore_fpustate(current, ti, ctx);
 	if (!err)
diff --git a/mm/mmap.c b/mm/mmap.c
index 971d0ed..98d1ba9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1858,6 +1858,7 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 
 	return 0;
 }
+EXPORT_SYMBOL(split_vma);
 
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
diff --git a/mm/mprotect.c b/mm/mprotect.c
index fded06f..47c7d75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -22,6 +22,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/mmu_notifier.h>
+#include <linux/module.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -317,3 +318,4 @@ out:
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
+EXPORT_SYMBOL(sys_mprotect);
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/10] Add support for multiple processes
  2008-10-17 23:11                 ` [PATCH 09/10] Introduce functions to restore mm Andrey Mirkin
@ 2008-10-17 23:11                   ` Andrey Mirkin
  2008-10-27 15:58                     ` Oren Laadan
  0 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-17 23:11 UTC (permalink / raw)
  To: containers, linux-kernel; +Cc: Pavel Emelyanov, Andrey Mirkin

The whole tree of processes can be checkpointed and restarted now.
Shared objects are not supported yet.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 checkpoint/cpt_image.h   |    2 +
 checkpoint/cpt_process.c |   24 +++++++++++++
 checkpoint/rst_process.c |   85 +++++++++++++++++++++++++++-------------------
 3 files changed, 76 insertions(+), 35 deletions(-)

diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
index e1fb483..f370df2 100644
--- a/checkpoint/cpt_image.h
+++ b/checkpoint/cpt_image.h
@@ -128,6 +128,8 @@ struct cpt_task_image {
 	__u64	cpt_nivcsw;
 	__u64	cpt_min_flt;
 	__u64	cpt_maj_flt;
+	__u32	cpt_children_num;
+	__u32	cpt_pad;
 } __attribute__ ((aligned (8)));
 
 struct cpt_mm_image {
diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
index 1f7a54b..d73ec3c 100644
--- a/checkpoint/cpt_process.c
+++ b/checkpoint/cpt_process.c
@@ -40,6 +40,19 @@ static unsigned int encode_task_flags(unsigned int task_flags)
 		
 }
 
+static int cpt_count_children(struct task_struct *tsk, struct cpt_context *ctx)
+{
+	int num = 0;
+	struct task_struct *child;
+
+	list_for_each_entry(child, &tsk->children, sibling) {
+		if (child->parent != tsk)
+			continue;
+		num++;
+	}
+	return num;
+}
+
 int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
 {
 	struct cpt_task_image *t;
@@ -102,6 +115,7 @@ int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
 	t->cpt_egid = tsk->egid;
 	t->cpt_sgid = tsk->sgid;
 	t->cpt_fsgid = tsk->fsgid;
+	t->cpt_children_num = cpt_count_children(tsk, ctx);
 
 	err = ctx->write(t, sizeof(*t), ctx);
 
@@ -231,6 +245,16 @@ int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
 		err = cpt_dump_fpustate(tsk, ctx);
 	if (!err)
 		err = cpt_dump_registers(tsk, ctx);
+	if (!err) {
+		struct task_struct *child;
+		list_for_each_entry(child, &tsk->children, sibling) {
+			if (child->parent != tsk)
+				continue;
+			err = cpt_dump_task(child, ctx);
+			if (err)
+				break;
+		}
+	}
 
 	return err;
 }
diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
index 9e448b2..c088833 100644
--- a/checkpoint/rst_process.c
+++ b/checkpoint/rst_process.c
@@ -25,7 +25,7 @@ struct thr_context {
 	struct completion complete;
 	int error;
 	struct cpt_context *ctx;
-	struct task_struct *tsk;
+	struct cpt_task_image *ti;
 };
 
 int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long flags, pid_t pid)
@@ -199,17 +199,14 @@ static int restart_thread(void *arg)
 	struct cpt_context *ctx;
 	struct cpt_task_image *ti;
 	int err;
+	int i;
 
 	current->state = TASK_UNINTERRUPTIBLE;
 
 	ctx = thr_ctx->ctx;
-	ti = kmalloc(sizeof(*ti), GFP_KERNEL);
-	if (!ti)
-		return -ENOMEM;
+	ti = thr_ctx->ti;
 
-	err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
-	if (!err)
-		err = rst_restore_task_struct(current, ti, ctx);
+	err = rst_restore_task_struct(current, ti, ctx);
 	if (!err)
 		err = rst_restore_mm(ctx);
 	if (!err)
@@ -217,6 +214,12 @@ static int restart_thread(void *arg)
 	if (!err)
 		err = rst_restore_registers(current, ctx);
 
+	for (i = 0; i < ti->cpt_children_num; i++) {
+		err = rst_restart_process(ctx);
+		if (err)
+			break;
+	}
+
 	thr_ctx->error = err;
 	complete(&thr_ctx->complete);
 
@@ -226,7 +229,6 @@ static int restart_thread(void *arg)
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 	}
 
-	kfree(ti);
 	schedule();
 
 	eprintk("leaked %d/%d %p\n", task_pid_nr(current), task_pid_vnr(current), current->mm);
@@ -235,44 +237,57 @@ static int restart_thread(void *arg)
 	complete_and_exit(NULL, 0);
 	return 0;
 }
-static int create_root_task(struct cpt_context *ctx,
-			    struct thr_context *thr_ctx)
+
+int rst_restart_process(struct cpt_context *ctx)
 {
+	struct thr_context thr_ctx;
 	struct task_struct *tsk;
+	struct cpt_task_image *ti;
 	int pid;
+	int err;
 
-	thr_ctx->ctx = ctx;
-	thr_ctx->error = 0;
-	init_completion(&thr_ctx->complete);
+	thr_ctx.ctx = ctx;
+	thr_ctx.error = 0;
+	init_completion(&thr_ctx.complete);
 
-	/* We should also create container here */ 
-	pid = local_kernel_thread(restart_thread, thr_ctx,
-			CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | 
-			CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET, 0);
-	if (pid < 0)
-		return pid;
+	ti = kmalloc(sizeof(*ti), GFP_KERNEL);
+	if (!ti)
+		return -ENOMEM;
+
+	err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
+	if (err)
+		goto err_free;
+	thr_ctx.ti = ti;
+
+	if (ti->cpt_pid == 1) {
+		/* We should also create container here */ 
+		pid = local_kernel_thread(restart_thread, &thr_ctx,
+				CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | 
+				CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET, 0);
+	} else {
+		/* We should fork here a child with the same pid and
+		   correct flags */
+		pid = local_kernel_thread(restart_thread, &thr_ctx, 0, 0); 
+	}
+	if (pid < 0) {
+		err = pid;
+		goto err_free;
+	}
 	read_lock(&tasklist_lock);
 	tsk = find_task_by_vpid(pid);
 	if (tsk)
 		get_task_struct(tsk);
 	read_unlock(&tasklist_lock);
-	if (tsk == NULL)
-		return -ESRCH;
-	thr_ctx->tsk = tsk;
-	return 0;
-}
-
-int rst_restart_process(struct cpt_context *ctx)
-{
-	struct thr_context thr_ctx_root;
-	int err;
-
-	err = create_root_task(ctx, &thr_ctx_root);
-	if (err)
-		return err;
+	if (tsk == NULL) {
+		err = -ESRCH;
+		goto err_free;
+	}
 
-	wait_for_completion(&thr_ctx_root.complete);
-	wait_task_inactive(thr_ctx_root.tsk, 0);
+	wait_for_completion(&thr_ctx.complete);
+	wait_task_inactive(tsk, 0);
+	err = thr_ctx.error;
 
+err_free:
+	kfree(ti);
 	return err;
 }
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-17 23:11               ` [PATCH 08/10] Introduce functions to restart a process Andrey Mirkin
  2008-10-17 23:11                 ` [PATCH 09/10] Introduce functions to restore mm Andrey Mirkin
@ 2008-10-20  9:23                 ` Cedric Le Goater
  2008-10-22  8:49                   ` [Devel] " Andrey Mirkin
  2008-10-20 13:25                 ` Louis Rilling
  2 siblings, 1 reply; 49+ messages in thread
From: Cedric Le Goater @ 2008-10-20  9:23 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov

Hello Andrey !


> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index 109792b..a4848a3 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
>  	GET_THREAD_INFO(%ebp)
>  	popl %eax
>  	CFI_ADJUST_CFA_OFFSET -4
> +ret_from_fork_tail:
>  	pushl $0x0202			# Reset kernel eflags
>  	CFI_ADJUST_CFA_OFFSET 4
>  	popfl
> @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
>  	CFI_ENDPROC
>  END(ret_from_fork)
>  
> +ENTRY(i386_ret_from_resume)
> +	CFI_STARTPROC
> +	pushl %eax
> +	CFI_ADJUST_CFA_OFFSET 4
> +	call schedule_tail
> +	GET_THREAD_INFO(%ebp)
> +	popl %eax
> +	CFI_ADJUST_CFA_OFFSET -4
> +	movl (%esp), %eax
> +	testl %eax, %eax
> +	jz    1f
> +	pushl %esp
> +	call  *%eax
> +	addl  $4, %esp
> +1:
> +	addl  $256, %esp
> +	jmp   ret_from_fork_tail
> +	CFI_ENDPROC
> +END(i386_ret_from_resume)

Could you explain why you need to do this  

	call  *%eax

is it related to the freezer code ? 

C.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/10] Introduce function to dump process
  2008-10-17 23:11         ` [PATCH 05/10] Introduce function to dump process Andrey Mirkin
  2008-10-17 23:11           ` [PATCH 06/10] Introduce functions to dump mm Andrey Mirkin
@ 2008-10-20 11:02           ` Louis Rilling
  2008-10-24  4:15             ` [Devel] " Andrey Mirkin
  2008-10-20 17:48           ` Serge E. Hallyn
  2 siblings, 1 reply; 49+ messages in thread
From: Louis Rilling @ 2008-10-20 11:02 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 8729 bytes --]

Hi,

On Sat, Oct 18, 2008 at 03:11:33AM +0400, Andrey Mirkin wrote:
> Functions to dump task struct, fpu state and registers are added.
> All IDs are saved from the POV of process (container) namespace.

Just a couple of little comments, in case this series should keep on living.

[...]

> diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
> new file mode 100644
> index 0000000..58f608d
> --- /dev/null
> +++ b/checkpoint/cpt_process.c
> @@ -0,0 +1,236 @@
> +/*
> + *  Copyright (C) 2008 Parallels, Inc.
> + *
> + *  Author: Andrey Mirkin <major@openvz.org>
> + *
> + *  This program is free software; you can redistribute it and/or
> + *  modify it under the terms of the GNU General Public License as
> + *  published by the Free Software Foundation, version 2 of the
> + *  License.
> + *
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/version.h>
> +#include <linux/nsproxy.h>
> +
> +#include "checkpoint.h"
> +#include "cpt_image.h"
> +
> +static unsigned int encode_task_flags(unsigned int task_flags)
> +{
> +	unsigned int flags = 0;
> +
> +	if (task_flags & PF_EXITING)
> +		flags |= (1 << CPT_PF_EXITING);
> +	if (task_flags & PF_FORKNOEXEC)
> +		flags |= (1 << CPT_PF_FORKNOEXEC);
> +	if (task_flags & PF_SUPERPRIV)
> +		flags |= (1 << CPT_PF_SUPERPRIV);
> +	if (task_flags & PF_DUMPCORE)
> +		flags |= (1 << CPT_PF_DUMPCORE);
> +	if (task_flags & PF_SIGNALED)
> +		flags |= (1 << CPT_PF_SIGNALED);
> +	if (task_flags & PF_USED_MATH)
> +		flags |= (1 << CPT_PF_USED_MATH);
> +	
> +	return flags;
> +		
> +}
> +
> +int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> +	struct cpt_task_image *t;
> +	int i;
> +	int err;
> +
> +	t = kzalloc(sizeof(*t), GFP_KERNEL);
> +	if (!t)
> +		return -ENOMEM;
> +
> +	t->cpt_len = sizeof(*t);
> +	t->cpt_type = CPT_OBJ_TASK;
> +	t->cpt_hdrlen = sizeof(*t);
> +	t->cpt_content = CPT_CONTENT_ARRAY;
> +
> +	t->cpt_state = tsk->state;
> +	t->cpt_flags = encode_task_flags(tsk->flags);
> +	t->cpt_exit_code = tsk->exit_code;
> +	t->cpt_exit_signal = tsk->exit_signal;
> +	t->cpt_pdeath_signal = tsk->pdeath_signal;
> +	t->cpt_pid = task_pid_nr_ns(tsk, ctx->nsproxy->pid_ns);
> +	t->cpt_tgid = task_tgid_nr_ns(tsk, ctx->nsproxy->pid_ns);
> +	t->cpt_ppid = tsk->parent ?
> +		task_pid_nr_ns(tsk->parent, ctx->nsproxy->pid_ns) : 0;
> +	t->cpt_rppid = tsk->real_parent ?
> +		task_pid_nr_ns(tsk->real_parent, ctx->nsproxy->pid_ns) : 0;
> +	t->cpt_pgrp = task_pgrp_nr_ns(tsk, ctx->nsproxy->pid_ns);
> +	t->cpt_session = task_session_nr_ns(tsk, ctx->nsproxy->pid_ns);
> +	t->cpt_old_pgrp = 0;
> +	if (tsk->signal->tty_old_pgrp)
> +		t->cpt_old_pgrp = pid_vnr(tsk->signal->tty_old_pgrp);
> +	t->cpt_leader = tsk->group_leader ? task_pid_vnr(tsk->group_leader) : 0;

Why pid_vnr() here, and task_*_nr_ns() above? According to the introducing
comment, I'd expect something like pid_nr_ns(tsk->signal->tty_old_pgrp,
tsk->nsproxy->pid_ns), and the same for tsk->group_leader.

IIUC, pid_vnr() is correct only if ctx->nsproxy->pid_ns == tsk->nsproxy->pid_ns
== current->nsproxy->pid_ns, and I expect current to live in a different pid_ns.

Comments?

> +	t->cpt_utime = tsk->utime;
> +	t->cpt_stime = tsk->stime;
> +	t->cpt_utimescaled = tsk->utimescaled;
> +	t->cpt_stimescaled = tsk->stimescaled;
> +	t->cpt_gtime = tsk->gtime;
> +	t->cpt_prev_utime = tsk->prev_utime;
> +	t->cpt_prev_stime = tsk->prev_stime;
> +	t->cpt_nvcsw = tsk->nvcsw;
> +	t->cpt_nivcsw = tsk->nivcsw;
> +	t->cpt_start_time = cpt_timespec_export(&tsk->start_time);
> +	t->cpt_real_start_time = cpt_timespec_export(&tsk->real_start_time);
> +	t->cpt_min_flt = tsk->min_flt;
> +	t->cpt_maj_flt = tsk->maj_flt;
> +	memcpy(t->cpt_comm, tsk->comm, TASK_COMM_LEN);
> +	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
> +		t->cpt_tls[i] = (((u64)tsk->thread.tls_array[i].b) << 32) +
> +			tsk->thread.tls_array[i].a;
> +	}
> +	/* TODO: encode thread flags and status like task flags */
> +	t->cpt_thrflags = task_thread_info(tsk)->flags & ~(1<<TIF_FREEZE);
> +	t->cpt_thrstatus = task_thread_info(tsk)->status;
> +	t->cpt_user = tsk->user->uid;
> +	t->cpt_uid = tsk->uid;
> +	t->cpt_euid = tsk->euid;
> +	t->cpt_suid = tsk->suid;
> +	t->cpt_fsuid = tsk->fsuid;
> +	t->cpt_gid = tsk->gid;
> +	t->cpt_egid = tsk->egid;
> +	t->cpt_sgid = tsk->sgid;
> +	t->cpt_fsgid = tsk->fsgid;
> +
> +	err = ctx->write(t, sizeof(*t), ctx);
> +
> +	kfree(t);
> +	return err;
> +}
> +
> +static int cpt_dump_fpustate(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> +	struct cpt_obj_bits hdr;
> +	int err;
> +	int content;
> +	unsigned long size;
> +
> +	content = CPT_CONTENT_X86_FPUSTATE;
> +	size = sizeof(struct i387_fxsave_struct);
> +#ifndef CONFIG_X86_64
> +	if (!cpu_has_fxsr) {
> +		size = sizeof(struct i387_fsave_struct);
> +		content = CPT_CONTENT_X86_FPUSTATE_OLD;
> +	}
> +#endif
> +
> +	hdr.cpt_len = sizeof(hdr) + size;
> +	hdr.cpt_type = CPT_OBJ_BITS;
> +	hdr.cpt_hdrlen = sizeof(hdr);
> +	hdr.cpt_content = content;
> +	hdr.cpt_size = size;
> +	err = ctx->write(&hdr, sizeof(hdr), ctx);
> +	if (!err)
> +		ctx->write(tsk->thread.xstate, size, ctx);

Should check the error code of the line above, right?

> +	return err;
> +}
> +
> +static u32 encode_segment(u32 segreg)
> +{
> +	segreg &= 0xFFFF;
> +
> +	if (segreg == 0)
> +		return CPT_SEG_ZERO;
> +	if ((segreg & 3) != 3) {
> +		eprintk("Invalid RPL of a segment reg %x\n", segreg);
> +		return CPT_SEG_ZERO;
> +	}
> +
> +	/* LDT descriptor, it is just an index to LDT array */
> +	if (segreg & 4)
> +		return CPT_SEG_LDT + (segreg >> 3);
> +
> +	/* TLS descriptor. */
> +	if ((segreg >> 3) >= GDT_ENTRY_TLS_MIN &&
> +			(segreg >> 3) <= GDT_ENTRY_TLS_MAX)
> +		return CPT_SEG_TLS1 + ((segreg>>3) - GDT_ENTRY_TLS_MIN);
> +
> +	/* One of standard desriptors */
> +#ifdef CONFIG_X86_64
> +	if (segreg == __USER32_DS)
> +		return CPT_SEG_USER32_DS;
> +	if (segreg == __USER32_CS)
> +		return CPT_SEG_USER32_CS;
> +	if (segreg == __USER_DS)
> +		return CPT_SEG_USER64_DS;
> +	if (segreg == __USER_CS)
> +		return CPT_SEG_USER64_CS;
> +#else
> +	if (segreg == __USER_DS)
> +		return CPT_SEG_USER32_DS;
> +	if (segreg == __USER_CS)
> +		return CPT_SEG_USER32_CS;
> +#endif
> +	eprintk("Invalid segment reg %x\n", segreg);
> +	return CPT_SEG_ZERO;
> +}
> +
> +static int cpt_dump_registers(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> +	struct cpt_x86_regs ri;
> +	struct pt_regs *pt_regs;
> +
> +	ri.cpt_len = sizeof(ri);
> +	ri.cpt_type = CPT_OBJ_X86_REGS;
> +	ri.cpt_hdrlen = sizeof(ri);
> +	ri.cpt_content = CPT_CONTENT_VOID;
> +
> +	ri.cpt_debugreg[0] = tsk->thread.debugreg0;
> +	ri.cpt_debugreg[1] = tsk->thread.debugreg1;
> +	ri.cpt_debugreg[2] = tsk->thread.debugreg2;
> +	ri.cpt_debugreg[3] = tsk->thread.debugreg3;
> +	ri.cpt_debugreg[4] = 0;
> +	ri.cpt_debugreg[5] = 0;
> +	ri.cpt_debugreg[6] = tsk->thread.debugreg6;
> +	ri.cpt_debugreg[7] = tsk->thread.debugreg7;
> +
> +	pt_regs = task_pt_regs(tsk);
> +
> +	ri.cpt_fs = encode_segment(pt_regs->fs);
> +	ri.cpt_gs = encode_segment(tsk->thread.gs);
> +
> +	ri.cpt_bx = pt_regs->bx;
> +	ri.cpt_cx = pt_regs->cx;
> +	ri.cpt_dx = pt_regs->dx;
> +	ri.cpt_si = pt_regs->si;
> +	ri.cpt_di = pt_regs->di;
> +	ri.cpt_bp = pt_regs->bp;
> +	ri.cpt_ax = pt_regs->ax;
> +	ri.cpt_ds = encode_segment(pt_regs->ds);
> +	ri.cpt_es = encode_segment(pt_regs->es);
> +	ri.cpt_orig_ax = pt_regs->orig_ax;
> +	ri.cpt_ip = pt_regs->ip;
> +	ri.cpt_cs = encode_segment(pt_regs->cs);
> +	ri.cpt_flags = pt_regs->flags;
> +	ri.cpt_sp = pt_regs->sp;
> +	ri.cpt_ss = encode_segment(pt_regs->ss);
> +	
> +	return ctx->write(&ri, sizeof(ri), ctx);
> +}
> +
> +int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> +	int err;
> +
> +	err = cpt_dump_task_struct(tsk, ctx);
> +
> +	/* Dump task mm */
> +
> +	if (!err)
> +		cpt_dump_fpustate(tsk, ctx);

error checking...

> +	if (!err)
> +		cpt_dump_registers(tsk, ctx);

error checking...

> +
> +	return err;
> +}
> -- 
> 1.5.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/10] Introduce functions to dump mm
  2008-10-17 23:11           ` [PATCH 06/10] Introduce functions to dump mm Andrey Mirkin
  2008-10-17 23:11             ` [PATCH 07/10] Introduce function for restarting a container Andrey Mirkin
@ 2008-10-20 12:25             ` Louis Rilling
  2008-10-22  8:58               ` [Devel] " Andrey Mirkin
  2008-10-20 17:21             ` Dave Hansen
  2 siblings, 1 reply; 49+ messages in thread
From: Louis Rilling @ 2008-10-20 12:25 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 12120 bytes --]

On Sat, Oct 18, 2008 at 03:11:34AM +0400, Andrey Mirkin wrote:
> Functions to dump mm struct, VMAs and mm context are added.

Again, a few little comments.

[...]

> diff --git a/checkpoint/cpt_mm.c b/checkpoint/cpt_mm.c
> new file mode 100644
> index 0000000..8a22c48
> --- /dev/null
> +++ b/checkpoint/cpt_mm.c
> @@ -0,0 +1,434 @@
> +/*
> + *  Copyright (C) 2008 Parallels, Inc.
> + *
> + *  Authors:	Andrey Mirkin <major@openvz.org>
> + *
> + *  This program is free software; you can redistribute it and/or
> + *  modify it under the terms of the GNU General Public License as
> + *  published by the Free Software Foundation, version 2 of the
> + *  License.
> + *
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/mm.h>
> +#include <linux/errno.h>
> +#include <linux/major.h>
> +#include <linux/mman.h>
> +#include <linux/mnt_namespace.h>
> +#include <linux/mount.h>
> +#include <linux/namei.h>
> +#include <linux/pagemap.h>
> +#include <linux/hugetlb.h>
> +#include <asm/ldt.h>
> +
> +#include "checkpoint.h"
> +#include "cpt_image.h"
> +
> +struct page_area
> +{
> +	int type;
> +	unsigned long start;
> +	unsigned long end;
> +	pgoff_t pgoff;
> +	loff_t mm;
> +	__u64 list[16];
> +};
> +
> +struct page_desc
> +{
> +	int	type;
> +	pgoff_t	index;
> +	loff_t	mm;
> +	int	shared;
> +};
> +
> +enum {
> +	PD_ABSENT,
> +	PD_COPY,
> +	PD_FUNKEY,
> +};
> +
> +/* 0: page can be obtained from backstore, or still not mapped anonymous  page,
> +      or something else, which does not requre copy.
> +   1: page requires copy
> +   2: page requres copy but its content is zero. Quite useless.
> +   3: wp page is shared after fork(). It is to be COWed when modified.
> +   4: page is something unsupported... We copy it right now.
> + */
> +
> +static void page_get_desc(struct vm_area_struct *vma, unsigned long addr,
> +			  struct page_desc *pdesc, cpt_context_t * ctx)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *ptep, pte;
> +	spinlock_t *ptl;
> +	struct page *pg = NULL;
> +	pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff;
> +
> +	pdesc->index = linear_index;
> +	pdesc->shared = 0;
> +	pdesc->mm = CPT_NULL;
> +
> +	if (vma->vm_flags & VM_IO) {
> +		pdesc->type = PD_ABSENT;
> +		return;
> +	}
> +
> +	pgd = pgd_offset(mm, addr);
> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +		goto out_absent;
> +	pud = pud_offset(pgd, addr);
> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> +		goto out_absent;
> +	pmd = pmd_offset(pud, addr);
> +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> +		goto out_absent;
> +#ifdef CONFIG_X86
> +	if (pmd_huge(*pmd)) {
> +		eprintk("page_huge\n");
> +		goto out_unsupported;
> +	}
> +#endif
> +	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +	pte = *ptep;
> +	pte_unmap(ptep);
> +
> +	if (pte_none(pte))
> +		goto out_absent_unlock;
> +
> +	if ((pg = vm_normal_page(vma, addr, pte)) == NULL) {
> +		pdesc->type = PD_COPY;
> +		goto out_unlock;
> +	}
> +
> +	get_page(pg);
> +	spin_unlock(ptl);
> +
> +	if (pg->mapping && !PageAnon(pg)) {
> +		if (vma->vm_file == NULL) {
> +			eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr);
> +			goto out_unsupported;
> +		}
> +		if (vma->vm_file->f_mapping != pg->mapping) {
> +			eprintk("pg->mapping!=f_mapping: %08lx %p %p\n",
> +				    addr, vma->vm_file->f_mapping, pg->mapping);
> +			goto out_unsupported;
> +		}
> +		pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
> +		/* Page is in backstore. For us it is like
> +		 * it is not present.
> +		 */
> +		goto out_absent;
> +	}
> +
> +	if (PageReserved(pg)) {
> +		/* Special case: ZERO_PAGE is used, when an
> +		 * anonymous page is accessed but not written. */
> +		if (pg == ZERO_PAGE(addr)) {
> +			if (pte_write(pte)) {
> +				eprintk("not funny already, writable ZERO_PAGE\n");
> +				goto out_unsupported;
> +			}
> +			/* Just copy it for now */
> +			pdesc->type = PD_COPY;
> +			goto out_put;
> +		}
> +		eprintk("reserved page %lu at %08lx\n", pg->index, addr);
> +		goto out_unsupported;
> +	}
> +
> +	if (!pg->mapping) {
> +		eprintk("page without mapping at %08lx\n", addr);
> +		goto out_unsupported;
> +	}
> +
> +	pdesc->type = PD_COPY;
> +
> +out_put:
> +	if (pg)
> +		put_page(pg);
> +	return;
> +
> +out_unlock:
> +	spin_unlock(ptl);
> +	goto out_put;
> +
> +out_absent_unlock:
> +	spin_unlock(ptl);
> +
> +out_absent:
> +	pdesc->type = PD_ABSENT;
> +	goto out_put;
> +
> +out_unsupported:
> +	pdesc->type = PD_FUNKEY;
> +	goto out_put;
> +}
> +
> +static int count_vma_pages(struct vm_area_struct *vma, struct cpt_context *ctx)
> +{
> +	unsigned long addr;
> +	int page_num = 0;
> +
> +	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> +		struct page_desc pd;
> +
> +		page_get_desc(vma, addr, &pd, ctx);
> +
> +		if (pd.type != PD_COPY) {
> +			return -EINVAL;
> +		} else {
> +			page_num += 1;
> +		}
> +		
> +	}
> +	return page_num;
> +}
> +
> +/* ATTN: We give "current" to get_user_pages(). This is wrong, but get_user_pages()
> + * does not really need this thing. It just stores some page fault stats there.
> + *
> + * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache pages
> + * before accessing vma.
> + */
> +static int dump_pages(struct vm_area_struct *vma, unsigned long start,
> +		unsigned long end, struct cpt_context *ctx)
> +{
> +#define MAX_PAGE_BATCH 16
> +	struct page *pg[MAX_PAGE_BATCH];
> +	int npages = (end - start)/PAGE_SIZE;
> +	int count = 0;
> +
> +	while (count < npages) {
> +		int copy = npages - count;
> +		int n;
> +
> +		if (copy > MAX_PAGE_BATCH)
> +			copy = MAX_PAGE_BATCH;
> +		n = get_user_pages(current, vma->vm_mm, start, copy,
> +				   0, 1, pg, NULL);
> +		if (n == copy) {
> +			int i;
> +			for (i=0; i<n; i++) {
> +				char *maddr = kmap(pg[i]);
> +				ctx->write(maddr, PAGE_SIZE, ctx);
> +				kunmap(pg[i]);

There is no error handling in this inner loop. Should be fixed imho.

> +			}
> +		} else {
> +			eprintk("get_user_pages fault");
> +			for ( ; n > 0; n--)
> +				page_cache_release(pg[n-1]);
> +			return -EFAULT;
> +		}
> +		start += n*PAGE_SIZE;
> +		count += n;
> +		for ( ; n > 0; n--)
> +			page_cache_release(pg[n-1]);
> +	}
> +	return 0;
> +}
> +
> +static int dump_page_block(struct vm_area_struct *vma,
> +			   struct cpt_page_block *pgb,
> +			   struct cpt_context *ctx)
> +{
> +	int err;
> +	pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start;
> +	pgb->cpt_type = CPT_OBJ_PAGES;
> +	pgb->cpt_hdrlen = sizeof(*pgb);
> +	pgb->cpt_content = CPT_CONTENT_DATA;
> +
> +	err = ctx->write(pgb, sizeof(*pgb), ctx);
> +	if (!err)
> +		err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx);
> +
> +	return err;
> +}
> +
> +static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx)
> +{
> +	int len;
> +	char *path;
> +	char *buf;
> +	struct cpt_object_hdr o;
> +
> +	buf = (char *)__get_free_page(GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	path = d_path(p, buf, PAGE_SIZE);
> +
> +	if (IS_ERR(path)) {
> +		free_page((unsigned long)buf);
> +		return PTR_ERR(path);
> +	}
> +
> +	len = buf + PAGE_SIZE - 1 - path;
> +	o.cpt_len = sizeof(o) + len + 1;
> +	o.cpt_type = CPT_OBJ_NAME;
> +	o.cpt_hdrlen = sizeof(o);
> +	o.cpt_content = CPT_CONTENT_NAME;
> +	path[len] = 0;
> +
> +	ctx->write(&o, sizeof(o), ctx);
> +	ctx->write(path, len + 1, ctx);

Error handling?

> +	free_page((unsigned long)buf);
> +
> +	return 0;
> +}
> +
> +static int dump_one_vma(struct mm_struct *mm,
> +			struct vm_area_struct *vma, struct cpt_context *ctx)
> +{
> +	struct cpt_vma_image *v;
> +	unsigned long addr;
> +	int page_num;
> +	int err;
> +
> +	v = kzalloc(sizeof(*v), GFP_KERNEL);
> +	if (!v)
> +		return -ENOMEM;
> +
> +	v->cpt_len = sizeof(*v);
> +	v->cpt_type = CPT_OBJ_VMA;
> +	v->cpt_hdrlen = sizeof(*v);
> +	v->cpt_content = CPT_CONTENT_ARRAY;
> +
> +	v->cpt_start = vma->vm_start;
> +	v->cpt_end = vma->vm_end;
> +	v->cpt_flags = vma->vm_flags;
> +	if (vma->vm_flags & VM_HUGETLB) {
> +		eprintk("huge TLB VMAs are still not supported\n");
> +		kfree(v);
> +		return -EINVAL;
> +	}
> +	v->cpt_pgprot = vma->vm_page_prot.pgprot;
> +	v->cpt_pgoff = vma->vm_pgoff;
> +	v->cpt_file = CPT_NULL;
> +	v->cpt_vma_type = CPT_VMA_TYPE_0;
> +
> +	page_num = count_vma_pages(vma, ctx);
> +	if (page_num < 0) {
> +		kfree(v);
> +		return -EINVAL;
> +	}

AFAICS, since count_vma_pages only supports pages with PD_COPY, and since
get_page_desc() tags text segment pages (file-mapped and not anonymous since
not written to) as PD_ABSENT, no executable is checkpointable. So, where is the
trick? Am I completely missing something about page mapping?

> +	v->cpt_page_num = page_num;
> +
> +	if (vma->vm_file) {
> +		v->cpt_file = 0;
> +		v->cpt_vma_type = CPT_VMA_FILE;
> +	}
> +
> +	ctx->write(v, sizeof(*v), ctx);

Error handling?

> +	kfree(v);
> +
> +	if (vma->vm_file) {
> +		err = cpt_dump_dentry(&vma->vm_file->f_path, ctx);
> +		if (err < 0)
> +			return err;
> +	}
> +
> +	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> +		struct page_desc pd;
> +		struct cpt_page_block pgb;
> +
> +		page_get_desc(vma, addr, &pd, ctx);
> +
> +		if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) {
> +			eprintk("dump_one_vma: funkey page\n");
> +			return -EINVAL;
> +		}
> +
> +		pgb.cpt_start = addr;
> +		pgb.cpt_end = addr + PAGE_SIZE;
> +		dump_page_block(vma, &pgb, ctx);

Error handling?

> +	}
> +
> +	return 0;
> +}
> +
> +static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context *ctx)
> +{
> +#ifdef CONFIG_X86
> +	if (mm->context.size) {
> +		struct cpt_obj_bits b;
> +		int size;
> +
> +		mutex_lock(&mm->context.lock);
> +
> +		b.cpt_type = CPT_OBJ_BITS;
> +		b.cpt_len = sizeof(b);
> +		b.cpt_content = CPT_CONTENT_MM_CONTEXT;
> +		b.cpt_size = mm->context.size * LDT_ENTRY_SIZE;
> +
> +		ctx->write(&b, sizeof(b), ctx);
> +
> +		size = mm->context.size * LDT_ENTRY_SIZE;
> +
> +		ctx->write(mm->context.ldt, size, ctx);

Error handling?

> +
> +		mutex_unlock(&mm->context.lock);
> +	}
> +#endif
> +	return 0;
> +}
> +
> +int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> +	struct mm_struct *mm = tsk->mm;
> +	struct cpt_mm_image *v;
> +	struct vm_area_struct *vma;
> +	int err;
> +
> +	v = kzalloc(sizeof(*v), GFP_KERNEL);
> +	if (!v)
> +		return -ENOMEM;
> +
> +	v->cpt_len = sizeof(*v);
> +	v->cpt_type = CPT_OBJ_MM;
> +	v->cpt_hdrlen = sizeof(*v);
> +	v->cpt_content = CPT_CONTENT_ARRAY;
> +
> +	down_read(&mm->mmap_sem);
> +	v->cpt_start_code = mm->start_code;
> +	v->cpt_end_code = mm->end_code;
> +	v->cpt_start_data = mm->start_data;
> +	v->cpt_end_data = mm->end_data;
> +	v->cpt_start_brk = mm->start_brk;
> +	v->cpt_brk = mm->brk;
> +	v->cpt_start_stack = mm->start_stack;
> +	v->cpt_start_arg = mm->arg_start;
> +	v->cpt_end_arg = mm->arg_end;
> +	v->cpt_start_env = mm->env_start;
> +	v->cpt_end_env = mm->env_end;
> +	v->cpt_def_flags = mm->def_flags;
> +	v->cpt_flags = mm->flags;
> +	v->cpt_map_count = mm->map_count;
> +
> +	err = ctx->write(v, sizeof(*v), ctx);
> +	kfree(v);
> +	
> +	if (err) {
> +		eprintk("error during writing mm\n");
> +		goto err_up;
> +	}
> +	
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		if ((err = dump_one_vma(mm, vma, ctx)) != 0)
> +			goto err_up;
> +	}
> +
> +	err = cpt_dump_mm_context(mm, ctx);
> +
> +err_up:
> +	up_read(&mm->mmap_sem);
> +
> +	return err;
> +}
> +

[...]

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-17 23:11               ` [PATCH 08/10] Introduce functions to restart a process Andrey Mirkin
  2008-10-17 23:11                 ` [PATCH 09/10] Introduce functions to restore mm Andrey Mirkin
  2008-10-20  9:23                 ` [PATCH 08/10] Introduce functions to restart a process Cedric Le Goater
@ 2008-10-20 13:25                 ` Louis Rilling
  2008-10-23 10:56                   ` [Devel] " Andrey Mirkin
  2 siblings, 1 reply; 49+ messages in thread
From: Louis Rilling @ 2008-10-20 13:25 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 7530 bytes --]

On Sat, Oct 18, 2008 at 03:11:36AM +0400, Andrey Mirkin wrote:
> Functions to restart process, restore its state, fpu and registers are added.

[...]

> diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
> new file mode 100644
> index 0000000..b9f745e
> --- /dev/null
> +++ b/checkpoint/rst_process.c
> @@ -0,0 +1,277 @@
> +/*
> + *  Copyright (C) 2008 Parallels, Inc.
> + *
> + *  Author: Andrey Mirkin <major@openvz.org>
> + *
> + *  This program is free software; you can redistribute it and/or
> + *  modify it under the terms of the GNU General Public License as
> + *  published by the Free Software Foundation, version 2 of the
> + *  License.
> + *
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/version.h>
> +#include <linux/module.h>
> +
> +#include "checkpoint.h"
> +#include "cpt_image.h"
> +
> +#define HOOK_RESERVE	256
> +
> +struct thr_context {
> +	struct completion complete;
> +	int error;
> +	struct cpt_context *ctx;
> +	struct task_struct *tsk;
> +};
> +
> +int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long flags, pid_t pid)
> +{
> +	pid_t ret;
> +
> +	if (current->fs == NULL) {
> +		/* do_fork_pid() hates processes without fs, oopses. */
> +		eprintk("local_kernel_thread: current->fs==NULL\n");
> +		return -EINVAL;
> +	}
> +	if (!try_module_get(THIS_MODULE))
> +		return -EBUSY;
> +	ret = kernel_thread(fn, arg, flags);
> +	if (ret < 0)
> +		module_put(THIS_MODULE);
> +	return ret;
> +}
> +
> +static unsigned int decode_task_flags(unsigned int task_flags)
> +{
> +	unsigned int flags = 0;
> +
> +	if (task_flags & (1 << CPT_PF_EXITING))
> +		flags |= PF_EXITING;
> +	if (task_flags & (1 << CPT_PF_FORKNOEXEC))
> +		flags |= PF_FORKNOEXEC;
> +	if (task_flags & (1 << CPT_PF_SUPERPRIV))
> +		flags |= PF_SUPERPRIV;
> +	if (task_flags & (1 << CPT_PF_DUMPCORE))
> +		flags |= PF_DUMPCORE;
> +	if (task_flags & (1 << CPT_PF_SIGNALED))
> +		flags |= PF_SIGNALED;
> +	
> +	return flags;
> +		
> +}
> +
> +int rst_restore_task_struct(struct task_struct *tsk, struct cpt_task_image *ti,
> +			    struct cpt_context *ctx)
> +{
> +	int i;
> +
> +	/* Restore only saved flags, comm and tls for now */
> +	tsk->flags = decode_task_flags(ti->cpt_flags);
> +	clear_tsk_thread_flag(tsk, TIF_FREEZE);
> +	memcpy(tsk->comm, ti->cpt_comm, TASK_COMM_LEN);
> +	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
> +		tsk->thread.tls_array[i].a = ti->cpt_tls[i] & 0xFFFFFFFF;
> +		tsk->thread.tls_array[i].b = ti->cpt_tls[i] >> 32;
> +	}
> +
> +	return 0;
> +}
> +
> +static int rst_restore_fpustate(struct task_struct *tsk, struct cpt_task_image *ti,
> +				struct cpt_context *ctx)
> +{
> +	struct cpt_obj_bits hdr;
> +	int err;
> +	char *buf;
> +
> +	clear_stopped_child_used_math(tsk);
> +
> +	err = rst_get_object(CPT_OBJ_BITS, &hdr, sizeof(hdr), ctx);
> +	if (err < 0)
> +		return err;
> +
> +	buf = kmalloc(hdr.cpt_size, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	err = ctx->read(buf, hdr.cpt_size, ctx);
> +	if (err)
> +		goto out;
> +
> +	if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE && cpu_has_fxsr) {
> +		memcpy(&tsk->thread.xstate, buf,
> +				sizeof(struct i387_fxsave_struct));
> +		if (ti->cpt_flags & CPT_PF_USED_MATH)
> +			set_stopped_child_used_math(tsk);
> +	}
> +#ifndef CONFIG_X86_64
> +	else if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE_OLD &&
> +			!cpu_has_fxsr) {		
> +		memcpy(&tsk->thread.xstate, buf,
> +				sizeof(struct i387_fsave_struct));
> +		if (ti->cpt_flags & CPT_PF_USED_MATH)
> +			set_stopped_child_used_math(tsk);
> +	}
> +#endif
> +
> +out:
> +	kfree(buf);
> +	return err;
> +}
> +
> +static u32 decode_segment(u32 segid)
> +{
> +	if (segid == CPT_SEG_ZERO)
> +		return 0;
> +
> +	/* TLS descriptors */
> +	if (segid <= CPT_SEG_TLS3)
> +		return ((GDT_ENTRY_TLS_MIN + segid - CPT_SEG_TLS1) << 3) + 3;
> +
> +	/* LDT descriptor, it is just an index to LDT array */
> +	if (segid >= CPT_SEG_LDT)
> +		return ((segid - CPT_SEG_LDT) << 3) | 7;
> +
> +	/* Check for one of standard descriptors */
> +	if (segid == CPT_SEG_USER32_DS)
> +		return __USER_DS;
> +	if (segid == CPT_SEG_USER32_CS)
> +		return __USER_CS;
> +
> +	eprintk("Invalid segment reg %d\n", segid);
> +	return 0;
> +}
> +
> +static int rst_restore_registers(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> +	struct cpt_x86_regs ri;
> +	struct pt_regs *regs = task_pt_regs(tsk);
> +	extern char i386_ret_from_resume;
> +	int err;
> +
> +	err = rst_get_object(CPT_OBJ_X86_REGS, &ri, sizeof(ri), ctx);
> +	if (err < 0)
> +		return err;
> +
> +	tsk->thread.sp = (unsigned long) regs;
> +	tsk->thread.sp0 = (unsigned long) (regs+1);
> +	tsk->thread.ip = (unsigned long) &i386_ret_from_resume;
> +
> +	tsk->thread.gs = decode_segment(ri.cpt_gs);
> +	tsk->thread.debugreg0 = ri.cpt_debugreg[0];
> +	tsk->thread.debugreg1 = ri.cpt_debugreg[1];
> +	tsk->thread.debugreg2 = ri.cpt_debugreg[2];
> +	tsk->thread.debugreg3 = ri.cpt_debugreg[3];
> +	tsk->thread.debugreg6 = ri.cpt_debugreg[6];
> +	tsk->thread.debugreg7 = ri.cpt_debugreg[7];
> +
> +	regs->bx = ri.cpt_bx;
> +	regs->cx = ri.cpt_cx;
> +	regs->dx = ri.cpt_dx;
> +	regs->si = ri.cpt_si;
> +	regs->di = ri.cpt_di;
> +	regs->bp = ri.cpt_bp;
> +	regs->ax = ri.cpt_ax;
> +	regs->orig_ax = ri.cpt_orig_ax;
> +	regs->ip = ri.cpt_ip;
> +	regs->flags = ri.cpt_flags;
> +	regs->sp = ri.cpt_sp;
> +
> +	regs->cs = decode_segment(ri.cpt_cs);
> +	regs->ss = decode_segment(ri.cpt_ss);
> +	regs->ds = decode_segment(ri.cpt_ds);
> +	regs->es = decode_segment(ri.cpt_es);
> +	regs->fs = decode_segment(ri.cpt_fs);
> +
> +	tsk->thread.sp -= HOOK_RESERVE;
> +	memset((void*)tsk->thread.sp, 0, HOOK_RESERVE);
> +
> +	return 0;
> +}
> +
> +static int restart_thread(void *arg)
> +{
> +	struct thr_context *thr_ctx = arg;
> +	struct cpt_context *ctx;
> +	struct cpt_task_image *ti;
> +	int err;
> +
> +	current->state = TASK_UNINTERRUPTIBLE;
> +
> +	ctx = thr_ctx->ctx;
> +	ti = kmalloc(sizeof(*ti), GFP_KERNEL);
> +	if (!ti)
> +		return -ENOMEM;
> +
> +	err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
> +	if (!err)
> +		err = rst_restore_task_struct(current, ti, ctx);
> +	/* Restore mm here */
> +	if (!err)
> +		err = rst_restore_fpustate(current, ti, ctx);
> +	if (!err)
> +		err = rst_restore_registers(current, ctx);
> +
> +	thr_ctx->error = err;
> +	complete(&thr_ctx->complete);
> +
> +	if (!err && (ti->cpt_state & (EXIT_ZOMBIE|EXIT_DEAD))) {
> +		do_exit(ti->cpt_exit_code);
> +	} else {
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +	}
> +
> +	kfree(ti);
> +	schedule();
> +
> +	eprintk("leaked %d/%d %p\n", task_pid_nr(current), task_pid_vnr(current), current->mm);
> +
> +	module_put(THIS_MODULE);

I'm sorry, I still do not understand what you are doing with this self-module
pinning stuff. AFAICS, we should not get here unless there is a bug. So the
checkpoint module ref count is never decreased, right?

Could you detail what is this self-module pinning for? As I already told you,
this looks like a bogus solution to avoid unloading the checkpoint module during
restart.

Thanks!

Louis

[...]

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] Make checkpoint/restart functionality modular
  2008-10-17 23:11   ` [PATCH 02/10] Make checkpoint/restart functionality modular Andrey Mirkin
  2008-10-17 23:11     ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Andrey Mirkin
@ 2008-10-20 16:51     ` Dave Hansen
  2008-10-20 16:59     ` Serge E. Hallyn
  2 siblings, 0 replies; 49+ messages in thread
From: Dave Hansen @ 2008-10-20 16:51 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov

On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> +struct cpt_operations
> +{
> +       struct module * owner;
> +       int (*checkpoint)(pid_t pid, int fd, unsigned long flags);
> +       int (*restart)(int ctid, int fd, unsigned long flags);
> +};

I think this is pretty useless obfuscation.  We're not going to have
pluggable checkpoint/restart implementations, are we?  So, why bother
putting it in a module?

I can understand that it's easier to develop your code when it's in a
module and you don't have to reboot the machine to load a new kernel
each time.  But, that's an individual developer thing, and doesn't
belong in an upstream submission.

I know people have given you a hard time for this in the past.  Why is
it still here?

-- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] Make checkpoint/restart functionality modular
  2008-10-17 23:11   ` [PATCH 02/10] Make checkpoint/restart functionality modular Andrey Mirkin
  2008-10-17 23:11     ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Andrey Mirkin
  2008-10-20 16:51     ` [PATCH 02/10] Make checkpoint/restart functionality modular Dave Hansen
@ 2008-10-20 16:59     ` Serge E. Hallyn
  2 siblings, 0 replies; 49+ messages in thread
From: Serge E. Hallyn @ 2008-10-20 16:59 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov

Quoting Andrey Mirkin (major@openvz.org):
> A config option CONFIG_CHECKPOINT is introduced.
> New structure cpt_operations is introduced to store pointers to
> checkpoint/restart functions from module.

I thought we had decided not to use a kernel module?

Louis' comments on your patch 8 regarding module pinning suggests that
details about using a module will detract from proper review of the core
c/r functionality...

-serge

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/10] Introduce context structure needed during checkpointing/restart
  2008-10-17 23:11     ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Andrey Mirkin
  2008-10-17 23:11       ` [PATCH 04/10] Introduce container dump function Andrey Mirkin
@ 2008-10-20 17:02       ` Dave Hansen
  2008-10-29 15:30         ` [Devel] " Andrey Mirkin
  1 sibling, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2008-10-20 17:02 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov

On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> +typedef struct cpt_context
> +{
> +	pid_t		pid;		/* should be changed to ctid later */
> +	int		ctx_id;		/* context id */
> +	struct list_head ctx_list;
> +	int		refcount;
> +	int		ctx_state;
> +	struct semaphore main_sem;

Does this really need to be a semaphore or is a mutex OK?

> +	int		errno;

Could you hold off on adding these things to the struct until the patch
where they're actually used?  It's hard to judge this without seeing
what you do with it.

> +	struct file	*file;
> +	loff_t		current_object;
> +
> +	struct list_head object_array[CPT_OBJ_MAX];
> +
> +	int		(*write)(const void *addr, size_t count, struct cpt_context *ctx);
> +	int		(*read)(void *addr, size_t count, struct cpt_context *ctx);
> +} cpt_context_t;

Man, this is hard to review.  I was going to try and make sure that your
refcounting was right and atomic, but there's no use of it in this patch
except for the initialization and accessor functions.  Darn.

> +extern int debug_level;

I'm going to go out on a limb here and say that "debug_level" is
probably a wee bit too generic of a variable name.

> +#define cpt_printk(lvl, fmt, args...)	do {	\
> +		if (lvl <= debug_level)		\
> +			printk(fmt, ##args);	\
> +	} while (0)

I think you can use pr_debug() here, too, just like Oren did.

> +struct cpt_context * context_alloc(void)
> +{
> +	struct cpt_context *ctx;
> +	int i;
> +
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return NULL;
> +
> +	init_MUTEX(&ctx->main_sem);
> +	ctx->refcount = 1;
> +
> +	ctx->current_object = -1;
> +	ctx->write = file_write;
> +	ctx->read = file_read;
> +	for (i = 0; i < CPT_OBJ_MAX; i++) {
> +		INIT_LIST_HEAD(&ctx->object_array[i]);
> +	}
> +
> +	return ctx;
> +}
> +
> +void context_release(struct cpt_context *ctx)
> +{
> +	ctx->ctx_state = CPT_CTX_ERROR;
> +
> +	kfree(ctx);
> +}
> +
> +static void context_put(struct cpt_context *ctx)
> +{
> +	if (!--ctx->refcount)
> +		context_release(ctx);
> +}
> +
>  static int checkpoint(pid_t pid, int fd, unsigned long flags)
>  {
> -	return -ENOSYS; 
> +	struct file *file;
> +	struct cpt_context *ctx;
> +	int err;
> +
> +	err = -EBADF;
> +	file = fget(fd);
> +	if (!file)
> +		goto out;
> +
> +	err = -ENOMEM;
> +	ctx = context_alloc();
> +	if (!ctx)
> +		goto out_file;
> +
> +	ctx->file = file;
> +	ctx->ctx_state = CPT_CTX_DUMPING;
> +
> +	/* checkpoint */
> +	err = -ENOSYS;
> +
> +	context_put(ctx);
> +
> +out_file:
> +	fput(file);
> +out:
> +	return err; 
>  }

So, where is context_get()?  Is there only single-threaded access to the
refcount?  If so, why do we even need it?  We should probably just use
context_release() driectly.

If there is multithreaded access to context_put() or the refcount, then
they're unsafe without additional locking.

-- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/10] Introduce functions to dump mm
  2008-10-17 23:11           ` [PATCH 06/10] Introduce functions to dump mm Andrey Mirkin
  2008-10-17 23:11             ` [PATCH 07/10] Introduce function for restarting a container Andrey Mirkin
  2008-10-20 12:25             ` [PATCH 06/10] Introduce functions to dump mm Louis Rilling
@ 2008-10-20 17:21             ` Dave Hansen
  2008-10-23  8:43               ` [Devel] " Andrey Mirkin
  2 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2008-10-20 17:21 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov

On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> +static void page_get_desc(struct vm_area_struct *vma, unsigned long addr,
> +                         struct page_desc *pdesc, cpt_context_t * ctx)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       pgd_t *pgd;
> +       pud_t *pud;
> +       pmd_t *pmd;
> +       pte_t *ptep, pte;
> +       spinlock_t *ptl;
> +       struct page *pg = NULL;
> +       pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff;
> +
> +       pdesc->index = linear_index;
> +       pdesc->shared = 0;
> +       pdesc->mm = CPT_NULL;
> +
> +       if (vma->vm_flags & VM_IO) {
> +               pdesc->type = PD_ABSENT;
> +               return;
> +       }
> +
> +       pgd = pgd_offset(mm, addr);
> +       if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +               goto out_absent;
> +       pud = pud_offset(pgd, addr);
> +       if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> +               goto out_absent;
> +       pmd = pmd_offset(pud, addr);
> +       if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> +               goto out_absent;
> +#ifdef CONFIG_X86
> +       if (pmd_huge(*pmd)) {
> +               eprintk("page_huge\n");
> +               goto out_unsupported;
> +       }
> +#endif

I take it you know that this breaks with the 1GB (x86_64) and 16GB (ppc)
large pages.  

Since you have the VMA, why not use is_vm_hugetlb_page()?

-- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/10] Introduce function to dump process
  2008-10-17 23:11         ` [PATCH 05/10] Introduce function to dump process Andrey Mirkin
  2008-10-17 23:11           ` [PATCH 06/10] Introduce functions to dump mm Andrey Mirkin
  2008-10-20 11:02           ` [PATCH 05/10] Introduce function to dump process Louis Rilling
@ 2008-10-20 17:48           ` Serge E. Hallyn
  2008-10-24  4:40             ` [Devel] " Andrey Mirkin
  2 siblings, 1 reply; 49+ messages in thread
From: Serge E. Hallyn @ 2008-10-20 17:48 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov

Quoting Andrey Mirkin (major@openvz.org):
> +	t->cpt_uid = tsk->uid;
> +	t->cpt_euid = tsk->euid;
> +	t->cpt_suid = tsk->suid;
> +	t->cpt_fsuid = tsk->fsuid;
> +	t->cpt_gid = tsk->gid;
> +	t->cpt_egid = tsk->egid;
> +	t->cpt_sgid = tsk->sgid;
> +	t->cpt_fsgid = tsk->fsgid;

I don't see where any of these are restored.  (Obviously, I wanted
to think about how you're verifying the restarter's authorization
to do so)

thanks,
-serge

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-20  9:23                 ` [PATCH 08/10] Introduce functions to restart a process Cedric Le Goater
@ 2008-10-22  8:49                   ` Andrey Mirkin
  2008-10-22  9:25                     ` Louis Rilling
  2008-10-22 12:47                     ` Cedric Le Goater
  0 siblings, 2 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-22  8:49 UTC (permalink / raw)
  To: devel, Pavel Emelyanov; +Cc: Cedric Le Goater, containers, linux-kernel

On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> Hello Andrey !
>
> > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > index 109792b..a4848a3 100644
> > --- a/arch/x86/kernel/entry_32.S
> > +++ b/arch/x86/kernel/entry_32.S
> > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> >  	GET_THREAD_INFO(%ebp)
> >  	popl %eax
> >  	CFI_ADJUST_CFA_OFFSET -4
> > +ret_from_fork_tail:
> >  	pushl $0x0202			# Reset kernel eflags
> >  	CFI_ADJUST_CFA_OFFSET 4
> >  	popfl
> > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> >  	CFI_ENDPROC
> >  END(ret_from_fork)
> >
> > +ENTRY(i386_ret_from_resume)
> > +	CFI_STARTPROC
> > +	pushl %eax
> > +	CFI_ADJUST_CFA_OFFSET 4
> > +	call schedule_tail
> > +	GET_THREAD_INFO(%ebp)
> > +	popl %eax
> > +	CFI_ADJUST_CFA_OFFSET -4
> > +	movl (%esp), %eax
> > +	testl %eax, %eax
> > +	jz    1f
> > +	pushl %esp
> > +	call  *%eax
> > +	addl  $4, %esp
> > +1:
> > +	addl  $256, %esp
> > +	jmp   ret_from_fork_tail
> > +	CFI_ENDPROC
> > +END(i386_ret_from_resume)
>
> Could you explain why you need to do this
>
> 	call  *%eax
>
> is it related to the freezer code ?

It is not related to the freezer code actually.
That is needed to restart syscalls. Right now I don't have a code in my 
patchset which restarts a syscall, but later I plan to add it.
In OpenVZ checkpointing we restart syscalls if process was caught in syscall 
during checkpointing.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 06/10] Introduce functions to dump mm
  2008-10-20 12:25             ` [PATCH 06/10] Introduce functions to dump mm Louis Rilling
@ 2008-10-22  8:58               ` Andrey Mirkin
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-22  8:58 UTC (permalink / raw)
  To: devel, Louis.Rilling; +Cc: Pavel Emelyanov, containers, linux-kernel

On Monday 20 October 2008 16:25 Louis Rilling wrote:
> On Sat, Oct 18, 2008 at 03:11:34AM +0400, Andrey Mirkin wrote:
> > Functions to dump mm struct, VMAs and mm context are added.
>
> Again, a few little comments.
>
> [...]
>
> > diff --git a/checkpoint/cpt_mm.c b/checkpoint/cpt_mm.c
> > new file mode 100644
> > index 0000000..8a22c48
> > --- /dev/null
> > +++ b/checkpoint/cpt_mm.c
> > @@ -0,0 +1,434 @@
> > +/*
> > + *  Copyright (C) 2008 Parallels, Inc.
> > + *
> > + *  Authors:	Andrey Mirkin <major@openvz.org>
> > + *
> > + *  This program is free software; you can redistribute it and/or
> > + *  modify it under the terms of the GNU General Public License as
> > + *  published by the Free Software Foundation, version 2 of the
> > + *  License.
> > + *
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/slab.h>
> > +#include <linux/file.h>
> > +#include <linux/mm.h>
> > +#include <linux/errno.h>
> > +#include <linux/major.h>
> > +#include <linux/mman.h>
> > +#include <linux/mnt_namespace.h>
> > +#include <linux/mount.h>
> > +#include <linux/namei.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/hugetlb.h>
> > +#include <asm/ldt.h>
> > +
> > +#include "checkpoint.h"
> > +#include "cpt_image.h"
> > +
> > +struct page_area
> > +{
> > +	int type;
> > +	unsigned long start;
> > +	unsigned long end;
> > +	pgoff_t pgoff;
> > +	loff_t mm;
> > +	__u64 list[16];
> > +};
> > +
> > +struct page_desc
> > +{
> > +	int	type;
> > +	pgoff_t	index;
> > +	loff_t	mm;
> > +	int	shared;
> > +};
> > +
> > +enum {
> > +	PD_ABSENT,
> > +	PD_COPY,
> > +	PD_FUNKEY,
> > +};
> > +
> > +/* 0: page can be obtained from backstore, or still not mapped anonymous
> >  page, +      or something else, which does not requre copy.
> > +   1: page requires copy
> > +   2: page requres copy but its content is zero. Quite useless.
> > +   3: wp page is shared after fork(). It is to be COWed when modified.
> > +   4: page is something unsupported... We copy it right now.
> > + */
> > +
> > +static void page_get_desc(struct vm_area_struct *vma, unsigned long
> > addr, +			  struct page_desc *pdesc, cpt_context_t * ctx)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> > +	pmd_t *pmd;
> > +	pte_t *ptep, pte;
> > +	spinlock_t *ptl;
> > +	struct page *pg = NULL;
> > +	pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE +
> > vma->vm_pgoff; +
> > +	pdesc->index = linear_index;
> > +	pdesc->shared = 0;
> > +	pdesc->mm = CPT_NULL;
> > +
> > +	if (vma->vm_flags & VM_IO) {
> > +		pdesc->type = PD_ABSENT;
> > +		return;
> > +	}
> > +
> > +	pgd = pgd_offset(mm, addr);
> > +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> > +		goto out_absent;
> > +	pud = pud_offset(pgd, addr);
> > +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> > +		goto out_absent;
> > +	pmd = pmd_offset(pud, addr);
> > +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> > +		goto out_absent;
> > +#ifdef CONFIG_X86
> > +	if (pmd_huge(*pmd)) {
> > +		eprintk("page_huge\n");
> > +		goto out_unsupported;
> > +	}
> > +#endif
> > +	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +	pte = *ptep;
> > +	pte_unmap(ptep);
> > +
> > +	if (pte_none(pte))
> > +		goto out_absent_unlock;
> > +
> > +	if ((pg = vm_normal_page(vma, addr, pte)) == NULL) {
> > +		pdesc->type = PD_COPY;
> > +		goto out_unlock;
> > +	}
> > +
> > +	get_page(pg);
> > +	spin_unlock(ptl);
> > +
> > +	if (pg->mapping && !PageAnon(pg)) {
> > +		if (vma->vm_file == NULL) {
> > +			eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr);
> > +			goto out_unsupported;
> > +		}
> > +		if (vma->vm_file->f_mapping != pg->mapping) {
> > +			eprintk("pg->mapping!=f_mapping: %08lx %p %p\n",
> > +				    addr, vma->vm_file->f_mapping, pg->mapping);
> > +			goto out_unsupported;
> > +		}
> > +		pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
> > +		/* Page is in backstore. For us it is like
> > +		 * it is not present.
> > +		 */
> > +		goto out_absent;
> > +	}
> > +
> > +	if (PageReserved(pg)) {
> > +		/* Special case: ZERO_PAGE is used, when an
> > +		 * anonymous page is accessed but not written. */
> > +		if (pg == ZERO_PAGE(addr)) {
> > +			if (pte_write(pte)) {
> > +				eprintk("not funny already, writable ZERO_PAGE\n");
> > +				goto out_unsupported;
> > +			}
> > +			/* Just copy it for now */
> > +			pdesc->type = PD_COPY;
> > +			goto out_put;
> > +		}
> > +		eprintk("reserved page %lu at %08lx\n", pg->index, addr);
> > +		goto out_unsupported;
> > +	}
> > +
> > +	if (!pg->mapping) {
> > +		eprintk("page without mapping at %08lx\n", addr);
> > +		goto out_unsupported;
> > +	}
> > +
> > +	pdesc->type = PD_COPY;
> > +
> > +out_put:
> > +	if (pg)
> > +		put_page(pg);
> > +	return;
> > +
> > +out_unlock:
> > +	spin_unlock(ptl);
> > +	goto out_put;
> > +
> > +out_absent_unlock:
> > +	spin_unlock(ptl);
> > +
> > +out_absent:
> > +	pdesc->type = PD_ABSENT;
> > +	goto out_put;
> > +
> > +out_unsupported:
> > +	pdesc->type = PD_FUNKEY;
> > +	goto out_put;
> > +}
> > +
> > +static int count_vma_pages(struct vm_area_struct *vma, struct
> > cpt_context *ctx) +{
> > +	unsigned long addr;
> > +	int page_num = 0;
> > +
> > +	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > +		struct page_desc pd;
> > +
> > +		page_get_desc(vma, addr, &pd, ctx);
> > +
> > +		if (pd.type != PD_COPY) {
> > +			return -EINVAL;
> > +		} else {
> > +			page_num += 1;
> > +		}
> > +
> > +	}
> > +	return page_num;
> > +}
> > +
> > +/* ATTN: We give "current" to get_user_pages(). This is wrong, but
> > get_user_pages() + * does not really need this thing. It just stores some
> > page fault stats there. + *
> > + * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache
> > pages + * before accessing vma.
> > + */
> > +static int dump_pages(struct vm_area_struct *vma, unsigned long start,
> > +		unsigned long end, struct cpt_context *ctx)
> > +{
> > +#define MAX_PAGE_BATCH 16
> > +	struct page *pg[MAX_PAGE_BATCH];
> > +	int npages = (end - start)/PAGE_SIZE;
> > +	int count = 0;
> > +
> > +	while (count < npages) {
> > +		int copy = npages - count;
> > +		int n;
> > +
> > +		if (copy > MAX_PAGE_BATCH)
> > +			copy = MAX_PAGE_BATCH;
> > +		n = get_user_pages(current, vma->vm_mm, start, copy,
> > +				   0, 1, pg, NULL);
> > +		if (n == copy) {
> > +			int i;
> > +			for (i=0; i<n; i++) {
> > +				char *maddr = kmap(pg[i]);
> > +				ctx->write(maddr, PAGE_SIZE, ctx);
> > +				kunmap(pg[i]);
>
> There is no error handling in this inner loop. Should be fixed imho.

Yes, you right. Already fixed in next version. I'll try to send it out 
shortly.

>
> > +			}
> > +		} else {
> > +			eprintk("get_user_pages fault");
> > +			for ( ; n > 0; n--)
> > +				page_cache_release(pg[n-1]);
> > +			return -EFAULT;
> > +		}
> > +		start += n*PAGE_SIZE;
> > +		count += n;
> > +		for ( ; n > 0; n--)
> > +			page_cache_release(pg[n-1]);
> > +	}
> > +	return 0;
> > +}
> > +
> > +static int dump_page_block(struct vm_area_struct *vma,
> > +			   struct cpt_page_block *pgb,
> > +			   struct cpt_context *ctx)
> > +{
> > +	int err;
> > +	pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start;
> > +	pgb->cpt_type = CPT_OBJ_PAGES;
> > +	pgb->cpt_hdrlen = sizeof(*pgb);
> > +	pgb->cpt_content = CPT_CONTENT_DATA;
> > +
> > +	err = ctx->write(pgb, sizeof(*pgb), ctx);
> > +	if (!err)
> > +		err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx);
> > +
> > +	return err;
> > +}
> > +
> > +static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx)
> > +{
> > +	int len;
> > +	char *path;
> > +	char *buf;
> > +	struct cpt_object_hdr o;
> > +
> > +	buf = (char *)__get_free_page(GFP_KERNEL);
> > +	if (!buf)
> > +		return -ENOMEM;
> > +
> > +	path = d_path(p, buf, PAGE_SIZE);
> > +
> > +	if (IS_ERR(path)) {
> > +		free_page((unsigned long)buf);
> > +		return PTR_ERR(path);
> > +	}
> > +
> > +	len = buf + PAGE_SIZE - 1 - path;
> > +	o.cpt_len = sizeof(o) + len + 1;
> > +	o.cpt_type = CPT_OBJ_NAME;
> > +	o.cpt_hdrlen = sizeof(o);
> > +	o.cpt_content = CPT_CONTENT_NAME;
> > +	path[len] = 0;
> > +
> > +	ctx->write(&o, sizeof(o), ctx);
> > +	ctx->write(path, len + 1, ctx);
>
> Error handling?
Will fix it, thanks.

>
> > +	free_page((unsigned long)buf);
> > +
> > +	return 0;
> > +}
> > +
> > +static int dump_one_vma(struct mm_struct *mm,
> > +			struct vm_area_struct *vma, struct cpt_context *ctx)
> > +{
> > +	struct cpt_vma_image *v;
> > +	unsigned long addr;
> > +	int page_num;
> > +	int err;
> > +
> > +	v = kzalloc(sizeof(*v), GFP_KERNEL);
> > +	if (!v)
> > +		return -ENOMEM;
> > +
> > +	v->cpt_len = sizeof(*v);
> > +	v->cpt_type = CPT_OBJ_VMA;
> > +	v->cpt_hdrlen = sizeof(*v);
> > +	v->cpt_content = CPT_CONTENT_ARRAY;
> > +
> > +	v->cpt_start = vma->vm_start;
> > +	v->cpt_end = vma->vm_end;
> > +	v->cpt_flags = vma->vm_flags;
> > +	if (vma->vm_flags & VM_HUGETLB) {
> > +		eprintk("huge TLB VMAs are still not supported\n");
> > +		kfree(v);
> > +		return -EINVAL;
> > +	}
> > +	v->cpt_pgprot = vma->vm_page_prot.pgprot;
> > +	v->cpt_pgoff = vma->vm_pgoff;
> > +	v->cpt_file = CPT_NULL;
> > +	v->cpt_vma_type = CPT_VMA_TYPE_0;
> > +
> > +	page_num = count_vma_pages(vma, ctx);
> > +	if (page_num < 0) {
> > +		kfree(v);
> > +		return -EINVAL;
> > +	}
>
> AFAICS, since count_vma_pages only supports pages with PD_COPY, and since
> get_page_desc() tags text segment pages (file-mapped and not anonymous
> since not written to) as PD_ABSENT, no executable is checkpointable. So,
> where is the trick? Am I completely missing something about page mapping?
Oh, that's my fault, I have sent wrong version. I will send new patchset with 
correct page mapping today.

>
> > +	v->cpt_page_num = page_num;
> > +
> > +	if (vma->vm_file) {
> > +		v->cpt_file = 0;
> > +		v->cpt_vma_type = CPT_VMA_FILE;
> > +	}
> > +
> > +	ctx->write(v, sizeof(*v), ctx);
>
> Error handling?
Yes, will add it.

>
> > +	kfree(v);
> > +
> > +	if (vma->vm_file) {
> > +		err = cpt_dump_dentry(&vma->vm_file->f_path, ctx);
> > +		if (err < 0)
> > +			return err;
> > +	}
> > +
> > +	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > +		struct page_desc pd;
> > +		struct cpt_page_block pgb;
> > +
> > +		page_get_desc(vma, addr, &pd, ctx);
> > +
> > +		if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) {
> > +			eprintk("dump_one_vma: funkey page\n");
> > +			return -EINVAL;
> > +		}
> > +
> > +		pgb.cpt_start = addr;
> > +		pgb.cpt_end = addr + PAGE_SIZE;
> > +		dump_page_block(vma, &pgb, ctx);
>
> Error handling?
Yeap, thanks.

>
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context
> > *ctx) +{
> > +#ifdef CONFIG_X86
> > +	if (mm->context.size) {
> > +		struct cpt_obj_bits b;
> > +		int size;
> > +
> > +		mutex_lock(&mm->context.lock);
> > +
> > +		b.cpt_type = CPT_OBJ_BITS;
> > +		b.cpt_len = sizeof(b);
> > +		b.cpt_content = CPT_CONTENT_MM_CONTEXT;
> > +		b.cpt_size = mm->context.size * LDT_ENTRY_SIZE;
> > +
> > +		ctx->write(&b, sizeof(b), ctx);
> > +
> > +		size = mm->context.size * LDT_ENTRY_SIZE;
> > +
> > +		ctx->write(mm->context.ldt, size, ctx);
>
> Error handling?
Thanks again!

>
> > +
> > +		mutex_unlock(&mm->context.lock);
> > +	}
> > +#endif
> > +	return 0;
> > +}
> > +
> > +int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx)
> > +{
> > +	struct mm_struct *mm = tsk->mm;
> > +	struct cpt_mm_image *v;
> > +	struct vm_area_struct *vma;
> > +	int err;
> > +
> > +	v = kzalloc(sizeof(*v), GFP_KERNEL);
> > +	if (!v)
> > +		return -ENOMEM;
> > +
> > +	v->cpt_len = sizeof(*v);
> > +	v->cpt_type = CPT_OBJ_MM;
> > +	v->cpt_hdrlen = sizeof(*v);
> > +	v->cpt_content = CPT_CONTENT_ARRAY;
> > +
> > +	down_read(&mm->mmap_sem);
> > +	v->cpt_start_code = mm->start_code;
> > +	v->cpt_end_code = mm->end_code;
> > +	v->cpt_start_data = mm->start_data;
> > +	v->cpt_end_data = mm->end_data;
> > +	v->cpt_start_brk = mm->start_brk;
> > +	v->cpt_brk = mm->brk;
> > +	v->cpt_start_stack = mm->start_stack;
> > +	v->cpt_start_arg = mm->arg_start;
> > +	v->cpt_end_arg = mm->arg_end;
> > +	v->cpt_start_env = mm->env_start;
> > +	v->cpt_end_env = mm->env_end;
> > +	v->cpt_def_flags = mm->def_flags;
> > +	v->cpt_flags = mm->flags;
> > +	v->cpt_map_count = mm->map_count;
> > +
> > +	err = ctx->write(v, sizeof(*v), ctx);
> > +	kfree(v);
> > +
> > +	if (err) {
> > +		eprintk("error during writing mm\n");
> > +		goto err_up;
> > +	}
> > +
> > +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > +		if ((err = dump_one_vma(mm, vma, ctx)) != 0)
> > +			goto err_up;
> > +	}
> > +
> > +	err = cpt_dump_mm_context(mm, ctx);
> > +
> > +err_up:
> > +	up_read(&mm->mmap_sem);
> > +
> > +	return err;
> > +}
> > +
>
> [...]
>
> Louis

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22  8:49                   ` [Devel] " Andrey Mirkin
@ 2008-10-22  9:25                     ` Louis Rilling
  2008-10-22 10:06                       ` Greg Kurz
  2008-10-22 10:12                       ` Andrey Mirkin
  2008-10-22 12:47                     ` Cedric Le Goater
  1 sibling, 2 replies; 49+ messages in thread
From: Louis Rilling @ 2008-10-22  9:25 UTC (permalink / raw)
  To: Andrey Mirkin
  Cc: devel, Pavel Emelyanov, Cedric Le Goater, containers, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2292 bytes --]

On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> > Hello Andrey !
> >
> > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > > index 109792b..a4848a3 100644
> > > --- a/arch/x86/kernel/entry_32.S
> > > +++ b/arch/x86/kernel/entry_32.S
> > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> > >  	GET_THREAD_INFO(%ebp)
> > >  	popl %eax
> > >  	CFI_ADJUST_CFA_OFFSET -4
> > > +ret_from_fork_tail:
> > >  	pushl $0x0202			# Reset kernel eflags
> > >  	CFI_ADJUST_CFA_OFFSET 4
> > >  	popfl
> > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> > >  	CFI_ENDPROC
> > >  END(ret_from_fork)
> > >
> > > +ENTRY(i386_ret_from_resume)
> > > +	CFI_STARTPROC
> > > +	pushl %eax
> > > +	CFI_ADJUST_CFA_OFFSET 4
> > > +	call schedule_tail
> > > +	GET_THREAD_INFO(%ebp)
> > > +	popl %eax
> > > +	CFI_ADJUST_CFA_OFFSET -4
> > > +	movl (%esp), %eax
> > > +	testl %eax, %eax
> > > +	jz    1f
> > > +	pushl %esp
> > > +	call  *%eax
> > > +	addl  $4, %esp
> > > +1:
> > > +	addl  $256, %esp
> > > +	jmp   ret_from_fork_tail
> > > +	CFI_ENDPROC
> > > +END(i386_ret_from_resume)
> >
> > Could you explain why you need to do this
> >
> > 	call  *%eax
> >
> > is it related to the freezer code ?
> 
> It is not related to the freezer code actually.
> That is needed to restart syscalls. Right now I don't have a code in my 
> patchset which restarts a syscall, but later I plan to add it.
> In OpenVZ checkpointing we restart syscalls if process was caught in syscall 
> during checkpointing.

Do you checkpoint uninterruptible syscalls as well? If only interruptible
syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
this syscall does not restart itself when interrupted, and well, this is life,
userspace just sees -EINTR, which is allowed by the syscall spec.
Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
Kerrighed and this works.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22  9:25                     ` Louis Rilling
@ 2008-10-22 10:06                       ` Greg Kurz
  2008-10-22 10:44                         ` Louis Rilling
  2008-10-22 10:12                       ` Andrey Mirkin
  1 sibling, 1 reply; 49+ messages in thread
From: Greg Kurz @ 2008-10-22 10:06 UTC (permalink / raw)
  To: Louis.Rilling
  Cc: Andrey Mirkin, containers, Cedric Le Goater, linux-kernel,
	Pavel Emelyanov

On Wed, 2008-10-22 at 11:25 +0200, Louis Rilling wrote:
> Do you checkpoint uninterruptible syscalls as well? If only interruptible
> syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
> ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
> this syscall does not restart itself when interrupted, and well, this is life,
> userspace just sees -EINTR, which is allowed by the syscall spec.
> Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
> Kerrighed and this works.
> 
> Louis
> 

I don't know Kerrighed internals but I understand you perform checkpoint
with a signal handler. Right ? This approach has a huge benefit: the
signal handling code do all the arch dependant stuff to save registers
in user memory.

-- 
Gregory Kurz                                     gkurz@fr.ibm.com
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)534 638 479                           Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22  9:25                     ` Louis Rilling
  2008-10-22 10:06                       ` Greg Kurz
@ 2008-10-22 10:12                       ` Andrey Mirkin
  2008-10-22 10:46                         ` Louis Rilling
  2008-10-22 15:25                         ` Oren Laadan
  1 sibling, 2 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-22 10:12 UTC (permalink / raw)
  To: Louis.Rilling
  Cc: devel, Pavel Emelyanov, Cedric Le Goater, containers, linux-kernel

On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
> On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> > On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> > > Hello Andrey !
> > >
> > > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > > > index 109792b..a4848a3 100644
> > > > --- a/arch/x86/kernel/entry_32.S
> > > > +++ b/arch/x86/kernel/entry_32.S
> > > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> > > >  	GET_THREAD_INFO(%ebp)
> > > >  	popl %eax
> > > >  	CFI_ADJUST_CFA_OFFSET -4
> > > > +ret_from_fork_tail:
> > > >  	pushl $0x0202			# Reset kernel eflags
> > > >  	CFI_ADJUST_CFA_OFFSET 4
> > > >  	popfl
> > > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> > > >  	CFI_ENDPROC
> > > >  END(ret_from_fork)
> > > >
> > > > +ENTRY(i386_ret_from_resume)
> > > > +	CFI_STARTPROC
> > > > +	pushl %eax
> > > > +	CFI_ADJUST_CFA_OFFSET 4
> > > > +	call schedule_tail
> > > > +	GET_THREAD_INFO(%ebp)
> > > > +	popl %eax
> > > > +	CFI_ADJUST_CFA_OFFSET -4
> > > > +	movl (%esp), %eax
> > > > +	testl %eax, %eax
> > > > +	jz    1f
> > > > +	pushl %esp
> > > > +	call  *%eax
> > > > +	addl  $4, %esp
> > > > +1:
> > > > +	addl  $256, %esp
> > > > +	jmp   ret_from_fork_tail
> > > > +	CFI_ENDPROC
> > > > +END(i386_ret_from_resume)
> > >
> > > Could you explain why you need to do this
> > >
> > > 	call  *%eax
> > >
> > > is it related to the freezer code ?
> >
> > It is not related to the freezer code actually.
> > That is needed to restart syscalls. Right now I don't have a code in my
> > patchset which restarts a syscall, but later I plan to add it.
> > In OpenVZ checkpointing we restart syscalls if process was caught in
> > syscall during checkpointing.
>
> Do you checkpoint uninterruptible syscalls as well? If only interruptible
> syscalls are checkpointed, I'd say that either this syscall uses
> ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal handling code already
> does the trick, or this syscall does not restart itself when interrupted,
> and well, this is life, userspace just sees -EINTR, which is allowed by the
> syscall spec.
> Actually this is how we checkpoint/migrate tasks in interruptible syscalls
> in Kerrighed and this works.

We checkpoint only interruptible syscalls. Some syscalls do not restart 
themself, that is why after restarting a process we restart syscall to 
complete it.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22 10:06                       ` Greg Kurz
@ 2008-10-22 10:44                         ` Louis Rilling
  2008-10-22 12:44                           ` Greg Kurz
  0 siblings, 1 reply; 49+ messages in thread
From: Louis Rilling @ 2008-10-22 10:44 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Andrey Mirkin, containers, Cedric Le Goater, linux-kernel,
	Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 1642 bytes --]

On Wed, Oct 22, 2008 at 12:06:19PM +0200, Greg Kurz wrote:
> On Wed, 2008-10-22 at 11:25 +0200, Louis Rilling wrote:
> > Do you checkpoint uninterruptible syscalls as well? If only interruptible
> > syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
> > ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
> > this syscall does not restart itself when interrupted, and well, this is life,
> > userspace just sees -EINTR, which is allowed by the syscall spec.
> > Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
> > Kerrighed and this works.
> > 
> > Louis
> > 
> 
> I don't know Kerrighed internals but I understand you perform checkpoint
> with a signal handler. Right ?

Right. This is an kernel-internal-only signal, so all signals remain available
for userspace.

> This approach has a huge benefit: the
> signal handling code do all the arch dependant stuff to save registers
> in user memory.

Hm, I'm not sure to understand what you mean here. We just rely on arch code
that jumps to signal handling to correctly setup struct pt_regs, which is then
passed to the checkpoint code. So yes, userspace registers are mostly saved by
existing arch code. But in x86-64 for instance, segment registers still need to
be saved by the checkpoint code (a bit like copy_thread() does), and I don't
know arch-independent functions doing this.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22 10:12                       ` Andrey Mirkin
@ 2008-10-22 10:46                         ` Louis Rilling
  2008-10-23  8:53                           ` Andrey Mirkin
  2008-10-22 15:25                         ` Oren Laadan
  1 sibling, 1 reply; 49+ messages in thread
From: Louis Rilling @ 2008-10-22 10:46 UTC (permalink / raw)
  To: Andrey Mirkin
  Cc: devel, Pavel Emelyanov, Cedric Le Goater, containers, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2971 bytes --]

On Wed, Oct 22, 2008 at 02:12:12PM +0400, Andrey Mirkin wrote:
> On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
> > On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> > > On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> > > > Hello Andrey !
> > > >
> > > > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > > > > index 109792b..a4848a3 100644
> > > > > --- a/arch/x86/kernel/entry_32.S
> > > > > +++ b/arch/x86/kernel/entry_32.S
> > > > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> > > > >  	GET_THREAD_INFO(%ebp)
> > > > >  	popl %eax
> > > > >  	CFI_ADJUST_CFA_OFFSET -4
> > > > > +ret_from_fork_tail:
> > > > >  	pushl $0x0202			# Reset kernel eflags
> > > > >  	CFI_ADJUST_CFA_OFFSET 4
> > > > >  	popfl
> > > > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> > > > >  	CFI_ENDPROC
> > > > >  END(ret_from_fork)
> > > > >
> > > > > +ENTRY(i386_ret_from_resume)
> > > > > +	CFI_STARTPROC
> > > > > +	pushl %eax
> > > > > +	CFI_ADJUST_CFA_OFFSET 4
> > > > > +	call schedule_tail
> > > > > +	GET_THREAD_INFO(%ebp)
> > > > > +	popl %eax
> > > > > +	CFI_ADJUST_CFA_OFFSET -4
> > > > > +	movl (%esp), %eax
> > > > > +	testl %eax, %eax
> > > > > +	jz    1f
> > > > > +	pushl %esp
> > > > > +	call  *%eax
> > > > > +	addl  $4, %esp
> > > > > +1:
> > > > > +	addl  $256, %esp
> > > > > +	jmp   ret_from_fork_tail
> > > > > +	CFI_ENDPROC
> > > > > +END(i386_ret_from_resume)
> > > >
> > > > Could you explain why you need to do this
> > > >
> > > > 	call  *%eax
> > > >
> > > > is it related to the freezer code ?
> > >
> > > It is not related to the freezer code actually.
> > > That is needed to restart syscalls. Right now I don't have a code in my
> > > patchset which restarts a syscall, but later I plan to add it.
> > > In OpenVZ checkpointing we restart syscalls if process was caught in
> > > syscall during checkpointing.
> >
> > Do you checkpoint uninterruptible syscalls as well? If only interruptible
> > syscalls are checkpointed, I'd say that either this syscall uses
> > ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal handling code already
> > does the trick, or this syscall does not restart itself when interrupted,
> > and well, this is life, userspace just sees -EINTR, which is allowed by the
> > syscall spec.
> > Actually this is how we checkpoint/migrate tasks in interruptible syscalls
> > in Kerrighed and this works.
> 
> We checkpoint only interruptible syscalls. Some syscalls do not restart 
> themself, that is why after restarting a process we restart syscall to 
> complete it.

I guess you do that to avoid breaking application that are badly written and do
not handle -EINTR correctly with interruptible syscalls. Right?

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22 10:44                         ` Louis Rilling
@ 2008-10-22 12:44                           ` Greg Kurz
  0 siblings, 0 replies; 49+ messages in thread
From: Greg Kurz @ 2008-10-22 12:44 UTC (permalink / raw)
  To: Louis.Rilling
  Cc: Andrey Mirkin, containers, Cedric Le Goater, linux-kernel,
	Pavel Emelyanov

On Wed, 2008-10-22 at 12:44 +0200, Louis Rilling wrote:
> On Wed, Oct 22, 2008 at 12:06:19PM +0200, Greg Kurz wrote:
> > On Wed, 2008-10-22 at 11:25 +0200, Louis Rilling wrote:
> > > Do you checkpoint uninterruptible syscalls as well? If only interruptible
> > > syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
> > > ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
> > > this syscall does not restart itself when interrupted, and well, this is life,
> > > userspace just sees -EINTR, which is allowed by the syscall spec.
> > > Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
> > > Kerrighed and this works.
> > > 
> > > Louis
> > > 
> > 
> > I don't know Kerrighed internals but I understand you perform checkpoint
> > with a signal handler. Right ?
> 
> Right. This is an kernel-internal-only signal, so all signals remain available
> for userspace.
> 
> > This approach has a huge benefit: the
> > signal handling code do all the arch dependant stuff to save registers
> > in user memory.
> 
> Hm, I'm not sure to understand what you mean here. We just rely on arch code
> that jumps to signal handling to correctly setup struct pt_regs, which is then
> passed to the checkpoint code. So yes, userspace registers are mostly saved by
> existing arch code. But in x86-64 for instance, segment registers still need to
> be saved by the checkpoint code (a bit like copy_thread() does), and I don't
> know arch-independent functions doing this.
> 

You're right, some segment registers need to be saved on x86 also... I
should have written 'most of' in my previous mail.

-- 
Gregory Kurz                                     gkurz@fr.ibm.com
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)534 638 479                           Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22  8:49                   ` [Devel] " Andrey Mirkin
  2008-10-22  9:25                     ` Louis Rilling
@ 2008-10-22 12:47                     ` Cedric Le Goater
  2008-10-23  9:54                       ` Andrey Mirkin
  1 sibling, 1 reply; 49+ messages in thread
From: Cedric Le Goater @ 2008-10-22 12:47 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: devel, Pavel Emelyanov, containers, linux-kernel

>>> +ENTRY(i386_ret_from_resume)
>>> +	CFI_STARTPROC
>>> +	pushl %eax
>>> +	CFI_ADJUST_CFA_OFFSET 4
>>> +	call schedule_tail
>>> +	GET_THREAD_INFO(%ebp)
>>> +	popl %eax
>>> +	CFI_ADJUST_CFA_OFFSET -4
>>> +	movl (%esp), %eax
>>> +	testl %eax, %eax
>>> +	jz    1f
>>> +	pushl %esp
>>> +	call  *%eax
>>> +	addl  $4, %esp
>>> +1:
>>> +	addl  $256, %esp
>>> +	jmp   ret_from_fork_tail
>>> +	CFI_ENDPROC
>>> +END(i386_ret_from_resume)
>> Could you explain why you need to do this
>>
>> 	call  *%eax
>>
>> is it related to the freezer code ?
> 
> It is not related to the freezer code actually.
> That is needed to restart syscalls. Right now I don't have a code in my 
> patchset which restarts a syscall, but later I plan to add it.
> In OpenVZ checkpointing we restart syscalls if process was caught in syscall 
> during checkpointing.

ok. I get it now. why 256 bytes of extra stack ? I'm sure it's not random. 

Thanks,

C.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22 10:12                       ` Andrey Mirkin
  2008-10-22 10:46                         ` Louis Rilling
@ 2008-10-22 15:25                         ` Oren Laadan
  2008-10-23  9:00                           ` Andrey Mirkin
  1 sibling, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2008-10-22 15:25 UTC (permalink / raw)
  To: Andrey Mirkin
  Cc: Louis.Rilling, containers, Cedric Le Goater, linux-kernel,
	Pavel Emelyanov



Andrey Mirkin wrote:
> On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
>> On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
>>> On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
>>>> Hello Andrey !
>>>>
>>>>> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
>>>>> index 109792b..a4848a3 100644
>>>>> --- a/arch/x86/kernel/entry_32.S
>>>>> +++ b/arch/x86/kernel/entry_32.S
>>>>> @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
>>>>>  	GET_THREAD_INFO(%ebp)
>>>>>  	popl %eax
>>>>>  	CFI_ADJUST_CFA_OFFSET -4
>>>>> +ret_from_fork_tail:
>>>>>  	pushl $0x0202			# Reset kernel eflags
>>>>>  	CFI_ADJUST_CFA_OFFSET 4
>>>>>  	popfl
>>>>> @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
>>>>>  	CFI_ENDPROC
>>>>>  END(ret_from_fork)
>>>>>
>>>>> +ENTRY(i386_ret_from_resume)
>>>>> +	CFI_STARTPROC
>>>>> +	pushl %eax
>>>>> +	CFI_ADJUST_CFA_OFFSET 4
>>>>> +	call schedule_tail
>>>>> +	GET_THREAD_INFO(%ebp)
>>>>> +	popl %eax
>>>>> +	CFI_ADJUST_CFA_OFFSET -4
>>>>> +	movl (%esp), %eax
>>>>> +	testl %eax, %eax
>>>>> +	jz    1f
>>>>> +	pushl %esp
>>>>> +	call  *%eax
>>>>> +	addl  $4, %esp
>>>>> +1:
>>>>> +	addl  $256, %esp
>>>>> +	jmp   ret_from_fork_tail
>>>>> +	CFI_ENDPROC
>>>>> +END(i386_ret_from_resume)
>>>> Could you explain why you need to do this
>>>>
>>>> 	call  *%eax
>>>>
>>>> is it related to the freezer code ?
>>> It is not related to the freezer code actually.
>>> That is needed to restart syscalls. Right now I don't have a code in my
>>> patchset which restarts a syscall, but later I plan to add it.
>>> In OpenVZ checkpointing we restart syscalls if process was caught in
>>> syscall during checkpointing.
>> Do you checkpoint uninterruptible syscalls as well? If only interruptible
>> syscalls are checkpointed, I'd say that either this syscall uses
>> ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal handling code already
>> does the trick, or this syscall does not restart itself when interrupted,
>> and well, this is life, userspace just sees -EINTR, which is allowed by the
>> syscall spec.
>> Actually this is how we checkpoint/migrate tasks in interruptible syscalls
>> in Kerrighed and this works.
> 
> We checkpoint only interruptible syscalls. Some syscalls do not restart 
> themself, that is why after restarting a process we restart syscall to 
> complete it.

Can you please elaborate on this ?  I don't recall having had issues
with that.

Thanks,

Oren.

> 
> Andrey
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 06/10] Introduce functions to dump mm
  2008-10-20 17:21             ` Dave Hansen
@ 2008-10-23  8:43               ` Andrey Mirkin
  2008-10-23 13:51                 ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-23  8:43 UTC (permalink / raw)
  To: devel; +Cc: Dave Hansen, containers, linux-kernel

On Monday 20 October 2008 21:21 Dave Hansen wrote:
> On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> > +static void page_get_desc(struct vm_area_struct *vma, unsigned long
> > addr, +                         struct page_desc *pdesc, cpt_context_t *
> > ctx) +{
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       pgd_t *pgd;
> > +       pud_t *pud;
> > +       pmd_t *pmd;
> > +       pte_t *ptep, pte;
> > +       spinlock_t *ptl;
> > +       struct page *pg = NULL;
> > +       pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE +
> > vma->vm_pgoff; +
> > +       pdesc->index = linear_index;
> > +       pdesc->shared = 0;
> > +       pdesc->mm = CPT_NULL;
> > +
> > +       if (vma->vm_flags & VM_IO) {
> > +               pdesc->type = PD_ABSENT;
> > +               return;
> > +       }
> > +
> > +       pgd = pgd_offset(mm, addr);
> > +       if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> > +               goto out_absent;
> > +       pud = pud_offset(pgd, addr);
> > +       if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> > +               goto out_absent;
> > +       pmd = pmd_offset(pud, addr);
> > +       if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> > +               goto out_absent;
> > +#ifdef CONFIG_X86
> > +       if (pmd_huge(*pmd)) {
> > +               eprintk("page_huge\n");
> > +               goto out_unsupported;
> > +       }
> > +#endif
>
> I take it you know that this breaks with the 1GB (x86_64) and 16GB (ppc)
> large pages.
>
> Since you have the VMA, why not use is_vm_hugetlb_page()?
Right now I'm checking VM_HUGETLB flag on VMAs in dump_one_vma().
This checks were added for sanity purpose just to throw out all unsupported 
right now cases.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22 10:46                         ` Louis Rilling
@ 2008-10-23  8:53                           ` Andrey Mirkin
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-23  8:53 UTC (permalink / raw)
  To: Louis.Rilling
  Cc: devel, Pavel Emelyanov, Cedric Le Goater, containers, linux-kernel

On Wednesday 22 October 2008 14:46 Louis Rilling wrote:
> On Wed, Oct 22, 2008 at 02:12:12PM +0400, Andrey Mirkin wrote:
> > On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
> > > On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> > > > On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> > > > > Hello Andrey !
> > > > >
> > > > > > diff --git a/arch/x86/kernel/entry_32.S
> > > > > > b/arch/x86/kernel/entry_32.S index 109792b..a4848a3 100644
> > > > > > --- a/arch/x86/kernel/entry_32.S
> > > > > > +++ b/arch/x86/kernel/entry_32.S
> > > > > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> > > > > >  	GET_THREAD_INFO(%ebp)
> > > > > >  	popl %eax
> > > > > >  	CFI_ADJUST_CFA_OFFSET -4
> > > > > > +ret_from_fork_tail:
> > > > > >  	pushl $0x0202			# Reset kernel eflags
> > > > > >  	CFI_ADJUST_CFA_OFFSET 4
> > > > > >  	popfl
> > > > > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> > > > > >  	CFI_ENDPROC
> > > > > >  END(ret_from_fork)
> > > > > >
> > > > > > +ENTRY(i386_ret_from_resume)
> > > > > > +	CFI_STARTPROC
> > > > > > +	pushl %eax
> > > > > > +	CFI_ADJUST_CFA_OFFSET 4
> > > > > > +	call schedule_tail
> > > > > > +	GET_THREAD_INFO(%ebp)
> > > > > > +	popl %eax
> > > > > > +	CFI_ADJUST_CFA_OFFSET -4
> > > > > > +	movl (%esp), %eax
> > > > > > +	testl %eax, %eax
> > > > > > +	jz    1f
> > > > > > +	pushl %esp
> > > > > > +	call  *%eax
> > > > > > +	addl  $4, %esp
> > > > > > +1:
> > > > > > +	addl  $256, %esp
> > > > > > +	jmp   ret_from_fork_tail
> > > > > > +	CFI_ENDPROC
> > > > > > +END(i386_ret_from_resume)
> > > > >
> > > > > Could you explain why you need to do this
> > > > >
> > > > > 	call  *%eax
> > > > >
> > > > > is it related to the freezer code ?
> > > >
> > > > It is not related to the freezer code actually.
> > > > That is needed to restart syscalls. Right now I don't have a code in
> > > > my patchset which restarts a syscall, but later I plan to add it. In
> > > > OpenVZ checkpointing we restart syscalls if process was caught in
> > > > syscall during checkpointing.
> > >
> > > Do you checkpoint uninterruptible syscalls as well? If only
> > > interruptible syscalls are checkpointed, I'd say that either this
> > > syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> > > handling code already does the trick, or this syscall does not restart
> > > itself when interrupted, and well, this is life, userspace just sees
> > > -EINTR, which is allowed by the syscall spec.
> > > Actually this is how we checkpoint/migrate tasks in interruptible
> > > syscalls in Kerrighed and this works.
> >
> > We checkpoint only interruptible syscalls. Some syscalls do not restart
> > themself, that is why after restarting a process we restart syscall to
> > complete it.
>
> I guess you do that to avoid breaking application that are badly written
> and do not handle -EINTR correctly with interruptible syscalls. Right?

Right, also this is needed to restart some syscalls (like pause) from kernel 
without returning to user space. Let me explain it in more details. There is 
a gap when process will be in user space just before entering syscall again. 
At this time a signal can be delivered to process and it even can be handled. 
So, we will miss a signal which must interrupt pause syscall.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22 15:25                         ` Oren Laadan
@ 2008-10-23  9:00                           ` Andrey Mirkin
  2008-10-23 13:57                             ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-23  9:00 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrey Mirkin, Louis.Rilling, containers, Cedric Le Goater,
	linux-kernel, Pavel Emelyanov

On Wednesday 22 October 2008 19:25 Oren Laadan wrote:
> Andrey Mirkin wrote:
> > On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
> >> On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> >>> On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> >>>> Hello Andrey !
> >>>>
> >>>>> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> >>>>> index 109792b..a4848a3 100644
> >>>>> --- a/arch/x86/kernel/entry_32.S
> >>>>> +++ b/arch/x86/kernel/entry_32.S
> >>>>> @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> >>>>>  	GET_THREAD_INFO(%ebp)
> >>>>>  	popl %eax
> >>>>>  	CFI_ADJUST_CFA_OFFSET -4
> >>>>> +ret_from_fork_tail:
> >>>>>  	pushl $0x0202			# Reset kernel eflags
> >>>>>  	CFI_ADJUST_CFA_OFFSET 4
> >>>>>  	popfl
> >>>>> @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> >>>>>  	CFI_ENDPROC
> >>>>>  END(ret_from_fork)
> >>>>>
> >>>>> +ENTRY(i386_ret_from_resume)
> >>>>> +	CFI_STARTPROC
> >>>>> +	pushl %eax
> >>>>> +	CFI_ADJUST_CFA_OFFSET 4
> >>>>> +	call schedule_tail
> >>>>> +	GET_THREAD_INFO(%ebp)
> >>>>> +	popl %eax
> >>>>> +	CFI_ADJUST_CFA_OFFSET -4
> >>>>> +	movl (%esp), %eax
> >>>>> +	testl %eax, %eax
> >>>>> +	jz    1f
> >>>>> +	pushl %esp
> >>>>> +	call  *%eax
> >>>>> +	addl  $4, %esp
> >>>>> +1:
> >>>>> +	addl  $256, %esp
> >>>>> +	jmp   ret_from_fork_tail
> >>>>> +	CFI_ENDPROC
> >>>>> +END(i386_ret_from_resume)
> >>>>
> >>>> Could you explain why you need to do this
> >>>>
> >>>> 	call  *%eax
> >>>>
> >>>> is it related to the freezer code ?
> >>>
> >>> It is not related to the freezer code actually.
> >>> That is needed to restart syscalls. Right now I don't have a code in my
> >>> patchset which restarts a syscall, but later I plan to add it.
> >>> In OpenVZ checkpointing we restart syscalls if process was caught in
> >>> syscall during checkpointing.
> >>
> >> Do you checkpoint uninterruptible syscalls as well? If only
> >> interruptible syscalls are checkpointed, I'd say that either this
> >> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> >> handling code already does the trick, or this syscall does not restart
> >> itself when interrupted, and well, this is life, userspace just sees
> >> -EINTR, which is allowed by the syscall spec.
> >> Actually this is how we checkpoint/migrate tasks in interruptible
> >> syscalls in Kerrighed and this works.
> >
> > We checkpoint only interruptible syscalls. Some syscalls do not restart
> > themself, that is why after restarting a process we restart syscall to
> > complete it.
>
> Can you please elaborate on this ?  I don't recall having had issues
> with that.

Right now in 2.6.18 kernel we restarts in such a way pause, rt_sigtimedwait 
and futex syscalls. Recently futex syscall was reworked and we will not need 
such hooks for it.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-22 12:47                     ` Cedric Le Goater
@ 2008-10-23  9:54                       ` Andrey Mirkin
  2008-10-23 13:49                         ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-23  9:54 UTC (permalink / raw)
  To: Cedric Le Goater; +Cc: devel, Pavel Emelyanov, containers, linux-kernel

On Wednesday 22 October 2008 16:47 Cedric Le Goater wrote:
> >>> +ENTRY(i386_ret_from_resume)
> >>> +	CFI_STARTPROC
> >>> +	pushl %eax
> >>> +	CFI_ADJUST_CFA_OFFSET 4
> >>> +	call schedule_tail
> >>> +	GET_THREAD_INFO(%ebp)
> >>> +	popl %eax
> >>> +	CFI_ADJUST_CFA_OFFSET -4
> >>> +	movl (%esp), %eax
> >>> +	testl %eax, %eax
> >>> +	jz    1f
> >>> +	pushl %esp
> >>> +	call  *%eax
> >>> +	addl  $4, %esp
> >>> +1:
> >>> +	addl  $256, %esp
> >>> +	jmp   ret_from_fork_tail
> >>> +	CFI_ENDPROC
> >>> +END(i386_ret_from_resume)
> >>
> >> Could you explain why you need to do this
> >>
> >> 	call  *%eax
> >>
> >> is it related to the freezer code ?
> >
> > It is not related to the freezer code actually.
> > That is needed to restart syscalls. Right now I don't have a code in my
> > patchset which restarts a syscall, but later I plan to add it.
> > In OpenVZ checkpointing we restart syscalls if process was caught in
> > syscall during checkpointing.
>
> ok. I get it now. why 256 bytes of extra stack ? I'm sure it's not random.
We are putting special structure on stack, which is used at the very end of 
the whole restart procedure to restore complex states (ptrace is one of such 
cases). Right now I don't need to use this structure as we have a deal with 
simple cases, but reservation of 256 bytes on stack is needed for future.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-20 13:25                 ` Louis Rilling
@ 2008-10-23 10:56                   ` Andrey Mirkin
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-23 10:56 UTC (permalink / raw)
  To: devel, Louis.Rilling; +Cc: containers, linux-kernel

On Monday 20 October 2008 17:25 Louis Rilling wrote:
> On Sat, Oct 18, 2008 at 03:11:36AM +0400, Andrey Mirkin wrote:
> > Functions to restart process, restore its state, fpu and registers are
> > added.
>
> [...]
>
> > diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
> > new file mode 100644
> > index 0000000..b9f745e
> > --- /dev/null
> > +++ b/checkpoint/rst_process.c
> > @@ -0,0 +1,277 @@
> > +/*
> > + *  Copyright (C) 2008 Parallels, Inc.
> > + *
> > + *  Author: Andrey Mirkin <major@openvz.org>
> > + *
> > + *  This program is free software; you can redistribute it and/or
> > + *  modify it under the terms of the GNU General Public License as
> > + *  published by the Free Software Foundation, version 2 of the
> > + *  License.
> > + *
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/fs.h>
> > +#include <linux/file.h>
> > +#include <linux/version.h>
> > +#include <linux/module.h>
> > +
> > +#include "checkpoint.h"
> > +#include "cpt_image.h"
> > +
> > +#define HOOK_RESERVE	256
> > +
> > +struct thr_context {
> > +	struct completion complete;
> > +	int error;
> > +	struct cpt_context *ctx;
> > +	struct task_struct *tsk;
> > +};
> > +
> > +int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long
> > flags, pid_t pid) +{
> > +	pid_t ret;
> > +
> > +	if (current->fs == NULL) {
> > +		/* do_fork_pid() hates processes without fs, oopses. */
> > +		eprintk("local_kernel_thread: current->fs==NULL\n");
> > +		return -EINVAL;
> > +	}
> > +	if (!try_module_get(THIS_MODULE))
> > +		return -EBUSY;
> > +	ret = kernel_thread(fn, arg, flags);
> > +	if (ret < 0)
> > +		module_put(THIS_MODULE);
> > +	return ret;
> > +}
> > +
> > +static unsigned int decode_task_flags(unsigned int task_flags)
> > +{
> > +	unsigned int flags = 0;
> > +
> > +	if (task_flags & (1 << CPT_PF_EXITING))
> > +		flags |= PF_EXITING;
> > +	if (task_flags & (1 << CPT_PF_FORKNOEXEC))
> > +		flags |= PF_FORKNOEXEC;
> > +	if (task_flags & (1 << CPT_PF_SUPERPRIV))
> > +		flags |= PF_SUPERPRIV;
> > +	if (task_flags & (1 << CPT_PF_DUMPCORE))
> > +		flags |= PF_DUMPCORE;
> > +	if (task_flags & (1 << CPT_PF_SIGNALED))
> > +		flags |= PF_SIGNALED;
> > +
> > +	return flags;
> > +
> > +}
> > +
> > +int rst_restore_task_struct(struct task_struct *tsk, struct
> > cpt_task_image *ti, +			    struct cpt_context *ctx)
> > +{
> > +	int i;
> > +
> > +	/* Restore only saved flags, comm and tls for now */
> > +	tsk->flags = decode_task_flags(ti->cpt_flags);
> > +	clear_tsk_thread_flag(tsk, TIF_FREEZE);
> > +	memcpy(tsk->comm, ti->cpt_comm, TASK_COMM_LEN);
> > +	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
> > +		tsk->thread.tls_array[i].a = ti->cpt_tls[i] & 0xFFFFFFFF;
> > +		tsk->thread.tls_array[i].b = ti->cpt_tls[i] >> 32;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int rst_restore_fpustate(struct task_struct *tsk, struct
> > cpt_task_image *ti, +				struct cpt_context *ctx)
> > +{
> > +	struct cpt_obj_bits hdr;
> > +	int err;
> > +	char *buf;
> > +
> > +	clear_stopped_child_used_math(tsk);
> > +
> > +	err = rst_get_object(CPT_OBJ_BITS, &hdr, sizeof(hdr), ctx);
> > +	if (err < 0)
> > +		return err;
> > +
> > +	buf = kmalloc(hdr.cpt_size, GFP_KERNEL);
> > +	if (!buf)
> > +		return -ENOMEM;
> > +
> > +	err = ctx->read(buf, hdr.cpt_size, ctx);
> > +	if (err)
> > +		goto out;
> > +
> > +	if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE && cpu_has_fxsr) {
> > +		memcpy(&tsk->thread.xstate, buf,
> > +				sizeof(struct i387_fxsave_struct));
> > +		if (ti->cpt_flags & CPT_PF_USED_MATH)
> > +			set_stopped_child_used_math(tsk);
> > +	}
> > +#ifndef CONFIG_X86_64
> > +	else if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE_OLD &&
> > +			!cpu_has_fxsr) {
> > +		memcpy(&tsk->thread.xstate, buf,
> > +				sizeof(struct i387_fsave_struct));
> > +		if (ti->cpt_flags & CPT_PF_USED_MATH)
> > +			set_stopped_child_used_math(tsk);
> > +	}
> > +#endif
> > +
> > +out:
> > +	kfree(buf);
> > +	return err;
> > +}
> > +
> > +static u32 decode_segment(u32 segid)
> > +{
> > +	if (segid == CPT_SEG_ZERO)
> > +		return 0;
> > +
> > +	/* TLS descriptors */
> > +	if (segid <= CPT_SEG_TLS3)
> > +		return ((GDT_ENTRY_TLS_MIN + segid - CPT_SEG_TLS1) << 3) + 3;
> > +
> > +	/* LDT descriptor, it is just an index to LDT array */
> > +	if (segid >= CPT_SEG_LDT)
> > +		return ((segid - CPT_SEG_LDT) << 3) | 7;
> > +
> > +	/* Check for one of standard descriptors */
> > +	if (segid == CPT_SEG_USER32_DS)
> > +		return __USER_DS;
> > +	if (segid == CPT_SEG_USER32_CS)
> > +		return __USER_CS;
> > +
> > +	eprintk("Invalid segment reg %d\n", segid);
> > +	return 0;
> > +}
> > +
> > +static int rst_restore_registers(struct task_struct *tsk, struct
> > cpt_context *ctx) +{
> > +	struct cpt_x86_regs ri;
> > +	struct pt_regs *regs = task_pt_regs(tsk);
> > +	extern char i386_ret_from_resume;
> > +	int err;
> > +
> > +	err = rst_get_object(CPT_OBJ_X86_REGS, &ri, sizeof(ri), ctx);
> > +	if (err < 0)
> > +		return err;
> > +
> > +	tsk->thread.sp = (unsigned long) regs;
> > +	tsk->thread.sp0 = (unsigned long) (regs+1);
> > +	tsk->thread.ip = (unsigned long) &i386_ret_from_resume;
> > +
> > +	tsk->thread.gs = decode_segment(ri.cpt_gs);
> > +	tsk->thread.debugreg0 = ri.cpt_debugreg[0];
> > +	tsk->thread.debugreg1 = ri.cpt_debugreg[1];
> > +	tsk->thread.debugreg2 = ri.cpt_debugreg[2];
> > +	tsk->thread.debugreg3 = ri.cpt_debugreg[3];
> > +	tsk->thread.debugreg6 = ri.cpt_debugreg[6];
> > +	tsk->thread.debugreg7 = ri.cpt_debugreg[7];
> > +
> > +	regs->bx = ri.cpt_bx;
> > +	regs->cx = ri.cpt_cx;
> > +	regs->dx = ri.cpt_dx;
> > +	regs->si = ri.cpt_si;
> > +	regs->di = ri.cpt_di;
> > +	regs->bp = ri.cpt_bp;
> > +	regs->ax = ri.cpt_ax;
> > +	regs->orig_ax = ri.cpt_orig_ax;
> > +	regs->ip = ri.cpt_ip;
> > +	regs->flags = ri.cpt_flags;
> > +	regs->sp = ri.cpt_sp;
> > +
> > +	regs->cs = decode_segment(ri.cpt_cs);
> > +	regs->ss = decode_segment(ri.cpt_ss);
> > +	regs->ds = decode_segment(ri.cpt_ds);
> > +	regs->es = decode_segment(ri.cpt_es);
> > +	regs->fs = decode_segment(ri.cpt_fs);
> > +
> > +	tsk->thread.sp -= HOOK_RESERVE;
> > +	memset((void*)tsk->thread.sp, 0, HOOK_RESERVE);
> > +
> > +	return 0;
> > +}
> > +
> > +static int restart_thread(void *arg)
> > +{
> > +	struct thr_context *thr_ctx = arg;
> > +	struct cpt_context *ctx;
> > +	struct cpt_task_image *ti;
> > +	int err;
> > +
> > +	current->state = TASK_UNINTERRUPTIBLE;
> > +
> > +	ctx = thr_ctx->ctx;
> > +	ti = kmalloc(sizeof(*ti), GFP_KERNEL);
> > +	if (!ti)
> > +		return -ENOMEM;
> > +
> > +	err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
> > +	if (!err)
> > +		err = rst_restore_task_struct(current, ti, ctx);
> > +	/* Restore mm here */
> > +	if (!err)
> > +		err = rst_restore_fpustate(current, ti, ctx);
> > +	if (!err)
> > +		err = rst_restore_registers(current, ctx);
> > +
> > +	thr_ctx->error = err;
> > +	complete(&thr_ctx->complete);
> > +
> > +	if (!err && (ti->cpt_state & (EXIT_ZOMBIE|EXIT_DEAD))) {
> > +		do_exit(ti->cpt_exit_code);
> > +	} else {
> > +		__set_current_state(TASK_UNINTERRUPTIBLE);
> > +	}
> > +
> > +	kfree(ti);
> > +	schedule();
> > +
> > +	eprintk("leaked %d/%d %p\n", task_pid_nr(current),
> > task_pid_vnr(current), current->mm); +
> > +	module_put(THIS_MODULE);
>
> I'm sorry, I still do not understand what you are doing with this
> self-module pinning stuff. AFAICS, we should not get here unless there is a
> bug. So the checkpoint module ref count is never decreased, right?
>
> Could you detail what is this self-module pinning for? As I already told
> you, this looks like a bogus solution to avoid unloading the checkpoint
> module during restart.

Actually right now module ref count increase/decrease is not needed.
But in some cases restore work should be done only after unfreezing the 
process. So, in this case we should grab ref count during process creation 
and put it after this special work is done.
I will rework this place and send it in next version to make it more clear how 
it will be used in future.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-23  9:54                       ` Andrey Mirkin
@ 2008-10-23 13:49                         ` Dave Hansen
  2008-10-24  4:04                           ` Andrey Mirkin
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2008-10-23 13:49 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: Cedric Le Goater, containers, linux-kernel, Pavel Emelyanov

On Thu, 2008-10-23 at 13:54 +0400, Andrey Mirkin wrote:
> We are putting special structure on stack, which is used at the very end of 
> the whole restart procedure to restore complex states (ptrace is one of such 
> cases). Right now I don't need to use this structure as we have a deal with 
> simple cases, but reservation of 256 bytes on stack is needed for future.

Wow.  So you're saying that, if this patch is accepted, we simply need
to accept that anything being checkpointed will use an extra 256 bytes
of stack?  Seems like something to perhaps put in the changelog rather
than some completely undocumented assembly nugget.

-- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 06/10] Introduce functions to dump mm
  2008-10-23  8:43               ` [Devel] " Andrey Mirkin
@ 2008-10-23 13:51                 ` Dave Hansen
  2008-10-24  4:07                   ` Andrey Mirkin
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2008-10-23 13:51 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: devel, containers, linux-kernel

On Thu, 2008-10-23 at 12:43 +0400, Andrey Mirkin wrote:
> > > +#ifdef CONFIG_X86
> > > +       if (pmd_huge(*pmd)) {
> > > +               eprintk("page_huge\n");
> > > +               goto out_unsupported;
> > > +       }
> > > +#endif
> >
> > I take it you know that this breaks with the 1GB (x86_64) and 16GB (ppc)
> > large pages.
> >
> > Since you have the VMA, why not use is_vm_hugetlb_page()?
> Right now I'm checking VM_HUGETLB flag on VMAs in dump_one_vma().
> This checks were added for sanity purpose just to throw out all unsupported 
> right now cases.

I'm telling you that it's no good.  Not only should this path never be
hit.  But, if it is, you'll crash anyway in some cases.

It's a bad check.  At best it misleads the reader to think that you've
covered your bases.

-- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-23  9:00                           ` Andrey Mirkin
@ 2008-10-23 13:57                             ` Dave Hansen
  2008-10-24  3:57                               ` Andrey Mirkin
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2008-10-23 13:57 UTC (permalink / raw)
  To: Andrey Mirkin
  Cc: Oren Laadan, containers, linux-kernel, Louis.Rilling,
	Cedric Le Goater, Andrey Mirkin, Pavel Emelyanov

On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
> 
> > >>> It is not related to the freezer code actually.
> > >>> That is needed to restart syscalls. Right now I don't have a code in my
> > >>> patchset which restarts a syscall, but later I plan to add it.
> > >>> In OpenVZ checkpointing we restart syscalls if process was caught in
> > >>> syscall during checkpointing.
> > >>
> > >> Do you checkpoint uninterruptible syscalls as well? If only
> > >> interruptible syscalls are checkpointed, I'd say that either this
> > >> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> > >> handling code already does the trick, or this syscall does not restart
> > >> itself when interrupted, and well, this is life, userspace just sees
> > >> -EINTR, which is allowed by the syscall spec.
> > >> Actually this is how we checkpoint/migrate tasks in interruptible
> > >> syscalls in Kerrighed and this works.
> > >
> > > We checkpoint only interruptible syscalls. Some syscalls do not restart
> > > themself, that is why after restarting a process we restart syscall to
> > > complete it.
> >
> > Can you please elaborate on this ?  I don't recall having had issues
> > with that.
> 
> Right now in 2.6.18 kernel we restarts in such a way pause, rt_sigtimedwait 
> and futex syscalls. Recently futex syscall was reworked and we will not need 
> such hooks for it.

Could you elaborate on this a bit?

If the futex syscall was reworked, perhaps we can do the same for
rt_sigtimedwait() and get rid of this code completely.

-- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-23 13:57                             ` Dave Hansen
@ 2008-10-24  3:57                               ` Andrey Mirkin
  2008-10-25 21:10                                 ` Oren Laadan
  0 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-24  3:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Oren Laadan, containers, linux-kernel, Louis.Rilling,
	Cedric Le Goater, Pavel Emelyanov

On Thursday 23 October 2008 17:57 Dave Hansen wrote:
> On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
> > > >>> It is not related to the freezer code actually.
> > > >>> That is needed to restart syscalls. Right now I don't have a code
> > > >>> in my patchset which restarts a syscall, but later I plan to add
> > > >>> it. In OpenVZ checkpointing we restart syscalls if process was
> > > >>> caught in syscall during checkpointing.
> > > >>
> > > >> Do you checkpoint uninterruptible syscalls as well? If only
> > > >> interruptible syscalls are checkpointed, I'd say that either this
> > > >> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> > > >> handling code already does the trick, or this syscall does not
> > > >> restart itself when interrupted, and well, this is life, userspace
> > > >> just sees -EINTR, which is allowed by the syscall spec.
> > > >> Actually this is how we checkpoint/migrate tasks in interruptible
> > > >> syscalls in Kerrighed and this works.
> > > >
> > > > We checkpoint only interruptible syscalls. Some syscalls do not
> > > > restart themself, that is why after restarting a process we restart
> > > > syscall to complete it.
> > >
> > > Can you please elaborate on this ?  I don't recall having had issues
> > > with that.
> >
> > Right now in 2.6.18 kernel we restarts in such a way pause,
> > rt_sigtimedwait and futex syscalls. Recently futex syscall was reworked
> > and we will not need such hooks for it.
>
> Could you elaborate on this a bit?
>
> If the futex syscall was reworked, perhaps we can do the same for
> rt_sigtimedwait() and get rid of this code completely.

Well, we can try to rework rt_sigtimedwait(), but we will still need this code 
in the future to restart pause syscall from kernel without returning to user 
space. Also this code will be needed to restore some complex states.
As concerns pause syscall I have already written to Louis about the problem we 
are trying to solve with this code. There is a gap when process will be in 
user space just before entering syscall again. At this time a signal can be 
delivered to process and it even can be handled. So, we will miss a signal 
which must interrupt pause syscall.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-23 13:49                         ` Dave Hansen
@ 2008-10-24  4:04                           ` Andrey Mirkin
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-24  4:04 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Cedric Le Goater, containers, linux-kernel, Pavel Emelyanov

On Thursday 23 October 2008 17:49 Dave Hansen wrote:
> On Thu, 2008-10-23 at 13:54 +0400, Andrey Mirkin wrote:
> > We are putting special structure on stack, which is used at the very end
> > of the whole restart procedure to restore complex states (ptrace is one
> > of such cases). Right now I don't need to use this structure as we have a
> > deal with simple cases, but reservation of 256 bytes on stack is needed
> > for future.
>
> Wow.  So you're saying that, if this patch is accepted, we simply need
> to accept that anything being checkpointed will use an extra 256 bytes
> of stack?  Seems like something to perhaps put in the changelog rather
> than some completely undocumented assembly nugget.

This 256 bytes will be used only during restart procedure and only by our 
module. As you can see in i386_ret_from_resume we are restoring it back. So, 
when process will return to user space it will not have extra 256 bytes 
reserved on stack already. I will add information about it to documentation 
and changelog.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 06/10] Introduce functions to dump mm
  2008-10-23 13:51                 ` Dave Hansen
@ 2008-10-24  4:07                   ` Andrey Mirkin
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-24  4:07 UTC (permalink / raw)
  To: Dave Hansen; +Cc: devel, containers, linux-kernel

On Thursday 23 October 2008 17:51 Dave Hansen wrote:
> On Thu, 2008-10-23 at 12:43 +0400, Andrey Mirkin wrote:
> > > > +#ifdef CONFIG_X86
> > > > +       if (pmd_huge(*pmd)) {
> > > > +               eprintk("page_huge\n");
> > > > +               goto out_unsupported;
> > > > +       }
> > > > +#endif
> > >
> > > I take it you know that this breaks with the 1GB (x86_64) and 16GB
> > > (ppc) large pages.
> > >
> > > Since you have the VMA, why not use is_vm_hugetlb_page()?
> >
> > Right now I'm checking VM_HUGETLB flag on VMAs in dump_one_vma().
> > This checks were added for sanity purpose just to throw out all
> > unsupported right now cases.
>
> I'm telling you that it's no good.  Not only should this path never be
> hit.  But, if it is, you'll crash anyway in some cases.
>
> It's a bad check.  At best it misleads the reader to think that you've
> covered your bases.

Agree, I will rework this.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 05/10] Introduce function to dump process
  2008-10-20 11:02           ` [PATCH 05/10] Introduce function to dump process Louis Rilling
@ 2008-10-24  4:15             ` Andrey Mirkin
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-24  4:15 UTC (permalink / raw)
  To: devel, Louis.Rilling; +Cc: containers, linux-kernel

On Monday 20 October 2008 15:02 Louis Rilling wrote:
> Hi,
>
> On Sat, Oct 18, 2008 at 03:11:33AM +0400, Andrey Mirkin wrote:
> > Functions to dump task struct, fpu state and registers are added.
> > All IDs are saved from the POV of process (container) namespace.
>
> Just a couple of little comments, in case this series should keep on
> living.
>
> [...]
>
> > diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
> > new file mode 100644
> > index 0000000..58f608d
> > --- /dev/null
> > +++ b/checkpoint/cpt_process.c
> > @@ -0,0 +1,236 @@
> > +/*
> > + *  Copyright (C) 2008 Parallels, Inc.
> > + *
> > + *  Author: Andrey Mirkin <major@openvz.org>
> > + *
> > + *  This program is free software; you can redistribute it and/or
> > + *  modify it under the terms of the GNU General Public License as
> > + *  published by the Free Software Foundation, version 2 of the
> > + *  License.
> > + *
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/fs.h>
> > +#include <linux/file.h>
> > +#include <linux/version.h>
> > +#include <linux/nsproxy.h>
> > +
> > +#include "checkpoint.h"
> > +#include "cpt_image.h"
> > +
> > +static unsigned int encode_task_flags(unsigned int task_flags)
> > +{
> > +	unsigned int flags = 0;
> > +
> > +	if (task_flags & PF_EXITING)
> > +		flags |= (1 << CPT_PF_EXITING);
> > +	if (task_flags & PF_FORKNOEXEC)
> > +		flags |= (1 << CPT_PF_FORKNOEXEC);
> > +	if (task_flags & PF_SUPERPRIV)
> > +		flags |= (1 << CPT_PF_SUPERPRIV);
> > +	if (task_flags & PF_DUMPCORE)
> > +		flags |= (1 << CPT_PF_DUMPCORE);
> > +	if (task_flags & PF_SIGNALED)
> > +		flags |= (1 << CPT_PF_SIGNALED);
> > +	if (task_flags & PF_USED_MATH)
> > +		flags |= (1 << CPT_PF_USED_MATH);
> > +
> > +	return flags;
> > +
> > +}
> > +
> > +int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context
> > *ctx) +{
> > +	struct cpt_task_image *t;
> > +	int i;
> > +	int err;
> > +
> > +	t = kzalloc(sizeof(*t), GFP_KERNEL);
> > +	if (!t)
> > +		return -ENOMEM;
> > +
> > +	t->cpt_len = sizeof(*t);
> > +	t->cpt_type = CPT_OBJ_TASK;
> > +	t->cpt_hdrlen = sizeof(*t);
> > +	t->cpt_content = CPT_CONTENT_ARRAY;
> > +
> > +	t->cpt_state = tsk->state;
> > +	t->cpt_flags = encode_task_flags(tsk->flags);
> > +	t->cpt_exit_code = tsk->exit_code;
> > +	t->cpt_exit_signal = tsk->exit_signal;
> > +	t->cpt_pdeath_signal = tsk->pdeath_signal;
> > +	t->cpt_pid = task_pid_nr_ns(tsk, ctx->nsproxy->pid_ns);
> > +	t->cpt_tgid = task_tgid_nr_ns(tsk, ctx->nsproxy->pid_ns);
> > +	t->cpt_ppid = tsk->parent ?
> > +		task_pid_nr_ns(tsk->parent, ctx->nsproxy->pid_ns) : 0;
> > +	t->cpt_rppid = tsk->real_parent ?
> > +		task_pid_nr_ns(tsk->real_parent, ctx->nsproxy->pid_ns) : 0;
> > +	t->cpt_pgrp = task_pgrp_nr_ns(tsk, ctx->nsproxy->pid_ns);
> > +	t->cpt_session = task_session_nr_ns(tsk, ctx->nsproxy->pid_ns);
> > +	t->cpt_old_pgrp = 0;
> > +	if (tsk->signal->tty_old_pgrp)
> > +		t->cpt_old_pgrp = pid_vnr(tsk->signal->tty_old_pgrp);
> > +	t->cpt_leader = tsk->group_leader ? task_pid_vnr(tsk->group_leader) :
> > 0;
>
> Why pid_vnr() here, and task_*_nr_ns() above? According to the introducing
> comment, I'd expect something like pid_nr_ns(tsk->signal->tty_old_pgrp,
> tsk->nsproxy->pid_ns), and the same for tsk->group_leader.
>
> IIUC, pid_vnr() is correct only if ctx->nsproxy->pid_ns ==
> tsk->nsproxy->pid_ns == current->nsproxy->pid_ns, and I expect current to
> live in a different pid_ns.
>
> Comments?

Oh, you right here. I have already fixed it in my tree, but accidentally wrong 
patch were sent.

>
> > +	t->cpt_utime = tsk->utime;
> > +	t->cpt_stime = tsk->stime;
> > +	t->cpt_utimescaled = tsk->utimescaled;
> > +	t->cpt_stimescaled = tsk->stimescaled;
> > +	t->cpt_gtime = tsk->gtime;
> > +	t->cpt_prev_utime = tsk->prev_utime;
> > +	t->cpt_prev_stime = tsk->prev_stime;
> > +	t->cpt_nvcsw = tsk->nvcsw;
> > +	t->cpt_nivcsw = tsk->nivcsw;
> > +	t->cpt_start_time = cpt_timespec_export(&tsk->start_time);
> > +	t->cpt_real_start_time = cpt_timespec_export(&tsk->real_start_time);
> > +	t->cpt_min_flt = tsk->min_flt;
> > +	t->cpt_maj_flt = tsk->maj_flt;
> > +	memcpy(t->cpt_comm, tsk->comm, TASK_COMM_LEN);
> > +	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
> > +		t->cpt_tls[i] = (((u64)tsk->thread.tls_array[i].b) << 32) +
> > +			tsk->thread.tls_array[i].a;
> > +	}
> > +	/* TODO: encode thread flags and status like task flags */
> > +	t->cpt_thrflags = task_thread_info(tsk)->flags & ~(1<<TIF_FREEZE);
> > +	t->cpt_thrstatus = task_thread_info(tsk)->status;
> > +	t->cpt_user = tsk->user->uid;
> > +	t->cpt_uid = tsk->uid;
> > +	t->cpt_euid = tsk->euid;
> > +	t->cpt_suid = tsk->suid;
> > +	t->cpt_fsuid = tsk->fsuid;
> > +	t->cpt_gid = tsk->gid;
> > +	t->cpt_egid = tsk->egid;
> > +	t->cpt_sgid = tsk->sgid;
> > +	t->cpt_fsgid = tsk->fsgid;
> > +
> > +	err = ctx->write(t, sizeof(*t), ctx);
> > +
> > +	kfree(t);
> > +	return err;
> > +}
> > +
> > +static int cpt_dump_fpustate(struct task_struct *tsk, struct cpt_context
> > *ctx) +{
> > +	struct cpt_obj_bits hdr;
> > +	int err;
> > +	int content;
> > +	unsigned long size;
> > +
> > +	content = CPT_CONTENT_X86_FPUSTATE;
> > +	size = sizeof(struct i387_fxsave_struct);
> > +#ifndef CONFIG_X86_64
> > +	if (!cpu_has_fxsr) {
> > +		size = sizeof(struct i387_fsave_struct);
> > +		content = CPT_CONTENT_X86_FPUSTATE_OLD;
> > +	}
> > +#endif
> > +
> > +	hdr.cpt_len = sizeof(hdr) + size;
> > +	hdr.cpt_type = CPT_OBJ_BITS;
> > +	hdr.cpt_hdrlen = sizeof(hdr);
> > +	hdr.cpt_content = content;
> > +	hdr.cpt_size = size;
> > +	err = ctx->write(&hdr, sizeof(hdr), ctx);
> > +	if (!err)
> > +		ctx->write(tsk->thread.xstate, size, ctx);
>
> Should check the error code of the line above, right?
Right, thanks!

[...]

> > +int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
> > +{
> > +	int err;
> > +
> > +	err = cpt_dump_task_struct(tsk, ctx);
> > +
> > +	/* Dump task mm */
> > +
> > +	if (!err)
> > +		cpt_dump_fpustate(tsk, ctx);
>
> error checking...
>
> > +	if (!err)
> > +		cpt_dump_registers(tsk, ctx);
>
> error checking...
Thanks again, will fix it.


Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 05/10] Introduce function to dump process
  2008-10-20 17:48           ` Serge E. Hallyn
@ 2008-10-24  4:40             ` Andrey Mirkin
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-24  4:40 UTC (permalink / raw)
  To: devel; +Cc: Serge E. Hallyn, containers, linux-kernel

On Monday 20 October 2008 21:48 Serge E. Hallyn wrote:
> Quoting Andrey Mirkin (major@openvz.org):
> > +	t->cpt_uid = tsk->uid;
> > +	t->cpt_euid = tsk->euid;
> > +	t->cpt_suid = tsk->suid;
> > +	t->cpt_fsuid = tsk->fsuid;
> > +	t->cpt_gid = tsk->gid;
> > +	t->cpt_egid = tsk->egid;
> > +	t->cpt_sgid = tsk->sgid;
> > +	t->cpt_fsgid = tsk->fsgid;
>
> I don't see where any of these are restored.  (Obviously, I wanted
> to think about how you're verifying the restarter's authorization
> to do so)

Well, right now I don't use them during restore to simplify restart procedure 
and make it more clear for reviewers. In OpenVZ we are doing all restart 
procedure with root's privileges and relying on fact that all such IDs will 
be the same during restart (as we are restarting a container and its file 
system will be the same during restart).

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-24  3:57                               ` Andrey Mirkin
@ 2008-10-25 21:10                                 ` Oren Laadan
  2008-10-29 14:52                                   ` Andrey Mirkin
  0 siblings, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2008-10-25 21:10 UTC (permalink / raw)
  To: Andrey Mirkin
  Cc: Dave Hansen, containers, linux-kernel, Louis.Rilling,
	Cedric Le Goater, Pavel Emelyanov



Andrey Mirkin wrote:
> On Thursday 23 October 2008 17:57 Dave Hansen wrote:
>> On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
>>>>>>> It is not related to the freezer code actually.
>>>>>>> That is needed to restart syscalls. Right now I don't have a code
>>>>>>> in my patchset which restarts a syscall, but later I plan to add
>>>>>>> it. In OpenVZ checkpointing we restart syscalls if process was
>>>>>>> caught in syscall during checkpointing.
>>>>>> Do you checkpoint uninterruptible syscalls as well? If only
>>>>>> interruptible syscalls are checkpointed, I'd say that either this
>>>>>> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
>>>>>> handling code already does the trick, or this syscall does not
>>>>>> restart itself when interrupted, and well, this is life, userspace
>>>>>> just sees -EINTR, which is allowed by the syscall spec.
>>>>>> Actually this is how we checkpoint/migrate tasks in interruptible
>>>>>> syscalls in Kerrighed and this works.
>>>>> We checkpoint only interruptible syscalls. Some syscalls do not
>>>>> restart themself, that is why after restarting a process we restart
>>>>> syscall to complete it.
>>>> Can you please elaborate on this ?  I don't recall having had issues
>>>> with that.
>>> Right now in 2.6.18 kernel we restarts in such a way pause,
>>> rt_sigtimedwait and futex syscalls. Recently futex syscall was reworked
>>> and we will not need such hooks for it.
>> Could you elaborate on this a bit?
>>
>> If the futex syscall was reworked, perhaps we can do the same for
>> rt_sigtimedwait() and get rid of this code completely.
> 
> Well, we can try to rework rt_sigtimedwait(), but we will still need this code 
> in the future to restart pause syscall from kernel without returning to user 
> space. Also this code will be needed to restore some complex states.
> As concerns pause syscall I have already written to Louis about the problem we 
> are trying to solve with this code. There is a gap when process will be in 
> user space just before entering syscall again. At this time a signal can be 
> delivered to process and it even can be handled. So, we will miss a signal 
> which must interrupt pause syscall.

I'm not convinced that you a real race exists, and even if it does, I'm not
convinced that hacking the assembly entry/exit code is the best way to do it.

Let me explain:

You are concerned about a race in which a signal is delivered to a task
that resumes from restart to user space and is about to (re)invoke 'pause()'
(because the restart so arranged its EIP and registers).

This almost always means that the user code is buggy and relies on specific
scheduling, because you can usually find a scheduling (without the C/R) where
the intended recipient of the signal was delayed and only calls pause() after
the signal is delivered.

For instance, if the sequence of events is:
	A calls pause() -> checkpoint -> restart ->
		B signals A -> A calls pause() (after restart),
then the following sequence is possible(*) without C/R:
	B signals A -> A calls pause()
because normally B cannot assume anything about when A is actually,
really, is suspended (which means the programmer did an imperfect job).

I said "almost always" and "usually", because there is one case where the
alternative schedule: task B could, prior to sending the signal, "ensure"
that task A is already sleeping within the 'pause()' syscall. While this
is possible, it is definitely unusual, and in fact I never code that does
that. And what if the sysadmin send SIGSTOP followed by SIGCONT ?  In
short, such code is simply broken.

More importantly, if you think about the operation and semantics of the
freezer cgroup - similar behavior is to be expected when you freeze and
then thaw a container.

Specifically, when you freeze the container that has a task in sys_pause(),
then that task will abort the syscall become frozen. As soon as it becomes
unfrozen, it will return to user space (with the EIP "rewinded") only to
re-invoke the syscall. So the same "race" remains even if you only freeze
and then thaw, regardless of C/R.

Moreover, I argue that basically when you return from a sys_restart(), the
entire container should, by default, remain in frozen state - just like it
is with sys_checkpoint(). An explicit thaw will make the container resume
execution.

Therefore, there are two options: the first is to decide that this behavior
- going back to user space to re-invoke the syscall - is valid. In this case
you don't need a special hack for returning from sys_restart(). The second
option is to decide that it is broken, in which case you need to also fix
the freezer code. Personally, I think that this behavior is valid and need
not be fixed.

Finally, even if you do want to fix the behavior for this pathologic case,
I don't see why you'd want to do it in this manner. Instead, you can add a
simple test prior to returning from sys_restart(), something like this:

	...
	/* almost done: now handle special cases: */
	if (our last syscall == __NR_pause) {
		ret = sys_pause();
	} else if (our last syscall == __NR_futex) {
		do some stuff;
		ret = sys_futex();
	} else {
		ret = what-we-want-to-return
	}
	/* finally, return to user space */
	return ret;
}

I'm not quite know what other "complex states" you refer to; but I wonder
whether that code "needed to restore some complex states" could not be
implemented along the same idea.

The upside is clear: the code is less obscure, simple to debug, and not
architecture-dependent. (hehe .. it even runs faster because it saves a
whole kernel->user->kernel switch, what do you know !).

Oren.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/10] Add support for multiple processes
  2008-10-17 23:11                   ` [PATCH 10/10] Add support for multiple processes Andrey Mirkin
@ 2008-10-27 15:58                     ` Oren Laadan
  2008-10-30  4:55                       ` [Devel] " Andrey Mirkin
  0 siblings, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2008-10-27 15:58 UTC (permalink / raw)
  To: Andrey Mirkin; +Cc: containers, linux-kernel, Pavel Emelyanov



Andrey Mirkin wrote:
> The whole tree of processes can be checkpointed and restarted now.
> Shared objects are not supported yet.
> 
> Signed-off-by: Andrey Mirkin <major@openvz.org>
> ---
>  checkpoint/cpt_image.h   |    2 +
>  checkpoint/cpt_process.c |   24 +++++++++++++
>  checkpoint/rst_process.c |   85 +++++++++++++++++++++++++++-------------------
>  3 files changed, 76 insertions(+), 35 deletions(-)
> 
> diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
> index e1fb483..f370df2 100644
> --- a/checkpoint/cpt_image.h
> +++ b/checkpoint/cpt_image.h
> @@ -128,6 +128,8 @@ struct cpt_task_image {
>  	__u64	cpt_nivcsw;
>  	__u64	cpt_min_flt;
>  	__u64	cpt_maj_flt;
> +	__u32	cpt_children_num;
> +	__u32	cpt_pad;
>  } __attribute__ ((aligned (8)));
>  
>  struct cpt_mm_image {
> diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
> index 1f7a54b..d73ec3c 100644
> --- a/checkpoint/cpt_process.c
> +++ b/checkpoint/cpt_process.c
> @@ -40,6 +40,19 @@ static unsigned int encode_task_flags(unsigned int task_flags)
>  		
>  }
>  
> +static int cpt_count_children(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> +	int num = 0;
> +	struct task_struct *child;
> +
> +	list_for_each_entry(child, &tsk->children, sibling) {
> +		if (child->parent != tsk)
> +			continue;
> +		num++;
> +	}
> +	return num;
> +}
> +

I noticed that don't take the appropriate locks when browsing through
tasks lists (siblings, children, global list). Although I realize that
the container should be frozen at this time, I keep wondering if this
is indeed always safe.

For instance, are you protected against an OOM killer that might just
occur uninvited and kill one of those tasks ?

Can the administrator force an un-freeze of the container ?  Or perhaps
some error condition if the kernel cause that ?

>  int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
>  {
>  	struct cpt_task_image *t;
> @@ -102,6 +115,7 @@ int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
>  	t->cpt_egid = tsk->egid;
>  	t->cpt_sgid = tsk->sgid;
>  	t->cpt_fsgid = tsk->fsgid;
> +	t->cpt_children_num = cpt_count_children(tsk, ctx);
>  
>  	err = ctx->write(t, sizeof(*t), ctx);
>  
> @@ -231,6 +245,16 @@ int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
>  		err = cpt_dump_fpustate(tsk, ctx);
>  	if (!err)
>  		err = cpt_dump_registers(tsk, ctx);
> +	if (!err) {
> +		struct task_struct *child;
> +		list_for_each_entry(child, &tsk->children, sibling) {
> +			if (child->parent != tsk)
> +				continue;
> +			err = cpt_dump_task(child, ctx);
> +			if (err)
> +				break;

Here too.

[...]

Oren.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-25 21:10                                 ` Oren Laadan
@ 2008-10-29 14:52                                   ` Andrey Mirkin
  2008-10-30 15:59                                     ` Oren Laadan
  0 siblings, 1 reply; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-29 14:52 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, containers, linux-kernel, Louis.Rilling,
	Cedric Le Goater, Pavel Emelyanov

On Sunday 26 October 2008 01:10 Oren Laadan wrote:
> Andrey Mirkin wrote:
> > On Thursday 23 October 2008 17:57 Dave Hansen wrote:
> >> On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
> >>>>>>> It is not related to the freezer code actually.
> >>>>>>> That is needed to restart syscalls. Right now I don't have a code
> >>>>>>> in my patchset which restarts a syscall, but later I plan to add
> >>>>>>> it. In OpenVZ checkpointing we restart syscalls if process was
> >>>>>>> caught in syscall during checkpointing.
> >>>>>>
> >>>>>> Do you checkpoint uninterruptible syscalls as well? If only
> >>>>>> interruptible syscalls are checkpointed, I'd say that either this
> >>>>>> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> >>>>>> handling code already does the trick, or this syscall does not
> >>>>>> restart itself when interrupted, and well, this is life, userspace
> >>>>>> just sees -EINTR, which is allowed by the syscall spec.
> >>>>>> Actually this is how we checkpoint/migrate tasks in interruptible
> >>>>>> syscalls in Kerrighed and this works.
> >>>>>
> >>>>> We checkpoint only interruptible syscalls. Some syscalls do not
> >>>>> restart themself, that is why after restarting a process we restart
> >>>>> syscall to complete it.
> >>>>
> >>>> Can you please elaborate on this ?  I don't recall having had issues
> >>>> with that.
> >>>
> >>> Right now in 2.6.18 kernel we restarts in such a way pause,
> >>> rt_sigtimedwait and futex syscalls. Recently futex syscall was reworked
> >>> and we will not need such hooks for it.
> >>
> >> Could you elaborate on this a bit?
> >>
> >> If the futex syscall was reworked, perhaps we can do the same for
> >> rt_sigtimedwait() and get rid of this code completely.
> >
> > Well, we can try to rework rt_sigtimedwait(), but we will still need this
> > code in the future to restart pause syscall from kernel without returning
> > to user space. Also this code will be needed to restore some complex
> > states. As concerns pause syscall I have already written to Louis about
> > the problem we are trying to solve with this code. There is a gap when
> > process will be in user space just before entering syscall again. At this
> > time a signal can be delivered to process and it even can be handled. So,
> > we will miss a signal which must interrupt pause syscall.
>
> I'm not convinced that you a real race exists, and even if it does, I'm not
> convinced that hacking the assembly entry/exit code is the best way to do
> it.

Well, as I already told pause() syscall is is not only one case why we need to 
do some additional job in that place.

> Let me explain:
>
> You are concerned about a race in which a signal is delivered to a task
> that resumes from restart to user space and is about to (re)invoke
> 'pause()' (because the restart so arranged its EIP and registers).
>
> This almost always means that the user code is buggy and relies on specific
> scheduling, because you can usually find a scheduling (without the C/R)
> where the intended recipient of the signal was delayed and only calls
> pause() after the signal is delivered.
>
> For instance, if the sequence of events is:
> 	A calls pause() -> checkpoint -> restart ->
> 		B signals A -> A calls pause() (after restart),
> then the following sequence is possible(*) without C/R:
> 	B signals A -> A calls pause()
> because normally B cannot assume anything about when A is actually,
> really, is suspended (which means the programmer did an imperfect job).

You right here. Both sequences are possible in theory. You will be surprised 
but in practice we found out that probability to miss a signal in case of C/R 
is much higher then during ordinary execution.

> I said "almost always" and "usually", because there is one case where the
> alternative schedule: task B could, prior to sending the signal, "ensure"
> that task A is already sleeping within the 'pause()' syscall. While this
> is possible, it is definitely unusual, and in fact I never code that does
> that. And what if the sysadmin send SIGSTOP followed by SIGCONT ?  In
> short, such code is simply broken.
>
> More importantly, if you think about the operation and semantics of the
> freezer cgroup - similar behavior is to be expected when you freeze and
> then thaw a container.
>
> Specifically, when you freeze the container that has a task in sys_pause(),
> then that task will abort the syscall become frozen. As soon as it becomes
> unfrozen, it will return to user space (with the EIP "rewinded") only to
> re-invoke the syscall. So the same "race" remains even if you only freeze
> and then thaw, regardless of C/R.

Exactly. But during freeze/unfreeze probability to catch such situation is 
very low. In our tests we were tried to checkpoint/restart LTP tests. And 
this "race" were triggered during restart almost in 100% of tests.

> Moreover, I argue that basically when you return from a sys_restart(), the
> entire container should, by default, remain in frozen state - just like it
> is with sys_checkpoint(). An explicit thaw will make the container resume
> execution.

No doubt. That is how we do in OpenVZ, after restart the container remains 
frozen. And we need to thaw it to resume its execution.

> Therefore, there are two options: the first is to decide that this behavior
> - going back to user space to re-invoke the syscall - is valid. In this
> case you don't need a special hack for returning from sys_restart(). The
> second option is to decide that it is broken, in which case you need to
> also fix the freezer code. Personally, I think that this behavior is valid
> and need not be fixed.

I still believe that we need to fix such behaviour during restart as in 
practice it is very easy to trigger it.

> Finally, even if you do want to fix the behavior for this pathologic case,
> I don't see why you'd want to do it in this manner. Instead, you can add a
> simple test prior to returning from sys_restart(), something like this:
>
> 	...
> 	/* almost done: now handle special cases: */
> 	if (our last syscall == __NR_pause) {
> 		ret = sys_pause();
> 	} else if (our last syscall == __NR_futex) {
> 		do some stuff;
> 		ret = sys_futex();
> 	} else {
> 		ret = what-we-want-to-return
> 	}
> 	/* finally, return to user space */
> 	return ret;
> }

This only works if we do not want to stay in frozen state after restart. Or am 
I missed something?

> I'm not quite know what other "complex states" you refer to; but I wonder
> whether that code "needed to restore some complex states" could not be
> implemented along the same idea.

In the same manner we are restoring for instance ptrace.

> The upside is clear: the code is less obscure, simple to debug, and not
> architecture-dependent. (hehe .. it even runs faster because it saves a
> whole kernel->user->kernel switch, what do you know !).
In our case we also do not need a switch and the code actually not very 
complicated.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 03/10] Introduce context structure needed during checkpointing/restart
  2008-10-20 17:02       ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Dave Hansen
@ 2008-10-29 15:30         ` Andrey Mirkin
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-29 15:30 UTC (permalink / raw)
  To: devel; +Cc: Dave Hansen, containers, linux-kernel

On Monday 20 October 2008 21:02 Dave Hansen wrote:
> On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> > +typedef struct cpt_context
> > +{
> > +	pid_t		pid;		/* should be changed to ctid later */
> > +	int		ctx_id;		/* context id */
> > +	struct list_head ctx_list;
> > +	int		refcount;
> > +	int		ctx_state;
> > +	struct semaphore main_sem;
>
> Does this really need to be a semaphore or is a mutex OK?
Actually mutex is enough here.

> > +	int		errno;
>
> Could you hold off on adding these things to the struct until the patch
> where they're actually used?  It's hard to judge this without seeing
> what you do with it.
I will try not to introduce variables and functions which are not used in 
future.

>
> > +	struct file	*file;
> > +	loff_t		current_object;
> > +
> > +	struct list_head object_array[CPT_OBJ_MAX];
> > +
> > +	int		(*write)(const void *addr, size_t count, struct cpt_context *ctx);
> > +	int		(*read)(void *addr, size_t count, struct cpt_context *ctx);
> > +} cpt_context_t;
>
> Man, this is hard to review.  I was going to try and make sure that your
> refcounting was right and atomic, but there's no use of it in this patch
> except for the initialization and accessor functions.  Darn.
For simplicity I will throw out all this stuff completely.

>
> > +extern int debug_level;
>
> I'm going to go out on a limb here and say that "debug_level" is
> probably a wee bit too generic of a variable name.
I will change it to something else.

>
> > +#define cpt_printk(lvl, fmt, args...)	do {	\
> > +		if (lvl <= debug_level)		\
> > +			printk(fmt, ##args);	\
> > +	} while (0)
>
> I think you can use pr_debug() here, too, just like Oren did.
Will switch to pr_debug().

>
> > +struct cpt_context * context_alloc(void)
> > +{
> > +	struct cpt_context *ctx;
> > +	int i;
> > +
> > +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> > +	if (!ctx)
> > +		return NULL;
> > +
> > +	init_MUTEX(&ctx->main_sem);
> > +	ctx->refcount = 1;
> > +
> > +	ctx->current_object = -1;
> > +	ctx->write = file_write;
> > +	ctx->read = file_read;
> > +	for (i = 0; i < CPT_OBJ_MAX; i++) {
> > +		INIT_LIST_HEAD(&ctx->object_array[i]);
> > +	}
> > +
> > +	return ctx;
> > +}
> > +
> > +void context_release(struct cpt_context *ctx)
> > +{
> > +	ctx->ctx_state = CPT_CTX_ERROR;
> > +
> > +	kfree(ctx);
> > +}
> > +
> > +static void context_put(struct cpt_context *ctx)
> > +{
> > +	if (!--ctx->refcount)
> > +		context_release(ctx);
> > +}
> > +
> >  static int checkpoint(pid_t pid, int fd, unsigned long flags)
> >  {
> > -	return -ENOSYS;
> > +	struct file *file;
> > +	struct cpt_context *ctx;
> > +	int err;
> > +
> > +	err = -EBADF;
> > +	file = fget(fd);
> > +	if (!file)
> > +		goto out;
> > +
> > +	err = -ENOMEM;
> > +	ctx = context_alloc();
> > +	if (!ctx)
> > +		goto out_file;
> > +
> > +	ctx->file = file;
> > +	ctx->ctx_state = CPT_CTX_DUMPING;
> > +
> > +	/* checkpoint */
> > +	err = -ENOSYS;
> > +
> > +	context_put(ctx);
> > +
> > +out_file:
> > +	fput(file);
> > +out:
> > +	return err;
> >  }
>
> So, where is context_get()?  Is there only single-threaded access to the
> refcount?  If so, why do we even need it?  We should probably just use
> context_release() driectly.
The idea is that in future we should be able to keep a context for incremental 
checkpointing. That is why we need context get/put functions. Right now it is 
not used, so I will drop it.

> If there is multithreaded access to context_put() or the refcount, then
> they're unsafe without additional locking.
Access to refcount will be protected with context mutex.

Thanks for comments.

Actually I'm not sure if I will continue with my own patch set, but I will 
take into account all your comments during porting my functionality to Oren's 
tree.

Andrey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 10/10] Add support for multiple processes
  2008-10-27 15:58                     ` Oren Laadan
@ 2008-10-30  4:55                       ` Andrey Mirkin
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Mirkin @ 2008-10-30  4:55 UTC (permalink / raw)
  To: devel; +Cc: Oren Laadan, containers, linux-kernel

On Monday 27 October 2008 18:58 Oren Laadan wrote:
> Andrey Mirkin wrote:
> > The whole tree of processes can be checkpointed and restarted now.
> > Shared objects are not supported yet.
> >
> > Signed-off-by: Andrey Mirkin <major@openvz.org>
> > ---
> >  checkpoint/cpt_image.h   |    2 +
> >  checkpoint/cpt_process.c |   24 +++++++++++++
> >  checkpoint/rst_process.c |   85
> > +++++++++++++++++++++++++++------------------- 3 files changed, 76
> > insertions(+), 35 deletions(-)
> >
> > diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
> > index e1fb483..f370df2 100644
> > --- a/checkpoint/cpt_image.h
> > +++ b/checkpoint/cpt_image.h
> > @@ -128,6 +128,8 @@ struct cpt_task_image {
> >  	__u64	cpt_nivcsw;
> >  	__u64	cpt_min_flt;
> >  	__u64	cpt_maj_flt;
> > +	__u32	cpt_children_num;
> > +	__u32	cpt_pad;
> >  } __attribute__ ((aligned (8)));
> >
> >  struct cpt_mm_image {
> > diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
> > index 1f7a54b..d73ec3c 100644
> > --- a/checkpoint/cpt_process.c
> > +++ b/checkpoint/cpt_process.c
> > @@ -40,6 +40,19 @@ static unsigned int encode_task_flags(unsigned int
> > task_flags)
> >
> >  }
> >
> > +static int cpt_count_children(struct task_struct *tsk, struct
> > cpt_context *ctx) +{
> > +	int num = 0;
> > +	struct task_struct *child;
> > +
> > +	list_for_each_entry(child, &tsk->children, sibling) {
> > +		if (child->parent != tsk)
> > +			continue;
> > +		num++;
> > +	}
> > +	return num;
> > +}
> > +
>
> I noticed that don't take the appropriate locks when browsing through
> tasks lists (siblings, children, global list). Although I realize that
> the container should be frozen at this time, I keep wondering if this
> is indeed always safe.
You right here. We need to take tasklist_lock to be sure that everything will 
be consistent.

> For instance, are you protected against an OOM killer that might just
> occur uninvited and kill one of those tasks ?

OOM killer can't kill one of those tasks as all of them should be frozen at 
that time and be in uninterruptible state. So, we can just do not think about 
OOM at that time. But anyway you right and we need locking around browsing 
tasks list.

> Can the administrator force an un-freeze of the container ?  Or perhaps
> some error condition if the kernel cause that ?

The main idea is that context should be protected with a mutex and only one 
process can do some activity (freeze, dump, unfreeze, kill) with a container. 
Right now it is not implemented at all, but in future it will be added.

Andrey

> >  int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context
> > *ctx) {
> >  	struct cpt_task_image *t;
> > @@ -102,6 +115,7 @@ int cpt_dump_task_struct(struct task_struct *tsk,
> > struct cpt_context *ctx) t->cpt_egid = tsk->egid;
> >  	t->cpt_sgid = tsk->sgid;
> >  	t->cpt_fsgid = tsk->fsgid;
> > +	t->cpt_children_num = cpt_count_children(tsk, ctx);
> >
> >  	err = ctx->write(t, sizeof(*t), ctx);
> >
> > @@ -231,6 +245,16 @@ int cpt_dump_task(struct task_struct *tsk, struct
> > cpt_context *ctx) err = cpt_dump_fpustate(tsk, ctx);
> >  	if (!err)
> >  		err = cpt_dump_registers(tsk, ctx);
> > +	if (!err) {
> > +		struct task_struct *child;
> > +		list_for_each_entry(child, &tsk->children, sibling) {
> > +			if (child->parent != tsk)
> > +				continue;
> > +			err = cpt_dump_task(child, ctx);
> > +			if (err)
> > +				break;
>
> Here too.
>
> [...]
>
> Oren.
>
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>
> _______________________________________________
> Devel mailing list
> Devel@openvz.org
> https://openvz.org/mailman/listinfo/devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process
  2008-10-29 14:52                                   ` Andrey Mirkin
@ 2008-10-30 15:59                                     ` Oren Laadan
  0 siblings, 0 replies; 49+ messages in thread
From: Oren Laadan @ 2008-10-30 15:59 UTC (permalink / raw)
  To: Andrey Mirkin
  Cc: Dave Hansen, containers, linux-kernel, Louis.Rilling,
	Cedric Le Goater, Pavel Emelyanov



Andrey Mirkin wrote:
> On Sunday 26 October 2008 01:10 Oren Laadan wrote:
>> Andrey Mirkin wrote:
>>> On Thursday 23 October 2008 17:57 Dave Hansen wrote:
>>>> On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
>>>>>>>>> It is not related to the freezer code actually.
>>>>>>>>> That is needed to restart syscalls. Right now I don't have a code
>>>>>>>>> in my patchset which restarts a syscall, but later I plan to add
>>>>>>>>> it. In OpenVZ checkpointing we restart syscalls if process was
>>>>>>>>> caught in syscall during checkpointing.
>>>>>>>> Do you checkpoint uninterruptible syscalls as well? If only
>>>>>>>> interruptible syscalls are checkpointed, I'd say that either this
>>>>>>>> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
>>>>>>>> handling code already does the trick, or this syscall does not
>>>>>>>> restart itself when interrupted, and well, this is life, userspace
>>>>>>>> just sees -EINTR, which is allowed by the syscall spec.
>>>>>>>> Actually this is how we checkpoint/migrate tasks in interruptible
>>>>>>>> syscalls in Kerrighed and this works.
>>>>>>> We checkpoint only interruptible syscalls. Some syscalls do not
>>>>>>> restart themself, that is why after restarting a process we restart
>>>>>>> syscall to complete it.
>>>>>> Can you please elaborate on this ?  I don't recall having had issues
>>>>>> with that.
>>>>> Right now in 2.6.18 kernel we restarts in such a way pause,
>>>>> rt_sigtimedwait and futex syscalls. Recently futex syscall was reworked
>>>>> and we will not need such hooks for it.
>>>> Could you elaborate on this a bit?
>>>>
>>>> If the futex syscall was reworked, perhaps we can do the same for
>>>> rt_sigtimedwait() and get rid of this code completely.
>>> Well, we can try to rework rt_sigtimedwait(), but we will still need this
>>> code in the future to restart pause syscall from kernel without returning
>>> to user space. Also this code will be needed to restore some complex
>>> states. As concerns pause syscall I have already written to Louis about
>>> the problem we are trying to solve with this code. There is a gap when
>>> process will be in user space just before entering syscall again. At this
>>> time a signal can be delivered to process and it even can be handled. So,
>>> we will miss a signal which must interrupt pause syscall.
>> I'm not convinced that you a real race exists, and even if it does, I'm not
>> convinced that hacking the assembly entry/exit code is the best way to do
>> it.
> 
> Well, as I already told pause() syscall is is not only one case why we need to 
> do some additional job in that place.
> 
>> Let me explain:
>>
>> You are concerned about a race in which a signal is delivered to a task
>> that resumes from restart to user space and is about to (re)invoke
>> 'pause()' (because the restart so arranged its EIP and registers).
>>
>> This almost always means that the user code is buggy and relies on specific
>> scheduling, because you can usually find a scheduling (without the C/R)
>> where the intended recipient of the signal was delayed and only calls
>> pause() after the signal is delivered.
>>
>> For instance, if the sequence of events is:
>> 	A calls pause() -> checkpoint -> restart ->
>> 		B signals A -> A calls pause() (after restart),
>> then the following sequence is possible(*) without C/R:
>> 	B signals A -> A calls pause()
>> because normally B cannot assume anything about when A is actually,
>> really, is suspended (which means the programmer did an imperfect job).
> 
> You right here. Both sequences are possible in theory. You will be surprised 
> but in practice we found out that probability to miss a signal in case of C/R 
> is much higher then during ordinary execution.

The point is that "missing" a signal because of freeze/thaw (or stop/cont)
is perfectly acceptable behavior. It is supposed to work that way. So I
argue that we don't need a workaround.

> 
>> I said "almost always" and "usually", because there is one case where the
>> alternative schedule: task B could, prior to sending the signal, "ensure"
>> that task A is already sleeping within the 'pause()' syscall. While this
>> is possible, it is definitely unusual, and in fact I never code that does
>> that. And what if the sysadmin send SIGSTOP followed by SIGCONT ?  In
>> short, such code is simply broken.
>>
>> More importantly, if you think about the operation and semantics of the
>> freezer cgroup - similar behavior is to be expected when you freeze and
>> then thaw a container.
>>
>> Specifically, when you freeze the container that has a task in sys_pause(),
>> then that task will abort the syscall become frozen. As soon as it becomes
>> unfrozen, it will return to user space (with the EIP "rewinded") only to
>> re-invoke the syscall. So the same "race" remains even if you only freeze
>> and then thaw, regardless of C/R.
> 
> Exactly. But during freeze/unfreeze probability to catch such situation is 
> very low. In our tests we were tried to checkpoint/restart LTP tests. And 
> this "race" were triggered during restart almost in 100% of tests.

Why is it the case ?

At the end of restart, the container remains frozen until you unfreeze it.
So c/r effectively becomes freeze/unfreeze, except - possible - for page
fault that are more likely to happen when thawing following the restart
(and these may slow down the application and allow a signal to "slip in").

If it's nearly 100% of the tests, that it should be easily reproducible
with merely freeze/thaw pairs, no ?

Still, arguing that LTP "breaks" here is like arguing that LTP "breaks"
if we were to run it while sending SIGSTOP/SIGCONT to its processes...

> 
>> Moreover, I argue that basically when you return from a sys_restart(), the
>> entire container should, by default, remain in frozen state - just like it
>> is with sys_checkpoint(). An explicit thaw will make the container resume
>> execution.
> 
> No doubt. That is how we do in OpenVZ, after restart the container remains 
> frozen. And we need to thaw it to resume its execution.
> 
>> Therefore, there are two options: the first is to decide that this behavior
>> - going back to user space to re-invoke the syscall - is valid. In this
>> case you don't need a special hack for returning from sys_restart(). The
>> second option is to decide that it is broken, in which case you need to
>> also fix the freezer code. Personally, I think that this behavior is valid
>> and need not be fixed.
> 
> I still believe that we need to fix such behaviour during restart as in 
> practice it is very easy to trigger it.

did you see any problem outside LTP ?  I never saw any such problems.

> 
>> Finally, even if you do want to fix the behavior for this pathologic case,
>> I don't see why you'd want to do it in this manner. Instead, you can add a
>> simple test prior to returning from sys_restart(), something like this:
>>
>> 	...
>> 	/* almost done: now handle special cases: */
>> 	if (our last syscall == __NR_pause) {
		do_freeze();

>> 		ret = sys_pause();
>> 	} else if (our last syscall == __NR_futex) {
>> 		do some stuff;
		do_freeze();

>> 		ret = sys_futex();
>> 	} else {
>> 		ret = what-we-want-to-return
>> 	}
>> 	/* finally, return to user space */
>> 	return ret;
>> }
> 
> This only works if we do not want to stay in frozen state after restart. Or am 
> I missed something?

Sure: see addition above.

> 
>> I'm not quite know what other "complex states" you refer to; but I wonder
>> whether that code "needed to restore some complex states" could not be
>> implemented along the same idea.
> 
> In the same manner we are restoring for instance ptrace.

Yes, I mentioned that in the past. Can be addressed in the same manner.

> 
>> The upside is clear: the code is less obscure, simple to debug, and not
>> architecture-dependent. (hehe .. it even runs faster because it saves a
>> whole kernel->user->kernel switch, what do you know !).

> In our case we also do not need a switch and the code actually not very 
> complicated.

Fact is that people wondered what was going on there.

I prefer the arch-independent way, because it is, well, arch-independent,
and because it makes the logic and the exception obvious to the reader,
and it is easily extensible to handle additional special cases (like
ptrace), and makes maintenance easier.

Do you object to doing it this way ?

Oren.


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2008-10-30 16:00 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-17 23:11 [PATCH 0/10] OpenVZ kernel based checkpointing/restart (v2) Andrey Mirkin
2008-10-17 23:11 ` [PATCH 01/10] Introduce trivial sys_checkpoint and sys_restore system calls Andrey Mirkin
2008-10-17 23:11   ` [PATCH 02/10] Make checkpoint/restart functionality modular Andrey Mirkin
2008-10-17 23:11     ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Andrey Mirkin
2008-10-17 23:11       ` [PATCH 04/10] Introduce container dump function Andrey Mirkin
2008-10-17 23:11         ` [PATCH 05/10] Introduce function to dump process Andrey Mirkin
2008-10-17 23:11           ` [PATCH 06/10] Introduce functions to dump mm Andrey Mirkin
2008-10-17 23:11             ` [PATCH 07/10] Introduce function for restarting a container Andrey Mirkin
2008-10-17 23:11               ` [PATCH 08/10] Introduce functions to restart a process Andrey Mirkin
2008-10-17 23:11                 ` [PATCH 09/10] Introduce functions to restore mm Andrey Mirkin
2008-10-17 23:11                   ` [PATCH 10/10] Add support for multiple processes Andrey Mirkin
2008-10-27 15:58                     ` Oren Laadan
2008-10-30  4:55                       ` [Devel] " Andrey Mirkin
2008-10-20  9:23                 ` [PATCH 08/10] Introduce functions to restart a process Cedric Le Goater
2008-10-22  8:49                   ` [Devel] " Andrey Mirkin
2008-10-22  9:25                     ` Louis Rilling
2008-10-22 10:06                       ` Greg Kurz
2008-10-22 10:44                         ` Louis Rilling
2008-10-22 12:44                           ` Greg Kurz
2008-10-22 10:12                       ` Andrey Mirkin
2008-10-22 10:46                         ` Louis Rilling
2008-10-23  8:53                           ` Andrey Mirkin
2008-10-22 15:25                         ` Oren Laadan
2008-10-23  9:00                           ` Andrey Mirkin
2008-10-23 13:57                             ` Dave Hansen
2008-10-24  3:57                               ` Andrey Mirkin
2008-10-25 21:10                                 ` Oren Laadan
2008-10-29 14:52                                   ` Andrey Mirkin
2008-10-30 15:59                                     ` Oren Laadan
2008-10-22 12:47                     ` Cedric Le Goater
2008-10-23  9:54                       ` Andrey Mirkin
2008-10-23 13:49                         ` Dave Hansen
2008-10-24  4:04                           ` Andrey Mirkin
2008-10-20 13:25                 ` Louis Rilling
2008-10-23 10:56                   ` [Devel] " Andrey Mirkin
2008-10-20 12:25             ` [PATCH 06/10] Introduce functions to dump mm Louis Rilling
2008-10-22  8:58               ` [Devel] " Andrey Mirkin
2008-10-20 17:21             ` Dave Hansen
2008-10-23  8:43               ` [Devel] " Andrey Mirkin
2008-10-23 13:51                 ` Dave Hansen
2008-10-24  4:07                   ` Andrey Mirkin
2008-10-20 11:02           ` [PATCH 05/10] Introduce function to dump process Louis Rilling
2008-10-24  4:15             ` [Devel] " Andrey Mirkin
2008-10-20 17:48           ` Serge E. Hallyn
2008-10-24  4:40             ` [Devel] " Andrey Mirkin
2008-10-20 17:02       ` [PATCH 03/10] Introduce context structure needed during checkpointing/restart Dave Hansen
2008-10-29 15:30         ` [Devel] " Andrey Mirkin
2008-10-20 16:51     ` [PATCH 02/10] Make checkpoint/restart functionality modular Dave Hansen
2008-10-20 16:59     ` Serge E. Hallyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).