[RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
@ 2011-07-15 13:45 Pavel Emelyanov
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-15 13:45 UTC (permalink / raw)
  To: Nathan Lynch, Oren Laadan, Daniel Lezcano, Serge Hallyn, Tejun Heo
  Cc: Cyrill Gorcunov, Linux Containers, Glauber Costa

Hi guys!

There have already been made many attempts to have the checkpoint/restore functionality
in Linux, but as far as I can see there's still no final solutions that suits most of
the interested people. The main concern about the previous approaches as I see it was
about - all that stuff was supposed to sit in the kernel thus creating various problems.

I'd like to bring this subject back again proposing the way of how to implement c/r
mostly in the userspace with the reasonable help of a kernel.

That said, I propose to start with very basic set of objects to c/r that can work with

* x86_64 tasks (subtree) which includes
   - registers
   - TLS
   - memory of all kinds (file and anon both shared and private)
* open regular files
* pipes (with data in it)

Core idea:

The core idea of the restore process is to implement the binary handler that can execve-ute
image files recreating the register and the memory state of a task. Restoring the process 
tree and opening files is done completely in the user space, i.e. when restoring the subtree
of processes I first fork all the tasks in respective order, then open required files and 
then call execve() to restore registers and memory.

The checkpointing process is quite simple - all we need about processes can be read from /proc
except for several things - registers and private memory. In current implementation to get 
them I introduce the /proc/<pid>/dump file which produces the file that can be executed by the
described above binfmt. Additionally I introduce the /proc/<pid>/mfd/ dir with info about
mappings. It is populated with symbolc links with names equal to vma->vm_start and pointing to
mapped files (including anon shared which are tmpfs ones). Thus we can open some task's
/proc/<pid>/mfd/<address> link and find out the mapped file inode (to check for sharing) and
if required map one and read the contents of anon shared memory.

Other minor stuff is in patches and mostly tools. The set is for linux-2.6.39. The current
implementation is not yet well tested and has many other defects, but demonstrates the idea. 

What do you think? Does the support from kernel of the proposed type suit us?

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 0/1] proc: Introduce the /proc/<pid>/mfd/ directory
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-15 13:45   ` Pavel Emelyanov
       [not found]     ` <4E20448A.5010207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 13:46   ` [PATCH 2/7] vfs: Introduce the fd closing helper Pavel Emelyanov
                     ` (9 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-15 13:45 UTC (permalink / raw)
  To: Nathan Lynch, Oren Laadan, Daniel Lezcano, Serge Hallyn, Tejun Heo
  Cc: Cyrill Gorcunov, Linux Containers, Glauber Costa

This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
one for each mapping with file, the name of a symlink is vma->vm_start, the
target is the file. Opening a symlink results in a file that point exactly
to the same inode as them vma's one.

This thing is aimed to help checkpointing processes.

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

---
 fs/proc/base.c          |  204 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/proc_fs.h |    5 +-
 2 files changed, 208 insertions(+), 1 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index dfa5327..633af12 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1995,6 +1995,49 @@ static int proc_fd_link(struct inode *inode, struct path *path)
 	return proc_fd_info(inode, path, NULL);
 }
 
+static int proc_mfd_get_link(struct inode *inode, struct path *path)
+{
+	struct task_struct *task;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	unsigned long vm_start;
+	int rc = -ENOENT;
+
+	task = get_proc_task(inode);
+	if (!task)
+		goto out;
+
+	mm = get_task_mm(task);
+	put_task_struct(task);
+
+	if (!mm)
+		goto out;
+
+	vm_start = PROC_I(inode)->vm_start;
+
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (vma->vm_start < vm_start)
+			continue;
+		if (vma->vm_start > vm_start)
+			break;
+		if (!vma->vm_file)
+			break;
+
+		*path = vma->vm_file->f_path;
+		path_get(path);
+
+		rc = 0;
+		break;
+	}
+
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+
+out:
+	return rc;
+}
+
 static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode;
@@ -2213,6 +2256,115 @@ static const struct file_operations proc_fd_operations = {
 	.llseek		= default_llseek,
 };
 
+static const struct dentry_operations tid_mfd_dentry_operations = {
+	.d_delete	= pid_delete_dentry,
+};
+
+static struct dentry *proc_mfd_instantiate(struct inode *dir, struct dentry *dentry,
+					   struct task_struct *task, const void *ptr)
+{
+	const struct vm_area_struct *vma = ptr;
+	struct file *file = vma->vm_file;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	if (!file)
+		return ERR_PTR(-ENOENT);
+
+	inode = proc_pid_make_inode(dir->i_sb, task);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei			= PROC_I(inode);
+	ei->vm_start		= vma->vm_start;
+	ei->op.proc_get_link	= proc_mfd_get_link;
+
+	inode->i_op	= &proc_pid_link_inode_operations;
+	inode->i_size	= 64;
+	inode->i_mode	= S_IFLNK;
+
+	if (file->f_mode & FMODE_READ)
+		inode->i_mode |= S_IRUSR | S_IXUSR;
+	if (file->f_mode & FMODE_WRITE)
+		inode->i_mode |= S_IWUSR | S_IXUSR;
+
+	d_set_d_op(dentry, &tid_mfd_dentry_operations);
+	d_add(dentry, inode);
+
+	return NULL;
+}
+
+static int proc_mfd_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	unsigned int vmai;
+	ino_t ino;
+	int ret;
+
+	ret = -ENOENT;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out_no_task;
+
+	ret = -EPERM;
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	ret = 0;
+	switch (filp->f_pos) {
+	case 0:
+		ino = inode->i_ino;
+		if (filldir(dirent, ".", 1, 0, ino, DT_DIR) < 0)
+			goto out;
+		filp->f_pos++;
+	case 1:
+		ino = parent_ino(dentry);
+		if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
+			goto out;
+		filp->f_pos++;
+	default:
+		mm = get_task_mm(task);
+		if (!mm)
+			goto out;
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap, vmai = 2; vma; vma = vma->vm_next) {
+			char name[2 + 16 + 1];
+			int len;
+
+			if (!vma->vm_file)
+				continue;
+
+			vmai++;
+			if (vmai < filp->f_pos)
+				continue;
+
+			filp->f_pos++;
+			len = snprintf(name, sizeof(name), "0x%lx", vma->vm_start);
+			if (proc_fill_cache(filp, dirent, filldir,
+					    name, len, proc_mfd_instantiate,
+					    task, vma) < 0)
+				break;
+		}
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+
+out:
+	put_task_struct(task);
+out_no_task:
+	return ret;
+}
+
+static const struct file_operations proc_mfd_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_mfd_readdir,
+	.llseek		= default_llseek,
+};
+
 /*
  * /proc/pid/fd needs a special permission handler so that a process can still
  * access /proc/self/fd after it has executed a setuid().
@@ -2240,6 +2392,57 @@ static const struct inode_operations proc_fd_inode_operations = {
 	.setattr	= proc_setattr,
 };
 
+static struct dentry *proc_mfd_lookup(struct inode *dir,
+		struct dentry *dentry, struct nameidata *nd)
+{
+	struct task_struct *task;
+	unsigned long vm_start;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct dentry *result;
+	char *endp;
+
+	result = ERR_PTR(-ENOENT);
+
+	task = get_proc_task(dir);
+	if (!task)
+		goto out_no_task;
+
+	vm_start = simple_strtoul(dentry->d_name.name, &endp, 16);
+	if (*endp != '\0')
+		goto out_no_mm;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_no_mm;
+
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (vma->vm_start == vm_start)
+			break;
+		if (vma->vm_start > vm_start)
+			goto out_no_vma;
+	}
+
+	if (!vma)
+		goto out_no_vma;
+
+	result = proc_mfd_instantiate(dir, dentry, task, vma);
+
+out_no_vma:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+out_no_mm:
+	put_task_struct(task);
+out_no_task:
+	return result;
+}
+
+static const struct inode_operations proc_mfd_inode_operations = {
+	.lookup		= proc_mfd_lookup,
+	.setattr	= proc_setattr,
+};
+
 static struct dentry *proc_fdinfo_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
 {
@@ -2819,6 +3022,7 @@ static const struct inode_operations proc_task_inode_operations;
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
+	DIR("mfd",        S_IRUSR|S_IXUSR, proc_mfd_inode_operations, proc_mfd_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index eaf4350..c779c74 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -265,7 +265,10 @@ struct ctl_table;
 
 struct proc_inode {
 	struct pid *pid;
-	int fd;
+	union {
+		int fd;
+		unsigned long vm_start;
+	};
 	union proc_op op;
 	struct proc_dir_entry *pde;
 	struct ctl_table_header *sysctl;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/7] vfs: Introduce the fd closing helper
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 13:45   ` [PATCH 0/1] proc: Introduce the /proc/<pid>/mfd/ directory Pavel Emelyanov
@ 2011-07-15 13:46   ` Pavel Emelyanov
       [not found]     ` <4E2044A7.4030103-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 13:46   ` [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status Pavel Emelyanov
                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-15 13:46 UTC (permalink / raw)
  To: Nathan Lynch, Oren Laadan, Daniel Lezcano, Serge Hallyn, Tejun Heo
  Cc: Cyrill Gorcunov, Linux Containers, Glauber Costa

This is nothing but making is possible to call the sys_close from the kernel.

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

---
 fs/open.c          |   32 ++++++++++++++++++++------------
 include/linux/fs.h |    1 +
 2 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index b52cf01..126aa8b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1078,17 +1078,11 @@ int filp_close(struct file *filp, fl_owner_t id)
 
 EXPORT_SYMBOL(filp_close);
 
-/*
- * Careful here! We test whether the file pointer is NULL before
- * releasing the fd. This ensures that one clone task can't release
- * an fd while another clone is opening it.
- */
-SYSCALL_DEFINE1(close, unsigned int, fd)
+int do_close(unsigned int fd)
 {
 	struct file * filp;
 	struct files_struct *files = current->files;
 	struct fdtable *fdt;
-	int retval;
 
 	spin_lock(&files->file_lock);
 	fdt = files_fdtable(files);
@@ -1101,7 +1095,25 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 	FD_CLR(fd, fdt->close_on_exec);
 	__put_unused_fd(files, fd);
 	spin_unlock(&files->file_lock);
-	retval = filp_close(filp, files);
+
+	return filp_close(filp, files);
+
+out_unlock:
+	spin_unlock(&files->file_lock);
+	return -EBADF;
+}
+EXPORT_SYMBOL_GPL(do_close);
+
+/*
+ * Careful here! We test whether the file pointer is NULL before
+ * releasing the fd. This ensures that one clone task can't release
+ * an fd while another clone is opening it.
+ */
+SYSCALL_DEFINE1(close, unsigned int, fd)
+{
+	int retval;
+
+	retval = do_close(fd);
 
 	/* can't restart close syscall because file table entry was cleared */
 	if (unlikely(retval == -ERESTARTSYS ||
@@ -1111,10 +1123,6 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 		retval = -EINTR;
 
 	return retval;
-
-out_unlock:
-	spin_unlock(&files->file_lock);
-	return -EBADF;
 }
 EXPORT_SYMBOL(sys_close);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cdf9495..77a5d3e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1991,6 +1991,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
 extern struct file * dentry_open(struct dentry *, struct vfsmount *, int,
 				 const struct cred *);
 extern int filp_close(struct file *, fl_owner_t id);
+extern int do_close(unsigned int fd);
 extern char * getname(const char __user *);
 
 /* fs/ioctl.c */
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 13:45   ` [PATCH 0/1] proc: Introduce the /proc/<pid>/mfd/ directory Pavel Emelyanov
  2011-07-15 13:46   ` [PATCH 2/7] vfs: Introduce the fd closing helper Pavel Emelyanov
@ 2011-07-15 13:46   ` Pavel Emelyanov
       [not found]     ` <4E2044C3.7050506-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 13:47   ` [PATCH 4/7] vfs: Add ->statfs callback for pipefs Pavel Emelyanov
                     ` (7 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-15 13:46 UTC (permalink / raw)
  To: Nathan Lynch, Oren Laadan, Daniel Lezcano, Serge Hallyn, Tejun Heo
  Cc: Cyrill Gorcunov, Linux Containers, Glauber Costa

Although we can get the pids of some task's issue, this is just 
more convenient to have them this way.

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>.

---
 fs/proc/array.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 5e4f776..f01f480 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -158,6 +158,18 @@ static inline const char *get_task_state(struct task_struct *tsk)
 	return *p;
 }
 
+static void task_children(struct seq_file *m, struct task_struct *p, struct pid_namespace *ns)
+{
+	struct task_struct *c;
+
+	seq_printf(m, "Children:");
+	read_lock(&tasklist_lock);
+	list_for_each_entry(c, &p->children, sibling)
+		seq_printf(m, " %d", pid_nr_ns(task_pid(c), ns));
+	read_unlock(&tasklist_lock);
+	seq_putc(m, '\n');
+}
+
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *p)
 {
@@ -192,6 +204,8 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		cred->uid, cred->euid, cred->suid, cred->fsuid,
 		cred->gid, cred->egid, cred->sgid, cred->fsgid);
 
+	task_children(m, p, ns);
+
 	task_lock(p);
 	if (p->files)
 		fdt = files_fdtable(p->files);
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 4/7] vfs: Add ->statfs callback for pipefs
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (2 preceding siblings ...)
  2011-07-15 13:46   ` [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status Pavel Emelyanov
@ 2011-07-15 13:47   ` Pavel Emelyanov
       [not found]     ` <4E2044D6.3060205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 13:47   ` [PATCH 5/7] clone: Introduce the CLONE_CHILD_USEPID functionality Pavel Emelyanov
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-15 13:47 UTC (permalink / raw)
  To: Nathan Lynch, Oren Laadan, Daniel Lezcano, Serge Hallyn, Tejun Heo
  Cc: Cyrill Gorcunov, Linux Containers, Glauber Costa

This is done to make it possible to distinguish pipes from fifos
when opening one via /proc/<pid>/fd/ link.

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

---
 fs/pipe.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index da42f7d..5de15de 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1254,6 +1254,7 @@ out:
 
 static const struct super_operations pipefs_ops = {
 	.destroy_inode = free_inode_nonrcu,
+	.statfs = simple_statfs,
 };
 
 /*
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 5/7] clone: Introduce the CLONE_CHILD_USEPID functionality
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (3 preceding siblings ...)
  2011-07-15 13:47   ` [PATCH 4/7] vfs: Add ->statfs callback for pipefs Pavel Emelyanov
@ 2011-07-15 13:47   ` Pavel Emelyanov
       [not found]     ` <4E2044EB.20001-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 13:47   ` [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file Pavel Emelyanov
                     ` (5 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-15 13:47 UTC (permalink / raw)
  To: Nathan Lynch, Oren Laadan, Daniel Lezcano, Serge Hallyn, Tejun Heo
  Cc: Cyrill Gorcunov, Linux Containers, Glauber Costa

The respective flag for clone() makes the latter to take the desired
pid of a new process from the child_tidptr. The given pid is used as
the pid for the pid namespace the parent is currently running in.

Needed badly for restoring a process.

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

---
 include/linux/pid.h   |    2 +-
 include/linux/sched.h |    1 +
 kernel/fork.c         |   10 ++++++-
 kernel/pid.c          |   70 +++++++++++++++++++++++++++++++++++-------------
 4 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index cdced84..de772ab 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, int pid);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 781abd1..5b6c1e2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -23,6 +23,7 @@
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
 /* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
    and is now available for re-use. */
+#define CLONE_CHILD_USEPID	0x02000000	/* use the given pid */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..f30fbdb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1183,8 +1183,16 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
+		int want_pid = 0;
+
+		if (clone_flags & CLONE_CHILD_USEPID) {
+			retval = get_user(want_pid, child_tidptr);
+			if (retval)
+				goto bad_fork_cleanup_io;
+		}
+
 		retval = -ENOMEM;
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, want_pid);
 		if (!pid)
 			goto bad_fork_cleanup_io;
 	}
diff --git a/kernel/pid.c b/kernel/pid.c
index 57a8346..69ae1be 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -159,11 +159,55 @@ static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
 	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	if (unlikely(!map->page)) {
+		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		/*
+		 * Free the page if someone raced with us
+		 * installing it:
+		 */
+		spin_lock_irq(&pidmap_lock);
+		if (!map->page) {
+			map->page = page;
+			page = NULL;
+		}
+		spin_unlock_irq(&pidmap_lock);
+		kfree(page);
+		if (unlikely(!map->page))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+
+	if (alloc_pidmap_page(map) < 0)
+		return -ENOMEM;
+
+	if (!test_and_set_bit(offset, map->page)) {
+		atomic_dec(&map->nr_free);
+		return pid;
+	}
+
+	return -EBUSY;
+}
+
+static int alloc_pidmap(struct pid_namespace *pid_ns, int desired_pid)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
+	if (desired_pid)
+		return set_pidmap(pid_ns, desired_pid);
+
 	pid = last + 1;
 	if (pid >= pid_max)
 		pid = RESERVED_PIDS;
@@ -176,22 +220,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	 */
 	max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (!map->page) {
-				map->page = page;
-				page = NULL;
-			}
-			spin_unlock_irq(&pidmap_lock);
-			kfree(page);
-			if (unlikely(!map->page))
-				break;
-		}
+		if (alloc_pidmap_page(map) < 0)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
@@ -277,7 +308,7 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, int this_ns_pid)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -291,13 +322,14 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, this_ns_pid);
 		if (nr < 0)
 			goto out_free;
 
 		pid->numbers[i].nr = nr;
 		pid->numbers[i].ns = tmp;
 		tmp = tmp->parent;
+		this_ns_pid = 0;
 	}
 
 	get_pid_ns(ns);
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (4 preceding siblings ...)
  2011-07-15 13:47   ` [PATCH 5/7] clone: Introduce the CLONE_CHILD_USEPID functionality Pavel Emelyanov
@ 2011-07-15 13:47   ` Pavel Emelyanov
       [not found]     ` <4E204500.6040800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 13:48   ` [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler Pavel Emelyanov
                     ` (4 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-15 13:47 UTC (permalink / raw)
  To: Nathan Lynch, Oren Laadan, Daniel Lezcano, Serge Hallyn, Tejun Heo
  Cc: Cyrill Gorcunov, Linux Containers, Glauber Costa

An image read from file contains task's registers and information
about its VM. Later this image can be execve-ed causing recreation
of the previously read task state.

The file format is my own, very simple. Introduced to make the code
as simple as possible. Better file format (if any) is to be discussed.

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

---
 fs/proc/Kconfig            |    8 +
 fs/proc/Makefile           |    1 +
 fs/proc/base.c             |    3 +
 fs/proc/img_dump.c         |  397 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/binfmt_img.h |   87 ++++++++++
 include/linux/proc_fs.h    |    2 +
 6 files changed, 498 insertions(+), 0 deletions(-)
 create mode 100644 fs/proc/img_dump.c
 create mode 100644 include/linux/binfmt_img.h

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 15af622..c64bf75 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -67,3 +67,11 @@ config PROC_PAGE_MONITOR
 	  /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
 	  /proc/kpagecount, and /proc/kpageflags. Disabling these
           interfaces will reduce the size of the kernel by approximately 4kb.
+
+config PROC_IMG
+	default y
+	depends on PROC_FS
+	bool "Enable /proc/<pid>/dump file"
+	help
+	  Say Y here if you want to be able to produce checkpoint-restore images
+	  for tasks via proc
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index df434c5..3a59cb1 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -27,3 +27,4 @@ proc-$(CONFIG_PROC_VMCORE)	+= vmcore.o
 proc-$(CONFIG_PROC_DEVICETREE)	+= proc_devtree.o
 proc-$(CONFIG_PRINTK)	+= kmsg.o
 proc-$(CONFIG_PROC_PAGE_MONITOR)	+= page.o
+proc-$(CONFIG_PROC_IMG) += img_dump.o
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 633af12..c01438f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3044,6 +3044,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("cmdline",    S_IRUGO, proc_pid_cmdline),
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
+#ifdef CONFIG_PROC_IMG
+	REG("dump",	  S_IRUSR|S_IWUSR, proc_pid_dump_operations),
+#endif
 	ONE("statm",      S_IRUGO, proc_pid_statm),
 	REG("maps",       S_IRUGO, proc_maps_operations),
 #ifdef CONFIG_NUMA
diff --git a/fs/proc/img_dump.c b/fs/proc/img_dump.c
new file mode 100644
index 0000000..7fa52ef
--- /dev/null
+++ b/fs/proc/img_dump.c
@@ -0,0 +1,397 @@
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/binfmt_img.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/highmem.h>
+#include <linux/types.h>
+#include "internal.h"
+
+static int img_dump_buffer(char __user *ubuf, size_t size, void *buf, int len, int pos)
+{
+	int ret;
+	static size_t dumped = 0;
+
+	len -= pos;
+	if (len > size)
+		len = size;
+
+	ret = copy_to_user(ubuf, buf + pos, len);
+	if (ret)
+		return -EFAULT;
+
+	dumped += len;
+	return len;
+}
+
+static int img_dump_header(char __user *buf, size_t size, int pos)
+{
+	struct binfmt_img_header hdr;
+
+	hdr.magic = BINFMT_IMG_MAGIC;
+	hdr.version = BINFMT_IMG_VERS_0;
+
+	return img_dump_buffer(buf, size, &hdr, sizeof(hdr), pos);
+}
+
+static __u16 encode_segment(unsigned short seg)
+{
+	if (seg == 0)
+		return CKPT_X86_SEG_NULL;
+	BUG_ON((seg & 3) != 3);
+
+	if (seg == __USER_CS)
+		return CKPT_X86_SEG_USER64_CS;
+	if (seg == __USER_DS)
+		return CKPT_X86_SEG_USER64_DS;
+#ifdef CONFIG_COMPAT
+	if (seg == __USER32_CS)
+		return CKPT_X86_SEG_USER32_CS;
+	if (seg == __USER32_DS)
+		return CKPT_X86_SEG_USER32_DS;
+#endif
+
+	if (seg & 4)
+		return CKPT_X86_SEG_LDT | (seg >> 3);
+
+	seg >>= 3;
+	if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX)
+		return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN);
+
+	printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg);
+	BUG();
+}
+
+static __u64 encode_tls(struct desc_struct *d)
+{
+	return ((__u64)d->a << 32) + d->b;
+}
+
+static int img_dump_regs(struct task_struct *p, char __user *buf, size_t size, int pos)
+{
+	struct binfmt_regs_image regi;
+	struct pt_regs *regs;
+	int i;
+
+	regs = task_pt_regs(p);
+
+	regi.r15 = regs->r15;
+	regi.r14 = regs->r14;
+	regi.r13 = regs->r13;
+	regi.r12 = regs->r12;
+	regi.r11 = regs->r11;
+	regi.r10 = regs->r10;
+	regi.r9 = regs->r9;
+	regi.r8 = regs->r8;
+	regi.ax = regs->ax;
+	regi.orig_ax = regs->orig_ax;
+	regi.bx = regs->bx;
+	regi.cx = regs->cx;
+	regi.dx = regs->dx;
+	regi.si = regs->si;
+	regi.di = regs->di;
+	regi.ip = regs->ip;
+	regi.flags = regs->flags;
+	regi.bp = regs->bp;
+	regi.sp = regs->sp;
+
+	/* segments */
+	regi.gsindex = encode_segment(p->thread.gsindex);
+	regi.fsindex = encode_segment(p->thread.fsindex);
+	regi.cs = encode_segment(regs->cs);
+	regi.ss = encode_segment(regs->ss);
+	regi.ds = encode_segment(p->thread.ds);
+	regi.es = encode_segment(p->thread.es);
+
+	BUILD_BUG_ON(GDT_ENTRY_TLS_ENTRIES != CKPT_TLS_ENTRIES);
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
+		regi.tls[i] = encode_tls(&p->thread.tls_array[i]);
+
+	if (p->thread.gsindex)
+		regi.gs = 0;
+	else
+		regi.gs = p->thread.gs;
+
+	if (p->thread.fsindex)
+		regi.fs = 0;
+	else
+		regi.fs = p->thread.fs;
+
+	return img_dump_buffer(buf, size, &regi, sizeof(regi), pos);
+}
+
+static int img_dump_mm(struct mm_struct *mm, char __user *buf, size_t size, int pos)
+{
+	struct binfmt_mm_image mmi;
+
+	mmi.flags = mm->flags;
+	mmi.def_flags = mm->def_flags;
+	mmi.start_code = mm->start_code;
+	mmi.end_code = mm->end_code;
+	mmi.start_data = mm->start_data;
+	mmi.end_data = mm->end_data;
+	mmi.start_brk = mm->start_brk;
+	mmi.brk = mm->brk;
+	mmi.start_stack = mm->start_stack;
+	mmi.arg_start = mm->arg_start;
+	mmi.arg_end = mm->arg_end;
+	mmi.env_start = mm->env_start;
+	mmi.env_end = mm->env_end;
+	mmi.exe_fd = 0;
+
+	return img_dump_buffer(buf, size, &mmi, sizeof(mmi), pos);
+}
+
+static int img_dump_vma(struct vm_area_struct *vma, char __user *buf, size_t size, int pos)
+{
+	struct binfmt_vma_image vmai;
+
+	if (vma == NULL) {
+		memset(&vmai, 0, sizeof(vmai));
+		goto dumpit;
+	}
+
+	printk("Dumping vma %016lx-%016lx %p/%p\n", vma->vm_start, vma->vm_end, vma, vma->vm_mm);
+
+	vmai.fd = 0;
+	vmai.prot = 0;
+	if (vma->vm_flags & VM_READ)
+		vmai.prot |= PROT_READ;
+	if (vma->vm_flags & VM_WRITE)
+		vmai.prot |= PROT_WRITE;
+	if (vma->vm_flags & VM_EXEC)
+		vmai.prot |= PROT_EXEC;
+
+	vmai.flags = 0;
+	if (vma->vm_file == NULL)
+		vmai.flags |= MAP_ANONYMOUS;
+	if (vma->vm_flags & VM_MAYSHARE)
+		vmai.flags |= MAP_SHARED;
+	else
+		vmai.flags |= MAP_PRIVATE;
+
+	vmai.start = vma->vm_start;
+	vmai.end = vma->vm_end;
+	vmai.pgoff = vma->vm_pgoff;
+
+dumpit:
+	return img_dump_buffer(buf, size, &vmai, sizeof(vmai), pos);
+}
+
+static int img_dump_page(unsigned long addr, void *data, char __user *buf, size_t size, int pos)
+{
+	struct binfmt_page_image pgi;
+	int ret = 0, tmp;
+
+	pgi.vaddr = addr;
+
+	if (pos < sizeof(pgi)) {
+		tmp = img_dump_buffer(buf, size, &pgi, sizeof(pgi), pos);
+		if (tmp < 0)
+			return tmp;
+
+		ret = tmp;
+		if (size <= ret)
+			return ret;
+
+		buf += ret;
+		size -= ret;
+		pos = 0;
+	} else
+		pos -= sizeof(pgi);
+
+	tmp = img_dump_buffer(buf, size, data, PAGE_SIZE, pos);
+	if (tmp < 0)
+		return tmp;
+
+	return ret + tmp;
+}
+
+static inline int is_private_vma(struct vm_area_struct *vma)
+{
+	if (vma->vm_file == NULL)
+		return 1;
+	if (!(vma->vm_flags & VM_SHARED))
+		return 1;
+	return 0;
+}
+
+static ssize_t do_produce_dump(struct task_struct *p, char __user *buf,
+		size_t size, loff_t *ppos)
+{
+	size_t img_pos = 0, img_ppos;
+	size_t produced = 0;
+	int len;
+	loff_t pos = *ppos;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+
+#define move_pos();	do {	\
+		buf += len;	\
+		produced += len;\
+		size -= len;	\
+		pos += len;	\
+	} while (0)
+
+#define seek_pos(__size);	do {	\
+		img_ppos = img_pos;	\
+		img_pos += (__size);	\
+	} while (0)
+
+	/* header */
+	seek_pos(sizeof(struct binfmt_img_header));
+	if (pos < img_pos) {
+		len = img_dump_header(buf, size, pos - img_ppos);
+		if (len < 0)
+			goto err;
+
+		move_pos();
+		if (size == 0)
+			goto out;
+	}
+
+	/* registers */
+	seek_pos(sizeof(struct binfmt_regs_image));
+	if (pos < img_pos) {
+		len = img_dump_regs(p, buf, size, pos - img_ppos);
+		if (len < 0)
+			goto err;
+
+		move_pos();
+		if (size == 0)
+			goto out;
+	}
+
+	/* memory */
+	mm = get_task_mm(p);
+	if (mm == NULL)
+		return -EACCES;
+
+	down_read(&mm->mmap_sem);
+
+	seek_pos(sizeof(struct binfmt_mm_image));
+	if (pos < img_pos) {
+		len = img_dump_mm(mm, buf, size, pos - img_ppos);
+		if (len < 0)
+			goto err_mm;
+
+		move_pos();
+		if (size == 0)
+			goto out_mm;
+	}
+
+	vma = mm->mmap;
+	while (1) {
+		seek_pos(sizeof(struct binfmt_vma_image));
+		if (pos < img_pos) {
+			len = img_dump_vma(vma, buf, size, pos - img_ppos);
+			if (len < 0)
+				goto err_mm;
+
+			move_pos();
+			if (size == 0)
+				goto out_mm;
+		}
+
+		if (vma == NULL)
+			break;
+
+		vma = vma->vm_next;
+	}
+
+	for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
+		/* slow and stupid */
+		unsigned long addr;
+		struct page *page;
+		void *pg_data;
+
+		if (!is_private_vma(vma))
+			continue;
+
+		for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+			page = follow_page(vma, addr, FOLL_FORCE | FOLL_DUMP | FOLL_GET);
+			if (page == NULL)
+				continue;
+			if (IS_ERR(page)) /* huh? */
+				continue;
+
+			seek_pos(sizeof(struct binfmt_page_image) + PAGE_SIZE);
+			if (pos < img_pos) {
+				pg_data = kmap(page);
+				len = img_dump_page(addr, pg_data, buf, size, pos - img_ppos);
+				kunmap(page);
+
+				if (len < 0) {
+					put_page(page);
+					goto err_mm;
+				}
+
+				move_pos();
+				if (size == 0) {
+					put_page(page);
+					goto out_mm;
+				}
+			}
+
+			put_page(page);
+		}
+	}
+
+	seek_pos(sizeof(struct binfmt_page_image));
+	if (pos < img_pos) {
+		struct binfmt_page_image zero;
+
+		memset(&zero, 0, sizeof(zero));
+		len = img_dump_buffer(buf, size, &zero, sizeof(zero), pos - img_ppos);
+		if (len < 0)
+			goto err;
+
+		move_pos();
+	}
+
+out_mm:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+out:
+	*ppos = pos;
+	return produced;
+
+err_mm:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+err:
+	return len;
+}
+
+static ssize_t img_dump_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
+{
+	struct task_struct *p;
+
+	p = get_proc_task(file->f_dentry->d_inode);
+	if (p == NULL)
+		return -ESRCH;
+
+	if (!(p->state & TASK_STOPPED)) {
+		put_task_struct(p);
+		return -EINVAL;
+	}
+
+	return do_produce_dump(p, buf, size, ppos);
+}
+
+static int img_dump_open(struct inode *inode, struct file *filp)
+{
+	return 0;
+}
+
+static int img_dump_release(struct inode *inode, struct file *filp)
+{
+	return 0;
+}
+
+const struct file_operations proc_pid_dump_operations = {
+	.open		= img_dump_open,
+	.read		= img_dump_read,
+	.release	= img_dump_release,
+};
diff --git a/include/linux/binfmt_img.h b/include/linux/binfmt_img.h
new file mode 100644
index 0000000..a4293af
--- /dev/null
+++ b/include/linux/binfmt_img.h
@@ -0,0 +1,87 @@
+#ifndef __BINFMT_IMG_H__
+#define __BINFMT_IMG_H__
+
+#include <linux/types.h>
+
+struct binfmt_img_header {
+	__u32	magic;
+	__u32	version;
+};
+
+#define CKPT_TLS_ENTRIES	3
+
+struct binfmt_regs_image {
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 orig_ax;
+	__u64 bx;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 ip;
+	__u64 flags;
+	__u64 bp;
+	__u64 sp;
+
+	__u64 gs;
+	__u64 fs;
+	__u64 tls[CKPT_TLS_ENTRIES];
+	__u16 gsindex;
+	__u16 fsindex;
+	__u16 cs;
+	__u16 ss;
+	__u16 ds;
+	__u16 es;
+};
+
+#define CKPT_X86_SEG_NULL       0
+#define CKPT_X86_SEG_USER32_CS  1
+#define CKPT_X86_SEG_USER32_DS  2
+#define CKPT_X86_SEG_USER64_CS  3
+#define CKPT_X86_SEG_USER64_DS  4
+#define CKPT_X86_SEG_TLS        0x4000
+#define CKPT_X86_SEG_LDT        0x8000
+
+struct binfmt_mm_image {
+	__u64	flags;
+	__u64	def_flags;
+	__u64	start_code;
+	__u64	end_code;
+	__u64	start_data;
+	__u64	end_data;
+	__u64	start_brk;
+	__u64	brk;
+	__u64	start_stack;
+	__u64	arg_start;
+	__u64	arg_end;
+	__u64	env_start;
+	__u64	env_end;
+	__u32	exe_fd;
+};
+
+struct binfmt_vma_image {
+	__u32	prot;
+	__u32	flags;
+	__u32	pad;
+	__u32	fd;
+	__u64	start;
+	__u64	end;
+	__u64	pgoff;
+};
+
+struct binfmt_page_image {
+	__u64	vaddr;
+};
+
+#define BINFMT_IMG_MAGIC	0xa75b8d43
+#define BINFMT_IMG_VERS_0	0x00000100
+
+#endif
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index c779c74..686b374 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -102,6 +102,8 @@ struct vmcore {
 
 #ifdef CONFIG_PROC_FS
 
+extern const struct file_operations proc_pid_dump_operations;
+
 extern void proc_root_init(void);
 
 void proc_flush_task(struct task_struct *task);
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (5 preceding siblings ...)
  2011-07-15 13:47   ` [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file Pavel Emelyanov
@ 2011-07-15 13:48   ` Pavel Emelyanov
       [not found]     ` <4E204519.3040804-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 13:49   ` [TOOLS] To make use of the patches Pavel Emelyanov
                     ` (3 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-15 13:48 UTC (permalink / raw)
  To: Nathan Lynch, Oren Laadan, Daniel Lezcano, Serge Hallyn, Tejun Heo
  Cc: Cyrill Gorcunov, Linux Containers, Glauber Costa

When being execve-ed the handler reads registers, mappings and provided
memory pages from image and just assigns this state on current task. This
simple functionality can be used to restore a task, whose state whas read
from e.g. /proc/<pid>/dump file before.

As I said before, the mentioned proc file format is designed to be as
simple as possible. Can (and should) be redesigned (ELF?).

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

---
 fs/Kconfig.binfmt |    6 +
 fs/Makefile       |    1 +
 fs/binfmt_img.c   |  324 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 331 insertions(+), 0 deletions(-)
 create mode 100644 fs/binfmt_img.c

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index 79e2ca7..0b2f48e 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -161,3 +161,9 @@ config BINFMT_MISC
 	  You may say M here for module support and later load the module when
 	  you have use for it; the module is called binfmt_misc. If you
 	  don't know what to answer at this point, say Y.
+
+config BINFMT_IMG
+	tristate "Kernel support for IMG binaries"
+	depends on X86
+	help
+	  Say M/Y here to enable support for checkpoint-restore images execution
diff --git a/fs/Makefile b/fs/Makefile
index fb68c2b..8221719 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_NFSD_DEPRECATED)	+= nfsctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
 obj-$(CONFIG_BINFMT_EM86)	+= binfmt_em86.o
 obj-$(CONFIG_BINFMT_MISC)	+= binfmt_misc.o
+obj-$(CONFIG_BINFMT_IMG)	+= binfmt_img.o
 
 # binfmt_script is always there
 obj-y				+= binfmt_script.o
diff --git a/fs/binfmt_img.c b/fs/binfmt_img.c
new file mode 100644
index 0000000..9b09797
--- /dev/null
+++ b/fs/binfmt_img.c
@@ -0,0 +1,324 @@
+#include <linux/binfmt_img.h>
+#include <linux/module.h>
+#include <linux/binfmts.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/highmem.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+/*
+ * The binary handler to save and restore a single task state
+ */
+
+static int img_check_header(void *buf)
+{
+	struct binfmt_img_header *hdr = buf;
+
+	if (hdr->magic != BINFMT_IMG_MAGIC)
+		return -ENOEXEC;
+
+	if (hdr->version != BINFMT_IMG_VERS_0)
+		return -EINVAL;
+
+	return sizeof(*hdr);
+}
+
+static unsigned short decode_segment(__u16 seg)
+{
+	if (seg == CKPT_X86_SEG_NULL)
+		return 0;
+
+	if (seg == CKPT_X86_SEG_USER64_CS)
+		return __USER_CS;
+	if (seg == CKPT_X86_SEG_USER64_DS)
+		return __USER_DS;
+#ifdef CONFIG_COMPAT 
+	if (seg == CKPT_X86_SEG_USER32_CS)
+		return __USER32_CS;
+	if (seg == CKPT_X86_SEG_USER32_DS)
+		return __USER32_DS;
+#endif
+
+	if (seg & CKPT_X86_SEG_TLS) {
+		seg &= ~CKPT_X86_SEG_TLS;
+		return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3;
+	}
+	if (seg & CKPT_X86_SEG_LDT) {
+		seg &= ~CKPT_X86_SEG_LDT;
+		return (seg << 3) | 7;
+	}
+	BUG();
+}
+
+static void decode_tls(struct desc_struct *d, __u64 val)
+{
+	d->a = (unsigned int)(val >> 32);
+	d->b = (unsigned int)(val & 0xFFFFFFFF);
+}
+
+static int img_restore_regs(struct linux_binprm *bprm, loff_t off, struct pt_regs *regs)
+{
+	int ret, i;
+	struct binfmt_regs_image regi;
+	struct thread_struct *th = &current->thread;
+	unsigned short seg;
+
+	ret = kernel_read(bprm->file, off, (char *)&regi, sizeof(regi));
+	if (ret != sizeof(regi))
+		return -EIO;
+
+	regs->r15 = regi.r15;
+	regs->r14 = regi.r14;
+	regs->r13 = regi.r13;
+	regs->r12 = regi.r12;
+	regs->r11 = regi.r11;
+	regs->r10 = regi.r10;
+	regs->r9 = regi.r9;
+	regs->r8 = regi.r8;
+	regs->ax = regi.ax;
+	regs->orig_ax = regi.orig_ax;
+	regs->bx = regi.bx;
+	regs->cx = regi.cx;
+	regs->dx = regi.dx;
+	regs->si = regi.si;
+	regs->di = regi.di;
+	regs->ip = regi.ip;
+	regs->flags = regi.flags;
+	regs->bp = regi.bp;
+	regs->sp = regi.sp;
+
+	regs->cs = decode_segment(regi.cs);
+	regs->ss = decode_segment(regi.ss);
+
+	th->usersp = regi.sp;
+	th->ds = decode_segment(regi.ds);
+	th->es = decode_segment(regi.es);
+	th->fsindex = decode_segment(regi.fsindex);
+	th->gsindex = decode_segment(regi.gsindex);
+
+	th->fs = regi.fs;
+	th->gs = regi.gs;
+
+	BUILD_BUG_ON(GDT_ENTRY_TLS_ENTRIES != CKPT_TLS_ENTRIES);
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
+		decode_tls(&th->tls_array[i], regi.tls[i]);
+
+	load_TLS(th, smp_processor_id());
+
+	seg = th->fsindex;
+	loadsegment(fs, seg);
+	savesegment(fs, seg);
+	if (seg != th->fsindex) {
+		printk("ERROR saving fs selector want %x, has %x\n",
+				(unsigned int)th->fsindex, (unsigned int)seg);
+		return -EFAULT;
+	}
+
+	if (th->fs)
+		wrmsrl(MSR_FS_BASE, th->fs);
+	load_gs_index(th->gsindex);
+	if (th->gs)
+		wrmsrl(MSR_KERNEL_GS_BASE, th->gs);
+
+	return sizeof(regi);
+}
+
+static int img_restore_mm(struct linux_binprm *bprm, loff_t off)
+{
+	int ret;
+	struct binfmt_mm_image mmi;
+	struct mm_struct *mm = current->mm;
+
+	ret = kernel_read(bprm->file, off, (char *)&mmi, sizeof(mmi));
+	if (ret != sizeof(mmi))
+		return -EIO;
+
+	mm->flags = mmi.flags;
+	mm->def_flags = mmi.def_flags;
+	mm->start_code = mmi.start_code;
+	mm->end_code = mmi.end_code;
+	mm->start_data = mmi.start_data;
+	mm->end_data = mmi.end_data;
+	mm->start_brk = mmi.start_brk;
+	mm->brk = mmi.brk;
+	mm->start_stack = mmi.start_stack;
+	mm->arg_start = mmi.arg_start;
+	mm->arg_end = mmi.arg_end;
+	mm->env_start = mmi.env_start;
+	mm->env_end = mmi.env_end;
+
+	if (mmi.exe_fd != 0) {
+		struct file *f;
+
+		f = fget(mmi.exe_fd);
+		if (f == NULL)
+			return -EBADF;
+
+		fput(mm->exe_file);
+		mm->exe_file = f;
+	}
+
+	return sizeof(mmi);
+}
+
+static int img_restore_vmas(struct linux_binprm *bprm, loff_t off)
+{
+	int ret;
+	struct mm_struct *mm = current->mm;
+	int len = 0;
+
+	do_munmap(mm, 0, TASK_SIZE);
+
+	while (1) {
+		struct binfmt_vma_image vmai;
+		unsigned long addr;
+		struct file *file = NULL;
+
+		len += sizeof(vmai);
+
+		ret = kernel_read(bprm->file, off, (char *)&vmai, sizeof(vmai));
+		if (ret != sizeof(vmai))
+			return -EIO;
+
+		if (vmai.start == 0 && vmai.end == 0)
+			break;
+
+		if (vmai.fd != 0) {
+			file = fget(vmai.fd);
+			if (file == NULL)
+				return -EBADF;
+		} else
+			vmai.flags |= MAP_ANONYMOUS;
+
+		if (vmai.start <= mm->start_stack && vmai.end >= mm->start_stack)
+			vmai.flags |= MAP_GROWSDOWN;
+
+		addr = do_mmap_pgoff(file, vmai.start, vmai.end - vmai.start,
+				vmai.prot, vmai.flags | MAP_FIXED, vmai.pgoff);
+
+		if (vmai.fd) {
+			fput(file);
+			do_close(vmai.fd);
+		}
+
+		if ((long)addr < 0 || (addr != vmai.start))
+			return -ENXIO;
+
+		off += sizeof(vmai);
+	}
+
+	return len;
+}
+
+static int img_restore_pages(struct linux_binprm *bprm, loff_t off)
+{
+	int ret;
+	struct mm_struct *mm = current->mm;
+	int len = 0;
+
+	while (1) {
+		struct binfmt_page_image pgi;
+		struct vm_area_struct *vma;
+		struct page *page;
+		void *pg_data;
+
+		ret = kernel_read(bprm->file, off, (char *)&pgi, sizeof(pgi));
+		if (ret != sizeof(pgi))
+			return -EIO;
+
+		len += sizeof(pgi);
+		if (pgi.vaddr == 0)
+			break;
+
+		vma = find_vma(mm, pgi.vaddr);
+		if (vma == NULL)
+			return -ESRCH;
+
+		ret = get_user_pages(current, current->mm, (unsigned long)pgi.vaddr,
+				1, 1, 1, &page, NULL);
+		if (ret != 1)
+			return -EFAULT;
+
+		pg_data = kmap(page);
+		ret = kernel_read(bprm->file, off + sizeof(pgi), pg_data, PAGE_SIZE);
+		kunmap(page);
+		put_page(page);
+
+		if (ret != PAGE_SIZE)
+			return -EFAULT;
+
+		len += PAGE_SIZE;
+		off += sizeof(pgi) + PAGE_SIZE;
+	}
+
+	return len;
+}
+
+static int img_restore_mem(struct linux_binprm *bprm, loff_t off)
+{
+	int ret;
+	loff_t len = off;
+
+	ret = img_restore_mm(bprm, len);
+	if (ret < 0)
+		return ret;
+
+	len += ret;
+	ret = img_restore_vmas(bprm, len);
+	if (ret < 0)
+		return ret;
+
+	len += ret;
+	ret = img_restore_pages(bprm, len);
+	if (ret < 0)
+		return ret;
+
+	len += ret;
+	return len;
+
+}
+
+static int img_load_binary(struct linux_binprm * bprm, struct pt_regs * regs)
+{
+	int ret;
+	loff_t len = 0;
+
+	ret = img_check_header(bprm->buf);
+	if (ret < 0)
+		return ret;
+
+	len += ret;
+	ret = img_restore_regs(bprm, len, regs);
+	if (ret < 0)
+		return ret;
+
+	len += ret;
+	ret = img_restore_mem(bprm, len);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
+static struct linux_binfmt img_binfmt = {
+	.module = THIS_MODULE,
+	.load_binary = img_load_binary,
+};
+
+static __init int img_binfmt_init(void)
+{
+	return register_binfmt(&img_binfmt);
+}
+
+static __exit void img_binfmt_exit(void)
+{
+	unregister_binfmt(&img_binfmt);
+}
+
+module_init(img_binfmt_init);
+module_exit(img_binfmt_exit);
+MODULE_LICENSE("GPL");
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [TOOLS] To make use of the patches
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (6 preceding siblings ...)
  2011-07-15 13:48   ` [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler Pavel Emelyanov
@ 2011-07-15 13:49   ` Pavel Emelyanov
       [not found]     ` <4E204554.6040901-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-15 15:01   ` [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace Tejun Heo
                     ` (2 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-15 13:49 UTC (permalink / raw)
  To: Nathan Lynch, Oren Laadan, Daniel Lezcano, Serge Hallyn, Tejun Heo
  Cc: Cyrill Gorcunov, Linux Containers, Glauber Costa

[-- Attachment #1: Type: text/plain, Size: 70 bytes --]

Additionally the binfmt_img.h from kernel is required for cr-restore.

[-- Attachment #2: cr-dump.c --]
[-- Type: text/plain, Size: 14228 bytes --]

#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <dirent.h>
#include <string.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <errno.h>
#include <linux/kdev_t.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/vfs.h>

#include <linux/types.h>
#include "img_structs.h"

static int fdinfo_img;
static int pages_img;
static int core_img;
static int shmem_img;
static int pipes_img;

#define PIPEFS_MAGIC 0x50495045

static int prep_img_files(int pid)
{
	__u32 type;
	char name[64];

	sprintf(name, "fdinfo-%d.img", pid);
	fdinfo_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
	if (fdinfo_img < 0) {
		perror("Can't open fdinfo");
		return 1;
	}

	type = FDINFO_MAGIC;
	write(fdinfo_img, &type, 4);

	sprintf(name, "pages-%d.img", pid);
	pages_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
	if (pages_img < 0) {
		perror("Can't open shmem");
		return 1;
	}

	type = PAGES_MAGIC;
	write(pages_img, &type, 4);

	sprintf(name, "core-%d.img", pid);
	core_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
	if (core_img < 0) {
		perror("Can't open core");
		return 1;
	}

	sprintf(name, "shmem-%d.img", pid);
	shmem_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
	if (shmem_img < 0) {
		perror("Can't open shmem");
		return 1;
	}

	type = SHMEM_MAGIC;
	write(shmem_img, &type, 4);

	sprintf(name, "pipes-%d.img", pid);
	pipes_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
	if (pipes_img < 0) {
		perror("Can't open pipes");
		return 1;
	}

	type = PIPES_MAGIC;
	write(pipes_img, &type, 4);

	return 0;
}

static void kill_imgfiles(int pid)
{
	/* FIXME */
}

static int stop_task(int pid)
{
	return kill(pid, SIGSTOP);
}

static void continue_task(int pid)
{
	if (kill(pid, SIGCONT))
		perror("Can't cont task");
}

static char big_tmp_str[PATH_MAX];

static int read_fd_params(int pid, char *fd, unsigned long *pos, unsigned int *flags)
{
	char fd_str[128];
	int ifd;

	sprintf(fd_str, "/proc/%d/fdinfo/%s", pid, fd);

	printf("\tGetting fdinfo for fd %s\n", fd);
	ifd = open(fd_str, O_RDONLY);
	if (ifd < 0) {
		perror("Can't open fdinfo");
		return 1;
	}

	read(ifd, big_tmp_str, sizeof(big_tmp_str));
	close(ifd);

	sscanf(big_tmp_str, "pos:\t%lli\nflags:\t%o\n", pos, flags);
	return 0;
}

static int dump_one_reg_file(int type, unsigned long fd_name, int lfd,
		int lclose, unsigned long pos, unsigned int flags)
{
	char fd_str[128];
	int len;
	struct fdinfo_entry e;

	sprintf(fd_str, "/proc/self/fd/%d", lfd);
	len = readlink(fd_str, big_tmp_str, sizeof(big_tmp_str) - 1);
	if (len < 0) {
		perror("Can't readlink fd");
		return 1;
	}

	big_tmp_str[len] = '\0';
	printf("\tDumping path for %x fd via self %d [%s]\n", fd_name, lfd, big_tmp_str);

	if (lclose)
		close(lfd);

	e.type = type;
	e.addr = fd_name;
	e.len = len;
	e.pos = pos;
	e.flags = flags;

	write(fdinfo_img, &e, sizeof(e));
	write(fdinfo_img, big_tmp_str, len);

	return 0;
}

#define MAX_PIPE_BUF_SIZE	1024 /* FIXME - this is not so */
#define SPLICE_F_NONBLOCK	0x2

static int dump_pipe_and_data(int lfd, struct pipes_entry *e)
{
	int steal_pipe[2];
	int ret;

	printf("\tDumping data from pipe %x\n", e->pipeid);
	if (pipe(steal_pipe) < 0) {
		perror("Can't create pipe for stealing data");
		return 1;
	}

	ret = tee(lfd, steal_pipe[1], MAX_PIPE_BUF_SIZE, SPLICE_F_NONBLOCK);
	if (ret < 0) {
		if (errno != EAGAIN) {
			perror("Can't pick pipe data");
			return 1;
		}

		ret = 0;
	}

	e->bytes = ret;
	write(pipes_img, e, sizeof(*e));

	if (ret) {
		ret = splice(steal_pipe[0], NULL, pipes_img, NULL, ret, 0);
		if (ret < 0) {
			perror("Can't push pipe data");
			return 1;
		}
	}

	close(steal_pipe[0]);
	close(steal_pipe[1]);
	return 0;
}

static int dump_one_pipe(int fd, int lfd, unsigned int id, unsigned int flags)
{
	struct pipes_entry e;

	printf("\tDumping pipe %d/%x flags %x\n", fd, id, flags);

	e.fd = fd;
	e.pipeid = id;
	e.flags = flags;

	if (flags & O_WRONLY) {
		e.bytes = 0;
		write(pipes_img, &e, sizeof(e));
		return 0;
	}

	return dump_pipe_and_data(lfd, &e);
}

static int dump_one_fd(int dir, char *fd_name, unsigned long pos, unsigned int flags)
{
	int fd;
	struct stat st_buf;
	struct statfs stfs_buf;

	printf("\tDumping fd %s\n", fd_name);
	fd = openat(dir, fd_name, O_RDONLY);
	if (fd == -1) {
		printf("Tried to openat %d/%d %s\n", getpid(), dir, fd_name);
		perror("Can't open fd");
		return 1;
	}

	if (fstat(fd, &st_buf) < 0) {
		perror("Can't stat one");
		return 1;
	}

	if (S_ISREG(st_buf.st_mode))
		return dump_one_reg_file(FDINFO_FD, atoi(fd_name), fd, 1, pos, flags);

	if (S_ISFIFO(st_buf.st_mode)) {
		if (fstatfs(fd, &stfs_buf) < 0) {
			perror("Can't statfs one");
			return 1;
		}

		if (stfs_buf.f_type == PIPEFS_MAGIC)
			return dump_one_pipe(atoi(fd_name), fd, st_buf.st_ino, flags);
	}

	if (!strcmp(fd_name, "0")) {
		printf("\tSkipping stdin\n");
		return 0;
	}

	if (!strcmp(fd_name, "1")) {
		printf("\tSkipping stdout\n");
		return 0;
	}

	if (!strcmp(fd_name, "2")) {
		printf("\tSkipping stderr\n");
		return 0;
	}

	fprintf(stderr, "Can't dump file %s of that type [%x]\n", fd_name, st_buf.st_mode);
	return 1;

}

static int dump_task_files(int pid)
{
	char pid_fd_dir[64];
	DIR *fd_dir;
	struct dirent *de;
	unsigned long pos;
	unsigned int flags;

	printf("Dumping open files for %d\n", pid);

	sprintf(pid_fd_dir, "/proc/%d/fd", pid);
	fd_dir = opendir(pid_fd_dir);
	if (fd_dir == NULL) {
		perror("Can't open fd dir");
		return -1;
	}

	while ((de = readdir(fd_dir)) != NULL) {
		if (de->d_name[0] == '.')
			continue;

		if (read_fd_params(pid, de->d_name, &pos, &flags))
			return 1;

		if (dump_one_fd(dirfd(fd_dir), de->d_name, pos, flags))
			return 1;
	}

	closedir(fd_dir);
	return 0;
}

#define PAGE_SIZE	4096
#define PAGE_RSS	0x1

static unsigned long rawhex(char *str, char **end)
{
	unsigned long ret = 0;

	while (1) {
		if (str[0] >= '0' && str[0] <= '9') {
			ret <<= 4;
			ret += str[0] - '0';
		} else if (str[0] >= 'a' && str[0] <= 'f') {
			ret <<= 4;
			ret += str[0] - 'a' + 0xA;
		} else if (str[0] >= 'A' && str[0] <= 'F') {
			ret <<= 4;
			ret += str[0] - 'A' + 0xA;
		} else {
			if (end)
				*end = str;
			return ret;
		}

		str++;
	}
}

static void map_desc_parm(char *desc, unsigned long *pgoff, unsigned long *len)
{
	char *s;
	unsigned long start, end;

	start = rawhex(desc, &s);
	if (*s != '-') {
		goto bug;
	}

	end = rawhex(s + 1, &s);
	if (*s != ' ') {
		goto bug;
	}

	s = strchr(s + 1, ' ');
	*pgoff = rawhex(s + 1, &s);
	if (*s != ' ') {
		goto bug;
	}

	if (start > end)
		goto bug;

	*len = end - start;

	if (*len % PAGE_SIZE) {
		goto bug;
	}
	if (*pgoff % PAGE_SIZE) {
		goto bug;
	}

	return;
bug:
	fprintf(stderr, "BUG\n");
	exit(1);
}

static int dump_map_pages(int lfd, unsigned long start, unsigned long pgoff, unsigned long len)
{
	unsigned int nrpages, pfn;
	void *mem;
	unsigned char *mc;

	printf("\t\tDumping pages start %x len %x off %x\n", start, len, pgoff);
	mem = mmap(NULL, len, PROT_READ, MAP_FILE | MAP_PRIVATE, lfd, pgoff);
	if (mem == MAP_FAILED) {
		perror("Can't map");
		return 1;
	}

	nrpages = len / PAGE_SIZE;
	mc = malloc(nrpages);
	if (mincore(mem, len, mc)) {
		perror("Can't mincore mapping");
		return 1;
	}

	for (pfn = 0; pfn < nrpages; pfn++)
		if (mc[pfn] & PAGE_RSS) {
			__u64 vaddr;

			vaddr = start + pfn * PAGE_SIZE;
			write(pages_img, &vaddr, 8);
			write(pages_img, mem + pfn * PAGE_SIZE, PAGE_SIZE);
		}

	munmap(mem, len);

	return 0;
}

static int dump_anon_private_map(char *start)
{
	printf("\tSkipping anon private mapping at %s\n", start);
	return 0;
}

static int dump_anon_shared_map(char *_start, char *mdesc, int lfd, struct stat *st)
{
	unsigned long pgoff, len;
	struct shmem_entry e;
	unsigned long start;
	struct stat buf;

	map_desc_parm(mdesc, &pgoff, &len);

	start = rawhex(_start, NULL);
	e.start = start;
	e.end = start + len;
	e.shmid = st->st_ino;

	write(shmem_img, &e, sizeof(e));

	if (dump_map_pages(lfd, start, pgoff, len))
		return 1;

	close(lfd);
	return 0;
}

static int dump_file_shared_map(char *start, char *mdesc, int lfd)
{
	printf("\tSkipping file shared mapping at %s\n", start);
	close(lfd);
	return 0;
}

static int dump_file_private_map(char *_start, char *mdesc, int lfd)
{
	unsigned long pgoff, len;
	unsigned long start;

	map_desc_parm(mdesc, &pgoff, &len);

	start = rawhex(_start, NULL);
	if (dump_one_reg_file(FDINFO_MAP, start, lfd, 0, 0, O_RDONLY))
		return 1;

	close(lfd);
	return 0;
}

static int dump_one_mapping(char *mdesc, DIR *mfd_dir)
{
	char *flags, *tmp;
	char map_start[32];
	int lfd;
	struct stat st_buf;

	tmp = strchr(mdesc, '-');
	memset(map_start, 0, sizeof(map_start));
	strncpy(map_start, mdesc, tmp - mdesc);
	flags = strchr(mdesc, ' ');
	flags++;

	printf("\tDumping %s\n", map_start);
	lfd = openat(dirfd(mfd_dir), map_start, O_RDONLY);
	if (lfd == -1) {
		if (errno != ENOENT) {
			perror("Can't open mapping");
			return 1;
		}

		if (flags[3] != 'p') {
			fprintf(stderr, "Bogus mapping [%s]\n", mdesc);
			return 1;
		}

		return dump_anon_private_map(map_start);
	}

	if (fstat(lfd, &st_buf) < 0) {
		perror("Can't stat mapping!");
		return 1;
	}

	if (!S_ISREG(st_buf.st_mode)) {
		perror("Can't handle non-regular mapping");
		return 1;
	}

	if (MAJOR(st_buf.st_dev) == 0) {
		if (flags[3] != 's') {
			fprintf(stderr, "Bogus mapping [%s]\n", mdesc);
			return 1;
		}

		/* FIXME - this can be tmpfs visible file mapping */
		return dump_anon_shared_map(map_start, mdesc, lfd, &st_buf);
	}

	if (flags[3] == 'p')
		return dump_file_private_map(map_start, mdesc, lfd);
	else
		return dump_file_shared_map(map_start, mdesc, lfd);
}

static int dump_task_ext_mm(int pid)
{
	char path[64];
	DIR *mfd_dir;
	FILE *maps;

	printf("Dumping mappings for %d\n", pid);

	sprintf(path, "/proc/%d/mfd", pid);
	mfd_dir = opendir(path);
	if (mfd_dir == NULL) {
		perror("Can't open mfd dir");
		return -1;
	}

	sprintf(path, "/proc/%d/maps", pid);
	maps = fopen(path, "r");
	if (maps == NULL) {
		perror("Can't open maps file");
		return 1;
	}

	while (fgets(big_tmp_str, sizeof(big_tmp_str), maps) != NULL)
		if (dump_one_mapping(big_tmp_str, mfd_dir))
			return 1;

	fclose(maps);
	closedir(mfd_dir);
	return 0;
}

static int dump_task_state(int pid)
{
	char path[64];
	int dump_fd;
	void *mem;

	printf("Dumping task image for %d\n", pid);
	sprintf(path, "/proc/%d/dump", pid);
	dump_fd = open(path, O_RDONLY);
	if (dump_fd < 0) {
		perror("Can't open dump file");
		return 1;
	}

	mem = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, 0, 0);
	if (mem == MAP_FAILED) {
		perror("Can't get mem");
		return 1;
	}

	while (1) {
		int r, w;

		r = read(dump_fd, mem, 4096);
		if (r == 0)
			break;
		if (r < 0) {
			perror("Can't read dump file");
			return 1;
		}

		w = 0;
		while (w < r) {
			int ret;

			ret = write(core_img, mem + w, r - w);
			if (ret <= 0) {
				perror("Can't write core");
				return 1;
			}

			w += ret;
		}
	}

	munmap(mem, 4096);
	close(dump_fd);

	return 0;
}

static int dump_one_task(int pid, int stop)
{
	printf("Dumping task %d\n", pid);

	if (prep_img_files(pid))
		return 1;

	if (stop && stop_task(pid))
		goto err_task;

	if (dump_task_files(pid))
		goto err;

	if (dump_task_ext_mm(pid))
		goto err;

	if (dump_task_state(pid))
		goto err;

	if (stop)
		continue_task(pid);

	printf("Dump is complete\n");
	return 0;

err:
	if (stop)
		continue_task(pid);
err_task:
	kill_imgfiles(pid);
	return 1;
}

static int pstree_fd;
static char big_tmp_str[4096];
static int *pids, nr_pids;

static char *get_children_pids(int pid)
{
	FILE *f;
	int len;
	char *ret, *tmp;

	sprintf(big_tmp_str, "/proc/%d/status", pid);
	f = fopen(big_tmp_str, "r");
	if (f == NULL)
		return NULL;

	while ((fgets(big_tmp_str, sizeof(big_tmp_str), f)) != NULL) {
		if (strncmp(big_tmp_str, "Children:", 9))
			continue;

		tmp = big_tmp_str + 10;
		len = strlen(tmp);
		ret = malloc(len + 1);
		strcpy(ret, tmp);
		if (len)
			ret[len - 1] = ' ';

		fclose(f);
		return ret;
	}

	fclose(f);
	return NULL;
}

static int dump_pid_and_children(int pid)
{
	struct pstree_entry e;
	char *chlist, *tmp, *tmp2;

	printf("\tReading %d children list\n", pid);
	chlist = get_children_pids(pid);
	if (chlist == NULL)
		return 1;

	printf("\t%d has children %s\n", pid, chlist);

	e.pid = pid;
	e.nr_children = 0;

	pids = realloc(pids, (nr_pids + 1) * sizeof(int));
	pids[nr_pids++] = e.pid;

	tmp = chlist;
	while ((tmp = strchr(tmp, ' ')) != NULL) {
		tmp++;
		e.nr_children++;
	}

	write(pstree_fd, &e, sizeof(e));
	tmp = chlist;
	while (1) {
		__u32 cpid;

		cpid = strtol(tmp, &tmp, 10);
		if (cpid == 0)
			break;
		if (*tmp != ' ') {
			fprintf(stderr, "Error in string with children!\n");
			return 1;
		}

		write(pstree_fd, &cpid, sizeof(cpid));
		tmp++;
	}

	tmp = chlist;
	while ((tmp2 = strchr(tmp, ' ')) != NULL) {
		*tmp2 = '\0';
		if (dump_pid_and_children(atoi(tmp)))
			return 1;
		tmp = tmp2 + 1;
	}

	free(chlist);
	return 0;
}

static int __dump_all_tasks(void)
{
	int i, pid;

	printf("Dumping tasks' images for");
	for (i = 0; i < nr_pids; i++)
		printf(" %d", pids[i]);
	printf("\n");

	printf("Stopping tasks\n");
	for (i = 0; i < nr_pids; i++)
		if (stop_task(pids[i]))
			goto err;

	for (i = 0; i < nr_pids; i++) {
		if (dump_one_task(pids[i], 0))
			goto err;
	}

	printf("Resuming tasks\n");
	for (i = 0; i < nr_pids; i++)
		continue_task(pids[i]);

	return 0;

err:
	for (i = 0; i < nr_pids; i++)
		continue_task(pids[i]);
	return 1;

}

static int dump_all_tasks(int pid)
{
	char *chlist;
	__u32 type;

	pids = NULL;
	nr_pids = 0;

	printf("Dumping process tree, start from %d\n", pid);

	sprintf(big_tmp_str, "pstree-%d.img", pid);
	pstree_fd = open(big_tmp_str, O_WRONLY | O_CREAT | O_EXCL, 0600);
	if (pstree_fd < 0) {
		perror("Can't create pstree");
		return 1;
	}

	type = PSTREE_MAGIC;
	write(pstree_fd, &type, sizeof(type));

	if (dump_pid_and_children(pid))
		return 1;

	close(pstree_fd);

	return __dump_all_tasks();
}

int main(int argc, char **argv)
{
	if (argc != 3)
		goto usage;
	if (argv[1][0] != '-')
		goto usage;
	if (argv[1][1] == 'p')
		return dump_one_task(atoi(argv[2]), 1);
	if (argv[1][1] == 't')
		return dump_all_tasks(atoi(argv[2]));

usage:
	printf("Usage: %s (-p|-t) <pid>\n", argv[0]);
	return 1;
}

[-- Attachment #3: cr-restore.c --]
[-- Type: text/plain, Size: 19947 bytes --]

#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <dirent.h>
#include <string.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <errno.h>
#include <linux/kdev_t.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/sendfile.h>

#define PAGE_SIZE	4096

#include <linux/types.h>
#include "img_structs.h"
#include "binfmt_img.h"

struct fmap_fd {
	unsigned long start;
	int fd;
	struct fmap_fd *next;
};

static struct fmap_fd *fmap_fds;

struct shmem_info {
	unsigned long start;
	unsigned long end;
	unsigned long id;
	int pid;
	int real_pid;
};

static struct shmem_info *shmems;
static int nr_shmems;

struct pipes_info {
	unsigned int id;
	int pid;
	int real_pid;
	int read_fd;
	int write_fd;
	int users;
};

static struct pipes_info *pipes;
static int nr_pipes;

static void show_saved_shmems(void)
{
	int i;

	printf("\tSaved shmems:\n");
	for (i = 0; i < nr_shmems; i++)
		printf("\t\t%016lx %lx %d\n", shmems[i].start, shmems[i].id, shmems[i].pid);
}

static void show_saved_pipes(void)
{
	int i;

	printf("\tSaved pipes:\n");
	for (i = 0; i < nr_pipes; i++)
		printf("\t\t%x -> %d\n", pipes[i].id, pipes[i].pid);
}

static struct shmem_info *search_shmem(unsigned long addr, unsigned long id)
{
	int i;

	for (i = 0; i < nr_shmems; i++) {
		struct shmem_info *si;

		si = shmems + i;
		if (si->start <= addr && si->end >= addr && si->id == id)
			return si;
	}

	return NULL;
}

static struct pipes_info *search_pipes(unsigned int pipeid)
{
	int i;

	for (i = 0; i < nr_pipes; i++) {
		struct pipes_info *pi;

		pi = pipes + i;
		if (pi->id == pipeid)
			return pi;
	}

	return NULL;
}

static void shmem_update_real_pid(int vpid, int rpid)
{
	int i;

	for (i = 0; i < nr_shmems; i++)
		if (shmems[i].pid == vpid)
			shmems[i].real_pid = rpid;
}

static int shmem_wait_and_open(struct shmem_info *si)
{
	/* FIXME - not good */
	char path[128];
	unsigned long time = 1000;

	sleep(1);

	while (si->real_pid == 0)
		usleep(time);

	sprintf(path, "/proc/%d/mfd/0x%lx", si->real_pid, si->start);
	while (1) {
		int ret;

		ret = open(path, O_RDWR);
		if (ret > 0)
			return ret;

		if (ret < 0 && errno != ENOENT) {
			perror("     Can't stat shmem");
			return -1;
		}

		printf("Waiting for [%s] to appear\n", path);
		if (time < 20000000)
			time <<= 1;
		usleep(time);
	}
}

static int try_to_add_shmem(int pid, struct shmem_entry *e)
{
	int i;

	for (i = 0; i < nr_shmems; i++) {
		if (shmems[i].start != e->start || shmems[i].id != e->shmid)
			continue;

		if (shmems[i].end != e->end) {
			printf("Bogus shmem\n");
			return 1;
		}

		if (shmems[i].pid > pid)
			shmems[i].pid = pid;

		return 0;
	}

	if ((nr_shmems + 1) * sizeof(struct shmem_info) >= 4096) {
		printf("OOM storing shmems\n");
		return 1;
	}

	shmems[nr_shmems].start = e->start;
	shmems[nr_shmems].end = e->end;
	shmems[nr_shmems].id = e->shmid;
	shmems[nr_shmems].pid = pid;
	shmems[nr_shmems].real_pid = 0;
	nr_shmems++;

	return 0;
}

static int try_to_add_pipe(int pid, struct pipes_entry *e, int p_fd)
{
	int i;

	for (i = 0; i < nr_pipes; i++) {
		if (pipes[i].id != e->pipeid)
			continue;

		if (pipes[i].pid > pid)
			pipes[i].pid = pid;
		pipes[i].users++;

		return 0;
	}

	if ((nr_pipes + 1) * sizeof(struct pipes_info) >= 4096) {
		printf("OOM storing pipes\n");
		return 1;
	}

	pipes[nr_pipes].id = e->pipeid;
	pipes[nr_pipes].pid = pid;
	pipes[nr_pipes].real_pid = 0;
	pipes[nr_pipes].read_fd = 0;
	pipes[nr_pipes].write_fd = 0;
	pipes[nr_pipes].users = 1;
	nr_pipes++;

	return 0;
}

static int prepare_shmem_pid(int pid)
{
	char path[64];
	int sh_fd;
	__u32 type = 0;

	sprintf(path, "shmem-%d.img", pid);
	sh_fd = open(path, O_RDONLY);
	if (sh_fd < 0) {
		perror("Can't open shmem info");
		return 1;
	}

	read(sh_fd, &type, sizeof(type));
	if (type != SHMEM_MAGIC) {
		perror("Bad shmem magic");
		return 1;
	}

	while (1) {
		struct shmem_entry e;
		int ret;

		ret = read(sh_fd, &e, sizeof(e));
		if (ret == 0)
			break;
		if (ret != sizeof(e)) {
			perror("Can't read shmem entry");
			return 1;
		}

		if (try_to_add_shmem(pid, &e))
			return 1;
	}

	close(sh_fd);
	return 0;
}

static int prepare_pipes_pid(int pid)
{
	char path[64];
	int p_fd;
	__u32 type = 0;

	sprintf(path, "pipes-%d.img", pid);
	p_fd = open(path, O_RDONLY);
	if (p_fd < 0) {
		perror("Can't open pipes image");
		return 1;
	}

	read(p_fd, &type, sizeof(type));
	if (type != PIPES_MAGIC) {
		perror("Bad pipes magin");
		return 1;
	}

	while (1) {
		struct pipes_entry e;
		int ret;

		ret = read(p_fd, &e, sizeof(e));
		if (ret == 0)
			break;
		if (ret != sizeof(e)) {
			fprintf(stderr, "Read pipes for %s failed %d of %d read\n",
					path, ret, sizeof(e));
			perror("Can't read pipes entry");
			return 1;
		}

		if (try_to_add_pipe(pid, &e, p_fd))
			return 1;

		lseek(p_fd, e.bytes, SEEK_CUR);
	}

	close(p_fd);
	return 0;
}

static int prepare_shared(int ps_fd)
{
	printf("Preparing info about shared resources\n");

	nr_shmems = 0;
	shmems = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANON, 0, 0);
	if (shmems == MAP_FAILED) {
		perror("Can't map shmems");
		return 1;
	}

	pipes = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANON, 0, 0);
	if (pipes == MAP_FAILED) {
		perror("Can't map pipes");
		return 1;
	}

	while (1) {
		struct pstree_entry e;
		int ret;

		ret = read(ps_fd, &e, sizeof(e));
		if (ret == 0)
			break;

		if (ret != sizeof(e)) {
			perror("Can't read ps");
			return 1;
		}

		if (prepare_shmem_pid(e.pid))
			return 1;

		if (prepare_pipes_pid(e.pid))
			return 1;

		lseek(ps_fd, e.nr_children * sizeof(__u32), SEEK_CUR);
	}

	lseek(ps_fd, sizeof(__u32), SEEK_SET);

	show_saved_shmems();
	show_saved_pipes();

	return 0;
}

static struct fmap_fd *pop_fmap_fd(unsigned long start)
{
	struct fmap_fd **p, *r;

	for (p = &fmap_fds; *p != NULL; p = &(*p)->next) {
		if ((*p)->start != start)
			continue;

		r = *p;
		*p = r->next;
		return r;
	}

	return NULL;
}

static int open_fe_fd(struct fdinfo_entry *fe, int fd)
{
	char path[PATH_MAX];
	int tmp;

	if (read(fd, path, fe->len) != fe->len) {
		fprintf(stderr, "Error reading path");
		return -1;
	}

	path[fe->len] = '\0';

	tmp = open(path, fe->flags);
	if (tmp < 0) {
		perror("Can't open file");
		return -1;
	}

	lseek(tmp, fe->pos, SEEK_SET);

	return tmp;
}

static int reopen_fd(int old_fd, int new_fd)
{
	int tmp;

	if (old_fd != new_fd) {
		tmp = dup2(old_fd, new_fd);
		if (tmp < 0)
			return tmp;

		close(old_fd);
	}

	return new_fd;
}

static int open_fd(int pid, struct fdinfo_entry *fe, int *cfd)
{
	int fd, tmp;

	if (*cfd == (int)fe->addr) {
		tmp = dup(*cfd);
		if (tmp < 0) {
			perror("Can't dup file");
			return 1;
		}

		*cfd = tmp;
	}

	tmp = open_fe_fd(fe, *cfd);
	if (tmp < 0)
		return 1;

	fd = reopen_fd(tmp, (int)fe->addr);
	if (fd < 0) {
		perror("Can't dup");
		return 1;
	}

	return 0;
}

static int open_fmap(int pid, struct fdinfo_entry *fe, int fd)
{
	int tmp;
	struct fmap_fd *new;

	tmp = open_fe_fd(fe, fd);
	if (tmp < 0)
		return 1;

	printf("%d:\t\tWill map %x to %d\n", pid, fe->addr, tmp);
	new = malloc(sizeof(*new));
	new->start = fe->addr;
	new->fd = tmp;
	new->next = fmap_fds;
	fmap_fds = new;

	return 0;
}

static int prepare_fds(int pid)
{
	__u32 mag;
	char path[64];
	int fdinfo_fd;

	printf("%d: Opening files\n", pid);

	sprintf(path, "fdinfo-%d.img", pid);
	fdinfo_fd = open(path, O_RDONLY);
	if (fdinfo_fd < 0) {
		perror("Can't open fdinfo");
		return 1;
	}

	read(fdinfo_fd, &mag, 4);
	if (mag != FDINFO_MAGIC) {
		fprintf(stderr, "Bad file\n");
		return 1;
	}

	while (1) {
		int ret;
		struct fdinfo_entry fe;

		ret = read(fdinfo_fd, &fe, sizeof(fe));
		if (ret == 0) {
			close(fdinfo_fd);
			return 0;
		}

		if (ret < 0) {
			perror("Can't read file");
			return 1;
		}
		if (ret != sizeof(fe)) {
			fprintf(stderr, "Error reading\n");
			return 1;
		}

		printf("\t%d: Got fd for %lx type %d namelen %d\n", pid,
				(unsigned long)fe.addr, fe.type, fe.len);
		switch (fe.type) {
		case FDINFO_FD:
			if (open_fd(pid, &fe, &fdinfo_fd))
				return 1;

			break;
		case FDINFO_MAP:
			if (open_fmap(pid, &fe, fdinfo_fd))
				return 1;

			break;
		default:
			fprintf(stderr, "Some bullshit in a file\n");
			return 1;
		}
	}
}

struct shmem_to_id {
	unsigned long addr;
	unsigned long end;
	unsigned long id;
	struct shmem_to_id *next;
};

static struct shmem_to_id *my_shmem_ids;

static unsigned long find_shmem_id(unsigned long addr)
{
	struct shmem_to_id *si;

	for (si = my_shmem_ids; si != NULL; si = si->next)
		if (si->addr <= addr && si->end >= addr)
			return si->id;

	return 0;
}

static void save_shmem_id(struct shmem_entry *e)
{
	struct shmem_to_id *si;

	si = malloc(sizeof(*si));
	si->addr = e->start;
	si->end = e->end;
	si->id = e->shmid;
	si->next = my_shmem_ids;
	my_shmem_ids = si;
}

static int prepare_shmem(int pid)
{
	char path[64];
	int sh_fd;
	__u32 type = 0;

	sprintf(path, "shmem-%d.img", pid);
	sh_fd = open(path, O_RDONLY);
	if (sh_fd < 0) {
		perror("Can't open shmem info");
		return 1;
	}

	read(sh_fd, &type, sizeof(type));
	if (type != SHMEM_MAGIC) {
		perror("Bad shmem magic");
		return 1;
	}

	while (1) {
		struct shmem_entry e;
		int ret;

		ret = read(sh_fd, &e, sizeof(e));
		if (ret == 0)
			break;
		if (ret != sizeof(e)) {
			perror("Can't read shmem entry");
			return 1;
		}

		save_shmem_id(&e);
	}

	close(sh_fd);
	return 0;
}

static int try_fixup_file_map(int pid, struct binfmt_vma_image *vi, int fd)
{
	struct fmap_fd *fmfd;

	fmfd = pop_fmap_fd(vi->start);
	if (fmfd != NULL) {
		printf("%d: Fixing %lx vma to %d fd\n", pid, vi->start, fmfd->fd);
		lseek(fd, -sizeof(*vi), SEEK_CUR);
		vi->fd = fmfd->fd;
		if (write(fd, vi, sizeof(*vi)) != sizeof(*vi)) {
			perror("Can't write img");
			return 1;
		}

		free(fmfd);
	}

	return 0;
}

static int try_fixup_shared_map(int pid, struct binfmt_vma_image *vi, int fd)
{
	struct shmem_info *si;
	unsigned long id;

	id = find_shmem_id(vi->start);
	if (id == 0)
		return 0;

	si = search_shmem(vi->start, id);
	printf("%d: Search for %016lx shmem %p/%d\n", pid, vi->start, si, si ? si->pid : -1);

	if (si == NULL) {
		fprintf(stderr, "Can't find my shmem %016lx\n", vi->start);
		return 1;
	}

	if (si->pid != pid) {
		int sh_fd;

		sh_fd = shmem_wait_and_open(si);
		printf("%d: Fixing %lx vma to %x/%d shmem -> %d\n", pid, vi->start, si->id, si->pid, sh_fd);
		if (fd < 0) {
			perror("Can't open shmem");
			return 1;
		}

		lseek(fd, -sizeof(*vi), SEEK_CUR);
		vi->fd = sh_fd;
		if (write(fd, vi, sizeof(*vi)) != sizeof(*vi)) {
			perror("Can't write img");
			return 1;
		}
	}

	return 0;
}

static int fixup_vma_fds(int pid, int fd)
{
	lseek(fd, sizeof(struct binfmt_img_header) +
			sizeof(struct binfmt_regs_image) +
			sizeof(struct binfmt_mm_image), SEEK_SET);

	while (1) {
		struct binfmt_vma_image vi;

		if (read(fd, &vi, sizeof(vi)) != sizeof(vi)) {
			perror("Can't read");
			return 1;
		}

		if (vi.start == 0 && vi.end == 0)
			return 0;

		printf("%d: Fixing %016lx-%016lx %016lx vma\n", pid, vi.start, vi.end, vi.pgoff);
		if (try_fixup_file_map(pid, &vi, fd))
			return 1;

		if (try_fixup_shared_map(pid, &vi, fd))
			return 1;
	}
}

static inline int should_restore_page(int pid, unsigned long vaddr)
{
	struct shmem_info *si;
	unsigned long id;

	id = find_shmem_id(vaddr);
	if (id == 0)
		return 1;

	si = search_shmem(vaddr, id);
	return si->pid == pid;
}

static int fixup_pages_data(int pid, int fd)
{
	char path[128];
	int shfd;
	__u32 mag;
	__u64 vaddr;

	sprintf(path, "pages-%d.img", pid);
	shfd = open(path, O_RDONLY);
	if (shfd < 0) {
		perror("Can't open shmem image");
		return 1;
	}

	read(shfd, &mag, sizeof(mag));
	if (mag != PAGES_MAGIC) {
		fprintf(stderr, "Bad shmem image\n");
		return 1;
	}

	lseek(fd, -sizeof(struct binfmt_page_image), SEEK_END);
	read(fd, &vaddr, sizeof(vaddr));
	if (vaddr != 0) {
		printf("SHIT %lx\n", (unsigned long)vaddr);
		return 1;
	}
	lseek(fd, -sizeof(struct binfmt_page_image), SEEK_END);

	while (1) {
		int ret;

		ret = read(shfd, &vaddr, sizeof(vaddr));
		if (ret == 0)
			break;

		if (ret < 0 || ret != sizeof(vaddr)) {
			perror("Can't read vaddr");
			return 1;
		}

		if (vaddr == 0)
			break;

		if (!should_restore_page(pid, vaddr)) {
			lseek(shfd, PAGE_SIZE, SEEK_CUR);
			continue;
		}

//		printf("Copy page %lx to image\n", (unsigned long)vaddr);
		write(fd, &vaddr, sizeof(vaddr));
		sendfile(fd, shfd, NULL, PAGE_SIZE);
	}

	close(shfd);
	vaddr = 0;
	write(fd, &vaddr, sizeof(vaddr));
	return 0;
}

static int prepare_image_maps(int fd, int pid)
{
	printf("%d: Fixing maps before executing image\n", pid);

	if (fixup_vma_fds(pid, fd))
		return 1;

	if (fixup_pages_data(pid, fd))
		return 1;

	close(fd);
	return 0;
}

static int execute_image(int pid)
{
	char path[128];
	int fd, fd_new;
	struct stat buf;

	sprintf(path, "core-%d.img", pid);
	fd = open(path, O_RDONLY);
	if (fd < 0) {
		perror("Can't open exec image");
		return 1;
	}

	if (fstat(fd, &buf)) {
		perror("Can't stat");
		return 1;
	}

	sprintf(path, "core-%d.img.out", pid);
	fd_new = open(path, O_RDWR | O_CREAT | O_EXCL, 0700);
	if (fd_new < 0) {
		perror("Can't open new image");
		return 1;
	}

	printf("%d: Preparing execution image\n", pid);
	sendfile(fd_new, fd, NULL, buf.st_size);
	close(fd);

	if (fchmod(fd_new, 0700)) {
		perror("Can't prepare exec image");
		return 1;
	}

	if (prepare_image_maps(fd_new, pid))
		return 1;

	printf("%d/%d EXEC IMAGE\n", pid, getpid());
	return execl(path, path, NULL);
}

static int create_pipe(int pid, struct pipes_entry *e, struct pipes_info *pi, int pipes_fd)
{
	int pfd[2], tmp;
	unsigned long time = 1000;

	printf("\t%d: Creating pipe %x\n", pid, e->pipeid);

	if (pipe(pfd) < 0) {
		perror("Can't create pipe");
		return 1;
	}

	if (e->bytes) {
		printf("\t%d: Splicing data to %d\n", pid, pfd[1]);

		tmp = splice(pipes_fd, NULL, pfd[1], NULL, e->bytes, 0);
		if (tmp != e->bytes) {
			fprintf(stderr, "Wanted to restore %ld bytes, but got %ld\n",
					e->bytes, tmp);
			if (tmp < 0)
				perror("Error splicing data");
			return 1;
		}
	}

	pi->read_fd = pfd[0];
	pi->write_fd = pfd[1];
	pi->real_pid = getpid();

	printf("\t%d: Done, waiting for others on %d pid with r:%d w:%d\n",
			pid, pi->real_pid, pfd[0], pfd[1]);

	while (1) {
		if (pi->users == 1) /* only I left */
			break;

		printf("\t%d: Waiting for %x pipe to attach (%d users left)\n",
				pid, e->pipeid, pi->users - 1);
		if (time < 20000000)
			time <<= 1;
		usleep(time);
	}

	printf("\t%d: All is ok - reopening pipe for %d\n", pid, e->fd);
	if (e->flags & O_WRONLY) {
		close(pfd[0]);
		tmp = reopen_fd(pfd[1], e->fd);
	} else {
		close(pfd[1]);
		tmp = reopen_fd(pfd[0], e->fd);
	}

	if (tmp < 0) {
		perror("Can't dup pipe fd");
		return 1;
	}

	return 0;
}

static int attach_pipe(int pid, struct pipes_entry *e, struct pipes_info *pi)
{
	char path[128];
	int tmp, fd;

	printf("\t%d: Wating for pipe %x to appear\n", pid, e->pipeid);

	while (pi->real_pid == 0)
		usleep(1000);

	if (e->flags & O_WRONLY)
		tmp = pi->write_fd;
	else
		tmp = pi->read_fd;

	sprintf(path, "/proc/%d/fd/%d", pi->real_pid, tmp);
	printf("\t%d: Attaching pipe %s\n", pid, path);

	fd = open(path, e->flags);
	if (fd < 0) {
		perror("Can't attach pipe");
		return 1;
	}

	printf("\t%d: Done, reopening for %d\n", pid, e->fd);
	pi->users--;
	tmp = reopen_fd(fd, e->fd);
	if (tmp < 0) {
		perror("Can't dup to attach pipe");
		return 1;
	}

	return 0;

}

static int open_pipe(int pid, struct pipes_entry *e, int *pipes_fd)
{
	struct pipes_info *pi;

	printf("\t%d: Opening pipe %x on fd %d\n", pid, e->pipeid, e->fd);
	if (e->fd == *pipes_fd) {
		int tmp;

		tmp = dup(*pipes_fd);
		if (tmp < 0) {
			perror("Can't dup file");
			return 1;
		}

		*pipes_fd = tmp;
	}

	pi = search_pipes(e->pipeid);
	if (pi == NULL) {
		fprintf(stderr, "BUG: can't find my pipe %x\n", e->pipeid);
		return 1;
	}

	if (pi->pid == pid)
		return create_pipe(pid, e, pi, *pipes_fd);
	else
		return attach_pipe(pid, e, pi);
}

static int prepare_pipes(int pid)
{
	char path[64];
	int pipes_fd;
	__u32 type = 0;

	printf("%d: Opening pipes\n", pid);

	sprintf(path, "pipes-%d.img", pid);
	pipes_fd = open(path, O_RDONLY);
	if (pipes_fd < 0) {
		perror("Can't open pipes img");
		return 1;
	}

	read(pipes_fd, &type, sizeof(type));
	if (type != PIPES_MAGIC) {
		perror("Bad pipes file");
		return 1;
	}

	while (1) {
		struct pipes_entry e;
		int ret;

		ret = read(pipes_fd, &e, sizeof(e));
		if (ret == 0) {
			close(pipes_fd);
			return 0;
		}
		if (ret != sizeof(e)) {
			perror("Bad pipes entry");
			return 1;
		}

		if (open_pipe(pid, &e, &pipes_fd))
			return 1;
	}
}

static int restore_one_task(int pid)
{
	printf("%d: Restoring resources\n", pid);

	if (prepare_pipes(pid))
		return 1;

	if (prepare_fds(pid))
		return 1;

	if (prepare_shmem(pid))
		return 1;

	return execute_image(pid);
}

static int restore_task_with_children(int my_pid, char *pstree_path);

#if 0
static inline int fork_with_pid(int pid, char *pstree_path)
{
	/* FIXME - no such ability now */
	int ret;

	ret = fork();
	if (ret == 0) {
		ret = restore_task_with_children(pid, pstree_path);
		exit(ret);
	}

	return ret;
}
#else
#define CLONE_CHILD_USEPID      0x02000000

static int do_child(void *arg)
{
	return restore_task_with_children(getpid(), arg);
}

static inline int fork_with_pid(int pid, char *pstree_path)
{
	void *stack;

	stack = mmap(0, 4 * 4096, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANON | MAP_GROWSDOWN, 0, 0);
	if (stack == MAP_FAILED)
		return -1;

	stack += 4 * 4096;
	return clone(do_child, stack, SIGCHLD | CLONE_CHILD_USEPID, pstree_path, NULL, NULL, &pid);

}
#endif

static int restore_task_with_children(int my_pid, char *pstree_path)
{
	int *pids;
	int fd, ret, i;
	struct pstree_entry e;

	printf("%d: Starting restore\n", my_pid);

	fd = open(pstree_path, O_RDONLY);
	if (fd < 0) {
		perror("Can't reopen pstree image");
		exit(1);
	}

	lseek(fd, sizeof(__u32), SEEK_SET);
	while (1) {
		ret = read(fd, &e, sizeof(e));
		if (ret != sizeof(e)) {
			fprintf(stderr, "%d: Read returned %d\n", my_pid, ret);
			if (ret < 0)
				perror("Can't read pstree");
			exit(1);
		}

		if (e.pid != my_pid) {
			lseek(fd, e.nr_children * sizeof(__u32), SEEK_CUR);
			continue;
		}
		
		break;
	}

	if (e.nr_children > 0) {
		i = e.nr_children * sizeof(int);
		pids = malloc(i);
		ret = read(fd, pids, i);
		if (ret != i) {
			perror("Can't read children pids");
			exit(1);
		}

		close(fd);

		printf("%d: Restoring %d children:\n", my_pid, e.nr_children);
		for (i = 0; i < e.nr_children; i++) {
			printf("\tFork %d from %d\n", pids[i], my_pid);
			ret = fork_with_pid(pids[i], pstree_path);
			if (ret < 0) {
				perror("Can't fork kid");
				exit(1);
			}
		}
	} else
		close(fd);

	shmem_update_real_pid(my_pid, getpid());

	return restore_one_task(my_pid);
}

static int restore_root_task(char *pstree_path, int fd)
{
	struct pstree_entry e;
	int ret;

	ret = read(fd, &e, sizeof(e));
	if (ret != sizeof(e)) {
		perror("Can't read root pstree entry");
		return 1;
	}

	close(fd);

	printf("Forking root with %d pid\n", e.pid);
	ret = fork_with_pid(e.pid, pstree_path);
	if (ret < 0) {
		perror("Can't fork root");
		return 1;
	}

	wait(NULL);
	return 0;
}

static int restore_all_tasks(char *pid)
{
	char path[128];
	int pstree_fd;
	__u32 type = 0;

	sprintf(path, "pstree-%s.img", pid);
	pstree_fd = open(path, O_RDONLY);
	if (pstree_fd < 0) {
		perror("Can't open pstree image");
		return 1;
	}

	read(pstree_fd, &type, sizeof(type));
	if (type != PSTREE_MAGIC) {
		perror("Bad pstree magic");
		return 1;
	}

	if (prepare_shared(pstree_fd))
		return 1;

	return restore_root_task(path, pstree_fd);
}

int main(int argc, char **argv)
{
	if (argc != 3)
		goto usage;
	if (argv[1][0] != '-')
		goto usage;
	if (argv[1][1] == 'p')
		return restore_one_task(atoi(argv[2]));
	if (argv[1][1] == 't')
		return restore_all_tasks(argv[2]);

usage:
	printf("Usage: %s (-t|-p) <pid>\n", argv[0]);
	return 1;
}

[-- Attachment #4: img-show.c --]
[-- Type: text/plain, Size: 7004 bytes --]

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <linux/types.h>
#include <string.h>
#include "img_structs.h"
#include "binfmt_img.h"

static int show_fdinfo(int fd)
{
	char data[1024];
	struct fdinfo_entry e;

	while (1) {
		int ret;

		ret = read(fd, &e, sizeof(e));
		if (ret == 0)
			break;
		if (ret != sizeof(e)) {
			perror("Can't read");
			return 1;
		}

		ret = read(fd, data, e.len);
		if (ret != e.len) {
			perror("Can't read");
			return 1;
		}

		data[e.len] = '\0';
		switch (e.type) {
		case FDINFO_FD:
			printf("fd %d [%s] pos %lx flags %o\n", (int)e.addr, data, e.pos, e.flags);
			break;
		case FDINFO_MAP:
			printf("map %lx [%s] flags %o\n", e.addr, data, e.flags);
			break;
		default:
			fprintf(stderr, "Unknown fdinfo entry type %d\n", e.type);
			return 1;
		}
	}

	return 0;
}

#define PAGE_SIZE	4096

static int show_mem(int fd)
{
	__u64 vaddr;
	unsigned int data[2];

	while (1) {
		if (read(fd, &vaddr, 8) == 0)
			break;
		if (vaddr == 0)
			break;

		read(fd, &data[0], sizeof(unsigned int));
		lseek(fd, PAGE_SIZE - 2 * sizeof(unsigned int), SEEK_CUR);
		read(fd, &data[1], sizeof(unsigned int));

		printf("\tpage 0x%lx [%x...%x]\n", (unsigned long)vaddr, data[0], data[1]);
	}

	return 0;
}

static int show_pages(int fd)
{
	return show_mem(fd);
}

static int show_shmem(int fd)
{
	int r;
	struct shmem_entry e;

	while (1) {
		r = read(fd, &e, sizeof(e));
		if (r == 0)
			return 0;
		if (r != sizeof(e)) {
			perror("Can't read shmem entry");
			return 1;
		}

		printf("%016lx-%016lx %016x\n", e.start, e.end, e.shmid);
	}
}

static char *segval(__u16 seg)
{
	switch (seg) {
		case CKPT_X86_SEG_NULL:		return "nul";
		case CKPT_X86_SEG_USER32_CS:	return "cs32";
		case CKPT_X86_SEG_USER32_DS:	return "ds32";
		case CKPT_X86_SEG_USER64_CS:	return "cs64";
		case CKPT_X86_SEG_USER64_DS:	return "ds64";
	}

	if (seg & CKPT_X86_SEG_TLS)
		return "tls";
	if (seg & CKPT_X86_SEG_LDT)
		return "ldt";

	return "[unknown]";
}

static int show_regs(int fd)
{
	struct binfmt_regs_image ri;

	if (read(fd, &ri, sizeof(ri)) != sizeof(ri)) {
		perror("Can't read registers from image");
		return 1;
	}

	printf("Registers:\n");

	printf("\tr15:     %016lx\n", ri.r15);
	printf("\tr14:     %016lx\n", ri.r14);
	printf("\tr13:     %016lx\n", ri.r13);
	printf("\tr12:     %016lx\n", ri.r12);
	printf("\tr11:     %016lx\n", ri.r11);
	printf("\tr10:     %016lx\n", ri.r10);
	printf("\tr9:      %016lx\n", ri.r9);
	printf("\tr8:      %016lx\n", ri.r8);
	printf("\tax:      %016lx\n", ri.ax);
	printf("\torig_ax: %016lx\n", ri.orig_ax);
	printf("\tbx:      %016lx\n", ri.bx);
	printf("\tcx:      %016lx\n", ri.cx);
	printf("\tdx:      %016lx\n", ri.dx);
	printf("\tsi:      %016lx\n", ri.si);
	printf("\tdi:      %016lx\n", ri.di);
	printf("\tip:      %016lx\n", ri.ip);
	printf("\tflags:   %016lx\n", ri.flags);
	printf("\tbp:      %016lx\n", ri.bp);
	printf("\tsp:      %016lx\n", ri.sp);
	printf("\tgs:      %016lx\n", ri.gs);
	printf("\tfs:      %016lx\n", ri.fs);
	printf("\tgsindex: %s\n", segval(ri.gsindex));
	printf("\tfsindex: %s\n", segval(ri.fsindex));
	printf("\tcs:      %s\n", segval(ri.cs));
	printf("\tss:      %s\n", segval(ri.ss));
	printf("\tds:      %s\n", segval(ri.ds));
	printf("\tes:      %s\n", segval(ri.es));

	printf("\ttls0     %016lx\n", ri.tls[0]);
	printf("\ttls1     %016lx\n", ri.tls[1]);
	printf("\ttls2     %016lx\n", ri.tls[2]);

	return 0;
}

static int show_mm(int fd, unsigned long *stack)
{
	struct binfmt_mm_image mi;

	if (read(fd, &mi, sizeof(mi)) != sizeof(mi)) {
		perror("Can't read mm from image");
		return 1;
	}

	printf("MM:\n");
	printf("\tflags:       %016lx\n", mi.flags);
	printf("\tdef_flags:   %016lx\n", mi.def_flags);
	printf("\tstart_code:  %016lx\n", mi.start_code);
	printf("\tend_code:    %016lx\n", mi.end_code);
	printf("\tstart_data:  %016lx\n", mi.start_data);
	printf("\tend_data:    %016lx\n", mi.end_data);
	printf("\tstart_brk:   %016lx\n", mi.start_brk);
	printf("\tbrk:         %016lx\n", mi.brk);
	printf("\tstart_stack: %016lx\n", mi.start_stack);
	printf("\targ_start:   %016lx\n", mi.arg_start);
	printf("\targ_end:     %016lx\n", mi.arg_end);
	printf("\tenv_start:   %016lx\n", mi.env_start);
	printf("\tenv_end:     %016lx\n", mi.env_end);

	*stack = mi.start_stack;

	return 0;
}

static int show_vmas(int fd, unsigned long stack)
{
	struct binfmt_vma_image vi;

	printf("VMAs:\n");
	while (1) {
		char *note = "";

		if (read(fd, &vi, sizeof(vi)) != sizeof(vi)) {
			perror("Can't read vma from image");
			return 1;
		}

		if (vi.start == 0 && vi.end == 0)
			return 0;

		if (vi.start <= stack && vi.end >= stack)
			note = "[stack]";

		printf("\t%016lx-%016lx file %d %016lx prot %x flags %x %s\n",
				vi.start, vi.end, vi.fd, vi.pgoff,
				vi.prot, vi.flags, note);
	}
}

static int show_privmem(int fd)
{
	printf("Pages:\n");
	return show_mem(fd);
}

static int show_core(int fd)
{
	__u32 version = 0;
	unsigned long stack;

	read(fd, &version, 4);
	if (version != BINFMT_IMG_VERS_0) {
		printf("Unsupported version %d\n", version);
		return 1;
	}

	printf("Showing version 0\n");

	if (show_regs(fd))
		return 1;

	if (show_mm(fd, &stack))
		return 1;

	if (show_vmas(fd, stack))
		return 1;

	if (show_privmem(fd))
		return 1;

	return 0;
}

static int show_pstree(int fd)
{
	int ret;
	struct pstree_entry e;

	while (1) {
		int i;
		__u32 *ch;

		ret = read(fd, &e, sizeof(e));
		if (ret == 0)
			return 0;
		if (ret != sizeof(e)) {
			perror("Can't read processes entry");
			return 1;
		}

		printf("%d:", e.pid);
		i = e.nr_children * sizeof(__u32);
		ch = malloc(i);
		ret = read(fd, ch, i);
		if (ret != i) {
			perror("Can't read children list");
			return 1;
		}

		for (i = 0; i < e.nr_children; i++)
			printf(" %d", ch[i]);
		printf("\n");
	}
}

static int show_pipes(int fd)
{
	struct pipes_entry e;
	int ret;
	char buf[17];

	while (1) {
		ret = read(fd, &e, sizeof(e));
		if (ret == 0)
			break;
		if (ret != sizeof(e)) {
			perror("Can't read pipe entry");
			return 1;
		}

		printf("%d: %lx %o %d ", e.fd, e.pipeid, e.flags, e.bytes);
		if (e.flags & O_WRONLY) {
			printf("\n");

			if (e.bytes) {
				printf("Bogus pipe\n");
				return 1;
			}

			continue;
		}

		memset(buf, 0, sizeof(buf));
		ret = e.bytes;
		if (ret > 16)
			ret = 16;

		read(fd, buf, ret);
		printf("\t[%s", buf);
		if (ret < e.bytes)
			printf("...");
		printf("]\n");
		lseek(fd, e.bytes - ret, SEEK_CUR);
	}

	return 0;

}

int main(int argc, char **argv)
{
	__u32 type;
	int fd;

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror("Can't open");
		return 1;
	}

	read(fd, &type, 4);

	if (type == FDINFO_MAGIC)
		return show_fdinfo(fd);
	if (type == PAGES_MAGIC)
		return show_pages(fd);
	if (type == SHMEM_MAGIC)
		return show_shmem(fd);
	if (type == PSTREE_MAGIC)
		return show_pstree(fd);
	if (type == PIPES_MAGIC)
		return show_pipes(fd);
	if (type == BINFMT_IMG_MAGIC)
		return show_core(fd);

	printf("Unknown file type 0x%x\n", type);
	return 1;
}

[-- Attachment #5: img_structs.h --]
[-- Type: text/plain, Size: 494 bytes --]


#define FDINFO_MAGIC	0x01010101

struct fdinfo_entry {
	__u8	type;
	__u8	len;
	__u16	flags;
	__u32	pos;
	__u64	addr;
};

#define FDINFO_FD	1
#define FDINFO_MAP	2

#define PAGES_MAGIC	0x20202020

#define SHMEM_MAGIC	0x03300330

struct shmem_entry {
	__u64	start;
	__u64	end;
	__u64	shmid;
};

#define PSTREE_MAGIC	0x40044004

struct pstree_entry {
	__u32	pid;
	__u32	nr_children;
};

#define PIPES_MAGIC	0x05055050

struct pipes_entry {
	__u32	fd;
	__u32	pipeid;
	__u32	flags;
	__u32	bytes;
};

[-- Attachment #6: Makefile --]
[-- Type: text/plain, Size: 186 bytes --]

all: cr-dump img-show cr-restore

img-show: img-show.c
	gcc -o $@ $<

cr-dump: cr-dump.c
	gcc -o $@ $<

cr-restore: cr-restore.c
	gcc -o $@ $<

clean:
	rm -f cr-dump img-show cr-restore

[-- Attachment #7: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (7 preceding siblings ...)
  2011-07-15 13:49   ` [TOOLS] To make use of the patches Pavel Emelyanov
@ 2011-07-15 15:01   ` Tejun Heo
  2011-07-18 13:27   ` Serge E. Hallyn
  2011-07-23  0:25   ` Matt Helsley
  10 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-07-15 15:01 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello,

On Fri, Jul 15, 2011 at 05:45:10PM +0400, Pavel Emelyanov wrote:
> There have already been made many attempts to have the
> checkpoint/restore functionality in Linux, but as far as I can see
> there's still no final solutions that suits most of the interested
> people. The main concern about the previous approaches as I see it
> was about - all that stuff was supposed to sit in the kernel thus
> creating various problems.
> 
> I'd like to bring this subject back again proposing the way of how
> to implement c/r mostly in the userspace with the reasonable help of
> a kernel.

I just glanced through the series and it seems well contained but as
it touches process hierarchy, I think it would be better to cc Oleg
Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> and Andrew Morton too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file
       [not found]     ` <4E204500.6040800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-16 22:57       ` Kirill A. Shutemov
       [not found]         ` <20110716225709.GA25606-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
  2011-07-21  6:44       ` Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: Kirill A. Shutemov @ 2011-07-16 22:57 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

On Fri, Jul 15, 2011 at 05:47:44PM +0400, Pavel Emelyanov wrote:
> An image read from file contains task's registers and information
> about its VM. Later this image can be execve-ed causing recreation
> of the previously read task state.
> 
> The file format is my own, very simple. Introduced to make the code
> as simple as possible. Better file format (if any) is to be discussed.

I think file format should be per-binfmt, similar to core dump. So it will
be ELF with ELF binary. Core dumper code can be reused in some way.

> Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> 
> ---
>  fs/proc/Kconfig            |    8 +
>  fs/proc/Makefile           |    1 +
>  fs/proc/base.c             |    3 +
>  fs/proc/img_dump.c         |  397 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/binfmt_img.h |   87 ++++++++++
>  include/linux/proc_fs.h    |    2 +
>  6 files changed, 498 insertions(+), 0 deletions(-)
>  create mode 100644 fs/proc/img_dump.c
>  create mode 100644 include/linux/binfmt_img.h
> 
> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> index 15af622..c64bf75 100644
> --- a/fs/proc/Kconfig
> +++ b/fs/proc/Kconfig
> @@ -67,3 +67,11 @@ config PROC_PAGE_MONITOR
>  	  /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
>  	  /proc/kpagecount, and /proc/kpageflags. Disabling these
>            interfaces will reduce the size of the kernel by approximately 4kb.
> +
> +config PROC_IMG
> +	default y
> +	depends on PROC_FS

depends on X86_64 ?

>+	bool "Enable /proc/<pid>/dump file"
> +	help
> +	  Say Y here if you want to be able to produce checkpoint-restore images
> +	  for tasks via proc
> diff --git a/fs/proc/Makefile b/fs/proc/Makefile
> index df434c5..3a59cb1 100644
> --- a/fs/proc/Makefile
> +++ b/fs/proc/Makefile
> @@ -27,3 +27,4 @@ proc-$(CONFIG_PROC_VMCORE)	+= vmcore.o
>  proc-$(CONFIG_PROC_DEVICETREE)	+= proc_devtree.o
>  proc-$(CONFIG_PRINTK)	+= kmsg.o
>  proc-$(CONFIG_PROC_PAGE_MONITOR)	+= page.o
> +proc-$(CONFIG_PROC_IMG) += img_dump.o
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 633af12..c01438f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3044,6 +3044,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  	INF("cmdline",    S_IRUGO, proc_pid_cmdline),
>  	ONE("stat",       S_IRUGO, proc_tgid_stat),
> +#ifdef CONFIG_PROC_IMG
> +	REG("dump",	  S_IRUSR|S_IWUSR, proc_pid_dump_operations),
> +#endif

Writable?

>  	ONE("statm",      S_IRUGO, proc_pid_statm),
>  	REG("maps",       S_IRUGO, proc_maps_operations),
>  #ifdef CONFIG_NUMA
> diff --git a/fs/proc/img_dump.c b/fs/proc/img_dump.c
> new file mode 100644
> index 0000000..7fa52ef
> --- /dev/null
> +++ b/fs/proc/img_dump.c
> @@ -0,0 +1,397 @@
> +#include <linux/proc_fs.h>
> +#include <linux/sched.h>
> +#include <linux/uaccess.h>
> +#include <linux/binfmt_img.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/highmem.h>
> +#include <linux/types.h>
> +#include "internal.h"
> +
> +static int img_dump_buffer(char __user *ubuf, size_t size, void *buf, int len, int pos)
> +{
> +	int ret;
> +	static size_t dumped = 0;
> +
> +	len -= pos;
> +	if (len > size)
> +		len = size;
> +
> +	ret = copy_to_user(ubuf, buf + pos, len);
> +	if (ret)
> +		return -EFAULT;
> +
> +	dumped += len;
> +	return len;
> +}
> +
> +static int img_dump_header(char __user *buf, size_t size, int pos)
> +{
> +	struct binfmt_img_header hdr;
> +
> +	hdr.magic = BINFMT_IMG_MAGIC;
> +	hdr.version = BINFMT_IMG_VERS_0;
> +
> +	return img_dump_buffer(buf, size, &hdr, sizeof(hdr), pos);
> +}
> +
> +static __u16 encode_segment(unsigned short seg)
> +{
> +	if (seg == 0)
> +		return CKPT_X86_SEG_NULL;
> +	BUG_ON((seg & 3) != 3);
> +
> +	if (seg == __USER_CS)
> +		return CKPT_X86_SEG_USER64_CS;
> +	if (seg == __USER_DS)
> +		return CKPT_X86_SEG_USER64_DS;
> +#ifdef CONFIG_COMPAT
> +	if (seg == __USER32_CS)
> +		return CKPT_X86_SEG_USER32_CS;
> +	if (seg == __USER32_DS)
> +		return CKPT_X86_SEG_USER32_DS;
> +#endif
> +
> +	if (seg & 4)
> +		return CKPT_X86_SEG_LDT | (seg >> 3);
> +
> +	seg >>= 3;
> +	if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX)
> +		return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN);
> +
> +	printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg);
> +	BUG();
> +}
> +
> +static __u64 encode_tls(struct desc_struct *d)
> +{
> +	return ((__u64)d->a << 32) + d->b;
> +}
> +
> +static int img_dump_regs(struct task_struct *p, char __user *buf, size_t size, int pos)
> +{
> +	struct binfmt_regs_image regi;
> +	struct pt_regs *regs;
> +	int i;
> +
> +	regs = task_pt_regs(p);
> +
> +	regi.r15 = regs->r15;
> +	regi.r14 = regs->r14;
> +	regi.r13 = regs->r13;
> +	regi.r12 = regs->r12;
> +	regi.r11 = regs->r11;
> +	regi.r10 = regs->r10;
> +	regi.r9 = regs->r9;
> +	regi.r8 = regs->r8;
> +	regi.ax = regs->ax;
> +	regi.orig_ax = regs->orig_ax;
> +	regi.bx = regs->bx;
> +	regi.cx = regs->cx;
> +	regi.dx = regs->dx;
> +	regi.si = regs->si;
> +	regi.di = regs->di;
> +	regi.ip = regs->ip;
> +	regi.flags = regs->flags;
> +	regi.bp = regs->bp;
> +	regi.sp = regs->sp;
> +
> +	/* segments */
> +	regi.gsindex = encode_segment(p->thread.gsindex);
> +	regi.fsindex = encode_segment(p->thread.fsindex);
> +	regi.cs = encode_segment(regs->cs);
> +	regi.ss = encode_segment(regs->ss);
> +	regi.ds = encode_segment(p->thread.ds);
> +	regi.es = encode_segment(p->thread.es);
> +
> +	BUILD_BUG_ON(GDT_ENTRY_TLS_ENTRIES != CKPT_TLS_ENTRIES);
> +	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
> +		regi.tls[i] = encode_tls(&p->thread.tls_array[i]);
> +
> +	if (p->thread.gsindex)
> +		regi.gs = 0;
> +	else
> +		regi.gs = p->thread.gs;
> +
> +	if (p->thread.fsindex)
> +		regi.fs = 0;
> +	else
> +		regi.fs = p->thread.fs;
> +
> +	return img_dump_buffer(buf, size, &regi, sizeof(regi), pos);
> +}
> +
> +static int img_dump_mm(struct mm_struct *mm, char __user *buf, size_t size, int pos)
> +{
> +	struct binfmt_mm_image mmi;
> +
> +	mmi.flags = mm->flags;
> +	mmi.def_flags = mm->def_flags;
> +	mmi.start_code = mm->start_code;
> +	mmi.end_code = mm->end_code;
> +	mmi.start_data = mm->start_data;
> +	mmi.end_data = mm->end_data;
> +	mmi.start_brk = mm->start_brk;
> +	mmi.brk = mm->brk;
> +	mmi.start_stack = mm->start_stack;
> +	mmi.arg_start = mm->arg_start;
> +	mmi.arg_end = mm->arg_end;
> +	mmi.env_start = mm->env_start;
> +	mmi.env_end = mm->env_end;
> +	mmi.exe_fd = 0;
> +
> +	return img_dump_buffer(buf, size, &mmi, sizeof(mmi), pos);
> +}
> +
> +static int img_dump_vma(struct vm_area_struct *vma, char __user *buf, size_t size, int pos)
> +{
> +	struct binfmt_vma_image vmai;
> +
> +	if (vma == NULL) {
> +		memset(&vmai, 0, sizeof(vmai));
> +		goto dumpit;
> +	}
> +
> +	printk("Dumping vma %016lx-%016lx %p/%p\n", vma->vm_start, vma->vm_end, vma, vma->vm_mm);
> +
> +	vmai.fd = 0;
> +	vmai.prot = 0;
> +	if (vma->vm_flags & VM_READ)
> +		vmai.prot |= PROT_READ;
> +	if (vma->vm_flags & VM_WRITE)
> +		vmai.prot |= PROT_WRITE;
> +	if (vma->vm_flags & VM_EXEC)
> +		vmai.prot |= PROT_EXEC;
> +
> +	vmai.flags = 0;
> +	if (vma->vm_file == NULL)
> +		vmai.flags |= MAP_ANONYMOUS;
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		vmai.flags |= MAP_SHARED;
> +	else
> +		vmai.flags |= MAP_PRIVATE;
> +
> +	vmai.start = vma->vm_start;
> +	vmai.end = vma->vm_end;
> +	vmai.pgoff = vma->vm_pgoff;
> +
> +dumpit:
> +	return img_dump_buffer(buf, size, &vmai, sizeof(vmai), pos);
> +}
> +
> +static int img_dump_page(unsigned long addr, void *data, char __user *buf, size_t size, int pos)
> +{
> +	struct binfmt_page_image pgi;
> +	int ret = 0, tmp;
> +
> +	pgi.vaddr = addr;
> +
> +	if (pos < sizeof(pgi)) {
> +		tmp = img_dump_buffer(buf, size, &pgi, sizeof(pgi), pos);
> +		if (tmp < 0)
> +			return tmp;
> +
> +		ret = tmp;
> +		if (size <= ret)
> +			return ret;
> +
> +		buf += ret;
> +		size -= ret;
> +		pos = 0;
> +	} else
> +		pos -= sizeof(pgi);
> +
> +	tmp = img_dump_buffer(buf, size, data, PAGE_SIZE, pos);
> +	if (tmp < 0)
> +		return tmp;
> +
> +	return ret + tmp;
> +}
> +
> +static inline int is_private_vma(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_file == NULL)
> +		return 1;
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return 1;
> +	return 0;
> +}
> +
> +static ssize_t do_produce_dump(struct task_struct *p, char __user *buf,
> +		size_t size, loff_t *ppos)
> +{
> +	size_t img_pos = 0, img_ppos;
> +	size_t produced = 0;
> +	int len;
> +	loff_t pos = *ppos;
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +
> +#define move_pos();	do {	\
> +		buf += len;	\
> +		produced += len;\
> +		size -= len;	\
> +		pos += len;	\
> +	} while (0)
> +
> +#define seek_pos(__size);	do {	\
> +		img_ppos = img_pos;	\
> +		img_pos += (__size);	\
> +	} while (0)
> +
> +	/* header */
> +	seek_pos(sizeof(struct binfmt_img_header));
> +	if (pos < img_pos) {
> +		len = img_dump_header(buf, size, pos - img_ppos);
> +		if (len < 0)
> +			goto err;
> +
> +		move_pos();
> +		if (size == 0)
> +			goto out;
> +	}
> +
> +	/* registers */
> +	seek_pos(sizeof(struct binfmt_regs_image));
> +	if (pos < img_pos) {
> +		len = img_dump_regs(p, buf, size, pos - img_ppos);
> +		if (len < 0)
> +			goto err;
> +
> +		move_pos();
> +		if (size == 0)
> +			goto out;
> +	}
> +
> +	/* memory */
> +	mm = get_task_mm(p);
> +	if (mm == NULL)
> +		return -EACCES;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	seek_pos(sizeof(struct binfmt_mm_image));
> +	if (pos < img_pos) {
> +		len = img_dump_mm(mm, buf, size, pos - img_ppos);
> +		if (len < 0)
> +			goto err_mm;
> +
> +		move_pos();
> +		if (size == 0)
> +			goto out_mm;
> +	}
> +
> +	vma = mm->mmap;
> +	while (1) {
> +		seek_pos(sizeof(struct binfmt_vma_image));
> +		if (pos < img_pos) {
> +			len = img_dump_vma(vma, buf, size, pos - img_ppos);
> +			if (len < 0)
> +				goto err_mm;
> +
> +			move_pos();
> +			if (size == 0)
> +				goto out_mm;
> +		}
> +
> +		if (vma == NULL)
> +			break;
> +
> +		vma = vma->vm_next;
> +	}
> +
> +	for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
> +		/* slow and stupid */
> +		unsigned long addr;
> +		struct page *page;
> +		void *pg_data;
> +
> +		if (!is_private_vma(vma))
> +			continue;
> +
> +		for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> +			page = follow_page(vma, addr, FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> +			if (page == NULL)
> +				continue;
> +			if (IS_ERR(page)) /* huh? */
> +				continue;
> +
> +			seek_pos(sizeof(struct binfmt_page_image) + PAGE_SIZE);
> +			if (pos < img_pos) {
> +				pg_data = kmap(page);
> +				len = img_dump_page(addr, pg_data, buf, size, pos - img_ppos);
> +				kunmap(page);
> +
> +				if (len < 0) {
> +					put_page(page);
> +					goto err_mm;
> +				}
> +
> +				move_pos();
> +				if (size == 0) {
> +					put_page(page);
> +					goto out_mm;
> +				}
> +			}
> +
> +			put_page(page);
> +		}
> +	}
> +
> +	seek_pos(sizeof(struct binfmt_page_image));
> +	if (pos < img_pos) {
> +		struct binfmt_page_image zero;
> +
> +		memset(&zero, 0, sizeof(zero));
> +		len = img_dump_buffer(buf, size, &zero, sizeof(zero), pos - img_ppos);
> +		if (len < 0)
> +			goto err;
> +
> +		move_pos();
> +	}
> +
> +out_mm:
> +	up_read(&mm->mmap_sem);
> +	mmput(mm);
> +out:
> +	*ppos = pos;
> +	return produced;
> +
> +err_mm:
> +	up_read(&mm->mmap_sem);
> +	mmput(mm);
> +err:
> +	return len;
> +}
> +
> +static ssize_t img_dump_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> +{
> +	struct task_struct *p;
> +
> +	p = get_proc_task(file->f_dentry->d_inode);
> +	if (p == NULL)
> +		return -ESRCH;
> +
> +	if (!(p->state & TASK_STOPPED)) {
> +		put_task_struct(p);
> +		return -EINVAL;
> +	}
> +
> +	return do_produce_dump(p, buf, size, ppos);
> +}
> +
> +static int img_dump_open(struct inode *inode, struct file *filp)
> +{
> +	return 0;
> +}
> +
> +static int img_dump_release(struct inode *inode, struct file *filp)
> +{
> +	return 0;
> +}
> +
> +const struct file_operations proc_pid_dump_operations = {
> +	.open		= img_dump_open,
> +	.read		= img_dump_read,
> +	.release	= img_dump_release,
> +};
> diff --git a/include/linux/binfmt_img.h b/include/linux/binfmt_img.h
> new file mode 100644
> index 0000000..a4293af
> --- /dev/null
> +++ b/include/linux/binfmt_img.h
> @@ -0,0 +1,87 @@
> +#ifndef __BINFMT_IMG_H__
> +#define __BINFMT_IMG_H__
> +
> +#include <linux/types.h>
> +
> +struct binfmt_img_header {
> +	__u32	magic;
> +	__u32	version;
> +};
> +
> +#define CKPT_TLS_ENTRIES	3
> +
> +struct binfmt_regs_image {
> +	__u64 r15;
> +	__u64 r14;
> +	__u64 r13;
> +	__u64 r12;
> +	__u64 r11;
> +	__u64 r10;
> +	__u64 r9;
> +	__u64 r8;
> +	__u64 ax;
> +	__u64 orig_ax;
> +	__u64 bx;
> +	__u64 cx;
> +	__u64 dx;
> +	__u64 si;
> +	__u64 di;
> +	__u64 ip;
> +	__u64 flags;
> +	__u64 bp;
> +	__u64 sp;
> +
> +	__u64 gs;
> +	__u64 fs;
> +	__u64 tls[CKPT_TLS_ENTRIES];
> +	__u16 gsindex;
> +	__u16 fsindex;
> +	__u16 cs;
> +	__u16 ss;
> +	__u16 ds;
> +	__u16 es;
> +};
> +
> +#define CKPT_X86_SEG_NULL       0
> +#define CKPT_X86_SEG_USER32_CS  1
> +#define CKPT_X86_SEG_USER32_DS  2
> +#define CKPT_X86_SEG_USER64_CS  3
> +#define CKPT_X86_SEG_USER64_DS  4
> +#define CKPT_X86_SEG_TLS        0x4000
> +#define CKPT_X86_SEG_LDT        0x8000
> +
> +struct binfmt_mm_image {
> +	__u64	flags;
> +	__u64	def_flags;
> +	__u64	start_code;
> +	__u64	end_code;
> +	__u64	start_data;
> +	__u64	end_data;
> +	__u64	start_brk;
> +	__u64	brk;
> +	__u64	start_stack;
> +	__u64	arg_start;
> +	__u64	arg_end;
> +	__u64	env_start;
> +	__u64	env_end;
> +	__u32	exe_fd;
> +};
> +
> +struct binfmt_vma_image {
> +	__u32	prot;
> +	__u32	flags;
> +	__u32	pad;
> +	__u32	fd;
> +	__u64	start;
> +	__u64	end;
> +	__u64	pgoff;
> +};
> +
> +struct binfmt_page_image {
> +	__u64	vaddr;
> +};
> +
> +#define BINFMT_IMG_MAGIC	0xa75b8d43
> +#define BINFMT_IMG_VERS_0	0x00000100
> +
> +#endif
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index c779c74..686b374 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -102,6 +102,8 @@ struct vmcore {
>  
>  #ifdef CONFIG_PROC_FS
>  
> +extern const struct file_operations proc_pid_dump_operations;
> +
>  extern void proc_root_init(void);
>  
>  void proc_flush_task(struct task_struct *task);
> -- 
> 1.5.5.6
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file
       [not found]         ` <20110716225709.GA25606-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
@ 2011-07-17  8:06           ` Cyrill Gorcunov
  0 siblings, 0 replies; 68+ messages in thread
From: Cyrill Gorcunov @ 2011-07-17  8:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Pavel Emelyanov, Glauber Costa, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

On Sun, Jul 17, 2011 at 01:57:09AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jul 15, 2011 at 05:47:44PM +0400, Pavel Emelyanov wrote:
> > An image read from file contains task's registers and information
> > about its VM. Later this image can be execve-ed causing recreation
> > of the previously read task state.
> > 
> > The file format is my own, very simple. Introduced to make the code
> > as simple as possible. Better file format (if any) is to be discussed.
> 
> I think file format should be per-binfmt, similar to core dump. So it will
> be ELF with ELF binary. Core dumper code can be reused in some way.
> 

Don't think so. We could push all data into PT_LOAD, still the restore
procedure is different from how elf loader works so I don't want
at all to change kernel's elf handler code (if I understand you right,
that is what you propose?). In real we tried to isolate the changes
we bring as much as possible.

	Cyrill

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (8 preceding siblings ...)
  2011-07-15 15:01   ` [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace Tejun Heo
@ 2011-07-18 13:27   ` Serge E. Hallyn
       [not found]     ` <20110718132759.GB8127-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2011-07-23  0:25   ` Matt Helsley
  10 siblings, 1 reply; 68+ messages in thread
From: Serge E. Hallyn @ 2011-07-18 13:27 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

Thanks, Pavel.  I will take a look at this when I get a chance.  I'm
a little worried about security implications - this approach should
lend itself (especially with the binfmt handler) to clean handling
of security issues, but given the issues we've had with /proc things
that already exist, I'm worried about the dump files.  If you have
any preemptive comments on that, please do share :)

We did briefly try a binfmt handler at the very end of our foray into
the ptrace checkpoint/restart approach, but your overall set here seems
very nice.

thanks,
-serge

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file
       [not found]     ` <4E204500.6040800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-16 22:57       ` Kirill A. Shutemov
@ 2011-07-21  6:44       ` Tejun Heo
       [not found]         ` <20110721064408.GR3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-21  6:44 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello,

On Fri, Jul 15, 2011 at 05:47:44PM +0400, Pavel Emelyanov wrote:
> An image read from file contains task's registers and information
> about its VM. Later this image can be execve-ed causing recreation
> of the previously read task state.
> 
> The file format is my own, very simple. Introduced to make the code
> as simple as possible. Better file format (if any) is to be discussed.

First of all, I don't really think we need to bake in process dumper
into the kernel.  Most of information dumped here is already available
through /proc and ptrace and we can add the missing pieces like the
suggested proc vma fds.

> +static int img_dump_regs(struct task_struct *p, char __user *buf, size_t size, int pos)
> +{
> +	struct binfmt_regs_image regi;
> +	struct pt_regs *regs;
> +	int i;
> +
> +	regs = task_pt_regs(p);
> +
> +	regi.r15 = regs->r15;
> +	regi.r14 = regs->r14;
> +	regi.r13 = regs->r13;
> +	regi.r12 = regs->r12;
> +	regi.r11 = regs->r11;
> +	regi.r10 = regs->r10;
> +	regi.r9 = regs->r9;
> +	regi.r8 = regs->r8;
> +	regi.ax = regs->ax;
> +	regi.orig_ax = regs->orig_ax;
> +	regi.bx = regs->bx;
> +	regi.cx = regs->cx;
> +	regi.dx = regs->dx;
> +	regi.si = regs->si;
> +	regi.di = regs->di;
> +	regi.ip = regs->ip;
> +	regi.flags = regs->flags;
> +	regi.bp = regs->bp;
> +	regi.sp = regs->sp;
> +
> +	/* segments */
> +	regi.gsindex = encode_segment(p->thread.gsindex);
> +	regi.fsindex = encode_segment(p->thread.fsindex);
> +	regi.cs = encode_segment(regs->cs);
> +	regi.ss = encode_segment(regs->ss);
> +	regi.ds = encode_segment(p->thread.ds);
> +	regi.es = encode_segment(p->thread.es);
> +
> +	BUILD_BUG_ON(GDT_ENTRY_TLS_ENTRIES != CKPT_TLS_ENTRIES);
> +	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
> +		regi.tls[i] = encode_tls(&p->thread.tls_array[i]);
> +
> +	if (p->thread.gsindex)
> +		regi.gs = 0;
> +	else
> +		regi.gs = p->thread.gs;
> +
> +	if (p->thread.fsindex)
> +		regi.fs = 0;
> +	else
> +		regi.fs = p->thread.fs;
> +
> +	return img_dump_buffer(buf, size, &regi, sizeof(regi), pos);
> +}

Umm... x86_64 code directly under fs/proc?  And the dump image doesn't
have arch marker?

> +static ssize_t do_produce_dump(struct task_struct *p, char __user *buf,
> +		size_t size, loff_t *ppos)
> +{
...
> +	/* registers */
> +	seek_pos(sizeof(struct binfmt_regs_image));
> +	if (pos < img_pos) {
> +		len = img_dump_regs(p, buf, size, pos - img_ppos);
> +		if (len < 0)
> +			goto err;
> +
> +		move_pos();
> +		if (size == 0)
> +			goto out;
> +	}

This is per-thread information.

> +	/* memory */
...
> +	for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
> +		/* slow and stupid */
> +		unsigned long addr;
> +		struct page *page;
> +		void *pg_data;
> +
> +		if (!is_private_vma(vma))
> +			continue;
> +
> +		for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> +			page = follow_page(vma, addr, FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> +			if (page == NULL)
> +				continue;
> +			if (IS_ERR(page)) /* huh? */
> +				continue;
> +
> +			seek_pos(sizeof(struct binfmt_page_image) + PAGE_SIZE);
> +			if (pos < img_pos) {
> +				pg_data = kmap(page);
> +				len = img_dump_page(addr, pg_data, buf, size, pos - img_ppos);
> +				kunmap(page);
> +
> +				if (len < 0) {
> +					put_page(page);
> +					goto err_mm;
> +				}
> +
> +				move_pos();
> +				if (size == 0) {
> +					put_page(page);
> +					goto out_mm;
> +				}
> +			}
> +
> +			put_page(page);
> +		}
> +	}
...

These are per-process.  I can't see how this would work out well.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler
       [not found]     ` <4E204519.3040804-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-21  6:51       ` Tejun Heo
       [not found]         ` <20110721065127.GS3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-21  6:51 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Fri, Jul 15, 2011 at 05:48:09PM +0400, Pavel Emelyanov wrote:
> When being execve-ed the handler reads registers, mappings and provided
> memory pages from image and just assigns this state on current task. This
> simple functionality can be used to restore a task, whose state whas read
> from e.g. /proc/<pid>/dump file before.

Ummm... iff the process is single threaded. :(

Much more complex machinery is needed to restore full process anyway
which would require some kernel facilities but definitely a lot more
logic in userland.  I really can't see much point in having
dumper/restorer in kernel.  The simplistic dumper/restorer proposed
here isn't really useful - among other things, it's single threaded
only and there's no mechanism to freeze the task being dumped.  It is
almost trivially implementable from userland using existing
facilities.  I wonder what the point is.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status
       [not found]     ` <4E2044C3.7050506-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-21  6:54       ` Tejun Heo
       [not found]         ` <20110721065436.GT3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2011-07-21 15:54       ` Serge E. Hallyn
  1 sibling, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-21  6:54 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Fri, Jul 15, 2011 at 05:46:43PM +0400, Pavel Emelyanov wrote:
> Although we can get the pids of some task's issue, this is just 
> more convenient to have them this way.
> 
> Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>.

Umm... The primary aim is dumping whole namespaces, right?  The dumper
would have to build full process tree anyway so I don't see much point
in providing backlink from kernel.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/7] vfs: Add ->statfs callback for pipefs
       [not found]     ` <4E2044D6.3060205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-21  6:59       ` Tejun Heo
  2011-07-21 15:59       ` Serge E. Hallyn
  1 sibling, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-07-21  6:59 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Fri, Jul 15, 2011 at 05:47:02PM +0400, Pavel Emelyanov wrote:
> This is done to make it possible to distinguish pipes from fifos
> when opening one via /proc/<pid>/fd/ link.
> 
> Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

This sounds like a generally good idea.  Can you please send it to
Andrew?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/1] proc: Introduce the /proc/<pid>/mfd/ directory
       [not found]     ` <4E20448A.5010207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-21  7:21       ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-07-21  7:21 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Fri, Jul 15, 2011 at 05:45:46PM +0400, Pavel Emelyanov wrote:
> This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
> one for each mapping with file, the name of a symlink is vma->vm_start, the
> target is the file. Opening a symlink results in a file that point exactly
> to the same inode as them vma's one.
> 
> This thing is aimed to help checkpointing processes.
> 
> Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

Without delving into implementation details (which look fine to me
when compared to proc_fd stuff but I'm not familiar with the code), I
think this is a generally good idea and suggest floating this towards
Andrew.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/7] vfs: Introduce the fd closing helper
       [not found]     ` <4E2044A7.4030103-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-21 15:47       ` Serge E. Hallyn
  0 siblings, 0 replies; 68+ messages in thread
From: Serge E. Hallyn @ 2011-07-21 15:47 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> This is nothing but making is possible to call the sys_close from the kernel.
> 
> Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

I see no problems here, thanks.

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> 
> ---
>  fs/open.c          |   32 ++++++++++++++++++++------------
>  include/linux/fs.h |    1 +
>  2 files changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index b52cf01..126aa8b 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1078,17 +1078,11 @@ int filp_close(struct file *filp, fl_owner_t id)
>  
>  EXPORT_SYMBOL(filp_close);
>  
> -/*
> - * Careful here! We test whether the file pointer is NULL before
> - * releasing the fd. This ensures that one clone task can't release
> - * an fd while another clone is opening it.
> - */
> -SYSCALL_DEFINE1(close, unsigned int, fd)
> +int do_close(unsigned int fd)
>  {
>  	struct file * filp;
>  	struct files_struct *files = current->files;
>  	struct fdtable *fdt;
> -	int retval;
>  
>  	spin_lock(&files->file_lock);
>  	fdt = files_fdtable(files);
> @@ -1101,7 +1095,25 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
>  	FD_CLR(fd, fdt->close_on_exec);
>  	__put_unused_fd(files, fd);
>  	spin_unlock(&files->file_lock);
> -	retval = filp_close(filp, files);
> +
> +	return filp_close(filp, files);
> +
> +out_unlock:
> +	spin_unlock(&files->file_lock);
> +	return -EBADF;
> +}
> +EXPORT_SYMBOL_GPL(do_close);
> +
> +/*
> + * Careful here! We test whether the file pointer is NULL before
> + * releasing the fd. This ensures that one clone task can't release
> + * an fd while another clone is opening it.
> + */
> +SYSCALL_DEFINE1(close, unsigned int, fd)
> +{
> +	int retval;
> +
> +	retval = do_close(fd);
>  
>  	/* can't restart close syscall because file table entry was cleared */
>  	if (unlikely(retval == -ERESTARTSYS ||
> @@ -1111,10 +1123,6 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
>  		retval = -EINTR;
>  
>  	return retval;
> -
> -out_unlock:
> -	spin_unlock(&files->file_lock);
> -	return -EBADF;
>  }
>  EXPORT_SYMBOL(sys_close);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index cdf9495..77a5d3e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1991,6 +1991,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
>  extern struct file * dentry_open(struct dentry *, struct vfsmount *, int,
>  				 const struct cred *);
>  extern int filp_close(struct file *, fl_owner_t id);
> +extern int do_close(unsigned int fd);
>  extern char * getname(const char __user *);
>  
>  /* fs/ioctl.c */
> -- 
> 1.5.5.6
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status
       [not found]     ` <4E2044C3.7050506-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-21  6:54       ` Tejun Heo
@ 2011-07-21 15:54       ` Serge E. Hallyn
  1 sibling, 0 replies; 68+ messages in thread
From: Serge E. Hallyn @ 2011-07-21 15:54 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> Although we can get the pids of some task's issue, this is just 
> more convenient to have them this way.
> 
> Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>.

Mind you status is getting long :)  But I see no problem with the
patch technically, and it seems useful if we're going to go the
user-space-checkpoint route.

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  fs/proc/array.c |   14 ++++++++++++++
>  1 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 5e4f776..f01f480 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -158,6 +158,18 @@ static inline const char *get_task_state(struct task_struct *tsk)
>  	return *p;
>  }
>  
> +static void task_children(struct seq_file *m, struct task_struct *p, struct pid_namespace *ns)
> +{
> +	struct task_struct *c;
> +
> +	seq_printf(m, "Children:");
> +	read_lock(&tasklist_lock);
> +	list_for_each_entry(c, &p->children, sibling)
> +		seq_printf(m, " %d", pid_nr_ns(task_pid(c), ns));
> +	read_unlock(&tasklist_lock);
> +	seq_putc(m, '\n');
> +}
> +
>  static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  				struct pid *pid, struct task_struct *p)
>  {
> @@ -192,6 +204,8 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  		cred->uid, cred->euid, cred->suid, cred->fsuid,
>  		cred->gid, cred->egid, cred->sgid, cred->fsgid);
>  
> +	task_children(m, p, ns);
> +
>  	task_lock(p);
>  	if (p->files)
>  		fdt = files_fdtable(p->files);
> -- 
> 1.5.5.6
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/7] vfs: Add ->statfs callback for pipefs
       [not found]     ` <4E2044D6.3060205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-21  6:59       ` Tejun Heo
@ 2011-07-21 15:59       ` Serge E. Hallyn
  1 sibling, 0 replies; 68+ messages in thread
From: Serge E. Hallyn @ 2011-07-21 15:59 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> This is done to make it possible to distinguish pipes from fifos
> when opening one via /proc/<pid>/fd/ link.
> 
> Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> 
> ---
>  fs/pipe.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index da42f7d..5de15de 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1254,6 +1254,7 @@ out:
>  
>  static const struct super_operations pipefs_ops = {
>  	.destroy_inode = free_inode_nonrcu,
> +	.statfs = simple_statfs,
>  };
>  
>  /*
> -- 
> 1.5.5.6
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/7] clone: Introduce the CLONE_CHILD_USEPID functionality
       [not found]     ` <4E2044EB.20001-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-21 16:04       ` Serge E. Hallyn
       [not found]         ` <20110721160459.GD19012-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Serge E. Hallyn @ 2011-07-21 16:04 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Eric W. Biederman, Tejun Heo, Daniel Lezcano

Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> The respective flag for clone() makes the latter to take the desired
> pid of a new process from the child_tidptr. The given pid is used as
> the pid for the pid namespace the parent is currently running in.
> 
> Needed badly for restoring a process.
> 
> Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

How do you intend to eventually support multiple pid namespaces?
I don't mind not supporting them now, but if the interface does
cannot be extended to support it, I think that's a simple NACK.

IIRC Eric Biederman (explicity cc:d) in the past has advocated using
/proc/sys/pid_max games to specify a pid.  That is actually
usable cross-namespace, though only serially.  Do you think that
will be too cumbersome?

There is also the clone_with_pid() syscall from Suka.  It
accompanied the in-kernel checkpoint/restart patchset, but should
be perfectly usable without it.  Is there a good reason not to
pursue that?

thanks,
-serge

> ---
>  include/linux/pid.h   |    2 +-
>  include/linux/sched.h |    1 +
>  kernel/fork.c         |   10 ++++++-
>  kernel/pid.c          |   70 +++++++++++++++++++++++++++++++++++-------------
>  4 files changed, 62 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index cdced84..de772ab 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
>  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
>  int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
>  
> -extern struct pid *alloc_pid(struct pid_namespace *ns);
> +extern struct pid *alloc_pid(struct pid_namespace *ns, int pid);
>  extern void free_pid(struct pid *pid);
>  
>  /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 781abd1..5b6c1e2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -23,6 +23,7 @@
>  #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
>  /* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
>     and is now available for re-use. */
> +#define CLONE_CHILD_USEPID	0x02000000	/* use the given pid */
>  #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
>  #define CLONE_NEWIPC		0x08000000	/* New ipcs */
>  #define CLONE_NEWUSER		0x10000000	/* New user namespace */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e7548de..f30fbdb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1183,8 +1183,16 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  		goto bad_fork_cleanup_io;
>  
>  	if (pid != &init_struct_pid) {
> +		int want_pid = 0;
> +
> +		if (clone_flags & CLONE_CHILD_USEPID) {
> +			retval = get_user(want_pid, child_tidptr);
> +			if (retval)
> +				goto bad_fork_cleanup_io;
> +		}
> +
>  		retval = -ENOMEM;
> -		pid = alloc_pid(p->nsproxy->pid_ns);
> +		pid = alloc_pid(p->nsproxy->pid_ns, want_pid);
>  		if (!pid)
>  			goto bad_fork_cleanup_io;
>  	}
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 57a8346..69ae1be 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -159,11 +159,55 @@ static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
>  	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
>  }
>  
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int alloc_pidmap_page(struct pidmap *map)
> +{
> +	if (unlikely(!map->page)) {
> +		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +		/*
> +		 * Free the page if someone raced with us
> +		 * installing it:
> +		 */
> +		spin_lock_irq(&pidmap_lock);
> +		if (!map->page) {
> +			map->page = page;
> +			page = NULL;
> +		}
> +		spin_unlock_irq(&pidmap_lock);
> +		kfree(page);
> +		if (unlikely(!map->page))
> +			return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> +{
> +	int offset;
> +	struct pidmap *map;
> +
> +	offset = pid & BITS_PER_PAGE_MASK;
> +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> +
> +	if (alloc_pidmap_page(map) < 0)
> +		return -ENOMEM;
> +
> +	if (!test_and_set_bit(offset, map->page)) {
> +		atomic_dec(&map->nr_free);
> +		return pid;
> +	}
> +
> +	return -EBUSY;
> +}
> +
> +static int alloc_pidmap(struct pid_namespace *pid_ns, int desired_pid)
>  {
>  	int i, offset, max_scan, pid, last = pid_ns->last_pid;
>  	struct pidmap *map;
>  
> +	if (desired_pid)
> +		return set_pidmap(pid_ns, desired_pid);
> +
>  	pid = last + 1;
>  	if (pid >= pid_max)
>  		pid = RESERVED_PIDS;
> @@ -176,22 +220,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  	 */
>  	max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
>  	for (i = 0; i <= max_scan; ++i) {
> -		if (unlikely(!map->page)) {
> -			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> -			/*
> -			 * Free the page if someone raced with us
> -			 * installing it:
> -			 */
> -			spin_lock_irq(&pidmap_lock);
> -			if (!map->page) {
> -				map->page = page;
> -				page = NULL;
> -			}
> -			spin_unlock_irq(&pidmap_lock);
> -			kfree(page);
> -			if (unlikely(!map->page))
> -				break;
> -		}
> +		if (alloc_pidmap_page(map) < 0)
> +			break;
> +
>  		if (likely(atomic_read(&map->nr_free))) {
>  			do {
>  				if (!test_and_set_bit(offset, map->page)) {
> @@ -277,7 +308,7 @@ void free_pid(struct pid *pid)
>  	call_rcu(&pid->rcu, delayed_put_pid);
>  }
>  
> -struct pid *alloc_pid(struct pid_namespace *ns)
> +struct pid *alloc_pid(struct pid_namespace *ns, int this_ns_pid)
>  {
>  	struct pid *pid;
>  	enum pid_type type;
> @@ -291,13 +322,14 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  
>  	tmp = ns;
>  	for (i = ns->level; i >= 0; i--) {
> -		nr = alloc_pidmap(tmp);
> +		nr = alloc_pidmap(tmp, this_ns_pid);
>  		if (nr < 0)
>  			goto out_free;
>  
>  		pid->numbers[i].nr = nr;
>  		pid->numbers[i].ns = tmp;
>  		tmp = tmp->parent;
> +		this_ns_pid = 0;
>  	}
>  
>  	get_pid_ns(ns);
> -- 
> 1.5.5.6
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler
       [not found]         ` <20110721065127.GS3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2011-07-22 22:46           ` Matt Helsley
       [not found]             ` <20110722224617.GA16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-22 22:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Glauber Costa, Cyrill Gorcunov, Nathan Lynch,
	Linux Containers, Serge Hallyn, Daniel Lezcano

On Thu, Jul 21, 2011 at 08:51:27AM +0200, Tejun Heo wrote:
> On Fri, Jul 15, 2011 at 05:48:09PM +0400, Pavel Emelyanov wrote:
> > When being execve-ed the handler reads registers, mappings and provided
> > memory pages from image and just assigns this state on current task. This
> > simple functionality can be used to restore a task, whose state whas read
> > from e.g. /proc/<pid>/dump file before.
> 
> Ummm... iff the process is single threaded. :(
> 
> Much more complex machinery is needed to restore full process anyway
> which would require some kernel facilities but definitely a lot more

Agreed,

> logic in userland.  I really can't see much point in having

I disagree (surprise! ;)).

> dumper/restorer in kernel.  The simplistic dumper/restorer proposed
> here isn't really useful - among other things, it's single threaded
> only and there's no mechanism to freeze the task being dumped.  It is

To be fair Pavel used signals to stop/resume the task. It's not
a good solution but it's a start (more below).

> almost trivially implementable from userland using existing
> facilities.  I wonder what the point is.

No, I think that ultimately an addition to the cgroup freezer will
be needed.

The problem is that another task (perhaps a shell or debugger)
could come in and wake up the tasks. In theory the same problem could
happen with the cgroup freezer -- only the fact that today code
is rarely written to deal with it allows it to be more reliable than
SIGSTOP and SIGCONT.

The task doing the checkpoint *at least* needs to know if the frozen
tasks have been thawed ("notification"). That allows it to report
a warning or an error to the effect that the checkpoint may be
unreliable. Notification alone produces the possibility of indefinite
postponement however.

So it needs some assurance that frozen tasks will not be thawed until
checkpoint is complete. Oren's patches used a new freezer state
for this purpose. That's an in-kernel solution -- we need something
somewhat more elaborate for userspace because we then have to worry
about abuse of a new freezer interface.

We could add the ability to "lock" the freezer in its current state
to userspace. Only the task that set the lock can release it. Of
course if the task died then the lock would need to be released. It
might also be wise to add a timeout...

So we almost want to be able to use a mandatory file lock on the
freezer.state. Or perhaps we can add a freezer.lock file to the
cgroup freezer.

But for now, using the cgroup freezer would be an improvement
over SIGSTOP/SIGCONT.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/7] clone: Introduce the CLONE_CHILD_USEPID functionality
       [not found]         ` <20110721160459.GD19012-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2011-07-22 23:08           ` Matt Helsley
       [not found]             ` <20110722230848.GB16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-22 23:08 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Pavel Emelyanov, Glauber Costa, Cyrill Gorcunov, Tejun Heo,
	Nathan Lynch, Eric W. Biederman, Linux Containers,
	Daniel Lezcano

On Thu, Jul 21, 2011 at 11:04:59AM -0500, Serge E. Hallyn wrote:
> Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> > The respective flag for clone() makes the latter to take the desired
> > pid of a new process from the child_tidptr. The given pid is used as
> > the pid for the pid namespace the parent is currently running in.
> > 
> > Needed badly for restoring a process.
> > 
> > Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> 
> How do you intend to eventually support multiple pid namespaces?
> I don't mind not supporting them now, but if the interface does
> cannot be extended to support it, I think that's a simple NACK.
> 
> IIRC Eric Biederman (explicity cc:d) in the past has advocated using
> /proc/sys/pid_max games to specify a pid.  That is actually
> usable cross-namespace, though only serially.  Do you think that
> will be too cumbersome?

Didn't that already get NACK'd by others?

Regardless, I'm not a fan of that method. If we're going to add
an interface to enable doing something then it's best if we can
do it simply and directly -- not play games with some apparent
tunable in order to "covertly" get the pids we want.

> 
> There is also the clone_with_pid() syscall from Suka.  It
> accompanied the in-kernel checkpoint/restart patchset, but should
> be perfectly usable without it.  Is there a good reason not to
> pursue that?

aka eclone. It enabled extending the clone flags in the future. Suka
did substantial work making sure it worked for multiple architectures
(clone is rather special that way). It's already been tested.

Furthermore, Oren's user-cr code already creates a tree of
multiple processes with correct pids, pid namespaces, session ids,
and process group ids in userspace using only eclone. So we know eclone
can handle all of that today because there's code to do it.

So I think it would be better to incorporate the eclone patch set
unless, as you say, Pavel can see a good reason not to.

Cheers,
	-Matt Helsley

> 
> thanks,
> -serge
> 
> > ---
> >  include/linux/pid.h   |    2 +-
> >  include/linux/sched.h |    1 +
> >  kernel/fork.c         |   10 ++++++-
> >  kernel/pid.c          |   70 +++++++++++++++++++++++++++++++++++-------------
> >  4 files changed, 62 insertions(+), 21 deletions(-)
> > 
> > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > index cdced84..de772ab 100644
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
> >  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
> >  int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
> >  
> > -extern struct pid *alloc_pid(struct pid_namespace *ns);
> > +extern struct pid *alloc_pid(struct pid_namespace *ns, int pid);
> >  extern void free_pid(struct pid *pid);
> >  
> >  /*
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 781abd1..5b6c1e2 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -23,6 +23,7 @@
> >  #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
> >  /* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> >     and is now available for re-use. */
> > +#define CLONE_CHILD_USEPID	0x02000000	/* use the given pid */
> >  #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
> >  #define CLONE_NEWIPC		0x08000000	/* New ipcs */
> >  #define CLONE_NEWUSER		0x10000000	/* New user namespace */
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index e7548de..f30fbdb 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1183,8 +1183,16 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >  		goto bad_fork_cleanup_io;
> >  
> >  	if (pid != &init_struct_pid) {
> > +		int want_pid = 0;
> > +
> > +		if (clone_flags & CLONE_CHILD_USEPID) {
> > +			retval = get_user(want_pid, child_tidptr);
> > +			if (retval)
> > +				goto bad_fork_cleanup_io;
> > +		}
> > +
> >  		retval = -ENOMEM;
> > -		pid = alloc_pid(p->nsproxy->pid_ns);
> > +		pid = alloc_pid(p->nsproxy->pid_ns, want_pid);
> >  		if (!pid)
> >  			goto bad_fork_cleanup_io;
> >  	}
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 57a8346..69ae1be 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -159,11 +159,55 @@ static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
> >  	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
> >  }
> >  
> > -static int alloc_pidmap(struct pid_namespace *pid_ns)
> > +static int alloc_pidmap_page(struct pidmap *map)
> > +{
> > +	if (unlikely(!map->page)) {
> > +		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > +		/*
> > +		 * Free the page if someone raced with us
> > +		 * installing it:
> > +		 */
> > +		spin_lock_irq(&pidmap_lock);
> > +		if (!map->page) {
> > +			map->page = page;
> > +			page = NULL;
> > +		}
> > +		spin_unlock_irq(&pidmap_lock);
> > +		kfree(page);
> > +		if (unlikely(!map->page))
> > +			return -ENOMEM;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> > +{
> > +	int offset;
> > +	struct pidmap *map;
> > +
> > +	offset = pid & BITS_PER_PAGE_MASK;
> > +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> > +
> > +	if (alloc_pidmap_page(map) < 0)
> > +		return -ENOMEM;
> > +
> > +	if (!test_and_set_bit(offset, map->page)) {
> > +		atomic_dec(&map->nr_free);
> > +		return pid;
> > +	}
> > +
> > +	return -EBUSY;
> > +}
> > +
> > +static int alloc_pidmap(struct pid_namespace *pid_ns, int desired_pid)
> >  {
> >  	int i, offset, max_scan, pid, last = pid_ns->last_pid;
> >  	struct pidmap *map;
> >  
> > +	if (desired_pid)
> > +		return set_pidmap(pid_ns, desired_pid);
> > +
> >  	pid = last + 1;
> >  	if (pid >= pid_max)
> >  		pid = RESERVED_PIDS;
> > @@ -176,22 +220,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> >  	 */
> >  	max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
> >  	for (i = 0; i <= max_scan; ++i) {
> > -		if (unlikely(!map->page)) {
> > -			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > -			/*
> > -			 * Free the page if someone raced with us
> > -			 * installing it:
> > -			 */
> > -			spin_lock_irq(&pidmap_lock);
> > -			if (!map->page) {
> > -				map->page = page;
> > -				page = NULL;
> > -			}
> > -			spin_unlock_irq(&pidmap_lock);
> > -			kfree(page);
> > -			if (unlikely(!map->page))
> > -				break;
> > -		}
> > +		if (alloc_pidmap_page(map) < 0)
> > +			break;
> > +
> >  		if (likely(atomic_read(&map->nr_free))) {
> >  			do {
> >  				if (!test_and_set_bit(offset, map->page)) {
> > @@ -277,7 +308,7 @@ void free_pid(struct pid *pid)
> >  	call_rcu(&pid->rcu, delayed_put_pid);
> >  }
> >  
> > -struct pid *alloc_pid(struct pid_namespace *ns)
> > +struct pid *alloc_pid(struct pid_namespace *ns, int this_ns_pid)
> >  {
> >  	struct pid *pid;
> >  	enum pid_type type;
> > @@ -291,13 +322,14 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >  
> >  	tmp = ns;
> >  	for (i = ns->level; i >= 0; i--) {
> > -		nr = alloc_pidmap(tmp);
> > +		nr = alloc_pidmap(tmp, this_ns_pid);
> >  		if (nr < 0)
> >  			goto out_free;
> >  
> >  		pid->numbers[i].nr = nr;
> >  		pid->numbers[i].ns = tmp;
> >  		tmp = tmp->parent;
> > +		this_ns_pid = 0;
> >  	}
> >  
> >  	get_pid_ns(ns);
> > -- 
> > 1.5.5.6
> > _______________________________________________
> > Containers mailing list
> > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> > https://lists.linux-foundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [TOOLS] To make use of the patches
       [not found]     ` <4E204554.6040901-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-22 23:45       ` Matt Helsley
       [not found]         ` <20110722234558.GD16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2011-07-23  0:40       ` Reply #2: " Matt Helsley
  1 sibling, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-22 23:45 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

On Fri, Jul 15, 2011 at 05:49:08PM +0400, Pavel Emelyanov wrote:
> Additionally the binfmt_img.h from kernel is required for cr-restore.

> #include <stdio.h>
> #include <unistd.h>
> #include <signal.h>
> #include <dirent.h>
> #include <string.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <errno.h>
> #include <linux/kdev_t.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <sys/vfs.h>
> 
> #include <linux/types.h>
> #include "img_structs.h"
> 
> static int fdinfo_img;
> static int pages_img;
> static int core_img;
> static int shmem_img;
> static int pipes_img;
> 
> #define PIPEFS_MAGIC 0x50495045

Shouldn't there be only one MAGIC number for checkpoint contents?

You can always add an additional "type" number following the magic
number. Or make the type a string with the name of the /proc file it's
from... etc.

> 
> static int prep_img_files(int pid)
> {
> 	__u32 type;
> 	char name[64];
> 
> 	sprintf(name, "fdinfo-%d.img", pid);
> 	fdinfo_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
> 	if (fdinfo_img < 0) {
> 		perror("Can't open fdinfo");
> 		return 1;
> 	}
> 
> 	type = FDINFO_MAGIC;
> 	write(fdinfo_img, &type, 4);
> 
> 	sprintf(name, "pages-%d.img", pid);
> 	pages_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
> 	if (pages_img < 0) {
> 		perror("Can't open shmem");
> 		return 1;
> 	}
> 
> 	type = PAGES_MAGIC;
> 	write(pages_img, &type, 4);
> 
> 	sprintf(name, "core-%d.img", pid);
> 	core_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
> 	if (core_img < 0) {
> 		perror("Can't open core");
> 		return 1;
> 	}
> 
> 	sprintf(name, "shmem-%d.img", pid);
> 	shmem_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
> 	if (shmem_img < 0) {
> 		perror("Can't open shmem");
> 		return 1;
> 	}
> 
> 	type = SHMEM_MAGIC;
> 	write(shmem_img, &type, 4);
> 
> 	sprintf(name, "pipes-%d.img", pid);
> 	pipes_img = open(name, O_WRONLY | O_CREAT | O_EXCL, 0600);
> 	if (pipes_img < 0) {
> 		perror("Can't open pipes");
> 		return 1;
> 	}
> 
> 	type = PIPES_MAGIC;
> 	write(pipes_img, &type, 4);
> 
> 	return 0;
> }
> 
> static void kill_imgfiles(int pid)
> {
> 	/* FIXME */
> }
> 
> static int stop_task(int pid)
> {
> 	return kill(pid, SIGSTOP);
> }
> 
> static void continue_task(int pid)
> {
> 	if (kill(pid, SIGCONT))
> 		perror("Can't cont task");
> }

Eventually, I think you should use the cgroup freezer here rather
than signals. Shells and debuggers use these signals so a checkpoint
could easily and quietly be corrupted.

Even if you use the freezer, there needs to be a mechanism to
assure that the frozen cgroup is not thawed before a consistent
checkpoint is complete. Otherwise corruption is always a possibility.

> 
> static char big_tmp_str[PATH_MAX];
> 
> static int read_fd_params(int pid, char *fd, unsigned long *pos, unsigned int *flags)
> {
> 	char fd_str[128];
> 	int ifd;
> 
> 	sprintf(fd_str, "/proc/%d/fdinfo/%s", pid, fd);
> 
> 	printf("\tGetting fdinfo for fd %s\n", fd);
> 	ifd = open(fd_str, O_RDONLY);
> 	if (ifd < 0) {
> 		perror("Can't open fdinfo");
> 		return 1;
> 	}
> 
> 	read(ifd, big_tmp_str, sizeof(big_tmp_str));
> 	close(ifd);
> 
> 	sscanf(big_tmp_str, "pos:\t%lli\nflags:\t%o\n", pos, flags);
> 	return 0;
> }
> 
> static int dump_one_reg_file(int type, unsigned long fd_name, int lfd,
> 		int lclose, unsigned long pos, unsigned int flags)
> {
> 	char fd_str[128];
> 	int len;
> 	struct fdinfo_entry e;
> 
> 	sprintf(fd_str, "/proc/self/fd/%d", lfd);
> 	len = readlink(fd_str, big_tmp_str, sizeof(big_tmp_str) - 1);
> 	if (len < 0) {
> 		perror("Can't readlink fd");
> 		return 1;
> 	}
> 
> 	big_tmp_str[len] = '\0';
> 	printf("\tDumping path for %x fd via self %d [%s]\n", fd_name, lfd, big_tmp_str);
> 
> 	if (lclose)
> 		close(lfd);
> 
> 	e.type = type;
> 	e.addr = fd_name;
> 	e.len = len;
> 	e.pos = pos;
> 	e.flags = flags;
> 
> 	write(fdinfo_img, &e, sizeof(e));
> 	write(fdinfo_img, big_tmp_str, len);
> 
> 	return 0;
> }
> 
> #define MAX_PIPE_BUF_SIZE	1024 /* FIXME - this is not so */
> #define SPLICE_F_NONBLOCK	0x2
> 
> static int dump_pipe_and_data(int lfd, struct pipes_entry *e)
> {
> 	int steal_pipe[2];
> 	int ret;
> 
> 	printf("\tDumping data from pipe %x\n", e->pipeid);
> 	if (pipe(steal_pipe) < 0) {
> 		perror("Can't create pipe for stealing data");
> 		return 1;
> 	}
> 
> 	ret = tee(lfd, steal_pipe[1], MAX_PIPE_BUF_SIZE, SPLICE_F_NONBLOCK);

Neat application of tee().

> 	if (ret < 0) {
> 		if (errno != EAGAIN) {
> 			perror("Can't pick pipe data");
> 			return 1;
> 		}
> 
> 		ret = 0;
> 	}
> 
> 	e->bytes = ret;
> 	write(pipes_img, e, sizeof(*e));
> 
> 	if (ret) {
> 		ret = splice(steal_pipe[0], NULL, pipes_img, NULL, ret, 0);
> 		if (ret < 0) {
> 			perror("Can't push pipe data");
> 			return 1;
> 		}
> 	}
> 
> 	close(steal_pipe[0]);
> 	close(steal_pipe[1]);
> 	return 0;
> }
> 
> static int dump_one_pipe(int fd, int lfd, unsigned int id, unsigned int flags)
> {
> 	struct pipes_entry e;
> 
> 	printf("\tDumping pipe %d/%x flags %x\n", fd, id, flags);
> 
> 	e.fd = fd;
> 	e.pipeid = id;
> 	e.flags = flags;
> 
> 	if (flags & O_WRONLY) {
> 		e.bytes = 0;
> 		write(pipes_img, &e, sizeof(e));
> 		return 0;
> 	}
> 
> 	return dump_pipe_and_data(lfd, &e);
> }
> 
> static int dump_one_fd(int dir, char *fd_name, unsigned long pos, unsigned int flags)
> {
> 	int fd;
> 	struct stat st_buf;
> 	struct statfs stfs_buf;
> 
> 	printf("\tDumping fd %s\n", fd_name);
> 	fd = openat(dir, fd_name, O_RDONLY);
> 	if (fd == -1) {
> 		printf("Tried to openat %d/%d %s\n", getpid(), dir, fd_name);
> 		perror("Can't open fd");
> 		return 1;
> 	}
> 
> 	if (fstat(fd, &st_buf) < 0) {
> 		perror("Can't stat one");
> 		return 1;
> 	}
> 
> 	if (S_ISREG(st_buf.st_mode))
> 		return dump_one_reg_file(FDINFO_FD, atoi(fd_name), fd, 1, pos, flags);
> 
> 	if (S_ISFIFO(st_buf.st_mode)) {
> 		if (fstatfs(fd, &stfs_buf) < 0) {
> 			perror("Can't statfs one");
> 			return 1;
> 		}
> 
> 		if (stfs_buf.f_type == PIPEFS_MAGIC)
> 			return dump_one_pipe(atoi(fd_name), fd, st_buf.st_ino, flags);
> 	}

This is starting to look like a linear search over the set of all
possible types of things file descriptors can refer to. A kernel implementation
doesn't have to do this. Furthermore, if lots of file descriptors are open
this could be alot of fstat() and fstatfs() calls -- will making so many
syscalls force us to an completely in-kernel implementation, like the
set already proposed, just to get usable performance?

> 
> 	if (!strcmp(fd_name, "0")) {
> 		printf("\tSkipping stdin\n");
> 		return 0;
> 	}

Assuming that fd 0 is "stdin" is very very gross. Yes, it's almost always
true. But that does *not* mean that it's a pty. stdin could be a pipe
we need to checkpoint. Really, this is also about the "type" of thing
the fd is referring to -- not about which fd nr it is.

What are your plans for removing this?


> 
> 	if (!strcmp(fd_name, "1")) {
> 		printf("\tSkipping stdout\n");
> 		return 0;
> 	}

Gross again, for the same reasons.

> 
> 	if (!strcmp(fd_name, "2")) {
> 		printf("\tSkipping stderr\n");
> 		return 0;
> 	}

Gross again, for the same reasons.

> 
> 	fprintf(stderr, "Can't dump file %s of that type [%x]\n", fd_name, st_buf.st_mode);
> 	return 1;
> 
> }
> 
> static int dump_task_files(int pid)
> {
> 	char pid_fd_dir[64];
> 	DIR *fd_dir;
> 	struct dirent *de;
> 	unsigned long pos;
> 	unsigned int flags;
> 
> 	printf("Dumping open files for %d\n", pid);
> 
> 	sprintf(pid_fd_dir, "/proc/%d/fd", pid);
> 	fd_dir = opendir(pid_fd_dir);
> 	if (fd_dir == NULL) {
> 		perror("Can't open fd dir");
> 		return -1;
> 	}
> 
> 	while ((de = readdir(fd_dir)) != NULL) {
> 		if (de->d_name[0] == '.')
> 			continue;
> 
> 		if (read_fd_params(pid, de->d_name, &pos, &flags))
> 			return 1;
> 
> 		if (dump_one_fd(dirfd(fd_dir), de->d_name, pos, flags))
> 			return 1;
> 	}
> 
> 	closedir(fd_dir);
> 	return 0;
> }
> 
> #define PAGE_SIZE	4096
> #define PAGE_RSS	0x1
> 
> static unsigned long rawhex(char *str, char **end)
> {
> 	unsigned long ret = 0;
> 
> 	while (1) {
> 		if (str[0] >= '0' && str[0] <= '9') {
> 			ret <<= 4;
> 			ret += str[0] - '0';
> 		} else if (str[0] >= 'a' && str[0] <= 'f') {
> 			ret <<= 4;
> 			ret += str[0] - 'a' + 0xA;
> 		} else if (str[0] >= 'A' && str[0] <= 'F') {
> 			ret <<= 4;
> 			ret += str[0] - 'A' + 0xA;
> 		} else {
> 			if (end)
> 				*end = str;
> 			return ret;
> 		}
> 
> 		str++;
> 	}
> }

nit: I haven't looked closely enough to see where rawhex is being used,
	but is there's no suitable library function for this?

> 
> static void map_desc_parm(char *desc, unsigned long *pgoff, unsigned long *len)
> {
> 	char *s;
> 	unsigned long start, end;
> 
> 	start = rawhex(desc, &s);
> 	if (*s != '-') {
> 		goto bug;
> 	}
> 
> 	end = rawhex(s + 1, &s);
> 	if (*s != ' ') {
> 		goto bug;
> 	}
> 
> 	s = strchr(s + 1, ' ');
> 	*pgoff = rawhex(s + 1, &s);
> 	if (*s != ' ') {
> 		goto bug;
> 	}
> 
> 	if (start > end)
> 		goto bug;
> 
> 	*len = end - start;
> 
> 	if (*len % PAGE_SIZE) {
> 		goto bug;
> 	}
> 	if (*pgoff % PAGE_SIZE) {
> 		goto bug;
> 	}
> 
> 	return;
> bug:
> 	fprintf(stderr, "BUG\n");
> 	exit(1);
> }
> 
> static int dump_map_pages(int lfd, unsigned long start, unsigned long pgoff, unsigned long len)
> {
> 	unsigned int nrpages, pfn;
> 	void *mem;
> 	unsigned char *mc;
> 
> 	printf("\t\tDumping pages start %x len %x off %x\n", start, len, pgoff);
> 	mem = mmap(NULL, len, PROT_READ, MAP_FILE | MAP_PRIVATE, lfd, pgoff);
> 	if (mem == MAP_FAILED) {
> 		perror("Can't map");
> 		return 1;
> 	}
> 
> 	nrpages = len / PAGE_SIZE;
> 	mc = malloc(nrpages);
> 	if (mincore(mem, len, mc)) {
> 		perror("Can't mincore mapping");
> 		return 1;
> 	}
> 
> 	for (pfn = 0; pfn < nrpages; pfn++)
> 		if (mc[pfn] & PAGE_RSS) {
> 			__u64 vaddr;
> 
> 			vaddr = start + pfn * PAGE_SIZE;
> 			write(pages_img, &vaddr, 8);
> 			write(pages_img, mem + pfn * PAGE_SIZE, PAGE_SIZE);
> 		}
> 
> 	munmap(mem, len);
> 
> 	return 0;
> }
> 
> static int dump_anon_private_map(char *start)
> {
> 	printf("\tSkipping anon private mapping at %s\n", start);
> 	return 0;
> }
> 
> static int dump_anon_shared_map(char *_start, char *mdesc, int lfd, struct stat *st)
> {
> 	unsigned long pgoff, len;
> 	struct shmem_entry e;
> 	unsigned long start;
> 	struct stat buf;
> 
> 	map_desc_parm(mdesc, &pgoff, &len);
> 
> 	start = rawhex(_start, NULL);
> 	e.start = start;
> 	e.end = start + len;
> 	e.shmid = st->st_ino;
> 
> 	write(shmem_img, &e, sizeof(e));
> 
> 	if (dump_map_pages(lfd, start, pgoff, len))
> 		return 1;
> 
> 	close(lfd);
> 	return 0;
> }
> 
> static int dump_file_shared_map(char *start, char *mdesc, int lfd)
> {
> 	printf("\tSkipping file shared mapping at %s\n", start);
> 	close(lfd);
> 	return 0;
> }

Shouldn't this be an error since it appears these shared mappings
are currently unsupported?

> 
> static int dump_file_private_map(char *_start, char *mdesc, int lfd)
> {
> 	unsigned long pgoff, len;
> 	unsigned long start;
> 
> 	map_desc_parm(mdesc, &pgoff, &len);
> 
> 	start = rawhex(_start, NULL);
> 	if (dump_one_reg_file(FDINFO_MAP, start, lfd, 0, 0, O_RDONLY))
> 		return 1;
> 
> 	close(lfd);
> 	return 0;
> }
> 
> static int dump_one_mapping(char *mdesc, DIR *mfd_dir)
> {
> 	char *flags, *tmp;
> 	char map_start[32];
> 	int lfd;
> 	struct stat st_buf;
> 
> 	tmp = strchr(mdesc, '-');
> 	memset(map_start, 0, sizeof(map_start));
> 	strncpy(map_start, mdesc, tmp - mdesc);
> 	flags = strchr(mdesc, ' ');
> 	flags++;
> 
> 	printf("\tDumping %s\n", map_start);
> 	lfd = openat(dirfd(mfd_dir), map_start, O_RDONLY);
> 	if (lfd == -1) {
> 		if (errno != ENOENT) {
> 			perror("Can't open mapping");
> 			return 1;
> 		}
> 
> 		if (flags[3] != 'p') {
> 			fprintf(stderr, "Bogus mapping [%s]\n", mdesc);
> 			return 1;
> 		}
> 
> 		return dump_anon_private_map(map_start);
> 	}
> 
> 	if (fstat(lfd, &st_buf) < 0) {
> 		perror("Can't stat mapping!");
> 		return 1;
> 	}
> 
> 	if (!S_ISREG(st_buf.st_mode)) {
> 		perror("Can't handle non-regular mapping");
> 		return 1;
> 	}
> 
> 	if (MAJOR(st_buf.st_dev) == 0) {
> 		if (flags[3] != 's') {
> 			fprintf(stderr, "Bogus mapping [%s]\n", mdesc);
> 			return 1;
> 		}
> 
> 		/* FIXME - this can be tmpfs visible file mapping */
> 		return dump_anon_shared_map(map_start, mdesc, lfd, &st_buf);
> 	}
> 
> 	if (flags[3] == 'p')
> 		return dump_file_private_map(map_start, mdesc, lfd);
> 	else
> 		return dump_file_shared_map(map_start, mdesc, lfd);
> }
> 
> static int dump_task_ext_mm(int pid)
> {
> 	char path[64];
> 	DIR *mfd_dir;
> 	FILE *maps;
> 
> 	printf("Dumping mappings for %d\n", pid);
> 
> 	sprintf(path, "/proc/%d/mfd", pid);
> 	mfd_dir = opendir(path);
> 	if (mfd_dir == NULL) {
> 		perror("Can't open mfd dir");
> 		return -1;
> 	}
> 
> 	sprintf(path, "/proc/%d/maps", pid);
> 	maps = fopen(path, "r");
> 	if (maps == NULL) {
> 		perror("Can't open maps file");
> 		return 1;
> 	}
> 
> 	while (fgets(big_tmp_str, sizeof(big_tmp_str), maps) != NULL)
> 		if (dump_one_mapping(big_tmp_str, mfd_dir))
> 			return 1;
> 
> 	fclose(maps);
> 	closedir(mfd_dir);
> 	return 0;
> }
> 
> static int dump_task_state(int pid)
> {
> 	char path[64];
> 	int dump_fd;
> 	void *mem;
> 
> 	printf("Dumping task image for %d\n", pid);
> 	sprintf(path, "/proc/%d/dump", pid);
> 	dump_fd = open(path, O_RDONLY);
> 	if (dump_fd < 0) {
> 		perror("Can't open dump file");
> 		return 1;
> 	}
> 
> 	mem = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, 0, 0);
> 	if (mem == MAP_FAILED) {
> 		perror("Can't get mem");
> 		return 1;
> 	}
> 
> 	while (1) {
> 		int r, w;
> 
> 		r = read(dump_fd, mem, 4096);
> 		if (r == 0)
> 			break;
> 		if (r < 0) {
> 			perror("Can't read dump file");
> 			return 1;
> 		}
> 
> 		w = 0;
> 		while (w < r) {
> 			int ret;
> 
> 			ret = write(core_img, mem + w, r - w);
> 			if (ret <= 0) {
> 				perror("Can't write core");
> 				return 1;
> 			}
> 
> 			w += ret;
> 		}
> 	}
> 
> 	munmap(mem, 4096);
> 	close(dump_fd);
> 
> 	return 0;
> }
> 
> static int dump_one_task(int pid, int stop)
> {
> 	printf("Dumping task %d\n", pid);
> 
> 	if (prep_img_files(pid))
> 		return 1;
> 
> 	if (stop && stop_task(pid))
> 		goto err_task;
> 
> 	if (dump_task_files(pid))
> 		goto err;
> 
> 	if (dump_task_ext_mm(pid))
> 		goto err;
> 
> 	if (dump_task_state(pid))
> 		goto err;
> 
> 	if (stop)
> 		continue_task(pid);
> 
> 	printf("Dump is complete\n");
> 	return 0;
> 
> err:
> 	if (stop)
> 		continue_task(pid);
> err_task:
> 	kill_imgfiles(pid);
> 	return 1;
> }
> 
> static int pstree_fd;
> static char big_tmp_str[4096];
> static int *pids, nr_pids;
> 
> static char *get_children_pids(int pid)
> {
> 	FILE *f;
> 	int len;
> 	char *ret, *tmp;
> 
> 	sprintf(big_tmp_str, "/proc/%d/status", pid);
> 	f = fopen(big_tmp_str, "r");
> 	if (f == NULL)
> 		return NULL;
> 
> 	while ((fgets(big_tmp_str, sizeof(big_tmp_str), f)) != NULL) {
> 		if (strncmp(big_tmp_str, "Children:", 9))
> 			continue;
> 
> 		tmp = big_tmp_str + 10;
> 		len = strlen(tmp);
> 		ret = malloc(len + 1);
> 		strcpy(ret, tmp);
> 		if (len)
> 			ret[len - 1] = ' ';
> 
> 		fclose(f);
> 		return ret;
> 	}
> 
> 	fclose(f);
> 	return NULL;
> }
> 
> static int dump_pid_and_children(int pid)
> {
> 	struct pstree_entry e;
> 	char *chlist, *tmp, *tmp2;
> 
> 	printf("\tReading %d children list\n", pid);
> 	chlist = get_children_pids(pid);
> 	if (chlist == NULL)
> 		return 1;
> 
> 	printf("\t%d has children %s\n", pid, chlist);
> 
> 	e.pid = pid;
> 	e.nr_children = 0;
> 
> 	pids = realloc(pids, (nr_pids + 1) * sizeof(int));
> 	pids[nr_pids++] = e.pid;
> 
> 	tmp = chlist;
> 	while ((tmp = strchr(tmp, ' ')) != NULL) {
> 		tmp++;
> 		e.nr_children++;
> 	}
> 
> 	write(pstree_fd, &e, sizeof(e));
> 	tmp = chlist;
> 	while (1) {
> 		__u32 cpid;
> 
> 		cpid = strtol(tmp, &tmp, 10);
> 		if (cpid == 0)
> 			break;
> 		if (*tmp != ' ') {
> 			fprintf(stderr, "Error in string with children!\n");
> 			return 1;
> 		}
> 
> 		write(pstree_fd, &cpid, sizeof(cpid));
> 		tmp++;
> 	}
> 
> 	tmp = chlist;
> 	while ((tmp2 = strchr(tmp, ' ')) != NULL) {
> 		*tmp2 = '\0';
> 		if (dump_pid_and_children(atoi(tmp)))
> 			return 1;
> 		tmp = tmp2 + 1;
> 	}
> 
> 	free(chlist);
> 	return 0;
> }
> 
> static int __dump_all_tasks(void)
> {
> 	int i, pid;
> 
> 	printf("Dumping tasks' images for");
> 	for (i = 0; i < nr_pids; i++)
> 		printf(" %d", pids[i]);
> 	printf("\n");
> 
> 	printf("Stopping tasks\n");
> 	for (i = 0; i < nr_pids; i++)
> 		if (stop_task(pids[i]))
> 			goto err;
> 
> 	for (i = 0; i < nr_pids; i++) {
> 		if (dump_one_task(pids[i], 0))
> 			goto err;
> 	}
> 
> 	printf("Resuming tasks\n");
> 	for (i = 0; i < nr_pids; i++)
> 		continue_task(pids[i]);
> 
> 	return 0;
> 
> err:
> 	for (i = 0; i < nr_pids; i++)
> 		continue_task(pids[i]);
> 	return 1;
> 
> }
> 
> static int dump_all_tasks(int pid)
> {
> 	char *chlist;
> 	__u32 type;
> 
> 	pids = NULL;
> 	nr_pids = 0;
> 
> 	printf("Dumping process tree, start from %d\n", pid);
> 
> 	sprintf(big_tmp_str, "pstree-%d.img", pid);
> 	pstree_fd = open(big_tmp_str, O_WRONLY | O_CREAT | O_EXCL, 0600);
> 	if (pstree_fd < 0) {
> 		perror("Can't create pstree");
> 		return 1;
> 	}
> 
> 	type = PSTREE_MAGIC;
> 	write(pstree_fd, &type, sizeof(type));
> 
> 	if (dump_pid_and_children(pid))
> 		return 1;
> 
> 	close(pstree_fd);
> 
> 	return __dump_all_tasks();
> }
> 
> int main(int argc, char **argv)
> {
> 	if (argc != 3)
> 		goto usage;
> 	if (argv[1][0] != '-')
> 		goto usage;
> 	if (argv[1][1] == 'p')
> 		return dump_one_task(atoi(argv[2]), 1);
> 	if (argv[1][1] == 't')
> 		return dump_all_tasks(atoi(argv[2]));
> 
> usage:
> 	printf("Usage: %s (-p|-t) <pid>\n", argv[0]);
> 	return 1;
> }

> #include <stdio.h>
> #include <unistd.h>
> #include <signal.h>
> #include <dirent.h>
> #include <string.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <errno.h>
> #include <linux/kdev_t.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <sys/sendfile.h>
> 
> #define PAGE_SIZE	4096
> 
> #include <linux/types.h>
> #include "img_structs.h"
> #include "binfmt_img.h"
> 
> struct fmap_fd {
> 	unsigned long start;
> 	int fd;
> 	struct fmap_fd *next;
> };
> 
> static struct fmap_fd *fmap_fds;
> 
> struct shmem_info {
> 	unsigned long start;
> 	unsigned long end;
> 	unsigned long id;
> 	int pid;
> 	int real_pid;
> };
> 
> static struct shmem_info *shmems;
> static int nr_shmems;
> 
> struct pipes_info {
> 	unsigned int id;
> 	int pid;
> 	int real_pid;
> 	int read_fd;
> 	int write_fd;
> 	int users;
> };
> 
> static struct pipes_info *pipes;
> static int nr_pipes;
> 
> static void show_saved_shmems(void)
> {
> 	int i;
> 
> 	printf("\tSaved shmems:\n");
> 	for (i = 0; i < nr_shmems; i++)
> 		printf("\t\t%016lx %lx %d\n", shmems[i].start, shmems[i].id, shmems[i].pid);
> }
> 
> static void show_saved_pipes(void)
> {
> 	int i;
> 
> 	printf("\tSaved pipes:\n");
> 	for (i = 0; i < nr_pipes; i++)
> 		printf("\t\t%x -> %d\n", pipes[i].id, pipes[i].pid);
> }
> 
> static struct shmem_info *search_shmem(unsigned long addr, unsigned long id)
> {
> 	int i;
> 
> 	for (i = 0; i < nr_shmems; i++) {
> 		struct shmem_info *si;
> 
> 		si = shmems + i;
> 		if (si->start <= addr && si->end >= addr && si->id == id)
> 			return si;
> 	}
> 
> 	return NULL;
> }
> 
> static struct pipes_info *search_pipes(unsigned int pipeid)
> {
> 	int i;
> 
> 	for (i = 0; i < nr_pipes; i++) {
> 		struct pipes_info *pi;
> 
> 		pi = pipes + i;
> 		if (pi->id == pipeid)
> 			return pi;
> 	}
> 
> 	return NULL;
> }
> 
> static void shmem_update_real_pid(int vpid, int rpid)
> {
> 	int i;
> 
> 	for (i = 0; i < nr_shmems; i++)
> 		if (shmems[i].pid == vpid)
> 			shmems[i].real_pid = rpid;
> }
> 
> static int shmem_wait_and_open(struct shmem_info *si)
> {
> 	/* FIXME - not good */
> 	char path[128];
> 	unsigned long time = 1000;
> 
> 	sleep(1);
> 
> 	while (si->real_pid == 0)
> 		usleep(time);
> 
> 	sprintf(path, "/proc/%d/mfd/0x%lx", si->real_pid, si->start);
> 	while (1) {
> 		int ret;
> 
> 		ret = open(path, O_RDWR);
> 		if (ret > 0)
> 			return ret;
> 
> 		if (ret < 0 && errno != ENOENT) {
> 			perror("     Can't stat shmem");
> 			return -1;
> 		}
> 
> 		printf("Waiting for [%s] to appear\n", path);
> 		if (time < 20000000)
> 			time <<= 1;
> 		usleep(time);
> 	}
> }
> 
> static int try_to_add_shmem(int pid, struct shmem_entry *e)
> {
> 	int i;
> 
> 	for (i = 0; i < nr_shmems; i++) {
> 		if (shmems[i].start != e->start || shmems[i].id != e->shmid)
> 			continue;
> 
> 		if (shmems[i].end != e->end) {
> 			printf("Bogus shmem\n");
> 			return 1;
> 		}
> 
> 		if (shmems[i].pid > pid)
> 			shmems[i].pid = pid;
> 
> 		return 0;
> 	}
> 
> 	if ((nr_shmems + 1) * sizeof(struct shmem_info) >= 4096) {
> 		printf("OOM storing shmems\n");
> 		return 1;
> 	}
> 
> 	shmems[nr_shmems].start = e->start;
> 	shmems[nr_shmems].end = e->end;
> 	shmems[nr_shmems].id = e->shmid;
> 	shmems[nr_shmems].pid = pid;
> 	shmems[nr_shmems].real_pid = 0;
> 	nr_shmems++;
> 
> 	return 0;
> }
> 
> static int try_to_add_pipe(int pid, struct pipes_entry *e, int p_fd)
> {
> 	int i;
> 
> 	for (i = 0; i < nr_pipes; i++) {
> 		if (pipes[i].id != e->pipeid)
> 			continue;
> 
> 		if (pipes[i].pid > pid)
> 			pipes[i].pid = pid;
> 		pipes[i].users++;
> 
> 		return 0;
> 	}
> 
> 	if ((nr_pipes + 1) * sizeof(struct pipes_info) >= 4096) {
> 		printf("OOM storing pipes\n");
> 		return 1;
> 	}
> 
> 	pipes[nr_pipes].id = e->pipeid;
> 	pipes[nr_pipes].pid = pid;
> 	pipes[nr_pipes].real_pid = 0;
> 	pipes[nr_pipes].read_fd = 0;
> 	pipes[nr_pipes].write_fd = 0;
> 	pipes[nr_pipes].users = 1;
> 	nr_pipes++;
> 
> 	return 0;
> }
> 
> static int prepare_shmem_pid(int pid)
> {
> 	char path[64];
> 	int sh_fd;
> 	__u32 type = 0;
> 
> 	sprintf(path, "shmem-%d.img", pid);
> 	sh_fd = open(path, O_RDONLY);
> 	if (sh_fd < 0) {
> 		perror("Can't open shmem info");
> 		return 1;
> 	}
> 
> 	read(sh_fd, &type, sizeof(type));
> 	if (type != SHMEM_MAGIC) {
> 		perror("Bad shmem magic");
> 		return 1;
> 	}
> 
> 	while (1) {
> 		struct shmem_entry e;
> 		int ret;
> 
> 		ret = read(sh_fd, &e, sizeof(e));
> 		if (ret == 0)
> 			break;
> 		if (ret != sizeof(e)) {
> 			perror("Can't read shmem entry");
> 			return 1;
> 		}
> 
> 		if (try_to_add_shmem(pid, &e))
> 			return 1;
> 	}
> 
> 	close(sh_fd);
> 	return 0;
> }
> 
> static int prepare_pipes_pid(int pid)
> {
> 	char path[64];
> 	int p_fd;
> 	__u32 type = 0;
> 
> 	sprintf(path, "pipes-%d.img", pid);
> 	p_fd = open(path, O_RDONLY);
> 	if (p_fd < 0) {
> 		perror("Can't open pipes image");
> 		return 1;
> 	}
> 
> 	read(p_fd, &type, sizeof(type));
> 	if (type != PIPES_MAGIC) {
> 		perror("Bad pipes magin");
> 		return 1;
> 	}
> 
> 	while (1) {
> 		struct pipes_entry e;
> 		int ret;
> 
> 		ret = read(p_fd, &e, sizeof(e));
> 		if (ret == 0)
> 			break;
> 		if (ret != sizeof(e)) {
> 			fprintf(stderr, "Read pipes for %s failed %d of %d read\n",
> 					path, ret, sizeof(e));
> 			perror("Can't read pipes entry");
> 			return 1;
> 		}
> 
> 		if (try_to_add_pipe(pid, &e, p_fd))
> 			return 1;
> 
> 		lseek(p_fd, e.bytes, SEEK_CUR);
> 	}
> 
> 	close(p_fd);
> 	return 0;
> }
> 
> static int prepare_shared(int ps_fd)
> {
> 	printf("Preparing info about shared resources\n");
> 
> 	nr_shmems = 0;
> 	shmems = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANON, 0, 0);
> 	if (shmems == MAP_FAILED) {
> 		perror("Can't map shmems");
> 		return 1;
> 	}
> 
> 	pipes = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANON, 0, 0);
> 	if (pipes == MAP_FAILED) {
> 		perror("Can't map pipes");
> 		return 1;
> 	}
> 
> 	while (1) {
> 		struct pstree_entry e;
> 		int ret;
> 
> 		ret = read(ps_fd, &e, sizeof(e));
> 		if (ret == 0)
> 			break;
> 
> 		if (ret != sizeof(e)) {
> 			perror("Can't read ps");
> 			return 1;
> 		}
> 
> 		if (prepare_shmem_pid(e.pid))
> 			return 1;
> 
> 		if (prepare_pipes_pid(e.pid))
> 			return 1;
> 
> 		lseek(ps_fd, e.nr_children * sizeof(__u32), SEEK_CUR);
> 	}
> 
> 	lseek(ps_fd, sizeof(__u32), SEEK_SET);
> 
> 	show_saved_shmems();
> 	show_saved_pipes();
> 
> 	return 0;
> }
> 
> static struct fmap_fd *pop_fmap_fd(unsigned long start)
> {
> 	struct fmap_fd **p, *r;
> 
> 	for (p = &fmap_fds; *p != NULL; p = &(*p)->next) {
> 		if ((*p)->start != start)
> 			continue;
> 
> 		r = *p;
> 		*p = r->next;
> 		return r;
> 	}
> 
> 	return NULL;
> }
> 
> static int open_fe_fd(struct fdinfo_entry *fe, int fd)
> {
> 	char path[PATH_MAX];
> 	int tmp;
> 
> 	if (read(fd, path, fe->len) != fe->len) {
> 		fprintf(stderr, "Error reading path");
> 		return -1;
> 	}
> 
> 	path[fe->len] = '\0';
> 
> 	tmp = open(path, fe->flags);
> 	if (tmp < 0) {
> 		perror("Can't open file");
> 		return -1;
> 	}
> 
> 	lseek(tmp, fe->pos, SEEK_SET);
> 
> 	return tmp;
> }
> 
> static int reopen_fd(int old_fd, int new_fd)
> {
> 	int tmp;
> 
> 	if (old_fd != new_fd) {
> 		tmp = dup2(old_fd, new_fd);
> 		if (tmp < 0)
> 			return tmp;
> 
> 		close(old_fd);
> 	}
> 
> 	return new_fd;
> }
> 
> static int open_fd(int pid, struct fdinfo_entry *fe, int *cfd)
> {
> 	int fd, tmp;
> 
> 	if (*cfd == (int)fe->addr) {
> 		tmp = dup(*cfd);
> 		if (tmp < 0) {
> 			perror("Can't dup file");
> 			return 1;
> 		}
> 
> 		*cfd = tmp;
> 	}
> 
> 	tmp = open_fe_fd(fe, *cfd);
> 	if (tmp < 0)
> 		return 1;
> 
> 	fd = reopen_fd(tmp, (int)fe->addr);
> 	if (fd < 0) {
> 		perror("Can't dup");
> 		return 1;
> 	}
> 
> 	return 0;
> }
> 
> static int open_fmap(int pid, struct fdinfo_entry *fe, int fd)
> {
> 	int tmp;
> 	struct fmap_fd *new;
> 
> 	tmp = open_fe_fd(fe, fd);
> 	if (tmp < 0)
> 		return 1;
> 
> 	printf("%d:\t\tWill map %x to %d\n", pid, fe->addr, tmp);
> 	new = malloc(sizeof(*new));
> 	new->start = fe->addr;
> 	new->fd = tmp;
> 	new->next = fmap_fds;
> 	fmap_fds = new;
> 
> 	return 0;
> }
> 
> static int prepare_fds(int pid)
> {
> 	__u32 mag;
> 	char path[64];
> 	int fdinfo_fd;
> 
> 	printf("%d: Opening files\n", pid);
> 
> 	sprintf(path, "fdinfo-%d.img", pid);
> 	fdinfo_fd = open(path, O_RDONLY);
> 	if (fdinfo_fd < 0) {
> 		perror("Can't open fdinfo");
> 		return 1;
> 	}
> 
> 	read(fdinfo_fd, &mag, 4);
> 	if (mag != FDINFO_MAGIC) {
> 		fprintf(stderr, "Bad file\n");
> 		return 1;
> 	}
> 
> 	while (1) {
> 		int ret;
> 		struct fdinfo_entry fe;
> 
> 		ret = read(fdinfo_fd, &fe, sizeof(fe));
> 		if (ret == 0) {
> 			close(fdinfo_fd);
> 			return 0;
> 		}
> 
> 		if (ret < 0) {
> 			perror("Can't read file");
> 			return 1;
> 		}
> 		if (ret != sizeof(fe)) {
> 			fprintf(stderr, "Error reading\n");
> 			return 1;
> 		}
> 
> 		printf("\t%d: Got fd for %lx type %d namelen %d\n", pid,
> 				(unsigned long)fe.addr, fe.type, fe.len);
> 		switch (fe.type) {
> 		case FDINFO_FD:
> 			if (open_fd(pid, &fe, &fdinfo_fd))
> 				return 1;
> 
> 			break;
> 		case FDINFO_MAP:
> 			if (open_fmap(pid, &fe, fdinfo_fd))
> 				return 1;
> 
> 			break;
> 		default:
> 			fprintf(stderr, "Some bullshit in a file\n");
> 			return 1;
> 		}
> 	}
> }
> 
> struct shmem_to_id {
> 	unsigned long addr;
> 	unsigned long end;
> 	unsigned long id;
> 	struct shmem_to_id *next;
> };
> 
> static struct shmem_to_id *my_shmem_ids;
> 
> static unsigned long find_shmem_id(unsigned long addr)
> {
> 	struct shmem_to_id *si;
> 
> 	for (si = my_shmem_ids; si != NULL; si = si->next)
> 		if (si->addr <= addr && si->end >= addr)
> 			return si->id;
> 
> 	return 0;
> }
> 
> static void save_shmem_id(struct shmem_entry *e)
> {
> 	struct shmem_to_id *si;
> 
> 	si = malloc(sizeof(*si));
> 	si->addr = e->start;
> 	si->end = e->end;
> 	si->id = e->shmid;
> 	si->next = my_shmem_ids;
> 	my_shmem_ids = si;
> }
> 
> static int prepare_shmem(int pid)
> {
> 	char path[64];
> 	int sh_fd;
> 	__u32 type = 0;
> 
> 	sprintf(path, "shmem-%d.img", pid);
> 	sh_fd = open(path, O_RDONLY);
> 	if (sh_fd < 0) {
> 		perror("Can't open shmem info");
> 		return 1;
> 	}
> 
> 	read(sh_fd, &type, sizeof(type));
> 	if (type != SHMEM_MAGIC) {
> 		perror("Bad shmem magic");
> 		return 1;
> 	}
> 
> 	while (1) {
> 		struct shmem_entry e;
> 		int ret;
> 
> 		ret = read(sh_fd, &e, sizeof(e));
> 		if (ret == 0)
> 			break;
> 		if (ret != sizeof(e)) {
> 			perror("Can't read shmem entry");
> 			return 1;
> 		}
> 
> 		save_shmem_id(&e);
> 	}
> 
> 	close(sh_fd);
> 	return 0;
> }
> 
> static int try_fixup_file_map(int pid, struct binfmt_vma_image *vi, int fd)
> {
> 	struct fmap_fd *fmfd;
> 
> 	fmfd = pop_fmap_fd(vi->start);
> 	if (fmfd != NULL) {
> 		printf("%d: Fixing %lx vma to %d fd\n", pid, vi->start, fmfd->fd);
> 		lseek(fd, -sizeof(*vi), SEEK_CUR);
> 		vi->fd = fmfd->fd;
> 		if (write(fd, vi, sizeof(*vi)) != sizeof(*vi)) {
> 			perror("Can't write img");
> 			return 1;
> 		}
> 
> 		free(fmfd);
> 	}
> 
> 	return 0;
> }
> 
> static int try_fixup_shared_map(int pid, struct binfmt_vma_image *vi, int fd)
> {
> 	struct shmem_info *si;
> 	unsigned long id;
> 
> 	id = find_shmem_id(vi->start);
> 	if (id == 0)
> 		return 0;
> 
> 	si = search_shmem(vi->start, id);
> 	printf("%d: Search for %016lx shmem %p/%d\n", pid, vi->start, si, si ? si->pid : -1);
> 
> 	if (si == NULL) {
> 		fprintf(stderr, "Can't find my shmem %016lx\n", vi->start);
> 		return 1;
> 	}
> 
> 	if (si->pid != pid) {
> 		int sh_fd;
> 
> 		sh_fd = shmem_wait_and_open(si);
> 		printf("%d: Fixing %lx vma to %x/%d shmem -> %d\n", pid, vi->start, si->id, si->pid, sh_fd);
> 		if (fd < 0) {
> 			perror("Can't open shmem");
> 			return 1;
> 		}
> 
> 		lseek(fd, -sizeof(*vi), SEEK_CUR);
> 		vi->fd = sh_fd;
> 		if (write(fd, vi, sizeof(*vi)) != sizeof(*vi)) {
> 			perror("Can't write img");
> 			return 1;
> 		}
> 	}
> 
> 	return 0;
> }
> 
> static int fixup_vma_fds(int pid, int fd)
> {
> 	lseek(fd, sizeof(struct binfmt_img_header) +
> 			sizeof(struct binfmt_regs_image) +
> 			sizeof(struct binfmt_mm_image), SEEK_SET);
> 
> 	while (1) {
> 		struct binfmt_vma_image vi;
> 
> 		if (read(fd, &vi, sizeof(vi)) != sizeof(vi)) {
> 			perror("Can't read");
> 			return 1;
> 		}
> 
> 		if (vi.start == 0 && vi.end == 0)
> 			return 0;
> 
> 		printf("%d: Fixing %016lx-%016lx %016lx vma\n", pid, vi.start, vi.end, vi.pgoff);
> 		if (try_fixup_file_map(pid, &vi, fd))
> 			return 1;
> 
> 		if (try_fixup_shared_map(pid, &vi, fd))
> 			return 1;
> 	}
> }
> 
> static inline int should_restore_page(int pid, unsigned long vaddr)
> {
> 	struct shmem_info *si;
> 	unsigned long id;
> 
> 	id = find_shmem_id(vaddr);
> 	if (id == 0)
> 		return 1;
> 
> 	si = search_shmem(vaddr, id);
> 	return si->pid == pid;
> }
> 
> static int fixup_pages_data(int pid, int fd)
> {
> 	char path[128];
> 	int shfd;
> 	__u32 mag;
> 	__u64 vaddr;
> 
> 	sprintf(path, "pages-%d.img", pid);
> 	shfd = open(path, O_RDONLY);
> 	if (shfd < 0) {
> 		perror("Can't open shmem image");
> 		return 1;
> 	}
> 
> 	read(shfd, &mag, sizeof(mag));
> 	if (mag != PAGES_MAGIC) {
> 		fprintf(stderr, "Bad shmem image\n");
> 		return 1;
> 	}
> 
> 	lseek(fd, -sizeof(struct binfmt_page_image), SEEK_END);
> 	read(fd, &vaddr, sizeof(vaddr));
> 	if (vaddr != 0) {
> 		printf("SHIT %lx\n", (unsigned long)vaddr);
> 		return 1;
> 	}
> 	lseek(fd, -sizeof(struct binfmt_page_image), SEEK_END);
> 
> 	while (1) {
> 		int ret;
> 
> 		ret = read(shfd, &vaddr, sizeof(vaddr));
> 		if (ret == 0)
> 			break;
> 
> 		if (ret < 0 || ret != sizeof(vaddr)) {
> 			perror("Can't read vaddr");
> 			return 1;
> 		}
> 
> 		if (vaddr == 0)
> 			break;
> 
> 		if (!should_restore_page(pid, vaddr)) {
> 			lseek(shfd, PAGE_SIZE, SEEK_CUR);
> 			continue;
> 		}
> 
> //		printf("Copy page %lx to image\n", (unsigned long)vaddr);
> 		write(fd, &vaddr, sizeof(vaddr));
> 		sendfile(fd, shfd, NULL, PAGE_SIZE);
> 	}
> 
> 	close(shfd);
> 	vaddr = 0;
> 	write(fd, &vaddr, sizeof(vaddr));
> 	return 0;
> }
> 
> static int prepare_image_maps(int fd, int pid)
> {
> 	printf("%d: Fixing maps before executing image\n", pid);
> 
> 	if (fixup_vma_fds(pid, fd))
> 		return 1;
> 
> 	if (fixup_pages_data(pid, fd))
> 		return 1;
> 
> 	close(fd);
> 	return 0;
> }
> 
> static int execute_image(int pid)
> {
> 	char path[128];
> 	int fd, fd_new;
> 	struct stat buf;
> 
> 	sprintf(path, "core-%d.img", pid);
> 	fd = open(path, O_RDONLY);
> 	if (fd < 0) {
> 		perror("Can't open exec image");
> 		return 1;
> 	}
> 
> 	if (fstat(fd, &buf)) {
> 		perror("Can't stat");
> 		return 1;
> 	}
> 
> 	sprintf(path, "core-%d.img.out", pid);
> 	fd_new = open(path, O_RDWR | O_CREAT | O_EXCL, 0700);
> 	if (fd_new < 0) {
> 		perror("Can't open new image");
> 		return 1;
> 	}
> 
> 	printf("%d: Preparing execution image\n", pid);
> 	sendfile(fd_new, fd, NULL, buf.st_size);
> 	close(fd);
> 
> 	if (fchmod(fd_new, 0700)) {
> 		perror("Can't prepare exec image");
> 		return 1;
> 	}
> 
> 	if (prepare_image_maps(fd_new, pid))
> 		return 1;
> 
> 	printf("%d/%d EXEC IMAGE\n", pid, getpid());
> 	return execl(path, path, NULL);

How are you going to restore O_CLOEXEC flags?

> }
> 
> static int create_pipe(int pid, struct pipes_entry *e, struct pipes_info *pi, int pipes_fd)
> {
> 	int pfd[2], tmp;
> 	unsigned long time = 1000;
> 
> 	printf("\t%d: Creating pipe %x\n", pid, e->pipeid);
> 
> 	if (pipe(pfd) < 0) {
> 		perror("Can't create pipe");
> 		return 1;
> 	}
> 
> 	if (e->bytes) {
> 		printf("\t%d: Splicing data to %d\n", pid, pfd[1]);
> 
> 		tmp = splice(pipes_fd, NULL, pfd[1], NULL, e->bytes, 0);
> 		if (tmp != e->bytes) {
> 			fprintf(stderr, "Wanted to restore %ld bytes, but got %ld\n",
> 					e->bytes, tmp);
> 			if (tmp < 0)
> 				perror("Error splicing data");
> 			return 1;
> 		}
> 	}
> 
> 	pi->read_fd = pfd[0];
> 	pi->write_fd = pfd[1];
> 	pi->real_pid = getpid();
> 
> 	printf("\t%d: Done, waiting for others on %d pid with r:%d w:%d\n",
> 			pid, pi->real_pid, pfd[0], pfd[1]);
> 
> 	while (1) {
> 		if (pi->users == 1) /* only I left */
> 			break;
> 
> 		printf("\t%d: Waiting for %x pipe to attach (%d users left)\n",
> 				pid, e->pipeid, pi->users - 1);
> 		if (time < 20000000)
> 			time <<= 1;
> 		usleep(time);
> 	}
> 
> 	printf("\t%d: All is ok - reopening pipe for %d\n", pid, e->fd);
> 	if (e->flags & O_WRONLY) {
> 		close(pfd[0]);
> 		tmp = reopen_fd(pfd[1], e->fd);
> 	} else {
> 		close(pfd[1]);
> 		tmp = reopen_fd(pfd[0], e->fd);
> 	}
> 
> 	if (tmp < 0) {
> 		perror("Can't dup pipe fd");
> 		return 1;
> 	}
> 
> 	return 0;
> }
> 
> static int attach_pipe(int pid, struct pipes_entry *e, struct pipes_info *pi)
> {
> 	char path[128];
> 	int tmp, fd;
> 
> 	printf("\t%d: Wating for pipe %x to appear\n", pid, e->pipeid);
> 
> 	while (pi->real_pid == 0)
> 		usleep(1000);
> 
> 	if (e->flags & O_WRONLY)
> 		tmp = pi->write_fd;
> 	else
> 		tmp = pi->read_fd;
> 
> 	sprintf(path, "/proc/%d/fd/%d", pi->real_pid, tmp);
> 	printf("\t%d: Attaching pipe %s\n", pid, path);
> 
> 	fd = open(path, e->flags);
> 	if (fd < 0) {
> 		perror("Can't attach pipe");
> 		return 1;
> 	}
> 
> 	printf("\t%d: Done, reopening for %d\n", pid, e->fd);
> 	pi->users--;
> 	tmp = reopen_fd(fd, e->fd);
> 	if (tmp < 0) {
> 		perror("Can't dup to attach pipe");
> 		return 1;
> 	}
> 
> 	return 0;
> 
> }
> 
> static int open_pipe(int pid, struct pipes_entry *e, int *pipes_fd)
> {
> 	struct pipes_info *pi;
> 
> 	printf("\t%d: Opening pipe %x on fd %d\n", pid, e->pipeid, e->fd);
> 	if (e->fd == *pipes_fd) {
> 		int tmp;
> 
> 		tmp = dup(*pipes_fd);
> 		if (tmp < 0) {
> 			perror("Can't dup file");
> 			return 1;
> 		}
> 
> 		*pipes_fd = tmp;
> 	}
> 
> 	pi = search_pipes(e->pipeid);
> 	if (pi == NULL) {
> 		fprintf(stderr, "BUG: can't find my pipe %x\n", e->pipeid);
> 		return 1;
> 	}
> 
> 	if (pi->pid == pid)
> 		return create_pipe(pid, e, pi, *pipes_fd);
> 	else
> 		return attach_pipe(pid, e, pi);
> }
> 
> static int prepare_pipes(int pid)
> {
> 	char path[64];
> 	int pipes_fd;
> 	__u32 type = 0;
> 
> 	printf("%d: Opening pipes\n", pid);
> 
> 	sprintf(path, "pipes-%d.img", pid);
> 	pipes_fd = open(path, O_RDONLY);
> 	if (pipes_fd < 0) {
> 		perror("Can't open pipes img");
> 		return 1;
> 	}
> 
> 	read(pipes_fd, &type, sizeof(type));
> 	if (type != PIPES_MAGIC) {
> 		perror("Bad pipes file");
> 		return 1;
> 	}
> 
> 	while (1) {
> 		struct pipes_entry e;
> 		int ret;
> 
> 		ret = read(pipes_fd, &e, sizeof(e));
> 		if (ret == 0) {
> 			close(pipes_fd);
> 			return 0;
> 		}
> 		if (ret != sizeof(e)) {
> 			perror("Bad pipes entry");
> 			return 1;
> 		}
> 
> 		if (open_pipe(pid, &e, &pipes_fd))
> 			return 1;
> 	}
> }
> 
> static int restore_one_task(int pid)
> {
> 	printf("%d: Restoring resources\n", pid);
> 
> 	if (prepare_pipes(pid))
> 		return 1;
> 
> 	if (prepare_fds(pid))
> 		return 1;
> 
> 	if (prepare_shmem(pid))
> 		return 1;
> 
> 	return execute_image(pid);
> }
> 
> static int restore_task_with_children(int my_pid, char *pstree_path);
> 
> #if 0
> static inline int fork_with_pid(int pid, char *pstree_path)
> {
> 	/* FIXME - no such ability now */
> 	int ret;
> 
> 	ret = fork();
> 	if (ret == 0) {
> 		ret = restore_task_with_children(pid, pstree_path);
> 		exit(ret);
> 	}
> 
> 	return ret;
> }
> #else
> #define CLONE_CHILD_USEPID      0x02000000
> 
> static int do_child(void *arg)
> {
> 	return restore_task_with_children(getpid(), arg);
> }
> 
> static inline int fork_with_pid(int pid, char *pstree_path)
> {
> 	void *stack;
> 
> 	stack = mmap(0, 4 * 4096, PROT_READ | PROT_WRITE,
> 			MAP_PRIVATE | MAP_ANON | MAP_GROWSDOWN, 0, 0);
> 	if (stack == MAP_FAILED)
> 		return -1;
> 
> 	stack += 4 * 4096;
> 	return clone(do_child, stack, SIGCHLD | CLONE_CHILD_USEPID, pstree_path, NULL, NULL, &pid);
> 
> }
> #endif
> 
> static int restore_task_with_children(int my_pid, char *pstree_path)
> {
> 	int *pids;
> 	int fd, ret, i;
> 	struct pstree_entry e;
> 
> 	printf("%d: Starting restore\n", my_pid);
> 
> 	fd = open(pstree_path, O_RDONLY);
> 	if (fd < 0) {
> 		perror("Can't reopen pstree image");
> 		exit(1);
> 	}
> 
> 	lseek(fd, sizeof(__u32), SEEK_SET);
> 	while (1) {
> 		ret = read(fd, &e, sizeof(e));
> 		if (ret != sizeof(e)) {
> 			fprintf(stderr, "%d: Read returned %d\n", my_pid, ret);
> 			if (ret < 0)
> 				perror("Can't read pstree");
> 			exit(1);
> 		}
> 
> 		if (e.pid != my_pid) {
> 			lseek(fd, e.nr_children * sizeof(__u32), SEEK_CUR);
> 			continue;
> 		}
> 		
> 		break;
> 	}
> 
> 	if (e.nr_children > 0) {
> 		i = e.nr_children * sizeof(int);
> 		pids = malloc(i);
> 		ret = read(fd, pids, i);
> 		if (ret != i) {
> 			perror("Can't read children pids");
> 			exit(1);
> 		}
> 
> 		close(fd);
> 
> 		printf("%d: Restoring %d children:\n", my_pid, e.nr_children);
> 		for (i = 0; i < e.nr_children; i++) {
> 			printf("\tFork %d from %d\n", pids[i], my_pid);
> 			ret = fork_with_pid(pids[i], pstree_path);
> 			if (ret < 0) {
> 				perror("Can't fork kid");
> 				exit(1);
> 			}
> 		}
> 	} else
> 		close(fd);
> 
> 	shmem_update_real_pid(my_pid, getpid());
> 
> 	return restore_one_task(my_pid);
> }
> 
> static int restore_root_task(char *pstree_path, int fd)
> {
> 	struct pstree_entry e;
> 	int ret;
> 
> 	ret = read(fd, &e, sizeof(e));
> 	if (ret != sizeof(e)) {
> 		perror("Can't read root pstree entry");
> 		return 1;
> 	}
> 
> 	close(fd);
> 
> 	printf("Forking root with %d pid\n", e.pid);
> 	ret = fork_with_pid(e.pid, pstree_path);
> 	if (ret < 0) {
> 		perror("Can't fork root");
> 		return 1;
> 	}
> 
> 	wait(NULL);
> 	return 0;
> }
> 
> static int restore_all_tasks(char *pid)
> {
> 	char path[128];
> 	int pstree_fd;
> 	__u32 type = 0;
> 
> 	sprintf(path, "pstree-%s.img", pid);
> 	pstree_fd = open(path, O_RDONLY);
> 	if (pstree_fd < 0) {
> 		perror("Can't open pstree image");
> 		return 1;
> 	}
> 
> 	read(pstree_fd, &type, sizeof(type));
> 	if (type != PSTREE_MAGIC) {
> 		perror("Bad pstree magic");
> 		return 1;
> 	}
> 
> 	if (prepare_shared(pstree_fd))
> 		return 1;
> 
> 	return restore_root_task(path, pstree_fd);
> }
> 
> int main(int argc, char **argv)
> {
> 	if (argc != 3)
> 		goto usage;
> 	if (argv[1][0] != '-')
> 		goto usage;
> 	if (argv[1][1] == 'p')
> 		return restore_one_task(atoi(argv[2]));
> 	if (argv[1][1] == 't')
> 		return restore_all_tasks(argv[2]);
> 
> usage:
> 	printf("Usage: %s (-t|-p) <pid>\n", argv[0]);
> 	return 1;
> }

> #include <stdio.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <stdlib.h>
> #include <linux/types.h>
> #include <string.h>
> #include "img_structs.h"
> #include "binfmt_img.h"
> 
> static int show_fdinfo(int fd)
> {
> 	char data[1024];
> 	struct fdinfo_entry e;
> 
> 	while (1) {
> 		int ret;
> 
> 		ret = read(fd, &e, sizeof(e));
> 		if (ret == 0)
> 			break;
> 		if (ret != sizeof(e)) {
> 			perror("Can't read");
> 			return 1;
> 		}
> 
> 		ret = read(fd, data, e.len);
> 		if (ret != e.len) {
> 			perror("Can't read");
> 			return 1;
> 		}
> 
> 		data[e.len] = '\0';
> 		switch (e.type) {
> 		case FDINFO_FD:
> 			printf("fd %d [%s] pos %lx flags %o\n", (int)e.addr, data, e.pos, e.flags);
> 			break;
> 		case FDINFO_MAP:
> 			printf("map %lx [%s] flags %o\n", e.addr, data, e.flags);
> 			break;
> 		default:
> 			fprintf(stderr, "Unknown fdinfo entry type %d\n", e.type);
> 			return 1;
> 		}
> 	}
> 
> 	return 0;
> }
> 
> #define PAGE_SIZE	4096
> 
> static int show_mem(int fd)
> {
> 	__u64 vaddr;
> 	unsigned int data[2];
> 
> 	while (1) {
> 		if (read(fd, &vaddr, 8) == 0)
> 			break;
> 		if (vaddr == 0)
> 			break;
> 
> 		read(fd, &data[0], sizeof(unsigned int));
> 		lseek(fd, PAGE_SIZE - 2 * sizeof(unsigned int), SEEK_CUR);
> 		read(fd, &data[1], sizeof(unsigned int));
> 
> 		printf("\tpage 0x%lx [%x...%x]\n", (unsigned long)vaddr, data[0], data[1]);
> 	}
> 
> 	return 0;
> }
> 
> static int show_pages(int fd)
> {
> 	return show_mem(fd);
> }
> 
> static int show_shmem(int fd)
> {
> 	int r;
> 	struct shmem_entry e;
> 
> 	while (1) {
> 		r = read(fd, &e, sizeof(e));
> 		if (r == 0)
> 			return 0;
> 		if (r != sizeof(e)) {
> 			perror("Can't read shmem entry");
> 			return 1;
> 		}
> 
> 		printf("%016lx-%016lx %016x\n", e.start, e.end, e.shmid);
> 	}
> }
> 
> static char *segval(__u16 seg)
> {
> 	switch (seg) {
> 		case CKPT_X86_SEG_NULL:		return "nul";
> 		case CKPT_X86_SEG_USER32_CS:	return "cs32";
> 		case CKPT_X86_SEG_USER32_DS:	return "ds32";
> 		case CKPT_X86_SEG_USER64_CS:	return "cs64";
> 		case CKPT_X86_SEG_USER64_DS:	return "ds64";
> 	}
> 
> 	if (seg & CKPT_X86_SEG_TLS)
> 		return "tls";
> 	if (seg & CKPT_X86_SEG_LDT)
> 		return "ldt";
> 
> 	return "[unknown]";
> }
> 
> static int show_regs(int fd)
> {
> 	struct binfmt_regs_image ri;
> 
> 	if (read(fd, &ri, sizeof(ri)) != sizeof(ri)) {
> 		perror("Can't read registers from image");
> 		return 1;
> 	}
> 
> 	printf("Registers:\n");
> 
> 	printf("\tr15:     %016lx\n", ri.r15);
> 	printf("\tr14:     %016lx\n", ri.r14);
> 	printf("\tr13:     %016lx\n", ri.r13);
> 	printf("\tr12:     %016lx\n", ri.r12);
> 	printf("\tr11:     %016lx\n", ri.r11);
> 	printf("\tr10:     %016lx\n", ri.r10);
> 	printf("\tr9:      %016lx\n", ri.r9);
> 	printf("\tr8:      %016lx\n", ri.r8);
> 	printf("\tax:      %016lx\n", ri.ax);
> 	printf("\torig_ax: %016lx\n", ri.orig_ax);
> 	printf("\tbx:      %016lx\n", ri.bx);
> 	printf("\tcx:      %016lx\n", ri.cx);
> 	printf("\tdx:      %016lx\n", ri.dx);
> 	printf("\tsi:      %016lx\n", ri.si);
> 	printf("\tdi:      %016lx\n", ri.di);
> 	printf("\tip:      %016lx\n", ri.ip);
> 	printf("\tflags:   %016lx\n", ri.flags);
> 	printf("\tbp:      %016lx\n", ri.bp);
> 	printf("\tsp:      %016lx\n", ri.sp);
> 	printf("\tgs:      %016lx\n", ri.gs);
> 	printf("\tfs:      %016lx\n", ri.fs);
> 	printf("\tgsindex: %s\n", segval(ri.gsindex));
> 	printf("\tfsindex: %s\n", segval(ri.fsindex));
> 	printf("\tcs:      %s\n", segval(ri.cs));
> 	printf("\tss:      %s\n", segval(ri.ss));
> 	printf("\tds:      %s\n", segval(ri.ds));
> 	printf("\tes:      %s\n", segval(ri.es));
> 
> 	printf("\ttls0     %016lx\n", ri.tls[0]);
> 	printf("\ttls1     %016lx\n", ri.tls[1]);
> 	printf("\ttls2     %016lx\n", ri.tls[2]);
> 
> 	return 0;
> }
> 
> static int show_mm(int fd, unsigned long *stack)
> {
> 	struct binfmt_mm_image mi;
> 
> 	if (read(fd, &mi, sizeof(mi)) != sizeof(mi)) {
> 		perror("Can't read mm from image");
> 		return 1;
> 	}
> 
> 	printf("MM:\n");
> 	printf("\tflags:       %016lx\n", mi.flags);
> 	printf("\tdef_flags:   %016lx\n", mi.def_flags);
> 	printf("\tstart_code:  %016lx\n", mi.start_code);
> 	printf("\tend_code:    %016lx\n", mi.end_code);
> 	printf("\tstart_data:  %016lx\n", mi.start_data);
> 	printf("\tend_data:    %016lx\n", mi.end_data);
> 	printf("\tstart_brk:   %016lx\n", mi.start_brk);
> 	printf("\tbrk:         %016lx\n", mi.brk);
> 	printf("\tstart_stack: %016lx\n", mi.start_stack);
> 	printf("\targ_start:   %016lx\n", mi.arg_start);
> 	printf("\targ_end:     %016lx\n", mi.arg_end);
> 	printf("\tenv_start:   %016lx\n", mi.env_start);
> 	printf("\tenv_end:     %016lx\n", mi.env_end);
> 
> 	*stack = mi.start_stack;
> 
> 	return 0;
> }
> 
> static int show_vmas(int fd, unsigned long stack)
> {
> 	struct binfmt_vma_image vi;
> 
> 	printf("VMAs:\n");
> 	while (1) {
> 		char *note = "";
> 
> 		if (read(fd, &vi, sizeof(vi)) != sizeof(vi)) {
> 			perror("Can't read vma from image");
> 			return 1;
> 		}
> 
> 		if (vi.start == 0 && vi.end == 0)
> 			return 0;
> 
> 		if (vi.start <= stack && vi.end >= stack)
> 			note = "[stack]";
> 
> 		printf("\t%016lx-%016lx file %d %016lx prot %x flags %x %s\n",
> 				vi.start, vi.end, vi.fd, vi.pgoff,
> 				vi.prot, vi.flags, note);
> 	}
> }
> 
> static int show_privmem(int fd)
> {
> 	printf("Pages:\n");
> 	return show_mem(fd);
> }
> 
> static int show_core(int fd)
> {
> 	__u32 version = 0;
> 	unsigned long stack;
> 
> 	read(fd, &version, 4);
> 	if (version != BINFMT_IMG_VERS_0) {
> 		printf("Unsupported version %d\n", version);
> 		return 1;
> 	}
> 
> 	printf("Showing version 0\n");
> 
> 	if (show_regs(fd))
> 		return 1;
> 
> 	if (show_mm(fd, &stack))
> 		return 1;
> 
> 	if (show_vmas(fd, stack))
> 		return 1;
> 
> 	if (show_privmem(fd))
> 		return 1;
> 
> 	return 0;
> }
> 
> static int show_pstree(int fd)
> {
> 	int ret;
> 	struct pstree_entry e;
> 
> 	while (1) {
> 		int i;
> 		__u32 *ch;
> 
> 		ret = read(fd, &e, sizeof(e));
> 		if (ret == 0)
> 			return 0;
> 		if (ret != sizeof(e)) {
> 			perror("Can't read processes entry");
> 			return 1;
> 		}
> 
> 		printf("%d:", e.pid);
> 		i = e.nr_children * sizeof(__u32);
> 		ch = malloc(i);
> 		ret = read(fd, ch, i);
> 		if (ret != i) {
> 			perror("Can't read children list");
> 			return 1;
> 		}
> 
> 		for (i = 0; i < e.nr_children; i++)
> 			printf(" %d", ch[i]);
> 		printf("\n");
> 	}
> }
> 
> static int show_pipes(int fd)
> {
> 	struct pipes_entry e;
> 	int ret;
> 	char buf[17];
> 
> 	while (1) {
> 		ret = read(fd, &e, sizeof(e));
> 		if (ret == 0)
> 			break;
> 		if (ret != sizeof(e)) {
> 			perror("Can't read pipe entry");
> 			return 1;
> 		}
> 
> 		printf("%d: %lx %o %d ", e.fd, e.pipeid, e.flags, e.bytes);
> 		if (e.flags & O_WRONLY) {
> 			printf("\n");
> 
> 			if (e.bytes) {
> 				printf("Bogus pipe\n");
> 				return 1;
> 			}
> 
> 			continue;
> 		}
> 
> 		memset(buf, 0, sizeof(buf));
> 		ret = e.bytes;
> 		if (ret > 16)
> 			ret = 16;
> 
> 		read(fd, buf, ret);
> 		printf("\t[%s", buf);
> 		if (ret < e.bytes)
> 			printf("...");
> 		printf("]\n");
> 		lseek(fd, e.bytes - ret, SEEK_CUR);
> 	}
> 
> 	return 0;
> 
> }
> 
> int main(int argc, char **argv)
> {
> 	__u32 type;
> 	int fd;
> 
> 	fd = open(argv[1], O_RDONLY);
> 	if (fd < 0) {
> 		perror("Can't open");
> 		return 1;
> 	}
> 
> 	read(fd, &type, 4);
> 
> 	if (type == FDINFO_MAGIC)
> 		return show_fdinfo(fd);
> 	if (type == PAGES_MAGIC)
> 		return show_pages(fd);
> 	if (type == SHMEM_MAGIC)
> 		return show_shmem(fd);
> 	if (type == PSTREE_MAGIC)
> 		return show_pstree(fd);
> 	if (type == PIPES_MAGIC)
> 		return show_pipes(fd);
> 	if (type == BINFMT_IMG_MAGIC)
> 		return show_core(fd);
> 
> 	printf("Unknown file type 0x%x\n", type);
> 	return 1;
> }

> 
> #define FDINFO_MAGIC	0x01010101
> 
> struct fdinfo_entry {
> 	__u8	type;
> 	__u8	len;
> 	__u16	flags;
> 	__u32	pos;
> 	__u64	addr;
> };
> 
> #define FDINFO_FD	1
> #define FDINFO_MAP	2
> 
> #define PAGES_MAGIC	0x20202020
> 
> #define SHMEM_MAGIC	0x03300330
> 
> struct shmem_entry {
> 	__u64	start;
> 	__u64	end;
> 	__u64	shmid;
> };
> 
> #define PSTREE_MAGIC	0x40044004
> 
> struct pstree_entry {
> 	__u32	pid;
> 	__u32	nr_children;
> };
> 
> #define PIPES_MAGIC	0x05055050
> 
> struct pipes_entry {
> 	__u32	fd;
> 	__u32	pipeid;
> 	__u32	flags;
> 	__u32	bytes;
> };

> all: cr-dump img-show cr-restore
> 
> img-show: img-show.c
> 	gcc -o $@ $<
> 
> cr-dump: cr-dump.c
> 	gcc -o $@ $<
> 
> cr-restore: cr-restore.c
> 	gcc -o $@ $<
> 
> clean:
> 	rm -f cr-dump img-show cr-restore

> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

For any subsequent postings could you split this up into multiple
emails -- perhaps one per file? Or perhaps make them patches to the
kernel's tools directory?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (9 preceding siblings ...)
  2011-07-18 13:27   ` Serge E. Hallyn
@ 2011-07-23  0:25   ` Matt Helsley
       [not found]     ` <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  10 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-23  0:25 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

On Fri, Jul 15, 2011 at 05:45:10PM +0400, Pavel Emelyanov wrote:
> Hi guys!
> 
> There have already been made many attempts to have the checkpoint/restore functionality
> in Linux, but as far as I can see there's still no final solutions that suits most of
> the interested people. The main concern about the previous approaches as I see it was
> about - all that stuff was supposed to sit in the kernel thus creating various problems.
> 
> I'd like to bring this subject back again proposing the way of how to implement c/r
> mostly in the userspace with the reasonable help of a kernel.
> 
> 
> That said, I propose to start with very basic set of objects to c/r that can work with
> 
> * x86_64 tasks (subtree) which includes
>    - registers
>    - TLS
>    - memory of all kinds (file and anon both shared and private)

Do mixes of 32 and 64-bit tasks present any problems with this
method?

> * open regular files
> * pipes (with data in it)
> 
> Core idea:
> 
> The core idea of the restore process is to implement the binary handler that can execve-ute
> image files recreating the register and the memory state of a task. Restoring the process 

I suspect this can be done with Oren's patches too using binfmt-misc -- without any binfmt
kernel code.

> tree and opening files is done completely in the user space, i.e. when restoring the subtree
> of processes I first fork all the tasks in respective order, then open required files and 

OK. Oren's code also forked all the tasks in userspace prior to completing the restart.

> then call execve() to restore registers and memory.

That's kind of neat, but won't this interfere with restoring O_CLOEXEC
flags? (I also asked this in a reply to the TOOLS email)

> 
> The checkpointing process is quite simple - all we need about processes can be read from /proc
> except for several things - registers and private memory. In current implementation to get 

I put this to Tejun as well: What about stuff like epoll sets? Sure, you
can see the epoll fd in /proc/<pid>/fd, but you can't read it to tell
which fds are in it. Worse, even if you got the fds from the epoll items
via /proc, the way epoll holds onto them does not guarantee they'll refer
to the files the set would actuall wait on.

As best I can tell you can't reliably checkpoint epoll sets from userspace.

Then there's the matter of unlinked files. How do you plan to deal
with those without kernel code?

> them I introduce the /proc/<pid>/dump file which produces the file that can be executed by the
> described above binfmt. Additionally I introduce the /proc/<pid>/mfd/ dir with info about
> mappings. It is populated with symbolc links with names equal to vma->vm_start and pointing to
> mapped files (including anon shared which are tmpfs ones). Thus we can open some task's
> /proc/<pid>/mfd/<address> link and find out the mapped file inode (to check for sharing) and
> if required map one and read the contents of anon shared memory.

Finally, I think there's substantial room here for quiet and subtle
races to corrupt checkpoint images. If we add /proc interfaces only to
find they're racy will we need to add yet more /proc interfaces to
maintain backward compatibility yet fix the races? To get the locking
that ensures a consistent subset of information with this /proc-based
approach I think we'll frequently need to change the contents of
existing /proc files.

Imagine trusting the output of top to exactly represent the state of
your system's cpu usage. That's the sort of thing a piecemeal /proc
interface gets us. You're asking us to trust that frequent checkpoints
(say once every five minutes) of large, multiprocess, month-long
program runs won't quietly get corrupted and will leave plenty of
performance to not interfere with the throughput of the work.

A kernel syscall interface has a better chance of allowing us to fix
races without changing the interface. We've fixed a few races with
Oren's tree and none of them required us to change the output format.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Reply #2: [TOOLS] To make use of the patches
       [not found]     ` <4E204554.6040901-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-07-22 23:45       ` Matt Helsley
@ 2011-07-23  0:40       ` Matt Helsley
       [not found]         ` <20110723004045.GC21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-23  0:40 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

On Fri, Jul 15, 2011 at 05:49:08PM +0400, Pavel Emelyanov wrote:

<snip>

> static void kill_imgfiles(int pid)
> {
> 	/* FIXME */
> }
> 
> static int stop_task(int pid)
> {
> 	return kill(pid, SIGSTOP);
> }

Shouldn't you wait() on the task too? Otherwise I think you'll race
with it. Alternately, you could introduce a wait() phase after the loop
calls stop_task() below...

<snip>

> 
> static int __dump_all_tasks(void)
> {
> 	int i, pid;
> 
> 	printf("Dumping tasks' images for");
> 	for (i = 0; i < nr_pids; i++)
> 		printf(" %d", pids[i]);
> 	printf("\n");
> 
> 	printf("Stopping tasks\n");
> 	for (i = 0; i < nr_pids; i++)
> 		if (stop_task(pids[i]))
> 			goto err;

(see the wait() note above)

> 
> 	for (i = 0; i < nr_pids; i++) {
> 		if (dump_one_task(pids[i], 0))
> 			goto err;
> 	}
> 
> 	printf("Resuming tasks\n");
> 	for (i = 0; i < nr_pids; i++)
> 		continue_task(pids[i]);
> 
> 	return 0;
> 
> err:
> 	for (i = 0; i < nr_pids; i++)
> 		continue_task(pids[i]);

nit: Seems like you could simplify this using a variable with
the return value.

> 	return 1;
> 
> }

<snip>
 
> 
> static int fixup_pages_data(int pid, int fd)
> {
> 	char path[128];
> 	int shfd;
> 	__u32 mag;
> 	__u64 vaddr;
> 
> 	sprintf(path, "pages-%d.img", pid);
> 	shfd = open(path, O_RDONLY);
> 	if (shfd < 0) {
> 		perror("Can't open shmem image");
> 		return 1;
> 	}
> 
> 	read(shfd, &mag, sizeof(mag));
> 	if (mag != PAGES_MAGIC) {
> 		fprintf(stderr, "Bad shmem image\n");
> 		return 1;
> 	}
> 
> 	lseek(fd, -sizeof(struct binfmt_page_image), SEEK_END);
> 	read(fd, &vaddr, sizeof(vaddr));
> 	if (vaddr != 0) {
> 		printf("SHIT %lx\n", (unsigned long)vaddr);

Typo?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]     ` <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-23  3:29       ` Matt Helsley
       [not found]         ` <20110723032945.GD21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2011-07-23  3:53       ` Tejun Heo
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-23  3:29 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Glauber Costa, Cyrill Gorcunov, Tejun Heo,
	Nathan Lynch, Linux Containers, Serge Hallyn, Daniel Lezcano

On Fri, Jul 22, 2011 at 05:25:58PM -0700, Matt Helsley wrote:

<snip>
 
> As best I can tell you can't reliably checkpoint epoll sets from userspace.

Sorry, I take that back -- you can modify /proc to do it.

Details:

You'd have to output the fd number for each epoll item plus the path to
the file. The fd,file pair in the item is not strictly tied to
the contents of the processes' fd table yet the item fd has to be output
since it's the number userspace will supply to epoll_ctl(EPOLL_CTL_DEL...).

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]     ` <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2011-07-23  3:29       ` Matt Helsley
@ 2011-07-23  3:53       ` Tejun Heo
       [not found]         ` <CAOS58YPqLSYi2xECUk4O5GG3s6aokT=VykmkL6UnAOzyHXNAgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-07-23  5:10       ` Tejun Heo
  2011-07-23  8:39       ` Pavel Emelyanov
  3 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-23  3:53 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Glauber Costa, Cyrill Gorcunov, Nathan Lynch,
	Linux Containers, Serge Hallyn, Daniel Lezcano

Hello,

On Sat, Jul 23, 2011 at 2:25 AM, Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> Then there's the matter of unlinked files. How do you plan to deal
> with those without kernel code?

/proc/PID/fd already provides access to deleted files perfectly well
as most avid p0rn watchers would know (you can run mplayer on flash's
deleted temp files). ;)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]         ` <20110723032945.GD21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-23  4:58           ` Tejun Heo
       [not found]             ` <20110723045842.GD21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-23  4:58 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Glauber Costa, Cyrill Gorcunov, Nathan Lynch,
	Linux Containers, Serge Hallyn, Daniel Lezcano

On Fri, Jul 22, 2011 at 08:29:45PM -0700, Matt Helsley wrote:
> On Fri, Jul 22, 2011 at 05:25:58PM -0700, Matt Helsley wrote:
> You'd have to output the fd number for each epoll item plus the path to
> the file. The fd,file pair in the item is not strictly tied to
> the contents of the processes' fd table yet the item fd has to be output
> since it's the number userspace will supply to epoll_ctl(EPOLL_CTL_DEL...).

I haven't really looked at it but if @fd is key, @fd should be enough.
You can determine what @fd is and it attributes from /proc/PID/fd/ and
/proc/PID/fdinfo/.  No reason to list them again.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]     ` <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2011-07-23  3:29       ` Matt Helsley
  2011-07-23  3:53       ` Tejun Heo
@ 2011-07-23  5:10       ` Tejun Heo
       [not found]         ` <20110723051005.GE21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2011-07-23  8:39       ` Pavel Emelyanov
  3 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-23  5:10 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Glauber Costa, Cyrill Gorcunov, Nathan Lynch,
	Linux Containers, Serge Hallyn, Daniel Lezcano

Hello,

On Fri, Jul 22, 2011 at 05:25:58PM -0700, Matt Helsley wrote:
> Finally, I think there's substantial room here for quiet and subtle
> races to corrupt checkpoint images. If we add /proc interfaces only to
> find they're racy will we need to add yet more /proc interfaces to
> maintain backward compatibility yet fix the races? To get the locking
> that ensures a consistent subset of information with this /proc-based
> approach I think we'll frequently need to change the contents of
> existing /proc files.

The target processes need to be frozen to remove race conditions (be
it SIGSOTP, cgroup freeze or PTRACE trap).  If there are exceptions in
the boundaries between frozen domain and the rest of the system,
they'll need to be dealt with and those need to be dealt with whether
the thing is in kernel or not.

> Imagine trusting the output of top to exactly represent the state of
> your system's cpu usage. That's the sort of thing a piecemeal /proc
> interface gets us. You're asking us to trust that frequent checkpoints
> (say once every five minutes) of large, multiprocess, month-long
> program runs won't quietly get corrupted and will leave plenty of
> performance to not interfere with the throughput of the work.

This is rather bogus.  If you freeze the processes, most of the
information in /proc (the ones which would show up in top anyway)
doesn't change.  What race condition?

> A kernel syscall interface has a better chance of allowing us to fix
> races without changing the interface. We've fixed a few races with
> Oren's tree and none of them required us to change the output format.

Sure, that was completely embedded in the kernel and things can be
implemented and fixed with much less consideration.  I can see how
that would be easier for the specific use case, but that EXACTLY is
why it can't go upstream.  I just can't see it happening and think it
would be far more productive spending the time and energy looking for
and implementing solutions which actually can go mainline.  If you
don't care about mainlining, that's great too, but then there's no
point in talking about it either.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status
       [not found]         ` <20110721065436.GT3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2011-07-23  8:06           ` Pavel Emelyanov
       [not found]             ` <4E2A8116.1040309-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On 07/21/2011 10:54 AM, Tejun Heo wrote:
> On Fri, Jul 15, 2011 at 05:46:43PM +0400, Pavel Emelyanov wrote:
>> Although we can get the pids of some task's issue, this is just 
>> more convenient to have them this way.
>>
>> Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>.
> 
> Umm... The primary aim is dumping whole namespaces, right?  

No. The aim is to dump an arbitrary set of tasks.

> The dumper would have to build full process tree anyway so I don't 
> see much point in providing backlink from kernel.

Hm... Why would a dumper need to build the whole tree? Maybe you're
right with this, so can you elaborate?

This particular patch just helps with collecting a subtree without
scanning the whole /proc/ directory.

> Thanks.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/7] clone: Introduce the CLONE_CHILD_USEPID functionality
       [not found]             ` <20110722230848.GB16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-23  8:09               ` Pavel Emelyanov
  0 siblings, 0 replies; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:09 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Glauber Costa, Cyrill Gorcunov, Tejun Heo, Nathan Lynch,
	Eric W. Biederman, Linux Containers, Daniel Lezcano

> So I think it would be better to incorporate the eclone patch set
> unless, as you say, Pavel can see a good reason not to.

I'm perfectly fine with using the eclone approach. This particular patch
was included in the set just because the tools use one. I will switch to
using eclone next iteration.

> Cheers,
> 	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file
       [not found]         ` <20110721064408.GR3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2011-07-23  8:11           ` Pavel Emelyanov
       [not found]             ` <4E2A8239.5060908-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On 07/21/2011 10:44 AM, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jul 15, 2011 at 05:47:44PM +0400, Pavel Emelyanov wrote:
>> An image read from file contains task's registers and information
>> about its VM. Later this image can be execve-ed causing recreation
>> of the previously read task state.
>>
>> The file format is my own, very simple. Introduced to make the code
>> as simple as possible. Better file format (if any) is to be discussed.
> 
> First of all, I don't really think we need to bake in process dumper
> into the kernel.

Neither do I :) I just didn't have better candidate in mind and wanted
to discuss this part (see below).

> Most of information dumped here is already available
> through /proc and ptrace and we can add the missing pieces like the
> suggested proc vma fds.

Let's start with the simplest things. Can you suggest the best (from you pov)
way for dumping all the registers, tls and the anonymous pages through the 
existing interfaces?

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler
       [not found]             ` <20110722224617.GA16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-23  8:17               ` Pavel Emelyanov
       [not found]                 ` <4E2A83AC.6090504-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:17 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

On 07/23/2011 02:46 AM, Matt Helsley wrote:
> On Thu, Jul 21, 2011 at 08:51:27AM +0200, Tejun Heo wrote:
>> On Fri, Jul 15, 2011 at 05:48:09PM +0400, Pavel Emelyanov wrote:
>>> When being execve-ed the handler reads registers, mappings and provided
>>> memory pages from image and just assigns this state on current task. This
>>> simple functionality can be used to restore a task, whose state whas read
>>> from e.g. /proc/<pid>/dump file before.
>>
>> Ummm... iff the process is single threaded. :(

Yup :( This is the weak side.

>> Much more complex machinery is needed to restore full process anyway
>> which would require some kernel facilities but definitely a lot more
> 
> Agreed,

Well, let me describe why I chose the binfmt handler for restore.

The basic idea is very simple - you have to create a process with a defined
values in registers and defined set of memory mappings (with the memory
contents). This creation is obviously done by some (maybe another) process.

Thus we have two ways to go - either we transform the restoring task into
the target one or we freeze the target one and repopulate it "remotely".

The 1st approach seemed to be more elegant to me, and with this one we do
already have an API for turning one VM+regs into another - the execve.

If you can suggest another way - I'm open for discussion.

>> logic in userland.  I really can't see much point in having
> 
> I disagree (surprise! ;)).
> 
>> dumper/restorer in kernel.  The simplistic dumper/restorer proposed
>> here isn't really useful - among other things, it's single threaded
>> only and there's no mechanism to freeze the task being dumped.  It is
> 
> To be fair Pavel used signals to stop/resume the task. It's not
> a good solution but it's a start (more below).
> 
>> almost trivially implementable from userland using existing
>> facilities.  I wonder what the point is.
> 
> No, I think that ultimately an addition to the cgroup freezer will
> be needed.

Sure it will be! I planned to use one in the next iterations and used
the sigstop just for simplicity.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [TOOLS] To make use of the patches
       [not found]         ` <20110722234558.GD16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-23  8:32           ` Pavel Emelyanov
       [not found]             ` <4E2A8704.3030306-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:32 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

>> #define PIPEFS_MAGIC 0x50495045
> 
> Shouldn't there be only one MAGIC number for checkpoint contents?
> 
> You can always add an additional "type" number following the magic
> number. Or make the type a string with the name of the /proc file it's
> from... etc.

Don't get your idea here, can you elaborate please?

>> static void continue_task(int pid)
>> {
>>       if (kill(pid, SIGCONT))
>>               perror("Can't cont task");
>> }
> 
> Eventually, I think you should use the cgroup freezer here rather
> than signals. Shells and debuggers use these signals so a checkpoint
> could easily and quietly be corrupted.

Yes sure! As I told, I will switch to one in the 2nd iteration.

> Even if you use the freezer, there needs to be a mechanism to
> assure that the frozen cgroup is not thawed before a consistent
> checkpoint is complete. Otherwise corruption is always a possibility.

Yes, this is a good point. I'm thinking about it.

>> static int dump_pipe_and_data(int lfd, struct pipes_entry *e)
>> {
>>       int steal_pipe[2];
>>       int ret;
>>
>>       printf("\tDumping data from pipe %x\n", e->pipeid);
>>       if (pipe(steal_pipe) < 0) {
>>               perror("Can't create pipe for stealing data");
>>               return 1;
>>       }
>>
>>       ret = tee(lfd, steal_pipe[1], MAX_PIPE_BUF_SIZE, SPLICE_F_NONBLOCK);
> 
> Neat application of tee().

Thanks! :)

>>       if (ret < 0) {
>>               if (errno != EAGAIN) {
>>                       perror("Can't pick pipe data");
>>                       return 1;
>>               }
>>
>>               ret = 0;
>>       }
>>
>>       e->bytes = ret;
>>       write(pipes_img, e, sizeof(*e));
>>
>>       if (ret) {
>>               ret = splice(steal_pipe[0], NULL, pipes_img, NULL, ret, 0);
>>               if (ret < 0) {
>>                       perror("Can't push pipe data");
>>                       return 1;
>>               }
>>       }
>>
>>       close(steal_pipe[0]);
>>       close(steal_pipe[1]);
>>       return 0;
>> }
>>
>> static int dump_one_pipe(int fd, int lfd, unsigned int id, unsigned int flags)
>> {
>>       struct pipes_entry e;
>>
>>       printf("\tDumping pipe %d/%x flags %x\n", fd, id, flags);
>>
>>       e.fd = fd;
>>       e.pipeid = id;
>>       e.flags = flags;
>>
>>       if (flags & O_WRONLY) {
>>               e.bytes = 0;
>>               write(pipes_img, &e, sizeof(e));
>>               return 0;
>>       }
>>
>>       return dump_pipe_and_data(lfd, &e);
>> }
>>
>> static int dump_one_fd(int dir, char *fd_name, unsigned long pos, unsigned int flags)
>> {
>>       int fd;
>>       struct stat st_buf;
>>       struct statfs stfs_buf;
>>
>>       printf("\tDumping fd %s\n", fd_name);
>>       fd = openat(dir, fd_name, O_RDONLY);
>>       if (fd == -1) {
>>               printf("Tried to openat %d/%d %s\n", getpid(), dir, fd_name);
>>               perror("Can't open fd");
>>               return 1;
>>       }
>>
>>       if (fstat(fd, &st_buf) < 0) {
>>               perror("Can't stat one");
>>               return 1;
>>       }
>>
>>       if (S_ISREG(st_buf.st_mode))
>>               return dump_one_reg_file(FDINFO_FD, atoi(fd_name), fd, 1, pos, flags);
>>
>>       if (S_ISFIFO(st_buf.st_mode)) {
>>               if (fstatfs(fd, &stfs_buf) < 0) {
>>                       perror("Can't statfs one");
>>                       return 1;
>>               }
>>
>>               if (stfs_buf.f_type == PIPEFS_MAGIC)
>>                       return dump_one_pipe(atoi(fd_name), fd, st_buf.st_ino, flags);
>>       }
> 
> This is starting to look like a linear search over the set of all
> possible types of things file descriptors can refer to. A kernel implementation
> doesn't have to do this. Furthermore, if lots of file descriptors are open
> this could be alot of fstat() and fstatfs() calls -- will making so many
> syscalls force us to an completely in-kernel implementation, like the
> set already proposed, just to get usable performance?

A kernel implementation doesn't have to do any syscalls at all. If we're going to
do it in kernel, then we should throw this set away and resurrect the Oren's set.

As far as the many fstats is concerned - yes, some sort of optimization about this
is surely required.

>>
>>       if (!strcmp(fd_name, "0")) {
>>               printf("\tSkipping stdin\n");
>>               return 0;
>>       }
> 
> Assuming that fd 0 is "stdin" is very very gross. Yes, it's almost always
> true. But that does *not* mean that it's a pty. stdin could be a pipe
> we need to checkpoint. Really, this is also about the "type" of thing
> the fd is referring to -- not about which fd nr it is.
> 
> What are your plans for removing this?

This was done just to make it possible to demonstrate what this code can do
checkpointing shell scripts and restoring them in (probably) another session.

The plan for this part is - implement the c/r support for terminals and throw
this explicit check for stdio-s away :)

>> static unsigned long rawhex(char *str, char **end)
>> {
>>       unsigned long ret = 0;
>>
>>       while (1) {
>>               if (str[0] >= '0' && str[0] <= '9') {
>>                       ret <<= 4;
>>                       ret += str[0] - '0';
>>               } else if (str[0] >= 'a' && str[0] <= 'f') {
>>                       ret <<= 4;
>>                       ret += str[0] - 'a' + 0xA;
>>               } else if (str[0] >= 'A' && str[0] <= 'F') {
>>                       ret <<= 4;
>>                       ret += str[0] - 'A' + 0xA;
>>               } else {
>>                       if (end)
>>                               *end = str;
>>                       return ret;
>>               }
>>
>>               str++;
>>       }
>> }
> 
> nit: I haven't looked closely enough to see where rawhex is being used,
>         but is there's no suitable library function for this?

Well, I looked for but did found. All I've met required an 0x to precede the hex number.
If you point me one - I will gladly replace mine with it.

>> static int dump_file_shared_map(char *start, char *mdesc, int lfd)
>> {
>>       printf("\tSkipping file shared mapping at %s\n", start);
>>       close(lfd);
>>       return 0;
>> }
> 
> Shouldn't this be an error since it appears these shared mappings
> are currently unsupported?

Why unsupported? Shared file mappings are fully supported, unless some bug found its
way into the source.

>>       printf("%d/%d EXEC IMAGE\n", pid, getpid());
>>       return execl(path, path, NULL);
> 
> How are you going to restore O_CLOEXEC flags?

Don't know yet. But assuming we have agreed on using execve for restoring tasks, then the solution
is - just set this flag and call exec. Since my binary handler doesn't call the setup_new_exec
(which closes the files) these bits will be preserved


> For any subsequent postings could you split this up into multiple
> emails -- perhaps one per file? 

OK, will do this.

> Or perhaps make them patches to the kernel's tools directory?

Hm... I didn't think about having these tools be the part of the kernel source tree.

Maybe it would be better if I publish the tools in git repo, what do you think?

> Cheers,
>         -Matt Helsley
> .
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Reply #2: [TOOLS] To make use of the patches
       [not found]         ` <20110723004045.GC21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-23  8:33           ` Pavel Emelyanov
  0 siblings, 0 replies; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:33 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

On 07/23/2011 04:40 AM, Matt Helsley wrote:
> On Fri, Jul 15, 2011 at 05:49:08PM +0400, Pavel Emelyanov wrote:
> 
> <snip>
> 
>> static void kill_imgfiles(int pid)
>> {
>> 	/* FIXME */
>> }
>>
>> static int stop_task(int pid)
>> {
>> 	return kill(pid, SIGSTOP);
>> }
> 
> Shouldn't you wait() on the task too? Otherwise I think you'll race
> with it. Alternately, you could introduce a wait() phase after the loop
> calls stop_task() below...

Well, as discussed - I will switch to the freezer cgroup instead of this signalling.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file
       [not found]             ` <4E2A8239.5060908-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-23  8:37               ` Tejun Heo
       [not found]                 ` <20110723083711.GF21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-23  8:37 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello,

On Sat, Jul 23, 2011 at 12:11:37PM +0400, Pavel Emelyanov wrote:
> On 07/21/2011 10:44 AM, Tejun Heo wrote:
> > Most of information dumped here is already available
> > through /proc and ptrace and we can add the missing pieces like the
> > suggested proc vma fds.
> 
> Let's start with the simplest things. Can you suggest the best (from you pov)
> way for dumping all the registers, tls and the anonymous pages through the 
> existing interfaces?

Just use ptrace.  Seizing all threads in a process, gather file and
memory map info from /proc, inject a parasite to dump memory pages and
do whatever else.  There's nothing special about TLS, fs/gs base is
already included in ptrace register dump in x86_64.  Dunno how it's
handled in 32bit but if it's not available exporting it isn't a big
deal.  Rebuilding the process image from the captured information
shouldn't be too hard.

There is simply is no need to put CR into kernel at all.  Just add the
missing pieces to export the necessary information and missing APIs
which are required to restore it (e.g. setting TID like you did in
this patchset).  Approaching it that way would make things useful for
other use cases && is highly more likely to get merged.  There doesn't
even need to be one big merge day.  You can just improve things
piecewise until it works.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]     ` <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
                         ` (2 preceding siblings ...)
  2011-07-23  5:10       ` Tejun Heo
@ 2011-07-23  8:39       ` Pavel Emelyanov
  3 siblings, 0 replies; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:39 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

On 07/23/2011 04:25 AM, Matt Helsley wrote:
> On Fri, Jul 15, 2011 at 05:45:10PM +0400, Pavel Emelyanov wrote:
>> Hi guys!
>>
>> There have already been made many attempts to have the checkpoint/restore functionality
>> in Linux, but as far as I can see there's still no final solutions that suits most of
>> the interested people. The main concern about the previous approaches as I see it was
>> about - all that stuff was supposed to sit in the kernel thus creating various problems.
>>
>> I'd like to bring this subject back again proposing the way of how to implement c/r
>> mostly in the userspace with the reasonable help of a kernel.
>>
>>
>> That said, I propose to start with very basic set of objects to c/r that can work with
>>
>> * x86_64 tasks (subtree) which includes
>>    - registers
>>    - TLS
>>    - memory of all kinds (file and anon both shared and private)
> 
> Do mixes of 32 and 64-bit tasks present any problems with this
> method?

In theory - no. But in practice I didn't write the 32-bit support yet.

>> * open regular files
>> * pipes (with data in it)
>>
>> Core idea:
>>
>> The core idea of the restore process is to implement the binary handler that can execve-ute
>> image files recreating the register and the memory state of a task. Restoring the process 
> 
> I suspect this can be done with Oren's patches too using binfmt-misc -- without any binfmt
> kernel code.
> 
>> tree and opening files is done completely in the user space, i.e. when restoring the subtree
>> of processes I first fork all the tasks in respective order, then open required files and 
> 
> OK. Oren's code also forked all the tasks in userspace prior to completing the restart.
> 
>> then call execve() to restore registers and memory.
> 
> That's kind of neat, but won't this interfere with restoring O_CLOEXEC
> flags? (I also asked this in a reply to the TOOLS email)
> 
>>
>> The checkpointing process is quite simple - all we need about processes can be read from /proc
>> except for several things - registers and private memory. In current implementation to get 
> 
> I put this to Tejun as well: What about stuff like epoll sets? Sure, you
> can see the epoll fd in /proc/<pid>/fd, but you can't read it to tell
> which fds are in it. Worse, even if you got the fds from the epoll items
> via /proc, the way epoll holds onto them does not guarantee they'll refer
> to the files the set would actuall wait on.
> 
> As best I can tell you can't reliably checkpoint epoll sets from userspace.

With the existing interfaces - yes. My aim was to start the discussion whether we can
extend the kernel APIs to make it possible to do so.

> Then there's the matter of unlinked files. How do you plan to deal
> with those without kernel code?

You will have the same problem even with the c/r in the kernel. Frankly, I don't see
much difference in where to solve this one, can you elaborate?

>> them I introduce the /proc/<pid>/dump file which produces the file that can be executed by the
>> described above binfmt. Additionally I introduce the /proc/<pid>/mfd/ dir with info about
>> mappings. It is populated with symbolc links with names equal to vma->vm_start and pointing to
>> mapped files (including anon shared which are tmpfs ones). Thus we can open some task's
>> /proc/<pid>/mfd/<address> link and find out the mapped file inode (to check for sharing) and
>> if required map one and read the contents of anon shared memory.
> 
> Finally, I think there's substantial room here for quiet and subtle
> races to corrupt checkpoint images. If we add /proc interfaces only to
> find they're racy will we need to add yet more /proc interfaces to
> maintain backward compatibility yet fix the races? To get the locking
> that ensures a consistent subset of information with this /proc-based
> approach I think we'll frequently need to change the contents of
> existing /proc files.
> 
> Imagine trusting the output of top to exactly represent the state of
> your system's cpu usage. That's the sort of thing a piecemeal /proc
> interface gets us. You're asking us to trust that frequent checkpoints
> (say once every five minutes) of large, multiprocess, month-long
> program runs won't quietly get corrupted and will leave plenty of
> performance to not interfere with the throughput of the work.
> 
> A kernel syscall interface has a better chance of allowing us to fix
> races without changing the interface. We've fixed a few races with
> Oren's tree and none of them required us to change the output format.

If we all decide, that we do want to have the checkpoint/restart as all-in-kernel approach,
then OK. But my impression is - the community is not happy with it.

> Cheers,
> 	-Matt Helsley
> .
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status
       [not found]             ` <4E2A8116.1040309-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-23  8:41               ` Tejun Heo
       [not found]                 ` <20110723084110.GG21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-23  8:41 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello,

On Sat, Jul 23, 2011 at 12:06:46PM +0400, Pavel Emelyanov wrote:
> > The dumper would have to build full process tree anyway so I don't 
> > see much point in providing backlink from kernel.
> 
> Hm... Why would a dumper need to build the whole tree? Maybe you're
> right with this, so can you elaborate?

It depends on what you want to dump.  Dumping arbitrary task has
issues even at the most basic level due to relationships among family
of processes, which I assume is why the in-kernel CR was primarily
focused on dumping and restoring full namespaces.  If you're doing
that, there is no avoiding walking process tree anyway (which
apparently can even be per-NS using uprocfs).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]     ` <20110718132759.GB8127-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2011-07-23  8:43       ` Pavel Emelyanov
  0 siblings, 0 replies; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:43 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Glauber Costa, Cyrill Gorcunov, Linux Containers, Nathan Lynch,
	Tejun Heo, Serge Hallyn, Daniel Lezcano

On 07/18/2011 11:04 PM, Serge E. Hallyn wrote:
> (sorry, just realized postfix has been messing up my email, hope this
> comes through ok)
> 
> Thanks, Pavel.  I will take a look at this when I get a chance.  I'm
> a little worried about security implications - this approach should
> lend itself (especially with the binfmt handler) to clean handling
> of security issues, but given the issues we've had with /proc things
> that already exist, I'm worried about the dump files.  If you have
> any preemptive comments on that, please do share :)

As far as the security is concerned - yes, this is very tricky question.
Before we find out and fix all the possible security implications, I'd
suggest adding the
  if (!capable(CAP_SYS_ADMIN))
	return -EPERM
check into the execve handler. :)

And I understand your worry about the dump files in /proc. I do not like
this thing either and looking forward for your suggestions. I've asked this
question to Tejun, hopefully we'll work out the good solution.

> We did briefly try a binfmt handler at the very end of our foray into
> the ptrace checkpoint/restart approach, but your overall set here seems
> very nice.
> 
> thanks,
> -serge
> .
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status
       [not found]                 ` <20110723084110.GG21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2011-07-23  8:45                   ` Pavel Emelyanov
       [not found]                     ` <4E2A8A0E.5030208-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On 07/23/2011 12:41 PM, Tejun Heo wrote:
> Hello,
> 
> On Sat, Jul 23, 2011 at 12:06:46PM +0400, Pavel Emelyanov wrote:
>>> The dumper would have to build full process tree anyway so I don't 
>>> see much point in providing backlink from kernel.
>>
>> Hm... Why would a dumper need to build the whole tree? Maybe you're
>> right with this, so can you elaborate?
> 
> It depends on what you want to dump.  Dumping arbitrary task has
> issues even at the most basic level due to relationships among family
> of processes, which I assume is why the in-kernel CR was primarily
> focused on dumping and restoring full namespaces.  If you're doing
> that, there is no avoiding walking process tree anyway (which
> apparently can even be per-NS using uprocfs).

OK I see. Then my answer is - typically a container looks like an init task
with everybody else growing from that point. Having a machine with 1000 of
containers building the whole /proc tree in memory to dump a single container
would be MUCH more expensive that having this small like in proc.

> Thanks.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler
       [not found]                 ` <4E2A83AC.6090504-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-23  8:45                   ` Tejun Heo
       [not found]                     ` <20110723084529.GH21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-23  8:45 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello, Pavel.

On Sat, Jul 23, 2011 at 12:17:48PM +0400, Pavel Emelyanov wrote:
> The basic idea is very simple - you have to create a process with a defined
> values in registers and defined set of memory mappings (with the memory
> contents). This creation is obviously done by some (maybe another) process.
> 
> Thus we have two ways to go - either we transform the restoring task into
> the target one or we freeze the target one and repopulate it "remotely".
> 
> The 1st approach seemed to be more elegant to me, and with this one we do
> already have an API for turning one VM+regs into another - the execve.
> 
> If you can suggest another way - I'm open for discussion.

Just restore it using the usual system calls - clone, mmap, open....
There is no reason for the kernel to do it and kernel can't even do it
properly without going way outside of the existing exec(2) conventions
unless you're planning to make exec(2) create multi-threaded process
and I don't think that's a wise direction.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file
       [not found]                 ` <20110723083711.GF21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2011-07-23  8:49                   ` Pavel Emelyanov
       [not found]                     ` <4E2A8B12.4010709-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On 07/23/2011 12:37 PM, Tejun Heo wrote:
> Hello,
> 
> On Sat, Jul 23, 2011 at 12:11:37PM +0400, Pavel Emelyanov wrote:
>> On 07/21/2011 10:44 AM, Tejun Heo wrote:
>>> Most of information dumped here is already available
>>> through /proc and ptrace and we can add the missing pieces like the
>>> suggested proc vma fds.
>>
>> Let's start with the simplest things. Can you suggest the best (from you pov)
>> way for dumping all the registers, tls and the anonymous pages through the 
>> existing interfaces?
> 
> Just use ptrace.  Seizing all threads in a process, gather file and
> memory map info from /proc, inject a parasite to dump memory pages and
> do whatever else.  There's nothing special about TLS, fs/gs base is
> already included in ptrace register dump in x86_64.  Dunno how it's
> handled in 32bit but if it's not available exporting it isn't a big
> deal.  Rebuilding the process image from the captured information
> shouldn't be too hard.

You're talking about your recent set for ptrace? Can you propose a quick scratch
of how you propose to dump and restore registers and memory with this?

> There is simply is no need to put CR into kernel at all.  

I don't want to! I propose to use small set of APIs for it and the execve handler
is just the way to replace VM+regs of a task with another set.

> Just add the
> missing pieces to export the necessary information and missing APIs
> which are required to restore it (e.g. setting TID like you did in
> this patchset).  Approaching it that way would make things useful for
> other use cases && is highly more likely to get merged.  There doesn't
> even need to be one big merge day.  You can just improve things
> piecewise until it works.

Totally agree.

> Thanks.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status
       [not found]                     ` <4E2A8A0E.5030208-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-23  8:50                       ` Tejun Heo
       [not found]                         ` <20110723085014.GI21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-23  8:50 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Sat, Jul 23, 2011 at 12:45:02PM +0400, Pavel Emelyanov wrote:
> OK I see. Then my answer is - typically a container looks like an
> init task with everybody else growing from that point. Having a
> machine with 1000 of containers building the whole /proc tree in
> memory to dump a single container would be MUCH more expensive that
> having this small like in proc.

This isn't a major point, so let's leave it alone for now.  If it's
necessary, adding it isn't a big deal; however, I think it would
probably be better solved by per-ns procfs.  If walking /proc becomes
a huge overhead, CR wouldn't be the only one suffering and it calls
for a better solution.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler
       [not found]                     ` <20110723084529.GH21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2011-07-23  8:51                       ` Pavel Emelyanov
       [not found]                         ` <4E2A8B7D.8010807-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On 07/23/2011 12:45 PM, Tejun Heo wrote:
> Hello, Pavel.
> 
> On Sat, Jul 23, 2011 at 12:17:48PM +0400, Pavel Emelyanov wrote:
>> The basic idea is very simple - you have to create a process with a defined
>> values in registers and defined set of memory mappings (with the memory
>> contents). This creation is obviously done by some (maybe another) process.
>>
>> Thus we have two ways to go - either we transform the restoring task into
>> the target one or we freeze the target one and repopulate it "remotely".
>>
>> The 1st approach seemed to be more elegant to me, and with this one we do
>> already have an API for turning one VM+regs into another - the execve.
>>
>> If you can suggest another way - I'm open for discussion.
> 
> Just restore it using the usual system calls - clone, mmap, open....

I can't clone/mmap/open registers (yet, unless we decide to have /proc/pid/regs
file, do we?). Neither can I do it for anonymous private mappings :(

> There is no reason for the kernel to do it and kernel can't even do it
> properly without going way outside of the existing exec(2) conventions
> unless you're planning to make exec(2) create multi-threaded process
> and I don't think that's a wise direction.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status
       [not found]                         ` <20110723085014.GI21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2011-07-23  8:51                           ` Pavel Emelyanov
  0 siblings, 0 replies; 68+ messages in thread
From: Pavel Emelyanov @ 2011-07-23  8:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On 07/23/2011 12:50 PM, Tejun Heo wrote:
> On Sat, Jul 23, 2011 at 12:45:02PM +0400, Pavel Emelyanov wrote:
>> OK I see. Then my answer is - typically a container looks like an
>> init task with everybody else growing from that point. Having a
>> machine with 1000 of containers building the whole /proc tree in
>> memory to dump a single container would be MUCH more expensive that
>> having this small like in proc.
> 
> This isn't a major point, so let's leave it alone for now.  If it's
> necessary, adding it isn't a big deal; however, I think it would
> probably be better solved by per-ns procfs.  If walking /proc becomes
> a huge overhead, CR wouldn't be the only one suffering and it calls
> for a better solution.

OK.

> Thanks.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file
       [not found]                     ` <4E2A8B12.4010709-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-23  8:58                       ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-07-23  8:58 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello,

On Sat, Jul 23, 2011 at 12:49:22PM +0400, Pavel Emelyanov wrote:
> On 07/23/2011 12:37 PM, Tejun Heo wrote:
> > Just use ptrace.  Seizing all threads in a process, gather file and
> > memory map info from /proc, inject a parasite to dump memory pages and
> > do whatever else.  There's nothing special about TLS, fs/gs base is
> > already included in ptrace register dump in x86_64.  Dunno how it's
> > handled in 32bit but if it's not available exporting it isn't a big
> > deal.  Rebuilding the process image from the captured information
> > shouldn't be too hard.
> 
> You're talking about your recent set for ptrace? Can you propose a
> quick scratch of how you propose to dump and restore registers and
> memory with this?

Hmmm... I thought I just did that writing the above paragraph, so even
without the parasite treak, ptrace has access to full memory space and
registers of the tracee, so the basic stuff is already dumpable &&
restorable - at its simplest, dumper can PEEKDATA for each mapped
memory area (as indicated by /proc/PID/maps), use GETREGS and
GETFPREGS to acquire register states, save them.  To restore, launch a
process, attach to it, let it mmap all the recorded areas and fill it
from the saved data and SETREGS/SETFPREGS to restore the register
states and let go.

This of course is over-simplified but this should work.  There are
some missing things like finding out which signals are pending and
with what siginfo but those can be solved with small additions to
/proc.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler
       [not found]                         ` <4E2A8B7D.8010807-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-23  9:04                           ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-07-23  9:04 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Glauber Costa, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello,

On Sat, Jul 23, 2011 at 12:51:09PM +0400, Pavel Emelyanov wrote:
> On 07/23/2011 12:45 PM, Tejun Heo wrote:
> > Just restore it using the usual system calls - clone, mmap, open....
> 
> I can't clone/mmap/open registers (yet, unless we decide to have /proc/pid/regs
> file, do we?). Neither can I do it for anonymous private mappings :(

Hmmm?  When restoring a task, the task is started under restorer's
control.  The restorer can use ptrace to restore registers or feed it
assemblies to restore register states (ie. series of movq's to set
each register to the stored value followed by jmp to the stored RIP).

I don't understand your concern about anonymous private mappings.
What's different about them?  Can't the process being restored mmap
anonymous private mappings and fill it with the saved data?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]             ` <20110723045842.GD21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2011-07-26 18:11               ` Matt Helsley
       [not found]                 ` <20110726181128.GD14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-26 18:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Sat, Jul 23, 2011 at 06:58:42AM +0200, Tejun Heo wrote:
> On Fri, Jul 22, 2011 at 08:29:45PM -0700, Matt Helsley wrote:
> > On Fri, Jul 22, 2011 at 05:25:58PM -0700, Matt Helsley wrote:
> > You'd have to output the fd number for each epoll item plus the path to
> > the file. The fd,file pair in the item is not strictly tied to
> > the contents of the processes' fd table yet the item fd has to be output
> > since it's the number userspace will supply to epoll_ctl(EPOLL_CTL_DEL...).
> 
> I haven't really looked at it but if @fd is key, @fd should be enough.
> You can determine what @fd is and it attributes from /proc/PID/fd/ and
> /proc/PID/fdinfo/.  No reason to list them again.

No, you can't use the fd from the epoll item to look it up in the
task's fd table. It may look trivially correct but it is not and that's
why I mentioned it.

This is an example of where the information we already have in /proc
looks like it should be re-used but should not.

EPOLL_CTL_ADD uses the fd to add it to the epoll items as an (fd,file)
pair. Then another thread could change the fd table to close that fd.
close does not update all the epoll sets the fd is a part of. So in order
to remove the epoll item userspace must use the same "fd" it used during
EPOLL_CTL_ADD -- even though that fd no longer refers to anything.
That's why you can't just use the epoll item's fd to do a lookup in the
fd table -- you'll get the wrong struct *file. Unfortunately, most of the
time it will probably appear to work. You need a proper testcase to
demonstrate it.

At least that's what I recall of the code.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]         ` <20110723051005.GE21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2011-07-26 22:02           ` Matt Helsley
       [not found]             ` <20110726220215.GE14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-26 22:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Sat, Jul 23, 2011 at 07:10:05AM +0200, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jul 22, 2011 at 05:25:58PM -0700, Matt Helsley wrote:
> > Finally, I think there's substantial room here for quiet and subtle
> > races to corrupt checkpoint images. If we add /proc interfaces only to
> > find they're racy will we need to add yet more /proc interfaces to
> > maintain backward compatibility yet fix the races? To get the locking
> > that ensures a consistent subset of information with this /proc-based
> > approach I think we'll frequently need to change the contents of
> > existing /proc files.
> 
> The target processes need to be frozen to remove race conditions (be
> it SIGSOTP, cgroup freeze or PTRACE trap).  If there are exceptions in

SIGSTOP does not work as I've pointed out several times. I already pointed
out the problem with using the cgroup freezer as-is. As for ptrace
trapping, how would checkpointing a process and its debugger work?
This can happen when checkpointing a container. It seems to me that
they'd interfere with each other by either preventing one another from
attaching (last I checked ptrace was limited this way -- apologies if I
missed some of your work) or one would resume the task 'unexpectedly'
Do we aspire to have these bugs or would we rather plan on having
something that works?

> the boundaries between frozen domain and the rest of the system,
> they'll need to be dealt with and those need to be dealt with whether
> the thing is in kernel or not.

in-kernel we can use existing locks without changing the interface.

What's the plan for userspace? Will it be possible for userspace to
accidentally use the interfaces without holding the userspace "locks"
and thus quietly gather inconsistent information? I think the freezer
is necessary but not sufficient.

> > Imagine trusting the output of top to exactly represent the state of
> > your system's cpu usage. That's the sort of thing a piecemeal /proc
> > interface gets us. You're asking us to trust that frequent checkpoints
> > (say once every five minutes) of large, multiprocess, month-long
> > program runs won't quietly get corrupted and will leave plenty of
> > performance to not interfere with the throughput of the work.
> 
> This is rather bogus.  If you freeze the processes, most of the
> information in /proc (the ones which would show up in top anyway)

"most"... begging the question: which?

What the freezer covers seems very loosely defined in comparison to kernel
lock coverage (kernel locks also have great tool support..).
While the freezer is useful I think we'd be foolish to rely on empirical
observation of which /proc contents don't seem to change while the task is
frozen. As best I can tell the only thing the freezer is guaranteed to
cover is the register state of the frozen task and keep it in-kernel so
only that task cannot execute and produce side-effects. Once you
get to multiple threads/processes it's possible for them to share mm,
fd table, filesystem data, etc. so you have to make sure that everything
that shares those resources is also frozen and remains frozen for the
duration of the checkpoint (the point of a previous post about the freezer).
How will we find all things that share an mm, or an fd table, etc.
in a race-free way, from userspace, and ensure they are and remain frozen?
What about other shared resources like System V Shm, Sems,... ?

> doesn't change.  What race condition?

It's hard to point to specific race conditions when *you* haven't
posted checkpoint code -- just hints and ideas. Until you have something
more substantial the best I can do is review Pavel's code and worry about
what problems might later be uncovered in the future ptrace/proc
interfaces you choose to introduce.

> > A kernel syscall interface has a better chance of allowing us to fix
> > races without changing the interface. We've fixed a few races with
> > Oren's tree and none of them required us to change the output format.
> 
> Sure, that was completely embedded in the kernel and things can be
> implemented and fixed with much less consideration.  I can see how
> that would be easier for the specific use case, but that EXACTLY is
> why it can't go upstream.  I just can't see it happening and think it

It can't go upstream because it's too easy to implement and fix?
It can't go upstream because it has a specific use case?
Is there something that says every interface added to the kernel *must*
be useful for something besides the purpose that originally inspired it?

> would be far more productive spending the time and energy looking for
> and implementing solutions which actually can go mainline.  If you

Oh, you mean stuff that's hard to implement and fix? ;)

> don't care about mainlining, that's great too, but then there's no
> point in talking about it either.

Quite the contrary. How is it a good thing to ignore flaws in a
proposed solution to a problem? You're advocating a bunch of new kernel 
interfaces with the idea that they will be useful for checkpoint/restart.
If they turn out to be racy for the purposes of checkpointing then
kernel maintainers such as yourself will have those interfaces to support
and we will still have no reliable "mainline" checkpoint/restart.

I keep going back to the in-kernel implementation because I believe it
sets the bar -- I think you should do as well or better if you're going
to claim these interfaces are useful for checkpoint/restart. That does not
mean I expect people to like the out-of-tree in-kernel implementation. We
were given a high standard to meet for our checkpoint/restart work and I
don't see why your checkpoint/restart solution should be held to a lower
standard.

So if you don't want me to bring up in-kernel checkpoint/restart then stop
suggesting these interfaces will enable checkpoint/restart or show me
some real code.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]             ` <20110726220215.GE14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-26 22:21               ` Tejun Heo
       [not found]                 ` <20110726222109.GB28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-26 22:21 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Tue, Jul 26, 2011 at 03:02:15PM -0700, Matt Helsley wrote:
> > Sure, that was completely embedded in the kernel and things can be
> > implemented and fixed with much less consideration.  I can see how
> > that would be easier for the specific use case, but that EXACTLY is
> > why it can't go upstream.  I just can't see it happening and think it
> 
> It can't go upstream because it's too easy to implement and fix?
> It can't go upstream because it has a specific use case?
> Is there something that says every interface added to the kernel *must*
> be useful for something besides the purpose that originally inspired it?

You really don't understand what I'm trying to say at all?

> > would be far more productive spending the time and energy looking for
> > and implementing solutions which actually can go mainline.  If you
> 
> Oh, you mean stuff that's hard to implement and fix? ;)

We've talked about this over and over again.  If you wanna pursue
in-kernel implementation, please go ahead and keep at it.

Good luck.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                 ` <20110726181128.GD14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-26 22:45                   ` Tejun Heo
       [not found]                     ` <20110726224525.GC28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-26 22:45 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello,

On Tue, Jul 26, 2011 at 11:11:28AM -0700, Matt Helsley wrote:
> No, you can't use the fd from the epoll item to look it up in the
> task's fd table. It may look trivially correct but it is not and that's
> why I mentioned it.
> 
> This is an example of where the information we already have in /proc
> looks like it should be re-used but should not.
> 
> EPOLL_CTL_ADD uses the fd to add it to the epoll items as an (fd,file)
> pair. Then another thread could change the fd table to close that fd.
> close does not update all the epoll sets the fd is a part of. So in order
> to remove the epoll item userspace must use the same "fd" it used during
> EPOLL_CTL_ADD -- even though that fd no longer refers to anything.
> That's why you can't just use the epoll item's fd to do a lookup in the
> fd table -- you'll get the wrong struct *file. Unfortunately, most of the
> time it will probably appear to work. You need a proper testcase to
> demonstrate it.
> 
> At least that's what I recall of the code.

Ummm... I'm a bit confused, are you saying that EPOLL_CTL_DEL may take
fd which is already closed?  I can't see how that would be possible.
epoll uses fget() for fd -> file mapping like everyone else.  If fd is
closed, the mapping doesn't exist.

I think what's confusing here is that, if multiple fd points to the
same file, epoll 'may' report events on an already closed or resued fd
because whole thing is anchored on struct file which doesn't go away
until the last fd is closed, but I don't think that's something we
need to worry about.  It's dangling events on dead fds which is
explicitly described as 'may' happen in the documentation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]         ` <CAOS58YPqLSYi2xECUk4O5GG3s6aokT=VykmkL6UnAOzyHXNAgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-07-26 22:59           ` Matt Helsley
       [not found]             ` <20110726225911.GF14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-26 22:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Sat, Jul 23, 2011 at 05:53:46AM +0200, Tejun Heo wrote:
> Hello,
> 
> On Sat, Jul 23, 2011 at 2:25 AM, Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> > Then there's the matter of unlinked files. How do you plan to deal
> > with those without kernel code?
> 
> /proc/PID/fd already provides access to deleted files perfectly well
> as most avid p0rn watchers would know (you can run mplayer on flash's
> deleted temp files). ;)

Yup, access to the unlinked file contents. This is an example where
things appear simple and complete in /proc yet it is insufficient.
Here's what you'll need:

The string "(deleted)" in a file name is, strictly speaking, ambiguous --
it does not mean the file is unlinked. You also can't infer that it is
unlinked by stat()'ing that path since a different file could have
been created in the same spot. For something unambiguous you'll
have to add that information to /proc somewhere. fdinfo doesn't seem
to be the right place since fds aren't unlinked -- files are. 

Then you've got to detect when they're the same unlinked file and share
the copy upon restart. Or they could be different unlinked files
with the same path in which case you should not share the copy. I suppose
you'll have to check the device and inode and then see if any other task
being checkpointed has it open... once for each of potentially thousands
of fds being checkpointed.

Then there's the case where you've got one unlinked dentry for the
file but a hardlink elsewhere. The /proc/PID/fd path won't point to the
hardlinked location. So in order for those to be the same file upon
restart you need to find the file somehow during checkpoint and/or
restart.

Finally these files often can be huge. Copying them elsewhere is a huge IO
burden compared to careful relinking of the file. IO that could be better
spent doing actual work.

We solved all that with "relinking". It's possible to make a relink()
syscall. The code I posted some time ago to containers@ can be easily
adapted for that -- I did so for my testing of those patches. I'm not
exactly sure how it would be done from userspace but I suspect it could
be done.

Perhaps you'll find a different and better way to solve all those
problems unlinked files present. I'd sincerely like to hear about it.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                     ` <20110726224525.GC28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2011-07-26 23:07                       ` Matt Helsley
  0 siblings, 0 replies; 68+ messages in thread
From: Matt Helsley @ 2011-07-26 23:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Wed, Jul 27, 2011 at 12:45:25AM +0200, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 26, 2011 at 11:11:28AM -0700, Matt Helsley wrote:
> > No, you can't use the fd from the epoll item to look it up in the
> > task's fd table. It may look trivially correct but it is not and that's
> > why I mentioned it.
> > 
> > This is an example of where the information we already have in /proc
> > looks like it should be re-used but should not.
> > 
> > EPOLL_CTL_ADD uses the fd to add it to the epoll items as an (fd,file)
> > pair. Then another thread could change the fd table to close that fd.
> > close does not update all the epoll sets the fd is a part of. So in order
> > to remove the epoll item userspace must use the same "fd" it used during
> > EPOLL_CTL_ADD -- even though that fd no longer refers to anything.
> > That's why you can't just use the epoll item's fd to do a lookup in the
> > fd table -- you'll get the wrong struct *file. Unfortunately, most of the
> > time it will probably appear to work. You need a proper testcase to
> > demonstrate it.
> > 
> > At least that's what I recall of the code.
> 
> Ummm... I'm a bit confused, are you saying that EPOLL_CTL_DEL may take
> fd which is already closed?  I can't see how that would be possible.
> epoll uses fget() for fd -> file mapping like everyone else.  If fd is

Argh, you're right! I thought it just did an fget() on the epoll fd
itself then looked up the target fd in the epoll set..

> closed, the mapping doesn't exist.
> 
> I think what's confusing here is that, if multiple fd points to the
> same file, epoll 'may' report events on an already closed or resued fd
> because whole thing is anchored on struct file which doesn't go away
> until the last fd is closed, but I don't think that's something we
> need to worry about.  It's dangling events on dead fds which is
> explicitly described as 'may' happen in the documentation.

Yup.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]             ` <20110726225911.GF14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-26 23:46               ` Tejun Heo
       [not found]                 ` <20110726234657.GD28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-26 23:46 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello,

On Tue, Jul 26, 2011 at 03:59:11PM -0700, Matt Helsley wrote:
> > /proc/PID/fd already provides access to deleted files perfectly well
> > as most avid p0rn watchers would know (you can run mplayer on flash's
> > deleted temp files). ;)
> 
> Yup, access to the unlinked file contents. This is an example where
> things appear simple and complete in /proc yet it is insufficient.
> Here's what you'll need:
> 
> The string "(deleted)" in a file name is, strictly speaking, ambiguous --
> it does not mean the file is unlinked. You also can't infer that it is
> unlinked by stat()'ing that path since a different file could have
> been created in the same spot. For something unambiguous you'll
> have to add that information to /proc somewhere. fdinfo doesn't seem
> to be the right place since fds aren't unlinked -- files are. 

Hmm... but wouldn't fstat() after open reveal the original inode?  ie.

  $ cat fstat.c
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <unistd.h>
  #include <stdio.h>
  #include <fcntl.h>
  #include <assert.h>

  int main(int argc, char **argv)
  {
	  int fd;
	  struct stat st = {};

	  assert((fd = open(argv[1], O_RDONLY)) >= 0);
	  assert(!fstat(fd, &st));
	  printf("ino=%lu nlink=%lu\n",
		 (unsigned long)st.st_ino, (unsigned long)st.st_nlink);
	  return 0;
  }
  $ gcc -Wall -o fstat fstat.c
  $ cat > asdf &
  [7] 31908
  $ ./fstat asdf
  ino=9180912 nlink=1
  $ ./fstat /proc/31908/fd/1
  ino=9180912 nlink=1
  $ rm -f asdf
  $ ./fstat /proc/31908/fd/1
  ino=9180912 nlink=0
  $ touch asdf
  $ $ ./fstat asdf
  ino=9180915 nlink=1

I don't think anything is ambiguous.

> Then you've got to detect when they're the same unlinked file and share
> the copy upon restart. Or they could be different unlinked files
> with the same path in which case you should not share the copy. I suppose
> you'll have to check the device and inode and then see if any other task
> being checkpointed has it open... once for each of potentially thousands
> of fds being checkpointed.

Just build a hash table w/ fstat results.  It's O(nr_open_files)
whether you do that or not.

> Then there's the case where you've got one unlinked dentry for the
> file but a hardlink elsewhere. The /proc/PID/fd path won't point to the
> hardlinked location. So in order for those to be the same file upon
> restart you need to find the file somehow during checkpoint and/or
> restart.

You can determine whether search for another hardlink is necessary by
looking at nlink.  Hmm... I wonder whether open_by_handle_at() can be
used for this instead of scanning filesystem for matching inode
number.  Screening by nlink should eliminate most cases but if
open_by_handle_at() can deal with actual cases, it would be much
better.

> Finally these files often can be huge. Copying them elsewhere is a huge IO
> burden compared to careful relinking of the file. IO that could be better
> spent doing actual work.
>
> We solved all that with "relinking". It's possible to make a relink()
> syscall. The code I posted some time ago to containers@ can be easily
> adapted for that -- I did so for my testing of those patches. I'm not
> exactly sure how it would be done from userspace but I suspect it could
> be done.

Yeah, something like flink (like fstat for stat) should do it.  FS
methods operate on dentries anyway so it can be added in the vfs layer
proper if necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                 ` <20110726222109.GB28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2011-07-27  0:06                   ` Matt Helsley
       [not found]                     ` <20110727000651.GA15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-27  0:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Wed, Jul 27, 2011 at 12:21:09AM +0200, Tejun Heo wrote:
> On Tue, Jul 26, 2011 at 03:02:15PM -0700, Matt Helsley wrote:
> > > Sure, that was completely embedded in the kernel and things can be
> > > implemented and fixed with much less consideration.  I can see how
> > > that would be easier for the specific use case, but that EXACTLY is
> > > why it can't go upstream.  I just can't see it happening and think it
> > 
> > It can't go upstream because it's too easy to implement and fix?
> > It can't go upstream because it has a specific use case?
> > Is there something that says every interface added to the kernel *must*
> > be useful for something besides the purpose that originally inspired it?
> 
> You really don't understand what I'm trying to say at all?

That's not what I said. I know you're arguing we shouldn't have an
in-kernel implementation.

Your statement above did not seem to support your argument at all --
you seemed to be conceding that an in-kernel implementation ("embedded
in the kernel") would be easier to implement and fix (nit: would've
been nice for you to include a bit more context..).

I know you think we should make use of lots of changes in a variety
of places such as ptrace, new bits in /proc, etc. to avoid an in-kernel
implementation. That's certainly an enticingly simple (non-complex) idea.
However I still question whether the idea will work well for
checkpoint/restart.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                 ` <20110726234657.GD28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2011-07-27  0:53                   ` Matt Helsley
       [not found]                     ` <20110727005341.GB15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-27  0:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Wed, Jul 27, 2011 at 01:46:57AM +0200, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 26, 2011 at 03:59:11PM -0700, Matt Helsley wrote:
> > > /proc/PID/fd already provides access to deleted files perfectly well
> > > as most avid p0rn watchers would know (you can run mplayer on flash's
> > > deleted temp files). ;)
> > 
> > Yup, access to the unlinked file contents. This is an example where
> > things appear simple and complete in /proc yet it is insufficient.
> > Here's what you'll need:
> > 
> > The string "(deleted)" in a file name is, strictly speaking, ambiguous --
> > it does not mean the file is unlinked. You also can't infer that it is
> > unlinked by stat()'ing that path since a different file could have
> > been created in the same spot. For something unambiguous you'll
> > have to add that information to /proc somewhere. fdinfo doesn't seem
> > to be the right place since fds aren't unlinked -- files are. 
> 
> Hmm... but wouldn't fstat() after open reveal the original inode?  ie.
> 
>   $ cat fstat.c
>   #include <sys/types.h>
>   #include <sys/stat.h>
>   #include <unistd.h>
>   #include <stdio.h>
>   #include <fcntl.h>
>   #include <assert.h>
> 
>   int main(int argc, char **argv)
>   {
> 	  int fd;
> 	  struct stat st = {};
> 
> 	  assert((fd = open(argv[1], O_RDONLY)) >= 0);
> 	  assert(!fstat(fd, &st));
> 	  printf("ino=%lu nlink=%lu\n",
> 		 (unsigned long)st.st_ino, (unsigned long)st.st_nlink);
> 	  return 0;
>   }
>   $ gcc -Wall -o fstat fstat.c
>   $ cat > asdf &
>   [7] 31908
>   $ ./fstat asdf
>   ino=9180912 nlink=1
>   $ ./fstat /proc/31908/fd/1
>   ino=9180912 nlink=1
>   $ rm -f asdf
>   $ ./fstat /proc/31908/fd/1
>   ino=9180912 nlink=0
>   $ touch asdf
>   $ $ ./fstat asdf
>   ino=9180915 nlink=1
> 
> I don't think anything is ambiguous.

Good point. Hmm, is it possible nlink could change to/from 0 in some obscure
VFS code though? The cgroup freezer won't cover filesystem activity so
checkpoint would also have to freeze the filesystem using the fs
freezer..

Though with flink() a link count race like that probably wouldn't matter.

> 
> > Then you've got to detect when they're the same unlinked file and share
> > the copy upon restart. Or they could be different unlinked files
> > with the same path in which case you should not share the copy. I suppose
> > you'll have to check the device and inode and then see if any other task
> > being checkpointed has it open... once for each of potentially thousands
> > of fds being checkpointed.
> 
> Just build a hash table w/ fstat results.  It's O(nr_open_files)
> whether you do that or not.

Yup. Still, it would be great if there was some way to avoid the need
for a hash table.

> 
> > Then there's the case where you've got one unlinked dentry for the
> > file but a hardlink elsewhere. The /proc/PID/fd path won't point to the
> > hardlinked location. So in order for those to be the same file upon
> > restart you need to find the file somehow during checkpoint and/or
> > restart.
> 
> You can determine whether search for another hardlink is necessary by
> looking at nlink.  Hmm... I wonder whether open_by_handle_at() can be
> used for this instead of scanning filesystem for matching inode
> number.  Screening by nlink should eliminate most cases but if
> open_by_handle_at() can deal with actual cases, it would be much
> better.

I briefly considered that and it might still be a good idea.
One reason I still went with relink is I was uncertain about what happens
to handles if the kernel reboots. If they become invalid then they don't
seem like a good candidate for checkpointing unlinked files.

> > Finally these files often can be huge. Copying them elsewhere is a huge IO
> > burden compared to careful relinking of the file. IO that could be better
> > spent doing actual work.
> >
> > We solved all that with "relinking". It's possible to make a relink()
> > syscall. The code I posted some time ago to containers@ can be easily
> > adapted for that -- I did so for my testing of those patches. I'm not
> > exactly sure how it would be done from userspace but I suspect it could
> > be done.
> 
> Yeah, something like flink (like fstat for stat) should do it.  FS
> methods operate on dentries anyway so it can be added in the vfs layer
> proper if necessary.

Exactly. I worked on that for a little bit but the security questions
worried me and I haven't picked it back up since. If you or Pavel do pick
up the flink() solution I'd be happy to help review it since it'll probably
be something we can use too.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                     ` <20110727005341.GB15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-27 10:12                       ` Tejun Heo
       [not found]                         ` <20110727101228.GY2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-27 10:12 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello, Matt.

On Tue, Jul 26, 2011 at 05:53:41PM -0700, Matt Helsley wrote:
> Good point. Hmm, is it possible nlink could change to/from 0 in some obscure
> VFS code though? The cgroup freezer won't cover filesystem activity so
> checkpoint would also have to freeze the filesystem using the fs
> freezer..

nlink won't change by itself.  I think it more comes down to policy
and framework decisions - the scope of CR'ing, how filesystems are
snapshotted along and so on.  If those boundaries are well-defined,
setting up mechanisms accordingly shouldn't be too difficult.

> > You can determine whether search for another hardlink is necessary by
> > looking at nlink.  Hmm... I wonder whether open_by_handle_at() can be
> > used for this instead of scanning filesystem for matching inode
> > number.  Screening by nlink should eliminate most cases but if
> > open_by_handle_at() can deal with actual cases, it would be much
> > better.
> 
> I briefly considered that and it might still be a good idea.
> One reason I still went with relink is I was uncertain about what happens
> to handles if the kernel reboots. If they become invalid then they don't
> seem like a good candidate for checkpointing unlinked files.

Hmmm... I _think_ they're persistent but if not I think a better
approach would be investigating why they aren't and update them so
that they're useful for CR too.

> > Yeah, something like flink (like fstat for stat) should do it.  FS
> > methods operate on dentries anyway so it can be added in the vfs layer
> > proper if necessary.
> 
> Exactly. I worked on that for a little bit but the security questions
> worried me and I haven't picked it back up since. If you or Pavel do pick
> up the flink() solution I'd be happy to help review it since it'll probably
> be something we can use too.

Yes, maybe, but the thing is that these are pretty much fringe case
optimizations.  I'm not saying they aren't worth adding but that
missing flink() or open_by_handle_at() support wouldn't hurt coverage
all that much.

I keep raising these similar points for two reasons.  First, CR
doesn't have to be complete (however the 'completeness' is defined) to
be useful.  If CR works for most use cases with existing mechanisms,
going forward with it would be already quite useful.  For HPC
applications, the bar is quite low, actually.

Secondly, once it builds momentum by being actually useful and
deployed, it gets *much* easier to justify addition of new kernel
features for it.  Conditioning whole progress on fringe cases is
counter productive for both the main project and the fringe cases.  If
the order is reversed, both can proceed much more efficiently.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                     ` <20110727000651.GA15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-27 12:01                       ` Tejun Heo
       [not found]                         ` <20110727120114.GZ2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-27 12:01 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello, Matt.

On Tue, Jul 26, 2011 at 05:06:51PM -0700, Matt Helsley wrote:
> On Wed, Jul 27, 2011 at 12:21:09AM +0200, Tejun Heo wrote:
> > On Tue, Jul 26, 2011 at 03:02:15PM -0700, Matt Helsley wrote:
> > > > Sure, that was completely embedded in the kernel and things can be
> > > > implemented and fixed with much less consideration.  I can see how
> > > > that would be easier for the specific use case, but that EXACTLY is
> > > > why it can't go upstream.  I just can't see it happening and think it
> > > 
> > > It can't go upstream because it's too easy to implement and fix?
> > > It can't go upstream because it has a specific use case?
> > > Is there something that says every interface added to the kernel *must*
> > > be useful for something besides the purpose that originally inspired it?
> > 
> > You really don't understand what I'm trying to say at all?
> 
> That's not what I said. I know you're arguing we shouldn't have an
> in-kernel implementation.
> 
> Your statement above did not seem to support your argument at all --
> you seemed to be conceding that an in-kernel implementation ("embedded
> in the kernel") would be easier to implement and fix (nit: would've
> been nice for you to include a bit more context..).

I see.  Probably I was too indirect, so let me try again.  The reason
why in-kernel implementation seems easier for CR itself is because it
has unlimited access to all the internal data structures, locking and
everything, which waivers a lot of efforts.  There's no layering to
consider and no userland visible API to worry about.

Unfortunately, those benefits don't come free.  It ends up adding a
lot of side-way accesses to different subsystems including another
locking vector, which add complexity to all the subsystems.  In short,
it makes CR easier by making everything else more complex.

Analogies are often misleading but in-kernel web server seems useful
to explain the point I'm trying to make (at least some part of it).
If the kernel lacks proper support API, hooking deeply into page
cache, network stack, scheduler and whatnot would make building high
performance web server much easier than trying to devise and implement
proper APIs to support high performance web server, and as a prototype
or probing project, in-kernel implementation sure would have a lot of
usefulness, but that's not how the end result should turn out.  It
makes maintaining and improving kernel subsystems unnecessarily
difficult for quite limited usefulness.

Again, I'm not saying CR is exactly the same and POV can vary greatly
depending on how one perceives various parameters, but I think it at
least illustrates my point clear.

> I know you think we should make use of lots of changes in a variety
> of places such as ptrace, new bits in /proc, etc. to avoid an in-kernel
> implementation. That's certainly an enticingly simple (non-complex) idea.
> However I still question whether the idea will work well for
> checkpoint/restart.

I think the difference in opinions originates from two major factors.
One being scope or completeness and the other perceived difficulties
of doing it from userland.  I think I've already said enough about the
former in another reply.

For the latter, I still can't see what would be so difficult.  We have
properly working ptrace now (and can even transparently inject worker
thread into the target process) so the core functionality is easily
(it takes effort but isn't technically difficult) achievable.  The
specific issues you've raised in this thread don't seem all that
daunting to tackle.  To me, the crux of most issues seems already
half-solved.  Maybe I'm overly optimistic but I don't really see any
missing chunk which is too big or especially difficult.  If you can
think of some, please bring them up.  Let's talk about them.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                         ` <20110727120114.GZ2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2011-07-27 21:35                           ` Matt Helsley
       [not found]                             ` <20110727213510.GC15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-27 21:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Wed, Jul 27, 2011 at 02:01:14PM +0200, Tejun Heo wrote:
> Hello, Matt.
> 
> On Tue, Jul 26, 2011 at 05:06:51PM -0700, Matt Helsley wrote:
> > On Wed, Jul 27, 2011 at 12:21:09AM +0200, Tejun Heo wrote:
> > > On Tue, Jul 26, 2011 at 03:02:15PM -0700, Matt Helsley wrote:
> > > > > Sure, that was completely embedded in the kernel and things can be
> > > > > implemented and fixed with much less consideration.  I can see how
> > > > > that would be easier for the specific use case, but that EXACTLY is
> > > > > why it can't go upstream.  I just can't see it happening and think it
> > > > 
> > > > It can't go upstream because it's too easy to implement and fix?
> > > > It can't go upstream because it has a specific use case?
> > > > Is there something that says every interface added to the kernel *must*
> > > > be useful for something besides the purpose that originally inspired it?
> > > 
> > > You really don't understand what I'm trying to say at all?
> > 
> > That's not what I said. I know you're arguing we shouldn't have an
> > in-kernel implementation.
> > 
> > Your statement above did not seem to support your argument at all --
> > you seemed to be conceding that an in-kernel implementation ("embedded
> > in the kernel") would be easier to implement and fix (nit: would've
> > been nice for you to include a bit more context..).
> 
> I see.  Probably I was too indirect, so let me try again.  The reason
> why in-kernel implementation seems easier for CR itself is because it
> has unlimited access to all the internal data structures, locking and
> everything, which waivers a lot of efforts.  There's no layering to
> consider and no userland visible API to worry about.
> 
> Unfortunately, those benefits don't come free.  It ends up adding a

(Agreed so far..)

> lot of side-way accesses to different subsystems including another
> locking vector, which add complexity to all the subsystems.  In short,
> it makes CR easier by making everything else more complex.

More, but how much more is where we probably disagree. Often  the
"subsystems" that need to be checkpointed already need to be prevent races
with syscalls that do most of what checkpoint/restart needs.
So checkpoint/restart usually doesn't make it any more complex in terms of
locking. In fact I can't think of a single instance where we changed the lock
coverage or locking rules of any subsystem.

> 
> Analogies are often misleading but in-kernel web server seems useful
> to explain the point I'm trying to make (at least some part of it).
> If the kernel lacks proper support API, hooking deeply into page
> cache, network stack, scheduler and whatnot would make building high
> performance web server much easier than trying to devise and implement
> proper APIs to support high performance web server, and as a prototype
> or probing project, in-kernel implementation sure would have a lot of
> usefulness, but that's not how the end result should turn out.  It
> makes maintaining and improving kernel subsystems unnecessarily
> difficult for quite limited usefulness.
> 
> Again, I'm not saying CR is exactly the same and POV can vary greatly
> depending on how one perceives various parameters, but I think it at
> least illustrates my point clear.

I think I see the point you're getting at. There are so many 
differences from c/r that the depth and breadth of the impact are
quite different for an in-kernel webserver though. I'd say c/r has a much
wider impact (involves more kernel/userspace interfaces) but also is
less deep than you seem to suggest -- it doesn't hook into the page cache,
the scheduler, packet rx/tx, etc.

The closest part of your analogy involved the networking code
and made me wonder how you think network sockets and connections
could best be checkpointed and restarted from userspace.

> > I know you think we should make use of lots of changes in a variety
> > of places such as ptrace, new bits in /proc, etc. to avoid an in-kernel
> > implementation. That's certainly an enticingly simple (non-complex) idea.
> > However I still question whether the idea will work well for
> > checkpoint/restart.
> 
> I think the difference in opinions originates from two major factors.
> One being scope or completeness and the other perceived difficulties
> of doing it from userland.  I think I've already said enough about the
> former in another reply.
> 
> For the latter, I still can't see what would be so difficult.  We have
> properly working ptrace now (and can even transparently inject worker
> thread into the target process) so the core functionality is easily
> (it takes effort but isn't technically difficult) achievable.  The
> specific issues you've raised in this thread don't seem all that
> daunting to tackle.  To me, the crux of most issues seems already
> half-solved.  Maybe I'm overly optimistic but I don't really see any
> missing chunk which is too big or especially difficult.  If you can
> think of some, please bring them up.  Let's talk about them.
> 
> Thanks.

Fair enough.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                         ` <20110727101228.GY2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2011-07-27 22:26                           ` Matt Helsley
  0 siblings, 0 replies; 68+ messages in thread
From: Matt Helsley @ 2011-07-27 22:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Wed, Jul 27, 2011 at 12:12:28PM +0200, Tejun Heo wrote:
> Hello, Matt.
> 
> On Tue, Jul 26, 2011 at 05:53:41PM -0700, Matt Helsley wrote:
> > Good point. Hmm, is it possible nlink could change to/from 0 in some obscure
> > VFS code though? The cgroup freezer won't cover filesystem activity so
> > checkpoint would also have to freeze the filesystem using the fs
> > freezer..
> 
> nlink won't change by itself.  I think it more comes down to policy
> and framework decisions - the scope of CR'ing, how filesystems are
> snapshotted along and so on.  If those boundaries are well-defined,
> setting up mechanisms accordingly shouldn't be too difficult.

Yes, policy is a large part of how checkpoint deals with filesystems
and mount namespaces. For instance, with flink where do you link to?
Userspace could conceivably want to link to many places, or even
tradeoff between performance (flink) and simplicity (copying it all to the
same location). That's one reason I might prefer file handles -- they
could be more policy agnostic :).

> 
> > > You can determine whether search for another hardlink is necessary by
> > > looking at nlink.  Hmm... I wonder whether open_by_handle_at() can be
> > > used for this instead of scanning filesystem for matching inode
> > > number.  Screening by nlink should eliminate most cases but if
> > > open_by_handle_at() can deal with actual cases, it would be much
> > > better.
> > 
> > I briefly considered that and it might still be a good idea.
> > One reason I still went with relink is I was uncertain about what happens
> > to handles if the kernel reboots. If they become invalid then they don't
> > seem like a good candidate for checkpointing unlinked files.
> 
> Hmmm... I _think_ they're persistent but if not I think a better
> approach would be investigating why they aren't and update them so
> that they're useful for CR too.

Assuming that's possible and acceptable to others, sure.

> 
> > > Yeah, something like flink (like fstat for stat) should do it.  FS
> > > methods operate on dentries anyway so it can be added in the vfs layer
> > > proper if necessary.
> > 
> > Exactly. I worked on that for a little bit but the security questions
> > worried me and I haven't picked it back up since. If you or Pavel do pick
> > up the flink() solution I'd be happy to help review it since it'll probably
> > be something we can use too.
> 
> Yes, maybe, but the thing is that these are pretty much fringe case
> optimizations.  I'm not saying they aren't worth adding but that
> missing flink() or open_by_handle_at() support wouldn't hurt coverage
> all that much.

Fair enough for now.

> 
> I keep raising these similar points for two reasons.  First, CR
> doesn't have to be complete (however the 'completeness' is defined) to

I agree it doesn't have to be complete to be useful, but...

> be useful.  If CR works for most use cases with existing mechanisms,
> going forward with it would be already quite useful.  For HPC
> applications, the bar is quite low, actually.

the less complete CR is the more critical a (presumably userspace) method
for detecting and identifying what the source(s) of an incomplete CR is
(are). Otherwise we could try to checkpoint a hypothetical "cure for
cancer/save the world" HPC task only to discover the checkpoint is utterly
useless when we attempt to do a restart. Or, worse, it just quietly corrupts
the results! So even if we don't support checkpointing something we still
want to detect if it's being used.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [TOOLS] To make use of the patches
       [not found]             ` <4E2A8704.3030306-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-07-27 23:00               ` Matt Helsley
       [not found]                 ` <20110727230003.GE15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Matt Helsley @ 2011-07-27 23:00 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Cyrill Gorcunov, Linux Containers, Tejun Heo, Daniel Lezcano

On Sat, Jul 23, 2011 at 12:32:04PM +0400, Pavel Emelyanov wrote:

<snip>

> > For any subsequent postings could you split this up into multiple
> > emails -- perhaps one per file? 
> 
> OK, will do this.
> 
> > Or perhaps make them patches to the kernel's tools directory?
> 
> Hm... I didn't think about having these tools be the part of the kernel source tree.
> 
> Maybe it would be better if I publish the tools in git repo, what do you think?

I honestly don't know which is more appropriate.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                             ` <20110727213510.GC15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-28  7:21                               ` Tejun Heo
       [not found]                                 ` <20110728072141.GB2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-07-28  7:21 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

Hello, Matt.

On Wed, Jul 27, 2011 at 02:35:10PM -0700, Matt Helsley wrote:
> The closest part of your analogy involved the networking code
> and made me wonder how you think network sockets and connections
> could best be checkpointed and restarted from userspace.

My knowledge of the networking stack is rather basic so it probably
would require more research to be complete enough but I have something
on mind.  I'll try to hack up an example code (and many combine it
with the parasite so that it can hijack a socket) this weekend (or
next week :)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                                 ` <20110728072141.GB2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2011-07-28  7:23                                   ` Tejun Heo
  2011-07-28  8:37                                   ` James Bottomley
  1 sibling, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-07-28  7:23 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano

On Thu, Jul 28, 2011 at 09:21:41AM +0200, Tejun Heo wrote:
> My knowledge of the networking stack is rather basic so it probably
> would require more research to be complete enough but I have something
> on mind.  I'll try to hack up an example code (and many combine it
                                                     ^^^^
                                  Ummm... brainfart: maybe
-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [TOOLS] To make use of the patches
       [not found]                 ` <20110727230003.GE15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-07-28  8:23                   ` James Bottomley
  0 siblings, 0 replies; 68+ messages in thread
From: James Bottomley @ 2011-07-28  8:23 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Cyrill Gorcunov, Linux Containers, Tejun Heo, Daniel Lezcano,
	Pavel Emelianov

On Wed, 2011-07-27 at 16:00 -0700, Matt Helsley wrote:
> On Sat, Jul 23, 2011 at 12:32:04PM +0400, Pavel Emelyanov wrote:
> 
> <snip>
> 
> > > For any subsequent postings could you split this up into multiple
> > > emails -- perhaps one per file? 
> > 
> > OK, will do this.
> > 
> > > Or perhaps make them patches to the kernel's tools directory?
> > 
> > Hm... I didn't think about having these tools be the part of the kernel source tree.
> > 
> > Maybe it would be better if I publish the tools in git repo, what do you think?
> 
> I honestly don't know which is more appropriate.

The decision has largely been made for us.  We now use the kernel
repository tools directory for any tools which are specific to (and
bound to) kernel infrastructure.  Checkpoint/Restore seems to fall into
this category.

James

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
       [not found]                                 ` <20110728072141.GB2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2011-07-28  7:23                                   ` Tejun Heo
@ 2011-07-28  8:37                                   ` James Bottomley
  2011-07-28  9:10                                     ` Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: James Bottomley @ 2011-07-28  8:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelianov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano, Lars Marowsky-Bree

On Thu, 2011-07-28 at 09:21 +0200, Tejun Heo wrote:
> Hello, Matt.
> 
> On Wed, Jul 27, 2011 at 02:35:10PM -0700, Matt Helsley wrote:
> > The closest part of your analogy involved the networking code
> > and made me wonder how you think network sockets and connections
> > could best be checkpointed and restarted from userspace.
> 
> My knowledge of the networking stack is rather basic so it probably
> would require more research to be complete enough but I have something
> on mind.  I'll try to hack up an example code (and many combine it
> with the parasite so that it can hijack a socket) this weekend (or
> next week :)

So, this is actually a good example of why we don't want this
specifically bound to C/R in the kernel.  I think an individual network
socket can be checkpointed and restored separately specifically by
exporting some of its internal state (basically the current sequence
number and some of the window state).

The benefit to us of finding what this state is and making it available
is not only that we can now save and restore the socket as part of the
checkpoint, its that the High Availability people can use this feature
for individual sockets to build fault tolerant network service failover
on top of.  Thus by making the feature granular and available to
userspace, we've expanded the number of use cases (and hence the amount
of testing) we get for the feature.

James

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
  2011-07-28  8:37                                   ` James Bottomley
@ 2011-07-28  9:10                                     ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-07-28  9:10 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pavel Emelianov, Cyrill Gorcunov, Nathan Lynch, Linux Containers,
	Serge Hallyn, Daniel Lezcano, Lars Marowsky-Bree

Hey, James.  Nice to see you w/ parallels hat on.

On Thu, Jul 28, 2011 at 08:37:59AM +0000, James Bottomley wrote:
> So, this is actually a good example of why we don't want this
> specifically bound to C/R in the kernel.  I think an individual network
> socket can be checkpointed and restored separately specifically by
> exporting some of its internal state (basically the current sequence
> number and some of the window state).

I actually think network socket would be a pretty well behaving
candidate for userland CR.  Its behavior and interaction are strictly
defined.  The thing is designed to talk to other machines after all so
we basically already have well documented and enforced mechanism to
coerce it.  We need some bits and pieces but I'm expecting kernel side
of it to be very small.

> The benefit to us of finding what this state is and making it available
> is not only that we can now save and restore the socket as part of the
> checkpoint, its that the High Availability people can use this feature
> for individual sockets to build fault tolerant network service failover
> on top of.  Thus by making the feature granular and available to
> userspace, we've expanded the number of use cases (and hence the amount
> of testing) we get for the feature.

Yeah, exactly, and that in turn makes it easy to push the feature into
the kernel.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2011-07-28  9:10 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-15 13:45 [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace Pavel Emelyanov
     [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-15 13:45   ` [PATCH 0/1] proc: Introduce the /proc/<pid>/mfd/ directory Pavel Emelyanov
     [not found]     ` <4E20448A.5010207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21  7:21       ` Tejun Heo
2011-07-15 13:46   ` [PATCH 2/7] vfs: Introduce the fd closing helper Pavel Emelyanov
     [not found]     ` <4E2044A7.4030103-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 15:47       ` Serge E. Hallyn
2011-07-15 13:46   ` [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status Pavel Emelyanov
     [not found]     ` <4E2044C3.7050506-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21  6:54       ` Tejun Heo
     [not found]         ` <20110721065436.GT3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-23  8:06           ` Pavel Emelyanov
     [not found]             ` <4E2A8116.1040309-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:41               ` Tejun Heo
     [not found]                 ` <20110723084110.GG21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23  8:45                   ` Pavel Emelyanov
     [not found]                     ` <4E2A8A0E.5030208-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:50                       ` Tejun Heo
     [not found]                         ` <20110723085014.GI21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23  8:51                           ` Pavel Emelyanov
2011-07-21 15:54       ` Serge E. Hallyn
2011-07-15 13:47   ` [PATCH 4/7] vfs: Add ->statfs callback for pipefs Pavel Emelyanov
     [not found]     ` <4E2044D6.3060205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21  6:59       ` Tejun Heo
2011-07-21 15:59       ` Serge E. Hallyn
2011-07-15 13:47   ` [PATCH 5/7] clone: Introduce the CLONE_CHILD_USEPID functionality Pavel Emelyanov
     [not found]     ` <4E2044EB.20001-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 16:04       ` Serge E. Hallyn
     [not found]         ` <20110721160459.GD19012-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2011-07-22 23:08           ` Matt Helsley
     [not found]             ` <20110722230848.GB16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  8:09               ` Pavel Emelyanov
2011-07-15 13:47   ` [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file Pavel Emelyanov
     [not found]     ` <4E204500.6040800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-16 22:57       ` Kirill A. Shutemov
     [not found]         ` <20110716225709.GA25606-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
2011-07-17  8:06           ` Cyrill Gorcunov
2011-07-21  6:44       ` Tejun Heo
     [not found]         ` <20110721064408.GR3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-23  8:11           ` Pavel Emelyanov
     [not found]             ` <4E2A8239.5060908-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:37               ` Tejun Heo
     [not found]                 ` <20110723083711.GF21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23  8:49                   ` Pavel Emelyanov
     [not found]                     ` <4E2A8B12.4010709-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:58                       ` Tejun Heo
2011-07-15 13:48   ` [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler Pavel Emelyanov
     [not found]     ` <4E204519.3040804-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21  6:51       ` Tejun Heo
     [not found]         ` <20110721065127.GS3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-22 22:46           ` Matt Helsley
     [not found]             ` <20110722224617.GA16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  8:17               ` Pavel Emelyanov
     [not found]                 ` <4E2A83AC.6090504-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:45                   ` Tejun Heo
     [not found]                     ` <20110723084529.GH21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23  8:51                       ` Pavel Emelyanov
     [not found]                         ` <4E2A8B7D.8010807-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  9:04                           ` Tejun Heo
2011-07-15 13:49   ` [TOOLS] To make use of the patches Pavel Emelyanov
     [not found]     ` <4E204554.6040901-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-22 23:45       ` Matt Helsley
     [not found]         ` <20110722234558.GD16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  8:32           ` Pavel Emelyanov
     [not found]             ` <4E2A8704.3030306-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-27 23:00               ` Matt Helsley
     [not found]                 ` <20110727230003.GE15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-28  8:23                   ` James Bottomley
2011-07-23  0:40       ` Reply #2: " Matt Helsley
     [not found]         ` <20110723004045.GC21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  8:33           ` Pavel Emelyanov
2011-07-15 15:01   ` [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace Tejun Heo
2011-07-18 13:27   ` Serge E. Hallyn
     [not found]     ` <20110718132759.GB8127-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2011-07-23  8:43       ` Pavel Emelyanov
2011-07-23  0:25   ` Matt Helsley
     [not found]     ` <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  3:29       ` Matt Helsley
     [not found]         ` <20110723032945.GD21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  4:58           ` Tejun Heo
     [not found]             ` <20110723045842.GD21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-26 18:11               ` Matt Helsley
     [not found]                 ` <20110726181128.GD14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-26 22:45                   ` Tejun Heo
     [not found]                     ` <20110726224525.GC28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-26 23:07                       ` Matt Helsley
2011-07-23  3:53       ` Tejun Heo
     [not found]         ` <CAOS58YPqLSYi2xECUk4O5GG3s6aokT=VykmkL6UnAOzyHXNAgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-07-26 22:59           ` Matt Helsley
     [not found]             ` <20110726225911.GF14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-26 23:46               ` Tejun Heo
     [not found]                 ` <20110726234657.GD28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-27  0:53                   ` Matt Helsley
     [not found]                     ` <20110727005341.GB15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-27 10:12                       ` Tejun Heo
     [not found]                         ` <20110727101228.GY2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-27 22:26                           ` Matt Helsley
2011-07-23  5:10       ` Tejun Heo
     [not found]         ` <20110723051005.GE21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-26 22:02           ` Matt Helsley
     [not found]             ` <20110726220215.GE14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-26 22:21               ` Tejun Heo
     [not found]                 ` <20110726222109.GB28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-27  0:06                   ` Matt Helsley
     [not found]                     ` <20110727000651.GA15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-27 12:01                       ` Tejun Heo
     [not found]                         ` <20110727120114.GZ2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-27 21:35                           ` Matt Helsley
     [not found]                             ` <20110727213510.GC15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-28  7:21                               ` Tejun Heo
     [not found]                                 ` <20110728072141.GB2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-28  7:23                                   ` Tejun Heo
2011-07-28  8:37                                   ` James Bottomley
2011-07-28  9:10                                     ` Tejun Heo
2011-07-23  8:39       ` Pavel Emelyanov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.