linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] - v2 -  Object creation with a specified id
@ 2008-04-18  5:44 Nadia.Derbey
  2008-04-18  5:45 ` [PATCH 1/4] - v2 - Provide a new procfs interface to set next id Nadia.Derbey
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Nadia.Derbey @ 2008-04-18  5:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, orenl, nick



When restarting a process that has been previously checkpointed, that process
should keep on using some of its ids (such as its process id, or sysV ipc ids).

This patch provides a feature that can help ensuring this saved state reuse:
it makes it possible to create an object with a pre-defined id.

A first implementation had been proposed 2 months ago. It consisted in
changing an object's id after it had been created.

Here is a second implementation based on Oren Ladaan's idea: Oren's suggestion
was to force an object's id during its creation, rather than 1. create it,
2. change its id.

A new file is created in procfs: /proc/self/task/<my_tid>/next_id.
This makes it possible to avoid races between several threads belonging to
the same process.

When this file is filled with an id value, a structure pointed to by the
calling task_struct is filled with that id.

Then, when an object supporting this feature is created, the id present in
that new structure is used, instead of the default one.

The syntax is one of:
   . echo "LONG XX" > /proc/self/task/<my_tid>/next_id
     next object to be created will have an id set to XX
   . echo "LONG<n> X0 ... X<n-1>" > /proc/self/task/<my_tid>/next_id
     next object to be created will have its ids set to XX0, ... X<n-1>
     This is particularly useful for processes that may have several ids if
     they belong to nested namespaces.

The objects covered here are ipc objects and processes.
Today, the ids are specified as long, but having a type string specified in
the next_id file makes it possible to cover more types in the future, if
needed.

The patches are against 2.6.25-rc3-mm2, in the following order:

[PATCH 1/4] adds the procfs facility for next object to be created, this
            object being associated to a single id.
[PATCH 2/4] enhances the procfs facility for objects associated to multiple
            ids (like processes).
[PATCH 3/4] makes use of the specified id (if any) to allocate the new IPC
            object (changes the ipc_addid() path).
[PATCH 4/4] uses the specified id(s) (if any) to set the upid nr(s) for a newly
            allocated process (changes the alloc_pid()/alloc_pidmap() paths).

Any comment and/or suggestions are welcome.

Regards,
Nadia

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] - v2 - Provide a new procfs interface to set next id
  2008-04-18  5:44 [PATCH 0/4] - v2 - Object creation with a specified id Nadia.Derbey
@ 2008-04-18  5:45 ` Nadia.Derbey
  2008-04-18  5:45 ` [PATCH 2/4] - v2 - Provide a new procfs interface to set next upid nr(s) Nadia.Derbey
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Nadia.Derbey @ 2008-04-18  5:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, orenl, nick, Nadia Derbey

[-- Attachment #1: proc_set_next_id.patch --]
[-- Type: text/plain, Size: 8705 bytes --]

[PATCH 01/04]

This patch proposes the procfs facilities needed to feed the id for the
next object to be allocated.

if an
echo "LONG XX" > /proc/self/task/<tid>/next_id
is issued, next object to be created will have XX as its id.

This applies to objects that need a single id, such as ipc objects.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>

---
 fs/exec.c              |    3 +
 fs/proc/base.c         |   76 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h  |    3 +
 include/linux/sysids.h |   24 +++++++++++++
 kernel/Makefile        |    2 -
 kernel/exit.c          |    4 ++
 kernel/fork.c          |    2 +
 kernel/nextid.c        |   86 +++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 199 insertions(+), 1 deletion(-)

Index: linux-2.6.25-rc8-mm2/include/linux/sysids.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.25-rc8-mm2/include/linux/sysids.h	2008-04-17 13:35:29.000000000 +0200
@@ -0,0 +1,24 @@
+/*
+ * include/linux/sysids.h
+ *
+ * Definitions to support object creation with predefined id.
+ *
+ */
+
+#ifndef _LINUX_SYSIDS_H
+#define _LINUX_SYSIDS_H
+
+struct sys_id {
+	long id;
+};
+
+extern ssize_t get_nextid(struct task_struct *, char *, size_t);
+extern int set_nextid(struct task_struct *, char *);
+extern int reset_nextid(struct task_struct *);
+
+static inline void exit_nextid(struct task_struct *tsk)
+{
+	reset_nextid(tsk);
+}
+
+#endif /* _LINUX_SYSIDS_H */
Index: linux-2.6.25-rc8-mm2/include/linux/sched.h
===================================================================
--- linux-2.6.25-rc8-mm2.orig/include/linux/sched.h	2008-04-17 12:50:22.000000000 +0200
+++ linux-2.6.25-rc8-mm2/include/linux/sched.h	2008-04-17 13:38:04.000000000 +0200
@@ -88,6 +88,7 @@ struct sched_param {
 #include <linux/task_io_accounting.h>
 #include <linux/kobject.h>
 #include <linux/latencytop.h>
+#include <linux/sysids.h>
 
 #include <asm/processor.h>
 
@@ -1280,6 +1281,8 @@ struct task_struct {
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
 #endif
+	/* Id to assign to the next resource to be created */
+	struct sys_id *next_id;
 };
 
 /*
Index: linux-2.6.25-rc8-mm2/fs/proc/base.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/fs/proc/base.c	2008-04-17 12:50:40.000000000 +0200
+++ linux-2.6.25-rc8-mm2/fs/proc/base.c	2008-04-17 15:19:25.000000000 +0200
@@ -1138,6 +1138,77 @@ static const struct file_operations proc
 #endif
 
 
+static ssize_t next_id_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char *page;
+	ssize_t length;
+
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+
+	if (count >= PAGE_SIZE)
+		count = PAGE_SIZE - 1;
+
+	length = -ENOMEM;
+	page = (char *) __get_free_page(GFP_TEMPORARY);
+	if (!page)
+		goto out;
+
+	length = get_nextid(task, (char *) page, count);
+	if (length >= 0)
+		length = simple_read_from_buffer(buf, count, ppos,
+						(char *)page, length);
+	free_page((unsigned long) page);
+
+out:
+	put_task_struct(task);
+	return length;
+}
+
+static ssize_t next_id_write(struct file *file, const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	char *page;
+	ssize_t length;
+
+	if (pid_task(proc_pid(inode), PIDTYPE_PID) != current)
+		return -EPERM;
+
+	if (count >= PAGE_SIZE)
+		count = PAGE_SIZE - 1;
+
+	if (*ppos != 0) {
+		/* No partial writes. */
+		return -EINVAL;
+	}
+	page = (char *)__get_free_page(GFP_TEMPORARY);
+	if (!page)
+		return -ENOMEM;
+	length = -EFAULT;
+	if (copy_from_user(page, buf, count))
+		goto out_free_page;
+
+	page[count] = '\0';
+
+	length = set_nextid(current, page);
+	if (!length)
+		length = count;
+
+out_free_page:
+	free_page((unsigned long) page);
+	return length;
+}
+
+static const struct file_operations proc_next_id_operations = {
+	.read		= next_id_read,
+	.write		= next_id_write,
+};
+
+
 #ifdef CONFIG_SCHED_DEBUG
 /*
  * Print out various scheduling related per-task fields:
@@ -2779,6 +2850,11 @@ static const struct pid_entry tid_base_s
 #ifdef CONFIG_FAULT_INJECTION
 	REG("make-it-fail", S_IRUGO|S_IWUSR, fault_inject),
 #endif
+	/*
+	 * NOTE that next_id is not added into tgid_base_stuff[] since it has
+	 * to be specified on a per-thread basis.
+	 */
+	REG("next_id", S_IRUGO|S_IWUSR, next_id),
 };
 
 static int proc_tid_base_readdir(struct file * filp,
Index: linux-2.6.25-rc8-mm2/kernel/Makefile
===================================================================
--- linux-2.6.25-rc8-mm2.orig/kernel/Makefile	2008-04-17 12:50:25.000000000 +0200
+++ linux-2.6.25-rc8-mm2/kernel/Makefile	2008-04-17 13:47:45.000000000 +0200
@@ -9,7 +9,7 @@ obj-y     = sched.o fork.o exec_domain.o
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
-	    notifier.o ksysfs.o pm_qos_params.o
+	    notifier.o ksysfs.o pm_qos_params.o nextid.o
 
 ifdef CONFIG_FTRACE
 # Do not profile debug utilities
Index: linux-2.6.25-rc8-mm2/kernel/nextid.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.25-rc8-mm2/kernel/nextid.c	2008-04-17 15:16:07.000000000 +0200
@@ -0,0 +1,86 @@
+/*
+ * linux/kernel/nextid.c
+ *
+ *
+ * Provide the get_nextid() / set_nextid() routines
+ * (called from fs/proc/base.c).
+ * They allow to specify the id for the next resource to be allocated,
+ * instead of letting the allocator set it for us.
+ */
+
+#include <linux/sched.h>
+#include <linux/ctype.h>
+
+
+
+ssize_t get_nextid(struct task_struct *task, char *buffer, size_t size)
+{
+	struct sys_id *sid;
+
+	sid = task->next_id;
+	if (!sid)
+		return snprintf(buffer, size, "UNSET\n");
+
+	return snprintf(buffer, size, "LONG %ld\n", sid->id);
+}
+
+static int set_single_id(struct task_struct *task, char *buffer)
+{
+	struct sys_id *sid;
+	long next_id;
+	char *end;
+
+	next_id = simple_strtol(buffer, &end, 0);
+	if (end == buffer || (end && *end && !isspace(*end)))
+		return -EINVAL;
+
+	sid = task->next_id;
+	if (!sid) {
+		sid = kzalloc(sizeof(*sid), GFP_KERNEL);
+		if (!sid)
+			return -ENOMEM;
+		task->next_id = sid;
+	}
+
+	sid->id = next_id;
+
+	return 0;
+}
+
+int reset_nextid(struct task_struct *task)
+{
+	struct sys_id *sid;
+
+	sid = task->next_id;
+	if (!sid)
+		return 0;
+
+	task->next_id = NULL;
+	kfree(sid);
+	return 0;
+}
+
+#define LONG_STR    "LONG"
+#define RESET_STR   "RESET"
+
+/*
+ * Parses a line written to /proc/self/task/<my_tid>/next_id.
+ * this line has the following format:
+ * LONG id              --> a single id is specified
+ */
+int set_nextid(struct task_struct *task, char *buffer)
+{
+	char *token, *out = buffer;
+
+	if (!out)
+		return -EINVAL;
+
+	token = strsep(&out, " ");
+
+	if (!strcmp(token, LONG_STR))
+		return set_single_id(task, out);
+	else if (!strncmp(token, RESET_STR, strlen(RESET_STR)))
+		return reset_nextid(task);
+	else
+		return -EINVAL;
+}
Index: linux-2.6.25-rc8-mm2/kernel/fork.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/kernel/fork.c	2008-04-17 12:50:25.000000000 +0200
+++ linux-2.6.25-rc8-mm2/kernel/fork.c	2008-04-17 13:49:09.000000000 +0200
@@ -1167,6 +1167,8 @@ static struct task_struct *copy_process(
 	p->blocked_on = NULL; /* not blocked yet */
 #endif
 
+	p->next_id = NULL;
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	sched_fork(p, clone_flags);
 
Index: linux-2.6.25-rc8-mm2/kernel/exit.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/kernel/exit.c	2008-04-17 12:50:25.000000000 +0200
+++ linux-2.6.25-rc8-mm2/kernel/exit.c	2008-04-17 13:50:42.000000000 +0200
@@ -987,6 +987,10 @@ NORET_TYPE void do_exit(long code)
 
 	proc_exit_connector(tsk);
 	exit_notify(tsk, group_dead);
+
+	if (unlikely(tsk->next_id))
+		exit_nextid(tsk);
+
 #ifdef CONFIG_NUMA
 	mpol_put(tsk->mempolicy);
 	tsk->mempolicy = NULL;
Index: linux-2.6.25-rc8-mm2/fs/exec.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/fs/exec.c	2008-04-17 12:50:41.000000000 +0200
+++ linux-2.6.25-rc8-mm2/fs/exec.c	2008-04-17 13:51:47.000000000 +0200
@@ -1024,6 +1024,9 @@ int flush_old_exec(struct linux_binprm *
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
 
+	if (unlikely(current->next_id))
+		reset_nextid(current);
+
 	return 0;
 
 mmap_failed:

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/4] - v2 - Provide a new procfs interface to set next upid nr(s)
  2008-04-18  5:44 [PATCH 0/4] - v2 - Object creation with a specified id Nadia.Derbey
  2008-04-18  5:45 ` [PATCH 1/4] - v2 - Provide a new procfs interface to set next id Nadia.Derbey
@ 2008-04-18  5:45 ` Nadia.Derbey
  2008-04-18  5:45 ` [PATCH 3/4] - v2 - IPC: use the target ID specified in procfs Nadia.Derbey
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Nadia.Derbey @ 2008-04-18  5:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, orenl, nick, Nadia Derbey

[-- Attachment #1: proc_set_next_ids.patch --]
[-- Type: text/plain, Size: 6993 bytes --]

[PATCH 02/04]

This patch proposes the procfs facilities needed to feed the id(s) for the
next task to be forked.

say n is the number of pids to be provided through procfs:

if an
echo "LONG<n> X0 X1 ... X<n-1>" > /proc/self/task/<tid>/next_id
is issued, the next task to be forked will have its upid nrs set as follows
(say it is forked in a pid ns of level L):

level         upid nr
L ----------> X0
..
L - i ------> Xi
..
L - n + 1 --> X<n-1>

Then, for levels L-n down to level 0, the pids will be left to the kernel
choice.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>

---
 include/linux/sysids.h |   27 ++++++++
 kernel/nextid.c        |  158 ++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 163 insertions(+), 22 deletions(-)

Index: linux-2.6.25-rc8-mm2/include/linux/sysids.h
===================================================================
--- linux-2.6.25-rc8-mm2.orig/include/linux/sysids.h	2008-04-17 13:35:29.000000000 +0200
+++ linux-2.6.25-rc8-mm2/include/linux/sysids.h	2008-04-17 16:03:07.000000000 +0200
@@ -8,8 +8,33 @@
 #ifndef _LINUX_SYSIDS_H
 #define _LINUX_SYSIDS_H
 
+
+#define NIDS_SMALL       32
+#define NIDS_PER_BLOCK   ((unsigned int)(PAGE_SIZE / sizeof(long)))
+
+/* access the ids "array" with this macro */
+#define ID_AT(pi, i)	\
+	((pi)->blocks[(i) / NIDS_PER_BLOCK][(i) % NIDS_PER_BLOCK])
+
+
+/*
+ * List of ids for the next object to be created. This presently applies to
+ * next process to be created.
+ * The next process to be created is associated to a set of upid nrs: one for
+ * each pid namespace level that process belongs to.
+ * upid nrs from level 0 up to level <npids - 1> will be automatically
+ * allocated.
+ * upid nr for level nids will be set to blocks[0][0]
+ * upid nr for level <nids + i> will be set to ID_AT(ids, i);
+ *
+ * If a single id is needed, nids is set to 1 and small_block[0] is set to
+ * that id.
+ */
 struct sys_id {
-	long id;
+	int nids;
+	long small_block[NIDS_SMALL];
+	int nblocks;
+	long *blocks[0];
 };
 
 extern ssize_t get_nextid(struct task_struct *, char *, size_t);
Index: linux-2.6.25-rc8-mm2/kernel/nextid.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/kernel/nextid.c	2008-04-17 15:16:07.000000000 +0200
+++ linux-2.6.25-rc8-mm2/kernel/nextid.c	2008-04-17 16:54:30.000000000 +0200
@@ -13,38 +13,146 @@
 
 
 
+static struct sys_id *id_blocks_alloc(int nids)
+{
+	struct sys_id *ids;
+	int nblocks;
+	int i;
+
+	nblocks = (nids + NIDS_PER_BLOCK - 1) / NIDS_PER_BLOCK;
+	BUG_ON(nblocks < 1);
+
+	ids = kmalloc(sizeof(*ids) + nblocks * sizeof(long *), GFP_KERNEL);
+	if (!ids)
+		return NULL;
+	ids->nids = nids;
+	ids->nblocks = nblocks;
+
+	if (nids <= NIDS_SMALL)
+		ids->blocks[0] = ids->small_block;
+	else {
+		for (i = 0; i < nblocks; i++) {
+			long *b;
+			b = (void *)__get_free_page(GFP_KERNEL);
+			if (!b)
+				goto out_undo_partial_alloc;
+			ids->blocks[i] = b;
+		}
+	}
+	return ids;
+
+out_undo_partial_alloc:
+	while (--i >= 0)
+		free_page((unsigned long)ids->blocks[i]);
+
+	kfree(ids);
+	return NULL;
+}
+
+static void id_blocks_free(struct sys_id *ids)
+{
+	if (ids == NULL)
+		return;
+
+	if (ids->blocks[0] != ids->small_block) {
+		int i;
+		for (i = 0; i < ids->nblocks; i++)
+			free_page((unsigned long)ids->blocks[i]);
+	}
+	kfree(ids);
+	return;
+}
+
 ssize_t get_nextid(struct task_struct *task, char *buffer, size_t size)
 {
+	ssize_t rc, count = 0;
 	struct sys_id *sid;
+	char *bufptr = buffer;
+	int i;
 
 	sid = task->next_id;
-	if (!sid)
+	if (!sid || !sid->nids)
 		return snprintf(buffer, size, "UNSET\n");
 
-	return snprintf(buffer, size, "LONG %ld\n", sid->id);
+	count = snprintf(bufptr, size, "LONGS (%d) ", sid->nids);
+
+	for (i = 0; i < sid->nids - 1; i++) {
+		rc = snprintf(&bufptr[count], size - count, "%ld ",
+				ID_AT(sid, i));
+		if (rc >= size - count)
+			return -ENOMEM;
+		count += rc;
+	}
+
+	rc = snprintf(&bufptr[count], size - count, "%ld\n", ID_AT(sid, i));
+	if (rc >= size - count)
+		return -ENOMEM;
+	count += rc;
+
+	return count;
 }
 
-static int set_single_id(struct task_struct *task, char *buffer)
+static int fill_nextid_list(struct task_struct *task, int nids, char *buffer)
 {
-	struct sys_id *sid;
-	long next_id;
+	char *token, *buff = buffer;
 	char *end;
+	struct sys_id *sid;
+	struct sys_id *old_list = task->next_id;
+	int i;
 
-	next_id = simple_strtol(buffer, &end, 0);
-	if (end == buffer || (end && *end && !isspace(*end)))
-		return -EINVAL;
+	sid = id_blocks_alloc(nids);
+	if (!sid)
+		return -ENOMEM;
 
-	sid = task->next_id;
-	if (!sid) {
-		sid = kzalloc(sizeof(*sid), GFP_KERNEL);
-		if (!sid)
-			return -ENOMEM;
-		task->next_id = sid;
+	i = 0;
+	while ((token = strsep(&buff, " ")) != NULL && i < nids) {
+		long id;
+
+		if (!*token)
+			goto out_free;
+		id = simple_strtol(token, &end, 0);
+		if (end == token || (*end && !isspace(*end)))
+			goto out_free;
+		ID_AT(sid, i) = id;
+		i++;
 	}
 
-	sid->id = next_id;
+	if (i != nids)
+		/* Not enough pids compared to npids */
+		goto out_free;
+
+	if (old_list)
+		id_blocks_free(old_list);
 
+	task->next_id = sid;
 	return 0;
+
+out_free:
+	id_blocks_free(sid);
+	return -EINVAL;
+}
+
+/*
+ * Parses a line with the following format:
+ * <x> <id0> ... <idx-1>
+ * and sets <id0> to <idx-1> as the sequence of ids to be used for the next
+ * object to be created by the task.
+ * This applies to processes that need 1 id per namespace level.
+ * Any trailing character on the line is skipped.
+ */
+static int set_multiple_ids(struct task_struct *task, char *nb, char *buffer)
+{
+	int nids;
+	char *end;
+
+	nids = simple_strtol(nb, &end, 0);
+	if (*end)
+		return -EINVAL;
+
+	if (nids <= 0)
+		return -EINVAL;
+
+	return fill_nextid_list(task, nids, buffer);
 }
 
 int reset_nextid(struct task_struct *task)
@@ -55,8 +163,8 @@ int reset_nextid(struct task_struct *tas
 	if (!sid)
 		return 0;
 
+	id_blocks_free(sid);
 	task->next_id = NULL;
-	kfree(sid);
 	return 0;
 }
 
@@ -65,12 +173,14 @@ int reset_nextid(struct task_struct *tas
 
 /*
  * Parses a line written to /proc/self/task/<my_tid>/next_id.
- * this line has the following format:
+ * this line has one of the following formats:
  * LONG id              --> a single id is specified
+ * LONG<x> id0 ... id<x-1> --> a sequence of ids is specified
  */
 int set_nextid(struct task_struct *task, char *buffer)
 {
 	char *token, *out = buffer;
+	size_t sz;
 
 	if (!out)
 		return -EINVAL;
@@ -78,9 +188,15 @@ int set_nextid(struct task_struct *task,
 	token = strsep(&out, " ");
 
 	if (!strcmp(token, LONG_STR))
-		return set_single_id(task, out);
-	else if (!strncmp(token, RESET_STR, strlen(RESET_STR)))
+		return fill_nextid_list(task, 1, out);
+
+	sz = strlen(LONG_STR);
+
+	if (!strncmp(token, LONG_STR, sz))
+		return set_multiple_ids(task, token + sz, out);
+
+	if (!strncmp(token, RESET_STR, strlen(RESET_STR)))
 		return reset_nextid(task);
-	else
-		return -EINVAL;
+
+	return -EINVAL;
 }

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/4] - v2 - IPC: use the target ID specified in procfs
  2008-04-18  5:44 [PATCH 0/4] - v2 - Object creation with a specified id Nadia.Derbey
  2008-04-18  5:45 ` [PATCH 1/4] - v2 - Provide a new procfs interface to set next id Nadia.Derbey
  2008-04-18  5:45 ` [PATCH 2/4] - v2 - Provide a new procfs interface to set next upid nr(s) Nadia.Derbey
@ 2008-04-18  5:45 ` Nadia.Derbey
  2008-04-18  5:45 ` [PATCH 4/4] - v2 - PID: " Nadia.Derbey
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Nadia.Derbey @ 2008-04-18  5:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, orenl, nick, Nadia Derbey

[-- Attachment #1: ipc_use_next_id.patch --]
[-- Type: text/plain, Size: 3361 bytes --]

[PATCH 03/04]

This patch makes use of the target id specified by a previous write into
/proc/self/task/<tid>/next_id as the id to use to allocate the next IPC
object.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>

---
 include/linux/sysids.h |    7 +++++++
 ipc/util.c             |   41 +++++++++++++++++++++++++++++++++--------
 kernel/nextid.c        |    2 +-
 3 files changed, 41 insertions(+), 9 deletions(-)

Index: linux-2.6.25-rc8-mm2/include/linux/sysids.h
===================================================================
--- linux-2.6.25-rc8-mm2.orig/include/linux/sysids.h	2008-04-17 16:03:07.000000000 +0200
+++ linux-2.6.25-rc8-mm2/include/linux/sysids.h	2008-04-17 17:05:54.000000000 +0200
@@ -37,9 +37,16 @@ struct sys_id {
 	long *blocks[0];
 };
 
+#define next_ipcid(tsk) ((tsk)->next_id					\
+				? ((tsk)->next_id->nids			\
+					? ID_AT((tsk)->next_id, 0)	\
+					: -1)				\
+				: -1)
+
 extern ssize_t get_nextid(struct task_struct *, char *, size_t);
 extern int set_nextid(struct task_struct *, char *);
 extern int reset_nextid(struct task_struct *);
+extern void id_blocks_free(struct sys_id *);
 
 static inline void exit_nextid(struct task_struct *tsk)
 {
Index: linux-2.6.25-rc8-mm2/kernel/nextid.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/kernel/nextid.c	2008-04-17 16:54:30.000000000 +0200
+++ linux-2.6.25-rc8-mm2/kernel/nextid.c	2008-04-17 17:06:43.000000000 +0200
@@ -49,7 +49,7 @@ out_undo_partial_alloc:
 	return NULL;
 }
 
-static void id_blocks_free(struct sys_id *ids)
+void id_blocks_free(struct sys_id *ids)
 {
 	if (ids == NULL)
 		return;
Index: linux-2.6.25-rc8-mm2/ipc/util.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/ipc/util.c	2008-04-17 12:50:37.000000000 +0200
+++ linux-2.6.25-rc8-mm2/ipc/util.c	2008-04-17 17:09:37.000000000 +0200
@@ -260,6 +260,7 @@ int ipc_get_maxid(struct ipc_ids *ids)
 int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
 {
 	int id, err;
+	int next_id;
 
 	if (size > IPCMNI)
 		size = IPCMNI;
@@ -267,20 +268,44 @@ int ipc_addid(struct ipc_ids* ids, struc
 	if (ids->in_use >= size)
 		return -ENOSPC;
 
-	err = idr_get_new(&ids->ipcs_idr, new, &id);
-	if (err)
-		return err;
+	next_id = next_ipcid(current);
+
+	if (next_id >= 0) {
+		/* There is a target id specified, try to use it */
+		int new_lid = next_id % SEQ_MULTIPLIER;
+
+		if (next_id !=
+		    (new_lid + (next_id / SEQ_MULTIPLIER) * SEQ_MULTIPLIER))
+			return -EINVAL;
+
+		err = idr_get_new_above(&ids->ipcs_idr, new, new_lid, &id);
+		if (err)
+			return err;
+		if (id != new_lid) {
+			idr_remove(&ids->ipcs_idr, id);
+			return -EBUSY;
+		}
+
+		new->id = next_id;
+		new->seq = next_id / SEQ_MULTIPLIER;
+		id_blocks_free(current->next_id);
+		current->next_id = NULL;
+	} else {
+		err = idr_get_new(&ids->ipcs_idr, new, &id);
+		if (err)
+			return err;
+
+		new->seq = ids->seq++;
+		if (ids->seq > ids->seq_max)
+			ids->seq = 0;
+		new->id = ipc_buildid(id, new->seq);
+	}
 
 	ids->in_use++;
 
 	new->cuid = new->uid = current->euid;
 	new->gid = new->cgid = current->egid;
 
-	new->seq = ids->seq++;
-	if(ids->seq > ids->seq_max)
-		ids->seq = 0;
-
-	new->id = ipc_buildid(id, new->seq);
 	spin_lock_init(&new->lock);
 	new->deleted = 0;
 	rcu_read_lock();

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/4] - v2 - PID: use the target ID specified in procfs
  2008-04-18  5:44 [PATCH 0/4] - v2 - Object creation with a specified id Nadia.Derbey
                   ` (2 preceding siblings ...)
  2008-04-18  5:45 ` [PATCH 3/4] - v2 - IPC: use the target ID specified in procfs Nadia.Derbey
@ 2008-04-18  5:45 ` Nadia.Derbey
  2008-04-18 17:07 ` [PATCH 0/4] - v2 - Object creation with a specified id Dave Hansen
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Nadia.Derbey @ 2008-04-18  5:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, orenl, nick, Nadia Derbey

[-- Attachment #1: upidnr_use_next_id.patch --]
[-- Type: text/plain, Size: 6397 bytes --]

[PATCH 04/04]

This patch makes use of the target ids specified by a previous write to
/proc/self/task/<tid>/next_id as the ids to use to allocate the next upid nrs.
Upper levels upid nrs that are not specified in next_pids file are left to the
kernel choice.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>

---
 include/linux/pid.h |    2 
 kernel/fork.c       |    3 -
 kernel/pid.c        |  141 +++++++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 126 insertions(+), 20 deletions(-)

Index: linux-2.6.25-rc8-mm2/include/linux/pid.h
===================================================================
--- linux-2.6.25-rc8-mm2.orig/include/linux/pid.h	2008-04-17 12:50:22.000000000 +0200
+++ linux-2.6.25-rc8-mm2/include/linux/pid.h	2008-04-17 17:42:11.000000000 +0200
@@ -121,7 +121,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, int *retval);
 extern void free_pid(struct pid *pid);
 
 /*
Index: linux-2.6.25-rc8-mm2/kernel/fork.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/kernel/fork.c	2008-04-17 13:49:09.000000000 +0200
+++ linux-2.6.25-rc8-mm2/kernel/fork.c	2008-04-17 17:42:48.000000000 +0200
@@ -1200,8 +1200,7 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
-		pid = alloc_pid(task_active_pid_ns(p));
+		pid = alloc_pid(task_active_pid_ns(p), &retval);
 		if (!pid)
 			goto bad_fork_cleanup_io;
 
Index: linux-2.6.25-rc8-mm2/kernel/pid.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/kernel/pid.c	2008-04-17 12:50:25.000000000 +0200
+++ linux-2.6.25-rc8-mm2/kernel/pid.c	2008-04-17 17:50:00.000000000 +0200
@@ -122,6 +122,26 @@ static void free_pidmap(struct upid *upi
 	atomic_inc(&map->nr_free);
 }
 
+static inline int alloc_pidmap_page(struct pidmap *map)
+{
+	if (unlikely(!map->page)) {
+		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		/*
+		 * Free the page if someone raced with us
+		 * installing it:
+		 */
+		spin_lock_irq(&pidmap_lock);
+		if (map->page)
+			kfree(page);
+		else
+			map->page = page;
+		spin_unlock_irq(&pidmap_lock);
+		if (unlikely(!map->page))
+			return -1;
+	}
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -134,21 +154,8 @@ static int alloc_pidmap(struct pid_names
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		if (unlikely(alloc_pidmap_page(map)))
+			break;
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
@@ -182,6 +189,35 @@ static int alloc_pidmap(struct pid_names
 	return -1;
 }
 
+/*
+ * Return a predefined pid value if successful (ID_AT(pid_l, level)),
+ * -errno else
+ */
+static int alloc_fixed_pidmap(struct pid_namespace *pid_ns,
+			struct sys_id *pid_l, int level)
+{
+	int offset, pid;
+	struct pidmap *map;
+
+	pid = ID_AT(pid_l, level);
+	if (pid < RESERVED_PIDS || pid >= pid_max)
+		return -EINVAL;
+
+	map = &pid_ns->pidmap[pid / BITS_PER_PAGE];
+
+	if (unlikely(alloc_pidmap_page(map)))
+		return -ENOMEM;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+
+	atomic_dec(&map->nr_free);
+	pid_ns->last_pid = max(pid_ns->last_pid, pid);
+
+	return pid;
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
@@ -243,20 +279,91 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+/*
+ * Called by alloc_pid() to use a list of predefined ids for the calling
+ * process' upper ns levels.
+ * Returns next pid ns to visit if successful (may be NULL if walked through
+ * the entire pid ns hierarchy).
+ * i is filled with next level to be visited (useful for the error cases).
+ */
+static struct pid_namespace *set_predefined_pids(struct pid_namespace *ns,
+						struct pid *pid,
+						struct sys_id *pid_l,
+						int *next_level)
+{
+	struct pid_namespace *tmp;
+	int rel_level, i, nr;
+
+	rel_level = pid_l->nids - 1;
+	if (rel_level > ns->level)
+		return ERR_PTR(-EINVAL);
+
+	tmp = ns;
+
+	/*
+	 * Use the predefined upid nrs for levels ns->level down to
+	 * ns->level - rel_level
+	 */
+	for (i = ns->level ; rel_level >= 0; i--, rel_level--) {
+		nr = alloc_fixed_pidmap(tmp, pid_l, rel_level);
+		if (nr < 0) {
+			tmp = ERR_PTR(nr);
+			goto out;
+		}
+
+		pid->numbers[i].nr = nr;
+		pid->numbers[i].ns = tmp;
+		tmp = tmp->parent;
+	}
+
+	id_blocks_free(pid_l);
+out:
+	*next_level = i;
+	return tmp;
+}
+
+struct pid *alloc_pid(struct pid_namespace *ns, int *retval)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	struct sys_id *pid_l;
 
+	*retval = -ENOMEM;
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid)
 		goto out;
 
 	tmp = ns;
-	for (i = ns->level; i >= 0; i--) {
+	i = ns->level;
+
+	/*
+	 * If there is a list of upid nrs specified, use it instead of letting
+	 * the kernel chose the upid nrs for us.
+	 */
+	pid_l = current->next_id;
+	if (pid_l && pid_l->nids) {
+		/*
+		 * returns the next ns to be visited in the following loop
+		 * (or NULL if we are done).
+		 * i is filled in with the next level to be visited. We need
+		 * it to undo things in the error cases.
+		 */
+		tmp = set_predefined_pids(ns, pid, pid_l, &i);
+		if (IS_ERR(tmp)) {
+			*retval = PTR_ERR(tmp);
+			goto out_free;
+		}
+		current->next_id = NULL;
+	}
+
+	*retval = -ENOMEM;
+	/*
+	 * Let the lower levels upid nrs be automatically allocated
+	 */
+	for ( ; i >= 0; i--) {
 		nr = alloc_pidmap(tmp);
 		if (nr < 0)
 			goto out_free;

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] - v2 -  Object creation with a specified id
  2008-04-18  5:44 [PATCH 0/4] - v2 - Object creation with a specified id Nadia.Derbey
                   ` (3 preceding siblings ...)
  2008-04-18  5:45 ` [PATCH 4/4] - v2 - PID: " Nadia.Derbey
@ 2008-04-18 17:07 ` Dave Hansen
  2008-04-21 11:32   ` Nadia Derbey
  2008-04-22 19:36 ` Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id) Alexey Dobriyan
  2008-04-23 14:23 ` [PATCH 0/4] - v2 - Object creation with a specified id Pavel Machek
  6 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2008-04-18 17:07 UTC (permalink / raw)
  To: Nadia.Derbey; +Cc: linux-kernel, containers, nick

On Fri, 2008-04-18 at 07:44 +0200, Nadia.Derbey@bull.net wrote:
> The syntax is one of:
>    . echo "LONG XX" > /proc/self/task/<my_tid>/next_id
>      next object to be created will have an id set to XX
>    . echo "LONG<n> X0 ... X<n-1>" > /proc/self/task/<my_tid>/next_id
>      next object to be created will have its ids set to XX0, ... X<n-1>
>      This is particularly useful for processes that may have several ids if
>      they belong to nested namespaces.

I still think, in its current form, this is pretty dangerous.  

It might be nice to have a central place to set ids, but there's no real
safety here for when we get a mishmash of 50 of these things.  We have a
queue and we fix the order in which things are inserted, but how do we
control the order in which the kernel extracts them?  Can we 100%
conform to that order in the future and be consistent?  Won't that also
be dictated by the way userspace behaves?

If we're going to go this route, I think we need to do a few things.
First, these need to be more "type safe".  That means that sticking an
ids in here that you mean to go to an IPC should never, ever, ever be
able to get interpreted as a PID or as an inode number or anything else.

	echo "IPC_SEM_ID NN" > /proc/self/task/<my_tid>/next_id
	echo "PID MM" > /proc/self/task/<my_tid>/next_id

Maybe we should include the types in there as well, but those are kinda
implied by the kind of id you're setting.  Have you thought about 32-bit
userspace with 64-bit kernels?  Does that cause any issues?

I also feel like we should have the kernel able to enumerate which
things in a process *are* settable.  What are the valid values for that
first word being echoed in the file?

Does anyone have a list of these ids that we're going to want to set
like this?  Oren?  I'm a bit worried that we might want to set things
other than ids with an interface like this and that it won't transition
well.

-- Dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] - v2 -  Object creation with a specified id
  2008-04-18 17:07 ` [PATCH 0/4] - v2 - Object creation with a specified id Dave Hansen
@ 2008-04-21 11:32   ` Nadia Derbey
  0 siblings, 0 replies; 32+ messages in thread
From: Nadia Derbey @ 2008-04-21 11:32 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers, nick, linux-kernel

Dave Hansen wrote:
> On Fri, 2008-04-18 at 07:44 +0200, Nadia.Derbey@bull.net wrote:
> 
>>The syntax is one of:
>>   . echo "LONG XX" > /proc/self/task/<my_tid>/next_id
>>     next object to be created will have an id set to XX
>>   . echo "LONG<n> X0 ... X<n-1>" > /proc/self/task/<my_tid>/next_id
>>     next object to be created will have its ids set to XX0, ... X<n-1>
>>     This is particularly useful for processes that may have several ids if
>>     they belong to nested namespaces.
> 
> 
> I still think, in its current form, this is pretty dangerous.  
> 
> It might be nice to have a central place to set ids, but there's no real
> safety here for when we get a mishmash of 50 of these things.  We have a
> queue and we fix the order in which things are inserted, but how do we
> control the order in which the kernel extracts them?

I thought that we were the ones who are going to be controlling the 
restart phase?

I think sysV IPCs are particular, since they could be recreated early 
during the restart phase:
we could define a callback routine that would be called by each process 
being restarted and that would do:
for each ipc type:
   1. extract next ipc id for my uid and current ipc type
   2. write(fd_next_id, id, id_sz)
   3. call the XXXget() routine for current ipc_type
   4. remove that id from the checkpoint file

I think this could only be applied to msg queues and sems, since the 
creator pid is stored in the perm structure in the shm case. But in the 
shm case, we can replace in 1. "my uid" by "my tgid_vnr"

Then the processes hierarchy restore would take place, so no confusion 
between the ids.


> Can we 100%
> conform to that order in the future and be consistent?  Won't that also
> be dictated by the way userspace behaves?
> 
> If we're going to go this route, I think we need to do a few things.
> First, these need to be more "type safe".  That means that sticking an
> ids in here that you mean to go to an IPC should never, ever, ever be
> able to get interpreted as a PID or as an inode number or anything else.
> 
> 	echo "IPC_SEM_ID NN" > /proc/self/task/<my_tid>/next_id
> 	echo "PID MM" > /proc/self/task/<my_tid>/next_id
> 
> Maybe we should include the types in there as well, but those are kinda
> implied by the kind of id you're setting.  Have you thought about 32-bit
> userspace with 64-bit kernels? 

No, but the ids are int in the ipc case and in the pid_t case too.

> Does that cause any issues?
> 
> I also feel like we should have the kernel able to enumerate which
> things in a process *are* settable.  What are the valid values for that
> first word being echoed in the file?
> 
> Does anyone have a list of these ids that we're going to want to set
> like this?  Oren?  I'm a bit worried that we might want to set things
> other than ids with an interface like this and that it won't transition
> well.
> 

Regards,
Nadia



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-22 19:36 ` Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id) Alexey Dobriyan
@ 2008-04-22 18:56   ` Dave Hansen
  2008-04-22 19:51     ` Serge E. Hallyn
  2008-04-22 21:01     ` Alexey Dobriyan
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Hansen @ 2008-04-22 18:56 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Nadia.Derbey, linux-kernel, containers, orenl, nick

On Tue, 2008-04-22 at 23:36 +0400, Alexey Dobriyan wrote:
> On Fri, Apr 18, 2008 at 07:44:59AM +0200, Nadia.Derbey@bull.net wrote:
> >    . echo "LONG XX" > /proc/self/task/<my_tid>/next_id
> >      next object to be created will have an id set to XX
> >    . echo "LONG<n> X0 ... X<n-1>" > /proc/self/task/<my_tid>/next_id
> >      next object to be created will have its ids set to XX0, ... X<n-1>
> >      This is particularly useful for processes that may have several ids if
> >      they belong to nested namespaces.
> 
> Can we answer the following questions before merging this patch:
> 
> a) should mainline kernel have checkpoint/restart feature at all
> b) if yes, should it be done from kernel- or userspace?
> 
> Until agreement will be "yes/from userspace" such patches don't make
> sense in mainline.

What do you mean by "from kernel" or "from userspace"?  Userspace has to
be involved somewhere, right?  It at least need to open() the checkpoint
file and pass it into the kernel.  		

-- Dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-18  5:44 [PATCH 0/4] - v2 - Object creation with a specified id Nadia.Derbey
                   ` (4 preceding siblings ...)
  2008-04-18 17:07 ` [PATCH 0/4] - v2 - Object creation with a specified id Dave Hansen
@ 2008-04-22 19:36 ` Alexey Dobriyan
  2008-04-22 18:56   ` Dave Hansen
  2008-04-23 14:23 ` [PATCH 0/4] - v2 - Object creation with a specified id Pavel Machek
  6 siblings, 1 reply; 32+ messages in thread
From: Alexey Dobriyan @ 2008-04-22 19:36 UTC (permalink / raw)
  To: Nadia.Derbey; +Cc: linux-kernel, containers, orenl, nick

On Fri, Apr 18, 2008 at 07:44:59AM +0200, Nadia.Derbey@bull.net wrote:
>    . echo "LONG XX" > /proc/self/task/<my_tid>/next_id
>      next object to be created will have an id set to XX
>    . echo "LONG<n> X0 ... X<n-1>" > /proc/self/task/<my_tid>/next_id
>      next object to be created will have its ids set to XX0, ... X<n-1>
>      This is particularly useful for processes that may have several ids if
>      they belong to nested namespaces.

Can we answer the following questions before merging this patch:

a) should mainline kernel have checkpoint/restart feature at all
b) if yes, should it be done from kernel- or userspace?

Until agreement will be "yes/from userspace" such patches don't make
sense in mainline.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-22 18:56   ` Dave Hansen
@ 2008-04-22 19:51     ` Serge E. Hallyn
  2008-04-22 21:01     ` Alexey Dobriyan
  1 sibling, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2008-04-22 19:51 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Alexey Dobriyan, containers, nick, Nadia.Derbey, linux-kernel

Quoting Dave Hansen (dave@linux.vnet.ibm.com):
> On Tue, 2008-04-22 at 23:36 +0400, Alexey Dobriyan wrote:
> > On Fri, Apr 18, 2008 at 07:44:59AM +0200, Nadia.Derbey@bull.net wrote:
> > >    . echo "LONG XX" > /proc/self/task/<my_tid>/next_id
> > >      next object to be created will have an id set to XX
> > >    . echo "LONG<n> X0 ... X<n-1>" > /proc/self/task/<my_tid>/next_id
> > >      next object to be created will have its ids set to XX0, ... X<n-1>
> > >      This is particularly useful for processes that may have several ids if
> > >      they belong to nested namespaces.
> > 
> > Can we answer the following questions before merging this patch:
> > 
> > a) should mainline kernel have checkpoint/restart feature at all
> > b) if yes, should it be done from kernel- or userspace?
> > 
> > Until agreement will be "yes/from userspace" such patches don't make
> > sense in mainline.
> 
> What do you mean by "from kernel" or "from userspace"?  Userspace has to
> be involved somewhere, right?  It at least need to open() the checkpoint
> file and pass it into the kernel.  		

Well let's start by answering a) :

	I vote yes  :-)

As for b), good question.  Except as Dave points out I think it's pretty
unlikely that we'll end up with "sys_checkpoint(checkpoint_dirnam)"
and "sys_restart(checkpointed_dirnam)".  Anything more user-space
driven will likely require userspace to set ids on resources, right?

But yes, we have to completely answer b) before this patch goes in, no
doubt about that.

Alexey, are you advocating for a completely kernel-driven approach, or
just making sure we come to a decision?

thanks,
-serge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-22 18:56   ` Dave Hansen
  2008-04-22 19:51     ` Serge E. Hallyn
@ 2008-04-22 21:01     ` Alexey Dobriyan
  2008-04-22 22:56       ` Dave Hansen
  1 sibling, 1 reply; 32+ messages in thread
From: Alexey Dobriyan @ 2008-04-22 21:01 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Nadia.Derbey, linux-kernel, containers, orenl, nick

On Tue, Apr 22, 2008 at 11:56:20AM -0700, Dave Hansen wrote:
> On Tue, 2008-04-22 at 23:36 +0400, Alexey Dobriyan wrote:
> > On Fri, Apr 18, 2008 at 07:44:59AM +0200, Nadia.Derbey@bull.net wrote:
> > >    . echo "LONG XX" > /proc/self/task/<my_tid>/next_id
> > >      next object to be created will have an id set to XX
> > >    . echo "LONG<n> X0 ... X<n-1>" > /proc/self/task/<my_tid>/next_id
> > >      next object to be created will have its ids set to XX0, ... X<n-1>
> > >      This is particularly useful for processes that may have several ids if
> > >      they belong to nested namespaces.
> > 
> > Can we answer the following questions before merging this patch:
> > 
> > a) should mainline kernel have checkpoint/restart feature at all
> > b) if yes, should it be done from kernel- or userspace?
> > 
> > Until agreement will be "yes/from userspace" such patches don't make
> > sense in mainline.
> 
> What do you mean by "from kernel" or "from userspace"?

By "from userspace" I mean proposed interfaces and similar: usespace by
special system calls puts some state of future object and then does
normal system call which creates aforementioned object.

This can be used for PIDs, OK.

This can be used for SystemV shmem ids. But SystemV shmem also has
(let's choose) uid/gid, atimes and actual content. How would restoration
look like?

How to restore struct task_struct::did_exec ? Do execve(2)?

A was ptracing B at checkpoint moment...

Netdevices: stats, name, all sorts of flags, hw addresses, MTU

iptables rules.

These next ids are suitable, well, only for ids which is very, very small
part of kernel state needed to restore group of processes.

> Userspace has to be involved somewhere, right?  It at least need to open()
> the checkpoint file and pass it into the kernel.  		

Sure, but opening dumpfile is not an interesting part.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-22 21:01     ` Alexey Dobriyan
@ 2008-04-22 22:56       ` Dave Hansen
  2008-04-23  6:40         ` Kirill Korotaev
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2008-04-22 22:56 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Nadia.Derbey, linux-kernel, containers, orenl, nick

On Wed, 2008-04-23 at 01:01 +0400, Alexey Dobriyan wrote:
> On Tue, Apr 22, 2008 at 11:56:20AM -0700, Dave Hansen wrote:
> > On Tue, 2008-04-22 at 23:36 +0400, Alexey Dobriyan wrote:> > 
> > > a) should mainline kernel have checkpoint/restart feature at all
> > > b) if yes, should it be done from kernel- or userspace?
> > > 
> > > Until agreement will be "yes/from userspace" such patches don't make
> > > sense in mainline.
> > 
> > What do you mean by "from kernel" or "from userspace"?
> 
> By "from userspace" I mean proposed interfaces and similar: usespace by
> special system calls puts some state of future object and then does
> normal system call which creates aforementioned object.
> 
> This can be used for PIDs, OK.
> 
> This can be used for SystemV shmem ids. But SystemV shmem also has
> (let's choose) uid/gid, atimes and actual content. How would restoration
> look like?
> 
> How to restore struct task_struct::did_exec ? Do execve(2)?
> 
> A was ptracing B at checkpoint moment...
> 
> Netdevices: stats, name, all sorts of flags, hw addresses, MTU
> 
> iptables rules.

Don't we already have interfaces to dump these out and restore them?  

My argument is this:  If we have interfaces that exist (like setting up
iptables rules) we shouldn't make a second interface *just* for
checkpoint/restart.  We have a hard enough time getting *one* interface
right for things, I can't imagine getting two right, and *keeping* them
right.

If the current interface is insufficient, we should first expand it in
such a way that it can be used for checkpoint.  That certainly won't
work in all cases.  fork(), for instance, doesn't take any arguments and
is going to be awfully hard to expand. :)

I'd love to hear some of your insights about how things like the current
iptables interfaces are insufficient for checkpoint/restart.

> These next ids are suitable, well, only for ids which is very, very small
> part of kernel state needed to restore group of processes.

I couldn't agree more.  This id setting mechanism would only be useful
for a small subset of the things we need during a restart.

-- Dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-22 22:56       ` Dave Hansen
@ 2008-04-23  6:40         ` Kirill Korotaev
  2008-04-23 15:33           ` Dave Hansen
  2008-04-24  1:19           ` Oren Laadan
  0 siblings, 2 replies; 32+ messages in thread
From: Kirill Korotaev @ 2008-04-23  6:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexey Dobriyan, containers, nick, Nadia.Derbey, linux-kernel,
	Andrew Morton

> If the current interface is insufficient, we should first expand it in
> such a way that it can be used for checkpoint.  That certainly won't
> work in all cases.  fork(), for instance, doesn't take any arguments and
> is going to be awfully hard to expand. :)
> 
> I'd love to hear some of your insights about how things like the current
> iptables interfaces are insufficient for checkpoint/restart.

iptables is a bad example. Luckily for checkpointing - it always had an interface
"load full state", "dump full state".

But even iptables are not working in current form for checkpointing - they can't
save/restore state of conntracks. Do you think netdev@/netfilter@ guys will be happy
to have APIs allowing to set conntracks and have all the pain related
to API stability - cause conntrack state changed a couple of times during last 2 years.

Consider more intimate kernel states like:
a. task statistics
b. task start time
c. load average
d. skb state and it's data.
e. mount tree.

If you think over, e.g. (b) is a bad thing. It was used to be accounted in jiffies, then in timespec.
(a) is another example of dataset which we can't predict. task statistics change over a time.
Why bother with such intimate data in user-space at all?
Why the hell user-space should know about it and be ABLE to modify it?

Why do we need to export ability to set IDs for some objects which none of the
operating systems do and then to have a burden to support it for application compatibility
the rest of our lifes? Do you really believe none of the applications except for checkpointing
will be using it?

My personal vision is that:
1. user space must initialize checkpointing/restore state via some system call,
   supply file descriptor from where data can be read/written to.
2. must call the syscall asking kernel to restore/save different subsytems one by one.
3. finalize cpt/restore state via the syscall
But user-space MUST NOT bother about data content. At least not about the data supplied by the kernel.
It can add additional sections if needed, e.g. about iptables state.

Having all this functionality in a signle syscall we specifically CLAIM a black box,
and that no one can use this interfaces for something different from checkpoint/restore.

So I think we have to know what other maintainers think before we can go.


>> These next ids are suitable, well, only for ids which is very, very small
>> part of kernel state needed to restore group of processes.
> 
> I couldn't agree more.  This id setting mechanism would only be useful
> for a small subset of the things we need during a restart.
> 
> -- Dave
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] - v2 -  Object creation with a specified id
  2008-04-18  5:44 [PATCH 0/4] - v2 - Object creation with a specified id Nadia.Derbey
                   ` (5 preceding siblings ...)
  2008-04-22 19:36 ` Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id) Alexey Dobriyan
@ 2008-04-23 14:23 ` Pavel Machek
  6 siblings, 0 replies; 32+ messages in thread
From: Pavel Machek @ 2008-04-23 14:23 UTC (permalink / raw)
  To: Nadia.Derbey; +Cc: linux-kernel, containers, orenl, nick

Hi!

> When restarting a process that has been previously checkpointed, that process
> should keep on using some of its ids (such as its process id, or sysV ipc ids).
> 
> This patch provides a feature that can help ensuring this saved state reuse:
> it makes it possible to create an object with a pre-defined id.
> 
> A first implementation had been proposed 2 months ago. It consisted in
> changing an object's id after it had been created.
> 
> Here is a second implementation based on Oren Ladaan's idea: Oren's suggestion
> was to force an object's id during its creation, rather than 1. create it,
> 2. change its id.
> 
> A new file is created in procfs: /proc/self/task/<my_tid>/next_id.
> This makes it possible to avoid races between several threads belonging to
> the same process.

Ugly...

> When this file is filled with an id value, a structure pointed to by the
> calling task_struct is filled with that id.
> 
> Then, when an object supporting this feature is created, the id present in
> that new structure is used, instead of the default one.
> 
> The syntax is one of:
>    . echo "LONG XX" > /proc/self/task/<my_tid>/next_id
>      next object to be created will have an id set to XX
>    . echo "LONG<n> X0 ... X<n-1>" > /proc/self/task/<my_tid>/next_id
>      next object to be created will have its ids set to XX0, ... X<n-1>
>      This is particularly useful for processes that may have several ids if
>      they belong to nested namespaces.

So ugly it is not even funny.

Create fork_with_pid(int pid) and corresponding ipc primitives, if
you have to...
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-23  6:40         ` Kirill Korotaev
@ 2008-04-23 15:33           ` Dave Hansen
  2008-04-24  7:00             ` Kirill Korotaev
  2008-04-24  1:19           ` Oren Laadan
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2008-04-23 15:33 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: containers, linux-kernel, Nadia.Derbey, Andrew Morton, nick,
	Alexey Dobriyan

On Wed, 2008-04-23 at 10:40 +0400, Kirill Korotaev wrote:
> Having all this functionality in a signle syscall we specifically CLAIM a black box,
> and that no one can use this interfaces for something different from checkpoint/restore.
> 
> So I think we have to know what other maintainers think before we can go.

I completely agree.

Have you asked any particular maintainers what they think?

-- Dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-23  6:40         ` Kirill Korotaev
  2008-04-23 15:33           ` Dave Hansen
@ 2008-04-24  1:19           ` Oren Laadan
  2008-07-10  1:58             ` Eric W. Biederman
  1 sibling, 1 reply; 32+ messages in thread
From: Oren Laadan @ 2008-04-24  1:19 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Dave Hansen, containers, linux-kernel, Nadia.Derbey,
	Andrew Morton, nick, Alexey Dobriyan



Kirill Korotaev wrote:
>> If the current interface is insufficient, we should first expand it in
>> such a way that it can be used for checkpoint.  That certainly won't
>> work in all cases.  fork(), for instance, doesn't take any arguments and
>> is going to be awfully hard to expand. :)
>>
>> I'd love to hear some of your insights about how things like the current
>> iptables interfaces are insufficient for checkpoint/restart.
> 
> iptables is a bad example. Luckily for checkpointing - it always had an interface
> "load full state", "dump full state".
> 
> But even iptables are not working in current form for checkpointing - they can't
> save/restore state of conntracks. Do you think netdev@/netfilter@ guys will be happy
> to have APIs allowing to set conntracks and have all the pain related
> to API stability - cause conntrack state changed a couple of times during last 2 years.
> 
> Consider more intimate kernel states like:
> a. task statistics
> b. task start time
> c. load average
> d. skb state and it's data.
> e. mount tree.
> 
> If you think over, e.g. (b) is a bad thing. It was used to be accounted in jiffies, then in timespec.
> (a) is another example of dataset which we can't predict. task statistics change over a time.
> Why bother with such intimate data in user-space at all?
> Why the hell user-space should know about it and be ABLE to modify it?

Agreed.

> Why do we need to export ability to set IDs for some objects which none of the
> operating systems do and then to have a burden to support it for application compatibility
> the rest of our lifes? Do you really believe none of the applications except for checkpointing
> will be using it?
> 
> My personal vision is that:
> 1. user space must initialize checkpointing/restore state via some system call,
>    supply file descriptor from where data can be read/written to.
> 2. must call the syscall asking kernel to restore/save different subsytems one by one.
> 3. finalize cpt/restore state via the syscall
> But user-space MUST NOT bother about data content. At least not about the data supplied by the kernel.
> It can add additional sections if needed, e.g. about iptables state.

I mostly agree with the vision that checkpoint/restart is probably best
implemented as a black box:

* First, much of the work required to restore the state of a process
as well as the state of its resources, requires kernel interfaces that
are lower than the ones available to user-space. Working in user-space
will require that we design new complex interfaces for this purpose only.

* Second, much of the state that needs to be saved was not, is not, and
should probably never be exported to user-space (e.g. interval socket
buffers, t->did_exec and many more). It is only accessible to kernel
code, so an in-kernel module (for checkpoint/restart) makes sense. It is
that sort of internals that may (and will) change as the kernel evolves
- precisely because it is not visible to user-space and not bound to it.

That said, we should still attempt to reuse existing kernel interfaces
and mechanisms as much as possible to save - and restore - the state of
processes, and prefer that over handcrafting special code. This is
especially true for restart: in checkpoint one has to _capture_ the
state by probing it in a passive manner; in contrast, in restart one
has to actively _construct_ new resources and ensure that their state
matches the saved state.

For instance, it is possible to create the process tree in a container
during restart from user-space reusing clone() (I'd argue that it's
even better to do so from user-space). Likewise, it is possible to redo
an open file by opening the file then using dup2() syscall to adjust
the target fd if necessary. The two differ in that the latter allows
to adjust the (so called) resources identifier to the desired value
(because it is privately visible), while the former does not - it gives
a globally visible identifier. And this is precisely why this thread
had started: how to determine the resource identifier when requesting
to allocate a resource (in this example, the pid).

> 
> Having all this functionality in a signle syscall we specifically CLAIM a black box,
> and that no one can use this interfaces for something different from checkpoint/restore.

True, except for what can be done (and is easier to actually that way)
in user space; the most obvious example being clone() and setsid() -
which are a pain to adjust after the fact. In particular, everything
that is private in the kernel now (un-exported) should remain that way,
unless there is an (other) compelling reason to expose it.
(see http://www.ncl.cs.columbia.edu/publications/usenix2007_fordist.pdf)

Regardless of this black-box approach, we need a method to tell the
kernel code that we reuse, that we want a certain resource to have a
certain value. Whether or not this method is exported to user space
depends on whether or not that particular resource is constructed from
user space or not.

For example, Zap uses a special per-task (task_struct) field that if
set designates the next resource identifier expected to be used. It
is set before restore of pid's (clone), IPC (all sorts) and pseudo
terminals. Of these, only clone() is called from user-space, so the
interface allows only that one to be set from user-space.

Oren.

> 
> So I think we have to know what other maintainers think before we can go.
> 
> 
>>> These next ids are suitable, well, only for ids which is very, very small
>>> part of kernel state needed to restore group of processes.
>> I couldn't agree more.  This id setting mechanism would only be useful
>> for a small subset of the things we need during a restart.
>>
>> -- Dave
>>
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/containers
>>
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-23 15:33           ` Dave Hansen
@ 2008-04-24  7:00             ` Kirill Korotaev
  2008-04-24 18:30               ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill Korotaev @ 2008-04-24  7:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: containers, linux-kernel, Nadia.Derbey, Andrew Morton, nick,
	Alexey Dobriyan

Yeah... we talked with Andrew yesterday. He mostly agrees with a black box.
He also said an interesting idea, that if you need compatibility
between the kernels (like we for example, support for migration from 2.6.9 to 2.6.18 in OpenVZ)
you can do image conversion in user-space... Though I think we will prefer simply to have
a compatibility patch for specific kernel versions in our kernel tree (not in mainstream).

Definitely, other ideas/opinions are welcome.

Kirill


Dave Hansen wrote:
> On Wed, 2008-04-23 at 10:40 +0400, Kirill Korotaev wrote:
>> Having all this functionality in a signle syscall we specifically CLAIM a black box,
>> and that no one can use this interfaces for something different from checkpoint/restore.
>>
>> So I think we have to know what other maintainers think before we can go.
> 
> I completely agree.
> 
> Have you asked any particular maintainers what they think?
> 
> -- Dave
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-24  7:00             ` Kirill Korotaev
@ 2008-04-24 18:30               ` Dave Hansen
  2008-04-24 23:13                 ` Oren Laadan
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2008-04-24 18:30 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: containers, linux-kernel, Nadia.Derbey, Andrew Morton, nick,
	Alexey Dobriyan

On Thu, 2008-04-24 at 11:00 +0400, Kirill Korotaev wrote:
> Yeah... we talked with Andrew yesterday. He mostly agrees with a black
> box.
> He also said an interesting idea, that if you need compatibility
> between the kernels (like we for example, support for migration from
> 2.6.9 to 2.6.18 in OpenVZ)
> you can do image conversion in user-space...

That sounds compelling to me.  I'm definitely willing to explore it.

Any thoughts on what we should start with?  What can we get merged, most
easily?

-- Dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-24 18:30               ` Dave Hansen
@ 2008-04-24 23:13                 ` Oren Laadan
  0 siblings, 0 replies; 32+ messages in thread
From: Oren Laadan @ 2008-04-24 23:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill Korotaev, containers, linux-kernel, Nadia.Derbey,
	Andrew Morton, nick, Alexey Dobriyan



Dave Hansen wrote:
> On Thu, 2008-04-24 at 11:00 +0400, Kirill Korotaev wrote:
>> Yeah... we talked with Andrew yesterday. He mostly agrees with a black
>> box.
>> He also said an interesting idea, that if you need compatibility
>> between the kernels (like we for example, support for migration from
>> 2.6.9 to 2.6.18 in OpenVZ)
>> you can do image conversion in user-space...
> 
> That sounds compelling to me.  I'm definitely willing to explore it.

Filtering/converting in user-space has long been part of zap: used to
be able to migrate between both minor and major versions of the kernel
(always upwards, of course).

In fact, I already proposed it as part of the original thread that led
eventually to the patch in question:
https://lists.linux-foundation.org/pipermail/containers/2008-February/009849.html

The need to be able to specify a desired ID (from user-space and/or from
in-kernel) remains.

Oren.

> Any thoughts on what we should start with?  What can we get merged, most
> easily?
> 
> -- Dave
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-04-24  1:19           ` Oren Laadan
@ 2008-07-10  1:58             ` Eric W. Biederman
  2008-07-10 17:12               ` Dave Hansen
  2008-07-17 23:09               ` Oren Laadan
  0 siblings, 2 replies; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-10  1:58 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Kirill Korotaev, containers, linux-kernel, Dave Hansen,
	Alexey Dobriyan, Andrew Morton, nick, Nadia.Derbey, Serge Hallyn

Oren Laadan <orenl@cs.columbia.edu> writes:

>> Consider more intimate kernel states like:
>> a. task statistics
>> b. task start time
>> c. load average
>> d. skb state and it's data.
>> e. mount tree.
>> 
>> If you think over, e.g. (b) is a bad thing. It was used to be accounted in
> jiffies, then in timespec.
>> (a) is another example of dataset which we can't predict. task statistics
> change over a time.
>> Why bother with such intimate data in user-space at all?
>> Why the hell user-space should know about it and be ABLE to modify it?
>
> Agreed.

Almost agreed.  The reason we care is that the data is visible to user
space in some form.  So we need to save it, but hopefully not in it's internal
kernel representation.  If we don't care at all we should not save it.

>> My personal vision is that:
>> 1. user space must initialize checkpointing/restore state via some system
> call,
>>    supply file descriptor from where data can be read/written to.
>> 2. must call the syscall asking kernel to restore/save different subsytems one
> by one.
>> 3. finalize cpt/restore state via the syscall
>> But user-space MUST NOT bother about data content. At least not about the data
> supplied by the kernel.
>> It can add additional sections if needed, e.g. about iptables state.
>
> I mostly agree with the vision that checkpoint/restart is probably best
> implemented as a black box:

I would claim an atomic unit.  Roughly like a coredump is today.  We know
what a core dump does and we don't care how it does it.

> * First, much of the work required to restore the state of a process
> as well as the state of its resources, requires kernel interfaces that
> are lower than the ones available to user-space. Working in user-space
> will require that we design new complex interfaces for this purpose only.

Yes.

> * Second, much of the state that needs to be saved was not, is not, and
> should probably never be exported to user-space (e.g. interval socket
> buffers, t->did_exec and many more). It is only accessible to kernel
> code, so an in-kernel module (for checkpoint/restart) makes sense. It is
> that sort of internals that may (and will) change as the kernel evolves
> - precisely because it is not visible to user-space and not bound to it.

No.  If the state can be inferred from user space it is visible to user
space.  However there is state visible to user space like did_exec
that is not directly manipulatable by user space.

In the worst case today we can restore a checkpoint by replaying all of
the user space actions that took us to get there.  That is a tedious
and slow approach.

> That said, we should still attempt to reuse existing kernel interfaces
> and mechanisms as much as possible to save - and restore - the state of
> processes, and prefer that over handcrafting special code. This is
> especially true for restart: in checkpoint one has to _capture_ the
> state by probing it in a passive manner; in contrast, in restart one
> has to actively _construct_ new resources and ensure that their state
> matches the saved state.

> For instance, it is possible to create the process tree in a container
> during restart from user-space reusing clone() (I'd argue that it's
> even better to do so from user-space). Likewise, it is possible to redo
> an open file by opening the file then using dup2() syscall to adjust
> the target fd if necessary. The two differ in that the latter allows
> to adjust the (so called) resources identifier to the desired value
> (because it is privately visible), while the former does not - it gives
> a globally visible identifier. And this is precisely why this thread
> had started: how to determine the resource identifier when requesting
> to allocate a resource (in this example, the pid).

The limiting factor to me appears to be live migration.

As I understand the concept.   Live migration is where you take you
first take a snapshot of a running system, and restore that snapshot
on another system and don't start it.

Then you repeat with an incremental snapshot and you do an incremental
restore.

Until ideally the changes are small and you can afford to have the
system paused for the amount of time it takes to transfer and restore
the last incremental snapshot.

I expect a design that allows for multiple cpus to work on the
checkpoint/restore paths and that allows for incremental and
thus live migration are going to be the limiting factors in a good
interface design.

Pushing the work into the kernel with an atomic syscall style approach
so that the normal people maintaining the kernel can have visibility
of the checkpoint/restore code paths and can do some of the work
seems healthy.


>> Having all this functionality in a signle syscall we specifically CLAIM a
> black box,
>> and that no one can use this interfaces for something different from
> checkpoint/restore.
>
> True, except for what can be done (and is easier to actually that way)
> in user space; the most obvious example being clone() and setsid() -
> which are a pain to adjust after the fact. In particular, everything
> that is private in the kernel now (un-exported) should remain that way,
> unless there is an (other) compelling reason to expose it.
> (see http://www.ncl.cs.columbia.edu/publications/usenix2007_fordist.pdf)

Yes.  A very good example of something that is user visible but should
not be user settable (except during restore) is the start time for a
monotonic clock.  If you allow someone to set it arbitrarily it is no
longer a monotonic clock and it looses all value.

So here is one suggestion:
sys_processtree_isolate(container_init_pid);
sys_checkpoint(old_dir_fd, new_dir_fd, container_init_pid);
sys_processtree_unisolate(container_init_pid);

container_init_pid = sys_restore(dir_fd, old_container_init_pid);
sys_processtree_unisolate(container_init_pid);

Then save the different components in different files.  And in the non-trivial cases
have tagged data in the files.  The tags allow us to have different data types
easily. 

For data that already have or should have an interface for
persistence.  Filesystems being the primary example we don't handle
this way.  Instead we use the already existing techniques to
backup/snapshot the data and then to restore the data.

Isolation should be used instead of freezing if we can, because
isolation is a much cheaper concept, and it allows for multi-machine
synchronized checkpoints.  On the big scale if I unplug a machines
network cable (isolating it) then I don't care when between the time I
isolate the machine and the time I plug the cable back in.  It all
looks the same to the outside world.  But just the comunications of
the machine were frozen.  Likewise for processes.  SIGSTOP is good
but I suspect it may be excessive, so denies us optimization
opportunities (at least in the interface).

So in summary we want an interface that allows for.
- Existing persistent kernel state to using the preexisting mechanisms.
- The persistence code to be looked after by the people working on the
  kernel subsystem.
- Migration between kernel versions, (possibly with user space intervention).
- Live migration 
- Minimal intervention of the processes being checkpointed.
- Denying persistence to applications that need it.
- Checkpoints coordinated between multiple containers or real
  machines.
- Documented in terms of isolation rather then freezing, as it is the
  isolation that is the important property. 
- Restricted access to user space visible but immutable state.
- Ideally sufficiently general that debuggers can take advantage
  of the checkpoints.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10  1:58             ` Eric W. Biederman
@ 2008-07-10 17:12               ` Dave Hansen
  2008-07-10 17:32                 ` Serge E. Hallyn
  2008-07-17 23:14                 ` Oren Laadan
  2008-07-17 23:09               ` Oren Laadan
  1 sibling, 2 replies; 32+ messages in thread
From: Dave Hansen @ 2008-07-10 17:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oren Laadan, Kirill Korotaev, containers, linux-kernel,
	Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan,
	Serge E. Hallyn

On Wed, 2008-07-09 at 18:58 -0700, Eric W. Biederman wrote:
> In the worst case today we can restore a checkpoint by replaying all of
> the user space actions that took us to get there.  That is a tedious
> and slow approach.

Yes, tedious and slow, *and* minimally invasive in the kernel.  Once we
have a tedious and slow process, we'll have some really good points when
we try to push the next set of patches to make it less slow and tedious.
We'll be able to describe an _actual_ set of problems to our fellow
kernel hackers.

So, the checkpoint-as-a-corefile idea sounds good to me, but it
definitely leaves a lot of questions about exactly how we'll need to do
the restore.

-- Dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10 17:12               ` Dave Hansen
@ 2008-07-10 17:32                 ` Serge E. Hallyn
  2008-07-10 18:55                   ` Eric W. Biederman
  2008-07-17 23:16                   ` Oren Laadan
  2008-07-17 23:14                 ` Oren Laadan
  1 sibling, 2 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2008-07-10 17:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Oren Laadan, Kirill Korotaev, containers,
	linux-kernel, Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan

Quoting Dave Hansen (dave@linux.vnet.ibm.com):
> On Wed, 2008-07-09 at 18:58 -0700, Eric W. Biederman wrote:
> > In the worst case today we can restore a checkpoint by replaying all of
> > the user space actions that took us to get there.  That is a tedious
> > and slow approach.
> 
> Yes, tedious and slow, *and* minimally invasive in the kernel.  Once we
> have a tedious and slow process, we'll have some really good points when
> we try to push the next set of patches to make it less slow and tedious.
> We'll be able to describe an _actual_ set of problems to our fellow
> kernel hackers.
> 
> So, the checkpoint-as-a-corefile idea sounds good to me, but it
> definitely leaves a lot of questions about exactly how we'll need to do
> the restore.

Talking with Dave over irc, I kind of liked the idea of creating a new
fs/binfmt_cr.c that executes a checkpoint-as-a-coredump file.

One thing I do not like about the checkpoint-as-coredump is that it begs
us to dump all memory out into the file.  Our plan/hope was to save
ourselves from writing out most memory by:

	1. associating a separate swapfile with each container
	2. doing a swapfile snapshot at each checkpoint
	3. dumping the pte entries (/proc/self/)

If we do checkpoint-as-a-coredump, then we need userspace to coordinate
a kernel-generated coredump with a user-generated (?) swapfile snapshot.
But I guess we figure that out later.

-serge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10 17:32                 ` Serge E. Hallyn
@ 2008-07-10 18:55                   ` Eric W. Biederman
  2008-07-10 19:06                     ` Dave Hansen
  2008-07-17 23:16                   ` Oren Laadan
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-10 18:55 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, Oren Laadan, Kirill Korotaev, containers,
	linux-kernel, Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan

"Serge E. Hallyn" <serue@us.ibm.com> writes:

>> So, the checkpoint-as-a-corefile idea sounds good to me, but it
>> definitely leaves a lot of questions about exactly how we'll need to do
>> the restore.
>
> Talking with Dave over irc, I kind of liked the idea of creating a new
> fs/binfmt_cr.c that executes a checkpoint-as-a-coredump file.
>
> One thing I do not like about the checkpoint-as-coredump is that it begs
> us to dump all memory out into the file.  Our plan/hope was to save
> ourselves from writing out most memory by:
>
> 	1. associating a separate swapfile with each container
> 	2. doing a swapfile snapshot at each checkpoint
> 	3. dumping the pte entries (/proc/self/)
>
> If we do checkpoint-as-a-coredump, then we need userspace to coordinate
> a kernel-generated coredump with a user-generated (?) swapfile snapshot.
> But I guess we figure that out later.

Well it is a matter of which VMAs you dump.  For things that are file backed
you need to dump them.

I don't know that even a binfmt for per process level checkpoints is sufficient
but I do know having something of that granularity looks much easier.  Otherwise
it takes a bazillian little syscalls to do things no one else is interested in doing.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10 18:55                   ` Eric W. Biederman
@ 2008-07-10 19:06                     ` Dave Hansen
  2008-07-10 19:21                       ` Eric W. Biederman
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2008-07-10 19:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Oren Laadan, Kirill Korotaev, containers,
	linux-kernel, Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan

On Thu, 2008-07-10 at 11:55 -0700, Eric W. Biederman wrote:
> 
> > If we do checkpoint-as-a-coredump, then we need userspace to coordinate
> > a kernel-generated coredump with a user-generated (?) swapfile snapshot.
> > But I guess we figure that out later.
> 
> Well it is a matter of which VMAs you dump.  For things that are file backed
> you need to dump them.

Are we talking about the VMA itself, or the memory backing the VMA?

-- Dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10 19:06                     ` Dave Hansen
@ 2008-07-10 19:21                       ` Eric W. Biederman
  2008-07-10 19:47                         ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-10 19:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Serge E. Hallyn, Oren Laadan, Kirill Korotaev, containers,
	linux-kernel, Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan

Dave Hansen <dave@linux.vnet.ibm.com> writes:

> On Thu, 2008-07-10 at 11:55 -0700, Eric W. Biederman wrote:
>> 
>> > If we do checkpoint-as-a-coredump, then we need userspace to coordinate
>> > a kernel-generated coredump with a user-generated (?) swapfile snapshot.
>> > But I guess we figure that out later.
>> 
>> Well it is a matter of which VMAs you dump.  For things that are file backed
>> you need to dump them.

For things that are file backed you DO NOT need to dump them.  (sorry typo).

> Are we talking about the VMA itself, or the memory backing the VMA?

The memory backing the VMA.  We need to store the page protections
that the memory was mapped with as well now that you point it out.  A
VMA is not user space visible, which is why we can arbitrarily split
and merge VMAs.

Eric



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10 19:21                       ` Eric W. Biederman
@ 2008-07-10 19:47                         ` Dave Hansen
  2008-07-11  0:32                           ` Alexey Dobriyan
  2008-07-17 23:19                           ` Oren Laadan
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Hansen @ 2008-07-10 19:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Oren Laadan, Kirill Korotaev, containers,
	linux-kernel, Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan

On Thu, 2008-07-10 at 12:21 -0700, Eric W. Biederman wrote:
> > Are we talking about the VMA itself, or the memory backing the VMA?
> 
> The memory backing the VMA.  We need to store the page protections
> that the memory was mapped with as well now that you point it out.  A
> VMA is not user space visible, which is why we can arbitrarily split
> and merge VMAs.

It is visible with /proc/$pid/{maps,smaps,numamaps?}.  That's the only
efficient way I know of from userspace to figure out where userspace's
memory is and if it *is* file backed, and what the permissions are.

We also can't truly split them arbitrarily because the memory cost of
the VMAs themselves might become too heavy (the remap_file_pages()
problem).

It gets trickier when things are also private mappings in addition to
being in a file-backed VMA.  We *do* need to checkpoint those, but only
the pages to which there was a write.

There's also the problem of restoring things read-only, but VM_MAYWRITE.
If there's a high level of (non-COW'd) sharing of these anonymous areas,
we may not be able to even restore the set of processes unless we can
replicate the sharing.  We might run out of memory if the sharing isn't
replicated.

Memory is fun! :)

-- Dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10 19:47                         ` Dave Hansen
@ 2008-07-11  0:32                           ` Alexey Dobriyan
  2008-07-17 23:19                           ` Oren Laadan
  1 sibling, 0 replies; 32+ messages in thread
From: Alexey Dobriyan @ 2008-07-11  0:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Serge E. Hallyn, Oren Laadan, Kirill Korotaev,
	containers, linux-kernel, Nadia.Derbey, Andrew Morton, nick

On Thu, Jul 10, 2008 at 12:47:32PM -0700, Dave Hansen wrote:
> On Thu, 2008-07-10 at 12:21 -0700, Eric W. Biederman wrote:
> > > Are we talking about the VMA itself, or the memory backing the VMA?
> > 
> > The memory backing the VMA.  We need to store the page protections
> > that the memory was mapped with as well now that you point it out.  A
> > VMA is not user space visible, which is why we can arbitrarily split
> > and merge VMAs.
> 
> It is visible with /proc/$pid/{maps,smaps,numamaps?}.  That's the only
> efficient way I know of from userspace to figure out where userspace's
> memory is and if it *is* file backed, and what the permissions are.

None of those files exist if PROC_FS is off.

> We also can't truly split them arbitrarily because the memory cost of
> the VMAs themselves might become too heavy (the remap_file_pages()
> problem).
> 
> It gets trickier when things are also private mappings in addition to
> being in a file-backed VMA.  We *do* need to checkpoint those, but only
> the pages to which there was a write.
> 
> There's also the problem of restoring things read-only, but VM_MAYWRITE.
> If there's a high level of (non-COW'd) sharing of these anonymous areas,
> we may not be able to even restore the set of processes unless we can
> replicate the sharing.  We might run out of memory if the sharing isn't
> replicated.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 -  Object creation with a specified id)
  2008-07-10  1:58             ` Eric W. Biederman
  2008-07-10 17:12               ` Dave Hansen
@ 2008-07-17 23:09               ` Oren Laadan
  1 sibling, 0 replies; 32+ messages in thread
From: Oren Laadan @ 2008-07-17 23:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Korotaev, containers, linux-kernel, Dave Hansen,
	Alexey Dobriyan, Andrew Morton, nick, Nadia.Derbey, Serge Hallyn



Eric W. Biederman wrote:
> Oren Laadan <orenl@cs.columbia.edu> writes:
> 
>>> Consider more intimate kernel states like:
>>> a. task statistics
>>> b. task start time
>>> c. load average
>>> d. skb state and it's data.
>>> e. mount tree.
>>>
>>> If you think over, e.g. (b) is a bad thing. It was used to be accounted in
>> jiffies, then in timespec.
>>> (a) is another example of dataset which we can't predict. task statistics
>> change over a time.
>>> Why bother with such intimate data in user-space at all?
>>> Why the hell user-space should know about it and be ABLE to modify it?
>> Agreed.
> 
> Almost agreed.  The reason we care is that the data is visible to user
> space in some form.  So we need to save it, but hopefully not in it's internal
> kernel representation.  If we don't care at all we should not save it.

yes, we probably prefer an intermediate representation to some extent, mainly
because we (a) don't want garbage data that we don't use and (b) to make the
task of converting from old to new kernel version easier.

But only for these two reasons. I see zero value in avoiding representation
used internally by the kernel for the sake of avoiding it. In fact, if it's
that much easier to stick to that representation in a specific case -- so be
it. There shall be userland utilities that will be able to inspect the contents
(like those that can inspect the contents of internal file system data).

>>> My personal vision is that:
>>> 1. user space must initialize checkpointing/restore state via some system
>> call,
>>>    supply file descriptor from where data can be read/written to.
>>> 2. must call the syscall asking kernel to restore/save different subsytems one
>> by one.
>>> 3. finalize cpt/restore state via the syscall
>>> But user-space MUST NOT bother about data content. At least not about the data
>> supplied by the kernel.
>>> It can add additional sections if needed, e.g. about iptables state.
>> I mostly agree with the vision that checkpoint/restart is probably best
>> implemented as a black box:
> 
> I would claim an atomic unit.  Roughly like a coredump is today.  We know
> what a core dump does and we don't care how it does it.
> 
>> * First, much of the work required to restore the state of a process
>> as well as the state of its resources, requires kernel interfaces that
>> are lower than the ones available to user-space. Working in user-space
>> will require that we design new complex interfaces for this purpose only.
> 
> Yes.
> 
>> * Second, much of the state that needs to be saved was not, is not, and
>> should probably never be exported to user-space (e.g. interval socket
>> buffers, t->did_exec and many more). It is only accessible to kernel
>> code, so an in-kernel module (for checkpoint/restart) makes sense. It is
>> that sort of internals that may (and will) change as the kernel evolves
>> - precisely because it is not visible to user-space and not bound to it.
> 
> No.  If the state can be inferred from user space it is visible to user
> space.  However there is state visible to user space like did_exec
> that is not directly manipulatable by user space.

Not all state can be inferred; and the point is that I don't want to need
to chase the kernel devs and add such interfaces every time some state is
added.

> 
> In the worst case today we can restore a checkpoint by replaying all of
> the user space actions that took us to get there.  That is a tedious
> and slow approach.

ugh ... not only tedious - but not entirely correct: deterministic replay is
a very complicated problem (but thrilling) !  and you'll need to log/record
everything during the execution :(

> 
>> That said, we should still attempt to reuse existing kernel interfaces
>> and mechanisms as much as possible to save - and restore - the state of
>> processes, and prefer that over handcrafting special code. This is
>> especially true for restart: in checkpoint one has to _capture_ the
>> state by probing it in a passive manner; in contrast, in restart one
>> has to actively _construct_ new resources and ensure that their state
>> matches the saved state.
> 
>> For instance, it is possible to create the process tree in a container
>> during restart from user-space reusing clone() (I'd argue that it's
>> even better to do so from user-space). Likewise, it is possible to redo
>> an open file by opening the file then using dup2() syscall to adjust
>> the target fd if necessary. The two differ in that the latter allows
>> to adjust the (so called) resources identifier to the desired value
>> (because it is privately visible), while the former does not - it gives
>> a globally visible identifier. And this is precisely why this thread
>> had started: how to determine the resource identifier when requesting
>> to allocate a resource (in this example, the pid).
> 
> The limiting factor to me appears to be live migration.

limiting in what sense ?

> As I understand the concept.   Live migration is where you take you
> first take a snapshot of a running system, and restore that snapshot
> on another system and don't start it.
> 
> Then you repeat with an incremental snapshot and you do an incremental
> restore.
> 
> Until ideally the changes are small and you can afford to have the
> system paused for the amount of time it takes to transfer and restore
> the last incremental snapshot.

yes;  in practice, you mostly care about memory contents that have
changed in the interim, as saving/transfering memory is by far the
most time consuming component of a checkpoint/migration.

So you must track which memory pages were modified between successive
attempts to complete the migration. As for the rest - you can either
checkpoint all the rest during the final snapshot, or save all of them
at the beginning and track changes to them too.

> 
> I expect a design that allows for multiple cpus to work on the
> checkpoint/restore paths and that allows for incremental and
> thus live migration are going to be the limiting factors in a good
> interface design.

These two are important factors in my reasoning to use in kernel
implementation. Such implementation should allow concurrency (e.g.
using kernel threads), as well as the required atomicity to track
the changes between successive iterations.

> Pushing the work into the kernel with an atomic syscall style approach
> so that the normal people maintaining the kernel can have visibility
> of the checkpoint/restore code paths and can do some of the work
> seems healthy.

So now it seems that we agree on that a kernel approach is suitable.

> 
>>> Having all this functionality in a signle syscall we specifically CLAIM a
>> black box,
>>> and that no one can use this interfaces for something different from
>> checkpoint/restore.
>>
>> True, except for what can be done (and is easier to actually that way)
>> in user space; the most obvious example being clone() and setsid() -
>> which are a pain to adjust after the fact. In particular, everything
>> that is private in the kernel now (un-exported) should remain that way,
>> unless there is an (other) compelling reason to expose it.
>> (see http://www.ncl.cs.columbia.edu/publications/usenix2007_fordist.pdf)
> 
> Yes.  A very good example of something that is user visible but should
> not be user settable (except during restore) is the start time for a
> monotonic clock.  If you allow someone to set it arbitrarily it is no
> longer a monotonic clock and it looses all value.
> 
> So here is one suggestion:
> sys_processtree_isolate(container_init_pid);
> sys_checkpoint(old_dir_fd, new_dir_fd, container_init_pid);
> sys_processtree_unisolate(container_init_pid);
> 
> container_init_pid = sys_restore(dir_fd, old_container_init_pid);
> sys_processtree_unisolate(container_init_pid);

what do you mean by "isolate" and "unisolate" ?

> 
> Then save the different components in different files.  And in the non-trivial cases
> have tagged data in the files.  The tags allow us to have different data types
> easily. 
> 

why different files ?  why not a single file, streamed - beneficial
for live migration, for example, or pass the data through filters
(encryption, signature, conversion between kernel versions, etc).

> For data that already have or should have an interface for
> persistence.  Filesystems being the primary example we don't handle
> this way.  Instead we use the already existing techniques to
> backup/snapshot the data and then to restore the data.
> 
> Isolation should be used instead of freezing if we can, because
> isolation is a much cheaper concept, and it allows for multi-machine
> synchronized checkpoints.  On the big scale if I unplug a machines

if you let the processes in a container to proceed execution (even
if isolated) during a checkpoint, you will be unable to capture a
state that is consistent among all processes in that container. So
you must freeze them somehow.

> network cable (isolating it) then I don't care when between the time I
> isolate the machine and the time I plug the cable back in.  It all
> looks the same to the outside world.  But just the comunications of
> the machine were frozen.  Likewise for processes.  SIGSTOP is good

I am not aware of any simple way that can prevent any sort of interaction
between processes (by simple, I mean like unplugging a cable). For
example, how would you prevent shared memory alterations ?  signals that
are raised due to async IO completion ? and many more examples.

> but I suspect it may be excessive, so denies us optimization
> opportunities (at least in the interface).
> 


> So in summary we want an interface that allows for.
> - Existing persistent kernel state to using the preexisting mechanisms.
> - The persistence code to be looked after by the people working on the
>   kernel subsystem.
> - Migration between kernel versions, (possibly with user space intervention).
> - Live migration 
> - Minimal intervention of the processes being checkpointed.
> - Denying persistence to applications that need it.
> - Checkpoints coordinated between multiple containers or real
>   machines.

you mean distributed checkpoint/restart ?  not an easy job either, and
should not, IMHO, be done at kernel level. Perhaps you'll find this
paper interesting:
http://www.ncl.cs.columbia.edu/publications/cluster2005_fordist.pdf

> - Documented in terms of isolation rather then freezing, as it is the
>   isolation that is the important property. 

can you please define "isolation" in this context ?  "freezing" is very
simple: the processes cannot do anything while they are checkpointed.
(in the case of live migration, they aren't frozen except for very short
time - aka downtime).

> - Restricted access to user space visible but immutable state.

> - Ideally sufficiently general that debuggers can take advantage
>   of the checkpoints.

This last point is interesting and important, but not high priority.
It is always possible to take a checkpoint image and - in user space -
convert it to a format that is readable by debuggers. So it should not
be a guiding factor in designing the checkpoint/restart mechanism.

Oren.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10 17:12               ` Dave Hansen
  2008-07-10 17:32                 ` Serge E. Hallyn
@ 2008-07-17 23:14                 ` Oren Laadan
  1 sibling, 0 replies; 32+ messages in thread
From: Oren Laadan @ 2008-07-17 23:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Kirill Korotaev, containers, linux-kernel,
	Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan,
	Serge E. Hallyn



Dave Hansen wrote:
> On Wed, 2008-07-09 at 18:58 -0700, Eric W. Biederman wrote:
>> In the worst case today we can restore a checkpoint by replaying all of
>> the user space actions that took us to get there.  That is a tedious
>> and slow approach.
> 
> Yes, tedious and slow, *and* minimally invasive in the kernel.  Once we
> have a tedious and slow process, we'll have some really good points when

"Replaying all of the user space actions that took us to get there" -
this task is not even always possible without deterministic log/replay
mechanism :(

Oren.

> we try to push the next set of patches to make it less slow and tedious.
> We'll be able to describe an _actual_ set of problems to our fellow
> kernel hackers.
> 
> So, the checkpoint-as-a-corefile idea sounds good to me, but it
> definitely leaves a lot of questions about exactly how we'll need to do
> the restore.
> 
> -- Dave
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10 17:32                 ` Serge E. Hallyn
  2008-07-10 18:55                   ` Eric W. Biederman
@ 2008-07-17 23:16                   ` Oren Laadan
  2008-07-18 16:18                     ` Serge E. Hallyn
  1 sibling, 1 reply; 32+ messages in thread
From: Oren Laadan @ 2008-07-17 23:16 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, Eric W. Biederman, Kirill Korotaev, containers,
	linux-kernel, Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan



Serge E. Hallyn wrote:
> Quoting Dave Hansen (dave@linux.vnet.ibm.com):
>> On Wed, 2008-07-09 at 18:58 -0700, Eric W. Biederman wrote:
>>> In the worst case today we can restore a checkpoint by replaying all of
>>> the user space actions that took us to get there.  That is a tedious
>>> and slow approach.
>> Yes, tedious and slow, *and* minimally invasive in the kernel.  Once we
>> have a tedious and slow process, we'll have some really good points when
>> we try to push the next set of patches to make it less slow and tedious.
>> We'll be able to describe an _actual_ set of problems to our fellow
>> kernel hackers.
>>
>> So, the checkpoint-as-a-corefile idea sounds good to me, but it
>> definitely leaves a lot of questions about exactly how we'll need to do
>> the restore.
> 
> Talking with Dave over irc, I kind of liked the idea of creating a new
> fs/binfmt_cr.c that executes a checkpoint-as-a-coredump file.
> 
> One thing I do not like about the checkpoint-as-coredump is that it begs
> us to dump all memory out into the file.  Our plan/hope was to save
> ourselves from writing out most memory by:
> 
> 	1. associating a separate swapfile with each container
> 	2. doing a swapfile snapshot at each checkpoint
> 	3. dumping the pte entries (/proc/self/)
> 
> If we do checkpoint-as-a-coredump, then we need userspace to coordinate
> a kernel-generated coredump with a user-generated (?) swapfile snapshot.
> But I guess we figure that out later.

I'm not sure how this approach integrates with (a) live migration (and
the iterative process of sending over memory modified since previous
iteration), and (b) incremental checkpoint (where except for the first
snapshot, additional snapshots only save what changed since the previous
one).

Oren.

> 
> -serge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-10 19:47                         ` Dave Hansen
  2008-07-11  0:32                           ` Alexey Dobriyan
@ 2008-07-17 23:19                           ` Oren Laadan
  1 sibling, 0 replies; 32+ messages in thread
From: Oren Laadan @ 2008-07-17 23:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Serge E. Hallyn, Kirill Korotaev, containers,
	linux-kernel, Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan



Dave Hansen wrote:
> On Thu, 2008-07-10 at 12:21 -0700, Eric W. Biederman wrote:
>>> Are we talking about the VMA itself, or the memory backing the VMA?
>> The memory backing the VMA.  We need to store the page protections
>> that the memory was mapped with as well now that you point it out.  A
>> VMA is not user space visible, which is why we can arbitrarily split
>> and merge VMAs.
> 
> It is visible with /proc/$pid/{maps,smaps,numamaps?}.  That's the only
> efficient way I know of from userspace to figure out where userspace's
> memory is and if it *is* file backed, and what the permissions are.
> 
> We also can't truly split them arbitrarily because the memory cost of
> the VMAs themselves might become too heavy (the remap_file_pages()
> problem).
> 
> It gets trickier when things are also private mappings in addition to
> being in a file-backed VMA.  We *do* need to checkpoint those, but only
> the pages to which there was a write.
> 
> There's also the problem of restoring things read-only, but VM_MAYWRITE.
> If there's a high level of (non-COW'd) sharing of these anonymous areas,
> we may not be able to even restore the set of processes unless we can
> replicate the sharing.  We might run out of memory if the sharing isn't
> replicated.
> 
> Memory is fun! :)

Heh .. and it can get much worse.  By all means, the functions that analyze
and save the VMAs during checkpoint and later reconstruct them during restart
are the most complicated logic. The good news, however, is that it works :)

Oren.

> 
> -- Dave
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id)
  2008-07-17 23:16                   ` Oren Laadan
@ 2008-07-18 16:18                     ` Serge E. Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2008-07-18 16:18 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, Eric W. Biederman, Kirill Korotaev, containers,
	linux-kernel, Nadia.Derbey, Andrew Morton, nick, Alexey Dobriyan

[-- Attachment #1: Type: text/plain, Size: 3647 bytes --]

Quoting Oren Laadan (orenl@cs.columbia.edu):
>
>
> Serge E. Hallyn wrote:
>> Quoting Dave Hansen (dave@linux.vnet.ibm.com):
>>> On Wed, 2008-07-09 at 18:58 -0700, Eric W. Biederman wrote:
>>>> In the worst case today we can restore a checkpoint by replaying all of
>>>> the user space actions that took us to get there.  That is a tedious
>>>> and slow approach.
>>> Yes, tedious and slow, *and* minimally invasive in the kernel.  Once we
>>> have a tedious and slow process, we'll have some really good points when
>>> we try to push the next set of patches to make it less slow and tedious.
>>> We'll be able to describe an _actual_ set of problems to our fellow
>>> kernel hackers.
>>>
>>> So, the checkpoint-as-a-corefile idea sounds good to me, but it
>>> definitely leaves a lot of questions about exactly how we'll need to do
>>> the restore.
>>
>> Talking with Dave over irc, I kind of liked the idea of creating a new
>> fs/binfmt_cr.c that executes a checkpoint-as-a-coredump file.
>>
>> One thing I do not like about the checkpoint-as-coredump is that it begs
>> us to dump all memory out into the file.  Our plan/hope was to save
>> ourselves from writing out most memory by:
>>
>> 	1. associating a separate swapfile with each container
>> 	2. doing a swapfile snapshot at each checkpoint
>> 	3. dumping the pte entries (/proc/self/)
>>
>> If we do checkpoint-as-a-coredump, then we need userspace to coordinate
>> a kernel-generated coredump with a user-generated (?) swapfile snapshot.
>> But I guess we figure that out later.
>
> I'm not sure how this approach integrates with (a) live migration (and
> the iterative process of sending over memory modified since previous
> iteration), and (b) incremental checkpoint (where except for the first
> snapshot, additional snapshots only save what changed since the previous
> one).

Oh, well I was seeing them as pretty orthogonal actually.  The reason is
that my checkpoint-as-a-coredump file would NOT include memory contents.
I'm still hoping that we can lock a container into its own nfs-mounted
swapfile, and take a snapshot of that at checkpoint.  Let the snapshot
solution take care of the incremental snapshot.

Attached are one kernel and one cryo patch, purely for a toy
implementation.  It doesn't even *quite* work yet.  Here are the notes
I wrote when I put it down wednesday afternoon:

===============================================================

The kernel patch impelments (only for x86_32) sys_checkpoint.  You
can create a checkpoint with the following test program:

#include <unistd.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <errno.h>

int main()
{
        int ret;

        ret = syscall(327, "output.kckpt");
        printf("ret was %d\n", ret);
        if (ret == -1)
                perror("checkpoint");
        return 0;
}

Make output.kckpt executable and run it.  Works fine.

With the cryo patch, you can do:

SHELL 1:
./tests/sleep
SHELL 2:
./cr -p <sleeppid> -f o1.bin

and you'll see <sleeppid.kckpt> created.  Make that
executable and run it, and you'll see it runs as though
it were the original sleep program.

On the other hand, two problems with using cryo:

        1. the checkpointed app segfaults

        2. if you restart with "cr -r -f o1.bin" it
           fails somewhere.

===============================================================

Matt has already pointed out some coding errors so *really* it's
just to show one way we could integrate a binary format handler
for partial restart with more userspace control.  I'd be too
embarassed to send it out, but figure I should send it out before
the mini-summit.

-serge

[-- Attachment #2: 0001-checkpoint-add-sys_checkpoint-and-binfmt_cr.c.patch --]
[-- Type: text/x-diff, Size: 12578 bytes --]

>From 04800bea92f56c15eb1a1de35de9a54613b5eb9c Mon Sep 17 00:00:00 2001
From: Serge Hallyn <serge@us.ibm.com>
Date: Wed, 16 Jul 2008 13:38:44 -0500
Subject: [PATCH 1/1] checkpoint: add sys_checkpoint() and binfmt_cr.c

Add a do_checkpoint syscall (only for x86_32 right now).  The intent is to
dump process data which isn't userspace-accessible yet.

Introduce fs/binfmt_cr, which executes checkpoint files.  At the moment
all it does is execute the original file using its default binary
handler, and resets tsk->mm->arg_start and tsk->did_exec.

Since binfmt_cr only does part of the necessary restart operations, userspace
will need to do the rest.

Cryo, for instance, will cause the new process to execute this task, then be
ptraced to allow the rest of the restore to take place.

Signed-off-by: Serge Hallyn <serge@us.ibm.com>
---
 arch/x86/kernel/process_32.c       |   15 +++++
 arch/x86/kernel/syscall_table_32.S |    1 +
 fs/Kconfig.binfmt                  |    7 +++
 fs/Makefile                        |    3 +-
 fs/binfmt_cr.c                     |  100 ++++++++++++++++++++++++++++++++++++
 fs/checkpoint.c                    |   79 ++++++++++++++++++++++++++++
 fs/exec.c                          |   21 ++++++++
 include/asm-x86/unistd_32.h        |    1 +
 include/linux/binfmts.h            |    1 +
 include/linux/checkpoint.h         |    5 ++
 include/linux/sched.h              |    2 +
 include/linux/syscalls.h           |    1 +
 kernel/sys_ni.c                    |    2 +
 13 files changed, 237 insertions(+), 1 deletions(-)
 create mode 100644 fs/binfmt_cr.c
 create mode 100644 fs/checkpoint.c
 create mode 100644 include/linux/checkpoint.h

diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index e2db9ac..fd55fec 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -767,3 +767,18 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
 	unsigned long range_end = mm->brk + 0x02000000;
 	return randomize_range(mm->brk, range_end, 0) ? : mm->brk;
 }
+
+asmlinkage int sys_checkpoint(struct pt_regs regs)
+{
+	int error;
+	char *filename;
+
+	filename = getname((char __user *) regs.bx);
+	error = PTR_ERR(filename);
+	if (IS_ERR(filename))
+		goto out;
+	error = do_checkpoint(filename, &regs);
+	putname(filename);
+out:
+	return error;
+}
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index adff556..8195a31 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -326,3 +326,4 @@ ENTRY(sys_call_table)
 	.long sys_fallocate
 	.long sys_timerfd_settime	/* 325 */
 	.long sys_timerfd_gettime
+	.long sys_checkpoint
diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index 3263084..cb0d19b 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -137,3 +137,10 @@ config BINFMT_MISC
 	  You may say M here for module support and later load the module when
 	  you have use for it; the module is called binfmt_misc. If you
 	  don't know what to answer at this point, say Y.
+
+config BINFMT_CR
+	tristate "Kernel support for executing checkpoint files"
+	default n
+	---help---
+	  Checkpoint files (created using sys_checkpoint) can be executed
+	  as though they were binaries using this binary format handler.
diff --git a/fs/Makefile b/fs/Makefile
index 1e7a11b..9230fac 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o
+		stack.o checkpoint.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
@@ -34,6 +34,7 @@ obj-y				+= $(nfsd-y) $(nfsd-m)
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
 obj-$(CONFIG_BINFMT_EM86)	+= binfmt_em86.o
 obj-$(CONFIG_BINFMT_MISC)	+= binfmt_misc.o
+obj-$(CONFIG_BINFMT_CR)		+= binfmt_cr.o
 
 # binfmt_script is always there
 obj-y				+= binfmt_script.o
diff --git a/fs/binfmt_cr.c b/fs/binfmt_cr.c
new file mode 100644
index 0000000..8a0e173
--- /dev/null
+++ b/fs/binfmt_cr.c
@@ -0,0 +1,100 @@
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/stat.h>
+#include <linux/slab.h>
+#include <linux/binfmts.h>
+#include <linux/init.h>
+#include <linux/file.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/mm_types.h>
+#include <linux/sched.h>
+#include <linux/checkpoint.h>
+
+/*
+ * The pathname is likely to quickly overrun bprm->buf.
+ * I'll need to read the first page of the file.
+ */
+static int load_checkpoint(struct linux_binprm *bprm,struct pt_regs *regs)
+{
+	unsigned long arg_start;
+	short did_exec;
+	char *cp;
+	struct file *file;
+	int retval;
+
+	cp = bprm->buf;
+	if (memcmp(cp, CKPT_ID, strlen(CKPT_ID)))
+		return -ENOEXEC;
+	cp += strlen(CKPT_ID) + 1;
+	printk(KERN_NOTICE "%s: checking version\n", __func__);
+	if (memcmp(cp, CKPT_VERSION, strlen(CKPT_VERSION)))
+		return -EINVAL;
+	/* Grab the pathname of the original, checkpointed executable */
+	cp += strlen(CKPT_VERSION) + 1;
+	if (*cp == ' ') {
+		printk(KERN_NOTICE "Serge: bump by 1\n");
+		cp++;
+	}
+	printk(KERN_NOTICE "%s: reading arg_start\n", __func__);
+	retval = sscanf(cp, "%lu", &arg_start);
+	if (retval != 1)
+		return -EINVAL;
+	printk(KERN_NOTICE "%s: arg_start was %lu\n", __func__, arg_start);
+	printk(KERN_NOTICE "%s: moving cp to did_exec\n", __func__);
+	while (*(++cp) != ' ' && (cp-bprm->buf < BINPRM_BUF_SIZE));
+	cp++;
+	if (cp-bprm->buf >= BINPRM_BUF_SIZE)
+		return -EINVAL;
+	printk(KERN_NOTICE "%s: reading did_exec (cp is %s)\n", __func__, cp);
+	retval = sscanf(cp, "%hu", &did_exec);
+	if (retval != 1)
+		return -EINVAL;
+	printk(KERN_NOTICE "%s: did_exec was %hu\n", __func__, did_exec);
+	printk(KERN_NOTICE "%s: moving cp to fname\n", __func__);
+	while (*(++cp) != ' ' && (cp-bprm->buf < BINPRM_BUF_SIZE));
+	cp++;
+	if (cp-bprm->buf >= BINPRM_BUF_SIZE)
+		return -EINVAL;
+	/*
+	 * OK, now restart the process with the original executable's dentry.
+	 */
+	printk(KERN_NOTICE "%s: opening fname: %s\n", __func__, cp);
+	file = open_exec(cp);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	printk(KERN_NOTICE "%s: calling prepare_binprm %s\n", __func__, cp);
+	bprm->file = file;
+	retval = prepare_binprm(bprm);
+	if (retval < 0)
+		return retval;
+	retval = search_binary_handler(bprm,regs);
+	if (retval >= 0) {
+		/* execve success */
+		printk(KERN_NOTICE "%s: execve succeeded!\n", __func__);
+		current->mm->arg_start = arg_start;
+		current->did_exec = did_exec;
+	} else
+		printk(KERN_NOTICE "%s: execve failed with %d.\n", __func__, retval);
+	return retval;
+}
+
+static struct linux_binfmt cr_format = {
+	.module		= THIS_MODULE,
+	.load_binary	= load_checkpoint,
+};
+
+static int __init init_cr_binfmt(void)
+{
+	return register_binfmt(&cr_format);
+}
+
+static void __exit exit_cr_binfmt(void)
+{
+	unregister_binfmt(&cr_format);
+}
+
+core_initcall(init_cr_binfmt);
+module_exit(exit_cr_binfmt);
+MODULE_LICENSE("GPL");
diff --git a/fs/checkpoint.c b/fs/checkpoint.c
new file mode 100644
index 0000000..784f79a
--- /dev/null
+++ b/fs/checkpoint.c
@@ -0,0 +1,79 @@
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/uaccess.h>
+#include <linux/sched.h>
+#include <linux/checkpoint.h>
+
+int checkpoint_write(struct file *file, const void *addr, int nr)
+{
+	return file->f_op->write(file, addr, nr, &file->f_pos) == nr;
+}
+
+#ifdef CONFIG_PROC_FS
+char *get_exe_name(char *buf, int buflen)
+{
+	struct file *f = current->mm->exe_file;
+	return dentry_path(f->f_dentry, buf, buflen);
+}
+#else
+char *get_exe_name(char *buf, int buflen)
+{
+	if (buflen < sizeof(current->comm))
+		return -ENAMETOOLONG;
+	return get_task_comm(buf, current);
+#endif
+
+/*
+ * Format of a checkpoint file.
+ *  Version 2008-07-14:
+ *   LX_CKPT2008-07-14
+ *   mm->arg_start (lu)
+ *   current->did_exec (hu)
+ *   filename
+ */
+int dump_checkpoint(struct file *file, struct pt_regs * regs)
+{
+	char buf[MMARGSTR];
+	char *exename, *sret;
+	size_t len;
+	mm_segment_t fs;
+	int retval = 0;
+
+	exename = kmalloc(4096, GFP_KERNEL);
+	if (IS_ERR(exename))
+		return -ENOMEM;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	retval = -EINVAL;
+	printk(KERN_NOTICE "%s: writing a dump file\n", __func__);
+	if (!checkpoint_write(file, CKPT_ID, sizeof(CKPT_ID)))
+		goto out_setfs;
+	printk(KERN_NOTICE "%s: wrote ckpt id\n", __func__);
+	if (!checkpoint_write(file, CKPT_VERSION, sizeof(CKPT_VERSION)))
+		goto out_setfs;
+	len = snprintf(buf, MMARGSTR, " %lu ", current->mm->arg_start);
+	if (!checkpoint_write(file, buf, len))
+		goto out_setfs;
+	len = snprintf(buf, MMARGSTR, "%hu ", current->did_exec);
+	if (!checkpoint_write(file, buf, len))
+		goto out_setfs;
+
+	sret = get_exe_name(exename, 4096);
+	if (IS_ERR(sret)) {
+		retval = PTR_ERR(sret);
+		goto out_setfs;
+	}
+	retval = 0;
+	if (!checkpoint_write(file, sret, strlen(sret)+1))
+		retval = -EINVAL;
+	printk(KERN_NOTICE "%s: returning %d\n", __func__, retval);
+out_setfs:
+	set_fs(fs);
+	kfree(exename);
+	return retval;
+}
diff --git a/fs/exec.c b/fs/exec.c
index fd92343..68ad85c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -51,6 +51,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1790,3 +1791,23 @@ fail_unlock:
 fail:
 	return retval;
 }
+
+int do_checkpoint(char *filename, struct pt_regs * regs)
+{
+	int retval = -EINVAL;
+	struct file * file;
+
+	printk(KERN_NOTICE "%s: called (filename %s)\n", __func__, filename);
+	file = filp_open(filename, O_CREAT|O_NOFOLLOW|O_WRONLY, 0600);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+	printk(KERN_NOTICE "%s: create went ok\n", __func__);
+	if (!file->f_op || !file->f_op->write)
+		goto close_fail;
+
+	retval = dump_checkpoint(file, regs);
+
+close_fail:
+	filp_close(file, NULL);
+	return retval;
+}
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index 8317d94..b367465 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -332,6 +332,7 @@
 #define __NR_fallocate		324
 #define __NR_timerfd_settime	325
 #define __NR_timerfd_gettime	326
+#define __NR_checkpoint		327
 
 #ifdef __KERNEL__
 
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index ee0ed48..3024e44 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -70,6 +70,7 @@ struct linux_binfmt {
 	int (*load_shlib)(struct file *);
 	int (*core_dump)(long signr, struct pt_regs *regs, struct file *file, unsigned long limit);
 	unsigned long min_coredump;	/* minimal dump size */
+	int (*checkpoint)(struct pt_regs *regs, struct file *file);
 	int hasvdso;
 };
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..5628f0e
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,5 @@
+#define CKPT_ID "LX_CKPT"
+#define CKPT_VERSION "2008-07-14"
+#define MMARGSTR 20
+
+int dump_checkpoint(struct file *file, struct pt_regs * regs);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5d3f84..7098822 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1861,6 +1861,8 @@ extern int do_execve(char *, char __user * __user *, char __user * __user *, str
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 
+extern int do_checkpoint(char *, struct pt_regs *);
+
 extern void set_task_comm(struct task_struct *tsk, char *from);
 extern char *get_task_comm(char *to, struct task_struct *tsk);
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0522f36..f08877d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -617,5 +617,6 @@ asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
+asmlinkage long sys_checkpoint(const char *filename);
 
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5b9b467..62dcdaa 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -161,3 +161,5 @@ cond_syscall(sys_timerfd_gettime);
 cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
+
+cond_syscall(sys_checkpoint);
-- 
1.5.4.3


[-- Attachment #3: 0001-cryo--sys_checkpoint-first-attempt-at-exploiting.patch --]
[-- Type: text/x-diff, Size: 3190 bytes --]

>From 3b5b5488e4c25a1fff223c9b98883f17af4a40e2 Mon Sep 17 00:00:00 2001
From: Serge Hallyn <serue@us.ibm.com>
Date: Wed, 16 Jul 2008 14:33:31 -0400
Subject: [PATCH 1/1] sys_checkpoint: first attempt at exploiting

First attempt at using the sys_checkpoint and binfmt_cr.c
functionality.

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
---
 cr.c  |   13 +++++++++++--
 sci.h |    5 +++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/cr.c b/cr.c
index ffb7b0b..82ba813 100644
--- a/cr.c
+++ b/cr.c
@@ -932,6 +932,11 @@ static int save_process_data(pid_t pid, int fd, lh_list_t *ptree)
 		write_item(fd, "sigpend", &sigpend, sizeof(sigpend));
 	}
 
+	/* Write the ckpt image */
+	snprintf(fname, sizeof(fname), "%u.kckpt", syscallpid);
+	PT_CKPT(syscallpid, fname);
+	write_item(fd, "ckpt_file", fname, strlen(fname));
+
 	/* file descriptors */
 	write_item(fd, "FD", NULL, 0);
 	t_d(pi->nf);
@@ -1094,6 +1099,7 @@ pid_t restart_first_proc(int fd, char *exe, char *sargv, int la, char *senv,
 		WARN("set_proc_file(%d) errno=%d: %s\n", next_pid,
 			errno, strerror(errno));
 
+	DEBUG("executing the file %s with args %s\n", exe, sargv);
 	if ((pid = fork()) == 0) {
 		if (ptrace_traceme() == -1) exit(1);
 		close(fd);
@@ -1877,6 +1883,7 @@ static int process_restart(int fd, int mode)
 	void *buf = NULL;
 	int ret, la = 0, le = 0;
 	size_t bufsz;
+	char *ckpt_file = NULL;
 	lh_list_t *ptree = NULL, *pt;
 	lh_hash_t hpid;
 	pid_t *pid = NULL, *ppid = NULL, npid = 0;
@@ -1912,6 +1919,7 @@ static int process_restart(int fd, int mode)
 			Free(sigact);
 			Free(sigmask);
 			Free(sigpend);
+			Free(ckpt_file);
 		}
 
 		/* fillup process fields */
@@ -1919,6 +1927,7 @@ static int process_restart(int fd, int mode)
 		else ITEM_SET(ppid, pid_t);
 		else ITEM_SET(exitsig, int);
 		else ITEM_SET(exe, char);
+		else ITEM_SET(ckpt_file, char);
 		else ITEM_SET(cwd, char);
 		else ITEM_SET(regs, struct user_regs_struct);
 		else ITEM_SET(fpregs, struct user_fpregs_struct);
@@ -1931,13 +1940,13 @@ static int process_restart(int fd, int mode)
 		else if (ITEM_IS("FD")) {
 			/* all previous necessary fields ok, ready to fork */
 			if (! ptree) {
-				t_d(npid = restart_first_proc(fd, exe, sargv, la, senv, le, *pid));
+				t_d(npid = restart_first_proc(fd, ckpt_file ? ckpt_file : exe, sargv, la, senv, le, *pid));
 			} else {
 				lh_list_t *p = lh_hash_lookup(&hpid, (unsigned int)*ppid);
 				pid_t nppid = p ? (pid_t) p->data : 0;
 				
 				if (*exitsig == SIGCHLD)
-					t_d(npid = restart_proc(fd, nppid, exe, sargv, la, senv, le, *pid));
+					t_d(npid = restart_proc(fd, nppid, ckpt_file ? ckpt_file : exe, sargv, la, senv, le, *pid));
 				else
 					t_d(npid = restart_thread(nppid, *exitsig, regs->esp));
 			}
diff --git a/sci.h b/sci.h
index 0b32ae4..8717350 100644
--- a/sci.h
+++ b/sci.h
@@ -132,6 +132,11 @@ int call_func(pid_t pid, int scratch, int flag, int funcaddr, int argc, ...);
 			-3, 0, buf,		\
 			0, 0, n)
 
+#define SYS_ckpt 327
+#define PT_CKPT(p, path) \
+	ptrace_syscall(p, 0, 0, SYS_ckpt, 1,	\
+			STRLEN_PTR, 0, path)
+
 #define PT_LSEEK(p, fd, off, w) \
 	ptrace_syscall(p, 0, 0, SYS_lseek, 3,	\
 			0, 0, fd,		\
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2008-07-18 16:18 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-18  5:44 [PATCH 0/4] - v2 - Object creation with a specified id Nadia.Derbey
2008-04-18  5:45 ` [PATCH 1/4] - v2 - Provide a new procfs interface to set next id Nadia.Derbey
2008-04-18  5:45 ` [PATCH 2/4] - v2 - Provide a new procfs interface to set next upid nr(s) Nadia.Derbey
2008-04-18  5:45 ` [PATCH 3/4] - v2 - IPC: use the target ID specified in procfs Nadia.Derbey
2008-04-18  5:45 ` [PATCH 4/4] - v2 - PID: " Nadia.Derbey
2008-04-18 17:07 ` [PATCH 0/4] - v2 - Object creation with a specified id Dave Hansen
2008-04-21 11:32   ` Nadia Derbey
2008-04-22 19:36 ` Checkpoint/restart (was Re: [PATCH 0/4] - v2 - Object creation with a specified id) Alexey Dobriyan
2008-04-22 18:56   ` Dave Hansen
2008-04-22 19:51     ` Serge E. Hallyn
2008-04-22 21:01     ` Alexey Dobriyan
2008-04-22 22:56       ` Dave Hansen
2008-04-23  6:40         ` Kirill Korotaev
2008-04-23 15:33           ` Dave Hansen
2008-04-24  7:00             ` Kirill Korotaev
2008-04-24 18:30               ` Dave Hansen
2008-04-24 23:13                 ` Oren Laadan
2008-04-24  1:19           ` Oren Laadan
2008-07-10  1:58             ` Eric W. Biederman
2008-07-10 17:12               ` Dave Hansen
2008-07-10 17:32                 ` Serge E. Hallyn
2008-07-10 18:55                   ` Eric W. Biederman
2008-07-10 19:06                     ` Dave Hansen
2008-07-10 19:21                       ` Eric W. Biederman
2008-07-10 19:47                         ` Dave Hansen
2008-07-11  0:32                           ` Alexey Dobriyan
2008-07-17 23:19                           ` Oren Laadan
2008-07-17 23:16                   ` Oren Laadan
2008-07-18 16:18                     ` Serge E. Hallyn
2008-07-17 23:14                 ` Oren Laadan
2008-07-17 23:09               ` Oren Laadan
2008-04-23 14:23 ` [PATCH 0/4] - v2 - Object creation with a specified id Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).