linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/4] A few patches in a sake of c/r functionality
@ 2012-01-23 14:20 Cyrill Gorcunov
  2012-01-23 14:20 ` [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8 Cyrill Gorcunov
                   ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

Hi,

here are a few patches in a sake of c/r. Nothing really
serious except a new kcmp syscall I think. Please review,
comments are really appreciated!

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-23 14:20 [patch 0/4] A few patches in a sake of c/r functionality Cyrill Gorcunov
@ 2012-01-23 14:20 ` Cyrill Gorcunov
  2012-01-23 18:54   ` Kees Cook
                     ` (2 more replies)
  2012-01-23 14:20 ` [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4 Cyrill Gorcunov
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Cyrill Gorcunov

[-- Attachment #1: fs-proc-tid-children-12 --]
[-- Type: text/plain, Size: 8035 bytes --]

When we do checkpoint of a task we need to know the list of children
the task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).

So instead of walking over all pids in the system (creating one big process
tree in memory, just to figure out which children a task has) -- we add
explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

v2:
 - Kame suggested to use a separated /proc/<pid>/children entry
   instead of poking /proc/<pid>/status
 - Andew suggested to use rcu facility instead of locking
   tasklist_lock
 - Tejun pointed that non-seekable seq file might not be
   enough for tasks with large number of children

v3:
 - To be on a safe side use %lu format for pid_t printing

v4:
 - New line get printed when sequence ends not at seq->stop,
   a nit pointed by Tejun
 - Documentation update
 - tasklist_lock is back, Oleg pointed that ->children list
   is actually not rcu-safe

v5:
 - Oleg suggested to make /proc/<pid>/task/<tid>/children
   instead of global /proc/<pid>/children, which eliminates
   hardness related to threads and children migration, and
   allows patch to be a way simplier.

v6:
 - Drop ptrace_may_access tests, pids are can be found anyway
   so nothing to protect here.
 - Update comments and docs, pointed by Oleg.

v7:
 - Use get_pid over proc-pid directly, to simplify
   code, pointed by Oleg.

v8:
 - Obtain a starting pid from the proc's inode directly.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/filesystems/proc.txt |   18 +++++
 fs/proc/array.c                    |  123 +++++++++++++++++++++++++++++++++++++
 fs/proc/base.c                     |    1 
 fs/proc/internal.h                 |    1 
 4 files changed, 143 insertions(+)

Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
   3.5	/proc/<pid>/mountinfo - Information about mounts
   3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+  3.7   /proc/<pid>/task/<tid>/children - Information about task children
 
   4	Configuring procfs
   4.1	Mount options
@@ -1549,6 +1550,23 @@ then the kernel's TASK_COMM_LEN (current
 comm value.
 
 
+3.7	/proc/<pid>/task/<tid>/children - Information about task children
+-------------------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>/<tid> pair. The format is a space separated
+stream of pids.
+
+Note the "first level" here -- if a child has own children they will
+not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+to obtain the descendants.
+
+Since this interface is intended to be fast and cheap it doesn't
+guarantee to provide precise results and some children might be
+skipped, especially if they've exited right after we printed their
+pids, so one need to either stop or freeze processes being inspected
+if precise results are needed.
+
+
 ------------------------------------------------------------------------------
 Configuring procfs
 ------------------------------------------------------------------------------
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,126 @@ int proc_pid_statm(struct seq_file *m, s
 
 	return 0;
 }
+
+static struct pid *
+get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
+{
+	struct task_struct *start, *task;
+	struct pid *pid = NULL;
+
+	read_lock(&tasklist_lock);
+
+	start = pid_task(proc_pid(inode), PIDTYPE_PID);
+	if (!start)
+		goto out;
+
+	/*
+	 * Lets try to continue searching first, this gives
+	 * us significant speedup on children-rich processes.
+	 */
+	if (pid_prev) {
+		task = pid_task(pid_prev, PIDTYPE_PID);
+		if (task && task->real_parent == start &&
+		    !(list_empty(&task->sibling))) {
+			if (list_is_last(&task->sibling, &start->children))
+				goto out;
+			task = list_first_entry(&task->sibling,
+						struct task_struct, sibling);
+			pid = get_pid(task_pid(task));
+			goto out;
+		}
+	}
+
+	/*
+	 * Slow search case.
+	 *
+	 * We might miss some children here if children
+	 * are exited while we were not holding the lock,
+	 * but it was never promised to be accurate that
+	 * much.
+	 *
+	 * "Just suppose that the parent sleeps, but N children
+	 *  exit after we printed their tids. Now the slow paths
+	 *  skips N extra children, we miss N tasks." (c)
+	 *
+	 * So one need to stop or freeze the leader and all
+	 * its children to get a precise result.
+	 */
+	list_for_each_entry(task, &start->children, sibling) {
+		if (pos-- == 0) {
+			pid = get_pid(task_pid(task));
+			break;
+		}
+	}
+
+out:
+	read_unlock(&tasklist_lock);
+	return pid;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+	struct inode *inode = seq->private;
+	pid_t pid;
+
+	pid = pid_nr_ns(v, inode->i_sb->s_fs_info);
+	return seq_printf(seq, " %d", pid);
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	return get_children_pid(seq->private, NULL, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct pid *pid = NULL;
+
+	pid = get_children_pid(seq->private, v, *pos + 1);
+	if (!pid)
+		seq_printf(seq, "\n");
+	put_pid(v);
+
+	++*pos;
+	return pid;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+	put_pid(v);
+}
+
+static const struct seq_operations children_seq_ops = {
+	.start	= children_seq_start,
+	.next	= children_seq_next,
+	.stop	= children_seq_stop,
+	.show	= children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+	struct seq_file *m;
+	int ret;
+
+	ret = seq_open(file, &children_seq_ops);
+	if (ret)
+		return ret;
+
+	m = file->private_data;
+	m->private = inode;
+
+	return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+	seq_release(inode, file);
+	return 0;
+}
+
+const struct file_operations proc_tid_children_operations = {
+	.open    = children_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3384,6 +3384,7 @@ static const struct pid_entry tid_base_s
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
 	REG("maps",      S_IRUGO, proc_maps_operations),
+	REG("children",  S_IRUGO, proc_tid_children_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
 #endif
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
 				struct pid *pid, struct task_struct *task);
 extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
 
+extern const struct file_operations proc_tid_children_operations;
 extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-23 14:20 [patch 0/4] A few patches in a sake of c/r functionality Cyrill Gorcunov
  2012-01-23 14:20 ` [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8 Cyrill Gorcunov
@ 2012-01-23 14:20 ` Cyrill Gorcunov
  2012-01-23 18:48   ` H. Peter Anvin
  2012-01-24  2:16   ` KAMEZAWA Hiroyuki
  2012-01-23 14:20 ` [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
  2012-01-23 14:20 ` [patch 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries Cyrill Gorcunov
  3 siblings, 2 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Cyrill Gorcunov, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov,
	Valdis.Kletnieks

[-- Attachment #1: sys-kcmp-7 --]
[-- Type: text/plain, Size: 7857 bytes --]

While doing the checkpoint-restore in the userspace one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are shared
between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparision' syscall might be the best candidate. So here is it --
__NR_kcmp.

It takes up to 5 agruments - the pids of the two tasks (which
characteristics should be compared), the comparision type and
(in case of comparision of files) two file descriptors.

At moment only x86 is supported.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Glauber Costa <glommer@parallels.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tejun Heo <tj@kernel.org>
CC: Matt Helsley <matthltc@us.ibm.com>
CC: Pekka Enberg <penberg@kernel.org>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Valdis.Kletnieks@vt.edu
---
 arch/x86/include/asm/kcmp.h      |   17 ++++
 arch/x86/include/asm/syscalls.h  |    4 
 arch/x86/kernel/Makefile         |    1 
 arch/x86/kernel/kcmp.c           |  163 +++++++++++++++++++++++++++++++++++++++
 arch/x86/syscalls/syscall_32.tbl |    1 
 arch/x86/syscalls/syscall_64.tbl |    1 
 6 files changed, 187 insertions(+)

Index: linux-2.6.git/arch/x86/include/asm/kcmp.h
===================================================================
--- /dev/null
+++ linux-2.6.git/arch/x86/include/asm/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_KCMP_H
+#define _LINUX_KCMP_H
+
+/* Comparision type */
+enum {
+	KCMP_FILE,
+	KCMP_VM,
+	KCMP_FILES,
+	KCMP_FS,
+	KCMP_SIGHAND,
+	KCMP_IO,
+	KCMP_SYSVSEM,
+
+	KCMP_TYPES,
+};
+
+#endif /* _LINUX_KCMP_H */
Index: linux-2.6.git/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.git.orig/arch/x86/include/asm/syscalls.h
+++ linux-2.6.git/arch/x86/include/asm/syscalls.h
@@ -42,6 +42,10 @@ long sys_sigaltstack(const stack_t __use
 asmlinkage int sys_set_thread_area(struct user_desc __user *);
 asmlinkage int sys_get_thread_area(struct user_desc __user *);
 
+/* kenrel/kcmp.c */
+asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
+			 unsigned long idx1, unsigned long idx2);
+
 /* X86_32 only */
 #ifdef CONFIG_X86_32
 
Index: linux-2.6.git/arch/x86/kernel/Makefile
===================================================================
--- linux-2.6.git.orig/arch/x86/kernel/Makefile
+++ linux-2.6.git/arch/x86/kernel/Makefile
@@ -34,6 +34,7 @@ obj-y			+= alternative.o i8253.o pci-nom
 obj-y			+= tsc.o io_delay.o rtc.o
 obj-y			+= pci-iommu_table.o
 obj-y			+= resource.o
+obj-y			+= kcmp.o
 
 obj-y				+= trampoline.o trampoline_$(BITS).o
 obj-y				+= process.o
Index: linux-2.6.git/arch/x86/kernel/kcmp.c
===================================================================
--- /dev/null
+++ linux-2.6.git/arch/x86/kernel/kcmp.c
@@ -0,0 +1,163 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fdtable.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+
+#include <asm/unistd.h>
+#include <asm/kcmp.h>
+
+static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
+
+static long kptr_obfuscate(long v, int type)
+{
+	return (v + cookies[type][0]) ^ cookies[type][1];
+}
+
+/*
+ * 0 - equal
+ * 1 - less than
+ * 2 - greater than
+ * 3 - not equal but ordering unavailable
+ */
+static int kcmp_ptr(long v1, long v2, int type)
+{
+	long ret;
+
+	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
+
+	return (ret < 0) | ((ret > 0) << 1);
+}
+
+#define KCMP_TASK_PTR(task1, task2, member, type)	\
+	kcmp_ptr((long)(task1)->member,			\
+		 (long)(task2)->member,			\
+		 type)
+
+#define KCMP_PTR(ptr1, ptr2, type)			\
+	kcmp_ptr((long)ptr1, (long)ptr2, type)
+
+/* A caller must be sure the task is presented in memory */
+static struct file *
+get_file_raw_ptr(struct task_struct *task, unsigned int idx)
+{
+	struct fdtable *fdt;
+	struct file *file;
+
+	spin_lock(&task->files->file_lock);
+	fdt = files_fdtable(task->files);
+	if (idx < fdt->max_fds)
+		file = fdt->fd[idx];
+	else
+		file = NULL;
+	spin_unlock(&task->files->file_lock);
+
+	return file;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+		unsigned long, idx1, unsigned long, idx2)
+{
+	struct task_struct *task1;
+	struct task_struct *task2;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	task1 = find_task_by_vpid(pid1);
+	if (!task1) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	task2 = find_task_by_vpid(pid2);
+	if (!task2) {
+		put_task_struct(task1);
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	get_task_struct(task1);
+	get_task_struct(task2);
+
+	rcu_read_unlock();
+
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EACCES;
+		goto err;
+	}
+
+	/*
+	 * Note for all cases but the KCMP_FILE we
+	 * don't take any locks in a sake of speed.
+	 */
+
+	switch (type) {
+	case KCMP_FILE: {
+		struct file *filp1, *filp2;
+
+		filp1 = get_file_raw_ptr(task1, idx1);
+		filp2 = get_file_raw_ptr(task2, idx2);
+
+		if (filp1 && filp2)
+			ret = KCMP_PTR(filp1, filp2, KCMP_FILE);
+		else
+			ret = -ENOENT;
+		break;
+	}
+	case KCMP_VM:
+		ret = KCMP_TASK_PTR(task1, task2, mm, KCMP_VM);
+		break;
+	case KCMP_FILES:
+		ret = KCMP_TASK_PTR(task1, task2, files, KCMP_FILES);
+		break;
+	case KCMP_FS:
+		ret = KCMP_TASK_PTR(task1, task2, fs, KCMP_FS);
+		break;
+	case KCMP_SIGHAND:
+		ret = KCMP_TASK_PTR(task1, task2, sighand, KCMP_SIGHAND);
+		break;
+	case KCMP_IO:
+		ret = KCMP_TASK_PTR(task1, task2, io_context, KCMP_IO);
+		break;
+	case KCMP_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+		ret = KCMP_TASK_PTR(task1, task2, sysvsem.undo_list, KCMP_SYSVSEM);
+#else
+		ret = -ENOENT;
+		goto err;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto err;
+	}
+
+err:
+	put_task_struct(task1);
+	put_task_struct(task2);
+
+	return ret;
+}
+
+static __init int kcmp_cookie_init(void)
+{
+	int i, j;
+
+	for (i = 0; i < KCMP_TYPES; i++) {
+		for (j = 0; j < 2; j++) {
+			get_random_bytes(&cookies[i][j],
+					 sizeof(cookies[i][j]));
+			cookies[i][j] |= (~(~0UL >>  1) | 1);
+		}
+	}
+
+	return 0;
+}
+late_initcall(kcmp_cookie_init);
Index: linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_32.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	kcmp			sys_kcmp
Index: linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_64.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
@@ -318,3 +318,4 @@
 309	64	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	kcmp			sys_kcmp


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-23 14:20 [patch 0/4] A few patches in a sake of c/r functionality Cyrill Gorcunov
  2012-01-23 14:20 ` [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8 Cyrill Gorcunov
  2012-01-23 14:20 ` [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4 Cyrill Gorcunov
@ 2012-01-23 14:20 ` Cyrill Gorcunov
  2012-01-23 20:42   ` Kees Cook
  2012-01-24 23:59   ` Andrew Morton
  2012-01-23 14:20 ` [patch 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries Cyrill Gorcunov
  3 siblings, 2 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Cyrill Gorcunov, Vasiliy Kulikov

[-- Attachment #1: fs-proc-extend-state --]
[-- Type: text/plain, Size: 1871 bytes --]

We would like to have an ability to restore command line
arguments and envirion pointers so the task being restored
would print appropriate values in /proc/pid/cmdline and
/proc/pid/envirion. The exit_code is needed to restore
zombie tasks.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/array.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -464,7 +464,8 @@ static int do_task_stat(struct seq_file
 
 	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu \
+%lu %lu %lu %lu %d\n",
 		pid_nr_ns(pid, ns),
 		tcomm,
 		state,
@@ -514,7 +515,12 @@ static int do_task_stat(struct seq_file
 		cputime_to_clock_t(cgtime),
 		(mm && permitted) ? mm->start_data : 0,
 		(mm && permitted) ? mm->end_data : 0,
-		(mm && permitted) ? mm->start_brk : 0);
+		(mm && permitted) ? mm->start_brk : 0,
+		(mm && permitted) ? mm->arg_start : 0,
+		(mm && permitted) ? mm->arg_end : 0,
+		(mm && permitted) ? mm->env_start : 0,
+		(mm && permitted) ? mm->env_end : 0,
+		task->exit_code);
 	if (mm)
 		mmput(mm);
 	return 0;


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries
  2012-01-23 14:20 [patch 0/4] A few patches in a sake of c/r functionality Cyrill Gorcunov
                   ` (2 preceding siblings ...)
  2012-01-23 14:20 ` [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
@ 2012-01-23 14:20 ` Cyrill Gorcunov
  2012-01-23 15:55   ` Cyrill Gorcunov
  3 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Cyrill Gorcunov, Michael Kerrisk,
	Vasiliy Kulikov

[-- Attachment #1: prctl-restore-mm-members-2 --]
[-- Type: text/plain, Size: 5062 bytes --]

After restore we would like the 'ps' command show the command
line and evironment exactly the same it was at checkpoint time.

So this additional PR_SET_MM_ allow us to do so. Note that
these members of mm_struct is rather used for output in
procfs, except auxv vector which is used by ld.so mostly.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/prctl.h |    5 +++
 kernel/sys.c          |   73 +++++++++++++++++++++++++++++++-------------------
 2 files changed, 51 insertions(+), 27 deletions(-)

Index: linux-2.6.git/include/linux/prctl.h
===================================================================
--- linux-2.6.git.orig/include/linux/prctl.h
+++ linux-2.6.git/include/linux/prctl.h
@@ -113,5 +113,10 @@
 # define PR_SET_MM_START_STACK		5
 # define PR_SET_MM_START_BRK		6
 # define PR_SET_MM_BRK			7
+# define PR_SET_MM_ARG_START		8
+# define PR_SET_MM_ARG_END		9
+# define PR_SET_MM_ENV_START		10
+# define PR_SET_MM_ENV_END		11
+# define PR_SET_MM_AUXV			12
 
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1693,17 +1693,25 @@ SYSCALL_DEFINE1(umask, int, mask)
 }
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
+static bool vma_flags_mismatch(struct vm_area_struct *vma,
+			       unsigned long required,
+			       unsigned long banned)
+{
+	return (vma->vm_flags & required) != required ||
+		(vma->vm_flags & banned);
+}
+
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
 {
 	unsigned long rlim = rlimit(RLIMIT_DATA);
-	unsigned long vm_req_flags;
-	unsigned long vm_bad_flags;
 	struct vm_area_struct *vma;
 	int error = 0;
 	struct mm_struct *mm = current->mm;
 
-	if (arg4 | arg5)
+	if (arg4 && opt != PR_SET_MM_AUXV)
+		return -EINVAL;
+	else if (arg4 | arg5)
 		return -EINVAL;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -1715,7 +1723,9 @@ static int prctl_set_mm(int opt, unsigne
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, addr);
 
-	if (opt != PR_SET_MM_START_BRK && opt != PR_SET_MM_BRK) {
+	if (opt != PR_SET_MM_START_BRK &&
+	    opt != PR_SET_MM_BRK &&
+	    opt != PR_SET_MM_AUXV) {
 		/* It must be existing VMA */
 		if (!vma || vma->vm_start > addr)
 			goto out;
@@ -1725,11 +1735,8 @@ static int prctl_set_mm(int opt, unsigne
 	switch (opt) {
 	case PR_SET_MM_START_CODE:
 	case PR_SET_MM_END_CODE:
-		vm_req_flags = VM_READ | VM_EXEC;
-		vm_bad_flags = VM_WRITE | VM_MAYSHARE;
-
-		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
-		    (vma->vm_flags & vm_bad_flags))
+		if (vma_flags_mismatch(vma, VM_READ | VM_EXEC,
+				       VM_WRITE | VM_MAYSHARE))
 			goto out;
 
 		if (opt == PR_SET_MM_START_CODE)
@@ -1740,11 +1747,8 @@ static int prctl_set_mm(int opt, unsigne
 
 	case PR_SET_MM_START_DATA:
 	case PR_SET_MM_END_DATA:
-		vm_req_flags = VM_READ | VM_WRITE;
-		vm_bad_flags = VM_EXEC | VM_MAYSHARE;
-
-		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
-		    (vma->vm_flags & vm_bad_flags))
+		if (vma_flags_mismatch(vma, VM_READ | VM_WRITE,
+				       VM_EXEC | VM_MAYSHARE))
 			goto out;
 
 		if (opt == PR_SET_MM_START_DATA)
@@ -1753,19 +1757,6 @@ static int prctl_set_mm(int opt, unsigne
 			mm->end_data = addr;
 		break;
 
-	case PR_SET_MM_START_STACK:
-
-#ifdef CONFIG_STACK_GROWSUP
-		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSUP;
-#else
-		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
-#endif
-		if ((vma->vm_flags & vm_req_flags) != vm_req_flags)
-			goto out;
-
-		mm->start_stack = addr;
-		break;
-
 	case PR_SET_MM_START_BRK:
 		if (addr <= mm->end_data)
 			goto out;
@@ -1790,6 +1781,34 @@ static int prctl_set_mm(int opt, unsigne
 		mm->brk = addr;
 		break;
 
+	case PR_SET_MM_START_STACK:
+	case PR_SET_MM_ARG_START:
+	case PR_SET_MM_ARG_END:
+	case PR_SET_MM_ENV_START:
+	case PR_SET_MM_ENV_END:
+#ifdef CONFIG_STACK_GROWSUP
+		if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSUP, 0))
+#else
+		if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSDOWN, 0))
+#endif
+			goto out;
+		if (opt == PR_SET_MM_ARG_START)
+			mm->start_stack = addr;
+		else if (opt == PR_SET_MM_ARG_START)
+			mm->arg_start = addr;
+		else if (opt == PR_SET_MM_ARG_END)
+			mm->arg_end = addr;
+		else if (opt == PR_SET_MM_ENV_START)
+			mm->env_start = addr;
+		else if (opt == PR_SET_MM_ENV_END)
+			mm->env_end = addr;
+		break;
+
+	case PR_SET_MM_AUXV:
+		if (arg4 > sizeof(mm->saved_auxv))
+			goto out;
+		error = copy_from_user(mm->saved_auxv, (const void __user *)addr, arg4);
+		break;
 	default:
 		error = -EINVAL;
 		goto out;


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries
  2012-01-23 14:20 ` [patch 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries Cyrill Gorcunov
@ 2012-01-23 15:55   ` Cyrill Gorcunov
  2012-01-23 20:02     ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Michael Kerrisk, Vasiliy Kulikov

On Mon, Jan 23, 2012 at 06:20:40PM +0400, Cyrill Gorcunov wrote:
> After restore we would like the 'ps' command show the command
> line and evironment exactly the same it was at checkpoint time.
> 
> So this additional PR_SET_MM_ allow us to do so. Note that
> these members of mm_struct is rather used for output in
> procfs, except auxv vector which is used by ld.so mostly.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrew Vagin <avagin@openvz.org>
> Cc: Serge Hallyn <serge.hallyn@canonical.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Vasiliy Kulikov <segoon@openwall.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---

This one is with typo fixed.

	Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries

After restore we would like the 'ps' command show the command
line and evironment exactly the same it was at checkpoint time.

So this additional PR_SET_MM_ allow us to do so. Note that
these members of mm_struct is rather used for output in
procfs, except auxv vector which is used by ld.so mostly.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/prctl.h |    5 +++
 kernel/sys.c          |   73 +++++++++++++++++++++++++++++++-------------------
 2 files changed, 51 insertions(+), 27 deletions(-)

Index: linux-2.6.git/include/linux/prctl.h
===================================================================
--- linux-2.6.git.orig/include/linux/prctl.h
+++ linux-2.6.git/include/linux/prctl.h
@@ -113,5 +113,10 @@
 # define PR_SET_MM_START_STACK		5
 # define PR_SET_MM_START_BRK		6
 # define PR_SET_MM_BRK			7
+# define PR_SET_MM_ARG_START		8
+# define PR_SET_MM_ARG_END		9
+# define PR_SET_MM_ENV_START		10
+# define PR_SET_MM_ENV_END		11
+# define PR_SET_MM_AUXV			12
 
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1693,17 +1693,25 @@ SYSCALL_DEFINE1(umask, int, mask)
 }
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
+static bool vma_flags_mismatch(struct vm_area_struct *vma,
+			       unsigned long required,
+			       unsigned long banned)
+{
+	return (vma->vm_flags & required) != required ||
+		(vma->vm_flags & banned);
+}
+
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
 {
 	unsigned long rlim = rlimit(RLIMIT_DATA);
-	unsigned long vm_req_flags;
-	unsigned long vm_bad_flags;
 	struct vm_area_struct *vma;
 	int error = 0;
 	struct mm_struct *mm = current->mm;
 
-	if (arg4 | arg5)
+	if (arg4 && opt != PR_SET_MM_AUXV)
+		return -EINVAL;
+	else if (arg4 | arg5)
 		return -EINVAL;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -1715,7 +1723,9 @@ static int prctl_set_mm(int opt, unsigne
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, addr);
 
-	if (opt != PR_SET_MM_START_BRK && opt != PR_SET_MM_BRK) {
+	if (opt != PR_SET_MM_START_BRK &&
+	    opt != PR_SET_MM_BRK &&
+	    opt != PR_SET_MM_AUXV) {
 		/* It must be existing VMA */
 		if (!vma || vma->vm_start > addr)
 			goto out;
@@ -1725,11 +1735,8 @@ static int prctl_set_mm(int opt, unsigne
 	switch (opt) {
 	case PR_SET_MM_START_CODE:
 	case PR_SET_MM_END_CODE:
-		vm_req_flags = VM_READ | VM_EXEC;
-		vm_bad_flags = VM_WRITE | VM_MAYSHARE;
-
-		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
-		    (vma->vm_flags & vm_bad_flags))
+		if (vma_flags_mismatch(vma, VM_READ | VM_EXEC,
+				       VM_WRITE | VM_MAYSHARE))
 			goto out;
 
 		if (opt == PR_SET_MM_START_CODE)
@@ -1740,11 +1747,8 @@ static int prctl_set_mm(int opt, unsigne
 
 	case PR_SET_MM_START_DATA:
 	case PR_SET_MM_END_DATA:
-		vm_req_flags = VM_READ | VM_WRITE;
-		vm_bad_flags = VM_EXEC | VM_MAYSHARE;
-
-		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
-		    (vma->vm_flags & vm_bad_flags))
+		if (vma_flags_mismatch(vma, VM_READ | VM_WRITE,
+				       VM_EXEC | VM_MAYSHARE))
 			goto out;
 
 		if (opt == PR_SET_MM_START_DATA)
@@ -1753,19 +1757,6 @@ static int prctl_set_mm(int opt, unsigne
 			mm->end_data = addr;
 		break;
 
-	case PR_SET_MM_START_STACK:
-
-#ifdef CONFIG_STACK_GROWSUP
-		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSUP;
-#else
-		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
-#endif
-		if ((vma->vm_flags & vm_req_flags) != vm_req_flags)
-			goto out;
-
-		mm->start_stack = addr;
-		break;
-
 	case PR_SET_MM_START_BRK:
 		if (addr <= mm->end_data)
 			goto out;
@@ -1790,6 +1781,34 @@ static int prctl_set_mm(int opt, unsigne
 		mm->brk = addr;
 		break;
 
+	case PR_SET_MM_START_STACK:
+	case PR_SET_MM_ARG_START:
+	case PR_SET_MM_ARG_END:
+	case PR_SET_MM_ENV_START:
+	case PR_SET_MM_ENV_END:
+#ifdef CONFIG_STACK_GROWSUP
+		if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSUP, 0))
+#else
+		if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSDOWN, 0))
+#endif
+			goto out;
+		if (opt == PR_SET_MM_START_STACK)
+			mm->start_stack = addr;
+		else if (opt == PR_SET_MM_ARG_START)
+			mm->arg_start = addr;
+		else if (opt == PR_SET_MM_ARG_END)
+			mm->arg_end = addr;
+		else if (opt == PR_SET_MM_ENV_START)
+			mm->env_start = addr;
+		else if (opt == PR_SET_MM_ENV_END)
+			mm->env_end = addr;
+		break;
+
+	case PR_SET_MM_AUXV:
+		if (arg4 > sizeof(mm->saved_auxv))
+			goto out;
+		error = copy_from_user(mm->saved_auxv, (const void __user *)addr, arg4);
+		break;
 	default:
 		error = -EINVAL;
 		goto out;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-23 14:20 ` [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4 Cyrill Gorcunov
@ 2012-01-23 18:48   ` H. Peter Anvin
  2012-01-23 20:03     ` Cyrill Gorcunov
  2012-01-24  2:16   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 60+ messages in thread
From: H. Peter Anvin @ 2012-01-23 18:48 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Kees Cook, Tejun Heo, Andrew Vagin,
	Eric W. Biederman, Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On 01/23/2012 06:20 AM, Cyrill Gorcunov wrote:
> +
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> +	return (v + cookies[type][0]) ^ cookies[type][1];
> +}
> +

Arf... when I said to use xor I meant instead of the add, not instead of
the multiply, so:

	return (v ^ cookies[type][0]) * cookies[type][1];

Otherwise you have absolutely no source of diffusion at all (symmetric
cryptography is about combinations of diffusion -- spreading the content
-- and confusion -- scrambling individual bits of content.)

+	for (i = 0; i < KCMP_TYPES; i++) {
+		for (j = 0; j < 2; j++) {
+			get_random_bytes(&cookies[i][j],
+					 sizeof(cookies[i][j]));
+			cookies[i][j] |= (~(~0UL >>  1) | 1);
+		}
+	}

Only cookies[1] -- being used as a multiplicative constant -- needs the OR.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-23 14:20 ` [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8 Cyrill Gorcunov
@ 2012-01-23 18:54   ` Kees Cook
  2012-01-23 19:33     ` Cyrill Gorcunov
  2012-01-24  2:07   ` KAMEZAWA Hiroyuki
  2012-01-24 23:53   ` Andrew Morton
  2 siblings, 1 reply; 60+ messages in thread
From: Kees Cook @ 2012-01-23 18:54 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Mon, Jan 23, 2012 at 6:20 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> When we do checkpoint of a task we need to know the list of children
> the task, has but there is no easy and fast way to generate reverse
> parent->children chain from arbitrary <pid> (while a parent pid is
> provided in "PPid" field of /proc/<pid>/status).
> [...]
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Serge Hallyn <serge.hallyn@canonical.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

> +static int children_seq_show(struct seq_file *seq, void *v)
> +{
> +       struct inode *inode = seq->private;
> +       pid_t pid;
> +
> +       pid = pid_nr_ns(v, inode->i_sb->s_fs_info);
> +       return seq_printf(seq, " %d", pid);
> +}

Does this mean the file contents always starts with a space? I think
I'd prefer a trailing space than a leading one. Better yet, neither.
:)

-Kees

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-23 18:54   ` Kees Cook
@ 2012-01-23 19:33     ` Cyrill Gorcunov
  2012-01-23 20:29       ` Kees Cook
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 19:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Mon, Jan 23, 2012 at 10:54:53AM -0800, Kees Cook wrote:
> On Mon, Jan 23, 2012 at 6:20 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> > When we do checkpoint of a task we need to know the list of children
> > the task, has but there is no easy and fast way to generate reverse
> > parent->children chain from arbitrary <pid> (while a parent pid is
> > provided in "PPid" field of /proc/<pid>/status).
> > [...]
> > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Cc: Serge Hallyn <serge.hallyn@canonical.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>
> 
> > +static int children_seq_show(struct seq_file *seq, void *v)
> > +{
> > +       struct inode *inode = seq->private;
> > +       pid_t pid;
> > +
> > +       pid = pid_nr_ns(v, inode->i_sb->s_fs_info);
> > +       return seq_printf(seq, " %d", pid);
> > +}
> 
> Does this mean the file contents always starts with a space? I think
> I'd prefer a trailing space than a leading one. Better yet, neither.
> :)
> 

Yeah, it there children, they will be in say " 1 2 3 4\n" format.
To drop this space completely i'll have to add more code, which I
actually trying to escape. Can we live with it? ;)

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries
  2012-01-23 15:55   ` Cyrill Gorcunov
@ 2012-01-23 20:02     ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 20:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Michael Kerrisk, Vasiliy Kulikov

On Mon, Jan 23, 2012 at 07:55:43PM +0400, Cyrill Gorcunov wrote:
> On Mon, Jan 23, 2012 at 06:20:40PM +0400, Cyrill Gorcunov wrote:
> > After restore we would like the 'ps' command show the command
> > line and evironment exactly the same it was at checkpoint time.
> > 
> > So this additional PR_SET_MM_ allow us to do so. Note that
> > these members of mm_struct is rather used for output in
> > procfs, except auxv vector which is used by ld.so mostly.
> > 
> 
> This one is with typo fixed.
> ---

After some more testing I found there is a proble, so I'll send
updated version later. Please dont review ;)

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-23 18:48   ` H. Peter Anvin
@ 2012-01-23 20:03     ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 20:03 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Kees Cook, Tejun Heo, Andrew Vagin,
	Eric W. Biederman, Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Mon, Jan 23, 2012 at 10:48:57AM -0800, H. Peter Anvin wrote:
> On 01/23/2012 06:20 AM, Cyrill Gorcunov wrote:
> > +
> > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> > +
> > +static long kptr_obfuscate(long v, int type)
> > +{
> > +	return (v + cookies[type][0]) ^ cookies[type][1];
> > +}
> > +
> 
> Arf... when I said to use xor I meant instead of the add, not instead of
> the multiply, so:
> 
> 	return (v ^ cookies[type][0]) * cookies[type][1];
> 
> Otherwise you have absolutely no source of diffusion at all (symmetric
> cryptography is about combinations of diffusion -- spreading the content
> -- and confusion -- scrambling individual bits of content.)
> 
> +	for (i = 0; i < KCMP_TYPES; i++) {
> +		for (j = 0; j < 2; j++) {
> +			get_random_bytes(&cookies[i][j],
> +					 sizeof(cookies[i][j]));
> +			cookies[i][j] |= (~(~0UL >>  1) | 1);
> +		}
> +	}
> 
> Only cookies[1] -- being used as a multiplicative constant -- needs the OR.

OK. Thanks. I'll update.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-23 19:33     ` Cyrill Gorcunov
@ 2012-01-23 20:29       ` Kees Cook
  2012-01-23 20:39         ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: Kees Cook @ 2012-01-23 20:29 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Mon, Jan 23, 2012 at 11:33 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Mon, Jan 23, 2012 at 10:54:53AM -0800, Kees Cook wrote:
>> On Mon, Jan 23, 2012 at 6:20 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
>> > When we do checkpoint of a task we need to know the list of children
>> > the task, has but there is no easy and fast way to generate reverse
>> > parent->children chain from arbitrary <pid> (while a parent pid is
>> > provided in "PPid" field of /proc/<pid>/status).
>> > [...]
>> > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
>> > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
>> > Cc: Andrew Morton <akpm@linux-foundation.org>
>> > Cc: Pavel Emelyanov <xemul@parallels.com>
>> > Cc: Serge Hallyn <serge.hallyn@canonical.com>
>> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> Reviewed-by: Kees Cook <keescook@chromium.org>
>>
>> > +static int children_seq_show(struct seq_file *seq, void *v)
>> > +{
>> > +       struct inode *inode = seq->private;
>> > +       pid_t pid;
>> > +
>> > +       pid = pid_nr_ns(v, inode->i_sb->s_fs_info);
>> > +       return seq_printf(seq, " %d", pid);
>> > +}
>>
>> Does this mean the file contents always starts with a space? I think
>> I'd prefer a trailing space than a leading one. Better yet, neither.
>> :)
>>
>
> Yeah, it there children, they will be in say " 1 2 3 4\n" format.
> To drop this space completely i'll have to add more code, which I
> actually trying to escape. Can we live with it? ;)

How about just:

return seq_printf(seq, "%d ", pid);

instead?

-Kees

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-23 20:29       ` Kees Cook
@ 2012-01-23 20:39         ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 20:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Mon, Jan 23, 2012 at 12:29:52PM -0800, Kees Cook wrote:
> >
> > Yeah, it there children, they will be in say " 1 2 3 4\n" format.
> > To drop this space completely i'll have to add more code, which I
> > actually trying to escape. Can we live with it? ;)
> 
> How about just:
> 
> return seq_printf(seq, "%d ", pid);
> 
> instead?
> 

ok, will update.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-23 14:20 ` [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
@ 2012-01-23 20:42   ` Kees Cook
  2012-01-23 20:53     ` Cyrill Gorcunov
  2012-01-24 23:59   ` Andrew Morton
  1 sibling, 1 reply; 60+ messages in thread
From: Kees Cook @ 2012-01-23 20:42 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Vasiliy Kulikov

On Mon, Jan 23, 2012 at 6:20 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> We would like to have an ability to restore command line
> arguments and envirion pointers so the task being restored
> would print appropriate values in /proc/pid/cmdline and
> /proc/pid/envirion. The exit_code is needed to restore
> zombie tasks.
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Serge Hallyn <serge.hallyn@canonical.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrew Vagin <avagin@openvz.org>
> Cc: Vasiliy Kulikov <segoon@openwall.com>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/proc/array.c |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> Index: linux-2.6.git/fs/proc/array.c
> ===================================================================
> --- linux-2.6.git.orig/fs/proc/array.c
> +++ linux-2.6.git/fs/proc/array.c
> @@ -464,7 +464,8 @@ static int do_task_stat(struct seq_file
>
>        seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
>  %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
> -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
> +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu \
> +%lu %lu %lu %lu %d\n",
>                pid_nr_ns(pid, ns),
>                tcomm,
>                state,
> @@ -514,7 +515,12 @@ static int do_task_stat(struct seq_file
>                cputime_to_clock_t(cgtime),
>                (mm && permitted) ? mm->start_data : 0,
>                (mm && permitted) ? mm->end_data : 0,
> -               (mm && permitted) ? mm->start_brk : 0);
> +               (mm && permitted) ? mm->start_brk : 0,
> +               (mm && permitted) ? mm->arg_start : 0,
> +               (mm && permitted) ? mm->arg_end : 0,
> +               (mm && permitted) ? mm->env_start : 0,
> +               (mm && permitted) ? mm->env_end : 0,
> +               task->exit_code);
>        if (mm)
>                mmput(mm);
>        return 0;

You're not exposing auxv here? In your testing, what situations ended
up using auxv after initial startup? Or is your intention to be able
to freeze a process potentially before libc has examined auxv?

-Kees

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-23 20:42   ` Kees Cook
@ 2012-01-23 20:53     ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-23 20:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Vasiliy Kulikov

On Mon, Jan 23, 2012 at 12:42:28PM -0800, Kees Cook wrote:
> On Mon, Jan 23, 2012 at 6:20 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> > We would like to have an ability to restore command line
> > arguments and envirion pointers so the task being restored
> > would print appropriate values in /proc/pid/cmdline and
> > /proc/pid/envirion. The exit_code is needed to restore
> > zombie tasks.
> >
> > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Cc: Serge Hallyn <serge.hallyn@canonical.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Cc: Alexey Dobriyan <adobriyan@gmail.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Andrew Vagin <avagin@openvz.org>
> > Cc: Vasiliy Kulikov <segoon@openwall.com>
> > Cc: Alexey Dobriyan <adobriyan@gmail.com>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > ---
> >  fs/proc/array.c |   10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > Index: linux-2.6.git/fs/proc/array.c
> > ===================================================================
> > --- linux-2.6.git.orig/fs/proc/array.c
> > +++ linux-2.6.git/fs/proc/array.c
> > @@ -464,7 +464,8 @@ static int do_task_stat(struct seq_file
> >
> >        seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
> >  %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
> > -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
> > +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu \
> > +%lu %lu %lu %lu %d\n",
> >                pid_nr_ns(pid, ns),
> >                tcomm,
> >                state,
> > @@ -514,7 +515,12 @@ static int do_task_stat(struct seq_file
> >                cputime_to_clock_t(cgtime),
> >                (mm && permitted) ? mm->start_data : 0,
> >                (mm && permitted) ? mm->end_data : 0,
> > -               (mm && permitted) ? mm->start_brk : 0);
> > +               (mm && permitted) ? mm->start_brk : 0,
> > +               (mm && permitted) ? mm->arg_start : 0,
> > +               (mm && permitted) ? mm->arg_end : 0,
> > +               (mm && permitted) ? mm->env_start : 0,
> > +               (mm && permitted) ? mm->env_end : 0,
> > +               task->exit_code);
> >        if (mm)
> >                mmput(mm);
> >        return 0;
> 
> You're not exposing auxv here? In your testing, what situations ended
> up using auxv after initial startup? Or is your intention to be able
> to freeze a process potentially before libc has examined auxv?
> 

auxv already available via /proc/pid/auxv so I thought it would be
redundant to put it into different place. Or you meant something
else?

In testing I simply read auxv values from /proc/pid/auxv and restore
them back via prctl at restore time. (prctl patch is in this series
but it's slightly buggy so I post correct version later).

And intention is simply dump this vector at checkpoint time and
restore when needed.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-23 14:20 ` [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8 Cyrill Gorcunov
  2012-01-23 18:54   ` Kees Cook
@ 2012-01-24  2:07   ` KAMEZAWA Hiroyuki
  2012-01-24  6:53     ` Cyrill Gorcunov
  2012-01-24 23:53   ` Andrew Morton
  2 siblings, 1 reply; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-24  2:07 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Mon, 23 Jan 2012 18:20:37 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> When we do checkpoint of a task we need to know the list of children
> the task, has but there is no easy and fast way to generate reverse
> parent->children chain from arbitrary <pid> (while a parent pid is
> provided in "PPid" field of /proc/<pid>/status).
> 
> So instead of walking over all pids in the system (creating one big process
> tree in memory, just to figure out which children a task has) -- we add
> explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
> this kind of information but it is not yet exported.
> 
> This is a first level children, not the whole process tree.
> 
> v2:
>  - Kame suggested to use a separated /proc/<pid>/children entry
>    instead of poking /proc/<pid>/status
>  - Andew suggested to use rcu facility instead of locking
>    tasklist_lock
>  - Tejun pointed that non-seekable seq file might not be
>    enough for tasks with large number of children
> 
> v3:
>  - To be on a safe side use %lu format for pid_t printing
> 
> v4:
>  - New line get printed when sequence ends not at seq->stop,
>    a nit pointed by Tejun
>  - Documentation update
>  - tasklist_lock is back, Oleg pointed that ->children list
>    is actually not rcu-safe
> 
> v5:
>  - Oleg suggested to make /proc/<pid>/task/<tid>/children
>    instead of global /proc/<pid>/children, which eliminates
>    hardness related to threads and children migration, and
>    allows patch to be a way simplier.
> 
> v6:
>  - Drop ptrace_may_access tests, pids are can be found anyway
>    so nothing to protect here.
>  - Update comments and docs, pointed by Oleg.
> 
> v7:
>  - Use get_pid over proc-pid directly, to simplify
>    code, pointed by Oleg.
> 
> v8:
>  - Obtain a starting pid from the proc's inode directly.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Serge Hallyn <serge.hallyn@canonical.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

>From viewpoint I played with seq_file, yesterday.

> +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	return get_children_pid(seq->private, NULL, *pos);
> +}
> +
> +static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct pid *pid = NULL;
> +
> +	pid = get_children_pid(seq->private, v, *pos + 1);
> +	if (!pid)
> +		seq_printf(seq, "\n");
> +	put_pid(v);

Because seq_printf() may fail. This seems dangeorus.

If seq_printf() fails and returns NULL, "\n" will not be
printed out and user land parser will go wrong.

I think all seq_printf() should be handled in ->show().
(And you can use seq_putc() for "\n".)

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-23 14:20 ` [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4 Cyrill Gorcunov
  2012-01-23 18:48   ` H. Peter Anvin
@ 2012-01-24  2:16   ` KAMEZAWA Hiroyuki
  2012-01-24  6:47     ` Cyrill Gorcunov
  1 sibling, 1 reply; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-24  2:16 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Mon, 23 Jan 2012 18:20:38 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> While doing the checkpoint-restore in the userspace one need to determine
> whether various kernel objects (like mm_struct-s of file_struct-s) are shared
> between tasks and restore this state.
> 
> The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
> syscall, while there's currently no ways for solving the 1st one.
> 
> One of the ways for checking whether two tasks share e.g. mm_struct is to
> provide some mm_struct ID of a task to its proc file, but showing such
> info considered to be not that good for security reasons.
> 
> Thus after some debates we end up in conclusion that using that named
> 'comparision' syscall might be the best candidate. So here is it --
> __NR_kcmp.
> 
> It takes up to 5 agruments - the pids of the two tasks (which
> characteristics should be compared), the comparision type and
> (in case of comparision of files) two file descriptors.
> 
> At moment only x86 is supported.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> CC: "Eric W. Biederman" <ebiederm@xmission.com>
> CC: Pavel Emelyanov <xemul@parallels.com>
> CC: Andrey Vagin <avagin@openvz.org>
> CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: H. Peter Anvin <hpa@zytor.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Glauber Costa <glommer@parallels.com>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Tejun Heo <tj@kernel.org>
> CC: Matt Helsley <matthltc@us.ibm.com>
> CC: Pekka Enberg <penberg@kernel.org>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Vasiliy Kulikov <segoon@openwall.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Alexey Dobriyan <adobriyan@gmail.com>
> CC: Valdis.Kletnieks@vt.edu
> ---
>  arch/x86/include/asm/kcmp.h      |   17 ++++
>  arch/x86/include/asm/syscalls.h  |    4 
>  arch/x86/kernel/Makefile         |    1 
>  arch/x86/kernel/kcmp.c           |  163 +++++++++++++++++++++++++++++++++++++++
>  arch/x86/syscalls/syscall_32.tbl |    1 
>  arch/x86/syscalls/syscall_64.tbl |    1 
>  6 files changed, 187 insertions(+)
> 
> Index: linux-2.6.git/arch/x86/include/asm/kcmp.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/arch/x86/include/asm/kcmp.h
> @@ -0,0 +1,17 @@
> +#ifndef _LINUX_KCMP_H
> +#define _LINUX_KCMP_H
> +
> +/* Comparision type */
> +enum {
> +	KCMP_FILE,
> +	KCMP_VM,
> +	KCMP_FILES,
> +	KCMP_FS,
> +	KCMP_SIGHAND,
> +	KCMP_IO,
> +	KCMP_SYSVSEM,
> +
> +	KCMP_TYPES,
> +};
> +
> +#endif /* _LINUX_KCMP_H */

Why under /arch ?





> Index: linux-2.6.git/arch/x86/include/asm/syscalls.h
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/include/asm/syscalls.h
> +++ linux-2.6.git/arch/x86/include/asm/syscalls.h
> @@ -42,6 +42,10 @@ long sys_sigaltstack(const stack_t __use
>  asmlinkage int sys_set_thread_area(struct user_desc __user *);
>  asmlinkage int sys_get_thread_area(struct user_desc __user *);
>  
> +/* kenrel/kcmp.c */
> +asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> +			 unsigned long idx1, unsigned long idx2);
> +
>  /* X86_32 only */
>  #ifdef CONFIG_X86_32
>  
> Index: linux-2.6.git/arch/x86/kernel/Makefile
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/kernel/Makefile
> +++ linux-2.6.git/arch/x86/kernel/Makefile
> @@ -34,6 +34,7 @@ obj-y			+= alternative.o i8253.o pci-nom
>  obj-y			+= tsc.o io_delay.o rtc.o
>  obj-y			+= pci-iommu_table.o
>  obj-y			+= resource.o
> +obj-y			+= kcmp.o
>  
>  obj-y				+= trampoline.o trampoline_$(BITS).o
>  obj-y				+= process.o
> Index: linux-2.6.git/arch/x86/kernel/kcmp.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/arch/x86/kernel/kcmp.c
> @@ -0,0 +1,163 @@
> +#include <linux/kernel.h>
> +#include <linux/syscalls.h>
> +#include <linux/fdtable.h>
> +#include <linux/string.h>
> +#include <linux/random.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/cache.h>
> +#include <linux/bug.h>
> +#include <linux/err.h>
> +
> +#include <asm/unistd.h>
> +#include <asm/kcmp.h>
> +
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> +	return (v + cookies[type][0]) ^ cookies[type][1];
> +}
> +

I'm sorry could you add comments to swho what this does ?



> +/*
> + * 0 - equal
> + * 1 - less than
> + * 2 - greater than
> + * 3 - not equal but ordering unavailable
> + */
> +static int kcmp_ptr(long v1, long v2, int type)
> +{
> +	long ret;
> +
> +	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
> +
> +	return (ret < 0) | ((ret > 0) << 1);
> +}
> +
> +#define KCMP_TASK_PTR(task1, task2, member, type)	\
> +	kcmp_ptr((long)(task1)->member,			\
> +		 (long)(task2)->member,			\
> +		 type)
> +
> +#define KCMP_PTR(ptr1, ptr2, type)			\
> +	kcmp_ptr((long)ptr1, (long)ptr2, type)
> +
> +/* A caller must be sure the task is presented in memory */
> +static struct file *
> +get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> +{
> +	struct fdtable *fdt;
> +	struct file *file;
> +
> +	spin_lock(&task->files->file_lock);
> +	fdt = files_fdtable(task->files);
> +	if (idx < fdt->max_fds)
> +		file = fdt->fd[idx];
> +	else
> +		file = NULL;
> +	spin_unlock(&task->files->file_lock);
> +
> +	return file;
> +}
> +
> +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
> +		unsigned long, idx1, unsigned long, idx2)
> +{
> +	struct task_struct *task1;
> +	struct task_struct *task2;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	task1 = find_task_by_vpid(pid1);
> +	if (!task1) {
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	task2 = find_task_by_vpid(pid2);
> +	if (!task2) {
> +		put_task_struct(task1);
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	get_task_struct(task1);
> +	get_task_struct(task2);
> +
> +	rcu_read_unlock();
> +
> +	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> +	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> +		ret = -EACCES;
> +		goto err;
> +	}
> +
> +	/*
> +	 * Note for all cases but the KCMP_FILE we
> +	 * don't take any locks in a sake of speed.
> +	 */
> +
> +	switch (type) {
> +	case KCMP_FILE: {
> +		struct file *filp1, *filp2;
> +
> +		filp1 = get_file_raw_ptr(task1, idx1);
> +		filp2 = get_file_raw_ptr(task2, idx2);
> +
> +		if (filp1 && filp2)
> +			ret = KCMP_PTR(filp1, filp2, KCMP_FILE);
> +		else
> +			ret = -ENOENT;
> +		break;
> +	}
> +	case KCMP_VM:
> +		ret = KCMP_TASK_PTR(task1, task2, mm, KCMP_VM);
> +		break;
> +	case KCMP_FILES:
> +		ret = KCMP_TASK_PTR(task1, task2, files, KCMP_FILES);
> +		break;
> +	case KCMP_FS:
> +		ret = KCMP_TASK_PTR(task1, task2, fs, KCMP_FS);
> +		break;
> +	case KCMP_SIGHAND:
> +		ret = KCMP_TASK_PTR(task1, task2, sighand, KCMP_SIGHAND);
> +		break;
> +	case KCMP_IO:
> +		ret = KCMP_TASK_PTR(task1, task2, io_context, KCMP_IO);
> +		break;
> +	case KCMP_SYSVSEM:
> +#ifdef CONFIG_SYSVIPC
> +		ret = KCMP_TASK_PTR(task1, task2, sysvsem.undo_list, KCMP_SYSVSEM);
> +#else
> +		ret = -ENOENT;
> +		goto err;
> +#endif
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +err:
> +	put_task_struct(task1);
> +	put_task_struct(task2);
> +
> +	return ret;
> +}

It seems this function itself doesn't depend on arch.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  2:16   ` KAMEZAWA Hiroyuki
@ 2012-01-24  6:47     ` Cyrill Gorcunov
  2012-01-24  7:04       ` H. Peter Anvin
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24  6:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, Jan 24, 2012 at 11:16:55AM +0900, KAMEZAWA Hiroyuki wrote:
...
> > 
> > Index: linux-2.6.git/arch/x86/include/asm/kcmp.h
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6.git/arch/x86/include/asm/kcmp.h
> > @@ -0,0 +1,17 @@
> > +#ifndef _LINUX_KCMP_H
> > +#define _LINUX_KCMP_H
> > +
> > +/* Comparision type */
> > +enum {
> > +	KCMP_FILE,
> > +	KCMP_VM,
> > +	KCMP_FILES,
> > +	KCMP_FS,
> > +	KCMP_SIGHAND,
> > +	KCMP_IO,
> > +	KCMP_SYSVSEM,
> > +
> > +	KCMP_TYPES,
> > +};
> > +
> > +#endif /* _LINUX_KCMP_H */
> 
> Why under /arch ?
> 

Hi Kame,

because I've tested it under x86 only. Once someone
confirm it's needed on some else arch and does work
as expected -- it wont be a problem to make it system
wide. Until then -- I think better to stick with
at least tested case.
...
> > +
> > +static long kptr_obfuscate(long v, int type)
> > +{
> > +	return (v + cookies[type][0]) ^ cookies[type][1];
> > +}
> > +
> 
> I'm sorry could you add comments to swho what this does ?

The idea is disorder the in-memory order but remain order
suitable for sorting. Anywa, as Peter pointed this helper
must be redone. I'll post updated version.

> 
> It seems this function itself doesn't depend on arch.
> 

Yes, it doesn't. But as I said if someone confirm it's
needed on non-x86 and get tested, that will not be a
problem to make it system-side.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-24  2:07   ` KAMEZAWA Hiroyuki
@ 2012-01-24  6:53     ` Cyrill Gorcunov
  2012-01-24  7:07       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24  6:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Tue, Jan 24, 2012 at 11:07:30AM +0900, KAMEZAWA Hiroyuki wrote:
...
> 
> From viewpoint I played with seq_file, yesterday.
> 
> > +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > +	return get_children_pid(seq->private, NULL, *pos);
> > +}
> > +
> > +static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > +	struct pid *pid = NULL;
> > +
> > +	pid = get_children_pid(seq->private, v, *pos + 1);
> > +	if (!pid)
> > +		seq_printf(seq, "\n");
> > +	put_pid(v);
> 
> Because seq_printf() may fail. This seems dangeorus.
> 
> If seq_printf() fails and returns NULL, "\n" will not be
> printed out and user land parser will go wrong.
>

Hmm. But userspace app will get eof, so frankly I don't see
a problem here. Or maybe I miss something?

> I think all seq_printf() should be handled in ->show().
> (And you can use seq_putc() for "\n".)
> 

Sure, i'll change it to seq_putc.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  6:47     ` Cyrill Gorcunov
@ 2012-01-24  7:04       ` H. Peter Anvin
  2012-01-24  7:17         ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: H. Peter Anvin @ 2012-01-24  7:04 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, Kees Cook, Tejun Heo, Andrew Vagin,
	Eric W. Biederman, Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On 01/23/2012 10:47 PM, Cyrill Gorcunov wrote:
>>
>> Why under /arch ?
>>
> 
> Hi Kame,
> 
> because I've tested it under x86 only. Once someone
> confirm it's needed on some else arch and does work
> as expected -- it wont be a problem to make it system
> wide. Until then -- I think better to stick with
> at least tested case.
> ...

That's not a reason to put it in arch/ ... that's possibly a reason to
not map the system call on other architectures yet.

	-hpa

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-24  6:53     ` Cyrill Gorcunov
@ 2012-01-24  7:07       ` KAMEZAWA Hiroyuki
  2012-01-24  7:21         ` Cyrill Gorcunov
  2012-01-24  8:51         ` Cyrill Gorcunov
  0 siblings, 2 replies; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-24  7:07 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Tue, 24 Jan 2012 10:53:38 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On Tue, Jan 24, 2012 at 11:07:30AM +0900, KAMEZAWA Hiroyuki wrote:
> ...
> > 
> > From viewpoint I played with seq_file, yesterday.
> > 
> > > +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> > > +{
> > > +	return get_children_pid(seq->private, NULL, *pos);
> > > +}
> > > +
> > > +static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > > +{
> > > +	struct pid *pid = NULL;
> > > +
> > > +	pid = get_children_pid(seq->private, v, *pos + 1);
> > > +	if (!pid)
> > > +		seq_printf(seq, "\n");
> > > +	put_pid(v);
> > 
> > Because seq_printf() may fail. This seems dangeorus.
> > 
> > If seq_printf() fails and returns NULL, "\n" will not be
> > printed out and user land parser will go wrong.
> >
> 
> Hmm. But userspace app will get eof, so frankly I don't see
> a problem here. Or maybe I miss something?
> 

Userspace need to take care of whether there may be"\n" or not even
if read() returns EOF.
As an interface, it's BUG to say "\n" will be there if you're lucky!"
(*) I know script language can handle this but we shouldn't assume that.

How about just remove "\n" at EOF  ? I think it's unnecessary.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  7:04       ` H. Peter Anvin
@ 2012-01-24  7:17         ` Cyrill Gorcunov
  2012-01-24  7:20           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24  7:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, Kees Cook, Tejun Heo, Andrew Vagin,
	Eric W. Biederman, Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Mon, Jan 23, 2012 at 11:04:19PM -0800, H. Peter Anvin wrote:
> On 01/23/2012 10:47 PM, Cyrill Gorcunov wrote:
> >>
> >> Why under /arch ?
> >>
> > 
> > Hi Kame,
> > 
> > because I've tested it under x86 only. Once someone
> > confirm it's needed on some else arch and does work
> > as expected -- it wont be a problem to make it system
> > wide. Until then -- I think better to stick with
> > at least tested case.
> > ...
> 
> That's not a reason to put it in arch/ ... that's possibly a reason to
> not map the system call on other architectures yet.
> 

Where it should live then? In kernel/ or mm/ ?

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  7:17         ` Cyrill Gorcunov
@ 2012-01-24  7:20           ` KAMEZAWA Hiroyuki
  2012-01-24  7:38             ` Cyrill Gorcunov
  2012-01-24  8:49             ` Eric W. Biederman
  0 siblings, 2 replies; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-24  7:20 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: H. Peter Anvin, linux-kernel, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, Kees Cook, Tejun Heo, Andrew Vagin,
	Eric W. Biederman, Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, 24 Jan 2012 11:17:16 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On Mon, Jan 23, 2012 at 11:04:19PM -0800, H. Peter Anvin wrote:
> > On 01/23/2012 10:47 PM, Cyrill Gorcunov wrote:
> > >>
> > >> Why under /arch ?
> > >>
> > > 
> > > Hi Kame,
> > > 
> > > because I've tested it under x86 only. Once someone
> > > confirm it's needed on some else arch and does work
> > > as expected -- it wont be a problem to make it system
> > > wide. Until then -- I think better to stick with
> > > at least tested case.
> > > ...
> > 
> > That's not a reason to put it in arch/ ... that's possibly a reason to
> > not map the system call on other architectures yet.
> > 
> 
> Where it should live then? In kernel/ or mm/ ?
> 

kernel/checkpoint_restart ?

gathering related changes to a directory may help developpers joins later....
To me, this makes seeing git log easy ;)
 
Thanks,
-Kame


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-24  7:07       ` KAMEZAWA Hiroyuki
@ 2012-01-24  7:21         ` Cyrill Gorcunov
  2012-01-24  8:52           ` Eric W. Biederman
  2012-01-24  8:51         ` Cyrill Gorcunov
  1 sibling, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24  7:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Tue, Jan 24, 2012 at 04:07:09PM +0900, KAMEZAWA Hiroyuki wrote:
> > 
> > Hmm. But userspace app will get eof, so frankly I don't see
> > a problem here. Or maybe I miss something?
> > 
> 
> Userspace need to take care of whether there may be"\n" or not even
> if read() returns EOF.
> As an interface, it's BUG to say "\n" will be there if you're lucky!"
> (*) I know script language can handle this but we shouldn't assume that.
> 
> How about just remove "\n" at EOF  ? I think it's unnecessary.
> 

Sure thing, it's not a problem to remove it completely.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  7:20           ` KAMEZAWA Hiroyuki
@ 2012-01-24  7:38             ` Cyrill Gorcunov
  2012-01-24  7:40               ` KAMEZAWA Hiroyuki
  2012-01-24  8:49             ` Eric W. Biederman
  1 sibling, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24  7:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: H. Peter Anvin, linux-kernel, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, Kees Cook, Tejun Heo, Andrew Vagin,
	Eric W. Biederman, Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, Jan 24, 2012 at 04:20:31PM +0900, KAMEZAWA Hiroyuki wrote:
...
> > > 
> > > That's not a reason to put it in arch/ ... that's possibly a reason to
> > > not map the system call on other architectures yet.
> > > 
> > 
> > Where it should live then? In kernel/ or mm/ ?
> > 
> 
> kernel/checkpoint_restart ?
> 
> gathering related changes to a directory may help developpers joins later....
> To me, this makes seeing git log easy ;)
>  

Such separate directory implies everything related to c/r should live there,
it also implies (I think) that code which is under CHECKPOINT_RESTORE should
be moved there as well, so I'm not sure, Kame ;) I guess mm/ will be a good
place since these are operations with memory pointers.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  7:38             ` Cyrill Gorcunov
@ 2012-01-24  7:40               ` KAMEZAWA Hiroyuki
  2012-01-24  8:48                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-24  7:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: H. Peter Anvin, linux-kernel, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, Kees Cook, Tejun Heo, Andrew Vagin,
	Eric W. Biederman, Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, 24 Jan 2012 11:38:42 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On Tue, Jan 24, 2012 at 04:20:31PM +0900, KAMEZAWA Hiroyuki wrote:
> ...
> > > > 
> > > > That's not a reason to put it in arch/ ... that's possibly a reason to
> > > > not map the system call on other architectures yet.
> > > > 
> > > 
> > > Where it should live then? In kernel/ or mm/ ?
> > > 
> > 
> > kernel/checkpoint_restart ?
> > 
> > gathering related changes to a directory may help developpers joins later....
> > To me, this makes seeing git log easy ;)
> >  
> 
> Such separate directory implies everything related to c/r should live there,
> it also implies (I think) that code which is under CHECKPOINT_RESTORE should
> be moved there as well, so I'm not sure, Kame ;) I guess mm/ will be a good
> place since these are operations with memory pointers.
> 

please do as you like. 
-Kame


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  7:40               ` KAMEZAWA Hiroyuki
@ 2012-01-24  8:48                 ` Cyrill Gorcunov
  2012-01-24 20:20                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24  8:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, H. Peter Anvin
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, Jan 24, 2012 at 04:40:08PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 24 Jan 2012 11:38:42 +0400
> Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > On Tue, Jan 24, 2012 at 04:20:31PM +0900, KAMEZAWA Hiroyuki wrote:
> > ...
> > > > > 
> > > > > That's not a reason to put it in arch/ ... that's possibly a reason to
> > > > > not map the system call on other architectures yet.
> > > > > 
> > > > 
> > > > Where it should live then? In kernel/ or mm/ ?
> > > > 
> > > 
> > > kernel/checkpoint_restart ?
> > > 
> > > gathering related changes to a directory may help developpers joins later....
> > > To me, this makes seeing git log easy ;)
> > >  
> > 
> > Such separate directory implies everything related to c/r should live there,
> > it also implies (I think) that code which is under CHECKPOINT_RESTORE should
> > be moved there as well, so I'm not sure, Kame ;) I guess mm/ will be a good
> > place since these are operations with memory pointers.
> > 
> 
> please do as you like. 

So it should be something like below I think...

	Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [RFC] syscalls, x86: Add __NR_kcmp syscall v5

While doing the checkpoint-restore in the userspace one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are shared
between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparision' syscall might be the best candidate. So here is it --
__NR_kcmp.

It takes up to 5 agruments - the pids of the two tasks (which
characteristics should be compared), the comparision type and
(in case of comparision of files) two file descriptors.

At moment only x86 is supported.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Glauber Costa <glommer@parallels.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tejun Heo <tj@kernel.org>
CC: Matt Helsley <matthltc@us.ibm.com>
CC: Pekka Enberg <penberg@kernel.org>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Valdis.Kletnieks@vt.edu
---
 arch/x86/include/asm/syscalls.h  |    4 
 arch/x86/syscalls/syscall_32.tbl |    1 
 arch/x86/syscalls/syscall_64.tbl |    1 
 include/linux/kcmp.h             |   17 ++++
 mm/Makefile                      |    1 
 mm/kcmp.c                        |  163 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 187 insertions(+)

Index: linux-2.6.git/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.git.orig/arch/x86/include/asm/syscalls.h
+++ linux-2.6.git/arch/x86/include/asm/syscalls.h
@@ -42,6 +42,10 @@ long sys_sigaltstack(const stack_t __use
 asmlinkage int sys_set_thread_area(struct user_desc __user *);
 asmlinkage int sys_get_thread_area(struct user_desc __user *);
 
+/* mm/kcmp.c */
+asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
+			 unsigned long idx1, unsigned long idx2);
+
 /* X86_32 only */
 #ifdef CONFIG_X86_32
 
Index: linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_32.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	kcmp			sys_kcmp
Index: linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_64.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
@@ -318,3 +318,4 @@
 309	64	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	kcmp			sys_kcmp
Index: linux-2.6.git/include/linux/kcmp.h
===================================================================
--- /dev/null
+++ linux-2.6.git/include/linux/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_KCMP_H
+#define _LINUX_KCMP_H
+
+/* Comparision type */
+enum {
+	KCMP_FILE,
+	KCMP_VM,
+	KCMP_FILES,
+	KCMP_FS,
+	KCMP_SIGHAND,
+	KCMP_IO,
+	KCMP_SYSVSEM,
+
+	KCMP_TYPES,
+};
+
+#endif /* _LINUX_KCMP_H */
Index: linux-2.6.git/mm/Makefile
===================================================================
--- linux-2.6.git.orig/mm/Makefile
+++ linux-2.6.git/mm/Makefile
@@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoiso
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
+obj-$(CONFIG_X86) += kcmp.o
Index: linux-2.6.git/mm/kcmp.c
===================================================================
--- /dev/null
+++ linux-2.6.git/mm/kcmp.c
@@ -0,0 +1,163 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fdtable.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+#include <linux/kcmp.h>
+
+#include <asm/unistd.h>
+
+static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
+
+static long kptr_obfuscate(long v, int type)
+{
+	return (v ^ cookies[type][0]) * cookies[type][1];
+}
+
+/*
+ * 0 - equal
+ * 1 - less than
+ * 2 - greater than
+ * 3 - not equal but ordering unavailable
+ */
+static int kcmp_ptr(long v1, long v2, int type)
+{
+	long ret;
+
+	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
+
+	return (ret < 0) | ((ret > 0) << 1);
+}
+
+#define KCMP_TASK_PTR(task1, task2, member, type)	\
+	kcmp_ptr((long)(task1)->member,			\
+		 (long)(task2)->member,			\
+		 type)
+
+#define KCMP_PTR(ptr1, ptr2, type)			\
+	kcmp_ptr((long)ptr1, (long)ptr2, type)
+
+/* A caller must be sure the task is presented in memory */
+static struct file *
+get_file_raw_ptr(struct task_struct *task, unsigned int idx)
+{
+	struct fdtable *fdt;
+	struct file *file;
+
+	spin_lock(&task->files->file_lock);
+	fdt = files_fdtable(task->files);
+	if (idx < fdt->max_fds)
+		file = fdt->fd[idx];
+	else
+		file = NULL;
+	spin_unlock(&task->files->file_lock);
+
+	return file;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+		unsigned long, idx1, unsigned long, idx2)
+{
+	struct task_struct *task1;
+	struct task_struct *task2;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	task1 = find_task_by_vpid(pid1);
+	if (!task1) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	task2 = find_task_by_vpid(pid2);
+	if (!task2) {
+		put_task_struct(task1);
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	get_task_struct(task1);
+	get_task_struct(task2);
+
+	rcu_read_unlock();
+
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EACCES;
+		goto err;
+	}
+
+	/*
+	 * Note for all cases but the KCMP_FILE we
+	 * don't take any locks in a sake of speed.
+	 */
+
+	switch (type) {
+	case KCMP_FILE: {
+		struct file *filp1, *filp2;
+
+		filp1 = get_file_raw_ptr(task1, idx1);
+		filp2 = get_file_raw_ptr(task2, idx2);
+
+		if (filp1 && filp2)
+			ret = KCMP_PTR(filp1, filp2, KCMP_FILE);
+		else
+			ret = -ENOENT;
+		break;
+	}
+	case KCMP_VM:
+		ret = KCMP_TASK_PTR(task1, task2, mm, KCMP_VM);
+		break;
+	case KCMP_FILES:
+		ret = KCMP_TASK_PTR(task1, task2, files, KCMP_FILES);
+		break;
+	case KCMP_FS:
+		ret = KCMP_TASK_PTR(task1, task2, fs, KCMP_FS);
+		break;
+	case KCMP_SIGHAND:
+		ret = KCMP_TASK_PTR(task1, task2, sighand, KCMP_SIGHAND);
+		break;
+	case KCMP_IO:
+		ret = KCMP_TASK_PTR(task1, task2, io_context, KCMP_IO);
+		break;
+	case KCMP_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+		ret = KCMP_TASK_PTR(task1, task2, sysvsem.undo_list, KCMP_SYSVSEM);
+#else
+		ret = -ENOENT;
+		goto err;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto err;
+	}
+
+err:
+	put_task_struct(task1);
+	put_task_struct(task2);
+
+	return ret;
+}
+
+static __init int kcmp_cookie_init(void)
+{
+	int i, j;
+
+	for (i = 0; i < KCMP_TYPES; i++) {
+		for (j = 0; j < 2; j++) {
+			get_random_bytes(&cookies[i][j],
+					 sizeof(cookies[i][j]));
+		}
+		cookies[i][1] |= (~(~0UL >>  1) | 1);
+	}
+
+	return 0;
+}
+late_initcall(kcmp_cookie_init);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  7:20           ` KAMEZAWA Hiroyuki
  2012-01-24  7:38             ` Cyrill Gorcunov
@ 2012-01-24  8:49             ` Eric W. Biederman
  2012-01-24  8:49               ` Cyrill Gorcunov
  1 sibling, 1 reply; 60+ messages in thread
From: Eric W. Biederman @ 2012-01-24  8:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Cyrill Gorcunov, H. Peter Anvin, linux-kernel, Andrew Morton,
	Pavel Emelyanov, Serge Hallyn, Kees Cook, Tejun Heo,
	Andrew Vagin, Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Tue, 24 Jan 2012 11:17:16 +0400
> Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>
>> On Mon, Jan 23, 2012 at 11:04:19PM -0800, H. Peter Anvin wrote:
>> > On 01/23/2012 10:47 PM, Cyrill Gorcunov wrote:
>> > >>
>> > >> Why under /arch ?
>> > >>
>> > > 
>> > > Hi Kame,
>> > > 
>> > > because I've tested it under x86 only. Once someone
>> > > confirm it's needed on some else arch and does work
>> > > as expected -- it wont be a problem to make it system
>> > > wide. Until then -- I think better to stick with
>> > > at least tested case.
>> > > ...
>> > 
>> > That's not a reason to put it in arch/ ... that's possibly a reason to
>> > not map the system call on other architectures yet.
>> > 
>> 
>> Where it should live then? In kernel/ or mm/ ?
>> 
>
> kernel/checkpoint_restart ?
>
> gathering related changes to a directory may help developpers joins later....
> To me, this makes seeing git log easy ;)

kernel/

The dominant use may be checkpoint restart but the code is not at all
checkpoint restart specific.

Eric

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  8:49             ` Eric W. Biederman
@ 2012-01-24  8:49               ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24  8:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: KAMEZAWA Hiroyuki, H. Peter Anvin, linux-kernel, Andrew Morton,
	Pavel Emelyanov, Serge Hallyn, Kees Cook, Tejun Heo,
	Andrew Vagin, Alexey Dobriyan, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, Jan 24, 2012 at 12:49:27AM -0800, Eric W. Biederman wrote:
...
> 
> kernel/
> 
> The dominant use may be checkpoint restart but the code is not at all
> checkpoint restart specific.
> 

Crap, just sent out mm/ version :)

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-24  7:07       ` KAMEZAWA Hiroyuki
  2012-01-24  7:21         ` Cyrill Gorcunov
@ 2012-01-24  8:51         ` Cyrill Gorcunov
  1 sibling, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24  8:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Tue, Jan 24, 2012 at 04:07:09PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 24 Jan 2012 10:53:38 +0400
> Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > On Tue, Jan 24, 2012 at 11:07:30AM +0900, KAMEZAWA Hiroyuki wrote:
> > ...
> > > 
> > > From viewpoint I played with seq_file, yesterday.
> > > 
> > > > +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> > > > +{
> > > > +	return get_children_pid(seq->private, NULL, *pos);
> > > > +}
> > > > +
> > > > +static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > > > +{
> > > > +	struct pid *pid = NULL;
> > > > +
> > > > +	pid = get_children_pid(seq->private, v, *pos + 1);
> > > > +	if (!pid)
> > > > +		seq_printf(seq, "\n");
> > > > +	put_pid(v);
> > > 
> > > Because seq_printf() may fail. This seems dangeorus.
> > > 
> > > If seq_printf() fails and returns NULL, "\n" will not be
> > > printed out and user land parser will go wrong.
> > >
> > 
> > Hmm. But userspace app will get eof, so frankly I don't see
> > a problem here. Or maybe I miss something?
> > 
> 
> Userspace need to take care of whether there may be"\n" or not even
> if read() returns EOF.
> As an interface, it's BUG to say "\n" will be there if you're lucky!"
> (*) I know script language can handle this but we shouldn't assume that.
> 
> How about just remove "\n" at EOF  ? I think it's unnecessary.
> 

This one should fit both "%d " and no "\n" requirements.

	Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v9

When we do checkpoint of a task we need to know the list of children
the task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).

So instead of walking over all pids in the system (creating one big process
tree in memory, just to figure out which children a task has) -- we add
explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/filesystems/proc.txt |   18 +++++
 fs/proc/array.c                    |  121 +++++++++++++++++++++++++++++++++++++
 fs/proc/base.c                     |    1 
 fs/proc/internal.h                 |    1 
 4 files changed, 141 insertions(+)

Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
   3.5	/proc/<pid>/mountinfo - Information about mounts
   3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+  3.7   /proc/<pid>/task/<tid>/children - Information about task children
 
   4	Configuring procfs
   4.1	Mount options
@@ -1549,6 +1550,23 @@ then the kernel's TASK_COMM_LEN (current
 comm value.
 
 
+3.7	/proc/<pid>/task/<tid>/children - Information about task children
+-------------------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>/<tid> pair. The format is a space separated
+stream of pids.
+
+Note the "first level" here -- if a child has own children they will
+not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+to obtain the descendants.
+
+Since this interface is intended to be fast and cheap it doesn't
+guarantee to provide precise results and some children might be
+skipped, especially if they've exited right after we printed their
+pids, so one need to either stop or freeze processes being inspected
+if precise results are needed.
+
+
 ------------------------------------------------------------------------------
 Configuring procfs
 ------------------------------------------------------------------------------
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,124 @@ int proc_pid_statm(struct seq_file *m, s
 
 	return 0;
 }
+
+static struct pid *
+get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
+{
+	struct task_struct *start, *task;
+	struct pid *pid = NULL;
+
+	read_lock(&tasklist_lock);
+
+	start = pid_task(proc_pid(inode), PIDTYPE_PID);
+	if (!start)
+		goto out;
+
+	/*
+	 * Lets try to continue searching first, this gives
+	 * us significant speedup on children-rich processes.
+	 */
+	if (pid_prev) {
+		task = pid_task(pid_prev, PIDTYPE_PID);
+		if (task && task->real_parent == start &&
+		    !(list_empty(&task->sibling))) {
+			if (list_is_last(&task->sibling, &start->children))
+				goto out;
+			task = list_first_entry(&task->sibling,
+						struct task_struct, sibling);
+			pid = get_pid(task_pid(task));
+			goto out;
+		}
+	}
+
+	/*
+	 * Slow search case.
+	 *
+	 * We might miss some children here if children
+	 * are exited while we were not holding the lock,
+	 * but it was never promised to be accurate that
+	 * much.
+	 *
+	 * "Just suppose that the parent sleeps, but N children
+	 *  exit after we printed their tids. Now the slow paths
+	 *  skips N extra children, we miss N tasks." (c)
+	 *
+	 * So one need to stop or freeze the leader and all
+	 * its children to get a precise result.
+	 */
+	list_for_each_entry(task, &start->children, sibling) {
+		if (pos-- == 0) {
+			pid = get_pid(task_pid(task));
+			break;
+		}
+	}
+
+out:
+	read_unlock(&tasklist_lock);
+	return pid;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+	struct inode *inode = seq->private;
+	pid_t pid;
+
+	pid = pid_nr_ns(v, inode->i_sb->s_fs_info);
+	return seq_printf(seq, "%d ", pid);
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	return get_children_pid(seq->private, NULL, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct pid *pid;
+
+	pid = get_children_pid(seq->private, v, *pos + 1);
+	put_pid(v);
+
+	++*pos;
+	return pid;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+	put_pid(v);
+}
+
+static const struct seq_operations children_seq_ops = {
+	.start	= children_seq_start,
+	.next	= children_seq_next,
+	.stop	= children_seq_stop,
+	.show	= children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+	struct seq_file *m;
+	int ret;
+
+	ret = seq_open(file, &children_seq_ops);
+	if (ret)
+		return ret;
+
+	m = file->private_data;
+	m->private = inode;
+
+	return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+	seq_release(inode, file);
+	return 0;
+}
+
+const struct file_operations proc_tid_children_operations = {
+	.open    = children_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3384,6 +3384,7 @@ static const struct pid_entry tid_base_s
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
 	REG("maps",      S_IRUGO, proc_maps_operations),
+	REG("children",  S_IRUGO, proc_tid_children_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
 #endif
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
 				struct pid *pid, struct task_struct *task);
 extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
 
+extern const struct file_operations proc_tid_children_operations;
 extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-24  7:21         ` Cyrill Gorcunov
@ 2012-01-24  8:52           ` Eric W. Biederman
  2012-01-24  9:11             ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: Eric W. Biederman @ 2012-01-24  8:52 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, Kees Cook, Tejun Heo, Andrew Vagin,
	Alexey Dobriyan

Cyrill Gorcunov <gorcunov@gmail.com> writes:

> On Tue, Jan 24, 2012 at 04:07:09PM +0900, KAMEZAWA Hiroyuki wrote:
>> > 
>> > Hmm. But userspace app will get eof, so frankly I don't see
>> > a problem here. Or maybe I miss something?
>> > 
>> 
>> Userspace need to take care of whether there may be"\n" or not even
>> if read() returns EOF.
>> As an interface, it's BUG to say "\n" will be there if you're lucky!"
>> (*) I know script language can handle this but we shouldn't assume that.
>> 
>> How about just remove "\n" at EOF  ? I think it's unnecessary.
>> 
>
> Sure thing, it's not a problem to remove it completely.

Foolish question.  Is there any reason why this is a file instead
of being the obvious directory full of symlinks?

Eric

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-24  8:52           ` Eric W. Biederman
@ 2012-01-24  9:11             ` Cyrill Gorcunov
  2012-01-25  1:14               ` KOSAKI Motohiro
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24  9:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, Kees Cook, Tejun Heo, Andrew Vagin,
	Alexey Dobriyan

On Tue, Jan 24, 2012 at 12:52:03AM -0800, Eric W. Biederman wrote:
> Cyrill Gorcunov <gorcunov@gmail.com> writes:
> 
> > On Tue, Jan 24, 2012 at 04:07:09PM +0900, KAMEZAWA Hiroyuki wrote:
> >> > 
> >> > Hmm. But userspace app will get eof, so frankly I don't see
> >> > a problem here. Or maybe I miss something?
> >> > 
> >> 
> >> Userspace need to take care of whether there may be"\n" or not even
> >> if read() returns EOF.
> >> As an interface, it's BUG to say "\n" will be there if you're lucky!"
> >> (*) I know script language can handle this but we shouldn't assume that.
> >> 
> >> How about just remove "\n" at EOF  ? I think it's unnecessary.
> >> 
> >
> > Sure thing, it's not a problem to remove it completely.
> 
> Foolish question.  Is there any reason why this is a file instead
> of being the obvious directory full of symlinks?
> 

How would these symlinks look like? "../../pid"? There were a conversation
about such things (https://lkml.org/lkml/2011/12/2/142) but I suppose we
were agree on children with pids as consensus.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24  8:48                 ` Cyrill Gorcunov
@ 2012-01-24 20:20                   ` KOSAKI Motohiro
  2012-01-24 20:26                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: KOSAKI Motohiro @ 2012-01-24 20:20 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KAMEZAWA Hiroyuki, H. Peter Anvin, linux-kernel, Andrew Morton,
	Pavel Emelyanov, Serge Hallyn, Kees Cook, Tejun Heo,
	Andrew Vagin, Eric W. Biederman, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

>> please do as you like.
>
> So it should be something like below I think...

Looks ok this version to me. So, if you fix other developers pointed
issue, I'll ack this.

thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 20:20                   ` KOSAKI Motohiro
@ 2012-01-24 20:26                     ` Cyrill Gorcunov
  2012-01-24 20:44                       ` Eric W. Biederman
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24 20:26 UTC (permalink / raw)
  To: KOSAKI Motohiro, H. Peter Anvin, Eric W. Biederman
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, Kees Cook, Tejun Heo, Andrew Vagin,
	Alexey Dobriyan, Ingo Molnar, Thomas Gleixner, Glauber Costa,
	Andi Kleen, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Valdis.Kletnieks

On Tue, Jan 24, 2012 at 03:20:26PM -0500, KOSAKI Motohiro wrote:
> >> please do as you like.
> >
> > So it should be something like below I think...
> 
> Looks ok this version to me. So, if you fix other developers pointed
> issue, I'll ack this.
> 

Thanks!

Eric, so mm/ would be fine or I still should move it to kernel/
instead? I've addressed other issues I hope.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 20:26                     ` Cyrill Gorcunov
@ 2012-01-24 20:44                       ` Eric W. Biederman
  2012-01-24 20:50                         ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: Eric W. Biederman @ 2012-01-24 20:44 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KOSAKI Motohiro, H. Peter Anvin, KAMEZAWA Hiroyuki, linux-kernel,
	Andrew Morton, Pavel Emelyanov, Serge Hallyn, Kees Cook,
	Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

Cyrill Gorcunov <gorcunov@gmail.com> writes:

> On Tue, Jan 24, 2012 at 03:20:26PM -0500, KOSAKI Motohiro wrote:
>> >> please do as you like.
>> >
>> > So it should be something like below I think...
>> 
>> Looks ok this version to me. So, if you fix other developers pointed
>> issue, I'll ack this.
>> 
>
> Thanks!
>
> Eric, so mm/ would be fine or I still should move it to kernel/
> instead? I've addressed other issues I hope.

The world won't fall apart if the code lands in mm.  I have a strong
preference for kernel/.  I just don't see anything at all memory
management like about that code.  Even the fact that you are
comparing pointers is an implementation detail.

Eric

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 20:44                       ` Eric W. Biederman
@ 2012-01-24 20:50                         ` Cyrill Gorcunov
  2012-01-24 21:20                           ` Eric W. Biederman
                                             ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24 20:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: KOSAKI Motohiro, H. Peter Anvin, KAMEZAWA Hiroyuki, linux-kernel,
	Andrew Morton, Pavel Emelyanov, Serge Hallyn, Kees Cook,
	Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, Jan 24, 2012 at 12:44:59PM -0800, Eric W. Biederman wrote:
> Cyrill Gorcunov <gorcunov@gmail.com> writes:
> 
> > On Tue, Jan 24, 2012 at 03:20:26PM -0500, KOSAKI Motohiro wrote:
> >> >> please do as you like.
> >> >
> >> > So it should be something like below I think...
> >> 
> >> Looks ok this version to me. So, if you fix other developers pointed
> >> issue, I'll ack this.
> >> 
> >
> > Thanks!
> >
> > Eric, so mm/ would be fine or I still should move it to kernel/
> > instead? I've addressed other issues I hope.
> 
> The world won't fall apart if the code lands in mm.  I have a strong
> preference for kernel/.  I just don't see anything at all memory
> management like about that code.  Even the fact that you are
> comparing pointers is an implementation detail.

This one should fit all requirements I guess.

	Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [RFC] syscalls, x86: Add __NR_kcmp syscall v6

While doing the checkpoint-restore in the userspace one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are shared
between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparision' syscall might be the best candidate. So here is it --
__NR_kcmp.

It takes up to 5 agruments - the pids of the two tasks (which
characteristics should be compared), the comparision type and
(in case of comparision of files) two file descriptors.

At moment only x86 is supported.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Glauber Costa <glommer@parallels.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tejun Heo <tj@kernel.org>
CC: Matt Helsley <matthltc@us.ibm.com>
CC: Pekka Enberg <penberg@kernel.org>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Valdis.Kletnieks@vt.edu
---
 arch/x86/include/asm/syscalls.h  |    4 
 arch/x86/syscalls/syscall_32.tbl |    1 
 arch/x86/syscalls/syscall_64.tbl |    1 
 include/linux/kcmp.h             |   17 ++++
 kernel/Makefile                  |    1 
 kernel/kcmp.c                    |  163 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 187 insertions(+)

Index: linux-2.6.git/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.git.orig/arch/x86/include/asm/syscalls.h
+++ linux-2.6.git/arch/x86/include/asm/syscalls.h
@@ -42,6 +42,10 @@ long sys_sigaltstack(const stack_t __use
 asmlinkage int sys_set_thread_area(struct user_desc __user *);
 asmlinkage int sys_get_thread_area(struct user_desc __user *);
 
+/* kernel/kcmp.c */
+asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
+			 unsigned long idx1, unsigned long idx2);
+
 /* X86_32 only */
 #ifdef CONFIG_X86_32
 
Index: linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_32.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	kcmp			sys_kcmp
Index: linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_64.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
@@ -318,3 +318,4 @@
 309	64	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	kcmp			sys_kcmp
Index: linux-2.6.git/include/linux/kcmp.h
===================================================================
--- /dev/null
+++ linux-2.6.git/include/linux/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_KCMP_H
+#define _LINUX_KCMP_H
+
+/* Comparision type */
+enum {
+	KCMP_FILE,
+	KCMP_VM,
+	KCMP_FILES,
+	KCMP_FS,
+	KCMP_SIGHAND,
+	KCMP_IO,
+	KCMP_SYSVSEM,
+
+	KCMP_TYPES,
+};
+
+#endif /* _LINUX_KCMP_H */
Index: linux-2.6.git/kernel/Makefile
===================================================================
--- linux-2.6.git.orig/kernel/Makefile
+++ linux-2.6.git/kernel/Makefile
@@ -25,6 +25,7 @@ endif
 obj-y += sched/
 obj-y += power/
 
+obj-$(CONFIG_X86) += kcmp.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
Index: linux-2.6.git/kernel/kcmp.c
===================================================================
--- /dev/null
+++ linux-2.6.git/kernel/kcmp.c
@@ -0,0 +1,163 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fdtable.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+#include <linux/kcmp.h>
+
+#include <asm/unistd.h>
+
+static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
+
+static long kptr_obfuscate(long v, int type)
+{
+	return (v ^ cookies[type][0]) * cookies[type][1];
+}
+
+/*
+ * 0 - equal
+ * 1 - less than
+ * 2 - greater than
+ * 3 - not equal but ordering unavailable
+ */
+static int kcmp_ptr(long v1, long v2, int type)
+{
+	long ret;
+
+	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
+
+	return (ret < 0) | ((ret > 0) << 1);
+}
+
+#define KCMP_TASK_PTR(task1, task2, member, type)	\
+	kcmp_ptr((long)(task1)->member,			\
+		 (long)(task2)->member,			\
+		 type)
+
+#define KCMP_PTR(ptr1, ptr2, type)			\
+	kcmp_ptr((long)ptr1, (long)ptr2, type)
+
+/* A caller must be sure the task is presented in memory */
+static struct file *
+get_file_raw_ptr(struct task_struct *task, unsigned int idx)
+{
+	struct fdtable *fdt;
+	struct file *file;
+
+	spin_lock(&task->files->file_lock);
+	fdt = files_fdtable(task->files);
+	if (idx < fdt->max_fds)
+		file = fdt->fd[idx];
+	else
+		file = NULL;
+	spin_unlock(&task->files->file_lock);
+
+	return file;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+		unsigned long, idx1, unsigned long, idx2)
+{
+	struct task_struct *task1;
+	struct task_struct *task2;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	task1 = find_task_by_vpid(pid1);
+	if (!task1) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	task2 = find_task_by_vpid(pid2);
+	if (!task2) {
+		put_task_struct(task1);
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	get_task_struct(task1);
+	get_task_struct(task2);
+
+	rcu_read_unlock();
+
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EACCES;
+		goto err;
+	}
+
+	/*
+	 * Note for all cases but the KCMP_FILE we
+	 * don't take any locks in a sake of speed.
+	 */
+
+	switch (type) {
+	case KCMP_FILE: {
+		struct file *filp1, *filp2;
+
+		filp1 = get_file_raw_ptr(task1, idx1);
+		filp2 = get_file_raw_ptr(task2, idx2);
+
+		if (filp1 && filp2)
+			ret = KCMP_PTR(filp1, filp2, KCMP_FILE);
+		else
+			ret = -ENOENT;
+		break;
+	}
+	case KCMP_VM:
+		ret = KCMP_TASK_PTR(task1, task2, mm, KCMP_VM);
+		break;
+	case KCMP_FILES:
+		ret = KCMP_TASK_PTR(task1, task2, files, KCMP_FILES);
+		break;
+	case KCMP_FS:
+		ret = KCMP_TASK_PTR(task1, task2, fs, KCMP_FS);
+		break;
+	case KCMP_SIGHAND:
+		ret = KCMP_TASK_PTR(task1, task2, sighand, KCMP_SIGHAND);
+		break;
+	case KCMP_IO:
+		ret = KCMP_TASK_PTR(task1, task2, io_context, KCMP_IO);
+		break;
+	case KCMP_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+		ret = KCMP_TASK_PTR(task1, task2, sysvsem.undo_list, KCMP_SYSVSEM);
+#else
+		ret = -ENOENT;
+		goto err;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto err;
+	}
+
+err:
+	put_task_struct(task1);
+	put_task_struct(task2);
+
+	return ret;
+}
+
+static __init int kcmp_cookie_init(void)
+{
+	int i, j;
+
+	for (i = 0; i < KCMP_TYPES; i++) {
+		for (j = 0; j < 2; j++) {
+			get_random_bytes(&cookies[i][j],
+					 sizeof(cookies[i][j]));
+		}
+		cookies[i][1] |= (~(~0UL >>  1) | 1);
+	}
+
+	return 0;
+}
+late_initcall(kcmp_cookie_init);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 20:50                         ` Cyrill Gorcunov
@ 2012-01-24 21:20                           ` Eric W. Biederman
  2012-01-24 21:34                             ` Cyrill Gorcunov
  2012-01-24 21:22                           ` Andrew Morton
  2012-01-24 21:25                           ` Andrew Morton
  2 siblings, 1 reply; 60+ messages in thread
From: Eric W. Biederman @ 2012-01-24 21:20 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KOSAKI Motohiro, H. Peter Anvin, KAMEZAWA Hiroyuki, linux-kernel,
	Andrew Morton, Pavel Emelyanov, Serge Hallyn, Kees Cook,
	Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

Cyrill Gorcunov <gorcunov@gmail.com> writes:

> On Tue, Jan 24, 2012 at 12:44:59PM -0800, Eric W. Biederman wrote:
>> Cyrill Gorcunov <gorcunov@gmail.com> writes:
>> 
>> > On Tue, Jan 24, 2012 at 03:20:26PM -0500, KOSAKI Motohiro wrote:
>> >> >> please do as you like.
>> >> >
>> >> > So it should be something like below I think...
>> >> 
>> >> Looks ok this version to me. So, if you fix other developers pointed
>> >> issue, I'll ack this.
>> >> 
>> >
>> > Thanks!
>> >
>> > Eric, so mm/ would be fine or I still should move it to kernel/
>> > instead? I've addressed other issues I hope.
>> 
>> The world won't fall apart if the code lands in mm.  I have a strong
>> preference for kernel/.  I just don't see anything at all memory
>> management like about that code.  Even the fact that you are
>> comparing pointers is an implementation detail.
>
> This one should fit all requirements I guess.

Bahahaha!

Looking I see one  more nit.

You need an entry in include/linux/syscalls.h

Eric

> ---
> From: Cyrill Gorcunov <gorcunov@openvz.org>
> Subject: [RFC] syscalls, x86: Add __NR_kcmp syscall v6
>
> While doing the checkpoint-restore in the userspace one need to determine
> whether various kernel objects (like mm_struct-s of file_struct-s) are shared
> between tasks and restore this state.
>
> The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
> syscall, while there's currently no ways for solving the 1st one.
>
> One of the ways for checking whether two tasks share e.g. mm_struct is to
> provide some mm_struct ID of a task to its proc file, but showing such
> info considered to be not that good for security reasons.
>
> Thus after some debates we end up in conclusion that using that named
> 'comparision' syscall might be the best candidate. So here is it --
> __NR_kcmp.
>
> It takes up to 5 agruments - the pids of the two tasks (which
> characteristics should be compared), the comparision type and
> (in case of comparision of files) two file descriptors.
>
> At moment only x86 is supported.
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> CC: "Eric W. Biederman" <ebiederm@xmission.com>
> CC: Pavel Emelyanov <xemul@parallels.com>
> CC: Andrey Vagin <avagin@openvz.org>
> CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: H. Peter Anvin <hpa@zytor.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Glauber Costa <glommer@parallels.com>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Tejun Heo <tj@kernel.org>
> CC: Matt Helsley <matthltc@us.ibm.com>
> CC: Pekka Enberg <penberg@kernel.org>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Vasiliy Kulikov <segoon@openwall.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Alexey Dobriyan <adobriyan@gmail.com>
> CC: Valdis.Kletnieks@vt.edu
> ---
>  arch/x86/include/asm/syscalls.h  |    4 
>  arch/x86/syscalls/syscall_32.tbl |    1 
>  arch/x86/syscalls/syscall_64.tbl |    1 
>  include/linux/kcmp.h             |   17 ++++
>  kernel/Makefile                  |    1 
>  kernel/kcmp.c                    |  163 +++++++++++++++++++++++++++++++++++++++
>  6 files changed, 187 insertions(+)
>
> Index: linux-2.6.git/arch/x86/include/asm/syscalls.h
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/include/asm/syscalls.h
> +++ linux-2.6.git/arch/x86/include/asm/syscalls.h
> @@ -42,6 +42,10 @@ long sys_sigaltstack(const stack_t __use
>  asmlinkage int sys_set_thread_area(struct user_desc __user *);
>  asmlinkage int sys_get_thread_area(struct user_desc __user *);
>  
> +/* kernel/kcmp.c */
> +asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> +			 unsigned long idx1, unsigned long idx2);
> +
>  /* X86_32 only */
>  #ifdef CONFIG_X86_32
>  
> Index: linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/syscalls/syscall_32.tbl
> +++ linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
> @@ -355,3 +355,4 @@
>  346	i386	setns			sys_setns
>  347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
>  348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
> +349	i386	kcmp			sys_kcmp
> Index: linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/syscalls/syscall_64.tbl
> +++ linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
> @@ -318,3 +318,4 @@
>  309	64	getcpu			sys_getcpu
>  310	64	process_vm_readv	sys_process_vm_readv
>  311	64	process_vm_writev	sys_process_vm_writev
> +312	64	kcmp			sys_kcmp
> Index: linux-2.6.git/include/linux/kcmp.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/include/linux/kcmp.h
> @@ -0,0 +1,17 @@
> +#ifndef _LINUX_KCMP_H
> +#define _LINUX_KCMP_H
> +
> +/* Comparision type */
> +enum {
> +	KCMP_FILE,
> +	KCMP_VM,
> +	KCMP_FILES,
> +	KCMP_FS,
> +	KCMP_SIGHAND,
> +	KCMP_IO,
> +	KCMP_SYSVSEM,
> +
> +	KCMP_TYPES,
> +};
> +
> +#endif /* _LINUX_KCMP_H */
> Index: linux-2.6.git/kernel/Makefile
> ===================================================================
> --- linux-2.6.git.orig/kernel/Makefile
> +++ linux-2.6.git/kernel/Makefile
> @@ -25,6 +25,7 @@ endif
>  obj-y += sched/
>  obj-y += power/
>  
> +obj-$(CONFIG_X86) += kcmp.o
>  obj-$(CONFIG_FREEZER) += freezer.o
>  obj-$(CONFIG_PROFILING) += profile.o
>  obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
> Index: linux-2.6.git/kernel/kcmp.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/kernel/kcmp.c
> @@ -0,0 +1,163 @@
> +#include <linux/kernel.h>
> +#include <linux/syscalls.h>
> +#include <linux/fdtable.h>
> +#include <linux/string.h>
> +#include <linux/random.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/cache.h>
> +#include <linux/bug.h>
> +#include <linux/err.h>
> +#include <linux/kcmp.h>
> +
> +#include <asm/unistd.h>
> +
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> +	return (v ^ cookies[type][0]) * cookies[type][1];
> +}
> +
> +/*
> + * 0 - equal
> + * 1 - less than
> + * 2 - greater than
> + * 3 - not equal but ordering unavailable
> + */
> +static int kcmp_ptr(long v1, long v2, int type)
> +{
> +	long ret;
> +
> +	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
> +
> +	return (ret < 0) | ((ret > 0) << 1);
> +}
> +
> +#define KCMP_TASK_PTR(task1, task2, member, type)	\
> +	kcmp_ptr((long)(task1)->member,			\
> +		 (long)(task2)->member,			\
> +		 type)
> +
> +#define KCMP_PTR(ptr1, ptr2, type)			\
> +	kcmp_ptr((long)ptr1, (long)ptr2, type)
> +
> +/* A caller must be sure the task is presented in memory */
> +static struct file *
> +get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> +{
> +	struct fdtable *fdt;
> +	struct file *file;
> +
> +	spin_lock(&task->files->file_lock);
> +	fdt = files_fdtable(task->files);
> +	if (idx < fdt->max_fds)
> +		file = fdt->fd[idx];
> +	else
> +		file = NULL;
> +	spin_unlock(&task->files->file_lock);
> +
> +	return file;
> +}
> +
> +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
> +		unsigned long, idx1, unsigned long, idx2)
> +{
> +	struct task_struct *task1;
> +	struct task_struct *task2;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	task1 = find_task_by_vpid(pid1);
> +	if (!task1) {
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	task2 = find_task_by_vpid(pid2);
> +	if (!task2) {
> +		put_task_struct(task1);
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	get_task_struct(task1);
> +	get_task_struct(task2);
> +
> +	rcu_read_unlock();
> +
> +	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> +	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> +		ret = -EACCES;
> +		goto err;
> +	}
> +
> +	/*
> +	 * Note for all cases but the KCMP_FILE we
> +	 * don't take any locks in a sake of speed.
> +	 */
> +
> +	switch (type) {
> +	case KCMP_FILE: {
> +		struct file *filp1, *filp2;
> +
> +		filp1 = get_file_raw_ptr(task1, idx1);
> +		filp2 = get_file_raw_ptr(task2, idx2);
> +
> +		if (filp1 && filp2)
> +			ret = KCMP_PTR(filp1, filp2, KCMP_FILE);
> +		else
> +			ret = -ENOENT;
> +		break;
> +	}
> +	case KCMP_VM:
> +		ret = KCMP_TASK_PTR(task1, task2, mm, KCMP_VM);
> +		break;
> +	case KCMP_FILES:
> +		ret = KCMP_TASK_PTR(task1, task2, files, KCMP_FILES);
> +		break;
> +	case KCMP_FS:
> +		ret = KCMP_TASK_PTR(task1, task2, fs, KCMP_FS);
> +		break;
> +	case KCMP_SIGHAND:
> +		ret = KCMP_TASK_PTR(task1, task2, sighand, KCMP_SIGHAND);
> +		break;
> +	case KCMP_IO:
> +		ret = KCMP_TASK_PTR(task1, task2, io_context, KCMP_IO);
> +		break;
> +	case KCMP_SYSVSEM:
> +#ifdef CONFIG_SYSVIPC
> +		ret = KCMP_TASK_PTR(task1, task2, sysvsem.undo_list, KCMP_SYSVSEM);
> +#else
> +		ret = -ENOENT;
> +		goto err;
> +#endif
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +err:
> +	put_task_struct(task1);
> +	put_task_struct(task2);
> +
> +	return ret;
> +}
> +
> +static __init int kcmp_cookie_init(void)
> +{
> +	int i, j;
> +
> +	for (i = 0; i < KCMP_TYPES; i++) {
> +		for (j = 0; j < 2; j++) {
> +			get_random_bytes(&cookies[i][j],
> +					 sizeof(cookies[i][j]));
> +		}
> +		cookies[i][1] |= (~(~0UL >>  1) | 1);
> +	}
> +
> +	return 0;
> +}
> +late_initcall(kcmp_cookie_init);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 20:50                         ` Cyrill Gorcunov
  2012-01-24 21:20                           ` Eric W. Biederman
@ 2012-01-24 21:22                           ` Andrew Morton
  2012-01-24 21:45                             ` Andrew Morton
                                               ` (2 more replies)
  2012-01-24 21:25                           ` Andrew Morton
  2 siblings, 3 replies; 60+ messages in thread
From: Andrew Morton @ 2012-01-24 21:22 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, KOSAKI Motohiro, H. Peter Anvin,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Wed, 25 Jan 2012 00:50:39 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> This one should fit all requirements I guess.

[wakes up]

> While doing the checkpoint-restore in the userspace one need to determine
> whether various kernel objects (like mm_struct-s of file_struct-s) are shared
> between tasks and restore this state.
> 
> The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
> syscall, while there's currently no ways for solving the 1st one.
> 
> One of the ways for checking whether two tasks share e.g. mm_struct is to
> provide some mm_struct ID of a task to its proc file, but showing such
> info considered to be not that good for security reasons.
> 
> Thus after some debates we end up in conclusion that using that named
> 'comparision' syscall might be the best candidate. So here is it --
> __NR_kcmp.
> 
> It takes up to 5 agruments - the pids of the two tasks (which
> characteristics should be compared), the comparision type and
> (in case of comparision of files) two file descriptors.

PIDs are not unique.  One wonders what happens in this syscall if the
same pid appears in two namespaces.

<reads the code>

Seems that it performs lookups only in the caller's PID namespace. 
Maybe this is appropriate but it should be described and justified in
the changelog and in code comments, please.  And in the forthcoming
manpage ;)

> At moment only x86 is supported.

Presumably you have a test app.  Please let's include that app in
tools/testing/selftests/ for arch maintainers and others to use and
maintain.

>
> ...
>
> --- /dev/null
> +++ linux-2.6.git/kernel/kcmp.c
> @@ -0,0 +1,163 @@
> +#include <linux/kernel.h>
> +#include <linux/syscalls.h>
> +#include <linux/fdtable.h>
> +#include <linux/string.h>
> +#include <linux/random.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/cache.h>
> +#include <linux/bug.h>
> +#include <linux/err.h>
> +#include <linux/kcmp.h>
> +
> +#include <asm/unistd.h>
> +
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;

This reader of this code doesn't understand why all this cookie stuff
is in here.  Please include code comments which explain the reason for
the existence of this code.

> +static long kptr_obfuscate(long v, int type)
> +{
> +	return (v ^ cookies[type][0]) * cookies[type][1];
> +}
> +
> +/*
> + * 0 - equal
> + * 1 - less than
> + * 2 - greater than
> + * 3 - not equal but ordering unavailable

what the heck does case 3 mean?  Why is it here?

> + */
> +static int kcmp_ptr(long v1, long v2, int type)
> +{
> +	long ret;
> +
> +	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
> +
> +	return (ret < 0) | ((ret > 0) << 1);
> +}
> +
> +#define KCMP_TASK_PTR(task1, task2, member, type)	\
> +	kcmp_ptr((long)(task1)->member,			\
> +		 (long)(task2)->member,			\
> +		 type)
> +
> +#define KCMP_PTR(ptr1, ptr2, type)			\
> +	kcmp_ptr((long)ptr1, (long)ptr2, type)

ugh.  This:

static long kptr_obfuscate(void *p, enum you_forgot_to_name_the_enum type)
{
	return ((long)p ^ cookies[type][0]) * cookies[type][1];
}

static int kcmp_task_pointers(void *task1, void *task2, size_t field_offset,
				enum you_forgot_to_name_the_enum type)
{
	void **field1 = t1 + field_offset;	/* points to a pointer in the task_struct */
	void **field2 = t1 + field_offset;
	long diff;

	diff = kptr_obfuscate(*field1, type) - kptr_obfuscate(*field2, type);
	return (diff < 0) | ((diff > 0) << 1);
}

	...
	ret = kcmp_task_pointers(task1, task2, offsetof(task_struct, mm),
				KCMP_VM);
	...

see?  No nasty macros, it's type-correct and it uses only a single
explicit typecast.

> +/* A caller must be sure the task is presented in memory */

"The caller must have pinned the task"

> +static struct file *
> +get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> +{
> +	struct fdtable *fdt;
> +	struct file *file;
> +
> +	spin_lock(&task->files->file_lock);
> +	fdt = files_fdtable(task->files);
> +	if (idx < fdt->max_fds)
> +		file = fdt->fd[idx];
> +	else
> +		file = NULL;
> +	spin_unlock(&task->files->file_lock);
> +
> +	return file;
> +}
> +
> +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
> +		unsigned long, idx1, unsigned long, idx2)
> +{
> +	struct task_struct *task1;
> +	struct task_struct *task2;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	task1 = find_task_by_vpid(pid1);
> +	if (!task1) {
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	task2 = find_task_by_vpid(pid2);
> +	if (!task2) {
> +		put_task_struct(task1);
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	get_task_struct(task1);
> +	get_task_struct(task2);
> +
> +	rcu_read_unlock();
> +
> +	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> +	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {

Add a comment explaining this decision.

> +		ret = -EACCES;
> +		goto err;
> +	}
> +
> +	/*
> +	 * Note for all cases but the KCMP_FILE we
> +	 * don't take any locks in a sake of speed.
> +	 */
> +
> +	switch (type) {
> +	case KCMP_FILE: {
> +		struct file *filp1, *filp2;
> +
> +		filp1 = get_file_raw_ptr(task1, idx1);
> +		filp2 = get_file_raw_ptr(task2, idx2);
> +
> +		if (filp1 && filp2)
> +			ret = KCMP_PTR(filp1, filp2, KCMP_FILE);
> +		else
> +			ret = -ENOENT;
> +		break;
> +	}
> +	case KCMP_VM:
> +		ret = KCMP_TASK_PTR(task1, task2, mm, KCMP_VM);
> +		break;
> +	case KCMP_FILES:
> +		ret = KCMP_TASK_PTR(task1, task2, files, KCMP_FILES);
> +		break;
> +	case KCMP_FS:
> +		ret = KCMP_TASK_PTR(task1, task2, fs, KCMP_FS);
> +		break;
> +	case KCMP_SIGHAND:
> +		ret = KCMP_TASK_PTR(task1, task2, sighand, KCMP_SIGHAND);
> +		break;
> +	case KCMP_IO:
> +		ret = KCMP_TASK_PTR(task1, task2, io_context, KCMP_IO);
> +		break;
> +	case KCMP_SYSVSEM:
> +#ifdef CONFIG_SYSVIPC
> +		ret = KCMP_TASK_PTR(task1, task2, sysvsem.undo_list, KCMP_SYSVSEM);
> +#else
> +		ret = -ENOENT;

ENOENT seems inappropriate here.

> +		goto err;
> +#endif
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +err:
> +	put_task_struct(task1);
> +	put_task_struct(task2);
> +
> +	return ret;
> +}
> +
> +static __init int kcmp_cookie_init(void)
> +{
> +	int i, j;
> +
> +	for (i = 0; i < KCMP_TYPES; i++) {
> +		for (j = 0; j < 2; j++) {
> +			get_random_bytes(&cookies[i][j],
> +					 sizeof(cookies[i][j]));
> +		}
> +		cookies[i][1] |= (~(~0UL >>  1) | 1);

hm, what's the point in writing a random number to cookies[i][1] and
then immediately overwriting that with a constant?

> +	}
> +
> +	return 0;
> +}
> +late_initcall(kcmp_cookie_init);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 20:50                         ` Cyrill Gorcunov
  2012-01-24 21:20                           ` Eric W. Biederman
  2012-01-24 21:22                           ` Andrew Morton
@ 2012-01-24 21:25                           ` Andrew Morton
  2012-01-24 21:31                             ` Cyrill Gorcunov
  2 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2012-01-24 21:25 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, KOSAKI Motohiro, H. Peter Anvin,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Wed, 25 Jan 2012 00:50:39 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> While doing the checkpoint-restore in the userspace

btw, some questions regarding the overall project: do you guys actually
have working userspace code which can at least partially perform a c/r?
What state is all this in?

I'd be a bit concerned if we're putting code into mainline which has
not been demonstrated to achieve any form of successful
checkpoint/restore.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 21:25                           ` Andrew Morton
@ 2012-01-24 21:31                             ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24 21:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, KOSAKI Motohiro, H. Peter Anvin,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, Jan 24, 2012 at 01:25:16PM -0800, Andrew Morton wrote:
> On Wed, 25 Jan 2012 00:50:39 +0400
> Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > While doing the checkpoint-restore in the userspace
> 
> btw, some questions regarding the overall project: do you guys actually
> have working userspace code which can at least partially perform a c/r?
> What state is all this in?
> 

Yes, we have a number of test-cases which do c/r. See details on
http://criu.org/ . We don't announce it yet since everything is under
heavy development.

> I'd be a bit concerned if we're putting code into mainline which has
> not been demonstrated to achieve any form of successful checkpoint/restore.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 21:20                           ` Eric W. Biederman
@ 2012-01-24 21:34                             ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24 21:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: KOSAKI Motohiro, H. Peter Anvin, KAMEZAWA Hiroyuki, linux-kernel,
	Andrew Morton, Pavel Emelyanov, Serge Hallyn, Kees Cook,
	Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, Jan 24, 2012 at 01:20:12PM -0800, Eric W. Biederman wrote:
...
> >
> > This one should fit all requirements I guess.
> 
> Bahahaha!
> 
> Looking I see one  more nit.
> 
> You need an entry in include/linux/syscalls.h

Sigh.. ;) Not just this. I'll post updated version to address
Andrew's comments too once I finish it (not today probably).

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 21:22                           ` Andrew Morton
@ 2012-01-24 21:45                             ` Andrew Morton
  2012-01-24 21:46                               ` H. Peter Anvin
  2012-01-24 21:46                             ` Cyrill Gorcunov
  2012-01-24 22:54                             ` Eric W. Biederman
  2 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2012-01-24 21:45 UTC (permalink / raw)
  To: Cyrill Gorcunov, Eric W. Biederman, KOSAKI Motohiro,
	H. Peter Anvin, KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov,
	Serge Hallyn, Kees Cook, Tejun Heo, Andrew Vagin,
	Alexey Dobriyan, Ingo Molnar, Thomas Gleixner, Glauber Costa,
	Andi Kleen, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Valdis.Kletnieks

On Tue, 24 Jan 2012 13:22:22 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> static int kcmp_task_pointers(void *task1, void *task2, size_t field_offset,
> 				enum you_forgot_to_name_the_enum type)
> {
> 	void **field1 = t1 + field_offset;	/* points to a pointer in the task_struct */
> 	void **field2 = t1 + field_offset;

On reflection, this was being too cute.  It would be better to make the
function type-safer and just put up with the local typecasts:

static int kcmp_task_pointers(struct task_struct *task1,
			struct task_struct *task2, size_t field_offset,
			enum you_forgot_to_name_the_enum type)
{
	void **field1 = (void *)t1 + field_offset;
	void **field2 = (void *)t2 + field_offset;


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 21:22                           ` Andrew Morton
  2012-01-24 21:45                             ` Andrew Morton
@ 2012-01-24 21:46                             ` Cyrill Gorcunov
  2012-01-24 21:59                               ` Andrew Morton
  2012-01-24 22:54                             ` Eric W. Biederman
  2 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-24 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, KOSAKI Motohiro, H. Peter Anvin,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, Jan 24, 2012 at 01:22:22PM -0800, Andrew Morton wrote:
> 
> PIDs are not unique.  One wonders what happens in this syscall if the
> same pid appears in two namespaces.
> 
> <reads the code>
> 
> Seems that it performs lookups only in the caller's PID namespace. 
> Maybe this is appropriate but it should be described and justified in
> the changelog and in code comments, please.  And in the forthcoming
> manpage ;)
> 

Yes, caller's namespace was used intentionally, will add comments (manpage
makes me shiver).

> > At moment only x86 is supported.
> 
> Presumably you have a test app.  Please let's include that app in
> tools/testing/selftests/ for arch maintainers and others to use and
> maintain.

ok

> > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> 
> This reader of this code doesn't understand why all this cookie stuff
> is in here.  Please include code comments which explain the reason for
> the existence of this code.
> 

ok

> > +static long kptr_obfuscate(long v, int type)
> > +{
> > +	return (v ^ cookies[type][0]) * cookies[type][1];
> > +}
> > +
> > +/*
> > + * 0 - equal
> > + * 1 - less than
> > + * 2 - greater than
> > + * 3 - not equal but ordering unavailable
> 
> what the heck does case 3 mean?  Why is it here?
> 

I'll add a comment. It's reserved for case where we might
need to disable gt/lt comparision result. Probably in future.

> > +
> > +#define KCMP_PTR(ptr1, ptr2, type)			\
> > +	kcmp_ptr((long)ptr1, (long)ptr2, type)
> 
> ugh.  This:
> 
> static long kptr_obfuscate(void *p, enum you_forgot_to_name_the_enum type)
> {
> 	return ((long)p ^ cookies[type][0]) * cookies[type][1];
> }
> 
> static int kcmp_task_pointers(void *task1, void *task2, size_t field_offset,
> 				enum you_forgot_to_name_the_enum type)
> {
> 	void **field1 = t1 + field_offset;	/* points to a pointer in the task_struct */
> 	void **field2 = t1 + field_offset;
> 	long diff;
> 
> 	diff = kptr_obfuscate(*field1, type) - kptr_obfuscate(*field2, type);
> 	return (diff < 0) | ((diff > 0) << 1);
> }
> 
> 	...
> 	ret = kcmp_task_pointers(task1, task2, offsetof(task_struct, mm),
> 				KCMP_VM);
> 	...
> 
> see?  No nasty macros, it's type-correct and it uses only a single
> explicit typecast.
> 

ok, i'll change it of course, but I personally like macros version more.

> > +/* A caller must be sure the task is presented in memory */
> "The caller must have pinned the task"
> 
> > +	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> > +	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> 
> Add a comment explaining this decision.
> 

OK.

> 
> ENOENT seems inappropriate here.
>

Which one should be better?

> > +static __init int kcmp_cookie_init(void)
> > +{
> > +	int i, j;
> > +
> > +	for (i = 0; i < KCMP_TYPES; i++) {
> > +		for (j = 0; j < 2; j++) {
> > +			get_random_bytes(&cookies[i][j],
> > +					 sizeof(cookies[i][j]));
> > +		}
> > +		cookies[i][1] |= (~(~0UL >>  1) | 1);
> 
> hm, what's the point in writing a random number to cookies[i][1] and
> then immediately overwriting that with a constant?

It's '|=' , not '='.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 21:45                             ` Andrew Morton
@ 2012-01-24 21:46                               ` H. Peter Anvin
  2012-01-24 22:00                                 ` Andrew Morton
  0 siblings, 1 reply; 60+ messages in thread
From: H. Peter Anvin @ 2012-01-24 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Eric W. Biederman, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On 01/24/2012 01:45 PM, Andrew Morton wrote:
> On Tue, 24 Jan 2012 13:22:22 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>> static int kcmp_task_pointers(void *task1, void *task2, size_t field_offset,
>> 				enum you_forgot_to_name_the_enum type)
>> {
>> 	void **field1 = t1 + field_offset;	/* points to a pointer in the task_struct */
>> 	void **field2 = t1 + field_offset;
> 
> On reflection, this was being too cute.  It would be better to make the
> function type-safer and just put up with the local typecasts:
> 
> static int kcmp_task_pointers(struct task_struct *task1,
> 			struct task_struct *task2, size_t field_offset,
> 			enum you_forgot_to_name_the_enum type)
> {
> 	void **field1 = (void *)t1 + field_offset;
> 	void **field2 = (void *)t2 + field_offset;
> 

Arithmetic on void pointers?

	-hpa


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 21:46                             ` Cyrill Gorcunov
@ 2012-01-24 21:59                               ` Andrew Morton
  0 siblings, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2012-01-24 21:59 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, KOSAKI Motohiro, H. Peter Anvin,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Wed, 25 Jan 2012 01:46:30 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> > 
> > ENOENT seems inappropriate here.
> >
> 
> Which one should be better?

EINVAL, I guess.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 21:46                               ` H. Peter Anvin
@ 2012-01-24 22:00                                 ` Andrew Morton
  2012-01-24 22:52                                   ` H. Peter Anvin
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2012-01-24 22:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Cyrill Gorcunov, Eric W. Biederman, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, 24 Jan 2012 13:46:31 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On 01/24/2012 01:45 PM, Andrew Morton wrote:
> > On Tue, 24 Jan 2012 13:22:22 -0800
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> >> static int kcmp_task_pointers(void *task1, void *task2, size_t field_offset,
> >> 				enum you_forgot_to_name_the_enum type)
> >> {
> >> 	void **field1 = t1 + field_offset;	/* points to a pointer in the task_struct */
> >> 	void **field2 = t1 + field_offset;
> > 
> > On reflection, this was being too cute.  It would be better to make the
> > function type-safer and just put up with the local typecasts:
> > 
> > static int kcmp_task_pointers(struct task_struct *task1,
> > 			struct task_struct *task2, size_t field_offset,
> > 			enum you_forgot_to_name_the_enum type)
> > {
> > 	void **field1 = (void *)t1 + field_offset;
> > 	void **field2 = (void *)t2 + field_offset;
> > 
> 
> Arithmetic on void pointers?
> 

All the world is gcc!

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 22:00                                 ` Andrew Morton
@ 2012-01-24 22:52                                   ` H. Peter Anvin
  2012-01-24 23:42                                     ` Andrew Morton
  0 siblings, 1 reply; 60+ messages in thread
From: H. Peter Anvin @ 2012-01-24 22:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Eric W. Biederman, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On 01/24/2012 02:00 PM, Andrew Morton wrote:
>>
>> Arithmetic on void pointers?
> 
> All the world is gcc!

Even so, warnings are annoying...

	-hpa

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 22:54                             ` Eric W. Biederman
@ 2012-01-24 22:54                               ` Andrew Morton
  0 siblings, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2012-01-24 22:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cyrill Gorcunov, KOSAKI Motohiro, H. Peter Anvin,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, 24 Jan 2012 14:54:18 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Simply open coding the comparison would be better. 
> 
>         ret = kcmp_ptr(task1->files, task2->files, type);

yup, that's nice.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 21:22                           ` Andrew Morton
  2012-01-24 21:45                             ` Andrew Morton
  2012-01-24 21:46                             ` Cyrill Gorcunov
@ 2012-01-24 22:54                             ` Eric W. Biederman
  2012-01-24 22:54                               ` Andrew Morton
  2 siblings, 1 reply; 60+ messages in thread
From: Eric W. Biederman @ 2012-01-24 22:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, KOSAKI Motohiro, H. Peter Anvin,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

Andrew Morton <akpm@linux-foundation.org> writes:


> <reads the code>
>
> Seems that it performs lookups only in the caller's PID namespace. 
> Maybe this is appropriate but it should be described and justified in
> the changelog and in code comments, please.  And in the forthcoming
> manpage ;)

Well pids should always and only be looked up in the callers pid
namespace.  Any other behavior is broken.  It is probably worth
a mention in a manpage but you should not need to justify using
abstractions as they were designed to be used.


>> +static int kcmp_ptr(long v1, long v2, int type)
>> +{
>> +	long ret;
>> +
>> +	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
>> +
>> +	return (ret < 0) | ((ret > 0) << 1);
>> +}
>> +
>> +#define KCMP_TASK_PTR(task1, task2, member, type)	\
>> +	kcmp_ptr((long)(task1)->member,			\
>> +		 (long)(task2)->member,			\
>> +		 type)
>> +
>> +#define KCMP_PTR(ptr1, ptr2, type)			\
>> +	kcmp_ptr((long)ptr1, (long)ptr2, type)
>
> ugh.  This:
>
> static long kptr_obfuscate(void *p, enum you_forgot_to_name_the_enum type)
> {
> 	return ((long)p ^ cookies[type][0]) * cookies[type][1];
> }
>
> static int kcmp_task_pointers(void *task1, void *task2, size_t field_offset,
> 				enum you_forgot_to_name_the_enum type)
> {
> 	void **field1 = t1 + field_offset;	/* points to a pointer in the task_struct */
> 	void **field2 = t1 + field_offset;
> 	long diff;
>
> 	diff = kptr_obfuscate(*field1, type) - kptr_obfuscate(*field2, type);
> 	return (diff < 0) | ((diff > 0) << 1);
> }
>
> 	...
> 	ret = kcmp_task_pointers(task1, task2, offsetof(task_struct, mm),
> 				KCMP_VM);
> 	...
>
> see?  No nasty macros, it's type-correct and it uses only a single
> explicit typecast.

Seriously?  Simply open coding the comparison would be better. 

        ret = kcmp_ptr(task1->files, task2->files, type);

All pointers are not encoded the same as void * pointers.  Admittedly
the only case I can think of are function pointers on Itanium, but
what is a little wrong today can easily become a lot wrong tomorrow.

Making the kcmp_ptr arguments void * seems the way to go though.

Now there is one interesting case we are not handling properly.
If any of our pointers can be NULL which I think happens in the
file case we should return -EBADF instead of reporting two NULL
pointers point to the same object.

Eric

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4
  2012-01-24 22:52                                   ` H. Peter Anvin
@ 2012-01-24 23:42                                     ` Andrew Morton
  0 siblings, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2012-01-24 23:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Cyrill Gorcunov, Eric W. Biederman, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, Tejun Heo, Andrew Vagin, Alexey Dobriyan, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Valdis.Kletnieks

On Tue, 24 Jan 2012 14:52:32 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On 01/24/2012 02:00 PM, Andrew Morton wrote:
> >>
> >> Arithmetic on void pointers?
> > 
> > All the world is gcc!
> 
> Even so, warnings are annoying...
> 

We already do arithmetic on void* in many places.  Run "make
W=3" and admire..


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-23 14:20 ` [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8 Cyrill Gorcunov
  2012-01-23 18:54   ` Kees Cook
  2012-01-24  2:07   ` KAMEZAWA Hiroyuki
@ 2012-01-24 23:53   ` Andrew Morton
  2012-01-25  6:52     ` Cyrill Gorcunov
  2 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2012-01-24 23:53 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Mon, 23 Jan 2012 18:20:37 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> When we do checkpoint of a task we need to know the list of children
> the task, has but there is no easy and fast way to generate reverse
> parent->children chain from arbitrary <pid> (while a parent pid is
> provided in "PPid" field of /proc/<pid>/status).
> 
> So instead of walking over all pids in the system (creating one big process
> tree in memory, just to figure out which children a task has) -- we add
> explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
> this kind of information but it is not yet exported.
> 
> This is a first level children, not the whole process tree.

Is there a reason for not putting this new code inside
CONFIG_CHECKPOINT_RESTART?  If so, please changelog that.

We could add

/* #ifdef CONFIG_CHECKPOINT_RESTART (reasoning goes here) */

so that if we should decide to pull all this code out again, a grep
will flag this code for consideration.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-23 14:20 ` [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
  2012-01-23 20:42   ` Kees Cook
@ 2012-01-24 23:59   ` Andrew Morton
  2012-01-25  6:54     ` Cyrill Gorcunov
  1 sibling, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2012-01-24 23:59 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Vasiliy Kulikov

On Mon, 23 Jan 2012 18:20:39 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> We would like to have an ability to restore command line
> arguments and envirion pointers so the task being restored
> would print appropriate values in /proc/pid/cmdline and
> /proc/pid/envirion. The exit_code is needed to restore
> zombie tasks.
> 

Please update Documentation/filesystems/proc.txt.

>
> --- linux-2.6.git.orig/fs/proc/array.c
> +++ linux-2.6.git/fs/proc/array.c
> @@ -464,7 +464,8 @@ static int do_task_stat(struct seq_file
>  
>  	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
>  %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
> -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
> +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu \
> +%lu %lu %lu %lu %d\n",
>  		pid_nr_ns(pid, ns),
>  		tcomm,
>  		state,
> @@ -514,7 +515,12 @@ static int do_task_stat(struct seq_file
>  		cputime_to_clock_t(cgtime),
>  		(mm && permitted) ? mm->start_data : 0,
>  		(mm && permitted) ? mm->end_data : 0,
> -		(mm && permitted) ? mm->start_brk : 0);
> +		(mm && permitted) ? mm->start_brk : 0,
> +		(mm && permitted) ? mm->arg_start : 0,
> +		(mm && permitted) ? mm->arg_end : 0,
> +		(mm && permitted) ? mm->env_start : 0,
> +		(mm && permitted) ? mm->env_end : 0,
> +		task->exit_code);
>  	if (mm)
>  		mmput(mm);
>  	return 0;

/proc/pid/stat is getting out of control.  People are now sending patches
because reading from this thing already takes too long.

The problem is that if you only want one field, you have to incur the
cost of preparing all the fields.  The magnitude of this problem
increases exponentially over time!

I'm unsure what to do about it really.  Perhaps add a new
/proc/pid/mmstat for MM-specific things.  We could put the above six
fields in there, as long as we move quickly.




^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-24  9:11             ` Cyrill Gorcunov
@ 2012-01-25  1:14               ` KOSAKI Motohiro
  2012-01-25  2:11                 ` Eric W. Biederman
  0 siblings, 1 reply; 60+ messages in thread
From: KOSAKI Motohiro @ 2012-01-25  1:14 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, KAMEZAWA Hiroyuki, linux-kernel,
	Andrew Morton, Pavel Emelyanov, Serge Hallyn, Kees Cook,
	Tejun Heo, Andrew Vagin, Alexey Dobriyan

(1/24/12 4:11 AM), Cyrill Gorcunov wrote:
> On Tue, Jan 24, 2012 at 12:52:03AM -0800, Eric W. Biederman wrote:
>> Cyrill Gorcunov<gorcunov@gmail.com>  writes:
>>
>>> On Tue, Jan 24, 2012 at 04:07:09PM +0900, KAMEZAWA Hiroyuki wrote:
>>>>>
>>>>> Hmm. But userspace app will get eof, so frankly I don't see
>>>>> a problem here. Or maybe I miss something?
>>>>>
>>>>
>>>> Userspace need to take care of whether there may be"\n" or not even
>>>> if read() returns EOF.
>>>> As an interface, it's BUG to say "\n" will be there if you're lucky!"
>>>> (*) I know script language can handle this but we shouldn't assume that.
>>>>
>>>> How about just remove "\n" at EOF  ? I think it's unnecessary.
>>>>
>>>
>>> Sure thing, it's not a problem to remove it completely.
>>
>> Foolish question.  Is there any reason why this is a file instead
>> of being the obvious directory full of symlinks?
>>
>
> How would these symlinks look like? "../../pid"? There were a conversation
> about such things (https://lkml.org/lkml/2011/12/2/142) but I suppose we
> were agree on children with pids as consensus.

I couldn't find any agreement in this link. Suppose wrong url?



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-25  1:14               ` KOSAKI Motohiro
@ 2012-01-25  2:11                 ` Eric W. Biederman
  2012-01-25  6:55                   ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: Eric W. Biederman @ 2012-01-25  2:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Cyrill Gorcunov, KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton,
	Pavel Emelyanov, Serge Hallyn, Kees Cook, Tejun Heo,
	Andrew Vagin, Alexey Dobriyan

KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:

> (1/24/12 4:11 AM), Cyrill Gorcunov wrote:
>> On Tue, Jan 24, 2012 at 12:52:03AM -0800, Eric W. Biederman wrote:
>>> Cyrill Gorcunov<gorcunov@gmail.com>  writes:
>>>
>>>> On Tue, Jan 24, 2012 at 04:07:09PM +0900, KAMEZAWA Hiroyuki wrote:
>>>>>>
>>>>>> Hmm. But userspace app will get eof, so frankly I don't see
>>>>>> a problem here. Or maybe I miss something?
>>>>>>
>>>>>
>>>>> Userspace need to take care of whether there may be"\n" or not even
>>>>> if read() returns EOF.
>>>>> As an interface, it's BUG to say "\n" will be there if you're lucky!"
>>>>> (*) I know script language can handle this but we shouldn't assume that.
>>>>>
>>>>> How about just remove "\n" at EOF  ? I think it's unnecessary.
>>>>>
>>>>
>>>> Sure thing, it's not a problem to remove it completely.
>>>
>>> Foolish question.  Is there any reason why this is a file instead
>>> of being the obvious directory full of symlinks?
>>>
>>
>> How would these symlinks look like? "../../pid"? There were a conversation
>> about such things (https://lkml.org/lkml/2011/12/2/142) but I suppose we
>> were agree on children with pids as consensus.
>
> I couldn't find any agreement in this link. Suppose wrong url?

Now that you have reminded me of this thread.  I can say that the
link would need to look like ../../pid.  Our children will always
be thread group leaders, so we can safely point to the /proc/<pid>
directories.  So readlink would return ../../<pid> or however many
dots are needed.  Follow link could just warp us to that directory
as it does for the other magic proc symlinks.

My feeling is that a children subdirectory would be a lot more useful
than a simple file that lists the children.

Eric

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-24 23:53   ` Andrew Morton
@ 2012-01-25  6:52     ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-25  6:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan

On Tue, Jan 24, 2012 at 03:53:03PM -0800, Andrew Morton wrote:
> On Mon, 23 Jan 2012 18:20:37 +0400
> Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> 
> > When we do checkpoint of a task we need to know the list of children
> > the task, has but there is no easy and fast way to generate reverse
> > parent->children chain from arbitrary <pid> (while a parent pid is
> > provided in "PPid" field of /proc/<pid>/status).
> > 
> > So instead of walking over all pids in the system (creating one big process
> > tree in memory, just to figure out which children a task has) -- we add
> > explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
> > this kind of information but it is not yet exported.
> > 
> > This is a first level children, not the whole process tree.
> 
> Is there a reason for not putting this new code inside
> CONFIG_CHECKPOINT_RESTART?  If so, please changelog that.
> 

I don't know if someone might need it as well, so I've asked in one of
previous iterations if this file is to be under CONFIG_CHECKPOINT_RESTART.
So, sure, I'll wrap it this way.

> We could add
> 
> /* #ifdef CONFIG_CHECKPOINT_RESTART (reasoning goes here) */
> 
> so that if we should decide to pull all this code out again, a grep
> will flag this code for consideration.
> 

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-24 23:59   ` Andrew Morton
@ 2012-01-25  6:54     ` Cyrill Gorcunov
  2012-01-25  7:12       ` Andrew Morton
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-25  6:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Vasiliy Kulikov

On Tue, Jan 24, 2012 at 03:59:41PM -0800, Andrew Morton wrote:
> On Mon, 23 Jan 2012 18:20:39 +0400
> Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> 
> > We would like to have an ability to restore command line
> > arguments and envirion pointers so the task being restored
> > would print appropriate values in /proc/pid/cmdline and
> > /proc/pid/envirion. The exit_code is needed to restore
> > zombie tasks.
> > 
> 
> Please update Documentation/filesystems/proc.txt.

ok

> > +		(mm && permitted) ? mm->start_brk : 0,
> > +		(mm && permitted) ? mm->arg_start : 0,
> > +		(mm && permitted) ? mm->arg_end : 0,
> > +		(mm && permitted) ? mm->env_start : 0,
> > +		(mm && permitted) ? mm->env_end : 0,
> > +		task->exit_code);
> >  	if (mm)
> >  		mmput(mm);
> >  	return 0;
> 
> /proc/pid/stat is getting out of control.  People are now sending patches
> because reading from this thing already takes too long.
> 
> The problem is that if you only want one field, you have to incur the
> cost of preparing all the fields.  The magnitude of this problem
> increases exponentially over time!
> 
> I'm unsure what to do about it really.  Perhaps add a new
> /proc/pid/mmstat for MM-specific things.  We could put the above six
> fields in there, as long as we move quickly.
> 

I can add prctl PR_GET_MM with subcodes, since PR_SET_MM is already here
and wrapped with CHECKPOINT_RESTORE. Would this be better?

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-25  2:11                 ` Eric W. Biederman
@ 2012-01-25  6:55                   ` Cyrill Gorcunov
  2012-01-25 15:29                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-25  6:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton,
	Pavel Emelyanov, Serge Hallyn, Kees Cook, Tejun Heo,
	Andrew Vagin, Alexey Dobriyan

On Tue, Jan 24, 2012 at 06:11:21PM -0800, Eric W. Biederman wrote:
> KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:
> >>
> >> How would these symlinks look like? "../../pid"? There were a conversation
> >> about such things (https://lkml.org/lkml/2011/12/2/142) but I suppose we
> >> were agree on children with pids as consensus.
> >
> > I couldn't find any agreement in this link. Suppose wrong url?
> 
> Now that you have reminded me of this thread.  I can say that the
> link would need to look like ../../pid.  Our children will always
> be thread group leaders, so we can safely point to the /proc/<pid>
> directories.  So readlink would return ../../<pid> or however many
> dots are needed.  Follow link could just warp us to that directory
> as it does for the other magic proc symlinks.
> 
> My feeling is that a children subdirectory would be a lot more useful
> than a simple file that lists the children.
> 

I'll check what I can do, thanks.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-25  6:54     ` Cyrill Gorcunov
@ 2012-01-25  7:12       ` Andrew Morton
  2012-01-25  7:18         ` Cyrill Gorcunov
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2012-01-25  7:12 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Vasiliy Kulikov

On Wed, 25 Jan 2012 10:54:50 +0400 Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> > > +		(mm && permitted) ? mm->start_brk : 0,
> > > +		(mm && permitted) ? mm->arg_start : 0,
> > > +		(mm && permitted) ? mm->arg_end : 0,
> > > +		(mm && permitted) ? mm->env_start : 0,
> > > +		(mm && permitted) ? mm->env_end : 0,
> > > +		task->exit_code);
> > >  	if (mm)
> > >  		mmput(mm);
> > >  	return 0;
> > 
> > /proc/pid/stat is getting out of control.  People are now sending patches
> > because reading from this thing already takes too long.

err, actually, that was /proc/stat/

> > The problem is that if you only want one field, you have to incur the
> > cost of preparing all the fields.  The magnitude of this problem
> > increases exponentially over time!
> > 
> > I'm unsure what to do about it really.  Perhaps add a new
> > /proc/pid/mmstat for MM-specific things.  We could put the above six
> > fields in there, as long as we move quickly.
> > 
> 
> I can add prctl PR_GET_MM with subcodes, since PR_SET_MM is already here
> and wrapped with CHECKPOINT_RESTORE. Would this be better?

mm, not really - /proc is the logical/expected place for it.

I'm thinking that perhaps we should start again with all of this and
export all this information in brand new, well-designed procfs files. 
We'd still maintain /proc/stat and /proc/pid/stat but people should
migrate off them.  Eventually (10 years?) everyone will be setting their
CONFIG_PROC_[PID_]STAT to 'n' and perhaps we can retire the things.

Meanwhile, I suppose you may as well continue to make /proc/pid/stat
even crazier :(  It isn't as bad as /proc/stat!


btw, do we really need to do "(mm && permitted)" so many times?  ie,
can we split that seq_printf up and do

	if (mm && permitted) {
		seq_printf(m + offset, "%lu", mm->start_data);
		...
	} else {
		seq_printf(m + offset, "0 0 0 0 0 0 0 0 0");
	}

?  Although this probably won't help much.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-25  7:12       ` Andrew Morton
@ 2012-01-25  7:18         ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-25  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	Alexey Dobriyan, Vasiliy Kulikov

On Tue, Jan 24, 2012 at 11:12:45PM -0800, Andrew Morton wrote:
> On Wed, 25 Jan 2012 10:54:50 +0400 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > > /proc/pid/stat is getting out of control.  People are now sending patches
> > > because reading from this thing already takes too long.
> 
> err, actually, that was /proc/stat/
> 

Ah! (I saw those thread, but then I get confused and thought maybe there were
for /proc/pid/stat as well but I simply missed it ;)

> > 
> > I can add prctl PR_GET_MM with subcodes, since PR_SET_MM is already here
> > and wrapped with CHECKPOINT_RESTORE. Would this be better?
> 
> mm, not really - /proc is the logical/expected place for it.
> 
> I'm thinking that perhaps we should start again with all of this and
> export all this information in brand new, well-designed procfs files. 
> We'd still maintain /proc/stat and /proc/pid/stat but people should
> migrate off them.  Eventually (10 years?) everyone will be setting their
> CONFIG_PROC_[PID_]STAT to 'n' and perhaps we can retire the things.
> 
> Meanwhile, I suppose you may as well continue to make /proc/pid/stat
> even crazier :(  It isn't as bad as /proc/stat!
> 

At moment indeed it's not that bloated... yet.

> btw, do we really need to do "(mm && permitted)" so many times?  ie,
> can we split that seq_printf up and do
> 
> 	if (mm && permitted) {
> 		seq_printf(m + offset, "%lu", mm->start_data);
> 		...
> 	} else {
> 		seq_printf(m + offset, "0 0 0 0 0 0 0 0 0");
> 	}
> 
> ?  Although this probably won't help much.
> 

Yeah we can. I'll update.

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
  2012-01-25  6:55                   ` Cyrill Gorcunov
@ 2012-01-25 15:29                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 60+ messages in thread
From: Cyrill Gorcunov @ 2012-01-25 15:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton,
	Pavel Emelyanov, Serge Hallyn, Kees Cook, Tejun Heo,
	Andrew Vagin, Alexey Dobriyan

On Wed, Jan 25, 2012 at 10:55:51AM +0400, Cyrill Gorcunov wrote:
> > 
> > Now that you have reminded me of this thread.  I can say that the
> > link would need to look like ../../pid.  Our children will always
> > be thread group leaders, so we can safely point to the /proc/<pid>
> > directories.  So readlink would return ../../<pid> or however many
> > dots are needed.  Follow link could just warp us to that directory
> > as it does for the other magic proc symlinks.
> > 
> > My feeling is that a children subdirectory would be a lot more useful
> > than a simple file that lists the children.
> > 
> 
> I'll check what I can do, thanks.
> 

Sigh. This will require complete code rewrite. And since I've just
made it to live under CONFIG_CHECKPOINT_RESTORE I would prefer if
we stick with children file. So if there are no strong objections
agains 'children' as a file, could we leave it as is, ie in
stream-of-pids form?

	Cyrill

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2012-01-25 15:30 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-23 14:20 [patch 0/4] A few patches in a sake of c/r functionality Cyrill Gorcunov
2012-01-23 14:20 ` [patch 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8 Cyrill Gorcunov
2012-01-23 18:54   ` Kees Cook
2012-01-23 19:33     ` Cyrill Gorcunov
2012-01-23 20:29       ` Kees Cook
2012-01-23 20:39         ` Cyrill Gorcunov
2012-01-24  2:07   ` KAMEZAWA Hiroyuki
2012-01-24  6:53     ` Cyrill Gorcunov
2012-01-24  7:07       ` KAMEZAWA Hiroyuki
2012-01-24  7:21         ` Cyrill Gorcunov
2012-01-24  8:52           ` Eric W. Biederman
2012-01-24  9:11             ` Cyrill Gorcunov
2012-01-25  1:14               ` KOSAKI Motohiro
2012-01-25  2:11                 ` Eric W. Biederman
2012-01-25  6:55                   ` Cyrill Gorcunov
2012-01-25 15:29                     ` Cyrill Gorcunov
2012-01-24  8:51         ` Cyrill Gorcunov
2012-01-24 23:53   ` Andrew Morton
2012-01-25  6:52     ` Cyrill Gorcunov
2012-01-23 14:20 ` [patch 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v4 Cyrill Gorcunov
2012-01-23 18:48   ` H. Peter Anvin
2012-01-23 20:03     ` Cyrill Gorcunov
2012-01-24  2:16   ` KAMEZAWA Hiroyuki
2012-01-24  6:47     ` Cyrill Gorcunov
2012-01-24  7:04       ` H. Peter Anvin
2012-01-24  7:17         ` Cyrill Gorcunov
2012-01-24  7:20           ` KAMEZAWA Hiroyuki
2012-01-24  7:38             ` Cyrill Gorcunov
2012-01-24  7:40               ` KAMEZAWA Hiroyuki
2012-01-24  8:48                 ` Cyrill Gorcunov
2012-01-24 20:20                   ` KOSAKI Motohiro
2012-01-24 20:26                     ` Cyrill Gorcunov
2012-01-24 20:44                       ` Eric W. Biederman
2012-01-24 20:50                         ` Cyrill Gorcunov
2012-01-24 21:20                           ` Eric W. Biederman
2012-01-24 21:34                             ` Cyrill Gorcunov
2012-01-24 21:22                           ` Andrew Morton
2012-01-24 21:45                             ` Andrew Morton
2012-01-24 21:46                               ` H. Peter Anvin
2012-01-24 22:00                                 ` Andrew Morton
2012-01-24 22:52                                   ` H. Peter Anvin
2012-01-24 23:42                                     ` Andrew Morton
2012-01-24 21:46                             ` Cyrill Gorcunov
2012-01-24 21:59                               ` Andrew Morton
2012-01-24 22:54                             ` Eric W. Biederman
2012-01-24 22:54                               ` Andrew Morton
2012-01-24 21:25                           ` Andrew Morton
2012-01-24 21:31                             ` Cyrill Gorcunov
2012-01-24  8:49             ` Eric W. Biederman
2012-01-24  8:49               ` Cyrill Gorcunov
2012-01-23 14:20 ` [patch 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
2012-01-23 20:42   ` Kees Cook
2012-01-23 20:53     ` Cyrill Gorcunov
2012-01-24 23:59   ` Andrew Morton
2012-01-25  6:54     ` Cyrill Gorcunov
2012-01-25  7:12       ` Andrew Morton
2012-01-25  7:18         ` Cyrill Gorcunov
2012-01-23 14:20 ` [patch 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries Cyrill Gorcunov
2012-01-23 15:55   ` Cyrill Gorcunov
2012-01-23 20:02     ` Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).