linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC c/r 0/4] [RFC c/r 0/@total@] A pile in c/r sake
@ 2012-01-27 17:53 Cyrill Gorcunov
  2012-01-27 17:53 ` [RFC c/r 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v9 Cyrill Gorcunov
                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 17:53 UTC (permalink / raw)
  To: LKML; +Cc: Andrew Morton, Eric W. Biederman, Pavel Emelyanov, KOSAKI Motohiro

Hi, please review, I hope I've addressed all concerns,
if not -- please poke me.

And I left /proc/pid/task/tid/children "as a file"
approach since I've got no strong objections.

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC c/r 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v9
  2012-01-27 17:53 [RFC c/r 0/4] [RFC c/r 0/@total@] A pile in c/r sake Cyrill Gorcunov
@ 2012-01-27 17:53 ` Cyrill Gorcunov
  2012-01-27 17:53 ` [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7 Cyrill Gorcunov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 17:53 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki

[-- Attachment #1: fs-proc-tid-children-13 --]
[-- Type: text/plain, Size: 7055 bytes --]

When we do checkpoint of a task we need to know the list of children
the task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).

So instead of walking over all pids in the system (creating one big process
tree in memory, just to figure out which children a task has) -- we add
explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/filesystems/proc.txt |   18 +++++
 fs/proc/array.c                    |  123 +++++++++++++++++++++++++++++++++++++
 fs/proc/base.c                     |    3 
 fs/proc/internal.h                 |    1 
 4 files changed, 145 insertions(+)

Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
   3.5	/proc/<pid>/mountinfo - Information about mounts
   3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+  3.7   /proc/<pid>/task/<tid>/children - Information about task children
 
   4	Configuring procfs
   4.1	Mount options
@@ -1549,6 +1550,23 @@ then the kernel's TASK_COMM_LEN (current
 comm value.
 
 
+3.7	/proc/<pid>/task/<tid>/children - Information about task children
+-------------------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>/<tid> pair. The format is a space separated
+stream of pids.
+
+Note the "first level" here -- if a child has own children they will
+not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+to obtain the descendants.
+
+Since this interface is intended to be fast and cheap it doesn't
+guarantee to provide precise results and some children might be
+skipped, especially if they've exited right after we printed their
+pids, so one need to either stop or freeze processes being inspected
+if precise results are needed.
+
+
 ------------------------------------------------------------------------------
 Configuring procfs
 ------------------------------------------------------------------------------
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,126 @@ int proc_pid_statm(struct seq_file *m, s
 
 	return 0;
 }
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+static struct pid *
+get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
+{
+	struct task_struct *start, *task;
+	struct pid *pid = NULL;
+
+	read_lock(&tasklist_lock);
+
+	start = pid_task(proc_pid(inode), PIDTYPE_PID);
+	if (!start)
+		goto out;
+
+	/*
+	 * Lets try to continue searching first, this gives
+	 * us significant speedup on children-rich processes.
+	 */
+	if (pid_prev) {
+		task = pid_task(pid_prev, PIDTYPE_PID);
+		if (task && task->real_parent == start &&
+		    !(list_empty(&task->sibling))) {
+			if (list_is_last(&task->sibling, &start->children))
+				goto out;
+			task = list_first_entry(&task->sibling,
+						struct task_struct, sibling);
+			pid = get_pid(task_pid(task));
+			goto out;
+		}
+	}
+
+	/*
+	 * Slow search case.
+	 *
+	 * We might miss some children here if children
+	 * are exited while we were not holding the lock,
+	 * but it was never promised to be accurate that
+	 * much.
+	 *
+	 * "Just suppose that the parent sleeps, but N children
+	 *  exit after we printed their tids. Now the slow paths
+	 *  skips N extra children, we miss N tasks." (c)
+	 *
+	 * So one need to stop or freeze the leader and all
+	 * its children to get a precise result.
+	 */
+	list_for_each_entry(task, &start->children, sibling) {
+		if (pos-- == 0) {
+			pid = get_pid(task_pid(task));
+			break;
+		}
+	}
+
+out:
+	read_unlock(&tasklist_lock);
+	return pid;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+	struct inode *inode = seq->private;
+	pid_t pid;
+
+	pid = pid_nr_ns(v, inode->i_sb->s_fs_info);
+	return seq_printf(seq, "%d ", pid);
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	return get_children_pid(seq->private, NULL, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct pid *pid;
+
+	pid = get_children_pid(seq->private, v, *pos + 1);
+	put_pid(v);
+
+	++*pos;
+	return pid;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+	put_pid(v);
+}
+
+static const struct seq_operations children_seq_ops = {
+	.start	= children_seq_start,
+	.next	= children_seq_next,
+	.stop	= children_seq_stop,
+	.show	= children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+	struct seq_file *m;
+	int ret;
+
+	ret = seq_open(file, &children_seq_ops);
+	if (ret)
+		return ret;
+
+	m = file->private_data;
+	m->private = inode;
+
+	return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+	seq_release(inode, file);
+	return 0;
+}
+
+const struct file_operations proc_tid_children_operations = {
+	.open    = children_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = children_seq_release,
+};
+#endif
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3384,6 +3384,9 @@ static const struct pid_entry tid_base_s
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
 	REG("maps",      S_IRUGO, proc_maps_operations),
+#ifdef CONFIG_CHECKPOINT_RESTORE
+	REG("children",  S_IRUGO, proc_tid_children_operations),
+#endif
 #ifdef CONFIG_NUMA
 	REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
 #endif
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
 				struct pid *pid, struct task_struct *task);
 extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
 
+extern const struct file_operations proc_tid_children_operations;
 extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 17:53 [RFC c/r 0/4] [RFC c/r 0/@total@] A pile in c/r sake Cyrill Gorcunov
  2012-01-27 17:53 ` [RFC c/r 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v9 Cyrill Gorcunov
@ 2012-01-27 17:53 ` Cyrill Gorcunov
  2012-01-27 18:05   ` H. Peter Anvin
                     ` (4 more replies)
  2012-01-27 17:53 ` [RFC c/r 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
  2012-01-27 17:53 ` [RFC c/r 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries Cyrill Gorcunov
  3 siblings, 5 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 17:53 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Cyrill Gorcunov, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

[-- Attachment #1: sys-kcmp-10 --]
[-- Type: text/plain, Size: 12791 bytes --]

While doing the checkpoint-restore in the userspace one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are shared
between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparision' syscall might be the best candidate. So here is it --
__NR_kcmp.

It takes up to 5 agruments - the pids of the two tasks (which
characteristics should be compared), the comparision type and
(in case of comparision of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Glauber Costa <glommer@parallels.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tejun Heo <tj@kernel.org>
CC: Matt Helsley <matthltc@us.ibm.com>
CC: Pekka Enberg <penberg@kernel.org>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Valdis.Kletnieks@vt.edu
---
 arch/x86/syscalls/syscall_32.tbl         |    1 
 arch/x86/syscalls/syscall_64.tbl         |    1 
 include/linux/kcmp.h                     |   17 ++
 include/linux/syscalls.h                 |    2 
 kernel/Makefile                          |    1 
 kernel/kcmp.c                            |  181 +++++++++++++++++++++++++++++++
 tools/testing/selftests/kcmp/Makefile    |   35 +++++
 tools/testing/selftests/kcmp/kcmp_test.c |  126 +++++++++++++++++++++
 tools/testing/selftests/run_tests        |    2 
 9 files changed, 365 insertions(+), 1 deletion(-)

Index: linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_32.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	kcmp			sys_kcmp
Index: linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_64.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
@@ -318,3 +318,4 @@
 309	64	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	kcmp			sys_kcmp
Index: linux-2.6.git/include/linux/kcmp.h
===================================================================
--- /dev/null
+++ linux-2.6.git/include/linux/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_KCMP_H
+#define _LINUX_KCMP_H
+
+/* Comparision type */
+enum kcmp_type {
+	KCMP_FILE,
+	KCMP_VM,
+	KCMP_FILES,
+	KCMP_FS,
+	KCMP_SIGHAND,
+	KCMP_IO,
+	KCMP_SYSVSEM,
+
+	KCMP_TYPES,
+};
+
+#endif /* _LINUX_KCMP_H */
Index: linux-2.6.git/include/linux/syscalls.h
===================================================================
--- linux-2.6.git.orig/include/linux/syscalls.h
+++ linux-2.6.git/include/linux/syscalls.h
@@ -857,4 +857,6 @@ asmlinkage long sys_process_vm_writev(pi
 				      unsigned long riovcnt,
 				      unsigned long flags);
 
+asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
+			 unsigned long idx1, unsigned long idx2);
 #endif
Index: linux-2.6.git/kernel/Makefile
===================================================================
--- linux-2.6.git.orig/kernel/Makefile
+++ linux-2.6.git/kernel/Makefile
@@ -25,6 +25,7 @@ endif
 obj-y += sched/
 obj-y += power/
 
+obj-$(CONFIG_X86) += kcmp.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
Index: linux-2.6.git/kernel/kcmp.c
===================================================================
--- /dev/null
+++ linux-2.6.git/kernel/kcmp.c
@@ -0,0 +1,181 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fdtable.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+#include <linux/kcmp.h>
+
+#include <asm/unistd.h>
+
+/*
+ * We don't expose real in-memory order of objects for security
+ * reasons, still the comparision results should be suitable for
+ * sorting. Thus, we obfuscate kernel pointers values (using random
+ * cookies obtaned at early boot stage) and compare the production
+ * instead.
+ */
+static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
+
+static long kptr_obfuscate(long v, int type)
+{
+	return (v ^ cookies[type][0]) * cookies[type][1];
+}
+
+/*
+ * 0 - equal
+ * 1 - less than
+ * 2 - greater than
+ * 3 - not equal but ordering unavailable (reserved for future)
+ */
+static int kcmp_ptr(long v1, long v2, enum kcmp_type type)
+{
+	long ret;
+
+	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
+
+	return (ret < 0) | ((ret > 0) << 1);
+}
+
+/* The caller must have pinned the task */
+static struct file *
+get_file_raw_ptr(struct task_struct *task, unsigned int idx)
+{
+	struct fdtable *fdt;
+	struct file *file;
+
+	spin_lock(&task->files->file_lock);
+	fdt = files_fdtable(task->files);
+	if (idx < fdt->max_fds)
+		file = fdt->fd[idx];
+	else
+		file = NULL;
+	spin_unlock(&task->files->file_lock);
+
+	return file;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+		unsigned long, idx1, unsigned long, idx2)
+{
+	struct task_struct *task1, *task2;
+	int ret;
+
+	rcu_read_lock();
+
+	/*
+	 * Tasks are looked up in caller's
+	 * PID namespace only.
+	 */
+
+	task1 = find_task_by_vpid(pid1);
+	if (!task1) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	task2 = find_task_by_vpid(pid2);
+	if (!task2) {
+		put_task_struct(task1);
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	get_task_struct(task1);
+	get_task_struct(task2);
+
+	rcu_read_unlock();
+
+	/*
+	 * One should have enough rights to inspect tasks details.
+	 */
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EACCES;
+		goto err;
+	}
+
+	/*
+	 * Note for all cases but the KCMP_FILE we
+	 * don't take any locks in a sake of speed.
+	 */
+
+	switch (type) {
+	case KCMP_FILE: {
+		struct file *filp1, *filp2;
+
+		filp1 = get_file_raw_ptr(task1, idx1);
+		filp2 = get_file_raw_ptr(task2, idx2);
+
+		if (filp1 && filp2)
+			ret = kcmp_ptr((long)filp1, (long)filp2, KCMP_FILE);
+		else
+			ret = -ENOENT;
+		break;
+	}
+	case KCMP_VM:
+		ret = kcmp_ptr((long)task1->mm,
+			       (long)task2->mm,
+			       KCMP_VM);
+		break;
+	case KCMP_FILES:
+		ret = kcmp_ptr((long)task1->files,
+			       (long)task2->files,
+			       KCMP_FILES);
+		break;
+	case KCMP_FS:
+		ret = kcmp_ptr((long)task1->fs,
+			       (long)task2->fs,
+			       KCMP_FS);
+		break;
+	case KCMP_SIGHAND:
+		ret = kcmp_ptr((long)task1->sighand,
+			       (long)task2->sighand,
+			       KCMP_SIGHAND);
+		break;
+	case KCMP_IO:
+		ret = kcmp_ptr((long)task1->io_context,
+			       (long)task2->io_context,
+			       KCMP_IO);
+		break;
+	case KCMP_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+		ret = kcmp_ptr((long)task1->sysvsem.undo_list,
+			       (long)task2->sysvsem.undo_list,
+			       KCMP_SYSVSEM);
+#else
+		ret = -EINVAL;
+		goto err;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto err;
+	}
+
+err:
+	put_task_struct(task1);
+	put_task_struct(task2);
+
+	return ret;
+}
+
+static __init int kcmp_cookie_init(void)
+{
+	int i, j;
+
+	for (i = 0; i < KCMP_TYPES; i++) {
+		for (j = 0; j < 2; j++) {
+			get_random_bytes(&cookies[i][j],
+					 sizeof(cookies[i][j]));
+		}
+		cookies[i][1] |= (~(~0UL >>  1) | 1);
+	}
+
+	return 0;
+}
+late_initcall(kcmp_cookie_init);
Index: linux-2.6.git/tools/testing/selftests/kcmp/Makefile
===================================================================
--- /dev/null
+++ linux-2.6.git/tools/testing/selftests/kcmp/Makefile
@@ -0,0 +1,35 @@
+ifeq ($(strip $(V)),)
+	E = @echo
+	Q = @
+else
+	E = @\#
+	Q =
+endif
+export E Q
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+	CFLAGS := -DCONFIG_X86_32
+endif
+ifeq ($(ARCH),x86_64)
+	ARCH := X86
+	CFLAGS := -DCONFIG_X86_64
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+
+all:
+ifeq ($(ARCH),X86)
+	$(E) "  CC run_test"
+	$(Q) gcc $(CFLAGS) kcmp_test.c -o run_test
+else
+	$(E) "Not an x86 target, can't build breakpoints selftests"
+endif
+
+clean:
+	$(E) "  CLEAN"
+	$(Q) rm -fr ./run_test
+	$(Q) rm -fr ./test-file
Index: linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
===================================================================
--- /dev/null
+++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
@@ -0,0 +1,126 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <limits.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <fcntl.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+
+#ifdef CONFIG_X86_64
+#include "asm/unistd_64.h"
+#else
+#include "asm/unistd_32.h"
+#endif
+
+#include "linux/kcmp.h"
+
+#ifdef CONFIG_X86_64
+static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
+		unsigned long arg2, unsigned long arg3,
+		unsigned long arg4)
+{
+	long ret;
+	asm volatile(
+		"movl %1, %%eax		\n"
+		"movq %2, %%rdi		\n"
+		"movq %3, %%rsi		\n"
+		"movq %4, %%rdx		\n"
+		"movq %5, %%r10		\n"
+		"movq %6, %%r8		\n"
+		"syscall		\n"
+		"movq %%rax, %0		\n"
+		: "=r"(ret)
+		: "g" ((int)nr), "g" (arg0), "g" (arg1), "g" (arg2),
+			"g" (arg3), "g" (arg4)
+		: "rax", "rdi", "rsi", "rdx", "r10", "r8", "memory");
+	return ret;
+}
+#else
+static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
+		unsigned long arg2, unsigned long arg3,
+		unsigned long arg4)
+{
+	long ret;
+	asm volatile(
+		"movl %1, %%eax		\n"
+		"movl %2, %%ebx		\n"
+		"movl %3, %%ecx		\n"
+		"movl %4, %%edx		\n"
+		"movl %5, %%esi		\n"
+		"movl %6, %%edi		\n"
+		"intl $0x80		\n"
+		"movq %%eax, %0		\n"
+		: "=r"(ret)
+		: "g" ((int)nr), "g" (arg0), "g" (arg1), "g" (arg2),
+			"g" (arg3), "g" (arg4)
+		: "eax", "ebx", "ecx", "edx", "esi", "edi", "memory");
+	return ret;
+}
+#endif
+
+static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2)
+{
+	return syscall5(__NR_kcmp, (long)pid1, (long)pid2,
+			(long)type, (long)fd1, (long)fd2);
+}
+
+int main(int argc, char **argv)
+{
+	int pid1, pid2;
+	int fd1, fd2;
+	int status;
+
+	fd1 = open("test-file", O_RDWR | O_CREAT | O_TRUNC, 0644);
+	pid1 = getpid();
+
+	if (fd1 < 0) {
+		perror("Can't create file");
+		exit(1);
+	}
+
+	pid2 = fork();
+	if (pid2 < 0) {
+		perror("fork failed");
+		exit(1);
+	}
+
+	if (!pid2) {
+		int pid2 = getpid();
+
+		fd2 = open("test-file", O_RDWR, 0644);
+		if (fd2 < 0) {
+			perror("Can't open file");
+			exit(1);
+		}
+
+                printf("pid1: %6d pid2: %6d FD: %2d FILES: %2d VM: %2d FS: %2d "
+		       "SIGHAND: %2d IO: %2d SYSVSEM: %2d INV: %2d\n",
+		       pid1, pid2,
+		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd2),
+		       sys_kcmp(pid1, pid2, KCMP_FILES,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_VM,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_FS,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SIGHAND,	0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_IO,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SYSVSEM,	0, 0),
+
+			/* This one should fail */
+		       sys_kcmp(pid1, pid2, KCMP_TYPES + 1,	0, 0));
+
+		/* This one should return same fd */
+		printf("pid1: %6d pid2: %6d SAME-FD: %2d\n",
+		       pid1, pid2,
+		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd1));
+
+		exit(0);
+	}
+
+	waitpid(pid2, &status, P_ALL);
+
+	return 0;
+}
Index: linux-2.6.git/tools/testing/selftests/run_tests
===================================================================
--- linux-2.6.git.orig/tools/testing/selftests/run_tests
+++ linux-2.6.git/tools/testing/selftests/run_tests
@@ -1,6 +1,6 @@
 #!/bin/bash
 
-TARGETS=breakpoints
+TARGETS="breakpoints kcmp"
 
 for TARGET in $TARGETS
 do


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC c/r 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-27 17:53 [RFC c/r 0/4] [RFC c/r 0/@total@] A pile in c/r sake Cyrill Gorcunov
  2012-01-27 17:53 ` [RFC c/r 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v9 Cyrill Gorcunov
  2012-01-27 17:53 ` [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7 Cyrill Gorcunov
@ 2012-01-27 17:53 ` Cyrill Gorcunov
  2012-01-27 18:29   ` Kees Cook
  2012-01-27 20:00   ` KOSAKI Motohiro
  2012-01-27 17:53 ` [RFC c/r 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries Cyrill Gorcunov
  3 siblings, 2 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 17:53 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn,
	Kees Cook, KAMEZAWA Hiroyuki, Alexey Dobriyan, Tejun Heo,
	Andrew Vagin, Vasiliy Kulikov

[-- Attachment #1: fs-proc-extend-state --]
[-- Type: text/plain, Size: 3179 bytes --]

We would like to have an ability to restore command line
arguments and envirion pointers so the task being restored
would print appropriate values in /proc/pid/cmdline and
/proc/pid/envirion. The exit_code is needed to restore
zombie tasks.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
---
 Documentation/filesystems/proc.txt |    5 +++++
 fs/proc/array.c                    |   23 ++++++++++++++++++-----
 2 files changed, 23 insertions(+), 5 deletions(-)

Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -311,6 +311,11 @@ Table 1-4: Contents of the stat files (a
   start_data    address above which program data+bss is placed
   end_data      address below which program data+bss is placed
   start_brk     address above which program heap can be expanded with brk()
+  arg_start     address above which program command line is placed
+  arg_end       address below which program command line is placed
+  env_start     address above which program environment is placed
+  env_end       address below which program environment is placed
+  exit_code     the thread's exit_code in the form reported by the waitpid system call
 ..............................................................................
 
 The /proc/PID/maps file containing the currently mapped memory regions and
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -464,7 +464,7 @@ static int do_task_stat(struct seq_file
 
 	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld ",
 		pid_nr_ns(pid, ns),
 		tcomm,
 		state,
@@ -511,10 +511,23 @@ static int do_task_stat(struct seq_file
 		task->policy,
 		(unsigned long long)delayacct_blkio_ticks(task),
 		cputime_to_clock_t(gtime),
-		cputime_to_clock_t(cgtime),
-		(mm && permitted) ? mm->start_data : 0,
-		(mm && permitted) ? mm->end_data : 0,
-		(mm && permitted) ? mm->start_brk : 0);
+		cputime_to_clock_t(cgtime));
+
+		if (mm && permitted) {
+			seq_printf(m, "%lu %lu %lu %lu %lu %lu %lu ",
+				   mm->start_data,
+				   mm->end_data,
+				   mm->start_brk,
+				   mm->arg_start,
+				   mm->arg_end,
+				   mm->env_start,
+				   mm->env_end);
+		} else {
+			seq_printf(m, "0 0 0 0 0 0 0 ");
+		}
+
+		seq_printf(m, "%d\n", task->exit_code);
+
 	if (mm)
 		mmput(mm);
 	return 0;


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC c/r 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries
  2012-01-27 17:53 [RFC c/r 0/4] [RFC c/r 0/@total@] A pile in c/r sake Cyrill Gorcunov
                   ` (2 preceding siblings ...)
  2012-01-27 17:53 ` [RFC c/r 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
@ 2012-01-27 17:53 ` Cyrill Gorcunov
  2012-01-27 18:37   ` Kees Cook
  3 siblings, 1 reply; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 17:53 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Cyrill Gorcunov, Michael Kerrisk, Kees Cook,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Pavel Emelyanov,
	Vasiliy Kulikov, KAMEZAWA Hiroyuki

[-- Attachment #1: prctl-restore-mm-members-4 --]
[-- Type: text/plain, Size: 5215 bytes --]

After restore we would like the 'ps' command show the command
line and evironment exactly the same it was at checkpoint time.

So this additional PR_SET_MM_ allow us to do so. Note that
these members of mm_struct is rather used for output in
procfs, except auxv vector which is used by ld.so mostly.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/prctl.h |    5 +++
 kernel/sys.c          |   78 +++++++++++++++++++++++++++++++-------------------
 2 files changed, 54 insertions(+), 29 deletions(-)

Index: linux-2.6.git/include/linux/prctl.h
===================================================================
--- linux-2.6.git.orig/include/linux/prctl.h
+++ linux-2.6.git/include/linux/prctl.h
@@ -113,5 +113,10 @@
 # define PR_SET_MM_START_STACK		5
 # define PR_SET_MM_START_BRK		6
 # define PR_SET_MM_BRK			7
+# define PR_SET_MM_ARG_START		8
+# define PR_SET_MM_ARG_END		9
+# define PR_SET_MM_ENV_START		10
+# define PR_SET_MM_ENV_END		11
+# define PR_SET_MM_AUXV			12
 
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1693,17 +1693,23 @@ SYSCALL_DEFINE1(umask, int, mask)
 }
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
+static bool vma_flags_mismatch(struct vm_area_struct *vma,
+			       unsigned long required,
+			       unsigned long banned)
+{
+	return (vma->vm_flags & required) != required ||
+		(vma->vm_flags & banned);
+}
+
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
 {
 	unsigned long rlim = rlimit(RLIMIT_DATA);
-	unsigned long vm_req_flags;
-	unsigned long vm_bad_flags;
 	struct vm_area_struct *vma;
 	int error = 0;
 	struct mm_struct *mm = current->mm;
 
-	if (arg4 | arg5)
+	if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
 		return -EINVAL;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -1715,7 +1721,9 @@ static int prctl_set_mm(int opt, unsigne
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, addr);
 
-	if (opt != PR_SET_MM_START_BRK && opt != PR_SET_MM_BRK) {
+	if (opt != PR_SET_MM_START_BRK &&
+	    opt != PR_SET_MM_BRK &&
+	    opt != PR_SET_MM_AUXV) {
 		/* It must be existing VMA */
 		if (!vma || vma->vm_start > addr)
 			goto out;
@@ -1725,11 +1733,8 @@ static int prctl_set_mm(int opt, unsigne
 	switch (opt) {
 	case PR_SET_MM_START_CODE:
 	case PR_SET_MM_END_CODE:
-		vm_req_flags = VM_READ | VM_EXEC;
-		vm_bad_flags = VM_WRITE | VM_MAYSHARE;
-
-		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
-		    (vma->vm_flags & vm_bad_flags))
+		if (vma_flags_mismatch(vma, VM_READ | VM_EXEC,
+				       VM_WRITE | VM_MAYSHARE))
 			goto out;
 
 		if (opt == PR_SET_MM_START_CODE)
@@ -1740,11 +1745,8 @@ static int prctl_set_mm(int opt, unsigne
 
 	case PR_SET_MM_START_DATA:
 	case PR_SET_MM_END_DATA:
-		vm_req_flags = VM_READ | VM_WRITE;
-		vm_bad_flags = VM_EXEC | VM_MAYSHARE;
-
-		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
-		    (vma->vm_flags & vm_bad_flags))
+		if (vma_flags_mismatch(vma, VM_READ | VM_WRITE,
+				       VM_EXEC | VM_MAYSHARE))
 			goto out;
 
 		if (opt == PR_SET_MM_START_DATA)
@@ -1753,19 +1755,6 @@ static int prctl_set_mm(int opt, unsigne
 			mm->end_data = addr;
 		break;
 
-	case PR_SET_MM_START_STACK:
-
-#ifdef CONFIG_STACK_GROWSUP
-		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSUP;
-#else
-		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
-#endif
-		if ((vma->vm_flags & vm_req_flags) != vm_req_flags)
-			goto out;
-
-		mm->start_stack = addr;
-		break;
-
 	case PR_SET_MM_START_BRK:
 		if (addr <= mm->end_data)
 			goto out;
@@ -1790,16 +1779,47 @@ static int prctl_set_mm(int opt, unsigne
 		mm->brk = addr;
 		break;
 
+	case PR_SET_MM_START_STACK:
+	case PR_SET_MM_ARG_START:
+	case PR_SET_MM_ARG_END:
+	case PR_SET_MM_ENV_START:
+	case PR_SET_MM_ENV_END:
+#ifdef CONFIG_STACK_GROWSUP
+		if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSUP, 0))
+#else
+		if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSDOWN, 0))
+#endif
+			goto out;
+		if (opt == PR_SET_MM_START_STACK)
+			mm->start_stack = addr;
+		else if (opt == PR_SET_MM_ARG_START)
+			mm->arg_start = addr;
+		else if (opt == PR_SET_MM_ARG_END)
+			mm->arg_end = addr;
+		else if (opt == PR_SET_MM_ENV_START)
+			mm->env_start = addr;
+		else if (opt == PR_SET_MM_ENV_END)
+			mm->env_end = addr;
+		break;
+
+	case PR_SET_MM_AUXV:
+		if (arg4 > sizeof(mm->saved_auxv))
+			goto out;
+		up_read(&mm->mmap_sem);
+
+		error = -EFAULT;
+		if (!copy_from_user(mm->saved_auxv, (const void __user *)addr, arg4))
+			error = 0;
+
+		return error;
 	default:
 		error = -EINVAL;
 		goto out;
 	}
 
 	error = 0;
-
 out:
 	up_read(&mm->mmap_sem);
-
 	return error;
 }
 #else /* CONFIG_CHECKPOINT_RESTORE */


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 17:53 ` [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7 Cyrill Gorcunov
@ 2012-01-27 18:05   ` H. Peter Anvin
  2012-01-27 18:11     ` Cyrill Gorcunov
  2012-01-27 18:15   ` Andi Kleen
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2012-01-27 18:05 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro,
	Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

On 01/27/2012 09:53 AM, Cyrill Gorcunov wrote:
> +
> +/*
> + * We don't expose real in-memory order of objects for security
> + * reasons, still the comparision results should be suitable for
> + * sorting. Thus, we obfuscate kernel pointers values (using random
> + * cookies obtaned at early boot stage) and compare the production
> + * instead.
> + */
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> +	return (v ^ cookies[type][0]) * cookies[type][1];
> +}
> +
> +/*
> + * 0 - equal
> + * 1 - less than
> + * 2 - greater than
> + * 3 - not equal but ordering unavailable (reserved for future)
> + */
> +static int kcmp_ptr(long v1, long v2, enum kcmp_type type)
> +{
> +	long ret;
> +
> +	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
> +
> +	return (ret < 0) | ((ret > 0) << 1);
> +}
> +

I just want to point out that we could do hard cryptography, too --
using DES or AES and compare the result since symmetric cryptography is
an isomorphism.  One would have to compare the whole result, obviously,
not a truncated one, so using memcmp() or the similar.

I'll leave it up to the security guys to worry about if that is needed,
but since it is something that can be slotted in without changing the
API it seems reasonably safe.

	-hpa

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 18:05   ` H. Peter Anvin
@ 2012-01-27 18:11     ` Cyrill Gorcunov
  0 siblings, 0 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 18:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro,
	Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

On Fri, Jan 27, 2012 at 10:05:18AM -0800, H. Peter Anvin wrote:
> + */
> > +static int kcmp_ptr(long v1, long v2, enum kcmp_type type)
> > +{
> > +	long ret;
> > +
> > +	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
> > +
> > +	return (ret < 0) | ((ret > 0) << 1);
> > +}
> > +
> 
> I just want to point out that we could do hard cryptography, too --
> using DES or AES and compare the result since symmetric cryptography is
> an isomorphism.  One would have to compare the whole result, obviously,
> not a truncated one, so using memcmp() or the similar.
> 
> I'll leave it up to the security guys to worry about if that is needed,
> but since it is something that can be slotted in without changing the
> API it seems reasonably safe.
> 

Yes, it's internal operations not visible to user-space. Actually I thought
to even introduce kind of sysctl later so admin would be able to refresh
cookies manually when needed or completely disable this interface as well.
But not in this patch ;)

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 17:53 ` [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7 Cyrill Gorcunov
  2012-01-27 18:05   ` H. Peter Anvin
@ 2012-01-27 18:15   ` Andi Kleen
  2012-01-27 18:24     ` Cyrill Gorcunov
  2012-01-27 18:40   ` Eric W. Biederman
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2012-01-27 18:15 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa,
	Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

> It takes up to 5 agruments - the pids of the two tasks (which
> characteristics should be compared), the comparision type and
> (in case of comparision of files) two file descriptors.

Can you please write a manpage for it? That's really required
to evaluate the interface properly.

As I understand it every time the kernel adds some new kind of state
this would need to be extended too? This would seem like a lot of work,
especially since you always need to synchronize kernel/user space.
How would the user space break if it doesn't know about some newly
added state?

Maybe it would be better to put more of the relevant code into the
kernel to encapsulate this better.


> +	case KCMP_SYSVSEM:
> +#ifdef CONFIG_SYSVIPC
> +		ret = kcmp_ptr((long)task1->sysvsem.undo_list,
> +			       (long)task2->sysvsem.undo_list,
> +			       KCMP_SYSVSEM);

I assume that's normally NULL.

> +#ifdef CONFIG_X86_64
> +static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
> +		unsigned long arg2, unsigned long arg3,
> +		unsigned long arg4)

Why not just use syscall() in glibc?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 18:15   ` Andi Kleen
@ 2012-01-27 18:24     ` Cyrill Gorcunov
  2012-01-27 18:30       ` H. Peter Anvin
  2012-01-27 18:31       ` Andi Kleen
  0 siblings, 2 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 18:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

On Fri, Jan 27, 2012 at 07:15:25PM +0100, Andi Kleen wrote:
> > It takes up to 5 agruments - the pids of the two tasks (which
> > characteristics should be compared), the comparision type and
> > (in case of comparision of files) two file descriptors.
> 
> Can you please write a manpage for it? That's really required
> to evaluate the interface properly.

Sure, I'll try to (btw, where I should send it to? And in which
format novadays mans are written? In plain old troff or some
human readable asciidocs?) And... should I post man page
on LKML as well?

> 
> As I understand it every time the kernel adds some new kind of state
> this would need to be extended too? This would seem like a lot of work,
> especially since you always need to synchronize kernel/user space.
> How would the user space break if it doesn't know about some newly
> added state?

Wait, maybe I should use kernel-doc here and put comments with example
right on top of SYSCALL definition?

> 
> Maybe it would be better to put more of the relevant code into the
> kernel to encapsulate this better.
> 
> 
> > +	case KCMP_SYSVSEM:
> > +#ifdef CONFIG_SYSVIPC
> > +		ret = kcmp_ptr((long)task1->sysvsem.undo_list,
> > +			       (long)task2->sysvsem.undo_list,
> > +			       KCMP_SYSVSEM);
> 
> I assume that's normally NULL.
> 

Hmm.. Andi, I seem not following. And?

> > +#ifdef CONFIG_X86_64
> > +static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
> > +		unsigned long arg2, unsigned long arg3,
> > +		unsigned long arg4)
> 
> Why not just use syscall() in glibc?
> 

Never heard of it. I'll take a look, thanks. But I suppose all this
comments might be addressed in patch on top?

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-27 17:53 ` [RFC c/r 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
@ 2012-01-27 18:29   ` Kees Cook
  2012-01-27 20:00   ` KOSAKI Motohiro
  1 sibling, 0 replies; 37+ messages in thread
From: Kees Cook @ 2012-01-27 18:29 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Alexey Dobriyan, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On Fri, Jan 27, 2012 at 9:53 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> We would like to have an ability to restore command line
> arguments and envirion pointers so the task being restored
> would print appropriate values in /proc/pid/cmdline and
> /proc/pid/envirion. The exit_code is needed to restore
> zombie tasks.
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>

Reviewed-by: Kees Cook <keescook@chromium.org>

> +               if (mm && permitted) {
> +                       seq_printf(m, "%lu %lu %lu %lu %lu %lu %lu ",

I like this clean-up. :)

-Kees

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 18:24     ` Cyrill Gorcunov
@ 2012-01-27 18:30       ` H. Peter Anvin
  2012-01-28 17:19         ` Michael Kerrisk
  2012-01-27 18:31       ` Andi Kleen
  1 sibling, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2012-01-27 18:30 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andi Kleen, LKML, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, mtk.manpages

On 01/27/2012 10:24 AM, Cyrill Gorcunov wrote:
> 
> Sure, I'll try to (btw, where I should send it to? And in which
> format novadays mans are written? In plain old troff or some
> human readable asciidocs?) And... should I post man page
> on LKML as well?
> 

Normally they're written using the mandoc macro set in *roff.  Michael
Kerrisk (mtk) is the authority on how to do things.

	-hpa

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 18:24     ` Cyrill Gorcunov
  2012-01-27 18:30       ` H. Peter Anvin
@ 2012-01-27 18:31       ` Andi Kleen
  2012-01-27 18:40         ` Cyrill Gorcunov
  1 sibling, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2012-01-27 18:31 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andi Kleen, LKML, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

On Fri, Jan 27, 2012 at 10:24:03PM +0400, Cyrill Gorcunov wrote:
> Sure, I'll try to (btw, where I should send it to? And in which
> format novadays mans are written? In plain old troff or some
> human readable asciidocs?) And... should I post man page
> on LKML as well?

Manpages are written in troff with -man and you would add the ASCII output
of the manpage to the commit log and send the troff source at some
point to linux-man.

> 
> > 
> > As I understand it every time the kernel adds some new kind of state
> > this would need to be extended too? This would seem like a lot of work,
> > especially since you always need to synchronize kernel/user space.
> > How would the user space break if it doesn't know about some newly
> > added state?
> 
> Wait, maybe I should use kernel-doc here and put comments with example
> right on top of SYSCALL definition?

The basic problem is if this interface is at the right level of abstraction.
I have some doubts on that. It seems like a long term maintenance nightmare to 
me.  It may be better to put the loop that would call this into the kernel.

> > > +#ifdef CONFIG_SYSVIPC
> > > +		ret = kcmp_ptr((long)task1->sysvsem.undo_list,
> > > +			       (long)task2->sysvsem.undo_list,
> > > +			       KCMP_SYSVSEM);
> > 
> > I assume that's normally NULL.
> > 
> 
> Hmm.. Andi, I seem not following. And?

It just doesn't seem like a very useful thing to compare.

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries
  2012-01-27 17:53 ` [RFC c/r 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries Cyrill Gorcunov
@ 2012-01-27 18:37   ` Kees Cook
  2012-01-27 18:43     ` Cyrill Gorcunov
  2012-01-27 20:28     ` KOSAKI Motohiro
  0 siblings, 2 replies; 37+ messages in thread
From: Kees Cook @ 2012-01-27 18:37 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Michael Kerrisk, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki

On Fri, Jan 27, 2012 at 9:53 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> +               if (opt == PR_SET_MM_START_STACK)
> +                       mm->start_stack = addr;
> +               else if (opt == PR_SET_MM_ARG_START)
> +                       mm->arg_start = addr;
> +               else if (opt == PR_SET_MM_ARG_END)
> +                       mm->arg_end = addr;
> +               else if (opt == PR_SET_MM_ENV_START)
> +                       mm->env_start = addr;
> +               else if (opt == PR_SET_MM_ENV_END)
> +                       mm->env_end = addr;
> +               break;

Why not a switch statement here? Not that it really matters. :)

> +
> +       case PR_SET_MM_AUXV:
> +               if (arg4 > sizeof(mm->saved_auxv))
> +                       goto out;
> +               up_read(&mm->mmap_sem);
> +
> +               error = -EFAULT;
> +               if (!copy_from_user(mm->saved_auxv, (const void __user *)addr, arg4))
> +                       error = 0;
> +
> +               return error;

Is the mmap_sem released here because of the copy_from_user()? Is it
still safe to write to saved_auxv?

-Kees

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 18:31       ` Andi Kleen
@ 2012-01-27 18:40         ` Cyrill Gorcunov
  2012-01-27 19:40           ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 18:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

On Fri, Jan 27, 2012 at 07:31:55PM +0100, Andi Kleen wrote:
> On Fri, Jan 27, 2012 at 10:24:03PM +0400, Cyrill Gorcunov wrote:
> > Sure, I'll try to (btw, where I should send it to? And in which
> > format novadays mans are written? In plain old troff or some
> > human readable asciidocs?) And... should I post man page
> > on LKML as well?
> 
> Manpages are written in troff with -man and you would add the ASCII output
> of the manpage to the commit log and send the troff source at some
> point to linux-man.
>

OK, I see, thanks.

> The basic problem is if this interface is at the right level of abstraction.
> I have some doubts on that. It seems like a long term maintenance nightmare to 
> me. It may be better to put the loop that would call this into the kernel.
> 

Hmm, ie selftest right in kenel?

> > 
> > Hmm.. Andi, I seem not following. And?
> 
> It just doesn't seem like a very useful thing to compare.

I see.

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 17:53 ` [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7 Cyrill Gorcunov
  2012-01-27 18:05   ` H. Peter Anvin
  2012-01-27 18:15   ` Andi Kleen
@ 2012-01-27 18:40   ` Eric W. Biederman
  2012-01-27 18:45     ` Cyrill Gorcunov
  2012-01-27 19:10     ` Cyrill Gorcunov
  2012-01-27 19:37   ` Vasiliy Kulikov
  2012-01-27 20:19   ` KOSAKI Motohiro
  4 siblings, 2 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-01-27 18:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, KOSAKI Motohiro,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

Cyrill Gorcunov <gorcunov@openvz.org> writes:

> +/*
> + * We don't expose real in-memory order of objects for security
> + * reasons, still the comparision results should be suitable for
> + * sorting. Thus, we obfuscate kernel pointers values (using random
> + * cookies obtaned at early boot stage) and compare the production
> + * instead.
> + */
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> +	return (v ^ cookies[type][0]) * cookies[type][1];
> +}
> +
> +/*
> + * 0 - equal
> + * 1 - less than
> + * 2 - greater than
> + * 3 - not equal but ordering unavailable (reserved for future)
> + */
> +static int kcmp_ptr(long v1, long v2, enum kcmp_type type)
> +{

Can you make the prototype:
static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)

So the caller does not need to have casts?

> +	long ret;
> +
> +	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
> +
> +	return (ret < 0) | ((ret > 0) << 1);
> +}
> +
> +/* The caller must have pinned the task */
> +static struct file *
> +get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> +{
> +	struct fdtable *fdt;
> +	struct file *file;
> +
> +	spin_lock(&task->files->file_lock);
> +	fdt = files_fdtable(task->files);
> +	if (idx < fdt->max_fds)
> +		file = fdt->fd[idx];
> +	else
> +		file = NULL;
> +	spin_unlock(&task->files->file_lock);

I believe I commented on this earlier but it looks like it got lost in
the noise.

file == NULL != file == NULL.

When you can't lookup the file kcmp should return -EBADF.

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries
  2012-01-27 18:37   ` Kees Cook
@ 2012-01-27 18:43     ` Cyrill Gorcunov
  2012-01-27 20:31       ` KOSAKI Motohiro
  2012-01-27 20:28     ` KOSAKI Motohiro
  1 sibling, 1 reply; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 18:43 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Michael Kerrisk, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki

On Fri, Jan 27, 2012 at 10:37:05AM -0800, Kees Cook wrote:
> On Fri, Jan 27, 2012 at 9:53 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> > +               if (opt == PR_SET_MM_START_STACK)
> > +                       mm->start_stack = addr;
> > +               else if (opt == PR_SET_MM_ARG_START)
> > +                       mm->arg_start = addr;
> > +               else if (opt == PR_SET_MM_ARG_END)
> > +                       mm->arg_end = addr;
> > +               else if (opt == PR_SET_MM_ENV_START)
> > +                       mm->env_start = addr;
> > +               else if (opt == PR_SET_MM_ENV_END)
> > +                       mm->env_end = addr;
> > +               break;
> 
> Why not a switch statement here? Not that it really matters. :)
> 

Just to look different from toplevel switch, which is better from my POV.

> > +
> > +       case PR_SET_MM_AUXV:
> > +               if (arg4 > sizeof(mm->saved_auxv))
> > +                       goto out;
> > +               up_read(&mm->mmap_sem);
> > +
> > +               error = -EFAULT;
> > +               if (!copy_from_user(mm->saved_auxv, (const void __user *)addr, arg4))
> > +                       error = 0;
> > +
> > +               return error;
> 
> Is the mmap_sem released here because of the copy_from_user()? Is it
> still safe to write to saved_auxv?
> 

At moment I believe yes (if only I'm not missing something), we poke this
vector at elf loading procedure, so it's up to user to sync access to "own"
saved_auxv and write sane values inside.

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 18:40   ` Eric W. Biederman
@ 2012-01-27 18:45     ` Cyrill Gorcunov
  2012-01-27 19:10     ` Cyrill Gorcunov
  1 sibling, 0 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 18:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Andrew Morton, Pavel Emelyanov, KOSAKI Motohiro,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

On Fri, Jan 27, 2012 at 10:40:54AM -0800, Eric W. Biederman wrote:
> Can you make the prototype:
> static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
> 
> So the caller does not need to have casts?

Sure.

> I believe I commented on this earlier but it looks like it got lost in
> the noise.
> 
> file == NULL != file == NULL.
> 
> When you can't lookup the file kcmp should return -EBADF.
> 

Yes, Eric, I somehow missed it in replies flowing. I'll update.
Thanks!

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 18:40   ` Eric W. Biederman
  2012-01-27 18:45     ` Cyrill Gorcunov
@ 2012-01-27 19:10     ` Cyrill Gorcunov
  1 sibling, 0 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 19:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Andrew Morton, Pavel Emelyanov, KOSAKI Motohiro,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

On Fri, Jan 27, 2012 at 10:40:54AM -0800, Eric W. Biederman wrote:
...
> 
> I believe I commented on this earlier but it looks like it got lost in
> the noise.
> 
> file == NULL != file == NULL.
> 
> When you can't lookup the file kcmp should return -EBADF.
> 

Something like below, right? (Andi I'll address your comments in
on top patch I'll make later, ok? I need to figure out where this
syscall() lives in glibc before using it).

	Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [RFC] syscalls, x86: Add __NR_kcmp syscall v8

While doing the checkpoint-restore in the userspace one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are shared
between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparision' syscall might be the best candidate. So here is it --
__NR_kcmp.

It takes up to 5 agruments - the pids of the two tasks (which
characteristics should be compared), the comparision type and
(in case of comparision of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Glauber Costa <glommer@parallels.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tejun Heo <tj@kernel.org>
CC: Matt Helsley <matthltc@us.ibm.com>
CC: Pekka Enberg <penberg@kernel.org>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Valdis.Kletnieks@vt.edu
---
 arch/x86/syscalls/syscall_32.tbl         |    1 
 arch/x86/syscalls/syscall_64.tbl         |    1 
 include/linux/kcmp.h                     |   17 +++
 include/linux/syscalls.h                 |    2 
 kernel/Makefile                          |    1 
 kernel/kcmp.c                            |  171 +++++++++++++++++++++++++++++++
 tools/testing/selftests/kcmp/Makefile    |   35 ++++++
 tools/testing/selftests/kcmp/kcmp_test.c |  126 ++++++++++++++++++++++
 tools/testing/selftests/run_tests        |    2 
 9 files changed, 355 insertions(+), 1 deletion(-)

Index: linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_32.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	kcmp			sys_kcmp
Index: linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
===================================================================
--- linux-2.6.git.orig/arch/x86/syscalls/syscall_64.tbl
+++ linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
@@ -318,3 +318,4 @@
 309	64	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	kcmp			sys_kcmp
Index: linux-2.6.git/include/linux/kcmp.h
===================================================================
--- /dev/null
+++ linux-2.6.git/include/linux/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_KCMP_H
+#define _LINUX_KCMP_H
+
+/* Comparision type */
+enum kcmp_type {
+	KCMP_FILE,
+	KCMP_VM,
+	KCMP_FILES,
+	KCMP_FS,
+	KCMP_SIGHAND,
+	KCMP_IO,
+	KCMP_SYSVSEM,
+
+	KCMP_TYPES,
+};
+
+#endif /* _LINUX_KCMP_H */
Index: linux-2.6.git/include/linux/syscalls.h
===================================================================
--- linux-2.6.git.orig/include/linux/syscalls.h
+++ linux-2.6.git/include/linux/syscalls.h
@@ -857,4 +857,6 @@ asmlinkage long sys_process_vm_writev(pi
 				      unsigned long riovcnt,
 				      unsigned long flags);
 
+asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
+			 unsigned long idx1, unsigned long idx2);
 #endif
Index: linux-2.6.git/kernel/Makefile
===================================================================
--- linux-2.6.git.orig/kernel/Makefile
+++ linux-2.6.git/kernel/Makefile
@@ -25,6 +25,7 @@ endif
 obj-y += sched/
 obj-y += power/
 
+obj-$(CONFIG_X86) += kcmp.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
Index: linux-2.6.git/kernel/kcmp.c
===================================================================
--- /dev/null
+++ linux-2.6.git/kernel/kcmp.c
@@ -0,0 +1,171 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fdtable.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+#include <linux/kcmp.h>
+
+#include <asm/unistd.h>
+
+/*
+ * We don't expose real in-memory order of objects for security
+ * reasons, still the comparision results should be suitable for
+ * sorting. Thus, we obfuscate kernel pointers values (using random
+ * cookies obtaned at early boot stage) and compare the production
+ * instead.
+ */
+static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
+
+static long kptr_obfuscate(long v, int type)
+{
+	return (v ^ cookies[type][0]) * cookies[type][1];
+}
+
+/*
+ * 0 - equal
+ * 1 - less than
+ * 2 - greater than
+ * 3 - not equal but ordering unavailable (reserved for future)
+ */
+static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
+{
+	long ret;
+
+	ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type);
+
+	return (ret < 0) | ((ret > 0) << 1);
+}
+
+/* The caller must have pinned the task */
+static struct file *
+get_file_raw_ptr(struct task_struct *task, unsigned int idx)
+{
+	struct fdtable *fdt;
+	struct file *file;
+
+	spin_lock(&task->files->file_lock);
+	fdt = files_fdtable(task->files);
+	if (idx < fdt->max_fds)
+		file = fdt->fd[idx];
+	else
+		file = NULL;
+	spin_unlock(&task->files->file_lock);
+
+	return file;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+		unsigned long, idx1, unsigned long, idx2)
+{
+	struct task_struct *task1, *task2;
+	int ret;
+
+	rcu_read_lock();
+
+	/*
+	 * Tasks are looked up in caller's
+	 * PID namespace only.
+	 */
+
+	task1 = find_task_by_vpid(pid1);
+	if (!task1) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	task2 = find_task_by_vpid(pid2);
+	if (!task2) {
+		put_task_struct(task1);
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	get_task_struct(task1);
+	get_task_struct(task2);
+
+	rcu_read_unlock();
+
+	/*
+	 * One should have enough rights to inspect tasks details.
+	 */
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EACCES;
+		goto err;
+	}
+
+	/*
+	 * Note for all cases but the KCMP_FILE we
+	 * don't take any locks in a sake of speed.
+	 */
+
+	switch (type) {
+	case KCMP_FILE: {
+		struct file *filp1, *filp2;
+
+		filp1 = get_file_raw_ptr(task1, idx1);
+		filp2 = get_file_raw_ptr(task2, idx2);
+
+		if (filp1 && filp2)
+			ret = kcmp_ptr(filp1, filp2, KCMP_FILE);
+		else
+			ret = -EBADF;
+		break;
+	}
+	case KCMP_VM:
+		ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM);
+		break;
+	case KCMP_FILES:
+		ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES);
+		break;
+	case KCMP_FS:
+		ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS);
+		break;
+	case KCMP_SIGHAND:
+		ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND);
+		break;
+	case KCMP_IO:
+		ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO);
+		break;
+	case KCMP_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+		ret = kcmp_ptr(task1->sysvsem.undo_list,
+			       task2->sysvsem.undo_list,
+			       KCMP_SYSVSEM);
+#else
+		ret = -EINVAL;
+		goto err;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto err;
+	}
+
+err:
+	put_task_struct(task1);
+	put_task_struct(task2);
+
+	return ret;
+}
+
+static __init int kcmp_cookie_init(void)
+{
+	int i, j;
+
+	for (i = 0; i < KCMP_TYPES; i++) {
+		for (j = 0; j < 2; j++) {
+			get_random_bytes(&cookies[i][j],
+					 sizeof(cookies[i][j]));
+		}
+		cookies[i][1] |= (~(~0UL >>  1) | 1);
+	}
+
+	return 0;
+}
+late_initcall(kcmp_cookie_init);
Index: linux-2.6.git/tools/testing/selftests/kcmp/Makefile
===================================================================
--- /dev/null
+++ linux-2.6.git/tools/testing/selftests/kcmp/Makefile
@@ -0,0 +1,35 @@
+ifeq ($(strip $(V)),)
+	E = @echo
+	Q = @
+else
+	E = @\#
+	Q =
+endif
+export E Q
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+	CFLAGS := -DCONFIG_X86_32
+endif
+ifeq ($(ARCH),x86_64)
+	ARCH := X86
+	CFLAGS := -DCONFIG_X86_64
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+
+all:
+ifeq ($(ARCH),X86)
+	$(E) "  CC run_test"
+	$(Q) gcc $(CFLAGS) kcmp_test.c -o run_test
+else
+	$(E) "Not an x86 target, can't build breakpoints selftests"
+endif
+
+clean:
+	$(E) "  CLEAN"
+	$(Q) rm -fr ./run_test
+	$(Q) rm -fr ./test-file
Index: linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
===================================================================
--- /dev/null
+++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
@@ -0,0 +1,126 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <limits.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <fcntl.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+
+#ifdef CONFIG_X86_64
+#include "asm/unistd_64.h"
+#else
+#include "asm/unistd_32.h"
+#endif
+
+#include "linux/kcmp.h"
+
+#ifdef CONFIG_X86_64
+static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
+		unsigned long arg2, unsigned long arg3,
+		unsigned long arg4)
+{
+	long ret;
+	asm volatile(
+		"movl %1, %%eax		\n"
+		"movq %2, %%rdi		\n"
+		"movq %3, %%rsi		\n"
+		"movq %4, %%rdx		\n"
+		"movq %5, %%r10		\n"
+		"movq %6, %%r8		\n"
+		"syscall		\n"
+		"movq %%rax, %0		\n"
+		: "=r"(ret)
+		: "g" ((int)nr), "g" (arg0), "g" (arg1), "g" (arg2),
+			"g" (arg3), "g" (arg4)
+		: "rax", "rdi", "rsi", "rdx", "r10", "r8", "memory");
+	return ret;
+}
+#else
+static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
+		unsigned long arg2, unsigned long arg3,
+		unsigned long arg4)
+{
+	long ret;
+	asm volatile(
+		"movl %1, %%eax		\n"
+		"movl %2, %%ebx		\n"
+		"movl %3, %%ecx		\n"
+		"movl %4, %%edx		\n"
+		"movl %5, %%esi		\n"
+		"movl %6, %%edi		\n"
+		"intl $0x80		\n"
+		"movq %%eax, %0		\n"
+		: "=r"(ret)
+		: "g" ((int)nr), "g" (arg0), "g" (arg1), "g" (arg2),
+			"g" (arg3), "g" (arg4)
+		: "eax", "ebx", "ecx", "edx", "esi", "edi", "memory");
+	return ret;
+}
+#endif
+
+static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2)
+{
+	return syscall5(__NR_kcmp, (long)pid1, (long)pid2,
+			(long)type, (long)fd1, (long)fd2);
+}
+
+int main(int argc, char **argv)
+{
+	int pid1, pid2;
+	int fd1, fd2;
+	int status;
+
+	fd1 = open("test-file", O_RDWR | O_CREAT | O_TRUNC, 0644);
+	pid1 = getpid();
+
+	if (fd1 < 0) {
+		perror("Can't create file");
+		exit(1);
+	}
+
+	pid2 = fork();
+	if (pid2 < 0) {
+		perror("fork failed");
+		exit(1);
+	}
+
+	if (!pid2) {
+		int pid2 = getpid();
+
+		fd2 = open("test-file", O_RDWR, 0644);
+		if (fd2 < 0) {
+			perror("Can't open file");
+			exit(1);
+		}
+
+                printf("pid1: %6d pid2: %6d FD: %2d FILES: %2d VM: %2d FS: %2d "
+		       "SIGHAND: %2d IO: %2d SYSVSEM: %2d INV: %2d\n",
+		       pid1, pid2,
+		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd2),
+		       sys_kcmp(pid1, pid2, KCMP_FILES,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_VM,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_FS,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SIGHAND,	0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_IO,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SYSVSEM,	0, 0),
+
+			/* This one should fail */
+		       sys_kcmp(pid1, pid2, KCMP_TYPES + 1,	0, 0));
+
+		/* This one should return same fd */
+		printf("pid1: %6d pid2: %6d SAME-FD: %2d\n",
+		       pid1, pid2,
+		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd1));
+
+		exit(0);
+	}
+
+	waitpid(pid2, &status, P_ALL);
+
+	return 0;
+}
Index: linux-2.6.git/tools/testing/selftests/run_tests
===================================================================
--- linux-2.6.git.orig/tools/testing/selftests/run_tests
+++ linux-2.6.git/tools/testing/selftests/run_tests
@@ -1,6 +1,6 @@
 #!/bin/bash
 
-TARGETS=breakpoints
+TARGETS="breakpoints kcmp"
 
 for TARGET in $TARGETS
 do

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 17:53 ` [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7 Cyrill Gorcunov
                     ` (2 preceding siblings ...)
  2012-01-27 18:40   ` Eric W. Biederman
@ 2012-01-27 19:37   ` Vasiliy Kulikov
  2012-01-27 19:59     ` hpanvin@gmail.com
  2012-01-27 20:19   ` KOSAKI Motohiro
  4 siblings, 1 reply; 37+ messages in thread
From: Vasiliy Kulikov @ 2012-01-27 19:37 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa,
	Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks

Hi Cyrill,

On Fri, Jan 27, 2012 at 21:53 +0400, Cyrill Gorcunov wrote:
> +/*
> + * We don't expose real in-memory order of objects for security
> + * reasons, still the comparision results should be suitable for
> + * sorting. Thus, we obfuscate kernel pointers values (using random
> + * cookies obtaned at early boot stage) and compare the production
> + * instead.
> + */
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> +	return (v ^ cookies[type][0]) * cookies[type][1];

AFACS, cookies is fully random value, is it possible that

((v1 ^ cookies[type][0]) * cookies[type][1] == (v2 ^ cookies[type][0]) * cookies[type][1]) &&
(v1 != v2)

for too round cookies[type][1]?

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 18:40         ` Cyrill Gorcunov
@ 2012-01-27 19:40           ` Andi Kleen
  2012-01-27 20:55             ` Eric W. Biederman
  0 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2012-01-27 19:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andi Kleen, LKML, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

> > The basic problem is if this interface is at the right level of abstraction.
> > I have some doubts on that. It seems like a long term maintenance nightmare to 
> > me. It may be better to put the loop that would call this into the kernel.
> > 
> 
> Hmm, ie selftest right in kenel?

Not testing, but more the general stability of the interface. IMHO it exposes
too many kernel internals. I know they are already exposed by clone/unshare, 
but in those nothing breaks if the user program doesn't know about some new
flags. But this looks like the user always has to be updated for every change.
I think I would prefer if more of the user was in kernel to not expose
that much.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 19:37   ` Vasiliy Kulikov
@ 2012-01-27 19:59     ` hpanvin@gmail.com
  2012-01-27 20:07       ` Cyrill Gorcunov
  0 siblings, 1 reply; 37+ messages in thread
From: hpanvin@gmail.com @ 2012-01-27 19:59 UTC (permalink / raw)
  To: Vasiliy Kulikov, Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro,
	Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks

No, cookies[1] is always odd.

Vasiliy Kulikov <segoon@openwall.com> wrote:

>Hi Cyrill,
>
>On Fri, Jan 27, 2012 at 21:53 +0400, Cyrill Gorcunov wrote:
>> +/*
>> + * We don't expose real in-memory order of objects for security
>> + * reasons, still the comparision results should be suitable for
>> + * sorting. Thus, we obfuscate kernel pointers values (using random
>> + * cookies obtaned at early boot stage) and compare the production
>> + * instead.
>> + */
>> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
>> +
>> +static long kptr_obfuscate(long v, int type)
>> +{
>> +	return (v ^ cookies[type][0]) * cookies[type][1];
>
>AFACS, cookies is fully random value, is it possible that
>
>((v1 ^ cookies[type][0]) * cookies[type][1] == (v2 ^ cookies[type][0])
>* cookies[type][1]) &&
>(v1 != v2)
>
>for too round cookies[type][1]?
>
>Thanks,
>
>-- 
>Vasiliy Kulikov
>http://www.openwall.com - bringing security into open computing
>environments

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-27 17:53 ` [RFC c/r 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
  2012-01-27 18:29   ` Kees Cook
@ 2012-01-27 20:00   ` KOSAKI Motohiro
  2012-01-27 20:10     ` Cyrill Gorcunov
  1 sibling, 1 reply; 37+ messages in thread
From: KOSAKI Motohiro @ 2012-01-27 20:00 UTC (permalink / raw)
  To: gorcunov
  Cc: linux-kernel, akpm, ebiederm, xemul, xemul, serge.hallyn,
	keescook, kamezawa.hiroyu, adobriyan, tj, avagin, segoon

On 1/27/2012 12:53 PM, Cyrill Gorcunov wrote:
> We would like to have an ability to restore command line
> arguments and envirion pointers so the task being restored
> would print appropriate values in /proc/pid/cmdline and
> /proc/pid/envirion. The exit_code is needed to restore
> zombie tasks.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Serge Hallyn <serge.hallyn@canonical.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrew Vagin <avagin@openvz.org>
> Cc: Vasiliy Kulikov <segoon@openwall.com>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  Documentation/filesystems/proc.txt |    5 +++++
>  fs/proc/array.c                    |   23 ++++++++++++++++++-----
>  2 files changed, 23 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.git/Documentation/filesystems/proc.txt
> ===================================================================
> --- linux-2.6.git.orig/Documentation/filesystems/proc.txt
> +++ linux-2.6.git/Documentation/filesystems/proc.txt
> @@ -311,6 +311,11 @@ Table 1-4: Contents of the stat files (a
>    start_data    address above which program data+bss is placed
>    end_data      address below which program data+bss is placed
>    start_brk     address above which program heap can be expanded with brk()
> +  arg_start     address above which program command line is placed
> +  arg_end       address below which program command line is placed
> +  env_start     address above which program environment is placed
> +  env_end       address below which program environment is placed
> +  exit_code     the thread's exit_code in the form reported by the waitpid system call
>  ..............................................................................
>  
>  The /proc/PID/maps file containing the currently mapped memory regions and
> Index: linux-2.6.git/fs/proc/array.c
> ===================================================================
> --- linux-2.6.git.orig/fs/proc/array.c
> +++ linux-2.6.git/fs/proc/array.c
> @@ -464,7 +464,7 @@ static int do_task_stat(struct seq_file
>  
>  	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
>  %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
> -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
> +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld ",
>  		pid_nr_ns(pid, ns),
>  		tcomm,
>  		state,
> @@ -511,10 +511,23 @@ static int do_task_stat(struct seq_file
>  		task->policy,
>  		(unsigned long long)delayacct_blkio_ticks(task),
>  		cputime_to_clock_t(gtime),
> -		cputime_to_clock_t(cgtime),
> -		(mm && permitted) ? mm->start_data : 0,
> -		(mm && permitted) ? mm->end_data : 0,
> -		(mm && permitted) ? mm->start_brk : 0);
> +		cputime_to_clock_t(cgtime));
> +
> +		if (mm && permitted) {
> +			seq_printf(m, "%lu %lu %lu %lu %lu %lu %lu ",
> +				   mm->start_data,
> +				   mm->end_data,
> +				   mm->start_brk,
> +				   mm->arg_start,
> +				   mm->arg_end,
> +				   mm->env_start,
> +				   mm->env_end);
> +		} else {
> +			seq_printf(m, "0 0 0 0 0 0 0 ");
> +		}

This part seems good.


> +
> +		seq_printf(m, "%d\n", task->exit_code);
> +

Bad this part seems to make new side channel. exit_code is one of inter process communication messages.
IPC messages should NOT be observed from completely unrelated proesses.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 19:59     ` hpanvin@gmail.com
@ 2012-01-27 20:07       ` Cyrill Gorcunov
  0 siblings, 0 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 20:07 UTC (permalink / raw)
  To: hpanvin@gmail.com
  Cc: Vasiliy Kulikov, LKML, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa,
	Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks

On Fri, Jan 27, 2012 at 11:59:18AM -0800, hpanvin@gmail.com wrote:
> No, cookies[1] is always odd.
> 

Yeah, this requirement was the first problem I've been pointed
in early version of the patch ;)

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  2012-01-27 20:00   ` KOSAKI Motohiro
@ 2012-01-27 20:10     ` Cyrill Gorcunov
  0 siblings, 0 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 20:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, akpm, ebiederm, xemul, xemul, serge.hallyn,
	keescook, kamezawa.hiroyu, adobriyan, tj, avagin, segoon

On Fri, Jan 27, 2012 at 03:00:37PM -0500, KOSAKI Motohiro wrote:
...
> 
> > +
> > +		seq_printf(m, "%d\n", task->exit_code);
> > +
> 
> Bad this part seems to make new side channel. exit_code is one of
> inter process communication messages. IPC messages should NOT be
> observed from completely unrelated proesses.
> 

You mean we need "permitted" test here, right?

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 17:53 ` [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7 Cyrill Gorcunov
                     ` (3 preceding siblings ...)
  2012-01-27 19:37   ` Vasiliy Kulikov
@ 2012-01-27 20:19   ` KOSAKI Motohiro
  2012-01-27 20:33     ` Eric W. Biederman
                       ` (2 more replies)
  4 siblings, 3 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2012-01-27 20:19 UTC (permalink / raw)
  To: gorcunov
  Cc: linux-kernel, akpm, ebiederm, xemul, xemul, avagin,
	kosaki.motohiro, mingo, hpa, tglx, glommer, andi, tj, matthltc,
	penberg, eric.dumazet, segoon, adobriyan, Valdis.Kletnieks

On 1/27/2012 12:53 PM, Cyrill Gorcunov wrote:
> While doing the checkpoint-restore in the userspace one need to determine
> whether various kernel objects (like mm_struct-s of file_struct-s) are shared
> between tasks and restore this state.
> 
> The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
> syscall, while there's currently no ways for solving the 1st one.
> 
> One of the ways for checking whether two tasks share e.g. mm_struct is to
> provide some mm_struct ID of a task to its proc file, but showing such
> info considered to be not that good for security reasons.
> 
> Thus after some debates we end up in conclusion that using that named
> 'comparision' syscall might be the best candidate. So here is it --
> __NR_kcmp.
> 
> It takes up to 5 agruments - the pids of the two tasks (which
> characteristics should be compared), the comparision type and
> (in case of comparision of files) two file descriptors.
> 
> Lookups for pids are done in the caller's PID namespace only.
> 
> At moment only x86 is supported and tested.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> CC: "Eric W. Biederman" <ebiederm@xmission.com>
> CC: Pavel Emelyanov <xemul@parallels.com>
> CC: Andrey Vagin <avagin@openvz.org>
> CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: H. Peter Anvin <hpa@zytor.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Glauber Costa <glommer@parallels.com>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Tejun Heo <tj@kernel.org>
> CC: Matt Helsley <matthltc@us.ibm.com>
> CC: Pekka Enberg <penberg@kernel.org>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Vasiliy Kulikov <segoon@openwall.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Alexey Dobriyan <adobriyan@gmail.com>
> CC: Valdis.Kletnieks@vt.edu
> ---
>  arch/x86/syscalls/syscall_32.tbl         |    1 
>  arch/x86/syscalls/syscall_64.tbl         |    1 
>  include/linux/kcmp.h                     |   17 ++
>  include/linux/syscalls.h                 |    2 
>  kernel/Makefile                          |    1 
>  kernel/kcmp.c                            |  181 +++++++++++++++++++++++++++++++
>  tools/testing/selftests/kcmp/Makefile    |   35 +++++
>  tools/testing/selftests/kcmp/kcmp_test.c |  126 +++++++++++++++++++++
>  tools/testing/selftests/run_tests        |    2 
>  9 files changed, 365 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/syscalls/syscall_32.tbl
> +++ linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
> @@ -355,3 +355,4 @@
>  346	i386	setns			sys_setns
>  347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
>  348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
> +349	i386	kcmp			sys_kcmp
> Index: linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/syscalls/syscall_64.tbl
> +++ linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
> @@ -318,3 +318,4 @@
>  309	64	getcpu			sys_getcpu
>  310	64	process_vm_readv	sys_process_vm_readv
>  311	64	process_vm_writev	sys_process_vm_writev
> +312	64	kcmp			sys_kcmp
> Index: linux-2.6.git/include/linux/kcmp.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/include/linux/kcmp.h
> @@ -0,0 +1,17 @@
> +#ifndef _LINUX_KCMP_H
> +#define _LINUX_KCMP_H
> +
> +/* Comparision type */
> +enum kcmp_type {
> +	KCMP_FILE,
> +	KCMP_VM,
> +	KCMP_FILES,
> +	KCMP_FS,
> +	KCMP_SIGHAND,
> +	KCMP_IO,
> +	KCMP_SYSVSEM,
> +
> +	KCMP_TYPES,
> +};
> +
> +#endif /* _LINUX_KCMP_H */
> Index: linux-2.6.git/include/linux/syscalls.h
> ===================================================================
> --- linux-2.6.git.orig/include/linux/syscalls.h
> +++ linux-2.6.git/include/linux/syscalls.h
> @@ -857,4 +857,6 @@ asmlinkage long sys_process_vm_writev(pi
>  				      unsigned long riovcnt,
>  				      unsigned long flags);
>  
> +asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> +			 unsigned long idx1, unsigned long idx2);
>  #endif
> Index: linux-2.6.git/kernel/Makefile
> ===================================================================
> --- linux-2.6.git.orig/kernel/Makefile
> +++ linux-2.6.git/kernel/Makefile
> @@ -25,6 +25,7 @@ endif
>  obj-y += sched/
>  obj-y += power/
>  
> +obj-$(CONFIG_X86) += kcmp.o
>  obj-$(CONFIG_FREEZER) += freezer.o
>  obj-$(CONFIG_PROFILING) += profile.o
>  obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
> Index: linux-2.6.git/kernel/kcmp.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/kernel/kcmp.c
> @@ -0,0 +1,181 @@
> +#include <linux/kernel.h>
> +#include <linux/syscalls.h>
> +#include <linux/fdtable.h>
> +#include <linux/string.h>
> +#include <linux/random.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/cache.h>
> +#include <linux/bug.h>
> +#include <linux/err.h>
> +#include <linux/kcmp.h>
> +
> +#include <asm/unistd.h>
> +
> +/*
> + * We don't expose real in-memory order of objects for security
> + * reasons, still the comparision results should be suitable for
> + * sorting. Thus, we obfuscate kernel pointers values (using random
> + * cookies obtaned at early boot stage) and compare the production
> + * instead.
> + */
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> +	return (v ^ cookies[type][0]) * cookies[type][1];
> +}
> +
> +/*
> + * 0 - equal
> + * 1 - less than
> + * 2 - greater than
> + * 3 - not equal but ordering unavailable (reserved for future)
> + */
> +static int kcmp_ptr(long v1, long v2, enum kcmp_type type)
> +{
> +	long ret;
> +
> +	ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
> +
> +	return (ret < 0) | ((ret > 0) << 1);
> +}
> +
> +/* The caller must have pinned the task */
> +static struct file *
> +get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> +{
> +	struct fdtable *fdt;
> +	struct file *file;
> +
> +	spin_lock(&task->files->file_lock);
> +	fdt = files_fdtable(task->files);
> +	if (idx < fdt->max_fds)
> +		file = fdt->fd[idx];
> +	else
> +		file = NULL;
> +	spin_unlock(&task->files->file_lock);
> +
> +	return file;
> +}
> +
> +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
> +		unsigned long, idx1, unsigned long, idx2)
> +{
> +	struct task_struct *task1, *task2;
> +	int ret;
> +
> +	rcu_read_lock();
> +
> +	/*
> +	 * Tasks are looked up in caller's
> +	 * PID namespace only.
> +	 */
> +
> +	task1 = find_task_by_vpid(pid1);
> +	if (!task1) {
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	task2 = find_task_by_vpid(pid2);
> +	if (!task2) {
> +		put_task_struct(task1);
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	get_task_struct(task1);
> +	get_task_struct(task2);
> +
> +	rcu_read_unlock();
> +
> +	/*
> +	 * One should have enough rights to inspect tasks details.
> +	 */
> +	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> +	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> +		ret = -EACCES;
> +		goto err;
> +	}
> +
> +	/*
> +	 * Note for all cases but the KCMP_FILE we
> +	 * don't take any locks in a sake of speed.
> +	 */
> +
> +	switch (type) {
> +	case KCMP_FILE: {
> +		struct file *filp1, *filp2;
> +
> +		filp1 = get_file_raw_ptr(task1, idx1);
> +		filp2 = get_file_raw_ptr(task2, idx2);
> +
> +		if (filp1 && filp2)
> +			ret = kcmp_ptr((long)filp1, (long)filp2, KCMP_FILE);
> +		else
> +			ret = -ENOENT;

If my remember is correct, Andrew pointed out EINVAL is better than ENOENT.


> +		break;
> +	}
> +	case KCMP_VM:
> +		ret = kcmp_ptr((long)task1->mm,
> +			       (long)task2->mm,
> +			       KCMP_VM);
> +		break;
> +	case KCMP_FILES:
> +		ret = kcmp_ptr((long)task1->files,
> +			       (long)task2->files,
> +			       KCMP_FILES);
> +		break;
> +	case KCMP_FS:
> +		ret = kcmp_ptr((long)task1->fs,
> +			       (long)task2->fs,
> +			       KCMP_FS);
> +		break;
> +	case KCMP_SIGHAND:
> +		ret = kcmp_ptr((long)task1->sighand,
> +			       (long)task2->sighand,
> +			       KCMP_SIGHAND);
> +		break;
> +	case KCMP_IO:
> +		ret = kcmp_ptr((long)task1->io_context,
> +			       (long)task2->io_context,
> +			       KCMP_IO);
> +		break;
> +	case KCMP_SYSVSEM:
> +#ifdef CONFIG_SYSVIPC
> +		ret = kcmp_ptr((long)task1->sysvsem.undo_list,
> +			       (long)task2->sysvsem.undo_list,
> +			       KCMP_SYSVSEM);
> +#else
> +		ret = -EINVAL;

ENOTSUP is better, I think. because of, EINVAL implicitly mean _caller_ is wrong.
but in this case, it is not bad. only the kernel doesn't have enough feature.


> +		goto err;

you don't need err label at all.


> +#endif
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +err:
> +	put_task_struct(task1);
> +	put_task_struct(task2);
> +
> +	return ret;
> +}
> +
> +static __init int kcmp_cookie_init(void)
> +{
> +	int i, j;
> +
> +	for (i = 0; i < KCMP_TYPES; i++) {
> +		for (j = 0; j < 2; j++) {
> +			get_random_bytes(&cookies[i][j],
> +					 sizeof(cookies[i][j]));
> +		}
> +		cookies[i][1] |= (~(~0UL >>  1) | 1);
> +	}
> +
> +	return 0;
> +}
> +late_initcall(kcmp_cookie_init);
> Index: linux-2.6.git/tools/testing/selftests/kcmp/Makefile
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/tools/testing/selftests/kcmp/Makefile
> @@ -0,0 +1,35 @@
> +ifeq ($(strip $(V)),)
> +	E = @echo
> +	Q = @
> +else
> +	E = @\#
> +	Q =
> +endif
> +export E Q
> +
> +uname_M := $(shell uname -m 2>/dev/null || echo not)
> +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
> +ifeq ($(ARCH),i386)
> +        ARCH := X86
> +	CFLAGS := -DCONFIG_X86_32
> +endif
> +ifeq ($(ARCH),x86_64)
> +	ARCH := X86
> +	CFLAGS := -DCONFIG_X86_64
> +endif
> +
> +CFLAGS += -I../../../../arch/x86/include/generated/
> +CFLAGS += -I../../../../include/
> +
> +all:
> +ifeq ($(ARCH),X86)
> +	$(E) "  CC run_test"
> +	$(Q) gcc $(CFLAGS) kcmp_test.c -o run_test
> +else
> +	$(E) "Not an x86 target, can't build breakpoints selftests"
> +endif
> +
> +clean:
> +	$(E) "  CLEAN"
> +	$(Q) rm -fr ./run_test
> +	$(Q) rm -fr ./test-file
> Index: linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
> @@ -0,0 +1,126 @@
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <signal.h>
> +#include <limits.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <string.h>
> +#include <fcntl.h>
> +
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <sys/wait.h>
> +
> +#ifdef CONFIG_X86_64
> +#include "asm/unistd_64.h"
> +#else
> +#include "asm/unistd_32.h"
> +#endif
> +
> +#include "linux/kcmp.h"
> +
> +#ifdef CONFIG_X86_64
> +static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
> +		unsigned long arg2, unsigned long arg3,
> +		unsigned long arg4)
> +{
> +	long ret;
> +	asm volatile(
> +		"movl %1, %%eax		\n"
> +		"movq %2, %%rdi		\n"
> +		"movq %3, %%rsi		\n"
> +		"movq %4, %%rdx		\n"
> +		"movq %5, %%r10		\n"
> +		"movq %6, %%r8		\n"
> +		"syscall		\n"
> +		"movq %%rax, %0		\n"
> +		: "=r"(ret)
> +		: "g" ((int)nr), "g" (arg0), "g" (arg1), "g" (arg2),
> +			"g" (arg3), "g" (arg4)
> +		: "rax", "rdi", "rsi", "rdx", "r10", "r8", "memory");
> +	return ret;
> +}
> +#else
> +static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
> +		unsigned long arg2, unsigned long arg3,
> +		unsigned long arg4)
> +{
> +	long ret;
> +	asm volatile(
> +		"movl %1, %%eax		\n"
> +		"movl %2, %%ebx		\n"
> +		"movl %3, %%ecx		\n"
> +		"movl %4, %%edx		\n"
> +		"movl %5, %%esi		\n"
> +		"movl %6, %%edi		\n"
> +		"intl $0x80		\n"
> +		"movq %%eax, %0		\n"
> +		: "=r"(ret)
> +		: "g" ((int)nr), "g" (arg0), "g" (arg1), "g" (arg2),
> +			"g" (arg3), "g" (arg4)
> +		: "eax", "ebx", "ecx", "edx", "esi", "edi", "memory");
> +	return ret;
> +}
> +#endif
> +
> +static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2)
> +{
> +	return syscall5(__NR_kcmp, (long)pid1, (long)pid2,
> +			(long)type, (long)fd1, (long)fd2);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int pid1, pid2;
> +	int fd1, fd2;
> +	int status;
> +
> +	fd1 = open("test-file", O_RDWR | O_CREAT | O_TRUNC, 0644);
> +	pid1 = getpid();
> +
> +	if (fd1 < 0) {
> +		perror("Can't create file");
> +		exit(1);
> +	}
> +
> +	pid2 = fork();
> +	if (pid2 < 0) {
> +		perror("fork failed");
> +		exit(1);
> +	}
> +
> +	if (!pid2) {
> +		int pid2 = getpid();
> +
> +		fd2 = open("test-file", O_RDWR, 0644);
> +		if (fd2 < 0) {
> +			perror("Can't open file");
> +			exit(1);
> +		}
> +
> +                printf("pid1: %6d pid2: %6d FD: %2d FILES: %2d VM: %2d FS: %2d "
> +		       "SIGHAND: %2d IO: %2d SYSVSEM: %2d INV: %2d\n",
> +		       pid1, pid2,
> +		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd2),
> +		       sys_kcmp(pid1, pid2, KCMP_FILES,		0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_VM,		0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_FS,		0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_SIGHAND,	0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_IO,		0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_SYSVSEM,	0, 0),.

The best practice of auto test is

 AssertFooBar(expected_value, actual_value);

and, just only print "correct or not". Only you know the correct value.



> +
> +			/* This one should fail */
> +		       sys_kcmp(pid1, pid2, KCMP_TYPES + 1,	0, 0));
> +
> +		/* This one should return same fd */
> +		printf("pid1: %6d pid2: %6d SAME-FD: %2d\n",
> +		       pid1, pid2,
> +		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd1));
> +
> +		exit(0);
> +	}
> +
> +	waitpid(pid2, &status, P_ALL);
> +
> +	return 0;
> +}
> Index: linux-2.6.git/tools/testing/selftests/run_tests
> ===================================================================
> --- linux-2.6.git.orig/tools/testing/selftests/run_tests
> +++ linux-2.6.git/tools/testing/selftests/run_tests
> @@ -1,6 +1,6 @@
>  #!/bin/bash
>  
> -TARGETS=breakpoints
> +TARGETS="breakpoints kcmp"
>  
>  for TARGET in $TARGETS
>  do
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries
  2012-01-27 18:37   ` Kees Cook
  2012-01-27 18:43     ` Cyrill Gorcunov
@ 2012-01-27 20:28     ` KOSAKI Motohiro
  1 sibling, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2012-01-27 20:28 UTC (permalink / raw)
  To: keescook
  Cc: gorcunov, linux-kernel, akpm, ebiederm, xemul, mtk.manpages, tj,
	avagin, serge.hallyn, xemul, segoon, kamezawa.hiroyu

>> +
>> +       case PR_SET_MM_AUXV:
>> +               if (arg4 > sizeof(mm->saved_auxv))
>> +                       goto out;
>> +               up_read(&mm->mmap_sem);
>> +
>> +               error = -EFAULT;
>> +               if (!copy_from_user(mm->saved_auxv, (const void __user *)addr, arg4))
>> +                       error = 0;
>> +
>> +               return error;
> 
> Is the mmap_sem released here because of the copy_from_user()? Is it
> still safe to write to saved_auxv?

Sure. It's unsafe.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries
  2012-01-27 18:43     ` Cyrill Gorcunov
@ 2012-01-27 20:31       ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2012-01-27 20:31 UTC (permalink / raw)
  To: gorcunov
  Cc: keescook, linux-kernel, akpm, ebiederm, xemul, mtk.manpages, tj,
	avagin, serge.hallyn, xemul, segoon, kamezawa.hiroyu

>>> +
>>> +       case PR_SET_MM_AUXV:
>>> +               if (arg4 > sizeof(mm->saved_auxv))
>>> +                       goto out;
>>> +               up_read(&mm->mmap_sem);
>>> +
>>> +               error = -EFAULT;
>>> +               if (!copy_from_user(mm->saved_auxv, (const void __user *)addr, arg4))
>>> +                       error = 0;
>>> +
>>> +               return error;
>>
>> Is the mmap_sem released here because of the copy_from_user()? Is it
>> still safe to write to saved_auxv?
>>
> 
> At moment I believe yes (if only I'm not missing something), we poke this
> vector at elf loading procedure, so it's up to user to sync access to "own"
> saved_auxv and write sane values inside.

Think prctl(PR_SET_MM_AUXV) vs prctl(PR_SET_MM_AUXV) race. That might make to write corrupted value.
syscall should be atomic from a point of userland view.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 20:19   ` KOSAKI Motohiro
@ 2012-01-27 20:33     ` Eric W. Biederman
  2012-01-27 20:50       ` Cyrill Gorcunov
  2012-01-27 20:34     ` Glauber Costa
  2012-01-27 20:47     ` Cyrill Gorcunov
  2 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-01-27 20:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: gorcunov, linux-kernel, akpm, xemul, xemul, avagin,
	kosaki.motohiro, mingo, hpa, tglx, glommer, andi, tj, matthltc,
	penberg, eric.dumazet, segoon, adobriyan, Valdis.Kletnieks

KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> writes:

> On 1/27/2012 12:53 PM, Cyrill Gorcunov wrote:
>> +	switch (type) {
>> +	case KCMP_FILE: {
>> +		struct file *filp1, *filp2;
>> +
>> +		filp1 = get_file_raw_ptr(task1, idx1);
>> +		filp2 = get_file_raw_ptr(task2, idx2);
>> +
>> +		if (filp1 && filp2)
>> +			ret = kcmp_ptr((long)filp1, (long)filp2, KCMP_FILE);
>> +		else
>> +			ret = -ENOENT;
>
> If my remember is correct, Andrew pointed out EINVAL is better than ENOENT.

Ah yes.  And really what it should be is
		if (!filp1 || !filp2)
 			return -EBADF;

At least EBADF is what you return if it is your process that doesn't
have the filedescriptor.

>> +		break;
>> +	case KCMP_SYSVSEM:
>> +#ifdef CONFIG_SYSVIPC
>> +		ret = kcmp_ptr((long)task1->sysvsem.undo_list,
>> +			       (long)task2->sysvsem.undo_list,
>> +			       KCMP_SYSVSEM);
>> +#else
>> +		ret = -EINVAL;
>
> ENOTSUP is better, I think. because of, EINVAL implicitly mean _caller_ is wrong.
> but in this case, it is not bad. only the kernel doesn't have enough
> feature.

Careful a type compiled out should in principle match a type whose
support has not been implemented. That is the default case should match
what happens when you don't compile in sysvipc support.

>
>> +		goto err;
>
> you don't need err label at all.
>
>
>> +#endif
>> +		break;
>> +	default:
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +
>> +err:
>> +	put_task_struct(task1);
>> +	put_task_struct(task2);
>> +
>> +	return ret;
>> +}

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 20:19   ` KOSAKI Motohiro
  2012-01-27 20:33     ` Eric W. Biederman
@ 2012-01-27 20:34     ` Glauber Costa
  2012-01-27 20:37       ` H. Peter Anvin
  2012-01-27 20:47     ` Cyrill Gorcunov
  2 siblings, 1 reply; 37+ messages in thread
From: Glauber Costa @ 2012-01-27 20:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: gorcunov, linux-kernel, akpm, ebiederm, xemul, xemul, avagin,
	kosaki.motohiro, mingo, hpa, tglx, andi, tj, matthltc, penberg,
	eric.dumazet, segoon, adobriyan, Valdis.Kletnieks

>> +	case KCMP_SYSVSEM:
>> +#ifdef CONFIG_SYSVIPC
>> +		ret = kcmp_ptr((long)task1->sysvsem.undo_list,
>> +			       (long)task2->sysvsem.undo_list,
>> +			       KCMP_SYSVSEM);
>> +#else
>> +		ret = -EINVAL;
> 
> ENOTSUP is better, I think. because of, EINVAL implicitly mean _caller_ is wrong.
> but in this case, it is not bad. only the kernel doesn't have enough feature.
> 

Isn't it usually the case for ENOSYS, then ?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 20:34     ` Glauber Costa
@ 2012-01-27 20:37       ` H. Peter Anvin
  0 siblings, 0 replies; 37+ messages in thread
From: H. Peter Anvin @ 2012-01-27 20:37 UTC (permalink / raw)
  To: Glauber Costa
  Cc: KOSAKI Motohiro, gorcunov, linux-kernel, akpm, ebiederm, xemul,
	xemul, avagin, kosaki.motohiro, mingo, tglx, andi, tj, matthltc,
	penberg, eric.dumazet, segoon, adobriyan, Valdis.Kletnieks

On 01/27/2012 12:34 PM, Glauber Costa wrote:
>>> +	case KCMP_SYSVSEM:
>>> +#ifdef CONFIG_SYSVIPC
>>> +		ret = kcmp_ptr((long)task1->sysvsem.undo_list,
>>> +			       (long)task2->sysvsem.undo_list,
>>> +			       KCMP_SYSVSEM);
>>> +#else
>>> +		ret = -EINVAL;
>>
>> ENOTSUP is better, I think. because of, EINVAL implicitly mean _caller_ is wrong.
>> but in this case, it is not bad. only the kernel doesn't have enough feature.
>>
> 
> Isn't it usually the case for ENOSYS, then ?
> 

ENOSYS means "no such system call".  That would apply if the kcmp system
call itself did not exist.

	-hpa


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 20:19   ` KOSAKI Motohiro
  2012-01-27 20:33     ` Eric W. Biederman
  2012-01-27 20:34     ` Glauber Costa
@ 2012-01-27 20:47     ` Cyrill Gorcunov
  2 siblings, 0 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 20:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, akpm, ebiederm, xemul, xemul, avagin,
	kosaki.motohiro, mingo, hpa, tglx, glommer, andi, tj, matthltc,
	penberg, eric.dumazet, segoon, adobriyan, Valdis.Kletnieks

On Fri, Jan 27, 2012 at 03:19:07PM -0500, KOSAKI Motohiro wrote:
...
> > +
> > +		if (filp1 && filp2)
> > +			ret = kcmp_ptr((long)filp1, (long)filp2, KCMP_FILE);
> > +		else
> > +			ret = -ENOENT;
> 
> If my remember is correct, Andrew pointed out EINVAL is better than ENOENT.
> 

Phphphp :) there a version posted already which
returns -EBADFD (as Eric pointed me).

> > +#ifdef CONFIG_SYSVIPC
> > +		ret = kcmp_ptr((long)task1->sysvsem.undo_list,
> > +			       (long)task2->sysvsem.undo_list,
> > +			       KCMP_SYSVSEM);
> > +#else
> > +		ret = -EINVAL;
> 
> ENOTSUP is better, I think. because of, EINVAL implicitly mean _caller_ is wrong.
> but in this case, it is not bad. only the kernel doesn't have enough feature.
> 

I see.

> 
> > +		goto err;
> 
> you don't need err label at all.
> 

yeah, thanks.

> > +
> > +                printf("pid1: %6d pid2: %6d FD: %2d FILES: %2d VM: %2d FS: %2d "
> > +		       "SIGHAND: %2d IO: %2d SYSVSEM: %2d INV: %2d\n",
> > +		       pid1, pid2,
> > +		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd2),
> > +		       sys_kcmp(pid1, pid2, KCMP_FILES,		0, 0),
> > +		       sys_kcmp(pid1, pid2, KCMP_VM,		0, 0),
> > +		       sys_kcmp(pid1, pid2, KCMP_FS,		0, 0),
> > +		       sys_kcmp(pid1, pid2, KCMP_SIGHAND,	0, 0),
> > +		       sys_kcmp(pid1, pid2, KCMP_IO,		0, 0),
> > +		       sys_kcmp(pid1, pid2, KCMP_SYSVSEM,	0, 0),.
> 
> The best practice of auto test is
> 
>  AssertFooBar(expected_value, actual_value);
> 
> and, just only print "correct or not". Only you know the correct value.
> 

ok

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 20:33     ` Eric W. Biederman
@ 2012-01-27 20:50       ` Cyrill Gorcunov
  2012-01-27 21:19         ` Eric W. Biederman
  0 siblings, 1 reply; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-27 20:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: KOSAKI Motohiro, linux-kernel, akpm, xemul, xemul, avagin,
	kosaki.motohiro, mingo, hpa, tglx, glommer, andi, tj, matthltc,
	penberg, eric.dumazet, segoon, adobriyan, Valdis.Kletnieks

On Fri, Jan 27, 2012 at 12:33:07PM -0800, Eric W. Biederman wrote:
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> writes:
> >> +			ret = kcmp_ptr((long)filp1, (long)filp2, KCMP_FILE);
> >> +		else
> >> +			ret = -ENOENT;
> >
> > If my remember is correct, Andrew pointed out EINVAL is better than ENOENT.
> 
> Ah yes.  And really what it should be is
> 		if (!filp1 || !filp2)
>  			return -EBADF;
> 
> At least EBADF is what you return if it is your process that doesn't
> have the filedescriptor.
> 

Eric, I've sent out version with

		if (filp1 && filp2)
			...
		else
			ret = -EBADF;

maybe you're lookin into previous version?

> >> +			       KCMP_SYSVSEM);
> >> +#else
> >> +		ret = -EINVAL;
> >
> > ENOTSUP is better, I think. because of, EINVAL implicitly mean _caller_ is wrong.
> > but in this case, it is not bad. only the kernel doesn't have enough
> > feature.
> 
> Careful a type compiled out should in principle match a type whose
> support has not been implemented. That is the default case should match
> what happens when you don't compile in sysvipc support.

I don't get it :) Will -EINVAL be enough or not?

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 19:40           ` Andi Kleen
@ 2012-01-27 20:55             ` Eric W. Biederman
  0 siblings, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-01-27 20:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Cyrill Gorcunov, LKML, Andrew Morton, Pavel Emelyanov,
	KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

Andi Kleen <andi@firstfloor.org> writes:

>> > The basic problem is if this interface is at the right level of abstraction.
>> > I have some doubts on that. It seems like a long term maintenance nightmare to 
>> > me. It may be better to put the loop that would call this into the kernel.
>> > 
>> 
>> Hmm, ie selftest right in kenel?
>
> Not testing, but more the general stability of the interface. IMHO it exposes
> too many kernel internals. I know they are already exposed by clone/unshare, 
> but in those nothing breaks if the user program doesn't know about some new
> flags. But this looks like the user always has to be updated for every change.
> I think I would prefer if more of the user was in kernel to not expose
> that much.

Do you vote for putting the entire process serializer in one system call
then?

With the serializer in userspace you only need to update your userspace
code if something uses a new facility.  Which is the standard userspace
requirement.  Userspace does not need to be strictly in sync with the
kernel.

Personally I think all of this exporting extra state a little at a time
is horrible, but it seems to have a better chance of getting merged
because the pain comes a little at time.

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 20:50       ` Cyrill Gorcunov
@ 2012-01-27 21:19         ` Eric W. Biederman
  0 siblings, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-01-27 21:19 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KOSAKI Motohiro, linux-kernel, akpm, xemul, xemul, avagin,
	kosaki.motohiro, mingo, hpa, tglx, glommer, andi, tj, matthltc,
	penberg, eric.dumazet, segoon, adobriyan, Valdis.Kletnieks

Cyrill Gorcunov <gorcunov@openvz.org> writes:

> On Fri, Jan 27, 2012 at 12:33:07PM -0800, Eric W. Biederman wrote:
>> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> writes:
>> >> +			ret = kcmp_ptr((long)filp1, (long)filp2, KCMP_FILE);
>> >> +		else
>> >> +			ret = -ENOENT;
>> >
>> > If my remember is correct, Andrew pointed out EINVAL is better than ENOENT.
>> 
>> Ah yes.  And really what it should be is
>> 		if (!filp1 || !filp2)
>>  			return -EBADF;
>> 
>> At least EBADF is what you return if it is your process that doesn't
>> have the filedescriptor.
>> 
>
> Eric, I've sent out version with
>
> 		if (filp1 && filp2)
> 			...
> 		else
> 			ret = -EBADF;
>
> maybe you're lookin into previous version?

Yeah.  Comments and patch passing in the night.  It looks like you have
it right in your latest patch.

>> >> +			       KCMP_SYSVSEM);
>> >> +#else
>> >> +		ret = -EINVAL;
>> >
>> > ENOTSUP is better, I think. because of, EINVAL implicitly mean _caller_ is wrong.
>> > but in this case, it is not bad. only the kernel doesn't have enough
>> > feature.
>> 
>> Careful a type compiled out should in principle match a type whose
>> support has not been implemented. That is the default case should match
>> what happens when you don't compile in sysvipc support.
>
> I don't get it :) Will -EINVAL be enough or not?

At the present time the only way we can get -EINVAL by not supporting a
type so there is no pressing need for something different.

However in the general case EINVAL is a pretty generic failure mode and
having something more precise that you can use to figure out what you
did wrong when calling a system call tends to help a great deal.

So I am favor of using a better error code if EOPNOTSUP or ENOTTY if we
can convince ourselves it is the proper error code.

Mostly it is bike shedding but it is a detail that getting it right will
help users of this interface in the long run.

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-27 18:30       ` H. Peter Anvin
@ 2012-01-28 17:19         ` Michael Kerrisk
  2012-01-28 17:34           ` Cyrill Gorcunov
  0 siblings, 1 reply; 37+ messages in thread
From: Michael Kerrisk @ 2012-01-28 17:19 UTC (permalink / raw)
  To: H. Peter Anvin, Cyrill Gorcunov
  Cc: Andi Kleen, LKML, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, KOSAKI Motohiro, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks

On Sat, Jan 28, 2012 at 7:30 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 01/27/2012 10:24 AM, Cyrill Gorcunov wrote:
>>
>> Sure, I'll try to (btw, where I should send it to? And in which
>> format novadays mans are written? In plain old troff or some
>> human readable asciidocs?) And... should I post man page
>> on LKML as well?
>>
>
> Normally they're written using the mandoc macro set in *roff.  Michael
> Kerrisk (mtk) is the authority on how to do things.

Cyrill, see  http://www.kernel.org/doc/man-pages/patches.html and the
following man pages
man-pages(7)
man(7)
groff_man(7)

Thanks,

Michael
-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-28 17:19         ` Michael Kerrisk
@ 2012-01-28 17:34           ` Cyrill Gorcunov
  2012-01-28 17:36             ` Cyrill Gorcunov
  0 siblings, 1 reply; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-28 17:34 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: H. Peter Anvin, Andi Kleen, LKML, Andrew Morton,
	Eric W. Biederman, Pavel Emelyanov, KOSAKI Motohiro,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Tejun Heo, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan,
	Valdis.Kletnieks

On Sun, Jan 29, 2012 at 06:19:41AM +1300, Michael Kerrisk wrote:
> On Sat, Jan 28, 2012 at 7:30 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 01/27/2012 10:24 AM, Cyrill Gorcunov wrote:
> >>
> >> Sure, I'll try to (btw, where I should send it to? And in which
> >> format novadays mans are written? In plain old troff or some
> >> human readable asciidocs?) And... should I post man page
> >> on LKML as well?
> >>
> >
> > Normally they're written using the mandoc macro set in *roff.  Michael
> > Kerrisk (mtk) is the authority on how to do things.
> 
> Cyrill, see  http://www.kernel.org/doc/man-pages/patches.html and the
> following man pages
> man-pages(7)
> man(7)
> groff_man(7)
> 

Hi Michael, thanks! I've found this link yesterday night. Then
I tried to clone man-pages but for some reason it always fail

 | [cyrill@moon projects]$ git clone git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
 | Cloning into 'man-pages'...
 | fatal: The remote end hung up unexpectedly
 | [cyrill@moon projects]$

maybe there another mirror I could fetch this repo?

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
  2012-01-28 17:34           ` Cyrill Gorcunov
@ 2012-01-28 17:36             ` Cyrill Gorcunov
  0 siblings, 0 replies; 37+ messages in thread
From: Cyrill Gorcunov @ 2012-01-28 17:36 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: H. Peter Anvin, Andi Kleen, LKML, Andrew Morton,
	Eric W. Biederman, Pavel Emelyanov, KOSAKI Motohiro,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Tejun Heo, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan,
	Valdis.Kletnieks

On Sat, Jan 28, 2012 at 09:34:56PM +0400, Cyrill Gorcunov wrote:
> 
> Hi Michael, thanks! I've found this link yesterday night. Then
> I tried to clone man-pages but for some reason it always fail
> 
>  | [cyrill@moon projects]$ git clone git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
>  | Cloning into 'man-pages'...
>  | fatal: The remote end hung up unexpectedly
>  | [cyrill@moon projects]$
> 
> maybe there another mirror I could fetch this repo?
> 

Just cloned from github.

	Cyrill

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-01-28 17:36 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-27 17:53 [RFC c/r 0/4] [RFC c/r 0/@total@] A pile in c/r sake Cyrill Gorcunov
2012-01-27 17:53 ` [RFC c/r 1/4] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v9 Cyrill Gorcunov
2012-01-27 17:53 ` [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7 Cyrill Gorcunov
2012-01-27 18:05   ` H. Peter Anvin
2012-01-27 18:11     ` Cyrill Gorcunov
2012-01-27 18:15   ` Andi Kleen
2012-01-27 18:24     ` Cyrill Gorcunov
2012-01-27 18:30       ` H. Peter Anvin
2012-01-28 17:19         ` Michael Kerrisk
2012-01-28 17:34           ` Cyrill Gorcunov
2012-01-28 17:36             ` Cyrill Gorcunov
2012-01-27 18:31       ` Andi Kleen
2012-01-27 18:40         ` Cyrill Gorcunov
2012-01-27 19:40           ` Andi Kleen
2012-01-27 20:55             ` Eric W. Biederman
2012-01-27 18:40   ` Eric W. Biederman
2012-01-27 18:45     ` Cyrill Gorcunov
2012-01-27 19:10     ` Cyrill Gorcunov
2012-01-27 19:37   ` Vasiliy Kulikov
2012-01-27 19:59     ` hpanvin@gmail.com
2012-01-27 20:07       ` Cyrill Gorcunov
2012-01-27 20:19   ` KOSAKI Motohiro
2012-01-27 20:33     ` Eric W. Biederman
2012-01-27 20:50       ` Cyrill Gorcunov
2012-01-27 21:19         ` Eric W. Biederman
2012-01-27 20:34     ` Glauber Costa
2012-01-27 20:37       ` H. Peter Anvin
2012-01-27 20:47     ` Cyrill Gorcunov
2012-01-27 17:53 ` [RFC c/r 3/4] c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat Cyrill Gorcunov
2012-01-27 18:29   ` Kees Cook
2012-01-27 20:00   ` KOSAKI Motohiro
2012-01-27 20:10     ` Cyrill Gorcunov
2012-01-27 17:53 ` [RFC c/r 4/4] c/r: prctl: Extend PR_SET_MM to set up more mm_struct entries Cyrill Gorcunov
2012-01-27 18:37   ` Kees Cook
2012-01-27 18:43     ` Cyrill Gorcunov
2012-01-27 20:31       ` KOSAKI Motohiro
2012-01-27 20:28     ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).