All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-17  8:20 Andrey Vagin
  2015-02-17  8:20   ` Andrey Vagin
                   ` (9 more replies)
  0 siblings, 10 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Roger Luethi, Andrey Vagin

Here is a preview version. It provides restricted set of functionality.
I would like to collect feedback about this idea.

Currently we use the proc file system, where all information are
presented in text files, what is convenient for humans.  But if we need
to get information about processes from code (e.g. in C), the procfs
doesn't look so cool.

>From code we would prefer to get information in binary format and to be
able to specify which information and for which tasks are required. Here
is a new interface with all these features, which is called task_diag.
In addition it's much faster than procfs.

task_diag is based on netlink sockets and looks like socket-diag, which
is used to get information about sockets.

A request is described by the task_diag_pid structure:

struct task_diag_pid {
       __u64   show_flags;	/* specify which information are required */
       __u64   dump_stratagy;   /* specify a group of processes */

       __u32   pid;
};

A respone is a set of netlink messages. Each message describes one task.
All task properties are divided on groups. A message contains the
TASK_DIAG_MSG group and other groups if they have been requested in
show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
response will contain the TASK_DIAG_CRED group which is described by the
task_diag_creds structure.

struct task_diag_msg {
	__u32	tgid;
	__u32	pid;
	__u32	ppid;
	__u32	tpid;
	__u32	sid;
	__u32	pgid;
	__u8	state;
	char	comm[TASK_DIAG_COMM_LEN];
};

Another good feature of task_diag is an ability to request information
for a few processes. Currently here are two stratgies
TASK_DIAG_DUMP_ALL	- get information for all tasks
TASK_DIAG_DUMP_CHILDREN	- get information for children of a specified
			  tasks

The task diag is much faster than the proc file system. We don't need to
create a new file descriptor for each task. We need to send a request
and get a response. It allows to get information for a few task in one
request-response iteration.

I have compared performance of procfs and task-diag for the
"ps ax -o pid,ppid" command.

A test stand contains 10348 processes.
$ ps ax -o pid,ppid | wc -l
10348

$ time ps ax -o pid,ppid > /dev/null

real	0m1.073s
user	0m0.086s
sys	0m0.903s

$ time ./task_diag_all > /dev/null

real	0m0.037s
user	0m0.004s
sys	0m0.020s

And here are statistics about syscalls which were called by each
command.
$ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
            20,713      syscalls:sys_exit_open
            20,710      syscalls:sys_exit_close
            20,708      syscalls:sys_exit_read
            10,348      syscalls:sys_exit_newstat
                31      syscalls:sys_exit_write

$ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
               114      syscalls:sys_exit_recvfrom
                49      syscalls:sys_exit_write
                 8      syscalls:sys_exit_mmap
                 4      syscalls:sys_exit_mprotect
                 3      syscalls:sys_exit_newfstat

You can find the test program from this experiment in the last patch.

The idea of this functionality was suggested by Pavel Emelyanov
(xemul@), when he found that operations with /proc forms a significant
part of a checkpointing time.

Ten years ago here was attempt to add a netlink interface to access to /proc
information:
http://lwn.net/Articles/99600/

Signed-off-by: Andrey Vagin <avagin@openvz.org>

git repo: https://github.com/avagin/linux-task-diag

Andrey Vagin (7):
  [RFC] kernel: add a netlink interface to get information about tasks
  kernel: move next_tgid from fs/proc
  task-diag: add ability to get information about all tasks
  task-diag: add a new group to get process credentials
  kernel: add ability to iterate children of a specified task
  task_diag: add ability to dump children
  selftest: check the task_diag functinonality

 fs/proc/array.c                                    |  58 +---
 fs/proc/base.c                                     |  43 ---
 include/linux/proc_fs.h                            |  13 +
 include/uapi/linux/taskdiag.h                      |  89 ++++++
 init/Kconfig                                       |  12 +
 kernel/Makefile                                    |   1 +
 kernel/pid.c                                       |  94 ++++++
 kernel/taskdiag.c                                  | 343 +++++++++++++++++++++
 tools/testing/selftests/task_diag/Makefile         |  16 +
 tools/testing/selftests/task_diag/task_diag.c      |  59 ++++
 tools/testing/selftests/task_diag/task_diag_all.c  |  82 +++++
 tools/testing/selftests/task_diag/task_diag_comm.c | 195 ++++++++++++
 tools/testing/selftests/task_diag/task_diag_comm.h |  47 +++
 tools/testing/selftests/task_diag/taskdiag.h       |   1 +
 14 files changed, 967 insertions(+), 86 deletions(-)
 create mode 100644 include/uapi/linux/taskdiag.h
 create mode 100644 kernel/taskdiag.c
 create mode 100644 tools/testing/selftests/task_diag/Makefile
 create mode 100644 tools/testing/selftests/task_diag/task_diag.c
 create mode 100644 tools/testing/selftests/task_diag/task_diag_all.c
 create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.c
 create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.h
 create mode 120000 tools/testing/selftests/task_diag/taskdiag.h

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Roger Luethi <rl@hellgate.ch>
-- 
2.1.0


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/7] kernel: add a netlink interface to get information about tasks
@ 2015-02-17  8:20   ` Andrey Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Roger Luethi, Andrey Vagin

task_diag is based on netlink sockets and looks like socket-diag, which
is used to get information about sockets.

task_diag is a new interface which is going to raplace the proc file
system in cases when we need to get information in a binary format.

A request messages is described by the task_diag_pid structure:
struct task_diag_pid {
       __u64   show_flags;
       __u64   dump_stratagy;

       __u32   pid;
};

A respone is a set of netlink messages. Each message describes one task.
All task properties are divided on groups. A message contains the
TASK_DIAG_MSG group, and other groups if they have been requested in
show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
response will contain the TASK_DIAG_CRED group which is described by the
task_diag_creds structure.

struct task_diag_msg {
	__u32	tgid;
	__u32	pid;
	__u32	ppid;
	__u32	tpid;
	__u32	sid;
	__u32	pgid;
	__u8	state;
	char	comm[TASK_DIAG_COMM_LEN];
};

The dump_stratagy field will be used in following patches to request
information for a group of processes.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 include/uapi/linux/taskdiag.h |  64 +++++++++++++++
 init/Kconfig                  |  12 +++
 kernel/Makefile               |   1 +
 kernel/taskdiag.c             | 179 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 256 insertions(+)
 create mode 100644 include/uapi/linux/taskdiag.h
 create mode 100644 kernel/taskdiag.c

diff --git a/include/uapi/linux/taskdiag.h b/include/uapi/linux/taskdiag.h
new file mode 100644
index 0000000..e1feb35
--- /dev/null
+++ b/include/uapi/linux/taskdiag.h
@@ -0,0 +1,64 @@
+#ifndef _LINUX_TASKDIAG_H
+#define _LINUX_TASKDIAG_H
+
+#include <linux/types.h>
+#include <linux/capability.h>
+
+#define TASKDIAG_GENL_NAME	"TASKDIAG"
+#define TASKDIAG_GENL_VERSION	0x1
+
+enum {
+	/* optional attributes which can be specified in show_flags */
+
+	/* other attributes */
+	TASK_DIAG_MSG = 64,
+};
+
+enum {
+	TASK_DIAG_RUNNING,
+	TASK_DIAG_INTERRUPTIBLE,
+	TASK_DIAG_UNINTERRUPTIBLE,
+	TASK_DIAG_STOPPED,
+	TASK_DIAG_TRACE_STOP,
+	TASK_DIAG_DEAD,
+	TASK_DIAG_ZOMBIE,
+};
+
+#define TASK_DIAG_COMM_LEN 16
+
+struct task_diag_msg {
+	__u32	tgid;
+	__u32	pid;
+	__u32	ppid;
+	__u32	tpid;
+	__u32	sid;
+	__u32	pgid;
+	__u8	state;
+	char	comm[TASK_DIAG_COMM_LEN];
+};
+
+enum {
+	TASKDIAG_CMD_UNSPEC = 0,	/* Reserved */
+	TASKDIAG_CMD_GET,
+	__TASKDIAG_CMD_MAX,
+};
+#define TASKDIAG_CMD_MAX (__TASKDIAG_CMD_MAX - 1)
+
+#define TASK_DIAG_DUMP_ALL	0
+
+struct task_diag_pid {
+	__u64	show_flags;
+	__u64	dump_stratagy;
+
+	__u32	pid;
+};
+
+enum {
+	TASKDIAG_CMD_ATTR_UNSPEC = 0,
+	TASKDIAG_CMD_ATTR_GET,
+	__TASKDIAG_CMD_ATTR_MAX,
+};
+
+#define TASKDIAG_CMD_ATTR_MAX (__TASKDIAG_CMD_ATTR_MAX - 1)
+
+#endif /* _LINUX_TASKDIAG_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9afb971..e959ae3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -430,6 +430,18 @@ config TASKSTATS
 
 	  Say N if unsure.
 
+config TASK_DIAG
+	bool "Export task/process properties through netlink"
+	depends on NET
+	default n
+	help
+	  Export selected properties for tasks/processes through the
+	  generic netlink interface. Unlike the proc file system, task_diag
+	  returns information in a binary format, allows to specify which
+	  information are required.
+
+	  Say N if unsure.
+
 config TASK_DELAY_ACCT
 	bool "Enable per-task delay accounting"
 	depends on TASKSTATS
diff --git a/kernel/Makefile b/kernel/Makefile
index a59481a..2d4fc71 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -95,6 +95,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_TASK_DIAG) += taskdiag.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
new file mode 100644
index 0000000..5faf3f0
--- /dev/null
+++ b/kernel/taskdiag.c
@@ -0,0 +1,179 @@
+#include <uapi/linux/taskdiag.h>
+#include <net/genetlink.h>
+#include <linux/pid_namespace.h>
+#include <linux/ptrace.h>
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+
+static struct genl_family family = {
+	.id		= GENL_ID_GENERATE,
+	.name		= TASKDIAG_GENL_NAME,
+	.version	= TASKDIAG_GENL_VERSION,
+	.maxattr	= TASKDIAG_CMD_ATTR_MAX,
+	.netnsok	= true,
+};
+
+static size_t taskdiag_packet_size(u64 show_flags)
+{
+	return nla_total_size(sizeof(struct task_diag_msg));
+}
+
+/*
+ * The task state array is a strange "bitmap" of
+ * reasons to sleep. Thus "running" is zero, and
+ * you can test for combinations of others with
+ * simple bit tests.
+ */
+static const __u8 task_state_array[] = {
+	TASK_DIAG_RUNNING,
+	TASK_DIAG_INTERRUPTIBLE,
+	TASK_DIAG_UNINTERRUPTIBLE,
+	TASK_DIAG_STOPPED,
+	TASK_DIAG_TRACE_STOP,
+	TASK_DIAG_DEAD,
+	TASK_DIAG_ZOMBIE,
+};
+
+static inline const __u8 get_task_state(struct task_struct *tsk)
+{
+	unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
+
+	BUILD_BUG_ON(1 + ilog2(TASK_REPORT) != ARRAY_SIZE(task_state_array)-1);
+
+	return task_state_array[fls(state)];
+}
+
+static int fill_task_msg(struct task_struct *p, struct sk_buff *skb)
+{
+	struct pid_namespace *ns = task_active_pid_ns(current);
+	struct task_diag_msg *msg;
+	struct nlattr *attr;
+	char tcomm[sizeof(p->comm)];
+	struct task_struct *tracer;
+
+	attr = nla_reserve(skb, TASK_DIAG_MSG, sizeof(struct task_diag_msg));
+	if (!attr)
+		return -EMSGSIZE;
+
+	msg = nla_data(attr);
+
+	rcu_read_lock();
+	msg->ppid = pid_alive(p) ?
+		task_tgid_nr_ns(rcu_dereference(p->real_parent), ns) : 0;
+
+	msg->tpid = 0;
+	tracer = ptrace_parent(p);
+	if (tracer)
+		msg->tpid = task_pid_nr_ns(tracer, ns);
+
+	msg->tgid = task_tgid_nr_ns(p, ns);
+	msg->pid = task_pid_nr_ns(p, ns);
+	msg->sid = task_session_nr_ns(p, ns);
+	msg->pgid = task_pgrp_nr_ns(p, ns);
+
+	rcu_read_unlock();
+
+	get_task_comm(tcomm, p);
+	memset(msg->comm, 0, TASK_DIAG_COMM_LEN);
+	strncpy(msg->comm, tcomm, TASK_DIAG_COMM_LEN);
+
+	msg->state = get_task_state(p);
+
+	return 0;
+}
+
+static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
+				u64 show_flags, u32 portid, u32 seq)
+{
+	void *reply;
+	int err;
+
+	reply = genlmsg_put(skb, portid, seq, &family, 0, TASKDIAG_CMD_GET);
+	if (reply == NULL)
+		return -EMSGSIZE;
+
+	err = fill_task_msg(tsk, skb);
+	if (err)
+		goto err;
+
+	return genlmsg_end(skb, reply);
+err:
+	genlmsg_cancel(skb, reply);
+	return err;
+}
+
+static int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct task_struct *tsk = NULL;
+	struct task_diag_pid *req;
+	struct sk_buff *msg;
+	size_t size;
+	int rc;
+
+	req = nla_data(info->attrs[TASKDIAG_CMD_ATTR_GET]);
+	if (req == NULL)
+		return -EINVAL;
+
+	if (nla_len(info->attrs[TASKDIAG_CMD_ATTR_GET]) < sizeof(*req))
+		return -EINVAL;
+
+	size = taskdiag_packet_size(req->show_flags);
+	msg = genlmsg_new(size, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	tsk = find_task_by_vpid(req->pid);
+	if (tsk)
+		get_task_struct(tsk);
+	rcu_read_unlock();
+	if (!tsk) {
+		rc = -ESRCH;
+		goto err;
+	};
+
+	if (!ptrace_may_access(tsk, PTRACE_MODE_READ)) {
+		put_task_struct(tsk);
+		rc = -EPERM;
+		goto err;
+	}
+
+	rc = task_diag_fill(tsk, msg, req->show_flags,
+				info->snd_portid, info->snd_seq);
+	put_task_struct(tsk);
+	if (rc < 0)
+		goto err;
+
+	return genlmsg_reply(msg, info);
+err:
+	nlmsg_free(msg);
+	return rc;
+}
+
+static const struct nla_policy
+			taskstats_cmd_get_policy[TASKDIAG_CMD_ATTR_MAX+1] = {
+	[TASKDIAG_CMD_ATTR_GET]  = {	.type = NLA_UNSPEC,
+					.len = sizeof(struct task_diag_pid)
+				},
+};
+
+static const struct genl_ops taskdiag_ops[] = {
+	{
+		.cmd		= TASKDIAG_CMD_GET,
+		.doit		= taskdiag_doit,
+		.policy		= taskstats_cmd_get_policy,
+	},
+};
+
+static int __init taskdiag_init(void)
+{
+	int rc;
+
+	rc = genl_register_family_with_ops(&family, taskdiag_ops);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+late_initcall(taskdiag_init);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 1/7] kernel: add a netlink interface to get information about tasks
@ 2015-02-17  8:20   ` Andrey Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi, Andrey Vagin

task_diag is based on netlink sockets and looks like socket-diag, which
is used to get information about sockets.

task_diag is a new interface which is going to raplace the proc file
system in cases when we need to get information in a binary format.

A request messages is described by the task_diag_pid structure:
struct task_diag_pid {
       __u64   show_flags;
       __u64   dump_stratagy;

       __u32   pid;
};

A respone is a set of netlink messages. Each message describes one task.
All task properties are divided on groups. A message contains the
TASK_DIAG_MSG group, and other groups if they have been requested in
show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
response will contain the TASK_DIAG_CRED group which is described by the
task_diag_creds structure.

struct task_diag_msg {
	__u32	tgid;
	__u32	pid;
	__u32	ppid;
	__u32	tpid;
	__u32	sid;
	__u32	pgid;
	__u8	state;
	char	comm[TASK_DIAG_COMM_LEN];
};

The dump_stratagy field will be used in following patches to request
information for a group of processes.

Signed-off-by: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 include/uapi/linux/taskdiag.h |  64 +++++++++++++++
 init/Kconfig                  |  12 +++
 kernel/Makefile               |   1 +
 kernel/taskdiag.c             | 179 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 256 insertions(+)
 create mode 100644 include/uapi/linux/taskdiag.h
 create mode 100644 kernel/taskdiag.c

diff --git a/include/uapi/linux/taskdiag.h b/include/uapi/linux/taskdiag.h
new file mode 100644
index 0000000..e1feb35
--- /dev/null
+++ b/include/uapi/linux/taskdiag.h
@@ -0,0 +1,64 @@
+#ifndef _LINUX_TASKDIAG_H
+#define _LINUX_TASKDIAG_H
+
+#include <linux/types.h>
+#include <linux/capability.h>
+
+#define TASKDIAG_GENL_NAME	"TASKDIAG"
+#define TASKDIAG_GENL_VERSION	0x1
+
+enum {
+	/* optional attributes which can be specified in show_flags */
+
+	/* other attributes */
+	TASK_DIAG_MSG = 64,
+};
+
+enum {
+	TASK_DIAG_RUNNING,
+	TASK_DIAG_INTERRUPTIBLE,
+	TASK_DIAG_UNINTERRUPTIBLE,
+	TASK_DIAG_STOPPED,
+	TASK_DIAG_TRACE_STOP,
+	TASK_DIAG_DEAD,
+	TASK_DIAG_ZOMBIE,
+};
+
+#define TASK_DIAG_COMM_LEN 16
+
+struct task_diag_msg {
+	__u32	tgid;
+	__u32	pid;
+	__u32	ppid;
+	__u32	tpid;
+	__u32	sid;
+	__u32	pgid;
+	__u8	state;
+	char	comm[TASK_DIAG_COMM_LEN];
+};
+
+enum {
+	TASKDIAG_CMD_UNSPEC = 0,	/* Reserved */
+	TASKDIAG_CMD_GET,
+	__TASKDIAG_CMD_MAX,
+};
+#define TASKDIAG_CMD_MAX (__TASKDIAG_CMD_MAX - 1)
+
+#define TASK_DIAG_DUMP_ALL	0
+
+struct task_diag_pid {
+	__u64	show_flags;
+	__u64	dump_stratagy;
+
+	__u32	pid;
+};
+
+enum {
+	TASKDIAG_CMD_ATTR_UNSPEC = 0,
+	TASKDIAG_CMD_ATTR_GET,
+	__TASKDIAG_CMD_ATTR_MAX,
+};
+
+#define TASKDIAG_CMD_ATTR_MAX (__TASKDIAG_CMD_ATTR_MAX - 1)
+
+#endif /* _LINUX_TASKDIAG_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9afb971..e959ae3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -430,6 +430,18 @@ config TASKSTATS
 
 	  Say N if unsure.
 
+config TASK_DIAG
+	bool "Export task/process properties through netlink"
+	depends on NET
+	default n
+	help
+	  Export selected properties for tasks/processes through the
+	  generic netlink interface. Unlike the proc file system, task_diag
+	  returns information in a binary format, allows to specify which
+	  information are required.
+
+	  Say N if unsure.
+
 config TASK_DELAY_ACCT
 	bool "Enable per-task delay accounting"
 	depends on TASKSTATS
diff --git a/kernel/Makefile b/kernel/Makefile
index a59481a..2d4fc71 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -95,6 +95,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_TASK_DIAG) += taskdiag.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
new file mode 100644
index 0000000..5faf3f0
--- /dev/null
+++ b/kernel/taskdiag.c
@@ -0,0 +1,179 @@
+#include <uapi/linux/taskdiag.h>
+#include <net/genetlink.h>
+#include <linux/pid_namespace.h>
+#include <linux/ptrace.h>
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+
+static struct genl_family family = {
+	.id		= GENL_ID_GENERATE,
+	.name		= TASKDIAG_GENL_NAME,
+	.version	= TASKDIAG_GENL_VERSION,
+	.maxattr	= TASKDIAG_CMD_ATTR_MAX,
+	.netnsok	= true,
+};
+
+static size_t taskdiag_packet_size(u64 show_flags)
+{
+	return nla_total_size(sizeof(struct task_diag_msg));
+}
+
+/*
+ * The task state array is a strange "bitmap" of
+ * reasons to sleep. Thus "running" is zero, and
+ * you can test for combinations of others with
+ * simple bit tests.
+ */
+static const __u8 task_state_array[] = {
+	TASK_DIAG_RUNNING,
+	TASK_DIAG_INTERRUPTIBLE,
+	TASK_DIAG_UNINTERRUPTIBLE,
+	TASK_DIAG_STOPPED,
+	TASK_DIAG_TRACE_STOP,
+	TASK_DIAG_DEAD,
+	TASK_DIAG_ZOMBIE,
+};
+
+static inline const __u8 get_task_state(struct task_struct *tsk)
+{
+	unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
+
+	BUILD_BUG_ON(1 + ilog2(TASK_REPORT) != ARRAY_SIZE(task_state_array)-1);
+
+	return task_state_array[fls(state)];
+}
+
+static int fill_task_msg(struct task_struct *p, struct sk_buff *skb)
+{
+	struct pid_namespace *ns = task_active_pid_ns(current);
+	struct task_diag_msg *msg;
+	struct nlattr *attr;
+	char tcomm[sizeof(p->comm)];
+	struct task_struct *tracer;
+
+	attr = nla_reserve(skb, TASK_DIAG_MSG, sizeof(struct task_diag_msg));
+	if (!attr)
+		return -EMSGSIZE;
+
+	msg = nla_data(attr);
+
+	rcu_read_lock();
+	msg->ppid = pid_alive(p) ?
+		task_tgid_nr_ns(rcu_dereference(p->real_parent), ns) : 0;
+
+	msg->tpid = 0;
+	tracer = ptrace_parent(p);
+	if (tracer)
+		msg->tpid = task_pid_nr_ns(tracer, ns);
+
+	msg->tgid = task_tgid_nr_ns(p, ns);
+	msg->pid = task_pid_nr_ns(p, ns);
+	msg->sid = task_session_nr_ns(p, ns);
+	msg->pgid = task_pgrp_nr_ns(p, ns);
+
+	rcu_read_unlock();
+
+	get_task_comm(tcomm, p);
+	memset(msg->comm, 0, TASK_DIAG_COMM_LEN);
+	strncpy(msg->comm, tcomm, TASK_DIAG_COMM_LEN);
+
+	msg->state = get_task_state(p);
+
+	return 0;
+}
+
+static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
+				u64 show_flags, u32 portid, u32 seq)
+{
+	void *reply;
+	int err;
+
+	reply = genlmsg_put(skb, portid, seq, &family, 0, TASKDIAG_CMD_GET);
+	if (reply == NULL)
+		return -EMSGSIZE;
+
+	err = fill_task_msg(tsk, skb);
+	if (err)
+		goto err;
+
+	return genlmsg_end(skb, reply);
+err:
+	genlmsg_cancel(skb, reply);
+	return err;
+}
+
+static int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct task_struct *tsk = NULL;
+	struct task_diag_pid *req;
+	struct sk_buff *msg;
+	size_t size;
+	int rc;
+
+	req = nla_data(info->attrs[TASKDIAG_CMD_ATTR_GET]);
+	if (req == NULL)
+		return -EINVAL;
+
+	if (nla_len(info->attrs[TASKDIAG_CMD_ATTR_GET]) < sizeof(*req))
+		return -EINVAL;
+
+	size = taskdiag_packet_size(req->show_flags);
+	msg = genlmsg_new(size, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	tsk = find_task_by_vpid(req->pid);
+	if (tsk)
+		get_task_struct(tsk);
+	rcu_read_unlock();
+	if (!tsk) {
+		rc = -ESRCH;
+		goto err;
+	};
+
+	if (!ptrace_may_access(tsk, PTRACE_MODE_READ)) {
+		put_task_struct(tsk);
+		rc = -EPERM;
+		goto err;
+	}
+
+	rc = task_diag_fill(tsk, msg, req->show_flags,
+				info->snd_portid, info->snd_seq);
+	put_task_struct(tsk);
+	if (rc < 0)
+		goto err;
+
+	return genlmsg_reply(msg, info);
+err:
+	nlmsg_free(msg);
+	return rc;
+}
+
+static const struct nla_policy
+			taskstats_cmd_get_policy[TASKDIAG_CMD_ATTR_MAX+1] = {
+	[TASKDIAG_CMD_ATTR_GET]  = {	.type = NLA_UNSPEC,
+					.len = sizeof(struct task_diag_pid)
+				},
+};
+
+static const struct genl_ops taskdiag_ops[] = {
+	{
+		.cmd		= TASKDIAG_CMD_GET,
+		.doit		= taskdiag_doit,
+		.policy		= taskstats_cmd_get_policy,
+	},
+};
+
+static int __init taskdiag_init(void)
+{
+	int rc;
+
+	rc = genl_register_family_with_ops(&family, taskdiag_ops);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+late_initcall(taskdiag_init);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/7] kernel: move next_tgid from fs/proc
  2015-02-17  8:20 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Andrey Vagin
  2015-02-17  8:20   ` Andrey Vagin
@ 2015-02-17  8:20 ` Andrey Vagin
  2015-02-17  8:20   ` Andrey Vagin
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Roger Luethi, Andrey Vagin

This function will be used in task_diag.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 fs/proc/base.c          | 43 -------------------------------------------
 include/linux/proc_fs.h |  7 +++++++
 kernel/pid.c            | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..24ed43d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2795,49 +2795,6 @@ out:
 	return ERR_PTR(result);
 }
 
-/*
- * Find the first task with tgid >= tgid
- *
- */
-struct tgid_iter {
-	unsigned int tgid;
-	struct task_struct *task;
-};
-static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
-{
-	struct pid *pid;
-
-	if (iter.task)
-		put_task_struct(iter.task);
-	rcu_read_lock();
-retry:
-	iter.task = NULL;
-	pid = find_ge_pid(iter.tgid, ns);
-	if (pid) {
-		iter.tgid = pid_nr_ns(pid, ns);
-		iter.task = pid_task(pid, PIDTYPE_PID);
-		/* What we to know is if the pid we have find is the
-		 * pid of a thread_group_leader.  Testing for task
-		 * being a thread_group_leader is the obvious thing
-		 * todo but there is a window when it fails, due to
-		 * the pid transfer logic in de_thread.
-		 *
-		 * So we perform the straight forward test of seeing
-		 * if the pid we have found is the pid of a thread
-		 * group leader, and don't worry if the task we have
-		 * found doesn't happen to be a thread group leader.
-		 * As we don't care in the case of readdir.
-		 */
-		if (!iter.task || !has_group_leader_pid(iter.task)) {
-			iter.tgid += 1;
-			goto retry;
-		}
-		get_task_struct(iter.task);
-	}
-	rcu_read_unlock();
-	return iter;
-}
-
 #define TGID_OFFSET (FIRST_PROCESS_ENTRY + 2)
 
 /* for the /proc/ directory itself, after non-process stuff has been done */
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b97bf2e..136b6ed 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -82,4 +82,11 @@ static inline struct proc_dir_entry *proc_net_mkdir(
 	return proc_mkdir_data(name, 0, parent, net);
 }
 
+struct tgid_iter {
+	unsigned int tgid;
+	struct task_struct *task;
+};
+
+struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter);
+
 #endif /* _LINUX_PROC_FS_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index cd36a5e..082307a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -568,6 +568,45 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 }
 
 /*
+ * Find the first task with tgid >= tgid
+ *
+ */
+struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
+{
+	struct pid *pid;
+
+	if (iter.task)
+		put_task_struct(iter.task);
+	rcu_read_lock();
+retry:
+	iter.task = NULL;
+	pid = find_ge_pid(iter.tgid, ns);
+	if (pid) {
+		iter.tgid = pid_nr_ns(pid, ns);
+		iter.task = pid_task(pid, PIDTYPE_PID);
+		/* What we to know is if the pid we have find is the
+		 * pid of a thread_group_leader.  Testing for task
+		 * being a thread_group_leader is the obvious thing
+		 * todo but there is a window when it fails, due to
+		 * the pid transfer logic in de_thread.
+		 *
+		 * So we perform the straight forward test of seeing
+		 * if the pid we have found is the pid of a thread
+		 * group leader, and don't worry if the task we have
+		 * found doesn't happen to be a thread group leader.
+		 * As we don't care in the case of readdir.
+		 */
+		if (!iter.task || !has_group_leader_pid(iter.task)) {
+			iter.tgid += 1;
+			goto retry;
+		}
+		get_task_struct(iter.task);
+	}
+	rcu_read_unlock();
+	return iter;
+}
+
+/*
  * The pid hash table is scaled according to the amount of memory in the
  * machine.  From a minimum of 16 slots up to 4096 slots at one gigabyte or
  * more.
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/7] task-diag: add ability to get information about all tasks
@ 2015-02-17  8:20   ` Andrey Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Roger Luethi, Andrey Vagin

For that we need to set NLM_F_DUMP. Currently here are no
filters. Any suggestions are welcome.

I think we can add request for children, threads, session or group
members.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 kernel/taskdiag.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 5faf3f0..da4a51b 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -102,6 +102,46 @@ err:
 	return err;
 }
 
+static int taskdiag_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct pid_namespace *ns = task_active_pid_ns(current);
+	struct tgid_iter iter;
+	struct nlattr *na;
+	struct task_diag_pid *req;
+	int rc;
+
+	if (nlmsg_len(cb->nlh) < GENL_HDRLEN + sizeof(*req))
+		return -EINVAL;
+
+	na = nlmsg_data(cb->nlh) + GENL_HDRLEN;
+	if (na->nla_type < 0)
+		return -EINVAL;
+
+	req = (struct task_diag_pid *) nla_data(na);
+
+	iter.tgid = cb->args[0];
+	iter.task = NULL;
+	for (iter = next_tgid(ns, iter);
+	     iter.task;
+	     iter.tgid += 1, iter = next_tgid(ns, iter)) {
+		if (!ptrace_may_access(iter.task, PTRACE_MODE_READ))
+			continue;
+
+		rc = task_diag_fill(iter.task, skb, req->show_flags,
+				NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq);
+		if (rc < 0) {
+			put_task_struct(iter.task);
+			if (rc != -EMSGSIZE)
+				return rc;
+			break;
+		}
+	}
+
+	cb->args[0] = iter.tgid;
+
+	return skb->len;
+}
+
 static int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct task_struct *tsk = NULL;
@@ -161,6 +201,7 @@ static const struct genl_ops taskdiag_ops[] = {
 	{
 		.cmd		= TASKDIAG_CMD_GET,
 		.doit		= taskdiag_doit,
+		.dumpit		= taskdiag_dumpid,
 		.policy		= taskstats_cmd_get_policy,
 	},
 };
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/7] task-diag: add ability to get information about all tasks
@ 2015-02-17  8:20   ` Andrey Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi, Andrey Vagin

For that we need to set NLM_F_DUMP. Currently here are no
filters. Any suggestions are welcome.

I think we can add request for children, threads, session or group
members.

Signed-off-by: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 kernel/taskdiag.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 5faf3f0..da4a51b 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -102,6 +102,46 @@ err:
 	return err;
 }
 
+static int taskdiag_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct pid_namespace *ns = task_active_pid_ns(current);
+	struct tgid_iter iter;
+	struct nlattr *na;
+	struct task_diag_pid *req;
+	int rc;
+
+	if (nlmsg_len(cb->nlh) < GENL_HDRLEN + sizeof(*req))
+		return -EINVAL;
+
+	na = nlmsg_data(cb->nlh) + GENL_HDRLEN;
+	if (na->nla_type < 0)
+		return -EINVAL;
+
+	req = (struct task_diag_pid *) nla_data(na);
+
+	iter.tgid = cb->args[0];
+	iter.task = NULL;
+	for (iter = next_tgid(ns, iter);
+	     iter.task;
+	     iter.tgid += 1, iter = next_tgid(ns, iter)) {
+		if (!ptrace_may_access(iter.task, PTRACE_MODE_READ))
+			continue;
+
+		rc = task_diag_fill(iter.task, skb, req->show_flags,
+				NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq);
+		if (rc < 0) {
+			put_task_struct(iter.task);
+			if (rc != -EMSGSIZE)
+				return rc;
+			break;
+		}
+	}
+
+	cb->args[0] = iter.tgid;
+
+	return skb->len;
+}
+
 static int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct task_struct *tsk = NULL;
@@ -161,6 +201,7 @@ static const struct genl_ops taskdiag_ops[] = {
 	{
 		.cmd		= TASKDIAG_CMD_GET,
 		.doit		= taskdiag_doit,
+		.dumpit		= taskdiag_dumpid,
 		.policy		= taskstats_cmd_get_policy,
 	},
 };
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/7] task-diag: add a new group to get process credentials
@ 2015-02-17  8:20   ` Andrey Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Roger Luethi, Andrey Vagin

A response is represented by the task_diag_creds structure:

struct task_diag_creds {
       struct task_diag_caps cap_inheritable;
       struct task_diag_caps cap_permitted;
       struct task_diag_caps cap_effective;
       struct task_diag_caps cap_bset;

       __u32 uid;
       __u32 euid;
       __u32 suid;
       __u32 fsuid;
       __u32 gid;
       __u32 egid;
       __u32 sgid;
       __u32 fsgid;
};

This group is optional and it filled only if show_flags contains
TASK_DIAG_SHOW_CRED.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 include/uapi/linux/taskdiag.h | 23 ++++++++++++++++++
 kernel/taskdiag.c             | 55 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/taskdiag.h b/include/uapi/linux/taskdiag.h
index e1feb35..db12f6d 100644
--- a/include/uapi/linux/taskdiag.h
+++ b/include/uapi/linux/taskdiag.h
@@ -9,11 +9,14 @@
 
 enum {
 	/* optional attributes which can be specified in show_flags */
+	TASK_DIAG_CRED,
 
 	/* other attributes */
 	TASK_DIAG_MSG = 64,
 };
 
+#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)
+
 enum {
 	TASK_DIAG_RUNNING,
 	TASK_DIAG_INTERRUPTIBLE,
@@ -37,6 +40,26 @@ struct task_diag_msg {
 	char	comm[TASK_DIAG_COMM_LEN];
 };
 
+struct task_diag_caps {
+	__u32 cap[_LINUX_CAPABILITY_U32S_3];
+};
+
+struct task_diag_creds {
+	struct task_diag_caps cap_inheritable;
+	struct task_diag_caps cap_permitted;
+	struct task_diag_caps cap_effective;
+	struct task_diag_caps cap_bset;
+
+	__u32 uid;
+	__u32 euid;
+	__u32 suid;
+	__u32 fsuid;
+	__u32 gid;
+	__u32 egid;
+	__u32 sgid;
+	__u32 fsgid;
+};
+
 enum {
 	TASKDIAG_CMD_UNSPEC = 0,	/* Reserved */
 	TASKDIAG_CMD_GET,
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index da4a51b..6ccbcaf 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -15,7 +15,14 @@ static struct genl_family family = {
 
 static size_t taskdiag_packet_size(u64 show_flags)
 {
-	return nla_total_size(sizeof(struct task_diag_msg));
+	size_t size;
+
+	size = nla_total_size(sizeof(struct task_diag_msg));
+
+	if (show_flags & TASK_DIAG_SHOW_CRED)
+		size += nla_total_size(sizeof(struct task_diag_creds));
+
+	return size;
 }
 
 /*
@@ -82,6 +89,46 @@ static int fill_task_msg(struct task_struct *p, struct sk_buff *skb)
 	return 0;
 }
 
+static inline void caps2diag(struct task_diag_caps *diag, const kernel_cap_t *cap)
+{
+	int i;
+
+	for (i = 0; i < _LINUX_CAPABILITY_U32S_3; i++)
+		diag->cap[i] = cap->cap[i];
+}
+
+static int fill_creds(struct task_struct *p, struct sk_buff *skb)
+{
+	struct user_namespace *user_ns = current_user_ns();
+	struct task_diag_creds *diag_cred;
+	const struct cred *cred;
+	struct nlattr *attr;
+
+	attr = nla_reserve(skb, TASK_DIAG_CRED, sizeof(struct task_diag_creds));
+	if (!attr)
+		return -EMSGSIZE;
+
+	diag_cred = nla_data(attr);
+
+	cred = get_task_cred(p);
+
+	caps2diag(&diag_cred->cap_inheritable, &cred->cap_inheritable);
+	caps2diag(&diag_cred->cap_permitted, &cred->cap_permitted);
+	caps2diag(&diag_cred->cap_effective, &cred->cap_effective);
+	caps2diag(&diag_cred->cap_bset, &cred->cap_bset);
+
+	diag_cred->uid   = from_kuid_munged(user_ns, cred->uid);
+	diag_cred->euid  = from_kuid_munged(user_ns, cred->euid);
+	diag_cred->suid  = from_kuid_munged(user_ns, cred->suid);
+	diag_cred->fsuid = from_kuid_munged(user_ns, cred->fsuid);
+	diag_cred->gid   = from_kgid_munged(user_ns, cred->gid);
+	diag_cred->egid  = from_kgid_munged(user_ns, cred->egid);
+	diag_cred->sgid  = from_kgid_munged(user_ns, cred->sgid);
+	diag_cred->fsgid = from_kgid_munged(user_ns, cred->fsgid);
+
+	return 0;
+}
+
 static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
 				u64 show_flags, u32 portid, u32 seq)
 {
@@ -96,6 +143,12 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
 	if (err)
 		goto err;
 
+	if (show_flags & TASK_DIAG_SHOW_CRED) {
+		err = fill_creds(tsk, skb);
+		if (err)
+			goto err;
+	}
+
 	return genlmsg_end(skb, reply);
 err:
 	genlmsg_cancel(skb, reply);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/7] task-diag: add a new group to get process credentials
@ 2015-02-17  8:20   ` Andrey Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi, Andrey Vagin

A response is represented by the task_diag_creds structure:

struct task_diag_creds {
       struct task_diag_caps cap_inheritable;
       struct task_diag_caps cap_permitted;
       struct task_diag_caps cap_effective;
       struct task_diag_caps cap_bset;

       __u32 uid;
       __u32 euid;
       __u32 suid;
       __u32 fsuid;
       __u32 gid;
       __u32 egid;
       __u32 sgid;
       __u32 fsgid;
};

This group is optional and it filled only if show_flags contains
TASK_DIAG_SHOW_CRED.

Signed-off-by: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 include/uapi/linux/taskdiag.h | 23 ++++++++++++++++++
 kernel/taskdiag.c             | 55 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/taskdiag.h b/include/uapi/linux/taskdiag.h
index e1feb35..db12f6d 100644
--- a/include/uapi/linux/taskdiag.h
+++ b/include/uapi/linux/taskdiag.h
@@ -9,11 +9,14 @@
 
 enum {
 	/* optional attributes which can be specified in show_flags */
+	TASK_DIAG_CRED,
 
 	/* other attributes */
 	TASK_DIAG_MSG = 64,
 };
 
+#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)
+
 enum {
 	TASK_DIAG_RUNNING,
 	TASK_DIAG_INTERRUPTIBLE,
@@ -37,6 +40,26 @@ struct task_diag_msg {
 	char	comm[TASK_DIAG_COMM_LEN];
 };
 
+struct task_diag_caps {
+	__u32 cap[_LINUX_CAPABILITY_U32S_3];
+};
+
+struct task_diag_creds {
+	struct task_diag_caps cap_inheritable;
+	struct task_diag_caps cap_permitted;
+	struct task_diag_caps cap_effective;
+	struct task_diag_caps cap_bset;
+
+	__u32 uid;
+	__u32 euid;
+	__u32 suid;
+	__u32 fsuid;
+	__u32 gid;
+	__u32 egid;
+	__u32 sgid;
+	__u32 fsgid;
+};
+
 enum {
 	TASKDIAG_CMD_UNSPEC = 0,	/* Reserved */
 	TASKDIAG_CMD_GET,
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index da4a51b..6ccbcaf 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -15,7 +15,14 @@ static struct genl_family family = {
 
 static size_t taskdiag_packet_size(u64 show_flags)
 {
-	return nla_total_size(sizeof(struct task_diag_msg));
+	size_t size;
+
+	size = nla_total_size(sizeof(struct task_diag_msg));
+
+	if (show_flags & TASK_DIAG_SHOW_CRED)
+		size += nla_total_size(sizeof(struct task_diag_creds));
+
+	return size;
 }
 
 /*
@@ -82,6 +89,46 @@ static int fill_task_msg(struct task_struct *p, struct sk_buff *skb)
 	return 0;
 }
 
+static inline void caps2diag(struct task_diag_caps *diag, const kernel_cap_t *cap)
+{
+	int i;
+
+	for (i = 0; i < _LINUX_CAPABILITY_U32S_3; i++)
+		diag->cap[i] = cap->cap[i];
+}
+
+static int fill_creds(struct task_struct *p, struct sk_buff *skb)
+{
+	struct user_namespace *user_ns = current_user_ns();
+	struct task_diag_creds *diag_cred;
+	const struct cred *cred;
+	struct nlattr *attr;
+
+	attr = nla_reserve(skb, TASK_DIAG_CRED, sizeof(struct task_diag_creds));
+	if (!attr)
+		return -EMSGSIZE;
+
+	diag_cred = nla_data(attr);
+
+	cred = get_task_cred(p);
+
+	caps2diag(&diag_cred->cap_inheritable, &cred->cap_inheritable);
+	caps2diag(&diag_cred->cap_permitted, &cred->cap_permitted);
+	caps2diag(&diag_cred->cap_effective, &cred->cap_effective);
+	caps2diag(&diag_cred->cap_bset, &cred->cap_bset);
+
+	diag_cred->uid   = from_kuid_munged(user_ns, cred->uid);
+	diag_cred->euid  = from_kuid_munged(user_ns, cred->euid);
+	diag_cred->suid  = from_kuid_munged(user_ns, cred->suid);
+	diag_cred->fsuid = from_kuid_munged(user_ns, cred->fsuid);
+	diag_cred->gid   = from_kgid_munged(user_ns, cred->gid);
+	diag_cred->egid  = from_kgid_munged(user_ns, cred->egid);
+	diag_cred->sgid  = from_kgid_munged(user_ns, cred->sgid);
+	diag_cred->fsgid = from_kgid_munged(user_ns, cred->fsgid);
+
+	return 0;
+}
+
 static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
 				u64 show_flags, u32 portid, u32 seq)
 {
@@ -96,6 +143,12 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
 	if (err)
 		goto err;
 
+	if (show_flags & TASK_DIAG_SHOW_CRED) {
+		err = fill_creds(tsk, skb);
+		if (err)
+			goto err;
+	}
+
 	return genlmsg_end(skb, reply);
 err:
 	genlmsg_cancel(skb, reply);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/7] kernel: add ability to iterate children of a specified task
  2015-02-17  8:20 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Andrey Vagin
                   ` (3 preceding siblings ...)
  2015-02-17  8:20   ` Andrey Vagin
@ 2015-02-17  8:20 ` Andrey Vagin
  2015-02-17  8:20 ` [PATCH 6/7] task_diag: add ability to dump children Andrey Vagin
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Roger Luethi, Andrey Vagin

The interface is similar with the tgid iterator. It is used in
procfs and it will be used in task_diag.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 fs/proc/array.c         | 58 +++++++++++++------------------------------------
 include/linux/proc_fs.h |  6 +++++
 kernel/pid.c            | 55 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+), 43 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bd117d0..7197c6a 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -579,54 +579,26 @@ get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
 {
 	struct task_struct *start, *task;
 	struct pid *pid = NULL;
+	struct child_iter iter;
 
-	read_lock(&tasklist_lock);
-
-	start = pid_task(proc_pid(inode), PIDTYPE_PID);
+	start = get_proc_task(inode);
 	if (!start)
-		goto out;
+		return NULL;
 
-	/*
-	 * Lets try to continue searching first, this gives
-	 * us significant speedup on children-rich processes.
-	 */
-	if (pid_prev) {
-		task = pid_task(pid_prev, PIDTYPE_PID);
-		if (task && task->real_parent == start &&
-		    !(list_empty(&task->sibling))) {
-			if (list_is_last(&task->sibling, &start->children))
-				goto out;
-			task = list_first_entry(&task->sibling,
-						struct task_struct, sibling);
-			pid = get_pid(task_pid(task));
-			goto out;
-		}
-	}
+	if (pid_prev)
+		task = get_pid_task(pid_prev, PIDTYPE_PID);
+	else
+		task = NULL;
 
-	/*
-	 * Slow search case.
-	 *
-	 * We might miss some children here if children
-	 * are exited while we were not holding the lock,
-	 * but it was never promised to be accurate that
-	 * much.
-	 *
-	 * "Just suppose that the parent sleeps, but N children
-	 *  exit after we printed their tids. Now the slow paths
-	 *  skips N extra children, we miss N tasks." (c)
-	 *
-	 * So one need to stop or freeze the leader and all
-	 * its children to get a precise result.
-	 */
-	list_for_each_entry(task, &start->children, sibling) {
-		if (pos-- == 0) {
-			pid = get_pid(task_pid(task));
-			break;
-		}
-	}
+	iter.parent = start;
+	iter.task = task;
+	iter.pos = pos;
+
+	iter = next_child(iter);
 
-out:
-	read_unlock(&tasklist_lock);
+	put_task_struct(start);
+	if (iter.task)
+		pid = get_pid(task_pid(iter.task));
 	return pid;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 136b6ed..eba98bc 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -89,4 +89,10 @@ struct tgid_iter {
 
 struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter);
 
+struct child_iter {
+	struct task_struct      *task, *parent;
+	unsigned int		pos;
+};
+
+struct child_iter next_child(struct child_iter iter);
 #endif /* _LINUX_PROC_FS_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 082307a..6e3e42a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -606,6 +606,61 @@ retry:
 	return iter;
 }
 
+struct child_iter next_child(struct child_iter iter)
+{
+	struct task_struct *task;
+	loff_t pos = iter.pos;
+
+	read_lock(&tasklist_lock);
+
+	/*
+	 * Lets try to continue searching first, this gives
+	 * us significant speedup on children-rich processes.
+	 */
+	if (iter.task) {
+		task = iter.task;
+		if (task && task->real_parent == iter.parent &&
+		    !(list_empty(&task->sibling))) {
+			if (list_is_last(&task->sibling, &iter.parent->children)) {
+				task = NULL;
+				goto out;
+			}
+			task = list_first_entry(&task->sibling,
+						struct task_struct, sibling);
+			goto out;
+		}
+	}
+
+	/*
+	 * Slow search case.
+	 *
+	 * We might miss some children here if children
+	 * are exited while we were not holding the lock,
+	 * but it was never promised to be accurate that
+	 * much.
+	 *
+	 * "Just suppose that the parent sleeps, but N children
+	 *  exit after we printed their tids. Now the slow paths
+	 *  skips N extra children, we miss N tasks." (c)
+	 *
+	 * So one need to stop or freeze the leader and all
+	 * its children to get a precise result.
+	 */
+	list_for_each_entry(task, &iter.parent->children, sibling) {
+		if (pos-- == 0)
+			goto out;
+	}
+	task = NULL;
+out:
+	if (iter.task)
+		put_task_struct(iter.task);
+	if (task)
+		get_task_struct(task);
+	iter.task = task;
+	read_unlock(&tasklist_lock);
+	return iter;
+}
+
 /*
  * The pid hash table is scaled according to the amount of memory in the
  * machine.  From a minimum of 16 slots up to 4096 slots at one gigabyte or
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/7] task_diag: add ability to dump children
  2015-02-17  8:20 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Andrey Vagin
                   ` (4 preceding siblings ...)
  2015-02-17  8:20 ` [PATCH 5/7] kernel: add ability to iterate children of a specified task Andrey Vagin
@ 2015-02-17  8:20 ` Andrey Vagin
  2015-02-17  8:20 ` [PATCH 7/7] selftest: check the task_diag functinonality Andrey Vagin
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Roger Luethi, Andrey Vagin

Now we can dump all task or children of a specified task.
It's an example how this interface can be expanded for different
use-cases.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 include/uapi/linux/taskdiag.h |  1 +
 kernel/taskdiag.c             | 83 +++++++++++++++++++++++++++++++++++++------
 2 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/taskdiag.h b/include/uapi/linux/taskdiag.h
index db12f6d..d8a9e92 100644
--- a/include/uapi/linux/taskdiag.h
+++ b/include/uapi/linux/taskdiag.h
@@ -68,6 +68,7 @@ enum {
 #define TASKDIAG_CMD_MAX (__TASKDIAG_CMD_MAX - 1)
 
 #define TASK_DIAG_DUMP_ALL	0
+#define TASK_DIAG_DUMP_CHILDREN	1
 
 struct task_diag_pid {
 	__u64	show_flags;
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 6ccbcaf..951ecbd 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -155,12 +155,71 @@ err:
 	return err;
 }
 
+struct task_iter {
+	struct task_diag_pid *req;
+	struct pid_namespace *ns;
+	struct netlink_callback *cb;
+
+	union {
+		struct tgid_iter tgid;
+		struct child_iter child;
+	};
+};
+
+static struct task_struct *iter_start(struct task_iter *iter)
+{
+	switch (iter->req->dump_stratagy) {
+	case TASK_DIAG_DUMP_CHILDREN:
+		rcu_read_lock();
+		iter->child.parent = find_task_by_pid_ns(iter->req->pid, iter->ns);
+		if (iter->child.parent)
+			get_task_struct(iter->child.parent);
+		rcu_read_unlock();
+
+		if (iter->child.parent == NULL)
+			return ERR_PTR(-ESRCH);
+
+		iter->child.pos = iter->cb->args[0];
+		iter->child.task = NULL;
+		iter->child = next_child(iter->child);
+		return iter->child.task;
+
+	case TASK_DIAG_DUMP_ALL:
+		iter->tgid.tgid = iter->cb->args[0];
+		iter->tgid.task = NULL;
+		iter->tgid = next_tgid(iter->ns, iter->tgid);
+		return iter->tgid.task;
+	}
+
+	return ERR_PTR(-EINVAL);
+}
+
+static struct task_struct *iter_next(struct task_iter *iter)
+{
+	switch (iter->req->dump_stratagy) {
+	case TASK_DIAG_DUMP_CHILDREN:
+		iter->child.pos += 1;
+		iter->child = next_child(iter->child);
+		iter->cb->args[0] = iter->child.pos;
+		return iter->child.task;
+
+	case TASK_DIAG_DUMP_ALL:
+		iter->tgid.tgid += 1;
+		iter->tgid = next_tgid(iter->ns, iter->tgid);
+		iter->cb->args[0] = iter->tgid.tgid;
+		return iter->tgid.task;
+	}
+
+	return NULL;
+}
+
 static int taskdiag_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct pid_namespace *ns = task_active_pid_ns(current);
-	struct tgid_iter iter;
+	struct task_iter iter;
 	struct nlattr *na;
 	struct task_diag_pid *req;
+	struct task_struct *task;
 	int rc;
 
 	if (nlmsg_len(cb->nlh) < GENL_HDRLEN + sizeof(*req))
@@ -172,26 +231,28 @@ static int taskdiag_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
 
 	req = (struct task_diag_pid *) nla_data(na);
 
-	iter.tgid = cb->args[0];
-	iter.task = NULL;
-	for (iter = next_tgid(ns, iter);
-	     iter.task;
-	     iter.tgid += 1, iter = next_tgid(ns, iter)) {
-		if (!ptrace_may_access(iter.task, PTRACE_MODE_READ))
+	iter.req = req;
+	iter.ns  = ns;
+	iter.cb  = cb;
+
+	task = iter_start(&iter);
+	if (IS_ERR(task) < 0)
+		return PTR_ERR(task);
+
+	for (; task; task = iter_next(&iter)) {
+		if (!ptrace_may_access(task, PTRACE_MODE_READ))
 			continue;
 
-		rc = task_diag_fill(iter.task, skb, req->show_flags,
+		rc = task_diag_fill(task, skb, req->show_flags,
 				NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq);
 		if (rc < 0) {
-			put_task_struct(iter.task);
+			put_task_struct(task);
 			if (rc != -EMSGSIZE)
 				return rc;
 			break;
 		}
 	}
 
-	cb->args[0] = iter.tgid;
-
 	return skb->len;
 }
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 7/7] selftest: check the task_diag functinonality
  2015-02-17  8:20 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Andrey Vagin
                   ` (5 preceding siblings ...)
  2015-02-17  8:20 ` [PATCH 6/7] task_diag: add ability to dump children Andrey Vagin
@ 2015-02-17  8:20 ` Andrey Vagin
  2015-02-17  8:53   ` Arnd Bergmann
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Andrey Vagin @ 2015-02-17  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Roger Luethi, Andrey Vagin

Here are two test (example) programs.

task_diag - request information for two processes.
test_diag_all - request information about all processes

Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/task_diag/Makefile         |  16 ++
 tools/testing/selftests/task_diag/task_diag.c      |  56 ++++++
 tools/testing/selftests/task_diag/task_diag_all.c  |  82 ++++++++
 tools/testing/selftests/task_diag/task_diag_comm.c | 206 +++++++++++++++++++++
 tools/testing/selftests/task_diag/task_diag_comm.h |  47 +++++
 tools/testing/selftests/task_diag/taskdiag.h       |   1 +
 7 files changed, 409 insertions(+)
 create mode 100644 tools/testing/selftests/task_diag/Makefile
 create mode 100644 tools/testing/selftests/task_diag/task_diag.c
 create mode 100644 tools/testing/selftests/task_diag/task_diag_all.c
 create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.c
 create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.h
 create mode 120000 tools/testing/selftests/task_diag/taskdiag.h

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4e51122..c73d888 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -17,6 +17,7 @@ TARGETS += sysctl
 TARGETS += timers
 TARGETS += user
 TARGETS += vm
+TARGETS += task_diag
 #Please keep the TARGETS list alphabetically sorted
 
 TARGETS_HOTPLUG = cpu-hotplug
diff --git a/tools/testing/selftests/task_diag/Makefile b/tools/testing/selftests/task_diag/Makefile
new file mode 100644
index 0000000..d6583c4
--- /dev/null
+++ b/tools/testing/selftests/task_diag/Makefile
@@ -0,0 +1,16 @@
+all: task_diag task_diag_all
+
+run_tests: all
+	@./task_diag && ./task_diag_all && echo "task_diag: [PASS]" || echo "task_diag: [FAIL]"
+
+CFLAGS += -Wall -O2
+
+task_diag.o: task_diag.c task_diag_comm.h
+task_diag_all.o: task_diag_all.c task_diag_comm.h
+task_diag_comm.o: task_diag_comm.c task_diag_comm.h
+
+task_diag_all: task_diag_all.o task_diag_comm.o
+task_diag: task_diag.o task_diag_comm.o
+
+clean:
+	rm -rf task_diag task_diag_all task_diag_comm.o task_diag_all.o task_diag.o
diff --git a/tools/testing/selftests/task_diag/task_diag.c b/tools/testing/selftests/task_diag/task_diag.c
new file mode 100644
index 0000000..fafeeac
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag.c
@@ -0,0 +1,56 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <poll.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/wait.h>
+#include <signal.h>
+
+#include <linux/genetlink.h>
+#include "taskdiag.h"
+#include "task_diag_comm.h"
+
+int main(int argc, char *argv[])
+{
+	int exit_status = 1;
+	int rc, rep_len, id;
+	int nl_sd = -1;
+	struct task_diag_pid req;
+	char buf[4096];
+
+	req.show_flags = TASK_DIAG_SHOW_CRED;
+	req.pid = getpid();
+
+	nl_sd = create_nl_socket(NETLINK_GENERIC);
+	if (nl_sd < 0)
+		return -1;
+
+	id = get_family_id(nl_sd);
+	if (!id)
+		goto err;
+
+	rc = send_cmd(nl_sd, id, getpid(), TASKDIAG_CMD_GET,
+		      TASKDIAG_CMD_ATTR_GET, &req, sizeof(req), 0);
+	pr_info("Sent pid/tgid, retval %d\n", rc);
+	if (rc < 0)
+		goto err;
+
+	rep_len = recv(nl_sd, buf, sizeof(buf), 0);
+	if (rep_len < 0) {
+		pr_perror("Unable to receive a response\n");
+		goto err;
+	}
+	pr_info("received %d bytes\n", rep_len);
+
+	nlmsg_receive(buf, rep_len, &show_task);
+
+	exit_status = 0;
+err:
+	close(nl_sd);
+	return exit_status;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_all.c b/tools/testing/selftests/task_diag/task_diag_all.c
new file mode 100644
index 0000000..85e1a0a
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_all.c
@@ -0,0 +1,82 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <poll.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/wait.h>
+#include <signal.h>
+
+#include "task_diag_comm.h"
+#include "taskdiag.h"
+
+int tasks;
+
+
+extern int _show_task(struct nlmsghdr *hdr)
+{
+	tasks++;
+	return show_task(hdr);
+}
+
+int main(int argc, char *argv[])
+{
+	int exit_status = 1;
+	int rc, rep_len, id;
+	int nl_sd = -1;
+	struct {
+		struct task_diag_pid req;
+	} pid_req;
+	char buf[4096];
+
+	quiet = 0;
+
+	pid_req.req.show_flags = 0;
+	pid_req.req.dump_stratagy = TASK_DIAG_DUMP_ALL;
+	pid_req.req.pid = 1;
+
+	nl_sd = create_nl_socket(NETLINK_GENERIC);
+	if (nl_sd < 0)
+		return -1;
+
+	id = get_family_id(nl_sd);
+	if (!id)
+		goto err;
+
+	rc = send_cmd(nl_sd, id, getpid(), TASKDIAG_CMD_GET,
+		      TASKDIAG_CMD_ATTR_GET, &pid_req, sizeof(pid_req), 1);
+	pr_info("Sent pid/tgid, retval %d\n", rc);
+	if (rc < 0)
+		goto err;
+
+	while (1) {
+		int err;
+
+		rep_len = recv(nl_sd, buf, sizeof(buf), 0);
+		pr_info("received %d bytes\n", rep_len);
+
+		if (rep_len < 0) {
+			pr_perror("Unable to receive a response\n");
+			goto err;
+		}
+
+		if (rep_len == 0)
+			break;
+
+		err = nlmsg_receive(buf, rep_len, &_show_task);
+		if (err < 0)
+			goto err;
+		if (err == 0)
+			break;
+	}
+	printf("tasks: %d\n", tasks);
+
+	exit_status = 0;
+err:
+	close(nl_sd);
+	return exit_status;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_comm.c b/tools/testing/selftests/task_diag/task_diag_comm.c
new file mode 100644
index 0000000..df7780d
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_comm.c
@@ -0,0 +1,206 @@
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/genetlink.h>
+
+#include "taskdiag.h"
+#include "task_diag_comm.h"
+
+int quiet = 0;
+
+/*
+ * Create a raw netlink socket and bind
+ */
+int create_nl_socket(int protocol)
+{
+	int fd;
+	struct sockaddr_nl local;
+
+	fd = socket(AF_NETLINK, SOCK_RAW, protocol);
+	if (fd < 0)
+		return -1;
+
+	memset(&local, 0, sizeof(local));
+	local.nl_family = AF_NETLINK;
+
+	if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0)
+		goto error;
+
+	return fd;
+error:
+	close(fd);
+	return -1;
+}
+
+
+int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid,
+	     __u8 genl_cmd, __u16 nla_type,
+	     void *nla_data, int nla_len, int dump)
+{
+	struct nlattr *na;
+	struct sockaddr_nl nladdr;
+	int r, buflen;
+	char *buf;
+
+	struct msgtemplate msg;
+
+	msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
+	msg.n.nlmsg_type = nlmsg_type;
+	msg.n.nlmsg_flags = NLM_F_REQUEST;
+	if (dump)
+		msg.n.nlmsg_flags |= NLM_F_DUMP;
+	msg.n.nlmsg_seq = 0;
+	msg.n.nlmsg_pid = nlmsg_pid;
+	msg.g.cmd = genl_cmd;
+	msg.g.version = 0x1;
+	na = (struct nlattr *) GENLMSG_DATA(&msg);
+	na->nla_type = nla_type;
+	na->nla_len = nla_len + 1 + NLA_HDRLEN;
+	memcpy(NLA_DATA(na), nla_data, nla_len);
+	msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
+
+	buf = (char *) &msg;
+	buflen = msg.n.nlmsg_len;
+	memset(&nladdr, 0, sizeof(nladdr));
+	nladdr.nl_family = AF_NETLINK;
+	r = sendto(sd, buf, buflen, 0, (struct sockaddr *) &nladdr,
+			   sizeof(nladdr));
+	if (r != buflen) {
+		pr_perror("Unable to send %d (%d)", r, buflen);
+		return -1;
+	}
+	return 0;
+}
+
+
+/*
+ * Probe the controller in genetlink to find the family id
+ * for the TASKDIAG family
+ */
+int get_family_id(int sd)
+{
+	char name[100];
+	struct msgtemplate ans;
+
+	int id = 0, rc;
+	struct nlattr *na;
+	int rep_len;
+
+	strcpy(name, TASKDIAG_GENL_NAME);
+	rc = send_cmd(sd, GENL_ID_CTRL, getpid(), CTRL_CMD_GETFAMILY,
+			CTRL_ATTR_FAMILY_NAME, (void *)name,
+			strlen(TASKDIAG_GENL_NAME) + 1, 0);
+	if (rc < 0)
+		return -1;
+
+	rep_len = recv(sd, &ans, sizeof(ans), 0);
+	if (ans.n.nlmsg_type == NLMSG_ERROR ||
+	    (rep_len < 0) || !NLMSG_OK((&ans.n), rep_len))
+		return 0;
+
+	na = (struct nlattr *) GENLMSG_DATA(&ans);
+	na = (struct nlattr *) ((char *) na + NLA_ALIGN(na->nla_len));
+	if (na->nla_type == CTRL_ATTR_FAMILY_ID)
+		id = *(__u16 *) NLA_DATA(na);
+
+	return id;
+}
+
+int nlmsg_receive(void *buf, int len, int (*cb)(struct nlmsghdr *))
+{
+	struct nlmsghdr *hdr;
+
+	for (hdr = (struct nlmsghdr *)buf;
+			NLMSG_OK(hdr, len); hdr = NLMSG_NEXT(hdr, len)) {
+
+		if (hdr->nlmsg_type == NLMSG_DONE) {
+			int *len = (int *)NLMSG_DATA(hdr);
+
+			if (*len < 0) {
+				pr_err("ERROR %d reported by netlink (%s)\n",
+					*len, strerror(-*len));
+				return *len;
+			}
+
+			return 0;
+		}
+
+		if (hdr->nlmsg_type == NLMSG_ERROR) {
+			struct nlmsgerr *err = (struct nlmsgerr *)NLMSG_DATA(hdr);
+
+			if (hdr->nlmsg_len - sizeof(*hdr) < sizeof(struct nlmsgerr)) {
+				pr_err("ERROR truncated\n");
+				return -1;
+			}
+
+			if (err->error == 0)
+				return 0;
+
+			return -1;
+		}
+		if (cb && cb(hdr))
+			return -1;
+	}
+
+	return 1;
+}
+
+int show_task(struct nlmsghdr *hdr)
+{
+	int msg_len;
+	struct msgtemplate *msg;
+	struct nlattr *na;
+	int len;
+
+	msg_len = GENLMSG_PAYLOAD(hdr);
+
+	msg = (struct msgtemplate *)hdr;
+	na = (struct nlattr *) GENLMSG_DATA(msg);
+	len = 0;
+	while (len < msg_len) {
+		len += NLA_ALIGN(na->nla_len);
+		switch (na->nla_type) {
+		case TASK_DIAG_MSG:
+		{
+			struct task_diag_msg *msg;
+
+			/* For nested attributes, na follows */
+			msg = (struct task_diag_msg *) NLA_DATA(na);
+			pr_info("pid %d ppid %d comm %s\n", msg->pid, msg->ppid, msg->comm);
+			break;
+		}
+		case TASK_DIAG_CRED:
+		{
+			struct task_diag_creds *creds;
+
+			creds = (struct task_diag_creds *) NLA_DATA(na);
+			pr_info("uid: %d %d %d %d\n", creds->uid,
+					creds->euid, creds->suid, creds->fsuid);
+			pr_info("gid: %d %d %d %d\n", creds->uid,
+					creds->euid, creds->suid, creds->fsuid);
+			pr_info("CapInh: %08x%08x\n",
+						creds->cap_inheritable.cap[1],
+						creds->cap_inheritable.cap[0]);
+			pr_info("CapPrm: %08x%08x\n",
+						creds->cap_permitted.cap[1],
+						creds->cap_permitted.cap[0]);
+			pr_info("CapEff: %08x%08x\n",
+						creds->cap_effective.cap[1],
+						creds->cap_effective.cap[0]);
+			pr_info("CapBnd: %08x%08x\n", creds->cap_bset.cap[1],
+						creds->cap_bset.cap[0]);
+			break;
+		}
+		default:
+			pr_err("Unknown nla_type %d\n",
+				na->nla_type);
+			return -1;
+		}
+		na = (struct nlattr *) (GENLMSG_DATA(msg) + len);
+	}
+
+	return 0;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_comm.h b/tools/testing/selftests/task_diag/task_diag_comm.h
new file mode 100644
index 0000000..42f2088
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_comm.h
@@ -0,0 +1,47 @@
+#ifndef __TASK_DIAG_COMM__
+#define __TASK_DIAG_COMM__
+
+#include <stdio.h>
+
+#include <linux/genetlink.h>
+#include "taskdiag.h"
+
+/*
+ * Generic macros for dealing with netlink sockets. Might be duplicated
+ * elsewhere. It is recommended that commercial grade applications use
+ * libnl or libnetlink and use the interfaces provided by the library
+ */
+#define GENLMSG_DATA(glh)	((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
+#define GENLMSG_PAYLOAD(glh)	(NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
+#define NLA_DATA(na)		((void *)((char *)(na) + NLA_HDRLEN))
+#define NLA_PAYLOAD(len)	(len - NLA_HDRLEN)
+
+#define pr_err(fmt, ...)				\
+		fprintf(stderr, fmt, ##__VA_ARGS__)
+
+#define pr_perror(fmt, ...)				\
+		fprintf(stderr, fmt " : %m\n", ##__VA_ARGS__)
+
+extern int quiet;
+#define pr_info(fmt, arg...)			\
+	do {					\
+		if (!quiet)			\
+			printf(fmt, ##arg);	\
+	} while (0)				\
+
+struct msgtemplate {
+	struct nlmsghdr n;
+	struct genlmsghdr g;
+	char body[4096];
+};
+
+extern int create_nl_socket(int protocol);
+extern int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid,
+	     __u8 genl_cmd, __u16 nla_type,
+	     void *nla_data, int nla_len, int dump);
+
+extern int get_family_id(int sd);
+extern int nlmsg_receive(void *buf, int len, int (*cb)(struct nlmsghdr *));
+extern int show_task(struct nlmsghdr *hdr);
+
+#endif /* __TASK_DIAG_COMM__ */
diff --git a/tools/testing/selftests/task_diag/taskdiag.h b/tools/testing/selftests/task_diag/taskdiag.h
new file mode 120000
index 0000000..83e857e
--- /dev/null
+++ b/tools/testing/selftests/task_diag/taskdiag.h
@@ -0,0 +1 @@
+../../../../include/uapi/linux/taskdiag.h
\ No newline at end of file
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-17  8:53   ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2015-02-17  8:53 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: linux-kernel, linux-api, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.
> 
> A request is described by the task_diag_pid structure:
> 
> struct task_diag_pid {
>        __u64   show_flags;      /* specify which information are required */
>        __u64   dump_stratagy;   /* specify a group of processes */
> 
>        __u32   pid;
> };

Can you explain how the interface relates to the 'taskstats' genetlink
API? Did you consider extending that interface to provide the
information you need instead of basing on the socket-diag?

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-17  8:53   ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2015-02-17  8:53 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.
> 
> A request is described by the task_diag_pid structure:
> 
> struct task_diag_pid {
>        __u64   show_flags;      /* specify which information are required */
>        __u64   dump_stratagy;   /* specify a group of processes */
> 
>        __u32   pid;
> };

Can you explain how the interface relates to the 'taskstats' genetlink
API? Did you consider extending that interface to provide the
information you need instead of basing on the socket-diag?

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-17 16:09   ` David Ahern
  0 siblings, 0 replies; 41+ messages in thread
From: David Ahern @ 2015-02-17 16:09 UTC (permalink / raw)
  To: Andrey Vagin, linux-kernel
  Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Roger Luethi

On 2/17/15 1:20 AM, Andrey Vagin wrote:
> And here are statistics about syscalls which were called by each
> command.
> $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
>              20,713      syscalls:sys_exit_open
>              20,710      syscalls:sys_exit_close
>              20,708      syscalls:sys_exit_read
>              10,348      syscalls:sys_exit_newstat
>                  31      syscalls:sys_exit_write
>
> $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
>                 114      syscalls:sys_exit_recvfrom
>                  49      syscalls:sys_exit_write
>                   8      syscalls:sys_exit_mmap
>                   4      syscalls:sys_exit_mprotect
>                   3      syscalls:sys_exit_newfstat

'perf trace -s' gives the summary with stats.
e.g., perf trace -s --  ps ax -o pid,ppid

  ps (23850), 3117 events, 99.3%, 0.000 msec

    syscall            calls      min       avg       max      stddev
                                (msec)    (msec)    (msec)        (%)
    --------------- -------- --------- --------- ---------     ------
    read                 353     0.000     0.010     0.035      3.14%
    write                166     0.006     0.012     0.045      3.03%
    open                 365     0.002     0.005     0.178     11.29%
    close                354     0.001     0.002     0.024      3.57%
    stat                 170     0.002     0.007     0.662     52.99%
    fstat                 19     0.002     0.003     0.003      2.31%
    lseek                  2     0.003     0.003     0.003      6.49%
    mmap                  50     0.004     0.006     0.013      3.40%
...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-17 16:09   ` David Ahern
  0 siblings, 0 replies; 41+ messages in thread
From: David Ahern @ 2015-02-17 16:09 UTC (permalink / raw)
  To: Andrey Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On 2/17/15 1:20 AM, Andrey Vagin wrote:
> And here are statistics about syscalls which were called by each
> command.
> $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
>              20,713      syscalls:sys_exit_open
>              20,710      syscalls:sys_exit_close
>              20,708      syscalls:sys_exit_read
>              10,348      syscalls:sys_exit_newstat
>                  31      syscalls:sys_exit_write
>
> $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
>                 114      syscalls:sys_exit_recvfrom
>                  49      syscalls:sys_exit_write
>                   8      syscalls:sys_exit_mmap
>                   4      syscalls:sys_exit_mprotect
>                   3      syscalls:sys_exit_newfstat

'perf trace -s' gives the summary with stats.
e.g., perf trace -s --  ps ax -o pid,ppid

  ps (23850), 3117 events, 99.3%, 0.000 msec

    syscall            calls      min       avg       max      stddev
                                (msec)    (msec)    (msec)        (%)
    --------------- -------- --------- --------- ---------     ------
    read                 353     0.000     0.010     0.035      3.14%
    write                166     0.006     0.012     0.045      3.03%
    open                 365     0.002     0.005     0.178     11.29%
    close                354     0.001     0.002     0.024      3.57%
    stat                 170     0.002     0.007     0.662     52.99%
    fstat                 19     0.002     0.003     0.003      2.31%
    lseek                  2     0.003     0.003     0.003      6.49%
    mmap                  50     0.004     0.006     0.013      3.40%
...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
  2015-02-17  8:20 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Andrey Vagin
                   ` (8 preceding siblings ...)
  2015-02-17 16:09   ` David Ahern
@ 2015-02-17 19:05 ` Andy Lutomirski
  2015-02-18 14:27     ` Andrew Vagin
  9 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2015-02-17 19:05 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
	Andrew Morton, Linux API, linux-kernel

On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
>
> Here is a preview version. It provides restricted set of functionality.
> I would like to collect feedback about this idea.
>
> Currently we use the proc file system, where all information are
> presented in text files, what is convenient for humans.  But if we need
> to get information about processes from code (e.g. in C), the procfs
> doesn't look so cool.
>
> From code we would prefer to get information in binary format and to be
> able to specify which information and for which tasks are required. Here
> is a new interface with all these features, which is called task_diag.
> In addition it's much faster than procfs.
>
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.
>
> A request is described by the task_diag_pid structure:
>
> struct task_diag_pid {
>        __u64   show_flags;      /* specify which information are required */
>        __u64   dump_stratagy;   /* specify a group of processes */
>
>        __u32   pid;
> };
>
> A respone is a set of netlink messages. Each message describes one task.
> All task properties are divided on groups. A message contains the
> TASK_DIAG_MSG group and other groups if they have been requested in
> show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> response will contain the TASK_DIAG_CRED group which is described by the
> task_diag_creds structure.
>
> struct task_diag_msg {
>         __u32   tgid;
>         __u32   pid;
>         __u32   ppid;
>         __u32   tpid;
>         __u32   sid;
>         __u32   pgid;
>         __u8    state;
>         char    comm[TASK_DIAG_COMM_LEN];
> };
>
> Another good feature of task_diag is an ability to request information
> for a few processes. Currently here are two stratgies
> TASK_DIAG_DUMP_ALL      - get information for all tasks
> TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
>                           tasks
>
> The task diag is much faster than the proc file system. We don't need to
> create a new file descriptor for each task. We need to send a request
> and get a response. It allows to get information for a few task in one
> request-response iteration.
>
> I have compared performance of procfs and task-diag for the
> "ps ax -o pid,ppid" command.
>
> A test stand contains 10348 processes.
> $ ps ax -o pid,ppid | wc -l
> 10348
>
> $ time ps ax -o pid,ppid > /dev/null
>
> real    0m1.073s
> user    0m0.086s
> sys     0m0.903s
>
> $ time ./task_diag_all > /dev/null
>
> real    0m0.037s
> user    0m0.004s
> sys     0m0.020s
>
> And here are statistics about syscalls which were called by each
> command.
> $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
>             20,713      syscalls:sys_exit_open
>             20,710      syscalls:sys_exit_close
>             20,708      syscalls:sys_exit_read
>             10,348      syscalls:sys_exit_newstat
>                 31      syscalls:sys_exit_write
>
> $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
>                114      syscalls:sys_exit_recvfrom
>                 49      syscalls:sys_exit_write
>                  8      syscalls:sys_exit_mmap
>                  4      syscalls:sys_exit_mprotect
>                  3      syscalls:sys_exit_newfstat
>
> You can find the test program from this experiment in the last patch.
>
> The idea of this functionality was suggested by Pavel Emelyanov
> (xemul@), when he found that operations with /proc forms a significant
> part of a checkpointing time.
>
> Ten years ago here was attempt to add a netlink interface to access to /proc
> information:
> http://lwn.net/Articles/99600/

I don't suppose this could use real syscalls instead of netlink.  If
nothing else, netlink seems to conflate pid and net namespaces.

Also, using an asynchronous interface (send, poll?, recv) for
something that's inherently synchronous (as the kernel a local
question) seems awkward to me.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-17 20:32     ` Andrew Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-17 20:32 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Tue, Feb 17, 2015 at 09:09:47AM -0700, David Ahern wrote:
> On 2/17/15 1:20 AM, Andrey Vagin wrote:
> >And here are statistics about syscalls which were called by each
> >command.
> >$ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> >             20,713      syscalls:sys_exit_open
> >             20,710      syscalls:sys_exit_close
> >             20,708      syscalls:sys_exit_read
> >             10,348      syscalls:sys_exit_newstat
> >                 31      syscalls:sys_exit_write
> >
> >$ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> >                114      syscalls:sys_exit_recvfrom
> >                 49      syscalls:sys_exit_write
> >                  8      syscalls:sys_exit_mmap
> >                  4      syscalls:sys_exit_mprotect
> >                  3      syscalls:sys_exit_newfstat
> 
> 'perf trace -s' gives the summary with stats.
> e.g., perf trace -s --  ps ax -o pid,ppid

Thank you for this command, I haven't used it before.

 ps (21301), 145271 events, 100.0%, 0.000 msec

   syscall            calls      min       avg       max      stddev
                               (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- ---------     ------
   read               20717     0.000     0.020     1.631      0.64%
   write                  1     0.019     0.019     0.019      0.00%
   open               20722     0.025     0.035     3.624      0.93%
   close              20719     0.006     0.009     1.059      0.95%
   stat               10352     0.015     0.025     1.748      0.95%
   fstat                 12     0.010     0.012     0.020      6.17%
   lseek                  2     0.011     0.012     0.012      3.08%
   mmap                  30     0.012     0.034     0.094      9.35%
   mprotect              17     0.034     0.045     0.067      4.86%
   munmap                 3     0.028     0.058     0.108     44.12%
   brk                    4     0.011     0.015     0.019     11.24%
   rt_sigaction          25     0.011     0.011     0.014      1.27%
   rt_sigprocmask         1     0.012     0.012     0.012      0.00%
   ioctl                  4     0.010     0.012     0.014      6.94%
   access                 1     0.034     0.034     0.034      0.00%
   execve                 6     0.000     0.496     2.794     92.58%
   uname                  1     0.015     0.015     0.015      0.00%
   getdents              12     0.019     0.691     1.158     13.04%
   getrlimit              1     0.012     0.012     0.012      0.00%
   geteuid                1     0.012     0.012     0.012      0.00%
   arch_prctl             1     0.013     0.013     0.013      0.00%
   futex                  1     0.020     0.020     0.020      0.00%
   set_tid_address        1     0.012     0.012     0.012      0.00%
   openat                 1     0.030     0.030     0.030      0.00%
   set_robust_list        1     0.011     0.011     0.011      0.00%


 task_diag_all (21304), 569 events, 98.6%, 0.000 msec

   syscall            calls      min       avg       max      stddev
                               (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- ---------     ------
   read                   2     0.000     0.045     0.090    100.00%
   write                 77     0.010     0.013     0.083      7.93%
   open                   2     0.031     0.038     0.045     19.64%
   close                  3     0.010     0.014     0.017     13.43%
   fstat                  3     0.011     0.011     0.012      3.79%
   mmap                   8     0.013     0.027     0.049     16.72%
   mprotect               4     0.034     0.043     0.052      8.86%
   munmap                 1     0.031     0.031     0.031      0.00%
   brk                    1     0.014     0.014     0.014      0.00%
   ioctl                  1     0.010     0.010     0.010      0.00%
   access                 1     0.030     0.030     0.030      0.00%
   getpid                 1     0.011     0.011     0.011      0.00%
   socket                 1     0.045     0.045     0.045      0.00%
   sendto                 2     0.091     0.104     0.117     12.63%
   recvfrom             175     0.026     0.093     0.141      1.10%
   bind                   1     0.014     0.014     0.014      0.00%
   execve                 1     0.000     0.000     0.000      0.00%
   arch_prctl             1     0.011     0.011     0.011      0.00%

> 
>  ps (23850), 3117 events, 99.3%, 0.000 msec
> 
>    syscall            calls      min       avg       max      stddev
>                                (msec)    (msec)    (msec)        (%)
>    --------------- -------- --------- --------- ---------     ------
>    read                 353     0.000     0.010     0.035      3.14%
>    write                166     0.006     0.012     0.045      3.03%
>    open                 365     0.002     0.005     0.178     11.29%
>    close                354     0.001     0.002     0.024      3.57%
>    stat                 170     0.002     0.007     0.662     52.99%
>    fstat                 19     0.002     0.003     0.003      2.31%
>    lseek                  2     0.003     0.003     0.003      6.49%
>    mmap                  50     0.004     0.006     0.013      3.40%
> ...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-17 20:32     ` Andrew Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-17 20:32 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrey Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Tue, Feb 17, 2015 at 09:09:47AM -0700, David Ahern wrote:
> On 2/17/15 1:20 AM, Andrey Vagin wrote:
> >And here are statistics about syscalls which were called by each
> >command.
> >$ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> >             20,713      syscalls:sys_exit_open
> >             20,710      syscalls:sys_exit_close
> >             20,708      syscalls:sys_exit_read
> >             10,348      syscalls:sys_exit_newstat
> >                 31      syscalls:sys_exit_write
> >
> >$ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> >                114      syscalls:sys_exit_recvfrom
> >                 49      syscalls:sys_exit_write
> >                  8      syscalls:sys_exit_mmap
> >                  4      syscalls:sys_exit_mprotect
> >                  3      syscalls:sys_exit_newfstat
> 
> 'perf trace -s' gives the summary with stats.
> e.g., perf trace -s --  ps ax -o pid,ppid

Thank you for this command, I haven't used it before.

 ps (21301), 145271 events, 100.0%, 0.000 msec

   syscall            calls      min       avg       max      stddev
                               (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- ---------     ------
   read               20717     0.000     0.020     1.631      0.64%
   write                  1     0.019     0.019     0.019      0.00%
   open               20722     0.025     0.035     3.624      0.93%
   close              20719     0.006     0.009     1.059      0.95%
   stat               10352     0.015     0.025     1.748      0.95%
   fstat                 12     0.010     0.012     0.020      6.17%
   lseek                  2     0.011     0.012     0.012      3.08%
   mmap                  30     0.012     0.034     0.094      9.35%
   mprotect              17     0.034     0.045     0.067      4.86%
   munmap                 3     0.028     0.058     0.108     44.12%
   brk                    4     0.011     0.015     0.019     11.24%
   rt_sigaction          25     0.011     0.011     0.014      1.27%
   rt_sigprocmask         1     0.012     0.012     0.012      0.00%
   ioctl                  4     0.010     0.012     0.014      6.94%
   access                 1     0.034     0.034     0.034      0.00%
   execve                 6     0.000     0.496     2.794     92.58%
   uname                  1     0.015     0.015     0.015      0.00%
   getdents              12     0.019     0.691     1.158     13.04%
   getrlimit              1     0.012     0.012     0.012      0.00%
   geteuid                1     0.012     0.012     0.012      0.00%
   arch_prctl             1     0.013     0.013     0.013      0.00%
   futex                  1     0.020     0.020     0.020      0.00%
   set_tid_address        1     0.012     0.012     0.012      0.00%
   openat                 1     0.030     0.030     0.030      0.00%
   set_robust_list        1     0.011     0.011     0.011      0.00%


 task_diag_all (21304), 569 events, 98.6%, 0.000 msec

   syscall            calls      min       avg       max      stddev
                               (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- ---------     ------
   read                   2     0.000     0.045     0.090    100.00%
   write                 77     0.010     0.013     0.083      7.93%
   open                   2     0.031     0.038     0.045     19.64%
   close                  3     0.010     0.014     0.017     13.43%
   fstat                  3     0.011     0.011     0.012      3.79%
   mmap                   8     0.013     0.027     0.049     16.72%
   mprotect               4     0.034     0.043     0.052      8.86%
   munmap                 1     0.031     0.031     0.031      0.00%
   brk                    1     0.014     0.014     0.014      0.00%
   ioctl                  1     0.010     0.010     0.010      0.00%
   access                 1     0.030     0.030     0.030      0.00%
   getpid                 1     0.011     0.011     0.011      0.00%
   socket                 1     0.045     0.045     0.045      0.00%
   sendto                 2     0.091     0.104     0.117     12.63%
   recvfrom             175     0.026     0.093     0.141      1.10%
   bind                   1     0.014     0.014     0.014      0.00%
   execve                 1     0.000     0.000     0.000      0.00%
   arch_prctl             1     0.011     0.011     0.011      0.00%

> 
>  ps (23850), 3117 events, 99.3%, 0.000 msec
> 
>    syscall            calls      min       avg       max      stddev
>                                (msec)    (msec)    (msec)        (%)
>    --------------- -------- --------- --------- ---------     ------
>    read                 353     0.000     0.010     0.035      3.14%
>    write                166     0.006     0.012     0.045      3.03%
>    open                 365     0.002     0.005     0.178     11.29%
>    close                354     0.001     0.002     0.024      3.57%
>    stat                 170     0.002     0.007     0.662     52.99%
>    fstat                 19     0.002     0.003     0.003      2.31%
>    lseek                  2     0.003     0.003     0.003      6.49%
>    mmap                  50     0.004     0.006     0.013      3.40%
> ...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
  2015-02-17  8:53   ` Arnd Bergmann
@ 2015-02-17 21:33     ` Andrew Vagin
  -1 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-17 21:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> > 
> > A request is described by the task_diag_pid structure:
> > 
> > struct task_diag_pid {
> >        __u64   show_flags;      /* specify which information are required */
> >        __u64   dump_stratagy;   /* specify a group of processes */
> > 
> >        __u32   pid;
> > };
> 
> Can you explain how the interface relates to the 'taskstats' genetlink
> API? Did you consider extending that interface to provide the
> information you need instead of basing on the socket-diag?

It isn't based on the socket-diag, it looks like socket-diag.

Current task_diag registers a new genl family, but we can use the taskstats
family and add task_diag commands to it.

Thanks,
Andrew

> 
> 	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-17 21:33     ` Andrew Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-17 21:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrey Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> > 
> > A request is described by the task_diag_pid structure:
> > 
> > struct task_diag_pid {
> >        __u64   show_flags;      /* specify which information are required */
> >        __u64   dump_stratagy;   /* specify a group of processes */
> > 
> >        __u32   pid;
> > };
> 
> Can you explain how the interface relates to the 'taskstats' genetlink
> API? Did you consider extending that interface to provide the
> information you need instead of basing on the socket-diag?

It isn't based on the socket-diag, it looks like socket-diag.

Current task_diag registers a new genl family, but we can use the taskstats
family and add task_diag commands to it.

Thanks,
Andrew

> 
> 	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-18 11:06       ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2015-02-18 11:06 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > > 
> > > A request is described by the task_diag_pid structure:
> > > 
> > > struct task_diag_pid {
> > >        __u64   show_flags;      /* specify which information are required */
> > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > 
> > >        __u32   pid;
> > > };
> > 
> > Can you explain how the interface relates to the 'taskstats' genetlink
> > API? Did you consider extending that interface to provide the
> > information you need instead of basing on the socket-diag?
> 
> It isn't based on the socket-diag, it looks like socket-diag.
> 
> Current task_diag registers a new genl family, but we can use the taskstats
> family and add task_diag commands to it.

What I meant was more along the lines of making it look like taskstats
by adding new fields to 'struct taskstat' for what you want return.
I don't know if that is possible or a good idea for the information
you want to get out of the kernel, but it seems like a more natural
interface, as it already has some of the same data (comm, gid, pid,
ppid, ...).

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-18 11:06       ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2015-02-18 11:06 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > > 
> > > A request is described by the task_diag_pid structure:
> > > 
> > > struct task_diag_pid {
> > >        __u64   show_flags;      /* specify which information are required */
> > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > 
> > >        __u32   pid;
> > > };
> > 
> > Can you explain how the interface relates to the 'taskstats' genetlink
> > API? Did you consider extending that interface to provide the
> > information you need instead of basing on the socket-diag?
> 
> It isn't based on the socket-diag, it looks like socket-diag.
> 
> Current task_diag registers a new genl family, but we can use the taskstats
> family and add task_diag commands to it.

What I meant was more along the lines of making it look like taskstats
by adding new fields to 'struct taskstat' for what you want return.
I don't know if that is possible or a good idea for the information
you want to get out of the kernel, but it seems like a more natural
interface, as it already has some of the same data (comm, gid, pid,
ppid, ...).

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
  2015-02-18 11:06       ` Arnd Bergmann
@ 2015-02-18 12:42         ` Andrew Vagin
  -1 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-18 12:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > > 
> > > > A request is described by the task_diag_pid structure:
> > > > 
> > > > struct task_diag_pid {
> > > >        __u64   show_flags;      /* specify which information are required */
> > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > > 
> > > >        __u32   pid;
> > > > };
> > > 
> > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > API? Did you consider extending that interface to provide the
> > > information you need instead of basing on the socket-diag?
> > 
> > It isn't based on the socket-diag, it looks like socket-diag.
> > 
> > Current task_diag registers a new genl family, but we can use the taskstats
> > family and add task_diag commands to it.
> 
> What I meant was more along the lines of making it look like taskstats
> by adding new fields to 'struct taskstat' for what you want return.
> I don't know if that is possible or a good idea for the information
> you want to get out of the kernel, but it seems like a more natural
> interface, as it already has some of the same data (comm, gid, pid,
> ppid, ...).

Now I see what you mean. task_diag has more flexible and universal
interface than taskstat. A response of taskstat only contains a
taskstats structure. A response of taskdiag can contains a few types of
properties. Each type is described by its own structure.

Curently here are only two groups of parameters: task_diag_msg and
task_diag_creds.

task_diag_msg contains a few basic parameters.
task_diag_creds contains credentials.

I'm going to add other groups to describe all kind of task properties
which currently are presented in procfs (e.g. /proc/pid/maps,
/proc/pid/fding/*, /proc/pid/status, etc).

One of features of task_diag is an ability to choose which information
are required. This allows to minimize a response size and a time, which
is requred to fill this response.

struct task_diag_msg {
        __u32   tgid;
        __u32   pid;
        __u32   ppid;
        __u32   tpid;
        __u32   sid;
        __u32   pgid;
        __u8    state;
        char    comm[TASK_DIAG_COMM_LEN];
};

struct task_diag_creds {
        struct task_diag_caps cap_inheritable;
        struct task_diag_caps cap_permitted;
        struct task_diag_caps cap_effective;
        struct task_diag_caps cap_bset;

        __u32 uid;
        __u32 euid;
        __u32 suid;
        __u32 fsuid;
        __u32 gid;
        __u32 egid;
        __u32 sgid;
        __u32 fsgid;
};

Thanks,
Andrew
> 
> 	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-18 12:42         ` Andrew Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-18 12:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrey Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > > 
> > > > A request is described by the task_diag_pid structure:
> > > > 
> > > > struct task_diag_pid {
> > > >        __u64   show_flags;      /* specify which information are required */
> > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > > 
> > > >        __u32   pid;
> > > > };
> > > 
> > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > API? Did you consider extending that interface to provide the
> > > information you need instead of basing on the socket-diag?
> > 
> > It isn't based on the socket-diag, it looks like socket-diag.
> > 
> > Current task_diag registers a new genl family, but we can use the taskstats
> > family and add task_diag commands to it.
> 
> What I meant was more along the lines of making it look like taskstats
> by adding new fields to 'struct taskstat' for what you want return.
> I don't know if that is possible or a good idea for the information
> you want to get out of the kernel, but it seems like a more natural
> interface, as it already has some of the same data (comm, gid, pid,
> ppid, ...).

Now I see what you mean. task_diag has more flexible and universal
interface than taskstat. A response of taskstat only contains a
taskstats structure. A response of taskdiag can contains a few types of
properties. Each type is described by its own structure.

Curently here are only two groups of parameters: task_diag_msg and
task_diag_creds.

task_diag_msg contains a few basic parameters.
task_diag_creds contains credentials.

I'm going to add other groups to describe all kind of task properties
which currently are presented in procfs (e.g. /proc/pid/maps,
/proc/pid/fding/*, /proc/pid/status, etc).

One of features of task_diag is an ability to choose which information
are required. This allows to minimize a response size and a time, which
is requred to fill this response.

struct task_diag_msg {
        __u32   tgid;
        __u32   pid;
        __u32   ppid;
        __u32   tpid;
        __u32   sid;
        __u32   pgid;
        __u8    state;
        char    comm[TASK_DIAG_COMM_LEN];
};

struct task_diag_creds {
        struct task_diag_caps cap_inheritable;
        struct task_diag_caps cap_permitted;
        struct task_diag_caps cap_effective;
        struct task_diag_caps cap_bset;

        __u32 uid;
        __u32 euid;
        __u32 suid;
        __u32 fsuid;
        __u32 gid;
        __u32 egid;
        __u32 sgid;
        __u32 fsgid;
};

Thanks,
Andrew
> 
> 	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-18 14:27     ` Andrew Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-18 14:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrey Vagin, Pavel Emelyanov, Roger Luethi, Oleg Nesterov,
	Cyrill Gorcunov, Andrew Morton, Linux API, linux-kernel

On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
> >
> > Here is a preview version. It provides restricted set of functionality.
> > I would like to collect feedback about this idea.
> >
> > Currently we use the proc file system, where all information are
> > presented in text files, what is convenient for humans.  But if we need
> > to get information about processes from code (e.g. in C), the procfs
> > doesn't look so cool.
> >
> > From code we would prefer to get information in binary format and to be
> > able to specify which information and for which tasks are required. Here
> > is a new interface with all these features, which is called task_diag.
> > In addition it's much faster than procfs.
> >
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> >        __u64   show_flags;      /* specify which information are required */
> >        __u64   dump_stratagy;   /* specify a group of processes */
> >
> >        __u32   pid;
> > };
> >
> > A respone is a set of netlink messages. Each message describes one task.
> > All task properties are divided on groups. A message contains the
> > TASK_DIAG_MSG group and other groups if they have been requested in
> > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > response will contain the TASK_DIAG_CRED group which is described by the
> > task_diag_creds structure.
> >
> > struct task_diag_msg {
> >         __u32   tgid;
> >         __u32   pid;
> >         __u32   ppid;
> >         __u32   tpid;
> >         __u32   sid;
> >         __u32   pgid;
> >         __u8    state;
> >         char    comm[TASK_DIAG_COMM_LEN];
> > };
> >
> > Another good feature of task_diag is an ability to request information
> > for a few processes. Currently here are two stratgies
> > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> >                           tasks
> >
> > The task diag is much faster than the proc file system. We don't need to
> > create a new file descriptor for each task. We need to send a request
> > and get a response. It allows to get information for a few task in one
> > request-response iteration.
> >
> > I have compared performance of procfs and task-diag for the
> > "ps ax -o pid,ppid" command.
> >
> > A test stand contains 10348 processes.
> > $ ps ax -o pid,ppid | wc -l
> > 10348
> >
> > $ time ps ax -o pid,ppid > /dev/null
> >
> > real    0m1.073s
> > user    0m0.086s
> > sys     0m0.903s
> >
> > $ time ./task_diag_all > /dev/null
> >
> > real    0m0.037s
> > user    0m0.004s
> > sys     0m0.020s
> >
> > And here are statistics about syscalls which were called by each
> > command.
> > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> >             20,713      syscalls:sys_exit_open
> >             20,710      syscalls:sys_exit_close
> >             20,708      syscalls:sys_exit_read
> >             10,348      syscalls:sys_exit_newstat
> >                 31      syscalls:sys_exit_write
> >
> > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> >                114      syscalls:sys_exit_recvfrom
> >                 49      syscalls:sys_exit_write
> >                  8      syscalls:sys_exit_mmap
> >                  4      syscalls:sys_exit_mprotect
> >                  3      syscalls:sys_exit_newfstat
> >
> > You can find the test program from this experiment in the last patch.
> >
> > The idea of this functionality was suggested by Pavel Emelyanov
> > (xemul@), when he found that operations with /proc forms a significant
> > part of a checkpointing time.
> >
> > Ten years ago here was attempt to add a netlink interface to access to /proc
> > information:
> > http://lwn.net/Articles/99600/
> 
> I don't suppose this could use real syscalls instead of netlink.  If
> nothing else, netlink seems to conflate pid and net namespaces.

What do you mean by "conflate pid and net namespaces"?

> 
> Also, using an asynchronous interface (send, poll?, recv) for
> something that's inherently synchronous (as the kernel a local
> question) seems awkward to me.

Actually all requests are handled synchronously. We call sendmsg to send
a request and it is handled in this syscall.
 2)               |  netlink_sendmsg() {
 2)               |    netlink_unicast() {
 2)               |      taskdiag_doit() {
 2)   2.153 us    |        task_diag_fill();
 2)               |        netlink_unicast() {
 2)   0.185 us    |          netlink_attachskb();
 2)   0.291 us    |          __netlink_sendskb();
 2)   2.452 us    |        }
 2) + 33.625 us   |      }
 2) + 54.611 us   |    }
 2) + 76.370 us   |  }
 2)               |  netlink_recvmsg() {
 2)   1.178 us    |    skb_recv_datagram();
 2) + 46.953 us   |  }

If we request information for a group of tasks (NLM_F_DUMP), a first
portion of data is filled from the sendmsg syscall. And then when we read
it, the kernel fills the next portion.

 3)               |  netlink_sendmsg() {
 3)               |    __netlink_dump_start() {
 3)               |      netlink_dump() {
 3)               |        taskdiag_dumpid() {
 3)   0.685 us    |          task_diag_fill();
...
 3)   0.224 us    |          task_diag_fill();
 3) + 74.028 us   |        }
 3) + 88.757 us   |      }
 3) + 89.296 us   |    }
 3) + 98.705 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)               |      taskdiag_dumpid() {
 3)   0.594 us    |        task_diag_fill();
...
 3)   0.242 us    |        task_diag_fill();
 3) + 60.634 us   |      }
 3) + 72.803 us   |    }
 3) + 88.005 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)   2.403 us    |      taskdiag_dumpid();
 3) + 26.236 us   |    }
 3) + 40.522 us   |  }
 0) + 20.407 us   |  netlink_recvmsg();


netlink is really good for this type of tasks.  It allows to create an
extendable interface which can be easy customized for different needs.

I don't think that we would want to create another similar interface
just to be independent from network subsystem.

Thanks,
Andrew

> 
> --Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-18 14:27     ` Andrew Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-18 14:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrey Vagin, Pavel Emelyanov, Roger Luethi, Oleg Nesterov,
	Cyrill Gorcunov, Andrew Morton, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> >
> > Here is a preview version. It provides restricted set of functionality.
> > I would like to collect feedback about this idea.
> >
> > Currently we use the proc file system, where all information are
> > presented in text files, what is convenient for humans.  But if we need
> > to get information about processes from code (e.g. in C), the procfs
> > doesn't look so cool.
> >
> > From code we would prefer to get information in binary format and to be
> > able to specify which information and for which tasks are required. Here
> > is a new interface with all these features, which is called task_diag.
> > In addition it's much faster than procfs.
> >
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> >        __u64   show_flags;      /* specify which information are required */
> >        __u64   dump_stratagy;   /* specify a group of processes */
> >
> >        __u32   pid;
> > };
> >
> > A respone is a set of netlink messages. Each message describes one task.
> > All task properties are divided on groups. A message contains the
> > TASK_DIAG_MSG group and other groups if they have been requested in
> > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > response will contain the TASK_DIAG_CRED group which is described by the
> > task_diag_creds structure.
> >
> > struct task_diag_msg {
> >         __u32   tgid;
> >         __u32   pid;
> >         __u32   ppid;
> >         __u32   tpid;
> >         __u32   sid;
> >         __u32   pgid;
> >         __u8    state;
> >         char    comm[TASK_DIAG_COMM_LEN];
> > };
> >
> > Another good feature of task_diag is an ability to request information
> > for a few processes. Currently here are two stratgies
> > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> >                           tasks
> >
> > The task diag is much faster than the proc file system. We don't need to
> > create a new file descriptor for each task. We need to send a request
> > and get a response. It allows to get information for a few task in one
> > request-response iteration.
> >
> > I have compared performance of procfs and task-diag for the
> > "ps ax -o pid,ppid" command.
> >
> > A test stand contains 10348 processes.
> > $ ps ax -o pid,ppid | wc -l
> > 10348
> >
> > $ time ps ax -o pid,ppid > /dev/null
> >
> > real    0m1.073s
> > user    0m0.086s
> > sys     0m0.903s
> >
> > $ time ./task_diag_all > /dev/null
> >
> > real    0m0.037s
> > user    0m0.004s
> > sys     0m0.020s
> >
> > And here are statistics about syscalls which were called by each
> > command.
> > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> >             20,713      syscalls:sys_exit_open
> >             20,710      syscalls:sys_exit_close
> >             20,708      syscalls:sys_exit_read
> >             10,348      syscalls:sys_exit_newstat
> >                 31      syscalls:sys_exit_write
> >
> > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> >                114      syscalls:sys_exit_recvfrom
> >                 49      syscalls:sys_exit_write
> >                  8      syscalls:sys_exit_mmap
> >                  4      syscalls:sys_exit_mprotect
> >                  3      syscalls:sys_exit_newfstat
> >
> > You can find the test program from this experiment in the last patch.
> >
> > The idea of this functionality was suggested by Pavel Emelyanov
> > (xemul@), when he found that operations with /proc forms a significant
> > part of a checkpointing time.
> >
> > Ten years ago here was attempt to add a netlink interface to access to /proc
> > information:
> > http://lwn.net/Articles/99600/
> 
> I don't suppose this could use real syscalls instead of netlink.  If
> nothing else, netlink seems to conflate pid and net namespaces.

What do you mean by "conflate pid and net namespaces"?

> 
> Also, using an asynchronous interface (send, poll?, recv) for
> something that's inherently synchronous (as the kernel a local
> question) seems awkward to me.

Actually all requests are handled synchronously. We call sendmsg to send
a request and it is handled in this syscall.
 2)               |  netlink_sendmsg() {
 2)               |    netlink_unicast() {
 2)               |      taskdiag_doit() {
 2)   2.153 us    |        task_diag_fill();
 2)               |        netlink_unicast() {
 2)   0.185 us    |          netlink_attachskb();
 2)   0.291 us    |          __netlink_sendskb();
 2)   2.452 us    |        }
 2) + 33.625 us   |      }
 2) + 54.611 us   |    }
 2) + 76.370 us   |  }
 2)               |  netlink_recvmsg() {
 2)   1.178 us    |    skb_recv_datagram();
 2) + 46.953 us   |  }

If we request information for a group of tasks (NLM_F_DUMP), a first
portion of data is filled from the sendmsg syscall. And then when we read
it, the kernel fills the next portion.

 3)               |  netlink_sendmsg() {
 3)               |    __netlink_dump_start() {
 3)               |      netlink_dump() {
 3)               |        taskdiag_dumpid() {
 3)   0.685 us    |          task_diag_fill();
...
 3)   0.224 us    |          task_diag_fill();
 3) + 74.028 us   |        }
 3) + 88.757 us   |      }
 3) + 89.296 us   |    }
 3) + 98.705 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)               |      taskdiag_dumpid() {
 3)   0.594 us    |        task_diag_fill();
...
 3)   0.242 us    |        task_diag_fill();
 3) + 60.634 us   |      }
 3) + 72.803 us   |    }
 3) + 88.005 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)   2.403 us    |      taskdiag_dumpid();
 3) + 26.236 us   |    }
 3) + 40.522 us   |  }
 0) + 20.407 us   |  netlink_recvmsg();


netlink is really good for this type of tasks.  It allows to create an
extendable interface which can be easy customized for different needs.

I don't think that we would want to create another similar interface
just to be independent from network subsystem.

Thanks,
Andrew

> 
> --Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-18 14:46           ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2015-02-18 14:46 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > is used to get information about sockets.
> > > > > 
> > > > > A request is described by the task_diag_pid structure:
> > > > > 
> > > > > struct task_diag_pid {
> > > > >        __u64   show_flags;      /* specify which information are required */
> > > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > > > 
> > > > >        __u32   pid;
> > > > > };
> > > > 
> > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > API? Did you consider extending that interface to provide the
> > > > information you need instead of basing on the socket-diag?
> > > 
> > > It isn't based on the socket-diag, it looks like socket-diag.
> > > 
> > > Current task_diag registers a new genl family, but we can use the taskstats
> > > family and add task_diag commands to it.
> > 
> > What I meant was more along the lines of making it look like taskstats
> > by adding new fields to 'struct taskstat' for what you want return.
> > I don't know if that is possible or a good idea for the information
> > you want to get out of the kernel, but it seems like a more natural
> > interface, as it already has some of the same data (comm, gid, pid,
> > ppid, ...).
> 
> Now I see what you mean. task_diag has more flexible and universal
> interface than taskstat. A response of taskstat only contains a
> taskstats structure. A response of taskdiag can contains a few types of
> properties. Each type is described by its own structure.

Right, so the question is whether that flexibility is actually required
here. Independent of which design you personally prefer, what are the
downsides of extending the existing but less flexible interface?

If it's good enough, that would seem to provide a more consistent
API, which in turn helps users understand the interface and use it
correctly.

> Curently here are only two groups of parameters: task_diag_msg and
> task_diag_creds.
> 
> task_diag_msg contains a few basic parameters.
> task_diag_creds contains credentials.
> 
> I'm going to add other groups to describe all kind of task properties
> which currently are presented in procfs (e.g. /proc/pid/maps,
> /proc/pid/fding/*, /proc/pid/status, etc).
> 
> One of features of task_diag is an ability to choose which information
> are required. This allows to minimize a response size and a time, which
> is requred to fill this response.

I realize that you are trying to optimize for performance, but it
would be nice to quantify this if you want to argue for requiring
a split interface.

> struct task_diag_msg {
>         __u32   tgid;
>         __u32   pid;
>         __u32   ppid;
>         __u32   tpid;
>         __u32   sid;
>         __u32   pgid;
>         __u8    state;
>         char    comm[TASK_DIAG_COMM_LEN];
> };

I guess this part would be a very natural extension to the
existing taskstats structure, and we should only add a new
one here if there are extremely good reasons for it.

> struct task_diag_creds {
>         struct task_diag_caps cap_inheritable;
>         struct task_diag_caps cap_permitted;
>         struct task_diag_caps cap_effective;
>         struct task_diag_caps cap_bset;
> 
>         __u32 uid;
>         __u32 euid;
>         __u32 suid;
>         __u32 fsuid;
>         __u32 gid;
>         __u32 egid;
>         __u32 sgid;
>         __u32 fsgid;
> };

while this part could well be kept separate so you can query
it individually from the rest of taskstats, but through a
related interface.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-18 14:46           ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2015-02-18 14:46 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > is used to get information about sockets.
> > > > > 
> > > > > A request is described by the task_diag_pid structure:
> > > > > 
> > > > > struct task_diag_pid {
> > > > >        __u64   show_flags;      /* specify which information are required */
> > > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > > > 
> > > > >        __u32   pid;
> > > > > };
> > > > 
> > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > API? Did you consider extending that interface to provide the
> > > > information you need instead of basing on the socket-diag?
> > > 
> > > It isn't based on the socket-diag, it looks like socket-diag.
> > > 
> > > Current task_diag registers a new genl family, but we can use the taskstats
> > > family and add task_diag commands to it.
> > 
> > What I meant was more along the lines of making it look like taskstats
> > by adding new fields to 'struct taskstat' for what you want return.
> > I don't know if that is possible or a good idea for the information
> > you want to get out of the kernel, but it seems like a more natural
> > interface, as it already has some of the same data (comm, gid, pid,
> > ppid, ...).
> 
> Now I see what you mean. task_diag has more flexible and universal
> interface than taskstat. A response of taskstat only contains a
> taskstats structure. A response of taskdiag can contains a few types of
> properties. Each type is described by its own structure.

Right, so the question is whether that flexibility is actually required
here. Independent of which design you personally prefer, what are the
downsides of extending the existing but less flexible interface?

If it's good enough, that would seem to provide a more consistent
API, which in turn helps users understand the interface and use it
correctly.

> Curently here are only two groups of parameters: task_diag_msg and
> task_diag_creds.
> 
> task_diag_msg contains a few basic parameters.
> task_diag_creds contains credentials.
> 
> I'm going to add other groups to describe all kind of task properties
> which currently are presented in procfs (e.g. /proc/pid/maps,
> /proc/pid/fding/*, /proc/pid/status, etc).
> 
> One of features of task_diag is an ability to choose which information
> are required. This allows to minimize a response size and a time, which
> is requred to fill this response.

I realize that you are trying to optimize for performance, but it
would be nice to quantify this if you want to argue for requiring
a split interface.

> struct task_diag_msg {
>         __u32   tgid;
>         __u32   pid;
>         __u32   ppid;
>         __u32   tpid;
>         __u32   sid;
>         __u32   pgid;
>         __u8    state;
>         char    comm[TASK_DIAG_COMM_LEN];
> };

I guess this part would be a very natural extension to the
existing taskstats structure, and we should only add a new
one here if there are extremely good reasons for it.

> struct task_diag_creds {
>         struct task_diag_caps cap_inheritable;
>         struct task_diag_caps cap_permitted;
>         struct task_diag_caps cap_effective;
>         struct task_diag_caps cap_bset;
> 
>         __u32 uid;
>         __u32 euid;
>         __u32 suid;
>         __u32 fsuid;
>         __u32 gid;
>         __u32 egid;
>         __u32 sgid;
>         __u32 fsgid;
> };

while this part could well be kept separate so you can query
it individually from the rest of taskstats, but through a
related interface.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-19  1:18       ` Andy Lutomirski
  0 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2015-02-19  1:18 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
	linux-kernel, Andrew Morton, Linux API, Andrey Vagin

On Feb 18, 2015 6:27 AM, "Andrew Vagin" <avagin@parallels.com> wrote:
>
> On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
> > >
> > > Here is a preview version. It provides restricted set of functionality.
> > > I would like to collect feedback about this idea.
> > >
> > > Currently we use the proc file system, where all information are
> > > presented in text files, what is convenient for humans.  But if we need
> > > to get information about processes from code (e.g. in C), the procfs
> > > doesn't look so cool.
> > >
> > > From code we would prefer to get information in binary format and to be
> > > able to specify which information and for which tasks are required. Here
> > > is a new interface with all these features, which is called task_diag.
> > > In addition it's much faster than procfs.
> > >
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > >
> > > A request is described by the task_diag_pid structure:
> > >
> > > struct task_diag_pid {
> > >        __u64   show_flags;      /* specify which information are required */
> > >        __u64   dump_stratagy;   /* specify a group of processes */
> > >
> > >        __u32   pid;
> > > };
> > >
> > > A respone is a set of netlink messages. Each message describes one task.
> > > All task properties are divided on groups. A message contains the
> > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > response will contain the TASK_DIAG_CRED group which is described by the
> > > task_diag_creds structure.
> > >
> > > struct task_diag_msg {
> > >         __u32   tgid;
> > >         __u32   pid;
> > >         __u32   ppid;
> > >         __u32   tpid;
> > >         __u32   sid;
> > >         __u32   pgid;
> > >         __u8    state;
> > >         char    comm[TASK_DIAG_COMM_LEN];
> > > };
> > >
> > > Another good feature of task_diag is an ability to request information
> > > for a few processes. Currently here are two stratgies
> > > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > >                           tasks
> > >
> > > The task diag is much faster than the proc file system. We don't need to
> > > create a new file descriptor for each task. We need to send a request
> > > and get a response. It allows to get information for a few task in one
> > > request-response iteration.
> > >
> > > I have compared performance of procfs and task-diag for the
> > > "ps ax -o pid,ppid" command.
> > >
> > > A test stand contains 10348 processes.
> > > $ ps ax -o pid,ppid | wc -l
> > > 10348
> > >
> > > $ time ps ax -o pid,ppid > /dev/null
> > >
> > > real    0m1.073s
> > > user    0m0.086s
> > > sys     0m0.903s
> > >
> > > $ time ./task_diag_all > /dev/null
> > >
> > > real    0m0.037s
> > > user    0m0.004s
> > > sys     0m0.020s
> > >
> > > And here are statistics about syscalls which were called by each
> > > command.
> > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> > >             20,713      syscalls:sys_exit_open
> > >             20,710      syscalls:sys_exit_close
> > >             20,708      syscalls:sys_exit_read
> > >             10,348      syscalls:sys_exit_newstat
> > >                 31      syscalls:sys_exit_write
> > >
> > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> > >                114      syscalls:sys_exit_recvfrom
> > >                 49      syscalls:sys_exit_write
> > >                  8      syscalls:sys_exit_mmap
> > >                  4      syscalls:sys_exit_mprotect
> > >                  3      syscalls:sys_exit_newfstat
> > >
> > > You can find the test program from this experiment in the last patch.
> > >
> > > The idea of this functionality was suggested by Pavel Emelyanov
> > > (xemul@), when he found that operations with /proc forms a significant
> > > part of a checkpointing time.
> > >
> > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > information:
> > > http://lwn.net/Articles/99600/
> >
> > I don't suppose this could use real syscalls instead of netlink.  If
> > nothing else, netlink seems to conflate pid and net namespaces.
>
> What do you mean by "conflate pid and net namespaces"?

A netlink socket is bound to a network namespace, but you should be
returning data specific to a pid namespace.

On a related note, how does this interact with hidepid?  More
generally, what privileges are you requiring to obtain what data?

>
> >
> > Also, using an asynchronous interface (send, poll?, recv) for
> > something that's inherently synchronous (as the kernel a local
> > question) seems awkward to me.
>
> Actually all requests are handled synchronously. We call sendmsg to send
> a request and it is handled in this syscall.
>  2)               |  netlink_sendmsg() {
>  2)               |    netlink_unicast() {
>  2)               |      taskdiag_doit() {
>  2)   2.153 us    |        task_diag_fill();
>  2)               |        netlink_unicast() {
>  2)   0.185 us    |          netlink_attachskb();
>  2)   0.291 us    |          __netlink_sendskb();
>  2)   2.452 us    |        }
>  2) + 33.625 us   |      }
>  2) + 54.611 us   |    }
>  2) + 76.370 us   |  }
>  2)               |  netlink_recvmsg() {
>  2)   1.178 us    |    skb_recv_datagram();
>  2) + 46.953 us   |  }
>
> If we request information for a group of tasks (NLM_F_DUMP), a first
> portion of data is filled from the sendmsg syscall. And then when we read
> it, the kernel fills the next portion.
>
>  3)               |  netlink_sendmsg() {
>  3)               |    __netlink_dump_start() {
>  3)               |      netlink_dump() {
>  3)               |        taskdiag_dumpid() {
>  3)   0.685 us    |          task_diag_fill();
> ...
>  3)   0.224 us    |          task_diag_fill();
>  3) + 74.028 us   |        }
>  3) + 88.757 us   |      }
>  3) + 89.296 us   |    }
>  3) + 98.705 us   |  }
>  3)               |  netlink_recvmsg() {
>  3)               |    netlink_dump() {
>  3)               |      taskdiag_dumpid() {
>  3)   0.594 us    |        task_diag_fill();
> ...
>  3)   0.242 us    |        task_diag_fill();
>  3) + 60.634 us   |      }
>  3) + 72.803 us   |    }
>  3) + 88.005 us   |  }
>  3)               |  netlink_recvmsg() {
>  3)               |    netlink_dump() {
>  3)   2.403 us    |      taskdiag_dumpid();
>  3) + 26.236 us   |    }
>  3) + 40.522 us   |  }
>  0) + 20.407 us   |  netlink_recvmsg();
>
>
> netlink is really good for this type of tasks.  It allows to create an
> extendable interface which can be easy customized for different needs.
>
> I don't think that we would want to create another similar interface
> just to be independent from network subsystem.

I guess this is a bit streamy in that you ask one question and get
multiple answers.

>
> Thanks,
> Andrew
>
> >
> > --Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-19  1:18       ` Andy Lutomirski
  0 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2015-02-19  1:18 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linux API,
	Andrey Vagin

On Feb 18, 2015 6:27 AM, "Andrew Vagin" <avagin-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>
> On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> > >
> > > Here is a preview version. It provides restricted set of functionality.
> > > I would like to collect feedback about this idea.
> > >
> > > Currently we use the proc file system, where all information are
> > > presented in text files, what is convenient for humans.  But if we need
> > > to get information about processes from code (e.g. in C), the procfs
> > > doesn't look so cool.
> > >
> > > From code we would prefer to get information in binary format and to be
> > > able to specify which information and for which tasks are required. Here
> > > is a new interface with all these features, which is called task_diag.
> > > In addition it's much faster than procfs.
> > >
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > >
> > > A request is described by the task_diag_pid structure:
> > >
> > > struct task_diag_pid {
> > >        __u64   show_flags;      /* specify which information are required */
> > >        __u64   dump_stratagy;   /* specify a group of processes */
> > >
> > >        __u32   pid;
> > > };
> > >
> > > A respone is a set of netlink messages. Each message describes one task.
> > > All task properties are divided on groups. A message contains the
> > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > response will contain the TASK_DIAG_CRED group which is described by the
> > > task_diag_creds structure.
> > >
> > > struct task_diag_msg {
> > >         __u32   tgid;
> > >         __u32   pid;
> > >         __u32   ppid;
> > >         __u32   tpid;
> > >         __u32   sid;
> > >         __u32   pgid;
> > >         __u8    state;
> > >         char    comm[TASK_DIAG_COMM_LEN];
> > > };
> > >
> > > Another good feature of task_diag is an ability to request information
> > > for a few processes. Currently here are two stratgies
> > > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > >                           tasks
> > >
> > > The task diag is much faster than the proc file system. We don't need to
> > > create a new file descriptor for each task. We need to send a request
> > > and get a response. It allows to get information for a few task in one
> > > request-response iteration.
> > >
> > > I have compared performance of procfs and task-diag for the
> > > "ps ax -o pid,ppid" command.
> > >
> > > A test stand contains 10348 processes.
> > > $ ps ax -o pid,ppid | wc -l
> > > 10348
> > >
> > > $ time ps ax -o pid,ppid > /dev/null
> > >
> > > real    0m1.073s
> > > user    0m0.086s
> > > sys     0m0.903s
> > >
> > > $ time ./task_diag_all > /dev/null
> > >
> > > real    0m0.037s
> > > user    0m0.004s
> > > sys     0m0.020s
> > >
> > > And here are statistics about syscalls which were called by each
> > > command.
> > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> > >             20,713      syscalls:sys_exit_open
> > >             20,710      syscalls:sys_exit_close
> > >             20,708      syscalls:sys_exit_read
> > >             10,348      syscalls:sys_exit_newstat
> > >                 31      syscalls:sys_exit_write
> > >
> > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> > >                114      syscalls:sys_exit_recvfrom
> > >                 49      syscalls:sys_exit_write
> > >                  8      syscalls:sys_exit_mmap
> > >                  4      syscalls:sys_exit_mprotect
> > >                  3      syscalls:sys_exit_newfstat
> > >
> > > You can find the test program from this experiment in the last patch.
> > >
> > > The idea of this functionality was suggested by Pavel Emelyanov
> > > (xemul@), when he found that operations with /proc forms a significant
> > > part of a checkpointing time.
> > >
> > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > information:
> > > http://lwn.net/Articles/99600/
> >
> > I don't suppose this could use real syscalls instead of netlink.  If
> > nothing else, netlink seems to conflate pid and net namespaces.
>
> What do you mean by "conflate pid and net namespaces"?

A netlink socket is bound to a network namespace, but you should be
returning data specific to a pid namespace.

On a related note, how does this interact with hidepid?  More
generally, what privileges are you requiring to obtain what data?

>
> >
> > Also, using an asynchronous interface (send, poll?, recv) for
> > something that's inherently synchronous (as the kernel a local
> > question) seems awkward to me.
>
> Actually all requests are handled synchronously. We call sendmsg to send
> a request and it is handled in this syscall.
>  2)               |  netlink_sendmsg() {
>  2)               |    netlink_unicast() {
>  2)               |      taskdiag_doit() {
>  2)   2.153 us    |        task_diag_fill();
>  2)               |        netlink_unicast() {
>  2)   0.185 us    |          netlink_attachskb();
>  2)   0.291 us    |          __netlink_sendskb();
>  2)   2.452 us    |        }
>  2) + 33.625 us   |      }
>  2) + 54.611 us   |    }
>  2) + 76.370 us   |  }
>  2)               |  netlink_recvmsg() {
>  2)   1.178 us    |    skb_recv_datagram();
>  2) + 46.953 us   |  }
>
> If we request information for a group of tasks (NLM_F_DUMP), a first
> portion of data is filled from the sendmsg syscall. And then when we read
> it, the kernel fills the next portion.
>
>  3)               |  netlink_sendmsg() {
>  3)               |    __netlink_dump_start() {
>  3)               |      netlink_dump() {
>  3)               |        taskdiag_dumpid() {
>  3)   0.685 us    |          task_diag_fill();
> ...
>  3)   0.224 us    |          task_diag_fill();
>  3) + 74.028 us   |        }
>  3) + 88.757 us   |      }
>  3) + 89.296 us   |    }
>  3) + 98.705 us   |  }
>  3)               |  netlink_recvmsg() {
>  3)               |    netlink_dump() {
>  3)               |      taskdiag_dumpid() {
>  3)   0.594 us    |        task_diag_fill();
> ...
>  3)   0.242 us    |        task_diag_fill();
>  3) + 60.634 us   |      }
>  3) + 72.803 us   |    }
>  3) + 88.005 us   |  }
>  3)               |  netlink_recvmsg() {
>  3)               |    netlink_dump() {
>  3)   2.403 us    |      taskdiag_dumpid();
>  3) + 26.236 us   |    }
>  3) + 40.522 us   |  }
>  0) + 20.407 us   |  netlink_recvmsg();
>
>
> netlink is really good for this type of tasks.  It allows to create an
> extendable interface which can be easy customized for different needs.
>
> I don't think that we would want to create another similar interface
> just to be independent from network subsystem.

I guess this is a bit streamy in that you ask one question and get
multiple answers.

>
> Thanks,
> Andrew
>
> >
> > --Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
  2015-02-18 14:46           ` Arnd Bergmann
@ 2015-02-19 14:04             ` Andrew Vagin
  -1 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-19 14:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Wed, Feb 18, 2015 at 03:46:31PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> > On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > > is used to get information about sockets.
> > > > > > 
> > > > > > A request is described by the task_diag_pid structure:
> > > > > > 
> > > > > > struct task_diag_pid {
> > > > > >        __u64   show_flags;      /* specify which information are required */
> > > > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > > > > 
> > > > > >        __u32   pid;
> > > > > > };
> > > > > 
> > > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > > API? Did you consider extending that interface to provide the
> > > > > information you need instead of basing on the socket-diag?
> > > > 
> > > > It isn't based on the socket-diag, it looks like socket-diag.
> > > > 
> > > > Current task_diag registers a new genl family, but we can use the taskstats
> > > > family and add task_diag commands to it.
> > > 
> > > What I meant was more along the lines of making it look like taskstats
> > > by adding new fields to 'struct taskstat' for what you want return.
> > > I don't know if that is possible or a good idea for the information
> > > you want to get out of the kernel, but it seems like a more natural
> > > interface, as it already has some of the same data (comm, gid, pid,
> > > ppid, ...).
> > 
> > Now I see what you mean. task_diag has more flexible and universal
> > interface than taskstat. A response of taskstat only contains a
> > taskstats structure. A response of taskdiag can contains a few types of
> > properties. Each type is described by its own structure.
> 
> Right, so the question is whether that flexibility is actually required
> here. Independent of which design you personally prefer, what are the
> downsides of extending the existing but less flexible interface?

I have looked at taskstat once again.

The format of response messages for taskstat and taskdiag are the same.
It's a netlink message with a set of nested attributes. New attributes
can be added without breaking backward compatibility.

The request can be expanded to be able to specified which information is
required and for which tasks.

These two features allow to significantly improve performance, because
in this case we don't need to do a system call for each task.

I have done a few experiments to prove these words.

task_proc_all reads /proc/pid/stat for each tast
$ time ./task_proc_all > /dev/null

real	0m1.528s
user	0m0.016s
sys	0m1.341s

task_diag uses task_diag and requests information for each task
separately.
$ time ./task_diag > /dev/null

real	0m1.166s
user	0m0.024s
sys	0m1.127s

task_diag_all uses task_diag and requests information for all tasks in
one request.
$ time ./task_diag_all > /dev/null

real	0m0.077s
user	0m0.018s
sys	0m0.053s

So you can see that the ability to request information for a group of
tasks allows to be more effective.

The summary of this message is that we can use the interface of
taskstats with some extensions.

Arnd, thank you for your opinion and suggestions.

> 
> If it's good enough, that would seem to provide a more consistent
> API, which in turn helps users understand the interface and use it
> correctly.
> 
> > Curently here are only two groups of parameters: task_diag_msg and
> > task_diag_creds.
> > 
> > task_diag_msg contains a few basic parameters.
> > task_diag_creds contains credentials.
> > 
> > I'm going to add other groups to describe all kind of task properties
> > which currently are presented in procfs (e.g. /proc/pid/maps,
> > /proc/pid/fding/*, /proc/pid/status, etc).
> > 
> > One of features of task_diag is an ability to choose which information
> > are required. This allows to minimize a response size and a time, which
> > is requred to fill this response.
> 
> I realize that you are trying to optimize for performance, but it
> would be nice to quantify this if you want to argue for requiring
> a split interface.
> 
> > struct task_diag_msg {
> >         __u32   tgid;
> >         __u32   pid;
> >         __u32   ppid;
> >         __u32   tpid;
> >         __u32   sid;
> >         __u32   pgid;
> >         __u8    state;
> >         char    comm[TASK_DIAG_COMM_LEN];
> > };
> 
> I guess this part would be a very natural extension to the
> existing taskstats structure, and we should only add a new
> one here if there are extremely good reasons for it.

The task_diag_msg structure contains properties which are used more
frequently than statistics from the taststats structure.

The size of the task_diag_msg structure is 44 bytes, the size of the
taststats structure 328.  If we have more data, we need to do more
system calls. So I have done one more experiment to look how it affects
perfomance:

If we use the task_diag_msg structure:
$ time ./task_diag_all > /dev/null

real	0m0.077s
user	0m0.018s
sys	0m0.053s

If we use the taststats structure:
$ time ./task_diag_all > /dev/null

real	0m0.117s
user	0m0.029s
sys	0m0.085s

Thanks,
Andrew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-19 14:04             ` Andrew Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-19 14:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi

On Wed, Feb 18, 2015 at 03:46:31PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> > On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > > is used to get information about sockets.
> > > > > > 
> > > > > > A request is described by the task_diag_pid structure:
> > > > > > 
> > > > > > struct task_diag_pid {
> > > > > >        __u64   show_flags;      /* specify which information are required */
> > > > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > > > > 
> > > > > >        __u32   pid;
> > > > > > };
> > > > > 
> > > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > > API? Did you consider extending that interface to provide the
> > > > > information you need instead of basing on the socket-diag?
> > > > 
> > > > It isn't based on the socket-diag, it looks like socket-diag.
> > > > 
> > > > Current task_diag registers a new genl family, but we can use the taskstats
> > > > family and add task_diag commands to it.
> > > 
> > > What I meant was more along the lines of making it look like taskstats
> > > by adding new fields to 'struct taskstat' for what you want return.
> > > I don't know if that is possible or a good idea for the information
> > > you want to get out of the kernel, but it seems like a more natural
> > > interface, as it already has some of the same data (comm, gid, pid,
> > > ppid, ...).
> > 
> > Now I see what you mean. task_diag has more flexible and universal
> > interface than taskstat. A response of taskstat only contains a
> > taskstats structure. A response of taskdiag can contains a few types of
> > properties. Each type is described by its own structure.
> 
> Right, so the question is whether that flexibility is actually required
> here. Independent of which design you personally prefer, what are the
> downsides of extending the existing but less flexible interface?

I have looked at taskstat once again.

The format of response messages for taskstat and taskdiag are the same.
It's a netlink message with a set of nested attributes. New attributes
can be added without breaking backward compatibility.

The request can be expanded to be able to specified which information is
required and for which tasks.

These two features allow to significantly improve performance, because
in this case we don't need to do a system call for each task.

I have done a few experiments to prove these words.

task_proc_all reads /proc/pid/stat for each tast
$ time ./task_proc_all > /dev/null

real	0m1.528s
user	0m0.016s
sys	0m1.341s

task_diag uses task_diag and requests information for each task
separately.
$ time ./task_diag > /dev/null

real	0m1.166s
user	0m0.024s
sys	0m1.127s

task_diag_all uses task_diag and requests information for all tasks in
one request.
$ time ./task_diag_all > /dev/null

real	0m0.077s
user	0m0.018s
sys	0m0.053s

So you can see that the ability to request information for a group of
tasks allows to be more effective.

The summary of this message is that we can use the interface of
taskstats with some extensions.

Arnd, thank you for your opinion and suggestions.

> 
> If it's good enough, that would seem to provide a more consistent
> API, which in turn helps users understand the interface and use it
> correctly.
> 
> > Curently here are only two groups of parameters: task_diag_msg and
> > task_diag_creds.
> > 
> > task_diag_msg contains a few basic parameters.
> > task_diag_creds contains credentials.
> > 
> > I'm going to add other groups to describe all kind of task properties
> > which currently are presented in procfs (e.g. /proc/pid/maps,
> > /proc/pid/fding/*, /proc/pid/status, etc).
> > 
> > One of features of task_diag is an ability to choose which information
> > are required. This allows to minimize a response size and a time, which
> > is requred to fill this response.
> 
> I realize that you are trying to optimize for performance, but it
> would be nice to quantify this if you want to argue for requiring
> a split interface.
> 
> > struct task_diag_msg {
> >         __u32   tgid;
> >         __u32   pid;
> >         __u32   ppid;
> >         __u32   tpid;
> >         __u32   sid;
> >         __u32   pgid;
> >         __u8    state;
> >         char    comm[TASK_DIAG_COMM_LEN];
> > };
> 
> I guess this part would be a very natural extension to the
> existing taskstats structure, and we should only add a new
> one here if there are extremely good reasons for it.

The task_diag_msg structure contains properties which are used more
frequently than statistics from the taststats structure.

The size of the task_diag_msg structure is 44 bytes, the size of the
taststats structure 328.  If we have more data, we need to do more
system calls. So I have done one more experiment to look how it affects
perfomance:

If we use the task_diag_msg structure:
$ time ./task_diag_all > /dev/null

real	0m0.077s
user	0m0.018s
sys	0m0.053s

If we use the taststats structure:
$ time ./task_diag_all > /dev/null

real	0m0.117s
user	0m0.029s
sys	0m0.085s

Thanks,
Andrew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-19 21:39         ` Andrew Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-19 21:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
	linux-kernel, Andrew Morton, Linux API, Andrey Vagin

On Wed, Feb 18, 2015 at 05:18:38PM -0800, Andy Lutomirski wrote:
> On Feb 18, 2015 6:27 AM, "Andrew Vagin" <avagin@parallels.com> wrote:
> >
> > On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
> > > >
> > > > Here is a preview version. It provides restricted set of functionality.
> > > > I would like to collect feedback about this idea.
> > > >
> > > > Currently we use the proc file system, where all information are
> > > > presented in text files, what is convenient for humans.  But if we need
> > > > to get information about processes from code (e.g. in C), the procfs
> > > > doesn't look so cool.
> > > >
> > > > From code we would prefer to get information in binary format and to be
> > > > able to specify which information and for which tasks are required. Here
> > > > is a new interface with all these features, which is called task_diag.
> > > > In addition it's much faster than procfs.
> > > >
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > >
> > > > A request is described by the task_diag_pid structure:
> > > >
> > > > struct task_diag_pid {
> > > >        __u64   show_flags;      /* specify which information are required */
> > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > >
> > > >        __u32   pid;
> > > > };
> > > >
> > > > A respone is a set of netlink messages. Each message describes one task.
> > > > All task properties are divided on groups. A message contains the
> > > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > > response will contain the TASK_DIAG_CRED group which is described by the
> > > > task_diag_creds structure.
> > > >
> > > > struct task_diag_msg {
> > > >         __u32   tgid;
> > > >         __u32   pid;
> > > >         __u32   ppid;
> > > >         __u32   tpid;
> > > >         __u32   sid;
> > > >         __u32   pgid;
> > > >         __u8    state;
> > > >         char    comm[TASK_DIAG_COMM_LEN];
> > > > };
> > > >
> > > > Another good feature of task_diag is an ability to request information
> > > > for a few processes. Currently here are two stratgies
> > > > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > > >                           tasks
> > > >
> > > > The task diag is much faster than the proc file system. We don't need to
> > > > create a new file descriptor for each task. We need to send a request
> > > > and get a response. It allows to get information for a few task in one
> > > > request-response iteration.
> > > >
> > > > I have compared performance of procfs and task-diag for the
> > > > "ps ax -o pid,ppid" command.
> > > >
> > > > A test stand contains 10348 processes.
> > > > $ ps ax -o pid,ppid | wc -l
> > > > 10348
> > > >
> > > > $ time ps ax -o pid,ppid > /dev/null
> > > >
> > > > real    0m1.073s
> > > > user    0m0.086s
> > > > sys     0m0.903s
> > > >
> > > > $ time ./task_diag_all > /dev/null
> > > >
> > > > real    0m0.037s
> > > > user    0m0.004s
> > > > sys     0m0.020s
> > > >
> > > > And here are statistics about syscalls which were called by each
> > > > command.
> > > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> > > >             20,713      syscalls:sys_exit_open
> > > >             20,710      syscalls:sys_exit_close
> > > >             20,708      syscalls:sys_exit_read
> > > >             10,348      syscalls:sys_exit_newstat
> > > >                 31      syscalls:sys_exit_write
> > > >
> > > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> > > >                114      syscalls:sys_exit_recvfrom
> > > >                 49      syscalls:sys_exit_write
> > > >                  8      syscalls:sys_exit_mmap
> > > >                  4      syscalls:sys_exit_mprotect
> > > >                  3      syscalls:sys_exit_newfstat
> > > >
> > > > You can find the test program from this experiment in the last patch.
> > > >
> > > > The idea of this functionality was suggested by Pavel Emelyanov
> > > > (xemul@), when he found that operations with /proc forms a significant
> > > > part of a checkpointing time.
> > > >
> > > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > > information:
> > > > http://lwn.net/Articles/99600/
> > >
> > > I don't suppose this could use real syscalls instead of netlink.  If
> > > nothing else, netlink seems to conflate pid and net namespaces.
> >
> > What do you mean by "conflate pid and net namespaces"?
> 
> A netlink socket is bound to a network namespace, but you should be
> returning data specific to a pid namespace.

Here is a good question. When we mount a procfs instance, the current
pidns is saved on a superblock. Then if we read data from
this procfs from another pidns, we will see pid-s from the pidns where
this procfs has been mounted.

$ unshare -p -- bash -c '(bash)'
$ cat /proc/self/status | grep ^Pid:
Pid:	15770
$ echo $$
1

A similar situation with socket_diag. A socket_diag socket is bound to a
network namespace. If we open a socket_diag socket and change a network
namespace, it will return infromation about the initial netns.

In this version I always use a current pid namespace.
But to be consistant with other kernel logic, a socket diag has to be
linked with a pidns where it has been created.

> 
> On a related note, how does this interact with hidepid?  More

Currently it always work as procfs with hidepid = 2 (highest level of
security).

> generally, what privileges are you requiring to obtain what data?

It dumps information only if ptrace_may_access(tsk, PTRACE_MODE_READ) returns true

> 
> >
> > >
> > > Also, using an asynchronous interface (send, poll?, recv) for
> > > something that's inherently synchronous (as the kernel a local
> > > question) seems awkward to me.
> >
> > Actually all requests are handled synchronously. We call sendmsg to send
> > a request and it is handled in this syscall.
> >  2)               |  netlink_sendmsg() {
> >  2)               |    netlink_unicast() {
> >  2)               |      taskdiag_doit() {
> >  2)   2.153 us    |        task_diag_fill();
> >  2)               |        netlink_unicast() {
> >  2)   0.185 us    |          netlink_attachskb();
> >  2)   0.291 us    |          __netlink_sendskb();
> >  2)   2.452 us    |        }
> >  2) + 33.625 us   |      }
> >  2) + 54.611 us   |    }
> >  2) + 76.370 us   |  }
> >  2)               |  netlink_recvmsg() {
> >  2)   1.178 us    |    skb_recv_datagram();
> >  2) + 46.953 us   |  }
> >
> > If we request information for a group of tasks (NLM_F_DUMP), a first
> > portion of data is filled from the sendmsg syscall. And then when we read
> > it, the kernel fills the next portion.
> >
> >  3)               |  netlink_sendmsg() {
> >  3)               |    __netlink_dump_start() {
> >  3)               |      netlink_dump() {
> >  3)               |        taskdiag_dumpid() {
> >  3)   0.685 us    |          task_diag_fill();
> > ...
> >  3)   0.224 us    |          task_diag_fill();
> >  3) + 74.028 us   |        }
> >  3) + 88.757 us   |      }
> >  3) + 89.296 us   |    }
> >  3) + 98.705 us   |  }
> >  3)               |  netlink_recvmsg() {
> >  3)               |    netlink_dump() {
> >  3)               |      taskdiag_dumpid() {
> >  3)   0.594 us    |        task_diag_fill();
> > ...
> >  3)   0.242 us    |        task_diag_fill();
> >  3) + 60.634 us   |      }
> >  3) + 72.803 us   |    }
> >  3) + 88.005 us   |  }
> >  3)               |  netlink_recvmsg() {
> >  3)               |    netlink_dump() {
> >  3)   2.403 us    |      taskdiag_dumpid();
> >  3) + 26.236 us   |    }
> >  3) + 40.522 us   |  }
> >  0) + 20.407 us   |  netlink_recvmsg();
> >
> >
> > netlink is really good for this type of tasks.  It allows to create an
> > extendable interface which can be easy customized for different needs.
> >
> > I don't think that we would want to create another similar interface
> > just to be independent from network subsystem.
> 
> I guess this is a bit streamy in that you ask one question and get
> multiple answers.

It's like seq_file in procfs. The kernel allocates a buffer then fills
it, copies it into userspace, fills it again, ... repeats these actions.
And we can read data from file by portions.

Actually here is one more analogy. When we open a file in procfs,
we sends a request to the kernel and a file path is a request body in
this case. But in case of procfs, we can't construct requests, we only
have a set of predefined requests.

> 
> >
> > Thanks,
> > Andrew
> >
> > >
> > > --Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-19 21:39         ` Andrew Vagin
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Vagin @ 2015-02-19 21:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linux API,
	Andrey Vagin

On Wed, Feb 18, 2015 at 05:18:38PM -0800, Andy Lutomirski wrote:
> On Feb 18, 2015 6:27 AM, "Andrew Vagin" <avagin-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> >
> > On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> > > >
> > > > Here is a preview version. It provides restricted set of functionality.
> > > > I would like to collect feedback about this idea.
> > > >
> > > > Currently we use the proc file system, where all information are
> > > > presented in text files, what is convenient for humans.  But if we need
> > > > to get information about processes from code (e.g. in C), the procfs
> > > > doesn't look so cool.
> > > >
> > > > From code we would prefer to get information in binary format and to be
> > > > able to specify which information and for which tasks are required. Here
> > > > is a new interface with all these features, which is called task_diag.
> > > > In addition it's much faster than procfs.
> > > >
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > >
> > > > A request is described by the task_diag_pid structure:
> > > >
> > > > struct task_diag_pid {
> > > >        __u64   show_flags;      /* specify which information are required */
> > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > >
> > > >        __u32   pid;
> > > > };
> > > >
> > > > A respone is a set of netlink messages. Each message describes one task.
> > > > All task properties are divided on groups. A message contains the
> > > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > > response will contain the TASK_DIAG_CRED group which is described by the
> > > > task_diag_creds structure.
> > > >
> > > > struct task_diag_msg {
> > > >         __u32   tgid;
> > > >         __u32   pid;
> > > >         __u32   ppid;
> > > >         __u32   tpid;
> > > >         __u32   sid;
> > > >         __u32   pgid;
> > > >         __u8    state;
> > > >         char    comm[TASK_DIAG_COMM_LEN];
> > > > };
> > > >
> > > > Another good feature of task_diag is an ability to request information
> > > > for a few processes. Currently here are two stratgies
> > > > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > > >                           tasks
> > > >
> > > > The task diag is much faster than the proc file system. We don't need to
> > > > create a new file descriptor for each task. We need to send a request
> > > > and get a response. It allows to get information for a few task in one
> > > > request-response iteration.
> > > >
> > > > I have compared performance of procfs and task-diag for the
> > > > "ps ax -o pid,ppid" command.
> > > >
> > > > A test stand contains 10348 processes.
> > > > $ ps ax -o pid,ppid | wc -l
> > > > 10348
> > > >
> > > > $ time ps ax -o pid,ppid > /dev/null
> > > >
> > > > real    0m1.073s
> > > > user    0m0.086s
> > > > sys     0m0.903s
> > > >
> > > > $ time ./task_diag_all > /dev/null
> > > >
> > > > real    0m0.037s
> > > > user    0m0.004s
> > > > sys     0m0.020s
> > > >
> > > > And here are statistics about syscalls which were called by each
> > > > command.
> > > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> > > >             20,713      syscalls:sys_exit_open
> > > >             20,710      syscalls:sys_exit_close
> > > >             20,708      syscalls:sys_exit_read
> > > >             10,348      syscalls:sys_exit_newstat
> > > >                 31      syscalls:sys_exit_write
> > > >
> > > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> > > >                114      syscalls:sys_exit_recvfrom
> > > >                 49      syscalls:sys_exit_write
> > > >                  8      syscalls:sys_exit_mmap
> > > >                  4      syscalls:sys_exit_mprotect
> > > >                  3      syscalls:sys_exit_newfstat
> > > >
> > > > You can find the test program from this experiment in the last patch.
> > > >
> > > > The idea of this functionality was suggested by Pavel Emelyanov
> > > > (xemul@), when he found that operations with /proc forms a significant
> > > > part of a checkpointing time.
> > > >
> > > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > > information:
> > > > http://lwn.net/Articles/99600/
> > >
> > > I don't suppose this could use real syscalls instead of netlink.  If
> > > nothing else, netlink seems to conflate pid and net namespaces.
> >
> > What do you mean by "conflate pid and net namespaces"?
> 
> A netlink socket is bound to a network namespace, but you should be
> returning data specific to a pid namespace.

Here is a good question. When we mount a procfs instance, the current
pidns is saved on a superblock. Then if we read data from
this procfs from another pidns, we will see pid-s from the pidns where
this procfs has been mounted.

$ unshare -p -- bash -c '(bash)'
$ cat /proc/self/status | grep ^Pid:
Pid:	15770
$ echo $$
1

A similar situation with socket_diag. A socket_diag socket is bound to a
network namespace. If we open a socket_diag socket and change a network
namespace, it will return infromation about the initial netns.

In this version I always use a current pid namespace.
But to be consistant with other kernel logic, a socket diag has to be
linked with a pidns where it has been created.

> 
> On a related note, how does this interact with hidepid?  More

Currently it always work as procfs with hidepid = 2 (highest level of
security).

> generally, what privileges are you requiring to obtain what data?

It dumps information only if ptrace_may_access(tsk, PTRACE_MODE_READ) returns true

> 
> >
> > >
> > > Also, using an asynchronous interface (send, poll?, recv) for
> > > something that's inherently synchronous (as the kernel a local
> > > question) seems awkward to me.
> >
> > Actually all requests are handled synchronously. We call sendmsg to send
> > a request and it is handled in this syscall.
> >  2)               |  netlink_sendmsg() {
> >  2)               |    netlink_unicast() {
> >  2)               |      taskdiag_doit() {
> >  2)   2.153 us    |        task_diag_fill();
> >  2)               |        netlink_unicast() {
> >  2)   0.185 us    |          netlink_attachskb();
> >  2)   0.291 us    |          __netlink_sendskb();
> >  2)   2.452 us    |        }
> >  2) + 33.625 us   |      }
> >  2) + 54.611 us   |    }
> >  2) + 76.370 us   |  }
> >  2)               |  netlink_recvmsg() {
> >  2)   1.178 us    |    skb_recv_datagram();
> >  2) + 46.953 us   |  }
> >
> > If we request information for a group of tasks (NLM_F_DUMP), a first
> > portion of data is filled from the sendmsg syscall. And then when we read
> > it, the kernel fills the next portion.
> >
> >  3)               |  netlink_sendmsg() {
> >  3)               |    __netlink_dump_start() {
> >  3)               |      netlink_dump() {
> >  3)               |        taskdiag_dumpid() {
> >  3)   0.685 us    |          task_diag_fill();
> > ...
> >  3)   0.224 us    |          task_diag_fill();
> >  3) + 74.028 us   |        }
> >  3) + 88.757 us   |      }
> >  3) + 89.296 us   |    }
> >  3) + 98.705 us   |  }
> >  3)               |  netlink_recvmsg() {
> >  3)               |    netlink_dump() {
> >  3)               |      taskdiag_dumpid() {
> >  3)   0.594 us    |        task_diag_fill();
> > ...
> >  3)   0.242 us    |        task_diag_fill();
> >  3) + 60.634 us   |      }
> >  3) + 72.803 us   |    }
> >  3) + 88.005 us   |  }
> >  3)               |  netlink_recvmsg() {
> >  3)               |    netlink_dump() {
> >  3)   2.403 us    |      taskdiag_dumpid();
> >  3) + 26.236 us   |    }
> >  3) + 40.522 us   |  }
> >  0) + 20.407 us   |  netlink_recvmsg();
> >
> >
> > netlink is really good for this type of tasks.  It allows to create an
> > extendable interface which can be easy customized for different needs.
> >
> > I don't think that we would want to create another similar interface
> > just to be independent from network subsystem.
> 
> I guess this is a bit streamy in that you ask one question and get
> multiple answers.

It's like seq_file in procfs. The kernel allocates a buffer then fills
it, copies it into userspace, fills it again, ... repeats these actions.
And we can read data from file by portions.

Actually here is one more analogy. When we open a file in procfs,
we sends a request to the kernel and a file path is a request body in
this case. But in case of procfs, we can't construct requests, we only
have a set of predefined requests.

> 
> >
> > Thanks,
> > Andrew
> >
> > >
> > > --Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-20 20:33           ` Andy Lutomirski
  0 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2015-02-20 20:33 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
	linux-kernel, Andrew Morton, Linux API, Andrey Vagin

On Thu, Feb 19, 2015 at 1:39 PM, Andrew Vagin <avagin@parallels.com> wrote:
> On Wed, Feb 18, 2015 at 05:18:38PM -0800, Andy Lutomirski wrote:
>> > > I don't suppose this could use real syscalls instead of netlink.  If
>> > > nothing else, netlink seems to conflate pid and net namespaces.
>> >
>> > What do you mean by "conflate pid and net namespaces"?
>>
>> A netlink socket is bound to a network namespace, but you should be
>> returning data specific to a pid namespace.
>
> Here is a good question. When we mount a procfs instance, the current
> pidns is saved on a superblock. Then if we read data from
> this procfs from another pidns, we will see pid-s from the pidns where
> this procfs has been mounted.
>
> $ unshare -p -- bash -c '(bash)'
> $ cat /proc/self/status | grep ^Pid:
> Pid:    15770
> $ echo $$
> 1
>
> A similar situation with socket_diag. A socket_diag socket is bound to a
> network namespace. If we open a socket_diag socket and change a network
> namespace, it will return infromation about the initial netns.
>
> In this version I always use a current pid namespace.
> But to be consistant with other kernel logic, a socket diag has to be
> linked with a pidns where it has been created.
>

Attaching a pidns to every freshly created netlink socket seems odd,
but I don't see a better solution that still uses netlink.

>>
>> On a related note, how does this interact with hidepid?  More
>
> Currently it always work as procfs with hidepid = 2 (highest level of
> security).
>
>> generally, what privileges are you requiring to obtain what data?
>
> It dumps information only if ptrace_may_access(tsk, PTRACE_MODE_READ) returns true

Sounds good to me.

>
>>
>> >
>> > >
>> > > Also, using an asynchronous interface (send, poll?, recv) for
>> > > something that's inherently synchronous (as the kernel a local
>> > > question) seems awkward to me.
>> >
>> > Actually all requests are handled synchronously. We call sendmsg to send
>> > a request and it is handled in this syscall.
>> >  2)               |  netlink_sendmsg() {
>> >  2)               |    netlink_unicast() {
>> >  2)               |      taskdiag_doit() {
>> >  2)   2.153 us    |        task_diag_fill();
>> >  2)               |        netlink_unicast() {
>> >  2)   0.185 us    |          netlink_attachskb();
>> >  2)   0.291 us    |          __netlink_sendskb();
>> >  2)   2.452 us    |        }
>> >  2) + 33.625 us   |      }
>> >  2) + 54.611 us   |    }
>> >  2) + 76.370 us   |  }
>> >  2)               |  netlink_recvmsg() {
>> >  2)   1.178 us    |    skb_recv_datagram();
>> >  2) + 46.953 us   |  }
>> >
>> > If we request information for a group of tasks (NLM_F_DUMP), a first
>> > portion of data is filled from the sendmsg syscall. And then when we read
>> > it, the kernel fills the next portion.
>> >
>> >  3)               |  netlink_sendmsg() {
>> >  3)               |    __netlink_dump_start() {
>> >  3)               |      netlink_dump() {
>> >  3)               |        taskdiag_dumpid() {
>> >  3)   0.685 us    |          task_diag_fill();
>> > ...
>> >  3)   0.224 us    |          task_diag_fill();
>> >  3) + 74.028 us   |        }
>> >  3) + 88.757 us   |      }
>> >  3) + 89.296 us   |    }
>> >  3) + 98.705 us   |  }
>> >  3)               |  netlink_recvmsg() {
>> >  3)               |    netlink_dump() {
>> >  3)               |      taskdiag_dumpid() {
>> >  3)   0.594 us    |        task_diag_fill();
>> > ...
>> >  3)   0.242 us    |        task_diag_fill();
>> >  3) + 60.634 us   |      }
>> >  3) + 72.803 us   |    }
>> >  3) + 88.005 us   |  }
>> >  3)               |  netlink_recvmsg() {
>> >  3)               |    netlink_dump() {
>> >  3)   2.403 us    |      taskdiag_dumpid();
>> >  3) + 26.236 us   |    }
>> >  3) + 40.522 us   |  }
>> >  0) + 20.407 us   |  netlink_recvmsg();
>> >
>> >
>> > netlink is really good for this type of tasks.  It allows to create an
>> > extendable interface which can be easy customized for different needs.
>> >
>> > I don't think that we would want to create another similar interface
>> > just to be independent from network subsystem.
>>
>> I guess this is a bit streamy in that you ask one question and get
>> multiple answers.
>
> It's like seq_file in procfs. The kernel allocates a buffer then fills
> it, copies it into userspace, fills it again, ... repeats these actions.
> And we can read data from file by portions.
>
> Actually here is one more analogy. When we open a file in procfs,
> we sends a request to the kernel and a file path is a request body in
> this case. But in case of procfs, we can't construct requests, we only
> have a set of predefined requests.

Fair enough.  Procfs is also a bit absurd and only makes sense because
it's compatible with lots of tools.  In a totally sane world, I would
argue that you should issue one syscall asking questions about a bit
and you should get answers immediately.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-20 20:33           ` Andy Lutomirski
  0 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2015-02-20 20:33 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linux API,
	Andrey Vagin

On Thu, Feb 19, 2015 at 1:39 PM, Andrew Vagin <avagin-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> On Wed, Feb 18, 2015 at 05:18:38PM -0800, Andy Lutomirski wrote:
>> > > I don't suppose this could use real syscalls instead of netlink.  If
>> > > nothing else, netlink seems to conflate pid and net namespaces.
>> >
>> > What do you mean by "conflate pid and net namespaces"?
>>
>> A netlink socket is bound to a network namespace, but you should be
>> returning data specific to a pid namespace.
>
> Here is a good question. When we mount a procfs instance, the current
> pidns is saved on a superblock. Then if we read data from
> this procfs from another pidns, we will see pid-s from the pidns where
> this procfs has been mounted.
>
> $ unshare -p -- bash -c '(bash)'
> $ cat /proc/self/status | grep ^Pid:
> Pid:    15770
> $ echo $$
> 1
>
> A similar situation with socket_diag. A socket_diag socket is bound to a
> network namespace. If we open a socket_diag socket and change a network
> namespace, it will return infromation about the initial netns.
>
> In this version I always use a current pid namespace.
> But to be consistant with other kernel logic, a socket diag has to be
> linked with a pidns where it has been created.
>

Attaching a pidns to every freshly created netlink socket seems odd,
but I don't see a better solution that still uses netlink.

>>
>> On a related note, how does this interact with hidepid?  More
>
> Currently it always work as procfs with hidepid = 2 (highest level of
> security).
>
>> generally, what privileges are you requiring to obtain what data?
>
> It dumps information only if ptrace_may_access(tsk, PTRACE_MODE_READ) returns true

Sounds good to me.

>
>>
>> >
>> > >
>> > > Also, using an asynchronous interface (send, poll?, recv) for
>> > > something that's inherently synchronous (as the kernel a local
>> > > question) seems awkward to me.
>> >
>> > Actually all requests are handled synchronously. We call sendmsg to send
>> > a request and it is handled in this syscall.
>> >  2)               |  netlink_sendmsg() {
>> >  2)               |    netlink_unicast() {
>> >  2)               |      taskdiag_doit() {
>> >  2)   2.153 us    |        task_diag_fill();
>> >  2)               |        netlink_unicast() {
>> >  2)   0.185 us    |          netlink_attachskb();
>> >  2)   0.291 us    |          __netlink_sendskb();
>> >  2)   2.452 us    |        }
>> >  2) + 33.625 us   |      }
>> >  2) + 54.611 us   |    }
>> >  2) + 76.370 us   |  }
>> >  2)               |  netlink_recvmsg() {
>> >  2)   1.178 us    |    skb_recv_datagram();
>> >  2) + 46.953 us   |  }
>> >
>> > If we request information for a group of tasks (NLM_F_DUMP), a first
>> > portion of data is filled from the sendmsg syscall. And then when we read
>> > it, the kernel fills the next portion.
>> >
>> >  3)               |  netlink_sendmsg() {
>> >  3)               |    __netlink_dump_start() {
>> >  3)               |      netlink_dump() {
>> >  3)               |        taskdiag_dumpid() {
>> >  3)   0.685 us    |          task_diag_fill();
>> > ...
>> >  3)   0.224 us    |          task_diag_fill();
>> >  3) + 74.028 us   |        }
>> >  3) + 88.757 us   |      }
>> >  3) + 89.296 us   |    }
>> >  3) + 98.705 us   |  }
>> >  3)               |  netlink_recvmsg() {
>> >  3)               |    netlink_dump() {
>> >  3)               |      taskdiag_dumpid() {
>> >  3)   0.594 us    |        task_diag_fill();
>> > ...
>> >  3)   0.242 us    |        task_diag_fill();
>> >  3) + 60.634 us   |      }
>> >  3) + 72.803 us   |    }
>> >  3) + 88.005 us   |  }
>> >  3)               |  netlink_recvmsg() {
>> >  3)               |    netlink_dump() {
>> >  3)   2.403 us    |      taskdiag_dumpid();
>> >  3) + 26.236 us   |    }
>> >  3) + 40.522 us   |  }
>> >  0) + 20.407 us   |  netlink_recvmsg();
>> >
>> >
>> > netlink is really good for this type of tasks.  It allows to create an
>> > extendable interface which can be easy customized for different needs.
>> >
>> > I don't think that we would want to create another similar interface
>> > just to be independent from network subsystem.
>>
>> I guess this is a bit streamy in that you ask one question and get
>> multiple answers.
>
> It's like seq_file in procfs. The kernel allocates a buffer then fills
> it, copies it into userspace, fills it again, ... repeats these actions.
> And we can read data from file by portions.
>
> Actually here is one more analogy. When we open a file in procfs,
> we sends a request to the kernel and a file path is a request body in
> this case. But in case of procfs, we can't construct requests, we only
> have a set of predefined requests.

Fair enough.  Procfs is also a bit absurd and only makes sense because
it's compatible with lots of tools.  In a totally sane world, I would
argue that you should issue one syscall asking questions about a bit
and you should get answers immediately.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
  2015-02-27 20:54   ` David Ahern
@ 2015-02-27 21:50     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 41+ messages in thread
From: Arnaldo Carvalho de Melo @ 2015-02-27 21:50 UTC (permalink / raw)
  To: David Ahern; +Cc: Pavel Odintsov, Andrew Vagin, linux-kernel

Em Fri, Feb 27, 2015 at 01:54:03PM -0700, David Ahern escreveu:
> On 2/27/15 1:43 PM, Arnaldo Carvalho de Melo wrote:
> 
> > From the subject line, there is patchkit, but I couldn't find it... Can
> >you resend it to me or point me to some url where I can get it?
> 
> https://lkml.org/lkml/2015/2/17/64

Yeah, I eventually found it, this would be great for perf:

Another good feature of task_diag is an ability to request information
for a few processes. Currently here are two stratgies
TASK_DIAG_DUMP_ALL	- get information for all tasks
TASK_DIAG_DUMP_CHILDREN	- get information for children of a specified
			  tasks


I.e. 'perf record -a' would use that TASK_DIAG_DUMP_ALL to synthesize
PERF_RECORD_{FORK,COMM} events, we would still need some way to generate
the PERF_RECORD_MMAP entries tho.

- Arnaldo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
  2015-02-27 20:43 ` Arnaldo Carvalho de Melo
@ 2015-02-27 20:54   ` David Ahern
  2015-02-27 21:50     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 41+ messages in thread
From: David Ahern @ 2015-02-27 20:54 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Pavel Odintsov; +Cc: linux-kernel

On 2/27/15 1:43 PM, Arnaldo Carvalho de Melo wrote:

>  From the subject line, there is patchkit, but I couldn't find it... Can
> you resend it to me or point me to some url where I can get it?

https://lkml.org/lkml/2015/2/17/64


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
  2015-02-19 13:00 Pavel Odintsov
@ 2015-02-27 20:43 ` Arnaldo Carvalho de Melo
  2015-02-27 20:54   ` David Ahern
  0 siblings, 1 reply; 41+ messages in thread
From: Arnaldo Carvalho de Melo @ 2015-02-27 20:43 UTC (permalink / raw)
  To: Pavel Odintsov; +Cc: linux-kernel, David Ahern

Em Thu, Feb 19, 2015 at 05:00:12PM +0400, Pavel Odintsov escreveu:
> Hello!
> 
> In addition to my post I want to mention another issue related with
> slow /proc read in perf toolkit. On my server with 25 000 processes I
> need about ~15 minutes for loading perf top toolkit completely.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=86991

Right, one way would be to, in the 'perf top' case to defer getting
thread information to when we need it, i.e. when we get a sample for a
pid that we have no struct thread associated with.

We would speed up 'perf top' startup but would introduce jitter down the
line, and would be up for races, but hey, we already are, using /proc
:-/

But that would not work for 'perf record', as we need to in advance
generate those records as we don't do any processing of samples...

Yeah, for preexisting threads we do have a problem since day one, what
we use is just what can be done with existing stuff.

I saw that there were some more messages in this thread, its just that I
haven't found them in my mailbox when David Ahern pointed this out this
discussion to me :-\

>From the subject line, there is patchkit, but I couldn't find it... Can
you resend it to me or point me to some url where I can get it?

- Arnaldo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-19 13:00 Pavel Odintsov
  2015-02-27 20:43 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 41+ messages in thread
From: Pavel Odintsov @ 2015-02-19 13:00 UTC (permalink / raw)
  To: linux-kernel

Hello!

In addition to my post I want to mention another issue related with
slow /proc read in perf toolkit. On my server with 25 000 processes I
need about ~15 minutes for loading perf top toolkit completely.

https://bugzilla.kernel.org/show_bug.cgi?id=86991


-- 
Sincerely yours, Pavel Odintsov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-19 12:50 Pavel Odintsov
  0 siblings, 0 replies; 41+ messages in thread
From: Pavel Odintsov @ 2015-02-19 12:50 UTC (permalink / raw)
  To: linux-kernel

Hello, folks!

It's very useful patches and they can do my tasks simpler and faster.

In my day to day work I working with Linux servers with enormous
amount of processes (~25 000 per server). This servers run multiple
hundreds of Linux containers.

If I want analyze processor load, network load or check something else
I use top/atop/htop/netstat. But they work very slow and consume
significant amount of CPU power for parsing multiple thousands text
files in /proc (like /proc/tcp, /proc/udp, /proc/status,
/proc/$pid/status).

Some time ago I worked on malware detection toolkit for Linux -
Antidoto (https://github.com/FastVPSEestiOu/Antidoto) which uses /proc
filesystem very deeply. For detecting malware I need check every
descriptor, every sockets and get complete information about all
processes on system.

But with current text file based architecture of /proc I can't achieve
suitable speed of my toolkit.

For example, there you can look at time of processing all network
connections for server with 20244 processes with
linux_network_activity_tracker.pl
(https://github.com/FastVPSEestiOu/Antidoto/blob/master/linux_network_activity_tracker.pl):

real 1m26.637s
user 0m23.945s
sys 0m43.978s

As you can see this time is very huge but I use latest CPUs from Intel
(Xepn 2697v3).

I have multiple ideas about complete realtime Linux server monitoring
but without ability to pull information from the Linux Kernel faster I
can't realize they.

-- 
Sincerely yours, Pavel Odintsov

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2015-02-27 21:50 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-17  8:20 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Andrey Vagin
2015-02-17  8:20 ` [PATCH 1/7] kernel: add a netlink interface to get information about tasks Andrey Vagin
2015-02-17  8:20   ` Andrey Vagin
2015-02-17  8:20 ` [PATCH 2/7] kernel: move next_tgid from fs/proc Andrey Vagin
2015-02-17  8:20 ` [PATCH 3/7] task-diag: add ability to get information about all tasks Andrey Vagin
2015-02-17  8:20   ` Andrey Vagin
2015-02-17  8:20 ` [PATCH 4/7] task-diag: add a new group to get process credentials Andrey Vagin
2015-02-17  8:20   ` Andrey Vagin
2015-02-17  8:20 ` [PATCH 5/7] kernel: add ability to iterate children of a specified task Andrey Vagin
2015-02-17  8:20 ` [PATCH 6/7] task_diag: add ability to dump children Andrey Vagin
2015-02-17  8:20 ` [PATCH 7/7] selftest: check the task_diag functinonality Andrey Vagin
2015-02-17  8:53 ` [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Arnd Bergmann
2015-02-17  8:53   ` Arnd Bergmann
2015-02-17 21:33   ` Andrew Vagin
2015-02-17 21:33     ` Andrew Vagin
2015-02-18 11:06     ` Arnd Bergmann
2015-02-18 11:06       ` Arnd Bergmann
2015-02-18 12:42       ` Andrew Vagin
2015-02-18 12:42         ` Andrew Vagin
2015-02-18 14:46         ` Arnd Bergmann
2015-02-18 14:46           ` Arnd Bergmann
2015-02-19 14:04           ` Andrew Vagin
2015-02-19 14:04             ` Andrew Vagin
2015-02-17 16:09 ` David Ahern
2015-02-17 16:09   ` David Ahern
2015-02-17 20:32   ` Andrew Vagin
2015-02-17 20:32     ` Andrew Vagin
2015-02-17 19:05 ` Andy Lutomirski
2015-02-18 14:27   ` Andrew Vagin
2015-02-18 14:27     ` Andrew Vagin
2015-02-19  1:18     ` Andy Lutomirski
2015-02-19  1:18       ` Andy Lutomirski
2015-02-19 21:39       ` Andrew Vagin
2015-02-19 21:39         ` Andrew Vagin
2015-02-20 20:33         ` Andy Lutomirski
2015-02-20 20:33           ` Andy Lutomirski
2015-02-19 12:50 Pavel Odintsov
2015-02-19 13:00 Pavel Odintsov
2015-02-27 20:43 ` Arnaldo Carvalho de Melo
2015-02-27 20:54   ` David Ahern
2015-02-27 21:50     ` Arnaldo Carvalho de Melo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.