netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/5] bpf: add main_thread_only customization for task/task_file iterators
@ 2020-08-27  0:06 Yonghong Song
  2020-08-27  0:06 ` [PATCH bpf-next 1/5] bpf: make bpf_link_info.iter similar to bpf_iter_link_info Yonghong Song
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Yonghong Song @ 2020-08-27  0:06 UTC (permalink / raw)
  To: bpf, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Commit e679654a704e ("bpf: Fix a rcu_sched stall issue with
bpf task/task_file iterator") introduced rate limiting in
bpf_seq_read() to fix a case where traversing too many tasks
and files (tens of millions of files) may cause kernel rcu stall.
But rate limiting won't reduce the amount of work to traverse
all these files.

In practice, for a user process, typically all threads belongs
to that process share the same file table and there is no need
to visit every thread for its files.

This patch implemented a customization for task/task_file iterators
to traverse files only for thread with task->tgid == task->pid,
which will include some kernel threads and user process main threads.
Such reduction of unnecessary work will make iterator runtime
much faster if there are a lot of non-main threads and open
files for the process.

Patch #1 fix an uapi issue for bpf_link_info.iter.
Patch #2 implemented the main_thread_only customization for
task/task_file iterators.
Patch #3 added link_query support for new customization.
Patch #4 added bpftool support and Patch #5 added a selftest.

Yonghong Song (5):
  bpf: make bpf_link_info.iter similar to bpf_iter_link_info
  bpf: add main_thread_only customization for task/task_file iterators
  bpf: add link_query support for newly added main_thread_only info
  bpftool: support optional 'task main_thread_only' argument
  selftests/bpf: test task_file iterator with main_thread_only

 include/linux/bpf.h                           |  3 +-
 include/uapi/linux/bpf.h                      | 16 ++++-
 kernel/bpf/task_iter.c                        | 63 ++++++++++++++-----
 .../bpftool/Documentation/bpftool-iter.rst    | 17 ++++-
 tools/bpf/bpftool/bash-completion/bpftool     |  9 ++-
 tools/bpf/bpftool/iter.c                      | 28 +++++++--
 tools/bpf/bpftool/link.c                      | 12 ++++
 tools/include/uapi/linux/bpf.h                | 16 ++++-
 .../selftests/bpf/prog_tests/bpf_iter.c       | 50 +++++++++++----
 .../selftests/bpf/progs/bpf_iter_task_file.c  |  9 ++-
 10 files changed, 183 insertions(+), 40 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 1/5] bpf: make bpf_link_info.iter similar to bpf_iter_link_info
  2020-08-27  0:06 [PATCH bpf-next 0/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
@ 2020-08-27  0:06 ` Yonghong Song
  2020-08-27  4:50   ` Andrii Nakryiko
  2020-08-27  0:06 ` [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Yonghong Song @ 2020-08-27  0:06 UTC (permalink / raw)
  To: bpf, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

bpf_link_info.iter is used by link_query to return
bpf_iter_link_info to user space. Fields may be different
,e.g., map_fd vs. map_id, so we cannot reuse
the exact structure. But make them similar, e.g.,
  struct bpf_link_info {
     /* common fields */
     union {
	struct { ... } raw_tracepoint;
	struct { ... } tracing;
	...
	struct {
	    /* common fields for iter */
	    union {
		struct {
		    __u32 map_id;
		} map;
		/* other structs for other targets */
	    };
	};
    };
 };
so the structure is extensible the same way as
bpf_iter_link_info.

Fixes: 6b0a249a301e ("bpf: Implement link_query for bpf iterators")
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/uapi/linux/bpf.h       | 6 ++++--
 tools/include/uapi/linux/bpf.h | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0388bc0200b0..ef7af384f5ee 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4251,8 +4251,10 @@ struct bpf_link_info {
 			__aligned_u64 target_name; /* in/out: target_name buffer ptr */
 			__u32 target_name_len;	   /* in/out: target_name buffer len */
 			union {
-				__u32 map_id;
-			} map;
+				struct {
+					__u32 map_id;
+				} map;
+			};
 		} iter;
 		struct  {
 			__u32 netns_ino;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0388bc0200b0..ef7af384f5ee 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4251,8 +4251,10 @@ struct bpf_link_info {
 			__aligned_u64 target_name; /* in/out: target_name buffer ptr */
 			__u32 target_name_len;	   /* in/out: target_name buffer len */
 			union {
-				__u32 map_id;
-			} map;
+				struct {
+					__u32 map_id;
+				} map;
+			};
 		} iter;
 		struct  {
 			__u32 netns_ino;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators
  2020-08-27  0:06 [PATCH bpf-next 0/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
  2020-08-27  0:06 ` [PATCH bpf-next 1/5] bpf: make bpf_link_info.iter similar to bpf_iter_link_info Yonghong Song
@ 2020-08-27  0:06 ` Yonghong Song
  2020-08-27  5:07   ` Andrii Nakryiko
  2020-08-27  0:06 ` [PATCH bpf-next 3/5] bpf: add link_query support for newly added main_thread_only info Yonghong Song
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Yonghong Song @ 2020-08-27  0:06 UTC (permalink / raw)
  To: bpf, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Currently, task and task_file by default iterates through
all tasks. For task_file, by default, all files from all tasks
will be traversed.

But for a user process, the file_table is shared by all threads
of that process. So traversing the main thread per process should
be enough to traverse all files and this can save a lot of cpu
time if some process has large number of threads and each thread
has lots of open files.

This patch implemented a customization for task/task_file iterator,
permitting to traverse only the kernel task where its pid equal
to tgid in the kernel. This includes some kernel threads, and
main threads of user processes. This will solve the above potential
performance issue for task_file. This customization may be useful
for task iterator too if only traversing main threads is enough.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h            |  3 ++-
 include/uapi/linux/bpf.h       |  5 ++++
 kernel/bpf/task_iter.c         | 46 +++++++++++++++++++++++-----------
 tools/include/uapi/linux/bpf.h |  5 ++++
 4 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a6131d95e31e..058eb9b0ba78 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1220,7 +1220,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 	int __init bpf_iter_ ## target(args) { return 0; }
 
 struct bpf_iter_aux_info {
-	struct bpf_map *map;
+	struct bpf_map *map;	/* for iterator traversing map elements */
+	bool main_thread_only;	/* for task/task_file iterator */
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ef7af384f5ee..af5c600bf673 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -85,6 +85,11 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+
+	struct {
+		__u32	main_thread_only:1;
+		__u32	:31;
+	} task;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for details. */
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 232df29793e9..362bf2dda63a 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -11,19 +11,22 @@
 
 struct bpf_iter_seq_task_common {
 	struct pid_namespace *ns;
+	bool main_thread_only;
 };
 
 struct bpf_iter_seq_task_info {
 	/* The first field must be struct bpf_iter_seq_task_common.
-	 * this is assumed by {init, fini}_seq_pidns() callback functions.
+	 * this is assumed by {init, fini}_seq_task_common() callback functions.
 	 */
 	struct bpf_iter_seq_task_common common;
 	u32 tid;
 };
 
-static struct task_struct *task_seq_get_next(struct pid_namespace *ns,
-					     u32 *tid)
+static struct task_struct *task_seq_get_next(
+	struct bpf_iter_seq_task_common *task_common, u32 *tid)
 {
+	bool main_thread_only = task_common->main_thread_only;
+	struct pid_namespace *ns = task_common->ns;
 	struct task_struct *task = NULL;
 	struct pid *pid;
 
@@ -31,7 +34,10 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns,
 retry:
 	pid = idr_get_next(&ns->idr, tid);
 	if (pid) {
-		task = get_pid_task(pid, PIDTYPE_PID);
+		if (main_thread_only)
+			task = get_pid_task(pid, PIDTYPE_TGID);
+		else
+			task = get_pid_task(pid, PIDTYPE_PID);
 		if (!task) {
 			++*tid;
 			goto retry;
@@ -47,7 +53,7 @@ static void *task_seq_start(struct seq_file *seq, loff_t *pos)
 	struct bpf_iter_seq_task_info *info = seq->private;
 	struct task_struct *task;
 
-	task = task_seq_get_next(info->common.ns, &info->tid);
+	task = task_seq_get_next(&info->common, &info->tid);
 	if (!task)
 		return NULL;
 
@@ -64,7 +70,7 @@ static void *task_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	++*pos;
 	++info->tid;
 	put_task_struct((struct task_struct *)v);
-	task = task_seq_get_next(info->common.ns, &info->tid);
+	task = task_seq_get_next(&info->common, &info->tid);
 	if (!task)
 		return NULL;
 
@@ -118,7 +124,7 @@ static const struct seq_operations task_seq_ops = {
 
 struct bpf_iter_seq_task_file_info {
 	/* The first field must be struct bpf_iter_seq_task_common.
-	 * this is assumed by {init, fini}_seq_pidns() callback functions.
+	 * this is assumed by {init, fini}_seq_task_common() callback functions.
 	 */
 	struct bpf_iter_seq_task_common common;
 	struct task_struct *task;
@@ -131,7 +137,6 @@ static struct file *
 task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
 		       struct task_struct **task, struct files_struct **fstruct)
 {
-	struct pid_namespace *ns = info->common.ns;
 	u32 curr_tid = info->tid, max_fds;
 	struct files_struct *curr_files;
 	struct task_struct *curr_task;
@@ -147,7 +152,7 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
 		curr_files = *fstruct;
 		curr_fd = info->fd;
 	} else {
-		curr_task = task_seq_get_next(ns, &curr_tid);
+		curr_task = task_seq_get_next(&info->common, &curr_tid);
 		if (!curr_task)
 			return NULL;
 
@@ -293,15 +298,16 @@ static void task_file_seq_stop(struct seq_file *seq, void *v)
 	}
 }
 
-static int init_seq_pidns(void *priv_data, struct bpf_iter_aux_info *aux)
+static int init_seq_task_common(void *priv_data, struct bpf_iter_aux_info *aux)
 {
 	struct bpf_iter_seq_task_common *common = priv_data;
 
 	common->ns = get_pid_ns(task_active_pid_ns(current));
+	common->main_thread_only = aux->main_thread_only;
 	return 0;
 }
 
-static void fini_seq_pidns(void *priv_data)
+static void fini_seq_task_common(void *priv_data)
 {
 	struct bpf_iter_seq_task_common *common = priv_data;
 
@@ -315,19 +321,28 @@ static const struct seq_operations task_file_seq_ops = {
 	.show	= task_file_seq_show,
 };
 
+static int bpf_iter_attach_task(struct bpf_prog *prog,
+				union bpf_iter_link_info *linfo,
+				struct bpf_iter_aux_info *aux)
+{
+	aux->main_thread_only = linfo->task.main_thread_only;
+	return 0;
+}
+
 BTF_ID_LIST(btf_task_file_ids)
 BTF_ID(struct, task_struct)
 BTF_ID(struct, file)
 
 static const struct bpf_iter_seq_info task_seq_info = {
 	.seq_ops		= &task_seq_ops,
-	.init_seq_private	= init_seq_pidns,
-	.fini_seq_private	= fini_seq_pidns,
+	.init_seq_private	= init_seq_task_common,
+	.fini_seq_private	= fini_seq_task_common,
 	.seq_priv_size		= sizeof(struct bpf_iter_seq_task_info),
 };
 
 static struct bpf_iter_reg task_reg_info = {
 	.target			= "task",
+	.attach_target		= bpf_iter_attach_task,
 	.ctx_arg_info_size	= 1,
 	.ctx_arg_info		= {
 		{ offsetof(struct bpf_iter__task, task),
@@ -338,13 +353,14 @@ static struct bpf_iter_reg task_reg_info = {
 
 static const struct bpf_iter_seq_info task_file_seq_info = {
 	.seq_ops		= &task_file_seq_ops,
-	.init_seq_private	= init_seq_pidns,
-	.fini_seq_private	= fini_seq_pidns,
+	.init_seq_private	= init_seq_task_common,
+	.fini_seq_private	= fini_seq_task_common,
 	.seq_priv_size		= sizeof(struct bpf_iter_seq_task_file_info),
 };
 
 static struct bpf_iter_reg task_file_reg_info = {
 	.target			= "task_file",
+	.attach_target		= bpf_iter_attach_task,
 	.ctx_arg_info_size	= 2,
 	.ctx_arg_info		= {
 		{ offsetof(struct bpf_iter__task_file, task),
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ef7af384f5ee..af5c600bf673 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -85,6 +85,11 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+
+	struct {
+		__u32	main_thread_only:1;
+		__u32	:31;
+	} task;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for details. */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 3/5] bpf: add link_query support for newly added main_thread_only info
  2020-08-27  0:06 [PATCH bpf-next 0/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
  2020-08-27  0:06 ` [PATCH bpf-next 1/5] bpf: make bpf_link_info.iter similar to bpf_iter_link_info Yonghong Song
  2020-08-27  0:06 ` [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
@ 2020-08-27  0:06 ` Yonghong Song
  2020-08-27  5:14   ` Andrii Nakryiko
  2020-08-27  0:06 ` [PATCH bpf-next 4/5] bpftool: support optional 'task main_thread_only' argument Yonghong Song
  2020-08-27  0:06 ` [PATCH bpf-next 5/5] selftests/bpf: test task_file iterator with main_thread_only Yonghong Song
  4 siblings, 1 reply; 12+ messages in thread
From: Yonghong Song @ 2020-08-27  0:06 UTC (permalink / raw)
  To: bpf, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Added support for link_query for main_thread_only information
with task/task_file iterators.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/uapi/linux/bpf.h       |  5 +++++
 kernel/bpf/task_iter.c         | 17 +++++++++++++++++
 tools/include/uapi/linux/bpf.h |  5 +++++
 3 files changed, 27 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index af5c600bf673..595bdc4c9431 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4259,6 +4259,11 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+
+				struct {
+					__u32 main_thread_only:1;
+					__u32 :31;
+				} task;
 			};
 		} iter;
 		struct  {
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 362bf2dda63a..7636abe05f27 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -329,6 +329,19 @@ static int bpf_iter_attach_task(struct bpf_prog *prog,
 	return 0;
 }
 
+static void bpf_iter_task_show_fdinfo(const struct bpf_iter_aux_info *aux,
+				      struct seq_file *seq)
+{
+	seq_printf(seq, "main_thread_only:\t%u\n", aux->main_thread_only);
+}
+
+static int bpf_iter_task_fill_link_info(const struct bpf_iter_aux_info *aux,
+					struct bpf_link_info *info)
+{
+	info->iter.task.main_thread_only = aux->main_thread_only;
+	return 0;
+}
+
 BTF_ID_LIST(btf_task_file_ids)
 BTF_ID(struct, task_struct)
 BTF_ID(struct, file)
@@ -343,6 +356,8 @@ static const struct bpf_iter_seq_info task_seq_info = {
 static struct bpf_iter_reg task_reg_info = {
 	.target			= "task",
 	.attach_target		= bpf_iter_attach_task,
+	.show_fdinfo		= bpf_iter_task_show_fdinfo,
+	.fill_link_info		= bpf_iter_task_fill_link_info,
 	.ctx_arg_info_size	= 1,
 	.ctx_arg_info		= {
 		{ offsetof(struct bpf_iter__task, task),
@@ -361,6 +376,8 @@ static const struct bpf_iter_seq_info task_file_seq_info = {
 static struct bpf_iter_reg task_file_reg_info = {
 	.target			= "task_file",
 	.attach_target		= bpf_iter_attach_task,
+	.show_fdinfo		= bpf_iter_task_show_fdinfo,
+	.fill_link_info		= bpf_iter_task_fill_link_info,
 	.ctx_arg_info_size	= 2,
 	.ctx_arg_info		= {
 		{ offsetof(struct bpf_iter__task_file, task),
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index af5c600bf673..595bdc4c9431 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4259,6 +4259,11 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+
+				struct {
+					__u32 main_thread_only:1;
+					__u32 :31;
+				} task;
 			};
 		} iter;
 		struct  {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 4/5] bpftool: support optional 'task main_thread_only' argument
  2020-08-27  0:06 [PATCH bpf-next 0/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
                   ` (2 preceding siblings ...)
  2020-08-27  0:06 ` [PATCH bpf-next 3/5] bpf: add link_query support for newly added main_thread_only info Yonghong Song
@ 2020-08-27  0:06 ` Yonghong Song
  2020-08-27  0:06 ` [PATCH bpf-next 5/5] selftests/bpf: test task_file iterator with main_thread_only Yonghong Song
  4 siblings, 0 replies; 12+ messages in thread
From: Yonghong Song @ 2020-08-27  0:06 UTC (permalink / raw)
  To: bpf, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

For task and task_file bpf iterators, optional 'task main_thread_only'
can signal the kernel to only iterate through main threads of each
process. link_query will also print out main_thread_only value for
task/task_file iterators.

This patch also fixed the issue where if the additional arguments
are not supported, bpftool will print an error message and exit.

  $ ./bpftool iter pin ./bpf_iter_task.o /sys/fs/bpf/p1 task main_thread_only
  $ ./bpftool iter pin ./bpf_iter_task_file.o /sys/fs/bpf/p2 task main_thread_only
  $ ./bpftool iter pin ./bpf_iter_task_file.o /sys/fs/bpf/p3
  $ ./bpftool link show
  1: iter  prog 6  target_name bpf_map
  2: iter  prog 7  target_name bpf_prog
  3: iter  prog 12  target_name task  main_thread_only 1
  5: iter  prog 23  target_name task_file  main_thread_only 1
  6: iter  prog 28  target_name task_file  main_thread_only 0

  $ cat /sys/fs/bpf/p2
    tgid      gid       fd      file
  ...
    1716     1716      255 ffffffffa2e95ec0
    1756     1756        0 ffffffffa2e95ec0
    1756     1756        1 ffffffffa2e95ec0
    1756     1756        2 ffffffffa2e95ec0
    1756     1756        3 ffffffffa2e20a80
    1756     1756        4 ffffffffa2e19ba0
    1756     1756        5 ffffffffa2e16460
    1756     1756        6 ffffffffa2e16460
    1756     1756        7 ffffffffa2e16460
    1756     1756        8 ffffffffa2e16260
    1761     1761        0 ffffffffa2e95ec0
  ...
  $ ls /proc/1756/task/
    1756  1757  1758  1759  1760

  In the above task_file iterator, the process with id 1756 has 5 threads and
  only the thread with pid = 1756 is processed by the bpf program.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../bpftool/Documentation/bpftool-iter.rst    | 17 +++++++++--
 tools/bpf/bpftool/bash-completion/bpftool     |  9 +++++-
 tools/bpf/bpftool/iter.c                      | 28 ++++++++++++++++---
 tools/bpf/bpftool/link.c                      | 12 ++++++++
 4 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-iter.rst b/tools/bpf/bpftool/Documentation/bpftool-iter.rst
index 070ffacb42b5..d9aac12c76da 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-iter.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-iter.rst
@@ -17,15 +17,16 @@ SYNOPSIS
 ITER COMMANDS
 ===================
 
-|	**bpftool** **iter pin** *OBJ* *PATH* [**map** *MAP*]
+|	**bpftool** **iter pin** *OBJ* *PATH* [**map** *MAP* | **task** *TASK_OPT*]
 |	**bpftool** **iter help**
 |
 |	*OBJ* := /a/file/of/bpf_iter_target.o
 |	*MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
+|	*TASK_OPT* := { **main_thread_only** }
 
 DESCRIPTION
 ===========
-	**bpftool iter pin** *OBJ* *PATH* [**map** *MAP*]
+	**bpftool iter pin** *OBJ* *PATH* [**map** *MAP* | **task** *TASK_OPT*]
 		  A bpf iterator combines a kernel iterating of
 		  particular kernel data (e.g., tasks, bpf_maps, etc.)
 		  and a bpf program called for each kernel data object
@@ -44,6 +45,11 @@ DESCRIPTION
 		  with each map element, do checking, filtering, aggregation,
 		  etc. without copying data to user space.
 
+		  The task or task_file bpf iterator can have an optional
+		  parameter *TASK_OPT*. The current supported value is
+		  **main_thread_only** which supports to iterate only main
+		  threads of each process.
+
 		  User can then *cat PATH* to see the bpf iterator output.
 
 	**bpftool iter help**
@@ -78,6 +84,13 @@ EXAMPLES
    Create a file-based bpf iterator from bpf_iter_hashmap.o and map with
    id 20, and pin it to /sys/fs/bpf/my_hashmap
 
+**# bpftool iter pin bpf_iter_task.o /sys/fs/bpf/my_task task main_thread_only**
+
+::
+
+   Create a file-based bpf iterator from bpf_iter_task.o which iterates main
+   threads of processes only, and pin it to /sys/fs/bpf/my_hashmap
+
 SEE ALSO
 ========
 	**bpf**\ (2),
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index 7b68e3c0a5fb..84d538de71e1 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -613,6 +613,7 @@ _bpftool()
             esac
             ;;
         iter)
+            local TARGET_TYPE='map task'
             case $command in
                 pin)
                     case $prev in
@@ -628,9 +629,15 @@ _bpftool()
                         pinned)
                             _filedir
                             ;;
-                        *)
+                        task)
+                            _bpftool_one_of_list 'main_thread_only'
+                            ;;
+                        map)
                             _bpftool_one_of_list $MAP_TYPE
                             ;;
+                        *)
+                            _bpftool_one_of_list $TARGET_TYPE
+                            ;;
                     esac
                     return 0
                     ;;
diff --git a/tools/bpf/bpftool/iter.c b/tools/bpf/bpftool/iter.c
index 3b1aad7535dd..a4c789ea43f1 100644
--- a/tools/bpf/bpftool/iter.c
+++ b/tools/bpf/bpftool/iter.c
@@ -26,6 +26,7 @@ static int do_pin(int argc, char **argv)
 
 	/* optional arguments */
 	if (argc) {
+		memset(&linfo, 0, sizeof(linfo));
 		if (is_prefix(*argv, "map")) {
 			NEXT_ARG();
 
@@ -38,11 +39,29 @@ static int do_pin(int argc, char **argv)
 			if (map_fd < 0)
 				return -1;
 
-			memset(&linfo, 0, sizeof(linfo));
 			linfo.map.map_fd = map_fd;
-			iter_opts.link_info = &linfo;
-			iter_opts.link_info_len = sizeof(linfo);
+		} else if (is_prefix(*argv, "task")) {
+			NEXT_ARG();
+
+			if (!REQ_ARGS(1)) {
+				p_err("incorrect task spec");
+				return -1;
+			}
+
+			if (strcmp(*argv, "main_thread_only") != 0) {
+				p_err("incorrect task spec");
+				return -1;
+			}
+
+			linfo.task.main_thread_only = true;
+		} else {
+			p_err("expected no more arguments, 'map' or 'task', got: '%s'?",
+			      *argv);
+			return -1;
 		}
+
+		iter_opts.link_info = &linfo;
+		iter_opts.link_info_len = sizeof(linfo);
 	}
 
 	obj = bpf_object__open(objfile);
@@ -95,9 +114,10 @@ static int do_pin(int argc, char **argv)
 static int do_help(int argc, char **argv)
 {
 	fprintf(stderr,
-		"Usage: %1$s %2$s pin OBJ PATH [map MAP]\n"
+		"Usage: %1$s %2$s pin OBJ PATH [map MAP | task TASK_OPT]\n"
 		"       %1$s %2$s help\n"
 		"       " HELP_SPEC_MAP "\n"
+		"       TASK_OPT := { main_thread_only }\n"
 		"",
 		bin_name, "iter");
 
diff --git a/tools/bpf/bpftool/link.c b/tools/bpf/bpftool/link.c
index e77e1525d20a..a159d5680c74 100644
--- a/tools/bpf/bpftool/link.c
+++ b/tools/bpf/bpftool/link.c
@@ -83,6 +83,12 @@ static bool is_iter_map_target(const char *target_name)
 	       strcmp(target_name, "bpf_sk_storage_map") == 0;
 }
 
+static bool is_iter_task_target(const char *target_name)
+{
+	return strcmp(target_name, "task") == 0 ||
+	       strcmp(target_name, "task_file") == 0;
+}
+
 static void show_iter_json(struct bpf_link_info *info, json_writer_t *wtr)
 {
 	const char *target_name = u64_to_ptr(info->iter.target_name);
@@ -91,6 +97,9 @@ static void show_iter_json(struct bpf_link_info *info, json_writer_t *wtr)
 
 	if (is_iter_map_target(target_name))
 		jsonw_uint_field(wtr, "map_id", info->iter.map.map_id);
+	else if (is_iter_task_target(target_name))
+		jsonw_uint_field(wtr, "main_thread_only",
+				 info->iter.task.main_thread_only);
 }
 
 static int get_prog_info(int prog_id, struct bpf_prog_info *info)
@@ -202,6 +211,9 @@ static void show_iter_plain(struct bpf_link_info *info)
 
 	if (is_iter_map_target(target_name))
 		printf("map_id %u  ", info->iter.map.map_id);
+	else if (is_iter_task_target(target_name))
+		printf("main_thread_only %u  ",
+		       info->iter.task.main_thread_only);
 }
 
 static int show_link_close_plain(int fd, struct bpf_link_info *info)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 5/5] selftests/bpf: test task_file iterator with main_thread_only
  2020-08-27  0:06 [PATCH bpf-next 0/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
                   ` (3 preceding siblings ...)
  2020-08-27  0:06 ` [PATCH bpf-next 4/5] bpftool: support optional 'task main_thread_only' argument Yonghong Song
@ 2020-08-27  0:06 ` Yonghong Song
  4 siblings, 0 replies; 12+ messages in thread
From: Yonghong Song @ 2020-08-27  0:06 UTC (permalink / raw)
  To: bpf, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Modified existing bpf_iter_test_file program to check
whether all accessed files from the main thread or not.
  $ ./test_progs -n 4
  ...
  #4/7 task_file:OK
  ...
  #4 bpf_iter:OK
  Summary: 1/24 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../selftests/bpf/prog_tests/bpf_iter.c       | 50 ++++++++++++++-----
 .../selftests/bpf/progs/bpf_iter_task_file.c  |  9 +++-
 2 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index 7375d9a6d242..235580801372 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -37,13 +37,14 @@ static void test_btf_id_or_null(void)
 	}
 }
 
-static void do_dummy_read(struct bpf_program *prog)
+static void do_dummy_read(struct bpf_program *prog,
+			  struct bpf_iter_attach_opts *opts)
 {
 	struct bpf_link *link;
 	char buf[16] = {};
 	int iter_fd, len;
 
-	link = bpf_program__attach_iter(prog, NULL);
+	link = bpf_program__attach_iter(prog, opts);
 	if (CHECK(IS_ERR(link), "attach_iter", "attach_iter failed\n"))
 		return;
 
@@ -71,7 +72,7 @@ static void test_ipv6_route(void)
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_ipv6_route);
+	do_dummy_read(skel->progs.dump_ipv6_route, NULL);
 
 	bpf_iter_ipv6_route__destroy(skel);
 }
@@ -85,7 +86,7 @@ static void test_netlink(void)
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_netlink);
+	do_dummy_read(skel->progs.dump_netlink, NULL);
 
 	bpf_iter_netlink__destroy(skel);
 }
@@ -99,7 +100,7 @@ static void test_bpf_map(void)
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_bpf_map);
+	do_dummy_read(skel->progs.dump_bpf_map, NULL);
 
 	bpf_iter_bpf_map__destroy(skel);
 }
@@ -113,7 +114,7 @@ static void test_task(void)
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_task);
+	do_dummy_read(skel->progs.dump_task, NULL);
 
 	bpf_iter_task__destroy(skel);
 }
@@ -127,22 +128,47 @@ static void test_task_stack(void)
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_task_stack);
+	do_dummy_read(skel->progs.dump_task_stack, NULL);
 
 	bpf_iter_task_stack__destroy(skel);
 }
 
+static void *do_nothing(void *arg)
+{
+	pthread_exit(arg);
+}
+
 static void test_task_file(void)
 {
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
 	struct bpf_iter_task_file *skel;
+	union bpf_iter_link_info linfo;
+	pthread_t thread_id;
+	void *ret;
 
 	skel = bpf_iter_task_file__open_and_load();
 	if (CHECK(!skel, "bpf_iter_task_file__open_and_load",
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_task_file);
+	if (CHECK(pthread_create(&thread_id, NULL, &do_nothing, NULL),
+		  "pthread_create", "pthread_create failed\n"))
+		goto done;
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.task.main_thread_only = true;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+	do_dummy_read(skel->progs.dump_task_file, &opts);
+
+	if (CHECK(pthread_join(thread_id, &ret) || ret != NULL,
+		  "pthread_join", "pthread_join failed\n"))
+		goto done;
+
+	CHECK(skel->bss->count != 0, "",
+	      "invalid non main-thread file visit %d\n", skel->bss->count);
 
+done:
 	bpf_iter_task_file__destroy(skel);
 }
 
@@ -155,7 +181,7 @@ static void test_tcp4(void)
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_tcp4);
+	do_dummy_read(skel->progs.dump_tcp4, NULL);
 
 	bpf_iter_tcp4__destroy(skel);
 }
@@ -169,7 +195,7 @@ static void test_tcp6(void)
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_tcp6);
+	do_dummy_read(skel->progs.dump_tcp6, NULL);
 
 	bpf_iter_tcp6__destroy(skel);
 }
@@ -183,7 +209,7 @@ static void test_udp4(void)
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_udp4);
+	do_dummy_read(skel->progs.dump_udp4, NULL);
 
 	bpf_iter_udp4__destroy(skel);
 }
@@ -197,7 +223,7 @@ static void test_udp6(void)
 		  "skeleton open_and_load failed\n"))
 		return;
 
-	do_dummy_read(skel->progs.dump_udp6);
+	do_dummy_read(skel->progs.dump_udp6, NULL);
 
 	bpf_iter_udp6__destroy(skel);
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c
index 8b787baa2654..880f6c9db8fb 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c
@@ -6,6 +6,8 @@
 
 char _license[] SEC("license") = "GPL";
 
+int count = 0;
+
 SEC("iter/task_file")
 int dump_task_file(struct bpf_iter__task_file *ctx)
 {
@@ -17,8 +19,13 @@ int dump_task_file(struct bpf_iter__task_file *ctx)
 	if (task == (void *)0 || file == (void *)0)
 		return 0;
 
-	if (ctx->meta->seq_num == 0)
+	if (ctx->meta->seq_num == 0) {
+		count = 0; /* reset the counter for this seq_file session */
 		BPF_SEQ_PRINTF(seq, "    tgid      gid       fd      file\n");
+	}
+
+	if (task->tgid != task->pid)
+		count++;
 
 	BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
 		       (long)file->f_op);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/5] bpf: make bpf_link_info.iter similar to bpf_iter_link_info
  2020-08-27  0:06 ` [PATCH bpf-next 1/5] bpf: make bpf_link_info.iter similar to bpf_iter_link_info Yonghong Song
@ 2020-08-27  4:50   ` Andrii Nakryiko
  0 siblings, 0 replies; 12+ messages in thread
From: Andrii Nakryiko @ 2020-08-27  4:50 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Aug 26, 2020 at 5:07 PM Yonghong Song <yhs@fb.com> wrote:
>
> bpf_link_info.iter is used by link_query to return
> bpf_iter_link_info to user space. Fields may be different
> ,e.g., map_fd vs. map_id, so we cannot reuse
> the exact structure. But make them similar, e.g.,
>   struct bpf_link_info {
>      /* common fields */
>      union {
>         struct { ... } raw_tracepoint;
>         struct { ... } tracing;
>         ...
>         struct {
>             /* common fields for iter */
>             union {
>                 struct {
>                     __u32 map_id;
>                 } map;
>                 /* other structs for other targets */
>             };
>         };
>     };
>  };
> so the structure is extensible the same way as
> bpf_iter_link_info.
>
> Fixes: 6b0a249a301e ("bpf: Implement link_query for bpf iterators")
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---

I like this change, thanks!

Acked-by: Andrii Nakryiko <andriin@fb.com>

>  include/uapi/linux/bpf.h       | 6 ++++--
>  tools/include/uapi/linux/bpf.h | 6 ++++--
>  2 files changed, 8 insertions(+), 4 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators
  2020-08-27  0:06 ` [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
@ 2020-08-27  5:07   ` Andrii Nakryiko
  2020-08-27 18:09     ` Yonghong Song
  0 siblings, 1 reply; 12+ messages in thread
From: Andrii Nakryiko @ 2020-08-27  5:07 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Aug 26, 2020 at 5:07 PM Yonghong Song <yhs@fb.com> wrote:
>
> Currently, task and task_file by default iterates through
> all tasks. For task_file, by default, all files from all tasks
> will be traversed.
>
> But for a user process, the file_table is shared by all threads
> of that process. So traversing the main thread per process should
> be enough to traverse all files and this can save a lot of cpu
> time if some process has large number of threads and each thread
> has lots of open files.
>
> This patch implemented a customization for task/task_file iterator,
> permitting to traverse only the kernel task where its pid equal
> to tgid in the kernel. This includes some kernel threads, and
> main threads of user processes. This will solve the above potential
> performance issue for task_file. This customization may be useful
> for task iterator too if only traversing main threads is enough.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h            |  3 ++-
>  include/uapi/linux/bpf.h       |  5 ++++
>  kernel/bpf/task_iter.c         | 46 +++++++++++++++++++++++-----------
>  tools/include/uapi/linux/bpf.h |  5 ++++
>  4 files changed, 43 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index a6131d95e31e..058eb9b0ba78 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1220,7 +1220,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>         int __init bpf_iter_ ## target(args) { return 0; }
>
>  struct bpf_iter_aux_info {
> -       struct bpf_map *map;
> +       struct bpf_map *map;    /* for iterator traversing map elements */
> +       bool main_thread_only;  /* for task/task_file iterator */

As a user of task_file iterator I'd hate to make this decision,
honestly, especially if I can't prove that all processes share the
same file table (I think clone() syscall allows to do weird
combinations like that, right?). It does make sense for a task/
iterator, though, if I need to iterate a user-space process (group of
tasks). So can we do this:

1a. Either add a new bpf_iter type process/ (or in kernel lingo
task_group/) to iterate only main threads (group_leader in kernel
lingo);
1b. Or make this main_thread_only an option for only a task/ iterator
(and maybe call it "group_leader_only" or something to reflect kernel
terminology?)

2. For task/file iterator, still iterate all tasks, but if the task's
files struct is the same as group_leader's files struct, then go to
the next one. This should eliminate tons of duplication of iterating
the same files over and over. It would still iterate a bunch of tasks,
but compared to the number of files that's generally a much smaller
number, so should still give sizable savings. I don't think we need an
extra option for this, tbh, this behavior was my intuitive
expectation, so I don't think you'll be breaking any sane user of this
iterator.

Disclaimer: I haven't got a chance to look through kernel code much,
so I'm sorry if what I'm proposing is something that is impossible to
implement or extremely hard/unreliable. But thought I'll put this idea
out there before we decide on this.

WDYT?

[...]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 3/5] bpf: add link_query support for newly added main_thread_only info
  2020-08-27  0:06 ` [PATCH bpf-next 3/5] bpf: add link_query support for newly added main_thread_only info Yonghong Song
@ 2020-08-27  5:14   ` Andrii Nakryiko
  0 siblings, 0 replies; 12+ messages in thread
From: Andrii Nakryiko @ 2020-08-27  5:14 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Aug 26, 2020 at 5:07 PM Yonghong Song <yhs@fb.com> wrote:
>
> Added support for link_query for main_thread_only information
> with task/task_file iterators.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/uapi/linux/bpf.h       |  5 +++++
>  kernel/bpf/task_iter.c         | 17 +++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  5 +++++
>  3 files changed, 27 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index af5c600bf673..595bdc4c9431 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4259,6 +4259,11 @@ struct bpf_link_info {
>                                 struct {
>                                         __u32 map_id;
>                                 } map;
> +
> +                               struct {
> +                                       __u32 main_thread_only:1;
> +                                       __u32 :31;
> +                               } task;
>                         };
>                 } iter;
>                 struct  {
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index 362bf2dda63a..7636abe05f27 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -329,6 +329,19 @@ static int bpf_iter_attach_task(struct bpf_prog *prog,
>         return 0;
>  }
>
> +static void bpf_iter_task_show_fdinfo(const struct bpf_iter_aux_info *aux,
> +                                     struct seq_file *seq)
> +{
> +       seq_printf(seq, "main_thread_only:\t%u\n", aux->main_thread_only);
> +}
> +
> +static int bpf_iter_task_fill_link_info(const struct bpf_iter_aux_info *aux,
> +                                       struct bpf_link_info *info)
> +{
> +       info->iter.task.main_thread_only = aux->main_thread_only;
> +       return 0;
> +}
> +
>  BTF_ID_LIST(btf_task_file_ids)
>  BTF_ID(struct, task_struct)
>  BTF_ID(struct, file)
> @@ -343,6 +356,8 @@ static const struct bpf_iter_seq_info task_seq_info = {
>  static struct bpf_iter_reg task_reg_info = {
>         .target                 = "task",
>         .attach_target          = bpf_iter_attach_task,
> +       .show_fdinfo            = bpf_iter_task_show_fdinfo,
> +       .fill_link_info         = bpf_iter_task_fill_link_info,
>         .ctx_arg_info_size      = 1,
>         .ctx_arg_info           = {
>                 { offsetof(struct bpf_iter__task, task),
> @@ -361,6 +376,8 @@ static const struct bpf_iter_seq_info task_file_seq_info = {
>  static struct bpf_iter_reg task_file_reg_info = {
>         .target                 = "task_file",
>         .attach_target          = bpf_iter_attach_task,
> +       .show_fdinfo            = bpf_iter_task_show_fdinfo,
> +       .fill_link_info         = bpf_iter_task_fill_link_info,
>         .ctx_arg_info_size      = 2,
>         .ctx_arg_info           = {
>                 { offsetof(struct bpf_iter__task_file, task),
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index af5c600bf673..595bdc4c9431 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -4259,6 +4259,11 @@ struct bpf_link_info {
>                                 struct {
>                                         __u32 map_id;
>                                 } map;
> +
> +                               struct {
> +                                       __u32 main_thread_only:1;
> +                                       __u32 :31;

nit: unless we want to always re-calculate how many bits we have left
(and specify that in UAPI header), using `__u32: 0` would work here,
won't require updating it with every new flag, and won't accidentally
add an extra 32 bits to the task struct, if we forget to update the
size.

But nothing wrong with :31,  if explicitness was a goal here.

> +                               } task;
>                         };
>                 } iter;
>                 struct  {
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators
  2020-08-27  5:07   ` Andrii Nakryiko
@ 2020-08-27 18:09     ` Yonghong Song
  2020-09-02  0:34       ` Andrii Nakryiko
  0 siblings, 1 reply; 12+ messages in thread
From: Yonghong Song @ 2020-08-27 18:09 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 8/26/20 10:07 PM, Andrii Nakryiko wrote:
> On Wed, Aug 26, 2020 at 5:07 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Currently, task and task_file by default iterates through
>> all tasks. For task_file, by default, all files from all tasks
>> will be traversed.
>>
>> But for a user process, the file_table is shared by all threads
>> of that process. So traversing the main thread per process should
>> be enough to traverse all files and this can save a lot of cpu
>> time if some process has large number of threads and each thread
>> has lots of open files.
>>
>> This patch implemented a customization for task/task_file iterator,
>> permitting to traverse only the kernel task where its pid equal
>> to tgid in the kernel. This includes some kernel threads, and
>> main threads of user processes. This will solve the above potential
>> performance issue for task_file. This customization may be useful
>> for task iterator too if only traversing main threads is enough.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h            |  3 ++-
>>   include/uapi/linux/bpf.h       |  5 ++++
>>   kernel/bpf/task_iter.c         | 46 +++++++++++++++++++++++-----------
>>   tools/include/uapi/linux/bpf.h |  5 ++++
>>   4 files changed, 43 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index a6131d95e31e..058eb9b0ba78 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1220,7 +1220,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>>          int __init bpf_iter_ ## target(args) { return 0; }
>>
>>   struct bpf_iter_aux_info {
>> -       struct bpf_map *map;
>> +       struct bpf_map *map;    /* for iterator traversing map elements */
>> +       bool main_thread_only;  /* for task/task_file iterator */
> 
> As a user of task_file iterator I'd hate to make this decision,
> honestly, especially if I can't prove that all processes share the
> same file table (I think clone() syscall allows to do weird
> combinations like that, right?). It does make sense for a task/

Right. the clone() syscall permits different kinds of sharing,
sharing address space and sharing files are separated. It is possible
that address space is shared and files are not shared. That is
why I want to add flags for task_file so that if user knows
they do not have cases where address space shared and files not
shared, they can use main_thread_only to improve performance.

> iterator, though, if I need to iterate a user-space process (group of
> tasks). So can we do this:
> 
> 1a. Either add a new bpf_iter type process/ (or in kernel lingo
> task_group/) to iterate only main threads (group_leader in kernel
> lingo);
> 1b. Or make this main_thread_only an option for only a task/ iterator
> (and maybe call it "group_leader_only" or something to reflect kernel
> terminology?)

The following is the kernel pid_type definition,

enum pid_type
{
         PIDTYPE_PID,
         PIDTYPE_TGID,
         PIDTYPE_PGID,
         PIDTYPE_SID,
         PIDTYPE_MAX,
};

Right now, task iterator is traversed following
PIDTYPE_PID, i.e., every tasks in the system.

To iterate through main thread, we need to traverse following
PIDTYPE_TGID.

In the future, it is possible, people want to traverse
following PIDTYPE_PGID (process group) or PIDTYPE_SID (session).

Or we might have other customization, e.g., cgroup_id, which can
be filtered in the bpf program, but in-kernel filtering can
definitely improve performance.

So I prefer 1b.

Yes, naming is hard.
The name "main_thread_only" is mostly from userspace perspective.
"group_leader_only" might not be good as it may be confused with
possible future process group.

> 
> 2. For task/file iterator, still iterate all tasks, but if the task's
> files struct is the same as group_leader's files struct, then go to
> the next one. This should eliminate tons of duplication of iterating
> the same files over and over. It would still iterate a bunch of tasks,
> but compared to the number of files that's generally a much smaller
> number, so should still give sizable savings. I don't think we need an
> extra option for this, tbh, this behavior was my intuitive
> expectation, so I don't think you'll be breaking any sane user of this
> iterator.

What you suggested makes sense. for task_file iterator, we only promise
to visit all files from all tasks. We can do necessary filtering to 
remove duplicates in the kernel and did not break promise.
I will remove customization from task_file iterator.

> 
> Disclaimer: I haven't got a chance to look through kernel code much,
> so I'm sorry if what I'm proposing is something that is impossible to
> implement or extremely hard/unreliable. But thought I'll put this idea
> out there before we decide on this.
> 
> WDYT?
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators
  2020-08-27 18:09     ` Yonghong Song
@ 2020-09-02  0:34       ` Andrii Nakryiko
  2020-09-02  2:23         ` Yonghong Song
  0 siblings, 1 reply; 12+ messages in thread
From: Andrii Nakryiko @ 2020-09-02  0:34 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Thu, Aug 27, 2020 at 11:09 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 8/26/20 10:07 PM, Andrii Nakryiko wrote:
> > On Wed, Aug 26, 2020 at 5:07 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> Currently, task and task_file by default iterates through
> >> all tasks. For task_file, by default, all files from all tasks
> >> will be traversed.
> >>
> >> But for a user process, the file_table is shared by all threads
> >> of that process. So traversing the main thread per process should
> >> be enough to traverse all files and this can save a lot of cpu
> >> time if some process has large number of threads and each thread
> >> has lots of open files.
> >>
> >> This patch implemented a customization for task/task_file iterator,
> >> permitting to traverse only the kernel task where its pid equal
> >> to tgid in the kernel. This includes some kernel threads, and
> >> main threads of user processes. This will solve the above potential
> >> performance issue for task_file. This customization may be useful
> >> for task iterator too if only traversing main threads is enough.
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   include/linux/bpf.h            |  3 ++-
> >>   include/uapi/linux/bpf.h       |  5 ++++
> >>   kernel/bpf/task_iter.c         | 46 +++++++++++++++++++++++-----------
> >>   tools/include/uapi/linux/bpf.h |  5 ++++
> >>   4 files changed, 43 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index a6131d95e31e..058eb9b0ba78 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -1220,7 +1220,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> >>          int __init bpf_iter_ ## target(args) { return 0; }
> >>
> >>   struct bpf_iter_aux_info {
> >> -       struct bpf_map *map;
> >> +       struct bpf_map *map;    /* for iterator traversing map elements */
> >> +       bool main_thread_only;  /* for task/task_file iterator */
> >
> > As a user of task_file iterator I'd hate to make this decision,
> > honestly, especially if I can't prove that all processes share the
> > same file table (I think clone() syscall allows to do weird
> > combinations like that, right?). It does make sense for a task/
>
> Right. the clone() syscall permits different kinds of sharing,
> sharing address space and sharing files are separated. It is possible
> that address space is shared and files are not shared. That is
> why I want to add flags for task_file so that if user knows
> they do not have cases where address space shared and files not
> shared, they can use main_thread_only to improve performance.

The problem with such options is that for a lot of users it won't be
clear at all when and if those options can be used. E.g., when I
imagine myself building some generic tool utilizing task_file bpf_iter
that is supposed to be run on any server, how do I know if it's safe
to specify "main_thread_only"? I can't guarantee that. So I'll be
forced to either risk it or fallback to default and unnecessarily slow
behavior.

This is different for task/ bpf_iter, though, so I support that for
task/ only. But you've already done that split, so thank you! :)

>
> > iterator, though, if I need to iterate a user-space process (group of
> > tasks). So can we do this:
> >
> > 1a. Either add a new bpf_iter type process/ (or in kernel lingo
> > task_group/) to iterate only main threads (group_leader in kernel
> > lingo);
> > 1b. Or make this main_thread_only an option for only a task/ iterator
> > (and maybe call it "group_leader_only" or something to reflect kernel
> > terminology?)
>
> The following is the kernel pid_type definition,
>
> enum pid_type
> {
>          PIDTYPE_PID,
>          PIDTYPE_TGID,
>          PIDTYPE_PGID,
>          PIDTYPE_SID,
>          PIDTYPE_MAX,
> };
>
> Right now, task iterator is traversed following
> PIDTYPE_PID, i.e., every tasks in the system.
>
> To iterate through main thread, we need to traverse following
> PIDTYPE_TGID.
>
> In the future, it is possible, people want to traverse
> following PIDTYPE_PGID (process group) or PIDTYPE_SID (session).
>
> Or we might have other customization, e.g., cgroup_id, which can
> be filtered in the bpf program, but in-kernel filtering can
> definitely improve performance.
>
> So I prefer 1b.

Sounds good, but let's use a proper enum, not a set of bit fields. We
can support all 4 from the get go. There is no need to wait for the
actual use case to appear, as the iteration semantics is pretty clear.

>
> Yes, naming is hard.
> The name "main_thread_only" is mostly from userspace perspective.
> "group_leader_only" might not be good as it may be confused with
> possible future process group.
>
> >
> > 2. For task/file iterator, still iterate all tasks, but if the task's
> > files struct is the same as group_leader's files struct, then go to
> > the next one. This should eliminate tons of duplication of iterating
> > the same files over and over. It would still iterate a bunch of tasks,
> > but compared to the number of files that's generally a much smaller
> > number, so should still give sizable savings. I don't think we need an
> > extra option for this, tbh, this behavior was my intuitive
> > expectation, so I don't think you'll be breaking any sane user of this
> > iterator.
>
> What you suggested makes sense. for task_file iterator, we only promise
> to visit all files from all tasks. We can do necessary filtering to
> remove duplicates in the kernel and did not break promise.
> I will remove customization from task_file iterator.

Thanks for doing that!

>
> >
> > Disclaimer: I haven't got a chance to look through kernel code much,
> > so I'm sorry if what I'm proposing is something that is impossible to
> > implement or extremely hard/unreliable. But thought I'll put this idea
> > out there before we decide on this.
> >
> > WDYT?
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators
  2020-09-02  0:34       ` Andrii Nakryiko
@ 2020-09-02  2:23         ` Yonghong Song
  0 siblings, 0 replies; 12+ messages in thread
From: Yonghong Song @ 2020-09-02  2:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 9/1/20 5:34 PM, Andrii Nakryiko wrote:
> On Thu, Aug 27, 2020 at 11:09 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 8/26/20 10:07 PM, Andrii Nakryiko wrote:
>>> On Wed, Aug 26, 2020 at 5:07 PM Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>> Currently, task and task_file by default iterates through
>>>> all tasks. For task_file, by default, all files from all tasks
>>>> will be traversed.
>>>>
>>>> But for a user process, the file_table is shared by all threads
>>>> of that process. So traversing the main thread per process should
>>>> be enough to traverse all files and this can save a lot of cpu
>>>> time if some process has large number of threads and each thread
>>>> has lots of open files.
>>>>
>>>> This patch implemented a customization for task/task_file iterator,
>>>> permitting to traverse only the kernel task where its pid equal
>>>> to tgid in the kernel. This includes some kernel threads, and
>>>> main threads of user processes. This will solve the above potential
>>>> performance issue for task_file. This customization may be useful
>>>> for task iterator too if only traversing main threads is enough.
>>>>
>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>> ---
>>>>    include/linux/bpf.h            |  3 ++-
>>>>    include/uapi/linux/bpf.h       |  5 ++++
>>>>    kernel/bpf/task_iter.c         | 46 +++++++++++++++++++++++-----------
>>>>    tools/include/uapi/linux/bpf.h |  5 ++++
>>>>    4 files changed, 43 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>> index a6131d95e31e..058eb9b0ba78 100644
>>>> --- a/include/linux/bpf.h
>>>> +++ b/include/linux/bpf.h
>>>> @@ -1220,7 +1220,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>>>>           int __init bpf_iter_ ## target(args) { return 0; }
>>>>
>>>>    struct bpf_iter_aux_info {
>>>> -       struct bpf_map *map;
>>>> +       struct bpf_map *map;    /* for iterator traversing map elements */
>>>> +       bool main_thread_only;  /* for task/task_file iterator */
>>>
>>> As a user of task_file iterator I'd hate to make this decision,
>>> honestly, especially if I can't prove that all processes share the
>>> same file table (I think clone() syscall allows to do weird
>>> combinations like that, right?). It does make sense for a task/
>>
>> Right. the clone() syscall permits different kinds of sharing,
>> sharing address space and sharing files are separated. It is possible
>> that address space is shared and files are not shared. That is
>> why I want to add flags for task_file so that if user knows
>> they do not have cases where address space shared and files not
>> shared, they can use main_thread_only to improve performance.
> 
> The problem with such options is that for a lot of users it won't be
> clear at all when and if those options can be used. E.g., when I
> imagine myself building some generic tool utilizing task_file bpf_iter
> that is supposed to be run on any server, how do I know if it's safe
> to specify "main_thread_only"? I can't guarantee that. So I'll be
> forced to either risk it or fallback to default and unnecessarily slow
> behavior.
> 
> This is different for task/ bpf_iter, though, so I support that for
> task/ only. But you've already done that split, so thank you! :)
> 
>>
>>> iterator, though, if I need to iterate a user-space process (group of
>>> tasks). So can we do this:
>>>
>>> 1a. Either add a new bpf_iter type process/ (or in kernel lingo
>>> task_group/) to iterate only main threads (group_leader in kernel
>>> lingo);
>>> 1b. Or make this main_thread_only an option for only a task/ iterator
>>> (and maybe call it "group_leader_only" or something to reflect kernel
>>> terminology?)
>>
>> The following is the kernel pid_type definition,
>>
>> enum pid_type
>> {
>>           PIDTYPE_PID,
>>           PIDTYPE_TGID,
>>           PIDTYPE_PGID,
>>           PIDTYPE_SID,
>>           PIDTYPE_MAX,
>> };
>>
>> Right now, task iterator is traversed following
>> PIDTYPE_PID, i.e., every tasks in the system.
>>
>> To iterate through main thread, we need to traverse following
>> PIDTYPE_TGID.
>>
>> In the future, it is possible, people want to traverse
>> following PIDTYPE_PGID (process group) or PIDTYPE_SID (session).
>>
>> Or we might have other customization, e.g., cgroup_id, which can
>> be filtered in the bpf program, but in-kernel filtering can
>> definitely improve performance.
>>
>> So I prefer 1b.
> 
> Sounds good, but let's use a proper enum, not a set of bit fields. We
> can support all 4 from the get go. There is no need to wait for the
> actual use case to appear, as the iteration semantics is pretty clear.

Agree. enum is better than individual bit fields esp. considering they
are mutually exclusive. Will design the interface along this line
when time comes to implement those customization. Thanks!

> 
>>
>> Yes, naming is hard.
>> The name "main_thread_only" is mostly from userspace perspective.
>> "group_leader_only" might not be good as it may be confused with
>> possible future process group.
>>
>>>
>>> 2. For task/file iterator, still iterate all tasks, but if the task's
>>> files struct is the same as group_leader's files struct, then go to
>>> the next one. This should eliminate tons of duplication of iterating
>>> the same files over and over. It would still iterate a bunch of tasks,
>>> but compared to the number of files that's generally a much smaller
>>> number, so should still give sizable savings. I don't think we need an
>>> extra option for this, tbh, this behavior was my intuitive
>>> expectation, so I don't think you'll be breaking any sane user of this
>>> iterator.
>>
>> What you suggested makes sense. for task_file iterator, we only promise
>> to visit all files from all tasks. We can do necessary filtering to
>> remove duplicates in the kernel and did not break promise.
>> I will remove customization from task_file iterator.
> 
> Thanks for doing that!
> 
>>
>>>
>>> Disclaimer: I haven't got a chance to look through kernel code much,
>>> so I'm sorry if what I'm proposing is something that is impossible to
>>> implement or extremely hard/unreliable. But thought I'll put this idea
>>> out there before we decide on this.
>>>
>>> WDYT?
>>>
>>> [...]
>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-09-02  2:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-27  0:06 [PATCH bpf-next 0/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
2020-08-27  0:06 ` [PATCH bpf-next 1/5] bpf: make bpf_link_info.iter similar to bpf_iter_link_info Yonghong Song
2020-08-27  4:50   ` Andrii Nakryiko
2020-08-27  0:06 ` [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators Yonghong Song
2020-08-27  5:07   ` Andrii Nakryiko
2020-08-27 18:09     ` Yonghong Song
2020-09-02  0:34       ` Andrii Nakryiko
2020-09-02  2:23         ` Yonghong Song
2020-08-27  0:06 ` [PATCH bpf-next 3/5] bpf: add link_query support for newly added main_thread_only info Yonghong Song
2020-08-27  5:14   ` Andrii Nakryiko
2020-08-27  0:06 ` [PATCH bpf-next 4/5] bpftool: support optional 'task main_thread_only' argument Yonghong Song
2020-08-27  0:06 ` [PATCH bpf-next 5/5] selftests/bpf: test task_file iterator with main_thread_only Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).