All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data
@ 2020-04-27 20:12 Yonghong Song
  2020-04-27 20:12 ` [PATCH bpf-next v1 01/19] net: refactor net assignment for seq_net_private structure Yonghong Song
                   ` (18 more replies)
  0 siblings, 19 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Motivation:
  The current way to dump kernel data structures mostly:
    1. /proc system
    2. various specific tools like "ss" which requires kernel support.
    3. drgn
  The dropback for the first two is that whenever you want to dump more, you
  need change the kernel. For example, Martin wants to dump socket local
  storage with "ss". Kernel change is needed for it to work ([1]).
  This is also the direct motivation for this work.

  drgn ([2]) solves this proble nicely and no kernel change is not needed.
  But since drgn is not able to verify the validity of a particular pointer value,
  it might present the wrong results in rare cases.
  
  In this patch set, we introduce bpf iterator. Initial kernel changes are
  still needed for interested kernel data, but a later data structure change
  will not require kernel changes any more. bpf program itself can adapt
  to new data structure changes. This will give certain flexibility with
  guaranteed correctness.
  
  In this patch set, kernel seq_ops is used to facilitate iterating through
  kernel data, similar to current /proc and many other lossless kernel
  dumping facilities. In the future, different iterators can be
  implemented to trade off losslessness for other criteria e.g. no
  repeated object visits, etc.

User Interface:
  1. Similar to prog/map/link, the iterator can be pinned into a
     path within a bpffs mount point.
  2. The bpftool command can pin an iterator to a file
         bpftool iter pin <bpf_prog.o> <path>
  3. Use `cat <path>` to dump the contents.
     Use `rm -f <path>` to remove the pinned iterator.
  4. The anonymous iterator can be created as well.

  Please see patch #17 andd #18 for bpf programs and bpf iterator
  output examples.

  Note that certain iterators are namespace aware. For example,
  task and task_file targets only iterate through current pid namespace.
  ipv6_route and netlink will iterate through current net namespace.

  Please see individual patches for implementation details.

Performance:
  The bpf iterator provides in-kernel aggregation abilities
  for kernel data. This can greatly improve performance
  compared to e.g., iterating all process directories under /proc.
  For example, I did an experiment on my VM with an application forking
  different number of tasks and each forked process opening various number
  of files. The following is the result with the latency with unit of microseconds:

    # of forked tasks   # of open files    # of bpf_prog calls  # latency (us)
    100                 100                11503                7586
    1000                1000               1013203              709513
    10000               100                1130203              764519

  The number of bpf_prog calls may be more than forked tasks multipled by
  open files since there are other tasks running on the system.
  The bpf program is a do-nothing program. One millions of bpf calls takes
  less than one second.

Future Work:
  Although the initial motivation is from Martin's sk_local_storage,
  this patch didn't implement tcp6 sockets and sk_local_storage.
  The /proc/net/tcp6 involves three types of sockets, timewait,
  request and tcp6 sockets. Some kind of type casting or other
  mechanism is needed to handle all these socket types in one
  bpf program. This will be addressed in future work.

  Currently, we do not support kernel data generated under module.
  This requires some BTF work.

  More work for more iterators, e.g., bpf_progs, cgroups, bpf_map elements, etc.

Changelog:
  RFC v2 ([3]) -> non-RFC v1:
    - rename bpfdump to bpf_iter
    - use bpffs instead of a new file system
    - use bpf_link to streamline and simplify iterator creation.

References:
  [1]: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
  [2]: https://github.com/osandov/drgn
  [3]: https://lore.kernel.org/bpf/40e427e2-5b15-e9aa-e2cb-42dc1b53d047@gmail.com/T/

Yonghong Song (19):
  net: refactor net assignment for seq_net_private structure
  bpf: implement an interface to register bpf_iter targets
  bpf: add bpf_map iterator
  bpf: allow loading of a bpf_iter program
  bpf: support bpf tracing/iter programs for BPF_LINK_CREATE
  bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE
  bpf: create anonymous bpf iterator
  bpf: create file bpf iterator
  bpf: add PTR_TO_BTF_ID_OR_NULL support
  bpf: add netlink and ipv6_route targets
  bpf: add task and task/file targets
  bpf: add bpf_seq_printf and bpf_seq_write helpers
  bpf: handle spilled PTR_TO_BTF_ID properly when checking
    stack_boundary
  bpf: support variable length array in tracing programs
  tools/libbpf: add bpf_iter support
  tools/bpftool: add bpf_iter support for bptool
  tools/bpf: selftests: add iterator programs for ipv6_route and netlink
  tools/bpf: selftests: add iter progs for bpf_map/task/task_file
  tools/bpf: selftests: add bpf_iter selftests

 fs/proc/proc_net.c                            |   5 +-
 include/linux/bpf.h                           |  33 ++
 include/linux/seq_file_net.h                  |   8 +
 include/uapi/linux/bpf.h                      |  38 +-
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_iter.c                         | 358 ++++++++++++++++++
 kernel/bpf/btf.c                              |  38 +-
 kernel/bpf/inode.c                            |  28 ++
 kernel/bpf/map_iter.c                         | 107 ++++++
 kernel/bpf/syscall.c                          |  62 ++-
 kernel/bpf/task_iter.c                        | 319 ++++++++++++++++
 kernel/bpf/verifier.c                         |  47 ++-
 kernel/trace/bpf_trace.c                      | 159 ++++++++
 net/ipv6/ip6_fib.c                            |  71 +++-
 net/ipv6/route.c                              |  30 ++
 net/netlink/af_netlink.c                      |  99 ++++-
 scripts/bpf_helpers_doc.py                    |   2 +
 .../bpftool/Documentation/bpftool-iter.rst    |  71 ++++
 tools/bpf/bpftool/bash-completion/bpftool     |  13 +
 tools/bpf/bpftool/iter.c                      |  84 ++++
 tools/bpf/bpftool/main.c                      |   3 +-
 tools/bpf/bpftool/main.h                      |   1 +
 tools/include/uapi/linux/bpf.h                |  38 +-
 tools/lib/bpf/bpf.c                           |  11 +
 tools/lib/bpf/bpf.h                           |   2 +
 tools/lib/bpf/bpf_tracing.h                   |  23 ++
 tools/lib/bpf/libbpf.c                        |  60 +++
 tools/lib/bpf/libbpf.h                        |  11 +
 tools/lib/bpf/libbpf.map                      |   7 +
 .../selftests/bpf/prog_tests/bpf_iter.c       | 180 +++++++++
 .../selftests/bpf/progs/bpf_iter_bpf_map.c    |  32 ++
 .../selftests/bpf/progs/bpf_iter_ipv6_route.c |  69 ++++
 .../selftests/bpf/progs/bpf_iter_netlink.c    |  77 ++++
 .../selftests/bpf/progs/bpf_iter_task.c       |  29 ++
 .../selftests/bpf/progs/bpf_iter_task_file.c  |  28 ++
 .../selftests/bpf/progs/bpf_iter_test_kern1.c |   4 +
 .../selftests/bpf/progs/bpf_iter_test_kern2.c |   4 +
 .../selftests/bpf/progs/bpf_iter_test_kern3.c |  18 +
 .../bpf/progs/bpf_iter_test_kern_common.h     |  22 ++
 39 files changed, 2174 insertions(+), 19 deletions(-)
 create mode 100644 kernel/bpf/bpf_iter.c
 create mode 100644 kernel/bpf/map_iter.c
 create mode 100644 kernel/bpf/task_iter.c
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-iter.rst
 create mode 100644 tools/bpf/bpftool/iter.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_bpf_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_task.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_task_file.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_test_kern1.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_test_kern2.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_test_kern3.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_test_kern_common.h

-- 
2.24.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 01/19] net: refactor net assignment for seq_net_private structure
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-29  5:38   ` Andrii Nakryiko
  2020-04-27 20:12 ` [PATCH bpf-next v1 02/19] bpf: implement an interface to register bpf_iter targets Yonghong Song
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Refactor assignment of "net" in seq_net_private structure
in proc_net.c to a helper function. The helper later will
be used by bpfdump.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 fs/proc/proc_net.c           | 5 ++---
 include/linux/seq_file_net.h | 8 ++++++++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 4888c5224442..aee07c19cf8b 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -75,9 +75,8 @@ static int seq_open_net(struct inode *inode, struct file *file)
 		put_net(net);
 		return -ENOMEM;
 	}
-#ifdef CONFIG_NET_NS
-	p->net = net;
-#endif
+
+	set_seq_net_private(p, net);
 	return 0;
 }
 
diff --git a/include/linux/seq_file_net.h b/include/linux/seq_file_net.h
index 0fdbe1ddd8d1..0ec4a18b9aca 100644
--- a/include/linux/seq_file_net.h
+++ b/include/linux/seq_file_net.h
@@ -35,4 +35,12 @@ static inline struct net *seq_file_single_net(struct seq_file *seq)
 #endif
 }
 
+static inline void set_seq_net_private(struct seq_net_private *p,
+				       struct net *net)
+{
+#ifdef CONFIG_NET_NS
+	p->net = net;
+#endif
+}
+
 #endif
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 02/19] bpf: implement an interface to register bpf_iter targets
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
  2020-04-27 20:12 ` [PATCH bpf-next v1 01/19] net: refactor net assignment for seq_net_private structure Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-28 16:20   ` Martin KaFai Lau
  2020-04-27 20:12 ` [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator Yonghong Song
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

The target can call bpf_iter_reg_target() to register itself.
The needed information:
  target:           target name, reprsented as a directory hierarchy
  target_func_name: the kernel func name used by verifier to
                    verify bpf programs
  seq_ops:          the seq_file operations for the target
  seq_priv_size:    the private_data size needed by the seq_file
                    operations
  target_feature:   certain feature requested by the target for
                    bpf_iter to prepare for seq_file operations.

A little bit more explanations on the target name and target_feature.
For example, the target name can be "bpf_map", "task", "task/file",
which represents iterating all bpf_map's, all tasks, or all files
of all tasks.

The target feature is mostly for reusing existing seq_file operations.
For example, /proc/net/{tcp6, ipv6_route, netlink, ...} seq_file private
data contains a reference to net namespace. When bpf_iter tries to
reuse the same seq_ops, its seq_file private data need the net namespace
setup properly too. In this case, the bpf_iter infrastructure can help
set up properly before doing seq_file operations.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h   | 11 ++++++++++
 kernel/bpf/Makefile   |  2 +-
 kernel/bpf/bpf_iter.c | 50 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/bpf_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 10960cfabea4..5e56abc1e2f1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -31,6 +31,7 @@ struct seq_file;
 struct btf;
 struct btf_type;
 struct exception_table_entry;
+struct seq_operations;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1109,6 +1110,16 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
 int bpf_obj_get_user(const char __user *pathname, int flags);
 
+struct bpf_iter_reg {
+	const char *target;
+	const char *target_func_name;
+	const struct seq_operations *seq_ops;
+	u32 seq_priv_size;
+	u32 target_feature;
+};
+
+int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
+
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f2d7be596966..6a8b0febd3f6 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -2,7 +2,7 @@
 obj-y := core.o
 CFLAGS_core.o += $(call cc-disable-warning, override-init)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
new file mode 100644
index 000000000000..1115b978607a
--- /dev/null
+++ b/kernel/bpf/bpf_iter.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2020 Facebook */
+
+#include <linux/fs.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+
+struct bpf_iter_target_info {
+	struct list_head list;
+	const char *target;
+	const char *target_func_name;
+	const struct seq_operations *seq_ops;
+	u32 seq_priv_size;
+	u32 target_feature;
+};
+
+static struct list_head targets;
+static struct mutex targets_mutex;
+static bool bpf_iter_inited = false;
+
+int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
+{
+	struct bpf_iter_target_info *tinfo;
+
+	/* The earliest bpf_iter_reg_target() is called at init time
+	 * where the bpf_iter registration is serialized.
+	 */
+	if (!bpf_iter_inited) {
+		INIT_LIST_HEAD(&targets);
+		mutex_init(&targets_mutex);
+		bpf_iter_inited = true;
+	}
+
+	tinfo = kmalloc(sizeof(*tinfo), GFP_KERNEL);
+	if (!tinfo)
+		return -ENOMEM;
+
+	tinfo->target = reg_info->target;
+	tinfo->target_func_name = reg_info->target_func_name;
+	tinfo->seq_ops = reg_info->seq_ops;
+	tinfo->seq_priv_size = reg_info->seq_priv_size;
+	tinfo->target_feature = reg_info->target_feature;
+	INIT_LIST_HEAD(&tinfo->list);
+
+	mutex_lock(&targets_mutex);
+	list_add(&tinfo->list, &targets);
+	mutex_unlock(&targets_mutex);
+
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
  2020-04-27 20:12 ` [PATCH bpf-next v1 01/19] net: refactor net assignment for seq_net_private structure Yonghong Song
  2020-04-27 20:12 ` [PATCH bpf-next v1 02/19] bpf: implement an interface to register bpf_iter targets Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-29  0:37   ` Martin KaFai Lau
  2020-04-29  6:04   ` Andrii Nakryiko
  2020-04-27 20:12 ` [PATCH bpf-next v1 04/19] bpf: allow loading of a bpf_iter program Yonghong Song
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

The bpf_map iterator is implemented.
The bpf program is called at seq_ops show() and stop() functions.
bpf_iter_get_prog() will retrieve bpf program and other
parameters during seq_file object traversal. In show() function,
bpf program will traverse every valid object, and in stop()
function, bpf program will be called one more time after all
objects are traversed.

The first member of the bpf context contains the meta data, namely,
the seq_file, session_id and seq_num. Here, the session_id is
a unique id for one specific seq_file session. The seq_num is
the number of bpf prog invocations in the current session.
The bpf_iter_get_prog(), which will be implemented in subsequent
patches, will have more information on how meta data are computed.

The second member of the bpf context is a struct bpf_map pointer,
which bpf program can examine.

The target implementation also provided the structure definition
for bpf program and the function definition for verifier to
verify the bpf program. Specifically for bpf_map iterator,
the structure is "bpf_iter__bpf_map" andd the function is
"__bpf_iter__bpf_map".

More targets will be implemented later, all of which will include
the following, similar to bpf_map iterator:
  - seq_ops() implementation
  - function definition for verifier to verify the bpf program
  - seq_file private data size
  - additional target feature

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h   |  10 ++++
 kernel/bpf/Makefile   |   2 +-
 kernel/bpf/bpf_iter.c |  19 ++++++++
 kernel/bpf/map_iter.c | 107 ++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c  |  13 +++++
 5 files changed, 150 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/map_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5e56abc1e2f1..4ac8d61f7c3e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1078,6 +1078,7 @@ int  generic_map_update_batch(struct bpf_map *map,
 int  generic_map_delete_batch(struct bpf_map *map,
 			      const union bpf_attr *attr,
 			      union bpf_attr __user *uattr);
+struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
 
 extern int sysctl_unprivileged_bpf_disabled;
 
@@ -1118,7 +1119,16 @@ struct bpf_iter_reg {
 	u32 target_feature;
 };
 
+struct bpf_iter_meta {
+	__bpf_md_ptr(struct seq_file *, seq);
+	u64 session_id;
+	u64 seq_num;
+};
+
 int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
+struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
+				   u64 *session_id, u64 *seq_num, bool is_last);
+int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 6a8b0febd3f6..b2b5eefc5254 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -2,7 +2,7 @@
 obj-y := core.o
 CFLAGS_core.o += $(call cc-disable-warning, override-init)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 1115b978607a..284c95587803 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -48,3 +48,22 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
 
 	return 0;
 }
+
+struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
+				   u64 *session_id, u64 *seq_num, bool is_last)
+{
+	return NULL;
+}
+
+int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
+{
+	int ret;
+
+	migrate_disable();
+	rcu_read_lock();
+	ret = BPF_PROG_RUN(prog, ctx);
+	rcu_read_unlock();
+	migrate_enable();
+
+	return ret;
+}
diff --git a/kernel/bpf/map_iter.c b/kernel/bpf/map_iter.c
new file mode 100644
index 000000000000..bb3ad4c3bde5
--- /dev/null
+++ b/kernel/bpf/map_iter.c
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2020 Facebook */
+#include <linux/bpf.h>
+#include <linux/fs.h>
+#include <linux/filter.h>
+#include <linux/kernel.h>
+
+struct bpf_iter_seq_map_info {
+	struct bpf_map *map;
+	u32 id;
+};
+
+static void *bpf_map_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_iter_seq_map_info *info = seq->private;
+	struct bpf_map *map;
+	u32 id = info->id;
+
+	map = bpf_map_get_curr_or_next(&id);
+	if (IS_ERR_OR_NULL(map))
+		return NULL;
+
+	++*pos;
+	info->map = map;
+	info->id = id;
+	return map;
+}
+
+static void *bpf_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_iter_seq_map_info *info = seq->private;
+	struct bpf_map *map;
+
+	++*pos;
+	++info->id;
+	map = bpf_map_get_curr_or_next(&info->id);
+	if (IS_ERR_OR_NULL(map))
+		return NULL;
+
+	bpf_map_put(info->map);
+	info->map = map;
+	return map;
+}
+
+struct bpf_iter__bpf_map {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct bpf_map *, map);
+};
+
+int __init __bpf_iter__bpf_map(struct bpf_iter_meta *meta, struct bpf_map *map)
+{
+	return 0;
+}
+
+static int bpf_map_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter_meta meta;
+	struct bpf_iter__bpf_map ctx;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.map = v;
+	meta.seq = seq;
+	prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_map_info),
+				 &meta.session_id, &meta.seq_num,
+				 v == (void *)0);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static void bpf_map_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_iter_seq_map_info *info = seq->private;
+
+	if (!v)
+		bpf_map_seq_show(seq, v);
+
+	if (info->map) {
+		bpf_map_put(info->map);
+		info->map = NULL;
+	}
+}
+
+static const struct seq_operations bpf_map_seq_ops = {
+	.start	= bpf_map_seq_start,
+	.next	= bpf_map_seq_next,
+	.stop	= bpf_map_seq_stop,
+	.show	= bpf_map_seq_show,
+};
+
+static int __init bpf_map_iter_init(void)
+{
+	struct bpf_iter_reg reg_info = {
+		.target			= "bpf_map",
+		.target_func_name	= "__bpf_iter__bpf_map",
+		.seq_ops		= &bpf_map_seq_ops,
+		.seq_priv_size		= sizeof(struct bpf_iter_seq_map_info),
+		.target_feature		= 0,
+	};
+
+	return bpf_iter_reg_target(&reg_info);
+}
+
+late_initcall(bpf_map_iter_init);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7626b8024471..022187640943 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2800,6 +2800,19 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
 	return err;
 }
 
+struct bpf_map *bpf_map_get_curr_or_next(u32 *id)
+{
+	struct bpf_map *map;
+
+	spin_lock_bh(&map_idr_lock);
+	map = idr_get_next(&map_idr, id);
+	if (map)
+		map = __bpf_map_inc_not_zero(map, false);
+	spin_unlock_bh(&map_idr_lock);
+
+	return map;
+}
+
 #define BPF_PROG_GET_FD_BY_ID_LAST_FIELD prog_id
 
 struct bpf_prog *bpf_prog_by_id(u32 id)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 04/19] bpf: allow loading of a bpf_iter program
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (2 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-29  0:54   ` Martin KaFai Lau
  2020-04-27 20:12 ` [PATCH bpf-next v1 05/19] bpf: support bpf tracing/iter programs for BPF_LINK_CREATE Yonghong Song
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

A bpf_iter program is a tracing program with attach type
BPF_TRACE_ITER. The load attribute
  attach_btf_id
is used by the verifier against a particular kernel function,
e.g., __bpf_iter__bpf_map in our previous bpf_map iterator.

The program return value must be 0 for now. In the
future, other return values may be used for filtering or
teminating the iterator.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/verifier.c          | 20 ++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  1 +
 3 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4a6c47f3febe..f39b9fec37ab 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -215,6 +215,7 @@ enum bpf_attach_type {
 	BPF_TRACE_FEXIT,
 	BPF_MODIFY_RETURN,
 	BPF_LSM_MAC,
+	BPF_TRACE_ITER,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 91728e0f27eb..fd36c22685d9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7074,6 +7074,11 @@ static int check_return_code(struct bpf_verifier_env *env)
 			return 0;
 		range = tnum_const(0);
 		break;
+	case BPF_PROG_TYPE_TRACING:
+		if (env->prog->expected_attach_type != BPF_TRACE_ITER)
+			return 0;
+		range = tnum_const(0);
+		break;
 	default:
 		return 0;
 	}
@@ -10454,6 +10459,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	struct bpf_prog *tgt_prog = prog->aux->linked_prog;
 	u32 btf_id = prog->aux->attach_btf_id;
 	const char prefix[] = "btf_trace_";
+	struct btf_func_model fmodel;
 	int ret = 0, subprog = -1, i;
 	struct bpf_trampoline *tr;
 	const struct btf_type *t;
@@ -10595,6 +10601,20 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 		prog->aux->attach_func_proto = t;
 		prog->aux->attach_btf_trace = true;
 		return 0;
+	case BPF_TRACE_ITER:
+		if (!btf_type_is_func(t)) {
+			verbose(env, "attach_btf_id %u is not a function\n",
+				btf_id);
+			return -EINVAL;
+		}
+		t = btf_type_by_id(btf, t->type);
+		if (!btf_type_is_func_proto(t))
+			return -EINVAL;
+		prog->aux->attach_func_name = tname;
+		prog->aux->attach_func_proto = t;
+		ret = btf_distill_func_proto(&env->log, btf, t,
+					     tname, &fmodel);
+		return ret;
 	default:
 		if (!prog_extension)
 			return -EINVAL;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4a6c47f3febe..f39b9fec37ab 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -215,6 +215,7 @@ enum bpf_attach_type {
 	BPF_TRACE_FEXIT,
 	BPF_MODIFY_RETURN,
 	BPF_LSM_MAC,
+	BPF_TRACE_ITER,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 05/19] bpf: support bpf tracing/iter programs for BPF_LINK_CREATE
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (3 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 04/19] bpf: allow loading of a bpf_iter program Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-29  1:17   ` [Potential Spoof] " Martin KaFai Lau
  2020-04-29  6:25   ` Andrii Nakryiko
  2020-04-27 20:12 ` [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE Yonghong Song
                   ` (13 subsequent siblings)
  18 siblings, 2 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Given a bpf program, the step to create an anonymous bpf iterator is:
  - create a bpf_iter_link, which combines bpf program and the target.
    In the future, there could be more information recorded in the link.
    A link_fd will be returned to the user space.
  - create an anonymous bpf iterator with the given link_fd.

The anonymous bpf iterator (and its underlying bpf_link) will be
used to create file based bpf iterator as well.

The benefit to use of bpf_iter_link:
  - for file based bpf iterator, bpf_iter_link provides a standard
    way to replace underlying bpf programs.
  - for both anonymous and free based iterators, bpf link query
    capability can be leveraged.

The patch added support of tracing/iter programs for  BPF_LINK_CREATE.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h   |  2 ++
 kernel/bpf/bpf_iter.c | 54 +++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c  | 15 ++++++++++++
 3 files changed, 71 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4ac8d61f7c3e..60ecb73d8f6d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1034,6 +1034,7 @@ extern const struct file_operations bpf_prog_fops;
 extern const struct bpf_prog_ops bpf_offload_prog_ops;
 extern const struct bpf_verifier_ops tc_cls_act_analyzer_ops;
 extern const struct bpf_verifier_ops xdp_analyzer_ops;
+extern const struct bpf_link_ops bpf_iter_link_lops;
 
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
@@ -1129,6 +1130,7 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
 struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
 				   u64 *session_id, u64 *seq_num, bool is_last);
 int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
+int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 284c95587803..9532e7bcb8e1 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -14,6 +14,11 @@ struct bpf_iter_target_info {
 	u32 target_feature;
 };
 
+struct bpf_iter_link {
+	struct bpf_link link;
+	struct bpf_iter_target_info *tinfo;
+};
+
 static struct list_head targets;
 static struct mutex targets_mutex;
 static bool bpf_iter_inited = false;
@@ -67,3 +72,52 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
 
 	return ret;
 }
+
+static void bpf_iter_link_release(struct bpf_link *link)
+{
+}
+
+static void bpf_iter_link_dealloc(struct bpf_link *link)
+{
+}
+
+const struct bpf_link_ops bpf_iter_link_lops = {
+	.release = bpf_iter_link_release,
+	.dealloc = bpf_iter_link_dealloc,
+};
+
+int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct bpf_iter_target_info *tinfo;
+	struct bpf_iter_link *link;
+	const char *func_name;
+	bool existed = false;
+	int err;
+
+	if (attr->link_create.target_fd || attr->link_create.flags)
+		return -EINVAL;
+
+	func_name = prog->aux->attach_func_name;
+	mutex_lock(&targets_mutex);
+	list_for_each_entry(tinfo, &targets, list) {
+		if (!strcmp(tinfo->target_func_name, func_name)) {
+			existed = true;
+			break;
+		}
+	}
+	mutex_unlock(&targets_mutex);
+	if (!existed)
+		return -ENOENT;
+
+	link = kzalloc(sizeof(*link), GFP_USER | __GFP_NOWARN);
+	if (!link)
+		return -ENOMEM;
+
+	bpf_link_init(&link->link, &bpf_iter_link_lops, prog);
+	link->tinfo = tinfo;
+
+	err = bpf_link_new_fd(&link->link);
+	if (err < 0)
+		kfree(link);
+	return err;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 022187640943..8741b5e11c85 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2269,6 +2269,8 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
 	else if (link->ops == &bpf_cgroup_link_lops)
 		link_type = "cgroup";
 #endif
+	else if (link->ops == &bpf_iter_link_lops)
+		link_type = "iter";
 	else
 		link_type = "unknown";
 
@@ -2597,6 +2599,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 	case BPF_CGROUP_GETSOCKOPT:
 	case BPF_CGROUP_SETSOCKOPT:
 		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
+	case BPF_TRACE_ITER:
+		return BPF_PROG_TYPE_TRACING;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -3571,6 +3575,14 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
 	return err;
 }
 
+static int tracing_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	if (attr->link_create.attach_type == BPF_TRACE_ITER)
+		return bpf_iter_link_attach(attr, prog);
+
+	return -EINVAL;
+}
+
 #define BPF_LINK_CREATE_LAST_FIELD link_create.flags
 static int link_create(union bpf_attr *attr)
 {
@@ -3607,6 +3619,9 @@ static int link_create(union bpf_attr *attr)
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
 		ret = cgroup_bpf_link_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_TRACING:
+		ret = tracing_bpf_link_attach(attr, prog);
+		break;
 	default:
 		ret = -EINVAL;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (4 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 05/19] bpf: support bpf tracing/iter programs for BPF_LINK_CREATE Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-29  1:32   ` Martin KaFai Lau
  2020-04-27 20:12 ` [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator Yonghong Song
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Added BPF_LINK_UPDATE support for tracing/iter programs.
This way, a file based bpf iterator, which holds a reference
to the link, can have its bpf program updated without
creating new files.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h   |  2 ++
 kernel/bpf/bpf_iter.c | 29 +++++++++++++++++++++++++++++
 kernel/bpf/syscall.c  |  5 +++++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 60ecb73d8f6d..4fc39d9b5cd0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1131,6 +1131,8 @@ struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
 				   u64 *session_id, u64 *seq_num, bool is_last);
 int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
 int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
+			  struct bpf_prog *new_prog);
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 9532e7bcb8e1..fc1ce5ee5c3f 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -23,6 +23,9 @@ static struct list_head targets;
 static struct mutex targets_mutex;
 static bool bpf_iter_inited = false;
 
+/* protect bpf_iter_link.link->prog upddate */
+static struct mutex bpf_iter_mutex;
+
 int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
 {
 	struct bpf_iter_target_info *tinfo;
@@ -33,6 +36,7 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
 	if (!bpf_iter_inited) {
 		INIT_LIST_HEAD(&targets);
 		mutex_init(&targets_mutex);
+		mutex_init(&bpf_iter_mutex);
 		bpf_iter_inited = true;
 	}
 
@@ -121,3 +125,28 @@ int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
 		kfree(link);
 	return err;
 }
+
+int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
+			  struct bpf_prog *new_prog)
+{
+	int ret = 0;
+
+	mutex_lock(&bpf_iter_mutex);
+	if (old_prog && link->prog != old_prog) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
+	if (link->prog->type != new_prog->type ||
+	    link->prog->expected_attach_type != new_prog->expected_attach_type ||
+	    strcmp(link->prog->aux->attach_func_name, new_prog->aux->attach_func_name)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	link->prog = new_prog;
+
+out_unlock:
+	mutex_unlock(&bpf_iter_mutex);
+	return ret;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8741b5e11c85..b7af4f006f2e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3679,6 +3679,11 @@ static int link_update(union bpf_attr *attr)
 		goto out_put_progs;
 	}
 #endif
+
+	if (link->ops == &bpf_iter_link_lops) {
+		ret = bpf_iter_link_replace(link, old_prog, new_prog);
+		goto out_put_progs;
+	}
 	ret = -EINVAL;
 
 out_put_progs:
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (5 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-29  5:39   ` Martin KaFai Lau
                     ` (2 more replies)
  2020-04-27 20:12 ` [PATCH bpf-next v1 08/19] bpf: create file " Yonghong Song
                   ` (11 subsequent siblings)
  18 siblings, 3 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

A new bpf command BPF_ITER_CREATE is added.

The anonymous bpf iterator is seq_file based.
The seq_file private data are referenced by targets.
The bpf_iter infrastructure allocated additional space
at seq_file->private after the space used by targets
to store some meta data, e.g.,
  prog:       prog to run
  session_id: an unique id for each opened seq_file
  seq_num:    how many times bpf programs are queried in this session
  has_last:   indicate whether or not bpf_prog has been called after
              all valid objects have been processed

A map between file and prog/link is established to help
fops->release(). When fops->release() is called, just based on
inode and file, bpf program cannot be located since target
seq_priv_size not available. This map helps retrieve the prog
whose reference count needs to be decremented.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h            |   3 +
 include/uapi/linux/bpf.h       |   6 ++
 kernel/bpf/bpf_iter.c          | 162 ++++++++++++++++++++++++++++++++-
 kernel/bpf/syscall.c           |  27 ++++++
 tools/include/uapi/linux/bpf.h |   6 ++
 5 files changed, 203 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4fc39d9b5cd0..0f0cafc65a04 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1112,6 +1112,8 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
 int bpf_obj_get_user(const char __user *pathname, int flags);
 
+#define BPF_DUMP_SEQ_NET_PRIVATE	BIT(0)
+
 struct bpf_iter_reg {
 	const char *target;
 	const char *target_func_name;
@@ -1133,6 +1135,7 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
 int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
 			  struct bpf_prog *new_prog);
+int bpf_iter_new_fd(struct bpf_link *link);
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f39b9fec37ab..576651110d16 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -113,6 +113,7 @@ enum bpf_cmd {
 	BPF_MAP_DELETE_BATCH,
 	BPF_LINK_CREATE,
 	BPF_LINK_UPDATE,
+	BPF_ITER_CREATE,
 };
 
 enum bpf_map_type {
@@ -590,6 +591,11 @@ union bpf_attr {
 		__u32		old_prog_fd;
 	} link_update;
 
+	struct { /* struct used by BPF_ITER_CREATE command */
+		__u32		link_fd;
+		__u32		flags;
+	} iter_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index fc1ce5ee5c3f..1f4e778d1814 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2020 Facebook */
 
 #include <linux/fs.h>
+#include <linux/anon_inodes.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
 
@@ -19,6 +20,19 @@ struct bpf_iter_link {
 	struct bpf_iter_target_info *tinfo;
 };
 
+struct extra_priv_data {
+	struct bpf_prog *prog;
+	u64 session_id;
+	u64 seq_num;
+	bool has_last;
+};
+
+struct anon_file_prog_assoc {
+	struct list_head list;
+	struct file *file;
+	struct bpf_prog *prog;
+};
+
 static struct list_head targets;
 static struct mutex targets_mutex;
 static bool bpf_iter_inited = false;
@@ -26,6 +40,50 @@ static bool bpf_iter_inited = false;
 /* protect bpf_iter_link.link->prog upddate */
 static struct mutex bpf_iter_mutex;
 
+/* Since at anon seq_file release function, the prog cannot
+ * be retrieved since target seq_priv_size is not available.
+ * Keep a list of <anon_file, prog> mapping, so that
+ * at file release stage, the prog can be released properly.
+ */
+static struct list_head anon_iter_info;
+static struct mutex anon_iter_info_mutex;
+
+/* incremented on every opened seq_file */
+static atomic64_t session_id;
+
+static u32 get_total_priv_dsize(u32 old_size)
+{
+	return roundup(old_size, 8) + sizeof(struct extra_priv_data);
+}
+
+static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
+{
+	return old_ptr + roundup(old_size, 8);
+}
+
+static int anon_iter_release(struct inode *inode, struct file *file)
+{
+	struct anon_file_prog_assoc *finfo;
+
+	mutex_lock(&anon_iter_info_mutex);
+	list_for_each_entry(finfo, &anon_iter_info, list) {
+		if (finfo->file == file) {
+			bpf_prog_put(finfo->prog);
+			list_del(&finfo->list);
+			kfree(finfo);
+			break;
+		}
+	}
+	mutex_unlock(&anon_iter_info_mutex);
+
+	return seq_release_private(inode, file);
+}
+
+static const struct file_operations anon_bpf_iter_fops = {
+	.read		= seq_read,
+	.release	= anon_iter_release,
+};
+
 int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
 {
 	struct bpf_iter_target_info *tinfo;
@@ -37,6 +95,8 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
 		INIT_LIST_HEAD(&targets);
 		mutex_init(&targets_mutex);
 		mutex_init(&bpf_iter_mutex);
+		INIT_LIST_HEAD(&anon_iter_info);
+		mutex_init(&anon_iter_info_mutex);
 		bpf_iter_inited = true;
 	}
 
@@ -61,7 +121,20 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
 struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
 				   u64 *session_id, u64 *seq_num, bool is_last)
 {
-	return NULL;
+	struct extra_priv_data *extra_data;
+
+	if (seq->file->f_op != &anon_bpf_iter_fops)
+		return NULL;
+
+	extra_data = get_extra_priv_dptr(seq->private, priv_data_size);
+	if (extra_data->has_last)
+		return NULL;
+
+	*session_id = extra_data->session_id;
+	*seq_num = extra_data->seq_num++;
+	extra_data->has_last = is_last;
+
+	return extra_data->prog;
 }
 
 int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
@@ -150,3 +223,90 @@ int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
 	mutex_unlock(&bpf_iter_mutex);
 	return ret;
 }
+
+static void init_seq_file(void *priv_data, struct bpf_iter_target_info *tinfo,
+			  struct bpf_prog *prog)
+{
+	struct extra_priv_data *extra_data;
+
+	if (tinfo->target_feature & BPF_DUMP_SEQ_NET_PRIVATE)
+		set_seq_net_private((struct seq_net_private *)priv_data,
+				    current->nsproxy->net_ns);
+
+	extra_data = get_extra_priv_dptr(priv_data, tinfo->seq_priv_size);
+	extra_data->session_id = atomic64_add_return(1, &session_id);
+	extra_data->prog = prog;
+	extra_data->seq_num = 0;
+	extra_data->has_last = false;
+}
+
+static int prepare_seq_file(struct file *file, struct bpf_iter_link *link)
+{
+	struct anon_file_prog_assoc *finfo;
+	struct bpf_iter_target_info *tinfo;
+	struct bpf_prog *prog;
+	u32 total_priv_dsize;
+	void *priv_data;
+
+	finfo = kmalloc(sizeof(*finfo), GFP_USER | __GFP_NOWARN);
+	if (!finfo)
+		return -ENOMEM;
+
+	mutex_lock(&bpf_iter_mutex);
+	prog = link->link.prog;
+	bpf_prog_inc(prog);
+	mutex_unlock(&bpf_iter_mutex);
+
+	tinfo = link->tinfo;
+	total_priv_dsize = get_total_priv_dsize(tinfo->seq_priv_size);
+	priv_data = __seq_open_private(file, tinfo->seq_ops, total_priv_dsize);
+	if (!priv_data) {
+		bpf_prog_sub(prog, 1);
+		kfree(finfo);
+		return -ENOMEM;
+	}
+
+	init_seq_file(priv_data, tinfo, prog);
+
+	finfo->file = file;
+	finfo->prog = prog;
+
+	mutex_lock(&anon_iter_info_mutex);
+	list_add(&finfo->list, &anon_iter_info);
+	mutex_unlock(&anon_iter_info_mutex);
+	return 0;
+}
+
+int bpf_iter_new_fd(struct bpf_link *link)
+{
+	struct file *file;
+	int err, fd;
+
+	if (link->ops != &bpf_iter_link_lops)
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = anon_inode_getfile("bpf_iter", &anon_bpf_iter_fops,
+				  NULL, O_CLOEXEC);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto free_fd;
+	}
+
+	err = prepare_seq_file(file,
+			       container_of(link, struct bpf_iter_link, link));
+	if (err)
+		goto free_file;
+
+	fd_install(fd, file);
+	return fd;
+
+free_file:
+	fput(file);
+free_fd:
+	put_unused_fd(fd);
+	return err;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b7af4f006f2e..458f7000887a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3696,6 +3696,30 @@ static int link_update(union bpf_attr *attr)
 	return ret;
 }
 
+#define BPF_ITER_CREATE_LAST_FIELD iter_create.flags
+
+static int bpf_iter_create(union bpf_attr *attr)
+{
+	struct bpf_link *link;
+	int err;
+
+	if (CHECK_ATTR(BPF_ITER_CREATE))
+		return -EINVAL;
+
+	if (attr->iter_create.flags)
+		return -EINVAL;
+
+	link = bpf_link_get_from_fd(attr->iter_create.link_fd);
+	if (IS_ERR(link))
+		return PTR_ERR(link);
+
+	err = bpf_iter_new_fd(link);
+	if (err < 0)
+		bpf_link_put(link);
+
+	return err;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr;
@@ -3813,6 +3837,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_LINK_UPDATE:
 		err = link_update(&attr);
 		break;
+	case BPF_ITER_CREATE:
+		err = bpf_iter_create(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f39b9fec37ab..576651110d16 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -113,6 +113,7 @@ enum bpf_cmd {
 	BPF_MAP_DELETE_BATCH,
 	BPF_LINK_CREATE,
 	BPF_LINK_UPDATE,
+	BPF_ITER_CREATE,
 };
 
 enum bpf_map_type {
@@ -590,6 +591,11 @@ union bpf_attr {
 		__u32		old_prog_fd;
 	} link_update;
 
+	struct { /* struct used by BPF_ITER_CREATE command */
+		__u32		link_fd;
+		__u32		flags;
+	} iter_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 08/19] bpf: create file bpf iterator
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (6 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-29 20:40   ` Andrii Nakryiko
  2020-04-27 20:12 ` [PATCH bpf-next v1 09/19] bpf: add PTR_TO_BTF_ID_OR_NULL support Yonghong Song
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

A new obj type BPF_TYPE_ITER is added to bpffs.
To produce a file bpf iterator, the fd must be
corresponding to a link_fd assocciated with a
trace/iter program. When the pinned file is
opened, a seq_file will be generated.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h   |  3 +++
 kernel/bpf/bpf_iter.c | 48 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/bpf/inode.c    | 28 +++++++++++++++++++++++++
 kernel/bpf/syscall.c  |  2 +-
 4 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0f0cafc65a04..601b3299b7e4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1021,6 +1021,8 @@ static inline void bpf_enable_instrumentation(void)
 
 extern const struct file_operations bpf_map_fops;
 extern const struct file_operations bpf_prog_fops;
+extern const struct file_operations bpf_link_fops;
+extern const struct file_operations bpffs_iter_fops;
 
 #define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \
 	extern const struct bpf_prog_ops _name ## _prog_ops; \
@@ -1136,6 +1138,7 @@ int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
 			  struct bpf_prog *new_prog);
 int bpf_iter_new_fd(struct bpf_link *link);
+void *bpf_iter_get_from_fd(u32 ufd);
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 1f4e778d1814..f5e933236996 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -123,7 +123,8 @@ struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
 {
 	struct extra_priv_data *extra_data;
 
-	if (seq->file->f_op != &anon_bpf_iter_fops)
+	if (seq->file->f_op != &anon_bpf_iter_fops &&
+	    seq->file->f_op != &bpffs_iter_fops)
 		return NULL;
 
 	extra_data = get_extra_priv_dptr(seq->private, priv_data_size);
@@ -310,3 +311,48 @@ int bpf_iter_new_fd(struct bpf_link *link)
 	put_unused_fd(fd);
 	return err;
 }
+
+static int bpffs_iter_open(struct inode *inode, struct file *file)
+{
+	struct bpf_iter_link *link = inode->i_private;
+
+	return prepare_seq_file(file, link);
+}
+
+static int bpffs_iter_release(struct inode *inode, struct file *file)
+{
+	return anon_iter_release(inode, file);
+}
+
+const struct file_operations bpffs_iter_fops = {
+	.open		= bpffs_iter_open,
+	.read		= seq_read,
+	.release	= bpffs_iter_release,
+};
+
+void *bpf_iter_get_from_fd(u32 ufd)
+{
+	struct bpf_link *link;
+	struct bpf_prog *prog;
+	struct fd f;
+
+	f = fdget(ufd);
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+	if (f.file->f_op != &bpf_link_fops) {
+		link = ERR_PTR(-EINVAL);
+		goto out;
+	}
+
+	link = f.file->private_data;
+	prog = link->prog;
+	if (prog->expected_attach_type != BPF_TRACE_ITER) {
+		link = ERR_PTR(-EINVAL);
+		goto out;
+	}
+
+	bpf_link_inc(link);
+out:
+	fdput(f);
+	return link;
+}
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 95087d9f4ed3..de4493983a37 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -26,6 +26,7 @@ enum bpf_type {
 	BPF_TYPE_PROG,
 	BPF_TYPE_MAP,
 	BPF_TYPE_LINK,
+	BPF_TYPE_ITER,
 };
 
 static void *bpf_any_get(void *raw, enum bpf_type type)
@@ -38,6 +39,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
 		bpf_map_inc_with_uref(raw);
 		break;
 	case BPF_TYPE_LINK:
+	case BPF_TYPE_ITER:
 		bpf_link_inc(raw);
 		break;
 	default:
@@ -58,6 +60,7 @@ static void bpf_any_put(void *raw, enum bpf_type type)
 		bpf_map_put_with_uref(raw);
 		break;
 	case BPF_TYPE_LINK:
+	case BPF_TYPE_ITER:
 		bpf_link_put(raw);
 		break;
 	default:
@@ -82,6 +85,15 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
 		return raw;
 	}
 
+	/* check bpf_iter before bpf_link as
+	 * ufd is also a link.
+	 */
+	raw = bpf_iter_get_from_fd(ufd);
+	if (!IS_ERR(raw)) {
+		*type = BPF_TYPE_ITER;
+		return raw;
+	}
+
 	raw = bpf_link_get_from_fd(ufd);
 	if (!IS_ERR(raw)) {
 		*type = BPF_TYPE_LINK;
@@ -96,6 +108,7 @@ static const struct inode_operations bpf_dir_iops;
 static const struct inode_operations bpf_prog_iops = { };
 static const struct inode_operations bpf_map_iops  = { };
 static const struct inode_operations bpf_link_iops  = { };
+static const struct inode_operations bpf_iter_iops  = { };
 
 static struct inode *bpf_get_inode(struct super_block *sb,
 				   const struct inode *dir,
@@ -135,6 +148,8 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type)
 		*type = BPF_TYPE_MAP;
 	else if (inode->i_op == &bpf_link_iops)
 		*type = BPF_TYPE_LINK;
+	else if (inode->i_op == &bpf_iter_iops)
+		*type = BPF_TYPE_ITER;
 	else
 		return -EACCES;
 
@@ -362,6 +377,12 @@ static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg)
 			     &bpffs_obj_fops);
 }
 
+static int bpf_mkiter(struct dentry *dentry, umode_t mode, void *arg)
+{
+	return bpf_mkobj_ops(dentry, mode, arg, &bpf_iter_iops,
+			     &bpffs_iter_fops);
+}
+
 static struct dentry *
 bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags)
 {
@@ -441,6 +462,9 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	case BPF_TYPE_LINK:
 		ret = vfs_mkobj(dentry, mode, bpf_mklink, raw);
 		break;
+	case BPF_TYPE_ITER:
+		ret = vfs_mkobj(dentry, mode, bpf_mkiter, raw);
+		break;
 	default:
 		ret = -EPERM;
 	}
@@ -519,6 +543,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
 		ret = bpf_map_new_fd(raw, f_flags);
 	else if (type == BPF_TYPE_LINK)
 		ret = bpf_link_new_fd(raw);
+	else if (type == BPF_TYPE_ITER)
+		ret = bpf_iter_new_fd(raw);
 	else
 		return -ENOENT;
 
@@ -538,6 +564,8 @@ static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type
 		return ERR_PTR(-EINVAL);
 	if (inode->i_op == &bpf_link_iops)
 		return ERR_PTR(-EINVAL);
+	if (inode->i_op == &bpf_iter_iops)
+		return ERR_PTR(-EINVAL);
 	if (inode->i_op != &bpf_prog_iops)
 		return ERR_PTR(-EACCES);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 458f7000887a..e9ca5fbe8723 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2285,7 +2285,7 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
 }
 #endif
 
-static const struct file_operations bpf_link_fops = {
+const struct file_operations bpf_link_fops = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= bpf_link_show_fdinfo,
 #endif
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 09/19] bpf: add PTR_TO_BTF_ID_OR_NULL support
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (7 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 08/19] bpf: create file " Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-29 20:46   ` Andrii Nakryiko
  2020-04-27 20:12 ` [PATCH bpf-next v1 10/19] bpf: add netlink and ipv6_route targets Yonghong Song
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Add bpf_reg_type PTR_TO_BTF_ID_OR_NULL support.
For tracing/iter program, the bpf program context
definition, e.g., for previous bpf_map target, looks like
  struct bpf_iter_bpf_map {
    struct bpf_dump_meta *meta;
    struct bpf_map *map;
  };

The kernel guarantees that meta is not NULL, but
map pointer maybe NULL. The NULL map indicates that all
objects have been traversed, so bpf program can take
proper action, e.g., do final aggregation and/or send
final report to user space.

Add btf_id_or_null_non0_off to prog->aux structure, to
indicate that for tracing programs, if the context access
offset is not 0, set to PTR_TO_BTF_ID_OR_NULL instead of
PTR_TO_BTF_ID. This bit is set for tracing/iter program.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h   |  2 ++
 kernel/bpf/btf.c      |  5 ++++-
 kernel/bpf/verifier.c | 19 ++++++++++++++-----
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 601b3299b7e4..d30cf0544ab0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -320,6 +320,7 @@ enum bpf_reg_type {
 	PTR_TO_TP_BUFFER,	 /* reg points to a writable raw tp's buffer */
 	PTR_TO_XDP_SOCK,	 /* reg points to struct xdp_sock */
 	PTR_TO_BTF_ID,		 /* reg points to kernel struct */
+	PTR_TO_BTF_ID_OR_NULL,	 /* reg points to kernel struct or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -658,6 +659,7 @@ struct bpf_prog_aux {
 	bool offload_requested;
 	bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */
 	bool func_proto_unreliable;
+	bool btf_id_or_null_non0_off;
 	enum bpf_tramp_prog_type trampoline_prog_type;
 	struct bpf_trampoline *trampoline;
 	struct hlist_node tramp_hlist;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index d65c6912bdaf..2c098e6b1acc 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3788,7 +3788,10 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 		return true;
 
 	/* this is a pointer to another type */
-	info->reg_type = PTR_TO_BTF_ID;
+	if (off != 0 && prog->aux->btf_id_or_null_non0_off)
+		info->reg_type = PTR_TO_BTF_ID_OR_NULL;
+	else
+		info->reg_type = PTR_TO_BTF_ID;
 
 	if (tgt_prog) {
 		ret = btf_translate_to_vmlinux(log, btf, t, tgt_prog->type, arg);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index fd36c22685d9..21ec85e382ca 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -396,7 +396,8 @@ static bool reg_type_may_be_null(enum bpf_reg_type type)
 	return type == PTR_TO_MAP_VALUE_OR_NULL ||
 	       type == PTR_TO_SOCKET_OR_NULL ||
 	       type == PTR_TO_SOCK_COMMON_OR_NULL ||
-	       type == PTR_TO_TCP_SOCK_OR_NULL;
+	       type == PTR_TO_TCP_SOCK_OR_NULL ||
+	       type == PTR_TO_BTF_ID_OR_NULL;
 }
 
 static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
@@ -410,7 +411,8 @@ static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
 	return type == PTR_TO_SOCKET ||
 		type == PTR_TO_SOCKET_OR_NULL ||
 		type == PTR_TO_TCP_SOCK ||
-		type == PTR_TO_TCP_SOCK_OR_NULL;
+		type == PTR_TO_TCP_SOCK_OR_NULL ||
+		type == PTR_TO_BTF_ID_OR_NULL;
 }
 
 static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
@@ -462,6 +464,7 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_TP_BUFFER]	= "tp_buffer",
 	[PTR_TO_XDP_SOCK]	= "xdp_sock",
 	[PTR_TO_BTF_ID]		= "ptr_",
+	[PTR_TO_BTF_ID_OR_NULL]	= "ptr_or_null_",
 };
 
 static char slot_type_char[] = {
@@ -522,7 +525,7 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 			/* reg->off should be 0 for SCALAR_VALUE */
 			verbose(env, "%lld", reg->var_off.value + reg->off);
 		} else {
-			if (t == PTR_TO_BTF_ID)
+			if (t == PTR_TO_BTF_ID || t == PTR_TO_BTF_ID_OR_NULL)
 				verbose(env, "%s", kernel_type_name(reg->btf_id));
 			verbose(env, "(id=%d", reg->id);
 			if (reg_type_may_be_refcounted_or_null(t))
@@ -2118,6 +2121,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_TCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
 	case PTR_TO_BTF_ID:
+	case PTR_TO_BTF_ID_OR_NULL:
 		return true;
 	default:
 		return false;
@@ -2638,7 +2642,7 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
 		 */
 		*reg_type = info.reg_type;
 
-		if (*reg_type == PTR_TO_BTF_ID)
+		if (*reg_type == PTR_TO_BTF_ID || *reg_type == PTR_TO_BTF_ID_OR_NULL)
 			*btf_id = info.btf_id;
 		else
 			env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
@@ -3222,7 +3226,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 				 * a sub-register.
 				 */
 				regs[value_regno].subreg_def = DEF_NOT_SUBREG;
-				if (reg_type == PTR_TO_BTF_ID)
+				if (reg_type == PTR_TO_BTF_ID ||
+				    reg_type == PTR_TO_BTF_ID_OR_NULL)
 					regs[value_regno].btf_id = btf_id;
 			}
 			regs[value_regno].type = reg_type;
@@ -6545,6 +6550,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 			reg->type = PTR_TO_SOCK_COMMON;
 		} else if (reg->type == PTR_TO_TCP_SOCK_OR_NULL) {
 			reg->type = PTR_TO_TCP_SOCK;
+		} else if (reg->type == PTR_TO_BTF_ID_OR_NULL) {
+			reg->type = PTR_TO_BTF_ID;
 		}
 		if (is_null) {
 			/* We don't need id and ref_obj_id from this point
@@ -8403,6 +8410,7 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type)
 	case PTR_TO_TCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
 	case PTR_TO_BTF_ID:
+	case PTR_TO_BTF_ID_OR_NULL:
 		return false;
 	default:
 		return true;
@@ -10612,6 +10620,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 			return -EINVAL;
 		prog->aux->attach_func_name = tname;
 		prog->aux->attach_func_proto = t;
+		prog->aux->btf_id_or_null_non0_off = true;
 		ret = btf_distill_func_proto(&env->log, btf, t,
 					     tname, &fmodel);
 		return ret;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 10/19] bpf: add netlink and ipv6_route targets
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (8 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 09/19] bpf: add PTR_TO_BTF_ID_OR_NULL support Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-28 19:49     ` kbuild test robot
  2020-04-28 19:50     ` kbuild test robot
  2020-04-27 20:12 ` [PATCH bpf-next v1 11/19] bpf: add task and task/file targets Yonghong Song
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

This patch added netlink and ipv6_route targets, using
the same seq_ops (except show() and minor changes for stop())
for /proc/net/{netlink,ipv6_route}.

Note that both ipv6_route and netlink have target_feature
set as BPF_DUMP_SEQ_NET_PRIVATE. This is to notify
bpf_iter infrastructure to set net namespace properly
in seq_file private data area.

Since module is not supported for now, ipv6_route is
supported only if the IPV6 is built-in, i.e., not compiled
as a module. The restriction can be lifted once module
is properly supported for bpf_iter.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 net/ipv6/ip6_fib.c       | 71 +++++++++++++++++++++++++++-
 net/ipv6/route.c         | 30 ++++++++++++
 net/netlink/af_netlink.c | 99 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 196 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 46ed56719476..588b5f508b18 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2467,7 +2467,7 @@ void fib6_gc_cleanup(void)
 }
 
 #ifdef CONFIG_PROC_FS
-static int ipv6_route_seq_show(struct seq_file *seq, void *v)
+static int ipv6_route_native_seq_show(struct seq_file *seq, void *v)
 {
 	struct fib6_info *rt = v;
 	struct ipv6_route_iter *iter = seq->private;
@@ -2625,7 +2625,7 @@ static bool ipv6_route_iter_active(struct ipv6_route_iter *iter)
 	return w->node && !(w->state == FWS_U && w->node == w->root);
 }
 
-static void ipv6_route_seq_stop(struct seq_file *seq, void *v)
+static void ipv6_route_native_seq_stop(struct seq_file *seq, void *v)
 	__releases(RCU_BH)
 {
 	struct net *net = seq_file_net(seq);
@@ -2637,6 +2637,73 @@ static void ipv6_route_seq_stop(struct seq_file *seq, void *v)
 	rcu_read_unlock_bh();
 }
 
+#if IS_BUILTIN(CONFIG_IPV6) && defined(CONFIG_BPF_SYSCALL)
+struct bpf_iter__ipv6_route {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct fib6_info *, rt);
+};
+
+static int ipv6_route_prog_seq_show(struct bpf_prog *prog, struct seq_file *seq,
+				    u64 session_id, u64 seq_num, void *v)
+{
+	struct bpf_iter__ipv6_route ctx;
+	struct bpf_iter_meta meta;
+	int ret;
+
+	meta.seq = seq;
+	meta.session_id = session_id;
+	meta.seq_num = seq_num;
+	ctx.meta = &meta;
+	ctx.rt = v;
+	ret = bpf_iter_run_prog(prog, &ctx);
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static int ipv6_route_seq_show(struct seq_file *seq, void *v)
+{
+	struct ipv6_route_iter *iter = seq->private;
+	u64 session_id, seq_num;
+	struct bpf_prog *prog;
+	int ret;
+
+	prog = bpf_iter_get_prog(seq, sizeof(struct ipv6_route_iter),
+				 &session_id, &seq_num, false);
+	if (!prog)
+		return ipv6_route_native_seq_show(seq, v);
+
+	ret = ipv6_route_prog_seq_show(prog, seq, session_id, seq_num, v);
+	iter->w.leaf = NULL;
+
+	return ret;
+}
+
+static void ipv6_route_seq_stop(struct seq_file *seq, void *v)
+{
+	u64 session_id, seq_num;
+	struct bpf_prog *prog;
+
+	if (!v) {
+		prog = bpf_iter_get_prog(seq, sizeof(struct ipv6_route_iter),
+					 &session_id, &seq_num, true);
+		if (prog)
+			ipv6_route_prog_seq_show(prog, seq, session_id,
+						 seq_num, v);
+	}
+
+	ipv6_route_native_seq_stop(seq, v);
+}
+#else
+static int ipv6_route_seq_show(struct seq_file *seq, void *v)
+{
+	return ipv6_route_native_seq_show(seq, v);
+}
+
+static void ipv6_route_seq_stop(struct seq_file *seq, void *v)
+{
+	ipv6_route_native_seq_stop(seq, v);
+}
+#endif
+
 const struct seq_operations ipv6_route_seq_ops = {
 	.start	= ipv6_route_seq_start,
 	.next	= ipv6_route_seq_next,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 310cbddaa533..f275a13e2aea 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6390,6 +6390,28 @@ void __init ip6_route_init_special_entries(void)
   #endif
 }
 
+#if IS_BUILTIN(CONFIG_IPV6)
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
+int __init __bpf_iter__ipv6_route(struct bpf_iter_meta *meta, struct fib6_info *rt)
+{
+	return 0;
+}
+
+static int __init bpf_iter_register(void)
+{
+	struct bpf_iter_reg reg_info = {
+		.target			= "ipv6_route",
+		.target_func_name	= "__bpf_iter__ipv6_route",
+		.seq_ops		= &ipv6_route_seq_ops,
+		.seq_priv_size		= sizeof(struct ipv6_route_iter),
+		.target_feature		= BPF_DUMP_SEQ_NET_PRIVATE,
+	};
+
+	return bpf_iter_reg_target(&reg_info);
+}
+#endif
+#endif
+
 int __init ip6_route_init(void)
 {
 	int ret;
@@ -6452,6 +6474,14 @@ int __init ip6_route_init(void)
 	if (ret)
 		goto out_register_late_subsys;
 
+#if IS_BUILTIN(CONFIG_IPV6)
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
+	ret = bpf_iter_register();
+	if (ret)
+		goto out_register_late_subsys;
+#endif
+#endif
+
 	for_each_possible_cpu(cpu) {
 		struct uncached_list *ul = per_cpu_ptr(&rt6_uncached_list, cpu);
 
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 5ded01ca8b20..b6192cd66801 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2596,7 +2596,7 @@ static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	return __netlink_seq_next(seq);
 }
 
-static void netlink_seq_stop(struct seq_file *seq, void *v)
+static void netlink_native_seq_stop(struct seq_file *seq, void *v)
 {
 	struct nl_seq_iter *iter = seq->private;
 
@@ -2607,7 +2607,7 @@ static void netlink_seq_stop(struct seq_file *seq, void *v)
 }
 
 
-static int netlink_seq_show(struct seq_file *seq, void *v)
+static int netlink_native_seq_show(struct seq_file *seq, void *v)
 {
 	if (v == SEQ_START_TOKEN) {
 		seq_puts(seq,
@@ -2634,6 +2634,80 @@ static int netlink_seq_show(struct seq_file *seq, void *v)
 	return 0;
 }
 
+#ifdef CONFIG_BPF_SYSCALL
+struct bpf_iter__netlink {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct netlink_sock *, sk);
+};
+
+int __init __bpf_iter__netlink(struct bpf_iter_meta *meta, struct netlink_sock *sk)
+{
+	return 0;
+}
+
+static int netlink_prog_seq_show(struct bpf_prog *prog, struct seq_file *seq,
+				 u64 session_id, u64 seq_num, void *v)
+{
+	struct bpf_iter__netlink ctx;
+	struct bpf_iter_meta meta;
+	int ret = 0;
+
+	meta.seq = seq;
+	meta.session_id = session_id;
+	meta.seq_num = seq_num;
+	ctx.meta = &meta;
+	ctx.sk = nlk_sk((struct sock *)v);
+	ret = bpf_iter_run_prog(prog, &ctx);
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static int netlink_seq_show(struct seq_file *seq, void *v)
+{
+	u64 session_id, seq_num;
+	struct bpf_prog *prog;
+
+	prog = bpf_iter_get_prog(seq, sizeof(struct nl_seq_iter),
+				 &session_id, &seq_num, false);
+	if (!prog)
+		return netlink_native_seq_show(seq, v);
+
+	if (v == SEQ_START_TOKEN)
+		return 0;
+
+	return netlink_prog_seq_show(prog, seq, session_id,
+				     seq_num - 1, v);
+}
+
+static void netlink_seq_stop(struct seq_file *seq, void *v)
+{
+	u64 session_id, seq_num;
+	struct bpf_prog *prog;
+
+	if (!v) {
+		prog = bpf_iter_get_prog(seq, sizeof(struct nl_seq_iter),
+					 &session_id, &seq_num, true);
+		if (prog) {
+			if (seq_num)
+				seq_num = seq_num - 1;
+			netlink_prog_seq_show(prog, seq, session_id,
+					      seq_num, v);
+		}
+	}
+
+	netlink_native_seq_stop(seq, v);
+}
+#else
+static int netlink_seq_show(struct seq_file *seq, void *v)
+{
+	return netlink_native_seq_show(seq, v);
+}
+
+static void netlink_seq_stop(struct seq_file *seq, void *v)
+{
+	netlink_native_seq_stop(seq, v);
+}
+#endif
+
 static const struct seq_operations netlink_seq_ops = {
 	.start  = netlink_seq_start,
 	.next   = netlink_seq_next,
@@ -2740,6 +2814,21 @@ static const struct rhashtable_params netlink_rhashtable_params = {
 	.automatic_shrinking = true,
 };
 
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
+static int __init bpf_iter_register(void)
+{
+	struct bpf_iter_reg reg_info = {
+		.target			= "netlink",
+		.target_func_name	= "__bpf_iter__netlink",
+		.seq_ops		= &netlink_seq_ops,
+		.seq_priv_size		= sizeof(struct nl_seq_iter),
+		.target_feature		= BPF_DUMP_SEQ_NET_PRIVATE,
+	};
+
+	return bpf_iter_reg_target(&reg_info);
+}
+#endif
+
 static int __init netlink_proto_init(void)
 {
 	int i;
@@ -2748,6 +2837,12 @@ static int __init netlink_proto_init(void)
 	if (err != 0)
 		goto out;
 
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
+	err = bpf_iter_register();
+	if (err)
+		goto out;
+#endif
+
 	BUILD_BUG_ON(sizeof(struct netlink_skb_parms) > sizeof_field(struct sk_buff, cb));
 
 	nl_table = kcalloc(MAX_LINKS, sizeof(*nl_table), GFP_KERNEL);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 11/19] bpf: add task and task/file targets
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (9 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 10/19] bpf: add netlink and ipv6_route targets Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-30  2:08   ` Andrii Nakryiko
  2020-04-27 20:12 ` [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers Yonghong Song
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Only the tasks belonging to "current" pid namespace
are enumerated.

For task/file target, the bpf program will have access to
  struct task_struct *task
  u32 fd
  struct file *file
where fd/file is an open file for the task.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/Makefile    |   2 +-
 kernel/bpf/task_iter.c | 319 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 320 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/task_iter.c

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index b2b5eefc5254..37b2d8620153 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -2,7 +2,7 @@
 obj-y := core.o
 CFLAGS_core.o += $(call cc-disable-warning, override-init)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
new file mode 100644
index 000000000000..ee29574e427d
--- /dev/null
+++ b/kernel/bpf/task_iter.c
@@ -0,0 +1,319 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2020 Facebook */
+
+#include <linux/init.h>
+#include <linux/namei.h>
+#include <linux/pid_namespace.h>
+#include <linux/fs.h>
+#include <linux/fdtable.h>
+#include <linux/filter.h>
+
+struct bpf_iter_seq_task_info {
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	u32 id;
+};
+
+static struct task_struct *task_seq_get_next(struct pid_namespace *ns, u32 *id)
+{
+	struct task_struct *task = NULL;
+	struct pid *pid;
+
+	rcu_read_lock();
+	pid = idr_get_next(&ns->idr, id);
+	if (pid)
+		task = get_pid_task(pid, PIDTYPE_PID);
+	rcu_read_unlock();
+
+	return task;
+}
+
+static void *task_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_iter_seq_task_info *info = seq->private;
+	struct task_struct *task;
+	u32 id = info->id;
+
+	if (*pos == 0)
+		info->ns = task_active_pid_ns(current);
+
+	task = task_seq_get_next(info->ns, &id);
+	if (!task)
+		return NULL;
+
+	++*pos;
+	info->task = task;
+	info->id = id;
+
+	return task;
+}
+
+static void *task_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_iter_seq_task_info *info = seq->private;
+	struct task_struct *task;
+
+	++*pos;
+	++info->id;
+	task = task_seq_get_next(info->ns, &info->id);
+	if (!task)
+		return NULL;
+
+	put_task_struct(info->task);
+	info->task = task;
+	return task;
+}
+
+struct bpf_iter__task {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct task_struct *, task);
+};
+
+int __init __bpf_iter__task(struct bpf_iter_meta *meta, struct task_struct *task)
+{
+	return 0;
+}
+
+static int task_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter_meta meta;
+	struct bpf_iter__task ctx;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_task_info),
+				 &meta.session_id, &meta.seq_num,
+				 v == (void *)0);
+	if (prog) {
+		meta.seq = seq;
+		ctx.meta = &meta;
+		ctx.task = v;
+		ret = bpf_iter_run_prog(prog, &ctx);
+	}
+
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static void task_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_iter_seq_task_info *info = seq->private;
+
+	if (!v)
+		task_seq_show(seq, v);
+
+	if (info->task) {
+		put_task_struct(info->task);
+		info->task = NULL;
+	}
+}
+
+static const struct seq_operations task_seq_ops = {
+	.start	= task_seq_start,
+	.next	= task_seq_next,
+	.stop	= task_seq_stop,
+	.show	= task_seq_show,
+};
+
+struct bpf_iter_seq_task_file_info {
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct files_struct *files;
+	u32 id;
+	u32 fd;
+};
+
+static struct file *task_file_seq_get_next(struct pid_namespace *ns, u32 *id,
+					   int *fd, struct task_struct **task,
+					   struct files_struct **fstruct)
+{
+	struct files_struct *files;
+	struct task_struct *tk;
+	u32 sid = *id;
+	int sfd;
+
+	/* If this function returns a non-NULL file object,
+	 * it held a reference to the files_struct and file.
+	 * Otherwise, it does not hold any reference.
+	 */
+again:
+	if (*fstruct) {
+		files = *fstruct;
+		sfd = *fd;
+	} else {
+		tk = task_seq_get_next(ns, &sid);
+		if (!tk)
+			return NULL;
+
+		files = get_files_struct(tk);
+		put_task_struct(tk);
+		if (!files) {
+			sid = ++(*id);
+			*fd = 0;
+			goto again;
+		}
+		*fstruct = files;
+		*task = tk;
+		if (sid == *id) {
+			sfd = *fd;
+		} else {
+			*id = sid;
+			sfd = 0;
+		}
+	}
+
+	rcu_read_lock();
+	for (; sfd < files_fdtable(files)->max_fds; sfd++) {
+		struct file *f;
+
+		f = fcheck_files(files, sfd);
+		if (!f)
+			continue;
+		*fd = sfd;
+		get_file(f);
+		rcu_read_unlock();
+		return f;
+	}
+
+	/* the current task is done, go to the next task */
+	rcu_read_unlock();
+	put_files_struct(files);
+	*fstruct = NULL;
+	sid = ++(*id);
+	*fd = 0;
+	goto again;
+}
+
+static void *task_file_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_iter_seq_task_file_info *info = seq->private;
+	struct files_struct *files = NULL;
+	struct task_struct *task = NULL;
+	struct file *file;
+	u32 id = info->id;
+	int fd = info->fd;
+
+	if (*pos == 0)
+		info->ns = task_active_pid_ns(current);
+
+	file = task_file_seq_get_next(info->ns, &id, &fd, &task, &files);
+	if (!file) {
+		info->files = NULL;
+		return NULL;
+	}
+
+	++*pos;
+	info->id = id;
+	info->fd = fd;
+	info->task = task;
+	info->files = files;
+
+	return file;
+}
+
+static void *task_file_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_iter_seq_task_file_info *info = seq->private;
+	struct files_struct *files = info->files;
+	struct task_struct *task = info->task;
+	struct file *file;
+	u32 id = info->id;
+
+	++*pos;
+	++info->fd;
+	fput((struct file *)v);
+	file = task_file_seq_get_next(info->ns, &id, &info->fd, &task, &files);
+	if (!file) {
+		info->files = NULL;
+		return NULL;
+	}
+
+	info->id = id;
+	info->task = task;
+	info->files = files;
+
+	return file;
+}
+
+struct bpf_iter__task_file {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct task_struct *, task);
+	u32 fd;
+	__bpf_md_ptr(struct file *, file);
+};
+
+int __init __bpf_iter__task_file(struct bpf_iter_meta *meta,
+			      struct task_struct *task, u32 fd,
+			      struct file *file)
+{
+	return 0;
+}
+
+static int task_file_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter_seq_task_file_info *info = seq->private;
+	struct bpf_iter__task_file ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_task_file_info),
+				 &meta.session_id, &meta.seq_num, v == (void *)0);
+	if (prog) {
+		meta.seq = seq;
+		ctx.meta = &meta;
+		ctx.task = info->task;
+		ctx.fd = info->fd;
+		ctx.file = v;
+		ret = bpf_iter_run_prog(prog, &ctx);
+	}
+
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static void task_file_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_iter_seq_task_file_info *info = seq->private;
+
+	if (v)
+		fput((struct file *)v);
+	else
+		task_file_seq_show(seq, v);
+
+	if (info->files) {
+		put_files_struct(info->files);
+		info->files = NULL;
+	}
+}
+
+static const struct seq_operations task_file_seq_ops = {
+	.start	= task_file_seq_start,
+	.next	= task_file_seq_next,
+	.stop	= task_file_seq_stop,
+	.show	= task_file_seq_show,
+};
+
+static int __init task_iter_init(void)
+{
+	struct bpf_iter_reg task_file_reg_info = {
+		.target			= "task_file",
+		.target_func_name	= "__bpf_iter__task_file",
+		.seq_ops		= &task_file_seq_ops,
+		.seq_priv_size		= sizeof(struct bpf_iter_seq_task_file_info),
+		.target_feature		= 0,
+	};
+	struct bpf_iter_reg task_reg_info = {
+		.target			= "task",
+		.target_func_name	= "__bpf_iter__task",
+		.seq_ops		= &task_seq_ops,
+		.seq_priv_size		= sizeof(struct bpf_iter_seq_task_info),
+		.target_feature		= 0,
+	};
+	int ret;
+
+	ret = bpf_iter_reg_target(&task_reg_info);
+	if (ret)
+		return ret;
+
+	return bpf_iter_reg_target(&task_file_reg_info);
+}
+late_initcall(task_iter_init);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (10 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 11/19] bpf: add task and task/file targets Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-28  6:02     ` kbuild test robot
  2020-04-27 20:12 ` [PATCH bpf-next v1 13/19] bpf: handle spilled PTR_TO_BTF_ID properly when checking stack_boundary Yonghong Song
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Two helpers bpf_seq_printf and bpf_seq_write, are added for
writing data to the seq_file buffer.

bpf_seq_printf supports common format string flag/width/type
fields so at least I can get identical results for
netlink and ipv6_route targets.

For bpf_seq_printf and bpf_seq_write, return value -EOVERFLOW
specifically indicates a write failure due to overflow, which
means the object will be repeated in the next bpf invocation
if object collection stays the same. Note that if the object
collection is changed, depending how collection traversal is
done, even if the object still in the collection, it may not
be visited.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/uapi/linux/bpf.h       |  31 ++++++-
 kernel/trace/bpf_trace.c       | 159 +++++++++++++++++++++++++++++++++
 scripts/bpf_helpers_doc.py     |   2 +
 tools/include/uapi/linux/bpf.h |  31 ++++++-
 4 files changed, 221 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 576651110d16..f0ab17d8fb73 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3042,6 +3042,33 @@ union bpf_attr {
  * 		See: clock_gettime(CLOCK_BOOTTIME)
  * 	Return
  * 		Current *ktime*.
+ *
+ * int bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, const void *data, u32 data_len)
+ * 	Description
+ * 		seq_printf uses seq_file seq_printf() to print out the format string.
+ * 		The *m* represents the seq_file. The *fmt* and *fmt_size* are for
+ * 		the format string itself. The *data* and *data_len* are format string
+ * 		arguments. The *data* are a u64 array and corresponding format string
+ * 		values are stored in the array. For strings and pointers where pointees
+ * 		are accessed, only the pointer values are stored in the *data* array.
+ * 		The *data_len* is the *data* size in term of bytes.
+ * 	Return
+ * 		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EINVAL**		Invalid arguments, or invalid/unsupported formats.
+ *		* **-E2BIG**		Too many format specifiers.
+ *		* **-ENOMEM**		No enough memory to copy pointees or strings.
+ *		* **-EOVERFLOW**	Overflow happens, the same object will be tried again.
+ *
+ * int bpf_seq_write(struct seq_file *m, const void *data, u32 len)
+ * 	Description
+ * 		seq_write uses seq_file seq_write() to write the data.
+ * 		The *m* represents the seq_file. The *data* and *len* represent the
+ *		data to write in bytes.
+ * 	Return
+ * 		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EOVERFLOW**	Overflow happens, the same object will be tried again.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3169,7 +3196,9 @@ union bpf_attr {
 	FN(get_netns_cookie),		\
 	FN(get_current_ancestor_cgroup_id),	\
 	FN(sk_assign),			\
-	FN(ktime_get_boot_ns),
+	FN(ktime_get_boot_ns),		\
+	FN(seq_printf),			\
+	FN(seq_write),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index e875c95d3ced..f7c5587b5d2e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -457,6 +457,161 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
 	return &bpf_trace_printk_proto;
 }
 
+#define MAX_SEQ_PRINTF_VARARGS	12
+#define MAX_SEQ_PRINTF_STR_LEN	128
+
+BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
+	   const void *, data, u32, data_len)
+{
+	char bufs[MAX_SEQ_PRINTF_VARARGS][MAX_SEQ_PRINTF_STR_LEN];
+	u64 params[MAX_SEQ_PRINTF_VARARGS];
+	int i, copy_size, num_args;
+	const u64 *args = data;
+	int fmt_cnt = 0;
+
+	/*
+	 * bpf_check()->check_func_arg()->check_stack_boundary()
+	 * guarantees that fmt points to bpf program stack,
+	 * fmt_size bytes of it were initialized and fmt_size > 0
+	 */
+	if (fmt[--fmt_size] != 0)
+		return -EINVAL;
+
+	if (data_len & 7)
+		return -EINVAL;
+
+	for (i = 0; i < fmt_size; i++) {
+		if (fmt[i] == '%' && (!data || !data_len))
+			return -EINVAL;
+	}
+
+	num_args = data_len / 8;
+
+	/* check format string for allowed specifiers */
+	for (i = 0; i < fmt_size; i++) {
+		if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i]))
+			return -EINVAL;
+
+		if (fmt[i] != '%')
+			continue;
+
+		if (fmt_cnt >= MAX_SEQ_PRINTF_VARARGS)
+			return -E2BIG;
+
+		if (fmt_cnt >= num_args)
+			return -EINVAL;
+
+		/* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
+		i++;
+
+		/* skip optional "[0+-][num]" width formating field */
+		while (fmt[i] == '0' || fmt[i] == '+'  || fmt[i] == '-')
+			i++;
+		if (fmt[i] >= '1' && fmt[i] <= '9') {
+			i++;
+			while (fmt[i] >= '0' && fmt[i] <= '9')
+				i++;
+		}
+
+		if (fmt[i] == 's') {
+			/* disallow any further format extensions */
+			if (fmt[i + 1] != 0 &&
+			    !isspace(fmt[i + 1]) &&
+			    !ispunct(fmt[i + 1]))
+				return -EINVAL;
+
+			/* try our best to copy */
+			bufs[fmt_cnt][0] = 0;
+			strncpy_from_unsafe(bufs[fmt_cnt],
+					    (void *) (long) args[fmt_cnt],
+					    MAX_SEQ_PRINTF_STR_LEN);
+			params[fmt_cnt] = (u64)(long)bufs[fmt_cnt];
+
+			fmt_cnt++;
+			continue;
+		}
+
+		if (fmt[i] == 'p') {
+			if (fmt[i + 1] == 0 ||
+			    fmt[i + 1] == 'K' ||
+			    fmt[i + 1] == 'x') {
+				/* just kernel pointers */
+				params[fmt_cnt] = args[fmt_cnt];
+				fmt_cnt++;
+				continue;
+			}
+
+			/* only support "%pI4", "%pi4", "%pI6" and "pi6". */
+			if (fmt[i + 1] != 'i' && fmt[i + 1] != 'I')
+				return -EINVAL;
+			if (fmt[i + 2] != '4' && fmt[i + 2] != '6')
+				return -EINVAL;
+
+			copy_size = (fmt[i + 2] == '4') ? 4 : 16;
+
+			/* try our best to copy */
+			probe_kernel_read(bufs[fmt_cnt],
+					  (void *) (long) args[fmt_cnt], copy_size);
+			params[fmt_cnt] = (u64)(long)bufs[fmt_cnt];
+
+			i += 2;
+			fmt_cnt++;
+			continue;
+		}
+
+		if (fmt[i] == 'l') {
+			i++;
+			if (fmt[i] == 'l')
+				i++;
+		}
+
+		if (fmt[i] != 'i' && fmt[i] != 'd' &&
+		    fmt[i] != 'u' && fmt[i] != 'x')
+			return -EINVAL;
+
+		params[fmt_cnt] = args[fmt_cnt];
+		fmt_cnt++;
+	}
+
+	/* Maximumly we can have MAX_SEQ_PRINTF_VARARGS parameter, just give
+	 * all of them to seq_printf().
+	 */
+	seq_printf(m, fmt, params[0], params[1], params[2], params[3],
+		   params[4], params[5], params[6], params[7], params[8],
+		   params[9], params[10], params[11]);
+
+	return seq_has_overflowed(m) ? -EOVERFLOW : 0;
+}
+
+static int bpf_seq_printf_btf_ids[5];
+static const struct bpf_func_proto bpf_seq_printf_proto = {
+	.func		= bpf_seq_printf,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type      = ARG_PTR_TO_MEM_OR_NULL,
+	.arg5_type      = ARG_CONST_SIZE_OR_ZERO,
+	.btf_id		= bpf_seq_printf_btf_ids,
+};
+
+BPF_CALL_3(bpf_seq_write, struct seq_file *, m, const void *, data, u32, len)
+{
+	return seq_write(m, data, len) ? -EOVERFLOW : 0;
+}
+
+static int bpf_seq_write_btf_ids[5];
+static const struct bpf_func_proto bpf_seq_write_proto = {
+	.func		= bpf_seq_write,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.btf_id		= bpf_seq_write_btf_ids,
+};
+
 static __always_inline int
 get_map_perf_counter(struct bpf_map *map, u64 flags,
 		     u64 *value, u64 *enabled, u64 *running)
@@ -1226,6 +1381,10 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_xdp_output:
 		return &bpf_xdp_output_proto;
 #endif
+	case BPF_FUNC_seq_printf:
+		return &bpf_seq_printf_proto;
+	case BPF_FUNC_seq_write:
+		return &bpf_seq_write_proto;
 	default:
 		return raw_tp_prog_func_proto(func_id, prog);
 	}
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index f43d193aff3a..ded304c96a05 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -414,6 +414,7 @@ class PrinterHelpers(Printer):
             'struct sk_reuseport_md',
             'struct sockaddr',
             'struct tcphdr',
+            'struct seq_file',
 
             'struct __sk_buff',
             'struct sk_msg_md',
@@ -450,6 +451,7 @@ class PrinterHelpers(Printer):
             'struct sk_reuseport_md',
             'struct sockaddr',
             'struct tcphdr',
+            'struct seq_file',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 576651110d16..f0ab17d8fb73 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3042,6 +3042,33 @@ union bpf_attr {
  * 		See: clock_gettime(CLOCK_BOOTTIME)
  * 	Return
  * 		Current *ktime*.
+ *
+ * int bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, const void *data, u32 data_len)
+ * 	Description
+ * 		seq_printf uses seq_file seq_printf() to print out the format string.
+ * 		The *m* represents the seq_file. The *fmt* and *fmt_size* are for
+ * 		the format string itself. The *data* and *data_len* are format string
+ * 		arguments. The *data* are a u64 array and corresponding format string
+ * 		values are stored in the array. For strings and pointers where pointees
+ * 		are accessed, only the pointer values are stored in the *data* array.
+ * 		The *data_len* is the *data* size in term of bytes.
+ * 	Return
+ * 		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EINVAL**		Invalid arguments, or invalid/unsupported formats.
+ *		* **-E2BIG**		Too many format specifiers.
+ *		* **-ENOMEM**		No enough memory to copy pointees or strings.
+ *		* **-EOVERFLOW**	Overflow happens, the same object will be tried again.
+ *
+ * int bpf_seq_write(struct seq_file *m, const void *data, u32 len)
+ * 	Description
+ * 		seq_write uses seq_file seq_write() to write the data.
+ * 		The *m* represents the seq_file. The *data* and *len* represent the
+ *		data to write in bytes.
+ * 	Return
+ * 		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EOVERFLOW**	Overflow happens, the same object will be tried again.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3169,7 +3196,9 @@ union bpf_attr {
 	FN(get_netns_cookie),		\
 	FN(get_current_ancestor_cgroup_id),	\
 	FN(sk_assign),			\
-	FN(ktime_get_boot_ns),
+	FN(ktime_get_boot_ns),		\
+	FN(seq_printf),			\
+	FN(seq_write),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 13/19] bpf: handle spilled PTR_TO_BTF_ID properly when checking stack_boundary
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (11 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-27 20:12 ` [PATCH bpf-next v1 14/19] bpf: support variable length array in tracing programs Yonghong Song
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

This specifically to handle the case like below:
   // ptr below is a socket ptr identified by PTR_TO_BTF_ID
   u64 param[2] = { ptr, val };
   bpf_seq_printf(seq, fmt, sizeof(fmt), param, sizeof(param));

In this case, the 16 bytes stack for "param" contains:
   8 bytes for ptr with spilled PTR_TO_BTF_ID
   8 bytes for val as STACK_MISC

The current verifier will complain the ptr should not be visible
to the helper.
   ...
   16: (7b) *(u64 *)(r10 -64) = r2
   18: (7b) *(u64 *)(r10 -56) = r1
   19: (bf) r4 = r10
   ;
   20: (07) r4 += -64
   ; BPF_SEQ_PRINTF(seq, fmt1, (long)s, s->sk_protocol);
   21: (bf) r1 = r6
   22: (18) r2 = 0xffffa8d00018605a
   24: (b4) w3 = 10
   25: (b4) w5 = 16
   26: (85) call bpf_seq_printf#125
    R0=inv(id=0) R1_w=ptr_seq_file(id=0,off=0,imm=0)
    R2_w=map_value(id=0,off=90,ks=4,vs=144,imm=0) R3_w=inv10
    R4_w=fp-64 R5_w=inv16 R6=ptr_seq_file(id=0,off=0,imm=0)
    R7=ptr_netlink_sock(id=0,off=0,imm=0) R10=fp0 fp-56_w=mmmmmmmm
    fp-64_w=ptr_
   last_idx 26 first_idx 13
   regs=8 stack=0 before 25: (b4) w5 = 16
   regs=8 stack=0 before 24: (b4) w3 = 10
   invalid indirect read from stack off -64+0 size 16

Let us permit this if the program is a tracing/iter program.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/verifier.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 21ec85e382ca..17a780e59f77 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3474,6 +3474,14 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno,
 			*stype = STACK_MISC;
 			goto mark;
 		}
+
+		/* pointer value can be visible to tracing/iter program */
+		if (env->prog->type == BPF_PROG_TYPE_TRACING &&
+		    env->prog->expected_attach_type == BPF_TRACE_ITER &&
+		    state->stack[spi].slot_type[0] == STACK_SPILL &&
+		    state->stack[spi].spilled_ptr.type == PTR_TO_BTF_ID)
+			goto mark;
+
 		if (state->stack[spi].slot_type[0] == STACK_SPILL &&
 		    state->stack[spi].spilled_ptr.type == SCALAR_VALUE) {
 			__mark_reg_unknown(env, &state->stack[spi].spilled_ptr);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 14/19] bpf: support variable length array in tracing programs
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (12 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 13/19] bpf: handle spilled PTR_TO_BTF_ID properly when checking stack_boundary Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-30 20:04   ` Andrii Nakryiko
  2020-04-27 20:12 ` [PATCH bpf-next v1 15/19] tools/libbpf: add bpf_iter support Yonghong Song
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

In /proc/net/ipv6_route, we have
  struct fib6_info {
    struct fib6_table *fib6_table;
    ...
    struct fib6_nh fib6_nh[0];
  }
  struct fib6_nh {
    struct fib_nh_common nh_common;
    struct rt6_info **rt6i_pcpu;
    struct rt6_exception_bucket *rt6i_exception_bucket;
  };
  struct fib_nh_common {
    ...
    u8 nhc_gw_family;
    ...
  }

The access:
  struct fib6_nh *fib6_nh = &rt->fib6_nh;
  ... fib6_nh->nh_common.nhc_gw_family ...

This patch ensures such an access is handled properly.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/btf.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 2c098e6b1acc..22c69e1d5a56 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3831,6 +3831,7 @@ int btf_struct_access(struct bpf_verifier_log *log,
 	const struct btf_type *mtype, *elem_type = NULL;
 	const struct btf_member *member;
 	const char *tname, *mname;
+	u32 vlen;
 
 again:
 	tname = __btf_name_by_offset(btf_vmlinux, t->name_off);
@@ -3839,7 +3840,37 @@ int btf_struct_access(struct bpf_verifier_log *log,
 		return -EINVAL;
 	}
 
-	if (off + size > t->size) {
+	vlen = btf_type_vlen(t);
+	if (vlen > 0 && off + size > t->size) {
+		/* If the last element is a variable size array, we may
+		 * need to relax the rule.
+		 */
+		struct btf_array *array_elem;
+
+		member = btf_type_member(t) + vlen - 1;
+		mtype = btf_type_skip_modifiers(btf_vmlinux, member->type,
+						NULL);
+		if (!btf_type_is_array(mtype))
+			goto error;
+
+		array_elem = (struct btf_array *)(mtype + 1);
+		if (array_elem->nelems != 0)
+			goto error;
+
+		moff = btf_member_bit_offset(t, member) / 8;
+		if (off < moff)
+			goto error;
+
+		elem_type = btf_type_skip_modifiers(btf_vmlinux,
+						    array_elem->type, NULL);
+		if (!btf_type_is_struct(elem_type))
+			goto error;
+
+		off = (off - moff) % elem_type->size;
+		return btf_struct_access(log, elem_type, off, size, atype,
+					 next_btf_id);
+
+error:
 		bpf_log(log, "access beyond struct %s at off %u size %u\n",
 			tname, off, size);
 		return -EACCES;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 15/19] tools/libbpf: add bpf_iter support
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (13 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 14/19] bpf: support variable length array in tracing programs Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-30  1:41   ` Andrii Nakryiko
  2020-04-27 20:12 ` [PATCH bpf-next v1 16/19] tools/bpftool: add bpf_iter support for bptool Yonghong Song
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Three new libbpf APIs are added to support bpf_iter:
  - bpf_program__attach_iter
    Given a bpf program and additional parameters, which is
    none now, returns a bpf_link.
  - bpf_link__create_iter
    Given a bpf_link, create a bpf_iter and return a fd
    so user can then do read() to get seq_file output data.
  - bpf_iter_create
    syscall level API to create a bpf iterator.

Two macros, BPF_SEQ_PRINTF0 and BPF_SEQ_PRINTF, are also introduced.
These two macros can help bpf program writers with
nicer bpf_seq_printf syntax similar to the kernel one.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/lib/bpf/bpf.c         | 11 +++++++
 tools/lib/bpf/bpf.h         |  2 ++
 tools/lib/bpf/bpf_tracing.h | 23 ++++++++++++++
 tools/lib/bpf/libbpf.c      | 60 +++++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h      | 11 +++++++
 tools/lib/bpf/libbpf.map    |  7 +++++
 6 files changed, 114 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 5cc1b0785d18..7ffd6c0ad95f 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -619,6 +619,17 @@ int bpf_link_update(int link_fd, int new_prog_fd,
 	return sys_bpf(BPF_LINK_UPDATE, &attr, sizeof(attr));
 }
 
+int bpf_iter_create(int link_fd, unsigned int flags)
+{
+	union bpf_attr attr;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.iter_create.link_fd = link_fd;
+	attr.iter_create.flags = flags;
+
+	return sys_bpf(BPF_ITER_CREATE, &attr, sizeof(attr));
+}
+
 int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
 		   __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt)
 {
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 46d47afdd887..db9df303090e 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -187,6 +187,8 @@ struct bpf_link_update_opts {
 LIBBPF_API int bpf_link_update(int link_fd, int new_prog_fd,
 			       const struct bpf_link_update_opts *opts);
 
+LIBBPF_API int bpf_iter_create(int link_fd, unsigned int flags);
+
 struct bpf_prog_test_run_attr {
 	int prog_fd;
 	int repeat;
diff --git a/tools/lib/bpf/bpf_tracing.h b/tools/lib/bpf/bpf_tracing.h
index f3f3c3fb98cb..4a6dffaa7e57 100644
--- a/tools/lib/bpf/bpf_tracing.h
+++ b/tools/lib/bpf/bpf_tracing.h
@@ -413,4 +413,27 @@ typeof(name(0)) name(struct pt_regs *ctx)				    \
 }									    \
 static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args)
 
+/*
+ * BPF_SEQ_PRINTF to wrap bpf_seq_printf to-be-printed values
+ * in a structure. BPF_SEQ_PRINTF0 is a simple wrapper for
+ * bpf_seq_printf().
+ */
+#define BPF_SEQ_PRINTF0(seq, fmt)					\
+	({								\
+		int ret = bpf_seq_printf(seq, fmt, sizeof(fmt),		\
+					 (void *)0, 0);			\
+		ret;							\
+	})
+
+#define BPF_SEQ_PRINTF(seq, fmt, args...)				\
+	({								\
+		_Pragma("GCC diagnostic push")				\
+		_Pragma("GCC diagnostic ignored \"-Wint-conversion\"")	\
+		__u64 param[___bpf_narg(args)] = { args };		\
+		_Pragma("GCC diagnostic pop")				\
+		int ret = bpf_seq_printf(seq, fmt, sizeof(fmt),		\
+					 param, sizeof(param));		\
+		ret;							\
+	})
+
 #endif
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8e1dc6980fac..ffdc4d8e0cc0 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -6366,6 +6366,9 @@ static const struct bpf_sec_def section_defs[] = {
 		.is_attach_btf = true,
 		.expected_attach_type = BPF_LSM_MAC,
 		.attach_fn = attach_lsm),
+	SEC_DEF("iter/", TRACING,
+		.expected_attach_type = BPF_TRACE_ITER,
+		.is_attach_btf = true),
 	BPF_PROG_SEC("xdp",			BPF_PROG_TYPE_XDP),
 	BPF_PROG_SEC("perf_event",		BPF_PROG_TYPE_PERF_EVENT),
 	BPF_PROG_SEC("lwt_in",			BPF_PROG_TYPE_LWT_IN),
@@ -6629,6 +6632,7 @@ static int bpf_object__collect_struct_ops_map_reloc(struct bpf_object *obj,
 
 #define BTF_TRACE_PREFIX "btf_trace_"
 #define BTF_LSM_PREFIX "bpf_lsm_"
+#define BTF_ITER_PREFIX "__bpf_iter__"
 #define BTF_MAX_NAME_SIZE 128
 
 static int find_btf_by_prefix_kind(const struct btf *btf, const char *prefix,
@@ -6659,6 +6663,9 @@ static inline int __find_vmlinux_btf_id(struct btf *btf, const char *name,
 	else if (attach_type == BPF_LSM_MAC)
 		err = find_btf_by_prefix_kind(btf, BTF_LSM_PREFIX, name,
 					      BTF_KIND_FUNC);
+	else if (attach_type == BPF_TRACE_ITER)
+		err = find_btf_by_prefix_kind(btf, BTF_ITER_PREFIX, name,
+					      BTF_KIND_FUNC);
 	else
 		err = btf__find_by_name_kind(btf, name, BTF_KIND_FUNC);
 
@@ -7617,6 +7624,59 @@ bpf_program__attach_cgroup(struct bpf_program *prog, int cgroup_fd)
 	return link;
 }
 
+struct bpf_link *
+bpf_program__attach_iter(struct bpf_program *prog,
+			 const struct bpf_iter_attach_opts *opts)
+{
+	enum bpf_attach_type attach_type;
+	char errmsg[STRERR_BUFSIZE];
+	struct bpf_link *link;
+	int prog_fd, link_fd;
+
+	if (!OPTS_VALID(opts, bpf_iter_attach_opts))
+		return ERR_PTR(-EINVAL);
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0) {
+		pr_warn("program '%s': can't attach before loaded\n",
+			bpf_program__title(prog, false));
+		return ERR_PTR(-EINVAL);
+	}
+
+	link = calloc(1, sizeof(*link));
+	if (!link)
+		return ERR_PTR(-ENOMEM);
+	link->detach = &bpf_link__detach_fd;
+
+	attach_type = bpf_program__get_expected_attach_type(prog);
+	link_fd = bpf_link_create(prog_fd, 0, attach_type, NULL);
+	if (link_fd < 0) {
+		link_fd = -errno;
+		free(link);
+		pr_warn("program '%s': failed to attach to iterator: %s\n",
+			bpf_program__title(prog, false),
+			libbpf_strerror_r(link_fd, errmsg, sizeof(errmsg)));
+		return ERR_PTR(link_fd);
+	}
+	link->fd = link_fd;
+	return link;
+}
+
+int bpf_link__create_iter(struct bpf_link *link, unsigned int flags)
+{
+	char errmsg[STRERR_BUFSIZE];
+	int iter_fd;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link), flags);
+	if (iter_fd < 0) {
+		iter_fd = -errno;
+		pr_warn("failed to create an iterator: %s\n",
+			libbpf_strerror_r(iter_fd, errmsg, sizeof(errmsg)));
+	}
+
+	return iter_fd;
+}
+
 struct bpf_link *bpf_program__attach(struct bpf_program *prog)
 {
 	const struct bpf_sec_def *sec_def;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index f1dacecb1619..abe5786fcab3 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -258,6 +258,17 @@ struct bpf_map;
 
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(struct bpf_map *map);
 
+struct bpf_iter_attach_opts {
+	size_t sz; /* size of this struct for forward/backward compatibility */
+};
+#define bpf_iter_attach_opts__last_field sz
+
+LIBBPF_API struct bpf_link *
+bpf_program__attach_iter(struct bpf_program *prog,
+			 const struct bpf_iter_attach_opts *opts);
+LIBBPF_API int
+bpf_link__create_iter(struct bpf_link *link, unsigned int flags);
+
 struct bpf_insn;
 
 /*
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index bb8831605b25..1cea36f9f2e2 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -254,3 +254,10 @@ LIBBPF_0.0.8 {
 		bpf_program__set_lsm;
 		bpf_set_link_xdp_fd_opts;
 } LIBBPF_0.0.7;
+
+LIBBPF_0.0.9 {
+	global:
+		bpf_link__create_iter;
+		bpf_program__attach_iter;
+		bpf_iter_create;
+} LIBBPF_0.0.8;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 16/19] tools/bpftool: add bpf_iter support for bptool
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (14 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 15/19] tools/libbpf: add bpf_iter support Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-28  9:27   ` Quentin Monnet
  2020-04-27 20:12 ` [PATCH bpf-next v1 17/19] tools/bpf: selftests: add iterator programs for ipv6_route and netlink Yonghong Song
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Currently, only one command is supported
  bpftool iter pin <bpf_prog.o> <path>

It will pin the trace/iter bpf program in
the object file <bpf_prog.o> to the <path>
where <path> should be on a bpffs mount.

For example,
  $ bpftool iter pin ./bpf_iter_ipv6_route.o \
    /sys/fs/bpf/my_route
User can then do a `cat` to print out the results:
  $ cat /sys/fs/bpf/my_route
    fe800000000000000000000000000000 40 00000000000000000000000000000000 ...
    00000000000000000000000000000000 00 00000000000000000000000000000000 ...
    00000000000000000000000000000001 80 00000000000000000000000000000000 ...
    fe800000000000008c0162fffebdfd57 80 00000000000000000000000000000000 ...
    ff000000000000000000000000000000 08 00000000000000000000000000000000 ...
    00000000000000000000000000000000 00 00000000000000000000000000000000 ...

The implementation for ipv6_route iterator is in one of subsequent
patches.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../bpftool/Documentation/bpftool-iter.rst    | 71 ++++++++++++++++
 tools/bpf/bpftool/bash-completion/bpftool     | 13 +++
 tools/bpf/bpftool/iter.c                      | 84 +++++++++++++++++++
 tools/bpf/bpftool/main.c                      |  3 +-
 tools/bpf/bpftool/main.h                      |  1 +
 5 files changed, 171 insertions(+), 1 deletion(-)
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-iter.rst
 create mode 100644 tools/bpf/bpftool/iter.c

diff --git a/tools/bpf/bpftool/Documentation/bpftool-iter.rst b/tools/bpf/bpftool/Documentation/bpftool-iter.rst
new file mode 100644
index 000000000000..1997a6bac4a0
--- /dev/null
+++ b/tools/bpf/bpftool/Documentation/bpftool-iter.rst
@@ -0,0 +1,71 @@
+============
+bpftool-iter
+============
+-------------------------------------------------------------------------------
+tool to create BPF iterators
+-------------------------------------------------------------------------------
+
+:Manual section: 8
+
+SYNOPSIS
+========
+
+	**bpftool** [*OPTIONS*] **iter** *COMMAND*
+
+	*COMMANDS* := { **pin** | **help** }
+
+STRUCT_OPS COMMANDS
+===================
+
+|	**bpftool** **iter pin** *OBJ* *PATH*
+|	**bpftool** **struct_ops help**
+|
+|	*OBJ* := /a/file/of/bpf_iter_target.o
+
+
+DESCRIPTION
+===========
+	**bpftool iter pin** *OBJ* *PATH*
+		  Create a bpf iterator from *OBJ*, and pin it to
+		  *PATH*. The *PATH* should be located in *bpffs* mount.
+
+	**bpftool struct_ops help**
+		  Print short help message.
+
+OPTIONS
+=======
+	-h, --help
+		  Print short generic help message (similar to **bpftool help**).
+
+	-V, --version
+		  Print version number (similar to **bpftool version**).
+
+	-d, --debug
+		  Print all logs available, even debug-level information. This
+		  includes logs from libbpf as well as from the verifier, when
+		  attempting to load programs.
+
+EXAMPLES
+========
+**# bpftool iter pin bpf_iter_netlink.o /sys/fs/bpf/my_netlink**
+
+::
+
+   Create a file-based bpf iterator from bpf_iter_netlink.o and pin it
+   to /sys/fs/bpf/my_netlink
+
+
+SEE ALSO
+========
+	**bpf**\ (2),
+	**bpf-helpers**\ (7),
+	**bpftool**\ (8),
+	**bpftool-prog**\ (8),
+	**bpftool-map**\ (8),
+	**bpftool-cgroup**\ (8),
+	**bpftool-feature**\ (8),
+	**bpftool-net**\ (8),
+	**bpftool-perf**\ (8),
+	**bpftool-btf**\ (8)
+	**bpftool-gen**\ (8)
+	**bpftool-struct_ops**\ (8)
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index 45ee99b159e2..17a81695da0f 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -604,6 +604,19 @@ _bpftool()
                     ;;
             esac
             ;;
+        iter)
+            case $command in
+                pin)
+                    _filedir
+                    return 0
+                    ;;
+                *)
+                    [[ $prev == $object ]] && \
+                        COMPREPLY=( $( compgen -W 'help' \
+                            -- "$cur" ) )
+                    ;;
+            esac
+            ;;
         map)
             local MAP_TYPE='id pinned name'
             case $command in
diff --git a/tools/bpf/bpftool/iter.c b/tools/bpf/bpftool/iter.c
new file mode 100644
index 000000000000..db9fae6be716
--- /dev/null
+++ b/tools/bpf/bpftool/iter.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (C) 2020 Facebook
+
+#define _GNU_SOURCE
+#include <linux/err.h>
+#include <bpf/libbpf.h>
+
+#include "main.h"
+
+static int do_pin(int argc, char **argv)
+{
+	const char *objfile, *path;
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	struct bpf_link *link;
+	int err;
+
+	if (!REQ_ARGS(2))
+		usage();
+
+	objfile = GET_ARG();
+	path = GET_ARG();
+
+	obj = bpf_object__open(objfile);
+	if (IS_ERR_OR_NULL(obj)) {
+		p_err("can't open objfile %s", objfile);
+		return -1;
+	}
+
+	err = bpf_object__load(obj);
+	if (err < 0) {
+		err = -1;
+		p_err("can't load objfile %s", objfile);
+		goto close_obj;
+	}
+
+	prog = bpf_program__next(NULL, obj);
+	link = bpf_program__attach_iter(prog, NULL);
+	if (IS_ERR(link)) {
+		err = -1;
+		p_err("attach_iter failed for program %s",
+		      bpf_program__name(prog));
+		goto close_obj;
+	}
+
+	err = bpf_link__pin(link, path);
+	if (err) {
+		err = -1;
+		p_err("pin_iter failed for program %s to path %s",
+		      bpf_program__name(prog), path);
+		goto close_link;
+	}
+
+	err = 0;
+
+close_link:
+	bpf_link__disconnect(link);
+	bpf_link__destroy(link);
+close_obj:
+	bpf_object__close(obj);
+	return err;
+}
+
+static int do_help(int argc, char **argv)
+{
+	fprintf(stderr,
+		"Usage: %s %s pin OBJ PATH\n"
+		"       %s %s help\n"
+		"\n",
+		bin_name, argv[-2], bin_name, argv[-2]);
+
+	return 0;
+}
+
+static const struct cmd cmds[] = {
+	{ "help",	do_help },
+	{ "pin",	do_pin },
+	{ 0 }
+};
+
+int do_iter(int argc, char **argv)
+{
+	return cmd_select(cmds, argc, argv, do_help);
+}
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index 466c269eabdd..6805b77789cb 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -58,7 +58,7 @@ static int do_help(int argc, char **argv)
 		"       %s batch file FILE\n"
 		"       %s version\n"
 		"\n"
-		"       OBJECT := { prog | map | cgroup | perf | net | feature | btf | gen | struct_ops }\n"
+		"       OBJECT := { prog | map | cgroup | perf | net | feature | btf | gen | struct_ops | iter }\n"
 		"       " HELP_SPEC_OPTIONS "\n"
 		"",
 		bin_name, bin_name, bin_name);
@@ -222,6 +222,7 @@ static const struct cmd cmds[] = {
 	{ "btf",	do_btf },
 	{ "gen",	do_gen },
 	{ "struct_ops",	do_struct_ops },
+	{ "iter",	do_iter },
 	{ "version",	do_version },
 	{ 0 }
 };
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 86f14ce26fd7..2b5d4a616b48 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -162,6 +162,7 @@ int do_feature(int argc, char **argv);
 int do_btf(int argc, char **argv);
 int do_gen(int argc, char **argv);
 int do_struct_ops(int argc, char **argv);
+int do_iter(int argc, char **argv);
 
 int parse_u32_arg(int *argc, char ***argv, __u32 *val, const char *what);
 int prog_parse_fd(int *argc, char ***argv);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 17/19] tools/bpf: selftests: add iterator programs for ipv6_route and netlink
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (15 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 16/19] tools/bpftool: add bpf_iter support for bptool Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-30  2:12   ` Andrii Nakryiko
  2020-04-27 20:12 ` [PATCH bpf-next v1 18/19] tools/bpf: selftests: add iter progs for bpf_map/task/task_file Yonghong Song
  2020-04-27 20:12 ` [PATCH bpf-next v1 19/19] tools/bpf: selftests: add bpf_iter selftests Yonghong Song
  18 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Two bpf programs are added in this patch for netlink and ipv6_route
target. On my VM, I am able to achieve identical
results compared to /proc/net/netlink and /proc/net/ipv6_route.

  $ cat /proc/net/netlink
  sk               Eth Pid        Groups   Rmem     Wmem     Dump  Locks    Drops    Inode
  000000002c42d58b 0   0          00000000 0        0        0     2        0        7
  00000000a4e8b5e1 0   1          00000551 0        0        0     2        0        18719
  00000000e1b1c195 4   0          00000000 0        0        0     2        0        16422
  000000007e6b29f9 6   0          00000000 0        0        0     2        0        16424
  ....
  00000000159a170d 15  1862       00000002 0        0        0     2        0        1886
  000000009aca4bc9 15  3918224839 00000002 0        0        0     2        0        19076
  00000000d0ab31d2 15  1          00000002 0        0        0     2        0        18683
  000000008398fb08 16  0          00000000 0        0        0     2        0        27
  $ cat /sys/fs/bpf/my_netlink
  sk               Eth Pid        Groups   Rmem     Wmem     Dump  Locks    Drops    Inode
  000000002c42d58b 0   0          00000000 0        0        0     2        0        7
  00000000a4e8b5e1 0   1          00000551 0        0        0     2        0        18719
  00000000e1b1c195 4   0          00000000 0        0        0     2        0        16422
  000000007e6b29f9 6   0          00000000 0        0        0     2        0        16424
  ....
  00000000159a170d 15  1862       00000002 0        0        0     2        0        1886
  000000009aca4bc9 15  3918224839 00000002 0        0        0     2        0        19076
  00000000d0ab31d2 15  1          00000002 0        0        0     2        0        18683
  000000008398fb08 16  0          00000000 0        0        0     2        0        27

  $ cat /proc/net/ipv6_route
  fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
  fe80000000000000c04b03fffe7827ce 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
  ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000003 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  $ cat /sys/fs/bpf/my_ipv6_route
  fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
  fe80000000000000c04b03fffe7827ce 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
  ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000003 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../selftests/bpf/progs/bpf_iter_ipv6_route.c | 69 +++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_netlink.c    | 77 +++++++++++++++++++
 2 files changed, 146 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_netlink.c

diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c b/tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c
new file mode 100644
index 000000000000..bed34521f997
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+extern bool CONFIG_IPV6_SUBTREES __kconfig __weak;
+
+#define	RTF_GATEWAY		0x0002
+#define IFNAMSIZ		16
+#define fib_nh_gw_family        nh_common.nhc_gw_family
+#define fib_nh_gw6              nh_common.nhc_gw.ipv6
+#define fib_nh_dev              nh_common.nhc_dev
+
+SEC("iter/ipv6_route")
+int dump_ipv6_route(struct bpf_iter__ipv6_route *ctx)
+{
+	static const char fmt1[] = "%pi6 %02x ";
+	static const char fmt2[] = "%pi6 ";
+	static const char fmt3[] = "00000000000000000000000000000000 ";
+	static const char fmt4[] = "%08x %08x %08x %08x %8s\n";
+	static const char fmt5[] = "%08x %08x %08x %08x\n";
+	static const char fmt7[] = "00000000000000000000000000000000 00 ";
+	struct seq_file *seq = ctx->meta->seq;
+	struct fib6_info *rt = ctx->rt;
+	const struct net_device *dev;
+	struct fib6_nh *fib6_nh;
+	unsigned int flags;
+	struct nexthop *nh;
+
+	if (rt == (void *)0)
+		return 0;
+
+	fib6_nh = &rt->fib6_nh[0];
+	flags = rt->fib6_flags;
+
+	/* FIXME: nexthop_is_multipath is not handled here. */
+	nh = rt->nh;
+	if (rt->nh)
+		fib6_nh = &nh->nh_info->fib6_nh;
+
+	BPF_SEQ_PRINTF(seq, fmt1, &rt->fib6_dst.addr, rt->fib6_dst.plen);
+
+	if (CONFIG_IPV6_SUBTREES)
+		BPF_SEQ_PRINTF(seq, fmt1, &rt->fib6_src.addr,
+			       rt->fib6_src.plen);
+	else
+		BPF_SEQ_PRINTF0(seq, fmt7);
+
+	if (fib6_nh->fib_nh_gw_family) {
+		flags |= RTF_GATEWAY;
+		BPF_SEQ_PRINTF(seq, fmt2, &fib6_nh->fib_nh_gw6);
+	} else {
+		BPF_SEQ_PRINTF0(seq, fmt3);
+	}
+
+	dev = fib6_nh->fib_nh_dev;
+	if (dev)
+		BPF_SEQ_PRINTF(seq, fmt4, rt->fib6_metric,
+			       rt->fib6_ref.refs.counter, 0, flags, dev->name);
+	else
+		BPF_SEQ_PRINTF(seq, fmt4, rt->fib6_metric,
+			       rt->fib6_ref.refs.counter, 0, flags);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
new file mode 100644
index 000000000000..54f93863c34c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define sk_rmem_alloc	sk_backlog.rmem_alloc
+#define sk_refcnt	__sk_common.skc_refcnt
+
+#define offsetof(TYPE, MEMBER)  ((size_t)&((TYPE *)0)->MEMBER)
+#define container_of(ptr, type, member)				\
+	({							\
+		void *__mptr = (void *)(ptr);			\
+		((type *)(__mptr - offsetof(type, member)));	\
+	})
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
+SEC("iter/netlink")
+int dump_netlink(struct bpf_iter__netlink *ctx)
+{
+	static const char banner[] =
+		"sk               Eth Pid        Groups   "
+		"Rmem     Wmem     Dump  Locks    Drops    Inode\n";
+	static const char fmt1[] = "%pK %-3d ";
+	static const char fmt2[] = "%-10u %08x %-8d %-8d %-5d %-8d ";
+	static const char fmt5[] = "%-8u %-8lu\n";
+	struct seq_file *seq = ctx->meta->seq;
+	struct netlink_sock *nlk = ctx->sk;
+	unsigned long group, ino;
+	struct inode *inode;
+	struct socket *sk;
+	struct sock *s;
+
+	if (nlk == (void *)0)
+		return 0;
+
+	if (ctx->meta->seq_num == 0)
+		BPF_SEQ_PRINTF0(seq, banner);
+
+	s = &nlk->sk;
+	BPF_SEQ_PRINTF(seq, fmt1, s, s->sk_protocol);
+
+	if (!nlk->groups)  {
+		group = 0;
+	} else {
+		/* FIXME: temporary use bpf_probe_read here, needs
+		 * verifier support to do direct access.
+		 */
+		bpf_probe_read(&group, sizeof(group), &nlk->groups[0]);
+	}
+	BPF_SEQ_PRINTF(seq, fmt2, nlk->portid, (u32)group,
+		       s->sk_rmem_alloc.counter,
+		       s->sk_wmem_alloc.refs.counter - 1,
+		       nlk->cb_running, s->sk_refcnt.refs.counter);
+
+	sk = s->sk_socket;
+	if (!sk) {
+		ino = 0;
+	} else {
+		/* FIXME: container_of inside SOCK_INODE has a forced
+		 * type conversion, and direct access cannot be used
+		 * with current verifier.
+		 */
+		inode = SOCK_INODE(sk);
+		bpf_probe_read(&ino, sizeof(ino), &inode->i_ino);
+	}
+	BPF_SEQ_PRINTF(seq, fmt5, s->sk_drops.counter, ino);
+
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 18/19] tools/bpf: selftests: add iter progs for bpf_map/task/task_file
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (16 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 17/19] tools/bpf: selftests: add iterator programs for ipv6_route and netlink Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  2020-04-27 20:12 ` [PATCH bpf-next v1 19/19] tools/bpf: selftests: add bpf_iter selftests Yonghong Song
  18 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

The implementation is arbitrary, just to show how the bpf programs
can be written for bpf_map/task/task_file. They can be costomized
for specific needs.

For example, for bpf_map, the iterator prints out:
  $ cat /sys/fs/bpf/my_bpf_map
      id   refcnt  usercnt  locked_vm
       3        2        0         20
       6        2        0         20
       9        2        0         20
      12        2        0         20
      13        2        0         20
      16        2        0         20
      19        2        0         20
      === END ===

For task, the iterator prints out:
  $ cat /sys/fs/bpf/my_task
    tgid      gid
       1        1
       2        2
    ....
    1944     1944
    1948     1948
    1949     1949
    1953     1953
    === END ===

For task/file, the iterator prints out:
  $ cat /sys/fs/bpf/my_task_file
    tgid      gid       fd      file
       1        1        0 ffffffff95c97600
       1        1        1 ffffffff95c97600
       1        1        2 ffffffff95c97600
    ....
    1895     1895      255 ffffffff95c8fe00
    1932     1932        0 ffffffff95c8fe00
    1932     1932        1 ffffffff95c8fe00
    1932     1932        2 ffffffff95c8fe00
    1932     1932        3 ffffffff95c185c0

This is able to print out all open files (fd and file->f_op), so user can compare
f_op against a particular kernel file operations to find what it is.
For example, from /proc/kallsyms, we can find
  ffffffff95c185c0 r eventfd_fops
so we will know tgid 1932 fd 3 is an eventfd file descriptor.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../selftests/bpf/progs/bpf_iter_bpf_map.c    | 32 +++++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_task.c       | 29 +++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_task_file.c  | 28 ++++++++++++++++
 3 files changed, 89 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_bpf_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_task.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_task_file.c

diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_bpf_map.c b/tools/testing/selftests/bpf/progs/bpf_iter_bpf_map.c
new file mode 100644
index 000000000000..d4973ba4f337
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_bpf_map.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("iter/bpf_map")
+int dump_bpf_map(struct bpf_iter__bpf_map *ctx)
+{
+	static const char banner[] = "      id   refcnt  usercnt  locked_vm\n";
+	static const char footer[] = "      === END ===\n";
+	static const char fmt[] = "%8u %8ld %8ld %10lu\n";
+	struct seq_file *seq = ctx->meta->seq;
+	__u64 seq_num = ctx->meta->seq_num;
+	struct bpf_map *map = ctx->map;
+
+	if (map == (void *)0) {
+		BPF_SEQ_PRINTF0(seq, footer);
+		return 0;
+	}
+
+	if (seq_num == 0)
+		BPF_SEQ_PRINTF0(seq, banner);
+
+	BPF_SEQ_PRINTF(seq, fmt, map->id, map->refcnt.counter,
+		       map->usercnt.counter,
+		       map->memory.user->locked_vm.counter);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task.c b/tools/testing/selftests/bpf/progs/bpf_iter_task.c
new file mode 100644
index 000000000000..78583dda3739
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_task.c
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("iter/task")
+int dump_tasks(struct bpf_iter__task *ctx)
+{
+	static char const banner[] = "    tgid      gid\n";
+	static char const footer[] = "=== END ===\n";
+	static char const fmt[] = "%8d %8d\n";
+	struct seq_file *seq = ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+
+	if (task == (void *)0) {
+		BPF_SEQ_PRINTF0(seq, footer);
+		return 0;
+	}
+
+	if (ctx->meta->seq_num == 0)
+		BPF_SEQ_PRINTF0(seq, banner);
+
+	BPF_SEQ_PRINTF(seq, fmt, task->tgid, task->pid);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c
new file mode 100644
index 000000000000..7ade0303a1a5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("iter/task_file")
+int dump_tasks(struct bpf_iter__task_file *ctx)
+{
+	static char const banner[] = "    tgid      gid       fd      file\n";
+	static char const fmt[] = "%8d %8d %8d %lx\n";
+	struct seq_file *seq = ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+	__u32 fd = ctx->fd;
+	struct file *file = ctx->file;
+
+	if (task == (void *)0 || file == (void *)0)
+		return 0;
+
+	if (ctx->meta->seq_num == 0)
+		BPF_SEQ_PRINTF0(seq, banner);
+
+	BPF_SEQ_PRINTF(seq, fmt, task->tgid, task->pid, fd, (long)file->f_op);
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH bpf-next v1 19/19] tools/bpf: selftests: add bpf_iter selftests
  2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
                   ` (17 preceding siblings ...)
  2020-04-27 20:12 ` [PATCH bpf-next v1 18/19] tools/bpf: selftests: add iter progs for bpf_map/task/task_file Yonghong Song
@ 2020-04-27 20:12 ` Yonghong Song
  18 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-27 20:12 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

The test includes three subtests for the previous three verifier changes:
  - new reg state btf_id_or_null
  - access fields in the variable length array of the structure.
  - put a btf_id ptr value in a stack and accessible to
    tracing/iter programs.

The test also tested the workflow of creating and reading data
from an anonymous or file iterator. Further for file based
iterator, it tested that the link update can change the underlying
bpf program.

  $ test_progs -n 2
  #2/1 btf_id_or_null:OK
  #2/2 ipv6_route:OK
  #2/3 netlink:OK
  #2/4 anon:OK
  #2/5 file:OK
  #2 bpf_iter:OK
  Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../selftests/bpf/prog_tests/bpf_iter.c       | 180 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_test_kern1.c |   4 +
 .../selftests/bpf/progs/bpf_iter_test_kern2.c |   4 +
 .../selftests/bpf/progs/bpf_iter_test_kern3.c |  18 ++
 .../bpf/progs/bpf_iter_test_kern_common.h     |  22 +++
 5 files changed, 228 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_test_kern1.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_test_kern2.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_test_kern3.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_test_kern_common.h

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
new file mode 100644
index 000000000000..d51ed0d99a75
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include <test_progs.h>
+#include "bpf_iter_ipv6_route.skel.h"
+#include "bpf_iter_netlink.skel.h"
+#include "bpf_iter_test_kern1.skel.h"
+#include "bpf_iter_test_kern2.skel.h"
+#include "bpf_iter_test_kern3.skel.h"
+
+static int duration;
+
+static void test_btf_id_or_null(void)
+{
+	struct bpf_iter_test_kern3 *skel;
+
+	skel = bpf_iter_test_kern3__open_and_load();
+	if (CHECK(skel, "skel_open_and_load",
+		  "skeleton open_and_load unexpectedly succeeded\n")) {
+		bpf_iter_test_kern3__destroy(skel);
+		return;
+	}
+}
+
+static void test_load_ipv6_route(void)
+{
+	struct bpf_iter_ipv6_route *skel;
+
+	skel = bpf_iter_ipv6_route__open_and_load();
+	if (CHECK(!skel, "skel_open_and_load",
+		  "skeleton open_and_load failed\n"))
+		return;
+
+	bpf_iter_ipv6_route__destroy(skel);
+}
+
+static void test_load_netlink(void)
+{
+	struct bpf_iter_netlink *skel;
+
+	skel = bpf_iter_netlink__open_and_load();
+	if (CHECK(!skel, "skel_open_and_load",
+		  "skeleton open_and_load failed\n"))
+		return;
+
+	bpf_iter_netlink__destroy(skel);
+}
+
+static int do_read_with_fd(int iter_fd, const char *expected)
+{
+	int err = -1, len;
+	char buf[16] = {};
+
+	while ((len = read(iter_fd, buf, sizeof(buf))) > 0) {
+		if (CHECK(len != strlen(expected), "read",
+			  "wrong read len %d\n", len))
+			return -1;
+
+		if (CHECK(err == 0, "read", "invalid additional read\n"))
+			return -1;
+
+		err = strcmp(buf, expected);
+		if (CHECK(err, "read",
+			  "incorrect read result: buf %s, expected %s\n",
+			  buf, expected))
+			return -1;
+	}
+
+	CHECK(err, "read", "missing read result\n");
+	return err;
+}
+
+static void test_anon_iter(void)
+{
+	struct bpf_iter_test_kern1 *skel;
+	struct bpf_link *link;
+	int iter_fd;
+
+	skel = bpf_iter_test_kern1__open_and_load();
+	if (CHECK(!skel, "skel_open_and_load",
+		  "skeleton open_and_load failed\n"))
+		return;
+
+	link = bpf_program__attach_iter(skel->progs.dump_tasks, NULL);
+	if (CHECK(IS_ERR(link), "attach_iter", "attach_iter failed\n"))
+		goto out;
+
+	iter_fd = bpf_link__create_iter(link, 0);
+	if (CHECK(iter_fd < 0, "create_iter", "create_iter failed\n"))
+		goto free_link;
+
+	do_read_with_fd(iter_fd, "abcd");
+	close(iter_fd);
+
+free_link:
+	bpf_link__disconnect(link);
+	bpf_link__destroy(link);
+out:
+	bpf_iter_test_kern1__destroy(skel);
+}
+
+static int do_read(const char *path, const char *expected)
+{
+	int err, iter_fd;
+
+	iter_fd = open(path, O_RDONLY);
+	if (CHECK(iter_fd < 0, "open", "open %s failed: %s\n",
+		  path, strerror(errno)))
+		return -1;
+
+	err = do_read_with_fd(iter_fd, expected);
+	close(iter_fd);
+	return err;
+}
+
+static void test_file_iter(void)
+{
+	const char *path = "/sys/fs/bpf/bpf_iter_test1";
+	struct bpf_iter_test_kern1 *skel1;
+	struct bpf_iter_test_kern2 *skel2;
+	struct bpf_link *link;
+	int err;
+
+	skel1 = bpf_iter_test_kern1__open_and_load();
+	if (CHECK(!skel1, "skel_open_and_load",
+		  "skeleton open_and_load failed\n"))
+		return;
+
+	link = bpf_program__attach_iter(skel1->progs.dump_tasks, NULL);
+	if (CHECK(IS_ERR(link), "attach_iter", "attach_iter failed\n"))
+		goto out;
+
+	/* unlink this path if it exists. */
+	unlink(path);
+
+	err = bpf_link__pin(link, path);
+	if (CHECK(err, "pin_iter", "pin_iter to %s failed: %s\n", path,
+		  strerror(errno)))
+		goto free_link;
+
+	err = do_read(path, "abcd");
+	if (err)
+		goto free_link;
+
+	/* file based iterator seems working fine. Let us a link update
+	 * of the underlying link and `cat` the iterator again, its content
+	 * should change.
+	 */
+	skel2 = bpf_iter_test_kern2__open_and_load();
+	if (CHECK(!skel2, "skel_open_and_load",
+		  "skeleton open_and_load failed\n"))
+		goto free_link;
+
+	err = bpf_link__update_program(link, skel2->progs.dump_tasks);
+	if (CHECK(err, "update_prog", "update_prog failed\n"))
+		goto destroy_skel2;
+
+	do_read(path, "ABCD");
+
+destroy_skel2:
+	bpf_iter_test_kern2__destroy(skel2);
+free_link:
+	bpf_link__disconnect(link);
+	bpf_link__destroy(link);
+out:
+	bpf_iter_test_kern1__destroy(skel1);
+}
+
+void test_bpf_iter(void)
+{
+	if (test__start_subtest("btf_id_or_null"))
+		test_btf_id_or_null();
+	if (test__start_subtest("ipv6_route"))
+		test_load_ipv6_route();
+	if (test__start_subtest("netlink"))
+		test_load_netlink();
+	if (test__start_subtest("anon"))
+		test_anon_iter();
+	if (test__start_subtest("file"))
+		test_file_iter();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_test_kern1.c b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern1.c
new file mode 100644
index 000000000000..c71a7c283108
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern1.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#define START_CHAR 'a'
+#include "bpf_iter_test_kern_common.h"
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_test_kern2.c b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern2.c
new file mode 100644
index 000000000000..8bdc8dc07444
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern2.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#define START_CHAR 'A'
+#include "bpf_iter_test_kern_common.h"
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_test_kern3.c b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern3.c
new file mode 100644
index 000000000000..a52555ef2826
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern3.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("iter/task")
+int dump_tasks(struct bpf_iter__task *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+	int tgid;
+
+	tgid = task->tgid;
+	bpf_seq_write(seq, &tgid, sizeof(tgid));
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_test_kern_common.h b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern_common.h
new file mode 100644
index 000000000000..7cd9125a291f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern_common.h
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+int count = 0;
+
+SEC("iter/task")
+int dump_tasks(struct bpf_iter__task *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	char c;
+
+	if (count < 4) {
+		c = START_CHAR + count;
+		bpf_seq_write(seq, &c, sizeof(c));
+		count++;
+	}
+
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers
  2020-04-27 20:12 ` [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers Yonghong Song
@ 2020-04-28  6:02     ` kbuild test robot
  0 siblings, 0 replies; 85+ messages in thread
From: kbuild test robot @ 2020-04-28  6:02 UTC (permalink / raw)
  To: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: kbuild-all, Alexei Starovoitov, Daniel Borkmann, kernel-team

[-- Attachment #1: Type: text/plain, Size: 6453 bytes --]

Hi Yonghong,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]
[cannot apply to bpf/master net/master vhost/linux-next net-next/master linus/master v5.7-rc3 next-20200424]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-implement-bpf-iterator-for-kernel-data/20200428-115101
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 9.3.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=sh 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from kernel/trace/bpf_trace.c:10:
   kernel/trace/bpf_trace.c: In function 'bpf_seq_printf':
>> kernel/trace/bpf_trace.c:463:35: warning: the frame size of 1672 bytes is larger than 1024 bytes [-Wframe-larger-than=]
     463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
         |                                   ^~~~~~~~
   include/linux/filter.h:456:30: note: in definition of macro '__BPF_CAST'
     456 |           (unsigned long)0, (t)0))) a
         |                              ^
>> include/linux/filter.h:449:27: note: in expansion of macro '__BPF_MAP_5'
     449 | #define __BPF_MAP(n, ...) __BPF_MAP_##n(__VA_ARGS__)
         |                           ^~~~~~~~~~
>> include/linux/filter.h:474:35: note: in expansion of macro '__BPF_MAP'
     474 |   return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
         |                                   ^~~~~~~~~
>> include/linux/filter.h:484:31: note: in expansion of macro 'BPF_CALL_x'
     484 | #define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__)
         |                               ^~~~~~~~~~
>> kernel/trace/bpf_trace.c:463:1: note: in expansion of macro 'BPF_CALL_5'
     463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
         | ^~~~~~~~~~

vim +463 kernel/trace/bpf_trace.c

   462	
 > 463	BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
   464		   const void *, data, u32, data_len)
   465	{
   466		char bufs[MAX_SEQ_PRINTF_VARARGS][MAX_SEQ_PRINTF_STR_LEN];
   467		u64 params[MAX_SEQ_PRINTF_VARARGS];
   468		int i, copy_size, num_args;
   469		const u64 *args = data;
   470		int fmt_cnt = 0;
   471	
   472		/*
   473		 * bpf_check()->check_func_arg()->check_stack_boundary()
   474		 * guarantees that fmt points to bpf program stack,
   475		 * fmt_size bytes of it were initialized and fmt_size > 0
   476		 */
   477		if (fmt[--fmt_size] != 0)
   478			return -EINVAL;
   479	
   480		if (data_len & 7)
   481			return -EINVAL;
   482	
   483		for (i = 0; i < fmt_size; i++) {
   484			if (fmt[i] == '%' && (!data || !data_len))
   485				return -EINVAL;
   486		}
   487	
   488		num_args = data_len / 8;
   489	
   490		/* check format string for allowed specifiers */
   491		for (i = 0; i < fmt_size; i++) {
   492			if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i]))
   493				return -EINVAL;
   494	
   495			if (fmt[i] != '%')
   496				continue;
   497	
   498			if (fmt_cnt >= MAX_SEQ_PRINTF_VARARGS)
   499				return -E2BIG;
   500	
   501			if (fmt_cnt >= num_args)
   502				return -EINVAL;
   503	
   504			/* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
   505			i++;
   506	
   507			/* skip optional "[0+-][num]" width formating field */
   508			while (fmt[i] == '0' || fmt[i] == '+'  || fmt[i] == '-')
   509				i++;
   510			if (fmt[i] >= '1' && fmt[i] <= '9') {
   511				i++;
   512				while (fmt[i] >= '0' && fmt[i] <= '9')
   513					i++;
   514			}
   515	
   516			if (fmt[i] == 's') {
   517				/* disallow any further format extensions */
   518				if (fmt[i + 1] != 0 &&
   519				    !isspace(fmt[i + 1]) &&
   520				    !ispunct(fmt[i + 1]))
   521					return -EINVAL;
   522	
   523				/* try our best to copy */
   524				bufs[fmt_cnt][0] = 0;
   525				strncpy_from_unsafe(bufs[fmt_cnt],
   526						    (void *) (long) args[fmt_cnt],
   527						    MAX_SEQ_PRINTF_STR_LEN);
   528				params[fmt_cnt] = (u64)(long)bufs[fmt_cnt];
   529	
   530				fmt_cnt++;
   531				continue;
   532			}
   533	
   534			if (fmt[i] == 'p') {
   535				if (fmt[i + 1] == 0 ||
   536				    fmt[i + 1] == 'K' ||
   537				    fmt[i + 1] == 'x') {
   538					/* just kernel pointers */
   539					params[fmt_cnt] = args[fmt_cnt];
   540					fmt_cnt++;
   541					continue;
   542				}
   543	
   544				/* only support "%pI4", "%pi4", "%pI6" and "pi6". */
   545				if (fmt[i + 1] != 'i' && fmt[i + 1] != 'I')
   546					return -EINVAL;
   547				if (fmt[i + 2] != '4' && fmt[i + 2] != '6')
   548					return -EINVAL;
   549	
   550				copy_size = (fmt[i + 2] == '4') ? 4 : 16;
   551	
   552				/* try our best to copy */
   553				probe_kernel_read(bufs[fmt_cnt],
   554						  (void *) (long) args[fmt_cnt], copy_size);
   555				params[fmt_cnt] = (u64)(long)bufs[fmt_cnt];
   556	
   557				i += 2;
   558				fmt_cnt++;
   559				continue;
   560			}
   561	
   562			if (fmt[i] == 'l') {
   563				i++;
   564				if (fmt[i] == 'l')
   565					i++;
   566			}
   567	
   568			if (fmt[i] != 'i' && fmt[i] != 'd' &&
   569			    fmt[i] != 'u' && fmt[i] != 'x')
   570				return -EINVAL;
   571	
   572			params[fmt_cnt] = args[fmt_cnt];
   573			fmt_cnt++;
   574		}
   575	
   576		/* Maximumly we can have MAX_SEQ_PRINTF_VARARGS parameter, just give
   577		 * all of them to seq_printf().
   578		 */
   579		seq_printf(m, fmt, params[0], params[1], params[2], params[3],
   580			   params[4], params[5], params[6], params[7], params[8],
   581			   params[9], params[10], params[11]);
   582	
   583		return seq_has_overflowed(m) ? -EOVERFLOW : 0;
   584	}
   585	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 54692 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers
@ 2020-04-28  6:02     ` kbuild test robot
  0 siblings, 0 replies; 85+ messages in thread
From: kbuild test robot @ 2020-04-28  6:02 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6629 bytes --]

Hi Yonghong,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]
[cannot apply to bpf/master net/master vhost/linux-next net-next/master linus/master v5.7-rc3 next-20200424]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-implement-bpf-iterator-for-kernel-data/20200428-115101
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 9.3.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=sh 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from kernel/trace/bpf_trace.c:10:
   kernel/trace/bpf_trace.c: In function 'bpf_seq_printf':
>> kernel/trace/bpf_trace.c:463:35: warning: the frame size of 1672 bytes is larger than 1024 bytes [-Wframe-larger-than=]
     463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
         |                                   ^~~~~~~~
   include/linux/filter.h:456:30: note: in definition of macro '__BPF_CAST'
     456 |           (unsigned long)0, (t)0))) a
         |                              ^
>> include/linux/filter.h:449:27: note: in expansion of macro '__BPF_MAP_5'
     449 | #define __BPF_MAP(n, ...) __BPF_MAP_##n(__VA_ARGS__)
         |                           ^~~~~~~~~~
>> include/linux/filter.h:474:35: note: in expansion of macro '__BPF_MAP'
     474 |   return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
         |                                   ^~~~~~~~~
>> include/linux/filter.h:484:31: note: in expansion of macro 'BPF_CALL_x'
     484 | #define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__)
         |                               ^~~~~~~~~~
>> kernel/trace/bpf_trace.c:463:1: note: in expansion of macro 'BPF_CALL_5'
     463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
         | ^~~~~~~~~~

vim +463 kernel/trace/bpf_trace.c

   462	
 > 463	BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
   464		   const void *, data, u32, data_len)
   465	{
   466		char bufs[MAX_SEQ_PRINTF_VARARGS][MAX_SEQ_PRINTF_STR_LEN];
   467		u64 params[MAX_SEQ_PRINTF_VARARGS];
   468		int i, copy_size, num_args;
   469		const u64 *args = data;
   470		int fmt_cnt = 0;
   471	
   472		/*
   473		 * bpf_check()->check_func_arg()->check_stack_boundary()
   474		 * guarantees that fmt points to bpf program stack,
   475		 * fmt_size bytes of it were initialized and fmt_size > 0
   476		 */
   477		if (fmt[--fmt_size] != 0)
   478			return -EINVAL;
   479	
   480		if (data_len & 7)
   481			return -EINVAL;
   482	
   483		for (i = 0; i < fmt_size; i++) {
   484			if (fmt[i] == '%' && (!data || !data_len))
   485				return -EINVAL;
   486		}
   487	
   488		num_args = data_len / 8;
   489	
   490		/* check format string for allowed specifiers */
   491		for (i = 0; i < fmt_size; i++) {
   492			if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i]))
   493				return -EINVAL;
   494	
   495			if (fmt[i] != '%')
   496				continue;
   497	
   498			if (fmt_cnt >= MAX_SEQ_PRINTF_VARARGS)
   499				return -E2BIG;
   500	
   501			if (fmt_cnt >= num_args)
   502				return -EINVAL;
   503	
   504			/* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
   505			i++;
   506	
   507			/* skip optional "[0+-][num]" width formating field */
   508			while (fmt[i] == '0' || fmt[i] == '+'  || fmt[i] == '-')
   509				i++;
   510			if (fmt[i] >= '1' && fmt[i] <= '9') {
   511				i++;
   512				while (fmt[i] >= '0' && fmt[i] <= '9')
   513					i++;
   514			}
   515	
   516			if (fmt[i] == 's') {
   517				/* disallow any further format extensions */
   518				if (fmt[i + 1] != 0 &&
   519				    !isspace(fmt[i + 1]) &&
   520				    !ispunct(fmt[i + 1]))
   521					return -EINVAL;
   522	
   523				/* try our best to copy */
   524				bufs[fmt_cnt][0] = 0;
   525				strncpy_from_unsafe(bufs[fmt_cnt],
   526						    (void *) (long) args[fmt_cnt],
   527						    MAX_SEQ_PRINTF_STR_LEN);
   528				params[fmt_cnt] = (u64)(long)bufs[fmt_cnt];
   529	
   530				fmt_cnt++;
   531				continue;
   532			}
   533	
   534			if (fmt[i] == 'p') {
   535				if (fmt[i + 1] == 0 ||
   536				    fmt[i + 1] == 'K' ||
   537				    fmt[i + 1] == 'x') {
   538					/* just kernel pointers */
   539					params[fmt_cnt] = args[fmt_cnt];
   540					fmt_cnt++;
   541					continue;
   542				}
   543	
   544				/* only support "%pI4", "%pi4", "%pI6" and "pi6". */
   545				if (fmt[i + 1] != 'i' && fmt[i + 1] != 'I')
   546					return -EINVAL;
   547				if (fmt[i + 2] != '4' && fmt[i + 2] != '6')
   548					return -EINVAL;
   549	
   550				copy_size = (fmt[i + 2] == '4') ? 4 : 16;
   551	
   552				/* try our best to copy */
   553				probe_kernel_read(bufs[fmt_cnt],
   554						  (void *) (long) args[fmt_cnt], copy_size);
   555				params[fmt_cnt] = (u64)(long)bufs[fmt_cnt];
   556	
   557				i += 2;
   558				fmt_cnt++;
   559				continue;
   560			}
   561	
   562			if (fmt[i] == 'l') {
   563				i++;
   564				if (fmt[i] == 'l')
   565					i++;
   566			}
   567	
   568			if (fmt[i] != 'i' && fmt[i] != 'd' &&
   569			    fmt[i] != 'u' && fmt[i] != 'x')
   570				return -EINVAL;
   571	
   572			params[fmt_cnt] = args[fmt_cnt];
   573			fmt_cnt++;
   574		}
   575	
   576		/* Maximumly we can have MAX_SEQ_PRINTF_VARARGS parameter, just give
   577		 * all of them to seq_printf().
   578		 */
   579		seq_printf(m, fmt, params[0], params[1], params[2], params[3],
   580			   params[4], params[5], params[6], params[7], params[8],
   581			   params[9], params[10], params[11]);
   582	
   583		return seq_has_overflowed(m) ? -EOVERFLOW : 0;
   584	}
   585	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 54692 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 16/19] tools/bpftool: add bpf_iter support for bptool
  2020-04-27 20:12 ` [PATCH bpf-next v1 16/19] tools/bpftool: add bpf_iter support for bptool Yonghong Song
@ 2020-04-28  9:27   ` Quentin Monnet
  2020-04-28 17:35     ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Quentin Monnet @ 2020-04-28  9:27 UTC (permalink / raw)
  To: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

2020-04-27 13:12 UTC-0700 ~ Yonghong Song <yhs@fb.com>
> Currently, only one command is supported
>   bpftool iter pin <bpf_prog.o> <path>
> 
> It will pin the trace/iter bpf program in
> the object file <bpf_prog.o> to the <path>
> where <path> should be on a bpffs mount.
> 
> For example,
>   $ bpftool iter pin ./bpf_iter_ipv6_route.o \
>     /sys/fs/bpf/my_route
> User can then do a `cat` to print out the results:
>   $ cat /sys/fs/bpf/my_route
>     fe800000000000000000000000000000 40 00000000000000000000000000000000 ...
>     00000000000000000000000000000000 00 00000000000000000000000000000000 ...
>     00000000000000000000000000000001 80 00000000000000000000000000000000 ...
>     fe800000000000008c0162fffebdfd57 80 00000000000000000000000000000000 ...
>     ff000000000000000000000000000000 08 00000000000000000000000000000000 ...
>     00000000000000000000000000000000 00 00000000000000000000000000000000 ...
> 
> The implementation for ipv6_route iterator is in one of subsequent
> patches.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  .../bpftool/Documentation/bpftool-iter.rst    | 71 ++++++++++++++++
>  tools/bpf/bpftool/bash-completion/bpftool     | 13 +++
>  tools/bpf/bpftool/iter.c                      | 84 +++++++++++++++++++
>  tools/bpf/bpftool/main.c                      |  3 +-
>  tools/bpf/bpftool/main.h                      |  1 +
>  5 files changed, 171 insertions(+), 1 deletion(-)
>  create mode 100644 tools/bpf/bpftool/Documentation/bpftool-iter.rst
>  create mode 100644 tools/bpf/bpftool/iter.c
> 
> diff --git a/tools/bpf/bpftool/Documentation/bpftool-iter.rst b/tools/bpf/bpftool/Documentation/bpftool-iter.rst
> new file mode 100644
> index 000000000000..1997a6bac4a0
> --- /dev/null
> +++ b/tools/bpf/bpftool/Documentation/bpftool-iter.rst
> @@ -0,0 +1,71 @@
> +============
> +bpftool-iter
> +============
> +-------------------------------------------------------------------------------
> +tool to create BPF iterators
> +-------------------------------------------------------------------------------
> +
> +:Manual section: 8
> +
> +SYNOPSIS
> +========
> +
> +	**bpftool** [*OPTIONS*] **iter** *COMMAND*
> +
> +	*COMMANDS* := { **pin** | **help** }
> +
> +STRUCT_OPS COMMANDS

s/STRUCT_OPS/ITER/

> +===================
> +
> +|	**bpftool** **iter pin** *OBJ* *PATH*
> +|	**bpftool** **struct_ops help**

s/struct_ops/iter/

> +|
> +|	*OBJ* := /a/file/of/bpf_iter_target.o
> +
> +
> +DESCRIPTION
> +===========
> +	**bpftool iter pin** *OBJ* *PATH*

Would be great to have a small blurb on what BPF iterators are and what
they can do. I'm afraid users reading this man page will have no idea
whatsoever.

> +		  Create a bpf iterator from *OBJ*, and pin it to
> +		  *PATH*. The *PATH* should be located in *bpffs* mount.

Can you keep the note that other pages have about the dot character
being forbidden in *PATH* basename, please?

> +
> +	**bpftool struct_ops help**

s/struct_ops/iter/

> +		  Print short help message.
> +
> +OPTIONS
> +=======
> +	-h, --help
> +		  Print short generic help message (similar to **bpftool help**).
> +
> +	-V, --version
> +		  Print version number (similar to **bpftool version**).
> +
> +	-d, --debug
> +		  Print all logs available, even debug-level information. This
> +		  includes logs from libbpf as well as from the verifier, when
> +		  attempting to load programs.> +
> +EXAMPLES
> +========
> +**# bpftool iter pin bpf_iter_netlink.o /sys/fs/bpf/my_netlink**
> +
> +::
> +
> +   Create a file-based bpf iterator from bpf_iter_netlink.o and pin it
> +   to /sys/fs/bpf/my_netlink
> +
> +
> +SEE ALSO
> +========
> +	**bpf**\ (2),
> +	**bpf-helpers**\ (7),
> +	**bpftool**\ (8),
> +	**bpftool-prog**\ (8),
> +	**bpftool-map**\ (8),
> +	**bpftool-cgroup**\ (8),
> +	**bpftool-feature**\ (8),
> +	**bpftool-net**\ (8),
> +	**bpftool-perf**\ (8),
> +	**bpftool-btf**\ (8)
> +	**bpftool-gen**\ (8)
> +	**bpftool-struct_ops**\ (8)
> diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
> index 45ee99b159e2..17a81695da0f 100644
> --- a/tools/bpf/bpftool/bash-completion/bpftool
> +++ b/tools/bpf/bpftool/bash-completion/bpftool
> @@ -604,6 +604,19 @@ _bpftool()
>                      ;;
>              esac
>              ;;
> +        iter)
> +            case $command in
> +                pin)
> +                    _filedir
> +                    return 0
> +                    ;;
> +                *)
> +                    [[ $prev == $object ]] && \
> +                        COMPREPLY=( $( compgen -W 'help' \
> +                            -- "$cur" ) )

You should probably offer "pin" here in addition to "help".

> +                    ;;
> +            esac
> +            ;;
>          map)
>              local MAP_TYPE='id pinned name'
>              case $command in
> diff --git a/tools/bpf/bpftool/iter.c b/tools/bpf/bpftool/iter.c
> new file mode 100644
> index 000000000000..db9fae6be716
> --- /dev/null
> +++ b/tools/bpf/bpftool/iter.c
> @@ -0,0 +1,84 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (C) 2020 Facebook
> +
> +#define _GNU_SOURCE
> +#include <linux/err.h>
> +#include <bpf/libbpf.h>
> +
> +#include "main.h"
> +
> +static int do_pin(int argc, char **argv)
> +{
> +	const char *objfile, *path;
> +	struct bpf_program *prog;
> +	struct bpf_object *obj;
> +	struct bpf_link *link;
> +	int err;

Nit: initialise err t0 -1 do you don't have to set it three times below?

> +
> +	if (!REQ_ARGS(2))
> +		usage();
> +
> +	objfile = GET_ARG();
> +	path = GET_ARG();
> +
> +	obj = bpf_object__open(objfile);
> +	if (IS_ERR_OR_NULL(obj)) {
> +		p_err("can't open objfile %s", objfile);
> +		return -1;
> +	}
> +
> +	err = bpf_object__load(obj);
> +	if (err < 0) {
> +		err = -1;
> +		p_err("can't load objfile %s", objfile);
> +		goto close_obj;
> +	}
> +
> +	prog = bpf_program__next(NULL, obj);
> +	link = bpf_program__attach_iter(prog, NULL);
> +	if (IS_ERR(link)) {
> +		err = -1;
> +		p_err("attach_iter failed for program %s",
> +		      bpf_program__name(prog));
> +		goto close_obj;
> +	}
> +
> +	err = bpf_link__pin(link, path);

Try to mount bpffs before that if "-n" is not passed? You could even
call do_pin_any() from common.c by passing bpf_link__fd().

> +	if (err) {
> +		err = -1;
> +		p_err("pin_iter failed for program %s to path %s",
> +		      bpf_program__name(prog), path);
> +		goto close_link;
> +	}
> +
> +	err = 0;
> +
> +close_link:
> +	bpf_link__disconnect(link);
> +	bpf_link__destroy(link);
> +close_obj:
> +	bpf_object__close(obj);
> +	return err;
> +}
> +
> +static int do_help(int argc, char **argv)
> +{
> +	fprintf(stderr,
> +		"Usage: %s %s pin OBJ PATH\n"
> +		"       %s %s help\n"
> +		"\n",
> +		bin_name, argv[-2], bin_name, argv[-2]);
> +
> +	return 0;
> +}
> +
> +static const struct cmd cmds[] = {
> +	{ "help",	do_help },
> +	{ "pin",	do_pin },
> +	{ 0 }
> +};
> +
> +int do_iter(int argc, char **argv)
> +{
> +	return cmd_select(cmds, argc, argv, do_help);
> +}
> dif	"",
>  		bin_name, bin_name, bin_name);
> @@ -222,6 +222,7 @@ static const struct cmd cmds[] = {
>  	{ "btf",	do_btf },
>  	{ "gen",	do_gen },
>  	{ "struct_ops",	do_struct_ops },
> +	{ "iter",	do_iter },
>  	{ "version",	do_version },
>  	{ 0 }
>  };
> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
> index 86f14ce26fd7..2b5d4a616b48 100644
> --- a/tools/bpf/bpftool/main.h
> +++ b/tools/bpf/bpftool/main.h
> @@ -162,6 +162,7 @@ int do_feature(int argc, char **argv);
>  int do_btf(int argc, char **argv);
>  int do_gen(int argc, char **argv);
>  int do_struct_ops(int argc, char **argv);
> +int do_iter(int argc, char **argv);
>  
>  int parse_u32_arg(int *argc, char ***argv, __u32 *val, const char *what);
>  int prog_parse_fd(int *argc, char ***argv);
> f --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
> index 466c269eabdd..6805b77789cb 100644
> --- a/tools/bpf/bpftool/main.c
> +++ b/tools/bpf/bpftool/main.c
> @@ -58,7 +58,7 @@ static int do_help(int argc, char **argv)
>  		"       %s batch file FILE\n"
>  		"       %s version\n"
>  		"\n"
> -		"       OBJECT := { prog | map | cgroup | perf | net | feature | btf | gen | struct_ops }\n"
> +		"       OBJECT := { prog | map | cgroup | perf | net | feature | btf | gen | struct_ops | iter }\n"
>  		"       " HELP_SPEC_OPTIONS "\n"
>  		"",
>  		bin_name, bin_name, bin_name);
> @@ -222,6 +222,7 @@ static const struct cmd cmds[] = {
>  	{ "btf",	do_btf },
>  	{ "gen",	do_gen },
>  	{ "struct_ops",	do_struct_ops },
> +	{ "iter",	do_iter },
>  	{ "version",	do_version },
>  	{ 0 }
>  };
> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
> index 86f14ce26fd7..2b5d4a616b48 100644
> --- a/tools/bpf/bpftool/main.h
> +++ b/tools/bpf/bpftool/main.h
> @@ -162,6 +162,7 @@ int do_feature(int argc, char **argv);
>  int do_btf(int argc, char **argv);
>  int do_gen(int argc, char **argv);
>  int do_struct_ops(int argc, char **argv);
> +int do_iter(int argc, char **argv);
>  
>  int parse_u32_arg(int *argc, char ***argv, __u32 *val, const char *what);
>  int prog_parse_fd(int *argc, char ***argv);
> 

Have you considered simply adapting the more traditional workflow
"bpftool prog load && bpftool prog attach" so that it supports iterators
instead of adding a new command? It would:

- Avoid adding yet another bpftool command with a single subcommand

- Enable to reuse the code from prog load, in particular for map reuse
(I'm not sure how relevant maps are for iterators, but I wouldn't be
surprised if someone finds a use case at some point?)

- Avoid users naively trying to run "bpftool prog load && bpftool prog
attach <prog> iter" and not understanding why it fails

Quentin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 02/19] bpf: implement an interface to register bpf_iter targets
  2020-04-27 20:12 ` [PATCH bpf-next v1 02/19] bpf: implement an interface to register bpf_iter targets Yonghong Song
@ 2020-04-28 16:20   ` Martin KaFai Lau
  2020-04-28 16:50     ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-28 16:20 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team

On Mon, Apr 27, 2020 at 01:12:36PM -0700, Yonghong Song wrote:
> The target can call bpf_iter_reg_target() to register itself.
> The needed information:
>   target:           target name, reprsented as a directory hierarchy
>   target_func_name: the kernel func name used by verifier to
>                     verify bpf programs
>   seq_ops:          the seq_file operations for the target
>   seq_priv_size:    the private_data size needed by the seq_file
>                     operations
>   target_feature:   certain feature requested by the target for
>                     bpf_iter to prepare for seq_file operations.
> 
> A little bit more explanations on the target name and target_feature.
> For example, the target name can be "bpf_map", "task", "task/file",
> which represents iterating all bpf_map's, all tasks, or all files
> of all tasks.
> 
> The target feature is mostly for reusing existing seq_file operations.
> For example, /proc/net/{tcp6, ipv6_route, netlink, ...} seq_file private
> data contains a reference to net namespace. When bpf_iter tries to
> reuse the same seq_ops, its seq_file private data need the net namespace
> setup properly too. In this case, the bpf_iter infrastructure can help
> set up properly before doing seq_file operations.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h   | 11 ++++++++++
>  kernel/bpf/Makefile   |  2 +-
>  kernel/bpf/bpf_iter.c | 50 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 62 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/bpf_iter.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 10960cfabea4..5e56abc1e2f1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -31,6 +31,7 @@ struct seq_file;
>  struct btf;
>  struct btf_type;
>  struct exception_table_entry;
> +struct seq_operations;
>  
>  extern struct idr btf_idr;
>  extern spinlock_t btf_idr_lock;
> @@ -1109,6 +1110,16 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>  int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>  int bpf_obj_get_user(const char __user *pathname, int flags);
>  
> +struct bpf_iter_reg {
> +	const char *target;
> +	const char *target_func_name;
> +	const struct seq_operations *seq_ops;
> +	u32 seq_priv_size;
> +	u32 target_feature;
> +};
> +
> +int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
> +
>  int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index f2d7be596966..6a8b0febd3f6 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -2,7 +2,7 @@
>  obj-y := core.o
>  CFLAGS_core.o += $(call cc-disable-warning, override-init)
>  
> -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
> +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> new file mode 100644
> index 000000000000..1115b978607a
> --- /dev/null
> +++ b/kernel/bpf/bpf_iter.c
> @@ -0,0 +1,50 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2020 Facebook */
> +
> +#include <linux/fs.h>
> +#include <linux/filter.h>
> +#include <linux/bpf.h>
> +
> +struct bpf_iter_target_info {
> +	struct list_head list;
> +	const char *target;
> +	const char *target_func_name;
> +	const struct seq_operations *seq_ops;
> +	u32 seq_priv_size;
> +	u32 target_feature;
> +};
> +
> +static struct list_head targets;
> +static struct mutex targets_mutex;
> +static bool bpf_iter_inited = false;
The "!bpf_iter_inited" test below is racy.

LIST_HEAD_INIT and DEFINE_MUTEX can be used instead.

> +
> +int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
> +{
> +	struct bpf_iter_target_info *tinfo;
> +
> +	/* The earliest bpf_iter_reg_target() is called at init time
> +	 * where the bpf_iter registration is serialized.
> +	 */
> +	if (!bpf_iter_inited) {
> +		INIT_LIST_HEAD(&targets);
> +		mutex_init(&targets_mutex);
> +		bpf_iter_inited = true;
> +	}

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers
  2020-04-28  6:02     ` kbuild test robot
@ 2020-04-28 16:35       ` Yonghong Song
  -1 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-28 16:35 UTC (permalink / raw)
  To: kbuild test robot, Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: kbuild-all, Alexei Starovoitov, Daniel Borkmann, kernel-team



On 4/27/20 11:02 PM, kbuild test robot wrote:
> Hi Yonghong,
> 
> I love your patch! Perhaps something to improve:
> 
> [auto build test WARNING on bpf-next/master]
> [cannot apply to bpf/master net/master vhost/linux-next net-next/master linus/master v5.7-rc3 next-20200424]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_37406982&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=ecuvAWhErc8x32mTscXvNhgSPkwcM7tK05lEVYIQMbI&s=rUkkN8hfXpHttD7t9NCfe5OIFTZZ_cn_SQTDjvs1cj0&e= ]
> 
> url:    https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-implement-bpf-iterator-for-kernel-data/20200428-115101
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> config: sh-allmodconfig (attached as .config)
> compiler: sh4-linux-gcc (GCC) 9.3.0
> reproduce:
>          wget https://urldefense.proofpoint.com/v2/url?u=https-3A__raw.githubusercontent.com_intel_lkp-2Dtests_master_sbin_make.cross&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=ecuvAWhErc8x32mTscXvNhgSPkwcM7tK05lEVYIQMbI&s=mm3zd05JFgyD1Fvvg5yehcYq7d9KLZkN7XSYyLaJRkA&e=  -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # save the attached .config to linux build tree
>          COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=sh
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kbuild test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>):
> 
>     In file included from kernel/trace/bpf_trace.c:10:
>     kernel/trace/bpf_trace.c: In function 'bpf_seq_printf':
>>> kernel/trace/bpf_trace.c:463:35: warning: the frame size of 1672 bytes is larger than 1024 bytes [-Wframe-larger-than=]
>       463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,

Thanks for reporting. Currently, I am supporting up to 12 string format 
specifiers and each string up to 128 bytes. To avoid racing and helper 
memory allocation, I put it on stack hence the above 1672 bytes, but
practically, I think support 4 strings with 128 bytes each is enough.
I will make a change in the next revision.

>           |                                   ^~~~~~~~
>     include/linux/filter.h:456:30: note: in definition of macro '__BPF_CAST'
>       456 |           (unsigned long)0, (t)0))) a
>           |                              ^
>>> include/linux/filter.h:449:27: note: in expansion of macro '__BPF_MAP_5'
>       449 | #define __BPF_MAP(n, ...) __BPF_MAP_##n(__VA_ARGS__)
>           |                           ^~~~~~~~~~
>>> include/linux/filter.h:474:35: note: in expansion of macro '__BPF_MAP'
>       474 |   return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
>           |                                   ^~~~~~~~~
>>> include/linux/filter.h:484:31: note: in expansion of macro 'BPF_CALL_x'
>       484 | #define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__)
>           |                               ^~~~~~~~~~
>>> kernel/trace/bpf_trace.c:463:1: note: in expansion of macro 'BPF_CALL_5'
>       463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
>           | ^~~~~~~~~~
> 
> vim +463 kernel/trace/bpf_trace.c
> 
>     462	
>   > 463	BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
>     464		   const void *, data, u32, data_len)
>     465	{
>     466		char bufs[MAX_SEQ_PRINTF_VARARGS][MAX_SEQ_PRINTF_STR_LEN];
>     467		u64 params[MAX_SEQ_PRINTF_VARARGS];
>     468		int i, copy_size, num_args;
>     469		const u64 *args = data;
>     470		int fmt_cnt = 0;
>     471	
[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers
@ 2020-04-28 16:35       ` Yonghong Song
  0 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-28 16:35 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3906 bytes --]



On 4/27/20 11:02 PM, kbuild test robot wrote:
> Hi Yonghong,
> 
> I love your patch! Perhaps something to improve:
> 
> [auto build test WARNING on bpf-next/master]
> [cannot apply to bpf/master net/master vhost/linux-next net-next/master linus/master v5.7-rc3 next-20200424]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_37406982&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=ecuvAWhErc8x32mTscXvNhgSPkwcM7tK05lEVYIQMbI&s=rUkkN8hfXpHttD7t9NCfe5OIFTZZ_cn_SQTDjvs1cj0&e= ]
> 
> url:    https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-implement-bpf-iterator-for-kernel-data/20200428-115101
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> config: sh-allmodconfig (attached as .config)
> compiler: sh4-linux-gcc (GCC) 9.3.0
> reproduce:
>          wget https://urldefense.proofpoint.com/v2/url?u=https-3A__raw.githubusercontent.com_intel_lkp-2Dtests_master_sbin_make.cross&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=ecuvAWhErc8x32mTscXvNhgSPkwcM7tK05lEVYIQMbI&s=mm3zd05JFgyD1Fvvg5yehcYq7d9KLZkN7XSYyLaJRkA&e=  -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # save the attached .config to linux build tree
>          COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=sh
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kbuild test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>):
> 
>     In file included from kernel/trace/bpf_trace.c:10:
>     kernel/trace/bpf_trace.c: In function 'bpf_seq_printf':
>>> kernel/trace/bpf_trace.c:463:35: warning: the frame size of 1672 bytes is larger than 1024 bytes [-Wframe-larger-than=]
>       463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,

Thanks for reporting. Currently, I am supporting up to 12 string format 
specifiers and each string up to 128 bytes. To avoid racing and helper 
memory allocation, I put it on stack hence the above 1672 bytes, but
practically, I think support 4 strings with 128 bytes each is enough.
I will make a change in the next revision.

>           |                                   ^~~~~~~~
>     include/linux/filter.h:456:30: note: in definition of macro '__BPF_CAST'
>       456 |           (unsigned long)0, (t)0))) a
>           |                              ^
>>> include/linux/filter.h:449:27: note: in expansion of macro '__BPF_MAP_5'
>       449 | #define __BPF_MAP(n, ...) __BPF_MAP_##n(__VA_ARGS__)
>           |                           ^~~~~~~~~~
>>> include/linux/filter.h:474:35: note: in expansion of macro '__BPF_MAP'
>       474 |   return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
>           |                                   ^~~~~~~~~
>>> include/linux/filter.h:484:31: note: in expansion of macro 'BPF_CALL_x'
>       484 | #define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__)
>           |                               ^~~~~~~~~~
>>> kernel/trace/bpf_trace.c:463:1: note: in expansion of macro 'BPF_CALL_5'
>       463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
>           | ^~~~~~~~~~
> 
> vim +463 kernel/trace/bpf_trace.c
> 
>     462	
>   > 463	BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
>     464		   const void *, data, u32, data_len)
>     465	{
>     466		char bufs[MAX_SEQ_PRINTF_VARARGS][MAX_SEQ_PRINTF_STR_LEN];
>     467		u64 params[MAX_SEQ_PRINTF_VARARGS];
>     468		int i, copy_size, num_args;
>     469		const u64 *args = data;
>     470		int fmt_cnt = 0;
>     471	
[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 02/19] bpf: implement an interface to register bpf_iter targets
  2020-04-28 16:20   ` Martin KaFai Lau
@ 2020-04-28 16:50     ` Yonghong Song
  0 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-28 16:50 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team



On 4/28/20 9:20 AM, Martin KaFai Lau wrote:
> On Mon, Apr 27, 2020 at 01:12:36PM -0700, Yonghong Song wrote:
>> The target can call bpf_iter_reg_target() to register itself.
>> The needed information:
>>    target:           target name, reprsented as a directory hierarchy
>>    target_func_name: the kernel func name used by verifier to
>>                      verify bpf programs
>>    seq_ops:          the seq_file operations for the target
>>    seq_priv_size:    the private_data size needed by the seq_file
>>                      operations
>>    target_feature:   certain feature requested by the target for
>>                      bpf_iter to prepare for seq_file operations.
>>
>> A little bit more explanations on the target name and target_feature.
>> For example, the target name can be "bpf_map", "task", "task/file",
>> which represents iterating all bpf_map's, all tasks, or all files
>> of all tasks.
>>
>> The target feature is mostly for reusing existing seq_file operations.
>> For example, /proc/net/{tcp6, ipv6_route, netlink, ...} seq_file private
>> data contains a reference to net namespace. When bpf_iter tries to
>> reuse the same seq_ops, its seq_file private data need the net namespace
>> setup properly too. In this case, the bpf_iter infrastructure can help
>> set up properly before doing seq_file operations.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h   | 11 ++++++++++
>>   kernel/bpf/Makefile   |  2 +-
>>   kernel/bpf/bpf_iter.c | 50 +++++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 62 insertions(+), 1 deletion(-)
>>   create mode 100644 kernel/bpf/bpf_iter.c
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 10960cfabea4..5e56abc1e2f1 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -31,6 +31,7 @@ struct seq_file;
>>   struct btf;
>>   struct btf_type;
>>   struct exception_table_entry;
>> +struct seq_operations;
>>   
>>   extern struct idr btf_idr;
>>   extern spinlock_t btf_idr_lock;
>> @@ -1109,6 +1110,16 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>>   int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>>   int bpf_obj_get_user(const char __user *pathname, int flags);
>>   
>> +struct bpf_iter_reg {
>> +	const char *target;
>> +	const char *target_func_name;
>> +	const struct seq_operations *seq_ops;
>> +	u32 seq_priv_size;
>> +	u32 target_feature;
>> +};
>> +
>> +int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
>> +
>>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index f2d7be596966..6a8b0febd3f6 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -2,7 +2,7 @@
>>   obj-y := core.o
>>   CFLAGS_core.o += $(call cc-disable-warning, override-init)
>>   
>> -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>> +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o
>>   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>>   obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
>>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>> new file mode 100644
>> index 000000000000..1115b978607a
>> --- /dev/null
>> +++ b/kernel/bpf/bpf_iter.c
>> @@ -0,0 +1,50 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright (c) 2020 Facebook */
>> +
>> +#include <linux/fs.h>
>> +#include <linux/filter.h>
>> +#include <linux/bpf.h>
>> +
>> +struct bpf_iter_target_info {
>> +	struct list_head list;
>> +	const char *target;
>> +	const char *target_func_name;
>> +	const struct seq_operations *seq_ops;
>> +	u32 seq_priv_size;
>> +	u32 target_feature;
>> +};
>> +
>> +static struct list_head targets;
>> +static struct mutex targets_mutex;
>> +static bool bpf_iter_inited = false;
> The "!bpf_iter_inited" test below is racy.

Yes, as mentioned in the comments, all currently implemented
targets are called at __init stage (do_basic_setup()->do_initcalls()),
I think there is no race here. But looking at the
code again, I am not so sure about my assumption any more.

> 
> LIST_HEAD_INIT and DEFINE_MUTEX can be used instead.

Will use these macros instead. Thanks!

> 
>> +
>> +int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>> +{
>> +	struct bpf_iter_target_info *tinfo;
>> +
>> +	/* The earliest bpf_iter_reg_target() is called at init time
>> +	 * where the bpf_iter registration is serialized.
>> +	 */
>> +	if (!bpf_iter_inited) {
>> +		INIT_LIST_HEAD(&targets);
>> +		mutex_init(&targets_mutex);
>> +		bpf_iter_inited = true;
>> +	}

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 16/19] tools/bpftool: add bpf_iter support for bptool
  2020-04-28  9:27   ` Quentin Monnet
@ 2020-04-28 17:35     ` Yonghong Song
  2020-04-29  8:37       ` Quentin Monnet
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-28 17:35 UTC (permalink / raw)
  To: Quentin Monnet, Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team



On 4/28/20 2:27 AM, Quentin Monnet wrote:
> 2020-04-27 13:12 UTC-0700 ~ Yonghong Song <yhs@fb.com>
>> Currently, only one command is supported
>>    bpftool iter pin <bpf_prog.o> <path>
>>
>> It will pin the trace/iter bpf program in
>> the object file <bpf_prog.o> to the <path>
>> where <path> should be on a bpffs mount.
>>
>> For example,
>>    $ bpftool iter pin ./bpf_iter_ipv6_route.o \
>>      /sys/fs/bpf/my_route
>> User can then do a `cat` to print out the results:
>>    $ cat /sys/fs/bpf/my_route
>>      fe800000000000000000000000000000 40 00000000000000000000000000000000 ...
>>      00000000000000000000000000000000 00 00000000000000000000000000000000 ...
>>      00000000000000000000000000000001 80 00000000000000000000000000000000 ...
>>      fe800000000000008c0162fffebdfd57 80 00000000000000000000000000000000 ...
>>      ff000000000000000000000000000000 08 00000000000000000000000000000000 ...
>>      00000000000000000000000000000000 00 00000000000000000000000000000000 ...
>>
>> The implementation for ipv6_route iterator is in one of subsequent
>> patches.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   .../bpftool/Documentation/bpftool-iter.rst    | 71 ++++++++++++++++
>>   tools/bpf/bpftool/bash-completion/bpftool     | 13 +++
>>   tools/bpf/bpftool/iter.c                      | 84 +++++++++++++++++++
>>   tools/bpf/bpftool/main.c                      |  3 +-
>>   tools/bpf/bpftool/main.h                      |  1 +
>>   5 files changed, 171 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/bpf/bpftool/Documentation/bpftool-iter.rst
>>   create mode 100644 tools/bpf/bpftool/iter.c
>>
>> diff --git a/tools/bpf/bpftool/Documentation/bpftool-iter.rst b/tools/bpf/bpftool/Documentation/bpftool-iter.rst
>> new file mode 100644
>> index 000000000000..1997a6bac4a0
>> --- /dev/null
>> +++ b/tools/bpf/bpftool/Documentation/bpftool-iter.rst
>> @@ -0,0 +1,71 @@
>> +============
>> +bpftool-iter
>> +============
>> +-------------------------------------------------------------------------------
>> +tool to create BPF iterators
>> +-------------------------------------------------------------------------------
>> +
>> +:Manual section: 8
>> +
>> +SYNOPSIS
>> +========
>> +
>> +	**bpftool** [*OPTIONS*] **iter** *COMMAND*
>> +
>> +	*COMMANDS* := { **pin** | **help** }
>> +
>> +STRUCT_OPS COMMANDS
> 
> s/STRUCT_OPS/ITER/

Oops. copy-paste error. Will fix.

> 
>> +===================
>> +
>> +|	**bpftool** **iter pin** *OBJ* *PATH*
>> +|	**bpftool** **struct_ops help**
> 
> s/struct_ops/iter/

Will fix.

> 
>> +|
>> +|	*OBJ* := /a/file/of/bpf_iter_target.o
>> +
>> +
>> +DESCRIPTION
>> +===========
>> +	**bpftool iter pin** *OBJ* *PATH*
> 
> Would be great to have a small blurb on what BPF iterators are and what
> they can do. I'm afraid users reading this man page will have no idea
> whatsoever.

Will add.

> 
>> +		  Create a bpf iterator from *OBJ*, and pin it to
>> +		  *PATH*. The *PATH* should be located in *bpffs* mount.
> 
> Can you keep the note that other pages have about the dot character
> being forbidden in *PATH* basename, please?

Will add.

> 
>> +
>> +	**bpftool struct_ops help**
> 
> s/struct_ops/iter/

Will fix.

> 
>> +		  Print short help message.
>> +
>> +OPTIONS
>> +=======
>> +	-h, --help
>> +		  Print short generic help message (similar to **bpftool help**).
>> +
>> +	-V, --version
>> +		  Print version number (similar to **bpftool version**).
>> +
>> +	-d, --debug
>> +		  Print all logs available, even debug-level information. This
>> +		  includes logs from libbpf as well as from the verifier, when
>> +		  attempting to load programs.> +
>> +EXAMPLES
>> +========
>> +**# bpftool iter pin bpf_iter_netlink.o /sys/fs/bpf/my_netlink**
>> +
>> +::
>> +
>> +   Create a file-based bpf iterator from bpf_iter_netlink.o and pin it
>> +   to /sys/fs/bpf/my_netlink
>> +
>> +
>> +SEE ALSO
>> +========
>> +	**bpf**\ (2),
>> +	**bpf-helpers**\ (7),
>> +	**bpftool**\ (8),
>> +	**bpftool-prog**\ (8),
>> +	**bpftool-map**\ (8),
>> +	**bpftool-cgroup**\ (8),
>> +	**bpftool-feature**\ (8),
>> +	**bpftool-net**\ (8),
>> +	**bpftool-perf**\ (8),
>> +	**bpftool-btf**\ (8)
>> +	**bpftool-gen**\ (8)
>> +	**bpftool-struct_ops**\ (8)
>> diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
>> index 45ee99b159e2..17a81695da0f 100644
>> --- a/tools/bpf/bpftool/bash-completion/bpftool
>> +++ b/tools/bpf/bpftool/bash-completion/bpftool
>> @@ -604,6 +604,19 @@ _bpftool()
>>                       ;;
>>               esac
>>               ;;
>> +        iter)
>> +            case $command in
>> +                pin)
>> +                    _filedir
>> +                    return 0
>> +                    ;;
>> +                *)
>> +                    [[ $prev == $object ]] && \
>> +                        COMPREPLY=( $( compgen -W 'help' \
>> +                            -- "$cur" ) )
> 
> You should probably offer "pin" here in addition to "help".

Will add.

> 
>> +                    ;;
>> +            esac
>> +            ;;
>>           map)
>>               local MAP_TYPE='id pinned name'
>>               case $command in
>> diff --git a/tools/bpf/bpftool/iter.c b/tools/bpf/bpftool/iter.c
>> new file mode 100644
>> index 000000000000..db9fae6be716
>> --- /dev/null
>> +++ b/tools/bpf/bpftool/iter.c
>> @@ -0,0 +1,84 @@
>> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>> +// Copyright (C) 2020 Facebook
>> +
>> +#define _GNU_SOURCE
>> +#include <linux/err.h>
>> +#include <bpf/libbpf.h>
>> +
>> +#include "main.h"
>> +
>> +static int do_pin(int argc, char **argv)
>> +{
>> +	const char *objfile, *path;
>> +	struct bpf_program *prog;
>> +	struct bpf_object *obj;
>> +	struct bpf_link *link;
>> +	int err;
> 
> Nit: initialise err t0 -1 do you don't have to set it three times below?

Double checked cmd_select() handling the return value:
   0 : success
   non-0 : failure

Looking like I can remove two `err = -1` below.

> 
>> +
>> +	if (!REQ_ARGS(2))
>> +		usage();
>> +
>> +	objfile = GET_ARG();
>> +	path = GET_ARG();
>> +
>> +	obj = bpf_object__open(objfile);
>> +	if (IS_ERR_OR_NULL(obj)) {
>> +		p_err("can't open objfile %s", objfile);
>> +		return -1;
>> +	}
>> +
>> +	err = bpf_object__load(obj);
>> +	if (err < 0) {
>> +		err = -1;
>> +		p_err("can't load objfile %s", objfile);
>> +		goto close_obj;
>> +	}
>> +
>> +	prog = bpf_program__next(NULL, obj);
>> +	link = bpf_program__attach_iter(prog, NULL);
>> +	if (IS_ERR(link)) {
>> +		err = -1;
>> +		p_err("attach_iter failed for program %s",
>> +		      bpf_program__name(prog));
>> +		goto close_obj;
>> +	}
>> +
>> +	err = bpf_link__pin(link, path);
> 
> Try to mount bpffs before that if "-n" is not passed? You could even
> call do_pin_any() from common.c by passing bpf_link__fd().

You probably means do_pin_fd()? That is a good suggestion, will use it
in the next revision.

> 
>> +	if (err) {
>> +		err = -1;
>> +		p_err("pin_iter failed for program %s to path %s",
>> +		      bpf_program__name(prog), path);
>> +		goto close_link;
>> +	}
>> +
>> +	err = 0;
>> +
>> +close_link:
>> +	bpf_link__disconnect(link);
>> +	bpf_link__destroy(link);
>> +close_obj:
>> +	bpf_object__close(obj);
>> +	return err;
>> +}
>> +
>> +static int do_help(int argc, char **argv)
>> +{
>> +	fprintf(stderr,
>> +		"Usage: %s %s pin OBJ PATH\n"
>> +		"       %s %s help\n"
>> +		"\n",
>> +		bin_name, argv[-2], bin_name, argv[-2]);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct cmd cmds[] = {
>> +	{ "help",	do_help },
>> +	{ "pin",	do_pin },
>> +	{ 0 }
>> +};
>> +
>> +int do_iter(int argc, char **argv)
>> +{
>> +	return cmd_select(cmds, argc, argv, do_help);
>> +}
>> dif	"",
>>   		bin_name, bin_name, bin_name);
>> @@ -222,6 +222,7 @@ static const struct cmd cmds[] = {
>>   	{ "btf",	do_btf },
>>   	{ "gen",	do_gen },
>>   	{ "struct_ops",	do_struct_ops },
>> +	{ "iter",	do_iter },
>>   	{ "version",	do_version },
>>   	{ 0 }
>>   };
>> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
>> index 86f14ce26fd7..2b5d4a616b48 100644
>> --- a/tools/bpf/bpftool/main.h
>> +++ b/tools/bpf/bpftool/main.h
>> @@ -162,6 +162,7 @@ int do_feature(int argc, char **argv);
>>   int do_btf(int argc, char **argv);
>>   int do_gen(int argc, char **argv);
>>   int do_struct_ops(int argc, char **argv);
>> +int do_iter(int argc, char **argv);
>>   
>>   int parse_u32_arg(int *argc, char ***argv, __u32 *val, const char *what);
>>   int prog_parse_fd(int *argc, char ***argv);
>> f --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
>> index 466c269eabdd..6805b77789cb 100644
>> --- a/tools/bpf/bpftool/main.c
>> +++ b/tools/bpf/bpftool/main.c
>> @@ -58,7 +58,7 @@ static int do_help(int argc, char **argv)
>>   		"       %s batch file FILE\n"
>>   		"       %s version\n"
>>   		"\n"
>> -		"       OBJECT := { prog | map | cgroup | perf | net | feature | btf | gen | struct_ops }\n"
>> +		"       OBJECT := { prog | map | cgroup | perf | net | feature | btf | gen | struct_ops | iter }\n"
>>   		"       " HELP_SPEC_OPTIONS "\n"
>>   		"",
>>   		bin_name, bin_name, bin_name);
>> @@ -222,6 +222,7 @@ static const struct cmd cmds[] = {
>>   	{ "btf",	do_btf },
>>   	{ "gen",	do_gen },
>>   	{ "struct_ops",	do_struct_ops },
>> +	{ "iter",	do_iter },
>>   	{ "version",	do_version },
>>   	{ 0 }
>>   };
>> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
>> index 86f14ce26fd7..2b5d4a616b48 100644
>> --- a/tools/bpf/bpftool/main.h
>> +++ b/tools/bpf/bpftool/main.h
>> @@ -162,6 +162,7 @@ int do_feature(int argc, char **argv);
>>   int do_btf(int argc, char **argv);
>>   int do_gen(int argc, char **argv);
>>   int do_struct_ops(int argc, char **argv);
>> +int do_iter(int argc, char **argv);
>>   
>>   int parse_u32_arg(int *argc, char ***argv, __u32 *val, const char *what);
>>   int prog_parse_fd(int *argc, char ***argv);
>>
> 
> Have you considered simply adapting the more traditional workflow
> "bpftool prog load && bpftool prog attach" so that it supports iterators
> instead of adding a new command? It would:

This is a good question, I should have clarified better in the commit
message.
   - prog load && prog attach won't work.
     the create_iter is a three stage process:
       1. prog load
       2. create and attach to a link
       3. pin link
     In the current implementation, the link merely just has the program.
     But in the future, the link will have other parameters like map_id,
     tgid/gid, or cgroup_id, or others.

     We could say to do:
       1. bpftool prog load <pin_path>
       2. bpftool iter pin prog file
          <maybe more parameters in the future>

     But this requires to pin the program itself in the bpffs, which
     mostly unneeded for file iterator creator.

     So this command `bpftool iter pin ...` is created for ease of use.

> 
> - Avoid adding yet another bpftool command with a single subcommand

So far, yes, in the future we may have more. In my RFC patcch, I have
`bpftool iter show ...` for introspection, this is to show all 
registered targets and all file iterators prog_id's.

This patch does not have it and I left it for the future work.
I am considering to use bpf iterator to do introspection here...

> 
> - Enable to reuse the code from prog load, in particular for map reuse
> (I'm not sure how relevant maps are for iterators, but I wouldn't be
> surprised if someone finds a use case at some point?)

Yes, we do plan to have map element iterators. We can also have
bpf_prog or other iterators. Yes, map element iterator use 
implementation should be `bpftool map` code base since it is
a use of bpf_iter infrastructure.

> 
> - Avoid users naively trying to run "bpftool prog load && bpftool prog
> attach <prog> iter" and not understanding why it fails

`bpftool prog attach <prog> [map_id]` mostly used to attach a program to
a map, right? In this case, it won't apply, right?

BTW, Thanks for reviewing and catching my mistakes!

> 
> Quentin
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 10/19] bpf: add netlink and ipv6_route targets
  2020-04-27 20:12 ` [PATCH bpf-next v1 10/19] bpf: add netlink and ipv6_route targets Yonghong Song
@ 2020-04-28 19:49     ` kbuild test robot
  2020-04-28 19:50     ` kbuild test robot
  1 sibling, 0 replies; 85+ messages in thread
From: kbuild test robot @ 2020-04-28 19:49 UTC (permalink / raw)
  To: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: kbuild-all, Alexei Starovoitov, Daniel Borkmann, kernel-team

Hi Yonghong,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]
[also build test WARNING on bpf/master net/master vhost/linux-next net-next/master linus/master v5.7-rc3 next-20200428]
[cannot apply to ipvs/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-implement-bpf-iterator-for-kernel-data/20200428-115101
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.1-191-gc51a0382-dirty
        make ARCH=x86_64 allmodconfig
        make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> net/netlink/af_netlink.c:2643:12: sparse: sparse: symbol '__bpf_iter__netlink' was not declared. Should it be static?
   net/netlink/af_netlink.c:2534:13: sparse: sparse: context imbalance in 'netlink_walk_start' - wrong count at exit
   net/netlink/af_netlink.c:2540:13: sparse: sparse: context imbalance in 'netlink_walk_stop' - unexpected unlock
   net/netlink/af_netlink.c:2576:13: sparse: sparse: context imbalance in 'netlink_seq_start' - wrong count at exit

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 10/19] bpf: add netlink and ipv6_route targets
@ 2020-04-28 19:49     ` kbuild test robot
  0 siblings, 0 replies; 85+ messages in thread
From: kbuild test robot @ 2020-04-28 19:49 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1722 bytes --]

Hi Yonghong,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]
[also build test WARNING on bpf/master net/master vhost/linux-next net-next/master linus/master v5.7-rc3 next-20200428]
[cannot apply to ipvs/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-implement-bpf-iterator-for-kernel-data/20200428-115101
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.1-191-gc51a0382-dirty
        make ARCH=x86_64 allmodconfig
        make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> net/netlink/af_netlink.c:2643:12: sparse: sparse: symbol '__bpf_iter__netlink' was not declared. Should it be static?
   net/netlink/af_netlink.c:2534:13: sparse: sparse: context imbalance in 'netlink_walk_start' - wrong count at exit
   net/netlink/af_netlink.c:2540:13: sparse: sparse: context imbalance in 'netlink_walk_stop' - unexpected unlock
   net/netlink/af_netlink.c:2576:13: sparse: sparse: context imbalance in 'netlink_seq_start' - wrong count at exit

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [RFC PATCH] bpf: __bpf_iter__netlink() can be static
  2020-04-27 20:12 ` [PATCH bpf-next v1 10/19] bpf: add netlink and ipv6_route targets Yonghong Song
@ 2020-04-28 19:50     ` kbuild test robot
  2020-04-28 19:50     ` kbuild test robot
  1 sibling, 0 replies; 85+ messages in thread
From: kbuild test robot @ 2020-04-28 19:50 UTC (permalink / raw)
  To: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: kbuild-all, Alexei Starovoitov, Daniel Borkmann, kernel-team


Signed-off-by: kbuild test robot <lkp@intel.com>
---
 af_netlink.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b6192cd668013..b8c9a87bd3960 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2640,7 +2640,7 @@ struct bpf_iter__netlink {
 	__bpf_md_ptr(struct netlink_sock *, sk);
 };
 
-int __init __bpf_iter__netlink(struct bpf_iter_meta *meta, struct netlink_sock *sk)
+static int __init __bpf_iter__netlink(struct bpf_iter_meta *meta, struct netlink_sock *sk)
 {
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH] bpf: __bpf_iter__netlink() can be static
@ 2020-04-28 19:50     ` kbuild test robot
  0 siblings, 0 replies; 85+ messages in thread
From: kbuild test robot @ 2020-04-28 19:50 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 605 bytes --]


Signed-off-by: kbuild test robot <lkp@intel.com>
---
 af_netlink.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b6192cd668013..b8c9a87bd3960 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2640,7 +2640,7 @@ struct bpf_iter__netlink {
 	__bpf_md_ptr(struct netlink_sock *, sk);
 };
 
-int __init __bpf_iter__netlink(struct bpf_iter_meta *meta, struct netlink_sock *sk)
+static int __init __bpf_iter__netlink(struct bpf_iter_meta *meta, struct netlink_sock *sk)
 {
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-27 20:12 ` [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator Yonghong Song
@ 2020-04-29  0:37   ` Martin KaFai Lau
  2020-04-29  0:48     ` Alexei Starovoitov
  2020-04-29  1:02     ` Yonghong Song
  2020-04-29  6:04   ` Andrii Nakryiko
  1 sibling, 2 replies; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29  0:37 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team

On Mon, Apr 27, 2020 at 01:12:37PM -0700, Yonghong Song wrote:
> The bpf_map iterator is implemented.
> The bpf program is called at seq_ops show() and stop() functions.
> bpf_iter_get_prog() will retrieve bpf program and other
> parameters during seq_file object traversal. In show() function,
> bpf program will traverse every valid object, and in stop()
> function, bpf program will be called one more time after all
> objects are traversed.
> 
> The first member of the bpf context contains the meta data, namely,
> the seq_file, session_id and seq_num. Here, the session_id is
> a unique id for one specific seq_file session. The seq_num is
> the number of bpf prog invocations in the current session.
> The bpf_iter_get_prog(), which will be implemented in subsequent
> patches, will have more information on how meta data are computed.
> 
> The second member of the bpf context is a struct bpf_map pointer,
> which bpf program can examine.
> 
> The target implementation also provided the structure definition
> for bpf program and the function definition for verifier to
> verify the bpf program. Specifically for bpf_map iterator,
> the structure is "bpf_iter__bpf_map" andd the function is
> "__bpf_iter__bpf_map".
> 
> More targets will be implemented later, all of which will include
> the following, similar to bpf_map iterator:
>   - seq_ops() implementation
>   - function definition for verifier to verify the bpf program
>   - seq_file private data size
>   - additional target feature
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h   |  10 ++++
>  kernel/bpf/Makefile   |   2 +-
>  kernel/bpf/bpf_iter.c |  19 ++++++++
>  kernel/bpf/map_iter.c | 107 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c  |  13 +++++
>  5 files changed, 150 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/map_iter.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 5e56abc1e2f1..4ac8d61f7c3e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1078,6 +1078,7 @@ int  generic_map_update_batch(struct bpf_map *map,
>  int  generic_map_delete_batch(struct bpf_map *map,
>  			      const union bpf_attr *attr,
>  			      union bpf_attr __user *uattr);
> +struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
>  
>  extern int sysctl_unprivileged_bpf_disabled;
>  
> @@ -1118,7 +1119,16 @@ struct bpf_iter_reg {
>  	u32 target_feature;
>  };
>  
> +struct bpf_iter_meta {
> +	__bpf_md_ptr(struct seq_file *, seq);
> +	u64 session_id;
> +	u64 seq_num;
> +};
> +
>  int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
> +struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
> +				   u64 *session_id, u64 *seq_num, bool is_last);
> +int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
>  
>  int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 6a8b0febd3f6..b2b5eefc5254 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -2,7 +2,7 @@
>  obj-y := core.o
>  CFLAGS_core.o += $(call cc-disable-warning, override-init)
>  
> -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o
> +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index 1115b978607a..284c95587803 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -48,3 +48,22 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>  
>  	return 0;
>  }
> +
> +struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
> +				   u64 *session_id, u64 *seq_num, bool is_last)
> +{
> +	return NULL;
Can this patch be moved after this function is implemented?

> +}
> +
> +int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
> +{
> +	int ret;
> +
> +	migrate_disable();
> +	rcu_read_lock();
> +	ret = BPF_PROG_RUN(prog, ctx);
> +	rcu_read_unlock();
> +	migrate_enable();
> +
> +	return ret;
> +}
> diff --git a/kernel/bpf/map_iter.c b/kernel/bpf/map_iter.c
> new file mode 100644
> index 000000000000..bb3ad4c3bde5
> --- /dev/null
> +++ b/kernel/bpf/map_iter.c
> @@ -0,0 +1,107 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2020 Facebook */
> +#include <linux/bpf.h>
> +#include <linux/fs.h>
> +#include <linux/filter.h>
> +#include <linux/kernel.h>
> +
> +struct bpf_iter_seq_map_info {
> +	struct bpf_map *map;
> +	u32 id;
> +};
> +
> +static void *bpf_map_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	struct bpf_iter_seq_map_info *info = seq->private;
> +	struct bpf_map *map;
> +	u32 id = info->id;
> +
> +	map = bpf_map_get_curr_or_next(&id);
> +	if (IS_ERR_OR_NULL(map))
> +		return NULL;
> +
> +	++*pos;
Does pos always need to be incremented here?

> +	info->map = map;
> +	info->id = id;
> +	return map;
> +}
> +
> +static void *bpf_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct bpf_iter_seq_map_info *info = seq->private;
> +	struct bpf_map *map;
> +
> +	++*pos;
> +	++info->id;
> +	map = bpf_map_get_curr_or_next(&info->id);
> +	if (IS_ERR_OR_NULL(map))
> +		return NULL;
> +
> +	bpf_map_put(info->map);
> +	info->map = map;
> +	return map;
> +}
> +
> +struct bpf_iter__bpf_map {
> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
> +	__bpf_md_ptr(struct bpf_map *, map);
> +};
> +
> +int __init __bpf_iter__bpf_map(struct bpf_iter_meta *meta, struct bpf_map *map)
> +{
> +	return 0;
> +}
> +
> +static int bpf_map_seq_show(struct seq_file *seq, void *v)
> +{
> +	struct bpf_iter_meta meta;
> +	struct bpf_iter__bpf_map ctx;
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	ctx.meta = &meta;
> +	ctx.map = v;
> +	meta.seq = seq;
> +	prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_map_info),
> +				 &meta.session_id, &meta.seq_num,
> +				 v == (void *)0);
From looking at seq_file.c, when will show() be called with "v == NULL"?

> +	if (prog)
> +		ret = bpf_iter_run_prog(prog, &ctx);
> +
> +	return ret == 0 ? 0 : -EINVAL;
The verifier change in patch 4 should have ensured that prog
can only return 0?

> +}
> +
> +static void bpf_map_seq_stop(struct seq_file *seq, void *v)
> +{
> +	struct bpf_iter_seq_map_info *info = seq->private;
> +
> +	if (!v)
> +		bpf_map_seq_show(seq, v);
> +
> +	if (info->map) {
> +		bpf_map_put(info->map);
> +		info->map = NULL;
> +	}
> +}
> +
> +static const struct seq_operations bpf_map_seq_ops = {
> +	.start	= bpf_map_seq_start,
> +	.next	= bpf_map_seq_next,
> +	.stop	= bpf_map_seq_stop,
> +	.show	= bpf_map_seq_show,
> +};
> +
> +static int __init bpf_map_iter_init(void)
> +{
> +	struct bpf_iter_reg reg_info = {
> +		.target			= "bpf_map",
> +		.target_func_name	= "__bpf_iter__bpf_map",
> +		.seq_ops		= &bpf_map_seq_ops,
> +		.seq_priv_size		= sizeof(struct bpf_iter_seq_map_info),
> +		.target_feature		= 0,
> +	};
> +
> +	return bpf_iter_reg_target(&reg_info);
> +}
> +
> +late_initcall(bpf_map_iter_init);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 7626b8024471..022187640943 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2800,6 +2800,19 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
>  	return err;
>  }
>  
> +struct bpf_map *bpf_map_get_curr_or_next(u32 *id)
> +{
> +	struct bpf_map *map;
> +
> +	spin_lock_bh(&map_idr_lock);
> +	map = idr_get_next(&map_idr, id);
> +	if (map)
> +		map = __bpf_map_inc_not_zero(map, false);
nit. For the !map case, set "map = ERR_PTR(-ENOENT)" so that
the _OR_NULL() test is not needed.  It will be more consistent
with other error checking codes in syscall.c.

> +	spin_unlock_bh(&map_idr_lock);
> +
> +	return map;
> +}
> +
>  #define BPF_PROG_GET_FD_BY_ID_LAST_FIELD prog_id
>  
>  struct bpf_prog *bpf_prog_by_id(u32 id)
> -- 
> 2.24.1
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  0:37   ` Martin KaFai Lau
@ 2020-04-29  0:48     ` Alexei Starovoitov
  2020-04-29  1:15       ` Yonghong Song
  2020-04-29  1:02     ` Yonghong Song
  1 sibling, 1 reply; 85+ messages in thread
From: Alexei Starovoitov @ 2020-04-29  0:48 UTC (permalink / raw)
  To: Martin KaFai Lau, Yonghong Song
  Cc: Andrii Nakryiko, bpf, netdev, Daniel Borkmann, kernel-team

On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>> +	prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_map_info),
>> +				 &meta.session_id, &meta.seq_num,
>> +				 v == (void *)0);
>  From looking at seq_file.c, when will show() be called with "v == NULL"?
> 

that v == NULL here and the whole verifier change just to allow NULL...
may be use seq_num as an indicator of the last elem instead?
Like seq_num with upper bit set to indicate that it's last?


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 04/19] bpf: allow loading of a bpf_iter program
  2020-04-27 20:12 ` [PATCH bpf-next v1 04/19] bpf: allow loading of a bpf_iter program Yonghong Song
@ 2020-04-29  0:54   ` Martin KaFai Lau
  2020-04-29  1:27     ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29  0:54 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team

On Mon, Apr 27, 2020 at 01:12:39PM -0700, Yonghong Song wrote:
> A bpf_iter program is a tracing program with attach type
> BPF_TRACE_ITER. The load attribute
>   attach_btf_id
> is used by the verifier against a particular kernel function,
> e.g., __bpf_iter__bpf_map in our previous bpf_map iterator.
> 
> The program return value must be 0 for now. In the
> future, other return values may be used for filtering or
> teminating the iterator.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/uapi/linux/bpf.h       |  1 +
>  kernel/bpf/verifier.c          | 20 ++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  1 +
>  3 files changed, 22 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4a6c47f3febe..f39b9fec37ab 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -215,6 +215,7 @@ enum bpf_attach_type {
>  	BPF_TRACE_FEXIT,
>  	BPF_MODIFY_RETURN,
>  	BPF_LSM_MAC,
> +	BPF_TRACE_ITER,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 91728e0f27eb..fd36c22685d9 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7074,6 +7074,11 @@ static int check_return_code(struct bpf_verifier_env *env)
>  			return 0;
>  		range = tnum_const(0);
>  		break;
> +	case BPF_PROG_TYPE_TRACING:
> +		if (env->prog->expected_attach_type != BPF_TRACE_ITER)
> +			return 0;
> +		range = tnum_const(0);
> +		break;
>  	default:
>  		return 0;
>  	}
> @@ -10454,6 +10459,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>  	struct bpf_prog *tgt_prog = prog->aux->linked_prog;
>  	u32 btf_id = prog->aux->attach_btf_id;
>  	const char prefix[] = "btf_trace_";
> +	struct btf_func_model fmodel;
>  	int ret = 0, subprog = -1, i;
>  	struct bpf_trampoline *tr;
>  	const struct btf_type *t;
> @@ -10595,6 +10601,20 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>  		prog->aux->attach_func_proto = t;
>  		prog->aux->attach_btf_trace = true;
>  		return 0;
> +	case BPF_TRACE_ITER:
> +		if (!btf_type_is_func(t)) {
> +			verbose(env, "attach_btf_id %u is not a function\n",
> +				btf_id);
> +			return -EINVAL;
> +		}
> +		t = btf_type_by_id(btf, t->type);
> +		if (!btf_type_is_func_proto(t))
Other than the type tests,
to ensure the attach_btf_id is a supported bpf_iter target,
should the prog be checked against the target list
("struct list_head targets") here also during the prog load time?

> +			return -EINVAL;
> +		prog->aux->attach_func_name = tname;
> +		prog->aux->attach_func_proto = t;
> +		ret = btf_distill_func_proto(&env->log, btf, t,
> +					     tname, &fmodel);
> +		return ret;
>  	default:
>  		if (!prog_extension)
>  			return -EINVAL;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  0:37   ` Martin KaFai Lau
  2020-04-29  0:48     ` Alexei Starovoitov
@ 2020-04-29  1:02     ` Yonghong Song
  1 sibling, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-29  1:02 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team



On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
> On Mon, Apr 27, 2020 at 01:12:37PM -0700, Yonghong Song wrote:
>> The bpf_map iterator is implemented.
>> The bpf program is called at seq_ops show() and stop() functions.
>> bpf_iter_get_prog() will retrieve bpf program and other
>> parameters during seq_file object traversal. In show() function,
>> bpf program will traverse every valid object, and in stop()
>> function, bpf program will be called one more time after all
>> objects are traversed.
>>
>> The first member of the bpf context contains the meta data, namely,
>> the seq_file, session_id and seq_num. Here, the session_id is
>> a unique id for one specific seq_file session. The seq_num is
>> the number of bpf prog invocations in the current session.
>> The bpf_iter_get_prog(), which will be implemented in subsequent
>> patches, will have more information on how meta data are computed.
>>
>> The second member of the bpf context is a struct bpf_map pointer,
>> which bpf program can examine.
>>
>> The target implementation also provided the structure definition
>> for bpf program and the function definition for verifier to
>> verify the bpf program. Specifically for bpf_map iterator,
>> the structure is "bpf_iter__bpf_map" andd the function is
>> "__bpf_iter__bpf_map".
>>
>> More targets will be implemented later, all of which will include
>> the following, similar to bpf_map iterator:
>>    - seq_ops() implementation
>>    - function definition for verifier to verify the bpf program
>>    - seq_file private data size
>>    - additional target feature
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h   |  10 ++++
>>   kernel/bpf/Makefile   |   2 +-
>>   kernel/bpf/bpf_iter.c |  19 ++++++++
>>   kernel/bpf/map_iter.c | 107 ++++++++++++++++++++++++++++++++++++++++++
>>   kernel/bpf/syscall.c  |  13 +++++
>>   5 files changed, 150 insertions(+), 1 deletion(-)
>>   create mode 100644 kernel/bpf/map_iter.c
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 5e56abc1e2f1..4ac8d61f7c3e 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1078,6 +1078,7 @@ int  generic_map_update_batch(struct bpf_map *map,
>>   int  generic_map_delete_batch(struct bpf_map *map,
>>   			      const union bpf_attr *attr,
>>   			      union bpf_attr __user *uattr);
>> +struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
>>   
>>   extern int sysctl_unprivileged_bpf_disabled;
>>   
>> @@ -1118,7 +1119,16 @@ struct bpf_iter_reg {
>>   	u32 target_feature;
>>   };
>>   
>> +struct bpf_iter_meta {
>> +	__bpf_md_ptr(struct seq_file *, seq);
>> +	u64 session_id;
>> +	u64 seq_num;
>> +};
>> +
>>   int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
>> +struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
>> +				   u64 *session_id, u64 *seq_num, bool is_last);
>> +int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
>>   
>>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index 6a8b0febd3f6..b2b5eefc5254 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -2,7 +2,7 @@
>>   obj-y := core.o
>>   CFLAGS_core.o += $(call cc-disable-warning, override-init)
>>   
>> -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o
>> +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o
>>   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>>   obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
>>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>> index 1115b978607a..284c95587803 100644
>> --- a/kernel/bpf/bpf_iter.c
>> +++ b/kernel/bpf/bpf_iter.c
>> @@ -48,3 +48,22 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>>   
>>   	return 0;
>>   }
>> +
>> +struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
>> +				   u64 *session_id, u64 *seq_num, bool is_last)
>> +{
>> +	return NULL;
> Can this patch be moved after this function is implemented?

I tried to have an example on how regristration looks like,
so I put bpf_map iterator implementation patch immediately
after the bpf_iter_reg_target() patch. Unfortunately, I make
the iterator implementation complete and compiler can pass,
I need this function() to be implemented in the above.

I guess I can delay this patch until I can properly
implement it, just like my RFC v2.

> 
>> +}
>> +
>> +int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
>> +{
>> +	int ret;
>> +
>> +	migrate_disable();
>> +	rcu_read_lock();
>> +	ret = BPF_PROG_RUN(prog, ctx);
>> +	rcu_read_unlock();
>> +	migrate_enable();
>> +
>> +	return ret;
>> +}
>> diff --git a/kernel/bpf/map_iter.c b/kernel/bpf/map_iter.c
>> new file mode 100644
>> index 000000000000..bb3ad4c3bde5
>> --- /dev/null
>> +++ b/kernel/bpf/map_iter.c
>> @@ -0,0 +1,107 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright (c) 2020 Facebook */
>> +#include <linux/bpf.h>
>> +#include <linux/fs.h>
>> +#include <linux/filter.h>
>> +#include <linux/kernel.h>
>> +
>> +struct bpf_iter_seq_map_info {
>> +	struct bpf_map *map;
>> +	u32 id;
>> +};
>> +
>> +static void *bpf_map_seq_start(struct seq_file *seq, loff_t *pos)
>> +{
>> +	struct bpf_iter_seq_map_info *info = seq->private;
>> +	struct bpf_map *map;
>> +	u32 id = info->id;
>> +
>> +	map = bpf_map_get_curr_or_next(&id);
>> +	if (IS_ERR_OR_NULL(map))
>> +		return NULL;
>> +
>> +	++*pos;
> Does pos always need to be incremented here?

Yes, I skipped passing SEQ_START_TOKEN to show(). Put it another way,
bpf program won't be called for SEQ_START_TOKEN, so I did a shortcut here.

> 
>> +	info->map = map;
>> +	info->id = id;
>> +	return map;
>> +}
>> +
>> +static void *bpf_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> +{
>> +	struct bpf_iter_seq_map_info *info = seq->private;
>> +	struct bpf_map *map;
>> +
>> +	++*pos;
>> +	++info->id;
>> +	map = bpf_map_get_curr_or_next(&info->id);
>> +	if (IS_ERR_OR_NULL(map))
>> +		return NULL;
>> +
>> +	bpf_map_put(info->map);
>> +	info->map = map;
>> +	return map;
>> +}
>> +
>> +struct bpf_iter__bpf_map {
>> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
>> +	__bpf_md_ptr(struct bpf_map *, map);
>> +};
>> +
>> +int __init __bpf_iter__bpf_map(struct bpf_iter_meta *meta, struct bpf_map *map)
>> +{
>> +	return 0;
>> +}
>> +
>> +static int bpf_map_seq_show(struct seq_file *seq, void *v)
>> +{
>> +	struct bpf_iter_meta meta;
>> +	struct bpf_iter__bpf_map ctx;
>> +	struct bpf_prog *prog;
>> +	int ret = 0;
>> +
>> +	ctx.meta = &meta;
>> +	ctx.map = v;
>> +	meta.seq = seq;
>> +	prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_map_info),
>> +				 &meta.session_id, &meta.seq_num,
>> +				 v == (void *)0);
>  From looking at seq_file.c, when will show() be called with "v == NULL"?

In the stop() function.

> 
>> +	if (prog)
>> +		ret = bpf_iter_run_prog(prog, &ctx);
>> +
>> +	return ret == 0 ? 0 : -EINVAL;
> The verifier change in patch 4 should have ensured that prog
> can only return 0?

Yes. I forgot to update this after last minutes I added verifier
enforcement. I can do
	if (prog)
		bpf_iter_run_prog(prog, &ctx);

	return 0;

> 
>> +}
>> +
>> +static void bpf_map_seq_stop(struct seq_file *seq, void *v)
>> +{
>> +	struct bpf_iter_seq_map_info *info = seq->private;
>> +
>> +	if (!v)
>> +		bpf_map_seq_show(seq, v);

bpf program for NULL object is called here.

>> +
>> +	if (info->map) {
>> +		bpf_map_put(info->map);
>> +		info->map = NULL;
>> +	}
>> +}
>> +
>> +static const struct seq_operations bpf_map_seq_ops = {
>> +	.start	= bpf_map_seq_start,
>> +	.next	= bpf_map_seq_next,
>> +	.stop	= bpf_map_seq_stop,
>> +	.show	= bpf_map_seq_show,
>> +};
>> +
>> +static int __init bpf_map_iter_init(void)
>> +{
>> +	struct bpf_iter_reg reg_info = {
>> +		.target			= "bpf_map",
>> +		.target_func_name	= "__bpf_iter__bpf_map",
>> +		.seq_ops		= &bpf_map_seq_ops,
>> +		.seq_priv_size		= sizeof(struct bpf_iter_seq_map_info),
>> +		.target_feature		= 0,
>> +	};
>> +
>> +	return bpf_iter_reg_target(&reg_info);
>> +}
>> +
>> +late_initcall(bpf_map_iter_init);
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 7626b8024471..022187640943 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2800,6 +2800,19 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
>>   	return err;
>>   }
>>   
>> +struct bpf_map *bpf_map_get_curr_or_next(u32 *id)
>> +{
>> +	struct bpf_map *map;
>> +
>> +	spin_lock_bh(&map_idr_lock);
>> +	map = idr_get_next(&map_idr, id);
>> +	if (map)
>> +		map = __bpf_map_inc_not_zero(map, false);
> nit. For the !map case, set "map = ERR_PTR(-ENOENT)" so that
> the _OR_NULL() test is not needed.  It will be more consistent
> with other error checking codes in syscall.c.

Good point, will do that.

> 
>> +	spin_unlock_bh(&map_idr_lock);
>> +
>> +	return map;
>> +}
>> +
>>   #define BPF_PROG_GET_FD_BY_ID_LAST_FIELD prog_id
>>   
>>   struct bpf_prog *bpf_prog_by_id(u32 id)
>> -- 
>> 2.24.1
>>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  0:48     ` Alexei Starovoitov
@ 2020-04-29  1:15       ` Yonghong Song
  2020-04-29  2:44         ` Alexei Starovoitov
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-29  1:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Martin KaFai Lau
  Cc: Andrii Nakryiko, bpf, netdev, Daniel Borkmann, kernel-team



On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_map_info),
>>> +                 &meta.session_id, &meta.seq_num,
>>> +                 v == (void *)0);
>>  From looking at seq_file.c, when will show() be called with "v == NULL"?
>>
> 
> that v == NULL here and the whole verifier change just to allow NULL...
> may be use seq_num as an indicator of the last elem instead?
> Like seq_num with upper bit set to indicate that it's last?

We could. But then verifier won't have an easy way to verify that.
For example, the above is expected:

      int prog(struct bpf_map *map, u64 seq_num) {
         if (seq_num >> 63)
           return 0;
         ... map->id ...
         ... map->user_cnt ...
      }

But if user writes

      int prog(struct bpf_map *map, u64 seq_num) {
          ... map->id ...
          ... map->user_cnt ...
      }

verifier won't be easy to conclude inproper map pointer tracing
here and in the above map->id, map->user_cnt will cause
exceptions and they will silently get value 0.

I do have another potential use case for this ptr_to_btf_id_or_null,
e.g., for tcp6, instead of pointer casting, I could have bpf_prog
like
     int prog(..., struct tcp6_sock *tcp_sk,
              struct timewait_sock *tw_sk, struct request_sock *req_sk) {
         if (tcp_sk) { /* dump tcp_sk ... */ }
         if (tw_sk) { /* dump tw_sk ... */ }
         if (req_sk) { /* dump req_sk ... */ }
     }
The kernel infrastructure will ensure at any time only one
of tcp_sk/tw_sk/req_sk is valid and the other two is NULL.





^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [Potential Spoof] [PATCH bpf-next v1 05/19] bpf: support bpf tracing/iter programs for BPF_LINK_CREATE
  2020-04-27 20:12 ` [PATCH bpf-next v1 05/19] bpf: support bpf tracing/iter programs for BPF_LINK_CREATE Yonghong Song
@ 2020-04-29  1:17   ` Martin KaFai Lau
  2020-04-29  6:25   ` Andrii Nakryiko
  1 sibling, 0 replies; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29  1:17 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team

On Mon, Apr 27, 2020 at 01:12:40PM -0700, Yonghong Song wrote:
> Given a bpf program, the step to create an anonymous bpf iterator is:
>   - create a bpf_iter_link, which combines bpf program and the target.
>     In the future, there could be more information recorded in the link.
>     A link_fd will be returned to the user space.
>   - create an anonymous bpf iterator with the given link_fd.
> 
> The anonymous bpf iterator (and its underlying bpf_link) will be
> used to create file based bpf iterator as well.
> 
> The benefit to use of bpf_iter_link:
>   - for file based bpf iterator, bpf_iter_link provides a standard
>     way to replace underlying bpf programs.
>   - for both anonymous and free based iterators, bpf link query
>     capability can be leveraged.
> 
> The patch added support of tracing/iter programs for  BPF_LINK_CREATE.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h   |  2 ++
>  kernel/bpf/bpf_iter.c | 54 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c  | 15 ++++++++++++
>  3 files changed, 71 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 4ac8d61f7c3e..60ecb73d8f6d 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1034,6 +1034,7 @@ extern const struct file_operations bpf_prog_fops;
>  extern const struct bpf_prog_ops bpf_offload_prog_ops;
>  extern const struct bpf_verifier_ops tc_cls_act_analyzer_ops;
>  extern const struct bpf_verifier_ops xdp_analyzer_ops;
> +extern const struct bpf_link_ops bpf_iter_link_lops;
>  
>  struct bpf_prog *bpf_prog_get(u32 ufd);
>  struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
> @@ -1129,6 +1130,7 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
>  struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
>  				   u64 *session_id, u64 *seq_num, bool is_last);
>  int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
> +int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>  
>  int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index 284c95587803..9532e7bcb8e1 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -14,6 +14,11 @@ struct bpf_iter_target_info {
>  	u32 target_feature;
>  };
>  
> +struct bpf_iter_link {
> +	struct bpf_link link;
> +	struct bpf_iter_target_info *tinfo;
> +};
> +
>  static struct list_head targets;
>  static struct mutex targets_mutex;
>  static bool bpf_iter_inited = false;
> @@ -67,3 +72,52 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
>  
>  	return ret;
>  }
> +
> +static void bpf_iter_link_release(struct bpf_link *link)
> +{
> +}
> +
> +static void bpf_iter_link_dealloc(struct bpf_link *link)
> +{
> +}
> +
> +const struct bpf_link_ops bpf_iter_link_lops = {
> +	.release = bpf_iter_link_release,
> +	.dealloc = bpf_iter_link_dealloc,
> +};
> +
> +int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_iter_target_info *tinfo;
> +	struct bpf_iter_link *link;
> +	const char *func_name;
> +	bool existed = false;
> +	int err;
> +
> +	if (attr->link_create.target_fd || attr->link_create.flags)
> +		return -EINVAL;
> +
> +	func_name = prog->aux->attach_func_name;
> +	mutex_lock(&targets_mutex);
> +	list_for_each_entry(tinfo, &targets, list) {
> +		if (!strcmp(tinfo->target_func_name, func_name)) {
This can be done in prog load time.

Also, is it better to store a btf_id at tinfo instead of doing strcmp here?

> +			existed = true;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&targets_mutex);
> +	if (!existed)
> +		return -ENOENT;
> +
> +	link = kzalloc(sizeof(*link), GFP_USER | __GFP_NOWARN);
> +	if (!link)
> +		return -ENOMEM;
> +
> +	bpf_link_init(&link->link, &bpf_iter_link_lops, prog);
> +	link->tinfo = tinfo;
> +
> +	err = bpf_link_new_fd(&link->link);
> +	if (err < 0)
> +		kfree(link);
> +	return err;
> +}
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 022187640943..8741b5e11c85 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2269,6 +2269,8 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
>  	else if (link->ops == &bpf_cgroup_link_lops)
>  		link_type = "cgroup";
>  #endif
> +	else if (link->ops == &bpf_iter_link_lops)
> +		link_type = "iter";
>  	else
>  		link_type = "unknown";
>  
> @@ -2597,6 +2599,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>  	case BPF_CGROUP_GETSOCKOPT:
>  	case BPF_CGROUP_SETSOCKOPT:
>  		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
> +	case BPF_TRACE_ITER:
> +		return BPF_PROG_TYPE_TRACING;
>  	default:
>  		return BPF_PROG_TYPE_UNSPEC;
>  	}
> @@ -3571,6 +3575,14 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
>  	return err;
>  }
>  
> +static int tracing_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	if (attr->link_create.attach_type == BPF_TRACE_ITER)
Has prog->expected_attach_type been checked also?

> +		return bpf_iter_link_attach(attr, prog);
> +
> +	return -EINVAL;
> +}
> +
>  #define BPF_LINK_CREATE_LAST_FIELD link_create.flags
>  static int link_create(union bpf_attr *attr)
>  {
> @@ -3607,6 +3619,9 @@ static int link_create(union bpf_attr *attr)
>  	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
>  		ret = cgroup_bpf_link_attach(attr, prog);
>  		break;
> +	case BPF_PROG_TYPE_TRACING:
> +		ret = tracing_bpf_link_attach(attr, prog);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  	}
> -- 
> 2.24.1
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 04/19] bpf: allow loading of a bpf_iter program
  2020-04-29  0:54   ` Martin KaFai Lau
@ 2020-04-29  1:27     ` Yonghong Song
  0 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-29  1:27 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team



On 4/28/20 5:54 PM, Martin KaFai Lau wrote:
> On Mon, Apr 27, 2020 at 01:12:39PM -0700, Yonghong Song wrote:
>> A bpf_iter program is a tracing program with attach type
>> BPF_TRACE_ITER. The load attribute
>>    attach_btf_id
>> is used by the verifier against a particular kernel function,
>> e.g., __bpf_iter__bpf_map in our previous bpf_map iterator.
>>
>> The program return value must be 0 for now. In the
>> future, other return values may be used for filtering or
>> teminating the iterator.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/uapi/linux/bpf.h       |  1 +
>>   kernel/bpf/verifier.c          | 20 ++++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h |  1 +
>>   3 files changed, 22 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 4a6c47f3febe..f39b9fec37ab 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -215,6 +215,7 @@ enum bpf_attach_type {
>>   	BPF_TRACE_FEXIT,
>>   	BPF_MODIFY_RETURN,
>>   	BPF_LSM_MAC,
>> +	BPF_TRACE_ITER,
>>   	__MAX_BPF_ATTACH_TYPE
>>   };
>>   
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 91728e0f27eb..fd36c22685d9 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -7074,6 +7074,11 @@ static int check_return_code(struct bpf_verifier_env *env)
>>   			return 0;
>>   		range = tnum_const(0);
>>   		break;
>> +	case BPF_PROG_TYPE_TRACING:
>> +		if (env->prog->expected_attach_type != BPF_TRACE_ITER)
>> +			return 0;
>> +		range = tnum_const(0);
>> +		break;
>>   	default:
>>   		return 0;
>>   	}
>> @@ -10454,6 +10459,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>>   	struct bpf_prog *tgt_prog = prog->aux->linked_prog;
>>   	u32 btf_id = prog->aux->attach_btf_id;
>>   	const char prefix[] = "btf_trace_";
>> +	struct btf_func_model fmodel;
>>   	int ret = 0, subprog = -1, i;
>>   	struct bpf_trampoline *tr;
>>   	const struct btf_type *t;
>> @@ -10595,6 +10601,20 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>>   		prog->aux->attach_func_proto = t;
>>   		prog->aux->attach_btf_trace = true;
>>   		return 0;
>> +	case BPF_TRACE_ITER:
>> +		if (!btf_type_is_func(t)) {
>> +			verbose(env, "attach_btf_id %u is not a function\n",
>> +				btf_id);
>> +			return -EINVAL;
>> +		}
>> +		t = btf_type_by_id(btf, t->type);
>> +		if (!btf_type_is_func_proto(t))
> Other than the type tests,
> to ensure the attach_btf_id is a supported bpf_iter target,
> should the prog be checked against the target list
> ("struct list_head targets") here also during the prog load time?

This is a good question. In my RFC v2, I did this, checking against
registered targets (essentially, program loading + attaching to the target).

In this version, program loading and attaching are separated.
   - program loading: against btf_id
   - attaching: linking bpf program to target
     current linking parameter only bpf_program, but later on
     there may be additional parameters like map_id, pid, cgroup_id
     etc. for tailoring the iterator behavior.

This seems having a better separation. Agreed that checking
at load time may return error earlier instead at link_create
time. Let me think about this.


> 
>> +			return -EINVAL;
>> +		prog->aux->attach_func_name = tname;
>> +		prog->aux->attach_func_proto = t;
>> +		ret = btf_distill_func_proto(&env->log, btf, t,
>> +					     tname, &fmodel);
>> +		return ret;
>>   	default:
>>   		if (!prog_extension)
>>   			return -EINVAL;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE
  2020-04-27 20:12 ` [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE Yonghong Song
@ 2020-04-29  1:32   ` Martin KaFai Lau
  2020-04-29  5:04     ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29  1:32 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team

On Mon, Apr 27, 2020 at 01:12:41PM -0700, Yonghong Song wrote:
> Added BPF_LINK_UPDATE support for tracing/iter programs.
> This way, a file based bpf iterator, which holds a reference
> to the link, can have its bpf program updated without
> creating new files.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h   |  2 ++
>  kernel/bpf/bpf_iter.c | 29 +++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c  |  5 +++++
>  3 files changed, 36 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 60ecb73d8f6d..4fc39d9b5cd0 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1131,6 +1131,8 @@ struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
>  				   u64 *session_id, u64 *seq_num, bool is_last);
>  int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
>  int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
> +			  struct bpf_prog *new_prog);
>  
>  int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index 9532e7bcb8e1..fc1ce5ee5c3f 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -23,6 +23,9 @@ static struct list_head targets;
>  static struct mutex targets_mutex;
>  static bool bpf_iter_inited = false;
>  
> +/* protect bpf_iter_link.link->prog upddate */
> +static struct mutex bpf_iter_mutex;
> +
>  int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>  {
>  	struct bpf_iter_target_info *tinfo;
> @@ -33,6 +36,7 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>  	if (!bpf_iter_inited) {
>  		INIT_LIST_HEAD(&targets);
>  		mutex_init(&targets_mutex);
> +		mutex_init(&bpf_iter_mutex);
>  		bpf_iter_inited = true;
>  	}
>  
> @@ -121,3 +125,28 @@ int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
>  		kfree(link);
>  	return err;
>  }
> +
> +int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
> +			  struct bpf_prog *new_prog)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&bpf_iter_mutex);
> +	if (old_prog && link->prog != old_prog) {
> +		ret = -EPERM;
> +		goto out_unlock;
> +	}
> +
> +	if (link->prog->type != new_prog->type ||
> +	    link->prog->expected_attach_type != new_prog->expected_attach_type ||
> +	    strcmp(link->prog->aux->attach_func_name, new_prog->aux->attach_func_name)) {
Can attach_btf_id be compared instead of strcmp()?

> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	link->prog = new_prog;
Does the old link->prog need a bpf_prog_put()?

> +
> +out_unlock:
> +	mutex_unlock(&bpf_iter_mutex);
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  1:15       ` Yonghong Song
@ 2020-04-29  2:44         ` Alexei Starovoitov
  2020-04-29  5:09           ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Alexei Starovoitov @ 2020-04-29  2:44 UTC (permalink / raw)
  To: Yonghong Song, Martin KaFai Lau
  Cc: Andrii Nakryiko, bpf, netdev, Daniel Borkmann, kernel-team

On 4/28/20 6:15 PM, Yonghong Song wrote:
> 
> 
> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct 
>>>> bpf_iter_seq_map_info),
>>>> +                 &meta.session_id, &meta.seq_num,
>>>> +                 v == (void *)0);
>>>  From looking at seq_file.c, when will show() be called with "v == 
>>> NULL"?
>>>
>>
>> that v == NULL here and the whole verifier change just to allow NULL...
>> may be use seq_num as an indicator of the last elem instead?
>> Like seq_num with upper bit set to indicate that it's last?
> 
> We could. But then verifier won't have an easy way to verify that.
> For example, the above is expected:
> 
>       int prog(struct bpf_map *map, u64 seq_num) {
>          if (seq_num >> 63)
>            return 0;
>          ... map->id ...
>          ... map->user_cnt ...
>       }
> 
> But if user writes
> 
>       int prog(struct bpf_map *map, u64 seq_num) {
>           ... map->id ...
>           ... map->user_cnt ...
>       }
> 
> verifier won't be easy to conclude inproper map pointer tracing
> here and in the above map->id, map->user_cnt will cause
> exceptions and they will silently get value 0.

I mean always pass valid object pointer into the prog.
In above case 'map' will always be valid.
Consider prog that iterating all map elements.
It's weird that the prog would always need to do
if (map == 0)
   goto out;
even if it doesn't care about finding last.
All progs would have to have such extra 'if'.
If we always pass valid object than there is no need
for such extra checks inside the prog.
First and last element can be indicated via seq_num
or via another flag or via helper call like is_this_last_elem()
or something.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE
  2020-04-29  1:32   ` Martin KaFai Lau
@ 2020-04-29  5:04     ` Yonghong Song
  2020-04-29  5:58       ` Martin KaFai Lau
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-29  5:04 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team



On 4/28/20 6:32 PM, Martin KaFai Lau wrote:
> On Mon, Apr 27, 2020 at 01:12:41PM -0700, Yonghong Song wrote:
>> Added BPF_LINK_UPDATE support for tracing/iter programs.
>> This way, a file based bpf iterator, which holds a reference
>> to the link, can have its bpf program updated without
>> creating new files.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h   |  2 ++
>>   kernel/bpf/bpf_iter.c | 29 +++++++++++++++++++++++++++++
>>   kernel/bpf/syscall.c  |  5 +++++
>>   3 files changed, 36 insertions(+)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 60ecb73d8f6d..4fc39d9b5cd0 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1131,6 +1131,8 @@ struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
>>   				   u64 *session_id, u64 *seq_num, bool is_last);
>>   int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
>>   int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>> +int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>> +			  struct bpf_prog *new_prog);
>>   
>>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>> index 9532e7bcb8e1..fc1ce5ee5c3f 100644
>> --- a/kernel/bpf/bpf_iter.c
>> +++ b/kernel/bpf/bpf_iter.c
>> @@ -23,6 +23,9 @@ static struct list_head targets;
>>   static struct mutex targets_mutex;
>>   static bool bpf_iter_inited = false;
>>   
>> +/* protect bpf_iter_link.link->prog upddate */
>> +static struct mutex bpf_iter_mutex;
>> +
>>   int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>>   {
>>   	struct bpf_iter_target_info *tinfo;
>> @@ -33,6 +36,7 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>>   	if (!bpf_iter_inited) {
>>   		INIT_LIST_HEAD(&targets);
>>   		mutex_init(&targets_mutex);
>> +		mutex_init(&bpf_iter_mutex);
>>   		bpf_iter_inited = true;
>>   	}
>>   
>> @@ -121,3 +125,28 @@ int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
>>   		kfree(link);
>>   	return err;
>>   }
>> +
>> +int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>> +			  struct bpf_prog *new_prog)
>> +{
>> +	int ret = 0;
>> +
>> +	mutex_lock(&bpf_iter_mutex);
>> +	if (old_prog && link->prog != old_prog) {
>> +		ret = -EPERM;
>> +		goto out_unlock;
>> +	}
>> +
>> +	if (link->prog->type != new_prog->type ||
>> +	    link->prog->expected_attach_type != new_prog->expected_attach_type ||
>> +	    strcmp(link->prog->aux->attach_func_name, new_prog->aux->attach_func_name)) {
> Can attach_btf_id be compared instead of strcmp()?

Yes, we can do it.

> 
>> +		ret = -EINVAL;
>> +		goto out_unlock;
>> +	}
>> +
>> +	link->prog = new_prog;
> Does the old link->prog need a bpf_prog_put()?

The old_prog is replaced in caller link_update (syscall.c):
static int link_update(union bpf_attr *attr)
{
         struct bpf_prog *old_prog = NULL, *new_prog;
         struct bpf_link *link;
         u32 flags;
         int ret;
...
         if (link->ops == &bpf_iter_link_lops) {
                 ret = bpf_iter_link_replace(link, old_prog, new_prog);
                 goto out_put_progs;
         }
         ret = -EINVAL;

out_put_progs:
         if (old_prog)
                 bpf_prog_put(old_prog);
...

> 
>> +
>> +out_unlock:
>> +	mutex_unlock(&bpf_iter_mutex);
>> +	return ret;
>> +}

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  2:44         ` Alexei Starovoitov
@ 2020-04-29  5:09           ` Yonghong Song
  2020-04-29  6:08             ` Andrii Nakryiko
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-29  5:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Martin KaFai Lau
  Cc: Andrii Nakryiko, bpf, netdev, Daniel Borkmann, kernel-team



On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
> On 4/28/20 6:15 PM, Yonghong Song wrote:
>>
>>
>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct 
>>>>> bpf_iter_seq_map_info),
>>>>> +                 &meta.session_id, &meta.seq_num,
>>>>> +                 v == (void *)0);
>>>>  From looking at seq_file.c, when will show() be called with "v == 
>>>> NULL"?
>>>>
>>>
>>> that v == NULL here and the whole verifier change just to allow NULL...
>>> may be use seq_num as an indicator of the last elem instead?
>>> Like seq_num with upper bit set to indicate that it's last?
>>
>> We could. But then verifier won't have an easy way to verify that.
>> For example, the above is expected:
>>
>>       int prog(struct bpf_map *map, u64 seq_num) {
>>          if (seq_num >> 63)
>>            return 0;
>>          ... map->id ...
>>          ... map->user_cnt ...
>>       }
>>
>> But if user writes
>>
>>       int prog(struct bpf_map *map, u64 seq_num) {
>>           ... map->id ...
>>           ... map->user_cnt ...
>>       }
>>
>> verifier won't be easy to conclude inproper map pointer tracing
>> here and in the above map->id, map->user_cnt will cause
>> exceptions and they will silently get value 0.
> 
> I mean always pass valid object pointer into the prog.
> In above case 'map' will always be valid.
> Consider prog that iterating all map elements.
> It's weird that the prog would always need to do
> if (map == 0)
>    goto out;
> even if it doesn't care about finding last.
> All progs would have to have such extra 'if'.
> If we always pass valid object than there is no need
> for such extra checks inside the prog.
> First and last element can be indicated via seq_num
> or via another flag or via helper call like is_this_last_elem()
> or something.

Okay, I see what you mean now. Basically this means
seq_ops->next() should try to get/maintain next two elements,
otherwise, we won't know whether the one in seq_ops->show()
is the last or not. We could do it in newly implemented
iterator bpf_map/task/task_file. Let me check how I could
make existing seq_ops (ipv6_route/netlink) works with
minimum changes.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 01/19] net: refactor net assignment for seq_net_private structure
  2020-04-27 20:12 ` [PATCH bpf-next v1 01/19] net: refactor net assignment for seq_net_private structure Yonghong Song
@ 2020-04-29  5:38   ` Andrii Nakryiko
  0 siblings, 0 replies; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29  5:38 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:13 PM Yonghong Song <yhs@fb.com> wrote:
>
> Refactor assignment of "net" in seq_net_private structure
> in proc_net.c to a helper function. The helper later will
> be used by bpfdump.

typo: bpfdump -> bpf_iter ?

Otherwise:

Acked-by: Andrii Nakryiko <andriin@fb.com>

>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  fs/proc/proc_net.c           | 5 ++---
>  include/linux/seq_file_net.h | 8 ++++++++
>  2 files changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 4888c5224442..aee07c19cf8b 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -75,9 +75,8 @@ static int seq_open_net(struct inode *inode, struct file *file)
>                 put_net(net);
>                 return -ENOMEM;
>         }
> -#ifdef CONFIG_NET_NS
> -       p->net = net;
> -#endif
> +
> +       set_seq_net_private(p, net);
>         return 0;
>  }
>
> diff --git a/include/linux/seq_file_net.h b/include/linux/seq_file_net.h
> index 0fdbe1ddd8d1..0ec4a18b9aca 100644
> --- a/include/linux/seq_file_net.h
> +++ b/include/linux/seq_file_net.h
> @@ -35,4 +35,12 @@ static inline struct net *seq_file_single_net(struct seq_file *seq)
>  #endif
>  }
>
> +static inline void set_seq_net_private(struct seq_net_private *p,
> +                                      struct net *net)
> +{
> +#ifdef CONFIG_NET_NS
> +       p->net = net;
> +#endif
> +}
> +
>  #endif
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-27 20:12 ` [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator Yonghong Song
@ 2020-04-29  5:39   ` Martin KaFai Lau
  2020-04-29  6:56   ` Andrii Nakryiko
  2020-04-29 19:39   ` Andrii Nakryiko
  2 siblings, 0 replies; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29  5:39 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team

On Mon, Apr 27, 2020 at 01:12:42PM -0700, Yonghong Song wrote:
> A new bpf command BPF_ITER_CREATE is added.
> 
> The anonymous bpf iterator is seq_file based.
> The seq_file private data are referenced by targets.
> The bpf_iter infrastructure allocated additional space
> at seq_file->private after the space used by targets
> to store some meta data, e.g.,
>   prog:       prog to run
>   session_id: an unique id for each opened seq_file
>   seq_num:    how many times bpf programs are queried in this session
>   has_last:   indicate whether or not bpf_prog has been called after
>               all valid objects have been processed
> 
> A map between file and prog/link is established to help
> fops->release(). When fops->release() is called, just based on
> inode and file, bpf program cannot be located since target
> seq_priv_size not available. This map helps retrieve the prog
> whose reference count needs to be decremented.
How about putting "struct extra_priv_data" at the beginning of
the seq_file's private store instead since its size is known.
seq->private can point to an aligned byte after
+sizeof(struct extra_priv_data).

[ ... ]

> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index fc1ce5ee5c3f..1f4e778d1814 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c

[ ... ]

> @@ -26,6 +40,50 @@ static bool bpf_iter_inited = false;
>  /* protect bpf_iter_link.link->prog upddate */
>  static struct mutex bpf_iter_mutex;
>  
> +/* Since at anon seq_file release function, the prog cannot
> + * be retrieved since target seq_priv_size is not available.
> + * Keep a list of <anon_file, prog> mapping, so that
> + * at file release stage, the prog can be released properly.
> + */
> +static struct list_head anon_iter_info;
> +static struct mutex anon_iter_info_mutex;
> +
> +/* incremented on every opened seq_file */
> +static atomic64_t session_id;
> +
> +static u32 get_total_priv_dsize(u32 old_size)
> +{
> +	return roundup(old_size, 8) + sizeof(struct extra_priv_data);
> +}
> +
> +static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
> +{
> +	return old_ptr + roundup(old_size, 8);
> +}
> +
> +static int anon_iter_release(struct inode *inode, struct file *file)
> +{
> +	struct anon_file_prog_assoc *finfo;
> +
> +	mutex_lock(&anon_iter_info_mutex);
> +	list_for_each_entry(finfo, &anon_iter_info, list) {
> +		if (finfo->file == file) {
> +			bpf_prog_put(finfo->prog);
> +			list_del(&finfo->list);
> +			kfree(finfo);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&anon_iter_info_mutex);
> +
> +	return seq_release_private(inode, file);
> +}
> +
> +static const struct file_operations anon_bpf_iter_fops = {
> +	.read		= seq_read,
> +	.release	= anon_iter_release,
> +};
> +
>  int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>  {
>  	struct bpf_iter_target_info *tinfo;

[ ... ]

> @@ -150,3 +223,90 @@ int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>  	mutex_unlock(&bpf_iter_mutex);
>  	return ret;
>  }
> +
> +static void init_seq_file(void *priv_data, struct bpf_iter_target_info *tinfo,
> +			  struct bpf_prog *prog)
> +{
> +	struct extra_priv_data *extra_data;
> +
> +	if (tinfo->target_feature & BPF_DUMP_SEQ_NET_PRIVATE)
> +		set_seq_net_private((struct seq_net_private *)priv_data,
> +				    current->nsproxy->net_ns);
> +
> +	extra_data = get_extra_priv_dptr(priv_data, tinfo->seq_priv_size);
> +	extra_data->session_id = atomic64_add_return(1, &session_id);
> +	extra_data->prog = prog;
> +	extra_data->seq_num = 0;
> +	extra_data->has_last = false;
> +}
> +
> +static int prepare_seq_file(struct file *file, struct bpf_iter_link *link)
> +{
> +	struct anon_file_prog_assoc *finfo;
> +	struct bpf_iter_target_info *tinfo;
> +	struct bpf_prog *prog;
> +	u32 total_priv_dsize;
> +	void *priv_data;
> +
> +	finfo = kmalloc(sizeof(*finfo), GFP_USER | __GFP_NOWARN);
> +	if (!finfo)
> +		return -ENOMEM;
> +
> +	mutex_lock(&bpf_iter_mutex);
> +	prog = link->link.prog;
> +	bpf_prog_inc(prog);
> +	mutex_unlock(&bpf_iter_mutex);
> +
> +	tinfo = link->tinfo;
> +	total_priv_dsize = get_total_priv_dsize(tinfo->seq_priv_size);
> +	priv_data = __seq_open_private(file, tinfo->seq_ops, total_priv_dsize);
> +	if (!priv_data) {
> +		bpf_prog_sub(prog, 1);
Could prog's refcnt go 0 here?  If yes, bpf_prog_put() should be used.

> +		kfree(finfo);
> +		return -ENOMEM;
> +	}
> +
> +	init_seq_file(priv_data, tinfo, prog);
> +
> +	finfo->file = file;
> +	finfo->prog = prog;
> +
> +	mutex_lock(&anon_iter_info_mutex);
> +	list_add(&finfo->list, &anon_iter_info);
> +	mutex_unlock(&anon_iter_info_mutex);
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE
  2020-04-29  5:04     ` Yonghong Song
@ 2020-04-29  5:58       ` Martin KaFai Lau
  2020-04-29  6:32         ` Andrii Nakryiko
  0 siblings, 1 reply; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29  5:58 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, kernel-team

On Tue, Apr 28, 2020 at 10:04:54PM -0700, Yonghong Song wrote:
> 
> 
> On 4/28/20 6:32 PM, Martin KaFai Lau wrote:
> > On Mon, Apr 27, 2020 at 01:12:41PM -0700, Yonghong Song wrote:
> > > Added BPF_LINK_UPDATE support for tracing/iter programs.
> > > This way, a file based bpf iterator, which holds a reference
> > > to the link, can have its bpf program updated without
> > > creating new files.
> > > 

[ ... ]

> > > --- a/kernel/bpf/bpf_iter.c
> > > +++ b/kernel/bpf/bpf_iter.c

[ ... ]

> > > @@ -121,3 +125,28 @@ int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> > >   		kfree(link);
> > >   	return err;
> > >   }
> > > +
> > > +int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
> > > +			  struct bpf_prog *new_prog)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&bpf_iter_mutex);
> > > +	if (old_prog && link->prog != old_prog) {
hmm....

If I read this function correctly,
old_prog could be NULL here and it is only needed during BPF_F_REPLACE
to ensure it is replacing a particular old_prog, no?


> > > +		ret = -EPERM;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	if (link->prog->type != new_prog->type ||
> > > +	    link->prog->expected_attach_type != new_prog->expected_attach_type ||
> > > +	    strcmp(link->prog->aux->attach_func_name, new_prog->aux->attach_func_name)) {
> > Can attach_btf_id be compared instead of strcmp()?
> 
> Yes, we can do it.
> 
> > 
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	link->prog = new_prog;
> > Does the old link->prog need a bpf_prog_put()?
> 
> The old_prog is replaced in caller link_update (syscall.c):

> static int link_update(union bpf_attr *attr)
> {
>         struct bpf_prog *old_prog = NULL, *new_prog;
>         struct bpf_link *link;
>         u32 flags;
>         int ret;
> ...
>         if (link->ops == &bpf_iter_link_lops) {
>                 ret = bpf_iter_link_replace(link, old_prog, new_prog);
>                 goto out_put_progs;
>         }
>         ret = -EINVAL;
> 
> out_put_progs:
>         if (old_prog)
>                 bpf_prog_put(old_prog);
The old_prog in link_update() took a separate refcnt from bpf_prog_get().
I don't see how it is related to the existing refcnt held in the link->prog.

or I am missing something in BPF_F_REPLACE?  

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-27 20:12 ` [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator Yonghong Song
  2020-04-29  0:37   ` Martin KaFai Lau
@ 2020-04-29  6:04   ` Andrii Nakryiko
  1 sibling, 0 replies; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29  6:04 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:18 PM Yonghong Song <yhs@fb.com> wrote:
>
> The bpf_map iterator is implemented.
> The bpf program is called at seq_ops show() and stop() functions.
> bpf_iter_get_prog() will retrieve bpf program and other
> parameters during seq_file object traversal. In show() function,
> bpf program will traverse every valid object, and in stop()
> function, bpf program will be called one more time after all
> objects are traversed.
>
> The first member of the bpf context contains the meta data, namely,
> the seq_file, session_id and seq_num. Here, the session_id is
> a unique id for one specific seq_file session. The seq_num is
> the number of bpf prog invocations in the current session.
> The bpf_iter_get_prog(), which will be implemented in subsequent
> patches, will have more information on how meta data are computed.
>
> The second member of the bpf context is a struct bpf_map pointer,
> which bpf program can examine.
>
> The target implementation also provided the structure definition
> for bpf program and the function definition for verifier to
> verify the bpf program. Specifically for bpf_map iterator,
> the structure is "bpf_iter__bpf_map" andd the function is
> "__bpf_iter__bpf_map".
>
> More targets will be implemented later, all of which will include
> the following, similar to bpf_map iterator:
>   - seq_ops() implementation
>   - function definition for verifier to verify the bpf program
>   - seq_file private data size
>   - additional target feature
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h   |  10 ++++
>  kernel/bpf/Makefile   |   2 +-
>  kernel/bpf/bpf_iter.c |  19 ++++++++
>  kernel/bpf/map_iter.c | 107 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c  |  13 +++++
>  5 files changed, 150 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/map_iter.c
>

[...]

> +static int __init bpf_map_iter_init(void)
> +{
> +       struct bpf_iter_reg reg_info = {
> +               .target                 = "bpf_map",
> +               .target_func_name       = "__bpf_iter__bpf_map",

I wonder if it would be better instead of strings to use a pointer to
a function here. It would preserve __bpf_iter__bpf_map function
without __init, plus it's hard to mistype the name accidentally. In
bpf_iter_reg_target() one would just need to find function in kallsyms
by function address and extract it's name.

Or that would be too much trouble?

> +               .seq_ops                = &bpf_map_seq_ops,
> +               .seq_priv_size          = sizeof(struct bpf_iter_seq_map_info),
> +               .target_feature         = 0,
> +       };
> +
> +       return bpf_iter_reg_target(&reg_info);
> +}
> +
> +late_initcall(bpf_map_iter_init);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 7626b8024471..022187640943 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2800,6 +2800,19 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
>         return err;
>  }
>
> +struct bpf_map *bpf_map_get_curr_or_next(u32 *id)
> +{
> +       struct bpf_map *map;
> +
> +       spin_lock_bh(&map_idr_lock);
> +       map = idr_get_next(&map_idr, id);
> +       if (map)
> +               map = __bpf_map_inc_not_zero(map, false);

When __bpf_map_inc_not_zero return ENOENT, it doesn't mean there are
no more BPF maps, it just means that the current one we got was
already released (or in the process of being released). I think you
need to retry with id+1 in such case, otherwise your iteration might
end prematurely.

> +       spin_unlock_bh(&map_idr_lock);
> +
> +       return map;
> +}
> +
>  #define BPF_PROG_GET_FD_BY_ID_LAST_FIELD prog_id
>
>  struct bpf_prog *bpf_prog_by_id(u32 id)
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  5:09           ` Yonghong Song
@ 2020-04-29  6:08             ` Andrii Nakryiko
  2020-04-29  6:20               ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29  6:08 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, bpf,
	Networking, Daniel Borkmann, Kernel Team

On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
> > On 4/28/20 6:15 PM, Yonghong Song wrote:
> >>
> >>
> >> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
> >>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
> >>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
> >>>>> bpf_iter_seq_map_info),
> >>>>> +                 &meta.session_id, &meta.seq_num,
> >>>>> +                 v == (void *)0);
> >>>>  From looking at seq_file.c, when will show() be called with "v ==
> >>>> NULL"?
> >>>>
> >>>
> >>> that v == NULL here and the whole verifier change just to allow NULL...
> >>> may be use seq_num as an indicator of the last elem instead?
> >>> Like seq_num with upper bit set to indicate that it's last?
> >>
> >> We could. But then verifier won't have an easy way to verify that.
> >> For example, the above is expected:
> >>
> >>       int prog(struct bpf_map *map, u64 seq_num) {
> >>          if (seq_num >> 63)
> >>            return 0;
> >>          ... map->id ...
> >>          ... map->user_cnt ...
> >>       }
> >>
> >> But if user writes
> >>
> >>       int prog(struct bpf_map *map, u64 seq_num) {
> >>           ... map->id ...
> >>           ... map->user_cnt ...
> >>       }
> >>
> >> verifier won't be easy to conclude inproper map pointer tracing
> >> here and in the above map->id, map->user_cnt will cause
> >> exceptions and they will silently get value 0.
> >
> > I mean always pass valid object pointer into the prog.
> > In above case 'map' will always be valid.
> > Consider prog that iterating all map elements.
> > It's weird that the prog would always need to do
> > if (map == 0)
> >    goto out;
> > even if it doesn't care about finding last.
> > All progs would have to have such extra 'if'.
> > If we always pass valid object than there is no need
> > for such extra checks inside the prog.
> > First and last element can be indicated via seq_num
> > or via another flag or via helper call like is_this_last_elem()
> > or something.
>
> Okay, I see what you mean now. Basically this means
> seq_ops->next() should try to get/maintain next two elements,

What about the case when there are no elements to iterate to begin
with? In that case, we still need to call bpf_prog for (empty)
post-aggregation, but we have no valid element... For bpf_map
iteration we could have fake empty bpf_map that would be passed, but
I'm not sure it's applicable for any time of object (e.g., having a
fake task_struct is probably quite a bit more problematic?)...

> otherwise, we won't know whether the one in seq_ops->show()
> is the last or not. We could do it in newly implemented
> iterator bpf_map/task/task_file. Let me check how I could
> make existing seq_ops (ipv6_route/netlink) works with
> minimum changes.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  6:08             ` Andrii Nakryiko
@ 2020-04-29  6:20               ` Yonghong Song
  2020-04-29  6:30                 ` Alexei Starovoitov
  2020-04-29  6:34                 ` Martin KaFai Lau
  0 siblings, 2 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-29  6:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, bpf,
	Networking, Daniel Borkmann, Kernel Team



On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
>>>>
>>>>
>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
>>>>>>> bpf_iter_seq_map_info),
>>>>>>> +                 &meta.session_id, &meta.seq_num,
>>>>>>> +                 v == (void *)0);
>>>>>>   From looking at seq_file.c, when will show() be called with "v ==
>>>>>> NULL"?
>>>>>>
>>>>>
>>>>> that v == NULL here and the whole verifier change just to allow NULL...
>>>>> may be use seq_num as an indicator of the last elem instead?
>>>>> Like seq_num with upper bit set to indicate that it's last?
>>>>
>>>> We could. But then verifier won't have an easy way to verify that.
>>>> For example, the above is expected:
>>>>
>>>>        int prog(struct bpf_map *map, u64 seq_num) {
>>>>           if (seq_num >> 63)
>>>>             return 0;
>>>>           ... map->id ...
>>>>           ... map->user_cnt ...
>>>>        }
>>>>
>>>> But if user writes
>>>>
>>>>        int prog(struct bpf_map *map, u64 seq_num) {
>>>>            ... map->id ...
>>>>            ... map->user_cnt ...
>>>>        }
>>>>
>>>> verifier won't be easy to conclude inproper map pointer tracing
>>>> here and in the above map->id, map->user_cnt will cause
>>>> exceptions and they will silently get value 0.
>>>
>>> I mean always pass valid object pointer into the prog.
>>> In above case 'map' will always be valid.
>>> Consider prog that iterating all map elements.
>>> It's weird that the prog would always need to do
>>> if (map == 0)
>>>     goto out;
>>> even if it doesn't care about finding last.
>>> All progs would have to have such extra 'if'.
>>> If we always pass valid object than there is no need
>>> for such extra checks inside the prog.
>>> First and last element can be indicated via seq_num
>>> or via another flag or via helper call like is_this_last_elem()
>>> or something.
>>
>> Okay, I see what you mean now. Basically this means
>> seq_ops->next() should try to get/maintain next two elements,
> 
> What about the case when there are no elements to iterate to begin
> with? In that case, we still need to call bpf_prog for (empty)
> post-aggregation, but we have no valid element... For bpf_map
> iteration we could have fake empty bpf_map that would be passed, but
> I'm not sure it's applicable for any time of object (e.g., having a
> fake task_struct is probably quite a bit more problematic?)...

Oh, yes, thanks for reminding me of this. I put a call to
bpf_prog in seq_ops->stop() especially to handle no object
case. In that case, seq_ops->start() will return NULL,
seq_ops->next() won't be called, and then seq_ops->stop()
is called. My earlier attempt tries to hook with next()
and then find it not working in all cases.

> 
>> otherwise, we won't know whether the one in seq_ops->show()
>> is the last or not. We could do it in newly implemented
>> iterator bpf_map/task/task_file. Let me check how I could
>> make existing seq_ops (ipv6_route/netlink) works with
>> minimum changes.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 05/19] bpf: support bpf tracing/iter programs for BPF_LINK_CREATE
  2020-04-27 20:12 ` [PATCH bpf-next v1 05/19] bpf: support bpf tracing/iter programs for BPF_LINK_CREATE Yonghong Song
  2020-04-29  1:17   ` [Potential Spoof] " Martin KaFai Lau
@ 2020-04-29  6:25   ` Andrii Nakryiko
  1 sibling, 0 replies; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29  6:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:13 PM Yonghong Song <yhs@fb.com> wrote:
>
> Given a bpf program, the step to create an anonymous bpf iterator is:
>   - create a bpf_iter_link, which combines bpf program and the target.
>     In the future, there could be more information recorded in the link.
>     A link_fd will be returned to the user space.
>   - create an anonymous bpf iterator with the given link_fd.
>
> The anonymous bpf iterator (and its underlying bpf_link) will be
> used to create file based bpf iterator as well.
>
> The benefit to use of bpf_iter_link:
>   - for file based bpf iterator, bpf_iter_link provides a standard
>     way to replace underlying bpf programs.
>   - for both anonymous and free based iterators, bpf link query
>     capability can be leveraged.
>
> The patch added support of tracing/iter programs for  BPF_LINK_CREATE.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h   |  2 ++
>  kernel/bpf/bpf_iter.c | 54 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c  | 15 ++++++++++++
>  3 files changed, 71 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 4ac8d61f7c3e..60ecb73d8f6d 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1034,6 +1034,7 @@ extern const struct file_operations bpf_prog_fops;
>  extern const struct bpf_prog_ops bpf_offload_prog_ops;
>  extern const struct bpf_verifier_ops tc_cls_act_analyzer_ops;
>  extern const struct bpf_verifier_ops xdp_analyzer_ops;
> +extern const struct bpf_link_ops bpf_iter_link_lops;

show_fdinfo implementation for bpf_link has changed, so thankfully
this won't be necessary after you rebase on latest master :)

>
>  struct bpf_prog *bpf_prog_get(u32 ufd);
>  struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
> @@ -1129,6 +1130,7 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
>  struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
>                                    u64 *session_id, u64 *seq_num, bool is_last);
>  int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
> +int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>
>  int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index 284c95587803..9532e7bcb8e1 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -14,6 +14,11 @@ struct bpf_iter_target_info {
>         u32 target_feature;
>  };
>
> +struct bpf_iter_link {
> +       struct bpf_link link;
> +       struct bpf_iter_target_info *tinfo;
> +};
> +
>  static struct list_head targets;
>  static struct mutex targets_mutex;
>  static bool bpf_iter_inited = false;
> @@ -67,3 +72,52 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
>
>         return ret;
>  }
> +
> +static void bpf_iter_link_release(struct bpf_link *link)
> +{
> +}
> +
> +static void bpf_iter_link_dealloc(struct bpf_link *link)
> +{

Here you need to kfree() link struct. See bpf_raw_tp_link_dealloc() for example.


> +}
> +
> +const struct bpf_link_ops bpf_iter_link_lops = {
> +       .release = bpf_iter_link_release,
> +       .dealloc = bpf_iter_link_dealloc,
> +};
> +

[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  6:20               ` Yonghong Song
@ 2020-04-29  6:30                 ` Alexei Starovoitov
  2020-04-29  6:40                   ` Andrii Nakryiko
  2020-04-29  6:34                 ` Martin KaFai Lau
  1 sibling, 1 reply; 85+ messages in thread
From: Alexei Starovoitov @ 2020-04-29  6:30 UTC (permalink / raw)
  To: Yonghong Song, Andrii Nakryiko
  Cc: Martin KaFai Lau, Andrii Nakryiko, bpf, Networking,
	Daniel Borkmann, Kernel Team

On 4/28/20 11:20 PM, Yonghong Song wrote:
> 
> 
> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
>>>
>>>
>>>
>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
>>>>>
>>>>>
>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
>>>>>>>> bpf_iter_seq_map_info),
>>>>>>>> +                 &meta.session_id, &meta.seq_num,
>>>>>>>> +                 v == (void *)0);
>>>>>>>   From looking at seq_file.c, when will show() be called with "v ==
>>>>>>> NULL"?
>>>>>>>
>>>>>>
>>>>>> that v == NULL here and the whole verifier change just to allow 
>>>>>> NULL...
>>>>>> may be use seq_num as an indicator of the last elem instead?
>>>>>> Like seq_num with upper bit set to indicate that it's last?
>>>>>
>>>>> We could. But then verifier won't have an easy way to verify that.
>>>>> For example, the above is expected:
>>>>>
>>>>>        int prog(struct bpf_map *map, u64 seq_num) {
>>>>>           if (seq_num >> 63)
>>>>>             return 0;
>>>>>           ... map->id ...
>>>>>           ... map->user_cnt ...
>>>>>        }
>>>>>
>>>>> But if user writes
>>>>>
>>>>>        int prog(struct bpf_map *map, u64 seq_num) {
>>>>>            ... map->id ...
>>>>>            ... map->user_cnt ...
>>>>>        }
>>>>>
>>>>> verifier won't be easy to conclude inproper map pointer tracing
>>>>> here and in the above map->id, map->user_cnt will cause
>>>>> exceptions and they will silently get value 0.
>>>>
>>>> I mean always pass valid object pointer into the prog.
>>>> In above case 'map' will always be valid.
>>>> Consider prog that iterating all map elements.
>>>> It's weird that the prog would always need to do
>>>> if (map == 0)
>>>>     goto out;
>>>> even if it doesn't care about finding last.
>>>> All progs would have to have such extra 'if'.
>>>> If we always pass valid object than there is no need
>>>> for such extra checks inside the prog.
>>>> First and last element can be indicated via seq_num
>>>> or via another flag or via helper call like is_this_last_elem()
>>>> or something.
>>>
>>> Okay, I see what you mean now. Basically this means
>>> seq_ops->next() should try to get/maintain next two elements,
>>
>> What about the case when there are no elements to iterate to begin
>> with? In that case, we still need to call bpf_prog for (empty)
>> post-aggregation, but we have no valid element... For bpf_map
>> iteration we could have fake empty bpf_map that would be passed, but
>> I'm not sure it's applicable for any time of object (e.g., having a
>> fake task_struct is probably quite a bit more problematic?)...
> 
> Oh, yes, thanks for reminding me of this. I put a call to
> bpf_prog in seq_ops->stop() especially to handle no object
> case. In that case, seq_ops->start() will return NULL,
> seq_ops->next() won't be called, and then seq_ops->stop()
> is called. My earlier attempt tries to hook with next()
> and then find it not working in all cases.

wait a sec. seq_ops->stop() is not the end.
With lseek of seq_file it can be called multiple times.
What's the point calling bpf prog with NULL then?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE
  2020-04-29  5:58       ` Martin KaFai Lau
@ 2020-04-29  6:32         ` Andrii Nakryiko
  2020-04-29  6:41           ` Martin KaFai Lau
  0 siblings, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29  6:32 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Tue, Apr 28, 2020 at 10:59 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Tue, Apr 28, 2020 at 10:04:54PM -0700, Yonghong Song wrote:
> >
> >
> > On 4/28/20 6:32 PM, Martin KaFai Lau wrote:
> > > On Mon, Apr 27, 2020 at 01:12:41PM -0700, Yonghong Song wrote:
> > > > Added BPF_LINK_UPDATE support for tracing/iter programs.
> > > > This way, a file based bpf iterator, which holds a reference
> > > > to the link, can have its bpf program updated without
> > > > creating new files.
> > > >
>
> [ ... ]
>
> > > > --- a/kernel/bpf/bpf_iter.c
> > > > +++ b/kernel/bpf/bpf_iter.c
>
> [ ... ]
>
> > > > @@ -121,3 +125,28 @@ int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> > > >                   kfree(link);
> > > >           return err;
> > > >   }
> > > > +
> > > > +int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
> > > > +                   struct bpf_prog *new_prog)
> > > > +{
> > > > + int ret = 0;
> > > > +
> > > > + mutex_lock(&bpf_iter_mutex);
> > > > + if (old_prog && link->prog != old_prog) {
> hmm....
>
> If I read this function correctly,
> old_prog could be NULL here and it is only needed during BPF_F_REPLACE
> to ensure it is replacing a particular old_prog, no?

Yes, do you see any problem with the above logic?

>
>
> > > > +         ret = -EPERM;
> > > > +         goto out_unlock;
> > > > + }
> > > > +
> > > > + if (link->prog->type != new_prog->type ||
> > > > +     link->prog->expected_attach_type != new_prog->expected_attach_type ||
> > > > +     strcmp(link->prog->aux->attach_func_name, new_prog->aux->attach_func_name)) {
> > > Can attach_btf_id be compared instead of strcmp()?
> >
> > Yes, we can do it.
> >
> > >
> > > > +         ret = -EINVAL;
> > > > +         goto out_unlock;
> > > > + }
> > > > +
> > > > + link->prog = new_prog;
> > > Does the old link->prog need a bpf_prog_put()?
> >
> > The old_prog is replaced in caller link_update (syscall.c):
>
> > static int link_update(union bpf_attr *attr)
> > {
> >         struct bpf_prog *old_prog = NULL, *new_prog;
> >         struct bpf_link *link;
> >         u32 flags;
> >         int ret;
> > ...
> >         if (link->ops == &bpf_iter_link_lops) {
> >                 ret = bpf_iter_link_replace(link, old_prog, new_prog);
> >                 goto out_put_progs;
> >         }
> >         ret = -EINVAL;
> >
> > out_put_progs:
> >         if (old_prog)
> >                 bpf_prog_put(old_prog);
> The old_prog in link_update() took a separate refcnt from bpf_prog_get().
> I don't see how it is related to the existing refcnt held in the link->prog.
>
> or I am missing something in BPF_F_REPLACE?

Martin is right, bpf_iter_link_replace() needs to drop its own refcnt
on old_prog, in addition to what generic link_update logic does here,
because bpf_link_iter bumped old_prog's refcnt when it was created or
updated last time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  6:20               ` Yonghong Song
  2020-04-29  6:30                 ` Alexei Starovoitov
@ 2020-04-29  6:34                 ` Martin KaFai Lau
  2020-04-29  6:51                   ` Yonghong Song
  1 sibling, 1 reply; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29  6:34 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, Alexei Starovoitov, Andrii Nakryiko, bpf,
	Networking, Daniel Borkmann, Kernel Team

On Tue, Apr 28, 2020 at 11:20:30PM -0700, Yonghong Song wrote:
> 
> 
> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
> > On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
> > > 
> > > 
> > > 
> > > On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
> > > > On 4/28/20 6:15 PM, Yonghong Song wrote:
> > > > > 
> > > > > 
> > > > > On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
> > > > > > On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
> > > > > > > > +    prog = bpf_iter_get_prog(seq, sizeof(struct
> > > > > > > > bpf_iter_seq_map_info),
> > > > > > > > +                 &meta.session_id, &meta.seq_num,
> > > > > > > > +                 v == (void *)0);
> > > > > > >   From looking at seq_file.c, when will show() be called with "v ==
> > > > > > > NULL"?
> > > > > > > 
> > > > > > 
> > > > > > that v == NULL here and the whole verifier change just to allow NULL...
> > > > > > may be use seq_num as an indicator of the last elem instead?
> > > > > > Like seq_num with upper bit set to indicate that it's last?
> > > > > 
> > > > > We could. But then verifier won't have an easy way to verify that.
> > > > > For example, the above is expected:
> > > > > 
> > > > >        int prog(struct bpf_map *map, u64 seq_num) {
> > > > >           if (seq_num >> 63)
> > > > >             return 0;
> > > > >           ... map->id ...
> > > > >           ... map->user_cnt ...
> > > > >        }
> > > > > 
> > > > > But if user writes
> > > > > 
> > > > >        int prog(struct bpf_map *map, u64 seq_num) {
> > > > >            ... map->id ...
> > > > >            ... map->user_cnt ...
> > > > >        }
> > > > > 
> > > > > verifier won't be easy to conclude inproper map pointer tracing
> > > > > here and in the above map->id, map->user_cnt will cause
> > > > > exceptions and they will silently get value 0.
> > > > 
> > > > I mean always pass valid object pointer into the prog.
> > > > In above case 'map' will always be valid.
> > > > Consider prog that iterating all map elements.
> > > > It's weird that the prog would always need to do
> > > > if (map == 0)
> > > >     goto out;
> > > > even if it doesn't care about finding last.
> > > > All progs would have to have such extra 'if'.
> > > > If we always pass valid object than there is no need
> > > > for such extra checks inside the prog.
> > > > First and last element can be indicated via seq_num
> > > > or via another flag or via helper call like is_this_last_elem()
> > > > or something.
> > > 
> > > Okay, I see what you mean now. Basically this means
> > > seq_ops->next() should try to get/maintain next two elements,
> > 
> > What about the case when there are no elements to iterate to begin
> > with? In that case, we still need to call bpf_prog for (empty)
> > post-aggregation, but we have no valid element... For bpf_map
> > iteration we could have fake empty bpf_map that would be passed, but
> > I'm not sure it's applicable for any time of object (e.g., having a
> > fake task_struct is probably quite a bit more problematic?)...
> 
> Oh, yes, thanks for reminding me of this. I put a call to
> bpf_prog in seq_ops->stop() especially to handle no object
> case. In that case, seq_ops->start() will return NULL,
> seq_ops->next() won't be called, and then seq_ops->stop()
> is called. My earlier attempt tries to hook with next()
> and then find it not working in all cases.
> 
> > 
> > > otherwise, we won't know whether the one in seq_ops->show()
> > > is the last or not. 
I think "show()" is convoluted with "stop()/eof()".  Could "stop()/eof()"
be its own separate (and optional) bpf_prog which only does "stop()/eof()"?

> > > We could do it in newly implemented
> > > iterator bpf_map/task/task_file. Let me check how I could
> > > make existing seq_ops (ipv6_route/netlink) works with
> > > minimum changes.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  6:30                 ` Alexei Starovoitov
@ 2020-04-29  6:40                   ` Andrii Nakryiko
  2020-04-29  6:44                     ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29  6:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Martin KaFai Lau, Andrii Nakryiko, bpf,
	Networking, Daniel Borkmann, Kernel Team

On Tue, Apr 28, 2020 at 11:30 PM Alexei Starovoitov <ast@fb.com> wrote:
>
> On 4/28/20 11:20 PM, Yonghong Song wrote:
> >
> >
> > On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
> >> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
> >>>
> >>>
> >>>
> >>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
> >>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
> >>>>>
> >>>>>
> >>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
> >>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
> >>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
> >>>>>>>> bpf_iter_seq_map_info),
> >>>>>>>> +                 &meta.session_id, &meta.seq_num,
> >>>>>>>> +                 v == (void *)0);
> >>>>>>>   From looking at seq_file.c, when will show() be called with "v ==
> >>>>>>> NULL"?
> >>>>>>>
> >>>>>>
> >>>>>> that v == NULL here and the whole verifier change just to allow
> >>>>>> NULL...
> >>>>>> may be use seq_num as an indicator of the last elem instead?
> >>>>>> Like seq_num with upper bit set to indicate that it's last?
> >>>>>
> >>>>> We could. But then verifier won't have an easy way to verify that.
> >>>>> For example, the above is expected:
> >>>>>
> >>>>>        int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>           if (seq_num >> 63)
> >>>>>             return 0;
> >>>>>           ... map->id ...
> >>>>>           ... map->user_cnt ...
> >>>>>        }
> >>>>>
> >>>>> But if user writes
> >>>>>
> >>>>>        int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>            ... map->id ...
> >>>>>            ... map->user_cnt ...
> >>>>>        }
> >>>>>
> >>>>> verifier won't be easy to conclude inproper map pointer tracing
> >>>>> here and in the above map->id, map->user_cnt will cause
> >>>>> exceptions and they will silently get value 0.
> >>>>
> >>>> I mean always pass valid object pointer into the prog.
> >>>> In above case 'map' will always be valid.
> >>>> Consider prog that iterating all map elements.
> >>>> It's weird that the prog would always need to do
> >>>> if (map == 0)
> >>>>     goto out;
> >>>> even if it doesn't care about finding last.
> >>>> All progs would have to have such extra 'if'.
> >>>> If we always pass valid object than there is no need
> >>>> for such extra checks inside the prog.
> >>>> First and last element can be indicated via seq_num
> >>>> or via another flag or via helper call like is_this_last_elem()
> >>>> or something.
> >>>
> >>> Okay, I see what you mean now. Basically this means
> >>> seq_ops->next() should try to get/maintain next two elements,
> >>
> >> What about the case when there are no elements to iterate to begin
> >> with? In that case, we still need to call bpf_prog for (empty)
> >> post-aggregation, but we have no valid element... For bpf_map
> >> iteration we could have fake empty bpf_map that would be passed, but
> >> I'm not sure it's applicable for any time of object (e.g., having a
> >> fake task_struct is probably quite a bit more problematic?)...
> >
> > Oh, yes, thanks for reminding me of this. I put a call to
> > bpf_prog in seq_ops->stop() especially to handle no object
> > case. In that case, seq_ops->start() will return NULL,
> > seq_ops->next() won't be called, and then seq_ops->stop()
> > is called. My earlier attempt tries to hook with next()
> > and then find it not working in all cases.
>
> wait a sec. seq_ops->stop() is not the end.
> With lseek of seq_file it can be called multiple times.

We don't allow seeking on seq_file created from bpf_iter_link, so
there should be no lseek'ing?

> What's the point calling bpf prog with NULL then?

To know that iteration has ended, even if there were 0 elements to
iterate. 0, 1 or N doesn't matter, we might still need to do some
final actions (e.g., submit or print summary).

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE
  2020-04-29  6:32         ` Andrii Nakryiko
@ 2020-04-29  6:41           ` Martin KaFai Lau
  0 siblings, 0 replies; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29  6:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Tue, Apr 28, 2020 at 11:32:15PM -0700, Andrii Nakryiko wrote:
> On Tue, Apr 28, 2020 at 10:59 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Tue, Apr 28, 2020 at 10:04:54PM -0700, Yonghong Song wrote:
> > >
> > >
> > > On 4/28/20 6:32 PM, Martin KaFai Lau wrote:
> > > > On Mon, Apr 27, 2020 at 01:12:41PM -0700, Yonghong Song wrote:
> > > > > Added BPF_LINK_UPDATE support for tracing/iter programs.
> > > > > This way, a file based bpf iterator, which holds a reference
> > > > > to the link, can have its bpf program updated without
> > > > > creating new files.
> > > > >
> >
> > [ ... ]
> >
> > > > > --- a/kernel/bpf/bpf_iter.c
> > > > > +++ b/kernel/bpf/bpf_iter.c
> >
> > [ ... ]
> >
> > > > > @@ -121,3 +125,28 @@ int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> > > > >                   kfree(link);
> > > > >           return err;
> > > > >   }
> > > > > +
> > > > > +int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
> > > > > +                   struct bpf_prog *new_prog)
> > > > > +{
> > > > > + int ret = 0;
> > > > > +
> > > > > + mutex_lock(&bpf_iter_mutex);
> > > > > + if (old_prog && link->prog != old_prog) {
> > hmm....
> >
> > If I read this function correctly,
> > old_prog could be NULL here and it is only needed during BPF_F_REPLACE
> > to ensure it is replacing a particular old_prog, no?
> 
> Yes, do you see any problem with the above logic?
Not at all.  I just want to point out that when old_prog is NULL,
the link_update() would not even call bpf_prog_put(old_prog).

> 
> >
> >
> > > > > +         ret = -EPERM;
> > > > > +         goto out_unlock;
> > > > > + }
> > > > > +
> > > > > + if (link->prog->type != new_prog->type ||
> > > > > +     link->prog->expected_attach_type != new_prog->expected_attach_type ||
> > > > > +     strcmp(link->prog->aux->attach_func_name, new_prog->aux->attach_func_name)) {
> > > > Can attach_btf_id be compared instead of strcmp()?
> > >
> > > Yes, we can do it.
> > >
> > > >
> > > > > +         ret = -EINVAL;
> > > > > +         goto out_unlock;
> > > > > + }
> > > > > +
> > > > > + link->prog = new_prog;
> > > > Does the old link->prog need a bpf_prog_put()?
> > >
> > > The old_prog is replaced in caller link_update (syscall.c):
> >
> > > static int link_update(union bpf_attr *attr)
> > > {
> > >         struct bpf_prog *old_prog = NULL, *new_prog;
> > >         struct bpf_link *link;
> > >         u32 flags;
> > >         int ret;
> > > ...
> > >         if (link->ops == &bpf_iter_link_lops) {
> > >                 ret = bpf_iter_link_replace(link, old_prog, new_prog);
> > >                 goto out_put_progs;
> > >         }
> > >         ret = -EINVAL;
> > >
> > > out_put_progs:
> > >         if (old_prog)
> > >                 bpf_prog_put(old_prog);
> > The old_prog in link_update() took a separate refcnt from bpf_prog_get().
> > I don't see how it is related to the existing refcnt held in the link->prog.
> >
> > or I am missing something in BPF_F_REPLACE?
> 
> Martin is right, bpf_iter_link_replace() needs to drop its own refcnt
> on old_prog, in addition to what generic link_update logic does here,
> because bpf_link_iter bumped old_prog's refcnt when it was created or
> updated last time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  6:40                   ` Andrii Nakryiko
@ 2020-04-29  6:44                     ` Yonghong Song
  2020-04-29 15:34                       ` Alexei Starovoitov
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-29  6:44 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov
  Cc: Martin KaFai Lau, Andrii Nakryiko, bpf, Networking,
	Daniel Borkmann, Kernel Team



On 4/28/20 11:40 PM, Andrii Nakryiko wrote:
> On Tue, Apr 28, 2020 at 11:30 PM Alexei Starovoitov <ast@fb.com> wrote:
>>
>> On 4/28/20 11:20 PM, Yonghong Song wrote:
>>>
>>>
>>> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
>>>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
>>>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
>>>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
>>>>>>>>>> bpf_iter_seq_map_info),
>>>>>>>>>> +                 &meta.session_id, &meta.seq_num,
>>>>>>>>>> +                 v == (void *)0);
>>>>>>>>>    From looking at seq_file.c, when will show() be called with "v ==
>>>>>>>>> NULL"?
>>>>>>>>>
>>>>>>>>
>>>>>>>> that v == NULL here and the whole verifier change just to allow
>>>>>>>> NULL...
>>>>>>>> may be use seq_num as an indicator of the last elem instead?
>>>>>>>> Like seq_num with upper bit set to indicate that it's last?
>>>>>>>
>>>>>>> We could. But then verifier won't have an easy way to verify that.
>>>>>>> For example, the above is expected:
>>>>>>>
>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>>            if (seq_num >> 63)
>>>>>>>              return 0;
>>>>>>>            ... map->id ...
>>>>>>>            ... map->user_cnt ...
>>>>>>>         }
>>>>>>>
>>>>>>> But if user writes
>>>>>>>
>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>>             ... map->id ...
>>>>>>>             ... map->user_cnt ...
>>>>>>>         }
>>>>>>>
>>>>>>> verifier won't be easy to conclude inproper map pointer tracing
>>>>>>> here and in the above map->id, map->user_cnt will cause
>>>>>>> exceptions and they will silently get value 0.
>>>>>>
>>>>>> I mean always pass valid object pointer into the prog.
>>>>>> In above case 'map' will always be valid.
>>>>>> Consider prog that iterating all map elements.
>>>>>> It's weird that the prog would always need to do
>>>>>> if (map == 0)
>>>>>>      goto out;
>>>>>> even if it doesn't care about finding last.
>>>>>> All progs would have to have such extra 'if'.
>>>>>> If we always pass valid object than there is no need
>>>>>> for such extra checks inside the prog.
>>>>>> First and last element can be indicated via seq_num
>>>>>> or via another flag or via helper call like is_this_last_elem()
>>>>>> or something.
>>>>>
>>>>> Okay, I see what you mean now. Basically this means
>>>>> seq_ops->next() should try to get/maintain next two elements,
>>>>
>>>> What about the case when there are no elements to iterate to begin
>>>> with? In that case, we still need to call bpf_prog for (empty)
>>>> post-aggregation, but we have no valid element... For bpf_map
>>>> iteration we could have fake empty bpf_map that would be passed, but
>>>> I'm not sure it's applicable for any time of object (e.g., having a
>>>> fake task_struct is probably quite a bit more problematic?)...
>>>
>>> Oh, yes, thanks for reminding me of this. I put a call to
>>> bpf_prog in seq_ops->stop() especially to handle no object
>>> case. In that case, seq_ops->start() will return NULL,
>>> seq_ops->next() won't be called, and then seq_ops->stop()
>>> is called. My earlier attempt tries to hook with next()
>>> and then find it not working in all cases.
>>
>> wait a sec. seq_ops->stop() is not the end.
>> With lseek of seq_file it can be called multiple times.

Yes, I have taken care of this. when the object is NULL,
bpf program will be called. When the object is NULL again,
it won't be called. The private data remembers it has
been called with NULL.

> 
> We don't allow seeking on seq_file created from bpf_iter_link, so
> there should be no lseek'ing?
> 
>> What's the point calling bpf prog with NULL then?
> 
> To know that iteration has ended, even if there were 0 elements to
> iterate. 0, 1 or N doesn't matter, we might still need to do some
> final actions (e.g., submit or print summary).
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  6:34                 ` Martin KaFai Lau
@ 2020-04-29  6:51                   ` Yonghong Song
  2020-04-29 19:25                     ` Andrii Nakryiko
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-29  6:51 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Andrii Nakryiko, Alexei Starovoitov, Andrii Nakryiko, bpf,
	Networking, Daniel Borkmann, Kernel Team



On 4/28/20 11:34 PM, Martin KaFai Lau wrote:
> On Tue, Apr 28, 2020 at 11:20:30PM -0700, Yonghong Song wrote:
>>
>>
>> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
>>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>>
>>>>
>>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
>>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
>>>>>>
>>>>>>
>>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
>>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
>>>>>>>>> bpf_iter_seq_map_info),
>>>>>>>>> +                 &meta.session_id, &meta.seq_num,
>>>>>>>>> +                 v == (void *)0);
>>>>>>>>    From looking at seq_file.c, when will show() be called with "v ==
>>>>>>>> NULL"?
>>>>>>>>
>>>>>>>
>>>>>>> that v == NULL here and the whole verifier change just to allow NULL...
>>>>>>> may be use seq_num as an indicator of the last elem instead?
>>>>>>> Like seq_num with upper bit set to indicate that it's last?
>>>>>>
>>>>>> We could. But then verifier won't have an easy way to verify that.
>>>>>> For example, the above is expected:
>>>>>>
>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>            if (seq_num >> 63)
>>>>>>              return 0;
>>>>>>            ... map->id ...
>>>>>>            ... map->user_cnt ...
>>>>>>         }
>>>>>>
>>>>>> But if user writes
>>>>>>
>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>             ... map->id ...
>>>>>>             ... map->user_cnt ...
>>>>>>         }
>>>>>>
>>>>>> verifier won't be easy to conclude inproper map pointer tracing
>>>>>> here and in the above map->id, map->user_cnt will cause
>>>>>> exceptions and they will silently get value 0.
>>>>>
>>>>> I mean always pass valid object pointer into the prog.
>>>>> In above case 'map' will always be valid.
>>>>> Consider prog that iterating all map elements.
>>>>> It's weird that the prog would always need to do
>>>>> if (map == 0)
>>>>>      goto out;
>>>>> even if it doesn't care about finding last.
>>>>> All progs would have to have such extra 'if'.
>>>>> If we always pass valid object than there is no need
>>>>> for such extra checks inside the prog.
>>>>> First and last element can be indicated via seq_num
>>>>> or via another flag or via helper call like is_this_last_elem()
>>>>> or something.
>>>>
>>>> Okay, I see what you mean now. Basically this means
>>>> seq_ops->next() should try to get/maintain next two elements,
>>>
>>> What about the case when there are no elements to iterate to begin
>>> with? In that case, we still need to call bpf_prog for (empty)
>>> post-aggregation, but we have no valid element... For bpf_map
>>> iteration we could have fake empty bpf_map that would be passed, but
>>> I'm not sure it's applicable for any time of object (e.g., having a
>>> fake task_struct is probably quite a bit more problematic?)...
>>
>> Oh, yes, thanks for reminding me of this. I put a call to
>> bpf_prog in seq_ops->stop() especially to handle no object
>> case. In that case, seq_ops->start() will return NULL,
>> seq_ops->next() won't be called, and then seq_ops->stop()
>> is called. My earlier attempt tries to hook with next()
>> and then find it not working in all cases.
>>
>>>
>>>> otherwise, we won't know whether the one in seq_ops->show()
>>>> is the last or not.
> I think "show()" is convoluted with "stop()/eof()".  Could "stop()/eof()"
> be its own separate (and optional) bpf_prog which only does "stop()/eof()"?

I thought this before. But user need to write a program instead of
a simple "if" condition in the main program...

> 
>>>> We could do it in newly implemented
>>>> iterator bpf_map/task/task_file. Let me check how I could
>>>> make existing seq_ops (ipv6_route/netlink) works with
>>>> minimum changes.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-27 20:12 ` [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator Yonghong Song
  2020-04-29  5:39   ` Martin KaFai Lau
@ 2020-04-29  6:56   ` Andrii Nakryiko
  2020-04-29  7:06     ` Yonghong Song
  2020-04-29 19:39   ` Andrii Nakryiko
  2 siblings, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29  6:56 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:19 PM Yonghong Song <yhs@fb.com> wrote:
>
> A new bpf command BPF_ITER_CREATE is added.
>
> The anonymous bpf iterator is seq_file based.
> The seq_file private data are referenced by targets.
> The bpf_iter infrastructure allocated additional space
> at seq_file->private after the space used by targets
> to store some meta data, e.g.,
>   prog:       prog to run
>   session_id: an unique id for each opened seq_file
>   seq_num:    how many times bpf programs are queried in this session
>   has_last:   indicate whether or not bpf_prog has been called after
>               all valid objects have been processed
>
> A map between file and prog/link is established to help
> fops->release(). When fops->release() is called, just based on
> inode and file, bpf program cannot be located since target
> seq_priv_size not available. This map helps retrieve the prog
> whose reference count needs to be decremented.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h            |   3 +
>  include/uapi/linux/bpf.h       |   6 ++
>  kernel/bpf/bpf_iter.c          | 162 ++++++++++++++++++++++++++++++++-
>  kernel/bpf/syscall.c           |  27 ++++++
>  tools/include/uapi/linux/bpf.h |   6 ++
>  5 files changed, 203 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 4fc39d9b5cd0..0f0cafc65a04 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1112,6 +1112,8 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>  int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>  int bpf_obj_get_user(const char __user *pathname, int flags);
>
> +#define BPF_DUMP_SEQ_NET_PRIVATE       BIT(0)
> +
>  struct bpf_iter_reg {
>         const char *target;
>         const char *target_func_name;
> @@ -1133,6 +1135,7 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
>  int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>  int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>                           struct bpf_prog *new_prog);
> +int bpf_iter_new_fd(struct bpf_link *link);
>
>  int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f39b9fec37ab..576651110d16 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -113,6 +113,7 @@ enum bpf_cmd {
>         BPF_MAP_DELETE_BATCH,
>         BPF_LINK_CREATE,
>         BPF_LINK_UPDATE,
> +       BPF_ITER_CREATE,
>  };
>
>  enum bpf_map_type {
> @@ -590,6 +591,11 @@ union bpf_attr {
>                 __u32           old_prog_fd;
>         } link_update;
>
> +       struct { /* struct used by BPF_ITER_CREATE command */
> +               __u32           link_fd;
> +               __u32           flags;
> +       } iter_create;
> +
>  } __attribute__((aligned(8)));
>
>  /* The description below is an attempt at providing documentation to eBPF
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index fc1ce5ee5c3f..1f4e778d1814 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -2,6 +2,7 @@
>  /* Copyright (c) 2020 Facebook */
>
>  #include <linux/fs.h>
> +#include <linux/anon_inodes.h>
>  #include <linux/filter.h>
>  #include <linux/bpf.h>
>
> @@ -19,6 +20,19 @@ struct bpf_iter_link {
>         struct bpf_iter_target_info *tinfo;
>  };
>
> +struct extra_priv_data {
> +       struct bpf_prog *prog;
> +       u64 session_id;
> +       u64 seq_num;
> +       bool has_last;
> +};
> +
> +struct anon_file_prog_assoc {
> +       struct list_head list;
> +       struct file *file;
> +       struct bpf_prog *prog;
> +};
> +
>  static struct list_head targets;
>  static struct mutex targets_mutex;
>  static bool bpf_iter_inited = false;
> @@ -26,6 +40,50 @@ static bool bpf_iter_inited = false;
>  /* protect bpf_iter_link.link->prog upddate */
>  static struct mutex bpf_iter_mutex;
>
> +/* Since at anon seq_file release function, the prog cannot
> + * be retrieved since target seq_priv_size is not available.
> + * Keep a list of <anon_file, prog> mapping, so that
> + * at file release stage, the prog can be released properly.
> + */
> +static struct list_head anon_iter_info;
> +static struct mutex anon_iter_info_mutex;
> +
> +/* incremented on every opened seq_file */
> +static atomic64_t session_id;
> +
> +static u32 get_total_priv_dsize(u32 old_size)
> +{
> +       return roundup(old_size, 8) + sizeof(struct extra_priv_data);
> +}
> +
> +static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
> +{
> +       return old_ptr + roundup(old_size, 8);
> +}
> +
> +static int anon_iter_release(struct inode *inode, struct file *file)
> +{
> +       struct anon_file_prog_assoc *finfo;
> +
> +       mutex_lock(&anon_iter_info_mutex);
> +       list_for_each_entry(finfo, &anon_iter_info, list) {
> +               if (finfo->file == file) {

I'll look at this and other patches more thoroughly tomorrow with
clear head, but this iteration to find anon_file_prog_assoc is really
unfortunate.

I think the problem is that you are allowing seq_file infrastructure
to call directly into target implementation of seq_operations without
intercepting them. If you change that and put whatever extra info is
necessary into seq_file->private in front of target's private state,
then you shouldn't need this, right?

This would also make each target's logic a bit simpler because you can:
- centralize creation and initialization of bpf_iter_meta (session_id,
seq, seq_num will be set up once in this generic code);
- loff_t pos increments;
- you can extract and centralize bpf_iter_get_prog() call in show()
implementation as well.

I think with that each target's logic will be simpler and you won't
need to maintain anon_file_prog_assocs.

Are there complications I'm missing?

[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-29  6:56   ` Andrii Nakryiko
@ 2020-04-29  7:06     ` Yonghong Song
  2020-04-29 18:16       ` Andrii Nakryiko
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-29  7:06 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/28/20 11:56 PM, Andrii Nakryiko wrote:
> On Mon, Apr 27, 2020 at 1:19 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> A new bpf command BPF_ITER_CREATE is added.
>>
>> The anonymous bpf iterator is seq_file based.
>> The seq_file private data are referenced by targets.
>> The bpf_iter infrastructure allocated additional space
>> at seq_file->private after the space used by targets
>> to store some meta data, e.g.,
>>    prog:       prog to run
>>    session_id: an unique id for each opened seq_file
>>    seq_num:    how many times bpf programs are queried in this session
>>    has_last:   indicate whether or not bpf_prog has been called after
>>                all valid objects have been processed
>>
>> A map between file and prog/link is established to help
>> fops->release(). When fops->release() is called, just based on
>> inode and file, bpf program cannot be located since target
>> seq_priv_size not available. This map helps retrieve the prog
>> whose reference count needs to be decremented.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h            |   3 +
>>   include/uapi/linux/bpf.h       |   6 ++
>>   kernel/bpf/bpf_iter.c          | 162 ++++++++++++++++++++++++++++++++-
>>   kernel/bpf/syscall.c           |  27 ++++++
>>   tools/include/uapi/linux/bpf.h |   6 ++
>>   5 files changed, 203 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 4fc39d9b5cd0..0f0cafc65a04 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1112,6 +1112,8 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>>   int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>>   int bpf_obj_get_user(const char __user *pathname, int flags);
>>
>> +#define BPF_DUMP_SEQ_NET_PRIVATE       BIT(0)
>> +
>>   struct bpf_iter_reg {
>>          const char *target;
>>          const char *target_func_name;
>> @@ -1133,6 +1135,7 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
>>   int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>>   int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>>                            struct bpf_prog *new_prog);
>> +int bpf_iter_new_fd(struct bpf_link *link);
>>
>>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index f39b9fec37ab..576651110d16 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -113,6 +113,7 @@ enum bpf_cmd {
>>          BPF_MAP_DELETE_BATCH,
>>          BPF_LINK_CREATE,
>>          BPF_LINK_UPDATE,
>> +       BPF_ITER_CREATE,
>>   };
>>
>>   enum bpf_map_type {
>> @@ -590,6 +591,11 @@ union bpf_attr {
>>                  __u32           old_prog_fd;
>>          } link_update;
>>
>> +       struct { /* struct used by BPF_ITER_CREATE command */
>> +               __u32           link_fd;
>> +               __u32           flags;
>> +       } iter_create;
>> +
>>   } __attribute__((aligned(8)));
>>
>>   /* The description below is an attempt at providing documentation to eBPF
>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>> index fc1ce5ee5c3f..1f4e778d1814 100644
>> --- a/kernel/bpf/bpf_iter.c
>> +++ b/kernel/bpf/bpf_iter.c
>> @@ -2,6 +2,7 @@
>>   /* Copyright (c) 2020 Facebook */
>>
>>   #include <linux/fs.h>
>> +#include <linux/anon_inodes.h>
>>   #include <linux/filter.h>
>>   #include <linux/bpf.h>
>>
>> @@ -19,6 +20,19 @@ struct bpf_iter_link {
>>          struct bpf_iter_target_info *tinfo;
>>   };
>>
>> +struct extra_priv_data {
>> +       struct bpf_prog *prog;
>> +       u64 session_id;
>> +       u64 seq_num;
>> +       bool has_last;
>> +};
>> +
>> +struct anon_file_prog_assoc {
>> +       struct list_head list;
>> +       struct file *file;
>> +       struct bpf_prog *prog;
>> +};
>> +
>>   static struct list_head targets;
>>   static struct mutex targets_mutex;
>>   static bool bpf_iter_inited = false;
>> @@ -26,6 +40,50 @@ static bool bpf_iter_inited = false;
>>   /* protect bpf_iter_link.link->prog upddate */
>>   static struct mutex bpf_iter_mutex;
>>
>> +/* Since at anon seq_file release function, the prog cannot
>> + * be retrieved since target seq_priv_size is not available.
>> + * Keep a list of <anon_file, prog> mapping, so that
>> + * at file release stage, the prog can be released properly.
>> + */
>> +static struct list_head anon_iter_info;
>> +static struct mutex anon_iter_info_mutex;
>> +
>> +/* incremented on every opened seq_file */
>> +static atomic64_t session_id;
>> +
>> +static u32 get_total_priv_dsize(u32 old_size)
>> +{
>> +       return roundup(old_size, 8) + sizeof(struct extra_priv_data);
>> +}
>> +
>> +static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
>> +{
>> +       return old_ptr + roundup(old_size, 8);
>> +}
>> +
>> +static int anon_iter_release(struct inode *inode, struct file *file)
>> +{
>> +       struct anon_file_prog_assoc *finfo;
>> +
>> +       mutex_lock(&anon_iter_info_mutex);
>> +       list_for_each_entry(finfo, &anon_iter_info, list) {
>> +               if (finfo->file == file) {
> 
> I'll look at this and other patches more thoroughly tomorrow with
> clear head, but this iteration to find anon_file_prog_assoc is really
> unfortunate.
> 
> I think the problem is that you are allowing seq_file infrastructure
> to call directly into target implementation of seq_operations without
> intercepting them. If you change that and put whatever extra info is
> necessary into seq_file->private in front of target's private state,
> then you shouldn't need this, right?

Yes. This is true. The idea is to minimize the target change.
But maybe this is not a good goal by itself.

You are right, if I intercept all seq_ops(), I do not need the
above change, I can tailor seq_file private_data right before
calling target one and restore after the target call.

Originally I only have one interception, show(), now I have
stop() too to call bpf at the end of iteration. Maybe I can
interpret all four, I think. This way, I can also get ride
of target feature.

> 
> This would also make each target's logic a bit simpler because you can:
> - centralize creation and initialization of bpf_iter_meta (session_id,
> seq, seq_num will be set up once in this generic code);
> - loff_t pos increments;
> - you can extract and centralize bpf_iter_get_prog() call in show()
> implementation as well.
> 
> I think with that each target's logic will be simpler and you won't
> need to maintain anon_file_prog_assocs.
> 
> Are there complications I'm missing?
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 16/19] tools/bpftool: add bpf_iter support for bptool
  2020-04-28 17:35     ` Yonghong Song
@ 2020-04-29  8:37       ` Quentin Monnet
  0 siblings, 0 replies; 85+ messages in thread
From: Quentin Monnet @ 2020-04-29  8:37 UTC (permalink / raw)
  To: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

2020-04-28 10:35 UTC-0700 ~ Yonghong Song <yhs@fb.com>
> 
> 
> On 4/28/20 2:27 AM, Quentin Monnet wrote:

[...]

>>> +    err = bpf_link__pin(link, path);
>>
>> Try to mount bpffs before that if "-n" is not passed? You could even
>> call do_pin_any() from common.c by passing bpf_link__fd().
> 
> You probably means do_pin_fd()? That is a good suggestion, will use it
> in the next revision.

Right, passing bpf_link__fd() to do_pin_any() wouldn't work, it does not
take the arguments expected by the "get_fd()" callback. My bad. So yeah,
just do_pin_fd() in that case :)

[...]

>>
>> Have you considered simply adapting the more traditional workflow
>> "bpftool prog load && bpftool prog attach" so that it supports iterators
>> instead of adding a new command? It would:
> 
> This is a good question, I should have clarified better in the commit
> message.
>   - prog load && prog attach won't work.
>     the create_iter is a three stage process:
>       1. prog load
>       2. create and attach to a link
>       3. pin link
>     In the current implementation, the link merely just has the program.
>     But in the future, the link will have other parameters like map_id,
>     tgid/gid, or cgroup_id, or others.
> 
>     We could say to do:
>       1. bpftool prog load <pin_path>
>       2. bpftool iter pin prog file
>          <maybe more parameters in the future>
> 
>     But this requires to pin the program itself in the bpffs, which
>     mostly unneeded for file iterator creator.
> 
>     So this command `bpftool iter pin ...` is created for ease of use.
> 
>>
>> - Avoid adding yet another bpftool command with a single subcommand
> 
> So far, yes, in the future we may have more. In my RFC patcch, I have
> `bpftool iter show ...` for introspection, this is to show all
> registered targets and all file iterators prog_id's.
> 
> This patch does not have it and I left it for the future work.
> I am considering to use bpf iterator to do introspection here...

Ok, so with the useless bpffs pinning step and the perspectives for
other subcommands in the future, I agree it makes sense to have "iter"
as a new command. And as you say, handling of the link may grow so it's
probably not a bad thing to have it aside from the "prog" command.
Thanks for the clarification (maybe add some of it to the commit log
indeed?).

> 
>>
>> - Enable to reuse the code from prog load, in particular for map reuse
>> (I'm not sure how relevant maps are for iterators, but I wouldn't be
>> surprised if someone finds a use case at some point?)
> 
> Yes, we do plan to have map element iterators. We can also have
> bpf_prog or other iterators. Yes, map element iterator use
> implementation should be `bpftool map` code base since it is
> a use of bpf_iter infrastructure.

My point was more about loading programs that reuse pre-existing, as in
"bpftool prog load foo /sys/fs/bpf/foo map name foomap id 1337". It
seems likely that similar syntax will be needed for loading/pinning
iterators as well eventually, but I suppose we can try to refactor the
code from prog.c to reuse it when the time comes.

> 
>>
>> - Avoid users naively trying to run "bpftool prog load && bpftool prog
>> attach <prog> iter" and not understanding why it fails
> 
> `bpftool prog attach <prog> [map_id]` mostly used to attach a program to
> a map, right? In this case, it won't apply, right?

Right, I'm just not convinced that all users are aware of that :) But
fair enough.

> 
> BTW, Thanks for reviewing and catching my mistakes!
> 

Thanks for your reply and clarification, that's appreciated too!
Quentin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  6:44                     ` Yonghong Song
@ 2020-04-29 15:34                       ` Alexei Starovoitov
  2020-04-29 18:14                         ` Yonghong Song
  2020-04-29 19:19                         ` Andrii Nakryiko
  0 siblings, 2 replies; 85+ messages in thread
From: Alexei Starovoitov @ 2020-04-29 15:34 UTC (permalink / raw)
  To: Yonghong Song, Andrii Nakryiko
  Cc: Martin KaFai Lau, Andrii Nakryiko, bpf, Networking,
	Daniel Borkmann, Kernel Team

On 4/28/20 11:44 PM, Yonghong Song wrote:
> 
> 
> On 4/28/20 11:40 PM, Andrii Nakryiko wrote:
>> On Tue, Apr 28, 2020 at 11:30 PM Alexei Starovoitov <ast@fb.com> wrote:
>>>
>>> On 4/28/20 11:20 PM, Yonghong Song wrote:
>>>>
>>>>
>>>> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
>>>>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
>>>>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
>>>>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>>>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
>>>>>>>>>>> bpf_iter_seq_map_info),
>>>>>>>>>>> +                 &meta.session_id, &meta.seq_num,
>>>>>>>>>>> +                 v == (void *)0);
>>>>>>>>>>    From looking at seq_file.c, when will show() be called with 
>>>>>>>>>> "v ==
>>>>>>>>>> NULL"?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> that v == NULL here and the whole verifier change just to allow
>>>>>>>>> NULL...
>>>>>>>>> may be use seq_num as an indicator of the last elem instead?
>>>>>>>>> Like seq_num with upper bit set to indicate that it's last?
>>>>>>>>
>>>>>>>> We could. But then verifier won't have an easy way to verify that.
>>>>>>>> For example, the above is expected:
>>>>>>>>
>>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>>>            if (seq_num >> 63)
>>>>>>>>              return 0;
>>>>>>>>            ... map->id ...
>>>>>>>>            ... map->user_cnt ...
>>>>>>>>         }
>>>>>>>>
>>>>>>>> But if user writes
>>>>>>>>
>>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>>>             ... map->id ...
>>>>>>>>             ... map->user_cnt ...
>>>>>>>>         }
>>>>>>>>
>>>>>>>> verifier won't be easy to conclude inproper map pointer tracing
>>>>>>>> here and in the above map->id, map->user_cnt will cause
>>>>>>>> exceptions and they will silently get value 0.
>>>>>>>
>>>>>>> I mean always pass valid object pointer into the prog.
>>>>>>> In above case 'map' will always be valid.
>>>>>>> Consider prog that iterating all map elements.
>>>>>>> It's weird that the prog would always need to do
>>>>>>> if (map == 0)
>>>>>>>      goto out;
>>>>>>> even if it doesn't care about finding last.
>>>>>>> All progs would have to have such extra 'if'.
>>>>>>> If we always pass valid object than there is no need
>>>>>>> for such extra checks inside the prog.
>>>>>>> First and last element can be indicated via seq_num
>>>>>>> or via another flag or via helper call like is_this_last_elem()
>>>>>>> or something.
>>>>>>
>>>>>> Okay, I see what you mean now. Basically this means
>>>>>> seq_ops->next() should try to get/maintain next two elements,
>>>>>
>>>>> What about the case when there are no elements to iterate to begin
>>>>> with? In that case, we still need to call bpf_prog for (empty)
>>>>> post-aggregation, but we have no valid element... For bpf_map
>>>>> iteration we could have fake empty bpf_map that would be passed, but
>>>>> I'm not sure it's applicable for any time of object (e.g., having a
>>>>> fake task_struct is probably quite a bit more problematic?)...
>>>>
>>>> Oh, yes, thanks for reminding me of this. I put a call to
>>>> bpf_prog in seq_ops->stop() especially to handle no object
>>>> case. In that case, seq_ops->start() will return NULL,
>>>> seq_ops->next() won't be called, and then seq_ops->stop()
>>>> is called. My earlier attempt tries to hook with next()
>>>> and then find it not working in all cases.
>>>
>>> wait a sec. seq_ops->stop() is not the end.
>>> With lseek of seq_file it can be called multiple times.
> 
> Yes, I have taken care of this. when the object is NULL,
> bpf program will be called. When the object is NULL again,
> it won't be called. The private data remembers it has
> been called with NULL.

Even without lseek stop() will be called multiple times.
If I read seq_file.c correctly it will be called before
every copy_to_user(). Which means that for a lot of text
(or if read() is done with small buffer) there will be
plenty of start,show,show,stop sequences.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29 15:34                       ` Alexei Starovoitov
@ 2020-04-29 18:14                         ` Yonghong Song
  2020-04-29 19:19                         ` Andrii Nakryiko
  1 sibling, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-29 18:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko
  Cc: Martin KaFai Lau, Andrii Nakryiko, bpf, Networking,
	Daniel Borkmann, Kernel Team



On 4/29/20 8:34 AM, Alexei Starovoitov wrote:
> On 4/28/20 11:44 PM, Yonghong Song wrote:
>>
>>
>> On 4/28/20 11:40 PM, Andrii Nakryiko wrote:
>>> On Tue, Apr 28, 2020 at 11:30 PM Alexei Starovoitov <ast@fb.com> wrote:
>>>>
>>>> On 4/28/20 11:20 PM, Yonghong Song wrote:
>>>>>
>>>>>
>>>>> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
>>>>>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
>>>>>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
>>>>>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>>>>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
>>>>>>>>>>>> bpf_iter_seq_map_info),
>>>>>>>>>>>> +                 &meta.session_id, &meta.seq_num,
>>>>>>>>>>>> +                 v == (void *)0);
>>>>>>>>>>>    From looking at seq_file.c, when will show() be called 
>>>>>>>>>>> with "v ==
>>>>>>>>>>> NULL"?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> that v == NULL here and the whole verifier change just to allow
>>>>>>>>>> NULL...
>>>>>>>>>> may be use seq_num as an indicator of the last elem instead?
>>>>>>>>>> Like seq_num with upper bit set to indicate that it's last?
>>>>>>>>>
>>>>>>>>> We could. But then verifier won't have an easy way to verify that.
>>>>>>>>> For example, the above is expected:
>>>>>>>>>
>>>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>>>>            if (seq_num >> 63)
>>>>>>>>>              return 0;
>>>>>>>>>            ... map->id ...
>>>>>>>>>            ... map->user_cnt ...
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>> But if user writes
>>>>>>>>>
>>>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>>>>             ... map->id ...
>>>>>>>>>             ... map->user_cnt ...
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>> verifier won't be easy to conclude inproper map pointer tracing
>>>>>>>>> here and in the above map->id, map->user_cnt will cause
>>>>>>>>> exceptions and they will silently get value 0.
>>>>>>>>
>>>>>>>> I mean always pass valid object pointer into the prog.
>>>>>>>> In above case 'map' will always be valid.
>>>>>>>> Consider prog that iterating all map elements.
>>>>>>>> It's weird that the prog would always need to do
>>>>>>>> if (map == 0)
>>>>>>>>      goto out;
>>>>>>>> even if it doesn't care about finding last.
>>>>>>>> All progs would have to have such extra 'if'.
>>>>>>>> If we always pass valid object than there is no need
>>>>>>>> for such extra checks inside the prog.
>>>>>>>> First and last element can be indicated via seq_num
>>>>>>>> or via another flag or via helper call like is_this_last_elem()
>>>>>>>> or something.
>>>>>>>
>>>>>>> Okay, I see what you mean now. Basically this means
>>>>>>> seq_ops->next() should try to get/maintain next two elements,
>>>>>>
>>>>>> What about the case when there are no elements to iterate to begin
>>>>>> with? In that case, we still need to call bpf_prog for (empty)
>>>>>> post-aggregation, but we have no valid element... For bpf_map
>>>>>> iteration we could have fake empty bpf_map that would be passed, but
>>>>>> I'm not sure it's applicable for any time of object (e.g., having a
>>>>>> fake task_struct is probably quite a bit more problematic?)...
>>>>>
>>>>> Oh, yes, thanks for reminding me of this. I put a call to
>>>>> bpf_prog in seq_ops->stop() especially to handle no object
>>>>> case. In that case, seq_ops->start() will return NULL,
>>>>> seq_ops->next() won't be called, and then seq_ops->stop()
>>>>> is called. My earlier attempt tries to hook with next()
>>>>> and then find it not working in all cases.
>>>>
>>>> wait a sec. seq_ops->stop() is not the end.
>>>> With lseek of seq_file it can be called multiple times.
>>
>> Yes, I have taken care of this. when the object is NULL,
>> bpf program will be called. When the object is NULL again,
>> it won't be called. The private data remembers it has
>> been called with NULL.
> 
> Even without lseek stop() will be called multiple times.
> If I read seq_file.c correctly it will be called before
> every copy_to_user(). Which means that for a lot of text
> (or if read() is done with small buffer) there will be
> plenty of start,show,show,stop sequences.

That is true, this may cause revisit the same object if the object
still exists return start() called again. I followed similar
practice with ipv6_route(), trying to looking up the same
object at start() and only advanced right before next().

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-29  7:06     ` Yonghong Song
@ 2020-04-29 18:16       ` Andrii Nakryiko
  2020-04-29 18:46         ` Martin KaFai Lau
  0 siblings, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29 18:16 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 29, 2020 at 12:07 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/28/20 11:56 PM, Andrii Nakryiko wrote:
> > On Mon, Apr 27, 2020 at 1:19 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> A new bpf command BPF_ITER_CREATE is added.
> >>
> >> The anonymous bpf iterator is seq_file based.
> >> The seq_file private data are referenced by targets.
> >> The bpf_iter infrastructure allocated additional space
> >> at seq_file->private after the space used by targets
> >> to store some meta data, e.g.,
> >>    prog:       prog to run
> >>    session_id: an unique id for each opened seq_file
> >>    seq_num:    how many times bpf programs are queried in this session
> >>    has_last:   indicate whether or not bpf_prog has been called after
> >>                all valid objects have been processed
> >>
> >> A map between file and prog/link is established to help
> >> fops->release(). When fops->release() is called, just based on
> >> inode and file, bpf program cannot be located since target
> >> seq_priv_size not available. This map helps retrieve the prog
> >> whose reference count needs to be decremented.
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   include/linux/bpf.h            |   3 +
> >>   include/uapi/linux/bpf.h       |   6 ++
> >>   kernel/bpf/bpf_iter.c          | 162 ++++++++++++++++++++++++++++++++-
> >>   kernel/bpf/syscall.c           |  27 ++++++
> >>   tools/include/uapi/linux/bpf.h |   6 ++
> >>   5 files changed, 203 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index 4fc39d9b5cd0..0f0cafc65a04 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -1112,6 +1112,8 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> >>   int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> >>   int bpf_obj_get_user(const char __user *pathname, int flags);
> >>
> >> +#define BPF_DUMP_SEQ_NET_PRIVATE       BIT(0)
> >> +
> >>   struct bpf_iter_reg {
> >>          const char *target;
> >>          const char *target_func_name;
> >> @@ -1133,6 +1135,7 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
> >>   int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> >>   int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
> >>                            struct bpf_prog *new_prog);
> >> +int bpf_iter_new_fd(struct bpf_link *link);
> >>
> >>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
> >>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index f39b9fec37ab..576651110d16 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -113,6 +113,7 @@ enum bpf_cmd {
> >>          BPF_MAP_DELETE_BATCH,
> >>          BPF_LINK_CREATE,
> >>          BPF_LINK_UPDATE,
> >> +       BPF_ITER_CREATE,
> >>   };
> >>
> >>   enum bpf_map_type {
> >> @@ -590,6 +591,11 @@ union bpf_attr {
> >>                  __u32           old_prog_fd;
> >>          } link_update;
> >>
> >> +       struct { /* struct used by BPF_ITER_CREATE command */
> >> +               __u32           link_fd;
> >> +               __u32           flags;
> >> +       } iter_create;
> >> +
> >>   } __attribute__((aligned(8)));
> >>
> >>   /* The description below is an attempt at providing documentation to eBPF
> >> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> >> index fc1ce5ee5c3f..1f4e778d1814 100644
> >> --- a/kernel/bpf/bpf_iter.c
> >> +++ b/kernel/bpf/bpf_iter.c
> >> @@ -2,6 +2,7 @@
> >>   /* Copyright (c) 2020 Facebook */
> >>
> >>   #include <linux/fs.h>
> >> +#include <linux/anon_inodes.h>
> >>   #include <linux/filter.h>
> >>   #include <linux/bpf.h>
> >>
> >> @@ -19,6 +20,19 @@ struct bpf_iter_link {
> >>          struct bpf_iter_target_info *tinfo;
> >>   };
> >>
> >> +struct extra_priv_data {
> >> +       struct bpf_prog *prog;
> >> +       u64 session_id;
> >> +       u64 seq_num;
> >> +       bool has_last;
> >> +};
> >> +
> >> +struct anon_file_prog_assoc {
> >> +       struct list_head list;
> >> +       struct file *file;
> >> +       struct bpf_prog *prog;
> >> +};
> >> +
> >>   static struct list_head targets;
> >>   static struct mutex targets_mutex;
> >>   static bool bpf_iter_inited = false;
> >> @@ -26,6 +40,50 @@ static bool bpf_iter_inited = false;
> >>   /* protect bpf_iter_link.link->prog upddate */
> >>   static struct mutex bpf_iter_mutex;
> >>
> >> +/* Since at anon seq_file release function, the prog cannot
> >> + * be retrieved since target seq_priv_size is not available.
> >> + * Keep a list of <anon_file, prog> mapping, so that
> >> + * at file release stage, the prog can be released properly.
> >> + */
> >> +static struct list_head anon_iter_info;
> >> +static struct mutex anon_iter_info_mutex;
> >> +
> >> +/* incremented on every opened seq_file */
> >> +static atomic64_t session_id;
> >> +
> >> +static u32 get_total_priv_dsize(u32 old_size)
> >> +{
> >> +       return roundup(old_size, 8) + sizeof(struct extra_priv_data);
> >> +}
> >> +
> >> +static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
> >> +{
> >> +       return old_ptr + roundup(old_size, 8);
> >> +}
> >> +
> >> +static int anon_iter_release(struct inode *inode, struct file *file)
> >> +{
> >> +       struct anon_file_prog_assoc *finfo;
> >> +
> >> +       mutex_lock(&anon_iter_info_mutex);
> >> +       list_for_each_entry(finfo, &anon_iter_info, list) {
> >> +               if (finfo->file == file) {
> >
> > I'll look at this and other patches more thoroughly tomorrow with
> > clear head, but this iteration to find anon_file_prog_assoc is really
> > unfortunate.
> >
> > I think the problem is that you are allowing seq_file infrastructure
> > to call directly into target implementation of seq_operations without
> > intercepting them. If you change that and put whatever extra info is
> > necessary into seq_file->private in front of target's private state,
> > then you shouldn't need this, right?
>
> Yes. This is true. The idea is to minimize the target change.
> But maybe this is not a good goal by itself.
>
> You are right, if I intercept all seq_ops(), I do not need the
> above change, I can tailor seq_file private_data right before
> calling target one and restore after the target call.
>
> Originally I only have one interception, show(), now I have
> stop() too to call bpf at the end of iteration. Maybe I can
> interpret all four, I think. This way, I can also get ride
> of target feature.

If the main goal is to minimize target changes and make them exactly
seq_operations implementation, then one easier way to get easy access
to our own metadata in seq_file->private is to set it to point
**after** our metadata, but before target's metadata. Roughly in
pseudo code:

struct bpf_iter_seq_file_meta {} __attribute((aligned(8)));

void *meta = kmalloc(sizeof(struct bpf_iter_seq_file_meta) +
target_private_size);
seq_file->private = meta + sizeof(struct bpf_iter_seq_file_meta);


Then to recover bpf_iter_Seq_file_meta:

struct bpf_iter_seq_file_meta *meta = seq_file->private - sizeof(*meta);

/* voila! */

This doesn't have a benefit of making targets simpler, but will
require no changes to them at all. Plus less indirect calls, so less
performance penalty.

>
> >
> > This would also make each target's logic a bit simpler because you can:
> > - centralize creation and initialization of bpf_iter_meta (session_id,
> > seq, seq_num will be set up once in this generic code);
> > - loff_t pos increments;
> > - you can extract and centralize bpf_iter_get_prog() call in show()
> > implementation as well.
> >
> > I think with that each target's logic will be simpler and you won't
> > need to maintain anon_file_prog_assocs.
> >
> > Are there complications I'm missing?
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-29 18:16       ` Andrii Nakryiko
@ 2020-04-29 18:46         ` Martin KaFai Lau
  2020-04-29 19:20           ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29 18:46 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 29, 2020 at 11:16:35AM -0700, Andrii Nakryiko wrote:
> On Wed, Apr 29, 2020 at 12:07 AM Yonghong Song <yhs@fb.com> wrote:
> >
> >
> >
> > On 4/28/20 11:56 PM, Andrii Nakryiko wrote:
> > > On Mon, Apr 27, 2020 at 1:19 PM Yonghong Song <yhs@fb.com> wrote:
> > >>
> > >> A new bpf command BPF_ITER_CREATE is added.
> > >>
> > >> The anonymous bpf iterator is seq_file based.
> > >> The seq_file private data are referenced by targets.
> > >> The bpf_iter infrastructure allocated additional space
> > >> at seq_file->private after the space used by targets
> > >> to store some meta data, e.g.,
> > >>    prog:       prog to run
> > >>    session_id: an unique id for each opened seq_file
> > >>    seq_num:    how many times bpf programs are queried in this session
> > >>    has_last:   indicate whether or not bpf_prog has been called after
> > >>                all valid objects have been processed
> > >>
> > >> A map between file and prog/link is established to help
> > >> fops->release(). When fops->release() is called, just based on
> > >> inode and file, bpf program cannot be located since target
> > >> seq_priv_size not available. This map helps retrieve the prog
> > >> whose reference count needs to be decremented.
> > >>
> > >> Signed-off-by: Yonghong Song <yhs@fb.com>
> > >> ---
> > >>   include/linux/bpf.h            |   3 +
> > >>   include/uapi/linux/bpf.h       |   6 ++
> > >>   kernel/bpf/bpf_iter.c          | 162 ++++++++++++++++++++++++++++++++-
> > >>   kernel/bpf/syscall.c           |  27 ++++++
> > >>   tools/include/uapi/linux/bpf.h |   6 ++
> > >>   5 files changed, 203 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > >> index 4fc39d9b5cd0..0f0cafc65a04 100644
> > >> --- a/include/linux/bpf.h
> > >> +++ b/include/linux/bpf.h
> > >> @@ -1112,6 +1112,8 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> > >>   int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > >>   int bpf_obj_get_user(const char __user *pathname, int flags);
> > >>
> > >> +#define BPF_DUMP_SEQ_NET_PRIVATE       BIT(0)
> > >> +
> > >>   struct bpf_iter_reg {
> > >>          const char *target;
> > >>          const char *target_func_name;
> > >> @@ -1133,6 +1135,7 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
> > >>   int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> > >>   int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
> > >>                            struct bpf_prog *new_prog);
> > >> +int bpf_iter_new_fd(struct bpf_link *link);
> > >>
> > >>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
> > >>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> index f39b9fec37ab..576651110d16 100644
> > >> --- a/include/uapi/linux/bpf.h
> > >> +++ b/include/uapi/linux/bpf.h
> > >> @@ -113,6 +113,7 @@ enum bpf_cmd {
> > >>          BPF_MAP_DELETE_BATCH,
> > >>          BPF_LINK_CREATE,
> > >>          BPF_LINK_UPDATE,
> > >> +       BPF_ITER_CREATE,
> > >>   };
> > >>
> > >>   enum bpf_map_type {
> > >> @@ -590,6 +591,11 @@ union bpf_attr {
> > >>                  __u32           old_prog_fd;
> > >>          } link_update;
> > >>
> > >> +       struct { /* struct used by BPF_ITER_CREATE command */
> > >> +               __u32           link_fd;
> > >> +               __u32           flags;
> > >> +       } iter_create;
> > >> +
> > >>   } __attribute__((aligned(8)));
> > >>
> > >>   /* The description below is an attempt at providing documentation to eBPF
> > >> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> > >> index fc1ce5ee5c3f..1f4e778d1814 100644
> > >> --- a/kernel/bpf/bpf_iter.c
> > >> +++ b/kernel/bpf/bpf_iter.c
> > >> @@ -2,6 +2,7 @@
> > >>   /* Copyright (c) 2020 Facebook */
> > >>
> > >>   #include <linux/fs.h>
> > >> +#include <linux/anon_inodes.h>
> > >>   #include <linux/filter.h>
> > >>   #include <linux/bpf.h>
> > >>
> > >> @@ -19,6 +20,19 @@ struct bpf_iter_link {
> > >>          struct bpf_iter_target_info *tinfo;
> > >>   };
> > >>
> > >> +struct extra_priv_data {
> > >> +       struct bpf_prog *prog;
> > >> +       u64 session_id;
> > >> +       u64 seq_num;
> > >> +       bool has_last;
> > >> +};
> > >> +
> > >> +struct anon_file_prog_assoc {
> > >> +       struct list_head list;
> > >> +       struct file *file;
> > >> +       struct bpf_prog *prog;
> > >> +};
> > >> +
> > >>   static struct list_head targets;
> > >>   static struct mutex targets_mutex;
> > >>   static bool bpf_iter_inited = false;
> > >> @@ -26,6 +40,50 @@ static bool bpf_iter_inited = false;
> > >>   /* protect bpf_iter_link.link->prog upddate */
> > >>   static struct mutex bpf_iter_mutex;
> > >>
> > >> +/* Since at anon seq_file release function, the prog cannot
> > >> + * be retrieved since target seq_priv_size is not available.
> > >> + * Keep a list of <anon_file, prog> mapping, so that
> > >> + * at file release stage, the prog can be released properly.
> > >> + */
> > >> +static struct list_head anon_iter_info;
> > >> +static struct mutex anon_iter_info_mutex;
> > >> +
> > >> +/* incremented on every opened seq_file */
> > >> +static atomic64_t session_id;
> > >> +
> > >> +static u32 get_total_priv_dsize(u32 old_size)
> > >> +{
> > >> +       return roundup(old_size, 8) + sizeof(struct extra_priv_data);
> > >> +}
> > >> +
> > >> +static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
> > >> +{
> > >> +       return old_ptr + roundup(old_size, 8);
> > >> +}
> > >> +
> > >> +static int anon_iter_release(struct inode *inode, struct file *file)
> > >> +{
> > >> +       struct anon_file_prog_assoc *finfo;
> > >> +
> > >> +       mutex_lock(&anon_iter_info_mutex);
> > >> +       list_for_each_entry(finfo, &anon_iter_info, list) {
> > >> +               if (finfo->file == file) {
> > >
> > > I'll look at this and other patches more thoroughly tomorrow with
> > > clear head, but this iteration to find anon_file_prog_assoc is really
> > > unfortunate.
> > >
> > > I think the problem is that you are allowing seq_file infrastructure
> > > to call directly into target implementation of seq_operations without
> > > intercepting them. If you change that and put whatever extra info is
> > > necessary into seq_file->private in front of target's private state,
> > > then you shouldn't need this, right?
> >
> > Yes. This is true. The idea is to minimize the target change.
> > But maybe this is not a good goal by itself.
> >
> > You are right, if I intercept all seq_ops(), I do not need the
> > above change, I can tailor seq_file private_data right before
> > calling target one and restore after the target call.
> >
> > Originally I only have one interception, show(), now I have
> > stop() too to call bpf at the end of iteration. Maybe I can
> > interpret all four, I think. This way, I can also get ride
> > of target feature.
> 
> If the main goal is to minimize target changes and make them exactly
> seq_operations implementation, then one easier way to get easy access
> to our own metadata in seq_file->private is to set it to point
> **after** our metadata, but before target's metadata. Roughly in
> pseudo code:
> 
> struct bpf_iter_seq_file_meta {} __attribute((aligned(8)));
> 
> void *meta = kmalloc(sizeof(struct bpf_iter_seq_file_meta) +
> target_private_size);
> seq_file->private = meta + sizeof(struct bpf_iter_seq_file_meta);
I have suggested the same thing earlier.  Good to know that we think alike ;)

May be put them in a struct such that container_of...etc can be used:
struct bpf_iter_private {
        struct extra_priv_data iter_private;
	u8 target_private[] __aligned(8);
};

> 
> 
> Then to recover bpf_iter_Seq_file_meta:
> 
> struct bpf_iter_seq_file_meta *meta = seq_file->private - sizeof(*meta);
> 
> /* voila! */
> 
> This doesn't have a benefit of making targets simpler, but will
> require no changes to them at all. Plus less indirect calls, so less
> performance penalty.
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29 15:34                       ` Alexei Starovoitov
  2020-04-29 18:14                         ` Yonghong Song
@ 2020-04-29 19:19                         ` Andrii Nakryiko
  2020-04-29 20:15                           ` Yonghong Song
  1 sibling, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29 19:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Martin KaFai Lau, Andrii Nakryiko, bpf,
	Networking, Daniel Borkmann, Kernel Team

On Wed, Apr 29, 2020 at 8:34 AM Alexei Starovoitov <ast@fb.com> wrote:
>
> On 4/28/20 11:44 PM, Yonghong Song wrote:
> >
> >
> > On 4/28/20 11:40 PM, Andrii Nakryiko wrote:
> >> On Tue, Apr 28, 2020 at 11:30 PM Alexei Starovoitov <ast@fb.com> wrote:
> >>>
> >>> On 4/28/20 11:20 PM, Yonghong Song wrote:
> >>>>
> >>>>
> >>>> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
> >>>>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
> >>>>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
> >>>>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
> >>>>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
> >>>>>>>>>>> bpf_iter_seq_map_info),
> >>>>>>>>>>> +                 &meta.session_id, &meta.seq_num,
> >>>>>>>>>>> +                 v == (void *)0);
> >>>>>>>>>>    From looking at seq_file.c, when will show() be called with
> >>>>>>>>>> "v ==
> >>>>>>>>>> NULL"?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> that v == NULL here and the whole verifier change just to allow
> >>>>>>>>> NULL...
> >>>>>>>>> may be use seq_num as an indicator of the last elem instead?
> >>>>>>>>> Like seq_num with upper bit set to indicate that it's last?
> >>>>>>>>
> >>>>>>>> We could. But then verifier won't have an easy way to verify that.
> >>>>>>>> For example, the above is expected:
> >>>>>>>>
> >>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>>>>            if (seq_num >> 63)
> >>>>>>>>              return 0;
> >>>>>>>>            ... map->id ...
> >>>>>>>>            ... map->user_cnt ...
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> But if user writes
> >>>>>>>>
> >>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>>>>             ... map->id ...
> >>>>>>>>             ... map->user_cnt ...
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> verifier won't be easy to conclude inproper map pointer tracing
> >>>>>>>> here and in the above map->id, map->user_cnt will cause
> >>>>>>>> exceptions and they will silently get value 0.
> >>>>>>>
> >>>>>>> I mean always pass valid object pointer into the prog.
> >>>>>>> In above case 'map' will always be valid.
> >>>>>>> Consider prog that iterating all map elements.
> >>>>>>> It's weird that the prog would always need to do
> >>>>>>> if (map == 0)
> >>>>>>>      goto out;
> >>>>>>> even if it doesn't care about finding last.
> >>>>>>> All progs would have to have such extra 'if'.
> >>>>>>> If we always pass valid object than there is no need
> >>>>>>> for such extra checks inside the prog.
> >>>>>>> First and last element can be indicated via seq_num
> >>>>>>> or via another flag or via helper call like is_this_last_elem()
> >>>>>>> or something.
> >>>>>>
> >>>>>> Okay, I see what you mean now. Basically this means
> >>>>>> seq_ops->next() should try to get/maintain next two elements,
> >>>>>
> >>>>> What about the case when there are no elements to iterate to begin
> >>>>> with? In that case, we still need to call bpf_prog for (empty)
> >>>>> post-aggregation, but we have no valid element... For bpf_map
> >>>>> iteration we could have fake empty bpf_map that would be passed, but
> >>>>> I'm not sure it's applicable for any time of object (e.g., having a
> >>>>> fake task_struct is probably quite a bit more problematic?)...
> >>>>
> >>>> Oh, yes, thanks for reminding me of this. I put a call to
> >>>> bpf_prog in seq_ops->stop() especially to handle no object
> >>>> case. In that case, seq_ops->start() will return NULL,
> >>>> seq_ops->next() won't be called, and then seq_ops->stop()
> >>>> is called. My earlier attempt tries to hook with next()
> >>>> and then find it not working in all cases.
> >>>
> >>> wait a sec. seq_ops->stop() is not the end.
> >>> With lseek of seq_file it can be called multiple times.
> >
> > Yes, I have taken care of this. when the object is NULL,
> > bpf program will be called. When the object is NULL again,
> > it won't be called. The private data remembers it has
> > been called with NULL.
>
> Even without lseek stop() will be called multiple times.
> If I read seq_file.c correctly it will be called before
> every copy_to_user(). Which means that for a lot of text
> (or if read() is done with small buffer) there will be
> plenty of start,show,show,stop sequences.


Right start/stop can be called multiple times, but seems like there
are clear indicators of beginning of iteration and end of iteration:
- start() with seq_num == 0 is start of iteration (can be called
multiple times, if first element overflows buffer);
- stop() with p == NULL is end of iteration (seems like can be called
multiple times as well, if user keeps read()'ing after iteration
completed).

There is another problem with stop(), though. If BPF program will
attempt to output anything during stop(), that output will be just
discarded. Not great. Especially if that output overflows and we need
to re-allocate buffer.

We are trying to use seq_file just to reuse 140 lines of code in
seq_read(), which is no magic, just a simple double buffer and retry
piece of logic. We don't need lseek and traverse, we don't need all
the escaping stuff. I think bpf_iter implementation would be much
simpler if bpf_iter had better control over iteration. Then this whole
"end of iteration" behavior would be crystal clear. Should we maybe
reconsider again?

I understand we want to re-use networking iteration code, but we can
still do that with custom implementation of seq_read, because we are
still using struct seq_file and follow its semantics. The change would
be to allow stop(NULL) (or any stop() call for that matter) to perform
output (and handle retry and buffer re-allocation). Or, alternatively,
coupled with seq_operations intercept proposal in patch #7 discussion,
we can add extra method (e.g., finish()) that would be called after
all elements are traversed and will allow to emit extra stuff. We can
do that (implement finish()) in seq_read, as well, if that's going to
fly ok with seq_file maintainers, of course.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-29 18:46         ` Martin KaFai Lau
@ 2020-04-29 19:20           ` Yonghong Song
  2020-04-29 20:50             ` Martin KaFai Lau
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-29 19:20 UTC (permalink / raw)
  To: Martin KaFai Lau, Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team



On 4/29/20 11:46 AM, Martin KaFai Lau wrote:
> On Wed, Apr 29, 2020 at 11:16:35AM -0700, Andrii Nakryiko wrote:
>> On Wed, Apr 29, 2020 at 12:07 AM Yonghong Song <yhs@fb.com> wrote:
>>>
>>>
>>>
>>> On 4/28/20 11:56 PM, Andrii Nakryiko wrote:
>>>> On Mon, Apr 27, 2020 at 1:19 PM Yonghong Song <yhs@fb.com> wrote:
>>>>>
>>>>> A new bpf command BPF_ITER_CREATE is added.
>>>>>
>>>>> The anonymous bpf iterator is seq_file based.
>>>>> The seq_file private data are referenced by targets.
>>>>> The bpf_iter infrastructure allocated additional space
>>>>> at seq_file->private after the space used by targets
>>>>> to store some meta data, e.g.,
>>>>>     prog:       prog to run
>>>>>     session_id: an unique id for each opened seq_file
>>>>>     seq_num:    how many times bpf programs are queried in this session
>>>>>     has_last:   indicate whether or not bpf_prog has been called after
>>>>>                 all valid objects have been processed
>>>>>
>>>>> A map between file and prog/link is established to help
>>>>> fops->release(). When fops->release() is called, just based on
>>>>> inode and file, bpf program cannot be located since target
>>>>> seq_priv_size not available. This map helps retrieve the prog
>>>>> whose reference count needs to be decremented.
>>>>>
>>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>> ---
>>>>>    include/linux/bpf.h            |   3 +
>>>>>    include/uapi/linux/bpf.h       |   6 ++
>>>>>    kernel/bpf/bpf_iter.c          | 162 ++++++++++++++++++++++++++++++++-
>>>>>    kernel/bpf/syscall.c           |  27 ++++++
>>>>>    tools/include/uapi/linux/bpf.h |   6 ++
>>>>>    5 files changed, 203 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>> index 4fc39d9b5cd0..0f0cafc65a04 100644
>>>>> --- a/include/linux/bpf.h
>>>>> +++ b/include/linux/bpf.h
>>>>> @@ -1112,6 +1112,8 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>>>>>    int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>>>>>    int bpf_obj_get_user(const char __user *pathname, int flags);
>>>>>
>>>>> +#define BPF_DUMP_SEQ_NET_PRIVATE       BIT(0)
>>>>> +
>>>>>    struct bpf_iter_reg {
>>>>>           const char *target;
>>>>>           const char *target_func_name;
>>>>> @@ -1133,6 +1135,7 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
>>>>>    int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>>>>>    int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>>>>>                             struct bpf_prog *new_prog);
>>>>> +int bpf_iter_new_fd(struct bpf_link *link);
>>>>>
>>>>>    int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>>>>>    int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>> index f39b9fec37ab..576651110d16 100644
>>>>> --- a/include/uapi/linux/bpf.h
>>>>> +++ b/include/uapi/linux/bpf.h
>>>>> @@ -113,6 +113,7 @@ enum bpf_cmd {
>>>>>           BPF_MAP_DELETE_BATCH,
>>>>>           BPF_LINK_CREATE,
>>>>>           BPF_LINK_UPDATE,
>>>>> +       BPF_ITER_CREATE,
>>>>>    };
>>>>>
>>>>>    enum bpf_map_type {
>>>>> @@ -590,6 +591,11 @@ union bpf_attr {
>>>>>                   __u32           old_prog_fd;
>>>>>           } link_update;
>>>>>
>>>>> +       struct { /* struct used by BPF_ITER_CREATE command */
>>>>> +               __u32           link_fd;
>>>>> +               __u32           flags;
>>>>> +       } iter_create;
>>>>> +
>>>>>    } __attribute__((aligned(8)));
>>>>>
>>>>>    /* The description below is an attempt at providing documentation to eBPF
>>>>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>>>>> index fc1ce5ee5c3f..1f4e778d1814 100644
>>>>> --- a/kernel/bpf/bpf_iter.c
>>>>> +++ b/kernel/bpf/bpf_iter.c
>>>>> @@ -2,6 +2,7 @@
>>>>>    /* Copyright (c) 2020 Facebook */
>>>>>
>>>>>    #include <linux/fs.h>
>>>>> +#include <linux/anon_inodes.h>
>>>>>    #include <linux/filter.h>
>>>>>    #include <linux/bpf.h>
>>>>>
>>>>> @@ -19,6 +20,19 @@ struct bpf_iter_link {
>>>>>           struct bpf_iter_target_info *tinfo;
>>>>>    };
>>>>>
>>>>> +struct extra_priv_data {
>>>>> +       struct bpf_prog *prog;
>>>>> +       u64 session_id;
>>>>> +       u64 seq_num;
>>>>> +       bool has_last;
>>>>> +};
>>>>> +
>>>>> +struct anon_file_prog_assoc {
>>>>> +       struct list_head list;
>>>>> +       struct file *file;
>>>>> +       struct bpf_prog *prog;
>>>>> +};
>>>>> +
>>>>>    static struct list_head targets;
>>>>>    static struct mutex targets_mutex;
>>>>>    static bool bpf_iter_inited = false;
>>>>> @@ -26,6 +40,50 @@ static bool bpf_iter_inited = false;
>>>>>    /* protect bpf_iter_link.link->prog upddate */
>>>>>    static struct mutex bpf_iter_mutex;
>>>>>
>>>>> +/* Since at anon seq_file release function, the prog cannot
>>>>> + * be retrieved since target seq_priv_size is not available.
>>>>> + * Keep a list of <anon_file, prog> mapping, so that
>>>>> + * at file release stage, the prog can be released properly.
>>>>> + */
>>>>> +static struct list_head anon_iter_info;
>>>>> +static struct mutex anon_iter_info_mutex;
>>>>> +
>>>>> +/* incremented on every opened seq_file */
>>>>> +static atomic64_t session_id;
>>>>> +
>>>>> +static u32 get_total_priv_dsize(u32 old_size)
>>>>> +{
>>>>> +       return roundup(old_size, 8) + sizeof(struct extra_priv_data);
>>>>> +}
>>>>> +
>>>>> +static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
>>>>> +{
>>>>> +       return old_ptr + roundup(old_size, 8);
>>>>> +}
>>>>> +
>>>>> +static int anon_iter_release(struct inode *inode, struct file *file)
>>>>> +{
>>>>> +       struct anon_file_prog_assoc *finfo;
>>>>> +
>>>>> +       mutex_lock(&anon_iter_info_mutex);
>>>>> +       list_for_each_entry(finfo, &anon_iter_info, list) {
>>>>> +               if (finfo->file == file) {
>>>>
>>>> I'll look at this and other patches more thoroughly tomorrow with
>>>> clear head, but this iteration to find anon_file_prog_assoc is really
>>>> unfortunate.
>>>>
>>>> I think the problem is that you are allowing seq_file infrastructure
>>>> to call directly into target implementation of seq_operations without
>>>> intercepting them. If you change that and put whatever extra info is
>>>> necessary into seq_file->private in front of target's private state,
>>>> then you shouldn't need this, right?
>>>
>>> Yes. This is true. The idea is to minimize the target change.
>>> But maybe this is not a good goal by itself.
>>>
>>> You are right, if I intercept all seq_ops(), I do not need the
>>> above change, I can tailor seq_file private_data right before
>>> calling target one and restore after the target call.
>>>
>>> Originally I only have one interception, show(), now I have
>>> stop() too to call bpf at the end of iteration. Maybe I can
>>> interpret all four, I think. This way, I can also get ride
>>> of target feature.
>>
>> If the main goal is to minimize target changes and make them exactly
>> seq_operations implementation, then one easier way to get easy access
>> to our own metadata in seq_file->private is to set it to point
>> **after** our metadata, but before target's metadata. Roughly in
>> pseudo code:
>>
>> struct bpf_iter_seq_file_meta {} __attribute((aligned(8)));
>>
>> void *meta = kmalloc(sizeof(struct bpf_iter_seq_file_meta) +
>> target_private_size);
>> seq_file->private = meta + sizeof(struct bpf_iter_seq_file_meta);
> I have suggested the same thing earlier.  Good to know that we think alike ;)
> 
> May be put them in a struct such that container_of...etc can be used:
> struct bpf_iter_private {
>          struct extra_priv_data iter_private;
> 	u8 target_private[] __aligned(8);
> };

This should work, but need to intercept all seq_ops() operations
because target expects private data is `target_private` only.
Let me experiment what is the best way to do this.

> 
>>
>>
>> Then to recover bpf_iter_Seq_file_meta:
>>
>> struct bpf_iter_seq_file_meta *meta = seq_file->private - sizeof(*meta);
>>
>> /* voila! */
>>
>> This doesn't have a benefit of making targets simpler, but will
>> require no changes to them at all. Plus less indirect calls, so less
>> performance penalty.
>>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29  6:51                   ` Yonghong Song
@ 2020-04-29 19:25                     ` Andrii Nakryiko
  0 siblings, 0 replies; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29 19:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, bpf,
	Networking, Daniel Borkmann, Kernel Team

On Tue, Apr 28, 2020 at 11:51 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/28/20 11:34 PM, Martin KaFai Lau wrote:
> > On Tue, Apr 28, 2020 at 11:20:30PM -0700, Yonghong Song wrote:
> >>
> >>
> >> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
> >>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
> >>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
> >>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
> >>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
> >>>>>>>>> bpf_iter_seq_map_info),
> >>>>>>>>> +                 &meta.session_id, &meta.seq_num,
> >>>>>>>>> +                 v == (void *)0);
> >>>>>>>>    From looking at seq_file.c, when will show() be called with "v ==
> >>>>>>>> NULL"?
> >>>>>>>>
> >>>>>>>
> >>>>>>> that v == NULL here and the whole verifier change just to allow NULL...
> >>>>>>> may be use seq_num as an indicator of the last elem instead?
> >>>>>>> Like seq_num with upper bit set to indicate that it's last?
> >>>>>>
> >>>>>> We could. But then verifier won't have an easy way to verify that.
> >>>>>> For example, the above is expected:
> >>>>>>
> >>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>>            if (seq_num >> 63)
> >>>>>>              return 0;
> >>>>>>            ... map->id ...
> >>>>>>            ... map->user_cnt ...
> >>>>>>         }
> >>>>>>
> >>>>>> But if user writes
> >>>>>>
> >>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>>             ... map->id ...
> >>>>>>             ... map->user_cnt ...
> >>>>>>         }
> >>>>>>
> >>>>>> verifier won't be easy to conclude inproper map pointer tracing
> >>>>>> here and in the above map->id, map->user_cnt will cause
> >>>>>> exceptions and they will silently get value 0.
> >>>>>
> >>>>> I mean always pass valid object pointer into the prog.
> >>>>> In above case 'map' will always be valid.
> >>>>> Consider prog that iterating all map elements.
> >>>>> It's weird that the prog would always need to do
> >>>>> if (map == 0)
> >>>>>      goto out;
> >>>>> even if it doesn't care about finding last.
> >>>>> All progs would have to have such extra 'if'.
> >>>>> If we always pass valid object than there is no need
> >>>>> for such extra checks inside the prog.
> >>>>> First and last element can be indicated via seq_num
> >>>>> or via another flag or via helper call like is_this_last_elem()
> >>>>> or something.
> >>>>
> >>>> Okay, I see what you mean now. Basically this means
> >>>> seq_ops->next() should try to get/maintain next two elements,
> >>>
> >>> What about the case when there are no elements to iterate to begin
> >>> with? In that case, we still need to call bpf_prog for (empty)
> >>> post-aggregation, but we have no valid element... For bpf_map
> >>> iteration we could have fake empty bpf_map that would be passed, but
> >>> I'm not sure it's applicable for any time of object (e.g., having a
> >>> fake task_struct is probably quite a bit more problematic?)...
> >>
> >> Oh, yes, thanks for reminding me of this. I put a call to
> >> bpf_prog in seq_ops->stop() especially to handle no object
> >> case. In that case, seq_ops->start() will return NULL,
> >> seq_ops->next() won't be called, and then seq_ops->stop()
> >> is called. My earlier attempt tries to hook with next()
> >> and then find it not working in all cases.
> >>
> >>>
> >>>> otherwise, we won't know whether the one in seq_ops->show()
> >>>> is the last or not.
> > I think "show()" is convoluted with "stop()/eof()".  Could "stop()/eof()"
> > be its own separate (and optional) bpf_prog which only does "stop()/eof()"?
>
> I thought this before. But user need to write a program instead of
> a simple "if" condition in the main program...
>

I agree with Yonghong, requiring user to check for null is pretty
trivial and verifier can give very clear error message if user didn't
check.
The PTR_TO_BTF_ID_OR_NULL seems useful in general as well, it's an
optional typed input arguments and might be useful in other
situations. Verifier changes don't seem excessive as well.

Having two coupled BPF programs to do single iteration becomes awkward
to manage, will complicate kernel interface (e.g., special variants of
LINK_CREATE and LINK_UPDATE) and libbpf implementation. It's also
going to be harder to replace them atomically. I think overall cons
outweight pros.

As one way to maybe simplify it for users a bit, we can make this
post-aggregation call optional with extra flag on BPF_PROG_LOAD.
Unless extra flag is specified, input arguments can stay PTR_TO_BTF_ID
and we'll just get non-NULL inputs and no "end of iteration" call.
With extra flags, inputs become PTR_TO_BTF_ID_OR_NULL and one extra
call at the end.



> >
> >>>> We could do it in newly implemented
> >>>> iterator bpf_map/task/task_file. Let me check how I could
> >>>> make existing seq_ops (ipv6_route/netlink) works with
> >>>> minimum changes.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-27 20:12 ` [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator Yonghong Song
  2020-04-29  5:39   ` Martin KaFai Lau
  2020-04-29  6:56   ` Andrii Nakryiko
@ 2020-04-29 19:39   ` Andrii Nakryiko
  2 siblings, 0 replies; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29 19:39 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:19 PM Yonghong Song <yhs@fb.com> wrote:
>
> A new bpf command BPF_ITER_CREATE is added.
>
> The anonymous bpf iterator is seq_file based.
> The seq_file private data are referenced by targets.
> The bpf_iter infrastructure allocated additional space
> at seq_file->private after the space used by targets
> to store some meta data, e.g.,
>   prog:       prog to run
>   session_id: an unique id for each opened seq_file
>   seq_num:    how many times bpf programs are queried in this session
>   has_last:   indicate whether or not bpf_prog has been called after
>               all valid objects have been processed
>
> A map between file and prog/link is established to help
> fops->release(). When fops->release() is called, just based on
> inode and file, bpf program cannot be located since target
> seq_priv_size not available. This map helps retrieve the prog
> whose reference count needs to be decremented.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h            |   3 +
>  include/uapi/linux/bpf.h       |   6 ++
>  kernel/bpf/bpf_iter.c          | 162 ++++++++++++++++++++++++++++++++-
>  kernel/bpf/syscall.c           |  27 ++++++
>  tools/include/uapi/linux/bpf.h |   6 ++
>  5 files changed, 203 insertions(+), 1 deletion(-)
>

[...]

>  int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>  {
>         struct bpf_iter_target_info *tinfo;
> @@ -37,6 +95,8 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>                 INIT_LIST_HEAD(&targets);
>                 mutex_init(&targets_mutex);
>                 mutex_init(&bpf_iter_mutex);
> +               INIT_LIST_HEAD(&anon_iter_info);
> +               mutex_init(&anon_iter_info_mutex);
>                 bpf_iter_inited = true;
>         }
>
> @@ -61,7 +121,20 @@ int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
>  struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
>                                    u64 *session_id, u64 *seq_num, bool is_last)

instead of passing many pointers (session_id, seq_num), would it be
better to just pass bpf_iter_meta * instead?

>  {
> -       return NULL;
> +       struct extra_priv_data *extra_data;
> +
> +       if (seq->file->f_op != &anon_bpf_iter_fops)
> +               return NULL;
> +
> +       extra_data = get_extra_priv_dptr(seq->private, priv_data_size);
> +       if (extra_data->has_last)
> +               return NULL;
> +
> +       *session_id = extra_data->session_id;
> +       *seq_num = extra_data->seq_num++;
> +       extra_data->has_last = is_last;
> +
> +       return extra_data->prog;
>  }
>
>  int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
> @@ -150,3 +223,90 @@ int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>         mutex_unlock(&bpf_iter_mutex);
>         return ret;
>  }
> +
> +static void init_seq_file(void *priv_data, struct bpf_iter_target_info *tinfo,
> +                         struct bpf_prog *prog)
> +{
> +       struct extra_priv_data *extra_data;
> +
> +       if (tinfo->target_feature & BPF_DUMP_SEQ_NET_PRIVATE)
> +               set_seq_net_private((struct seq_net_private *)priv_data,
> +                                   current->nsproxy->net_ns);
> +
> +       extra_data = get_extra_priv_dptr(priv_data, tinfo->seq_priv_size);
> +       extra_data->session_id = atomic64_add_return(1, &session_id);
> +       extra_data->prog = prog;
> +       extra_data->seq_num = 0;
> +       extra_data->has_last = false;
> +}
> +
> +static int prepare_seq_file(struct file *file, struct bpf_iter_link *link)
> +{
> +       struct anon_file_prog_assoc *finfo;
> +       struct bpf_iter_target_info *tinfo;
> +       struct bpf_prog *prog;
> +       u32 total_priv_dsize;
> +       void *priv_data;
> +
> +       finfo = kmalloc(sizeof(*finfo), GFP_USER | __GFP_NOWARN);
> +       if (!finfo)
> +               return -ENOMEM;
> +
> +       mutex_lock(&bpf_iter_mutex);
> +       prog = link->link.prog;
> +       bpf_prog_inc(prog);
> +       mutex_unlock(&bpf_iter_mutex);
> +
> +       tinfo = link->tinfo;
> +       total_priv_dsize = get_total_priv_dsize(tinfo->seq_priv_size);
> +       priv_data = __seq_open_private(file, tinfo->seq_ops, total_priv_dsize);
> +       if (!priv_data) {
> +               bpf_prog_sub(prog, 1);
> +               kfree(finfo);
> +               return -ENOMEM;
> +       }
> +
> +       init_seq_file(priv_data, tinfo, prog);
> +
> +       finfo->file = file;
> +       finfo->prog = prog;
> +
> +       mutex_lock(&anon_iter_info_mutex);
> +       list_add(&finfo->list, &anon_iter_info);
> +       mutex_unlock(&anon_iter_info_mutex);
> +       return 0;
> +}
> +
> +int bpf_iter_new_fd(struct bpf_link *link)
> +{
> +       struct file *file;
> +       int err, fd;
> +
> +       if (link->ops != &bpf_iter_link_lops)
> +               return -EINVAL;
> +
> +       fd = get_unused_fd_flags(O_CLOEXEC);
> +       if (fd < 0)
> +               return fd;
> +
> +       file = anon_inode_getfile("bpf_iter", &anon_bpf_iter_fops,
> +                                 NULL, O_CLOEXEC);

Shouldn't this anon file be readable and have O_RDONLY flag as well?

> +       if (IS_ERR(file)) {
> +               err = PTR_ERR(file);
> +               goto free_fd;
> +       }
> +
> +       err = prepare_seq_file(file,
> +                              container_of(link, struct bpf_iter_link, link));
> +       if (err)
> +               goto free_file;
> +
> +       fd_install(fd, file);
> +       return fd;
> +
> +free_file:
> +       fput(file);
> +free_fd:
> +       put_unused_fd(fd);
> +       return err;
> +}
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index b7af4f006f2e..458f7000887a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3696,6 +3696,30 @@ static int link_update(union bpf_attr *attr)
>         return ret;
>  }
>
> +#define BPF_ITER_CREATE_LAST_FIELD iter_create.flags
> +
> +static int bpf_iter_create(union bpf_attr *attr)
> +{
> +       struct bpf_link *link;
> +       int err;
> +
> +       if (CHECK_ATTR(BPF_ITER_CREATE))
> +               return -EINVAL;
> +
> +       if (attr->iter_create.flags)
> +               return -EINVAL;
> +
> +       link = bpf_link_get_from_fd(attr->iter_create.link_fd);
> +       if (IS_ERR(link))
> +               return PTR_ERR(link);
> +
> +       err = bpf_iter_new_fd(link);
> +       if (err < 0)
> +               bpf_link_put(link);

bpf_iter_new_fd() doesn't take a refcnt on link, so you need to put it
regardless of success or error

> +
> +       return err;
> +}
> +
>  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
>  {
>         union bpf_attr attr;
> @@ -3813,6 +3837,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>         case BPF_LINK_UPDATE:
>                 err = link_update(&attr);
>                 break;
> +       case BPF_ITER_CREATE:
> +               err = bpf_iter_create(&attr);
> +               break;
>         default:
>                 err = -EINVAL;
>                 break;

[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29 19:19                         ` Andrii Nakryiko
@ 2020-04-29 20:15                           ` Yonghong Song
  2020-04-30  3:06                             ` Alexei Starovoitov
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-04-29 20:15 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov
  Cc: Martin KaFai Lau, Andrii Nakryiko, bpf, Networking,
	Daniel Borkmann, Kernel Team



On 4/29/20 12:19 PM, Andrii Nakryiko wrote:
> On Wed, Apr 29, 2020 at 8:34 AM Alexei Starovoitov <ast@fb.com> wrote:
>>
>> On 4/28/20 11:44 PM, Yonghong Song wrote:
>>>
>>>
>>> On 4/28/20 11:40 PM, Andrii Nakryiko wrote:
>>>> On Tue, Apr 28, 2020 at 11:30 PM Alexei Starovoitov <ast@fb.com> wrote:
>>>>>
>>>>> On 4/28/20 11:20 PM, Yonghong Song wrote:
>>>>>>
>>>>>>
>>>>>> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
>>>>>>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@fb.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
>>>>>>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
>>>>>>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
>>>>>>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
>>>>>>>>>>>>> bpf_iter_seq_map_info),
>>>>>>>>>>>>> +                 &meta.session_id, &meta.seq_num,
>>>>>>>>>>>>> +                 v == (void *)0);
>>>>>>>>>>>>     From looking at seq_file.c, when will show() be called with
>>>>>>>>>>>> "v ==
>>>>>>>>>>>> NULL"?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> that v == NULL here and the whole verifier change just to allow
>>>>>>>>>>> NULL...
>>>>>>>>>>> may be use seq_num as an indicator of the last elem instead?
>>>>>>>>>>> Like seq_num with upper bit set to indicate that it's last?
>>>>>>>>>>
>>>>>>>>>> We could. But then verifier won't have an easy way to verify that.
>>>>>>>>>> For example, the above is expected:
>>>>>>>>>>
>>>>>>>>>>          int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>>>>>             if (seq_num >> 63)
>>>>>>>>>>               return 0;
>>>>>>>>>>             ... map->id ...
>>>>>>>>>>             ... map->user_cnt ...
>>>>>>>>>>          }
>>>>>>>>>>
>>>>>>>>>> But if user writes
>>>>>>>>>>
>>>>>>>>>>          int prog(struct bpf_map *map, u64 seq_num) {
>>>>>>>>>>              ... map->id ...
>>>>>>>>>>              ... map->user_cnt ...
>>>>>>>>>>          }
>>>>>>>>>>
>>>>>>>>>> verifier won't be easy to conclude inproper map pointer tracing
>>>>>>>>>> here and in the above map->id, map->user_cnt will cause
>>>>>>>>>> exceptions and they will silently get value 0.
>>>>>>>>>
>>>>>>>>> I mean always pass valid object pointer into the prog.
>>>>>>>>> In above case 'map' will always be valid.
>>>>>>>>> Consider prog that iterating all map elements.
>>>>>>>>> It's weird that the prog would always need to do
>>>>>>>>> if (map == 0)
>>>>>>>>>       goto out;
>>>>>>>>> even if it doesn't care about finding last.
>>>>>>>>> All progs would have to have such extra 'if'.
>>>>>>>>> If we always pass valid object than there is no need
>>>>>>>>> for such extra checks inside the prog.
>>>>>>>>> First and last element can be indicated via seq_num
>>>>>>>>> or via another flag or via helper call like is_this_last_elem()
>>>>>>>>> or something.
>>>>>>>>
>>>>>>>> Okay, I see what you mean now. Basically this means
>>>>>>>> seq_ops->next() should try to get/maintain next two elements,
>>>>>>>
>>>>>>> What about the case when there are no elements to iterate to begin
>>>>>>> with? In that case, we still need to call bpf_prog for (empty)
>>>>>>> post-aggregation, but we have no valid element... For bpf_map
>>>>>>> iteration we could have fake empty bpf_map that would be passed, but
>>>>>>> I'm not sure it's applicable for any time of object (e.g., having a
>>>>>>> fake task_struct is probably quite a bit more problematic?)...
>>>>>>
>>>>>> Oh, yes, thanks for reminding me of this. I put a call to
>>>>>> bpf_prog in seq_ops->stop() especially to handle no object
>>>>>> case. In that case, seq_ops->start() will return NULL,
>>>>>> seq_ops->next() won't be called, and then seq_ops->stop()
>>>>>> is called. My earlier attempt tries to hook with next()
>>>>>> and then find it not working in all cases.
>>>>>
>>>>> wait a sec. seq_ops->stop() is not the end.
>>>>> With lseek of seq_file it can be called multiple times.
>>>
>>> Yes, I have taken care of this. when the object is NULL,
>>> bpf program will be called. When the object is NULL again,
>>> it won't be called. The private data remembers it has
>>> been called with NULL.
>>
>> Even without lseek stop() will be called multiple times.
>> If I read seq_file.c correctly it will be called before
>> every copy_to_user(). Which means that for a lot of text
>> (or if read() is done with small buffer) there will be
>> plenty of start,show,show,stop sequences.
> 
> 
> Right start/stop can be called multiple times, but seems like there
> are clear indicators of beginning of iteration and end of iteration:
> - start() with seq_num == 0 is start of iteration (can be called
> multiple times, if first element overflows buffer);
> - stop() with p == NULL is end of iteration (seems like can be called
> multiple times as well, if user keeps read()'ing after iteration
> completed).
> 
> There is another problem with stop(), though. If BPF program will
> attempt to output anything during stop(), that output will be just
> discarded. Not great. Especially if that output overflows and we need

The stop() output will not be discarded in the following cases:
    - regular show() objects overflow and stop() BPF program not called
    - regular show() objects not overflow, which means iteration is done,
      and stop() BPF program does not overflow.

The stop() seq_file output will be discarded if
    - regular show() objects not overflow and stop() BPF program output
      overflows.
    - no objects to iterate, BPF program got called, but its seq_file
      write/printf will be discarded.

Two options here:
   - implement Alexei suggestion to look ahead two elements to
     always having valid object and indicating the last element
     with a special flag.
   - Per Andrii's suggestion below to implement new way or to
     tweak seq_file() a little bit to resolve the above cases
     where stop() seq_file outputs being discarded.

Will try to experiment with both above options...


> to re-allocate buffer.
> 
> We are trying to use seq_file just to reuse 140 lines of code in
> seq_read(), which is no magic, just a simple double buffer and retry
> piece of logic. We don't need lseek and traverse, we don't need all
> the escaping stuff. I think bpf_iter implementation would be much
> simpler if bpf_iter had better control over iteration. Then this whole
> "end of iteration" behavior would be crystal clear. Should we maybe
> reconsider again?
> 
> I understand we want to re-use networking iteration code, but we can
> still do that with custom implementation of seq_read, because we are
> still using struct seq_file and follow its semantics. The change would
> be to allow stop(NULL) (or any stop() call for that matter) to perform
> output (and handle retry and buffer re-allocation). Or, alternatively,
> coupled with seq_operations intercept proposal in patch #7 discussion,
> we can add extra method (e.g., finish()) that would be called after
> all elements are traversed and will allow to emit extra stuff. We can
> do that (implement finish()) in seq_read, as well, if that's going to
> fly ok with seq_file maintainers, of course.
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 08/19] bpf: create file bpf iterator
  2020-04-27 20:12 ` [PATCH bpf-next v1 08/19] bpf: create file " Yonghong Song
@ 2020-04-29 20:40   ` Andrii Nakryiko
  2020-04-30 18:02     ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29 20:40 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:18 PM Yonghong Song <yhs@fb.com> wrote:
>
> A new obj type BPF_TYPE_ITER is added to bpffs.
> To produce a file bpf iterator, the fd must be
> corresponding to a link_fd assocciated with a
> trace/iter program. When the pinned file is
> opened, a seq_file will be generated.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h   |  3 +++
>  kernel/bpf/bpf_iter.c | 48 ++++++++++++++++++++++++++++++++++++++++++-
>  kernel/bpf/inode.c    | 28 +++++++++++++++++++++++++
>  kernel/bpf/syscall.c  |  2 +-
>  4 files changed, 79 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 0f0cafc65a04..601b3299b7e4 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1021,6 +1021,8 @@ static inline void bpf_enable_instrumentation(void)
>
>  extern const struct file_operations bpf_map_fops;
>  extern const struct file_operations bpf_prog_fops;
> +extern const struct file_operations bpf_link_fops;
> +extern const struct file_operations bpffs_iter_fops;
>
>  #define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \
>         extern const struct bpf_prog_ops _name ## _prog_ops; \
> @@ -1136,6 +1138,7 @@ int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>  int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>                           struct bpf_prog *new_prog);
>  int bpf_iter_new_fd(struct bpf_link *link);
> +void *bpf_iter_get_from_fd(u32 ufd);
>
>  int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index 1f4e778d1814..f5e933236996 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -123,7 +123,8 @@ struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
>  {
>         struct extra_priv_data *extra_data;
>
> -       if (seq->file->f_op != &anon_bpf_iter_fops)
> +       if (seq->file->f_op != &anon_bpf_iter_fops &&
> +           seq->file->f_op != &bpffs_iter_fops)

Do we really need anon_bpf_iter_fops and bpffs_iter_fops? Seems like
the only difference is bpffs_iter_open. Could it be implemented as
part of anon_bpf_iter_ops as well? Seems like open() is never called
for anon_inode_file, so it should work for both?

>                 return NULL;
>
>         extra_data = get_extra_priv_dptr(seq->private, priv_data_size);
> @@ -310,3 +311,48 @@ int bpf_iter_new_fd(struct bpf_link *link)
>         put_unused_fd(fd);
>         return err;
>  }
> +
> +static int bpffs_iter_open(struct inode *inode, struct file *file)
> +{
> +       struct bpf_iter_link *link = inode->i_private;
> +
> +       return prepare_seq_file(file, link);
> +}
> +
> +static int bpffs_iter_release(struct inode *inode, struct file *file)
> +{
> +       return anon_iter_release(inode, file);
> +}
> +
> +const struct file_operations bpffs_iter_fops = {
> +       .open           = bpffs_iter_open,
> +       .read           = seq_read,
> +       .release        = bpffs_iter_release,
> +};
> +
> +void *bpf_iter_get_from_fd(u32 ufd)

return struct bpf_iter_link * here, given this is specific constructor
for bpf_iter_link?

> +{
> +       struct bpf_link *link;
> +       struct bpf_prog *prog;
> +       struct fd f;
> +
> +       f = fdget(ufd);
> +       if (!f.file)
> +               return ERR_PTR(-EBADF);
> +       if (f.file->f_op != &bpf_link_fops) {
> +               link = ERR_PTR(-EINVAL);
> +               goto out;
> +       }
> +
> +       link = f.file->private_data;
> +       prog = link->prog;
> +       if (prog->expected_attach_type != BPF_TRACE_ITER) {
> +               link = ERR_PTR(-EINVAL);
> +               goto out;
> +       }
> +
> +       bpf_link_inc(link);
> +out:
> +       fdput(f);
> +       return link;
> +}
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 95087d9f4ed3..de4493983a37 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -26,6 +26,7 @@ enum bpf_type {
>         BPF_TYPE_PROG,
>         BPF_TYPE_MAP,
>         BPF_TYPE_LINK,
> +       BPF_TYPE_ITER,

Adding ITER as an alternative type of pinned object to BPF_TYPE_LINK
seems undesirable. We can allow opening bpf_iter's seq_file by doing
the same trick as is done for bpf_maps, supporting seq_show (see
bpf_mkmap() and bpf_map_support_seq_show()). Do you think we can do
the same here? If we later see that more kinds of links would want to
allow direct open() to create a file with some output from BPF
program, we can generalize this as part of bpf_link infrastructure.
For now having a custom check similar to bpf_map's seems sufficient.

What do you think?

>  };
>
>  static void *bpf_any_get(void *raw, enum bpf_type type)
> @@ -38,6 +39,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
>                 bpf_map_inc_with_uref(raw);
>                 break;
>         case BPF_TYPE_LINK:
> +       case BPF_TYPE_ITER:
>                 bpf_link_inc(raw);
>                 break;
>         default:
> @@ -58,6 +60,7 @@ static void bpf_any_put(void *raw, enum bpf_type type)
>                 bpf_map_put_with_uref(raw);
>                 break;
>         case BPF_TYPE_LINK:
> +       case BPF_TYPE_ITER:
>                 bpf_link_put(raw);
>                 break;
>         default:
> @@ -82,6 +85,15 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
>                 return raw;
>         }
>

[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 09/19] bpf: add PTR_TO_BTF_ID_OR_NULL support
  2020-04-27 20:12 ` [PATCH bpf-next v1 09/19] bpf: add PTR_TO_BTF_ID_OR_NULL support Yonghong Song
@ 2020-04-29 20:46   ` Andrii Nakryiko
  2020-04-29 20:51     ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-29 20:46 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:13 PM Yonghong Song <yhs@fb.com> wrote:
>
> Add bpf_reg_type PTR_TO_BTF_ID_OR_NULL support.
> For tracing/iter program, the bpf program context
> definition, e.g., for previous bpf_map target, looks like
>   struct bpf_iter_bpf_map {
>     struct bpf_dump_meta *meta;
>     struct bpf_map *map;
>   };
>
> The kernel guarantees that meta is not NULL, but
> map pointer maybe NULL. The NULL map indicates that all
> objects have been traversed, so bpf program can take
> proper action, e.g., do final aggregation and/or send
> final report to user space.
>
> Add btf_id_or_null_non0_off to prog->aux structure, to
> indicate that for tracing programs, if the context access
> offset is not 0, set to PTR_TO_BTF_ID_OR_NULL instead of
> PTR_TO_BTF_ID. This bit is set for tracing/iter program.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h   |  2 ++
>  kernel/bpf/btf.c      |  5 ++++-
>  kernel/bpf/verifier.c | 19 ++++++++++++++-----
>  3 files changed, 20 insertions(+), 6 deletions(-)
>

[...]

>
>  static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
> @@ -410,7 +411,8 @@ static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
>         return type == PTR_TO_SOCKET ||
>                 type == PTR_TO_SOCKET_OR_NULL ||
>                 type == PTR_TO_TCP_SOCK ||
> -               type == PTR_TO_TCP_SOCK_OR_NULL;
> +               type == PTR_TO_TCP_SOCK_OR_NULL ||
> +               type == PTR_TO_BTF_ID_OR_NULL;

BTF_ID is not considered to be refcounted for the purpose of verifier,
unless I'm missing something?

>  }
>
>  static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
> @@ -462,6 +464,7 @@ static const char * const reg_type_str[] = {
>         [PTR_TO_TP_BUFFER]      = "tp_buffer",
>         [PTR_TO_XDP_SOCK]       = "xdp_sock",
>         [PTR_TO_BTF_ID]         = "ptr_",
> +       [PTR_TO_BTF_ID_OR_NULL] = "ptr_or_null_",
>  };
>

[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-29 19:20           ` Yonghong Song
@ 2020-04-29 20:50             ` Martin KaFai Lau
  2020-04-29 20:54               ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Martin KaFai Lau @ 2020-04-29 20:50 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 29, 2020 at 12:20:05PM -0700, Yonghong Song wrote:
> 
> 
> On 4/29/20 11:46 AM, Martin KaFai Lau wrote:
> > On Wed, Apr 29, 2020 at 11:16:35AM -0700, Andrii Nakryiko wrote:
> > > On Wed, Apr 29, 2020 at 12:07 AM Yonghong Song <yhs@fb.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > On 4/28/20 11:56 PM, Andrii Nakryiko wrote:
> > > > > On Mon, Apr 27, 2020 at 1:19 PM Yonghong Song <yhs@fb.com> wrote:
> > > > > > 
> > > > > > A new bpf command BPF_ITER_CREATE is added.
> > > > > > 
> > > > > > The anonymous bpf iterator is seq_file based.
> > > > > > The seq_file private data are referenced by targets.
> > > > > > The bpf_iter infrastructure allocated additional space
> > > > > > at seq_file->private after the space used by targets
> > > > > > to store some meta data, e.g.,
> > > > > >     prog:       prog to run
> > > > > >     session_id: an unique id for each opened seq_file
> > > > > >     seq_num:    how many times bpf programs are queried in this session
> > > > > >     has_last:   indicate whether or not bpf_prog has been called after
> > > > > >                 all valid objects have been processed
> > > > > > 
> > > > > > A map between file and prog/link is established to help
> > > > > > fops->release(). When fops->release() is called, just based on
> > > > > > inode and file, bpf program cannot be located since target
> > > > > > seq_priv_size not available. This map helps retrieve the prog
> > > > > > whose reference count needs to be decremented.
> > > > > > 
> > > > > > Signed-off-by: Yonghong Song <yhs@fb.com>
> > > > > > ---
> > > > > >    include/linux/bpf.h            |   3 +
> > > > > >    include/uapi/linux/bpf.h       |   6 ++
> > > > > >    kernel/bpf/bpf_iter.c          | 162 ++++++++++++++++++++++++++++++++-
> > > > > >    kernel/bpf/syscall.c           |  27 ++++++
> > > > > >    tools/include/uapi/linux/bpf.h |   6 ++
> > > > > >    5 files changed, 203 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > > index 4fc39d9b5cd0..0f0cafc65a04 100644
> > > > > > --- a/include/linux/bpf.h
> > > > > > +++ b/include/linux/bpf.h
> > > > > > @@ -1112,6 +1112,8 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> > > > > >    int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > > > > >    int bpf_obj_get_user(const char __user *pathname, int flags);
> > > > > > 
> > > > > > +#define BPF_DUMP_SEQ_NET_PRIVATE       BIT(0)
> > > > > > +
> > > > > >    struct bpf_iter_reg {
> > > > > >           const char *target;
> > > > > >           const char *target_func_name;
> > > > > > @@ -1133,6 +1135,7 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
> > > > > >    int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> > > > > >    int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
> > > > > >                             struct bpf_prog *new_prog);
> > > > > > +int bpf_iter_new_fd(struct bpf_link *link);
> > > > > > 
> > > > > >    int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
> > > > > >    int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > > index f39b9fec37ab..576651110d16 100644
> > > > > > --- a/include/uapi/linux/bpf.h
> > > > > > +++ b/include/uapi/linux/bpf.h
> > > > > > @@ -113,6 +113,7 @@ enum bpf_cmd {
> > > > > >           BPF_MAP_DELETE_BATCH,
> > > > > >           BPF_LINK_CREATE,
> > > > > >           BPF_LINK_UPDATE,
> > > > > > +       BPF_ITER_CREATE,
> > > > > >    };
> > > > > > 
> > > > > >    enum bpf_map_type {
> > > > > > @@ -590,6 +591,11 @@ union bpf_attr {
> > > > > >                   __u32           old_prog_fd;
> > > > > >           } link_update;
> > > > > > 
> > > > > > +       struct { /* struct used by BPF_ITER_CREATE command */
> > > > > > +               __u32           link_fd;
> > > > > > +               __u32           flags;
> > > > > > +       } iter_create;
> > > > > > +
> > > > > >    } __attribute__((aligned(8)));
> > > > > > 
> > > > > >    /* The description below is an attempt at providing documentation to eBPF
> > > > > > diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> > > > > > index fc1ce5ee5c3f..1f4e778d1814 100644
> > > > > > --- a/kernel/bpf/bpf_iter.c
> > > > > > +++ b/kernel/bpf/bpf_iter.c
> > > > > > @@ -2,6 +2,7 @@
> > > > > >    /* Copyright (c) 2020 Facebook */
> > > > > > 
> > > > > >    #include <linux/fs.h>
> > > > > > +#include <linux/anon_inodes.h>
> > > > > >    #include <linux/filter.h>
> > > > > >    #include <linux/bpf.h>
> > > > > > 
> > > > > > @@ -19,6 +20,19 @@ struct bpf_iter_link {
> > > > > >           struct bpf_iter_target_info *tinfo;
> > > > > >    };
> > > > > > 
> > > > > > +struct extra_priv_data {
> > > > > > +       struct bpf_prog *prog;
> > > > > > +       u64 session_id;
> > > > > > +       u64 seq_num;
> > > > > > +       bool has_last;
> > > > > > +};
> > > > > > +
> > > > > > +struct anon_file_prog_assoc {
> > > > > > +       struct list_head list;
> > > > > > +       struct file *file;
> > > > > > +       struct bpf_prog *prog;
> > > > > > +};
> > > > > > +
> > > > > >    static struct list_head targets;
> > > > > >    static struct mutex targets_mutex;
> > > > > >    static bool bpf_iter_inited = false;
> > > > > > @@ -26,6 +40,50 @@ static bool bpf_iter_inited = false;
> > > > > >    /* protect bpf_iter_link.link->prog upddate */
> > > > > >    static struct mutex bpf_iter_mutex;
> > > > > > 
> > > > > > +/* Since at anon seq_file release function, the prog cannot
> > > > > > + * be retrieved since target seq_priv_size is not available.
> > > > > > + * Keep a list of <anon_file, prog> mapping, so that
> > > > > > + * at file release stage, the prog can be released properly.
> > > > > > + */
> > > > > > +static struct list_head anon_iter_info;
> > > > > > +static struct mutex anon_iter_info_mutex;
> > > > > > +
> > > > > > +/* incremented on every opened seq_file */
> > > > > > +static atomic64_t session_id;
> > > > > > +
> > > > > > +static u32 get_total_priv_dsize(u32 old_size)
> > > > > > +{
> > > > > > +       return roundup(old_size, 8) + sizeof(struct extra_priv_data);
> > > > > > +}
> > > > > > +
> > > > > > +static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
> > > > > > +{
> > > > > > +       return old_ptr + roundup(old_size, 8);
> > > > > > +}
> > > > > > +
> > > > > > +static int anon_iter_release(struct inode *inode, struct file *file)
> > > > > > +{
> > > > > > +       struct anon_file_prog_assoc *finfo;
> > > > > > +
> > > > > > +       mutex_lock(&anon_iter_info_mutex);
> > > > > > +       list_for_each_entry(finfo, &anon_iter_info, list) {
> > > > > > +               if (finfo->file == file) {
> > > > > 
> > > > > I'll look at this and other patches more thoroughly tomorrow with
> > > > > clear head, but this iteration to find anon_file_prog_assoc is really
> > > > > unfortunate.
> > > > > 
> > > > > I think the problem is that you are allowing seq_file infrastructure
> > > > > to call directly into target implementation of seq_operations without
> > > > > intercepting them. If you change that and put whatever extra info is
> > > > > necessary into seq_file->private in front of target's private state,
> > > > > then you shouldn't need this, right?
> > > > 
> > > > Yes. This is true. The idea is to minimize the target change.
> > > > But maybe this is not a good goal by itself.
> > > > 
> > > > You are right, if I intercept all seq_ops(), I do not need the
> > > > above change, I can tailor seq_file private_data right before
> > > > calling target one and restore after the target call.
> > > > 
> > > > Originally I only have one interception, show(), now I have
> > > > stop() too to call bpf at the end of iteration. Maybe I can
> > > > interpret all four, I think. This way, I can also get ride
> > > > of target feature.
> > > 
> > > If the main goal is to minimize target changes and make them exactly
> > > seq_operations implementation, then one easier way to get easy access
> > > to our own metadata in seq_file->private is to set it to point
> > > **after** our metadata, but before target's metadata. Roughly in
> > > pseudo code:
> > > 
> > > struct bpf_iter_seq_file_meta {} __attribute((aligned(8)));
> > > 
> > > void *meta = kmalloc(sizeof(struct bpf_iter_seq_file_meta) +
> > > target_private_size);
> > > seq_file->private = meta + sizeof(struct bpf_iter_seq_file_meta);
> > I have suggested the same thing earlier.  Good to know that we think alike ;)
> > 
> > May be put them in a struct such that container_of...etc can be used:
> > struct bpf_iter_private {
> >          struct extra_priv_data iter_private;
> > 	u8 target_private[] __aligned(8);
> > };
> 
> This should work, but need to intercept all seq_ops() operations
> because target expects private data is `target_private` only.
> Let me experiment what is the best way to do this.
As long as "seq_file->private = bpf_iter_private->target_private;" as
Andrii also suggested, the existing seq_ops() should not have to be
changed or needed to be intercepted because they are only
accessing it through seq_file->private.

The bpf_iter logic should be the only one needed to access the
bpf_iter_private->iter_private and it can be obtained by, e.g.
"container_of(seq_file->private, struct bpf_iter_private, target_private)"

> 
> > 
> > > 
> > > 
> > > Then to recover bpf_iter_Seq_file_meta:
> > > 
> > > struct bpf_iter_seq_file_meta *meta = seq_file->private - sizeof(*meta);
> > > 
> > > /* voila! */
> > > 
> > > This doesn't have a benefit of making targets simpler, but will
> > > require no changes to them at all. Plus less indirect calls, so less
> > > performance penalty.
> > > 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 09/19] bpf: add PTR_TO_BTF_ID_OR_NULL support
  2020-04-29 20:46   ` Andrii Nakryiko
@ 2020-04-29 20:51     ` Yonghong Song
  0 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-29 20:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/29/20 1:46 PM, Andrii Nakryiko wrote:
> On Mon, Apr 27, 2020 at 1:13 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Add bpf_reg_type PTR_TO_BTF_ID_OR_NULL support.
>> For tracing/iter program, the bpf program context
>> definition, e.g., for previous bpf_map target, looks like
>>    struct bpf_iter_bpf_map {
>>      struct bpf_dump_meta *meta;
>>      struct bpf_map *map;
>>    };
>>
>> The kernel guarantees that meta is not NULL, but
>> map pointer maybe NULL. The NULL map indicates that all
>> objects have been traversed, so bpf program can take
>> proper action, e.g., do final aggregation and/or send
>> final report to user space.
>>
>> Add btf_id_or_null_non0_off to prog->aux structure, to
>> indicate that for tracing programs, if the context access
>> offset is not 0, set to PTR_TO_BTF_ID_OR_NULL instead of
>> PTR_TO_BTF_ID. This bit is set for tracing/iter program.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h   |  2 ++
>>   kernel/bpf/btf.c      |  5 ++++-
>>   kernel/bpf/verifier.c | 19 ++++++++++++++-----
>>   3 files changed, 20 insertions(+), 6 deletions(-)
>>
> 
> [...]
> 
>>
>>   static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
>> @@ -410,7 +411,8 @@ static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
>>          return type == PTR_TO_SOCKET ||
>>                  type == PTR_TO_SOCKET_OR_NULL ||
>>                  type == PTR_TO_TCP_SOCK ||
>> -               type == PTR_TO_TCP_SOCK_OR_NULL;
>> +               type == PTR_TO_TCP_SOCK_OR_NULL ||
>> +               type == PTR_TO_BTF_ID_OR_NULL;
> 
> BTF_ID is not considered to be refcounted for the purpose of verifier,
> unless I'm missing something?

You are correct. PTR_TO_BTF_ID is not there is a clear sign
PTR_TO_BTF_ID_OR_NULL should not be there either.

> 
>>   }
>>
>>   static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
>> @@ -462,6 +464,7 @@ static const char * const reg_type_str[] = {
>>          [PTR_TO_TP_BUFFER]      = "tp_buffer",
>>          [PTR_TO_XDP_SOCK]       = "xdp_sock",
>>          [PTR_TO_BTF_ID]         = "ptr_",
>> +       [PTR_TO_BTF_ID_OR_NULL] = "ptr_or_null_",
>>   };
>>
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator
  2020-04-29 20:50             ` Martin KaFai Lau
@ 2020-04-29 20:54               ` Yonghong Song
  0 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-29 20:54 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/29/20 1:50 PM, Martin KaFai Lau wrote:
> On Wed, Apr 29, 2020 at 12:20:05PM -0700, Yonghong Song wrote:
>>
>>
>> On 4/29/20 11:46 AM, Martin KaFai Lau wrote:
>>> On Wed, Apr 29, 2020 at 11:16:35AM -0700, Andrii Nakryiko wrote:
>>>> On Wed, Apr 29, 2020 at 12:07 AM Yonghong Song <yhs@fb.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 4/28/20 11:56 PM, Andrii Nakryiko wrote:
>>>>>> On Mon, Apr 27, 2020 at 1:19 PM Yonghong Song <yhs@fb.com> wrote:
>>>>>>>
>>>>>>> A new bpf command BPF_ITER_CREATE is added.
>>>>>>>
>>>>>>> The anonymous bpf iterator is seq_file based.
>>>>>>> The seq_file private data are referenced by targets.
>>>>>>> The bpf_iter infrastructure allocated additional space
>>>>>>> at seq_file->private after the space used by targets
>>>>>>> to store some meta data, e.g.,
>>>>>>>      prog:       prog to run
>>>>>>>      session_id: an unique id for each opened seq_file
>>>>>>>      seq_num:    how many times bpf programs are queried in this session
>>>>>>>      has_last:   indicate whether or not bpf_prog has been called after
>>>>>>>                  all valid objects have been processed
>>>>>>>
>>>>>>> A map between file and prog/link is established to help
>>>>>>> fops->release(). When fops->release() is called, just based on
>>>>>>> inode and file, bpf program cannot be located since target
>>>>>>> seq_priv_size not available. This map helps retrieve the prog
>>>>>>> whose reference count needs to be decremented.
>>>>>>>
>>>>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>>>> ---
>>>>>>>     include/linux/bpf.h            |   3 +
>>>>>>>     include/uapi/linux/bpf.h       |   6 ++
>>>>>>>     kernel/bpf/bpf_iter.c          | 162 ++++++++++++++++++++++++++++++++-
>>>>>>>     kernel/bpf/syscall.c           |  27 ++++++
>>>>>>>     tools/include/uapi/linux/bpf.h |   6 ++
>>>>>>>     5 files changed, 203 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>>>> index 4fc39d9b5cd0..0f0cafc65a04 100644
>>>>>>> --- a/include/linux/bpf.h
>>>>>>> +++ b/include/linux/bpf.h
>>>>>>> @@ -1112,6 +1112,8 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>>>>>>>     int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>>>>>>>     int bpf_obj_get_user(const char __user *pathname, int flags);
>>>>>>>
>>>>>>> +#define BPF_DUMP_SEQ_NET_PRIVATE       BIT(0)
>>>>>>> +
>>>>>>>     struct bpf_iter_reg {
>>>>>>>            const char *target;
>>>>>>>            const char *target_func_name;
>>>>>>> @@ -1133,6 +1135,7 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
>>>>>>>     int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>>>>>>>     int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>>>>>>>                              struct bpf_prog *new_prog);
>>>>>>> +int bpf_iter_new_fd(struct bpf_link *link);
>>>>>>>
>>>>>>>     int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>>>>>>>     int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>>> index f39b9fec37ab..576651110d16 100644
>>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>>> @@ -113,6 +113,7 @@ enum bpf_cmd {
>>>>>>>            BPF_MAP_DELETE_BATCH,
>>>>>>>            BPF_LINK_CREATE,
>>>>>>>            BPF_LINK_UPDATE,
>>>>>>> +       BPF_ITER_CREATE,
>>>>>>>     };
>>>>>>>
>>>>>>>     enum bpf_map_type {
>>>>>>> @@ -590,6 +591,11 @@ union bpf_attr {
>>>>>>>                    __u32           old_prog_fd;
>>>>>>>            } link_update;
>>>>>>>
>>>>>>> +       struct { /* struct used by BPF_ITER_CREATE command */
>>>>>>> +               __u32           link_fd;
>>>>>>> +               __u32           flags;
>>>>>>> +       } iter_create;
>>>>>>> +
>>>>>>>     } __attribute__((aligned(8)));
>>>>>>>
>>>>>>>     /* The description below is an attempt at providing documentation to eBPF
>>>>>>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>>>>>>> index fc1ce5ee5c3f..1f4e778d1814 100644
>>>>>>> --- a/kernel/bpf/bpf_iter.c
>>>>>>> +++ b/kernel/bpf/bpf_iter.c
>>>>>>> @@ -2,6 +2,7 @@
>>>>>>>     /* Copyright (c) 2020 Facebook */
>>>>>>>
>>>>>>>     #include <linux/fs.h>
>>>>>>> +#include <linux/anon_inodes.h>
>>>>>>>     #include <linux/filter.h>
>>>>>>>     #include <linux/bpf.h>
>>>>>>>
>>>>>>> @@ -19,6 +20,19 @@ struct bpf_iter_link {
>>>>>>>            struct bpf_iter_target_info *tinfo;
>>>>>>>     };
>>>>>>>
>>>>>>> +struct extra_priv_data {
>>>>>>> +       struct bpf_prog *prog;
>>>>>>> +       u64 session_id;
>>>>>>> +       u64 seq_num;
>>>>>>> +       bool has_last;
>>>>>>> +};
>>>>>>> +
>>>>>>> +struct anon_file_prog_assoc {
>>>>>>> +       struct list_head list;
>>>>>>> +       struct file *file;
>>>>>>> +       struct bpf_prog *prog;
>>>>>>> +};
>>>>>>> +
>>>>>>>     static struct list_head targets;
>>>>>>>     static struct mutex targets_mutex;
>>>>>>>     static bool bpf_iter_inited = false;
>>>>>>> @@ -26,6 +40,50 @@ static bool bpf_iter_inited = false;
>>>>>>>     /* protect bpf_iter_link.link->prog upddate */
>>>>>>>     static struct mutex bpf_iter_mutex;
>>>>>>>
>>>>>>> +/* Since at anon seq_file release function, the prog cannot
>>>>>>> + * be retrieved since target seq_priv_size is not available.
>>>>>>> + * Keep a list of <anon_file, prog> mapping, so that
>>>>>>> + * at file release stage, the prog can be released properly.
>>>>>>> + */
>>>>>>> +static struct list_head anon_iter_info;
>>>>>>> +static struct mutex anon_iter_info_mutex;
>>>>>>> +
>>>>>>> +/* incremented on every opened seq_file */
>>>>>>> +static atomic64_t session_id;
>>>>>>> +
>>>>>>> +static u32 get_total_priv_dsize(u32 old_size)
>>>>>>> +{
>>>>>>> +       return roundup(old_size, 8) + sizeof(struct extra_priv_data);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
>>>>>>> +{
>>>>>>> +       return old_ptr + roundup(old_size, 8);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int anon_iter_release(struct inode *inode, struct file *file)
>>>>>>> +{
>>>>>>> +       struct anon_file_prog_assoc *finfo;
>>>>>>> +
>>>>>>> +       mutex_lock(&anon_iter_info_mutex);
>>>>>>> +       list_for_each_entry(finfo, &anon_iter_info, list) {
>>>>>>> +               if (finfo->file == file) {
>>>>>>
>>>>>> I'll look at this and other patches more thoroughly tomorrow with
>>>>>> clear head, but this iteration to find anon_file_prog_assoc is really
>>>>>> unfortunate.
>>>>>>
>>>>>> I think the problem is that you are allowing seq_file infrastructure
>>>>>> to call directly into target implementation of seq_operations without
>>>>>> intercepting them. If you change that and put whatever extra info is
>>>>>> necessary into seq_file->private in front of target's private state,
>>>>>> then you shouldn't need this, right?
>>>>>
>>>>> Yes. This is true. The idea is to minimize the target change.
>>>>> But maybe this is not a good goal by itself.
>>>>>
>>>>> You are right, if I intercept all seq_ops(), I do not need the
>>>>> above change, I can tailor seq_file private_data right before
>>>>> calling target one and restore after the target call.
>>>>>
>>>>> Originally I only have one interception, show(), now I have
>>>>> stop() too to call bpf at the end of iteration. Maybe I can
>>>>> interpret all four, I think. This way, I can also get ride
>>>>> of target feature.
>>>>
>>>> If the main goal is to minimize target changes and make them exactly
>>>> seq_operations implementation, then one easier way to get easy access
>>>> to our own metadata in seq_file->private is to set it to point
>>>> **after** our metadata, but before target's metadata. Roughly in
>>>> pseudo code:
>>>>
>>>> struct bpf_iter_seq_file_meta {} __attribute((aligned(8)));
>>>>
>>>> void *meta = kmalloc(sizeof(struct bpf_iter_seq_file_meta) +
>>>> target_private_size);
>>>> seq_file->private = meta + sizeof(struct bpf_iter_seq_file_meta);
>>> I have suggested the same thing earlier.  Good to know that we think alike ;)
>>>
>>> May be put them in a struct such that container_of...etc can be used:
>>> struct bpf_iter_private {
>>>           struct extra_priv_data iter_private;
>>> 	u8 target_private[] __aligned(8);
>>> };
>>
>> This should work, but need to intercept all seq_ops() operations
>> because target expects private data is `target_private` only.
>> Let me experiment what is the best way to do this.
> As long as "seq_file->private = bpf_iter_private->target_private;" as
> Andrii also suggested, the existing seq_ops() should not have to be
> changed or needed to be intercepted because they are only
> accessing it through seq_file->private.
> 
> The bpf_iter logic should be the only one needed to access the
> bpf_iter_private->iter_private and it can be obtained by, e.g.
> "container_of(seq_file->private, struct bpf_iter_private, target_private)"

Thanks for explanation! I never thought of using container_of
magic here. It indeed work very nicely.

> 
>>
>>>
>>>>
>>>>
>>>> Then to recover bpf_iter_Seq_file_meta:
>>>>
>>>> struct bpf_iter_seq_file_meta *meta = seq_file->private - sizeof(*meta);
>>>>
>>>> /* voila! */
>>>>
>>>> This doesn't have a benefit of making targets simpler, but will
>>>> require no changes to them at all. Plus less indirect calls, so less
>>>> performance penalty.
>>>>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 15/19] tools/libbpf: add bpf_iter support
  2020-04-27 20:12 ` [PATCH bpf-next v1 15/19] tools/libbpf: add bpf_iter support Yonghong Song
@ 2020-04-30  1:41   ` Andrii Nakryiko
  2020-05-02  7:17     ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-30  1:41 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:17 PM Yonghong Song <yhs@fb.com> wrote:
>
> Three new libbpf APIs are added to support bpf_iter:
>   - bpf_program__attach_iter
>     Given a bpf program and additional parameters, which is
>     none now, returns a bpf_link.
>   - bpf_link__create_iter
>     Given a bpf_link, create a bpf_iter and return a fd
>     so user can then do read() to get seq_file output data.
>   - bpf_iter_create
>     syscall level API to create a bpf iterator.
>
> Two macros, BPF_SEQ_PRINTF0 and BPF_SEQ_PRINTF, are also introduced.
> These two macros can help bpf program writers with
> nicer bpf_seq_printf syntax similar to the kernel one.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  tools/lib/bpf/bpf.c         | 11 +++++++
>  tools/lib/bpf/bpf.h         |  2 ++
>  tools/lib/bpf/bpf_tracing.h | 23 ++++++++++++++
>  tools/lib/bpf/libbpf.c      | 60 +++++++++++++++++++++++++++++++++++++
>  tools/lib/bpf/libbpf.h      | 11 +++++++
>  tools/lib/bpf/libbpf.map    |  7 +++++
>  6 files changed, 114 insertions(+)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 5cc1b0785d18..7ffd6c0ad95f 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -619,6 +619,17 @@ int bpf_link_update(int link_fd, int new_prog_fd,
>         return sys_bpf(BPF_LINK_UPDATE, &attr, sizeof(attr));
>  }
>
> +int bpf_iter_create(int link_fd, unsigned int flags)

Do you envision anything more than just flags being passed for
bpf_iter_create? I wonder if we should just go ahead with options
struct here?

> +{
> +       union bpf_attr attr;
> +
> +       memset(&attr, 0, sizeof(attr));
> +       attr.iter_create.link_fd = link_fd;
> +       attr.iter_create.flags = flags;
> +
> +       return sys_bpf(BPF_ITER_CREATE, &attr, sizeof(attr));
> +}
> +

[...]

> +/*
> + * BPF_SEQ_PRINTF to wrap bpf_seq_printf to-be-printed values
> + * in a structure. BPF_SEQ_PRINTF0 is a simple wrapper for
> + * bpf_seq_printf().
> + */
> +#define BPF_SEQ_PRINTF0(seq, fmt)                                      \
> +       ({                                                              \
> +               int ret = bpf_seq_printf(seq, fmt, sizeof(fmt),         \
> +                                        (void *)0, 0);                 \
> +               ret;                                                    \
> +       })
> +
> +#define BPF_SEQ_PRINTF(seq, fmt, args...)                              \

You can unify BPF_SEQ_PRINTF and BPF_SEQ_PRINTF0 by using
___bpf_empty() macro. See bpf_core_read.h for similar use case.
Specifically, look at ___empty (equivalent of ___bpf_empty) and
___core_read, ___core_read0, ___core_readN macro.

> +       ({                                                              \
> +               _Pragma("GCC diagnostic push")                          \
> +               _Pragma("GCC diagnostic ignored \"-Wint-conversion\"")  \
> +               __u64 param[___bpf_narg(args)] = { args };              \

Do you need to provide the size of array here? If you omit
__bpf_narg(args), wouldn't compiler automatically calculate the right
size?

Also, can you please use "unsigned long long" to not have any implicit
dependency on __u64 being defined?

> +               _Pragma("GCC diagnostic pop")                           \
> +               int ret = bpf_seq_printf(seq, fmt, sizeof(fmt),         \
> +                                        param, sizeof(param));         \
> +               ret;                                                    \
> +       })
> +
>  #endif
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 8e1dc6980fac..ffdc4d8e0cc0 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -6366,6 +6366,9 @@ static const struct bpf_sec_def section_defs[] = {
>                 .is_attach_btf = true,
>                 .expected_attach_type = BPF_LSM_MAC,
>                 .attach_fn = attach_lsm),
> +       SEC_DEF("iter/", TRACING,
> +               .expected_attach_type = BPF_TRACE_ITER,
> +               .is_attach_btf = true),

It would be nice to implement auto-attach capabilities (similar to
fentry/fexit, lsm and raw_tracepoint). Section name should have enough
information for this, no?

>         BPF_PROG_SEC("xdp",                     BPF_PROG_TYPE_XDP),
>         BPF_PROG_SEC("perf_event",              BPF_PROG_TYPE_PERF_EVENT),
>         BPF_PROG_SEC("lwt_in",                  BPF_PROG_TYPE_LWT_IN),
> @@ -6629,6 +6632,7 @@ static int bpf_object__collect_struct_ops_map_reloc(struct bpf_object *obj,
>

[...]

> +
> +       link = calloc(1, sizeof(*link));
> +       if (!link)
> +               return ERR_PTR(-ENOMEM);
> +       link->detach = &bpf_link__detach_fd;
> +
> +       attach_type = bpf_program__get_expected_attach_type(prog);

Given you know it has to be BPF_TRACE_ITER, it's better to explicitly
specify that. If provided program wasn't loaded with correct
expected_attach_type, kernel will reject it. But if you don't do it,
then you can accidentally create some other type of bpf_link.

> +       link_fd = bpf_link_create(prog_fd, 0, attach_type, NULL);
> +       if (link_fd < 0) {
> +               link_fd = -errno;
> +               free(link);
> +               pr_warn("program '%s': failed to attach to iterator: %s\n",
> +                       bpf_program__title(prog, false),
> +                       libbpf_strerror_r(link_fd, errmsg, sizeof(errmsg)));
> +               return ERR_PTR(link_fd);
> +       }
> +       link->fd = link_fd;
> +       return link;
> +}
> +
> +int bpf_link__create_iter(struct bpf_link *link, unsigned int flags)
> +{

Same question as for low-level bpf_link_create(). If we expect the
need to extend optional things in the future, I'd add opts right now.

But I wonder if bpf_link__create_iter() provides any additional value
beyond bpf_iter_create(). Maybe let's not add it (yet)?

> +       char errmsg[STRERR_BUFSIZE];
> +       int iter_fd;
> +
> +       iter_fd = bpf_iter_create(bpf_link__fd(link), flags);
> +       if (iter_fd < 0) {
> +               iter_fd = -errno;
> +               pr_warn("failed to create an iterator: %s\n",
> +                       libbpf_strerror_r(iter_fd, errmsg, sizeof(errmsg)));
> +       }
> +
> +       return iter_fd;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 11/19] bpf: add task and task/file targets
  2020-04-27 20:12 ` [PATCH bpf-next v1 11/19] bpf: add task and task/file targets Yonghong Song
@ 2020-04-30  2:08   ` Andrii Nakryiko
  2020-05-01 17:23     ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-30  2:08 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:17 PM Yonghong Song <yhs@fb.com> wrote:
>
> Only the tasks belonging to "current" pid namespace
> are enumerated.
>
> For task/file target, the bpf program will have access to
>   struct task_struct *task
>   u32 fd
>   struct file *file
> where fd/file is an open file for the task.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/Makefile    |   2 +-
>  kernel/bpf/task_iter.c | 319 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 320 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/task_iter.c
>

[...]

> +static void *task_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +       struct bpf_iter_seq_task_info *info = seq->private;
> +       struct task_struct *task;
> +       u32 id = info->id;
> +
> +       if (*pos == 0)
> +               info->ns = task_active_pid_ns(current);

I wonder why pid namespace is set in start() callback each time, while
net_ns was set once when seq_file is created. I think it should be
consistent, no? Either pid_ns is another feature and is set
consistently just once using the context of the process that creates
seq_file, or net_ns could be set using the same method without
bpf_iter infra knowing about this feature? Or there are some
non-obvious aspects which make pid_ns easier to work with?

Either way, process read()'ing seq_file might be different than
process open()'ing seq_file, so they might have different namespaces.
We need to decide explicitly which context should be used and do it
consistently.

> +
> +       task = task_seq_get_next(info->ns, &id);
> +       if (!task)
> +               return NULL;
> +
> +       ++*pos;
> +       info->task = task;
> +       info->id = id;
> +
> +       return task;
> +}
> +
> +static void *task_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +       struct bpf_iter_seq_task_info *info = seq->private;
> +       struct task_struct *task;
> +
> +       ++*pos;
> +       ++info->id;

this would make iterator skip pid 0? Is that by design?

> +       task = task_seq_get_next(info->ns, &info->id);
> +       if (!task)
> +               return NULL;
> +
> +       put_task_struct(info->task);

on very first iteration info->task might be NULL, right?

> +       info->task = task;
> +       return task;
> +}
> +
> +struct bpf_iter__task {
> +       __bpf_md_ptr(struct bpf_iter_meta *, meta);
> +       __bpf_md_ptr(struct task_struct *, task);
> +};
> +
> +int __init __bpf_iter__task(struct bpf_iter_meta *meta, struct task_struct *task)
> +{
> +       return 0;
> +}
> +
> +static int task_seq_show(struct seq_file *seq, void *v)
> +{
> +       struct bpf_iter_meta meta;
> +       struct bpf_iter__task ctx;
> +       struct bpf_prog *prog;
> +       int ret = 0;
> +
> +       prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_task_info),
> +                                &meta.session_id, &meta.seq_num,
> +                                v == (void *)0);
> +       if (prog) {

can it happen that prog is NULL?


> +               meta.seq = seq;
> +               ctx.meta = &meta;
> +               ctx.task = v;
> +               ret = bpf_iter_run_prog(prog, &ctx);
> +       }
> +
> +       return ret == 0 ? 0 : -EINVAL;
> +}
> +
> +static void task_seq_stop(struct seq_file *seq, void *v)
> +{
> +       struct bpf_iter_seq_task_info *info = seq->private;
> +
> +       if (!v)
> +               task_seq_show(seq, v);

hmm... show() called from stop()? what's the case where this is necessary?
> +
> +       if (info->task) {
> +               put_task_struct(info->task);
> +               info->task = NULL;
> +       }
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 17/19] tools/bpf: selftests: add iterator programs for ipv6_route and netlink
  2020-04-27 20:12 ` [PATCH bpf-next v1 17/19] tools/bpf: selftests: add iterator programs for ipv6_route and netlink Yonghong Song
@ 2020-04-30  2:12   ` Andrii Nakryiko
  0 siblings, 0 replies; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-30  2:12 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:18 PM Yonghong Song <yhs@fb.com> wrote:
>
> Two bpf programs are added in this patch for netlink and ipv6_route
> target. On my VM, I am able to achieve identical
> results compared to /proc/net/netlink and /proc/net/ipv6_route.
>
>   $ cat /proc/net/netlink
>   sk               Eth Pid        Groups   Rmem     Wmem     Dump  Locks    Drops    Inode
>   000000002c42d58b 0   0          00000000 0        0        0     2        0        7
>   00000000a4e8b5e1 0   1          00000551 0        0        0     2        0        18719
>   00000000e1b1c195 4   0          00000000 0        0        0     2        0        16422
>   000000007e6b29f9 6   0          00000000 0        0        0     2        0        16424
>   ....
>   00000000159a170d 15  1862       00000002 0        0        0     2        0        1886
>   000000009aca4bc9 15  3918224839 00000002 0        0        0     2        0        19076
>   00000000d0ab31d2 15  1          00000002 0        0        0     2        0        18683
>   000000008398fb08 16  0          00000000 0        0        0     2        0        27
>   $ cat /sys/fs/bpf/my_netlink
>   sk               Eth Pid        Groups   Rmem     Wmem     Dump  Locks    Drops    Inode
>   000000002c42d58b 0   0          00000000 0        0        0     2        0        7
>   00000000a4e8b5e1 0   1          00000551 0        0        0     2        0        18719
>   00000000e1b1c195 4   0          00000000 0        0        0     2        0        16422
>   000000007e6b29f9 6   0          00000000 0        0        0     2        0        16424
>   ....
>   00000000159a170d 15  1862       00000002 0        0        0     2        0        1886
>   000000009aca4bc9 15  3918224839 00000002 0        0        0     2        0        19076
>   00000000d0ab31d2 15  1          00000002 0        0        0     2        0        18683
>   000000008398fb08 16  0          00000000 0        0        0     2        0        27
>
>   $ cat /proc/net/ipv6_route
>   fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
>   00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
>   00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
>   fe80000000000000c04b03fffe7827ce 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
>   ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000003 00000000 00000001     eth0
>   00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
>   $ cat /sys/fs/bpf/my_ipv6_route
>   fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
>   00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
>   00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
>   fe80000000000000c04b03fffe7827ce 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
>   ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000003 00000000 00000001     eth0
>   00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  .../selftests/bpf/progs/bpf_iter_ipv6_route.c | 69 +++++++++++++++++
>  .../selftests/bpf/progs/bpf_iter_netlink.c    | 77 +++++++++++++++++++
>  2 files changed, 146 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
>
> diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c b/tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c
> new file mode 100644
> index 000000000000..bed34521f997
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c
> @@ -0,0 +1,69 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Facebook */
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_endian.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +extern bool CONFIG_IPV6_SUBTREES __kconfig __weak;
> +
> +#define        RTF_GATEWAY             0x0002
> +#define IFNAMSIZ               16
> +#define fib_nh_gw_family        nh_common.nhc_gw_family
> +#define fib_nh_gw6              nh_common.nhc_gw.ipv6
> +#define fib_nh_dev              nh_common.nhc_dev
> +
> +SEC("iter/ipv6_route")
> +int dump_ipv6_route(struct bpf_iter__ipv6_route *ctx)
> +{
> +       static const char fmt1[] = "%pi6 %02x ";
> +       static const char fmt2[] = "%pi6 ";
> +       static const char fmt3[] = "00000000000000000000000000000000 ";
> +       static const char fmt4[] = "%08x %08x %08x %08x %8s\n";
> +       static const char fmt5[] = "%08x %08x %08x %08x\n";
> +       static const char fmt7[] = "00000000000000000000000000000000 00 ";
> +       struct seq_file *seq = ctx->meta->seq;
> +       struct fib6_info *rt = ctx->rt;
> +       const struct net_device *dev;
> +       struct fib6_nh *fib6_nh;
> +       unsigned int flags;
> +       struct nexthop *nh;
> +
> +       if (rt == (void *)0)
> +               return 0;
> +
> +       fib6_nh = &rt->fib6_nh[0];
> +       flags = rt->fib6_flags;
> +
> +       /* FIXME: nexthop_is_multipath is not handled here. */
> +       nh = rt->nh;
> +       if (rt->nh)
> +               fib6_nh = &nh->nh_info->fib6_nh;
> +
> +       BPF_SEQ_PRINTF(seq, fmt1, &rt->fib6_dst.addr, rt->fib6_dst.plen);
> +
> +       if (CONFIG_IPV6_SUBTREES)
> +               BPF_SEQ_PRINTF(seq, fmt1, &rt->fib6_src.addr,
> +                              rt->fib6_src.plen);
> +       else
> +               BPF_SEQ_PRINTF0(seq, fmt7);

Looking at these examples, I think BPF_SEQ_PRINTF should just assume
that fmt argument is string literal and do:

static const char ___tmp_fmt[] = fmt;

inside that macro. So one can just do:

BPF_SEQ_PRINTF(seq, "Hello, world!\n");

or

BPF_SEQ_PRINTF(seq, "My awesome template %d ==> %s\n", id, some_string);

WDYT?

> +
> +       if (fib6_nh->fib_nh_gw_family) {
> +               flags |= RTF_GATEWAY;
> +               BPF_SEQ_PRINTF(seq, fmt2, &fib6_nh->fib_nh_gw6);
> +       } else {
> +               BPF_SEQ_PRINTF0(seq, fmt3);
> +       }
> +
> +       dev = fib6_nh->fib_nh_dev;
> +       if (dev)
> +               BPF_SEQ_PRINTF(seq, fmt4, rt->fib6_metric,
> +                              rt->fib6_ref.refs.counter, 0, flags, dev->name);
> +       else
> +               BPF_SEQ_PRINTF(seq, fmt4, rt->fib6_metric,
> +                              rt->fib6_ref.refs.counter, 0, flags);
> +
> +       return 0;
> +}

[...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-29 20:15                           ` Yonghong Song
@ 2020-04-30  3:06                             ` Alexei Starovoitov
  2020-04-30  4:01                               ` Yonghong Song
  0 siblings, 1 reply; 85+ messages in thread
From: Alexei Starovoitov @ 2020-04-30  3:06 UTC (permalink / raw)
  To: Yonghong Song, Andrii Nakryiko
  Cc: Martin KaFai Lau, Andrii Nakryiko, bpf, Networking,
	Daniel Borkmann, Kernel Team

On 4/29/20 1:15 PM, Yonghong Song wrote:
>>>
>>> Even without lseek stop() will be called multiple times.
>>> If I read seq_file.c correctly it will be called before
>>> every copy_to_user(). Which means that for a lot of text
>>> (or if read() is done with small buffer) there will be
>>> plenty of start,show,show,stop sequences.
>>
>>
>> Right start/stop can be called multiple times, but seems like there
>> are clear indicators of beginning of iteration and end of iteration:
>> - start() with seq_num == 0 is start of iteration (can be called
>> multiple times, if first element overflows buffer);
>> - stop() with p == NULL is end of iteration (seems like can be called
>> multiple times as well, if user keeps read()'ing after iteration
>> completed).
>>
>> There is another problem with stop(), though. If BPF program will
>> attempt to output anything during stop(), that output will be just
>> discarded. Not great. Especially if that output overflows and we need
> 
> The stop() output will not be discarded in the following cases:
>     - regular show() objects overflow and stop() BPF program not called
>     - regular show() objects not overflow, which means iteration is done,
>       and stop() BPF program does not overflow.
> 
> The stop() seq_file output will be discarded if
>     - regular show() objects not overflow and stop() BPF program output
>       overflows.
>     - no objects to iterate, BPF program got called, but its seq_file
>       write/printf will be discarded.
> 
> Two options here:
>    - implement Alexei suggestion to look ahead two elements to
>      always having valid object and indicating the last element
>      with a special flag.
>    - Per Andrii's suggestion below to implement new way or to
>      tweak seq_file() a little bit to resolve the above cases
>      where stop() seq_file outputs being discarded.
> 
> Will try to experiment with both above options...
> 
> 
>> to re-allocate buffer.
>>
>> We are trying to use seq_file just to reuse 140 lines of code in
>> seq_read(), which is no magic, just a simple double buffer and retry
>> piece of logic. We don't need lseek and traverse, we don't need all
>> the escaping stuff. I think bpf_iter implementation would be much
>> simpler if bpf_iter had better control over iteration. Then this whole
>> "end of iteration" behavior would be crystal clear. Should we maybe
>> reconsider again?

That's what I was advocating for some time now.

I think seq_file is barely usable as a /proc extension and completely
unusable for iterating.
All the discussions in the last few weeks are pointing out that
majority of use cases are in the iterating space instead of dumping.
Dumping human readable strings as unstable /proc extension is
a small subset. So I think we shouldn't use fs/seq_file.c.
The dance around double op->next() or introducing op->finish()
into seq_ops looks like fifth wheel to the car.
I think bpf_iter semantics and bpf prog logic would be much simpler
and easier to understand if op->read method was re-implemented
for the purpose of iterating the objects.
I mean seq_op->start/next/stop can be reused as-is to iterate
existing kernel objects like sockets, but seq_read() will not be
used. We should explicitly disable lseek and write on our
cat-able files and use new bpf_seq_read() as .read op.
This specialized bpf_seq_read() will still do a sequences of
start/show/show/stop for every copy_to_user, but we don't need to
add finish() to seq_op and hack existing seq_read().
We also will be able to provide precise seq_num into the program
instead of approximation.
bpf_seq_read wouldn't need to deal with ppos and traverse.
It wouldn't need fancy m->size<<=1 retries.
It can allocate fixed PAGE_SIZE and be done with it.
It's fine to restrict bpf progs to not dump more than 4k
chracters per object.
And we can call bpf_iter prog exactly once per element.
Plenty of pros and no real cons.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator
  2020-04-30  3:06                             ` Alexei Starovoitov
@ 2020-04-30  4:01                               ` Yonghong Song
  0 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-30  4:01 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko
  Cc: Martin KaFai Lau, Andrii Nakryiko, bpf, Networking,
	Daniel Borkmann, Kernel Team



On 4/29/20 8:06 PM, Alexei Starovoitov wrote:
> On 4/29/20 1:15 PM, Yonghong Song wrote:
>>>>
>>>> Even without lseek stop() will be called multiple times.
>>>> If I read seq_file.c correctly it will be called before
>>>> every copy_to_user(). Which means that for a lot of text
>>>> (or if read() is done with small buffer) there will be
>>>> plenty of start,show,show,stop sequences.
>>>
>>>
>>> Right start/stop can be called multiple times, but seems like there
>>> are clear indicators of beginning of iteration and end of iteration:
>>> - start() with seq_num == 0 is start of iteration (can be called
>>> multiple times, if first element overflows buffer);
>>> - stop() with p == NULL is end of iteration (seems like can be called
>>> multiple times as well, if user keeps read()'ing after iteration
>>> completed).
>>>
>>> There is another problem with stop(), though. If BPF program will
>>> attempt to output anything during stop(), that output will be just
>>> discarded. Not great. Especially if that output overflows and we need
>>
>> The stop() output will not be discarded in the following cases:
>>     - regular show() objects overflow and stop() BPF program not called
>>     - regular show() objects not overflow, which means iteration is done,
>>       and stop() BPF program does not overflow.
>>
>> The stop() seq_file output will be discarded if
>>     - regular show() objects not overflow and stop() BPF program output
>>       overflows.
>>     - no objects to iterate, BPF program got called, but its seq_file
>>       write/printf will be discarded.
>>
>> Two options here:
>>    - implement Alexei suggestion to look ahead two elements to
>>      always having valid object and indicating the last element
>>      with a special flag.
>>    - Per Andrii's suggestion below to implement new way or to
>>      tweak seq_file() a little bit to resolve the above cases
>>      where stop() seq_file outputs being discarded.
>>
>> Will try to experiment with both above options...
>>
>>
>>> to re-allocate buffer.
>>>
>>> We are trying to use seq_file just to reuse 140 lines of code in
>>> seq_read(), which is no magic, just a simple double buffer and retry
>>> piece of logic. We don't need lseek and traverse, we don't need all
>>> the escaping stuff. I think bpf_iter implementation would be much
>>> simpler if bpf_iter had better control over iteration. Then this whole
>>> "end of iteration" behavior would be crystal clear. Should we maybe
>>> reconsider again?
> 
> That's what I was advocating for some time now.
> 
> I think seq_file is barely usable as a /proc extension and completely
> unusable for iterating.
> All the discussions in the last few weeks are pointing out that
> majority of use cases are in the iterating space instead of dumping.
> Dumping human readable strings as unstable /proc extension is
> a small subset. So I think we shouldn't use fs/seq_file.c.
> The dance around double op->next() or introducing op->finish()
> into seq_ops looks like fifth wheel to the car.
> I think bpf_iter semantics and bpf prog logic would be much simpler
> and easier to understand if op->read method was re-implemented
> for the purpose of iterating the objects.
> I mean seq_op->start/next/stop can be reused as-is to iterate
> existing kernel objects like sockets, but seq_read() will not be
> used. We should explicitly disable lseek and write on our
> cat-able files and use new bpf_seq_read() as .read op.
> This specialized bpf_seq_read() will still do a sequences of
> start/show/show/stop for every copy_to_user, but we don't need to
> add finish() to seq_op and hack existing seq_read().
> We also will be able to provide precise seq_num into the program
> instead of approximation.
> bpf_seq_read wouldn't need to deal with ppos and traverse.
> It wouldn't need fancy m->size<<=1 retries.
> It can allocate fixed PAGE_SIZE and be done with it.
> It's fine to restrict bpf progs to not dump more than 4k
> chracters per object.
> And we can call bpf_iter prog exactly once per element.
> Plenty of pros and no real cons.

This may indeed be simpler and scalarable since it is specific for our 
use case compared to ouble next with tweaking the
existing target (ipv6_route, netlink, etc.). Will explore
this way.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 08/19] bpf: create file bpf iterator
  2020-04-29 20:40   ` Andrii Nakryiko
@ 2020-04-30 18:02     ` Yonghong Song
  0 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-04-30 18:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/29/20 1:40 PM, Andrii Nakryiko wrote:
> On Mon, Apr 27, 2020 at 1:18 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> A new obj type BPF_TYPE_ITER is added to bpffs.
>> To produce a file bpf iterator, the fd must be
>> corresponding to a link_fd assocciated with a
>> trace/iter program. When the pinned file is
>> opened, a seq_file will be generated.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h   |  3 +++
>>   kernel/bpf/bpf_iter.c | 48 ++++++++++++++++++++++++++++++++++++++++++-
>>   kernel/bpf/inode.c    | 28 +++++++++++++++++++++++++
>>   kernel/bpf/syscall.c  |  2 +-
>>   4 files changed, 79 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 0f0cafc65a04..601b3299b7e4 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1021,6 +1021,8 @@ static inline void bpf_enable_instrumentation(void)
>>
>>   extern const struct file_operations bpf_map_fops;
>>   extern const struct file_operations bpf_prog_fops;
>> +extern const struct file_operations bpf_link_fops;
>> +extern const struct file_operations bpffs_iter_fops;
>>
>>   #define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \
>>          extern const struct bpf_prog_ops _name ## _prog_ops; \
>> @@ -1136,6 +1138,7 @@ int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>>   int bpf_iter_link_replace(struct bpf_link *link, struct bpf_prog *old_prog,
>>                            struct bpf_prog *new_prog);
>>   int bpf_iter_new_fd(struct bpf_link *link);
>> +void *bpf_iter_get_from_fd(u32 ufd);
>>
>>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>> index 1f4e778d1814..f5e933236996 100644
>> --- a/kernel/bpf/bpf_iter.c
>> +++ b/kernel/bpf/bpf_iter.c
>> @@ -123,7 +123,8 @@ struct bpf_prog *bpf_iter_get_prog(struct seq_file *seq, u32 priv_data_size,
>>   {
>>          struct extra_priv_data *extra_data;
>>
>> -       if (seq->file->f_op != &anon_bpf_iter_fops)
>> +       if (seq->file->f_op != &anon_bpf_iter_fops &&
>> +           seq->file->f_op != &bpffs_iter_fops)
> 
> Do we really need anon_bpf_iter_fops and bpffs_iter_fops? Seems like
> the only difference is bpffs_iter_open. Could it be implemented as
> part of anon_bpf_iter_ops as well? Seems like open() is never called
> for anon_inode_file, so it should work for both?

Yes, open() will not be used for anon_bpf_iter. I used two
file_operations just for this reason. But I guess, I can
just use one. It won't hurt.

> 
>>                  return NULL;
>>
>>          extra_data = get_extra_priv_dptr(seq->private, priv_data_size);
>> @@ -310,3 +311,48 @@ int bpf_iter_new_fd(struct bpf_link *link)
>>          put_unused_fd(fd);
>>          return err;
>>   }
>> +
>> +static int bpffs_iter_open(struct inode *inode, struct file *file)
>> +{
>> +       struct bpf_iter_link *link = inode->i_private;
>> +
>> +       return prepare_seq_file(file, link);
>> +}
>> +
>> +static int bpffs_iter_release(struct inode *inode, struct file *file)
>> +{
>> +       return anon_iter_release(inode, file);
>> +}
>> +
>> +const struct file_operations bpffs_iter_fops = {
>> +       .open           = bpffs_iter_open,
>> +       .read           = seq_read,
>> +       .release        = bpffs_iter_release,
>> +};
>> +
>> +void *bpf_iter_get_from_fd(u32 ufd)
> 
> return struct bpf_iter_link * here, given this is specific constructor
> for bpf_iter_link?
> 
>> +{
>> +       struct bpf_link *link;
>> +       struct bpf_prog *prog;
>> +       struct fd f;
>> +
>> +       f = fdget(ufd);
>> +       if (!f.file)
>> +               return ERR_PTR(-EBADF);
>> +       if (f.file->f_op != &bpf_link_fops) {
>> +               link = ERR_PTR(-EINVAL);
>> +               goto out;
>> +       }
>> +
>> +       link = f.file->private_data;
>> +       prog = link->prog;
>> +       if (prog->expected_attach_type != BPF_TRACE_ITER) {
>> +               link = ERR_PTR(-EINVAL);
>> +               goto out;
>> +       }
>> +
>> +       bpf_link_inc(link);
>> +out:
>> +       fdput(f);
>> +       return link;
>> +}
>> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
>> index 95087d9f4ed3..de4493983a37 100644
>> --- a/kernel/bpf/inode.c
>> +++ b/kernel/bpf/inode.c
>> @@ -26,6 +26,7 @@ enum bpf_type {
>>          BPF_TYPE_PROG,
>>          BPF_TYPE_MAP,
>>          BPF_TYPE_LINK,
>> +       BPF_TYPE_ITER,
> 
> Adding ITER as an alternative type of pinned object to BPF_TYPE_LINK
> seems undesirable. We can allow opening bpf_iter's seq_file by doing
> the same trick as is done for bpf_maps, supporting seq_show (see
> bpf_mkmap() and bpf_map_support_seq_show()). Do you think we can do
> the same here? If we later see that more kinds of links would want to
> allow direct open() to create a file with some output from BPF
> program, we can generalize this as part of bpf_link infrastructure.
> For now having a custom check similar to bpf_map's seems sufficient.
> 
> What do you think?

Sounds good. Will use the mechanism similar to bpf_map.

> 
>>   };
>>
>>   static void *bpf_any_get(void *raw, enum bpf_type type)
>> @@ -38,6 +39,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
>>                  bpf_map_inc_with_uref(raw);
>>                  break;
>>          case BPF_TYPE_LINK:
>> +       case BPF_TYPE_ITER:
>>                  bpf_link_inc(raw);
>>                  break;
>>          default:
>> @@ -58,6 +60,7 @@ static void bpf_any_put(void *raw, enum bpf_type type)
>>                  bpf_map_put_with_uref(raw);
>>                  break;
>>          case BPF_TYPE_LINK:
>> +       case BPF_TYPE_ITER:
>>                  bpf_link_put(raw);
>>                  break;
>>          default:
>> @@ -82,6 +85,15 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
>>                  return raw;
>>          }
>>
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 14/19] bpf: support variable length array in tracing programs
  2020-04-27 20:12 ` [PATCH bpf-next v1 14/19] bpf: support variable length array in tracing programs Yonghong Song
@ 2020-04-30 20:04   ` Andrii Nakryiko
  0 siblings, 0 replies; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-30 20:04 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 27, 2020 at 1:17 PM Yonghong Song <yhs@fb.com> wrote:
>
> In /proc/net/ipv6_route, we have
>   struct fib6_info {
>     struct fib6_table *fib6_table;
>     ...
>     struct fib6_nh fib6_nh[0];
>   }
>   struct fib6_nh {
>     struct fib_nh_common nh_common;
>     struct rt6_info **rt6i_pcpu;
>     struct rt6_exception_bucket *rt6i_exception_bucket;
>   };
>   struct fib_nh_common {
>     ...
>     u8 nhc_gw_family;
>     ...
>   }
>
> The access:
>   struct fib6_nh *fib6_nh = &rt->fib6_nh;
>   ... fib6_nh->nh_common.nhc_gw_family ...
>
> This patch ensures such an access is handled properly.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/btf.c | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 2c098e6b1acc..22c69e1d5a56 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3831,6 +3831,7 @@ int btf_struct_access(struct bpf_verifier_log *log,
>         const struct btf_type *mtype, *elem_type = NULL;
>         const struct btf_member *member;
>         const char *tname, *mname;
> +       u32 vlen;
>
>  again:
>         tname = __btf_name_by_offset(btf_vmlinux, t->name_off);
> @@ -3839,7 +3840,37 @@ int btf_struct_access(struct bpf_verifier_log *log,
>                 return -EINVAL;
>         }
>
> -       if (off + size > t->size) {
> +       vlen = btf_type_vlen(t);
> +       if (vlen > 0 && off + size > t->size) {

if vlen == 0, it will skip this entire check and will eventually go to:

bpf_log(log, "struct %s doesn't have field at offset %d\n", tname, off);
return -EINVAL;

That's probably not right and we are better still reporting:

bpf_log(log, "access beyond struct %s at off %u size %u\n"
        tname, off, size);
return -EACCESS;

So this if (vlen > 0) check should be nested in this if?

> +               /* If the last element is a variable size array, we may
> +                * need to relax the rule.
> +                */
> +               struct btf_array *array_elem;
> +
> +               member = btf_type_member(t) + vlen - 1;
> +               mtype = btf_type_skip_modifiers(btf_vmlinux, member->type,
> +                                               NULL);
> +               if (!btf_type_is_array(mtype))
> +                       goto error;
> +
> +               array_elem = (struct btf_array *)(mtype + 1);
> +               if (array_elem->nelems != 0)
> +                       goto error;
> +
> +               moff = btf_member_bit_offset(t, member) / 8;
> +               if (off < moff)
> +                       goto error;
> +
> +               elem_type = btf_type_skip_modifiers(btf_vmlinux,
> +                                                   array_elem->type, NULL);
> +               if (!btf_type_is_struct(elem_type))
> +                       goto error;

What about array of primitive types or pointers? Do we want to
explicitly disable such use cases?

> +
> +               off = (off - moff) % elem_type->size;
> +               return btf_struct_access(log, elem_type, off, size, atype,
> +                                        next_btf_id);
> +
> +error:
>                 bpf_log(log, "access beyond struct %s at off %u size %u\n",
>                         tname, off, size);
>                 return -EACCES;
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers
  2020-04-28 16:35       ` Yonghong Song
  (?)
@ 2020-04-30 20:06       ` Andrii Nakryiko
  -1 siblings, 0 replies; 85+ messages in thread
From: Andrii Nakryiko @ 2020-04-30 20:06 UTC (permalink / raw)
  To: Yonghong Song
  Cc: kbuild test robot, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, kbuild-all, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team

On Tue, Apr 28, 2020 at 9:36 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/27/20 11:02 PM, kbuild test robot wrote:
> > Hi Yonghong,
> >
> > I love your patch! Perhaps something to improve:
> >
> > [auto build test WARNING on bpf-next/master]
> > [cannot apply to bpf/master net/master vhost/linux-next net-next/master linus/master v5.7-rc3 next-20200424]
> > [if your patch is applied to the wrong git tree, please drop us a note to help
> > improve the system. BTW, we also suggest to use '--base' option to specify the
> > base tree in git format-patch, please see https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_37406982&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=ecuvAWhErc8x32mTscXvNhgSPkwcM7tK05lEVYIQMbI&s=rUkkN8hfXpHttD7t9NCfe5OIFTZZ_cn_SQTDjvs1cj0&e= ]
> >
> > url:    https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-implement-bpf-iterator-for-kernel-data/20200428-115101
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> > config: sh-allmodconfig (attached as .config)
> > compiler: sh4-linux-gcc (GCC) 9.3.0
> > reproduce:
> >          wget https://urldefense.proofpoint.com/v2/url?u=https-3A__raw.githubusercontent.com_intel_lkp-2Dtests_master_sbin_make.cross&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=ecuvAWhErc8x32mTscXvNhgSPkwcM7tK05lEVYIQMbI&s=mm3zd05JFgyD1Fvvg5yehcYq7d9KLZkN7XSYyLaJRkA&e=  -O ~/bin/make.cross
> >          chmod +x ~/bin/make.cross
> >          # save the attached .config to linux build tree
> >          COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=sh
> >
> > If you fix the issue, kindly add following tag as appropriate
> > Reported-by: kbuild test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>):
> >
> >     In file included from kernel/trace/bpf_trace.c:10:
> >     kernel/trace/bpf_trace.c: In function 'bpf_seq_printf':
> >>> kernel/trace/bpf_trace.c:463:35: warning: the frame size of 1672 bytes is larger than 1024 bytes [-Wframe-larger-than=]
> >       463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
>
> Thanks for reporting. Currently, I am supporting up to 12 string format
> specifiers and each string up to 128 bytes. To avoid racing and helper
> memory allocation, I put it on stack hence the above 1672 bytes, but
> practically, I think support 4 strings with 128 bytes each is enough.
> I will make a change in the next revision.

It's still quite a lot of data on stack. How about per-CPU buffer that
this function can use for temporary storage?

>
> >           |                                   ^~~~~~~~
> >     include/linux/filter.h:456:30: note: in definition of macro '__BPF_CAST'
> >       456 |           (unsigned long)0, (t)0))) a
> >           |                              ^
> >>> include/linux/filter.h:449:27: note: in expansion of macro '__BPF_MAP_5'
> >       449 | #define __BPF_MAP(n, ...) __BPF_MAP_##n(__VA_ARGS__)
> >           |                           ^~~~~~~~~~
> >>> include/linux/filter.h:474:35: note: in expansion of macro '__BPF_MAP'
> >       474 |   return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
> >           |                                   ^~~~~~~~~
> >>> include/linux/filter.h:484:31: note: in expansion of macro 'BPF_CALL_x'
> >       484 | #define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__)
> >           |                               ^~~~~~~~~~
> >>> kernel/trace/bpf_trace.c:463:1: note: in expansion of macro 'BPF_CALL_5'
> >       463 | BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
> >           | ^~~~~~~~~~
> >
> > vim +463 kernel/trace/bpf_trace.c
> >
> >     462
> >   > 463       BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
> >     464                  const void *, data, u32, data_len)
> >     465       {
> >     466               char bufs[MAX_SEQ_PRINTF_VARARGS][MAX_SEQ_PRINTF_STR_LEN];
> >     467               u64 params[MAX_SEQ_PRINTF_VARARGS];
> >     468               int i, copy_size, num_args;
> >     469               const u64 *args = data;
> >     470               int fmt_cnt = 0;
> >     471
> [...]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 11/19] bpf: add task and task/file targets
  2020-04-30  2:08   ` Andrii Nakryiko
@ 2020-05-01 17:23     ` Yonghong Song
  2020-05-01 19:01       ` Andrii Nakryiko
  0 siblings, 1 reply; 85+ messages in thread
From: Yonghong Song @ 2020-05-01 17:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/29/20 7:08 PM, Andrii Nakryiko wrote:
> On Mon, Apr 27, 2020 at 1:17 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Only the tasks belonging to "current" pid namespace
>> are enumerated.
>>
>> For task/file target, the bpf program will have access to
>>    struct task_struct *task
>>    u32 fd
>>    struct file *file
>> where fd/file is an open file for the task.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   kernel/bpf/Makefile    |   2 +-
>>   kernel/bpf/task_iter.c | 319 +++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 320 insertions(+), 1 deletion(-)
>>   create mode 100644 kernel/bpf/task_iter.c
>>
> 
> [...]
> 
>> +static void *task_seq_start(struct seq_file *seq, loff_t *pos)
>> +{
>> +       struct bpf_iter_seq_task_info *info = seq->private;
>> +       struct task_struct *task;
>> +       u32 id = info->id;
>> +
>> +       if (*pos == 0)
>> +               info->ns = task_active_pid_ns(current);
> 
> I wonder why pid namespace is set in start() callback each time, while
> net_ns was set once when seq_file is created. I think it should be
> consistent, no? Either pid_ns is another feature and is set
> consistently just once using the context of the process that creates
> seq_file, or net_ns could be set using the same method without
> bpf_iter infra knowing about this feature? Or there are some
> non-obvious aspects which make pid_ns easier to work with?
> 
> Either way, process read()'ing seq_file might be different than
> process open()'ing seq_file, so they might have different namespaces.
> We need to decide explicitly which context should be used and do it
> consistently.

Good point. for networking case, the `net` namespace is locked
at seq_file open stage and later on it is used for seq_read().

I think I should do the same thing, locking down pid namespace
at open.

> 
>> +
>> +       task = task_seq_get_next(info->ns, &id);
>> +       if (!task)
>> +               return NULL;
>> +
>> +       ++*pos;
>> +       info->task = task;
>> +       info->id = id;
>> +
>> +       return task;
>> +}
>> +
>> +static void *task_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> +{
>> +       struct bpf_iter_seq_task_info *info = seq->private;
>> +       struct task_struct *task;
>> +
>> +       ++*pos;
>> +       ++info->id;
> 
> this would make iterator skip pid 0? Is that by design?

The start will try to find pid 0. That means start will never
return SEQ_START_TOKEN since the bpf program won't be called any way.

> 
>> +       task = task_seq_get_next(info->ns, &info->id);
>> +       if (!task)
>> +               return NULL;
>> +
>> +       put_task_struct(info->task);
> 
> on very first iteration info->task might be NULL, right?

Even the first iteration info->task is not NULL. The start()
will forcefully try to find the first real task from idr number 0.

> 
>> +       info->task = task;
>> +       return task;
>> +}
>> +
>> +struct bpf_iter__task {
>> +       __bpf_md_ptr(struct bpf_iter_meta *, meta);
>> +       __bpf_md_ptr(struct task_struct *, task);
>> +};
>> +
>> +int __init __bpf_iter__task(struct bpf_iter_meta *meta, struct task_struct *task)
>> +{
>> +       return 0;
>> +}
>> +
>> +static int task_seq_show(struct seq_file *seq, void *v)
>> +{
>> +       struct bpf_iter_meta meta;
>> +       struct bpf_iter__task ctx;
>> +       struct bpf_prog *prog;
>> +       int ret = 0;
>> +
>> +       prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_task_info),
>> +                                &meta.session_id, &meta.seq_num,
>> +                                v == (void *)0);
>> +       if (prog) {
> 
> can it happen that prog is NULL?

Yes, this function is shared between show() and stop().
The stop() function might be called multiple times since
user can repeatedly try read() although there is nothing
there, in which case, the seq_ops will be just
start() and stop().

> 
> 
>> +               meta.seq = seq;
>> +               ctx.meta = &meta;
>> +               ctx.task = v;
>> +               ret = bpf_iter_run_prog(prog, &ctx);
>> +       }
>> +
>> +       return ret == 0 ? 0 : -EINVAL;
>> +}
>> +
>> +static void task_seq_stop(struct seq_file *seq, void *v)
>> +{
>> +       struct bpf_iter_seq_task_info *info = seq->private;
>> +
>> +       if (!v)
>> +               task_seq_show(seq, v);
> 
> hmm... show() called from stop()? what's the case where this is necessary?

I will refactor it better. This is to invoke bpf program
in stop() with NULL object to signal the end of
iteration.

>> +
>> +       if (info->task) {
>> +               put_task_struct(info->task);
>> +               info->task = NULL;
>> +       }
>> +}
>> +
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 11/19] bpf: add task and task/file targets
  2020-05-01 17:23     ` Yonghong Song
@ 2020-05-01 19:01       ` Andrii Nakryiko
  0 siblings, 0 replies; 85+ messages in thread
From: Andrii Nakryiko @ 2020-05-01 19:01 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Fri, May 1, 2020 at 10:23 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/29/20 7:08 PM, Andrii Nakryiko wrote:
> > On Mon, Apr 27, 2020 at 1:17 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> Only the tasks belonging to "current" pid namespace
> >> are enumerated.
> >>
> >> For task/file target, the bpf program will have access to
> >>    struct task_struct *task
> >>    u32 fd
> >>    struct file *file
> >> where fd/file is an open file for the task.
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   kernel/bpf/Makefile    |   2 +-
> >>   kernel/bpf/task_iter.c | 319 +++++++++++++++++++++++++++++++++++++++++
> >>   2 files changed, 320 insertions(+), 1 deletion(-)
> >>   create mode 100644 kernel/bpf/task_iter.c
> >>
> >
> > [...]
> >
> >> +static void *task_seq_start(struct seq_file *seq, loff_t *pos)
> >> +{
> >> +       struct bpf_iter_seq_task_info *info = seq->private;
> >> +       struct task_struct *task;
> >> +       u32 id = info->id;
> >> +
> >> +       if (*pos == 0)
> >> +               info->ns = task_active_pid_ns(current);
> >
> > I wonder why pid namespace is set in start() callback each time, while
> > net_ns was set once when seq_file is created. I think it should be
> > consistent, no? Either pid_ns is another feature and is set
> > consistently just once using the context of the process that creates
> > seq_file, or net_ns could be set using the same method without
> > bpf_iter infra knowing about this feature? Or there are some
> > non-obvious aspects which make pid_ns easier to work with?
> >
> > Either way, process read()'ing seq_file might be different than
> > process open()'ing seq_file, so they might have different namespaces.
> > We need to decide explicitly which context should be used and do it
> > consistently.
>
> Good point. for networking case, the `net` namespace is locked
> at seq_file open stage and later on it is used for seq_read().
>
> I think I should do the same thing, locking down pid namespace
> at open.

Yeah, I think it's a good idea.

>
> >
> >> +
> >> +       task = task_seq_get_next(info->ns, &id);
> >> +       if (!task)
> >> +               return NULL;
> >> +
> >> +       ++*pos;
> >> +       info->task = task;
> >> +       info->id = id;
> >> +
> >> +       return task;
> >> +}
> >> +
> >> +static void *task_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> >> +{
> >> +       struct bpf_iter_seq_task_info *info = seq->private;
> >> +       struct task_struct *task;
> >> +
> >> +       ++*pos;
> >> +       ++info->id;
> >
> > this would make iterator skip pid 0? Is that by design?
>
> The start will try to find pid 0. That means start will never
> return SEQ_START_TOKEN since the bpf program won't be called any way.

Never mind, I confused task_seq_next() and task_seq_get_next() :)

>
> >
> >> +       task = task_seq_get_next(info->ns, &info->id);
> >> +       if (!task)
> >> +               return NULL;
> >> +
> >> +       put_task_struct(info->task);
> >
> > on very first iteration info->task might be NULL, right?
>
> Even the first iteration info->task is not NULL. The start()
> will forcefully try to find the first real task from idr number 0.
>

Right, goes to same confusion as above, sorry.

> >
> >> +       info->task = task;
> >> +       return task;
> >> +}
> >> +
> >> +struct bpf_iter__task {
> >> +       __bpf_md_ptr(struct bpf_iter_meta *, meta);
> >> +       __bpf_md_ptr(struct task_struct *, task);
> >> +};
> >> +
> >> +int __init __bpf_iter__task(struct bpf_iter_meta *meta, struct task_struct *task)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +static int task_seq_show(struct seq_file *seq, void *v)
> >> +{
> >> +       struct bpf_iter_meta meta;
> >> +       struct bpf_iter__task ctx;
> >> +       struct bpf_prog *prog;
> >> +       int ret = 0;
> >> +
> >> +       prog = bpf_iter_get_prog(seq, sizeof(struct bpf_iter_seq_task_info),
> >> +                                &meta.session_id, &meta.seq_num,
> >> +                                v == (void *)0);
> >> +       if (prog) {
> >
> > can it happen that prog is NULL?
>
> Yes, this function is shared between show() and stop().
> The stop() function might be called multiple times since
> user can repeatedly try read() although there is nothing
> there, in which case, the seq_ops will be just
> start() and stop().

Ah, right, NULL case after end of iteration, got it.

>
> >
> >
> >> +               meta.seq = seq;
> >> +               ctx.meta = &meta;
> >> +               ctx.task = v;
> >> +               ret = bpf_iter_run_prog(prog, &ctx);
> >> +       }
> >> +
> >> +       return ret == 0 ? 0 : -EINVAL;
> >> +}
> >> +
> >> +static void task_seq_stop(struct seq_file *seq, void *v)
> >> +{
> >> +       struct bpf_iter_seq_task_info *info = seq->private;
> >> +
> >> +       if (!v)
> >> +               task_seq_show(seq, v);
> >
> > hmm... show() called from stop()? what's the case where this is necessary?
>
> I will refactor it better. This is to invoke bpf program
> in stop() with NULL object to signal the end of
> iteration.
>
> >> +
> >> +       if (info->task) {
> >> +               put_task_struct(info->task);
> >> +               info->task = NULL;
> >> +       }
> >> +}
> >> +
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH bpf-next v1 15/19] tools/libbpf: add bpf_iter support
  2020-04-30  1:41   ` Andrii Nakryiko
@ 2020-05-02  7:17     ` Yonghong Song
  0 siblings, 0 replies; 85+ messages in thread
From: Yonghong Song @ 2020-05-02  7:17 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/29/20 6:41 PM, Andrii Nakryiko wrote:
> On Mon, Apr 27, 2020 at 1:17 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Three new libbpf APIs are added to support bpf_iter:
>>    - bpf_program__attach_iter
>>      Given a bpf program and additional parameters, which is
>>      none now, returns a bpf_link.
>>    - bpf_link__create_iter
>>      Given a bpf_link, create a bpf_iter and return a fd
>>      so user can then do read() to get seq_file output data.
>>    - bpf_iter_create
>>      syscall level API to create a bpf iterator.
>>
>> Two macros, BPF_SEQ_PRINTF0 and BPF_SEQ_PRINTF, are also introduced.
>> These two macros can help bpf program writers with
>> nicer bpf_seq_printf syntax similar to the kernel one.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   tools/lib/bpf/bpf.c         | 11 +++++++
>>   tools/lib/bpf/bpf.h         |  2 ++
>>   tools/lib/bpf/bpf_tracing.h | 23 ++++++++++++++
>>   tools/lib/bpf/libbpf.c      | 60 +++++++++++++++++++++++++++++++++++++
>>   tools/lib/bpf/libbpf.h      | 11 +++++++
>>   tools/lib/bpf/libbpf.map    |  7 +++++
>>   6 files changed, 114 insertions(+)
>>
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index 5cc1b0785d18..7ffd6c0ad95f 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
>> @@ -619,6 +619,17 @@ int bpf_link_update(int link_fd, int new_prog_fd,
>>          return sys_bpf(BPF_LINK_UPDATE, &attr, sizeof(attr));
>>   }
>>
>> +int bpf_iter_create(int link_fd, unsigned int flags)
> 
> Do you envision anything more than just flags being passed for
> bpf_iter_create? I wonder if we should just go ahead with options
> struct here?

I think most, if not all, parameters should go to link create.
This way, we can have the identical anon_iter through:
    link -> anon_iter
    link -> pinned file -> anon_iter

I do not really expect any more fields for bpf_iter_create.
The flags here is for potential future extension, which I
have no idea how it looks like.

> 
>> +{
>> +       union bpf_attr attr;
>> +
>> +       memset(&attr, 0, sizeof(attr));
>> +       attr.iter_create.link_fd = link_fd;
>> +       attr.iter_create.flags = flags;
>> +
>> +       return sys_bpf(BPF_ITER_CREATE, &attr, sizeof(attr));
>> +}
>> +
> 
> [...]
> 
>> +/*
>> + * BPF_SEQ_PRINTF to wrap bpf_seq_printf to-be-printed values
>> + * in a structure. BPF_SEQ_PRINTF0 is a simple wrapper for
>> + * bpf_seq_printf().
>> + */
>> +#define BPF_SEQ_PRINTF0(seq, fmt)                                      \
>> +       ({                                                              \
>> +               int ret = bpf_seq_printf(seq, fmt, sizeof(fmt),         \
>> +                                        (void *)0, 0);                 \
>> +               ret;                                                    \
>> +       })
>> +
>> +#define BPF_SEQ_PRINTF(seq, fmt, args...)                              \
> 
> You can unify BPF_SEQ_PRINTF and BPF_SEQ_PRINTF0 by using
> ___bpf_empty() macro. See bpf_core_read.h for similar use case.
> Specifically, look at ___empty (equivalent of ___bpf_empty) and
> ___core_read, ___core_read0, ___core_readN macro.

Thanks for the tip. Will try.

> 
>> +       ({                                                              \
>> +               _Pragma("GCC diagnostic push")                          \
>> +               _Pragma("GCC diagnostic ignored \"-Wint-conversion\"")  \
>> +               __u64 param[___bpf_narg(args)] = { args };              \
> 
> Do you need to provide the size of array here? If you omit
> __bpf_narg(args), wouldn't compiler automatically calculate the right
> size?
> 

Yes, compiler should calculate correct size.

> Also, can you please use "unsigned long long" to not have any implicit
> dependency on __u64 being defined?

Will do.

> 
>> +               _Pragma("GCC diagnostic pop")                           \
>> +               int ret = bpf_seq_printf(seq, fmt, sizeof(fmt),         \
>> +                                        param, sizeof(param));         \
>> +               ret;                                                    \
>> +       })
>> +
>>   #endif
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 8e1dc6980fac..ffdc4d8e0cc0 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -6366,6 +6366,9 @@ static const struct bpf_sec_def section_defs[] = {
>>                  .is_attach_btf = true,
>>                  .expected_attach_type = BPF_LSM_MAC,
>>                  .attach_fn = attach_lsm),
>> +       SEC_DEF("iter/", TRACING,
>> +               .expected_attach_type = BPF_TRACE_ITER,
>> +               .is_attach_btf = true),
> 
> It would be nice to implement auto-attach capabilities (similar to
> fentry/fexit, lsm and raw_tracepoint). Section name should have enough
> information for this, no?

In the current form, yes, auto attach is possible.
But I am thinking we may soon have additional information
like map_id (appear in link_create) etc.
to make auto attach not possible. That is why
I implemented an explicit attach. is this assessment correct?

> 
>>          BPF_PROG_SEC("xdp",                     BPF_PROG_TYPE_XDP),
>>          BPF_PROG_SEC("perf_event",              BPF_PROG_TYPE_PERF_EVENT),
>>          BPF_PROG_SEC("lwt_in",                  BPF_PROG_TYPE_LWT_IN),
>> @@ -6629,6 +6632,7 @@ static int bpf_object__collect_struct_ops_map_reloc(struct bpf_object *obj,
>>
> 
> [...]
> 
>> +
>> +       link = calloc(1, sizeof(*link));
>> +       if (!link)
>> +               return ERR_PTR(-ENOMEM);
>> +       link->detach = &bpf_link__detach_fd;
>> +
>> +       attach_type = bpf_program__get_expected_attach_type(prog);
> 
> Given you know it has to be BPF_TRACE_ITER, it's better to explicitly
> specify that. If provided program wasn't loaded with correct
> expected_attach_type, kernel will reject it. But if you don't do it,
> then you can accidentally create some other type of bpf_link.

Yes, will do.

> 
>> +       link_fd = bpf_link_create(prog_fd, 0, attach_type, NULL);
>> +       if (link_fd < 0) {
>> +               link_fd = -errno;
>> +               free(link);
>> +               pr_warn("program '%s': failed to attach to iterator: %s\n",
>> +                       bpf_program__title(prog, false),
>> +                       libbpf_strerror_r(link_fd, errmsg, sizeof(errmsg)));
>> +               return ERR_PTR(link_fd);
>> +       }
>> +       link->fd = link_fd;
>> +       return link;
>> +}
>> +
>> +int bpf_link__create_iter(struct bpf_link *link, unsigned int flags)
>> +{
> 
> Same question as for low-level bpf_link_create(). If we expect the
> need to extend optional things in the future, I'd add opts right now.
> 
> But I wonder if bpf_link__create_iter() provides any additional value
> beyond bpf_iter_create(). Maybe let's not add it (yet)?

The only additional thing is better warning messsage.
Agree that is so marginal. Will drop it.

> 
>> +       char errmsg[STRERR_BUFSIZE];
>> +       int iter_fd;
>> +
>> +       iter_fd = bpf_iter_create(bpf_link__fd(link), flags);
>> +       if (iter_fd < 0) {
>> +               iter_fd = -errno;
>> +               pr_warn("failed to create an iterator: %s\n",
>> +                       libbpf_strerror_r(iter_fd, errmsg, sizeof(errmsg)));
>> +       }
>> +
>> +       return iter_fd;
>> +}
>> +
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2020-05-02  7:17 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-27 20:12 [PATCH bpf-next v1 00/19] bpf: implement bpf iterator for kernel data Yonghong Song
2020-04-27 20:12 ` [PATCH bpf-next v1 01/19] net: refactor net assignment for seq_net_private structure Yonghong Song
2020-04-29  5:38   ` Andrii Nakryiko
2020-04-27 20:12 ` [PATCH bpf-next v1 02/19] bpf: implement an interface to register bpf_iter targets Yonghong Song
2020-04-28 16:20   ` Martin KaFai Lau
2020-04-28 16:50     ` Yonghong Song
2020-04-27 20:12 ` [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator Yonghong Song
2020-04-29  0:37   ` Martin KaFai Lau
2020-04-29  0:48     ` Alexei Starovoitov
2020-04-29  1:15       ` Yonghong Song
2020-04-29  2:44         ` Alexei Starovoitov
2020-04-29  5:09           ` Yonghong Song
2020-04-29  6:08             ` Andrii Nakryiko
2020-04-29  6:20               ` Yonghong Song
2020-04-29  6:30                 ` Alexei Starovoitov
2020-04-29  6:40                   ` Andrii Nakryiko
2020-04-29  6:44                     ` Yonghong Song
2020-04-29 15:34                       ` Alexei Starovoitov
2020-04-29 18:14                         ` Yonghong Song
2020-04-29 19:19                         ` Andrii Nakryiko
2020-04-29 20:15                           ` Yonghong Song
2020-04-30  3:06                             ` Alexei Starovoitov
2020-04-30  4:01                               ` Yonghong Song
2020-04-29  6:34                 ` Martin KaFai Lau
2020-04-29  6:51                   ` Yonghong Song
2020-04-29 19:25                     ` Andrii Nakryiko
2020-04-29  1:02     ` Yonghong Song
2020-04-29  6:04   ` Andrii Nakryiko
2020-04-27 20:12 ` [PATCH bpf-next v1 04/19] bpf: allow loading of a bpf_iter program Yonghong Song
2020-04-29  0:54   ` Martin KaFai Lau
2020-04-29  1:27     ` Yonghong Song
2020-04-27 20:12 ` [PATCH bpf-next v1 05/19] bpf: support bpf tracing/iter programs for BPF_LINK_CREATE Yonghong Song
2020-04-29  1:17   ` [Potential Spoof] " Martin KaFai Lau
2020-04-29  6:25   ` Andrii Nakryiko
2020-04-27 20:12 ` [PATCH bpf-next v1 06/19] bpf: support bpf tracing/iter programs for BPF_LINK_UPDATE Yonghong Song
2020-04-29  1:32   ` Martin KaFai Lau
2020-04-29  5:04     ` Yonghong Song
2020-04-29  5:58       ` Martin KaFai Lau
2020-04-29  6:32         ` Andrii Nakryiko
2020-04-29  6:41           ` Martin KaFai Lau
2020-04-27 20:12 ` [PATCH bpf-next v1 07/19] bpf: create anonymous bpf iterator Yonghong Song
2020-04-29  5:39   ` Martin KaFai Lau
2020-04-29  6:56   ` Andrii Nakryiko
2020-04-29  7:06     ` Yonghong Song
2020-04-29 18:16       ` Andrii Nakryiko
2020-04-29 18:46         ` Martin KaFai Lau
2020-04-29 19:20           ` Yonghong Song
2020-04-29 20:50             ` Martin KaFai Lau
2020-04-29 20:54               ` Yonghong Song
2020-04-29 19:39   ` Andrii Nakryiko
2020-04-27 20:12 ` [PATCH bpf-next v1 08/19] bpf: create file " Yonghong Song
2020-04-29 20:40   ` Andrii Nakryiko
2020-04-30 18:02     ` Yonghong Song
2020-04-27 20:12 ` [PATCH bpf-next v1 09/19] bpf: add PTR_TO_BTF_ID_OR_NULL support Yonghong Song
2020-04-29 20:46   ` Andrii Nakryiko
2020-04-29 20:51     ` Yonghong Song
2020-04-27 20:12 ` [PATCH bpf-next v1 10/19] bpf: add netlink and ipv6_route targets Yonghong Song
2020-04-28 19:49   ` kbuild test robot
2020-04-28 19:49     ` kbuild test robot
2020-04-28 19:50   ` [RFC PATCH] bpf: __bpf_iter__netlink() can be static kbuild test robot
2020-04-28 19:50     ` kbuild test robot
2020-04-27 20:12 ` [PATCH bpf-next v1 11/19] bpf: add task and task/file targets Yonghong Song
2020-04-30  2:08   ` Andrii Nakryiko
2020-05-01 17:23     ` Yonghong Song
2020-05-01 19:01       ` Andrii Nakryiko
2020-04-27 20:12 ` [PATCH bpf-next v1 12/19] bpf: add bpf_seq_printf and bpf_seq_write helpers Yonghong Song
2020-04-28  6:02   ` kbuild test robot
2020-04-28  6:02     ` kbuild test robot
2020-04-28 16:35     ` Yonghong Song
2020-04-28 16:35       ` Yonghong Song
2020-04-30 20:06       ` Andrii Nakryiko
2020-04-27 20:12 ` [PATCH bpf-next v1 13/19] bpf: handle spilled PTR_TO_BTF_ID properly when checking stack_boundary Yonghong Song
2020-04-27 20:12 ` [PATCH bpf-next v1 14/19] bpf: support variable length array in tracing programs Yonghong Song
2020-04-30 20:04   ` Andrii Nakryiko
2020-04-27 20:12 ` [PATCH bpf-next v1 15/19] tools/libbpf: add bpf_iter support Yonghong Song
2020-04-30  1:41   ` Andrii Nakryiko
2020-05-02  7:17     ` Yonghong Song
2020-04-27 20:12 ` [PATCH bpf-next v1 16/19] tools/bpftool: add bpf_iter support for bptool Yonghong Song
2020-04-28  9:27   ` Quentin Monnet
2020-04-28 17:35     ` Yonghong Song
2020-04-29  8:37       ` Quentin Monnet
2020-04-27 20:12 ` [PATCH bpf-next v1 17/19] tools/bpf: selftests: add iterator programs for ipv6_route and netlink Yonghong Song
2020-04-30  2:12   ` Andrii Nakryiko
2020-04-27 20:12 ` [PATCH bpf-next v1 18/19] tools/bpf: selftests: add iter progs for bpf_map/task/task_file Yonghong Song
2020-04-27 20:12 ` [PATCH bpf-next v1 19/19] tools/bpf: selftests: add bpf_iter selftests Yonghong Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.