bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures
@ 2020-04-08 23:25 Yonghong Song
  2020-04-08 23:25 ` [RFC PATCH bpf-next 01/16] net: refactor net assignment for seq_net_private structure Yonghong Song
                   ` (15 more replies)
  0 siblings, 16 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Motivation:
  The current way to dump kernel data structures mostly:
    1. /proc system
    2. various specific tools like "ss" which requires kernel support.
    3. drgn
  The dropback for the first two is that whenever you want to dump more, you
  need change the kernel. For example, Martin wants to dump socket local
  storage with "ss". Kernel change is needed for it to work ([1]).
  This is also the direct motivation for this work.

  drgn ([2]) solves this proble nicely and no kernel change is not needed.
  But since drgn is not able to verify the validity of a particular pointer value,
  it might present the wrong results in rare cases.

  In this patch set, we introduce bpf based dumping. Initial kernel changes are
  still needed, but a data structure change will not require kernel changes
  any more. bpf program itself is used to adapt to new data structure
  changes. This will give certain flexibility with guaranteed correctness.

  Here, kernel seq_ops is used to facilitate dumping, similar to current
  /proc and many other lossless kernel dumping facilities.

User Interfaces:
  1. A new mount file system, bpfdump at /sys/kernel/bpfdump is introduced.
     Different from /sys/fs/bpf, this is a single user mount. Mount command
     can be:
        mount -t bpfdump bpfdump /sys/kernel/bpfdump
  2. Kernel bpf dumpable data structures are represented as directories
     under /sys/kernel/bpfdump, e.g.,
       /sys/kernel/bpfdump/ipv6_route/
       /sys/kernel/bpfdump/netlink/
       /sys/kernel/bpfdump/bpf_map/
       /sys/kernel/bpfdump/task/
       /sys/kernel/bpfdump/task/file/
     In this patch set, we use "target" to represent a particular bpf
     supported data structure, for example, targets "ipv6_route",
     "netlink", "bpf_map", "task", "task/file", which are actual
     directory hierarchy relative to /sys/kernel/bpfdump/.

     Note that nested structures are supported for sub fields in a major
     data structure. For example, target "task/file" is to examine all open
     files for all tasks (task_struct->files) as reference count and
     locks are needed to access task_struct->files safely.
  3. The bpftool command can be used to create a dumper:
       bpftool dumper pin <bpf_prog.o> <dumper_name>
     where the bpf_prog.o encodes the target information. For example, the
     following dumpers can be created:
       /sys/kernel/bpfdump/ipv6_route/{my1, my2}
       /sys/kernel/bpfdump/task/file/{f1, f2}
  4. Use "cat <dumper>" to dump the contents.
     Use "rm -f <dumper>" to delete the dumper.
  5. An anonymous dumper can be created without pinning to a
     physical file. The fd will return to the application and
     the application can then "read" the contents.

Please see patch #14 and #15 for bpf programs and
bpf dumper output examples.

Two new helpers bpf_seq_printf() and bpf_seq_write() are introduced.
bpf_seq_printf() mostly for file based dumpers and bpf_seq_write()
mostly for anonymous dumpers.

Note that certain dumpers are namespace aware. For example,
task and task/... targets only iterate through current pid namespace.
ipv6_route and netlink will iterate through current net namespace.

For introspection, see patch #13,
  bpftool dumper show {target|dumper}
can show all targets and their function prototypes (for writing bpf
programs), or all dumpers with their associated bpf prog_id.
For any open file descriptors (anonymous or from dumper file),
  cat /proc/<pid>/fdinfo/<fd>
will show target and its associated prog_id as well.

In current implementation, the userspace codes in libbpf and bpftool
are really rough. My implement for seq_ops operations for bpf_map,
task and task/file needs more expert scrutiny. I haven't really
thought about dumper file permission control, etc.

Although the initial motivation is from Martin's sk_local_storage,
this patch didn't implement tcp6 sockets and sk_local_storage.
The /proc/net/tcp6 involves three types of sockets, timewait,
request and tcp6 sockets. Some kind of type casting is needed
to convert socket_common to these three types of sockets based
on socket state. This will be addressed in future work.

Submit this as a RFC to get some comments as the implementation
is not complete.

References:
  [1]: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
  [2]: https://github.com/osandov/drgn

Yonghong Song (16):
  net: refactor net assignment for seq_net_private structure
  bpf: create /sys/kernel/bpfdump mount file system
  bpf: provide a way for targets to register themselves
  bpf: allow loading of a dumper program
  bpf: create file or anonymous dumpers
  bpf: add netlink and ipv6_route targets
  bpf: add bpf_map target
  bpf: add task and task/file targets
  bpf: add bpf_seq_printf and bpf_seq_write helpers
  bpf: support variable length array in tracing programs
  bpf: implement query for target_proto and file dumper prog_id
  tools/libbpf: libbpf support for bpfdump
  tools/bpftool: add bpf dumper support
  tools/bpf: selftests: add dumper programs for ipv6_route and netlink
  tools/bpf: selftests: add dumper progs for bpf_map/task/task_file
  tools/bpf: selftests: add a selftest for anonymous dumper

 fs/proc/proc_net.c                            |   5 +-
 include/linux/bpf.h                           |  13 +
 include/linux/seq_file_net.h                  |   8 +
 include/uapi/linux/bpf.h                      |  38 +-
 include/uapi/linux/magic.h                    |   1 +
 kernel/bpf/Makefile                           |   1 +
 kernel/bpf/btf.c                              |  25 +
 kernel/bpf/dump.c                             | 707 ++++++++++++++++++
 kernel/bpf/dump_task.c                        | 294 ++++++++
 kernel/bpf/syscall.c                          | 137 +++-
 kernel/bpf/verifier.c                         |  15 +
 kernel/trace/bpf_trace.c                      | 172 +++++
 net/ipv6/ip6_fib.c                            |  41 +-
 net/ipv6/route.c                              |  22 +
 net/netlink/af_netlink.c                      |  54 +-
 scripts/bpf_helpers_doc.py                    |   2 +
 tools/bpf/bpftool/dumper.c                    | 131 ++++
 tools/bpf/bpftool/main.c                      |   3 +-
 tools/bpf/bpftool/main.h                      |   1 +
 tools/include/uapi/linux/bpf.h                |  38 +-
 tools/lib/bpf/bpf.c                           |  33 +
 tools/lib/bpf/bpf.h                           |   5 +
 tools/lib/bpf/libbpf.c                        |  48 +-
 tools/lib/bpf/libbpf.h                        |   1 +
 tools/lib/bpf/libbpf.map                      |   3 +
 .../selftests/bpf/prog_tests/bpfdump_test.c   |  41 +
 .../selftests/bpf/progs/bpfdump_bpf_map.c     |  24 +
 .../selftests/bpf/progs/bpfdump_ipv6_route.c  |  63 ++
 .../selftests/bpf/progs/bpfdump_netlink.c     |  74 ++
 .../selftests/bpf/progs/bpfdump_task.c        |  21 +
 .../selftests/bpf/progs/bpfdump_task_file.c   |  24 +
 .../selftests/bpf/progs/bpfdump_test_kern.c   |  26 +
 32 files changed, 2055 insertions(+), 16 deletions(-)
 create mode 100644 kernel/bpf/dump.c
 create mode 100644 kernel/bpf/dump_task.c
 create mode 100644 tools/bpf/bpftool/dumper.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpfdump_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_bpf_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_ipv6_route.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_netlink.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_task.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_task_file.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_test_kern.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 01/16] net: refactor net assignment for seq_net_private structure
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-08 23:25 ` [RFC PATCH bpf-next 02/16] bpf: create /sys/kernel/bpfdump mount file system Yonghong Song
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Refactor assignment of "net" in seq_net_private structure
in proc_net.c to a helper function. The helper later will
be used by bpfdump.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 fs/proc/proc_net.c           | 5 ++---
 include/linux/seq_file_net.h | 8 ++++++++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 4888c5224442..aee07c19cf8b 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -75,9 +75,8 @@ static int seq_open_net(struct inode *inode, struct file *file)
 		put_net(net);
 		return -ENOMEM;
 	}
-#ifdef CONFIG_NET_NS
-	p->net = net;
-#endif
+
+	set_seq_net_private(p, net);
 	return 0;
 }
 
diff --git a/include/linux/seq_file_net.h b/include/linux/seq_file_net.h
index 0fdbe1ddd8d1..0ec4a18b9aca 100644
--- a/include/linux/seq_file_net.h
+++ b/include/linux/seq_file_net.h
@@ -35,4 +35,12 @@ static inline struct net *seq_file_single_net(struct seq_file *seq)
 #endif
 }
 
+static inline void set_seq_net_private(struct seq_net_private *p,
+				       struct net *net)
+{
+#ifdef CONFIG_NET_NS
+	p->net = net;
+#endif
+}
+
 #endif
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 02/16] bpf: create /sys/kernel/bpfdump mount file system
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
  2020-04-08 23:25 ` [RFC PATCH bpf-next 01/16] net: refactor net assignment for seq_net_private structure Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-08 23:25 ` [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves Yonghong Song
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

This patch creates a mount point "bpfdump" under
/sys/kernel. The file system has a single user
mode, i.e., all mount points will be identical.

The magic number I picked for the new file system
is "dump".

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/uapi/linux/magic.h |  1 +
 kernel/bpf/Makefile        |  1 +
 kernel/bpf/dump.c          | 79 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 81 insertions(+)
 create mode 100644 kernel/bpf/dump.c

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index d78064007b17..4ce3d8882315 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -88,6 +88,7 @@
 #define BPF_FS_MAGIC		0xcafe4a11
 #define AAFS_MAGIC		0x5a3c69f0
 #define ZONEFS_MAGIC		0x5a4f4653
+#define DUMPFS_MAGIC		0x64756d70
 
 /* Since UDF 2.01 is ISO 13346 based... */
 #define UDF_SUPER_MAGIC		0x15013346
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f2d7be596966..4a1376ab2bea 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
 endif
 ifeq ($(CONFIG_SYSFS),y)
 obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o
+obj-$(CONFIG_BPF_SYSCALL) += dump.o
 endif
 ifeq ($(CONFIG_BPF_JIT),y)
 obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o
diff --git a/kernel/bpf/dump.c b/kernel/bpf/dump.c
new file mode 100644
index 000000000000..e0c33486e0e7
--- /dev/null
+++ b/kernel/bpf/dump.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2020 Facebook */
+
+#include <linux/init.h>
+#include <linux/magic.h>
+#include <linux/mount.h>
+#include <linux/anon_inodes.h>
+#include <linux/namei.h>
+#include <linux/fs.h>
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+
+static void bpfdump_free_inode(struct inode *inode)
+{
+	kfree(inode->i_private);
+	free_inode_nonrcu(inode);
+}
+
+static const struct super_operations bpfdump_super_operations = {
+	.statfs		= simple_statfs,
+	.free_inode	= bpfdump_free_inode,
+};
+
+static int bpfdump_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	static const struct tree_descr files[] = { { "" } };
+	int err;
+
+	err = simple_fill_super(sb, DUMPFS_MAGIC, files);
+	if (err)
+		return err;
+
+	sb->s_op = &bpfdump_super_operations;
+	return 0;
+}
+
+static int bpfdump_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, bpfdump_fill_super);
+}
+
+static const struct fs_context_operations bpfdump_context_ops = {
+	.get_tree	= bpfdump_get_tree,
+};
+
+static int bpfdump_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &bpfdump_context_ops;
+	return 0;
+}
+
+static struct file_system_type fs_type = {
+	.owner			= THIS_MODULE,
+	.name			= "bpfdump",
+	.init_fs_context	= bpfdump_init_fs_context,
+	.kill_sb		= kill_litter_super,
+};
+
+static int __init bpfdump_init(void)
+{
+	int ret;
+
+	ret = sysfs_create_mount_point(kernel_kobj, "bpfdump");
+	if (ret)
+		return ret;
+
+	ret = register_filesystem(&fs_type);
+	if (ret)
+		goto remove_mount;
+
+	return 0;
+
+remove_mount:
+	sysfs_remove_mount_point(kernel_kobj, "bpfdump");
+	return ret;
+}
+core_initcall(bpfdump_init);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
  2020-04-08 23:25 ` [RFC PATCH bpf-next 01/16] net: refactor net assignment for seq_net_private structure Yonghong Song
  2020-04-08 23:25 ` [RFC PATCH bpf-next 02/16] bpf: create /sys/kernel/bpfdump mount file system Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-10 22:18   ` Andrii Nakryiko
  2020-04-10 22:25   ` Andrii Nakryiko
  2020-04-08 23:25 ` [RFC PATCH bpf-next 04/16] bpf: allow loading of a dumper program Yonghong Song
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Here, the target refers to a particular data structure
inside the kernel we want to dump. For example, it
can be all task_structs in the current pid namespace,
or it could be all open files for all task_structs
in the current pid namespace.

Each target is identified with the following information:
   target_rel_path   <=== relative path to /sys/kernel/bpfdump
   target_proto      <=== kernel func proto which represents
                          bpf program signature for this target
   seq_ops           <=== seq_ops for seq_file operations
   seq_priv_size     <=== seq_file private data size
   target_feature    <=== target specific feature which needs
                          handling outside seq_ops.

The target relative path is a relative directory to /sys/kernel/bpfdump/.
For example, it could be:
   task                  <=== all tasks
   task/file             <=== all open files under all tasks
   ipv6_route            <=== all ipv6_routes
   tcp6/sk_local_storage <=== all tcp6 socket local storages
   foo/bar/tar           <=== all tar's in bar in foo

The "target_feature" is mostly used for reusing existing seq_ops.
For example, for /proc/net/<> stats, the "net" namespace is often
stored in file private data. The target_feature enables bpf based
dumper to set "net" properly for itself before calling shared
seq_ops.

bpf_dump_reg_target() is implemented so targets
can register themselves. Currently, module is not
supported, so there is no bpf_dump_unreg_target().
The main reason is that BTF is not available for modules
yet.

Since target might call bpf_dump_reg_target() before
bpfdump mount point is created, __bpfdump_init()
may be called in bpf_dump_reg_target() as well.

The file-based dumpers will be regular files under
the specific target directory. For example,
   task/my1      <=== dumper "my1" iterates through all tasks
   task/file/my2 <=== dumper "my2" iterates through all open files
                      under all tasks

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h |   4 +
 kernel/bpf/dump.c   | 190 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 193 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fd2b2322412d..53914bec7590 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1109,6 +1109,10 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
 int bpf_obj_get_user(const char __user *pathname, int flags);
 
+int bpf_dump_reg_target(const char *target, const char *target_proto,
+			const struct seq_operations *seq_ops,
+			u32 seq_priv_size, u32 target_feature);
+
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
diff --git a/kernel/bpf/dump.c b/kernel/bpf/dump.c
index e0c33486e0e7..45528846557f 100644
--- a/kernel/bpf/dump.c
+++ b/kernel/bpf/dump.c
@@ -12,6 +12,173 @@
 #include <linux/filter.h>
 #include <linux/bpf.h>
 
+struct bpfdump_target_info {
+	struct list_head list;
+	const char *target;
+	const char *target_proto;
+	struct dentry *dir_dentry;
+	const struct seq_operations *seq_ops;
+	u32 seq_priv_size;
+	u32 target_feature;
+};
+
+struct bpfdump_targets {
+	struct list_head dumpers;
+	struct mutex dumper_mutex;
+};
+
+/* registered dump targets */
+static struct bpfdump_targets dump_targets;
+
+static struct dentry *bpfdump_dentry;
+
+static struct dentry *bpfdump_add_dir(const char *name, struct dentry *parent,
+				      const struct inode_operations *i_ops,
+				      void *data);
+static int __bpfdump_init(void);
+
+static int dumper_unlink(struct inode *dir, struct dentry *dentry)
+{
+	kfree(d_inode(dentry)->i_private);
+	return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations bpf_dir_iops = {
+	.lookup		= simple_lookup,
+	.unlink		= dumper_unlink,
+};
+
+int bpf_dump_reg_target(const char *target,
+			const char *target_proto,
+			const struct seq_operations *seq_ops,
+			u32 seq_priv_size, u32 target_feature)
+{
+	struct bpfdump_target_info *tinfo, *ptinfo;
+	struct dentry *dentry, *parent;
+	const char *lastslash;
+	bool existed = false;
+	int err, parent_len;
+
+	if (!bpfdump_dentry) {
+		err = __bpfdump_init();
+		if (err)
+			return err;
+	}
+
+	tinfo = kmalloc(sizeof(*tinfo), GFP_KERNEL);
+	if (!tinfo)
+		return -ENOMEM;
+
+	tinfo->target = target;
+	tinfo->target_proto = target_proto;
+	tinfo->seq_ops = seq_ops;
+	tinfo->seq_priv_size = seq_priv_size;
+	tinfo->target_feature = target_feature;
+	INIT_LIST_HEAD(&tinfo->list);
+
+	lastslash = strrchr(target, '/');
+	if (!lastslash) {
+		parent = bpfdump_dentry;
+	} else {
+		parent_len = (unsigned long)lastslash - (unsigned long)target;
+
+		mutex_lock(&dump_targets.dumper_mutex);
+		list_for_each_entry(ptinfo, &dump_targets.dumpers, list) {
+			if (strlen(ptinfo->target) == parent_len &&
+			    strncmp(ptinfo->target, target, parent_len) == 0) {
+				existed = true;
+				break;
+			}
+		}
+		mutex_unlock(&dump_targets.dumper_mutex);
+		if (existed == false) {
+			err = -ENOENT;
+			goto free_tinfo;
+		}
+
+		parent = ptinfo->dir_dentry;
+		target = lastslash + 1;
+	}
+	dentry = bpfdump_add_dir(target, parent, &bpf_dir_iops, tinfo);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto free_tinfo;
+	}
+
+	tinfo->dir_dentry = dentry;
+
+	mutex_lock(&dump_targets.dumper_mutex);
+	list_add(&tinfo->list, &dump_targets.dumpers);
+	mutex_unlock(&dump_targets.dumper_mutex);
+	return 0;
+
+free_tinfo:
+	kfree(tinfo);
+	return err;
+}
+
+static struct dentry *
+bpfdump_create_dentry(const char *name, umode_t mode, struct dentry *parent,
+		      void *data, const struct inode_operations *i_ops,
+		      const struct file_operations *f_ops)
+{
+	struct inode *dir, *inode;
+	struct dentry *dentry;
+	int err;
+
+	dir = d_inode(parent);
+
+	inode_lock(dir);
+	dentry = lookup_one_len(name, parent, strlen(name));
+	if (IS_ERR(dentry))
+		goto unlock;
+
+	if (d_really_is_positive(dentry)) {
+		err = -EEXIST;
+		goto dentry_put;
+	}
+
+	inode = new_inode(dir->i_sb);
+	if (!inode) {
+		err = -ENOMEM;
+		goto dentry_put;
+	}
+
+	inode->i_ino = get_next_ino();
+	inode->i_mode = mode;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+	inode->i_private = data;
+
+	if (S_ISDIR(mode)) {
+		inode->i_op = i_ops;
+		inode->i_fop = f_ops;
+		inc_nlink(inode);
+		inc_nlink(dir);
+	} else {
+		inode->i_fop = f_ops;
+	}
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+	inode_unlock(dir);
+	return dentry;
+
+dentry_put:
+	dput(dentry);
+	dentry = ERR_PTR(err);
+unlock:
+	inode_unlock(dir);
+	return dentry;
+}
+
+static struct dentry *
+bpfdump_add_dir(const char *name, struct dentry *parent,
+		const struct inode_operations *i_ops, void *data)
+{
+	return bpfdump_create_dentry(name, S_IFDIR | 0755, parent,
+				     data, i_ops, &simple_dir_operations);
+}
+
 static void bpfdump_free_inode(struct inode *inode)
 {
 	kfree(inode->i_private);
@@ -58,8 +225,10 @@ static struct file_system_type fs_type = {
 	.kill_sb		= kill_litter_super,
 };
 
-static int __init bpfdump_init(void)
+static int __bpfdump_init(void)
 {
+	struct vfsmount *mount;
+	int mount_count;
 	int ret;
 
 	ret = sysfs_create_mount_point(kernel_kobj, "bpfdump");
@@ -70,10 +239,29 @@ static int __init bpfdump_init(void)
 	if (ret)
 		goto remove_mount;
 
+	/* get a reference to mount so we can populate targets
+	 * at init time.
+	 */
+	ret = simple_pin_fs(&fs_type, &mount, &mount_count);
+	if (ret)
+		goto remove_mount;
+
+	bpfdump_dentry = mount->mnt_root;
+
+	INIT_LIST_HEAD(&dump_targets.dumpers);
+	mutex_init(&dump_targets.dumper_mutex);
 	return 0;
 
 remove_mount:
 	sysfs_remove_mount_point(kernel_kobj, "bpfdump");
 	return ret;
 }
+
+static int __init bpfdump_init(void)
+{
+	if (bpfdump_dentry)
+		return 0;
+
+	return __bpfdump_init();
+}
 core_initcall(bpfdump_init);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 04/16] bpf: allow loading of a dumper program
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (2 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-10 22:36   ` Andrii Nakryiko
  2020-04-08 23:25 ` [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers Yonghong Song
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

A dumper bpf program is a tracing program with attach type
BPF_TRACE_DUMP. During bpf program load, the load attribute
   attach_prog_fd
carries the target directory fd. The program will be
verified against btf_id of the target_proto.

If the program is loaded successfully, the dump target, as
represented as a relative path to /sys/kernel/bpfdump,
will be remembered in prog->aux->dump_target, which will
be used later to create dumpers.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h            |  2 ++
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/dump.c              | 40 ++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |  8 ++++++-
 kernel/bpf/verifier.c          | 15 +++++++++++++
 tools/include/uapi/linux/bpf.h |  1 +
 6 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 53914bec7590..44268d36d901 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -673,6 +673,7 @@ struct bpf_prog_aux {
 	struct bpf_map **used_maps;
 	struct bpf_prog *prog;
 	struct user_struct *user;
+	const char *dump_target;
 	u64 load_time; /* ns since boottime */
 	struct bpf_map *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
 	char name[BPF_OBJ_NAME_LEN];
@@ -1112,6 +1113,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 int bpf_dump_reg_target(const char *target, const char *target_proto,
 			const struct seq_operations *seq_ops,
 			u32 seq_priv_size, u32 target_feature);
+int bpf_dump_set_target_info(u32 target_fd, struct bpf_prog *prog);
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2e29a671d67e..0f1cbed446c1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -215,6 +215,7 @@ enum bpf_attach_type {
 	BPF_TRACE_FEXIT,
 	BPF_MODIFY_RETURN,
 	BPF_LSM_MAC,
+	BPF_TRACE_DUMP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/dump.c b/kernel/bpf/dump.c
index 45528846557f..1091affe8b3f 100644
--- a/kernel/bpf/dump.c
+++ b/kernel/bpf/dump.c
@@ -11,6 +11,9 @@
 #include <linux/fs_parser.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
+#include <linux/btf.h>
+
+extern struct btf *btf_vmlinux;
 
 struct bpfdump_target_info {
 	struct list_head list;
@@ -48,6 +51,43 @@ static const struct inode_operations bpf_dir_iops = {
 	.unlink		= dumper_unlink,
 };
 
+int bpf_dump_set_target_info(u32 target_fd, struct bpf_prog *prog)
+{
+	struct bpfdump_target_info *tinfo;
+	const char *target_proto;
+	struct file *target_file;
+	struct fd tfd;
+	int err = 0, btf_id;
+
+	if (!btf_vmlinux)
+		return -EINVAL;
+
+	tfd = fdget(target_fd);
+	target_file = tfd.file;
+	if (!target_file)
+		return -EBADF;
+
+	if (target_file->f_inode->i_op != &bpf_dir_iops) {
+		err = -EINVAL;
+		goto done;
+	}
+
+	tinfo = target_file->f_inode->i_private;
+	target_proto = tinfo->target_proto;
+	btf_id = btf_find_by_name_kind(btf_vmlinux, target_proto,
+				       BTF_KIND_FUNC);
+
+	if (btf_id > 0) {
+		prog->aux->dump_target = tinfo->target;
+		prog->aux->attach_btf_id = btf_id;
+	}
+
+	err = min(btf_id, 0);
+done:
+	fdput(tfd);
+	return err;
+}
+
 int bpf_dump_reg_target(const char *target,
 			const char *target_proto,
 			const struct seq_operations *seq_ops,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 64783da34202..41005dee8957 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2060,7 +2060,12 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 
 	prog->expected_attach_type = attr->expected_attach_type;
 	prog->aux->attach_btf_id = attr->attach_btf_id;
-	if (attr->attach_prog_fd) {
+	if (type == BPF_PROG_TYPE_TRACING &&
+	    attr->expected_attach_type == BPF_TRACE_DUMP) {
+		err = bpf_dump_set_target_info(attr->attach_prog_fd, prog);
+		if (err)
+			goto free_prog_nouncharge;
+	} else if (attr->attach_prog_fd) {
 		struct bpf_prog *tgt_prog;
 
 		tgt_prog = bpf_prog_get(attr->attach_prog_fd);
@@ -2145,6 +2150,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	err = bpf_prog_new_fd(prog);
 	if (err < 0)
 		bpf_prog_put(prog);
+
 	return err;
 
 free_used_maps:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 04c6630cc18f..f531cee24fc5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10426,6 +10426,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	struct bpf_prog *tgt_prog = prog->aux->linked_prog;
 	u32 btf_id = prog->aux->attach_btf_id;
 	const char prefix[] = "btf_trace_";
+	struct btf_func_model fmodel;
 	int ret = 0, subprog = -1, i;
 	struct bpf_trampoline *tr;
 	const struct btf_type *t;
@@ -10566,6 +10567,20 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 		prog->aux->attach_func_proto = t;
 		prog->aux->attach_btf_trace = true;
 		return 0;
+	case BPF_TRACE_DUMP:
+		if (!btf_type_is_func(t)) {
+			verbose(env, "attach_btf_id %u is not a function\n",
+				btf_id);
+			return -EINVAL;
+		}
+		t = btf_type_by_id(btf, t->type);
+		if (!btf_type_is_func_proto(t))
+			return -EINVAL;
+		prog->aux->attach_func_name = tname;
+		prog->aux->attach_func_proto = t;
+		ret = btf_distill_func_proto(&env->log, btf, t,
+					     tname, &fmodel);
+		return ret;
 	default:
 		if (!prog_extension)
 			return -EINVAL;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2e29a671d67e..0f1cbed446c1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -215,6 +215,7 @@ enum bpf_attach_type {
 	BPF_TRACE_FEXIT,
 	BPF_MODIFY_RETURN,
 	BPF_LSM_MAC,
+	BPF_TRACE_DUMP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (3 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 04/16] bpf: allow loading of a dumper program Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-10  3:00   ` Alexei Starovoitov
                     ` (3 more replies)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 06/16] bpf: add netlink and ipv6_route targets Yonghong Song
                   ` (10 subsequent siblings)
  15 siblings, 4 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Given a loaded dumper bpf program, which already
knows which target it should bind to, there
two ways to create a dumper:
  - a file based dumper under hierarchy of
    /sys/kernel/bpfdump/ which uses can
    "cat" to print out the output.
  - an anonymous dumper which user application
    can "read" the dumping output.

For file based dumper, BPF_OBJ_PIN syscall interface
is used. For anonymous dumper, BPF_PROG_ATTACH
syscall interface is used.

To facilitate target seq_ops->show() to get the
bpf program easily, dumper creation increased
the target-provided seq_file private data size
so bpf program pointer is also stored in seq_file
private data.

Further, a seq_num which represents how many
bpf_dump_get_prog() has been called is also
available to the target seq_ops->show().
Such information can be used to e.g., print
banner before printing out actual data.

Note the seq_num does not represent the num
of unique kernel objects the bpf program has
seen. But it should be a good approximate.

A target feature BPF_DUMP_SEQ_NET_PRIVATE
is implemented specifically useful for
net based dumpers. It sets net namespace
as the current process net namespace.
This avoids changing existing net seq_ops
in order to retrieve net namespace from
the seq_file pointer.

For open dumper files, anonymous or not, the
fdinfo will show the target and prog_id associated
with that file descriptor. For dumper file itself,
a kernel interface will be provided to retrieve the
prog_id in one of the later patches.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h            |   5 +
 include/uapi/linux/bpf.h       |   6 +-
 kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
 kernel/bpf/syscall.c           |  11 +-
 tools/include/uapi/linux/bpf.h |   6 +-
 5 files changed, 362 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 44268d36d901..8171e01ff4be 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1110,10 +1110,15 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
 int bpf_obj_get_user(const char __user *pathname, int flags);
 
+#define BPF_DUMP_SEQ_NET_PRIVATE	BIT(0)
+
 int bpf_dump_reg_target(const char *target, const char *target_proto,
 			const struct seq_operations *seq_ops,
 			u32 seq_priv_size, u32 target_feature);
 int bpf_dump_set_target_info(u32 target_fd, struct bpf_prog *prog);
+int bpf_dump_create(u32 prog_fd, const char __user *dumper_name);
+struct bpf_prog *bpf_dump_get_prog(struct seq_file *seq, u32 priv_data_size,
+				   u64 *seq_num);
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0f1cbed446c1..b51d56fc77f9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -354,6 +354,7 @@ enum {
 /* Flags for accessing BPF object from syscall side. */
 	BPF_F_RDONLY		= (1U << 3),
 	BPF_F_WRONLY		= (1U << 4),
+	BPF_F_DUMP		= (1U << 5),
 
 /* Flag for stack_map, store build_id+offset instead of pointer */
 	BPF_F_STACK_BUILD_ID	= (1U << 5),
@@ -481,7 +482,10 @@ union bpf_attr {
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
-		__aligned_u64	pathname;
+		union {
+			__aligned_u64	pathname;
+			__aligned_u64	dumper_name;
+		};
 		__u32		bpf_fd;
 		__u32		file_flags;
 	};
diff --git a/kernel/bpf/dump.c b/kernel/bpf/dump.c
index 1091affe8b3f..ac6856abb711 100644
--- a/kernel/bpf/dump.c
+++ b/kernel/bpf/dump.c
@@ -30,22 +30,173 @@ struct bpfdump_targets {
 	struct mutex dumper_mutex;
 };
 
+struct dumper_inode_info {
+	struct bpfdump_target_info *tinfo;
+	struct bpf_prog *prog;
+};
+
+struct dumper_info {
+	struct list_head list;
+	/* file to identify an anon dumper,
+	 * dentry to identify a file dumper.
+	 */
+	union {
+		struct file *file;
+		struct dentry *dentry;
+	};
+	struct bpfdump_target_info *tinfo;
+	struct bpf_prog *prog;
+};
+
+struct dumpers {
+	struct list_head dumpers;
+	struct mutex dumper_mutex;
+};
+
+struct extra_priv_data {
+	struct bpf_prog *prog;
+	u64 seq_num;
+};
+
 /* registered dump targets */
 static struct bpfdump_targets dump_targets;
 
 static struct dentry *bpfdump_dentry;
 
+static struct dumpers anon_dumpers, file_dumpers;
+
+static const struct file_operations bpf_dumper_ops;
+static const struct inode_operations bpf_dir_iops;
+
+static struct dentry *bpfdump_add_file(const char *name, struct dentry *parent,
+				       const struct file_operations *f_ops,
+				       void *data);
 static struct dentry *bpfdump_add_dir(const char *name, struct dentry *parent,
 				      const struct inode_operations *i_ops,
 				      void *data);
 static int __bpfdump_init(void);
 
+static u32 get_total_priv_dsize(u32 old_size)
+{
+	return roundup(old_size, 8) + sizeof(struct extra_priv_data);
+}
+
+static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
+{
+	return old_ptr + roundup(old_size, 8);
+}
+
+#ifdef CONFIG_PROC_FS
+static void dumper_show_fdinfo(struct seq_file *m, struct file *filp)
+{
+	struct dumper_inode_info *i_info = filp->f_inode->i_private;
+
+	seq_printf(m, "target:\t%s\n"
+		      "prog_id:\t%u\n",
+		   i_info->tinfo->target,
+		   i_info->prog->aux->id);
+}
+
+static void anon_dumper_show_fdinfo(struct seq_file *m, struct file *filp)
+{
+	struct dumper_info *dinfo;
+
+	mutex_lock(&anon_dumpers.dumper_mutex);
+	list_for_each_entry(dinfo, &anon_dumpers.dumpers, list) {
+		if (dinfo->file == filp) {
+			seq_printf(m, "target:\t%s\n"
+				      "prog_id:\t%u\n",
+				   dinfo->tinfo->target,
+				   dinfo->prog->aux->id);
+			break;
+		}
+	}
+	mutex_unlock(&anon_dumpers.dumper_mutex);
+}
+
+#endif
+
+static void process_target_feature(u32 feature, void *priv_data)
+{
+	/* use the current net namespace */
+	if (feature & BPF_DUMP_SEQ_NET_PRIVATE)
+		set_seq_net_private((struct seq_net_private *)priv_data,
+				    current->nsproxy->net_ns);
+}
+
+static int dumper_open(struct inode *inode, struct file *file)
+{
+	struct dumper_inode_info *i_info = inode->i_private;
+	struct extra_priv_data *extra_data;
+	u32 old_priv_size, total_priv_size;
+	void *priv_data;
+
+	old_priv_size = i_info->tinfo->seq_priv_size;
+	total_priv_size = get_total_priv_dsize(old_priv_size);
+	priv_data = __seq_open_private(file, i_info->tinfo->seq_ops,
+				       total_priv_size);
+	if (!priv_data)
+		return -ENOMEM;
+
+	process_target_feature(i_info->tinfo->target_feature, priv_data);
+
+	extra_data = get_extra_priv_dptr(priv_data, old_priv_size);
+	extra_data->prog = i_info->prog;
+	extra_data->seq_num = 0;
+
+	return 0;
+}
+
+static int anon_dumper_release(struct inode *inode, struct file *file)
+{
+	struct dumper_info *dinfo;
+
+	/* release the bpf program */
+	mutex_lock(&anon_dumpers.dumper_mutex);
+	list_for_each_entry(dinfo, &anon_dumpers.dumpers, list) {
+		if (dinfo->file == file) {
+			bpf_prog_put(dinfo->prog);
+			list_del(&dinfo->list);
+			break;
+		}
+	}
+	mutex_unlock(&anon_dumpers.dumper_mutex);
+
+	return seq_release_private(inode, file);
+}
+
+static int dumper_release(struct inode *inode, struct file *file)
+{
+	return seq_release_private(inode, file);
+}
+
 static int dumper_unlink(struct inode *dir, struct dentry *dentry)
 {
-	kfree(d_inode(dentry)->i_private);
+	struct dumper_inode_info *i_info = d_inode(dentry)->i_private;
+
+	bpf_prog_put(i_info->prog);
+	kfree(i_info);
+
 	return simple_unlink(dir, dentry);
 }
 
+static const struct file_operations bpf_dumper_ops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= dumper_show_fdinfo,
+#endif
+	.open		= dumper_open,
+	.read		= seq_read,
+	.release	= dumper_release,
+};
+
+static const struct file_operations anon_bpf_dumper_ops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= anon_dumper_show_fdinfo,
+#endif
+	.read		= seq_read,
+	.release	= anon_dumper_release,
+};
+
 static const struct inode_operations bpf_dir_iops = {
 	.lookup		= simple_lookup,
 	.unlink		= dumper_unlink,
@@ -88,6 +239,179 @@ int bpf_dump_set_target_info(u32 target_fd, struct bpf_prog *prog)
 	return err;
 }
 
+static int create_anon_dumper(struct bpfdump_target_info *tinfo,
+			      struct bpf_prog *prog)
+{
+	struct extra_priv_data *extra_data;
+	u32 old_priv_size, total_priv_size;
+	struct dumper_info *dinfo;
+	struct file *file;
+	int err, anon_fd;
+	void *priv_data;
+	struct fd fd;
+
+	anon_fd = anon_inode_getfd("bpf-dumper", &anon_bpf_dumper_ops,
+				   NULL, O_CLOEXEC);
+	if (anon_fd < 0)
+		return anon_fd;
+
+	/* setup seq_file for anon dumper */
+	fd = fdget(anon_fd);
+	file = fd.file;
+
+	dinfo = kmalloc(sizeof(*dinfo), GFP_KERNEL);
+	if (!dinfo) {
+		err = -ENOMEM;
+		goto free_fd;
+	}
+
+	old_priv_size = tinfo->seq_priv_size;
+	total_priv_size = get_total_priv_dsize(old_priv_size);
+
+	priv_data = __seq_open_private(file, tinfo->seq_ops,
+				       total_priv_size);
+	if (!priv_data) {
+		err = -ENOMEM;
+		goto free_dinfo;
+	}
+
+	dinfo->file = file;
+	dinfo->tinfo = tinfo;
+	dinfo->prog = prog;
+
+	mutex_lock(&anon_dumpers.dumper_mutex);
+	list_add(&dinfo->list, &anon_dumpers.dumpers);
+	mutex_unlock(&anon_dumpers.dumper_mutex);
+
+	process_target_feature(tinfo->target_feature, priv_data);
+
+	extra_data = get_extra_priv_dptr(priv_data, old_priv_size);
+	extra_data->prog = prog;
+	extra_data->seq_num = 0;
+
+	fdput(fd);
+	return anon_fd;
+
+free_dinfo:
+	kfree(dinfo);
+free_fd:
+	fdput(fd);
+	return err;
+}
+
+static int create_dumper(struct bpfdump_target_info *tinfo,
+			 const char __user *dumper_name,
+			 struct bpf_prog *prog)
+{
+	struct dumper_inode_info *i_info;
+	struct dumper_info *dinfo;
+	struct dentry *dentry;
+	const char *dname;
+	int err = 0;
+
+	i_info = kmalloc(sizeof(*i_info), GFP_KERNEL);
+	if (!i_info)
+		return -ENOMEM;
+
+	i_info->tinfo = tinfo;
+	i_info->prog = prog;
+
+	dinfo = kmalloc(sizeof(*dinfo), GFP_KERNEL);
+	if (!dinfo) {
+		err = -ENOMEM;
+		goto free_i_info;
+	}
+
+	dname = strndup_user(dumper_name, PATH_MAX);
+	if (!dname) {
+		err = -ENOMEM;
+		goto free_dinfo;
+	}
+
+	dentry = bpfdump_add_file(dname, tinfo->dir_dentry,
+				  &bpf_dumper_ops, i_info);
+	kfree(dname);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto free_dinfo;
+	}
+
+	dinfo->dentry = dentry;
+	dinfo->tinfo = tinfo;
+	dinfo->prog = prog;
+
+	mutex_lock(&file_dumpers.dumper_mutex);
+	list_add(&dinfo->list, &file_dumpers.dumpers);
+	mutex_unlock(&file_dumpers.dumper_mutex);
+
+	return 0;
+
+free_dinfo:
+	kfree(dinfo);
+free_i_info:
+	kfree(i_info);
+	return err;
+}
+
+int bpf_dump_create(u32 prog_fd, const char __user *dumper_name)
+{
+	struct bpfdump_target_info *tinfo;
+	const char *target;
+	struct bpf_prog *prog;
+	bool existed = false;
+	int err = 0;
+
+	prog = bpf_prog_get(prog_fd);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	target = prog->aux->dump_target;
+	if (!target) {
+		err = -EINVAL;
+		goto free_prog;
+	}
+
+	mutex_lock(&dump_targets.dumper_mutex);
+	list_for_each_entry(tinfo, &dump_targets.dumpers, list) {
+		if (strcmp(tinfo->target, target) == 0) {
+			existed = true;
+			break;
+		}
+	}
+	mutex_unlock(&dump_targets.dumper_mutex);
+
+	if (!existed) {
+		err = -EINVAL;
+		goto free_prog;
+	}
+
+	err = dumper_name ? create_dumper(tinfo, dumper_name, prog)
+			  : create_anon_dumper(tinfo, prog);
+	if (err < 0)
+		goto free_prog;
+
+	return err;
+
+free_prog:
+	bpf_prog_put(prog);
+	return err;
+}
+
+struct bpf_prog *bpf_dump_get_prog(struct seq_file *seq, u32 priv_data_size,
+	u64 *seq_num)
+{
+	struct extra_priv_data *extra_data;
+
+	if (seq->file->f_op != &bpf_dumper_ops &&
+	    seq->file->f_op != &anon_bpf_dumper_ops)
+		return NULL;
+
+	extra_data = get_extra_priv_dptr(seq->private, priv_data_size);
+	*seq_num = extra_data->seq_num++;
+
+	return extra_data->prog;
+}
+
 int bpf_dump_reg_target(const char *target,
 			const char *target_proto,
 			const struct seq_operations *seq_ops,
@@ -211,6 +535,14 @@ bpfdump_create_dentry(const char *name, umode_t mode, struct dentry *parent,
 	return dentry;
 }
 
+static struct dentry *
+bpfdump_add_file(const char *name, struct dentry *parent,
+		 const struct file_operations *f_ops, void *data)
+{
+	return bpfdump_create_dentry(name, S_IFREG | 0444, parent,
+				     data, NULL, f_ops);
+}
+
 static struct dentry *
 bpfdump_add_dir(const char *name, struct dentry *parent,
 		const struct inode_operations *i_ops, void *data)
@@ -290,6 +622,10 @@ static int __bpfdump_init(void)
 
 	INIT_LIST_HEAD(&dump_targets.dumpers);
 	mutex_init(&dump_targets.dumper_mutex);
+	INIT_LIST_HEAD(&anon_dumpers.dumpers);
+	mutex_init(&anon_dumpers.dumper_mutex);
+	INIT_LIST_HEAD(&file_dumpers.dumpers);
+	mutex_init(&file_dumpers.dumper_mutex);
 	return 0;
 
 remove_mount:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 41005dee8957..b5e4f18cc633 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2173,9 +2173,13 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 
 static int bpf_obj_pin(const union bpf_attr *attr)
 {
-	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
+	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags & ~BPF_F_DUMP)
 		return -EINVAL;
 
+	if (attr->file_flags == BPF_F_DUMP)
+		return bpf_dump_create(attr->bpf_fd,
+				       u64_to_user_ptr(attr->dumper_name));
+
 	return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
 }
 
@@ -2605,6 +2609,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 	case BPF_CGROUP_GETSOCKOPT:
 	case BPF_CGROUP_SETSOCKOPT:
 		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
+	case BPF_TRACE_DUMP:
+		return BPF_PROG_TYPE_TRACING;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -2663,6 +2669,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_SOCK_OPS:
 		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 		break;
+	case BPF_PROG_TYPE_TRACING:
+		ret = bpf_dump_create(attr->attach_bpf_fd, (void __user *)NULL);
+		break;
 	default:
 		ret = -EINVAL;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0f1cbed446c1..b51d56fc77f9 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -354,6 +354,7 @@ enum {
 /* Flags for accessing BPF object from syscall side. */
 	BPF_F_RDONLY		= (1U << 3),
 	BPF_F_WRONLY		= (1U << 4),
+	BPF_F_DUMP		= (1U << 5),
 
 /* Flag for stack_map, store build_id+offset instead of pointer */
 	BPF_F_STACK_BUILD_ID	= (1U << 5),
@@ -481,7 +482,10 @@ union bpf_attr {
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
-		__aligned_u64	pathname;
+		union {
+			__aligned_u64	pathname;
+			__aligned_u64	dumper_name;
+		};
 		__u32		bpf_fd;
 		__u32		file_flags;
 	};
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 06/16] bpf: add netlink and ipv6_route targets
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (4 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-10 23:13   ` Andrii Nakryiko
  2020-04-08 23:25 ` [RFC PATCH bpf-next 07/16] bpf: add bpf_map target Yonghong Song
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

This patch added netlink and ipv6_route targets, using
the same seq_ops (except show()) for /proc/net/{netlink,ipv6_route}.

Since module is not supported for now, ipv6_route is
supported only if the IPV6 is built-in, i.e., not compiled
as a module. The restriction can be lifted once module
is properly supported for bpfdump.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h      |  1 +
 kernel/bpf/dump.c        | 13 ++++++++++
 net/ipv6/ip6_fib.c       | 41 +++++++++++++++++++++++++++++-
 net/ipv6/route.c         | 22 ++++++++++++++++
 net/netlink/af_netlink.c | 54 +++++++++++++++++++++++++++++++++++++++-
 5 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8171e01ff4be..f7d4269d77b8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1119,6 +1119,7 @@ int bpf_dump_set_target_info(u32 target_fd, struct bpf_prog *prog);
 int bpf_dump_create(u32 prog_fd, const char __user *dumper_name);
 struct bpf_prog *bpf_dump_get_prog(struct seq_file *seq, u32 priv_data_size,
 				   u64 *seq_num);
+int bpf_dump_run_prog(struct bpf_prog *prog, void *ctx);
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
diff --git a/kernel/bpf/dump.c b/kernel/bpf/dump.c
index ac6856abb711..4e009b2612c2 100644
--- a/kernel/bpf/dump.c
+++ b/kernel/bpf/dump.c
@@ -412,6 +412,19 @@ struct bpf_prog *bpf_dump_get_prog(struct seq_file *seq, u32 priv_data_size,
 	return extra_data->prog;
 }
 
+int bpf_dump_run_prog(struct bpf_prog *prog, void *ctx)
+{
+	int ret;
+
+	migrate_disable();
+	rcu_read_lock();
+	ret = BPF_PROG_RUN(prog, ctx);
+	rcu_read_unlock();
+	migrate_enable();
+
+	return ret;
+}
+
 int bpf_dump_reg_target(const char *target,
 			const char *target_proto,
 			const struct seq_operations *seq_ops,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 46ed56719476..0a8dbdcf5f12 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2467,7 +2467,7 @@ void fib6_gc_cleanup(void)
 }
 
 #ifdef CONFIG_PROC_FS
-static int ipv6_route_seq_show(struct seq_file *seq, void *v)
+static int ipv6_route_native_seq_show(struct seq_file *seq, void *v)
 {
 	struct fib6_info *rt = v;
 	struct ipv6_route_iter *iter = seq->private;
@@ -2637,6 +2637,45 @@ static void ipv6_route_seq_stop(struct seq_file *seq, void *v)
 	rcu_read_unlock_bh();
 }
 
+#if IS_BUILTIN(CONFIG_IPV6)
+static int ipv6_route_prog_seq_show(struct bpf_prog *prog, struct seq_file *seq,
+				    u64 seq_num, void *v)
+{
+	struct ipv6_route_iter *iter = seq->private;
+	struct {
+		struct fib6_info *rt;
+		struct seq_file *seq;
+		u64 seq_num;
+	} ctx = {
+		.rt = v,
+		.seq = seq,
+		.seq_num = seq_num,
+	};
+	int ret;
+
+	ret = bpf_dump_run_prog(prog, &ctx);
+	iter->w.leaf = NULL;
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static int ipv6_route_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_prog *prog;
+	u64 seq_num;
+
+	prog = bpf_dump_get_prog(seq, sizeof(struct ipv6_route_iter), &seq_num);
+	if (!prog)
+		return ipv6_route_native_seq_show(seq, v);
+
+	return ipv6_route_prog_seq_show(prog, seq, seq_num, v);
+}
+#else
+static int ipv6_route_seq_show(struct seq_file *seq, void *v)
+{
+	return ipv6_route_native_seq_show(seq, v);
+}
+#endif
+
 const struct seq_operations ipv6_route_seq_ops = {
 	.start	= ipv6_route_seq_start,
 	.next	= ipv6_route_seq_next,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 310cbddaa533..f3457d9d5a8b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6390,6 +6390,16 @@ void __init ip6_route_init_special_entries(void)
   #endif
 }
 
+#if IS_BUILTIN(CONFIG_IPV6)
+#ifdef CONFIG_PROC_FS
+int __init bpfdump__ipv6_route(struct fib6_info *rt, struct seq_file *seq,
+			       u64 seq_num)
+{
+	return 0;
+}
+#endif
+#endif
+
 int __init ip6_route_init(void)
 {
 	int ret;
@@ -6452,6 +6462,18 @@ int __init ip6_route_init(void)
 	if (ret)
 		goto out_register_late_subsys;
 
+#if IS_BUILTIN(CONFIG_IPV6)
+#ifdef CONFIG_PROC_FS
+	ret = bpf_dump_reg_target("ipv6_route",
+				  "bpfdump__ipv6_route",
+				  &ipv6_route_seq_ops,
+				  sizeof(struct ipv6_route_iter),
+				  BPF_DUMP_SEQ_NET_PRIVATE);
+	if (ret)
+		goto out_register_late_subsys;
+#endif
+#endif
+
 	for_each_possible_cpu(cpu) {
 		struct uncached_list *ul = per_cpu_ptr(&rt6_uncached_list, cpu);
 
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 5ded01ca8b20..b6ab827e8d47 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2607,7 +2607,7 @@ static void netlink_seq_stop(struct seq_file *seq, void *v)
 }
 
 
-static int netlink_seq_show(struct seq_file *seq, void *v)
+static int netlink_native_seq_show(struct seq_file *seq, void *v)
 {
 	if (v == SEQ_START_TOKEN) {
 		seq_puts(seq,
@@ -2634,6 +2634,40 @@ static int netlink_seq_show(struct seq_file *seq, void *v)
 	return 0;
 }
 
+static int netlink_prog_seq_show(struct bpf_prog *prog, struct seq_file *seq,
+				 u64 seq_num, void *v)
+{
+	struct {
+		struct netlink_sock *sk;
+		struct seq_file *seq;
+		u64 seq_num;
+	} ctx = {
+		.seq = seq,
+		.seq_num = seq_num - 1,
+	};
+	int ret = 0;
+
+	if (v == SEQ_START_TOKEN)
+		return 0;
+
+	ctx.sk = nlk_sk((struct sock *)v);
+	ret = bpf_dump_run_prog(prog, &ctx);
+
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static int netlink_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_prog *prog;
+	u64 seq_num;
+
+	prog = bpf_dump_get_prog(seq, sizeof(struct nl_seq_iter), &seq_num);
+	if (!prog)
+		return netlink_native_seq_show(seq, v);
+
+	return netlink_prog_seq_show(prog, seq, seq_num, v);
+}
+
 static const struct seq_operations netlink_seq_ops = {
 	.start  = netlink_seq_start,
 	.next   = netlink_seq_next,
@@ -2740,6 +2774,14 @@ static const struct rhashtable_params netlink_rhashtable_params = {
 	.automatic_shrinking = true,
 };
 
+#ifdef CONFIG_PROC_FS
+int __init bpfdump__netlink(struct netlink_sock *sk, struct seq_file *seq,
+			    u64 seq_num)
+{
+	return 0;
+}
+#endif
+
 static int __init netlink_proto_init(void)
 {
 	int i;
@@ -2764,6 +2806,16 @@ static int __init netlink_proto_init(void)
 		}
 	}
 
+#ifdef CONFIG_PROC_FS
+	err = bpf_dump_reg_target("netlink",
+				  "bpfdump__netlink",
+				  &netlink_seq_ops,
+				  sizeof(struct nl_seq_iter),
+				  BPF_DUMP_SEQ_NET_PRIVATE);
+	if (err)
+		goto out;
+#endif
+
 	netlink_add_usersock_entry();
 
 	sock_register(&netlink_family_ops);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 07/16] bpf: add bpf_map target
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (5 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 06/16] bpf: add netlink and ipv6_route targets Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-13 22:18   ` Andrii Nakryiko
  2020-04-08 23:25 ` [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets Yonghong Song
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

This patch added bpf_map target. Traversing all bpf_maps
through map_idr. A reference is held for the map during
the show() to ensure safety and correctness for field accesses.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/syscall.c | 104 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b5e4f18cc633..62a872a406ca 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3797,3 +3797,107 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 
 	return err;
 }
+
+struct bpfdump_seq_map_info {
+	struct bpf_map *map;
+	u32 id;
+};
+
+static struct bpf_map *bpf_map_seq_get_next(u32 *id)
+{
+	struct bpf_map *map;
+
+	spin_lock_bh(&map_idr_lock);
+	map = idr_get_next(&map_idr, id);
+	if (map)
+		map = __bpf_map_inc_not_zero(map, false);
+	spin_unlock_bh(&map_idr_lock);
+
+	return map;
+}
+
+static void *bpf_map_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpfdump_seq_map_info *info = seq->private;
+	struct bpf_map *map;
+	u32 id = info->id + 1;
+
+	map = bpf_map_seq_get_next(&id);
+	if (!map)
+		return NULL;
+
+	++*pos;
+	info->map = map;
+	info->id = id;
+	return map;
+}
+
+static void *bpf_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpfdump_seq_map_info *info = seq->private;
+	struct bpf_map *map;
+	u32 id = info->id + 1;
+
+	++*pos;
+	map = bpf_map_seq_get_next(&id);
+	if (!map)
+		return NULL;
+
+	__bpf_map_put(info->map, true);
+	info->map = map;
+	info->id = id;
+	return map;
+}
+
+static void bpf_map_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpfdump_seq_map_info *info = seq->private;
+
+	if (info->map) {
+		__bpf_map_put(info->map, true);
+		info->map = NULL;
+	}
+}
+
+static int bpf_map_seq_show(struct seq_file *seq, void *v)
+{
+	struct {
+		struct bpf_map *map;
+		struct seq_file *seq;
+		u64 seq_num;
+	} ctx = {
+		.map = v,
+		.seq = seq,
+	};
+	struct bpf_prog *prog;
+	int ret;
+
+	prog = bpf_dump_get_prog(seq, sizeof(struct bpfdump_seq_map_info),
+				 &ctx.seq_num);
+	ret = bpf_dump_run_prog(prog, &ctx);
+
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static const struct seq_operations bpf_map_seq_ops = {
+	.start	= bpf_map_seq_start,
+	.next	= bpf_map_seq_next,
+	.stop	= bpf_map_seq_stop,
+	.show	= bpf_map_seq_show,
+};
+
+int __init bpfdump__bpf_map(struct bpf_map *map, struct seq_file *seq,
+			    u64 seq_num)
+{
+	return 0;
+}
+
+static int __init bpf_map_dump_init(void)
+{
+	return bpf_dump_reg_target("bpf_map",
+				   "bpfdump__bpf_map",
+				   &bpf_map_seq_ops,
+				   sizeof(struct bpfdump_seq_map_info), 0);
+}
+
+late_initcall(bpf_map_dump_init);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (6 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 07/16] bpf: add bpf_map target Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-10  3:22   ` Alexei Starovoitov
  2020-04-13 23:00   ` Andrii Nakryiko
  2020-04-08 23:25 ` [RFC PATCH bpf-next 09/16] bpf: add bpf_seq_printf and bpf_seq_write helpers Yonghong Song
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Only the tasks belonging to "current" pid namespace
are enumerated.

For task/file target, the bpf program will have access to
  struct task_struct *task
  u32 fd
  struct file *file
where fd/file is an open file for the task.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/Makefile    |   2 +-
 kernel/bpf/dump_task.c | 294 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 295 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/dump_task.c

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 4a1376ab2bea..7e2c73deabab 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -26,7 +26,7 @@ obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
 endif
 ifeq ($(CONFIG_SYSFS),y)
 obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o
-obj-$(CONFIG_BPF_SYSCALL) += dump.o
+obj-$(CONFIG_BPF_SYSCALL) += dump.o dump_task.o
 endif
 ifeq ($(CONFIG_BPF_JIT),y)
 obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o
diff --git a/kernel/bpf/dump_task.c b/kernel/bpf/dump_task.c
new file mode 100644
index 000000000000..69b0bcec68e9
--- /dev/null
+++ b/kernel/bpf/dump_task.c
@@ -0,0 +1,294 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2020 Facebook */
+
+#include <linux/init.h>
+#include <linux/namei.h>
+#include <linux/pid_namespace.h>
+#include <linux/fs.h>
+#include <linux/fdtable.h>
+#include <linux/filter.h>
+
+struct bpfdump_seq_task_info {
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	u32 id;
+};
+
+static struct task_struct *task_seq_get_next(struct pid_namespace *ns, u32 *id)
+{
+	struct task_struct *task;
+	struct pid *pid;
+
+	rcu_read_lock();
+	pid = idr_get_next(&ns->idr, id);
+	task = get_pid_task(pid, PIDTYPE_PID);
+	if (task)
+		get_task_struct(task);
+	rcu_read_unlock();
+
+	return task;
+}
+
+static void *task_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpfdump_seq_task_info *info = seq->private;
+	struct task_struct *task;
+	u32 id = info->id + 1;
+
+	if (*pos == 0)
+		info->ns = task_active_pid_ns(current);
+
+	task = task_seq_get_next(info->ns, &id);
+	if (!task)
+		return NULL;
+
+	++*pos;
+	info->task = task;
+	info->id = id;
+
+	return task;
+}
+
+static void *task_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpfdump_seq_task_info *info = seq->private;
+	struct task_struct *task;
+	u32 id = info->id + 1;
+
+	++*pos;
+	task = task_seq_get_next(info->ns, &id);
+	if (!task)
+		return NULL;
+
+	put_task_struct(info->task);
+	info->task = task;
+	info->id = id;
+	return task;
+}
+
+static void task_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpfdump_seq_task_info *info = seq->private;
+
+	if (info->task) {
+		put_task_struct(info->task);
+		info->task = NULL;
+	}
+}
+
+static int task_seq_show(struct seq_file *seq, void *v)
+{
+	struct {
+		struct task_struct *task;
+		struct seq_file *seq;
+		u64 seq_num;
+	} ctx = {
+		.task = v,
+		.seq = seq,
+	};
+	struct bpf_prog *prog;
+	int ret;
+
+	prog = bpf_dump_get_prog(seq, sizeof(struct bpfdump_seq_task_info),
+				 &ctx.seq_num);
+	ret = bpf_dump_run_prog(prog, &ctx);
+
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static const struct seq_operations task_seq_ops = {
+        .start  = task_seq_start,
+        .next   = task_seq_next,
+        .stop   = task_seq_stop,
+        .show   = task_seq_show,
+};
+
+struct bpfdump_seq_task_file_info {
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct files_struct *files;
+	u32 id;
+	u32 fd;
+};
+
+static struct file *task_file_seq_get_next(struct pid_namespace *ns, u32 *id,
+					   int *fd, struct task_struct **task,
+					   struct files_struct **fstruct)
+{
+	struct files_struct *files;
+	struct task_struct *tk;
+	u32 sid = *id;
+	int sfd;
+
+	/* If this function returns a non-NULL file object,
+	 * it held a reference to the files_struct and file.
+	 * Otherwise, it does not hold any reference.
+	 */
+again:
+	if (*fstruct) {
+		files = *fstruct;
+		sfd = *fd;
+	} else {
+		tk = task_seq_get_next(ns, &sid);
+		if (!tk)
+			return NULL;
+		files = get_files_struct(tk);
+		put_task_struct(tk);
+		if (!files)
+			return NULL;
+		*fstruct = files;
+		*task = tk;
+		if (sid == *id) {
+			sfd = *fd;
+		} else {
+			*id = sid;
+			sfd = 0;
+		}
+	}
+
+	spin_lock(&files->file_lock);
+	for (; sfd < files_fdtable(files)->max_fds; sfd++) {
+		struct file *f;
+
+		f = fcheck_files(files, sfd);
+		if (!f)
+			continue;
+
+		*fd = sfd;
+		get_file(f);
+		spin_unlock(&files->file_lock);
+		return f;
+	}
+
+	/* the current task is done, go to the next task */
+	spin_unlock(&files->file_lock);
+	put_files_struct(files);
+	*fstruct = NULL;
+	sid = ++(*id);
+	goto again;
+}
+
+static void *task_file_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpfdump_seq_task_file_info *info = seq->private;
+	struct files_struct *files = NULL;
+	struct task_struct *task = NULL;
+	struct file *file;
+	u32 id = info->id;
+	int fd = info->fd + 1;
+
+	if (*pos == 0)
+		info->ns = task_active_pid_ns(current);
+
+	file = task_file_seq_get_next(info->ns, &id, &fd, &task, &files);
+	if (!file) {
+		info->files = NULL;
+		return NULL;
+	}
+
+	++*pos;
+	info->id = id;
+	info->fd = fd;
+	info->task = task;
+	info->files = files;
+
+	return file;
+}
+
+static void *task_file_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpfdump_seq_task_file_info *info = seq->private;
+	struct files_struct *files = info->files;
+	struct task_struct *task = info->task;
+	int fd = info->fd + 1;
+	struct file *file;
+	u32 id = info->id;
+
+	++*pos;
+	fput((struct file *)v);
+	file = task_file_seq_get_next(info->ns, &id, &fd, &task, &files);
+	if (!file) {
+		info->files = NULL;
+		return NULL;
+	}
+
+	info->id = id;
+	info->fd = fd;
+	info->task = task;
+	info->files = files;
+
+	return file;
+}
+
+static void task_file_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpfdump_seq_task_file_info *info = seq->private;
+
+	if (v)
+		fput((struct file *)v);
+	if (info->files) {
+		put_files_struct(info->files);
+		info->files = NULL;
+	}
+}
+
+static int task_file_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpfdump_seq_task_file_info *info = seq->private;
+	struct {
+		struct task_struct *task;
+		u32 fd;
+		struct file *file;
+		struct seq_file *seq;
+		u64 seq_num;
+	} ctx = {
+		.file = v,
+		.seq = seq,
+	};
+	struct bpf_prog *prog;
+	int ret;
+
+	prog = bpf_dump_get_prog(seq, sizeof(struct bpfdump_seq_task_file_info),
+				 &ctx.seq_num);
+	ctx.task = info->task;
+	ctx.fd = info->fd;
+	ret = bpf_dump_run_prog(prog, &ctx);
+
+	return ret == 0 ? 0 : -EINVAL;
+}
+
+static const struct seq_operations task_file_seq_ops = {
+        .start  = task_file_seq_start,
+        .next   = task_file_seq_next,
+        .stop   = task_file_seq_stop,
+        .show   = task_file_seq_show,
+};
+
+int __init bpfdump__task(struct task_struct *task, struct seq_file *seq,
+			 u64 seq_num) {
+	return 0;
+}
+
+int __init bpfdump__task_file(struct task_struct *task, u32 fd,
+			      struct file *file, struct seq_file *seq,
+			      u64 seq_num)
+{
+	return 0;
+}
+
+static int __init task_dump_init(void)
+{
+	int ret;
+
+	ret = bpf_dump_reg_target("task", "bpfdump__task",
+				  &task_seq_ops,
+				  sizeof(struct bpfdump_seq_task_info), 0);
+	if (ret)
+		return ret;
+
+	return bpf_dump_reg_target("task/file", "bpfdump__task_file",
+				   &task_file_seq_ops,
+				   sizeof(struct bpfdump_seq_task_file_info),
+				   0);
+}
+late_initcall(task_dump_init);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 09/16] bpf: add bpf_seq_printf and bpf_seq_write helpers
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (7 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-10  3:26   ` Alexei Starovoitov
  2020-04-14  5:28   ` Andrii Nakryiko
  2020-04-08 23:25 ` [RFC PATCH bpf-next 10/16] bpf: support variable length array in tracing programs Yonghong Song
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Two helpers bpf_seq_printf and bpf_seq_write, are added for
writing data to the seq_file buffer.

bpf_seq_printf supports common format string flag/width/type
fields so at least I can get identical results for
netlink and ipv6_route targets.

For bpf_seq_printf, return value 1 specifically indicates
a write failure due to overflow in order to differentiate
the failure from format strings.

For seq_file show, since the same object may be called
twice, some bpf_prog might be sensitive to this. With return
value indicating overflow happens the bpf program can
react differently.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/uapi/linux/bpf.h       |  18 +++-
 kernel/trace/bpf_trace.c       | 172 +++++++++++++++++++++++++++++++++
 scripts/bpf_helpers_doc.py     |   2 +
 tools/include/uapi/linux/bpf.h |  18 +++-
 4 files changed, 208 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b51d56fc77f9..a245f0df53c4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3030,6 +3030,20 @@ union bpf_attr {
  *		* **-EOPNOTSUPP**	Unsupported operation, for example a
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
+ *
+ * int bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, ...)
+ * 	Description
+ * 		seq_printf
+ * 	Return
+ * 		0 if successful, or
+ * 		1 if failure due to buffer overflow, or
+ * 		a negative value for format string related failures.
+ *
+ * int bpf_seq_write(struct seq_file *m, const void *data, u32 len)
+ * 	Description
+ * 		seq_write
+ * 	Return
+ * 		0 if successful, non-zero otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3156,7 +3170,9 @@ union bpf_attr {
 	FN(xdp_output),			\
 	FN(get_netns_cookie),		\
 	FN(get_current_ancestor_cgroup_id),	\
-	FN(sk_assign),
+	FN(sk_assign),			\
+	FN(seq_printf),			\
+	FN(seq_write),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1796747a77..e7d6ba7c9c51 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -457,6 +457,174 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
 	return &bpf_trace_printk_proto;
 }
 
+BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size, u64, arg1,
+	   u64, arg2)
+{
+	bool buf_used = false;
+	int i, copy_size;
+	int mod[2] = {};
+	int fmt_cnt = 0;
+	u64 unsafe_addr;
+	char buf[64];
+
+	/*
+	 * bpf_check()->check_func_arg()->check_stack_boundary()
+	 * guarantees that fmt points to bpf program stack,
+	 * fmt_size bytes of it were initialized and fmt_size > 0
+	 */
+	if (fmt[--fmt_size] != 0)
+		return -EINVAL;
+
+	/* check format string for allowed specifiers */
+	for (i = 0; i < fmt_size; i++) {
+		if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i]))
+			return -EINVAL;
+
+		if (fmt[i] != '%')
+			continue;
+
+		if (fmt_cnt >= 2)
+			return -EINVAL;
+
+		/* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
+		i++;
+
+		/* skip optional "[0+-][num]" width formating field */
+		while (fmt[i] == '0' || fmt[i] == '+'  || fmt[i] == '-')
+			i++;
+		if (fmt[i] >= '1' && fmt[i] <= '9') {
+			i++;
+			while (fmt[i] >= '0' && fmt[i] <= '9')
+				i++;
+		}
+
+		if (fmt[i] == 'l') {
+			mod[fmt_cnt]++;
+			i++;
+		} else if (fmt[i] == 's') {
+			mod[fmt_cnt]++;
+			fmt_cnt++;
+			/* disallow any further format extensions */
+			if (fmt[i + 1] != 0 &&
+			    !isspace(fmt[i + 1]) &&
+			    !ispunct(fmt[i + 1]))
+				return -EINVAL;
+
+			if (buf_used)
+				/* allow only one '%s'/'%p' per fmt string */
+				return -EINVAL;
+			buf_used = true;
+
+			if (fmt_cnt == 1) {
+				unsafe_addr = arg1;
+				arg1 = (long) buf;
+			} else {
+				unsafe_addr = arg2;
+				arg2 = (long) buf;
+			}
+			buf[0] = 0;
+			strncpy_from_unsafe(buf,
+					    (void *) (long) unsafe_addr,
+					    sizeof(buf));
+			continue;
+		} else if (fmt[i] == 'p') {
+			mod[fmt_cnt]++;
+			fmt_cnt++;
+			if (fmt[i + 1] == 0 ||
+			    fmt[i + 1] == 'K' ||
+			    fmt[i + 1] == 'x') {
+				/* just kernel pointers */
+				continue;
+			}
+
+			if (buf_used)
+				return -EINVAL;
+			buf_used = true;
+
+			/* only support "%pI4", "%pi4", "%pI6" and "pi6". */
+			if (fmt[i + 1] != 'i' && fmt[i + 1] != 'I')
+				return -EINVAL;
+			if (fmt[i + 2] != '4' && fmt[i + 2] != '6')
+				return -EINVAL;
+
+			copy_size = (fmt[i + 2] == '4') ? 4 : 16;
+
+			if (fmt_cnt == 1) {
+				unsafe_addr = arg1;
+				arg1 = (long) buf;
+			} else {
+				unsafe_addr = arg2;
+				arg2 = (long) buf;
+			}
+			probe_kernel_read(buf, (void *) (long) unsafe_addr, copy_size);
+
+			i += 2;
+			continue;
+		}
+
+		if (fmt[i] == 'l') {
+			mod[fmt_cnt]++;
+			i++;
+		}
+
+		if (fmt[i] != 'i' && fmt[i] != 'd' &&
+		    fmt[i] != 'u' && fmt[i] != 'x')
+			return -EINVAL;
+		fmt_cnt++;
+	}
+
+/* Horrid workaround for getting va_list handling working with different
+ * argument type combinations generically for 32 and 64 bit archs.
+ */
+#define __BPF_SP_EMIT()	__BPF_ARG2_SP()
+#define __BPF_SP(...)							\
+	seq_printf(m, fmt, ##__VA_ARGS__)
+
+#define __BPF_ARG1_SP(...)						\
+	((mod[0] == 2 || (mod[0] == 1 && __BITS_PER_LONG == 64))	\
+	  ? __BPF_SP(arg1, ##__VA_ARGS__)				\
+	  : ((mod[0] == 1 || (mod[0] == 0 && __BITS_PER_LONG == 32))	\
+	      ? __BPF_SP((long)arg1, ##__VA_ARGS__)			\
+	      : __BPF_SP((u32)arg1, ##__VA_ARGS__)))
+
+#define __BPF_ARG2_SP(...)						\
+	((mod[1] == 2 || (mod[1] == 1 && __BITS_PER_LONG == 64))	\
+	  ? __BPF_ARG1_SP(arg2, ##__VA_ARGS__)				\
+	  : ((mod[1] == 1 || (mod[1] == 0 && __BITS_PER_LONG == 32))	\
+	      ? __BPF_ARG1_SP((long)arg2, ##__VA_ARGS__)		\
+	      : __BPF_ARG1_SP((u32)arg2, ##__VA_ARGS__)))
+
+	__BPF_SP_EMIT();
+	return seq_has_overflowed(m);
+}
+
+static int bpf_seq_printf_btf_ids[5];
+static const struct bpf_func_proto bpf_seq_printf_proto = {
+	.func		= bpf_seq_printf,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.btf_id		= bpf_seq_printf_btf_ids,
+};
+
+BPF_CALL_3(bpf_seq_write, struct seq_file *, m, const char *, data, u32, len)
+{
+	return seq_write(m, data, len);
+}
+
+static int bpf_seq_write_btf_ids[5];
+static const struct bpf_func_proto bpf_seq_write_proto = {
+	.func		= bpf_seq_write,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.btf_id		= bpf_seq_write_btf_ids,
+};
+
 static __always_inline int
 get_map_perf_counter(struct bpf_map *map, u64 flags,
 		     u64 *value, u64 *enabled, u64 *running)
@@ -1224,6 +1392,10 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_xdp_output:
 		return &bpf_xdp_output_proto;
 #endif
+	case BPF_FUNC_seq_printf:
+		return &bpf_seq_printf_proto;
+	case BPF_FUNC_seq_write:
+		return &bpf_seq_write_proto;
 	default:
 		return raw_tp_prog_func_proto(func_id, prog);
 	}
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index f43d193aff3a..ded304c96a05 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -414,6 +414,7 @@ class PrinterHelpers(Printer):
             'struct sk_reuseport_md',
             'struct sockaddr',
             'struct tcphdr',
+            'struct seq_file',
 
             'struct __sk_buff',
             'struct sk_msg_md',
@@ -450,6 +451,7 @@ class PrinterHelpers(Printer):
             'struct sk_reuseport_md',
             'struct sockaddr',
             'struct tcphdr',
+            'struct seq_file',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b51d56fc77f9..a245f0df53c4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3030,6 +3030,20 @@ union bpf_attr {
  *		* **-EOPNOTSUPP**	Unsupported operation, for example a
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
+ *
+ * int bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, ...)
+ * 	Description
+ * 		seq_printf
+ * 	Return
+ * 		0 if successful, or
+ * 		1 if failure due to buffer overflow, or
+ * 		a negative value for format string related failures.
+ *
+ * int bpf_seq_write(struct seq_file *m, const void *data, u32 len)
+ * 	Description
+ * 		seq_write
+ * 	Return
+ * 		0 if successful, non-zero otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3156,7 +3170,9 @@ union bpf_attr {
 	FN(xdp_output),			\
 	FN(get_netns_cookie),		\
 	FN(get_current_ancestor_cgroup_id),	\
-	FN(sk_assign),
+	FN(sk_assign),			\
+	FN(seq_printf),			\
+	FN(seq_write),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 10/16] bpf: support variable length array in tracing programs
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (8 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 09/16] bpf: add bpf_seq_printf and bpf_seq_write helpers Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-14  0:13   ` Andrii Nakryiko
  2020-04-08 23:25 ` [RFC PATCH bpf-next 11/16] bpf: implement query for target_proto and file dumper prog_id Yonghong Song
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

In /proc/net/ipv6_route, we have
  struct fib6_info {
    struct fib6_table *fib6_table;
    ...
    struct fib6_nh fib6_nh[0];
  }
  struct fib6_nh {
    struct fib_nh_common nh_common;
    struct rt6_info **rt6i_pcpu;
    struct rt6_exception_bucket *rt6i_exception_bucket;
  };
  struct fib_nh_common {
    ...
    u8 nhc_gw_family;
    ...
  }

The access:
  struct fib6_nh *fib6_nh = &rt->fib6_nh;
  ... fib6_nh->nh_common.nhc_gw_family ...

This patch ensures such an access is handled properly.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/btf.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index d65c6912bdaf..89a0d983b169 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3837,6 +3837,31 @@ int btf_struct_access(struct bpf_verifier_log *log,
 	}
 
 	if (off + size > t->size) {
+		/* If the last element is a variable size array, we may
+		 * need to relax the rule.
+		 */
+		struct btf_array *array_elem;
+		u32 vlen = btf_type_vlen(t);
+		u32 last_member_type;
+
+		member = btf_type_member(t);
+		last_member_type = member[vlen - 1].type;
+		mtype = btf_type_by_id(btf_vmlinux, last_member_type);
+		if (!btf_type_is_array(mtype))
+			goto error;
+
+		array_elem = (struct btf_array *)(mtype + 1);
+		if (array_elem->nelems != 0)
+			goto error;
+
+		elem_type = btf_type_by_id(btf_vmlinux, array_elem->type);
+		if (!btf_type_is_struct(elem_type))
+			goto error;
+
+		off = (off - t->size) % elem_type->size;
+		return btf_struct_access(log, elem_type, off, size, atype, next_btf_id);
+
+error:
 		bpf_log(log, "access beyond struct %s at off %u size %u\n",
 			tname, off, size);
 		return -EACCES;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 11/16] bpf: implement query for target_proto and file dumper prog_id
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (9 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 10/16] bpf: support variable length array in tracing programs Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-10  3:10   ` Alexei Starovoitov
  2020-04-08 23:25 ` [RFC PATCH bpf-next 12/16] tools/libbpf: libbpf support for bpfdump Yonghong Song
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Given a fd representing a bpfdump target, user
can retrieve the target_proto name which represents
the bpf program prototype.

Given a fd representing a file dumper, user can
retrieve the bpf_prog id associated with that dumper.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       | 13 +++++++++
 kernel/bpf/dump.c              | 51 ++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           | 14 ++++++++++
 tools/include/uapi/linux/bpf.h | 13 +++++++++
 5 files changed, 92 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f7d4269d77b8..c9aec3b02dfa 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1120,6 +1120,7 @@ int bpf_dump_create(u32 prog_fd, const char __user *dumper_name);
 struct bpf_prog *bpf_dump_get_prog(struct seq_file *seq, u32 priv_data_size,
 				   u64 *seq_num);
 int bpf_dump_run_prog(struct bpf_prog *prog, void *ctx);
+int bpf_dump_query(const union bpf_attr *attr, union bpf_attr __user *uattr);
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a245f0df53c4..fc2157e319f1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -113,6 +113,7 @@ enum bpf_cmd {
 	BPF_MAP_DELETE_BATCH,
 	BPF_LINK_CREATE,
 	BPF_LINK_UPDATE,
+	BPF_DUMP_QUERY,
 };
 
 enum bpf_map_type {
@@ -594,6 +595,18 @@ union bpf_attr {
 		__u32		old_prog_fd;
 	} link_update;
 
+	struct {
+		__u32		query_fd;
+		__u32		flags;
+		union {
+			struct {
+				__aligned_u64	target_proto;
+				__u32		proto_buf_len;
+			};
+			__u32			prog_id;
+		};
+	} dump_query;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/dump.c b/kernel/bpf/dump.c
index 4e009b2612c2..f3041b362057 100644
--- a/kernel/bpf/dump.c
+++ b/kernel/bpf/dump.c
@@ -86,6 +86,57 @@ static void *get_extra_priv_dptr(void *old_ptr, u32 old_size)
 	return old_ptr + roundup(old_size, 8);
 }
 
+int bpf_dump_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+	struct bpfdump_target_info *tinfo;
+	struct dumper_inode_info *i_info;
+	const char *target_proto;
+	void * __user proto_buf;
+	struct file *filp;
+	u32 proto_len;
+	struct fd qfd;
+	int err = 0;
+
+	if (attr->dump_query.flags != 0)
+		return -EINVAL;
+
+	qfd = fdget(attr->dump_query.query_fd);
+	filp = qfd.file;
+	if (!filp)
+		return -EBADF;
+
+	if (filp->f_op != &bpf_dumper_ops &&
+	    filp->f_inode->i_op != &bpf_dir_iops) {
+		err = -EINVAL;
+		goto done;
+	}
+
+	if (filp->f_op == &bpf_dumper_ops) {
+		i_info = filp->f_inode->i_private;
+		if (put_user(i_info->prog->aux->id, &uattr->dump_query.prog_id))
+			err = -EFAULT;
+
+		goto done;
+	}
+
+	tinfo = filp->f_inode->i_private;
+	target_proto = tinfo->target_proto;
+
+	proto_len = strlen(target_proto) + 1;
+	if (attr->dump_query.proto_buf_len < proto_len) {
+		err = -ENOSPC;
+		goto done;
+	}
+
+	proto_buf = u64_to_user_ptr(attr->dump_query.target_proto);
+	if (copy_to_user(proto_buf, target_proto, proto_len))
+		err = -EFAULT;
+
+done:
+	fdput(qfd);
+	return err;
+}
+
 #ifdef CONFIG_PROC_FS
 static void dumper_show_fdinfo(struct seq_file *m, struct file *filp)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 62a872a406ca..46b58f1f2d75 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3673,6 +3673,17 @@ static int link_update(union bpf_attr *attr)
 	return ret;
 }
 
+#define BPF_DUMP_QUERY_LAST_FIELD dump_query.proto_buf_len
+
+static int bpf_dump_do_query(const union bpf_attr *attr,
+			     union bpf_attr __user *uattr)
+{
+	if (CHECK_ATTR(BPF_DUMP_QUERY))
+		return -EINVAL;
+
+	return bpf_dump_query(attr, uattr);
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr;
@@ -3790,6 +3801,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_LINK_UPDATE:
 		err = link_update(&attr);
 		break;
+	case BPF_DUMP_QUERY:
+		err = bpf_dump_do_query(&attr, uattr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a245f0df53c4..fc2157e319f1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -113,6 +113,7 @@ enum bpf_cmd {
 	BPF_MAP_DELETE_BATCH,
 	BPF_LINK_CREATE,
 	BPF_LINK_UPDATE,
+	BPF_DUMP_QUERY,
 };
 
 enum bpf_map_type {
@@ -594,6 +595,18 @@ union bpf_attr {
 		__u32		old_prog_fd;
 	} link_update;
 
+	struct {
+		__u32		query_fd;
+		__u32		flags;
+		union {
+			struct {
+				__aligned_u64	target_proto;
+				__u32		proto_buf_len;
+			};
+			__u32			prog_id;
+		};
+	} dump_query;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 12/16] tools/libbpf: libbpf support for bpfdump
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (10 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 11/16] bpf: implement query for target_proto and file dumper prog_id Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-08 23:25 ` [RFC PATCH bpf-next 13/16] tools/bpftool: add bpf dumper support Yonghong Song
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Add a few libbpf APIs for bpfdump pin and query.

Also, parse the dump program section name,
retrieve the dump target path and open the path
to get a fd and assignment to prog->attach_prog_fd.
This is not really desirable, and need to think
more how to have equally better user interface
and cope with libbpf well.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/lib/bpf/bpf.c      | 33 +++++++++++++++++++++++++++
 tools/lib/bpf/bpf.h      |  5 +++++
 tools/lib/bpf/libbpf.c   | 48 ++++++++++++++++++++++++++++++++++++----
 tools/lib/bpf/libbpf.h   |  1 +
 tools/lib/bpf/libbpf.map |  3 +++
 5 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 5cc1b0785d18..e8d4304fcc98 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -533,6 +533,39 @@ int bpf_obj_get(const char *pathname)
 	return sys_bpf(BPF_OBJ_GET, &attr, sizeof(attr));
 }
 
+int bpf_obj_pin_dumper(int fd, const char *dname)
+{
+	union bpf_attr attr;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.dumper_name = ptr_to_u64((void *)dname);
+	attr.bpf_fd = fd;
+	attr.file_flags = BPF_F_DUMP;
+
+	return sys_bpf(BPF_OBJ_PIN, &attr, sizeof(attr));
+}
+
+int bpf_dump_query(int query_fd, __u32 flags, void *target_proto_buf,
+		   __u32 buf_len, __u32 *prog_id)
+{
+	union bpf_attr attr;
+	int ret;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.dump_query.query_fd = query_fd;
+	attr.dump_query.flags = flags;
+	if (target_proto_buf) {
+		attr.dump_query.target_proto = ptr_to_u64((void *)target_proto_buf);
+		attr.dump_query.proto_buf_len = buf_len;
+	}
+
+	ret = sys_bpf(BPF_DUMP_QUERY, &attr, sizeof(attr));
+	if (!ret && prog_id)
+		*prog_id = attr.dump_query.prog_id;
+
+	return ret;
+}
+
 int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
 		    unsigned int flags)
 {
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 46d47afdd887..2f89f8445962 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -149,8 +149,13 @@ LIBBPF_API int bpf_map_update_batch(int fd, void *keys, void *values,
 				    __u32 *count,
 				    const struct bpf_map_batch_opts *opts);
 
+LIBBPF_API int bpf_dump_query(int query_fd, __u32 flags,
+			      void *target_proto_buf, __u32 buf_len,
+			      __u32 *prog_id);
+
 LIBBPF_API int bpf_obj_pin(int fd, const char *pathname);
 LIBBPF_API int bpf_obj_get(const char *pathname);
+LIBBPF_API int bpf_obj_pin_dumper(int fd, const char *dname);
 
 struct bpf_prog_attach_opts {
 	size_t sz; /* size of this struct for forward/backward compatibility */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ff9174282a8c..c7a81ede56ce 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -79,6 +79,7 @@ static struct bpf_program *bpf_object__find_prog_by_idx(struct bpf_object *obj,
 							int idx);
 static const struct btf_type *
 skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id);
+static int fill_dumper_info(struct bpf_program *prog);
 
 static int __base_pr(enum libbpf_print_level level, const char *format,
 		     va_list args)
@@ -2365,8 +2366,12 @@ static inline bool libbpf_prog_needs_vmlinux_btf(struct bpf_program *prog)
 	/* BPF_PROG_TYPE_TRACING programs which do not attach to other programs
 	 * also need vmlinux BTF
 	 */
-	if (prog->type == BPF_PROG_TYPE_TRACING && !prog->attach_prog_fd)
-		return true;
+	if (prog->type == BPF_PROG_TYPE_TRACING) {
+		if (prog->expected_attach_type == BPF_TRACE_DUMP)
+			return false;
+		if (!prog->attach_prog_fd)
+			return true;
+	}
 
 	return false;
 }
@@ -4958,7 +4963,7 @@ int bpf_program__load(struct bpf_program *prog, char *license, __u32 kern_ver)
 {
 	int err = 0, fd, i, btf_id;
 
-	if ((prog->type == BPF_PROG_TYPE_TRACING ||
+	if (((prog->type == BPF_PROG_TYPE_TRACING && prog->expected_attach_type != BPF_TRACE_DUMP) ||
 	     prog->type == BPF_PROG_TYPE_LSM ||
 	     prog->type == BPF_PROG_TYPE_EXT) && !prog->attach_btf_id) {
 		btf_id = libbpf_find_attach_btf_id(prog);
@@ -5319,6 +5324,7 @@ static int bpf_object__resolve_externs(struct bpf_object *obj,
 
 int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
 {
+	struct bpf_program *prog;
 	struct bpf_object *obj;
 	int err, i;
 
@@ -5335,7 +5341,17 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
 
 	obj->loaded = true;
 
-	err = bpf_object__probe_caps(obj);
+	err = 0;
+	bpf_object__for_each_program(prog, obj) {
+		if (prog->type == BPF_PROG_TYPE_TRACING &&
+		    prog->expected_attach_type == BPF_TRACE_DUMP) {
+			err = fill_dumper_info(prog);
+			if (err)
+				break;
+		}
+	}
+
+	err = err ? : bpf_object__probe_caps(obj);
 	err = err ? : bpf_object__resolve_externs(obj, obj->kconfig);
 	err = err ? : bpf_object__sanitize_and_load_btf(obj);
 	err = err ? : bpf_object__sanitize_maps(obj);
@@ -5459,6 +5475,11 @@ int bpf_program__pin_instance(struct bpf_program *prog, const char *path,
 	return 0;
 }
 
+int bpf_program__pin_dumper(struct bpf_program *prog, const char *dname)
+{
+	return bpf_obj_pin_dumper(bpf_program__fd(prog), dname);
+}
+
 int bpf_program__unpin_instance(struct bpf_program *prog, const char *path,
 				int instance)
 {
@@ -6322,6 +6343,8 @@ static const struct bpf_sec_def section_defs[] = {
 		.is_attach_btf = true,
 		.expected_attach_type = BPF_LSM_MAC,
 		.attach_fn = attach_lsm),
+	SEC_DEF("dump/", TRACING,
+		.expected_attach_type = BPF_TRACE_DUMP),
 	BPF_PROG_SEC("xdp",			BPF_PROG_TYPE_XDP),
 	BPF_PROG_SEC("perf_event",		BPF_PROG_TYPE_PERF_EVENT),
 	BPF_PROG_SEC("lwt_in",			BPF_PROG_TYPE_LWT_IN),
@@ -6401,6 +6424,23 @@ static const struct bpf_sec_def *find_sec_def(const char *sec_name)
 	return NULL;
 }
 
+static int fill_dumper_info(struct bpf_program *prog)
+{
+	const struct bpf_sec_def *sec;
+	const char *dump_target;
+	int fd;
+
+	sec = find_sec_def(bpf_program__title(prog, false));
+	if (sec) {
+		dump_target = bpf_program__title(prog, false) + sec->len;
+		fd = open(dump_target, O_RDONLY);
+		if (fd < 0)
+			return fd;
+		prog->attach_prog_fd = fd;
+	}
+	return 0;
+}
+
 static char *libbpf_get_type_names(bool attach_type)
 {
 	int i, len = ARRAY_SIZE(section_defs) * MAX_TYPE_NAME_SIZE;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 44df1d3e7287..ccb5d30fff4a 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -216,6 +216,7 @@ LIBBPF_API int bpf_program__unpin_instance(struct bpf_program *prog,
 LIBBPF_API int bpf_program__pin(struct bpf_program *prog, const char *path);
 LIBBPF_API int bpf_program__unpin(struct bpf_program *prog, const char *path);
 LIBBPF_API void bpf_program__unload(struct bpf_program *prog);
+LIBBPF_API int bpf_program__pin_dumper(struct bpf_program *prog, const char *dname);
 
 struct bpf_link;
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index bb8831605b25..ed6234bb199f 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -238,6 +238,7 @@ LIBBPF_0.0.7 {
 
 LIBBPF_0.0.8 {
 	global:
+		bpf_dump_query;
 		bpf_link__fd;
 		bpf_link__open;
 		bpf_link__pin;
@@ -248,8 +249,10 @@ LIBBPF_0.0.8 {
 		bpf_link_update;
 		bpf_map__set_initial_value;
 		bpf_program__attach_cgroup;
+		bpf_obj_pin_dumper;
 		bpf_program__attach_lsm;
 		bpf_program__is_lsm;
+		bpf_program__pin_dumper;
 		bpf_program__set_attach_target;
 		bpf_program__set_lsm;
 		bpf_set_link_xdp_fd_opts;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 13/16] tools/bpftool: add bpf dumper support
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (11 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 12/16] tools/libbpf: libbpf support for bpfdump Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-08 23:25 ` [RFC PATCH bpf-next 14/16] tools/bpf: selftests: add dumper programs for ipv6_route and netlink Yonghong Song
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Implemented "bpftool dumper" command. Two subcommands now:
  bpftool dumper pin <bpf_prog.o> <dumper_name>
  bpftool dumper show {target|dumper}

The "pin" subcommand will create a dumper with <dumper_name>
under the dump target (specified in <bpf_prog>.o).
The "show" subcommand will show target func protos, which
will be useful on how to write the bpf program, and
the "dumper" subcommand will show the corresponding prog_id
for each dumper.

For example, with some of later selftest dumpers are pinned
in the kernel, we can do inspection like below:
  $ bpftool dumper show target
  target                  prog_proto
  task                    bpfdump__task
  task/file               bpfdump__task_file
  bpf_map                 bpfdump__bpf_map
  ipv6_route              bpfdump__ipv6_route
  netlink                 bpfdump__netlink
  $ bpftool dumper show dumper
  dumper                  prog_id
  task/my1                8
  task/file/my1           12
  bpf_map/my1             4
  ipv6_route/my2          16
  netlink/my2             24
  netlink/my3             20

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/bpf/bpftool/dumper.c | 131 +++++++++++++++++++++++++++++++++++++
 tools/bpf/bpftool/main.c   |   3 +-
 tools/bpf/bpftool/main.h   |   1 +
 3 files changed, 134 insertions(+), 1 deletion(-)
 create mode 100644 tools/bpf/bpftool/dumper.c

diff --git a/tools/bpf/bpftool/dumper.c b/tools/bpf/bpftool/dumper.c
new file mode 100644
index 000000000000..81389083dec5
--- /dev/null
+++ b/tools/bpf/bpftool/dumper.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (C) 2020 Facebook
+// Author: Yonghong Song <yhs@fb.com>
+
+#define _GNU_SOURCE
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <ftw.h>
+
+#include <linux/err.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "main.h"
+
+static int do_pin(int argc, char **argv)
+{
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	const char *objfile, *dname;
+	int err;
+
+	if (!REQ_ARGS(2)) {
+		usage();
+		return -1;
+	}
+
+	objfile = GET_ARG();
+	dname = GET_ARG();
+
+	obj = bpf_object__open(objfile);
+	if (IS_ERR_OR_NULL(obj))
+		return -1;
+
+	err = bpf_object__load(obj);
+	if (err < 0)
+		return -1;
+
+	prog = bpf_program__next(NULL, obj);
+	err = bpf_program__pin_dumper(prog, dname);
+	return err;
+}
+
+static bool for_targets;
+static const char *bpfdump_root = "/sys/kernel/bpfdump";
+
+static int check_file(const char *fpath, const struct stat *sb,
+		      int typeflag, struct FTW *ftwbuf)
+{
+	char proto_buf[64];
+	unsigned prog_id;
+	const char *name;
+	int ret, fd;
+
+	if ((for_targets && typeflag == FTW_F) ||
+	    (!for_targets && typeflag == FTW_D))
+		return 0;
+
+	if (for_targets && strcmp(fpath, bpfdump_root) == 0)
+		return 0;
+
+	fd = open(fpath, O_RDONLY);
+	if (fd < 0)
+		return fd;
+
+	ret = bpf_dump_query(fd, 0, proto_buf, sizeof(proto_buf),
+			     &prog_id);
+	if (ret < 0)
+		goto done;
+
+	name = fpath + strlen(bpfdump_root) + 1;
+	if (for_targets)
+		fprintf(stdout, "%-24s%-24s\n", name, proto_buf);
+	else
+		fprintf(stdout, "%-24s%-24d\n", name, prog_id);
+
+done:
+	close(fd);
+	return ret;
+}
+
+static int do_show(int argc, char **argv)
+{
+	int flags = FTW_PHYS;
+	int nopenfd = 16;
+	const char *spec;
+
+	if (!REQ_ARGS(1)) {
+		usage();
+		return -1;
+	}
+
+	spec = GET_ARG();
+	if (strcmp(spec, "target") == 0) {
+		for_targets = true;
+		fprintf(stdout, "target                  prog_proto\n");
+	} else if (strcmp(spec, "dumper") == 0) {
+		fprintf(stdout, "dumper                  prog_id\n");
+		for_targets = false;
+	} else {
+		return -1;
+	}
+
+	if (nftw(bpfdump_root, check_file, nopenfd, flags) == -1)
+		return -1;
+
+	return 0;
+}
+
+static int do_help(int argc, char **argv)
+{
+	return 0;
+}
+
+static const struct cmd cmds[] = {
+	{ "help",	do_help },
+	{ "show",	do_show },
+	{ "pin",	do_pin },
+	{ 0 }
+};
+
+int do_dumper(int argc, char **argv)
+{
+	return cmd_select(cmds, argc, argv, do_help);
+}
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index 466c269eabdd..8489aba6543d 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -58,7 +58,7 @@ static int do_help(int argc, char **argv)
 		"       %s batch file FILE\n"
 		"       %s version\n"
 		"\n"
-		"       OBJECT := { prog | map | cgroup | perf | net | feature | btf | gen | struct_ops }\n"
+		"       OBJECT := { prog | map | cgroup | perf | net | feature | btf | gen | struct_ops | dumper}\n"
 		"       " HELP_SPEC_OPTIONS "\n"
 		"",
 		bin_name, bin_name, bin_name);
@@ -222,6 +222,7 @@ static const struct cmd cmds[] = {
 	{ "btf",	do_btf },
 	{ "gen",	do_gen },
 	{ "struct_ops",	do_struct_ops },
+	{ "dumper",	do_dumper },
 	{ "version",	do_version },
 	{ 0 }
 };
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 86f14ce26fd7..2c59f319bbe9 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -162,6 +162,7 @@ int do_feature(int argc, char **argv);
 int do_btf(int argc, char **argv);
 int do_gen(int argc, char **argv);
 int do_struct_ops(int argc, char **argv);
+int do_dumper(int argc, char **arg);
 
 int parse_u32_arg(int *argc, char ***argv, __u32 *val, const char *what);
 int prog_parse_fd(int *argc, char ***argv);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 14/16] tools/bpf: selftests: add dumper programs for ipv6_route and netlink
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (12 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 13/16] tools/bpftool: add bpf dumper support Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-14  5:39   ` Andrii Nakryiko
  2020-04-08 23:25 ` [RFC PATCH bpf-next 15/16] tools/bpf: selftests: add dumper progs for bpf_map/task/task_file Yonghong Song
  2020-04-08 23:25 ` [RFC PATCH bpf-next 16/16] tools/bpf: selftests: add a selftest for anonymous dumper Yonghong Song
  15 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Two bpf programs are added in this patch for netlink and ipv6_route
target. On my VM, I am able to achieve identical
results compared to /proc/net/netlink and /proc/net/ipv6_route.

  $ cat /proc/net/netlink
  sk               Eth Pid        Groups   Rmem     Wmem     Dump  Locks    Drops    Inode
  000000002c42d58b 0   0          00000000 0        0        0     2        0        7
  00000000a4e8b5e1 0   1          00000551 0        0        0     2        0        18719
  00000000e1b1c195 4   0          00000000 0        0        0     2        0        16422
  000000007e6b29f9 6   0          00000000 0        0        0     2        0        16424
  ....
  00000000159a170d 15  1862       00000002 0        0        0     2        0        1886
  000000009aca4bc9 15  3918224839 00000002 0        0        0     2        0        19076
  00000000d0ab31d2 15  1          00000002 0        0        0     2        0        18683
  000000008398fb08 16  0          00000000 0        0        0     2        0        27
  $ cat /sys/kernel/bpfdump/netlink/my1
  sk               Eth Pid        Groups   Rmem     Wmem     Dump  Locks    Drops    Inode
  000000002c42d58b 0   0          00000000 0        0        0     2        0        7
  00000000a4e8b5e1 0   1          00000551 0        0        0     2        0        18719
  00000000e1b1c195 4   0          00000000 0        0        0     2        0        16422
  000000007e6b29f9 6   0          00000000 0        0        0     2        0        16424
  ....
  00000000159a170d 15  1862       00000002 0        0        0     2        0        1886
  000000009aca4bc9 15  3918224839 00000002 0        0        0     2        0        19076
  00000000d0ab31d2 15  1          00000002 0        0        0     2        0        18683
  000000008398fb08 16  0          00000000 0        0        0     2        0        27

  $ cat /proc/net/ipv6_route
  fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
  fe80000000000000c04b03fffe7827ce 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
  ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000003 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  $ cat /sys/kernel/bpfdump/ipv6_route/my1
  fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
  fe80000000000000c04b03fffe7827ce 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
  ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000003 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../selftests/bpf/progs/bpfdump_ipv6_route.c  | 63 ++++++++++++++++
 .../selftests/bpf/progs/bpfdump_netlink.c     | 74 +++++++++++++++++++
 2 files changed, 137 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_ipv6_route.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_netlink.c

diff --git a/tools/testing/selftests/bpf/progs/bpfdump_ipv6_route.c b/tools/testing/selftests/bpf/progs/bpfdump_ipv6_route.c
new file mode 100644
index 000000000000..590e56791052
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpfdump_ipv6_route.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+extern bool CONFIG_IPV6_SUBTREES __kconfig __weak;
+
+#define	RTF_GATEWAY		0x0002
+#define IFNAMSIZ		16
+#define fib_nh_gw_family        nh_common.nhc_gw_family
+#define fib_nh_gw6              nh_common.nhc_gw.ipv6
+#define fib_nh_dev              nh_common.nhc_dev
+
+SEC("dump//sys/kernel/bpfdump/ipv6_route")
+int BPF_PROG(dump_ipv6_route, struct fib6_info *rt, struct seq_file *seq, u64 seq_num)
+{
+	struct fib6_nh *fib6_nh = &rt->fib6_nh[0];
+	unsigned int flags = rt->fib6_flags;
+	const struct net_device *dev;
+	struct nexthop *nh;
+	static const char fmt1[] = "%pi6 %02x ";
+	static const char fmt2[] = "%pi6 ";
+	static const char fmt3[] = "00000000000000000000000000000000 ";
+	static const char fmt4[] = "%08x %08x ";
+	static const char fmt5[] = "%8s\n";
+	static const char fmt6[] = "\n";
+	static const char fmt7[] = "00000000000000000000000000000000 00 ";
+
+	/* FIXME: nexthop_is_multipath is not handled here. */
+	nh = rt->nh;
+	if (rt->nh)
+		fib6_nh = &nh->nh_info->fib6_nh;
+
+	bpf_seq_printf(seq, fmt1, sizeof(fmt1), &rt->fib6_dst.addr,
+		       rt->fib6_dst.plen);
+
+	if (CONFIG_IPV6_SUBTREES)
+		bpf_seq_printf(seq, fmt1, sizeof(fmt1), &rt->fib6_src.addr,
+			       rt->fib6_src.plen);
+	else
+		bpf_seq_printf(seq, fmt7, sizeof(fmt7));
+
+	if (fib6_nh->fib_nh_gw_family) {
+		flags |= RTF_GATEWAY;
+		bpf_seq_printf(seq, fmt2, sizeof(fmt2), &fib6_nh->fib_nh_gw6);
+	} else {
+		bpf_seq_printf(seq, fmt3, sizeof(fmt3));
+	}
+
+	dev = fib6_nh->fib_nh_dev;
+	bpf_seq_printf(seq, fmt4, sizeof(fmt4), rt->fib6_metric, rt->fib6_ref.refs.counter);
+	bpf_seq_printf(seq, fmt4, sizeof(fmt4), 0, flags);
+	if (dev)
+		bpf_seq_printf(seq, fmt5, sizeof(fmt5), dev->name);
+	else
+		bpf_seq_printf(seq, fmt6, sizeof(fmt6));
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/bpfdump_netlink.c b/tools/testing/selftests/bpf/progs/bpfdump_netlink.c
new file mode 100644
index 000000000000..37c9be546b99
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpfdump_netlink.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define sk_rmem_alloc	sk_backlog.rmem_alloc
+#define sk_refcnt	__sk_common.skc_refcnt
+
+#define offsetof(TYPE, MEMBER)  ((size_t)&((TYPE *)0)->MEMBER)
+#define container_of(ptr, type, member) ({                              \
+        void *__mptr = (void *)(ptr);                                   \
+        ((type *)(__mptr - offsetof(type, member))); })
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
+SEC("dump//sys/kernel/bpfdump/netlink")
+int BPF_PROG(dump_netlink, struct netlink_sock *nlk, struct seq_file *seq, u64 seq_num)
+{
+	static const char banner[] =
+		"sk               Eth Pid        Groups   "
+		"Rmem     Wmem     Dump  Locks    Drops    Inode\n";
+	static const char fmt1[] = "%pK %-3d ";
+	static const char fmt2[] = "%-10u %08x ";
+	static const char fmt3[] = "%-8d %-8d ";
+	static const char fmt4[] = "%-5d %-8d ";
+	static const char fmt5[] = "%-8u %-8lu\n";
+	struct sock *s = &nlk->sk;
+	unsigned long group, ino;
+	struct inode *inode;
+	struct socket *sk;
+
+	if (seq_num == 0)
+		bpf_seq_printf(seq, banner, sizeof(banner));
+
+	bpf_seq_printf(seq, fmt1, sizeof(fmt1), s, s->sk_protocol);
+
+	if (!nlk->groups)  {
+		group = 0;
+	} else {
+		/* FIXME: temporary use bpf_probe_read here, needs
+		 * verifier support to do direct access.
+		 */
+		bpf_probe_read(&group, sizeof(group), &nlk->groups[0]);
+	}
+	bpf_seq_printf(seq, fmt2, sizeof(fmt2), nlk->portid, (u32)group);
+
+
+	bpf_seq_printf(seq, fmt3, sizeof(fmt3), s->sk_rmem_alloc.counter,
+		       s->sk_wmem_alloc.refs.counter - 1);
+	bpf_seq_printf(seq, fmt4, sizeof(fmt4), nlk->cb_running,
+		       s->sk_refcnt.refs.counter);
+
+	sk = s->sk_socket;
+	if (!sk) {
+		ino = 0;
+	} else {
+		/* FIXME: container_of inside SOCK_INODE has a forced
+		 * type conversion, and direct access cannot be used
+		 * with current verifier.
+		 */
+		inode = SOCK_INODE(sk);
+		bpf_probe_read(&ino, sizeof(ino), &inode->i_ino);
+	}
+	bpf_seq_printf(seq, fmt5, sizeof(fmt5), s->sk_drops.counter, ino);
+
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 15/16] tools/bpf: selftests: add dumper progs for bpf_map/task/task_file
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (13 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 14/16] tools/bpf: selftests: add dumper programs for ipv6_route and netlink Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  2020-04-10  3:33   ` Alexei Starovoitov
  2020-04-08 23:25 ` [RFC PATCH bpf-next 16/16] tools/bpf: selftests: add a selftest for anonymous dumper Yonghong Song
  15 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

The implementation is arbitrary, just to show how the bpf programs
can be written for bpf_map/task/task_file. They can be costomized
for specific needs.

For example, for bpf_map, the dumper prints out:
  $ cat /sys/kernel/bpfdump/bpf_map/my1
      id   refcnt  usercnt  locked_vm
       3        2        0         20
       6        2        0         20
       9        2        0         20
      12        2        0         20
      13        2        0         20
      16        2        0         20
      19        2        0         20

For task, the dumper prints out:
  $ cat /sys/kernel/bpfdump/task/my1
    tgid      gid
       1        1
       2        2
    ....
    1944     1944
    1948     1948
    1949     1949
    1953     1953

For task/file, the dumper prints out:
  $ cat /sys/kernel/bpfdump/task/file/my1
    tgid      gid       fd      file
       1        1        0 ffffffff95c97600
       1        1        1 ffffffff95c97600
       1        1        2 ffffffff95c97600
    ....
    1895     1895      255 ffffffff95c8fe00
    1932     1932        0 ffffffff95c8fe00
    1932     1932        1 ffffffff95c8fe00
    1932     1932        2 ffffffff95c8fe00
    1932     1932        3 ffffffff95c185c0

This is able to print out all open files (fd and file->f_op), so user can compare
f_op against a particular kernel file operations to find what it is.
For example, from /proc/kallsyms, we can find
  ffffffff95c185c0 r eventfd_fops
so we will know tgid 1932 fd 3 is an eventfd file descriptor.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../selftests/bpf/progs/bpfdump_bpf_map.c     | 24 +++++++++++++++++++
 .../selftests/bpf/progs/bpfdump_task.c        | 21 ++++++++++++++++
 .../selftests/bpf/progs/bpfdump_task_file.c   | 24 +++++++++++++++++++
 3 files changed, 69 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_bpf_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_task.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_task_file.c

diff --git a/tools/testing/selftests/bpf/progs/bpfdump_bpf_map.c b/tools/testing/selftests/bpf/progs/bpfdump_bpf_map.c
new file mode 100644
index 000000000000..c85f5a330010
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpfdump_bpf_map.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("dump//sys/kernel/bpfdump/bpf_map")
+int BPF_PROG(dump_bpf_map, struct bpf_map *map, struct seq_file *seq, u64 seq_num)
+{
+	static const char banner[] = "      id   refcnt  usercnt  locked_vm\n";
+	static const char fmt1[] = "%8u %8ld ";
+	static const char fmt2[] = "%8ld %10lu\n";
+
+	if (seq_num == 0)
+		bpf_seq_printf(seq, banner, sizeof(banner));
+
+	bpf_seq_printf(seq, fmt1, sizeof(fmt1), map->id, map->refcnt.counter);
+	bpf_seq_printf(seq, fmt2, sizeof(fmt2), map->usercnt.counter,
+		       map->memory.user->locked_vm.counter);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/bpfdump_task.c b/tools/testing/selftests/bpf/progs/bpfdump_task.c
new file mode 100644
index 000000000000..4d90ba97fbda
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpfdump_task.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("dump//sys/kernel/bpfdump/task")
+int BPF_PROG(dump_tasks, struct task_struct *task, struct seq_file *seq, u64 seq_num)
+{
+	static char const banner[] = "    tgid      gid\n";
+	static char const fmt[] = "%8d %8d\n";
+
+	if (seq_num == 0)
+		bpf_seq_printf(seq, banner, sizeof(banner));
+
+	bpf_seq_printf(seq, fmt, sizeof(fmt), task->tgid, task->pid);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/bpfdump_task_file.c b/tools/testing/selftests/bpf/progs/bpfdump_task_file.c
new file mode 100644
index 000000000000..5cf02c050e1f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpfdump_task_file.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("dump//sys/kernel/bpfdump/task/file")
+int BPF_PROG(dump_tasks, struct task_struct *task, __u32 fd, struct file *file,
+	     struct seq_file *seq, u64 seq_num)
+{
+	static char const banner[] = "    tgid      gid       fd      file\n";
+	static char const fmt1[] = "%8d %8d";
+	static char const fmt2[] = " %8d %lx\n";
+
+	if (seq_num == 0)
+		bpf_seq_printf(seq, banner, sizeof(banner));
+
+	bpf_seq_printf(seq, fmt1, sizeof(fmt1), task->tgid, task->pid);
+	bpf_seq_printf(seq, fmt2, sizeof(fmt2), fd, (long)file->f_op);
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH bpf-next 16/16] tools/bpf: selftests: add a selftest for anonymous dumper
  2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
                   ` (14 preceding siblings ...)
  2020-04-08 23:25 ` [RFC PATCH bpf-next 15/16] tools/bpf: selftests: add dumper progs for bpf_map/task/task_file Yonghong Song
@ 2020-04-08 23:25 ` Yonghong Song
  15 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-08 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

The selftest creates a anonymous dumper for the
/sys/kernel/bpfdump/task/ target and ensure the
user space got the expected contents. Both
bpf_seq_printf() and bpf_seq_write() helpers
are tested in this selftest.

  $ test_progs -n 2
  #2 bpfdump_test:OK
  Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../selftests/bpf/prog_tests/bpfdump_test.c   | 41 +++++++++++++++++++
 .../selftests/bpf/progs/bpfdump_test_kern.c   | 26 ++++++++++++
 2 files changed, 67 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpfdump_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_test_kern.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpfdump_test.c b/tools/testing/selftests/bpf/prog_tests/bpfdump_test.c
new file mode 100644
index 000000000000..a04fae7f1e3d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpfdump_test.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include "bpfdump_test_kern.skel.h"
+
+void test_bpfdump_test(void)
+{
+	int err, prog_fd, dumper_fd, duration = 0;
+	struct bpfdump_test_kern *skel;
+	char buf[16] = {};
+	const char *expected = "0A1B2C3D";
+
+	skel = bpfdump_test_kern__open_and_load();
+	if (CHECK(!skel, "skel_open_and_load",
+		  "skeleton open_and_load failed\n"))
+		return;
+
+	prog_fd = bpf_program__fd(skel->progs.dump_tasks);
+	dumper_fd = bpf_prog_attach(prog_fd, 0, BPF_TRACE_DUMP, 0);
+	if (CHECK(dumper_fd < 0, "bpf_prog_attach", "attach dumper failed\n"))
+		goto destroy_skel;
+
+	err = -EINVAL;
+	while (read(dumper_fd, buf, sizeof(buf)) > 0) {
+		if (CHECK(!err, "read", "unexpected extra read\n"))
+			goto close_fd;
+
+		err = strcmp(buf, expected) != 0;
+		if (CHECK(err, "read",
+			  "read failed: buf %s, expected %s\n", buf,
+			  expected))
+			goto close_fd;
+	}
+
+	CHECK(err, "read", "real failed: no read, expected %s\n",
+	      expected);
+
+close_fd:
+	close(dumper_fd);
+destroy_skel:
+	bpfdump_test_kern__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpfdump_test_kern.c b/tools/testing/selftests/bpf/progs/bpfdump_test_kern.c
new file mode 100644
index 000000000000..4758f5d11d9c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpfdump_test_kern.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_endian.h>
+
+char _license[] SEC("license") = "GPL";
+
+int count = 0;
+
+SEC("dump//sys/kernel/bpfdump/task")
+int BPF_PROG(dump_tasks, struct task_struct *task, struct seq_file *seq, u64 seq_num)
+{
+	static char fmt[] = "%d";
+	char c;
+
+	if (count < 4) {
+		bpf_seq_printf(seq, fmt, sizeof(fmt), count);
+		c = 'A' + count;
+		bpf_seq_write(seq, &c, sizeof(c));
+		count++;
+	}
+
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-08 23:25 ` [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers Yonghong Song
@ 2020-04-10  3:00   ` Alexei Starovoitov
  2020-04-10  6:09     ` Yonghong Song
  2020-04-10 22:42     ` Yonghong Song
  2020-04-10 22:51   ` Andrii Nakryiko
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-10  3:00 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team

On Wed, Apr 08, 2020 at 04:25:26PM -0700, Yonghong Song wrote:
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0f1cbed446c1..b51d56fc77f9 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -354,6 +354,7 @@ enum {
>  /* Flags for accessing BPF object from syscall side. */
>  	BPF_F_RDONLY		= (1U << 3),
>  	BPF_F_WRONLY		= (1U << 4),
> +	BPF_F_DUMP		= (1U << 5),
...
>  static int bpf_obj_pin(const union bpf_attr *attr)
>  {
> -	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
> +	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags & ~BPF_F_DUMP)
>  		return -EINVAL;
>  
> +	if (attr->file_flags == BPF_F_DUMP)
> +		return bpf_dump_create(attr->bpf_fd,
> +				       u64_to_user_ptr(attr->dumper_name));
> +
>  	return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
>  }

I think kernel can be a bit smarter here. There is no need for user space
to pass BPF_F_DUMP flag to kernel just to differentiate the pinning.
Can prog attach type be used instead?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 11/16] bpf: implement query for target_proto and file dumper prog_id
  2020-04-08 23:25 ` [RFC PATCH bpf-next 11/16] bpf: implement query for target_proto and file dumper prog_id Yonghong Song
@ 2020-04-10  3:10   ` Alexei Starovoitov
  2020-04-10  6:11     ` Yonghong Song
  0 siblings, 1 reply; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-10  3:10 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team

On Wed, Apr 08, 2020 at 04:25:33PM -0700, Yonghong Song wrote:
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index a245f0df53c4..fc2157e319f1 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -113,6 +113,7 @@ enum bpf_cmd {
>  	BPF_MAP_DELETE_BATCH,
>  	BPF_LINK_CREATE,
>  	BPF_LINK_UPDATE,
> +	BPF_DUMP_QUERY,
>  };
>  
>  enum bpf_map_type {
> @@ -594,6 +595,18 @@ union bpf_attr {
>  		__u32		old_prog_fd;
>  	} link_update;
>  
> +	struct {
> +		__u32		query_fd;
> +		__u32		flags;
> +		union {
> +			struct {
> +				__aligned_u64	target_proto;
> +				__u32		proto_buf_len;
> +			};
> +			__u32			prog_id;
> +		};
> +	} dump_query;

I think it would be cleaner to use BPF_OBJ_GET_INFO_BY_FD instead of
introducing new command.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets
  2020-04-08 23:25 ` [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets Yonghong Song
@ 2020-04-10  3:22   ` Alexei Starovoitov
  2020-04-10  6:19     ` Yonghong Song
  2020-04-13 23:00   ` Andrii Nakryiko
  1 sibling, 1 reply; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-10  3:22 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team

On Wed, Apr 08, 2020 at 04:25:29PM -0700, Yonghong Song wrote:
> +
> +	spin_lock(&files->file_lock);
> +	for (; sfd < files_fdtable(files)->max_fds; sfd++) {
> +		struct file *f;
> +
> +		f = fcheck_files(files, sfd);
> +		if (!f)
> +			continue;
> +
> +		*fd = sfd;
> +		get_file(f);
> +		spin_unlock(&files->file_lock);
> +		return f;
> +	}
> +
> +	/* the current task is done, go to the next task */
> +	spin_unlock(&files->file_lock);
> +	put_files_struct(files);

I think spin_lock is unnecessary.
It's similarly unnecessary in bpf_task_fd_query().
Take a look at proc_readfd_common() in fs/proc/fd.c.
It only needs rcu_read_lock() to iterate fd array.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 09/16] bpf: add bpf_seq_printf and bpf_seq_write helpers
  2020-04-08 23:25 ` [RFC PATCH bpf-next 09/16] bpf: add bpf_seq_printf and bpf_seq_write helpers Yonghong Song
@ 2020-04-10  3:26   ` Alexei Starovoitov
  2020-04-10  6:12     ` Yonghong Song
  2020-04-14  5:28   ` Andrii Nakryiko
  1 sibling, 1 reply; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-10  3:26 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team

On Wed, Apr 08, 2020 at 04:25:31PM -0700, Yonghong Song wrote:
>  
> +BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size, u64, arg1,
> +	   u64, arg2)
> +{
> +	bool buf_used = false;
> +	int i, copy_size;
> +	int mod[2] = {};
> +	int fmt_cnt = 0;
> +	u64 unsafe_addr;
> +	char buf[64];
> +
> +	/*
> +	 * bpf_check()->check_func_arg()->check_stack_boundary()
> +	 * guarantees that fmt points to bpf program stack,
> +	 * fmt_size bytes of it were initialized and fmt_size > 0
> +	 */
> +	if (fmt[--fmt_size] != 0)
> +		return -EINVAL;
...
> +/* Horrid workaround for getting va_list handling working with different
> + * argument type combinations generically for 32 and 64 bit archs.
> + */
> +#define __BPF_SP_EMIT()	__BPF_ARG2_SP()
> +#define __BPF_SP(...)							\
> +	seq_printf(m, fmt, ##__VA_ARGS__)
> +
> +#define __BPF_ARG1_SP(...)						\
> +	((mod[0] == 2 || (mod[0] == 1 && __BITS_PER_LONG == 64))	\
> +	  ? __BPF_SP(arg1, ##__VA_ARGS__)				\
> +	  : ((mod[0] == 1 || (mod[0] == 0 && __BITS_PER_LONG == 32))	\
> +	      ? __BPF_SP((long)arg1, ##__VA_ARGS__)			\
> +	      : __BPF_SP((u32)arg1, ##__VA_ARGS__)))
> +
> +#define __BPF_ARG2_SP(...)						\
> +	((mod[1] == 2 || (mod[1] == 1 && __BITS_PER_LONG == 64))	\
> +	  ? __BPF_ARG1_SP(arg2, ##__VA_ARGS__)				\
> +	  : ((mod[1] == 1 || (mod[1] == 0 && __BITS_PER_LONG == 32))	\
> +	      ? __BPF_ARG1_SP((long)arg2, ##__VA_ARGS__)		\
> +	      : __BPF_ARG1_SP((u32)arg2, ##__VA_ARGS__)))
> +
> +	__BPF_SP_EMIT();
> +	return seq_has_overflowed(m);
> +}

This function is mostly a copy-paste of bpf_trace_printk() with difference
of printing via seq_printf vs __trace_printk.
Please find a way to share the code.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 15/16] tools/bpf: selftests: add dumper progs for bpf_map/task/task_file
  2020-04-08 23:25 ` [RFC PATCH bpf-next 15/16] tools/bpf: selftests: add dumper progs for bpf_map/task/task_file Yonghong Song
@ 2020-04-10  3:33   ` Alexei Starovoitov
  2020-04-10  6:41     ` Yonghong Song
  0 siblings, 1 reply; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-10  3:33 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team

On Wed, Apr 08, 2020 at 04:25:38PM -0700, Yonghong Song wrote:
> For task/file, the dumper prints out:
>   $ cat /sys/kernel/bpfdump/task/file/my1
>     tgid      gid       fd      file
>        1        1        0 ffffffff95c97600
>        1        1        1 ffffffff95c97600
>        1        1        2 ffffffff95c97600
>     ....
>     1895     1895      255 ffffffff95c8fe00
>     1932     1932        0 ffffffff95c8fe00
>     1932     1932        1 ffffffff95c8fe00
>     1932     1932        2 ffffffff95c8fe00
>     1932     1932        3 ffffffff95c185c0
...
> +SEC("dump//sys/kernel/bpfdump/task/file")
> +int BPF_PROG(dump_tasks, struct task_struct *task, __u32 fd, struct file *file,
> +	     struct seq_file *seq, u64 seq_num)
> +{
> +	static char const banner[] = "    tgid      gid       fd      file\n";
> +	static char const fmt1[] = "%8d %8d";
> +	static char const fmt2[] = " %8d %lx\n";
> +
> +	if (seq_num == 0)
> +		bpf_seq_printf(seq, banner, sizeof(banner));
> +
> +	bpf_seq_printf(seq, fmt1, sizeof(fmt1), task->tgid, task->pid);
> +	bpf_seq_printf(seq, fmt2, sizeof(fmt2), fd, (long)file->f_op);
> +	return 0;
> +}

I wonder what is the speed of walking all files in all tasks with an empty
program? If it's fast I can imagine a million use cases for such searching bpf
prog. Like finding which task owns particular socket. This could be a massive
feature.

With one redundant spin_lock removed it seems it will be one spin_lock per prog
invocation? May be eventually it can be amortized within seq_file iterating
logic. Would be really awesome if the cost is just refcnt ++/-- per call and
rcu_read_lock.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-10  3:00   ` Alexei Starovoitov
@ 2020-04-10  6:09     ` Yonghong Song
  2020-04-10 22:42     ` Yonghong Song
  1 sibling, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-10  6:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team



On 4/9/20 8:00 PM, Alexei Starovoitov wrote:
> On Wed, Apr 08, 2020 at 04:25:26PM -0700, Yonghong Song wrote:
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 0f1cbed446c1..b51d56fc77f9 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -354,6 +354,7 @@ enum {
>>   /* Flags for accessing BPF object from syscall side. */
>>   	BPF_F_RDONLY		= (1U << 3),
>>   	BPF_F_WRONLY		= (1U << 4),
>> +	BPF_F_DUMP		= (1U << 5),
> ...
>>   static int bpf_obj_pin(const union bpf_attr *attr)
>>   {
>> -	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
>> +	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags & ~BPF_F_DUMP)
>>   		return -EINVAL;
>>   
>> +	if (attr->file_flags == BPF_F_DUMP)
>> +		return bpf_dump_create(attr->bpf_fd,
>> +				       u64_to_user_ptr(attr->dumper_name));
>> +
>>   	return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
>>   }
> 
> I think kernel can be a bit smarter here. There is no need for user space
> to pass BPF_F_DUMP flag to kernel just to differentiate the pinning.
> Can prog attach type be used instead?

Good point. Yes, no need for BPF_F_DUMP, expected_attach_type
is enough. Will change.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 11/16] bpf: implement query for target_proto and file dumper prog_id
  2020-04-10  3:10   ` Alexei Starovoitov
@ 2020-04-10  6:11     ` Yonghong Song
  0 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-10  6:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team



On 4/9/20 8:10 PM, Alexei Starovoitov wrote:
> On Wed, Apr 08, 2020 at 04:25:33PM -0700, Yonghong Song wrote:
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index a245f0df53c4..fc2157e319f1 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -113,6 +113,7 @@ enum bpf_cmd {
>>   	BPF_MAP_DELETE_BATCH,
>>   	BPF_LINK_CREATE,
>>   	BPF_LINK_UPDATE,
>> +	BPF_DUMP_QUERY,
>>   };
>>   
>>   enum bpf_map_type {
>> @@ -594,6 +595,18 @@ union bpf_attr {
>>   		__u32		old_prog_fd;
>>   	} link_update;
>>   
>> +	struct {
>> +		__u32		query_fd;
>> +		__u32		flags;
>> +		union {
>> +			struct {
>> +				__aligned_u64	target_proto;
>> +				__u32		proto_buf_len;
>> +			};
>> +			__u32			prog_id;
>> +		};
>> +	} dump_query;
> 
> I think it would be cleaner to use BPF_OBJ_GET_INFO_BY_FD instead of
> introducing new command.

Right, using BPF_OBJ_GET_INFO_BY_FD should be good. Previously, I
am using target/dumper name to query and later I changed to fd,
but still used BPF_DUMP_QUERY...

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 09/16] bpf: add bpf_seq_printf and bpf_seq_write helpers
  2020-04-10  3:26   ` Alexei Starovoitov
@ 2020-04-10  6:12     ` Yonghong Song
  0 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-10  6:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team



On 4/9/20 8:26 PM, Alexei Starovoitov wrote:
> On Wed, Apr 08, 2020 at 04:25:31PM -0700, Yonghong Song wrote:
>>   
>> +BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size, u64, arg1,
>> +	   u64, arg2)
>> +{
>> +	bool buf_used = false;
>> +	int i, copy_size;
>> +	int mod[2] = {};
>> +	int fmt_cnt = 0;
>> +	u64 unsafe_addr;
>> +	char buf[64];
>> +
>> +	/*
>> +	 * bpf_check()->check_func_arg()->check_stack_boundary()
>> +	 * guarantees that fmt points to bpf program stack,
>> +	 * fmt_size bytes of it were initialized and fmt_size > 0
>> +	 */
>> +	if (fmt[--fmt_size] != 0)
>> +		return -EINVAL;
> ...
>> +/* Horrid workaround for getting va_list handling working with different
>> + * argument type combinations generically for 32 and 64 bit archs.
>> + */
>> +#define __BPF_SP_EMIT()	__BPF_ARG2_SP()
>> +#define __BPF_SP(...)							\
>> +	seq_printf(m, fmt, ##__VA_ARGS__)
>> +
>> +#define __BPF_ARG1_SP(...)						\
>> +	((mod[0] == 2 || (mod[0] == 1 && __BITS_PER_LONG == 64))	\
>> +	  ? __BPF_SP(arg1, ##__VA_ARGS__)				\
>> +	  : ((mod[0] == 1 || (mod[0] == 0 && __BITS_PER_LONG == 32))	\
>> +	      ? __BPF_SP((long)arg1, ##__VA_ARGS__)			\
>> +	      : __BPF_SP((u32)arg1, ##__VA_ARGS__)))
>> +
>> +#define __BPF_ARG2_SP(...)						\
>> +	((mod[1] == 2 || (mod[1] == 1 && __BITS_PER_LONG == 64))	\
>> +	  ? __BPF_ARG1_SP(arg2, ##__VA_ARGS__)				\
>> +	  : ((mod[1] == 1 || (mod[1] == 0 && __BITS_PER_LONG == 32))	\
>> +	      ? __BPF_ARG1_SP((long)arg2, ##__VA_ARGS__)		\
>> +	      : __BPF_ARG1_SP((u32)arg2, ##__VA_ARGS__)))
>> +
>> +	__BPF_SP_EMIT();
>> +	return seq_has_overflowed(m);
>> +}
> 
> This function is mostly a copy-paste of bpf_trace_printk() with difference
> of printing via seq_printf vs __trace_printk.
> Please find a way to share the code.

Will do.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets
  2020-04-10  3:22   ` Alexei Starovoitov
@ 2020-04-10  6:19     ` Yonghong Song
  2020-04-10 21:31       ` Alexei Starovoitov
  0 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-10  6:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team



On 4/9/20 8:22 PM, Alexei Starovoitov wrote:
> On Wed, Apr 08, 2020 at 04:25:29PM -0700, Yonghong Song wrote:
>> +
>> +	spin_lock(&files->file_lock);
>> +	for (; sfd < files_fdtable(files)->max_fds; sfd++) {
>> +		struct file *f;
>> +
>> +		f = fcheck_files(files, sfd);
>> +		if (!f)
>> +			continue;
>> +
>> +		*fd = sfd;
>> +		get_file(f);
>> +		spin_unlock(&files->file_lock);
>> +		return f;
>> +	}
>> +
>> +	/* the current task is done, go to the next task */
>> +	spin_unlock(&files->file_lock);
>> +	put_files_struct(files);
> 
> I think spin_lock is unnecessary.
> It's similarly unnecessary in bpf_task_fd_query().
> Take a look at proc_readfd_common() in fs/proc/fd.c.
> It only needs rcu_read_lock() to iterate fd array.

I see. I was looking at function seq_show() at fs/proc/fd.c,

...
                 spin_lock(&files->file_lock);
                 file = fcheck_files(files, fd);
                 if (file) {
                         struct fdtable *fdt = files_fdtable(files);

                         f_flags = file->f_flags;
                         if (close_on_exec(fd, fdt))
                                 f_flags |= O_CLOEXEC;

                         get_file(file);
                         ret = 0;
                 }
                 spin_unlock(&files->file_lock);
                 put_files_struct(files);
...

I guess here spin_lock is needed due to close_on_exec().

Will use rcu_read_lock() mechanism then.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 15/16] tools/bpf: selftests: add dumper progs for bpf_map/task/task_file
  2020-04-10  3:33   ` Alexei Starovoitov
@ 2020-04-10  6:41     ` Yonghong Song
  0 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-10  6:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team



On 4/9/20 8:33 PM, Alexei Starovoitov wrote:
> On Wed, Apr 08, 2020 at 04:25:38PM -0700, Yonghong Song wrote:
>> For task/file, the dumper prints out:
>>    $ cat /sys/kernel/bpfdump/task/file/my1
>>      tgid      gid       fd      file
>>         1        1        0 ffffffff95c97600
>>         1        1        1 ffffffff95c97600
>>         1        1        2 ffffffff95c97600
>>      ....
>>      1895     1895      255 ffffffff95c8fe00
>>      1932     1932        0 ffffffff95c8fe00
>>      1932     1932        1 ffffffff95c8fe00
>>      1932     1932        2 ffffffff95c8fe00
>>      1932     1932        3 ffffffff95c185c0
> ...
>> +SEC("dump//sys/kernel/bpfdump/task/file")
>> +int BPF_PROG(dump_tasks, struct task_struct *task, __u32 fd, struct file *file,
>> +	     struct seq_file *seq, u64 seq_num)
>> +{
>> +	static char const banner[] = "    tgid      gid       fd      file\n";
>> +	static char const fmt1[] = "%8d %8d";
>> +	static char const fmt2[] = " %8d %lx\n";
>> +
>> +	if (seq_num == 0)
>> +		bpf_seq_printf(seq, banner, sizeof(banner));
>> +
>> +	bpf_seq_printf(seq, fmt1, sizeof(fmt1), task->tgid, task->pid);
>> +	bpf_seq_printf(seq, fmt2, sizeof(fmt2), fd, (long)file->f_op);
>> +	return 0;
>> +}
> 
> I wonder what is the speed of walking all files in all tasks with an empty
> program? If it's fast I can imagine a million use cases for such searching bpf
> prog. Like finding which task owns particular socket. This could be a massive
> feature.
> 
> With one redundant spin_lock removed it seems it will be one spin_lock per prog
> invocation? May be eventually it can be amortized within seq_file iterating
> logic. Would be really awesome if the cost is just refcnt ++/-- per call and
> rcu_read_lock.

The main seq_read() loop is below:
         while (1) {
                 size_t offs = m->count;
                 loff_t pos = m->index;

                 p = m->op->next(m, p, &m->index);
                 if (pos == m->index)
                         /* Buggy ->next function */
                         m->index++;
                 if (!p || IS_ERR(p)) {
                         err = PTR_ERR(p);
                         break;
                 }
                 if (m->count >= size)
                         break;
                 err = m->op->show(m, p);
                 if (seq_has_overflowed(m) || err) {
                         m->count = offs;
                         if (likely(err <= 0))
                                 break;
                 }
         }

If we remove the spin_lock() as in another email comment,
we won't have spin_lock() in seq_ops->next() function, only
refcnt ++/-- and rcu_read_{lock, unlock}s. The seq_ops->show() does
not have any spin_lock() either.

I have not got time to do a perf measurement yet.
Will do in the next revision.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets
  2020-04-10  6:19     ` Yonghong Song
@ 2020-04-10 21:31       ` Alexei Starovoitov
  2020-04-10 21:33         ` Alexei Starovoitov
  0 siblings, 1 reply; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-10 21:31 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team

On Thu, Apr 09, 2020 at 11:19:10PM -0700, Yonghong Song wrote:
> 
> 
> On 4/9/20 8:22 PM, Alexei Starovoitov wrote:
> > On Wed, Apr 08, 2020 at 04:25:29PM -0700, Yonghong Song wrote:
> > > +
> > > +	spin_lock(&files->file_lock);
> > > +	for (; sfd < files_fdtable(files)->max_fds; sfd++) {
> > > +		struct file *f;
> > > +
> > > +		f = fcheck_files(files, sfd);
> > > +		if (!f)
> > > +			continue;
> > > +
> > > +		*fd = sfd;
> > > +		get_file(f);
> > > +		spin_unlock(&files->file_lock);
> > > +		return f;
> > > +	}
> > > +
> > > +	/* the current task is done, go to the next task */
> > > +	spin_unlock(&files->file_lock);
> > > +	put_files_struct(files);
> > 
> > I think spin_lock is unnecessary.
> > It's similarly unnecessary in bpf_task_fd_query().
> > Take a look at proc_readfd_common() in fs/proc/fd.c.
> > It only needs rcu_read_lock() to iterate fd array.
> 
> I see. I was looking at function seq_show() at fs/proc/fd.c,
> 
> ...
>                 spin_lock(&files->file_lock);
>                 file = fcheck_files(files, fd);
>                 if (file) {
>                         struct fdtable *fdt = files_fdtable(files);
> 
>                         f_flags = file->f_flags;
>                         if (close_on_exec(fd, fdt))
>                                 f_flags |= O_CLOEXEC;
> 
>                         get_file(file);
>                         ret = 0;
>                 }
>                 spin_unlock(&files->file_lock);
>                 put_files_struct(files);
> ...
> 
> I guess here spin_lock is needed due to close_on_exec().

Right. fdr->close_on_exec array is not rcu protected and needs that spin_lock.

> Will use rcu_read_lock() mechanism then.

Thanks!

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets
  2020-04-10 21:31       ` Alexei Starovoitov
@ 2020-04-10 21:33         ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-10 21:33 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Network Development,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Fri, Apr 10, 2020 at 2:31 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Apr 09, 2020 at 11:19:10PM -0700, Yonghong Song wrote:
> >
> >
> > On 4/9/20 8:22 PM, Alexei Starovoitov wrote:
> > > On Wed, Apr 08, 2020 at 04:25:29PM -0700, Yonghong Song wrote:
> > > > +
> > > > + spin_lock(&files->file_lock);
> > > > + for (; sfd < files_fdtable(files)->max_fds; sfd++) {
> > > > +         struct file *f;
> > > > +
> > > > +         f = fcheck_files(files, sfd);
> > > > +         if (!f)
> > > > +                 continue;
> > > > +
> > > > +         *fd = sfd;
> > > > +         get_file(f);
> > > > +         spin_unlock(&files->file_lock);
> > > > +         return f;
> > > > + }
> > > > +
> > > > + /* the current task is done, go to the next task */
> > > > + spin_unlock(&files->file_lock);
> > > > + put_files_struct(files);
> > >
> > > I think spin_lock is unnecessary.
> > > It's similarly unnecessary in bpf_task_fd_query().
> > > Take a look at proc_readfd_common() in fs/proc/fd.c.
> > > It only needs rcu_read_lock() to iterate fd array.
> >
> > I see. I was looking at function seq_show() at fs/proc/fd.c,
> >
> > ...
> >                 spin_lock(&files->file_lock);
> >                 file = fcheck_files(files, fd);
> >                 if (file) {
> >                         struct fdtable *fdt = files_fdtable(files);
> >
> >                         f_flags = file->f_flags;
> >                         if (close_on_exec(fd, fdt))
> >                                 f_flags |= O_CLOEXEC;
> >
> >                         get_file(file);
> >                         ret = 0;
> >                 }
> >                 spin_unlock(&files->file_lock);
> >                 put_files_struct(files);
> > ...
> >
> > I guess here spin_lock is needed due to close_on_exec().
>
> Right. fdr->close_on_exec array is not rcu protected and needs that spin_lock.

Actually. I'll take it back. fdt is rcu protected and that member is part of it.
So imo seq_show() is doing that spin_lock unnecessary.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves
  2020-04-08 23:25 ` [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves Yonghong Song
@ 2020-04-10 22:18   ` Andrii Nakryiko
  2020-04-10 23:24     ` Yonghong Song
  2020-04-15 22:57     ` Yonghong Song
  2020-04-10 22:25   ` Andrii Nakryiko
  1 sibling, 2 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-10 22:18 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> Here, the target refers to a particular data structure
> inside the kernel we want to dump. For example, it
> can be all task_structs in the current pid namespace,
> or it could be all open files for all task_structs
> in the current pid namespace.
>
> Each target is identified with the following information:
>    target_rel_path   <=== relative path to /sys/kernel/bpfdump
>    target_proto      <=== kernel func proto which represents
>                           bpf program signature for this target
>    seq_ops           <=== seq_ops for seq_file operations
>    seq_priv_size     <=== seq_file private data size
>    target_feature    <=== target specific feature which needs
>                           handling outside seq_ops.

It's not clear what "feature" stands for here... Is this just a sort
of private_data passed through to dumper?

>
> The target relative path is a relative directory to /sys/kernel/bpfdump/.
> For example, it could be:
>    task                  <=== all tasks
>    task/file             <=== all open files under all tasks
>    ipv6_route            <=== all ipv6_routes
>    tcp6/sk_local_storage <=== all tcp6 socket local storages
>    foo/bar/tar           <=== all tar's in bar in foo

^^ this seems useful, but I don't think code as is supports more than 2 levels?

>
> The "target_feature" is mostly used for reusing existing seq_ops.
> For example, for /proc/net/<> stats, the "net" namespace is often
> stored in file private data. The target_feature enables bpf based
> dumper to set "net" properly for itself before calling shared
> seq_ops.
>
> bpf_dump_reg_target() is implemented so targets
> can register themselves. Currently, module is not
> supported, so there is no bpf_dump_unreg_target().
> The main reason is that BTF is not available for modules
> yet.
>
> Since target might call bpf_dump_reg_target() before
> bpfdump mount point is created, __bpfdump_init()
> may be called in bpf_dump_reg_target() as well.
>
> The file-based dumpers will be regular files under
> the specific target directory. For example,
>    task/my1      <=== dumper "my1" iterates through all tasks
>    task/file/my2 <=== dumper "my2" iterates through all open files
>                       under all tasks
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h |   4 +
>  kernel/bpf/dump.c   | 190 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 193 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fd2b2322412d..53914bec7590 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1109,6 +1109,10 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>  int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>  int bpf_obj_get_user(const char __user *pathname, int flags);
>
> +int bpf_dump_reg_target(const char *target, const char *target_proto,
> +                       const struct seq_operations *seq_ops,
> +                       u32 seq_priv_size, u32 target_feature);
> +
>  int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
> diff --git a/kernel/bpf/dump.c b/kernel/bpf/dump.c
> index e0c33486e0e7..45528846557f 100644
> --- a/kernel/bpf/dump.c
> +++ b/kernel/bpf/dump.c
> @@ -12,6 +12,173 @@
>  #include <linux/filter.h>
>  #include <linux/bpf.h>
>
> +struct bpfdump_target_info {
> +       struct list_head list;
> +       const char *target;
> +       const char *target_proto;
> +       struct dentry *dir_dentry;
> +       const struct seq_operations *seq_ops;
> +       u32 seq_priv_size;
> +       u32 target_feature;
> +};
> +
> +struct bpfdump_targets {
> +       struct list_head dumpers;
> +       struct mutex dumper_mutex;

nit: would be a bit simpler if these were static variables with static
initialization, similar to how bpfdump_dentry is separate?

> +};
> +
> +/* registered dump targets */
> +static struct bpfdump_targets dump_targets;
> +
> +static struct dentry *bpfdump_dentry;
> +
> +static struct dentry *bpfdump_add_dir(const char *name, struct dentry *parent,
> +                                     const struct inode_operations *i_ops,
> +                                     void *data);
> +static int __bpfdump_init(void);
> +
> +static int dumper_unlink(struct inode *dir, struct dentry *dentry)
> +{
> +       kfree(d_inode(dentry)->i_private);
> +       return simple_unlink(dir, dentry);
> +}
> +
> +static const struct inode_operations bpf_dir_iops = {
> +       .lookup         = simple_lookup,
> +       .unlink         = dumper_unlink,
> +};
> +
> +int bpf_dump_reg_target(const char *target,
> +                       const char *target_proto,
> +                       const struct seq_operations *seq_ops,
> +                       u32 seq_priv_size, u32 target_feature)
> +{
> +       struct bpfdump_target_info *tinfo, *ptinfo;
> +       struct dentry *dentry, *parent;
> +       const char *lastslash;
> +       bool existed = false;
> +       int err, parent_len;
> +
> +       if (!bpfdump_dentry) {
> +               err = __bpfdump_init();

This will be called (again) if bpfdump_init() fails? Not sure why? In
rare cases, some dumper will fail to initialize, but then some might
succeed, which is going to be even more confusing, no?

> +               if (err)
> +                       return err;
> +       }
> +
> +       tinfo = kmalloc(sizeof(*tinfo), GFP_KERNEL);
> +       if (!tinfo)
> +               return -ENOMEM;
> +
> +       tinfo->target = target;
> +       tinfo->target_proto = target_proto;
> +       tinfo->seq_ops = seq_ops;
> +       tinfo->seq_priv_size = seq_priv_size;
> +       tinfo->target_feature = target_feature;
> +       INIT_LIST_HEAD(&tinfo->list);
> +
> +       lastslash = strrchr(target, '/');
> +       if (!lastslash) {
> +               parent = bpfdump_dentry;

Two nits here. First, it supports only one and two levels. But it
seems like it wouldn't be hard to support multiple? Instead of
reverse-searching for /, you can forward search and keep track of
"current parent".

nit2:

parent = bpfdump_dentry;
if (lastslash) {

    parent = ptinfo->dir_dentry;
}

seems a bit cleaner (and generalizes to multi-level a bit better).

> +       } else {
> +               parent_len = (unsigned long)lastslash - (unsigned long)target;
> +
> +               mutex_lock(&dump_targets.dumper_mutex);
> +               list_for_each_entry(ptinfo, &dump_targets.dumpers, list) {
> +                       if (strlen(ptinfo->target) == parent_len &&
> +                           strncmp(ptinfo->target, target, parent_len) == 0) {
> +                               existed = true;
> +                               break;
> +                       }
> +               }
> +               mutex_unlock(&dump_targets.dumper_mutex);
> +               if (existed == false) {
> +                       err = -ENOENT;
> +                       goto free_tinfo;
> +               }
> +
> +               parent = ptinfo->dir_dentry;
> +               target = lastslash + 1;
> +       }
> +       dentry = bpfdump_add_dir(target, parent, &bpf_dir_iops, tinfo);
> +       if (IS_ERR(dentry)) {
> +               err = PTR_ERR(dentry);
> +               goto free_tinfo;
> +       }
> +
> +       tinfo->dir_dentry = dentry;
> +
> +       mutex_lock(&dump_targets.dumper_mutex);
> +       list_add(&tinfo->list, &dump_targets.dumpers);
> +       mutex_unlock(&dump_targets.dumper_mutex);
> +       return 0;
> +
> +free_tinfo:
> +       kfree(tinfo);
> +       return err;
> +}
> +

[...]

> +       if (S_ISDIR(mode)) {
> +               inode->i_op = i_ops;
> +               inode->i_fop = f_ops;
> +               inc_nlink(inode);
> +               inc_nlink(dir);
> +       } else {
> +               inode->i_fop = f_ops;
> +       }
> +
> +       d_instantiate(dentry, inode);
> +       dget(dentry);

lookup_one_len already bumped refcount, why the second time here?

> +       inode_unlock(dir);
> +       return dentry;
> +
> +dentry_put:
> +       dput(dentry);
> +       dentry = ERR_PTR(err);
> +unlock:
> +       inode_unlock(dir);
> +       return dentry;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves
  2020-04-08 23:25 ` [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves Yonghong Song
  2020-04-10 22:18   ` Andrii Nakryiko
@ 2020-04-10 22:25   ` Andrii Nakryiko
  2020-04-10 23:25     ` Yonghong Song
  1 sibling, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-10 22:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> Here, the target refers to a particular data structure
> inside the kernel we want to dump. For example, it
> can be all task_structs in the current pid namespace,
> or it could be all open files for all task_structs
> in the current pid namespace.
>
> Each target is identified with the following information:
>    target_rel_path   <=== relative path to /sys/kernel/bpfdump
>    target_proto      <=== kernel func proto which represents
>                           bpf program signature for this target
>    seq_ops           <=== seq_ops for seq_file operations
>    seq_priv_size     <=== seq_file private data size
>    target_feature    <=== target specific feature which needs
>                           handling outside seq_ops.
>
> The target relative path is a relative directory to /sys/kernel/bpfdump/.
> For example, it could be:
>    task                  <=== all tasks
>    task/file             <=== all open files under all tasks
>    ipv6_route            <=== all ipv6_routes
>    tcp6/sk_local_storage <=== all tcp6 socket local storages
>    foo/bar/tar           <=== all tar's in bar in foo
>
> The "target_feature" is mostly used for reusing existing seq_ops.
> For example, for /proc/net/<> stats, the "net" namespace is often
> stored in file private data. The target_feature enables bpf based
> dumper to set "net" properly for itself before calling shared
> seq_ops.
>
> bpf_dump_reg_target() is implemented so targets
> can register themselves. Currently, module is not
> supported, so there is no bpf_dump_unreg_target().
> The main reason is that BTF is not available for modules
> yet.
>
> Since target might call bpf_dump_reg_target() before
> bpfdump mount point is created, __bpfdump_init()
> may be called in bpf_dump_reg_target() as well.
>
> The file-based dumpers will be regular files under
> the specific target directory. For example,
>    task/my1      <=== dumper "my1" iterates through all tasks
>    task/file/my2 <=== dumper "my2" iterates through all open files
>                       under all tasks
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h |   4 +
>  kernel/bpf/dump.c   | 190 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 193 insertions(+), 1 deletion(-)
>

[...]

> +
> +static int dumper_unlink(struct inode *dir, struct dentry *dentry)
> +{
> +       kfree(d_inode(dentry)->i_private);
> +       return simple_unlink(dir, dentry);
> +}
> +
> +static const struct inode_operations bpf_dir_iops = {

noticed this reading next patch. It should probably be called
bpfdump_dir_iops to avoid confusion with bpf_dir_iops of BPF FS in
kernel/bpf/inode.c?

> +       .lookup         = simple_lookup,
> +       .unlink         = dumper_unlink,
> +};
> +

[...]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 04/16] bpf: allow loading of a dumper program
  2020-04-08 23:25 ` [RFC PATCH bpf-next 04/16] bpf: allow loading of a dumper program Yonghong Song
@ 2020-04-10 22:36   ` Andrii Nakryiko
  2020-04-10 23:28     ` Yonghong Song
  0 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-10 22:36 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:25 PM Yonghong Song <yhs@fb.com> wrote:
>
> A dumper bpf program is a tracing program with attach type
> BPF_TRACE_DUMP. During bpf program load, the load attribute
>    attach_prog_fd
> carries the target directory fd. The program will be
> verified against btf_id of the target_proto.
>
> If the program is loaded successfully, the dump target, as
> represented as a relative path to /sys/kernel/bpfdump,
> will be remembered in prog->aux->dump_target, which will
> be used later to create dumpers.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h            |  2 ++
>  include/uapi/linux/bpf.h       |  1 +
>  kernel/bpf/dump.c              | 40 ++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |  8 ++++++-
>  kernel/bpf/verifier.c          | 15 +++++++++++++
>  tools/include/uapi/linux/bpf.h |  1 +
>  6 files changed, 66 insertions(+), 1 deletion(-)
>

[...]

>
> +int bpf_dump_set_target_info(u32 target_fd, struct bpf_prog *prog)
> +{
> +       struct bpfdump_target_info *tinfo;
> +       const char *target_proto;
> +       struct file *target_file;
> +       struct fd tfd;
> +       int err = 0, btf_id;
> +
> +       if (!btf_vmlinux)
> +               return -EINVAL;
> +
> +       tfd = fdget(target_fd);
> +       target_file = tfd.file;
> +       if (!target_file)
> +               return -EBADF;

fdput is missing (or rather err = -BADF; goto done; ?)


> +
> +       if (target_file->f_inode->i_op != &bpf_dir_iops) {
> +               err = -EINVAL;
> +               goto done;
> +       }
> +
> +       tinfo = target_file->f_inode->i_private;
> +       target_proto = tinfo->target_proto;
> +       btf_id = btf_find_by_name_kind(btf_vmlinux, target_proto,
> +                                      BTF_KIND_FUNC);
> +
> +       if (btf_id > 0) {
> +               prog->aux->dump_target = tinfo->target;
> +               prog->aux->attach_btf_id = btf_id;
> +       }
> +
> +       err = min(btf_id, 0);

this min trick looks too clever... why not more straightforward and composable:

if (btf_id < 0) {
    err = btf_id;
    goto done;
}

prog->aux->dump_target = tinfo->target;
prog->aux->attach_btf_id = btf_id;

?

> +done:
> +       fdput(tfd);
> +       return err;
> +}
> +
>  int bpf_dump_reg_target(const char *target,
>                         const char *target_proto,
>                         const struct seq_operations *seq_ops,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 64783da34202..41005dee8957 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2060,7 +2060,12 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
>
>         prog->expected_attach_type = attr->expected_attach_type;
>         prog->aux->attach_btf_id = attr->attach_btf_id;
> -       if (attr->attach_prog_fd) {
> +       if (type == BPF_PROG_TYPE_TRACING &&
> +           attr->expected_attach_type == BPF_TRACE_DUMP) {
> +               err = bpf_dump_set_target_info(attr->attach_prog_fd, prog);

looking at bpf_attr, it's not clear why attach_prog_fd and
prog_ifindex were not combined into a single union field... this
probably got missed? But in this case I'd say let's create a

union {
    __u32 attach_prog_fd;
    __u32 attach_target_fd; (similar to terminology for BPF_PROG_ATTACH)
};

instead of reusing not-exactly-matching field names?

> +               if (err)
> +                       goto free_prog_nouncharge;
> +       } else if (attr->attach_prog_fd) {
>                 struct bpf_prog *tgt_prog;
>
>                 tgt_prog = bpf_prog_get(attr->attach_prog_fd);
> @@ -2145,6 +2150,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
>         err = bpf_prog_new_fd(prog);
>         if (err < 0)
>                 bpf_prog_put(prog);
> +
>         return err;
>

[...]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-10  3:00   ` Alexei Starovoitov
  2020-04-10  6:09     ` Yonghong Song
@ 2020-04-10 22:42     ` Yonghong Song
  2020-04-10 22:53       ` Andrii Nakryiko
  1 sibling, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-10 22:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, netdev,
	Alexei Starovoitov, Daniel Borkmann, kernel-team



On 4/9/20 8:00 PM, Alexei Starovoitov wrote:
> On Wed, Apr 08, 2020 at 04:25:26PM -0700, Yonghong Song wrote:
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 0f1cbed446c1..b51d56fc77f9 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -354,6 +354,7 @@ enum {
>>   /* Flags for accessing BPF object from syscall side. */
>>   	BPF_F_RDONLY		= (1U << 3),
>>   	BPF_F_WRONLY		= (1U << 4),
>> +	BPF_F_DUMP		= (1U << 5),
> ...
>>   static int bpf_obj_pin(const union bpf_attr *attr)
>>   {
>> -	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
>> +	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags & ~BPF_F_DUMP)
>>   		return -EINVAL;
>>   
>> +	if (attr->file_flags == BPF_F_DUMP)
>> +		return bpf_dump_create(attr->bpf_fd,
>> +				       u64_to_user_ptr(attr->dumper_name));
>> +
>>   	return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
>>   }
> 
> I think kernel can be a bit smarter here. There is no need for user space
> to pass BPF_F_DUMP flag to kernel just to differentiate the pinning.
> Can prog attach type be used instead?

Think again. I think a flag is still useful.
Suppose that we have the following scenario:
   - the current directory /sys/fs/bpf/
   - user says pin a tracing/dump (target task) prog to "p1"

It is not really clear whether user wants to pin to
    /sys/fs/bpf/p1
or user wants to pin to
    /sys/kernel/bpfdump/task/p1

unless we say that a tracing/dump program cannot pin
to /sys/fs/bpf which seems unnecessary restriction.

What do you think?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-08 23:25 ` [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers Yonghong Song
  2020-04-10  3:00   ` Alexei Starovoitov
@ 2020-04-10 22:51   ` Andrii Nakryiko
  2020-04-10 23:41     ` Yonghong Song
  2020-04-10 23:25   ` Andrii Nakryiko
  2020-04-14  5:56   ` Andrii Nakryiko
  3 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-10 22:51 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> Given a loaded dumper bpf program, which already
> knows which target it should bind to, there
> two ways to create a dumper:
>   - a file based dumper under hierarchy of
>     /sys/kernel/bpfdump/ which uses can
>     "cat" to print out the output.
>   - an anonymous dumper which user application
>     can "read" the dumping output.
>
> For file based dumper, BPF_OBJ_PIN syscall interface
> is used. For anonymous dumper, BPF_PROG_ATTACH
> syscall interface is used.
>
> To facilitate target seq_ops->show() to get the
> bpf program easily, dumper creation increased
> the target-provided seq_file private data size
> so bpf program pointer is also stored in seq_file
> private data.
>
> Further, a seq_num which represents how many
> bpf_dump_get_prog() has been called is also
> available to the target seq_ops->show().
> Such information can be used to e.g., print
> banner before printing out actual data.
>
> Note the seq_num does not represent the num
> of unique kernel objects the bpf program has
> seen. But it should be a good approximate.
>
> A target feature BPF_DUMP_SEQ_NET_PRIVATE
> is implemented specifically useful for
> net based dumpers. It sets net namespace
> as the current process net namespace.
> This avoids changing existing net seq_ops
> in order to retrieve net namespace from
> the seq_file pointer.
>
> For open dumper files, anonymous or not, the
> fdinfo will show the target and prog_id associated
> with that file descriptor. For dumper file itself,
> a kernel interface will be provided to retrieve the
> prog_id in one of the later patches.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h            |   5 +
>  include/uapi/linux/bpf.h       |   6 +-
>  kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
>  kernel/bpf/syscall.c           |  11 +-
>  tools/include/uapi/linux/bpf.h |   6 +-
>  5 files changed, 362 insertions(+), 4 deletions(-)
>

[...]

>
> +struct dumper_inode_info {
> +       struct bpfdump_target_info *tinfo;
> +       struct bpf_prog *prog;
> +};
> +
> +struct dumper_info {
> +       struct list_head list;
> +       /* file to identify an anon dumper,
> +        * dentry to identify a file dumper.
> +        */
> +       union {
> +               struct file *file;
> +               struct dentry *dentry;
> +       };
> +       struct bpfdump_target_info *tinfo;
> +       struct bpf_prog *prog;
> +};

This is essentially a bpf_link. Why not do it as a bpf_link from the
get go? Instead of having all this duplication for anonymous and
pinned dumpers, it would always be a bpf_link-based dumper, but for
those pinned bpf_link itself is going to be pinned. You also get a
benefit of being able to list all dumpers through existing bpf_link
API (also see my RFC patches with bpf_link_prime/bpf_link_settle,
which makes using bpf_link safe and simple).

[...]

> +
> +static void anon_dumper_show_fdinfo(struct seq_file *m, struct file *filp)
> +{
> +       struct dumper_info *dinfo;
> +
> +       mutex_lock(&anon_dumpers.dumper_mutex);
> +       list_for_each_entry(dinfo, &anon_dumpers.dumpers, list) {

this (and few other places where you search in a loop) would also be
simplified, because struct file* would point to bpf_dumper_link, which
then would have a pointer to bpf_prog, dentry (if pinned), etc. No
searching at all.

> +               if (dinfo->file == filp) {
> +                       seq_printf(m, "target:\t%s\n"
> +                                     "prog_id:\t%u\n",
> +                                  dinfo->tinfo->target,
> +                                  dinfo->prog->aux->id);
> +                       break;
> +               }
> +       }
> +       mutex_unlock(&anon_dumpers.dumper_mutex);
> +}
> +
> +#endif
> +

[...]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-10 22:42     ` Yonghong Song
@ 2020-04-10 22:53       ` Andrii Nakryiko
  2020-04-10 23:47         ` Yonghong Song
  0 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-10 22:53 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Fri, Apr 10, 2020 at 3:43 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/9/20 8:00 PM, Alexei Starovoitov wrote:
> > On Wed, Apr 08, 2020 at 04:25:26PM -0700, Yonghong Song wrote:
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 0f1cbed446c1..b51d56fc77f9 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -354,6 +354,7 @@ enum {
> >>   /* Flags for accessing BPF object from syscall side. */
> >>      BPF_F_RDONLY            = (1U << 3),
> >>      BPF_F_WRONLY            = (1U << 4),
> >> +    BPF_F_DUMP              = (1U << 5),
> > ...
> >>   static int bpf_obj_pin(const union bpf_attr *attr)
> >>   {
> >> -    if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
> >> +    if (CHECK_ATTR(BPF_OBJ) || attr->file_flags & ~BPF_F_DUMP)
> >>              return -EINVAL;
> >>
> >> +    if (attr->file_flags == BPF_F_DUMP)
> >> +            return bpf_dump_create(attr->bpf_fd,
> >> +                                   u64_to_user_ptr(attr->dumper_name));
> >> +
> >>      return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
> >>   }
> >
> > I think kernel can be a bit smarter here. There is no need for user space
> > to pass BPF_F_DUMP flag to kernel just to differentiate the pinning.
> > Can prog attach type be used instead?
>
> Think again. I think a flag is still useful.
> Suppose that we have the following scenario:
>    - the current directory /sys/fs/bpf/
>    - user says pin a tracing/dump (target task) prog to "p1"
>
> It is not really clear whether user wants to pin to
>     /sys/fs/bpf/p1
> or user wants to pin to
>     /sys/kernel/bpfdump/task/p1
>
> unless we say that a tracing/dump program cannot pin
> to /sys/fs/bpf which seems unnecessary restriction.
>
> What do you think?

Instead of special-casing dumper_name, can we require specifying full
path, and then check whether it is in BPF FS vs BPFDUMP FS? If the
latter, additionally check that it is in the right sub-directory
matching its intended target type.

But honestly, just doing everything within BPF FS starts to seem
cleaner at this point...

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 06/16] bpf: add netlink and ipv6_route targets
  2020-04-08 23:25 ` [RFC PATCH bpf-next 06/16] bpf: add netlink and ipv6_route targets Yonghong Song
@ 2020-04-10 23:13   ` Andrii Nakryiko
  2020-04-10 23:52     ` Yonghong Song
  0 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-10 23:13 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:25 PM Yonghong Song <yhs@fb.com> wrote:
>
> This patch added netlink and ipv6_route targets, using
> the same seq_ops (except show()) for /proc/net/{netlink,ipv6_route}.
>
> Since module is not supported for now, ipv6_route is
> supported only if the IPV6 is built-in, i.e., not compiled
> as a module. The restriction can be lifted once module
> is properly supported for bpfdump.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h      |  1 +
>  kernel/bpf/dump.c        | 13 ++++++++++
>  net/ipv6/ip6_fib.c       | 41 +++++++++++++++++++++++++++++-
>  net/ipv6/route.c         | 22 ++++++++++++++++
>  net/netlink/af_netlink.c | 54 +++++++++++++++++++++++++++++++++++++++-
>  5 files changed, 129 insertions(+), 2 deletions(-)
>

[...]

>
> +#if IS_BUILTIN(CONFIG_IPV6)
> +static int ipv6_route_prog_seq_show(struct bpf_prog *prog, struct seq_file *seq,
> +                                   u64 seq_num, void *v)
> +{
> +       struct ipv6_route_iter *iter = seq->private;
> +       struct {
> +               struct fib6_info *rt;
> +               struct seq_file *seq;
> +               u64 seq_num;
> +       } ctx = {

So this anonymous struct definition has to match bpfdump__ipv6_route
function prototype, if I understand correctly. So this means that BTF
will have a very useful struct, that can be used directly in BPF
program, but it won't have a canonical name. This is very sad... Would
it be possible to instead use a struct as a prototype for these
dumpers? Here's why it matters. Instead of currently requiring BPF
users to declare their dumpers as (just copy-pasted):

int BPF_PROG(some_name, struct fib6_info *rt, struct seq_file *seq,
u64 seq_num) {
   ...
}

if bpfdump__ipv6_route was actually a struct definition:


struct bpfdump__ipv6_route {
    struct fib6_info *rt;
    struct seq_file *seq;
    u64 seq_num;
};

Then with vmlinux.h, such program would be very nicely declared and used as:

int some_name(struct bpfdump__ipv6_route *ctx) {
  /* here use ctx->rt, ctx->seq, ctx->seqnum */
}

This is would would be nice to have for raw_tp and tp_btf as well.


Of course we can also code-generate such types from func_protos in
bpftool, and that's a plan B for this, IMO. But seem like in this case
you already have two keep two separate entities in sync: func proto
and struct for context, so I thought I'd bring it up.

> +               .rt = v,
> +               .seq = seq,
> +               .seq_num = seq_num,
> +       };
> +       int ret;
> +
> +       ret = bpf_dump_run_prog(prog, &ctx);
> +       iter->w.leaf = NULL;
> +       return ret == 0 ? 0 : -EINVAL;
> +}
> +

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves
  2020-04-10 22:18   ` Andrii Nakryiko
@ 2020-04-10 23:24     ` Yonghong Song
  2020-04-13 19:31       ` Andrii Nakryiko
  2020-04-15 22:57     ` Yonghong Song
  1 sibling, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-10 23:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/10/20 3:18 PM, Andrii Nakryiko wrote:
> On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Here, the target refers to a particular data structure
>> inside the kernel we want to dump. For example, it
>> can be all task_structs in the current pid namespace,
>> or it could be all open files for all task_structs
>> in the current pid namespace.
>>
>> Each target is identified with the following information:
>>     target_rel_path   <=== relative path to /sys/kernel/bpfdump
>>     target_proto      <=== kernel func proto which represents
>>                            bpf program signature for this target
>>     seq_ops           <=== seq_ops for seq_file operations
>>     seq_priv_size     <=== seq_file private data size
>>     target_feature    <=== target specific feature which needs
>>                            handling outside seq_ops.
> 
> It's not clear what "feature" stands for here... Is this just a sort
> of private_data passed through to dumper?

This is described later. It is some kind of target passed to the dumper.

> 
>>
>> The target relative path is a relative directory to /sys/kernel/bpfdump/.
>> For example, it could be:
>>     task                  <=== all tasks
>>     task/file             <=== all open files under all tasks
>>     ipv6_route            <=== all ipv6_routes
>>     tcp6/sk_local_storage <=== all tcp6 socket local storages
>>     foo/bar/tar           <=== all tar's in bar in foo
> 
> ^^ this seems useful, but I don't think code as is supports more than 2 levels?

Currently implement should support it.
You need
  - first register 'foo'. target name 'foo'.
  - then register 'foo/bar'. 'foo' will be the parent of 'bar'. target 
name 'foo/bar'.
  - then 'foo/bar/tar'. 'foo/bar' will be the parent of 'tar'. target 
name 'foo/bar/tar'.

> 
>>
>> The "target_feature" is mostly used for reusing existing seq_ops.
>> For example, for /proc/net/<> stats, the "net" namespace is often
>> stored in file private data. The target_feature enables bpf based
>> dumper to set "net" properly for itself before calling shared
>> seq_ops.
>>
>> bpf_dump_reg_target() is implemented so targets
>> can register themselves. Currently, module is not
>> supported, so there is no bpf_dump_unreg_target().
>> The main reason is that BTF is not available for modules
>> yet.
>>
>> Since target might call bpf_dump_reg_target() before
>> bpfdump mount point is created, __bpfdump_init()
>> may be called in bpf_dump_reg_target() as well.
>>
>> The file-based dumpers will be regular files under
>> the specific target directory. For example,
>>     task/my1      <=== dumper "my1" iterates through all tasks
>>     task/file/my2 <=== dumper "my2" iterates through all open files
>>                        under all tasks
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h |   4 +
>>   kernel/bpf/dump.c   | 190 +++++++++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 193 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index fd2b2322412d..53914bec7590 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1109,6 +1109,10 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>>   int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>>   int bpf_obj_get_user(const char __user *pathname, int flags);
>>
>> +int bpf_dump_reg_target(const char *target, const char *target_proto,
>> +                       const struct seq_operations *seq_ops,
>> +                       u32 seq_priv_size, u32 target_feature);
>> +
>>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
>> diff --git a/kernel/bpf/dump.c b/kernel/bpf/dump.c
>> index e0c33486e0e7..45528846557f 100644
>> --- a/kernel/bpf/dump.c
>> +++ b/kernel/bpf/dump.c
>> @@ -12,6 +12,173 @@
>>   #include <linux/filter.h>
>>   #include <linux/bpf.h>
>>
>> +struct bpfdump_target_info {
>> +       struct list_head list;
>> +       const char *target;
>> +       const char *target_proto;
>> +       struct dentry *dir_dentry;
>> +       const struct seq_operations *seq_ops;
>> +       u32 seq_priv_size;
>> +       u32 target_feature;
>> +};
>> +
>> +struct bpfdump_targets {
>> +       struct list_head dumpers;
>> +       struct mutex dumper_mutex;
> 
> nit: would be a bit simpler if these were static variables with static
> initialization, similar to how bpfdump_dentry is separate?

yes, we could do that. not 100% sure whether it will be simpler or not.
the structure is to glue them together.

> 
>> +};
>> +
>> +/* registered dump targets */
>> +static struct bpfdump_targets dump_targets;
>> +
>> +static struct dentry *bpfdump_dentry;
>> +
>> +static struct dentry *bpfdump_add_dir(const char *name, struct dentry *parent,
>> +                                     const struct inode_operations *i_ops,
>> +                                     void *data);
>> +static int __bpfdump_init(void);
>> +
>> +static int dumper_unlink(struct inode *dir, struct dentry *dentry)
>> +{
>> +       kfree(d_inode(dentry)->i_private);
>> +       return simple_unlink(dir, dentry);
>> +}
>> +
>> +static const struct inode_operations bpf_dir_iops = {
>> +       .lookup         = simple_lookup,
>> +       .unlink         = dumper_unlink,
>> +};
>> +
>> +int bpf_dump_reg_target(const char *target,
>> +                       const char *target_proto,
>> +                       const struct seq_operations *seq_ops,
>> +                       u32 seq_priv_size, u32 target_feature)
>> +{
>> +       struct bpfdump_target_info *tinfo, *ptinfo;
>> +       struct dentry *dentry, *parent;
>> +       const char *lastslash;
>> +       bool existed = false;
>> +       int err, parent_len;
>> +
>> +       if (!bpfdump_dentry) {
>> +               err = __bpfdump_init();
> 
> This will be called (again) if bpfdump_init() fails? Not sure why? In
> rare cases, some dumper will fail to initialize, but then some might
> succeed, which is going to be even more confusing, no?

I can have a static variable to say bpfdump_init has been attempted to
avoid such situation to avoid any second try.

> 
>> +               if (err)
>> +                       return err;
>> +       }
>> +
>> +       tinfo = kmalloc(sizeof(*tinfo), GFP_KERNEL);
>> +       if (!tinfo)
>> +               return -ENOMEM;
>> +
>> +       tinfo->target = target;
>> +       tinfo->target_proto = target_proto;
>> +       tinfo->seq_ops = seq_ops;
>> +       tinfo->seq_priv_size = seq_priv_size;
>> +       tinfo->target_feature = target_feature;
>> +       INIT_LIST_HEAD(&tinfo->list);
>> +
>> +       lastslash = strrchr(target, '/');
>> +       if (!lastslash) {
>> +               parent = bpfdump_dentry;
> 
> Two nits here. First, it supports only one and two levels. But it
> seems like it wouldn't be hard to support multiple? Instead of
> reverse-searching for /, you can forward search and keep track of
> "current parent".
> 
> nit2:
> 
> parent = bpfdump_dentry;
> if (lastslash) {
> 
>      parent = ptinfo->dir_dentry;
> }
> 
> seems a bit cleaner (and generalizes to multi-level a bit better).
> 
>> +       } else {
>> +               parent_len = (unsigned long)lastslash - (unsigned long)target;
>> +
>> +               mutex_lock(&dump_targets.dumper_mutex);
>> +               list_for_each_entry(ptinfo, &dump_targets.dumpers, list) {
>> +                       if (strlen(ptinfo->target) == parent_len &&
>> +                           strncmp(ptinfo->target, target, parent_len) == 0) {
>> +                               existed = true;
>> +                               break;
>> +                       }
>> +               }
>> +               mutex_unlock(&dump_targets.dumper_mutex);
>> +               if (existed == false) {
>> +                       err = -ENOENT;
>> +                       goto free_tinfo;
>> +               }
>> +
>> +               parent = ptinfo->dir_dentry;
>> +               target = lastslash + 1;
>> +       }
>> +       dentry = bpfdump_add_dir(target, parent, &bpf_dir_iops, tinfo);
>> +       if (IS_ERR(dentry)) {
>> +               err = PTR_ERR(dentry);
>> +               goto free_tinfo;
>> +       }
>> +
>> +       tinfo->dir_dentry = dentry;
>> +
>> +       mutex_lock(&dump_targets.dumper_mutex);
>> +       list_add(&tinfo->list, &dump_targets.dumpers);
>> +       mutex_unlock(&dump_targets.dumper_mutex);
>> +       return 0;
>> +
>> +free_tinfo:
>> +       kfree(tinfo);
>> +       return err;
>> +}
>> +
> 
> [...]
> 
>> +       if (S_ISDIR(mode)) {
>> +               inode->i_op = i_ops;
>> +               inode->i_fop = f_ops;
>> +               inc_nlink(inode);
>> +               inc_nlink(dir);
>> +       } else {
>> +               inode->i_fop = f_ops;
>> +       }
>> +
>> +       d_instantiate(dentry, inode);
>> +       dget(dentry);
> 
> lookup_one_len already bumped refcount, why the second time here?

good question. this is what security/inode.c is doing and seems working.
do not really know the science behind this. will check more.

> 
>> +       inode_unlock(dir);
>> +       return dentry;
>> +
>> +dentry_put:
>> +       dput(dentry);
>> +       dentry = ERR_PTR(err);
>> +unlock:
>> +       inode_unlock(dir);
>> +       return dentry;
>> +}
>> +
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves
  2020-04-10 22:25   ` Andrii Nakryiko
@ 2020-04-10 23:25     ` Yonghong Song
  0 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-10 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/10/20 3:25 PM, Andrii Nakryiko wrote:
> On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Here, the target refers to a particular data structure
>> inside the kernel we want to dump. For example, it
>> can be all task_structs in the current pid namespace,
>> or it could be all open files for all task_structs
>> in the current pid namespace.
>>
>> Each target is identified with the following information:
>>     target_rel_path   <=== relative path to /sys/kernel/bpfdump
>>     target_proto      <=== kernel func proto which represents
>>                            bpf program signature for this target
>>     seq_ops           <=== seq_ops for seq_file operations
>>     seq_priv_size     <=== seq_file private data size
>>     target_feature    <=== target specific feature which needs
>>                            handling outside seq_ops.
>>
>> The target relative path is a relative directory to /sys/kernel/bpfdump/.
>> For example, it could be:
>>     task                  <=== all tasks
>>     task/file             <=== all open files under all tasks
>>     ipv6_route            <=== all ipv6_routes
>>     tcp6/sk_local_storage <=== all tcp6 socket local storages
>>     foo/bar/tar           <=== all tar's in bar in foo
>>
>> The "target_feature" is mostly used for reusing existing seq_ops.
>> For example, for /proc/net/<> stats, the "net" namespace is often
>> stored in file private data. The target_feature enables bpf based
>> dumper to set "net" properly for itself before calling shared
>> seq_ops.
>>
>> bpf_dump_reg_target() is implemented so targets
>> can register themselves. Currently, module is not
>> supported, so there is no bpf_dump_unreg_target().
>> The main reason is that BTF is not available for modules
>> yet.
>>
>> Since target might call bpf_dump_reg_target() before
>> bpfdump mount point is created, __bpfdump_init()
>> may be called in bpf_dump_reg_target() as well.
>>
>> The file-based dumpers will be regular files under
>> the specific target directory. For example,
>>     task/my1      <=== dumper "my1" iterates through all tasks
>>     task/file/my2 <=== dumper "my2" iterates through all open files
>>                        under all tasks
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h |   4 +
>>   kernel/bpf/dump.c   | 190 +++++++++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 193 insertions(+), 1 deletion(-)
>>
> 
> [...]
> 
>> +
>> +static int dumper_unlink(struct inode *dir, struct dentry *dentry)
>> +{
>> +       kfree(d_inode(dentry)->i_private);
>> +       return simple_unlink(dir, dentry);
>> +}
>> +
>> +static const struct inode_operations bpf_dir_iops = {
> 
> noticed this reading next patch. It should probably be called
> bpfdump_dir_iops to avoid confusion with bpf_dir_iops of BPF FS in
> kernel/bpf/inode.c?

make sense. originally probably copied from inode.c and did not
change that.

> 
>> +       .lookup         = simple_lookup,
>> +       .unlink         = dumper_unlink,
>> +};
>> +
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-08 23:25 ` [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers Yonghong Song
  2020-04-10  3:00   ` Alexei Starovoitov
  2020-04-10 22:51   ` Andrii Nakryiko
@ 2020-04-10 23:25   ` Andrii Nakryiko
  2020-04-11  0:23     ` Yonghong Song
  2020-04-14  5:56   ` Andrii Nakryiko
  3 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-10 23:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> Given a loaded dumper bpf program, which already
> knows which target it should bind to, there
> two ways to create a dumper:
>   - a file based dumper under hierarchy of
>     /sys/kernel/bpfdump/ which uses can
>     "cat" to print out the output.
>   - an anonymous dumper which user application
>     can "read" the dumping output.
>
> For file based dumper, BPF_OBJ_PIN syscall interface
> is used. For anonymous dumper, BPF_PROG_ATTACH
> syscall interface is used.
>
> To facilitate target seq_ops->show() to get the
> bpf program easily, dumper creation increased
> the target-provided seq_file private data size
> so bpf program pointer is also stored in seq_file
> private data.
>
> Further, a seq_num which represents how many
> bpf_dump_get_prog() has been called is also
> available to the target seq_ops->show().
> Such information can be used to e.g., print
> banner before printing out actual data.

So I looked up seq_operations struct and did a very cursory read of
fs/seq_file.c and seq_file documentation, so I might be completely off
here.

start() is called before iteration begins, stop() is called after
iteration ends. Would it be a bit better and user-friendly interface
to have to extra calls to BPF program, say with NULL input element,
but with extra enum/flag that specifies that this is a START or END of
iteration, in addition to seq_num?

Also, right now it's impossible to write stateful dumpers that do any
kind of stats calculation, because it's impossible to determine when
iteration restarted (it starts from the very beginning, not from the
last element). It's impossible to just rememebr last processed
seq_num, because BPF program might be called for a new "session" in
parallel with the old one.

So it seems like few things would be useful:

1. end flag for post-aggregation and/or footer printing (seq_num == 0
is providing similar means for start flag).
2. Some sort of "session id", so that bpfdumper can maintain
per-session intermediate state. Plus with this it would be possible to
detect restarts (if there is some state for the same session and
seq_num == 0, this is restart).

It seems like it might be a bit more flexible to, instead of providing
seq_file * pointer directly, actually provide a bpfdumper_context
struct, which would have seq_file * as one of fields, other being
session_id and start/stop flags.

A bit unstructured thoughts, but what do you think?

>
> Note the seq_num does not represent the num
> of unique kernel objects the bpf program has
> seen. But it should be a good approximate.
>
> A target feature BPF_DUMP_SEQ_NET_PRIVATE
> is implemented specifically useful for
> net based dumpers. It sets net namespace
> as the current process net namespace.
> This avoids changing existing net seq_ops
> in order to retrieve net namespace from
> the seq_file pointer.
>
> For open dumper files, anonymous or not, the
> fdinfo will show the target and prog_id associated
> with that file descriptor. For dumper file itself,
> a kernel interface will be provided to retrieve the
> prog_id in one of the later patches.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h            |   5 +
>  include/uapi/linux/bpf.h       |   6 +-
>  kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
>  kernel/bpf/syscall.c           |  11 +-
>  tools/include/uapi/linux/bpf.h |   6 +-
>  5 files changed, 362 insertions(+), 4 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 04/16] bpf: allow loading of a dumper program
  2020-04-10 22:36   ` Andrii Nakryiko
@ 2020-04-10 23:28     ` Yonghong Song
  2020-04-13 19:33       ` Andrii Nakryiko
  0 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-10 23:28 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/10/20 3:36 PM, Andrii Nakryiko wrote:
> On Wed, Apr 8, 2020 at 4:25 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> A dumper bpf program is a tracing program with attach type
>> BPF_TRACE_DUMP. During bpf program load, the load attribute
>>     attach_prog_fd
>> carries the target directory fd. The program will be
>> verified against btf_id of the target_proto.
>>
>> If the program is loaded successfully, the dump target, as
>> represented as a relative path to /sys/kernel/bpfdump,
>> will be remembered in prog->aux->dump_target, which will
>> be used later to create dumpers.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h            |  2 ++
>>   include/uapi/linux/bpf.h       |  1 +
>>   kernel/bpf/dump.c              | 40 ++++++++++++++++++++++++++++++++++
>>   kernel/bpf/syscall.c           |  8 ++++++-
>>   kernel/bpf/verifier.c          | 15 +++++++++++++
>>   tools/include/uapi/linux/bpf.h |  1 +
>>   6 files changed, 66 insertions(+), 1 deletion(-)
>>
> 
> [...]
> 
>>
>> +int bpf_dump_set_target_info(u32 target_fd, struct bpf_prog *prog)
>> +{
>> +       struct bpfdump_target_info *tinfo;
>> +       const char *target_proto;
>> +       struct file *target_file;
>> +       struct fd tfd;
>> +       int err = 0, btf_id;
>> +
>> +       if (!btf_vmlinux)
>> +               return -EINVAL;
>> +
>> +       tfd = fdget(target_fd);
>> +       target_file = tfd.file;
>> +       if (!target_file)
>> +               return -EBADF;
> 
> fdput is missing (or rather err = -BADF; goto done; ?)

No need to do fdput if tfd.file is NULL.

> 
> 
>> +
>> +       if (target_file->f_inode->i_op != &bpf_dir_iops) {
>> +               err = -EINVAL;
>> +               goto done;
>> +       }
>> +
>> +       tinfo = target_file->f_inode->i_private;
>> +       target_proto = tinfo->target_proto;
>> +       btf_id = btf_find_by_name_kind(btf_vmlinux, target_proto,
>> +                                      BTF_KIND_FUNC);
>> +
>> +       if (btf_id > 0) {
>> +               prog->aux->dump_target = tinfo->target;
>> +               prog->aux->attach_btf_id = btf_id;
>> +       }
>> +
>> +       err = min(btf_id, 0);
> 
> this min trick looks too clever... why not more straightforward and composable:
> 
> if (btf_id < 0) {
>      err = btf_id;
>      goto done;
> }
> 
> prog->aux->dump_target = tinfo->target;
> prog->aux->attach_btf_id = btf_id;
> 
> ?

this can be done.

> 
>> +done:
>> +       fdput(tfd);
>> +       return err;
>> +}
>> +
>>   int bpf_dump_reg_target(const char *target,
>>                          const char *target_proto,
>>                          const struct seq_operations *seq_ops,
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 64783da34202..41005dee8957 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2060,7 +2060,12 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
>>
>>          prog->expected_attach_type = attr->expected_attach_type;
>>          prog->aux->attach_btf_id = attr->attach_btf_id;
>> -       if (attr->attach_prog_fd) {
>> +       if (type == BPF_PROG_TYPE_TRACING &&
>> +           attr->expected_attach_type == BPF_TRACE_DUMP) {
>> +               err = bpf_dump_set_target_info(attr->attach_prog_fd, prog);
> 
> looking at bpf_attr, it's not clear why attach_prog_fd and
> prog_ifindex were not combined into a single union field... this
> probably got missed? But in this case I'd say let's create a
> 
> union {
>      __u32 attach_prog_fd;
>      __u32 attach_target_fd; (similar to terminology for BPF_PROG_ATTACH)
> };
> 
> instead of reusing not-exactly-matching field names?

I thought about this, but thinking to avoid uapi change (although 
compatible). Maybe we should. Let me think about this.

> 
>> +               if (err)
>> +                       goto free_prog_nouncharge;
>> +       } else if (attr->attach_prog_fd) {
>>                  struct bpf_prog *tgt_prog;
>>
>>                  tgt_prog = bpf_prog_get(attr->attach_prog_fd);
>> @@ -2145,6 +2150,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
>>          err = bpf_prog_new_fd(prog);
>>          if (err < 0)
>>                  bpf_prog_put(prog);
>> +
>>          return err;
>>
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-10 22:51   ` Andrii Nakryiko
@ 2020-04-10 23:41     ` Yonghong Song
  2020-04-13 19:45       ` Andrii Nakryiko
  0 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-10 23:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/10/20 3:51 PM, Andrii Nakryiko wrote:
> On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Given a loaded dumper bpf program, which already
>> knows which target it should bind to, there
>> two ways to create a dumper:
>>    - a file based dumper under hierarchy of
>>      /sys/kernel/bpfdump/ which uses can
>>      "cat" to print out the output.
>>    - an anonymous dumper which user application
>>      can "read" the dumping output.
>>
>> For file based dumper, BPF_OBJ_PIN syscall interface
>> is used. For anonymous dumper, BPF_PROG_ATTACH
>> syscall interface is used.
>>
>> To facilitate target seq_ops->show() to get the
>> bpf program easily, dumper creation increased
>> the target-provided seq_file private data size
>> so bpf program pointer is also stored in seq_file
>> private data.
>>
>> Further, a seq_num which represents how many
>> bpf_dump_get_prog() has been called is also
>> available to the target seq_ops->show().
>> Such information can be used to e.g., print
>> banner before printing out actual data.
>>
>> Note the seq_num does not represent the num
>> of unique kernel objects the bpf program has
>> seen. But it should be a good approximate.
>>
>> A target feature BPF_DUMP_SEQ_NET_PRIVATE
>> is implemented specifically useful for
>> net based dumpers. It sets net namespace
>> as the current process net namespace.
>> This avoids changing existing net seq_ops
>> in order to retrieve net namespace from
>> the seq_file pointer.
>>
>> For open dumper files, anonymous or not, the
>> fdinfo will show the target and prog_id associated
>> with that file descriptor. For dumper file itself,
>> a kernel interface will be provided to retrieve the
>> prog_id in one of the later patches.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h            |   5 +
>>   include/uapi/linux/bpf.h       |   6 +-
>>   kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
>>   kernel/bpf/syscall.c           |  11 +-
>>   tools/include/uapi/linux/bpf.h |   6 +-
>>   5 files changed, 362 insertions(+), 4 deletions(-)
>>
> 
> [...]
> 
>>
>> +struct dumper_inode_info {
>> +       struct bpfdump_target_info *tinfo;
>> +       struct bpf_prog *prog;
>> +};
>> +
>> +struct dumper_info {
>> +       struct list_head list;
>> +       /* file to identify an anon dumper,
>> +        * dentry to identify a file dumper.
>> +        */
>> +       union {
>> +               struct file *file;
>> +               struct dentry *dentry;
>> +       };
>> +       struct bpfdump_target_info *tinfo;
>> +       struct bpf_prog *prog;
>> +};
> 
> This is essentially a bpf_link. Why not do it as a bpf_link from the
> get go? Instead of having all this duplication for anonymous and

This is a good question. Maybe part of bpf-link can be used and
I have to implement others. I will check.

> pinned dumpers, it would always be a bpf_link-based dumper, but for
> those pinned bpf_link itself is going to be pinned. You also get a
> benefit of being able to list all dumpers through existing bpf_link
> API (also see my RFC patches with bpf_link_prime/bpf_link_settle,
> which makes using bpf_link safe and simple).

Agree. Alternative is to use BPF_OBJ_GET_INFO_BY_FD to query individual
dumper as directory tree walk can be easily done at user space.


> 
> [...]
> 
>> +
>> +static void anon_dumper_show_fdinfo(struct seq_file *m, struct file *filp)
>> +{
>> +       struct dumper_info *dinfo;
>> +
>> +       mutex_lock(&anon_dumpers.dumper_mutex);
>> +       list_for_each_entry(dinfo, &anon_dumpers.dumpers, list) {
> 
> this (and few other places where you search in a loop) would also be
> simplified, because struct file* would point to bpf_dumper_link, which
> then would have a pointer to bpf_prog, dentry (if pinned), etc. No
> searching at all.

This is a reason for this. the same as bpflink, bpfdump already has
the full information about file, inode, etc.
The file private_data actually points to seq_file. The seq_file private 
data is used in the target. That is exactly why we try to have this 
mapping to keep track. bpf_link won't help here.

> 
>> +               if (dinfo->file == filp) {
>> +                       seq_printf(m, "target:\t%s\n"
>> +                                     "prog_id:\t%u\n",
>> +                                  dinfo->tinfo->target,
>> +                                  dinfo->prog->aux->id);
>> +                       break;
>> +               }
>> +       }
>> +       mutex_unlock(&anon_dumpers.dumper_mutex);
>> +}
>> +
>> +#endif
>> +
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-10 22:53       ` Andrii Nakryiko
@ 2020-04-10 23:47         ` Yonghong Song
  2020-04-11 23:11           ` Alexei Starovoitov
  0 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-10 23:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/10/20 3:53 PM, Andrii Nakryiko wrote:
> On Fri, Apr 10, 2020 at 3:43 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 4/9/20 8:00 PM, Alexei Starovoitov wrote:
>>> On Wed, Apr 08, 2020 at 04:25:26PM -0700, Yonghong Song wrote:
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index 0f1cbed446c1..b51d56fc77f9 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -354,6 +354,7 @@ enum {
>>>>    /* Flags for accessing BPF object from syscall side. */
>>>>       BPF_F_RDONLY            = (1U << 3),
>>>>       BPF_F_WRONLY            = (1U << 4),
>>>> +    BPF_F_DUMP              = (1U << 5),
>>> ...
>>>>    static int bpf_obj_pin(const union bpf_attr *attr)
>>>>    {
>>>> -    if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
>>>> +    if (CHECK_ATTR(BPF_OBJ) || attr->file_flags & ~BPF_F_DUMP)
>>>>               return -EINVAL;
>>>>
>>>> +    if (attr->file_flags == BPF_F_DUMP)
>>>> +            return bpf_dump_create(attr->bpf_fd,
>>>> +                                   u64_to_user_ptr(attr->dumper_name));
>>>> +
>>>>       return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
>>>>    }
>>>
>>> I think kernel can be a bit smarter here. There is no need for user space
>>> to pass BPF_F_DUMP flag to kernel just to differentiate the pinning.
>>> Can prog attach type be used instead?
>>
>> Think again. I think a flag is still useful.
>> Suppose that we have the following scenario:
>>     - the current directory /sys/fs/bpf/
>>     - user says pin a tracing/dump (target task) prog to "p1"
>>
>> It is not really clear whether user wants to pin to
>>      /sys/fs/bpf/p1
>> or user wants to pin to
>>      /sys/kernel/bpfdump/task/p1
>>
>> unless we say that a tracing/dump program cannot pin
>> to /sys/fs/bpf which seems unnecessary restriction.
>>
>> What do you think?
> 
> Instead of special-casing dumper_name, can we require specifying full
> path, and then check whether it is in BPF FS vs BPFDUMP FS? If the
> latter, additionally check that it is in the right sub-directory
> matching its intended target type.

We could. I just think specifying full path for bpfdump is not necessary 
since it is a single user mount...

> 
> But honestly, just doing everything within BPF FS starts to seem
> cleaner at this point...

bpffs is multi mount, which is not a perfect fit for bpfdump,
considering mounting inside namespace, etc, all dumpers are gone.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 06/16] bpf: add netlink and ipv6_route targets
  2020-04-10 23:13   ` Andrii Nakryiko
@ 2020-04-10 23:52     ` Yonghong Song
  0 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-10 23:52 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/10/20 4:13 PM, Andrii Nakryiko wrote:
> On Wed, Apr 8, 2020 at 4:25 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> This patch added netlink and ipv6_route targets, using
>> the same seq_ops (except show()) for /proc/net/{netlink,ipv6_route}.
>>
>> Since module is not supported for now, ipv6_route is
>> supported only if the IPV6 is built-in, i.e., not compiled
>> as a module. The restriction can be lifted once module
>> is properly supported for bpfdump.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h      |  1 +
>>   kernel/bpf/dump.c        | 13 ++++++++++
>>   net/ipv6/ip6_fib.c       | 41 +++++++++++++++++++++++++++++-
>>   net/ipv6/route.c         | 22 ++++++++++++++++
>>   net/netlink/af_netlink.c | 54 +++++++++++++++++++++++++++++++++++++++-
>>   5 files changed, 129 insertions(+), 2 deletions(-)
>>
> 
> [...]
> 
>>
>> +#if IS_BUILTIN(CONFIG_IPV6)
>> +static int ipv6_route_prog_seq_show(struct bpf_prog *prog, struct seq_file *seq,
>> +                                   u64 seq_num, void *v)
>> +{
>> +       struct ipv6_route_iter *iter = seq->private;
>> +       struct {
>> +               struct fib6_info *rt;
>> +               struct seq_file *seq;
>> +               u64 seq_num;
>> +       } ctx = {
> 
> So this anonymous struct definition has to match bpfdump__ipv6_route
> function prototype, if I understand correctly. So this means that BTF
> will have a very useful struct, that can be used directly in BPF
> program, but it won't have a canonical name. This is very sad... Would
> it be possible to instead use a struct as a prototype for these
> dumpers? Here's why it matters. Instead of currently requiring BPF
> users to declare their dumpers as (just copy-pasted):
> 
> int BPF_PROG(some_name, struct fib6_info *rt, struct seq_file *seq,
> u64 seq_num) {
>     ...
> }
> 
> if bpfdump__ipv6_route was actually a struct definition:
> 
> 
> struct bpfdump__ipv6_route {
>      struct fib6_info *rt;
>      struct seq_file *seq;
>      u64 seq_num;
> };
> 
> Then with vmlinux.h, such program would be very nicely declared and used as:
> 
> int some_name(struct bpfdump__ipv6_route *ctx) {
>    /* here use ctx->rt, ctx->seq, ctx->seqnum */
> }

Thanks, I do not know this!
This definitely better and may make kernel code simpler.
Will experiment.

> 
> This is would would be nice to have for raw_tp and tp_btf as well.
> 
> 
> Of course we can also code-generate such types from func_protos in
> bpftool, and that's a plan B for this, IMO. But seem like in this case
> you already have two keep two separate entities in sync: func proto
> and struct for context, so I thought I'd bring it up.
> 
>> +               .rt = v,
>> +               .seq = seq,
>> +               .seq_num = seq_num,
>> +       };
>> +       int ret;
>> +
>> +       ret = bpf_dump_run_prog(prog, &ctx);
>> +       iter->w.leaf = NULL;
>> +       return ret == 0 ? 0 : -EINVAL;
>> +}
>> +

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-10 23:25   ` Andrii Nakryiko
@ 2020-04-11  0:23     ` Yonghong Song
  2020-04-11 23:17       ` Alexei Starovoitov
  2020-04-13 19:59       ` Andrii Nakryiko
  0 siblings, 2 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-11  0:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/10/20 4:25 PM, Andrii Nakryiko wrote:
> On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Given a loaded dumper bpf program, which already
>> knows which target it should bind to, there
>> two ways to create a dumper:
>>    - a file based dumper under hierarchy of
>>      /sys/kernel/bpfdump/ which uses can
>>      "cat" to print out the output.
>>    - an anonymous dumper which user application
>>      can "read" the dumping output.
>>
>> For file based dumper, BPF_OBJ_PIN syscall interface
>> is used. For anonymous dumper, BPF_PROG_ATTACH
>> syscall interface is used.
>>
>> To facilitate target seq_ops->show() to get the
>> bpf program easily, dumper creation increased
>> the target-provided seq_file private data size
>> so bpf program pointer is also stored in seq_file
>> private data.
>>
>> Further, a seq_num which represents how many
>> bpf_dump_get_prog() has been called is also
>> available to the target seq_ops->show().
>> Such information can be used to e.g., print
>> banner before printing out actual data.
> 
> So I looked up seq_operations struct and did a very cursory read of
> fs/seq_file.c and seq_file documentation, so I might be completely off
> here.
> 
> start() is called before iteration begins, stop() is called after
> iteration ends. Would it be a bit better and user-friendly interface
> to have to extra calls to BPF program, say with NULL input element,
> but with extra enum/flag that specifies that this is a START or END of
> iteration, in addition to seq_num?

The current design always pass a valid object (task, file, netlink_sock,
fib6_info). That is, access to fields to those data structure won't 
cause runtime exceptions.

Therefore, with the existing seq_ops implementation for ipv6_route
and netlink, etc, we don't have END information. We can get START
information though.

> 
> Also, right now it's impossible to write stateful dumpers that do any
> kind of stats calculation, because it's impossible to determine when
> iteration restarted (it starts from the very beginning, not from the
> last element). It's impossible to just rememebr last processed
> seq_num, because BPF program might be called for a new "session" in
> parallel with the old one.

Theoretically, session end can be detected by checking the return
value of last bpf_seq_printf() or bpf_seq_write(). If it indicates
an overflow, that means session end.

Or bpfdump infrastructure can help do this work to provide
session id.

> 
> So it seems like few things would be useful:
> 
> 1. end flag for post-aggregation and/or footer printing (seq_num == 0
> is providing similar means for start flag).

the end flag is a problem. We could say hijack next or stop so we
can detect the end, but passing a NULL pointer as the object
to the bpf program may be problematic without verifier enforcement
as it may cause a lot of exceptions... Although all these exception
will be silenced by bpf infra, but still not sure whether this
is acceptable or not.

> 2. Some sort of "session id", so that bpfdumper can maintain
> per-session intermediate state. Plus with this it would be possible to
> detect restarts (if there is some state for the same session and
> seq_num == 0, this is restart).

I guess we can do this.

> 
> It seems like it might be a bit more flexible to, instead of providing
> seq_file * pointer directly, actually provide a bpfdumper_context
> struct, which would have seq_file * as one of fields, other being
> session_id and start/stop flags.

As you mentioned, if we have more fields related to seq_file passing
to bpf program, yes, grouping them into a structure makes sense.

> 
> A bit unstructured thoughts, but what do you think?
> 
>>
>> Note the seq_num does not represent the num
>> of unique kernel objects the bpf program has
>> seen. But it should be a good approximate.
>>
>> A target feature BPF_DUMP_SEQ_NET_PRIVATE
>> is implemented specifically useful for
>> net based dumpers. It sets net namespace
>> as the current process net namespace.
>> This avoids changing existing net seq_ops
>> in order to retrieve net namespace from
>> the seq_file pointer.
>>
>> For open dumper files, anonymous or not, the
>> fdinfo will show the target and prog_id associated
>> with that file descriptor. For dumper file itself,
>> a kernel interface will be provided to retrieve the
>> prog_id in one of the later patches.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h            |   5 +
>>   include/uapi/linux/bpf.h       |   6 +-
>>   kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
>>   kernel/bpf/syscall.c           |  11 +-
>>   tools/include/uapi/linux/bpf.h |   6 +-
>>   5 files changed, 362 insertions(+), 4 deletions(-)
>>
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-10 23:47         ` Yonghong Song
@ 2020-04-11 23:11           ` Alexei Starovoitov
  2020-04-12  6:51             ` Yonghong Song
  2020-04-13 20:48             ` Andrii Nakryiko
  0 siblings, 2 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-11 23:11 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Fri, Apr 10, 2020 at 04:47:36PM -0700, Yonghong Song wrote:
> > 
> > Instead of special-casing dumper_name, can we require specifying full
> > path, and then check whether it is in BPF FS vs BPFDUMP FS? If the
> > latter, additionally check that it is in the right sub-directory
> > matching its intended target type.
> 
> We could. I just think specifying full path for bpfdump is not necessary
> since it is a single user mount...
> 
> > 
> > But honestly, just doing everything within BPF FS starts to seem
> > cleaner at this point...
> 
> bpffs is multi mount, which is not a perfect fit for bpfdump,
> considering mounting inside namespace, etc, all dumpers are gone.

As Yonghong pointed out reusing bpffs for dumpers doesn't look possible
from implementation perspective.
Even if it was possible the files in such mix-and-match file system
would be of different kinds with different semantics. I think that
will lead to mediocre user experience when file 'foo' is cat-able
with nice human output, but file 'bar' isn't cat-able at all because
it's just a pinned map. imo having all dumpers in one fixed location
in /sys/kernel/bpfdump makes it easy to discover for folks who might
not even know what bpf is.
For example when I'm trying to learn some new area of the kernel I might go
poke around /proc and /sys directory looking for a file name that could be
interesting to 'cat'. This is how I discovered /sys/kernel/slab/ :)
I think keeping all dumpers in /sys/kernel/bpfdump/ will make them
similarly discoverable.

re: f_dump flag...
May be it's a sign that pinning is not the right name for such operation?
If kernel cannot distinguish pinning dumper prog into bpffs as a vanilla
pinning operation vs pinning into bpfdumpfs to make it cat-able then something
isn't right about api. Either it needs to be a new bpf syscall command (like
install_dumper_in_dumpfs) or reuse pinning command, but make libbpf specify the
full path. From bpf prog point of view it may still specify only the final
name, but libbpf can prepend the /sys/kernel/bpfdump/.../. May be there is a
third option. Extra flag for pinning just doesn't look right. What if we do
another specialized file system later? It would need yet another flag to pin
there?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-11  0:23     ` Yonghong Song
@ 2020-04-11 23:17       ` Alexei Starovoitov
  2020-04-13 21:04         ` Andrii Nakryiko
  2020-04-13 19:59       ` Andrii Nakryiko
  1 sibling, 1 reply; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-11 23:17 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Fri, Apr 10, 2020 at 05:23:30PM -0700, Yonghong Song wrote:
> > 
> > So it seems like few things would be useful:
> > 
> > 1. end flag for post-aggregation and/or footer printing (seq_num == 0
> > is providing similar means for start flag).
> 
> the end flag is a problem. We could say hijack next or stop so we
> can detect the end, but passing a NULL pointer as the object
> to the bpf program may be problematic without verifier enforcement
> as it may cause a lot of exceptions... Although all these exception
> will be silenced by bpf infra, but still not sure whether this
> is acceptable or not.

I don't like passing NULL there just to indicate something to a program.
It's not too horrible to support from verifier side, but NULL is only
one such flag. What does it suppose to indicate? That dumper prog
is just starting? or ending? Let's pass (void*)1, and (void *)2 ?
I'm not a fan of such inband signaling.
imo it's cleaner and simpler when that object pointer is always valid.

> > 2. Some sort of "session id", so that bpfdumper can maintain
> > per-session intermediate state. Plus with this it would be possible to
> > detect restarts (if there is some state for the same session and
> > seq_num == 0, this is restart).
> 
> I guess we can do this.

beyond seq_num passing session_id is a good idea. Though I don't quite see
the use case where you'd need bpfdumper prog to be stateful, but doesn't hurt.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-11 23:11           ` Alexei Starovoitov
@ 2020-04-12  6:51             ` Yonghong Song
  2020-04-13 20:48             ` Andrii Nakryiko
  1 sibling, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-12  6:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/11/20 4:11 PM, Alexei Starovoitov wrote:
> On Fri, Apr 10, 2020 at 04:47:36PM -0700, Yonghong Song wrote:
>>>
>>> Instead of special-casing dumper_name, can we require specifying full
>>> path, and then check whether it is in BPF FS vs BPFDUMP FS? If the
>>> latter, additionally check that it is in the right sub-directory
>>> matching its intended target type.
>>
>> We could. I just think specifying full path for bpfdump is not necessary
>> since it is a single user mount...
>>
>>>
>>> But honestly, just doing everything within BPF FS starts to seem
>>> cleaner at this point...
>>
>> bpffs is multi mount, which is not a perfect fit for bpfdump,
>> considering mounting inside namespace, etc, all dumpers are gone.
> 
> As Yonghong pointed out reusing bpffs for dumpers doesn't look possible
> from implementation perspective.
> Even if it was possible the files in such mix-and-match file system
> would be of different kinds with different semantics. I think that
> will lead to mediocre user experience when file 'foo' is cat-able
> with nice human output, but file 'bar' isn't cat-able at all because
> it's just a pinned map. imo having all dumpers in one fixed location
> in /sys/kernel/bpfdump makes it easy to discover for folks who might
> not even know what bpf is.
> For example when I'm trying to learn some new area of the kernel I might go
> poke around /proc and /sys directory looking for a file name that could be
> interesting to 'cat'. This is how I discovered /sys/kernel/slab/ :)
> I think keeping all dumpers in /sys/kernel/bpfdump/ will make them
> similarly discoverable.
> 
> re: f_dump flag...
> May be it's a sign that pinning is not the right name for such operation?
> If kernel cannot distinguish pinning dumper prog into bpffs as a vanilla
> pinning operation vs pinning into bpfdumpfs to make it cat-able then something
> isn't right about api. Either it needs to be a new bpf syscall command (like
> install_dumper_in_dumpfs) or reuse pinning command, but make libbpf specify the
> full path. From bpf prog point of view it may still specify only the final
> name, but libbpf can prepend the /sys/kernel/bpfdump/.../. May be there is a
> third option. Extra flag for pinning just doesn't look right. What if we do
> another specialized file system later? It would need yet another flag to pin
> there?

For the 2nd option,
    - user still just specifying the dumper name, and
    - bpftool will prepend /sys/kernel/bpfdump/...
this should work. In this case, the kernel API
to create bpf dumper will be
    BPF_OBJ_PIN with a file path
this is fine only with one following annoyance.
Suppose somehow:
    - bpfdump is mounted at /sys/kernel/bpfdump and somewhere else say
      /root/tmp/bpfdump/
      [
        I checked do_mount in namespace.c, and did not find a flag
        to prevent multi mounting, maybe I missed something. I will be
        glad if somebody knows and let me know.
      ]
    - user call BPF_OBJ_PIN to path /root/tmp/bpfdump/task/my_task.
    - But actually the file will also appear in
      /sys/kernel/bpfdump/task/my_task.
there is a little confusion here based on kernel API.
That is exactly why I supplied with only filename. Conceptually, it
will be clear that the dumper will appear in all mount points.

Maybe a new bpf subcommand is warranted.
maybe BPF_DUMPER_INSTALL?






^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves
  2020-04-10 23:24     ` Yonghong Song
@ 2020-04-13 19:31       ` Andrii Nakryiko
  0 siblings, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-13 19:31 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Fri, Apr 10, 2020 at 4:24 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/10/20 3:18 PM, Andrii Nakryiko wrote:
> > On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> Here, the target refers to a particular data structure
> >> inside the kernel we want to dump. For example, it
> >> can be all task_structs in the current pid namespace,
> >> or it could be all open files for all task_structs
> >> in the current pid namespace.
> >>
> >> Each target is identified with the following information:
> >>     target_rel_path   <=== relative path to /sys/kernel/bpfdump
> >>     target_proto      <=== kernel func proto which represents
> >>                            bpf program signature for this target
> >>     seq_ops           <=== seq_ops for seq_file operations
> >>     seq_priv_size     <=== seq_file private data size
> >>     target_feature    <=== target specific feature which needs
> >>                            handling outside seq_ops.
> >
> > It's not clear what "feature" stands for here... Is this just a sort
> > of private_data passed through to dumper?
>
> This is described later. It is some kind of target passed to the dumper.
>
> >
> >>
> >> The target relative path is a relative directory to /sys/kernel/bpfdump/.
> >> For example, it could be:
> >>     task                  <=== all tasks
> >>     task/file             <=== all open files under all tasks
> >>     ipv6_route            <=== all ipv6_routes
> >>     tcp6/sk_local_storage <=== all tcp6 socket local storages
> >>     foo/bar/tar           <=== all tar's in bar in foo
> >
> > ^^ this seems useful, but I don't think code as is supports more than 2 levels?
>
> Currently implement should support it.
> You need
>   - first register 'foo'. target name 'foo'.
>   - then register 'foo/bar'. 'foo' will be the parent of 'bar'. target
> name 'foo/bar'.
>   - then 'foo/bar/tar'. 'foo/bar' will be the parent of 'tar'. target
> name 'foo/bar/tar'.

Ah, I see, right, that would work. Please disregard then.

>
> >
> >>
> >> The "target_feature" is mostly used for reusing existing seq_ops.
> >> For example, for /proc/net/<> stats, the "net" namespace is often
> >> stored in file private data. The target_feature enables bpf based
> >> dumper to set "net" properly for itself before calling shared
> >> seq_ops.
> >>
> >> bpf_dump_reg_target() is implemented so targets
> >> can register themselves. Currently, module is not
> >> supported, so there is no bpf_dump_unreg_target().
> >> The main reason is that BTF is not available for modules
> >> yet.
> >>
> >> Since target might call bpf_dump_reg_target() before
> >> bpfdump mount point is created, __bpfdump_init()
> >> may be called in bpf_dump_reg_target() as well.
> >>
> >> The file-based dumpers will be regular files under
> >> the specific target directory. For example,
> >>     task/my1      <=== dumper "my1" iterates through all tasks
> >>     task/file/my2 <=== dumper "my2" iterates through all open files
> >>                        under all tasks
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   include/linux/bpf.h |   4 +
> >>   kernel/bpf/dump.c   | 190 +++++++++++++++++++++++++++++++++++++++++++-
> >>   2 files changed, 193 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index fd2b2322412d..53914bec7590 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -1109,6 +1109,10 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> >>   int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> >>   int bpf_obj_get_user(const char __user *pathname, int flags);
> >>
> >> +int bpf_dump_reg_target(const char *target, const char *target_proto,
> >> +                       const struct seq_operations *seq_ops,
> >> +                       u32 seq_priv_size, u32 target_feature);
> >> +
> >>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
> >>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
> >>   int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
> >> diff --git a/kernel/bpf/dump.c b/kernel/bpf/dump.c
> >> index e0c33486e0e7..45528846557f 100644
> >> --- a/kernel/bpf/dump.c
> >> +++ b/kernel/bpf/dump.c
> >> @@ -12,6 +12,173 @@
> >>   #include <linux/filter.h>
> >>   #include <linux/bpf.h>
> >>
> >> +struct bpfdump_target_info {
> >> +       struct list_head list;
> >> +       const char *target;
> >> +       const char *target_proto;
> >> +       struct dentry *dir_dentry;
> >> +       const struct seq_operations *seq_ops;
> >> +       u32 seq_priv_size;
> >> +       u32 target_feature;
> >> +};
> >> +
> >> +struct bpfdump_targets {
> >> +       struct list_head dumpers;
> >> +       struct mutex dumper_mutex;
> >
> > nit: would be a bit simpler if these were static variables with static
> > initialization, similar to how bpfdump_dentry is separate?
>
> yes, we could do that. not 100% sure whether it will be simpler or not.
> the structure is to glue them together.
>
> >
> >> +};
> >> +
> >> +/* registered dump targets */
> >> +static struct bpfdump_targets dump_targets;
> >> +
> >> +static struct dentry *bpfdump_dentry;
> >> +
> >> +static struct dentry *bpfdump_add_dir(const char *name, struct dentry *parent,
> >> +                                     const struct inode_operations *i_ops,
> >> +                                     void *data);
> >> +static int __bpfdump_init(void);
> >> +
> >> +static int dumper_unlink(struct inode *dir, struct dentry *dentry)
> >> +{
> >> +       kfree(d_inode(dentry)->i_private);
> >> +       return simple_unlink(dir, dentry);
> >> +}
> >> +
> >> +static const struct inode_operations bpf_dir_iops = {
> >> +       .lookup         = simple_lookup,
> >> +       .unlink         = dumper_unlink,
> >> +};
> >> +
> >> +int bpf_dump_reg_target(const char *target,
> >> +                       const char *target_proto,
> >> +                       const struct seq_operations *seq_ops,
> >> +                       u32 seq_priv_size, u32 target_feature)
> >> +{
> >> +       struct bpfdump_target_info *tinfo, *ptinfo;
> >> +       struct dentry *dentry, *parent;
> >> +       const char *lastslash;
> >> +       bool existed = false;
> >> +       int err, parent_len;
> >> +
> >> +       if (!bpfdump_dentry) {
> >> +               err = __bpfdump_init();
> >
> > This will be called (again) if bpfdump_init() fails? Not sure why? In
> > rare cases, some dumper will fail to initialize, but then some might
> > succeed, which is going to be even more confusing, no?
>
> I can have a static variable to say bpfdump_init has been attempted to
> avoid such situation to avoid any second try.
>
> >
> >> +               if (err)
> >> +                       return err;
> >> +       }
> >> +
> >> +       tinfo = kmalloc(sizeof(*tinfo), GFP_KERNEL);
> >> +       if (!tinfo)
> >> +               return -ENOMEM;
> >> +
> >> +       tinfo->target = target;
> >> +       tinfo->target_proto = target_proto;
> >> +       tinfo->seq_ops = seq_ops;
> >> +       tinfo->seq_priv_size = seq_priv_size;
> >> +       tinfo->target_feature = target_feature;
> >> +       INIT_LIST_HEAD(&tinfo->list);
> >> +
> >> +       lastslash = strrchr(target, '/');
> >> +       if (!lastslash) {
> >> +               parent = bpfdump_dentry;
> >
> > Two nits here. First, it supports only one and two levels. But it
> > seems like it wouldn't be hard to support multiple? Instead of
> > reverse-searching for /, you can forward search and keep track of
> > "current parent".
> >
> > nit2:
> >
> > parent = bpfdump_dentry;
> > if (lastslash) {
> >
> >      parent = ptinfo->dir_dentry;
> > }
> >
> > seems a bit cleaner (and generalizes to multi-level a bit better).
> >
> >> +       } else {
> >> +               parent_len = (unsigned long)lastslash - (unsigned long)target;
> >> +
> >> +               mutex_lock(&dump_targets.dumper_mutex);
> >> +               list_for_each_entry(ptinfo, &dump_targets.dumpers, list) {
> >> +                       if (strlen(ptinfo->target) == parent_len &&
> >> +                           strncmp(ptinfo->target, target, parent_len) == 0) {
> >> +                               existed = true;
> >> +                               break;
> >> +                       }
> >> +               }
> >> +               mutex_unlock(&dump_targets.dumper_mutex);
> >> +               if (existed == false) {
> >> +                       err = -ENOENT;
> >> +                       goto free_tinfo;
> >> +               }
> >> +
> >> +               parent = ptinfo->dir_dentry;
> >> +               target = lastslash + 1;
> >> +       }
> >> +       dentry = bpfdump_add_dir(target, parent, &bpf_dir_iops, tinfo);
> >> +       if (IS_ERR(dentry)) {
> >> +               err = PTR_ERR(dentry);
> >> +               goto free_tinfo;
> >> +       }
> >> +
> >> +       tinfo->dir_dentry = dentry;
> >> +
> >> +       mutex_lock(&dump_targets.dumper_mutex);
> >> +       list_add(&tinfo->list, &dump_targets.dumpers);
> >> +       mutex_unlock(&dump_targets.dumper_mutex);
> >> +       return 0;
> >> +
> >> +free_tinfo:
> >> +       kfree(tinfo);
> >> +       return err;
> >> +}
> >> +
> >
> > [...]
> >
> >> +       if (S_ISDIR(mode)) {
> >> +               inode->i_op = i_ops;
> >> +               inode->i_fop = f_ops;
> >> +               inc_nlink(inode);
> >> +               inc_nlink(dir);
> >> +       } else {
> >> +               inode->i_fop = f_ops;
> >> +       }
> >> +
> >> +       d_instantiate(dentry, inode);
> >> +       dget(dentry);
> >
> > lookup_one_len already bumped refcount, why the second time here?
>
> good question. this is what security/inode.c is doing and seems working.
> do not really know the science behind this. will check more.

sounds good

>
> >
> >> +       inode_unlock(dir);
> >> +       return dentry;
> >> +
> >> +dentry_put:
> >> +       dput(dentry);
> >> +       dentry = ERR_PTR(err);
> >> +unlock:
> >> +       inode_unlock(dir);
> >> +       return dentry;
> >> +}
> >> +
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 04/16] bpf: allow loading of a dumper program
  2020-04-10 23:28     ` Yonghong Song
@ 2020-04-13 19:33       ` Andrii Nakryiko
  0 siblings, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-13 19:33 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Fri, Apr 10, 2020 at 4:28 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/10/20 3:36 PM, Andrii Nakryiko wrote:
> > On Wed, Apr 8, 2020 at 4:25 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> A dumper bpf program is a tracing program with attach type
> >> BPF_TRACE_DUMP. During bpf program load, the load attribute
> >>     attach_prog_fd
> >> carries the target directory fd. The program will be
> >> verified against btf_id of the target_proto.
> >>
> >> If the program is loaded successfully, the dump target, as
> >> represented as a relative path to /sys/kernel/bpfdump,
> >> will be remembered in prog->aux->dump_target, which will
> >> be used later to create dumpers.
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   include/linux/bpf.h            |  2 ++
> >>   include/uapi/linux/bpf.h       |  1 +
> >>   kernel/bpf/dump.c              | 40 ++++++++++++++++++++++++++++++++++
> >>   kernel/bpf/syscall.c           |  8 ++++++-
> >>   kernel/bpf/verifier.c          | 15 +++++++++++++
> >>   tools/include/uapi/linux/bpf.h |  1 +
> >>   6 files changed, 66 insertions(+), 1 deletion(-)
> >>
> >
> > [...]
> >
> >>
> >> +int bpf_dump_set_target_info(u32 target_fd, struct bpf_prog *prog)
> >> +{
> >> +       struct bpfdump_target_info *tinfo;
> >> +       const char *target_proto;
> >> +       struct file *target_file;
> >> +       struct fd tfd;
> >> +       int err = 0, btf_id;
> >> +
> >> +       if (!btf_vmlinux)
> >> +               return -EINVAL;
> >> +
> >> +       tfd = fdget(target_fd);
> >> +       target_file = tfd.file;
> >> +       if (!target_file)
> >> +               return -EBADF;
> >
> > fdput is missing (or rather err = -BADF; goto done; ?)
>
> No need to do fdput if tfd.file is NULL.

ah, right :)

>
> >
> >
> >> +
> >> +       if (target_file->f_inode->i_op != &bpf_dir_iops) {
> >> +               err = -EINVAL;
> >> +               goto done;
> >> +       }
> >> +
> >> +       tinfo = target_file->f_inode->i_private;
> >> +       target_proto = tinfo->target_proto;
> >> +       btf_id = btf_find_by_name_kind(btf_vmlinux, target_proto,
> >> +                                      BTF_KIND_FUNC);
> >> +
> >> +       if (btf_id > 0) {
> >> +               prog->aux->dump_target = tinfo->target;
> >> +               prog->aux->attach_btf_id = btf_id;
> >> +       }
> >> +
> >> +       err = min(btf_id, 0);
> >
> > this min trick looks too clever... why not more straightforward and composable:
> >
> > if (btf_id < 0) {
> >      err = btf_id;
> >      goto done;
> > }
> >
> > prog->aux->dump_target = tinfo->target;
> > prog->aux->attach_btf_id = btf_id;
> >
> > ?
>
> this can be done.
>
> >
> >> +done:
> >> +       fdput(tfd);
> >> +       return err;
> >> +}
> >> +
> >>   int bpf_dump_reg_target(const char *target,
> >>                          const char *target_proto,
> >>                          const struct seq_operations *seq_ops,
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index 64783da34202..41005dee8957 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -2060,7 +2060,12 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
> >>
> >>          prog->expected_attach_type = attr->expected_attach_type;
> >>          prog->aux->attach_btf_id = attr->attach_btf_id;
> >> -       if (attr->attach_prog_fd) {
> >> +       if (type == BPF_PROG_TYPE_TRACING &&
> >> +           attr->expected_attach_type == BPF_TRACE_DUMP) {
> >> +               err = bpf_dump_set_target_info(attr->attach_prog_fd, prog);
> >
> > looking at bpf_attr, it's not clear why attach_prog_fd and
> > prog_ifindex were not combined into a single union field... this
> > probably got missed? But in this case I'd say let's create a
> >
> > union {
> >      __u32 attach_prog_fd;
> >      __u32 attach_target_fd; (similar to terminology for BPF_PROG_ATTACH)
> > };
> >
> > instead of reusing not-exactly-matching field names?
>
> I thought about this, but thinking to avoid uapi change (although
> compatible). Maybe we should. Let me think about this.

This is creating a new alias for the same field, so should be fine
from UAPI perspective.

>
> >
> >> +               if (err)
> >> +                       goto free_prog_nouncharge;
> >> +       } else if (attr->attach_prog_fd) {
> >>                  struct bpf_prog *tgt_prog;
> >>
> >>                  tgt_prog = bpf_prog_get(attr->attach_prog_fd);
> >> @@ -2145,6 +2150,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
> >>          err = bpf_prog_new_fd(prog);
> >>          if (err < 0)
> >>                  bpf_prog_put(prog);
> >> +
> >>          return err;
> >>
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-10 23:41     ` Yonghong Song
@ 2020-04-13 19:45       ` Andrii Nakryiko
  0 siblings, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-13 19:45 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Fri, Apr 10, 2020 at 4:41 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/10/20 3:51 PM, Andrii Nakryiko wrote:
> > On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> Given a loaded dumper bpf program, which already
> >> knows which target it should bind to, there
> >> two ways to create a dumper:
> >>    - a file based dumper under hierarchy of
> >>      /sys/kernel/bpfdump/ which uses can
> >>      "cat" to print out the output.
> >>    - an anonymous dumper which user application
> >>      can "read" the dumping output.
> >>
> >> For file based dumper, BPF_OBJ_PIN syscall interface
> >> is used. For anonymous dumper, BPF_PROG_ATTACH
> >> syscall interface is used.
> >>
> >> To facilitate target seq_ops->show() to get the
> >> bpf program easily, dumper creation increased
> >> the target-provided seq_file private data size
> >> so bpf program pointer is also stored in seq_file
> >> private data.
> >>
> >> Further, a seq_num which represents how many
> >> bpf_dump_get_prog() has been called is also
> >> available to the target seq_ops->show().
> >> Such information can be used to e.g., print
> >> banner before printing out actual data.
> >>
> >> Note the seq_num does not represent the num
> >> of unique kernel objects the bpf program has
> >> seen. But it should be a good approximate.
> >>
> >> A target feature BPF_DUMP_SEQ_NET_PRIVATE
> >> is implemented specifically useful for
> >> net based dumpers. It sets net namespace
> >> as the current process net namespace.
> >> This avoids changing existing net seq_ops
> >> in order to retrieve net namespace from
> >> the seq_file pointer.
> >>
> >> For open dumper files, anonymous or not, the
> >> fdinfo will show the target and prog_id associated
> >> with that file descriptor. For dumper file itself,
> >> a kernel interface will be provided to retrieve the
> >> prog_id in one of the later patches.
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   include/linux/bpf.h            |   5 +
> >>   include/uapi/linux/bpf.h       |   6 +-
> >>   kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
> >>   kernel/bpf/syscall.c           |  11 +-
> >>   tools/include/uapi/linux/bpf.h |   6 +-
> >>   5 files changed, 362 insertions(+), 4 deletions(-)
> >>
> >
> > [...]
> >
> >>
> >> +struct dumper_inode_info {
> >> +       struct bpfdump_target_info *tinfo;
> >> +       struct bpf_prog *prog;
> >> +};
> >> +
> >> +struct dumper_info {
> >> +       struct list_head list;
> >> +       /* file to identify an anon dumper,
> >> +        * dentry to identify a file dumper.
> >> +        */
> >> +       union {
> >> +               struct file *file;
> >> +               struct dentry *dentry;
> >> +       };
> >> +       struct bpfdump_target_info *tinfo;
> >> +       struct bpf_prog *prog;
> >> +};
> >
> > This is essentially a bpf_link. Why not do it as a bpf_link from the
> > get go? Instead of having all this duplication for anonymous and
>
> This is a good question. Maybe part of bpf-link can be used and
> I have to implement others. I will check.
>
> > pinned dumpers, it would always be a bpf_link-based dumper, but for
> > those pinned bpf_link itself is going to be pinned. You also get a
> > benefit of being able to list all dumpers through existing bpf_link
> > API (also see my RFC patches with bpf_link_prime/bpf_link_settle,
> > which makes using bpf_link safe and simple).
>
> Agree. Alternative is to use BPF_OBJ_GET_INFO_BY_FD to query individual
> dumper as directory tree walk can be easily done at user space.

But BPF_OBJ_GET_INFO_BY_FD won't work well for anonymous dumpers,
because it's not so easy to iterate over them (possible, but not
easy)?

>
>
> >
> > [...]
> >
> >> +
> >> +static void anon_dumper_show_fdinfo(struct seq_file *m, struct file *filp)
> >> +{
> >> +       struct dumper_info *dinfo;
> >> +
> >> +       mutex_lock(&anon_dumpers.dumper_mutex);
> >> +       list_for_each_entry(dinfo, &anon_dumpers.dumpers, list) {
> >
> > this (and few other places where you search in a loop) would also be
> > simplified, because struct file* would point to bpf_dumper_link, which
> > then would have a pointer to bpf_prog, dentry (if pinned), etc. No
> > searching at all.
>
> This is a reason for this. the same as bpflink, bpfdump already has
> the full information about file, inode, etc.

I think (if I understand what you are saying), this is my point. What
you have in struct dumper_info is already a custom bpf_link. You are
just missing `struct bpf_link link;` field there and plugging it into
overall bpf_link infrastructure (bpf_link__init + bpf_link__prime +
bpf_link__settle, from my RFC) to gain benefits of bpf_link infra.


> The file private_data actually points to seq_file. The seq_file private
> data is used in the target. That is exactly why we try to have this
> mapping to keep track. bpf_link won't help here.

I need to go and re-read all the code again carefully with who stores
what in their private_data field...

>
> >
> >> +               if (dinfo->file == filp) {
> >> +                       seq_printf(m, "target:\t%s\n"
> >> +                                     "prog_id:\t%u\n",
> >> +                                  dinfo->tinfo->target,
> >> +                                  dinfo->prog->aux->id);
> >> +                       break;
> >> +               }
> >> +       }
> >> +       mutex_unlock(&anon_dumpers.dumper_mutex);
> >> +}
> >> +
> >> +#endif
> >> +
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-11  0:23     ` Yonghong Song
  2020-04-11 23:17       ` Alexei Starovoitov
@ 2020-04-13 19:59       ` Andrii Nakryiko
  1 sibling, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-13 19:59 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Fri, Apr 10, 2020 at 5:23 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/10/20 4:25 PM, Andrii Nakryiko wrote:
> > On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> Given a loaded dumper bpf program, which already
> >> knows which target it should bind to, there
> >> two ways to create a dumper:
> >>    - a file based dumper under hierarchy of
> >>      /sys/kernel/bpfdump/ which uses can
> >>      "cat" to print out the output.
> >>    - an anonymous dumper which user application
> >>      can "read" the dumping output.
> >>
> >> For file based dumper, BPF_OBJ_PIN syscall interface
> >> is used. For anonymous dumper, BPF_PROG_ATTACH
> >> syscall interface is used.
> >>
> >> To facilitate target seq_ops->show() to get the
> >> bpf program easily, dumper creation increased
> >> the target-provided seq_file private data size
> >> so bpf program pointer is also stored in seq_file
> >> private data.
> >>
> >> Further, a seq_num which represents how many
> >> bpf_dump_get_prog() has been called is also
> >> available to the target seq_ops->show().
> >> Such information can be used to e.g., print
> >> banner before printing out actual data.
> >
> > So I looked up seq_operations struct and did a very cursory read of
> > fs/seq_file.c and seq_file documentation, so I might be completely off
> > here.
> >
> > start() is called before iteration begins, stop() is called after
> > iteration ends. Would it be a bit better and user-friendly interface
> > to have to extra calls to BPF program, say with NULL input element,
> > but with extra enum/flag that specifies that this is a START or END of
> > iteration, in addition to seq_num?
>
> The current design always pass a valid object (task, file, netlink_sock,
> fib6_info). That is, access to fields to those data structure won't
> cause runtime exceptions.
>
> Therefore, with the existing seq_ops implementation for ipv6_route
> and netlink, etc, we don't have END information. We can get START
> information though.

Right, I understand this about current implementation, because it
calls BPF program from show. But I noticed also stop(), which

>
> >
> > Also, right now it's impossible to write stateful dumpers that do any
> > kind of stats calculation, because it's impossible to determine when
> > iteration restarted (it starts from the very beginning, not from the
> > last element). It's impossible to just rememebr last processed
> > seq_num, because BPF program might be called for a new "session" in
> > parallel with the old one.
>
> Theoretically, session end can be detected by checking the return
> value of last bpf_seq_printf() or bpf_seq_write(). If it indicates
> an overflow, that means session end.

That's not what I meant by session end. If there is an overflow, the
session is going to be restart from start (but it's still the same
session, we just got bigger output buffer).

>
> Or bpfdump infrastructure can help do this work to provide
> session id.

Well, come to think about it. seq_file pointer itself is unique per
session, so that one can be used as session id, is that right?

>
> >
> > So it seems like few things would be useful:
> >
> > 1. end flag for post-aggregation and/or footer printing (seq_num == 0
> > is providing similar means for start flag).
>
> the end flag is a problem. We could say hijack next or stop so we
> can detect the end, but passing a NULL pointer as the object
> to the bpf program may be problematic without verifier enforcement
> as it may cause a lot of exceptions... Although all these exception
> will be silenced by bpf infra, but still not sure whether this
> is acceptable or not.

Right, verifier will need to know that item can be valid pointer or
NULL. It's not perfect, but not too big of a deal for user to check
for NULL at the very beginning.

What I'm aiming for with this end flags is ability for BPF program to
collect data during show() calls, and then at the end get extra call
to give ability to post-aggregate this data and emit some sort of
summary into seq_file. Think about printing out summary stats across
all tasks (e.g., p50 of run queue latency, or something like that). In
that case, I need to iterate all tasks, I don't need to emit anything
for any individual tasks, but I need to produce an aggregation and
output after the last task was iterated. Right now it's impossible to
do, but seems like an extremely powerful and useful feature. drgn
could utilize this to speed up its scripts. There are plenty of tools
that would like to have a frequent but cheap view into internals of
the system, which current is implemented through netlink (taskstats)
or procfs, both quite expensive, if polled every second.

Anonymous bpfdump, though, is going to be much cheaper, because a lot
of aggregation can happen in the kernel and only minimal output at the
end will be read by user-space.

>
> > 2. Some sort of "session id", so that bpfdumper can maintain
> > per-session intermediate state. Plus with this it would be possible to
> > detect restarts (if there is some state for the same session and
> > seq_num == 0, this is restart).
>
> I guess we can do this.

See above, probably using seq_file pointer is good enough.

>
> >
> > It seems like it might be a bit more flexible to, instead of providing
> > seq_file * pointer directly, actually provide a bpfdumper_context
> > struct, which would have seq_file * as one of fields, other being
> > session_id and start/stop flags.
>
> As you mentioned, if we have more fields related to seq_file passing
> to bpf program, yes, grouping them into a structure makes sense.
>
> >
> > A bit unstructured thoughts, but what do you think?
> >
> >>
> >> Note the seq_num does not represent the num
> >> of unique kernel objects the bpf program has
> >> seen. But it should be a good approximate.
> >>
> >> A target feature BPF_DUMP_SEQ_NET_PRIVATE
> >> is implemented specifically useful for
> >> net based dumpers. It sets net namespace
> >> as the current process net namespace.
> >> This avoids changing existing net seq_ops
> >> in order to retrieve net namespace from
> >> the seq_file pointer.
> >>
> >> For open dumper files, anonymous or not, the
> >> fdinfo will show the target and prog_id associated
> >> with that file descriptor. For dumper file itself,
> >> a kernel interface will be provided to retrieve the
> >> prog_id in one of the later patches.
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   include/linux/bpf.h            |   5 +
> >>   include/uapi/linux/bpf.h       |   6 +-
> >>   kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
> >>   kernel/bpf/syscall.c           |  11 +-
> >>   tools/include/uapi/linux/bpf.h |   6 +-
> >>   5 files changed, 362 insertions(+), 4 deletions(-)
> >>
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-11 23:11           ` Alexei Starovoitov
  2020-04-12  6:51             ` Yonghong Song
@ 2020-04-13 20:48             ` Andrii Nakryiko
  1 sibling, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-13 20:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Sat, Apr 11, 2020 at 4:11 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Apr 10, 2020 at 04:47:36PM -0700, Yonghong Song wrote:
> > >
> > > Instead of special-casing dumper_name, can we require specifying full
> > > path, and then check whether it is in BPF FS vs BPFDUMP FS? If the
> > > latter, additionally check that it is in the right sub-directory
> > > matching its intended target type.
> >
> > We could. I just think specifying full path for bpfdump is not necessary
> > since it is a single user mount...
> >
> > >
> > > But honestly, just doing everything within BPF FS starts to seem
> > > cleaner at this point...
> >
> > bpffs is multi mount, which is not a perfect fit for bpfdump,
> > considering mounting inside namespace, etc, all dumpers are gone.
>
> As Yonghong pointed out reusing bpffs for dumpers doesn't look possible
> from implementation perspective.
> Even if it was possible the files in such mix-and-match file system
> would be of different kinds with different semantics. I think that
> will lead to mediocre user experience when file 'foo' is cat-able
> with nice human output, but file 'bar' isn't cat-able at all because
> it's just a pinned map. imo having all dumpers in one fixed location
> in /sys/kernel/bpfdump makes it easy to discover for folks who might
> not even know what bpf is.

I agree about importance of discoverability, but bpffs will typically
be mounted as /sys/fs/bpf/ as well, so it's just as discoverable at
/sys/fs/bpf/bpfdump. But I'm not too fixated on unifying bpffs and
bpfdumpfs, it's just that bpfdumpfs feels a bit too single-purpose.

> For example when I'm trying to learn some new area of the kernel I might go
> poke around /proc and /sys directory looking for a file name that could be
> interesting to 'cat'. This is how I discovered /sys/kernel/slab/ :)
> I think keeping all dumpers in /sys/kernel/bpfdump/ will make them
> similarly discoverable.
>
> re: f_dump flag...
> May be it's a sign that pinning is not the right name for such operation?
> If kernel cannot distinguish pinning dumper prog into bpffs as a vanilla
> pinning operation vs pinning into bpfdumpfs to make it cat-able then something
> isn't right about api. Either it needs to be a new bpf syscall command (like
> install_dumper_in_dumpfs) or reuse pinning command, but make libbpf specify the
> full path. From bpf prog point of view it may still specify only the final
> name, but libbpf can prepend the /sys/kernel/bpfdump/.../. May be there is a
> third option. Extra flag for pinning just doesn't look right. What if we do
> another specialized file system later? It would need yet another flag to pin
> there?

I agree about specifying full path from libbpf side. But section
definition shouldn't include /sys/fs/bpfdump part, so program would be
defined as:

SEC("dump/task/file")
int prog(...) { }

And libbpf by default will concat that with /sys/fs/bpfdump, but
probably should also provide a way to override prefix with custom
value, provided by users.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-11 23:17       ` Alexei Starovoitov
@ 2020-04-13 21:04         ` Andrii Nakryiko
  0 siblings, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-13 21:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Sat, Apr 11, 2020 at 4:17 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Apr 10, 2020 at 05:23:30PM -0700, Yonghong Song wrote:
> > >
> > > So it seems like few things would be useful:
> > >
> > > 1. end flag for post-aggregation and/or footer printing (seq_num == 0
> > > is providing similar means for start flag).
> >
> > the end flag is a problem. We could say hijack next or stop so we
> > can detect the end, but passing a NULL pointer as the object
> > to the bpf program may be problematic without verifier enforcement
> > as it may cause a lot of exceptions... Although all these exception
> > will be silenced by bpf infra, but still not sure whether this
> > is acceptable or not.
>
> I don't like passing NULL there just to indicate something to a program.
> It's not too horrible to support from verifier side, but NULL is only
> one such flag. What does it suppose to indicate? That dumper prog
> is just starting? or ending? Let's pass (void*)1, and (void *)2 ?
> I'm not a fan of such inband signaling.
> imo it's cleaner and simpler when that object pointer is always valid.

I'm not proposing to pass fake pointers. I proposed to have bpfdump
context instead. E.g., one way to do this would be something like:

struct bpf_dump_context {
  struct seq_file *seq;
  u64 seq_num;
  int flags; /* 0 | BPF_DUMP_START | BPF_DUMP_END */
};

int prog(struct bpf_dump_context *ctx, struct netlink_sock *sk) {
  if (ctx->flags & BPF_DUMP_END) {
    /* emit summary */
    return 0;
  }

  /* sk must be not null here. */
}


This is one way. We can make it simpler by saying that sk == NULL is
always end of aggregation for given seq_file, then we won't need flags
and will require `if (!sk)` check explicitly. Don't know what's the
best way, but what I'm advocating for is to have a way for BPF program
to know that processing is finished and it's time to emit summary. See
my other reply in this thread with example use cases.


>
> > > 2. Some sort of "session id", so that bpfdumper can maintain
> > > per-session intermediate state. Plus with this it would be possible to
> > > detect restarts (if there is some state for the same session and
> > > seq_num == 0, this is restart).
> >
> > I guess we can do this.
>
> beyond seq_num passing session_id is a good idea. Though I don't quite see
> the use case where you'd need bpfdumper prog to be stateful, but doesn't hurt.

State per session seems most useful, so session id + hashmap solves
it. If we do sk_local storage per seq_file, that might be enough as
well, I guess...

Examples are any kind of summary stats across all sockets/tasks/etc.

Another interesting use case: produce map from process ID (tgid) to
bpf_maps, bpf_progs, bpf_links (or sockets, or whatever kind of file
we need). You'd need FD/file -> kernel object map and then kernel
object -> tgid map. I think there are many useful use-cases beyond
"one line per object" output cases that inspired bpfdump in the first
place.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 07/16] bpf: add bpf_map target
  2020-04-08 23:25 ` [RFC PATCH bpf-next 07/16] bpf: add bpf_map target Yonghong Song
@ 2020-04-13 22:18   ` Andrii Nakryiko
  2020-04-13 22:47     ` Andrii Nakryiko
  0 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-13 22:18 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> This patch added bpf_map target. Traversing all bpf_maps
> through map_idr. A reference is held for the map during
> the show() to ensure safety and correctness for field accesses.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/syscall.c | 104 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 104 insertions(+)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index b5e4f18cc633..62a872a406ca 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3797,3 +3797,107 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>
>         return err;
>  }
> +
> +struct bpfdump_seq_map_info {
> +       struct bpf_map *map;
> +       u32 id;
> +};
> +
> +static struct bpf_map *bpf_map_seq_get_next(u32 *id)
> +{
> +       struct bpf_map *map;
> +
> +       spin_lock_bh(&map_idr_lock);
> +       map = idr_get_next(&map_idr, id);
> +       if (map)
> +               map = __bpf_map_inc_not_zero(map, false);
> +       spin_unlock_bh(&map_idr_lock);
> +
> +       return map;
> +}
> +
> +static void *bpf_map_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +       struct bpfdump_seq_map_info *info = seq->private;
> +       struct bpf_map *map;
> +       u32 id = info->id + 1;

shouldn't it always start from id=0? This seems buggy and should break
on seq_file restart.

> +
> +       map = bpf_map_seq_get_next(&id);
> +       if (!map)

bpf_map_seq_get_next will return error code, not NULL, if bpf_map
refcount couldn't be incremented. So this must be IS_ERR(map).

> +               return NULL;
> +
> +       ++*pos;
> +       info->map = map;
> +       info->id = id;
> +       return map;
> +}
> +
> +static void *bpf_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +       struct bpfdump_seq_map_info *info = seq->private;
> +       struct bpf_map *map;
> +       u32 id = info->id + 1;
> +
> +       ++*pos;
> +       map = bpf_map_seq_get_next(&id);
> +       if (!map)

same here, IS_ERR(map)

> +               return NULL;
> +
> +       __bpf_map_put(info->map, true);
> +       info->map = map;
> +       info->id = id;
> +       return map;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 07/16] bpf: add bpf_map target
  2020-04-13 22:18   ` Andrii Nakryiko
@ 2020-04-13 22:47     ` Andrii Nakryiko
  0 siblings, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-13 22:47 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Apr 13, 2020 at 3:18 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
> >
> > This patch added bpf_map target. Traversing all bpf_maps
> > through map_idr. A reference is held for the map during
> > the show() to ensure safety and correctness for field accesses.
> >
> > Signed-off-by: Yonghong Song <yhs@fb.com>
> > ---
> >  kernel/bpf/syscall.c | 104 +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 104 insertions(+)
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index b5e4f18cc633..62a872a406ca 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -3797,3 +3797,107 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> >
> >         return err;
> >  }
> > +
> > +struct bpfdump_seq_map_info {
> > +       struct bpf_map *map;
> > +       u32 id;
> > +};
> > +
> > +static struct bpf_map *bpf_map_seq_get_next(u32 *id)
> > +{
> > +       struct bpf_map *map;
> > +
> > +       spin_lock_bh(&map_idr_lock);
> > +       map = idr_get_next(&map_idr, id);
> > +       if (map)
> > +               map = __bpf_map_inc_not_zero(map, false);
> > +       spin_unlock_bh(&map_idr_lock);
> > +
> > +       return map;
> > +}
> > +
> > +static void *bpf_map_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > +       struct bpfdump_seq_map_info *info = seq->private;
> > +       struct bpf_map *map;
> > +       u32 id = info->id + 1;
>
> shouldn't it always start from id=0? This seems buggy and should break
> on seq_file restart.

Actually never mind this, from reading fs/seq_file.c code I've been
under impression that start is only called for full restarts, but
that's not true.


>
> > +
> > +       map = bpf_map_seq_get_next(&id);
> > +       if (!map)
>
> bpf_map_seq_get_next will return error code, not NULL, if bpf_map
> refcount couldn't be incremented. So this must be IS_ERR(map).
>
> > +               return NULL;
> > +
> > +       ++*pos;
> > +       info->map = map;
> > +       info->id = id;
> > +       return map;
> > +}
> > +
> > +static void *bpf_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > +       struct bpfdump_seq_map_info *info = seq->private;
> > +       struct bpf_map *map;
> > +       u32 id = info->id + 1;
> > +
> > +       ++*pos;
> > +       map = bpf_map_seq_get_next(&id);
> > +       if (!map)
>
> same here, IS_ERR(map)
>
> > +               return NULL;
> > +
> > +       __bpf_map_put(info->map, true);
> > +       info->map = map;
> > +       info->id = id;
> > +       return map;
> > +}
> > +
>
> [...]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets
  2020-04-08 23:25 ` [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets Yonghong Song
  2020-04-10  3:22   ` Alexei Starovoitov
@ 2020-04-13 23:00   ` Andrii Nakryiko
  1 sibling, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-13 23:00 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> Only the tasks belonging to "current" pid namespace
> are enumerated.
>
> For task/file target, the bpf program will have access to
>   struct task_struct *task
>   u32 fd
>   struct file *file
> where fd/file is an open file for the task.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/Makefile    |   2 +-
>  kernel/bpf/dump_task.c | 294 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 295 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/dump_task.c
>
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 4a1376ab2bea..7e2c73deabab 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -26,7 +26,7 @@ obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
>  endif
>  ifeq ($(CONFIG_SYSFS),y)
>  obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o
> -obj-$(CONFIG_BPF_SYSCALL) += dump.o
> +obj-$(CONFIG_BPF_SYSCALL) += dump.o dump_task.o
>  endif
>  ifeq ($(CONFIG_BPF_JIT),y)
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o
> diff --git a/kernel/bpf/dump_task.c b/kernel/bpf/dump_task.c
> new file mode 100644
> index 000000000000..69b0bcec68e9
> --- /dev/null
> +++ b/kernel/bpf/dump_task.c
> @@ -0,0 +1,294 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2020 Facebook */
> +
> +#include <linux/init.h>
> +#include <linux/namei.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/fs.h>
> +#include <linux/fdtable.h>
> +#include <linux/filter.h>
> +
> +struct bpfdump_seq_task_info {
> +       struct pid_namespace *ns;
> +       struct task_struct *task;
> +       u32 id;
> +};
> +
> +static struct task_struct *task_seq_get_next(struct pid_namespace *ns, u32 *id)
> +{
> +       struct task_struct *task;
> +       struct pid *pid;
> +
> +       rcu_read_lock();
> +       pid = idr_get_next(&ns->idr, id);
> +       task = get_pid_task(pid, PIDTYPE_PID);
> +       if (task)
> +               get_task_struct(task);

I think get_pid_task() already calls get_task_struct() internally on
success. See also bpf_task_fd_query() implementation, it doesn't take
extra refcnt on task.

> +       rcu_read_unlock();
> +
> +       return task;
> +}
> +

[...]

> +static struct file *task_file_seq_get_next(struct pid_namespace *ns, u32 *id,
> +                                          int *fd, struct task_struct **task,
> +                                          struct files_struct **fstruct)
> +{
> +       struct files_struct *files;
> +       struct task_struct *tk;
> +       u32 sid = *id;
> +       int sfd;
> +
> +       /* If this function returns a non-NULL file object,
> +        * it held a reference to the files_struct and file.
> +        * Otherwise, it does not hold any reference.
> +        */
> +again:
> +       if (*fstruct) {
> +               files = *fstruct;
> +               sfd = *fd;
> +       } else {
> +               tk = task_seq_get_next(ns, &sid);
> +               if (!tk)
> +                       return NULL;
> +               files = get_files_struct(tk);
> +               put_task_struct(tk);
> +               if (!files)
> +                       return NULL;

There might still be another task with its own files, so shouldn't we
keep iterating tasks here?

> +               *fstruct = files;
> +               *task = tk;
> +               if (sid == *id) {
> +                       sfd = *fd;
> +               } else {
> +                       *id = sid;
> +                       sfd = 0;
> +               }
> +       }
> +
> +       spin_lock(&files->file_lock);
> +       for (; sfd < files_fdtable(files)->max_fds; sfd++) {
> +               struct file *f;
> +
> +               f = fcheck_files(files, sfd);
> +               if (!f)
> +                       continue;
> +
> +               *fd = sfd;
> +               get_file(f);
> +               spin_unlock(&files->file_lock);
> +               return f;
> +       }
> +
> +       /* the current task is done, go to the next task */
> +       spin_unlock(&files->file_lock);
> +       put_files_struct(files);
> +       *fstruct = NULL;
> +       sid = ++(*id);
> +       goto again;
> +}
> +

[...]

> +static int task_file_seq_show(struct seq_file *seq, void *v)
> +{
> +       struct bpfdump_seq_task_file_info *info = seq->private;
> +       struct {
> +               struct task_struct *task;
> +               u32 fd;
> +               struct file *file;
> +               struct seq_file *seq;
> +               u64 seq_num;

should all the fields here be 8-byte aligned, including pointers
(because BPF is 64-bit arch)? Well, at least `u32 fd` should?

> +       } ctx = {
> +               .file = v,
> +               .seq = seq,
> +       };
> +       struct bpf_prog *prog;
> +       int ret;
> +
> +       prog = bpf_dump_get_prog(seq, sizeof(struct bpfdump_seq_task_file_info),
> +                                &ctx.seq_num);
> +       ctx.task = info->task;
> +       ctx.fd = info->fd;
> +       ret = bpf_dump_run_prog(prog, &ctx);
> +
> +       return ret == 0 ? 0 : -EINVAL;
> +}
> +
> +static const struct seq_operations task_file_seq_ops = {
> +        .start  = task_file_seq_start,
> +        .next   = task_file_seq_next,
> +        .stop   = task_file_seq_stop,
> +        .show   = task_file_seq_show,
> +};
> +
> +int __init bpfdump__task(struct task_struct *task, struct seq_file *seq,
> +                        u64 seq_num) {
> +       return 0;
> +}
> +
> +int __init bpfdump__task_file(struct task_struct *task, u32 fd,
> +                             struct file *file, struct seq_file *seq,
> +                             u64 seq_num)
> +{
> +       return 0;
> +}
> +
> +static int __init task_dump_init(void)
> +{
> +       int ret;
> +
> +       ret = bpf_dump_reg_target("task", "bpfdump__task",
> +                                 &task_seq_ops,
> +                                 sizeof(struct bpfdump_seq_task_info), 0);
> +       if (ret)
> +               return ret;
> +
> +       return bpf_dump_reg_target("task/file", "bpfdump__task_file",
> +                                  &task_file_seq_ops,
> +                                  sizeof(struct bpfdump_seq_task_file_info),
> +                                  0);
> +}
> +late_initcall(task_dump_init);
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 10/16] bpf: support variable length array in tracing programs
  2020-04-08 23:25 ` [RFC PATCH bpf-next 10/16] bpf: support variable length array in tracing programs Yonghong Song
@ 2020-04-14  0:13   ` Andrii Nakryiko
  0 siblings, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-14  0:13 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> In /proc/net/ipv6_route, we have
>   struct fib6_info {
>     struct fib6_table *fib6_table;
>     ...
>     struct fib6_nh fib6_nh[0];
>   }
>   struct fib6_nh {
>     struct fib_nh_common nh_common;
>     struct rt6_info **rt6i_pcpu;
>     struct rt6_exception_bucket *rt6i_exception_bucket;
>   };
>   struct fib_nh_common {
>     ...
>     u8 nhc_gw_family;
>     ...
>   }
>
> The access:
>   struct fib6_nh *fib6_nh = &rt->fib6_nh;
>   ... fib6_nh->nh_common.nhc_gw_family ...
>
> This patch ensures such an access is handled properly.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/btf.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
>
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index d65c6912bdaf..89a0d983b169 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3837,6 +3837,31 @@ int btf_struct_access(struct bpf_verifier_log *log,
>         }
>
>         if (off + size > t->size) {
> +               /* If the last element is a variable size array, we may
> +                * need to relax the rule.
> +                */
> +               struct btf_array *array_elem;
> +               u32 vlen = btf_type_vlen(t);
> +               u32 last_member_type;
> +
> +               member = btf_type_member(t);
> +               last_member_type = member[vlen - 1].type;

vlen could be zero, and this will be bad


> +               mtype = btf_type_by_id(btf_vmlinux, last_member_type);

might want to strip modifiers here?

> +               if (!btf_type_is_array(mtype))
> +                       goto error;
> +

should probably check that off is >= last_member's offset within a
struct? Otherwise access might be spanning previous field and this
array?

> +               array_elem = (struct btf_array *)(mtype + 1);
> +               if (array_elem->nelems != 0)
> +                       goto error;
> +
> +               elem_type = btf_type_by_id(btf_vmlinux, array_elem->type);

strip modifiers

> +               if (!btf_type_is_struct(elem_type))
> +                       goto error;
> +
> +               off = (off - t->size) % elem_type->size;

I think it will be safer to use field offset, not struct size.
Consider example below

$ cat test-test.c
struct bla {
        long a;
        int b;
        char c[];
};

int main() {
        static struct bla *x = 0;
        return 0;
}

$ pahole -F btf -C bla test-test.o
struct bla {
        long int                   a;                    /*     0     8 */
        int                        b;                    /*     8     4 */
        char                       c[];                  /*    12     0 */

        /* size: 16, cachelines: 1, members: 3 */
        /* padding: 4 */
        /* last cacheline: 16 bytes */
};

c is at offset 12, but struct size is 16 due to long alignment. It
could be a 4-byte struct instead of char there.

> +               return btf_struct_access(log, elem_type, off, size, atype, next_btf_id);
> +
> +error:
>                 bpf_log(log, "access beyond struct %s at off %u size %u\n",
>                         tname, off, size);
>                 return -EACCES;
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 09/16] bpf: add bpf_seq_printf and bpf_seq_write helpers
  2020-04-08 23:25 ` [RFC PATCH bpf-next 09/16] bpf: add bpf_seq_printf and bpf_seq_write helpers Yonghong Song
  2020-04-10  3:26   ` Alexei Starovoitov
@ 2020-04-14  5:28   ` Andrii Nakryiko
  1 sibling, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-14  5:28 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> Two helpers bpf_seq_printf and bpf_seq_write, are added for
> writing data to the seq_file buffer.
>
> bpf_seq_printf supports common format string flag/width/type
> fields so at least I can get identical results for
> netlink and ipv6_route targets.
>
> For bpf_seq_printf, return value 1 specifically indicates
> a write failure due to overflow in order to differentiate
> the failure from format strings.
>
> For seq_file show, since the same object may be called
> twice, some bpf_prog might be sensitive to this. With return
> value indicating overflow happens the bpf program can
> react differently.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/uapi/linux/bpf.h       |  18 +++-
>  kernel/trace/bpf_trace.c       | 172 +++++++++++++++++++++++++++++++++
>  scripts/bpf_helpers_doc.py     |   2 +
>  tools/include/uapi/linux/bpf.h |  18 +++-
>  4 files changed, 208 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b51d56fc77f9..a245f0df53c4 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3030,6 +3030,20 @@ union bpf_attr {
>   *             * **-EOPNOTSUPP**       Unsupported operation, for example a
>   *                                     call from outside of TC ingress.
>   *             * **-ESOCKTNOSUPPORT**  Socket type not supported (reuseport).
> + *
> + * int bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, ...)
> + *     Description
> + *             seq_printf
> + *     Return
> + *             0 if successful, or
> + *             1 if failure due to buffer overflow, or
> + *             a negative value for format string related failures.

This encoding feels a bit arbitrary, why not stick to normal error
codes and return, for example, EAGAIN on overflow (or EOVERFLOW?..)

> + *
> + * int bpf_seq_write(struct seq_file *m, const void *data, u32 len)
> + *     Description
> + *             seq_write
> + *     Return
> + *             0 if successful, non-zero otherwise.

Especially given that bpf_seq_write will probably return <0 on the same error?

>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -3156,7 +3170,9 @@ union bpf_attr {
>         FN(xdp_output),                 \
>         FN(get_netns_cookie),           \
>         FN(get_current_ancestor_cgroup_id),     \
> -       FN(sk_assign),
> +       FN(sk_assign),                  \
> +       FN(seq_printf),                 \
> +       FN(seq_write),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1796747a77..e7d6ba7c9c51 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -457,6 +457,174 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
>         return &bpf_trace_printk_proto;
>  }
>
> +BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size, u64, arg1,
> +          u64, arg2)
> +{

I honestly didn't dare to look at implementation below, but this
limitation of only up to 2 arguments in bpf_seq_printf (arg1 and arg2)
seem extremely limiting. It might be ok for bpf_printk, but not for
more general and non-debugging bpf_seq_printf.

How about instead of passing arguments as 4th and 5th argument,
bpf_seq_printf would require passing a pointer to a long array, where
each item corresponds to printf argument? So on BPF program side, one
would have to do this, to printf 5 arguments;

long __tmp_arr[] = { 123, pointer_to_str, some_input_int,
some_input_long, 5 * arg_x };
return bpf_seq_printf(m, fmt, fmt_size, &__tmp_arr, sizeof(__tmp_arr));

And the bpf_seq_printf would know that 4th argument is a pointer to an
array of size provided in 5th argument and process them accordingly.
This would theoretically allow to have arbitrary number of arguments.
This local array construction can be abstracted into macro, of course.
Would something like this be possible?

[...]

> +/* Horrid workaround for getting va_list handling working with different
> + * argument type combinations generically for 32 and 64 bit archs.
> + */
> +#define __BPF_SP_EMIT()        __BPF_ARG2_SP()
> +#define __BPF_SP(...)                                                  \
> +       seq_printf(m, fmt, ##__VA_ARGS__)
> +
> +#define __BPF_ARG1_SP(...)                                             \
> +       ((mod[0] == 2 || (mod[0] == 1 && __BITS_PER_LONG == 64))        \
> +         ? __BPF_SP(arg1, ##__VA_ARGS__)                               \
> +         : ((mod[0] == 1 || (mod[0] == 0 && __BITS_PER_LONG == 32))    \
> +             ? __BPF_SP((long)arg1, ##__VA_ARGS__)                     \
> +             : __BPF_SP((u32)arg1, ##__VA_ARGS__)))
> +
> +#define __BPF_ARG2_SP(...)                                             \
> +       ((mod[1] == 2 || (mod[1] == 1 && __BITS_PER_LONG == 64))        \
> +         ? __BPF_ARG1_SP(arg2, ##__VA_ARGS__)                          \
> +         : ((mod[1] == 1 || (mod[1] == 0 && __BITS_PER_LONG == 32))    \
> +             ? __BPF_ARG1_SP((long)arg2, ##__VA_ARGS__)                \
> +             : __BPF_ARG1_SP((u32)arg2, ##__VA_ARGS__)))

hm... wouldn't this make it impossible to print 64-bit numbers on
32-bit arches? It seems to be truncating to 32-bit unconditionally....

> +
> +       __BPF_SP_EMIT();
> +       return seq_has_overflowed(m);
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 14/16] tools/bpf: selftests: add dumper programs for ipv6_route and netlink
  2020-04-08 23:25 ` [RFC PATCH bpf-next 14/16] tools/bpf: selftests: add dumper programs for ipv6_route and netlink Yonghong Song
@ 2020-04-14  5:39   ` Andrii Nakryiko
  0 siblings, 0 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-14  5:39 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> Two bpf programs are added in this patch for netlink and ipv6_route
> target. On my VM, I am able to achieve identical
> results compared to /proc/net/netlink and /proc/net/ipv6_route.
>
>   $ cat /proc/net/netlink
>   sk               Eth Pid        Groups   Rmem     Wmem     Dump  Locks    Drops    Inode
>   000000002c42d58b 0   0          00000000 0        0        0     2        0        7
>   00000000a4e8b5e1 0   1          00000551 0        0        0     2        0        18719
>   00000000e1b1c195 4   0          00000000 0        0        0     2        0        16422
>   000000007e6b29f9 6   0          00000000 0        0        0     2        0        16424
>   ....
>   00000000159a170d 15  1862       00000002 0        0        0     2        0        1886
>   000000009aca4bc9 15  3918224839 00000002 0        0        0     2        0        19076
>   00000000d0ab31d2 15  1          00000002 0        0        0     2        0        18683
>   000000008398fb08 16  0          00000000 0        0        0     2        0        27
>   $ cat /sys/kernel/bpfdump/netlink/my1
>   sk               Eth Pid        Groups   Rmem     Wmem     Dump  Locks    Drops    Inode
>   000000002c42d58b 0   0          00000000 0        0        0     2        0        7
>   00000000a4e8b5e1 0   1          00000551 0        0        0     2        0        18719
>   00000000e1b1c195 4   0          00000000 0        0        0     2        0        16422
>   000000007e6b29f9 6   0          00000000 0        0        0     2        0        16424
>   ....
>   00000000159a170d 15  1862       00000002 0        0        0     2        0        1886
>   000000009aca4bc9 15  3918224839 00000002 0        0        0     2        0        19076
>   00000000d0ab31d2 15  1          00000002 0        0        0     2        0        18683
>   000000008398fb08 16  0          00000000 0        0        0     2        0        27
>
>   $ cat /proc/net/ipv6_route
>   fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
>   00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
>   00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
>   fe80000000000000c04b03fffe7827ce 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
>   ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000003 00000000 00000001     eth0
>   00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
>   $ cat /sys/kernel/bpfdump/ipv6_route/my1
>   fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
>   00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
>   00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
>   fe80000000000000c04b03fffe7827ce 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
>   ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000003 00000000 00000001     eth0
>   00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  .../selftests/bpf/progs/bpfdump_ipv6_route.c  | 63 ++++++++++++++++
>  .../selftests/bpf/progs/bpfdump_netlink.c     | 74 +++++++++++++++++++
>  2 files changed, 137 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_ipv6_route.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bpfdump_netlink.c
>
> diff --git a/tools/testing/selftests/bpf/progs/bpfdump_ipv6_route.c b/tools/testing/selftests/bpf/progs/bpfdump_ipv6_route.c
> new file mode 100644
> index 000000000000..590e56791052
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpfdump_ipv6_route.c
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Facebook */
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_endian.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +extern bool CONFIG_IPV6_SUBTREES __kconfig __weak;
> +
> +#define        RTF_GATEWAY             0x0002
> +#define IFNAMSIZ               16
> +#define fib_nh_gw_family        nh_common.nhc_gw_family
> +#define fib_nh_gw6              nh_common.nhc_gw.ipv6
> +#define fib_nh_dev              nh_common.nhc_dev
> +
> +SEC("dump//sys/kernel/bpfdump/ipv6_route")
> +int BPF_PROG(dump_ipv6_route, struct fib6_info *rt, struct seq_file *seq, u64 seq_num)
> +{
> +       struct fib6_nh *fib6_nh = &rt->fib6_nh[0];
> +       unsigned int flags = rt->fib6_flags;
> +       const struct net_device *dev;
> +       struct nexthop *nh;
> +       static const char fmt1[] = "%pi6 %02x ";
> +       static const char fmt2[] = "%pi6 ";
> +       static const char fmt3[] = "00000000000000000000000000000000 ";
> +       static const char fmt4[] = "%08x %08x ";
> +       static const char fmt5[] = "%8s\n";
> +       static const char fmt6[] = "\n";
> +       static const char fmt7[] = "00000000000000000000000000000000 00 ";
> +
> +       /* FIXME: nexthop_is_multipath is not handled here. */
> +       nh = rt->nh;
> +       if (rt->nh)
> +               fib6_nh = &nh->nh_info->fib6_nh;
> +
> +       bpf_seq_printf(seq, fmt1, sizeof(fmt1), &rt->fib6_dst.addr,
> +                      rt->fib6_dst.plen);
> +
> +       if (CONFIG_IPV6_SUBTREES)
> +               bpf_seq_printf(seq, fmt1, sizeof(fmt1), &rt->fib6_src.addr,
> +                              rt->fib6_src.plen);
> +       else
> +               bpf_seq_printf(seq, fmt7, sizeof(fmt7));
> +
> +       if (fib6_nh->fib_nh_gw_family) {
> +               flags |= RTF_GATEWAY;
> +               bpf_seq_printf(seq, fmt2, sizeof(fmt2), &fib6_nh->fib_nh_gw6);
> +       } else {
> +               bpf_seq_printf(seq, fmt3, sizeof(fmt3));
> +       }
> +
> +       dev = fib6_nh->fib_nh_dev;
> +       bpf_seq_printf(seq, fmt4, sizeof(fmt4), rt->fib6_metric, rt->fib6_ref.refs.counter);
> +       bpf_seq_printf(seq, fmt4, sizeof(fmt4), 0, flags);
> +       if (dev)
> +               bpf_seq_printf(seq, fmt5, sizeof(fmt5), dev->name);
> +       else
> +               bpf_seq_printf(seq, fmt6, sizeof(fmt6));
> +
> +       return 0;
> +}
> diff --git a/tools/testing/selftests/bpf/progs/bpfdump_netlink.c b/tools/testing/selftests/bpf/progs/bpfdump_netlink.c
> new file mode 100644
> index 000000000000..37c9be546b99
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpfdump_netlink.c
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Facebook */
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_endian.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define sk_rmem_alloc  sk_backlog.rmem_alloc
> +#define sk_refcnt      __sk_common.skc_refcnt
> +
> +#define offsetof(TYPE, MEMBER)  ((size_t)&((TYPE *)0)->MEMBER)
> +#define container_of(ptr, type, member) ({                              \
> +        void *__mptr = (void *)(ptr);                                   \
> +        ((type *)(__mptr - offsetof(type, member))); })
> +
> +static inline struct inode *SOCK_INODE(struct socket *socket)
> +{
> +       return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
> +}
> +
> +SEC("dump//sys/kernel/bpfdump/netlink")

We discussed already on previous patch, but just to put it visually
into comparison:

SEC("dump/netlink")

looks so much nicer :)

> +int BPF_PROG(dump_netlink, struct netlink_sock *nlk, struct seq_file *seq, u64 seq_num)
> +{
> +       static const char banner[] =
> +               "sk               Eth Pid        Groups   "
> +               "Rmem     Wmem     Dump  Locks    Drops    Inode\n";
> +       static const char fmt1[] = "%pK %-3d ";
> +       static const char fmt2[] = "%-10u %08x ";
> +       static const char fmt3[] = "%-8d %-8d ";
> +       static const char fmt4[] = "%-5d %-8d ";
> +       static const char fmt5[] = "%-8u %-8lu\n";
> +       struct sock *s = &nlk->sk;
> +       unsigned long group, ino;
> +       struct inode *inode;
> +       struct socket *sk;
> +
> +       if (seq_num == 0)
> +               bpf_seq_printf(seq, banner, sizeof(banner));
> +
> +       bpf_seq_printf(seq, fmt1, sizeof(fmt1), s, s->sk_protocol);
> +
> +       if (!nlk->groups)  {
> +               group = 0;
> +       } else {
> +               /* FIXME: temporary use bpf_probe_read here, needs
> +                * verifier support to do direct access.
> +                */
> +               bpf_probe_read(&group, sizeof(group), &nlk->groups[0]);

Is this what's being fixed by patch #10?

> +       }
> +       bpf_seq_printf(seq, fmt2, sizeof(fmt2), nlk->portid, (u32)group);
> +
> +
> +       bpf_seq_printf(seq, fmt3, sizeof(fmt3), s->sk_rmem_alloc.counter,
> +                      s->sk_wmem_alloc.refs.counter - 1);
> +       bpf_seq_printf(seq, fmt4, sizeof(fmt4), nlk->cb_running,
> +                      s->sk_refcnt.refs.counter);
> +
> +       sk = s->sk_socket;
> +       if (!sk) {
> +               ino = 0;
> +       } else {
> +               /* FIXME: container_of inside SOCK_INODE has a forced
> +                * type conversion, and direct access cannot be used
> +                * with current verifier.
> +                */
> +               inode = SOCK_INODE(sk);
> +               bpf_probe_read(&ino, sizeof(ino), &inode->i_ino);
> +       }
> +       bpf_seq_printf(seq, fmt5, sizeof(fmt5), s->sk_drops.counter, ino);
> +
> +       return 0;
> +}
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-08 23:25 ` [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers Yonghong Song
                     ` (2 preceding siblings ...)
  2020-04-10 23:25   ` Andrii Nakryiko
@ 2020-04-14  5:56   ` Andrii Nakryiko
  2020-04-14 23:59     ` Yonghong Song
  3 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-14  5:56 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>
> Given a loaded dumper bpf program, which already
> knows which target it should bind to, there
> two ways to create a dumper:
>   - a file based dumper under hierarchy of
>     /sys/kernel/bpfdump/ which uses can
>     "cat" to print out the output.
>   - an anonymous dumper which user application
>     can "read" the dumping output.
>
> For file based dumper, BPF_OBJ_PIN syscall interface
> is used. For anonymous dumper, BPF_PROG_ATTACH
> syscall interface is used.

We discussed this offline with Yonghong a bit, but I thought I'd put
my thoughts about this in writing for completeness. To me, it seems
like the most consistent way to do both anonymous and named dumpers is
through the following steps:

1. BPF_PROG_LOAD to load/verify program, that created program FD.
2. LINK_CREATE using that program FD and direntry FD. This creates
dumper bpf_link (bpf_dumper_link), returns anonymous link FD. If link
FD is closed, dumper program is detached and dumper is destroyed
(unless pinned in bpffs, just like with any other bpf_link.
3. At this point bpf_dumper_link can be treated like a factory of
seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
illustration purposes) command, that accepts dumper link FD and
returns a new seq_file FD, which can be read() normally (or, e.g.,
cat'ed from shell).
4. Additionally, this anonymous bpf_link can be pinned/mounted in
bpfdumpfs. We can do it as BPF_OBJ_PIN or as a separate command. Once
pinned at, e.g., /sys/fs/bpfdump/task/my_dumper, just opening that
file is equivalent to BPF_DUMPER_OPEN_FILE and will create a new
seq_file that can be read() independently from other seq_files opened
against the same dumper. Pinning bpfdumpfs entry also bumps refcnt of
bpf_link itself, so even if process that created link dies, bpf dumper
stays attached until its bpfdumpfs entry is deleted.

Apart from BPF_DUMPER_OPEN_FILE and open()'ing bpfdumpfs file duality,
it seems pretty consistent and follows safe-by-default auto-cleanup of
anonymous link, unless pinned in bpfdumpfs (or one can still pin
bpf_link in bpffs, but it can't be open()'ed the same way, it just
preserves BPF program from being cleaned up).

Out of all schemes I could come up with, this one seems most unified
and nicely fits into bpf_link infra. Thoughts?

>
> To facilitate target seq_ops->show() to get the
> bpf program easily, dumper creation increased
> the target-provided seq_file private data size
> so bpf program pointer is also stored in seq_file
> private data.
>
> Further, a seq_num which represents how many
> bpf_dump_get_prog() has been called is also
> available to the target seq_ops->show().
> Such information can be used to e.g., print
> banner before printing out actual data.
>
> Note the seq_num does not represent the num
> of unique kernel objects the bpf program has
> seen. But it should be a good approximate.
>
> A target feature BPF_DUMP_SEQ_NET_PRIVATE
> is implemented specifically useful for
> net based dumpers. It sets net namespace
> as the current process net namespace.
> This avoids changing existing net seq_ops
> in order to retrieve net namespace from
> the seq_file pointer.
>
> For open dumper files, anonymous or not, the
> fdinfo will show the target and prog_id associated
> with that file descriptor. For dumper file itself,
> a kernel interface will be provided to retrieve the
> prog_id in one of the later patches.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h            |   5 +
>  include/uapi/linux/bpf.h       |   6 +-
>  kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
>  kernel/bpf/syscall.c           |  11 +-
>  tools/include/uapi/linux/bpf.h |   6 +-
>  5 files changed, 362 insertions(+), 4 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-14  5:56   ` Andrii Nakryiko
@ 2020-04-14 23:59     ` Yonghong Song
  2020-04-15  4:45       ` Andrii Nakryiko
  0 siblings, 1 reply; 71+ messages in thread
From: Yonghong Song @ 2020-04-14 23:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/13/20 10:56 PM, Andrii Nakryiko wrote:
> On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Given a loaded dumper bpf program, which already
>> knows which target it should bind to, there
>> two ways to create a dumper:
>>    - a file based dumper under hierarchy of
>>      /sys/kernel/bpfdump/ which uses can
>>      "cat" to print out the output.
>>    - an anonymous dumper which user application
>>      can "read" the dumping output.
>>
>> For file based dumper, BPF_OBJ_PIN syscall interface
>> is used. For anonymous dumper, BPF_PROG_ATTACH
>> syscall interface is used.
> 
> We discussed this offline with Yonghong a bit, but I thought I'd put
> my thoughts about this in writing for completeness. To me, it seems
> like the most consistent way to do both anonymous and named dumpers is
> through the following steps:

The main motivation for me to use bpf_link is to enumerate
anonymous bpf dumpers by using idr based link_query mechanism in one
of previous Andrii's RFC patch so I do not need to re-invent the wheel.

But looks like there are some difficulties:

> 
> 1. BPF_PROG_LOAD to load/verify program, that created program FD.
> 2. LINK_CREATE using that program FD and direntry FD. This creates
> dumper bpf_link (bpf_dumper_link), returns anonymous link FD. If link

bpf dump program already have the target information as part of
verification propose, so it does not need directory FD.
LINK_CREATE probably not a good fit here.

bpf dump program is kind similar to fentry/fexit program,
where after successful program loading, the program will know
where to attach trampoline.

Looking at kernel code, for fentry/fexit program, at raw_tracepoint_open
syscall, the trampoline will be installed and actually bpf program will
be called.

So, ideally, if we want to use kernel bpf_link, we want to
return a cat-able bpf_link because ultimately we want to query
file descriptors which actually 'read' bpf program outputs.

Current bpf_link is not cat-able.
I try to hack by manipulating fops and other stuff, it may work,
but looks ugly. Or we create a bpf_catable_link and build an 
infrastructure around that? Not sure whether it is worthwhile for this 
one-off thing (bpfdump)?

Or to query anonymous bpf dumpers, I can just write a bpf dump program
to go through all fd's to find out.

BTW, my current approach (in my private branch),
anonymous dumper:
    bpf_raw_tracepoint_open(NULL, prog) -> cat-able fd
file dumper:
    bpf_obj_pin(prog, path)  -> a cat-able file

If you consider program itself is a link, this is like what
described below in 3 and 4.


> FD is closed, dumper program is detached and dumper is destroyed
> (unless pinned in bpffs, just like with any other bpf_link.
> 3. At this point bpf_dumper_link can be treated like a factory of
> seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
> illustration purposes) command, that accepts dumper link FD and
> returns a new seq_file FD, which can be read() normally (or, e.g.,
> cat'ed from shell).

In this case, link_query may not be accurate if a bpf_dumper_link
is created but no corresponding bpf_dumper_open_file. What we really
need to iterate through all dumper seq_file FDs.

> 4. Additionally, this anonymous bpf_link can be pinned/mounted in
> bpfdumpfs. We can do it as BPF_OBJ_PIN or as a separate command. Once
> pinned at, e.g., /sys/fs/bpfdump/task/my_dumper, just opening that
> file is equivalent to BPF_DUMPER_OPEN_FILE and will create a new
> seq_file that can be read() independently from other seq_files opened
> against the same dumper. Pinning bpfdumpfs entry also bumps refcnt of
> bpf_link itself, so even if process that created link dies, bpf dumper
> stays attached until its bpfdumpfs entry is deleted.
> 
> Apart from BPF_DUMPER_OPEN_FILE and open()'ing bpfdumpfs file duality,
> it seems pretty consistent and follows safe-by-default auto-cleanup of
> anonymous link, unless pinned in bpfdumpfs (or one can still pin
> bpf_link in bpffs, but it can't be open()'ed the same way, it just
> preserves BPF program from being cleaned up).
> 
> Out of all schemes I could come up with, this one seems most unified
> and nicely fits into bpf_link infra. Thoughts?
> 
>>
>> To facilitate target seq_ops->show() to get the
>> bpf program easily, dumper creation increased
>> the target-provided seq_file private data size
>> so bpf program pointer is also stored in seq_file
>> private data.
>>
>> Further, a seq_num which represents how many
>> bpf_dump_get_prog() has been called is also
>> available to the target seq_ops->show().
>> Such information can be used to e.g., print
>> banner before printing out actual data.
>>
>> Note the seq_num does not represent the num
>> of unique kernel objects the bpf program has
>> seen. But it should be a good approximate.
>>
>> A target feature BPF_DUMP_SEQ_NET_PRIVATE
>> is implemented specifically useful for
>> net based dumpers. It sets net namespace
>> as the current process net namespace.
>> This avoids changing existing net seq_ops
>> in order to retrieve net namespace from
>> the seq_file pointer.
>>
>> For open dumper files, anonymous or not, the
>> fdinfo will show the target and prog_id associated
>> with that file descriptor. For dumper file itself,
>> a kernel interface will be provided to retrieve the
>> prog_id in one of the later patches.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h            |   5 +
>>   include/uapi/linux/bpf.h       |   6 +-
>>   kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
>>   kernel/bpf/syscall.c           |  11 +-
>>   tools/include/uapi/linux/bpf.h |   6 +-
>>   5 files changed, 362 insertions(+), 4 deletions(-)
>>
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-14 23:59     ` Yonghong Song
@ 2020-04-15  4:45       ` Andrii Nakryiko
  2020-04-15 16:46         ` Alexei Starovoitov
  0 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-15  4:45 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Tue, Apr 14, 2020 at 4:59 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/13/20 10:56 PM, Andrii Nakryiko wrote:
> > On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> Given a loaded dumper bpf program, which already
> >> knows which target it should bind to, there
> >> two ways to create a dumper:
> >>    - a file based dumper under hierarchy of
> >>      /sys/kernel/bpfdump/ which uses can
> >>      "cat" to print out the output.
> >>    - an anonymous dumper which user application
> >>      can "read" the dumping output.
> >>
> >> For file based dumper, BPF_OBJ_PIN syscall interface
> >> is used. For anonymous dumper, BPF_PROG_ATTACH
> >> syscall interface is used.
> >
> > We discussed this offline with Yonghong a bit, but I thought I'd put
> > my thoughts about this in writing for completeness. To me, it seems
> > like the most consistent way to do both anonymous and named dumpers is
> > through the following steps:
>
> The main motivation for me to use bpf_link is to enumerate
> anonymous bpf dumpers by using idr based link_query mechanism in one
> of previous Andrii's RFC patch so I do not need to re-invent the wheel.
>
> But looks like there are some difficulties:
>
> >
> > 1. BPF_PROG_LOAD to load/verify program, that created program FD.
> > 2. LINK_CREATE using that program FD and direntry FD. This creates
> > dumper bpf_link (bpf_dumper_link), returns anonymous link FD. If link
>
> bpf dump program already have the target information as part of
> verification propose, so it does not need directory FD.
> LINK_CREATE probably not a good fit here.
>
> bpf dump program is kind similar to fentry/fexit program,
> where after successful program loading, the program will know
> where to attach trampoline.
>
> Looking at kernel code, for fentry/fexit program, at raw_tracepoint_open
> syscall, the trampoline will be installed and actually bpf program will
> be called.
>

direntry FD doesn't have to be specified at attach time, I forgot that
it is already provided during load. That wasn't a requirement or
critical part. I think if we already had LINK_CREATE command, we'd
never have to create RAW_TRACEPOINT_OPEN one, all of them could be the
same command.

> So, ideally, if we want to use kernel bpf_link, we want to
> return a cat-able bpf_link because ultimately we want to query
> file descriptors which actually 'read' bpf program outputs.
>
> Current bpf_link is not cat-able.

Let's be precise here. By cat-able you mean that you'd like to just
start issuing read() calls and get output of bpfdump program, is that
right? Wouldn't that mean that you can read output just once? So it
won't be possible to create anonymous dumper and periodically get
up-to-date output. User would need to call RAW_TRACEPOINT_OPEN every
single time it would need to do a dump. I guess that would work, but
I'm not seeing why it has to be that way.

What I proposed above was that once you create a bpf_link, you can use
that same bpf_link to open many seq_files, each with its own FD, which
can be read() independently of each other. This behavior would be
consistent with named bpfdumper, which can produce many independent
seq_files with each new open() syscall, but all from exactly the same
attached bpfdumper.

> I try to hack by manipulating fops and other stuff, it may work,
> but looks ugly. Or we create a bpf_catable_link and build an
> infrastructure around that? Not sure whether it is worthwhile for this
> one-off thing (bpfdump)?
>
> Or to query anonymous bpf dumpers, I can just write a bpf dump program
> to go through all fd's to find out.
>
> BTW, my current approach (in my private branch),
> anonymous dumper:
>     bpf_raw_tracepoint_open(NULL, prog) -> cat-able fd

So just to re-iterate. If my understanding is correct, this cat-able
fd is a single seq_file. If you want to dump it again, you would call
bpf_raw_tracepoint_open() again?

> file dumper:
>     bpf_obj_pin(prog, path)  -> a cat-able file

While in this case, you'd open() as many times as you need and get new
cat-able fd for each of those calls.

>
> If you consider program itself is a link, this is like what
> described below in 3 and 4.

Program is not a link. Same as cgroup BPF program attached somewhere
to a cgroup is not a link. Because that BPF program can be attached to
multiple cgroups or even under multiple attach types to the same
cgroup. Same here, same dumper can be "attached" in bpfdumpfs multiple
times, and each instance of attachment is link, but it's still the
same program.

>
>
> > FD is closed, dumper program is detached and dumper is destroyed
> > (unless pinned in bpffs, just like with any other bpf_link.
> > 3. At this point bpf_dumper_link can be treated like a factory of
> > seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
> > illustration purposes) command, that accepts dumper link FD and
> > returns a new seq_file FD, which can be read() normally (or, e.g.,
> > cat'ed from shell).
>
> In this case, link_query may not be accurate if a bpf_dumper_link
> is created but no corresponding bpf_dumper_open_file. What we really
> need to iterate through all dumper seq_file FDs.

If the goal is to iterate all the open seq_files (i.e., bpfdump active
sessions), then bpf_link is clearly not the right approach. But I
thought we are talking about iterating all the bpfdump programs
attachments, not **sessions**, in which case bpf_link is exactly the
right approach.


>
> > 4. Additionally, this anonymous bpf_link can be pinned/mounted in
> > bpfdumpfs. We can do it as BPF_OBJ_PIN or as a separate command. Once
> > pinned at, e.g., /sys/fs/bpfdump/task/my_dumper, just opening that
> > file is equivalent to BPF_DUMPER_OPEN_FILE and will create a new
> > seq_file that can be read() independently from other seq_files opened
> > against the same dumper. Pinning bpfdumpfs entry also bumps refcnt of
> > bpf_link itself, so even if process that created link dies, bpf dumper
> > stays attached until its bpfdumpfs entry is deleted.
> >
> > Apart from BPF_DUMPER_OPEN_FILE and open()'ing bpfdumpfs file duality,
> > it seems pretty consistent and follows safe-by-default auto-cleanup of
> > anonymous link, unless pinned in bpfdumpfs (or one can still pin
> > bpf_link in bpffs, but it can't be open()'ed the same way, it just
> > preserves BPF program from being cleaned up).
> >
> > Out of all schemes I could come up with, this one seems most unified
> > and nicely fits into bpf_link infra. Thoughts?
> >
> >>
> >> To facilitate target seq_ops->show() to get the
> >> bpf program easily, dumper creation increased
> >> the target-provided seq_file private data size
> >> so bpf program pointer is also stored in seq_file
> >> private data.
> >>
> >> Further, a seq_num which represents how many
> >> bpf_dump_get_prog() has been called is also
> >> available to the target seq_ops->show().
> >> Such information can be used to e.g., print
> >> banner before printing out actual data.
> >>
> >> Note the seq_num does not represent the num
> >> of unique kernel objects the bpf program has
> >> seen. But it should be a good approximate.
> >>
> >> A target feature BPF_DUMP_SEQ_NET_PRIVATE
> >> is implemented specifically useful for
> >> net based dumpers. It sets net namespace
> >> as the current process net namespace.
> >> This avoids changing existing net seq_ops
> >> in order to retrieve net namespace from
> >> the seq_file pointer.
> >>
> >> For open dumper files, anonymous or not, the
> >> fdinfo will show the target and prog_id associated
> >> with that file descriptor. For dumper file itself,
> >> a kernel interface will be provided to retrieve the
> >> prog_id in one of the later patches.
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   include/linux/bpf.h            |   5 +
> >>   include/uapi/linux/bpf.h       |   6 +-
> >>   kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
> >>   kernel/bpf/syscall.c           |  11 +-
> >>   tools/include/uapi/linux/bpf.h |   6 +-
> >>   5 files changed, 362 insertions(+), 4 deletions(-)
> >>
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-15  4:45       ` Andrii Nakryiko
@ 2020-04-15 16:46         ` Alexei Starovoitov
  2020-04-16  1:48           ` Andrii Nakryiko
  0 siblings, 1 reply; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-15 16:46 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Tue, Apr 14, 2020 at 09:45:08PM -0700, Andrii Nakryiko wrote:
> >
> > > FD is closed, dumper program is detached and dumper is destroyed
> > > (unless pinned in bpffs, just like with any other bpf_link.
> > > 3. At this point bpf_dumper_link can be treated like a factory of
> > > seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
> > > illustration purposes) command, that accepts dumper link FD and
> > > returns a new seq_file FD, which can be read() normally (or, e.g.,
> > > cat'ed from shell).
> >
> > In this case, link_query may not be accurate if a bpf_dumper_link
> > is created but no corresponding bpf_dumper_open_file. What we really
> > need to iterate through all dumper seq_file FDs.
> 
> If the goal is to iterate all the open seq_files (i.e., bpfdump active
> sessions), then bpf_link is clearly not the right approach. But I
> thought we are talking about iterating all the bpfdump programs
> attachments, not **sessions**, in which case bpf_link is exactly the
> right approach.

That's an important point. What is the pinned /sys/kernel/bpfdump/tasks/foo ?
Every time 'cat' opens it a new seq_file is created with new FD, right ?
Reading of that file can take infinite amount of time, since 'cat' can be
paused in the middle.
I think we're dealing with several different kinds of objects here.
1. "template" of seq_file that is seen with 'ls' in /sys/kernel/bpfdump/
2. given instance of seq_file after "template" was open
3. bpfdumper program
4. and now links. One bpf_link from seq_file template to bpf prog and
  many other bpf_links from actual seq_file kernel object to bpf prog.
  I think both kinds of links need to be iteratable via get_next_id.

At the same time I don't think 1 and 2 are links.
read-ing link FD should not trigger program execution. link is the connecting
abstraction. It shouldn't be used to trigger anything. It's static.
Otherwise read-ing cgroup-bpf link would need to trigger cgroup bpf prog too.
FD that points to actual seq_file is the one that should be triggering
iteration of kernel objects and corresponding execution of linked prog.
That FD can be anon_inode returned from raw_tp_open (or something else)
or FD from open("/sys/kernel/bpfdump/foo").

The more I think about all the objects involved the more it feels that the
whole process should consist of three steps (instead of two).
1. load bpfdump prog
2. create seq_file-template in /sys/kernel/bpfdump/
   (not sure which api should do that)
3. use bpf_link_create api to attach bpfdumper prog to that seq_file-template

Then when the file is opened a new bpf_link is created for that reading session.
At the same time both kinds of links (to teamplte and to seq_file) should be
iteratable for observability reasons, but get_fd_from_id on them should probably
be disallowed, since holding such FD to these special links by other process
has odd semantics.

Similarly for anon seq_file it should be three step process as well:
1. load bpfdump prog
2. create anon seq_file (api is tbd) that returns FD
3. use bpf_link_create to attach prog to seq_file FD

May be it's all overkill. These are just my thoughts so far.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves
  2020-04-10 22:18   ` Andrii Nakryiko
  2020-04-10 23:24     ` Yonghong Song
@ 2020-04-15 22:57     ` Yonghong Song
  1 sibling, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-15 22:57 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/10/20 3:18 PM, Andrii Nakryiko wrote:
> On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Here, the target refers to a particular data structure
>> inside the kernel we want to dump. For example, it
>> can be all task_structs in the current pid namespace,
>> or it could be all open files for all task_structs
>> in the current pid namespace.
>>
>> Each target is identified with the following information:
>>     target_rel_path   <=== relative path to /sys/kernel/bpfdump
>>     target_proto      <=== kernel func proto which represents
>>                            bpf program signature for this target
>>     seq_ops           <=== seq_ops for seq_file operations
>>     seq_priv_size     <=== seq_file private data size
>>     target_feature    <=== target specific feature which needs
>>                            handling outside seq_ops.
> 
> It's not clear what "feature" stands for here... Is this just a sort
> of private_data passed through to dumper?
> 
>>
>> The target relative path is a relative directory to /sys/kernel/bpfdump/.
>> For example, it could be:
>>     task                  <=== all tasks
>>     task/file             <=== all open files under all tasks
>>     ipv6_route            <=== all ipv6_routes
>>     tcp6/sk_local_storage <=== all tcp6 socket local storages
>>     foo/bar/tar           <=== all tar's in bar in foo
> 
> ^^ this seems useful, but I don't think code as is supports more than 2 levels?
> 
>>
>> The "target_feature" is mostly used for reusing existing seq_ops.
>> For example, for /proc/net/<> stats, the "net" namespace is often
>> stored in file private data. The target_feature enables bpf based
>> dumper to set "net" properly for itself before calling shared
>> seq_ops.
>>
>> bpf_dump_reg_target() is implemented so targets
>> can register themselves. Currently, module is not
>> supported, so there is no bpf_dump_unreg_target().
>> The main reason is that BTF is not available for modules
>> yet.
>>
>> Since target might call bpf_dump_reg_target() before
>> bpfdump mount point is created, __bpfdump_init()
>> may be called in bpf_dump_reg_target() as well.
>>
>> The file-based dumpers will be regular files under
>> the specific target directory. For example,
>>     task/my1      <=== dumper "my1" iterates through all tasks
>>     task/file/my2 <=== dumper "my2" iterates through all open files
>>                        under all tasks
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h |   4 +
>>   kernel/bpf/dump.c   | 190 +++++++++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 193 insertions(+), 1 deletion(-)
>>

>> +
> 
> [...]
> 
>> +       if (S_ISDIR(mode)) {
>> +               inode->i_op = i_ops;
>> +               inode->i_fop = f_ops;
>> +               inc_nlink(inode);
>> +               inc_nlink(dir);
>> +       } else {
>> +               inode->i_fop = f_ops;
>> +       }
>> +
>> +       d_instantiate(dentry, inode);
>> +       dget(dentry);
> 
> lookup_one_len already bumped refcount, why the second time here?

This is due to artifact in security/inode.c:

void securityfs_remove(struct dentry *dentry)
{
         struct inode *dir;

         if (!dentry || IS_ERR(dentry))
                 return;

         dir = d_inode(dentry->d_parent);
         inode_lock(dir);
         if (simple_positive(dentry)) {
                 if (d_is_dir(dentry))
                         simple_rmdir(dir, dentry);
                 else
                         simple_unlink(dir, dentry);
                 dput(dentry);
         }
         inode_unlock(dir);
         simple_release_fs(&mount, &mount_count);
}
EXPORT_SYMBOL_GPL(securityfs_remove);

I did not implement bpfdumpfs_remove like the above.
I just use simple_unlink so I indeed do not need the above dget().
I have removed it in RFC v2. Tested it and it works fine.

I think we may not need that additional reference either in
security/inode.c.

> 
>> +       inode_unlock(dir);
>> +       return dentry;
>> +
>> +dentry_put:
>> +       dput(dentry);
>> +       dentry = ERR_PTR(err);
>> +unlock:
>> +       inode_unlock(dir);
>> +       return dentry;
>> +}
>> +
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-15 16:46         ` Alexei Starovoitov
@ 2020-04-16  1:48           ` Andrii Nakryiko
  2020-04-16  7:15             ` Yonghong Song
  2020-04-16 17:04             ` Alexei Starovoitov
  0 siblings, 2 replies; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-16  1:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 15, 2020 at 9:46 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Apr 14, 2020 at 09:45:08PM -0700, Andrii Nakryiko wrote:
> > >
> > > > FD is closed, dumper program is detached and dumper is destroyed
> > > > (unless pinned in bpffs, just like with any other bpf_link.
> > > > 3. At this point bpf_dumper_link can be treated like a factory of
> > > > seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
> > > > illustration purposes) command, that accepts dumper link FD and
> > > > returns a new seq_file FD, which can be read() normally (or, e.g.,
> > > > cat'ed from shell).
> > >
> > > In this case, link_query may not be accurate if a bpf_dumper_link
> > > is created but no corresponding bpf_dumper_open_file. What we really
> > > need to iterate through all dumper seq_file FDs.
> >
> > If the goal is to iterate all the open seq_files (i.e., bpfdump active
> > sessions), then bpf_link is clearly not the right approach. But I
> > thought we are talking about iterating all the bpfdump programs
> > attachments, not **sessions**, in which case bpf_link is exactly the
> > right approach.
>
> That's an important point. What is the pinned /sys/kernel/bpfdump/tasks/foo ?

Assuming it's not a rhetorical question, foo is a pinned bpf_dumper
link (in my interpretation of all this).

> Every time 'cat' opens it a new seq_file is created with new FD, right ?

yes

> Reading of that file can take infinite amount of time, since 'cat' can be
> paused in the middle.

yep, correct (though most use case probably going to be very short-lived)

> I think we're dealing with several different kinds of objects here.
> 1. "template" of seq_file that is seen with 'ls' in /sys/kernel/bpfdump/

Let's clarify here again, because this can be interpreted differently.

Are you talking about, e.g., /sys/fs/bpfdump/task directory that
defines what class of items should be iterated? Or you are talking
about named dumper: /sys/fs/bpfdump/task/my_dumper?

If the former, I agree that it's not a link. If the latter, then
that's what we've been so far calling "a named bpfdumper". Which is
what I argue is a link, pinned in bpfdumpfs (*not bpffs*).

UPD: reading further, seems like it's some third interpretation, so
please clarify.

> 2. given instance of seq_file after "template" was open

Right, corresponding to "bpfdump session" (has its own unique session_id).

> 3. bpfdumper program

Yep, BPF_PROG_LOAD returns FD to verified bpfdumper program.

> 4. and now links. One bpf_link from seq_file template to bpf prog and

So I guess "seq_file template" is /sys/kernel/bpfdump/tasks direntry
itself, which has to be specified as FD during BPF_PROG_LOAD, is that
right? If yes, I agree, "seq_file template" + attached bpf_prog is a
link.

>   many other bpf_links from actual seq_file kernel object to bpf prog.

I think this one is not a link at all. It's a bpfdumper session. For
me this is equivalent of a single BPF program invocation on cgroup due
to a single packet. I understand that in this case it's multiple BPF
program invocations, so it's not exactly 1:1, but if we had an easy
way to do iteration from inside BPF program over all, say, tasks, that
would be one BPF program invocation with a loop inside. So to me one
seq_file session is analogous to a single BPF program execution (or,
say one hardware event triggering one execution of perf_event BPF
program).

>   I think both kinds of links need to be iteratable via get_next_id.
>
> At the same time I don't think 1 and 2 are links.
> read-ing link FD should not trigger program execution. link is the connecting
> abstraction. It shouldn't be used to trigger anything. It's static.
> Otherwise read-ing cgroup-bpf link would need to trigger cgroup bpf prog too.
> FD that points to actual seq_file is the one that should be triggering
> iteration of kernel objects and corresponding execution of linked prog.

Yep, I agree totally, reading bpf_link FD directly as if it was
seq_file seems weird and would support only a single time to read.

> That FD can be anon_inode returned from raw_tp_open (or something else)

raw_tp_open currently always returns bpf_link FDs, so if this suddenly
returns readable seq_file instead, that would be weird, IMO.


> or FD from open("/sys/kernel/bpfdump/foo").

Agreed.

>
> The more I think about all the objects involved the more it feels that the
> whole process should consist of three steps (instead of two).
> 1. load bpfdump prog
> 2. create seq_file-template in /sys/kernel/bpfdump/
>    (not sure which api should do that)

Hm... ok, I think seq_file-template means something else entirely.
It's not an attached BPF program, but also not a /sys/fs/bpfdump/task
"provider". What is it and what is its purpose? Also, how is it
cleaned up if application crashes between creating "seq_file-template"
and attaching BPF program to it?

> 3. use bpf_link_create api to attach bpfdumper prog to that seq_file-template
>
> Then when the file is opened a new bpf_link is created for that reading session.
> At the same time both kinds of links (to teamplte and to seq_file) should be
> iteratable for observability reasons, but get_fd_from_id on them should probably
> be disallowed, since holding such FD to these special links by other process
> has odd semantics.

This special get_fd_from_id handling for bpfdumper links (in your
interpretation) looks like a sign that using bpf_link to represent a
specific bpfdumper session is not the right design.

As for obserabilitiy of bpfdumper sessions, I think using bpfdump
program + task/file provider will give a good way to do this,
actually, with no need to maintain a separate IDR just for bpfdumper
sessions.

>
> Similarly for anon seq_file it should be three step process as well:
> 1. load bpfdump prog
> 2. create anon seq_file (api is tbd) that returns FD
> 3. use bpf_link_create to attach prog to seq_file FD
>
> May be it's all overkill. These are just my thoughts so far.

Just to contrast, in a condensed form, what I was proposing:

For named dumper:
1. load bpfdump prog
2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
bpf_link anon FD back
3. pin link in bpfdumpfs (e.g., /sys/fs/bpfdump/task/my_dumper)
4. each open() of /sys/fs/bpfdump/task/my_dumper produces new
bpfdumper session/seq_file

For anon dumper:
1. load bpfdump prog
2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
bpf_link anon FD back
3. give bpf_link FD to some new API (say, BPF_DUMP_NEW_SESSION or
whatever name) to create seq_file/bpfdumper session, which will create
FD that can be read(). One can do that many times, each time getting
its own bpfdumper session.

First two steps are exactly the same, as it should be, because
named/anon dumper is still the same dumper. Note also that we can use
bpf_link FD of named dumper and BPF_DUMP_NEW_SESSION command to also
create sessions, which further underlines that the only difference
between named and anon dumper is this bpfdumpfs direntry that allows
to create new seq_file/session by doing normal open(), instead of
BPF's BPF_DUMP_NEW_SESSION.

Named vs anon dumper is like "named" vs "anon" bpf_link -- we don't
even talk in those terms about bpf_link, because the only difference
is pinned direntry in a special FS, really.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-16  1:48           ` Andrii Nakryiko
@ 2020-04-16  7:15             ` Yonghong Song
  2020-04-16 17:04             ` Alexei Starovoitov
  1 sibling, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-16  7:15 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/15/20 6:48 PM, Andrii Nakryiko wrote:
> On Wed, Apr 15, 2020 at 9:46 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Tue, Apr 14, 2020 at 09:45:08PM -0700, Andrii Nakryiko wrote:
>>>>
>>>>> FD is closed, dumper program is detached and dumper is destroyed
>>>>> (unless pinned in bpffs, just like with any other bpf_link.
>>>>> 3. At this point bpf_dumper_link can be treated like a factory of
>>>>> seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
>>>>> illustration purposes) command, that accepts dumper link FD and
>>>>> returns a new seq_file FD, which can be read() normally (or, e.g.,
>>>>> cat'ed from shell).
>>>>
>>>> In this case, link_query may not be accurate if a bpf_dumper_link
>>>> is created but no corresponding bpf_dumper_open_file. What we really
>>>> need to iterate through all dumper seq_file FDs.
>>>
>>> If the goal is to iterate all the open seq_files (i.e., bpfdump active
>>> sessions), then bpf_link is clearly not the right approach. But I
>>> thought we are talking about iterating all the bpfdump programs
>>> attachments, not **sessions**, in which case bpf_link is exactly the
>>> right approach.
>>
>> That's an important point. What is the pinned /sys/kernel/bpfdump/tasks/foo ?
> 
> Assuming it's not a rhetorical question, foo is a pinned bpf_dumper
> link (in my interpretation of all this).
> 
>> Every time 'cat' opens it a new seq_file is created with new FD, right ?
> 
> yes
> 
>> Reading of that file can take infinite amount of time, since 'cat' can be
>> paused in the middle.
> 
> yep, correct (though most use case probably going to be very short-lived)
> 
>> I think we're dealing with several different kinds of objects here.
>> 1. "template" of seq_file that is seen with 'ls' in /sys/kernel/bpfdump/
> 
> Let's clarify here again, because this can be interpreted differently.
> 
> Are you talking about, e.g., /sys/fs/bpfdump/task directory that
> defines what class of items should be iterated? Or you are talking
> about named dumper: /sys/fs/bpfdump/task/my_dumper?
> 
> If the former, I agree that it's not a link. If the latter, then
> that's what we've been so far calling "a named bpfdumper". Which is
> what I argue is a link, pinned in bpfdumpfs (*not bpffs*).
> 
> UPD: reading further, seems like it's some third interpretation, so
> please clarify.
> 
>> 2. given instance of seq_file after "template" was open
> 
> Right, corresponding to "bpfdump session" (has its own unique session_id).
> 
>> 3. bpfdumper program
> 
> Yep, BPF_PROG_LOAD returns FD to verified bpfdumper program.
> 
>> 4. and now links. One bpf_link from seq_file template to bpf prog and
> 
> So I guess "seq_file template" is /sys/kernel/bpfdump/tasks direntry
> itself, which has to be specified as FD during BPF_PROG_LOAD, is that
> right? If yes, I agree, "seq_file template" + attached bpf_prog is a
> link.
> 
>>    many other bpf_links from actual seq_file kernel object to bpf prog.
> 
> I think this one is not a link at all. It's a bpfdumper session. For
> me this is equivalent of a single BPF program invocation on cgroup due
> to a single packet. I understand that in this case it's multiple BPF
> program invocations, so it's not exactly 1:1, but if we had an easy
> way to do iteration from inside BPF program over all, say, tasks, that
> would be one BPF program invocation with a loop inside. So to me one
> seq_file session is analogous to a single BPF program execution (or,
> say one hardware event triggering one execution of perf_event BPF
> program).
> 
>>    I think both kinds of links need to be iteratable via get_next_id.
>>
>> At the same time I don't think 1 and 2 are links.
>> read-ing link FD should not trigger program execution. link is the connecting
>> abstraction. It shouldn't be used to trigger anything. It's static.
>> Otherwise read-ing cgroup-bpf link would need to trigger cgroup bpf prog too.
>> FD that points to actual seq_file is the one that should be triggering
>> iteration of kernel objects and corresponding execution of linked prog.
> 
> Yep, I agree totally, reading bpf_link FD directly as if it was
> seq_file seems weird and would support only a single time to read.
> 
>> That FD can be anon_inode returned from raw_tp_open (or something else)
> 
> raw_tp_open currently always returns bpf_link FDs, so if this suddenly
> returns readable seq_file instead, that would be weird, IMO.
> 
> 
>> or FD from open("/sys/kernel/bpfdump/foo").
> 
> Agreed.
> 
>>
>> The more I think about all the objects involved the more it feels that the
>> whole process should consist of three steps (instead of two).
>> 1. load bpfdump prog
>> 2. create seq_file-template in /sys/kernel/bpfdump/
>>     (not sure which api should do that)
> 
> Hm... ok, I think seq_file-template means something else entirely.
> It's not an attached BPF program, but also not a /sys/fs/bpfdump/task
> "provider". What is it and what is its purpose? Also, how is it
> cleaned up if application crashes between creating "seq_file-template"
> and attaching BPF program to it?
> 
>> 3. use bpf_link_create api to attach bpfdumper prog to that seq_file-template
>>
>> Then when the file is opened a new bpf_link is created for that reading session.
>> At the same time both kinds of links (to teamplte and to seq_file) should be
>> iteratable for observability reasons, but get_fd_from_id on them should probably
>> be disallowed, since holding such FD to these special links by other process
>> has odd semantics.
> 
> This special get_fd_from_id handling for bpfdumper links (in your
> interpretation) looks like a sign that using bpf_link to represent a
> specific bpfdumper session is not the right design.
> 
> As for obserabilitiy of bpfdumper sessions, I think using bpfdump
> program + task/file provider will give a good way to do this,
> actually, with no need to maintain a separate IDR just for bpfdumper
> sessions.
> 
>>
>> Similarly for anon seq_file it should be three step process as well:
>> 1. load bpfdump prog
>> 2. create anon seq_file (api is tbd) that returns FD
>> 3. use bpf_link_create to attach prog to seq_file FD
>>
>> May be it's all overkill. These are just my thoughts so far.
> 
> Just to contrast, in a condensed form, what I was proposing:
> 
> For named dumper:
> 1. load bpfdump prog
> 2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
> bpf_link anon FD back

I actually tried to prototype earlier today.
for existing tracing program (non-raw-tracepoint, e.g., fentry/fexit),
when raw_tracepoint_open() is called,
bpf_trampoline_link_prog() is called, trampoline is actually
updated with bpf_program and program start running. You can hold
this link_fd at user application or pin it to /sys/fs/bpf.

That is what I refers to in my previous email whether
we can have 'cat'-able link or not. But looks it is pretty hard.

Alternatively, we could still return a bpf_link.
The only thing bpf_link did is to hold a reference count for bpf_prog
and nothing else. Later on, we can use this bpf_link to pin dumper
or open anonymous seq_file.

But since bpf_link just holds a reference for prog and nothing more
and that is why I mentioned not 100% sure whether bpf_link is needed
as I could achieve the same thing with bpf_prog. Further,
it does not provide ability to query open files (a bpf program
for task/file target should be able to do it.)

But if for API consistency, we prefer raw_tracepoint_open() to
return a link fd. I can still do it, I guess. Or maybe link_query
still useful in some way.


> 3. pin link in bpfdumpfs (e.g., /sys/fs/bpfdump/task/my_dumper)
> 4. each open() of /sys/fs/bpfdump/task/my_dumper produces new
> bpfdumper session/seq_file
> 
> For anon dumper:
> 1. load bpfdump prog
> 2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
> bpf_link anon FD back
> 3. give bpf_link FD to some new API (say, BPF_DUMP_NEW_SESSION or
> whatever name) to create seq_file/bpfdumper session, which will create
> FD that can be read(). One can do that many times, each time getting
> its own bpfdumper session.
> 
> First two steps are exactly the same, as it should be, because
> named/anon dumper is still the same dumper. Note also that we can use
> bpf_link FD of named dumper and BPF_DUMP_NEW_SESSION command to also
> create sessions, which further underlines that the only difference
> between named and anon dumper is this bpfdumpfs direntry that allows
> to create new seq_file/session by doing normal open(), instead of
> BPF's BPF_DUMP_NEW_SESSION.
> 
> Named vs anon dumper is like "named" vs "anon" bpf_link -- we don't
> even talk in those terms about bpf_link, because the only difference
> is pinned direntry in a special FS, really.
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-16  1:48           ` Andrii Nakryiko
  2020-04-16  7:15             ` Yonghong Song
@ 2020-04-16 17:04             ` Alexei Starovoitov
  2020-04-16 19:35               ` Andrii Nakryiko
  1 sibling, 1 reply; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-16 17:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Wed, Apr 15, 2020 at 06:48:13PM -0700, Andrii Nakryiko wrote:
> On Wed, Apr 15, 2020 at 9:46 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Apr 14, 2020 at 09:45:08PM -0700, Andrii Nakryiko wrote:
> > > >
> > > > > FD is closed, dumper program is detached and dumper is destroyed
> > > > > (unless pinned in bpffs, just like with any other bpf_link.
> > > > > 3. At this point bpf_dumper_link can be treated like a factory of
> > > > > seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
> > > > > illustration purposes) command, that accepts dumper link FD and
> > > > > returns a new seq_file FD, which can be read() normally (or, e.g.,
> > > > > cat'ed from shell).
> > > >
> > > > In this case, link_query may not be accurate if a bpf_dumper_link
> > > > is created but no corresponding bpf_dumper_open_file. What we really
> > > > need to iterate through all dumper seq_file FDs.
> > >
> > > If the goal is to iterate all the open seq_files (i.e., bpfdump active
> > > sessions), then bpf_link is clearly not the right approach. But I
> > > thought we are talking about iterating all the bpfdump programs
> > > attachments, not **sessions**, in which case bpf_link is exactly the
> > > right approach.
> >
> > That's an important point. What is the pinned /sys/kernel/bpfdump/tasks/foo ?
> 
> Assuming it's not a rhetorical question, foo is a pinned bpf_dumper
> link (in my interpretation of all this).

It wasn't rhetorical question and your answer is differrent from mine :)
It's not a link. It's a template of seq_file. It's the same as
$ stat /proc/net/ipv6_route
  File: ‘/proc/net/ipv6_route’
  Size: 0         	Blocks: 0          IO Block: 1024   regular empty file

> > Every time 'cat' opens it a new seq_file is created with new FD, right ?
> 
> yes
> 
> > Reading of that file can take infinite amount of time, since 'cat' can be
> > paused in the middle.
> 
> yep, correct (though most use case probably going to be very short-lived)
> 
> > I think we're dealing with several different kinds of objects here.
> > 1. "template" of seq_file that is seen with 'ls' in /sys/kernel/bpfdump/
> 
> Let's clarify here again, because this can be interpreted differently.
> 
> Are you talking about, e.g., /sys/fs/bpfdump/task directory that
> defines what class of items should be iterated? Or you are talking
> about named dumper: /sys/fs/bpfdump/task/my_dumper?

the latter.

> 
> If the former, I agree that it's not a link. If the latter, then
> that's what we've been so far calling "a named bpfdumper". Which is
> what I argue is a link, pinned in bpfdumpfs (*not bpffs*).

It cannot be a link, since link is only a connection between
kernel object and bpf prog.
Whereas seq_file is such kernel object.

> 
> For named dumper:
> 1. load bpfdump prog
> 2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
> bpf_link anon FD back
> 3. pin link in bpfdumpfs (e.g., /sys/fs/bpfdump/task/my_dumper)
> 4. each open() of /sys/fs/bpfdump/task/my_dumper produces new
> bpfdumper session/seq_file
> 
> For anon dumper:
> 1. load bpfdump prog
> 2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
> bpf_link anon FD back
> 3. give bpf_link FD to some new API (say, BPF_DUMP_NEW_SESSION or
> whatever name) to create seq_file/bpfdumper session, which will create
> FD that can be read(). One can do that many times, each time getting
> its own bpfdumper session.

I slept on it and still fundamentally disagree that seq_file + bpf_prog
is a derivative of link. Or in OoO terms it's not a child class of bpf_link.
seq_file is its own class that should contain bpf_link as one of its
members, but it shouldn't be derived from 'class bpf_link'.

In that sense Yonghong proposed api (raw_tp_open to create anon seq_file+prog
and obj_pin to create a template of named seq_file+prog) are the best fit.
Implementation wise his 'struct extra_priv_data' needs to include
'struct bpf_link' instead of 'struct bpf_prog *prog;' directly.

So evertime 'cat' opens named seq_file there is bpf_link registered in IDR.
Anon seq_file should have another bpf_link as well.

My earlier suggestion to disallow get_fd_from_id for such links is wrong.
It's fine to get an FD to such link, but it shouldn't prevent destruction
of seq_file. 'cat' will close named seq_file and 'struct extra_priv_data' class
should do link_put. If some other process did get_fd_from_id then such link will
become dangling. Just like removal of netdev will make dangling xdp links.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-16 17:04             ` Alexei Starovoitov
@ 2020-04-16 19:35               ` Andrii Nakryiko
  2020-04-16 23:18                 ` Alexei Starovoitov
  0 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-16 19:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Thu, Apr 16, 2020 at 10:04 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Apr 15, 2020 at 06:48:13PM -0700, Andrii Nakryiko wrote:
> > On Wed, Apr 15, 2020 at 9:46 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Apr 14, 2020 at 09:45:08PM -0700, Andrii Nakryiko wrote:
> > > > >
> > > > > > FD is closed, dumper program is detached and dumper is destroyed
> > > > > > (unless pinned in bpffs, just like with any other bpf_link.
> > > > > > 3. At this point bpf_dumper_link can be treated like a factory of
> > > > > > seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
> > > > > > illustration purposes) command, that accepts dumper link FD and
> > > > > > returns a new seq_file FD, which can be read() normally (or, e.g.,
> > > > > > cat'ed from shell).
> > > > >
> > > > > In this case, link_query may not be accurate if a bpf_dumper_link
> > > > > is created but no corresponding bpf_dumper_open_file. What we really
> > > > > need to iterate through all dumper seq_file FDs.
> > > >
> > > > If the goal is to iterate all the open seq_files (i.e., bpfdump active
> > > > sessions), then bpf_link is clearly not the right approach. But I
> > > > thought we are talking about iterating all the bpfdump programs
> > > > attachments, not **sessions**, in which case bpf_link is exactly the
> > > > right approach.
> > >
> > > That's an important point. What is the pinned /sys/kernel/bpfdump/tasks/foo ?
> >
> > Assuming it's not a rhetorical question, foo is a pinned bpf_dumper
> > link (in my interpretation of all this).
>
> It wasn't rhetorical question and your answer is differrent from mine :)
> It's not a link. It's a template of seq_file. It's the same as
> $ stat /proc/net/ipv6_route
>   File: ‘/proc/net/ipv6_route’
>   Size: 0               Blocks: 0          IO Block: 1024   regular empty file

I don't see a contradiction. Pinning bpfdumper link in bpfdumpfs will
create a direntry and corresponding inode. That inode's i_private
field will contain a pointer to that link. When that direntry is
open()'ed, seq_file is going to be created. That seq_file will
probably need to take refcnt on underlying bpf_link and store it in
its private data. I was *not* implying that
/sys/kernel/bpfdump/tasks/foo is same as bpf_link pinned in bpffs,
which you can restore by doing BPF_OBJ_GET. It's more of a "backed by
bpf_link", if that helps to clarify.

But in your terminology, bpfdumper bpf_link *is* "a template of
seq_file", that I agree.

>
> > > Every time 'cat' opens it a new seq_file is created with new FD, right ?
> >
> > yes
> >
> > > Reading of that file can take infinite amount of time, since 'cat' can be
> > > paused in the middle.
> >
> > yep, correct (though most use case probably going to be very short-lived)
> >
> > > I think we're dealing with several different kinds of objects here.
> > > 1. "template" of seq_file that is seen with 'ls' in /sys/kernel/bpfdump/
> >
> > Let's clarify here again, because this can be interpreted differently.
> >
> > Are you talking about, e.g., /sys/fs/bpfdump/task directory that
> > defines what class of items should be iterated? Or you are talking
> > about named dumper: /sys/fs/bpfdump/task/my_dumper?
>
> the latter.
>
> >
> > If the former, I agree that it's not a link. If the latter, then
> > that's what we've been so far calling "a named bpfdumper". Which is
> > what I argue is a link, pinned in bpfdumpfs (*not bpffs*).
>
> It cannot be a link, since link is only a connection between
> kernel object and bpf prog.
> Whereas seq_file is such kernel object.

Not sure, but maybe that's where the misconnect is? seq_file instance
is derivative of bpf_prog + bpfdump provider. That couplin of bpf_prog
and provider is a link to me. That bpf_link can be used to "produce"
many independent seq_files then.

I do agree that link is a connection between prog and kernel object,
but I argue that "kernel object" in this case is bpfdumper provider
(e.g., what is backing /sys/fs/bpfdump/task), not any specific
seq_file.

>
> >
> > For named dumper:
> > 1. load bpfdump prog
> > 2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
> > bpf_link anon FD back
> > 3. pin link in bpfdumpfs (e.g., /sys/fs/bpfdump/task/my_dumper)
> > 4. each open() of /sys/fs/bpfdump/task/my_dumper produces new
> > bpfdumper session/seq_file
> >
> > For anon dumper:
> > 1. load bpfdump prog
> > 2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
> > bpf_link anon FD back
> > 3. give bpf_link FD to some new API (say, BPF_DUMP_NEW_SESSION or
> > whatever name) to create seq_file/bpfdumper session, which will create
> > FD that can be read(). One can do that many times, each time getting
> > its own bpfdumper session.
>
> I slept on it and still fundamentally disagree that seq_file + bpf_prog
> is a derivative of link. Or in OoO terms it's not a child class of bpf_link.
> seq_file is its own class that should contain bpf_link as one of its
> members, but it shouldn't be derived from 'class bpf_link'.

Referring to inheritance here doesn't seem necessary or helpful, I'd
rather not confuse and complicate all this further.

bpfdump provider/target + bpf_prog = bpf_link. bpf_link is "a factory"
of seq_files. That's it, no inheritance.


>
> In that sense Yonghong proposed api (raw_tp_open to create anon seq_file+prog
> and obj_pin to create a template of named seq_file+prog) are the best fit.
> Implementation wise his 'struct extra_priv_data' needs to include
> 'struct bpf_link' instead of 'struct bpf_prog *prog;' directly.
>
> So evertime 'cat' opens named seq_file there is bpf_link registered in IDR.
> Anon seq_file should have another bpf_link as well.

So that's where I disagree and don't see the point of having all those
short-lived bpf_links. cat opening seq_file doesn't create a bpf_link,
it creates a seq_file. If we want to associate some ID with it, it's
fine, but it's not a bpf_link ID (in my opinion, of course).

>
> My earlier suggestion to disallow get_fd_from_id for such links is wrong.
> It's fine to get an FD to such link, but it shouldn't prevent destruction

This is again some custom limitations and implementation, which again
I think is a sign of not ideal design for this. And now that we'll
have bpfdumper for iterate task/file, I also don't think that
everything should have ID to be "iterable" anymore.


> of seq_file. 'cat' will close named seq_file and 'struct extra_priv_data' class
> should do link_put. If some other process did get_fd_from_id then such link will
> become dangling. Just like removal of netdev will make dangling xdp links.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-16 19:35               ` Andrii Nakryiko
@ 2020-04-16 23:18                 ` Alexei Starovoitov
  2020-04-17  5:11                   ` Andrii Nakryiko
  0 siblings, 1 reply; 71+ messages in thread
From: Alexei Starovoitov @ 2020-04-16 23:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Thu, Apr 16, 2020 at 12:35:07PM -0700, Andrii Nakryiko wrote:
> >
> > I slept on it and still fundamentally disagree that seq_file + bpf_prog
> > is a derivative of link. Or in OoO terms it's not a child class of bpf_link.
> > seq_file is its own class that should contain bpf_link as one of its
> > members, but it shouldn't be derived from 'class bpf_link'.
> 
> Referring to inheritance here doesn't seem necessary or helpful, I'd
> rather not confuse and complicate all this further.
> 
> bpfdump provider/target + bpf_prog = bpf_link. bpf_link is "a factory"
> of seq_files. That's it, no inheritance.

named seq_file in bpfdumpfs does indeed look like "factory" pattern.
And yes, there is no inheritance between named seq_file and given seq_file after open().

> > In that sense Yonghong proposed api (raw_tp_open to create anon seq_file+prog
> > and obj_pin to create a template of named seq_file+prog) are the best fit.
> > Implementation wise his 'struct extra_priv_data' needs to include
> > 'struct bpf_link' instead of 'struct bpf_prog *prog;' directly.
> >
> > So evertime 'cat' opens named seq_file there is bpf_link registered in IDR.
> > Anon seq_file should have another bpf_link as well.
> 
> So that's where I disagree and don't see the point of having all those
> short-lived bpf_links. cat opening seq_file doesn't create a bpf_link,
> it creates a seq_file. If we want to associate some ID with it, it's
> fine, but it's not a bpf_link ID (in my opinion, of course).

I thought we're on the same page with the definition of bpf_link ;)
Let's recap. To make it easier I'll keep using object oriented analogy
since I think it's the most appropriate to internalize all the concepts.
- first what is file descriptor? It's nothing but std::shared_ptr<> to some kernel object.
- then there is a key class == struct bpf_link
- for raw tracepoints raw_tp_open() returns an FD to child class of bpf_link
  which is 'struct bpf_raw_tp_link'.
  In other words it returns std::shared_ptr<struct bpf_raw_tp_link>.
- for fentry/fexit/freplace/lsm raw_tp_open() returns an FD to a different child
  class of bpf_link which is "struct bpf_tracing_link".
  This is std::share_ptr<struct bpf_trace_link>.
- for cgroup-bpf progs bpf_link_create() returns an FD to child class of bpf_link
  which is 'struct bpf_cgroup_link'.
  This is std::share_ptr<struct bpf_cgroup_link>.

In all those cases three different shared pointers are seen as file descriptors
from the process pov but they point to different children of bpf_link base class.
link_update() is a method of base class bpf_link and it has to work for
all children classes.
Similarly your future get_obj_info_by_fd() from any of these three shared pointers
will return information specific to that child class.
In all those cases one link attaches one program to one kernel object.

Now back to bpfdumpfs.
In the latest Yonghong's patches raw_tp_open() returns an FD that is a pointer
to seq_file. This is existing kernel base class. It has its own seq_operations
virtual methods that are defined for bpfdumpfs_seq_file which is a child class
of seq_file that keeps start/stop/next methods as-is and overrides show()
method to be able to call bpf prog for every iteratable kernel object.

What you're proposing is to make bpfdump_seq_file class to be a child of two
base classes (seq_file and bpf_link) whereas I'm saying that it should be
a child of seq_file only, since bpf_link methods do not apply to it.
Like there is no sensible behavior for link_update() on such dual parent object.

In my proposal bpfdump_seq_file class keeps cat-ability and all methods of seq_file
and no extra methods from bpf_link that don't belong in seq_file.
But I'm arguing that bpfdump_seq_file class should have a member bpf_link
instead of simply holding bpf_prog via refcnt.
Let's call this child class of bpf_link the bpf_seq_file_link class. Having
bpf_seq_file_link as member would mean that such link is discoverable via IDR,
the user process can get an FD to it and can do get_obj_info_by_fd().
The information returned for such link will be a pair (bpfdump_prog, bpfdump_seq_file).
Meaning that at any given time 'bpftool link show' will show where every bpf
prog in the system is attached to.
Say named bpfdump_seq_file exists in /sys/kernel/bpfdump/tasks/foo.
No one is doing a 'cat' on it yet.
"bpftool link show" will show one link which is a pair (bpfdump_prog, "tasks/foo").
Now two humans are doing 'cat' of that file.
The bpfdump_prog refcnt is now 3 and there are two additional seq_files created
by the kernel when user said open("/sys/kernel/bpfdump/tasks/foo").
If these two humans are slow somebody could have done "rm /sys/kernel/bpfdump/tasks/foo"
and that bpfdump_seq_file and it's member bpf_seq_file_link would be gone,
but two other bpdump_seq_file-s are still active and they are different.
"bpftool link show" should be showing two pairs (bpfdump_prog, seq_file_A) and
(bpfdump_prog, seq_file_B).
The users could have been in different pid namespaces. What seq_file_A is
iterating could be completely different from seq_file_B, but I think it's
useful for admin to know where all bpf progs in the system are attached and
what kind of things are triggering them.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-16 23:18                 ` Alexei Starovoitov
@ 2020-04-17  5:11                   ` Andrii Nakryiko
  2020-04-19  6:11                     ` Yonghong Song
  0 siblings, 1 reply; 71+ messages in thread
From: Andrii Nakryiko @ 2020-04-17  5:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Martin KaFai Lau,
	Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Thu, Apr 16, 2020 at 4:18 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Apr 16, 2020 at 12:35:07PM -0700, Andrii Nakryiko wrote:
> > >
> > > I slept on it and still fundamentally disagree that seq_file + bpf_prog
> > > is a derivative of link. Or in OoO terms it's not a child class of bpf_link.
> > > seq_file is its own class that should contain bpf_link as one of its
> > > members, but it shouldn't be derived from 'class bpf_link'.
> >
> > Referring to inheritance here doesn't seem necessary or helpful, I'd
> > rather not confuse and complicate all this further.
> >
> > bpfdump provider/target + bpf_prog = bpf_link. bpf_link is "a factory"
> > of seq_files. That's it, no inheritance.
>
> named seq_file in bpfdumpfs does indeed look like "factory" pattern.
> And yes, there is no inheritance between named seq_file and given seq_file after open().
>
> > > In that sense Yonghong proposed api (raw_tp_open to create anon seq_file+prog
> > > and obj_pin to create a template of named seq_file+prog) are the best fit.
> > > Implementation wise his 'struct extra_priv_data' needs to include
> > > 'struct bpf_link' instead of 'struct bpf_prog *prog;' directly.
> > >
> > > So evertime 'cat' opens named seq_file there is bpf_link registered in IDR.
> > > Anon seq_file should have another bpf_link as well.
> >
> > So that's where I disagree and don't see the point of having all those
> > short-lived bpf_links. cat opening seq_file doesn't create a bpf_link,
> > it creates a seq_file. If we want to associate some ID with it, it's
> > fine, but it's not a bpf_link ID (in my opinion, of course).
>
> I thought we're on the same page with the definition of bpf_link ;)
> Let's recap. To make it easier I'll keep using object oriented analogy
> since I think it's the most appropriate to internalize all the concepts.
> - first what is file descriptor? It's nothing but std::shared_ptr<> to some kernel object.

I agree overall, but if I may be 100% pedantic, FD and kernel objects
topology can be quite a bit more complicated:

FD ---> struct file --(private_data)----> kernel object
     /                                 /
FD --                                 /
                                     /
FD ---> struct file --(private_data)/

I'll refer to this a bit further down.

> - then there is a key class == struct bpf_link
> - for raw tracepoints raw_tp_open() returns an FD to child class of bpf_link
>   which is 'struct bpf_raw_tp_link'.
>   In other words it returns std::shared_ptr<struct bpf_raw_tp_link>.
> - for fentry/fexit/freplace/lsm raw_tp_open() returns an FD to a different child
>   class of bpf_link which is "struct bpf_tracing_link".
>   This is std::share_ptr<struct bpf_trace_link>.
> - for cgroup-bpf progs bpf_link_create() returns an FD to child class of bpf_link
>   which is 'struct bpf_cgroup_link'.
>   This is std::share_ptr<struct bpf_cgroup_link>.
>
> In all those cases three different shared pointers are seen as file descriptors
> from the process pov but they point to different children of bpf_link base class.
> link_update() is a method of base class bpf_link and it has to work for
> all children classes.
> Similarly your future get_obj_info_by_fd() from any of these three shared pointers
> will return information specific to that child class.
> In all those cases one link attaches one program to one kernel object.
>

Thank you for a nice recap! :)

> Now back to bpfdumpfs.
> In the latest Yonghong's patches raw_tp_open() returns an FD that is a pointer
> to seq_file. This is existing kernel base class. It has its own seq_operations
> virtual methods that are defined for bpfdumpfs_seq_file which is a child class
> of seq_file that keeps start/stop/next methods as-is and overrides show()
> method to be able to call bpf prog for every iteratable kernel object.
>
> What you're proposing is to make bpfdump_seq_file class to be a child of two
> base classes (seq_file and bpf_link) whereas I'm saying that it should be
> a child of seq_file only, since bpf_link methods do not apply to it.
> Like there is no sensible behavior for link_update() on such dual parent object.
>
> In my proposal bpfdump_seq_file class keeps cat-ability and all methods of seq_file
> and no extra methods from bpf_link that don't belong in seq_file.
> But I'm arguing that bpfdump_seq_file class should have a member bpf_link
> instead of simply holding bpf_prog via refcnt.
> Let's call this child class of bpf_link the bpf_seq_file_link class. Having
> bpf_seq_file_link as member would mean that such link is discoverable via IDR,
> the user process can get an FD to it and can do get_obj_info_by_fd().
> The information returned for such link will be a pair (bpfdump_prog, bpfdump_seq_file).
> Meaning that at any given time 'bpftool link show' will show where every bpf
> prog in the system is attached to.
> Say named bpfdump_seq_file exists in /sys/kernel/bpfdump/tasks/foo.
> No one is doing a 'cat' on it yet.
> "bpftool link show" will show one link which is a pair (bpfdump_prog, "tasks/foo").
> Now two humans are doing 'cat' of that file.
> The bpfdump_prog refcnt is now 3 and there are two additional seq_files created
> by the kernel when user said open("/sys/kernel/bpfdump/tasks/foo").
> If these two humans are slow somebody could have done "rm /sys/kernel/bpfdump/tasks/foo"
> and that bpfdump_seq_file and it's member bpf_seq_file_link would be gone,
> but two other bpdump_seq_file-s are still active and they are different.
> "bpftool link show" should be showing two pairs (bpfdump_prog, seq_file_A) and
> (bpfdump_prog, seq_file_B).
> The users could have been in different pid namespaces. What seq_file_A is
> iterating could be completely different from seq_file_B, but I think it's
> useful for admin to know where all bpf progs in the system are attached and
> what kind of things are triggering them.

How exactly bpf_link is implemented for bpfdumper is not all that
important to me. It can be a separate struct, a field, a pointer to a
separate struct -- not that different.

I didn't mean for this thread to be just another endless discussion,
so I'll try to wrap it up in this email. I really like bpfdumper idea
and usability overall. Getting call for end of iteration is a big deal
and I'm glad I got at least that :)

But let me try to just point out few things you proposed above that I
disagree on the high-level with, as well as provide few supporting
point to the scheme I proposed previously. If all that is not
convincing, I rest my case and I won't object to bpfdumper to go in in
any form, as long as I can use it anonymously with extra call at the
end to do post-aggregation.

So, first. I do not see a point of treating each instance of seq_file
as if it was an new bpf_link:
1. It's a bit like saying that each inherited cgroup bpf program in
effective_prog_array should has a new bpf_link created. It's not how
it's done for cgroups and I think for a good reason.
2. Further, each seq_file, when created from "seq_file template",
should take a refcnt on bpf_prog, not bpf_link. Because seq_file
expects bpf_prog itself to be exactly the same throughout entire
iteration process. Bpf_link, on the other hand, allows to do in-place
update of bpf_program, which would ruin seq_file iteration,
potentially. I know we can disable that, but it just feels like
arbitrary restrictions.
3. Suppose each seq_file is/has bpf_link and one can iterate over each
active seq_file (what I've been calling a session, but whatever). What
kind of info user-facing info you can get from get_obj_info? prog_id,
prog_tag, provider ID/name (i.e., /sys/fs/bpfdump/task). Is that
useful? Yes! Is that enough to do anything actionable? No! Immediately
you'd need to know PIDs of all processes that have FD open to that
seq_file (and see diagram above, there could be many processes with
many FDs for the same seq_file). bpf_link doesn't know all PIDs. So
it's this generic "who has this file opened" problem all over again,
which I'm already pretty tired to talk about :) Except now we have at
least 3 ways to answer such questions: iterate procfs+fdinfo, drgn
scripts, now also bpfdump program for task/file provider.

So even if you can enumerate active bpfdump seq_files in the system,
you still need extra info and iterate over task/file items to be able
to do anything about that. Which is one of the reasons I think
auto-creating bpf_links for each seq_file is useless and will just
pollute the view of the system (from bpf_link standpoint).

Now, second. Getting back what I proposed with 3-4 step process (load
--> attach (create_link) --> (pin in bpfdumpds + open() |
BPF_NEW_DUMP_SESSION)). I realize now that attach might seem
superficial here, because it doesn't provide any extra information (FD
of provider was specified at prog load time). It does feel a bit
weird, but:

1. It's not that weird, because fentry/fexit/freplace and tp_btf also
don't provide anything extra: all the info was specified at load time.
2. This attach step is a good point to provide some sort of
"parametrization" to narrow down behavior of providers. I'll give two
examples that I think are going to be very useful and we'll eventually
add support for them in one way or another.

Example A. task/file provider. Instead of iterating over all tasks,
the simplest extension would be to specify **one** specific task PID
to iterate all files of. Attach time would be the place to specify
this PID. We don't need to know PID at load time, because that doesn't
change anything about BPF program validation and verified just doesn't
need to know. So now, once attached, bpf_link is created that can be
pinned in bpfdumpfs or BPF_NEW_DUMP_SESSION can be used to create
potentially many seq_files (e.g., poll every second) to see all open
files from a specific task. We can keep generalizing to, say, having
all tasks in a given cgroup. All that can be implemented by filtering
out inside BPF program, of course, but having narrower scope from the
beginning could save tons of time and resources.

Example B. Iterating BPF map items. We already have bpf_map provider,
next could be bpf_map/items, which would call BPF program for each
key/value pair, something like:

int BPF_PROG(for_each_map_kv, struct seq_file *seq, struct bpf_map *map,
             void *key, size_t key_size, void *value, size_t value_size)
{
    ...
}

Now, once you have that, a natural next desire is to say "only dump
items of map with ID 123", instead of iterating over all BPF maps in
the system. That map ID could be specified at attachment time, when
bpf_link with these parameters are going to be created. Again, at load
time BPF verifier doesn't need to know specific BPF map we are going
to iterate, if we stick to generic key/value blobs semantics.

So with such possibility considered, I hope having explicit
LINK_CREATE step starts making much more sense. This, plus not having
to distinguish between named and anonymous dumpers (just like we don't
distinguish pinned, i.e. "named", bpf_link from anonymous one), makes
me still believe that this is a better approach.

But alas, my goal here is to bring different perspectives, not to
obstruct or delay progress. So I'm going to spend some more time
reviewing v2 and will provide feedback on relevant patches, but if my
arguments were not convincing, I'm fine with that. I managed to
convince you guys that "anonymous" bpfdumper without bpfdumpfs pinning
and post-aggregation callback are a good thing and I'm happy about
that already. Can't get 100% of what I want, right? :)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
  2020-04-17  5:11                   ` Andrii Nakryiko
@ 2020-04-19  6:11                     ` Yonghong Song
  0 siblings, 0 replies; 71+ messages in thread
From: Yonghong Song @ 2020-04-19  6:11 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Martin KaFai Lau, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 4/16/20 10:11 PM, Andrii Nakryiko wrote:
> On Thu, Apr 16, 2020 at 4:18 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Thu, Apr 16, 2020 at 12:35:07PM -0700, Andrii Nakryiko wrote:
>>>>
>>>> I slept on it and still fundamentally disagree that seq_file + bpf_prog
>>>> is a derivative of link. Or in OoO terms it's not a child class of bpf_link.
>>>> seq_file is its own class that should contain bpf_link as one of its
>>>> members, but it shouldn't be derived from 'class bpf_link'.
>>>
>>> Referring to inheritance here doesn't seem necessary or helpful, I'd
>>> rather not confuse and complicate all this further.
>>>
>>> bpfdump provider/target + bpf_prog = bpf_link. bpf_link is "a factory"
>>> of seq_files. That's it, no inheritance.
>>
>> named seq_file in bpfdumpfs does indeed look like "factory" pattern.
>> And yes, there is no inheritance between named seq_file and given seq_file after open().
>>
>>>> In that sense Yonghong proposed api (raw_tp_open to create anon seq_file+prog
>>>> and obj_pin to create a template of named seq_file+prog) are the best fit.
>>>> Implementation wise his 'struct extra_priv_data' needs to include
>>>> 'struct bpf_link' instead of 'struct bpf_prog *prog;' directly.
>>>>
>>>> So evertime 'cat' opens named seq_file there is bpf_link registered in IDR.
>>>> Anon seq_file should have another bpf_link as well.
>>>
>>> So that's where I disagree and don't see the point of having all those
>>> short-lived bpf_links. cat opening seq_file doesn't create a bpf_link,
>>> it creates a seq_file. If we want to associate some ID with it, it's
>>> fine, but it's not a bpf_link ID (in my opinion, of course).
>>
>> I thought we're on the same page with the definition of bpf_link ;)
>> Let's recap. To make it easier I'll keep using object oriented analogy
>> since I think it's the most appropriate to internalize all the concepts.
>> - first what is file descriptor? It's nothing but std::shared_ptr<> to some kernel object.
> 
> I agree overall, but if I may be 100% pedantic, FD and kernel objects
> topology can be quite a bit more complicated:
> 
> FD ---> struct file --(private_data)----> kernel object
>       /                                 /
> FD --                                 /
>                                       /
> FD ---> struct file --(private_data)/
> 
> I'll refer to this a bit further down.
> 
>> - then there is a key class == struct bpf_link
>> - for raw tracepoints raw_tp_open() returns an FD to child class of bpf_link
>>    which is 'struct bpf_raw_tp_link'.
>>    In other words it returns std::shared_ptr<struct bpf_raw_tp_link>.
>> - for fentry/fexit/freplace/lsm raw_tp_open() returns an FD to a different child
>>    class of bpf_link which is "struct bpf_tracing_link".
>>    This is std::share_ptr<struct bpf_trace_link>.
>> - for cgroup-bpf progs bpf_link_create() returns an FD to child class of bpf_link
>>    which is 'struct bpf_cgroup_link'.
>>    This is std::share_ptr<struct bpf_cgroup_link>.
>>
>> In all those cases three different shared pointers are seen as file descriptors
>> from the process pov but they point to different children of bpf_link base class.
>> link_update() is a method of base class bpf_link and it has to work for
>> all children classes.
>> Similarly your future get_obj_info_by_fd() from any of these three shared pointers
>> will return information specific to that child class.
>> In all those cases one link attaches one program to one kernel object.
>>
> 
> Thank you for a nice recap! :)
> 
>> Now back to bpfdumpfs.
>> In the latest Yonghong's patches raw_tp_open() returns an FD that is a pointer
>> to seq_file. This is existing kernel base class. It has its own seq_operations
>> virtual methods that are defined for bpfdumpfs_seq_file which is a child class
>> of seq_file that keeps start/stop/next methods as-is and overrides show()
>> method to be able to call bpf prog for every iteratable kernel object.
>>
>> What you're proposing is to make bpfdump_seq_file class to be a child of two
>> base classes (seq_file and bpf_link) whereas I'm saying that it should be
>> a child of seq_file only, since bpf_link methods do not apply to it.
>> Like there is no sensible behavior for link_update() on such dual parent object.
>>
>> In my proposal bpfdump_seq_file class keeps cat-ability and all methods of seq_file
>> and no extra methods from bpf_link that don't belong in seq_file.
>> But I'm arguing that bpfdump_seq_file class should have a member bpf_link
>> instead of simply holding bpf_prog via refcnt.
>> Let's call this child class of bpf_link the bpf_seq_file_link class. Having
>> bpf_seq_file_link as member would mean that such link is discoverable via IDR,
>> the user process can get an FD to it and can do get_obj_info_by_fd().
>> The information returned for such link will be a pair (bpfdump_prog, bpfdump_seq_file).
>> Meaning that at any given time 'bpftool link show' will show where every bpf
>> prog in the system is attached to.
>> Say named bpfdump_seq_file exists in /sys/kernel/bpfdump/tasks/foo.
>> No one is doing a 'cat' on it yet.
>> "bpftool link show" will show one link which is a pair (bpfdump_prog, "tasks/foo").
>> Now two humans are doing 'cat' of that file.
>> The bpfdump_prog refcnt is now 3 and there are two additional seq_files created
>> by the kernel when user said open("/sys/kernel/bpfdump/tasks/foo").
>> If these two humans are slow somebody could have done "rm /sys/kernel/bpfdump/tasks/foo"
>> and that bpfdump_seq_file and it's member bpf_seq_file_link would be gone,
>> but two other bpdump_seq_file-s are still active and they are different.
>> "bpftool link show" should be showing two pairs (bpfdump_prog, seq_file_A) and
>> (bpfdump_prog, seq_file_B).
>> The users could have been in different pid namespaces. What seq_file_A is
>> iterating could be completely different from seq_file_B, but I think it's
>> useful for admin to know where all bpf progs in the system are attached and
>> what kind of things are triggering them.
> 
> How exactly bpf_link is implemented for bpfdumper is not all that
> important to me. It can be a separate struct, a field, a pointer to a
> separate struct -- not that different.
> 
> I didn't mean for this thread to be just another endless discussion,
> so I'll try to wrap it up in this email. I really like bpfdumper idea
> and usability overall. Getting call for end of iteration is a big deal
> and I'm glad I got at least that :)
> 
> But let me try to just point out few things you proposed above that I
> disagree on the high-level with, as well as provide few supporting
> point to the scheme I proposed previously. If all that is not
> convincing, I rest my case and I won't object to bpfdumper to go in in
> any form, as long as I can use it anonymously with extra call at the
> end to do post-aggregation.
> 
> So, first. I do not see a point of treating each instance of seq_file
> as if it was an new bpf_link:
> 1. It's a bit like saying that each inherited cgroup bpf program in
> effective_prog_array should has a new bpf_link created. It's not how
> it's done for cgroups and I think for a good reason.
> 2. Further, each seq_file, when created from "seq_file template",
> should take a refcnt on bpf_prog, not bpf_link. Because seq_file
> expects bpf_prog itself to be exactly the same throughout entire
> iteration process. Bpf_link, on the other hand, allows to do in-place
> update of bpf_program, which would ruin seq_file iteration,
> potentially. I know we can disable that, but it just feels like
> arbitrary restrictions.
> 3. Suppose each seq_file is/has bpf_link and one can iterate over each
> active seq_file (what I've been calling a session, but whatever). What
> kind of info user-facing info you can get from get_obj_info? prog_id,
> prog_tag, provider ID/name (i.e., /sys/fs/bpfdump/task). Is that
> useful? Yes! Is that enough to do anything actionable? No! Immediately
> you'd need to know PIDs of all processes that have FD open to that
> seq_file (and see diagram above, there could be many processes with
> many FDs for the same seq_file). bpf_link doesn't know all PIDs. So
> it's this generic "who has this file opened" problem all over again,
> which I'm already pretty tired to talk about :) Except now we have at
> least 3 ways to answer such questions: iterate procfs+fdinfo, drgn
> scripts, now also bpfdump program for task/file provider.
> 
> So even if you can enumerate active bpfdump seq_files in the system,
> you still need extra info and iterate over task/file items to be able
> to do anything about that. Which is one of the reasons I think
> auto-creating bpf_links for each seq_file is useless and will just
> pollute the view of the system (from bpf_link standpoint).
> 
> Now, second. Getting back what I proposed with 3-4 step process (load
> --> attach (create_link) --> (pin in bpfdumpds + open() |
> BPF_NEW_DUMP_SESSION)). I realize now that attach might seem
> superficial here, because it doesn't provide any extra information (FD
> of provider was specified at prog load time). It does feel a bit
> weird, but:
> 
> 1. It's not that weird, because fentry/fexit/freplace and tp_btf also
> don't provide anything extra: all the info was specified at load time.
> 2. This attach step is a good point to provide some sort of
> "parametrization" to narrow down behavior of providers. I'll give two
> examples that I think are going to be very useful and we'll eventually
> add support for them in one way or another.
> 
> Example A. task/file provider. Instead of iterating over all tasks,
> the simplest extension would be to specify **one** specific task PID
> to iterate all files of. Attach time would be the place to specify
> this PID. We don't need to know PID at load time, because that doesn't
> change anything about BPF program validation and verified just doesn't
> need to know. So now, once attached, bpf_link is created that can be
> pinned in bpfdumpfs or BPF_NEW_DUMP_SESSION can be used to create
> potentially many seq_files (e.g., poll every second) to see all open
> files from a specific task. We can keep generalizing to, say, having
> all tasks in a given cgroup. All that can be implemented by filtering
> out inside BPF program, of course, but having narrower scope from the
> beginning could save tons of time and resources.
> 
> Example B. Iterating BPF map items. We already have bpf_map provider,
> next could be bpf_map/items, which would call BPF program for each
> key/value pair, something like:
> 
> int BPF_PROG(for_each_map_kv, struct seq_file *seq, struct bpf_map *map,
>               void *key, size_t key_size, void *value, size_t value_size)
> {
>      ...
> }
> 
> Now, once you have that, a natural next desire is to say "only dump
> items of map with ID 123", instead of iterating over all BPF maps in
> the system. That map ID could be specified at attachment time, when
> bpf_link with these parameters are going to be created. Again, at load
> time BPF verifier doesn't need to know specific BPF map we are going
> to iterate, if we stick to generic key/value blobs semantics.

Thanks for bringing out this use case. I have not thought this carefully 
before, just thinking bpf filtering even for second-level data structure 
should be enough for most cases. But I do agree in certain cases, this 
is not good e.g., every map has millions of elements and you only want 
to scan through a particular map id.

But I think fixed parameterization at kernel interface might not be good 
enough. For example,
     - we want to filter only for files for this pid
       pid is passed to the kernel
     - we want to filter only for files for tasks in a particular cgroup
       cgroup id passed to the kernel and target need to check
       whether a particular task belongs to this cgroup
     - this is a hypothetical case.
       suppose you want to traverse the nh_list for a certain route
       with src1 and dst1
       src1 and dst1 need to be passed to the kernel and target.

Maybe a bpf based filter is a good choice here.

For a dumper program prog3 at foo1/foo2/foo3,
two filter programs can exist:
    prog1: target foo1
    prog2: target foo1/foo2
prog1/prog2 returns 1 means skip that object and 0 means not skipping

For dump prog3, return value 1 means stopping the dump and 0 means not
    stopping.

Note here, I did not put any further restriction to prog1/prog2, they
can use bpf_seq_printf() or any other tracing prog helpers.

So when to create a dumper (anonymous or file), multiple bpf programs
*can* be present:
    - all programs must be in the same hierarchy
      foo1/, foo1/foo3 are good
      foo1/, bar1/ will be rejected
    - each hierarchy can only have 0 or 1 program
    - the deepest hierarchy program is the one to do dumper,
      all early hierarchy programs, if present, are filter programs.
      if the filter program does not exist for a particular hierarchy,
      assumes a program always returns not skipping

I have not thought about kernel API yet. Not 100% LINK_CREATE is
the right choice here or not.

Any thoughts?

> 
> So with such possibility considered, I hope having explicit
> LINK_CREATE step starts making much more sense. This, plus not having
> to distinguish between named and anonymous dumpers (just like we don't
> distinguish pinned, i.e. "named", bpf_link from anonymous one), makes
> me still believe that this is a better approach.
> 
> But alas, my goal here is to bring different perspectives, not to
> obstruct or delay progress. So I'm going to spend some more time
> reviewing v2 and will provide feedback on relevant patches, but if my
> arguments were not convincing, I'm fine with that. I managed to
> convince you guys that "anonymous" bpfdumper without bpfdumpfs pinning
> and post-aggregation callback are a good thing and I'm happy about
> that already. Can't get 100% of what I want, right? :)
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2020-04-19  6:12 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-08 23:25 [RFC PATCH bpf-next 00/16] bpf: implement bpf based dumping of kernel data structures Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 01/16] net: refactor net assignment for seq_net_private structure Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 02/16] bpf: create /sys/kernel/bpfdump mount file system Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 03/16] bpf: provide a way for targets to register themselves Yonghong Song
2020-04-10 22:18   ` Andrii Nakryiko
2020-04-10 23:24     ` Yonghong Song
2020-04-13 19:31       ` Andrii Nakryiko
2020-04-15 22:57     ` Yonghong Song
2020-04-10 22:25   ` Andrii Nakryiko
2020-04-10 23:25     ` Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 04/16] bpf: allow loading of a dumper program Yonghong Song
2020-04-10 22:36   ` Andrii Nakryiko
2020-04-10 23:28     ` Yonghong Song
2020-04-13 19:33       ` Andrii Nakryiko
2020-04-08 23:25 ` [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers Yonghong Song
2020-04-10  3:00   ` Alexei Starovoitov
2020-04-10  6:09     ` Yonghong Song
2020-04-10 22:42     ` Yonghong Song
2020-04-10 22:53       ` Andrii Nakryiko
2020-04-10 23:47         ` Yonghong Song
2020-04-11 23:11           ` Alexei Starovoitov
2020-04-12  6:51             ` Yonghong Song
2020-04-13 20:48             ` Andrii Nakryiko
2020-04-10 22:51   ` Andrii Nakryiko
2020-04-10 23:41     ` Yonghong Song
2020-04-13 19:45       ` Andrii Nakryiko
2020-04-10 23:25   ` Andrii Nakryiko
2020-04-11  0:23     ` Yonghong Song
2020-04-11 23:17       ` Alexei Starovoitov
2020-04-13 21:04         ` Andrii Nakryiko
2020-04-13 19:59       ` Andrii Nakryiko
2020-04-14  5:56   ` Andrii Nakryiko
2020-04-14 23:59     ` Yonghong Song
2020-04-15  4:45       ` Andrii Nakryiko
2020-04-15 16:46         ` Alexei Starovoitov
2020-04-16  1:48           ` Andrii Nakryiko
2020-04-16  7:15             ` Yonghong Song
2020-04-16 17:04             ` Alexei Starovoitov
2020-04-16 19:35               ` Andrii Nakryiko
2020-04-16 23:18                 ` Alexei Starovoitov
2020-04-17  5:11                   ` Andrii Nakryiko
2020-04-19  6:11                     ` Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 06/16] bpf: add netlink and ipv6_route targets Yonghong Song
2020-04-10 23:13   ` Andrii Nakryiko
2020-04-10 23:52     ` Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 07/16] bpf: add bpf_map target Yonghong Song
2020-04-13 22:18   ` Andrii Nakryiko
2020-04-13 22:47     ` Andrii Nakryiko
2020-04-08 23:25 ` [RFC PATCH bpf-next 08/16] bpf: add task and task/file targets Yonghong Song
2020-04-10  3:22   ` Alexei Starovoitov
2020-04-10  6:19     ` Yonghong Song
2020-04-10 21:31       ` Alexei Starovoitov
2020-04-10 21:33         ` Alexei Starovoitov
2020-04-13 23:00   ` Andrii Nakryiko
2020-04-08 23:25 ` [RFC PATCH bpf-next 09/16] bpf: add bpf_seq_printf and bpf_seq_write helpers Yonghong Song
2020-04-10  3:26   ` Alexei Starovoitov
2020-04-10  6:12     ` Yonghong Song
2020-04-14  5:28   ` Andrii Nakryiko
2020-04-08 23:25 ` [RFC PATCH bpf-next 10/16] bpf: support variable length array in tracing programs Yonghong Song
2020-04-14  0:13   ` Andrii Nakryiko
2020-04-08 23:25 ` [RFC PATCH bpf-next 11/16] bpf: implement query for target_proto and file dumper prog_id Yonghong Song
2020-04-10  3:10   ` Alexei Starovoitov
2020-04-10  6:11     ` Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 12/16] tools/libbpf: libbpf support for bpfdump Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 13/16] tools/bpftool: add bpf dumper support Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 14/16] tools/bpf: selftests: add dumper programs for ipv6_route and netlink Yonghong Song
2020-04-14  5:39   ` Andrii Nakryiko
2020-04-08 23:25 ` [RFC PATCH bpf-next 15/16] tools/bpf: selftests: add dumper progs for bpf_map/task/task_file Yonghong Song
2020-04-10  3:33   ` Alexei Starovoitov
2020-04-10  6:41     ` Yonghong Song
2020-04-08 23:25 ` [RFC PATCH bpf-next 16/16] tools/bpf: selftests: add a selftest for anonymous dumper Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).