linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf
@ 2022-02-25 23:43 Hao Luo
  2022-02-25 23:43 ` [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall Hao Luo
                   ` (8 more replies)
  0 siblings, 9 replies; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

This patchset provides a bpf solution for monitoring cgroup activities
and exporting cgroup states in an organized way in bpffs. It introduces
the following features:

 1. sleepable tracepoints and sleepable tracing programs.
 2. a set of bpf helpers for creating and deleting files and
    directories in bpffs.
 3. a new iter prog, parameterizable by cgroup ids, to print cgroup
    state.

Sleepable tracepoints and tracing progs allow us to run bpf progs when
a new cgroup is created or an existing cgroup is removed. The set of
filesystem helpers allows sleepable tracing progs to set up directories
in bpffs for each cgroup. The progs can also pin and unlink bpf objects
from these bpffs directories. The new iter prog can be used to export
cgroup states. Using this set of additions, we are creating an extension
to the current cgroup interface to export per-cgroup stats.

See the selftest added in patch 09/09, test_cgroup_stats, as a full
example on how it can be done. The test develops a custom metric
measuring per-cgroup scheduling latencies and exports it via cgroup
iters, which are pinned by sleepable tracing progs attaching at cgroup
tracepoints.

Not only for per-cgroup stats, the same approach can be used for other
states such as task_vma iter and per-bpf-prog state. As an example, we
can write sleepable tracing progs to monitor task fork and exit, and let
the tracing prog to set up directories, parameterize task_vma iter and
pin the iters.

Hao Luo (9):
  bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  bpf: Add BPF_OBJ_PIN and BPF_OBJ_GET in the bpf_sys_bpf helper
  selftests/bpf: tests mkdir, rmdir, unlink and pin in syscall
  bpf: Introduce sleepable tracepoints
  cgroup: Sleepable cgroup tracepoints.
  libbpf: Add sleepable tp_btf
  bpf: Lift permission check in __sys_bpf when called from kernel.
  bpf: Introduce cgroup iter
  selftests/bpf: Tests using sleepable tracepoints to monitor cgroup
    events

 include/linux/bpf.h                           |  16 +-
 include/linux/tracepoint-defs.h               |   1 +
 include/trace/bpf_probe.h                     |  22 +-
 include/trace/events/cgroup.h                 |  45 ++++
 include/uapi/linux/bpf.h                      |  32 +++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/cgroup_iter.c                      | 141 +++++++++++
 kernel/bpf/inode.c                            |  33 ++-
 kernel/bpf/syscall.c                          | 237 ++++++++++++++++--
 kernel/cgroup/cgroup.c                        |   5 +
 kernel/trace/bpf_trace.c                      |   5 +
 tools/include/uapi/linux/bpf.h                |  32 +++
 tools/lib/bpf/libbpf.c                        |   1 +
 .../selftests/bpf/prog_tests/syscall.c        |  67 ++++-
 .../bpf/prog_tests/test_cgroup_stats.c        | 187 ++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_monitor.c      |  78 ++++++
 .../selftests/bpf/progs/cgroup_sched_lat.c    | 232 +++++++++++++++++
 .../testing/selftests/bpf/progs/syscall_fs.c  |  69 +++++
 19 files changed, 1175 insertions(+), 37 deletions(-)
 create mode 100644 kernel/bpf/cgroup_iter.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_monitor.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_sched_lat.c
 create mode 100644 tools/testing/selftests/bpf/progs/syscall_fs.c

-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
@ 2022-02-25 23:43 ` Hao Luo
  2022-02-27  5:18   ` Kumar Kartikeya Dwivedi
                     ` (2 more replies)
  2022-02-25 23:43 ` [PATCH bpf-next v1 2/9] bpf: Add BPF_OBJ_PIN and BPF_OBJ_GET in the bpf_sys_bpf helper Hao Luo
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

This patch allows bpf_syscall prog to perform some basic filesystem
operations: create, remove directories and unlink files. Three bpf
helpers are added for this purpose. When combined with the following
patches that allow pinning and getting bpf objects from bpf prog,
this feature can be used to create directory hierarchy in bpffs that
help manage bpf objects purely using bpf progs.

The added helpers subject to the same permission checks as their syscall
version. For example, one can not write to a read-only file system;
The identity of the current process is checked to see whether it has
sufficient permission to perform the operations.

Only directories and files in bpffs can be created or removed by these
helpers. But it won't be too hard to allow these helpers to operate
on files in other filesystems, if we want.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h            |   1 +
 include/uapi/linux/bpf.h       |  26 +++++
 kernel/bpf/inode.c             |   9 +-
 kernel/bpf/syscall.c           | 177 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  26 +++++
 5 files changed, 236 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f19abc59b6cd..fce5e26179f5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1584,6 +1584,7 @@ int bpf_link_new_fd(struct bpf_link *link);
 struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 
+bool bpf_path_is_bpf_dir(const struct path *path);
 int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
 int bpf_obj_get_user(const char __user *pathname, int flags);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index afe3d0d7f5f2..a5dbc794403d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5086,6 +5086,29 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure. On error
  *		*dst* buffer is zeroed out.
+ *
+ * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
+ *	Description
+ *		Attempts to create a directory name *pathname*. The argument
+ *		*pathname_sz* specifies the length of the string *pathname*.
+ *		The argument *mode* specifies the mode for the new directory. It
+ *		is modified by the process's umask. It has the same semantic as
+ *		the syscall mkdir(2).
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * long bpf_rmdir(const char *pathname, int pathname_sz)
+ *	Description
+ *		Deletes a directory, which must be empty.
+ *	Return
+ *		0 on sucess, or a negative error in case of failure.
+ *
+ * long bpf_unlink(const char *pathname, int pathname_sz)
+ *	Description
+ *		Deletes a name and possibly the file it refers to. It has the
+ *		same semantic as the syscall unlink(2).
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5280,6 +5303,9 @@ union bpf_attr {
 	FN(xdp_load_bytes),		\
 	FN(xdp_store_bytes),		\
 	FN(copy_from_user_task),	\
+	FN(mkdir),			\
+	FN(rmdir),			\
+	FN(unlink),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 4f841e16779e..3aca00e9e950 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -414,6 +414,11 @@ static const struct inode_operations bpf_dir_iops = {
 	.unlink		= simple_unlink,
 };
 
+bool bpf_path_is_bpf_dir(const struct path *path)
+{
+	return d_inode(path->dentry)->i_op == &bpf_dir_iops;
+}
+
 /* pin iterator link into bpffs */
 static int bpf_iter_link_pin_kernel(struct dentry *parent,
 				    const char *name, struct bpf_link *link)
@@ -439,7 +444,6 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 			  enum bpf_type type)
 {
 	struct dentry *dentry;
-	struct inode *dir;
 	struct path path;
 	umode_t mode;
 	int ret;
@@ -454,8 +458,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	if (ret)
 		goto out;
 
-	dir = d_inode(path.dentry);
-	if (dir->i_op != &bpf_dir_iops) {
+	if (!bpf_path_is_bpf_dir(&path)) {
 		ret = -EPERM;
 		goto out;
 	}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index db402ebc5570..07683b791733 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -12,6 +12,7 @@
 #include <linux/sched/signal.h>
 #include <linux/vmalloc.h>
 #include <linux/mmzone.h>
+#include <linux/namei.h>
 #include <linux/anon_inodes.h>
 #include <linux/fdtable.h>
 #include <linux/file.h>
@@ -4867,6 +4868,176 @@ const struct bpf_func_proto bpf_kallsyms_lookup_name_proto = {
 	.arg4_type	= ARG_PTR_TO_LONG,
 };
 
+BPF_CALL_3(bpf_mkdir, const char *, pathname, int, pathname_sz, u32, raw_mode)
+{
+	struct user_namespace *mnt_userns;
+	struct dentry *dentry;
+	struct path path;
+	umode_t mode;
+	int err;
+
+	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
+		return -EINVAL;
+
+	dentry = kern_path_create(AT_FDCWD, pathname, &path, LOOKUP_DIRECTORY);
+	if (IS_ERR(dentry))
+		return PTR_ERR(dentry);
+
+	if (!bpf_path_is_bpf_dir(&path)) {
+		err = -EPERM;
+		goto err_exit;
+	}
+
+	mode = raw_mode;
+	if (!IS_POSIXACL(path.dentry->d_inode))
+		mode &= ~current_umask();
+	err = security_path_mkdir(&path, dentry, mode);
+	if (err)
+		goto err_exit;
+
+	mnt_userns = mnt_user_ns(path.mnt);
+	err = vfs_mkdir(mnt_userns, d_inode(path.dentry), dentry, mode);
+
+err_exit:
+	done_path_create(&path, dentry);
+	return err;
+}
+
+const struct bpf_func_proto bpf_mkdir_proto = {
+	.func		= bpf_mkdir,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_2(bpf_rmdir, const char *, pathname, int, pathname_sz)
+{
+	struct user_namespace *mnt_userns;
+	struct path parent;
+	struct dentry *dentry;
+	int err;
+
+	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
+		return -EINVAL;
+
+	err = kern_path(pathname, 0, &parent);
+	if (err)
+		return err;
+
+	if (!bpf_path_is_bpf_dir(&parent)) {
+		err = -EPERM;
+		goto exit1;
+	}
+
+	err = mnt_want_write(parent.mnt);
+	if (err)
+		goto exit1;
+
+	dentry = kern_path_locked(pathname, &parent);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto exit2;
+	}
+
+	if (d_really_is_negative(dentry)) {
+		err = -ENOENT;
+		goto exit3;
+	}
+
+	err = security_path_rmdir(&parent, dentry);
+	if (err)
+		goto exit3;
+
+	mnt_userns = mnt_user_ns(parent.mnt);
+	err = vfs_rmdir(mnt_userns, d_inode(parent.dentry), dentry);
+exit3:
+	dput(dentry);
+	inode_unlock(d_inode(parent.dentry));
+exit2:
+	mnt_drop_write(parent.mnt);
+exit1:
+	path_put(&parent);
+	return err;
+}
+
+const struct bpf_func_proto bpf_rmdir_proto = {
+	.func		= bpf_rmdir,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+};
+
+BPF_CALL_2(bpf_unlink, const char *, pathname, int, pathname_sz)
+{
+	struct user_namespace *mnt_userns;
+	struct path parent;
+	struct dentry *dentry;
+	struct inode *inode = NULL;
+	int err;
+
+	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
+		return -EINVAL;
+
+	err = kern_path(pathname, 0, &parent);
+	if (err)
+		return err;
+
+	err = mnt_want_write(parent.mnt);
+	if (err)
+		goto exit1;
+
+	dentry = kern_path_locked(pathname, &parent);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto exit2;
+	}
+
+	if (!bpf_path_is_bpf_dir(&parent)) {
+		err = -EPERM;
+		goto exit3;
+	}
+
+	if (d_is_negative(dentry)) {
+		err = -ENOENT;
+		goto exit3;
+	}
+
+	if (d_is_dir(dentry)) {
+		err = -EISDIR;
+		goto exit3;
+	}
+
+	inode = dentry->d_inode;
+	ihold(inode);
+	err = security_path_unlink(&parent, dentry);
+	if (err)
+		goto exit3;
+
+	mnt_userns = mnt_user_ns(parent.mnt);
+	err = vfs_unlink(mnt_userns, d_inode(parent.dentry), dentry, NULL);
+exit3:
+	dput(dentry);
+	inode_unlock(d_inode(parent.dentry));
+	if (inode)
+		iput(inode);
+exit2:
+	mnt_drop_write(parent.mnt);
+exit1:
+	path_put(&parent);
+	return err;
+}
+
+const struct bpf_func_proto bpf_unlink_proto = {
+	.func		= bpf_unlink,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+};
+
 static const struct bpf_func_proto *
 syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -4879,6 +5050,12 @@ syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sys_close_proto;
 	case BPF_FUNC_kallsyms_lookup_name:
 		return &bpf_kallsyms_lookup_name_proto;
+	case BPF_FUNC_mkdir:
+		return &bpf_mkdir_proto;
+	case BPF_FUNC_rmdir:
+		return &bpf_rmdir_proto;
+	case BPF_FUNC_unlink:
+		return &bpf_unlink_proto;
 	default:
 		return tracing_prog_func_proto(func_id, prog);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index afe3d0d7f5f2..a5dbc794403d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5086,6 +5086,29 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure. On error
  *		*dst* buffer is zeroed out.
+ *
+ * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
+ *	Description
+ *		Attempts to create a directory name *pathname*. The argument
+ *		*pathname_sz* specifies the length of the string *pathname*.
+ *		The argument *mode* specifies the mode for the new directory. It
+ *		is modified by the process's umask. It has the same semantic as
+ *		the syscall mkdir(2).
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * long bpf_rmdir(const char *pathname, int pathname_sz)
+ *	Description
+ *		Deletes a directory, which must be empty.
+ *	Return
+ *		0 on sucess, or a negative error in case of failure.
+ *
+ * long bpf_unlink(const char *pathname, int pathname_sz)
+ *	Description
+ *		Deletes a name and possibly the file it refers to. It has the
+ *		same semantic as the syscall unlink(2).
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5280,6 +5303,9 @@ union bpf_attr {
 	FN(xdp_load_bytes),		\
 	FN(xdp_store_bytes),		\
 	FN(copy_from_user_task),	\
+	FN(mkdir),			\
+	FN(rmdir),			\
+	FN(unlink),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next v1 2/9] bpf: Add BPF_OBJ_PIN and BPF_OBJ_GET in the bpf_sys_bpf helper
  2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
  2022-02-25 23:43 ` [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall Hao Luo
@ 2022-02-25 23:43 ` Hao Luo
  2022-02-25 23:43 ` [PATCH bpf-next v1 3/9] selftests/bpf: tests mkdir, rmdir, unlink and pin in syscall Hao Luo
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

Now bpf syscall prog is able to pin bpf objects into bpffs and get
a pinned object from file system. Combining the previous patch that
introduced the helpers for creating and deleting directories in bpffs,
syscall prog can now persist bpf objects and organize them in a
directory hierarchy.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h  |  4 ++--
 kernel/bpf/inode.c   | 24 ++++++++++++++++++------
 kernel/bpf/syscall.c | 21 ++++++++++++++-------
 3 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fce5e26179f5..c36eeced3838 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1585,8 +1585,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 
 bool bpf_path_is_bpf_dir(const struct path *path);
-int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
-int bpf_obj_get_user(const char __user *pathname, int flags);
+int bpf_obj_pin_path(u32 ufd, bpfptr_t pathname);
+int bpf_obj_get_path(bpfptr_t pathname, int flags);
 
 #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
 #define DEFINE_BPF_ITER_FUNC(target, args...)			\
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 3aca00e9e950..6c2db54a2ff9 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -440,7 +440,7 @@ static int bpf_iter_link_pin_kernel(struct dentry *parent,
 	return ret;
 }
 
-static int bpf_obj_do_pin(const char __user *pathname, void *raw,
+static int bpf_obj_do_pin(bpfptr_t pathname, void *raw,
 			  enum bpf_type type)
 {
 	struct dentry *dentry;
@@ -448,7 +448,13 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	umode_t mode;
 	int ret;
 
-	dentry = user_path_create(AT_FDCWD, pathname, &path, 0);
+	if (bpfptr_is_null(pathname))
+		return -EINVAL;
+
+	if (bpfptr_is_kernel(pathname))
+		dentry = kern_path_create(AT_FDCWD, pathname.kernel, &path, 0);
+	else
+		dentry = user_path_create(AT_FDCWD, pathname.user, &path, 0);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
@@ -481,7 +487,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	return ret;
 }
 
-int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
+int bpf_obj_pin_path(u32 ufd, bpfptr_t pathname)
 {
 	enum bpf_type type;
 	void *raw;
@@ -498,7 +504,7 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
 	return ret;
 }
 
-static void *bpf_obj_do_get(const char __user *pathname,
+static void *bpf_obj_do_get(bpfptr_t pathname,
 			    enum bpf_type *type, int flags)
 {
 	struct inode *inode;
@@ -506,7 +512,13 @@ static void *bpf_obj_do_get(const char __user *pathname,
 	void *raw;
 	int ret;
 
-	ret = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW, &path);
+	if (bpfptr_is_null(pathname))
+		return ERR_PTR(-EINVAL);
+
+	if (bpfptr_is_kernel(pathname))
+		ret = kern_path(pathname.kernel, LOOKUP_FOLLOW, &path);
+	else
+		ret = user_path_at(AT_FDCWD, pathname.user, LOOKUP_FOLLOW, &path);
 	if (ret)
 		return ERR_PTR(ret);
 
@@ -530,7 +542,7 @@ static void *bpf_obj_do_get(const char __user *pathname,
 	return ERR_PTR(ret);
 }
 
-int bpf_obj_get_user(const char __user *pathname, int flags)
+int bpf_obj_get_path(bpfptr_t pathname, int flags)
 {
 	enum bpf_type type = BPF_TYPE_UNSPEC;
 	int f_flags;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 07683b791733..9e6d8d0c8af5 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2402,22 +2402,27 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 
 #define BPF_OBJ_LAST_FIELD file_flags
 
-static int bpf_obj_pin(const union bpf_attr *attr)
+static int bpf_obj_pin(const union bpf_attr *attr, bpfptr_t uattr)
 {
+	bpfptr_t pathname;
+
 	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
 		return -EINVAL;
 
-	return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
+	pathname = make_bpfptr(attr->pathname, bpfptr_is_kernel(uattr));
+	return bpf_obj_pin_path(attr->bpf_fd, pathname);
 }
 
-static int bpf_obj_get(const union bpf_attr *attr)
+static int bpf_obj_get(const union bpf_attr *attr, bpfptr_t uattr)
 {
+	bpfptr_t pathname;
+
 	if (CHECK_ATTR(BPF_OBJ) || attr->bpf_fd != 0 ||
 	    attr->file_flags & ~BPF_OBJ_FLAG_MASK)
 		return -EINVAL;
 
-	return bpf_obj_get_user(u64_to_user_ptr(attr->pathname),
-				attr->file_flags);
+	pathname = make_bpfptr(attr->pathname, bpfptr_is_kernel(uattr));
+	return bpf_obj_get_path(pathname, attr->file_flags);
 }
 
 void bpf_link_init(struct bpf_link *link, enum bpf_link_type type,
@@ -4648,10 +4653,10 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 		err = bpf_prog_load(&attr, uattr);
 		break;
 	case BPF_OBJ_PIN:
-		err = bpf_obj_pin(&attr);
+		err = bpf_obj_pin(&attr, uattr);
 		break;
 	case BPF_OBJ_GET:
-		err = bpf_obj_get(&attr);
+		err = bpf_obj_get(&attr, uattr);
 		break;
 	case BPF_PROG_ATTACH:
 		err = bpf_prog_attach(&attr);
@@ -4776,6 +4781,8 @@ BPF_CALL_3(bpf_sys_bpf, int, cmd, union bpf_attr *, attr, u32, attr_size)
 	case BPF_BTF_LOAD:
 	case BPF_LINK_CREATE:
 	case BPF_RAW_TRACEPOINT_OPEN:
+	case BPF_OBJ_PIN:
+	case BPF_OBJ_GET:
 		break;
 #ifdef CONFIG_BPF_JIT /* __bpf_prog_enter_sleepable used by trampoline and JIT */
 	case BPF_PROG_TEST_RUN:
-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next v1 3/9] selftests/bpf: tests mkdir, rmdir, unlink and pin in syscall
  2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
  2022-02-25 23:43 ` [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall Hao Luo
  2022-02-25 23:43 ` [PATCH bpf-next v1 2/9] bpf: Add BPF_OBJ_PIN and BPF_OBJ_GET in the bpf_sys_bpf helper Hao Luo
@ 2022-02-25 23:43 ` Hao Luo
  2022-02-25 23:43 ` [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints Hao Luo
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

Add a subtest in syscall to test bpf_mkdir(), bpf_rmdir(),
bpf_unlink() and object pinning in syscall prog.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../selftests/bpf/prog_tests/syscall.c        | 67 +++++++++++++++++-
 .../testing/selftests/bpf/progs/syscall_fs.c  | 69 +++++++++++++++++++
 2 files changed, 135 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/syscall_fs.c

diff --git a/tools/testing/selftests/bpf/prog_tests/syscall.c b/tools/testing/selftests/bpf/prog_tests/syscall.c
index f4d40001155a..782b5fe73096 100644
--- a/tools/testing/selftests/bpf/prog_tests/syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/syscall.c
@@ -1,7 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2021 Facebook */
+#include <sys/stat.h>
 #include <test_progs.h>
 #include "syscall.skel.h"
+#include "syscall_fs.skel.h"
 
 struct args {
 	__u64 log_buf;
@@ -12,7 +14,7 @@ struct args {
 	int btf_fd;
 };
 
-void test_syscall(void)
+static void test_syscall_basic(void)
 {
 	static char verifier_log[8192];
 	struct args ctx = {
@@ -53,3 +55,66 @@ void test_syscall(void)
 	if (ctx.btf_fd > 0)
 		close(ctx.btf_fd);
 }
+
+static void test_syscall_fs(void)
+{
+	char tmpl[] = "/sys/fs/bpf/syscall_XXXXXX";
+	struct stat statbuf = {};
+	static char verifier_log[8192];
+	struct args ctx = {
+		.log_buf = (uintptr_t) verifier_log,
+		.log_size = sizeof(verifier_log),
+		.prog_fd = 0,
+	};
+	LIBBPF_OPTS(bpf_test_run_opts, tattr,
+		.ctx_in = &ctx,
+		.ctx_size_in = sizeof(ctx),
+	);
+	struct syscall_fs *skel = NULL;
+	int err, mkdir_fd, rmdir_fd;
+	char *root, *dir, *path;
+
+	/* prepares test directories */
+	system("mount -t bpf bpffs /sys/fs/bpf");
+	root = mkdtemp(tmpl);
+	chmod(root, 0755);
+
+	/* loads prog */
+	skel = syscall_fs__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	dir = skel->bss->dirname;
+	snprintf(dir, sizeof(skel->bss->dirname), "%s/test", root);
+	path = skel->bss->pathname;
+	snprintf(path, sizeof(skel->bss->pathname), "%s/prog", dir);
+
+	/* tests mkdir */
+	mkdir_fd = bpf_program__fd(skel->progs.mkdir_prog);
+	err = bpf_prog_test_run_opts(mkdir_fd, &tattr);
+	ASSERT_EQ(err, 0, "mkdir_err");
+	ASSERT_EQ(tattr.retval, 0, "mkdir_retval");
+	ASSERT_OK(stat(dir, &statbuf), "mkdir_success");
+	ASSERT_OK(stat(path, &statbuf), "pin_success");
+
+	/* tests rmdir */
+	rmdir_fd = bpf_program__fd(skel->progs.rmdir_prog);
+	err = bpf_prog_test_run_opts(rmdir_fd, &tattr);
+	ASSERT_EQ(err, 0, "rmdir_err");
+	ASSERT_EQ(tattr.retval, 0, "rmdir_retval");
+	ASSERT_ERR(stat(path, &statbuf), "unlink_success");
+	ASSERT_ERR(stat(dir, &statbuf), "rmdir_success");
+
+cleanup:
+	syscall_fs__destroy(skel);
+	if (ctx.prog_fd > 0)
+		close(ctx.prog_fd);
+	rmdir(root);
+}
+
+void test_syscall(void) {
+	if (test__start_subtest("basic"))
+		test_syscall_basic();
+	if (test__start_subtest("filesystem"))
+		test_syscall_fs();
+}
diff --git a/tools/testing/selftests/bpf/progs/syscall_fs.c b/tools/testing/selftests/bpf/progs/syscall_fs.c
new file mode 100644
index 000000000000..9418d1364c09
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/syscall_fs.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <../../../tools/include/linux/filter.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct args {
+	__u64 log_buf;
+	__u32 log_size;
+	int max_entries;
+	int map_fd;
+	int prog_fd;
+	int btf_fd;
+};
+
+char dirname[64];
+char pathname[64];
+
+SEC("syscall")
+int mkdir_prog(struct args *ctx)
+{
+	static char license[] = "GPL";
+	static struct bpf_insn insns[] = {
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+	static union bpf_attr load_attr = {
+		.prog_type = BPF_PROG_TYPE_XDP,
+		.insn_cnt = sizeof(insns) / sizeof(insns[0]),
+	};
+	static union bpf_attr pin_attr = {
+		.file_flags = 0,
+	};
+	int ret;
+
+	ret = bpf_mkdir(dirname, sizeof(dirname), 0644);
+	if (ret)
+		return ret;
+
+	load_attr.license = (long) license;
+	load_attr.insns = (long) insns;
+	load_attr.log_buf = ctx->log_buf;
+	load_attr.log_size = ctx->log_size;
+	load_attr.log_level = 1;
+	ret = bpf_sys_bpf(BPF_PROG_LOAD, &load_attr, sizeof(load_attr));
+	if (ret < 0)
+		return ret;
+	else if (ret == 0)
+		return -1;
+	ctx->prog_fd = ret;
+
+	pin_attr.pathname = (__u64)pathname;
+	pin_attr.bpf_fd = ret;
+	return bpf_sys_bpf(BPF_OBJ_PIN, &pin_attr, sizeof(pin_attr));
+}
+
+SEC("syscall")
+int rmdir_prog(struct args *ctx)
+{
+	int ret;
+
+	ret = bpf_unlink(pathname, sizeof(pathname));
+	if (ret)
+		return ret;
+
+	return bpf_rmdir(dirname, sizeof(dirname));
+}
-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
                   ` (2 preceding siblings ...)
  2022-02-25 23:43 ` [PATCH bpf-next v1 3/9] selftests/bpf: tests mkdir, rmdir, unlink and pin in syscall Hao Luo
@ 2022-02-25 23:43 ` Hao Luo
  2022-03-02 19:41   ` Alexei Starovoitov
  2022-03-02 21:23   ` Yonghong Song
  2022-02-25 23:43 ` [PATCH bpf-next v1 5/9] cgroup: Sleepable cgroup tracepoints Hao Luo
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

Add a new type of bpf tracepoints: sleepable tracepoints, which allows
the handler to make calls that may sleep. With sleepable tracepoints, a
set of syscall helpers (which may sleep) may also be called from
sleepable tracepoints.

In the following patches, we will whitelist some tracepoints to be
sleepable.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h             | 10 +++++++-
 include/linux/tracepoint-defs.h |  1 +
 include/trace/bpf_probe.h       | 22 ++++++++++++++----
 kernel/bpf/syscall.c            | 41 +++++++++++++++++++++++----------
 kernel/trace/bpf_trace.c        |  5 ++++
 5 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c36eeced3838..759ade7b24b3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1810,6 +1810,9 @@ struct bpf_prog *bpf_prog_by_id(u32 id);
 struct bpf_link *bpf_link_by_id(u32 id);
 
 const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id);
+const struct bpf_func_proto *
+tracing_prog_syscall_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
+
 void bpf_task_storage_free(struct task_struct *task);
 bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
 const struct btf_func_model *
@@ -1822,7 +1825,6 @@ struct bpf_core_ctx {
 
 int bpf_core_apply(struct bpf_core_ctx *ctx, const struct bpf_core_relo *relo,
 		   int relo_idx, void *insn);
-
 #else /* !CONFIG_BPF_SYSCALL */
 static inline struct bpf_prog *bpf_prog_get(u32 ufd)
 {
@@ -2011,6 +2013,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 	return NULL;
 }
 
+static inline struct bpf_func_proto *
+tracing_prog_syscall_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return NULL;
+}
+
 static inline void bpf_task_storage_free(struct task_struct *task)
 {
 }
diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index e7c2276be33e..c73c7ab3680e 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -51,6 +51,7 @@ struct bpf_raw_event_map {
 	void			*bpf_func;
 	u32			num_args;
 	u32			writable_size;
+	u32			sleepable;
 } __aligned(32);
 
 /*
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index 7660a7846586..4edfc6df2f52 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -88,7 +88,7 @@ __bpf_trace_##call(void *__data, proto)					\
  * to make sure that if the tracepoint handling changes, the
  * bpf probe will fail to compile unless it too is updated.
  */
-#define __DEFINE_EVENT(template, call, proto, args, size)		\
+#define __DEFINE_EVENT(template, call, proto, args, size, sleep)	\
 static inline void bpf_test_probe_##call(void)				\
 {									\
 	check_trace_callback_type_##call(__bpf_trace_##template);	\
@@ -104,6 +104,7 @@ __section("__bpf_raw_tp_map") = {					\
 		.bpf_func	= __bpf_trace_##template,		\
 		.num_args	= COUNT_ARGS(args),			\
 		.writable_size	= size,					\
+		.sleepable	= sleep,				\
 	},								\
 };
 
@@ -123,11 +124,15 @@ static inline void bpf_test_buffer_##call(void)				\
 #undef DEFINE_EVENT_WRITABLE
 #define DEFINE_EVENT_WRITABLE(template, call, proto, args, size) \
 	__CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \
-	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), size)
+	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), size, 0)
+
+#undef DEFINE_EVENT_SLEEPABLE
+#define DEFINE_EVENT_SLEEPABLE(template, call, proto, args)	\
+	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), 0, 1)
 
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, call, proto, args)			\
-	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), 0)
+	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), 0, 0)
 
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
@@ -136,19 +141,26 @@ static inline void bpf_test_buffer_##call(void)				\
 #undef DECLARE_TRACE
 #define DECLARE_TRACE(call, proto, args)				\
 	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))		\
-	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)
+	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0, 0)
 
 #undef DECLARE_TRACE_WRITABLE
 #define DECLARE_TRACE_WRITABLE(call, proto, args, size) \
 	__CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \
 	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
-	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size)
+	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size, 0)
+
+#undef DECLARE_TRACE_SLEEPABLE
+#define DECLARE_TRACE_SLEEPABLE(call, proto, args)			\
+	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))		\
+	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0, 1)
 
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
 #undef DECLARE_TRACE_WRITABLE
 #undef DEFINE_EVENT_WRITABLE
 #undef __CHECK_WRITABLE_BUF_SIZE
+#undef DECLARE_TRACE_SLEEPABLE
+#undef DEFINE_EVENT_SLEEPABLE
 #undef __DEFINE_EVENT
 #undef FIRST
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9e6d8d0c8af5..0a12f52fe8a9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4827,12 +4827,6 @@ static const struct bpf_func_proto bpf_sys_bpf_proto = {
 	.arg3_type	= ARG_CONST_SIZE,
 };
 
-const struct bpf_func_proto * __weak
-tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
-{
-	return bpf_base_func_proto(func_id);
-}
-
 BPF_CALL_1(bpf_sys_close, u32, fd)
 {
 	/* When bpf program calls this helper there should not be
@@ -5045,24 +5039,47 @@ const struct bpf_func_proto bpf_unlink_proto = {
 	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
-static const struct bpf_func_proto *
-syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+/* Syscall helpers that are also allowed in sleepable tracing prog. */
+const struct bpf_func_proto *
+tracing_prog_syscall_func_proto(enum bpf_func_id func_id,
+				const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_sys_bpf:
 		return &bpf_sys_bpf_proto;
-	case BPF_FUNC_btf_find_by_name_kind:
-		return &bpf_btf_find_by_name_kind_proto;
 	case BPF_FUNC_sys_close:
 		return &bpf_sys_close_proto;
-	case BPF_FUNC_kallsyms_lookup_name:
-		return &bpf_kallsyms_lookup_name_proto;
 	case BPF_FUNC_mkdir:
 		return &bpf_mkdir_proto;
 	case BPF_FUNC_rmdir:
 		return &bpf_rmdir_proto;
 	case BPF_FUNC_unlink:
 		return &bpf_unlink_proto;
+	default:
+		return NULL;
+	}
+}
+
+const struct bpf_func_proto * __weak
+tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	const struct bpf_func_proto *fn;
+
+	fn = tracing_prog_syscall_func_proto(func_id, prog);
+	if (fn)
+		return fn;
+
+	return bpf_base_func_proto(func_id);
+}
+
+static const struct bpf_func_proto *
+syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_btf_find_by_name_kind:
+		return &bpf_btf_find_by_name_kind_proto;
+	case BPF_FUNC_kallsyms_lookup_name:
+		return &bpf_kallsyms_lookup_name_proto;
 	default:
 		return tracing_prog_func_proto(func_id, prog);
 	}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a2024ba32a20..c816e0e0d4a0 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1691,6 +1691,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		fn = raw_tp_prog_func_proto(func_id, prog);
 		if (!fn && prog->expected_attach_type == BPF_TRACE_ITER)
 			fn = bpf_iter_get_func_proto(func_id, prog);
+		if (!fn && prog->aux->sleepable)
+			fn = tracing_prog_syscall_func_proto(func_id, prog);
 		return fn;
 	}
 }
@@ -2053,6 +2055,9 @@ static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *
 	if (prog->aux->max_tp_access > btp->writable_size)
 		return -EINVAL;
 
+	if (prog->aux->sleepable && !btp->sleepable)
+		return -EPERM;
+
 	return tracepoint_probe_register_may_exist(tp, (void *)btp->bpf_func,
 						   prog);
 }
-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next v1 5/9] cgroup: Sleepable cgroup tracepoints.
  2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
                   ` (3 preceding siblings ...)
  2022-02-25 23:43 ` [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints Hao Luo
@ 2022-02-25 23:43 ` Hao Luo
  2022-02-25 23:43 ` [PATCH bpf-next v1 6/9] libbpf: Add sleepable tp_btf Hao Luo
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

Add two new sleepable tracepoints in cgroup: cgroup_mkdir_s and
cgroup_rmdir_s. The suffix _s means they are in a sleepable context.
These two tracepoints don't need full cgroup paths, they don't have
to live in atomic context. These two tracepoints are also called without
holding cgroup_mutex.

They can be used for bpf to monitor cgroup creation and deletion. Bpf
sleepable programs can attach to these two tracepoints and create
corresponding directories in bpffs. The created directories don't need
the cgroup paths, cgroup id is sufficient to identify the cgroup. Once
the bpffs directories have been created, the bpf prog can further pin
bpf objects inside the directories and allow users to read the pinned
objects.

This serves a way to extend the fixed cgroup interface.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/trace/events/cgroup.h | 45 +++++++++++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c        |  5 ++++
 2 files changed, 50 insertions(+)

diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index dd7d7c9efecd..4483a7d6c43a 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -204,6 +204,51 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen,
 	TP_ARGS(cgrp, path, val)
 );
 
+/*
+ * The following tracepoints are supposed to be called in a sleepable context.
+ */
+DECLARE_EVENT_CLASS(cgroup_sleepable_tp,
+
+	TP_PROTO(struct cgroup *cgrp),
+
+	TP_ARGS(cgrp),
+
+	TP_STRUCT__entry(
+		__field(	int,		root			)
+		__field(	int,		level			)
+		__field(	u64,		id			)
+	),
+
+	TP_fast_assign(
+		__entry->root = cgrp->root->hierarchy_id;
+		__entry->id = cgroup_id(cgrp);
+		__entry->level = cgrp->level;
+	),
+
+	TP_printk("root=%d id=%llu level=%d",
+		  __entry->root, __entry->id, __entry->level)
+);
+
+#ifdef DEFINE_EVENT_SLEEPABLE
+#undef DEFINE_EVENT
+#define DEFINE_EVENT(template, call, proto, args)		\
+	DEFINE_EVENT_SLEEPABLE(template, call, PARAMS(proto), PARAMS(args))
+#endif
+
+DEFINE_EVENT(cgroup_sleepable_tp, cgroup_mkdir_s,
+
+	TP_PROTO(struct cgroup *cgrp),
+
+	TP_ARGS(cgrp)
+);
+
+DEFINE_EVENT(cgroup_sleepable_tp, cgroup_rmdir_s,
+
+	TP_PROTO(struct cgroup *cgrp),
+
+	TP_ARGS(cgrp)
+);
+
 #endif /* _TRACE_CGROUP_H */
 
 /* This part must be outside protection */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9d05c3ca2d5e..f14ab00d9ef5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5535,6 +5535,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 	cgroup_destroy_locked(cgrp);
 out_unlock:
 	cgroup_kn_unlock(parent_kn);
+	if (!ret)
+		trace_cgroup_mkdir_s(cgrp);
 	return ret;
 }
 
@@ -5725,6 +5727,9 @@ int cgroup_rmdir(struct kernfs_node *kn)
 		TRACE_CGROUP_PATH(rmdir, cgrp);
 
 	cgroup_kn_unlock(kn);
+
+	if (!ret)
+		trace_cgroup_rmdir_s(cgrp);
 	return ret;
 }
 
-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next v1 6/9] libbpf: Add sleepable tp_btf
  2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
                   ` (4 preceding siblings ...)
  2022-02-25 23:43 ` [PATCH bpf-next v1 5/9] cgroup: Sleepable cgroup tracepoints Hao Luo
@ 2022-02-25 23:43 ` Hao Luo
  2022-02-25 23:43 ` [PATCH bpf-next v1 7/9] bpf: Lift permission check in __sys_bpf when called from kernel Hao Luo
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

In the previous patches, we have introduced sleepable tracepoints in the
kernel and listed a couple of cgroup tracepoints as sleepable. This
patch introduces a sleepable version of tp_btf. Sleepable tp_btf progs
can only attach to sleepable tracepoints.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 tools/lib/bpf/libbpf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 776b8e034d62..910682357390 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8619,6 +8619,7 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("fentry/",		TRACING, BPF_TRACE_FENTRY, SEC_ATTACH_BTF, attach_trace),
 	SEC_DEF("fmod_ret/",		TRACING, BPF_MODIFY_RETURN, SEC_ATTACH_BTF, attach_trace),
 	SEC_DEF("fexit/",		TRACING, BPF_TRACE_FEXIT, SEC_ATTACH_BTF, attach_trace),
+	SEC_DEF("tp_btf.s/",            TRACING, BPF_TRACE_RAW_TP, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),
 	SEC_DEF("fentry.s/",		TRACING, BPF_TRACE_FENTRY, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),
 	SEC_DEF("fmod_ret.s/",		TRACING, BPF_MODIFY_RETURN, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),
 	SEC_DEF("fexit.s/",		TRACING, BPF_TRACE_FEXIT, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),
-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next v1 7/9] bpf: Lift permission check in __sys_bpf when called from kernel.
  2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
                   ` (5 preceding siblings ...)
  2022-02-25 23:43 ` [PATCH bpf-next v1 6/9] libbpf: Add sleepable tp_btf Hao Luo
@ 2022-02-25 23:43 ` Hao Luo
  2022-03-02 20:01   ` Alexei Starovoitov
  2022-02-25 23:43 ` [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter Hao Luo
  2022-02-25 23:43 ` [PATCH bpf-next v1 9/9] selftests/bpf: Tests using sleepable tracepoints to monitor cgroup events Hao Luo
  8 siblings, 1 reply; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

After we introduced sleepable tracing programs, we now have an
interesting problem. There are now three execution paths that can
reach bpf_sys_bpf:

 1. called from bpf syscall.
 2. called from kernel context (e.g. kernel modules).
 3. called from bpf programs.

Ideally, capability check in bpf_sys_bpf is necessary for the first two
scenarios. But it may not be necessary for the third case.

The use case of sleepable tracepoints is to allow root user to deploy
bpf progs which run when a certain kernel tracepoints are triggered.
An example use case is to monitor cgroup creation and perform bpf
operations whenever a cgroup is created. These operations include
pinning an iter to export the cgroup's state. Using sleepable tracing
is preferred because it eliminates the need of a userspace daemon to
monitor cgroup changes.

However, in this use case, the current task who triggers the tracepoint
may be unprivileged and the permission check in __sys_bpf will thus
prevent it from making bpf syscalls. Therefore the tracing progs
deployed by root can not be used by non-root users.

A solution to this problem is to lift the permission check if the caller
of bpf_sys_bpf comes from either kernel context or bpf programs.

An alternative of lifting this permission check would be introducing an
'unpriv' version of bpf_sys_bpf, which doesn't check the current task's
capability. If the owner of the tracing prog wants it to be exclusively
used by root users, they can use the 'priv' version of bpf_sys_bpf; if
the owner wants it to be usable for non-root users, they can use the
'unpriv' version.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 kernel/bpf/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0a12f52fe8a9..3bf88002ee56 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4613,7 +4613,7 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 	union bpf_attr attr;
 	int err;
 
-	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
+	if (sysctl_unprivileged_bpf_disabled && !bpf_capable() && !uattr.is_kernel)
 		return -EPERM;
 
 	err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
                   ` (6 preceding siblings ...)
  2022-02-25 23:43 ` [PATCH bpf-next v1 7/9] bpf: Lift permission check in __sys_bpf when called from kernel Hao Luo
@ 2022-02-25 23:43 ` Hao Luo
  2022-02-26  2:32   ` kernel test robot
                     ` (4 more replies)
  2022-02-25 23:43 ` [PATCH bpf-next v1 9/9] selftests/bpf: Tests using sleepable tracepoints to monitor cgroup events Hao Luo
  8 siblings, 5 replies; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
iter doesn't iterate a set of kernel objects. Instead, it is supposed to
be parameterized by a cgroup id and prints only that cgroup. So one
needs to specify a target cgroup id when attaching this iter.

The target cgroup's state can be read out via a link of this iter.
Typically, we can monitor cgroup creation and deletion using sleepable
tracing and use it to create corresponding directories in bpffs and pin
a cgroup id parameterized link in the directory. Then we can read the
auto-pinned iter link to get cgroup's state. The output of the iter link
is determined by the program. See the selftest test_cgroup_stats.c for
an example.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h            |   1 +
 include/uapi/linux/bpf.h       |   6 ++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/cgroup_iter.c       | 141 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   6 ++
 5 files changed, 155 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/cgroup_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 759ade7b24b3..3ce9b0b7ed89 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
 
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
+	u64 cgroup_id;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a5dbc794403d..855ad80d9983 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5887,6 +5890,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index c1a9be6a4b9f..52a0e4c6e96e 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 000000000000..011d9dcd1d51
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+struct bpf_iter__cgroup {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct cgroup *cgroup;
+	u64 cgroup_id;
+
+	/* Only one session is supported. */
+	if (*pos > 0)
+		return NULL;
+
+	cgroup_id = *(u64 *)seq->private;
+	cgroup = cgroup_get_from_id(cgroup_id);
+	if (!cgroup)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+
+	return cgroup;
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter__cgroup ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = v;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, false);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	return ret;
+}
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+	if (v)
+		cgroup_put(v);
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+	.start  = cgroup_iter_seq_start,
+	.next   = cgroup_iter_seq_next,
+	.stop   = cgroup_iter_seq_stop,
+	.show   = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	*(u64 *)priv_data = aux->cgroup_id;
+	return 0;
+}
+
+static void cgroup_iter_seq_fini(void *priv_data)
+{
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+	.seq_ops                = &cgroup_iter_seq_ops,
+	.init_seq_private       = cgroup_iter_seq_init,
+	.fini_seq_private       = cgroup_iter_seq_fini,
+	.seq_priv_size          = sizeof(u64),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	aux->cgroup_id = linfo->cgroup.cgroup_id;
+	return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+}
+
+void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+				 struct seq_file *seq)
+{
+	char buf[64] = {0};
+
+	cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
+	seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
+	seq_printf(seq, "cgroup_path:\t%s\n", buf);
+}
+
+int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+				   struct bpf_link_info *info)
+{
+	info->iter.cgroup.cgroup_id = aux->cgroup_id;
+	return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+	.target			= "cgroup",
+	.attach_target		= bpf_iter_attach_cgroup,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup, cgroup),
+		  PTR_TO_BTF_ID },
+	},
+	.seq_info		= &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a5dbc794403d..855ad80d9983 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5887,6 +5890,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next v1 9/9] selftests/bpf: Tests using sleepable tracepoints to monitor cgroup events
  2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
                   ` (7 preceding siblings ...)
  2022-02-25 23:43 ` [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter Hao Luo
@ 2022-02-25 23:43 ` Hao Luo
  8 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-02-25 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

Tests the functionalities of sleepable tracing prog, sleepable tracepoints
(i.e. cgroup_mkdir_s and cgroup_rmdir_s) and cgroup iter prog all together.

The added selftest resembles a real-world application, where bpf is used
to export cgroup-level performance stats. There are two sets of progs in
the test: cgroup_monitor and cgroup_sched_lat

- Cgroup_monitor monitors cgroup creation and deletion using sleepable
  tracing; for each cgroup created, creates a directory in bpffs; creates
  a cgroup iter link and pins it in that directory.

- Cgroup_sched_lat is the program that collects cgroup's scheduling
  latencies and store them in hash map. Cgroup_sched_lat implements a
  cgroup iter prog, which reads the stats from the map and seq_prints
  them. This cgroup iter prog is the prog pinned by cgroup_monitor in
  each bpffs directory.

The cgroup_sched_lat in this test can be adapted for exporting similar
cgroup-level performance stats.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../bpf/prog_tests/test_cgroup_stats.c        | 187 ++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_monitor.c      |  78 ++++++
 .../selftests/bpf/progs/cgroup_sched_lat.c    | 232 ++++++++++++++++++
 4 files changed, 504 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_monitor.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_sched_lat.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c
new file mode 100644
index 000000000000..b6607ac074bc
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#define _GNU_SOURCE
+#include <sys/stat.h>	/* mkdir */
+#include <fcntl.h>	/* name_to_handle_at */
+#include <stdlib.h>
+#include <test_progs.h>
+#include "cgroup_monitor.skel.h"
+#include "cgroup_sched_lat.skel.h"
+
+static char mkdir_prog_path[64];
+static char rmdir_prog_path[64];
+static char dump_prog_path[64];
+
+/* Get cgroup id from a full path to cgroup */
+static int get_cgroup_id(const char *cgroup)
+{
+	int mount_id = 0;
+	struct {
+		struct file_handle fh;
+		__u64 cgid;
+	} fh = {};
+
+	fh.fh.handle_bytes = sizeof(fh.cgid);
+	if (name_to_handle_at(AT_FDCWD, cgroup, &fh.fh, &mount_id, 0))
+		return -1;
+
+	return fh.cgid;
+}
+
+static void spin_on_cpu(int seconds)
+{
+	time_t start, now;
+
+	start = time(NULL);
+	do {
+		now = time(NULL);
+	} while (now - start < seconds);
+}
+
+static void do_work(const char *cgroup)
+{
+	int i, cpu = 0, pid;
+	char cmd[128];
+
+	/* make cgroup threaded */
+	snprintf(cmd, 128, "echo threaded > %s/cgroup.type", cgroup);
+	system(cmd);
+
+	/* try to enable cpu controller. this may fail if cpu controller is not
+	 * available in cgroup.controllers or there is a cgroup v1 already
+	 * mounted in the system.
+	 */
+	snprintf(cmd, 128, "echo \"+cpu\" > %s/cgroup.subtree_control", cgroup);
+	system(cmd);
+
+	/* launch two children, both running in child cgroup */
+	for (i = 0; i < 2; ++i) {
+		pid = fork();
+		if (pid == 0) {
+			/* attach to cgroup */
+			snprintf(cmd, 128, "echo %d > %s/cgroup.procs", getpid(), cgroup);
+			system(cmd);
+
+			/* pin process to target cpu */
+			snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid());
+			system(cmd);
+
+			spin_on_cpu(3); /* spin on cpu for 3 seconds */
+			exit(0);
+		}
+	}
+
+	/* pin parent process to target cpu */
+	snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid());
+	system(cmd);
+
+	spin_on_cpu(3); /* spin on cpu for 3 seconds */
+	wait(NULL);
+}
+
+/* Check reading cgroup stats from auto pinned objects
+ * @root: root directory in bpffs set up for this test
+ * @cgroup: cgroup path
+ */
+static void check_cgroup_stats(const char *root, const char *cgroup)
+{
+	unsigned long queue_self, queue_other;
+	char buf[64], path[64];
+	int id, cgroup_id;
+	FILE *file;
+
+	id = get_cgroup_id(cgroup);
+	if (!ASSERT_GE(id, 0, "get_cgroup_id"))
+		return;
+
+	snprintf(path, sizeof(path), "%s/%d/stats", root, id);
+	file = fopen(path, "r");
+	if (!ASSERT_OK_PTR(file, "open"))
+		return;
+
+	ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+	ASSERT_EQ(sscanf(buf, "cgroup_id: %8d", &cgroup_id), 1, "output");
+	ASSERT_EQ(id, cgroup_id, "cgroup_id");
+
+	ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+	ASSERT_EQ(sscanf(buf, "queue_self: %8lu", &queue_self), 1, "output");
+
+	ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+	ASSERT_EQ(sscanf(buf, "queue_other: %8lu", &queue_other), 1, "output");
+	fclose(file);
+}
+
+/* Set up bpf progs for monitoring cgroup activities. */
+static void setup_cgroup_monitor(const char *root)
+{
+	struct cgroup_monitor *skel = NULL;
+
+	skel = cgroup_monitor__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "cgroup_monitor_skel_load"))
+		return;
+
+	cgroup_monitor__attach(skel);
+
+	snprintf(skel->bss->root, sizeof(skel->bss->root), "%s", root);
+
+	snprintf(mkdir_prog_path, 64, "%s/mkdir_prog", root);
+	bpf_obj_pin(bpf_link__fd(skel->links.mkdir_prog), mkdir_prog_path);
+
+	snprintf(rmdir_prog_path, 64, "%s/rmdir_prog", root);
+	bpf_obj_pin(bpf_link__fd(skel->links.rmdir_prog), rmdir_prog_path);
+
+	cgroup_monitor__destroy(skel);
+}
+
+void test_cgroup_stats(void)
+{
+	char bpf_tmpl[] = "/sys/fs/bpf/XXXXXX";
+	char cgrp_tmpl[] = "/sys/fs/cgroup/XXXXXX";
+	struct cgroup_sched_lat *skel = NULL;
+	char *root, *cgroup;
+
+	/* prepare test directories */
+	system("mount -t cgroup2 none /sys/fs/cgroup");
+	system("mount -t bpf bpffs /sys/fs/bpf");
+	root = mkdtemp(bpf_tmpl);
+	chmod(root, 0777);
+
+	/* set up progs for monitoring cgroup events */
+	setup_cgroup_monitor(root);
+
+	/* set up progs for profiling cgroup stats */
+	skel = cgroup_sched_lat__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "cgroup_sched_lat_skel_load"))
+		goto cleanup_root;
+
+	snprintf(dump_prog_path, 64, "%s/prog", root);
+	bpf_obj_pin(bpf_program__fd(skel->progs.dump_cgroup), dump_prog_path);
+	chmod(dump_prog_path, 0644);
+
+	cgroup_sched_lat__attach(skel);
+
+	/* thanks to cgroup monitoring progs, a directory corresponding to the
+	 * cgroup is created in bpffs.
+	 */
+	cgroup = mkdtemp(cgrp_tmpl);
+
+	/* collect some cgroup-level stats and check reading them from bpffs */
+	do_work(cgroup);
+	check_cgroup_stats(root, cgroup);
+
+	/* thanks to cgroup monitoring progs, removing cgroups also removes
+	 * the created directory in bpffs.
+	 */
+	rmdir(cgroup);
+
+	/* clean up cgroup monitoring progs */
+	cgroup_sched_lat__detach(skel);
+	cgroup_sched_lat__destroy(skel);
+	unlink(dump_prog_path);
+cleanup_root:
+	/* remove test directories in bpffs */
+	unlink(mkdir_prog_path);
+	unlink(rmdir_prog_path);
+	rmdir(root);
+	return;
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index 8cfaeba1ddbf..0d1bf954e831 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -16,6 +16,7 @@
 #define bpf_iter__bpf_map_elem bpf_iter__bpf_map_elem___not_used
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
+#define bpf_iter__cgroup bpf_iter__cgroup__not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -37,6 +38,7 @@
 #undef bpf_iter__bpf_map_elem
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
+#undef bpf_iter__cgroup
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -132,6 +134,11 @@ struct bpf_iter__sockmap {
 	struct sock *sk;
 };
 
+struct bpf_iter__cgroup {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute__((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_monitor.c b/tools/testing/selftests/bpf/progs/cgroup_monitor.c
new file mode 100644
index 000000000000..fa5debe1e15a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_monitor.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+/* root is the directory path. */
+char root[64];
+
+SEC("tp_btf.s/cgroup_mkdir_s")
+int BPF_PROG(mkdir_prog, struct cgroup *cgrp)
+{
+	static char dirname[64];
+	static char prog_path[64];
+	static char iter_path[64];
+	static union bpf_iter_link_info info;
+	static union bpf_attr get_attr;
+	static union bpf_attr link_attr;
+	static union bpf_attr pin_attr;
+	int link_fd, prog_fd, ret;
+	__u64 id;
+
+	/* create directory in bpffs named by cgroup's id. */
+	id = cgrp->kn->id;
+	BPF_SNPRINTF(dirname, sizeof(dirname), "%s/%lu", root, id);
+	ret = bpf_mkdir(dirname, sizeof(dirname), 0755);
+	if (ret)
+		return ret;
+
+	/* get cgroup iter prog pinned by test progs. */
+	BPF_SNPRINTF(prog_path, sizeof(prog_path), "%s/prog", root);
+	get_attr.bpf_fd = 0;
+	get_attr.pathname = (__u64)prog_path;
+	get_attr.file_flags = BPF_F_RDONLY;
+	prog_fd = bpf_sys_bpf(BPF_OBJ_GET, &get_attr, sizeof(get_attr));
+	if (prog_fd < 0)
+		return prog_fd;
+
+	/* create a link, parameterized by cgroup id. */
+	info.cgroup.cgroup_id = id;
+	link_attr.link_create.prog_fd = prog_fd;
+	link_attr.link_create.attach_type = BPF_TRACE_ITER;
+	link_attr.link_create.target_fd = 0;
+	link_attr.link_create.flags = 0;
+	link_attr.link_create.iter_info = (__u64)&info;
+	link_attr.link_create.iter_info_len = sizeof(info);
+	ret = bpf_sys_bpf(BPF_LINK_CREATE, &link_attr, sizeof(link_attr));
+	if (ret < 0) {
+		bpf_sys_close(prog_fd);
+		return ret;
+	}
+	link_fd = ret;
+
+	/* pin the link in the created directory */
+	BPF_SNPRINTF(iter_path, sizeof(iter_path), "%s/stats", dirname);
+	pin_attr.pathname = (__u64)iter_path;
+	pin_attr.bpf_fd = link_fd;
+	pin_attr.file_flags = 0;
+	ret = bpf_sys_bpf(BPF_OBJ_PIN, &pin_attr, sizeof(pin_attr));
+
+	bpf_sys_close(prog_fd);
+	bpf_sys_close(link_fd);
+	return ret;
+}
+
+SEC("tp_btf.s/cgroup_rmdir_s")
+int BPF_PROG(rmdir_prog, struct cgroup *cgrp)
+{
+	static char dirname[64];
+	static char path[64];
+
+	BPF_SNPRINTF(dirname, sizeof(dirname), "%s/%lu", root, cgrp->kn->id);
+	BPF_SNPRINTF(path, sizeof(path), "%s/stats", dirname);
+	bpf_unlink(path, sizeof(path));
+	return bpf_rmdir(dirname, sizeof(dirname));
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_sched_lat.c b/tools/testing/selftests/bpf/progs/cgroup_sched_lat.c
new file mode 100644
index 000000000000..90fe709377e1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_sched_lat.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define TASK_RUNNING 0
+#define BPF_F_CURRENT_CPU 0xffffffffULL
+
+extern void fair_sched_class __ksym;
+extern bool CONFIG_FAIR_GROUP_SCHED __kconfig;
+extern bool CONFIG_CGROUP_SCHED __kconfig;
+
+struct wait_lat {
+	/* Queue_self stands for the latency a task experiences while waiting
+	 * behind the tasks that are from the same cgroup.
+	 *
+	 * Queue_other stands for the latency a task experiences while waiting
+	 * behind the tasks that are from other cgroups.
+	 *
+	 * For example, if there are three tasks: A, B and C. Suppose A and B
+	 * are in the same cgroup and C is in another cgroup and we see A has
+	 * a queueing latency X milliseconds. Let's say during the X milliseconds,
+	 * B has run for Y milliseconds. We can break down X to two parts: time
+	 * when B is on cpu, that is Y; the time when C is on cpu, that is X - Y.
+	 *
+	 * Queue_self is the former (Y) while queue_other is the latter (X - Y).
+	 *
+	 * large value in queue_self is an indication of contention within a
+	 * cgroup; while large value in queue_other is an indication of
+	 * contention from multiple cgroups.
+	 */
+	u64 queue_self;
+	u64 queue_other;
+};
+
+struct timestamp {
+	/* timestamp when last queued */
+	u64 tsp;
+
+	/* cgroup exec_clock when last queued */
+	u64 exec_clock;
+};
+
+/* Map to store per-cgroup wait latency */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, u64);
+	__type(value, struct wait_lat);
+	__uint(max_entries, 65532);
+} cgroup_lat SEC(".maps");
+
+/* Map to store per-task queue timestamp */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct timestamp);
+} start SEC(".maps");
+
+/* adapt from task_cfs_rq in kernel/sched/sched.h */
+__always_inline
+struct cfs_rq *task_cfs_rq(struct task_struct *t)
+{
+	if (!CONFIG_FAIR_GROUP_SCHED)
+		return NULL;
+
+	return BPF_CORE_READ(&t->se, cfs_rq);
+}
+
+/* record enqueue timestamp */
+__always_inline
+static int trace_enqueue(struct task_struct *t)
+{
+	u32 pid = t->pid;
+	struct timestamp *ptr;
+	struct cfs_rq *cfs_rq;
+
+	if (!pid)
+		return 0;
+
+	/* only measure for CFS tasks */
+	if (t->sched_class != &fair_sched_class)
+		return 0;
+
+	ptr = bpf_task_storage_get(&start, t, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+
+	/* CONFIG_FAIR_GROUP_SCHED may not be enabled */
+	cfs_rq = task_cfs_rq(t);
+	if (!cfs_rq)
+		return 0;
+
+	ptr->tsp = bpf_ktime_get_ns();
+	ptr->exec_clock = BPF_CORE_READ(cfs_rq, exec_clock);
+	return 0;
+}
+
+SEC("tp_btf/sched_wakeup")
+int handle__sched_wakeup(u64 *ctx)
+{
+	/* TP_PROTO(struct task_struct *p) */
+	struct task_struct *p = (void *)ctx[0];
+
+	return trace_enqueue(p);
+}
+
+SEC("tp_btf/sched_wakeup_new")
+int handle__sched_wakeup_new(u64 *ctx)
+{
+	/* TP_PROTO(struct task_struct *p) */
+	struct task_struct *p = (void *)ctx[0];
+
+	return trace_enqueue(p);
+}
+
+/* task_group() from kernel/sched/sched.h */
+__always_inline
+struct task_group *task_group(struct task_struct *p)
+{
+	if (!CONFIG_CGROUP_SCHED)
+		return NULL;
+
+	return BPF_CORE_READ(p, sched_task_group);
+}
+
+__always_inline
+struct cgroup *task_cgroup(struct task_struct *p)
+{
+	struct task_group *tg;
+
+	tg = task_group(p);
+	if (!tg)
+		return NULL;
+
+	return BPF_CORE_READ(tg, css).cgroup;
+}
+
+__always_inline
+u64 max(u64 x, u64 y)
+{
+	return x > y ? x : y;
+}
+
+SEC("tp_btf/sched_switch")
+int handle__sched_switch(u64 *ctx)
+{
+	/* TP_PROTO(bool preempt, struct task_struct *prev,
+	 *	    struct task_struct *next)
+	 */
+	struct task_struct *prev = (struct task_struct *)ctx[1];
+	struct task_struct *next = (struct task_struct *)ctx[2];
+	u64 delta, delta_self, delta_other, id;
+	struct cfs_rq *cfs_rq;
+	struct timestamp *tsp;
+	struct wait_lat *lat;
+	struct cgroup *cgroup;
+
+	/* ivcsw: treat like an enqueue event and store timestamp */
+	if (prev->__state == TASK_RUNNING)
+		trace_enqueue(prev);
+
+	/* only measure for CFS tasks */
+	if (next->sched_class != &fair_sched_class)
+		return 0;
+
+	/* fetch timestamp and calculate delta */
+	tsp = bpf_task_storage_get(&start, next, 0, 0);
+	if (!tsp)
+		return 0;   /* missed enqueue */
+
+	/* CONFIG_FAIR_GROUP_SCHED may not be enabled */
+	cfs_rq = task_cfs_rq(next);
+	if (!cfs_rq)
+		return 0;
+
+	/* cpu controller may not be enabled */
+	cgroup = task_cgroup(next);
+	if (!cgroup)
+		return 0;
+
+	/* calculate self delay and other delay */
+	delta = bpf_ktime_get_ns() - tsp->tsp;
+	delta_self = BPF_CORE_READ(cfs_rq, exec_clock) - tsp->exec_clock;
+	if (delta_self > delta)
+		delta_self = delta;
+	delta_other = delta - delta_self;
+
+	/* insert into cgroup_lat map */
+	id = BPF_CORE_READ(cgroup, kn, id);
+	lat = bpf_map_lookup_elem(&cgroup_lat, &id);
+	if (!lat) {
+		struct wait_lat w = {
+			.queue_self = delta_self,
+			.queue_other = delta_other,
+		};
+
+		bpf_map_update_elem(&cgroup_lat, &id, &w, BPF_ANY);
+	} else {
+		lat->queue_self = max(delta_self, lat->queue_self);
+		lat->queue_other = max(delta_other, lat->queue_other);
+	}
+
+	bpf_task_storage_delete(&start, next);
+	return 0;
+}
+
+SEC("iter/cgroup")
+int dump_cgroup(struct bpf_iter__cgroup *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct cgroup *cgroup = ctx->cgroup;
+	struct wait_lat *lat;
+	u64 id = cgroup->kn->id;
+
+	BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", id);
+	lat = bpf_map_lookup_elem(&cgroup_lat, &id);
+	if (lat) {
+		BPF_SEQ_PRINTF(seq, "queue_self: %8lu\n", lat->queue_self);
+		BPF_SEQ_PRINTF(seq, "queue_other: %8lu\n", lat->queue_other);
+	} else {
+		/* print anyway for universal parsing logic in userspace. */
+		BPF_SEQ_PRINTF(seq, "queue_self: %8d\n", 0);
+		BPF_SEQ_PRINTF(seq, "queue_other: %8d\n", 0);
+	}
+	return 0;
+}
-- 
2.35.1.574.g5d30c73bfb-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-02-25 23:43 ` [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter Hao Luo
@ 2022-02-26  2:32   ` kernel test robot
  2022-02-26  2:32   ` kernel test robot
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 54+ messages in thread
From: kernel test robot @ 2022-02-26  2:32 UTC (permalink / raw)
  To: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: llvm, kbuild-all, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

Hi Hao,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: hexagon-randconfig-r023-20220226 (https://download.01.org/0day-ci/archive/20220226/202202261010.IU59MxY8-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project d271fc04d5b97b12e6b797c6067d3c96a8d7470e)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/ee74423719e2efb4efa7a4491920c78b60024ec7
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
        git checkout ee74423719e2efb4efa7a4491920c78b60024ec7
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash kernel/bpf/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/bpf/cgroup_iter.c:107:39: warning: format specifies type 'unsigned long' but the argument has type 'u64' (aka 'unsigned long long') [-Wformat]
           seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
                                        ~~~     ^~~~~~~~~~~~~~
                                        %llu
>> kernel/bpf/cgroup_iter.c:101:6: warning: no previous prototype for function 'bpf_iter_cgroup_show_fdinfo' [-Wmissing-prototypes]
   void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
        ^
   kernel/bpf/cgroup_iter.c:101:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
   ^
   static 
>> kernel/bpf/cgroup_iter.c:111:5: warning: no previous prototype for function 'bpf_iter_cgroup_fill_link_info' [-Wmissing-prototypes]
   int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
       ^
   kernel/bpf/cgroup_iter.c:111:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
   ^
   static 
   3 warnings generated.


vim +107 kernel/bpf/cgroup_iter.c

   100	
 > 101	void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
   102					 struct seq_file *seq)
   103	{
   104		char buf[64] = {0};
   105	
   106		cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
 > 107		seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
   108		seq_printf(seq, "cgroup_path:\t%s\n", buf);
   109	}
   110	
 > 111	int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
   112					   struct bpf_link_info *info)
   113	{
   114		info->iter.cgroup.cgroup_id = aux->cgroup_id;
   115		return 0;
   116	}
   117	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-02-25 23:43 ` [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter Hao Luo
  2022-02-26  2:32   ` kernel test robot
@ 2022-02-26  2:32   ` kernel test robot
  2022-02-26  2:53   ` kernel test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 54+ messages in thread
From: kernel test robot @ 2022-02-26  2:32 UTC (permalink / raw)
  To: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: kbuild-all, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

Hi Hao,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: m68k-defconfig (https://download.01.org/0day-ci/archive/20220226/202202261001.PWCTwlQp-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/ee74423719e2efb4efa7a4491920c78b60024ec7
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
        git checkout ee74423719e2efb4efa7a4491920c78b60024ec7
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=m68k SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_stop':
>> kernel/bpf/cgroup_iter.c:60:17: error: implicit declaration of function 'cgroup_put'; did you mean 'cgroup_psi'? [-Werror=implicit-function-declaration]
      60 |                 cgroup_put(v);
         |                 ^~~~~~~~~~
         |                 cgroup_psi
   kernel/bpf/cgroup_iter.c: At top level:
   kernel/bpf/cgroup_iter.c:101:6: warning: no previous prototype for 'bpf_iter_cgroup_show_fdinfo' [-Wmissing-prototypes]
     101 | void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c: In function 'bpf_iter_cgroup_show_fdinfo':
   kernel/bpf/cgroup_iter.c:107:40: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'u64' {aka 'long long unsigned int'} [-Wformat=]
     107 |         seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
         |                                      ~~^     ~~~~~~~~~~~~~~
         |                                        |        |
         |                                        |        u64 {aka long long unsigned int}
         |                                        long unsigned int
         |                                      %llu
   kernel/bpf/cgroup_iter.c: At top level:
   kernel/bpf/cgroup_iter.c:111:5: warning: no previous prototype for 'bpf_iter_cgroup_fill_link_info' [-Wmissing-prototypes]
     111 | int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
         |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +60 kernel/bpf/cgroup_iter.c

    56	
    57	static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
    58	{
    59		if (v)
  > 60			cgroup_put(v);
    61	}
    62	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-02-25 23:43 ` [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter Hao Luo
  2022-02-26  2:32   ` kernel test robot
  2022-02-26  2:32   ` kernel test robot
@ 2022-02-26  2:53   ` kernel test robot
  2022-03-02 21:59   ` Yonghong Song
  2022-03-02 22:45   ` Kumar Kartikeya Dwivedi
  4 siblings, 0 replies; 54+ messages in thread
From: kernel test robot @ 2022-02-26  2:53 UTC (permalink / raw)
  To: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: kbuild-all, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel, Hao Luo

Hi Hao,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm-randconfig-c002-20220226 (https://download.01.org/0day-ci/archive/20220226/202202261033.JZXn2oS0-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/ee74423719e2efb4efa7a4491920c78b60024ec7
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
        git checkout ee74423719e2efb4efa7a4491920c78b60024ec7
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arm SHELL=/bin/bash kernel/bpf/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/bpf/cgroup_iter.c:101:6: warning: no previous prototype for 'bpf_iter_cgroup_show_fdinfo' [-Wmissing-prototypes]
     101 | void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c: In function 'bpf_iter_cgroup_show_fdinfo':
>> kernel/bpf/cgroup_iter.c:107:40: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'u64' {aka 'long long unsigned int'} [-Wformat=]
     107 |         seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
         |                                      ~~^     ~~~~~~~~~~~~~~
         |                                        |        |
         |                                        |        u64 {aka long long unsigned int}
         |                                        long unsigned int
         |                                      %llu
   kernel/bpf/cgroup_iter.c: At top level:
>> kernel/bpf/cgroup_iter.c:111:5: warning: no previous prototype for 'bpf_iter_cgroup_fill_link_info' [-Wmissing-prototypes]
     111 | int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
         |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/bpf_iter_cgroup_show_fdinfo +101 kernel/bpf/cgroup_iter.c

   100	
 > 101	void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
   102					 struct seq_file *seq)
   103	{
   104		char buf[64] = {0};
   105	
   106		cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
 > 107		seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
   108		seq_printf(seq, "cgroup_path:\t%s\n", buf);
   109	}
   110	
 > 111	int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
   112					   struct bpf_link_info *info)
   113	{
   114		info->iter.cgroup.cgroup_id = aux->cgroup_id;
   115		return 0;
   116	}
   117	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-02-25 23:43 ` [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall Hao Luo
@ 2022-02-27  5:18   ` Kumar Kartikeya Dwivedi
  2022-02-28 22:10     ` Hao Luo
  2022-03-02 20:55   ` Yonghong Song
  2022-03-12  3:46   ` Al Viro
  2 siblings, 1 reply; 54+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-02-27  5:18 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Sat, Feb 26, 2022 at 05:13:31AM IST, Hao Luo wrote:
> This patch allows bpf_syscall prog to perform some basic filesystem
> operations: create, remove directories and unlink files. Three bpf
> helpers are added for this purpose. When combined with the following
> patches that allow pinning and getting bpf objects from bpf prog,
> this feature can be used to create directory hierarchy in bpffs that
> help manage bpf objects purely using bpf progs.
>
> The added helpers subject to the same permission checks as their syscall
> version. For example, one can not write to a read-only file system;
> The identity of the current process is checked to see whether it has
> sufficient permission to perform the operations.
>
> Only directories and files in bpffs can be created or removed by these
> helpers. But it won't be too hard to allow these helpers to operate
> on files in other filesystems, if we want.
>
> Signed-off-by: Hao Luo <haoluo@google.com>
> ---
>  include/linux/bpf.h            |   1 +
>  include/uapi/linux/bpf.h       |  26 +++++
>  kernel/bpf/inode.c             |   9 +-
>  kernel/bpf/syscall.c           | 177 +++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  26 +++++
>  5 files changed, 236 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f19abc59b6cd..fce5e26179f5 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1584,6 +1584,7 @@ int bpf_link_new_fd(struct bpf_link *link);
>  struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
>  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>
> +bool bpf_path_is_bpf_dir(const struct path *path);
>  int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>  int bpf_obj_get_user(const char __user *pathname, int flags);
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index afe3d0d7f5f2..a5dbc794403d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5086,6 +5086,29 @@ union bpf_attr {
>   *	Return
>   *		0 on success, or a negative error in case of failure. On error
>   *		*dst* buffer is zeroed out.
> + *
> + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
> + *	Description
> + *		Attempts to create a directory name *pathname*. The argument
> + *		*pathname_sz* specifies the length of the string *pathname*.
> + *		The argument *mode* specifies the mode for the new directory. It
> + *		is modified by the process's umask. It has the same semantic as
> + *		the syscall mkdir(2).
> + *	Return
> + *		0 on success, or a negative error in case of failure.
> + *
> + * long bpf_rmdir(const char *pathname, int pathname_sz)
> + *	Description
> + *		Deletes a directory, which must be empty.
> + *	Return
> + *		0 on sucess, or a negative error in case of failure.
> + *
> + * long bpf_unlink(const char *pathname, int pathname_sz)
> + *	Description
> + *		Deletes a name and possibly the file it refers to. It has the
> + *		same semantic as the syscall unlink(2).
> + *	Return
> + *		0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -5280,6 +5303,9 @@ union bpf_attr {
>  	FN(xdp_load_bytes),		\
>  	FN(xdp_store_bytes),		\
>  	FN(copy_from_user_task),	\
> +	FN(mkdir),			\
> +	FN(rmdir),			\
> +	FN(unlink),			\
>  	/* */
>

How about only introducing bpf_sys_mkdirat and bpf_sys_unlinkat? That would be
more useful for other cases in future, and when AT_FDCWD is passed, has the same
functionality as these, but when openat/fget is supported, it would work
relative to other dirfds as well. It can also allow using dirfd of the process
calling read for a iterator (e.g. if it sets the fd number using skel->bss).
unlinkat's AT_REMOVEDIR flag also removes the need for a bpf_rmdir.

WDYT?

> [...]

--
Kartikeya

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-02-27  5:18   ` Kumar Kartikeya Dwivedi
@ 2022-02-28 22:10     ` Hao Luo
  2022-03-02 19:34       ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Hao Luo @ 2022-02-28 22:10 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

Hi Kumar,

On Sat, Feb 26, 2022 at 9:18 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, Feb 26, 2022 at 05:13:31AM IST, Hao Luo wrote:
> > This patch allows bpf_syscall prog to perform some basic filesystem
> > operations: create, remove directories and unlink files. Three bpf
> > helpers are added for this purpose. When combined with the following
> > patches that allow pinning and getting bpf objects from bpf prog,
> > this feature can be used to create directory hierarchy in bpffs that
> > help manage bpf objects purely using bpf progs.
> >
> > The added helpers subject to the same permission checks as their syscall
> > version. For example, one can not write to a read-only file system;
> > The identity of the current process is checked to see whether it has
> > sufficient permission to perform the operations.
> >
> > Only directories and files in bpffs can be created or removed by these
> > helpers. But it won't be too hard to allow these helpers to operate
> > on files in other filesystems, if we want.
> >
> > Signed-off-by: Hao Luo <haoluo@google.com>
> > ---
> > + *
> > + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
> > + *   Description
> > + *           Attempts to create a directory name *pathname*. The argument
> > + *           *pathname_sz* specifies the length of the string *pathname*.
> > + *           The argument *mode* specifies the mode for the new directory. It
> > + *           is modified by the process's umask. It has the same semantic as
> > + *           the syscall mkdir(2).
> > + *   Return
> > + *           0 on success, or a negative error in case of failure.
> > + *
> > + * long bpf_rmdir(const char *pathname, int pathname_sz)
> > + *   Description
> > + *           Deletes a directory, which must be empty.
> > + *   Return
> > + *           0 on sucess, or a negative error in case of failure.
> > + *
> > + * long bpf_unlink(const char *pathname, int pathname_sz)
> > + *   Description
> > + *           Deletes a name and possibly the file it refers to. It has the
> > + *           same semantic as the syscall unlink(2).
> > + *   Return
> > + *           0 on success, or a negative error in case of failure.
> >   */
> >
>
> How about only introducing bpf_sys_mkdirat and bpf_sys_unlinkat? That would be
> more useful for other cases in future, and when AT_FDCWD is passed, has the same
> functionality as these, but when openat/fget is supported, it would work
> relative to other dirfds as well. It can also allow using dirfd of the process
> calling read for a iterator (e.g. if it sets the fd number using skel->bss).
> unlinkat's AT_REMOVEDIR flag also removes the need for a bpf_rmdir.
>
> WDYT?
>

The idea sounds good to me, more flexible. But I don't have a real use
case for using the added 'dirfd' at this moment. For all the use cases
I can think of, absolute paths will suffice, I think. Unless other
reviewers have opposition, I will try switching to mkdirat and
unlinkat in v2.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-02-28 22:10     ` Hao Luo
@ 2022-03-02 19:34       ` Alexei Starovoitov
  2022-03-03 18:50         ` Hao Luo
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-02 19:34 UTC (permalink / raw)
  To: Hao Luo
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Mon, Feb 28, 2022 at 02:10:39PM -0800, Hao Luo wrote:
> Hi Kumar,
> 
> On Sat, Feb 26, 2022 at 9:18 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Sat, Feb 26, 2022 at 05:13:31AM IST, Hao Luo wrote:
> > > This patch allows bpf_syscall prog to perform some basic filesystem
> > > operations: create, remove directories and unlink files. Three bpf
> > > helpers are added for this purpose. When combined with the following
> > > patches that allow pinning and getting bpf objects from bpf prog,
> > > this feature can be used to create directory hierarchy in bpffs that
> > > help manage bpf objects purely using bpf progs.
> > >
> > > The added helpers subject to the same permission checks as their syscall
> > > version. For example, one can not write to a read-only file system;
> > > The identity of the current process is checked to see whether it has
> > > sufficient permission to perform the operations.
> > >
> > > Only directories and files in bpffs can be created or removed by these
> > > helpers. But it won't be too hard to allow these helpers to operate
> > > on files in other filesystems, if we want.
> > >
> > > Signed-off-by: Hao Luo <haoluo@google.com>
> > > ---
> > > + *
> > > + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
> > > + *   Description
> > > + *           Attempts to create a directory name *pathname*. The argument
> > > + *           *pathname_sz* specifies the length of the string *pathname*.
> > > + *           The argument *mode* specifies the mode for the new directory. It
> > > + *           is modified by the process's umask. It has the same semantic as
> > > + *           the syscall mkdir(2).
> > > + *   Return
> > > + *           0 on success, or a negative error in case of failure.
> > > + *
> > > + * long bpf_rmdir(const char *pathname, int pathname_sz)
> > > + *   Description
> > > + *           Deletes a directory, which must be empty.
> > > + *   Return
> > > + *           0 on sucess, or a negative error in case of failure.
> > > + *
> > > + * long bpf_unlink(const char *pathname, int pathname_sz)
> > > + *   Description
> > > + *           Deletes a name and possibly the file it refers to. It has the
> > > + *           same semantic as the syscall unlink(2).
> > > + *   Return
> > > + *           0 on success, or a negative error in case of failure.
> > >   */
> > >
> >
> > How about only introducing bpf_sys_mkdirat and bpf_sys_unlinkat? That would be
> > more useful for other cases in future, and when AT_FDCWD is passed, has the same
> > functionality as these, but when openat/fget is supported, it would work
> > relative to other dirfds as well. It can also allow using dirfd of the process
> > calling read for a iterator (e.g. if it sets the fd number using skel->bss).
> > unlinkat's AT_REMOVEDIR flag also removes the need for a bpf_rmdir.
> >
> > WDYT?
> >
> 
> The idea sounds good to me, more flexible. But I don't have a real use
> case for using the added 'dirfd' at this moment. For all the use cases
> I can think of, absolute paths will suffice, I think. Unless other
> reviewers have opposition, I will try switching to mkdirat and
> unlinkat in v2.

I'm surprised you don't need "at" variants.
I thought your production setup has a top level cgroup controller and
then inner tasks inside containers manage cgroups on their own.
Since containers are involved they likely run inside their own mountns.
cgroupfs mount is single. So you probably don't even need to bind mount it
inside containers, but bpffs is not a single mount. You need
to bind mount top bpffs inside containers for tasks to access it.
Now for cgroupfs the abs path is not an issue, but for bpffs
the AT_FDCWD becomes a problem. AT_FDCWD is using current mount ns.
Inside container that will be different. Unless you bind mount into exact
same path the full path has different meanings inside and outside of the container.
It seems to me the bpf progs attached to cgroup sleepable events should
be using FD of bpffs. Then when these tracepoints are triggered from
different containers in different mountns they will get the right dir prefix.
What am I missing?

I think non-AT variants are not needed. The prog can always pass AT_FDCWD
if it's really the intent, but passing actual FD seems more error-proof.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-02-25 23:43 ` [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints Hao Luo
@ 2022-03-02 19:41   ` Alexei Starovoitov
  2022-03-03 19:37     ` Hao Luo
  2022-03-02 21:23   ` Yonghong Song
  1 sibling, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-02 19:41 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Fri, Feb 25, 2022 at 03:43:34PM -0800, Hao Luo wrote:
> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> index e7c2276be33e..c73c7ab3680e 100644
> --- a/include/linux/tracepoint-defs.h
> +++ b/include/linux/tracepoint-defs.h
> @@ -51,6 +51,7 @@ struct bpf_raw_event_map {
>  	void			*bpf_func;
>  	u32			num_args;
>  	u32			writable_size;
> +	u32			sleepable;

It increases the size for all tracepoints. 
See BPF_RAW_TP in include/asm-generic/vmlinux.lds.h
Please switch writeable_size and sleepable to u16.
>  
> -static const struct bpf_func_proto *
> -syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +/* Syscall helpers that are also allowed in sleepable tracing prog. */
> +const struct bpf_func_proto *
> +tracing_prog_syscall_func_proto(enum bpf_func_id func_id,
> +				const struct bpf_prog *prog)
>  {
>  	switch (func_id) {
>  	case BPF_FUNC_sys_bpf:
>  		return &bpf_sys_bpf_proto;
> -	case BPF_FUNC_btf_find_by_name_kind:
> -		return &bpf_btf_find_by_name_kind_proto;
>  	case BPF_FUNC_sys_close:
>  		return &bpf_sys_close_proto;
> -	case BPF_FUNC_kallsyms_lookup_name:
> -		return &bpf_kallsyms_lookup_name_proto;
>  	case BPF_FUNC_mkdir:
>  		return &bpf_mkdir_proto;
>  	case BPF_FUNC_rmdir:
>  		return &bpf_rmdir_proto;
>  	case BPF_FUNC_unlink:
>  		return &bpf_unlink_proto;
> +	default:
> +		return NULL;
> +	}
> +}

If I read this correctly the goal is to disallow find_by_name_kind
and lookup_name from sleepable tps. Why? What's the harm?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 7/9] bpf: Lift permission check in __sys_bpf when called from kernel.
  2022-02-25 23:43 ` [PATCH bpf-next v1 7/9] bpf: Lift permission check in __sys_bpf when called from kernel Hao Luo
@ 2022-03-02 20:01   ` Alexei Starovoitov
  2022-03-03 19:14     ` Hao Luo
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-02 20:01 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Fri, Feb 25, 2022 at 03:43:37PM -0800, Hao Luo wrote:
> After we introduced sleepable tracing programs, we now have an
> interesting problem. There are now three execution paths that can
> reach bpf_sys_bpf:
> 
>  1. called from bpf syscall.
>  2. called from kernel context (e.g. kernel modules).
>  3. called from bpf programs.
> 
> Ideally, capability check in bpf_sys_bpf is necessary for the first two
> scenarios. But it may not be necessary for the third case.

Well, it's unnecessary for the first two as well.
When called from the kernel lskel it's a pointless check.
The kernel module can do anything regardless.
When called from bpf syscall program it's not quite correct either.
When CAP_BPF was introduced we've designed it to enforce permissions
at prog load time. The prog_run doesn't check permissions.
So syscall progs don't need this secondary permission check.
Please add "case BPF_PROG_TYPE_SYSCALL:" to is_perfmon_prog_type()
and combine it with this patch.

That would be the best. The alternative below is less appealing.

> An alternative of lifting this permission check would be introducing an
> 'unpriv' version of bpf_sys_bpf, which doesn't check the current task's
> capability. If the owner of the tracing prog wants it to be exclusively
> used by root users, they can use the 'priv' version of bpf_sys_bpf; if
> the owner wants it to be usable for non-root users, they can use the
> 'unpriv' version.

...

> -	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
> +	if (sysctl_unprivileged_bpf_disabled && !bpf_capable() && !uattr.is_kernel)

This is great idea. If I could think of this I would went with it when prog_syscall
was introduced.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-02-25 23:43 ` [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall Hao Luo
  2022-02-27  5:18   ` Kumar Kartikeya Dwivedi
@ 2022-03-02 20:55   ` Yonghong Song
  2022-03-03 18:56     ` Hao Luo
  2022-03-12  3:46   ` Al Viro
  2 siblings, 1 reply; 54+ messages in thread
From: Yonghong Song @ 2022-03-02 20:55 UTC (permalink / raw)
  To: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel



On 2/25/22 3:43 PM, Hao Luo wrote:
> This patch allows bpf_syscall prog to perform some basic filesystem
> operations: create, remove directories and unlink files. Three bpf
> helpers are added for this purpose. When combined with the following
> patches that allow pinning and getting bpf objects from bpf prog,
> this feature can be used to create directory hierarchy in bpffs that
> help manage bpf objects purely using bpf progs.
> 
> The added helpers subject to the same permission checks as their syscall
> version. For example, one can not write to a read-only file system;
> The identity of the current process is checked to see whether it has
> sufficient permission to perform the operations.
> 
> Only directories and files in bpffs can be created or removed by these
> helpers. But it won't be too hard to allow these helpers to operate
> on files in other filesystems, if we want.
> 
> Signed-off-by: Hao Luo <haoluo@google.com>
> ---
>   include/linux/bpf.h            |   1 +
>   include/uapi/linux/bpf.h       |  26 +++++
>   kernel/bpf/inode.c             |   9 +-
>   kernel/bpf/syscall.c           | 177 +++++++++++++++++++++++++++++++++
>   tools/include/uapi/linux/bpf.h |  26 +++++
>   5 files changed, 236 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f19abc59b6cd..fce5e26179f5 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1584,6 +1584,7 @@ int bpf_link_new_fd(struct bpf_link *link);
>   struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
>   struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>   
> +bool bpf_path_is_bpf_dir(const struct path *path);
>   int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>   int bpf_obj_get_user(const char __user *pathname, int flags);
>   
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index afe3d0d7f5f2..a5dbc794403d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5086,6 +5086,29 @@ union bpf_attr {
>    *	Return
>    *		0 on success, or a negative error in case of failure. On error
>    *		*dst* buffer is zeroed out.
> + *
> + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)

Can we make pathname_sz to be u32 instead of int? pathname_sz should 
never be negative any way.

Also, I think it is a good idea to add 'u64 flags' parameter for all
three helpers, so we have room in the future to tune for new use cases.

> + *	Description
> + *		Attempts to create a directory name *pathname*. The argument
> + *		*pathname_sz* specifies the length of the string *pathname*.
> + *		The argument *mode* specifies the mode for the new directory. It
> + *		is modified by the process's umask. It has the same semantic as
> + *		the syscall mkdir(2).
> + *	Return
> + *		0 on success, or a negative error in case of failure.
> + *
> + * long bpf_rmdir(const char *pathname, int pathname_sz)
> + *	Description
> + *		Deletes a directory, which must be empty.
> + *	Return
> + *		0 on sucess, or a negative error in case of failure.
> + *
> + * long bpf_unlink(const char *pathname, int pathname_sz)
> + *	Description
> + *		Deletes a name and possibly the file it refers to. It has the
> + *		same semantic as the syscall unlink(2).
> + *	Return
> + *		0 on success, or a negative error in case of failure.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -5280,6 +5303,9 @@ union bpf_attr {
>   	FN(xdp_load_bytes),		\
>   	FN(xdp_store_bytes),		\
>   	FN(copy_from_user_task),	\
> +	FN(mkdir),			\
> +	FN(rmdir),			\
> +	FN(unlink),			\
>   	/* */
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
[...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-02-25 23:43 ` [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints Hao Luo
  2022-03-02 19:41   ` Alexei Starovoitov
@ 2022-03-02 21:23   ` Yonghong Song
  2022-03-02 21:30     ` Alexei Starovoitov
  1 sibling, 1 reply; 54+ messages in thread
From: Yonghong Song @ 2022-03-02 21:23 UTC (permalink / raw)
  To: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel



On 2/25/22 3:43 PM, Hao Luo wrote:
> Add a new type of bpf tracepoints: sleepable tracepoints, which allows
> the handler to make calls that may sleep. With sleepable tracepoints, a
> set of syscall helpers (which may sleep) may also be called from
> sleepable tracepoints.

There are some old discussions on sleepable tracepoints, maybe
worthwhile to take a look.

https://lore.kernel.org/bpf/20210218222125.46565-5-mjeanson@efficios.com/T/

> 
> In the following patches, we will whitelist some tracepoints to be
> sleepable.
> 
> Signed-off-by: Hao Luo <haoluo@google.com>
> ---
>   include/linux/bpf.h             | 10 +++++++-
>   include/linux/tracepoint-defs.h |  1 +
>   include/trace/bpf_probe.h       | 22 ++++++++++++++----
>   kernel/bpf/syscall.c            | 41 +++++++++++++++++++++++----------
>   kernel/trace/bpf_trace.c        |  5 ++++
>   5 files changed, 61 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c36eeced3838..759ade7b24b3 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1810,6 +1810,9 @@ struct bpf_prog *bpf_prog_by_id(u32 id);
>   struct bpf_link *bpf_link_by_id(u32 id);
>   
>   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id);
> +const struct bpf_func_proto *
> +tracing_prog_syscall_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
> +
>   void bpf_task_storage_free(struct task_struct *task);
>   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
>   const struct btf_func_model *
> @@ -1822,7 +1825,6 @@ struct bpf_core_ctx {
>   
>   int bpf_core_apply(struct bpf_core_ctx *ctx, const struct bpf_core_relo *relo,
>   		   int relo_idx, void *insn);
> -
>   #else /* !CONFIG_BPF_SYSCALL */
>   static inline struct bpf_prog *bpf_prog_get(u32 ufd)
>   {
> @@ -2011,6 +2013,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>   	return NULL;
>   }
>   
> +static inline struct bpf_func_proto *
> +tracing_prog_syscall_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	return NULL;
> +}
> +
>   static inline void bpf_task_storage_free(struct task_struct *task)
>   {
>   }
> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> index e7c2276be33e..c73c7ab3680e 100644
> --- a/include/linux/tracepoint-defs.h
> +++ b/include/linux/tracepoint-defs.h
> @@ -51,6 +51,7 @@ struct bpf_raw_event_map {
>   	void			*bpf_func;
>   	u32			num_args;
>   	u32			writable_size;
> +	u32			sleepable;
>   } __aligned(32);
>   
>   /*
[...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-03-02 21:23   ` Yonghong Song
@ 2022-03-02 21:30     ` Alexei Starovoitov
  2022-03-03  1:08       ` Yonghong Song
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-02 21:30 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, Stanislav Fomichev, bpf, LKML

On Wed, Mar 2, 2022 at 1:23 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 2/25/22 3:43 PM, Hao Luo wrote:
> > Add a new type of bpf tracepoints: sleepable tracepoints, which allows
> > the handler to make calls that may sleep. With sleepable tracepoints, a
> > set of syscall helpers (which may sleep) may also be called from
> > sleepable tracepoints.
>
> There are some old discussions on sleepable tracepoints, maybe
> worthwhile to take a look.
>
> https://lore.kernel.org/bpf/20210218222125.46565-5-mjeanson@efficios.com/T/

Right. It's very much related, but obsolete too.
We don't need any of that for sleeptable _raw_ tps.
I prefer to stay with "sleepable" name as well to
match the rest of the bpf sleepable code.
In all cases it's faultable.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-02-25 23:43 ` [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter Hao Luo
                     ` (2 preceding siblings ...)
  2022-02-26  2:53   ` kernel test robot
@ 2022-03-02 21:59   ` Yonghong Song
  2022-03-03 20:02     ` Hao Luo
  2022-03-02 22:45   ` Kumar Kartikeya Dwivedi
  4 siblings, 1 reply; 54+ messages in thread
From: Yonghong Song @ 2022-03-02 21:59 UTC (permalink / raw)
  To: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel



On 2/25/22 3:43 PM, Hao Luo wrote:
> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> be parameterized by a cgroup id and prints only that cgroup. So one
> needs to specify a target cgroup id when attaching this iter.
> 
> The target cgroup's state can be read out via a link of this iter.
> Typically, we can monitor cgroup creation and deletion using sleepable
> tracing and use it to create corresponding directories in bpffs and pin
> a cgroup id parameterized link in the directory. Then we can read the
> auto-pinned iter link to get cgroup's state. The output of the iter link
> is determined by the program. See the selftest test_cgroup_stats.c for
> an example.
> 
> Signed-off-by: Hao Luo <haoluo@google.com>
> ---
>   include/linux/bpf.h            |   1 +
>   include/uapi/linux/bpf.h       |   6 ++
>   kernel/bpf/Makefile            |   2 +-
>   kernel/bpf/cgroup_iter.c       | 141 +++++++++++++++++++++++++++++++++
>   tools/include/uapi/linux/bpf.h |   6 ++
>   5 files changed, 155 insertions(+), 1 deletion(-)
>   create mode 100644 kernel/bpf/cgroup_iter.c
> 
[...]
> +
> +static const struct seq_operations cgroup_iter_seq_ops = {
> +	.start  = cgroup_iter_seq_start,
> +	.next   = cgroup_iter_seq_next,
> +	.stop   = cgroup_iter_seq_stop,
> +	.show   = cgroup_iter_seq_show,
> +};
> +
> +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
> +
> +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
> +{
> +	*(u64 *)priv_data = aux->cgroup_id;
> +	return 0;
> +}
> +
> +static void cgroup_iter_seq_fini(void *priv_data)
> +{
> +}
> +
> +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> +	.seq_ops                = &cgroup_iter_seq_ops,
> +	.init_seq_private       = cgroup_iter_seq_init,
> +	.fini_seq_private       = cgroup_iter_seq_fini,

Since cgroup_iter_seq_fini() is a nop, you can just have
	.fini_seq_private	= NULL,

> +	.seq_priv_size          = sizeof(u64),
> +};
> +
> +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
> +				  union bpf_iter_link_info *linfo,
> +				  struct bpf_iter_aux_info *aux)
> +{
> +	aux->cgroup_id = linfo->cgroup.cgroup_id;
> +	return 0;
> +}
> +
> +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
> +{
> +}
> +
> +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> +				 struct seq_file *seq)
> +{
> +	char buf[64] = {0};

Is this 64 the maximum possible cgroup path length?
If there is a macro for that, I think it would be good to use it.

> +
> +	cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));

cgroup_path_from_kernfs_id() might fail in which case, buf will be 0.
and cgroup_path will be nothing. I guess this might be the expected
result. I might be good to add a comment to clarify in the code.


> +	seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
> +	seq_printf(seq, "cgroup_path:\t%s\n", buf);
> +}
> +
> +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
> +				   struct bpf_link_info *info)
> +{
> +	info->iter.cgroup.cgroup_id = aux->cgroup_id;
> +	return 0;
> +}
> +
> +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
> +		     struct cgroup *cgroup)
> +
> +static struct bpf_iter_reg bpf_cgroup_reg_info = {
> +	.target			= "cgroup",
> +	.attach_target		= bpf_iter_attach_cgroup,
> +	.detach_target		= bpf_iter_detach_cgroup,

The same ehre, since bpf_iter_detach_cgroup() is a nop,
you can replace it with NULL in the above.

> +	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
> +	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
> +	.ctx_arg_info_size	= 1,
> +	.ctx_arg_info		= {
> +		{ offsetof(struct bpf_iter__cgroup, cgroup),
> +		  PTR_TO_BTF_ID },
> +	},
> +	.seq_info		= &cgroup_iter_seq_info,
> +};
> +
> +static int __init bpf_cgroup_iter_init(void)
> +{
> +	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
> +	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-02-25 23:43 ` [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter Hao Luo
                     ` (3 preceding siblings ...)
  2022-03-02 21:59   ` Yonghong Song
@ 2022-03-02 22:45   ` Kumar Kartikeya Dwivedi
  2022-03-03  2:03     ` Yonghong Song
  4 siblings, 1 reply; 54+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-02 22:45 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> be parameterized by a cgroup id and prints only that cgroup. So one
> needs to specify a target cgroup id when attaching this iter.
>
> The target cgroup's state can be read out via a link of this iter.
> Typically, we can monitor cgroup creation and deletion using sleepable
> tracing and use it to create corresponding directories in bpffs and pin
> a cgroup id parameterized link in the directory. Then we can read the
> auto-pinned iter link to get cgroup's state. The output of the iter link
> is determined by the program. See the selftest test_cgroup_stats.c for
> an example.
>
> Signed-off-by: Hao Luo <haoluo@google.com>
> ---
>  include/linux/bpf.h            |   1 +
>  include/uapi/linux/bpf.h       |   6 ++
>  kernel/bpf/Makefile            |   2 +-
>  kernel/bpf/cgroup_iter.c       | 141 +++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |   6 ++
>  5 files changed, 155 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/cgroup_iter.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 759ade7b24b3..3ce9b0b7ed89 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
>
>  struct bpf_iter_aux_info {
>  	struct bpf_map *map;
> +	u64 cgroup_id;
>  };
>
>  typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a5dbc794403d..855ad80d9983 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>  	struct {
>  		__u32	map_fd;
>  	} map;
> +	struct {
> +		__u64	cgroup_id;
> +	} cgroup;
>  };
>
>  /* BPF syscall commands, see bpf(2) man-page for more details. */
> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
>  				struct {
>  					__u32 map_id;
>  				} map;
> +				struct {
> +					__u64 cgroup_id;
> +				} cgroup;
>  			};
>  		} iter;
>  		struct  {
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index c1a9be6a4b9f..52a0e4c6e96e 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>  obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
> new file mode 100644
> index 000000000000..011d9dcd1d51
> --- /dev/null
> +++ b/kernel/bpf/cgroup_iter.c
> @@ -0,0 +1,141 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022 Google */
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <linux/cgroup.h>
> +#include <linux/kernel.h>
> +#include <linux/seq_file.h>
> +
> +struct bpf_iter__cgroup {
> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
> +	__bpf_md_ptr(struct cgroup *, cgroup);
> +};
> +
> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	struct cgroup *cgroup;
> +	u64 cgroup_id;
> +
> +	/* Only one session is supported. */
> +	if (*pos > 0)
> +		return NULL;
> +
> +	cgroup_id = *(u64 *)seq->private;
> +	cgroup = cgroup_get_from_id(cgroup_id);
> +	if (!cgroup)
> +		return NULL;
> +
> +	if (*pos == 0)
> +		++*pos;
> +
> +	return cgroup;
> +}
> +
> +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	++*pos;
> +	return NULL;
> +}
> +
> +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
> +{
> +	struct bpf_iter__cgroup ctx;
> +	struct bpf_iter_meta meta;
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	ctx.meta = &meta;
> +	ctx.cgroup = v;
> +	meta.seq = seq;
> +	prog = bpf_iter_get_info(&meta, false);
> +	if (prog)
> +		ret = bpf_iter_run_prog(prog, &ctx);
> +
> +	return ret;
> +}
> +
> +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
> +{
> +	if (v)
> +		cgroup_put(v);
> +}

I think in existing iterators, we make a final call to seq_show, with v as NULL,
is there a specific reason to do it differently for this? There is logic in
bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
NULL, to execute BPF program with NULL p, see the comment above stop label.

If you do add the seq_show call with NULL, you'd also need to change the
ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.

> +
> +static const struct seq_operations cgroup_iter_seq_ops = {
> +	.start  = cgroup_iter_seq_start,
> +	.next   = cgroup_iter_seq_next,
> +	.stop   = cgroup_iter_seq_stop,
> +	.show   = cgroup_iter_seq_show,
> +};
> +
> +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
> +
> +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
> +{
> +	*(u64 *)priv_data = aux->cgroup_id;
> +	return 0;
> +}
> +
> +static void cgroup_iter_seq_fini(void *priv_data)
> +{
> +}
> +
> +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> +	.seq_ops                = &cgroup_iter_seq_ops,
> +	.init_seq_private       = cgroup_iter_seq_init,
> +	.fini_seq_private       = cgroup_iter_seq_fini,
> +	.seq_priv_size          = sizeof(u64),
> +};
> +
> +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
> +				  union bpf_iter_link_info *linfo,
> +				  struct bpf_iter_aux_info *aux)
> +{
> +	aux->cgroup_id = linfo->cgroup.cgroup_id;
> +	return 0;
> +}
> +
> +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
> +{
> +}
> +
> +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> +				 struct seq_file *seq)
> +{
> +	char buf[64] = {0};
> +
> +	cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
> +	seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
> +	seq_printf(seq, "cgroup_path:\t%s\n", buf);
> +}
> +
> +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
> +				   struct bpf_link_info *info)
> +{
> +	info->iter.cgroup.cgroup_id = aux->cgroup_id;
> +	return 0;
> +}
> +
> +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
> +		     struct cgroup *cgroup)
> +
> +static struct bpf_iter_reg bpf_cgroup_reg_info = {
> +	.target			= "cgroup",
> +	.attach_target		= bpf_iter_attach_cgroup,
> +	.detach_target		= bpf_iter_detach_cgroup,
> +	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
> +	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
> +	.ctx_arg_info_size	= 1,
> +	.ctx_arg_info		= {
> +		{ offsetof(struct bpf_iter__cgroup, cgroup),
> +		  PTR_TO_BTF_ID },
> +	},
> +	.seq_info		= &cgroup_iter_seq_info,
> +};
> +
> +static int __init bpf_cgroup_iter_init(void)
> +{
> +	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
> +	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
> +}
> +
> +late_initcall(bpf_cgroup_iter_init);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index a5dbc794403d..855ad80d9983 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>  	struct {
>  		__u32	map_fd;
>  	} map;
> +	struct {
> +		__u64	cgroup_id;
> +	} cgroup;
>  };
>
>  /* BPF syscall commands, see bpf(2) man-page for more details. */
> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
>  				struct {
>  					__u32 map_id;
>  				} map;
> +				struct {
> +					__u64 cgroup_id;
> +				} cgroup;
>  			};
>  		} iter;
>  		struct  {
> --
> 2.35.1.574.g5d30c73bfb-goog
>

--
Kartikeya

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-03-02 21:30     ` Alexei Starovoitov
@ 2022-03-03  1:08       ` Yonghong Song
  2022-03-03  2:29         ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Yonghong Song @ 2022-03-03  1:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, Stanislav Fomichev, bpf, LKML



On 3/2/22 1:30 PM, Alexei Starovoitov wrote:
> On Wed, Mar 2, 2022 at 1:23 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 2/25/22 3:43 PM, Hao Luo wrote:
>>> Add a new type of bpf tracepoints: sleepable tracepoints, which allows
>>> the handler to make calls that may sleep. With sleepable tracepoints, a
>>> set of syscall helpers (which may sleep) may also be called from
>>> sleepable tracepoints.
>>
>> There are some old discussions on sleepable tracepoints, maybe
>> worthwhile to take a look.
>>
>> https://lore.kernel.org/bpf/20210218222125.46565-5-mjeanson@efficios.com/T/
> 
> Right. It's very much related, but obsolete too.
> We don't need any of that for sleeptable _raw_ tps.
> I prefer to stay with "sleepable" name as well to
> match the rest of the bpf sleepable code.
> In all cases it's faultable.

sounds good to me. Agree that for the bpf user case, Hao's 
implementation should be enough.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-03-02 22:45   ` Kumar Kartikeya Dwivedi
@ 2022-03-03  2:03     ` Yonghong Song
  2022-03-03  3:03       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 54+ messages in thread
From: Yonghong Song @ 2022-03-03  2:03 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel



On 3/2/22 2:45 PM, Kumar Kartikeya Dwivedi wrote:
> On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
>> be parameterized by a cgroup id and prints only that cgroup. So one
>> needs to specify a target cgroup id when attaching this iter.
>>
>> The target cgroup's state can be read out via a link of this iter.
>> Typically, we can monitor cgroup creation and deletion using sleepable
>> tracing and use it to create corresponding directories in bpffs and pin
>> a cgroup id parameterized link in the directory. Then we can read the
>> auto-pinned iter link to get cgroup's state. The output of the iter link
>> is determined by the program. See the selftest test_cgroup_stats.c for
>> an example.
>>
>> Signed-off-by: Hao Luo <haoluo@google.com>
>> ---
>>   include/linux/bpf.h            |   1 +
>>   include/uapi/linux/bpf.h       |   6 ++
>>   kernel/bpf/Makefile            |   2 +-
>>   kernel/bpf/cgroup_iter.c       | 141 +++++++++++++++++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h |   6 ++
>>   5 files changed, 155 insertions(+), 1 deletion(-)
>>   create mode 100644 kernel/bpf/cgroup_iter.c
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 759ade7b24b3..3ce9b0b7ed89 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
>>
>>   struct bpf_iter_aux_info {
>>   	struct bpf_map *map;
>> +	u64 cgroup_id;
>>   };
>>
>>   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index a5dbc794403d..855ad80d9983 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>   	struct {
>>   		__u32	map_fd;
>>   	} map;
>> +	struct {
>> +		__u64	cgroup_id;
>> +	} cgroup;
>>   };
>>
>>   /* BPF syscall commands, see bpf(2) man-page for more details. */
>> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
>>   				struct {
>>   					__u32 map_id;
>>   				} map;
>> +				struct {
>> +					__u64 cgroup_id;
>> +				} cgroup;
>>   			};
>>   		} iter;
>>   		struct  {
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index c1a9be6a4b9f..52a0e4c6e96e 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>>
>>   obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
>>   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
>> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
>>   obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>>   obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
>>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>> diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
>> new file mode 100644
>> index 000000000000..011d9dcd1d51
>> --- /dev/null
>> +++ b/kernel/bpf/cgroup_iter.c
>> @@ -0,0 +1,141 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright (c) 2022 Google */
>> +#include <linux/bpf.h>
>> +#include <linux/btf_ids.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/kernel.h>
>> +#include <linux/seq_file.h>
>> +
>> +struct bpf_iter__cgroup {
>> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
>> +	__bpf_md_ptr(struct cgroup *, cgroup);
>> +};
>> +
>> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
>> +{
>> +	struct cgroup *cgroup;
>> +	u64 cgroup_id;
>> +
>> +	/* Only one session is supported. */
>> +	if (*pos > 0)
>> +		return NULL;
>> +
>> +	cgroup_id = *(u64 *)seq->private;
>> +	cgroup = cgroup_get_from_id(cgroup_id);
>> +	if (!cgroup)
>> +		return NULL;
>> +
>> +	if (*pos == 0)
>> +		++*pos;
>> +
>> +	return cgroup;
>> +}
>> +
>> +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> +{
>> +	++*pos;
>> +	return NULL;
>> +}
>> +
>> +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
>> +{
>> +	struct bpf_iter__cgroup ctx;
>> +	struct bpf_iter_meta meta;
>> +	struct bpf_prog *prog;
>> +	int ret = 0;
>> +
>> +	ctx.meta = &meta;
>> +	ctx.cgroup = v;
>> +	meta.seq = seq;
>> +	prog = bpf_iter_get_info(&meta, false);
>> +	if (prog)
>> +		ret = bpf_iter_run_prog(prog, &ctx);
>> +
>> +	return ret;
>> +}
>> +
>> +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
>> +{
>> +	if (v)
>> +		cgroup_put(v);
>> +}
> 
> I think in existing iterators, we make a final call to seq_show, with v as NULL,
> is there a specific reason to do it differently for this? There is logic in
> bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
> NULL, to execute BPF program with NULL p, see the comment above stop label.
> 
> If you do add the seq_show call with NULL, you'd also need to change the
> ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.

Kumar, PTR_TO_BTF_ID should be okay since the show() never takes a 
non-NULL cgroup. But we do have issues for cgroup_iter_seq_stop() which 
I missed earlier.

For cgroup_iter, the following is the current workflow:
    start -> not NULL -> show -> next -> NULL -> stop
or
    start -> NULL -> stop

So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
the cgroup_put() is not actually called, i.e., corresponding cgroup is
not freed.

There are two ways to fix the issue:
   . call cgroup_put() in next() before return NULL. This way,
     stop() will be a noop.
   . put cgroup_get_from_id() and cgroup_put() in
     bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().

I prefer the second approach as it is cleaner.

> 
>> +
>> +static const struct seq_operations cgroup_iter_seq_ops = {
>> +	.start  = cgroup_iter_seq_start,
>> +	.next   = cgroup_iter_seq_next,
>> +	.stop   = cgroup_iter_seq_stop,
>> +	.show   = cgroup_iter_seq_show,
>> +};
>> +
>> +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
>> +
>> +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
>> +{
>> +	*(u64 *)priv_data = aux->cgroup_id;
>> +	return 0;
>> +}
>> +
>> +static void cgroup_iter_seq_fini(void *priv_data)
>> +{
>> +}
>> +
>> +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
>> +	.seq_ops                = &cgroup_iter_seq_ops,
>> +	.init_seq_private       = cgroup_iter_seq_init,
>> +	.fini_seq_private       = cgroup_iter_seq_fini,
>> +	.seq_priv_size          = sizeof(u64),
>> +};
>> +
>> +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
>> +				  union bpf_iter_link_info *linfo,
>> +				  struct bpf_iter_aux_info *aux)
>> +{
>> +	aux->cgroup_id = linfo->cgroup.cgroup_id;
>> +	return 0;
>> +}
>> +
>> +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
>> +{
>> +}
>> +
>> +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
>> +				 struct seq_file *seq)
>> +{
>> +	char buf[64] = {0};
>> +
>> +	cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
>> +	seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
>> +	seq_printf(seq, "cgroup_path:\t%s\n", buf);
>> +}
>> +
>> +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
>> +				   struct bpf_link_info *info)
>> +{
>> +	info->iter.cgroup.cgroup_id = aux->cgroup_id;
>> +	return 0;
>> +}
>> +
>> +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
>> +		     struct cgroup *cgroup)
>> +
>> +static struct bpf_iter_reg bpf_cgroup_reg_info = {
>> +	.target			= "cgroup",
>> +	.attach_target		= bpf_iter_attach_cgroup,
>> +	.detach_target		= bpf_iter_detach_cgroup,
>> +	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
>> +	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
>> +	.ctx_arg_info_size	= 1,
>> +	.ctx_arg_info		= {
>> +		{ offsetof(struct bpf_iter__cgroup, cgroup),
>> +		  PTR_TO_BTF_ID },
>> +	},
>> +	.seq_info		= &cgroup_iter_seq_info,
>> +};
>> +
>> +static int __init bpf_cgroup_iter_init(void)
>> +{
>> +	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
>> +	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
>> +}
>> +
>> +late_initcall(bpf_cgroup_iter_init);
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index a5dbc794403d..855ad80d9983 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>   	struct {
>>   		__u32	map_fd;
>>   	} map;
>> +	struct {
>> +		__u64	cgroup_id;
>> +	} cgroup;
>>   };
>>
>>   /* BPF syscall commands, see bpf(2) man-page for more details. */
>> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
>>   				struct {
>>   					__u32 map_id;
>>   				} map;
>> +				struct {
>> +					__u64 cgroup_id;
>> +				} cgroup;
>>   			};
>>   		} iter;
>>   		struct  {
>> --
>> 2.35.1.574.g5d30c73bfb-goog
>>
> 
> --
> Kartikeya

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-03-03  1:08       ` Yonghong Song
@ 2022-03-03  2:29         ` Alexei Starovoitov
  2022-03-03 19:43           ` Hao Luo
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-03  2:29 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, Josh Don, Stanislav Fomichev, bpf, LKML

On Wed, Mar 2, 2022 at 5:09 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 3/2/22 1:30 PM, Alexei Starovoitov wrote:
> > On Wed, Mar 2, 2022 at 1:23 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >>
> >>
> >> On 2/25/22 3:43 PM, Hao Luo wrote:
> >>> Add a new type of bpf tracepoints: sleepable tracepoints, which allows
> >>> the handler to make calls that may sleep. With sleepable tracepoints, a
> >>> set of syscall helpers (which may sleep) may also be called from
> >>> sleepable tracepoints.
> >>
> >> There are some old discussions on sleepable tracepoints, maybe
> >> worthwhile to take a look.
> >>
> >> https://lore.kernel.org/bpf/20210218222125.46565-5-mjeanson@efficios.com/T/
> >
> > Right. It's very much related, but obsolete too.
> > We don't need any of that for sleeptable _raw_ tps.
> > I prefer to stay with "sleepable" name as well to
> > match the rest of the bpf sleepable code.
> > In all cases it's faultable.
>
> sounds good to me. Agree that for the bpf user case, Hao's
> implementation should be enough.

Just remembered that we can also do trivial noinline __weak
nop function and mark it sleepable on the verifier side.
That's what we were planning to do to trace map update/delete ops
in Joe Burton's series.
Then we don't need to extend tp infra.
I'm fine whichever way. I see pros and cons in both options.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-03-03  2:03     ` Yonghong Song
@ 2022-03-03  3:03       ` Kumar Kartikeya Dwivedi
  2022-03-03  4:00         ` Alexei Starovoitov
  2022-03-03  7:33         ` Yonghong Song
  0 siblings, 2 replies; 54+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-03  3:03 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel

On Thu, Mar 03, 2022 at 07:33:16AM IST, Yonghong Song wrote:
>
>
> On 3/2/22 2:45 PM, Kumar Kartikeya Dwivedi wrote:
> > On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
> > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > > be parameterized by a cgroup id and prints only that cgroup. So one
> > > needs to specify a target cgroup id when attaching this iter.
> > >
> > > The target cgroup's state can be read out via a link of this iter.
> > > Typically, we can monitor cgroup creation and deletion using sleepable
> > > tracing and use it to create corresponding directories in bpffs and pin
> > > a cgroup id parameterized link in the directory. Then we can read the
> > > auto-pinned iter link to get cgroup's state. The output of the iter link
> > > is determined by the program. See the selftest test_cgroup_stats.c for
> > > an example.
> > >
> > > Signed-off-by: Hao Luo <haoluo@google.com>
> > > ---
> > >   include/linux/bpf.h            |   1 +
> > >   include/uapi/linux/bpf.h       |   6 ++
> > >   kernel/bpf/Makefile            |   2 +-
> > >   kernel/bpf/cgroup_iter.c       | 141 +++++++++++++++++++++++++++++++++
> > >   tools/include/uapi/linux/bpf.h |   6 ++
> > >   5 files changed, 155 insertions(+), 1 deletion(-)
> > >   create mode 100644 kernel/bpf/cgroup_iter.c
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 759ade7b24b3..3ce9b0b7ed89 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
> > >
> > >   struct bpf_iter_aux_info {
> > >   	struct bpf_map *map;
> > > +	u64 cgroup_id;
> > >   };
> > >
> > >   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index a5dbc794403d..855ad80d9983 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> > >   	struct {
> > >   		__u32	map_fd;
> > >   	} map;
> > > +	struct {
> > > +		__u64	cgroup_id;
> > > +	} cgroup;
> > >   };
> > >
> > >   /* BPF syscall commands, see bpf(2) man-page for more details. */
> > > @@ -5887,6 +5890,9 @@ struct bpf_link_info {
> > >   				struct {
> > >   					__u32 map_id;
> > >   				} map;
> > > +				struct {
> > > +					__u64 cgroup_id;
> > > +				} cgroup;
> > >   			};
> > >   		} iter;
> > >   		struct  {
> > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > > index c1a9be6a4b9f..52a0e4c6e96e 100644
> > > --- a/kernel/bpf/Makefile
> > > +++ b/kernel/bpf/Makefile
> > > @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
> > >
> > >   obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
> > >   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> > > -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> > > +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
> > >   obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
> > >   obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
> > >   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> > > diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
> > > new file mode 100644
> > > index 000000000000..011d9dcd1d51
> > > --- /dev/null
> > > +++ b/kernel/bpf/cgroup_iter.c
> > > @@ -0,0 +1,141 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/* Copyright (c) 2022 Google */
> > > +#include <linux/bpf.h>
> > > +#include <linux/btf_ids.h>
> > > +#include <linux/cgroup.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/seq_file.h>
> > > +
> > > +struct bpf_iter__cgroup {
> > > +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
> > > +	__bpf_md_ptr(struct cgroup *, cgroup);
> > > +};
> > > +
> > > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > > +{
> > > +	struct cgroup *cgroup;
> > > +	u64 cgroup_id;
> > > +
> > > +	/* Only one session is supported. */
> > > +	if (*pos > 0)
> > > +		return NULL;
> > > +
> > > +	cgroup_id = *(u64 *)seq->private;
> > > +	cgroup = cgroup_get_from_id(cgroup_id);
> > > +	if (!cgroup)
> > > +		return NULL;
> > > +
> > > +	if (*pos == 0)
> > > +		++*pos;
> > > +
> > > +	return cgroup;
> > > +}
> > > +
> > > +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > > +{
> > > +	++*pos;
> > > +	return NULL;
> > > +}
> > > +
> > > +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
> > > +{
> > > +	struct bpf_iter__cgroup ctx;
> > > +	struct bpf_iter_meta meta;
> > > +	struct bpf_prog *prog;
> > > +	int ret = 0;
> > > +
> > > +	ctx.meta = &meta;
> > > +	ctx.cgroup = v;
> > > +	meta.seq = seq;
> > > +	prog = bpf_iter_get_info(&meta, false);
> > > +	if (prog)
> > > +		ret = bpf_iter_run_prog(prog, &ctx);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
> > > +{
> > > +	if (v)
> > > +		cgroup_put(v);
> > > +}
> >
> > I think in existing iterators, we make a final call to seq_show, with v as NULL,
> > is there a specific reason to do it differently for this? There is logic in
> > bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
> > NULL, to execute BPF program with NULL p, see the comment above stop label.
> >
> > If you do add the seq_show call with NULL, you'd also need to change the
> > ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.
>
> Kumar, PTR_TO_BTF_ID should be okay since the show() never takes a non-NULL
> cgroup. But we do have issues for cgroup_iter_seq_stop() which I missed
> earlier.
>

Right, I was thinking whether it should call seq_show for v == NULL case. All
other iterators seem to do so, it's a bit different here since it is only
iterating over a single cgroup, I guess, but it would be nice to have some
consistency.

> For cgroup_iter, the following is the current workflow:
>    start -> not NULL -> show -> next -> NULL -> stop
> or
>    start -> NULL -> stop
>
> So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
> the cgroup_put() is not actually called, i.e., corresponding cgroup is
> not freed.
>
> There are two ways to fix the issue:
>   . call cgroup_put() in next() before return NULL. This way,
>     stop() will be a noop.
>   . put cgroup_get_from_id() and cgroup_put() in
>     bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().
>
> I prefer the second approach as it is cleaner.
>

I think current approach is also not safe if cgroup_id gets reused, right? I.e.
it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
is less likely to occur, but since it wraps around to find a free ID it might
not be theoretical.

> >
> > > +
> > > +static const struct seq_operations cgroup_iter_seq_ops = {
> > > +	.start  = cgroup_iter_seq_start,
> > > +	.next   = cgroup_iter_seq_next,
> > > +	.stop   = cgroup_iter_seq_stop,
> > > +	.show   = cgroup_iter_seq_show,
> > > +};
> > > +
> > > +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
> > > +
> > > +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
> > > +{
> > > +	*(u64 *)priv_data = aux->cgroup_id;
> > > +	return 0;
> > > +}
> > > +
> > > +static void cgroup_iter_seq_fini(void *priv_data)
> > > +{
> > > +}
> > > +
> > > +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> > > +	.seq_ops                = &cgroup_iter_seq_ops,
> > > +	.init_seq_private       = cgroup_iter_seq_init,
> > > +	.fini_seq_private       = cgroup_iter_seq_fini,
> > > +	.seq_priv_size          = sizeof(u64),
> > > +};
> > > +
> > > +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
> > > +				  union bpf_iter_link_info *linfo,
> > > +				  struct bpf_iter_aux_info *aux)
> > > +{
> > > +	aux->cgroup_id = linfo->cgroup.cgroup_id;
> > > +	return 0;
> > > +}
> > > +
> > > +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
> > > +{
> > > +}
> > > +
> > > +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> > > +				 struct seq_file *seq)
> > > +{
> > > +	char buf[64] = {0};
> > > +
> > > +	cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
> > > +	seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
> > > +	seq_printf(seq, "cgroup_path:\t%s\n", buf);
> > > +}
> > > +
> > > +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
> > > +				   struct bpf_link_info *info)
> > > +{
> > > +	info->iter.cgroup.cgroup_id = aux->cgroup_id;
> > > +	return 0;
> > > +}
> > > +
> > > +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
> > > +		     struct cgroup *cgroup)
> > > +
> > > +static struct bpf_iter_reg bpf_cgroup_reg_info = {
> > > +	.target			= "cgroup",
> > > +	.attach_target		= bpf_iter_attach_cgroup,
> > > +	.detach_target		= bpf_iter_detach_cgroup,
> > > +	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
> > > +	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
> > > +	.ctx_arg_info_size	= 1,
> > > +	.ctx_arg_info		= {
> > > +		{ offsetof(struct bpf_iter__cgroup, cgroup),
> > > +		  PTR_TO_BTF_ID },
> > > +	},
> > > +	.seq_info		= &cgroup_iter_seq_info,
> > > +};
> > > +
> > > +static int __init bpf_cgroup_iter_init(void)
> > > +{
> > > +	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
> > > +	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
> > > +}
> > > +
> > > +late_initcall(bpf_cgroup_iter_init);
> > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > index a5dbc794403d..855ad80d9983 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> > >   	struct {
> > >   		__u32	map_fd;
> > >   	} map;
> > > +	struct {
> > > +		__u64	cgroup_id;
> > > +	} cgroup;
> > >   };
> > >
> > >   /* BPF syscall commands, see bpf(2) man-page for more details. */
> > > @@ -5887,6 +5890,9 @@ struct bpf_link_info {
> > >   				struct {
> > >   					__u32 map_id;
> > >   				} map;
> > > +				struct {
> > > +					__u64 cgroup_id;
> > > +				} cgroup;
> > >   			};
> > >   		} iter;
> > >   		struct  {
> > > --
> > > 2.35.1.574.g5d30c73bfb-goog
> > >
> >
> > --
> > Kartikeya

--
Kartikeya

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-03-03  3:03       ` Kumar Kartikeya Dwivedi
@ 2022-03-03  4:00         ` Alexei Starovoitov
  2022-03-03  7:33         ` Yonghong Song
  1 sibling, 0 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-03  4:00 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, Hao Luo, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Wed, Mar 2, 2022 at 7:03 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
>
> I think current approach is also not safe if cgroup_id gets reused, right? I.e.
> it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
> be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
> is less likely to occur, but since it wraps around to find a free ID it might
> not be theoretical.

cgroupid is 64-bit.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-03-03  3:03       ` Kumar Kartikeya Dwivedi
  2022-03-03  4:00         ` Alexei Starovoitov
@ 2022-03-03  7:33         ` Yonghong Song
  2022-03-03  8:13           ` Kumar Kartikeya Dwivedi
  2022-03-03 21:52           ` Hao Luo
  1 sibling, 2 replies; 54+ messages in thread
From: Yonghong Song @ 2022-03-03  7:33 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel



On 3/2/22 7:03 PM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Mar 03, 2022 at 07:33:16AM IST, Yonghong Song wrote:
>>
>>
>> On 3/2/22 2:45 PM, Kumar Kartikeya Dwivedi wrote:
>>> On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
>>>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
>>>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
>>>> be parameterized by a cgroup id and prints only that cgroup. So one
>>>> needs to specify a target cgroup id when attaching this iter.
>>>>
>>>> The target cgroup's state can be read out via a link of this iter.
>>>> Typically, we can monitor cgroup creation and deletion using sleepable
>>>> tracing and use it to create corresponding directories in bpffs and pin
>>>> a cgroup id parameterized link in the directory. Then we can read the
>>>> auto-pinned iter link to get cgroup's state. The output of the iter link
>>>> is determined by the program. See the selftest test_cgroup_stats.c for
>>>> an example.
>>>>
>>>> Signed-off-by: Hao Luo <haoluo@google.com>
>>>> ---
>>>>    include/linux/bpf.h            |   1 +
>>>>    include/uapi/linux/bpf.h       |   6 ++
>>>>    kernel/bpf/Makefile            |   2 +-
>>>>    kernel/bpf/cgroup_iter.c       | 141 +++++++++++++++++++++++++++++++++
>>>>    tools/include/uapi/linux/bpf.h |   6 ++
>>>>    5 files changed, 155 insertions(+), 1 deletion(-)
>>>>    create mode 100644 kernel/bpf/cgroup_iter.c
>>>>
>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>> index 759ade7b24b3..3ce9b0b7ed89 100644
>>>> --- a/include/linux/bpf.h
>>>> +++ b/include/linux/bpf.h
>>>> @@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
>>>>
>>>>    struct bpf_iter_aux_info {
>>>>    	struct bpf_map *map;
>>>> +	u64 cgroup_id;
>>>>    };
>>>>
>>>>    typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index a5dbc794403d..855ad80d9983 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>>    	struct {
>>>>    		__u32	map_fd;
>>>>    	} map;
>>>> +	struct {
>>>> +		__u64	cgroup_id;
>>>> +	} cgroup;
>>>>    };
>>>>
>>>>    /* BPF syscall commands, see bpf(2) man-page for more details. */
>>>> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
>>>>    				struct {
>>>>    					__u32 map_id;
>>>>    				} map;
>>>> +				struct {
>>>> +					__u64 cgroup_id;
>>>> +				} cgroup;
>>>>    			};
>>>>    		} iter;
>>>>    		struct  {
>>>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>>>> index c1a9be6a4b9f..52a0e4c6e96e 100644
>>>> --- a/kernel/bpf/Makefile
>>>> +++ b/kernel/bpf/Makefile
>>>> @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>>>>
>>>>    obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
>>>>    obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
>>>> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>>>> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
>>>>    obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>>>>    obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
>>>>    obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>>>> diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
>>>> new file mode 100644
>>>> index 000000000000..011d9dcd1d51
>>>> --- /dev/null
>>>> +++ b/kernel/bpf/cgroup_iter.c
>>>> @@ -0,0 +1,141 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +/* Copyright (c) 2022 Google */
>>>> +#include <linux/bpf.h>
>>>> +#include <linux/btf_ids.h>
>>>> +#include <linux/cgroup.h>
>>>> +#include <linux/kernel.h>
>>>> +#include <linux/seq_file.h>
>>>> +
>>>> +struct bpf_iter__cgroup {
>>>> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
>>>> +	__bpf_md_ptr(struct cgroup *, cgroup);
>>>> +};
>>>> +
>>>> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
>>>> +{
>>>> +	struct cgroup *cgroup;
>>>> +	u64 cgroup_id;
>>>> +
>>>> +	/* Only one session is supported. */
>>>> +	if (*pos > 0)
>>>> +		return NULL;
>>>> +
>>>> +	cgroup_id = *(u64 *)seq->private;
>>>> +	cgroup = cgroup_get_from_id(cgroup_id);
>>>> +	if (!cgroup)
>>>> +		return NULL;
>>>> +
>>>> +	if (*pos == 0)
>>>> +		++*pos;
>>>> +
>>>> +	return cgroup;
>>>> +}
>>>> +
>>>> +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>>>> +{
>>>> +	++*pos;
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
>>>> +{
>>>> +	struct bpf_iter__cgroup ctx;
>>>> +	struct bpf_iter_meta meta;
>>>> +	struct bpf_prog *prog;
>>>> +	int ret = 0;
>>>> +
>>>> +	ctx.meta = &meta;
>>>> +	ctx.cgroup = v;
>>>> +	meta.seq = seq;
>>>> +	prog = bpf_iter_get_info(&meta, false);
>>>> +	if (prog)
>>>> +		ret = bpf_iter_run_prog(prog, &ctx);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
>>>> +{
>>>> +	if (v)
>>>> +		cgroup_put(v);
>>>> +}
>>>
>>> I think in existing iterators, we make a final call to seq_show, with v as NULL,
>>> is there a specific reason to do it differently for this? There is logic in
>>> bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
>>> NULL, to execute BPF program with NULL p, see the comment above stop label.
>>>
>>> If you do add the seq_show call with NULL, you'd also need to change the
>>> ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.
>>
>> Kumar, PTR_TO_BTF_ID should be okay since the show() never takes a non-NULL
>> cgroup. But we do have issues for cgroup_iter_seq_stop() which I missed
>> earlier.
>>
> 
> Right, I was thinking whether it should call seq_show for v == NULL case. All
> other iterators seem to do so, it's a bit different here since it is only
> iterating over a single cgroup, I guess, but it would be nice to have some
> consistency.

You are correct that I think it is okay since it only iterates with one
cgroup. This is different from other cases so far where more than one 
objects may be traversed. We may have future other use cases, e.g.,
one task. I think we can abstract out start()/next()/stop() callbacks
for such use cases. So it is okay it is different from other existing
iterators since they are indeed different.

> 
>> For cgroup_iter, the following is the current workflow:
>>     start -> not NULL -> show -> next -> NULL -> stop
>> or
>>     start -> NULL -> stop
>>
>> So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
>> the cgroup_put() is not actually called, i.e., corresponding cgroup is
>> not freed.
>>
>> There are two ways to fix the issue:
>>    . call cgroup_put() in next() before return NULL. This way,
>>      stop() will be a noop.
>>    . put cgroup_get_from_id() and cgroup_put() in
>>      bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().
>>
>> I prefer the second approach as it is cleaner.
>>
> 
> I think current approach is also not safe if cgroup_id gets reused, right? I.e.
> it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
> be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
> is less likely to occur, but since it wraps around to find a free ID it might
> not be theoretical.

As Alexei mentioned, cgroup id is 64-bit, the collision should
be nearly impossible. Another option is to get a fd from
the cgroup path, and send the fd to the kernel. This probably
works.

> 
>>>
>>>> +
>>>> +static const struct seq_operations cgroup_iter_seq_ops = {
>>>> +	.start  = cgroup_iter_seq_start,
>>>> +	.next   = cgroup_iter_seq_next,
>>>> +	.stop   = cgroup_iter_seq_stop,
>>>> +	.show   = cgroup_iter_seq_show,
>>>> +};
>>>> +
>>>> +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
>>>> +
>>>> +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
>>>> +{
>>>> +	*(u64 *)priv_data = aux->cgroup_id;
>>>> +	return 0;
>>>> +}
>>>> +
[...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-03-03  7:33         ` Yonghong Song
@ 2022-03-03  8:13           ` Kumar Kartikeya Dwivedi
  2022-03-03 21:52           ` Hao Luo
  1 sibling, 0 replies; 54+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-03  8:13 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel

On Thu, Mar 03, 2022 at 01:03:57PM IST, Yonghong Song wrote:
> [...]
> >
> > Right, I was thinking whether it should call seq_show for v == NULL case. All
> > other iterators seem to do so, it's a bit different here since it is only
> > iterating over a single cgroup, I guess, but it would be nice to have some
> > consistency.
>
> You are correct that I think it is okay since it only iterates with one
> cgroup. This is different from other cases so far where more than one
> objects may be traversed. We may have future other use cases, e.g.,
> one task. I think we can abstract out start()/next()/stop() callbacks
> for such use cases. So it is okay it is different from other existing
> iterators since they are indeed different.
>
> >
> > > For cgroup_iter, the following is the current workflow:
> > >     start -> not NULL -> show -> next -> NULL -> stop
> > > or
> > >     start -> NULL -> stop
> > >
> > > So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
> > > the cgroup_put() is not actually called, i.e., corresponding cgroup is
> > > not freed.
> > >
> > > There are two ways to fix the issue:
> > >    . call cgroup_put() in next() before return NULL. This way,
> > >      stop() will be a noop.
> > >    . put cgroup_get_from_id() and cgroup_put() in
> > >      bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().
> > >
> > > I prefer the second approach as it is cleaner.
> > >
> >
> > I think current approach is also not safe if cgroup_id gets reused, right? I.e.
> > it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
> > be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
> > is less likely to occur, but since it wraps around to find a free ID it might
> > not be theoretical.
>
> As Alexei mentioned, cgroup id is 64-bit, the collision should
> be nearly impossible. Another option is to get a fd from
> the cgroup path, and send the fd to the kernel. This probably
> works.
>

I see, even on 32-bit systems the actual id is 64-bit.
As for cgroup fd vs id, existing cgroup BPF programs seem to take fd, map iter
also takes map fd, so it might make sense to use cgroup fd here as well.

> [...]

--
Kartikeya

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-02 19:34       ` Alexei Starovoitov
@ 2022-03-03 18:50         ` Hao Luo
  2022-03-04 18:37           ` Hao Luo
  0 siblings, 1 reply; 54+ messages in thread
From: Hao Luo @ 2022-03-03 18:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Wed, Mar 2, 2022 at 11:34 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Feb 28, 2022 at 02:10:39PM -0800, Hao Luo wrote:
> > Hi Kumar,
> >
> > On Sat, Feb 26, 2022 at 9:18 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Sat, Feb 26, 2022 at 05:13:31AM IST, Hao Luo wrote:
> > > > This patch allows bpf_syscall prog to perform some basic filesystem
> > > > operations: create, remove directories and unlink files. Three bpf
> > > > helpers are added for this purpose. When combined with the following
> > > > patches that allow pinning and getting bpf objects from bpf prog,
> > > > this feature can be used to create directory hierarchy in bpffs that
> > > > help manage bpf objects purely using bpf progs.
> > > >
> > > > The added helpers subject to the same permission checks as their syscall
> > > > version. For example, one can not write to a read-only file system;
> > > > The identity of the current process is checked to see whether it has
> > > > sufficient permission to perform the operations.
> > > >
> > > > Only directories and files in bpffs can be created or removed by these
> > > > helpers. But it won't be too hard to allow these helpers to operate
> > > > on files in other filesystems, if we want.
> > > >
> > > > Signed-off-by: Hao Luo <haoluo@google.com>
> > > > ---
> > > > + *
> > > > + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
> > > > + *   Description
> > > > + *           Attempts to create a directory name *pathname*. The argument
> > > > + *           *pathname_sz* specifies the length of the string *pathname*.
> > > > + *           The argument *mode* specifies the mode for the new directory. It
> > > > + *           is modified by the process's umask. It has the same semantic as
> > > > + *           the syscall mkdir(2).
> > > > + *   Return
> > > > + *           0 on success, or a negative error in case of failure.
> > > > + *
> > > > + * long bpf_rmdir(const char *pathname, int pathname_sz)
> > > > + *   Description
> > > > + *           Deletes a directory, which must be empty.
> > > > + *   Return
> > > > + *           0 on sucess, or a negative error in case of failure.
> > > > + *
> > > > + * long bpf_unlink(const char *pathname, int pathname_sz)
> > > > + *   Description
> > > > + *           Deletes a name and possibly the file it refers to. It has the
> > > > + *           same semantic as the syscall unlink(2).
> > > > + *   Return
> > > > + *           0 on success, or a negative error in case of failure.
> > > >   */
> > > >
> > >
> > > How about only introducing bpf_sys_mkdirat and bpf_sys_unlinkat? That would be
> > > more useful for other cases in future, and when AT_FDCWD is passed, has the same
> > > functionality as these, but when openat/fget is supported, it would work
> > > relative to other dirfds as well. It can also allow using dirfd of the process
> > > calling read for a iterator (e.g. if it sets the fd number using skel->bss).
> > > unlinkat's AT_REMOVEDIR flag also removes the need for a bpf_rmdir.
> > >
> > > WDYT?
> > >
> >
> > The idea sounds good to me, more flexible. But I don't have a real use
> > case for using the added 'dirfd' at this moment. For all the use cases
> > I can think of, absolute paths will suffice, I think. Unless other
> > reviewers have opposition, I will try switching to mkdirat and
> > unlinkat in v2.
>
> I'm surprised you don't need "at" variants.
> I thought your production setup has a top level cgroup controller and
> then inner tasks inside containers manage cgroups on their own.
> Since containers are involved they likely run inside their own mountns.
> cgroupfs mount is single. So you probably don't even need to bind mount it
> inside containers, but bpffs is not a single mount. You need
> to bind mount top bpffs inside containers for tasks to access it.
> Now for cgroupfs the abs path is not an issue, but for bpffs
> the AT_FDCWD becomes a problem. AT_FDCWD is using current mount ns.
> Inside container that will be different. Unless you bind mount into exact
> same path the full path has different meanings inside and outside of the container.
> It seems to me the bpf progs attached to cgroup sleepable events should
> be using FD of bpffs. Then when these tracepoints are triggered from
> different containers in different mountns they will get the right dir prefix.
> What am I missing?
>

Alexei, you are perfectly right. To be honest, I failed to see the
fact that the sleepable tp prog is in the container's mount ns. I
think we can bind mount the top bpffs into exactly the same path
inside container, right? But I haven't tested this idea. Passing FDs
should be better.

> I think non-AT variants are not needed. The prog can always pass AT_FDCWD
> if it's really the intent, but passing actual FD seems more error-proof.

Let's have the AT version. Passing FD seems the right approach, when
we use it in the context of container.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-02 20:55   ` Yonghong Song
@ 2022-03-03 18:56     ` Hao Luo
  2022-03-03 19:13       ` Yonghong Song
  0 siblings, 1 reply; 54+ messages in thread
From: Hao Luo @ 2022-03-03 18:56 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel

On Wed, Mar 2, 2022 at 12:55 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 2/25/22 3:43 PM, Hao Luo wrote:
> > This patch allows bpf_syscall prog to perform some basic filesystem
> > operations: create, remove directories and unlink files. Three bpf
> > helpers are added for this purpose. When combined with the following
> > patches that allow pinning and getting bpf objects from bpf prog,
> > this feature can be used to create directory hierarchy in bpffs that
> > help manage bpf objects purely using bpf progs.
> >
> > The added helpers subject to the same permission checks as their syscall
> > version. For example, one can not write to a read-only file system;
> > The identity of the current process is checked to see whether it has
> > sufficient permission to perform the operations.
> >
> > Only directories and files in bpffs can be created or removed by these
> > helpers. But it won't be too hard to allow these helpers to operate
> > on files in other filesystems, if we want.
> >
> > Signed-off-by: Hao Luo <haoluo@google.com>
> > ---
> >   include/linux/bpf.h            |   1 +
> >   include/uapi/linux/bpf.h       |  26 +++++
> >   kernel/bpf/inode.c             |   9 +-
> >   kernel/bpf/syscall.c           | 177 +++++++++++++++++++++++++++++++++
> >   tools/include/uapi/linux/bpf.h |  26 +++++
> >   5 files changed, 236 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index f19abc59b6cd..fce5e26179f5 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1584,6 +1584,7 @@ int bpf_link_new_fd(struct bpf_link *link);
> >   struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
> >   struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> >
> > +bool bpf_path_is_bpf_dir(const struct path *path);
> >   int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> >   int bpf_obj_get_user(const char __user *pathname, int flags);
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index afe3d0d7f5f2..a5dbc794403d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5086,6 +5086,29 @@ union bpf_attr {
> >    *  Return
> >    *          0 on success, or a negative error in case of failure. On error
> >    *          *dst* buffer is zeroed out.
> > + *
> > + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
>
> Can we make pathname_sz to be u32 instead of int? pathname_sz should
> never be negative any way.
>
> Also, I think it is a good idea to add 'u64 flags' parameter for all
> three helpers, so we have room in the future to tune for new use cases.
>

SG. Will make this change.

Actually, I think I need to cap patthname_sz from above, to ensure
pathname_sz isn't too big. Is that necessary? I see there are other
helpers that don't have this type of check.

> > + *   Description
> > + *           Attempts to create a directory name *pathname*. The argument
> > + *           *pathname_sz* specifies the length of the string *pathname*.
> > + *           The argument *mode* specifies the mode for the new directory. It
> > + *           is modified by the process's umask. It has the same semantic as
> > + *           the syscall mkdir(2).
> > + *   Return
> > + *           0 on success, or a negative error in case of failure.
> > + *
> > + * long bpf_rmdir(const char *pathname, int pathname_sz)
> > + *   Description
> > + *           Deletes a directory, which must be empty.
> > + *   Return
> > + *           0 on sucess, or a negative error in case of failure.
> > + *
> > + * long bpf_unlink(const char *pathname, int pathname_sz)
> > + *   Description
> > + *           Deletes a name and possibly the file it refers to. It has the
> > + *           same semantic as the syscall unlink(2).
> > + *   Return
> > + *           0 on success, or a negative error in case of failure.
> >    */
> >   #define __BPF_FUNC_MAPPER(FN)               \
> >       FN(unspec),                     \
> > @@ -5280,6 +5303,9 @@ union bpf_attr {
> >       FN(xdp_load_bytes),             \
> >       FN(xdp_store_bytes),            \
> >       FN(copy_from_user_task),        \
> > +     FN(mkdir),                      \
> > +     FN(rmdir),                      \
> > +     FN(unlink),                     \
> >       /* */
> >
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> [...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-03 18:56     ` Hao Luo
@ 2022-03-03 19:13       ` Yonghong Song
  2022-03-03 19:15         ` Hao Luo
  0 siblings, 1 reply; 54+ messages in thread
From: Yonghong Song @ 2022-03-03 19:13 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel



On 3/3/22 10:56 AM, Hao Luo wrote:
> On Wed, Mar 2, 2022 at 12:55 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 2/25/22 3:43 PM, Hao Luo wrote:
>>> This patch allows bpf_syscall prog to perform some basic filesystem
>>> operations: create, remove directories and unlink files. Three bpf
>>> helpers are added for this purpose. When combined with the following
>>> patches that allow pinning and getting bpf objects from bpf prog,
>>> this feature can be used to create directory hierarchy in bpffs that
>>> help manage bpf objects purely using bpf progs.
>>>
>>> The added helpers subject to the same permission checks as their syscall
>>> version. For example, one can not write to a read-only file system;
>>> The identity of the current process is checked to see whether it has
>>> sufficient permission to perform the operations.
>>>
>>> Only directories and files in bpffs can be created or removed by these
>>> helpers. But it won't be too hard to allow these helpers to operate
>>> on files in other filesystems, if we want.
>>>
>>> Signed-off-by: Hao Luo <haoluo@google.com>
>>> ---
>>>    include/linux/bpf.h            |   1 +
>>>    include/uapi/linux/bpf.h       |  26 +++++
>>>    kernel/bpf/inode.c             |   9 +-
>>>    kernel/bpf/syscall.c           | 177 +++++++++++++++++++++++++++++++++
>>>    tools/include/uapi/linux/bpf.h |  26 +++++
>>>    5 files changed, 236 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index f19abc59b6cd..fce5e26179f5 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -1584,6 +1584,7 @@ int bpf_link_new_fd(struct bpf_link *link);
>>>    struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
>>>    struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>>>
>>> +bool bpf_path_is_bpf_dir(const struct path *path);
>>>    int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>>>    int bpf_obj_get_user(const char __user *pathname, int flags);
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index afe3d0d7f5f2..a5dbc794403d 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -5086,6 +5086,29 @@ union bpf_attr {
>>>     *  Return
>>>     *          0 on success, or a negative error in case of failure. On error
>>>     *          *dst* buffer is zeroed out.
>>> + *
>>> + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
>>
>> Can we make pathname_sz to be u32 instead of int? pathname_sz should
>> never be negative any way.
>>
>> Also, I think it is a good idea to add 'u64 flags' parameter for all
>> three helpers, so we have room in the future to tune for new use cases.
>>
> 
> SG. Will make this change.
> 
> Actually, I think I need to cap patthname_sz from above, to ensure
> pathname_sz isn't too big. Is that necessary? I see there are other
> helpers that don't have this type of check.

There is no need. The verifier should ensure the memory held by pathname 
will have at least size of pathname_sz.

> 
>>> + *   Description
>>> + *           Attempts to create a directory name *pathname*. The argument
>>> + *           *pathname_sz* specifies the length of the string *pathname*.
>>> + *           The argument *mode* specifies the mode for the new directory. It
>>> + *           is modified by the process's umask. It has the same semantic as
>>> + *           the syscall mkdir(2).
>>> + *   Return
>>> + *           0 on success, or a negative error in case of failure.
>>> + *
>>> + * long bpf_rmdir(const char *pathname, int pathname_sz)
>>> + *   Description
>>> + *           Deletes a directory, which must be empty.
>>> + *   Return
>>> + *           0 on sucess, or a negative error in case of failure.
>>> + *
>>> + * long bpf_unlink(const char *pathname, int pathname_sz)
>>> + *   Description
>>> + *           Deletes a name and possibly the file it refers to. It has the
>>> + *           same semantic as the syscall unlink(2).
>>> + *   Return
>>> + *           0 on success, or a negative error in case of failure.
>>>     */
>>>    #define __BPF_FUNC_MAPPER(FN)               \
>>>        FN(unspec),                     \
>>> @@ -5280,6 +5303,9 @@ union bpf_attr {
>>>        FN(xdp_load_bytes),             \
>>>        FN(xdp_store_bytes),            \
>>>        FN(copy_from_user_task),        \
>>> +     FN(mkdir),                      \
>>> +     FN(rmdir),                      \
>>> +     FN(unlink),                     \
>>>        /* */
>>>
>>>    /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>> [...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 7/9] bpf: Lift permission check in __sys_bpf when called from kernel.
  2022-03-02 20:01   ` Alexei Starovoitov
@ 2022-03-03 19:14     ` Hao Luo
  0 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-03-03 19:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Wed, Mar 2, 2022 at 12:02 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 25, 2022 at 03:43:37PM -0800, Hao Luo wrote:
> > After we introduced sleepable tracing programs, we now have an
> > interesting problem. There are now three execution paths that can
> > reach bpf_sys_bpf:
> >
> >  1. called from bpf syscall.
> >  2. called from kernel context (e.g. kernel modules).
> >  3. called from bpf programs.
> >
> > Ideally, capability check in bpf_sys_bpf is necessary for the first two
> > scenarios. But it may not be necessary for the third case.
>
> Well, it's unnecessary for the first two as well.
> When called from the kernel lskel it's a pointless check.
> The kernel module can do anything regardless.
> When called from bpf syscall program it's not quite correct either.
> When CAP_BPF was introduced we've designed it to enforce permissions
> at prog load time. The prog_run doesn't check permissions.
> So syscall progs don't need this secondary permission check.
> Please add "case BPF_PROG_TYPE_SYSCALL:" to is_perfmon_prog_type()
> and combine it with this patch.
>

Sure, will do.

> That would be the best. The alternative below is less appealing.
>
> > An alternative of lifting this permission check would be introducing an
> > 'unpriv' version of bpf_sys_bpf, which doesn't check the current task's
> > capability. If the owner of the tracing prog wants it to be exclusively
> > used by root users, they can use the 'priv' version of bpf_sys_bpf; if
> > the owner wants it to be usable for non-root users, they can use the
> > 'unpriv' version.
>
> ...
>
> > -     if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
> > +     if (sysctl_unprivileged_bpf_disabled && !bpf_capable() && !uattr.is_kernel)
>
> This is great idea. If I could think of this I would went with it when prog_syscall
> was introduced.

Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-03 19:13       ` Yonghong Song
@ 2022-03-03 19:15         ` Hao Luo
  0 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-03-03 19:15 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel

On Thu, Mar 3, 2022 at 11:13 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 3/3/22 10:56 AM, Hao Luo wrote:
> > On Wed, Mar 2, 2022 at 12:55 PM Yonghong Song <yhs@fb.com> wrote:
> >> On 2/25/22 3:43 PM, Hao Luo wrote:
> >>> @@ -5086,6 +5086,29 @@ union bpf_attr {
> >>>     *  Return
> >>>     *          0 on success, or a negative error in case of failure. On error
> >>>     *          *dst* buffer is zeroed out.
> >>> + *
> >>> + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
> >>
> >> Can we make pathname_sz to be u32 instead of int? pathname_sz should
> >> never be negative any way.
> >>
> >> Also, I think it is a good idea to add 'u64 flags' parameter for all
> >> three helpers, so we have room in the future to tune for new use cases.
> >>
> >
> > SG. Will make this change.
> >
> > Actually, I think I need to cap patthname_sz from above, to ensure
> > pathname_sz isn't too big. Is that necessary? I see there are other
> > helpers that don't have this type of check.
>
> There is no need. The verifier should ensure the memory held by pathname
> will have at least size of pathname_sz.
>

SG. Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-03-02 19:41   ` Alexei Starovoitov
@ 2022-03-03 19:37     ` Hao Luo
  2022-03-03 19:59       ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Hao Luo @ 2022-03-03 19:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Wed, Mar 2, 2022 at 11:41 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 25, 2022 at 03:43:34PM -0800, Hao Luo wrote:
> > diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> > index e7c2276be33e..c73c7ab3680e 100644
> > --- a/include/linux/tracepoint-defs.h
> > +++ b/include/linux/tracepoint-defs.h
> > @@ -51,6 +51,7 @@ struct bpf_raw_event_map {
> >       void                    *bpf_func;
> >       u32                     num_args;
> >       u32                     writable_size;
> > +     u32                     sleepable;
>
> It increases the size for all tracepoints.
> See BPF_RAW_TP in include/asm-generic/vmlinux.lds.h
> Please switch writeable_size and sleepable to u16.

No problem.

> >
> > -static const struct bpf_func_proto *
> > -syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +/* Syscall helpers that are also allowed in sleepable tracing prog. */
> > +const struct bpf_func_proto *
> > +tracing_prog_syscall_func_proto(enum bpf_func_id func_id,
> > +                             const struct bpf_prog *prog)
> >  {
> >       switch (func_id) {
> >       case BPF_FUNC_sys_bpf:
> >               return &bpf_sys_bpf_proto;
> > -     case BPF_FUNC_btf_find_by_name_kind:
> > -             return &bpf_btf_find_by_name_kind_proto;
> >       case BPF_FUNC_sys_close:
> >               return &bpf_sys_close_proto;
> > -     case BPF_FUNC_kallsyms_lookup_name:
> > -             return &bpf_kallsyms_lookup_name_proto;
> >       case BPF_FUNC_mkdir:
> >               return &bpf_mkdir_proto;
> >       case BPF_FUNC_rmdir:
> >               return &bpf_rmdir_proto;
> >       case BPF_FUNC_unlink:
> >               return &bpf_unlink_proto;
> > +     default:
> > +             return NULL;
> > +     }
> > +}
>
> If I read this correctly the goal is to disallow find_by_name_kind
> and lookup_name from sleepable tps. Why? What's the harm?

A couple of thoughts, please correct me if they don't make sense. I
may think too much.

1. The very first reason is, I don't know the use case of them in
tracing. So I think I can leave them right now and add them later if
the maintainers want them.

2. A related question is, do we actually want all syscall helpers to
be in sleepable tracing? Some helpers may cause re-entering the
tracepoints. For a hypothetical example, if we call mkdir while
tracing some tracepoints in vfs_mkdir. Do we have protection for this?
Another potential problem is about lookup_name in particular,
sleepable_tracing could be triggered by any user, will lookup_name
leak kernel addresses to users in some way? The filesystem helpers
have some basic perm checks, I would think it's relatively safer.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-03-03  2:29         ` Alexei Starovoitov
@ 2022-03-03 19:43           ` Hao Luo
  2022-03-03 20:02             ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Hao Luo @ 2022-03-03 19:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Wed, Mar 2, 2022 at 6:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 2, 2022 at 5:09 PM Yonghong Song <yhs@fb.com> wrote:
> >
> >
> >
> > On 3/2/22 1:30 PM, Alexei Starovoitov wrote:
> > > On Wed, Mar 2, 2022 at 1:23 PM Yonghong Song <yhs@fb.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2/25/22 3:43 PM, Hao Luo wrote:
> > >>> Add a new type of bpf tracepoints: sleepable tracepoints, which allows
> > >>> the handler to make calls that may sleep. With sleepable tracepoints, a
> > >>> set of syscall helpers (which may sleep) may also be called from
> > >>> sleepable tracepoints.
> > >>
> > >> There are some old discussions on sleepable tracepoints, maybe
> > >> worthwhile to take a look.
> > >>
> > >> https://lore.kernel.org/bpf/20210218222125.46565-5-mjeanson@efficios.com/T/
> > >
> > > Right. It's very much related, but obsolete too.
> > > We don't need any of that for sleeptable _raw_ tps.
> > > I prefer to stay with "sleepable" name as well to
> > > match the rest of the bpf sleepable code.
> > > In all cases it's faultable.
> >
> > sounds good to me. Agree that for the bpf user case, Hao's
> > implementation should be enough.
>
> Just remembered that we can also do trivial noinline __weak
> nop function and mark it sleepable on the verifier side.
> That's what we were planning to do to trace map update/delete ops
> in Joe Burton's series.
> Then we don't need to extend tp infra.
> I'm fine whichever way. I see pros and cons in both options.

Joe is also cc'ed in this patchset, I will sync up with him on the
status of trace map work.

Alexei, do we have potentially other variants of tp? We can make the
current u16 sleepable a flag, so we can reuse this flag later when we
have another type of tracepoints.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-03-03 19:37     ` Hao Luo
@ 2022-03-03 19:59       ` Alexei Starovoitov
  0 siblings, 0 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-03 19:59 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Thu, Mar 3, 2022 at 11:37 AM Hao Luo <haoluo@google.com> wrote:
>
> On Wed, Mar 2, 2022 at 11:41 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Feb 25, 2022 at 03:43:34PM -0800, Hao Luo wrote:
> > > diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> > > index e7c2276be33e..c73c7ab3680e 100644
> > > --- a/include/linux/tracepoint-defs.h
> > > +++ b/include/linux/tracepoint-defs.h
> > > @@ -51,6 +51,7 @@ struct bpf_raw_event_map {
> > >       void                    *bpf_func;
> > >       u32                     num_args;
> > >       u32                     writable_size;
> > > +     u32                     sleepable;
> >
> > It increases the size for all tracepoints.
> > See BPF_RAW_TP in include/asm-generic/vmlinux.lds.h
> > Please switch writeable_size and sleepable to u16.
>
> No problem.
>
> > >
> > > -static const struct bpf_func_proto *
> > > -syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > > +/* Syscall helpers that are also allowed in sleepable tracing prog. */
> > > +const struct bpf_func_proto *
> > > +tracing_prog_syscall_func_proto(enum bpf_func_id func_id,
> > > +                             const struct bpf_prog *prog)
> > >  {
> > >       switch (func_id) {
> > >       case BPF_FUNC_sys_bpf:
> > >               return &bpf_sys_bpf_proto;
> > > -     case BPF_FUNC_btf_find_by_name_kind:
> > > -             return &bpf_btf_find_by_name_kind_proto;
> > >       case BPF_FUNC_sys_close:
> > >               return &bpf_sys_close_proto;
> > > -     case BPF_FUNC_kallsyms_lookup_name:
> > > -             return &bpf_kallsyms_lookup_name_proto;
> > >       case BPF_FUNC_mkdir:
> > >               return &bpf_mkdir_proto;
> > >       case BPF_FUNC_rmdir:
> > >               return &bpf_rmdir_proto;
> > >       case BPF_FUNC_unlink:
> > >               return &bpf_unlink_proto;
> > > +     default:
> > > +             return NULL;
> > > +     }
> > > +}
> >
> > If I read this correctly the goal is to disallow find_by_name_kind
> > and lookup_name from sleepable tps. Why? What's the harm?
>
> A couple of thoughts, please correct me if they don't make sense. I
> may think too much.
>
> 1. The very first reason is, I don't know the use case of them in
> tracing. So I think I can leave them right now and add them later if
> the maintainers want them.
>
> 2. A related question is, do we actually want all syscall helpers to
> be in sleepable tracing? Some helpers may cause re-entering the
> tracepoints. For a hypothetical example, if we call mkdir while
> tracing some tracepoints in vfs_mkdir. Do we have protection for this?

If we go with noinline weak nop function approach then we will
get recursion protection for free. All trampoline powered progs have it.
Both sleepable and not.

> Another potential problem is about lookup_name in particular,
> sleepable_tracing could be triggered by any user, will lookup_name
> leak kernel addresses to users in some way? The filesystem helpers
> have some basic perm checks, I would think it's relatively safer.

The tracepoint may be triggerable by any user, but the sleepable
tp bpf prog will be loaded with cap_perfmon permissions, so it has
the rights to read anything.
So I don't see any concerns with enabling lookup_name to both
syscall bpf prog and tp progs.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-03-03 19:43           ` Hao Luo
@ 2022-03-03 20:02             ` Alexei Starovoitov
  2022-03-03 20:04               ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-03 20:02 UTC (permalink / raw)
  To: Hao Luo
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Thu, Mar 3, 2022 at 11:43 AM Hao Luo <haoluo@google.com> wrote:
>
> On Wed, Mar 2, 2022 at 6:29 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Mar 2, 2022 at 5:09 PM Yonghong Song <yhs@fb.com> wrote:
> > >
> > >
> > >
> > > On 3/2/22 1:30 PM, Alexei Starovoitov wrote:
> > > > On Wed, Mar 2, 2022 at 1:23 PM Yonghong Song <yhs@fb.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 2/25/22 3:43 PM, Hao Luo wrote:
> > > >>> Add a new type of bpf tracepoints: sleepable tracepoints, which allows
> > > >>> the handler to make calls that may sleep. With sleepable tracepoints, a
> > > >>> set of syscall helpers (which may sleep) may also be called from
> > > >>> sleepable tracepoints.
> > > >>
> > > >> There are some old discussions on sleepable tracepoints, maybe
> > > >> worthwhile to take a look.
> > > >>
> > > >> https://lore.kernel.org/bpf/20210218222125.46565-5-mjeanson@efficios.com/T/
> > > >
> > > > Right. It's very much related, but obsolete too.
> > > > We don't need any of that for sleeptable _raw_ tps.
> > > > I prefer to stay with "sleepable" name as well to
> > > > match the rest of the bpf sleepable code.
> > > > In all cases it's faultable.
> > >
> > > sounds good to me. Agree that for the bpf user case, Hao's
> > > implementation should be enough.
> >
> > Just remembered that we can also do trivial noinline __weak
> > nop function and mark it sleepable on the verifier side.
> > That's what we were planning to do to trace map update/delete ops
> > in Joe Burton's series.
> > Then we don't need to extend tp infra.
> > I'm fine whichever way. I see pros and cons in both options.
>
> Joe is also cc'ed in this patchset, I will sync up with him on the
> status of trace map work.
>
> Alexei, do we have potentially other variants of tp? We can make the
> current u16 sleepable a flag, so we can reuse this flag later when we
> have another type of tracepoints.

When we added the ability to attach to kernel functions and mark them
as allow_error_inject the usefulness of tracepoints and even
writeable tracepoints was deminissed.
If we do sleepable tracepoint, I suspect, it may be the last extension
in that area.
I guess I'm convincing myself that noinline weak nop func
is better here. Just like it's better for Joe's map tracing.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-03-02 21:59   ` Yonghong Song
@ 2022-03-03 20:02     ` Hao Luo
  0 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-03-03 20:02 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, KP Singh, Shakeel Butt, Joe Burton,
	Tejun Heo, joshdon, sdf, bpf, linux-kernel

Thanks Yonghong,

On Wed, Mar 2, 2022 at 2:00 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 2/25/22 3:43 PM, Hao Luo wrote:
> > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > be parameterized by a cgroup id and prints only that cgroup. So one
> > needs to specify a target cgroup id when attaching this iter.
> >
> > The target cgroup's state can be read out via a link of this iter.
> > Typically, we can monitor cgroup creation and deletion using sleepable
> > tracing and use it to create corresponding directories in bpffs and pin
> > a cgroup id parameterized link in the directory. Then we can read the
> > auto-pinned iter link to get cgroup's state. The output of the iter link
> > is determined by the program. See the selftest test_cgroup_stats.c for
> > an example.
> >
> > Signed-off-by: Hao Luo <haoluo@google.com>
> > ---
> >   include/linux/bpf.h            |   1 +
> >   include/uapi/linux/bpf.h       |   6 ++
> >   kernel/bpf/Makefile            |   2 +-
> >   kernel/bpf/cgroup_iter.c       | 141 +++++++++++++++++++++++++++++++++
> >   tools/include/uapi/linux/bpf.h |   6 ++
> >   5 files changed, 155 insertions(+), 1 deletion(-)
> >   create mode 100644 kernel/bpf/cgroup_iter.c
> >
[...]
> > +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> > +     .seq_ops                = &cgroup_iter_seq_ops,
> > +     .init_seq_private       = cgroup_iter_seq_init,
> > +     .fini_seq_private       = cgroup_iter_seq_fini,
>
> Since cgroup_iter_seq_fini() is a nop, you can just have
>         .fini_seq_private       = NULL,
>

Sounds good. It looks weird to have .init without .fini. This may
indicate a bug somewhere. .attach and .detach the same. I see that you
pointed out a bug in a followed reply and the fix has paired attach
and detach. That explains something. :)

> > +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> > +                              struct seq_file *seq)
> > +{
> > +     char buf[64] = {0};
>
> Is this 64 the maximum possible cgroup path length?
> If there is a macro for that, I think it would be good to use it.
>

64 is something I made up. There is a macro for path length. Let me
use that in v2.

> > +
> > +     cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
>
> cgroup_path_from_kernfs_id() might fail in which case, buf will be 0.
> and cgroup_path will be nothing. I guess this might be the expected
> result. I might be good to add a comment to clarify in the code.
>

No problem.

>
> > +     seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
> > +     seq_printf(seq, "cgroup_path:\t%s\n", buf);
> > +}
> > +
> > +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
> > +                                struct bpf_link_info *info)
> > +{
> > +     info->iter.cgroup.cgroup_id = aux->cgroup_id;
> > +     return 0;
> > +}
> > +
> > +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
> > +                  struct cgroup *cgroup)
> > +
> > +static struct bpf_iter_reg bpf_cgroup_reg_info = {
> > +     .target                 = "cgroup",
> > +     .attach_target          = bpf_iter_attach_cgroup,
> > +     .detach_target          = bpf_iter_detach_cgroup,
>
> The same ehre, since bpf_iter_detach_cgroup() is a nop,
> you can replace it with NULL in the above.
>
> > +     .show_fdinfo            = bpf_iter_cgroup_show_fdinfo,
> > +     .fill_link_info         = bpf_iter_cgroup_fill_link_info,
> > +     .ctx_arg_info_size      = 1,
> > +     .ctx_arg_info           = {
> > +             { offsetof(struct bpf_iter__cgroup, cgroup),
> > +               PTR_TO_BTF_ID },
> > +     },
> > +     .seq_info               = &cgroup_iter_seq_info,
> > +};
> > +
> > +static int __init bpf_cgroup_iter_init(void)
> > +{
> > +     bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
> > +     return bpf_iter_reg_target(&bpf_cgroup_reg_info);
> > +}
> > +
> [...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-03-03 20:02             ` Alexei Starovoitov
@ 2022-03-03 20:04               ` Alexei Starovoitov
  2022-03-03 22:06                 ` Hao Luo
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-03 20:04 UTC (permalink / raw)
  To: Hao Luo
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Thu, Mar 3, 2022 at 12:02 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Mar 3, 2022 at 11:43 AM Hao Luo <haoluo@google.com> wrote:
> >
> > On Wed, Mar 2, 2022 at 6:29 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Mar 2, 2022 at 5:09 PM Yonghong Song <yhs@fb.com> wrote:
> > > >
> > > >
> > > >
> > > > On 3/2/22 1:30 PM, Alexei Starovoitov wrote:
> > > > > On Wed, Mar 2, 2022 at 1:23 PM Yonghong Song <yhs@fb.com> wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 2/25/22 3:43 PM, Hao Luo wrote:
> > > > >>> Add a new type of bpf tracepoints: sleepable tracepoints, which allows
> > > > >>> the handler to make calls that may sleep. With sleepable tracepoints, a
> > > > >>> set of syscall helpers (which may sleep) may also be called from
> > > > >>> sleepable tracepoints.
> > > > >>
> > > > >> There are some old discussions on sleepable tracepoints, maybe
> > > > >> worthwhile to take a look.
> > > > >>
> > > > >> https://lore.kernel.org/bpf/20210218222125.46565-5-mjeanson@efficios.com/T/
> > > > >
> > > > > Right. It's very much related, but obsolete too.
> > > > > We don't need any of that for sleeptable _raw_ tps.
> > > > > I prefer to stay with "sleepable" name as well to
> > > > > match the rest of the bpf sleepable code.
> > > > > In all cases it's faultable.
> > > >
> > > > sounds good to me. Agree that for the bpf user case, Hao's
> > > > implementation should be enough.
> > >
> > > Just remembered that we can also do trivial noinline __weak
> > > nop function and mark it sleepable on the verifier side.
> > > That's what we were planning to do to trace map update/delete ops
> > > in Joe Burton's series.
> > > Then we don't need to extend tp infra.
> > > I'm fine whichever way. I see pros and cons in both options.
> >
> > Joe is also cc'ed in this patchset, I will sync up with him on the
> > status of trace map work.
> >
> > Alexei, do we have potentially other variants of tp? We can make the
> > current u16 sleepable a flag, so we can reuse this flag later when we
> > have another type of tracepoints.
>
> When we added the ability to attach to kernel functions and mark them
> as allow_error_inject the usefulness of tracepoints and even
> writeable tracepoints was deminissed.
> If we do sleepable tracepoint, I suspect, it may be the last extension
> in that area.
> I guess I'm convincing myself that noinline weak nop func
> is better here. Just like it's better for Joe's map tracing.

To add to the above... The only downside of sleepable nop func
comparing to tp is the lack of static_branch.
So this nop call will always be there.
For map tracing and for cgroup mkdir/rmdir the few nanosecond
overhead of calling an empty function isn't even measurable.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
  2022-03-03  7:33         ` Yonghong Song
  2022-03-03  8:13           ` Kumar Kartikeya Dwivedi
@ 2022-03-03 21:52           ` Hao Luo
  1 sibling, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-03-03 21:52 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Wed, Mar 2, 2022 at 11:34 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 3/2/22 7:03 PM, Kumar Kartikeya Dwivedi wrote:
> > On Thu, Mar 03, 2022 at 07:33:16AM IST, Yonghong Song wrote:
> >>
> >>
> >> On 3/2/22 2:45 PM, Kumar Kartikeya Dwivedi wrote:
> >>> On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
> >>>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> >>>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> >>>> be parameterized by a cgroup id and prints only that cgroup. So one
> >>>> needs to specify a target cgroup id when attaching this iter.
> >>>>
> >>>> The target cgroup's state can be read out via a link of this iter.
> >>>> Typically, we can monitor cgroup creation and deletion using sleepable
> >>>> tracing and use it to create corresponding directories in bpffs and pin
> >>>> a cgroup id parameterized link in the directory. Then we can read the
> >>>> auto-pinned iter link to get cgroup's state. The output of the iter link
> >>>> is determined by the program. See the selftest test_cgroup_stats.c for
> >>>> an example.
> >>>>
> >>>> Signed-off-by: Hao Luo <haoluo@google.com>
> >>>> ---
> >>>>    include/linux/bpf.h            |   1 +
> >>>>    include/uapi/linux/bpf.h       |   6 ++
> >>>>    kernel/bpf/Makefile            |   2 +-
> >>>>    kernel/bpf/cgroup_iter.c       | 141 +++++++++++++++++++++++++++++++++
> >>>>    tools/include/uapi/linux/bpf.h |   6 ++
> >>>>    5 files changed, 155 insertions(+), 1 deletion(-)
> >>>>    create mode 100644 kernel/bpf/cgroup_iter.c
[...]
> >>>
> >>> I think in existing iterators, we make a final call to seq_show, with v as NULL,
> >>> is there a specific reason to do it differently for this? There is logic in
> >>> bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
> >>> NULL, to execute BPF program with NULL p, see the comment above stop label.
> >>>
> >>> If you do add the seq_show call with NULL, you'd also need to change the
> >>> ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.
> >>
> >> Kumar, PTR_TO_BTF_ID should be okay since the show() never takes a non-NULL
> >> cgroup. But we do have issues for cgroup_iter_seq_stop() which I missed
> >> earlier.
> >>
> >
> > Right, I was thinking whether it should call seq_show for v == NULL case. All
> > other iterators seem to do so, it's a bit different here since it is only
> > iterating over a single cgroup, I guess, but it would be nice to have some
> > consistency.
>
> You are correct that I think it is okay since it only iterates with one
> cgroup. This is different from other cases so far where more than one
> objects may be traversed. We may have future other use cases, e.g.,
> one task. I think we can abstract out start()/next()/stop() callbacks
> for such use cases. So it is okay it is different from other existing
> iterators since they are indeed different.
>

Right. This iter is special. It has a single element. So we don't
really need preamble and epilogue, which can directly be coded up in
the iter program. And we can also guarantee the cgroup passed is
always valid, otherwise we wouldn't invoke show(). So passing
PTR_TO_BTF_ID is fine. I did so mainly in order to save a null check
inside the prog.

> >
> >> For cgroup_iter, the following is the current workflow:
> >>     start -> not NULL -> show -> next -> NULL -> stop
> >> or
> >>     start -> NULL -> stop
> >>
> >> So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
> >> the cgroup_put() is not actually called, i.e., corresponding cgroup is
> >> not freed.
> >>
> >> There are two ways to fix the issue:
> >>    . call cgroup_put() in next() before return NULL. This way,
> >>      stop() will be a noop.
> >>    . put cgroup_get_from_id() and cgroup_put() in
> >>      bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().
> >>
> >> I prefer the second approach as it is cleaner.
> >>

Yeah, the second approach should be fine. I was thinking of holding
the cgroup's reference only when we actually start reading, so that a
cgroup can go at any time and this iter gets a reference only in best
effort. Now a reference is held from attach to detach, but I think it
should be fine. Let me test.

> >
> > I think current approach is also not safe if cgroup_id gets reused, right? I.e.
> > it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
> > be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
> > is less likely to occur, but since it wraps around to find a free ID it might
> > not be theoretical.
>
> As Alexei mentioned, cgroup id is 64-bit, the collision should
> be nearly impossible. Another option is to get a fd from
> the cgroup path, and send the fd to the kernel. This probably
> works.
>

64bit cgroup id should be fine. Using cgroup path and fd is more
complicated, unnecessarily IMHO.

> [...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
  2022-03-03 20:04               ` Alexei Starovoitov
@ 2022-03-03 22:06                 ` Hao Luo
  0 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-03-03 22:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Thu, Mar 3, 2022 at 12:04 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Mar 3, 2022 at 12:02 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Mar 3, 2022 at 11:43 AM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Wed, Mar 2, 2022 at 6:29 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Mar 2, 2022 at 5:09 PM Yonghong Song <yhs@fb.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 3/2/22 1:30 PM, Alexei Starovoitov wrote:
> > > > > > On Wed, Mar 2, 2022 at 1:23 PM Yonghong Song <yhs@fb.com> wrote:
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 2/25/22 3:43 PM, Hao Luo wrote:
> > > > > >>> Add a new type of bpf tracepoints: sleepable tracepoints, which allows
> > > > > >>> the handler to make calls that may sleep. With sleepable tracepoints, a
> > > > > >>> set of syscall helpers (which may sleep) may also be called from
> > > > > >>> sleepable tracepoints.
> > > > > >>
> > > > > >> There are some old discussions on sleepable tracepoints, maybe
> > > > > >> worthwhile to take a look.
> > > > > >>
> > > > > >> https://lore.kernel.org/bpf/20210218222125.46565-5-mjeanson@efficios.com/T/
> > > > > >
> > > > > > Right. It's very much related, but obsolete too.
> > > > > > We don't need any of that for sleeptable _raw_ tps.
> > > > > > I prefer to stay with "sleepable" name as well to
> > > > > > match the rest of the bpf sleepable code.
> > > > > > In all cases it's faultable.
> > > > >
> > > > > sounds good to me. Agree that for the bpf user case, Hao's
> > > > > implementation should be enough.
> > > >
> > > > Just remembered that we can also do trivial noinline __weak
> > > > nop function and mark it sleepable on the verifier side.
> > > > That's what we were planning to do to trace map update/delete ops
> > > > in Joe Burton's series.
> > > > Then we don't need to extend tp infra.
> > > > I'm fine whichever way. I see pros and cons in both options.
> > >
> > > Joe is also cc'ed in this patchset, I will sync up with him on the
> > > status of trace map work.
> > >
> > > Alexei, do we have potentially other variants of tp? We can make the
> > > current u16 sleepable a flag, so we can reuse this flag later when we
> > > have another type of tracepoints.
> >
> > When we added the ability to attach to kernel functions and mark them
> > as allow_error_inject the usefulness of tracepoints and even
> > writeable tracepoints was deminissed.
> > If we do sleepable tracepoint, I suspect, it may be the last extension
> > in that area.
> > I guess I'm convincing myself that noinline weak nop func
> > is better here. Just like it's better for Joe's map tracing.
>
> To add to the above... The only downside of sleepable nop func
> comparing to tp is the lack of static_branch.
> So this nop call will always be there.
> For map tracing and for cgroup mkdir/rmdir the few nanosecond
> overhead of calling an empty function isn't even measurable.

The overhead should be fine, I think. mkdir/rmdir won't be frequent
operations. Thanks for the explanation. Let me give it a try and
report back how it works.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-03 18:50         ` Hao Luo
@ 2022-03-04 18:37           ` Hao Luo
  2022-03-05 23:47             ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Hao Luo @ 2022-03-04 18:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Thu, Mar 3, 2022 at 10:50 AM Hao Luo <haoluo@google.com> wrote:
>
> On Wed, Mar 2, 2022 at 11:34 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Feb 28, 2022 at 02:10:39PM -0800, Hao Luo wrote:
> > > Hi Kumar,
> > >
> > > On Sat, Feb 26, 2022 at 9:18 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > On Sat, Feb 26, 2022 at 05:13:31AM IST, Hao Luo wrote:
> > > > > This patch allows bpf_syscall prog to perform some basic filesystem
> > > > > operations: create, remove directories and unlink files. Three bpf
> > > > > helpers are added for this purpose. When combined with the following
> > > > > patches that allow pinning and getting bpf objects from bpf prog,
> > > > > this feature can be used to create directory hierarchy in bpffs that
> > > > > help manage bpf objects purely using bpf progs.
> > > > >
> > > > > The added helpers subject to the same permission checks as their syscall
> > > > > version. For example, one can not write to a read-only file system;
> > > > > The identity of the current process is checked to see whether it has
> > > > > sufficient permission to perform the operations.
> > > > >
> > > > > Only directories and files in bpffs can be created or removed by these
> > > > > helpers. But it won't be too hard to allow these helpers to operate
> > > > > on files in other filesystems, if we want.
> > > > >
> > > > > Signed-off-by: Hao Luo <haoluo@google.com>
> > > > > ---
> > > > > + *
> > > > > + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
> > > > > + *   Description
> > > > > + *           Attempts to create a directory name *pathname*. The argument
> > > > > + *           *pathname_sz* specifies the length of the string *pathname*.
> > > > > + *           The argument *mode* specifies the mode for the new directory. It
> > > > > + *           is modified by the process's umask. It has the same semantic as
> > > > > + *           the syscall mkdir(2).
> > > > > + *   Return
> > > > > + *           0 on success, or a negative error in case of failure.
> > > > > + *
> > > > > + * long bpf_rmdir(const char *pathname, int pathname_sz)
> > > > > + *   Description
> > > > > + *           Deletes a directory, which must be empty.
> > > > > + *   Return
> > > > > + *           0 on sucess, or a negative error in case of failure.
> > > > > + *
> > > > > + * long bpf_unlink(const char *pathname, int pathname_sz)
> > > > > + *   Description
> > > > > + *           Deletes a name and possibly the file it refers to. It has the
> > > > > + *           same semantic as the syscall unlink(2).
> > > > > + *   Return
> > > > > + *           0 on success, or a negative error in case of failure.
> > > > >   */
> > > > >
> > > >
> > > > How about only introducing bpf_sys_mkdirat and bpf_sys_unlinkat? That would be
> > > > more useful for other cases in future, and when AT_FDCWD is passed, has the same
> > > > functionality as these, but when openat/fget is supported, it would work
> > > > relative to other dirfds as well. It can also allow using dirfd of the process
> > > > calling read for a iterator (e.g. if it sets the fd number using skel->bss).
> > > > unlinkat's AT_REMOVEDIR flag also removes the need for a bpf_rmdir.
> > > >
> > > > WDYT?
> > > >
> > >
> > > The idea sounds good to me, more flexible. But I don't have a real use
> > > case for using the added 'dirfd' at this moment. For all the use cases
> > > I can think of, absolute paths will suffice, I think. Unless other
> > > reviewers have opposition, I will try switching to mkdirat and
> > > unlinkat in v2.
> >
> > I'm surprised you don't need "at" variants.
> > I thought your production setup has a top level cgroup controller and
> > then inner tasks inside containers manage cgroups on their own.
> > Since containers are involved they likely run inside their own mountns.
> > cgroupfs mount is single. So you probably don't even need to bind mount it
> > inside containers, but bpffs is not a single mount. You need
> > to bind mount top bpffs inside containers for tasks to access it.
> > Now for cgroupfs the abs path is not an issue, but for bpffs
> > the AT_FDCWD becomes a problem. AT_FDCWD is using current mount ns.
> > Inside container that will be different. Unless you bind mount into exact
> > same path the full path has different meanings inside and outside of the container.
> > It seems to me the bpf progs attached to cgroup sleepable events should
> > be using FD of bpffs. Then when these tracepoints are triggered from
> > different containers in different mountns they will get the right dir prefix.
> > What am I missing?
> >
>
> Alexei, you are perfectly right. To be honest, I failed to see the
> fact that the sleepable tp prog is in the container's mount ns. I
> think we can bind mount the top bpffs into exactly the same path
> inside container, right? But I haven't tested this idea. Passing FDs
> should be better.
>

I gave this question more thought. We don't need to bind mount the top
bpffs into the container, instead, we may be able to overlay a bpffs
directory into the container. Here is the workflow in my mind:

For each job, let's say A, the container runtime can create a
directory in bpffs, for example

  /sys/fs/bpf/jobs/A

and then create the cgroup for A. The sleepable tracing prog will
create the file:

  /sys/fs/bpf/jobs/A/100/stats

100 is the created cgroup's id. Then the container runtime overlays
the bpffs directory into container A in the same path:

  [A's container path]/sys/fs/bpf/jobs/A.

A can see the stats at the path within its mount ns:

  /sys/fs/bpf/jobs/A/100/stats

When A creates cgroup, it is able to write to the top layer of the
overlayed directory. So it is

  /sys/fs/bpf/jobs/A/101/stats

Some of my thoughts:
  1. Compared to bind mount top bpffs into container, overlaying a
directory avoids exposing other jobs' stats. This gives better
isolation. I already have a patch for supporting laying bpffs over
other fs, it's not too hard.
  2. Once the container runtime has overlayed directory into the
container, it has no need to create more cgroups for this job. It
doesn't need to track the stats of job-created cgroups, which are
mainly for inspection by the job itself. Even if it needs to collect
the stats from those cgroups, it can read from the path in the
container.
  3. The overlay path in container doesn't have to be exactly the same
as the path in root mount ns. In the sleepable tracing prog, we may
select paths based on current process's ns. If we choose to do this,
we can further avoid exposing cgroup id and job name to the container.


> > I think non-AT variants are not needed. The prog can always pass AT_FDCWD
> > if it's really the intent, but passing actual FD seems more error-proof.
>
> Let's have the AT version. Passing FD seems the right approach, when
> we use it in the context of container.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-04 18:37           ` Hao Luo
@ 2022-03-05 23:47             ` Alexei Starovoitov
  2022-03-08 21:08               ` Hao Luo
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-05 23:47 UTC (permalink / raw)
  To: Hao Luo
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Fri, Mar 4, 2022 at 10:37 AM Hao Luo <haoluo@google.com> wrote:
>
> I gave this question more thought. We don't need to bind mount the top
> bpffs into the container, instead, we may be able to overlay a bpffs
> directory into the container. Here is the workflow in my mind:

I don't quite follow what you mean by 'overlay' here.
Another bpffs mount or future overlayfs that supports bpffs?

> For each job, let's say A, the container runtime can create a
> directory in bpffs, for example
>
>   /sys/fs/bpf/jobs/A
>
> and then create the cgroup for A. The sleepable tracing prog will
> create the file:
>
>   /sys/fs/bpf/jobs/A/100/stats
>
> 100 is the created cgroup's id. Then the container runtime overlays
> the bpffs directory into container A in the same path:

Why cgroup id ? Wouldn't it be easier to use the same cgroup name
as in cgroupfs ?

>   [A's container path]/sys/fs/bpf/jobs/A.
>
> A can see the stats at the path within its mount ns:
>
>   /sys/fs/bpf/jobs/A/100/stats
>
> When A creates cgroup, it is able to write to the top layer of the
> overlayed directory. So it is
>
>   /sys/fs/bpf/jobs/A/101/stats
>
> Some of my thoughts:
>   1. Compared to bind mount top bpffs into container, overlaying a
> directory avoids exposing other jobs' stats. This gives better
> isolation. I already have a patch for supporting laying bpffs over
> other fs, it's not too hard.

So it's overlayfs combination of bpffs and something like ext4, right?
I thought you found out that overlaryfs has to be upper fs
and lower fs shouldn't be modified underneath.
So if bpffs is a lower fs the writes into it should go
through the upper overlayfs, right?

>   2. Once the container runtime has overlayed directory into the
> container, it has no need to create more cgroups for this job. It
> doesn't need to track the stats of job-created cgroups, which are
> mainly for inspection by the job itself. Even if it needs to collect
> the stats from those cgroups, it can read from the path in the
> container.
>   3. The overlay path in container doesn't have to be exactly the same
> as the path in root mount ns. In the sleepable tracing prog, we may
> select paths based on current process's ns. If we choose to do this,
> we can further avoid exposing cgroup id and job name to the container.

The benefits make sense.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-05 23:47             ` Alexei Starovoitov
@ 2022-03-08 21:08               ` Hao Luo
  0 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-03-08 21:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Sat, Mar 5, 2022 at 3:47 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Mar 4, 2022 at 10:37 AM Hao Luo <haoluo@google.com> wrote:
> >
> > I gave this question more thought. We don't need to bind mount the top
> > bpffs into the container, instead, we may be able to overlay a bpffs
> > directory into the container. Here is the workflow in my mind:
>
> I don't quite follow what you mean by 'overlay' here.
> Another bpffs mount or future overlayfs that supports bpffs?
>
> > For each job, let's say A, the container runtime can create a
> > directory in bpffs, for example
> >
> >   /sys/fs/bpf/jobs/A
> >
> > and then create the cgroup for A. The sleepable tracing prog will
> > create the file:
> >
> >   /sys/fs/bpf/jobs/A/100/stats
> >
> > 100 is the created cgroup's id. Then the container runtime overlays
> > the bpffs directory into container A in the same path:
>
> Why cgroup id ? Wouldn't it be easier to use the same cgroup name
> as in cgroupfs ?
>

Cgroup name isn't unique. We don't need the hierarchy information of
cgroups. We can use a library function to translate cgroup path to
cgroup id. See the get_cgroup_id() in patch 9/9. It works fine in the
selftest.

> >   [A's container path]/sys/fs/bpf/jobs/A.
> >
> > A can see the stats at the path within its mount ns:
> >
> >   /sys/fs/bpf/jobs/A/100/stats
> >
> > When A creates cgroup, it is able to write to the top layer of the
> > overlayed directory. So it is
> >
> >   /sys/fs/bpf/jobs/A/101/stats
> >
> > Some of my thoughts:
> >   1. Compared to bind mount top bpffs into container, overlaying a
> > directory avoids exposing other jobs' stats. This gives better
> > isolation. I already have a patch for supporting laying bpffs over
> > other fs, it's not too hard.
>
> So it's overlayfs combination of bpffs and something like ext4, right?
> I thought you found out that overlaryfs has to be upper fs
> and lower fs shouldn't be modified underneath.
> So if bpffs is a lower fs the writes into it should go
> through the upper overlayfs, right?
>

It's overlayfs combining bpffs and ext4. Bpffs is the upper layer. The
lower layer is an empty ext4 directory. The merged directory is a
directory in the container.
The upper layer contains bpf objects that we want to expose to the
container, for example, the sleepable tracing progs and the iter link
for reading stats. Only the merged directory is visible to the
container and all the updates go through the merged directory.

The following is the example of workflow I'm thinking:

Step 1: We first set up directories and bpf objects needed by containers.

[# ~] ls /sys/fs/bpf/container/upper
tracing_prog   iter_link
[# ~] ls /sys/fs/bpf/container/work
[# ~] ls /container
root   lower
[# ~] ls /container/root
bpf
[# ~] ls /container/root/bpf

Step 2: Use overlayfs to mount a directory from bpffs into the container's home.

[# ~] mkdir /container/lower
[# ~] mkdir /sys/fs/bpf/container/workdir
[# ~] mount -t overlay overlay -o \
 lowerdir=/container/lower,\
 upperdir=/sys/fs/bpf/container/upper,\
 workdir=/sys/fs/bpf/container/work \
  /container/root/bpf
[# ~] ls /container/root/bpf
tracing_prog    iter_link

Step 3: pivot root for container, we expect to see the bpf objects are
mapped into container,

[# ~] chroot /container/root
[# ~] ls /
bpf
[# ~] ls /bpf
tracing_prog   iter_link

Note:

- I haven't tested Step 3. But Step 1 and step 2 seem to be working as
expected. I am testing the behaviors of the bpf objects, after we
enter the container.

- Only a directory in bpffs is mapped into the container, not the top
bpffs. The path is uniform in all containers, that is, /bpf. The
container should be able to mkdir in /bpf, etc.

> >   2. Once the container runtime has overlayed directory into the
> > container, it has no need to create more cgroups for this job. It
> > doesn't need to track the stats of job-created cgroups, which are
> > mainly for inspection by the job itself. Even if it needs to collect
> > the stats from those cgroups, it can read from the path in the
> > container.
> >   3. The overlay path in container doesn't have to be exactly the same
> > as the path in root mount ns. In the sleepable tracing prog, we may
> > select paths based on current process's ns. If we choose to do this,
> > we can further avoid exposing cgroup id and job name to the container.
>
> The benefits make sense.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-02-25 23:43 ` [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall Hao Luo
  2022-02-27  5:18   ` Kumar Kartikeya Dwivedi
  2022-03-02 20:55   ` Yonghong Song
@ 2022-03-12  3:46   ` Al Viro
  2022-03-14 17:07     ` Hao Luo
  2 siblings, 1 reply; 54+ messages in thread
From: Al Viro @ 2022-03-12  3:46 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Fri, Feb 25, 2022 at 03:43:31PM -0800, Hao Luo wrote:
> This patch allows bpf_syscall prog to perform some basic filesystem
> operations: create, remove directories and unlink files. Three bpf
> helpers are added for this purpose. When combined with the following
> patches that allow pinning and getting bpf objects from bpf prog,
> this feature can be used to create directory hierarchy in bpffs that
> help manage bpf objects purely using bpf progs.
> 
> The added helpers subject to the same permission checks as their syscall
> version. For example, one can not write to a read-only file system;
> The identity of the current process is checked to see whether it has
> sufficient permission to perform the operations.
> 
> Only directories and files in bpffs can be created or removed by these
> helpers. But it won't be too hard to allow these helpers to operate
> on files in other filesystems, if we want.

In which contexts can those be called?

> +BPF_CALL_2(bpf_rmdir, const char *, pathname, int, pathname_sz)
> +{
> +	struct user_namespace *mnt_userns;
> +	struct path parent;
> +	struct dentry *dentry;
> +	int err;
> +
> +	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
> +		return -EINVAL;
> +
> +	err = kern_path(pathname, 0, &parent);
> +	if (err)
> +		return err;
> +
> +	if (!bpf_path_is_bpf_dir(&parent)) {
> +		err = -EPERM;
> +		goto exit1;
> +	}
> +
> +	err = mnt_want_write(parent.mnt);
> +	if (err)
> +		goto exit1;
> +
> +	dentry = kern_path_locked(pathname, &parent);

This can't be right.  Ever.  There is no promise whatsoever
that these two lookups will resolve to the same place.

> +BPF_CALL_2(bpf_unlink, const char *, pathname, int, pathname_sz)
> +{
> +	struct user_namespace *mnt_userns;
> +	struct path parent;
> +	struct dentry *dentry;
> +	struct inode *inode = NULL;
> +	int err;
> +
> +	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
> +		return -EINVAL;
> +
> +	err = kern_path(pathname, 0, &parent);
> +	if (err)
> +		return err;
> +
> +	err = mnt_want_write(parent.mnt);
> +	if (err)
> +		goto exit1;
> +
> +	dentry = kern_path_locked(pathname, &parent);
> +	if (IS_ERR(dentry)) {
> +		err = PTR_ERR(dentry);
> +		goto exit2;
> +	}

Ditto.  NAK; if you want to poke into fs/namei.c guts, do it right.
Or at least discuss that on fsdevel.  As it is, it's completely broken.
It's racy *and* it blatantly leaks both vfsmount and dentry references.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-12  3:46   ` Al Viro
@ 2022-03-14 17:07     ` Hao Luo
  2022-03-14 23:10       ` Al Viro
  0 siblings, 1 reply; 54+ messages in thread
From: Hao Luo @ 2022-03-14 17:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

Hello Al,

On Fri, Mar 11, 2022 at 7:46 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Feb 25, 2022 at 03:43:31PM -0800, Hao Luo wrote:
> > This patch allows bpf_syscall prog to perform some basic filesystem
> > operations: create, remove directories and unlink files. Three bpf
> > helpers are added for this purpose. When combined with the following
> > patches that allow pinning and getting bpf objects from bpf prog,
> > this feature can be used to create directory hierarchy in bpffs that
> > help manage bpf objects purely using bpf progs.
> >
> > The added helpers subject to the same permission checks as their syscall
> > version. For example, one can not write to a read-only file system;
> > The identity of the current process is checked to see whether it has
> > sufficient permission to perform the operations.
> >
> > Only directories and files in bpffs can be created or removed by these
> > helpers. But it won't be too hard to allow these helpers to operate
> > on files in other filesystems, if we want.
>
> In which contexts can those be called?
>

In a sleepable context. The plan is to introduce a certain tracepoints
as sleepable, a program that attaches to sleepable tracepoints is
allowed to call these functions. In particular, the first sleepable
tracepoint introduced in this patchset is one at the end of
cgroup_mkdir(). Do you have any advices?

> > +BPF_CALL_2(bpf_rmdir, const char *, pathname, int, pathname_sz)
> > +{
> > +     struct user_namespace *mnt_userns;
> > +     struct path parent;
> > +     struct dentry *dentry;
> > +     int err;
> > +
> > +     if (pathname_sz <= 1 || pathname[pathname_sz - 1])
> > +             return -EINVAL;
> > +
> > +     err = kern_path(pathname, 0, &parent);
> > +     if (err)
> > +             return err;
> > +
> > +     if (!bpf_path_is_bpf_dir(&parent)) {
> > +             err = -EPERM;
> > +             goto exit1;
> > +     }
> > +
> > +     err = mnt_want_write(parent.mnt);
> > +     if (err)
> > +             goto exit1;
> > +
> > +     dentry = kern_path_locked(pathname, &parent);
>
> This can't be right.  Ever.  There is no promise whatsoever
> that these two lookups will resolve to the same place.
>
> > +BPF_CALL_2(bpf_unlink, const char *, pathname, int, pathname_sz)
> > +{
> > +     struct user_namespace *mnt_userns;
> > +     struct path parent;
> > +     struct dentry *dentry;
> > +     struct inode *inode = NULL;
> > +     int err;
> > +
> > +     if (pathname_sz <= 1 || pathname[pathname_sz - 1])
> > +             return -EINVAL;
> > +
> > +     err = kern_path(pathname, 0, &parent);
> > +     if (err)
> > +             return err;
> > +
> > +     err = mnt_want_write(parent.mnt);
> > +     if (err)
> > +             goto exit1;
> > +
> > +     dentry = kern_path_locked(pathname, &parent);
> > +     if (IS_ERR(dentry)) {
> > +             err = PTR_ERR(dentry);
> > +             goto exit2;
> > +     }
>
> Ditto.  NAK; if you want to poke into fs/namei.c guts, do it right.
> Or at least discuss that on fsdevel.  As it is, it's completely broken.
> It's racy *and* it blatantly leaks both vfsmount and dentry references.
>
> NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

Thanks Al for taking a look. Actually, there is a simpler approach:
can we export two functions in namei.c that wrap call to do_mkdirat
and do_unlinkat, but take a kernel string as pathname? Then these two
bpf helpers can use them, don't have to be this complicated. Does this
sound good to you?

Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-14 17:07     ` Hao Luo
@ 2022-03-14 23:10       ` Al Viro
  2022-03-15 17:27         ` Hao Luo
  0 siblings, 1 reply; 54+ messages in thread
From: Al Viro @ 2022-03-14 23:10 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Mon, Mar 14, 2022 at 10:07:31AM -0700, Hao Luo wrote:
> Hello Al,

> > In which contexts can those be called?
> >
> 
> In a sleepable context. The plan is to introduce a certain tracepoints
> as sleepable, a program that attaches to sleepable tracepoints is
> allowed to call these functions. In particular, the first sleepable
> tracepoint introduced in this patchset is one at the end of
> cgroup_mkdir(). Do you have any advices?

Yes - don't do it, unless you really want a lot of user-triggerable
deadlocks.

Pathname resolution is not locking-agnostic.  In particular, you can't
do it if you are under any ->i_rwsem, whether it's shared or exclusive.
That includes cgroup_mkdir() callchains.  And if the pathname passed
to these functions will have you walk through the parent directory,
you would get screwed (e.g. if the next component happens to be
inexistent, triggering a lookup, which takes ->i_rwsem shared).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-14 23:10       ` Al Viro
@ 2022-03-15 17:27         ` Hao Luo
  2022-03-15 18:59           ` Alexei Starovoitov
  2022-03-15 19:00           ` Al Viro
  0 siblings, 2 replies; 54+ messages in thread
From: Hao Luo @ 2022-03-15 17:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Mon, Mar 14, 2022 at 4:12 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Mon, Mar 14, 2022 at 10:07:31AM -0700, Hao Luo wrote:
> > Hello Al,
>
> > > In which contexts can those be called?
> > >
> >
> > In a sleepable context. The plan is to introduce a certain tracepoints
> > as sleepable, a program that attaches to sleepable tracepoints is
> > allowed to call these functions. In particular, the first sleepable
> > tracepoint introduced in this patchset is one at the end of
> > cgroup_mkdir(). Do you have any advices?
>
> Yes - don't do it, unless you really want a lot of user-triggerable
> deadlocks.
>
> Pathname resolution is not locking-agnostic.  In particular, you can't
> do it if you are under any ->i_rwsem, whether it's shared or exclusive.
> That includes cgroup_mkdir() callchains.  And if the pathname passed
> to these functions will have you walk through the parent directory,
> you would get screwed (e.g. if the next component happens to be
> inexistent, triggering a lookup, which takes ->i_rwsem shared).

I'm thinking of two options, let's see if either can work out:

Option 1: We can put restrictions on the pathname passed into this
helper. We can explicitly require the parameter dirfd to be in bpffs
(we can verify). In addition, we check pathname to be not containing
any dot or dotdot, so the resolved path will end up inside bpffs,
therefore won't take ->i_rwsem that is in the callchain of
cgroup_mkdir().

Option 2: We can avoid pathname resolution entirely. Like above, we
can adjust the semantics of this helper to be: making an immediate
directory under the dirfd passed in. In particular, like above, we can
enforce the dirfd to be in bpffs and pathname to consist of only
alphabet and numbers. With these restrictions, we call vfs_mkdir() to
create directories.

Being able to mkdir from bpf has useful use cases, let's try to make
it happen even with many limitations.

Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-15 17:27         ` Hao Luo
@ 2022-03-15 18:59           ` Alexei Starovoitov
  2022-03-15 19:03             ` Alexei Starovoitov
  2022-03-15 19:00           ` Al Viro
  1 sibling, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-15 18:59 UTC (permalink / raw)
  To: Hao Luo
  Cc: Al Viro, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Tue, Mar 15, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Mar 14, 2022 at 4:12 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Mon, Mar 14, 2022 at 10:07:31AM -0700, Hao Luo wrote:
> > > Hello Al,
> >
> > > > In which contexts can those be called?
> > > >
> > >
> > > In a sleepable context. The plan is to introduce a certain tracepoints
> > > as sleepable, a program that attaches to sleepable tracepoints is
> > > allowed to call these functions. In particular, the first sleepable
> > > tracepoint introduced in this patchset is one at the end of
> > > cgroup_mkdir(). Do you have any advices?
> >
> > Yes - don't do it, unless you really want a lot of user-triggerable
> > deadlocks.
> >
> > Pathname resolution is not locking-agnostic.  In particular, you can't
> > do it if you are under any ->i_rwsem, whether it's shared or exclusive.
> > That includes cgroup_mkdir() callchains.  And if the pathname passed
> > to these functions will have you walk through the parent directory,
> > you would get screwed (e.g. if the next component happens to be
> > inexistent, triggering a lookup, which takes ->i_rwsem shared).
>
> I'm thinking of two options, let's see if either can work out:
>
> Option 1: We can put restrictions on the pathname passed into this
> helper. We can explicitly require the parameter dirfd to be in bpffs
> (we can verify). In addition, we check pathname to be not containing
> any dot or dotdot, so the resolved path will end up inside bpffs,
> therefore won't take ->i_rwsem that is in the callchain of
> cgroup_mkdir().
>
> Option 2: We can avoid pathname resolution entirely. Like above, we
> can adjust the semantics of this helper to be: making an immediate
> directory under the dirfd passed in. In particular, like above, we can
> enforce the dirfd to be in bpffs and pathname to consist of only
> alphabet and numbers. With these restrictions, we call vfs_mkdir() to
> create directories.
>
> Being able to mkdir from bpf has useful use cases, let's try to make
> it happen even with many limitations.

Option 3. delegate vfs_mkdir to a worker and wait in the helper.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-15 17:27         ` Hao Luo
  2022-03-15 18:59           ` Alexei Starovoitov
@ 2022-03-15 19:00           ` Al Viro
  2022-03-15 19:47             ` Hao Luo
  1 sibling, 1 reply; 54+ messages in thread
From: Al Viro @ 2022-03-15 19:00 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Tue, Mar 15, 2022 at 10:27:39AM -0700, Hao Luo wrote:

> Option 1: We can put restrictions on the pathname passed into this
> helper. We can explicitly require the parameter dirfd to be in bpffs
> (we can verify). In addition, we check pathname to be not containing
> any dot or dotdot, so the resolved path will end up inside bpffs,
> therefore won't take ->i_rwsem that is in the callchain of
> cgroup_mkdir().

Won't be enough - mount --bind the parent under itself and there you go...
Sure, you could prohibit mountpoint crossing, etc., but at that point
I'd question the usefulness of pathname resolution in the first place.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-15 18:59           ` Alexei Starovoitov
@ 2022-03-15 19:03             ` Alexei Starovoitov
  0 siblings, 0 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2022-03-15 19:03 UTC (permalink / raw)
  To: Hao Luo
  Cc: Al Viro, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, Josh Don,
	Stanislav Fomichev, bpf, LKML

On Tue, Mar 15, 2022 at 11:59 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 15, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> >
> > On Mon, Mar 14, 2022 at 4:12 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Mon, Mar 14, 2022 at 10:07:31AM -0700, Hao Luo wrote:
> > > > Hello Al,
> > >
> > > > > In which contexts can those be called?
> > > > >
> > > >
> > > > In a sleepable context. The plan is to introduce a certain tracepoints
> > > > as sleepable, a program that attaches to sleepable tracepoints is
> > > > allowed to call these functions. In particular, the first sleepable
> > > > tracepoint introduced in this patchset is one at the end of
> > > > cgroup_mkdir(). Do you have any advices?
> > >
> > > Yes - don't do it, unless you really want a lot of user-triggerable
> > > deadlocks.
> > >
> > > Pathname resolution is not locking-agnostic.  In particular, you can't
> > > do it if you are under any ->i_rwsem, whether it's shared or exclusive.
> > > That includes cgroup_mkdir() callchains.  And if the pathname passed
> > > to these functions will have you walk through the parent directory,
> > > you would get screwed (e.g. if the next component happens to be
> > > inexistent, triggering a lookup, which takes ->i_rwsem shared).
> >
> > I'm thinking of two options, let's see if either can work out:
> >
> > Option 1: We can put restrictions on the pathname passed into this
> > helper. We can explicitly require the parameter dirfd to be in bpffs
> > (we can verify). In addition, we check pathname to be not containing
> > any dot or dotdot, so the resolved path will end up inside bpffs,
> > therefore won't take ->i_rwsem that is in the callchain of
> > cgroup_mkdir().
> >
> > Option 2: We can avoid pathname resolution entirely. Like above, we
> > can adjust the semantics of this helper to be: making an immediate
> > directory under the dirfd passed in. In particular, like above, we can
> > enforce the dirfd to be in bpffs and pathname to consist of only
> > alphabet and numbers. With these restrictions, we call vfs_mkdir() to
> > create directories.
> >
> > Being able to mkdir from bpf has useful use cases, let's try to make
> > it happen even with many limitations.
>
> Option 3. delegate vfs_mkdir to a worker and wait in the helper.

I meant _dont_ wait, of course.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
  2022-03-15 19:00           ` Al Viro
@ 2022-03-15 19:47             ` Hao Luo
  0 siblings, 0 replies; 54+ messages in thread
From: Hao Luo @ 2022-03-15 19:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Tejun Heo, joshdon, sdf, bpf,
	linux-kernel

On Tue, Mar 15, 2022 at 12:02 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Mar 15, 2022 at 10:27:39AM -0700, Hao Luo wrote:
>
> > Option 1: We can put restrictions on the pathname passed into this
> > helper. We can explicitly require the parameter dirfd to be in bpffs
> > (we can verify). In addition, we check pathname to be not containing
> > any dot or dotdot, so the resolved path will end up inside bpffs,
> > therefore won't take ->i_rwsem that is in the callchain of
> > cgroup_mkdir().
>
> Won't be enough - mount --bind the parent under itself and there you go...
> Sure, you could prohibit mountpoint crossing, etc., but at that point
> I'd question the usefulness of pathname resolution in the first place.

[Apologies for resend, my response did not get delivered to mail list]

I don't see a use case where we need to bind mount the directories in
bpffs, right now. So in option 1, we can also prohibit mountpoint
crossing.

Pathname resolution is still useful in this case. Imagine we want to
put all the created dirs under a base dir, we can open the base dir
and reuse its fd for multiple mkdirs, for example:

Userspace:
  fd = openat(..., "/sys/fs/bpf", ...);
  pass fd to the bpf prog

bpf prog:
  bpf_mkdirat(fd, "common1", ...);
  bpf_mkdirat(fd, "common1/a", ...);
  bpf_mkdirat(fd, "common1/b", ...);
  bpf_mkdirat(fd, "common2", ...);
  ...

It would be very inconvenient if we can't resolve multi-level paths.

As Alexei said, another option is to delegate syscall to a worker
thread. IMHO, we could do that in future if we find there is a need
for the full feature of pathname resolution.

Al, does that sound good?

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2022-03-15 19:48 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall Hao Luo
2022-02-27  5:18   ` Kumar Kartikeya Dwivedi
2022-02-28 22:10     ` Hao Luo
2022-03-02 19:34       ` Alexei Starovoitov
2022-03-03 18:50         ` Hao Luo
2022-03-04 18:37           ` Hao Luo
2022-03-05 23:47             ` Alexei Starovoitov
2022-03-08 21:08               ` Hao Luo
2022-03-02 20:55   ` Yonghong Song
2022-03-03 18:56     ` Hao Luo
2022-03-03 19:13       ` Yonghong Song
2022-03-03 19:15         ` Hao Luo
2022-03-12  3:46   ` Al Viro
2022-03-14 17:07     ` Hao Luo
2022-03-14 23:10       ` Al Viro
2022-03-15 17:27         ` Hao Luo
2022-03-15 18:59           ` Alexei Starovoitov
2022-03-15 19:03             ` Alexei Starovoitov
2022-03-15 19:00           ` Al Viro
2022-03-15 19:47             ` Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 2/9] bpf: Add BPF_OBJ_PIN and BPF_OBJ_GET in the bpf_sys_bpf helper Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 3/9] selftests/bpf: tests mkdir, rmdir, unlink and pin in syscall Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints Hao Luo
2022-03-02 19:41   ` Alexei Starovoitov
2022-03-03 19:37     ` Hao Luo
2022-03-03 19:59       ` Alexei Starovoitov
2022-03-02 21:23   ` Yonghong Song
2022-03-02 21:30     ` Alexei Starovoitov
2022-03-03  1:08       ` Yonghong Song
2022-03-03  2:29         ` Alexei Starovoitov
2022-03-03 19:43           ` Hao Luo
2022-03-03 20:02             ` Alexei Starovoitov
2022-03-03 20:04               ` Alexei Starovoitov
2022-03-03 22:06                 ` Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 5/9] cgroup: Sleepable cgroup tracepoints Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 6/9] libbpf: Add sleepable tp_btf Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 7/9] bpf: Lift permission check in __sys_bpf when called from kernel Hao Luo
2022-03-02 20:01   ` Alexei Starovoitov
2022-03-03 19:14     ` Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter Hao Luo
2022-02-26  2:32   ` kernel test robot
2022-02-26  2:32   ` kernel test robot
2022-02-26  2:53   ` kernel test robot
2022-03-02 21:59   ` Yonghong Song
2022-03-03 20:02     ` Hao Luo
2022-03-02 22:45   ` Kumar Kartikeya Dwivedi
2022-03-03  2:03     ` Yonghong Song
2022-03-03  3:03       ` Kumar Kartikeya Dwivedi
2022-03-03  4:00         ` Alexei Starovoitov
2022-03-03  7:33         ` Yonghong Song
2022-03-03  8:13           ` Kumar Kartikeya Dwivedi
2022-03-03 21:52           ` Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 9/9] selftests/bpf: Tests using sleepable tracepoints to monitor cgroup events Hao Luo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).