bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] Add O_PATH-based BPF_OBJ_PIN and BPF_OBJ_GET support
@ 2023-05-16  0:13 Andrii Nakryiko
  2023-05-16  0:13 ` [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands Andrii Nakryiko
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2023-05-16  0:13 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: cyphar, brauner, lennart, Andrii Nakryiko

Add ability to specify pinning location within BPF FS using O_PATH-based FDs,
similar to openat() family of APIs. Patch #1 adds necessary kernel-side
changes. Patch #2 exposes this through libbpf APIs. Patch #3 uses new mount
APIs (fsopen, fsconfig, fsmount) to demonstrated how now it's possible to work
with detach-mounted BPF FS using new BPF_OBJ_PIN and BPF_OBJ_GET
functionality.

This feature is inspired as a result of recent conversations during
LSF/MM/BPF 2023 conference about shortcomings of being able to perform BPF
objects pinning only using lookup-based paths.

Andrii Nakryiko (3):
  bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  libbpf: add opts-based bpf_obj_pin() API and add support for path_fd
  selftests/bpf: add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests

 include/linux/bpf.h                           |   4 +-
 include/uapi/linux/bpf.h                      |   5 +
 kernel/bpf/inode.c                            |  16 +--
 kernel/bpf/syscall.c                          |   8 +-
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/lib/bpf/bpf.c                           |  16 ++-
 tools/lib/bpf/bpf.h                           |  17 ++-
 tools/lib/bpf/libbpf.map                      |   1 +
 .../bpf/prog_tests/bpf_obj_pinning.c          | 110 ++++++++++++++++++
 9 files changed, 164 insertions(+), 18 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  2023-05-16  0:13 [PATCH bpf-next 0/3] Add O_PATH-based BPF_OBJ_PIN and BPF_OBJ_GET support Andrii Nakryiko
@ 2023-05-16  0:13 ` Andrii Nakryiko
  2023-05-16  8:52   ` Jiri Olsa
  2023-05-16  9:07   ` Christian Brauner
  2023-05-16  0:13 ` [PATCH bpf-next 2/3] libbpf: add opts-based bpf_obj_pin() API and add support for path_fd Andrii Nakryiko
  2023-05-16  0:13 ` [PATCH bpf-next 3/3] selftests/bpf: add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests Andrii Nakryiko
  2 siblings, 2 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2023-05-16  0:13 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: cyphar, brauner, lennart, Andrii Nakryiko

Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.

One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.

This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.

This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h            |  4 ++--
 include/uapi/linux/bpf.h       |  5 +++++
 kernel/bpf/inode.c             | 16 ++++++++--------
 kernel/bpf/syscall.c           |  8 +++++---
 tools/include/uapi/linux/bpf.h |  5 +++++
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 36e4b2d8cca2..f58895830ada 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
 
-int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
-int bpf_obj_get_user(const char __user *pathname, int flags);
+int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
+int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
 
 #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
 #define DEFINE_BPF_ITER_FUNC(target, args...)			\
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1bb11a6ee667..db2870a52ce0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1420,6 +1420,11 @@ union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 		__u32		file_flags;
+		/* same as dirfd in openat() syscall; see openat(2)
+		 * manpage for details of dirfd/path_fd and pathname semantics;
+		 * zero path_fd implies AT_FDCWD behavior
+		 */
+		__u32		path_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 9948b542a470..13bb54f6bd17 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -435,7 +435,7 @@ static int bpf_iter_link_pin_kernel(struct dentry *parent,
 	return ret;
 }
 
-static int bpf_obj_do_pin(const char __user *pathname, void *raw,
+static int bpf_obj_do_pin(int path_fd, const char __user *pathname, void *raw,
 			  enum bpf_type type)
 {
 	struct dentry *dentry;
@@ -444,7 +444,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	umode_t mode;
 	int ret;
 
-	dentry = user_path_create(AT_FDCWD, pathname, &path, 0);
+	dentry = user_path_create(path_fd, pathname, &path, 0);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
@@ -478,7 +478,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	return ret;
 }
 
-int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
+int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname)
 {
 	enum bpf_type type;
 	void *raw;
@@ -488,14 +488,14 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
 	if (IS_ERR(raw))
 		return PTR_ERR(raw);
 
-	ret = bpf_obj_do_pin(pathname, raw, type);
+	ret = bpf_obj_do_pin(path_fd, pathname, raw, type);
 	if (ret != 0)
 		bpf_any_put(raw, type);
 
 	return ret;
 }
 
-static void *bpf_obj_do_get(const char __user *pathname,
+static void *bpf_obj_do_get(int path_fd, const char __user *pathname,
 			    enum bpf_type *type, int flags)
 {
 	struct inode *inode;
@@ -503,7 +503,7 @@ static void *bpf_obj_do_get(const char __user *pathname,
 	void *raw;
 	int ret;
 
-	ret = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW, &path);
+	ret = user_path_at(path_fd, pathname, LOOKUP_FOLLOW, &path);
 	if (ret)
 		return ERR_PTR(ret);
 
@@ -527,7 +527,7 @@ static void *bpf_obj_do_get(const char __user *pathname,
 	return ERR_PTR(ret);
 }
 
-int bpf_obj_get_user(const char __user *pathname, int flags)
+int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags)
 {
 	enum bpf_type type = BPF_TYPE_UNSPEC;
 	int f_flags;
@@ -538,7 +538,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
 	if (f_flags < 0)
 		return f_flags;
 
-	raw = bpf_obj_do_get(pathname, &type, f_flags);
+	raw = bpf_obj_do_get(path_fd, pathname, &type, f_flags);
 	if (IS_ERR(raw))
 		return PTR_ERR(raw);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 909c112ef537..65a46f6d4be0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2697,14 +2697,15 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	return err;
 }
 
-#define BPF_OBJ_LAST_FIELD file_flags
+#define BPF_OBJ_LAST_FIELD path_fd
 
 static int bpf_obj_pin(const union bpf_attr *attr)
 {
 	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
 		return -EINVAL;
 
-	return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
+	return bpf_obj_pin_user(attr->bpf_fd, attr->path_fd ?: AT_FDCWD,
+				u64_to_user_ptr(attr->pathname));
 }
 
 static int bpf_obj_get(const union bpf_attr *attr)
@@ -2713,7 +2714,8 @@ static int bpf_obj_get(const union bpf_attr *attr)
 	    attr->file_flags & ~BPF_OBJ_FLAG_MASK)
 		return -EINVAL;
 
-	return bpf_obj_get_user(u64_to_user_ptr(attr->pathname),
+	return bpf_obj_get_user(attr->path_fd ?: AT_FDCWD,
+				u64_to_user_ptr(attr->pathname),
 				attr->file_flags);
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1bb11a6ee667..db2870a52ce0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1420,6 +1420,11 @@ union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 		__u32		file_flags;
+		/* same as dirfd in openat() syscall; see openat(2)
+		 * manpage for details of dirfd/path_fd and pathname semantics;
+		 * zero path_fd implies AT_FDCWD behavior
+		 */
+		__u32		path_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 2/3] libbpf: add opts-based bpf_obj_pin() API and add support for path_fd
  2023-05-16  0:13 [PATCH bpf-next 0/3] Add O_PATH-based BPF_OBJ_PIN and BPF_OBJ_GET support Andrii Nakryiko
  2023-05-16  0:13 ` [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands Andrii Nakryiko
@ 2023-05-16  0:13 ` Andrii Nakryiko
  2023-05-16  0:13 ` [PATCH bpf-next 3/3] selftests/bpf: add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests Andrii Nakryiko
  2 siblings, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2023-05-16  0:13 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: cyphar, brauner, lennart, Andrii Nakryiko

Add path_fd support for bpf_obj_pin() and bpf_obj_get() operations
(through their opts-based variants). This allows to take advantage of
new kernel-side support for O_PATH-based pin/get location specification.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c      | 16 +++++++++++++---
 tools/lib/bpf/bpf.h      | 17 +++++++++++++++--
 tools/lib/bpf/libbpf.map |  1 +
 3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 128ac723c4ea..35cdc0777d79 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -572,13 +572,17 @@ int bpf_map_update_batch(int fd, const void *keys, const void *values, __u32 *co
 				    (void *)keys, (void *)values, count, opts);
 }
 
-int bpf_obj_pin(int fd, const char *pathname)
+int bpf_obj_pin_opts(int fd, const char *pathname, const struct bpf_obj_pin_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, file_flags);
+	const size_t attr_sz = offsetofend(union bpf_attr, path_fd);
 	union bpf_attr attr;
 	int ret;
 
+	if (!OPTS_VALID(opts, bpf_obj_pin_opts))
+		return libbpf_err(-EINVAL);
+
 	memset(&attr, 0, attr_sz);
+	attr.path_fd = OPTS_GET(opts, path_fd, 0);
 	attr.pathname = ptr_to_u64((void *)pathname);
 	attr.bpf_fd = fd;
 
@@ -586,6 +590,11 @@ int bpf_obj_pin(int fd, const char *pathname)
 	return libbpf_err_errno(ret);
 }
 
+int bpf_obj_pin(int fd, const char *pathname)
+{
+	return bpf_obj_pin_opts(fd, pathname, NULL);
+}
+
 int bpf_obj_get(const char *pathname)
 {
 	return bpf_obj_get_opts(pathname, NULL);
@@ -593,7 +602,7 @@ int bpf_obj_get(const char *pathname)
 
 int bpf_obj_get_opts(const char *pathname, const struct bpf_obj_get_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, file_flags);
+	const size_t attr_sz = offsetofend(union bpf_attr, path_fd);
 	union bpf_attr attr;
 	int fd;
 
@@ -601,6 +610,7 @@ int bpf_obj_get_opts(const char *pathname, const struct bpf_obj_get_opts *opts)
 		return libbpf_err(-EINVAL);
 
 	memset(&attr, 0, attr_sz);
+	attr.path_fd = OPTS_GET(opts, path_fd, 0);
 	attr.pathname = ptr_to_u64((void *)pathname);
 	attr.file_flags = OPTS_GET(opts, file_flags, 0);
 
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index a2c091389b18..22500a526520 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -284,16 +284,29 @@ LIBBPF_API int bpf_map_update_batch(int fd, const void *keys, const void *values
 				    __u32 *count,
 				    const struct bpf_map_batch_opts *opts);
 
+struct bpf_obj_pin_opts {
+	size_t sz; /* size of this struct for forward/backward compatibility */
+
+	int path_fd;
+
+	size_t :0;
+};
+#define bpf_obj_pin_opts__last_field path_fd
+
+LIBBPF_API int bpf_obj_pin(int fd, const char *pathname);
+LIBBPF_API int bpf_obj_pin_opts(int fd, const char *pathname,
+				const struct bpf_obj_pin_opts *opts);
+
 struct bpf_obj_get_opts {
 	size_t sz; /* size of this struct for forward/backward compatibility */
 
 	__u32 file_flags;
+	int path_fd;
 
 	size_t :0;
 };
-#define bpf_obj_get_opts__last_field file_flags
+#define bpf_obj_get_opts__last_field path_fd
 
-LIBBPF_API int bpf_obj_pin(int fd, const char *pathname);
 LIBBPF_API int bpf_obj_get(const char *pathname);
 LIBBPF_API int bpf_obj_get_opts(const char *pathname,
 				const struct bpf_obj_get_opts *opts);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index a5aa3a383d69..7a4fe80da360 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -389,5 +389,6 @@ LIBBPF_1.2.0 {
 		bpf_link__update_map;
 		bpf_link_get_info_by_fd;
 		bpf_map_get_info_by_fd;
+		bpf_obj_pin_opts;
 		bpf_prog_get_info_by_fd;
 } LIBBPF_1.1.0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 3/3] selftests/bpf: add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests
  2023-05-16  0:13 [PATCH bpf-next 0/3] Add O_PATH-based BPF_OBJ_PIN and BPF_OBJ_GET support Andrii Nakryiko
  2023-05-16  0:13 ` [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands Andrii Nakryiko
  2023-05-16  0:13 ` [PATCH bpf-next 2/3] libbpf: add opts-based bpf_obj_pin() API and add support for path_fd Andrii Nakryiko
@ 2023-05-16  0:13 ` Andrii Nakryiko
  2 siblings, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2023-05-16  0:13 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: cyphar, brauner, lennart, Andrii Nakryiko

Add a selftest demonstrating using detach-mounted BPF FS using new mount
APIs, and pinning and getting BPF map using such mount. This
demonstrates how something like container manager could setup BPF FS,
pin and adjust all the necessary objects in it, all before exposing BPF
FS to a particular mount namespace.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../bpf/prog_tests/bpf_obj_pinning.c          | 110 ++++++++++++++++++
 1 file changed, 110 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c b/tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c
new file mode 100644
index 000000000000..4624f58e7b8a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <linux/unistd.h>
+#include <linux/mount.h>
+#include <sys/syscall.h>
+
+static inline int sys_fsopen(const char *fsname, unsigned flags)
+{
+	return syscall(__NR_fsopen, fsname, flags);
+}
+
+static inline int sys_fsconfig(int fs_fd, unsigned cmd, const char *key, const void *val, int aux)
+{
+	return syscall(__NR_fsconfig, fs_fd, cmd, key, val, aux);
+}
+
+static inline int sys_fsmount(int fs_fd, unsigned flags, unsigned ms_flags)
+{
+	return syscall(__NR_fsmount, fs_fd, flags, ms_flags);
+}
+
+static inline int sys_move_mount(int from_dfd, const char *from_path,
+			         int to_dfd, const char *to_path,
+			         unsigned int ms_flags)
+{
+	return syscall(__NR_move_mount, from_dfd, from_path, to_dfd, to_path, ms_flags);
+}
+
+void test_bpf_obj_pinning(void)
+{
+	LIBBPF_OPTS(bpf_obj_pin_opts, pin_opts);
+	LIBBPF_OPTS(bpf_obj_get_opts, get_opts);
+	int fs_fd = -1, mnt_fd = -1;
+	int map_fd = -1, map_fd2 = -1;
+	int zero = 0, src_value, dst_value, err;
+	const char *map_name = "fsmount_map";
+
+	/* A bunch of below UAPI calls are constructed based on reading:
+	 * https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
+	 */
+
+	/* create VFS context */
+	fs_fd = sys_fsopen("bpf", 0);
+	if (!ASSERT_GE(fs_fd, 0, "fs_fd"))
+		goto cleanup;
+
+	/* instantiate FS object */
+	err = sys_fsconfig(fs_fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+	if (!ASSERT_OK(err, "fs_create"))
+		goto cleanup;
+
+	/* create O_PATH fd for detached mount */
+	mnt_fd = sys_fsmount(fs_fd, 0, 0);
+	if (!ASSERT_GE(mnt_fd, 0, "mnt_fd"))
+		goto cleanup;
+
+	/* If we wanted to expose detached mount in the file system, we'd do
+	 * something like below. But the whole point is that we actually don't
+	 * even have to expose BPF FS in the file system to be able to work
+	 * (pin/get objects) with it.
+	 *
+	 * err = sys_move_mount(mnt_fd, "", -EBADF, mnt_path, MOVE_MOUNT_F_EMPTY_PATH);
+	 * if (!ASSERT_OK(err, "move_mount"))
+	 *	goto cleanup;
+	 */
+
+	/* create BPF map to pin */
+	map_fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, map_name, 4, 4, 1, NULL);
+	if (!ASSERT_GE(map_fd, 0, "map_fd"))
+		goto cleanup;
+
+	/* pin BPF map into detached BPF FS through mnt_fd */
+	pin_opts.path_fd = mnt_fd;
+	err = bpf_obj_pin_opts(map_fd, map_name, &pin_opts);
+	if (!ASSERT_OK(err, "map_pin"))
+		goto cleanup;
+
+	/* get BPF map from detached BPF FS through mnt_fd */
+	get_opts.path_fd = mnt_fd;
+	map_fd2 = bpf_obj_get_opts(map_name, &get_opts);
+	if (!ASSERT_GE(map_fd2, 0, "map_get"))
+		goto cleanup;
+
+	/* update map through one FD */
+	src_value = 0xcafebeef;
+	err = bpf_map_update_elem(map_fd, &zero, &src_value, 0);
+	ASSERT_OK(err, "map_update");
+
+	/* check values written/read through different FDs do match */
+	dst_value = 0;
+	err = bpf_map_lookup_elem(map_fd2, &zero, &dst_value);
+	ASSERT_OK(err, "map_lookup");
+	ASSERT_EQ(dst_value, src_value, "map_value_eq1");
+	ASSERT_EQ(dst_value, 0xcafebeef, "map_value_eq2");
+
+cleanup:
+	if (map_fd >= 0)
+		ASSERT_OK(close(map_fd), "close_map_fd");
+	if (map_fd2 >= 0)
+		ASSERT_OK(close(map_fd2), "close_map_fd2");
+	if (fs_fd >= 0)
+		ASSERT_OK(close(fs_fd), "close_fs_fd");
+	if (mnt_fd >= 0)
+		ASSERT_OK(close(mnt_fd), "close_mnt_fd");
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  2023-05-16  0:13 ` [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands Andrii Nakryiko
@ 2023-05-16  8:52   ` Jiri Olsa
  2023-05-16 18:02     ` Andrii Nakryiko
  2023-05-16  9:07   ` Christian Brauner
  1 sibling, 1 reply; 30+ messages in thread
From: Jiri Olsa @ 2023-05-16  8:52 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, cyphar, brauner, lennart

On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> forces users to specify pinning location as a string-based absolute or
> relative (to current working directory) path. This has various
> implications related to security (e.g., symlink-based attacks), forces
> BPF FS to be exposed in the file system, which can cause races with
> other applications.
> 
> One of the feedbacks we got from folks working with containers heavily
> was that inability to use purely FD-based location specification was an
> unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> commands. This patch closes this oversight, adding path_fd field to
> BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> *at() syscalls for dirfd + pathname combinations.
> 
> This now allows interesting possibilities like working with detached BPF
> FS mount (e.g., to perform multiple pinnings without running a risk of
> someone interfering with them), and generally making pinning/getting
> more secure and not prone to any races and/or security attacks.
> 
> This is demonstrated by a selftest added in subsequent patch that takes
> advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> creating detached BPF FS mount, pinning, and then getting BPF map out of
> it, all while never exposing this private instance of BPF FS to outside
> worlds.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/bpf.h            |  4 ++--
>  include/uapi/linux/bpf.h       |  5 +++++
>  kernel/bpf/inode.c             | 16 ++++++++--------
>  kernel/bpf/syscall.c           |  8 +++++---
>  tools/include/uapi/linux/bpf.h |  5 +++++
>  5 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 36e4b2d8cca2..f58895830ada 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
>  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
>  
> -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> -int bpf_obj_get_user(const char __user *pathname, int flags);
> +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
>  
>  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
>  #define DEFINE_BPF_ITER_FUNC(target, args...)			\
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1bb11a6ee667..db2870a52ce0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1420,6 +1420,11 @@ union bpf_attr {
>  		__aligned_u64	pathname;
>  		__u32		bpf_fd;
>  		__u32		file_flags;
> +		/* same as dirfd in openat() syscall; see openat(2)
> +		 * manpage for details of dirfd/path_fd and pathname semantics;
> +		 * zero path_fd implies AT_FDCWD behavior
> +		 */
> +		__u32		path_fd;

I'd probably call it dir_fd to emphasize the similarity,
but I don't mind path_fd as well

I have a note that you suggested to introduce this for uprobe
multi link as well, so I'll do something similar

lgtm

Acked-by: Jiri Olsa <jolsa@kernel.org>

jirka

>  	};
>  
>  	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 9948b542a470..13bb54f6bd17 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -435,7 +435,7 @@ static int bpf_iter_link_pin_kernel(struct dentry *parent,
>  	return ret;
>  }
>  
> -static int bpf_obj_do_pin(const char __user *pathname, void *raw,
> +static int bpf_obj_do_pin(int path_fd, const char __user *pathname, void *raw,
>  			  enum bpf_type type)
>  {
>  	struct dentry *dentry;
> @@ -444,7 +444,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
>  	umode_t mode;
>  	int ret;
>  
> -	dentry = user_path_create(AT_FDCWD, pathname, &path, 0);
> +	dentry = user_path_create(path_fd, pathname, &path, 0);
>  	if (IS_ERR(dentry))
>  		return PTR_ERR(dentry);
>  
> @@ -478,7 +478,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
>  	return ret;
>  }
>  
> -int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
> +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname)
>  {
>  	enum bpf_type type;
>  	void *raw;
> @@ -488,14 +488,14 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
>  	if (IS_ERR(raw))
>  		return PTR_ERR(raw);
>  
> -	ret = bpf_obj_do_pin(pathname, raw, type);
> +	ret = bpf_obj_do_pin(path_fd, pathname, raw, type);
>  	if (ret != 0)
>  		bpf_any_put(raw, type);
>  
>  	return ret;
>  }
>  
> -static void *bpf_obj_do_get(const char __user *pathname,
> +static void *bpf_obj_do_get(int path_fd, const char __user *pathname,
>  			    enum bpf_type *type, int flags)
>  {
>  	struct inode *inode;
> @@ -503,7 +503,7 @@ static void *bpf_obj_do_get(const char __user *pathname,
>  	void *raw;
>  	int ret;
>  
> -	ret = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW, &path);
> +	ret = user_path_at(path_fd, pathname, LOOKUP_FOLLOW, &path);
>  	if (ret)
>  		return ERR_PTR(ret);
>  
> @@ -527,7 +527,7 @@ static void *bpf_obj_do_get(const char __user *pathname,
>  	return ERR_PTR(ret);
>  }
>  
> -int bpf_obj_get_user(const char __user *pathname, int flags)
> +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags)
>  {
>  	enum bpf_type type = BPF_TYPE_UNSPEC;
>  	int f_flags;
> @@ -538,7 +538,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
>  	if (f_flags < 0)
>  		return f_flags;
>  
> -	raw = bpf_obj_do_get(pathname, &type, f_flags);
> +	raw = bpf_obj_do_get(path_fd, pathname, &type, f_flags);
>  	if (IS_ERR(raw))
>  		return PTR_ERR(raw);
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 909c112ef537..65a46f6d4be0 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2697,14 +2697,15 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
>  	return err;
>  }
>  
> -#define BPF_OBJ_LAST_FIELD file_flags
> +#define BPF_OBJ_LAST_FIELD path_fd
>  
>  static int bpf_obj_pin(const union bpf_attr *attr)
>  {
>  	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
>  		return -EINVAL;
>  
> -	return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
> +	return bpf_obj_pin_user(attr->bpf_fd, attr->path_fd ?: AT_FDCWD,
> +				u64_to_user_ptr(attr->pathname));
>  }
>  
>  static int bpf_obj_get(const union bpf_attr *attr)
> @@ -2713,7 +2714,8 @@ static int bpf_obj_get(const union bpf_attr *attr)
>  	    attr->file_flags & ~BPF_OBJ_FLAG_MASK)
>  		return -EINVAL;
>  
> -	return bpf_obj_get_user(u64_to_user_ptr(attr->pathname),
> +	return bpf_obj_get_user(attr->path_fd ?: AT_FDCWD,
> +				u64_to_user_ptr(attr->pathname),
>  				attr->file_flags);
>  }
>  
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 1bb11a6ee667..db2870a52ce0 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1420,6 +1420,11 @@ union bpf_attr {
>  		__aligned_u64	pathname;
>  		__u32		bpf_fd;
>  		__u32		file_flags;
> +		/* same as dirfd in openat() syscall; see openat(2)
> +		 * manpage for details of dirfd/path_fd and pathname semantics;
> +		 * zero path_fd implies AT_FDCWD behavior
> +		 */
> +		__u32		path_fd;
>  	};
>  
>  	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  2023-05-16  0:13 ` [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands Andrii Nakryiko
  2023-05-16  8:52   ` Jiri Olsa
@ 2023-05-16  9:07   ` Christian Brauner
  2023-05-16 18:02     ` Andrii Nakryiko
  1 sibling, 1 reply; 30+ messages in thread
From: Christian Brauner @ 2023-05-16  9:07 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, cyphar, lennart

On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> forces users to specify pinning location as a string-based absolute or
> relative (to current working directory) path. This has various
> implications related to security (e.g., symlink-based attacks), forces
> BPF FS to be exposed in the file system, which can cause races with
> other applications.
> 
> One of the feedbacks we got from folks working with containers heavily
> was that inability to use purely FD-based location specification was an
> unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> commands. This patch closes this oversight, adding path_fd field to

Cool!

> BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> *at() syscalls for dirfd + pathname combinations.
> 
> This now allows interesting possibilities like working with detached BPF
> FS mount (e.g., to perform multiple pinnings without running a risk of
> someone interfering with them), and generally making pinning/getting
> more secure and not prone to any races and/or security attacks.
> 
> This is demonstrated by a selftest added in subsequent patch that takes
> advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> creating detached BPF FS mount, pinning, and then getting BPF map out of
> it, all while never exposing this private instance of BPF FS to outside
> worlds.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/bpf.h            |  4 ++--
>  include/uapi/linux/bpf.h       |  5 +++++
>  kernel/bpf/inode.c             | 16 ++++++++--------
>  kernel/bpf/syscall.c           |  8 +++++---
>  tools/include/uapi/linux/bpf.h |  5 +++++
>  5 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 36e4b2d8cca2..f58895830ada 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
>  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
>  
> -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> -int bpf_obj_get_user(const char __user *pathname, int flags);
> +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
>  
>  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
>  #define DEFINE_BPF_ITER_FUNC(target, args...)			\
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1bb11a6ee667..db2870a52ce0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1420,6 +1420,11 @@ union bpf_attr {
>  		__aligned_u64	pathname;
>  		__u32		bpf_fd;
>  		__u32		file_flags;
> +		/* same as dirfd in openat() syscall; see openat(2)
> +		 * manpage for details of dirfd/path_fd and pathname semantics;
> +		 * zero path_fd implies AT_FDCWD behavior
> +		 */
> +		__u32		path_fd;
>  	};

So 0 is a valid file descriptor and can trivially be created and made to
refer to any file. Is this a conscious decision to have a zero value
imply AT_FDCWD and have you done this somewhere else in bpf already?
Because that's contrary to how any file descriptor based apis work.

How this is usually solved for extensible structs is to have a flag
field that raises a flag to indicate that the fd fiel is set and thus 0
can be used as a valid value.

The way you're doing it right now is very counterintuitive to userspace
and pretty much guaranteed to cause subtle bugs.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  2023-05-16  9:07   ` Christian Brauner
@ 2023-05-16 18:02     ` Andrii Nakryiko
  2023-05-17  9:11       ` fd == 0 means AT_FDCWD " Christian Brauner
  0 siblings, 1 reply; 30+ messages in thread
From: Andrii Nakryiko @ 2023-05-16 18:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, cyphar, lennart

On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> > forces users to specify pinning location as a string-based absolute or
> > relative (to current working directory) path. This has various
> > implications related to security (e.g., symlink-based attacks), forces
> > BPF FS to be exposed in the file system, which can cause races with
> > other applications.
> >
> > One of the feedbacks we got from folks working with containers heavily
> > was that inability to use purely FD-based location specification was an
> > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> > commands. This patch closes this oversight, adding path_fd field to
>
> Cool!
>
> > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> > *at() syscalls for dirfd + pathname combinations.
> >
> > This now allows interesting possibilities like working with detached BPF
> > FS mount (e.g., to perform multiple pinnings without running a risk of
> > someone interfering with them), and generally making pinning/getting
> > more secure and not prone to any races and/or security attacks.
> >
> > This is demonstrated by a selftest added in subsequent patch that takes
> > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> > creating detached BPF FS mount, pinning, and then getting BPF map out of
> > it, all while never exposing this private instance of BPF FS to outside
> > worlds.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/bpf.h            |  4 ++--
> >  include/uapi/linux/bpf.h       |  5 +++++
> >  kernel/bpf/inode.c             | 16 ++++++++--------
> >  kernel/bpf/syscall.c           |  8 +++++---
> >  tools/include/uapi/linux/bpf.h |  5 +++++
> >  5 files changed, 25 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 36e4b2d8cca2..f58895830ada 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
> >  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> >  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
> >
> > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > -int bpf_obj_get_user(const char __user *pathname, int flags);
> > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
> >
> >  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
> >  #define DEFINE_BPF_ITER_FUNC(target, args...)                        \
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 1bb11a6ee667..db2870a52ce0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -1420,6 +1420,11 @@ union bpf_attr {
> >               __aligned_u64   pathname;
> >               __u32           bpf_fd;
> >               __u32           file_flags;
> > +             /* same as dirfd in openat() syscall; see openat(2)
> > +              * manpage for details of dirfd/path_fd and pathname semantics;
> > +              * zero path_fd implies AT_FDCWD behavior
> > +              */
> > +             __u32           path_fd;
> >       };
>
> So 0 is a valid file descriptor and can trivially be created and made to
> refer to any file. Is this a conscious decision to have a zero value
> imply AT_FDCWD and have you done this somewhere else in bpf already?
> Because that's contrary to how any file descriptor based apis work.
>
> How this is usually solved for extensible structs is to have a flag
> field that raises a flag to indicate that the fd fiel is set and thus 0
> can be used as a valid value.
>
> The way you're doing it right now is very counterintuitive to userspace
> and pretty much guaranteed to cause subtle bugs.

Yes, it's a very bpf()-specific convention we've settled on a while
ago. It allows a cleaner and simpler backwards compatibility story
without having to introduce new flags every single time. Most of BPF
UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass
it to bpf() syscall. Most of the time users will be blissfully unaware
because libbpf and other BPF libraries are checking for fd == 0 and
dup()'ing them to avoid ever returning FD 0 to the user.

tl;dr, a conscious decision consistent with the rest of BPF UAPI. It
is a bpf() peculiarity, yes.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  2023-05-16  8:52   ` Jiri Olsa
@ 2023-05-16 18:02     ` Andrii Nakryiko
  0 siblings, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2023-05-16 18:02 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, cyphar, brauner, lennart

On Tue, May 16, 2023 at 1:52 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> > forces users to specify pinning location as a string-based absolute or
> > relative (to current working directory) path. This has various
> > implications related to security (e.g., symlink-based attacks), forces
> > BPF FS to be exposed in the file system, which can cause races with
> > other applications.
> >
> > One of the feedbacks we got from folks working with containers heavily
> > was that inability to use purely FD-based location specification was an
> > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> > commands. This patch closes this oversight, adding path_fd field to
> > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> > *at() syscalls for dirfd + pathname combinations.
> >
> > This now allows interesting possibilities like working with detached BPF
> > FS mount (e.g., to perform multiple pinnings without running a risk of
> > someone interfering with them), and generally making pinning/getting
> > more secure and not prone to any races and/or security attacks.
> >
> > This is demonstrated by a selftest added in subsequent patch that takes
> > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> > creating detached BPF FS mount, pinning, and then getting BPF map out of
> > it, all while never exposing this private instance of BPF FS to outside
> > worlds.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/bpf.h            |  4 ++--
> >  include/uapi/linux/bpf.h       |  5 +++++
> >  kernel/bpf/inode.c             | 16 ++++++++--------
> >  kernel/bpf/syscall.c           |  8 +++++---
> >  tools/include/uapi/linux/bpf.h |  5 +++++
> >  5 files changed, 25 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 36e4b2d8cca2..f58895830ada 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
> >  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> >  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
> >
> > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > -int bpf_obj_get_user(const char __user *pathname, int flags);
> > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
> >
> >  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
> >  #define DEFINE_BPF_ITER_FUNC(target, args...)                        \
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 1bb11a6ee667..db2870a52ce0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -1420,6 +1420,11 @@ union bpf_attr {
> >               __aligned_u64   pathname;
> >               __u32           bpf_fd;
> >               __u32           file_flags;
> > +             /* same as dirfd in openat() syscall; see openat(2)
> > +              * manpage for details of dirfd/path_fd and pathname semantics;
> > +              * zero path_fd implies AT_FDCWD behavior
> > +              */
> > +             __u32           path_fd;
>
> I'd probably call it dir_fd to emphasize the similarity,
> but I don't mind path_fd as well

I considered that, but it's really not necessarily a directory, it
could be a specific file location (with O_PATH), so I felt like a more
generic "path_fd" would be better (plus we have *path*name to combine
with). It's minor, I can be convinced if others feel strongly about
this.


>
> I have a note that you suggested to introduce this for uprobe
> multi link as well, so I'll do something similar
>
> lgtm
>
> Acked-by: Jiri Olsa <jolsa@kernel.org>
>
> jirka
>
> >       };
> >
> >       struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > index 9948b542a470..13bb54f6bd17 100644
> > --- a/kernel/bpf/inode.c
> > +++ b/kernel/bpf/inode.c

[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-16 18:02     ` Andrii Nakryiko
@ 2023-05-17  9:11       ` Christian Brauner
  2023-05-17 12:05         ` Christoph Hellwig
  2023-05-18 21:56         ` Andrii Nakryiko
  0 siblings, 2 replies; 30+ messages in thread
From: Christian Brauner @ 2023-05-17  9:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, cyphar, lennart,
	linux-fsdevel, Christoph Hellwig, Al Viro

On Tue, May 16, 2023 at 11:02:42AM -0700, Andrii Nakryiko wrote:
> On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> > > forces users to specify pinning location as a string-based absolute or
> > > relative (to current working directory) path. This has various
> > > implications related to security (e.g., symlink-based attacks), forces
> > > BPF FS to be exposed in the file system, which can cause races with
> > > other applications.
> > >
> > > One of the feedbacks we got from folks working with containers heavily
> > > was that inability to use purely FD-based location specification was an
> > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> > > commands. This patch closes this oversight, adding path_fd field to
> >
> > Cool!
> >
> > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> > > *at() syscalls for dirfd + pathname combinations.
> > >
> > > This now allows interesting possibilities like working with detached BPF
> > > FS mount (e.g., to perform multiple pinnings without running a risk of
> > > someone interfering with them), and generally making pinning/getting
> > > more secure and not prone to any races and/or security attacks.
> > >
> > > This is demonstrated by a selftest added in subsequent patch that takes
> > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> > > creating detached BPF FS mount, pinning, and then getting BPF map out of
> > > it, all while never exposing this private instance of BPF FS to outside
> > > worlds.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/bpf.h            |  4 ++--
> > >  include/uapi/linux/bpf.h       |  5 +++++
> > >  kernel/bpf/inode.c             | 16 ++++++++--------
> > >  kernel/bpf/syscall.c           |  8 +++++---
> > >  tools/include/uapi/linux/bpf.h |  5 +++++
> > >  5 files changed, 25 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 36e4b2d8cca2..f58895830ada 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
> > >  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> > >  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
> > >
> > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > > -int bpf_obj_get_user(const char __user *pathname, int flags);
> > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
> > >
> > >  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
> > >  #define DEFINE_BPF_ITER_FUNC(target, args...)                        \
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 1bb11a6ee667..db2870a52ce0 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -1420,6 +1420,11 @@ union bpf_attr {
> > >               __aligned_u64   pathname;
> > >               __u32           bpf_fd;
> > >               __u32           file_flags;
> > > +             /* same as dirfd in openat() syscall; see openat(2)
> > > +              * manpage for details of dirfd/path_fd and pathname semantics;
> > > +              * zero path_fd implies AT_FDCWD behavior
> > > +              */
> > > +             __u32           path_fd;
> > >       };
> >
> > So 0 is a valid file descriptor and can trivially be created and made to
> > refer to any file. Is this a conscious decision to have a zero value
> > imply AT_FDCWD and have you done this somewhere else in bpf already?
> > Because that's contrary to how any file descriptor based apis work.
> >
> > How this is usually solved for extensible structs is to have a flag
> > field that raises a flag to indicate that the fd fiel is set and thus 0
> > can be used as a valid value.
> >
> > The way you're doing it right now is very counterintuitive to userspace
> > and pretty much guaranteed to cause subtle bugs.
> 
> Yes, it's a very bpf()-specific convention we've settled on a while
> ago. It allows a cleaner and simpler backwards compatibility story
> without having to introduce new flags every single time. Most of BPF
> UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass
> it to bpf() syscall. Most of the time users will be blissfully unaware
> because libbpf and other BPF libraries are checking for fd == 0 and
> dup()'ing them to avoid ever returning FD 0 to the user.
> 
> tl;dr, a conscious decision consistent with the rest of BPF UAPI. It
> is a bpf() peculiarity, yes.

Adding fsdevel so we're aware of this quirk.

So I'm not sure whether this was ever discussed on fsdevel when you took
the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
invalid value.

If it was discussed then great but if not then I would like to make it
very clear that if in the future you decide to introduce custom
semantics for vfs provided infrastructure - especially when exposed to
userspace - that you please Cc us.

You often make it very clear on the list that you don't like it when
anything that touches bpf code doesn't end up getting sent to the bpf
mailing list. It is exactly the same for us.

This is not a rant I'm really just trying to make sure that we agree on
common ground when it comes to touching each others code or semantic
assumptions.

I personally find this extremely weird to treat fd 0 as anything other
than a random fd number as it goes against any userspace assumptions and
drastically deviates from basically every file descriptor interface we
have. I mean, you're not just saying fd 0 is invalid you're even saying
it means AT_FDCWD.

For every other interface, including those that pass fds in structs
whose extensibility is premised on unknown fields being set to zero,
have ways to make fd 0 work just fine. You could've done that to without
inventing custom fd semantics.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17  9:11       ` fd == 0 means AT_FDCWD " Christian Brauner
@ 2023-05-17 12:05         ` Christoph Hellwig
  2023-05-17 16:17           ` Alexei Starovoitov
  2023-05-18 21:56         ` Andrii Nakryiko
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2023-05-17 12:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, ast, daniel, martin.lau,
	cyphar, lennart, linux-fsdevel, Christoph Hellwig, Al Viro

On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> Adding fsdevel so we're aware of this quirk.
> 
> So I'm not sure whether this was ever discussed on fsdevel when you took
> the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> invalid value.

I've never heard of this before, and I think it is compltely
unacceptable. 0 ist just a normal FD, although one that happens to
have specific meaning in userspace as stdin.

> 
> If it was discussed then great but if not then I would like to make it
> very clear that if in the future you decide to introduce custom
> semantics for vfs provided infrastructure - especially when exposed to
> userspace - that you please Cc us.

I don't think it's just the future.  We really need to undo this ASAP.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17 12:05         ` Christoph Hellwig
@ 2023-05-17 16:17           ` Alexei Starovoitov
  2023-05-17 21:48             ` Alexei Starovoitov
  2023-05-18  8:38             ` Christian Brauner
  0 siblings, 2 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2023-05-17 16:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro

On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > Adding fsdevel so we're aware of this quirk.
> >
> > So I'm not sure whether this was ever discussed on fsdevel when you took
> > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > invalid value.
>
> I've never heard of this before, and I think it is compltely
> unacceptable. 0 ist just a normal FD, although one that happens to
> have specific meaning in userspace as stdin.
>
> >
> > If it was discussed then great but if not then I would like to make it
> > very clear that if in the future you decide to introduce custom
> > semantics for vfs provided infrastructure - especially when exposed to
> > userspace - that you please Cc us.
>
> I don't think it's just the future.  We really need to undo this ASAP.

Christian is not correct in stating that treatment of fd==0 as invalid
bpf object applies to vfs fd-s.
The path_fd addition in this patch is really the very first one of this kind.
At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
are invalid and this is not going to change. It's been uapi for a long time.

More so fd-s 0,1,2 are not "normal FDs".
Unix has made two mistakes:
1. fd==0 being valid fd
2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.

The first mistake makes it hard to pass FD without an extra flag.
The 2nd mistake is just awful.
We've seen plenty of severe datacenter wide issues because some
library or piece of software assumes stdin/out/err.
Various services have been hurt badly by this "convention".
In libbpf we added ensure_good_fd() to make sure none of bpf objects
(progs, maps, etc) are ever seen with fd=0,1,2.
Other pieces of datacenter software enforce the same.

In other words fds=0,1,2 are taken. They must not be anything but
stdin/out/err or gutted to /dev/null.
Otherwise expect horrible bugs and multi day debugging.

Because of that, several years ago, we've decided to fix unix mistake #1
when it comes to bpf objects and started reserving fd=0 as invalid.
This patch is proposing to do the same for path_fd (normal vfs fd) when
it is passed to bpf syscall. I think it's a good trade-off and fits
the rest of bpf uapi.

Everyone who's hiding behind statements: but POSIX is a standard..
or this is how we've been doing things... are ignoring the practical
situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17 16:17           ` Alexei Starovoitov
@ 2023-05-17 21:48             ` Alexei Starovoitov
  2023-05-18  8:38             ` Christian Brauner
  1 sibling, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2023-05-17 21:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro

On Wed, May 17, 2023 at 9:17 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > Adding fsdevel so we're aware of this quirk.
> > >
> > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > invalid value.
> >
> > I've never heard of this before, and I think it is compltely
> > unacceptable. 0 ist just a normal FD, although one that happens to
> > have specific meaning in userspace as stdin.
> >
> > >
> > > If it was discussed then great but if not then I would like to make it
> > > very clear that if in the future you decide to introduce custom
> > > semantics for vfs provided infrastructure - especially when exposed to
> > > userspace - that you please Cc us.
> >
> > I don't think it's just the future.  We really need to undo this ASAP.
>
> Christian is not correct in stating that treatment of fd==0 as invalid
> bpf object applies to vfs fd-s.
> The path_fd addition in this patch is really the very first one of this kind.
> At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> are invalid and this is not going to change. It's been uapi for a long time.
>
> More so fd-s 0,1,2 are not "normal FDs".
> Unix has made two mistakes:
> 1. fd==0 being valid fd
> 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
>
> The first mistake makes it hard to pass FD without an extra flag.
> The 2nd mistake is just awful.
> We've seen plenty of severe datacenter wide issues because some
> library or piece of software assumes stdin/out/err.
> Various services have been hurt badly by this "convention".
> In libbpf we added ensure_good_fd() to make sure none of bpf objects
> (progs, maps, etc) are ever seen with fd=0,1,2.
> Other pieces of datacenter software enforce the same.
>
> In other words fds=0,1,2 are taken. They must not be anything but
> stdin/out/err or gutted to /dev/null.
> Otherwise expect horrible bugs and multi day debugging.
>
> Because of that, several years ago, we've decided to fix unix mistake #1
> when it comes to bpf objects and started reserving fd=0 as invalid.
> This patch is proposing to do the same for path_fd (normal vfs fd) when
> it is passed to bpf syscall. I think it's a good trade-off and fits
> the rest of bpf uapi.
>
> Everyone who's hiding behind statements: but POSIX is a standard..
> or this is how we've been doing things... are ignoring the practical
> situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.

Summarizing an offlist discussion with Christian and Andrii.

The key issue is that fd == 0 must not mean AT_FDCWD and that's clear.
We'll respin with an extra flag to indicate that path_fd should be used.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17 16:17           ` Alexei Starovoitov
  2023-05-17 21:48             ` Alexei Starovoitov
@ 2023-05-18  8:38             ` Christian Brauner
  2023-05-18 14:30               ` Theodore Ts'o
  2023-05-18 16:25               ` Alexei Starovoitov
  1 sibling, 2 replies; 30+ messages in thread
From: Christian Brauner @ 2023-05-18  8:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro

On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote:
> On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > Adding fsdevel so we're aware of this quirk.
> > >
> > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > invalid value.
> >
> > I've never heard of this before, and I think it is compltely
> > unacceptable. 0 ist just a normal FD, although one that happens to
> > have specific meaning in userspace as stdin.
> >
> > >
> > > If it was discussed then great but if not then I would like to make it
> > > very clear that if in the future you decide to introduce custom
> > > semantics for vfs provided infrastructure - especially when exposed to
> > > userspace - that you please Cc us.
> >
> > I don't think it's just the future.  We really need to undo this ASAP.
> 
> Christian is not correct in stating that treatment of fd==0 as invalid
> bpf object applies to vfs fd-s.
> The path_fd addition in this patch is really the very first one of this kind.
> At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> are invalid and this is not going to change. It's been uapi for a long time.
> 
> More so fd-s 0,1,2 are not "normal FDs".
> Unix has made two mistakes:
> 1. fd==0 being valid fd
> 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
> 
> The first mistake makes it hard to pass FD without an extra flag.
> The 2nd mistake is just awful.
> We've seen plenty of severe datacenter wide issues because some
> library or piece of software assumes stdin/out/err.
> Various services have been hurt badly by this "convention".
> In libbpf we added ensure_good_fd() to make sure none of bpf objects
> (progs, maps, etc) are ever seen with fd=0,1,2.
> Other pieces of datacenter software enforce the same.
> 
> In other words fds=0,1,2 are taken. They must not be anything but
> stdin/out/err or gutted to /dev/null.
> Otherwise expect horrible bugs and multi day debugging.
> 
> Because of that, several years ago, we've decided to fix unix mistake #1
> when it comes to bpf objects and started reserving fd=0 as invalid.
> This patch is proposing to do the same for path_fd (normal vfs fd) when

It isn't as you now realized but I'm glad we cleared that up off-list.

> it is passed to bpf syscall. I think it's a good trade-off and fits
> the rest of bpf uapi.
> 
> Everyone who's hiding behind statements: but POSIX is a standard..
> or this is how we've been doing things... are ignoring the practical
> situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.

(Prefix: Imagine me calmly writing this and in a relaxed tone.)

Just to clarify. I do think that deciding that 0 is an invalid file
descriptor number is weird and I wish you'd have discussed this with us
before you took that decision. You've seen the reaction that other
low-level userspace people you talked to had to these news...

I'm not sure what to make of the POSIX excursion. I think that it is a
complete sideshow to the issue here in a way. But fwiw...

We don't follow arbitrary conventions such as 0, 1, and 2 because we all
have sworn allegiance to The Secret Order of the POSIX but because the
alternative is that one subsystem finds it neat to use fd 0 to refer to
AT_FDCWD and another one to AT_MY_CUSTOM_FD0_MEANING. Which is exactly
what would've happened if this patch would have made it unnoticed.

This doesn't scale and our interfaces aren't designed around
Shakespeare's dictum "What's in a name?".

This will quickly devolve into a situation similar to letting a step on
a staircase be off by a few millimeters. See those users falling.

I'm glad we cleared this up. My main issue is indeed that fd 0 now isn't
just forbidden it would be given an entirely different meaning which is
not acceptable from the vfs perspective.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18  8:38             ` Christian Brauner
@ 2023-05-18 14:30               ` Theodore Ts'o
  2023-05-18 16:25               ` Alexei Starovoitov
  1 sibling, 0 replies; 30+ messages in thread
From: Theodore Ts'o @ 2023-05-18 14:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexei Starovoitov, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

The other thing to note is that while the *convention* may be that 0,
1, and 2 are for stdin, stdout, and stderr, this is a *userspae*
convention.  After all, system daemons like getty, gnome-terminal,
et. al, need to be able to open file descriptors for stdin, stdout,
and stderr, and it would be.....highly undesirable for the kernel to
have to special case those processes from being able to open those
file descriptors.  So in the eyes of Kernel to Userspace API's we
should not specially privilege the meaning of file descriptors 0, 1,
and 2.

Besides, we have a perfectly good way of expressing "not a FD" and
that is negative values!  File descriptors, after all, are signed
integers.

Finally, by having some kernel subsystem have a different meaning for
fd 0 means that there are potential security vulernabilities.  It may
be the case that userspace *SHOULD* not use fd 0 for anythingn other
than stdin, and that should be something which should be handed to it
by its parent process.

However, consider what might happen if a malicious program where to
exec a process, perhaps a setuid process, with fd 0 closed.  Now the
first file opened by that program will be assigned fd 0, and if that
gets passed to BPF, something surprising and wonderous --- but
hopefully not something that can be leveraged to be a high severity
security vulnerability --- may very well happen.

So if there is anyway to that we can change the BPF API's to change to
use negative values for special case meanings, we should do it.
Certainly for any new API's, and even for old API's, Linus has always
said that there are some special case times when we can break the
userspace ABI --- and security vulnerabilites are certainly one of
them.

Best regards,

					- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18  8:38             ` Christian Brauner
  2023-05-18 14:30               ` Theodore Ts'o
@ 2023-05-18 16:25               ` Alexei Starovoitov
  2023-05-18 16:33                 ` Matthew Wilcox
  2023-05-18 17:20                 ` Christian Brauner
  1 sibling, 2 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2023-05-18 16:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 10:38:46AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote:
> > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > > Adding fsdevel so we're aware of this quirk.
> > > >
> > > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > > invalid value.
> > >
> > > I've never heard of this before, and I think it is compltely
> > > unacceptable. 0 ist just a normal FD, although one that happens to
> > > have specific meaning in userspace as stdin.
> > >
> > > >
> > > > If it was discussed then great but if not then I would like to make it
> > > > very clear that if in the future you decide to introduce custom
> > > > semantics for vfs provided infrastructure - especially when exposed to
> > > > userspace - that you please Cc us.
> > >
> > > I don't think it's just the future.  We really need to undo this ASAP.
> > 
> > Christian is not correct in stating that treatment of fd==0 as invalid
> > bpf object applies to vfs fd-s.
> > The path_fd addition in this patch is really the very first one of this kind.
> > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> > are invalid and this is not going to change. It's been uapi for a long time.
> > 
> > More so fd-s 0,1,2 are not "normal FDs".
> > Unix has made two mistakes:
> > 1. fd==0 being valid fd
> > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
> > 
> > The first mistake makes it hard to pass FD without an extra flag.
> > The 2nd mistake is just awful.
> > We've seen plenty of severe datacenter wide issues because some
> > library or piece of software assumes stdin/out/err.
> > Various services have been hurt badly by this "convention".
> > In libbpf we added ensure_good_fd() to make sure none of bpf objects
> > (progs, maps, etc) are ever seen with fd=0,1,2.
> > Other pieces of datacenter software enforce the same.
> > 
> > In other words fds=0,1,2 are taken. They must not be anything but
> > stdin/out/err or gutted to /dev/null.
> > Otherwise expect horrible bugs and multi day debugging.
> > 
> > Because of that, several years ago, we've decided to fix unix mistake #1
> > when it comes to bpf objects and started reserving fd=0 as invalid.
> > This patch is proposing to do the same for path_fd (normal vfs fd) when
> 
> It isn't as you now realized but I'm glad we cleared that up off-list.
> 
> > it is passed to bpf syscall. I think it's a good trade-off and fits
> > the rest of bpf uapi.
> > 
> > Everyone who's hiding behind statements: but POSIX is a standard..
> > or this is how we've been doing things... are ignoring the practical
> > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.
> 
> (Prefix: Imagine me calmly writing this and in a relaxed tone.)
> 
> Just to clarify. I do think that deciding that 0 is an invalid file

We're still talking past each other.
0 is an invalid bpf object. Not file.
There is a difference.
The kernel is breaking user space by returning non-file FDs in 0,1,2.
Especially as fd = 1 and 2.
ensure_good_fd() in libbpf is a library workaround to make sure bpf objects
are not the reason for user app brekage.
I firmly believe that making kernel return socket FDs and other special FDs with fd >=3
(under new sysctl, for example) will prevent user space breakage.

And to answer Ted's question..
Yes. It's a security issue, but it's the other way around.
The kernel returning non vfs file FD in [0,1,2] range is a security issue.
I'm proposing to fix it with new sysctl or boot flag.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 16:25               ` Alexei Starovoitov
@ 2023-05-18 16:33                 ` Matthew Wilcox
  2023-05-18 17:22                   ` Christian Brauner
  2023-05-18 17:20                 ` Christian Brauner
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2023-05-18 16:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christian Brauner, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote:
> We're still talking past each other.
> 0 is an invalid bpf object. Not file.
> There is a difference.
> The kernel is breaking user space by returning non-file FDs in 0,1,2.
> Especially as fd = 1 and 2.
> ensure_good_fd() in libbpf is a library workaround to make sure bpf objects
> are not the reason for user app brekage.
> I firmly believe that making kernel return socket FDs and other special FDs with fd >=3
> (under new sysctl, for example) will prevent user space breakage.

Wait, why are socket FDs special?  I shouldn't be able to have anything
but chardev fds, pipes and regular files as fd 0,1,2?  I agree that having
directory fds and blockdev fds as fd 0,1,2 are confusing and pointless,
but I see the value in having a TCP socket as stdin/stdout/stderr.

If a fd shouldn't be used for stdio, having an ioctl to enable it
and read/write return errors until/unless it's enabled makes sense.
But now we have to label each fd as safe/not-safe for stdio, which we
can as easily do by setting up our fops appropriately.  So I'm not sure
what you're trying to accomplish here.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 16:25               ` Alexei Starovoitov
  2023-05-18 16:33                 ` Matthew Wilcox
@ 2023-05-18 17:20                 ` Christian Brauner
  2023-05-18 17:33                   ` Linus Torvalds
  2023-05-18 18:26                   ` Alexei Starovoitov
  1 sibling, 2 replies; 30+ messages in thread
From: Christian Brauner @ 2023-05-18 17:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro,
	Linus Torvalds

On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote:
> On Thu, May 18, 2023 at 10:38:46AM +0200, Christian Brauner wrote:
> > On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote:
> > > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> > > >
> > > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > > > Adding fsdevel so we're aware of this quirk.
> > > > >
> > > > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > > > invalid value.
> > > >
> > > > I've never heard of this before, and I think it is compltely
> > > > unacceptable. 0 ist just a normal FD, although one that happens to
> > > > have specific meaning in userspace as stdin.
> > > >
> > > > >
> > > > > If it was discussed then great but if not then I would like to make it
> > > > > very clear that if in the future you decide to introduce custom
> > > > > semantics for vfs provided infrastructure - especially when exposed to
> > > > > userspace - that you please Cc us.
> > > >
> > > > I don't think it's just the future.  We really need to undo this ASAP.
> > > 
> > > Christian is not correct in stating that treatment of fd==0 as invalid
> > > bpf object applies to vfs fd-s.
> > > The path_fd addition in this patch is really the very first one of this kind.
> > > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> > > are invalid and this is not going to change. It's been uapi for a long time.
> > > 
> > > More so fd-s 0,1,2 are not "normal FDs".
> > > Unix has made two mistakes:
> > > 1. fd==0 being valid fd
> > > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
> > > 
> > > The first mistake makes it hard to pass FD without an extra flag.
> > > The 2nd mistake is just awful.
> > > We've seen plenty of severe datacenter wide issues because some
> > > library or piece of software assumes stdin/out/err.
> > > Various services have been hurt badly by this "convention".
> > > In libbpf we added ensure_good_fd() to make sure none of bpf objects
> > > (progs, maps, etc) are ever seen with fd=0,1,2.
> > > Other pieces of datacenter software enforce the same.
> > > 
> > > In other words fds=0,1,2 are taken. They must not be anything but
> > > stdin/out/err or gutted to /dev/null.
> > > Otherwise expect horrible bugs and multi day debugging.
> > > 
> > > Because of that, several years ago, we've decided to fix unix mistake #1
> > > when it comes to bpf objects and started reserving fd=0 as invalid.
> > > This patch is proposing to do the same for path_fd (normal vfs fd) when
> > 
> > It isn't as you now realized but I'm glad we cleared that up off-list.
> > 
> > > it is passed to bpf syscall. I think it's a good trade-off and fits
> > > the rest of bpf uapi.
> > > 
> > > Everyone who's hiding behind statements: but POSIX is a standard..
> > > or this is how we've been doing things... are ignoring the practical
> > > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.
> > 
> > (Prefix: Imagine me calmly writing this and in a relaxed tone.)
> > 
> > Just to clarify. I do think that deciding that 0 is an invalid file

descriptor

> 
> We're still talking past each other.
> 0 is an invalid bpf object. Not file.
> There is a difference.

You cut of a word above. I can't follow your argument.
File descriptor numbers are free to refer to whatever we want.
They don't care about what type of object they refer to and they
better not.

> The kernel is breaking user space by returning non-file FDs in 0,1,2.
> Especially as fd = 1 and 2.

This has a strong aura of a strawman argument. ;)

> ensure_good_fd() in libbpf is a library workaround to make sure bpf objects
> are not the reason for user app brekage.
> I firmly believe that making kernel return socket FDs and other special FDs with fd >=3
> (under new sysctl, for example) will prevent user space breakage.
> 
> And to answer Ted's question..
> Yes. It's a security issue, but it's the other way around.
> The kernel returning non vfs file FD in [0,1,2] range is a security issue.
> I'm proposing to fix it with new sysctl or boot flag.

That's just completely weird. We can see what Linus thinks but I think
that's a somewhat outlandish proposal that I wouldn't support.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 16:33                 ` Matthew Wilcox
@ 2023-05-18 17:22                   ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-05-18 17:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexei Starovoitov, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 05:33:32PM +0100, Matthew Wilcox wrote:
> On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote:
> > We're still talking past each other.
> > 0 is an invalid bpf object. Not file.
> > There is a difference.
> > The kernel is breaking user space by returning non-file FDs in 0,1,2.
> > Especially as fd = 1 and 2.
> > ensure_good_fd() in libbpf is a library workaround to make sure bpf objects
> > are not the reason for user app brekage.
> > I firmly believe that making kernel return socket FDs and other special FDs with fd >=3
> > (under new sysctl, for example) will prevent user space breakage.
> 
> Wait, why are socket FDs special?  I shouldn't be able to have anything
> but chardev fds, pipes and regular files as fd 0,1,2?  I agree that having
> directory fds and blockdev fds as fd 0,1,2 are confusing and pointless,
> but I see the value in having a TCP socket as stdin/stdout/stderr.
> 
> If a fd shouldn't be used for stdio, having an ioctl to enable it
> and read/write return errors until/unless it's enabled makes sense.
> But now we have to label each fd as safe/not-safe for stdio, which we
> can as easily do by setting up our fops appropriately.  So I'm not sure
> what you're trying to accomplish here.

Yeah, I don't think we want weird ioctl()s to restrict file descriptor
ranges in any way. This all sounds pretty weird to me and I don't even
want to imagine the semantical oddness of suddenly restricting the
kernels ability to return some fds.

Honestly, most of the time sysctls such as this are the equivalent of
throwing the hands up in the air and leaving the room.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 17:20                 ` Christian Brauner
@ 2023-05-18 17:33                   ` Linus Torvalds
  2023-05-18 18:21                     ` Christian Brauner
  2023-05-18 18:26                   ` Alexei Starovoitov
  1 sibling, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2023-05-18 17:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexei Starovoitov, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 10:20 AM Christian Brauner <brauner@kernel.org> wrote:
>
> That's just completely weird. We can see what Linus thinks but I think
> that's a somewhat outlandish proposal that I wouldn't support.

I have no idea of the background here.

But fd 0 is in absolutely no way special. Anything that thinks that a
zero fd is invalid or in any way different from (say) fd 5 is
completely and utterly buggy by definition.

Now, fd 0 can obviously be invalid in the sense that it may not be
open, exactly the same way fd 100 may not be open. So in *that* sense
we can have an invalid fd 0, and system calls might return EBADF for
trying to access it if somebody has closed it.

If bpf thinks that 0 is not a file descriptor, then bpf is simply
wrong. No ifs, buts or maybes about it. It's like saying "1 is not a
number". It's nonsensical garbage.

But maybe I misunderstand the issue.

              Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 17:33                   ` Linus Torvalds
@ 2023-05-18 18:21                     ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2023-05-18 18:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexei Starovoitov, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 10:33:30AM -0700, Linus Torvalds wrote:
> On Thu, May 18, 2023 at 10:20 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > That's just completely weird. We can see what Linus thinks but I think
> > that's a somewhat outlandish proposal that I wouldn't support.
> 
> I have no idea of the background here.
> 
> But fd 0 is in absolutely no way special. Anything that thinks that a
> zero fd is invalid or in any way different from (say) fd 5 is
> completely and utterly buggy by definition.
> 
> Now, fd 0 can obviously be invalid in the sense that it may not be
> open, exactly the same way fd 100 may not be open. So in *that* sense
> we can have an invalid fd 0, and system calls might return EBADF for
> trying to access it if somebody has closed it.
> 
> If bpf thinks that 0 is not a file descriptor, then bpf is simply
> wrong. No ifs, buts or maybes about it. It's like saying "1 is not a
> number". It's nonsensical garbage.
> 
> But maybe I misunderstand the issue.

TL;DR bpf has considerd fd 0 to be an invalid fd value for a long time.
We can't exactly follow the motiviation:
https://lore.kernel.org/bpf/CAADnVQLitLUc1SozzKjBgq6HGTchE1cO+e4j8eDgtE0zFn5VEw@mail.gmail.com
and it's probably to late to change this.

Yes, passing fds in extensible structs is inconvenient because you need
to pass an indicator alongside an fd to indicate that the zero value is
anactual file descriptor. Because unknown fields need to be initialzied
to zero so that old kernels can ignore larger structs. So what we
usually have to do:

mount_setattr()

attr.set |= MOUNT_ATTR_IDMAP;
attr.userns_fd = 0;

We just found out about fd 0 being invalid because a new bpf feature
proposal now tried to reuse fd 0 to mean AT_FDCWD when passed through a
new bpf feature; which I said isn't acceptable.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 17:20                 ` Christian Brauner
  2023-05-18 17:33                   ` Linus Torvalds
@ 2023-05-18 18:26                   ` Alexei Starovoitov
  2023-05-18 18:57                     ` Linus Torvalds
  1 sibling, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2023-05-18 18:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro,
	Linus Torvalds

On Thu, May 18, 2023 at 07:20:29PM +0200, Christian Brauner wrote:
> On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote:
> > On Thu, May 18, 2023 at 10:38:46AM +0200, Christian Brauner wrote:
> > > On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote:
> > > > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> > > > >
> > > > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > > > > Adding fsdevel so we're aware of this quirk.
> > > > > >
> > > > > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > > > > invalid value.
> > > > >
> > > > > I've never heard of this before, and I think it is compltely
> > > > > unacceptable. 0 ist just a normal FD, although one that happens to
> > > > > have specific meaning in userspace as stdin.
> > > > >
> > > > > >
> > > > > > If it was discussed then great but if not then I would like to make it
> > > > > > very clear that if in the future you decide to introduce custom
> > > > > > semantics for vfs provided infrastructure - especially when exposed to
> > > > > > userspace - that you please Cc us.
> > > > >
> > > > > I don't think it's just the future.  We really need to undo this ASAP.
> > > > 
> > > > Christian is not correct in stating that treatment of fd==0 as invalid
> > > > bpf object applies to vfs fd-s.
> > > > The path_fd addition in this patch is really the very first one of this kind.
> > > > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> > > > are invalid and this is not going to change. It's been uapi for a long time.
> > > > 
> > > > More so fd-s 0,1,2 are not "normal FDs".
> > > > Unix has made two mistakes:
> > > > 1. fd==0 being valid fd
> > > > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
> > > > 
> > > > The first mistake makes it hard to pass FD without an extra flag.
> > > > The 2nd mistake is just awful.
> > > > We've seen plenty of severe datacenter wide issues because some
> > > > library or piece of software assumes stdin/out/err.
> > > > Various services have been hurt badly by this "convention".
> > > > In libbpf we added ensure_good_fd() to make sure none of bpf objects
> > > > (progs, maps, etc) are ever seen with fd=0,1,2.
> > > > Other pieces of datacenter software enforce the same.
> > > > 
> > > > In other words fds=0,1,2 are taken. They must not be anything but
> > > > stdin/out/err or gutted to /dev/null.
> > > > Otherwise expect horrible bugs and multi day debugging.
> > > > 
> > > > Because of that, several years ago, we've decided to fix unix mistake #1
> > > > when it comes to bpf objects and started reserving fd=0 as invalid.
> > > > This patch is proposing to do the same for path_fd (normal vfs fd) when
> > > 
> > > It isn't as you now realized but I'm glad we cleared that up off-list.
> > > 
> > > > it is passed to bpf syscall. I think it's a good trade-off and fits
> > > > the rest of bpf uapi.
> > > > 
> > > > Everyone who's hiding behind statements: but POSIX is a standard..
> > > > or this is how we've been doing things... are ignoring the practical
> > > > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.
> > > 
> > > (Prefix: Imagine me calmly writing this and in a relaxed tone.)
> > > 
> > > Just to clarify. I do think that deciding that 0 is an invalid file
> 
> descriptor
> 
> > 
> > We're still talking past each other.
> > 0 is an invalid bpf object. Not file.
> > There is a difference.
> 
> You cut of a word above. I can't follow your argument.
> File descriptor numbers are free to refer to whatever we want.
> They don't care about what type of object they refer to and they
> better not.

User space cares what they refer to.
Unix made integer FD into broken abstraction.
regular files need one set of syscalls to work with such 'integer FDs'.
sockets need another set of syscalls.
All other anon-inodes need another set of syscalls.
While user space sees an integer without type and that a root cause of the bugs.
And that's particularly bad for integers 0,1,2 where user space assumes that
regular files are behind them.

Imagine C++ project where base class is called 'FD' while children
implement their own access methods that don't overlap with each other.
Now one application passes this 'FD' class to another.
That other app can only do trial and error to figure out what it got.
Any user space developer would reject such class hierarchy design,
but kernel folks are so proud of this broken abstraction.
FDs are not special.. POSIX is the standard... Right.
That's why user space keeps their workarounds, because kernel folks
are not empathic to user space needs.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 18:26                   ` Alexei Starovoitov
@ 2023-05-18 18:57                     ` Linus Torvalds
  2023-05-19  4:44                       ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2023-05-18 18:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christian Brauner, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 2091 bytes --]

[ Out walking for health - sorry for html crud ]

On Thu, May 18, 2023, 11:26 Alexei Starovoitov <alexei.starovoitov@gmail.com>
wrote:

>
> User space cares what they refer to.
> Unix made integer FD into broken abstraction.
>

Stop trying to blame anybody else.

The unix people came up with a good abstraction, and you've screwed up.
That is nobody's fault but your own, and you should just admit it rather
than trying to double down on being wrong.

Integer file descriptions are fine. Negative ones are invalid, and people
who wanted to use "root" and "cwd" have already done so using them for
decades. Look up openat().

You just picked completely the wrong invalid number.

That's on you. Nobody else.

regular files need one set of syscalls to work with such 'integer FDs'.
> sockets need another set of syscalls.
>

None of that has anything to do with file descriptors, which are a unified
format.

The reason sockets have extra system calls is that you're can just do extra
things with them. That's no different from ioctl's, and they could have
been done that way, but the extra system calls are cleaner.

While user space sees an integer without type and that a root cause of the
> bugs.
> And that's particularly bad for integers 0,1,2 where user space assumes
> that
> regular files are behind them.
>

Stop making yourself look any more stupid than necessary.

The 0/1/2 file descriptors are not at all special. They are a shell
pipeline default, nothing more. They are not the argument your think they
are, and you should stop trying to make them an argument.

It only makes you look bad.

Imagine C++ project where base class is called 'FD' while children
>

Ok, now you've just gone totally off the rails.

Stop this silliness.

You did something bad from ignorance. That's fine. But making excuses for
it turns ignorance to willful stupidity. Stop digging yourself deeper in
that hole.

The system call interface is not some C++ object interface. And we should
all be grateful that it isn't.

        Linus

[-- Attachment #2: Type: text/html, Size: 3875 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17  9:11       ` fd == 0 means AT_FDCWD " Christian Brauner
  2023-05-17 12:05         ` Christoph Hellwig
@ 2023-05-18 21:56         ` Andrii Nakryiko
  1 sibling, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2023-05-18 21:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, cyphar, lennart,
	linux-fsdevel, Christoph Hellwig, Al Viro

On Wed, May 17, 2023 at 2:11 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, May 16, 2023 at 11:02:42AM -0700, Andrii Nakryiko wrote:
> > On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> > > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> > > > forces users to specify pinning location as a string-based absolute or
> > > > relative (to current working directory) path. This has various
> > > > implications related to security (e.g., symlink-based attacks), forces
> > > > BPF FS to be exposed in the file system, which can cause races with
> > > > other applications.
> > > >
> > > > One of the feedbacks we got from folks working with containers heavily
> > > > was that inability to use purely FD-based location specification was an
> > > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> > > > commands. This patch closes this oversight, adding path_fd field to
> > >
> > > Cool!
> > >
> > > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> > > > *at() syscalls for dirfd + pathname combinations.
> > > >
> > > > This now allows interesting possibilities like working with detached BPF
> > > > FS mount (e.g., to perform multiple pinnings without running a risk of
> > > > someone interfering with them), and generally making pinning/getting
> > > > more secure and not prone to any races and/or security attacks.
> > > >
> > > > This is demonstrated by a selftest added in subsequent patch that takes
> > > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> > > > creating detached BPF FS mount, pinning, and then getting BPF map out of
> > > > it, all while never exposing this private instance of BPF FS to outside
> > > > worlds.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  include/linux/bpf.h            |  4 ++--
> > > >  include/uapi/linux/bpf.h       |  5 +++++
> > > >  kernel/bpf/inode.c             | 16 ++++++++--------
> > > >  kernel/bpf/syscall.c           |  8 +++++---
> > > >  tools/include/uapi/linux/bpf.h |  5 +++++
> > > >  5 files changed, 25 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 36e4b2d8cca2..f58895830ada 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
> > > >  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> > > >  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
> > > >
> > > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > > > -int bpf_obj_get_user(const char __user *pathname, int flags);
> > > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> > > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
> > > >
> > > >  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
> > > >  #define DEFINE_BPF_ITER_FUNC(target, args...)                        \
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 1bb11a6ee667..db2870a52ce0 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -1420,6 +1420,11 @@ union bpf_attr {
> > > >               __aligned_u64   pathname;
> > > >               __u32           bpf_fd;
> > > >               __u32           file_flags;
> > > > +             /* same as dirfd in openat() syscall; see openat(2)
> > > > +              * manpage for details of dirfd/path_fd and pathname semantics;
> > > > +              * zero path_fd implies AT_FDCWD behavior
> > > > +              */
> > > > +             __u32           path_fd;
> > > >       };
> > >
> > > So 0 is a valid file descriptor and can trivially be created and made to
> > > refer to any file. Is this a conscious decision to have a zero value
> > > imply AT_FDCWD and have you done this somewhere else in bpf already?
> > > Because that's contrary to how any file descriptor based apis work.
> > >
> > > How this is usually solved for extensible structs is to have a flag
> > > field that raises a flag to indicate that the fd fiel is set and thus 0
> > > can be used as a valid value.
> > >
> > > The way you're doing it right now is very counterintuitive to userspace
> > > and pretty much guaranteed to cause subtle bugs.
> >
> > Yes, it's a very bpf()-specific convention we've settled on a while
> > ago. It allows a cleaner and simpler backwards compatibility story
> > without having to introduce new flags every single time. Most of BPF
> > UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass
> > it to bpf() syscall. Most of the time users will be blissfully unaware
> > because libbpf and other BPF libraries are checking for fd == 0 and
> > dup()'ing them to avoid ever returning FD 0 to the user.
> >
> > tl;dr, a conscious decision consistent with the rest of BPF UAPI. It
> > is a bpf() peculiarity, yes.
>
> Adding fsdevel so we're aware of this quirk.
>
> So I'm not sure whether this was ever discussed on fsdevel when you took
> the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> invalid value.
>
> If it was discussed then great but if not then I would like to make it
> very clear that if in the future you decide to introduce custom
> semantics for vfs provided infrastructure - especially when exposed to
> userspace - that you please Cc us.

Yep, I'll remember to cc linux-fsdevel@vger.kernel.org for future
patches touching on vfs-related concepts, no problem. I wasn't trying
to sneak it in or anything, it just never occurred to me, sorry about
that.

>
> You often make it very clear on the list that you don't like it when
> anything that touches bpf code doesn't end up getting sent to the bpf
> mailing list. It is exactly the same for us.

That's a fair request, ack.

>
> This is not a rant I'm really just trying to make sure that we agree on
> common ground when it comes to touching each others code or semantic
> assumptions.
>
> I personally find this extremely weird to treat fd 0 as anything other
> than a random fd number as it goes against any userspace assumptions and
> drastically deviates from basically every file descriptor interface we
> have. I mean, you're not just saying fd 0 is invalid you're even saying
> it means AT_FDCWD.

Agreed, I can see how this could have undesirable (not just
surprising) implications if, say, open(O_PATH) returned fd=0. I just
sent a v2 with a new flag that needs to be specified to be able to use
path FD (and falling back to backwards-compatible AT_FDCWD behavior
otherwise).

>
> For every other interface, including those that pass fds in structs
> whose extensibility is premised on unknown fields being set to zero,
> have ways to make fd 0 work just fine. You could've done that to without
> inventing custom fd semantics.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 18:57                     ` Linus Torvalds
@ 2023-05-19  4:44                       ` Alexei Starovoitov
  2023-05-19  8:13                         ` Christian Brauner
                                           ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2023-05-19  4:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 11:57:14AM -0700, Linus Torvalds wrote:
> That is nobody's fault but your own, and you should just admit it rather
> than trying to double down on being wrong.

You're correct. I was indeed doubling down on that.
Thanks for putting it straight like that.

> The 0/1/2 file descriptors are not at all special. They are a shell
> pipeline default, nothing more. They are not the argument your think they
> are, and you should stop trying to make them an argument.

I'm well aware that any file type is allowed to be in FDs 0,1,2 and
some user space is using it that way, like old inetd:
https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
That puts the same socket into 0,1,2 before exec-ing new process.

My point that the kernel has to assist user space instead of
stubbornly sticking to POSIX and saying all FDs are equal.

Most user space developers know that care should be taken with FDs 0,1,2,
but it's still easy to make a mistake.

To explain the motivation a bit of background:
"folly" is a core C++ library for fb apps. Like libstdc++ and a lot more.
Until this commit in 2021:
https://github.com/facebook/folly/commit/cc9032a0e41a0cba9aa93240c483cfceb0ff44ea
the user could launch a new process with flag "folly::Subprocess::CLOSE".
It's useful for the cases when child doesn't want to inherit stdin/out/err.
There is also GLOG. google's logging library that can be configured to log to stderr.
Both libraries are well written with the high code quality.
In a big app multiple people use different pieces and may not be aware
how all pieces are put together. You can guess the rest...
Important service used a library that used another library that started a
process with folly::Subprocess::CLOSE. That process opened network connections
and used glog. It was "working" for some time, because sys_write() to a socket
is a valid command, but when TCP buffers got full synchronous innocuous logging
prevented parent from making progress.

That footgun was removed from folly in 2021, but we still see this issue from time to time.
My point that the kernel can help here.
Since folks don't like sysctl to control FD assignment how about something like this:

diff --git a/fs/file.c b/fs/file.c
index 7893ea161d77..896e79433f61 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -554,9 +554,15 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
        return error;
 }

+__weak noinline u32 get_start_fd(void)
+{
+       return 0;
+}
+/* mark it as BPF_MODIFY_RETURN to let bpf progs adjust return value */
+
 int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
 {
-       return alloc_fd(0, nofile, flags);
+       return alloc_fd(get_start_fd(), nofile, flags);
 }

Then we can enforce fd >= 3 for a certain container or for a particular app.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  4:44                       ` Alexei Starovoitov
@ 2023-05-19  8:13                         ` Christian Brauner
  2023-05-19 14:27                           ` Theodore Ts'o
  2023-05-19 17:51                         ` Linus Torvalds
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Christian Brauner @ 2023-05-19  8:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 09:44:33PM -0700, Alexei Starovoitov wrote:
> On Thu, May 18, 2023 at 11:57:14AM -0700, Linus Torvalds wrote:
> > That is nobody's fault but your own, and you should just admit it rather
> > than trying to double down on being wrong.
> 
> You're correct. I was indeed doubling down on that.
> Thanks for putting it straight like that.
> 
> > The 0/1/2 file descriptors are not at all special. They are a shell
> > pipeline default, nothing more. They are not the argument your think they
> > are, and you should stop trying to make them an argument.
> 
> I'm well aware that any file type is allowed to be in FDs 0,1,2 and
> some user space is using it that way, like old inetd:
> https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
> That puts the same socket into 0,1,2 before exec-ing new process.
> 
> My point that the kernel has to assist user space instead of
> stubbornly sticking to POSIX and saying all FDs are equal.
> 
> Most user space developers know that care should be taken with FDs 0,1,2,
> but it's still easy to make a mistake.
> 
> To explain the motivation a bit of background:
> "folly" is a core C++ library for fb apps. Like libstdc++ and a lot more.
> Until this commit in 2021:
> https://github.com/facebook/folly/commit/cc9032a0e41a0cba9aa93240c483cfceb0ff44ea
> the user could launch a new process with flag "folly::Subprocess::CLOSE".
> It's useful for the cases when child doesn't want to inherit stdin/out/err.
> There is also GLOG. google's logging library that can be configured to log to stderr.
> Both libraries are well written with the high code quality.
> In a big app multiple people use different pieces and may not be aware
> how all pieces are put together. You can guess the rest...
> Important service used a library that used another library that started a
> process with folly::Subprocess::CLOSE. That process opened network connections
> and used glog. It was "working" for some time, because sys_write() to a socket
> is a valid command, but when TCP buffers got full synchronous innocuous logging
> prevented parent from making progress.
> 
> That footgun was removed from folly in 2021, but we still see this issue from time to time.
> My point that the kernel can help here.
> Since folks don't like sysctl to control FD assignment how about something like this:
> 
> diff --git a/fs/file.c b/fs/file.c
> index 7893ea161d77..896e79433f61 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -554,9 +554,15 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>         return error;
>  }
> 
> +__weak noinline u32 get_start_fd(void)
> +{
> +       return 0;
> +}
> +/* mark it as BPF_MODIFY_RETURN to let bpf progs adjust return value */
> +
>  int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
>  {
> -       return alloc_fd(0, nofile, flags);
> +       return alloc_fd(get_start_fd(), nofile, flags);

I'm sorry but I really don't think this is a good idea. We're not going
to run BPF programs in core file code. That stuff is sensitive and
complex enough as it is without having to take into account that a bpf
program can modify behavior. It's also completely unclear whether that's
safe to do as this would allow to change fd allocation across the whole
kernel.

This idea that fd 0, 1, and 2 or any other fd deserve special treatment
by the kernel needs to die; and quickly at that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  8:13                         ` Christian Brauner
@ 2023-05-19 14:27                           ` Theodore Ts'o
  0 siblings, 0 replies; 30+ messages in thread
From: Theodore Ts'o @ 2023-05-19 14:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexei Starovoitov, Linus Torvalds, Christoph Hellwig,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Aleksa Sarai,
	Lennart Poettering, Linux-Fsdevel, Al Viro

On Fri, May 19, 2023 at 10:13:09AM +0200, Christian Brauner wrote:
> > I'm well aware that any file type is allowed to be in FDs 0,1,2 and
> > some user space is using it that way, like old inetd:
> > https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
> > That puts the same socket into 0,1,2 before exec-ing new process.

This is a *feature*.  I've seen, and actually written shell scripts
which have been wired into /etc/inetd.conf. amd so the fact that shell
script can send stdout out to a incoming TCP connection.  It should be
possible to implement the finger protocol (RFC 1288) as a shell or
python script, *precisely* because having inetd connect a socket to
FDs 0, 1, and 2 is a good and useful thing to do.

> > My point that the kernel has to assist user space instead of
> > stubbornly sticking to POSIX and saying all FDs are equal.

This is not a matter of adhering to Posix.  It's about the fundamental
Unix philosophy.  Not everything needs to be implemented in a
complicated C++ program....


> > To explain the motivation a bit of background:
> > "folly" is a core C++ library for fb apps. Like libstdc++ and a lot more.
> > Until this commit in 2021:
> > https://github.com/facebook/folly/commit/cc9032a0e41a0cba9aa93240c483cfceb0ff44ea
> > the user could launch a new process with flag "folly::Subprocess::CLOSE".
> > It's useful for the cases when child doesn't want to inherit stdin/out/err.

Yeah, sorry, that's just simple bug in the Folly library (which I
guess was well named).  Closing all of the file descriptors and then
opening 0, 1, and 2 using /dev/null is a pretty basic.  In fact,
there's a convenient daemon(3) will do this for you.  No muss, no
fuss, no dirty dishes.

> I'm sorry but I really don't think this is a good idea. We're not going
> to run BPF programs in core file code. That stuff is sensitive and
> complex enough as it is without having to take into account that a bpf
> program can modify behavior. It's also completely unclear whether that's
> safe to do as this would allow to change fd allocation across the whole
> kernel.
> 
> This idea that fd 0, 1, and 2 or any other fd deserve special treatment
> by the kernel needs to die; and quickly at that.

+1.

Making fundamentally violent changes to core Unix design and
philosophy just to accomodate incompetent user space programmers is
IMHO a really bad idea.

						- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  4:44                       ` Alexei Starovoitov
  2023-05-19  8:13                         ` Christian Brauner
@ 2023-05-19 17:51                         ` Linus Torvalds
  2023-05-23  7:49                         ` Lennart Poettering
  2023-08-26  4:27                         ` Al Viro
  3 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-05-19 17:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christian Brauner, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 9:44 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> My point that the kernel can help here.
> Since folks don't like sysctl to control FD assignment how about something like this:

No. Really.

The only thing that needs to happen is that people need to *know* that
fd's 0/1/2 are not at all special.

The thing is, it's even *traditional* to do something like

        close(0);
        close(1);
        close(2);

and fork() twice (exiting the first child after the second fork).
Usually you'd also make sure to re-set umask, and do a setsid() to
make sure you're not part of the controlling terminal any more.

Now, some people would then redirect /dev/null to those file
descriptors, but not always: file descriptors used to be a fairly
limited resource. So people would *want* to use all the file
descriptors they could for whatever server connections they
implemented. Including, very much, fd 0.

So really. There is not a way in hell we will *EVER* consider fd 0 to
be special. It isn't, it never has been, and it never shall be.

Instead, just spread the word that fd 0 isn't special.

The fact that you think that some completely broken C++ code was
"written with high quality", and you think that is an excuse for
garbage is just sad.

Those C++ libraries WERE INCREDIBLE CRAP. They were buggy garbage. And
no, they are in no way going to affect the kernel and make the kernel
do stupid and wrong things.

                    Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  4:44                       ` Alexei Starovoitov
  2023-05-19  8:13                         ` Christian Brauner
  2023-05-19 17:51                         ` Linus Torvalds
@ 2023-05-23  7:49                         ` Lennart Poettering
  2023-05-23 17:25                           ` Andrii Nakryiko
  2023-08-26  4:27                         ` Al Viro
  3 siblings, 1 reply; 30+ messages in thread
From: Lennart Poettering @ 2023-05-23  7:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Christian Brauner, Christoph Hellwig,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Aleksa Sarai, Linux-Fsdevel,
	Al Viro

On Do, 18.05.23 21:44, Alexei Starovoitov (alexei.starovoitov@gmail.com) wrote:

> > The 0/1/2 file descriptors are not at all special. They are a shell
> > pipeline default, nothing more. They are not the argument your think they
> > are, and you should stop trying to make them an argument.
>
> I'm well aware that any file type is allowed to be in FDs 0,1,2 and
> some user space is using it that way, like old inetd:
> https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
> That puts the same socket into 0,1,2 before exec-ing new process.
>
> My point that the kernel has to assist user space instead of
> stubbornly sticking to POSIX and saying all FDs are equal.
>
> Most user space developers know that care should be taken with FDs 0,1,2,
> but it's still easy to make a mistake.

If I look at libbpf, which supposedly gets the fd handling right I
can't find any hint it actually moves the fds it gets from open() to
an fd > 2, though?

i.e. the code that invokes open() calls in the libbpf codebase happily
just accepts an fd < 2, including fd == 0, and this is then later
passed back into the kernel in various bpf() syscall invocations,
which should refuse it, no? So what's going on there?

I did find this though:

<snip>
	new_fd = open("/", O_RDONLY | O_CLOEXEC);
	if (new_fd < 0) {
		err = -errno;
		goto err_free_new_name;
	}

	new_fd = dup3(fd, new_fd, O_CLOEXEC);
	if (new_fd < 0) {
		err = -errno;
		goto err_close_new_fd;
	}
</snip>

(This is from libbpf.c, bpf_map__reuse_fd(), i.e. https://github.com/libbpf/libbpf/blob/master/src/libbpf.c)

Not sure what's going on here, what is this about? you allocate an fd
you then immediately replace? Is this done to move the fd away from
fd=0?  but that doesn't work that way, in case fd 0 is closed when
entering this function.

Or is this about dup'ping with O_CLOEXEC?

Please be aware that F_DUPFD_CLOEXEC exists, which allows you to
easily move some fd above some treshold, *and* set O_CLOEXEC at the
same time. In the systemd codebase we call this frequently for code
that ends up being loaded in arbitrary processes (glibc NSS modules,
PAM modules), in order to ensure we get out of the fd < 3 territory
quickly.

(btw, if you do care about O_CLOEXEC – which is great – then you also
want to replace a bunch of fopen(…, "r") with fopen(…, "re") in your
codebase)

Lennart

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-23  7:49                         ` Lennart Poettering
@ 2023-05-23 17:25                           ` Andrii Nakryiko
  0 siblings, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2023-05-23 17:25 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Alexei Starovoitov, Linus Torvalds, Christian Brauner,
	Christoph Hellwig, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Aleksa Sarai, Linux-Fsdevel,
	Al Viro

On Tue, May 23, 2023 at 12:49 AM Lennart Poettering
<lennart@poettering.net> wrote:
>
> On Do, 18.05.23 21:44, Alexei Starovoitov (alexei.starovoitov@gmail.com) wrote:
>
> > > The 0/1/2 file descriptors are not at all special. They are a shell
> > > pipeline default, nothing more. They are not the argument your think they
> > > are, and you should stop trying to make them an argument.
> >
> > I'm well aware that any file type is allowed to be in FDs 0,1,2 and
> > some user space is using it that way, like old inetd:
> > https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
> > That puts the same socket into 0,1,2 before exec-ing new process.
> >
> > My point that the kernel has to assist user space instead of
> > stubbornly sticking to POSIX and saying all FDs are equal.
> >
> > Most user space developers know that care should be taken with FDs 0,1,2,
> > but it's still easy to make a mistake.
>
> If I look at libbpf, which supposedly gets the fd handling right I
> can't find any hint it actually moves the fds it gets from open() to
> an fd > 2, though?
>
> i.e. the code that invokes open() calls in the libbpf codebase happily
> just accepts an fd < 2, including fd == 0, and this is then later
> passed back into the kernel in various bpf() syscall invocations,
> which should refuse it, no? So what's going on there?

libbpf's attempt to ensure that fd>2 applies mostly to BPF objects:
maps, progs, btfs, links, which are always returned from bpf()
syscall. That's where we use ensure_good_fd(). The snippet you found
in bpf_map__reuse_fd() is problematic and slipped through the cracks,
I'll fix it, thanks for bringing this to my attention. I'll also check
if there are any other places where we should use ensure_good_fd() as
well.

>
> I did find this though:
>
> <snip>
>         new_fd = open("/", O_RDONLY | O_CLOEXEC);
>         if (new_fd < 0) {
>                 err = -errno;
>                 goto err_free_new_name;
>         }
>
>         new_fd = dup3(fd, new_fd, O_CLOEXEC);
>         if (new_fd < 0) {
>                 err = -errno;
>                 goto err_close_new_fd;
>         }
> </snip>
>
> (This is from libbpf.c, bpf_map__reuse_fd(), i.e. https://github.com/libbpf/libbpf/blob/master/src/libbpf.c)
>
> Not sure what's going on here, what is this about? you allocate an fd
> you then immediately replace? Is this done to move the fd away from
> fd=0?  but that doesn't work that way, in case fd 0 is closed when
> entering this function.
>
> Or is this about dup'ping with O_CLOEXEC?


This code predates me, so I don't know the exact reasons why it's
implemented exactly like this. It seems like instead of doing
open()+dup3() we should do dup3()+ensure_good_fd(). I need to look at
this carefully, this is not the code I work with regularly.

>
> Please be aware that F_DUPFD_CLOEXEC exists, which allows you to
> easily move some fd above some treshold, *and* set O_CLOEXEC at the
> same time. In the systemd codebase we call this frequently for code
> that ends up being loaded in arbitrary processes (glibc NSS modules,
> PAM modules), in order to ensure we get out of the fd < 3 territory
> quickly.

nice, thanks

>
> (btw, if you do care about O_CLOEXEC – which is great – then you also
> want to replace a bunch of fopen(…, "r") with fopen(…, "re") in your
> codebase)

I will check this as well, thanks for the hints :)

>
> Lennart

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  4:44                       ` Alexei Starovoitov
                                           ` (2 preceding siblings ...)
  2023-05-23  7:49                         ` Lennart Poettering
@ 2023-08-26  4:27                         ` Al Viro
  3 siblings, 0 replies; 30+ messages in thread
From: Al Viro @ 2023-08-26  4:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Christian Brauner, Christoph Hellwig,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Aleksa Sarai,
	Lennart Poettering, Linux-Fsdevel

On Thu, May 18, 2023 at 09:44:33PM -0700, Alexei Starovoitov wrote:

> That footgun was removed from folly in 2021, but we still see this issue from time to time.
> My point that the kernel can help here.
> Since folks don't like sysctl to control FD assignment how about something like this:
> 
> diff --git a/fs/file.c b/fs/file.c
> index 7893ea161d77..896e79433f61 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -554,9 +554,15 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>         return error;
>  }
> 
> +__weak noinline u32 get_start_fd(void)
> +{
> +       return 0;
> +}
> +/* mark it as BPF_MODIFY_RETURN to let bpf progs adjust return value */
> +
>  int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
>  {
> -       return alloc_fd(0, nofile, flags);
> +       return alloc_fd(get_start_fd(), nofile, flags);
>  }
> 
> Then we can enforce fd >= 3 for a certain container or for a particular app.

[an extremely belated reply - had been net.dead since mid-May, just got to
that thread]

As far as I'm concerned, the main conclusion is that BPF handling of file
descriptors needs a fairly hostile code review, regarding the interactions
with dup2(), shared descriptor tables, SCM_RIGHTS, etc.  Rationale:
demonstrated utter lack of clue about the nature of file descriptors,
along with a weird mental model of how they are used, complete with
"if they are used not in the way we expect, let's shove a hook
somewhere to enforce The Right Way(tm)".  Will do...

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-08-26  4:27 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-16  0:13 [PATCH bpf-next 0/3] Add O_PATH-based BPF_OBJ_PIN and BPF_OBJ_GET support Andrii Nakryiko
2023-05-16  0:13 ` [PATCH bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands Andrii Nakryiko
2023-05-16  8:52   ` Jiri Olsa
2023-05-16 18:02     ` Andrii Nakryiko
2023-05-16  9:07   ` Christian Brauner
2023-05-16 18:02     ` Andrii Nakryiko
2023-05-17  9:11       ` fd == 0 means AT_FDCWD " Christian Brauner
2023-05-17 12:05         ` Christoph Hellwig
2023-05-17 16:17           ` Alexei Starovoitov
2023-05-17 21:48             ` Alexei Starovoitov
2023-05-18  8:38             ` Christian Brauner
2023-05-18 14:30               ` Theodore Ts'o
2023-05-18 16:25               ` Alexei Starovoitov
2023-05-18 16:33                 ` Matthew Wilcox
2023-05-18 17:22                   ` Christian Brauner
2023-05-18 17:20                 ` Christian Brauner
2023-05-18 17:33                   ` Linus Torvalds
2023-05-18 18:21                     ` Christian Brauner
2023-05-18 18:26                   ` Alexei Starovoitov
2023-05-18 18:57                     ` Linus Torvalds
2023-05-19  4:44                       ` Alexei Starovoitov
2023-05-19  8:13                         ` Christian Brauner
2023-05-19 14:27                           ` Theodore Ts'o
2023-05-19 17:51                         ` Linus Torvalds
2023-05-23  7:49                         ` Lennart Poettering
2023-05-23 17:25                           ` Andrii Nakryiko
2023-08-26  4:27                         ` Al Viro
2023-05-18 21:56         ` Andrii Nakryiko
2023-05-16  0:13 ` [PATCH bpf-next 2/3] libbpf: add opts-based bpf_obj_pin() API and add support for path_fd Andrii Nakryiko
2023-05-16  0:13 ` [PATCH bpf-next 3/3] selftests/bpf: add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).