All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Move pidfd to tiny pseudo fs
@ 2024-02-13 16:45 Christian Brauner
  2024-02-13 16:45 ` [PATCH 1/2] pidfd: move struct pidfd_fops Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-13 16:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Alexander Viro, Seth Forshee, Tycho Andersen,
	Christian Brauner

Hey,

This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do so far simply because of
the very justified limitations of anonymous inodes. So yesterday I sat
down and wrote it down.

Back when I added pidfds the concept was new (on Linux) and the
limitations were acceptable but now it's starting to hurt us. And with
the concept of pidfds having been around quite a while and being widely
used this is worth doing. This makes it so that:

* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() for equality.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode->i_private instead of
  file->private_data. This means it is now possible to introduce
  concepts that operate on a process once all file descriptors have been
  closed. A concrete example is kill-on-last-close.
* file->private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
  pid will refer to the same inode if it's opened multiple times. In
  contrast to now where each struct pid refers to the same inode. Even
  if we were to move to anon_inode_create_getfile() which creates new
  inodes we'd still be associating the same struct pid with multiple
  different inodes.
* Pidfds now go through the regular dentry_open() path which means that
  all security hooks are called unblocking proper LSM management for
  pidfds. In addition fsnotify hooks are called and allow for listening
  to open events on pidfds.

The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no inode
operations in general, so nothing complex. It's hopefully the best kind
of dumb there is. Dentries and inodes are always deleted when the last
pidfd is closed.

I've made the new code optional and placed it under CONFIG_FS_PIDFD but
I'm confident we can remove that very soon. This takes some inspiration
from nsfs which uses a similar stashing mechanism.

Thanks!
Christian

Signed-off-by: Christian Brauner <brauner@kernel.org>

---
base-commit: 3f643cd2351099e6b859533b6f984463e5315e5f
change-id: 20240212-vfs-pidfd_fs-9a6e49283d80


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 1/2] pidfd: move struct pidfd_fops
  2024-02-13 16:45 [PATCH 0/2] Move pidfd to tiny pseudo fs Christian Brauner
@ 2024-02-13 16:45 ` Christian Brauner
  2024-02-13 16:45 ` [PATCH 2/2] pidfd: add pidfdfs Christian Brauner
  2024-02-13 17:02 ` [PATCH 0/2] Move pidfd to tiny pseudo fs Christian Brauner
  2 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-13 16:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Alexander Viro, Seth Forshee, Tycho Andersen,
	Christian Brauner

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Makefile   |   2 +-
 fs/pidfdfs.c  | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c | 110 ---------------------------------------------------
 3 files changed, 124 insertions(+), 111 deletions(-)

diff --git a/fs/Makefile b/fs/Makefile
index c09016257f05..0fe5d0151fcc 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -15,7 +15,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		pnode.o splice.o sync.o utimes.o d_path.o \
 		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
 		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
-		kernel_read_file.o mnt_idmapping.o remap_range.o
+		kernel_read_file.o mnt_idmapping.o remap_range.o pidfdfs.o
 
 obj-$(CONFIG_BUFFER_HEAD)	+= buffer.o mpage.o
 obj-$(CONFIG_PROC_FS)		+= proc_namespace.o
diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
new file mode 100644
index 000000000000..55e8396e7fc4
--- /dev/null
+++ b/fs/pidfdfs.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <linux/mount.h>
+#include <linux/pid.h>
+#include <linux/pid_namespace.h>
+#include <linux/poll.h>
+#include <linux/proc_fs.h>
+#include <linux/proc_ns.h>
+#include <linux/pseudo_fs.h>
+#include <linux/seq_file.h>
+#include <uapi/linux/pidfd.h>
+
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = file->private_data;
+
+	file->private_data = NULL;
+	put_pid(pid);
+	return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+/**
+ * pidfd_show_fdinfo - print information about a pidfd
+ * @m: proc fdinfo file
+ * @f: file referencing a pidfd
+ *
+ * Pid:
+ * This function will print the pid that a given pidfd refers to in the
+ * pid namespace of the procfs instance.
+ * If the pid namespace of the process is not a descendant of the pid
+ * namespace of the procfs instance 0 will be shown as its pid. This is
+ * similar to calling getppid() on a process whose parent is outside of
+ * its pid namespace.
+ *
+ * NSpid:
+ * If pid namespaces are supported then this function will also print
+ * the pid of a given pidfd refers to for all descendant pid namespaces
+ * starting from the current pid namespace of the instance, i.e. the
+ * Pid field and the first entry in the NSpid field will be identical.
+ * If the pid namespace of the process is not a descendant of the pid
+ * namespace of the procfs instance 0 will be shown as its first NSpid
+ * entry and no others will be shown.
+ * Note that this differs from the Pid and NSpid fields in
+ * /proc/<pid>/status where Pid and NSpid are always shown relative to
+ * the  pid namespace of the procfs instance. The difference becomes
+ * obvious when sending around a pidfd between pid namespaces from a
+ * different branch of the tree, i.e. where no ancestral relation is
+ * present between the pid namespaces:
+ * - create two new pid namespaces ns1 and ns2 in the initial pid
+ *   namespace (also take care to create new mount namespaces in the
+ *   new pid namespace and mount procfs)
+ * - create a process with a pidfd in ns1
+ * - send pidfd from ns1 to ns2
+ * - read /proc/self/fdinfo/<pidfd> and observe that both Pid and NSpid
+ *   have exactly one entry, which is 0
+ */
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct pid *pid = f->private_data;
+	struct pid_namespace *ns;
+	pid_t nr = -1;
+
+	if (likely(pid_has_task(pid, PIDTYPE_PID))) {
+		ns = proc_pid_ns(file_inode(m->file)->i_sb);
+		nr = pid_nr_ns(pid, ns);
+	}
+
+	seq_put_decimal_ll(m, "Pid:\t", nr);
+
+#ifdef CONFIG_PID_NS
+	seq_put_decimal_ll(m, "\nNSpid:\t", nr);
+	if (nr > 0) {
+		int i;
+
+		/* If nr is non-zero it means that 'pid' is valid and that
+		 * ns, i.e. the pid namespace associated with the procfs
+		 * instance, is in the pid namespace hierarchy of pid.
+		 * Start at one below the already printed level.
+		 */
+		for (i = ns->level + 1; i <= pid->level; i++)
+			seq_put_decimal_ll(m, "\t", pid->numbers[i].nr);
+	}
+#endif
+	seq_putc(m, '\n');
+}
+#endif
+
+/*
+ * Poll support for process exit notification.
+ */
+static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+	struct pid *pid = file->private_data;
+	bool thread = file->f_flags & PIDFD_THREAD;
+	struct task_struct *task;
+	__poll_t poll_flags = 0;
+
+	poll_wait(file, &pid->wait_pidfd, pts);
+	/*
+	 * Depending on PIDFD_THREAD, inform pollers when the thread
+	 * or the whole thread-group exits.
+	 */
+	rcu_read_lock();
+	task = pid_task(pid, PIDTYPE_PID);
+	if (!task)
+		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
+	else if (task->exit_state && (thread || thread_group_empty(task)))
+		poll_flags = EPOLLIN | EPOLLRDNORM;
+	rcu_read_unlock();
+
+	return poll_flags;
+}
+
+const struct file_operations pidfd_fops = {
+	.release	= pidfd_release,
+	.poll		= pidfd_poll,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= pidfd_show_fdinfo,
+#endif
+};
diff --git a/kernel/fork.c b/kernel/fork.c
index 3f22ec90c5c6..662a61f340ce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1993,116 +1993,6 @@ struct pid *pidfd_pid(const struct file *file)
 	return ERR_PTR(-EBADF);
 }
 
-static int pidfd_release(struct inode *inode, struct file *file)
-{
-	struct pid *pid = file->private_data;
-
-	file->private_data = NULL;
-	put_pid(pid);
-	return 0;
-}
-
-#ifdef CONFIG_PROC_FS
-/**
- * pidfd_show_fdinfo - print information about a pidfd
- * @m: proc fdinfo file
- * @f: file referencing a pidfd
- *
- * Pid:
- * This function will print the pid that a given pidfd refers to in the
- * pid namespace of the procfs instance.
- * If the pid namespace of the process is not a descendant of the pid
- * namespace of the procfs instance 0 will be shown as its pid. This is
- * similar to calling getppid() on a process whose parent is outside of
- * its pid namespace.
- *
- * NSpid:
- * If pid namespaces are supported then this function will also print
- * the pid of a given pidfd refers to for all descendant pid namespaces
- * starting from the current pid namespace of the instance, i.e. the
- * Pid field and the first entry in the NSpid field will be identical.
- * If the pid namespace of the process is not a descendant of the pid
- * namespace of the procfs instance 0 will be shown as its first NSpid
- * entry and no others will be shown.
- * Note that this differs from the Pid and NSpid fields in
- * /proc/<pid>/status where Pid and NSpid are always shown relative to
- * the  pid namespace of the procfs instance. The difference becomes
- * obvious when sending around a pidfd between pid namespaces from a
- * different branch of the tree, i.e. where no ancestral relation is
- * present between the pid namespaces:
- * - create two new pid namespaces ns1 and ns2 in the initial pid
- *   namespace (also take care to create new mount namespaces in the
- *   new pid namespace and mount procfs)
- * - create a process with a pidfd in ns1
- * - send pidfd from ns1 to ns2
- * - read /proc/self/fdinfo/<pidfd> and observe that both Pid and NSpid
- *   have exactly one entry, which is 0
- */
-static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
-{
-	struct pid *pid = f->private_data;
-	struct pid_namespace *ns;
-	pid_t nr = -1;
-
-	if (likely(pid_has_task(pid, PIDTYPE_PID))) {
-		ns = proc_pid_ns(file_inode(m->file)->i_sb);
-		nr = pid_nr_ns(pid, ns);
-	}
-
-	seq_put_decimal_ll(m, "Pid:\t", nr);
-
-#ifdef CONFIG_PID_NS
-	seq_put_decimal_ll(m, "\nNSpid:\t", nr);
-	if (nr > 0) {
-		int i;
-
-		/* If nr is non-zero it means that 'pid' is valid and that
-		 * ns, i.e. the pid namespace associated with the procfs
-		 * instance, is in the pid namespace hierarchy of pid.
-		 * Start at one below the already printed level.
-		 */
-		for (i = ns->level + 1; i <= pid->level; i++)
-			seq_put_decimal_ll(m, "\t", pid->numbers[i].nr);
-	}
-#endif
-	seq_putc(m, '\n');
-}
-#endif
-
-/*
- * Poll support for process exit notification.
- */
-static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
-{
-	struct pid *pid = file->private_data;
-	bool thread = file->f_flags & PIDFD_THREAD;
-	struct task_struct *task;
-	__poll_t poll_flags = 0;
-
-	poll_wait(file, &pid->wait_pidfd, pts);
-	/*
-	 * Depending on PIDFD_THREAD, inform pollers when the thread
-	 * or the whole thread-group exits.
-	 */
-	rcu_read_lock();
-	task = pid_task(pid, PIDTYPE_PID);
-	if (!task)
-		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
-	else if (task->exit_state && (thread || thread_group_empty(task)))
-		poll_flags = EPOLLIN | EPOLLRDNORM;
-	rcu_read_unlock();
-
-	return poll_flags;
-}
-
-const struct file_operations pidfd_fops = {
-	.release = pidfd_release,
-	.poll = pidfd_poll,
-#ifdef CONFIG_PROC_FS
-	.show_fdinfo = pidfd_show_fdinfo,
-#endif
-};
-
 /**
  * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
  * @pid:   the struct pid for which to create a pidfd

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 2/2] pidfd: add pidfdfs
  2024-02-13 16:45 [PATCH 0/2] Move pidfd to tiny pseudo fs Christian Brauner
  2024-02-13 16:45 ` [PATCH 1/2] pidfd: move struct pidfd_fops Christian Brauner
@ 2024-02-13 16:45 ` Christian Brauner
  2024-02-13 17:17   ` Linus Torvalds
                     ` (3 more replies)
  2024-02-13 17:02 ` [PATCH 0/2] Move pidfd to tiny pseudo fs Christian Brauner
  2 siblings, 4 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-13 16:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Alexander Viro, Seth Forshee, Tycho Andersen,
	Christian Brauner

This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do simply because of the
very justified limitations of anonymous inodes. Moving pidfds to a tiny
pseudo filesystem allows:

* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() and then comparing inode
  numbers.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode->i_private instead of
  file->private_data. This means it is now possible to introduce
  concepts that operate on a process once all file descriptors have been
  closed. A concrete example is kill-on-last-close.
* file->private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
  pid will refer to the same inode if it's opened multiple times. In
  contrast to now where each struct pid refers to the same inode. Even
  if we were to move to anon_inode_create_getfile() which creates new
  inodes we'd still be associating the same struct pid with multiple
  different inodes.
* Pidfds now go through the regular dentry_open() path which means that
  all security hooks are called unblocking proper LSM management for
  pidfds. In addition fsnotify hooks are called and allow for listening
  to open events on pidfds.

The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no complex
inode operations, nothing. Dentries and inodes are always deleted when
the last pidfd is closed.

The code is entirely optional and fairly small. If it's not selected we
fallback to anonymous inodes. Heavily inspired by nsfs which uses a
similar stashing mechanism just for namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Kconfig                 |   6 ++
 fs/pidfdfs.c               | 189 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/pid.h        |   4 +
 include/linux/pidfdfs.h    |   9 +++
 include/uapi/linux/magic.h |   1 +
 init/main.c                |   2 +
 kernel/fork.c              |  13 +---
 kernel/nsproxy.c           |   2 +-
 kernel/pid.c               |   2 +
 9 files changed, 214 insertions(+), 14 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075..c7ed65e34820 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -174,6 +174,12 @@ source "fs/proc/Kconfig"
 source "fs/kernfs/Kconfig"
 source "fs/sysfs/Kconfig"
 
+config FS_PIDFD
+	bool "Pseudo filesystem for process file descriptors"
+	depends on 64BIT
+	help
+	  Pidfdfs implements advanced features for process file descriptors.
+
 config TMPFS
 	bool "Tmpfs virtual memory file system support (former shm fs)"
 	depends on SHMEM
diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
index 55e8396e7fc4..efc68ef3a08d 100644
--- a/fs/pidfdfs.c
+++ b/fs/pidfdfs.c
@@ -1,9 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/anon_inodes.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/magic.h>
 #include <linux/mount.h>
 #include <linux/pid.h>
+#include <linux/pidfdfs.h>
 #include <linux/pid_namespace.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
@@ -12,12 +14,25 @@
 #include <linux/seq_file.h>
 #include <uapi/linux/pidfd.h>
 
+struct pid *pidfd_pid(const struct file *file)
+{
+	if (file->f_op != &pidfd_fops)
+		return ERR_PTR(-EBADF);
+#ifdef CONFIG_FS_PIDFD
+	return file_inode(file)->i_private;
+#else
+	return file->private_data;
+#endif
+}
+
 static int pidfd_release(struct inode *inode, struct file *file)
 {
+#ifndef CONFIG_FS_PIDFD
 	struct pid *pid = file->private_data;
 
 	file->private_data = NULL;
 	put_pid(pid);
+#endif
 	return 0;
 }
 
@@ -59,7 +74,7 @@ static int pidfd_release(struct inode *inode, struct file *file)
  */
 static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
 {
-	struct pid *pid = f->private_data;
+	struct pid *pid = pidfd_pid(f);
 	struct pid_namespace *ns;
 	pid_t nr = -1;
 
@@ -93,7 +108,7 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
  */
 static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 {
-	struct pid *pid = file->private_data;
+	struct pid *pid = pidfd_pid(file);
 	bool thread = file->f_flags & PIDFD_THREAD;
 	struct task_struct *task;
 	__poll_t poll_flags = 0;
@@ -121,3 +136,173 @@ const struct file_operations pidfd_fops = {
 	.show_fdinfo	= pidfd_show_fdinfo,
 #endif
 };
+
+#ifdef CONFIG_FS_PIDFD
+static struct vfsmount *pidfdfs_mnt __ro_after_init;
+static struct super_block *pidfdfs_sb __ro_after_init;
+static u64 pidfdfs_ino = 0;
+
+static void pidfdfs_evict_inode(struct inode *inode)
+{
+	struct pid *pid = inode->i_private;
+
+	clear_inode(inode);
+	put_pid(pid);
+}
+
+static const struct super_operations pidfdfs_sops = {
+	.statfs		= simple_statfs,
+	.evict_inode	= pidfdfs_evict_inode,
+};
+
+static void pidfdfs_prune_dentry(struct dentry *dentry)
+{
+	struct inode *inode;
+	struct pid *pid;
+
+	inode = d_inode(dentry);
+	if (!inode)
+		return;
+
+	pid = inode->i_private;
+	atomic_long_set(&pid->stashed, 0);
+}
+
+static char *pidfdfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	return dynamic_dname(buffer, buflen, "pidfd:[%lu]",
+			     d_inode(dentry)->i_ino);
+}
+
+const struct dentry_operations pidfdfs_dentry_operations = {
+	.d_prune	= pidfdfs_prune_dentry,
+	.d_delete	= always_delete_dentry,
+	.d_dname	= pidfdfs_dname,
+};
+
+static int pidfdfs_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	ctx = init_pseudo(fc, PIDFDFS_MAGIC);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->ops = &pidfdfs_sops;
+	ctx->dops = &pidfdfs_dentry_operations;
+	return 0;
+}
+
+static struct file_system_type pidfdfs_type = {
+	.name			= "pidfdfs",
+	.init_fs_context	= pidfdfs_init_fs_context,
+	.kill_sb		= kill_anon_super,
+};
+
+static struct dentry *pidfdfs_dentry(struct pid *pid)
+{
+	struct inode *inode;
+	struct dentry *dentry;
+	unsigned long i_ptr;
+
+	inode = new_inode_pseudo(pidfdfs_sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode->i_ino	= pid->ino;
+	inode->i_mode	= S_IFREG | S_IRUGO;
+	inode->i_fop	= &pidfd_fops;
+	inode->i_flags |= S_IMMUTABLE;
+	simple_inode_init_ts(inode);
+	/* grab a reference */
+	inode->i_private = get_pid(pid);
+
+	/* consumes @inode */
+	dentry = d_make_root(inode);
+	if (!dentry)
+		return ERR_PTR(-ENOMEM);
+
+	i_ptr = atomic_long_cmpxchg(&pid->stashed, 0, (unsigned long)dentry);
+	if (i_ptr) {
+		d_delete(dentry); /* make sure ->d_prune() does nothing */
+		dput(dentry);
+		cpu_relax();
+		return ERR_PTR(-EAGAIN);
+	}
+
+	return dentry;
+}
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+
+	struct path path;
+	struct dentry *dentry;
+	struct file *pidfd_file;
+
+	for (;;) {
+		rcu_read_lock();
+		dentry = (struct dentry *)atomic_long_read(&pid->stashed);
+		if (!dentry || !lockref_get_not_dead(&dentry->d_lockref)) {
+			rcu_read_unlock();
+
+			dentry = pidfdfs_dentry(pid);
+			if (!IS_ERR(dentry))
+				break;
+			if (PTR_ERR(dentry) == -EAGAIN)
+				continue;
+		}
+		rcu_read_unlock();
+		break;
+	}
+	if (IS_ERR(dentry))
+		return ERR_CAST(dentry);
+
+	path.mnt = mntget(pidfdfs_mnt);
+	path.dentry = dentry;
+
+	pidfd_file = dentry_open(&path, flags, current_cred());
+	path_put(&path);
+
+	return pidfd_file;
+}
+
+void pid_init_pidfdfs(struct pid *pid)
+{
+	atomic_long_set(&pid->stashed, 0);
+	pid->ino = ++pidfdfs_ino;
+}
+
+void __init pidfdfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&pidfdfs_type);
+	if (err)
+		panic("Failed to register pidfdfs pseudo filesystem");
+
+	pidfdfs_mnt = kern_mount(&pidfdfs_type);
+	if (IS_ERR(pidfdfs_mnt))
+		panic("Failed to mount pidfdfs pseudo filesystem");
+
+	pidfdfs_sb = pidfdfs_mnt->mnt_sb;
+}
+
+#else /* !CONFIG_FS_PIDFD */
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+	struct file *pidfd_file;
+
+	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
+					flags | O_RDWR);
+	if (IS_ERR(pidfd_file))
+		return pidfd_file;
+
+	get_pid(pid);
+	return pidfd_file;
+}
+
+void pid_init_pidfdfs(struct pid *pid) { }
+void __init pidfdfs_init(void) { }
+#endif
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 8124d57752b9..1a47676a04c2 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -55,6 +55,10 @@ struct pid
 	refcount_t count;
 	unsigned int level;
 	spinlock_t lock;
+#ifdef CONFIG_FS_PIDFD
+	atomic_long_t stashed;
+	unsigned long ino;
+#endif
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
 	struct hlist_head inodes;
diff --git a/include/linux/pidfdfs.h b/include/linux/pidfdfs.h
new file mode 100644
index 000000000000..760dbc163625
--- /dev/null
+++ b/include/linux/pidfdfs.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PIDFDFS_H
+#define _LINUX_PIDFDFS_H
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags);
+void __init pidfdfs_init(void);
+void pid_init_pidfdfs(struct pid *pid);
+
+#endif /* _LINUX_PIDFDFS_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..a0d5480115c5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define PIDFDFS_MAGIC		0x50494446	/* "PIDF" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..0663003f3146 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@
 #include <linux/init_syscalls.h>
 #include <linux/stackdepot.h>
 #include <linux/randomize_kstack.h>
+#include <linux/pidfdfs.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
@@ -1059,6 +1060,7 @@ void start_kernel(void)
 	seq_file_init();
 	proc_root_init();
 	nsfs_init();
+	pidfdfs_init();
 	cpuset_init();
 	cgroup_init();
 	taskstats_init_early();
diff --git a/kernel/fork.c b/kernel/fork.c
index 662a61f340ce..eab2fcc90342 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -102,6 +102,7 @@
 #include <linux/iommu.h>
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
+#include <linux/pidfdfs.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -1985,14 +1986,6 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
-struct pid *pidfd_pid(const struct file *file)
-{
-	if (file->f_op == &pidfd_fops)
-		return file->private_data;
-
-	return ERR_PTR(-EBADF);
-}
-
 /**
  * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
  * @pid:   the struct pid for which to create a pidfd
@@ -2030,13 +2023,11 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 	if (pidfd < 0)
 		return pidfd;
 
-	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
-					flags | O_RDWR);
+	pidfd_file = pidfdfs_alloc_file(pid, flags | O_RDWR);
 	if (IS_ERR(pidfd_file)) {
 		put_unused_fd(pidfd);
 		return PTR_ERR(pidfd_file);
 	}
-	get_pid(pid); /* held by pidfd_file now */
 	/*
 	 * anon_inode_getfile() ignores everything outside of the
 	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 15781acaac1c..6ec3deec68c2 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -573,7 +573,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags)
 	if (proc_ns_file(f.file))
 		err = validate_ns(&nsset, ns);
 	else
-		err = validate_nsset(&nsset, f.file->private_data);
+		err = validate_nsset(&nsset, pidfd_pid(f.file));
 	if (!err) {
 		commit_nsset(&nsset);
 		perf_event_namespaces(current);
diff --git a/kernel/pid.c b/kernel/pid.c
index c1d940fbd314..dbff84493bff 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -42,6 +42,7 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
+#include <linux/pidfdfs.h>
 #include <net/sock.h>
 #include <uapi/linux/pidfd.h>
 
@@ -270,6 +271,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
+	pid_init_pidfdfs(pid);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/2] Move pidfd to tiny pseudo fs
  2024-02-13 16:45 [PATCH 0/2] Move pidfd to tiny pseudo fs Christian Brauner
  2024-02-13 16:45 ` [PATCH 1/2] pidfd: move struct pidfd_fops Christian Brauner
  2024-02-13 16:45 ` [PATCH 2/2] pidfd: add pidfdfs Christian Brauner
@ 2024-02-13 17:02 ` Christian Brauner
  2 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-13 17:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Alexander Viro, Seth Forshee, Tycho Andersen

On Tue, Feb 13, 2024 at 05:45:45PM +0100, Christian Brauner wrote:
> Hey,
> 
> This moves pidfds from the anonymous inode infrastructure to a tiny
> pseudo filesystem. This has been on my todo for quite a while as it will
> unblock further work that we weren't able to do so far simply because of
> the very justified limitations of anonymous inodes. So yesterday I sat
> down and wrote it down.
> 
> Back when I added pidfds the concept was new (on Linux) and the
> limitations were acceptable but now it's starting to hurt us. And with
> the concept of pidfds having been around quite a while and being widely
> used this is worth doing. This makes it so that:
> 
> * statx() on pidfds becomes useful for the first time.
> * pidfds can be compared simply via statx() for equality.
> * pidfds have unique inode numbers for the system lifetime.
> * struct pid is now stashed in inode->i_private instead of
>   file->private_data. This means it is now possible to introduce
>   concepts that operate on a process once all file descriptors have been
>   closed. A concrete example is kill-on-last-close.
> * file->private_data is freed up for per-file options for pidfds.
> * Each struct pid will refer to a different inode but the same struct
>   pid will refer to the same inode if it's opened multiple times. In
>   contrast to now where each struct pid refers to the same inode. Even
>   if we were to move to anon_inode_create_getfile() which creates new
>   inodes we'd still be associating the same struct pid with multiple
>   different inodes.
> * Pidfds now go through the regular dentry_open() path which means that
>   all security hooks are called unblocking proper LSM management for
>   pidfds. In addition fsnotify hooks are called and allow for listening
>   to open events on pidfds.
> 
> The tiny pseudo filesystem is not visible anywhere in userspace exactly
> like e.g., pipefs and sockfs. There's no lookup, there's no inode
> operations in general, so nothing complex. It's hopefully the best kind
> of dumb there is. Dentries and inodes are always deleted when the last
> pidfd is closed.
> 
> I've made the new code optional and placed it under CONFIG_FS_PIDFD but
> I'm confident we can remove that very soon. This takes some inspiration
> from nsfs which uses a similar stashing mechanism.
> 
> Thanks!
> Christian
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> ---
> base-commit: 3f643cd2351099e6b859533b6f984463e5315e5f
> change-id: 20240212-vfs-pidfd_fs-9a6e49283d80

I forgot to mention that pidfds are explicitly not simply directory
inodes in procfs for various reasons so this isn't an option I want to
pursue. Integrating them into procfs would be a nasty level of
complexity that makes for very ugly and convoluted code. Especially how
this would need to be integrated into copy_process() and other
locations. It also poses significant security and permission checking
challenges to userspace because it is generally not safe to send around
file descriptors for /proc/<pid> directories. It's a pretty big attack
vector and cause of security issues. So really this is not a path that I
want to go down. It defeats the whole purpose of pidfds as opaque, easy
delegatable handles.

Oh, and tree is vfs.pidfd at the usual location
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-13 16:45 ` [PATCH 2/2] pidfd: add pidfdfs Christian Brauner
@ 2024-02-13 17:17   ` Linus Torvalds
  2024-02-14 14:40     ` Christian Brauner
  2024-02-22 19:03   ` Nathan Chancellor
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-02-13 17:17 UTC (permalink / raw)
  To: Christian Brauner, Al Viro; +Cc: linux-fsdevel, Seth Forshee, Tycho Andersen

On Tue, 13 Feb 2024 at 08:46, Christian Brauner <brauner@kernel.org> wrote:
>
> * Each struct pid will refer to a different inode but the same struct
>   pid will refer to the same inode if it's opened multiple times. In
>   contrast to now where each struct pid refers to the same inode.

The games for this are disgusting. This needs to be done some other way.

Yes, I realize that you copied what Al did for nsfs. It's still
entirely disgusting.

To quote the Romans: "Quod licet Al, non licet bovi".

It might be a *bit* less disgusting if we hid the casting games with
"atomic_long_xyz" by having nice wrappers that do "atomic_ptr_xyz".
Because having an "atomic_long_t" and stuffing a dentry pointer in it
is just too ugly for words.

But I do think that if we're going to do the same disgusting thing
that nsfs does, then the whole "keep one dentry" logic needs to be
shared between the two filesystem. Not be two different copies of the
same inscrutable thing.

Because they *are* the same thing, although you wrote the pidfs copy
slightly differently.

I was ok with one incredibly ugly thing, but I'm not ok with having it
copied and duplicated.

               Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-13 17:17   ` Linus Torvalds
@ 2024-02-14 14:40     ` Christian Brauner
  2024-02-14 18:27       ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-02-14 14:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Tue, Feb 13, 2024 at 09:17:18AM -0800, Linus Torvalds wrote:
> On Tue, 13 Feb 2024 at 08:46, Christian Brauner <brauner@kernel.org> wrote:
> >
> > * Each struct pid will refer to a different inode but the same struct
> >   pid will refer to the same inode if it's opened multiple times. In
> >   contrast to now where each struct pid refers to the same inode.
> 
> The games for this are disgusting. This needs to be done some other way.

Yeah, I'm not particularly happy about it and I did consider deviating
from this completely and doing a lookup based on stashed inode number
alone. But that would've been a bit more code. Let me see though.

> To quote the Romans:

uniti aedificamus

> Because they *are* the same thing, although you wrote the pidfs copy

I didn't see the point in sharing the code particularly for an rfc
because I didn't expect you to be enthusiastic about the original code
(Which is also why I Cced you here in the first place).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-14 14:40     ` Christian Brauner
@ 2024-02-14 18:27       ` Christian Brauner
  2024-02-14 18:37         ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-02-14 18:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Wed, Feb 14, 2024 at 03:40:26PM +0100, Christian Brauner wrote:
> On Tue, Feb 13, 2024 at 09:17:18AM -0800, Linus Torvalds wrote:
> > On Tue, 13 Feb 2024 at 08:46, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > * Each struct pid will refer to a different inode but the same struct
> > >   pid will refer to the same inode if it's opened multiple times. In
> > >   contrast to now where each struct pid refers to the same inode.
> > 
> > The games for this are disgusting. This needs to be done some other way.
> 
> Yeah, I'm not particularly happy about it and I did consider deviating
> from this completely and doing a lookup based on stashed inode number
> alone. But that would've been a bit more code. Let me see though.

Ok, that turned out to be simpler than I had evisioned - unless I made
horrible mistakes:

From c96d88910d029ea639902d245619acbd910fc32d Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 12 Feb 2024 16:32:38 +0100
Subject: [PATCH] [RFC] pidfd: add pidfdfs

This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do simply because of the
very justified limitations of anonymous inodes. Moving pidfds to a tiny
pseudo filesystem allows:

* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() and then comparing inode
  numbers.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode->i_private instead of
  file->private_data. This means it is now possible to introduce
  concepts that operate on a process once all file descriptors have been
  closed. A concrete example is kill-on-last-close.
* file->private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
  pid will refer to the same inode if it's opened multiple times. In
  contrast to now where each struct pid refers to the same inode. Even
  if we were to move to anon_inode_create_getfile() which creates new
  inodes we'd still be associating the same struct pid with multiple
  different inodes.

The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no complex
inode operations, nothing. Dentries and inodes are always deleted when
the last pidfd is closed.

We allocate a new inode for each struct pid and we reuse that inode for
all pidfds. We use iget_locked() to find that inode again based on the
inode number which isn't recycled. We allocate a new dentry for each
pidfd that uses the same inode. That is similar to anonymous inodes
which reuse the same inode for thousands of dentries. For pidfds we're
talking way less than that. There usually won't be a lot of concurrent
openers of the same struct pid. They can probably often be counted on
two hands. I know that systemd does use separate pidfd for the same
struct pid for various complex process tracking issues. So I think with
that things actually become way simpler. Especially because we don't
have to care about lookup. Dentries and inodes continue to be always
deleted.

The code is entirely optional and fairly small. If it's not selected we
fallback to anonymous inodes. Heavily inspired by nsfs which uses a
similar stashing mechanism just for namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Kconfig                 |   6 ++
 fs/pidfdfs.c               | 134 ++++++++++++++++++++++++++++++++++++-
 include/linux/pid.h        |   3 +
 include/linux/pidfdfs.h    |   9 +++
 include/uapi/linux/magic.h |   1 +
 init/main.c                |   2 +
 kernel/fork.c              |  13 +---
 kernel/nsproxy.c           |   2 +-
 kernel/pid.c               |   2 +
 9 files changed, 158 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/pidfdfs.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075..c7ed65e34820 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -174,6 +174,12 @@ source "fs/proc/Kconfig"
 source "fs/kernfs/Kconfig"
 source "fs/sysfs/Kconfig"
 
+config FS_PIDFD
+	bool "Pseudo filesystem for process file descriptors"
+	depends on 64BIT
+	help
+	  Pidfdfs implements advanced features for process file descriptors.
+
 config TMPFS
 	bool "Tmpfs virtual memory file system support (former shm fs)"
 	depends on SHMEM
diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
index 55e8396e7fc4..80caf1759f97 100644
--- a/fs/pidfdfs.c
+++ b/fs/pidfdfs.c
@@ -1,9 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/anon_inodes.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/magic.h>
 #include <linux/mount.h>
 #include <linux/pid.h>
+#include <linux/pidfdfs.h>
 #include <linux/pid_namespace.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
@@ -12,12 +14,25 @@
 #include <linux/seq_file.h>
 #include <uapi/linux/pidfd.h>
 
+struct pid *pidfd_pid(const struct file *file)
+{
+	if (file->f_op != &pidfd_fops)
+		return ERR_PTR(-EBADF);
+#ifdef CONFIG_FS_PIDFD
+	return file_inode(file)->i_private;
+#else
+	return file->private_data;
+#endif
+}
+
 static int pidfd_release(struct inode *inode, struct file *file)
 {
+#ifndef CONFIG_FS_PIDFD
 	struct pid *pid = file->private_data;
 
 	file->private_data = NULL;
 	put_pid(pid);
+#endif
 	return 0;
 }
 
@@ -59,7 +74,7 @@ static int pidfd_release(struct inode *inode, struct file *file)
  */
 static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
 {
-	struct pid *pid = f->private_data;
+	struct pid *pid = pidfd_pid(f);
 	struct pid_namespace *ns;
 	pid_t nr = -1;
 
@@ -93,7 +108,7 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
  */
 static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 {
-	struct pid *pid = file->private_data;
+	struct pid *pid = pidfd_pid(file);
 	bool thread = file->f_flags & PIDFD_THREAD;
 	struct task_struct *task;
 	__poll_t poll_flags = 0;
@@ -121,3 +136,118 @@ const struct file_operations pidfd_fops = {
 	.show_fdinfo	= pidfd_show_fdinfo,
 #endif
 };
+
+#ifdef CONFIG_FS_PIDFD
+static struct vfsmount *pidfdfs_mnt __ro_after_init;
+static struct super_block *pidfdfs_sb __ro_after_init;
+static u64 pidfdfs_ino = 0;
+
+static void pidfdfs_evict_inode(struct inode *inode)
+{
+	struct pid *pid = inode->i_private;
+
+	clear_inode(inode);
+	put_pid(pid);
+}
+
+static const struct super_operations pidfdfs_sops = {
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= pidfdfs_evict_inode,
+	.statfs		= simple_statfs,
+};
+
+static char *pidfdfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	return dynamic_dname(buffer, buflen, "pidfd:[%lu]",
+			     d_inode(dentry)->i_ino);
+}
+
+const struct dentry_operations pidfdfs_dentry_operations = {
+	.d_delete	= always_delete_dentry,
+	.d_dname	= pidfdfs_dname,
+};
+
+static int pidfdfs_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	ctx = init_pseudo(fc, PIDFDFS_MAGIC);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->ops = &pidfdfs_sops;
+	ctx->dops = &pidfdfs_dentry_operations;
+	return 0;
+}
+
+static struct file_system_type pidfdfs_type = {
+	.name			= "pidfdfs",
+	.init_fs_context	= pidfdfs_init_fs_context,
+	.kill_sb		= kill_anon_super,
+};
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+
+	struct inode *inode;
+	struct file *pidfd_file;
+
+	inode = iget_locked(pidfdfs_sb, pid->ino);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	if (inode->i_state & I_NEW) {
+		inode->i_ino = pid->ino;
+		inode->i_mode = S_IFREG | S_IRUGO;
+		inode->i_fop = &pidfd_fops;
+		inode->i_flags |= S_IMMUTABLE;
+		inode->i_private = get_pid(pid);
+		simple_inode_init_ts(inode);
+		unlock_new_inode(inode);
+	}
+
+	pidfd_file = alloc_file_pseudo(inode, pidfdfs_mnt, "", flags, &pidfd_fops);
+	if (IS_ERR(pidfd_file))
+		iput(inode);
+
+	return pidfd_file;
+}
+
+void pid_init_pidfdfs(struct pid *pid)
+{
+	pid->ino = ++pidfdfs_ino;
+}
+
+void __init pidfdfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&pidfdfs_type);
+	if (err)
+		panic("Failed to register pidfdfs pseudo filesystem");
+
+	pidfdfs_mnt = kern_mount(&pidfdfs_type);
+	if (IS_ERR(pidfdfs_mnt))
+		panic("Failed to mount pidfdfs pseudo filesystem");
+
+	pidfdfs_sb = pidfdfs_mnt->mnt_sb;
+}
+
+#else /* !CONFIG_FS_PIDFD */
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+	struct file *pidfd_file;
+
+	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
+					flags | O_RDWR);
+	if (IS_ERR(pidfd_file))
+		return pidfd_file;
+
+	get_pid(pid);
+	return pidfd_file;
+}
+
+void pid_init_pidfdfs(struct pid *pid) { }
+void __init pidfdfs_init(void) { }
+#endif
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 8124d57752b9..7b6f5deab36a 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -55,6 +55,9 @@ struct pid
 	refcount_t count;
 	unsigned int level;
 	spinlock_t lock;
+#ifdef CONFIG_FS_PIDFD
+	unsigned long ino;
+#endif
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
 	struct hlist_head inodes;
diff --git a/include/linux/pidfdfs.h b/include/linux/pidfdfs.h
new file mode 100644
index 000000000000..760dbc163625
--- /dev/null
+++ b/include/linux/pidfdfs.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PIDFDFS_H
+#define _LINUX_PIDFDFS_H
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags);
+void __init pidfdfs_init(void);
+void pid_init_pidfdfs(struct pid *pid);
+
+#endif /* _LINUX_PIDFDFS_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..a0d5480115c5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define PIDFDFS_MAGIC		0x50494446	/* "PIDF" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..0663003f3146 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@
 #include <linux/init_syscalls.h>
 #include <linux/stackdepot.h>
 #include <linux/randomize_kstack.h>
+#include <linux/pidfdfs.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
@@ -1059,6 +1060,7 @@ void start_kernel(void)
 	seq_file_init();
 	proc_root_init();
 	nsfs_init();
+	pidfdfs_init();
 	cpuset_init();
 	cgroup_init();
 	taskstats_init_early();
diff --git a/kernel/fork.c b/kernel/fork.c
index 662a61f340ce..eab2fcc90342 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -102,6 +102,7 @@
 #include <linux/iommu.h>
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
+#include <linux/pidfdfs.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -1985,14 +1986,6 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
-struct pid *pidfd_pid(const struct file *file)
-{
-	if (file->f_op == &pidfd_fops)
-		return file->private_data;
-
-	return ERR_PTR(-EBADF);
-}
-
 /**
  * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
  * @pid:   the struct pid for which to create a pidfd
@@ -2030,13 +2023,11 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 	if (pidfd < 0)
 		return pidfd;
 
-	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
-					flags | O_RDWR);
+	pidfd_file = pidfdfs_alloc_file(pid, flags | O_RDWR);
 	if (IS_ERR(pidfd_file)) {
 		put_unused_fd(pidfd);
 		return PTR_ERR(pidfd_file);
 	}
-	get_pid(pid); /* held by pidfd_file now */
 	/*
 	 * anon_inode_getfile() ignores everything outside of the
 	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 15781acaac1c..6ec3deec68c2 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -573,7 +573,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags)
 	if (proc_ns_file(f.file))
 		err = validate_ns(&nsset, ns);
 	else
-		err = validate_nsset(&nsset, f.file->private_data);
+		err = validate_nsset(&nsset, pidfd_pid(f.file));
 	if (!err) {
 		commit_nsset(&nsset);
 		perf_event_namespaces(current);
diff --git a/kernel/pid.c b/kernel/pid.c
index c1d940fbd314..dbff84493bff 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -42,6 +42,7 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
+#include <linux/pidfdfs.h>
 #include <net/sock.h>
 #include <uapi/linux/pidfd.h>
 
@@ -270,6 +271,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
+	pid_init_pidfdfs(pid);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-14 18:27       ` Christian Brauner
@ 2024-02-14 18:37         ` Linus Torvalds
  2024-02-15 16:11           ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-02-14 18:37 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@kernel.org> wrote:
>
> Ok, that turned out to be simpler than I had evisioned - unless I made
> horrible mistakes:

Hmm. Could we do the same for nsfs?

Also, quick question:

> +void pid_init_pidfdfs(struct pid *pid)
> +{
> +       pid->ino = ++pidfdfs_ino;
> +}

As far as I can tell, the above is only safe because it is serialized by

        spin_lock_irq(&pidmap_lock);

in the only caller.

Can we please just not have a function at all, and just move it there,
so that it's a whole lot more obvious that that inode generation
really gets us a unique number?

Am I missing something?

                 Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-14 18:37         ` Linus Torvalds
@ 2024-02-15 16:11           ` Christian Brauner
  2024-02-16 11:50             ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-02-15 16:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Wed, Feb 14, 2024 at 10:37:33AM -0800, Linus Torvalds wrote:
> On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Ok, that turned out to be simpler than I had evisioned - unless I made
> > horrible mistakes:
> 
> Hmm. Could we do the same for nsfs?

Jein (Ja/Yes + Nein/No).
We could but it would alter nsfs quite a bit.

Before stashing nsfs did (roughly):

/* get an unhashed dentry */
dentry = d_alloc_pseudo(sb, &qname);

/* get new or find existing inode */
inode = iget_locked(sb, ns->inum);

/* Add new or reuse existing dentry atomically */
d_instantiate_unique(dentry, inode);

Afaict, that was intended to work because d_instantiate_unique()
existed. But I don't think it ever worked before the stashing. I'll get
to that below.

d_instantiate_unique() used to to find an existing dentry or added a new
dentry all with inode->i_lock held.

A little after that stashing mechanism for nsfs was added
d_instantiate_unique() was split into two: d_exact_alias() and
d_splice_alias(). The semantics are different. d_exact_alias() discounts
hashed dentries - returns NULL - and only considers unhashed dentries
eligible for reuse. If it finds an unhashed alias it will rehash it and
return it.

And whereas d_instantiate_unique() found and added a new dentry with
inode->i_lock held and thus guaranteed that there would only be one
dentry the d_exact_alias() followed by d_splice_alias() doesn't because
i_lock is dropped in between of course.

But even the logic back then didn't work. Because back then nsfs called
d_alloc_pseudo() which sets dentry->d_parent = dentry. IOW, creates
IS_ROOT() dentries. But __d_instantiate_unique() did:

if (alias->d_parent != entry->d_parent)
        continue;

and so would reliably discount d_alloc_pseudo() dentries... So even back
then nsfs created aliases for each open of the same namespace.

Right now the same problem plus another problem would exist. Consider:

/* get an unhashed dentry */
dentry = d_alloc_anon(sb, &qname);

/* get new or find existing inode */
inode = iget_locked(sb, ns->inum);

/* Add new or reuse existing dentry atomically */
alias = d_exact_alias(dentry, inode);
if (!alias)
	d_splice_alias(inode, dentry);

That doesn't work. First, because of the inode->i_lock drop in between
d_exact_alias() and d_splice_alias(). Second, because d_exact_alias()
discounts IS_ROOT() for reuse. Third, because d_exact_alias() discounts
hashed dentries for reuse.

Consequently switching to a similar iget_locked() mechanism for nsfs as
for pidfdfs right now would mean that there's now not just a single
dentry that's reused by all concurrent __ns_get_path() calls. Instead
one does get multiple dentries like in the old (likely broken) scheme.

For nsfs this might matter because in contrast to anonymous inodes and
pidfds these namespace fds can be bind-mounted and thus persisted. And
in that case it's nice if you only have a single dentry. Whether that
matters in practice in terms of memory I'm not sure. It likely doesn't.

Now, for pidfds we don't care. pidfds can never be bind-mounted and
there's no reason for that. That doesn't work with anonymous inodes and
it doesn't work with pidfdfs. The pidfds_mnt is marked as vfs internal
preventing it from being used for mounting.

For pidfds we also don't care about multiple dentries for the same
inode. Because right now - with pidfds being anonymous inodes - that's
exacty what already happens. And that really hasn't been an issue so far
so I don't see why it would be an issue now.

But, if we wanted this we could solve it without that stashing mechanism
as a patch on top of both pidfdfs and nsfs later but it would require
some dcache help.

I think what we'd need - specifically tailored to both nsfs and pidfdfs
- is something like the below. So, this is only expected to work on
stuff that does d_alloc_anon(). So has no real name and dentry->d_parent
== dentry; basically a bunch of IS_ROOT() dentries. I can add that on
top and then pidfds can use that and nsfs as well. But again, I don't
need it for pidfdfs to be functional. Up to you.

diff --git a/fs/dcache.c b/fs/dcache.c
index b813528fb147..7c78b8b1a8b3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2642,6 +2642,39 @@ void d_add(struct dentry *entry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_add);

+/*
+ * Helper for special filesystems that want to recycle the exact same dentry
+ * over and over. Requires d_alloc_anon() and will reject anything that isn't
+ * IS_ROOT(). Caller must provide valid inode.
+ */
+struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode)
+{
+       struct dentry *alias;
+       unsigned int hash = entry->d_name.hash;
+
+       if (!inode)
+               return NULL;
+
+       if (!IS_ROOT(entry))
+               return NULL;
+
+       spin_lock(&inode->i_lock);
+       hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+               if (alias->d_name.hash != hash)
+                       continue;
+               if (!d_same_name(alias, entry->d_parent, &entry->d_name))
+                       continue;
+               dget_dlock(alias);
+               spin_unlock(&inode->i_lock);
+               return alias;
+       }
+
+       __d_instantiate(entry, inode);
+       spin_unlock(&inode->i_lock);
+       security_d_instantiate(entry, inode);
+       return NULL;
+}
+
 /**
  * d_exact_alias - find and hash an exact unhashed alias
  * @entry: dentry to add
diff --git a/fs/internal.h b/fs/internal.h
index b67406435fc0..41b441c7b2a0 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -310,3 +310,4 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode);

And then this can become roughly:

struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
{
        struct path path;
        struct dentry *dentry, *alias;
        struct inode *inode;
        struct file *pidfd_file;

        dentry = d_alloc_anon(pidfdfs_sb);
        if (!dentry)
                return ERR_PTR(-ENOMEM);

        inode = iget_locked(pidfdfs_sb, pid->ino);
        if (!inode) {
                dput(dentry);
                return ERR_PTR(-ENOMEM);
        }

        if (inode->i_state & I_NEW) {
                inode->i_ino = pid->ino;
                inode->i_mode = S_IFREG | S_IRUGO;
                inode->i_fop = &pidfd_fops;
                inode->i_flags |= S_IMMUTABLE;
                inode->i_private = get_pid(pid);
                simple_inode_init_ts(inode);
                unlock_new_inode(inode);
        }

        alias = d_instantiate_unique_anon(dentry, inode);
        if (alias) {
                dput(dentry);
                dentry = alias;
        }

        path.dentry = dentry;
        path.mnt = mntget(pidfdfs_mnt);

        pidfd_file = dentry_open(&path, flags, current_cred());
        path_put(&path);
        return pidfd_file;
}

> 
> Also, quick question:
> 
> > +void pid_init_pidfdfs(struct pid *pid)
> > +{
> > +       pid->ino = ++pidfdfs_ino;
> > +}
> 
> As far as I can tell, the above is only safe because it is serialized by
> 
>         spin_lock_irq(&pidmap_lock);

Yes, I'm explicitly relying on that because that thing serializes all
alloc_pid() calls anyway which came in very handy nice.

> 
> in the only caller.
> 
> Can we please just not have a function at all, and just move it there,
> so that it's a whole lot more obvious that that inode generation
> really gets us a unique number?

Yes, of course. I just did it to avoid an ifdef basically.

> 
> Am I missing something?

Nope.

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-15 16:11           ` Christian Brauner
@ 2024-02-16 11:50             ` Christian Brauner
  2024-02-16 16:41               ` Christian Brauner
  2024-02-17 13:59               ` Oleg Nesterov
  0 siblings, 2 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-16 11:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]

On Thu, Feb 15, 2024 at 05:11:31PM +0100, Christian Brauner wrote:
> On Wed, Feb 14, 2024 at 10:37:33AM -0800, Linus Torvalds wrote:
> > On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > Ok, that turned out to be simpler than I had evisioned - unless I made
> > > horrible mistakes:
> > 
> > Hmm. Could we do the same for nsfs?

Ok, here's what I got. I'll put the changes to switch both nsfs and
pidfdfs to the proposed unique mechanism suggested yesterday on top. I
would send that in two batches in any case. So if that's somehow broken
then we can just drop it.

[-- Attachment #2: 0001-pidfd-move-struct-pidfd_fops.eml --]
[-- Type: message/rfc822, Size: 10113 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: linux-fsdevel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,  Alexander Viro <viro@zeniv.linux.org.uk>,  Seth Forshee <sforshee@kernel.org>, Tycho Andersen <tycho@tycho.pizza>,  Christian Brauner <brauner@kernel.org>
Subject: [PATCH v2 1/5] pidfd: move struct pidfd_fops
Date: Fri, 16 Feb 2024 12:40:11 +0100
Message-ID: <20240216-vfs-pidfd_fs-v2-1-8365d659464d@kernel.org>

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Makefile   |   2 +-
 fs/pidfdfs.c  | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c | 110 ---------------------------------------------------
 3 files changed, 124 insertions(+), 111 deletions(-)

diff --git a/fs/Makefile b/fs/Makefile
index c09016257f05..0fe5d0151fcc 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -15,7 +15,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		pnode.o splice.o sync.o utimes.o d_path.o \
 		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
 		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
-		kernel_read_file.o mnt_idmapping.o remap_range.o
+		kernel_read_file.o mnt_idmapping.o remap_range.o pidfdfs.o
 
 obj-$(CONFIG_BUFFER_HEAD)	+= buffer.o mpage.o
 obj-$(CONFIG_PROC_FS)		+= proc_namespace.o
diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
new file mode 100644
index 000000000000..55e8396e7fc4
--- /dev/null
+++ b/fs/pidfdfs.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <linux/mount.h>
+#include <linux/pid.h>
+#include <linux/pid_namespace.h>
+#include <linux/poll.h>
+#include <linux/proc_fs.h>
+#include <linux/proc_ns.h>
+#include <linux/pseudo_fs.h>
+#include <linux/seq_file.h>
+#include <uapi/linux/pidfd.h>
+
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = file->private_data;
+
+	file->private_data = NULL;
+	put_pid(pid);
+	return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+/**
+ * pidfd_show_fdinfo - print information about a pidfd
+ * @m: proc fdinfo file
+ * @f: file referencing a pidfd
+ *
+ * Pid:
+ * This function will print the pid that a given pidfd refers to in the
+ * pid namespace of the procfs instance.
+ * If the pid namespace of the process is not a descendant of the pid
+ * namespace of the procfs instance 0 will be shown as its pid. This is
+ * similar to calling getppid() on a process whose parent is outside of
+ * its pid namespace.
+ *
+ * NSpid:
+ * If pid namespaces are supported then this function will also print
+ * the pid of a given pidfd refers to for all descendant pid namespaces
+ * starting from the current pid namespace of the instance, i.e. the
+ * Pid field and the first entry in the NSpid field will be identical.
+ * If the pid namespace of the process is not a descendant of the pid
+ * namespace of the procfs instance 0 will be shown as its first NSpid
+ * entry and no others will be shown.
+ * Note that this differs from the Pid and NSpid fields in
+ * /proc/<pid>/status where Pid and NSpid are always shown relative to
+ * the  pid namespace of the procfs instance. The difference becomes
+ * obvious when sending around a pidfd between pid namespaces from a
+ * different branch of the tree, i.e. where no ancestral relation is
+ * present between the pid namespaces:
+ * - create two new pid namespaces ns1 and ns2 in the initial pid
+ *   namespace (also take care to create new mount namespaces in the
+ *   new pid namespace and mount procfs)
+ * - create a process with a pidfd in ns1
+ * - send pidfd from ns1 to ns2
+ * - read /proc/self/fdinfo/<pidfd> and observe that both Pid and NSpid
+ *   have exactly one entry, which is 0
+ */
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct pid *pid = f->private_data;
+	struct pid_namespace *ns;
+	pid_t nr = -1;
+
+	if (likely(pid_has_task(pid, PIDTYPE_PID))) {
+		ns = proc_pid_ns(file_inode(m->file)->i_sb);
+		nr = pid_nr_ns(pid, ns);
+	}
+
+	seq_put_decimal_ll(m, "Pid:\t", nr);
+
+#ifdef CONFIG_PID_NS
+	seq_put_decimal_ll(m, "\nNSpid:\t", nr);
+	if (nr > 0) {
+		int i;
+
+		/* If nr is non-zero it means that 'pid' is valid and that
+		 * ns, i.e. the pid namespace associated with the procfs
+		 * instance, is in the pid namespace hierarchy of pid.
+		 * Start at one below the already printed level.
+		 */
+		for (i = ns->level + 1; i <= pid->level; i++)
+			seq_put_decimal_ll(m, "\t", pid->numbers[i].nr);
+	}
+#endif
+	seq_putc(m, '\n');
+}
+#endif
+
+/*
+ * Poll support for process exit notification.
+ */
+static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+	struct pid *pid = file->private_data;
+	bool thread = file->f_flags & PIDFD_THREAD;
+	struct task_struct *task;
+	__poll_t poll_flags = 0;
+
+	poll_wait(file, &pid->wait_pidfd, pts);
+	/*
+	 * Depending on PIDFD_THREAD, inform pollers when the thread
+	 * or the whole thread-group exits.
+	 */
+	rcu_read_lock();
+	task = pid_task(pid, PIDTYPE_PID);
+	if (!task)
+		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
+	else if (task->exit_state && (thread || thread_group_empty(task)))
+		poll_flags = EPOLLIN | EPOLLRDNORM;
+	rcu_read_unlock();
+
+	return poll_flags;
+}
+
+const struct file_operations pidfd_fops = {
+	.release	= pidfd_release,
+	.poll		= pidfd_poll,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= pidfd_show_fdinfo,
+#endif
+};
diff --git a/kernel/fork.c b/kernel/fork.c
index 3f22ec90c5c6..662a61f340ce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1993,116 +1993,6 @@ struct pid *pidfd_pid(const struct file *file)
 	return ERR_PTR(-EBADF);
 }
 
-static int pidfd_release(struct inode *inode, struct file *file)
-{
-	struct pid *pid = file->private_data;
-
-	file->private_data = NULL;
-	put_pid(pid);
-	return 0;
-}
-
-#ifdef CONFIG_PROC_FS
-/**
- * pidfd_show_fdinfo - print information about a pidfd
- * @m: proc fdinfo file
- * @f: file referencing a pidfd
- *
- * Pid:
- * This function will print the pid that a given pidfd refers to in the
- * pid namespace of the procfs instance.
- * If the pid namespace of the process is not a descendant of the pid
- * namespace of the procfs instance 0 will be shown as its pid. This is
- * similar to calling getppid() on a process whose parent is outside of
- * its pid namespace.
- *
- * NSpid:
- * If pid namespaces are supported then this function will also print
- * the pid of a given pidfd refers to for all descendant pid namespaces
- * starting from the current pid namespace of the instance, i.e. the
- * Pid field and the first entry in the NSpid field will be identical.
- * If the pid namespace of the process is not a descendant of the pid
- * namespace of the procfs instance 0 will be shown as its first NSpid
- * entry and no others will be shown.
- * Note that this differs from the Pid and NSpid fields in
- * /proc/<pid>/status where Pid and NSpid are always shown relative to
- * the  pid namespace of the procfs instance. The difference becomes
- * obvious when sending around a pidfd between pid namespaces from a
- * different branch of the tree, i.e. where no ancestral relation is
- * present between the pid namespaces:
- * - create two new pid namespaces ns1 and ns2 in the initial pid
- *   namespace (also take care to create new mount namespaces in the
- *   new pid namespace and mount procfs)
- * - create a process with a pidfd in ns1
- * - send pidfd from ns1 to ns2
- * - read /proc/self/fdinfo/<pidfd> and observe that both Pid and NSpid
- *   have exactly one entry, which is 0
- */
-static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
-{
-	struct pid *pid = f->private_data;
-	struct pid_namespace *ns;
-	pid_t nr = -1;
-
-	if (likely(pid_has_task(pid, PIDTYPE_PID))) {
-		ns = proc_pid_ns(file_inode(m->file)->i_sb);
-		nr = pid_nr_ns(pid, ns);
-	}
-
-	seq_put_decimal_ll(m, "Pid:\t", nr);
-
-#ifdef CONFIG_PID_NS
-	seq_put_decimal_ll(m, "\nNSpid:\t", nr);
-	if (nr > 0) {
-		int i;
-
-		/* If nr is non-zero it means that 'pid' is valid and that
-		 * ns, i.e. the pid namespace associated with the procfs
-		 * instance, is in the pid namespace hierarchy of pid.
-		 * Start at one below the already printed level.
-		 */
-		for (i = ns->level + 1; i <= pid->level; i++)
-			seq_put_decimal_ll(m, "\t", pid->numbers[i].nr);
-	}
-#endif
-	seq_putc(m, '\n');
-}
-#endif
-
-/*
- * Poll support for process exit notification.
- */
-static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
-{
-	struct pid *pid = file->private_data;
-	bool thread = file->f_flags & PIDFD_THREAD;
-	struct task_struct *task;
-	__poll_t poll_flags = 0;
-
-	poll_wait(file, &pid->wait_pidfd, pts);
-	/*
-	 * Depending on PIDFD_THREAD, inform pollers when the thread
-	 * or the whole thread-group exits.
-	 */
-	rcu_read_lock();
-	task = pid_task(pid, PIDTYPE_PID);
-	if (!task)
-		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
-	else if (task->exit_state && (thread || thread_group_empty(task)))
-		poll_flags = EPOLLIN | EPOLLRDNORM;
-	rcu_read_unlock();
-
-	return poll_flags;
-}
-
-const struct file_operations pidfd_fops = {
-	.release = pidfd_release,
-	.poll = pidfd_poll,
-#ifdef CONFIG_PROC_FS
-	.show_fdinfo = pidfd_show_fdinfo,
-#endif
-};
-
 /**
  * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
  * @pid:   the struct pid for which to create a pidfd

-- 
2.43.0


[-- Attachment #3: 0002-pidfd-add-pidfdfs.eml --]
[-- Type: message/rfc822, Size: 13182 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: linux-fsdevel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,  Alexander Viro <viro@zeniv.linux.org.uk>,  Seth Forshee <sforshee@kernel.org>, Tycho Andersen <tycho@tycho.pizza>,  Christian Brauner <brauner@kernel.org>
Subject: [PATCH v2 2/5] pidfd: add pidfdfs
Date: Fri, 16 Feb 2024 12:40:12 +0100
Message-ID: <20240216-vfs-pidfd_fs-v2-2-8365d659464d@kernel.org>

This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do simply because of the
very justified limitations of anonymous inodes. Moving pidfds to a tiny
pseudo filesystem allows:

* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() and then comparing inode
  numbers.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode->i_private instead of
  file->private_data. This means it is now possible to introduce
  concepts that operate on a process once all file descriptors have been
  closed. A concrete example is kill-on-last-close.
* file->private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
  pid will refer to the same inode if it's opened multiple times. In
  contrast to now where each struct pid refers to the same inode. Even
  if we were to move to anon_inode_create_getfile() which creates new
  inodes we'd still be associating the same struct pid with multiple
  different inodes.

The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no complex
inode operations, nothing. Dentries and inodes are always deleted when
the last pidfd is closed.

We allocate a new inode for each struct pid and we reuse that inode for
all pidfds. We use iget_locked() to find that inode again based on the
inode number which isn't recycled. We allocate a new dentry for each
pidfd that uses the same inode. That is similar to anonymous inodes
which reuse the same inode for thousands of dentries. For pidfds we're
talking way less than that. There usually won't be a lot of concurrent
openers of the same struct pid. They can probably often be counted on
two hands. I know that systemd does use separate pidfd for the same
struct pid for various complex process tracking issues. So I think with
that things actually become way simpler. Especially because we don't
have to care about lookup. Dentries and inodes continue to be always
deleted.

The code is entirely optional and fairly small. If it's not selected we
fallback to anonymous inodes. Heavily inspired by nsfs which uses a
similar stashing mechanism just for namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Kconfig                 |   6 +++
 fs/pidfdfs.c               | 128 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/pid.h        |   3 ++
 include/linux/pidfdfs.h    |   9 ++++
 include/uapi/linux/magic.h |   1 +
 init/main.c                |   2 +
 kernel/fork.c              |  13 +----
 kernel/nsproxy.c           |   2 +-
 kernel/pid.c               |   7 +++
 9 files changed, 157 insertions(+), 14 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075..c7ed65e34820 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -174,6 +174,12 @@ source "fs/proc/Kconfig"
 source "fs/kernfs/Kconfig"
 source "fs/sysfs/Kconfig"
 
+config FS_PIDFD
+	bool "Pseudo filesystem for process file descriptors"
+	depends on 64BIT
+	help
+	  Pidfdfs implements advanced features for process file descriptors.
+
 config TMPFS
 	bool "Tmpfs virtual memory file system support (former shm fs)"
 	depends on SHMEM
diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
index 55e8396e7fc4..be4e74cec8b9 100644
--- a/fs/pidfdfs.c
+++ b/fs/pidfdfs.c
@@ -1,9 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/anon_inodes.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/magic.h>
 #include <linux/mount.h>
 #include <linux/pid.h>
+#include <linux/pidfdfs.h>
 #include <linux/pid_namespace.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
@@ -12,12 +14,25 @@
 #include <linux/seq_file.h>
 #include <uapi/linux/pidfd.h>
 
+struct pid *pidfd_pid(const struct file *file)
+{
+	if (file->f_op != &pidfd_fops)
+		return ERR_PTR(-EBADF);
+#ifdef CONFIG_FS_PIDFD
+	return file_inode(file)->i_private;
+#else
+	return file->private_data;
+#endif
+}
+
 static int pidfd_release(struct inode *inode, struct file *file)
 {
+#ifndef CONFIG_FS_PIDFD
 	struct pid *pid = file->private_data;
 
 	file->private_data = NULL;
 	put_pid(pid);
+#endif
 	return 0;
 }
 
@@ -59,7 +74,7 @@ static int pidfd_release(struct inode *inode, struct file *file)
  */
 static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
 {
-	struct pid *pid = f->private_data;
+	struct pid *pid = pidfd_pid(f);
 	struct pid_namespace *ns;
 	pid_t nr = -1;
 
@@ -93,7 +108,7 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
  */
 static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 {
-	struct pid *pid = file->private_data;
+	struct pid *pid = pidfd_pid(file);
 	bool thread = file->f_flags & PIDFD_THREAD;
 	struct task_struct *task;
 	__poll_t poll_flags = 0;
@@ -121,3 +136,112 @@ const struct file_operations pidfd_fops = {
 	.show_fdinfo	= pidfd_show_fdinfo,
 #endif
 };
+
+#ifdef CONFIG_FS_PIDFD
+static struct vfsmount *pidfdfs_mnt __ro_after_init;
+static struct super_block *pidfdfs_sb __ro_after_init;
+
+static void pidfdfs_evict_inode(struct inode *inode)
+{
+	struct pid *pid = inode->i_private;
+
+	clear_inode(inode);
+	put_pid(pid);
+}
+
+static const struct super_operations pidfdfs_sops = {
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= pidfdfs_evict_inode,
+	.statfs		= simple_statfs,
+};
+
+static char *pidfdfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	return dynamic_dname(buffer, buflen, "pidfd:[%lu]",
+			     d_inode(dentry)->i_ino);
+}
+
+const struct dentry_operations pidfdfs_dentry_operations = {
+	.d_delete	= always_delete_dentry,
+	.d_dname	= pidfdfs_dname,
+};
+
+static int pidfdfs_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	ctx = init_pseudo(fc, PIDFDFS_MAGIC);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->ops = &pidfdfs_sops;
+	ctx->dops = &pidfdfs_dentry_operations;
+	return 0;
+}
+
+static struct file_system_type pidfdfs_type = {
+	.name			= "pidfdfs",
+	.init_fs_context	= pidfdfs_init_fs_context,
+	.kill_sb		= kill_anon_super,
+};
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+
+	struct inode *inode;
+	struct file *pidfd_file;
+
+	inode = iget_locked(pidfdfs_sb, pid->ino);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	if (inode->i_state & I_NEW) {
+		inode->i_ino = pid->ino;
+		inode->i_mode = S_IFREG | S_IRUGO;
+		inode->i_fop = &pidfd_fops;
+		inode->i_flags |= S_IMMUTABLE;
+		inode->i_private = get_pid(pid);
+		simple_inode_init_ts(inode);
+		unlock_new_inode(inode);
+	}
+
+	pidfd_file = alloc_file_pseudo(inode, pidfdfs_mnt, "", flags, &pidfd_fops);
+	if (IS_ERR(pidfd_file))
+		iput(inode);
+
+	return pidfd_file;
+}
+
+void __init pidfdfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&pidfdfs_type);
+	if (err)
+		panic("Failed to register pidfdfs pseudo filesystem");
+
+	pidfdfs_mnt = kern_mount(&pidfdfs_type);
+	if (IS_ERR(pidfdfs_mnt))
+		panic("Failed to mount pidfdfs pseudo filesystem");
+
+	pidfdfs_sb = pidfdfs_mnt->mnt_sb;
+}
+
+#else /* !CONFIG_FS_PIDFD */
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+	struct file *pidfd_file;
+
+	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
+					flags | O_RDWR);
+	if (IS_ERR(pidfd_file))
+		return pidfd_file;
+
+	get_pid(pid);
+	return pidfd_file;
+}
+
+void pid_init_pidfdfs(struct pid *pid) { }
+void __init pidfdfs_init(void) { }
+#endif
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 8124d57752b9..7b6f5deab36a 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -55,6 +55,9 @@ struct pid
 	refcount_t count;
 	unsigned int level;
 	spinlock_t lock;
+#ifdef CONFIG_FS_PIDFD
+	unsigned long ino;
+#endif
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
 	struct hlist_head inodes;
diff --git a/include/linux/pidfdfs.h b/include/linux/pidfdfs.h
new file mode 100644
index 000000000000..760dbc163625
--- /dev/null
+++ b/include/linux/pidfdfs.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PIDFDFS_H
+#define _LINUX_PIDFDFS_H
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags);
+void __init pidfdfs_init(void);
+void pid_init_pidfdfs(struct pid *pid);
+
+#endif /* _LINUX_PIDFDFS_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..a0d5480115c5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define PIDFDFS_MAGIC		0x50494446	/* "PIDF" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..0663003f3146 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@
 #include <linux/init_syscalls.h>
 #include <linux/stackdepot.h>
 #include <linux/randomize_kstack.h>
+#include <linux/pidfdfs.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
@@ -1059,6 +1060,7 @@ void start_kernel(void)
 	seq_file_init();
 	proc_root_init();
 	nsfs_init();
+	pidfdfs_init();
 	cpuset_init();
 	cgroup_init();
 	taskstats_init_early();
diff --git a/kernel/fork.c b/kernel/fork.c
index 662a61f340ce..eab2fcc90342 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -102,6 +102,7 @@
 #include <linux/iommu.h>
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
+#include <linux/pidfdfs.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -1985,14 +1986,6 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
-struct pid *pidfd_pid(const struct file *file)
-{
-	if (file->f_op == &pidfd_fops)
-		return file->private_data;
-
-	return ERR_PTR(-EBADF);
-}
-
 /**
  * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
  * @pid:   the struct pid for which to create a pidfd
@@ -2030,13 +2023,11 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 	if (pidfd < 0)
 		return pidfd;
 
-	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
-					flags | O_RDWR);
+	pidfd_file = pidfdfs_alloc_file(pid, flags | O_RDWR);
 	if (IS_ERR(pidfd_file)) {
 		put_unused_fd(pidfd);
 		return PTR_ERR(pidfd_file);
 	}
-	get_pid(pid); /* held by pidfd_file now */
 	/*
 	 * anon_inode_getfile() ignores everything outside of the
 	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 15781acaac1c..6ec3deec68c2 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -573,7 +573,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags)
 	if (proc_ns_file(f.file))
 		err = validate_ns(&nsset, ns);
 	else
-		err = validate_nsset(&nsset, f.file->private_data);
+		err = validate_nsset(&nsset, pidfd_pid(f.file));
 	if (!err) {
 		commit_nsset(&nsset);
 		perf_event_namespaces(current);
diff --git a/kernel/pid.c b/kernel/pid.c
index c1d940fbd314..2c0a9e8f58e2 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -42,6 +42,7 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
+#include <linux/pidfdfs.h>
 #include <net/sock.h>
 #include <uapi/linux/pidfd.h>
 
@@ -65,6 +66,9 @@ int pid_max = PID_MAX_DEFAULT;
 
 int pid_max_min = RESERVED_PIDS + 1;
 int pid_max_max = PID_MAX_LIMIT;
+#ifdef CONFIG_FS_PIDFD
+static u64 pidfdfs_ino = 0;
+#endif
 
 /*
  * PID-map pages start out as NULL, they get allocated upon
@@ -272,6 +276,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	spin_lock_irq(&pidmap_lock);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
+#ifdef CONFIG_FS_PIDFD
+	pid->ino = ++pidfdfs_ino;
+#endif
 	for ( ; upid >= pid->numbers; --upid) {
 		/* Make the PID visible to find_pid_ns. */
 		idr_replace(&upid->ns->idr, pid, upid->nr);

-- 
2.43.0


[-- Attachment #4: 0003-dcache-add-d_instantiate_unique_anon.eml --]
[-- Type: message/rfc822, Size: 3927 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: linux-fsdevel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,  Alexander Viro <viro@zeniv.linux.org.uk>,  Seth Forshee <sforshee@kernel.org>, Tycho Andersen <tycho@tycho.pizza>,  Christian Brauner <brauner@kernel.org>
Subject: [PATCH v2 3/5] dcache: add d_instantiate_unique_anon()
Date: Fri, 16 Feb 2024 12:40:13 +0100
Message-ID: <20240216-vfs-pidfd_fs-v2-3-8365d659464d@kernel.org>

Add a helper for nsfs and pidfds. Both filesystems only allocate
anonymous dentries that aren't children of any other dentries and aren't
parents of any other dentries. So dentry->d_parent points to dentry. For
each unique inode we only need a unique dentry. Add a helper that both
nsfs and pidfdfs can use. Not exported, only internal.h and others
should refrain from using this.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/dcache.c   | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/internal.h |  1 +
 2 files changed, 50 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index b813528fb147..e492c515b0e6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2642,6 +2642,55 @@ void d_add(struct dentry *entry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_add);
 
+/**
+ * d_instantiate_unique_anon - reuse existing, unhashed dentry or add new one
+ * @entry: dentry to add
+ * @inode: The inode to attach this dentry
+ *
+ * Helper for special filesystems that want to recycle the exact same dentry
+ * over and over. Dentries must be unhashed, IS_ROOT() dentries gotten via
+ * d_alloc_anon(). Anything else is a bug. Caller must provide valid inode.
+ */
+struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode)
+{
+	struct dentry *alias;
+	unsigned int hash = entry->d_name.hash;
+
+	if (!inode)
+		return NULL;
+
+	if (!IS_ROOT(entry))
+		return NULL;
+
+	if (WARN_ON_ONCE(!d_unhashed(entry)))
+		return NULL;
+
+	if (WARN_ON_ONCE(!hlist_unhashed(&entry->d_u.d_alias)))
+		return NULL;
+
+	spin_lock(&inode->i_lock);
+	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+		if (alias->d_name.hash != hash)
+			continue;
+		if (!d_same_name(alias, entry->d_parent, &entry->d_name))
+			continue;
+		if (WARN_ON_ONCE(!IS_ROOT(alias)))
+			continue;
+		if (WARN_ON_ONCE(!d_unhashed(alias)))
+			continue;
+		spin_lock(&alias->d_lock);
+		dget_dlock(alias);
+		spin_unlock(&alias->d_lock);
+		spin_unlock(&inode->i_lock);
+		return alias;
+	}
+
+	__d_instantiate(entry, inode);
+	spin_unlock(&inode->i_lock);
+	security_d_instantiate(entry, inode); /* groan */
+	return NULL;
+}
+
 /**
  * d_exact_alias - find and hash an exact unhashed alias
  * @entry: dentry to add
diff --git a/fs/internal.h b/fs/internal.h
index b67406435fc0..41b441c7b2a0 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -310,3 +310,4 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode);

-- 
2.43.0


[-- Attachment #5: 0004-pidfdfs-use-new-helper.eml --]
[-- Type: message/rfc822, Size: 2882 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: linux-fsdevel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,  Alexander Viro <viro@zeniv.linux.org.uk>,  Seth Forshee <sforshee@kernel.org>, Tycho Andersen <tycho@tycho.pizza>,  Christian Brauner <brauner@kernel.org>
Subject: [PATCH v2 4/5] pidfdfs: use new helper
Date: Fri, 16 Feb 2024 12:40:14 +0100
Message-ID: <20240216-vfs-pidfd_fs-v2-4-8365d659464d@kernel.org>

We currently allocate a new dentry for each opener of the same struct
pid. We don't really need that and with the new helper introduced
earlier we can just reuse any existing dentry.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfdfs.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
index be4e74cec8b9..cff617fa3fb9 100644
--- a/fs/pidfdfs.c
+++ b/fs/pidfdfs.c
@@ -14,6 +14,8 @@
 #include <linux/seq_file.h>
 #include <uapi/linux/pidfd.h>
 
+#include "internal.h"
+
 struct pid *pidfd_pid(const struct file *file)
 {
 	if (file->f_op != &pidfd_fops)
@@ -189,11 +191,19 @@ struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
 {
 
 	struct inode *inode;
+	struct dentry *dentry, *alias;
+	struct path path;
 	struct file *pidfd_file;
 
+	dentry = d_alloc_anon(pidfdfs_sb);
+	if (!dentry)
+		return ERR_PTR(-ENOMEM);
+
 	inode = iget_locked(pidfdfs_sb, pid->ino);
-	if (!inode)
+	if (!inode) {
+		dput(dentry);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	if (inode->i_state & I_NEW) {
 		inode->i_ino = pid->ino;
@@ -205,10 +215,16 @@ struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
 		unlock_new_inode(inode);
 	}
 
-	pidfd_file = alloc_file_pseudo(inode, pidfdfs_mnt, "", flags, &pidfd_fops);
-	if (IS_ERR(pidfd_file))
-		iput(inode);
+	alias = d_instantiate_unique_anon(dentry, inode);
+	if (alias) {
+		dput(dentry);
+		dentry = alias;
+	}
 
+	path.dentry = dentry;
+	path.mnt = mntget(pidfdfs_mnt);
+	pidfd_file = dentry_open(&path, flags, current_cred());
+	path_put(&path);
 	return pidfd_file;
 }
 

-- 
2.43.0


[-- Attachment #6: 0005-nsfs-remove-dentry-stashing-mechanism.eml --]
[-- Type: message/rfc822, Size: 6375 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: linux-fsdevel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,  Alexander Viro <viro@zeniv.linux.org.uk>,  Seth Forshee <sforshee@kernel.org>, Tycho Andersen <tycho@tycho.pizza>,  Christian Brauner <brauner@kernel.org>
Subject: [PATCH v2 5/5] nsfs: remove dentry stashing mechanism
Date: Fri, 16 Feb 2024 12:40:15 +0100
Message-ID: <20240216-vfs-pidfd_fs-v2-5-8365d659464d@kernel.org>

As Linus points out this is pretty ugly. With the new helper we can
achieve the same result without the ugliness.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/nsfs.c                 | 118 ++++++++++++++++++----------------------------
 include/linux/ns_common.h |   1 -
 include/linux/proc_ns.h   |   1 -
 3 files changed, 45 insertions(+), 75 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 34e1e3e36733..0c7593865ec9 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -33,20 +33,9 @@ static char *ns_dname(struct dentry *dentry, char *buffer, int buflen)
 		ns_ops->name, inode->i_ino);
 }
 
-static void ns_prune_dentry(struct dentry *dentry)
-{
-	struct inode *inode = d_inode(dentry);
-	if (inode) {
-		struct ns_common *ns = inode->i_private;
-		atomic_long_set(&ns->stashed, 0);
-	}
-}
-
-const struct dentry_operations ns_dentry_operations =
-{
-	.d_prune	= ns_prune_dentry,
-	.d_delete	= always_delete_dentry,
-	.d_dname	= ns_dname,
+const struct dentry_operations ns_dentry_operations = {
+	.d_delete = always_delete_dentry,
+	.d_dname = ns_dname,
 };
 
 static void nsfs_evict(struct inode *inode)
@@ -58,65 +47,51 @@ static void nsfs_evict(struct inode *inode)
 
 static int __ns_get_path(struct path *path, struct ns_common *ns)
 {
-	struct vfsmount *mnt = nsfs_mnt;
-	struct dentry *dentry;
+	struct super_block *nsfs_sb = nsfs_mnt->mnt_sb;
+	struct dentry *dentry, *alias;
 	struct inode *inode;
-	unsigned long d;
-
-	rcu_read_lock();
-	d = atomic_long_read(&ns->stashed);
-	if (!d)
-		goto slow;
-	dentry = (struct dentry *)d;
-	if (!lockref_get_not_dead(&dentry->d_lockref))
-		goto slow;
-	rcu_read_unlock();
-	ns->ops->put(ns);
-got_it:
-	path->mnt = mntget(mnt);
-	path->dentry = dentry;
-	return 0;
-slow:
-	rcu_read_unlock();
-	inode = new_inode_pseudo(mnt->mnt_sb);
+
+	dentry = d_alloc_anon(nsfs_sb);
+	if (!dentry)
+		return -ENOMEM;
+
+	inode = iget_locked(nsfs_sb, ns->inum);
 	if (!inode) {
-		ns->ops->put(ns);
+		dput(dentry);
 		return -ENOMEM;
 	}
-	inode->i_ino = ns->inum;
-	simple_inode_init_ts(inode);
-	inode->i_flags |= S_IMMUTABLE;
-	inode->i_mode = S_IFREG | S_IRUGO;
-	inode->i_fop = &ns_file_operations;
-	inode->i_private = ns;
-
-	dentry = d_make_root(inode);	/* not the normal use, but... */
-	if (!dentry)
-		return -ENOMEM;
+
+	if (inode->i_state & I_NEW) {
+		inode->i_ino = ns->inum;
+		simple_inode_init_ts(inode);
+		inode->i_flags |= S_IMMUTABLE;
+		inode->i_mode = S_IFREG | S_IRUGO;
+		inode->i_fop = &ns_file_operations;
+		inode->i_private = ns;
+		unlock_new_inode(inode);
+	} else {
+		ns->ops->put(ns);
+	}
+
 	dentry->d_fsdata = (void *)ns->ops;
-	d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry);
-	if (d) {
-		d_delete(dentry);	/* make sure ->d_prune() does nothing */
+	alias = d_instantiate_unique_anon(dentry, inode);
+	if (alias) {
 		dput(dentry);
-		cpu_relax();
-		return -EAGAIN;
+		dentry = alias;
 	}
-	goto got_it;
+
+	path->dentry = dentry;
+	path->mnt = mntget(nsfs_mnt);
+	return 0;
 }
 
 int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
-		     void *private_data)
+		   void *private_data)
 {
-	int ret;
-
-	do {
-		struct ns_common *ns = ns_get_cb(private_data);
-		if (!ns)
-			return -ENOENT;
-		ret = __ns_get_path(path, ns);
-	} while (ret == -EAGAIN);
-
-	return ret;
+	struct ns_common *ns = ns_get_cb(private_data);
+	if (!ns)
+		return -ENOENT;
+	return __ns_get_path(path, ns);
 }
 
 struct ns_get_path_task_args {
@@ -147,6 +122,7 @@ int open_related_ns(struct ns_common *ns,
 {
 	struct path path = {};
 	struct file *f;
+	struct ns_common *relative;
 	int err;
 	int fd;
 
@@ -154,18 +130,13 @@ int open_related_ns(struct ns_common *ns,
 	if (fd < 0)
 		return fd;
 
-	do {
-		struct ns_common *relative;
-
-		relative = get_ns(ns);
-		if (IS_ERR(relative)) {
-			put_unused_fd(fd);
-			return PTR_ERR(relative);
-		}
-
-		err = __ns_get_path(&path, relative);
-	} while (err == -EAGAIN);
+	relative = get_ns(ns);
+	if (IS_ERR(relative)) {
+		put_unused_fd(fd);
+		return PTR_ERR(relative);
+	}
 
+	err = __ns_get_path(&path, relative);
 	if (err) {
 		put_unused_fd(fd);
 		return err;
@@ -259,6 +230,7 @@ static const struct super_operations nsfs_ops = {
 	.statfs = simple_statfs,
 	.evict_inode = nsfs_evict,
 	.show_path = nsfs_show_path,
+	.drop_inode = generic_delete_inode,
 };
 
 static int nsfs_init_fs_context(struct fs_context *fc)
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 0f1d024bd958..016258562b5d 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -7,7 +7,6 @@
 struct proc_ns_operations;
 
 struct ns_common {
-	atomic_long_t stashed;
 	const struct proc_ns_operations *ops;
 	unsigned int inum;
 	refcount_t count;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 49539bc416ce..acd3d347a6a5 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -66,7 +66,6 @@ static inline void proc_free_inum(unsigned int inum) {}
 
 static inline int ns_alloc_inum(struct ns_common *ns)
 {
-	atomic_long_set(&ns->stashed, 0);
 	return proc_alloc_inum(&ns->inum);
 }
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-16 11:50             ` Christian Brauner
@ 2024-02-16 16:41               ` Christian Brauner
  2024-02-17 13:59               ` Oleg Nesterov
  1 sibling, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-16 16:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Fri, Feb 16, 2024 at 12:50:45PM +0100, Christian Brauner wrote:
> On Thu, Feb 15, 2024 at 05:11:31PM +0100, Christian Brauner wrote:
> > On Wed, Feb 14, 2024 at 10:37:33AM -0800, Linus Torvalds wrote:
> > > On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > Ok, that turned out to be simpler than I had evisioned - unless I made
> > > > horrible mistakes:
> > > 
> > > Hmm. Could we do the same for nsfs?
> 
> Ok, here's what I got. I'll put the changes to switch both nsfs and
> pidfdfs to the proposed unique mechanism suggested yesterday on top. I
> would send that in two batches in any case. So if that's somehow broken
> then we can just drop it.

Bah, I pasted the version where I forgot an iput() when a reusable
dentry is found. Fixed that already.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-16 11:50             ` Christian Brauner
  2024-02-16 16:41               ` Christian Brauner
@ 2024-02-17 13:59               ` Oleg Nesterov
  2024-02-17 17:30                 ` Linus Torvalds
  2024-02-18  9:30                 ` Christian Brauner
  1 sibling, 2 replies; 61+ messages in thread
From: Oleg Nesterov @ 2024-02-17 13:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On 02/16, Christian Brauner wrote:
>
> +struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
> +{
> +
> +	struct inode *inode;
> +	struct file *pidfd_file;
> +
> +	inode = iget_locked(pidfdfs_sb, pid->ino);
> +	if (!inode)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (inode->i_state & I_NEW) {
> +		inode->i_ino = pid->ino;

I guess this is unnecessary, iget_locked() should initialize i_ino if I_NEW ?

But I have a really stupid (I know nothing about vfs) question, why do we
need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?

IIUC, if this pid is freed and then another "struct pid" has the same address
we can rely on __wait_on_freeing_inode() ?

Oleg.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-17 13:59               ` Oleg Nesterov
@ 2024-02-17 17:30                 ` Linus Torvalds
  2024-02-17 17:38                   ` Linus Torvalds
  2024-02-18 11:15                   ` Christian Brauner
  2024-02-18  9:30                 ` Christian Brauner
  1 sibling, 2 replies; 61+ messages in thread
From: Linus Torvalds @ 2024-02-17 17:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 2386 bytes --]

On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
>
> But I have a really stupid (I know nothing about vfs) question, why do we
> need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
>
> IIUC, if this pid is freed and then another "struct pid" has the same address
> we can rely on __wait_on_freeing_inode() ?

Heh. Maybe it would work, but we really don't want to expose core
kernel pointers to user space as the inode number.

So then we'd have to add extra hackery to that (ie we'd have to
intercept stat calls, and we'd have to have something else for
->d_dname() etc..).

Those are all things that the VFS does support, but ...

So I do prefer Christian's new approach, although some of it ends up
being a bit unclear.

Christian, can you explain why this:

        spin_lock(&alias->d_lock);
        dget_dlock(alias);
        spin_unlock(&alias->d_lock);

instead of just 'dget()'?

Also, while I found the old __ns_get_path() to be fairly disgusting, I
actually think it's simpler and clearer than playing around with the
dentry alias list. So my expectation on code sharing was that you'd
basically lift the old __ns_get_path(), make *that* the helper, and
just pass it an argument that is the pointer to the filesystem
"stashed" entry...

And yes, using "atomic_long_t" for stashed is a crime against
humanity. It's also entirely pointless. There are no actual atomic
operations that the code wants except for reading and writing (aka
READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
"atomic_long_t" buys the code nothing, and only makes things more
complicated and requires crazy casts.

So I think the nsfs.c code should be changed to do

-       atomic_long_t stashed;
+       struct dentry *stashed;

and remove the crazy workarounds for using the wrong type.

Something like the attached patch.

Then, I think the whole "lockref_get_not_dead()" etc games of
__ns_get_path() could be extracted out into a helper function that
takes that "&ns->stashed" pointer as an argument, and now that helper
could also be used for pidfs, except pidfs obviously just does
"&pid->stashed" instead.

Hmm?

Entirely untested patch. Also, the above idea may be broken because of
some obvious issue that I didn't think about. Christian?

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2108 bytes --]

 fs/nsfs.c                 | 11 ++++-------
 include/linux/ns_common.h |  2 +-
 include/linux/proc_ns.h   |  2 +-
 3 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 34e1e3e36733..1104db67c9a4 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -38,7 +38,7 @@ static void ns_prune_dentry(struct dentry *dentry)
 	struct inode *inode = d_inode(dentry);
 	if (inode) {
 		struct ns_common *ns = inode->i_private;
-		atomic_long_set(&ns->stashed, 0);
+		WRITE_ONCE(ns->stashed, NULL);
 	}
 }
 
@@ -61,13 +61,11 @@ static int __ns_get_path(struct path *path, struct ns_common *ns)
 	struct vfsmount *mnt = nsfs_mnt;
 	struct dentry *dentry;
 	struct inode *inode;
-	unsigned long d;
 
 	rcu_read_lock();
-	d = atomic_long_read(&ns->stashed);
-	if (!d)
+	dentry = READ_ONCE(ns->stashed);
+	if (!dentry)
 		goto slow;
-	dentry = (struct dentry *)d;
 	if (!lockref_get_not_dead(&dentry->d_lockref))
 		goto slow;
 	rcu_read_unlock();
@@ -94,8 +92,7 @@ static int __ns_get_path(struct path *path, struct ns_common *ns)
 	if (!dentry)
 		return -ENOMEM;
 	dentry->d_fsdata = (void *)ns->ops;
-	d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry);
-	if (d) {
+	if (cmpxchg(&ns->stashed, NULL, dentry)) {
 		d_delete(dentry);	/* make sure ->d_prune() does nothing */
 		dput(dentry);
 		cpu_relax();
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 0f1d024bd958..7d22ea50b098 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -7,7 +7,7 @@
 struct proc_ns_operations;
 
 struct ns_common {
-	atomic_long_t stashed;
+	struct dentry *stashed;
 	const struct proc_ns_operations *ops;
 	unsigned int inum;
 	refcount_t count;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 49539bc416ce..5ea470eb4d76 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -66,7 +66,7 @@ static inline void proc_free_inum(unsigned int inum) {}
 
 static inline int ns_alloc_inum(struct ns_common *ns)
 {
-	atomic_long_set(&ns->stashed, 0);
+	WRITE_ONCE(ns->stashed, NULL);
 	return proc_alloc_inum(&ns->inum);
 }
 

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-17 17:30                 ` Linus Torvalds
@ 2024-02-17 17:38                   ` Linus Torvalds
  2024-02-18 11:15                   ` Christian Brauner
  1 sibling, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2024-02-17 17:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Sat, 17 Feb 2024 at 09:30, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And yes, using "atomic_long_t" for stashed is a crime against
> humanity. It's also entirely pointless. There are no actual atomic
> operations that the code wants [..]

Just to clarify: the reason to use 'atomic_long_t' is for the
_arithmetic_ atomic ops. So the "inc/dec/inc_and_test" etc.

The code wants none of that, and can make do with the regular smp-safe
cmpxchg operation that works on any word-sized type.

              Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-17 13:59               ` Oleg Nesterov
  2024-02-17 17:30                 ` Linus Torvalds
@ 2024-02-18  9:30                 ` Christian Brauner
  1 sibling, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-18  9:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Sat, Feb 17, 2024 at 02:59:16PM +0100, Oleg Nesterov wrote:
> On 02/16, Christian Brauner wrote:
> >
> > +struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
> > +{
> > +
> > +	struct inode *inode;
> > +	struct file *pidfd_file;
> > +
> > +	inode = iget_locked(pidfdfs_sb, pid->ino);
> > +	if (!inode)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	if (inode->i_state & I_NEW) {
> > +		inode->i_ino = pid->ino;
> 
> I guess this is unnecessary, iget_locked() should initialize i_ino if I_NEW ?

Yes, it does. I just like to be explicit in such cases.

> 
> But I have a really stupid (I know nothing about vfs) question, why do we
> need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> 
> IIUC, if this pid is freed and then another "struct pid" has the same address
> we can rely on __wait_on_freeing_inode() ?

Yeah, I had thought about something like this but see Linus' reply.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-17 17:30                 ` Linus Torvalds
  2024-02-17 17:38                   ` Linus Torvalds
@ 2024-02-18 11:15                   ` Christian Brauner
  2024-02-18 11:33                     ` Christian Brauner
  2024-02-18 14:27                     ` Oleg Nesterov
  1 sibling, 2 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-18 11:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > But I have a really stupid (I know nothing about vfs) question, why do we
> > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> >
> > IIUC, if this pid is freed and then another "struct pid" has the same address
> > we can rely on __wait_on_freeing_inode() ?
> 
> Heh. Maybe it would work, but we really don't want to expose core
> kernel pointers to user space as the inode number.

And then also the property that the inode number is unique for the
system lifetime is extremely useful for userspace and I would like to
retain that property.

> 
> So then we'd have to add extra hackery to that (ie we'd have to
> intercept stat calls, and we'd have to have something else for
> ->d_dname() etc..).
> 
> Those are all things that the VFS does support, but ...
> 
> So I do prefer Christian's new approach, although some of it ends up
> being a bit unclear.
> 
> Christian, can you explain why this:
> 
>         spin_lock(&alias->d_lock);
>         dget_dlock(alias);
>         spin_unlock(&alias->d_lock);
> 
> instead of just 'dget()'?

No reason other than I forgot to switch to dget().

> 
> Also, while I found the old __ns_get_path() to be fairly disgusting, I
> actually think it's simpler and clearer than playing around with the
> dentry alias list. So my expectation on code sharing was that you'd

It's overall probably also cheaper, I think.

> basically lift the old __ns_get_path(), make *that* the helper, and
> just pass it an argument that is the pointer to the filesystem
> "stashed" entry...
> 
> And yes, using "atomic_long_t" for stashed is a crime against
> humanity. It's also entirely pointless. There are no actual atomic
> operations that the code wants except for reading and writing (aka
> READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
> "atomic_long_t" buys the code nothing, and only makes things more
> complicated and requires crazy casts.

Yup, I had that as a draft and that introduced struct ino_stash which
contained a dentry pointer and the inode number using cmpxchg(). But I
decided against this because ns_common.h would require to have access to
ino_stash definition so we wouldn't just able to hide it in internal.h
where it should belong.

> 
> So I think the nsfs.c code should be changed to do
> 
> -       atomic_long_t stashed;
> +       struct dentry *stashed;
> 
> and remove the crazy workarounds for using the wrong type.
> 
> Something like the attached patch.
> 
> Then, I think the whole "lockref_get_not_dead()" etc games of
> __ns_get_path() could be extracted out into a helper function that
> takes that "&ns->stashed" pointer as an argument, and now that helper
> could also be used for pidfs, except pidfs obviously just does
> "&pid->stashed" instead.
> 
> Hmm?
> 
> Entirely untested patch. Also, the above idea may be broken because of
> some obvious issue that I didn't think about. Christian?

Yeah, sure. Works for me. Let me play with something.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-18 11:15                   ` Christian Brauner
@ 2024-02-18 11:33                     ` Christian Brauner
  2024-02-18 17:54                       ` Christian Brauner
  2024-02-18 14:27                     ` Oleg Nesterov
  1 sibling, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-02-18 11:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Sun, Feb 18, 2024 at 12:15:02PM +0100, Christian Brauner wrote:
> On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> > On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > But I have a really stupid (I know nothing about vfs) question, why do we
> > > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> > >
> > > IIUC, if this pid is freed and then another "struct pid" has the same address
> > > we can rely on __wait_on_freeing_inode() ?
> > 
> > Heh. Maybe it would work, but we really don't want to expose core
> > kernel pointers to user space as the inode number.
> 
> And then also the property that the inode number is unique for the
> system lifetime is extremely useful for userspace and I would like to
> retain that property.
> 
> > 
> > So then we'd have to add extra hackery to that (ie we'd have to
> > intercept stat calls, and we'd have to have something else for
> > ->d_dname() etc..).
> > 
> > Those are all things that the VFS does support, but ...
> > 
> > So I do prefer Christian's new approach, although some of it ends up
> > being a bit unclear.
> > 
> > Christian, can you explain why this:
> > 
> >         spin_lock(&alias->d_lock);
> >         dget_dlock(alias);
> >         spin_unlock(&alias->d_lock);
> > 
> > instead of just 'dget()'?
> 
> No reason other than I forgot to switch to dget().
> 
> > 
> > Also, while I found the old __ns_get_path() to be fairly disgusting, I
> > actually think it's simpler and clearer than playing around with the
> > dentry alias list. So my expectation on code sharing was that you'd
> 
> It's overall probably also cheaper, I think.
> 
> > basically lift the old __ns_get_path(), make *that* the helper, and
> > just pass it an argument that is the pointer to the filesystem
> > "stashed" entry...
> > 
> > And yes, using "atomic_long_t" for stashed is a crime against
> > humanity. It's also entirely pointless. There are no actual atomic
> > operations that the code wants except for reading and writing (aka
> > READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
> > "atomic_long_t" buys the code nothing, and only makes things more
> > complicated and requires crazy casts.
> 
> Yup, I had that as a draft and that introduced struct ino_stash which
> contained a dentry pointer and the inode number using cmpxchg(). But I
> decided against this because ns_common.h would require to have access to
> ino_stash definition so we wouldn't just able to hide it in internal.h
> where it should belong.

Right, I remember. The annoying thing will be how to cleanly handle this
without having to pass too many parameters because we need d_fsdata, the
vfsmount, and the inode->i_fop. So let me see if I can get this to
something that doesn't look too ugly.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-18 11:15                   ` Christian Brauner
  2024-02-18 11:33                     ` Christian Brauner
@ 2024-02-18 14:27                     ` Oleg Nesterov
  1 sibling, 0 replies; 61+ messages in thread
From: Oleg Nesterov @ 2024-02-18 14:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On 02/18, Christian Brauner wrote:
>
> On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> > On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > But I have a really stupid (I know nothing about vfs) question, why do we
> > > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> > >
> > > IIUC, if this pid is freed and then another "struct pid" has the same address
> > > we can rely on __wait_on_freeing_inode() ?
> >
> > Heh. Maybe it would work, but we really don't want to expose core
> > kernel pointers to user space as the inode number.

We could use ptr_to_hashval(pid).

> And then also the property that the inode number is unique for the
> system lifetime is extremely useful for userspace and I would like to
> retain that property.

OK.

and perhaps task->thread_pid->ino can also be used as task_monotonic_id(task)
if we move the pid->ino initialization into init_task_pid(PIDTYPE_PID), this
allows to implement for_each_process_thread_break/continue... Nevermind, please
forget.

> > > +	if (inode->i_state & I_NEW) {
> > > +		inode->i_ino = pid->ino;
> >
> > I guess this is unnecessary, iget_locked() should initialize i_ino if I_NEW ?
>
> Yes, it does. I just like to be explicit in such cases.

Well. Of course I won't insist, this is minor. But to me this adds the
unnecessary confusion, as if we need to override ->ino for some reason.

Oleg.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-18 11:33                     ` Christian Brauner
@ 2024-02-18 17:54                       ` Christian Brauner
  2024-02-18 18:08                         ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-02-18 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 3731 bytes --]

On Sun, Feb 18, 2024 at 12:33:41PM +0100, Christian Brauner wrote:
> On Sun, Feb 18, 2024 at 12:15:02PM +0100, Christian Brauner wrote:
> > On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> > > On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > But I have a really stupid (I know nothing about vfs) question, why do we
> > > > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > > > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> > > >
> > > > IIUC, if this pid is freed and then another "struct pid" has the same address
> > > > we can rely on __wait_on_freeing_inode() ?
> > > 
> > > Heh. Maybe it would work, but we really don't want to expose core
> > > kernel pointers to user space as the inode number.
> > 
> > And then also the property that the inode number is unique for the
> > system lifetime is extremely useful for userspace and I would like to
> > retain that property.
> > 
> > > 
> > > So then we'd have to add extra hackery to that (ie we'd have to
> > > intercept stat calls, and we'd have to have something else for
> > > ->d_dname() etc..).
> > > 
> > > Those are all things that the VFS does support, but ...
> > > 
> > > So I do prefer Christian's new approach, although some of it ends up
> > > being a bit unclear.
> > > 
> > > Christian, can you explain why this:
> > > 
> > >         spin_lock(&alias->d_lock);
> > >         dget_dlock(alias);
> > >         spin_unlock(&alias->d_lock);
> > > 
> > > instead of just 'dget()'?
> > 
> > No reason other than I forgot to switch to dget().
> > 
> > > 
> > > Also, while I found the old __ns_get_path() to be fairly disgusting, I
> > > actually think it's simpler and clearer than playing around with the
> > > dentry alias list. So my expectation on code sharing was that you'd
> > 
> > It's overall probably also cheaper, I think.
> > 
> > > basically lift the old __ns_get_path(), make *that* the helper, and
> > > just pass it an argument that is the pointer to the filesystem
> > > "stashed" entry...
> > > 
> > > And yes, using "atomic_long_t" for stashed is a crime against
> > > humanity. It's also entirely pointless. There are no actual atomic
> > > operations that the code wants except for reading and writing (aka
> > > READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
> > > "atomic_long_t" buys the code nothing, and only makes things more
> > > complicated and requires crazy casts.
> > 
> > Yup, I had that as a draft and that introduced struct ino_stash which
> > contained a dentry pointer and the inode number using cmpxchg(). But I
> > decided against this because ns_common.h would require to have access to
> > ino_stash definition so we wouldn't just able to hide it in internal.h
> > where it should belong.
> 
> Right, I remember. The annoying thing will be how to cleanly handle this
> without having to pass too many parameters because we need d_fsdata, the
> vfsmount, and the inode->i_fop. So let me see if I can get this to
> something that doesn't look too ugly.

So, I'm running out of time today so I'm appending the draft I jotted
down now (untested).
Roughly, here's what I think would work for both nsfs and pidfs (I got
the hint and will rename from pidfdfs ;)). The thing that makes it a bit
tricky is that we need to indicate to the caller whether we've reused a
stashed or added a new inode/dentry so the caller can put the reference
to the object it took for i_private in case we reused a dentry/inode. On
EAGAIN i_private is property of the fs and will be cleaned up in
->evict(). Alternative is a callback for getting a reference which I
think is also ugly. Better suggestions welcome of course.

[-- Attachment #2: 0001-RFC-UNTESTED-LIKELY-STILL-BROKEN-internal-add-path_f.patch --]
[-- Type: text/x-diff, Size: 3166 bytes --]

From 281553f0059a889476dba5f4460570ea08ceefe5 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Sun, 18 Feb 2024 14:50:13 +0100
Subject: [PATCH 1/3] [RFC UNTESTED LIKELY STILL BROKEN] internal: add
 path_from_stashed()

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/internal.h |  3 ++
 fs/libfs.c    | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/fs/internal.h b/fs/internal.h
index b67406435fc0..cfddaec6fbf6 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -310,3 +310,6 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+int path_from_stashed(struct dentry **stashed, unsigned long ino,
+		      struct vfsmount *mnt, const struct file_operations *fops,
+		      void *data, struct path *path);
diff --git a/fs/libfs.c b/fs/libfs.c
index eec6031b0155..af46a83cd476 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1973,3 +1973,81 @@ struct timespec64 simple_inode_init_ts(struct inode *inode)
 	return ts;
 }
 EXPORT_SYMBOL(simple_inode_init_ts);
+
+static inline struct dentry *get_stashed_dentry(struct dentry *stashed)
+{
+	struct dentry *dentry;
+
+	rcu_read_lock();
+	dentry = READ_ONCE(stashed);
+	if (!dentry || !lockref_get_not_dead(&dentry->d_lockref))
+		dentry = NULL;
+	rcu_read_unlock();
+	return dentry;
+}
+
+static struct dentry *stash_dentry(struct dentry **stashed, unsigned long ino,
+				   struct super_block *sb,
+				   const struct file_operations *fops,
+				   void *data)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	dentry = d_alloc_anon(sb);
+	if (!dentry)
+		return ERR_PTR(-ENOMEM);
+
+	inode = new_inode_pseudo(sb);
+	if (!inode) {
+		dput(dentry);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	inode->i_ino = ino;
+	inode->i_flags |= S_IMMUTABLE;
+	inode->i_mode = S_IFREG | S_IRUGO;
+	inode->i_fop = fops;
+	inode->i_private = data;
+	simple_inode_init_ts(inode);
+
+	/* @data is now owned by the fs */
+	d_instantiate(dentry, inode);
+
+	if (cmpxchg(stashed, NULL, dentry)) {
+		d_delete(dentry); /* make sure ->d_prune() does nothing */
+		dput(dentry);
+		cpu_relax();
+		return ERR_PTR(-EAGAIN);
+	}
+
+	return dentry;
+}
+
+/*
+ * Try to retrieve stashed dentry or allocate a new one. Indicate to the caller
+ * whether we reused an existing one by returning 0 or when we added a new one
+ * by returning 1. This allows the caller to put any references. Alternative is
+ * a callback which is ugly.
+ */
+int path_from_stashed(struct dentry **stashed, unsigned long ino,
+		      struct vfsmount *mnt, const struct file_operations *fops,
+		      void *data, struct path *path)
+{
+	struct dentry *dentry;
+	int ret = 0;
+
+	dentry = get_stashed_dentry(*stashed);
+	if (dentry)
+		goto out_path;
+
+	dentry = stash_dentry(stashed, ino, mnt->mnt_sb, fops, data);
+	if (IS_ERR(dentry))
+		return PTR_ERR(dentry);
+	ret = 1;
+
+out_path:
+	path->dentry = dentry;
+	path->mnt = mntget(mnt);
+	return ret;
+}
-- 
2.43.0


[-- Attachment #3: 0002-RFC-UNTESTED-LIKELY-STILL-BROKEN-nsfs-convert-to-pat.patch --]
[-- Type: text/x-diff, Size: 4301 bytes --]

From ecf6c69f62a3b96926d1a4bb5f43dc18d6a60b0a Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Sun, 18 Feb 2024 14:51:23 +0100
Subject: [PATCH 2/3] [RFC UNTESTED LIKELY STILL BROKEN] nsfs: convert to
 path_from_stashed() helper

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/nsfs.c                 | 70 +++++++++------------------------------
 include/linux/ns_common.h |  2 +-
 include/linux/proc_ns.h   |  2 +-
 3 files changed, 18 insertions(+), 56 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 34e1e3e36733..31d02fb6cb2e 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -38,7 +38,7 @@ static void ns_prune_dentry(struct dentry *dentry)
 	struct inode *inode = d_inode(dentry);
 	if (inode) {
 		struct ns_common *ns = inode->i_private;
-		atomic_long_set(&ns->stashed, 0);
+		WRITE_ONCE(ns->stashed, NULL);
 	}
 }
 
@@ -56,54 +56,6 @@ static void nsfs_evict(struct inode *inode)
 	ns->ops->put(ns);
 }
 
-static int __ns_get_path(struct path *path, struct ns_common *ns)
-{
-	struct vfsmount *mnt = nsfs_mnt;
-	struct dentry *dentry;
-	struct inode *inode;
-	unsigned long d;
-
-	rcu_read_lock();
-	d = atomic_long_read(&ns->stashed);
-	if (!d)
-		goto slow;
-	dentry = (struct dentry *)d;
-	if (!lockref_get_not_dead(&dentry->d_lockref))
-		goto slow;
-	rcu_read_unlock();
-	ns->ops->put(ns);
-got_it:
-	path->mnt = mntget(mnt);
-	path->dentry = dentry;
-	return 0;
-slow:
-	rcu_read_unlock();
-	inode = new_inode_pseudo(mnt->mnt_sb);
-	if (!inode) {
-		ns->ops->put(ns);
-		return -ENOMEM;
-	}
-	inode->i_ino = ns->inum;
-	simple_inode_init_ts(inode);
-	inode->i_flags |= S_IMMUTABLE;
-	inode->i_mode = S_IFREG | S_IRUGO;
-	inode->i_fop = &ns_file_operations;
-	inode->i_private = ns;
-
-	dentry = d_make_root(inode);	/* not the normal use, but... */
-	if (!dentry)
-		return -ENOMEM;
-	dentry->d_fsdata = (void *)ns->ops;
-	d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry);
-	if (d) {
-		d_delete(dentry);	/* make sure ->d_prune() does nothing */
-		dput(dentry);
-		cpu_relax();
-		return -EAGAIN;
-	}
-	goto got_it;
-}
-
 int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
 		     void *private_data)
 {
@@ -113,10 +65,16 @@ int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
 		struct ns_common *ns = ns_get_cb(private_data);
 		if (!ns)
 			return -ENOENT;
-		ret = __ns_get_path(path, ns);
+		ret = path_from_stashed(&ns->stashed, ns->inum, nsfs_mnt,
+					&ns_file_operations, ns, path);
+		if (!ret || ret != -EAGAIN)
+			ns->ops->put(ns);
 	} while (ret == -EAGAIN);
 
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return 0;
 }
 
 struct ns_get_path_task_args {
@@ -163,10 +121,13 @@ int open_related_ns(struct ns_common *ns,
 			return PTR_ERR(relative);
 		}
 
-		err = __ns_get_path(&path, relative);
+		err = path_from_stashed(&ns->stashed, ns->inum, nsfs_mnt,
+					&ns_file_operations, ns, &path);
+		if (!err || err != -EAGAIN)
+			ns->ops->put(ns);
 	} while (err == -EAGAIN);
 
-	if (err) {
+	if (err < 0) {
 		put_unused_fd(fd);
 		return err;
 	}
@@ -249,7 +210,8 @@ bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
 static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
 {
 	struct inode *inode = d_inode(dentry);
-	const struct proc_ns_operations *ns_ops = dentry->d_fsdata;
+	const struct ns_common *ns = inode->i_private;
+	const struct proc_ns_operations *ns_ops = ns->ops;
 
 	seq_printf(seq, "%s:[%lu]", ns_ops->name, inode->i_ino);
 	return 0;
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 0f1d024bd958..7d22ea50b098 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -7,7 +7,7 @@
 struct proc_ns_operations;
 
 struct ns_common {
-	atomic_long_t stashed;
+	struct dentry *stashed;
 	const struct proc_ns_operations *ops;
 	unsigned int inum;
 	refcount_t count;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 49539bc416ce..5ea470eb4d76 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -66,7 +66,7 @@ static inline void proc_free_inum(unsigned int inum) {}
 
 static inline int ns_alloc_inum(struct ns_common *ns)
 {
-	atomic_long_set(&ns->stashed, 0);
+	WRITE_ONCE(ns->stashed, NULL);
 	return proc_alloc_inum(&ns->inum);
 }
 
-- 
2.43.0


[-- Attachment #4: 0003-RFC-UNTESTED-LIKELY-STILL-BROKEN-pidfds-convert-to-p.patch --]
[-- Type: text/x-diff, Size: 3121 bytes --]

From 9bd2f66776f06621ae4a71d511615272971ef293 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Sun, 18 Feb 2024 14:52:24 +0100
Subject: [PATCH 3/3] [RFC UNTESTED LIKELY STILL BROKEN] pidfds: convert to
 path_from_stashed() helper

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfdfs.c        | 43 ++++++++++++++++++++++++++-----------------
 include/linux/pid.h |  1 +
 kernel/pid.c        |  1 +
 3 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
index be4e74cec8b9..3e1204553ef2 100644
--- a/fs/pidfdfs.c
+++ b/fs/pidfdfs.c
@@ -14,6 +14,8 @@
 #include <linux/seq_file.h>
 #include <uapi/linux/pidfd.h>
 
+#include "internal.h"
+
 struct pid *pidfd_pid(const struct file *file)
 {
 	if (file->f_op != &pidfd_fops)
@@ -161,9 +163,21 @@ static char *pidfdfs_dname(struct dentry *dentry, char *buffer, int buflen)
 			     d_inode(dentry)->i_ino);
 }
 
+static void pidfdfs_prune_dentry(struct dentry *dentry)
+{
+	struct inode *inode;
+
+	inode = d_inode(dentry);
+	if (inode) {
+		struct pid *pid = inode->i_private;
+		WRITE_ONCE(pid->stashed, NULL);
+	}
+}
+
 const struct dentry_operations pidfdfs_dentry_operations = {
 	.d_delete	= always_delete_dentry,
 	.d_dname	= pidfdfs_dname,
+	.d_prune	= pidfdfs_prune_dentry,
 };
 
 static int pidfdfs_init_fs_context(struct fs_context *fc)
@@ -188,27 +202,22 @@ static struct file_system_type pidfdfs_type = {
 struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
 {
 
-	struct inode *inode;
 	struct file *pidfd_file;
+	struct path path;
+	int ret;
 
-	inode = iget_locked(pidfdfs_sb, pid->ino);
-	if (!inode)
-		return ERR_PTR(-ENOMEM);
-
-	if (inode->i_state & I_NEW) {
-		inode->i_ino = pid->ino;
-		inode->i_mode = S_IFREG | S_IRUGO;
-		inode->i_fop = &pidfd_fops;
-		inode->i_flags |= S_IMMUTABLE;
-		inode->i_private = get_pid(pid);
-		simple_inode_init_ts(inode);
-		unlock_new_inode(inode);
-	}
+	do {
+		ret = path_from_stashed(&pid->stashed, pid->ino, pidfdfs_mnt,
+					&pidfd_fops, pid, &path);
+	} while (ret == -EAGAIN);
 
-	pidfd_file = alloc_file_pseudo(inode, pidfdfs_mnt, "", flags, &pidfd_fops);
-	if (IS_ERR(pidfd_file))
-		iput(inode);
+	if (ret < 0)
+		return ERR_PTR(ret);
 
+	if (ret)
+		get_pid(pid);
+	pidfd_file = dentry_open(&path, flags, current_cred());
+	path_put(&path);
 	return pidfd_file;
 }
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7b6f5deab36a..3d1e817a809f 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -56,6 +56,7 @@ struct pid
 	unsigned int level;
 	spinlock_t lock;
 #ifdef CONFIG_FS_PIDFD
+	struct dentry *stashed;
 	unsigned long ino;
 #endif
 	/* lists of tasks that use this pid */
diff --git a/kernel/pid.c b/kernel/pid.c
index 2c0a9e8f58e2..f2f418ecf232 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -277,6 +277,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 #ifdef CONFIG_FS_PIDFD
+	pid->stashed = NULL;
 	pid->ino = ++pidfdfs_ino;
 #endif
 	for ( ; upid >= pid->numbers; --upid) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-18 17:54                       ` Christian Brauner
@ 2024-02-18 18:08                         ` Linus Torvalds
  2024-02-18 18:57                           ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-02-18 18:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Sun, 18 Feb 2024 at 09:54, Christian Brauner <brauner@kernel.org> wrote:
>
> So, I'm running out of time today so I'm appending the draft I jotted
> down now (untested).

This looks quite nice to me. Written like that with those well-named
helpers it is much more readable than the old nsfs code, honestly.

The only ugliness I see is the one that comes from the original code -
I'm not thrilled about the "return -EAGAIN" part, and I think that if
we found a previously stashed entry after all, we should loop.

But I think that whole horror comes from a fear of an endless loop
when the dentry is marked dead, and another CPU still sees the old
value (so you can't re-use it, and yet it's not NULL).

Which probably never happens, but I didn't think about it too much.

Anyway, that issue is pre-existing code, and your patch looks like an
improvement regardless, so that is *not* in any way a NAK, it's just
me musing on the code.

Al's email address has been bouncing for me for the last week or so,
so presumably he is not getting this and commenting on it. Oh well.

Anyway, ACK from me on this approach (assuming it passes your testing
when you get back, of course).

              Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-18 18:08                         ` Linus Torvalds
@ 2024-02-18 18:57                           ` Linus Torvalds
  2024-02-19 18:05                             ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-02-18 18:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Sun, 18 Feb 2024 at 10:08, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The only ugliness I see is the one that comes from the original code -
> I'm not thrilled about the "return -EAGAIN" part, and I think that if
> we found a previously stashed entry after all, we should loop.
>
> But I think that whole horror comes from a fear of an endless loop
> when the dentry is marked dead, and another CPU still sees the old
> value (so you can't re-use it, and yet it's not NULL).

Actually, I think this is fairly easily fixable, but let's fix it
*after* you've done your cleanups.

The eventual fix is fairly simple: allow installing a new dentry not
just as a replacement for a previous NULL dentry, but also replacing a
previous dead dentry.

That requires just two simple changes:

 - the ->d_prune() callback should no longer just blindly set the
stashed value to NULL, it would do

        // Somebody could have re-used our stash as the dentry
        // died, so we only NULL it out of if matches our pruned one
        cmpxchg(&stashed, dentry, NULL);

 - when installing, instead of doing

        if (cmpxchg(&stashed, NULL, dentry) .. FAIL ..

   we'd loop with something like this:

        guard(rcu)();
        for (;;) {
                struct dentry *old;

                // Assume any old dentry was cleared out
                old = cmpxchg(&stashed, NULL, dentry);
                if (likely(!old))
                        break;

                // Maybe somebody else installed a good dentry
                // .. so release ours and use the new one
                if (lockref_get_not_dead(&old->d_lockref)) {
                        d_delete(dentry);
                        dput(dentry);
                        return old;
                }

                // There's an old dead dentry there, try to take it over
                if (likely(try_cmpxchg(&stashed, old, dentry)))
                        break;

                // Ooops, that failed, to back and try again
                cpu_relax();
        }

        // We successfully installed our dentry
        // as the new stashed one
        return dentry;

which really isn't doesn't look that complicated (note the RCU guard
as a way to make sure this all runs RCU-locked without needing to
worry about the unlock sequence).

Note: your initial optimistic "get_stashed_dentry()" stays exactly as
it is. The above loop is just for the "oh, we didn't trivially re-use
an old dentry, so now we need to allocate a new one and install it"
case.

Anyway, the above is written just in the MUA, there's no testing of
the above, and again - I think this should be done *after* you've done
the cleanups of the current code. But I think it would clarify the odd
race condition with an old dentry dying just as a new one is created,
and make sure there isn't some -EAGAIN case that people need to worry
about.

Because either we can re-use the old one, or there isn't an old one,
or we find a dead one that can't be reused but can just be replaced.

Fairly straightforward, no?

               Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-18 18:57                           ` Linus Torvalds
@ 2024-02-19 18:05                             ` Christian Brauner
  2024-02-19 18:34                               ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-02-19 18:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

> Fairly straightforward, no?

So, I've finished the series today. It took a while because I wanted to
hammer on it with stress-ng --{procfs,pidfd}, run LTP, and then do some
other tests. It survives everything now for over 24h. There were a few
bugs as expected in the first draft.

@Linus, if you're up for it, please take a look at:

https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.pidfd

The topmost 6 commits contain everything we've had on here. I renamed
pidfdfs to pidfs and all helpers completely mechanically. The rest is
unchanged. At the end we have the conversion fo path_from_stashed(). And
the topmost commit implements your suggestion. Let me know if your ACK
still stands.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-19 18:05                             ` Christian Brauner
@ 2024-02-19 18:34                               ` Linus Torvalds
  2024-02-19 21:18                                 ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-02-19 18:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Mon, 19 Feb 2024 at 10:05, Christian Brauner <brauner@kernel.org> wrote:
>
> @Linus, if you're up for it, please take a look at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.pidfd
>
> The topmost 6 commits contain everything we've had on here.

Looks ok. The commit message on that last one reads a bit oddly,
because the "Quoting Linus" part looks like it means just the first
quote, even if it's really everything.

I'd suggest you just not attribute that explanation to me at all, and
edit it down to just a neutral explanation of what is going on.

But the code itself looks fine, and I like how it just cleaned up the
callers a lot now that they don't have that odd EAGAIN loop thing.

I expected that to happen, of course, and it was the point of my
suggestion, but it's still nice to actually see it as a patch that
removes the nasty code rather than just my "I think that's ugly and
could be done differently".

             Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-19 18:34                               ` Linus Torvalds
@ 2024-02-19 21:18                                 ` Christian Brauner
  2024-02-19 23:24                                   ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-02-19 21:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Mon, Feb 19, 2024 at 10:34:47AM -0800, Linus Torvalds wrote:
> On Mon, 19 Feb 2024 at 10:05, Christian Brauner <brauner@kernel.org> wrote:
> >
> > @Linus, if you're up for it, please take a look at:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.pidfd
> >
> > The topmost 6 commits contain everything we've had on here.
> 
> Looks ok. The commit message on that last one reads a bit oddly,
> because the "Quoting Linus" part looks like it means just the first
> quote, even if it's really everything.
> 
> I'd suggest you just not attribute that explanation to me at all, and
> edit it down to just a neutral explanation of what is going on.

Done.

> 
> But the code itself looks fine, and I like how it just cleaned up the
> callers a lot now that they don't have that odd EAGAIN loop thing.

Yup.

> 
> I expected that to happen, of course, and it was the point of my
> suggestion, but it's still nice to actually see it as a patch that
> removes the nasty code rather than just my "I think that's ugly and
> could be done differently".

I've moved the shared cmpxchg() bit in {ns,pidfs}_prune_dentry() to a
tiny helper so the cmpxchg() isn't coded in the open and is documented
in a single location. Pushed.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-19 21:18                                 ` Christian Brauner
@ 2024-02-19 23:24                                   ` Linus Torvalds
  0 siblings, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2024-02-19 23:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Al Viro, linux-fsdevel, Seth Forshee, Tycho Andersen

On Mon, 19 Feb 2024 at 13:18, Christian Brauner <brauner@kernel.org> wrote:
>
> I've moved the shared cmpxchg() bit in {ns,pidfs}_prune_dentry() to a
> tiny helper so the cmpxchg() isn't coded in the open and is documented
> in a single location.

Ack, thumbs up. LGTM,

              Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-13 16:45 ` [PATCH 2/2] pidfd: add pidfdfs Christian Brauner
  2024-02-13 17:17   ` Linus Torvalds
@ 2024-02-22 19:03   ` Nathan Chancellor
  2024-02-23 10:18     ` Heiko Carstens
  2024-02-23 11:55     ` Christian Brauner
  2024-03-12 10:35   ` Geert Uytterhoeven
  2024-05-15 11:10   ` Jiri Slaby
  3 siblings, 2 replies; 61+ messages in thread
From: Nathan Chancellor @ 2024-02-22 19:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen

Hi Christian,

On Tue, Feb 13, 2024 at 05:45:47PM +0100, Christian Brauner wrote:
> This moves pidfds from the anonymous inode infrastructure to a tiny
> pseudo filesystem. This has been on my todo for quite a while as it will
> unblock further work that we weren't able to do simply because of the
> very justified limitations of anonymous inodes. Moving pidfds to a tiny
> pseudo filesystem allows:
> 
> * statx() on pidfds becomes useful for the first time.
> * pidfds can be compared simply via statx() and then comparing inode
>   numbers.
> * pidfds have unique inode numbers for the system lifetime.
> * struct pid is now stashed in inode->i_private instead of
>   file->private_data. This means it is now possible to introduce
>   concepts that operate on a process once all file descriptors have been
>   closed. A concrete example is kill-on-last-close.
> * file->private_data is freed up for per-file options for pidfds.
> * Each struct pid will refer to a different inode but the same struct
>   pid will refer to the same inode if it's opened multiple times. In
>   contrast to now where each struct pid refers to the same inode. Even
>   if we were to move to anon_inode_create_getfile() which creates new
>   inodes we'd still be associating the same struct pid with multiple
>   different inodes.
> * Pidfds now go through the regular dentry_open() path which means that
>   all security hooks are called unblocking proper LSM management for
>   pidfds. In addition fsnotify hooks are called and allow for listening
>   to open events on pidfds.
> 
> The tiny pseudo filesystem is not visible anywhere in userspace exactly
> like e.g., pipefs and sockfs. There's no lookup, there's no complex
> inode operations, nothing. Dentries and inodes are always deleted when
> the last pidfd is closed.
> 
> The code is entirely optional and fairly small. If it's not selected we
> fallback to anonymous inodes. Heavily inspired by nsfs which uses a
> similar stashing mechanism just for namespaces.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Apologies if this has already been reported or fixed but I did not see
anything on the mailing list.

On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
services such as abrtd, dbus, and polkit fail to start on my Fedora
machines, which causes further isssues like failing to start network
interfaces with NetworkManager. I can easily reproduce this in a Fedora
39 QEMU virtual machine, which has:

  # systemctl --version
  systemd 254 (254.9-1.fc39)
  +PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT
   default-hierarchy=unified

Unfortunately, there does not really appear to be much information to
provide off bat but I am more than happy to try and gather whatever
information would be helpful if you are not able to reproduce locally.

  # uname -r
  6.8.0-rc1-00017-ga1a466d5af6c

  # zgrep CONFIG_FS_PID /proc/config.gz
  CONFIG_FS_PID=y

  # systemctl status polkit.service
  × polkit.service - Authorization Manager
       Loaded: loaded (/usr/lib/systemd/system/polkit.service; static)
      Drop-In: /usr/lib/systemd/system/service.d
               └─10-timeout-abort.conf
       Active: failed (Result: timeout) since Thu 2024-02-22 11:35:52 MST; 11min ago
         Docs: man:polkit(8)
      Process: 844 ExecStart=/usr/lib/polkit-1/polkitd --no-debug (code=killed, signal=TERM)
     Main PID: 844 (code=killed, signal=TERM)
          CPU: 116ms

  Feb 22 11:34:22 qemu systemd[1]: Starting polkit.service - Authorization Manager...
  Feb 22 11:34:22 qemu polkitd[844]: Started polkitd version 123
  Feb 22 11:34:22 qemu polkitd[844]: Loading rules from directory /etc/polkit-1/rules.d
  Feb 22 11:34:22 qemu polkitd[844]: Loading rules from directory /usr/share/polkit-1/rules.d
  Feb 22 11:34:22 qemu polkitd[844]: Finished loading, compiling and executing 5 rules
  Feb 22 11:34:22 qemu polkitd[844]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
  Feb 22 11:35:52 qemu systemd[1]: polkit.service: start operation timed out. Terminating.
  Feb 22 11:35:52 qemu systemd[1]: polkit.service: Failed with result 'timeout'.
  Feb 22 11:35:52 qemu systemd[1]: Failed to start polkit.service - Authorization Manager.

vs.

  # uname -r
  6.8.0-rc1-00016-gd68c1231c030

  # systemctl status polkit.service
  ● polkit.service - Authorization Manager
       Loaded: loaded (/usr/lib/systemd/system/polkit.service; static)
      Drop-In: /usr/lib/systemd/system/service.d
               └─10-timeout-abort.conf
       Active: active (running) since Thu 2024-02-22 11:30:38 MST; 21s ago
         Docs: man:polkit(8)
     Main PID: 843 (polkitd)
        Tasks: 4 (limit: 19010)
       Memory: 5.0M
          CPU: 169ms
       CGroup: /system.slice/polkit.service
               └─843 /usr/lib/polkit-1/polkitd --no-debug

  Feb 22 11:30:38 qemu systemd[1]: Starting polkit.service - Authorization Manager...
  Feb 22 11:30:38 qemu polkitd[843]: Started polkitd version 123
  Feb 22 11:30:38 qemu polkitd[843]: Loading rules from directory /etc/polkit-1/rules.d
  Feb 22 11:30:38 qemu polkitd[843]: Loading rules from directory /usr/share/polkit-1/rules.d
  Feb 22 11:30:38 qemu polkitd[843]: Finished loading, compiling and executing 5 rules
  Feb 22 11:30:38 qemu polkitd[843]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
  Feb 22 11:30:38 qemu systemd[1]: Started polkit.service - Authorization Manager.

or

  # uname -r
  6.8.0-rc1-00017-ga1a466d5af6c

  # zgrep CONFIG_FS_PID /proc/config.gz
  # CONFIG_FS_PID is not set

  # systemctl status polkit.service
  ● polkit.service - Authorization Manager
       Loaded: loaded (/usr/lib/systemd/system/polkit.service; static)
      Drop-In: /usr/lib/systemd/system/service.d
               └─10-timeout-abort.conf
       Active: active (running) since Thu 2024-02-22 11:52:41 MST; 5min ago
         Docs: man:polkit(8)
     Main PID: 845 (polkitd)
        Tasks: 4 (limit: 19010)
       Memory: 5.0M
          CPU: 177ms
       CGroup: /system.slice/polkit.service
               └─845 /usr/lib/polkit-1/polkitd --no-debug

  Feb 22 11:52:41 qemu systemd[1]: Starting polkit.service - Authorization Manager...
  Feb 22 11:52:41 qemu polkitd[845]: Started polkitd version 123
  Feb 22 11:52:41 qemu polkitd[845]: Loading rules from directory /etc/polkit-1/rules.d
  Feb 22 11:52:41 qemu polkitd[845]: Loading rules from directory /usr/share/polkit-1/rules.d
  Feb 22 11:52:41 qemu polkitd[845]: Finished loading, compiling and executing 5 rules
  Feb 22 11:52:41 qemu polkitd[845]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
  Feb 22 11:52:41 qemu systemd[1]: Started polkit.service - Authorization Manager.

I looked your most recent push of vfs.pidfd but I did not see anything
that would have appeared to fix this, so I did not test it.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-22 19:03   ` Nathan Chancellor
@ 2024-02-23 10:18     ` Heiko Carstens
  2024-02-23 11:56       ` Christian Brauner
  2024-02-23 11:55     ` Christian Brauner
  1 sibling, 1 reply; 61+ messages in thread
From: Heiko Carstens @ 2024-02-23 10:18 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Christian Brauner, linux-fsdevel, Linus Torvalds, Alexander Viro,
	Seth Forshee, Tycho Andersen

On Thu, Feb 22, 2024 at 12:03:34PM -0700, Nathan Chancellor wrote:
> Apologies if this has already been reported or fixed but I did not see
> anything on the mailing list.
> 
> On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
> services such as abrtd, dbus, and polkit fail to start on my Fedora
> machines, which causes further isssues like failing to start network
> interfaces with NetworkManager. I can easily reproduce this in a Fedora
> 39 QEMU virtual machine, which has:

Same here with Fedora 39 on s390 and next-20240223: network does not
come up.

Disabling CONFIG_FS_PID "fixes" the problem.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-22 19:03   ` Nathan Chancellor
  2024-02-23 10:18     ` Heiko Carstens
@ 2024-02-23 11:55     ` Christian Brauner
  2024-02-23 12:57       ` Heiko Carstens
                         ` (2 more replies)
  1 sibling, 3 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-23 11:55 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen

> Apologies if this has already been reported or fixed but I did not see
> anything on the mailing list.
> 
> On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
> services such as abrtd, dbus, and polkit fail to start on my Fedora
> machines, which causes further isssues like failing to start network
> interfaces with NetworkManager. I can easily reproduce this in a Fedora
> 39 QEMU virtual machine, which has:
> 
>   # systemctl --version
>   systemd 254 (254.9-1.fc39)

If something fails for completely inexplicable reasons:

Feb 23 12:09:58 fed1 audit[353]: AVC avc:  denied  { read write open } for  pid=353 comm="systemd-userdbd" path="pidfd:[709]" dev="pidfs" ino=709 scontext=system_u:system_r:systemd_userdbd_t:>

>   +SELINUX

pidfd creation can now be mediated by LSMs since we can finally go
through the regular open path. That wasn't possible before but LSM
mediation ability had been requested a few times.

In short, we have to update the selinux policy for Fedora. (Fwiw, went
through the same excercise with nsfs back then.)

I've created a pull-request here:

https://github.com/fedora-selinux/selinux-policy/pull/2050

and filed an issue here:

https://bugzilla.redhat.com/show_bug.cgi?id=2265630

We have sufficient time to get this resolved and I was assured that this
would be resolved. If we can't get it resolved in a timely manner we'll
default to N for a while until everything's updated but I'd like to
avoid that. I'll track that issue.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-23 10:18     ` Heiko Carstens
@ 2024-02-23 11:56       ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-23 11:56 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Nathan Chancellor, linux-fsdevel, Linus Torvalds, Alexander Viro,
	Seth Forshee, Tycho Andersen

On Fri, Feb 23, 2024 at 11:18:33AM +0100, Heiko Carstens wrote:
> On Thu, Feb 22, 2024 at 12:03:34PM -0700, Nathan Chancellor wrote:
> > Apologies if this has already been reported or fixed but I did not see
> > anything on the mailing list.
> > 
> > On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
> > services such as abrtd, dbus, and polkit fail to start on my Fedora
> > machines, which causes further isssues like failing to start network
> > interfaces with NetworkManager. I can easily reproduce this in a Fedora
> > 39 QEMU virtual machine, which has:
> 
> Same here with Fedora 39 on s390 and next-20240223: network does not
> come up.
> 
> Disabling CONFIG_FS_PID "fixes" the problem.

It's Selinux. See the other reply. It's already tracked.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-23 11:55     ` Christian Brauner
@ 2024-02-23 12:57       ` Heiko Carstens
  2024-02-23 13:27         ` Christian Brauner
  2024-02-23 13:41       ` Christian Brauner
  2024-02-23 21:26       ` Christian Brauner
  2 siblings, 1 reply; 61+ messages in thread
From: Heiko Carstens @ 2024-02-23 12:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Nathan Chancellor, linux-fsdevel, Linus Torvalds, Alexander Viro,
	Seth Forshee, Tycho Andersen

On Fri, Feb 23, 2024 at 12:55:07PM +0100, Christian Brauner wrote:
> In short, we have to update the selinux policy for Fedora. (Fwiw, went
> through the same excercise with nsfs back then.)
> 
> I've created a pull-request here:
> 
> https://github.com/fedora-selinux/selinux-policy/pull/2050
> 
> and filed an issue here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2265630
> 
> We have sufficient time to get this resolved and I was assured that this
> would be resolved. If we can't get it resolved in a timely manner we'll
> default to N for a while until everything's updated but I'd like to
> avoid that. I'll track that issue.

So you are basically saying that for now it is ok to break everybody's
system who tries linux-next and let them bisect, just to figure out they
have to disable a config option?

To me this is a clear indication that the default is wrong already now.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-23 12:57       ` Heiko Carstens
@ 2024-02-23 13:27         ` Christian Brauner
  2024-02-23 13:35           ` Heiko Carstens
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-02-23 13:27 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Nathan Chancellor, linux-fsdevel, Linus Torvalds, Alexander Viro,
	Seth Forshee, Tycho Andersen

> So you are basically saying that for now it is ok to break everybody's
> system who tries linux-next and let them bisect, just to figure out they
> have to disable a config option?

Let me fix that suggestive phrasing for you. You seem to struggle a bit:

> I think we should flip the default because this is breaking people who
> are trying linux-next and making them bisect which can be annoying.

Yes, that was the intent. I probably should've made that clear. What I
was trying to do is to flip the default to N and fix the policy. When
that is done we should aim to flip back to y.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-23 13:27         ` Christian Brauner
@ 2024-02-23 13:35           ` Heiko Carstens
  0 siblings, 0 replies; 61+ messages in thread
From: Heiko Carstens @ 2024-02-23 13:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Nathan Chancellor, linux-fsdevel, Linus Torvalds, Alexander Viro,
	Seth Forshee, Tycho Andersen

On Fri, Feb 23, 2024 at 02:27:23PM +0100, Christian Brauner wrote:
> > So you are basically saying that for now it is ok to break everybody's
> > system who tries linux-next and let them bisect, just to figure out they
> > have to disable a config option?
> 
> Let me fix that suggestive phrasing for you. You seem to struggle a bit:

Thanks for helping!

> > I think we should flip the default because this is breaking people who
> > are trying linux-next and making them bisect which can be annoying.
> 
> Yes, that was the intent. I probably should've made that clear. What I
> was trying to do is to flip the default to N and fix the policy. When
> that is done we should aim to flip back to y.

Ok, thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-23 11:55     ` Christian Brauner
  2024-02-23 12:57       ` Heiko Carstens
@ 2024-02-23 13:41       ` Christian Brauner
  2024-02-23 21:26       ` Christian Brauner
  2 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-23 13:41 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen, Heiko Carstens

> default to N for a while until everything's updated but I'd like to

Ok, updated vfs.pidfd with a patch to flip this to n as the default for
now until the LSM learns to deal with this. Should show up in -next
tomorrow.

---

From 57a220844820980f8e3de1c1cd9d112e6e73da83 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Fri, 23 Feb 2024 14:17:21 +0100
Subject: [PATCH] pidfs: default to n for now

Moving pidfds from the anonymous inode infrastructure to a separate tiny
in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes
Selinux denials and thus various userspace components that make heavy
use of pidfds to fail.

Feb 23 12:09:58 fed1 audit[353]: AVC avc:  denied  { read write open } for  pid=353 comm="systemd-userdbd" path="pidfd:[709]" dev="pidfs" ino=709 scontext=system_u:system_r:systemd_userdbd_t:>

So far pidfds weren't able to be mediated by selinux which was requested
multiple times. Now that pidfs exists it is actually possible to medite
pidfds because they go through the regular open path that calls the
security_file_open() hook. This is a huge advantage.

Until the Selinux policy is fixed we need to default to n to avoid
breaking people. That process is under way in [1] and [2].

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2265630 [1]
Link: https://github.com/fedora-selinux/selinux-policy/pull/2050 [2]
Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240222190334.GA412503@dev-arch.thelio-3990X
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index f3dbd84a0e40..eea2582fd4af 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -177,7 +177,7 @@ source "fs/sysfs/Kconfig"
 config FS_PID
 	bool "Pseudo filesystem for process file descriptors"
 	depends on 64BIT
-	default y
+	default n
 	help
 	  Pidfs implements advanced features for process file descriptors.
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-23 11:55     ` Christian Brauner
  2024-02-23 12:57       ` Heiko Carstens
  2024-02-23 13:41       ` Christian Brauner
@ 2024-02-23 21:26       ` Christian Brauner
  2024-02-23 21:58         ` Linus Torvalds
  2 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-02-23 21:26 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen, Heiko Carstens

On Fri, Feb 23, 2024 at 12:55:07PM +0100, Christian Brauner wrote:
> > Apologies if this has already been reported or fixed but I did not see
> > anything on the mailing list.
> > 
> > On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
> > services such as abrtd, dbus, and polkit fail to start on my Fedora
> > machines, which causes further isssues like failing to start network
> > interfaces with NetworkManager. I can easily reproduce this in a Fedora
> > 39 QEMU virtual machine, which has:
> > 
> >   # systemctl --version
> >   systemd 254 (254.9-1.fc39)
> 
> If something fails for completely inexplicable reasons:
> 
> Feb 23 12:09:58 fed1 audit[353]: AVC avc:  denied  { read write open } for  pid=353 comm="systemd-userdbd" path="pidfd:[709]" dev="pidfs" ino=709 scontext=system_u:system_r:systemd_userdbd_t:>
> 
> >   +SELINUX
> 
> pidfd creation can now be mediated by LSMs since we can finally go
> through the regular open path. That wasn't possible before but LSM
> mediation ability had been requested a few times.
> 
> In short, we have to update the selinux policy for Fedora. (Fwiw, went
> through the same excercise with nsfs back then.)
> 
> I've created a pull-request here:
> 
> https://github.com/fedora-selinux/selinux-policy/pull/2050
> 
> and filed an issue here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2265630
> 
> We have sufficient time to get this resolved and I was assured that this
> would be resolved. If we can't get it resolved in a timely manner we'll
> default to N for a while until everything's updated but I'd like to
> avoid that. I'll track that issue.

So I want to provide more context since I took the time to track this
all down in detail.

The failure you are seeing is indeed an selinux denial as I've pointed
out. The core failure is dbus-broker. That cascades into all the other
services failing. When dbus-broker fails to start polkit and all the
others won't be able to work because they depend on dbus-broker.

The reason for dbus-broker failing is because it doesn't handle failures
for SO_PEERPIDFD correctly. Last kernel release (either v6.7 or v6.6,
I'm not completely sure right now) we introduced SO_PEERPIDFD (and
SCM_PIDFD). SO_PEERPIDFD allows dbus-broker and polkit and others to
receive a pidfd for the peer of an AF_UNIX socket. This is the first
time in the history of Linux that we can safely authenticate clients in
a race-free manner. :)

dbus-broker immediately made use of this but messed up the error
checking. It only allowed EINVAL as a valid failure for SO_PEERPIDFD.
That's obviously problematic not just because of LSM denials but because
of seccomp denials that would prevent SO_PEERPIDFD from working; or any
other new error code from there.

So this is catching a flawed implementation in dbus-broker as well. It
_has_ to fallback to the old pid-based authentication when SO_PEERPIDFD
doesn't work no matter the reasons otherwise it'll always risk such
failures.

So, the immediate fix separate from the selinux policy update is to fix
dbus-broker which we've done now:

https://github.com/bus1/dbus-broker/pull/343

That should make it into Fedora asap as well.

The selinux reference policy is also updated in addition to the Fedora
policy:

https://github.com/bus1/dbus-broker/pull/343

So overall that LSM denial should not have caused dbus-broker to fail.
It can never assume that a feature released one kernel ago like
SO_PEERPIDFD can be assumed to be available.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-23 21:26       ` Christian Brauner
@ 2024-02-23 21:58         ` Linus Torvalds
  2024-02-24  5:52           ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-02-23 21:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Nathan Chancellor, linux-fsdevel, Seth Forshee, Tycho Andersen,
	Heiko Carstens, Al Viro

On Fri, 23 Feb 2024 at 13:26, Christian Brauner <brauner@kernel.org> wrote:
>
> So, the immediate fix separate from the selinux policy update is to fix
> dbus-broker which we've done now:
>
> https://github.com/bus1/dbus-broker/pull/343

Why is that code then continuing the idiocy of doing different things
for different error conditions?

IOW, it causes user space failure when that code doesn't fall back to
"don't do pidfd", but then it continues the crazy habit of treating
*some* error returns as "fallback to not use pidfd" and other errors
as "fail user space".

That was the fundamental bug with special-casing EINVAL in the first
place, and the above "fix" continues the braindamage.

Did nobody learn anything?

Also, honestly, if this breaks existing setups, then we should fix the
kernel anyway. Changing things from the old anonymous inodes to the
new pidfs inodes should *not* have caused any LSM denial issues.

You used the same pointer to dbus-broker for the LSM changes, but I
really don't think this should have required LSM changes in the first
place. Your reaction to "my kernel change caused LSM to barf" should
have made you go "let's fix the kernel so that LSM _doesn't_ barf".

Maybe by making pidfs look exactly like anonfs to LSM. Since I don't
see the LSM change, I'm not actually sure exactly what LSM even
reacted to in that switch-over.

           Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-23 21:58         ` Linus Torvalds
@ 2024-02-24  5:52           ` Christian Brauner
  2024-02-24  6:05             ` Christian Brauner
  2024-02-24 18:48             ` Linus Torvalds
  0 siblings, 2 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-24  5:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nathan Chancellor, linux-fsdevel, Seth Forshee, Tycho Andersen,
	Heiko Carstens, Al Viro

On Fri, Feb 23, 2024 at 01:58:36PM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 13:26, Christian Brauner <brauner@kernel.org> wrote:
> >
> > So, the immediate fix separate from the selinux policy update is to fix
> > dbus-broker which we've done now:
> >
> > https://github.com/bus1/dbus-broker/pull/343
> 
> Why is that code then continuing the idiocy of doing different things
> for different error conditions?

Not under my control unfortunately.

> Also, honestly, if this breaks existing setups, then we should fix the
> kernel anyway. Changing things from the old anonymous inodes to the
> new pidfs inodes should *not* have caused any LSM denial issues.
> 
> You used the same pointer to dbus-broker for the LSM changes, but I
> really don't think this should have required LSM changes in the first
> place. Your reaction to "my kernel change caused LSM to barf" should
> have made you go "let's fix the kernel so that LSM _doesn't_ barf".
> 
> Maybe by making pidfs look exactly like anonfs to LSM. Since I don't
> see the LSM change, I'm not actually sure exactly what LSM even
> reacted to in that switch-over.

This is selinux. So I think this is a misunderstanding. This isn't
something we can fix in the kernel. If Selinux is in enforcing mode in
userspace and it encounters anything that it doesn't know about it will
deny it by default. And the policy is entirely in userspace including
declaring new types for stuff like nsfs or pidfs to allow it. There's
just nothing to do in the kernel.

The Selinux policy update in userspace would always have to happen just
like it had to for nsfs. Usually that happens after a change has landed
and people realize breakage or realize that new functionality isn't
available. This time it's just interacting with bad error handling in
dbus-broker.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-24  5:52           ` Christian Brauner
@ 2024-02-24  6:05             ` Christian Brauner
  2024-02-24 18:48             ` Linus Torvalds
  1 sibling, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-24  6:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nathan Chancellor, linux-fsdevel, Seth Forshee, Tycho Andersen,
	Heiko Carstens, Al Viro

On Sat, Feb 24, 2024 at 06:52:41AM +0100, Christian Brauner wrote:
> On Fri, Feb 23, 2024 at 01:58:36PM -0800, Linus Torvalds wrote:
> > On Fri, 23 Feb 2024 at 13:26, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > So, the immediate fix separate from the selinux policy update is to fix
> > > dbus-broker which we've done now:
> > >
> > > https://github.com/bus1/dbus-broker/pull/343
> > 
> > Why is that code then continuing the idiocy of doing different things
> > for different error conditions?
> 
> Not under my control unfortunately.
> 
> > Also, honestly, if this breaks existing setups, then we should fix the
> > kernel anyway. Changing things from the old anonymous inodes to the
> > new pidfs inodes should *not* have caused any LSM denial issues.
> > 
> > You used the same pointer to dbus-broker for the LSM changes, but I
> > really don't think this should have required LSM changes in the first
> > place. Your reaction to "my kernel change caused LSM to barf" should
> > have made you go "let's fix the kernel so that LSM _doesn't_ barf".
> > 
> > Maybe by making pidfs look exactly like anonfs to LSM. Since I don't
> > see the LSM change, I'm not actually sure exactly what LSM even
> > reacted to in that switch-over.
> 
> This is selinux. So I think this is a misunderstanding. This isn't
> something we can fix in the kernel. If Selinux is in enforcing mode in
> userspace and it encounters anything that it doesn't know about it will
> deny it by default. And the policy is entirely in userspace including
> declaring new types for stuff like nsfs or pidfs to allow it. There's
> just nothing to do in the kernel.
> 
> The Selinux policy update in userspace would always have to happen just
> like it had to for nsfs. Usually that happens after a change has landed
> and people realize breakage or realize that new functionality isn't
> available. This time it's just interacting with bad error handling in
> dbus-broker.

I found the old thread for nsfs for example. Same thing:

https://www.spinics.net/lists/selinux/msg18425.html

"Since Linux 3.19 targets of /proc/PID/ns/* symlinks have lived in a fs
separated from /proc, named nsfs [1].  [...] 
When using a recent kernel with a policy without nsfs support, the
inodes are not labeled, as reported for example in Fedora bug #1234757
[3].  As I encounter this issue on my systems, I asked yesterday on the
refpolicy ML how nsfs inodes should be labeled [4]."

With the asker being pointed to a userspace policy update in

https://spinics.net/lists/selinux/msg18426.html

Honestly, my default reaction is always to test things like that with
various security modules and if I encounter anything that I can fix in
the kernel I do it. But the policies aren't in the kernel. The last link
above explicitly mentions this.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-24  5:52           ` Christian Brauner
  2024-02-24  6:05             ` Christian Brauner
@ 2024-02-24 18:48             ` Linus Torvalds
  2024-02-24 19:15               ` Christian Brauner
  1 sibling, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-02-24 18:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Nathan Chancellor, linux-fsdevel, Seth Forshee, Tycho Andersen,
	Heiko Carstens, Al Viro

On Fri, 23 Feb 2024 at 21:52, Christian Brauner <brauner@kernel.org> wrote:
>
> This is selinux. So I think this is a misunderstanding. This isn't
> something we can fix in the kernel.

Sure it is. SELinux just goes by what the kernel tells it anyway.

Presumably this is purely about the fact that the inode in question
*used* to be that magical 'anon_inode_inode' that is shared when you
don't want or need a separate inode allocation. I assume it doesn't
even look at that, it just looks at the 'anon_inode_fs_type' thing (or
maybe at the anon_inode_mnt->mnt_sb that is created by kern_mount in
anon_inode_init?)

IOW, isn't the *only* difference that selinux can actually see just
the inode allocation? It used to be that

       inode = anon_inode_getfile();

now it is

        inode = new_inode_pseudo(pidfdfs_sb);

and instead of sharing one single inode (like anon_inode_getfile()
does unless you ask for separate inodes), it now shares the dentry
instead (for the same pid).

Would selinux be happy if the inode allocation just used the
anon_inode superblock instead of pidfdfs_sb?

               Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-24 18:48             ` Linus Torvalds
@ 2024-02-24 19:15               ` Christian Brauner
  2024-02-24 19:19                 ` Christian Brauner
                                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-24 19:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nathan Chancellor, linux-fsdevel, Seth Forshee, Tycho Andersen,
	Heiko Carstens, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]

On Sat, Feb 24, 2024 at 10:48:11AM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 21:52, Christian Brauner <brauner@kernel.org> wrote:
> >
> > This is selinux. So I think this is a misunderstanding. This isn't
> > something we can fix in the kernel.
> 
> Sure it is. SELinux just goes by what the kernel tells it anyway.
> 
> Presumably this is purely about the fact that the inode in question
> *used* to be that magical 'anon_inode_inode' that is shared when you
> don't want or need a separate inode allocation. I assume it doesn't
> even look at that, it just looks at the 'anon_inode_fs_type' thing (or
> maybe at the anon_inode_mnt->mnt_sb that is created by kern_mount in
> anon_inode_init?)
> 
> IOW, isn't the *only* difference that selinux can actually see just
> the inode allocation? It used to be that
> 
>        inode = anon_inode_getfile();
> 
> now it is
> 
>         inode = new_inode_pseudo(pidfdfs_sb);
> 
> and instead of sharing one single inode (like anon_inode_getfile()
> does unless you ask for separate inodes), it now shares the dentry
> instead (for the same pid).
> 
> Would selinux be happy if the inode allocation just used the
> anon_inode superblock instead of pidfdfs_sb?

No, unfortunately not. The core issue is that anon_inode_getfile() isn't
subject to any LSM hooks which is what pidfds used. But dentry_open() is
via security_file_open(). LSMs wanted to have a say in pidfd mediation
which is now possible. So the switch to dentry_open() is what is causing
the issue.

But here's a straightforward fix appended. We let pidfs.c use that fix
as and then we introduce a new LSM hook for pidfds that allows mediation
of pidfds and selinux can implement it when they're ready. This is
regression free and future proof. I actually tested this already today.

How does that sounds?

[-- Attachment #2: 0001-FIX.patch --]
[-- Type: text/x-diff, Size: 1856 bytes --]

From f2281c90f9c7f67c5f3d2457268cd9877acc1fa9 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Sat, 24 Feb 2024 19:55:46 +0100
Subject: [PATCH] FIX

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/file_table.c | 4 ++--
 fs/internal.h   | 2 ++
 fs/pidfs.c      | 5 +++--
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index b991f90571b4..685198338c4b 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -282,8 +282,8 @@ struct file *alloc_empty_backing_file(int flags, const struct cred *cred)
  * @flags: O_... flags with which the new file will be opened
  * @fop: the 'struct file_operations' for the new file
  */
-static struct file *alloc_file(const struct path *path, int flags,
-		const struct file_operations *fop)
+struct file *alloc_file(const struct path *path, int flags,
+			const struct file_operations *fop)
 {
 	struct file *file;
 
diff --git a/fs/internal.h b/fs/internal.h
index b0c843c3fa3c..ac0d1fbd6d8d 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -315,3 +315,5 @@ int path_from_stashed(struct dentry **stashed, unsigned long ino,
 		      const struct inode_operations *iops, void *data,
 		      struct path *path);
 void stashed_dentry_prune(struct dentry *dentry);
+struct file *alloc_file(const struct path *path, int flags,
+			const struct file_operations *fop);
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 2e165e6911df..57585de8f973 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -227,8 +227,9 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 	if (ret < 0)
 		return ERR_PTR(ret);
 
-	pidfd_file = dentry_open(&path, flags, current_cred());
-	path_put(&path);
+	pidfd_file = alloc_file(&path, flags, &pidfs_file_operations);
+	if (IS_ERR(pidfd_file))
+		path_put(&path);
 	return pidfd_file;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-24 19:15               ` Christian Brauner
@ 2024-02-24 19:19                 ` Christian Brauner
  2024-02-24 19:21                 ` Linus Torvalds
  2024-02-27 19:26                 ` Nathan Chancellor
  2 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-24 19:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nathan Chancellor, linux-fsdevel, Seth Forshee, Tycho Andersen,
	Heiko Carstens, Al Viro

> How does that sounds?

And fwiw, this is equivalent to what we do today.



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-24 19:15               ` Christian Brauner
  2024-02-24 19:19                 ` Christian Brauner
@ 2024-02-24 19:21                 ` Linus Torvalds
  2024-02-27 19:26                 ` Nathan Chancellor
  2 siblings, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2024-02-24 19:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Nathan Chancellor, linux-fsdevel, Seth Forshee, Tycho Andersen,
	Heiko Carstens, Al Viro

On Sat, 24 Feb 2024 at 11:16, Christian Brauner <brauner@kernel.org> wrote:
>
> > Would selinux be happy if the inode allocation just used the
> > anon_inode superblock instead of pidfdfs_sb?
>
> No, unfortunately not. The core issue is that anon_inode_getfile() isn't
> subject to any LSM hooks which is what pidfds used. But dentry_open() is
> via security_file_open().

Ahh.

> But here's a straightforward fix appended. We let pidfs.c use that fix
> as and then we introduce a new LSM hook for pidfds that allows mediation
> of pidfds and selinux can implement it when they're ready. This is
> regression free and future proof. I actually tested this already today.
>
> How does that sounds?

Ack. Perfect. This is how new features go in: they act like the old
ones, but have expanded capabilities that you can expose for people
who want to use them.

The fact that this all apparently happened in nsfs too is a bit sad. I
hadn't even been aware of it.

I absolutely *hate* how some kernel people will just say "the fix is
to upgrade your user space".

Oh well, water under the bridge. But let's do it right for pidfs, and
your fix looks good to me.

Thanks,

            Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-24 19:15               ` Christian Brauner
  2024-02-24 19:19                 ` Christian Brauner
  2024-02-24 19:21                 ` Linus Torvalds
@ 2024-02-27 19:26                 ` Nathan Chancellor
  2024-02-27 22:13                   ` Christian Brauner
  2 siblings, 1 reply; 61+ messages in thread
From: Nathan Chancellor @ 2024-02-27 19:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, linux-fsdevel, Seth Forshee, Tycho Andersen,
	Heiko Carstens, Al Viro

Hi Christian,

On Sat, Feb 24, 2024 at 08:15:53PM +0100, Christian Brauner wrote:
> On Sat, Feb 24, 2024 at 10:48:11AM -0800, Linus Torvalds wrote:
> > On Fri, 23 Feb 2024 at 21:52, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > This is selinux. So I think this is a misunderstanding. This isn't
> > > something we can fix in the kernel.
> > 
> > Sure it is. SELinux just goes by what the kernel tells it anyway.
> > 
> > Presumably this is purely about the fact that the inode in question
> > *used* to be that magical 'anon_inode_inode' that is shared when you
> > don't want or need a separate inode allocation. I assume it doesn't
> > even look at that, it just looks at the 'anon_inode_fs_type' thing (or
> > maybe at the anon_inode_mnt->mnt_sb that is created by kern_mount in
> > anon_inode_init?)
> > 
> > IOW, isn't the *only* difference that selinux can actually see just
> > the inode allocation? It used to be that
> > 
> >        inode = anon_inode_getfile();
> > 
> > now it is
> > 
> >         inode = new_inode_pseudo(pidfdfs_sb);
> > 
> > and instead of sharing one single inode (like anon_inode_getfile()
> > does unless you ask for separate inodes), it now shares the dentry
> > instead (for the same pid).
> > 
> > Would selinux be happy if the inode allocation just used the
> > anon_inode superblock instead of pidfdfs_sb?
> 
> No, unfortunately not. The core issue is that anon_inode_getfile() isn't
> subject to any LSM hooks which is what pidfds used. But dentry_open() is
> via security_file_open(). LSMs wanted to have a say in pidfd mediation
> which is now possible. So the switch to dentry_open() is what is causing
> the issue.
> 
> But here's a straightforward fix appended. We let pidfs.c use that fix
> as and then we introduce a new LSM hook for pidfds that allows mediation
> of pidfds and selinux can implement it when they're ready. This is
> regression free and future proof. I actually tested this already today.
> 
> How does that sounds?

I see a patch similar to this change in your vfs.pidfs branch as
commit 47a1fbce74c3 ("pidfs: convert to path_from_stashed() helper"),
which also appears to be in next-20240227. However, I still seem to be
having similar issues (although I cannot reproduce them every single
boot like I used to). I do see some SELinux denials for pidfs, although
it seems like it is only write that is being denied, rather than open,
read, and write?

  # uname -r
  6.8.0-rc6-next-20240227

  # systemctl --failed --no-legend --plain
  fwupd-refresh.service loaded failed failed Refresh fwupd metadata and update motd
  mcelog.service        loaded failed failed Machine Check Exception Logging Daemon
  polkit.service        loaded failed failed Authorization Manager

  # journalctl -b 0 -g denied -t audit | head -3
  Feb 27 10:49:20 qemu audit[1]: AVC avc:  denied  { write } for  pid=1 comm="systemd" path="pidfd:[1547]" dev="pidfs" ino=1547 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0
  Feb 27 10:49:21 qemu audit[1]: AVC avc:  denied  { write } for  pid=1 comm="systemd" path="pidfd:[1564]" dev="pidfs" ino=1564 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0
  Feb 27 10:50:50 qemu audit[1]: AVC avc:  denied  { write } for  pid=1 comm="systemd" path="pidfd:[1547]" dev="pidfs" ino=1547 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-27 19:26                 ` Nathan Chancellor
@ 2024-02-27 22:13                   ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-02-27 22:13 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Linus Torvalds, linux-fsdevel, Seth Forshee, Tycho Andersen,
	Heiko Carstens, Al Viro

> boot like I used to). I do see some SELinux denials for pidfs, although
> it seems like it is only write that is being denied, rather than open,
> read, and write?

Yeah, the full extent of this just became apparent to me today. pidfd
inodes are currently private and they need to continue to be so until
userspace has caught up. This is now fixed and will show up in -next
tomorrow.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-13 16:45 ` [PATCH 2/2] pidfd: add pidfdfs Christian Brauner
  2024-02-13 17:17   ` Linus Torvalds
  2024-02-22 19:03   ` Nathan Chancellor
@ 2024-03-12 10:35   ` Geert Uytterhoeven
  2024-03-12 14:09     ` Christian Brauner
  2024-05-15 11:10   ` Jiri Slaby
  3 siblings, 1 reply; 61+ messages in thread
From: Geert Uytterhoeven @ 2024-03-12 10:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen, linux-kernel

 	Hi Christian,

On Tue, 13 Feb 2024, Christian Brauner wrote:
> This moves pidfds from the anonymous inode infrastructure to a tiny
> pseudo filesystem. This has been on my todo for quite a while as it will
> unblock further work that we weren't able to do simply because of the
> very justified limitations of anonymous inodes. Moving pidfds to a tiny
> pseudo filesystem allows:
>
> * statx() on pidfds becomes useful for the first time.
> * pidfds can be compared simply via statx() and then comparing inode
>  numbers.
> * pidfds have unique inode numbers for the system lifetime.
> * struct pid is now stashed in inode->i_private instead of
>  file->private_data. This means it is now possible to introduce
>  concepts that operate on a process once all file descriptors have been
>  closed. A concrete example is kill-on-last-close.
> * file->private_data is freed up for per-file options for pidfds.
> * Each struct pid will refer to a different inode but the same struct
>  pid will refer to the same inode if it's opened multiple times. In
>  contrast to now where each struct pid refers to the same inode. Even
>  if we were to move to anon_inode_create_getfile() which creates new
>  inodes we'd still be associating the same struct pid with multiple
>  different inodes.
> * Pidfds now go through the regular dentry_open() path which means that
>  all security hooks are called unblocking proper LSM management for
>  pidfds. In addition fsnotify hooks are called and allow for listening
>  to open events on pidfds.
>
> The tiny pseudo filesystem is not visible anywhere in userspace exactly
> like e.g., pipefs and sockfs. There's no lookup, there's no complex
> inode operations, nothing. Dentries and inodes are always deleted when
> the last pidfd is closed.
>
> The code is entirely optional and fairly small. If it's not selected we
> fallback to anonymous inodes. Heavily inspired by nsfs which uses a
> similar stashing mechanism just for namespaces.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Thanks for your patch, which is now commit cb12fd8e0dabb9a1
("pidfd: add pidfs") upstream.

> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -174,6 +174,12 @@ source "fs/proc/Kconfig"
> source "fs/kernfs/Kconfig"
> source "fs/sysfs/Kconfig"
>
> +config FS_PIDFD
> +	bool "Pseudo filesystem for process file descriptors"
> +	depends on 64BIT

I think it would have been good if this dependency would have been
explained in the commit message.  I.e. why does this depend on 64BIT?

What is the risk this will not stay entirely optional?  I.e. can it
become a requirement for modern userspace, and thus be used as a stick
to kill support for 32-bit architectures?

> +	help
> +	  Pidfdfs implements advanced features for process file descriptors.
> +
> config TMPFS
> 	bool "Tmpfs virtual memory file system support (former shm fs)"
> 	depends on SHMEM

Gr{oetje,eeting}s,

 						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
 							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-03-12 10:35   ` Geert Uytterhoeven
@ 2024-03-12 14:09     ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-03-12 14:09 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen, linux-kernel

> What is the risk this will not stay entirely optional?  I.e. can it
> become a requirement for modern userspace, and thus be used as a stick
> to kill support for 32-bit architectures?

Yeah, Linus has requested to remove the config option.
I'm about to send him a patch.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-02-13 16:45 ` [PATCH 2/2] pidfd: add pidfdfs Christian Brauner
                     ` (2 preceding siblings ...)
  2024-03-12 10:35   ` Geert Uytterhoeven
@ 2024-05-15 11:10   ` Jiri Slaby
  2024-05-15 16:39     ` Christian Brauner
  3 siblings, 1 reply; 61+ messages in thread
From: Jiri Slaby @ 2024-05-15 11:10 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel
  Cc: Linus Torvalds, Alexander Viro, Seth Forshee, Tycho Andersen

On 13. 02. 24, 17:45, Christian Brauner wrote:
> This moves pidfds from the anonymous inode infrastructure to a tiny
> pseudo filesystem. This has been on my todo for quite a while as it will
> unblock further work that we weren't able to do simply because of the
> very justified limitations of anonymous inodes. Moving pidfds to a tiny
> pseudo filesystem allows:
> 
> * statx() on pidfds becomes useful for the first time.
> * pidfds can be compared simply via statx() and then comparing inode
>    numbers.
> * pidfds have unique inode numbers for the system lifetime.
> * struct pid is now stashed in inode->i_private instead of
>    file->private_data. This means it is now possible to introduce
>    concepts that operate on a process once all file descriptors have been
>    closed. A concrete example is kill-on-last-close.
> * file->private_data is freed up for per-file options for pidfds.
> * Each struct pid will refer to a different inode but the same struct
>    pid will refer to the same inode if it's opened multiple times. In
>    contrast to now where each struct pid refers to the same inode. Even
>    if we were to move to anon_inode_create_getfile() which creates new
>    inodes we'd still be associating the same struct pid with multiple
>    different inodes.
> * Pidfds now go through the regular dentry_open() path which means that
>    all security hooks are called unblocking proper LSM management for
>    pidfds. In addition fsnotify hooks are called and allow for listening
>    to open events on pidfds.
> 
> The tiny pseudo filesystem is not visible anywhere in userspace exactly
> like e.g., pipefs and sockfs. There's no lookup, there's no complex
> inode operations, nothing. Dentries and inodes are always deleted when
> the last pidfd is closed.

This breaks lsof and util-linux.

Without the commit, lsof shows:
systemd      ... 59 [pidfd:899]


With the commit:
systemd      ... 1187 pidfd


And that user-visible change breaks a lot of stuff, incl. lsof tests.

For util-linux, its test fail with:

> [  125s] --- tests/expected/lsfd/column-name-pidfd	2024-05-06 07:20:54.655845940 +0000
> [  125s] +++ tests/output/lsfd/column-name-pidfd	2024-05-15 01:04:15.406666666 +0000
> [  125s] @@ -1,2 +1,2 @@
> [  125s] -3 anon_inode:[pidfd] pid=1 comm= nspid=1
> [  125s] +3 pidfd:[INODENUM] pidfd:[INODENUM]
> [  125s]  pidfd:ASSOC,KNAME,NAME: 0
> [  125s]          lsfd: NAME and KNAME column: [02] pidfd             ... FAILED (lsfd/column-name-pidfd)

And:
> [  125s] --- tests/expected/lsfd/column-type-pidfd	2024-05-06 07:20:54.655845940 +0000
> [  125s] +++ tests/output/lsfd/column-type-pidfd	2024-05-15 01:04:15.573333333 +0000
> [  125s] @@ -1,2 +1,2 @@
> [  125s] -3 UNKN pidfd
> [  125s] +3 REG REG
> [  125s]  pidfd:ASSOC,STTYPE,TYPE: 0
> [  125s]          lsfd: TYPE and STTYPE column: [02] pidfd            ... FAILED (lsfd/column-type-pidfd)

Any ideas?

thanks,
-- 
js
suse labs


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-15 11:10   ` Jiri Slaby
@ 2024-05-15 16:39     ` Christian Brauner
  2024-05-16  5:28       ` Jiri Slaby
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-05-15 16:39 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen

On Wed, May 15, 2024 at 01:10:49PM +0200, Jiri Slaby wrote:
> On 13. 02. 24, 17:45, Christian Brauner wrote:
> > This moves pidfds from the anonymous inode infrastructure to a tiny
> > pseudo filesystem. This has been on my todo for quite a while as it will
> > unblock further work that we weren't able to do simply because of the
> > very justified limitations of anonymous inodes. Moving pidfds to a tiny
> > pseudo filesystem allows:
> > 
> > * statx() on pidfds becomes useful for the first time.
> > * pidfds can be compared simply via statx() and then comparing inode
> >    numbers.
> > * pidfds have unique inode numbers for the system lifetime.
> > * struct pid is now stashed in inode->i_private instead of
> >    file->private_data. This means it is now possible to introduce
> >    concepts that operate on a process once all file descriptors have been
> >    closed. A concrete example is kill-on-last-close.
> > * file->private_data is freed up for per-file options for pidfds.
> > * Each struct pid will refer to a different inode but the same struct
> >    pid will refer to the same inode if it's opened multiple times. In
> >    contrast to now where each struct pid refers to the same inode. Even
> >    if we were to move to anon_inode_create_getfile() which creates new
> >    inodes we'd still be associating the same struct pid with multiple
> >    different inodes.
> > * Pidfds now go through the regular dentry_open() path which means that
> >    all security hooks are called unblocking proper LSM management for
> >    pidfds. In addition fsnotify hooks are called and allow for listening
> >    to open events on pidfds.
> > 
> > The tiny pseudo filesystem is not visible anywhere in userspace exactly
> > like e.g., pipefs and sockfs. There's no lookup, there's no complex
> > inode operations, nothing. Dentries and inodes are always deleted when
> > the last pidfd is closed.
> 
> This breaks lsof and util-linux.
> 
> Without the commit, lsof shows:
> systemd      ... 59 [pidfd:899]
> 
> 
> With the commit:
> systemd      ... 1187 pidfd
> 
> 
> And that user-visible change breaks a lot of stuff, incl. lsof tests.
> 
> For util-linux, its test fail with:
> 
> > [  125s] --- tests/expected/lsfd/column-name-pidfd	2024-05-06 07:20:54.655845940 +0000
> > [  125s] +++ tests/output/lsfd/column-name-pidfd	2024-05-15 01:04:15.406666666 +0000
> > [  125s] @@ -1,2 +1,2 @@
> > [  125s] -3 anon_inode:[pidfd] pid=1 comm= nspid=1
> > [  125s] +3 pidfd:[INODENUM] pidfd:[INODENUM]
> > [  125s]  pidfd:ASSOC,KNAME,NAME: 0
> > [  125s]          lsfd: NAME and KNAME column: [02] pidfd             ... FAILED (lsfd/column-name-pidfd)
> 
> And:
> > [  125s] --- tests/expected/lsfd/column-type-pidfd	2024-05-06 07:20:54.655845940 +0000
> > [  125s] +++ tests/output/lsfd/column-type-pidfd	2024-05-15 01:04:15.573333333 +0000
> > [  125s] @@ -1,2 +1,2 @@
> > [  125s] -3 UNKN pidfd
> > [  125s] +3 REG REG
> > [  125s]  pidfd:ASSOC,STTYPE,TYPE: 0
> > [  125s]          lsfd: TYPE and STTYPE column: [02] pidfd            ... FAILED (lsfd/column-type-pidfd)
> 
> Any ideas?

util-linux upstream is already handling that correctly now but it seems that
lsof is not. To fix this in the kernel we'll need something like. If you could
test this it'd be great as I'm currently traveling:

diff --git a/fs/pidfs.c b/fs/pidfs.c
index a63d5d24aa02..3da848a8a95e 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -201,10 +201,8 @@ static const struct super_operations pidfs_sops = {

 static char *pidfs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
-       struct inode *inode = d_inode(dentry);
-       struct pid *pid = inode->i_private;
-
-       return dynamic_dname(buffer, buflen, "pidfd:[%llu]", pid->ino);
+       /* Fake the old name as some userspace seems to rely on this. */
+       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd]");
 }

 static const struct dentry_operations pidfs_dentry_operations = {


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-15 16:39     ` Christian Brauner
@ 2024-05-16  5:28       ` Jiri Slaby
  2024-05-17  7:09         ` Jiri Slaby
  0 siblings, 1 reply; 61+ messages in thread
From: Jiri Slaby @ 2024-05-16  5:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen

On 15. 05. 24, 18:39, Christian Brauner wrote:
> On Wed, May 15, 2024 at 01:10:49PM +0200, Jiri Slaby wrote:
>> On 13. 02. 24, 17:45, Christian Brauner wrote:
>>> This moves pidfds from the anonymous inode infrastructure to a tiny
>>> pseudo filesystem. This has been on my todo for quite a while as it will
>>> unblock further work that we weren't able to do simply because of the
>>> very justified limitations of anonymous inodes. Moving pidfds to a tiny
>>> pseudo filesystem allows:
>>>
>>> * statx() on pidfds becomes useful for the first time.
>>> * pidfds can be compared simply via statx() and then comparing inode
>>>     numbers.
>>> * pidfds have unique inode numbers for the system lifetime.
>>> * struct pid is now stashed in inode->i_private instead of
>>>     file->private_data. This means it is now possible to introduce
>>>     concepts that operate on a process once all file descriptors have been
>>>     closed. A concrete example is kill-on-last-close.
>>> * file->private_data is freed up for per-file options for pidfds.
>>> * Each struct pid will refer to a different inode but the same struct
>>>     pid will refer to the same inode if it's opened multiple times. In
>>>     contrast to now where each struct pid refers to the same inode. Even
>>>     if we were to move to anon_inode_create_getfile() which creates new
>>>     inodes we'd still be associating the same struct pid with multiple
>>>     different inodes.
>>> * Pidfds now go through the regular dentry_open() path which means that
>>>     all security hooks are called unblocking proper LSM management for
>>>     pidfds. In addition fsnotify hooks are called and allow for listening
>>>     to open events on pidfds.
>>>
>>> The tiny pseudo filesystem is not visible anywhere in userspace exactly
>>> like e.g., pipefs and sockfs. There's no lookup, there's no complex
>>> inode operations, nothing. Dentries and inodes are always deleted when
>>> the last pidfd is closed.
>>
>> This breaks lsof and util-linux.
>>
>> Without the commit, lsof shows:
>> systemd      ... 59 [pidfd:899]
>>
>>
>> With the commit:
>> systemd      ... 1187 pidfd
>>
>>
>> And that user-visible change breaks a lot of stuff, incl. lsof tests.
>>
>> For util-linux, its test fail with:
>>
>>> [  125s] --- tests/expected/lsfd/column-name-pidfd	2024-05-06 07:20:54.655845940 +0000
>>> [  125s] +++ tests/output/lsfd/column-name-pidfd	2024-05-15 01:04:15.406666666 +0000
>>> [  125s] @@ -1,2 +1,2 @@
>>> [  125s] -3 anon_inode:[pidfd] pid=1 comm= nspid=1
>>> [  125s] +3 pidfd:[INODENUM] pidfd:[INODENUM]
>>> [  125s]  pidfd:ASSOC,KNAME,NAME: 0
>>> [  125s]          lsfd: NAME and KNAME column: [02] pidfd             ... FAILED (lsfd/column-name-pidfd)
>>
>> And:
>>> [  125s] --- tests/expected/lsfd/column-type-pidfd	2024-05-06 07:20:54.655845940 +0000
>>> [  125s] +++ tests/output/lsfd/column-type-pidfd	2024-05-15 01:04:15.573333333 +0000
>>> [  125s] @@ -1,2 +1,2 @@
>>> [  125s] -3 UNKN pidfd
>>> [  125s] +3 REG REG
>>> [  125s]  pidfd:ASSOC,STTYPE,TYPE: 0
>>> [  125s]          lsfd: TYPE and STTYPE column: [02] pidfd            ... FAILED (lsfd/column-type-pidfd)
>>
>> Any ideas?
> 
> util-linux upstream is already handling that correctly now but it seems that
> lsof is not. To fix this in the kernel we'll need something like. If you could
> test this it'd be great as I'm currently traveling:
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index a63d5d24aa02..3da848a8a95e 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -201,10 +201,8 @@ static const struct super_operations pidfs_sops = {
> 
>   static char *pidfs_dname(struct dentry *dentry, char *buffer, int buflen)
>   {
> -       struct inode *inode = d_inode(dentry);
> -       struct pid *pid = inode->i_private;
> -
> -       return dynamic_dname(buffer, buflen, "pidfd:[%llu]", pid->ino);
> +       /* Fake the old name as some userspace seems to rely on this. */
> +       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd]");

No, the lsof test runs "lsof -p $pid -a -d $fd -F pfn" and expects:
"p${pid} f${fd} n[pidfd:$pid]"

But it gets now:
p959 f3 nanon_inode

I.e. "anon_inode" instead of "n[pidfd:959]".

Did you intend to fix by the patch the lsfd's (util-linux) 
column-name-pidfd test by this instead (the above)?

thanks,
-- 
js
suse labs


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-16  5:28       ` Jiri Slaby
@ 2024-05-17  7:09         ` Jiri Slaby
  2024-05-17  7:54           ` Jiri Slaby
  0 siblings, 1 reply; 61+ messages in thread
From: Jiri Slaby @ 2024-05-17  7:09 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen

On 16. 05. 24, 7:28, Jiri Slaby wrote:
> On 15. 05. 24, 18:39, Christian Brauner wrote:
>> On Wed, May 15, 2024 at 01:10:49PM +0200, Jiri Slaby wrote:
>>> On 13. 02. 24, 17:45, Christian Brauner wrote:
>>>> This moves pidfds from the anonymous inode infrastructure to a tiny
>>>> pseudo filesystem. This has been on my todo for quite a while as it 
>>>> will
>>>> unblock further work that we weren't able to do simply because of the
>>>> very justified limitations of anonymous inodes. Moving pidfds to a tiny
>>>> pseudo filesystem allows:
>>>>
>>>> * statx() on pidfds becomes useful for the first time.
>>>> * pidfds can be compared simply via statx() and then comparing inode
>>>>     numbers.
>>>> * pidfds have unique inode numbers for the system lifetime.
>>>> * struct pid is now stashed in inode->i_private instead of
>>>>     file->private_data. This means it is now possible to introduce
>>>>     concepts that operate on a process once all file descriptors 
>>>> have been
>>>>     closed. A concrete example is kill-on-last-close.
>>>> * file->private_data is freed up for per-file options for pidfds.
>>>> * Each struct pid will refer to a different inode but the same struct
>>>>     pid will refer to the same inode if it's opened multiple times. In
>>>>     contrast to now where each struct pid refers to the same inode. 
>>>> Even
>>>>     if we were to move to anon_inode_create_getfile() which creates new
>>>>     inodes we'd still be associating the same struct pid with multiple
>>>>     different inodes.
>>>> * Pidfds now go through the regular dentry_open() path which means that
>>>>     all security hooks are called unblocking proper LSM management for
>>>>     pidfds. In addition fsnotify hooks are called and allow for 
>>>> listening
>>>>     to open events on pidfds.
>>>>
>>>> The tiny pseudo filesystem is not visible anywhere in userspace exactly
>>>> like e.g., pipefs and sockfs. There's no lookup, there's no complex
>>>> inode operations, nothing. Dentries and inodes are always deleted when
>>>> the last pidfd is closed.
>>>
>>> This breaks lsof and util-linux.
>>>
>>> Without the commit, lsof shows:
>>> systemd      ... 59 [pidfd:899]
>>>
>>>
>>> With the commit:
>>> systemd      ... 1187 pidfd
>>>
>>>
>>> And that user-visible change breaks a lot of stuff, incl. lsof tests.
>>>
>>> For util-linux, its test fail with:
>>>
>>>> [  125s] --- tests/expected/lsfd/column-name-pidfd    2024-05-06 
>>>> 07:20:54.655845940 +0000
>>>> [  125s] +++ tests/output/lsfd/column-name-pidfd    2024-05-15 
>>>> 01:04:15.406666666 +0000
>>>> [  125s] @@ -1,2 +1,2 @@
>>>> [  125s] -3 anon_inode:[pidfd] pid=1 comm= nspid=1
>>>> [  125s] +3 pidfd:[INODENUM] pidfd:[INODENUM]
>>>> [  125s]  pidfd:ASSOC,KNAME,NAME: 0
>>>> [  125s]          lsfd: NAME and KNAME column: [02] 
>>>> pidfd             ... FAILED (lsfd/column-name-pidfd)
>>>
>>> And:
>>>> [  125s] --- tests/expected/lsfd/column-type-pidfd    2024-05-06 
>>>> 07:20:54.655845940 +0000
>>>> [  125s] +++ tests/output/lsfd/column-type-pidfd    2024-05-15 
>>>> 01:04:15.573333333 +0000
>>>> [  125s] @@ -1,2 +1,2 @@
>>>> [  125s] -3 UNKN pidfd
>>>> [  125s] +3 REG REG
>>>> [  125s]  pidfd:ASSOC,STTYPE,TYPE: 0
>>>> [  125s]          lsfd: TYPE and STTYPE column: [02] 
>>>> pidfd            ... FAILED (lsfd/column-type-pidfd)
>>>
>>> Any ideas?
>>
>> util-linux upstream is already handling that correctly now but it 
>> seems that
>> lsof is not. To fix this in the kernel we'll need something like. If 
>> you could
>> test this it'd be great as I'm currently traveling:
>>
>> diff --git a/fs/pidfs.c b/fs/pidfs.c
>> index a63d5d24aa02..3da848a8a95e 100644
>> --- a/fs/pidfs.c
>> +++ b/fs/pidfs.c
>> @@ -201,10 +201,8 @@ static const struct super_operations pidfs_sops = {
>>
>>   static char *pidfs_dname(struct dentry *dentry, char *buffer, int 
>> buflen)
>>   {
>> -       struct inode *inode = d_inode(dentry);
>> -       struct pid *pid = inode->i_private;
>> -
>> -       return dynamic_dname(buffer, buflen, "pidfd:[%llu]", pid->ino);
>> +       /* Fake the old name as some userspace seems to rely on this. */
>> +       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd]");
> 
> No, the lsof test runs "lsof -p $pid -a -d $fd -F pfn" and expects:
> "p${pid} f${fd} n[pidfd:$pid]"
> 
> But it gets now:
> p959 f3 nanon_inode
> 
> I.e. "anon_inode" instead of "n[pidfd:959]".
> 
> Did you intend to fix by the patch the lsfd's (util-linux) 
> column-name-pidfd test by this instead (the above)?

This is now discussed in https://github.com/lsof-org/lsof/issues/317 too.

So yeah,
# ll /proc/984/fd
total 0
lrwx------ 1 xslaby users 64 May 17 09:00 0 -> /dev/pts/1
lrwx------ 1 xslaby users 64 May 17 09:00 2 -> /dev/pts/1
lrwx------ 1 xslaby users 64 May 17 09:00 3 -> anon_inode:[pidfd]

looks good with the patch. But lsof checks if this IS_REG(). And if it 
is, pidfd handling is not done.

> thanks,
-- 
js
suse labs


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-17  7:09         ` Jiri Slaby
@ 2024-05-17  7:54           ` Jiri Slaby
  2024-05-17 20:07             ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Jiri Slaby @ 2024-05-17  7:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Seth Forshee,
	Tycho Andersen

On 17. 05. 24, 9:09, Jiri Slaby wrote:
> On 16. 05. 24, 7:28, Jiri Slaby wrote:
>> On 15. 05. 24, 18:39, Christian Brauner wrote:
>>> On Wed, May 15, 2024 at 01:10:49PM +0200, Jiri Slaby wrote:
>>>> On 13. 02. 24, 17:45, Christian Brauner wrote:
>>>>> This moves pidfds from the anonymous inode infrastructure to a tiny
>>>>> pseudo filesystem. This has been on my todo for quite a while as it 
>>>>> will
>>>>> unblock further work that we weren't able to do simply because of the
>>>>> very justified limitations of anonymous inodes. Moving pidfds to a 
>>>>> tiny
>>>>> pseudo filesystem allows:
>>>>>
>>>>> * statx() on pidfds becomes useful for the first time.
>>>>> * pidfds can be compared simply via statx() and then comparing inode
>>>>>     numbers.
>>>>> * pidfds have unique inode numbers for the system lifetime.
>>>>> * struct pid is now stashed in inode->i_private instead of
>>>>>     file->private_data. This means it is now possible to introduce
>>>>>     concepts that operate on a process once all file descriptors 
>>>>> have been
>>>>>     closed. A concrete example is kill-on-last-close.
>>>>> * file->private_data is freed up for per-file options for pidfds.
>>>>> * Each struct pid will refer to a different inode but the same struct
>>>>>     pid will refer to the same inode if it's opened multiple times. In
>>>>>     contrast to now where each struct pid refers to the same inode. 
>>>>> Even
>>>>>     if we were to move to anon_inode_create_getfile() which creates 
>>>>> new
>>>>>     inodes we'd still be associating the same struct pid with multiple
>>>>>     different inodes.
>>>>> * Pidfds now go through the regular dentry_open() path which means 
>>>>> that
>>>>>     all security hooks are called unblocking proper LSM management for
>>>>>     pidfds. In addition fsnotify hooks are called and allow for 
>>>>> listening
>>>>>     to open events on pidfds.
>>>>>
>>>>> The tiny pseudo filesystem is not visible anywhere in userspace 
>>>>> exactly
>>>>> like e.g., pipefs and sockfs. There's no lookup, there's no complex
>>>>> inode operations, nothing. Dentries and inodes are always deleted when
>>>>> the last pidfd is closed.
>>>>
>>>> This breaks lsof and util-linux.
>>>>
>>>> Without the commit, lsof shows:
>>>> systemd      ... 59 [pidfd:899]
>>>>
>>>>
>>>> With the commit:
>>>> systemd      ... 1187 pidfd
>>>>
>>>>
>>>> And that user-visible change breaks a lot of stuff, incl. lsof tests.
>>>>
>>>> For util-linux, its test fail with:
>>>>
>>>>> [  125s] --- tests/expected/lsfd/column-name-pidfd    2024-05-06 
>>>>> 07:20:54.655845940 +0000
>>>>> [  125s] +++ tests/output/lsfd/column-name-pidfd    2024-05-15 
>>>>> 01:04:15.406666666 +0000
>>>>> [  125s] @@ -1,2 +1,2 @@
>>>>> [  125s] -3 anon_inode:[pidfd] pid=1 comm= nspid=1
>>>>> [  125s] +3 pidfd:[INODENUM] pidfd:[INODENUM]
>>>>> [  125s]  pidfd:ASSOC,KNAME,NAME: 0
>>>>> [  125s]          lsfd: NAME and KNAME column: [02] 
>>>>> pidfd             ... FAILED (lsfd/column-name-pidfd)
>>>>
>>>> And:
>>>>> [  125s] --- tests/expected/lsfd/column-type-pidfd    2024-05-06 
>>>>> 07:20:54.655845940 +0000
>>>>> [  125s] +++ tests/output/lsfd/column-type-pidfd    2024-05-15 
>>>>> 01:04:15.573333333 +0000
>>>>> [  125s] @@ -1,2 +1,2 @@
>>>>> [  125s] -3 UNKN pidfd
>>>>> [  125s] +3 REG REG
>>>>> [  125s]  pidfd:ASSOC,STTYPE,TYPE: 0
>>>>> [  125s]          lsfd: TYPE and STTYPE column: [02] 
>>>>> pidfd            ... FAILED (lsfd/column-type-pidfd)
>>>>
>>>> Any ideas?
>>>
>>> util-linux upstream is already handling that correctly now but it 
>>> seems that
>>> lsof is not. To fix this in the kernel we'll need something like. If 
>>> you could
>>> test this it'd be great as I'm currently traveling:
>>>
>>> diff --git a/fs/pidfs.c b/fs/pidfs.c
>>> index a63d5d24aa02..3da848a8a95e 100644
>>> --- a/fs/pidfs.c
>>> +++ b/fs/pidfs.c
>>> @@ -201,10 +201,8 @@ static const struct super_operations pidfs_sops = {
>>>
>>>   static char *pidfs_dname(struct dentry *dentry, char *buffer, int 
>>> buflen)
>>>   {
>>> -       struct inode *inode = d_inode(dentry);
>>> -       struct pid *pid = inode->i_private;
>>> -
>>> -       return dynamic_dname(buffer, buflen, "pidfd:[%llu]", pid->ino);
>>> +       /* Fake the old name as some userspace seems to rely on this. */
>>> +       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd]");
>>
>> No, the lsof test runs "lsof -p $pid -a -d $fd -F pfn" and expects:
>> "p${pid} f${fd} n[pidfd:$pid]"
>>
>> But it gets now:
>> p959 f3 nanon_inode
>>
>> I.e. "anon_inode" instead of "n[pidfd:959]".
>>
>> Did you intend to fix by the patch the lsfd's (util-linux) 
>> column-name-pidfd test by this instead (the above)?
> 
> This is now discussed in https://github.com/lsof-org/lsof/issues/317 too.
> 
> So yeah,
> # ll /proc/984/fd
> total 0
> lrwx------ 1 xslaby users 64 May 17 09:00 0 -> /dev/pts/1
> lrwx------ 1 xslaby users 64 May 17 09:00 2 -> /dev/pts/1
> lrwx------ 1 xslaby users 64 May 17 09:00 3 -> anon_inode:[pidfd]
> 
> looks good with the patch. But lsof checks if this IS_REG(). And if it 
> is, pidfd handling is not done.

So this fixes it, of course:
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -2026,7 +2026,6 @@ static struct dentry *prepare_anon_dentry(struct 
dentry **stashed,
         }

         /* Notice when this is changed. */
-       WARN_ON_ONCE(!S_ISREG(inode->i_mode));
         WARN_ON_ONCE(!IS_IMMUTABLE(inode));

         dentry = d_alloc_anon(sb);
diff --git a/fs/pidfs.c b/fs/pidfs.c
index a63d5d24..f51a794f 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -201,10 +201,7 @@ static const struct super_operations pidfs_sops = {

  static char *pidfs_dname(struct dentry *dentry, char *buffer, int buflen)
  {
-       struct inode *inode = d_inode(dentry);
-       struct pid *pid = inode->i_private;
-
-       return dynamic_dname(buffer, buflen, "pidfd:[%llu]", pid->ino);
+       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd]");
  }

  static const struct dentry_operations pidfs_dentry_operations = {
@@ -217,6 +214,7 @@ static int pidfs_init_inode(struct inode *inode, 
void *data)
  {
         inode->i_private = data;
         inode->i_flags |= S_PRIVATE;
+       inode->i_mode &= ~S_IFREG;
         inode->i_mode |= S_IRWXU;
         inode->i_op = &pidfs_inode_operations;
         inode->i_fop = &pidfs_file_operations;


> 
>> thanks,

-- 
js
suse labs


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-17  7:54           ` Jiri Slaby
@ 2024-05-17 20:07             ` Linus Torvalds
  2024-05-20  8:23               ` Jiri Slaby
  2024-05-21 12:16               ` Christian Brauner
  0 siblings, 2 replies; 61+ messages in thread
From: Linus Torvalds @ 2024-05-17 20:07 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Christian Brauner, linux-fsdevel, Alexander Viro, Seth Forshee,
	Tycho Andersen

On Fri, 17 May 2024 at 00:54, Jiri Slaby <jirislaby@kernel.org> wrote:
>
>          inode->i_private = data;
>          inode->i_flags |= S_PRIVATE;
> +       inode->i_mode &= ~S_IFREG;

That is not a sensible operation. S_IFREG isn't a bit mask.

But it looks like 'anon_inode' traditionally had *no* type bytes at
all. That's literally crazy.

Doing a 'stat -L' on one in /proc/X/fd/Y will correctly say "weird
file" about them.

What a crock. That's horrible, and we apparently never noticed how
broken anon_inodes were because nobody really cared. But then lsof
seems to have done the *opposite* and just said (for unfathomable
reasons) "this can't be a normal regular file".

But I can't actually find that code in lsof. I see

                 if (rest && rest[0] == '[' && rest[1] == 'p')
                     fdinfo_mask |= FDINFO_PID;

which only checks that the name starts with '[p'. Hmm.

[ Time passes, I go looking ]

Oh Christ. It's process_proc_node:

        type = s->st_mode & S_IFMT;
        switch (type) {
        ...
        case 0:
            if (!strcmp(p, "anon_inode"))
                Lf->ntype = Ntype = N_ANON_INODE;
            break;

so yes, process_proc_node() really seems to have intentionally noticed
that our anon inodes forgot to put a file type in the st_mode, and
together with the path from readlink matching 'anon_inode' is how lsof
determines it's one of the special inodes.

So yeah, we made a mistake, and then lsof decided that mistake was a feature.

But that does mean that we probably just have to live in the bed we made.

But that

> +       inode->i_mode &= ~S_IFREG;

is still very very wrong. It should use the proper bit mask: S_IFMT.

And we'd have to add a big comment about our historical stupidity that
we are perpetuating.

Oh well.

               Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-17 20:07             ` Linus Torvalds
@ 2024-05-20  8:23               ` Jiri Slaby
  2024-05-20 19:01                 ` Linus Torvalds
  2024-05-21 12:16               ` Christian Brauner
  1 sibling, 1 reply; 61+ messages in thread
From: Jiri Slaby @ 2024-05-20  8:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, linux-fsdevel, Alexander Viro, Seth Forshee,
	Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 3106 bytes --]

On 17. 05. 24, 22:07, Linus Torvalds wrote:
> On Fri, 17 May 2024 at 00:54, Jiri Slaby <jirislaby@kernel.org> wrote:
>>
>>           inode->i_private = data;
>>           inode->i_flags |= S_PRIVATE;
>> +       inode->i_mode &= ~S_IFREG;
> 
> That is not a sensible operation. S_IFREG isn't a bit mask.

Oh, sure. I just unmasked what libfs' prepare_anon_dentry() set by default.

> But it looks like 'anon_inode' traditionally had *no* type bytes at
> all. That's literally crazy.
> 
> Doing a 'stat -L' on one in /proc/X/fd/Y will correctly say "weird
> file" about them.
> 
> What a crock. That's horrible, and we apparently never noticed how
> broken anon_inodes were because nobody really cared. But then lsof
> seems to have done the *opposite* and just said (for unfathomable
> reasons) "this can't be a normal regular file".
> 
> But I can't actually find that code in lsof. I see
> 
>                   if (rest && rest[0] == '[' && rest[1] == 'p')
>                       fdinfo_mask |= FDINFO_PID;
> 
> which only checks that the name starts with '[p'. Hmm.

lsof just has received a fix in a form of:
    else if (Lf->ntype == N_REGLR && rest && *rest && strcmp(pbuf, 
"pidfd") == 0) {

https://github.com/lsof-org/lsof/pull/319/commits/c1678e3f6e4b4d984cb3078b7bf0c9e24bedb8ca

> [ Time passes, I go looking ]
> 
> Oh Christ. It's process_proc_node:

Yes, didn't I note it? Hmm, apparently not (or maybe it's hidden in all 
those pulls/isuses/bugs). But definitely been there, seen that. Sorry.

>          type = s->st_mode & S_IFMT;
>          switch (type) {
>          ...
>          case 0:
>              if (!strcmp(p, "anon_inode"))
>                  Lf->ntype = Ntype = N_ANON_INODE;
>              break;
> 
> so yes, process_proc_node() really seems to have intentionally noticed
> that our anon inodes forgot to put a file type in the st_mode, and
> together with the path from readlink matching 'anon_inode' is how lsof
> determines it's one of the special inodes.
> 
> So yeah, we made a mistake, and then lsof decided that mistake was a feature.

Yes, but we can schedule a removal of this compat handling after some 
years...

> But that does mean that we probably just have to live in the bed we made.
> 
> But that
> 
>> +       inode->i_mode &= ~S_IFREG;
> 
> is still very very wrong. It should use the proper bit mask: S_IFMT.

Either, I don't like removing that WARN_ON_ONCE() from libfs' 
prepare_anon_dentry(). Is it OK to remove this S_IFREG after 
path_from_stashed() in pidfs' pidfs_alloc_file(). I.e. after 
d_alloc_anon(), d_instantiate(), stash_dentry(), but before dentry_open()?

That looks weird.

Instead, add a sort of LEGACY_DONT_WARN_ABOUT_IFMT to path_from_stashed()?

Dirty, I think.

So what about LEGACY_NO_MODE which would set "i_mode = 0" and mangle the 
WARN_ON appropriately. Like in the patch attached? It works (when 
applied together with the anon_inode name fix).

> And we'd have to add a big comment about our historical stupidity that
> we are perpetuating.

And immediately add it to Documentation/ABI/obsolete/?

thanks,
-- 
js
suse labs

[-- Attachment #2: 0001-add-LEGACY_NO_MODE.patch --]
[-- Type: text/x-patch, Size: 4723 bytes --]

From b005cd96c97684adfabf07c56bd91fabe45c8cb7 Mon Sep 17 00:00:00 2001
From: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Date: Mon, 20 May 2024 10:13:44 +0200
Subject: [PATCH] add LEGACY_NO_MODE

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
---
 Documentation/ABI/obsolete/libfs-LEGACY_NO_MODE |  8 ++++++++
 fs/internal.h                                   |  5 ++++-
 fs/libfs.c                                      | 11 ++++++-----
 fs/nsfs.c                                       |  4 ++--
 fs/pidfs.c                                      |  3 ++-
 5 files changed, 22 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/ABI/obsolete/libfs-LEGACY_NO_MODE

diff --git a/Documentation/ABI/obsolete/libfs-LEGACY_NO_MODE b/Documentation/ABI/obsolete/libfs-LEGACY_NO_MODE
new file mode 100644
index 000000000000..37ad036b18b2
--- /dev/null
+++ b/Documentation/ABI/obsolete/libfs-LEGACY_NO_MODE
@@ -0,0 +1,8 @@
+What:		libfs' LEGACY_NO_MODE
+Date:		May 2024
+KernelVersion:	6.10
+Contact:	linux-fsdevel@vger.kernel.org
+Description:	LEGACY_NO_MODE was added to mimic the old (wrong) i_mode (zero)
+		of anon_inode when pidfs was moved away from anon_inode to
+		libfs helpers.
+Users:		lsof
diff --git a/fs/internal.h b/fs/internal.h
index 7ca738904e34..a7cd0eecc266 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -315,6 +315,9 @@ struct stashed_operations {
 	void (*put_data)(void *data);
 	int (*init_inode)(struct inode *inode, void *data);
 };
+
+#define LEGACY_NO_MODE		BIT(0)
+
 int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
-		      struct path *path);
+		      struct path *path, unsigned int flags);
 void stashed_dentry_prune(struct dentry *dentry);
diff --git a/fs/libfs.c b/fs/libfs.c
index b635ee5adbcc..c047aa4f4dac 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -2045,11 +2045,12 @@ static inline struct dentry *get_stashed_dentry(struct dentry *stashed)
 
 static struct dentry *prepare_anon_dentry(struct dentry **stashed,
 					  struct super_block *sb,
-					  void *data)
+					  void *data, unsigned int flags)
 {
 	struct dentry *dentry;
 	struct inode *inode;
 	const struct stashed_operations *sops = sb->s_fs_info;
+	umode_t i_mode;
 	int ret;
 
 	inode = new_inode_pseudo(sb);
@@ -2059,7 +2060,7 @@ static struct dentry *prepare_anon_dentry(struct dentry **stashed,
 	}
 
 	inode->i_flags |= S_IMMUTABLE;
-	inode->i_mode = S_IFREG;
+	inode->i_mode = i_mode = (flags & LEGACY_NO_MODE) ? 0 : S_IFREG;
 	simple_inode_init_ts(inode);
 
 	ret = sops->init_inode(inode, data);
@@ -2069,7 +2070,7 @@ static struct dentry *prepare_anon_dentry(struct dentry **stashed,
 	}
 
 	/* Notice when this is changed. */
-	WARN_ON_ONCE(!S_ISREG(inode->i_mode));
+	WARN_ON_ONCE((inode->i_mode & S_IFMT) != i_mode);
 	WARN_ON_ONCE(!IS_IMMUTABLE(inode));
 
 	dentry = d_alloc_anon(sb);
@@ -2126,7 +2127,7 @@ static struct dentry *stash_dentry(struct dentry **stashed,
  * Return: On success zero and on failure a negative error is returned.
  */
 int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
-		      struct path *path)
+		      struct path *path, unsigned int flags)
 {
 	struct dentry *dentry;
 	const struct stashed_operations *sops = mnt->mnt_sb->s_fs_info;
@@ -2139,7 +2140,7 @@ int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
 	}
 
 	/* Allocate a new dentry. */
-	dentry = prepare_anon_dentry(stashed, mnt->mnt_sb, data);
+	dentry = prepare_anon_dentry(stashed, mnt->mnt_sb, data, flags);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 07e22a15ef02..11b8ef2aeaed 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -56,7 +56,7 @@ int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
 	if (!ns)
 		return -ENOENT;
 
-	return path_from_stashed(&ns->stashed, nsfs_mnt, ns, path);
+	return path_from_stashed(&ns->stashed, nsfs_mnt, ns, path, 0);
 }
 
 struct ns_get_path_task_args {
@@ -101,7 +101,7 @@ int open_related_ns(struct ns_common *ns,
 		return PTR_ERR(relative);
 	}
 
-	err = path_from_stashed(&relative->stashed, nsfs_mnt, relative, &path);
+	err = path_from_stashed(&relative->stashed, nsfs_mnt, relative, &path, 0);
 	if (err < 0) {
 		put_unused_fd(fd);
 		return err;
diff --git a/fs/pidfs.c b/fs/pidfs.c
index a63d5d24aa02..cb894e9024ed 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -266,7 +266,8 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 	struct path path;
 	int ret;
 
-	ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path);
+	ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path,
+				LEGACY_NO_MODE);
 	if (ret < 0)
 		return ERR_PTR(ret);
 
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-20  8:23               ` Jiri Slaby
@ 2024-05-20 19:01                 ` Linus Torvalds
  2024-05-20 19:15                   ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-05-20 19:01 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Christian Brauner, linux-fsdevel, Alexander Viro, Seth Forshee,
	Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 627 bytes --]

On Mon, 20 May 2024 at 01:23, Jiri Slaby <jirislaby@kernel.org> wrote:
>
> So what about LEGACY_NO_MODE which would set "i_mode = 0" and mangle the
> WARN_ON appropriately. Like in the patch attached? It works (when
> applied together with the anon_inode name fix).

No, that's horrendous.

We actually have a much better place to handle this nasty thing:
pidfs_getattr() for the returned st_mode, and pidfs_dname() for the
name.

So how about just a patch like this?  It doesn't do anything
*internally* to the inodes, but it fixes up what we expose to user
level to make it look like lsof expects.

                    Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1799 bytes --]

 fs/pidfs.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index a63d5d24aa02..5231ddb27d25 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -169,6 +169,24 @@ static int pidfs_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 	return -EOPNOTSUPP;
 }
 
+
+/*
+ * User space expects pidfs inodes to have no file type in st_mode.
+ *
+ * In particular, 'lsof' has this legacy logic:
+ *
+ *	type = s->st_mode & S_IFMT;
+ *	switch (type) {
+ *	  ...
+ *	case 0:
+ *		if (!strcmp(p, "anon_inode"))
+ *			Lf->ntype = Ntype = N_ANON_INODE;
+ *
+ * to detect our old anon_inode logic.
+ *
+ * Rather than mess with our internal sane inode data, just fix it
+ * up here in getattr() by masking off the format bits.
+ */
 static int pidfs_getattr(struct mnt_idmap *idmap, const struct path *path,
 			 struct kstat *stat, u32 request_mask,
 			 unsigned int query_flags)
@@ -176,6 +194,7 @@ static int pidfs_getattr(struct mnt_idmap *idmap, const struct path *path,
 	struct inode *inode = d_inode(path->dentry);
 
 	generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
+	stat->mode &= ~S_IFMT;
 	return 0;
 }
 
@@ -199,12 +218,16 @@ static const struct super_operations pidfs_sops = {
 	.statfs		= simple_statfs,
 };
 
+/*
+ * 'lsof' has knowledge of out historical anon_inode use, and expects
+ * the pidfs dentry name to start with 'anon_inode'.
+ */
 static char *pidfs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
 	struct inode *inode = d_inode(dentry);
 	struct pid *pid = inode->i_private;
 
-	return dynamic_dname(buffer, buflen, "pidfd:[%llu]", pid->ino);
+	return dynamic_dname(buffer, buflen, "anon_inode:[pidfd-%llu]", pid->ino);
 }
 
 static const struct dentry_operations pidfs_dentry_operations = {

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-20 19:01                 ` Linus Torvalds
@ 2024-05-20 19:15                   ` Linus Torvalds
  2024-05-21  6:07                     ` Jiri Slaby
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-05-20 19:15 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Christian Brauner, linux-fsdevel, Alexander Viro, Seth Forshee,
	Tycho Andersen

On Mon, 20 May 2024 at 12:01, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So how about just a patch like this?  It doesn't do anything
> *internally* to the inodes, but it fixes up what we expose to user
> level to make it look like lsof expects.

Note that the historical dname for those pidfs files was
"anon_inode:[pidfd]", and that patch still kept the inode number in
there, so now it's "anon_inode:[pidfd-XYZ]", but I think lsof is still
happy with that.

Somebody should check.

I also wrote some kind of tentative commit log for this:

    fs/pidfs: make 'lsof' happy with our inode changes

    pidfs started using much saner inodes in commit b28ddcc32d8f ("pidfs:
    convert to path_from_stashed() helper"), but that exposed the fact that
    lsof had some knowledge of just how odd our old anon_inode usage was.

    For example, legacy anon_inodes hadn't even initialized the inode type
    in the inode mode, so everything had a type of zero.

    So sane tools like 'stat' would report these files as "weird file", but
    'lsof' instead used that (together with the name of the link in proc) to
    notice that it's an anonymous inode, and used it to detect pidfd files.

    Let's keep our internal new sane inode model, but mask the file type
    bits at 'stat()' time in the getattr() function we already have, and by
    making the dentry name match what lsof expects too.

    This keeps our internal models sane, but should make user space see the
    same old odd behavior.

    Reported-by: Jiri Slaby <jirislaby@kernel.org>
    Link: https://lore.kernel.org/all/a15b1050-4b52-4740-a122-a4d055c17f11@kernel.org/
    Link: https://github.com/lsof-org/lsof/issues/317
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Seth Forshee <sforshee@kernel.org>
    Cc: Tycho Andersen <tycho@tycho.pizza>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

but I haven't actually committed it to my main tree, since I'd hope to
get Ack's for it (and testing).

Or does somebody have any other ideas? I think that patch is fairly
clean, even if the *reason* for the patch is odd as heck.

           Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-20 19:15                   ` Linus Torvalds
@ 2024-05-21  6:07                     ` Jiri Slaby
  2024-05-21  6:13                       ` Jiri Slaby
  0 siblings, 1 reply; 61+ messages in thread
From: Jiri Slaby @ 2024-05-21  6:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, linux-fsdevel, Alexander Viro, Seth Forshee,
	Tycho Andersen

On 20. 05. 24, 21:15, Linus Torvalds wrote:
> On Mon, 20 May 2024 at 12:01, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So how about just a patch like this?  It doesn't do anything
>> *internally* to the inodes, but it fixes up what we expose to user
>> level to make it look like lsof expects.
> 
> Note that the historical dname for those pidfs files was
> "anon_inode:[pidfd]", and that patch still kept the inode number in
> there, so now it's "anon_inode:[pidfd-XYZ]", but I think lsof is still
> happy with that.

Now the last column of lsof still differs from 6.8:
-[pidfd:1234]
+[pidfd-4321]

And lsof tests still fail, as "lsof -F pfn" is checked against:
     if ! fgrep -q "p${pid} f${fd} n[pidfd:$pid]" <<<"$line"; then

Where $line is:
p1015 f3 n[pidfd-1315]

Wait, even if I change that minus to a colon, the inner pid (1315) 
differs from the outer (1015), but it should not (according to the test).

regards,
-- 
js
suse labs


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-21  6:07                     ` Jiri Slaby
@ 2024-05-21  6:13                       ` Jiri Slaby
  2024-05-21 12:33                         ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Jiri Slaby @ 2024-05-21  6:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, linux-fsdevel, Alexander Viro, Seth Forshee,
	Tycho Andersen

On 21. 05. 24, 8:07, Jiri Slaby wrote:
> On 20. 05. 24, 21:15, Linus Torvalds wrote:
>> On Mon, 20 May 2024 at 12:01, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> So how about just a patch like this?  It doesn't do anything
>>> *internally* to the inodes, but it fixes up what we expose to user
>>> level to make it look like lsof expects.
>>
>> Note that the historical dname for those pidfs files was
>> "anon_inode:[pidfd]", and that patch still kept the inode number in
>> there, so now it's "anon_inode:[pidfd-XYZ]", but I think lsof is still
>> happy with that.
> 
> Now the last column of lsof still differs from 6.8:
> -[pidfd:1234]
> +[pidfd-4321]
> 
> And lsof tests still fail, as "lsof -F pfn" is checked against:
>      if ! fgrep -q "p${pid} f${fd} n[pidfd:$pid]" <<<"$line"; then
> 
> Where $line is:
> p1015 f3 n[pidfd-1315]
> 
> Wait, even if I change that minus to a colon, the inner pid (1315) 
> differs from the outer (1015), but it should not (according to the test).

This fixes the test (meaning literally "it shuts up the test", but I 
have no idea if it is correct thing to do at all):
-       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd-%llu]", 
pid->ino);
+       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd:%d]", 
pid_nr(pid));

Maybe pid_vnr() would be more appropriate, I have no idea either.

> regards,-- 
-- 
js
suse labs


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-17 20:07             ` Linus Torvalds
  2024-05-20  8:23               ` Jiri Slaby
@ 2024-05-21 12:16               ` Christian Brauner
  1 sibling, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-05-21 12:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jiri Slaby, linux-fsdevel, Alexander Viro, Seth Forshee, Tycho Andersen

Sorry for the delayed reply. I was attending LSFMM last week. Including
travel that basically amounted to a week of limited email time. Catching
up on things now.

On Fri, May 17, 2024 at 01:07:43PM -0700, Linus Torvalds wrote:
> On Fri, 17 May 2024 at 00:54, Jiri Slaby <jirislaby@kernel.org> wrote:
> >
> >          inode->i_private = data;
> >          inode->i_flags |= S_PRIVATE;
> > +       inode->i_mode &= ~S_IFREG;
> 
> That is not a sensible operation. S_IFREG isn't a bit mask.
> 
> But it looks like 'anon_inode' traditionally had *no* type bytes at
> all. That's literally crazy.

Yes, it's pretty wild and I had long discussions about this before when
people told me it's impossible for st_mode to have no type. A lot of
low-level software isn't aware of this quirk and it's not something I
particularly enjoy.

> Doing a 'stat -L' on one in /proc/X/fd/Y will correctly say "weird
> file" about them.

Oh yes, it also broke assumptions in systemd and some other software
which had code related to stat() that assumed it could never see an
empty file type in st_mode.

> What a crock. That's horrible, and we apparently never noticed how
> broken anon_inodes were because nobody really cared. But then lsof
> seems to have done the *opposite* and just said (for unfathomable
> reasons) "this can't be a normal regular file".
> 
> But I can't actually find that code in lsof. I see
> 
>                  if (rest && rest[0] == '[' && rest[1] == 'p')
>                      fdinfo_mask |= FDINFO_PID;
> 
> which only checks that the name starts with '[p'. Hmm.
> 
> [ Time passes, I go looking ]
> 
> Oh Christ. It's process_proc_node:
> 
>         type = s->st_mode & S_IFMT;
>         switch (type) {
>         ...
>         case 0:
>             if (!strcmp(p, "anon_inode"))
>                 Lf->ntype = Ntype = N_ANON_INODE;
>             break;
> 
> so yes, process_proc_node() really seems to have intentionally noticed
> that our anon inodes forgot to put a file type in the st_mode, and
> together with the path from readlink matching 'anon_inode' is how lsof
> determines it's one of the special inodes.

strace does that too but strace updated it's code fairly early on to
accommodate pidfs. I had searched github to check that
anon_inode:[pidfd] wasn't something that userspace relied on and other
than strace it didn't yield any meaningful results and didn't surface
lsof unfortunately.

> 
> So yeah, we made a mistake, and then lsof decided that mistake was a feature.
> 
> But that does mean that we probably just have to live in the bed we made.

Yes, we likely do.

> 
> But that
> 
> > +       inode->i_mode &= ~S_IFREG;
> 
> is still very very wrong. It should use the proper bit mask: S_IFMT.
> 
> And we'd have to add a big comment about our historical stupidity that
> we are perpetuating.
> 
> Oh well.
> 
>                Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-21  6:13                       ` Jiri Slaby
@ 2024-05-21 12:33                         ` Christian Brauner
  2024-05-21 12:40                           ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-05-21 12:33 UTC (permalink / raw)
  To: Jiri Slaby, Linus Torvalds
  Cc: linux-fsdevel, Alexander Viro, Seth Forshee, Tycho Andersen

On Tue, May 21, 2024 at 08:13:08AM +0200, Jiri Slaby wrote:
> On 21. 05. 24, 8:07, Jiri Slaby wrote:
> > On 20. 05. 24, 21:15, Linus Torvalds wrote:
> > > On Mon, 20 May 2024 at 12:01, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > > 
> > > > So how about just a patch like this?  It doesn't do anything
> > > > *internally* to the inodes, but it fixes up what we expose to user
> > > > level to make it look like lsof expects.
> > > 
> > > Note that the historical dname for those pidfs files was
> > > "anon_inode:[pidfd]", and that patch still kept the inode number in
> > > there, so now it's "anon_inode:[pidfd-XYZ]", but I think lsof is still
> > > happy with that.
> > 
> > Now the last column of lsof still differs from 6.8:
> > -[pidfd:1234]
> > +[pidfd-4321]
> > 
> > And lsof tests still fail, as "lsof -F pfn" is checked against:
> >      if ! fgrep -q "p${pid} f${fd} n[pidfd:$pid]" <<<"$line"; then
> > 
> > Where $line is:
> > p1015 f3 n[pidfd-1315]
> > 
> > Wait, even if I change that minus to a colon, the inner pid (1315)
> > differs from the outer (1015), but it should not (according to the
> > test).
> 
> This fixes the test (meaning literally "it shuts up the test", but I have no
> idea if it is correct thing to do at all):
> -       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd-%llu]",
> pid->ino);
> +       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd:%d]",
> pid_nr(pid));
> 
> Maybe pid_vnr() would be more appropriate, I have no idea either.

So as pointed out the legacy format for pidfds is:

lrwx------ 1 root root  64 21. Mai 14:24 39 -> 'anon_inode:[pidfd]'

So it's neither showing inode number nor pid.

The problem with showing the pid unconditionally like that in
dynamic_dname() is that it's wrong in various circumstances. For
example, when the pidfd is sent into a pid namespace outside the it's
pid namespace hierarchy (e.g., into a sibling pid namespace or when the
task has already been reaped.

Imho, showing the pid is more work than it's worth especially because we
expose that info in fdinfo/<nr> anyway. So let's just do the simple thing.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-21 12:33                         ` Christian Brauner
@ 2024-05-21 12:40                           ` Christian Brauner
  2024-05-21 15:10                             ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2024-05-21 12:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Jiri Slaby, Alexander Viro, Seth Forshee, Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 2814 bytes --]

On Tue, May 21, 2024 at 02:33:24PM +0200, Christian Brauner wrote:
> On Tue, May 21, 2024 at 08:13:08AM +0200, Jiri Slaby wrote:
> > On 21. 05. 24, 8:07, Jiri Slaby wrote:
> > > On 20. 05. 24, 21:15, Linus Torvalds wrote:
> > > > On Mon, 20 May 2024 at 12:01, Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:
> > > > > 
> > > > > So how about just a patch like this?  It doesn't do anything
> > > > > *internally* to the inodes, but it fixes up what we expose to user
> > > > > level to make it look like lsof expects.
> > > > 
> > > > Note that the historical dname for those pidfs files was
> > > > "anon_inode:[pidfd]", and that patch still kept the inode number in
> > > > there, so now it's "anon_inode:[pidfd-XYZ]", but I think lsof is still
> > > > happy with that.
> > > 
> > > Now the last column of lsof still differs from 6.8:
> > > -[pidfd:1234]
> > > +[pidfd-4321]
> > > 
> > > And lsof tests still fail, as "lsof -F pfn" is checked against:
> > >      if ! fgrep -q "p${pid} f${fd} n[pidfd:$pid]" <<<"$line"; then
> > > 
> > > Where $line is:
> > > p1015 f3 n[pidfd-1315]
> > > 
> > > Wait, even if I change that minus to a colon, the inner pid (1315)
> > > differs from the outer (1015), but it should not (according to the
> > > test).
> > 
> > This fixes the test (meaning literally "it shuts up the test", but I have no
> > idea if it is correct thing to do at all):
> > -       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd-%llu]",
> > pid->ino);
> > +       return dynamic_dname(buffer, buflen, "anon_inode:[pidfd:%d]",
> > pid_nr(pid));
> > 
> > Maybe pid_vnr() would be more appropriate, I have no idea either.
> 
> So as pointed out the legacy format for pidfds is:
> 
> lrwx------ 1 root root  64 21. Mai 14:24 39 -> 'anon_inode:[pidfd]'
> 
> So it's neither showing inode number nor pid.
> 
> The problem with showing the pid unconditionally like that in
> dynamic_dname() is that it's wrong in various circumstances. For
> example, when the pidfd is sent into a pid namespace outside the it's
> pid namespace hierarchy (e.g., into a sibling pid namespace or when the
> task has already been reaped.
> 
> Imho, showing the pid is more work than it's worth especially because we
> expose that info in fdinfo/<nr> anyway. So let's just do the simple thing.

Here's the updated patch appended. Linus, feel free to commit it
directly or if you prefer I can send it to you later this week.

In any case, I really really would like to try to move away from the
current insanity maybe by the end of the year. So I really hope that
lsof changes to the same format that strace already changed to so we can
flip the switch. That should allow us to get rid of both the weird
non-type st_mode issue and the unpleasant name faking. Does that sound
like something we can try?

[-- Attachment #2: 0001-fs-pidfs-make-lsof-happy-with-our-inode-changes.patch --]
[-- Type: text/x-diff, Size: 3343 bytes --]

From 0e027ea62813cb61d95012757908e92e8cd46894 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 21 May 2024 14:34:43 +0200
Subject: [PATCH] fs/pidfs: make 'lsof' happy with our inode changes

pidfs started using much saner inodes in commit b28ddcc32d8f ("pidfs:
convert to path_from_stashed() helper"), but that exposed the fact that
lsof had some knowledge of just how odd our old anon_inode usage was.

For example, legacy anon_inodes hadn't even initialized the inode type
in the inode mode, so everything had a type of zero.

So sane tools like 'stat' would report these files as "weird file", but
'lsof' instead used that (together with the name of the link in proc) to
notice that it's an anonymous inode, and used it to detect pidfd files.

Let's keep our internal new sane inode model, but mask the file type
bits at 'stat()' time in the getattr() function we already have, and by
making the dentry name match what lsof expects too.

This keeps our internal models sane, but should make user space see the
same old odd behavior.

Reported-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/all/a15b1050-4b52-4740-a122-a4d055c17f11@kernel.org/
Link: https://github.com/lsof-org/lsof/issues/317
Cc: Christian Brauner <brauner@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index a63d5d24aa02..dbb9d854d1c5 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -169,6 +169,24 @@ static int pidfs_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 	return -EOPNOTSUPP;
 }
 
+
+/*
+ * User space expects pidfs inodes to have no file type in st_mode.
+ *
+ * In particular, 'lsof' has this legacy logic:
+ *
+ *	type = s->st_mode & S_IFMT;
+ *	switch (type) {
+ *	  ...
+ *	case 0:
+ *		if (!strcmp(p, "anon_inode"))
+ *			Lf->ntype = Ntype = N_ANON_INODE;
+ *
+ * to detect our old anon_inode logic.
+ *
+ * Rather than mess with our internal sane inode data, just fix it
+ * up here in getattr() by masking off the format bits.
+ */
 static int pidfs_getattr(struct mnt_idmap *idmap, const struct path *path,
 			 struct kstat *stat, u32 request_mask,
 			 unsigned int query_flags)
@@ -176,6 +194,7 @@ static int pidfs_getattr(struct mnt_idmap *idmap, const struct path *path,
 	struct inode *inode = d_inode(path->dentry);
 
 	generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
+	stat->mode &= ~S_IFMT;
 	return 0;
 }
 
@@ -199,12 +218,13 @@ static const struct super_operations pidfs_sops = {
 	.statfs		= simple_statfs,
 };
 
+/*
+ * 'lsof' has knowledge of out historical anon_inode use, and expects
+ * the pidfs dentry name to start with 'anon_inode'.
+ */
 static char *pidfs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
-	struct inode *inode = d_inode(dentry);
-	struct pid *pid = inode->i_private;
-
-	return dynamic_dname(buffer, buflen, "pidfd:[%llu]", pid->ino);
+	return dynamic_dname(buffer, buflen, "anon_inode:[pidfd]");
 }
 
 static const struct dentry_operations pidfs_dentry_operations = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-21 12:40                           ` Christian Brauner
@ 2024-05-21 15:10                             ` Linus Torvalds
  2024-05-25 11:57                               ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2024-05-21 15:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jiri Slaby, Alexander Viro, Seth Forshee, Tycho Andersen

On Tue, 21 May 2024 at 05:40, Christian Brauner <brauner@kernel.org> wrote:
>
> Here's the updated patch appended. Linus, feel free to commit it
> directly or if you prefer I can send it to you later this week.

Applied.

> In any case, I really really would like to try to move away from the
> current insanity maybe by the end of the year. So I really hope that
> lsof changes to the same format that strace already changed to so we can
> flip the switch. That should allow us to get rid of both the weird
> non-type st_mode issue and the unpleasant name faking. Does that sound
> like something we can try?

We can try it again later and see if nobody notices because they've
updated their user space.

That said, from previous experience, some people (and some distros)
very seldom update user space, but this is hopefully enough of a
corner case that *most* people won't even realize they've hit it, so
maybe it's one of those "fraction of a fraction is zero" cases.

             Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/2] pidfd: add pidfdfs
  2024-05-21 15:10                             ` Linus Torvalds
@ 2024-05-25 11:57                               ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2024-05-25 11:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Jiri Slaby, Alexander Viro, Seth Forshee, Tycho Andersen

On Tue, May 21, 2024 at 08:10:26AM -0700, Linus Torvalds wrote:
> On Tue, 21 May 2024 at 05:40, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Here's the updated patch appended. Linus, feel free to commit it
> > directly or if you prefer I can send it to you later this week.
> 
> Applied.
> 
> > In any case, I really really would like to try to move away from the
> > current insanity maybe by the end of the year. So I really hope that
> > lsof changes to the same format that strace already changed to so we can
> > flip the switch. That should allow us to get rid of both the weird
> > non-type st_mode issue and the unpleasant name faking. Does that sound
> > like something we can try?
> 
> We can try it again later and see if nobody notices because they've
> updated their user space.
> 
> That said, from previous experience, some people (and some distros)
> very seldom update user space, but this is hopefully enough of a
> corner case that *most* people won't even realize they've hit it, so
> maybe it's one of those "fraction of a fraction is zero" cases.

Cool! Fwiw, lsof has now already adapted to the new scheme as well in:
https://github.com/lsof-org/lsof/commit/cd5bb9bd7fbb65bca6cc60de1d35ff0e1c1b81e0

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2024-05-25 11:57 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-13 16:45 [PATCH 0/2] Move pidfd to tiny pseudo fs Christian Brauner
2024-02-13 16:45 ` [PATCH 1/2] pidfd: move struct pidfd_fops Christian Brauner
2024-02-13 16:45 ` [PATCH 2/2] pidfd: add pidfdfs Christian Brauner
2024-02-13 17:17   ` Linus Torvalds
2024-02-14 14:40     ` Christian Brauner
2024-02-14 18:27       ` Christian Brauner
2024-02-14 18:37         ` Linus Torvalds
2024-02-15 16:11           ` Christian Brauner
2024-02-16 11:50             ` Christian Brauner
2024-02-16 16:41               ` Christian Brauner
2024-02-17 13:59               ` Oleg Nesterov
2024-02-17 17:30                 ` Linus Torvalds
2024-02-17 17:38                   ` Linus Torvalds
2024-02-18 11:15                   ` Christian Brauner
2024-02-18 11:33                     ` Christian Brauner
2024-02-18 17:54                       ` Christian Brauner
2024-02-18 18:08                         ` Linus Torvalds
2024-02-18 18:57                           ` Linus Torvalds
2024-02-19 18:05                             ` Christian Brauner
2024-02-19 18:34                               ` Linus Torvalds
2024-02-19 21:18                                 ` Christian Brauner
2024-02-19 23:24                                   ` Linus Torvalds
2024-02-18 14:27                     ` Oleg Nesterov
2024-02-18  9:30                 ` Christian Brauner
2024-02-22 19:03   ` Nathan Chancellor
2024-02-23 10:18     ` Heiko Carstens
2024-02-23 11:56       ` Christian Brauner
2024-02-23 11:55     ` Christian Brauner
2024-02-23 12:57       ` Heiko Carstens
2024-02-23 13:27         ` Christian Brauner
2024-02-23 13:35           ` Heiko Carstens
2024-02-23 13:41       ` Christian Brauner
2024-02-23 21:26       ` Christian Brauner
2024-02-23 21:58         ` Linus Torvalds
2024-02-24  5:52           ` Christian Brauner
2024-02-24  6:05             ` Christian Brauner
2024-02-24 18:48             ` Linus Torvalds
2024-02-24 19:15               ` Christian Brauner
2024-02-24 19:19                 ` Christian Brauner
2024-02-24 19:21                 ` Linus Torvalds
2024-02-27 19:26                 ` Nathan Chancellor
2024-02-27 22:13                   ` Christian Brauner
2024-03-12 10:35   ` Geert Uytterhoeven
2024-03-12 14:09     ` Christian Brauner
2024-05-15 11:10   ` Jiri Slaby
2024-05-15 16:39     ` Christian Brauner
2024-05-16  5:28       ` Jiri Slaby
2024-05-17  7:09         ` Jiri Slaby
2024-05-17  7:54           ` Jiri Slaby
2024-05-17 20:07             ` Linus Torvalds
2024-05-20  8:23               ` Jiri Slaby
2024-05-20 19:01                 ` Linus Torvalds
2024-05-20 19:15                   ` Linus Torvalds
2024-05-21  6:07                     ` Jiri Slaby
2024-05-21  6:13                       ` Jiri Slaby
2024-05-21 12:33                         ` Christian Brauner
2024-05-21 12:40                           ` Christian Brauner
2024-05-21 15:10                             ` Linus Torvalds
2024-05-25 11:57                               ` Christian Brauner
2024-05-21 12:16               ` Christian Brauner
2024-02-13 17:02 ` [PATCH 0/2] Move pidfd to tiny pseudo fs Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.